International Debt

How Does the Global Debt Distributed Between Countries?

Posted by Maya Sandler on January 16, 2023

https://blogs.worldbank.org/opendata/understanding-global-debt-relieving-covid-19-impact-most-vulnerable

https://www.kaggle.com/hsankesara/deep-dive-into-international-debt-data

https://www.kaggle.com/theworldbank/world-bank-intl-debt

Tip!

Pandas moves slowly(ish) when dealing with big data.

Use swifter to run the code quicker, by simply add to the code:

Tip!

Visual Studio Code has a(nother) great feature

To change simultaneously in multiple places at the same time - do it with multiple cursors:

1) Ctrl+D - selects the text at the cursor, or the next occurrence of the current selection.

2) Ctrl+Shift+L - selects all occurrences of the current selected text.

3) Alt+Click (and up/down) - selects the text at the cursor and the one above/below it.

Also - Keyboard Shortcuts

Chunksize is 500 to allow for a big size dataset to be injected in chunks.

Intent

A big database (goes through October 2020) of international debt statistics with all the external debt between 1970 and 2027 of low-, lower-middle- and upper-middle- income countries, must have some very interesting insights. The debt information and insights have important consequences on the risk to global financial stability and economic growth.

In this project I will dive into the financial information of different countries and their debt to other countries (i.e. creditor countries). The insights from this project will shed light on the biggest creditor countries and total amount of debt owed to these countries, the total debt of each country and important debt indicators.

It is interesting to answer these questions:

What is the average debt of different income country groups?
What is the current total amount of debt owed by each country, and how is it compared to the averaged debt of their income group?
Which creditor countries lend most of the global debt?
What year did most countries took their highest debt?
Which types of income countries took the most debt?
On what type of income countries did COVID-19 have the biggest debt-effect?

I have so many more questions already, and will probably have even more as I unfold this massive database.

I download the data from the world bank database, cleaned, transformed, and imported the data via a Python code, validate and analyzed the data via MySQL and visualized it using Tableau Public.

This project contains the most complicated preprocessing step(s) I've done until now:

It opens multiple files and creates a database with multiple SQL tables.
Creates new SQL tables from specific columns from specific data files and adds a new column with auto incremental number.
Creates queries on other tables to build an entirely new table that will represent the data better and will be more efficient.

I conclude the project with my conclusions and insights from my analysis.

Description of the Data

The database was downloaded from The World Bank database. This dataset contains both national and regional debt statistics captured by ~500 economic indicators. Time series data is available for 133 countries between the years 1970-2020, as well as debt indicators.

The database consists of 3 main .csv datasets:

1. IDS-DRSCountries_WLD_Data.csv - This is the main database, showing each country has multiple debts along the years. It contains 66,102 records of loans.

2. IDS-DRSCountries_WLD_Country-Metadata.csv - data about the countries. It contained data about 133 countries.

3. IDS-DRSCountries_WLD_SeriesMetadata.csv - Debt indicators.

Preprocessing

I wrote a Python code to read the data from a raw .csv file and loading it as an SQL database, by using Pandas, mysql.connector and SQLAlchemy The basis for this code was a code I used in a previous project, but in this case there was massive preprocessing to do. To do this I imported these modules:

Creating a new database called "international_debt" in SQL:

The tables had a lot of repetitive values and no id number, that made the tables less efficient. The main debt data table had the bigger organization issues and needed to be reorganized in a way that makes aggregation a lot simpler.

To make the database more efficient for queries, faster and easier to perform aggregation functions on, I decided to create a new database that will be comprised of 3 normalized and relational tables:

1) Country table - with ID, country name, country code, income group, etc.

2) Debt indicators table - with ID, indicator name, indicator definition, indicator code, etc.

3) Country debt table - a time series table of debt by country.

I will explain the process I have done for each table.

Country table

The original dataset of the country and their information (below is a part of a snapshot) is in the file IDS-DRSCountries_WLD_Country-Metadata.csv has 31 columns. Many of them are unnecessary (such as the latest household survey and the system used for national accounts). The dataset also doesn't have an id-number.

Therefore, I wanted to create a new table called "countries", that has these columns:

id - an integer. This will be a new auto-increment int. column
code - 3 letter country code
short_name - the short name of the country
income_group - indicate the income category of the country
region - region of the world
lending_category - the category of debt
external_debt_status - indicates if the report is actual, estimated or preliminary

To do that, I:

Created a new table in SQL called "countries"
Read only specific columns from the csv file:
pd.read_csv(file_name, usecols=cols_list)
Added a new column of the country id, which is the primary key for this table and will be auto incremented:
id INT PRIMARY KEY NOT NULL AUTO_INCREMENT
Injected data from the columns I read into the new SQL table - and skipped the first column (the id column).

This is the Python function for creating the table:

Indicators tables

The original dataset for the debt-indicators is in the file IDS-DRSCountries_WLD_SeriesMetadata.csv has some columns, such as Source and Topic, that are unnecessary for this analysis. This dataset also doesn't have an id-number.

Therefore, I wanted to create a new table called "indicators", that has these columns:

id - an integer. This will be a new auto-increment int. column
code - debt indicator code
indicator_name - the long name of indicators
short_definition - the short definition of the indicator
long_definition - the long definition of the indicator
aggregation_method - is it a sum, an average or something else

To do that, I:

Created a new table in SQL called "indicators"
Read only specific columns from the csv file, as in the countries table.
Added a new column of the country id, which is the primary key for this table and will be auto incremented, as in the countries table.
Injected data from the columns I read into the new SQL table - and skipped the first column (the id column).

* 497 debt indicators were counted.

This is the Python function for creating the table:

Debt table

The original dataset in the file IDS-DRSCountries_WLD_Data.csv has repeating columns and is built so every country's indicator is a single row and there is a column for each year. This dataset doesn't have an id-number as well.

I needed to create a new table called "debt", so there would be a year column and a debt column instead of the original representation, and to as an auto-increment id column:

id - an integer. This will be a new auto-increment int. column
country_id - this is the id from the country table
indicator_name - the long name of indicators
short_definition - the short definition of the indicator
long_definition - the long definition of the indicator
aggregation_method - is it a sum, an average or something else

To insert data from the "country" table into the country_id column, I ran over the rows of the debt table, got the country name from the original data, ran an if condition to check if I have this country, if not then run a query on the country table, get the id and save it into the country_id column in the "debt" table.

*** Pandas is reading numeric data either as int64 or as float64. When verifying data, I noticed that values were not the same in the original csv file and the debt DataFrame. Looked up a solution, and fount it in this webpage and it says to use Python decimal object:

Import decimal module

from decimal import Decimal import pandas as pd def decimal_from_value(value): return Decimal(value) df = pd.read_csv(filename, converters={'sales': decimal_from_value}) # converter set sales type to "object" (Decimal), not default float64 print(df.dtypes) # week int64 # sales object

This is the Python function for creating the table:

Validate, Clean and Transform the Data

Validation that all the data was in proper formats and that it is consistent (years, publishers, etc.) via Microsoft Excel (using reorder and find). This process led to finding several rows with missing information, like the release year and publisher, and rows with a general notion of publisher name (“Unknown”). I transformed the data as part of the Python code.

Missing Information:

Missing information appeared as "N/A" in the original .csv file. This created a problem while executing the code in python because it runs pandas and it doesn't recognize what "N/A" is and sets it as a float data type. To achieve my purpose of transferring the original "N/A" to MySQL's NULL, I needed to add to the code another step of transforming the original N/A to None in pandas data frame, so it will automatically be transformed to NULL in MySQL.

Consolidate empty and "Unknown" publisher names:

Instead of deleting and losing data, I decided to gather the original "N/A" publisher name and the publishers written as “Unknown” into a single type of publisher name called "Other". Logically, it is basically the same meaning. This way all of them are consolidated as an "Other" publisher name, and can be counted in analysis. Count of the lines that were changed to others showed 261 games in the "Other" column.

This section of the code was between reading the .csv file and injecting the transformed data to MySQL:

Thant's it, a code that does everything in one execution. I like it!

Now to MySQL for the analysis and fetching the interesting information.

Validate the Data in MySQL Dataset:

After transforming the data, I needed to verify that the changes were made and that the data remained intact. I ran several queries to do it:

1. Check that all rows were transferred:

I compared the amount of rows in the raw data to the amount of rows in the dataset table, using the below query in MySQL. Both were equal to 16598, so all the lines are here. Great!

2. Checking for "N/A", and NULL values:

In the transformation of the data I changed the "N/A" data to None that was supposed to transform to NULL during the insertion of the data to MySQL. Therefore, it was needed to be checked. I checked it in the release_year column, as the N/A originally appeared in this column.

Counting "N/A" retreived 0, and counting of NULL retreived 271. That means all N/A in the release year column turned into NULL values. Great!

3. Checking for NULL values in the publisher column:

The publisher column was also transform in the Python script. It was changed from the original N/A to pandas None to "other". This transformation needed to be checked as well:

Counting retrieved 0 "N/A" and 261 "Other", which is exactly the number counted in the python script. Super Great!

Now that the data is clean and transformed, We could start to analyze it in MySQL.

Results:

All queries were done via MySQL and visualized via Tableau.

Question 1

To find...

These results indicate that

Question 2

People have a preference to specific genres, and gaming companies are aiming toward those preferences to increase their sales. Therefore, I looked into the most profitable genres in the world (called in the code "global_sales") via the following query.

These results show several interesting insights:

Sport genre is t.

This analysis is very important for building specific personas of users in specific games to increase appealing to the right market and to increase revenue.

Question 3

The game re the most profitable publishers in the world? (2) Are there specific publisher that are more appropriate for selling to a specific regional market?

To answer these questions, I analyzed the sales according to publishers to look at their ability to sell the games globally using the following query, as well as in specific regions in the world (using similar queries).

Best Global Publishers:

This data shows that Nintendo has the biggest selling abilities, more than 1.5 times than the next best selling publisher, Electronic Arts. Thereafter, sales drop and the next best publishers are Activision, Sony Computer Entertainment, etc. See next image for visualization.

Question 1

When creating a video game, as a game developer you may want to direct the game to the strongest market, rather than appealing to a niche audience, to increase revenue. Money and interests are changeable along the years, and therefore the market can change. Therefore, I asked what is the purchasing power of different regions in the world along the years.

I visualized the data so I will be able to see the different regions and their market along the years, and to be able to filter in/out specific publishers.

This analysis has several insights regarding the stronger markets:

There is a strong growth

Question 5

When developing a game, you wble users.

The query is intended to look into the differences in platform preferences between the regions in the world, as well as a global view.

Regional Preferable Platforms:

Examination of the strong platform in the world according to region, indicates that Wii is the strongest in the two most influencing regions on the gaming sales - North America and Europe. However, it is not the case in other regions. Also, there is a significant difference in the other platforms in each region. Most likely that users will not switch to a different platform once they bought one, unless it is old and they want to buy a new device (and they can afford it).

Therefore, this data is extremely important when producing a game to consumers, as it affects the question if the video game will be available for the users, and the whole interaction of the user with the hardware.

Conclusions

In this project I asked several business questions regarding the market and the selling abilities of specific games, genres, publishers and platforms between 1980 and 2017.

The analysis indicated that there are two specific strong markets (North America and Europe) with specific preferences of genres and platforms to publish the games in. Japan was revealed as a different market zone with unique preferences, and a smaller market. The data also indicated that specific regional market strengths for specific publishers, and a trend in their publishing abilities.

The insights in this project regarding video game sales, publishers, platforms and genres are important for video game developers who are seeking for the optimal affect, as well as video game publishers who are reaching for specific markets, and even for us as consumers.