How does Economic Growth Affect Income Inequality?¶

Mandeep Karki¶

Link to the Github webpage

Project Goals¶

I am interested in investigating whether economic growth leads to increase in inequality. Economic growth simply means the increase in total value of goods and services produced in a time frame. The standard meaasure of economic growth is Gross Domestic Product (GDP) growth in a fiscal year. GDP however only incorporates the production side of the economy and doesn't indicate anything about the distribution side. The increased production doesn't get distributed evenly. I am interested in investigating this distributional aspect and see if the fruits of economic growth are unevenly distributed and the rich get more benefit of economic growth compared to the poor in a country. My hypothesis is that economic growth leads to increase in income inequality.

Dataset for the Project¶

For my project, I get all of my data from World Development Indicators databank from the World Bank website. The database provides extensive data for hundreds of variables on development indicators for all the countries. However, we face the problem of lots of missing data, especially for the datapoints on developming countries. It is the common plight of field of development economics. Here, I first briefly discuss some of the most important variables that are important on the study first and expalin what I am planning to do with the missing data. All the definitions of the variables are the official World bank definitions.

Variables

Gini Index: My primary dependent variable of interest is the GINI Index, the most popular measure of income inequality. The Gini index measures the extent to which the distribution of income or consumption among individuals or households within an economy deviates from a perfectly equal distribution. A Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality (Definition from the World bank).

GDP per capita: GDP per capita is the sum of gross value added by all resident producers in the economy plus any product taxes (less subsidies) not included in the valuation of output, divided by mid-year population.

Inflation: Inflation is the measure of annual percentage change in the cost to the average consumer of acquiring a basket of goods and services in a given year.

Unemployment rate: Unemployment refers to the share of the labor force that is without work but available for and seeking employment in the given year.

Life Expectancy: Life expectancy at birth indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. This is the proxy for health in our model as it is the most widely available data worldwide.

Population Growth Rate: Annual population growth rate for year t is the exponential rate of growth of midyear population from year t-1 to t, expressed as a percentage.

Income Level: Low Income if GNI per capita is greater than USD 1085, Lower-Middle Income if GNI per capita is gbetween USD 1086 and USD 4225, Upper-Middle Income if GNI per capita is between USD 4225 and USD 13205 High Income if GNI per capita is greater than USD 13205. This is the classification used by the World Bank.

The dataset I am using has a panel data on 73 countries for the time period 1991-2019. The unit of observation is a country year. As mentioned earlier, the daset suffers heavily from missing data. gini Idnex is the variable with moss missing data. I filtered the 73 countries that have the most datapoints available. Some countries did not have a single datapoint for any of the years. I drop all the country years that have missing value for any of the major variables listed above except for Gini Index. For Gini Index, i use the linear extrapolation method to generate missing values. This is appropriate for two reasosn. First, it suffers the most from missing values and I want to make the most out of the availabel data. Second, the value of Gini Index doesn't fluctuate heavily in successive years. This means the generated data is meaningful in our analysis.

Question and Model Decision¶

I am going to use a regression model to investigate this relationship between Gini Index and Economic Growth.It is important to understand and untangle this complex relationship because it has huge policy implications for both developinga nd developed countries as they continue to grow and high income inequality is undesirable.

The dataset I am using is an unbalanced panel data. I am using Pooled OLS, Random Effects, and Fixed Effects model on the dataset. I will then use different statistical tests to choose which model better explains the relationship.

I will also control for the income level of a country in investigating this relationship. This relationship between economic growth and gini could be different for a rich country and a poor country (Kuznet's hypothesis). Therefore, I will evaluate this relationship for four different income levels (explained above).

ELT (Extraction, Load, and Transform)¶

I loaded the dataset which is an excel file which is uploaded on my data folder. Each row is a country year. I tidied the data by changing the datatype, dropping rows with missing variables, and permorming a linear extrapolation on missing values for the Gini Index. I created a new variable income level as it is important for the analysis (description above on the section on dataset).I also created a new variable CountryYear by combining country code and the year for each row. This will be an unique identified for each row of the dataset.

In [1]:
#Cloninng my project repositary, reading data and importing different libraries:
!git clone https://github.com/mandeepkrk/CMPSFinalProject
%cd /content/CMPSFinalProject
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Cloning into 'CMPSFinalProject'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 17 (delta 3), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (17/17), 2.07 MiB | 5.71 MiB/s, done.
Resolving deltas: 100% (3/3), done.
/content/CMPSFinalProject
In [2]:
#creating a dataframe with all the variables in our dataset:
df_gini = pd.read_excel("./gdp_gini_final.xlsx")
df_gini.head()
Out[2]:
year country country_code gdp gdp_gr gdp_pc gini gni gni_pc infl life_exp popn_gr tax_gdp_prcnt unemp
0 1991 Argentina ARG 189719984268.484528 9.133111 5730.72381 46.8 183857364656.528564 3860 140.502379 72.319 1.424063 5.008237 5.44
1 1992 Argentina ARG 228778917308.169861 7.937292 6815.32933 45.5 224521060685.898621 6060 16.071994 72.430 1.387435 5.547922 6.36
2 1993 Argentina ARG 236741715015.015015 8.206979 6957.417499 44.9 233743662862.862854 7110 -3.561096 72.565 1.357966 8.015138 10.1
3 1994 Argentina ARG 257440000000 5.836201 7464.474737 45.9 253743263100 7600 2.84934 73.172 1.347024 8.126437 11.76
4 1995 Argentina ARG 258031750000 -2.84521 7383.70451 48.9 253362890700 7340 3.165123 73.133 1.317554 8.054358 18.8
In [3]:
#Here, I create a new variable called 'country_year' combining country code and year. This serves as the unique identifier for each row of data.

df_gini.loc[:, "country_year"] = df_gini["country_code"].astype(str) + df_gini["year"].astype(str)
In [4]:
#Checking datatype for variables in df_gini_subset

print(df_gini.dtypes)
year               int64
country           object
country_code      object
gdp               object
gdp_gr            object
gdp_pc            object
gini              object
gni               object
gni_pc            object
infl              object
life_exp         float64
popn_gr          float64
tax_gdp_prcnt     object
unemp             object
country_year      object
dtype: object

We can see that many variables that need to be float datatype are read as strings by pandas.

In [5]:
# Drop tax_gdp_prcnt and gni as I an not using them for my analysis

df_gini = df_gini.drop(['tax_gdp_prcnt','gni'], axis=1)
In [6]:
#I want to change the variables to the appropriate datatypes. First, I check for the columns that have string value and shouldn't be string.
for col in ['gdp_pc', 'gdp','gdp_gr','gni_pc','infl','unemp']:
    if df_gini[col].dtype == 'object':
        print(f"Column '{col}' contains string values:")
        print(df_gini[col].unique())
Column 'gdp_pc' contains string values:
[5730.7238098842345 6815.329329698251 6957.417498892505 ...
 1495.752138410211 1475.1998833853477 1268.1209405624106]
Column 'gdp' contains string values:
[189719984268.48453 228778917308.16986 236741715015.015 ...
 25873601260.835304 26311507273.67354 23308667781.225754]
Column 'gdp_gr' contains string values:
[9.133110567389608 7.937291556430765 8.206979072212278 ...
 3.5043360955868224 4.034493896671648 1.44130602603785]
Column 'gni_pc' contains string values:
[3860 6060 7110 ... 3060 390 1330]
Column 'infl' contains string values:
[140.50237866101241 16.071993535694546 -3.561095575765748 ...
 10.095729868345842 7.411570948432569 7.6334704807674285]
Column 'unemp' contains string values:
[5.44 6.36 10.1 ... 8.137 8.431 5.54]
In [7]:
#Then I drop the rows with missing values and change the variables to float.
df_gini= df_gini.dropna(subset=['gdp_pc', 'gdp','gdp_gr', 'infl', 'life_exp','popn_gr','gni_pc','unemp'])
df_gini[['gdp_pc', 'gdp','gdp_gr', 'infl', 'life_exp','popn_gr','gni_pc','unemp']] = df_gini[['gdp_pc', 'gdp','gdp_gr', 'infl', 'life_exp','popn_gr','gni_pc','unemp']].replace('..', np.nan)
In [8]:
#Creating new varaible income level based on country's gni per capita as defined by the World Bank:

df_gini['income_lvl'] = np.select(
    [df_gini['gni_pc'] < 1085,
     (df_gini['gni_pc'] >= 1086) & (df_gini['gni_pc'] < 4095),
     (df_gini['gni_pc'] >= 4095) & (df_gini['gni_pc'] < 13205),
     df_gini['gni_pc'] >= 13205],
    [1, 2, 3, 4],
    default=np.nan)
df_gini['income_lvl'] = df_gini['income_lvl'].astype('Int64')
df_gini.head()
Out[8]:
year country country_code gdp gdp_gr gdp_pc gini gni_pc infl life_exp popn_gr unemp country_year income_lvl
0 1991 Argentina ARG 1.897200e+11 9.133111 5730.723810 46.8 3860.0 140.502379 72.319 1.424063 5.44 ARG1991 2
1 1992 Argentina ARG 2.287789e+11 7.937292 6815.329330 45.5 6060.0 16.071994 72.430 1.387435 6.36 ARG1992 3
2 1993 Argentina ARG 2.367417e+11 8.206979 6957.417499 44.9 7110.0 -3.561096 72.565 1.357966 10.10 ARG1993 3
3 1994 Argentina ARG 2.574400e+11 5.836201 7464.474737 45.9 7600.0 2.849340 73.172 1.347024 11.76 ARG1994 3
4 1995 Argentina ARG 2.580318e+11 -2.845210 7383.704510 48.9 7340.0 3.165123 73.133 1.317554 18.80 ARG1995 3
In [9]:
# change income_lvl to float for statistical analysis later

df_gini['income_lvl'] = df_gini['income_lvl'].astype(float)
In [10]:
#Confirming all the variables have been changed to appropriate datatype:
print(df_gini.dtypes)
year              int64
country          object
country_code     object
gdp             float64
gdp_gr          float64
gdp_pc          float64
gini             object
gni_pc          float64
infl            float64
life_exp        float64
popn_gr         float64
unemp           float64
country_year     object
income_lvl      float64
dtype: object

As we can see, all the variables except Gini are appropriate datatypes now. As I want to perform linear interpolation to generate missing values on Gini, I carry out the follwing steps:

In [11]:
#changing the missing values in Gini to NaN:

df_gini['gini'] = df_gini['gini'].replace('..', np.nan)
In [12]:
# Check how many missing values there are in the column Gini

print(df_gini.gini.isnull().sum())
315
In [ ]:
#perform linear interpolation for missing values in gini variable based on each country

# Loop through each country and perform linear interpolation for missing values in the Gini variable.

for country_code in df_gini['country_code'].unique():
  # Subset the data for the current country.
  country_data = df_gini[df_gini['country_code'] == country_code]

  # Perform linear interpolation for missing values in the Gini variable.
  country_data['gini'] = country_data['gini'].interpolate(method='linear')

  # Update the original dataframe with the interpolated values.
  df_gini.loc[country_data.index, 'gini'] = country_data['gini']
In [14]:
df_gini['gini'].describe()
Out[14]:
count    1667.000000
mean       38.074625
std         9.108657
min        20.700000
25%        30.800000
50%        35.700000
75%        45.175000
max        61.600000
Name: gini, dtype: float64
In [15]:
#Check how many missing valuesb there still are as if the the first value for a country is missing, it would still remain missing:
print(df_gini.gini.isnull().sum())
0
In [16]:
#Confirming that dataype for all variables is now correct:
print(df_gini.dtypes)
year              int64
country          object
country_code     object
gdp             float64
gdp_gr          float64
gdp_pc          float64
gini            float64
gni_pc          float64
infl            float64
life_exp        float64
popn_gr         float64
unemp           float64
country_year     object
income_lvl      float64
dtype: object
In [17]:
#The first 20 rows of the final dataframe:
df_gini.head(20)
Out[17]:
year country country_code gdp gdp_gr gdp_pc gini gni_pc infl life_exp popn_gr unemp country_year income_lvl
0 1991 Argentina ARG 1.897200e+11 9.133111 5730.723810 46.8 3860.0 140.502379 72.319 1.424063 5.44 ARG1991 2.0
1 1992 Argentina ARG 2.287789e+11 7.937292 6815.329330 45.5 6060.0 16.071994 72.430 1.387435 6.36 ARG1992 3.0
2 1993 Argentina ARG 2.367417e+11 8.206979 6957.417499 44.9 7110.0 -3.561096 72.565 1.357966 10.10 ARG1993 3.0
3 1994 Argentina ARG 2.574400e+11 5.836201 7464.474737 45.9 7600.0 2.849340 73.172 1.347024 11.76 ARG1994 3.0
4 1995 Argentina ARG 2.580318e+11 -2.845210 7383.704510 48.9 7340.0 3.165123 73.133 1.317554 18.80 ARG1995 3.0
5 1996 Argentina ARG 2.721498e+11 5.526690 7690.157003 49.5 7700.0 -0.052375 73.307 1.260411 17.11 ARG1996 3.0
6 1997 Argentina ARG 2.928590e+11 8.111047 8176.771195 49.1 8110.0 -0.463913 73.090 1.198264 14.82 ARG1997 3.0
7 1998 Argentina ARG 2.989482e+11 3.850179 8250.673174 50.7 7990.0 -1.705280 73.474 1.158178 12.65 ARG1998 3.0
8 1999 Argentina ARG 2.835230e+11 -3.385457 7735.322080 49.8 7540.0 -1.836558 73.722 1.152044 14.05 ARG1999 3.0
9 2000 Argentina ARG 2.842038e+11 -0.788999 7666.517834 51.1 7430.0 1.037287 73.926 1.133277 15.00 ARG2000 3.0
10 2001 Argentina ARG 2.686968e+11 -4.408840 7168.975872 53.3 6960.0 -1.095768 74.186 1.099171 17.32 ARG2001 3.0
11 2002 Argentina ARG 9.772400e+10 -10.894485 2579.488769 53.8 4020.0 30.555204 74.408 1.073538 19.59 ARG2002 2.0
12 2003 Argentina ARG 1.275870e+11 8.837041 3333.152904 51.0 3640.0 10.495703 74.080 1.032361 15.36 ARG2003 2.0
13 2004 Argentina ARG 1.646579e+11 9.029573 4258.160261 48.5 3360.0 18.363354 74.855 1.015337 13.52 ARG2004 2.0
14 2005 Argentina ARG 1.987371e+11 8.851660 5086.627761 47.8 4240.0 10.317511 75.139 1.033476 11.51 ARG2005 3.0
15 2006 Argentina ARG 2.325573e+11 8.047152 5890.978002 46.4 5460.0 13.741052 75.433 1.034672 10.08 ARG2006 3.0
16 2007 Argentina ARG 2.875305e+11 9.007651 7210.595548 46.3 6480.0 14.939925 75.006 1.006297 8.47 ARG2007 3.0
17 2008 Argentina ARG 3.615580e+11 4.057233 8977.506851 45.0 7630.0 23.171165 75.641 0.992294 7.84 ARG2008 3.0
18 2009 Argentina ARG 3.329765e+11 -5.918525 8184.389889 43.8 7760.0 15.377649 75.936 1.014284 8.65 ARG2009 3.0
19 2010 Argentina ARG 4.236274e+11 10.125398 10385.964432 43.7 9270.0 20.915124 75.721 0.255582 7.71 ARG2010 3.0
In [18]:
#Countries in our sample:

unique_countries = df_gini['country'].unique()

for i, country in enumerate(unique_countries):
    if i % 5 == 0:
        print()
    print(f"{country:20}", end="")
Argentina           Armenia             Australia           Austria             Belarus             
Belgium             Bolivia             Brazil              Bulgaria            Canada              
Chile               China               Colombia            Costa Rica          Croatia             
Cyprus              Czechia             Denmark             Dominican Republic  Ecuador             
El Salvador         Estonia             Finland             France              Georgia             
Germany             Greece              Honduras            Hungary             Iceland             
Indonesia           Iran, Islamic Rep.  Ireland             Israel              Italy               
Kazakhstan          Kosovo              Kyrgyz Republic     Latvia              Lithuania           
Luxembourg          Madagascar          Malaysia            Malta               Mexico              
Moldova             Netherlands         North Macedonia     Norway              Pakistan            
Panama              Paraguay            Peru                Philippines         Poland              
Portugal            Romania             Russian Federation  Slovak Republic     Slovenia            
Spain               Sweden              Switzerland         Tajikistan          Thailand            
Uganda              Ukraine             United Kingdom      United States       Uruguay             
Venezuela, RB       Viet Nam            Zambia              

Summary Stats and Visualizations¶

In this section, I present some summary statistics and preliminary visualization for some of the variables. I am mostly interested in looking at the scatterplot and trendlines for the relationship between Gini and my independent variables.

In [19]:
# Summary statistics for Gini by country:
gini_stats = df_gini.groupby('country')['gini'].describe()
print(gini_stats.to_string())
                    count       mean       std   min        25%        50%        75%   max
country                                                                                    
Argentina            29.0  46.191379  3.797510  41.1  42.700000  45.900000  49.100000  53.8
Armenia              21.0  32.419048  2.795285  28.0  30.000000  32.400000  34.800000  37.5
Australia            25.0  33.910000  0.758494  32.6  33.350000  34.000000  34.475000  35.4
Austria              26.0  30.100000  0.749637  28.7  29.525000  30.250000  30.750000  31.5
Belarus              22.0  27.954545  2.166350  25.2  26.500000  27.650000  29.400000  32.0
Belgium              17.0  28.182353  0.857493  27.2  27.600000  28.100000  28.400000  30.5
Bolivia              21.0  51.176190  6.158178  41.6  46.600000  49.200000  57.150000  61.6
Brazil               28.0  55.898214  2.898546  51.9  53.375000  55.250000  58.775000  60.1
Bulgaria             14.0  37.171429  2.640638  33.6  35.700000  36.350000  39.875000  41.3
Canada               29.0  33.055172  0.844557  31.3  32.700000  33.300000  33.700000  34.1
Chile                28.0  50.016071  4.010894  45.3  46.837500  48.333333  54.312500  56.4
China                27.0  39.542593  2.832573  33.9  38.350000  39.700000  41.816667  43.7
Colombia             29.0  54.103448  2.566707  49.7  52.600000  54.200000  55.550000  58.7
Costa Rica           29.0  48.089655  1.627519  45.6  46.800000  48.300000  48.800000  51.8
Croatia              11.0  31.354545  1.255678  28.9  30.650000  32.000000  32.350000  32.6
Cyprus               16.0  32.475000  1.910846  30.1  31.175000  31.900000  33.175000  37.0
Czechia              28.0  25.991071  1.186086  20.7  25.900000  26.200000  26.525000  27.5
Denmark              28.0  25.789286  1.931132  23.0  24.050000  25.400000  27.725000  28.7
Dominican Republic   28.0  48.264286  2.877369  41.9  47.000000  48.900000  50.100000  52.1
Ecuador              26.0  50.142308  4.049072  44.7  46.025000  49.966667  53.400000  58.6
El Salvador          29.0  47.034483  5.219253  38.0  42.300000  47.800000  51.500000  54.5
Estonia              17.0  32.641176  1.835776  30.3  31.200000  32.500000  33.600000  37.2
Finland              20.0  27.495000  0.362621  26.8  27.200000  27.516667  27.700000  28.3
France               24.0  32.054167  0.991696  29.7  31.675000  32.350000  32.625000  33.7
Georgia              24.0  38.170833  1.566908  35.9  36.850000  38.000000  39.525000  41.3
Germany              29.0  30.165517  1.230353  28.0  29.000000  30.300000  31.100000  31.9
Greece               20.0  34.350000  1.087220  32.8  33.600000  34.150000  35.025000  36.3
Honduras             29.0  53.644828  3.164150  48.2  51.300000  53.500000  55.700000  59.5
Hungary              16.0  29.875000  1.823367  27.0  28.975000  29.950000  30.650000  34.7
Iceland              17.0  27.605882  1.725714  25.4  26.200000  26.800000  28.700000  31.8
Indonesia            27.0  35.811111  3.293739  29.5  33.550000  35.600000  38.600000  40.8
Iran, Islamic Rep.   15.0  38.366667  3.950166  34.0  35.050000  36.700000  42.550000  44.8
Ireland              26.0  33.078846  1.646643  30.6  32.000000  32.850000  33.637500  37.0
Israel               23.0  40.100000  1.542430  38.1  38.650000  39.800000  41.450000  42.6
Italy                29.0  34.736207  1.121932  31.1  34.300000  35.000000  35.200000  36.7
Kazakhstan           19.0  29.931579  3.646010  26.8  27.650000  28.200000  31.000000  39.8
Kosovo               17.0  29.317647  2.000698  26.3  27.800000  29.000000  30.800000  33.3
Kyrgyz Republic      20.0  30.085000  2.827548  26.8  27.775000  29.800000  31.125000  37.4
Latvia               16.0  35.750000  1.267544  34.2  35.075000  35.550000  36.100000  39.0
Lithuania            16.0  35.793750  1.602693  32.5  35.025000  35.500000  37.225000  38.4
Luxembourg           29.0  30.858621  2.158623  27.0  30.000000  30.900000  32.000000  35.4
Madagascar           27.0  42.266667  1.935163  38.6  41.175000  42.600000  42.600000  47.4
Malaysia             28.0  45.225000  2.859410  41.1  42.275000  45.416667  47.804167  49.1
Malta                14.0  29.221429  0.696341  28.0  29.000000  29.100000  29.350000  31.0
Mexico               28.0  50.269643  2.352357  46.0  48.837500  50.375000  52.387500  53.4
Moldova              23.0  32.647826  4.924786  25.7  27.750000  34.400000  36.050000  42.6
Netherlands          16.0  28.606250  0.775000  27.6  28.050000  28.350000  29.225000  30.0
North Macedonia      11.0  36.609091  3.120082  33.0  34.350000  35.600000  38.750000  42.8
Norway               29.0  26.962069  1.427817  25.2  26.000000  26.840000  27.500000  31.6
Pakistan             24.0  30.129167  1.045271  28.7  29.483333  29.933333  30.775000  33.1
Panama               29.0  54.220690  3.156442  49.2  51.500000  54.600000  57.500000  58.2
Paraguay             25.0  51.804000  3.573890  45.7  48.500000  52.300000  54.600000  58.2
Peru                 23.0  47.882609  4.376762  41.5  43.750000  47.500000  50.900000  55.1
Philippines          20.0  45.795000  1.674552  42.3  45.075000  46.466667  46.916667  47.7
Poland               16.0  32.900000  2.282688  28.8  31.650000  33.150000  33.625000  38.0
Portugal             17.0  36.076471  1.795317  32.8  35.200000  36.000000  36.800000  38.9
Romania              14.0  36.200000  1.250846  34.4  35.650000  35.950000  36.475000  39.6
Russian Federation   23.0  39.091304  1.738315  36.8  37.450000  39.500000  40.500000  42.3
Slovak Republic      16.0  26.081250  1.614608  23.2  25.150000  26.100000  27.125000  29.3
Slovenia             16.0  24.837500  0.627030  23.7  24.400000  24.800000  25.025000  26.2
Spain                27.0  34.637037  1.249182  31.8  33.950000  34.700000  35.700000  36.5
Sweden               20.0  27.690000  1.304204  25.3  26.725000  27.600000  28.800000  30.0
Switzerland          20.0  32.847500  0.773317  31.6  32.450000  32.750000  33.362500  34.3
Tajikistan           21.0  32.447619  1.339883  29.5  31.500000  32.666667  33.600000  34.0
Thailand             28.0  40.607143  3.143916  34.9  37.725000  41.650000  42.575000  47.9
Uganda               28.0  42.433929  1.545515  39.0  41.437500  42.750000  43.579167  45.2
Ukraine              28.0  28.710714  3.862855  24.0  25.450000  28.800000  30.291667  39.3
United Kingdom       29.0  34.803448  1.502969  32.6  33.300000  35.000000  35.500000  38.8
United States        29.0  40.506897  0.804865  38.0  40.100000  40.600000  41.000000  41.5
Uruguay              14.0  42.057143  2.755055  39.5  39.750000  40.300000  44.950000  46.4
Venezuela, RB        28.0  45.921429  1.789549  42.1  44.800000  44.800000  47.500000  49.5
Viet Nam             23.0  36.145652  1.012009  34.8  35.550000  35.720000  36.740000  39.3
Zambia               29.0  52.217241  4.185394  42.1  49.100000  53.300000  55.120000  60.5
In [20]:
# The highest and lowest Gini values in our dataset

#the countries with the minimum and maximum Gini values
min_gini_country = gini_stats['min'].idxmin()
max_gini_country = gini_stats['max'].idxmax()

# The years with the minimum and maximum Gini values for each country
min_gini_year = df_gini[df_gini['country'] == min_gini_country].sort_values(by='gini')['year'].iloc[0]
max_gini_year = df_gini[df_gini['country'] == max_gini_country].sort_values(by='gini', ascending=False)['year'].iloc[0]

# Print the results
print(f"Country with minimum Gini: {min_gini_country}, Year: {min_gini_year}, Gini: {gini_stats.loc[min_gini_country, 'min']}")
print(f"Country with maximum Gini: {max_gini_country}, Year: {max_gini_year}, Gini: {gini_stats.loc[max_gini_country, 'max']}")
Country with minimum Gini: Czechia, Year: 1992, Gini: 20.7
Country with maximum Gini: Bolivia, Year: 2000, Gini: 61.6
In [21]:
# Let's see how the summary statistics for Gini varies by income level:
gini_stats_by_income_level = df_gini.groupby('income_lvl')['gini'].describe()
print(gini_stats_by_income_level.to_string())
            count       mean       std   min        25%        50%      75%   max
income_lvl                                                                       
1.0         201.0  39.818035  8.230691  27.4  33.133333  39.050000  43.2250  61.6
2.0         392.0  43.884418  9.851186  20.7  36.075000  46.333333  51.9625  60.1
3.0         383.0  41.873325  8.488807  25.2  35.700000  42.800000  48.7500  59.9
4.0         673.0  32.074443  4.704032  23.0  28.300000  31.900000  34.8000  49.9
In [22]:
#Let's see summary stats on infation based on income levels.
#It is important for our investigation as inflation affects rich and poor countries and the rich and the poor people within a country differently.

inflation_stats_by_income_level = df_gini.groupby('income_lvl')['infl'].describe()
print(inflation_stats_by_income_level.to_string())
            count       mean         std        min       25%       50%        75%          max
income_lvl                                                                                     
1.0         201.0  21.477256   74.643823  -3.169556  5.452302  8.837864  17.607724   952.995953
2.0         392.0  37.767287  247.145958 -26.299993  3.519471  6.654897  13.429700  3333.585422
3.0         383.0   7.772207    9.542675  -5.992202  2.578473  4.785570   8.799837    75.277369
4.0         673.0   2.190532    2.190667  -9.653676  1.003867  1.880282   2.985620    15.333310
In [23]:
# Let's see the summary statistics on population growth by income level:
popn_gr_stats_by_income_level = df_gini.groupby('income_lvl')['popn_gr'].describe()
print(popn_gr_stats_by_income_level.to_string())
            count      mean       std       min       25%       50%       75%       max
income_lvl                                                                             
1.0         201.0  1.751652  1.436717 -3.629546  1.048445  2.141922  2.906232  3.532921
2.0         392.0  1.163340  1.024360 -1.757004  0.496976  1.394561  1.873829  3.571097
3.0         383.0  0.717204  0.915528 -2.096943 -0.006653  1.015471  1.398723  2.760033
4.0         673.0  0.657537  0.744143 -2.258464  0.238457  0.540461  1.082907  3.931356
In [24]:
# Lets' see what the trend of Gini Index has been like for some of the major World powers in this period:


# Filter the data for United States, United Kingdom, France, Germany, and Australia and creating the plot:
countries = ['United States', 'United Kingdom', 'France', 'Germany', 'Australia']
df_filtered = df_gini[df_gini['country'].isin(countries)]

plt.figure(figsize=(10, 6))
for country in countries:
    df_country = df_filtered[df_filtered['country'] == country]
    plt.plot(df_country['year'], df_country['gini'], label=country)

plt.xlabel('Year')
plt.ylabel('Gini Coefficient')
plt.title('Gini Trend for Major World Powers')

plt.legend()
plt.show()

We can see that the Gini Index in the United States was higher than other major world powers to begin with and hasn't seen in sharp fluctuations in this period like the other countries.

In [25]:
#Scatterplots for gdp per capita and gini for countries with different income levels:

import matplotlib.pyplot as plt
import numpy as np


# Define colors for the plots
colors = {
    1: '#1F77B4',
    2: '#FF7F0E',
    3: '#2CA02C',
    4: '#D62728'
}

# Create a figure with four subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

# Loop through income levels
for i, income_lvl in enumerate(range(1, 5)):
    # Filter data for the current income level
    df_subset = df_gini[df_gini['income_lvl'] == income_lvl]

    # Plot the scatterplot
    axes[i // 2, i % 2].scatter(df_subset['gdp_pc'], df_subset['gini'], color=colors[income_lvl])

    # Calculate and plot the trendline
    slope, intercept = np.polyfit(df_subset['gdp_pc'], df_subset['gini'], 1)
    axes[i // 2, i % 2].plot(df_subset['gdp_pc'], slope * df_subset['gdp_pc'] + intercept, color='black', linestyle='--')

    # Set the title and labels
    axes[i // 2, i % 2].set_title(f'Income Level {income_lvl}')
    axes[i // 2, i % 2].set_xlabel('GDP per Capita')
    axes[i // 2, i % 2].set_ylabel('Gini Coefficient')

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()
In [26]:
# Scatter plot with a trendline for life expectancy and Gini coefficient:
import matplotlib.pyplot as plt
import numpy as np


plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(df_gini['life_exp'], df_gini['gini'])

# Trendline
z = np.polyfit(df_gini['life_exp'], df_gini['gini'], 1)
p = np.poly1d(z)
plt.plot(df_gini['life_exp'], p(df_gini['life_exp']), color='red')

# Labels and title
plt.xlabel('Life Expectancy')
plt.ylabel('Gini Coefficient')
plt.title('Life Expectancy vs. Gini Coefficient')

# Show the plot
plt.show()
In [27]:
# gini vs popn_gr scatterplot for countries with different income levels:

import matplotlib.pyplot as plt
import numpy as np

# Define colors for the plots
colors = {
    1: '#1F77B4',
    2: '#FF7F0E',
    3: '#2CA02C',
    4: '#D62728'
}

# Create a figure with four subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

# Loop through income levels
for i, income_lvl in enumerate(range(1, 5)):
    # Filter data for the current income level
    df_subset = df_gini[df_gini['income_lvl'] == income_lvl]

    # Plot the scatterplot
    axes[i // 2, i % 2].scatter(df_subset['popn_gr'], df_subset['gini'], color=colors[income_lvl])

    # Calculate and plot the trendline
    slope, intercept = np.polyfit(df_subset['popn_gr'], df_subset['gini'], 1)
    axes[i // 2, i % 2].plot(df_subset['popn_gr'], slope * df_subset['popn_gr'] + intercept, color='black', linestyle='--')

    # Set the title and labels
    axes[i // 2, i % 2].set_title(f'Income Level {income_lvl}')
    axes[i // 2, i % 2].set_xlabel('Population Growth Rate')
    axes[i // 2, i % 2].set_ylabel('Gini Coefficient')

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()
In [28]:
# scatterplot infl and gini for different income levels (outliers have been taken off)

import matplotlib.pyplot as plt
import numpy as np

# Define colors for the plots
colors = {
    1: '#1F77B4',
    2: '#FF7F0E',
    3: '#2CA02C',
    4: '#D62728'
}

# Create a figure with four subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

# Loop through income levels
for i, income_lvl in enumerate(range(1, 5)):
    # Filter data for the current income level and inflation values between -10 and 40
    df_subset = df_gini[(df_gini['income_lvl'] == income_lvl) & (df_gini['infl'] < 40) & (df_gini['infl'] > -10)]

    # Plot the scatterplot
    axes[i // 2, i % 2].scatter(df_subset['infl'], df_subset['gini'], color=colors[income_lvl])

    # Calculate and plot the trendline
    slope, intercept = np.polyfit(df_subset['infl'], df_subset['gini'], 1)
    axes[i // 2, i % 2].plot(df_subset['infl'], slope * df_subset['infl'] + intercept, color='black', linestyle='--')

    # Set the title and labels
    axes[i // 2, i % 2].set_title(f'Income Level {income_lvl}')
    axes[i // 2, i % 2].set_xlabel('Inflation')
    axes[i // 2, i % 2].set_ylabel('Gini Coefficient')

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

Model and Results¶

In this section, I present the statistical results of the model I am using. We then perform different tests to choose our final model and present our final results.

In [ ]:
#Installing required linear model packages for our statistical analysis
!pip install linearmodels
from linearmodels import PanelOLS
from linearmodels import RandomEffects
In [34]:
from linearmodels.panel import PooledOLS, PanelOLS, RandomEffects

# Assuming df_gini is your DataFrame with columns: gini, gdp_pc, infl, life_exp, popn_gr, unemp, income_lvl
# Drop any missing values
df_gini.dropna(inplace=True)
# Create a MultiIndex DataFrame with country and time dimensions
df_gini.set_index(['country', 'year'], inplace=True)

# Define dependent and independent variables
exog_vars = ['gdp_pc', 'infl', 'life_exp', 'popn_gr', 'unemp', 'income_lvl']
exog = df_gini[exog_vars]
endog = df_gini['gini']

# Pooled OLS model
pooled_ols_model = PooledOLS(endog, exog)
pooled_ols_results = pooled_ols_model.fit()


# Print model summaries
print("Pooled OLS:")
print(pooled_ols_results)
Pooled OLS:
                          PooledOLS Estimation Summary                          
================================================================================
Dep. Variable:                   gini   R-squared:                        0.9685
Estimator:                  PooledOLS   R-squared (Between):              0.9726
No. Observations:                1639   R-squared (Within):              -0.4438
Date:                Fri, May 03 2024   R-squared (Overall):              0.9685
Time:                        04:28:58   Log-likelihood                   -5506.1
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      8371.8
Entities:                          72   P-value                           0.0000
Avg Obs:                       22.764   Distribution:                  F(6,1633)
Min Obs:                       11.000                                           
Max Obs:                       29.000   F-statistic (robust):             8371.8
                                        P-value                           0.0000
Time periods:                      29   Distribution:                  F(6,1633)
Avg Obs:                       56.517                                           
Min Obs:                       14.000                                           
Max Obs:                       72.000                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc        -0.0002  1.184e-05    -18.057     0.0000     -0.0002     -0.0002
infl           0.0025     0.0014     1.7775     0.0757     -0.0003      0.0052
life_exp       0.5553     0.0117     47.527     0.0000      0.5324      0.5783
popn_gr        4.8373     0.1809     26.740     0.0000      4.4825      5.1921
unemp          0.3307     0.0418     7.9146     0.0000      0.2488      0.4127
income_lvl    -2.0961     0.2896    -7.2376     0.0000     -2.6642     -1.5281
==============================================================================
In [35]:
# Fixed effects model
fixed_effects_model = PanelOLS(endog, exog, entity_effects=True)
fixed_effects_results = fixed_effects_model.fit()

print("\nFixed Effects:")
print(fixed_effects_results)
Fixed Effects:
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                   gini   R-squared:                        0.1737
Estimator:                   PanelOLS   R-squared (Between):             -0.4042
No. Observations:                1639   R-squared (Within):               0.1737
Date:                Fri, May 03 2024   R-squared (Overall):             -0.3823
Time:                        04:28:59   Log-likelihood                   -3672.2
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      54.681
Entities:                          72   P-value                           0.0000
Avg Obs:                       22.764   Distribution:                  F(6,1561)
Min Obs:                       11.000                                           
Max Obs:                       29.000   F-statistic (robust):             54.681
                                        P-value                           0.0000
Time periods:                      29   Distribution:                  F(6,1561)
Avg Obs:                       56.517                                           
Min Obs:                       14.000                                           
Max Obs:                       72.000                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc      1.365e-05  8.134e-06     1.6783     0.0935  -2.303e-06    2.96e-05
infl           0.0012     0.0005     2.3911     0.0169      0.0002      0.0022
life_exp      -0.0619     0.0364    -1.7001     0.0893     -0.1334      0.0095
popn_gr        0.8427     0.1621     5.2002     0.0000      0.5248      1.1606
unemp          0.2317     0.0261     8.8766     0.0000      0.1805      0.2829
income_lvl    -1.8328     0.1758    -10.426     0.0000     -2.1776     -1.4880
==============================================================================

F-test for Poolability: 179.73
P-value: 0.0000
Distribution: F(71,1561)

Included effects: Entity
In [36]:
# Random effects model
random_effects_model = RandomEffects(endog, exog)
random_effects_results = random_effects_model.fit()

print("\nRandom Effects:")
print(random_effects_results)
Random Effects:
                        RandomEffects Estimation Summary                        
================================================================================
Dep. Variable:                   gini   R-squared:                        0.5557
Estimator:              RandomEffects   R-squared (Between):              0.9503
No. Observations:                1639   R-squared (Within):               0.0312
Date:                Fri, May 03 2024   R-squared (Overall):              0.9424
Time:                        04:28:59   Log-likelihood                   -3856.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      340.35
Entities:                          72   P-value                           0.0000
Avg Obs:                       22.764   Distribution:                  F(6,1633)
Min Obs:                       11.000                                           
Max Obs:                       29.000   F-statistic (robust):             340.35
                                        P-value                           0.0000
Time periods:                      29   Distribution:                  F(6,1633)
Avg Obs:                       56.517                                           
Min Obs:                       14.000                                           
Max Obs:                       72.000                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc     -4.898e-05  8.017e-06    -6.1097     0.0000   -6.47e-05  -3.326e-05
infl           0.0020     0.0005     3.6861     0.0002      0.0009      0.0030
life_exp       0.5226     0.0143     36.517     0.0000      0.4945      0.5506
popn_gr        1.5994     0.1696     9.4298     0.0000      1.2667      1.9320
unemp          0.3435     0.0274     12.555     0.0000      0.2899      0.3972
income_lvl    -2.8567     0.1805    -15.823     0.0000     -3.2108     -2.5025
==============================================================================
In [37]:
#Performing Hausman test to choose between fixed and random effects model.
#The numm hypothesis is random effects and alternative hypothesis is fixed effects in Hausman test

from scipy.stats import chi2

# Hausman test for comparing fixed effects and random effects
b_fe = fixed_effects_results.params
b_re = random_effects_results.params
cov_fe = fixed_effects_results.cov
cov_re = random_effects_results.cov

# Calculate the Hausman test statistic
d = b_fe - b_re
d_cov = cov_fe - cov_re
hausman_stat = d.T @ (d_cov @ d.T)
p_value = 1 - chi2.cdf(hausman_stat, df=d.shape[0])

# Print Hausman test statistic and p-value
print("Hausman Test:")
print("Test Statistic:", hausman_stat)
print("P-value:", p_value)
Hausman Test:
Test Statistic: 0.004634724637361317
P-value: 0.9999999979294973

Since the p-value is not significant in the Hausman test result, we fail to reject the null hypothesis of random effects model and choose random effect model as our result.

In [38]:
# Breusch-Pagan Lagrange Multiplier test for heteroskedasticity to choose between random effects model and pooled OLS,
#The Null hypotheisis is Pooled OLs Model and Alternative Hypothesis is Random Effects in this test.

# Calculate the residuals from the pooled OLS and fixed effects models
pooled_ols_residuals = pooled_ols_results.resids
fixed_effects_residuals = fixed_effects_results.resids

# Calculate the squared residuals
pooled_ols_squared_residuals = pooled_ols_residuals ** 2
fixed_effects_squared_residuals = fixed_effects_residuals ** 2

# Calculate the cross-sectional averages of the squared residuals for the fixed effects model
fe_residuals_mean = fixed_effects_residuals.groupby(level='country').mean()

# Calculate the Breusch-Pagan test statistic
num = (pooled_ols_squared_residuals - fe_residuals_mean).sum()
den = pooled_ols_squared_residuals.sum() - (pooled_ols_squared_residuals.mean() ** 2) * pooled_ols_residuals.shape[0]
lm_stat = num / den

# Calculate the p-value using a chi-squared distribution with degrees of freedom equal to the number of countries - 1
p_value = 1 - chi2.cdf(lm_stat, df=fixed_effects_residuals.groupby(level='country').count().shape[0] - 1)

# Print the Breusch-Pagan test statistic and p-value
print("\nBreusch-Pagan Lagrange Multiplier Test:")
print("Test Statistic:", lm_stat)
print("P-value:", p_value)
Breusch-Pagan Lagrange Multiplier Test:
Test Statistic: -0.021064060212915268
P-value: 1.0

The p-value of our Lagrange Multiplier test is large, so fail to reject the null hypothesis and choose the Pooled OLS model.

After performing both these tests, we can confirm that Pooled OLS is our final model.

Final Model, Results, and Conclusion¶

In this section, I present the final statistical model and intrepret the results. As stated above, after performing the statistical tests, Pooled OLS model is our final model.

In [39]:
#Final Model
print("Pooled OLS:")
print(pooled_ols_results)
Pooled OLS:
                          PooledOLS Estimation Summary                          
================================================================================
Dep. Variable:                   gini   R-squared:                        0.9685
Estimator:                  PooledOLS   R-squared (Between):              0.9726
No. Observations:                1639   R-squared (Within):              -0.4438
Date:                Fri, May 03 2024   R-squared (Overall):              0.9685
Time:                        04:28:58   Log-likelihood                   -5506.1
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      8371.8
Entities:                          72   P-value                           0.0000
Avg Obs:                       22.764   Distribution:                  F(6,1633)
Min Obs:                       11.000                                           
Max Obs:                       29.000   F-statistic (robust):             8371.8
                                        P-value                           0.0000
Time periods:                      29   Distribution:                  F(6,1633)
Avg Obs:                       56.517                                           
Min Obs:                       14.000                                           
Max Obs:                       72.000                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc        -0.0002  1.184e-05    -18.057     0.0000     -0.0002     -0.0002
infl           0.0025     0.0014     1.7775     0.0757     -0.0003      0.0052
life_exp       0.5553     0.0117     47.527     0.0000      0.5324      0.5783
popn_gr        4.8373     0.1809     26.740     0.0000      4.4825      5.1921
unemp          0.3307     0.0418     7.9146     0.0000      0.2488      0.4127
income_lvl    -2.0961     0.2896    -7.2376     0.0000     -2.6642     -1.5281
==============================================================================

For our final model, the R-squared is 0.9685 meaning our model explains 96.85% variation in our dependent variable is explained by our independent variables.

The F-statistic is 8371.8 and the p-va;ue for Global F-test is less than significance level of 0.05 meaning we can reject the null hypothesis. This indicates that the model as a whole is statistically significant, and at least one of the independent variables has a significant effect on the dependent variable.

In our final model, we can see that our primary independent variable of interest gdp_pc is significant in explaining Gini Index but not in the direction I expected. As gp_pc goes by $1, the Gini Index decreases by 0.0002 points.

All the independent variables except inflation are significant in our final model. Increase in life expectancy, population growth rate, and unemployment increase inequality in a country which intuitively makes sense as well. Increase in incole level, however decreases inequality. This means, as countries move up in the income level brackets as discussed earlier in terms of GNI per capita, inequality decreases.

From this project, we can conlcude that economic growth in the sample of our 73 countries in the 30 year period from 1991- 2019 decreased income inequality. Increase in population growth rate, unemployment rate, and life expectancy increased income inequality while the country jumping to a higher income level caused decrease in income inequality.