This lesson will be covering applications of regressions using multiple datasets. The first part of this lesson will give a brief summary of the linear regression estimator to provide a refresher.
Linear Regression is a statistical method used to capture linear relationships between two (or more) variables.
The model consists of:
An example of this can be the effect of education (X) on income (Y).
We strive to find the best fitting line as it allows us to:
The equation that gets us the best fitting line is:
Where:
With so many possible lines, how exactly do we select the best one?
We select the combination of intercept $(\(\beta_{0}\))$ and coefficient $(\(\beta_{1}\))$ values from our line equation ($\( Y_{i} = \beta_{0} + \beta_{1}X_{i} + \varepsilon_{i} \)$) that minimizes our sum of squared residuals (SSR).
The sum of squared residuals is non-other than the difference between our actual outcomes $(\(Y_{i}\))$ and our predicted outcomes $(\(\hat{Y_{i}}\))$ generated by an iteration of a line equation being tested.
One doesn't have to worry about how this formula is derived to obtain the ideal values, as we have statistical software help us; but, if interested then this link shows how it is done. In addition, this link provides an excellent review/introduction to regressions for those who may need it.
Let's apply this estimator in Python using a made-up example to start off and then move onto to real-world datasets.
Let's download the library packages we will be using for this lesson. The package that will allow us to run regressions in Python is the statsmodels
library.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
Let's also import the first dataset we will be using.
# Importing the dataset
df = pd.read_csv('Salary_Data.csv')
df.head()
YearsExperience | Salary | Sex | |
---|---|---|---|
0 | 1.1 | 39343 | Female |
1 | 1.3 | 46205 | Male |
2 | 1.5 | 37731 | Male |
3 | 2.0 | 43525 | Female |
4 | 2.2 | 39891 | Female |
The imported dataset reports year of experience at a certain job, their salary, and if the person is a woman. Suppose we want to run a regression where we are interested in understanding how years of work experience can affect one's salary. We would have to regress salary on years of experience. Before doing that exactly, let's begin by visualizing our data relationship.
We can visualize data by using the seaborn function sns.regplot
and adding additional aesthetics.
# Set theme
sns.set_style('ticks')
plot = sns.regplot(x='YearsExperience', y='Salary', data= df, color='black')
# Adding title and labels
plot.set_title('Figure 1', fontdict={'size': 18, 'weight': 'normal'})
plot.set_xlabel('Years of Work Experience', fontdict={'size': 12})
plot.set_ylabel('Salary (Dollars)', fontdict={'size': 12})
Text(0, 0.5, 'Salary (Dollars)')
Our initial visualization of the data shows that there may be a linear relationship between both years of work experience and salary. We can also see that our data appears homoskedastic (there is no apparent variance or sparsity in the data observations). Let's continue by fitting the model and running our regression.
# Running the regression
model = smf.ols('Salary ~ YearsExperience', data=df).fit()
# Presenting a table summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.957 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 622.5 Date: Tue, 18 Apr 2023 Prob (F-statistic): 1.14e-20 Time: 12:44:12 Log-Likelihood: -301.44 No. Observations: 30 AIC: 606.9 Df Residuals: 28 BIC: 609.7 Df Model: 1 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 2.579e+04 2273.053 11.347 0.000 2.11e+04 3.04e+04 YearsExperience 9449.9623 378.755 24.950 0.000 8674.119 1.02e+04 ============================================================================== Omnibus: 2.140 Durbin-Watson: 1.648 Prob(Omnibus): 0.343 Jarque-Bera (JB): 1.569 Skew: 0.363 Prob(JB): 0.456 Kurtosis: 2.147 Cond. No. 13.2 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The model.summary()
commands presents us with a nicely formatted table displaying a lot of information. Let's dissect it.
The top section of the regression summary contains data on what was done and what was calculated in the regression model.
Starting with the left column of information, the summary tells us what our outcome variable of interest was (in this case it was the salary of each individual), followed by the model and method that was used to get the results (in this case, it was an ordinary least squares regression that obtained the intercept and coefficient results by minimizing the sum of squared residuals).
It then displays system information (the time and date that the regression was done) and it ends with some information about the dataset that it used to calculate the results (the total number of observations in the dataset was 30 and the 'df residuals' is simply the total number of observations of the dataset subtracted by the number of variables being estimated).
The last input on the left column displays what kind of covariance measure the model accounted for. Here, we have 'nonrobust', which is the default type of covariance measure that does not account for any Heteroskedasticity or Heteroskedasticity-Autocorrelation present in the data. We will run a different kind of regression that does account for these two later.
The right column of the top section continues to display data related to the regression model built.
Starting with the first two measures, the R-squared measures how well the line we found fits the data and explains the relationship. The R-squared gives us the percentage amount of variation explained by our regression line created. It uses the Sum of Squared Residuals (SSR) we calculated earlier to obtain this value. It does by dividing that SSR by the Total Sum of Sqaures (TSS), and it is denoted with the following equation:
If you are interested in knowing how the TSS is calculated, this three-minute YouTube video explains it quite well.
Because our R-Squared metric ranges from 0 to 1, we can interpret a value closer to 1 as a higher value that explains the relationship between the data. A value closer to 0 would suggest that our regression does not explain the relationship between the variables.
The adjusted R-Squared is a modified version of the R-Squared that takes into account additional predictors that may be present in the model. Here, because we only have one independent variable, our Adjusted R-Squared does not vary by much. If you are interested in re-familiarizing yourself with these two concepts, the following two links are good resources to do so.
The F-Statistic measures the ratio of the mean squares treatment over the means squares error. It essentially measures the variation between sample means relative to the variation found within the samples used in the regression. Having a large F-Statistic value would suggest that there is evidence of a difference between the group means (you can reject the null hypothesis of no effect)
The Probability (F-Statistic) is none other than our collective p-value for our regression results. Per conventional standards, if our Prob. F-Statistic is lower than 0.05, then we can reject the null hypothesis of no effect (meaning our results are statistically significant).
We will ignore the last three measures as we do not need to understand these measures for our purposes.
The midsection of the regression summary is the most important, as this section displays our intercept and coefficient details. Let's interpret what we are seeing.
The first column displays all the variables that were inputted into the regression model. In this case, we only have our intercept and our years of experience variable.
The intercept is the value that our regression model equation produces if all independent variables were set to 0.
'YearsExperience' is our variable of interest and here we can see the beta coefficient's effect on salary. One would interpret this value as: an additional year of work experience is associated with an estimated increase of $9,450 dollars on one's salary, holding all else constant (that's a big jump).
The second column displays the standard errors (SEs) associated with the beta coefficients in our model. The standard error metric measures the sampling variation from our sample estimates (n) to the true population (N) that is unobserved. In other words, it is an indication of how well our sample estimates can be in relation to our beta coefficient(s). The higher the SE, the larger the variation from our sample to the true population pool. How does this relate to our estimated effects. If the SE of a coefficient were to be close to its estimated effect, then it would be assumed that the difference is close to none, making it indistinguishable from zero (non-significant effect). If the SE of a coefficient is not remotely close to our coefficient, then it can be suggested that our effect is rather large and therefore distinguishable from zero (significant effect). Because our coefficient is rather large from the SE, we can assume that there are distinguishable effects present.
The third column displays the test statistic, which is the coefficient divided by the standard error.
In order to get our p-value, one would need to take our test statistic and look at the t-distribution table with n-1 degrees of freedom to get that value. Luckily, the fourth column does that for us and presents the p-value of the beta coefficient estimate. Under arbitrary but conventional norms of significance, if our p-value is less than 0.05, then we can deem our coefficient as statistically significant.
The last column displays the beta coefficient's values within 95% of our data (or within 2 standard deviations of it). This is known as our confidence intervals.
While the bottom section of the results contains multiple measures, the most important one to know is the Durbin-Watson measure. This measure tells us about the regression's amount of homoscedasticity, or how even are the distribution of errors in the data. Having a value between 1 and 2 would suggest that there is an acceptable amount of even distribution, or no presence of Heteroskedasticity.
Suppose we wanted to introduce an additional variable into our regression: the sex of the individual. We would then have to regress salary on years of experience and sex. Before doing that exactly, let's begin by visualizing our data relationship using one continuous and categorical variable.
While the previous example used sns.regplot()
to plot our regression visualization, we will be using sns.lmplot()
this time, as it will allow us to add a categorical component. There are two ways to plot a regression line, depending on how you want to convey the data.
# Set theme
sns.set_style('ticks')
# Plot itself
plot = sns.lmplot(x='YearsExperience', y='Salary', data= df, hue ='Sex', markers =['o', 'v'])
# Plot title and axis labels
plt.title('Figure 2')
plt.ylabel("Salary")
plt.xlabel("Years of Work Experience")
# Adjust x limits
#plt.xlim(1,10)
# Remove legend from outer layer to innner
plt.tight_layout()
# Set plot size for display
plt.figure(figsize=(8, 6))
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
# Set theme
sns.set_style('ticks')
# Plot itself
plot = sns.lmplot(x='YearsExperience', y='Salary', data= df, col ='Sex')
# Remove legend from outer layer to innner
plt.tight_layout()
# Set plot size for display
plt.figure(figsize=(8, 6))
<Figure size 800x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
Now, let's run the regression. There are multiple ways to run a regression with a categorical variable as an additional independent variable. We will cover a few of those ways.
Here, our coefficient tells us what the relationship between sex and salary is. Python automatically chooses one of the categories, and it gives us that relationship. Useful and easy, but not as modular if you were interested in having more control over what actually gets captured.
# Running the regression
model = smf.ols('Salary ~ YearsExperience + Sex' , data=df).fit()
# Presenting a table summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.958 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 306.9 Date: Tue, 18 Apr 2023 Prob (F-statistic): 2.71e-19 Time: 12:44:12 Log-Likelihood: -301.12 No. Observations: 30 AIC: 608.2 Df Residuals: 27 BIC: 612.5 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 2.656e+04 2505.089 10.604 0.000 2.14e+04 3.17e+04 Sex[T.Male] -1625.9575 2138.556 -0.760 0.454 -6013.912 2761.997 YearsExperience 9467.9701 382.375 24.761 0.000 8683.401 1.03e+04 ============================================================================== Omnibus: 1.685 Durbin-Watson: 1.656 Prob(Omnibus): 0.431 Jarque-Bera (JB): 1.198 Skew: 0.232 Prob(JB): 0.549 Kurtosis: 2.138 Cond. No. 16.0 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
An alternative way to let Python know you are working with a categorical variable is by using the C()
operator within the regression. As we can see, we are still having Python choose the variable for us.
# Running the regression
model = smf.ols('Salary ~ YearsExperience + C(Sex)' , data=df).fit()
# Presenting a table summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.958 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 306.9 Date: Tue, 18 Apr 2023 Prob (F-statistic): 2.71e-19 Time: 12:44:12 Log-Likelihood: -301.12 No. Observations: 30 AIC: 608.2 Df Residuals: 27 BIC: 612.5 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 2.656e+04 2505.089 10.604 0.000 2.14e+04 3.17e+04 C(Sex)[T.Male] -1625.9575 2138.556 -0.760 0.454 -6013.912 2761.997 YearsExperience 9467.9701 382.375 24.761 0.000 8683.401 1.03e+04 ============================================================================== Omnibus: 1.685 Durbin-Watson: 1.656 Prob(Omnibus): 0.431 Jarque-Bera (JB): 1.198 Skew: 0.232 Prob(JB): 0.549 Kurtosis: 2.138 Cond. No. 16.0 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
A third way to let Python know you are working with a categorical variable is by creating a new variable that is a dummy of women. There are two ways to do this: (A) the long way, and (B) the short way. Let's cover both of them.
# Show DataFrame
df.head()
YearsExperience | Salary | Sex | |
---|---|---|---|
0 | 1.1 | 39343 | Female |
1 | 1.3 | 46205 | Male |
2 | 1.5 | 37731 | Male |
3 | 2.0 | 43525 | Female |
4 | 2.2 | 39891 | Female |
# Create an empty list
list = []
# Use a for-loop and if-else statement
for x in df['Sex']: # For every observation (row) in the variable, Sex
if x == 'Female': # If the observation (row) in the variable, Sex, is equal to 'Female'
list.append(1) # Add a 1 to nth observation in the list we created
else:
list.append(0) # If it doesn't say "Female", add a 0 to that nth observation
# Create a new variable and assign that list we created
df['Female'] = list
# Confirm that the variable is present
print(df)
# Running the regression
model = smf.ols('Salary ~ YearsExperience + Female' , data=df).fit()
# Presenting a table summary
print(model.summary())
YearsExperience Salary Sex Female 0 1.1 39343 Female 1 1 1.3 46205 Male 0 2 1.5 37731 Male 0 3 2.0 43525 Female 1 4 2.2 39891 Female 1 5 2.9 56642 Female 1 6 3.0 60150 Male 0 7 3.2 54445 Male 0 8 3.2 64445 Female 1 9 3.7 57189 Male 0 10 3.9 63218 Male 0 11 4.0 55794 Female 1 12 4.0 56957 Female 1 13 4.1 57081 Male 0 14 4.5 61111 Male 0 15 4.9 67938 Male 0 16 5.1 66029 Female 1 17 5.3 83088 Male 0 18 5.9 81363 Male 0 19 6.0 93940 Female 1 20 6.8 91738 Female 1 21 7.1 98273 Female 1 22 7.9 101302 Male 0 23 8.2 113812 Female 1 24 8.7 109431 Female 1 25 9.0 105582 Male 0 26 9.5 116969 Male 0 27 9.6 112635 Male 0 28 10.3 122391 Male 0 29 10.5 121872 Female 1 OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.958 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 306.9 Date: Tue, 18 Apr 2023 Prob (F-statistic): 2.71e-19 Time: 12:44:12 Log-Likelihood: -301.12 No. Observations: 30 AIC: 608.2 Df Residuals: 27 BIC: 612.5 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 2.494e+04 2551.242 9.775 0.000 1.97e+04 3.02e+04 YearsExperience 9467.9701 382.375 24.761 0.000 8683.401 1.03e+04 Female 1625.9575 2138.556 0.760 0.454 -2761.997 6013.912 ============================================================================== Omnibus: 1.685 Durbin-Watson: 1.656 Prob(Omnibus): 0.431 Jarque-Bera (JB): 1.198 Skew: 0.232 Prob(JB): 0.549 Kurtosis: 2.138 Cond. No. 16.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Importing the dataset to clear it
df = pd.read_csv('Salary_Data.csv')
# Use pd.get_dummies to generate dummy variables out of Sex variable
# You can use the argument 'drop_first=True' to drop whatever is the first dummy variable
# created
Sex = pd.get_dummies(df['Sex'], drop_first=False)
# Join new dataset with original one using 'df.join'
# Unlike concat and merge, join command forces a join without any logic, this only works
# if your rows are aligned well
df = df.join(Sex)
# Running the regression
model = smf.ols('Salary ~ YearsExperience + Female' , data=df).fit()
# Presenting a table summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.958 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 306.9 Date: Tue, 18 Apr 2023 Prob (F-statistic): 2.71e-19 Time: 12:44:12 Log-Likelihood: -301.12 No. Observations: 30 AIC: 608.2 Df Residuals: 27 BIC: 612.5 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 2.494e+04 2551.242 9.775 0.000 1.97e+04 3.02e+04 YearsExperience 9467.9701 382.375 24.761 0.000 8683.401 1.03e+04 Female 1625.9575 2138.556 0.760 0.454 -2761.997 6013.912 ============================================================================== Omnibus: 1.685 Durbin-Watson: 1.656 Prob(Omnibus): 0.431 Jarque-Bera (JB): 1.198 Skew: 0.232 Prob(JB): 0.549 Kurtosis: 2.138 Cond. No. 16.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Suppose we wanted to know what are the effects of women and years of work experience on salary. Regressions tend to hold variables constant, so unless we tell python to interact both of them at the same time, we will not be able to observe the effect of both variables together.
We can do this by using the '*' operator within the regression function.
# Running the regression
model = smf.ols('Salary ~ YearsExperience * Female' , data=df).fit()
# Presenting a table summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.959 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 205.3 Date: Tue, 18 Apr 2023 Prob (F-statistic): 3.25e-18 Time: 12:44:12 Log-Likelihood: -300.53 No. Observations: 30 AIC: 609.1 Df Residuals: 26 BIC: 614.7 Df Model: 3 Covariance Type: nonrobust ========================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------ Intercept 2.687e+04 3170.710 8.473 0.000 2.03e+04 3.34e+04 YearsExperience 9115.6677 514.374 17.722 0.000 8058.357 1.02e+04 Female -2526.1464 4587.162 -0.551 0.587 -1.2e+04 6902.899 YearsExperience:Female 785.8048 768.207 1.023 0.316 -793.268 2364.878 ============================================================================== Omnibus: 3.031 Durbin-Watson: 1.703 Prob(Omnibus): 0.220 Jarque-Bera (JB): 1.450 Skew: 0.138 Prob(JB): 0.484 Kurtosis: 1.959 Cond. No. 33.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In addition, we can specify to Python that we want robust standard errors by inserting the "cov_type = 'HC1'" argument in the .fit()
function.
# Running the regression
model = smf.ols('Salary ~ YearsExperience * Female' , data=df).fit(cov_type='HC1')
# Presenting a table summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.959 Model: OLS Adj. R-squared: 0.955 Method: Least Squares F-statistic: 321.1 Date: Tue, 18 Apr 2023 Prob (F-statistic): 1.18e-20 Time: 12:44:12 Log-Likelihood: -300.53 No. Observations: 30 AIC: 609.1 Df Residuals: 26 BIC: 614.7 Df Model: 3 Covariance Type: HC1 ========================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------ Intercept 2.687e+04 2832.504 9.485 0.000 2.13e+04 3.24e+04 YearsExperience 9115.6677 374.322 24.352 0.000 8382.010 9849.326 Female -2526.1464 4393.520 -0.575 0.565 -1.11e+04 6084.995 YearsExperience:Female 785.8048 682.491 1.151 0.250 -551.854 2123.464 ============================================================================== Omnibus: 3.031 Durbin-Watson: 1.703 Prob(Omnibus): 0.220 Jarque-Bera (JB): 1.450 Skew: 0.138 Prob(JB): 0.484 Kurtosis: 1.959 Cond. No. 33.3 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC1)
Suppose we want to compare different regression models and present them in a more organized table. There are two methods that can allow us to do this.
# Import summary_col package
from statsmodels.iolib.summary2 import summary_col
## Run three separate regression models
# Reg 1
model1 = smf.ols('Salary ~ YearsExperience' , data=df).fit()
# Reg 2
model2 = smf.ols('Salary ~ YearsExperience + Female' , data=df).fit()
# Reg 3
model3 = smf.ols('Salary ~ YearsExperience * Female' , data=df).fit()
dfoutput = summary_col([model1,model2,model3],stars=True, float_format='%0.2f',
regressor_order=model1.params.index.tolist())
print(dfoutput)
# Use this to produce the table as LaTeX and copy/paste it to get actual table
#dfoutput.as_latex()
========================================================== Salary I Salary II Salary III ---------------------------------------------------------- Intercept 25792.20*** 24937.74*** 26866.59*** (2273.05) (2551.24) (3170.71) YearsExperience 9449.96*** 9467.97*** 9115.67*** (378.75) (382.38) (514.37) Female 1625.96 -2526.15 (2138.56) (4587.16) YearsExperience:Female 785.80 (768.21) R-squared 0.96 0.96 0.96 R-squared Adj. 0.96 0.95 0.95 ========================================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
The github repo for this package can be found here.
## Install package from CRAN
# pip install stargazer
# Import library package to workspace
from stargazer.stargazer import Stargazer, LineLocation
# Use previous models to create the table
stargazer = Stargazer([model1,model2,model3] )
# Specify significant digits
stargazer.significant_digits(2)
stargazer
Dependent variable:Salary | |||
(1) | (2) | (3) | |
Female | 1625.96 | -2526.15 | |
(2138.56) | (4587.16) | ||
Intercept | 25792.20*** | 24937.74*** | 26866.59*** |
(2273.05) | (2551.24) | (3170.71) | |
YearsExperience | 9449.96*** | 9467.97*** | 9115.67*** |
(378.75) | (382.38) | (514.37) | |
YearsExperience:Female | 785.80 | ||
(768.21) | |||
Observations | 30 | 30 | 30 |
R2 | 0.96 | 0.96 | 0.96 |
Adjusted R2 | 0.96 | 0.95 | 0.95 |
Residual Std. Error | 5788.32 (df=28) | 5832.43 (df=27) | 5827.43 (df=26) |
F Statistic | 622.51*** (df=1; 28) | 306.85*** (df=2; 27) | 205.27*** (df=3; 26) |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Here, we notice that our coefficients are a little wonky, as Python places them in alphabetical order. But, if we wanted them in a more traditional format, we can also specify it to the stargazer package.
stargazer.covariate_order(['Intercept', 'YearsExperience', 'Female', 'YearsExperience:Female'])
stargazer
# The table we see below is HTML, if we want the code for that, all we need to execute is
# stargazer.render_html()
Dependent variable:Salary | |||
(1) | (2) | (3) | |
Intercept | 25792.20*** | 24937.74*** | 26866.59*** |
(2273.05) | (2551.24) | (3170.71) | |
YearsExperience | 9449.96*** | 9467.97*** | 9115.67*** |
(378.75) | (382.38) | (514.37) | |
Female | 1625.96 | -2526.15 | |
(2138.56) | (4587.16) | ||
YearsExperience:Female | 785.80 | ||
(768.21) | |||
Observations | 30 | 30 | 30 |
R2 | 0.96 | 0.96 | 0.96 |
Adjusted R2 | 0.96 | 0.95 | 0.95 |
Residual Std. Error | 5788.32 (df=28) | 5832.43 (df=27) | 5827.43 (df=26) |
F Statistic | 622.51*** (df=1; 28) | 306.85*** (df=2; 27) | 205.27*** (df=3; 26) |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Lastly, we can also render the table as LaTeX for easy exporting.
print(stargazer.render_latex())
\begin{table}[!htbp] \centering \begin{tabular}{@{\extracolsep{5pt}}lccc} \\[-1.8ex]\hline \hline \\[-1.8ex] & \multicolumn{3}{c}{\textit{Dependent variable:}} \ \cr \cline{3-4} \\[-1.8ex] & (1) & (2) & (3) \\ \hline \\[-1.8ex] Intercept & 25792.20$^{***}$ & 24937.74$^{***}$ & 26866.59$^{***}$ \\ & (2273.05) & (2551.24) & (3170.71) \\ YearsExperience & 9449.96$^{***}$ & 9467.97$^{***}$ & 9115.67$^{***}$ \\ & (378.75) & (382.38) & (514.37) \\ Female & & 1625.96$^{}$ & -2526.15$^{}$ \\ & & (2138.56) & (4587.16) \\ YearsExperience:Female & & & 785.80$^{}$ \\ & & & (768.21) \\ \hline \\[-1.8ex] Observations & 30 & 30 & 30 \\ $R^2$ & 0.96 & 0.96 & 0.96 \\ Adjusted $R^2$ & 0.96 & 0.95 & 0.95 \\ Residual Std. Error & 5788.32(df = 28) & 5832.43(df = 27) & 5827.43(df = 26) \\ F Statistic & 622.51$^{***}$ (df = 1.0; 28.0) & 306.85$^{***}$ (df = 2.0; 27.0) & 205.27$^{***}$ (df = 3.0; 26.0) \\ \hline \hline \\[-1.8ex] \textit{Note:} & \multicolumn{3}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ \end{tabular} \end{table}
Let's use a more real world example of data. The following data is from Zhou and Shaver's (2021) article, 'Reexamining the Effect of Refugees on Civil Conflict: A Global Subnational Analysis', and they seek to understand (among other things) whether the presence of refugees cause more conflict within states. While their paper conducts different analyses, the one we will be replicating is their robust regression of refugee presence on amount of Violent Events. The regression equation they use to estimate these effects is:
Let's import the data and view it.
df = pd.read_csv('panel.full_GED_2020.csv')
df.head()
GMI_ADMIN | GMI_CNTRY | CNTRY_NAME | ADMIN_NAME | Country | Year | attack | attack_state_based | attack_non_state | attack_one_sided | ... | no_new_site_rs_1 | no_new_site_rt_2 | no_new_site_rc_2 | no_new_site_rs_2 | nearborder | nearcapital | TotalRoadLength | RoadDensity | TotalRoadLength_km | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG-BAM | AFG | Afghanistan | Bamian | Afghanistan | 1990-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 1.0 | NaN | NaN | NaN | 1990 |
1 | AFG-BAM | AFG | Afghanistan | Bamian | Afghanistan | 1991-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 1.0 | NaN | NaN | NaN | 1991 |
2 | AFG-BAM | AFG | Afghanistan | Bamian | Afghanistan | 1992-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 789914.4805 | 0.041157 | 789.91448 | 1992 |
3 | AFG-BAM | AFG | Afghanistan | Bamian | Afghanistan | 1993-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 1.0 | NaN | NaN | NaN | 1993 |
4 | AFG-BAM | AFG | Afghanistan | Bamian | Afghanistan | 1994-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 1.0 | NaN | NaN | NaN | 1994 |
5 rows × 541 columns
The replication instructions tell us what variables were used for this regression analysis. Let's create a list object of those names and subset them.
var_list = ['attack','rtb','rtb.other','attack_1','attack_neighbors_sum', 'log_pop_1',
'gcp_ppp_1', 'STD', 'SQKM_ADMIN', 'log_bdist2', 'log_capdist', 'idpb',
'Country', 'year']
df = df[var_list]
df.head()
attack | rtb | rtb.other | attack_1 | attack_neighbors_sum | log_pop_1 | gcp_ppp_1 | STD | SQKM_ADMIN | log_bdist2 | log_capdist | idpb | Country | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0 | 0 | NaN | 2 | NaN | NaN | 221.432 | 19192.551 | 5.458124 | 5.258359 | 0 | Afghanistan | 1990 |
1 | 0.0 | 0 | 1 | 0.0 | 2 | 12.590328 | 0.234417 | 221.432 | 19192.551 | 5.458124 | 5.258359 | 0 | Afghanistan | 1991 |
2 | 0.0 | 0 | 1 | 0.0 | 0 | 12.666320 | 0.234417 | 221.432 | 19192.551 | 5.458124 | 5.258359 | 0 | Afghanistan | 1992 |
3 | 0.0 | 0 | 0 | 0.0 | 1 | 12.736942 | 0.234417 | 221.432 | 19192.551 | 5.458124 | 5.258359 | 0 | Afghanistan | 1993 |
4 | 0.0 | 0 | 1 | 0.0 | 20 | 12.802905 | 0.234417 | 221.432 | 19192.551 | 5.458124 | 5.258359 | 0 | Afghanistan | 1994 |
Before we move on to the main analysis, let's discuss some useful commands.
# This gives us a set of statistics across all variables in the dataset
# You can specify a few by indexing the df with those variables then placing describe after
# exammple: df['attack'].describe()
df.describe()
attack | rtb | rtb.other | attack_1 | attack_neighbors_sum | log_pop_1 | gcp_ppp_1 | STD | SQKM_ADMIN | log_bdist2 | log_capdist | idpb | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 73215.000000 | 73527.000000 | 73527.000000 | 70679.000000 | 73527.000000 | 65819.000000 | 65324.000000 | 73527.000000 | 7.352700e+04 | 73179.000000 | 73179.000000 | 73527.000000 | 73527.000000 |
mean | 2.247067 | 0.037184 | 0.247474 | 2.161689 | 6.220246 | 13.375119 | 19.890419 | 103.564727 | 5.203192e+04 | 4.664811 | 5.588554 | 0.003618 | 2004.002652 |
std | 20.342200 | 0.189213 | 0.431547 | 19.792058 | 45.933583 | 1.678239 | 74.463542 | 86.645497 | 1.676880e+05 | 1.120037 | 1.084331 | 0.060039 | 8.365720 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.057000e+00 | 0.716082 | 1.019839 | 0.000000 | 1990.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 12.477365 | 0.701478 | 25.171700 | 3.943059e+03 | 3.876278 | 4.891450 | 0.000000 | 1997.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.450893 | 2.971663 | 84.933500 | 1.131859e+04 | 4.671915 | 5.645031 | 0.000000 | 2004.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.376565 | 9.336956 | 162.612000 | 3.908533e+04 | 5.391149 | 6.280998 | 0.000000 | 2011.000000 |
max | 1986.000000 | 1.000000 | 1.000000 | 1986.000000 | 3472.000000 | 19.282600 | 1973.801000 | 459.845000 | 3.528428e+06 | 7.772757 | 8.948518 | 1.000000 | 2018.000000 |
We know that the country variable is categorical, so we can table it to see what countries are in our dataset
df['Country'].value_counts()
Russia 2351 Thailand 2088 Turkey 1943 Vietnam 1537 United States 1508 ... Macedonia 26 South Sudan 24 Nauru 20 Timor Leste 17 Montenegro 13 Name: Country, Length: 185, dtype: int64
We can also run correlations on variables we are interested in and provide aesthetic arguments to them. Here, we will just run a correlation on a few numeric variables in the dataframe.
# Storing the correlation in an object
corr = df[['attack','rtb','log_pop_1',
'gcp_ppp_1']].corr(numeric_only=True) # this makes sure that it keeps numeric only
# Aesthetic mapping
corr.style.background_gradient(cmap='RdBu')
attack | rtb | log_pop_1 | gcp_ppp_1 | |
---|---|---|---|---|
attack | 1.000000 | 0.040611 | 0.078641 | -0.010595 |
rtb | 0.040611 | 1.000000 | 0.081201 | -0.032658 |
log_pop_1 | 0.078641 | 0.081201 | 1.000000 | 0.396185 |
gcp_ppp_1 | -0.010595 | -0.032658 | 0.396185 | 1.000000 |
Now, back to the main analysis. Let's run the regression.
model = smf.ols('attack ~ rtb + rtb.other + attack_1 + attack_neighbors_sum + log_pop_1 + gcp_ppp_1 + STD + SQKM_ADMIN + log_bdist2 + log_capdist + idpb + C(Country) + C(year)', data=df).fit()
print(model.summary())
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/compat.py:36, in call_and_wrap_exc(msg, origin, f, *args, **kwargs) 35 try: ---> 36 return f(*args, **kwargs) 37 except Exception as e: File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/eval.py:169, in EvalEnvironment.eval(self, expr, source_name, inner_namespace) 168 code = compile(expr, source_name, "eval", self.flags, False) --> 169 return eval(code, {}, VarLookupDict([inner_namespace] 170 + self._namespaces)) File <string>:1 File ~/Library/r-miniconda/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name) 5901 return self[name] -> 5902 return object.__getattribute__(self, name) AttributeError: 'Series' object has no attribute 'other' The above exception was the direct cause of the following exception: PatsyError Traceback (most recent call last) Cell In[24], line 1 ----> 1 model = smf.ols('attack ~ rtb + rtb.other + attack_1 + attack_neighbors_sum + log_pop_1 + gcp_ppp_1 + STD + SQKM_ADMIN + log_bdist2 + log_capdist + idpb + C(Country) + C(year)', data=df).fit() 3 print(model.summary()) File ~/Library/r-miniconda/lib/python3.10/site-packages/statsmodels/base/model.py:200, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs) 197 if missing == 'none': # with patsy it's drop or raise. let's raise. 198 missing = 'raise' --> 200 tmp = handle_formula_data(data, None, formula, depth=eval_env, 201 missing=missing) 202 ((endog, exog), missing_idx, design_info) = tmp 203 max_endog = cls._formula_max_endog File ~/Library/r-miniconda/lib/python3.10/site-packages/statsmodels/formula/formulatools.py:63, in handle_formula_data(Y, X, formula, depth, missing) 61 else: 62 if data_util._is_using_pandas(Y, None): ---> 63 result = dmatrices(formula, Y, depth, return_type='dataframe', 64 NA_action=na_action) 65 else: 66 result = dmatrices(formula, Y, depth, return_type='dataframe', 67 NA_action=na_action) File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/highlevel.py:309, in dmatrices(formula_like, data, eval_env, NA_action, return_type) 299 """Construct two design matrices given a formula_like and data. 300 301 This function is identical to :func:`dmatrix`, except that it requires (...) 306 See :func:`dmatrix` for details. 307 """ 308 eval_env = EvalEnvironment.capture(eval_env, reference=1) --> 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env, 310 NA_action, return_type) 311 if lhs.shape[1] == 0: 312 raise PatsyError("model is missing required outcome variables") File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/highlevel.py:164, in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type) 162 def data_iter_maker(): 163 return iter([data]) --> 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env, 165 NA_action) 166 if design_infos is not None: 167 return build_design_matrices(design_infos, data, 168 NA_action=NA_action, 169 return_type=return_type) File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/highlevel.py:66, in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action) 64 if isinstance(formula_like, ModelDesc): 65 assert isinstance(eval_env, EvalEnvironment) ---> 66 return design_matrix_builders([formula_like.lhs_termlist, 67 formula_like.rhs_termlist], 68 data_iter_maker, 69 eval_env, 70 NA_action) 71 else: 72 return None File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/build.py:693, in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action) 689 factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env) 690 # Now all the factors have working eval methods, so we can evaluate them 691 # on some data to find out what type of data they return. 692 (num_column_counts, --> 693 cat_levels_contrasts) = _examine_factor_types(all_factors, 694 factor_states, 695 data_iter_maker, 696 NA_action) 697 # Now we need the factor infos, which encapsulate the knowledge of 698 # how to turn any given factor into a chunk of data: 699 factor_infos = {} File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/build.py:443, in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action) 441 for data in data_iter_maker(): 442 for factor in list(examine_needed): --> 443 value = factor.eval(factor_states[factor], data) 444 if factor in cat_sniffers or guess_categorical(value): 445 if factor not in cat_sniffers: File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/eval.py:568, in EvalFactor.eval(self, memorize_state, data) 567 def eval(self, memorize_state, data): --> 568 return self._eval(memorize_state["eval_code"], 569 memorize_state, 570 data) File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/eval.py:551, in EvalFactor._eval(self, code, memorize_state, data) 549 def _eval(self, code, memorize_state, data): 550 inner_namespace = VarLookupDict([data, memorize_state["transforms"]]) --> 551 return call_and_wrap_exc("Error evaluating factor", 552 self, 553 memorize_state["eval_env"].eval, 554 code, 555 inner_namespace=inner_namespace) File ~/Library/r-miniconda/lib/python3.10/site-packages/patsy/compat.py:43, in call_and_wrap_exc(msg, origin, f, *args, **kwargs) 39 new_exc = PatsyError("%s: %s: %s" 40 % (msg, e.__class__.__name__, e), 41 origin) 42 # Use 'exec' to hide this syntax from the Python 2 parser: ---> 43 exec("raise new_exc from e") 44 else: 45 # In python 2, we just let the original exception escape -- better 46 # than destroying the traceback. But if it's a PatsyError, we can 47 # at least set the origin properly. 48 if isinstance(e, PatsyError): File <string>:1 PatsyError: Error evaluating factor: AttributeError: 'Series' object has no attribute 'other' attack ~ rtb + rtb.other + attack_1 + attack_neighbors_sum + log_pop_1 + gcp_ppp_1 + STD + SQKM_ADMIN + log_bdist2 + log_capdist + idpb + C(Country) + C(year) ^^^^^^^^^
What happened? It appears that Python is not recognizing the name 'rtb.other' in the regression. Let's change the name of the variable and rerun it.
# Rename variable name
df.rename(columns={'rtb.other':'rtb_other'}, inplace=True)
# Rerun regression
model = smf.ols('attack ~ rtb + rtb_other + attack_1 + attack_neighbors_sum + log_pop_1 + gcp_ppp_1 + STD + SQKM_ADMIN + log_bdist2 + log_capdist + idpb + C(Country) + C(year)', data=df).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: attack R-squared: 0.612 Model: OLS Adj. R-squared: 0.610 Method: Least Squares F-statistic: 481.4 Date: Tue, 18 Apr 2023 Prob (F-statistic): 0.00 Time: 12:44:33 Log-Likelihood: -2.5339e+05 No. Observations: 65012 AIC: 5.072e+05 Df Residuals: 64799 BIC: 5.091e+05 Df Model: 212 Covariance Type: nonrobust ================================================================================================================ coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------------------------- Intercept 6.9926 1.033 6.769 0.000 4.968 9.017 C(Country)[T.Albania] -10.4937 0.767 -13.690 0.000 -11.996 -8.991 C(Country)[T.Algeria] -9.9632 0.578 -17.249 0.000 -11.095 -8.831 C(Country)[T.Andorra] 4.048e-11 2.93e-11 1.381 0.167 -1.7e-11 9.79e-11 C(Country)[T.Angola] -10.0907 0.709 -14.238 0.000 -11.480 -8.702 C(Country)[T.Argentina] -10.5099 0.665 -15.802 0.000 -11.813 -9.206 C(Country)[T.Armenia] -11.1490 2.305 -4.838 0.000 -15.666 -6.632 C(Country)[T.Australia] -9.7005 1.056 -9.189 0.000 -11.770 -7.631 C(Country)[T.Austria] -10.8752 0.881 -12.347 0.000 -12.601 -9.149 C(Country)[T.Azerbaijan] -9.1036 1.675 -5.434 0.000 -12.387 -5.820 C(Country)[T.Bahrain] -9.8513 1.240 -7.942 0.000 -12.282 -7.420 C(Country)[T.Bangladesh] -11.6946 1.130 -10.348 0.000 -13.910 -9.480 C(Country)[T.Barbados] -9.9299 2.308 -4.302 0.000 -14.454 -5.406 C(Country)[T.Belarus] -10.5137 1.039 -10.119 0.000 -12.550 -8.477 C(Country)[T.Belgium] -10.5311 0.905 -11.642 0.000 -12.304 -8.758 C(Country)[T.Belize] -9.6549 1.057 -9.137 0.000 -11.726 -7.584 C(Country)[T.Benin] -10.8059 1.041 -10.385 0.000 -12.845 -8.766 C(Country)[T.Bhutan] -10.7179 0.806 -13.303 0.000 -12.297 -9.139 C(Country)[T.Bolivia] -10.4967 0.879 -11.941 0.000 -12.220 -8.774 C(Country)[T.Bosnia and Herzegovina] 10.4473 2.345 4.456 0.000 5.852 15.043 C(Country)[T.Botswana] -10.3172 0.870 -11.863 0.000 -12.022 -8.613 C(Country)[T.Bouvet Island] -1.692e-12 7.56e-11 -0.022 0.982 -1.5e-10 1.46e-10 C(Country)[T.Brazil] -10.1727 0.670 -15.172 0.000 -11.487 -8.858 C(Country)[T.Brunei] -10.5257 2.315 -4.547 0.000 -15.063 -5.988 C(Country)[T.Bulgaria] -10.7891 0.880 -12.262 0.000 -12.514 -9.065 C(Country)[T.Burkina Faso] -10.2120 0.630 -16.201 0.000 -11.447 -8.977 C(Country)[T.Burundi] -11.4974 0.889 -12.931 0.000 -13.240 -9.755 C(Country)[T.Cambodia] -10.1776 0.705 -14.436 0.000 -11.559 -8.796 C(Country)[T.Cameroon] -10.1522 0.851 -11.923 0.000 -11.821 -8.483 C(Country)[T.Canada] -9.9247 0.855 -11.602 0.000 -11.601 -8.248 C(Country)[T.Central African Republic] -10.0012 0.751 -13.319 0.000 -11.473 -8.529 C(Country)[T.Chad] -11.1758 0.765 -14.602 0.000 -12.676 -9.676 C(Country)[T.Chile] -10.8279 0.772 -14.017 0.000 -12.342 -9.314 C(Country)[T.China] -11.8951 0.662 -17.962 0.000 -13.193 -10.597 C(Country)[T.Colombia] -9.2547 0.602 -15.379 0.000 -10.434 -8.075 C(Country)[T.Comoros] -9.9175 2.316 -4.282 0.000 -14.457 -5.378 C(Country)[T.Congo] -10.3818 0.887 -11.699 0.000 -12.121 -8.643 C(Country)[T.Congo, DRC] -6.9989 0.915 -7.653 0.000 -8.791 -5.206 C(Country)[T.Costa Rica] -10.5717 0.970 -10.904 0.000 -12.472 -8.671 C(Country)[T.Cote d'Ivoire] -10.1150 0.578 -17.488 0.000 -11.249 -8.981 C(Country)[T.Croatia] -6.4706 2.303 -2.809 0.005 -10.985 -1.956 C(Country)[T.Cuba] -10.2297 0.746 -13.716 0.000 -11.691 -8.768 C(Country)[T.Cyprus] -10.0311 1.040 -9.649 0.000 -12.069 -7.994 C(Country)[T.Czech Republic] -10.9331 1.003 -10.900 0.000 -12.899 -8.967 C(Country)[T.Czechoslovakia] -10.7273 2.717 -3.949 0.000 -16.052 -5.403 C(Country)[T.Denmark] -10.0883 0.788 -12.804 0.000 -11.633 -8.544 C(Country)[T.Djibouti] -10.9075 1.124 -9.705 0.000 -13.110 -8.705 C(Country)[T.Dominica] -3.164e-12 1.4e-10 -0.023 0.982 -2.77e-10 2.71e-10 C(Country)[T.Dominican Republic] -10.1677 0.683 -14.897 0.000 -11.505 -8.830 C(Country)[T.Ecuador] -10.3548 0.677 -15.296 0.000 -11.682 -9.028 C(Country)[T.Egypt] -10.0910 0.658 -15.336 0.000 -11.381 -8.801 C(Country)[T.El Salvador] -10.6366 0.889 -11.960 0.000 -12.380 -8.893 C(Country)[T.Equatorial Guinea] -9.9420 1.120 -8.880 0.000 -12.136 -7.748 C(Country)[T.Eritrea] -11.1229 0.952 -11.679 0.000 -12.990 -9.256 C(Country)[T.Estonia] -10.4592 2.308 -4.532 0.000 -14.982 -5.936 C(Country)[T.Ethiopia] -10.9030 0.782 -13.947 0.000 -12.435 -9.371 C(Country)[T.Finland] -10.0756 0.807 -12.491 0.000 -11.657 -8.495 C(Country)[T.France] -10.5699 0.668 -15.813 0.000 -11.880 -9.260 C(Country)[T.French Guiana] -9.3243 1.679 -5.553 0.000 -12.616 -6.033 C(Country)[T.Gabon] -10.1867 0.902 -11.292 0.000 -11.955 -8.419 C(Country)[T.Georgia] -11.8831 1.375 -8.639 0.000 -14.579 -9.187 C(Country)[T.Germany] -10.6643 0.796 -13.402 0.000 -12.224 -9.105 C(Country)[T.Ghana] -10.9151 0.864 -12.635 0.000 -12.608 -9.222 C(Country)[T.Glorioso Islands] -1.415e-12 6.24e-11 -0.023 0.982 -1.24e-10 1.21e-10 C(Country)[T.Greece] -10.6476 0.876 -12.156 0.000 -12.364 -8.931 C(Country)[T.Grenada] -9.8224 2.310 -4.252 0.000 -14.350 -5.294 C(Country)[T.Guatemala] -10.7083 0.700 -15.295 0.000 -12.081 -9.336 C(Country)[T.Guinea] -10.7869 0.636 -16.948 0.000 -12.034 -9.539 C(Country)[T.Guinea-Bissau] -9.9812 0.985 -10.130 0.000 -11.912 -8.050 C(Country)[T.Guyana] -9.6131 0.870 -11.052 0.000 -11.318 -7.908 C(Country)[T.Haiti] -10.6446 0.917 -11.610 0.000 -12.442 -8.848 C(Country)[T.Honduras] -10.4331 0.712 -14.658 0.000 -11.828 -9.038 C(Country)[T.Hungary] -10.4908 0.715 -14.667 0.000 -11.893 -9.089 C(Country)[T.Iceland] -9.1603 1.003 -9.135 0.000 -11.126 -7.195 C(Country)[T.India] -6.2476 0.630 -9.912 0.000 -7.483 -5.012 C(Country)[T.Indonesia] -10.0617 0.657 -15.316 0.000 -11.349 -8.774 C(Country)[T.Iran] -11.4329 0.679 -16.826 0.000 -12.765 -10.101 C(Country)[T.Iraq] -8.0762 0.716 -11.275 0.000 -9.480 -6.672 C(Country)[T.Ireland] -10.3339 1.383 -7.470 0.000 -13.045 -7.622 C(Country)[T.Israel] -7.5541 0.967 -7.814 0.000 -9.449 -5.659 C(Country)[T.Italy] -10.7665 0.677 -15.897 0.000 -12.094 -9.439 C(Country)[T.Jamaica] -9.9754 0.886 -11.261 0.000 -11.712 -8.239 C(Country)[T.Japan] -10.5082 0.922 -11.394 0.000 -12.316 -8.701 C(Country)[T.Jordan] -12.3579 1.026 -12.044 0.000 -14.369 -10.347 C(Country)[T.Juan De Nova Island] -4.303e-13 1.88e-11 -0.023 0.982 -3.74e-11 3.65e-11 C(Country)[T.Kazakhstan] -10.3569 0.706 -14.677 0.000 -11.740 -8.974 C(Country)[T.Kenya] -10.4239 0.935 -11.148 0.000 -12.257 -8.591 C(Country)[T.Kuwait] -10.3159 1.234 -8.362 0.000 -12.734 -7.898 C(Country)[T.Kyrgyzstan] -11.0551 1.108 -9.977 0.000 -13.227 -8.883 C(Country)[T.Laos] -10.5817 0.725 -14.604 0.000 -12.002 -9.161 C(Country)[T.Latvia] -10.5426 2.308 -4.567 0.000 -15.067 -6.018 C(Country)[T.Lebanon] -10.7674 1.112 -9.687 0.000 -12.946 -8.589 C(Country)[T.Lesotho] -11.0968 0.974 -11.391 0.000 -13.006 -9.187 C(Country)[T.Liberia] -10.5896 0.793 -13.351 0.000 -12.144 -9.035 C(Country)[T.Libya] -9.5059 0.657 -14.471 0.000 -10.793 -8.218 C(Country)[T.Liechtenstein] -9.238e-13 3.95e-11 -0.023 0.981 -7.83e-11 7.65e-11 C(Country)[T.Lithuania] -10.7198 2.310 -4.641 0.000 -15.247 -6.193 C(Country)[T.Luxembourg] -10.2931 2.319 -4.439 0.000 -14.837 -5.749 C(Country)[T.Macedonia] -11.1880 2.387 -4.688 0.000 -15.866 -6.510 C(Country)[T.Madagascar] -10.3781 1.041 -9.971 0.000 -12.418 -8.338 C(Country)[T.Malawi] -11.4192 1.394 -8.191 0.000 -14.152 -8.687 C(Country)[T.Malaysia] -10.9470 0.772 -14.182 0.000 -12.460 -9.434 C(Country)[T.Maldives] -8.9294 0.758 -11.783 0.000 -10.415 -7.444 C(Country)[T.Mali] -8.6893 1.036 -8.384 0.000 -10.721 -6.658 C(Country)[T.Malta] -9.9947 2.309 -4.328 0.000 -14.521 -5.468 C(Country)[T.Mauritania] -10.0691 0.806 -12.491 0.000 -11.649 -8.489 C(Country)[T.Mexico] -9.5778 0.629 -15.236 0.000 -10.810 -8.346 C(Country)[T.Moldova] -10.4543 1.669 -6.265 0.000 -13.725 -7.183 C(Country)[T.Monaco] -1.848e-12 8.16e-11 -0.023 0.982 -1.62e-10 1.58e-10 C(Country)[T.Mongolia] -10.0094 0.708 -14.135 0.000 -11.397 -8.621 C(Country)[T.Montenegro] -11.0597 3.346 -3.306 0.001 -17.617 -4.502 C(Country)[T.Morocco] -10.7342 0.881 -12.188 0.000 -12.460 -9.008 C(Country)[T.Mozambique] -10.8886 0.855 -12.729 0.000 -12.565 -9.212 C(Country)[T.Myanmar] -9.6915 0.752 -12.895 0.000 -11.165 -8.218 C(Country)[T.Namibia] -9.9167 0.671 -14.772 0.000 -11.232 -8.601 C(Country)[T.Nauru] 1.474e-12 6.34e-11 0.023 0.981 -1.23e-10 1.26e-10 C(Country)[T.Nepal] -7.8232 0.769 -10.178 0.000 -9.330 -6.317 C(Country)[T.Netherlands] -10.3422 0.815 -12.694 0.000 -11.939 -8.745 C(Country)[T.New Zealand] -9.7107 0.787 -12.335 0.000 -11.254 -8.168 C(Country)[T.Nicaragua] -10.3620 0.778 -13.317 0.000 -11.887 -8.837 C(Country)[T.Niger] -10.8769 0.980 -11.099 0.000 -12.798 -8.956 C(Country)[T.Nigeria] -9.4646 0.641 -14.760 0.000 -10.721 -8.208 C(Country)[T.North Korea] -10.7603 0.818 -13.154 0.000 -12.364 -9.157 C(Country)[T.Norway] -10.2392 0.700 -14.630 0.000 -11.611 -8.867 C(Country)[T.Oman] -10.2930 0.918 -11.215 0.000 -12.092 -8.494 C(Country)[T.Pakistan] -1.7774 1.042 -1.706 0.088 -3.819 0.264 C(Country)[T.Panama] -10.0690 0.849 -11.855 0.000 -11.734 -8.404 C(Country)[T.Papua New Guinea] -10.4955 0.699 -15.026 0.000 -11.865 -9.126 C(Country)[T.Paraguay] -10.0092 0.706 -14.186 0.000 -11.392 -8.626 C(Country)[T.Peru] -10.5480 0.640 -16.474 0.000 -11.803 -9.293 C(Country)[T.Philippines] -8.1711 0.768 -10.644 0.000 -9.676 -6.666 C(Country)[T.Poland] -10.3304 0.574 -18.009 0.000 -11.455 -9.206 C(Country)[T.Portugal] -10.3147 0.681 -15.141 0.000 -11.650 -8.979 C(Country)[T.Puerto Rico] 5.353e-13 2.32e-11 0.023 0.982 -4.5e-11 4.61e-11 C(Country)[T.Qatar] -9.3391 1.008 -9.267 0.000 -11.314 -7.364 C(Country)[T.Romania] -10.5278 0.579 -18.175 0.000 -11.663 -9.392 C(Country)[T.Russia] -9.9070 0.548 -18.082 0.000 -10.981 -8.833 C(Country)[T.Rwanda] -10.5828 0.938 -11.284 0.000 -12.421 -8.745 C(Country)[T.San Marino] -1.417e-13 6.68e-12 -0.021 0.983 -1.32e-11 1.29e-11 C(Country)[T.Sao Tome and Principe] -9.6305 1.672 -5.761 0.000 -12.907 -6.354 C(Country)[T.Saudi Arabia] -10.6553 0.763 -13.965 0.000 -12.151 -9.160 C(Country)[T.Senegal] -10.1515 0.859 -11.820 0.000 -11.835 -8.468 C(Country)[T.Sierra Leone] -6.7315 1.216 -5.538 0.000 -9.114 -4.349 C(Country)[T.Singapore] -10.7057 2.319 -4.617 0.000 -15.250 -6.161 C(Country)[T.Slovakia] -10.9679 1.431 -7.667 0.000 -13.772 -8.164 C(Country)[T.Slovenia] -11.6506 2.346 -4.965 0.000 -16.250 -7.051 C(Country)[T.Somalia] -6.9920 0.705 -9.915 0.000 -8.374 -5.610 C(Country)[T.South Africa] -9.6880 0.885 -10.951 0.000 -11.422 -7.954 C(Country)[T.South Korea] -10.6570 0.849 -12.554 0.000 -12.321 -8.993 C(Country)[T.South Sudan] -1.6418 2.522 -0.651 0.515 -6.585 3.302 C(Country)[T.Spain] -10.5423 0.688 -15.331 0.000 -11.890 -9.194 C(Country)[T.Sri Lanka] -6.8662 0.878 -7.819 0.000 -8.587 -5.145 C(Country)[T.St. Lucia] -10.1532 2.307 -4.402 0.000 -14.674 -5.632 C(Country)[T.St. Vincent and the Grenadines] -9.9546 2.317 -4.296 0.000 -14.496 -5.413 C(Country)[T.Sudan] -7.6688 0.942 -8.141 0.000 -9.515 -5.822 C(Country)[T.Suriname] -9.2158 0.915 -10.067 0.000 -11.010 -7.421 C(Country)[T.Swaziland] -10.2924 1.230 -8.369 0.000 -12.703 -7.882 C(Country)[T.Sweden] -10.0701 0.660 -15.268 0.000 -11.363 -8.777 C(Country)[T.Switzerland] -10.7819 0.825 -13.065 0.000 -12.399 -9.164 C(Country)[T.Syria] -18.4160 2.798 -6.582 0.000 -23.900 -12.932 C(Country)[T.Taiwan] -10.4879 2.311 -4.539 0.000 -15.017 -5.959 C(Country)[T.Tajikistan] -10.4326 1.218 -8.564 0.000 -12.820 -8.045 C(Country)[T.Tanzania] -10.8448 0.663 -16.347 0.000 -12.145 -9.545 C(Country)[T.Thailand] -10.5052 0.552 -19.042 0.000 -11.587 -9.424 C(Country)[T.The Gambia] -10.5757 1.400 -7.554 0.000 -13.320 -7.832 C(Country)[T.Timor Leste] -10.6574 2.936 -3.630 0.000 -16.412 -4.903 C(Country)[T.Togo] -10.3568 0.811 -12.767 0.000 -11.947 -8.767 C(Country)[T.Trinidad and Tobago] -10.5374 2.306 -4.569 0.000 -15.058 -6.017 C(Country)[T.Tunisia] -10.2820 0.664 -15.476 0.000 -11.584 -8.980 C(Country)[T.Turkey] -10.0720 0.521 -19.332 0.000 -11.093 -9.051 C(Country)[T.Turkmenistan] -10.8098 1.119 -9.661 0.000 -13.003 -8.617 C(Country)[T.USSR] -10.1749 1.400 -7.270 0.000 -12.918 -7.432 C(Country)[T.Uganda] -10.5963 0.866 -12.234 0.000 -12.294 -8.899 C(Country)[T.Ukraine] -10.0109 0.660 -15.164 0.000 -11.305 -8.717 C(Country)[T.United Arab Emirates] -10.3401 1.032 -10.018 0.000 -12.363 -8.317 C(Country)[T.United Kingdom] -10.1519 1.144 -8.875 0.000 -12.394 -7.910 C(Country)[T.United States] -10.3425 0.609 -16.983 0.000 -11.536 -9.149 C(Country)[T.Uruguay] -9.8545 0.707 -13.939 0.000 -11.240 -8.469 C(Country)[T.Uzbekistan] -10.8402 0.778 -13.931 0.000 -12.365 -9.315 C(Country)[T.Venezuela] -10.0922 0.649 -15.559 0.000 -11.364 -8.821 C(Country)[T.Vietnam] -10.7552 0.552 -19.468 0.000 -11.838 -9.672 C(Country)[T.Yemen] -9.3241 0.726 -12.850 0.000 -10.746 -7.902 C(Country)[T.Yugoslavia] -7.3932 1.801 -4.105 0.000 -10.923 -3.863 C(Country)[T.Zambia] -10.9877 0.904 -12.159 0.000 -12.759 -9.216 C(Country)[T.Zimbabwe] -10.8388 0.923 -11.748 0.000 -12.647 -9.030 C(year)[T.1991] 0.0198 0.251 0.079 0.937 -0.473 0.512 C(year)[T.1992] 0.3692 0.246 1.500 0.134 -0.113 0.852 C(year)[T.1993] 0.3247 0.246 1.320 0.187 -0.158 0.807 C(year)[T.1994] 0.1986 0.246 0.808 0.419 -0.283 0.680 C(year)[T.1995] 0.0331 0.246 0.135 0.893 -0.449 0.515 C(year)[T.1996] -0.2885 0.246 -1.173 0.241 -0.770 0.194 C(year)[T.1997] -0.1242 0.246 -0.505 0.614 -0.606 0.358 C(year)[T.1998] 0.3067 0.246 1.247 0.213 -0.176 0.789 C(year)[T.1999] 0.1814 0.246 0.737 0.461 -0.301 0.664 C(year)[T.2000] 0.2017 0.246 0.820 0.412 -0.281 0.684 C(year)[T.2001] -0.3461 0.246 -1.406 0.160 -0.829 0.136 C(year)[T.2002] 1.0385 0.246 4.218 0.000 0.556 1.521 C(year)[T.2003] -0.4090 0.246 -1.660 0.097 -0.892 0.074 C(year)[T.2004] 0.8883 0.246 3.606 0.000 0.406 1.371 C(year)[T.2005] -0.0438 0.246 -0.178 0.859 -0.527 0.439 C(year)[T.2006] 0.1819 0.246 0.738 0.461 -0.301 0.665 C(year)[T.2007] 0.3107 0.246 1.261 0.207 -0.172 0.794 C(year)[T.2008] 0.4659 0.247 1.889 0.059 -0.018 0.949 C(year)[T.2009] 0.4861 0.247 1.971 0.049 0.003 0.970 C(year)[T.2010] 0.1019 0.247 0.413 0.679 -0.382 0.585 C(year)[T.2011] 0.4702 0.247 1.904 0.057 -0.014 0.954 C(year)[T.2012] 0.2446 0.247 0.989 0.322 -0.240 0.729 C(year)[T.2013] -0.0133 0.247 -0.054 0.957 -0.498 0.471 C(year)[T.2014] 0.8414 0.247 3.402 0.001 0.357 1.326 C(year)[T.2015] 0.3698 0.248 1.493 0.135 -0.116 0.855 C(year)[T.2016] 0.0180 0.248 0.072 0.942 -0.468 0.504 C(year)[T.2017] 0.7471 0.248 3.015 0.003 0.261 1.233 C(year)[T.2018] 0.4176 0.248 1.686 0.092 -0.068 0.903 rtb 0.1675 0.271 0.617 0.537 -0.364 0.699 rtb_other 0.3748 0.241 1.557 0.119 -0.097 0.847 attack_1 0.6728 0.003 251.298 0.000 0.668 0.678 attack_neighbors_sum 0.0374 0.001 31.558 0.000 0.035 0.040 log_pop_1 0.2454 0.053 4.668 0.000 0.142 0.348 gcp_ppp_1 -0.0014 0.001 -1.679 0.093 -0.003 0.000 STD 0.0020 0.001 2.359 0.018 0.000 0.004 SQKM_ADMIN -4e-07 3.47e-07 -1.151 0.250 -1.08e-06 2.81e-07 log_bdist2 -0.1588 0.071 -2.244 0.025 -0.298 -0.020 log_capdist 0.0855 0.070 1.218 0.223 -0.052 0.223 idpb 2.9011 0.784 3.701 0.000 1.365 4.438 ============================================================================== Omnibus: 114979.490 Durbin-Watson: 2.124 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5777146419.606 Skew: 11.262 Prob(JB): 0.00 Kurtosis: 1463.205 Cond. No. 2.19e+21 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 4.58e-28. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.