[1] 2768
Life Expectancy and Schooling
A Simple Linear Regression Example Using the Kaggle Life Expectancy Data
1. Question:
Does life expectancy increase with schooling in the Kaggle life expectancy dataset?
Complete a full simple linear regression analysis using life expectancy as the response and schooling as the predictor. Your job is not only to fit the model, but also to identify and justify the treatment of problematic observations.
You should do the following:
- Prepare the data by selecting the variables needed for the analysis and removing rows with missing values.
- Fit the initial simple linear regression model.
- Plot the data and show the fitted regression line.
- Check the initial model diagnostics graphically.
- Identify the issue in the dataset.
- Provide evidence for this issue as a table.
- Explain why these cases appear problematic for this analysis.
- Remove the problematic cases and refit the model.
- Recheck the diagnostics graphically for the revised model.
- Interpret the final model clearly and state whether it is an improvement over the initial fit. Tabulate the important model-fit statistics in support of your conclusion.
- End with a short scientific write-up containing Methods, Results, and Discussion sections.
2. Marking Rubric
| Component | Marks |
|---|---|
| Data preparation | 3 |
| Initial model fitting and figure | 3 |
| Initial model diagnostics | 3 |
| Identifying and explaining the issue | 3 |
| Evidence for the issue presented as a table | 4 |
| Removing the problematic cases appropriately | 2 |
| Refitting the analysis | 3 |
| Rechecking the diagnostics | 3 |
| Interpreting the final model and comparing it with the initial model | 3 |
| Scientific write-up (Methods, Results, Discussion) | 3 |
Total: 30 marks
3. Data Preparation
I begin by reading the dataset and keeping only the variables needed for this analysis.
The working dataset contains 2,768 complete observations on life expectancy and schooling.
4. Diagnostics for the Initial Model
Regression assumptions should be checked graphically. I therefore inspect residuals versus fitted values, the normal Q-Q plot, the scale-location plot, and residuals versus leverage in Figure 1.
p_full_1 <- ggplot(full_aug, aes(x = .fitted, y = .resid)) +
geom_point(shape = 1, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "red") +
labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted")
p_full_2 <- ggplot(full_aug, aes(sample = .std.resid)) +
stat_qq(shape = 1, alpha = 0.5) +
stat_qq_line(colour = "red") +
labs(title = "Normal Q-Q", x = "Theoretical quantiles", y = "Standardised residuals")
p_full_3 <- ggplot(full_aug, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point(shape = 1, alpha = 0.4) +
labs(
x = "Fitted values",
y = "Sqrt(|Standardised residuals|)",
title = "Scale-Location"
)
p_full_4 <- ggplot(full_aug, aes(x = .hat, y = .std.resid)) +
geom_point(shape = 1, alpha = 0.4) +
labs(
x = "Leverage",
y = "Standardised residuals",
title = "Residuals vs Leverage"
)
(p_full_1 + p_full_2) / (p_full_3 + p_full_4)The initial diagnostics in Figure 1 suggest that the model captures the broad trend reasonably well, but there is evidence that the extreme low-schooling observations deserve scrutiny. The residual plots are not catastrophic, but the zero-schooling countries create a left-edge concentration that may be exerting unnecessary influence on the fitted line.
5. Countries With Zero Schooling
Before fitting the model, I inspect the cases where schooling is recorded as zero. A zero value here is biologically and socially unusual for several countries in the dataset, so it is worth checking whether these cases should be treated with caution.
Table 1 lists the countries represented by one or more rows where schooling is coded as zero.
| Country | Zero-schooling years |
|---|---|
| Antigua and Barbuda | 6 |
| Bosnia and Herzegovina | 1 |
| Equatorial Guinea | 1 |
| Micronesia (Federated States of) | 1 |
| Montenegro | 4 |
| South Sudan | 11 |
| Timor-Leste | 1 |
| Turkmenistan | 1 |
The detailed country-year rows are shown in Table 2.
| Country | Year | Schooling |
|---|---|---|
| Antigua and Barbuda | 2000 | 0 |
| Antigua and Barbuda | 2001 | 0 |
| Antigua and Barbuda | 2002 | 0 |
| Antigua and Barbuda | 2003 | 0 |
| Antigua and Barbuda | 2004 | 0 |
| Antigua and Barbuda | 2005 | 0 |
| Bosnia and Herzegovina | 2000 | 0 |
| Equatorial Guinea | 2000 | 0 |
| Micronesia (Federated States of) | 2000 | 0 |
| Montenegro | 2000 | 0 |
| Montenegro | 2001 | 0 |
| Montenegro | 2002 | 0 |
| Montenegro | 2003 | 0 |
| South Sudan | 2000 | 0 |
| South Sudan | 2001 | 0 |
| South Sudan | 2002 | 0 |
| South Sudan | 2003 | 0 |
| South Sudan | 2004 | 0 |
| South Sudan | 2005 | 0 |
| South Sudan | 2006 | 0 |
| South Sudan | 2007 | 0 |
| South Sudan | 2008 | 0 |
| South Sudan | 2009 | 0 |
| South Sudan | 2010 | 0 |
| Timor-Leste | 2000 | 0 |
| Turkmenistan | 2000 | 0 |
Because several of these zero-schooling values are unlikely to be literal population-level measurements, I first fit the model with all complete cases and then refit it after excluding these countries.
6. Initial Model: All Complete Cases
I now fit the simple linear regression:
\[ \text{Life expectancy}_i = \beta_0 + \beta_1 \text{Schooling}_i + \varepsilon_i \]
In this model, \(\beta_1\) is the expected change in life expectancy associated with a one-unit increase in schooling.
Call:
lm(formula = life_expectancy ~ Schooling, data = life_dat)
Residuals:
Min 1Q Median 3Q Max
-25.8986 -2.8210 0.6186 3.8186 30.4911
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.10889 0.43676 101 <2e-16 ***
Schooling 2.10345 0.03506 60 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.172 on 2766 degrees of freedom
Multiple R-squared: 0.5655, Adjusted R-squared: 0.5653
F-statistic: 3599 on 1 and 2766 DF, p-value: < 2.2e-16
The fitted relationship is shown in Figure 2.
Figure 2 shows a clear positive trend: countries with higher schooling values tend to have higher life expectancy. The relationship is not perfectly tight, but it is strong enough to make a simple straight-line model a reasonable first approximation.
7. Refit After Removing Zero-Schooling Countries
I now remove every country that appears in Table 1 and refit the model. This is a more conservative filtering decision than simply dropping the rows where schooling equals zero, because it assumes that a country with obviously suspect schooling values may contribute additional problematic observations in other years as well.
[1] 2640
After this filtering step, 2,640 observations remain.
Call:
lm(formula = life_expectancy ~ Schooling, data = life_dat_filtered)
Residuals:
Min 1Q Median 3Q Max
-25.0048 -2.8498 0.7661 3.9261 15.3182
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.43868 0.45199 91.68 <2e-16 ***
Schooling 2.31002 0.03602 64.14 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.871 on 2638 degrees of freedom
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6091
F-statistic: 4113 on 1 and 2638 DF, p-value: < 2.2e-16
The refitted relationship is shown in Figure 3.
Compared with Figure 2, Figure 3 shows a slightly cleaner linear pattern. The slope remains strongly positive, but the fit is less distorted by the extreme left-edge cases.
8. Diagnostics for the Filtered Model
I again inspect the standard regression diagnostics, now for the filtered model (Figure 4).
p_fil_1 <- ggplot(filtered_aug, aes(x = .fitted, y = .resid)) +
geom_point(shape = 1, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "red") +
labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted")
p_fil_2 <- ggplot(filtered_aug, aes(sample = .std.resid)) +
stat_qq(shape = 1, alpha = 0.5) +
stat_qq_line(colour = "red") +
labs(title = "Normal Q-Q", x = "Theoretical quantiles", y = "Standardised residuals")
p_fil_3 <- ggplot(filtered_aug, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point(shape = 1, alpha = 0.4) +
labs(
x = "Fitted values",
y = "Sqrt(|Standardised residuals|)",
title = "Scale-Location"
)
p_fil_4 <- ggplot(filtered_aug, aes(x = .hat, y = .std.resid)) +
geom_point(shape = 1, alpha = 0.4) +
labs(
x = "Leverage",
y = "Standardised residuals",
title = "Residuals vs Leverage"
)
(p_fil_1 + p_fil_2) / (p_fil_3 + p_fil_4)The filtered-model diagnostics in Figure 4 are more convincing. The residuals remain somewhat scattered, as expected in a large cross-country dataset, but the leverage pattern is less concerning and the model behaves more like a stable simple linear regression.
9. Comparing the Two Fits
To make the comparison explicit, I summarise the key coefficients and fit statistics in Table 3.
model_compare <- tibble(
Model = c("All complete cases", "Zero-schooling countries removed"),
N = c(nobs(mod_school_full), nobs(mod_school_filtered)),
Intercept = c(coef(mod_school_full)[1], coef(mod_school_filtered)[1]),
Slope = c(coef(mod_school_full)[2], coef(mod_school_filtered)[2]),
`95% CI lower` = c(confint(mod_school_full)[2, 1], confint(mod_school_filtered)[2, 1]),
`95% CI upper` = c(confint(mod_school_full)[2, 2], confint(mod_school_filtered)[2, 2]),
`R-squared` = c(summary(mod_school_full)$r.squared, summary(mod_school_filtered)$r.squared)
)
model_compare |>
mutate(
Intercept = round(Intercept, 3),
Slope = round(Slope, 3),
`95% CI lower` = round(`95% CI lower`, 3),
`95% CI upper` = round(`95% CI upper`, 3),
`R-squared` = round(`R-squared`, 3)
) |>
gt()| Model | N | Intercept | Slope | 95% CI lower | 95% CI upper | R-squared |
|---|---|---|---|---|---|---|
| All complete cases | 2768 | 44.109 | 2.103 | 2.035 | 2.172 | 0.565 |
| Zero-schooling countries removed | 2640 | 41.439 | 2.310 | 2.239 | 2.381 | 0.609 |
Table 3 shows that removing the zero-schooling countries increases the estimated slope from about 2.1 to about 2.31 years of life expectancy per unit schooling. At the same time, \(R^2\) increases from 0.565 to 0.609, indicating a somewhat tighter linear association after filtering.
10. Write-up
10.1 Methods
I analysed the relationship between life expectancy and schooling using the Kaggle life expectancy dataset. I retained country, year, life expectancy, and schooling, and removed rows with missing values for life expectancy or schooling. I first fitted a simple linear regression using all complete observations, with life expectancy as the response and schooling as the predictor. I then identified countries for which schooling was coded as zero (Table 1 and Table 2). Because several of these zero values appeared implausible as literal measurements and were likely to represent coding anomalies or problematic records, I excluded all observations from those countries and refitted the regression. For both models, I examined the fitted relationship graphically (Figure 2 and Figure 3) and assessed assumptions using residual-versus-fitted, normal Q-Q, scale-location, and residual-versus-leverage plots (Figure 1 and Figure 4). Model coefficients and goodness-of-fit statistics were compared in Table Table 3.
10.2 Results
Using all complete observations, life expectancy increased strongly with schooling (Figure 2). The fitted slope was positive, indicating that countries with greater schooling tended to have higher life expectancy, and the simple linear regression explained a substantial proportion of the variation in life expectancy ($R^2 = 0.565). Inspection of the data revealed a small group of countries with one or more observations in which schooling was coded as zero (Table 1 and Table 2). After removing those countries, the positive relationship remained clear and became slightly stronger (Figure 3). The estimated slope increased from 2.1 to 2.31, and model fit improved from $R^2 = 0.565 to $R^2 = 0.609 (Table 3). The diagnostic plots for the filtered model (Figure 4) suggested a cleaner and more stable fit than the initial model, with less concern about leverage from extreme low-schooling cases.
10.3 Discussion
This analysis indicates a strong positive association between schooling and life expectancy in the dataset, and the relationship is well captured by a simple linear model after removing countries with zero-schooling records. The comparison between the two fits suggests that those zero-schooling cases were influential enough to weaken the apparent linear pattern and reduce overall fit. Their removal produced a steeper slope and more convincing diagnostics, which supports the decision to treat them as questionable observations for this teaching example.
The results should still be interpreted cautiously. These data are observational and cross-national, so the fitted slope should not be read as a direct causal effect of schooling on life expectancy. Many other factors, including wealth, healthcare access, infectious disease burden, governance, and conflict, may contribute to the observed pattern. Even so, the filtered model provides a useful example of simple linear regression because the predictor-response relationship is interpretable, the scatterplot shows a clear trend, and the diagnostics are broadly acceptable.
Reuse
Citation
@online{smit2026,
author = {Smit, A. J. and J. Smit, A.},
title = {Life {Expectancy} and {Schooling}},
date = {2026-03-26},
url = {https://tangledbank.netlify.app/BCB744/basic_stats/kaggle_life_expectancy.html},
langid = {en}
}
