12. Simple Linear Regression
The Entry Point to Model-Based Biostatistics
- what a simple linear regression model is;
- when regression is more appropriate than correlation;
- the assumptions behind a straight-line model;
- how to fit a model with
lm(); - how to diagnose normality, homoscedasticity, linearity, and outliers;
- how to interpret slopes, fitted values, confidence intervals, and prediction intervals;
- how explanatory and predictive uses of the same regression differ;
- how to report a regression in the style of a Results section.
- Self-Assessment Task 12-1 (/10)
- Self-Assessment Task 12-2 (/30)
- Self-Assessment instructions and full task overview
Linear models are among the most useful statistical tools available to biologists because they describe and quantify how a response variable, \(Y\), changes as one or more predictor variables, \(X\), change. In statistics, a model is a mathematical representation of a real process. It is not reality itself, but an idealised description of the part of reality we want to understand. Linear models are especially valuable because they are simple enough to interpret, yet flexible enough to support a large part of modern statistical practice.
In the broadest sense, a linear model is one in which the unknown parameters enter linearly, even if the variables themselves are transformed or combined in more elaborate ways. The most basic member of that family is the simple linear model, which is the focus of this chapter. It has one continuous predictor and one continuous response, and it is fitted by simple linear regression. The goal may be explanatory, where the predictor is hypothesised to influence the response, or predictive, where I simply want a formula that estimates likely values of the response from observed values of the predictor. A causal interpretation is therefore common, but it is not required.
Regression analysis is the procedure by which I estimate the model parameters from data. The aim is to fit the model that best captures the observed response-predictor relationship and then interpret the strength, direction, and uncertainty of that relationship. This differs from correlation, which quantifies association without imposing a response-predictor structure. When one variable changes systematically with another but neither prediction nor a defensible response-predictor distinction is of interest, correlation is usually the more appropriate tool.
As established in Chapter 11, every observed response can be written as \(Y_i = \hat{Y}_i + e_i\), where \(\hat{Y}_i\) is the fitted value and \(e_i\) is the residual. In the previous chapter I focused on residuals, fitted values, and the diagnostic thinking used to assess whether a model is behaving adequately. Here I carry that thinking into the first full regression model in the course and show how the equation, the slope, the fitted line, and the diagnostics all belong to one workflow.
Simple linear regression is therefore the entry point to the wider model-based framework. From here I move to polynomial regression, where curvature is handled within the linear-model framework; to multiple regression, where several predictors act simultaneously; to interaction terms, where the effect of one predictor depends on another; and later to generalised linear models, where the same modelling thinking is extended to non-normal responses by introducing a link function and a different error distribution. If the response-predictor structure is defensible and a straight-line mean relationship is biologically plausible, simple linear regression is usually the correct place to begin.
1 Main Concepts
These ideas organise the chapter.
- Simple linear regression models one continuous response as a function of one continuous predictor.
- The response-predictor distinction is essential because it is not simply putting a line onto a correlation scatter plot.
- The slope is usually the main inferential quantity because it describes the expected change in the response for a one-unit change in the predictor.
- The intercept is often less biologically interesting, but it is still part of the fitted model.
- Residuals are central to assumption checking because they reveal structure that the model has failed to capture.
- Confidence intervals and prediction intervals answer different questions and should not be confused.
2 Nature of the Data and Assumptions
2.1 Requirements Before Fitting
As the experimenter, you must ensure the following requirements before a simple linear regression is fitted:
A defensible response-predictor structure: There should be a theoretical or philosophical basis for treating one variable as the predictor and the other as the response. This may be explicitly causal, but it can also be predictive if that distinction is still biologically sensible.
-
Independence of observations: Each measured value of the response must be independent of the others. If repeated measurements, clustered sampling, or temporal dependence are present, a different modelling framework may be required.
When temporal or spatial order is plausible, plot residuals against the order in which observations were collected. Runs of positive or negative residuals, or systematic cycles, indicate dependence that the model has not captured. For the sparrow and penguin examples you will encounter below, measurements are treated as independent, so this check is not performed here.
Continuous predictor: The predictor variable should be continuous.
Continuous response: The response variable should also be continuous.
2.2 Assumptions to Check After Fitting
After the model has been fitted, the following assumptions must be checked:
- Normality: The residuals \(e_i\) should be approximately normally distributed.
- Homoscedasticity: The variance of the residuals \(e_i\) should be roughly constant across the fitted values.
- Linearity: The mean relationship between the predictor and the response should be approximately linear.
- Measurement error in the predictor: Standard linear regression assumes that the predictor is measured without serious error. In practice this is only approximately true, and we return to this issue in Chapter 16.
As in the earlier inferential chapters, you must pay attention to the workflow. We first inspect the data (EDA and graphically), then fit the model, then examine the residuals (graphically first, then assumption tests if desired), and only then interpret the coefficients with confidence.
2.3 Assumption and Diagnostics
The table below summarises which diagnostic tool targets which assumption and what pattern to watch for.
| Assumption | What it concerns | Main diagnostic | Visual signal to watch for |
|---|---|---|---|
| Linearity | mean structure | residual vs fitted plot | slope or curvature in residuals |
| Constant variance | spread | residual vs fitted / scale-location | funnel or changing vertical spread |
| Normality | residual distribution | residual Q-Q plot | systematic bends away from reference line |
| Independence | relationship among residuals | residuals vs order/time/space | runs, cycles, clusters |
3 The Model
Simple linear regression is the first statistical method in the work we have done thus far in which we write an explicit equation (the model) for the mean response and then estimate its parameters from data.
The model is:
\[Y_i = \alpha + \beta X_i + \epsilon_i \tag{1}\]
In Equation 1, \(Y_i\) is the response for observation \(i\), \(X_i\) is the predictor, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon_i\) is the error term. Errors \(\epsilon_i\) are unobserved theoretical quantities. Residuals \(e_i\) are their observed estimates, computed after fitting, and are the objects used for diagnostic checks.
The line is fitted by minimising the sum of squared residuals. This is why ordinary linear regression is often called an ordinary least squares method.
The animation below shows the fitted line rotating through the data as the error sum of squares is minimised.
The model contains errors \(\epsilon_i\), which are the unobserved theoretical deviations of each true response from the straight-line mean. In most regression models we assume that the errors are independent and identically distributed (i.i.d.). When the errors are approximately normal this can be written as \(\epsilon_i \sim N(0, \sigma^2)\). The requirement of mean zero implies that, on average, the model does not systematically over- or under-predict. Constant variance implies that the spread of errors is roughly similar across the predictor range. Independence implies that errors do not carry systematic structure from one observation to the next.
The residuals \(e_i = Y_i - \hat{Y}_i\) are the observed estimates of these unobserved errors. It is the residuals that are available after fitting and that we inspect in diagnostic plots.
Violation of these assumptions can lead to biased or inefficient parameter estimates, poor uncertainty estimates, and misleading inference.
The next section formalises how the line is chosen from the data.
Before fitting any model, take a moment to check whether simple linear regression is the right tool. For each scenario below, decide whether the method is appropriate and identify any requirement that is not met:
| Scenario | Appropriate? | Which requirement, if any, fails? |
|---|---|---|
| Plant height (continuous) modelled as a function of rainfall (continuous); one measurement per plant | ? | ? |
| Blood pressure measured five times on each of 20 patients; modelled as a function of time | ? | ? |
| Species presence/absence modelled as a function of temperature | ? | ? |
| Tree diameter (continuous) modelled as a function of stand density (continuous); trees within the same plot share resources | ? | ? |
Discuss your answers with a partner and then read on to check them against the requirements listed above.
4 The Fitting Rule
The least-squares criterion used by lm() is to choose \(\alpha\) and \(\beta\) so that the residual sum of squares is as small as possible:
\[\text{RSS} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 \tag{2}\]
Equation 2 is not a regression model. It is the fitting rule that tells us how the software decides which of all possible straight lines is the best-fitting one.
5 R Function
The main function used in this chapter is lm():
You can read the formula response ~ predictor as “the response is modelled as a function of the predictor.”
The fitted model can then be explored with functions such as:
-
summary()for the coefficients and overall fit; -
confint()for confidence intervals around the coefficients; -
augment()from broom for fitted values and residuals; -
predict()for confidence and prediction intervals; -
plot()for standard diagnostic plots; -
bptest()from lmtest for a formal test of heteroscedasticity.
Fit the sparrow model by hand before running lm(). The least-squares slope is \(\hat{\beta} = r \cdot (s_Y / s_X)\) and the intercept is \(\hat{\alpha} = \bar{Y} - \hat{\beta}\bar{X}\).
Then verify your hand-calculated values against lm(wing ~ age, data = sparrows). Do they match? Explain in one sentence why the slope formula involves both the correlation and the ratio of standard deviations.
6 Outliers and Their Impact on Simple Linear Regression
Outliers are data points that deviate substantially from the overall pattern or trend observed in the data. They can have disproportionate effects on a simple linear regression because the fitted line is estimated by minimising squared residuals, as in Equation 2. Extreme observations may therefore influence the slope, the intercept, the standard errors, the confidence intervals, and the diagnostic patterns.
This does not mean that unusual observations must automatically be removed. Some are recording errors and should be corrected or excluded. Others are rare but real biological events and may carry important information. The correct response is therefore to identify potentially influential points, inspect them carefully, and decide whether they reveal error, unusual but valid biology, or a more fundamental model problem.
7 Example 1: Sparrow Wing Length and Age
- Fit the model with
lm(). - Plot residuals versus fitted values then check for curvature (linearity) and changing spread (constant variance).
- Plot the residual Q-Q plot and check whether the residual distribution is close enough to normal.
- If temporal or spatial order is plausible, plot residuals against order and check for runs or cycles (independence).
- Inspect Cook’s distance and leverage and identify observations with disproportionate influence.
- Revise the model if any diagnostic reveals leftover structure.
I begin with a very small sparrow dataset because it makes the general approach clear. I then go to a fuller worked example using the Adelie penguin data from the palmerpenguins package, which is much closer to the style and level of complexity encountered in real biological analyses.
| Age (days) | Wing length (cm) |
|---|---|
| 3 | 1.4 |
| 4 | 1.5 |
| 5 | 2.2 |
| 6 | 2.4 |
| 8 | 3.1 |
| 9 | 3.2 |
| 10 | 3.2 |
| 11 | 3.9 |
| 12 | 4.1 |
| 14 | 4.7 |
| 15 | 4.5 |
| 16 | 5.2 |
| 17 | 5.0 |
Before looking at the sparrow scatter plot, sketch what you expect the relationship between age and wing length to look like based on your biological knowledge alone. Then run the code below and compare your sketch with the actual data:
Does a straight line look like a reasonable summary of this relationship? Are there any obvious outliers or influential-looking points at the extremes of the age range?
7.1 Do an Exploratory Data Analysis (EDA)
The sparrow data show the basic form of a simple linear model.
age wing
Min. : 3 Min. :1.400
1st Qu.: 6 1st Qu.:2.400
Median :10 Median :3.200
Mean :10 Mean :3.415
3rd Qu.:14 3rd Qu.:4.500
Max. :17 Max. :5.200
In Figure 1, the scatter plot suggests a clear positive linear relationship: older sparrows tend to have longer wings, and the relationship appears close to linear over the range of the data. This example makes the fitted line and the slope easy to understand before we look at a noisier dataset.
7.2 State the Model Question and Hypothesis
With the sparrow example, I ask whether wing length changes systematically with age.
The statistic of interest in a simple linear regression is usually the slope in Equation 1, which quantifies the magnitude and direction of dependence of the response on the predictor:
\[H_{0}: \beta = 0\] \[H_{a}: \beta \ne 0\]
If the slope is zero (a more-or-less horizontal line), there is no linear relationship between the predictor and the expected value of the response. If the slope differs from zero, then the predictor helps explain variation in the response.
7.3 Fit the Model
Call:
lm(formula = wing ~ age, data = sparrows)
Residuals:
Min 1Q Median 3Q Max
-0.30699 -0.21538 0.06553 0.16324 0.22507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.71309 0.14790 4.821 0.000535 ***
age 0.27023 0.01349 20.027 5.27e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2184 on 11 degrees of freedom
Multiple R-squared: 0.9733, Adjusted R-squared: 0.9709
F-statistic: 401.1 on 1 and 11 DF, p-value: 5.267e-10
The output provides the intercept, the slope, their standard errors, a test of whether the coefficients differ from zero, the residual standard error, and the model \(R^2\).
- Examine the contents of the regression model object
sparrow_mod. Explain the main components and how they relate tosummary(sparrow_mod). ☐ (/3) - Using values inside the model object, show how to reconstruct the observed response values from the fitted values and residuals. ☐ (/3)
- Fit a linear regression through the model residuals and explain the result. ☐ (/2)
- Fit a linear regression through the fitted values and explain the result. ☐ (/2)
Read the summary() output for sparrow_mod carefully. Without using a calculator, answer the following questions from the printed output:
- What is the estimated slope, and what does it tell you about the expected change in wing length per additional day of age?
- What is the \(p\)-value for the slope, and what conclusion do you draw about \(H_0: \beta = 0\)?
- The residual standard error is printed near the bottom of the summary. What does this number represent in the units of the response variable?
- What does \(R^2 \approx 0.97\) tell you about how much of the variation in wing length the model accounts for?
Discuss your answers with a partner before continuing.
7.4 Test the Assumptions
Assumptions in regression are checked after fitting the model.
# A tibble: 6 × 4
age wing .fitted .resid
<dbl> <dbl> <dbl> <dbl>
1 3 1.4 1.52 -0.124
2 4 1.5 1.79 -0.294
3 5 2.2 2.06 0.136
4 6 2.4 2.33 0.0655
5 8 3.1 2.87 0.225
6 9 3.2 3.15 0.0548
7.4.1 Normality
The Q-Q panel in Figure 2 shows the sample quantiles of the residuals plotted against the theoretical quantiles of a normal distribution. For these data the points track close to the reference line, indicating that the residuals do not depart strongly from normality.
7.4.2 Homoscedasticity
The residuals-versus-fitted panel in Figure 2 does not reveal a systematic funnel or wedge pattern. The spread of residuals appears reasonably even across the range of fitted values, consistent with the constant-variance assumption.
7.4.3 Influential observations
Because the dataset contains only 13 observations, formal influence diagnostics have limited statistical power. No single observation stands out dramatically in the scale-location or Cook’s-distance panels, but conclusions based on such a small sample should be interpreted cautiously.
In Figure 2, the diagnostic plots suggest that the model is broadly adequate for these data. The residuals do not show severe curvature, the spread is reasonably even, and the Q-Q plot does not suggest a dramatic departure from normality. Because the dataset is small, these plots should be interpreted cautiously, but there is no obvious reason to abandon the linear model.
7.5 Interpret the Results
Inference applies within the observed range of the predictor. Extrapolation beyond the data is possible but requires explicit justification, because the straight-line form may not hold outside the observed range.
I construct the final model fit as a figure, which I will use in my reporting.
In Figure 3, the fitted slope is positive, which means that wing length increases with age. In this example, the slope estimate is about 0.27 cm per day, so the model implies that the expected wing length increases by roughly 0.27 cm for each additional day of age across the range of these observations.
The intercept is the expected wing length when age is zero. Here that value is not biologically the main point of interest. It is simply the point where the fitted line crosses the vertical axis.
The model explains a large proportion of the variation in the observed wing lengths (\(R^2 \approx 0.97\)), and the test of the slope provides very strong evidence that the linear relationship is not zero (\(p < 0.001\)).
7.6 Reporting
Methods
The relationship between sparrow wing length and age was assessed with a simple linear regression, with wing length as the response variable and age as the continuous predictor. Model adequacy was evaluated from standard residual diagnostics.
Results
Sparrow wing length increased strongly with age in the fitted simple linear regression (\(\beta = 0.270\), 95% CI: 0.241 to 0.300; \(R^2 = 0.97\); \(p < 0.001\)) (Figure 3). Across the observed age range, older birds therefore had consistently longer wings, with the expected wing length increasing by about 0.27 cm for each additional day of age.
Discussion
This example is useful because it makes the biological interpretation of the slope very clear: age is associated with a strong increase in wing length over the observed range, and the fitted line captures most of the variation in these simple demonstration data.
Use the sparrow model to generate predictions for two new ages and compare the confidence and prediction intervals:
Answer the following:
- Which interval is wider at each age, and why?
- Both intervals are wider at age 14 than at age 6 relative to the interval at the mean age. Explain why interval width varies across the predictor range.
- Would you trust a prediction for age 25? Why or why not?
8 Example 2: Adelie Penguin Bill Length and Body Mass
The following example uses the penguins dataset from the palmerpenguins package to model bill length as a function of body mass in Adelie penguins.
Although I could also calculate a correlation, I will use a simple linear regression because I want a predictive model that estimates bill length from body mass. This is a defensible use of simple linear regression even though I am not claiming that body mass directly causes bill length.
| Bill length (mm) | Body mass (g) |
|---|---|
| 39.1 | 3750 |
| 39.5 | 3800 |
| 40.3 | 3250 |
| 36.7 | 3450 |
| 39.3 | 3650 |
| 38.9 | 3625 |
8.1 Do an Exploratory Data Analysis (EDA)
[1] 151 8
species island bill_length_mm bill_depth_mm
Adelie :151 Biscoe :44 Min. :32.10 Min. :15.50
Chinstrap: 0 Dream :56 1st Qu.:36.75 1st Qu.:17.50
Gentoo : 0 Torgersen:51 Median :38.80 Median :18.40
Mean :38.79 Mean :18.35
3rd Qu.:40.75 3rd Qu.:19.00
Max. :46.00 Max. :21.50
flipper_length_mm body_mass_g sex year
Min. :172 Min. :2850 female:73 Min. :2007
1st Qu.:186 1st Qu.:3350 male :73 1st Qu.:2007
Median :190 Median :3700 NA's : 5 Median :2008
Mean :190 Mean :3701 Mean :2008
3rd Qu.:195 3rd Qu.:4000 3rd Qu.:2009
Max. :210 Max. :4775 Max. :2009
We see that the dataset contains many more observations than the sparrow example. I focus here on body_mass_g and bill_length_mm. Both are continuous, and restricting the analysis to Adelie penguins gives me a relatively coherent biological subset for the example.
8.2 Create a Plot
Code
In Figure 4, there is also a clear positive relationship between body mass and bill length despite considerable scatter. This relationship appears linear enough to justify a simple linear model as a first approximation. Creating a publication-quality plot complete with the regression line in place preempts the model fitting, but I can use it later for reporting should it turn out that the model fit is defensible.
8.3 State the Hypothesis
\[H_{0}: \beta = 0\] \[H_{a}: \beta \ne 0\]
The null hypothesis is that body mass has no linear association with bill length, while the alternative is that the slope differs from zero.
If the slope is zero, then the predictor does not explain systematic change in the expected response.
8.4 Fit the Model
Call:
lm(formula = bill_length_mm ~ body_mass_g, data = Adelie)
Residuals:
Min 1Q Median 3Q Max
-6.4208 -1.3690 0.1874 1.4825 5.6168
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.699e+01 1.483e+00 18.201 < 2e-16 ***
body_mass_g 3.188e-03 3.977e-04 8.015 2.95e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.234 on 149 degrees of freedom
Multiple R-squared: 0.3013, Adjusted R-squared: 0.2966
F-statistic: 64.24 on 1 and 149 DF, p-value: 2.955e-13
8.5 Test the Assumptions
To facilitate assumption checking I use augment() from broom to add fitted values, residuals, leverage, and related diagnostics to the data.
8.5.1 Normality
I use the Shapiro-Wilk test as one formal check of the residual distribution.
Shapiro-Wilk normality test
data: residuals(mod1)
W = 0.99613, p-value = 0.9637
The formal test does not flag a serious departure from normality here, but graphical diagnostics are usually more informative than the test alone, especially in moderate samples where the test has limited power to detect mild departures.
Code
p1 <- ggplot(mod1_data, aes(sample = .resid)) +
stat_qq(shape = 1, colour = "pink") +
stat_qq_line(colour = "steelblue4") +
labs(title = "Normal Q-Q", x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_grey()
p2 <- ggplot(mod1_data, aes(x = .resid)) +
geom_histogram(binwidth = 1, fill = "pink", color = "pink") +
labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency") +
theme_grey()
p3 <- ggplot(mod1_data, aes(x = .fitted, y = .resid)) +
geom_point(shape = 1, colour = "pink") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted", x = "Fitted values", y = "Residuals") +
theme_grey()
p4 <- ggplot(mod1_data, aes(x = .fitted, y = sqrt(abs(.resid)))) +
geom_point(shape = 1, colour = "pink") +
geom_smooth(se = FALSE, colour = "steelblue4") +
labs(title = "Scale-Location", x = "Fitted values", y = "Sq. Root |Resid.|") +
theme_grey()
ggarrange(p1, p2, p3, p4, nrow = 2, ncol = 2, labels = "AUTO")In Figure 5, the Q-Q plot and histogram suggest that the residuals are approximately normally distributed. There is no obvious extreme departure that would make the model immediately unusable.
8.5.2 Homoscedasticity
The Breusch-Pagan test is one formal check of constant variance.
The formal test is consistent with the visual impression from the residual plots. Graphical inspection should take precedence: if the scale-location plot shows a clear trend, the spread is not constant regardless of the test p-value. The test does not suggest strong heteroscedasticity, and the residuals-versus-fitted and scale-location panels in Figure 5 also indicate that the spread of residuals is reasonably even across the fitted range.
8.5.3 Check for Outliers
Four complementary diagnostics reveal different facets of how individual observations affect the fitted model (Figure 6, Figure 7): DFFITS, Cook’s distance, residuals versus leverage, and Cook’s distance versus leverage.
Code
plt1 <- ggplot(mod1_data, aes(x = index, y = dffits)) +
geom_col(fill = ifelse(abs(mod1_data$dffits) > dffits_threshold, "black", "pink")) +
geom_hline(yintercept = dffits_threshold, linetype = "dashed", colour = "red") +
geom_hline(yintercept = -dffits_threshold, linetype = "dashed", colour = "red") +
geom_text(aes(label = ifelse(abs(dffits) > dffits_threshold, as.character(index), "")),
hjust = 1.0, vjust = 1.0, colour = "darkred") +
labs(x = "Observation Index", y = "DFFITS") +
theme_grey()
plt2 <- ggplot(mod1_data, aes(x = index, y = .cooksd, fill = colour)) +
geom_col() +
geom_hline(yintercept = cooksd_thresh, linetype = "dashed", color = "red") +
geom_text(aes(label = ifelse(.cooksd > cooksd_thresh, as.character(index), "")),
hjust = 1.2, vjust = 1.0, color = "darkred") +
labs(x = "Observation Index", y = "Cook's Dist.") +
scale_fill_identity() +
theme_grey()
plt3 <- ggplot(mod1_data, aes(x = .hat, y = .std.resid)) +
geom_point(aes(size = .cooksd, colour = colour), shape = 1) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "black") +
geom_vline(xintercept = 2 * mean(mod1_data$.hat), linetype = "dashed", colour = "blue") +
geom_smooth(method = "loess", se = FALSE, colour = "steelblue4") +
geom_text(aes(label = ifelse(.cooksd > cooksd_thresh, as.character(index), "")),
hjust = 1.0, vjust = 1.0, colour = "darkred") +
labs(x = "Leverage",
y = "Std. Resid.",
size = "Cook's Distance") +
scale_colour_identity() +
theme_grey() +
theme(legend.position = "bottom") +
guides(size = guide_legend("Cook's Dist."))
plt4 <- ggplot(mod1_data, aes(x = .hat, y = .cooksd)) +
geom_point(aes(size = .cooksd, colour = colour), shape = 1) +
geom_hline(yintercept = cooksd_thresh, linetype = "dashed", colour = "red") +
geom_vline(xintercept = 2 * mean(mod1_data$.hat), linetype = "dashed", colour = "blue") +
geom_text(aes(label = ifelse(.cooksd > cooksd_thresh, as.character(index), "")),
hjust = 1.0, vjust = 1.0, colour = "darkred") +
labs(x = "Leverage",
y = "Cook's Dist.",
size = "Cook's Distance") +
scale_colour_identity() +
theme_grey() +
guides(size = guide_legend("Cook's Distance"))
ggarrange(plt1, plt2, plt3, plt4, nrow = 2, ncol = 2, labels = "AUTO",
common.legend = TRUE)Code
ggplot(mod1_data, aes(x = body_mass_g, y = bill_length_mm)) +
geom_point(aes(size = .cooksd, colour = colour), shape = 1) +
geom_smooth(method = "lm", se = FALSE, colour = "steelblue4") +
geom_text(aes(label = ifelse(abs(.cooksd) > cooksd_thresh, rownames(mod1_data), "")),
hjust = 1.0, vjust = 1.0, colour = "darkred") +
labs(x = "Body mass (g)", y = "Bill length (mm)") +
scale_colour_identity() +
theme_grey() +
theme(legend.position = "bottom") +
guides(size = guide_legend("Cook's Distance"))DFFITS (Difference in Fits) shows how much the fitted value at each observation changes when that observation is removed, expressed in estimated standard-error units (Figure 6 A). The dashed lines at \(\pm 2\sqrt{p/n}\) mark the threshold beyond which removal would noticeably shift the local prediction. Most bars sit well inside the bounds, and none extends dramatically beyond, so no individual penguin strongly controls its own fitted value.
Cook’s distance measures overall influence by showing the total shift in all fitted values when one observation is excluded. The rough threshold \(4/n\) (dashed red line; Figure 6 B) identifies candidates for inspection, but they do not have to be automatically deleted without good justification. A handful of observations approach or just cross the line (shown in black and labelled by row number) but no bar is very much taller than its neighbours and the exceedances are marginal.
Figure 6 C combines the two sources of concern. Leverage, on the \(x\)-axis, measures how far an observation’s predictor value sits from the centre of the predictor distribution; a high-leverage point has an unusual \(X\) value. The \(y\)-axis shows standardised residuals, and point size encodes Cook’s distance. The danger zone is upper and lower right: high leverage combined with a large residual means the observation pulls the fitted line toward itself with no counterweight from the rest of the data. The vertical dashed line marks twice the mean leverage. Most points cluster toward the left and within $$2 standardised residuals; a few approach the leverage threshold, but none combines unusual predictor position with a large residual.
In Figure 6 D, leverage is placed on the \(x\)-axis and Cook’s distance on the \(y\)-axis, making both dimensions of concern visible at once. Observations in the upper-right region (beyond the horizontal Cook’s threshold and the vertical leverage threshold) would be the highest priority to examine. No point occupies that region here.
Figure 7 locates the flagged observations in the original scatter. The highlighted penguins are not at the extremes of either variable and are not pulling the line in any obvious direction. Some observations contribute more to the fitted model than others, as is normal, but no single point dominates the result. If a flagged observation turned out to be a data-entry error, I would correct or exclude it; if it represents a genuine biological extreme, I would keep it and document its influence.
8.6 Interpret the Results
Now that the assumptions appear broadly acceptable, I can interpret the fitted model. The slope of the regression line is positive, so bill length increases with body mass. The coefficient is about \(3.2 \times 10^{-3}\) mm/g, meaning that the expected bill length increases by about 0.0032 mm for every additional gram of body mass.
Note that the units of the slope depend directly on the units of both variables. Here the slope is expressed in mm per gram, which produces a small numerical value. Rescaling body mass to kilograms (dividing by 1000) would yield a slope of about 3.2 mm/kg, which is easier to describe verbally. The biological meaning is unchanged.
The multiple \(R^2\) is about 0.30, so the model explains roughly 30% of the observed variation in bill length. A biologically informative regression does not need to explain nearly all the variation in the response to be worthwhile.
The test of the slope provides strong evidence that the relationship is not zero (\(p < 0.001\)). I apply an ANOVA on the fitted model and this leads me to the same practical conclusion, which is that the straight-line model explains a meaningful amount of variation in bill length.
Among the diagnostics, the residuals-versus-fitted plot is the most informative here. The spread is reasonably even and there is no clear curvature, which supports the straight-line mean structure for this species subset.
Had the Q-Q plot shown strong tail departures or the residual spread changed markedly across the fitted range, the next step would be to consider a response transformation, a different model family, or the inclusion of additional predictors.
8.7 Reporting
Methods
The data analysed in this example were drawn from the Palmer Penguins dataset, which contains measurements on penguins sampled in the Palmer Archipelago, Antarctica. For this worked example, only Adelie penguins were retained. Bill length was treated as the response variable and body mass as the continuous predictor.
A simple linear regression model was fitted using lm() in R, with bill length modelled as a function of body mass. Model adequacy was assessed by inspecting residual plots, by applying the Shapiro-Wilk test to the residuals, and by using the Breusch-Pagan test to assess homoscedasticity. Influential observations were explored using Cook’s distance, DFFITS, and leverage-based diagnostics.
Results
Bill length increased with body mass in the fitted simple linear regression (\(\beta = 0.00319\), SE = 0.00040, \(t = 8.02\), \(p < 0.001\)) (Figure 4). The model explained about 30% of the variation in bill length (\(R^2 = 0.30\)), indicating that body mass was an informative but incomplete predictor of bill length. The overall model was also strongly supported by the ANOVA (\(F = 64.25\), \(p < 0.001\), d.f. = 1, 149).
Discussion
The worked example supports a positive relationship between body mass and bill length in Adelie penguins, but it also shows the limits of a one-predictor model. Body mass explains part of the variation in bill length, not all of it. A fuller biological account would need additional predictors such as sex, age, or ecological context.
The penguin model explained only about 30% of the variation in bill length. Explore whether restricting the data to females improves the fit.
Compare the \(R^2\) and residual standard error from the female-only model with the values from mod1 (all Adelie penguins combined). Does splitting by sex change the picture? Run the same model for males and compare. What does this tell you about the usefulness of one-predictor models in biology?
9 Confidence and Prediction Intervals
The fitted line gives the expected mean response for a given value of the predictor, but two different kinds of interval are commonly needed. A confidence interval describes uncertainty in the estimated mean response, whereas a prediction interval describes uncertainty for an individual future observation. The prediction interval is always wider because it must include the scatter of individual observations around the fitted mean.
fit lwr upr
1 2.604698 2.444344 2.765051
2 4.226072 4.065719 4.386425
fit lwr upr
1 2.604698 2.097951 3.111444
2 4.226072 3.719325 4.732818
I can also visualise the same distinction in the penguin example.
Code
pred_conf <- as.data.frame(predict(mod1,
newdata = Adelie,
interval = "confidence"))
pred_pred <- as.data.frame(predict(mod1,
newdata = Adelie,
interval = "prediction"))
results <- cbind(Adelie, pred_conf, pred_pred[, 2:3])
names(results)[c(9:13)] <- c("fit", "lwr_conf", "upr_conf",
"lwr_pred", "upr_pred")
ggplot(data = results, aes(x = body_mass_g, y = fit)) +
geom_line(linewidth = 0.4, colour = "red") +
geom_ribbon(aes(ymin = lwr_pred, ymax = upr_pred),
alpha = 0.2, fill = "red") +
geom_ribbon(aes(ymin = lwr_conf, ymax = upr_conf),
alpha = 0.2, fill = "blue") +
geom_point(aes(y = bill_length_mm), shape = 1) +
labs(x = "Body mass (g)", y = "Bill length (mm)") +
theme_grey()In Figure 8, the blue confidence band is narrower because it describes uncertainty around the fitted mean response, whereas the pink prediction band is wider because it must also accommodate the scatter of individual penguins around that mean. Confidence intervals are therefore useful when the primary interest lies in the mean expected response at a given predictor value. Prediction intervals are more relevant when the goal is to anticipate the range in which an individual future observation may fall.
10 Prediction Versus Explanation
The same fitted straight-line model can be used for at least two different scientific purposes. In an explanatory analysis, the main interest is usually the slope itself and what it says about the biological relationship between the predictor and the response. A predictive analysis, on the other hand, places the emphasis on the fitted values, prediction intervals, and how accurately the model can anticipate new observations.
The sparrow example is mostly explanatory. We care primarily that wing length increases with age and that the slope is clearly positive. The Adelie penguin example is closer to a predictive framing because we have treated body mass as a variable from which bill length might be estimated for new individuals.
The distinction is important because it changes what should be emphasised in a Results section. Explanatory work usually emphasises the slope, its uncertainty, and the biological interpretation of the effect. Predictive work still needs the model coefficients, but it should pay much more attention to fitted values, prediction intervals, and the amount of unexplained variation.
This is also why a model with a highly significant slope is not automatically a good predictive model. A relationship can be biologically real and still leave substantial scatter around the fitted line, as in Figure 4. Conversely, a model that predicts well is not automatically evidence for a causal mechanism. In Chapter 24, I return to this distinction.
11 What to Do When Assumptions Fail
When diagnostic patterns reveals that the assumptions of the linear model are not adequate, the first step is to ask what those patterns are telling you. Residuals that suggest non-linearity indicate that the model has not absorbed all the systematic structure in the data; the remedy is either to transform the response or predictor variables, or to fit a more flexible model. Polynomial regression, mechanistic non-linear models, and Generalised Additive Models (GAMs) can all accommodate curvature. If residual variance changes strongly with fitted values, consider a transformation or a different modelling framework altogether, such as a Generalised Linear Model (GLM).
Outliers require careful judgement. If they arise from data-entry errors or procedural failures, removal is defensible; however, it should be documented and justified, because outliers may be functionally important as they can reveal rare but real extreme events. When you are confident that an outlier is a genuine observation, robust regression techniques such as M-estimation or least trimmed squares (which downweight influential points rather than discarding them) are worth considering. A final option is to apply an appropriate transformation, such as a logarithm or square root, which can compress the scale sufficiently to reduce the leverage of extreme values without removing any data.
12 Common Mistakes
Common mistakes in simple linear regression include:
- using regression when the relationship is only associative and poorly justified as a response-predictor model;
- ignoring non-independence among observations;
- fitting a straight line to a clearly curved relationship;
- treating statistical significance as biological importance;
- reporting \(R^2\) without discussing effect size, uncertainty, or assumptions.
Complete a full simple linear regression analysis using life expectancy as the response and schooling as the predictor. Your job is not only to fit the model, but also to identify and justify the treatment of problematic observations.
Use the dataset kaggle_life_expectancy_data.csv.
You should do the following:
- Prepare the data by selecting the variables needed for the analysis and removing rows with missing values.
- Fit the initial simple linear regression model.
- Plot the data and show the fitted regression line.
- Check the initial model diagnostics graphically.
- Identify the issue in the dataset.
- Provide evidence for this issue as a table.
- Explain why these cases appear problematic for this analysis.
- Remove the problematic cases and refit the model.
- Recheck the diagnostics graphically for the revised model.
- Interpret the final model clearly and state whether it is an improvement over the initial fit. Tabulate the important model-fit statistics in support of your conclusion.
- End with a short scientific write-up containing Methods, Results, and Discussion sections.
12.1 Marking Rubric
| Component | Marks |
|---|---|
| Data preparation | 3 |
| Initial model fitting and figure | 3 |
| Initial model diagnostics | 3 |
| Identifying and explaining the issue | 3 |
| Evidence for the issue presented as a table | 4 |
| Removing the problematic cases appropriately | 2 |
| Refitting the analysis | 3 |
| Rechecking the diagnostics | 3 |
| Interpreting the final model and comparing it with the initial model | 3 |
| Scientific write-up (Methods, Results, Discussion) | 3 |
Total: 30 marks
13 Summary
- Simple linear regression models one continuous response as a function of one continuous predictor.
- The slope is usually the main inferential quantity because it describes how the expected response changes with the predictor.
- Regression differs from correlation because it imposes a response-predictor structure.
- Residual diagnostics are essential because they tell us whether the model is adequate.
- Outlier diagnostics help us decide whether unusual observations are errors, influential extremes, or signs of model misspecification.
- Confidence intervals and prediction intervals answer different questions.
I established the workflow for modelling, above. In the next chapter, I extend the same workflow to curved relationships, and in Chapter 14 I then move to several predictors at once.
Reuse
Citation
@online{smit2026,
author = {Smit, A. J.},
title = {12. {Simple} {Linear} {Regression}},
date = {2026-04-11},
url = {https://tangledbank.netlify.app/BCB744/basic_stats/12-simple-linear-regression.html},
langid = {en}
}
