12. Simple Linear Regression

The Entry Point to Model-Based Biostatistics

Published

2026/03/22

NoteIn This Chapter
  • what a simple linear regression model is;
  • when regression is more appropriate than correlation;
  • the assumptions behind a straight-line model;
  • how to fit a model with lm();
  • how to diagnose normality, homoscedasticity, linearity, and outliers;
  • how to interpret slopes, fitted values, confidence intervals, and prediction intervals;
  • how explanatory and predictive uses of the same regression differ;
  • how to report a regression in the style of a Results section.
NoteCheatsheet

Find here a Cheatsheet on statistical methods.

ImportantTasks to Complete in This Chapter
  • Task H

1 Introduction

Linear models are among the most useful statistical tools in biology because they allow us to describe and quantify how one variable changes with another. In a regression, we write down a model for a response variable, \(Y\), as a function of a predictor variable, \(X\), and then estimate the parameters of that model from the data.

One of the simplest such models is the simple linear model, which has one continuous predictor and one continuous response. The aim may be explanatory, where the predictor is believed to influence the response, or predictive, where we want a formula that allows us to estimate likely values of the response from known values of the predictor. In either case, regression differs from correlation because it imposes a response-predictor structure on the data.

In the previous chapter, I defined residuals and showed how fitted models are checked. In this chapter, I apply that diagnostic groundwork to the first full regression model.

In this chapter, I lay the groundwork for polynomial regression, multiple regression, interaction terms, generalised linear models, and several of the more flexible modelling approaches discussed later. If the response-predictor structure is sound and a straight-line model is biologically plausible, simple linear regression is often the correct point of departure.

2 Key Concepts

These ideas organise the chapter.

  • Simple linear regression models one continuous response as a function of one continuous predictor.
  • The response-predictor distinction is essential. Regression is not simply correlation with a fitted line added.
  • The slope is usually the main inferential quantity because it describes the expected change in the response for a one-unit change in the predictor.
  • The intercept is often less biologically interesting, but it is still part of the fitted model.
  • Residuals are central to assumption checking because they reveal structure that the model has failed to capture.
  • Confidence intervals and prediction intervals answer different questions and should not be confused.

3 Nature of the Data and Assumptions

The experimenter must ensure the following key requirements for a simple linear regression:

  1. A defensible response-predictor structure: There should be a theoretical or philosophical basis for treating one variable as the predictor and the other as the response. This may be explicitly causal, but it can also be predictive if that distinction is still biologically sensible.
  2. Independence of observations: Each measured value of the response must be independent of the others. If repeated measurements, clustered sampling, or temporal dependence are present, a different modelling framework may be required.
  3. Continuous predictor: The predictor variable should be continuous.
  4. Continuous response: The response variable should also be continuous.

After the model has been fitted, the following assumptions must be checked:

  1. Normality: The residuals should be approximately normally distributed.
  2. Homoscedasticity: The residual variance should be roughly constant across the fitted values.
  3. Linearity: The mean relationship between the predictor and the response should be approximately linear.
  4. Measurement error in the predictor: Standard linear regression assumes that the predictor is measured without serious error. In practice this is only approximately true, and we return to this issue in Chapter 16.

As in the earlier inferential chapters, the workflow is important. We first inspect the data, then fit the model, then examine the residuals, and only then interpret the coefficients with confidence.

4 The Model

Simple linear regression is the first modelling framework in the sequence in which we write an explicit equation for the mean response and then estimate its parameters from data. The model itself and the least-squares fitting rule are set out more carefully in the Core Equations section below.

The line is fitted by minimising the sum of squared residuals. This is why ordinary linear regression is often called an ordinary least squares method.

The animation below shows the fitted line rotating through the data as the error sum of squares is minimised.

NoteThe Residuals, \(\epsilon_i\)

In most regression models we assume that the residuals are independent and identically distributed. When the residuals are approximately normal this can be written as \(\epsilon_i \sim N(0, \sigma^2)\). The requirement of mean zero implies that, on average, the model does not systematically over- or under-predict. Constant variance implies that the spread of residuals is roughly similar across the predictor range. Independence implies that residuals do not carry systematic structure from one observation to the next.

Violation of these assumptions can lead to biased or inefficient parameter estimates, poor uncertainty estimates, and misleading inference.

5 R Function

The main function used in this chapter is lm():

lm(response ~ predictor, data = df)

You can read the formula response ~ predictor as “the response is modelled as a function of the predictor.”

6 The Core Equations

Simple linear regression has two equations that students should keep conceptually separate. The first describes the model for the mean response. The second describes how the fitted line is chosen from the data.

The model itself is:

\[Y_i = \alpha + \beta X_i + \epsilon_i \tag{1}\]

In Equation 1, \(Y_i\) is the response for observation \(i\), \(X_i\) is the predictor, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon_i\) is the residual error. This is the equation that says what a straight-line mean structure looks like.

The least-squares criterion used by lm() is then to choose \(\alpha\) and \(\beta\) so that the residual sum of squares is as small as possible:

\[\text{RSS} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 \tag{2}\]

Equation Equation 2 is not a second regression model. It is the fitting rule. It tells us how the software decides which of all possible straight lines is the best-fitting one.

The fitted model can then be explored with functions such as:

  • summary() for the coefficients and overall fit;
  • confint() for confidence intervals around the coefficients;
  • augment() from broom for fitted values and residuals;
  • predict() for confidence and prediction intervals;
  • plot() for standard diagnostic plots;
  • bptest() from lmtest for a formal test of heteroscedasticity.

7 Outliers and Their Impact on Simple Linear Regression

Outliers can have disproportionate effects on a simple linear regression because the fitted line is estimated by minimising squared residuals, as in Equation 2. Extreme observations may therefore influence the slope, the intercept, the standard errors, the confidence intervals, and the diagnostic patterns.

This does not mean that unusual observations must automatically be removed. Some are recording errors and should be corrected or excluded. Others are rare but real biological events and may carry important information. The correct response is therefore to identify potentially influential points, inspect them carefully, and decide whether they reveal error, unusual but valid biology, or a more fundamental model problem.

8 Example 1: Sparrow Wing Length and Age

We begin with a very small sparrow dataset because it makes the general logic transparent. We then move to a fuller worked example using the Adelie penguin data from the palmerpenguins package, which is much closer to the style and level of complexity encountered in real biological analyses.

Table 1: Sparrow wing-length data used in the introductory simple linear regression example.
Age (days) Wing length
(cm)
3 1.4
4 1.5
5 2.2
6 2.4
8 3.1
9 3.2
10 3.2
11 3.9
12 4.1
14 4.7
15 4.5
16 5.2
17 5.0

8.1 Do an Exploratory Data Analysis (EDA)

We start with the sparrow data to see the basic form of a simple linear model.

summary(sparrows)
      age          wing      
 Min.   : 3   Min.   :1.400  
 1st Qu.: 6   1st Qu.:2.400  
 Median :10   Median :3.200  
 Mean   :10   Mean   :3.415  
 3rd Qu.:14   3rd Qu.:4.500  
 Max.   :17   Max.   :5.200  
Figure 1: Wing length of sparrows at different ages.

The scatter plot suggests a clear positive linear relationship: older sparrows tend to have longer wings, and the relationship appears close to linear over the range of the data. This small example is useful because it makes the fitted line and the slope easy to understand before we move to a noisier biological dataset.

8.2 State the Model Question and Hypothesis

The biological question in the sparrow example is whether wing length changes systematically with age.

The inferential focus in a simple linear regression is usually the slope in Equation 1:

\[H_{0}: \beta = 0\] \[H_{a}: \beta \ne 0\]

If the slope is zero (a more-or-less horizontal line), there is no linear relationship between the predictor and the expected value of the response. If the slope differs from zero, then the predictor helps explain variation in the response.

8.3 Fit the Model

sparrow_mod <- lm(wing ~ age, data = sparrows)
summary(sparrow_mod)

Call:
lm(formula = wing ~ age, data = sparrows)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.30699 -0.21538  0.06553  0.16324  0.22507 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.71309    0.14790   4.821 0.000535 ***
age          0.27023    0.01349  20.027 5.27e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2184 on 11 degrees of freedom
Multiple R-squared:  0.9733,    Adjusted R-squared:  0.9709 
F-statistic: 401.1 on 1 and 11 DF,  p-value: 5.267e-10

The output provides the intercept, the slope, their standard errors, a test of whether the coefficients differ from zero, the residual standard error, and the model \(R^2\).

8.4 Test the Assumptions

Assumptions in regression are checked after fitting the model.

sparrow_aug <- augment(sparrow_mod)
sparrow_aug |>
  select(age, wing, .fitted, .resid) |>
  head()
# A tibble: 6 × 4
    age  wing .fitted  .resid
  <dbl> <dbl>   <dbl>   <dbl>
1     3   1.4    1.52 -0.124 
2     4   1.5    1.79 -0.294 
3     5   2.2    2.06  0.136 
4     6   2.4    2.33  0.0655
5     8   3.1    2.87  0.225 
6     9   3.2    3.15  0.0548
Figure 2

The diagnostic plots suggest that the model is broadly adequate for these data. The residuals do not show severe curvature, the spread is reasonably even, and the Q-Q plot does not suggest a dramatic departure from normality. Because the dataset is small, these plots should be interpreted cautiously, but there is no obvious reason to abandon the linear model.

8.5 Interpret the Results

The fitted slope is positive, which means that wing length increases with age. In this example, the slope estimate is about 0.27 cm per day, so the model implies that the expected wing length increases by roughly 0.27 cm for each additional day of age across the range of these observations.

The intercept is the expected wing length when age is zero. Here that value is not biologically the main point of interest. It is simply the point where the fitted line crosses the vertical axis.

The model explains a large proportion of the variation in the observed wing lengths (\(R^2 \approx 0.97\)), and the test of the slope provides very strong evidence that the linear relationship is not zero (\(p < 0.001\)).

8.6 Reporting

Figure 3: Wing length as a function of age in the sparrow example. The straight line is the fitted simple linear regression and the blue shading is the 95% confidence interval.
NoteWrite-Up

Methods

The relationship between sparrow wing length and age was assessed with a simple linear regression, with wing length as the response variable and age as the continuous predictor. Model adequacy was evaluated from standard residual diagnostics.

Results

Sparrow wing length increased strongly with age in the fitted simple linear regression (\(\beta = 0.270\), 95% CI: 0.241 to 0.300; \(R^2 = 0.97\); \(p < 0.001\)). Across the observed age range, older birds therefore had consistently longer wings, with the expected wing length increasing by about 0.27 cm for each additional day of age.

Discussion

This example is useful because it makes the biological interpretation of the slope very clear: age is associated with a strong increase in wing length over the observed range, and the fitted line captures most of the variation in these simple demonstration data.

9 Example 2: Adelie Penguin Bill Length and Body Mass

The sparrow example makes the general workflow clear, but it is useful to repeat the same approach in a more interesting dataset. The following example uses the penguins dataset from the palmerpenguins package to model bill length as a function of body mass in Adelie penguins.

Although we could also calculate a correlation, we will use a simple linear regression because we want a predictive model that estimates bill length from body mass. This is a defensible use of simple linear regression even though we are not claiming that body mass directly causes bill length.

Adelie <- penguins[penguins$species == "Adelie", ]
Adelie <- Adelie[-4, ]
Table 2: Size measurements for adult foraging Adelie penguins near Palmer Station, Antarctica.
Bill length (mm) Body mass (g)
39.1 3750
39.5 3800
40.3 3250
36.7 3450
39.3 3650
38.9 3625

9.1 Do an Exploratory Data Analysis (EDA)

dim(Adelie)
[1] 151   8
summary(Adelie)
      species          island   bill_length_mm  bill_depth_mm  
 Adelie   :151   Biscoe   :44   Min.   :32.10   Min.   :15.50  
 Chinstrap:  0   Dream    :56   1st Qu.:36.75   1st Qu.:17.50  
 Gentoo   :  0   Torgersen:51   Median :38.80   Median :18.40  
                                Mean   :38.79   Mean   :18.35  
                                3rd Qu.:40.75   3rd Qu.:19.00  
                                Max.   :46.00   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex          year     
 Min.   :172       Min.   :2850   female:73   Min.   :2007  
 1st Qu.:186       1st Qu.:3350   male  :73   1st Qu.:2007  
 Median :190       Median :3700   NA's  : 5   Median :2008  
 Mean   :190       Mean   :3701               Mean   :2008  
 3rd Qu.:195       3rd Qu.:4000               3rd Qu.:2009  
 Max.   :210       Max.   :4775               Max.   :2009  

We see that the dataset contains many more observations than the sparrow example. We focus here on body_mass_g and bill_length_mm. Both are continuous, and restricting the analysis to Adelie penguins gives us a relatively coherent biological subset for the example.

9.2 Create a Plot

ggplot(Adelie,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(shape = 1, colour = "pink") +
  geom_smooth(method = "lm", se = FALSE, colour = "steelblue4") +
  labs(x = "Body mass (g)", y = "Bill length (mm)") +
  theme_grey()
Figure 4: Scatter plot of the Palmer Station Adelie penguin data with a best fit line.

Although there is considerable scatter in the data, there is also a clear positive relationship between body mass and bill length. This relationship appears linear enough to justify a simple linear model as a first approximation.

9.3 State the Hypothesis

\[H_{0}: \beta = 0\] \[H_{a}: \beta \ne 0\]

That is, the null hypothesis is that body mass has no linear association with bill length, while the alternative is that the slope differs from zero.

If the slope is zero, then the predictor does not explain systematic change in the expected response.

9.4 Fit the Model

mod1 <- lm(bill_length_mm ~ body_mass_g,
           data = Adelie)
summary(mod1)

Call:
lm(formula = bill_length_mm ~ body_mass_g, data = Adelie)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4208 -1.3690  0.1874  1.4825  5.6168 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.699e+01  1.483e+00  18.201  < 2e-16 ***
body_mass_g 3.188e-03  3.977e-04   8.015 2.95e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.234 on 149 degrees of freedom
Multiple R-squared:  0.3013,    Adjusted R-squared:  0.2966 
F-statistic: 64.24 on 1 and 149 DF,  p-value: 2.955e-13

9.5 Test the Assumptions

To facilitate assumption checking we use augment() from broom to add fitted values, residuals, leverage, and related diagnostics to the data.

mod1_data <- augment(mod1)

9.5.1 Normality

The Shapiro-Wilk test can be used as one formal check of the residual distribution.

shapiro.test(residuals(mod1))

    Shapiro-Wilk normality test

data:  residuals(mod1)
W = 0.99613, p-value = 0.9637

Here the formal test does not suggest a serious departure from normality, but the graphical diagnostics are usually more informative than the test alone.

Figure 5: Diagnostic plots for the simple linear regression of Adelie penguin bill length on body mass. A) Normal Q-Q plot. B) Histogram of residuals. C) Residuals versus fitted values. D) Scale-location plot.

The Q-Q plot and histogram suggest that the residuals are approximately normally distributed. There is no obvious extreme departure that would make the model immediately unusable.

9.5.2 Homoscedasticity

The Breusch-Pagan test is one formal check of constant variance.

bptest(mod1)

    studentized Breusch-Pagan test

data:  mod1
BP = 1.6677, df = 1, p-value = 0.1966

The test does not suggest strong heteroscedasticity, and the residuals-versus-fitted and scale-location plots also indicate that the spread of residuals is reasonably even across the fitted range.

9.5.3 Check for Outliers

Outliers and influential observations can be explored in several complementary ways. Here we look at DFFITS, Cook’s distance, residuals versus leverage, and Cook’s distance versus leverage.

cooksd_thresh <- 4 / nrow(mod1_data)
dffits_threshold <- 2 * sqrt(2 / nrow(Adelie))

mod1_data <- mod1_data %>%
  mutate(index = row_number(),
         leverage = hatvalues(mod1),
         dffits = dffits(mod1),
         colour = ifelse(.cooksd > cooksd_thresh, "black", "pink"))
Figure 6: Diagnostic plots for visual inspection of outliers in the Adelie penguin regression. A) DFFITS. B) Cook’s distance. C) Residuals versus leverage. D) Cook’s distance versus leverage. Observations beyond the Cook’s distance threshold are shown in black and labelled by row number.
Figure 7: Scatter plot of Adelie penguin bill length against body mass, with observations exceeding the Cook’s distance threshold highlighted.

These plots do not suggest a catastrophic influence problem, but they do remind us that some observations contribute more to the fitted line than others. If an identified point were an obvious data-entry error, it could be corrected or excluded. If it were a real biological extreme, it would usually be better to keep it and discuss its influence.

9.6 Interpret the Results

Now that the assumptions appear broadly acceptable, we can interpret the fitted model. The slope of the regression line is positive, so bill length increases with body mass. The coefficient is about \(3.2 \times 10^{-3}\) mm/g, meaning that the expected bill length increases by about 0.0032 mm for every additional gram of body mass.

The multiple \(R^2\) is about 0.30, so the model explains roughly 30% of the observed variation in bill length. This is a useful reminder that a biologically informative regression does not need to explain nearly all the variation in the response to be worthwhile.

The test of the slope provides strong evidence that the relationship is not zero (\(p < 0.001\)). An ANOVA on the fitted model leads to the same practical conclusion: the straight-line model explains a meaningful amount of variation in bill length.

9.7 Reporting

NoteWrite-Up

Methods

The data analysed in this example were drawn from the Palmer Penguins dataset, which contains measurements on penguins sampled in the Palmer Archipelago, Antarctica. For this worked example, only Adelie penguins were retained. Bill length was treated as the response variable and body mass as the continuous predictor.

A simple linear regression model was fitted using lm() in R, with bill length modelled as a function of body mass. Model adequacy was assessed by inspecting residual plots, by applying the Shapiro-Wilk test to the residuals, and by using the Breusch-Pagan test to assess homoscedasticity. Influential observations were explored using Cook’s distance, DFFITS, and leverage-based diagnostics.

Results

Bill length increased with body mass in the fitted simple linear regression (\(\beta = 0.00319\), SE = 0.00040, \(t = 8.02\), \(p < 0.001\)) (Figure 4). The model explained about 30% of the variation in bill length (\(R^2 = 0.30\)), indicating that body mass was an informative but incomplete predictor of bill length. The overall model was also strongly supported by the ANOVA (\(F = 64.25\), \(p < 0.001\), d.f. = 1, 149).

Discussion

The worked example supports a positive relationship between body mass and bill length in Adelie penguins, but it also shows the limits of a one-predictor model. Body mass explains part of the variation in bill length, not all of it. A fuller biological account would need additional predictors such as sex, age, or ecological context.

10 Confidence and Prediction Intervals

The fitted line gives the expected mean response for a given value of the predictor, but two different kinds of interval are commonly needed.

  • A confidence interval describes uncertainty in the estimated mean response.
  • A prediction interval describes uncertainty for an individual future observation.

The prediction interval is always wider because it must include the scatter of individual observations around the fitted mean.

new_x <- tibble(age = c(7, 13))
predict(sparrow_mod, newdata = new_x, interval = "confidence")
       fit      lwr      upr
1 2.604698 2.444344 2.765051
2 4.226072 4.065719 4.386425
predict(sparrow_mod, newdata = new_x, interval = "prediction")
       fit      lwr      upr
1 2.604698 2.097951 3.111444
2 4.226072 3.719325 4.732818

We can also visualise the same distinction in the penguin example.

pred_conf <- as.data.frame(predict(mod1,
                                   newdata = Adelie,
                                   interval = "confidence"))

pred_pred <- as.data.frame(predict(mod1,
                                   newdata = Adelie,
                                   interval = "prediction"))

results <- cbind(Adelie, pred_conf, pred_pred[, 2:3])
names(results)[c(9:13)] <- c("fit", "lwr_conf", "upr_conf",
                             "lwr_pred", "upr_pred")

ggplot(data = results, aes(x = body_mass_g, y = fit)) +
  geom_line(linewidth = 0.4, colour = "red") +
  geom_ribbon(aes(ymin = lwr_pred, ymax = upr_pred),
              alpha = 0.2, fill = "red") +
  geom_ribbon(aes(ymin = lwr_conf, ymax = upr_conf),
              alpha = 0.2, fill = "blue") +
  geom_point(aes(y = bill_length_mm), shape = 1) +
  labs(x = "Body mass (g)", y = "Bill length (mm)") +
  theme_grey()
Figure 8: Adelie penguin bill length model with confidence interval (blue) and prediction interval (pink) around the fitted values.

Confidence intervals are useful when the primary interest lies in the mean expected response at a given predictor value. Prediction intervals are more relevant when the goal is to anticipate the range in which an individual future observation may fall.

11 Prediction Versus Explanation

By this point in the chapter it should be clear that the same fitted straight-line model can be used for at least two different scientific purposes.

  • In an explanatory analysis, the main interest is usually the slope itself and what it says about the biological relationship between the predictor and the response.
  • In a predictive analysis, the emphasis shifts toward fitted values, prediction intervals, and how accurately the model can anticipate new observations.

The sparrow example is mostly explanatory. We care primarily that wing length increases with age and that the slope is clearly positive. The Adelie penguin example is closer to a predictive framing because we have treated body mass as a variable from which bill length might be estimated for new individuals.

The distinction is important because it changes what should be emphasised in a Results section. Explanatory work usually emphasises the slope, its uncertainty, and the biological interpretation of the effect. Predictive work still needs the model coefficients, but it should pay much more attention to fitted values, prediction intervals, and the amount of unexplained variation.

This is also why a model with a highly significant slope is not automatically a good predictive model. A relationship can be biologically real and still leave substantial scatter around the fitted line, as in the penguin data. Conversely, a model that predicts well is not automatically evidence for a causal mechanism. In Chapter 24, we return to this distinction.

12 What to Do When Assumptions Fail

If the assumptions of the linear model are not adequate, the first step is not to abandon the analysis mechanically but to ask what the diagnostic pattern is telling you.

  • If the residuals suggest non-linearity, consider whether the relationship should be transformed or whether a more flexible model is needed.
  • If the residual variance changes strongly with fitted values, consider transformation or a different modelling approach. Chapter 6 dealt with the raw-data version of this problem; here the same issue appears in the residuals.
  • If outliers have strong leverage or influence, inspect them carefully before deciding what to do.
  • If the real question is still only one of association, and not response modelling, return to Chapter 9.

13 Common Mistakes

Common mistakes in simple linear regression include:

  • using regression when the relationship is only associative and poorly justified as a response-predictor model;
  • ignoring non-independence among observations;
  • fitting a straight line to a clearly curved relationship;
  • treating statistical significance as biological importance;
  • reporting \(R^2\) without discussing effect size, uncertainty, or assumptions.

14 Summary

  • Simple linear regression models one continuous response as a function of one continuous predictor.
  • The slope is usually the main inferential quantity because it describes how the expected response changes with the predictor.
  • Regression differs from correlation because it imposes a response-predictor structure.
  • Residual diagnostics are essential because they tell us whether the model is adequate.
  • Outlier diagnostics help us decide whether unusual observations are errors, influential extremes, or signs of model misspecification.
  • Confidence intervals and prediction intervals answer different questions.

In this chapter, I establish the workflow for modelling. In the next chapter, I extend the same workflow to curved relationships, and in Chapter 14 I then move to several predictors at once.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {12. {Simple} {Linear} {Regression}},
  date = {2026-03-22},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/12-simple-linear-regression.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) 12. Simple Linear Regression. https://tangledbank.netlify.app/BCB744/basic_stats/12-simple-linear-regression.html.