17: Multiple Linear Regression (MLR)

Published

2026/06/15

TipMaterial Required for This Chapter
Type Name Link
Theory BCB744 Biostatistics Ch. 14: Multiple Regression
Data The Doubs River data 💾 Doubs.RData
ImportantTasks to Complete in This Chapter

When you build those models, return to the chapter on Model Building for the thinking that should guide variable and model selection.

In the Model Building chapter I argued that choosing a model is a way of thinking. In this chapter I propose the methodological counterpart: I apply the most widely used model in ecology, multiple linear regression (MLR), and the diagnostics that tell me whether to trust the fitted model. I use the familiar Doubs data so that the statistics (not the dataset) are the new thing. I then apply the same tools to the seaweed \(\beta\)-diversity analysis of the integrative project and to the Deep Dive into Gradients, where I apply multiple regression to real distance-decay data.

Multiple regression is also foundational to the methods in the chapters that follow. A generalised linear model (GLM) is a multiple regression for response variables that are not normally distributed, such as counts or presence-absence data. A generalised additive model (GAM) is a multiple regression in which the straight lines become smooth curves. A mixed model (mixed models) is a multiple regression with a term for grouped or non-independent observations. Everything in this chapter, namely partial slopes, multicollinearity, model selection, and residual diagnostics, reappears in all three.

From One Predictor to Many

Simple linear regression relates a response \(y\) to a single predictor \(x\). Ecological responses rarely have a single cause. Species richness at a river site responds to elevation, flow, oxygen, and nutrients at the same time, and these act together. Multiple regression estimates the effect of each predictor while holding the others constant, which is what I want when several environmental drivers covary along a gradient.

The model for \(n\) observations and \(p\) predictors is

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2). \]

Each \(\beta_j\) is a partial regression coefficient: the expected change in \(y\) for a one-unit increase in \(x_j\) when every other predictor is held fixed. That conditional phrase is the whole point of multiple regression and the source of most of its difficulties, because predictors that are themselves correlated cannot really be varied one at a time.

The model depends on a small set of assumptions (remembered as LINE plus one):

  • Linearity: the mean of \(y\) is a linear function of the predictors.
  • Independence: the residuals are independent of one another (no spatial, temporal, or grouped structure).
  • Normality: the residuals are normally distributed.
  • Equal variance: the residual variance is constant across fitted values (homoscedasticity).
  • No perfect collinearity: no predictor is an exact linear combination of the others, and ideally predictors are not even strongly correlated.

The diagnostics later in this chapter exist to check the first four, and the variance inflation factor (VIF) checks the fifth.

Question the Doubs River Data

I’ll model the species richness of the Doubs fish community, that is, the number of species recorded at each site, as a function of the environmental variables. Richness is an actual count, which makes it a slightly unsuitable fit for a model that assumes normal errors, but I use it here regardless. It lets me teach the full multiple-regression workflow on familiar data, and then I’ll motivate for the GLM chapter, where the same response is modelled more correctly as a count.

library(tidyverse)
library(vegan)
library(car) # vif()
library(performance) # check_model(), model checks

load(here::here(
  "data",
  "BCB743",
  "NEwR-2ed_code_data",
  "NEwR2-Data",
  "Doubs.RData"
))

# one site has no fish; drop it from both tables
spe <- spe[rowSums(spe) > 0, ]
env <- env[rownames(spe), ]

# the response: number of fish species per site
dat <- data.frame(richness = specnumber(spe), env)
head(dat)
  richness  dfs ele  slo  dis  pH har  pho  nit  amm  oxy bod
1        1  0.3 934 48.0 0.84 7.9  45 0.01 0.20 0.00 12.2 2.7
2        3  2.2 932  3.0 1.00 8.0  40 0.02 0.20 0.10 10.3 1.9
3        4 10.2 914  3.7 1.80 8.3  52 0.05 0.22 0.05 10.5 3.5
4        8 18.5 854  3.2 2.53 8.0  72 0.10 0.21 0.00 11.0 1.3
5       11 21.5 849  2.3 2.64 8.1  84 0.38 0.52 0.20  8.0 6.2
6       10 32.4 846  3.2 2.86 7.9  60 0.20 0.15 0.00 10.2 5.3

The environmental variables are distance from source (dfs), elevation (ele), slope (slo), discharge (dis), pH, hardness (har), and five water-quality measures: phosphate (pho), nitrate (nit), ammonium (amm), dissolved oxygen (oxy), and biological oxygen demand (bod). The Correlations chapter already showed that several of these are tightly correlated, because they are joint expressions of the same upstream-downstream gradient. That correlation is about to cause trouble.

Fitting and Interpreting the Model

I begin with a model containing a wide set of predictors:

full_model <- lm(
  richness ~ ele + slo + oxy + nit + amm + bod + dfs + dis,
  data = dat
)
summary(full_model)

Call:
lm(formula = richness ~ ele + slo + oxy + nit + amm + bod + dfs +
    dis, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max
-7.4643 -2.8023 -0.5152  2.5369  6.7268

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  23.539389  27.252175   0.864    0.398
ele          -0.003545   0.016562  -0.214    0.833
slo          -0.082652   0.127978  -0.646    0.526
oxy          -1.131520   1.270678  -0.890    0.384
nit           3.368938   2.413831   1.396    0.178
amm         -10.530408   9.941835  -1.059    0.302
bod          -1.152535   0.857853  -1.344    0.194
dfs           0.024522   0.052548   0.467    0.646
dis          -0.015888   0.224961  -0.071    0.944

Residual standard error: 4.507 on 20 degrees of freedom
Multiple R-squared:  0.7825,    Adjusted R-squared:  0.6955
F-statistic: 8.995 on 8 and 20 DF,  p-value: 3.501e-05

At the bottom, the F-statistic and its p-value test the whole model against an intercept-only model: is this set of predictors, taken together, better than the mean? The multiple \(R^2\) is the proportion of variance in richness explained, and the adjusted \(R^2\) penalises that figure for the number of predictors, so it is the correct measure to quote and to compare between models. Each coefficient row gives a partial slope, its standard error, and a t-test of whether that slope differs from zero with all other predictors held constant.

NoteA significant model with no significant predictors

Notice that the model explains a large share of the variance, yet none of the individual predictors are significant. That combination, a strong overall fit with weak individual t-tests, is the classic sign of multicollinearity: the predictors are so correlated that the model cannot decide which of them deserves the credit, so it spreads the uncertainty across all of them and inflates their standard errors.

Multicollinearity and the Variance Inflation Factor

When predictors are correlated, their estimated coefficients become unstable: small changes in the data produce large swings in the slopes, the standard errors widen, and the partial slopes can even take the wrong sign. The variance inflation factor (VIF) quantifies the problem for each predictor by measuring how much the variance of a coefficient is inflated by its correlation with the other predictors. A VIF of 1 means no inflation; values above 10 (some authors use 5) suggest collinearity serious enough to act on.

vif(full_model)
      ele       slo       oxy       nit       amm       bod       dfs       dis
27.491252  1.751504 10.841346 16.159445 20.236779 15.354992 73.719434 22.550991 

Several predictors have excessive VIFs, with distance from source the worst offender, because elevation, discharge, and distance from source are essentially three measurements of position along the river. I prune the predictor set by removing the variable with the highest VIF, refitting, and repeating until every VIF falls below 10:

preds <- c("ele", "slo", "oxy", "nit", "amm", "bod", "dfs", "dis")
repeat {
  m <- lm(reformulate(preds, "richness"), data = dat)
  v <- vif(m)
  if (max(v) <= 10) {
    break
  }
  worst <- names(which.max(v))
  message("dropping ", worst, " (VIF = ", round(max(v), 1), ")")
  preds <- setdiff(preds, worst)
}
preds
[1] "ele" "slo" "oxy" "nit" "bod" "dis"

The procedure removes distance from source and ammonium. The reduced model retains predictors that are no longer dangerously collinear:

reduced_model <- lm(reformulate(preds, "richness"), data = dat)
vif(reduced_model)
     ele      slo      oxy      nit      bod      dis
7.965885 1.422210 4.673976 4.532890 4.150025 4.807107 
summary(reduced_model)

Call:
lm(formula = reformulate(preds, "richness"), data = dat)

Residuals:
    Min      1Q  Median      3Q     Max
-6.9831 -3.2754  0.0006  2.5290  8.4815

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.860430  14.035666   3.552  0.00178 **
ele         -0.018751   0.008867  -2.115  0.04603 *
slo         -0.019662   0.114703  -0.171  0.86547
oxy         -2.127030   0.829852  -2.563  0.01773 *
nit          1.158802   1.271582   0.911  0.37201
bod         -2.067629   0.443585  -4.661  0.00012 ***
dis          0.022579   0.103307   0.219  0.82901
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.483 on 22 degrees of freedom
Multiple R-squared:  0.7633,    Adjusted R-squared:  0.6988
F-statistic: 11.83 on 6 and 22 DF,  p-value: 6.243e-06

The adjusted \(R^2\) is essentially unchanged. This is the lesson of collinearity: the dropped variables carried almost no unique information, so losing them costs nothing in fit while it greatly stabilises the coefficients. Dropping a collinear predictor does not throw away ecology; it removes a redundant proxy for a gradient that other variables already capture.

ImportantVIF pruning is mechanical, not ecological

Automatic VIF pruning keeps whichever correlated variable happens to be retained when the others are dropped, which need not be the most interpretable one. In practice you should let ecology guide the final choice: if distance from source is the variable you can interpret and measure most reliably, keep it and drop its correlates instead. The algorithm tells you that a set is collinear; it does not tell you which member to keep.

As a rule, start your model building sequence by allowing only variables to enter which you selected based on ecological/explanatory grounds. Let VIF deal with any remaining collinearity.

Model Selection

The reduced model avoids collinearity but still contains predictors that contribute little. Model selection seeks the most parsimonious model that fits well, trading goodness of fit against complexity. The Akaike Information Criterion (AIC) formalises that trade-off by rewarding fit and penalising the number of parameters. Lower values are better. Stepwise selection adds and removes terms to minimise AIC:

library(MASS)
final_model <- stepAIC(reduced_model, direction = "both", trace = FALSE)
summary(final_model)

Call:
lm(formula = richness ~ ele + oxy + bod, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max
-6.4496 -3.2366  0.0016  2.2652  8.2195

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.274845   8.127307   6.924 2.94e-07 ***
ele         -0.023651   0.003321  -7.121 1.83e-07 ***
oxy         -2.353142   0.690590  -3.407  0.00223 **
bod         -1.978063   0.383839  -5.153 2.51e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.284 on 25 degrees of freedom
Multiple R-squared:  0.7544,    Adjusted R-squared:  0.7249
F-statistic:  25.6 on 3 and 25 DF,  p-value: 8.623e-08

Selection settles on a compact model in which richness declines with elevation, with dissolved oxygen, and with biological oxygen demand: richness is highest at low-elevation, lowland sites and is depressed both in the cold, highly oxygenated headwaters and at organically polluted sites. All three predictors are now strongly significant, and the adjusted \(R^2\) has improved despite the model being simpler.

WarningTreat stepwise selection with suspicion

Stepwise procedures are convenient but widely criticised. They inflate \(R^2\), bias p-values and standard errors (the final model is reported as if the predictors had been chosen in advance, which they were not), and different algorithms can select different models from the same data. Use stepwise selection to suggest a model, never to prove one. Where you have competing a priori hypotheses, compare a small set of candidate models on AIC directly, or report the full model you decided on for ecological reasons. The Model Building chapter discusses multi-model inference as the more defensible alternative.

Nested models, where one model’s predictors are a subset of another’s, can be compared directly with an F-test:

anova(final_model, reduced_model)
Analysis of Variance Table

Model 1: richness ~ ele + oxy + bod
Model 2: richness ~ ele + slo + oxy + nit + bod + dis
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     25 458.77
2     22 442.06  3    16.707 0.2772 0.8412

A non-significant result means the extra predictors in the larger model do not explain significantly more variance, so the smaller model is preferred on grounds of parsimony.

Regression Diagnostics

A model can fit well by \(R^2\) and still violate the assumptions that make its p-values and confidence intervals meaningful. Diagnostics are not optional. They are how you find out whether the LINE assumptions hold, whether a few points are driving the result, and whether the model you have is the model you should report. The four standard diagnostic plots are produced by calling plot() on the fitted model:

par(mfrow = c(2, 2))
plot(final_model)
par(mfrow = c(1, 1))
Figure 1: The four standard diagnostic plots for the final richness model.

Each panel checks a different assumption:

  • Residuals vs Fitted (top left) checks linearity and equal variance. The residuals should scatter randomly around the horizontal line with no curvature and no funnel shape. Curvature means a predictor enters non-linearly (a job for a GAM); a funnel means the variance grows with the mean (a job for a GLM or a transformation).
  • Q-Q Residuals (top right) checks normality. The points should lie on the diagonal; systematic departures at the ends indicate heavy tails or skew.
  • Scale-Location (bottom left) is a second, more sensitive check of equal variance. The red line should be roughly flat.
  • Residuals vs Leverage (bottom right) finds influential points. Points beyond the dashed Cook’s distance contours have enough leverage and large enough residuals to be moving the fit on their own.

I can put numbers to the visual impressions. A formal test of residual normality and a check of influence:

shapiro.test(residuals(final_model)) # normality of residuals

    Shapiro-Wilk normality test

data:  residuals(final_model)
W = 0.97044, p-value = 0.5718
cooks <- cooks.distance(final_model)
which(cooks > 4 / nrow(dat)) # points above the 4/n rule of thumb
6 9
6 8 

For this model the residuals are acceptably normal and one site is moderately influential without dominating the fit. An influential point should never be deleted reflexively. Inspect it, decide whether it is an error or a genuine extreme observation, refit with and without it, and report whether the conclusions change.

The performance package complements the plots with explicit, formal tests of the individual assumptions, each returning a plain-language verdict:

check_normality(final_model) # Shapiro-Wilk on the residuals
OK: residuals appear as normally distributed (p = 0.654).
check_heteroscedasticity(final_model) # constant-variance (Breusch-Pagan) test
OK: Error variance appears to be homoscedastic (p = 0.540).
check_collinearity(final_model) # VIF, in tidy form
# Check for Multicollinearity

Low Correlation

 Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
  ele 1.22 [1.03, 2.43]     1.11      0.82     [0.41, 0.97]
  oxy 3.54 [2.28, 6.06]     1.88      0.28     [0.17, 0.44]
  bod 3.40 [2.20, 5.81]     1.84      0.29     [0.17, 0.45]
NoteWhat to do when diagnostics fail
  • Curvature in the residuals: add a polynomial or, better, fit a smooth term in a GAM.
  • Funnel-shaped residuals / variance tied to the mean: the response is probably a count or a proportion; fit a GLM with the appropriate error family rather than transforming.
  • Non-normal residuals: a transformation of the response may help, but for counts and presence-absence data a GLM is the obvious fix.
  • Non-independent residuals (sites grouped by river, plot, or region): a mixed model with a random effect for the grouping.
  • A single influential point: investigate it; refit without it and report the sensitivity.

Prediction and Reporting

A fitted model predicts the response at new predictor values, with intervals that express uncertainty:

new_site <- data.frame(ele = 300, oxy = 9, bod = 3)
final_terms <- attr(terms(final_model), "term.labels")
stopifnot(setequal(final_terms, names(new_site)))
predict(final_model, new_site, interval = "confidence")
       fit      lwr      upr
1 22.06713 19.12794 25.00631

The confidence interval expresses uncertainty about the mean richness at such a site; a prediction interval (use interval = "prediction") is wider because it also includes the scatter of individual sites about that mean.

When you write up a multiple regression, report enough evidence to support the model: the final model and how you arrived at it, the partial coefficients with their standard errors or confidence intervals, the adjusted \(R^2\) and the overall F-test, a statement that the assumptions were checked (with the diagnostics in a supplement if needed), and the VIFs or a note that collinearity was addressed. A coefficient without its uncertainty, or an \(R^2\) without a diagnostic check, is not a result.

When Multiple Regression Is Not Enough

The richness model in this chapter is reasonable: its residuals are approximately normal and its variance is reasonably constant. But richness is a count bounded below by zero, and as soon as counts are small, or the response is a proportion or a presence-absence record, the normal-error assumption fails in ways no transformation fully repairs. The next chapters remove the three restrictions of MLR in turn:

  • the GLM replaces the normal error and the identity link with error families and link functions suited to counts, proportions, and binary data;
  • the GAM replaces straight-line predictors with smooth curves, for responses that rise and fall along a gradient;
  • the mixed model adds random effects for grouped or nested data, where the independence assumption fails.

All three are multiple regression at the core. The partial slopes, the collinearity checks, the model selection, and the residual diagnostics you have learned here apply, with small modifications, to every one of them.

References

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {17: {Multiple} {Linear} {Regression} {(MLR)}},
  date = {2026-06-15},
  url = {https://tangledbank.netlify.app/BCB743/multiple_regression.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) 17: Multiple Linear Regression (MLR). https://tangledbank.netlify.app/BCB743/multiple_regression.html.