---
date: last-modified
title: "17: Multiple Linear Regression (MLR)"
format:
html:
code-link: false
---
```{r code-brewing-opts, echo=FALSE}
knitr::opts_chunk$set(
comment = "R>",
warning = FALSE,
message = FALSE,
fig.width = 4.5,
fig.height = 2.625,
out.width = "75%",
fig.asp = NULL, # control via width/height
dpi = 300
)
ggplot2::theme_set(
ggplot2::theme_minimal(base_size = 8)
)
ggplot2::theme_set(
ggplot2::theme_bw(base_size = 8)
)
```
::: callout-tip
## **Material Required for This Chapter**
| Type | Name | Link |
| :--------- | :------------------- | :------------------------------------------------------------------------------------------------------ |
| **Theory** | BCB744 Biostatistics | [Ch. 14: Multiple Regression](../BCB744/basic_stats/14-multiple-regression-and-model-specification.qmd) |
| **Data** | The Doubs River data | [💾 `Doubs.RData`](../data/BCB743/NEwR-2ed_code_data/NeWR2-Data/Doubs.RData) |
:::
::: {.callout-important appearance="simple"}
## Tasks to Complete in This Chapter
* The [integrative mini-research project](assessments/BCB743_integrative_assignment.qmd) applies multiple regression (and constrained ordination) to the seaweed data; [Task K](tasks/Task_K.qmd) is the matching practice run.
When you build those models, return to the chapter on [Model Building](model_building.qmd) for the thinking that should guide variable and model selection.
:::
In the [Model Building](model_building.qmd) chapter I argued that choosing a model is a way of thinking. In this chapter I propose the methodological counterpart: I apply the most widely used model in ecology, **multiple linear regression** (MLR), and the diagnostics that tell me whether to trust the fitted model. I use the familiar Doubs data so that the statistics (not the dataset) are the new thing. I then apply the same tools to the seaweed $\beta$-diversity analysis of the [integrative project](assessments/BCB743_integrative_assignment.qmd) and to the [Deep Dive into Gradients](deep_dive.qmd), where I apply multiple regression to real distance-decay data.
Multiple regression is also foundational to the methods in the chapters that follow. A **generalised linear model** ([GLM](glm.qmd)) is a multiple regression for response variables that are not normally distributed, such as counts or presence-absence data. A **generalised additive model** ([GAM](gam.qmd)) is a multiple regression in which the straight lines become smooth curves. A **mixed model** ([mixed models](mixed_models.qmd)) is a multiple regression with a term for grouped or non-independent observations. Everything in this chapter, namely partial slopes, multicollinearity, model selection, and residual diagnostics, reappears in all three.
## From One Predictor to Many
Simple linear regression relates a response $y$ to a single predictor $x$. Ecological responses rarely have a single cause. Species richness at a river site responds to elevation, flow, oxygen, and nutrients at the same time, and these act together. Multiple regression estimates the effect of each predictor *while holding the others constant*, which is what I want when several environmental drivers covary along a gradient.
The model for $n$ observations and $p$ predictors is
$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2).
$$
Each $\beta_j$ is a **partial regression coefficient**: the expected change in $y$ for a one-unit increase in $x_j$ when every other predictor is held fixed. That conditional phrase is the whole point of multiple regression and the source of most of its difficulties, because predictors that are themselves correlated cannot really be varied one at a time.
The model depends on a small set of assumptions (remembered as **LINE** plus one):
- **L**inearity: the mean of $y$ is a linear function of the predictors.
- **I**ndependence: the residuals are independent of one another (no spatial, temporal, or grouped structure).
- **N**ormality: the residuals are normally distributed.
- **E**qual variance: the residual variance is constant across fitted values (homoscedasticity).
- **No perfect collinearity**: no predictor is an exact linear combination of the others, and ideally predictors are not even strongly correlated.
The diagnostics later in this chapter exist to check the first four, and the variance inflation factor (VIF) checks the fifth.
## Question the Doubs River Data
I'll model the **species richness** of the Doubs fish community, that is, the number of species recorded at each site, as a function of the environmental variables. Richness is an actual count, which makes it a slightly unsuitable fit for a model that assumes normal errors, but I use it here regardless. It lets me teach the full multiple-regression workflow on familiar data, and then I'll motivate for the [GLM chapter](glm.qmd), where the same response is modelled more correctly as a count.
```{r code-load-data}
library(tidyverse)
library(vegan)
library(car) # vif()
library(performance) # check_model(), model checks
load(here::here(
"data",
"BCB743",
"NEwR-2ed_code_data",
"NEwR2-Data",
"Doubs.RData"
))
# one site has no fish; drop it from both tables
spe <- spe[rowSums(spe) > 0, ]
env <- env[rownames(spe), ]
# the response: number of fish species per site
dat <- data.frame(richness = specnumber(spe), env)
head(dat)
```
The environmental variables are distance from source (`dfs`), elevation (`ele`), slope (`slo`), discharge (`dis`), pH, hardness (`har`), and five water-quality measures: phosphate (`pho`), nitrate (`nit`), ammonium (`amm`), dissolved oxygen (`oxy`), and biological oxygen demand (`bod`). The [Correlations](correlations.qmd) chapter already showed that several of these are tightly correlated, because they are joint expressions of the same upstream-downstream gradient. That correlation is about to cause trouble.
## Fitting and Interpreting the Model
I begin with a model containing a wide set of predictors:
```{r code-full-model}
full_model <- lm(
richness ~ ele + slo + oxy + nit + amm + bod + dfs + dis,
data = dat
)
summary(full_model)
```
At the bottom, the **F-statistic** and its *p*-value test the whole model against an intercept-only model: is this set of predictors, taken together, better than the mean? The **multiple $R^2$** is the proportion of variance in richness explained, and the **adjusted $R^2$** penalises that figure for the number of predictors, so it is the correct measure to quote and to compare between models. Each **coefficient** row gives a partial slope, its standard error, and a *t*-test of whether that slope differs from zero with all other predictors held constant.
::: {.callout-note appearance="simple"}
## A significant model with no significant predictors
Notice that the model explains a large share of the variance, yet none of the individual predictors are significant. That combination, a strong overall fit with weak individual *t*-tests, is the classic sign of **multicollinearity**: the predictors are so correlated that the model cannot decide which of them deserves the credit, so it spreads the uncertainty across all of them and inflates their standard errors.
:::
## Multicollinearity and the Variance Inflation Factor
When predictors are correlated, their estimated coefficients become unstable: small changes in the data produce large swings in the slopes, the standard errors widen, and the partial slopes can even take the wrong sign. The **variance inflation factor** (VIF) quantifies the problem for each predictor by measuring how much the variance of a coefficient is inflated by its correlation with the other predictors. A VIF of 1 means no inflation; values above 10 (some authors use 5) suggest collinearity serious enough to act on.
```{r code-vif-full}
vif(full_model)
```
Several predictors have excessive VIFs, with distance from source the worst offender, because elevation, discharge, and distance from source are essentially three measurements of position along the river. I prune the predictor set by removing the variable with the highest VIF, refitting, and repeating until every VIF falls below 10:
```{r code-vif-prune}
preds <- c("ele", "slo", "oxy", "nit", "amm", "bod", "dfs", "dis")
repeat {
m <- lm(reformulate(preds, "richness"), data = dat)
v <- vif(m)
if (max(v) <= 10) {
break
}
worst <- names(which.max(v))
message("dropping ", worst, " (VIF = ", round(max(v), 1), ")")
preds <- setdiff(preds, worst)
}
preds
```
The procedure removes distance from source and ammonium. The reduced model retains predictors that are no longer dangerously collinear:
```{r code-reduced-model}
reduced_model <- lm(reformulate(preds, "richness"), data = dat)
vif(reduced_model)
summary(reduced_model)
```
The adjusted $R^2$ is essentially unchanged. This is the lesson of collinearity: the dropped variables carried almost no *unique* information, so losing them costs nothing in fit while it greatly stabilises the coefficients. Dropping a collinear predictor does not throw away ecology; it removes a redundant proxy for a gradient that other variables already capture.
::: {.callout-important}
## VIF pruning is mechanical, not ecological
Automatic VIF pruning keeps whichever correlated variable happens to be retained when the others are dropped, which need not be the most interpretable one. In practice you should let ecology guide the final choice: if distance from source is the variable you can interpret and measure most reliably, keep it and drop its correlates instead. The algorithm tells you *that* a set is collinear; it does not tell you *which* member to keep.
As a rule, start your model building sequence by allowing only variables to enter which you selected based on ecological/explanatory grounds. Let VIF deal with any remaining collinearity.
:::
## Model Selection
The reduced model avoids collinearity but still contains predictors that contribute little. **Model selection** seeks the most parsimonious model that fits well, trading goodness of fit against complexity. The **Akaike Information Criterion** (AIC) formalises that trade-off by rewarding fit and penalising the number of parameters. Lower values are better. Stepwise selection adds and removes terms to minimise AIC:
```{r code-step}
library(MASS)
final_model <- stepAIC(reduced_model, direction = "both", trace = FALSE)
summary(final_model)
```
```{r code-detach-mass, echo=FALSE}
# MASS::select() masks dplyr::select(); detach to avoid surprises later
detach("package:MASS")
```
Selection settles on a compact model in which richness declines with elevation, with dissolved oxygen, and with biological oxygen demand: richness is highest at low-elevation, lowland sites and is depressed both in the cold, highly oxygenated headwaters and at organically polluted sites. All three predictors are now strongly significant, and the adjusted $R^2$ has improved despite the model being simpler.
::: {.callout-warning}
## Treat stepwise selection with suspicion
Stepwise procedures are convenient but widely criticised. They inflate $R^2$, bias *p*-values and standard errors (the final model is reported as if the predictors had been chosen in advance, which they were not), and different algorithms can select different models from the same data. Use stepwise selection to *suggest* a model, never to prove one. Where you have competing *a priori* hypotheses, compare a small set of candidate models on AIC directly, or report the full model you decided on for ecological reasons. The [Model Building](model_building.qmd) chapter discusses multi-model inference as the more defensible alternative.
:::
Nested models, where one model's predictors are a subset of another's, can be compared directly with an *F*-test:
```{r code-anova-nested}
anova(final_model, reduced_model)
```
A non-significant result means the extra predictors in the larger model do not explain significantly more variance, so the smaller model is preferred on grounds of parsimony.
## Regression Diagnostics {#sec-diagnostics}
A model can fit well by $R^2$ and still violate the assumptions that make its *p*-values and confidence intervals meaningful. **Diagnostics are not optional.** They are how you find out whether the LINE assumptions hold, whether a few points are driving the result, and whether the model you have is the model you should report. The four standard diagnostic plots are produced by calling `plot()` on the fitted model:
```{r fig-base-diagnostics}
#| fig-width: 7
#| fig-height: 6
#| out-width: "85%"
#| fig-cap: "The four standard diagnostic plots for the final richness model."
par(mfrow = c(2, 2))
plot(final_model)
par(mfrow = c(1, 1))
```
Each panel checks a different assumption:
- **Residuals vs Fitted** (top left) checks **linearity** and **equal variance**. The residuals should scatter randomly around the horizontal line with no curvature and no funnel shape. Curvature means a predictor enters non-linearly (a job for a [GAM](gam.qmd)); a funnel means the variance grows with the mean (a job for a [GLM](glm.qmd) or a transformation).
- **Q-Q Residuals** (top right) checks **normality**. The points should lie on the diagonal; systematic departures at the ends indicate heavy tails or skew.
- **Scale-Location** (bottom left) is a second, more sensitive check of **equal variance**. The red line should be roughly flat.
- **Residuals vs Leverage** (bottom right) finds **influential points**. Points beyond the dashed Cook's distance contours have enough leverage and large enough residuals to be moving the fit on their own.
I can put numbers to the visual impressions. A formal test of residual normality and a check of influence:
```{r code-diag-numbers}
shapiro.test(residuals(final_model)) # normality of residuals
cooks <- cooks.distance(final_model)
which(cooks > 4 / nrow(dat)) # points above the 4/n rule of thumb
```
For this model the residuals are acceptably normal and one site is moderately influential without dominating the fit. An influential point should never be deleted reflexively. Inspect it, decide whether it is an error or a genuine extreme observation, refit with and without it, and report whether the conclusions change.
The **performance** package complements the plots with explicit, formal tests of the individual assumptions, each returning a plain-language verdict:
```{r code-performance-checks}
check_normality(final_model) # Shapiro-Wilk on the residuals
check_heteroscedasticity(final_model) # constant-variance (Breusch-Pagan) test
check_collinearity(final_model) # VIF, in tidy form
```
::: {.callout-note appearance="simple"}
## What to do when diagnostics fail
- **Curvature** in the residuals: add a polynomial or, better, fit a smooth term in a [GAM](gam.qmd).
- **Funnel-shaped residuals / variance tied to the mean**: the response is probably a count or a proportion; fit a [GLM](glm.qmd) with the appropriate error family rather than transforming.
- **Non-normal residuals**: a transformation of the response may help, but for counts and presence-absence data a GLM is the obvious fix.
- **Non-independent residuals** (sites grouped by river, plot, or region): a [mixed model](mixed_models.qmd) with a random effect for the grouping.
- **A single influential point**: investigate it; refit without it and report the sensitivity.
:::
## Prediction and Reporting
A fitted model predicts the response at new predictor values, with intervals that express uncertainty:
```{r code-predict}
new_site <- data.frame(ele = 300, oxy = 9, bod = 3)
final_terms <- attr(terms(final_model), "term.labels")
stopifnot(setequal(final_terms, names(new_site)))
predict(final_model, new_site, interval = "confidence")
```
The **confidence** interval expresses uncertainty about the *mean* richness at such a site; a **prediction** interval (use `interval = "prediction"`) is wider because it also includes the scatter of individual sites about that mean.
When you write up a multiple regression, report enough evidence to support the model: the final model and how you arrived at it, the partial coefficients with their standard errors or confidence intervals, the adjusted $R^2$ and the overall *F*-test, a statement that the assumptions were checked (with the diagnostics in a supplement if needed), and the VIFs or a note that collinearity was addressed. A coefficient without its uncertainty, or an $R^2$ without a diagnostic check, is not a result.
## When Multiple Regression Is Not Enough
The richness model in this chapter is reasonable: its residuals are approximately normal and its variance is reasonably constant. But richness is a count bounded below by zero, and as soon as counts are small, or the response is a proportion or a presence-absence record, the normal-error assumption fails in ways no transformation fully repairs. The next chapters remove the three restrictions of MLR in turn:
- the [GLM](glm.qmd) replaces the normal error and the identity link with error families and link functions suited to counts, proportions, and binary data;
- the [GAM](gam.qmd) replaces straight-line predictors with smooth curves, for responses that rise and fall along a gradient;
- the [mixed model](mixed_models.qmd) adds random effects for grouped or nested data, where the independence assumption fails.
All three are multiple regression at the core. The partial slopes, the collinearity checks, the model selection, and the residual diagnostics you have learned here apply, with small modifications, to every one of them.
## References
::: {#refs}
:::