library(tidyverse)
library(vegan)
library(mgcv) # gam(), the reference implementation
load(here::here(
"data",
"BCB743",
"NEwR-2ed_code_data",
"NEwR2-Data",
"Doubs.RData"
))
spe <- spe[rowSums(spe) > 0, ]
env <- env[rownames(spe), ]
dat <- data.frame(
richness = specnumber(spe),
Babl = spe$Babl, # abundance of the stone loach, Barbatula barbatula
env
)19: Generalised Additive Models (GAM)
| Type | Name | Link |
|---|---|---|
| Prior | Generalised Linear Models | GLM |
| Data | The Doubs River data | 💾 Doubs.RData |
A GLM still draws a straight line, just on the link scale rather than the response scale. That is often wrong for ecological gradients. The whole reason I preferred correspondence analysis over PCA was that species respond to gradients unimodally: abundance rises to a peak at a preferred position and falls away on either side. A straight line cannot describe a hump. I could add a quadratic term by hand, but that forces a symmetric parabola and still requires me to guess the shape in advance.
A generalised additive model (GAM) removes the guesswork. It keeps the error families and link functions of the GLM but replaces each straight-line term \(\beta_j x_j\) with a smooth function \(f_j(x_j)\) whose shape is estimated from the data:
\[ g(\mu) = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_p(x_p). \]
The model decides for itself whether a predictor acts linearly, as a hump, as a threshold, or as something more complex, while a penalty stops it from chasing noise. You have already used a GAM without knowing it: the ordisurf() response surfaces in the ordination chapters fit precisely this kind of smooth onto an ordination diagram.
What Is a Smooth?
A smooth \(f(x)\) is built as a weighted sum of simple basis functions (here, thin-plate or cubic regression splines). On its own, a flexible basis would overfit, so a GAM adds a wiggliness penalty and chooses the weights by penalised likelihood. The strength of the penalty is selected automatically; I use restricted maximum likelihood (method = "REML"), which is the most reliable choice.
The flexibility actually used by a fitted smooth is summarised by its effective degrees of freedom (edf). An edf of 1 means the smooth has collapsed to a straight line; an edf of 2 is roughly a quadratic; larger values mean more curvature. The edf is the GAM’s main diagnostic because it tells you how non-linear each predictor turned out to be.
A First GAM: the Line Becomes a Curve
In the GLM chapter I modelled richness against the environment with straight lines on the log scale. Compare a linear Poisson GLM with a GAM in which distance from source enters as a smooth:
df AIC
glm_linear 2.000000 209.9142
gam_smooth 7.516463 173.3015
Family: poisson
Link function: log
Formula:
richness ~ s(dfs, k = 8)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.40600 0.05997 40.12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dfs) 6.009 6.658 100.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.741 Deviance explained = 75.6%
-REML = 94.273 Scale est. = 1 n = 29
The GAM’s AIC is far lower, the smooth term is highly significant, and its edf is well above 1, so the richness-gradient relationship is strongly non-linear. The picture makes the difference obvious:
grid <- data.frame(dfs = seq(min(dat$dfs), max(dat$dfs), length.out = 200))
grid$glm <- predict(glm_linear, grid, type = "response")
grid$gam <- predict(gam_smooth, grid, type = "response")
ggplot(dat, aes(dfs, richness)) +
geom_point(alpha = 0.6) +
geom_line(
data = grid,
aes(dfs, glm),
linetype = "dashed",
colour = "grey40"
) +
geom_line(
data = grid,
aes(dfs, gam),
colour = "steelblue4",
linewidth = 0.8
) +
labs(x = "Distance from source (km)", y = "Species richness")Richness climbs steeply in the upper river, plateaus through the middle reaches, and dips near the mouth, a shape the straight line averages away entirely.
Reading a Smooth
A GAM summary() has two blocks: the parametric terms (the intercept and any ordinary linear predictors) and the smooth terms, reported by their edf and an approximate significance test. The smooth itself is best read from a plot of its partial effect, the contribution of that predictor on the link scale with the others held at their averages:
The y-axis is centred on zero because a smooth carries no intercept of its own; what matters is the shape and where the confidence band excludes zero. Wide bands at the ends, where data are sparse, are the expression of uncertainty.
Several Smooths at Once
Like multiple regression, a GAM can hold several predictors, each with its own smooth. The model adds their partial effects:
Family: poisson
Link function: log
Formula:
richness ~ s(ele, k = 6) + s(oxy, k = 6) + s(bod, k = 6)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.36900 0.06196 38.23 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(ele) 1.000 1.000 53.529 < 2e-16 ***
s(oxy) 3.243 3.820 16.013 0.00916 **
s(bod) 2.623 3.163 9.075 0.02531 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.878 Deviance explained = 87.7%
-REML = 82.521 Scale est. = 1 n = 29
Note the elevation smooth: its edf is exactly 1, so the GAM has decided that richness responds linearly to elevation and has drawn a straight line of its own accord. This is a safety feature of penalised smooths, i.e., given a linear relationship, a GAM recovers it rather than inventing curvature. Oxygen and biological oxygen demand, by contrast, are really non-linear.
Choosing and Checking the Basis Dimension
The argument k sets the maximum flexibility (the basis dimension), and the penalty then decides how much of it to use. You do not tune k to the “right” value, you set it generously and check that it was large enough. k.check() (and the fuller gam.check()) report, for each smooth, the basis dimension k', the edf used, and a test of whether residual pattern remains:
k' edf k-index p-value
s(ele) 5 1.000141 0.3762122 0.0000
s(oxy) 5 3.243444 1.1518184 0.7700
s(bod) 5 2.623360 0.8951831 0.2475
The rule is that a smooth needs a larger k only if its edf is close to k' and the k-index is below 1 with a small p-value. Here the oxygen and BOD smooths are comfortably below their basis dimension, so the basis is adequate. The elevation smooth shows a low p-value, but because its edf (1) is nowhere near k', the flag is spurious: a linear smooth cannot benefit from a larger basis. gam.check() additionally produces residual plots, which should be read like those of any GLM.
Concurvity: the GAM’s Multicollinearity
The multiple regression chapter warned that correlated predictors destabilise a model, and measured the problem with the VIF. The GAM has the same problem in a more general form, concurvity. That is, one smooth can be approximated by a combination of the others, not just linearly but through any curve. The concurvity() function measures it on a 0-to-1 scale, where values approaching 1 are dangerous:
para s(ele) s(oxy) s(bod)
worst 3.573937e-22 0.8135761 0.9635900 0.9575216
observed 3.573937e-22 0.4495871 0.5055429 0.6924397
estimate 3.573937e-22 0.4419147 0.8977509 0.8307041
The values here are high, because elevation, oxygen, and biological oxygen demand are all expressions of the same upstream-downstream gradient, which is the same collinearity I diagnosed with the VIF. As with the VIF, high concurvity means the individual smooths cannot be interpreted in isolation: the model still predicts well, but you should not claim that one of these correlated drivers matters and another does not.
Recovering the Unimodal Response
The motivating idea was that species respond to gradients as humps. I can see a GAM reconstruct one directly. The stone loach occurs at most sites, so its smooth is well supported:
[1] 0.4833824
grid2 <- data.frame(dfs = seq(min(dat$dfs), max(dat$dfs), length.out = 200))
grid2$fit <- predict(gam_loach, grid2, type = "response")
ggplot(dat, aes(dfs, Babl)) +
geom_point(alpha = 0.6) +
geom_line(
data = grid2,
aes(dfs, fit),
colour = "seagreen4",
linewidth = 0.8
) +
labs(x = "Distance from source (km)", y = "Stone loach abundance")The fitted curve rises to a peak partway down the river and declines towards both ends, which is the unimodal species-response model. The position of the peak is the species’ gradient optimum, and the width of the hump is its tolerance, the same quantities that correspondence analysis estimates as weighted averages. A GAM lets me see, for one species at a time, the response shape that ordination summarises for the whole community at once.
Prediction, Overfitting, and Smoothing
Predictions come from predict(), on the link scale by default or with type = "response" for the original units, and a smooth term carries a confidence band that widens where data are thin. The penalty is what keeps a GAM on track, and without it, a flexible basis would interpolate the noise. Two cautions remain. First, a smooth with a high edf fitted to few data points can still overfit; treat dramatic wiggles supported by a handful of observations with suspicion, and lean on REML rather than the older GCV criterion, which can undersmooth. Second, the smooths are descriptive, not mechanistic: a GAM tells you that richness peaks in the middle river, not why.
Where GAMs Lead
The additive smooth is versatile. Set bs = "re" and a smooth becomes a random effect, which takes us to the next chapter: a GAM with a random-effect smooth is a generalised additive mixed model. The smooth surfaces you fitted with ordisurf() in the ordination chapters were GAMs projected onto ordination axes, and the same mgcv function underlies spatial models that smooth over geographic coordinates. Once you can read an edf and a partial-effect plot, a large part of modern ecological modelling is open to you. The mixed models chapter takes the final step, adding structure for data that are grouped or non-independent.
References
Reuse
Citation
@online{smit2026,
author = {Smit, A. J. and J. Smit, A.},
title = {19: {Generalised} {Additive} {Models} {(GAM)}},
date = {2026-06-12},
url = {https://tangledbank.netlify.app/BCB743/gam.html},
langid = {en}
}