library(tidyverse)
library(vegan)
library(performance) # check_overdispersion(), r2()
# negative-binomial models use MASS::glm.nb(); we call it by namespace rather
# than attaching MASS, because MASS::select() would mask dplyr::select()
load(here::here(
"data",
"BCB743",
"NEwR-2ed_code_data",
"NEwR2-Data",
"Doubs.RData"
))
spe <- spe[rowSums(spe) > 0, ]
env <- env[rownames(spe), ]
dat <- data.frame(
richness = specnumber(spe), # a count: number of species per site
total = rowSums(spe), # total community abundance per site
trout = as.integer(spe$Satr > 0), # presence-absence of brown trout
env
)18: Generalised Linear Models (GLM)
| Type | Name | Link |
|---|---|---|
| Prior | Multiple Linear Regression | Multiple Regression |
| Data | The Doubs River data | 💾 Doubs.RData |
The Multiple Regression chapter ended by suggesting a MLR isn’t ideally suited to the Doubs data: species richness is a count, bounded below by zero, yet I modelled it with a method that assumes normally distributed errors of constant variance. For that particular response the model didn’t perform too badly, but ecological data routinely break the normal-error assumption in ways no transformation fully repairs. Counts pile up near zero and have a variance that grows with the mean. Presence-absence data take only the values 0 and 1. Proportions are confined to the interval \([0, 1]\). A straight line with normal errors can predict negative counts and impossible probabilities, and its p-values cannot be trusted when the variance is tied to the mean.
A generalised linear model (GLM) keeps the linear-predictor core of multiple regression but replaces the normal error with an error distribution chosen to match the response, and inserts a link function that connects the linear predictor to the mean of that distribution. Everything you learned about partial slopes, multicollinearity, model selection, and reporting applies here too, and only two new ideas are added.
The Three Components of a GLM
Every GLM has three parts:
- A random component: the probability distribution of the response, chosen from the exponential family (Normal, Poisson, binomial, Gamma, and others). This fixes the mean-variance relationship.
- A systematic component: the linear predictor \(\eta = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p\), exactly as in multiple regression.
- A link function \(g\) that connects the two, \(g(\mu) = \eta\), where \(\mu\) is the mean of the response. The link lets a linear predictor that ranges over all real numbers map onto a mean that may be constrained to be positive (counts) or to lie in \([0,1]\) (probabilities).
The common choices in ecology are:
| Response | Distribution | Default link | Mean constrained to |
|---|---|---|---|
| Continuous, symmetric | Normal (Gaussian) | identity | all real numbers |
| Counts | Poisson | log | \(> 0\) |
| Overdispersed counts | negative binomial | log | \(> 0\) |
| Presence-absence, proportions | binomial | logit | \([0, 1]\) |
| Positive, right-skewed (e.g. biomass) | Gamma | log | \(> 0\) |
Ordinary multiple regression is simply the GLM with a Normal distribution and an identity link, so a GLM is a generalisation rather than a different method.
Poisson Regression for Counts
I model species richness, the same count response as in the regression chapter, with a Poisson GLM and its default log link:
Call:
glm(formula = richness ~ ele + oxy + nit + bod, family = poisson,
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.4745815 0.6237394 8.777 < 2e-16 ***
ele -0.0019339 0.0003439 -5.624 1.87e-08 ***
oxy -0.1562692 0.0513775 -3.042 0.00235 **
nit 0.0671302 0.0682937 0.983 0.32563
bod -0.1609374 0.0309583 -5.199 2.01e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 154.386 on 28 degrees of freedom
Residual deviance: 38.676 on 24 degrees of freedom
AIC: 169.34
Number of Fisher Scoring iterations: 5
The coefficients live on the log scale, because the link is the logarithm. A coefficient there is not a change in richness but a change in log richness, so it is easiest to interpret after exponentiating:
On the response scale the effects are multiplicative: each one-unit increase in a predictor multiplies the expected richness by \(e^{\beta}\). A value above 1 means richness increases, below 1 means it decreases. Dissolved oxygen and biological oxygen demand both multiply expected richness by appreciably less than 1, so richness falls in cold, highly oxygenated headwaters and at organically polluted sites, the same ecological story the linear model told, now expressed in a way that can never predict a negative richness.
Overdispersion
The Poisson distribution makes a strong, testable assumption, which is that the variance equals the mean. Real ecological counts are often more variable than that, a condition called overdispersion, usually caused by unmeasured heterogeneity among sites or by clustering of individuals. Overdispersion does not bias the coefficients, but it makes the standard errors too small, so p-values look more significant than they are. Always check for it.
The quick check is the ratio of the residual deviance to its degrees of freedom, which should be near 1. The performance package gives a formal test:
[1] 1.611507
# Overdispersion test
dispersion ratio = 1.720
Pearson's Chi-Squared = 41.271
p-value = 0.016
The dispersion ratio here is modestly above 1, so there is mild overdispersion worth addressing. Two standard remedies exist.
Quasi-Poisson
The quasi-Poisson family estimates a dispersion parameter from the data and inflates the standard errors by its square root. The coefficients are unchanged; only their uncertainty grows:
[1] 1.719635
Estimate Std. Error
(Intercept) 5.474581476 0.8179397070
ele -0.001933851 0.0004509401
oxy -0.156269204 0.0673738330
nit 0.067130215 0.0895568041
bod -0.160937354 0.0405971027
Negative binomial
The negative binomial distribution adds a parameter, \(\theta\), that explicitly models the extra variance, and unlike the quasi-Poisson it is a proper likelihood, so it can be compared with AIC:
[1] 105.557
df AIC
pois 5 169.3352
negbin 6 171.1670
Here \(\theta\) is large and the negative binomial AIC barely improves on the Poisson, confirming that the overdispersion is mild and the conclusions are robust to it. That is a good outcome which you can only establish it by checking it specifically. When the dispersion ratio is large (say above 2) and \(\theta\) is small, the negative binomial standard errors are substantially wider than the Poisson ones, and reporting the Poisson model would be an error.
If a count response has far more zeros than a Poisson or negative binomial can produce (many sites where a species is simply absent for reasons unrelated to abundance), a zero-inflated or hurdle model is the appropriate tool (pscl::zeroinfl(), glmmTMB()). Diagnose this by comparing the observed number of zeros with the number the fitted model predicts.
Offsets: Modelling Rates Rather Than Counts
Sometimes the count you model was collected over different amounts of effort, area, or time, or you want a count relative to some total. Dividing the count to make a rate and then modelling it loses the count structure; the correct device is an offset, a predictor whose coefficient is fixed at 1. To model brown-trout abundance as a share of the whole fish community, I offset by the log of total community abundance:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.152473997 0.5305267417 -5.942159 2.812932e-09
ele 0.003421698 0.0005145177 6.650301 2.924933e-11
bod -0.380034906 0.0986347558 -3.852951 1.167026e-04
The model now describes the trout’s expected share of the community: that share rises with elevation and falls with organic pollution, which places brown trout clearly in the clean, cool, upstream reaches. Because total abundance enters through the offset rather than as an ordinary predictor, its coefficient is not estimated; it simply rescales each site’s expected count to its community size.
Logistic Regression for Presence-Absence
When the response is binary, like presence or absence, the binomial family with a logit link is the standard tool. I model the presence of brown trout:
Call:
glm(formula = trout ~ ele + bod, family = binomial, data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.802472 2.272496 0.793 0.4277
ele 0.007018 0.003109 2.258 0.0240 *
bod -1.055975 0.591533 -1.785 0.0742 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 39.336 on 28 degrees of freedom
Residual deviance: 18.725 on 26 degrees of freedom
AIC: 24.725
Number of Fisher Scoring iterations: 7
Logistic coefficients are log-odds. Exponentiating gives odds ratios, the factor by which the odds of presence change per unit of predictor:
Each unit of biological oxygen demand multiplies the odds of finding trout by roughly one third, a strong negative effect of organic pollution, while higher elevation slightly raises the odds. Odds ratios are still not the most intuitive quantity, so it is usually clearer to predict on the probability scale across a gradient of one predictor:
newdat <- data.frame(
ele = seq(min(dat$ele), max(dat$ele), length.out = 100),
bod = mean(dat$bod)
)
newdat$prob <- predict(logit, newdat, type = "response")
ggplot(newdat, aes(ele, prob)) +
geom_line(colour = "steelblue4", linewidth = 0.8) +
geom_rug(data = dat, aes(x = ele), sides = "b", inherit.aes = FALSE) +
labs(x = "Elevation (m)", y = "P(trout present)")A useful summary of fit for a logistic model is Tjur’s coefficient of discrimination, an \(R^2\) analogue that contrasts the mean predicted probability for present and absent sites:
Inference and Model Selection in GLMs
The t-tests of multiple regression become z-tests (Wald tests) in the summary() of a GLM, but Wald tests behave poorly in small samples and for coefficients near separation. The more reliable test compares nested models by the change in deviance, which follows a \(\chi^2\) distribution. drop1() performs these likelihood-ratio tests for each term:
Single term deletions
Model:
trout ~ ele + bod
Df Deviance AIC LRT Pr(>Chi)
<none> 18.725 24.725
ele 1 27.023 31.023 8.2983 0.003968 **
bod 1 25.614 29.614 6.8890 0.008673 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both predictors are clearly significant by the likelihood-ratio test, which here is the more trustworthy finding. For choosing among non-nested candidate models, AIC works exactly as it did for linear models (lower is better), with the same warning against treating automatic stepwise selection as confirmatory rather than exploratory.
Diagnostics for GLMs
The residuals-vs-fitted method of linear models does not transfer cleanly to GLMs, because the residuals are expected to be heteroscedastic from the onset. Use deviance residuals and read the plots with that in mind:
The most important GLM-specific checks are the overdispersion test shown earlier and, for counts, a comparison of observed and predicted zeros. For binary responses the ordinary residual plots are nearly uninformative (residuals can only take two patterns), so judge a logistic model by its discrimination (Tjur’s \(R^2\), or the area under the ROC curve) and by binned residuals, which average residuals within groups of similar fitted probability.
The most general GLM diagnostic is the simulated quantile residual implemented in the DHARMa package. It transforms the residuals of any GLM (or mixed model) onto a uniform scale where the usual visual checks become valid, and it provides direct tests for overdispersion, zero-inflation, and departures from the assumed distribution. If DHARMa is available, DHARMa::simulateResiduals(model) followed by plot() is the recommended check for every GLM in this module.
Reporting a GLM
Report the error family and link, the coefficients on an interpretable scale (exponentiated for log and logit links, with their confidence intervals), the result of the overdispersion check for counts, a measure of fit (deviance explained, or a pseudo-\(R^2\) such as Tjur’s for logistic models), and the likelihood-ratio tests of the terms you discuss. State which family you chose and why; “a Poisson GLM with a log link” tells a reader more than “a regression”.
Beyond the GLM
A GLM still assumes that each predictor acts as a straight line on the link scale, and that the observations are independent. The next two chapters relax those assumptions in turn. The GAM replaces the straight line on the link scale with a smooth curve, which is what a species’ unimodal response along a gradient requires. The mixed model adds random effects for grouped or nested data, and in its generalised form (the GLMM) carries the error families of this chapter into designs where the independence assumption fails.
References
Reuse
Citation
@online{smit2026,
author = {Smit, A. J. and J. Smit, A.},
title = {18: {Generalised} {Linear} {Models} {(GLM)}},
date = {2026-06-12},
url = {https://tangledbank.netlify.app/BCB743/glm.html},
langid = {en}
}