18: Generalised Linear Models (GLM)

Task L

Author

Affiliation

Published

2026/06/15

Practice Task

Work through these exercises after reading the Generalised Linear Models chapter. They use two vegan datasets that suit GLMs naturally: the oribatid mite counts (data(mite); data(mite.env)) for Poisson-family models, and the Sibbo island bird incidences (data(sipoo); data(sipoo.map)) for a binomial model. Four exercises are hands-on and two are conceptual. A worked answer is given under each exercise; try it yourself before opening it.

Choose a reasonably common species in mite and fit a Poisson GLM of its count on the mite.env predictors SubsDens, WatrCont and Topo. Report the coefficients on both the link and the response scale (exponentiate), and interpret the sign of each.

Show the answer

library(tidyverse)
library(vegan)
library(MASS)

data(mite); data(mite.env)
focal <- names(which.max(colSums(mite)))      # the most abundant species
dat <- mite.env
dat$y <- mite[[focal]]

m_pois <- glm(y ~ SubsDens + WatrCont + Topo, data = dat, family = poisson)
summary(m_pois)$coefficients

                 Estimate   Std. Error    z value     Pr(>|z|)
(Intercept) -0.1412208705 0.1034926679 -1.3645495 1.723947e-01
SubsDens    -0.0005566142 0.0020313081 -0.2740176 7.840711e-01
WatrCont     0.0074239678 0.0001673936 44.3503581 0.000000e+00
TopoHummock  0.3635814547 0.0579682728  6.2720767 3.562639e-10

exp(coef(m_pois))                              # multiplicative effect on the response (count) scale

(Intercept)    SubsDens    WatrCont TopoHummock 
  0.8682975   0.9994435   1.0074516   1.4384720

The fitted species is LCIL (the most abundant mite). On the link (log) scale the coefficients are additive; exponentiating moves them to the response (count) scale, where each becomes a multiplicative factor on the expected count. A positive log-coefficient (exponentiated value above 1) means the expected count rises with that predictor, a negative one (below 1) means it falls. For this mite the water-content term carries the strongest effect, consistent with moisture being the dominant gradient in this community.

Check the model for overdispersion (compare the residual deviance to its degrees of freedom, or use performance::check_overdispersion()). If it is overdispersed, refit it as a quasi-Poisson and a negative binomial model (MASS::glm.nb()), and compare the standard errors and inferences with the Poisson fit.

Show the answer

disp <- sum(residuals(m_pois, "pearson")^2) / m_pois$df.residual
disp                                           # >> 1 indicates overdispersion

[1] 45.67041

m_qp <- glm(y ~ SubsDens + WatrCont + Topo, data = dat, family = quasipoisson)
m_nb <- glm.nb(y ~ SubsDens + WatrCont + Topo, data = dat)

# compare standard errors of the same coefficients across the three fits
se <- function(m) summary(m)$coefficients[, "Std. Error"]
round(cbind(poisson = se(m_pois), quasipoisson = se(m_qp), negbin = se(m_nb)), 3)

            poisson quasipoisson negbin
(Intercept)   0.103        0.699  0.772
SubsDens      0.002        0.014  0.015
WatrCont      0.000        0.001  0.001
TopoHummock   0.058        0.392  0.380

The dispersion statistic is 45.7 (far above 1), so the counts are strongly overdispersed: their variance greatly exceeds the Poisson assumption that variance equals the mean. The Poisson standard errors are therefore much too small, which makes predictors look more significant than they are. The quasi-Poisson and negative-binomial refits inflate the standard errors to reflect the extra variance (the negative binomial does so by estimating a dispersion parameter, here $\theta$ = 0.55), so several borderline “significant” Poisson terms become non-significant. The point estimates barely change; what changes, and what matters, is the honesty of the uncertainty.

Using sipoo, compute each island’s species richness and fit a Poisson (or negative binomial) GLM of richness on log(area) taken from sipoo.map. The slope on log(area) is the exponent $z$ of the species-area relationship $S = cA^{z}$; report it and comment on its magnitude relative to the typical range of $z \approx 0.2$–$0.35$.

Show the answer

data(sipoo); data(sipoo.map)
rich <- rowSums(sipoo)                                      # bird species per island
area <- sipoo.map[rownames(sipoo), "area"]

sar <- glm(rich ~ log(area), family = poisson)
coef(sar)["log(area)"]                                     # the species-area exponent z

log(area) 
0.4626734

ggplot(tibble(area, rich), aes(area, rich)) +
  geom_point() +
  stat_smooth(method = "glm", method.args = list(family = poisson),
              formula = y ~ log(x), se = TRUE) +
  scale_x_log10() +
  labs(x = "Island area (log scale)", y = "Bird species richness")

With a Poisson log link and log(area) as the predictor, the model is $\log S = \log c + z\,\log A$, so the slope is exactly the species-area exponent $z$ in $S = cA^{z}$. The estimated $z$ for the Sibbo islands is 0.46, at or slightly above the upper end of the classic $0.2$–$0.35$ range, which is plausible for a true-island archipelago (isolated islands tend to have steeper species-area slopes than nested mainland samples). Richness rises with area, but with diminishing returns on the arithmetic scale, exactly the curvature the log link captures.

Convert one moderately common sipoo bird species to presence/absence and fit a logistic GLM of its occurrence on log(area). Report the odds ratio for a doubling of island area and the area at which the predicted probability of occurrence is 0.5.

Show the answer

prev <- colSums(sipoo)
fb <- names(which.min(abs(prev - nrow(sipoo) / 2)))       # species closest to 50% prevalence
dfb <- data.frame(pa = sipoo[[fb]], area = area)

m_log <- glm(pa ~ log(area), data = dfb, family = binomial)
b <- coef(m_log)
c(species = fb,
  OR_per_doubling = round(2^b["log(area)"], 2),           # odds multiply by this per doubling of area
  area_p50 = round(exp(-b[1] / b["log(area)"]), 1))       # area where P(occurrence) = 0.5

                  species OR_per_doubling.log(area)      area_p50.(Intercept) 
               "Sylvcurr"                    "2.73"                     "9.2"

The fitted species is Sylvcurr. The logit model is $\text{logit}(p) = \beta_0 + \beta_1\log(\text{area})$, so doubling the area multiplies the odds of occurrence by $2^{\beta_1}$ = 2.73 (comfortably above 1, so larger islands are more likely to hold the species), and presence becomes more likely than absence ($p = 0.5$) above an island area of about 9.2 units. This is island biogeography in a single GLM: occurrence probability rises smoothly with island area along an S-shaped (logistic) curve.

Explain why a Poisson or negative-binomial GLM is more appropriate than ordinary least-squares regression for count data, and what overdispersion means both statistically and ecologically.

Show the answer

Counts are non-negative integers whose variance grows with their mean, and they are often concentrated near zero. Ordinary least squares assumes the opposite, namely a continuous response with constant-variance, normally distributed errors, so it can predict impossible negative counts, it gives wrong standard errors when the variance is tied to the mean, and its residuals are heteroscedastic by construction. A Poisson GLM models the log of the mean count with a distribution defined only on the non-negative integers and with the mean-variance link built in. Overdispersion is when the data vary even more than Poisson allows (variance > mean). Statistically it means the Poisson standard errors are too small and significance is overstated; ecologically it usually reflects unmodelled heterogeneity, namely clumping, patchiness, or omitted environmental drivers, so that individuals are not distributed independently. The negative binomial adds a parameter to absorb that extra variance and restore honest inference.

Explain the role of the link function in a GLM (log for counts, logit for binary data). Take one fitted coefficient from Exercises 1–4 and interpret it on both the link scale and the back-transformed response scale.

Show the answer

The link function connects the linear predictor (a sum of $\beta x$ terms that can take any real value) to the mean of a response that is constrained, so that the model can be linear and additive on the link scale while respecting the bounds of the data on the response scale. The log link keeps counts positive (the mean is $e^{\eta}$, always > 0) and makes predictor effects multiplicative; the logit link keeps probabilities in $[0, 1]$ and makes effects multiplicative on the odds. Taking the water-content coefficient from the Poisson mite model in Exercise 1: on the link scale it is the change in log-expected-count per unit of water content (additive); back-transformed with exp() it is the factor by which the expected count is multiplied per unit increase in water content. The same coefficient, read on two scales, answers two questions: “is the effect positive?” (link scale) and “by how much does the count change?” (response scale).

Assessment Criteria

This Task is not formally assessed. It is built around four hands-on analyses (Exercises 1–4) and two short conceptual questions (Exercises 5–6); work through all six and bring your annotated Quarto document to class for discussion.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit2026,
  author = {Smit, A. J.},
  title = {18: {Generalised} {Linear} {Models} {(GLM)}},
  date = {2026-06-15},
  url = {https://tangledbank.netlify.app/BCB743/tasks/Task_L.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit AJ (2026) 18: Generalised Linear Models (GLM). https://tangledbank.netlify.app/BCB743/tasks/Task_L.html.

--- title: "18: Generalised Linear Models (GLM)" subtitle: "Task L" format: html: code-fold: true code-summary: "Show the answers" --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.width = 4.5, fig.height = 2.625, out.width = "75%", fig.asp = NULL, # control via width/height dpi = 300 ) ggplot2::theme_set( ggplot2::theme_minimal(base_size = 8) ) ggplot2::theme_set( ggplot2::theme_bw(base_size = 8) ) ``` ## Practice Task Work through these exercises after reading the [Generalised Linear Models](../glm.qmd) chapter. They use two **vegan** datasets that suit GLMs naturally: the oribatid mite counts (`data(mite); data(mite.env)`) for Poisson-family models, and the Sibbo island bird incidences (`data(sipoo); data(sipoo.map)`) for a binomial model. Four exercises are hands-on and two are conceptual. A worked answer is given under each exercise; try it yourself before opening it. 1. Choose a reasonably common species in `mite` and fit a **Poisson GLM** of its count on the `mite.env` predictors `SubsDens`, `WatrCont` and `Topo`. Report the coefficients on both the link and the response scale (exponentiate), and interpret the sign of each. ::: {.callout-note collapse="true"} ## Show the answer ```{r} #| code-fold: false #| label: task-l-q1 library(tidyverse) library(vegan) library(MASS) data(mite); data(mite.env) focal <- names(which.max(colSums(mite))) # the most abundant species dat <- mite.env dat$y <- mite[[focal]] m_pois <- glm(y ~ SubsDens + WatrCont + Topo, data = dat, family = poisson) summary(m_pois)$coefficients exp(coef(m_pois)) # multiplicative effect on the response (count) scale ``` The fitted species is `r focal` (the most abundant mite). On the **link** (log) scale the coefficients are additive; exponentiating moves them to the **response** (count) scale, where each becomes a multiplicative factor on the expected count. A positive log-coefficient (exponentiated value above 1) means the expected count rises with that predictor, a negative one (below 1) means it falls. For this mite the water-content term carries the strongest effect, consistent with moisture being the dominant gradient in this community. ::: 2. Check the model for **overdispersion** (compare the residual deviance to its degrees of freedom, or use `performance::check_overdispersion()`). If it is overdispersed, refit it as a **quasi-Poisson** and a **negative binomial** model (`MASS::glm.nb()`), and compare the standard errors and inferences with the Poisson fit. ::: {.callout-note collapse="true"} ## Show the answer ```{r} #| code-fold: false #| label: task-l-q2 disp <- sum(residuals(m_pois, "pearson")^2) / m_pois$df.residual disp # >> 1 indicates overdispersion m_qp <- glm(y ~ SubsDens + WatrCont + Topo, data = dat, family = quasipoisson) m_nb <- glm.nb(y ~ SubsDens + WatrCont + Topo, data = dat) # compare standard errors of the same coefficients across the three fits se <- function(m) summary(m)$coefficients[, "Std. Error"] round(cbind(poisson = se(m_pois), quasipoisson = se(m_qp), negbin = se(m_nb)), 3) ``` The dispersion statistic is **`r round(disp, 1)`** (far above 1), so the counts are strongly **overdispersed**: their variance greatly exceeds the Poisson assumption that variance equals the mean. The Poisson standard errors are therefore much too small, which makes predictors look more significant than they are. The quasi-Poisson and negative-binomial refits inflate the standard errors to reflect the extra variance (the negative binomial does so by estimating a dispersion parameter, here $\theta$ = **`r round(m_nb$theta, 2)`**), so several borderline "significant" Poisson terms become non-significant. The point estimates barely change; what changes, and what matters, is the honesty of the uncertainty. ::: 3. Using `sipoo`, compute each island's species **richness** and fit a Poisson (or negative binomial) GLM of richness on `log(area)` taken from `sipoo.map`. The slope on `log(area)` is the exponent $z$ of the species-area relationship $S = cA^{z}$; report it and comment on its magnitude relative to the typical range of $z \approx 0.2$--$0.35$. ::: {.callout-note collapse="true"} ## Show the answer ```{r} #| code-fold: false #| label: task-l-q3 #| fig-width: 5 #| fig-height: 3.5 data(sipoo); data(sipoo.map) rich <- rowSums(sipoo) # bird species per island area <- sipoo.map[rownames(sipoo), "area"] sar <- glm(rich ~ log(area), family = poisson) coef(sar)["log(area)"] # the species-area exponent z ggplot(tibble(area, rich), aes(area, rich)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = poisson), formula = y ~ log(x), se = TRUE) + scale_x_log10() + labs(x = "Island area (log scale)", y = "Bird species richness") ``` With a Poisson log link and `log(area)` as the predictor, the model is $\log S = \log c + z\,\log A$, so the slope is exactly the species-area exponent $z$ in $S = cA^{z}$. The estimated $z$ for the Sibbo islands is **`r round(coef(sar)[["log(area)"]], 2)`**, at or slightly above the upper end of the classic $0.2$--$0.35$ range, which is plausible for a true-island archipelago (isolated islands tend to have steeper species-area slopes than nested mainland samples). Richness rises with area, but with diminishing returns on the arithmetic scale, exactly the curvature the log link captures. ::: 4. Convert one moderately common `sipoo` bird species to presence/absence and fit a **logistic GLM** of its occurrence on `log(area)`. Report the odds ratio for a doubling of island area and the area at which the predicted probability of occurrence is 0.5. ::: {.callout-note collapse="true"} ## Show the answer ```{r} #| code-fold: false #| label: task-l-q4 prev <- colSums(sipoo) fb <- names(which.min(abs(prev - nrow(sipoo) / 2))) # species closest to 50% prevalence dfb <- data.frame(pa = sipoo[[fb]], area = area) m_log <- glm(pa ~ log(area), data = dfb, family = binomial) b <- coef(m_log) c(species = fb, OR_per_doubling = round(2^b["log(area)"], 2), # odds multiply by this per doubling of area area_p50 = round(exp(-b[1] / b["log(area)"]), 1)) # area where P(occurrence) = 0.5 ``` The fitted species is `r fb`. The logit model is $\text{logit}(p) = \beta_0 + \beta_1\log(\text{area})$, so doubling the area multiplies the **odds** of occurrence by $2^{\beta_1}$ = **`r round(2^coef(m_log)[["log(area)"]], 2)`** (comfortably above 1, so larger islands are more likely to hold the species), and presence becomes more likely than absence ($p = 0.5$) above an island area of about **`r round(exp(-coef(m_log)[[1]] / coef(m_log)[["log(area)"]]), 1)`** units. This is island biogeography in a single GLM: occurrence probability rises smoothly with island area along an S-shaped (logistic) curve. ::: 5. Explain why a Poisson or negative-binomial GLM is more appropriate than ordinary least-squares regression for count data, and what overdispersion means both statistically and ecologically. ::: {.callout-note collapse="true"} ## Show the answer Counts are non-negative integers whose variance grows with their mean, and they are often concentrated near zero. Ordinary least squares assumes the opposite, namely a continuous response with constant-variance, normally distributed errors, so it can predict impossible negative counts, it gives wrong standard errors when the variance is tied to the mean, and its residuals are heteroscedastic by construction. A Poisson GLM models the log of the mean count with a distribution defined only on the non-negative integers and with the mean-variance link built in. **Overdispersion** is when the data vary even more than Poisson allows (variance > mean). Statistically it means the Poisson standard errors are too small and significance is overstated; ecologically it usually reflects unmodelled heterogeneity, namely clumping, patchiness, or omitted environmental drivers, so that individuals are not distributed independently. The negative binomial adds a parameter to absorb that extra variance and restore honest inference. ::: 6. Explain the role of the **link function** in a GLM (log for counts, logit for binary data). Take one fitted coefficient from Exercises 1--4 and interpret it on both the link scale and the back-transformed response scale. ::: {.callout-note collapse="true"} ## Show the answer The **link function** connects the linear predictor (a sum of $\beta x$ terms that can take any real value) to the mean of a response that is constrained, so that the model can be linear and additive on the link scale while respecting the bounds of the data on the response scale. The **log** link keeps counts positive (the mean is $e^{\eta}$, always > 0) and makes predictor effects multiplicative; the **logit** link keeps probabilities in $[0, 1]$ and makes effects multiplicative on the odds. Taking the water-content coefficient from the Poisson mite model in Exercise 1: on the **link** scale it is the change in log-expected-count per unit of water content (additive); back-transformed with `exp()` it is the **factor** by which the expected count is multiplied per unit increase in water content. The same coefficient, read on two scales, answers two questions: "is the effect positive?" (link scale) and "by how much does the count change?" (response scale). ::: ## Assessment Criteria This Task is not formally assessed. It is built around four hands-on analyses (Exercises 1--4) and two short conceptual questions (Exercises 5--6); work through all six and bring your annotated Quarto document to class for discussion.