20: Linear and Generalised Linear Mixed Models

Author

A. J. Smit

Published

2026/06/12

TipMaterial Required for This Chapter
Type Name Link
Prior GLM; the Diatom case study GLM; nMDS: PERMANOVA (Diatoms)
Data Mayombo diatom data 💾 PB_data_matrix_abrev.csv
💾 PB_diat_env.csv

The regression chapters so far have assumed that observations are independent, the “I” in the LINE assumptions. Ecological sampling routinely breaks that assumption. Sites are nested within rivers, quadrats within transects, measurements within individuals, and subsamples within a single specimen. Observations that share a group are more alike than observations from different groups, and treating them as independent is pseudoreplication: it credits the analysis with more independent information than the design actually delivered, and it makes p-values far too small.

A mixed model solves this by adding a random effect for the grouping. I meet the same design that the diatom case study analysed with blocked PERMANOVA, and I now handle its nesting in a univariate model instead.

The Diatom Study and the Pseudoreplication It Creates

Epiphytic diatoms were sampled from kelp blades. The design crosses two fixed factors, host species (Ecklonia vs Laminaria) and host size (adult vs juvenile), and from each individual kelp plant up to three replicate tissue pieces were taken. Those three pieces are subsamples of the same blade, not independent replicates of the treatment. This is why the diatom chapter set plant as the permutation block.

library(tidyverse)
library(vegan)
library(lme4) # lmer(), glmer()
library(performance) # icc(), r2()

spp <- read.csv(
  here::here("data", "BCB743", "diatoms", "PB_data_matrix_abrev.csv"),
  row.names = 1
)
env <- read.csv(
  here::here("data", "BCB743", "diatoms", "PB_diat_env.csv"),
  row.names = 1
)
env <- env |>
  mutate(
    host_spp = factor(host_spp),
    host_size = factor(host_size),
    plant = factor(plant),
    richness = rowSums(spp > 0), # diatom genera per sample
    Gomph = spp[["Gomphoseptatum.spp"]] # the dominant genus, a count
  )

table(plant_samples = table(env$plant)) # how many subsamples per plant
plant_samples
 1  2  3 
 1  4 11 

Most plants contribute three subsamples. Ignoring that structure would treat 42 correlated subsamples as 42 independent observations.

Fixed and Random Effects

A fixed effect is a factor whose specific levels are of direct interest and would recur if you repeated the study: host species and host size are fixed. A random effect is a factor whose levels are a sample from a population of possible levels and are not themselves of interest: the particular kelp plants are an arbitrary sample of kelp plants, and I care about kelp plants in general, not plant number 7. I model the plant effect not as a set of coefficients but as a single variance: how much do blades differ from one another, over and above the treatments?

Written out, a random-intercept model gives each plant \(j\) its own deviation \(b_j\) from the overall mean,

\[ y_{ij} = \beta_0 + \beta_1 x_{ij} + b_j + \varepsilon_{ij}, \qquad b_j \sim N(0, \sigma^2_{\text{plant}}), \quad \varepsilon_{ij} \sim N(0, \sigma^2). \]

The model estimates the two variances rather than 16 separate plant means, which is far more economical and lets the plant deviations “borrow strength” from one another.

A Linear Mixed Model

I begin with diatom richness as a linear mixed model. The fixed part is the host species by host size interaction; the random part, (1 | plant), gives each blade its own intercept:

lmm <- lmer(richness ~ host_spp * host_size + (1 | plant), data = env)
summary(lmm)
Linear mixed model fit by REML ['lmerMod']
Formula: richness ~ host_spp * host_size + (1 | plant)
   Data: env

REML criterion at convergence: 179.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.20015 -0.44303 -0.07726  0.35348  2.37041 

Random effects:
 Groups   Name        Variance Std.Dev.
 plant    (Intercept) 0.478    0.6914  
 Residual             4.829    2.1974  
Number of obs: 42, groups:  plant, 16

Fixed effects:
                      Estimate Std. Error t value
(Intercept)             6.0222     0.7489   8.041
host_sppLp             -4.0296     1.0591  -3.805
host_sizeJ             -0.6889     1.0406  -0.662
host_sppLp:host_sizeJ   4.2838     1.5426   2.777

Correlation of Fixed Effects:
            (Intr) hst_sL hst_sJ
host_sppLp  -0.707              
host_sizeJ  -0.720  0.509       
hst_sppL:_J  0.486 -0.687 -0.675

The Random effects table is new to you. It reports the plant-to-plant standard deviation and the residual (within-plant) standard deviation. Their relative size is captured by the intraclass correlation coefficient (ICC), the proportion of the total variance that is between plants:

icc(lmm)
# Intraclass Correlation Coefficient

    Adjusted ICC: 0.090
  Unadjusted ICC: 0.060

For richness the ICC is small: blades of the same plant are only slightly more alike than blades of different plants. The fixed-effect coefficients are read exactly as in an ordinary regression, except that their standard errors now correctly account for the grouping. Note that lmer() gives no p-values by design, because the degrees of freedom for these tests are not exactly known; I test effects below by comparing models.

NoteREML and ML

lmer() fits by restricted maximum likelihood (REML) by default, which gives unbiased variance estimates and is what you report. To compare models that differ in their fixed effects, however, you must refit with ordinary maximum likelihood (ML), because REML likelihoods of different fixed-effect structures are not comparable. The anova() method does this refitting automatically.

A Generalised Linear Mixed Model

Richness is a count, so the correct model is a generalised linear mixed model (GLMM): the random intercept of a mixed model combined with the Poisson family and log link of a GLM. I model the abundance of the dominant genus, Gomphoseptatum:

glmm <- glmer(
  Gomph ~ host_spp * host_size + (1 | plant),
  family = poisson,
  data = env
)
summary(glmm)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: poisson  ( log )
Formula: Gomph ~ host_spp * host_size + (1 | plant)
   Data: env

      AIC       BIC    logLik -2*log(L)  df.resid 
    572.0     580.7    -281.0     562.0        37 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-7.8117 -1.3080 -0.3768  0.9387 13.0544 

Random effects:
 Groups Name        Variance Std.Dev.
 plant  (Intercept) 0.3568   0.5973  
Number of obs: 42, groups:  plant, 16

Fixed effects:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)             2.1071     0.3173   6.641 3.12e-11 ***
host_sppLp             -0.6122     0.4616  -1.326   0.1847    
host_sizeJ             -0.3249     0.4521  -0.719   0.4724    
host_sppLp:host_sizeJ   1.5384     0.6466   2.379   0.0173 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) hst_sL hst_sJ
host_sppLp  -0.686              
host_sizeJ  -0.701  0.483       
hst_sppL:_J  0.490 -0.713 -0.699
icc(glmm)
# Intraclass Correlation Coefficient

    Adjusted ICC: 0.787
  Unadjusted ICC: 0.561

Here the picture is completely different from the richness model: the ICC is high, so blades differ strongly in their Gomphoseptatum counts. That is precisely the situation in which ignoring the nesting does real damage.

Why It Matters: Pseudoreplication Quantified

Fit the same fixed-effects model as an ordinary GLM, pretending the subsamples are independent, and compare the standard error of the host-species effect with the GLMM’s:

naive <- glm(Gomph ~ host_spp * host_size, family = poisson, data = env)

c(
  naive_SE = coef(summary(naive))["host_sppLp", "Std. Error"],
  glmm_SE = coef(summary(glmm))["host_sppLp", "Std. Error"]
)
 naive_SE   glmm_SE 
0.1654262 0.4615640 

The naive standard error is roughly a third of the correct one. Pretending that 42 subsamples are 42 independent observations nearly trebles the apparent precision and would turn a non-significant effect into a “significant” one. The mixed model spends the degrees of freedom the design actually earned. This is the single most important reason to use a mixed model, and you cannot know in advance whether the inflation will be small (as for richness) or large (as for Gomphoseptatum): you have to fit the random effect to find out.

Random Slopes

A random intercept lets the baseline vary between plants. A random slope lets a predictor’s effect vary between plants as well, written (1 + host_size | plant):

rs <- lmer(
  richness ~ host_spp * host_size + (1 + host_size | plant),
  data = env
)
summary(rs)$varcor
 Groups   Name        Std.Dev. Corr   
 plant    (Intercept) 0.65440         
          host_sizeJ  0.96568  -0.657 
 Residual             2.19751         

Use a random slope when there is reason to think the effect of a within-group predictor genuinely differs between groups. Random slopes are demanding of data: each added random term needs enough groups, and enough observations per group, to be estimated. With only sixteen plants and at most three subsamples each, the diatom data can barely support them, which leads directly to the next point.

Singular Fits

When a random-effects structure is too complex for the data, or a variance is genuinely near zero, lmer() returns a singular fit: a variance estimated at (or against) the boundary of zero. Modelling total diatom abundance is a clear case:

sf <- lmer(log1p(rowSums(spp)) ~ host_spp * host_size + (1 | plant), data = env)
VarCorr(sf)
 Groups   Name        Std.Dev.
 plant    (Intercept) 0.00000 
 Residual             0.97806 
isSingular(sf)
[1] TRUE

The plant variance collapses to zero: for total abundance, blades of the same plant are no more alike than blades of different plants. A singular fit is a warning. The remedies are to simplify the random structure (drop a random slope, or a whole grouping factor that explains nothing), and to report clearly that the grouping contributed no detectable variance, having checked rather than assumed it.

Testing Effects and Measuring Fit

Fixed effects are tested by a likelihood-ratio test that compares nested models, refitted with ML (which anova() handles):

m_full <- glmer(
  Gomph ~ host_spp * host_size + (1 | plant),
  family = poisson,
  data = env
)
m_red <- glmer(
  Gomph ~ host_spp + host_size + (1 | plant),
  family = poisson,
  data = env
)
anova(m_red, m_full)
Data: env
Models:
m_red: Gomph ~ host_spp + host_size + (1 | plant)
m_full: Gomph ~ host_spp * host_size + (1 | plant)
       npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)  
m_red     4 574.89 581.84 -283.44    566.89                       
m_full    5 572.01 580.70 -281.00    562.01 4.8788  1    0.02719 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A significant result means the interaction earns is real. For overall fit, a mixed model has two \(R^2\) values: the marginal \(R^2\) (variance explained by the fixed effects alone) and the conditional \(R^2\) (fixed and random effects together). The gap between them is the contribution of the grouping:

r2(glmm)
# R2 for Mixed Models

  Conditional R2: 0.848
     Marginal R2: 0.287
WarningTesting a variance component is a boundary problem

Testing whether a random effect is needed (is \(\sigma^2_{\text{plant}} = 0\)?) is not an ordinary likelihood-ratio test, because the null value sits on the boundary of the parameter space, which makes the usual \(\chi^2\) reference distribution conservative. In practice, decide whether to include a random effect from the design (is there grouping or nesting?) rather than from a significance test. If the design has subsamples within plants, the plant random effect belongs in the model whether or not its variance turns out to be large.

Diagnostics and Reporting

Check a mixed model much as you would a GLM: plot the residuals against the fitted values, and inspect the distribution of the estimated random intercepts, which should be roughly normal.

plot(glmm)
Figure 1: Residuals against fitted values for the Gomphoseptatum GLMM.

When you report a mixed model, give the fixed-effect estimates with their uncertainty, the variance components (or the ICC) for every random effect, the marginal and conditional \(R^2\), the error family and link for a GLMM, and the structure of the random effects in words (“a random intercept for each kelp plant, to account for the three subsamples taken per blade”). State that the random effect reflects the design, not a fishing expedition.

All These Regressions…

These four chapters have lifted the three restrictions of ordinary multiple regression one at a time. The GLM replaced the normal error with a family matched to the data; the GAM replaced straight lines with smooths; and the mixed model has replaced the independence assumption with explicit terms for grouping and nesting. The same skeleton, a linear predictor of environmental drivers, runs through all of them, and through the constrained ordination that applies the very same logic to whole communities at once. Choosing among them is the subject of the Model Building chapter: match the model to the data you have and the question you are asking, then check that the model you fitted is the model you should report.

References

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J. and J. Smit, A.},
  title = {20: {Linear} and {Generalised} {Linear} {Mixed} {Models}},
  date = {2026-06-12},
  url = {https://tangledbank.netlify.app/BCB743/mixed_models.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ, J. Smit A (2026) 20: Linear and Generalised Linear Mixed Models. https://tangledbank.netlify.app/BCB743/mixed_models.html.