20. Generalised Linear Models

Extending Regression Beyond Normal Responses

Author

Affiliation

A. J. Smit

University of the Western Cape

Published

2026/04/07

In This Chapter

why ordinary linear models fail for binary, proportional, and count responses;
how families and link functions define a generalised linear model;
how simple proportion tests connect to the binomial GLM framework;
how to fit and interpret a real logistic regression with ecological data;
how to detect overdispersion in count data and what it means for interpretation.

Tasks to Complete in This Chapter

None

Ordinary linear regression assumes a normally distributed response with constant variance. Many ecological responses do not meet those conditions at all. Presence-absence data are binary, proportions are constrained between 0 and 1, and counts are non-negative integers whose variance often changes with the mean.

Generalised linear models (GLMs) extend regression to handle those cases while retaining the same basic reasoning:

define the biological question;
choose predictors;
fit a model;
interpret the fitted effects.

The difference is that the response distribution and link function must now match the data structure.

1 Key Concepts

The family describes the response distribution, such as binomial for binary data or Poisson for counts.
The link function connects the linear predictor to the expected response scale.
A GLM still has a linear predictor, but it acts on the link scale rather than directly on the response scale.
Interpretation depends on scale: a coefficient may describe a change in log-odds or log-counts rather than a direct additive change in the response.
Overdispersion is a warning that the simplest count model may be too optimistic.

2 When This Method Is Appropriate

Use a GLM when:

the response is binary, such as present/absent or alive/dead;
the response is a proportion derived from successes and failures;
the response is a count;
the variance changes with the mean in a way that an ordinary linear model cannot sensibly represent.

In this chapter, I pick up the earlier proportion-testing material and place it inside the broader regression sequence.

3 Nature of the Data and Assumptions

The assumptions depend on the GLM family, but some ideas are common and/or familiar:

observations should be independent;
the response family should match the data-generating structure;
the link function should make biological and statistical sense;
the linear predictor should be appropriately specified.

Unlike ordinary least squares, GLMs do not assume normal residuals with constant variance. Instead, the variance structure is part of the model family itself.

4 The Core Equations

A generalised linear model can be written as:

\[g(\mu_i) = \eta_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} \tag{1}\]

In Equation 1, $\mu_i$ is the expected response for observation $i$, $g(\cdot)$ is the link function, and $\eta_i$ is the linear predictor.

For binomial logistic regression, the most common link is the logit:

\[\log\left(\frac{p_i}{1-p_i}\right) = \alpha + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \tag{2}\]

Equation Equation 2 says that predictors act linearly on the log-odds scale.

For Poisson regression, the most common link is the log link:

\[\log(\mu_i) = \alpha + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \tag{3}\]

Equation Equation 3 means that predictors act linearly on the log-count scale.

5 R Functions

The main function is glm().

glm(response ~ predictors, family = ..., data = df)

Typical uses are:

glm(binary_response ~ predictors, family = binomial, data = df)
glm(cbind(successes, failures) ~ predictors, family = binomial, data = df)
glm(count ~ predictors, family = poisson, data = df)
glm(count ~ predictors, family = quasipoisson, data = df)

6 Simple Proportion Tests as a Bridge to GLMs

Before fitting full binomial models with predictors, it is useful to recognise that the simple proportion tests encountered earlier are part of the same broader reasining framework.

If we only want to compare an observed proportion against a null expectation, or compare the proportions of two groups without additional predictors, prop.test() is still useful.

6.1 One-Sample Proportion Tests

For a two-sided one-sample proportion test, the hypotheses are:

\[H_{0}: p = p_{0}\] \[H_{a}: p \ne p_{0}\]

Here x is the number of successes, n is the number of trials, and p is the hypothesised probability:

prop.test(x = 45, n = 100, p = 0.5)


    1-sample proportions test with continuity correction

data:  45 out of 100, null probability 0.5
X-squared = 0.81, df = 1, p-value = 0.3681
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.3514281 0.5524574
sample estimates:
   p 
0.45

prop.test(x = 33, n = 100, p = 0.5)


    1-sample proportions test with continuity correction

data:  33 out of 100, null probability 0.5
X-squared = 10.89, df = 1, p-value = 0.0009668
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.2411558 0.4320901
sample estimates:
   p 
0.33

6.2 Two-Sample Proportion Tests

For a two-sided comparison of two proportions:

\[H_{0}: p_{1} = p_{2}\] \[H_{a}: p_{1} \ne p_{2}\]

mosquito <- matrix(c(70, 85, 50, 35), ncol = 2)
colnames(mosquito) <- c("yes", "no")
rownames(mosquito) <- c("Jack", "Jill")
mosquito

     yes no
Jack  70 50
Jill  85 35

prop.test(mosquito)


    2-sample test for equality of proportions with continuity correction

data:  mosquito
X-squared = 3.5704, df = 1, p-value = 0.05882
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.253309811  0.003309811
sample estimates:
   prop 1    prop 2 
0.5833333 0.7083333

These tests are useful when the question is simple and there are few or no predictors. Once predictors enter the problem, the binomial GLM becomes the natural extension.

7 Example 1: Logistic Regression for Presence-Absence Data

7.1 Example Dataset

We use the Sessile_Benthic_Invertebrates.csv dataset from the rocky intertidal course data. These observations were collected along a shore-position gradient (Location_m) and from the left and right sides of the shore platform. Here we derive a binary response indicating whether acorn barnacles were present in a sampled unit.

sessile <- read_csv(
  here::here("data", "BCB744", "Rocky Intertidal Data", "Sessile_Benthic_Invertebrates.csv"),
  show_col_types = FALSE
) |>
  mutate(
    Side = factor(Side),
    acorn_present = as.integer(Acorn.Barnacles > 0)
  )

gt(head(sessile, 10))

A subset of the sessile-benthic invertebrate data used for the logistic-regression example.
Acorn.Barnacles	Anemones	Mussels	Side	acorn_present
2800	240	0	Left	1
2800	480	0	Right	1
0	0	0	Right	0
0	0	0	Left	0
4500	0	5	Left	1
13495	0	0	Right	1
300	0	0	Left	1
342	0	0	Right	1
5000	0	0	Right	1
1152	22	0	Left	1

7.2 Do an Exploratory Data Analysis (EDA)

sessile |>
  summarise(
    n = n(),
    n_present = sum(acorn_present),
    prop_present = mean(acorn_present),
    min_loc = min(Location_m),
    max_loc = max(Location_m)
  )

# A tibble: 1 × 5
      n n_present prop_present min_loc max_loc
  <int>     <int>        <dbl>   <dbl>   <dbl>
1   120        NA           NA       0      40

Code

ggplot(sessile, aes(x = Location_m, y = acorn_present)) +
  geom_jitter(height = 0.05, width = 0, alpha = 0.35) +
  labs(
    x = "Location on shore gradient (m)",
    y = "Acorn barnacles present"
  )

Figure 1: Presence and absence of acorn barnacles across the shore-position gradient.

The response pattern in Figure 1 is binary and clearly unsuitable for ordinary linear regression. That immediately points us to a binomial GLM.

7.3 State the Model Question and Hypotheses

The biological question is whether the probability of acorn-barnacle presence changes along the shore-position gradient.

\[H_{0}: \beta_{\text{Location}} = 0\] \[H_{a}: \beta_{\text{Location}} \ne 0\]

7.4 Fit the Model

fit_logistic <- glm(acorn_present ~ Location_m + Side,
                    family = binomial,
                    data = sessile)

summary(fit_logistic)


Call:
glm(formula = acorn_present ~ Location_m + Side, family = binomial, 
    data = sessile)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.46201    0.33287   1.388  0.16514   
Location_m  -0.05270    0.01790  -2.944  0.00324 **
SideRight   -0.07575    0.38928  -0.195  0.84572   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 159.95  on 115  degrees of freedom
Residual deviance: 149.77  on 113  degrees of freedom
  (4 observations deleted due to missingness)
AIC: 155.77

Number of Fisher Scoring iterations: 4

7.5 Test Assumptions / Check Diagnostics

Code

new_logistic <- expand.grid(
  Location_m = seq(min(sessile$Location_m), max(sessile$Location_m), length.out = 200),
  Side = levels(sessile$Side)
) |>
  as_tibble() |>
  mutate(prob = predict(fit_logistic, newdata = cur_data(), type = "response"))

ggplot(sessile, aes(x = Location_m, y = acorn_present, colour = Side)) +
  geom_jitter(height = 0.05, width = 0, alpha = 0.3) +
  geom_line(data = new_logistic, aes(y = prob), linewidth = 0.9) +
  labs(
    x = "Location on shore gradient (m)",
    y = "Probability of presence"
  )

Figure 2: Fitted logistic-regression curves for acorn-barnacle presence along the shore-position gradient.

For introductory logistic regression, the most important checks are whether the response family is appropriate and whether the fitted probability pattern in Figure 2 is biologically sensible.

7.6 Interpret the Results

The fitted location coefficient is negative (-0.053, p < 0.01), which means the log-odds of acorn-barnacle presence decline as shore position increases. In practical terms, the predicted probability of acorn-barnacle presence declines from about 0.61 near the lower end of the sampled gradient to about 0.16 near the upper end of the gradient.

The side effect is negligible in this example, so the dominant pattern is a decline in occurrence across shore position rather than a left-right platform difference.

7.7 Reporting

Write-Up

Methods

Acorn-barnacle occurrence was analysed using a binomial generalised linear model with logit link. Presence-absence was modelled as a function of shore position (Location_m) and platform side (Side) using rocky intertidal observations from the course dataset.

Results

The probability of acorn-barnacle presence declined significantly along the shore-position gradient (logit coefficient for Location_m = -0.053, $z = -2.94$, p < 0.01). The fitted probability of presence on the left side of the platform fell from about 0.61 at the low end of the sampled gradient to about 0.16 at the high end. There was no evidence that left and right sides differed strongly once location was accounted for (p > 0.05).

Discussion

The ecological interpretation is that acorn barnacles became less likely to occur higher along the sampled shore gradient. In a manuscript, it is more helpful to phrase this as a change in probability of occurrence than to leave the result on the log-odds scale alone.

8 Example 2: Count Data and Overdispersion

8.1 Example Dataset

We now use the Small_Mobile_Invertebrates.csv rocky intertidal dataset. Here the response is the count of snails in each sampled unit.

mobile <- read_csv(
  here::here("data", "BCB744", "Rocky Intertidal Data", "Small_Mobile_Invertebrates.csv"),
  show_col_types = FALSE
) |>
  mutate(Side = factor(Side))

gt(head(mobile, 10))

A subset of the small mobile invertebrate data used for the count-data GLM example.
Chitons	Snails	Limpets	Side
0	0	0	Left
0	3	9	Right
0	0	0	Right
0	0	0	Left
0	625	0	Right
1	0	1	Left
4	0	27	Left
0	134	11	Right
1	3	31	Left
0	0	16	Left

8.2 Do an Exploratory Data Analysis (EDA)

mobile |>
  summarise(
    n = n(),
    mean_snails = mean(Snails),
    var_snails = var(Snails),
    max_snails = max(Snails)
  )

# A tibble: 1 × 4
      n mean_snails var_snails max_snails
  <int>       <dbl>      <dbl>      <dbl>
1   120          NA         NA         NA

Code

ggplot(mobile, aes(x = Location_m, y = Snails, colour = Side)) +
  geom_point(alpha = 0.5) +
  labs(
    x = "Location on shore gradient (m)",
    y = "Snail count"
  )

Figure 3: Snail counts across the shore-position gradient.

The counts in Figure 3 are non-negative integers and include many zeros together with some very large values. That already suggests we need to be cautious about the simplest Poisson assumptions.

8.3 State the Model Question and Hypotheses

The question is whether snail abundance changes along the shore gradient and between the two sides of the shore platform.

\[H_{0}: \beta_{\text{Location}} = 0\] \[H_{a}: \beta_{\text{Location}} \ne 0\]

8.4 Fit the Model

fit_pois <- glm(Snails ~ Location_m + Side,
                family = poisson,
                data = mobile)

fit_qpois <- glm(Snails ~ Location_m + Side,
                 family = quasipoisson,
                 data = mobile)

summary(fit_pois)


Call:
glm(formula = Snails ~ Location_m + Side, family = poisson, data = mobile)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.801789   0.141224   12.76   <2e-16 ***
Location_m  -0.249638   0.009117  -27.38   <2e-16 ***
SideRight    2.757775   0.144401   19.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6117.7  on 112  degrees of freedom
Residual deviance: 2921.5  on 110  degrees of freedom
  (7 observations deleted due to missingness)
AIC: 3043.1

Number of Fisher Scoring iterations: 8

summary(fit_qpois)


Call:
glm(formula = Snails ~ Location_m + Side, family = quasipoisson, 
    data = mobile)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.8018     6.2655   0.288    0.774
Location_m   -0.2496     0.4045  -0.617    0.538
SideRight     2.7578     6.4065   0.430    0.668

(Dispersion parameter for quasipoisson family taken to be 1968.325)

    Null deviance: 6117.7  on 112  degrees of freedom
Residual deviance: 2921.5  on 110  degrees of freedom
  (7 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 8

8.5 Test Assumptions / Check Diagnostics

dispersion <- sum(residuals(fit_pois, type = "pearson")^2) / df.residual(fit_pois)
dispersion

[1] 1968.293

Code

new_count <- expand.grid(
  Location_m = seq(min(mobile$Location_m), max(mobile$Location_m), length.out = 200),
  Side = levels(mobile$Side)
) |>
  as_tibble() |>
  mutate(mu = predict(fit_qpois, newdata = cur_data(), type = "response"))

ggplot(mobile, aes(x = Location_m, y = Snails, colour = Side)) +
  geom_point(alpha = 0.35) +
  geom_line(data = new_count, aes(y = mu), linewidth = 0.9) +
  labs(
    x = "Location on shore gradient (m)",
    y = "Snail count"
  )

Figure 4: Observed snail counts and fitted quasi-Poisson means across the shore-position gradient.

The fitted curves in Figure 4 and the dispersion value of 1968.3 show that the simple Poisson model is much too optimistic. That is exactly the situation in which overdispersion must be acknowledged before interpreting the coefficients.

8.6 Interpret the Results

The Poisson model suggests that snail counts decline strongly along the shore gradient and differ between sides, but the overdispersion is so extreme that the nominal Poisson standard errors cannot be trusted.

The quasi-Poisson fit keeps the same mean structure but inflates the uncertainty appropriately. Under that more cautious model, the estimated decline with location and the side effect are no longer supported strongly. The biological conclusion therefore changes once overdispersion is taken seriously.

8.7 Reporting

Write-Up

Methods

Snail counts from rocky intertidal samples were analysed using a count-data GLM with location along the shore gradient and platform side as predictors. A Poisson model was fitted first, and overdispersion was then assessed using the Pearson residual dispersion statistic. Because overdispersion was extreme, a quasi-Poisson model was used for interpretation.

Results

The raw count data were highly variable, with the variance far exceeding the mean. The fitted Poisson model showed strong negative and side effects, but the dispersion statistic was approximately 1968.3, indicating severe overdispersion. Under the quasi-Poisson model, the same mean pattern remained but the standard errors were much larger, and neither predictor provided strong evidence for a clear effect (p > 0.05).

Discussion

The important lesson is not only that snail counts vary along the shore. It is that count-data inference can become misleading if overdispersion is ignored. In practice, the first biologically plausible pattern should always be checked against the adequacy of the assumed count distribution.

9 What to Do When Assumptions Fail / Alternatives

If a binary model shows dependence among observations, move to a generalised mixed model rather than forcing independence.
If a count model is overdispersed, consider quasi-Poisson, negative binomial, or other more appropriate count models.
If the response is a simple success proportion with no predictors, a proportion test may still be enough.
If the link-scale coefficients are hard to interpret, translate the results back to probabilities or expected counts on the response scale.

10 Summary

GLMs extend regression to binary, proportional, and count responses.
The family and link function must match the response structure.
Simple proportion tests are part of the same broader modelling logic as binomial GLMs.
In the acorn-barnacle example, the probability of presence declined across the shore gradient.
In the snail-count example, severe overdispersion changed the inferential conclusion and showed why diagnostics matter in GLMs.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit2026,
  author = {Smit, A. J.},
  title = {20. {Generalised} {Linear} {Models}},
  date = {2026-04-07},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/20-generalised-linear-models.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit AJ (2026) 20. Generalised Linear Models. https://tangledbank.netlify.app/BCB744/basic_stats/20-generalised-linear-models.html.

--- title: "20. Generalised Linear Models" subtitle: "Extending Regression Beyond Normal Responses" date: last-modified date-format: "YYYY/MM/DD" reference-location: margin --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.asp = NULL, fig.align = "center", fig.retina = 2, dpi = 300 ) ggplot2::theme_set( ggplot2::theme_grey(base_size = 8) ) ``` ```{r code-libraries, echo=FALSE} library(tidyverse) library(gt) ``` ::: {.callout-note appearance="simple"} ## In This Chapter - why ordinary linear models fail for binary, proportional, and count responses; - how families and link functions define a generalised linear model; - how simple proportion tests connect to the binomial GLM framework; - how to fit and interpret a real logistic regression with ecological data; - how to detect overdispersion in count data and what it means for interpretation. ::: ::: {.callout-important appearance="simple"} ## Tasks to Complete in This Chapter - None ::: Ordinary linear regression assumes a normally distributed response with constant variance. Many ecological responses do not meet those conditions at all. Presence-absence data are binary, proportions are constrained between 0 and 1, and counts are non-negative integers whose variance often changes with the mean. **Generalised linear models (GLMs)** extend regression to handle those cases while retaining the same basic reasoning: 1. define the biological question; 2. choose predictors; 3. fit a model; 4. interpret the fitted effects. The difference is that the response distribution and link function must now match the data structure. # Key Concepts - **The family** describes the response distribution, such as binomial for binary data or Poisson for counts. - **The link function** connects the linear predictor to the expected response scale. - **A GLM still has a linear predictor**, but it acts on the link scale rather than directly on the response scale. - **Interpretation depends on scale**: a coefficient may describe a change in log-odds or log-counts rather than a direct additive change in the response. - **Overdispersion** is a warning that the simplest count model may be too optimistic. # When This Method Is Appropriate Use a GLM when: - the response is binary, such as present/absent or alive/dead; - the response is a proportion derived from successes and failures; - the response is a count; - the variance changes with the mean in a way that an ordinary linear model cannot sensibly represent. In this chapter, I pick up the earlier proportion-testing material and place it inside the broader regression sequence. # Nature of the Data and Assumptions The assumptions depend on the GLM family, but some ideas are common and/or familiar: 1. observations should be independent; 2. the response family should match the data-generating structure; 3. the link function should make biological and statistical sense; 4. the linear predictor should be appropriately specified. Unlike ordinary least squares, GLMs do not assume normal residuals with constant variance. Instead, the variance structure is part of the model family itself. # The Core Equations A generalised linear model can be written as: $$g(\mu_i) = \eta_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi}$$ {#eq-glm} In @eq-glm, $\mu_i$ is the expected response for observation $i$, $g(\cdot)$ is the link function, and $\eta_i$ is the linear predictor. For binomial logistic regression, the most common link is the logit: $$\log\left(\frac{p_i}{1-p_i}\right) = \alpha + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}$$ {#eq-logit} Equation @eq-logit says that predictors act linearly on the **log-odds** scale. For Poisson regression, the most common link is the log link: $$\log(\mu_i) = \alpha + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}$$ {#eq-poisson} Equation @eq-poisson means that predictors act linearly on the **log-count** scale. # R Functions The main function is `glm()`. ```{r} #| eval: false glm(response ~ predictors, family = ..., data = df) ``` Typical uses are: ```{r} #| eval: false glm(binary_response ~ predictors, family = binomial, data = df) glm(cbind(successes, failures) ~ predictors, family = binomial, data = df) glm(count ~ predictors, family = poisson, data = df) glm(count ~ predictors, family = quasipoisson, data = df) ``` # Simple Proportion Tests as a Bridge to GLMs Before fitting full binomial models with predictors, it is useful to recognise that the simple proportion tests encountered earlier are part of the same broader reasining framework. If we only want to compare an observed proportion against a null expectation, or compare the proportions of two groups without additional predictors, `prop.test()` is still useful. ## One-Sample Proportion Tests For a two-sided one-sample proportion test, the hypotheses are: $$H_{0}: p = p_{0}$$ $$H_{a}: p \ne p_{0}$$ Here `x` is the number of successes, `n` is the number of trials, and `p` is the hypothesised probability: ```{r code-prop-test-x-n} prop.test(x = 45, n = 100, p = 0.5) prop.test(x = 33, n = 100, p = 0.5) ``` ## Two-Sample Proportion Tests For a two-sided comparison of two proportions: $$H_{0}: p_{1} = p_{2}$$ $$H_{a}: p_{1} \ne p_{2}$$ ```{r code-mosquito-matrix-c-ncol} mosquito <- matrix(c(70, 85, 50, 35), ncol = 2) colnames(mosquito) <- c("yes", "no") rownames(mosquito) <- c("Jack", "Jill") mosquito ``` ```{r code-prop-test-mosquito} prop.test(mosquito) ``` These tests are useful when the question is simple and there are few or no predictors. Once predictors enter the problem, the binomial GLM becomes the natural extension. # Example 1: Logistic Regression for Presence-Absence Data ## Example Dataset We use the `Sessile_Benthic_Invertebrates.csv` dataset from the rocky intertidal course data. These observations were collected along a shore-position gradient (`Location_m`) and from the left and right sides of the shore platform. Here we derive a binary response indicating whether acorn barnacles were present in a sampled unit. ```{r code-logistic-data} #| tbl-cap: "A subset of the sessile-benthic invertebrate data used for the logistic-regression example." sessile <- read_csv( here::here("data", "BCB744", "Rocky Intertidal Data", "Sessile_Benthic_Invertebrates.csv"), show_col_types = FALSE ) |> mutate( Side = factor(Side), acorn_present = as.integer(Acorn.Barnacles > 0) ) gt(head(sessile, 10)) ``` ## Do an Exploratory Data Analysis (EDA) ```{r code-logistic-summary} sessile |> summarise( n = n(), n_present = sum(acorn_present), prop_present = mean(acorn_present), min_loc = min(Location_m), max_loc = max(Location_m) ) ``` ```{r fig-acorn-presence} #| fig-cap: "Presence and absence of acorn barnacles across the shore-position gradient." #| fig-width: 4 #| fig-height: 3 #| code-fold: true ggplot(sessile, aes(x = Location_m, y = acorn_present)) + geom_jitter(height = 0.05, width = 0, alpha = 0.35) + labs( x = "Location on shore gradient (m)", y = "Acorn barnacles present" ) ``` The response pattern in @fig-acorn-presence is binary and clearly unsuitable for ordinary linear regression. That immediately points us to a binomial GLM. ## State the Model Question and Hypotheses The biological question is whether the probability of acorn-barnacle presence changes along the shore-position gradient. $$H_{0}: \beta_{\text{Location}} = 0$$ $$H_{a}: \beta_{\text{Location}} \ne 0$$ ## Fit the Model ```{r code-fit-logistic} fit_logistic <- glm(acorn_present ~ Location_m + Side, family = binomial, data = sessile) summary(fit_logistic) ``` ## Test Assumptions / Check Diagnostics ```{r fig-logistic-fit} #| fig-cap: "Fitted logistic-regression curves for acorn-barnacle presence along the shore-position gradient." #| fig-width: 4 #| fig-height: 3 #| code-fold: true new_logistic <- expand.grid( Location_m = seq(min(sessile$Location_m), max(sessile$Location_m), length.out = 200), Side = levels(sessile$Side) ) |> as_tibble() |> mutate(prob = predict(fit_logistic, newdata = cur_data(), type = "response")) ggplot(sessile, aes(x = Location_m, y = acorn_present, colour = Side)) + geom_jitter(height = 0.05, width = 0, alpha = 0.3) + geom_line(data = new_logistic, aes(y = prob), linewidth = 0.9) + labs( x = "Location on shore gradient (m)", y = "Probability of presence" ) ``` For introductory logistic regression, the most important checks are whether the response family is appropriate and whether the fitted probability pattern in @fig-logistic-fit is biologically sensible. ## Interpret the Results The fitted location coefficient is negative (`-0.053`, `p < 0.01`), which means the log-odds of acorn-barnacle presence decline as shore position increases. In practical terms, the predicted probability of acorn-barnacle presence declines from about `0.61` near the lower end of the sampled gradient to about `0.16` near the upper end of the gradient. The side effect is negligible in this example, so the dominant pattern is a decline in occurrence across shore position rather than a left-right platform difference. ## Reporting ::: {.callout-note appearance="simple"} ## Write-Up **Methods** Acorn-barnacle occurrence was analysed using a binomial generalised linear model with logit link. Presence-absence was modelled as a function of shore position (`Location_m`) and platform side (`Side`) using rocky intertidal observations from the course dataset. **Results** The probability of acorn-barnacle presence declined significantly along the shore-position gradient (logit coefficient for `Location_m` = `-0.053`, $z = -2.94$, `p < 0.01`). The fitted probability of presence on the left side of the platform fell from about `0.61` at the low end of the sampled gradient to about `0.16` at the high end. There was no evidence that left and right sides differed strongly once location was accounted for (`p > 0.05`). **Discussion** The ecological interpretation is that acorn barnacles became less likely to occur higher along the sampled shore gradient. In a manuscript, it is more helpful to phrase this as a change in probability of occurrence than to leave the result on the log-odds scale alone. ::: # Example 2: Count Data and Overdispersion ## Example Dataset We now use the `Small_Mobile_Invertebrates.csv` rocky intertidal dataset. Here the response is the count of snails in each sampled unit. ```{r code-count-data} #| tbl-cap: "A subset of the small mobile invertebrate data used for the count-data GLM example." mobile <- read_csv( here::here("data", "BCB744", "Rocky Intertidal Data", "Small_Mobile_Invertebrates.csv"), show_col_types = FALSE ) |> mutate(Side = factor(Side)) gt(head(mobile, 10)) ``` ## Do an Exploratory Data Analysis (EDA) ```{r code-count-summary} mobile |> summarise( n = n(), mean_snails = mean(Snails), var_snails = var(Snails), max_snails = max(Snails) ) ``` ```{r fig-snail-counts} #| fig-cap: "Snail counts across the shore-position gradient." #| fig-width: 4 #| fig-height: 3 #| code-fold: true ggplot(mobile, aes(x = Location_m, y = Snails, colour = Side)) + geom_point(alpha = 0.5) + labs( x = "Location on shore gradient (m)", y = "Snail count" ) ``` The counts in @fig-snail-counts are non-negative integers and include many zeros together with some very large values. That already suggests we need to be cautious about the simplest Poisson assumptions. ## State the Model Question and Hypotheses The question is whether snail abundance changes along the shore gradient and between the two sides of the shore platform. $$H_{0}: \beta_{\text{Location}} = 0$$ $$H_{a}: \beta_{\text{Location}} \ne 0$$ ## Fit the Model ```{r code-fit-count} fit_pois <- glm(Snails ~ Location_m + Side, family = poisson, data = mobile) fit_qpois <- glm(Snails ~ Location_m + Side, family = quasipoisson, data = mobile) summary(fit_pois) summary(fit_qpois) ``` ## Test Assumptions / Check Diagnostics ```{r code-dispersion} dispersion <- sum(residuals(fit_pois, type = "pearson")^2) / df.residual(fit_pois) dispersion ``` ```{r fig-poisson-fit} #| fig-cap: "Observed snail counts and fitted quasi-Poisson means across the shore-position gradient." #| fig-width: 4 #| fig-height: 3 #| code-fold: true new_count <- expand.grid( Location_m = seq(min(mobile$Location_m), max(mobile$Location_m), length.out = 200), Side = levels(mobile$Side) ) |> as_tibble() |> mutate(mu = predict(fit_qpois, newdata = cur_data(), type = "response")) ggplot(mobile, aes(x = Location_m, y = Snails, colour = Side)) + geom_point(alpha = 0.35) + geom_line(data = new_count, aes(y = mu), linewidth = 0.9) + labs( x = "Location on shore gradient (m)", y = "Snail count" ) ``` The fitted curves in @fig-poisson-fit and the dispersion value of `r round(dispersion, 1)` show that the simple Poisson model is much too optimistic. That is exactly the situation in which overdispersion must be acknowledged before interpreting the coefficients. ## Interpret the Results The Poisson model suggests that snail counts decline strongly along the shore gradient and differ between sides, but the overdispersion is so extreme that the nominal Poisson standard errors cannot be trusted. The quasi-Poisson fit keeps the same mean structure but inflates the uncertainty appropriately. Under that more cautious model, the estimated decline with location and the side effect are no longer supported strongly. The biological conclusion therefore changes once overdispersion is taken seriously. ## Reporting ::: {.callout-note appearance="simple"} ## Write-Up **Methods** Snail counts from rocky intertidal samples were analysed using a count-data GLM with location along the shore gradient and platform side as predictors. A Poisson model was fitted first, and overdispersion was then assessed using the Pearson residual dispersion statistic. Because overdispersion was extreme, a quasi-Poisson model was used for interpretation. **Results** The raw count data were highly variable, with the variance far exceeding the mean. The fitted Poisson model showed strong negative and side effects, but the dispersion statistic was approximately `r round(dispersion, 1)`, indicating severe overdispersion. Under the quasi-Poisson model, the same mean pattern remained but the standard errors were much larger, and neither predictor provided strong evidence for a clear effect (`p > 0.05`). **Discussion** The important lesson is not only that snail counts vary along the shore. It is that count-data inference can become misleading if overdispersion is ignored. In practice, the first biologically plausible pattern should always be checked against the adequacy of the assumed count distribution. ::: # What to Do When Assumptions Fail / Alternatives - If a binary model shows dependence among observations, move to a generalised mixed model rather than forcing independence. - If a count model is overdispersed, consider quasi-Poisson, negative binomial, or other more appropriate count models. - If the response is a simple success proportion with no predictors, a proportion test may still be enough. - If the link-scale coefficients are hard to interpret, translate the results back to probabilities or expected counts on the response scale. # Summary - GLMs extend regression to binary, proportional, and count responses. - The family and link function must match the response structure. - Simple proportion tests are part of the same broader modelling logic as binomial GLMs. - In the acorn-barnacle example, the probability of presence declined across the shore gradient. - In the snail-count example, severe overdispersion changed the inferential conclusion and showed why diagnostics matter in GLMs.