---
title: "Doubs River Worked Example: Multiple Linear Regression"
subtitle: "From Exploratory Plots to a Results and Discussion Section"
date: last-modified
date-format: "YYYY/MM/DD"
reference-location: margin
---
```{r code-brewing-opts, echo=FALSE}
knitr::opts_chunk$set(
comment = "R>",
warning = FALSE,
message = FALSE,
fig.asp = NULL,
fig.align = "center",
fig.retina = 2,
dpi = 300
)
ggplot2::theme_set(
ggplot2::theme_grey(base_size = 8)
)
```
```{r code-libraries, echo=FALSE}
library(tidyverse)
library(broom)
library(car)
library(lmtest)
library(gt)
```
::: {.callout-note appearance="simple"}
## In This Worked Example
- justify a biologically defensible response-predictor structure;
- use exploratory plots to decide whether relationships are roughly linear;
- reduce the predictor set using mechanism and collinearity;
- fit and diagnose a multiple linear regression model; and
- write the analysis up as a short Results and Discussion section.
:::
# Introduction
The Doubs River environmental data contain measurements from `r nrow(read.csv(here::here("data", "BCB743", "DoubsEnv.csv")))` sites along the river. The variables describe position in the catchment (`dfs`, `alt`, `slo`), hydrology (`flo`), water chemistry (`pH`, `har`), nutrient and pollution load (`pho`, `nit`, `amm`, `bod`), and dissolved oxygen (`oxy`).
For a multiple regression, the variables should not be treated symmetrically. Here I model **dissolved oxygen** (`oxy`) as the **response** because oxygen concentration is an ecological state variable that can plausibly be influenced by channel morphology, flow, and pollution, whereas the reverse is not biologically sensible. For example, biochemical oxygen demand (`bod`) can reduce oxygen, but oxygen concentration does not cause upstream distance or channel slope.
The question is therefore:
> Which environmental variables best explain variation in dissolved oxygen along the Doubs River?
# Data Import and Preparation
```{r}
doubs <- read.csv(here::here("data", "BCB743", "DoubsEnv.csv")) |>
rename(
site = X,
distance = dfs,
altitude = alt,
slope = slo,
flow = flo,
hardness = har,
phosphorus = pho,
nitrate = nit,
ammonia = amm,
oxygen = oxy
) |>
mutate(
log_flow = log(flow),
log_slope = log1p(slope),
log_nitrate = log1p(nitrate),
log_phosphorus = log1p(phosphorus),
log_ammonia = log1p(ammonia)
)
glimpse(doubs)
```
The transformed variables are used only where the raw scatterplots suggest clear right-skew and mild curvature. I log-transformed `flow`, `slope`, and the nutrient variables to improve linearity and reduce leverage from the largest values.
# Exploratory Data Analysis
## Raw Predictor-Response Relationships
The first task is to inspect each predictor against the response. I show both a straight-line fit and a loess smoother; where they differ strongly, the linearity assumption is weaker on the raw scale.
```{r}
raw_long <- doubs |>
select(site, oxygen, distance, altitude, slope, flow, pH, hardness,
phosphorus, nitrate, ammonia, bod) |>
pivot_longer(
cols = -c(site, oxygen),
names_to = "predictor",
values_to = "value"
)
ggplot(raw_long, aes(value, oxygen)) +
geom_point(size = 1.8, alpha = 0.8, colour = "grey20") +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.5, colour = "dodgerblue4") +
geom_smooth(method = "loess", se = FALSE, linewidth = 0.5, linetype = 2, colour = "firebrick3") +
facet_wrap(~ predictor, scales = "free_x", ncol = 3) +
labs(
x = NULL,
y = "Dissolved oxygen",
title = "Raw relationships between dissolved oxygen and candidate predictors",
subtitle = "Blue = straight line; red dashed = loess smoother"
)
```
Several patterns are already visible:
- `bod`, `phosphorus`, and `ammonia` each show strong negative associations with oxygen.
- `distance` and `altitude` represent the same longitudinal gradient in opposite directions, so both should not be interpreted as independent drivers.
- `flow`, `slope`, and the nutrient variables are right-skewed, and the loess lines suggest that the raw-scale relationships are not perfectly linear.
## Transformed Relationships for Skewed Predictors
```{r}
trans_long <- doubs |>
select(oxygen, log_flow, log_slope, log_nitrate, log_phosphorus, log_ammonia) |>
pivot_longer(
cols = -oxygen,
names_to = "predictor",
values_to = "value"
)
ggplot(trans_long, aes(value, oxygen)) +
geom_point(size = 1.8, alpha = 0.8, colour = "grey20") +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.5, colour = "dodgerblue4") +
geom_smooth(method = "loess", se = FALSE, linewidth = 0.5, linetype = 2, colour = "firebrick3") +
facet_wrap(~ predictor, scales = "free_x", ncol = 3) +
labs(
x = NULL,
y = "Dissolved oxygen",
title = "Log transformations make several relationships closer to linear"
)
```
The transformed versions of `flow`, `slope`, and `nitrate` are more suitable for a linear model than their raw versions, so I use those scales below.
# Variable Selection
## Start with Biology, Not with `step()`
Model selection should happen inside a biologically defensible candidate set. I therefore begin with variables that could plausibly affect oxygen directly or indirectly:
- `bod`: a proximate measure of organic pollution and oxygen demand;
- `flow` and `slope`: hydrological and geomorphological controls on reaeration;
- `pH` and `hardness`: broad chemical descriptors of water conditions;
- `nitrate`: a nutrient signal associated with enrichment;
- `distance` and `altitude`: longitudinal context variables.
Before fitting anything, I inspect collinearity.
```{r}
cor_mat <- doubs |>
select(distance, altitude, log_flow, log_slope, pH, hardness,
log_nitrate, log_phosphorus, log_ammonia, bod) |>
cor()
cor_long <- as.data.frame(as.table(cor_mat)) |>
rename(var1 = Var1, var2 = Var2, r = Freq)
ggplot(cor_long, aes(var1, var2, fill = r)) +
geom_tile(colour = "white") +
scale_fill_gradient2(low = "#b2182b", mid = "white", high = "#2166ac",
midpoint = 0, limits = c(-1, 1)) +
coord_equal() +
labs(
x = NULL,
y = NULL,
fill = "r",
title = "Several predictors are strongly collinear"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
The heatmap shows two collinearity clusters:
1. `distance`, `altitude`, and `log_flow` describe the same downstream gradient, so keeping all three would be redundant.
2. `bod`, `log_phosphorus`, and `log_ammonia` are also tightly associated, indicating a shared pollution signal.
I therefore make the following specification decisions before formal model selection:
- keep `log_flow` rather than `distance` or `altitude` because it is the most direct hydraulic mechanism for oxygenation;
- keep `log_slope` because it captures channel steepness independently of river size;
- keep `bod` as the main organic pollution term because it is most directly tied to oxygen demand;
- keep `log_nitrate` as a broader dissolved nutrient indicator;
- allow `pH` and `hardness` to compete for inclusion;
- drop `log_phosphorus`, `log_ammonia`, `distance`, and `altitude` from the candidate set to avoid redundant predictors.
## Candidate Models
```{r}
mod_global <- lm(oxygen ~ bod + pH + hardness + log_flow + log_slope + log_nitrate,
data = doubs)
mod_reduced <- lm(oxygen ~ bod + pH + log_flow + log_slope + log_nitrate,
data = doubs)
mod_minimal <- lm(oxygen ~ bod + log_flow + log_slope,
data = doubs)
model_comp <- tibble(
model = c("Global reduced-by-biology set", "Final parsimonious model", "Minimal mechanistic model"),
formula = c(
"oxygen ~ bod + pH + hardness + log_flow + log_slope + log_nitrate",
"oxygen ~ bod + pH + log_flow + log_slope + log_nitrate",
"oxygen ~ bod + log_flow + log_slope"
),
AIC = c(AIC(mod_global), AIC(mod_reduced), AIC(mod_minimal)),
adj_R2 = c(summary(mod_global)$adj.r.squared,
summary(mod_reduced)$adj.r.squared,
summary(mod_minimal)$adj.r.squared),
max_VIF = c(max(vif(mod_global)), max(vif(mod_reduced)), max(vif(mod_minimal)))
)
model_comp |>
mutate(
AIC = round(AIC, 2),
adj_R2 = round(adj_R2, 3),
max_VIF = round(max_VIF, 2)
) |>
gt()
```
The model including `hardness` has the lowest AIC, but only by `r round(AIC(mod_reduced) - AIC(mod_global), 2)` units. That difference is negligible. Because the reduced model also lowers the maximum variance inflation factor from `r round(max(vif(mod_global)), 2)` to `r round(max(vif(mod_reduced)), 2)`, I prefer the **parsimonious model**:
```{r}
final_mod <- mod_reduced
summary(final_mod)
```
# Model Fit and Interpretation
```{r}
coef_table <- tidy(final_mod, conf.int = TRUE) |>
mutate(across(where(is.numeric), ~ round(.x, 3)))
coef_table |>
gt()
```
The fitted model is:
$$
\widehat{\mathrm{oxygen}} = `r round(coef(final_mod)[1], 2)` +
`r round(coef(final_mod)[2], 2)` \, \mathrm{bod} +
`r round(coef(final_mod)[3], 2)` \, \mathrm{pH} +
`r round(coef(final_mod)[4], 2)` \, \log(\mathrm{flow}) +
`r round(coef(final_mod)[5], 2)` \, \log(1 + \mathrm{slope}) +
`r round(coef(final_mod)[6], 2)` \, \log(1 + \mathrm{nitrate})
$$
Overall, the model explains `r scales::percent(summary(final_mod)$r.squared, accuracy = 0.1)` of the variance in dissolved oxygen (`adjusted R^2 = `r round(summary(final_mod)$adj.r.squared, 3)``; `F_{5,24} = `r round(summary(final_mod)$fstatistic[1], 2)`, `p = `r format.pval(pf(summary(final_mod)$fstatistic[1], summary(final_mod)$fstatistic[2], summary(final_mod)$fstatistic[3], lower.tail = FALSE), digits = 3)``).
Among the individual predictors, oxygen declines strongly with increasing `bod`, and increases with both `log_flow` and `log_slope`. The fitted effects of `pH` and `log_nitrate` are weaker and their confidence intervals overlap zero, so they are retained as adjustment terms rather than as strong stand-alone drivers.
# Diagnostics
```{r}
par(mfrow = c(2, 2))
plot(final_mod)
par(mfrow = c(1, 1))
```
```{r}
diag_tbl <- tibble(
check = c("Shapiro-Wilk on residuals", "Breusch-Pagan for heteroscedasticity", "Largest Cook's distance", "Cook's rule-of-thumb threshold (4/n)"),
value = c(
round(shapiro.test(resid(final_mod))$p.value, 3),
round(bptest(final_mod)$p.value, 3),
round(max(cooks.distance(final_mod)), 3),
round(4 / nrow(doubs), 3)
)
)
diag_tbl |>
gt()
```
The residual diagnostics do not show serious departures from normality or constant variance. One site has a Cook's distance above the rough `4/n` threshold, so I check whether it changes the interpretation materially.
```{r}
site23_sensitivity <- lm(oxygen ~ bod + pH + log_flow + log_slope + log_nitrate,
data = filter(doubs, site != 23))
bind_rows(
tidy(final_mod) |> mutate(model = "All sites"),
tidy(site23_sensitivity) |> mutate(model = "Without site 23")
) |>
select(model, term, estimate, std.error, p.value) |>
mutate(across(where(is.numeric), ~ round(.x, 3))) |>
gt()
```
Removing site 23 changes the exact coefficient values but not the overall story: `bod` remains strongly negative, `log_slope` remains positive, and `log_flow` stays positive but becomes slightly less certain. I therefore retain the full dataset.
# Results
At 30 sites in the Doubs River, dissolved oxygen was analysed as a function of biochemical oxygen demand, pH, flow, channel slope, and nitrate concentration. Exploratory plots showed that several raw predictors were right-skewed and only approximately linear on the original scale, so `flow`, `slope`, and `nitrate` were log-transformed before modelling. Because longitudinal variables (`distance`, `altitude`, and `flow`) and pollution variables (`bod`, `phosphorus`, and `ammonia`) were strongly collinear, only one representative variable from each cluster was retained in the candidate set.
The final multiple regression model was `oxygen ~ bod + pH + log_flow + log_slope + log_nitrate`. The model explained `r scales::percent(summary(final_mod)$r.squared, accuracy = 0.1)` of the variance in dissolved oxygen (`adjusted R^2 = `r round(summary(final_mod)$adj.r.squared, 3)``; `F_{5,24} = `r round(summary(final_mod)$fstatistic[1], 2)`, `p < 0.001`). Dissolved oxygen decreased significantly as biochemical oxygen demand increased ($\beta = `r round(tidy(final_mod)$estimate[tidy(final_mod)$term == "bod"], 3)`, 95% CI `r round(tidy(final_mod, conf.int = TRUE)$conf.low[tidy(final_mod, conf.int = TRUE)$term == "bod"], 3)` to `r round(tidy(final_mod, conf.int = TRUE)$conf.high[tidy(final_mod, conf.int = TRUE)$term == "bod"], 3)``). Oxygen increased with both `log_flow` ($\beta = `r round(tidy(final_mod)$estimate[tidy(final_mod)$term == "log_flow"], 3)`, `p = `r round(tidy(final_mod)$p.value[tidy(final_mod)$term == "log_flow"], 3)``) and `log_slope` ($\beta = `r round(tidy(final_mod)$estimate[tidy(final_mod)$term == "log_slope"], 3)`, `p = `r round(tidy(final_mod)$p.value[tidy(final_mod)$term == "log_slope"], 3)``). The fitted effects of pH and nitrate were weaker and not individually significant at the 0.05 level, but they were retained because they modestly improved model fit and helped adjust for background chemical variation. Residual and influence diagnostics did not indicate serious violations of normality or homoscedasticity.
# Discussion
This analysis suggests that dissolved oxygen in the Doubs River is shaped primarily by a balance between **oxygen-consuming pollution** and **physical reaeration**. Sites with high biochemical oxygen demand tended to have lower oxygen concentrations, which is consistent with microbial decomposition consuming oxygen in enriched waters. In contrast, higher flow and steeper channels were associated with higher oxygen, which is biologically sensible because turbulence and faster water movement promote reaeration.
Two lessons about model specification are especially important here. First, the initial exploratory plots showed that some relationships were only roughly linear on the raw scale, so transformation was part of the modelling process rather than an afterthought. Second, several candidate predictors measured nearly the same longitudinal or pollution gradient. If all of them had been entered together, the model would have been harder to interpret and the coefficient estimates would have become unstable. Reducing the predictor set before final fitting therefore improved the analysis more than an automatic selection routine alone could have done.
The model is still observational, so the fitted coefficients should be interpreted as adjusted associations rather than proof of causation. In addition, the sample size is modest and one site had noticeable leverage, which means the exact coefficient values should not be over-interpreted. Even so, the ecological signal is clear: organic pollution is associated with oxygen depletion, while hydraulic conditions partly offset that decline.