16. Collinearity, Confounding, and Measurement Error

Three Threats to Interpretable Regression Models

Author

Affiliation

A. J. Smit

University of the Western Cape

Published

2026/04/07

In This Chapter

what collinearity is, why shared variance destabilises coefficients, and how to diagnose it;
how confounding differs from collinearity and why it is a problem of attribution rather than overlap;
why measurement error weakens inference through attenuation bias;
how proxy variables differ from mechanistic variables and what that implies for interpretation;
a practical workflow for identifying and responding to each problem.

Tasks to Complete in This Chapter

None

Regression models are often taught as though the hard part is getting the formula right. In practice, some of the most serious difficulties arise before and after the fitting step. Predictors may overlap so strongly that their individual effects cannot be separated. A third variable may distort a relationship that appears clean. The variables we fit may be noisy measurements, or only rough stand-ins for the biological processes we actually want to understand.

This chapter brings together three related problems:

Collinearity, where predictors share so much information that coefficients become unstable;
Confounding, where a missing variable creates or distorts an apparent effect; and
Measurement error, where noise in the predictor attenuates the estimated effect.

All three produce models that look numerically reasonable but carry weakened or misleading biological interpretations. A model may fit, produce a statistically significant result, and still tell you something false about the mechanism.

The problems are especially common in ecology. Environmental variables are rarely independent because they are generated by the same physical processes: temperature, oxygen, and productivity co-vary in the ocean; altitude, temperature, and radiation co-vary on land. Collinearity and confounding are the normal condition of field data, and ignoring them leads to overconfident conclusions.

In the last section of this chapter, I give you a practical workflow that ties the three problems together. It is worth reading first as an orientation before working through the examples.

Some connections to earlier chapters are worth making. In Chapter 14, the seaweed climate model was trimmed because several candidate predictors were strongly correlated. That decision was an informal response to collinearity. Here, I show you how to go making this decision. In Chapter 15, I noted that interactions can be confused with confounding when a third variable modifies the apparent relationship, and I’ll return to that distinction when addressing confounding below.

1 Important Ideas

Collinearity occurs when two or more predictor variables in a model are so strongly correlated that the model cannot cleanly separate their individual effects on the response.
The instability problem: when predictors share variance, the model tries to partition the same signal between two parameters. Small changes in the data can produce large swings in those parameters, and that is what unstable coefficients look like in practice.
VIF (Variance Inflation Factor) is the standard diagnostic that quantifies how much the variance of an estimated coefficient has been inflated by overlap with other predictors.
Confounding is about attribution. A confounder influences both the predictor of interest and the response to create or distort an apparent association. Two variables can be weakly correlated and still produce serious confounding.
Measurement error in a predictor attenuates its estimated effect and biases the slope towards zero. Weak effects are not always evidence of weak biology.
Proxy variables are a related interpretive problem. For example, altitude, depth, and distance from shore are easy to measure but they do not represent single mechanisms. Their coefficients describe a spatial or environmental gradient, not a specific causal process.
Compared to proxy variables, mechanistic variables such as temperature, nutrient concentration, light intensity are closer to the process of interest and produce more directly interpretable coefficients, though they too can be collinear.
Interpretation vs prediction: collinearity, confounding, and poor measurement all weaken explanation. A collinear model may still predict well, but its coefficients do not support mechanistic claims. This distinction governs most of the decisions in this chapter and will not be restated for each example.

2 Problems of Note

In the regression chapters so far, I have assumed that each predictor contributes something identifiable to the response, but that assumption is often too optimistic since several things can undermine it.

Two predictors may describe almost the same underlying process and this makes it impossible to say how much each contributes. A variable may appear important only because an unmeasured third variable drives both it and the response, or a predictor may be measured with enough noise that its effect is weakened and unstable. And even a well-measured predictor may be no more than a proxy for something else, a spatial stand-in rather than a mechanism.

These determine whether the biological story that a model tells is defensible.

3 R Functions

The most useful functions in this chapter are:

cor() and scatterplots for inspecting predictor overlap;
car::vif() for diagnosing collinearity in a fitted model;
lm() for comparing models that omit or include potentially confounding variables;
broom::tidy() for comparing coefficients across models.

4 Collinearity

4.1 What Is Collinearity?

Collinearity occurs when two predictor variables in a model are highly correlated. Multicollinearity is the more general term, referring to the situation where one predictor can be substantially predicted from the others.

In a marine environment, temperature and dissolved oxygen are strongly negatively correlated, with colder water holding more oxygen. Including both in a model to predict fish abundance forces the model to assign separate effects to variables that track the same underlying thermal gradient. That is multicollinearity.

Ecology is full of such paired raltionships, such as altitude and temperature, nitrate and phosphate, rainfall and river flow. They co-vary because they share physical or biological underpinnings.

Graham’s paper, Confronting multicollinearity in ecological multiple regression (Graham 2003), is a important reference. It is a methodological review of why collinearity causes trouble in ecological models, how it inflates coefficient uncertainty, and why interpretive confidence can collapse even when the fitted model still predicts reasonably well.

4.2 The Mechanism: Why Shared Variance Destabilises Coefficients

Collinearity’s effects are not intuitive, and to see why it causes unstable coefficients, consider what the model is being asked to do.

When two predictors are highly correlated, they explain much of the same variation in the response. But, the model must nonetheless assign a separate coefficient to each. There is, in principle, a continuous range of coefficient combinations that fit the data nearly equally well. For example, one predictor can be given more credit and the other less, or the credits can be reversed, with only marginal change in the total fit. Which combination the algorithm picks depends on small differences in the data.

Simply removing one observation, adding a small amount of noise, or resampling the data could cause the algorithm may pick a quite different combination. It is this sensitivity that inflated standard errors measure; the coefficient estimates are real but unreliable because the data do not cleanly constrain them.

Sign reversals follow from the same reasoning. If two predictors both have positive effects on the response, the model may compensate by giving one a negative partial coefficient and the other an inflated positive one, while keeping the total prediction nearly unchanged. The coefficient signs can flip because the algorithm is navigating a poorly-constrained estimation problem. This will say nothing at all about the scientific problem being addressed.

In practice, collinearity is primarily a problem for interpretation. I this way, pedictions can remain accurate even when individual coefficients are badly estimated.

4.3 Proxy Variables and Mechanistic Variables

Example 1 introduces this distinction clearly, but the idea belongs in any discussion of collinearity from the start.

A mechanistic variable is one that directly participates in the process of interest, such as temperature acting on enzyme kinetics, light driving photosynthesis, nutrient concentration limiting cell division. A proxy variable is a stand-in that correlates with the mechanism but does not represent it directly. Examples are altitude standing in for temperature, depth as a proxy for light, or distance from shore as a substitute for wave exposure.

Proxy variables are common in ecology because the mechanistic variable is often expensive or impossible to measure at scale. They are convenient, and people are sometimes lazy. They are not wrong to use, but they create a specific interpretive limit in that the coefficient of a proxy cannot be taken as the effect of the mechanism it represents.

Collinearity between a proxy and the mechanism it stands in for is almost guaranteed. Including both altitude and temperature in a plant-growth model is asking the model to separate the effect of a spatial gradient from the thing that gradient primarily affects. The model usually cannot do this reliably. The analysis of the seaweed biodiversity dataset is to some extent guilty of this (Smit et al. 2017).

5 Example 1: Predicting Plant Growth

Suppose we model plant growth using mean annual temperature and altitude. Both enter the model as predictors, but altitude is not a mechanism; it is simply a spatial index. Plant growth responds to temperature, oxygen partial pressure, UV radiation, and growing season length, all of which correlate with altitude. Temperature is closer to the mechanism, but it too is partly a proxy for the combined effects of the elevational gradient.

Including both predictors asks the model to separate a proxy from one of the processes the proxy represents. The outcome is that coefficients become unstable, standard errors increase, and biological interpretation becomes unclear.

Better modelling choices depend on the question. If the hypothesis is physiological, temperature is the more appropriate predictor. If the question is about broad spatial gradients, altitude may be sufficient alone. But fitting both simultaneously is hard to justify and produces results that are difficult to defend.

6 Example 2: Nutrient Limitation in Coastal Systems

Phytoplankton production depends on nutrient supply. If we model it using both nitrate and phosphate, we typically encounter collinearity, because both nutrients are supplied by the same upwelled or remineralised water masses. Both predictors are also mechanistically meaningful, but unlike in Example 1, this is collinearity between two real variables, not between a proxy and a mechanism.

The model may fit well overall, but individual coefficients become unstable and hard to interpret. The model cannot distinguish how much of the phytoplankton response is due to nitrate specifically and how much to phosphate, because both vary together. Both nutrients are in fact required for phytoplankton growth, but which one most constrains production is context specific: it depends on demand relative to supply, so limitation shifts according to which nutrient is in shortest effective supply at that time and place. This is the principle of Liebig’s Law of the Minimum, and it is why the apparent response to one nutrient or the other often depends on their balance rather than on either concentration alone.

Possible resolutions depend on the goal. If the question concerns stoichiometric limitation, an N:P ratio may be more informative than either nutrient separately. If the question is about prediction rather than mechanism, PCA or another dimension-reduction approach can represent the broader nutrient gradient in a single axis. If one nutrient is theoretically limiting based on Liebig’s law, use that one and leave the other out.

This example shows that collinearity is not always a matter of bad variable choices. In fact, both variables do drive phytoplankton growth! So, collinearity can arise between the most sensible predictors imaginable. The decision is how to handle it given the goal. Throughout, it reinforces the notion that doing good science involves reliance on our deep understanding of the governing scientific principles behind our study system, as well of statistical model building.

7 Diagnosing Collinearity

The standard diagnostic tool is the Variance Inflation Factor (VIF):

\[\text{VIF}_j = \frac{1}{1 - R_j^2} \tag{1}\]

In Equation 1, $R_j^2$ is the $R^2$ obtained by regressing predictor $X_j$ on all the remaining predictors in the model. It measures how well $X_j$ can be predicted from its companions; in other words, how much of $X_j$’s variance is already captured by the other predictors in the model.

When $R_j^2$ is high, $X_j$ carries very little information that is not already present elsewhere. A $R_j^2$ of 0.9 means 90% of $X_j$’s variation is redundant given the other predictors. The VIF for that predictor would be $1/(1 - 0.9) = 10$: the estimated variance of $\hat{\beta}_j$ is ten times what it would be if $X_j$ were uncorrelated with all other predictors. That is the inflation in uncertainty that collinearity imposes.

We rely on these rough thresholds when deciding which variables to discard: a VIF below about 5 is generally manageable; above 5 to 10, interpretation of individual coefficients becomes difficult and should be flagged. These are guides and not hard rules. Whether to act on a VIF analysis recommendation should be decided based on how strongly the analysis requires individual coefficients for interpretation.

Although VIF is an useful number, it does not diagnose a problem on its own. A high VIF means a coefficient is imprecisely estimated due to overlap. Whether that is something you need to be concerned with depends on whether the model is intended for explanation or prediction.

Collinearity, Sample Size, and Model Complexity

Collinearity is more damaging in small samples. With many observations and a moderate correlation between predictors, the model has enough information to partially separate their contributions, and the inflated standard errors remain acceptable. With a small dataset and strongly correlated predictors, the estimation problem becomes very underconstrained. As a rough guide, collinearity that is tolerable with 200 observations can be crippling with 30.

When building models in small datasets, choose your predictors conservatively. The degrees of freedom available to estimate each additional parameter (see the discussion in Chapter 15) are quickly exhausted, and collinearity makes that situation worse.

8 Example 3: Worked Collinearity Example

Using the seaweed dataset from Chapter 14, we inspect the candidate climate predictors for overlap.

8.1 Inspect the Predictor Set

sw <- read.csv(here::here("data", "BCB743", "seaweed", "spp_df2.csv"))
sw_ectz <- sw |>
  filter(bio == "ECTZ") |>
  select(Y, augMean, febRange, febSD, augSD, annMean)

cor(sw_ectz[, -1])

           augMean  febRange     febSD     augSD   annMean
augMean  1.0000000 0.6678245 0.5039666 0.4603614 0.9711458
febRange 0.6678245 1.0000000 0.9212490 0.5415907 0.6860398
febSD    0.5039666 0.9212490 1.0000000 0.5744316 0.5533197
augSD    0.4603614 0.5415907 0.5744316 1.0000000 0.5851171
annMean  0.9711458 0.6860398 0.5533197 0.5851171 1.0000000

The correlation matrix already shows that several of the candidate predictors overlap strongly. Visually, this pattern is even more revealing.

Code

ggplot(sw_ectz, aes(x = augMean, y = annMean)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = TRUE, colour = "magenta", linewidth = 0.8) +
  labs(
    x = "August mean temperature (°C)",
    y = "Annual mean temperature (°C)"
  ) +
  theme_grey()

Figure 1: August mean temperature and annual mean temperature are tightly linearly related in the ECTZ subset. Two variables that fall along a line like this carry largely the same information; including both in a regression forces the model to partition the same signal between two parameters.

Figure 1 shows why collinearity is visually intuitive once you look for it. The two temperature variables follow almost the same line. A regression model that includes both must somehow divide this common variation between two separate coefficients, and as the mechanism section above explains, that division is unreliable. Mechanistically, you can also see why this high amount of collinearity exists: this is because the two predictor variables are not entirely independent, since the August mean temperature is a part of the annual mean signal itself.

8.2 Diagnose the Overlap with VIF

col_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean,
              data = sw_ectz)

vif(col_mod)

  augMean  febRange     febSD     augSD   annMean 
27.947767 10.806635  8.765732  2.497739 31.061900

The VIF values confirm what the scatter and correlation matrix suggested. Predictors with VIF values well above 5-10 have $R_j^2$ values close to 1, so most of their variation is already explained by the other predictors in the model, and therefore their coefficients carry disproportionately large uncertainty.

8.3 Compare Simple and Multiple-Regression Slopes

The comparison below is a clear illustration of what collinearity actually does in practice.

preds <- c("augMean", "febRange", "febSD", "augSD", "annMean")

simple_models <- purrr::map(preds, ~ lm(as.formula(paste("Y ~", .x)),
                                        data = sw_ectz))

simple_slopes <- purrr::map2_dfr(simple_models, preds, \(mod, nm) {
  tidy(mod) |>
    filter(term != "(Intercept)") |>
    mutate(model = "Simple regression",
           predictor = nm)
})

multiple_slopes <- tidy(col_mod) |>
  filter(term != "(Intercept)") |>
  mutate(model = "Multiple regression",
         predictor = term)

bind_rows(simple_slopes, multiple_slopes) |>
  select(model, predictor, estimate, std.error, statistic, p.value)

# A tibble: 10 × 6
   model               predictor estimate std.error statistic   p.value
   <chr>               <chr>        <dbl>     <dbl>     <dbl>     <dbl>
 1 Simple regression   augMean    0.346     0.0109     31.7   6.68e- 96
 2 Simple regression   febRange   0.182     0.00890    20.4   8.31e- 58
 3 Simple regression   febSD      0.172     0.0124     13.8   1.56e- 33
 4 Simple regression   augSD      0.0879    0.00720    12.2   6.68e- 28
 5 Simple regression   annMean    0.332     0.00867    38.3   6.82e-115
 6 Multiple regression augMean   -0.0799    0.0426     -1.87  6.18e-  2
 7 Multiple regression febRange   0.113     0.0159      7.08  1.14e- 11
 8 Multiple regression febSD     -0.0572    0.0166     -3.45  6.37e-  4
 9 Multiple regression augSD      0.00302   0.00489     0.619 5.36e-  1
10 Multiple regression annMean    0.323     0.0416      7.76  1.59e- 13

We see that predictors that show strong, stable slopes in simple regressions become weaker, unstable, or sign-reversed once the overlapping predictors are fitted simultaneously.

The model is not necessarily wrong as far as its math is concerned since it is correctly trying to estimate partial effects given the available data. But those partial effects are being estimated under much greater uncertainty than any single slope suggests, because the model cannot cleanly separate what each predictor contributes. And this creates problems for when we need to explain the findings in terms of the underlying biological or ecological theory.

8.4 Interpret the Collinearity Problem

The difficulty lies in attribution, not in the existence of a fitted model. Several predictors here represent overlapping components of the same underlying climatic gradient. The model cannot assign a stable partial effect to any one of them. This is why collinearity is so often an interpretive problem rather than a purely computational one.

8.5 Reporting

Write-Up

Methods

Candidate climate predictors were screened for overlap using a correlation matrix, a scatterplot of the most strongly correlated pair, and variance inflation factors. Simple and multiple regressions were then compared to assess how strongly collinearity altered the apparent slopes of individual predictors.

Results

Several climate predictors in the seaweed model were strongly correlated, and the VIF analysis showed substantial overlap among them. Predictors that showed clear positive slopes in isolation became weaker or less stable once the overlapping variables were fitted together, demonstrating that the apparent partial effects were sensitive to collinearity.

Discussion

The model is not necessarily invalid, but coefficient-level interpretation becomes difficult when predictors represent overlapping components of the same underlying gradient. That is the key discussion point for a collinearity analysis.

9 Addressing Collinearity

Common responses include:

removing one of the correlated predictors, choosing based on which better matches the biological question;
using theory to select the variable closest to the mechanism of interest;
combining correlated predictors into a composite index or ordination axis when the question concerns the overall gradient rather than any single variable;
accepting some overlap if prediction is the goal, while avoiding strong causal language about individual coefficients;
using regularisation methods such as ridge regression or the lasso when predictive stability is the primary objective.

No one VIF threshold automatically resolves the problem. The decision rests on whether the model needs interpretable individual coefficients. Ridge regression, for example, addresses collinearity by shrinking coefficients towards zero; it trades a small increase in bias for a large reduction in variance, making coefficients more stable but no longer unbiased. In Chapter 25, I will demonstrate the regularisation methods.

Scale and Transformation

Collinearity can change under transformation. Two predictors that are strongly correlated on arithmetic scales may be less correlated after log transformation, particularly when their relationship is multiplicative rather than additive. Before concluding that collinearity is an unavoidable feature of a dataset, consider whether a transformation suggested by the biology might also reduce the overlap between predictors.

10 Confounding

10.1 What Is Confounding?

Confounding is a different problem from collinearity.

Collinearity is about overlap among predictors that are already in the model. The variables are measured, present, and causing estimation difficulties because they share information.

Confounding is about a variable that is not in the model (either it was measured but omitted, or the researcher naively failed to sample it) that influences both the predictor of interest and the response. The confounder creates or distorts the apparent association between the predictor and response, because the model attributes to the predictor an effect that is partly or wholly due to the omitted variable.

A confounder does not need to be strongly correlated with the predictor to cause problems. Even a modest association between the confounder and both variables can shift the estimated coefficient substantially.

10.2 Confounding and Causal Interpretation

Confounding is fundamentally a causal concept, even though regression is not a causal tool by itself. The concern is whether the apparent effect of a predictor reflects a genuine influence on the response, or whether it reflects the action of an unmeasured third variable.

In field ecology, true causal claims require experimental manipulation or, at minimum, careful design that separates the process of interest from associated variables. Regression with observational data can identify associations and quantify how much they change when additional variables are included. It cannot establish causality on its own. The language to use is “the apparent effect of temperature after accounting for nutrient concentration,” not “the causal effect of temperature.” Whether the remaining partial effect reflects a true temperature mechanism requires domain knowledge and careful study design to assess.

11 Example 4: Worked Confounding Example

Suppose we want to know whether seawater temperature affects algal productivity along a coastal gradient, but nutrient concentration also varies with temperature and itself drives productivity. In that situation, a temperature-only model attributes to temperature some of the variation that nutrient concentration actually produces. The temperature coefficient is confounded.

set.seed(144)

n <- 120
nutrient <- rnorm(n, mean = 10, sd = 2)
temperature <- 15 + 0.9 * nutrient + rnorm(n, sd = 1.2)
productivity <- 4 + 1.4 * nutrient + 0.15 * temperature + rnorm(n, sd = 1.5)

conf_dat <- tibble(
  nutrient = nutrient,
  temperature = temperature,
  productivity = productivity
)

11.1 Do an Exploratory Data Analysis (EDA)

conf_dat |>
  summarise(
    mean_temperature = mean(temperature),
    sd_temperature = sd(temperature),
    mean_nutrient = mean(nutrient),
    sd_nutrient = sd(nutrient),
    mean_productivity = mean(productivity),
    sd_productivity = sd(productivity)
  )

# A tibble: 1 × 6
  mean_temperature sd_temperature mean_nutrient sd_nutrient mean_productivity
             <dbl>          <dbl>         <dbl>       <dbl>             <dbl>
1             23.7           2.26          9.84        2.04              21.3
# ℹ 1 more variable: sd_productivity <dbl>

cor(conf_dat)

             nutrient temperature abundance
nutrient    1.0000000   0.8403428 0.9035191
temperature 0.8403428   1.0000000 0.8066436
abundance   0.9035191   0.8066436 1.0000000

Code

p1 <- ggplot(conf_dat, aes(x = temperature, y = productivity)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "Temperature", y = "Algal productivity") +
  theme_grey()

p2 <- ggplot(conf_dat, aes(x = nutrient, y = productivity)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "Nutrient concentration", y = "Algal productivity") +
  theme_grey()

p3 <- ggplot(conf_dat, aes(x = nutrient, y = temperature)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "Nutrient concentration", y = "Temperature") +
  theme_grey()

ggarrange(p1, p2, p3, ncol = 3)

Figure 2: A confounding situation where nutrient concentration is associated with both temperature and productivity.

The EDA in Figure 2 already shows the confounding structure. Temperature and nutrient concentration are strongly positively correlated ($r = 0.84$), and both are positively related to algal productivity. A temperature-only model will therefore absorb part of the nutrient effect into the temperature coefficient.

11.2 State the Model Question and Hypotheses

The biological question is not just whether temperature and productivity are associated. It is whether temperature still explains productivity after nutrient concentration has been accounted for.

\[H_{0}: \beta_{\text{temperature}} = 0\] \[H_{a}: \beta_{\text{temperature}} \ne 0\]

Here $\beta_{\text{temperature}}$ refers to the partial effect of temperature in the model that also includes nutrient concentration. If the temperature coefficient changes strongly when nutrient is added, and that change is evidence of confounding: the original temperature result was absorbing nutrient variation.

11.3 Fit Competing Models

mod_temp_only <- lm(productivity ~ temperature, data = conf_dat)
mod_with_nutrient <- lm(productivity ~ temperature + nutrient, data = conf_dat)

conf_compare <- bind_rows(
  tidy(mod_temp_only) |>
    filter(term != "(Intercept)") |>
    mutate(model = "Temperature only"),
  tidy(mod_with_nutrient) |>
    filter(term != "(Intercept)") |>
    mutate(model = "Temperature + nutrient")
) |>
  select(model, term, estimate, std.error, statistic, p.value)

conf_compare

# A tibble: 3 × 6
  model                  term        estimate std.error statistic  p.value
  <chr>                  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 Temperature only       temperature    1.27     0.0854     14.8  1.02e-28
2 Temperature + nutrient temperature    0.253    0.112       2.25 2.61e- 2
3 Temperature + nutrient nutrient       1.33     0.124      10.7  3.99e-19

anova(mod_temp_only, mod_with_nutrient)

Analysis of Variance Table

Model 1: abundance ~ temperature
Model 2: abundance ~ temperature + nutrient
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    118 522.57                                 
2    117 263.30  1    259.26 115.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

11.4 Interpret the Results

In the temperature-only model, algal productivity appears to increase strongly with temperature ($\beta = 1.27$, SE = 0.09, $p < 0.001$). Once nutrient concentration is included, the estimated temperature effect shrinks to about 0.25 (SE = 0.11, $p = 0.026$), while nutrient concentration has a strong positive effect ($\beta = 1.33$, SE = 0.12, $p < 0.001$).

The temperature-only model was answering the wrong biological question. It estimated the marginal association between temperature and productivity, which includes everything that co-varies with temperature, including nutrient concentration. The partial temperature coefficient in the richer model is the effect of temperature after holding nutrient concentration constant, which is closer to the physiological question.

So, we see that the coefficient did not just shrink; it changed its meaning.

11.5 Practical Implications of Confounding

Confounding arises in field studies when variables of biological importance are omitted from the model, whether because they were not measured, not recognised, or excluded by an overly simple design. The best defence is domain knowledge: before fitting a model, ask which unmeasured variables could influence both the predictor and the response, and then measure or control for them. Statistical adjustment after the fact, as illustrated here, is only possible when the confounder is measured.

Confounding, Interaction, and Scale

Confounding and interaction are sometimes confused. An interaction means the effect of one predictor depends on the value of another. Both predictors are present in the model, and their joint effect is the point (see Chapter 15). A confounder is a third variable whose omission distorts the estimated effect of the predictor of interest.

Confounding can also change under transformation. If the true relationship between a confounder and the response is multiplicative, a log transformation of the response may reduce the confounder’s influence on the temperature coefficient because the scale on which the relationship is additive has changed. This can alter the apparent magnitude of the confounding effect, but doesn’t resolve the scientific problem.

11.6 Reporting

Write-Up

Methods

A simulated coastal algal-abundance dataset was used to examine confounding by nutrient concentration in a temperature-abundance relationship. A simple linear regression of abundance on temperature alone was compared with a multiple regression that included both temperature and nutrient concentration, and the fitted temperature coefficient was evaluated for change between the two models.

Results

In the temperature-only model, algal abundance appeared to increase strongly with temperature ($\beta = 1.27 \pm 0.09$ SE, $t = 14.83$, $p < 0.001$). However, once nutrient concentration was included, the fitted temperature effect decreased markedly to $\beta = 0.25 \pm 0.11$ SE ($t = 2.25$, $p = 0.026$), while nutrient concentration had a strong positive effect on abundance ($\beta = 1.33 \pm 0.12$ SE, $t = 10.73$, $p < 0.001$). Temperature and nutrient concentration were also strongly correlated ($r = 0.84$), indicating that the simple temperature-abundance relationship was substantially confounded by nutrient availability.

Discussion

The temperature-only model estimated the marginal temperature-abundance association, which conflated temperature effects with nutrient effects. The partial temperature coefficient in the richer model is a more defensible estimate of the temperature effect given this dataset, though a causal interpretation would require an experimental design that independently manipulates temperature and nutrients.

12 Measurement Error

Standard linear regression assumes that predictors are measured without serious error. That assumption is almost never exactly true in the field. Instruments are imperfect, environmental conditions fluctuate between sensor and organism, and the temporal or spatial scale of a measurement may not match the scale of the biological process.

When a predictor is measured with substantial noise, the estimated regression slope is biased towards zero. This is called attenuation bias. The model is relating the response to an imperfect version of the underlying variable, and the slope is therefore pulled downward. The effect is real but appears weaker than it is.

Attenuation bias must be considered for three reasons. It weakens the apparent effect of a predictor that may genuinely be important. It reduces statistical power, increasing the risk of false negatives. And it creates an asymmetry because predictors measured with less precision appear less important than those measured more accurately, which can mislead decisions about which variables to include.

13 Example 5: Worked Measurement-Error Example

Suppose a response truly depends on temperature, but the observed temperature values are noisy because of instrument error or a mismatch between sensor location and the thermal environment experienced by the organism.

13.1 Simulate the True and Noisy Predictors

set.seed(414)

n <- 150
true_temp <- runif(n, min = 8, max = 24)
observed_temp <- true_temp + rnorm(n, sd = 4.5)
growth <- 1.2 + 0.55 * true_temp + rnorm(n, sd = 1.2)

me_dat <- tibble(
  true_temp = true_temp,
  observed_temp = observed_temp,
  growth = growth
)

mod_true <- lm(growth ~ true_temp, data = me_dat)
mod_observed <- lm(growth ~ observed_temp, data = me_dat)

bind_rows(
  tidy(mod_true) |> mutate(model = "True predictor"),
  tidy(mod_observed) |> mutate(model = "Observed noisy predictor")
) |>
  filter(term != "(Intercept)") |>
  select(model, term, estimate, std.error, statistic, p.value)

# A tibble: 2 × 6
  model                    term          estimate std.error statistic  p.value
  <chr>                    <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 True predictor           true_temp        0.555    0.0210      26.4 5.89e-58
2 Observed noisy predictor observed_temp    0.340    0.0293      11.6 1.43e-22

The measurement error standard deviation (4.5 °C) is set close to the standard deviation of the true predictor (≈ 4.6 °C), giving a noise-to-signal variance ratio near 1:1. Under classical measurement error, this implies that the fitted slope for the noisy predictor should be substantially attenuated relative to the true slope, with an expected reduction of roughly one-half. The fitted value is consistent with this expectation, though not identical, reflecting ordinary sampling variability in a finite dataset.

13.2 Compare the Fitted Relationships

Code

p3 <- ggplot(me_dat, aes(x = true_temp, y = growth)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = TRUE, colour = "magenta", linewidth = 0.8) +
  scale_x_continuous(limits = c(0, 30)) +
  labs(x = "True temperature (°C)", y = "Growth", title = "True predictor")

p4 <- ggplot(me_dat, aes(x = observed_temp, y = growth)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = TRUE, colour = "magenta", linewidth = 0.8) +
  scale_x_continuous(limits = c(0, 30)) +  
  labs(x = "Observed temperature (°C)", y = "Growth", title = "Noisy predictor")

ggarrange(p3, p4, ncol = 2, labels = "AUTO")

Figure 3: A) shows growth regressed against the true temperature, with a tight, steeply rising relationship. B) uses the same response but replaces the true predictor with a version with measurement noise of comparable magnitude to the signal itself. The slope is reduced and the cloud is visibly broader in the x-direction. The biology has not changed; only the measurement quality has.

Figure 3 makes the effect very obvious. The left panel shows what the relationship looks like when temperature is measured at the organism’s scale: we see a clear, steep line. The right panel shows what happens when the sensor is poorly placed, averaged over the wrong spatial scale, or simply imprecise. Now we see the slope is approximately halved and the scatter is much wider. The fitted line in the right panel would lead a you to conclude that temperature is a weaker driver than it actually is. Both panels use the same 150 observations and the same biological process, so the difference is entirely in measurement quality.

13.3 Interpret the Measurement-Error Problem

The lesson is that a weak fitted slope is not necessarily evidence of a weak biological relationship. It may instead simply reflect poor measurement.

This is a concern that must be taken seriously when comparing studies. A study that measured temperature precisely at the organism’s microhabitat will estimate a steeper, more significant slope than one that used remote-sensed data or a regional weather station. The difference in results would be attributed to the biology if the measurement quality is ignored, but it may be largely an artefact of data quality or appropriateness.

Practically, you’d want to invest in measurement quality, replicate measurements to estimate and potentially correct for noise, and be explicit about the precision of predictors when reporting results.

13.4 Reporting

Write-Up

Methods

A simulation was used to examine the effect of measurement error in a predictor variable by comparing the fitted relationship obtained from the true predictor with that obtained from a noisy observed version of the same predictor.

Results

Adding measurement error to the predictor reduced the apparent slope of the fitted relationship and weakened the visual strength of the association, illustrating the characteristic attenuation expected when predictors are measured with error.

Discussion

Weak fitted effects do not always imply weak biology. Poor measurement systematically flattens otherwise meaningful relationships. That is the interpretive warning to carry into any Discussion section that relies on observational field data.

13.5 Model II Regression as a Solution to Measurement Error

Ordinary least squares (OLS) estimates the conditional mean of a response given a predictor under an error structure concentrated in the response. This is appropriate when the predictor is measured with relatively high precision. When the predictor contains substantial measurement error, the fitted slope is attenuated because variability in the predictor is treated as noise rather than signal.

Model II regression provides an alternative when both variables contain measurement error. Two commonly used forms are major axis (MA) regression and standardised major axis (SMA) regression. These methods fit a line that reflects the shared variation between the variables rather than a directional dependence of one on the other.

Model II approaches are useful when the goal is to describe the relationship between two noisy variables and when neither can be treated as measured without error. They can therefore serve as a practical complement when attenuation bias is a concern.

However, the interpretation differs from OLS. The fitted line describes the structure of association rather than the conditional effect of a predictor on a response. For explanatory models, improving measurement quality or explicitly modelling the error process remains the more direct solution.

14 Revisiting Proxy Variables

Proxy variables appear throughout ecology, and by this point in the chapter several examples have already appeared: altitude in the plant-growth model, distance from shore as a proxy for exposure, chlorophyll-a as a proxy for productivity, body size as a proxy for condition or age.

Let’s integrate the consequences for interpretation.

A proxy is always an imperfect representation of the process of interest. It may correlate well with the mechanism, but it also correlates with other processes along the same gradient. Altitude covaries with temperature, radiation, precipitation, and soil properties. Depth in the ocean covaries with light, pressure, and oxygen. A coefficient estimated for a proxy variable describes the gradient well, but it says little about the driving mechanism.

This has a direct consequence for interpretation, because a significant coefficient for a proxy variable tells you that the gradient matters. Since it does not tell you which component of that gradient is the active process, you still need to measure the mechanism directly.

Mechanism Versus Proxy

If the question is physiological, use a physiological variable where possible. If the question concerns a broad spatial or environmental gradient, a proxy may be perfectly appropriate. The important thing is that the variable should match the scale and aim of the question, and the coefficient should be interpreted at the same scale, not treated as a mechanistic effect.

When a proxy and its mechanistic counterpart are both in the model, collinearity is almost guaranteed. The overlap between altitude and temperature, or between distance and exposure, is a consequence of the proxy relationship. This is why proxy variables belong in the collinearity discussion and not only in a separate section: the interpretive problems they create are inseparable from the instability problems that collinearity creates.

15 A Practical Workflow

When building and interpreting a regression model, the following sets address the problems in this chapter before they become interpretive traps.

State the biological hypothesis first. Is the question about a mechanism or a pattern? Your answer will determine which variables belong in the model.
Identify which predictors are mechanistic and which are proxies. Use mechanistic variables when the question is explanatory; proxies are acceptable for description or prediction.
Ask which variables could confound the relationship of interest. If a plausible confounder was not measured, that is a design limitation that should be stated.
Inspect predictor correlations and calculate VIF before fitting the full model. Strongly correlated predictors should not both enter the model without justification.
Compare simpler and more fully-specified models when confounding is plausible. A substantial change in a coefficient when a new variable is added is evidence of confounding, not only of model improvement.
Remove or combine redundant predictors when interpretation is the goal. If prediction is the goal and collinearity is moderate, some overlap may be tolerable, but then you should avoid strong mechanistic claims about individual coefficients.
Be honest about measurement quality. If an important predictor was measured with substantial noise, it will limit the strength of the conclusions, and this should be acknowledged.

An additional benefit to implementing these steps is that it also clarifies the explanation–prediction distinction that underpins most decisions we had to make in this chapter. Collinearity, confounding, and measurement error all weaken explanation, and sometimes they will all be present in the same dataset! A collinear model may still predict well, and an attenuated slope may still rank observations correctly, but the problems become serious when the question is biological interpretation rather than forecasting.

16 Summary

Collinearity makes coefficients unstable because predictors share variance. The model cannot cleanly partition the same signal between two parameters, so small data changes produce large coefficient changes, and sign reversals are possible.
VIF quantifies this instability. A $R_j^2$ close to 1 means the predictor contributes little unique information and its coefficient variance is correspondingly inflated.
Proxy variables and mechanistic variables are susceptible to collinearity when both enter the same model. Their interpretive constraints are different, so a proxy coefficient describes a gradient, not the mechanism.
Confounding is a problem of attribution. An omitted variable that influences both the predictor and the response distorts the apparent effect, and correction requires measuring the confounder and including it in the model.
Measurement error biases slopes towards zero, so w effects may reflect poor measurement rather than weak biology.
These problems must be handled, and avoided if they can be avoided, primarily through strong biological reasoning (first), careful variable selection, and honest acknowledgement of what the data cannot establish.

Here, I looked at model construction and in the next chapter model evaluation will be the focus. This will include diagnostics, comparison, and the question of whether a fitted model is actually behaving well.

References

Graham MH (2003) Confronting multicollinearity in ecological multiple regression. Ecology 84:2809–2815.

Smit AJ, Bolton JJ, Anderson RJ (2017) Seaweeds in two oceans: Beta-diversity. Frontiers in Marine Science 4:404.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit2026,
  author = {Smit, A. J.},
  title = {16. {Collinearity,} {Confounding,} and {Measurement} {Error}},
  date = {2026-04-07},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/16-collinearity-confounding-measurement-error.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit AJ (2026) 16. Collinearity, Confounding, and Measurement Error. https://tangledbank.netlify.app/BCB744/basic_stats/16-collinearity-confounding-measurement-error.html.

--- title: "16. Collinearity, Confounding, and Measurement Error" subtitle: "Three Threats to Interpretable Regression Models" date: last-modified date-format: "YYYY/MM/DD" reference-location: margin --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.asp = NULL, fig.align = "center", fig.retina = 2, dpi = 300 ) ggplot2::theme_set( ggplot2::theme_grey(base_size = 8) ) ``` ```{r code-knitr-opts-chunk-set, echo=FALSE} library(tidyverse) library(car) library(broom) library(ggpubr) ``` ::: {.callout-note appearance="simple"} ## In This Chapter - what collinearity is, why shared variance destabilises coefficients, and how to diagnose it; - how confounding differs from collinearity and why it is a problem of attribution rather than overlap; - why measurement error weakens inference through attenuation bias; - how proxy variables differ from mechanistic variables and what that implies for interpretation; - a practical workflow for identifying and responding to each problem. ::: ::: {.callout-important appearance="simple"} ## Tasks to Complete in This Chapter - None ::: Regression models are often taught as though the hard part is getting the formula right. In practice, some of the most serious difficulties arise before and after the fitting step. Predictors may overlap so strongly that their individual effects cannot be separated. A third variable may distort a relationship that appears clean. The variables we fit may be noisy measurements, or only rough stand-ins for the biological processes we actually want to understand. This chapter brings together three related problems: 1. **Collinearity**, where predictors share so much information that coefficients become unstable; 2. **Confounding**, where a missing variable creates or distorts an apparent effect; and 3. **Measurement error**, where noise in the predictor attenuates the estimated effect. All three produce models that look numerically reasonable but carry weakened or misleading biological interpretations. A model may fit, produce a statistically significant result, and still tell you something false about the mechanism. The problems are especially common in ecology. Environmental variables are rarely independent because they are generated by the same physical processes: temperature, oxygen, and productivity co-vary in the ocean; altitude, temperature, and radiation co-vary on land. Collinearity and confounding are the normal condition of field data, and ignoring them leads to overconfident conclusions. In the last section of this chapter, I give you a practical workflow that ties the three problems together. It is worth reading first as an orientation before working through the examples. Some connections to earlier chapters are worth making. In [Chapter 14](14-multiple-regression-and-model-specification.qmd), the seaweed climate model was trimmed because several candidate predictors were strongly correlated. That decision was an informal response to collinearity. Here, I show you how to go making this decision. In [Chapter 15](15-interaction-effects.qmd), I noted that interactions can be confused with confounding when a third variable modifies the apparent relationship, and I'll return to that distinction when addressing confounding below. # Important Ideas - **Collinearity** occurs when two or more predictor variables in a model are so strongly correlated that the model cannot cleanly separate their individual effects on the response. - **The instability problem**: when predictors share variance, the model tries to partition the same signal between two parameters. Small changes in the data can produce large swings in those parameters, and that is what unstable coefficients look like in practice. - **VIF (Variance Inflation Factor)** is the standard diagnostic that quantifies how much the variance of an estimated coefficient has been inflated by overlap with other predictors. - **Confounding** is about attribution. A confounder influences both the predictor of interest and the response to create or distort an apparent association. Two variables can be weakly correlated and still produce serious confounding. - **Measurement error** in a predictor attenuates its estimated effect and biases the slope towards zero. Weak effects are not always evidence of weak biology. - **Proxy variables** are a related interpretive problem. For example, altitude, depth, and distance from shore are easy to measure but they do not represent single mechanisms. Their coefficients describe a spatial or environmental gradient, not a specific causal process. - Compared to proxy variables, **mechanistic variables** such as temperature, nutrient concentration, light intensity are closer to the process of interest and produce more directly interpretable coefficients, though they too can be collinear. - **Interpretation vs prediction**: collinearity, confounding, and poor measurement all weaken explanation. A collinear model may still predict well, but its coefficients do not support mechanistic claims. This distinction governs most of the decisions in this chapter and will not be restated for each example. # Problems of Note In the regression chapters so far, I have assumed that each predictor contributes something identifiable to the response, but that assumption is often too optimistic since several things can undermine it. Two predictors may describe almost the same underlying process and this makes it impossible to say how much each contributes. A variable may appear important only because an unmeasured third variable drives both it and the response, or a predictor may be measured with enough noise that its effect is weakened and unstable. And even a well-measured predictor may be no more than a proxy for something else, a spatial stand-in rather than a mechanism. These determine whether the biological story that a model tells is defensible. # R Functions The most useful functions in this chapter are: - `cor()` and scatterplots for inspecting predictor overlap; - `car::vif()` for diagnosing collinearity in a fitted model; - `lm()` for comparing models that omit or include potentially confounding variables; - `broom::tidy()` for comparing coefficients across models. # Collinearity ## What Is Collinearity? **Collinearity** occurs when two predictor variables in a model are highly correlated. **Multicollinearity** is the more general term, referring to the situation where one predictor can be substantially predicted from the others. In a marine environment, temperature and dissolved oxygen are strongly negatively correlated, with colder water holding more oxygen. Including both in a model to predict fish abundance forces the model to assign separate effects to variables that track the same underlying thermal gradient. That is multicollinearity. Ecology is full of such paired raltionships, such as altitude and temperature, nitrate and phosphate, rainfall and river flow. They co-vary because they share physical or biological underpinnings. Graham's paper, *Confronting multicollinearity in ecological multiple regression* [@Graham2003], is a important reference. It is a methodological review of why collinearity causes trouble in ecological models, how it inflates coefficient uncertainty, and why interpretive confidence can collapse even when the fitted model still predicts reasonably well. ## The Mechanism: Why Shared Variance Destabilises Coefficients Collinearity's effects are not intuitive, and to see why it causes unstable coefficients, consider what the model is being asked to do. When two predictors are highly correlated, they explain much of the same variation in the response. But, the model must nonetheless assign a separate coefficient to each. There is, in principle, a continuous range of coefficient combinations that fit the data nearly equally well. For example, one predictor can be given more credit and the other less, or the credits can be reversed, with only marginal change in the total fit. Which combination the algorithm picks depends on small differences in the data. Simply removing one observation, adding a small amount of noise, or resampling the data could cause the algorithm may pick a quite different combination. It is this sensitivity that inflated standard errors measure; the coefficient estimates are real but unreliable because the data do not cleanly constrain them. Sign reversals follow from the same reasoning. If two predictors both have positive effects on the response, the model may compensate by giving one a negative partial coefficient and the other an inflated positive one, while keeping the total prediction nearly unchanged. The coefficient signs can flip because the algorithm is navigating a poorly-constrained estimation problem. This will say nothing at all about the scientific problem being addressed. In practice, collinearity is primarily a problem for interpretation. I this way, pedictions can remain accurate even when individual coefficients are badly estimated. ## Proxy Variables and Mechanistic Variables Example 1 introduces this distinction clearly, but the idea belongs in any discussion of collinearity from the start. A **mechanistic variable** is one that directly participates in the process of interest, such as temperature acting on enzyme kinetics, light driving photosynthesis, nutrient concentration limiting cell division. A **proxy variable** is a stand-in that correlates with the mechanism but does not represent it directly. Examples are altitude standing in for temperature, depth as a proxy for light, or distance from shore as a substitute for wave exposure. Proxy variables are common in ecology because the mechanistic variable is often expensive or impossible to measure at scale. They are convenient, and people are sometimes lazy. They are not wrong to use, but they create a specific interpretive limit in that the coefficient of a proxy cannot be taken as the effect of the mechanism it represents. Collinearity between a proxy and the mechanism it stands in for is almost guaranteed. Including both altitude and temperature in a plant-growth model is asking the model to separate the effect of a spatial gradient from the thing that gradient primarily affects. The model usually cannot do this reliably. The analysis of the seaweed biodiversity dataset is to some extent guilty of this [@smit2017seaweeds]. # Example 1: Predicting Plant Growth Suppose we model plant growth using mean annual temperature and altitude. Both enter the model as predictors, but altitude is not a mechanism; it is simply a spatial index. Plant growth responds to temperature, oxygen partial pressure, UV radiation, and growing season length, all of which correlate with altitude. Temperature is closer to the mechanism, but it too is partly a proxy for the combined effects of the elevational gradient. Including both predictors asks the model to separate a proxy from one of the processes the proxy represents. The outcome is that coefficients become unstable, standard errors increase, and biological interpretation becomes unclear. Better modelling choices depend on the question. If the hypothesis is physiological, temperature is the more appropriate predictor. If the question is about broad spatial gradients, altitude may be sufficient alone. But fitting both simultaneously is hard to justify and produces results that are difficult to defend. # Example 2: Nutrient Limitation in Coastal Systems Phytoplankton production depends on nutrient supply. If we model it using both nitrate and phosphate, we typically encounter collinearity, because both nutrients are supplied by the same upwelled or remineralised water masses. Both predictors are also mechanistically meaningful, but unlike in Example 1, this is collinearity between two real variables, not between a proxy and a mechanism. The model may fit well overall, but individual coefficients become unstable and hard to interpret. The model cannot distinguish how much of the phytoplankton response is due to nitrate specifically and how much to phosphate, because both vary together. Both nutrients are in fact required for phytoplankton growth, but which one most constrains production is context specific: it depends on demand relative to supply, so limitation shifts according to which nutrient is in shortest effective supply at that time and place. This is the principle of Liebig's Law of the Minimum, and it is why the apparent response to one nutrient or the other often depends on their balance rather than on either concentration alone. Possible resolutions depend on the goal. If the question concerns stoichiometric limitation, an N:P ratio may be more informative than either nutrient separately. If the question is about prediction rather than mechanism, [PCA](../../BCB743/PCA.qmd) or another dimension-reduction approach can represent the broader nutrient gradient in a single axis. If one nutrient is theoretically limiting based on Liebig's law, use that one and leave the other out. This example shows that collinearity is not always a matter of bad variable choices. In fact, both variables *do* drive phytoplankton growth! So, collinearity can arise between the most sensible predictors imaginable. The decision is how to handle it given the goal. Throughout, it reinforces the notion that doing good science involves reliance on our deep understanding of the governing scientific principles behind our study system, as well of statistical model building. # Diagnosing Collinearity The standard diagnostic tool is the **Variance Inflation Factor (VIF)**: $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$ {#eq-vif} In @eq-vif, $R_j^2$ is the $R^2$ obtained by regressing predictor $X_j$ on all the remaining predictors in the model. It measures how well $X_j$ can be predicted from its companions; in other words, how much of $X_j$'s variance is already captured by the other predictors in the model. When $R_j^2$ is high, $X_j$ carries very little information that is not already present elsewhere. A $R_j^2$ of 0.9 means 90% of $X_j$'s variation is redundant given the other predictors. The VIF for that predictor would be $1/(1 - 0.9) = 10$: the estimated variance of $\hat{\beta}_j$ is ten times what it would be if $X_j$ were uncorrelated with all other predictors. That is the inflation in uncertainty that collinearity imposes. We rely on these rough thresholds when deciding which variables to discard: a VIF below about 5 is generally manageable; above 5 to 10, interpretation of individual coefficients becomes difficult and should be flagged. These are guides and not hard rules. Whether to act on a VIF analysis recommendation should be decided based on how strongly the analysis requires individual coefficients for interpretation. Although VIF is an useful number, it does not diagnose a problem on its own. A high VIF means a coefficient is imprecisely estimated due to overlap. Whether that is something you need to be concerned with depends on whether the model is intended for explanation or prediction. ::: {.callout-note appearance="simple"} ## Collinearity, Sample Size, and Model Complexity Collinearity is more damaging in small samples. With many observations and a moderate correlation between predictors, the model has enough information to partially separate their contributions, and the inflated standard errors remain acceptable. With a small dataset and strongly correlated predictors, the estimation problem becomes very underconstrained. As a rough guide, collinearity that is tolerable with 200 observations can be crippling with 30. When building models in small datasets, choose your predictors conservatively. The degrees of freedom available to estimate each additional parameter (see the discussion in [Chapter 15](15-interaction-effects.qmd)) are quickly exhausted, and collinearity makes that situation worse. ::: # Example 3: Worked Collinearity Example Using the seaweed dataset from Chapter 14, we inspect the candidate climate predictors for overlap. ## Inspect the Predictor Set ```{r code-collinearity-data} sw <- read.csv(here::here("data", "BCB743", "seaweed", "spp_df2.csv")) sw_ectz <- sw |> filter(bio == "ECTZ") |> select(Y, augMean, febRange, febSD, augSD, annMean) cor(sw_ectz[, -1]) ``` The correlation matrix already shows that several of the candidate predictors overlap strongly. Visually, this pattern is even more revealing. ```{r fig-collinearity-scatter} #| fig-cap: "August mean temperature and annual mean temperature are tightly linearly related in the ECTZ subset. Two variables that fall along a line like this carry largely the same information; including both in a regression forces the model to partition the same signal between two parameters." #| fig-width: 4 #| fig-height: 3 #| code-fold: true ggplot(sw_ectz, aes(x = augMean, y = annMean)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = TRUE, colour = "magenta", linewidth = 0.8) + labs( x = "August mean temperature (°C)", y = "Annual mean temperature (°C)" ) + theme_grey() ``` @fig-collinearity-scatter shows why collinearity is visually intuitive once you look for it. The two temperature variables follow almost the same line. A regression model that includes both must somehow divide this common variation between two separate coefficients, and as the mechanism section above explains, that division is unreliable. Mechanistically, you can also see why this high amount of collinearity exists: this is because the two predictor variables are not entirely independent, since the August mean temperature is a part of the annual mean signal itself. ## Diagnose the Overlap with VIF ```{r code-collinearity-vif} col_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean, data = sw_ectz) vif(col_mod) ``` The VIF values confirm what the scatter and correlation matrix suggested. Predictors with VIF values well above 5-10 have $R_j^2$ values close to 1, so most of their variation is already explained by the other predictors in the model, and therefore their coefficients carry disproportionately large uncertainty. ## Compare Simple and Multiple-Regression Slopes The comparison below is a clear illustration of what collinearity actually does in practice. ```{r code-collinearity-compare} preds <- c("augMean", "febRange", "febSD", "augSD", "annMean") simple_models <- purrr::map(preds, ~ lm(as.formula(paste("Y ~", .x)), data = sw_ectz)) simple_slopes <- purrr::map2_dfr(simple_models, preds, \(mod, nm) { tidy(mod) |> filter(term != "(Intercept)") |> mutate(model = "Simple regression", predictor = nm) }) multiple_slopes <- tidy(col_mod) |> filter(term != "(Intercept)") |> mutate(model = "Multiple regression", predictor = term) bind_rows(simple_slopes, multiple_slopes) |> select(model, predictor, estimate, std.error, statistic, p.value) ``` We see that predictors that show strong, stable slopes in simple regressions become weaker, unstable, or sign-reversed once the overlapping predictors are fitted simultaneously. The model is not necessarily wrong as far as its math is concerned since it is correctly trying to estimate partial effects given the available data. But those partial effects are being estimated under much greater uncertainty than any single slope suggests, because the model cannot cleanly separate what each predictor contributes. And this creates problems for when we need to *explain* the findings in terms of the underlying biological or ecological theory. ## Interpret the Collinearity Problem The difficulty lies in attribution, not in the existence of a fitted model. Several predictors here represent overlapping components of the same underlying climatic gradient. The model cannot assign a stable partial effect to any one of them. This is why collinearity is so often an interpretive problem rather than a purely computational one. ## Reporting ::: {.callout-note appearance="simple"} ## Write-Up **Methods** Candidate climate predictors were screened for overlap using a correlation matrix, a scatterplot of the most strongly correlated pair, and variance inflation factors. Simple and multiple regressions were then compared to assess how strongly collinearity altered the apparent slopes of individual predictors. **Results** Several climate predictors in the seaweed model were strongly correlated, and the VIF analysis showed substantial overlap among them. Predictors that showed clear positive slopes in isolation became weaker or less stable once the overlapping variables were fitted together, demonstrating that the apparent partial effects were sensitive to collinearity. **Discussion** The model is not necessarily invalid, but coefficient-level interpretation becomes difficult when predictors represent overlapping components of the same underlying gradient. That is the key discussion point for a collinearity analysis. ::: # Addressing Collinearity Common responses include: - removing one of the correlated predictors, choosing based on which better matches the biological question; - using theory to select the variable closest to the mechanism of interest; - combining correlated predictors into a composite index or ordination axis when the question concerns the overall gradient rather than any single variable; - accepting some overlap if prediction is the goal, while avoiding strong causal language about individual coefficients; - using regularisation methods such as ridge regression or the lasso when predictive stability is the primary objective. No one VIF threshold automatically resolves the problem. The decision rests on whether the model needs interpretable individual coefficients. Ridge regression, for example, addresses collinearity by shrinking coefficients towards zero; it trades a small increase in bias for a large reduction in variance, making coefficients more stable but no longer unbiased. In [Chapter 25](25-regularisation.qmd), I will demonstrate the regularisation methods. ::: {.callout-note appearance="simple"} ## Scale and Transformation Collinearity can change under transformation. Two predictors that are strongly correlated on arithmetic scales may be less correlated after log transformation, particularly when their relationship is multiplicative rather than additive. Before concluding that collinearity is an unavoidable feature of a dataset, consider whether a transformation suggested by the biology might also reduce the overlap between predictors. ::: # Confounding ## What Is Confounding? Confounding is a different problem from collinearity. **Collinearity** is about overlap among predictors that are already in the model. The variables are measured, present, and causing estimation difficulties because they share information. **Confounding** is about a variable that is *not* in the model (either it was measured but omitted, or the researcher naively failed to sample it) that influences both the predictor of interest and the response. The confounder creates or distorts the apparent association between the predictor and response, because the model attributes to the predictor an effect that is partly or wholly due to the omitted variable. A confounder does not need to be strongly correlated with the predictor to cause problems. Even a modest association between the confounder and both variables can shift the estimated coefficient substantially. ## Confounding and Causal Interpretation Confounding is fundamentally a causal concept, even though regression is not a causal tool by itself. The concern is whether the apparent effect of a predictor reflects a genuine influence on the response, or whether it reflects the action of an unmeasured third variable. In field ecology, true causal claims require experimental manipulation or, at minimum, careful design that separates the process of interest from associated variables. Regression with observational data can identify associations and quantify how much they change when additional variables are included. It cannot establish causality on its own. The language to use is "the apparent effect of temperature after accounting for nutrient concentration," not "the causal effect of temperature." Whether the remaining partial effect reflects a true temperature mechanism requires domain knowledge and careful study design to assess. # Example 4: Worked Confounding Example Suppose we want to know whether seawater temperature affects algal productivity along a coastal gradient, but nutrient concentration also varies with temperature and itself drives productivity. In that situation, a temperature-only model attributes to temperature some of the variation that nutrient concentration actually produces. The temperature coefficient is confounded. ```{r code-confounding-sim} set.seed(144) n <- 120 nutrient <- rnorm(n, mean = 10, sd = 2) temperature <- 15 + 0.9 * nutrient + rnorm(n, sd = 1.2) productivity <- 4 + 1.4 * nutrient + 0.15 * temperature + rnorm(n, sd = 1.5) conf_dat <- tibble( nutrient = nutrient, temperature = temperature, productivity = productivity ) ``` ## Do an Exploratory Data Analysis (EDA) ```{r code-confounding-summary} conf_dat |> summarise( mean_temperature = mean(temperature), sd_temperature = sd(temperature), mean_nutrient = mean(nutrient), sd_nutrient = sd(nutrient), mean_productivity = mean(productivity), sd_productivity = sd(productivity) ) ``` ```{r code-confounding-correlation} cor(conf_dat) ``` ```{r fig-confounding} #| fig-cap: "A confounding situation where nutrient concentration is associated with both temperature and productivity." #| fig-width: 9 #| fig-height: 3 #| code-fold: true p1 <- ggplot(conf_dat, aes(x = temperature, y = productivity)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "Temperature", y = "Algal productivity") + theme_grey() p2 <- ggplot(conf_dat, aes(x = nutrient, y = productivity)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "Nutrient concentration", y = "Algal productivity") + theme_grey() p3 <- ggplot(conf_dat, aes(x = nutrient, y = temperature)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "Nutrient concentration", y = "Temperature") + theme_grey() ggarrange(p1, p2, p3, ncol = 3) ``` The EDA in @fig-confounding already shows the confounding structure. Temperature and nutrient concentration are strongly positively correlated ($r = 0.84$), and both are positively related to algal productivity. A temperature-only model will therefore absorb part of the nutrient effect into the temperature coefficient. ## State the Model Question and Hypotheses The biological question is not just whether temperature and productivity are associated. It is whether temperature still explains productivity **after nutrient concentration has been accounted for**. $$H_{0}: \beta_{\text{temperature}} = 0$$ $$H_{a}: \beta_{\text{temperature}} \ne 0$$ Here $\beta_{\text{temperature}}$ refers to the partial effect of temperature in the model that also includes nutrient concentration. If the temperature coefficient changes strongly when nutrient is added, and that change is evidence of confounding: the original temperature result was absorbing nutrient variation. ## Fit Competing Models ```{r code-confounding-models} mod_temp_only <- lm(productivity ~ temperature, data = conf_dat) mod_with_nutrient <- lm(productivity ~ temperature + nutrient, data = conf_dat) conf_compare <- bind_rows( tidy(mod_temp_only) |> filter(term != "(Intercept)") |> mutate(model = "Temperature only"), tidy(mod_with_nutrient) |> filter(term != "(Intercept)") |> mutate(model = "Temperature + nutrient") ) |> select(model, term, estimate, std.error, statistic, p.value) conf_compare ``` ```{r code-confounding-model-compare} anova(mod_temp_only, mod_with_nutrient) ``` ## Interpret the Results In the temperature-only model, algal productivity appears to increase strongly with temperature ($\beta = 1.27$, SE = 0.09, $p < 0.001$). Once nutrient concentration is included, the estimated temperature effect shrinks to about 0.25 (SE = 0.11, $p = 0.026$), while nutrient concentration has a strong positive effect ($\beta = 1.33$, SE = 0.12, $p < 0.001$). The temperature-only model was answering the wrong biological question. It estimated the marginal association between temperature and productivity, which includes everything that co-varies with temperature, including nutrient concentration. The partial temperature coefficient in the richer model is the effect of temperature after holding nutrient concentration constant, which is closer to the physiological question. So, we see that the coefficient did not just shrink; it changed its meaning. ## Practical Implications of Confounding Confounding arises in field studies when variables of biological importance are omitted from the model, whether because they were not measured, not recognised, or excluded by an overly simple design. The best defence is domain knowledge: before fitting a model, ask which unmeasured variables could influence both the predictor and the response, and then measure or control for them. Statistical adjustment after the fact, as illustrated here, is only possible when the confounder is measured. ::: {.callout-note appearance="simple"} ## Confounding, Interaction, and Scale Confounding and interaction are sometimes confused. An interaction means the effect of one predictor depends on the value of another. Both predictors are present in the model, and their joint effect is the point (see [Chapter 15](15-interaction-effects.qmd)). A confounder is a third variable whose omission distorts the estimated effect of the predictor of interest. Confounding can also change under transformation. If the true relationship between a confounder and the response is multiplicative, a log transformation of the response may reduce the confounder's influence on the temperature coefficient because the scale on which the relationship is additive has changed. This can alter the apparent magnitude of the confounding effect, but doesn't resolve the scientific problem. ::: ## Reporting ::: {.callout-note appearance="simple"} ## Write-Up **Methods** A simulated coastal algal-abundance dataset was used to examine confounding by nutrient concentration in a temperature-abundance relationship. A simple linear regression of abundance on temperature alone was compared with a multiple regression that included both temperature and nutrient concentration, and the fitted temperature coefficient was evaluated for change between the two models. **Results** In the temperature-only model, algal abundance appeared to increase strongly with temperature ($\beta = 1.27 \pm 0.09$ SE, $t = 14.83$, $p < 0.001$). However, once nutrient concentration was included, the fitted temperature effect decreased markedly to $\beta = 0.25 \pm 0.11$ SE ($t = 2.25$, $p = 0.026$), while nutrient concentration had a strong positive effect on abundance ($\beta = 1.33 \pm 0.12$ SE, $t = 10.73$, $p < 0.001$). Temperature and nutrient concentration were also strongly correlated ($r = 0.84$), indicating that the simple temperature-abundance relationship was substantially confounded by nutrient availability. **Discussion** The temperature-only model estimated the marginal temperature-abundance association, which conflated temperature effects with nutrient effects. The partial temperature coefficient in the richer model is a more defensible estimate of the temperature effect given this dataset, though a causal interpretation would require an experimental design that independently manipulates temperature and nutrients. ::: # Measurement Error Standard linear regression assumes that predictors are measured without serious error. That assumption is almost never exactly true in the field. Instruments are imperfect, environmental conditions fluctuate between sensor and organism, and the temporal or spatial scale of a measurement may not match the scale of the biological process. When a predictor is measured with substantial noise, the estimated regression slope is biased towards zero. This is called **attenuation bias**. The model is relating the response to an imperfect version of the underlying variable, and the slope is therefore pulled downward. The effect is real but appears weaker than it is. Attenuation bias must be considered for three reasons. It weakens the apparent effect of a predictor that may genuinely be important. It reduces statistical power, increasing the risk of false negatives. And it creates an asymmetry because predictors measured with less precision appear less important than those measured more accurately, which can mislead decisions about which variables to include. # Example 5: Worked Measurement-Error Example Suppose a response truly depends on temperature, but the observed temperature values are noisy because of instrument error or a mismatch between sensor location and the thermal environment experienced by the organism. ## Simulate the True and Noisy Predictors ```{r code-measurement-error-sim} set.seed(414) n <- 150 true_temp <- runif(n, min = 8, max = 24) observed_temp <- true_temp + rnorm(n, sd = 4.5) growth <- 1.2 + 0.55 * true_temp + rnorm(n, sd = 1.2) me_dat <- tibble( true_temp = true_temp, observed_temp = observed_temp, growth = growth ) mod_true <- lm(growth ~ true_temp, data = me_dat) mod_observed <- lm(growth ~ observed_temp, data = me_dat) bind_rows( tidy(mod_true) |> mutate(model = "True predictor"), tidy(mod_observed) |> mutate(model = "Observed noisy predictor") ) |> filter(term != "(Intercept)") |> select(model, term, estimate, std.error, statistic, p.value) ``` The measurement error standard deviation (4.5 °C) is set close to the standard deviation of the true predictor (≈ 4.6 °C), giving a noise-to-signal variance ratio near 1:1. Under classical measurement error, this implies that the fitted slope for the noisy predictor should be substantially attenuated relative to the true slope, with an expected reduction of roughly one-half. The fitted value is consistent with this expectation, though not identical, reflecting ordinary sampling variability in a finite dataset. ## Compare the Fitted Relationships ```{r fig-measurement-error} #| fig-cap: "A) shows growth regressed against the true temperature, with a tight, steeply rising relationship. B) uses the same response but replaces the true predictor with a version with measurement noise of comparable magnitude to the signal itself. The slope is reduced and the cloud is visibly broader in the x-direction. The biology has not changed; only the measurement quality has." #| fig-width: 7 #| fig-height: 3.5 #| code-fold: true p3 <- ggplot(me_dat, aes(x = true_temp, y = growth)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = TRUE, colour = "magenta", linewidth = 0.8) + scale_x_continuous(limits = c(0, 30)) + labs(x = "True temperature (°C)", y = "Growth", title = "True predictor") p4 <- ggplot(me_dat, aes(x = observed_temp, y = growth)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = TRUE, colour = "magenta", linewidth = 0.8) + scale_x_continuous(limits = c(0, 30)) + labs(x = "Observed temperature (°C)", y = "Growth", title = "Noisy predictor") ggarrange(p3, p4, ncol = 2, labels = "AUTO") ``` @fig-measurement-error makes the effect very obvious. The left panel shows what the relationship looks like when temperature is measured at the organism's scale: we see a clear, steep line. The right panel shows what happens when the sensor is poorly placed, averaged over the wrong spatial scale, or simply imprecise. Now we see the slope is approximately halved and the scatter is much wider. The fitted line in the right panel would lead a you to conclude that temperature is a weaker driver than it actually is. Both panels use the same 150 observations and the same biological process, so the difference is entirely in measurement quality. ## Interpret the Measurement-Error Problem The lesson is that a weak fitted slope is not necessarily evidence of a weak biological relationship. It may instead simply reflect poor measurement. This is a concern that must be taken seriously when comparing studies. A study that measured temperature precisely at the organism's microhabitat will estimate a steeper, more significant slope than one that used remote-sensed data or a regional weather station. The difference in results would be attributed to the biology if the measurement quality is ignored, but it may be largely an artefact of data quality or appropriateness. Practically, you'd want to invest in measurement quality, replicate measurements to estimate and potentially correct for noise, and be explicit about the precision of predictors when reporting results. ## Reporting ::: {.callout-note appearance="simple"} ## Write-Up **Methods** A simulation was used to examine the effect of measurement error in a predictor variable by comparing the fitted relationship obtained from the true predictor with that obtained from a noisy observed version of the same predictor. **Results** Adding measurement error to the predictor reduced the apparent slope of the fitted relationship and weakened the visual strength of the association, illustrating the characteristic attenuation expected when predictors are measured with error. **Discussion** Weak fitted effects do not always imply weak biology. Poor measurement systematically flattens otherwise meaningful relationships. That is the interpretive warning to carry into any Discussion section that relies on observational field data. ::: ## Model II Regression as a Solution to Measurement Error Ordinary least squares (OLS) estimates the conditional mean of a response given a predictor under an error structure concentrated in the response. This is appropriate when the predictor is measured with relatively high precision. When the predictor contains substantial measurement error, the fitted slope is attenuated because variability in the predictor is treated as noise rather than signal. Model II regression provides an alternative when both variables contain measurement error. Two commonly used forms are major axis (MA) regression and standardised major axis (SMA) regression. These methods fit a line that reflects the shared variation between the variables rather than a directional dependence of one on the other. Model II approaches are useful when the goal is to describe the relationship between two noisy variables and when neither can be treated as measured without error. They can therefore serve as a practical complement when attenuation bias is a concern. However, the interpretation differs from OLS. The fitted line describes the structure of association rather than the conditional effect of a predictor on a response. For explanatory models, improving measurement quality or explicitly modelling the error process remains the more direct solution. # Revisiting Proxy Variables Proxy variables appear throughout ecology, and by this point in the chapter several examples have already appeared: altitude in the plant-growth model, distance from shore as a proxy for exposure, chlorophyll-*a* as a proxy for productivity, body size as a proxy for condition or age. Let's integrate the consequences for interpretation. A proxy is always an imperfect representation of the process of interest. It may correlate well with the mechanism, but it also correlates with other processes along the same gradient. Altitude covaries with temperature, radiation, precipitation, and soil properties. Depth in the ocean covaries with light, pressure, and oxygen. A coefficient estimated for a proxy variable describes the gradient well, but it says little about the driving mechanism. This has a direct consequence for interpretation, because a significant coefficient for a proxy variable tells you that the gradient matters. Since it does not tell you which component of that gradient is the active process, you still need to measure the mechanism directly. ::: {.callout-note appearance="simple"} ## Mechanism Versus Proxy If the question is physiological, use a physiological variable where possible. If the question concerns a broad spatial or environmental gradient, a proxy may be perfectly appropriate. The important thing is that the variable should match the scale and aim of the question, and the coefficient should be interpreted at the same scale, not treated as a mechanistic effect. ::: When a proxy and its mechanistic counterpart are both in the model, collinearity is almost guaranteed. The overlap between altitude and temperature, or between distance and exposure, is a consequence of the proxy relationship. This is why proxy variables belong in the collinearity discussion and not only in a separate section: the interpretive problems they create are inseparable from the instability problems that collinearity creates. # A Practical Workflow When building and interpreting a regression model, the following sets address the problems in this chapter before they become interpretive traps. 1. **State the biological hypothesis first.** Is the question about a mechanism or a pattern? Your answer will determine which variables belong in the model. 2. **Identify which predictors are mechanistic and which are proxies.** Use mechanistic variables when the question is explanatory; proxies are acceptable for description or prediction. 3. **Ask which variables could confound the relationship of interest.** If a plausible confounder was not measured, that is a design limitation that should be stated. 4. **Inspect predictor correlations and calculate VIF before fitting the full model.** Strongly correlated predictors should not both enter the model without justification. 5. **Compare simpler and more fully-specified models when confounding is plausible.** A substantial change in a coefficient when a new variable is added is evidence of confounding, not only of model improvement. 6. **Remove or combine redundant predictors when interpretation is the goal.** If prediction is the goal and collinearity is moderate, some overlap may be tolerable, but then you should avoid strong mechanistic claims about individual coefficients. 7. **Be honest about measurement quality.** If an important predictor was measured with substantial noise, it will limit the strength of the conclusions, and this should be acknowledged. An additional benefit to implementing these steps is that it also clarifies the explanation–prediction distinction that underpins most decisions we had to make in this chapter. Collinearity, confounding, and measurement error all weaken explanation, and sometimes they will all be present in the same dataset! A collinear model may still predict well, and an attenuated slope may still rank observations correctly, but the problems become serious when the question is biological interpretation rather than forecasting. # Summary - Collinearity makes coefficients unstable because predictors share variance. The model cannot cleanly partition the same signal between two parameters, so small data changes produce large coefficient changes, and sign reversals are possible. - VIF quantifies this instability. A $R_j^2$ close to 1 means the predictor contributes little unique information and its coefficient variance is correspondingly inflated. - Proxy variables and mechanistic variables are susceptible to collinearity when both enter the same model. Their interpretive constraints are different, so a proxy coefficient describes a gradient, not the mechanism. - Confounding is a problem of attribution. An omitted variable that influences both the predictor and the response distorts the apparent effect, and correction requires measuring the confounder and including it in the model. - Measurement error biases slopes towards zero, so w effects may reflect poor measurement rather than weak biology. - These problems must be handled, and avoided if they can be avoided, primarily through strong biological reasoning (first), careful variable selection, and honest acknowledgement of what the data cannot establish. Here, I looked at model construction and in the next chapter model evaluation will be the focus. This will include diagnostics, comparison, and the question of whether a fitted model is actually behaving well.