15. Collinearity, Confounding, and Measurement Error

Three Threats to Interpretable Regression Models

Author

A. J. Smit

Published

2026/03/19

In This Chapter

what collinearity is and why it matters;
how confounding differs from collinearity;
why measurement error weakens inference;
how proxy variables complicate interpretation;
what a practical response looks like when these problems appear.

Tasks to Complete in This Chapter

None

1 Introduction

Regression models are often taught as though the difficult part is fitting the correct formula and reading the summary table. In practice, some of the hardest come afterwards. Predictors may overlap strongly, causal roles may be confused, and the variables we fit may be noisy measurements or only rough proxies for the processes we care about.

This chapter brings together three closely related but distinct problems:

collinearity, where predictors share too much information;
confounding, where a third variable distorts interpretation; and
measurement error, where the variables in the model are noisy versions of the quantities we would really like to observe.

All three problems matter because they erode interpretation. A model may still run and produce seemingly believable coefficients, but those coefficients may no longer mean what we would like them to mean.

In multiple regression and other multivariate models, we generally hope that the predictor variables provide distinct information. When they do not, our models can become unstable and difficult to interpret. This issue, known as collinearity or multicollinearity, is a common challenge in biological research, where environmental variables are often highly correlated. It is especially common in ecology because many environmental variables are linked through shared physical and biological processes. Temperature, oxygen, and productivity, for example, often co-vary.

This chapter explains what collinearity is, why it is problematic, and how to diagnose and address it. It also explains why collinearity is not the same as confounding, and why measurement error and proxy variables create further limitations even when the fitted model looks statistically sound. While no single paper is as singularly famous here as Hurlbert’s work on pseudoreplication, a foundational and highly recommended review of multicollinearity is that of Graham (2003).

2 Key Concepts

Keep the following distinctions clear.

Collinearity / Multicollinearity is a condition in multiple regression where two or more predictor variables are highly correlated, making it difficult for the model to separate their individual effects.
Problem of interpretation, not prediction The main drawback of collinearity is that it makes the model’s coefficients unreliable and difficult to interpret, even when the model predicts the outcome well.
Variance Inflation Factor (VIF) is the standard diagnostic tool for detecting multicollinearity. A high VIF indicates that the variance of a coefficient is being inflated by overlap with other predictors.
Confounding, on the other hand, is a problem of attribution.
Measurement error weakens inference and can bias estimated effects, often towards zero.
Proxy variables can be very useful, but they come with interpretive limits.
Good design and good variable choice solve more of these problems than software does.

3 Why These Problems Matter

It is useful to state the broader issue clearly. In the regression chapters so far, we have assumed that the predictors in a model can be interpreted as though each one contributes something identifiable to the response. But, that assumption is often too optimistic since several things can go wrong:

Two predictors may describe almost the same underlying process.
A variable may appear important only because another, unmeasured variable has been omitted.
A predictor may be measured with enough error that its effect is weakened and unstable.
A fitted coefficient may describe a proxy rather than a mechanism.

These affect what the model allows us to say biologically.

4 R Functions

The most useful functions in this chapter are:

cor() and plots of predictor relationships for identifying overlap;
car::vif() for diagnosing collinearity in a fitted model;
lm() for comparing models that omit or include potentially confounding variables;
broom::tidy() for comparing coefficients across models.

5 Collinearity

5.1 What is collinearity?

Collinearity occurs when two predictor variables in a multiple regression model are highly correlated. Multicollinearity is the more general term, referring to a situation where one predictor variable can be linearly predicted from one or more of the other predictor variables with a substantial degree of accuracy.

For example, in a marine environment, you might measure water temperature, salinity, and dissolved oxygen. It is very likely that temperature and dissolved oxygen are strongly negatively correlated, because colder water holds more oxygen. If you include both in a model to predict the abundance of a fish species, the predictors share information, which introduces multicollinearity.

In ecology and environmental biology, this is common because many variables co-vary through shared physical or biological processes. Examples include:

altitude and temperature;
nitrate and phosphate;
temperature and dissolved oxygen;
rainfall and river flow.

5.2 Why is it a problem?

Unlike pseudoreplication, collinearity does not necessarily invalidate the entire model in terms of its predictive power. A model with collinear predictors may still produce good predictions. However, it severely compromises the interpretation of the model’s coefficients. The primary problems are:

Unstable coefficient estimates The estimated regression coefficients can vary wildly depending on which other variables are in the model. The standard errors of the coefficients become inflated, making it difficult to determine the true effect of each predictor.
Incorrect signs A coefficient might appear to have the “wrong” sign, such as a positive effect where a negative one is expected biologically. This happens because the model is trying to partition shared variation between correlated predictors, and the results can become nonsensical.
Loss of statistical significance Because the standard errors are inflated, a predictor that is truly important may appear statistically non-significant. The model cannot confidently attribute the effect to any single one of the correlated predictors.

These effects arise because the model must estimate multiple coefficients from overlapping information. When two predictors are strongly correlated, the model cannot cleanly separate their individual contributions because they explain the same variation in the response.

So, the central message is that collinearity is usually a problem of interpretation rather than raw prediction.

5.3 Example 1: Predicting plant growth

Suppose you model plant growth using:

mean_annual_temperature, a mechanistic driver of metabolic rates; and
altitude, a composite variable that correlates with temperature, oxygen, and radiation.

These variables are strongly correlated because altitude influences temperature. The model then attempts to assign separate effects to two variables that describe overlapping processes. As a result:

coefficients become unstable;
standard errors increase;
biological interpretation becomes unclear.

Altitude is acting as a proxy variable. It is easy to measure but does not represent a single causal mechanism. Including both altitude and temperature asks the model to separate a proxy from the process it represents, which it often cannot do reliably.

Better modelling choices would include:

using temperature if the hypothesis concerns physiology;
using altitude if the question concerns broad spatial gradients;
avoiding both in the same model unless their distinct roles are explicitly justified.

5.4 Example 2: Nutrient limitation in coastal systems

Suppose you model phytoplankton biomass using:

nitrate; and
phosphate.

These nutrients often co-vary because they are supplied by the same water masses. The outcome may be that:

the model fits well overall;
individual coefficients are unstable or non-significant.

The model cannot separate the effect of nitrate from phosphate because both track the same underlying process: nutrient supply.

Possible resolutions include:

using one nutrient based on ecological theory, for example the limiting nutrient;
using a ratio such as N:P if the hypothesis concerns stoichiometry;
using PCA or another dimension-reduction method to represent a broader “nutrient gradient” if prediction is the main goal.

This example shows that collinearity can arise even when all the variables are mechanistically meaningful. The decision is therefore about how to handle collinearity and whether the goal is explanation or prediction.

5.5 Diagnosing collinearity

The most common way to diagnose multicollinearity is by calculating the Variance Inflation Factor (VIF) for each predictor variable.

A VIF of 1 means there is no correlation.
A VIF between about 1 and 5 is often manageable.
A VIF above about 5 or 10 suggests substantial multicollinearity that should be considered carefully.

VIF does not diagnose a problem on its own. It indicates that a coefficient reflects shared variation with other predictors. The decision to act depends on whether the model needs interpretable coefficients.

5.6 A worked collinearity example

Using the seaweed example from the previous chapter, we can inspect the candidate climate predictors for overlap.

sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv")
sw_ectz <- sw |>
  filter(bio == "ECTZ") |>
  select(Y, augMean, febRange, febSD, augSD, annMean)

cor(sw_ectz[, -1])

R>            augMean  febRange     febSD     augSD   annMean
R> augMean  1.0000000 0.6678245 0.5039666 0.4603614 0.9711458
R> febRange 0.6678245 1.0000000 0.9212490 0.5415907 0.6860398
R> febSD    0.5039666 0.9212490 1.0000000 0.5744316 0.5533197
R> augSD    0.4603614 0.5415907 0.5744316 1.0000000 0.5851171
R> annMean  0.9711458 0.6860398 0.5533197 0.5851171 1.0000000

The correlation matrix already shows that several of the candidate predictors overlap strongly.

col_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean,
              data = sw_ectz)

vif(col_mod)

R>   augMean  febRange     febSD     augSD   annMean 
R> 27.947767 10.806635  8.765732  2.497739 31.061900

The VIF values show that some predictors overlap strongly. The model is not necessarily mathematically invalid; it simply means that the separate interpretation of those coefficients becomes hard to defend.

We can make the consequences more concrete by comparing the simple slopes for each predictor to the coefficients in the multiple regression.

preds <- c("augMean", "febRange", "febSD", "augSD", "annMean")

simple_models <- purrr::map(preds, ~ lm(as.formula(paste("Y ~", .x)),
                                        data = sw_ectz))

simple_slopes <- purrr::map2_dfr(simple_models, preds, \(mod, nm) {
  tidy(mod) |>
    filter(term != "(Intercept)") |>
    mutate(model = "Simple regression",
           predictor = nm)
})

multiple_slopes <- tidy(col_mod) |>
  filter(term != "(Intercept)") |>
  mutate(model = "Multiple regression",
         predictor = term)

bind_rows(simple_slopes, multiple_slopes) |>
  select(model, predictor, estimate, std.error, statistic, p.value)

R> # A tibble: 10 × 6
R>    model               predictor estimate std.error statistic   p.value
R>    <chr>               <chr>        <dbl>     <dbl>     <dbl>     <dbl>
R>  1 Simple regression   augMean    0.346     0.0109     31.7   6.68e- 96
R>  2 Simple regression   febRange   0.182     0.00890    20.4   8.31e- 58
R>  3 Simple regression   febSD      0.172     0.0124     13.8   1.56e- 33
R>  4 Simple regression   augSD      0.0879    0.00720    12.2   6.68e- 28
R>  5 Simple regression   annMean    0.332     0.00867    38.3   6.82e-115
R>  6 Multiple regression augMean   -0.0799    0.0426     -1.87  6.18e-  2
R>  7 Multiple regression febRange   0.113     0.0159      7.08  1.14e- 11
R>  8 Multiple regression febSD     -0.0572    0.0166     -3.45  6.37e-  4
R>  9 Multiple regression augSD      0.00302   0.00489     0.619 5.36e-  1
R> 10 Multiple regression annMean    0.323     0.0416      7.76  1.59e- 13

This comparison is useful because it shows the practical effect of collinearity. Predictors that look obvious and seem strongly positive in isolation may become weaker, unstable, or even change sign once the overlapping predictors are entered together. The does multiple regression might not be wrong, but the coefficients are being estimated under much greater uncertainty because the model is trying to assign separate effects to partially redundant variables.

5.7 Addressing collinearity

Common responses include:

removing one of the correlated predictors;
choosing one variable over another based on theory;
combining correlated predictors into a single index or ordination axis;
accepting some overlap if prediction is the main goal, but avoiding strong causal language;
using regularisation methods such as ridge regression, lasso, or elastic net when the objective is predictive stability rather than simple coefficient interpretation.

There is obvious threshold that solves the problem automatically. But, is the overlap is small enough that the coefficients still mean something defensible?

6 Confounding

Confounding is a different problem. The two predictors may or may not overlap, and, more importantly, a third variable was accidentally excluded from assessment, or it inadvertently distorts the apparent relationship between a predictor of interest and the response.

A confounder is a variable that influences both:

the predictor of interest; and
the response.

This can create a spurious association or distort a real one. For example, if you relate abundance to temperature but nutrient supply is associated with both temperature and abundance, the temperature coefficient may partly absorb nutrient effects.

6.1 Confounding is not the same as collinearity

These ideas are related but not identical.

Collinearity is about overlap among predictors in the data.
Confounding is about mistaken attribution of an effect.

Two variables can be highly collinear without one being a confounder. Equally, a confounder can matter even when the correlation structure does not look especially dramatic.

6.2 A worked confounding example

We can see the logic of confounding clearly with a small simulated example. Suppose we want to know whether temperature affects algal abundance, but nutrient concentration also varies with temperature and itself influences abundance.

set.seed(144)

n <- 120
nutrient <- rnorm(n, mean = 10, sd = 2)
temperature <- 15 + 0.9 * nutrient + rnorm(n, sd = 1.2)
abundance <- 4 + 1.4 * nutrient + 0.15 * temperature + rnorm(n, sd = 1.5)

conf_dat <- tibble(
  nutrient = nutrient,
  temperature = temperature,
  abundance = abundance
)

If we fit a model with temperature alone, we may conclude that temperature is strongly related to abundance:

mod_temp_only <- lm(abundance ~ temperature, data = conf_dat)
mod_with_nutrient <- lm(abundance ~ temperature + nutrient, data = conf_dat)

tidy(mod_temp_only)

R> # A tibble: 2 × 5
R>   term        estimate std.error statistic  p.value
R>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
R> 1 (Intercept)    -8.74    2.03       -4.30 3.57e- 5
R> 2 temperature     1.27    0.0854     14.8  1.02e-28

tidy(mod_with_nutrient)

R> # A tibble: 3 × 5
R>   term        estimate std.error statistic  p.value
R>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
R> 1 (Intercept)    2.17      1.77       1.22 2.23e- 1
R> 2 temperature    0.253     0.112      2.25 2.61e- 2
R> 3 nutrient       1.33      0.124     10.7  3.99e-19

p1 <- ggplot(conf_dat, aes(x = temperature, y = abundance)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "Temperature", y = "Algal abundance") +
  theme_grey()

p2 <- ggplot(conf_dat, aes(x = nutrient, y = abundance)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "Nutrient concentration", y = "Algal abundance") +
  theme_grey()

ggarrange(p1, p2, ncol = 2)

Figure 1: A confounding situation where nutrient concentration is associated with both temperature and abundance.

The important point is the change in interpretation. Once nutrient is included, the estimated effect of temperature is re-evaluated while holding nutrient constant. If the temperature slope changes substantially, that is evidence that the original temperature-only model was at least partly confounded by nutrient concentration.

This is why confounding is fundamentally a problem of attribution: we are trying to decide whether the apparent effect of one variable is actually partly due to another.

6.3 Practical implications of confounding

Confounding often arises because:

a variable of biological importance was omitted;
the variable was measured but not included;
the design did not separate the process of interest from another correlated process.

The best remedy for confounding effects is to avoid it altogether through good subject knowledge, better experimental or sampling design, better measurement, and more careful model specification.

7 Measurement Error

Standard linear regression assumes that predictors are measured without serious error. That assumption is rarely exactly true in field biology.

When a predictor is noisy, its estimated effect is often biased towards zero. This is called attenuation bias. In practical terms, a real effect may appear weaker and less stable than it truly is. This matters because:

field instruments are imperfect;
environmental conditions fluctuate;
biological measurements can be imprecise;
some important variables are difficult or expensive to measure directly.

7.1 A worked measurement-error example

Suppose a response truly depends on temperature, but our observed temperature values are noisy because of instrument error or sampling mismatch.

set.seed(414)

n <- 150
true_temp <- runif(n, min = 8, max = 24)
observed_temp <- true_temp + rnorm(n, sd = 2.5)
growth <- 1.2 + 0.55 * true_temp + rnorm(n, sd = 2.0)

me_dat <- tibble(
  true_temp = true_temp,
  observed_temp = observed_temp,
  growth = growth
)

mod_true <- lm(growth ~ true_temp, data = me_dat)
mod_observed <- lm(growth ~ observed_temp, data = me_dat)

bind_rows(
  tidy(mod_true) |> mutate(model = "True predictor"),
  tidy(mod_observed) |> mutate(model = "Observed noisy predictor")
) |>
  filter(term != "(Intercept)") |>
  select(model, term, estimate, std.error, statistic, p.value)

R> # A tibble: 2 × 6
R>   model                    term          estimate std.error statistic  p.value
R>   <chr>                    <chr>            <dbl>     <dbl>     <dbl>    <dbl>
R> 1 True predictor           true_temp        0.558    0.0350      16.0 5.38e-34
R> 2 Observed noisy predictor observed_temp    0.501    0.0362      13.8 1.77e-28

p3 <- ggplot(me_dat, aes(x = true_temp, y = growth)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "True temperature", y = "Growth") +
  theme_grey()

p4 <- ggplot(me_dat, aes(x = observed_temp, y = growth)) +
  geom_point(alpha = 0.7, colour = "dodgerblue4") +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(x = "Observed temperature", y = "Growth") +
  theme_grey()

ggarrange(p3, p4, ncol = 2)

Figure 2: Measurement error weakens the apparent relationship between the predictor and the response.

The model using the noisy predictor typically produces a weaker slope than the model using the true predictor. This is the practical effect of attenuation bias: measurement error can make a real effect appear smaller than it is.

Measurement error therefore matters because it adds noise and because it can systematically weaken the estimated effect.

8 Proxy Variables

Many biological models use proxy variables because the true mechanistic variables are unavailable.

Examples include:

altitude as a proxy for temperature or radiation;
distance from shore as a proxy for exposure;
chlorophyll as a proxy for productivity;
depth in the ocean as a proxy for light intensity;
body size as a proxy for age or condition.

Proxy variables are often convenient, they may seem reasonable, and sometimes they are even unavoidable. However, they do create interpretive limits. A proxy coefficient should not be inprepreted as if it identified one clean mechanism.

This is why the examples above matter. In the plant-growth example, altitude is a broad spatial proxy, while temperature is closer to the mechanism. In the nutrient example, nitrate and phosphate are both mechanistically meaningful but may still function as overlapping representations of nutrient supply. In both cases, interpretation depends on whether the question is explanatory or predictive.

Mechanism Versus Proxy

If the question is physiological, use physiological drivers where possible. If the question is broad-scale pattern, a proxy may be perfectly acceptable. The important thing is that the variable should match the scale and aim of the question.

9 A Practical Workflow

When building and interpreting a regression model:

state the biological hypothesis first;
identify which predictors are mechanistic and which are proxies;
ask which variables could confound the relationship of interest;
inspect predictor correlations and calculate VIF where appropriate;
compare simpler and richer models when confounding is plausible;
remove or combine redundant predictors when interpretation is the goal;
be honest about causal limits when important variables are missing or poorly measured.

9.1 Explanation versus prediction

This chapter also forces an important distinction that connects directly to the regularisation chapter that follows later in the sequence.

If the main goal is explanation, then unstable coefficients, proxies, confounding, and overlap are serious problems because they weaken the biological interpretation.
If the main goal is prediction, some overlap may be tolerable if predictive performance is strong, although the coefficients may still not support strong mechanistic claims.

So the decision is not only “is there collinearity?” It is also “what is the model for?”

10 Summary

Collinearity makes coefficients unstable because predictors share information.
Confounding is a problem of attribution rather than simple overlap.
Measurement error weakens inference and can bias estimated effects, often towards zero.
Proxy variables are often useful, but they limit what a coefficient can be said to mean.
These problems are handled best through strong biological reasoning, careful design, and honest interpretation rather than by software alone.

The next chapter goes from model construction to the question of whether a fitted model is actually behaving well: diagnostics, comparison, and evaluation.

References

Graham MH (2003) Confronting multicollinearity in ecological multiple regression. Ecology 84:2809–2815.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {15. {Collinearity,} {Confounding,} and {Measurement} {Error}},
  date = {2026-03-19},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/15-collinearity-confounding-measurement-error.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., J. Smit A (2026) 15. Collinearity, Confounding, and Measurement Error. http://tangledbank.netlify.app/BCB744/basic_stats/15-collinearity-confounding-measurement-error.html.

--- title: "15. Collinearity, Confounding, and Measurement Error" subtitle: "Three Threats to Interpretable Regression Models" author: "A. J. Smit" date: last-modified date-format: "YYYY/MM/DD" reference-location: margin --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.width = 6.5, fig.height = 4.5, out.width = "88%", fig.asp = NULL, fig.align = "center", fig.retina = 2, dpi = 300 ) ggplot2::theme_set( ggplot2::theme_grey(base_size = 8) ) ``` ```{r code-knitr-opts-chunk-set, echo=FALSE} library(tidyverse) library(car) library(broom) library(ggpubr) ``` ::: {.callout-note appearance="simple"} ## In This Chapter - what collinearity is and why it matters; - how confounding differs from collinearity; - why measurement error weakens inference; - how proxy variables complicate interpretation; - what a practical response looks like when these problems appear. ::: ::: {.callout-important appearance="simple"} ## Tasks to Complete in This Chapter - None ::: # Introduction Regression models are often taught as though the difficult part is fitting the correct formula and reading the summary table. In practice, some of the hardest come afterwards. Predictors may overlap strongly, causal roles may be confused, and the variables we fit may be noisy measurements or only rough proxies for the processes we care about. This chapter brings together three closely related but distinct problems: 1. **collinearity**, where predictors share too much information; 2. **confounding**, where a third variable distorts interpretation; and 3. **measurement error**, where the variables in the model are noisy versions of the quantities we would really like to observe. All three problems matter because they erode interpretation. A model may still run and produce seemingly believable coefficients, but those coefficients may no longer mean what we would like them to mean. In multiple regression and other multivariate models, we generally hope that the predictor variables provide distinct information. When they do not, our models can become unstable and difficult to interpret. This issue, known as **collinearity** or **multicollinearity**, is a common challenge in biological research, where environmental variables are often highly correlated. It is especially common in ecology because many environmental variables are linked through shared physical and biological processes. Temperature, oxygen, and productivity, for example, often co-vary. This chapter explains what collinearity is, why it is problematic, and how to diagnose and address it. It also explains why collinearity is not the same as confounding, and why measurement error and proxy variables create further limitations even when the fitted model looks statistically sound. While no single paper is as singularly famous here as Hurlbert's work on pseudoreplication, a foundational and highly recommended review of multicollinearity is that of Graham -@Graham2003. # Key Concepts Keep the following distinctions clear. - **Collinearity / Multicollinearity** is a condition in multiple regression where two or more predictor variables are highly correlated, making it difficult for the model to separate their individual effects. - **Problem of interpretation, not prediction** The main drawback of collinearity is that it makes the model's coefficients unreliable and difficult to interpret, even when the model predicts the outcome well. - **Variance Inflation Factor (VIF)** is the standard diagnostic tool for detecting multicollinearity. A high VIF indicates that the variance of a coefficient is being inflated by overlap with other predictors. - **Confounding**, on the other hand, is a problem of attribution. - **Measurement error** weakens inference and can bias estimated effects, often towards zero. - **Proxy variables** can be very useful, but they come with interpretive limits. - **Good design and good variable choice** solve more of these problems than software does. # Why These Problems Matter It is useful to state the broader issue clearly. In the regression chapters so far, we have assumed that the predictors in a model can be interpreted as though each one contributes something identifiable to the response. But, that assumption is often too optimistic since several things can go wrong: - Two predictors may describe almost the same underlying process. - A variable may appear important only because another, unmeasured variable has been omitted. - A predictor may be measured with enough error that its effect is weakened and unstable. - A fitted coefficient may describe a proxy rather than a mechanism. These affect what the model allows us to say biologically. # R Functions The most useful functions in this chapter are: - `cor()` and plots of predictor relationships for identifying overlap; - `car::vif()` for diagnosing collinearity in a fitted model; - `lm()` for comparing models that omit or include potentially confounding variables; - `broom::tidy()` for comparing coefficients across models. # Collinearity ## What is collinearity? **Collinearity** occurs when two predictor variables in a multiple regression model are highly correlated. **Multicollinearity** is the more general term, referring to a situation where one predictor variable can be linearly predicted from one or more of the other predictor variables with a substantial degree of accuracy. For example, in a marine environment, you might measure water temperature, salinity, and dissolved oxygen. It is very likely that temperature and dissolved oxygen are strongly negatively correlated, because colder water holds more oxygen. If you include both in a model to predict the abundance of a fish species, the predictors share information, which introduces multicollinearity. In ecology and environmental biology, this is common because many variables co-vary through shared physical or biological processes. Examples include: - altitude and temperature; - nitrate and phosphate; - temperature and dissolved oxygen; - rainfall and river flow. ## Why is it a problem? Unlike pseudoreplication, collinearity does not necessarily invalidate the entire model in terms of its predictive power. A model with collinear predictors may still produce good predictions. However, it severely compromises the **interpretation** of the model's coefficients. The primary problems are: 1. **Unstable coefficient estimates** The estimated regression coefficients can vary wildly depending on which other variables are in the model. The standard errors of the coefficients become inflated, making it difficult to determine the true effect of each predictor. 2. **Incorrect signs** A coefficient might appear to have the "wrong" sign, such as a positive effect where a negative one is expected biologically. This happens because the model is trying to partition shared variation between correlated predictors, and the results can become nonsensical. 3. **Loss of statistical significance** Because the standard errors are inflated, a predictor that is truly important may appear statistically non-significant. The model cannot confidently attribute the effect to any single one of the correlated predictors. These effects arise because the model must estimate multiple coefficients from overlapping information. When two predictors are strongly correlated, the model cannot cleanly separate their individual contributions because they explain the same variation in the response. So, the central message is that collinearity is usually a **problem of interpretation rather than raw prediction**. ## Example 1: Predicting plant growth Suppose you model plant growth using: - `mean_annual_temperature`, a mechanistic driver of metabolic rates; and - `altitude`, a composite variable that correlates with temperature, oxygen, and radiation. These variables are strongly correlated because altitude influences temperature. The model then attempts to assign separate effects to two variables that describe overlapping processes. As a result: - coefficients become unstable; - standard errors increase; - biological interpretation becomes unclear. Altitude is acting as a **proxy variable**. It is easy to measure but does not represent a single causal mechanism. Including both altitude and temperature asks the model to separate a proxy from the process it represents, which it often cannot do reliably. Better modelling choices would include: - using temperature if the hypothesis concerns physiology; - using altitude if the question concerns broad spatial gradients; - avoiding both in the same model unless their distinct roles are explicitly justified. ## Example 2: Nutrient limitation in coastal systems Suppose you model phytoplankton biomass using: - nitrate; and - phosphate. These nutrients often co-vary because they are supplied by the same water masses. The outcome may be that: - the model fits well overall; - individual coefficients are unstable or non-significant. The model cannot separate the effect of nitrate from phosphate because both track the same underlying process: nutrient supply. Possible resolutions include: - using one nutrient based on ecological theory, for example the limiting nutrient; - using a ratio such as N:P if the hypothesis concerns stoichiometry; - using PCA or another dimension-reduction method to represent a broader "nutrient gradient" if prediction is the main goal. This example shows that collinearity can arise even when all the variables are mechanistically meaningful. The decision is therefore about how to handle collinearity *and* whether the goal is explanation or prediction. ## Diagnosing collinearity The most common way to diagnose multicollinearity is by calculating the **Variance Inflation Factor (VIF)** for each predictor variable. - A VIF of 1 means there is no correlation. - A VIF between about 1 and 5 is often manageable. - A VIF above about 5 or 10 suggests substantial multicollinearity that should be considered carefully. VIF does not diagnose a problem on its own. It indicates that a coefficient reflects shared variation with other predictors. The decision to act depends on whether the model needs interpretable coefficients. ## A worked collinearity example Using the seaweed example from the previous chapter, we can inspect the candidate climate predictors for overlap. ```{r code-collinearity-data} sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv") sw_ectz <- sw |> filter(bio == "ECTZ") |> select(Y, augMean, febRange, febSD, augSD, annMean) cor(sw_ectz[, -1]) ``` The correlation matrix already shows that several of the candidate predictors overlap strongly. ```{r code-collinearity-vif} col_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean, data = sw_ectz) vif(col_mod) ``` The VIF values show that some predictors overlap strongly. The model is not necessarily mathematically invalid; it simply means that the separate interpretation of those coefficients becomes hard to defend. We can make the consequences more concrete by comparing the simple slopes for each predictor to the coefficients in the multiple regression. ```{r code-collinearity-compare} preds <- c("augMean", "febRange", "febSD", "augSD", "annMean") simple_models <- purrr::map(preds, ~ lm(as.formula(paste("Y ~", .x)), data = sw_ectz)) simple_slopes <- purrr::map2_dfr(simple_models, preds, \(mod, nm) { tidy(mod) |> filter(term != "(Intercept)") |> mutate(model = "Simple regression", predictor = nm) }) multiple_slopes <- tidy(col_mod) |> filter(term != "(Intercept)") |> mutate(model = "Multiple regression", predictor = term) bind_rows(simple_slopes, multiple_slopes) |> select(model, predictor, estimate, std.error, statistic, p.value) ``` This comparison is useful because it shows the practical effect of collinearity. Predictors that look obvious and seem strongly positive in isolation may become weaker, unstable, or even change sign once the overlapping predictors are entered together. The does multiple regression might not be wrong, but the coefficients are being estimated under much greater uncertainty because the model is trying to assign separate effects to partially redundant variables. ## Addressing collinearity Common responses include: - removing one of the correlated predictors; - choosing one variable over another based on theory; - combining correlated predictors into a single index or ordination axis; - accepting some overlap if prediction is the main goal, but avoiding strong causal language; - using regularisation methods such as ridge regression, lasso, or elastic net when the objective is predictive stability rather than simple coefficient interpretation. There is obvious threshold that solves the problem automatically. But, is the overlap is small enough that the coefficients still mean something defensible? # Confounding Confounding is a different problem. The two predictors may or may not overlap, and, more importantly, a third variable was accidentally excluded from assessment, or it inadvertently distorts the apparent relationship between a predictor of interest and the response. A confounder is a variable that influences both: - the predictor of interest; and - the response. This can create a spurious association or distort a real one. For example, if you relate abundance to temperature but nutrient supply is associated with both temperature and abundance, the temperature coefficient may partly absorb nutrient effects. ## Confounding is not the same as collinearity These ideas are related but not identical. - **Collinearity** is about overlap among predictors in the data. - **Confounding** is about mistaken attribution of an effect. Two variables can be highly collinear without one being a confounder. Equally, a confounder can matter even when the correlation structure does not look especially dramatic. ## A worked confounding example We can see the logic of confounding clearly with a small simulated example. Suppose we want to know whether temperature affects algal abundance, but nutrient concentration also varies with temperature and itself influences abundance. ```{r code-confounding-sim} set.seed(144) n <- 120 nutrient <- rnorm(n, mean = 10, sd = 2) temperature <- 15 + 0.9 * nutrient + rnorm(n, sd = 1.2) abundance <- 4 + 1.4 * nutrient + 0.15 * temperature + rnorm(n, sd = 1.5) conf_dat <- tibble( nutrient = nutrient, temperature = temperature, abundance = abundance ) ``` If we fit a model with temperature alone, we may conclude that temperature is strongly related to abundance: ```{r code-confounding-models} mod_temp_only <- lm(abundance ~ temperature, data = conf_dat) mod_with_nutrient <- lm(abundance ~ temperature + nutrient, data = conf_dat) tidy(mod_temp_only) tidy(mod_with_nutrient) ``` ```{r fig-confounding} #| fig-cap: "A confounding situation where nutrient concentration is associated with both temperature and abundance." p1 <- ggplot(conf_dat, aes(x = temperature, y = abundance)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "Temperature", y = "Algal abundance") + theme_grey() p2 <- ggplot(conf_dat, aes(x = nutrient, y = abundance)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "Nutrient concentration", y = "Algal abundance") + theme_grey() ggarrange(p1, p2, ncol = 2) ``` The important point is the change in interpretation. Once nutrient is included, the estimated effect of temperature is re-evaluated while holding nutrient constant. If the temperature slope changes substantially, that is evidence that the original temperature-only model was at least partly confounded by nutrient concentration. This is why confounding is fundamentally a problem of attribution: we are trying to decide whether the apparent effect of one variable is actually partly due to another. ## Practical implications of confounding Confounding often arises because: - a variable of biological importance was omitted; - the variable was measured but not included; - the design did not separate the process of interest from another correlated process. The best remedy for confounding effects is to avoid it altogether through good subject knowledge, better experimental or sampling design, better measurement, and more careful model specification. # Measurement Error Standard linear regression assumes that predictors are measured without serious error. That assumption is rarely exactly true in field biology. When a predictor is noisy, its estimated effect is often biased towards zero. This is called **attenuation bias**. In practical terms, a real effect may appear weaker and less stable than it truly is. This matters because: - field instruments are imperfect; - environmental conditions fluctuate; - biological measurements can be imprecise; - some important variables are difficult or expensive to measure directly. ## A worked measurement-error example Suppose a response truly depends on temperature, but our observed temperature values are noisy because of instrument error or sampling mismatch. ```{r code-measurement-error-sim} set.seed(414) n <- 150 true_temp <- runif(n, min = 8, max = 24) observed_temp <- true_temp + rnorm(n, sd = 2.5) growth <- 1.2 + 0.55 * true_temp + rnorm(n, sd = 2.0) me_dat <- tibble( true_temp = true_temp, observed_temp = observed_temp, growth = growth ) mod_true <- lm(growth ~ true_temp, data = me_dat) mod_observed <- lm(growth ~ observed_temp, data = me_dat) bind_rows( tidy(mod_true) |> mutate(model = "True predictor"), tidy(mod_observed) |> mutate(model = "Observed noisy predictor") ) |> filter(term != "(Intercept)") |> select(model, term, estimate, std.error, statistic, p.value) ``` ```{r fig-measurement-error} #| fig-cap: "Measurement error weakens the apparent relationship between the predictor and the response." p3 <- ggplot(me_dat, aes(x = true_temp, y = growth)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "True temperature", y = "Growth") + theme_grey() p4 <- ggplot(me_dat, aes(x = observed_temp, y = growth)) + geom_point(alpha = 0.7, colour = "dodgerblue4") + geom_smooth(method = "lm", se = FALSE, colour = "magenta") + labs(x = "Observed temperature", y = "Growth") + theme_grey() ggarrange(p3, p4, ncol = 2) ``` The model using the noisy predictor typically produces a weaker slope than the model using the true predictor. This is the practical effect of attenuation bias: measurement error can make a real effect appear smaller than it is. Measurement error therefore matters because it adds noise and because it can systematically weaken the estimated effect. # Proxy Variables Many biological models use proxy variables because the true mechanistic variables are unavailable. Examples include: - altitude as a proxy for temperature or radiation; - distance from shore as a proxy for exposure; - chlorophyll as a proxy for productivity; - depth in the ocean as a proxy for light intensity; - body size as a proxy for age or condition. Proxy variables are often convenient, they may seem reasonable, and sometimes they are even unavoidable. However, they *do* create interpretive limits. A proxy coefficient should not be inprepreted as if it identified one clean mechanism. This is why the examples above matter. In the plant-growth example, altitude is a broad spatial proxy, while temperature is closer to the mechanism. In the nutrient example, nitrate and phosphate are both mechanistically meaningful but may still function as overlapping representations of nutrient supply. In both cases, interpretation depends on whether the question is explanatory or predictive. ::: {.callout-note appearance="simple"} ## Mechanism Versus Proxy If the question is physiological, use physiological drivers where possible. If the question is broad-scale pattern, a proxy may be perfectly acceptable. The important thing is that the variable should match the scale and aim of the question. ::: # A Practical Workflow When building and interpreting a regression model: 1. state the biological hypothesis first; 2. identify which predictors are mechanistic and which are proxies; 3. ask which variables could confound the relationship of interest; 4. inspect predictor correlations and calculate VIF where appropriate; 5. compare simpler and richer models when confounding is plausible; 6. remove or combine redundant predictors when interpretation is the goal; 7. be honest about causal limits when important variables are missing or poorly measured. ## Explanation versus prediction This chapter also forces an important distinction that connects directly to the regularisation chapter that follows later in the sequence. - If the main goal is **explanation**, then unstable coefficients, proxies, confounding, and overlap are serious problems because they weaken the biological interpretation. - If the main goal is **prediction**, some overlap may be tolerable if predictive performance is strong, although the coefficients may still not support strong mechanistic claims. So the decision is not only "is there collinearity?" It is also "what is the model for?" # Summary - Collinearity makes coefficients unstable because predictors share information. - Confounding is a problem of attribution rather than simple overlap. - Measurement error weakens inference and can bias estimated effects, often towards zero. - Proxy variables are often useful, but they limit what a coefficient can be said to mean. - These problems are handled best through strong biological reasoning, careful design, and honest interpretation rather than by software alone. The next chapter goes from model construction to the question of whether a fitted model is actually behaving well: diagnostics, comparison, and evaluation.