16: Collinearity

Author

A. J. Smit

Published

2026/03/18

1 Introduction

In multiple regression and other multivariate models, we assume that predictor variables are independent. When this assumption is violated, our models can become unstable and difficult to interpret. This issue, known as collinearity or multicollinearity, is a common challenge in biological research, where environmental variables are often highly correlated. It is especially common in ecology because many environmental variables are linked through shared physical and biological processes (e.g., temperature, oxygen, and productivity often co-vary).

This chapter explains what collinearity is, why it is problematic, and how to diagnose and address it. While no single paper is as seminal as Hurlbert’s on pseudoreplication, a foundational and highly recommended review on the topic is provided by Graham (2003).

2 Key Concepts

Collinearity/Multicollinearity: A condition in multiple regression where two or more predictor variables are highly correlated, making it difficult for the model to separate their individual effects.
Problem of Interpretation, Not Prediction: Collinearity’s main drawback is that it makes the model’s coefficients (the estimated effects of each predictor) unreliable and difficult to interpret, even if the model as a whole predicts the outcome well.
Variance Inflation Factor (VIF): The standard diagnostic tool for detecting multicollinearity. A high VIF (e.g., > 5) for a predictor indicates its variance is being inflated by its correlation with other predictors.
Unstable Coefficients and Inflated Errors: Collinearity causes estimated coefficients to be unstable and have large standard errors, which can mask the true significance of a predictor.
Addressing Collinearity: Common solutions include removing one of the correlated predictors, combining them into a single index (e.g., via PCA), or using advanced regression techniques designed to handle it.

3 What is Collinearity?

Collinearity occurs when two predictor variables in a multiple regression model are highly correlated. Multicollinearity is the more general term, referring to a situation where one predictor variable can be linearly predicted from one or more of the other predictor variables with a substantial degree of accuracy.

For example, in a marine environment, you might measure water temperature, salinity, and dissolved oxygen. It is very likely that temperature and dissolved oxygen are strongly negatively correlated (colder water holds more oxygen). If you include both in a model to predict the abundance of a fish species, the predictors share information, which introduces multicollinearity.

4 Why is it a Problem?

Unlike pseudoreplication, collinearity does not necessarily invalidate the entire model in terms of its predictive power. A model with collinear predictors may still produce good predictions. However, it severely compromises the interpretation of the model’s coefficients.

The primary problems are:

Unstable Coefficient Estimates: The estimated regression coefficients ($\beta$ values) can vary wildly depending on which other variables are in the model. The standard errors of the coefficients become inflated, making it difficult to determine the true effect of each predictor.
Incorrect Signs: A coefficient might appear to have the “wrong” sign (e.g., a positive effect when a negative one is expected biologically). This happens because the model is trying to partition the shared variance between the correlated predictors, and the results can be nonsensical.
Loss of Statistical Significance: Because the standard errors are inflated, a predictor variable that is truly important might appear statistically insignificant (i.e., have a p-value > 0.05). The model cannot confidently attribute the effect to any single one of the correlated predictors.

These effects arise because the model must estimate multiple coefficients from overlapping information.

So, essentially, when two predictors are strongly correlated, the model cannot separate their individual contributions because they explain the same variation in the response.

5 Examples

The following examples highlight two distinct forms of collinearity: one involving a proxy and a mechanistic variable, and one involving two mechanistic variables that co-vary.

5.1 Predicting Plant Growth

You model plant growth using:

mean_annual_temperature (mechanistic driver of metabolic rates)
altitude (a composite variable that correlates with temperature, oxygen, and radiation)

These variables are strongly correlated because altitude influences temperature.

Interpretation issue

The model attempts to assign separate effects to two variables that describe overlapping processes. As a result:

Coefficients become unstable
Standard errors increase
Biological interpretation becomes unclear

Main insight

Altitude acts as a proxy variable. It is easy to measure but does not represent a single causal mechanism. Including both altitude and temperature asks the model to separate a proxy from the process it represents, which it cannot do reliably.

Proxy variables are often used because they are easy to measure across large spatial scales, whereas mechanistic variables may require more detailed or local measurements.

Better modelling choices

Use temperature if your hypothesis concerns physiology
Use altitude if your question concerns broad spatial gradients
Avoid including both unless you explicitly model their roles

5.2 Nutrient Limitation in Coastal Systems

You model phytoplankton biomass using:

nitrate
phosphate

These nutrients often co-vary because they are supplied by the same water masses.

Outcome

The model fits well (high predictive ability)
Individual coefficients are unstable or non-significant

Interpretation

The model cannot separate the effect of nitrate from phosphate because both track the same underlying process: nutrient supply.

Resolution options

Use one nutrient based on ecological theory (e.g., the limiting nutrient)
Use a ratio (e.g., N:P) if your hypothesis concerns stoichiometry
Use PCA to represent a “nutrient gradient” if prediction is the goal

This example shows that collinearity can arise even when all variables are mechanistically meaningful.

So, the decision is therefore about how to handle collinearity and whether you aim to explain mechanisms or optimise prediction.

These examples show that collinearity can arise either from including proxy variables or from measuring multiple aspects of the same underlying process.

6 Diagnosing Collinearity

The most common way to diagnose multicollinearity is by calculating the Variance Inflation Factor (VIF) for each predictor variable.

What is VIF? The VIF for a predictor X is a measure of how much the variance of its estimated coefficient is “inflated” by its correlation with the other predictors in the model.
How it Works: For each predictor, a regression is fitted using that predictor as the response and all other predictors as the explanatory variables. The R-squared from this model is used to calculate the VIF.
Rule of Thumb:
- A VIF of 1 means there is no correlation.
- A VIF between 1 and 5 is generally considered acceptable.
- A VIF greater than 5 or 10 indicates high multicollinearity that should be addressed.

VIF does not diagnose a problem on its own; it indicates that a coefficient reflects shared variation with other predictors. The decision to act depends on whether you need interpretable coefficients.

In R, the vif() function in the car package is commonly used to calculate VIF values for a fitted model.

7 Addressing Collinearity

In practice, collinearity often reflects how we construct models rather than a property of the data alone. Predictors become correlated because they describe the same underlying process.

The problem of collinearity can be addressed through one of two strategies, both involving variable selection:

Theory-driven selection (preferred): Use biological knowledge to choose variables that represent distinct mechanisms. Each predictor should correspond to a process you expect to influence the response. This approach is applied a priori.
Data-driven selection: When theory is weak or variables are numerous, use statistical tools (e.g., PCA, regularisation) to reduce redundancy. In this case, the model prioritises predictive performance rather than interpretation of individual coefficients. This is an a posteriori approach that takes place after exploring relationships among variables, but before final model interpretation.

If you detect problematic collinearity, you have several options:

Remove one of the correlated variables: The simplest solution. Choose the variable that contributes less to your biological question or represents a redundant proxy.
Combine the variables: If two variables represent a similar underlying gradient (e.g., several measures of water quality), you could combine them into a single composite index using a method like Principal Component Analysis (PCA). You would then use the first principal component (PC1) as your new predictor.
Use specialised regression methods: Techniques like Ridge Regression or Lasso Regression are designed to handle collinearity, but are more advanced.

Practical workflow

Define the biological question (mechanism vs prediction)
Select variables using theory where possible
Diagnose collinearity (e.g., VIF)
If needed:
- Remove redundant variables (interpretation-focused models)
- Combine variables (latent gradients)
- Use regularisation (prediction-focused models)
Do not interpret coefficients until collinearity has been assessed.

8 Conclusion

Collinearity is a problem of interpretation, not necessarily prediction. If your goal is simply to predict a response, it may not be a major issue. But in ecology and biology, we are often interested in attributing effects to specific processes. Failing to check for and address collinearity can lead to unstable, misleading, and ultimately incorrect scientific conclusions. Always check the VIF of your predictors before interpreting the coefficients of a multiple regression model.

References

Graham MH (2003) Confronting multicollinearity in ecological multiple regression. Ecology 84:2809–2815.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {16: {Collinearity}},
  date = {2026-03-18},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/16-collinearity.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., J. Smit A (2026) 16: Collinearity. http://tangledbank.netlify.app/BCB744/basic_stats/16-collinearity.html.

--- title: "16: Collinearity" author: "A. J. Smit" date: last-modified date-format: "YYYY/MM/DD" --- ![](/images/accurate.jpg) *** ## Introduction In multiple regression and other multivariate models, we assume that predictor variables are independent. When this assumption is violated, our models can become unstable and difficult to interpret. This issue, known as **collinearity** or **multicollinearity**, is a common challenge in biological research, where environmental variables are often highly correlated. It is especially common in ecology because many environmental variables are linked through shared physical and biological processes (*e.g.*, temperature, oxygen, and productivity often co-vary). This chapter explains what collinearity is, why it is problematic, and how to diagnose and address it. While no single paper is as seminal as Hurlbert's on pseudoreplication, a foundational and highly recommended review on the topic is provided by Graham -@Graham2003. ## Key Concepts - **Collinearity/Multicollinearity:** A condition in multiple regression where two or more predictor variables are highly correlated, making it difficult for the model to separate their individual effects. - **Problem of Interpretation, Not Prediction:** Collinearity's main drawback is that it makes the model's coefficients (the estimated effects of each predictor) unreliable and difficult to interpret, even if the model as a whole predicts the outcome well. - **Variance Inflation Factor (VIF):** The standard diagnostic tool for detecting multicollinearity. A high VIF (e.g., > 5) for a predictor indicates its variance is being inflated by its correlation with other predictors. - **Unstable Coefficients and Inflated Errors:** Collinearity causes estimated coefficients to be unstable and have large standard errors, which can mask the true significance of a predictor. - **Addressing Collinearity:** Common solutions include removing one of the correlated predictors, combining them into a single index (e.g., via PCA), or using advanced regression techniques designed to handle it. ## What is Collinearity? **Collinearity** occurs when two predictor variables in a multiple regression model are highly correlated. **Multicollinearity** is the more general term, referring to a situation where one predictor variable can be linearly predicted from one or more of the other predictor variables with a substantial degree of accuracy. For example, in a marine environment, you might measure water temperature, salinity, and dissolved oxygen. It is very likely that temperature and dissolved oxygen are strongly negatively correlated (colder water holds more oxygen). If you include both in a model to predict the abundance of a fish species, the predictors share information, which introduces multicollinearity. ## Why is it a Problem? Unlike pseudoreplication, collinearity does not necessarily invalidate the entire model in terms of its predictive power. A model with collinear predictors may still produce good predictions. However, it severely compromises the **interpretation** of the model's coefficients. The primary problems are: 1. **Unstable Coefficient Estimates:** The estimated regression coefficients ($\beta$ values) can vary wildly depending on which other variables are in the model. The standard errors of the coefficients become inflated, making it difficult to determine the true effect of each predictor. 2. **Incorrect Signs:** A coefficient might appear to have the "wrong" sign (*e.g.*, a positive effect when a negative one is expected biologically). This happens because the model is trying to partition the shared variance between the correlated predictors, and the results can be nonsensical. 3. **Loss of Statistical Significance:** Because the standard errors are inflated, a predictor variable that is truly important might appear statistically insignificant (*i.e.*, have a *p*-value > 0.05). The model cannot confidently attribute the effect to any single one of the correlated predictors. These effects arise because the model must estimate multiple coefficients from overlapping information. So, essentially, when two predictors are strongly correlated, the model cannot separate their individual contributions because they explain the same variation in the response. ## Examples The following examples highlight two distinct forms of collinearity: one involving a proxy and a mechanistic variable, and one involving two mechanistic variables that co-vary. ### Predicting Plant Growth You model plant growth using: * `mean_annual_temperature` (mechanistic driver of metabolic rates) * `altitude` (a composite variable that correlates with temperature, oxygen, and radiation) These variables are strongly correlated because altitude influences temperature. **Interpretation issue** The model attempts to assign separate effects to two variables that describe overlapping processes. As a result: * Coefficients become unstable * Standard errors increase * Biological interpretation becomes unclear **Main insight** Altitude acts as a proxy variable. It is easy to measure but does not represent a single causal mechanism. Including both altitude and temperature asks the model to separate a proxy from the process it represents, which it cannot do reliably. Proxy variables are often used because they are easy to measure across large spatial scales, whereas mechanistic variables may require more detailed or local measurements. **Better modelling choices** * Use temperature if your hypothesis concerns physiology * Use altitude if your question concerns broad spatial gradients * Avoid including both unless you explicitly model their roles ### Nutrient Limitation in Coastal Systems You model phytoplankton biomass using: * nitrate * phosphate These nutrients often co-vary because they are supplied by the same water masses. **Outcome** * The model fits well (high predictive ability) * Individual coefficients are unstable or non-significant **Interpretation** The model cannot separate the effect of nitrate from phosphate because both track the same underlying process: nutrient supply. **Resolution options** * Use one nutrient based on ecological theory (*e.g.*, the limiting nutrient) * Use a ratio (*e.g.*, N:P) if your hypothesis concerns stoichiometry * Use PCA to represent a “nutrient gradient” if prediction is the goal This example shows that collinearity can arise even when all variables are mechanistically meaningful. So, the decision is therefore about how to handle collinearity *and* whether you aim to explain mechanisms or optimise prediction. These examples show that collinearity can arise either from including proxy variables or from measuring multiple aspects of the same underlying process. ## Diagnosing Collinearity The most common way to diagnose multicollinearity is by calculating the **Variance Inflation Factor (VIF)** for each predictor variable. * **What is VIF?** The VIF for a predictor `X` is a measure of how much the variance of its estimated coefficient is "inflated" by its correlation with the other predictors in the model. * **How it Works:** For each predictor, a regression is fitted using that predictor as the response and all other predictors as the explanatory variables. The R-squared from this model is used to calculate the VIF. * **Rule of Thumb:** * A VIF of 1 means there is no correlation. * A VIF between 1 and 5 is generally considered acceptable. * A VIF greater than 5 or 10 indicates high multicollinearity that should be addressed. VIF does not diagnose a problem on its own; it indicates that a coefficient reflects shared variation with other predictors. The decision to act depends on whether you need interpretable coefficients. In R, the `vif()` function in the `car` package is commonly used to calculate VIF values for a fitted model. ## Addressing Collinearity In practice, collinearity often reflects how we construct models rather than a property of the data alone. Predictors become correlated because they describe the same underlying process. The problem of collinearity can be addressed through one of two strategies, both involving variable selection: 1. **Theory-driven selection (preferred):** Use biological knowledge to choose variables that represent distinct mechanisms. Each predictor should correspond to a process you expect to influence the response. This approach is applied *a priori*. 2. **Data-driven selection:** When theory is weak or variables are numerous, use statistical tools (*e.g.*, PCA, regularisation) to reduce redundancy. In this case, the model prioritises predictive performance rather than interpretation of individual coefficients. This is an *a posteriori* approach that takes place *after* exploring relationships among variables, but before final model interpretation. If you detect problematic collinearity, you have several options: 1. **Remove one of the correlated variables:** The simplest solution. Choose the variable that contributes less to your biological question or represents a redundant proxy. 2. **Combine the variables:** If two variables represent a similar underlying gradient (*e.g.*, several measures of water quality), you could combine them into a single composite index using a method like Principal Component Analysis (PCA). You would then use the first principal component (PC1) as your new predictor. 3. **Use specialised regression methods:** Techniques like Ridge Regression or Lasso Regression are designed to handle collinearity, but are more advanced. **Practical workflow** * Define the biological question (mechanism vs prediction) * Select variables using theory where possible * Diagnose collinearity (*e.g.*, VIF) * If needed: * Remove redundant variables (interpretation-focused models) * Combine variables (latent gradients) * Use regularisation (prediction-focused models) * Do not interpret coefficients until collinearity has been assessed. ## Conclusion Collinearity is a problem of interpretation, not necessarily prediction. If your goal is simply to predict a response, it may not be a major issue. But in ecology and biology, we are often interested in attributing effects to specific processes. Failing to check for and address collinearity can lead to unstable, misleading, and ultimately incorrect scientific conclusions. Always check the VIF of your predictors before interpreting the coefficients of a multiple regression model. ***