23. Prediction and Explanation

Choosing Models for Different Scientific Goals

Author

A. J. Smit

Published

2026/03/19

1 Introduction

Not all models are built for the same purpose. Some are built to explain mechanisms, isolate effects, and support biologically interpretable statements. Others are built to predict future outcomes as accurately as possible. These goals overlap, but they are not identical.

That distinction matters because a model that is good for explanation is not always the one that is best for prediction. The trade-off becomes especially important once models become high-dimensional, predictors become strongly overlapping, or out-of-sample performance becomes the main concern.

2 Key Concepts

Explanation and prediction are related but different modelling goals.
Interpretability matters most when explanation is the main goal.
Out-of-sample performance matters most when prediction is the main goal.
Model choice should be justified by purpose, not only by convention.

3 Explanation Versus Prediction

An explanatory model typically emphasises:

biologically interpretable coefficients;
careful model specification;
defensible causal or process-based language;
explicit treatment of confounders and assumptions.

A predictive model typically emphasises:

accurate performance on new data;
stability under collinearity or large predictor sets;
lower prediction error rather than coefficient-by-coefficient interpretation.

Neither goal is superior in the abstract. The important thing is to know which goal you are pursuing.

4 Cross-Validation and Out-of-Sample Thinking

If prediction is the main goal, the model should be evaluated primarily on data that were not used to fit it. Cross-validation is one of the main ways to do this.

The basic idea is:

split the data into training and validation parts;
fit the model on the training data;
evaluate predictive performance on held-out data;
repeat this across several splits.

This kind of out-of-sample thinking becomes especially important when models are built for forecasting, classification, or any situation where performance on new data matters more than close interpretation of each coefficient.

5 What This Means in Practice

If you are building a model to explain ecological structure or process, you will usually prioritise:

coefficients that can be interpreted directly;
predictors chosen for biological reasons rather than only for predictive convenience;
explicit discussion of assumptions, confounding, and design limitations;
model statements that remain close to the original scientific question.

If you are building a model to predict accurately, you will usually prioritise:

out-of-sample predictive performance;
stable behaviour when predictors overlap or become numerous;
procedures that reduce overfitting;
honest evaluation on data not used in fitting.

These priorities often overlap, but they are not identical. A model that is excellent for explanation may be suboptimal for forecasting. A model that predicts well may be harder to interpret mechanistically.

6 Practical Guidance

If your goal is primarily explanation:

prefer models whose coefficients can be interpreted clearly;
keep the biological argument central;
avoid adding complexity unless it strengthens the scientific question.

If your goal is primarily prediction:

evaluate models out of sample;
accept that some loss of interpretability may be worthwhile;
consider methods that shrink coefficients or otherwise stabilise model performance.

7 Summary

Explanation and prediction are overlapping but distinct modelling goals.
Ordinary regression is often strongest for explanation when the model is interpretable and well specified.
Cross-validation is central when predictive performance is the goal.
The next chapter takes up regularisation directly and shows how ridge, lasso, and elastic net help when prediction and coefficient stability matter most.

This chapter makes the modelling goal explicit. The next chapter turns to regularisation, and the final chapter closes the sequence with reproducible analytical workflow.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {23. {Prediction} and {Explanation}},
  date = {2026-03-19},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/23-prediction-and-explanation.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., J. Smit A (2026) 23. Prediction and Explanation. http://tangledbank.netlify.app/BCB744/basic_stats/23-prediction-and-explanation.html.

--- title: "23. Prediction and Explanation" subtitle: "Choosing Models for Different Scientific Goals" author: "A. J. Smit" date: last-modified date-format: "YYYY/MM/DD" reference-location: margin --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.width = 6.5, fig.height = 4.5, out.width = "88%", fig.asp = NULL, fig.align = "center", fig.retina = 2, dpi = 300 ) ``` # Introduction Not all models are built for the same purpose. Some are built to explain mechanisms, isolate effects, and support biologically interpretable statements. Others are built to predict future outcomes as accurately as possible. These goals overlap, but they are not identical. That distinction matters because a model that is good for explanation is not always the one that is best for prediction. The trade-off becomes especially important once models become high-dimensional, predictors become strongly overlapping, or out-of-sample performance becomes the main concern. # Key Concepts - **Explanation** and **prediction** are related but different modelling goals. - **Interpretability** matters most when explanation is the main goal. - **Out-of-sample performance** matters most when prediction is the main goal. - **Model choice should be justified by purpose**, not only by convention. # Explanation Versus Prediction An explanatory model typically emphasises: - biologically interpretable coefficients; - careful model specification; - defensible causal or process-based language; - explicit treatment of confounders and assumptions. A predictive model typically emphasises: - accurate performance on new data; - stability under collinearity or large predictor sets; - lower prediction error rather than coefficient-by-coefficient interpretation. Neither goal is superior in the abstract. The important thing is to know which goal you are pursuing. # Cross-Validation and Out-of-Sample Thinking If prediction is the main goal, the model should be evaluated primarily on data that were not used to fit it. Cross-validation is one of the main ways to do this. The basic idea is: 1. split the data into training and validation parts; 2. fit the model on the training data; 3. evaluate predictive performance on held-out data; 4. repeat this across several splits. This kind of out-of-sample thinking becomes especially important when models are built for forecasting, classification, or any situation where performance on new data matters more than close interpretation of each coefficient. # What This Means in Practice If you are building a model to explain ecological structure or process, you will usually prioritise: - coefficients that can be interpreted directly; - predictors chosen for biological reasons rather than only for predictive convenience; - explicit discussion of assumptions, confounding, and design limitations; - model statements that remain close to the original scientific question. If you are building a model to predict accurately, you will usually prioritise: - out-of-sample predictive performance; - stable behaviour when predictors overlap or become numerous; - procedures that reduce overfitting; - honest evaluation on data not used in fitting. These priorities often overlap, but they are not identical. A model that is excellent for explanation may be suboptimal for forecasting. A model that predicts well may be harder to interpret mechanistically. # Practical Guidance If your goal is primarily **explanation**: - prefer models whose coefficients can be interpreted clearly; - keep the biological argument central; - avoid adding complexity unless it strengthens the scientific question. If your goal is primarily **prediction**: - evaluate models out of sample; - accept that some loss of interpretability may be worthwhile; - consider methods that shrink coefficients or otherwise stabilise model performance. # Summary - Explanation and prediction are overlapping but distinct modelling goals. - Ordinary regression is often strongest for explanation when the model is interpretable and well specified. - Cross-validation is central when predictive performance is the goal. - The next chapter takes up regularisation directly and shows how ridge, lasso, and elastic net help when prediction and coefficient stability matter most. This chapter makes the modelling goal explicit. The next chapter turns to regularisation, and the final chapter closes the sequence with reproducible analytical workflow.