15. Model Checking and Evaluation
Diagnostics, Comparison, and Generalisation
- Why model fitting is not the end of the analysis
- Diagnostic checks for linear models
- Influence, leverage, and residual structure
- Comparing candidate models with information criteria
- Overfitting and generalisation
- None
1 Introduction
Once a model has been fitted, two further questions immediately follow:
- Is the model adequate for the data?
- How does this model compare with plausible alternatives?
These questions belong to model checking and model evaluation. A statistically significant model may still be poorly specified, violate assumptions, overfit the data, or generalise badly to new observations.
2 Key Concepts
The following concepts structure the chapter.
- Model checking asks whether a fitted model is adequate for the data and assumptions.
- Residuals are central diagnostic tools for linear-model behaviour.
- Influence and leverage identify observations that matter disproportionately.
- Model comparison weighs plausible alternatives rather than treating one fitted model as final.
- Generalisation matters because good in-sample fit does not guarantee useful prediction.
3 Model Checking
Model checking asks whether the fitted model behaves in a way consistent with its assumptions and intended interpretation.
For ordinary linear models, the most useful checks usually involve residuals.
3.1 Residual patterns
Residuals should not show strong systematic structure.
Problems to look for include:
- curvature, suggesting non-linearity,
- funnel shapes, suggesting heteroscedasticity,
- unusual points with strong leverage,
- heavy tails or severe asymmetry.
3.2 Standard diagnostic plots
The standard four-plot diagnostic display in R remains a good starting point:
These typically show:
- residuals vs fitted values,
- normal Q-Q plot,
- scale-location plot,
- residuals vs leverage.
3.3 Influence and leverage
Not all unusual points are equally important. Some observations have:
- high residuals: poorly fitted values,
- high leverage: unusual predictor values,
- high influence: strong impact on the fitted coefficients.
An influential point deserves attention, but not automatic deletion. It may indicate:
- a data entry problem,
- measurement error,
- an important biological extreme, or
- a model that is too simple.
4 Assumption Checks and Their Limits
Formal tests such as Shapiro-Wilk or Breusch-Pagan can be useful, but they should not replace graphical assessment and biological judgement.
In practice:
- tiny deviations from normality are often unimportant,
- large datasets can make trivial deviations look “significant,” and
- visual residual structure is often more informative than a single test.
The real question is not whether a model is perfect, but whether it is adequate for the inferential goal.
5 Model Comparison
Biological analysis often involves several plausible candidate models, each representing a different hypothesis.
Model comparison should therefore be framed as comparing competing explanations, not merely hunting for the best numerical fit.
5.1 Information criteria
The most common tools are:
- AIC,
- AICc for small samples,
- sometimes BIC for a stronger penalty on complexity.
These criteria balance:
- model fit, and
- model complexity.
Lower values indicate better relative support among the models being compared.
5.2 What not to do
Automated stepwise procedures often produce unstable and hard-to-interpret models. They can also encourage post hoc storytelling around whichever terms happen to survive selection.
Candidate models should ideally be motivated by theory first and compared second.
6 Overfitting and Generalisation
A model can fit the present data very well while performing poorly on new data. This is overfitting.
Overfitting occurs when the model captures noise as if it were signal. It is especially common when:
- sample size is modest,
- many predictors are included,
- interactions are added freely,
- model choice is heavily data-driven.
The key distinction is between:
- training performance: fit to the data used to estimate the model,
- generalisation performance: fit to new or unseen data.
6.1 Cross-validation
Cross-validation provides one way to estimate predictive performance on unseen data.
In principle, the workflow is:
- split the data into parts,
- fit the model on one part,
- test it on the held-out part,
- repeat.
This is especially valuable when the goal of modelling is prediction rather than explanation.
7 A Practical Workflow
After fitting a model:
- Inspect residual plots.
- Check for influential points.
- Ask whether any detected pattern suggests a missing variable or wrong functional form.
- Compare only biologically sensible candidate models.
- Use information criteria or cross-validation according to the modelling goal.
- Report uncertainty and limitations clearly.
8 Summary
- Fitting a model is only the middle of the analysis.
- Residual diagnostics are essential for checking adequacy.
- Influence and leverage can distort fitted results and must be inspected.
- Model comparison should evaluate competing biological hypotheses, not just maximise fit.
- Overfitting is a major risk when complexity outruns signal.
Good modelling requires both a plausible model and evidence that the model is behaving well.
Reuse
Citation
@online{smit,_a._j.2026,
author = {Smit, A. J., and J. Smit, A.},
title = {15. {Model} {Checking} and {Evaluation}},
date = {2026-03-19},
url = {http://tangledbank.netlify.app/BCB744/basic_stats/15-model-checking-and-evaluation.html},
langid = {en}
}
