16. Model Checking and Evaluation

Diagnostics, Comparison, and Generalisation

Author

A. J. Smit

Published

2026/03/19

In This Chapter

why model fitting is not the end of the analysis;
how residuals are used to assess model adequacy;
what leverage and influence mean in practice;
how to compare competing models;
why overfitting and generalisation must be considered.

Tasks to Complete in This Chapter

None

1 Introduction

Once a model has been fitted, two further questions immediately follow:

Is the model adequate for the data?
How does this model compare with plausible alternatives?

These are the tasks of model checking and model evaluation. A statistically significant model may still be poorly specified, may violate assumptions, may be driven by a few influential observations, or may generalise badly to new data. Model fitting is therefore only the middle of the analysis.

2 Key Concepts

The chapter turns on the following ideas.

Model checking asks whether the fitted model is adequate for the data and assumptions.
Residuals are the main diagnostic tool because they show what the model failed to explain.
Leverage and influence identify observations that matter disproportionately to the fitted result.
Model comparison is about competing explanations, not just numerical optimisation.
Generalisation matters because good fit to the present data does not guarantee useful prediction elsewhere.

3 A Worked Diagnostic Example

We continue with the seaweed example from the previous chapters and fit the selected multiple regression model again.

sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv")
sw_ectz <- sw |>
  filter(bio == "ECTZ")

mod_eval <- lm(Y ~ augMean + febSD + augSD, data = sw_ectz)
summary(mod_eval)

R> 
R> Call:
R> lm(formula = Y ~ augMean + febSD + augSD, data = sw_ectz)
R> 
R> Residuals:
R>       Min        1Q    Median        3Q       Max 
R> -0.153994 -0.049229 -0.006086  0.045947  0.148579 
R> 
R> Coefficients:
R>             Estimate Std. Error t value Pr(>|t|)    
R> (Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
R> augMean     0.283335   0.011131  25.455  < 2e-16 ***
R> febSD       0.049639   0.008370   5.930 8.73e-09 ***
R> augSD       0.022150   0.004503   4.919 1.47e-06 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> 
R> Residual standard error: 0.06609 on 285 degrees of freedom
R> Multiple R-squared:  0.8387, Adjusted R-squared:  0.837 
R> F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16

4 Residual Diagnostics

Residual plots are the natural first step in model checking.

These four standard plots allow us to assess several things at once:

Residuals vs fitted values for curvature or changing variance;
Normal Q-Q plot for approximate normality;
Scale-location plot for heteroscedasticity;
Residuals vs leverage for potentially influential observations.

No diagnostic plot should be read mechanically. The question is whether the pattern is severe enough to undermine the scientific use of the model. In this example the plots suggest that the model is broadly adequate, although, as always, no real dataset is perfectly obedient.

5 Influence and Leverage

Not all unusual observations matter equally.

A point with a large residual is poorly fitted.
A point with high leverage has an unusual combination of predictor values.
A point with high influence materially changes the fitted coefficients.

Cook’s distance is one commonly used summary of influence.

cook_summary <- summary(cooks.distance(mod_eval))
cook_summary

R>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
R> 2.020e-08 5.316e-04 1.712e-03 3.061e-03 4.956e-03 1.876e-02

An influential point should not be deleted automatically. It may reflect:

a data entry error;
a measurement problem;
a biologically informative extreme case;
or a model that is too simple for the pattern in the data.

6 Assumption Checks and Their Limits

Formal tests can be useful, but they should not replace residual plots and biological judgement.

The practical points are these:

tiny deviations from normality are often unimportant;
large datasets can make trivial deviations look statistically significant;
visual residual structure is usually more informative than a single assumption test;
the real question is whether the model is adequate for the inferential goal, not whether it is mathematically perfect.

7 Model Comparison

Biological analysis often involves several plausible models, not just one. Model comparison should therefore be framed as comparing competing explanations.

We can illustrate this with the seaweed example by comparing:

a null model;
the selected three-predictor model;
the larger five-predictor model considered earlier.

null_mod <- lm(Y ~ 1, data = sw_ectz)
full_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean, data = sw_ectz)

AIC(null_mod, mod_eval, full_mod)

R>          df       AIC
R> null_mod  2 -222.8193
R> mod_eval  5 -744.1734
R> full_mod  7 -826.4472

The selected model is far better than the null model, and its performance is competitive with the larger model while remaining easier to interpret. That is an important general lesson: a more complex model is not automatically a better scientific model.

8 Overfitting and Generalisation

A model can fit the present data very well while performing poorly on new data. This is overfitting.

Overfitting is especially likely when:

the sample size is modest;
many predictors are included;
interaction terms are added freely;
model choice is heavily data-driven.

The key distinction is between:

in-sample fit, how well the model describes the data used to estimate it;
generalisation, how well it would perform on new or unseen data.

8.1 Cross-validation

Cross-validation is one practical way to assess predictive performance:

split the data into parts;
fit the model on one part;
test it on the held-out part;
repeat across multiple splits.

This is especially useful when the modelling goal is prediction rather than explanation.

9 A Practical Workflow

After fitting a regression model:

inspect the residual plots;
check for leverage and influential points;
ask whether any pattern suggests a missing predictor or wrong functional form;
compare only biologically sensible candidate models;
balance fit, complexity, and interpretability;
report limitations clearly.

10 Reporting

A Results-style report of model checking and evaluation should not read like a dump of diagnostics. It should summarise what was checked and what conclusion was reached.

For example:

Diagnostic plots of the selected multiple regression indicated no major departures from linear-model assumptions. Residuals were reasonably evenly distributed across the fitted range, the Q-Q plot suggested only mild deviation from normality, and no observation showed influence severe enough to invalidate interpretation. Model comparison further showed that the selected three-predictor model was strongly preferred to the intercept-only model and retained a more interpretable structure than the larger five-predictor alternative.

11 Summary

Fitting a model is only the middle of the analysis.
Residual diagnostics are essential for checking whether the model is adequate.
Leverage and influence identify observations that may disproportionately affect the fitted result.
Model comparison should weigh competing biological explanations, not just maximise fit.
Overfitting is a real risk whenever complexity outruns signal.

At this point the core modelling spine is in place: simple regression, multiple regression, interactions, threats to interpretation, and model evaluation. The later chapters now extend that same logic to more specialised modelling situations.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {16. {Model} {Checking} and {Evaluation}},
  date = {2026-03-19},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/16-model-checking-and-evaluation.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., J. Smit A (2026) 16. Model Checking and Evaluation. http://tangledbank.netlify.app/BCB744/basic_stats/16-model-checking-and-evaluation.html.

--- title: "16. Model Checking and Evaluation" subtitle: "Diagnostics, Comparison, and Generalisation" author: "A. J. Smit" date: last-modified date-format: "YYYY/MM/DD" reference-location: margin --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.width = 6.5, fig.height = 4.5, out.width = "88%", fig.asp = NULL, fig.align = "center", fig.retina = 2, dpi = 300 ) ggplot2::theme_set( ggplot2::theme_grey(base_size = 8) ) ``` ```{r code-knitr-opts-chunk-set, echo=FALSE} library(tidyverse) library(car) library(MASS) library(broom) ``` ::: {.callout-note appearance="simple"} ## In This Chapter - why model fitting is not the end of the analysis; - how residuals are used to assess model adequacy; - what leverage and influence mean in practice; - how to compare competing models; - why overfitting and generalisation must be considered. ::: ::: {.callout-important appearance="simple"} ## Tasks to Complete in This Chapter - None ::: # Introduction Once a model has been fitted, two further questions immediately follow: 1. Is the model adequate for the data? 2. How does this model compare with plausible alternatives? These are the tasks of **model checking** and **model evaluation**. A statistically significant model may still be poorly specified, may violate assumptions, may be driven by a few influential observations, or may generalise badly to new data. Model fitting is therefore only the middle of the analysis. # Key Concepts The chapter turns on the following ideas. - **Model checking** asks whether the fitted model is adequate for the data and assumptions. - **Residuals** are the main diagnostic tool because they show what the model failed to explain. - **Leverage and influence** identify observations that matter disproportionately to the fitted result. - **Model comparison** is about competing explanations, not just numerical optimisation. - **Generalisation** matters because good fit to the present data does not guarantee useful prediction elsewhere. # A Worked Diagnostic Example We continue with the seaweed example from the previous chapters and fit the selected multiple regression model again. ```{r code-model-eval-data} sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv") sw_ectz <- sw |> filter(bio == "ECTZ") mod_eval <- lm(Y ~ augMean + febSD + augSD, data = sw_ectz) summary(mod_eval) ``` # Residual Diagnostics Residual plots are the natural first step in model checking. ```{r fig-model-diagnostics} #| echo: false #| fig-width: 7 #| fig-height: 7 #| out-width: "88%" par(mfrow = c(2, 2)) plot(mod_eval) par(mfrow = c(1, 1)) ``` These four standard plots allow us to assess several things at once: - **Residuals vs fitted values** for curvature or changing variance; - **Normal Q-Q plot** for approximate normality; - **Scale-location plot** for heteroscedasticity; - **Residuals vs leverage** for potentially influential observations. No diagnostic plot should be read mechanically. The question is whether the pattern is severe enough to undermine the scientific use of the model. In this example the plots suggest that the model is broadly adequate, although, as always, no real dataset is perfectly obedient. # Influence and Leverage Not all unusual observations matter equally. - A point with a large **residual** is poorly fitted. - A point with high **leverage** has an unusual combination of predictor values. - A point with high **influence** materially changes the fitted coefficients. Cook's distance is one commonly used summary of influence. ```{r code-cooks} cook_summary <- summary(cooks.distance(mod_eval)) cook_summary ``` An influential point should not be deleted automatically. It may reflect: - a data entry error; - a measurement problem; - a biologically informative extreme case; - or a model that is too simple for the pattern in the data. # Assumption Checks and Their Limits Formal tests can be useful, but they should not replace residual plots and biological judgement. The practical points are these: - tiny deviations from normality are often unimportant; - large datasets can make trivial deviations look statistically significant; - visual residual structure is usually more informative than a single assumption test; - the real question is whether the model is adequate for the inferential goal, not whether it is mathematically perfect. # Model Comparison Biological analysis often involves several plausible models, not just one. Model comparison should therefore be framed as comparing **competing explanations**. We can illustrate this with the seaweed example by comparing: - a null model; - the selected three-predictor model; - the larger five-predictor model considered earlier. ```{r code-model-comparison} null_mod <- lm(Y ~ 1, data = sw_ectz) full_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean, data = sw_ectz) AIC(null_mod, mod_eval, full_mod) ``` The selected model is far better than the null model, and its performance is competitive with the larger model while remaining easier to interpret. That is an important general lesson: a more complex model is not automatically a better scientific model. # Overfitting and Generalisation A model can fit the present data very well while performing poorly on new data. This is **overfitting**. Overfitting is especially likely when: - the sample size is modest; - many predictors are included; - interaction terms are added freely; - model choice is heavily data-driven. The key distinction is between: - **in-sample fit**, how well the model describes the data used to estimate it; - **generalisation**, how well it would perform on new or unseen data. ## Cross-validation Cross-validation is one practical way to assess predictive performance: 1. split the data into parts; 2. fit the model on one part; 3. test it on the held-out part; 4. repeat across multiple splits. This is especially useful when the modelling goal is prediction rather than explanation. # A Practical Workflow After fitting a regression model: 1. inspect the residual plots; 2. check for leverage and influential points; 3. ask whether any pattern suggests a missing predictor or wrong functional form; 4. compare only biologically sensible candidate models; 5. balance fit, complexity, and interpretability; 6. report limitations clearly. # Reporting A Results-style report of model checking and evaluation should not read like a dump of diagnostics. It should summarise what was checked and what conclusion was reached. For example: > Diagnostic plots of the selected multiple regression indicated no major departures from linear-model assumptions. Residuals were reasonably evenly distributed across the fitted range, the Q-Q plot suggested only mild deviation from normality, and no observation showed influence severe enough to invalidate interpretation. Model comparison further showed that the selected three-predictor model was strongly preferred to the intercept-only model and retained a more interpretable structure than the larger five-predictor alternative. # Summary - Fitting a model is only the middle of the analysis. - Residual diagnostics are essential for checking whether the model is adequate. - Leverage and influence identify observations that may disproportionately affect the fitted result. - Model comparison should weigh competing biological explanations, not just maximise fit. - Overfitting is a real risk whenever complexity outruns signal. At this point the core modelling spine is in place: simple regression, multiple regression, interactions, threats to interpretation, and model evaluation. The later chapters now extend that same logic to more specialised modelling situations.