17. Model Checking and Evaluation

Diagnostics, Comparison, and Generalisation

Published

2026/03/22

NoteIn This Chapter
  • why model fitting is not the end of the analysis;
  • how residuals are used to assess model adequacy;
  • what leverage and influence mean in practice;
  • how to compare competing models;
  • why overfitting and generalisation must be considered.
ImportantTasks to Complete in This Chapter
  • None

1 Introduction

Once a model has been fitted, two further questions immediately follow:

  1. Is the model adequate for the data?
  2. How does this model compare with plausible alternatives?

These are the tasks of model checking and model evaluation. A statistically significant model may still be poorly specified, may violate assumptions, may be driven by a few influential observations, or may generalise badly to new data. Model fitting is therefore only the middle of the analysis.

2 Key Concepts

In this chapter, I organise the discussion around the following ideas.

  • Model checking asks whether the fitted model is adequate for the data and assumptions.
  • Residuals are the main diagnostic tool because they show what the model failed to explain.
  • Leverage and influence identify observations that matter disproportionately to the fitted result.
  • Model comparison is about competing explanations, not just numerical optimisation.
  • Generalisation matters because good fit to the present data does not guarantee useful prediction elsewhere.

3 Nature of the Data and Assumptions

Model checking begins only after a model has been fitted, but the logic depends on the same assumptions introduced earlier in the regression sequence.

  • Independence remains fundamental because residual dependence can make a model look better behaved than it really is.
  • Linearity of the mean structure still matters unless a more flexible model class has been justified.
  • Homoscedasticity affects the reliability of standard errors and confidence intervals.
  • Residual distribution should be broadly compatible with the inferential use of the model, even though mild deviations are common.
  • Generalisation depends on whether the fitted model captures signal rather than noise.

The practical aim is not to prove that a model is perfect. It is to decide whether the fitted model is adequate for the scientific question being asked.

4 R Functions

The main functions used in this chapter are:

  • plot() for the standard diagnostic panels from a fitted lm() object;
  • cooks.distance() for a summary of influence;
  • AIC() for comparing biologically plausible candidate models;
  • augment() from broom when fitted values, residuals, or leverage diagnostics need to be joined back to the data.

5 A Worked Diagnostic Example

We continue with the seaweed example from the previous chapters and fit the selected multiple regression model again.

sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv")
sw_ectz <- sw |>
  filter(bio == "ECTZ")

mod_eval <- lm(Y ~ augMean + febSD + augSD, data = sw_ectz)
summary(mod_eval)

Call:
lm(formula = Y ~ augMean + febSD + augSD, data = sw_ectz)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.153994 -0.049229 -0.006086  0.045947  0.148579 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
augMean     0.283335   0.011131  25.455  < 2e-16 ***
febSD       0.049639   0.008370   5.930 8.73e-09 ***
augSD       0.022150   0.004503   4.919 1.47e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06609 on 285 degrees of freedom
Multiple R-squared:  0.8387,    Adjusted R-squared:  0.837 
F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16

5.1 Residual Diagnostics

Residual plots are the natural first step in model checking.

Figure 1

These four standard plots allow us to assess several things at once:

  • Residuals vs fitted values for curvature or changing variance;
  • Normal Q-Q plot for approximate normality;
  • Scale-location plot for heteroscedasticity;
  • Residuals vs leverage for potentially influential observations.

No diagnostic plot should be read mechanically. The question is whether the pattern is severe enough to undermine the scientific use of the model. In this example the plots suggest that the model is broadly adequate, although, as always, no real dataset is perfectly obedient.

5.2 Influence and Leverage

Not all unusual observations matter equally.

  • A point with a large residual is poorly fitted.
  • A point with high leverage has an unusual combination of predictor values.
  • A point with high influence materially changes the fitted coefficients.

Cook’s distance is one commonly used summary of influence.

cook_summary <- summary(cooks.distance(mod_eval))
cook_summary
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
2.020e-08 5.316e-04 1.712e-03 3.061e-03 4.956e-03 1.876e-02 

An influential point should not be deleted automatically. It may reflect:

  • a data entry error;
  • a measurement problem;
  • a biologically informative extreme case;
  • or a model that is too simple for the pattern in the data.

5.3 Assumption Checks and Their Limits

Formal tests can be useful, but they should not replace residual plots and biological judgement.

The practical points are these:

  • tiny deviations from normality are often unimportant;
  • large datasets can make trivial deviations look statistically significant;
  • visual residual structure is usually more informative than a single assumption test;
  • the real question is whether the model is adequate for the inferential goal, not whether it is mathematically perfect.

5.4 Model Comparison

Biological analysis often involves several plausible models, not just one. Model comparison should therefore be framed as comparing competing explanations.

We can illustrate this with the seaweed example by comparing:

  • a null model;
  • the selected three-predictor model;
  • the larger five-predictor model considered earlier.
null_mod <- lm(Y ~ 1, data = sw_ectz)
full_mod <- lm(Y ~ augMean + febRange + febSD + augSD + annMean, data = sw_ectz)

AIC(null_mod, mod_eval, full_mod)
         df       AIC
null_mod  2 -222.8193
mod_eval  5 -744.1734
full_mod  7 -826.4472

The selected model is far better than the null model, and its performance is competitive with the larger model while remaining easier to interpret. That is an important general lesson: a more complex model is not automatically a better scientific model.

5.5 Overfitting and Generalisation

A model can fit the present data very well while performing poorly on new data. This is overfitting.

Overfitting is especially likely when:

  • the sample size is modest;
  • many predictors are included;
  • interaction terms are added freely;
  • model choice is heavily data-driven.

The key distinction is between:

  • in-sample fit, how well the model describes the data used to estimate it;
  • generalisation, how well it would perform on new or unseen data.

5.5.1 Cross-validation

Cross-validation is one practical way to assess predictive performance:

  1. split the data into parts;
  2. fit the model on one part;
  3. test it on the held-out part;
  4. repeat across multiple splits.

This is especially useful when the modelling goal is prediction rather than explanation.

We can make that concrete by comparing the selected three-predictor model with the larger five-predictor model using 10-fold cross-validation on the same seaweed subset.

set.seed(74416)

fold_id <- sample(rep(1:10, length.out = nrow(sw_ectz)))

cv_rmse <- function(formula, data, folds) {
  tibble(fold = sort(unique(folds))) |>
    mutate(
      rmse = map_dbl(fold, \(k) {
        train_dat <- data[folds != k, , drop = FALSE]
        test_dat <- data[folds == k, , drop = FALSE]

        mod <- lm(formula, data = train_dat)
        pred <- predict(mod, newdata = test_dat)

        sqrt(mean((test_dat$Y - pred) ^ 2))
      })
    )
}

cv_mod_eval <- cv_rmse(Y ~ augMean + febSD + augSD, sw_ectz, fold_id) |>
  mutate(model = "Selected three-predictor model")

cv_full_mod <- cv_rmse(Y ~ augMean + febRange + febSD + augSD + annMean,
                       sw_ectz,
                       fold_id) |>
  mutate(model = "Larger five-predictor model")

cv_results <- bind_rows(cv_mod_eval, cv_full_mod)

cv_results |>
  group_by(model) |>
  summarise(
    mean_rmse = mean(rmse),
    sd_rmse = sd(rmse),
    .groups = "drop"
  )
# A tibble: 2 × 3
  model                          mean_rmse sd_rmse
  <chr>                              <dbl>   <dbl>
1 Larger five-predictor model       0.0577 0.00615
2 Selected three-predictor model    0.0663 0.00577
ggplot(cv_results, aes(x = model, y = rmse, fill = model)) +
  geom_boxplot(alpha = 0.8, width = 0.55, show.legend = FALSE) +
  geom_jitter(width = 0.08, alpha = 0.8, size = 1.4, show.legend = FALSE) +
  labs(x = NULL, y = "Cross-validated RMSE") +
  theme_grey() +
  theme(axis.text.x = element_text(angle = 12, hjust = 1))
Figure 2: Ten-fold cross-validation RMSE for the selected three-predictor seaweed model and a larger five-predictor alternative.

The cross-validation results show that the larger model does improve out-of-sample prediction, but only modestly. In this run, the mean RMSE decreased from about 0.066 for the selected three-predictor model to about 0.058 for the larger five-predictor model. That is a real predictive gain, but it is not dramatic, so the trade-off between predictive improvement and interpretability still needs to be judged explicitly.

6 A Practical Workflow

After fitting a regression model:

  1. inspect the residual plots;
  2. check for leverage and influential points;
  3. ask whether any pattern suggests a missing predictor or wrong functional form;
  4. compare only biologically sensible candidate models;
  5. balance fit, complexity, and interpretability;
  6. report limitations clearly.

7 Reporting

A journal article should not present model checking as a raw list of diagnostics. Instead, it should state what was checked, what problems were looked for, and what conclusion was reached about the adequacy of the model.

NoteWrite-Up

Methods

After fitting the selected multiple regression model, diagnostic plots were examined to assess linearity, homoscedasticity, and the approximate normality of residuals. Leverage and influence were also inspected to identify observations that might disproportionately affect the fitted model. Model performance was compared with that of both an intercept-only model and a larger alternative model using information-theoretic criteria, and predictive performance was further assessed using 10-fold cross-validation.

Results

Diagnostic evaluation of the selected multiple regression revealed no major departures from model assumptions. Residuals were distributed reasonably evenly across the fitted range, the Q-Q plot suggested only mild deviation from normality, and no individual observation appeared sufficiently influential to undermine interpretation. Model comparison further indicated that the selected three-predictor model was strongly preferred to the intercept-only model and achieved a fit comparable to that of the larger five-predictor model while retaining a simpler and more interpretable structure. Ten-fold cross-validation showed that the larger model reduced mean out-of-sample RMSE from about 0.066 to 0.058, indicating a modest predictive advantage at the cost of additional complexity.

Discussion

These results suggest that the selected model provides an adequate statistical description of the data and that its biological interpretation is not driven by obvious assumption violations or isolated influential observations. The comparison with the larger candidate model also reinforces a central modelling principle: additional complexity is justified only when it yields a clearer or substantially better explanation, not simply because it increases fit.

8 Summary

  • Fitting a model is only the middle of the analysis.
  • Residual diagnostics are essential for checking whether the model is adequate.
  • Leverage and influence identify observations that may disproportionately affect the fitted result.
  • Model comparison should weigh competing biological explanations, not just maximise fit.
  • Overfitting is a real risk whenever complexity outruns signal.

At this point the core modelling spine is in place: simple regression, multiple regression, interactions, threats to interpretation, and model evaluation. The later chapters now extend that same logic to more specialised modelling situations.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {17. {Model} {Checking} and {Evaluation}},
  date = {2026-03-22},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/17-model-checking-and-evaluation.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) 17. Model Checking and Evaluation. https://tangledbank.netlify.app/BCB744/basic_stats/17-model-checking-and-evaluation.html.