11. Residuals and Model-Based Diagnostics

Checking Fitted Models Before Interpreting Them

Published

2026/03/22

NoteIn This Chapter
  • the shift from raw-data checks to model-based checks;
  • what a fitted model, a predicted value, and a residual are;
  • why regression assumptions apply to residuals;
  • how to read residual-versus-fitted and residual Q-Q plots;
  • how diagnostic patterns guide later modelling decisions.
ImportantTasks to Complete in This Chapter
  • None

1 Introduction: From Raw Data to Models

In earlier chapters, I checked assumptions on raw data. In Chapter 6, I asked whether values within groups, paired differences, or observed associations were well behaved enough for the planned test.

Regression changes the object that is checked. A regression model is a formal description of how a response variable changes with one or more predictor variables. A fitted model is that description after the data have been used to estimate it. The assumption checks then apply to the residuals, which are the differences between observed values and the values predicted by the fitted model.

The questions remain familiar. I still check shape, spread, and dependence. I now check them on residuals rather than on raw observations. In this chapter, I introduce that shift before the regression chapters begin. Simple Linear Regression, Multiple Regression and Model Specification, and Model Checking and Evaluation all build on it.

2 What Is a Model?

A model is a mathematical description of how a response changes with one or more predictors.

A response variable is the outcome being explained or predicted. Sometimes it is called the outcome or dependent variable.

A predictor variable is a measured quantity used to explain or predict the response. Also known as the independent variable.

A fitted model is the version of the model after its unknown quantities have been estimated from the data.

A predicted value is the value the fitted model expects for the response at a given predictor value.

Consider a simple biological question: how does plant height change with light availability? Plant height is the response. Light availability is the predictor. A fitted model would describe the expected plant height at each light level. The predicted value is the height that the fitted model gives for a plant growing under a specific amount of light.

In this chapter, I stay at a theoretical level as far as model fitting is concerned. In the next chapter, I fit the first straight-line model. Here my aim is to explain what is being checked once a model has been fitted.

3 What Are Residuals?

A residual is the difference between an observed value and its predicted value:

\[ \text{residual} = \text{observed value} - \text{predicted value} \]

Residuals represent the part of the response that the fitted model has not explained. They are the observed differences between the data and the fitted values, so they approximate the unexplained variation that remains after modelling. That remaining variation may arise from omitted predictors, measurement error, natural biological variation, or a model form that is too simple for the pattern in the data.

We will use the built-in trees dataset throughout this chapter. The dataset records tree girth, height, and volume. We will treat Volume as the response and Girth as the predictor.

trees_tbl <- as_tibble(trees) |>
  mutate(tree_id = row_number())

model <- lm(Volume ~ Girth, data = trees_tbl)

tree_resids <- trees_tbl |>
  mutate(
    predicted_volume = fitted(model),
    residual = residuals(model)
  )

tree_resids |>
  select(tree_id, Girth, Volume, predicted_volume, residual) |>
  slice_head(n = 8) |>
  knitr::kable(
    digits = 2,
    col.names = c("Tree", "Girth", "Observed volume", "Predicted volume", "Residual")
  )
Tree Girth Observed volume Predicted volume Residual
1 8.3 10.3 5.10 5.20
2 8.6 10.3 6.62 3.68
3 8.8 10.2 7.64 2.56
4 10.5 16.4 16.25 0.15
5 10.7 18.8 17.26 1.54
6 10.8 19.7 17.77 1.93
7 11.0 15.6 18.78 -3.18
8 11.0 18.2 18.78 -0.58

Each row in the table shows the same calculation. The fitted model predicts a tree volume from girth. The residual records the remaining difference between the observed and predicted volumes.

The basic R workflow is:

model <- lm(Volume ~ Girth, data = trees_tbl)

fitted(model)
residuals(model)

4 Why Assumptions Apply to Residuals

A fitted model aims to describe the systematic part of the relationship between a response and its predictors. The residuals contain what remains after that systematic part has been removed.

Assumptions therefore apply to the residuals because the residuals represent the unexplained variation. If the residuals still show a pattern, the model has left structure behind. If the residuals have unstable spread, the model’s uncertainty estimates become less reliable. If the residuals are dependent, the sample size is effectively smaller than it appears.

Contrast with earlier chapters:

Context What is checked
t-test or ANOVA data within groups
paired design paired differences
correlation observed relationship between variables
regression residuals from the fitted model

5 Key Regression Assumptions

5.1 Linearity

Linearity means that the fitted model has captured the mean relationship between the predictor and the response with an appropriate sloped straight-line form. In a simple linear regression, this means the residuals should not retain a sloped or curved pattern after the line has been fitted.

5.2 Constant Variance

Variance, in this context, means the spread of the residuals around zero. Constant variance means that this spread stays broadly similar across the fitted values. If the spread expands or contracts systematically, uncertainty is estimated unevenly across the response range.

5.3 Independence

Independence means that one residual does not carry information about another. This requirement depends on the sampling or experimental design. Repeated measures on the same individual, temporal correlation, spatial clustering, and nested sampling can all violate independence.

5.4 Normality of Residuals

Approximate normality of residuals means that the distribution of residuals is reasonably close to a normal distribution. This affects the reliability of standard errors, confidence intervals, and tests in small samples. Linearity, constant variance, and independence usually deserve attention first.

6 Diagnostic Plots

A diagnostic plot is a graph used to check whether the fitted model has left behind problematic structure in the residuals.

6.1 Residual versus Fitted Plot

The residual-versus-fitted plot places fitted values on the x-axis and residuals on the y-axis.

This plot is read by asking two questions:

  1. Do the residuals scatter around zero without a systematic pattern?
  2. Does their spread stay broadly similar across the fitted range?

Random scatter around zero supports the fitted mean structure. Curvature suggests that the straight-line form is missing part of the relationship. A funnel shape suggests changing variance.

6.2 Q-Q Plot of Residuals

The Q-Q plot of residuals compares the ordered residuals with the values expected from a normal distribution.

Points that follow the reference line support approximate normality. Systematic bends in the tails show departures from normality. Small departures are common. Large departures deserve inspection, especially when the sample is small or when extreme points drive the pattern.

The standard base-R diagnostic commands are:

plot(model, which = 1)  # residual vs fitted
plot(model, which = 2)  # Q-Q plot

7 Worked Example

We continue with the same trees model:

model <- lm(Volume ~ Girth, data = trees_tbl)

augment(model) |>
  select(Girth, Volume, .fitted, .resid) |>
  slice_head(n = 6)
# A tibble: 6 × 4
  Girth Volume .fitted .resid
  <dbl>  <dbl>   <dbl>  <dbl>
1   8.3   10.3    5.10  5.20 
2   8.6   10.3    6.62  3.68 
3   8.8   10.2    7.64  2.56 
4  10.5   16.4   16.2   0.152
5  10.7   18.8   17.3   1.54 
6  10.8   19.7   17.8   1.93 

The fitted values are the model’s expected tree volumes. The residuals show how far the observed volumes depart from those fitted values.

Figure 1: Residuals versus fitted values for the tree-volume model.

In Figure 1, the residuals are centred near zero across most of the fitted range. The smooth line bends slightly at the ends, and the spread increases for larger fitted values. The straight-line model captures much of the relationship, but it leaves some structure in the residuals. That is the kind of signal that later chapters use when they move to transformed, polynomial, or nonlinear fits.

Figure 2: Q-Q plot of residuals for the tree-volume model.

In Figure 2, most points follow the line closely through the centre of the distribution. The tails depart modestly. That pattern supports approximate normality well enough for an introductory straight-line model. The stronger warning in this example comes from the residual-versus-fitted plot rather than from the Q-Q plot.

8 Linking Diagnostics to Decisions

Use the diagnostic patterns to guide the next modelling step:

  • Curvature in the residual-versus-fitted plot, as in Figure 1, suggests a change in functional form. In later chapters that may mean a polynomial term, a smoother, or a mechanistic nonlinear model.
  • Changing spread across fitted values, also visible in Figure 1, suggests a transformation or a model with a different mean-variance structure.
  • Strong non-normality in the residuals, which would appear in a Q-Q plot such as Figure 2, suggests checking outliers, influential observations, or a more suitable response distribution.
  • Dependence in the residuals points back to the design and forward to repeated-measures or mixed-model methods.

A diagnostic plot does not declare success or failure by itself. It shows where the model still disagrees with the data, and suggests that revising the model design may be in order.

9 Connection to Earlier Chapters

In this chapter, I extend the same checking habits you used earlier in Chapter 6:

  • Grouped comparisons checked variation within groups. Regression checks variation across fitted values.
  • Normality was checked within groups or within paired differences. Regression checks normality in the residuals.
  • Unequal spread across groups warned against simple mean comparisons. Regression checks whether residual spread changes across the fitted range.

So, ask, is the structure plausible, is the spread stable, and are the observations independent? In regression, we should ask those questions after a model has already been fitted. In t-tests, ANOVAs, and correlations, we must apply assumption tests before applying the test.

11 Summary

Residuals are the differences between observed and predicted values.

Regression assumptions are checked on residuals because residuals contain the unexplained variation left by the fitted model.

Residual-versus-fitted plots assess mean structure and spread. Q-Q plots assess approximate normality of residuals.

Diagnostic patterns tell you whether the current model is adequate or whether the model needs revision.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {11. {Residuals} and {Model-Based} {Diagnostics}},
  date = {2026-03-22},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/11-residuals-and-model-based-diagnostics.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) 11. Residuals and Model-Based Diagnostics. https://tangledbank.netlify.app/BCB744/basic_stats/11-residuals-and-model-based-diagnostics.html.