14. Multiple Regression and Model Specification

From Biological Hypotheses to Statistical Models

Published

2026/03/22

NoteIn This Chapter
  • why simple regression is often not enough;
  • how a multiple regression extends the simple linear model;
  • why model specification is part of the scientific argument;
  • how to think about omitted variables, proxies, and categorical predictors;
  • how dummy-variable coding works in practice for real bioregions;
  • how to fit and interpret a worked multiple regression example;
  • how interactions and contrasts fit into the broader multiple-regression workflow.
ImportantTasks to Complete in This Chapter
  • None

1 Introduction

In Chapter 12 we saw how to model the relationship between two variables using simple linear regression. However, in ecosystems the relationship between the response variable and the explanatory variables is more complex and in many cases cannot be adequately captured by a single driver. In such cases, multiple linear regression is used to model the relationship between the response variable and multiple explanatory variables.

Biological systems are rarely driven by one variable alone. A response such as growth, abundance, diversity, or disease prevalence usually reflects the combined influence of several environmental, physiological, or experimental factors. Multiple regression is therefore the natural extension of simple regression to a more realistic ecological setting.

But adding more predictors creates a much more demanding scientific problem: which predictors belong in the model, and why? This chapter therefore combines two topics that should not be separated:

  1. multiple regression as a modelling framework; and
  2. model specification as the translation of biological ideas into statistical structure.

2 Key Concepts

In this chapter, I organise the discussion around the following ideas.

  • Multiple regression models a response using several predictors at once.
  • Holding other predictors constant is what gives partial coefficients their meaning.
  • Model specification translates biological reasoning into a statistical formula.
  • Omitted variables, poor functional form, and unjustified interactions can distort inference.
  • Predictor choice should be theory-driven rather than data-convenience driven.
  • Categorical predictors are part of the same linear-model family and can be handled naturally in multiple regression.
  • ANCOVA is not a separate mathematical species; it is a linear model containing a factor and a continuous covariate.

3 Nature of the Data and Assumptions

The same broad assumptions that apply in simple linear regression apply here too, but now in a multivariable setting.

  • Response variable The response should usually be continuous.
  • Predictors Predictor variables are often continuous, but they may also be categorical.
  • Independence The observations must be independent.
  • Approximate linearity The response should relate approximately linearly to each predictor on the scale used in the model.
  • Residual assumptions Residuals should be approximately normal with roughly constant variance.
  • Limited multicollinearity Predictors should not be so strongly overlapping that their coefficients become unstable or uninterpretable.

The additional difficulty in multiple regression is that the model must be judged as a fitted statistical object and also as a biological specification. Even if the residual assumptions are acceptable, the model can still be weak because the variables are badly chosen or badly motivated. We return to the interpretive problems created by multicollinearity, confounding, and measurement error in Chapter 16.

4 The Core Equation

The general multiple-regression model can be written as:

\[Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \epsilon_i \tag{1}\]

In Equation 1, each \(\beta_j\) is a partial slope. That means each coefficient is interpreted while holding the other predictors in the model constant. This is the main mathematical difference from simple linear regression and the reason multiple-regression coefficients require more careful biological interpretation.

5 Outliers

As in simple regression, outliers can have strong effects on coefficient estimation, standard errors, and interpretation. In multiple regression, however, unusual observations may be unusual in multivariate space even when they do not look extreme on any single predictor alone. So, diagnostic plots and leverage measures become especially important once we fit the model.

6 R Function

The lm() function in R is used to fit a multiple linear regression model. The syntax is similar to that of the lm() function used for simple linear regression, but now with several predictor variables.

The function takes the basic form:

lm(formula, data)

For a model with only continuous predictors:

lm(response ~ predictor1 + predictor2 + predictor3,
   data = dataset)

Interaction effects are implemented by including the product of two variables in the formula:

lm(response ~ predictor1 * predictor2, data = dataset)

When we have both continuous and categorical predictors, the formula is simply extended:

lm(response ~ continuous_predictor1 + continuous_predictor2 +
     factor(categorical_predictor),
   data = dataset)

R automatically handles the dummy-variable coding of factors. This is one reason multiple regression and ANOVA are so closely connected.

7 Where ANCOVA Fits

Students are often taught ANCOVA as if it were a separate named procedure standing halfway between ANOVA and regression. In practice, ANCOVA is just a linear model that contains:

  • a categorical predictor representing groups or treatments; and
  • a continuous covariate that we want to adjust for.

In R, the regression framing is written as:

lm(response ~ factor_group + covariate, data = dataset)

If one wants to keep the older ANOVA-style framing, the same additive structure can also be written as:

aov(response ~ factor_group + covariate, data = dataset)

The reason ANCOVA belongs in this chapter is that the model is algebraically the same as multiple regression with mixed predictor types. The factor is represented through dummy variables, the covariate enters as an ordinary slope term, and the fitted group coefficients are interpreted after adjusting for the covariate.

This is also why the distinction between ANOVA and ANCOVA is mostly one of emphasis:

  • ANOVA framing emphasises comparing group means;
  • ANCOVA framing emphasises comparing groups while controlling for a continuous covariate;
  • regression framing makes the common model structure explicit.

Mathematically, there is no difference. The fitted model is the same. What changes is the teaching emphasis and, in R, the style of output that is foregrounded.

If the model is:

lm(y ~ group + x, data = df)

or equivalently:

aov(y ~ group + x, data = df)

then:

  • the coefficient of x is the adjusted slope of the covariate;
  • the coefficients of group are adjusted differences among the factor levels relative to the reference level;
  • the whole model can be described either as a regression with a factor and a covariate or as an ANCOVA.

The practical difference is therefore one of emphasis:

  • lm() encourages you to read coefficients, fitted slopes, confidence intervals, and the common linear-model structure directly.
  • aov() encourages you to read the analysis in ANOVA-style terms, with sums of squares and omnibus tests emphasised.
  • For ANCOVA, lm() is usually the clearer teaching choice because it makes it obvious that the covariate and the factor are part of the same linear model.

The important assumption in the usual introductory ANCOVA framing is that the slope of the covariate is the same across groups. In regression language, that means we are fitting an additive model without a group:x interaction.

If the biological question is whether the covariate-response slope differs among groups, then the model is no longer a simple ANCOVA in the introductory sense. It becomes an interaction model, and that is the topic of Chapter 15.

So the clean teaching rule is:

  • y ~ group + x → ANCOVA-style additive model;
  • y ~ group * x → interaction model with group-specific slopes.

8 Why Model Specification is Important

In simple regression, the modelling decision is often obvious: one response, one predictor. In multiple regression, the structure of the model becomes part of the argument. The important questions are:

  • Which variables represent the hypothesised process?
  • Which variables are potential confounders that must be controlled?
  • Which variables are only proxies for a harder-to-measure mechanism?
  • Should the relationship be linear, transformed, or nonlinear?
  • Does the effect of one predictor depend on another?

Model specification therefore comes before model fitting. If the biological logic is weak, the fitted model is likely to be weak as well. The practical consequences of confounders, proxies, and measurement error are developed in much more detail in Chapter 16.

9 Example 1: A Climate Model for the East Coast Transition Zone

The main worked example in this chapter uses the seaweed data produced in the analysis by Smit et al. (2017). The same dataset is used repeatedly in BCB_Stats, and it is well suited to this chapter because it contains:

  • a continuous response variable, the mean Sørensen dissimilarity Y;
  • several continuous climatological predictors;
  • and a categorical predictor, the bioregional classification bio.
sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv")
rbind(head(sw, 3), tail(sw, 3))[,-1]
       dist  bio    augMean   febRange      febSD     augSD    annMean
1     0.000  BMP 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
2    51.138  BMP 0.05741369 0.09884404 0.16295271 0.3132800 0.01501846
3   104.443  BMP 0.15043904 0.34887754 0.09934163 0.4188239 0.02602247
968 102.649 ECTZ 0.41496099 0.11330069 0.24304493 0.7538546 0.52278161
969  49.912 ECTZ 0.17194242 0.05756093 0.18196664 0.3604341 0.24445006
970   0.000 ECTZ 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
              Y        Y1          Y2
1   0.000000000 0.0000000 0.000000000
2   0.003610108 0.0000000 0.003610108
3   0.003610108 0.0000000 0.003610108
968 0.198728140 0.1948882 0.003839961
969 0.069337442 0.0443038 0.025033645
970 0.000000000 0.0000000 0.000000000

We begin by focusing on the East Coast Transition Zone (ECTZ) and asking which combination of climate variables best explains variation in Y.

sw_ectz <- sw |>
  filter(bio == "ECTZ")

sw_sub1 <- sw_ectz[, c("Y", "augMean", "febRange", "febSD", "augSD", "annMean")]

9.1 Do an Exploratory Data Analysis (EDA)

We first inspect the scale and range of the variables in the subset we intend to model.

summary(sw_sub1)
       Y              augMean          febRange          febSD       
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.09615   1st Qu.:0.1817   1st Qu.:0.1866   1st Qu.:0.1820  
 Median :0.22779   Median :0.4686   Median :0.7531   Median :0.6256  
 Mean   :0.24393   Mean   :0.5313   Mean   :0.8329   Mean   :0.7209  
 3rd Qu.:0.35690   3rd Qu.:0.7927   3rd Qu.:1.3896   3rd Qu.:1.2072  
 Max.   :0.63754   Max.   :1.7508   Max.   :2.2704   Max.   :2.1249  
     augSD           annMean      
 Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.4178   1st Qu.:0.1874  
 Median :1.0454   Median :0.5228  
 Mean   :1.3207   Mean   :0.5722  
 3rd Qu.:2.0924   3rd Qu.:0.8085  
 Max.   :4.5267   Max.   :1.8941  

Before fitting a multiple regression, it is also useful to look at each predictor individually against the response because it provides intuition about the data structure before we begin combining effects.

predictors <- c("augMean", "febRange", "febSD", "augSD", "annMean")

models <- map(predictors, ~ lm(as.formula(paste("Y ~", .x)),
                               data = sw_ectz))

plts1 <- map(predictors, function(predictor) {
  ggplot(sw_ectz, aes(x = .data[[predictor]], y = Y)) +
    geom_point(shape = 1, colour = "dodgerblue4") +
    geom_smooth(method = "lm", col = "magenta", fill = "pink") +
    labs(title = paste("Y vs", predictor),
         x = predictor,
         y = "Y") +
    theme_grey()
})

ggpubr::ggarrange(plotlist = plts1, ncol = 2,
                  nrow = 3, labels = "AUTO")
Figure 1: Individual simple linear regressions fitted to the variables augMean, febRange, febSD, augSD, and annMean as predictors of the seaweed species composition as measured by the Sørensen dissimilarity, Y.

The individual plots show that each predictor appears to relate positively and approximately linearly to the response when considered alone. But those one-predictor views do not tell us how the predictors behave once they are considered together, and they do not tell us whether some predictors overlap so strongly that their coefficients become unstable.

9.2 State the Hypotheses

The null hypothesis for the overall model is that there is no significant relationship between the response Y and any of the entered climatological variables, implying that the coefficients for all predictors are equal to zero:

\[H_{0}: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0\]

The alternative is that at least one coefficient differs from zero:

\[H_{a}: \exists \, \beta_i \ne 0 \text{ for at least one } i\]

At the main-effects level, each predictor has its own hypothesis:

\[H_{0}: \beta_i = 0\] \[H_{a}: \beta_i \ne 0\]

However, as soon as several predictors are included in the model, each coefficient becomes a partial effect. Its interpretation depends on the presence of the others.

9.3 Fit the Initial Model

We first fit a full model containing the five climate predictors, and a null model containing only the intercept.

full_mod1 <- lm(Y ~ augMean + febRange + febSD + augSD + annMean,
                data = sw_sub1)
null_mod1 <- lm(Y ~ 1, data = sw_sub1)

summary(full_mod1)

Call:
lm(formula = Y ~ augMean + febRange + febSD + augSD + annMean, 
    data = sw_sub1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.131229 -0.042352 -0.003415  0.038601  0.144128 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.044886   0.006306   7.118 9.04e-12 ***
augMean     -0.079925   0.042630  -1.875 0.061841 .  
febRange     0.112913   0.015949   7.080 1.14e-11 ***
febSD       -0.057174   0.016554  -3.454 0.000637 ***
augSD        0.003024   0.004886   0.619 0.536421    
annMean      0.322776   0.041615   7.756 1.59e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05713 on 283 degrees of freedom
Multiple R-squared:  0.8804,    Adjusted R-squared:  0.8782 
F-statistic: 416.5 on 5 and 283 DF,  p-value: < 2.2e-16

The full model fits strongly, but the coefficient table already hints at a practical problem: some variables that looked useful on their own are no longer clearly interpretable once the others are included. This is a common sign of multicollinearity.

9.4 Check for Multicollinearity

Some of the predictors may be highly correlated with each other, and this can make the model less precise and harder to interpret because the coefficients’ standard errors become inflated. A formal way to detect multicollinearity is to calculate the variance inflation factor (VIF) for each predictor.

vif(full_mod1)
  augMean  febRange     febSD     augSD   annMean 
27.947767 10.806635  8.765732  2.497739 31.061900 

The VIF values show that some predictors overlap too strongly for comfortable interpretation. We therefore simplify the candidate set before asking which predictors belong in the final model.

This is a useful practical response to collinearity, but it is not the whole conceptual story. Chapter 16 explains why overlapping predictors create interpretive instability, how that differs from confounding, and why measurement error can produce similar practical difficulties.

threshold <- 10
vif_mod <- full_mod1
sw_vif <- sw_sub1

while (TRUE) {
  vif_values <- vif(vif_mod)
  max_vif <- max(vif_values)

  if (max_vif > threshold) {
    high_vif_var <- names(which.max(vif_values))
    sw_vif <- sw_vif[, !(names(sw_vif) %in% high_vif_var)]
    vif_mod <- lm(Y ~ ., data = sw_vif)
  } else {
    break
  }
}

summary(vif_mod)

Call:
lm(formula = Y ~ ., data = sw_vif)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.153994 -0.049229 -0.006086  0.045947  0.148579 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
augMean     0.283335   0.011131  25.455  < 2e-16 ***
febSD       0.049639   0.008370   5.930 8.73e-09 ***
augSD       0.022150   0.004503   4.919 1.47e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06609 on 285 degrees of freedom
Multiple R-squared:  0.8387,    Adjusted R-squared:  0.837 
F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16
vif(vif_mod)
 augMean    febSD    augSD 
1.423601 1.674397 1.585055 

This process leaves a much more interpretable candidate model. The retained predictors are augMean, febSD, and augSD, and the VIF values are now comfortably low.

Regularisation techniques such as ridge regression, lasso regression, or elastic net can also be used to deal with multicollinearity. These are discussed later in Chapter 25.

9.5 Fit the Selected Model

It may be that not all variables retained after the VIF screen are needed to explain the response. We can therefore use forward selection to identify the most parsimonious model within this reduced candidate set.

mod1 <- stepAIC(
  lm(Y ~ 1, data = sw_vif),
  scope = list(lower = lm(Y ~ 1, data = sw_vif), upper = vif_mod),
  direction = "forward",
  trace = FALSE
)

formula(mod1)
Y ~ augMean + febSD + augSD
summary(mod1)

Call:
lm(formula = Y ~ augMean + febSD + augSD, data = sw_vif)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.153994 -0.049229 -0.006086  0.045947  0.148579 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
augMean     0.283335   0.011131  25.455  < 2e-16 ***
febSD       0.049639   0.008370   5.930 8.73e-09 ***
augSD       0.022150   0.004503   4.919 1.47e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06609 on 285 degrees of freedom
Multiple R-squared:  0.8387,    Adjusted R-squared:  0.837 
F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16
anova(mod1)
Analysis of Variance Table

Response: Y
           Df Sum Sq Mean Sq  F value    Pr(>F)    
augMean     1 6.0084  6.0084 1375.660 < 2.2e-16 ***
febSD       1 0.3604  0.3604   82.507 < 2.2e-16 ***
augSD       1 0.1057  0.1057   24.196 1.473e-06 ***
Residuals 285 1.2448  0.0044                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The selected model is:

\[ Y = \alpha + \beta_1 \text{augMean} + \beta_2 \text{febSD} + \beta_3 \text{augSD} + \epsilon \]

This is a useful teaching result because it shows that model building is not simply about keeping every plausible variable. It is about arriving at a model that is statistically stable and biologically interpretable.

9.6 Test the Assumptions / Check Diagnostics

Before interpreting the model in earnest, we must check whether its assumptions are adequate. Added-variable plots are useful because they show the relationship between the response and each predictor while accounting for the effect of the other predictors.

avPlots(mod1, col = "dodgerblue4", col.lines = "magenta")
Figure 2: Partial regression plots for mod1 with the selected variables augMean, febSD, and augSD.

The added-variable plot for augMean shows a particularly strong and tight partial relationship with the response. The febSD and augSD plots show weaker effects with more scatter, but both still contribute.

We can also look at standard diagnostic plots for the model as a whole.

autoplot(mod1, shape = 21, colour = "dodgerblue4",
         smooth.colour = "magenta") +
  theme_grey()
Figure 3: Diagnostic plots to assess the fit of the final multiple linear regression model, mod1.

The model appears broadly acceptable. There may be mild curvature or slight heteroscedasticity, but there is no glaring red flag severe enough to invalidate interpretation. As always, adequacy is a matter of judgement, not perfection.

Component-plus-residual plots provide another useful perspective because they show whether the relationship between the response and each predictor appears reasonably linear in the fitted model.

crPlots(mod1, col = "dodgerblue4", col.lines = "magenta")
Figure 4: Component plus residual diagnostic plots to assess the fit of the final multiple linear regression model, mod1.

9.7 Interpret the Results

The fitted model shows that augMean, febSD, and augSD are all positively related to Sørensen dissimilarity in the East Coast Transition Zone once they are considered together. The strongest partial effect is associated with augMean, while febSD and augSD make smaller but still clearly positive contributions.

Each slope in a multiple regression must be read conditionally: it gives the expected change in the response for a one-unit increase in that predictor while the other predictors are held constant. That conditional interpretation is the strength of the model, but it is also where the dangers lie. If predictors overlap strongly, or if one predictor stands in for an omitted biological cause, the apparent partial effect can become difficult to interpret. Chapter 16 returns to that problem directly.

The model fit is strong: the adjusted \(R^2\) is about 0.84, so the selected predictors explain a large proportion of the observed variation in Y. The overall model test provides very strong evidence that the model explains more variation than an intercept-only model (\(p < 0.001\)).

The ANOVA table is useful here because it shows how much variance each predictor adds sequentially to the model. But remember that this sequential interpretation depends on the order in which predictors are entered.

9.8 Reporting

NoteWrite-Up

Methods

Multiple linear regression was used to model mean Sørensen dissimilarity in the East Coast Transition Zone as a function of candidate climate predictors. Collinearity among predictors was assessed before final interpretation, and overlapping predictors were removed to obtain a more stable final model.

Results

In the East Coast Transition Zone, mean Sørensen dissimilarity was well explained by a multiple linear regression containing August mean temperature, February temperature variability, and August temperature variability (adjusted \(R^2 = 0.84\); overall model \(p < 0.001\)). All three retained predictors had positive partial effects, with August mean temperature showing the strongest contribution (\(\beta = 0.283\)), followed by February temperature variability (\(\beta = 0.050\)) and August temperature variability (\(\beta = 0.022\)). Predictors with problematic overlap were removed before final interpretation, so the reported coefficients represent comparatively stable partial relationships.

Discussion

The main discussion point is that the model identifies a small set of comparatively stable climatic predictors, rather than treating every candidate variable as independently meaningful. This is why model specification and collinearity screening matter in multiple regression.

10 Example 2: Categorical Predictors with Distance and Bioregion

Multiple regression can include both continuous and categorical predictors. In R, this is done by entering the categorical variable as a factor. This is important because many ecological questions involve a mixture of environmental gradients and group structure.

For example:

lm(response ~ dist + factor(bio), data = df)

Once categorical predictors are included, multiple regression and ANOVA become part of the same linear-model family. This is why the earlier ANOVA chapter and the present modelling chapters connect so naturally.

10.1 Dummy Variable Coding

When a categorical variable is included in a regression model, R does not fit a separate coefficient for every level plus an intercept. That would be redundant. Instead, it uses dummy variables (also called indicator variables) to represent the levels of the factor.

Suppose a categorical predictor C has three levels: C1, C2, and C3. If C1 is used as the reference level (by default, lm() chooses the first alpha-numeric factor), then only two dummy variables are needed:

  • D1 = 1 when the observation belongs to C2, and 0 otherwise;
  • D2 = 1 when the observation belongs to C3, and 0 otherwise.

The reference level C1 is represented by D1 = 0 and D2 = 0. In that case the model can be written as:

\[ Y_i = \alpha + \beta_1 X_i + \gamma_1 D_{i1} + \gamma_2 D_{i2} + \epsilon_i \]

Here:

  • \(\alpha\) is the intercept for the reference category;
  • \(\beta_1\) is the slope of the continuous predictor X for the reference category;
  • \(\gamma_1\) is the difference in intercept between C2 and the reference category C1;
  • \(\gamma_2\) is the difference in intercept between C3 and the reference category C1.

This is the coding that lm() uses by default for factors. The first level of the factor is taken as the reference level unless you change it explicitly with relevel(). The practical implication is that when you see coefficients such as bioBMP or bioECTZ, they do not represent absolute means for those groups. They represent differences from the reference bioregion, holding the continuous predictors constant.

If an interaction between a continuous predictor and a factor is added, the same reasoning extends to the slopes. The main slope belongs to the reference level, and the interaction terms show how the slope changes in the other levels relative to that reference. That is the natural next step, but it belongs in Chapter 15, where the whole focus is on how those slope differences are interpreted.

10.2 What the Coding Looks Like in Practice

The abstract notation above is necessary, but it becomes much easier once we look at the actual coding used for the seaweed bioregions. We first create one row for each bioregion and then inspect the model matrix that lm() would construct internally.

bio_demo <- tibble(
  bio = factor(c("AMP", "B-ATZ", "BMP", "ECTZ"),
               levels = levels(factor(sw$bio)))
)

bio_model_matrix <- model.matrix(~ bio, data = bio_demo) |>
  as.data.frame() |>
  mutate(bio = bio_demo$bio) |>
  relocate(bio)

bio_model_matrix
    bio (Intercept) bioB-ATZ bioBMP bioECTZ
1   AMP           1        0      0       0
2 B-ATZ           1        1      0       0
3   BMP           1        0      1       0
4  ECTZ           1        0      0       1
Table 1: How the seaweed bioregions are represented by default dummy coding when AMP is the reference level.
bio (Intercept) bioB-ATZ bioBMP bioECTZ
AMP 1 0 0 0
B-ATZ 1 1 0 0
BMP 1 0 1 0
ECTZ 1 0 0 1
bio_model_matrix |>
  pivot_longer(cols = -bio, names_to = "term", values_to = "value") |>
  ggplot(aes(x = term, y = bio, fill = factor(value))) +
  geom_tile(colour = "white") +
  scale_fill_manual(values = c("0" = "grey85", "1" = "steelblue4")) +
  labs(x = NULL, y = "Bioregion", fill = "Value") +
  theme_grey()
Figure 5: Dummy-variable coding for the seaweed bioregions. AMP is the reference level, so its dummy columns are all zero.

This picture makes the bookkeeping explicit. AMP is the reference level, so all the dummy columns are zero there. B-ATZ, BMP, and ECTZ each receive a 1 in their own indicator column, which is why their fitted coefficients must be read as differences from AMP, not as stand-alone group means.

The same logic will later carry directly into interaction models. But before we go there, it is useful to stay with the simpler additive case in which distance has one common slope and the bioregions differ only in their intercepts.

10.3 State the Biological Question

The seaweed dataset includes two additional variables that we have not yet considered closely:

  • dist, the geographic distance between seaweed samples along the coast of South Africa; and
  • bio, the bioregional classification of the samples.

These variables lend themselves to a few interesting questions:

  1. Is the Sørensen dissimilarity of the seaweed flora related to geographic distance?
  2. Does the average Sørensen dissimilarity vary among bioregions?
  3. Is the effect of geographic distance on Sørensen dissimilarity different for each bioregion?

The first two questions are additive-model questions and therefore belong in this chapter. The third question asks whether the slope of distance differs among bioregions, which is an interaction question and is therefore deferred to Chapter 15.

10.4 State the Model and Hypotheses

For the additive model in this chapter, the response can be written as:

\[ Y = \alpha + \beta_1 \text{dist} + \beta_2 \text{bio} + \epsilon \]

Because bio has four levels, the fitted model is actually expanded internally into several dummy variables. With AMP as the reference bioregion, the model is equivalent to:

\[ \begin{aligned} Y = \alpha &+ \beta_1 \text{dist} + \beta_2 \text{bio}_{\text{B-ATZ}} + \beta_3 \text{bio}_{\text{BMP}} + \beta_4 \text{bio}_{\text{ECTZ}} + \epsilon \end{aligned} \]

In words:

  • the intercept and dist slope describe the reference bioregion;
  • the bio coefficients describe how the intercept differs in the other bioregions relative to the reference;
  • the slope of dist is assumed to be the same in all bioregions.

The overall null is that neither the distance effect nor the bioregion effect contributes to explaining the response. At the coefficient level, we therefore ask whether the slope of dist differs from zero and whether the bioregion coefficients differ from the reference level.

10.5 Visualise the Main Effects

p1 <- ggplot(sw, aes(x = dist, y = Y)) +
  geom_point(shape = 1, aes(colour = bio)) +
  geom_smooth(method = "lm", se = TRUE,
              colour = "black", fill = "grey60",
              linewidth = 0.4) +
  scale_colour_brewer(palette = "Set1") +
  labs(x = "Distance (km)",
       y = "Sørensen Dissimilarity") +
  theme_grey()

p2 <- ggplot(sw, aes(x = bio, y = Y)) +
  geom_boxplot(aes(colour = bio)) +
  labs(x = "Bioregional Classification",
       y = "Sørensen Dissimilarity") +
  theme_grey() +
  theme(legend.position = "none")

ggarrange(p1, p2, ncol = 2, labels = "AUTO")
Figure 6: Plot of main effects of A) distance along the coast and B) bioregional classification on the Sørensen dissimilarity index.

The plot of Y against distance already suggests that distance matters. Colouring by bioregion and inspecting the boxplots also suggest that the average level of Y differs among bioregions. The additive model is the first sensible way to combine those two patterns in one regression.

10.6 Fit and Assess Nested Models

mod2a <- lm(Y ~ dist, data = sw)
mod2 <- lm(Y ~ dist + bio, data = sw)

anova(mod2a, mod2, test = "F")
Analysis of Variance Table

Model 1: Y ~ dist
Model 2: Y ~ dist + bio
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    968 7.7388                                  
2    965 4.1156  3    3.6232 283.19 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2)

Call:
lm(formula = Y ~ dist + bio, data = sw)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.189964 -0.038724 -0.002277  0.029329  0.256021 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.698e-02  4.547e-03  -5.933 4.15e-09 ***
dist         4.612e-04  1.059e-05  43.533  < 2e-16 ***
bioB-ATZ     7.340e-02  1.368e-02   5.364 1.02e-07 ***
bioBMP       1.258e-02  5.258e-03   2.393   0.0169 *  
bioECTZ      1.385e-01  5.043e-03  27.461  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06531 on 965 degrees of freedom
Multiple R-squared:  0.7453,    Adjusted R-squared:  0.7442 
F-statistic:   706 on 4 and 965 DF,  p-value: < 2.2e-16
anova(mod2)
Analysis of Variance Table

Response: Y
           Df Sum Sq Mean Sq F value    Pr(>F)    
dist        1 8.4199  8.4199 1974.25 < 2.2e-16 ***
bio         3 3.6232  1.2077  283.19 < 2.2e-16 ***
Residuals 965 4.1156  0.0043                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The nested comparison shows that adding the bioregion effect to the distance-only model improves the fit strongly. The residual sum of squares decreases sharply, and the test indicates that the richer additive model provides a substantially better fit (\(p < 0.001\)).

At this stage we are still fitting the simpler model in which distance has one common slope. That keeps the interpretation aligned with the main goal of Chapter 14: understanding how continuous and categorical predictors coexist in one additive linear model. In Chapter 15, we return to the same biological setting and ask whether allowing the slope of distance to differ among bioregions improves the model further.

10.7 Interpret the Additive Model

The coefficient table must be read with care because bio is represented through dummy-variable coding. The reference bioregion is absorbed into the intercept, and the remaining bioregion coefficients are interpreted relative to that reference.

So, if AMP is the reference level, then:

  • the coefficient for dist is the slope for AMP;
  • the coefficient for bioBMP is the difference in intercept between BMP and AMP when dist = 0;
  • the same fitted slope of dist applies across the bioregions in this additive model.

The same reasoning applies to the other bioregions. In this chapter, the important point is that adding a factor to a regression model changes the intercept structure while preserving one shared slope for the continuous predictor.

The broad interpretation is straightforward even if the bookkeeping is not:

  • the effect of distance on Y is strong;
  • the mean level of Y differs among bioregions;
  • and those two sources of structure can be handled within the same additive linear model.

If we now decide that different bioregions may also have different rates at which dissimilarity increases with distance, then we must move to an interaction model. That is the starting point of the next chapter.

10.8 Reporting

NoteWrite-Up

Methods

Nested linear models were used to test whether Sørensen dissimilarity was related to geographic distance and whether mean dissimilarity differed among bioregions. A distance-only model was compared with an additive model containing both distance and bioregion.

Results

Sørensen dissimilarity increased strongly with geographic distance, and the additive model showed that mean dissimilarity also differed among bioregions. Adding bioregion to the distance-only model improved fit substantially (\(p < 0.001\)), indicating that both continuous geographic separation and categorical bioregional context contributed to explaining variation in the response. The fitted distance effect was strongly positive, while the bioregion coefficients indicated differences in baseline dissimilarity relative to the reference bioregion.

Discussion

The biologically important point is that compositional turnover is structured jointly by distance and bioregional context. At this stage, however, the model still assumes one common distance effect across the coastline. The next chapter asks whether that simplifying assumption is biologically defensible.

11 Example 3: The Final Additive Model

We can now expand the selected climate model to include bio as an additive predictor alongside augMean, febSD, and augSD. This creates a more complete model in which the climatic effects are estimated jointly with bioregional differences, but still under the additive assumption.

11.1 Fit the Comprehensive Additive Model

sw_sub2 <- sw |>
  dplyr::select(Y, augMean, febSD, augSD, bio)

full_mod3 <- lm(Y ~ augMean + febSD + augSD + bio, data = sw_sub2)
full_mod3a <- lm(Y ~ augMean + febSD + augSD, data = sw_sub2)
null_mod3 <- lm(Y ~ 1, data = sw_sub2)

anova(full_mod3, full_mod3a)
Analysis of Variance Table

Model 1: Y ~ augMean + febSD + augSD + bio
Model 2: Y ~ augMean + febSD + augSD
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    963 4.834                                  
2    966 5.689 -3  -0.85507 56.781 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(full_mod3, full_mod3a)
           df       AIC
full_mod3   8 -2373.840
full_mod3a  5 -2221.852
summary(full_mod3)

Call:
lm(formula = Y ~ augMean + febSD + augSD + bio, data = sw_sub2)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.167665 -0.045280 -0.006433  0.039474  0.288426 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.027394   0.005267   5.201 2.42e-07 ***
augMean      0.281357   0.007892  35.653  < 2e-16 ***
febSD       -0.002945   0.002864  -1.028    0.304    
augSD        0.014463   0.002918   4.956 8.49e-07 ***
bioB-ATZ    -0.012508   0.014915  -0.839    0.402    
bioBMP      -0.034542   0.006385  -5.410 7.96e-08 ***
bioECTZ      0.050082   0.006387   7.842 1.18e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07085 on 963 degrees of freedom
Multiple R-squared:  0.7008,    Adjusted R-squared:  0.699 
F-statistic:   376 on 6 and 963 DF,  p-value: < 2.2e-16
anova(full_mod3)
Analysis of Variance Table

Response: Y
           Df Sum Sq Mean Sq  F value    Pr(>F)    
augMean     1 9.9900  9.9900 1990.165 < 2.2e-16 ***
febSD       1 0.0530  0.0530   10.563  0.001194 ** 
augSD       1 0.4266  0.4266   84.983 < 2.2e-16 ***
bio         3 0.8551  0.2850   56.781 < 2.2e-16 ***
Residuals 963 4.8340  0.0050                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The richer additive model is clearly better than the climate-only model. The AIC is lower, and the nested comparison indicates that adding bioregional structure improves fit strongly even before we allow any climate slopes to vary among bioregions.

The model is now richer than the one we began with because it combines several continuous climate predictors with a categorical regional structure. That does not yet imply that the climate relationships differ among bioregions. It only means that bioregional shifts in baseline response matter and should be accounted for.

11.2 Interpret the Full Additive Model

The summary of full_mod3 shows that:

  • augMean remains a strong predictor of Y;
  • bio contributes additional explanatory structure beyond the climate predictors alone;
  • the overall model explains a substantial proportion of the variation in the response.

So, the outcome is that the relationship between the seaweed dissimilarity index and the selected climate variables can be modelled more convincingly once broad bioregional differences are included. That is a more defensible additive ecological statement than anything we could obtain from a single-predictor regression.

11.3 Reporting

NoteWrite-Up

Methods

A comprehensive additive multiple regression was fitted to evaluate whether mean Sørensen dissimilarity was explained jointly by the selected climate predictors and by bioregional classification. The additive climate-plus-bioregion model was compared with the corresponding climate-only model using information criteria and nested model comparison.

Results

The additive multiple regression explained a large proportion of the variation in mean Sørensen dissimilarity and fit the data substantially better than the corresponding climate-only model (lower AIC; nested model comparison \(p < 0.001\)). August mean temperature remained the dominant positive predictor, and the inclusion of bioregion showed that broad regional differences in baseline dissimilarity persisted even after the selected climate variables had been taken into account.

Discussion

This final additive model supports the ecological interpretation that compositional turnover is shaped both by climate and by broad regional context. If the next biological question is whether the climate slopes themselves differ among bioregions, then the analysis must move beyond additive structure and into the interaction framework developed in Chapter 15.

12 Alternative Categorical Variable Coding Schemes (Contrasts)

Throughout the chapter, the categorical variables have been handled with the default dummy-variable coding. But alternative coding schemes are possible, and they change the way the coefficients are interpreted.

The following synthetic example shows a few different possibilities.

set.seed(123)
n <- 100
x <- rnorm(n)
cat_var <- factor(sample(c("A", "B", "C", "D"), n, replace = TRUE))
y <- 2 * x + ifelse(cat_var == "A", 3,
                    ifelse(cat_var == "B", 1,
                           ifelse(cat_var == "C", -1, -3))) + rnorm(n)

data_contr <- data.frame(y, x, cat_var)
head(data_contr)
           y           x cat_var
1  0.6667876 -0.56047565       B
2  1.3086873 -0.23017749       B
3  0.4496192  1.55870831       D
4  2.1326402  0.07050839       A
5 -2.8608771  0.12928774       D
6  0.1497346  1.71506499       D

With default treatment coding, one level becomes the reference category:

contrasts(data_contr$cat_var)
  B C D
A 0 0 0
B 1 0 0
C 0 1 0
D 0 0 1

This means that the coefficient for each non-reference level is interpreted as the difference between that level and the reference level, holding the other predictors constant.

You can change the reference level:

data_contr$cat_var <- relevel(data_contr$cat_var, ref = "C")
contrasts(data_contr$cat_var)
  A B D
C 0 0 0
A 1 0 0
B 0 1 0
D 0 0 1

Or use effect coding:

contrasts(data_contr$cat_var) <- contr.sum(4)
contrasts(data_contr$cat_var)
  [,1] [,2] [,3]
C    1    0    0
A    0    1    0
B    0    0    1
D   -1   -1   -1

The practical lesson is not that students must memorise every contrast scheme. It is that the interpretation of categorical coefficients depends on how the coding is done.

13 Summary

  • Multiple regression extends simple regression to several predictors.
  • Model specification is not a side issue; it is part of the scientific argument.
  • The meaning of each coefficient depends on the other predictors included in the model.
  • Multicollinearity can make a model statistically unstable and biologically difficult to interpret.
  • Categorical predictors and interaction terms belong naturally within the multiple-regression framework.
  • A useful multiple regression is one that is both statistically adequate and scientifically defensible.

In this chapter, I lay out the full modelling sequence: additive models, interactions, richer bioregional structures, and categorical coding. In the next chapter, I return to the interaction problem directly and develop it in more focused detail.

References

Smit AJ, Bolton JJ, Anderson RJ (2017) Seaweeds in two oceans: Beta-diversity. Frontiers in Marine Science 4:404.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {14. {Multiple} {Regression} and {Model} {Specification}},
  date = {2026-03-22},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/14-multiple-regression-and-model-specification.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) 14. Multiple Regression and Model Specification. https://tangledbank.netlify.app/BCB744/basic_stats/14-multiple-regression-and-model-specification.html.