13. Multiple Regression and Model Specification

From Biological Hypotheses to Statistical Models

Author

A. J. Smit

Published

2026/03/19

In This Chapter

why simple regression is often not enough;
how a multiple regression extends the simple linear model;
why model specification is part of the scientific argument;
how to think about omitted variables, proxies, and categorical predictors;
how to fit and interpret a worked multiple regression example;
how interactions and contrasts fit into the broader multiple-regression workflow.

Tasks to Complete in This Chapter

None

1 Introduction

In Chapter 11 we saw how to model the relationship between two variables using simple linear regression. However, in ecosystems the relationship between the response variable and the explanatory variables is more complex and in many cases cannot be adequately captured by a single driver. In such cases, multiple linear regression is used to model the relationship between the response variable and multiple explanatory variables.

Biological systems are rarely driven by one variable alone. A response such as growth, abundance, diversity, or disease prevalence usually reflects the combined influence of several environmental, physiological, or experimental factors. Multiple regression is therefore the natural extension of simple regression to a more realistic ecological setting.

But adding more predictors creates a much more demanding scientific problem: which predictors belong in the model, and why? This chapter therefore combines two topics that should not be separated:

multiple regression as a modelling framework; and
model specification as the translation of biological ideas into statistical structure.

2 Key Concepts

The chapter turns on the following ideas.

Multiple regression models a response using several predictors at once.
Holding other predictors constant is what gives partial coefficients their meaning.
Model specification translates biological reasoning into a statistical formula.
Omitted variables, poor functional form, and unjustified interactions can distort inference.
Predictor choice should be theory-driven rather than data-convenience driven.
Categorical predictors are part of the same linear-model family and can be handled naturally in multiple regression.

3 Nature of the Data and Assumptions

The same broad assumptions that apply in simple linear regression apply here too, but now in a multivariable setting.

Response variable The response should usually be continuous.
Predictors Predictor variables are often continuous, but they may also be categorical.
Independence The observations must be independent.
Approximate linearity The response should relate approximately linearly to each predictor on the scale used in the model.
Residual assumptions Residuals should be approximately normal with roughly constant variance.
Limited multicollinearity Predictors should not be so strongly overlapping that their coefficients become unstable or uninterpretable.

The additional difficulty in multiple regression is that the model must be judged as a fitted statistical object and also as a biological specification. Even if the residual assumptions are acceptable, the model can still be weak because the variables are badly chosen or badly motivated. We return to the interpretive problems created by multicollinearity, confounding, and measurement error in Chapter 15.

4 Outliers

As in simple regression, outliers can have strong effects on coefficient estimation, standard errors, and interpretation. In multiple regression, however, unusual observations may be unusual in multivariate space even when they do not look extreme on any single predictor alone. So, diagnostic plots and leverage measures become especially important once we fit the model.

5 R Function

The lm() function in R is used to fit a multiple linear regression model. The syntax is similar to that of the lm() function used for simple linear regression, but now with several predictor variables.

The function takes the basic form:

lm(formula, data)

For a model with only continuous predictors:

lm(response ~ predictor1 + predictor2 + predictor3,
   data = dataset)

Interaction effects are implemented by including the product of two variables in the formula:

lm(response ~ predictor1 * predictor2, data = dataset)

When we have both continuous and categorical predictors, the formula is simply extended:

lm(response ~ continuous_predictor1 + continuous_predictor2 +
     factor(categorical_predictor),
   data = dataset)

R automatically handles the dummy-variable coding of factors. This is one reason multiple regression and ANOVA are so closely connected.

6 Why Model Specification Matters

In simple regression, the modelling decision is often obvious: one response, one predictor. In multiple regression, the structure of the model becomes part of the argument. The important questions are:

Which variables represent the hypothesised process?
Which variables are potential confounders that must be controlled?
Which variables are only proxies for a harder-to-measure mechanism?
Should the relationship be linear, transformed, or nonlinear?
Does the effect of one predictor depend on another?

Model specification therefore comes before model fitting. If the biological logic is weak, the fitted model is likely to be weak as well. The practical consequences of confounders, proxies, and measurement error are developed in much more detail in Chapter 15.

7 Example 1: A Climate Model for the East Coast Transition Zone

The main worked example in this chapter uses the seaweed data produced in the analysis by Smit et al. (2017). The same dataset is used repeatedly in BCB_Stats, and it is well suited to this chapter because it contains:

a continuous response variable, the mean Sørensen dissimilarity Y;
several continuous climatological predictors;
and a categorical predictor, the bioregional classification bio.

sw <- read.csv("../../data/BCB743/seaweed/spp_df2.csv")
rbind(head(sw, 3), tail(sw, 3))[,-1]

R>        dist  bio    augMean   febRange      febSD     augSD    annMean
R> 1     0.000  BMP 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
R> 2    51.138  BMP 0.05741369 0.09884404 0.16295271 0.3132800 0.01501846
R> 3   104.443  BMP 0.15043904 0.34887754 0.09934163 0.4188239 0.02602247
R> 968 102.649 ECTZ 0.41496099 0.11330069 0.24304493 0.7538546 0.52278161
R> 969  49.912 ECTZ 0.17194242 0.05756093 0.18196664 0.3604341 0.24445006
R> 970   0.000 ECTZ 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
R>               Y        Y1          Y2
R> 1   0.000000000 0.0000000 0.000000000
R> 2   0.003610108 0.0000000 0.003610108
R> 3   0.003610108 0.0000000 0.003610108
R> 968 0.198728140 0.1948882 0.003839961
R> 969 0.069337442 0.0443038 0.025033645
R> 970 0.000000000 0.0000000 0.000000000

We begin by focusing on the East Coast Transition Zone (ECTZ) and asking which combination of climate variables best explains variation in Y.

sw_ectz <- sw |>
  filter(bio == "ECTZ")

sw_sub1 <- sw_ectz[, c("Y", "augMean", "febRange", "febSD", "augSD", "annMean")]

8 Do an Exploratory Data Analysis (EDA)

We first inspect the scale and range of the variables in the subset we intend to model.

summary(sw_sub1)

R>        Y              augMean          febRange          febSD       
R>  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
R>  1st Qu.:0.09615   1st Qu.:0.1817   1st Qu.:0.1866   1st Qu.:0.1820  
R>  Median :0.22779   Median :0.4686   Median :0.7531   Median :0.6256  
R>  Mean   :0.24393   Mean   :0.5313   Mean   :0.8329   Mean   :0.7209  
R>  3rd Qu.:0.35690   3rd Qu.:0.7927   3rd Qu.:1.3896   3rd Qu.:1.2072  
R>  Max.   :0.63754   Max.   :1.7508   Max.   :2.2704   Max.   :2.1249  
R>      augSD           annMean      
R>  Min.   :0.0000   Min.   :0.0000  
R>  1st Qu.:0.4178   1st Qu.:0.1874  
R>  Median :1.0454   Median :0.5228  
R>  Mean   :1.3207   Mean   :0.5722  
R>  3rd Qu.:2.0924   3rd Qu.:0.8085  
R>  Max.   :4.5267   Max.   :1.8941

Before fitting a multiple regression, it is also useful to look at each predictor individually against the response because it provides intuition about the data structure before we begin combining effects.

predictors <- c("augMean", "febRange", "febSD", "augSD", "annMean")

models <- map(predictors, ~ lm(as.formula(paste("Y ~", .x)),
                               data = sw_ectz))

plts1 <- map(predictors, function(predictor) {
  ggplot(sw_ectz, aes_string(x = predictor, y = "Y")) +
    geom_point(shape = 1, colour = "dodgerblue4") +
    geom_smooth(method = "lm", col = "magenta", fill = "pink") +
    labs(title = paste("Y vs", predictor),
         x = predictor,
         y = "Y") +
    theme_grey()
})

ggpubr::ggarrange(plotlist = plts1, ncol = 2,
                  nrow = 3, labels = "AUTO")

Figure 1: Individual simple linear regressions fitted to the variables `augMean`, `febRange`, `febSD`, `augSD`, and `annMean` as predictors of the seaweed species composition as measured by the Sørensen dissimilarity, `Y`.

The individual plots show that each predictor appears to relate positively and approximately linearly to the response when considered alone. But those one-predictor views do not tell us how the predictors behave once they are considered together, and they do not tell us whether some predictors overlap so strongly that their coefficients become unstable.

9 State the Hypotheses

The null hypothesis for the overall model is that there is no significant relationship between the response Y and any of the entered climatological variables, implying that the coefficients for all predictors are equal to zero:

\[H_{0}: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0\]

The alternative is that at least one coefficient differs from zero:

\[H_{a}: \exists \, \beta_i \ne 0 \text{ for at least one } i\]

At the main-effects level, each predictor has its own hypothesis:

\[H_{0}: \beta_i = 0\] \[H_{a}: \beta_i \ne 0\]

However, as soon as several predictors are included in the model, each coefficient becomes a partial effect. Its interpretation depends on the presence of the others.

10 Fit the Initial Model

We first fit a full model containing the five climate predictors, and a null model containing only the intercept.

full_mod1 <- lm(Y ~ augMean + febRange + febSD + augSD + annMean,
                data = sw_sub1)
null_mod1 <- lm(Y ~ 1, data = sw_sub1)

summary(full_mod1)

R> 
R> Call:
R> lm(formula = Y ~ augMean + febRange + febSD + augSD + annMean, 
R>     data = sw_sub1)
R> 
R> Residuals:
R>       Min        1Q    Median        3Q       Max 
R> -0.131229 -0.042352 -0.003415  0.038601  0.144128 
R> 
R> Coefficients:
R>              Estimate Std. Error t value Pr(>|t|)    
R> (Intercept)  0.044886   0.006306   7.118 9.04e-12 ***
R> augMean     -0.079925   0.042630  -1.875 0.061841 .  
R> febRange     0.112913   0.015949   7.080 1.14e-11 ***
R> febSD       -0.057174   0.016554  -3.454 0.000637 ***
R> augSD        0.003024   0.004886   0.619 0.536421    
R> annMean      0.322776   0.041615   7.756 1.59e-13 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> 
R> Residual standard error: 0.05713 on 283 degrees of freedom
R> Multiple R-squared:  0.8804, Adjusted R-squared:  0.8782 
R> F-statistic: 416.5 on 5 and 283 DF,  p-value: < 2.2e-16

The full model fits strongly, but the coefficient table already hints at a practical problem: some variables that looked useful on their own are no longer clearly interpretable once the others are included. This is a common sign of multicollinearity.

11 Check for Multicollinearity

Some of the predictors may be highly correlated with each other, and this can make the model less precise and harder to interpret because the coefficients’ standard errors become inflated. A formal way to detect multicollinearity is to calculate the variance inflation factor (VIF) for each predictor.

vif(full_mod1)

R>   augMean  febRange     febSD     augSD   annMean 
R> 27.947767 10.806635  8.765732  2.497739 31.061900

The VIF values show that some predictors overlap too strongly for comfortable interpretation. We therefore simplify the candidate set before asking which predictors belong in the final model.

This is a useful practical response to collinearity, but it is not the whole conceptual story. Chapter 15 explains why overlapping predictors create interpretive instability, how that differs from confounding, and why measurement error can produce similar practical difficulties.

threshold <- 10
vif_mod <- full_mod1
sw_vif <- sw_sub1

while (TRUE) {
  vif_values <- vif(vif_mod)
  max_vif <- max(vif_values)

  if (max_vif > threshold) {
    high_vif_var <- names(which.max(vif_values))
    sw_vif <- sw_vif[, !(names(sw_vif) %in% high_vif_var)]
    vif_mod <- lm(Y ~ ., data = sw_vif)
  } else {
    break
  }
}

summary(vif_mod)

R> 
R> Call:
R> lm(formula = Y ~ ., data = sw_vif)
R> 
R> Residuals:
R>       Min        1Q    Median        3Q       Max 
R> -0.153994 -0.049229 -0.006086  0.045947  0.148579 
R> 
R> Coefficients:
R>             Estimate Std. Error t value Pr(>|t|)    
R> (Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
R> augMean     0.283335   0.011131  25.455  < 2e-16 ***
R> febSD       0.049639   0.008370   5.930 8.73e-09 ***
R> augSD       0.022150   0.004503   4.919 1.47e-06 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> 
R> Residual standard error: 0.06609 on 285 degrees of freedom
R> Multiple R-squared:  0.8387, Adjusted R-squared:  0.837 
R> F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16

vif(vif_mod)

R>  augMean    febSD    augSD 
R> 1.423601 1.674397 1.585055

This process leaves a much more interpretable candidate model. The retained predictors are augMean, febSD, and augSD, and the VIF values are now comfortably low.

Regularisation techniques such as ridge regression, lasso regression, or elastic net can also be used to deal with multicollinearity. These are discussed later in Chapter 24.

12 Fit the Selected Model

It may be that not all variables retained after the VIF screen are needed to explain the response. We can therefore use forward selection to identify the most parsimonious model within this reduced candidate set.

mod1 <- stepAIC(
  lm(Y ~ 1, data = sw_vif),
  scope = list(lower = lm(Y ~ 1, data = sw_vif), upper = vif_mod),
  direction = "forward",
  trace = FALSE
)

formula(mod1)

R> Y ~ augMean + febSD + augSD

summary(mod1)

R> 
R> Call:
R> lm(formula = Y ~ augMean + febSD + augSD, data = sw_vif)
R> 
R> Residuals:
R>       Min        1Q    Median        3Q       Max 
R> -0.153994 -0.049229 -0.006086  0.045947  0.148579 
R> 
R> Coefficients:
R>             Estimate Std. Error t value Pr(>|t|)    
R> (Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
R> augMean     0.283335   0.011131  25.455  < 2e-16 ***
R> febSD       0.049639   0.008370   5.930 8.73e-09 ***
R> augSD       0.022150   0.004503   4.919 1.47e-06 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> 
R> Residual standard error: 0.06609 on 285 degrees of freedom
R> Multiple R-squared:  0.8387, Adjusted R-squared:  0.837 
R> F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16

anova(mod1)

R> Analysis of Variance Table
R> 
R> Response: Y
R>            Df Sum Sq Mean Sq  F value    Pr(>F)    
R> augMean     1 6.0084  6.0084 1375.660 < 2.2e-16 ***
R> febSD       1 0.3604  0.3604   82.507 < 2.2e-16 ***
R> augSD       1 0.1057  0.1057   24.196 1.473e-06 ***
R> Residuals 285 1.2448  0.0044                       
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The selected model is:

\[ Y = \alpha + \beta_1 \text{augMean} + \beta_2 \text{febSD} + \beta_3 \text{augSD} + \epsilon \]

This is a useful teaching result because it shows that model building is not simply about keeping every plausible variable. It is about arriving at a model that is statistically stable and biologically interpretable.

13 Test the Assumptions / Check Diagnostics

Before interpreting the model in earnest, we must check whether its assumptions are adequate. Added-variable plots are useful because they show the relationship between the response and each predictor while accounting for the effect of the other predictors.

avPlots(mod1, col = "dodgerblue4", col.lines = "magenta")

Figure 2: Partial regression plots for `mod1` with the selected variables `augMean`, `febSD`, and `augSD`.

The added-variable plot for augMean shows a particularly strong and tight partial relationship with the response. The febSD and augSD plots show weaker effects with more scatter, but both still contribute.

We can also look at standard diagnostic plots for the model as a whole.

autoplot(mod1, shape = 21, colour = "dodgerblue4",
         smooth.colour = "magenta") +
  theme_grey()

Figure 3: Diagnostic plots to assess the fit of the final multiple linear regression model, `mod1`.

The model appears broadly acceptable. There may be mild curvature or slight heteroscedasticity, but there is no glaring red flag severe enough to invalidate interpretation. As always, adequacy is a matter of judgement, not perfection.

Component-plus-residual plots provide another useful perspective because they show whether the relationship between the response and each predictor appears reasonably linear in the fitted model.

crPlots(mod1, col = "dodgerblue4", col.lines = "magenta")

Figure 4: Component plus residual diagnostic plots to assess the fit of the final multiple linear regression model, `mod1`.

14 Interpret the Results

The fitted model shows that augMean, febSD, and augSD are all positively related to Sørensen dissimilarity in the East Coast Transition Zone once they are considered together. The strongest partial effect is associated with augMean, while febSD and augSD make smaller but still clearly positive contributions.

It is important to interpret the coefficients correctly. In a multiple regression, each slope coefficient represents the expected change in the response variable for a one-unit increase in the predictor variable, holding the other predictors constant. This is what gives the model its interpretive power. It is also why collinearity and confounding matter so much: if the predictors overlap strongly or stand in for other unmeasured causes, the apparent partial effect can become difficult to interpret. Chapter 15 unpacks these issues directly.

The model fit is strong: the adjusted \(R^2\) is about 0.84, so the selected predictors explain a large proportion of the observed variation in Y. The overall model test provides very strong evidence that the model explains more variation than an intercept-only model (\(p < 0.001\)).

The ANOVA table is useful here because it shows how much variance each predictor adds sequentially to the model. But remember that this sequential interpretation depends on the order in which predictors are entered.

15 Reporting

When reporting a multiple regression, include:

the response variable and the retained predictors;
the fitted coefficients and their direction;
the overall fit of the model;
the logic used to deal with multicollinearity or redundant predictors;
a biological interpretation of the retained effects.

For example:

In the East Coast Transition Zone, mean Sørensen dissimilarity was well explained by a multiple linear regression containing August mean temperature, February temperature variability, and August temperature variability (adjusted \(R^2 = 0.84\); overall model \(p < 0.001\)). All three retained predictors had positive partial effects, with August mean temperature showing the strongest contribution (\(\beta = 0.283\)), followed by February temperature variability (\(\beta = 0.050\)) and August temperature variability (\(\beta = 0.022\)). Predictors with problematic overlap were removed before final interpretation, so the reported coefficients represent comparatively stable partial relationships.

16 Categorical Predictors and Model Specification

Multiple regression can include both continuous and categorical predictors. In R, this is done by entering the categorical variable as a factor. This matters because many ecological questions involve a mixture of environmental gradients and group structure.

For example:

lm(response ~ dist + factor(bio), data = df)

Once categorical predictors are included, multiple regression and ANOVA become part of the same linear-model family. This is why the earlier ANOVA chapter and the present modelling chapters connect so naturally.

16.1 Dummy variable coding

When a categorical variable is included in a regression model, R does not fit a separate coefficient for every level plus an intercept. That would be redundant. Instead, it uses dummy variables (also called indicator variables) to represent the levels of the factor.

Suppose a categorical predictor C has three levels: C1, C2, and C3. If C1 is used as the reference level (by default, lm() chooses the first alpha-numeric factor), then only two dummy variables are needed:

D1 = 1 when the observation belongs to C2, and 0 otherwise;
D2 = 1 when the observation belongs to C3, and 0 otherwise.

The reference level C1 is represented by D1 = 0 and D2 = 0. In that case the model can be written as:

\[ Y_i = \alpha + \beta_1 X_i + \gamma_1 D_{i1} + \gamma_2 D_{i2} + \epsilon_i \]

Here:

\(\alpha\) is the intercept for the reference category;
\(\beta_1\) is the slope of the continuous predictor X for the reference category;
\(\gamma_1\) is the difference in intercept between C2 and the reference category C1;
\(\gamma_2\) is the difference in intercept between C3 and the reference category C1.

This is the coding that lm() uses by default for factors. The first level of the factor is taken as the reference level unless you change it explicitly with relevel(). The practical implication is that when you see coefficients such as bioBMP or bioECTZ, they do not represent absolute means for those groups. They represent differences from the reference bioregion, holding the continuous predictors constant.

If an interaction between a continuous predictor and a factor is added, the same reasoning extends to the slopes. The main slope belongs to the reference level, and the interaction terms show how the slope changes in the other levels relative to that reference. You need to understand this to interpret the dist * bio example below.

17 Example 2: Interaction of Distance and Bioregion

The seaweed dataset includes two additional variables that we have not yet considered closely:

dist, the geographic distance between seaweed samples along the coast of South Africa; and
bio, the bioregional classification of the samples.

These variables lend themselves to a few interesting questions:

Is the Sørensen dissimilarity of the seaweed flora related to geographic distance?
Does the average Sørensen dissimilarity vary among bioregions?
Is the effect of geographic distance on Sørensen dissimilarity different for each bioregion?

The third question requires that we consider possible interaction between the drivers, geographic distance and bioregion.

17.1 State the model and hypotheses

The full model can be written as:

\[ Y = \alpha + \beta_1 \text{dist} + \beta_2 \text{bio} + \beta_3 (\text{dist} \times \text{bio}) + \epsilon \]

Because bio has four levels, the fitted model is actually expanded internally into several dummy variables and their interactions. With AMP as the reference bioregion, the model is equivalent to:

\[ \begin{aligned} Y = \alpha &+ \beta_1 \text{dist} + \beta_2 \text{bio}_{\text{B-ATZ}} + \beta_3 \text{bio}_{\text{BMP}} + \beta_4 \text{bio}_{\text{ECTZ}} \\ &+ \beta_5 (\text{dist} \times \text{bio}_{\text{B-ATZ}}) + \beta_6 (\text{dist} \times \text{bio}_{\text{BMP}}) + \beta_7 (\text{dist} \times \text{bio}_{\text{ECTZ}}) + \epsilon \end{aligned} \]

In words:

the intercept and dist slope describe the reference bioregion;
the bio coefficients describe how the intercept differs in the other bioregions relative to the reference;
the interaction terms describe how the slope of dist differs in the other bioregions relative to the reference.

The overall null is that neither the main effects nor the interaction contributes to explaining the response. The interaction-specific null is that the effect of distance does not differ among bioregions.

17.2 Visualise the main effects

p1 <- ggplot(sw, aes(x = dist, y = Y)) +
  geom_point(shape = 1, aes(colour = bio)) +
  geom_smooth(method = "lm", se = TRUE,
              colour = "black", fill = "grey60",
              linewidth = 0.4) +
  scale_colour_brewer(palette = "Set1") +
  labs(x = "Distance (km)",
       y = "Sørensen Dissimilarity") +
  theme_grey()

p2 <- ggplot(sw, aes(x = bio, y = Y)) +
  geom_boxplot(aes(colour = bio)) +
  labs(x = "Bioregional Classification",
       y = "Sørensen Dissimilarity") +
  theme_grey() +
  theme(legend.position = "none")

ggarrange(p1, p2, ncol = 2, labels = "AUTO")

Figure 5: Plot of main effects of A) distance along the coast and B) bioregional classification on the Sørensen dissimilarity index.

The plot of Y against distance already suggests that one slope for all bioregions may be too simple. Colouring by bioregion reveals hidden structure. The boxplots also suggest that the average values of Y differ among bioregions.

17.3 Fit and assess nested models

mod2a <- lm(Y ~ dist, data = sw)
mod2 <- lm(Y ~ dist * bio, data = sw)

anova(mod2a, mod2, test = "F")

R> Analysis of Variance Table
R> 
R> Model 1: Y ~ dist
R> Model 2: Y ~ dist * bio
R>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
R> 1    968 7.7388                                  
R> 2    962 2.2507  6    5.4881 390.95 < 2.2e-16 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod2)

R> 
R> Call:
R> lm(formula = Y ~ dist * bio, data = sw)
R> 
R> Residuals:
R>       Min        1Q    Median        3Q       Max 
R> -0.112117 -0.030176 -0.004195  0.023698  0.233520 
R> 
R> Coefficients:
R>                 Estimate Std. Error t value Pr(>|t|)    
R> (Intercept)    5.341e-03  4.177e-03   1.279   0.2013    
R> dist           3.530e-04  1.140e-05  30.958  < 2e-16 ***
R> bioB-ATZ      -6.140e-03  1.659e-02  -0.370   0.7114    
R> bioBMP         3.820e-02  6.659e-03   5.737 1.29e-08 ***
R> bioECTZ        1.629e-02  6.447e-03   2.527   0.0117 *  
R> dist:bioB-ATZ  7.976e-04  1.875e-04   4.255 2.30e-05 ***
R> dist:bioBMP   -1.285e-04  2.065e-05  -6.222 7.31e-10 ***
R> dist:bioECTZ   4.213e-04  1.801e-05  23.392  < 2e-16 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> 
R> Residual standard error: 0.04837 on 962 degrees of freedom
R> Multiple R-squared:  0.8607, Adjusted R-squared:  0.8597 
R> F-statistic: 849.2 on 7 and 962 DF,  p-value: < 2.2e-16

anova(mod2)

R> Analysis of Variance Table
R> 
R> Response: Y
R>            Df Sum Sq Mean Sq F value    Pr(>F)    
R> dist        1 8.4199  8.4199 3598.79 < 2.2e-16 ***
R> bio         3 3.6232  1.2077  516.21 < 2.2e-16 ***
R> dist:bio    3 1.8648  0.6216  265.69 < 2.2e-16 ***
R> Residuals 962 2.2507  0.0023                      
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The nested comparison shows that adding the bioregion effect and its interaction with distance improves the model strongly. The residual sum of squares decreases sharply, and the test indicates that the richer model provides a substantially better fit (\(p < 0.001\)).

This chapter does not explain the full interaction table, because that is the focus of Chapter 14. But the present example is important because it shows that interactions are not an exotic extra. They arise naturally once multiple regression allows both continuous and categorical predictors.

17.4 Interpret the interaction model

The coefficient table must be read with care because bio is represented through dummy-variable coding. The reference bioregion is absorbed into the intercept, and the remaining bioregion coefficients are interpreted relative to that reference. The interaction terms then tell us how the slope of dist differs among the other bioregions relative to the reference.

So, if AMP is the reference level, then:

the coefficient for dist is the slope for AMP;
the coefficient for bioBMP is the difference in intercept between BMP and AMP when dist = 0;
the coefficient for dist:bioBMP is the amount by which the BMP slope differs from the AMP slope.

The same logic applies to the other bioregions. This is why the main effect of a factor and the interaction terms should never be interpreted in isolation. They are part of one combined coding system.

The broad interpretation is straightforward even if the bookkeeping is not:

the effect of distance on Y is strong;
the mean level of Y differs among bioregions;
and, importantly, the slope of distance is not the same in all bioregions.

That last point is the biologically important one. Different bioregions show different rates at which Sørensen dissimilarity increases with distance.

A journal-style Results statement for this analysis might read as follows:

Sørensen dissimilarity increased strongly with geographic distance, but the rate of increase differed among bioregions in the fitted interaction model. Adding bioregional main effects and the dist × bio interaction improved model fit substantially relative to a single-slope distance model (\(F = 390.95\), \(p < 0.001\)), and the interaction model explained a large proportion of the variation in dissimilarity (\(R^2 = 0.86\); adjusted \(R^2 = 0.86\)). The baseline distance effect was strongly positive (\(\beta = 3.53 \times 10^{-4}\), \(p < 0.001\)), but this slope was steeper in some bioregions and shallower in others, confirming that the relationship between geographic separation and compositional turnover is not spatially uniform along the South African coast.

18 Example 3: A Richer Final Model

We can now expand the selected climate model to include bio as a predictor alongside augMean, febSD, and augSD. This creates a richer model in which the continuous climate effects may differ among bioregions.

sw_sub2 <- sw |>
  dplyr::select(Y, augMean, febSD, augSD, bio)

full_mod3 <- lm(Y ~ (augMean + febSD + augSD) * bio, data = sw_sub2)
full_mod3a <- lm(Y ~ augMean + febSD + augSD, data = sw_sub2)
null_mod3 <- lm(Y ~ 1, data = sw_sub2)

anova(full_mod3, full_mod3a)

R> Analysis of Variance Table
R> 
R> Model 1: Y ~ (augMean + febSD + augSD) * bio
R> Model 2: Y ~ augMean + febSD + augSD
R>   Res.Df    RSS  Df Sum of Sq      F    Pr(>F)    
R> 1    954 3.5603                                   
R> 2    966 5.6890 -12   -2.1288 47.535 < 2.2e-16 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(full_mod3, full_mod3a)

R>            df       AIC
R> full_mod3  17 -2652.498
R> full_mod3a  5 -2221.852

summary(full_mod3)

R> 
R> Call:
R> lm(formula = Y ~ (augMean + febSD + augSD) * bio, data = sw_sub2)
R> 
R> Residuals:
R>      Min       1Q   Median       3Q      Max 
R> -0.15399 -0.03841 -0.01475  0.03464  0.24051 
R> 
R> Coefficients:
R>                    Estimate Std. Error t value Pr(>|t|)    
R> (Intercept)       0.0299094  0.0062756   4.766 2.17e-06 ***
R> augMean           0.3441099  0.0158575  21.700  < 2e-16 ***
R> febSD            -0.0006481  0.0027954  -0.232 0.816706    
R> augSD            -0.0059012  0.0034011  -1.735 0.083044 .  
R> bioB-ATZ         -0.0459611  0.0242519  -1.895 0.058374 .  
R> bioBMP            0.0160756  0.0100749   1.596 0.110906    
R> bioECTZ          -0.0015444  0.0090275  -0.171 0.864197    
R> augMean:bioB-ATZ -0.0461775  0.0874044  -0.528 0.597400    
R> augMean:bioBMP   -0.2406297  0.0211404 -11.382  < 2e-16 ***
R> augMean:bioECTZ  -0.0607745  0.0189030  -3.215 0.001348 ** 
R> febSD:bioB-ATZ    0.0409425  0.0818927   0.500 0.617223    
R> febSD:bioBMP      0.0056433  0.0150126   0.376 0.707070    
R> febSD:bioECTZ     0.0502867  0.0082266   6.113 1.43e-09 ***
R> augSD:bioB-ATZ    0.0655983  0.0371033   1.768 0.077382 .  
R> augSD:bioBMP      0.0410220  0.0114706   3.576 0.000366 ***
R> augSD:bioECTZ     0.0280513  0.0053752   5.219 2.21e-07 ***
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> 
R> Residual standard error: 0.06109 on 954 degrees of freedom
R> Multiple R-squared:  0.7797, Adjusted R-squared:  0.7762 
R> F-statistic: 225.1 on 15 and 954 DF,  p-value: < 2.2e-16

anova(full_mod3)

R> Analysis of Variance Table
R> 
R> Response: Y
R>              Df Sum Sq Mean Sq  F value    Pr(>F)    
R> augMean       1 9.9900  9.9900 2676.902 < 2.2e-16 ***
R> febSD         1 0.0530  0.0530   14.208 0.0001737 ***
R> augSD         1 0.4266  0.4266  114.308 < 2.2e-16 ***
R> bio           3 0.8551  0.2850   76.374 < 2.2e-16 ***
R> augMean:bio   3 0.7910  0.2637   70.647 < 2.2e-16 ***
R> febSD:bio     3 0.3475  0.1158   31.034 < 2.2e-16 ***
R> augSD:bio     3 0.1353  0.0451   12.085 9.097e-08 ***
R> Residuals   954 3.5603  0.0037                       
R> ---
R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction-rich model is clearly better than the reduced additive model. The AIC is lower, and the nested comparison indicates that allowing the slopes to differ among bioregions improves fit strongly.

This is a good place to pause and reflect on model specification. The model is now much richer than the one we began with. That does not automatically make it better in every sense. It is better only if the biological question genuinely concerns whether the climate relationships differ among bioregions.

18.1 Interpret the richer model

The summary of full_mod3 shows that:

augMean remains a strong predictor of Y;
several interaction terms are important, which means the climate-response relationships differ among bioregions;
the overall model explains a substantial proportion of the variation in the response.

In words, the message is this: the relationship between the seaweed dissimilarity index and the selected climate variables is not spatially uniform. The influence of those variables changes among the bioregions. That is a much richer ecological statement than anything we could obtain from a single-predictor regression.

A journal-style Results statement for this richer model might read as follows:

The interaction-rich multiple regression explained a large proportion of the variation in mean Sørensen dissimilarity (\(R^2 = 0.78\); adjusted \(R^2 = 0.78\)) and fit the data substantially better than the corresponding additive model (lower AIC; nested model comparison \(F = 47.54\), \(p < 0.001\)). August mean temperature remained the dominant positive predictor in the reference bioregion (\(\beta = 0.344\), \(p < 0.001\)), but several significant interaction terms showed that the effects of August mean temperature, February temperature variability, and August temperature variability differed among bioregions. In particular, the slopes associated with augMean were weaker in BMP and ECTZ than in the reference bioregion, while positive interactions involving febSD and augSD indicated that the effect of short-term temperature variability was stronger in some bioregions than in others. These results show that the climatic drivers of compositional turnover are not spatially uniform along the South African coast.

19 Alternative Categorical Variable Coding Schemes (Contrasts)

Throughout the chapter, the categorical variables have been handled with the default dummy-variable coding. But alternative coding schemes are possible, and they change the way the coefficients are interpreted.

The following synthetic example shows a few different possibilities.

set.seed(123)
n <- 100
x <- rnorm(n)
cat_var <- factor(sample(c("A", "B", "C", "D"), n, replace = TRUE))
y <- 2 * x + ifelse(cat_var == "A", 3,
                    ifelse(cat_var == "B", 1,
                           ifelse(cat_var == "C", -1, -3))) + rnorm(n)

data_contr <- data.frame(y, x, cat_var)
head(data_contr)

R>            y           x cat_var
R> 1  0.6667876 -0.56047565       B
R> 2  1.3086873 -0.23017749       B
R> 3  0.4496192  1.55870831       D
R> 4  2.1326402  0.07050839       A
R> 5 -2.8608771  0.12928774       D
R> 6  0.1497346  1.71506499       D

With default treatment coding, one level becomes the reference category:

contrasts(data_contr$cat_var)

R>   B C D
R> A 0 0 0
R> B 1 0 0
R> C 0 1 0
R> D 0 0 1

This means that the coefficient for each non-reference level is interpreted as the difference between that level and the reference level, holding the other predictors constant.

You can change the reference level:

data_contr$cat_var <- relevel(data_contr$cat_var, ref = "C")
contrasts(data_contr$cat_var)

R>   A B D
R> C 0 0 0
R> A 1 0 0
R> B 0 1 0
R> D 0 0 1

Or use effect coding:

contrasts(data_contr$cat_var) <- contr.sum(4)
contrasts(data_contr$cat_var)

R>   [,1] [,2] [,3]
R> C    1    0    0
R> A    0    1    0
R> B    0    0    1
R> D   -1   -1   -1

The practical lesson is not that students must memorise every contrast scheme. It is that the interpretation of categorical coefficients depends on how the coding is done.

20 Summary

Multiple regression extends simple regression to several predictors.
Model specification is not a side issue; it is part of the scientific argument.
The meaning of each coefficient depends on the other predictors included in the model.
Multicollinearity can make a model statistically unstable and biologically difficult to interpret.
Categorical predictors and interaction terms belong naturally within the multiple-regression framework.
A useful multiple regression is one that is both statistically adequate and scientifically defensible.

This chapter now lays out the full modelling landscape: additive models, interactions, richer bioregional structures, and categorical coding. The next chapter returns to the interaction problem directly and develops it in more focused detail.

References

Smit AJ, Bolton JJ, Anderson RJ (2017) Seaweeds in two oceans: Beta-diversity. Frontiers in Marine Science 4:404.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {13. {Multiple} {Regression} and {Model} {Specification}},
  date = {2026-03-19},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/13-multiple-regression-and-model-specification.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., J. Smit A (2026) 13. Multiple Regression and Model Specification. http://tangledbank.netlify.app/BCB744/basic_stats/13-multiple-regression-and-model-specification.html.