13. Polynomial Regression

Curvature Within the Linear Model Framework

Published

2026/03/22

NoteIn This Chapter
  • why polynomial regression is still a linear model in its parameters;
  • how to decide when a quadratic or cubic term is warranted;
  • how to fit and compare polynomial models with lm();
  • how to interpret curvature biologically without overclaiming mechanism;
  • how to report a polynomial regression in the same journal style used across the sequence.
ImportantTasks to Complete in This Chapter
  • None

1 Introduction

In simple linear regression, we assumed that the expected response changes in a straight-line way with the predictor. In many biological settings, that first assumption is too rigid: responses often accelerate, decelerate, peak, or flatten.

Polynomial regression is the first extension of linear modelling that allows this kind of curvature while remaining in the familiar lm() framework. The fitted curve may look nonlinear, but the model is still linear in the coefficients, which means ordinary least squares, confidence intervals, and model diagnostics are still used in exactly the same way.

In this chapter, I focus on how to detect curvature, how to add polynomial terms sensibly, and how to avoid using high-order polynomials as a substitute for biological reasoning.

2 Key Concepts

  • Polynomial regression is linear in parameters even though it is curved in the predictor.
  • A quadratic term allows one bend in the relationship.
  • A cubic term allows more complex curvature, but interpretation usually becomes harder.
  • Model hierarchy is imortant: if \(X^2\) is in the model, retain \(X\); if \(X^3\) is in the model, retain both \(X\) and \(X^2\).
  • Parsimony is important: adding terms can improve apparent fit while harming generalisability.

3 When This Method Is Appropriate

Polynomial regression is useful when:

  • a straight-line model leaves clear residual curvature;
  • the relationship is smooth and continuous rather than step-like;
  • there is no strong mechanistic equation yet, but a curved descriptive model is needed;
  • you want to stay within the familiar inferential approach offered by lm().

When the biology suggests a specific process (for example a saturating uptake model), a mechanistic nonlinear model (Chapter 22) may be better. When curvature is complex and not well represented by low-order powers, a GAM may be better.

4 Nature of the Data and Assumptions

Because polynomial regression uses ordinary least squares, the core assumptions are the same as for simple linear regression:

  1. independence of observations;
  2. approximately normal residuals;
  3. roughly constant residual variance;
  4. continuous response variable.

5 The Core Equations

What changes in polynomial regression is the mean structure rather than the underlying fitting framework. A cubic polynomial can be written as:

\[Y_i = \alpha + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \epsilon_i \tag{1}\]

In Equation 1, curvature is represented by powers of \(X\), but the model remains linear in the unknown coefficients \(\alpha, \beta_1, \beta_2, \beta_3\).

For introductory work, the important point is not to memorise the highest-order form. It is to understand that polynomial regression adds powers of the predictor while still estimating ordinary regression coefficients. A quadratic model is therefore just a simpler version of Equation 1 in which the cubic term is omitted.

6 R Function

The model is still fitted with lm():

lm(y ~ x + I(x^2), data = df)                       # quadratic
lm(y ~ x + I(x^2) + I(x^3), data = df)              # cubic
lm(y ~ poly(x, degree = 3), data = df)              # orthogonal polynomial basis

For teaching and biological interpretation, raw powers (I(x^2), I(x^3)) are usually easier to explain.

7 Example 1: CO\(_2\) Uptake and Curved Response

7.1 Example Dataset

We use the classic CO2 dataset from base R. These are real measurements of carbon dioxide uptake by grass plants, measured across increasing ambient CO\(_2\) concentrations. The biological expectation is that uptake should increase at low concentrations and then begin to level off, so a straight line may be too simple.

co2_df <- as_tibble(CO2) |> 
  filter(Treatment == "chilled", Type == "Quebec")

gt(head(co2_df, 10))
A subset of the CO2 dataset showing chilled Quebec plants used in the polynomial-regression example.
Plant Type Treatment conc uptake
Qc1 Quebec chilled 95 14.2
Qc1 Quebec chilled 175 24.1
Qc1 Quebec chilled 250 30.3
Qc1 Quebec chilled 350 34.6
Qc1 Quebec chilled 500 32.5
Qc1 Quebec chilled 675 35.4
Qc1 Quebec chilled 1000 38.7
Qc2 Quebec chilled 95 9.3
Qc2 Quebec chilled 175 27.3
Qc2 Quebec chilled 250 35.0

7.2 Do an Exploratory Data Analysis (EDA)

co2_df |>
  summarise(
    n = n(),
    min_conc = min(conc),
    max_conc = max(conc),
    mean_uptake = mean(uptake),
    sd_uptake = sd(uptake)
  )
# A tibble: 1 × 5
      n min_conc max_conc mean_uptake sd_uptake
  <int>    <dbl>    <dbl>       <dbl>     <dbl>
1    21       95     1000        31.8      9.64
ggplot(co2_df, aes(x = conc, y = uptake)) +
  geom_point(alpha = 0.75) +
  labs(x = expression(CO[2]~concentration),
       y = "CO2 uptake")
Figure 1: CO\(_2\) uptake in chilled Quebec plants across concentration levels.

The relationship is clearly positive, but it is not perfectly straight. Uptake rises quickly at low to moderate concentrations and then appears to flatten. That pattern motivates trying a polynomial extension.

7.3 State the Model Question and Hypotheses

The primary model question is whether including curvature meaningfully improves fit relative to a straight-line model.

For the quadratic extension, a useful inferential pair is:

\[H_{0}: \beta_2 = 0\] \[H_{a}: \beta_2 \ne 0\]

If \(\beta_2 = 0\), the model collapses to a straight-line form. If \(\beta_2 \neq 0\), the data support curvature.

7.4 Fit the Models

mod_lin  <- lm(uptake ~ conc, data = co2_df)
mod_quad <- lm(uptake ~ conc + I(conc^2), data = co2_df)
mod_cub  <- lm(uptake ~ conc + I(conc^2) + I(conc^3), data = co2_df)

summary(mod_lin)

Call:
lm(formula = uptake ~ conc, data = co2_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3773  -3.7712   0.0476   4.8664  10.7414 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.421041   2.583221   8.292 9.79e-08 ***
conc         0.023750   0.004919   4.828 0.000117 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.631 on 19 degrees of freedom
Multiple R-squared:  0.5509,    Adjusted R-squared:  0.5273 
F-statistic: 23.31 on 1 and 19 DF,  p-value: 0.0001169
summary(mod_quad)

Call:
lm(formula = uptake ~ conc + I(conc^2), data = co2_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7023 -2.9023  0.2657  2.3714 10.0540 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.053e+01  3.269e+00   3.221 0.004740 ** 
conc         8.392e-02  1.511e-02   5.554 2.85e-05 ***
I(conc^2)   -5.542e-05  1.351e-05  -4.102 0.000669 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.898 on 18 degrees of freedom
Multiple R-squared:  0.7679,    Adjusted R-squared:  0.7421 
F-statistic: 29.78 on 2 and 18 DF,  p-value: 1.954e-06
summary(mod_cub)

Call:
lm(formula = uptake ~ conc + I(conc^2) + I(conc^3), data = co2_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0662 -2.2950  0.3338  1.6063  6.4848 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.325e+00  3.792e+00  -1.404    0.178    
conc         2.339e-01  3.128e-02   7.478 9.03e-07 ***
I(conc^2)   -3.970e-04  6.818e-05  -5.822 2.04e-05 ***
I(conc^3)    2.094e-07  4.145e-08   5.051 9.84e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.187 on 17 degrees of freedom
Multiple R-squared:  0.9072,    Adjusted R-squared:  0.8908 
F-statistic: 55.39 on 3 and 17 DF,  p-value: 5.509e-09

A nested-model comparison helps us decide whether added terms improve the model enough to justify the extra complexity.

anova(mod_lin, mod_quad, mod_cub)
Analysis of Variance Table

Model 1: uptake ~ conc
Model 2: uptake ~ conc + I(conc^2)
Model 3: uptake ~ conc + I(conc^2) + I(conc^3)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     19 835.48                                  
2     18 431.80  1    403.68 39.746 7.919e-06 ***
3     17 172.66  1    259.14 25.516 9.844e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod_lin, mod_quad, mod_cub)
         df      AIC
mod_lin   3 142.9485
mod_quad  4 131.0877
mod_cub   5 113.8378

7.5 Test Assumptions / Check Diagnostics

quad_aug <- augment(mod_quad)

shapiro.test(residuals(mod_quad))

    Shapiro-Wilk normality test

data:  residuals(mod_quad)
W = 0.9902, p-value = 0.998
bptest(mod_quad)

    studentized Breusch-Pagan test

data:  mod_quad
BP = 1.3941, df = 2, p-value = 0.498
par(mfrow = c(2, 2))
plot(mod_quad)
par(mfrow = c(1, 1))
Figure 2: Standard diagnostics for the quadratic model.

No model is perfect, but if diagnostics are broadly acceptable and the residual curvature seen in the straight-line fit is reduced, the polynomial extension is usually justified.

7.6 Interpret the Results

co2_pred <- tibble(conc = seq(min(co2_df$conc), max(co2_df$conc), length.out = 200)) |>
  mutate(
    linear = predict(mod_lin, newdata = cur_data()),
    quadratic = predict(mod_quad, newdata = cur_data()),
    cubic = predict(mod_cub, newdata = cur_data())
  ) |>
  pivot_longer(cols = c(linear, quadratic, cubic),
               names_to = "model", values_to = "fit")

ggplot(co2_df, aes(x = conc, y = uptake)) +
  geom_point(alpha = 0.65) +
  geom_line(data = co2_pred, aes(y = fit, colour = model), linewidth = 0.9) +
  labs(x = expression(CO[2]~concentration),
       y = "CO2 uptake",
       colour = "Model")
Figure 3: Linear, quadratic, and cubic polynomial fits to the CO\(_2\) uptake data.

In this dataset, the quadratic model captures biologically plausible deceleration in uptake at higher concentrations better than a straight line. The cubic model may fit marginal details, but if it does not provide a clear inferential or biological gain, the quadratic model is usually preferable.

The practical interpretation is not usually about the standalone value of \(\beta_2\). Instead, it is about the shape of the response: uptake increases with concentration, but the rate of increase slows across the observed range.

7.7 Reporting

NoteWrite-Up

Methods

Carbon dioxide uptake was modelled as a function of ambient CO\(_2\) concentration using ordinary least squares regression on the CO2 plant dataset (chilled Quebec plants). A straight-line model was compared with quadratic and cubic polynomial extensions. Model comparison used nested-model ANOVA and AIC, and assumptions were assessed from residual diagnostics.

Results

The straight-line model underestimated curvature in the concentration-uptake relationship. Adding a quadratic term improved model fit substantially relative to the linear model (nested-model comparison, \(p < 0.001\)), while the cubic extension provided little additional practical benefit. The selected quadratic model described a positive but decelerating relationship in which uptake increased with concentration but began to flatten at higher concentration values.

Discussion

A low-order polynomial provided a useful descriptive model of nonlinearity while retaining the familiar lm() workflow. Biologically, the fitted curve is consistent with diminishing marginal increases in uptake as CO\(_2\) concentration rises. The model should be interpreted as a flexible approximation over the observed range rather than as a mechanistic process equation.

8 What to Do When Assumptions Fail / Alternatives

  • If residual curvature remains strong, move to a more flexible model such as a GAM.
  • If biology suggests a specific functional process, use a mechanistic nonlinear model (Chapter 22).
  • If polynomial degree must become high to fit the pattern, that is usually a warning sign of overfitting.

9 Summary

  • Polynomial regression is a linear-model extension that captures smooth curvature.
  • Low-order polynomials (especially quadratic) are often enough for introductory biological applications.
  • Model comparison and diagnostics should drive term selection, not visual fit alone.
  • Interpretation should focus on response shape and biological plausibility, not just coefficient significance.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {13. {Polynomial} {Regression}},
  date = {2026-03-22},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/13-polynomial-regression.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) 13. Polynomial Regression. https://tangledbank.netlify.app/BCB744/basic_stats/13-polynomial-regression.html.