13. Polynomial Regression
Curvature Within the Linear Model Framework
- why polynomial regression is still a linear model in its parameters;
- how to decide when a quadratic or cubic term is warranted;
- how to fit and compare polynomial models with
lm(); - how to interpret curvature biologically without overclaiming mechanism;
- how to report a polynomial regression in the same journal style used across the sequence.
- None
1 Introduction
In simple linear regression, we assumed that the expected response changes in a straight-line way with the predictor. In many biological settings, that first assumption is too rigid: responses often accelerate, decelerate, peak, or flatten.
Polynomial regression is the first extension of linear modelling that allows this kind of curvature while remaining in the familiar lm() framework. The fitted curve may look nonlinear, but the model is still linear in the coefficients, which means ordinary least squares, confidence intervals, and model diagnostics are still used in exactly the same way.
In this chapter, I focus on how to detect curvature, how to add polynomial terms sensibly, and how to avoid using high-order polynomials as a substitute for biological reasoning.
2 Key Concepts
- Polynomial regression is linear in parameters even though it is curved in the predictor.
- A quadratic term allows one bend in the relationship.
- A cubic term allows more complex curvature, but interpretation usually becomes harder.
- Model hierarchy is imortant: if \(X^2\) is in the model, retain \(X\); if \(X^3\) is in the model, retain both \(X\) and \(X^2\).
- Parsimony is important: adding terms can improve apparent fit while harming generalisability.
3 When This Method Is Appropriate
Polynomial regression is useful when:
- a straight-line model leaves clear residual curvature;
- the relationship is smooth and continuous rather than step-like;
- there is no strong mechanistic equation yet, but a curved descriptive model is needed;
- you want to stay within the familiar inferential approach offered by
lm().
When the biology suggests a specific process (for example a saturating uptake model), a mechanistic nonlinear model (Chapter 22) may be better. When curvature is complex and not well represented by low-order powers, a GAM may be better.
4 Nature of the Data and Assumptions
Because polynomial regression uses ordinary least squares, the core assumptions are the same as for simple linear regression:
- independence of observations;
- approximately normal residuals;
- roughly constant residual variance;
- continuous response variable.
5 The Core Equations
What changes in polynomial regression is the mean structure rather than the underlying fitting framework. A cubic polynomial can be written as:
\[Y_i = \alpha + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \epsilon_i \tag{1}\]
In Equation 1, curvature is represented by powers of \(X\), but the model remains linear in the unknown coefficients \(\alpha, \beta_1, \beta_2, \beta_3\).
For introductory work, the important point is not to memorise the highest-order form. It is to understand that polynomial regression adds powers of the predictor while still estimating ordinary regression coefficients. A quadratic model is therefore just a simpler version of Equation 1 in which the cubic term is omitted.
6 R Function
The model is still fitted with lm():
For teaching and biological interpretation, raw powers (I(x^2), I(x^3)) are usually easier to explain.
7 Example 1: CO\(_2\) Uptake and Curved Response
7.1 Example Dataset
We use the classic CO2 dataset from base R. These are real measurements of carbon dioxide uptake by grass plants, measured across increasing ambient CO\(_2\) concentrations. The biological expectation is that uptake should increase at low concentrations and then begin to level off, so a straight line may be too simple.
| Plant | Type | Treatment | conc | uptake |
|---|---|---|---|---|
| Qc1 | Quebec | chilled | 95 | 14.2 |
| Qc1 | Quebec | chilled | 175 | 24.1 |
| Qc1 | Quebec | chilled | 250 | 30.3 |
| Qc1 | Quebec | chilled | 350 | 34.6 |
| Qc1 | Quebec | chilled | 500 | 32.5 |
| Qc1 | Quebec | chilled | 675 | 35.4 |
| Qc1 | Quebec | chilled | 1000 | 38.7 |
| Qc2 | Quebec | chilled | 95 | 9.3 |
| Qc2 | Quebec | chilled | 175 | 27.3 |
| Qc2 | Quebec | chilled | 250 | 35.0 |
7.2 Do an Exploratory Data Analysis (EDA)
# A tibble: 1 × 5
n min_conc max_conc mean_uptake sd_uptake
<int> <dbl> <dbl> <dbl> <dbl>
1 21 95 1000 31.8 9.64
The relationship is clearly positive, but it is not perfectly straight. Uptake rises quickly at low to moderate concentrations and then appears to flatten. That pattern motivates trying a polynomial extension.
7.3 State the Model Question and Hypotheses
The primary model question is whether including curvature meaningfully improves fit relative to a straight-line model.
For the quadratic extension, a useful inferential pair is:
\[H_{0}: \beta_2 = 0\] \[H_{a}: \beta_2 \ne 0\]
If \(\beta_2 = 0\), the model collapses to a straight-line form. If \(\beta_2 \neq 0\), the data support curvature.
7.4 Fit the Models
Call:
lm(formula = uptake ~ conc, data = co2_df)
Residuals:
Min 1Q Median 3Q Max
-14.3773 -3.7712 0.0476 4.8664 10.7414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.421041 2.583221 8.292 9.79e-08 ***
conc 0.023750 0.004919 4.828 0.000117 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.631 on 19 degrees of freedom
Multiple R-squared: 0.5509, Adjusted R-squared: 0.5273
F-statistic: 23.31 on 1 and 19 DF, p-value: 0.0001169
Call:
lm(formula = uptake ~ conc + I(conc^2), data = co2_df)
Residuals:
Min 1Q Median 3Q Max
-8.7023 -2.9023 0.2657 2.3714 10.0540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.053e+01 3.269e+00 3.221 0.004740 **
conc 8.392e-02 1.511e-02 5.554 2.85e-05 ***
I(conc^2) -5.542e-05 1.351e-05 -4.102 0.000669 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.898 on 18 degrees of freedom
Multiple R-squared: 0.7679, Adjusted R-squared: 0.7421
F-statistic: 29.78 on 2 and 18 DF, p-value: 1.954e-06
Call:
lm(formula = uptake ~ conc + I(conc^2) + I(conc^3), data = co2_df)
Residuals:
Min 1Q Median 3Q Max
-6.0662 -2.2950 0.3338 1.6063 6.4848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.325e+00 3.792e+00 -1.404 0.178
conc 2.339e-01 3.128e-02 7.478 9.03e-07 ***
I(conc^2) -3.970e-04 6.818e-05 -5.822 2.04e-05 ***
I(conc^3) 2.094e-07 4.145e-08 5.051 9.84e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.187 on 17 degrees of freedom
Multiple R-squared: 0.9072, Adjusted R-squared: 0.8908
F-statistic: 55.39 on 3 and 17 DF, p-value: 5.509e-09
A nested-model comparison helps us decide whether added terms improve the model enough to justify the extra complexity.
Analysis of Variance Table
Model 1: uptake ~ conc
Model 2: uptake ~ conc + I(conc^2)
Model 3: uptake ~ conc + I(conc^2) + I(conc^3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 835.48
2 18 431.80 1 403.68 39.746 7.919e-06 ***
3 17 172.66 1 259.14 25.516 9.844e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
df AIC
mod_lin 3 142.9485
mod_quad 4 131.0877
mod_cub 5 113.8378
7.5 Test Assumptions / Check Diagnostics
Shapiro-Wilk normality test
data: residuals(mod_quad)
W = 0.9902, p-value = 0.998
studentized Breusch-Pagan test
data: mod_quad
BP = 1.3941, df = 2, p-value = 0.498
No model is perfect, but if diagnostics are broadly acceptable and the residual curvature seen in the straight-line fit is reduced, the polynomial extension is usually justified.
7.6 Interpret the Results
co2_pred <- tibble(conc = seq(min(co2_df$conc), max(co2_df$conc), length.out = 200)) |>
mutate(
linear = predict(mod_lin, newdata = cur_data()),
quadratic = predict(mod_quad, newdata = cur_data()),
cubic = predict(mod_cub, newdata = cur_data())
) |>
pivot_longer(cols = c(linear, quadratic, cubic),
names_to = "model", values_to = "fit")
ggplot(co2_df, aes(x = conc, y = uptake)) +
geom_point(alpha = 0.65) +
geom_line(data = co2_pred, aes(y = fit, colour = model), linewidth = 0.9) +
labs(x = expression(CO[2]~concentration),
y = "CO2 uptake",
colour = "Model")In this dataset, the quadratic model captures biologically plausible deceleration in uptake at higher concentrations better than a straight line. The cubic model may fit marginal details, but if it does not provide a clear inferential or biological gain, the quadratic model is usually preferable.
The practical interpretation is not usually about the standalone value of \(\beta_2\). Instead, it is about the shape of the response: uptake increases with concentration, but the rate of increase slows across the observed range.
7.7 Reporting
Methods
Carbon dioxide uptake was modelled as a function of ambient CO\(_2\) concentration using ordinary least squares regression on the CO2 plant dataset (chilled Quebec plants). A straight-line model was compared with quadratic and cubic polynomial extensions. Model comparison used nested-model ANOVA and AIC, and assumptions were assessed from residual diagnostics.
Results
The straight-line model underestimated curvature in the concentration-uptake relationship. Adding a quadratic term improved model fit substantially relative to the linear model (nested-model comparison, \(p < 0.001\)), while the cubic extension provided little additional practical benefit. The selected quadratic model described a positive but decelerating relationship in which uptake increased with concentration but began to flatten at higher concentration values.
Discussion
A low-order polynomial provided a useful descriptive model of nonlinearity while retaining the familiar lm() workflow. Biologically, the fitted curve is consistent with diminishing marginal increases in uptake as CO\(_2\) concentration rises. The model should be interpreted as a flexible approximation over the observed range rather than as a mechanistic process equation.
8 What to Do When Assumptions Fail / Alternatives
- If residual curvature remains strong, move to a more flexible model such as a GAM.
- If biology suggests a specific functional process, use a mechanistic nonlinear model (Chapter 22).
- If polynomial degree must become high to fit the pattern, that is usually a warning sign of overfitting.
9 Summary
- Polynomial regression is a linear-model extension that captures smooth curvature.
- Low-order polynomials (especially quadratic) are often enough for introductory biological applications.
- Model comparison and diagnostics should drive term selection, not visual fit alone.
- Interpretation should focus on response shape and biological plausibility, not just coefficient significance.
Reuse
Citation
@online{smit2026,
author = {Smit, A. J.},
title = {13. {Polynomial} {Regression}},
date = {2026-03-22},
url = {https://tangledbank.netlify.app/BCB744/basic_stats/13-polynomial-regression.html},
langid = {en}
}
