The sparrow wing data from Zar (1999). | |
---|---|
Age (days) | Wing length (cm) |
3 | 1.4 |
4 | 1.5 |
5 | 2.2 |
6 | 2.4 |
8 | 3.1 |
9 | 3.2 |
10 | 3.2 |
11 | 3.9 |
12 | 4.1 |
14 | 4.7 |
15 | 4.5 |
16 | 5.2 |
17 | 5.0 |
9. Simple Linear Regressions
The shallow end in the ocean of regression models
- The simple linear regression
- The model coefficients
- Graphing linear regressions
- Confidence intervals
- Prediction intervals
- Model fit diagnostics
Find here a Cheatsheet on statistical methods.
- My book chapter on simple linear regression.
- My book chapter on multiple linear regression.
- My book chapter on nonlinear regression.
- My book chapter on regularisation techniques.
- Task H
At a glance
Regression analysis is used to model and analyse the relationship between a dependent variable (response) and one or more independent variables (predictors). There is an expectation that one variable depends on or is influenced by the other. The data requirements for a regression analysis are:
Continuous dependent and independent variables Both the dependent variable and independent variables should be measured on a continuous scale (e.g., height, mass, light intensity). This requires that the two variables are paired (bivariate).
Linear relationship There should be a linear relationship between the dependent variable and independent variables. This can be visually assessed using scatter plots. If the relationship is not linear, you may need to consider non-linear regression or apply a data transformation.
Independence of observations The observations should be independent of each other. In the case of time series data or clustered data, this assumption may be violated, requiring specific regression techniques to account for the dependence (e.g., time series analysis, mixed-effects models).
Homoscedasticity The variance of the residuals (errors) should be constant across all levels of the independent variables. If the variances are not constant (heteroscedasticity), you may need to consider weighted least squares regression or other techniques to address this issue.
Normality of residuals The residuals should be approximately normally distributed. This can be assessed using diagnostic plots, such as a histogram of residuals or a Q-Q plot. If the residuals are not normally distributed, you may need to consider data transformations or more robust regression techniques like GLMs.
No multicollinearity This applies to multiple regression, which will not be covered in BCB744. Independent variables should not be highly correlated with each other, as this can cause issues in estimating the unique effect of each predictor. You can assess multicollinearity using variance inflation factors (VIFs) or correlation matrices. If multicollinearity is an issue, you may need to consider removing or combining highly correlated variables or using techniques like ridge regression or principal component analysis.
Random sampling The data should be obtained through random sampling or random assignment, ensuring that each observation has an equal chance of being included in the sample.
Introduction to regressions
A linear regression, or model, shows the relationship between a continuous dependent (response) variable and one or more independent variables (drivers), at least one of which must also be continuous. It helps us understand how a change in the independent variable(s) is responsible for a change in the dependent variable. Linear models therefore imply a causal relationship between variables, and we say that the response,
In statistics, ‘to model’ refers to the process of constructing a mathematical or statistical representation of a real-world phenomenon or system. The goal of modelling is to capture the essential features of the system or phenomenon in a simplified and structured form that can be analysed and understood.
A model can take many forms, such as an equation, a graph, a set of rules, or a simulation. The choice of model depends on the nature of the phenomenon being studied and the purpose of the analysis. For example, a linear regression model can be used to model the relationship between two continuous variables, while a logistic regression model can be used to model the probability of a binary outcome.
The process of modelling involves making assumptions about the relationship between variables, choosing an appropriate model structure, and estimating the model parameters based on data. Once a model has been constructed and estimated, it can be used to make predictions, test hypotheses, and gain insight into the underlying mechanisms of the phenomenon being studied.
Other variations of regressions you’ll encounter in biology include multiple regression, logistic regression, non-linear regression (such as the Michaelis-Menten model you learned about in BDC223), generalised linear models, generalised additive models, regression trees, and other. In this Chapter we will limit our encounters with regression models to simple linear regressions.
An example dataset
We use a dataset about sparrow wing length as a function of age. A graph of a linear regression model typically consists of a scatter plot with each point representing a pair of
The fitted line shows of the relationship between the variables and offer an easy way to make predictions about the dependent variable for a given value of the independent variable. As we shall see later, we can also plot the residuals, which are the differences between the observed data points,
The simple linear regression
A simple linear regression relates one continuous dependent variable to a continuous independent variable. The linear regression equation is already known to you (Equation 1).
The linear regression:
where
Coefficients are population parameters (statistics) that describe two properties of the linear line that best fits a scatter plot between a dependent variable and the independent continuous variables. The dependent variable,
The regression parameters
The error sum of squares:
The sparrow data set’s linear model is represented as:
When we perform a linear regression in R, it will output the model and the coefficients:
Call:
lm(formula = wing ~ age, data = sparrows)
Residuals:
Min 1Q Median 3Q Max
-0.30699 -0.21538 0.06553 0.16324 0.22507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.71309 0.14790 4.821 0.000535 ***
age 0.27023 0.01349 20.027 5.27e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2184 on 11 degrees of freedom
Multiple R-squared: 0.9733, Adjusted R-squared: 0.9709
F-statistic: 401.1 on 1 and 11 DF, p-value: 5.267e-10
The summary output shows six components:
Call The regression model as specified in the function call.
Residuals Provide a quick view of the distribution of the residuals. The residuals will always have a mean of zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
Coefficients The various regression coefficients—i.e.
Insight into the model accuracy is given by the Residual standard error (RSE), R-squared (R2) and the F-statistic. These are metrics that are used to check how well the overall model fits our data.
We will look at some of these components in turn.
The model coefficients
The intercept
The intercept (more precisely, the
There are several hypothesis tests associated with a simple linear regression. All of them assume that the residual error,
One of the tests looks at the significance of the intercept, i.e. it tests the H0 that Coefficients
table in the row indicated by (Intercept)
under the Pr(>|t|)
column.
The regression coefficient
The interpretation of the regression coefficient, Estimate
and in the row called age
(the latter name will of course depend on the name of the response column in your dataset). The coefficient of determination (
The second hypothesis test performed when fitting a linear regression model concerns the regression coefficient. It looks for whether there is a significant relationship (slope) of Coefficients
table in the column called Pr(>|t|)
in the row age
. In the sparrows data, the p-value associated with wing
is less than 0.05 and we therefore reject the H0 that
Residuals
The regression model’s residuals are the differences between the observed values,
Residuals are an important diagnostic tool for linear regression and many other models. If the residuals are randomly distributed around zero, it indicates that the model is a good fit for the data. However, if the residuals show a pattern or trend, such as a curve, S-, or U-shape, it may indicate that the model is not a good fit for the data and that additional variables or a more complex model may be needed.
The residuals also tell if if there are violations of assumptions, such as departures from normality or that the variances are heteroscedastic. If the assumptions are not met the the model’s validity is brought into question. Additionally, outliers in the residuals can help to identify influential observations that may be driving the results of the regression analysis.
Overall model accuracy
There are a few things that tell us about the overall model fit. The
Residual standard error (RSE) and root mean square error (RMSE)
The residual standard error (RSE) is a measure of the average amount that the response variable deviates from the regression line. It is calculated as the square root of the residual sum of squares divided by the degrees of freedom (Equation 3).
The RSE:
where
The root mean square error (RMSE) is a similar measure, but it is calculated as the square root of the mean of the squared residuals. It is a measure of the standard deviation of the residuals (Equation 4).
The RMSE:
RSE and RMSE are similar but different. There is a small difference in how they are calculated. The RSE takes into account the degrees of freedom which becomes important when models with different numbers of variables are compared. The RMSE is more commonly used in machine learning and data mining, where the focus is on prediction accuracy rather than statistical inference.
Both the RSE and RMSE provide information about the amount of error in the model predictions, with smaller values indicating a better fit. However, both may be influenced by outliers or other sources of variability in the data. Use a variety of means to assess the model fit diagnostics.
R-squared (R2)
The coefficient of determination, the
The R2:
Simply put, the
Note, however, that a high
Regressions may take on any relationship, not only a linear one. For example, there are parabolic, hyperbolic, logistic, exponential, etc. relationships of
In the case of our sparrows data, the
Sometimes you will also see something called the adjusted
F-statistic
The F-statistic (or F-value) is another measure of the overall significance of the model. It is used to test whether at least one of the independent variables in the model has a non-zero coefficient, indicating that it has a significant effect on the dependent variable.
It is calculated by taking the ratio of the mean square regression (MSR) to the mean square error (MSE) (Equation 6). The MSR measures the variation in the dependent variable that is explained by the independent variables in the model, while the MSE measures the variation in the dependent variable that is not explained by the independent variables.
Calculating the F-statistic:
where the model components are as in Equation 3.
If the F-statistic is large and the associated p-value is small (typically less than 0.05), it indicates that at least one of the independent variables in the model has a significant effect on the dependent variable. In other words, the H0 that all the independent variables have zero coefficients can be rejected in favour of the Ha that at least one independent variable has a non-zero coefficient.
Note that a significant F-statistic does not necessarily mean that all the independent variables in the model are significant. Additional diagnostic tools, such as individual t-tests and residual plots, should be used to determine which independent variables are significant and whether the model is a good fit for the data.
Fortunately, in this Chapter we will encounter linear regressions with only one independent variable. The situation where we deal with multiple independent variables is called multiple regression. We will encounter some multiple regression type models in Quantitative Ecology.
Confidence interval for linear regression
Confidence intervals (CI) are used to estimate the range of values within which the true value of a population parameter is likely to fall with a certain degree of confidence. Typically, in biology we use a 95% confidence interval. Confidence intervals around a linear regression model can be calculated for the intercept and slope coefficients, as well as for the predicted values of the dependent variable.
The confidence interval for the slope coefficient represents the range of likely values for the true slope of the linear relationship between the independent and dependent variables, given the data and the model assumptions. A confidence interval that does not include zero indicates that the slope coefficient is statistically significant at a given level of confidence, meaning that there is strong evidence of a non-zero effect of the independent variable on the dependent variable. In this case we do not accept the H0 that states the
The confidence interval for the predicted values of the dependent variable represents the range of likely values for the true value of the dependent variable at a given level of the independent variable. This can be useful for assessing the precision of the predictions made by the linear regression model, and for identifying any regions of the independent variable where the predictions are less reliable.
Again we have to observe the assumption of i.i.d. as before. For a given value of
So, the 95% confidence interval of the mean eruption duration at a waiting time of 80 minutes is from 4.105 and 4.248 minutes.
Prediction interval for linear regression
Prediction intervals serve different purposes from confidence intervals, and they are also calculated differently. A prediction interval is used to estimate the range of likely values for a new (future) observation of the dependent variable, given a specific value of the independent variable. It takes into account both the variability of the dependent variable around the predicted mean response, as well as the uncertainty in the estimated coefficients of the model. Prediction intervals are wider than confidence intervals, as they account for the additional uncertainty due to the variability of the dependent variable. As always, we observe that
The way we do this is similar to finding the confidence interval:
fit lwr upr
1 22.33142 20.19353 24.4693
The difference between confidence and prediction intervals is subtle and requires some philosophical consideration. In practice, if you use these intervals to make inferences about the population from which the samples were drawn, use the prediction intervals. If you instead want to describe the samples which you have taken, use the confidence intervals.
Predicting from the linear model
Knowing
# use the accessor function to grab the coefficients:
wing.coef <- coefficients(sparrows.lm)
wing.coef
(Intercept) age
0.7130945 0.2702290
# how long would an eruption last of we waited, say, 80 minutes?
age <- 80
# the first and second coef. can be accessed using the
# square bracket notation:
wing.pred <- (wing.coef[2] * age) + wing.coef[1]
wing.pred # the unit is minutes
age
22.33142
The prediction is that, given a waiting time of 80 minutes since the previous eruption, the next eruption will last 22.331 minutes. Note that this is the same value returned in Section 9 using the predict()
function.
We can predict more than one value. The predict()
function takes a dataframe of values for which we want to predict the duration of the eruption and returns a vector with the waiting times:
Diagnostic plots for examining the fit of a linear model
We may use several kinds of graphical displays to test the suitability of linear models for describing relationships in our data.
Plot of residuals vs. fitted values
A residual plot shows the residuals (values predicted by the linear model,
Plot of standardised residuals
We may use a plot of the residuals vs. the fitted values, which is helpful for detecting heteroscedasticity—e.g. a systematic change in the spread of residuals over a range of predicted values.
Normal probability plot of residuals (Normal Q-Q plot)
Let see all these plots in action for the sparrows data. The package ggfortify has a convenient function to automagically make all of these graphs:
One might also use the package gg_diagnose to create all the various (above plus some!) diagnostic plots available for fitted linear models.
Diagnostic plots will be further explored in the exercises (see below).
References
Reuse
Citation
@online{smit,_a._j.2021,
author = {Smit, A. J.,},
title = {9. {Simple} {Linear} {Regressions}},
date = {2021-01-01},
url = {http://tangledbank.netlify.app/BCB744/basic_stats/09-regressions.html},
langid = {en}
}