+ - 0:00:00
Notes for current slide
Notes for next slide

Simple Linear Regression

Part 2

AJ Smit

2020/06/25 (updated: 2024-03-26)

1 / 9

Assumptions of linear regressions

As with t-tests and ANOVAs, we require that some assumptions are met:

  • Normally distributed data
  • Homogeneity of variances or the errors (residuals)
  • Independence of data

Of course, another assumption is also that there is a linear relationship between the predictor and predictant. But maybe the relationship is better described by a polynomial function (i.e. a model that accommodates some kind of 'bendiness' in the best fit line added to the data points), or some other non-linear model (e.g. logarithmic, exponential, logistic, Michaelis-Menten, Gompertz, or a non-parametric Generalised Additive Model).

Outliers also have a huge effect on linear regressions.

In Part 2 of the Simple Linear Regression lectures we will look at some of the ways we can go about testing these assumptions.

2 / 9

First things first

In ANOVAs and t-tests, we can generally assess the normality of the data and the homogeneity of variances before doing the tests. With regressions we can only do this after fitting the model.

For t-tests, ANOVAs, and linear regressions, we know upfront whether or not the data satisfy the independence criterion.

As we have described in Part 1, we start by fitting the model. Then we check for the overall model significance (results of F-test), the significance of the β coefficient, and, if needed, the significance of the α term. It is also useful to assess the r2 value. These describe the fitted values. To do this, simply fit the model and create a plot with the fitted line.

Only then we check for normality and homogeneity of variances, and for this we use the residuals.

3 / 9

As we have seen before, the fitted values can be accessed as followed (see attributes(mod)):

head(mod$fitted)
R> 1 2 3 4 5 6
R> 4.415948 10.954129 9.864432 8.774735 3.871100 7.140190

The fitted values are the values that describe the path of the best fit line:

ggplot(data = lungs, aes(x = Age, y = LungCap)) +
geom_point(shape = 1, colour = "red3") +
geom_line(aes(y = mod$fitted.values), colour = "blue3") +
labs(x = "Age", y = "Lung capacity") + theme_pubr()

4 / 9

ggplot2 offers a direct and convenient way for fitting a linear model as we have done in the previous slide, but the summary of the model fit will be missing:

ggplot(lungs, aes(Age, LungCap)) +
geom_point(shape = 1, colour = "red3") +
stat_smooth(method = lm, se = TRUE, colour = "blue3", size = 0.2) + # CIs around LM
labs(x = "Age", y = "Lung capacity") + theme_pubr()

5 / 9

The residuals are what is left over after removing the linear slope of the data. Calling attributes(mod) will let you see how to access the residuals.

head(mod$residuals)
R> 1 2 3 4 5 6
R> 2.0590518 -0.8291289 -0.3144321 2.3502647 0.9289002 -0.9151901

Slightly adapting our code for fitting the linear model gives us the plot of residuals:

ggplot(data = lungs, aes(x = Age)) +
geom_point(aes(y = mod$residuals), shape = 1, colour = "red3") +
labs(x = "Age", y = "Residual") + theme_pubr()

6 / 9

Testing the assumptions

library(ggfortify)
autoplot(mod, colour = "salmon", shape = 1, size = 0.2, ncol = 2, which = c(1:2)) + theme_pubr()

  • Residuals vs Fitted: Is the relationship linear? You want to see a horizontal line without distinct bumps and deviations from the horizontal.
  • Normal Q-Q: Are the residuals normally distributed? The points should be on a straight line.
7 / 9
autoplot(mod, colour = "salmon", shape = 1, size = 0.2, ncol = 2, which = c(3, 5)) + theme_pubr()

  • Scale-Location: Are the residuals homoscedastic? The fitted line must be horizontal, and the standardised residuals (points) must be spread equally far above/below the line along the length of the fitted line.
  • Residuals vs Leverage: Are there outliers? Look out for the labelled points -- the numbers correspond to the row in the dataframe that contains the outlaying values.
8 / 9

Questions

  • What would be the slope of a linear model fitted to the residuals?
  • What would be the intercept of a linear model fitted to the residuals?
  • State the null hypotheses for the intercept and slope for this linear model, and provide statistical support for accepting/not accepting the linear model fitted to the residuals.
  • What is the significance of the overall model fit?
  • Why (to all of the above)?
  • Create a plot of the fitted line added to the scatterplot of residuals.
9 / 9

Assumptions of linear regressions

As with t-tests and ANOVAs, we require that some assumptions are met:

  • Normally distributed data
  • Homogeneity of variances or the errors (residuals)
  • Independence of data

Of course, another assumption is also that there is a linear relationship between the predictor and predictant. But maybe the relationship is better described by a polynomial function (i.e. a model that accommodates some kind of 'bendiness' in the best fit line added to the data points), or some other non-linear model (e.g. logarithmic, exponential, logistic, Michaelis-Menten, Gompertz, or a non-parametric Generalised Additive Model).

Outliers also have a huge effect on linear regressions.

In Part 2 of the Simple Linear Regression lectures we will look at some of the ways we can go about testing these assumptions.

2 / 9
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow