class: center, middle, inverse, title-slide .title[ # Simple Linear Regression ] .subtitle[ ## Part 1 ] .author[ ### AJ Smit ] .date[ ### 2020/06/25 (updated: 2024-03-26) ] --- ## Simple Linear Regression For more details about a Simple Linear Regression, please visit <https://ajsmit.github.io/Basic_stats/simple-linear-regressions.html> !!!NB Other takes on linear regressions can be seen at: - <https://www.youtube.com/watch?v=66z_MRwtFJM> !!!NB - <https://rpubs.com/aaronsc32/simple-linear-regression> !!!NB - <https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals> - <https://rpubs.com/aaronsc32/regression-through-the-origin> - <https://rpubs.com/aaronsc32/multiple-regression> Sometimes a Simple Linear Regression is called an Ordinary Least Squares (OLS) regression. --- ## What is a Simple Linear Regression for? For examining the *causal dependence* of one or several continuous variables on an independent continuous variable. The research question is, “Does Y *depend on* X?” or "Is a change in Y *caused by* a change in X?". --- ## What is the nature of the data? - **Independent variable:** a *nominal or continuous (numeric or double) variable*, e.g. time, age, mass, length, concentration, etc. - **Dependent variable:** also a *nominal or continuous variable*, e.g. mass, length, number of leaves, concentration, number of individuals, etc. --- ## The Simple Linear Regression equation `$$y_{n}=\beta \cdot x_{n}+\alpha+\epsilon$$` Where, - `\(y_{1..n}\)`: *dependent variable*, also called response or outcome variable - `\(x_{1..n}\)`: *independent variable*, also called the predictor - `\(\alpha\)`: *intercept* term, describes where the fitted line intercepts with the *y*-axis - `\(\beta\)`: *slope*, the 'inclination' or 'steepness' of the line - `\(\epsilon\)`: *residual variation*, the amount of variation not explained by a linear relationship of `\(y\)` on `\(x\)` --- ## Minimising the Sum of Squares Parameters `\(\alpha\)` and `\(\beta\)` are determined by *minimising the sum of squares* of the error term, `\(\epsilon\)`: `$$error~SS=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}$$` Where, - `\(y_{i}\)` is the `\(i\)`-th observed response, and - `\(\hat{y}_{i}\)` is the predicted response after fitting the linear regression By minimising the error SS, a linear regression finds the optimal line (or the best fit line) that minimises the distance (squared distance, to be precise) between the fitted and observed values. See the animation provided at the link on the next slide. --- ## Animation of Minimising Error Sum of Squares To see an animation demonstrating the minimisation of the error sum of squares, click [here](https://raw.githubusercontent.com/ajsmit/Basic_stats/master/figures/lm_rotate.avi). The corresponding code for the animation may be found [here](https://github.com/ajsmit/Basic_stats/tree/master/data). --- class: center, middle # Example: Lung Capacity Data --- ### What do the data look like? ```r library(tidyverse) lungs <- read_tsv("../data/LungCapData.csv") # read a tab separated value file head(lungs) ``` ``` R> # A tibble: 6 × 6 R> LungCap Age Height Smoke Gender Caesarean R> <dbl> <dbl> <dbl> <chr> <chr> <chr> R> 1 6.48 6 62.1 no male no R> 2 10.1 18 74.7 yes female no R> 3 9.55 16 69.7 no female yes R> 4 11.1 14 71 no male no R> 5 4.8 5 56.9 no male no R> 6 6.22 11 58.7 no female no ``` --- ### What is the relationship between age and lung capacity? - `\(x\)`: age, the independent variable - `\(y\)`: lung capacity, the dependent variable We do a visual examination of the data first: ```r ggplot(data = lungs, aes(x = Age, y = LungCap)) + geom_point(shape = 1, colour = "red3") + labs(x = "Age", y = "Lung capacity") + theme_pubr() ``` <img src="data:image/png;base64,#BCB744_Linear_regression_slides--1-_files/figure-html/unnamed-chunk-2-1.png" width="216" style="display: block; margin: auto;" /> --- ### FYI, what is the Pearson's Correlation Coefficient? ```r cor(lungs$Age, lungs$LungCap) ``` ``` R> [1] 0.8196749 ``` --- ### What function do we use to fit the linear regression? We fit a linear regression (sometimes we say 'fit a linear model') using the **`lm()`** function. Let's find some help on the function first: ```r ?lm # or, help(lm) ``` --- ### How do we fit the model? ```r mod <- lm(LungCap ~ Age, data = lungs) summary(mod) ``` ``` R> R> Call: R> lm(formula = LungCap ~ Age, data = lungs) R> R> Residuals: R> Min 1Q Median 3Q Max R> -4.7799 -1.0203 -0.0005 0.9789 4.2650 R> R> Coefficients: R> Estimate Std. Error t value Pr(>|t|) R> (Intercept) 1.14686 0.18353 6.249 7.06e-10 *** R> Age 0.54485 0.01416 38.476 < 2e-16 *** R> --- R> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R> R> Residual standard error: 1.526 on 723 degrees of freedom R> Multiple R-squared: 0.6719, Adjusted R-squared: 0.6714 R> F-statistic: 1480 on 1 and 723 DF, p-value: < 2.2e-16 ``` --- ### What does the output mean? - **(Intercept):** estimate of the `\(y\)`-intercept, `\(\alpha\)` - **Age:** estimate of the slope, `\(\beta\)` ### What do the *p*-values tell us? - for `\(\alpha\)`: H~0~, there is no difference between the estimate of the `\(y\)`-intercept and 0 - for `\(\beta\)`: H~0~, there is no difference between the estimate of the slope and 0 ### What else in the output is of importance? - **Standard errors of estimates:** - **Adjusted `\(r^{2}\)`:** the coefficient of determination, which is the amount of variation that is explained by a straight line of given coefficient and intercept; it is a measure of how well the model fits the data - **F-statistic, d.f., and *p*-value:** overall model fit --- ### What are the attributes of the linear regression object, `mod`? ```r attributes(mod) ``` ``` R> $names R> [1] "coefficients" "residuals" "effects" "rank" R> [5] "fitted.values" "assign" "qr" "df.residual" R> [9] "xlevels" "call" "terms" "model" R> R> $class R> [1] "lm" ``` We can extract some of the named attributes for further use, e.g.: ```r mod$coef ``` ``` R> (Intercept) Age R> 1.1468578 0.5448484 ``` --- ### How do we add the regression line to the plot we made earlier? ```r ggplot(data = lungs, aes(x = Age, y = LungCap)) + geom_point(shape = 1, colour = "red3") + geom_line(aes(y = mod$fitted.values), colour = "blue3") + labs(x = "Age", y = "Lung capacity") + theme_pubr() ``` <img src="data:image/png;base64,#BCB744_Linear_regression_slides--1-_files/figure-html/unnamed-chunk-8-1.png" width="216" style="display: block; margin: auto;" /> --- ### How do we find confidence intervals (CIs) for the model fit? ```r confint(mod) ``` ``` R> 2.5 % 97.5 % R> (Intercept) 0.7865454 1.5071702 R> Age 0.5170471 0.5726497 ``` ```r confint(mod, level = 0.90) ``` ``` R> 5 % 95 % R> (Intercept) 0.844593 1.4491226 R> Age 0.521526 0.5681708 ``` CIs allow one to estimate the range of predicted values, `\(\hat{y}\)`, that can be taken with a certain level of confidence (usually 95% in biological sciences) to contain the true population parameter. We will return to CIs in Chapter 10. --- A Simple Linear Regression is similar to an ANOVA (the latter looks at the dependence of a continue response variable as a function of a categorical influential variable); as such, we can create an ANOVA table for the linear model: ```r anova(mod) ``` ``` R> Analysis of Variance Table R> R> Response: LungCap R> Df Sum Sq Mean Sq F value Pr(>F) R> Age 1 3447.0 3447.0 1480.4 < 2.2e-16 *** R> Residuals 723 1683.5 2.3 R> --- R> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` This output is similar to the F-test seen in the output of `summary(mod)`. It tests the significance of the overall model fit. --- ### How do we use the regression model to make predictions? Aside from determining if there is a causal dependence between two variables, Simple Linear Regressions may also be used to predict the response given some input. For example, for our linear model, `mod`, what does it predict the lung capacity will be for people aged 13, 15, and 17 years old? ```r # create a df with a column called Age (as per the input data) pred <- data.frame(Age = c(13, 15, 17)) pred ``` ``` R> Age R> 1 13 R> 2 15 R> 3 17 ``` ```r predict(mod, pred) ``` ``` R> 1 2 3 R> 8.229887 9.319584 10.409280 ``` --- .left-column[## Questions] .right-column[ - What is the unit of `\(\alpha\)`? - What is the unit of `\(\beta\)`? ]