class: center, middle, inverse, title-slide .title[ # Correlations ] .subtitle[ ## Part 1 ] .author[ ### AJ Smit ] .date[ ### 2020/06/28 (updated: 2024-03-26) ] --- ## Correlation For more details about Correlation, please visit <https://ajsmit.github.io/Basic_stats/correlations.html>. Another take on correlations can be seen at: - <https://rpubs.com/aaronsc32/anova-compare-more-than-two-groups> --- ## What is a correlation for? - To investigate the strength of a potential association between two (or more) variables. - No requirement that one variable *causes* a response in the other (unlike regression; see Chapter 8). The research question is, “Is X related to Y?” or "Does X predict Y?". --- ## What is the nature of the data? - Paired variables, but neither is dependent or independent. - One is continuous, the other can be continuous or ordinal. --- ## What is the correlation coefficient? Correlation is denoted by `\(r\)`, the correlation coefficient, which ranges from -1 to 1. - As the coefficient strives to become closer to 1, the stronger is a *positive* correlation; i.e., as X increases so does Y. - Coefficients closer to -1 represent a negative correlation; i.e. as X decreases, Y increases. - As `\(r\)` approaches `\(0\)`, the weaker the correlation between the variables becomes; i.e. `\(r=0\)` indicates no correlation between the two variables. The correlation coefficient, `\(r\)`, should not be confused with the regression coefficient, `\(r^{2}\)` or `\(R^{2}\)`. --- ## What assumptions do the data need to fulfil? Like all statistical tests, correlation requires a series of assumptions as well. We also require that the data are i) paired (each X observation must have an associated Y value), and ii) that there are no outliers. Assumptions: 1. The association must be approximately linear 2. The samples follow independent normal distributions (but see below) 3. The requirement for homoscedasticity There are two main types of correlations, depending on the nature of the data: 1. Continuous normal data (Pearson's Product Moment correlation) 2. Ordinal data, may be non-normal (Spearman's *rho* correlation, or Kendall's *tau* correlation) --- class: center, middle # Examples --- ## What do the data look like? ```r data(iris) head(iris) ``` ``` R> Sepal.Length Sepal.Width Petal.Length Petal.Width Species R> 1 5.1 3.5 1.4 0.2 setosa R> 2 4.9 3.0 1.4 0.2 setosa R> 3 4.7 3.2 1.3 0.2 setosa R> 4 4.6 3.1 1.5 0.2 setosa R> 5 5.0 3.6 1.4 0.2 setosa R> 6 5.4 3.9 1.7 0.4 setosa ``` --- ```r summary(iris) ``` ``` R> Sepal.Length Sepal.Width Petal.Length Petal.Width R> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 R> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 R> Median :5.800 Median :3.000 Median :4.350 Median :1.300 R> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 R> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 R> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 R> Species R> setosa :50 R> versicolor:50 R> virginica :50 R> R> R> ``` The data are continuous and each X value has a corresponding Y value. If it is normally distributed and homoscedastic, we can apply a Pearson's Product Moment correlation. --- ## Visually, what is the association between `Sepal.Length` and `Sepal.Width`? .left-column[ - `\(x\)`: Sepal.Length - `\(y\)`: Sepal.Width Let's examine *Iris setosa* only. ] .right-column[ ```r setosa <- iris %>% filter(Species == "setosa") %>% select(-Species) ggplot(data = setosa, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(shape = 1, colour = "red3") + labs(x = "Length", y = "Width") + theme_pubr() ``` <img src="data:image/png;base64,#BCB744_Correlations_slides--1-_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## What about the assumptions? **1. The association must be approximately linear.** From the plot on the previous slide, yes, the relationship is linear. If the scatter plot showed a curved pattern, we are dealing with a nonlinear association between the two variables. --- **2. The samples follow independent normal distributions.** For each variable, X and Y, use Shapiro-Wilk normality test: `shapiro.test()` ```r shapiro.test(setosa$Sepal.Length) ``` ``` R> R> Shapiro-Wilk normality test R> R> data: setosa$Sepal.Length R> W = 0.9777, p-value = 0.4595 ``` ```r shapiro.test(setosa$Sepal.Width) ``` ``` R> R> Shapiro-Wilk normality test R> R> data: setosa$Sepal.Width R> W = 0.97172, p-value = 0.2715 ``` Above we see that the two *p*-values are > 0.05, hence the distribution of the data is not significantly different from that of a normal distribution. We can assume normality. --- We can also assess the normality assumption through visual inspection of Q-Q plots (quantile-quantile plots). A Q-Q plot draws the correlation between a given sample and the normal distribution. To do so, we can use `ggpubr::ggqqplot()`: ```r plt_a <- ggqqplot(setosa$Sepal.Length, ylab = "Sepal Length") # a ggpubr function plt_b <- ggqqplot(setosa$Sepal.Width, ylab = "Sepal Width") ggarrange(plt_a, plt_b, ncol = 2) # a ggpubr function ``` <img src="data:image/png;base64,#BCB744_Correlations_slides--1-_files/figure-html/unnamed-chunk-5-1.png" width="648" style="display: block; margin: auto;" /> Looking at the plots, we can conclude that both sets of samples follow normal distributions. --- **3. The requirement for homoscedasticity** Fit a line of best fit, and see if the values lay evenly above and below the line: ```r ggplot(data = setosa, aes(x = Sepal.Width, y = Sepal.Length)) + geom_point() + geom_smooth(method = "lm") + theme_pubr() ``` <img src="data:image/png;base64,#BCB744_Correlations_slides--1-_files/figure-html/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" /> Yes, everything seems in order.