class: center, middle, inverse, title-slide .title[ # Correlations ] .subtitle[ ## Part 2 ] .author[ ### AJ Smit ] .date[ ### 2020/06/28 (updated: 2024-03-26) ] --- class: center, middle # Doing the correlation --- ## What function do we use to do the correlation? The name of a very basic function for a correlation test is `cor()`. Let's find some help on the function first: ```r ?cor # or, help(cor) ``` --- ## How do we do the Pearson Product Moment correlation? The equation for the Pearson's correlation coefficient is: `$$r = \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^2\sum_{i=1}^{n}(y_{i}-\bar{y})}}$$` Where `\(\bar{x}\)` and `\(\bar{y}\)` are the means for the X and Y variables, respectively. The default for `cor()` is to fit a Pearson correlation, so we may omit the `method` argument: ```r with(setosa, cor(Sepal.Length, Sepal.Width)) ``` ``` R> [1] 0.7425467 ``` ## What does the output mean? The output and interpretation are simple. There are not associated *p*-values. There are no associated hypothesis tests. It simply tells of about the strength of association between the two variables. --- ## A more comprehensive correlation We can do a more detailed correlation using `cor.test()`: ```r with(setosa, cor.test(x = Sepal.Length, Sepal.Width)) ``` ``` R> R> Pearson's product-moment correlation R> R> data: Sepal.Length and Sepal.Width R> t = 7.6807, df = 48, p-value = 6.71e-10 R> alternative hypothesis: true correlation is not equal to 0 R> 95 percent confidence interval: R> 0.5851391 0.8460314 R> sample estimates: R> cor R> 0.7425467 ``` --- ## What if want to see the association between all the variables in the `setosa` dataset? For this we can rely on the `cor()` function again: ```r setosa_pearson <- cor(setosa) setosa_pearson ``` ``` R> Sepal.Length Sepal.Width Petal.Length Petal.Width R> Sepal.Length 1.0000000 0.7425467 0.2671758 0.2780984 R> Sepal.Width 0.7425467 1.0000000 0.1777000 0.2327520 R> Petal.Length 0.2671758 0.1777000 1.0000000 0.3316300 R> Petal.Width 0.2780984 0.2327520 0.3316300 1.0000000 ``` --- ## How would we visualise all these associations? In the Iris dataset, above, we compared associations for each pair of the following columns: `Sepal.Length`, `Sepal.Width`, `Petal.Length`, and `Petal.Width`. This required six pairs of correlations, which would be a pain if we wanted to create a visualisation for each of the six pairs. We can do it quickly, e.g.: ```r ecklonia <- read_csv("../data/ecklonia.csv") %>% select(-species, - site, - ID) head(ecklonia) ``` ``` R> # A tibble: 6 × 9 R> stipe_length stipe_diameter frond_length digits primary_blade_width R> <dbl> <dbl> <dbl> <dbl> <dbl> R> 1 456 23.5 116 6 15 R> 2 477 27 141 6 20 R> 3 427 17.5 144 7 10 R> 4 347 22.5 127 5 12 R> 5 470 17 160 5 11 R> 6 478 17.5 181 4 10.5 R> # ℹ 4 more variables: primary_blade_length <dbl>, stipe_mass <dbl>, R> # frond_mass <dbl>, epiphyte_length <dbl> ``` --- ```r ecklonia_pearson <- cor(ecklonia) library(corrplot) corrplot(ecklonia_pearson, method = "circle", type = "lower", number.digits = 2, addCoef.col = "salmon", tl.col = "black") ``` <img src="data:image/png;base64,#BCB744_Correlations_slides--2-_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> --- ```r pairs(data = ecklonia, ~ stipe_length + stipe_diameter + frond_length + primary_blade_length + primary_blade_width + stipe_mass + frond_mass) ``` <img src="data:image/png;base64,#BCB744_Correlations_slides--2-_files/figure-html/unnamed-chunk-8-1.png" width="720" style="display: block; margin: auto;" /> --- ## What if we had ordinal data instead of continuous data? Ordinal data are ordered categorical factors; in other words, the data are rank ordered. Let us create a test dataset: ```r lungs <- read_tsv("../data/LungCapData.csv") %>% mutate(size_class_intervals = as.factor(cut(Height, breaks = 4)), size_class = cut(Height, breaks = 4, labels = c("infant", "toddler", "adolecent", "teen"), ordered = TRUE)) head(lungs) ``` ``` R> # A tibble: 6 × 8 R> LungCap Age Height Smoke Gender Caesarean size_class_intervals size_class R> <dbl> <dbl> <dbl> <chr> <chr> <chr> <fct> <ord> R> 1 6.48 6 62.1 no male no (54.4,63.5] toddler R> 2 10.1 18 74.7 yes female no (72.7,81.8] teen R> 3 9.55 16 69.7 no female yes (63.5,72.7] adolecent R> 4 11.1 14 71 no male no (63.5,72.7] adolecent R> 5 4.8 5 56.9 no male no (54.4,63.5] toddler R> 6 6.22 11 58.7 no female no (54.4,63.5] toddler ``` For more information about creating categorical data from continuous data, see <https://www.youtube.com/watch?v=EWs1Ordh8nI>. --- ```r is.ordered(lungs$size_class) ``` ``` R> [1] TRUE ``` ```r head(as.numeric(lungs$size_class), 111) ``` ``` R> [1] 2 4 3 3 2 2 2 3 3 2 4 3 2 2 2 2 2 2 2 3 2 2 3 3 3 2 2 3 1 3 1 3 3 3 3 3 1 R> [38] 3 3 4 3 3 2 3 1 2 2 3 4 2 3 4 3 2 4 1 3 4 2 3 2 3 2 2 3 3 2 2 3 3 2 3 1 3 R> [75] 2 2 3 2 3 3 2 3 3 4 2 2 4 2 2 4 3 3 4 3 4 4 3 3 4 2 3 3 2 4 3 1 4 3 2 1 3 ``` --- The equation for Spearman's *rho* is: `$$r = \frac{\sum_{i=1}^{n}(x_{i}'-\bar{x})(y_{i}'-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}'-\bar{x})^2\sum_{i=1}^{n}(y_{i}'-\bar{y})}}$$` Where `\(x_{i}'\)` and `\(y_{i}'\)` are the ranks for each observation in the X and Y variables, respectively. ## How do we apply a Spearman's *rho* correlation by ranks? ```r cor.test(as.numeric(lungs$size_class), lungs$LungCap, method = "spearman") ``` ``` R> R> Spearman's rank correlation rho R> R> data: as.numeric(lungs$size_class) and lungs$LungCap R> S = 9358124, p-value < 2.2e-16 R> alternative hypothesis: true rho is not equal to 0 R> sample estimates: R> rho R> 0.8526579 ``` --- .left-column[## Questions] .right-column[ Find your own data set suitable for a correlation analysis. - State the null and alternative hypotheses. Do a Pearson's correlation. Explain the findings. - Do the tests necessary to evaluate the various assumptions for the above analysis. - Create all the associated figures for the above analysis. - Transform one of the variables to ordinal data, and do a Spearman's *rho* or Kendall's *tau* correlation. Explain the findings. ] --- .right-column[ - Do a full correlation analysis on the full `iris` dataset (i.e. all three species): - state all null hypotheses; - test all assumptions; - create all necessary plots; - write a few sentences on the findings. ]