Correlations

Correlations
Part 2
AJ Smit
2020/06/28 (updated: 2024-03-26)
1 / 14

Doing the correlation2 / 14

What function do we use to do the correlation?

The name of a very basic function for a correlation test is cor().

Let's find some help on the function first:

?cor # or,
help(cor)

3 / 14

How do we do the Pearson Product Moment correlation?

The equation for the Pearson's correlation coefficient is:

$r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2} \sum_{i = 1}^{n} (y_{i} - \bar{y})}}$

Where $\bar{x}$ and $\bar{y}$ are the means for the X and Y variables, respectively.

The default for cor() is to fit a Pearson correlation, so we may omit the method argument:

with(setosa, cor(Sepal.Length, Sepal.Width))

R> [1] 0.7425467

What does the output mean?

The output and interpretation are simple. There are not associated p-values. There are no associated hypothesis tests. It simply tells of about the strength of association between the two variables.

4 / 14

A more comprehensive correlation

We can do a more detailed correlation using cor.test():

with(setosa, cor.test(x = Sepal.Length, Sepal.Width))

R> 
R>     Pearson's product-moment correlation
R> 
R> data:  Sepal.Length and Sepal.Width
R> t = 7.6807, df = 48, p-value = 6.71e-10
R> alternative hypothesis: true correlation is not equal to 0
R> 95 percent confidence interval:
R>  0.5851391 0.8460314
R> sample estimates:
R>       cor 
R> 0.7425467

5 / 14

What if want to see the association between all the variables in the `setosa` dataset?

For this we can rely on the cor() function again:

setosa_pearson <- cor(setosa)
setosa_pearson

R>              Sepal.Length Sepal.Width Petal.Length Petal.Width
R> Sepal.Length    1.0000000   0.7425467    0.2671758   0.2780984
R> Sepal.Width     0.7425467   1.0000000    0.1777000   0.2327520
R> Petal.Length    0.2671758   0.1777000    1.0000000   0.3316300
R> Petal.Width     0.2780984   0.2327520    0.3316300   1.0000000

6 / 14

How would we visualise all these associations?

In the Iris dataset, above, we compared associations for each pair of the following columns: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. This required six pairs of correlations, which would be a pain if we wanted to create a visualisation for each of the six pairs. We can do it quickly, e.g.:

ecklonia <- read_csv("../data/ecklonia.csv") %>% 
  select(-species, - site, - ID)
head(ecklonia)

R> # A tibble: 6 × 9
R>   stipe_length stipe_diameter frond_length digits primary_blade_width
R>          <dbl>          <dbl>        <dbl>  <dbl>               <dbl>
R> 1          456           23.5          116      6                15  
R> 2          477           27            141      6                20  
R> 3          427           17.5          144      7                10  
R> 4          347           22.5          127      5                12  
R> 5          470           17            160      5                11  
R> 6          478           17.5          181      4                10.5
R> # ℹ 4 more variables: primary_blade_length <dbl>, stipe_mass <dbl>,
R> #   frond_mass <dbl>, epiphyte_length <dbl>

7 / 14

ecklonia_pearson <- cor(ecklonia)
library(corrplot)
corrplot(ecklonia_pearson, method = "circle", type = "lower",
         number.digits = 2, addCoef.col = "salmon", tl.col = "black")

8 / 14

pairs(data = ecklonia, ~ stipe_length + stipe_diameter + frond_length + primary_blade_length +
        primary_blade_width + stipe_mass + frond_mass)

9 / 14

What if we had ordinal data instead of continuous data?

Ordinal data are ordered categorical factors; in other words, the data are rank ordered. Let us create a test dataset:

lungs <- read_tsv("../data/LungCapData.csv") %>%
  mutate(size_class_intervals = as.factor(cut(Height, breaks = 4)),
         size_class = cut(Height, breaks = 4, labels = c("infant", "toddler", "adolecent", "teen"),
                          ordered = TRUE))
head(lungs)

R> # A tibble: 6 × 8
R>   LungCap   Age Height Smoke Gender Caesarean size_class_intervals size_class
R>     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>     <fct>                <ord>     
R> 1    6.48     6   62.1 no    male   no        (54.4,63.5]          toddler   
R> 2   10.1     18   74.7 yes   female no        (72.7,81.8]          teen      
R> 3    9.55    16   69.7 no    female yes       (63.5,72.7]          adolecent 
R> 4   11.1     14   71   no    male   no        (63.5,72.7]          adolecent 
R> 5    4.8      5   56.9 no    male   no        (54.4,63.5]          toddler   
R> 6    6.22    11   58.7 no    female no        (54.4,63.5]          toddler

For more information about creating categorical data from continuous data, see https://www.youtube.com/watch?v=EWs1Ordh8nI.

10 / 14

is.ordered(lungs$size_class)

R> [1] TRUE

head(as.numeric(lungs$size_class), 111)

R>   [1] 2 4 3 3 2 2 2 3 3 2 4 3 2 2 2 2 2 2 2 3 2 2 3 3 3 2 2 3 1 3 1 3 3 3 3 3 1
R>  [38] 3 3 4 3 3 2 3 1 2 2 3 4 2 3 4 3 2 4 1 3 4 2 3 2 3 2 2 3 3 2 2 3 3 2 3 1 3
R>  [75] 2 2 3 2 3 3 2 3 3 4 2 2 4 2 2 4 3 3 4 3 4 4 3 3 4 2 3 3 2 4 3 1 4 3 2 1 3

11 / 14

The equation for Spearman's rho is:

$r = \frac{\sum_{i = 1}^{n} (x_{i}^{'} - \bar{x}) (y_{i}^{'} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i}^{'} - \bar{x})^{2} \sum_{i = 1}^{n} (y_{i}^{'} - \bar{y})}}$

Where $x_{i}^{'}$ and $y_{i}^{'}$ are the ranks for each observation in the X and Y variables, respectively.

How do we apply a Spearman's rho correlation by ranks?

cor.test(as.numeric(lungs$size_class), lungs$LungCap, method = "spearman")

R> 
R>     Spearman's rank correlation rho
R> 
R> data:  as.numeric(lungs$size_class) and lungs$LungCap
R> S = 9358124, p-value < 2.2e-16
R> alternative hypothesis: true rho is not equal to 0
R> sample estimates:
R>       rho 
R> 0.8526579

12 / 14

Questions

Find your own data set suitable for a correlation analysis.

State the null and alternative hypotheses. Do a Pearson's correlation. Explain the findings.
Do the tests necessary to evaluate the various assumptions for the above analysis.
Create all the associated figures for the above analysis.
Transform one of the variables to ordinal data, and do a Spearman's rho or Kendall's tau correlation. Explain the findings.

13 / 14

Do a full correlation analysis on the full iris dataset (i.e. all three species):
- state all null hypotheses;
- test all assumptions;
- create all necessary plots;
- write a few sentences on the findings.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Correlations

Part 2

AJ Smit

2020/06/28 (updated: 2024-03-26)

Doing the correlation