+ - 0:00:00
Notes for current slide
Notes for next slide

Correlations

Part 2

AJ Smit

2020/06/28 (updated: 2024-03-26)

1 / 14

Doing the correlation

2 / 14

What function do we use to do the correlation?

The name of a very basic function for a correlation test is cor().

Let's find some help on the function first:

?cor # or,
help(cor)
3 / 14

How do we do the Pearson Product Moment correlation?

The equation for the Pearson's correlation coefficient is:

r=i=1n(xix¯)(yiy¯)i=1n(xix¯)2i=1n(yiy¯)

Where x¯ and y¯ are the means for the X and Y variables, respectively.

The default for cor() is to fit a Pearson correlation, so we may omit the method argument:

with(setosa, cor(Sepal.Length, Sepal.Width))
R> [1] 0.7425467

What does the output mean?

The output and interpretation are simple. There are not associated p-values. There are no associated hypothesis tests. It simply tells of about the strength of association between the two variables.

4 / 14

A more comprehensive correlation

We can do a more detailed correlation using cor.test():

with(setosa, cor.test(x = Sepal.Length, Sepal.Width))
R>
R> Pearson's product-moment correlation
R>
R> data: Sepal.Length and Sepal.Width
R> t = 7.6807, df = 48, p-value = 6.71e-10
R> alternative hypothesis: true correlation is not equal to 0
R> 95 percent confidence interval:
R> 0.5851391 0.8460314
R> sample estimates:
R> cor
R> 0.7425467
5 / 14

What if want to see the association between all the variables in the setosa dataset?

For this we can rely on the cor() function again:

setosa_pearson <- cor(setosa)
setosa_pearson
R> Sepal.Length Sepal.Width Petal.Length Petal.Width
R> Sepal.Length 1.0000000 0.7425467 0.2671758 0.2780984
R> Sepal.Width 0.7425467 1.0000000 0.1777000 0.2327520
R> Petal.Length 0.2671758 0.1777000 1.0000000 0.3316300
R> Petal.Width 0.2780984 0.2327520 0.3316300 1.0000000
6 / 14

How would we visualise all these associations?

In the Iris dataset, above, we compared associations for each pair of the following columns: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. This required six pairs of correlations, which would be a pain if we wanted to create a visualisation for each of the six pairs. We can do it quickly, e.g.:

ecklonia <- read_csv("../data/ecklonia.csv") %>%
select(-species, - site, - ID)
head(ecklonia)
R> # A tibble: 6 × 9
R> stipe_length stipe_diameter frond_length digits primary_blade_width
R> <dbl> <dbl> <dbl> <dbl> <dbl>
R> 1 456 23.5 116 6 15
R> 2 477 27 141 6 20
R> 3 427 17.5 144 7 10
R> 4 347 22.5 127 5 12
R> 5 470 17 160 5 11
R> 6 478 17.5 181 4 10.5
R> # ℹ 4 more variables: primary_blade_length <dbl>, stipe_mass <dbl>,
R> # frond_mass <dbl>, epiphyte_length <dbl>
7 / 14
ecklonia_pearson <- cor(ecklonia)
library(corrplot)
corrplot(ecklonia_pearson, method = "circle", type = "lower",
number.digits = 2, addCoef.col = "salmon", tl.col = "black")

8 / 14
pairs(data = ecklonia, ~ stipe_length + stipe_diameter + frond_length + primary_blade_length +
primary_blade_width + stipe_mass + frond_mass)

9 / 14

What if we had ordinal data instead of continuous data?

Ordinal data are ordered categorical factors; in other words, the data are rank ordered. Let us create a test dataset:

lungs <- read_tsv("../data/LungCapData.csv") %>%
mutate(size_class_intervals = as.factor(cut(Height, breaks = 4)),
size_class = cut(Height, breaks = 4, labels = c("infant", "toddler", "adolecent", "teen"),
ordered = TRUE))
head(lungs)
R> # A tibble: 6 × 8
R> LungCap Age Height Smoke Gender Caesarean size_class_intervals size_class
R> <dbl> <dbl> <dbl> <chr> <chr> <chr> <fct> <ord>
R> 1 6.48 6 62.1 no male no (54.4,63.5] toddler
R> 2 10.1 18 74.7 yes female no (72.7,81.8] teen
R> 3 9.55 16 69.7 no female yes (63.5,72.7] adolecent
R> 4 11.1 14 71 no male no (63.5,72.7] adolecent
R> 5 4.8 5 56.9 no male no (54.4,63.5] toddler
R> 6 6.22 11 58.7 no female no (54.4,63.5] toddler

For more information about creating categorical data from continuous data, see https://www.youtube.com/watch?v=EWs1Ordh8nI.

10 / 14
is.ordered(lungs$size_class)
R> [1] TRUE
head(as.numeric(lungs$size_class), 111)
R> [1] 2 4 3 3 2 2 2 3 3 2 4 3 2 2 2 2 2 2 2 3 2 2 3 3 3 2 2 3 1 3 1 3 3 3 3 3 1
R> [38] 3 3 4 3 3 2 3 1 2 2 3 4 2 3 4 3 2 4 1 3 4 2 3 2 3 2 2 3 3 2 2 3 3 2 3 1 3
R> [75] 2 2 3 2 3 3 2 3 3 4 2 2 4 2 2 4 3 3 4 3 4 4 3 3 4 2 3 3 2 4 3 1 4 3 2 1 3
11 / 14

The equation for Spearman's rho is:

r=i=1n(xix¯)(yiy¯)i=1n(xix¯)2i=1n(yiy¯)

Where xi and yi are the ranks for each observation in the X and Y variables, respectively.

How do we apply a Spearman's rho correlation by ranks?

cor.test(as.numeric(lungs$size_class), lungs$LungCap, method = "spearman")
R>
R> Spearman's rank correlation rho
R>
R> data: as.numeric(lungs$size_class) and lungs$LungCap
R> S = 9358124, p-value < 2.2e-16
R> alternative hypothesis: true rho is not equal to 0
R> sample estimates:
R> rho
R> 0.8526579
12 / 14

Questions

Find your own data set suitable for a correlation analysis.

  • State the null and alternative hypotheses. Do a Pearson's correlation. Explain the findings.
  • Do the tests necessary to evaluate the various assumptions for the above analysis.
  • Create all the associated figures for the above analysis.
  • Transform one of the variables to ordinal data, and do a Spearman's rho or Kendall's tau correlation. Explain the findings.
13 / 14
  • Do a full correlation analysis on the full iris dataset (i.e. all three species):

    • state all null hypotheses;

    • test all assumptions;

    • create all necessary plots;

    • write a few sentences on the findings.

14 / 14

Doing the correlation

2 / 14
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow