+ - 0:00:00
Notes for current slide
Notes for next slide

Correlations

Part 1

AJ Smit

2020/06/28 (updated: 2024-03-26)

1 / 14

Correlation

For more details about Correlation, please visit https://ajsmit.github.io/Basic_stats/correlations.html.

Another take on correlations can be seen at:

2 / 14

What is a correlation for?

  • To investigate the strength of a potential association between two (or more) variables.
  • No requirement that one variable causes a response in the other (unlike regression; see Chapter 8).

The research question is, “Is X related to Y?” or "Does X predict Y?".

3 / 14

What is the nature of the data?

  • Paired variables, but neither is dependent or independent.
  • One is continuous, the other can be continuous or ordinal.
4 / 14

What is the correlation coefficient?

Correlation is denoted by r, the correlation coefficient, which ranges from -1 to 1.

  • As the coefficient strives to become closer to 1, the stronger is a positive correlation; i.e., as X increases so does Y.
  • Coefficients closer to -1 represent a negative correlation; i.e. as X decreases, Y increases.
  • As r approaches 0, the weaker the correlation between the variables becomes; i.e. r=0 indicates no correlation between the two variables.

The correlation coefficient, r, should not be confused with the regression coefficient, r2 or R2.

5 / 14

What assumptions do the data need to fulfil?

Like all statistical tests, correlation requires a series of assumptions as well. We also require that the data are i) paired (each X observation must have an associated Y value), and ii) that there are no outliers.

Assumptions:

  1. The association must be approximately linear
  2. The samples follow independent normal distributions (but see below)
  3. The requirement for homoscedasticity

There are two main types of correlations, depending on the nature of the data:

  1. Continuous normal data (Pearson's Product Moment correlation)
  2. Ordinal data, may be non-normal (Spearman's rho correlation, or Kendall's tau correlation)
6 / 14

Examples

7 / 14

What do the data look like?

data(iris)
head(iris)
R> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
R> 1 5.1 3.5 1.4 0.2 setosa
R> 2 4.9 3.0 1.4 0.2 setosa
R> 3 4.7 3.2 1.3 0.2 setosa
R> 4 4.6 3.1 1.5 0.2 setosa
R> 5 5.0 3.6 1.4 0.2 setosa
R> 6 5.4 3.9 1.7 0.4 setosa
8 / 14
summary(iris)
R> Sepal.Length Sepal.Width Petal.Length Petal.Width
R> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
R> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
R> Median :5.800 Median :3.000 Median :4.350 Median :1.300
R> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
R> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
R> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
R> Species
R> setosa :50
R> versicolor:50
R> virginica :50
R>
R>
R>

The data are continuous and each X value has a corresponding Y value. If it is normally distributed and homoscedastic, we can apply a Pearson's Product Moment correlation.

9 / 14

Visually, what is the association between Sepal.Length and Sepal.Width?

  • x: Sepal.Length
  • y: Sepal.Width

Let's examine Iris setosa only.

setosa <- iris %>%
filter(Species == "setosa") %>%
select(-Species)
ggplot(data = setosa, aes(x = Sepal.Length,
y = Sepal.Width)) +
geom_point(shape = 1, colour = "red3") +
labs(x = "Length", y = "Width") + theme_pubr()

10 / 14

What about the assumptions?

1. The association must be approximately linear.

From the plot on the previous slide, yes, the relationship is linear. If the scatter plot showed a curved pattern, we are dealing with a nonlinear association between the two variables.

11 / 14

2. The samples follow independent normal distributions.

For each variable, X and Y, use Shapiro-Wilk normality test: shapiro.test()

shapiro.test(setosa$Sepal.Length)
R>
R> Shapiro-Wilk normality test
R>
R> data: setosa$Sepal.Length
R> W = 0.9777, p-value = 0.4595
shapiro.test(setosa$Sepal.Width)
R>
R> Shapiro-Wilk normality test
R>
R> data: setosa$Sepal.Width
R> W = 0.97172, p-value = 0.2715

Above we see that the two p-values are > 0.05, hence the distribution of the data is not significantly different from that of a normal distribution. We can assume normality.

12 / 14

We can also assess the normality assumption through visual inspection of Q-Q plots (quantile-quantile plots). A Q-Q plot draws the correlation between a given sample and the normal distribution. To do so, we can use ggpubr::ggqqplot():

plt_a <- ggqqplot(setosa$Sepal.Length, ylab = "Sepal Length") # a ggpubr function
plt_b <- ggqqplot(setosa$Sepal.Width, ylab = "Sepal Width")
ggarrange(plt_a, plt_b, ncol = 2) # a ggpubr function

Looking at the plots, we can conclude that both sets of samples follow normal distributions.

13 / 14

3. The requirement for homoscedasticity

Fit a line of best fit, and see if the values lay evenly above and below the line:

ggplot(data = setosa, aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point() +
geom_smooth(method = "lm") + theme_pubr()

Yes, everything seems in order.

14 / 14

Correlation

For more details about Correlation, please visit https://ajsmit.github.io/Basic_stats/correlations.html.

Another take on correlations can be seen at:

2 / 14
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow