Correlations

Correlations
Part 1
AJ Smit
2020/06/28 (updated: 2024-03-26)
1 / 14

Correlation

For more details about Correlation, please visit https://ajsmit.github.io/Basic_stats/correlations.html.

Another take on correlations can be seen at:

https://rpubs.com/aaronsc32/anova-compare-more-than-two-groups

2 / 14

What is a correlation for?

To investigate the strength of a potential association between two (or more) variables.
No requirement that one variable causes a response in the other (unlike regression; see Chapter 8).

The research question is, “Is X related to Y?” or "Does X predict Y?".

3 / 14

What is the nature of the data?Paired variables, but neither is dependent or independent.
One is continuous, the other can be continuous or ordinal.
4 / 14

What is the correlation coefficient?

Correlation is denoted by $r$ , the correlation coefficient, which ranges from -1 to 1.

As the coefficient strives to become closer to 1, the stronger is a positive correlation; i.e., as X increases so does Y.
Coefficients closer to -1 represent a negative correlation; i.e. as X decreases, Y increases.
As $r$ approaches $0$ , the weaker the correlation between the variables becomes; i.e. $r = 0$ indicates no correlation between the two variables.

The correlation coefficient, $r$ , should not be confused with the regression coefficient, $r^{2}$ or $R^{2}$ .

5 / 14

What assumptions do the data need to fulfil?

Like all statistical tests, correlation requires a series of assumptions as well. We also require that the data are i) paired (each X observation must have an associated Y value), and ii) that there are no outliers.

Assumptions:

The association must be approximately linear
The samples follow independent normal distributions (but see below)
The requirement for homoscedasticity

There are two main types of correlations, depending on the nature of the data:

Continuous normal data (Pearson's Product Moment correlation)
Ordinal data, may be non-normal (Spearman's rho correlation, or Kendall's tau correlation)

6 / 14

Examples7 / 14

What do the data look like?

data(iris)
head(iris)

R>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
R> 1          5.1         3.5          1.4         0.2  setosa
R> 2          4.9         3.0          1.4         0.2  setosa
R> 3          4.7         3.2          1.3         0.2  setosa
R> 4          4.6         3.1          1.5         0.2  setosa
R> 5          5.0         3.6          1.4         0.2  setosa
R> 6          5.4         3.9          1.7         0.4  setosa

8 / 14

summary(iris)

R>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
R>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
R>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
R>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
R>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
R>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
R>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
R>        Species  
R>  setosa    :50  
R>  versicolor:50  
R>  virginica :50  
R>                 
R>                 
R>

The data are continuous and each X value has a corresponding Y value. If it is normally distributed and homoscedastic, we can apply a Pearson's Product Moment correlation.

9 / 14

Visually, what is the association between `Sepal.Length` and `Sepal.Width`?

$x$ : Sepal.Length
$y$ : Sepal.Width

Let's examine Iris setosa only.

setosa <- iris %>%
  filter(Species == "setosa") %>%
  select(-Species)
ggplot(data = setosa, aes(x = Sepal.Length,
                          y = Sepal.Width)) +
  geom_point(shape = 1, colour = "red3") +
  labs(x = "Length", y = "Width") + theme_pubr()

10 / 14

What about the assumptions?

1. The association must be approximately linear.

From the plot on the previous slide, yes, the relationship is linear. If the scatter plot showed a curved pattern, we are dealing with a nonlinear association between the two variables.

11 / 14

2. The samples follow independent normal distributions.

For each variable, X and Y, use Shapiro-Wilk normality test: shapiro.test()

shapiro.test(setosa$Sepal.Length)

R> 
R>     Shapiro-Wilk normality test
R> 
R> data:  setosa$Sepal.Length
R> W = 0.9777, p-value = 0.4595

shapiro.test(setosa$Sepal.Width)

R> 
R>     Shapiro-Wilk normality test
R> 
R> data:  setosa$Sepal.Width
R> W = 0.97172, p-value = 0.2715

Above we see that the two p-values are > 0.05, hence the distribution of the data is not significantly different from that of a normal distribution. We can assume normality.

12 / 14

We can also assess the normality assumption through visual inspection of Q-Q plots (quantile-quantile plots). A Q-Q plot draws the correlation between a given sample and the normal distribution. To do so, we can use ggpubr::ggqqplot():

plt_a <- ggqqplot(setosa$Sepal.Length, ylab = "Sepal Length") # a ggpubr function
plt_b <- ggqqplot(setosa$Sepal.Width, ylab = "Sepal Width")
ggarrange(plt_a, plt_b, ncol = 2)  # a ggpubr function

Looking at the plots, we can conclude that both sets of samples follow normal distributions.

13 / 14

3. The requirement for homoscedasticity

Fit a line of best fit, and see if the values lay evenly above and below the line:

ggplot(data = setosa, aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point() +
  geom_smooth(method = "lm") + theme_pubr()

Yes, everything seems in order.

14 / 14

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Correlations

Part 1

AJ Smit

2020/06/28 (updated: 2024-03-26)

Correlation

What is a correlation for?

What is the nature of the data?

What is the correlation coefficient?

What assumptions do the data need to fulfil?

Examples

What do the data look like?

Visually, what is the association between Sepal.Length and Sepal.Width?

What about the assumptions?

Correlation

Help

Visually, what is the association between `Sepal.Length` and `Sepal.Width`?