9. Correlation and Association
Quantifying Relationships Without Imposing a Response Model
- correlation as a measure of association
- choosing Pearson, Spearman, or Kendall
- correlation matrices and heatmaps
- correlation and non-independence
- correlation versus regression
Find here a Cheatsheet on statistical methods.
- Self-Assessment Task 9-1 (/10)
- Self-Assessment Task 9-2 (/20)
- Self-Assessment instructions and full task overview
In the previous chapters, I asked whether means differ among groups. Correlation answers, do two variables vary together?
Here we concern ourselves with association rather than group comparison. It is the last of the staple inferential tools and prepares the transition into regression. Correlation does not impose a response model and it simply measures the strength and direction of association between two variables. As we shall see later, the purpose of regression is to have additional expectations about the roles of the variables: one variable is the response and the other predicts.
Correlation coefficients are effect sizes. Their sign shows direction and their magnitude shows how tightly the variables vary together. Coefficients vary from -1.0 (perfect inverse correlation), to 0 (no association), to 1.0 (perfect positive correlation).
1 Choosing the Appropriate Correlation
The main decision we will face is the form of the relationship.
- Use Pearson’s correlation when the relationship is approximately linear and the main question concerns linear co-variation.
- Use Spearman’s rank correlation when the relationship is monotonic but not especially linear, or when ranked order is more informative than raw spacing.
- Use Kendall’s rank correlation when the emphasis is on concordance in rank order, especially with ordinal data, many tied ranks, or a direct question about agreement between rankings.
Start with a scatterplot and then decide whether the pattern is:
- roughly linear;
- monotonic but curved or unevenly spaced;
- clustered, outlier-driven, or structured by site, time, or some other grouping.
Association does not imply causation. Correlation can reveal co-variation, but it cannot identify the mechanism behind it.
1.1 How to Read Magnitude
Rough verbal labels can help, but they are only a starting point:
- around
0.1to0.3: weak association; - around
0.3to0.7: moderate association; - above
0.7: strong association.
Those labels are only heuristics. A coefficient of 0.4 may be unremarkable in one system and biologically substantial in another, so it is highly context specific. A moderate correlation may be biologically useful if the variables are noisy or hard to measure. A strong correlation may still be uninformative if it is driven by site structure, repeated measurements, or a lurking third variable. Sample size also affects stability because correlation estimates are less stable in small datasets.
2 Data Structure and Diagnostics
Three questions should be settled before calculating a coefficient.
2.1 What is the Data Structure?
- Each observation in one variable must correspond to the same observation in the other variable.
- The sampling units should be independent.
- Correlation is calculated for one variable pair at a time, even if many pairs are later assembled into a matrix.
2.2 What is the Relationship Form?
- Pearson focuses on linear association.
- Spearman and Kendall focus on ordered association.
- A scatterplot is the main diagnostic because it shows whether the pattern is linear, monotonic, clustered, or broken by outliers.
2.3 What’s the Data Quality?
- Outliers can distort any coefficient, but Pearson is especially sensitive to them.
- Site, time, transect, quadrat, or repeated-measures structure can create correlations that do not reflect the biological question of interest.
2.4 Secondary Considerations
Distributional shape is less important than relationship form, outliers, and independence. Pearson is most sensitive to non-linearity and influential points. Normality is not the first issue to inspect.
Using the built-in iris dataset, examine the relationship between Sepal.Length and Petal.Length:
- Make a scatter plot coloured by
Species. Is the overall pattern linear? - Make the same scatter plot with only the Setosa points. Is the pattern still linear?
- Based on your plots, is there a data-structure issue that could affect your correlation estimate if you compute a single pooled Pearson coefficient?
What would you do differently to compute a correlation that correctly reflects the within-species relationship?
3 R Functions
The main function in this chapter is cor.test(). Use:
-
method = "pearson"for Pearson’s product-moment correlation; -
method = "spearman"for Spearman’s rank correlation; -
method = "kendall"for Kendall’s rank correlation.
The function cor() calculates the coefficient itself and is useful for pairwise matrices. The inferential version, cor.test(), is usually more useful in worked examples because it returns the coefficient, test statistic, and p-value.
For Pearson correlation:
\[r_{xy} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} \tag{1}\]
For Spearman rank correlation:
\[\rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2-1)} \tag{2}\]
For Kendall rank correlation:
\[\tau = \frac{n_c - n_d}{\binom{n}{2}} \tag{3}\]
These formulas show the structure of the coefficients. In practice, first decide which coefficient matches the pattern in the data.
4 Pearson Correlation
Pearson’s \(r\) measures the strength and direction of linear association. It is appropriate when the relationship is reasonably well described by a straight line and no small set of outliers dominates the pattern.
5 Example 1: Pearson Correlation in Ecklonia maxima
I begin by asking whether the length of Ecklonia maxima stipes tends to increase together with frond length.
5.1 Do an Exploratory Data Analysis (EDA)
# A tibble: 1 × 5
n mean_stipe_length sd_stipe_length mean_frond_length sd_frond_length
<int> <dbl> <dbl> <dbl> <dbl>
1 26 531. 132. 171. 49.4
Code
r_print <- paste0(
"r = ",
round(cor(ecklonia$stipe_length, ecklonia$frond_length), 2)
)
ggplot(data = ecklonia, aes(x = stipe_length, y = frond_length)) +
geom_smooth(method = "lm", colour = "blue3", se = FALSE, linewidth = 1) +
geom_point(size = 2.7, colour = "red3", shape = 16) +
geom_label(x = 300, y = 240, label = r_print) +
labs(x = "Stipe length (cm)", y = "Frond length (cm)")The scatterplot in Figure 1 shows a clear positive linear trend. The points are not tightly packed around the fitted line, but the relationship is straight enough that Pearson’s coefficient is appropriate. The main diagnostic here is the linear form of the line and univariate normality is secondary.
5.2 State the Hypotheses
\[H_{0}: \rho = 0\] \[H_{a}: \rho \ne 0\]
Here \(\rho\) is the population Pearson correlation coefficient.
5.3 Apply the Test
Pearson's product-moment correlation
data: ecklonia$stipe_length and ecklonia$frond_length
t = 4.2182, df = 24, p-value = 0.0003032
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3548169 0.8300525
sample estimates:
cor
0.6524911
5.4 Interpret the Results
The correlation is positive and moderately strong. Longer stipes tend to be associated with longer fronds. The coefficient is large enough to be biologically informative, and the p-value is well below 0.001, so we reject \(H_0\).
The effect size is the coefficient itself. Here \(r = 0.65\) indicates a fairly strong linear association for a biological dataset of this kind. That does not imply causation. It shows co-variation.
5.5 Reporting
Methods
The linear association between stipe length and frond length in Ecklonia maxima was assessed with a Pearson product-moment correlation.
Results
Stipe length and frond length in Ecklonia maxima were positively correlated (Pearson correlation: \(r = 0.65\), \(n = 26\), \(p < 0.001\)), indicating that kelps with longer stipes also tended to have longer fronds.
Discussion
The two morphological variables co-vary strongly in these data. The result shows association only. It does not identify a causal mechanism.
Using the mtcars dataset, compute the Pearson correlation between hp (horsepower) and mpg (fuel economy).
- First make a scatter plot. Does the relationship look linear?
- Compute
cor.test(mtcars$hp, mtcars$mpg). Report the correlation coefficient, 95% CI, and p-value. - Write a one-sentence interpretation following the reporting style used in the Write-Up above (include \(r\), df, and \(p\)-value).
- Does the negative sign make biological sense? Explain.
6 Spearman Rank Correlation
When the relationship is ordered but not especially linear, Spearman’s \(\rho\) is often the better choice. It replaces the raw values with ranks and asks whether the two variables tend to increase together in rank order.
7 Example 2: Spearman’s \(\rho\) When the Rank Pattern Is Clearer Than the Linear One
In the ecklonia data, the relationship between stipe diameter and primary blade length is not especially well described by a straight line, but there is still a biological suggestion that thicker stipes tend to occur with longer primary blades.
7.1 Do an Exploratory Data Analysis (EDA)
As I always do, I start with a scatterplot.
# A tibble: 1 × 5
n mean_stipe_diameter sd_stipe_diameter mean_primary_blade_length
<int> <dbl> <dbl> <dbl>
1 26 24.2 6.74 17.9
# ℹ 1 more variable: sd_primary_blade_length <dbl>
Code
The scatterplot in Figure 2 suggests an increasing pattern, but the spacing around a straight line is uneven and a few observations influence the fitted line strongly. The biological question is about ordered increase rather than precise linear scaling. That points to Spearman rather than Pearson.
7.2 State the Hypotheses
\[H_{0}: \rho_{s} = 0\] \[H_{a}: \rho_{s} \ne 0\]
Here \(\rho_s\) is the population Spearman rank-correlation coefficient.
7.3 Apply the Test
Spearman's rank correlation rho
data: ecklonia$stipe_diameter and ecklonia$primary_blade_length
S = 1444.1, p-value = 0.008311
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.5062992
It is useful to compare this with Pearson’s correlation on the same variable pair:
Pearson's product-moment correlation
data: ecklonia$stipe_diameter and ecklonia$primary_blade_length
t = 1.6413, df = 24, p-value = 0.1138
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07946077 0.62777344
sample estimates:
cor
0.3176688
7.4 Interpret the Results
Spearman’s \(\rho\) shows a moderate positive rank association. In these data, the rank-based signal is clearer than the strictly linear one. Pearson understates the pattern because it is asking a different question.
This is the main reason to use Spearman: the biological pattern is ordered, but the raw spacing around a line is not especially stable.
7.5 Reporting
Methods
A Spearman rank correlation was used to assess the monotonic association between stipe diameter and primary blade length in Ecklonia maxima. A rank-based method was preferred because the visual pattern suggested ordered increase without a clean linear form.
Results
Stipe diameter and primary blade length showed a moderate positive rank association (Spearman correlation: \(\rho = 0.51\), \(n = 26\), \(p < 0.01\)), indicating that kelps with thicker stipes also tended to have longer primary blades. On the same variable pair, the Pearson correlation was weaker and not statistically convincing at the 5% level (\(r = 0.32\), \(p > 0.05\)).
Discussion
Spearman’s \(\rho\) is more informative here because the biological conclusion concerns ordered increase rather than exact linear scaling.
8 Kendall Rank Correlation
Kendall’s \(\tau\) measures concordance in rank ordering. It asks whether pairs of observations tend to agree in how they are ordered on the two variables.
9 Example 3: Kendall’s \(\tau\) for Concordance in Rank Order
Suppose my question is whether longer primary blades also tend to be wider in the same rank order. This is a direct concordance question, which is the natural setting for Kendall’s \(\tau\).
9.1 Do an Exploratory Data Analysis (EDA)
Code
The scatterplot in Figure 3 shows a positive tendency, but the main point is not exact linear spacing. The question is whether larger values in one variable tend to be matched by larger values in the other. That is a concordance question, so Kendall’s \(\tau\) is appropriate.
9.2 State the Hypotheses
\[H_{0}: \tau = 0\] \[H_{a}: \tau \ne 0\]
Here \(\tau\) is the population Kendall rank-correlation coefficient.
9.3 Apply the Test
Kendall's rank correlation tau
data: ecklonia$primary_blade_length and ecklonia$primary_blade_width
z = 2.3601, p-value = 0.01827
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.3426171
9.4 Interpret the Results
Kendall’s \(\tau\) is positive, so the rank ordering is broadly consistent: longer blades also tend to be wider. The effect is moderate rather than strong, and the p-value is below 0.05, so we reject \(H_0\).
Kendall is often less familiar than Spearman, but its interpretation is very direct when the scientific question is about agreement in rank order.
9.5 Reporting
Methods
Kendall’s \(\tau\) was used to assess concordance between primary blade length and primary blade width in Ecklonia maxima.
Results
Primary blade length and primary blade width showed a positive association in rank order (Kendall correlation: \(\tau = 0.34\), \(n = 26\), \(p < 0.05\)), indicating that longer blades also tended to be wider.
Discussion
Kendall’s \(\tau\) is useful when the scientific message concerns agreement in rank order rather than precise linear scaling.
10 Correlation Matrices and Heatmaps
Once a single pairwise relationship is understood, the same idea can be scaled to many continuous variables at once.
10.1 Create the Correlation Matrix
| variable | digits | epiphyte_length | frond_length | frond_mass | primary_blade_length | primary_blade_width | stipe_diameter | stipe_length | stipe_mass |
|---|---|---|---|---|---|---|---|---|---|
| digits | 1.00 | 0.05 | 0.36 | 0.28 | 0.10 | 0.14 | 0.24 | 0.24 | 0.07 |
| epiphyte_length | 0.05 | 1.00 | 0.61 | 0.44 | 0.26 | 0.41 | 0.54 | 0.61 | 0.51 |
| frond_length | 0.36 | 0.61 | 1.00 | 0.57 | −0.02 | 0.28 | 0.39 | 0.65 | 0.39 |
| frond_mass | 0.28 | 0.44 | 0.57 | 1.00 | 0.15 | 0.36 | 0.51 | 0.51 | 0.47 |
| primary_blade_length | 0.10 | 0.26 | −0.02 | 0.15 | 1.00 | 0.34 | 0.32 | 0.13 | 0.16 |
| primary_blade_width | 0.14 | 0.41 | 0.28 | 0.36 | 0.34 | 1.00 | 0.83 | 0.34 | 0.83 |
| stipe_diameter | 0.24 | 0.54 | 0.39 | 0.51 | 0.32 | 0.83 | 1.00 | 0.59 | 0.82 |
| stipe_length | 0.24 | 0.61 | 0.65 | 0.51 | 0.13 | 0.34 | 0.59 | 1.00 | 0.58 |
| stipe_mass | 0.07 | 0.51 | 0.39 | 0.47 | 0.16 | 0.83 | 0.82 | 0.58 | 1.00 |
By producing many pairwise comparisons at once, these matrices offer a useful exploratory tool. They identify strong positive and negative relationships amongst many variables. Some of these correlations will appear by chance and other will be real, so you will have to apply your expert judgement. A matrix should therefore guide further inspection and not be used as a finished inferential result.
10.2 Visualise the Matrix
Code
Code
ecklonia_pearson |>
as.data.frame() |>
mutate(x = rownames(ecklonia_pearson)) |>
pivot_longer(
cols = -x,
names_to = "y",
values_to = "r"
) |>
ggplot(aes(x, y, fill = r)) +
geom_tile(colour = "white") +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
na.value = "grey95",
space = "Lab",
name = "r"
) +
labs(x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
coord_fixed()Together, Figure 4 and Figure 5 show the same correlation structure in two different visual styles. The matrix and heatmap are useful for scanning the dataset quickly, but they should be treated with caution. They do not control the false-positive rate across all displayed comparisons, and they do not explain why a correlation exists.
Using the penguins dataset from palmerpenguins, compute a pairwise Pearson correlation matrix for the four continuous morphological variables (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g).
- Filter to complete cases first (
drop_na()), then compute the matrix usingcor(). - Identify the strongest positive and strongest negative correlation.
- Does any of these correlations surprise you biologically? Why might
bill_depth_mmandbill_length_mmhave a negative pooled correlation across species?
11 Correlation and Non-Independent Data
Even the correct coefficient can mislead when the observations are not independent. This is common in ecology because measurements are often grouped within sites, transects, quadrats, times, or individuals.
12 Example 4: A Correlation Can Be Modified by Site Structure
I again use the kelp data.
# A tibble: 2 × 3
site n r_site
<chr> <int> <dbl>
1 Batsata Rock 13 0.223
2 Boulders Beach 13 0.920
Code
ggplot(ecklonia, aes(x = stipe_length, y = epiphyte_length, colour = site)) +
geom_point(shape = 1) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.7) +
geom_smooth(aes(group = 1),
method = "lm",
se = FALSE,
colour = "black",
linetype = "dashed",
linewidth = 0.6) +
labs(x = "Stipe length (cm)", y = "Epiphyte length (cm)", colour = "Site")The contrast between the pooled fit and the within-site fits is clear in Figure 6.
The pooled correlation across all observations is clearly positive, but the plot also shows clustering by site. Part of the overall association reflects site differences rather than individual-level co-variation.
This is why independence must be taken seriously. A naive pooled coefficient mixes at least two sources of structure:
- variation among individuals within sites;
- variation between sites.
If the biological question is about site-level differences, then site should be modelled explicitly. If the question is about individual-level association, a grouped or mixed-effects model is more defensible than a pooled correlation.
13 Correlation Versus Regression
Correlation and regression are related, but they do different jobs.
- Correlation quantifies how strongly two variables vary together.
- Regression estimates how the expected value of a response changes with a predictor.
Both are rooted in covariance, but Pearson’s correlation standardises covariance into a unitless coefficient while simple linear regression uses the same covariance structure to estimate a slope. See the next chapter.
If the question is “do these variables co-vary?”, correlation is appropriate. If the question is “how much does the response change when the predictor changes?”, the analysis has already become a regression.
For each of the following research questions, decide whether correlation or regression is more appropriate, and explain why:
- A botanist measures leaf area and stomatal density in 50 plants. She wants to know whether larger leaves have more stomata per unit area.
- A conservation biologist wants to predict the number of species in a forest patch from the patch area.
- Two ecologists each independently count the number of invasive plants in 20 quadrats. They want to know whether their counts agree.
- A physiologist wants to know how oxygen consumption changes for each 1°C increase in body temperature.
Which of these require a slope estimate? Which only require a measure of co-variation? Discuss with a partner.
14 If Assumptions Fail
If the relationship is not linear, the variables are ordinal, or the data contain influential outliers, Pearson correlation may not be appropriate. In such cases:
- use Spearman or Kendall for rank-based association;
- inspect the scatterplot before trusting any coefficient;
- ask whether site, time, or repeated-measures structure is inflating the coefficient;
- treat matrices as exploratory summaries;
- move to regression if the real question is one of response and predictor, or if confounding and grouped structure need to be handled directly.
Assumption checking itself is discussed in Chapter 6.
For each of the following pairs of variables, decide which correlation method (Pearson, Spearman, or Kendall) is most appropriate and briefly justify your choice:
- Body mass (g) and wing span (mm) in a sample of 120 birds — the scatterplot looks roughly linear. (/2)
- Ecologist ranks 15 sites from least to most disturbed (1–15); another ecologist ranks the same 15 sites by species richness. The question is whether more disturbed sites tend to rank lower in diversity. (/2)
- Sea surface temperature and chlorophyll-a concentration in a large oceanographic dataset — the relationship is clearly curvilinear (cooling increases phytoplankton). (/2)
- For (c), does Spearman also have an advantage over Pearson when data are skewed? (/2)
- A nutritionist scores diet quality on a 5-point ordinal scale and records BMI for 40 participants. (/2)
Find two datasets of your own and do a full correlation analysis on each. Briefly describe the data and why they exist. State the hypotheses, do an EDA, make exploratory figures, choose and justify the appropriate correlation method, assess assumptions, and write up the results in publication style.
Rubric
| Criterion | Excellent (Full Marks) | Partial Credit | Absent / Poor | Marks |
|---|---|---|---|---|
| 1. Dataset Choice and Justification | Two variables (from one or more datasets) are clearly described and justified as candidates for correlation analysis; rationale is thoughtful and contextually informed. | Variables are chosen and described but the rationale is vague or unconvincing. | Variable selection appears arbitrary or trivial; little or no justification is given. | /2 |
| 2. Hypothesis Framing | Null and alternative hypotheses are explicitly stated and aligned with the correlation analysis (e.g., \(H_0: \rho = 0\)). Contextual meaning is clearly explained. | Hypotheses are present but poorly articulated or lacking contextual relevance. | Hypotheses are missing, incorrect, or misaligned with the analysis. | /2 |
| 3. Exploratory Data Analysis | EDA includes summary statistics, variable distribution inspection, and consideration of linearity or monotonicity. Potential issues (e.g., outliers) are noted. | EDA is attempted but lacks depth or overlooks important features such as skewness or relationship form. | No meaningful EDA is performed before conducting the correlation. | /3 |
| 4. Exploratory Figures | Appropriate visualisation (e.g., scatterplot with smoothing line, marginal histograms) is clear, labelled, and supports interpretation. | A plot is included but is unclear, poorly formatted, or not well interpreted. | No plot is provided, or the plot is irrelevant or uninformative. | /2 |
| 5. Correlation Method and Calculation | The correlation method is appropriate to the data characteristics, with Pearson, Spearman, or Kendall chosen and justified. Code and output are correct and clearly reported. | The method is used correctly but without justification, or there are some reporting issues. | Correlation is applied mechanically or incorrectly; code or output is missing. | /3 |
| 6. Significance and Effect Size | The p-value and correlation coefficient (\(r\), \(\rho\), or \(\tau\)) are reported, with interpretation of both statistical and practical significance. | Results are reported but not clearly interpreted or contextualised. | The p-value or coefficient is misinterpreted, or key output is missing. | /2 |
| 7. Assumption Checking and Discussion | Relevant assumptions are addressed according to the chosen method (e.g., relationship form, outliers, independence), supported by appropriate plots and discussion. | Some assumptions are discussed or partially checked, but the reasoning is unclear or incomplete. | There is no discussion or evidence of assumption checking. | /3 |
| 8. Written Results Section | Results are presented in a clear, concise, publication-ready format, with technical correctness and a logical flow from EDA to conclusion. | Results are readable but disorganised, imprecise, or not fully connected to the evidence. | Results are unclear, incorrect, or unstructured. | /3 |
Total: /20
15 Summary
Correlation quantifies association without fitting a response model. The working sequence is:
- plot the data;
- decide whether the question is really about association;
- choose Pearson, Spearman, or Kendall according to the pattern in the data;
- check independence before trusting the coefficient;
- interpret sign, magnitude, and uncertainty together;
- treat pairwise matrices as exploratory, not definitive.
Reuse
Citation
@online{smit2026,
author = {Smit, A. J.},
title = {9. {Correlation} and {Association}},
date = {2026-04-07},
url = {https://tangledbank.netlify.app/BCB744/basic_stats/09-correlation-and-association.html},
langid = {en}
}
