Correlations and Associations

Published

January 1, 2021

TipMaterial Required for This Chapter
Type Name Link
Slides Correlation lecture slides 💾 BCB743_06_correlations.pdf
Data The Doubs River data 💾 Doubs.RData
ImportantTasks to Complete in This Chapter

You were introduced to correlations in BCB744, and you will now revisit this concept in the context of environmental data.

1 Set-Up the Analysis Environment

library(tidyverse)
library(vegan)
library(Hmisc) # for rcorr()

2 The Doubs River Data

David Zelený describes the background to the data on his excellent website and in the book Numerical Ecology with R by Borcard et al. (2011). These data are a beautiful example of how gradients structure biodiversity. It will be in your own interest to fully understand how the various environmental factors used as explanatory variables vary along a riverine gradient from the source to the terminus of the river.

2.1 Correlations between environmental variables

Correlation refers to the statistical (non-causal) relationship between two continuous variables. It measures the extent to which changes in one variable correspond to changes in another variable. Correlations are quantified into values ranging from -1 to +1, with -1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0 indicating no correlation. A positive correlation implies that as one variable increases, the other variable also increases. Conversely, a negative correlation implies that as one variable increases, the other decreases. Correlation can be calculated using several methods, the most common one being the Pearson correlation coefficient. Non-parametric correlations can be applied to ordinal or non-normal data.

In a system such as the Doubs River, correlations among environmental variables should not be seen as a collection of independent pairwise relationships. Instead, many of these correlations are joint expressions of a single, spatially ordered gradient extending from the headwaters to the river mouth. Variables such as elevation, distance from source, slope, nutrient concentrations, oxygen availability, and organic load are structurally linked through longitudinal transport, cumulative catchment effects, and downstream integration of physical and biogeochemical processes. As a result, strong correlations here often reflect the coherence of the riverine gradient rather than discrete ecological linkages between variables. Recognising this distinction is important, as it takes interpretation from isolated variable–variable associations to an integrative understanding of how multivariate environmental structure is imposed by spatial ordering along the river continuum.

Because these variables are jointly structured by a altitudinal (or some other geographical) gradient, high pairwise correlations also indicate substantial collinearity within the environmental dataset. This has important consequences for further analyses. When strongly correlated predictors are treated as independent covariates (such as in multiple regression or constrained ordination) their shared variance becomes difficult to disentangle, parameter estimates become unstable, and apparent effects may reflect gradient position rather than distinct environmental influences. Here, the correlation matrix functions less as an indication of relationships to be interpreted individually and more as a diagnostic view of multivariate redundancy imposed by spatial ordering. Identifying such collinearity early is therefore a matter of one’s analytical routine… it informs decisions about variable selection, dimensional reduction, and the appropriateness of modelling frameworks that assume relative independence among predictors.

Here follows a correlation analysis of the Doubs River envirnmental data:

load(here::here("data", "BCB743", "NEwR-2ed_code_data", "NEwR2-Data", "Doubs.RData"))

head(env, 5)
R>    dfs ele  slo  dis  pH har  pho  nit  amm  oxy bod
R> 1  0.3 934 48.0 0.84 7.9  45 0.01 0.20 0.00 12.2 2.7
R> 2  2.2 932  3.0 1.00 8.0  40 0.02 0.20 0.10 10.3 1.9
R> 3 10.2 914  3.7 1.80 8.3  52 0.05 0.22 0.05 10.5 3.5
R> 4 18.5 854  3.2 2.53 8.0  72 0.10 0.21 0.00 11.0 1.3
R> 5 21.5 849  2.3 2.64 8.1  84 0.38 0.52 0.20  8.0 6.2

We do not need to standardise as one would do for the calculation of Euclidean distances, but in some instances data transformations might be necessary:

env_cor <- round(cor(env), 2)
env_cor
R>       dfs   ele   slo   dis    pH   har   pho   nit   amm   oxy   bod
R> dfs  1.00 -0.94 -0.38  0.95  0.01  0.70  0.48  0.75  0.41 -0.51  0.39
R> ele -0.94  1.00  0.44 -0.87 -0.04 -0.74 -0.44 -0.76 -0.38  0.36 -0.34
R> slo -0.38  0.44  1.00 -0.34 -0.22 -0.53 -0.19 -0.31 -0.17  0.31 -0.18
R> dis  0.95 -0.87 -0.34  1.00  0.02  0.70  0.39  0.61  0.29 -0.36  0.25
R> pH   0.01 -0.04 -0.22  0.02  1.00  0.09 -0.08 -0.05 -0.12  0.18 -0.15
R> har  0.70 -0.74 -0.53  0.70  0.09  1.00  0.36  0.51  0.29 -0.38  0.34
R> pho  0.48 -0.44 -0.19  0.39 -0.08  0.36  1.00  0.80  0.97 -0.72  0.89
R> nit  0.75 -0.76 -0.31  0.61 -0.05  0.51  0.80  1.00  0.80 -0.63  0.64
R> amm  0.41 -0.38 -0.17  0.29 -0.12  0.29  0.97  0.80  1.00 -0.72  0.89
R> oxy -0.51  0.36  0.31 -0.36  0.18 -0.38 -0.72 -0.63 -0.72  1.00 -0.84
R> bod  0.39 -0.34 -0.18  0.25 -0.15  0.34  0.89  0.64  0.89 -0.84  1.00

Or if we want to see the associated p-values to establish a statistical significance:

rcorr(as.matrix(env))
R>       dfs   ele   slo   dis    pH   har   pho   nit   amm   oxy   bod
R> dfs  1.00 -0.94 -0.38  0.95  0.01  0.70  0.48  0.75  0.41 -0.51  0.39
R> ele -0.94  1.00  0.44 -0.87 -0.04 -0.74 -0.44 -0.76 -0.38  0.36 -0.34
R> slo -0.38  0.44  1.00 -0.34 -0.22 -0.53 -0.19 -0.31 -0.17  0.31 -0.18
R> dis  0.95 -0.87 -0.34  1.00  0.02  0.70  0.39  0.61  0.29 -0.36  0.25
R> pH   0.01 -0.04 -0.22  0.02  1.00  0.09 -0.08 -0.05 -0.12  0.18 -0.15
R> har  0.70 -0.74 -0.53  0.70  0.09  1.00  0.36  0.51  0.29 -0.38  0.34
R> pho  0.48 -0.44 -0.19  0.39 -0.08  0.36  1.00  0.80  0.97 -0.72  0.89
R> nit  0.75 -0.76 -0.31  0.61 -0.05  0.51  0.80  1.00  0.80 -0.63  0.64
R> amm  0.41 -0.38 -0.17  0.29 -0.12  0.29  0.97  0.80  1.00 -0.72  0.89
R> oxy -0.51  0.36  0.31 -0.36  0.18 -0.38 -0.72 -0.63 -0.72  1.00 -0.84
R> bod  0.39 -0.34 -0.18  0.25 -0.15  0.34  0.89  0.64  0.89 -0.84  1.00
R> 
R> n= 30 
R> 
R> 
R> P
R>     dfs    ele    slo    dis    pH     har    pho    nit    amm    oxy   
R> dfs        0.0000 0.0365 0.0000 0.9771 0.0000 0.0076 0.0000 0.0251 0.0040
R> ele 0.0000        0.0146 0.0000 0.8447 0.0000 0.0144 0.0000 0.0376 0.0493
R> slo 0.0365 0.0146        0.0625 0.2362 0.0028 0.3067 0.0997 0.3593 0.1006
R> dis 0.0000 0.0000 0.0625        0.9147 0.0000 0.0355 0.0004 0.1136 0.0522
R> pH  0.9771 0.8447 0.2362 0.9147        0.6405 0.6619 0.7976 0.5134 0.3494
R> har 0.0000 0.0000 0.0028 0.0000 0.6405        0.0481 0.0039 0.1191 0.0370
R> pho 0.0076 0.0144 0.3067 0.0355 0.6619 0.0481        0.0000 0.0000 0.0000
R> nit 0.0000 0.0000 0.0997 0.0004 0.7976 0.0039 0.0000        0.0000 0.0002
R> amm 0.0251 0.0376 0.3593 0.1136 0.5134 0.1191 0.0000 0.0000        0.0000
R> oxy 0.0040 0.0493 0.1006 0.0522 0.3494 0.0370 0.0000 0.0002 0.0000       
R> bod 0.0309 0.0677 0.3546 0.1770 0.4232 0.0619 0.0000 0.0001 0.0000 0.0000
R>     bod   
R> dfs 0.0309
R> ele 0.0677
R> slo 0.3546
R> dis 0.1770
R> pH  0.4232
R> har 0.0619
R> pho 0.0000
R> nit 0.0001
R> amm 0.0000
R> oxy 0.0000
R> bod

We can now follow with a visual exploration (see Question 1, below).

Although Pearson correlation is used here for convenience and familiarity, its interpretation requires that the distributional assumptions are met, which is only weakly true for several of the environmental variables in the Doubs dataset. Nutrient concentrations, biological oxygen demand, and ammonia exhibit skewness, uneven variance across sites, and the influence of low or zero values associated with upstream locations. Under such conditions, Pearson’s r reflects the strength of association and the leverage of extreme values along the river gradient. Non-parametric (rank-based) alternatives, such as Spearman correlation, relax these assumptions assessing the relative ordering rather than their given magnitudes. Comparing Pearson and Spearman correlations for this dataset therefore provides a useful diagnostic exercise, since it allows one to assess whether observed relationships are driven primarily by monotonic structure along the gradient or by a smaller number of influential observations. By stating our choice of correlation in this way, we can shift the analysis from default computation to informative method selection, where the form of the data actively informs the statistical lens applied to it.

Note also that the p-values reported for these correlations must be carefully used. With a sample size of 30 sites arranged along a strongly ordered riverine gradient, the assumption of independent observations that underpins classical significance testing is only weakly met. Many environmental variables here co-vary not because of distinct pairwise relationships, but because they are jointly structured by position along the river continuum. Under such conditions, small p-values often arise as a by-product of gradient coherence rather than as evidence for separable or mechanistic links between variables. Here, the p-value functions primarily as a descriptive indicator of consistency across sites, not as a hypothesis test in the conventional sense (H0: There is not statistically-significant relationship among variable pairs along the length of the Doubs River). Treating it as an yes/no marker for “significant” versus “non-significant” relationships might hide the multivariate dependence that motivates the analysis in the first place, and can encourage interpretive habits that become misleading once these variables are carried forward into ordination or regression-based frameworks.

2.2 Association between species

In this chapter, species association is used in a strictly statistical sense to denote patterns of joint occurrence across sampling sites. An association describes whether two species tend to be recorded at the same sites more or less often than expected from their individual distributions, without implying any underlying ecological mechanism.

Such patterns may arise from shared responses to environmental gradients, similar habitat tolerances, or spatial structuring of sites, as well as from direct biological interactions.

The methods applied here (based on dissimilarity measures such as the Jaccard index) operate only on presence–absence or abundance patterns and are therefore agnostic to the processes that generate those patterns. Association matrices should accordingly be seen as summaries of co-occurrence structure, not as evidence of interaction, dependency, or ecological coupling between species.

The Doubs River fish species dataset is an example of abundance data and it will serve well to examine the properties of an association matrix:

head(spe)
R>   Cogo Satr Phph Babl Thth Teso Chna Pato Lele Sqce Baba Albi Gogo Eslu Pefl
R> 1    0    3    0    0    0    0    0    0    0    0    0    0    0    0    0
R> 2    0    5    4    3    0    0    0    0    0    0    0    0    0    0    0
R> 3    0    5    5    5    0    0    0    0    0    0    0    0    0    1    0
R> 4    0    4    5    5    0    0    0    0    0    1    0    0    1    2    2
R> 5    0    2    3    2    0    0    0    0    5    2    0    0    2    4    4
R> 6    0    3    4    5    0    0    0    0    1    2    0    0    1    1    1
R>   Rham Legi Scer Cyca Titi Abbr Icme Gyce Ruru Blbj Alal Anan
R> 1    0    0    0    0    0    0    0    0    0    0    0    0
R> 2    0    0    0    0    0    0    0    0    0    0    0    0
R> 3    0    0    0    0    0    0    0    0    0    0    0    0
R> 4    0    0    0    0    1    0    0    0    0    0    0    0
R> 5    0    0    2    0    3    0    0    0    5    0    0    0
R> 6    0    0    0    0    2    0    0    0    1    0    0    0

In order to calculate an association matrix for the fish species we first need to transpose the data:

spe_t <- t(spe)

Now we can calculate the association matrix:

spe_assoc1 <- vegdist(spe_t, method = "jaccard")
 # display only a portion of the data...
as.matrix((spe_assoc1))[1:10, 1:10]
R>           Cogo      Satr      Phph      Babl      Thth      Teso      Chna
R> Cogo 0.0000000 0.7368421 0.7794118 0.7945205 0.3333333 0.4545455 0.9354839
R> Satr 0.7368421 0.0000000 0.3108108 0.4705882 0.7368421 0.7333333 0.9583333
R> Phph 0.7794118 0.3108108 0.0000000 0.2804878 0.7794118 0.7571429 0.9113924
R> Babl 0.7945205 0.4705882 0.2804878 0.0000000 0.8108108 0.7397260 0.8481013
R> Thth 0.3333333 0.7368421 0.7794118 0.8108108 0.0000000 0.5833333 0.9000000
R> Teso 0.4545455 0.7333333 0.7571429 0.7397260 0.5833333 0.0000000 0.8787879
R> Chna 0.9354839 0.9583333 0.9113924 0.8481013 0.9000000 0.8787879 0.0000000
R> Pato 0.8918919 0.9078947 0.7948718 0.7307692 0.9210526 0.7500000 0.4827586
R> Lele 0.8627451 0.8235294 0.7386364 0.6666667 0.9056604 0.7346939 0.6136364
R> Sqce 0.8360656 0.7978723 0.7346939 0.6562500 0.8730159 0.8281250 0.7017544
R>           Pato      Lele      Sqce
R> Cogo 0.8918919 0.8627451 0.8360656
R> Satr 0.9078947 0.8235294 0.7978723
R> Phph 0.7948718 0.7386364 0.7346939
R> Babl 0.7307692 0.6666667 0.6562500
R> Thth 0.9210526 0.9056604 0.8730159
R> Teso 0.7500000 0.7346939 0.8281250
R> Chna 0.4827586 0.6136364 0.7017544
R> Pato 0.0000000 0.5000000 0.6774194
R> Lele 0.5000000 0.0000000 0.4531250
R> Sqce 0.6774194 0.4531250 0.0000000
spe_assoc2 <- vegdist(spe_t, method = "jaccard", binary = TRUE)
as.matrix((spe_assoc2))[1:10, 1:10]
R>           Cogo      Satr      Phph      Babl      Thth      Teso      Chna
R> Cogo 0.0000000 0.5294118 0.6000000 0.6666667 0.2222222 0.4000000 0.8888889
R> Satr 0.5294118 0.0000000 0.2380952 0.3600000 0.5294118 0.6111111 0.8846154
R> Phph 0.6000000 0.2380952 0.0000000 0.1666667 0.6000000 0.6000000 0.7692308
R> Babl 0.6666667 0.3600000 0.1666667 0.0000000 0.6666667 0.6666667 0.6153846
R> Thth 0.2222222 0.5294118 0.6000000 0.6666667 0.0000000 0.4000000 0.8235294
R> Teso 0.4000000 0.6111111 0.6000000 0.6666667 0.4000000 0.0000000 0.7500000
R> Chna 0.8888889 0.8846154 0.7692308 0.6153846 0.8235294 0.7500000 0.0000000
R> Pato 0.8125000 0.8333333 0.7083333 0.6000000 0.8125000 0.6428571 0.2307692
R> Lele 0.8181818 0.6538462 0.5384615 0.3846154 0.8181818 0.7000000 0.4210526
R> Sqce 0.7307692 0.5517241 0.3928571 0.2500000 0.7307692 0.7307692 0.5200000
R>           Pato      Lele      Sqce
R> Cogo 0.8125000 0.8181818 0.7307692
R> Satr 0.8333333 0.6538462 0.5517241
R> Phph 0.7083333 0.5384615 0.3928571
R> Babl 0.6000000 0.3846154 0.2500000
R> Thth 0.8125000 0.8181818 0.7307692
R> Teso 0.6428571 0.7000000 0.7307692
R> Chna 0.2307692 0.4210526 0.5200000
R> Pato 0.0000000 0.3888889 0.5600000
R> Lele 0.3888889 0.0000000 0.2800000
R> Sqce 0.5600000 0.2800000 0.0000000

Interpreting these association matrices requires attention to what the chosen form of the Jaccard index reflects. When calculated on abundance data, the Jaccard distance shows whether two species co-occur and how their abundances are distributed across shared sites. Under this specofocation, species that frequently occur together but differ greatly in dominance can appear less closely associated than species with more similar abundances, even when their occupancy patterns are comparable. The binary version of the Jaccard index, however, removes abundance information altogether and retains only shared presence or absence, giving a measure that isolates co-occurrence structure independently of dominance. It is also worth emphasising that the Jaccard index is a dissimilarity, i.e., lower values correspond to stronger co-occurrence (greater association), whereas higher values indicate weaker co-occurrence. Interpreting “strong” or “weak” association therefore requires paying attention both to the magnitude of the distance and to how that distance has been constructed (from a dissimilarity index, in this case). Reading the abundance-weighted and binary matrices side by side makes clear that association is not an intrinsic property of a species pair, but a consequence of the analytical view applied. This is a distinction that becomes important when these distances are later used for clustering and ordination.

3 Looking Forward

The analyses in this chapter are intended as diagnostic steps on which the subsequent modelling choices can build. But they may also be insightful in themselves, as being able to read these matricers certainly gives one a level of deeper insight into one’s data. Strong correlations among environmental variables indicate redundancy that constrains how these predictors can be used together in regression-based models, i.e., collinearity inflates variance, destabilises parameter estimates, and undermines attempts to attribute effects to individual covariates. In such situations, treating each variable as an independent explanatory factor becomes statistically problematic. Correlation matrices therefore provide an initial step before downstream analyses are attempted. They reveal when dimensional reduction or alternative representations of environmental structure are required before formal modelling proceeds.

The same applies to species associations. Patterns of co-occurrence summarised in association matrices anticipate the geometry that species will occupy in multivariate space. High redundancy among species warns of compressed ordination structures, while weak or uneven associations often translate into elongated gradients or sparse configurations. This will become clear when we do our first ordination analyses. By examining correlations and associations at an early stage, we describe the data and diagnose their effective dimensionality; so, they represent assessments that motivate the transition toward ordination, clustering, and constrained multivariate methods in subsequent chapters.

References

Borcard D, Gillet F, Legendre P, others (2011) Numerical ecology with R. Springer

Reuse

Citation

BibTeX citation:
@online{smit,_a._j.2021,
  author = {Smit, A. J.,},
  title = {Correlations and {Associations}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB743/correlations.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit, A. J. (2021) Correlations and Associations. http://tangledbank.netlify.app/BCB743/correlations.html.