Correspondence Analysis (CA)
Type | Name | Link |
---|---|---|
Theory | Numerical Ecology in R | See pages 132-140 |
Slides | CA lecture slides | 💾 BCB743_09_CA.pdf |
Data | The Doubs River data | 💾 Doubs.RData |
Correspondence Analysis (CA) is an eigenvector-based ordination method that handles nonlinear species responses more effectively than Principal Component Analysis (PCA). PCA relies on linear relationships and maximises variance explained using a covariance or correlation matrix, but CA employs similar regression techniques based on \(\chi^2\)-standardised data and weights. This makes it more appropriate for species count and presence/absence data.
CA maximises the correspondence between species scores and sample scores by preserving \(\chi^2\) distances between sites in a species-by-site matrix instead of Euclidean distances. The \(\chi^2\) distance metric is not influenced by double zeros, making it suitable for situation when many species might be absent from several sites. The process involves performing a Singular Value Decomposition (SVD) or eigenvalue decomposition (two different approaches in linear algebra applied to the analysis of matrices) on the standardised data matrix and reporting the eigenvalues and eigenvectors.
In CA ordination diagrams, species and sites are represented by points, with their relative positions indicating the strength of their correspondence. The distances between points in the ordination space reflect the \(\chi^2\) distances between the rows or columns of the original data matrix. CA answers questions such as “Which sites do my species prefer?” or “Which sites correspond to which species?”
The species scores are linear combinations of the original species data that maximise the dispersion of the species scores along the ordination axes, capturing the main patterns in the data. These scores represent the weighted averages of the species concerning the environmental variables or site characteristics, approximating nonlinear response of species along environmental gradients, such as unimodal or skewed responses.
One potential downside of CA is that it assumes the total abundance or presence of species across sites to be constant, which may not always hold true. Additionally, some ecologists argue that CA might be overly influenced by rare species, as their contributions to the \(\chi^2\) statistic can be disproportionately large. This issue can be mitigated by applying appropriate transformations or down weighting rare species in the analysis.
Another problem with CA is the ‘arch effect.’ This is similar to the horseshoe effect in PCA, but less severe. The arch effect can be mitigated by using a Detrended Correspondence Analysis (DCA), which is a variation of CA that detrends the arch effect by removing the linear trend from the eigenvalues.
CA produces one axis fewer than the minimum of the number of sites (n) or the number of species (p). Like PCA, CA produces orthogonal axes ranked in decreasing order of importance. However, the variation represented is the total inertia, which is the sum of squares of all values in the \(\chi^2\) matrix, rather than the sum of eigenvalues along the diagonal as in PCA. Individual eigenvalues in CA can be greater than 1, indicating that the corresponding axis captures a significant portion of the total variance in the data.
The scaling of ordination plots in CA is similar to that in PCA. Scaling 1 (site scaling) means that sites close together in the plot have similar species relative frequencies, and any site near a species point will have a relatively large abundance of that species. Scaling 2 (species scaling) means that species points close together will have similar abundance patterns across sites, and any species close to a site point is more likely to have a high abundance at that site.
As with all ordination techniques, interpreting CA results should be done with caution and in conjunction with additional ecological knowledge and statistical tests, as the ordination axes may not always have a clear ecological interpretation. Please supplement your reading by referring to GUSTA ME) and David Zelený’s writing on the topic in Analysis of community ecology data in R.
Set-up the Analysis Environment
The Doubs River Data
In the PCA chapter we analysed the environmental data. This time we work with the species data.
Cogo Satr Phph Babl Thth Teso Chna Pato Lele Sqce Baba Albi Gogo Eslu Pefl
1 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 5 4 3 0 0 0 0 0 0 0 0 0 0 0
3 0 5 5 5 0 0 0 0 0 0 0 0 0 1 0
4 0 4 5 5 0 0 0 0 0 1 0 0 1 2 2
5 0 2 3 2 0 0 0 0 5 2 0 0 2 4 4
6 0 3 4 5 0 0 0 0 1 2 0 0 1 1 1
7 0 5 4 5 0 0 0 0 1 1 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Rham Legi Scer Cyca Titi Abbr Icme Gyce Ruru Blbj Alal Anan
1 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0
5 0 0 2 0 3 0 0 0 5 0 0 0
6 0 0 0 0 2 0 0 0 1 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0
Do the CA
The vegan function cca()
can be used for CA and Constrained Correspondence Analysis (CCA). When we do not specify constraints, as we do here, we will do a simple CA:
Okay, so there’s a problem. The error message says that at least one of the rows sums to 0. Which one?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
3 12 16 21 34 21 16 0 14 14 11 18 19 28 33 40 44 42 46 56 62 72 4 15 11 43
27 28 29 30
63 70 87 89
We see that the offending row is row 8, so we can omit it. This function will omit any row that sums to zero (or less):
Cogo Satr Phph Babl Thth Teso Chna Pato Lele Sqce Baba Albi Gogo Eslu Pefl
1 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 5 4 3 0 0 0 0 0 0 0 0 0 0 0
3 0 5 5 5 0 0 0 0 0 0 0 0 0 1 0
4 0 4 5 5 0 0 0 0 0 1 0 0 1 2 2
5 0 2 3 2 0 0 0 0 5 2 0 0 2 4 4
6 0 3 4 5 0 0 0 0 1 2 0 0 1 1 1
7 0 5 4 5 0 0 0 0 1 1 0 0 0 0 0
9 0 0 1 3 0 0 0 0 0 5 0 0 0 0 0
Rham Legi Scer Cyca Titi Abbr Icme Gyce Ruru Blbj Alal Anan
1 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0
5 0 0 2 0 3 0 0 0 5 0 0 0
6 0 0 0 0 2 0 0 0 1 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 1 0 0 0 4 0 0 0
Now we are ready for the CA:
Call: cca(X = spe)
Inertia Rank
Total 1.167
Unconstrained 1.167 26
Inertia is scaled Chi-square
Eigenvalues for unconstrained axes:
CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
0.6010 0.1444 0.1073 0.0834 0.0516 0.0418 0.0339 0.0288
(Showing 8 of 26 unconstrained eigenvalues)
The more verbose summary()
output:
Call:
cca(X = spe)
Partitioning of scaled Chi-square:
Inertia Proportion
Total 1.167 1
Unconstrained 1.167 1
Eigenvalues, and their contribution to the scaled Chi-square
Importance of components:
CA1 CA2 CA3 CA4 CA5 CA6 CA7
Eigenvalue 0.601 0.1444 0.10729 0.08337 0.05158 0.04185 0.03389
Proportion Explained 0.515 0.1237 0.09195 0.07145 0.04420 0.03586 0.02904
Cumulative Proportion 0.515 0.6387 0.73069 0.80214 0.84634 0.88220 0.91124
CA8 CA9 CA10 CA11 CA12 CA13
Eigenvalue 0.02883 0.01684 0.010826 0.010142 0.007886 0.006123
Proportion Explained 0.02470 0.01443 0.009278 0.008691 0.006758 0.005247
Cumulative Proportion 0.93594 0.95038 0.959655 0.968346 0.975104 0.980351
CA14 CA15 CA16 CA17 CA18 CA19
Eigenvalue 0.004867 0.004606 0.003844 0.003067 0.001823 0.001642
Proportion Explained 0.004171 0.003948 0.003294 0.002629 0.001562 0.001407
Cumulative Proportion 0.984522 0.988470 0.991764 0.994393 0.995955 0.997362
CA20 CA21 CA22 CA23 CA24
Eigenvalue 0.001295 0.0008775 0.0004217 0.0002149 0.0001528
Proportion Explained 0.001110 0.0007520 0.0003614 0.0001841 0.0001309
Cumulative Proportion 0.998472 0.9992238 0.9995852 0.9997693 0.9999002
CA25 CA26
Eigenvalue 8.949e-05 2.695e-05
Proportion Explained 7.669e-05 2.310e-05
Cumulative Proportion 1.000e+00 1.000e+00
The output looks similar to that of a PCA. The important things to note are the inertia (unconstrained and total inertia are the same), the Eigenvalues for the unconstrained axes, the Species scores, and the Site scores. Their interpretation is the same as before, but we can reiterate. Let us calculate the total inertia:
The inertia for the first axis (CA1) is:
The inertia of CA1 and CA2 is:
The fraction of the variance explained by CA1 and CA2 is:
Above, the value is the same one as in Cumulative Proportion in the summary(spe_ca)
output under the CA2 column.
The scree plot (Figure 1) shows the eigenvalues of the CA axes which helps us decide how many axes to retain in the analysis. In this case, we will retain the first two axes, as they explain the most variance in the data.
Species scores are actual species scores as they now relate to species data (in the PCA, the environmental variables were in the columns and so the species scores referred instead to the environment). The most positive and most negative eigenvectors (or loadings) indicate those species that dominate in their influence along particular CA axes. For example, CA1 will be most heavily loaded by the species Cogo and Satr (eigenvectors of 1.50075 and 1.66167, respectively). If there is an environmental gradient, it will be these species that show the strongest response. At the very least, we can say that the contributions of these species are having an overriding influence on the community differences seen between sites.
Site scores are also as seen earlier in PCA. The highest positive or negative loadings indicate sites that are dispersed far apart on the biplot (in ordination space). They will have large differences in fish community composition.
Please see Numerical Ecology in R (pp. 133 to 140). There you will find explanations for how to interpret the ordinations and the ordination diagrams shown below.
Ordination Diagrams
The biplots for the above ordination are given in Figure 2.
opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
plot(spe_ca, scaling = 1, main = "CA fish abundances - biplot scaling 1")
plot(spe_ca, scaling = 2, main = "CA fish abundances - biplot scaling 2")
par(opar)
Scaling 1: This is site scaling, which emphasises relationships between rows accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are approximations of their \(\chi^{2}\) distances in multidimensional space. Objects found near a point representing a species are likely to contain a high contribution of that species. Site scaling means that sites close together in the plot have similar species relative frequencies, and any site near a species point will have a relatively large abundance of that species.
Scaling 2: Species scaling. This emphasises relationships between columns accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are not approximations of their \(\chi^{2}\) distances in multidimensional space, but the distances among species are. Species scaling means that species points close together will have similar abundance patterns across sites, and any species close to a site point is more likely to have a high abundance at that site.
Below I provide biplots with site and species scores for four selected species (Figure 3). The point size of the site scores scales with species scores: the larger the point, the greater the species score. Here, the species score is seen as a centre of abundance; it represents the species’ maximum abundance, which decreases in every direction from the centroid. The plots are augmented with response surfaces created using the ordisurf()
function. This function fits models to predict the abundance of the species Salmo trutta fario (Brown Trout), Scardinius erythrophthalmus (Rudd), Telestes souffia (Souffia or Western Vairone), and Cottus gobio (Bullhead) using a Generalised Additive Model (GAM) of the Correspondence Analysis (CA) site scores on axes 1 and 2 as the predictor variables. The response surfaces illustrate where the species are most abundant and the direction of their response.
Additionally, I used the envfit()
function to project biplot arrows for the continuous environmental variables into the ordination space. Each arrow points in the direction of the maximum increase of the variable. The length of the arrow is proportional to the correlation between the variable and the ordination axes. The significance of the correlation is tested by permutation, with significant vectors shown in red. The environmental variables are the same as those used in the PCA.
palette(viridis(8))
opar <- par(no.readonly = TRUE)
par(mar = c(4, 4, 0.9, 0.5) + .1, mfrow = c(2, 2))
invisible(ordisurf(spe_ca ~ Satr, data = spe, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Salmo trutta fario"))
abline(h = 0, v = 0, lty = 3)
invisible(ordisurf(spe_ca ~ Scer, data = spe, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Scardinius erythrophthalmus"))
abline(h = 0, v = 0, lty = 3)
invisible(ordisurf(spe_ca ~ Teso, data = spe, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Telestes souffia"))
abline(h = 0, v = 0, lty = 3)
invisible(ordisurf(spe_ca ~ Cogo, data = spe, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Cottus gobio"))
abline(h = 0, v = 0, lty = 3)
env <- env[-8, ] # because we removed the eighth site in the spp data
# A posteriori projection of environmental variables in a CA
# The last plot produced (CA scaling 2) must be active
spe_ca_env <- envfit(spe_ca, env, scaling = 2) # Scaling 2 is default
plot(spe_ca_env)
# Plot significant variables with a different colour
plot(spe_ca_env, p.max = 0.05, col = "red")
par(opar)
The species response surfaces in Figure 3 show the change of species abundance across the ordination space and the vectors indicate how the species distribution and abundance relate to the predominant environmental gradients. Seen in this way, it quickly becomes evident that the biplot is a simplification of coenospaces.
References
Reuse
Citation
@online{j._smit2021,
author = {J. Smit, Albertus},
title = {Correspondence {Analysis} {(CA)}},
date = {2021-01-01},
url = {http://tangledbank.netlify.app/BCB743/CA.html},
langid = {en}
}