Principal Coordinate Analysis (PCoA)
Type | Name | Link |
---|---|---|
Theory | Numerical Ecology in R | See pages 140-145 |
Slides | CA lecture slides | 💾 BCB743_10_PCoA.pdf |
Data | The Doubs River data | 💾 Doubs.RData |
The seaweed environmental data | 💾 SeaweedEnv.RData |
|
The seaweed bioregion classification |
💾 bioregions.csv . |
Principal Coordinates Analysis (PCoA) is also known as Classical Multidimensional Scaling (MDS). It is an ordination technique used to analyse and represent multivariate data based on a (dis)similarity matrix. We use PCoA when it is necessary to specify dissimilarity measures other than Euclidean distance or \(\chi^2\) distance, as in PCA and CA, respectively.
Unlike PCA and CA, which operate directly on the raw data matrix, PCoA takes a (dis)similarity matrix as input. This matrix can be calculated using various dissimilarity measures available in vegan’s vegdist()
function, which may be necessary when our dataset includes quantitative, semi-quantitative, qualitative, and mixed variables. If the dissimilarities are Euclidean distances, PCoA results are equivalent to those obtained from PCA.
PCoA performs a dimensionality reduction on the (dis)similarity matrix, scaling the dissimilarities and returning a set of points in a lower-dimensional space (typically 2D or 3D). When plotted, the Euclidean distances between these points approximate the original dissimilarities, effectively representing the dissimilarities between objects as Euclidean distances in a lower-dimensional space. This representation can be helpful in visualising and interpreting complex relationships in a more interpretable form.
Conceptually, PCoA is similar to PCA and CA in that it aims to represent the relationships between objects in a lower-dimensional space. However, it differs in its approach to preserving distances. While PCA preserves Euclidean distances between objects and CA preserves \(\chi^2\) distances, PCoA can preserve any (dis)similarity measure provided as input and so it is more flexible for handling a greater range of data types.
In PCoA, the eigenvalues represent the extent to which each principal coordinate axis captures the variability in the original dissimilarity matrix. The proportion of a given eigenvalue to the sum of all eigenvalues indicates the relative importance of each axis. Higher eigenvalues represent axes that capture more variance (or dissimilarity) in the data, helping identify the most significant gradients in the dataset.
While earlier versions of PCoA in vegan did not provide information about the original variables, this limitation has been overcome in newer versions of the capscale()
function.
Set-up the Analysis Environment
The Doubs River Data
We continue to use the species data:
Calculate a Suitable Dissimilarity Matrix
You may or may not want to calculate a dissimilarity index upfront (see below). Here I calculate the Bray-Curtis dissimilarity which is appropriate for abundance data:
Do the PCoA
The book Numerical Ecology in R uses a built-in function cmdscale()
or the function pcoa()
in ape for its PCoA calculation. The vegan function capscale()
can also be used for PCoA, and this is the approach I take here. The ‘CAP’ in capscale()
stands for ‘Canonical Analysis of Principal Coordinates’. capscale()
works differently from rda()
or cca()
in that we can only specify the input via a formula interface. See ?capscale
for information. To run a PCoA without constraints we use 1
on the right-hand side of the formula (this suggests that PCoA also offer options for constrained ordination), with the dissimilarity matrix on the left. Here is how, and I give three options for doing the analysis:
Option 1: Supply a precalculated dissimilarity matrix
# spe_pcoa <- cmdscale(spe_bray, k = nrow(spe) - 1, eig = TRUE)
spe_pcoa <- capscale(spe_bray ~ 1)
spe_pcoa
Call: capscale(formula = spe_bray ~ 1)
Inertia Rank
Total 6.7621
RealTotal 7.0583
Unconstrained 7.0583 17
Imaginary -0.2963
Inertia is squared Bray distance
Eigenvalues for unconstrained axes:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
3.695 1.098 0.710 0.415 0.305 0.192 0.157 0.132
(Showing 8 of 17 unconstrained eigenvalues)
When we do a summary()
of the output we see that the results are similar to that of PCA and CA, but the Species scores are missing because information about original variables (species) are not available. This is due to the fact that in this instance input into capscale()
was the square (site × site) dissimilarity matrix produced from the species table, not the raw species table itself. Here is the output:
Call:
capscale(formula = spe_bray ~ 1)
Partitioning of squared Bray distance:
Inertia Proportion
Total 7.058 1
Unconstrained 7.058 1
Eigenvalues, and their contribution to the squared Bray distance
Importance of components:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7
Eigenvalue 3.6953 1.0985 0.7105 0.41497 0.30456 0.19179 0.15697
Proportion Explained 0.5235 0.1556 0.1007 0.05879 0.04315 0.02717 0.02224
Cumulative Proportion 0.5235 0.6792 0.7798 0.83862 0.88177 0.90894 0.93118
MDS8 MDS9 MDS10 MDS11 MDS12 MDS13
Eigenvalue 0.13191 0.12943 0.08668 0.046158 0.038645 0.02746
Proportion Explained 0.01869 0.01834 0.01228 0.006539 0.005475 0.00389
Cumulative Proportion 0.94987 0.96820 0.98048 0.987023 0.992498 0.99639
MDS14 MDS15 MDS16 MDS17
Eigenvalue 0.013065 0.007088 0.0040395 0.0013006
Proportion Explained 0.001851 0.001004 0.0005723 0.0001843
Cumulative Proportion 0.998239 0.999243 0.9998157 1.0000000
Option 2: Supply the raw data to capscale()
We can provide the raw species table instead and request that capscale()
calculates the required dissimilarity indices by automagically calling vegdist()
. The advantage of this approach is that it adds species scores as weighted sums of (residual) community matrix, whereas only providing the pre-calculated dissimilarity matrix provides no fixed method for adding species scores. I advocate providing a raw species table to capscale()
to retain the species information. This avoids many problems later on, such as having to calculate the weighted species scores ourselves.
Call: capscale(formula = spe ~ 1, distance = "bray")
Inertia Rank
Total 6.7621
RealTotal 7.0583
Unconstrained 7.0583 17
Imaginary -0.2963
Inertia is squared Bray distance
Species scores projected from 'spe'
Eigenvalues for unconstrained axes:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
3.695 1.098 0.710 0.415 0.305 0.192 0.157 0.132
(Showing 8 of 17 unconstrained eigenvalues)
summary()
now produces a familiar and more complete output:
Call:
capscale(formula = spe ~ 1, distance = "bray")
Partitioning of squared Bray distance:
Inertia Proportion
Total 7.058 1
Unconstrained 7.058 1
Eigenvalues, and their contribution to the squared Bray distance
Importance of components:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7
Eigenvalue 3.6953 1.0985 0.7105 0.41497 0.30456 0.19179 0.15697
Proportion Explained 0.5235 0.1556 0.1007 0.05879 0.04315 0.02717 0.02224
Cumulative Proportion 0.5235 0.6792 0.7798 0.83862 0.88177 0.90894 0.93118
MDS8 MDS9 MDS10 MDS11 MDS12 MDS13
Eigenvalue 0.13191 0.12943 0.08668 0.046158 0.038645 0.02746
Proportion Explained 0.01869 0.01834 0.01228 0.006539 0.005475 0.00389
Cumulative Proportion 0.94987 0.96820 0.98048 0.987023 0.992498 0.99639
MDS14 MDS15 MDS16 MDS17
Eigenvalue 0.013065 0.007088 0.0040395 0.0013006
Proportion Explained 0.001851 0.001004 0.0005723 0.0001843
Cumulative Proportion 0.998239 0.999243 0.9998157 1.0000000
Option 3: Use pre-made dissimilarity matrix and add species back using sppscores()
Another approach to add back the species information into the ordination object produced by supplying the pre-made dissimialrity matrix to capscale()
:
Call:
capscale(formula = spe_bray ~ 1)
Partitioning of squared Bray distance:
Inertia Proportion
Total 7.058 1
Unconstrained 7.058 1
Eigenvalues, and their contribution to the squared Bray distance
Importance of components:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7
Eigenvalue 3.6953 1.0985 0.7105 0.41497 0.30456 0.19179 0.15697
Proportion Explained 0.5235 0.1556 0.1007 0.05879 0.04315 0.02717 0.02224
Cumulative Proportion 0.5235 0.6792 0.7798 0.83862 0.88177 0.90894 0.93118
MDS8 MDS9 MDS10 MDS11 MDS12 MDS13
Eigenvalue 0.13191 0.12943 0.08668 0.046158 0.038645 0.02746
Proportion Explained 0.01869 0.01834 0.01228 0.006539 0.005475 0.00389
Cumulative Proportion 0.94987 0.96820 0.98048 0.987023 0.992498 0.99639
MDS14 MDS15 MDS16 MDS17
Eigenvalue 0.013065 0.007088 0.0040395 0.0013006
Proportion Explained 0.001851 0.001004 0.0005723 0.0001843
Cumulative Proportion 0.998239 0.999243 0.9998157 1.0000000
We can unpack what is inside the results, and there we can see that we can access the eigenvalues as we did for PCA and CA:
The percentage inertia explained by the first three axes is therefore:
# The `bstick()` function is not compatible with PCoA
# when negative eigenvalues are present
# Plot the scree plot without the broken stick model
# Extract eigenvalues
eigenvalues <- spe_pcoa$CA$eig
# Calculate the proportion of variance explained
variance_explained <- eigenvalues / sum(eigenvalues)
# Create a scree plot
plot(variance_explained, type = "b", main = "Scree Plot",
xlab = "Principal Components", ylab = "Prop. Var. Explained")
The scree plot (Figure 1) shows the proportion of variation explained by the PC axes. In this case, we will still only retain the first two axes. I selected these because after the 2nd PC, the proportion of variance explained by each additional PC is less than 10% and the plot starts levelling off.
See Numerical Ecology in R (pp. 140 to 145) for information about the interpretation of a PCoA and the ordination diagrams shown below.
Ordination Diagrams
We create the ordination diagrammes as before:
opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
plot(spe_pcoa, scaling = 1, main = "PCoA fish abundances - biplot scaling 1")
plot(spe_pcoa, scaling = 2, main = "PCoA fish abundances - biplot scaling 2")
par(opar)
Scaling 1 and scaling 2 is the same as in PCA and CA.
The plots above work okay, but we can improve them. Note that you can also apply these improvements to PCA and CA ordinations. Let us build plots from scratch:
opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
pl1 <- ordiplot(spe_pcoa, type = "none", scaling = 1,
main = "PCoA fish abundances - biplot scaling 1")
points(pl1, "sites", pch = 21, cex = 1.75, col = "grey80", bg = "grey80")
points(pl1, "species", pch = 21, col = "turquoise", arrows = TRUE)
text(pl1, "species", col = "blue4", cex = 0.9)
text(pl1, "sites", col = "red4", cex = 0.9)
pl2 <- ordiplot(spe_pcoa, type = "none", scaling = 2,
main = "PCoA fish abundances - biplot scaling 2")
points(pl2, "sites", pch = 21, cex = 1.75, col = "grey80", bg = "grey80")
points(pl2, "species", pch = 21, col = "turquoise", arrows = TRUE)
text(pl2, "species", col = "blue4", cex = 0.9)
text(pl2, "sites", col = "red4", cex = 0.9)
par(opar)
We can also fit response surfaces using ordisurf()
:
require('viridis')
palette(viridis(8))
opar <- par(no.readonly = TRUE)
par(mar = c(4, 4, 0.9, 0.5) + .1, mfrow = c(2, 2))
with(spe, tmp <- ordisurf(spe_pcoa ~ Satr, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Salmo trutta fario"))
abline(h = 0, v = 0, lty = 3)
with(spe, tmp <- ordisurf(spe_pcoa ~ Scer, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Scardinius erythrophthalmus"))
abline(h = 0, v = 0, lty = 3)
with(spe, tmp <- ordisurf(spe_pcoa ~ Teso, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Telestes souffia"))
abline(h = 0, v = 0, lty = 3)
with(spe, tmp <- ordisurf(spe_pcoa ~ Cogo, bubble = 3,
family = quasipoisson, knots = 2, col = 6,
display = "sites", main = "Cottus gobio"))
abline(h = 0, v = 0, lty = 3)
env <- dplyr::slice(env, -8)
(spe_pcoa_env <- envfit(spe_pcoa, env, scaling = 2))
***VECTORS
MDS1 MDS2 r2 Pr(>r)
dfs 0.99710 0.07609 0.7210 0.001 ***
ele -0.99807 -0.06208 0.5659 0.001 ***
slo -0.92225 0.38660 0.1078 0.136
dis 0.99746 -0.07129 0.5324 0.001 ***
pH -0.42673 -0.90438 0.0480 0.503
har 0.98804 0.15417 0.2769 0.023 *
pho 0.45343 0.89129 0.6912 0.001 ***
nit 0.86338 0.50456 0.6117 0.001 ***
amm 0.42719 0.90416 0.7076 0.001 ***
oxy -0.76847 -0.63989 0.7639 0.001 ***
bod 0.43152 0.90210 0.8561 0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Permutation: free
Number of permutations: 999
Handling Mixed Variable Types
The simplest way to handle mixed variable types is to use simply plot the factor variable of interest as a differently shaped or coloured symbol on the ordination diagram. I will demonstrate this with the seaweed dataset.
First, I construct an environmental dataset that contains some mixed variables by column binding a dataset of seawater temperatures and a bioregional classification of the 58 coastal sections:
bioreg <- read.csv(paste0(root, "seaweed/bioregions.csv"), header = TRUE)
load(paste0(root, "seaweed/SeaweedEnv.RData"))
E <- cbind(bioreg, env) %>%
mutate(spal.prov = factor(spal.prov),
spal.ecoreg = factor(spal.ecoreg),
lombard = factor(lombard),
bolton = factor(bolton))
head(E)
spal.prov spal.ecoreg lombard bolton febMean febMax febMed febX95
1 BMP NE NamBR BMP 13.00117 18.72044 12.66004 16.80969
2 BMP NE NamBR BMP 13.37950 18.61897 13.18389 17.07242
3 BMP NE NamBR BMP 13.36163 17.86458 13.23187 16.61114
4 BMP NE NamBR BMP 13.28966 17.12073 13.10284 16.12137
5 BMP NE NamBR BMP 12.81128 16.37829 12.40032 15.53240
6 BMP NE NamBR BMP 12.40247 15.96730 11.75096 15.21999
febRange augMean augMin augMed augX5 augRange annMean annSD
1 6.070326 11.75228 9.812431 11.82838 10.12598 2.502093 12.33503 1.255298
2 5.889300 11.57731 9.739288 11.61312 10.08165 2.973370 12.38795 1.401646
3 5.431383 11.29382 9.619388 11.26842 10.01617 3.084130 12.24332 1.474712
4 5.049024 11.13296 9.567049 11.02333 10.03277 2.995822 12.15410 1.505176
5 4.977916 11.23448 9.624302 10.99935 10.17375 2.940255 11.94613 1.449530
6 5.142721 11.50199 9.757004 11.15880 10.38581 2.925087 11.83773 1.385862
annRange febSD augSD annChl augChl febChl
1 1.2488912 1.625917 0.7665420 2.623040 11.070480 8.884580
2 1.8021850 1.753863 0.8969112 4.903870 8.760170 8.401560
3 2.0678127 1.703917 0.9408326 3.723187 8.356506 6.718254
4 2.1567012 1.593944 0.9393490 4.165980 4.164904 3.727157
5 1.5767921 1.517366 0.9542671 8.020257 8.765154 8.786165
6 0.9004776 1.501801 0.9768441 12.882601 7.591975 9.160030
'data.frame': 58 obs. of 22 variables:
$ spal.prov : Factor w/ 2 levels "AMP","BMP": 2 2 2 2 2 2 2 2 2 2 ...
$ spal.ecoreg: Factor w/ 2 levels "ABE","NE": 2 2 2 2 2 2 2 2 2 2 ...
$ lombard : Factor w/ 4 levels "ABR","NamBR",..: 2 2 2 2 2 2 2 2 2 2 ...
$ bolton : Factor w/ 4 levels "AMP","B-ATZ",..: 3 3 3 3 3 3 3 3 3 3 ...
$ febMean : num 13 13.4 13.4 13.3 12.8 ...
$ febMax : num 18.7 18.6 17.9 17.1 16.4 ...
$ febMed : num 12.7 13.2 13.2 13.1 12.4 ...
$ febX95 : num 16.8 17.1 16.6 16.1 15.5 ...
$ febRange : num 6.07 5.89 5.43 5.05 4.98 ...
$ augMean : num 11.8 11.6 11.3 11.1 11.2 ...
$ augMin : num 9.81 9.74 9.62 9.57 9.62 ...
$ augMed : num 11.8 11.6 11.3 11 11 ...
$ augX5 : num 10.1 10.1 10 10 10.2 ...
$ augRange : num 2.5 2.97 3.08 3 2.94 ...
$ annMean : num 12.3 12.4 12.2 12.2 11.9 ...
$ annSD : num 1.26 1.4 1.47 1.51 1.45 ...
$ annRange : num 1.25 1.8 2.07 2.16 1.58 ...
$ febSD : num 1.63 1.75 1.7 1.59 1.52 ...
$ augSD : num 0.767 0.897 0.941 0.939 0.954 ...
$ annChl : num 2.62 4.9 3.72 4.17 8.02 ...
$ augChl : num 11.07 8.76 8.36 4.16 8.77 ...
$ febChl : num 8.88 8.4 6.72 3.73 8.79 ...
In Smit et al. (2017) I used forward selection and the assessment of VIF to find only the non-collinear variables, which included augMean
, febRange
, febSD
and augSD
as the most parsimonious descriptors. I’ll do a PCoA of the seaweed environmental data (only the subset indicated above) and colour the sites by the bioregion:
E_pcoa <- capscale(env[, c("augMean", "febRange", "febSD", "augSD")] ~ 1)
col <- c("firebrick1", "seagreen4", "blue2", "goldenrod2")
pch <- c(17, 19)
opar <- par(no.readonly = TRUE)
ordiplot(E_pcoa, type = "n", scaling = 1,
xlim = c(-1.8, 2), ylim = c(-2.8, 1.2),
main = "PCoA of seaweed env. data")
points(E_pcoa, "sites", pch = 21, cex = 1.75, col = col[E$bolton], bg = "white")
text(E_pcoa, "sites", col = col[E$bolton], cex = 0.5)
The arrangement of sites in the ordination diagram (Figure 5) is still only affected by augMean
, febRange
, febSD
, and augSD
and colour is used only to identify the sites by bioregion. Colour does not affect the outcome of the ordination, and yet we can already see that sites show a clear bioregional grouping. Please see the analysis of the Mayombo diatom dataset for additional approaches to deal with categorical variables that are presumed to be influential. To formally account for bioregion in the ordination, we must take a different approach.
The second option is to use a suitable dissimilarity metric. Numerical, nominal, ordinal, and binary variables can all be accommodated with the Gower distance. We do not use vegan for this, but rather the daisy()
function in cluster.
We calculate the Gower distances and proceed with the PCoA as before:
library(cluster)
# cannot use mixed var types
# E_gower <- vegdist(E, method = "gower")
# can handle mixed var types... use instead of vegdist() gower dissimilarity
E_gower <- daisy(E, metric = "gower")
summary(E_gower)
1653 dissimilarities, summarized :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.006058 0.181880 0.344160 0.321890 0.443730 0.724140
Metric : mixed ; Types = N, N, N, N, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I
Number of objects : 58
1 2 3 4 5
1 0.00000000 0.03893923 0.05586573 0.09502716 0.06723042
2 0.03893923 0.00000000 0.02753347 0.06586077 0.04354363
3 0.05586573 0.02753347 0.00000000 0.04284923 0.04719106
4 0.09502716 0.06586077 0.04284923 0.00000000 0.06826655
5 0.06723042 0.04354363 0.04719106 0.06826655 0.00000000
E_pcoa <- capscale(E_mat ~ 1)
# sadly this means that the names in the Spcies scores are now missing
summary(E_pcoa)
Call:
capscale(formula = E_mat ~ 1)
Partitioning of squared Unknown distance:
Inertia Proportion
Total 3.93 1
Unconstrained 3.93 1
Eigenvalues, and their contribution to the squared Unknown distance
Importance of components:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7
Eigenvalue 2.2311 1.0530 0.16179 0.11635 0.08678 0.05990 0.05202
Proportion Explained 0.5678 0.2680 0.04117 0.02961 0.02208 0.01524 0.01324
Cumulative Proportion 0.5678 0.8358 0.87693 0.90654 0.92863 0.94387 0.95711
MDS8 MDS9 MDS10 MDS11 MDS12 MDS13
Eigenvalue 0.04274 0.023523 0.019153 0.016634 0.013035 0.010275
Proportion Explained 0.01088 0.005986 0.004874 0.004233 0.003317 0.002615
Cumulative Proportion 0.96799 0.973971 0.978845 0.983079 0.986396 0.989011
MDS14 MDS15 MDS16 MDS17 MDS18 MDS19
Eigenvalue 0.008646 0.006938 0.005866 0.005202 0.004133 0.0033437
Proportion Explained 0.002200 0.001766 0.001493 0.001324 0.001052 0.0008509
Cumulative Proportion 0.991211 0.992976 0.994469 0.995793 0.996845 0.9976959
MDS20 MDS21 MDS22 MDS23 MDS24
Eigenvalue 0.0029563 0.0020816 0.0012713 0.001041 0.0007704
Proportion Explained 0.0007523 0.0005297 0.0003235 0.000265 0.0001961
Cumulative Proportion 0.9984482 0.9989779 0.9993015 0.999567 0.9997626
MDS25 MDS26 MDS27
Eigenvalue 3.718e-04 2.940e-04 2.671e-04
Proportion Explained 9.463e-05 7.482e-05 6.798e-05
Cumulative Proportion 9.999e-01 9.999e-01 1.000e+00
We can extract the various kinds of scores for manual plotting.
References
Reuse
Citation
@online{j._smit2021,
author = {J. Smit, Albertus},
title = {Principal {Coordinate} {Analysis} {(PCoA)}},
date = {2021-01-01},
url = {http://tangledbank.netlify.app/BCB743/PCoA.html},
langid = {en}
}