Principal Component Analysis (PCA)

Author

Affiliation

Published

January 1, 2021

Material required for this chapter

Type	Name	Link
Theory	Numerical Ecology in R	See pages 117-132
Slides	PCA lecture slides	💾 `BCB743_08_PCA.pdf`
Data	The Doubs River data	💾 `Doubs.RData`
R function	A function for ordination plots	💾 `cleanplot.pca.R`

Tasks to complete in this Chapter

Task C 1–5

PCA as a foundation for undertanding ordination

Thoroughly understanding PCA is a prerequisite for understanding more complex ordination techniques. Mastered this chapter before moving on to CA, PCoA, RDA, and CCA. Much of the same terminology and concepts are used in these techniques.

Ordination refers to a suite of multivariate techniques that reduce a high-dimensional dataset into a lower-dimensional space, typically 2D or 3D, in such a way that any intrinsic structure in the data forms visually-discernible patterns (Pielou, 1984). In ecology, ordination techniques are used to describe relationships between community structure patterns and underlying environmental gradients. They allow us to determine the relative importance of different gradients and visualise species-environment relationships.

Principal Component Analysis (PCA) is one of the commonly used ordination techniques in ecology. It is a dimensionality reduction technique that transforms the original set of variables into a new set of uncorrelated variables called principal components. PCA is performed on a data matrix containing species abundances or environmental variables across multiple sites or samples.

The PCA process involves calculating the eigenvectors and eigenvalues of the covariance or correlation matrix of the data. The eigenvectors represent the directions of maximum variance in the data, and the corresponding eigenvalues represent the amount of variance explained by each eigenvector. The new axes, called principal components, are linear combinations of the original variables, ordered by the amount of variance they explain.

PCA preserves the Euclidean distances between samples in the original high-dimensional space when projecting them onto the lower-dimensional ordination space. This property makes PCA more suitable for analysing environmental data, where Euclidean distances between samples are meaningful and interpretable. However, for species data, which is often in the form of counts or frequencies, Euclidean distances may not be an appropriate measure of dissimilarity between samples.

The Horseshoe Effect

The ‘horseshoe effect’ (sometimes called the Guttman effect) is an artefact often seen with PCA when applied to species data, especially when using species abundance data for communities along environmental gradients. It distorts the data points in the ordination space. A less severe version of the horseshoe effect is called the ‘arch effect’ and is seen in Correspondence Analysis (CA).

The horseshoe effect occurs because PCA assumes linear relationships between variables, while species data often exhibit unimodal responses to environmental gradients. The unimodal model was discussed in BDC334. When species have unimodal distributions along a gradient, PCA tends to fold the ends of the gradient towards each other. This is visible as a horseshoe-shaped pattern in the ordination diagram.

This distortion can lead to several issues:

The horseshoe shape can make it appear that sites at opposite ends of the gradient are more similar than they really are.
The folding of the gradient ends compresses the data and potentially obscures important ecological patterns.
The second PCA axis—the most affected axis—often doesn’t represent a meaningful ecological gradient, making interpretation challenging.
Unlike the arch effect in CA, the horseshoe effect in PCA can lead to incorrect ordering of samples along the gradient.

To address these issues, we prefer to use non-metric Multidimensional Scaling (nMDS) or a distance-based method (like Principal Coordinates Analysis, PCoA) that is less susceptible to this artefact. Alternatively, we may use techniques specifically designed to handle unimodal species responses, such as Correspondence Analysis (CA), which uses $χ^{2}$ -distance and not Euclidean distance, or its detrended version (Detrended Correspondence Analysis, DCA), but these come with their own set of considerations.

Set-up the Analysis Environment

library(tidyverse)
library(vegan)
library(ggcorrplot) # for the correlations
library(ggpubr)

The Doubs River Data

load("../data/NEwR-2ed_code_data/NEwR2-Data/Doubs.RData")
head(env)

   dfs ele  slo  dis  pH har  pho  nit  amm  oxy bod
1  0.3 934 48.0 0.84 7.9  45 0.01 0.20 0.00 12.2 2.7
2  2.2 932  3.0 1.00 8.0  40 0.02 0.20 0.10 10.3 1.9
3 10.2 914  3.7 1.80 8.3  52 0.05 0.22 0.05 10.5 3.5
4 18.5 854  3.2 2.53 8.0  72 0.10 0.21 0.00 11.0 1.3
5 21.5 849  2.3 2.64 8.1  84 0.38 0.52 0.20  8.0 6.2
6 32.4 846  3.2 2.86 7.9  60 0.20 0.15 0.00 10.2 5.3

First Do a Correlation

# computing a correlation matrix
corr <- round(cor(env), 1)

# visualisation of the correlation matrix
ggcorrplot(corr, type = 'upper', outline.col = "white",
           colors = c("#1679a1", "white", "#f8766d"),
           lab = TRUE)

Figure 1: Pairwise correlations among the environmental variables included with the Doubs River study.

Some variables are very correlated, and they might be omitted from the subsequent analyses. We say that these variables are ‘collinear.’ Collinear variables cannot be teased apart in terms of finding out which one is most influential in structuring the community. There are more advanced ways to search for collinear variables (e.g. Variance Inflation Factors, VIF) and in this way we can systematically exclude them from the PCA. See Graham (2003) for a discussion on collinearity. Here we will proceed with all the variables.

See the Spatial Context

The patterns in the data and the correlations between them will make more sense if we can visualise a spatial context. Thankfully spatial data are available:

head(spa)

       X      Y
1 85.678 20.000
2 84.955 20.100
3 92.301 23.796
4 91.280 26.431
5 92.005 29.163
6 95.954 36.315

ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1) +
  geom_label(vjust = 0, nudge_y = 0.5, check_overlap = TRUE)

Figure 2: The spatial configuration of the Doubs River sites.

These site numbers correspond approximately to the ones in Verneaux (1973) but some of the numbers may have been shifted slightly in the example Doubs dataset used here compared to how they were originally numbered in Verneaux’s thesis and subsequent publication. This should not affect the interpretation. We can also scale the symbol size by the magnitude of the environmental variables. Lets look at two pairs of variables that are strongly correlated with one-another:

# We scale the data first so as to better represent the full
# magnitude of all variables with a common symbol size
env_std <- decostand(env, method = "standardize")

# positive correlations
plt1 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "red", aes(size = env_std$amm, shape = 3)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(size = "Magnitude", title = "Ammonium concentration")

plt2 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "red", aes(size = env_std$bod)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(title = "Biological oxygen demand")

# inverse correlations
plt3 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "blue", aes(size = env_std$alt)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(title = "Altitude")

plt4 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "blue", aes(size = env_std$flo)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(title = "Flow rate")

ggarrange(plt1, plt2, plt3, plt4, nrow = 2, ncol = 2,
          common.legend = TRUE, labels = "AUTO")

Figure 3: Four different representations of the site configuration (spatial context) of the Doubs River sampling layout. Symbols are scaled relative to A) ammonium concentration, B) BOD, C) Altitude, and D) Flow rate.

Do the PCA

We use the function rda() to do the PCA, but it can also be performed in base R with the functions prcomp() and princomp(). rda() is the same function that we will use later for a Redundancy Analysis, but when used without specifying constraints (as we do here) it amounts to simply doing a PCA. Typically we standardise environmental data to unit variance, but the PCA done by the rda() function accomplishes this step automagically when scale = TRUE. When applied to environmental data (as we typically do with a PCA) it works with correlations amongst the scaled variables. PCA preserves Euclidean distance and the relationships detected are linear, and for this reason it is not typically applied to species data without suitable transformations. In fact, in this module we will seldom apply a PCA to species data at all.

env_pca <- rda(env, scale = TRUE)
env_pca

Call: rda(X = env, scale = TRUE)

-- Model Summary --

              Inertia Rank
Total              11     
Unconstrained      11   11

Inertia is correlations

-- Eigenvalues --

Eigenvalues for unconstrained axes:
  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10  PC11 
5.969 2.164 1.065 0.739 0.400 0.336 0.173 0.108 0.024 0.017 0.006

# same ...
# env_std <- scale(env)
# env_pca <- rda(env_std, scale = FALSE)
# env_pca

In ordination we use the term inertia as a synonym for ‘variation’, but some PCA software (such as R’s prcomp() and princomp()) simply uses the term sdev for standard deviations. In PCA, when we use a correlation matrix (as we do here), the inertia is the sum of the diagonal values of the correlation matrix, which is simply the number of variables (11 in this example). When a PCA uses a covariance matrix the inertia is the sum of the variances of the variables.

You will also see in the output the mention of the term ‘unconstrained’. In a PCA the analysis is always unconstrained (i.e. not influenced by some a priori defined variables we hypothesise to explain the between site patterns in the multivariate data).

The section headed Eigenvalues for unconstrained axes shows the relative importance of the resultant reduced axes, and they can be used to determine the proportion of the total inertia (sum of the eigenvalues) captured by any one of the axes. They can be accessed with the function eigenvals() (the preferred function; see ?rda for help), but an alternative method is given below. The first eigenvalue (the one associated with PC1) always explains the most variation (the largest fraction), and each subsequent one explains the largest proportion of the remaining variance. We say the axes are orthogonal and ranked in decreasing order of importance. The sum of all eigenvalues is the total inertia, so collectively they theoretically can explain all of the variation in the dataset (but clearly they should not be used to explain all the variance). To extract the first eigenvalue we can do:

round(eigenvals(env_pca)[1], 3)

  PC1 
5.969

# or

round(env_pca$CA$eig[1], 3)

  PC1 
5.969

The total inertia is:

sum(eigenvals(env_pca))

[1] 11

# or

sum(env_pca$CA$eig)

[1] 11

So the proportion of variation explained by the first PC is:

round(env_pca$CA$eig[1] / sum(env_pca$CA$eig) * 100, 1) # result in %

 PC1 
54.3

We can show the same information as part of a more verbose summary. Here we see the pre-calculated Proportion Explained and Cumulative Proportion (it should be obvious what this is). There is also an assortment of other information, viz. Scaling 2 for species and site scores, Species scores, and Site scores.

summary(env_pca)


Call:
rda(X = env, scale = TRUE) 

Partitioning of correlations:
              Inertia Proportion
Total              11          1
Unconstrained      11          1

Eigenvalues, and their contribution to the correlations 

Importance of components:
                         PC1    PC2     PC3     PC4     PC5     PC6     PC7
Eigenvalue            5.9687 2.1639 1.06517 0.73875 0.40019 0.33563 0.17263
Proportion Explained  0.5426 0.1967 0.09683 0.06716 0.03638 0.03051 0.01569
Cumulative Proportion 0.5426 0.7393 0.83616 0.90332 0.93970 0.97022 0.98591
                           PC8      PC9     PC10     PC11
Eigenvalue            0.108228 0.023701 0.017083 0.005983
Proportion Explained  0.009839 0.002155 0.001553 0.000544
Cumulative Proportion 0.995748 0.997903 0.999456 1.000000

Newer versions of the ordinations’ summary() method no longer shows the species and site scores. To access these scores, we use the scores() function:

The species scores can be extracted from the ordination object by:

# extract species scores for first three PCA axes:
scores(env_pca, display = "species", choices = 1:3)

            PC1        PC2         PC3
dfs  1.08657429  0.5342463 -0.27333233
ele -1.04396094 -0.6147742  0.20711738
slo -0.57702931 -0.4892811 -0.63490306
dis  0.95843365  0.6608150 -0.32456247
pH  -0.06364143  0.4629373  1.01316786
har  0.90118054  0.5849982  0.06448776
pho  1.05820606 -0.6013982  0.13865996
nit  1.15013343 -0.1004908 -0.05166769
amm  1.00678966 -0.6969288  0.14076555
oxy -0.97458581  0.4990886 -0.09017454
bod  0.97315389 -0.7148470  0.15145231
attr(,"const")
[1] 4.226177

Similarly, for the site scores, do:

# extract the site scores for axes 1 and 2
scores(env_pca, display = "site", choices = 1:3) |> 
  head() # truncate to save space in this example

         PC1        PC2         PC3
1 -1.4127388 -1.4009788 -2.03483870
2 -1.0372473 -0.7795535  0.24400009
3 -0.9450700 -0.4676536  1.25042488
4 -0.8737116 -0.2698849  0.19304045
5 -0.4208759 -0.6694396  0.83190665
6 -0.7722358 -0.7206662 -0.07357441

The scores() method also applies to other ordination methods.

Question

We are dealing with environmental variables here, so why do we call

Terminology

Ordination methods in vegan such as rda() for a PCA/RDA or cca() for a CA/CCA is performed with a singular-value decomposition (SVD). Once the data matrix $X$ has been centred (and, where necessary, scaled or $χ^{2}$ -transformed), the SVD computes

$X = U Σ V^{T},$

with orthonormal left-singular vectors $U$ , orthonormal right-singular vectors $V$ , and positive singular values on the diagonal of $Σ$ . From here, all the components of our ordination can be extracted. How do these relate to our data?

Eigenvectors are the directions in the original variable space along which the variance in the data is maximised. They relate to our original data in two ways:
- $V$ is the matrix of right singular vectors; they are the eigenvectors of the columns (species or variables). The quantities can be accessed by env_pca$CA$v.
- $U$ is the matrix of left singular vectors; i.e., the eigenvectors of the rows (sites). The left singular vectors can be accessed by env_pca$CA$u
$Σ$ is a diagonal matrix of singular values $d$ , and the eigenvalues are the square of these singular values, $e i g = d^{2}$ . Eigenvalues represent the amount of variance explained by each corresponding eigenvector (or principal component), represented in PCA (and other unconstrained ordinations such as CA) by env_pca$CA$eig. Higher eigenvalues indicate that the corresponding principal component explains a larger portion of the total variance in the data. The sum of all eigenvalues is equal to the total variance in the data.

Let’s map these directly to our env_pca object, which I’ll assume was created with the rda() function. Looking inside the env_pca object, we’ll see a structure with a component named $C A$ or $C C A$ . This holds the results for the constrained/canonical analysis. For a simple PCA, all the results are in $C A$ . The eigenvectors (env_pca$CA$eig) define the direction of the new axes, they aren’t the final coordinates for ordination plots. Projection is the process of mapping the original data points onto the new reduced axes (principal components). What we plot are actually these scores.

Scores are the coordinates of the original sites and species projected onto the new ordination axes. They are calculated by scaling the eigenvectors. These scores are what we actually plot to create a biplot or triplot.
- The Site Scores tell us where each site (sample) is located along the new ordination axes. Sites that are close together in the ordination plot have similar variable (or species) compositions. We obtain the site scores as scores(env_pca, display = "site", ...).
- The Species Scores (or Variable Scores) inform us about how the original variables (whatever is in the columns, i.e., species or environmental variables) relate to the ordination axes. Sometimes we call the species scores loadings. In a PCA biplot, these are represented by arrows (but in CA not, as that reduced space is no longer linear). An arrow pointing strongly towards PCA1 means that species or variable is a major contributor to that axis of variation. So, species scores represent the contributions of the original variables to the principal components. They indicate the strength and direction of the correlation between the original variables and the new principal components. Species scores are obtained with scores(env_pca, display = "species", ...). Larger positive values indicate a stronger positive correlation, while larger negative values indicate a stronger negative correlation. Even though the term “species scores” is used in the software, it refers to the loadings of the original variables (environmental variables in this case).

Here’s the direct mapping:

Mathematical Concept	`vegan` Term	Where to Find It in the `rda` Object	Description
Eigenvalues	Eigenvalues	`env_pca$CA$eig`	The amount of variance explained by each axis (component). `summary(env_pca)` shows this clearly.
Eigenvectors of Variables	Species Scores	`env_pca$CA$v`	The raw eigenvectors for the columns of the data (the variables). These are scaled to become the “species scores” that you plot as arrows.
Eigenvectors of Sites	Site Scores	`env_pca$CA$u`	The raw eigenvectors for the rows of the data (the sites). These are scaled to become the “site scores” that you plot as points.
Plottable Coordinates	Scores	`scores(env_pca)`	The function `scores()` extracts and properly scales the raw eigenvectors (`$u` and `$v`) to give the coordinates ready for plotting.

When interpreting a PCA biplot (see below), the species scores (loadings) indicate the contributions of the original variables to the principal components, while the site scores represent the positions of the sites in the reduced-dimensional space defined by those principal components. The direction and length of the species score vectors (arrows) provide information about the relationships between the original variables and the principal components, while the positions of the site points reflect the similarities or differences between sites based on the environmental gradients represented by the principal components.

How Many Axes to Retain?

The number of axes to retain is a difficult question to answer. The first few axes will always explain the most variation, but how do we know how many reduced axes are influential and should be kept? Commonly recommended is the broken stick method—keep the principal components whose eigenvalues are higher than corresponding random broken stick components:

# make a scree plot using the vegan function:
screeplot(env_pca, bstick = TRUE, type = "lines")

Figure 4: Scree plot of the Doubs River environmental data PCA.

Or I can make a scree plot using ggplot2, which is more flexible:

scree_dat <- data.frame(eigenvalue = as.vector(eigenvals(env_pca)),
                        bstick = bstick(env_pca))
scree_dat$axis <- rownames(scree_dat)
rownames(scree_dat) <- NULL
scree_dat <- scree_dat |> 
  mutate(axis = factor(axis, levels = paste0(rep("PC", 11), seq(1:11))))

ggplot(data = scree_dat, aes(x = axis, y = eigenvalue)) +
  geom_point() +
  geom_line(aes(group = 1)) +
  geom_point(aes(y = bstick), colour = "red") +
  geom_line(aes(y = bstick, group = 1), colour = "red") +
  labs(x = "Principal component", y = "Inertia")

Figure 5: Scree plot of the Doubs River environmental data PCA made in **ggplot2**.

In the plot, above, the red line is the broken stick components and the black line the eigenvalues for the different PCs. The plot suggests keeping the first two PC axes, which explain approximately 74% of the total inertia. See Numerical Ecology with R pp. 121-122 for more information about how to decide how many PCs to retain.

Ordination Diagrams

I provide some examples of ordination diagrams scattered throughout the course content (e.g. here), but you may also refer to the step-by-step walk throughs provided by Roeland Kindt. Also see David Zelený’s excellent writing on the topic.

Let us look at examples. In a PCA ordination diagram, following the tradition of scatter diagrams in Cartesian coordinate systems, objects are represented as points and variables are displayed as arrows. We first use the standard vegan biplot() function:

opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
biplot(env_pca, scaling = 1, main = "PCA scaling 1", choices = c(1, 2))
biplot(env_pca, scaling = 2, main = "PCA scaling 2", choices = c(1, 2))
par(opar)

Scaling

Scaling 1: This scaling emphasises relationships between rows accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are approximations of their Euclidian distances in multidimensional space. Objects positioned further apart show a greater degree of environmental dissimilarity. The angles among descriptor vectors should not be interpreted as indicating the degree of correlation between the variables.

Scaling 2: This scaling emphasises relationships between columns accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are not approximations of their Euclidian distances in multidimensional space. The angles among descriptor vectors can be interpreted as indicating the degree of correlation between the variables.

Now we create biplots using the cleanplot.pca() function that comes with the Numerical Ecology in R book. The figures are more or less the same, except the plot showing the Site scores with Scaling 1 adds a ‘circle of equilibrium contribution’ (see Numerical Ecolology with R, p. 125). The circle of equilibrium contribution is a visual aid drawn on a biplot to help assess the relative importance of species (or variables) in the ordination space. It’s most useful in PCA biplots using scaling 1, where the focus is on species relationships. The circle is not a formal statistical test. It helps us to quickly identify the most important variables or species, but it doesn’t directly indicate statistical significance.

We only assign importance to the arrows that extend beyond the radius of the circle (Figure 7):

# we need to load the function first from its R file:
source("../data/NEwR-2ed_code_data/NEwR2-Functions/cleanplot.pca.R")
cleanplot.pca(env_pca, scaling = 1)

Figure 7: Ordination plot of the Doubs River environmental data made with the `cleanplot.pca()` function.

At this point it is essential that you refer to Numerical Ecology in R (pp. 118 to 126) for help with interpreting the ordination diagrams.

Fitting Environmental Response Surfaces

The ordisurf() function in vegan is used to visualise underlying environmental gradients on an ordination plot. This function fits a smooth surface (usually a generalised additive model GAM fitted to the site scores of PC axes of interest) to the ordination plot based on a specified environmental variable. It highlights how that variable changes across the ordination space so that we may interpret the spatial structure of the data in relation to the environmental gradient more easily.

For more about ordisurf(), see Gavin Simpson’s blog post What is ordifurf() doing?.

We plot the response surfaces for elevation and biological oxygen demand:

biplot(env_pca, type = c("text", "points"), col = c("black", "black"))
1invisible(ordisurf(env_pca ~ bod, env, add = TRUE, col = "turquoise", knots = 1))
invisible(ordisurf(env_pca ~ ele, env, add = TRUE, col = "salmon", knots = 1))

1: invisible() is used to suppress the output of the ordisurf() function; only the figure is returned.

Figure 8: Ordination plot of the Doubs River environmental data fitted with a smooth response surface for elevation and biological oxygen demand.

PCA does well as simpifying multidimensional data, but it has important limitations when applied to ecological community data due to its linear assumptions. In PCA biplots, environmental gradient contours form linear trend surfaces perpendicular to their vectors (Figure 8), reflecting the method’s inherent linearity. However, ecological data and environmental gradients are typically non-linear, with species exhibiting complex, unimodal responses to environmental factors. This mismatch can lead to oversimplified and misleading interpretations. Consequently, PCA is generally not recommended for community data analysis. Instead, alternative ordination methods like Correspondence Analysis (CA), Canonical Correspondence Analysis (CCA), and non-metric Multidimensional Scaling (nMDS) are preferred, as they better capture the non-linear relationships and provide more ecologically meaningful insights.

References

Graham MH (2003) Confronting multicollinearity in ecological multiple regression. Ecology 84:2809–2815.

Verneaux J (1973) Cours d’eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2021,
  author = {Smit, A. J.,},
  title = {Principal {Component} {Analysis} {(PCA)}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB743/PCA.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2021) Principal Component Analysis (PCA). http://tangledbank.netlify.app/BCB743/PCA.html.

--- date: "2021-01-01" title: "Principal Component Analysis (PCA)" --- ::: callout-tip ## **Material required for this chapter** | Type | Name | Link | |----------------|------------------------------------------|-----------------------------------------------------------------------------------| | **Theory** | Numerical Ecology in R | See pages 117-132 | | **Slides** | PCA lecture slides | [💾 `BCB743_08_PCA.pdf`](../slides/BCB743_08_PCA.pdf) | | **Data** | The Doubs River data | [💾 `Doubs.RData`](../data/NEwR-2ed_code_data/NeWR2-Data/Doubs.RData) | | **R function** | A function for ordination plots | [💾 `cleanplot.pca.R`](../data/NEwR-2ed_code_data/NeWR2-Functions/cleanplot.pca.R)| ::: ::: {.callout-important appearance="simple"} ## Tasks to complete in this Chapter * [Task C 1--5](assessments/Task_C.qmd) ::: ::: {.callout-important appearance="simple"} ## PCA as a foundation for undertanding ordination Thoroughly understanding PCA is a prerequisite for understanding more complex ordination techniques. Mastered this chapter before moving on to CA, PCoA, RDA, and CCA. Much of the same terminology and concepts are used in these techniques. ::: Ordination refers to a suite of multivariate techniques that reduce a high-dimensional dataset into a lower-dimensional space, typically 2D or 3D, in such a way that any intrinsic structure in the data forms visually-discernible patterns (Pielou, 1984). In ecology, ordination techniques are used to describe relationships between community structure patterns and underlying environmental gradients. They allow us to determine the relative importance of different gradients and visualise species-environment relationships. Principal Component Analysis (PCA) is one of the commonly used ordination techniques in ecology. It is a dimensionality reduction technique that transforms the original set of variables into a new set of uncorrelated variables called principal components. PCA is performed on a data matrix containing species abundances or environmental variables across multiple sites or samples. The PCA process involves calculating the eigenvectors and eigenvalues of the covariance or correlation matrix of the data. The eigenvectors represent the directions of maximum variance in the data, and the corresponding eigenvalues represent the amount of variance explained by each eigenvector. The new axes, called principal components, are linear combinations of the original variables, ordered by the amount of variance they explain. PCA preserves the Euclidean distances between samples in the original high-dimensional space when projecting them onto the lower-dimensional ordination space. This property makes PCA more suitable for analysing environmental data, where Euclidean distances between samples are meaningful and interpretable. However, for species data, which is often in the form of counts or frequencies, Euclidean distances may not be an appropriate measure of dissimilarity between samples. ## The Horseshoe Effect {#sec-horseshoe-effect} The ‘horseshoe effect’ (sometimes called the Guttman effect) is an artefact often seen with PCA when applied to species data, especially when using species abundance data for communities along environmental gradients. It distorts the data points in the ordination space. A less severe version of the horseshoe effect is called the ‘arch effect’ and is seen in [Correspondence Analysis (CA)](CA.qmd). The horseshoe effect occurs because PCA assumes linear relationships between variables, while species data often exhibit unimodal responses to environmental gradients. The unimodal model was discussed in [BDC334](../BDC334/L02a-gradients.html#sec-unimodal). When species have unimodal distributions along a gradient, PCA tends to fold the ends of the gradient towards each other. This is visible as a horseshoe-shaped pattern in the ordination diagram. This distortion can lead to several issues: 1. The horseshoe shape can make it appear that sites at opposite ends of the gradient are more similar than they really are. 2. The folding of the gradient ends compresses the data and potentially obscures important ecological patterns. 3. The second PCA axis---the most affected axis---often doesn't represent a meaningful ecological gradient, making interpretation challenging. 4. Unlike the arch effect in CA, the horseshoe effect in PCA can lead to incorrect ordering of samples along the gradient. To address these issues, we prefer to use [non-metric Multidimensional Scaling (nMDS)](nMDS.qmd) or a distance-based method (like [Principal Coordinates Analysis, PCoA](PCoA.qmd)) that is less susceptible to this artefact. Alternatively, we may use techniques specifically designed to handle unimodal species responses, such as [Correspondence Analysis (CA)](CA.qmd), which uses $\chi^2$-distance and not Euclidean distance, or its detrended version ([Detrended Correspondence Analysis, DCA](DCA.qmd)), but these come with their own set of considerations. ## Set-up the Analysis Environment ```{r} library(tidyverse) library(vegan) library(ggcorrplot) # for the correlations library(ggpubr) ``` ## The Doubs River Data ```{r} load("../data/NEwR-2ed_code_data/NEwR2-Data/Doubs.RData") head(env) ``` ## First Do a Correlation ```{r} #| fig-width: 8 #| fig-height: 8 #| fig.align: center #| fig.cap: "Pairwise correlations among the environmental variables included with the Doubs River study." #| label: fig-pca-correlation # computing a correlation matrix corr <- round(cor(env), 1) # visualisation of the correlation matrix ggcorrplot(corr, type = 'upper', outline.col = "white", colors = c("#1679a1", "white", "#f8766d"), lab = TRUE) ``` Some variables are very correlated, and they might be omitted from the subsequent analyses. We say that these variables are 'collinear.' Collinear variables cannot be teased apart in terms of finding out which one is most influential in structuring the community. There are more advanced ways to search for collinear variables (e.g. Variance Inflation Factors, VIF) and in this way we can systematically exclude them from the PCA. See @graham2003confronting for a discussion on collinearity. Here we will proceed with all the variables. ## See the Spatial Context The patterns in the data and the correlations between them will make more sense if we can visualise a spatial context. Thankfully spatial data are available: ```{r} #| fig.align: center #| fig.cap: "The spatial configuration of the Doubs River sites." #| label: fig-pca-map1 head(spa) ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) + geom_point(shape = 1) + geom_label(vjust = 0, nudge_y = 0.5, check_overlap = TRUE) ``` These site numbers correspond approximately to the ones in @verneaux1973cours but some of the numbers may have been shifted slightly in the example Doubs dataset used here compared to how they were originally numbered in Verneaux's thesis and subsequent publication. This should not affect the interpretation. We can also scale the symbol size by the magnitude of the environmental variables. Lets look at two pairs of variables that are strongly correlated with one-another: ```{r} #| fig-width: 8 #| fig-height: 6 #| fig.align: center #| fig.cap: "Four different representations of the site configuration (spatial context) of the Doubs River sampling layout. Symbols are scaled relative to A) ammonium concentration, B) BOD, C) Altitude, and D) Flow rate." #| label: fig-pca-map2 # We scale the data first so as to better represent the full # magnitude of all variables with a common symbol size env_std <- decostand(env, method = "standardize") # positive correlations plt1 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) + geom_point(shape = 1, col = "red", aes(size = env_std$amm, shape = 3)) + geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) + labs(size = "Magnitude", title = "Ammonium concentration") plt2 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) + geom_point(shape = 1, col = "red", aes(size = env_std$bod)) + geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) + labs(title = "Biological oxygen demand") # inverse correlations plt3 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) + geom_point(shape = 1, col = "blue", aes(size = env_std$alt)) + geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) + labs(title = "Altitude") plt4 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) + geom_point(shape = 1, col = "blue", aes(size = env_std$flo)) + geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) + labs(title = "Flow rate") ggarrange(plt1, plt2, plt3, plt4, nrow = 2, ncol = 2, common.legend = TRUE, labels = "AUTO") ``` ## Do the PCA We use the function `rda()` to do the PCA, but it can also be performed in base R with the functions `prcomp()` and `princomp()`. `rda()` is the same function that we will use later for a Redundancy Analysis, but when used without specifying constraints (as we do here) it amounts to simply doing a PCA. Typically we standardise environmental data to unit variance, but the PCA done by the `rda()` function accomplishes this step automagically when `scale = TRUE`. When applied to environmental data (as we typically do with a PCA) it works with correlations amongst the scaled variables. PCA preserves Euclidean distance and the relationships detected are linear, and for this reason it is not typically applied to species data without suitable transformations. In fact, in this module we will seldom apply a PCA to species data at all. ```{r} env_pca <- rda(env, scale = TRUE) env_pca # same ... # env_std <- scale(env) # env_pca <- rda(env_std, scale = FALSE) # env_pca ``` In ordination we use the term **inertia** as a synonym for 'variation', but some PCA software (such as R's `prcomp()` and `princomp()`) simply uses the term `sdev` for standard deviations. In PCA, when we use a correlation matrix (as we do here), the inertia is the sum of the diagonal values of the correlation matrix, which is simply the number of variables (11 in this example). When a PCA uses a covariance matrix the inertia is the sum of the variances of the variables. You will also see in the output the mention of the term '**unconstrained**'. In a PCA the analysis is always unconstrained (i.e. not influenced by some *a priori* defined variables we hypothesise to explain the between site patterns in the multivariate data). The section headed **Eigenvalues for unconstrained axes** shows the relative importance of the resultant reduced axes, and they can be used to determine the proportion of the total inertia (sum of the eigenvalues) captured by any one of the axes. They can be accessed with the function `eigenvals()` (the preferred function; see `?rda` for help), but an alternative method is given below. The first eigenvalue (the one associated with PC1) always explains the most variation (the largest fraction), and each subsequent one explains the largest proportion of the **remaining** variance. We say the axes are **orthogonal** and ranked in decreasing order of importance. The sum of all eigenvalues is the total inertia, so collectively they theoretically can explain all of the variation in the dataset (but clearly they **should not** be used to explain **all** the variance). To extract the first eigenvalue we can do: ```{r} round(eigenvals(env_pca)[1], 3) # or round(env_pca$CA$eig[1], 3) ``` The total inertia is: ```{r} sum(eigenvals(env_pca)) # or sum(env_pca$CA$eig) ``` So the proportion of variation explained by the first PC is: ```{r} round(env_pca$CA$eig[1] / sum(env_pca$CA$eig) * 100, 1) # result in % ``` We can show the same information as part of a more verbose summary. Here we see the pre-calculated **Proportion Explained** and **Cumulative Proportion** (it should be obvious what this is). There is also an assortment of other information, viz. **Scaling 2 for species and site scores**, **Species scores**, and **Site scores**. ```{r} summary(env_pca) ``` Newer versions of the ordinations' `summary()` method no longer shows the species and site scores. To access these scores, we use the `scores()` function: The species scores can be extracted from the ordination object by: ```{r} # extract species scores for first three PCA axes: scores(env_pca, display = "species", choices = 1:3) ``` Similarly, for the site scores, do: ```{r} # extract the site scores for axes 1 and 2 scores(env_pca, display = "site", choices = 1:3) |> head() # truncate to save space in this example ``` The `scores()` method also applies to other ordination methods. ::: {.callout-important appearance="simple"} ## Question We are dealing with environmental variables here, so why do we call ::: ## Terminology Ordination methods in vegan such as `rda()` for a PCA/RDA or `cca(`) for a CA/CCA is performed with a singular-value decomposition (SVD). Once the data matrix $X$ has been centred (and, where necessary, scaled or $\chi^2$-transformed), the SVD computes $$X = U \,\Sigma \,V^{\mathsf T},$$ with orthonormal *left-singular vectors* $U$, orthonormal *right-singular vectors* $V$, and positive singular values on the diagonal of $\Sigma$. From here, all the components of our ordination can be extracted. How do these relate to our data? - **Eigenvectors** are the directions in the original variable space along which the variance in the data is maximised. They relate to our original data in two ways: - $V$ is the matrix of right singular vectors; they are the *eigenvectors of the columns* (species or variables). The quantities can be accessed by `env_pca$CA$v`. - $U$ is the matrix of left singular vectors; *i.e.*, the *eigenvectors of the rows* (sites). The left singular vectors can be accessed by `env_pca$CA$u` - $\Sigma$ is a diagonal matrix of singular values $d$, and the **eigenvalues** are the square of these singular values, $eig=d^ 2$. Eigenvalues represent the amount of variance explained by each corresponding eigenvector (or principal component), represented in PCA (and other unconstrained ordinations such as CA) by `env_pca$CA$eig`. Higher eigenvalues indicate that the corresponding principal component explains a larger portion of the total variance in the data. The sum of all eigenvalues is equal to the total variance in the data. Let's map these directly to our `env_pca` object, which I'll assume was created with the `rda()` function. Looking inside the `env_pca` object, we'll see a structure with a component named $CA$ or $CCA$. This holds the results for the constrained/canonical analysis. For a simple PCA, all the results are in $CA$. The eigenvectors (`env_pca$CA$eig`) define the direction of the new axes, they aren't the final coordinates for ordination plots. *Projection* is the process of mapping the original data points onto the new reduced axes (principal components). What we plot are actually these *scores*. - **Scores** are the coordinates of the original sites and species projected onto the new ordination axes. They are calculated by scaling the eigenvectors. These scores are what we actually plot to create a biplot or triplot. - The **Site Scores** tell us where each site (sample) is located along the new ordination axes. Sites that are close together in the ordination plot have similar variable (or species) compositions. We obtain the site scores as `scores(env_pca, display = "site", ...)`. - The **Species Scores (or Variable Scores)** inform us about how the original variables (whatever is in the columns, *i.e.*, species or environmental variables) relate to the ordination axes. Sometimes we call the species scores *loadings*. In a PCA biplot, these are represented by arrows (but in CA not, as that reduced space is no longer linear). An arrow pointing strongly towards PCA1 means that species or variable is a major contributor to that axis of variation. So, species scores represent the contributions of the original variables to the principal components. They indicate the strength and direction of the correlation between the original variables and the new principal components. Species scores are obtained with `scores(env_pca, display = "species", ...)`. Larger positive values indicate a stronger positive correlation, while larger negative values indicate a stronger negative correlation. Even though the term "species scores" is used in the software, it refers to the loadings of the original variables (environmental variables in this case). Here's the direct mapping: | Mathematical Concept | `vegan` Term | Where to Find It in the `rda` Object | Description | | :--- | :--- | :--- | :--- | | **Eigenvalues** | Eigenvalues | `env_pca$CA$eig` | The amount of variance explained by each axis (component). `summary(env_pca)` shows this clearly. | | **Eigenvectors of Variables** | Species Scores | `env_pca$CA$v` | The raw eigenvectors for the columns of the data (the variables). These are scaled to become the "species scores" that you plot as arrows. | | **Eigenvectors of Sites** | Site Scores | `env_pca$CA$u` | The raw eigenvectors for the rows of the data (the sites). These are scaled to become the "site scores" that you plot as points. | | **Plottable Coordinates** | Scores | `scores(env_pca)` | The function `scores()` extracts and properly scales the raw eigenvectors (`$u` and `$v`) to give the coordinates ready for plotting. | When interpreting a PCA biplot (see below), the species scores (loadings) indicate the contributions of the original variables to the principal components, while the site scores represent the positions of the sites in the reduced-dimensional space defined by those principal components. The direction and length of the species score vectors (arrows) provide information about the relationships between the original variables and the principal components, while the positions of the site points reflect the similarities or differences between sites based on the environmental gradients represented by the principal components.   ## How Many Axes to Retain? The number of axes to retain is a difficult question to answer. The first few axes will always explain the most variation, but how do we know how many reduced axes are influential and should be kept? Commonly recommended is the **broken stick** method---keep the principal components whose eigenvalues are higher than corresponding random broken stick components: ```{r fig.height=4, fig.width=6} #| fig.align: center #| fig.cap: "Scree plot of the Doubs River environmental data PCA." #| label: fig-pca-screeplot # make a scree plot using the vegan function: screeplot(env_pca, bstick = TRUE, type = "lines") ``` Or I can make a scree plot using **ggplot2**, which is more flexible: ```{r fig.height=3, fig.width=6} #| fig.align: center #| fig.cap: "Scree plot of the Doubs River environmental data PCA made in **ggplot2**." #| label: fig-pca-scree-ggplot scree_dat <- data.frame(eigenvalue = as.vector(eigenvals(env_pca)), bstick = bstick(env_pca)) scree_dat$axis <- rownames(scree_dat) rownames(scree_dat) <- NULL scree_dat <- scree_dat |> mutate(axis = factor(axis, levels = paste0(rep("PC", 11), seq(1:11)))) ggplot(data = scree_dat, aes(x = axis, y = eigenvalue)) + geom_point() + geom_line(aes(group = 1)) + geom_point(aes(y = bstick), colour = "red") + geom_line(aes(y = bstick, group = 1), colour = "red") + labs(x = "Principal component", y = "Inertia") ``` In the plot, above, the red line is the broken stick components and the black line the eigenvalues for the different PCs. The plot suggests keeping the first two PC axes, which explain approximately 74% of the total inertia. See Numerical Ecology with R pp. 121-122 for more information about how to decide how many PCs to retain. ## Ordination Diagrams                                                                                             I provide some examples of ordination diagrams scattered throughout the course content (e.g. [here](/quantecol/chapters/pca_examples/#a-ggplot-biplot)), but you may also refer to the step-by-step walk throughs provided by [Roeland Kindt](https://rpubs.com/Roeland-KINDT/694016). Also see David Zelený's [excellent writing on the topic](https://www.davidzeleny.net/anadat-r/doku.php/en:ordiagrams). Let us look at examples. In a PCA ordination diagram, following the tradition of scatter diagrams in Cartesian coordinate systems, objects are represented as points and variables are displayed as arrows. We first use the standard **vegan** `biplot()` function: ```{r} #| fig-width: 10 #| fig-height: 5.625 #| fig.align: center #| fig.cap: "Ordination plot of the Doubs River environmental data showing site scaling (left) and species scaling (right)." #| label: fig-pca-biplots opar <- par(no.readonly = TRUE) par(mfrow = c(1, 2)) biplot(env_pca, scaling = 1, main = "PCA scaling 1", choices = c(1, 2)) biplot(env_pca, scaling = 2, main = "PCA scaling 2", choices = c(1, 2)) par(opar) ``` ## Scaling **Scaling 1:** This scaling *emphasises relationships between rows* accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are approximations of their Euclidian distances in multidimensional space. Objects positioned further apart show a greater degree of environmental dissimilarity. The angles among descriptor vectors should not be interpreted as indicating the degree of correlation between the variables. **Scaling 2:** This scaling *emphasises relationships between columns* accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are not approximations of their Euclidian distances in multidimensional space. The angles among descriptor vectors can be interpreted as indicating the degree of correlation between the variables. Now we create biplots using the `cleanplot.pca()` function that comes with the *Numerical Ecology in R* book. The figures are more or less the same, except the plot showing the Site scores with Scaling 1 adds a 'circle of equilibrium contribution' (see Numerical Ecolology with R, p. 125). The circle of equilibrium contribution is a visual aid drawn on a biplot to help assess the relative importance of species (or variables) in the ordination space. It's most useful in PCA biplots using scaling 1, where the focus is on species relationships. The circle is not a formal statistical test. It helps us to quickly identify the most important variables or species, but it doesn't directly indicate statistical significance. We only assign importance to the arrows that extend beyond the radius of the circle (@fig-pca-cleanplot): ```{r} #| fig-width: 10 #| fig-height: 5.625 #| fig.align: center #| fig.cap: "Ordination plot of the Doubs River environmental data made with the `cleanplot.pca()` function." #| label: fig-pca-cleanplot # we need to load the function first from its R file: source("../data/NEwR-2ed_code_data/NEwR2-Functions/cleanplot.pca.R") cleanplot.pca(env_pca, scaling = 1) ``` **At this point it is essential that you refer to *Numerical Ecology in R* (pp. 118 to 126) for help with interpreting the ordination diagrams.** ## Fitting Environmental Response Surfaces The `ordisurf()` function in **vegan** is used to visualise underlying environmental gradients on an ordination plot. This function fits a smooth surface (usually a generalised additive model GAM fitted to the site scores of PC axes of interest) to the ordination plot based on a specified environmental variable. It highlights how that variable changes across the ordination space so that we may interpret the spatial structure of the data in relation to the environmental gradient more easily. For more about `ordisurf()`, see Gavin Simpson's blog post [What is `ordifurf()` doing?](https://fromthebottomoftheheap.net/2011/06/10/what-is-ordisurf-doing/). We plot the response surfaces for elevation and biological oxygen demand: ```{r} #| message: false #| warning: false #| fig-width: 10 #| fig-height: 5.625 #| fig.align: center #| fig.cap: "Ordination plot of the Doubs River environmental data fitted with a smooth response surface for elevation and biological oxygen demand." #| label: fig-pca-ordisurf biplot(env_pca, type = c("text", "points"), col = c("black", "black")) invisible(ordisurf(env_pca ~ bod, env, add = TRUE, col = "turquoise", knots = 1)) # <1> invisible(ordisurf(env_pca ~ ele, env, add = TRUE, col = "salmon", knots = 1)) ``` 1. `invisible()` is used to suppress the output of the `ordisurf()` function; only the figure is returned. PCA does well as simpifying multidimensional data, but it has important limitations when applied to ecological community data due to its linear assumptions. In PCA biplots, environmental gradient contours form linear trend surfaces perpendicular to their vectors (@fig-pca-ordisurf), reflecting the method's inherent linearity. However, ecological data and environmental gradients are typically non-linear, with species exhibiting complex, unimodal responses to environmental factors. This mismatch can lead to oversimplified and misleading interpretations. Consequently, PCA is generally not recommended for community data analysis. Instead, alternative ordination methods like Correspondence Analysis (CA), Canonical Correspondence Analysis (CCA), and non-metric Multidimensional Scaling (nMDS) are preferred, as they better capture the non-linear relationships and provide more ecologically meaningful insights.