Principal Component Analysis (PCA)

Published

January 1, 2021

Material required for this chapter
Type Name Link
Theory Numerical Ecology in R See pages 117-132
Slides PCA lecture slides 💾 BCB743_08_PCA.pdf
Data The Doubs River data 💾 Doubs.RData
R function A function for ordination plots 💾 cleanplot.pca.R
Tasks to complete in this Chapter
PCA as a foundation for undertanding ordination

Thoroughly understanding PCA is a prerequisite for understanding more complex ordination techniques. Mastered this chapter before moving on to CA, PCoA, RDA, and CCA. Much of the same terminology and concepts are used in these techniques.

Ordination refers to a suite of multivariate techniques that reduce a high-dimensional dataset into a lower-dimensional space, typically 2D or 3D, in such a way that any intrinsic structure in the data forms visually-discernible patterns (Pielou, 1984). In ecology, ordination techniques are used to describe relationships between community structure patterns and underlying environmental gradients. They allow us to determine the relative importance of different gradients and visualise species-environment relationships.

Principal Component Analysis (PCA) is one of the commonly used ordination techniques in ecology. It is a dimensionality reduction technique that transforms the original set of variables into a new set of uncorrelated variables called principal components. PCA is performed on a data matrix containing species abundances or environmental variables across multiple sites or samples.

The PCA process involves calculating the eigenvectors and eigenvalues of the covariance or correlation matrix of the data. The eigenvectors represent the directions of maximum variance in the data, and the corresponding eigenvalues represent the amount of variance explained by each eigenvector. The new axes, called principal components, are linear combinations of the original variables, ordered by the amount of variance they explain.

PCA preserves the Euclidean distances between samples in the original high-dimensional space when projecting them onto the lower-dimensional ordination space. This property makes PCA more suitable for analysing environmental data, where Euclidean distances between samples are meaningful and interpretable. However, for species data, which is often in the form of counts or frequencies, Euclidean distances may not be an appropriate measure of dissimilarity between samples.

The Horseshoe Effect

The ‘horseshoe effect’ (sometimes called the Guttman effect) is an artefact often seen with PCA when applied to species data, especially when using species abundance data for communities along environmental gradients. It distorts the data points in the ordination space. A less severe version of the horseshoe effect is called the ‘arch effect’ and is seen in Correspondence Analysis (CA).

The horseshoe effect occurs because PCA assumes linear relationships between variables, while species data often exhibit unimodal responses to environmental gradients. The unimodal model was discussed in BDC334. When species have unimodal distributions along a gradient, PCA tends to fold the ends of the gradient towards each other. This is visible as a horseshoe-shaped pattern in the ordination diagram.

This distortion can lead to several issues:

  1. The horseshoe shape can make it appear that sites at opposite ends of the gradient are more similar than they really are.
  2. The folding of the gradient ends compresses the data and potentially obscures important ecological patterns.
  3. The second PCA axis—the most affected axis—often doesn’t represent a meaningful ecological gradient, making interpretation challenging.
  4. Unlike the arch effect in CA, the horseshoe effect in PCA can lead to incorrect ordering of samples along the gradient.

To address these issues, we prefer to use non-metric Multidimensional Scaling (nMDS) or a distance-based method (like Principal Coordinates Analysis, PCoA) that is less susceptible to this artefact. Alternatively, we may use techniques specifically designed to handle unimodal species responses, such as Correspondence Analysis (CA), which uses \chi^2-distance and not Euclidean distance, or its detrended version (Detrended Correspondence Analysis, DCA), but these come with their own set of considerations.

Set-up the Analysis Environment

library(tidyverse)
library(vegan)
library(ggcorrplot) # for the correlations
library(ggpubr)

The Doubs River Data

load("../data/NEwR-2ed_code_data/NEwR2-Data/Doubs.RData")
head(env)
   dfs ele  slo  dis  pH har  pho  nit  amm  oxy bod
1  0.3 934 48.0 0.84 7.9  45 0.01 0.20 0.00 12.2 2.7
2  2.2 932  3.0 1.00 8.0  40 0.02 0.20 0.10 10.3 1.9
3 10.2 914  3.7 1.80 8.3  52 0.05 0.22 0.05 10.5 3.5
4 18.5 854  3.2 2.53 8.0  72 0.10 0.21 0.00 11.0 1.3
5 21.5 849  2.3 2.64 8.1  84 0.38 0.52 0.20  8.0 6.2
6 32.4 846  3.2 2.86 7.9  60 0.20 0.15 0.00 10.2 5.3

First Do a Correlation

# computing a correlation matrix
corr <- round(cor(env), 1)

# visualisation of the correlation matrix
ggcorrplot(corr, type = 'upper', outline.col = "white",
           colors = c("#1679a1", "white", "#f8766d"),
           lab = TRUE)
Figure 1: Pairwise correlations among the environmental variables included with the Doubs River study.

Some variables are very correlated, and they might be omitted from the subsequent analyses. We say that these variables are ‘collinear.’ Collinear variables cannot be teased apart in terms of finding out which one is most influential in structuring the community. There are more advanced ways to search for collinear variables (e.g. Variance Inflation Factors, VIF) and in this way we can systematically exclude them from the PCA. See Graham (2003) for a discussion on collinearity. Here we will proceed with all the variables.

See the Spatial Context

The patterns in the data and the correlations between them will make more sense if we can visualise a spatial context. Thankfully spatial data are available:

head(spa)
       X      Y
1 85.678 20.000
2 84.955 20.100
3 92.301 23.796
4 91.280 26.431
5 92.005 29.163
6 95.954 36.315
ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1) +
  geom_label(vjust = 0, nudge_y = 0.5, check_overlap = TRUE)
Figure 2: The spatial configuration of the Doubs River sites.

These site numbers correspond approximately to the ones in Verneaux (1973) but some of the numbers may have been shifted slightly in the example Doubs dataset used here compared to how they were originally numbered in Verneaux’s thesis and subsequent publication. This should not affect the interpretation. We can also scale the symbol size by the magnitude of the environmental variables. Lets look at two pairs of variables that are strongly correlated with one-another:

# We scale the data first so as to better represent the full
# magnitude of all variables with a common symbol size
env_std <- decostand(env, method = "standardize")

# positive correlations
plt1 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "red", aes(size = env_std$amm, shape = 3)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(size = "Magnitude", title = "Ammonium concentration")

plt2 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "red", aes(size = env_std$bod)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(title = "Biological oxygen demand")

# inverse correlations
plt3 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "blue", aes(size = env_std$alt)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(title = "Altitude")

plt4 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
  geom_point(shape = 1, col = "blue", aes(size = env_std$flo)) +
  geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
  labs(title = "Flow rate")

ggarrange(plt1, plt2, plt3, plt4, nrow = 2, ncol = 2,
          common.legend = TRUE, labels = "AUTO")
Figure 3: Four different representations of the site configuration (spatial context) of the Doubs River sampling layout. Symbols are scaled relative to A) ammonium concentration, B) BOD, C) Altitude, and D) Flow rate.

Do the PCA

We use the function rda() to do the PCA, but it can also be performed in base R with the functions prcomp() and princomp(). rda() is the same function that we will use later for a Redundancy Analysis, but when used without specifying constraints (as we do here) it amounts to simply doing a PCA. Typically we standardise environmental data to unit variance, but the PCA done by the rda() function accomplishes this step automagically when scale = TRUE. When applied to environmental data (as we typically do with a PCA) it works with correlations amongst the scaled variables. PCA preserves Euclidean distance and the relationships detected are linear, and for this reason it is not typically applied to species data without suitable transformations. In fact, in this module we will seldom apply a PCA to species data at all.

env_pca <- rda(env, scale = TRUE)
env_pca
Call: rda(X = env, scale = TRUE)

              Inertia Rank
Total              11     
Unconstrained      11   11
Inertia is correlations 

Eigenvalues for unconstrained axes:
  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10  PC11 
5.969 2.164 1.065 0.739 0.400 0.336 0.173 0.108 0.024 0.017 0.006 
# same ...
# env_std <- scale(env)
# env_pca <- rda(env_std, scale = FALSE)
# env_pca

In ordination we use the term inertia as a synonym for ‘variation’, but some PCA software (such as R’s prcomp() and princomp()) simply uses the term sdev for standard deviations. In PCA, when we use a correlation matrix (as we do here), the inertia is the sum of the diagonal values of the correlation matrix, which is simply the number of variables (11 in this example). When a PCA uses a covariance matrix the inertia is the sum of the variances of the variables.

You will also see in the output the mention of the term ‘unconstrained’. In a PCA the analysis is always unconstrained (i.e. not influenced by some a priori defined variables we hypothesise to explain the between site patterns in the multivariate data).

The section headed Eigenvalues for unconstrained axes shows the relative importance of the resultant reduced axes, and they can be used to determine the proportion of the total inertia (sum of the eigenvalues) captured by any one of the axes. They can be accessed with the function eigenvals() (the preferred function; see ?rda for help), but an alternative method is given below. The first eigenvalue (the one associated with PC1) always explains the most variation (the largest fraction), and each subsequent one explains the largest proportion of the remaining variance. We say the axes are orthogonal and ranked in decreasing order of importance. The sum of all eigenvalues is the total inertia, so collectively they theoretically can explain all of the variation in the dataset (but clearly they should not be used to explain all the variance). To extract the first eigenvalue we can do:

round(eigenvals(env_pca)[1], 3)
  PC1 
5.969 
# or

round(env_pca$CA$eig[1], 3)
  PC1 
5.969 

The total inertia is:

sum(eigenvals(env_pca))
[1] 11
# or

sum(env_pca$CA$eig)
[1] 11

So the proportion of variation explained by the first PC is:

round(env_pca$CA$eig[1] / sum(env_pca$CA$eig) * 100, 1) # result in %
 PC1 
54.3 

We can show the same information as part of a more verbose summary. Here we see the pre-calculated Proportion Explained and Cumulative Proportion (it should be obvious what this is). There is also an assortment of other information, viz. Scaling 2 for species and site scores, Species scores, and Site scores.

summary(env_pca)

Call:
rda(X = env, scale = TRUE) 

Partitioning of correlations:
              Inertia Proportion
Total              11          1
Unconstrained      11          1

Eigenvalues, and their contribution to the correlations 

Importance of components:
                         PC1    PC2     PC3     PC4     PC5     PC6     PC7
Eigenvalue            5.9687 2.1639 1.06517 0.73875 0.40019 0.33563 0.17263
Proportion Explained  0.5426 0.1967 0.09683 0.06716 0.03638 0.03051 0.01569
Cumulative Proportion 0.5426 0.7393 0.83616 0.90332 0.93970 0.97022 0.98591
                           PC8      PC9     PC10     PC11
Eigenvalue            0.108228 0.023701 0.017083 0.005983
Proportion Explained  0.009839 0.002155 0.001553 0.000544
Cumulative Proportion 0.995748 0.997903 0.999456 1.000000

Terminology

Eigenvectors: Eigenvectors are the directions in the original variable space along which the variance in the data is maximised. In vegan’s PCA, they are represented by env_pca$CA$u. Each eigenvector points in the direction of a principal component and defines its orientation in the multidimensional space. Eigenvectors define the principal component axes, and the species scores (see below) reveal how the original variables relate to those axes.

Principal components: Principal components are the new axes derived from the original data in PCA. They are linear combinations of the original variables that maximise the variance in the data. The number of principal components is equal to the number of original variables, and they are orthogonal (uncorrelated) to each other. The first principal component captures the most variance, with each subsequent component capturing progressively less. The principal components are represented in PCA by env_pca$CA$v.

Species scores (loadings): While eigenvectors are the directions of maximum variance, the species scores indicate the coordinates of the original variables on those directions. So, species scores, also known as loadings, represent the contributions of the original variables to the principal components. They indicate the strength and direction of the correlation between the original variables and the new principal components. Species scores are obtained with scores(env_pca, display = "species"). Larger positive values indicate a stronger positive correlation, while larger negative values indicate a stronger negative correlation. Even though the term “species scores” is used in the software, it refers to the loadings of the original variables (environmental variables in this case).

Eigenvalues: Eigenvalues represent the amount of variance explained by each corresponding eigenvector (or principal component), represented in PCA by env_pca$CA$eig. Higher eigenvalues indicate that the corresponding principal component explains a larger portion of the total variance in the data. The sum of all eigenvalues is equal to the total variance in the data.

Projection: ‘Projection’ is the process of mapping the original data points onto the new reduced axes (principal components). This aids in visualising the data and identifying patterns or clusters.

Site scores: Site scores represent the coordinates of the sites (or samples) in the reduced-dimensional space defined by the principal components. They are obtained by scores(env_pca, display = "sites") and are used to plot the positions of the sites in 2D or 3D ordination space. Sites that are further apart in this space differ more in terms of the environmental conditions, as determined by the major environmental gradients indicated by the species scores (loadings).

When interpreting a PCA biplot (see below), the species scores (loadings) indicate the contributions of the original variables to the principal components, while the site scores represent the positions of the sites in the reduced-dimensional space defined by those principal components. The direction and length of the species score vectors (arrows) provide information about the relationships between the original variables and the principal components, while the positions of the site points reflect the similarities or differences between sites based on the environmental gradients represented by the principal components.

How Many Axes to Retain?

The number of axes to retain is a difficult question to answer. The first few axes will always explain the most variation, but how do we know how many reduced axes are influential and should be kept? Commonly recommended is the broken stick method—keep the principal components whose eigenvalues are higher than corresponding random broken stick components:

# make a scree plot using the vegan function:
screeplot(env_pca, bstick = TRUE, type = "lines")
Figure 4: Scree plot of the Doubs River environmental data PCA.

Or I can make a scree plot using ggplot2, which is more flexible:

scree_dat <- data.frame(eigenvalue = as.vector(eigenvals(env_pca)),
                        bstick = bstick(env_pca))
scree_dat$axis <- rownames(scree_dat)
rownames(scree_dat) <- NULL
scree_dat <- scree_dat |> 
  mutate(axis = factor(axis, levels = paste0(rep("PC", 11), seq(1:11))))

ggplot(data = scree_dat, aes(x = axis, y = eigenvalue)) +
  geom_point() +
  geom_line(aes(group = 1)) +
  geom_point(aes(y = bstick), colour = "red") +
  geom_line(aes(y = bstick, group = 1), colour = "red") +
  labs(x = "Principal component", y = "Inertia")
Figure 5: Scree plot of the Doubs River environmental data PCA made in ggplot2.

In the plot, above, the red line is the broken stick components and the black line the eigenvalues for the different PCs. The plot suggests keeping the first two PC axes, which explain approximately 74% of the total inertia. See Numerical Ecology with R pp. 121-122 for more information about how to decide how many PCs to retain.

Ordination Diagrams

I provide some examples of ordination diagrams scattered throughout the course content (e.g. here), but you may also refer to the step-by-step walk throughs provided by Roeland Kindt. Also see David Zelený’s excellent writing on the topic.

Let us look at examples. In a PCA ordination diagram, following the tradition of scatter diagrams in Cartesian coordinate systems, objects are represented as points and variables are displayed as arrows. We first use the standard vegan biplot() function:

opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
biplot(env_pca, scaling = 1, main = "PCA scaling 1", choices = c(1, 2))
biplot(env_pca, scaling = 2, main = "PCA scaling 2", choices = c(1, 2))
par(opar)
Figure 6: Ordination plot of the Doubs River environmental data showing site scaling (left) and species scaling (right).

Scaling

Scaling 1: This scaling emphasises relationships between rows accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are approximations of their Euclidian distances in multidimensional space. Objects positioned further apart show a greater degree of environmental dissimilarity. The angles among descriptor vectors should not be interpreted as indicating the degree of correlation between the variables.

Scaling 2: This scaling emphasises relationships between columns accurately in low-dimensional ordination space. Distances among objects (samples or sites) in the biplot are not approximations of their Euclidian distances in multidimensional space. The angles among descriptor vectors can be interpreted as indicating the degree of correlation between the variables.

Now we create biplots using the cleanplot.pca() function that comes with the Numerical Ecology in R book. The figures are more or less the same, except the plot showing the Site scores with Scaling 1 adds a ‘circle of equilibrium contribution’ (see Numerical Ecolology with R, p. 125). The circle of equilibrium contribution is a visual aid drawn on a biplot to help assess the relative importance of species (or variables) in the ordination space. It’s most useful in PCA biplots using scaling 1, where the focus is on species relationships. The circle is not a formal statistical test. It helps us to quickly identify the most important variables or species, but it doesn’t directly indicate statistical significance.

We only assign importance to the arrows that extend beyond the radius of the circle (Figure 7):

# we need to load the function first from its R file:
source("../data/NEwR-2ed_code_data/NEwR2-Functions/cleanplot.pca.R")
cleanplot.pca(env_pca, scaling = 1)
Figure 7: Ordination plot of the Doubs River environmental data made with the cleanplot.pca() function.

At this point it is essential that you refer to Numerical Ecology in R (pp. 118 to 126) for help with interpreting the ordination diagrams.

Fitting Environmental Response Surfaces

The ordisurf() function in vegan is used to visualise underlying environmental gradients on an ordination plot. This function fits a smooth surface (usually a generalised additive model GAM fitted to the site scores of PC axes of interest) to the ordination plot based on a specified environmental variable. It highlights how that variable changes across the ordination space so that we may interpret the spatial structure of the data in relation to the environmental gradient more easily.

For more about ordisurf(), see Gavin Simpson’s blog post What is ordifurf() doing?.

We plot the response surfaces for elevation and biological oxygen demand:

biplot(env_pca, type = c("text", "points"), col = c("black", "black"))
1invisible(ordisurf(env_pca ~ bod, env, add = TRUE, col = "turquoise", knots = 1))
invisible(ordisurf(env_pca ~ ele, env, add = TRUE, col = "salmon", knots = 1))
1
invisible() is used to suppress the output of the ordisurf() function; only the figure is returned.
Figure 8: Ordination plot of the Doubs River environmental data fitted with a smooth response surface for elevation and biological oxygen demand.

PCA does well as simpifying multidimensional data, but it has important limitations when applied to ecological community data due to its linear assumptions. In PCA biplots, environmental gradient contours form linear trend surfaces perpendicular to their vectors (Figure 8), reflecting the method’s inherent linearity. However, ecological data and environmental gradients are typically non-linear, with species exhibiting complex, unimodal responses to environmental factors. This mismatch can lead to oversimplified and misleading interpretations. Consequently, PCA is generally not recommended for community data analysis. Instead, alternative ordination methods like Correspondence Analysis (CA), Canonical Correspondence Analysis (CCA), and non-metric Multidimensional Scaling (nMDS) are preferred, as they better capture the non-linear relationships and provide more ecologically meaningful insights.

References

Graham MH (2003) Confronting multicollinearity in ecological multiple regression. Ecology 84:2809–2815.
Verneaux J (1973) Cours d’eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs.

Reuse

Citation

BibTeX citation:
@online{j._smit2021,
  author = {J. Smit, Albertus},
  title = {Principal {Component} {Analysis} {(PCA)}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB743/PCA.html},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2021) Principal Component Analysis (PCA). http://tangledbank.netlify.app/BCB743/PCA.html.