---
date: last-modified
title: "8a: Principal Component Analysis (PCA)"
format:
html:
code-link: false
---
```{r code-brewing-opts, echo=FALSE}
knitr::opts_chunk$set(
comment = "R>",
warning = FALSE,
message = FALSE,
fig.width = 4.5,
fig.height = 2.625,
out.width = "75%",
fig.asp = NULL, # control via width/height
dpi = 300
)
ggplot2::theme_set(
ggplot2::theme_minimal(base_size = 8)
)
ggplot2::theme_set(
ggplot2::theme_bw(base_size = 8)
)
```
```{r code-setup-hidden}
#| echo: false
library(tidyverse)
library(ggpubr)
```
::: callout-tip
## **Material Required for This Chapter**
| Type | Name | Link |
| :--- | :--- | :--- |
| **Theory** | Numerical Ecology with R | See pages 117-132 |
| **Slides** | PCA lecture slides | [💾 `BCB743_08_PCA.pdf`](../slides/BCB743_08_PCA.pdf) |
| **Data** | The Doubs River data | [💾 `Doubs.RData`](../data/BCB743/NEwR-2ed_code_data/NeWR2-Data/Doubs.RData) |
| **R function** | A function for ordination plots | [💾 `cleanplot.pca.R`](../data/BCB743/NEwR-2ed_code_data/NeWR2-Functions/cleanplot.pca.R)|
:::
::: {.callout-important appearance="simple"}
## Tasks to Complete in This Chapter
* [Task B 1--6](tasks/Task_B.qmd)
:::
::: {.callout-important appearance="simple"}
## PCA as a Foundation for Understanding Ordination
Thoroughly understanding PCA is a prerequisite for understanding more complex ordination techniques. Master this chapter before moving on to CA, PCoA, RDA, and CCA. Much of the same terminology and concepts are used in these techniques.
:::
::: {.callout-note appearance="simple"}
## The PCA Workflow
The chapter follows one sequence from start to finish. Keep it in view as a map:
1. Examine the correlations among variables.
2. Standardise the variables to a common scale.
3. Run the PCA.
4. Examine the eigenvalues (how much variation each axis carries).
5. Decide how many axes to retain.
6. Plot the ordination (the biplot).
7. Interpret the ecological gradients.
:::
## Geometric Projection of High-Dimensional Data
Ordination refers to a suite of multivariate techniques that reduces a high-dimensional dataset into a lower-dimensional space, typically 2D or 3D, in such a way that any intrinsic structure in the data forms visually-discernible patterns (Pielou, 1984). In ecology, ordination techniques are used to describe relationships between community structure patterns and underlying environmental gradients. They allow me to determine the relative importance of different gradients and visualise species-environment relationships.
Principal Component Analysis (PCA) is one of the commonly used ordination techniques in ecology. It is a dimensionality reduction technique that transforms the original set of variables into a new set of uncorrelated variables called principal components. PCA is performed on a data matrix containing species abundances or environmental variables across multiple sites or samples.
Although PCA is often introduced as a method for reducing dimensionality, it is more accurately understood as a geometric re-expression of the data. The method does not discard variables arbitrarily; rather, it rotates the original coordinate system so that new, mutually orthogonal (independent) axes are aligned with directions of maximal variance, and then projects the data onto these axes. This rotation-projection viewpoint will become explicit later when I examine how PCA is computed in R, but it is worth keeping in mind from the outset that PCA reorganises variation by changing the basis in which the data are represented.
@fig-pca-concept shows the idea on two correlated variables. The cloud of points has most of its spread along one diagonal direction. PCA finds that direction and makes it the first axis (PC1), with the second axis (PC2) at right angles to it. Rotating the data onto these new axes leaves the points unchanged but concentrates the variation onto PC1, so that a single axis now carries most of what two variables held between them.
```{r fig-pca-concept}
#| fig-cap: "PCA on two correlated variables. **(A)** The original variables, with the PC axes overlaid: PC1 follows the direction of greatest spread, PC2 lies at right angles to it. **(B)** The same points after rotation onto the PC axes; the variation is now concentrated along PC1. No data are lost, the cloud is only viewed from a more informative angle."
#| fig-width: 5.2
#| fig-height: 5.6
#| code-fold: true
set.seed(42)
n <- 60
x1 <- rnorm(n)
x2 <- 0.85 * x1 + rnorm(n, sd = 0.5)
demo <- data.frame(V1 = x1 * 2 + 6, V2 = x2 * 1.5 + 5)
demo_c <- scale(demo, scale = FALSE) # centre only
pc <- prcomp(demo_c)
# the sign of a PC axis is arbitrary; flip so PC1 points up-right
if (pc$rotation[1, 1] < 0) {
pc$rotation[, 1] <- -pc$rotation[, 1]
pc$x[, 1] <- -pc$x[, 1]
}
arr <- data.frame(
lab = c("PC1", "PC2"),
xend = pc$rotation[1, ] * pc$sdev * 2.2,
yend = pc$rotation[2, ] * pc$sdev * 2.2
)
panel_a <- ggplot(as.data.frame(demo_c), aes(V1, V2)) +
geom_hline(yintercept = 0, colour = "grey85") +
geom_vline(xintercept = 0, colour = "grey85") +
geom_point(colour = "grey45", size = 1.4) +
geom_segment(
data = arr,
aes(x = 0, y = 0, xend = xend, yend = yend),
colour = "firebrick",
arrow = arrow(length = unit(2.5, "mm")),
linewidth = 0.6
) +
geom_text(
data = arr,
aes(x = xend * 1.14, y = yend * 1.14, label = lab),
colour = "firebrick",
size = 3
) +
labs(
title = "Original variables, with PC axes",
x = "Variable 1 (centred)",
y = "Variable 2 (centred)"
) +
coord_equal()
panel_b <- ggplot(as.data.frame(pc$x), aes(PC1, PC2)) +
geom_hline(yintercept = 0, colour = "grey85") +
geom_vline(xintercept = 0, colour = "grey85") +
geom_point(colour = "steelblue", size = 1.4) +
labs(
title = "Same data, projected onto PC axes",
x = "PC1 (most variance)",
y = "PC2"
) +
coord_equal()
ggarrange(panel_a, panel_b, ncol = 1, labels = "AUTO", heights = c(1.6, 1))
```
The PCA process involves calculating the **eigenvectors** and **eigenvalues** of the covariance or correlation matrix of the data. The eigenvectors represent the directions of maximum variance in the data, and the corresponding eigenvalues represent the amount of variance explained by each eigenvector. The new axes, called principal components, are linear combinations of the original variables, ordered by the amount of variance they explain.
PCA seeks a low-dimensional representation that retains as much of the Euclidean structure among samples as possible. The full set of axes reproduces the original Euclidean distances exactly; the first two or three axes that I actually plot retain only an approximation, namely as much of that structure as the retained axes can hold. This makes PCA well suited to environmental data, where Euclidean distances between samples are meaningful and interpretable. For species data, often in the form of counts or frequencies, Euclidean distances may not be an appropriate measure of dissimilarity between samples.
Let me work through a PCA to show the ideas.
## Set-up the Analysis Environment
```{r code-library-tidyverse}
library(tidyverse)
library(vegan)
library(ggcorrplot) # for the correlations
library(ggpubr)
```
## The Doubs River Data
```{r code-load-here-here-data}
load(here::here(
"data",
"BCB743",
"NEwR-2ed_code_data",
"NEwR2-Data",
"Doubs.RData"
))
head(env)
```
## First Do a Correlation
```{r code-pca-correlation}
#| fig-width: 8
#| fig-height: 8
#| fig.align: center
#| fig.cap: "Pairwise correlations among the environmental variables included with the Doubs River study."
# computing a correlation matrix
corr <- round(cor(env), 1)
# visualisation of the correlation matrix
ggcorrplot(
corr,
type = 'upper',
outline.col = "white",
colors = c("#1679a1", "white", "#f8766d"),
lab = TRUE
)
```
Some variables are very correlated, and they might be omitted from the subsequent analyses. I say that these variables are **collinear**. Collinear variables cannot be teased apart in terms of finding out which one is most influential in structuring the community. There are more advanced ways to search for collinear variables (e.g., Variance Inflation Factors, VIF) and in this way I can systematically exclude them from the PCA. See @graham2003confronting for a discussion on collinearity.
Here I will proceed with all the variables, a deliberate decision for my approach here. In PCA, collinearity does not cause instability in estimated coefficients in the way it does in multiple regression, because PCA does not estimate regression parameters. Instead, correlated variables jointly determine the orientation of the ordination axes, i.e., strongly collinear variables tend to load in similar directions and contribute collectively to the same components. While this can complicate interpretation (by making it harder to disentangle the influence of individual variables) it does not invalidate the construction of the axes themselves. For teaching purposes, retaining collinear variables makes the structure of the data explicit and allows me to see how shared variance is redistributed across components. As I shall see later, in the constrained analyses where coefficients are interpreted directly, collinearity becomes a more serious inferential (statistical) issue and is addressed explicitly through diagnostics such as variance inflation factors and variable selection.
## See the Spatial Context
The patterns in the data and the correlations between them will make more sense if I can visualise a spatial context. Thankfully spatial data are available:
```{r fig-pca-map1}
#| fig.align: center
#| fig.cap: "The spatial configuration of the Doubs River sites."
head(spa)
ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
geom_point(shape = 1) +
geom_label(vjust = 0, nudge_y = 0.5, check_overlap = TRUE)
```
These site numbers correspond approximately to the ones in @verneaux1973cours but some of the numbers may have been shifted slightly in the example Doubs dataset used here compared to how they were originally numbered in Verneaux's thesis and subsequent publication. This should not affect the interpretation. I can also scale the symbol size by the magnitude of the environmental variables. Lets look at two pairs of variables that are strongly correlated with one-another:
```{r fig-pca-map2}
#| fig-width: 8
#| fig-height: 6
#| fig.align: center
#| fig.cap: "Four different representations of the site configuration (spatial context) of the Doubs River sampling layout. Symbols are scaled relative to A) ammonium concentration, B) BOD, C) Altitude, and D) Flow rate."
# We scale the data first so as to better represent the full
# magnitude of all variables with a common symbol size
env_std <- decostand(env, method = "standardize")
# positive correlations
plt1 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
geom_point(shape = 1, col = "red", aes(size = env_std$amm, shape = 3)) +
geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
labs(size = "Magnitude", title = "Ammonium concentration")
plt2 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
geom_point(shape = 1, col = "red", aes(size = env_std$bod)) +
geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
labs(title = "Biological oxygen demand")
# inverse correlations
plt3 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
geom_point(shape = 1, col = "blue", aes(size = env_std$ele)) +
geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
labs(title = "Altitude")
plt4 <- ggplot(spa, aes(x = X, y = Y, label = rownames(spa))) +
geom_point(shape = 1, col = "blue", aes(size = env_std$dis)) +
geom_text(vjust = -0.5, nudge_y = 0.5, check_overlap = TRUE) +
labs(title = "Flow rate")
ggarrange(
plt1,
plt2,
plt3,
plt4,
nrow = 2,
ncol = 2,
common.legend = TRUE,
labels = "AUTO"
)
```
## Do the PCA
I use the function `rda()` to do the PCA, but it can also be performed in base R with the functions `prcomp()` and `princomp()`. `rda()` is the same function that I will use later for a Redundancy Analysis, but when used without specifying constraints (as I do here) it amounts to simply doing a PCA. Typically I standardise environmental data to unit variance, and `rda()` accomplishes this step when `scale = TRUE`. When applied to environmental data (as I typically do with a PCA) it works with correlations amongst the scaled variables. PCA is built on Euclidean distance and the relationships it detects are linear, and for this reason it is not typically applied to species data without suitable transformations. In fact, in this module I will seldom apply a PCA to species data at all.
```{r code-env-pca-rda-env-scale}
env_pca <- rda(env, scale = TRUE)
env_pca
# same ...
# env_std <- scale(env)
# env_pca <- rda(env_std, scale = FALSE)
# env_pca
```
In ordination I use the term **inertia** as a synonym for 'variation', but some PCA software (such as R's `prcomp()` and `princomp()`) simply uses the term `sdev` for standard deviations. In PCA, when I use a correlation matrix (as I do here), the inertia is the sum of the diagonal values of the correlation matrix, which is simply the number of variables (11 in this example). When a PCA uses a covariance matrix the inertia is the sum of the variances of the variables.
You will also see in the output the mention of the term **unconstrained**. In a PCA the analysis is always unconstrained (i.e., not influenced by some *a priori* defined variables I hypothesise to explain the between site patterns in the multivariate data).
The section headed **Eigenvalues for unconstrained axes** shows the relative importance of the resultant reduced axes, and they can be used to determine the proportion of the total inertia (sum of the eigenvalues) captured by any one of the axes. They can be accessed with the function `eigenvals()` (the preferred function; see `?rda` for help), but an alternative method is given below. The first eigenvalue (the one associated with PC1) always explains the most variation (the largest fraction), and each subsequent one explains the largest proportion of the **remaining** variance. I say the axes are **orthogonal** and ranked in decreasing order of importance. The sum of all eigenvalues is the total inertia, so collectively they theoretically can explain all of the variation in the dataset (but clearly they **should not** be used to explain **all** the variance).
At this point it is important to distinguish between statistical adequacy and ecological sufficiency. A principal component that explains a large proportion of the variance is effective at compressing the data, but this does not by necessity imply that it represents the most influential ecological process. Variance explained is a measure of how much information is retained under dimensional reduction; it is not a measure of causal dominance or biological importance. Ecologically meaningful gradients may account for relatively little variance, while strong variance components may reflect broad-scale structure, measurement scale, or correlated responses rather than interpretable mechanisms. Here is where your knowledge as an ecologist will (or should) guide you.
To extract the first eigenvalue I can do:
```{r code-round-eigenvals-env-pca}
round(eigenvals(env_pca)[1], 3)
# or
round(env_pca$CA$eig[1], 3)
```
The total inertia is:
```{r code-sum-eigenvals-env-pca}
sum(eigenvals(env_pca))
# or
sum(env_pca$CA$eig)
```
So the proportion of variation explained by the first PC is:
```{r code-round-env-pca-ca-eig}
round(env_pca$CA$eig[1] / sum(env_pca$CA$eig) * 100, 1) # result in %
```
I can show the same information as part of a more verbose summary. Here I see the pre-calculated **Proportion Explained** and **Cumulative Proportion** (it should be obvious what this is). There is also an assortment of other information, viz. **Scaling 2 for species and site scores**, **Species scores**, and **Site scores**.
```{r code-summary-env-pca}
summary(env_pca)
```
::: {.callout-note appearance="simple"}
## What to Read in That Output
The block is long, but only a few numbers carry the message you need. Find the **Proportion Explained** and **Cumulative Proportion** rows near the top and read across the first two columns:
- PC1 explains about 54% of the variation.
- PC2 explains about 20%.
- Together they account for about 74%.
Everything else (the species and site scores, the scaling note) is detail I extract separately below. For deciding how far the ordination compresses the data, those three percentages are the whole story.
:::
Newer versions of the ordinations' `summary()` method no longer shows the species and site scores. To access these scores, I use the `scores()` function:
The species scores can be extracted from the ordination object by:
```{r code-scores-env-pca-display-species}
# extract species scores for first three PCA axes:
scores(env_pca, display = "species", choices = 1:3)
```
Similarly, for the site scores, do:
```{r code-scores-env-pca-display-site}
# extract the site scores for axes 1 and 2
scores(env_pca, display = "site", choices = 1:3) |>
head() # truncate to save space in this example
```
The `scores()` method also applies to other ordination methods.
::: {.callout-important appearance="simple"}
## Question
I am dealing with environmental variables here, so why do I call them "species scores"?
:::
## Terminology
The geometric ideas introduced earlier (rotation of axes and projection of data into a reduced space) are formalised in ordination methods through a singular-value decomposition (SVD). In **vegan**, ordination functions such as `rda()` for PCA/RDA and `cca()` for CA/CCA implement this decomposition directly. The whole pipeline reduces to one sentence: **eigenvectors define the directions of the new axes, and scores are the data projected onto those directions** (@fig-pca-pipeline). Several labels attach to the steps along the way, and the schematic keeps them in order.
```{mermaid}
%%| label: fig-pca-pipeline
%%| echo: false
%%| fig-cap: "From data to ordination plot. The SVD turns the data matrix into directions (eigenvectors) and their importances (eigenvalues); scaling the eigenvectors gives the scores that are plotted."
flowchart TD
A["Data matrix<br/>(sites × variables)"] --> B["SVD / eigen-analysis"]
B --> C["Eigenvectors<br/>(directions of the axes)"]
B --> D["Eigenvalues<br/>(variance per axis)"]
C --> E["Scores<br/>(coordinates = projections)"]
D --> E
E --> F["Ordination plot<br/>(biplot)"]
```
The mathematics behind the first two boxes is given below for completeness. It can be skimmed on a first reading; the terms it defines are gathered in the table that follows, which is the part to return to.
::: {.callout-note collapse="true" appearance="simple"}
## Mathematical Detail: the SVD in Full
Once the data matrix $X$ has been centred (and, where necessary, scaled or $\chi^2$-transformed), the SVD computes:
$$X = U \,\Sigma \,V^{\mathsf T},$$
with orthonormal *left-singular vectors* $U$, orthonormal *right-singular vectors* $V$, and positive singular values on the diagonal of $\Sigma$. From here, all the components of my ordination can be extracted. How do these relate to my data?
- **Eigenvectors** are the directions in the original variable space along which the variance in the data is maximised. They relate to my original data in two ways:
- $V$ is the matrix of right singular vectors; they are the *eigenvectors of the columns* (species or variables). The quantities can be accessed by `env_pca$CA$v`.
- $U$ is the matrix of left singular vectors; *i.e.*, the *eigenvectors of the rows* (sites). The left singular vectors can be accessed by `env_pca$CA$u`
- $\Sigma$ is a diagonal matrix of singular values $d$, and the **eigenvalues** are the square of these singular values, $eig=d^ 2$. Eigenvalues represent the amount of variance explained by each corresponding eigenvector (or principal component), represented in PCA (and other unconstrained ordinations such as CA) by `env_pca$CA$eig`. Higher eigenvalues indicate that the corresponding principal component explains a larger portion of the total variance in the data. The sum of all eigenvalues is equal to the total variance in the data.
:::
Let me map these directly to my `env_pca` object, which I will assume was created with the `rda()` function. Looking inside the `env_pca` object, I will see a structure with a component named $CA$ or $CCA$. This holds the results for the constrained or correspondence analysis. For a simple PCA, all the results are in $CA$. The eigenvectors (`env_pca$CA$v` for variables, `env_pca$CA$u` for sites) define the direction of the new axes; they are not the final coordinates for ordination plots. (The eigenvalues, the variance carried by each axis, are in `env_pca$CA$eig`.) *Projection* is the process of mapping the original data points onto the new reduced axes (principal components). What I plot are actually these *scores*.
- **Scores** are the coordinates of the original sites and species projected onto the new ordination axes. They are calculated by scaling the eigenvectors. These scores are what I actually plot to create a biplot or triplot.
- The **Site Scores** tell me where each site (sample) is located along the new ordination axes. Sites that are close together in the ordination plot have similar variable (or species) compositions. I obtain the site scores as `scores(env_pca, display = "site", ...)`.
- The **Species Scores** (more accurately, **variable scores** in this context) describe how the columns of the data matrix, whatever they represent ecologically, relate to the ordination axes. Sometimes I call the species scores *loadings*. In a PCA biplot, these are represented by arrows (but in CA not, as that reduced space is no longer linear). An arrow pointing strongly towards PCA1 means that species or variable is a major contributor to that axis of variation. So, species scores represent the contributions of the original variables to the principal components. They indicate the strength and direction of the correlation between the original variables and the new principal components. Species scores are obtained with `scores(env_pca, display = "species", ...)`. Larger positive values indicate a stronger positive correlation, while larger negative values indicate a stronger negative correlation. It is worth considering the terminology "species scores". The label “species scores” is structural since it inherited from a general ordination framework in which columns of the data matrix often correspond to species abundances. In PCA applied to environmental data, those columns contain variables, not taxa. The software does not “know” what the columns represent, so it treats them as abstract descriptors whose interpretation is supplied entirely by the analyst. Confusing function names with ontological claims may be a common source of error in multivariate analysis, and it is therefore important to read ordination output as a mathematical (or conceptual, if math is not your strong point) representation first, and only later as an ecological one.
Here is the direct mapping:
| Mathematical Concept | `vegan` Term | Where to Find It in the `rda` Object | Description |
| :--- | :--- | :--- | :--- |
| **Eigenvalues** | Eigenvalues | `env_pca$CA$eig` | The amount of variance explained by each axis (component). `summary(env_pca)` shows this clearly. |
| **Eigenvectors of Variables** | Species Scores | `env_pca$CA$v` | The raw eigenvectors for the columns of the data (the variables). These are scaled to become the "species scores" that you plot as arrows. |
| **Eigenvectors of Sites** | Site Scores | `env_pca$CA$u` | The raw eigenvectors for the rows of the data (the sites). These are scaled to become the "site scores" that you plot as points. |
| **Plottable Coordinates** | Scores | `scores(env_pca)` | The function `scores()` extracts and properly scales the raw eigenvectors (`$u` and `$v`) to give the coordinates ready for plotting. |
When interpreting a PCA biplot (see below), the species scores (loadings) indicate the contributions of the original variables to the principal components, while the site scores represent the positions of the sites in the reduced-dimensional space defined by those principal components. The direction and length of the species score vectors (arrows) provide information about the relationships between the original variables and the principal components, while the positions of the site points reflect the similarities or differences between sites based on the environmental gradients represented by the principal components.
<!-- It is customary to plot the Site and Species Scores as ordination diagrams called **biplots** for a few of the reduced axes. I will get to this below. -->
<!-- **Scaling 1** and **Scaling 2**, depending on what was specified in the `rda()` function call, are useful for whether one wants to interpret species (scaling 1) or variables (scaling 2). When calling Scaling 1, the distances between points plotted on the ordination diagram (i.e. the information in the rows, which in this case is sites) will retain their Euclidean distances, which allows for better interpretation of how sites relate to one-another. Calling Scaling 2 preserves more accurately the angles between variables (the columns, i.e. here I have environmental variables) with the consequence that in the biplot smaller angles between variable vectors will reflect stronger correlations. More on scaling below. -->
## How Many Axes to Retain?
The number of axes to retain is a difficult question to answer. The first few axes will always explain the most variation, but how do I know how many reduced axes are influential and should be kept? Commonly recommended is the **broken stick** method, namely keep the principal components whose eigenvalues are higher than corresponding random broken stick components:
```{r code-pca-screeplot, fig.height=4, fig.width=6}
#| fig.align: center
#| fig.cap: "Scree plot of the Doubs River environmental data PCA."
# make a scree plot using the vegan function:
screeplot(env_pca, bstick = TRUE, type = "lines")
```
Or I can make a scree plot using **ggplot2**, which is more flexible:
```{r fig-pca-scree-ggplot, fig.height=3, fig.width=6}
#| fig.align: center
#| fig.cap: "Scree plot of the Doubs River environmental data PCA made in **ggplot2**."
scree_dat <- data.frame(
eigenvalue = as.vector(eigenvals(env_pca)),
bstick = bstick(env_pca)
)
scree_dat$axis <- rownames(scree_dat)
rownames(scree_dat) <- NULL
scree_dat <- scree_dat |>
mutate(axis = factor(axis, levels = paste0(rep("PC", 11), seq(1:11))))
ggplot(data = scree_dat, aes(x = axis, y = eigenvalue)) +
geom_point() +
geom_line(aes(group = 1)) +
geom_point(aes(y = bstick), colour = "red") +
geom_line(aes(y = bstick, group = 1), colour = "red") +
labs(x = "Principal component", y = "Inertia")
```
In the plot above, the red line is the broken-stick components and the black line the eigenvalues for the different PCs. The plot suggests retaining the first two PC axes, which together account for approximately 74% of the total inertia and capture most of the large-scale variance in the dataset, though not necessarily all ecologically relevant structure. This distinction will be important in constrained ordination and model-based approaches, where axes are evaluated by variance accounted for and by their relationship to explicitly hypothesised environmental drivers. See Numerical Ecology with R pp. 121-122 for more information about how to decide how many PCs to retain.
## Ordination Diagrams
<!-- ### Plots along one-dimension -->
<!-- Now I will construct some primitive graphs of the Site and Species Scores to show how to interpret the eigenvectors associated with each eigenvalue. I will typically not do this kind of graphical display; for plots suitable for publication see the [Biplots](PCA_examples.qmd#biplots) section, below. -->
<!-- The first thing I need to do is extract the Species and Site Scores in a manner that makes them convenient for plotting. To do this, I can apply the `scores()` function to the PCA object, `env_pca`, and assign the output to tidied dataframes. The `scores()` function can tidy the data to some extent, but I make it even tidier in subsequent steps by creating long format data (rather than wide) using the `pivot_longer()` function. Various other bits of code lines accomplish additional restructuring of the data to make datasets that are fully compliant for creating the kind of figure I have in mind: -->
<!-- ```{r} -->
<!-- # species scores first by setting 'display' to species -->
<!-- # I am interested in all axes (PC1 to PC11) so set the 'choices' argument -->
<!-- # appropriately (see `?scores`): -->
<!-- spp_sc <- scores(env_pca, display = "species", choices = seq(1:11), tidy = TRUE) -->
<!-- # now pivot longer to make the data even tidier: -->
<!-- spp_sc <- spp_sc |> -->
<!-- select(-score) |> # remove column -->
<!-- pivot_longer(cols = PC1:PC11, # pivot -->
<!-- names_to = "PC_axis", -->
<!-- values_to = "score") |> -->
<!-- group_by(PC_axis) |> -->
<!-- mutate(rank = rank(abs(score)), # rank absolute scores -->
<!-- origin = 0, # create a column for start of arrows -->
<!-- PC_axis = factor(PC_axis, levels = paste0(rep("PC", 11), seq(1:11)))) |> -->
<!-- # above, reorder the factor levels so PC axis plot in right order -->
<!-- filter(rank >= 10) |> # keep only 2 higheest ranked scores -->
<!-- ungroup() -->
<!-- head(spp_sc) -->
<!-- # now the site scores: -->
<!-- site_sc <- scores(env_pca, display = "sites", choices = seq(1:11), tidy = TRUE) -->
<!-- site_sc <- site_sc |> -->
<!-- select(-score) |> -->
<!-- pivot_longer(cols = PC1:PC11, -->
<!-- names_to = "PC_axis", -->
<!-- values_to = "score") |> -->
<!-- mutate(label = as.numeric(label), -->
<!-- PC_axis = factor(PC_axis, levels = paste0(rep("PC", 11), seq(1:11)))) -->
<!-- head(site_sc) -->
<!-- ``` -->
<!-- ```{r} -->
<!-- # var explained along PC1, PC2, and PC3 for adding to plot: -->
<!-- PC1_var <- round(env_pca$CA$eig[1] / sum(env_pca$CA$eig) * 100, 1) -->
<!-- PC2_var <- round(env_pca$CA$eig[2] / sum(env_pca$CA$eig) * 100, 1) -->
<!-- PC3_var <- round(env_pca$CA$eig[3] / sum(env_pca$CA$eig) * 100, 1) -->
<!-- ``` -->
<!-- Now I can assemble a plot, and in it focus on the first two PCs. It seems somewhat complex, but the code can easily be deciphered if you read through it 'layer-by-layer': -->
<!-- ```{r} -->
<!-- #| fig-width: 8 -->
<!-- #| fig-height: 6 -->
<!-- library(ggrepel) -->
<!-- #| fig.align: center -->
<!-- #| fig.cap: "Ordination plot of the Doubs River environmental data showing only the influence of the first principal component." -->
<!-- #| label: fig-pca-linear -->
<!-- ggplot(data = site_sc, aes(x = PC_axis, y = score)) + -->
<!-- geom_hline(aes(yintercept = 0), linetype = "dashed") + -->
<!-- geom_jitter(shape = 19, width = 0.09, aes(colour = label)) + -->
<!-- scale_colour_viridis_c(name = "Site no.") + -->
<!-- geom_segment(data = spp_sc, aes(x = PC_axis, y = origin, -->
<!-- xend = PC_axis, yend = score), -->
<!-- lineend = "butt", -->
<!-- arrow = arrow(length = unit(3, "mm"), -->
<!-- type = "open", -->
<!-- angle = 30), -->
<!-- alpha = 0.8, size = 0.7, colour = "red") + -->
<!-- geom_text_repel(data = spp_sc, aes(label = label), size = 3.0, -->
<!-- direction = "y", colour = "red") + -->
<!-- annotate(geom = "text", x = 1, y = -2.5, size = 3.0, colour = "red", -->
<!-- label = paste0(PC1_var, "% var. expl.")) + -->
<!-- annotate(geom = "text", x = 2, y = -2.5, size = 3.0, colour = "red", -->
<!-- label = paste0(PC2_var, "% var. expl.")) + -->
<!-- annotate(geom = "text", x = 3, y = -2.5, size = 3.0, colour = "red", -->
<!-- label = paste0(PC3_var, "% var. expl.")) + -->
<!-- coord_flip() + -->
<!-- labs(x = NULL, y = "Score") + -->
<!-- theme( -->
<!-- panel.grid.major.x = element_blank(), -->
<!-- panel.grid.minor.x = element_blank(), -->
<!-- panel.grid.major.y = element_line(colour = "pink", linetype = "dashed"), -->
<!-- legend.position = c(0.075, 0.75), -->
<!-- legend.box.background = element_rect(colour = "black") -->
<!-- ) -->
<!-- ``` -->
<!-- Although you will never see a graph like this one, examining it is nevertheless informative. What I did was: -->
<!-- * **plot all new PC axes on the vertical axis**; they represent the reduced, simplified ecological space; these PC axes are ranked from most important (PC1) to least important (PC11), and each one's ability to explain some property of the environment is ranked by the magnitude of their eigenvalues -->
<!-- * PC1 explains 54.3% of the total variation in the environmental dataset -->
<!-- * PC2 explains an additional 19.7% of the total variation in the environmental dataset -->
<!-- * the cumulative % variance explained by PC1 and PC2 is 74% -->
<!-- * for each PC axis I plot -->
<!-- * the **Site scores as coloured points** -->
<!-- * the colours indicate the sampled sites' numbers -->
<!-- * the points indicate the spread of the sites across linear Euclidean space for the main environmental gradients represented by each PC axis -->
<!-- * the **Species scores as arrows** -->
<!-- * I only plot the top two most heavily loaded absolute eigenvectors -->
<!-- * the main environmental gradients are represented by the arrows -->
<!-- * the gradient represented is annotated by text giving the name of the environmental variables -->
<!-- * the location of the arrow heads is located in Euclidean space at the coordinates provided by the Species scores -->
<!-- * the longer the arrow, the more influence it has on causing the sites to spread out in Euclidean space -->
<!-- * the arrows point in the direction where the magnitude of the environmental variable is greater, and in the opposite direction the magnitude of the variable is less; for instance, sites are spread out along PC1 primarily due to the influence of the variables nitrate and distance from source such that the sites further down the river (more yellow) tend to have a higher nitrate concentration and have a larger distance from source, and sites closer to the source (more blue) have a smaller distance from source and lower nitrate concentration. -->
I provide some examples of ordination diagrams scattered throughout the module content (e.g., [here](PCA_examples.qmd#a-ggplot-biplot)), but you may also refer to the step-by-step walk throughs provided by [Roeland Kindt](https://rpubs.com/Roeland-KINDT/694016). Also see David Zelený's [excellent writing on the topic](https://www.davidzeleny.net/anadat-r/doku.php/en:ordiagrams).
Let me look at examples. In a PCA ordination diagram, following the tradition of scatter diagrams in Cartesian coordinate systems, objects are represented as points and variables are displayed as arrows. I first use the standard **vegan** `biplot()` function:
```{r code-pca-biplots}
#| fig-width: 10
#| fig-height: 5.625
#| fig.align: center
#| fig.cap: "Ordination plot of the Doubs River environmental data showing site scaling (left) and species scaling (right)."
opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
biplot(env_pca, scaling = 1, main = "PCA scaling 1", choices = c(1, 2))
biplot(env_pca, scaling = 2, main = "PCA scaling 2", choices = c(1, 2))
par(opar)
```
### How to Read This PCA
Interpretation is a skill worth seeing modelled. Reading the scaling 2 biplot above, with its variable arrows, here is what the ordination says about the Doubs River:
- **PC1 carries 54% of the variation**, far more than any other axis, so it is the dominant pattern in the data.
- **The sites spread along PC1 in river order.** The low-numbered sites near the source gather at one end; the high-numbered sites towards the mouth gather at the other. PC1 recovers the source-to-mouth sequence without ever being told where a site sits on the river.
- **The variable arrows split into two opposing groups.** Elevation (`ele`), slope (`slo`), and dissolved oxygen (`oxy`) point towards the headwater end. Distance from source (`dfs`), discharge (`dis`), hardness (`har`), the nutrients (`nit`, `pho`, `amm`), and biological oxygen demand (`bod`) point towards the lowland end.
- **PC1 is therefore an upstream-to-downstream gradient.** One end is the cool, steep, oxygen-rich headwaters; the other is the warmer, harder, nutrient-rich and organically loaded lower river.
- **PC2 (20%) is a weaker secondary contrast**, separating sites with high discharge and oxygen from those with high organic load, and it is harder to interpret cleanly. This is common: lower axes often mix several minor influences.
The ecological reading and the [correlation matrix](correlations.qmd) tell the same story. Variables whose arrows point the same way are the collinear groups seen earlier, and PC1 is the single gradient that their shared variance was expressing all along.
## Scaling
The choice of scaling in a PCA biplot helps me to make the right inferences about my data. Each scaling answers a different question, and the interpretation of distances, angles, and vector lengths depends on which question is being asked. The table is the short version to memorise:
| Scaling | Interpret best | Use when the question is about |
| :--- | :--- | :--- |
| Scaling 1 | distances among sites | which sites resemble which |
| Scaling 2 | correlations among variables | how variables relate to each other and to the axes |
**Scaling 1** should be used when the primary question concerns how *sites relate to one another across environmental gradients*. Distances among objects (samples or sites, whatever is contained in the data table's rows) in the biplot are approximations of their Euclidean distances in multidimensional space. Objects positioned further apart show a greater degree of environmental dissimilarity. The angles among descriptor vectors should not be interpreted as indicating the degree of correlation between the variables.
**Scaling 2** should be used when the primary question concerns how *variables relate to the ordination axes and to one another*. This is the content of the data table's columns. Distances among objects (samples or sites) in the biplot are not approximations of their Euclidean distances in multidimensional space. The angles among descriptor vectors can be interpreted as indicating the degree of correlation between the variables.
Now I create biplots using the `cleanplot.pca()` function that comes with the *Numerical Ecology with R* book. The figures are more or less the same, except that the scaling 1 plot adds a *circle of equilibrium contribution* (see *Numerical Ecology with R*, p. 125).
The circle has a precise meaning. If a variable contributed equally to every ordination axis, its arrow would have a predictable length in the two-dimensional plane I plot, namely the radius of the circle. For a plot of two axes built from $p$ variables, that radius is $\sqrt{2/p}$. A variable whose arrow reaches beyond the circle therefore contributes more than its equal share to the two displayed axes, and one whose arrow falls short contributes less. Two cautions follow. The circle is a reference length and not a significance test, so it tells me which variables are well represented in the plane on view, not which are statistically significant. It also applies only to scaling 1, where arrow lengths are directly comparable; under scaling 2 the arrows encode correlations rather than contributions, and the circle is not drawn. With those caveats, I read importance from the arrows that extend beyond the radius (@fig-pca-cleanplot):
```{r fig-pca-cleanplot}
#| fig-width: 10
#| fig-height: 5.625
#| fig.align: center
#| fig.cap: "Ordination plot of the Doubs River environmental data made with the `cleanplot.pca()` function."
# we need to load the function first from its R file:
source(here::here(
"data",
"BCB743",
"NEwR-2ed_code_data",
"NEwR2-Functions",
"cleanplot.pca.R"
))
cleanplot.pca(env_pca, scaling = 1)
```
**At this point it is essential that you refer to *Numerical Ecology with R* (pp. 118 to 126) for help with interpreting the ordination diagrams.**
## Fitting Environmental Response Surfaces
The `ordisurf()` function in **vegan** is used to visualise underlying environmental gradients on an ordination plot. This function fits a smooth surface (usually a generalised additive model GAM fitted to the site scores of PC axes of interest) to the ordination plot based on a specified environmental variable. It highlights how that variable changes across the ordination space so that I may interpret the spatial structure of the data in relation to the environmental gradient more easily.
For more about `ordisurf()`, see Gavin Simpson's blog post [What is `ordisurf()` doing?](https://fromthebottomoftheheap.net/2011/06/10/what-is-ordisurf-doing/).
I plot the response surfaces for elevation and biological oxygen demand:
```{r code-pca-ordisurf}
#| message: false
#| warning: false
#| fig-width: 10
#| fig-height: 5.625
#| fig.align: center
#| fig.cap: "Ordination plot of the Doubs River environmental data fitted with a smooth response surface for elevation and biological oxygen demand."
biplot(env_pca, type = c("text", "points"), col = c("black", "black"))
invisible(ordisurf(
env_pca ~ bod,
env,
add = TRUE,
col = "turquoise",
knots = 1
)) # <1>
invisible(ordisurf(env_pca ~ ele, env, add = TRUE, col = "salmon", knots = 1))
```
1. `invisible()` is used to suppress the output of the `ordisurf()` function; only the figure is returned.
## When PCA Fails: The Horseshoe Effect {#sec-horseshoe-effect}
Having run, plotted, and interpreted a PCA, I am in a position to see where it breaks down. On the Doubs environmental data PCA performed well, since the variables relate to the river gradient in a roughly linear way. On species data the picture often changes, and the most common symptom is the horseshoe.
The **horseshoe effect** (sometimes called the Guttman effect) is an artefact often seen with PCA when applied to species data, especially when using species abundance data for communities along environmental gradients. It distorts the data points in the ordination space. A less severe version of the horseshoe effect is called the ‘arch effect’ and is seen in [Correspondence Analysis (CA)](CA.qmd).
The horseshoe effect occurs because PCA assumes linear relationships between variables, while species data often exhibit unimodal responses to environmental gradients. The unimodal model was discussed in [BDC334](../BDC334/Lec-03-gradients.qmd#sec-unimodal). When species have unimodal distributions along a gradient, PCA tends to fold the ends of the gradient towards each other.
From a geometric perspective, the horseshoe effect indicates a mismatch between the structure of the data and the space into which PCA forces it. Species responding unimodally to an underlying gradient occupy a curved configuration in multivariate space. PCA, however, seeks a linear subspace that maximises variance under Euclidean geometry. In doing so, it approximates this curved structure with straight axes, causing the ends of the gradient to bend back towards one another in the reduced ordination. The horseshoe, which is visible on the ordination diagram, is therefore not a peculiarity of the data, but a consequence of representing a non-linear manifold using a linear projection.
This distortion can lead to several issues:
1. The horseshoe shape can make it appear that sites at opposite ends of the gradient are more similar than they really are.
2. The folding of the gradient ends compresses the data and potentially obscures important ecological patterns.
3. The second PCA axis (the most affected axis) often does not represent a meaningful ecological gradient, making interpretation challenging.
4. Unlike the arch effect in CA, the horseshoe effect in PCA can lead to incorrect ordering of samples along the gradient.
These distortions are informative, because they reveal the limits of PCA’s linear and Euclidean assumptions when confronted with non-linear ecological structure. To address these issues, I prefer to use [non-metric Multidimensional Scaling (nMDS)](nMDS.qmd) or a distance-based method (like [Principal Coordinates Analysis, PCoA](PCoA.qmd)) that are not constrained to linear projections. Alternatively, I may use techniques specifically designed to handle unimodal species responses, such as [Correspondence Analysis (CA)](CA.qmd), which uses $\chi^2$-distance and not Euclidean distance, or its detrended version ([Detrended Correspondence Analysis, DCA](DCA.qmd)), but these come with their own set of considerations.
PCA is best understood as an exploratory and diagnostic method rather than a catch-all solution for ecological ordination. Its strength lies in revealing broad structure in multivariate data, i.e., dominant gradients, major axes of covariation, and large-scale patterns that organise samples and variables in a reduced space. When applied to environmental data, PCA can provide a clear initial orientation to highlight correlated drivers, identify major contrasts among sites, and offer a coherent global summary of variation. Its limitations arise from the scope of a PCA; it is constrained to linear relationships under Euclidean geometry, and therefore cannot represent non-linear ecological responses without distorting them. PCA can establish a baseline against which alternative ordination methods can be evaluated and guide you towards approaches such as CA, CCA, nMDS, or distance-based methods that are better aligned with specific response forms and inferential goals.
::: {.callout-warning appearance="simple"}
## Common Mistakes
A few errors recur often enough to name. Each undoes part of what the chapter has built:
- **Reading variance explained as ecological importance.** A large eigenvalue means an axis compresses the data well, not that it represents the dominant ecological process. The two can differ.
- **Treating a PC axis as a measured variable.** An axis is a composite of many variables; it has no units and is interpreted through the loadings, not taken directly from the plot.
- **Ignoring scaling.** Distances among sites are interpretable under scaling 1, correlations among variables under scaling 2. Reading the wrong quantity from the wrong biplot is a common slip.
- **Running PCA on raw species counts.** Euclidean geometry and unimodal species responses do not match, and the result is a horseshoe. Use CA, nMDS, or a distance-based method, or transform the data first.
:::
## References
::: {#refs}
:::