Summary: Unconstrained Ordinations

Author

Affiliation

Published

June 24, 2024

In all the ordination techniques we have seen thus far, the primary goal is to represent high-dimensional data in a lower-dimensional space (usually 2D or 3D) while preserving as much of the original structure as possible. Points that are close together in the ordination plot are generally more similar in the original high-dimensional space.

Principal Component Analysis (PCA)

Data Type: Continuous data (e.g., environmental variables)
Interpretation: PCA identifies axes (principal components) that explain the maximum variation in the data. The axes represent linear combinations of the original variables that explain the most variance. The distance between points reflects their Euclidean dissimilarity. The loading values of variables on the axes indicate their contribution to the variation. Angles between variable arrows in the biplot represent correlations.
In vegan: rda() without constraining variables is used for PCA. Biplots can be created using the biplot() function, showing both sample scores and variable loadings.
Assumption: PCA assumes linear relationships between variables.

Correspondence Analysis (CA)

Data Type: Categorical or count data (e.g., species abundance)
Interpretation: CA explores the relationship between rows (e.g., sites) and columns (e.g., species), i.e. the biplot shows the relationships between rows and columns; but pay attention to the scaling. The distances between points in the ordination space reflect their $χ^{2}$ dissimilarity. Preserves original distances as well as possible in low-dimensional space.
In vegan: The cca() function with no constraining variables, specified with the formula = ~., is used for CA. Similar to PCA, biplots can be created.
Assumption: Assumes unimodal species responses and weighted averaging.

Principal Coordinate Analysis (PCoA)

Data Type: Distance or dissimilarity matrices (any type of distance) in vegan’s vegdist()
Interpretation: PCoA aims to represent the distances between objects in a low-dimensional space while preserving the original dissimilarities as much as possible. Interpretation depends on the chosen distance measure.
In vegan: The capscale() function with a distance matrix as input performs PCoA. Biplots are not directly applicable to PCoA but figures can be constructed in layers using ordiplot(), etc.
Assumption: PCoA does not assume linear relationships between variables.

Non-Metric Multidimensional Scaling (NMDS)

Data Type: Distance or dissimilarity matrices (any type of distance)
Interpretation: NMDS is an iterative method that tries to arrange objects in low-dimensional space so that the rank order of distances in the ordination matches the rank order of the original dissimilarities. The stress value indicates how well the ordination represents the original distances. Like PCoA, interpretation depends on the chosen distance measure. nMDS is considered more robust as it doesn’t assume linearity, but it can be sensitive to outliers and tied ranks.
In vegan: The metaMDS() function is used for NMDS. You can use the envfit() function to add environmental variables or species scores to the plot, but it’s an indirect fitting process.
Assumption: nMDS does not assume linear relationships between variables.

Which Method to Choose?

This depends on your data type, research question, and the type of dissimilarities you want to analyse:

Data Type: CA is more suitable for abundance data with many zeros, while PCA is better for continuous environmental variables. PCoA and NMDS for dissimilarities (use Gower distances for categorical, ordinal, or binary data types).
Distance/Dissimilarity: If you have a specific distance measure in mind (e.g., Bray-Curtis), use PCoA or NMDS.
Linear vs. Non-linear: PCA and CA assume linear relationships, while NMDS can capture non-linear patterns.
Focus: If you want to emphasise species composition, CA or NMDS might be suitable. If the focus is on the underlying gradients explaining the variation, PCA or PCoA could be preferred.

Considerations in vegan

Standardisation: Pay attention to data standardisation or transformation before analysis—this is especially the case for environmental variables measured along different units. vegan provides various options for standardisation, or use functions in base R. Species data typically do not require transformation, unless some special considerations are needed, for instance when working with overly dominant or rare species.
Ecological Interpretation: Use vegan’s envfit() and ordistep() to facilitate the interpretation of the relationship between community composition and environmental variables in ordination plots.
Dimensionality: Typically we visualise the relationships in 2D plots, but higher dimensions may be important. We can use vegan’s screeplot() function to help determine how many influential axes to retain.
Scaling: The scaling of ordination biplots can affect interpretation. scaling = 1 emphasises relationships among samples and scaling = 2 emphasises relationships among variables.
Proportion of Variance Explained: The vegan functions provide information on the proportion of variation explained by the reduced axes for PCA, CA, and PCoA.
Plotting: The ordiplot() function provides a consistent interface for plotting different ordination results. There are also various ways to enhance ordination plots, such as ordihull(), ordiellipse() for grouping; envfit() for fitting environmental variables; ordisurf() for response surfaces. You can also access the various components of the ordination results (e.g., scores, loadings) for custom plotting with ggplot2, which might be necessary to create more insightful and less cluttered figures for publication.

Applying Ordination Techniques to Environmental Data

Typically, ordinations are applied to species data. Sometimes, however, we may want to apply ordinations to the environmental data itself. In this way, we allow the ‘environment’ to speak for itself, revealing some patterns that we may then use to inform subsequent analyses. For example:

It can reveal the presence of correlated variables, which can be problematic in subsequent analyses. For example, if two variables are highly correlated, they may both appear to be important in explaining the species data, but in reality, only one of them is driving the patterns. Such correlated variable can be seen on the ordination plots as vectors pointing in the same direction.
It can help identify major gradients in the environmental variables, and this can then be related to the species composition. The lengths of the environmental vectors on ordination plots can be used to infer the importance of the variables in structuring the data. Strong gradients can be hypothesised to influence species composition. So, once we have set up hypotheses about the presumed influential environmental gradients, we can explore how these gradients correlate with the species data. Even without directly analysing the species data, we can infer potential influences of environmental factors on species distributions and community composition.
Another way to identify influential variables that might not form strong gradients is by plotting the ordination results and identifying clusters of similar environmental conditions. As before, we can use these clusters to hypothesise about potential similarities in community structure in these areas.
It allows us to develop a solid understanding of the environmental variation across the landscape and sets a baseline for interpreting any patterns observed in the species data. These ordination results can be plotted on maps for supplementary visualisations of the environmental gradients across the study area.
We can use functions like envfit() (see below) to fit species data to the environmental ordination space, which will facilitate our understanding of how well the environmental variables explain the species composition. If the environmental variables explain a large proportion of the variation in the species data, this suggests a strong relationship between the environment and the species composition.
An analysis of the environmental data can also lead us to further analyses, such as some of the constrained ordinations or multiple linear regression, which directly relate environmental variables to species data.

Linking Environmental Properties to Species Data

Typically, one is interested in understanding the relationship between species composition and environmental variables. This can be achieved by fitting environmental variables to the ordination space using envfit(), ordisurf(), ordiellipse(), ordispider(), and ordihull(). The ordistep() function can help identify the most important variables.

envfit() involves performing an unconstrained ordination on the species data alone and afterwards fitting environmental vectors onto the ordination plot. The environmental vectors are projected onto the ordination space, and their direction and length indicate the correlation and strength of each environmental variable with the ordination axes. The envfit() function can also be used to test the significance of the environmental variables in structuring the species data. We use envfit() to explore species patterns first and then see how these patterns are related to the environment—so, our primary interest is in understanding the intrinsic patterns of species composition without initially imposing any constraints from environmental data.

ordisuf() and ordiellipse() are used to visualise the response of species composition to environmental gradients or factors. ordisurf() fits a response surface to the ordination plot, showing how species composition changes along the environmental gradients. ordiellipse() draws ellipses around groups of samples, which can be defined by environmental variables or other factors. ordispider() and ordihull() are used to draw lines or polygons around groups of samples, respectively. These functions, therefore, show us how gradients vary across the landscape, and how species or sites are related to some categorical influential variables.

Constrained ordination (also known as canonical ordination) directly incorporates environmental variables into the ordination process. The ordination axes are linear combinations of environmental variables, meaning that the ordination is directly constrained by the environmental data. To do this, we do a constrained ordination (such as db-RDA or CCA), where the species data are directly related to the environmental variables. This allows us to to explicitly model the variation in species data that can be explained by the environmental variables, and it helps us understand the direct influence of environmental factors on species composition. Typically, we would choose constrained ordinations when our primary interest is in understanding how much of the variation in species composition can be explained by environmental variables. It is also useful when we have some hypotheses about the influence of a priori selected environmental variables on species distribution and want to test them formally. Lastly, constrained ordination also lets us partition the variance in species data into components explained by different kinds of environmental variables, and in so doing revealing also the residual (unexplained) components. Use the capscale() function in vegan to perform constrained ordination (see Distance-Based Redundancy Analysis). This allows us to explore how environmental variables structure the data and how they relate to each other.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2024,
  author = {Smit, A. J.,},
  title = {Summary: {Unconstrained} {Ordinations}},
  date = {2024-06-24},
  url = {http://tangledbank.netlify.app/BCB743/unconstrained-summary.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2024) Summary: Unconstrained Ordinations. http://tangledbank.netlify.app/BCB743/unconstrained-summary.html.

--- date: "2024-06-24" title: "Summary: Unconstrained Ordinations" --- In all the ordination techniques we have seen thus far, the primary goal is to represent high-dimensional data in a lower-dimensional space (usually 2D or 3D) while preserving as much of the original structure as possible. Points that are close together in the ordination plot are generally more similar in the original high-dimensional space. ## Principal Component Analysis (PCA) * **Data Type:** Continuous data (e.g., environmental variables) * **Interpretation:** PCA identifies axes (principal components) that explain the maximum variation in the data. The axes represent linear combinations of the original variables that explain the most variance. The distance between points reflects their Euclidean dissimilarity. The loading values of variables on the axes indicate their contribution to the variation. Angles between variable arrows in the biplot represent correlations. * **In vegan:** `rda()` without constraining variables is used for PCA. Biplots can be created using the `biplot()` function, showing both sample scores and variable loadings. * **Assumption:** PCA assumes linear relationships between variables. ## Correspondence Analysis (CA) * **Data Type:** Categorical or count data (e.g., species abundance) * **Interpretation:** CA explores the relationship between rows (e.g., sites) and columns (e.g., species), i.e. the biplot shows the relationships between rows and columns; but pay attention to the scaling. The distances between points in the ordination space reflect their $\chi^2$ dissimilarity. Preserves original distances as well as possible in low-dimensional space. * **In vegan:** The `cca()` function with no constraining variables, specified with the `formula = ~.`, is used for CA. Similar to PCA, biplots can be created. * **Assumption:** Assumes unimodal species responses and weighted averaging. ## Principal Coordinate Analysis (PCoA) * **Data Type:** Distance or dissimilarity matrices (any type of distance) in **vegan**'s `vegdist()` * **Interpretation:** PCoA aims to represent the distances between objects in a low-dimensional space while preserving the original dissimilarities as much as possible. Interpretation depends on the chosen distance measure. * **In vegan:** The `capscale()` function with a distance matrix as input performs PCoA. Biplots are not directly applicable to PCoA but figures can be constructed in layers using `ordiplot()`, etc. * **Assumption:** PCoA does not assume linear relationships between variables. ## Non-Metric Multidimensional Scaling (NMDS) * **Data Type:** Distance or dissimilarity matrices (any type of distance) * **Interpretation:** NMDS is an iterative method that tries to arrange objects in low-dimensional space so that the rank order of distances in the ordination matches the rank order of the original dissimilarities. The stress value indicates how well the ordination represents the original distances. Like PCoA, interpretation depends on the chosen distance measure. nMDS is considered more robust as it doesn't assume linearity, but it can be sensitive to outliers and tied ranks. * **In *vegan*:** The `metaMDS()` function is used for NMDS. You can use the `envfit()` function to add environmental variables or species scores to the plot, but it's an indirect fitting process. * **Assumption:** nMDS does not assume linear relationships between variables. ## Which Method to Choose? This depends on your data type, research question, and the type of dissimilarities you want to analyse: * **Data Type:** CA is more suitable for abundance data with many zeros, while PCA is better for continuous environmental variables. PCoA and NMDS for dissimilarities (use Gower distances for categorical, ordinal, or binary data types). * **Distance/Dissimilarity:** If you have a specific distance measure in mind (e.g., Bray-Curtis), use PCoA or NMDS. * **Linear vs. Non-linear:** PCA and CA assume linear relationships, while NMDS can capture non-linear patterns. * **Focus:** If you want to emphasise species composition, CA or NMDS might be suitable. If the focus is on the underlying gradients explaining the variation, PCA or PCoA could be preferred. ## Considerations in **vegan** * **Standardisation:** Pay attention to data standardisation or transformation before analysis---this is especially the case for environmental variables measured along different units. **vegan** provides various options for standardisation, or use functions in base R. Species data typically do not require transformation, unless some special considerations are needed, for instance when working with overly dominant or rare species. * **Ecological Interpretation:** Use **vegan**'s `envfit()` and `ordistep()` to facilitate the interpretation of the relationship between community composition and environmental variables in ordination plots. * **Dimensionality:** Typically we visualise the relationships in 2D plots, but higher dimensions may be important. We can use **vegan**'s `screeplot()` function to help determine how many influential axes to retain. * **Scaling:** The scaling of ordination biplots can affect interpretation. `scaling = 1` emphasises relationships among samples and `scaling = 2` emphasises relationships among variables. * **Proportion of Variance Explained:** The *vegan* functions provide information on the proportion of variation explained by the reduced axes for PCA, CA, and PCoA. * **Plotting:** The `ordiplot()` function provides a consistent interface for plotting different ordination results. There are also various ways to enhance ordination plots, such as `ordihull()`, `ordiellipse()` for grouping; `envfit()` for fitting environmental variables; `ordisurf()` for response surfaces. You can also access the various components of the ordination results (e.g., scores, loadings) for custom plotting with **ggplot2**, which might be necessary to create more insightful and less cluttered figures for publication. ## Applying Ordination Techniques to Environmental Data Typically, ordinations are applied to species data. Sometimes, however, we may want to apply ordinations to the environmental data itself. In this way, we allow the 'environment' to speak for itself, revealing some patterns that we may then use to inform subsequent analyses. For example: - It can reveal the presence of correlated variables, which can be problematic in subsequent analyses. For example, if two variables are highly correlated, they may both appear to be important in explaining the species data, but in reality, only one of them is driving the patterns. Such correlated variable can be seen on the ordination plots as vectors pointing in the same direction. - It can help identify major gradients in the environmental variables, and this can then be related to the species composition. The lengths of the environmental vectors on ordination plots can be used to infer the importance of the variables in structuring the data. Strong gradients can be hypothesised to influence species composition. So, once we have set up hypotheses about the presumed influential environmental gradients, we can explore how these gradients correlate with the species data. Even without directly analysing the species data, we can infer potential influences of environmental factors on species distributions and community composition. - Another way to identify influential variables that might not form strong gradients is by plotting the ordination results and identifying clusters of similar environmental conditions. As before, we can use these clusters to hypothesise about potential similarities in community structure in these areas. - It allows us to develop a solid understanding of the environmental variation across the landscape and sets a baseline for interpreting any patterns observed in the species data. These ordination results can be plotted on maps for supplementary visualisations of the environmental gradients across the study area. - We can use functions like `envfit()` (see below) to fit species data to the environmental ordination space, which will facilitate our understanding of how well the environmental variables explain the species composition. If the environmental variables explain a large proportion of the variation in the species data, this suggests a strong relationship between the environment and the species composition. - An analysis of the environmental data can also lead us to further analyses, such as some of the constrained ordinations or multiple linear regression, which directly relate environmental variables to species data. ## Linking Environmental Properties to Species Data Typically, one is interested in understanding the relationship between species composition and environmental variables. This can be achieved by fitting environmental variables to the ordination space using `envfit()`, `ordisurf()`, `ordiellipse()`, `ordispider()`, and `ordihull()`. The `ordistep()` function can help identify the most important variables. `envfit()` involves performing an unconstrained ordination on the species data alone and afterwards fitting environmental vectors onto the ordination plot. The environmental vectors are *projected* onto the ordination space, and their direction and length indicate the correlation and strength of each environmental variable with the ordination axes. The `envfit()` function can also be used to test the significance of the environmental variables in structuring the species data. We use `envfit()` to explore species patterns first and then see how these patterns are related to the environment---so, our primary interest is in understanding the intrinsic patterns of species composition without initially imposing any constraints from environmental data. `ordisuf()` and `ordiellipse()` are used to visualise the response of species composition to environmental gradients or factors. `ordisurf()` fits a response surface to the ordination plot, showing how species composition changes along the environmental gradients. `ordiellipse()` draws ellipses around groups of samples, which can be defined by environmental variables or other factors. `ordispider()` and `ordihull()` are used to draw lines or polygons around groups of samples, respectively. These functions, therefore, show us how gradients vary across the landscape, and how species or sites are related to some categorical influential variables. Constrained ordination (also known as canonical ordination) directly incorporates environmental variables into the ordination process. The ordination axes are linear combinations of environmental variables, meaning that the ordination is directly constrained by the environmental data. To do this, we do a constrained ordination (such as db-RDA or CCA), where the species data are directly related to the environmental variables. This allows us to to explicitly model the variation in species data that can be explained by the environmental variables, and it helps us understand the direct influence of environmental factors on species composition. Typically, we would choose constrained ordinations when our primary interest is in understanding how much of the variation in species composition can be explained by environmental variables. It is also useful when we have some hypotheses about the influence of *a priori* selected environmental variables on species distribution and want to test them formally. Lastly, constrained ordination also lets us partition the variance in species data into components explained by different kinds of environmental variables, and in so doing revealing also the residual (unexplained) components. Use the `capscale()` function in **vegan** to perform constrained ordination (see [Distance-Based Redundancy Analysis](constrained_ordination.html)). This allows us to explore how environmental variables structure the data and how they relate to each other.