---
date: last-modified
title: "11b: Community Analysis Case Study: Mayombo's Diatom Data"
---
```{r code-brewing-opts, echo=FALSE}
knitr::opts_chunk$set(
comment = "R>",
warning = FALSE,
message = FALSE,
fig.width = 4.5,
fig.height = 2.625,
out.width = "75%",
fig.asp = NULL, # control via width/height
dpi = 300
)
ggplot2::theme_set(
ggplot2::theme_minimal(base_size = 8)
)
ggplot2::theme_set(
ggplot2::theme_bw(base_size = 8)
)
```
<!-- # Topic 11: Community analysis case study (Mayombo et al. 2019) -->
::: callout-tip
## **Material Required for This Chapter**
| Type | Name | Link |
| :--- | :--- | :--- |
| **Reading** | Serge Mayobo's diatom paper | [💾 `Mayombo_et_al_2019.pdf`](../docs/Mayombo_et_al_2019.pdf) |
| **Data** | Abbreviated diatom data matrix | [💾 `PB_data_matrix_abrev.csv`](../data/BCB743/diatoms/PB_data_matrix_abrev.csv) |
| | Diatoms data matrix | [💾 `PB_data_matrix.csv`](../data/BCB743/diatoms/PB_data_matrix.csv) |
| | Diatom environmental data | [💾 `PB_diat_env.csv`](../data/BCB743/diatoms/PB_diat_env.csv) |
:::
Kelp forests are known to host a large biomass of epiphytic fauna and flora, including diatoms, which constitute the base of aquatic food webs and play an important role in the transfer of energy to higher trophic levels. Epiphytic diatom assemblages associated with two common species of South African kelps, *Ecklonia maxima* and *Laminaria pallida*, were investigated in this study. Primary blades of adult and juvenile thalli of both kelp species were sampled at False Bay in July 2017 and analysed using scanning electron microscopy. The diatom community data are here subjected to a suite of multivariate methods to show the structure of the diatom flora as a function of i) kelp species and ii) kelp size. Read @mayombo2019diatoms for more details and the findings of the research.
This chapter is less an nMDS tutorial than a worked case study in how a community-ecology analysis is built, from the sampling design through to species-level interpretation. The nMDS is one step in that sequence. The workflow runs from the species matrix to a statement about which diatoms drive any pattern (@fig-diatom-workflow), and each step answers a distinct question.
```{mermaid}
%%| label: fig-diatom-workflow
%%| fig-cap: "The community-analysis workflow followed in this chapter. Each step answers a different question, and the methods are chosen to suit the nested sampling design."
flowchart TD
A["Species matrix<br/>(samples × diatoms)"] --> B["Transform (log)"]
B --> C["Bray-Curtis<br/>dissimilarity"]
C --> D["betadisper<br/>(equal dispersions?)"]
D --> E["PERMANOVA<br/>(do groups differ?)"]
C --> F["nMDS<br/>(visualise structure)"]
E --> G["manyglm<br/>(which taxa?)"]
F --> G
```
Each method answers one question in that chain:
| Method | Question it answers |
| :--- | :--- |
| Bray-Curtis dissimilarity | how different is the diatom community between two samples? |
| `betadisper()` | do the groups have equal multivariate spread, a precondition for PERMANOVA? |
| PERMANOVA (`adonis2()`) | do the group means differ in community composition? |
| nMDS | what does the community structure look like in two dimensions? |
| `manyglm()` | which individual diatom taxa drive the differences? |
::: {.callout-note collapse="true" appearance="simple"}
## Peer Review Insight
The study was peer-reviewed, and a reviewer's comments together with my responses are reproduced here, since they show how statistical choices of this kind are questioned and defended in practice rather than being arbitrary.
**Reviewer 1**
The design of the observational study includes 2 treatments - age (young versus old) and host species (*Laminaria* versus *Ecklonia*), 4 replicates (4 primary blades from each combination of host algae and age), and 3 subsamples from each blade (pseudoreplicates, if treated incorrectly as replicates). The experimental design is analogous to a 2-way ANOVA, but with community data instead of a single individual response variable. This design can evaluate interactive effects between the two treatments (age and species). The authors' experimental design is most suited to analyses using PERMANOVA, which is the community statistics version of the ANOVA.
Please indicate for the readers why the data were transformed and standardised using the stated procedures. Definitely a good idea to transform data, but the readers need to understand why particular procedures were employed. Please describe the Wisconsin double standardisation:
- row/column standardised by row/column total to produce relative abundance to total and column/row standardised; vs.
- column/row max--to produce abundance relative to species max abundance.
Why a double standardisation + square-root transformation, as opposed to a single row/column standardisation by row/column total + square-root transformation?
**AJS: About ANOSIM and PERMANOVA**
Overall, Analysis of Similarities (ANOSIM) and the Mantel test were sensitive to heterogeneity in dispersions, with ANOSIM generally being more sensitive than the Mantel test. In contrast, PERMANOVA and Pillai's trace were mostly unaffected by heterogeneity for balanced designs. [...]. PERMANOVA was also unaffected by differences in correlation structure. [...] PERMANOVA was generally, but not always, more powerful than the others to detect changes in community structure.
**AJS: About data transformation**
Useful when the range of data values is large. Data are square root transformed, and then submitted to Wisconsin double standardisation, or species divided by their maxima, and stands standardised to equal totals. These two standardisations often improve the quality of ordinations.
:::
## Set-Up the Analysis Environment
```{r code-library-tidyverse}
library(tidyverse)
library(vegan)
# library(BiodiversityR)
# Files will be referenced using here::here() for absolute paths
```
## Load and Prepare the Data
### The species data
The diatom species data include the following:
- columns: diatom genera
- rows: samples (samples taken from two species of kelp; equivalent to sites in other species x sites tables)
- row names correspond to combinations of the factors in the columns inside `PB_diat_env.csv`
where `host_size` is `A` for adult kelp plant (host), `J` for juvenile kelp plant (host), `host_spp` is `Lp` for kelp species *Laminaria pallida* (host), `Em` for kelp plant *Ecklonia maxima* (host), `plant` is the unique number identifying a specific kelp plant, and `rep` is the replicate tissue sample from each kelp host plant from which the diatoms were extracted.
```{r code-spp-read-csv-here}
# with shortened name to fix nMDS overplotting
spp <- read.csv(
here::here("data", "BCB743", "diatoms", "PB_data_matrix_abrev.csv"),
row.names = "Replicate",
sep = ",",
header = TRUE
)
spp[1:6, 1:6]
# with full names
spp2 <- read.csv(
here::here("data", "BCB743", "diatoms", "PB_data_matrix.csv"),
row.names = "Replicate",
sep = ",",
header = TRUE
)
spp2[1:6, 1:6]
# remove ".spp" from column header name
colnames(spp) <- str_replace(colnames(spp), "\\.spp", "")
colnames(spp2) <- str_replace(colnames(spp2), "\\.spp", "")
```
Logarithmic transformation as suggested by @anderson2006distance: $log_{b}(x) + 1$ for $x > 0$, where $b$ is the base of the logarithm; zeros are left as zeros. Higher bases give less weight to quantities and more to presences.
```{r code-spp-log-decostand-spp}
spp.log <- decostand(spp, method = "log")
spp.log.dis <- vegdist(spp.log, method = "bray")
```
By default, `decostand(method = "log")` uses a base-2 logarithm. That choice is fine here because the purpose is to compress dominance before Bray-Curtis distances are calculated, but it should not be confused with a natural-log transformation.
### The predictors
The content is described above. These variables are categorical, and they are not actually 'environmental' data, but their purpose in the analysis is analogous to true environmental data, namely they describe where the samples were taken from.
```{r code-env-tibble-read-csv}
env <- tibble(
read.csv(here::here("data", "BCB743", "diatoms", "PB_diat_env.csv")),
sep = ",",
header = TRUE
)
env$plant <- as.factor(env$plant)
env$rep <- as.factor(env$rep)
head(env)
```
With the environmental data (factors), the following analyses can be done:
- ✘ Discriminant Analysis (DA)
- ✘ Analysis of Similarities (ANOSIM)
- ✔︎ Permutational Analysis of Variance (PERMANOVA)
- ✘ Mantel test
I will do an nMDS and PERMANOVA.
## Multivariate Homogeneity of Group Dispersions (Variances)
Before doing the PERMANOVA (testing differences between means), first check to see if the dispersion is the same. See `?adonis2` for more on this.
**Homogeneity of groups** `betadisper()` evaluates the differences in group homogeneities. I can view it as being analogous to Levene's test of the equality of variances. The null hypothesis evaluated is that the multivariate dispersions (the average distance of group members to their group centroid) are equal across groups. Unfortunately I can only use one factor as an independent variable so it is not yet possible to look for interactions (species × size).
So, I test the $H_{0}$ that the dispersion (variance) in diatom community structure does not differ between the two host species:
```{r code-mod-spp-with-env}
(mod.spp <- with(env, betadisper(spp.log.dis, host_spp)))
anova(mod.spp)
```
There is no difference in dispersion between the diatom communities on the two host species. Apply the same procedure to see if host size has an effect:
```{r code-mod-size-with-env}
(mod.size <- with(env, betadisper(spp.log.dis, host_size)))
anova(mod.size)
```
No, it does not have an effect either. Make some plots to visualise the patterns:
```{r fig-par-mfrow-c}
#| fig-width: 8
#| fig-height: 6
par(mfrow = c(2, 2))
plot(mod.spp, sub = NULL)
boxplot(mod.spp)
plot(mod.size)
boxplot(mod.size)
```
Optionally, I can confirm the above analysis with the `permutest()` function. `permutest()` is a permutational ANOVA-like test that tests the $H_{0}$ that there is no difference in the multivariate dispersion of diatom community structure between *Ecklonia maxima* and *Laminaria pallida*, and between adult and juvenile plants:
```{r code-permutest-mod-spp-there}
permutest(mod.spp) # there is in fact no difference
permutest(mod.size) # nope...
```
It should be enough to do the `anova()`, above, though. You can safely ignore the `permutest()`.
## PERMANOVA
Permutational Multivariate Analysis of Variance (PERMANOVA; @anderson2013permanova) uses distance or dissimilarity matrices (Bray-Curtis dissimilarities in this example), whereas ANOSIM uses only ranks of the dissimilarities. The former therefore preserves more information and it is the recommended approach to test for differences between multivariate means. PERMANOVA also allows for variation partitioning and permits for more complex designs (multiple factors, nested factors, interactions, covariates, etc.). To this end, I use `adonis2()` to evaluate the differences in the group means, which makes it analogous to multivariate analysis of variance.
Note that nestedness should be stated in the blocks (plants): "If you have a nested error structure, so that you do not want your data be shuffled over classes (blocks), you should define blocks in your permutation" (Jari Oksanen).
```{r code-perm-how-nperm}
# the permutational structure captures the nesting of replicates within plant
perm <- how(nperm = 1000)
setBlocks(perm) <- with(env, plant)
(perm.1 <- adonis2(
spp.log.dis ~ host_spp * host_size,
data = env,
permutations = perm
))
```
The test returns `Pr(>F) = 1`, which looks alarming but follows directly from the permutation design.
::: {.callout-note appearance="simple"}
## Why Is Pr(>F) Exactly 1?
A *p*-value of 1 looks like a broken analysis, but here it is a logical consequence of the blocked permutation. Host species and host size are fixed properties of each whole plant, while the blocks restrict permutations to *within* each plant. Because every sample from a plant shares that plant's species and size, no within-plant shuffle can change the group labels: the model matrix is identical under every permutation, so every permuted *F* equals the observed *F*, and `Pr(>F)` comes out at exactly 1.
The defensible reading is therefore not "there is strong evidence of no effect" but "this design, with permutations correctly restricted to respect the nesting, cannot by itself test a treatment that is applied at the plant level." Properly testing the species or size effect would mean permuting whole plants between treatment groups, since the plant, not the subsample, is the true unit of replication. The two other results point the same way, and more interpretably: the dispersions are homogeneous (above) and the nMDS groups overlap broadly (below). Together they support the conclusion that the diatom assemblages do not separate clearly by host species or size.
:::
## nMDS
The dispersion test and PERMANOVA asked, numerically, whether the groups differ. The nMDS now shows the community structure those tests were probing, so that the number and the picture can be read together. The two should agree: if the tests find no separation, the ordination should show overlapping groups.
Do the nMDS and assemble the figures:
```{r code-spp-nmds-metamds-spp}
spp.nmds <- metaMDS(
spp.log.dis,
k = 2,
trymax = 100,
trace = 0,
autotransform = FALSE,
wascores = FALSE
)
sppscores(spp.nmds) <- spp.log
# not printed as it is too long...
# scores(spp.nmds, display = "species")
# scores(spp.nmds, display = "sites")
```
```{r fig-col-c-indianred3-steelblue4}
#| fig-width: 8
#| fig-height: 6
col <- c("indianred3", "steelblue4")
pch <- c(17, 19)
opar <- par()
plt1 <- layout(
rbind(c(1, 1, 2, 2, 3, 3), c(4, 4, 4, 5, 5, 5)),
heights = c(2, 3),
respect = TRUE
)
# layout.show(plt1)
par(mar = c(3, 3, 1, 1))
# plot 1
plot(
mod.spp,
main = NULL,
tck = .05,
mgp = c(1.8, 0.5, 0),
col = col,
pch = pch,
sub = NULL
)
# plot 2
plot(
mod.size,
main = NULL,
tck = .05,
mgp = c(1.8, 0.5, 0),
col = col,
pch = pch,
sub = NULL
)
# plot 3
stressplot(
spp.nmds,
p.col = "steelblue4",
l.col = "indianred3",
tck = .05,
mgp = c(1.8, 0.5, 0)
)
# plot 4
par(mar = c(3, 3, 2, 1))
plot(
spp.nmds,
display = "sites",
type = "n",
main = NULL,
tck = .05,
mgp = c(1.8, 0.5, 0),
xlim = c(-2, 2),
ylim = c(-1, 2)
)
with(
env,
points(spp.nmds, display = "sites", col = col[host_spp], pch = pch[host_spp])
)
with(env, ordispider(spp.nmds, groups = host_spp, label = TRUE, col = col))
with(env, ordiellipse(spp.nmds, groups = host_spp, col = col, label = FALSE))
points(spp.nmds, display = "species", pch = 1, col = "seagreen")
orditorp(spp.nmds, display = "species", cex = 0.8, col = "black", air = 0.01)
# plot 5
par(mar = c(3, 3, 2, 1))
plot(
spp.nmds,
display = "sites",
type = "n",
main = NULL,
tck = .05,
mgp = c(1.8, 0.5, 0),
xlim = c(-2, 2),
ylim = c(-1, 2)
)
with(
env,
points(
spp.nmds,
display = "sites",
col = col[host_size],
pch = pch[host_size]
)
)
with(env, ordispider(spp.nmds, groups = host_size, label = TRUE, col = col))
with(env, ordiellipse(spp.nmds, groups = host_size, col = col, label = FALSE))
points(spp.nmds, display = "species", pch = 1, col = "seagreen")
orditorp(spp.nmds, display = "species", cex = 0.8, col = "black", air = 0.01)
# dev.off()
par(opar)
```
### Reading the Ordination
The two bottom panels are the ordinations to interpret, coloured by host species (left) and by host size (right). The spiders join each sample to its group centroid and the ellipses show each group's spread.
- **The host-species groups overlap almost entirely.** Their centroids sit close together, differing by only about 0.1 on the first axis against a within-group spread of roughly 0.5, and the ellipses are nearly concentric.
- **The adult and juvenile groups overlap to the same degree.** Neither host species nor host size carves out a distinct region of the ordination.
- **The picture agrees with the tests.** Broad overlap is what homogeneous dispersions and a PERMANOVA that detects no difference would predict, so the visual and the numerical results tell one story.
- **Read the broad pattern, not the fine detail.** The stress is about 0.19, in the "so-so" band, so the two-dimensional map is only a moderate summary of the full community structure. The species names placed by `orditorp()` mark roughly where each diatom is most abundant, and their exact positions should not be over-interpreted.
## Multivariate Abundance Using Generalised Linear Models
::: {.callout-note appearance="simple"}
## What manyglm Adds
PERMANOVA asks whether the communities differ overall. It cannot say which taxa are associated with the difference. `manyglm()` fits a separate generalised linear model to each diatom species and tests them together, so it can identify the individual taxa whose abundances differ among groups. The two methods are complementary: first the community question, then the species question.
:::
What follows is an example of *model-based multivariate analysis*, an alternative to the distance-based methods used above. A full treatment is beyond this chapter; the aim here is to show the kind of question the method answers and to reproduce the analysis from @mayombo2019diatoms. For the underlying theory, see @wang2012mvabund and @wang2017mvabund.
```{r code-library-mvabund}
library(mvabund)
diat_spp <- mvabund(spp2)
```
Look at the spread of the data using the boxplot function. The figure is not used in paper:
```{r code-par-mar-c-adjusts}
#| fig-width: 8
#| fig-height: 6
par(mar = c(2, 10, 2, 2)) # adjusts the margins
boxplot(spp, horizontal = TRUE, las = 2, main = "Abundance", col = "indianred")
```
Check the mean-variance relationship:
```{r fig-meanvar-plot-diat-spp}
#| fig-width: 8
meanvar.plot(diat_spp)
```
The above plot shows that spp with a high mean also have a high variance.
1. Are there differences in the species composition of the diatom spp. sampled? *This has already been addressed above, but I can apply an alternative approach below*.
2. Do some of them specialise on particular spp of kelp, while others are more generalised? *Addressed below*.
3. Do some occur more on juveniles, while some are on adults, and which ones indiscriminately live across age classes? *Addressed below*.
4. Which species? *Addressed below*.
Scale manually for `ggplot2()` custom plot. Create a scale function:
```{r fn-log-fun-function-x}
log_fun <- function(x) {
min_x <- min(x[x != 0], na.rm = TRUE)
a <- log(x) / min_x
a[which(!is.finite(a))] <- 0
return(a)
}
```
Make a plot that shows which diatoms species are responsible for differences between adult and juvenile kelps:
```{r fig-spp2}
#| fig-width: 8
#| fig-height: 6
spp2 %>%
mutate(host_size = env$host_size) %>%
gather(key = species, value = abund, -host_size) %>%
as_tibble() %>%
group_by(species) %>%
mutate(log.abund = log_fun(abund)) %>%
ungroup() %>%
ggplot(aes(x = fct_reorder(species, abund, .fun = mean), y = log.abund)) +
geom_boxplot(
aes(colour = host_size),
linewidth = 0.4,
outlier.size = 0,
fill = "grey90"
) +
geom_point(
aes(colour = host_size, shape = host_size),
position = position_dodge2(width = 0.8),
alpha = 0.6,
size = 2.5
) +
scale_colour_manual(name = "Age", values = c("indianred3", "steelblue4")) +
scale_shape_manual(name = "Age", values = c(17, 19)) +
scale_y_continuous(name = "Log abundance") +
coord_flip() +
theme_bw() +
theme(
panel.grid.major = element_line(
linetype = "dashed",
colour = "seagreen3",
linewidth = 0.2
),
panel.grid.minor = element_blank(),
axis.text.x = element_text(
size = 13,
colour = "black",
margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")
),
axis.text.y = element_text(
size = 13,
colour = "black",
face = "italic",
margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")
),
axis.title.x = element_text(size = 14, vjust = 5.75, colour = "black"),
axis.title.y = element_blank(),
axis.ticks.length = unit(-0.25, "cm"),
axis.ticks = element_line(colour = "black", linewidth = 0.5)
)
```
I settle on a negative binomial distribution for the species data. This will be provided to the `manyglm()` function:
```{r fig-size-mod2-manyglm-diat-spp-env}
#| fig-width: 8
#| fig-height: 6
size_mod2 <- manyglm(
diat_spp ~ (env$host_spp * env$host_size) / env$plant,
family = "negative binomial"
)
plot(size_mod2) # better residuals...
```
```{r code-out-anova-size-mod2-p}
# anova(size_mod2, test = "wald")
out <- anova(size_mod2, p.uni = "adjusted", test = "wald")
out$table
```
What is the proportional contribution of some important species to juvenile and adult plants?
```{r code-prop-contrib-data-frame}
prop.contrib <- data.frame(
spp = colnames(out$uni.test),
prop = out$uni.test[3, ],
row.names = NULL
)
prop.contrib %>%
mutate(perc = round((prop / sum(prop)) * 100, 1)) %>%
arrange(desc(perc)) %>%
mutate(cum = cumsum(perc))
```
## References
::: {#refs}
:::