Datasets
This page is a reference point for every dataset used in BCB743. First, it explains where the data live and how the code reaches them, because the module relies on three types of sources. Second, it lists and describes the datasets, with links, so you can read the original documentation before you analyse anything. Third, it records which dataset is used where (in which chapter and in which Task) so that when a method needs a worked example you can find the matching data quickly.
Almost every mistake in multivariate ecology begins before any model is fitted, for example, a species matrix is confused with an environmental matrix, sites and samples are transposed, or double zeros are left to mislead a Euclidean method. Knowing the shape and origin of a dataset is therefore the first analytical step.
How the data are organised
Make the datasets available to your R session in one of three ways.
In-package data ship inside an R package and load with
data(). Afterlibrary(vegan), the calldata(dune)places the dune meadow species matrix in your workspace, anddata(dune.env)its matching environmental table. No file path is involved. The vegan and ade4 packages supply most of the teaching datasets used here.Repository data are files stored in the course repository under
data/BCB743/. The code reads them withhere::here(), which builds a path from the project root so the same line works on any machine:
- External data are hosted elsewhere — most often on David Zelený’s Analysis of community ecology data in R site, or at original sources such as the WHO. The download links below point to those origins.
Datasets used in BCB743
The table below is the quick index: each dataset, where it comes from, and every place it is used. Chapter numbers follow the sidebar; Task letters follow the Tasks. A dataset that appears in both a chapter and a Task is doing double duty as a worked example and as an exercise.
| Dataset | Source / package | Used in chapters | Used in Tasks |
|---|---|---|---|
Doubs River fish (spe, env, spa) |
NEwR / repo; also ade4 doubs
|
5 Correlations, 7 Ordination, 8a PCA, 9a CA, 9b DCA, 10 PCoA, 11a nMDS, 14 Cluster, 17 Multiple regression, 18 GLM, 19 GAM | A, B, D, E, F, H, I |
Oribatid mite (mite, mite.env) |
vegan; also ade4 oribatid
|
7 Ordination, 9a CA, 13b Two Oceans, 15 Model building | G, H, L, M |
Dune meadow (dune, dune.env) |
vegan | 9b DCA | E, G |
Lichen pastures (varespec, varechem) |
vegan | — | J |
Sibbo/Sipoo birds (sipoo, sipoo.map) |
vegan | — | L |
Pyrifos mesocosms (pyrifos) |
vegan | — | N |
Barro Colorado Island (BCI) |
vegan | listed in module reading | — |
Iris flowers (iris) |
base datasets | 8b PCA examples, 14 Cluster | — |
| WHO SDG indicators | repo WHO/; source WHO
|
8c PCA SDG, 14 Cluster | C, I |
| South African seaweeds | repo seaweed/; Smit et al. 2017 |
10 PCoA, 13a db-RDA (+ revised), 13b Two Oceans, 14 Cluster, 15 Model building, 16 Deep dive, 17 Multiple regression | K |
| Mayombo diatoms | repo diatoms/; Mayombo et al. 2019 |
11b nMDS PERMANOVA, 20 Mixed models | — |
Yushan birds (ybirds) |
repo + Zelený | — | B, D, E |
Aravo alpine plants (aravo) |
ade4; Zelený | — | B, D |
R package datasets
vegan
The vegan package is the workhorse of the module, and its built-in datasets supply most of the teaching examples. They are not files; install the package and load each with data(). The links lead to the vegan reference manual, which documents the variables in each table.
| Dataset | Description | Used in BCB743 |
|---|---|---|
BCI, BCI.env
|
Barro Colorado Island tree counts: a 50-ha tropical forest plot, 50 sites × 225 species, with site environment | Listed in module reading |
dune, dune.env
|
Vegetation and environment in Dutch dune meadows: 20 sites × 30 species, plus five environmental descriptors | 9b; Tasks E, G |
dune.taxon, dune.phylodis
|
Taxonomic classification and phylogeny of the dune meadow species | — |
mite, mite.env, mite.pcnm, mite.xy
|
Oribatid mites from a peat bog: 70 cores × 35 species, with substrate variables and spatial coordinates | 7, 9a, 13b, 15; Tasks G, H, L, M |
pyrifos |
Response of aquatic invertebrates to the insecticide chlorpyrifos: 12 mesocosm ditches sampled over 11 weeks | Task N |
sipoo, sipoo.map
|
Birds on the islands of the Sipoo (Sibbo) archipelago: presence/absence by island, with island areas | Task L |
varespec, varechem
|
Lichen pastures (reindeer grazing): 24 sites × 44 species cover, plus 14 soil-chemistry variables | Task J |
Tangled Bank repository datasets
These datasets are stored as files in the course repository under data/BCB743/. They include the multivariate teaching sets that are not packaged in R, plus several datasets reserved for the integrative assignment and self-study.
The seaweed data
The seaweed data are the backbone of the constrained-ordination half of the module, and they are described in their own section below. The files live in data/BCB743/seaweed/.
| File | Contents |
|---|---|
💾 SeaweedSpp.csv |
Species (Y) matrix: presence of macroalgal species in each of 58 coastal sections |
💾 SeaweedEnv.RData |
Environmental (E) matrix: in situ seawater temperature descriptors |
💾 SeaweedEnv.csv |
The environmental matrix as CSV |
💾 bioregions.csv |
Bioregion assignment (BMP, B-ATZ, AMP, ECTZ) for each section |
💾 SeaweedSites.csv, 💾 sites.csv
|
Section coordinates and identifiers |
💾 Seaweed_geodist.csv |
Geographic (coastal) distances between sections |
💾 macroalgae.csv |
The full macroalgal records |
Doubs River (Numerical Ecology with R)
The Doubs data are the recurring example for the unconstrained methods. They come bundled with the Numerical Ecology with R (2nd ed.) code, stored under data/BCB743/NEwR-2ed_code_data/. The Doubs.RData file loads three objects: spe (fish abundances), env (environment), and spa (spatial coordinates).
| File | Contents |
|---|---|
💾 Doubs.RData |
spe, env, spa for the Doubs river |
💾 DoubsSpe.csv |
Fish species abundances (27 species × 30 sites) |
💾 DoubsEnv.csv |
Environmental variables (11 variables × 30 sites) |
💾 DoubsSpa.csv |
Spatial (x, y) coordinates of the sites |
The same folder also ships the NEwR helper functions, including 💾 cleanplot.pca.R used in the PCA chapter.
Mayombo diatoms
Serge Mayombo’s diatom data record epiphytic diatom communities on two kelp host species (Mayombo et al. 2019). They are used for the PERMANOVA/nMDS example and again for the mixed-models chapter. Files are in data/BCB743/diatoms/.
| File | Contents |
|---|---|
💾 PB_data_matrix_abrev.csv |
Diatom species × sample abundance matrix |
💾 PB_diat_env.csv |
Sample metadata: host species, host size, and so on |
WHO Sustainable Development Goals
The WHO SDG indicators drive the PCA and cluster-analysis examples on socio-economic data. The full collection — dozens of indicator CSVs plus an assembled matrix — lives in data/BCB743/WHO/.
| File | Contents |
|---|---|
💾 SDG_complete.csv |
The assembled country × indicator matrix used in the analyses |
💾 SDG_description.csv |
A key linking indicator codes to their descriptions |
💾 WHO.zip |
The complete set of individual SDG indicator files |
Yushan birds (external example)
The Yushan Mountain bird data (Ding, via Zelený) give the Tasks a second community-ecology example beyond the Doubs. They are supplied both as repository files and on Zelený’s site.
| File | Contents |
|---|---|
💾 ybirds_spe.txt |
Bird species abundances along the elevation gradient |
💾 ybirds_env.txt |
Environmental variables for each plot |
💾 ybirds.xlsx |
The same data as a spreadsheet |
Further repository datasets (assignment and self-study)
These are available in the repository for the integrative assignment and for independent practice; they are not part of a numbered worked example.
| Dataset | Files | Source |
|---|---|---|
| Barents Sea fish |
💾 BarentsFish_spp.csv, 💾 BarentsFish_env.csv
|
Fish catches and environment in the Barents Sea |
| Macintyre kwongan/woodland | 💾 folder | Plant community and environment matrices (Macintyre et al. 2018) |
| Thomsen biodiversity–stress | 💾 folder | Long-term community data (Thomsen, Jørgensen et al. 1992–2009) |
| Data dictionary | 💾 data-dictionary.csv |
A key to the variables across the BCB743 datasets |
David Zelený’s example data
David Zelený’s Analysis of community ecology data in R hosts a large, well-documented collection of example datasets. Several of them coincide with the package datasets above (the same Doubs, mite and dune data), and his pages are an excellent companion to the methods chapters. The complete catalogue follows; datasets used in BCB743 are flagged.
Vegetation data
| Dataset | Description | Dimensions | Used in BCB743 |
|---|---|---|---|
| Aravo | Alpine plant communities, Aravo, France (Choler 2005; Dray et al. 2014) | 75 plots × 82 species, 6 environmental variables, 8 traits | Tasks B, D |
| Barley | Barley field weed community (Pyšek & Lepš) | 122 plots × weed species, 3 environmental variables (species count not stated) | — |
| BCI | Barro Colorado Island forest permanent plot (Condit et al.) | 50 1-ha plots × tree species, 7 environmental + 13 soil variables | Module reading |
| Bryce Canyon | Bryce Canyon vegetation (Dave Roberts) | 160 plots × 169 species, 11 environmental variables | — |
| Carpathian wetlands | Spring-fen wetlands (Hájek, Hekera & Hájková) | 70 plots × species (plants and bryophytes), 15 environmental variables | — |
| Danube meadow | Danube floodplain meadow (Ellenberg 1956) | ≈25 plots × 94 species (48 × 171 in the download), plus Ellenberg values | — |
| Dune meadow | Dutch dune meadows (Jongman et al. 1995) | 20 plots × 30 species, 5 environmental variables | 9b; Tasks E, G |
| Taiwan MQU forests | Forest plots along a Taiwan elevation gradient (Li & Zelený) | 9 plots × 89 woody species, 2 environmental variables, 5 leaf traits | — |
| Gentry’s transects | Gentry’s global forest transects | 197 localities × species, with latitude, longitude, elevation, precipitation | — |
| Lalashan transect | Forest vegetation plots (Zelený et al.) | 18 plots × woody species, ~40 environmental variables, ~16 traits | — |
| Nanjenshan | Nanjenshan forest dynamics plot (Sun & Hsieh) | Page restricted; dimensions not stated | — |
| Němčičky | Forest understory permanent plot (Chudomelová et al.) | 97 samples × 274 species, plus Ellenberg values | — |
| Ohrazení | Wet-meadow experiment (Jan Lepš) | 96 samples × 86 species, with experimental factors | — |
| Seedlings (removal) | Randomised-block seedling experiment (Špačková & Lepš) | 16 plots (4 blocks × 4 treatments) × seedling species | — |
| Taiwan GBIF | Taiwan GBIF records (Liao & Chen) | Page restricted; dimensions not stated | — |
| Taiwan 1-ha plots | Seven 1-ha forest plots (Zelený & Li) | Page restricted; 7 plots, further dimensions not stated | — |
| Třebíč grasslands | Dry grasslands (David Zelený) | 48 samples × 171 species, ~18 environmental variables | — |
| Vltava valley | Vltava river valley vegetation (David Zelený) | 97 plots × 274 species, ~28 environmental variables, plus traits | — |
Zoological data
| Dataset | Description | Dimensions | Used in BCB743 |
|---|---|---|---|
| Yushan birds | Bird communities along an elevation gradient, Taiwan (Ding) | 50 sites × 59 bird species, 20 environmental variables, plus traits | Tasks B, D, E |
| Carabids (Finland) | Boreal-forest carabid beetles (Niemelä et al.) | Species × 5 habitat types (counts not stated) | — |
| Carabids (Canada) | Carabid beetles from Canada (Bergeron, Blanchet) | 194 sites × species, with vegetation-structure predictors | — |
| Coral reefs | Indonesian coral-reef data (Warwick, Anderson) | 10 transects over 6 years × 75 coral species | — |
| Doubs fish | Doubs river fish (Verneaux) | 30 sites × 27 fish species, 11 environmental variables, plus coordinates | Extensively (5–19); Tasks A, B, D, E, F, H, I |
| Spring-fen molluscs | Molluscs, vegetation and water chemistry (Horsák & Hájek) | 43 localities × species, 14 water-chemistry variables | — |
| Oribatid mites | Oribatid mites (Borcard et al.) | 70 cores × 35 morphospecies, 5 environmental variables, plus coordinates | 7, 9a, 13b, 15; Tasks G, H, L, M |
Simulated data
| Dataset | Description | Dimensions |
|---|---|---|
| Simulated data | Simulated community along a gradient (Minchin 1987; Fridley et al. 2007) | Up to 500 samples × ~280–296 species, 1–2 gradients (several variants) |
| Spatial simulated data | Spatially structured simulated community (Smith & Lundholm 2010) | 144 habitats × 50 species, elevation plus coordinates |
Other
| Dataset | Description | Dimensions |
|---|---|---|
| Cookies / pastries / pizza | A light teaching example (everest4ever) | 1931 recipes × 133 ingredients (presence/absence), 3 food types |
| Normal pig | A teaching curiosity | Stub page; dimensions not stated |
| Morse codes | Rothkopf’s Morse-code confusion experiment | 36 × 36 confusion matrix, plus 2 code attributes |
| What’s Cooking? | Recipe-ingredient data (Kaggle) | ~40,000 recipes × ~6,700 ingredients, 20 cuisines |
The seaweed data
Because the seaweed data anchor the constrained-ordination chapters, they deserve a fuller account.
The analyses rest on two matrices. The first, Y, holds distribution records of 846 macroalgal species across 58 coastal sections, each 50 km long, spanning the South African coast. This represents about 90% of the known seaweed flora of the country, assembled from verifiable literature and from John Bolton and Rob Anderson’s own collections over three decades (Bolton 1986; Stegenga et al. 1997; Bolton and Stegenga 2002; De Clerck et al. 2005). The second, E, is a dataset of in situ coastal seawater temperatures (Smit et al. 2013), derived from daily measurements over up to 40 years.
Four bioregions structure the coast (Bolton and Anderson 2004): the Benguela Marine Province (BMP, sections 1–17), the Benguela–Agulhas Transition Zone (B-ATZ, 18–22), the Agulhas Marine Province (AMP, 19–43/44), and the East Coast Transition Zone (ECTZ, 44/45–58). The plotting functions partition and colour-code the data by bioregion so that regional patterns in β-diversity become visible.
The data and the full analysis are described in Smit et al. (2017) and worked through in the Two Oceans appendices. Background reading is provided in Smit et al. 2017 and a description of the seaweed data. The complete code and data are also on the Seaweed-beta GitHub repository.
References
Reuse
Citation
@online{smit2026,
author = {Smit, A. J.},
title = {Datasets},
date = {2026-06-14},
url = {https://tangledbank.netlify.app/BCB743/datasets.html},
langid = {en}
}
