Datasets

Published

2026/06/14

This page is a reference point for every dataset used in BCB743. First, it explains where the data live and how the code reaches them, because the module relies on three types of sources. Second, it lists and describes the datasets, with links, so you can read the original documentation before you analyse anything. Third, it records which dataset is used where (in which chapter and in which Task) so that when a method needs a worked example you can find the matching data quickly.

Almost every mistake in multivariate ecology begins before any model is fitted, for example, a species matrix is confused with an environmental matrix, sites and samples are transposed, or double zeros are left to mislead a Euclidean method. Knowing the shape and origin of a dataset is therefore the first analytical step.

How the data are organised

Make the datasets available to your R session in one of three ways.

  1. In-package data ship inside an R package and load with data(). After library(vegan), the call data(dune) places the dune meadow species matrix in your workspace, and data(dune.env) its matching environmental table. No file path is involved. The vegan and ade4 packages supply most of the teaching datasets used here.

  2. Repository data are files stored in the course repository under data/BCB743/. The code reads them with here::here(), which builds a path from the project root so the same line works on any machine:

spp <- read.csv(here::here("data", "BCB743", "seaweed", "SeaweedSpp.csv"))
load(here::here(
  "data",
  "BCB743",
  "NEwR-2ed_code_data",
  "NEwR2-Data",
  "Doubs.RData"
))
  1. External data are hosted elsewhere — most often on David Zelený’s Analysis of community ecology data in R site, or at original sources such as the WHO. The download links below point to those origins.
NoteA note on the download links

The 💾 links point to the file as it sits in the course repository on GitHub. Package datasets (vegan, ade4) are not files you download; you obtain them by installing the package and calling data(). Their links lead to the package help page that documents the variables.

Datasets used in BCB743

The table below is the quick index: each dataset, where it comes from, and every place it is used. Chapter numbers follow the sidebar; Task letters follow the Tasks. A dataset that appears in both a chapter and a Task is doing double duty as a worked example and as an exercise.

Dataset Source / package Used in chapters Used in Tasks
Doubs River fish (spe, env, spa) NEwR / repo; also ade4 doubs 5 Correlations, 7 Ordination, 8a PCA, 9a CA, 9b DCA, 10 PCoA, 11a nMDS, 14 Cluster, 17 Multiple regression, 18 GLM, 19 GAM A, B, D, E, F, H, I
Oribatid mite (mite, mite.env) vegan; also ade4 oribatid 7 Ordination, 9a CA, 13b Two Oceans, 15 Model building G, H, L, M
Dune meadow (dune, dune.env) vegan 9b DCA E, G
Lichen pastures (varespec, varechem) vegan J
Sibbo/Sipoo birds (sipoo, sipoo.map) vegan L
Pyrifos mesocosms (pyrifos) vegan N
Barro Colorado Island (BCI) vegan listed in module reading
Iris flowers (iris) base datasets 8b PCA examples, 14 Cluster
WHO SDG indicators repo WHO/; source WHO 8c PCA SDG, 14 Cluster C, I
South African seaweeds repo seaweed/; Smit et al. 2017 10 PCoA, 13a db-RDA (+ revised), 13b Two Oceans, 14 Cluster, 15 Model building, 16 Deep dive, 17 Multiple regression K
Mayombo diatoms repo diatoms/; Mayombo et al. 2019 11b nMDS PERMANOVA, 20 Mixed models
Yushan birds (ybirds) repo + Zelený B, D, E
Aravo alpine plants (aravo) ade4; Zelený B, D

R package datasets

vegan

The vegan package is the workhorse of the module, and its built-in datasets supply most of the teaching examples. They are not files; install the package and load each with data(). The links lead to the vegan reference manual, which documents the variables in each table.

Dataset Description Used in BCB743
BCI, BCI.env Barro Colorado Island tree counts: a 50-ha tropical forest plot, 50 sites × 225 species, with site environment Listed in module reading
dune, dune.env Vegetation and environment in Dutch dune meadows: 20 sites × 30 species, plus five environmental descriptors 9b; Tasks E, G
dune.taxon, dune.phylodis Taxonomic classification and phylogeny of the dune meadow species
mite, mite.env, mite.pcnm, mite.xy Oribatid mites from a peat bog: 70 cores × 35 species, with substrate variables and spatial coordinates 7, 9a, 13b, 15; Tasks G, H, L, M
pyrifos Response of aquatic invertebrates to the insecticide chlorpyrifos: 12 mesocosm ditches sampled over 11 weeks Task N
sipoo, sipoo.map Birds on the islands of the Sipoo (Sibbo) archipelago: presence/absence by island, with island areas Task L
varespec, varechem Lichen pastures (reindeer grazing): 24 sites × 44 species cover, plus 14 soil-chemistry variables Task J

Tangled Bank repository datasets

These datasets are stored as files in the course repository under data/BCB743/. They include the multivariate teaching sets that are not packaged in R, plus several datasets reserved for the integrative assignment and self-study.

The seaweed data

The seaweed data are the backbone of the constrained-ordination half of the module, and they are described in their own section below. The files live in data/BCB743/seaweed/.

File Contents
💾 SeaweedSpp.csv Species (Y) matrix: presence of macroalgal species in each of 58 coastal sections
💾 SeaweedEnv.RData Environmental (E) matrix: in situ seawater temperature descriptors
💾 SeaweedEnv.csv The environmental matrix as CSV
💾 bioregions.csv Bioregion assignment (BMP, B-ATZ, AMP, ECTZ) for each section
💾 SeaweedSites.csv, 💾 sites.csv Section coordinates and identifiers
💾 Seaweed_geodist.csv Geographic (coastal) distances between sections
💾 macroalgae.csv The full macroalgal records

Doubs River (Numerical Ecology with R)

The Doubs data are the recurring example for the unconstrained methods. They come bundled with the Numerical Ecology with R (2nd ed.) code, stored under data/BCB743/NEwR-2ed_code_data/. The Doubs.RData file loads three objects: spe (fish abundances), env (environment), and spa (spatial coordinates).

File Contents
💾 Doubs.RData spe, env, spa for the Doubs river
💾 DoubsSpe.csv Fish species abundances (27 species × 30 sites)
💾 DoubsEnv.csv Environmental variables (11 variables × 30 sites)
💾 DoubsSpa.csv Spatial (x, y) coordinates of the sites

The same folder also ships the NEwR helper functions, including 💾 cleanplot.pca.R used in the PCA chapter.

Mayombo diatoms

Serge Mayombo’s diatom data record epiphytic diatom communities on two kelp host species (Mayombo et al. 2019). They are used for the PERMANOVA/nMDS example and again for the mixed-models chapter. Files are in data/BCB743/diatoms/.

File Contents
💾 PB_data_matrix_abrev.csv Diatom species × sample abundance matrix
💾 PB_diat_env.csv Sample metadata: host species, host size, and so on

WHO Sustainable Development Goals

The WHO SDG indicators drive the PCA and cluster-analysis examples on socio-economic data. The full collection — dozens of indicator CSVs plus an assembled matrix — lives in data/BCB743/WHO/.

File Contents
💾 SDG_complete.csv The assembled country × indicator matrix used in the analyses
💾 SDG_description.csv A key linking indicator codes to their descriptions
💾 WHO.zip The complete set of individual SDG indicator files

Yushan birds (external example)

The Yushan Mountain bird data (Ding, via Zelený) give the Tasks a second community-ecology example beyond the Doubs. They are supplied both as repository files and on Zelený’s site.

File Contents
💾 ybirds_spe.txt Bird species abundances along the elevation gradient
💾 ybirds_env.txt Environmental variables for each plot
💾 ybirds.xlsx The same data as a spreadsheet

Further repository datasets (assignment and self-study)

These are available in the repository for the integrative assignment and for independent practice; they are not part of a numbered worked example.

Dataset Files Source
Barents Sea fish 💾 BarentsFish_spp.csv, 💾 BarentsFish_env.csv Fish catches and environment in the Barents Sea
Macintyre kwongan/woodland 💾 folder Plant community and environment matrices (Macintyre et al. 2018)
Thomsen biodiversity–stress 💾 folder Long-term community data (Thomsen, Jørgensen et al. 1992–2009)
Data dictionary 💾 data-dictionary.csv A key to the variables across the BCB743 datasets

David Zelený’s example data

David Zelený’s Analysis of community ecology data in R hosts a large, well-documented collection of example datasets. Several of them coincide with the package datasets above (the same Doubs, mite and dune data), and his pages are an excellent companion to the methods chapters. The complete catalogue follows; datasets used in BCB743 are flagged.

Vegetation data

Dataset Description Dimensions Used in BCB743
Aravo Alpine plant communities, Aravo, France (Choler 2005; Dray et al. 2014) 75 plots × 82 species, 6 environmental variables, 8 traits Tasks B, D
Barley Barley field weed community (Pyšek & Lepš) 122 plots × weed species, 3 environmental variables (species count not stated)
BCI Barro Colorado Island forest permanent plot (Condit et al.) 50 1-ha plots × tree species, 7 environmental + 13 soil variables Module reading
Bryce Canyon Bryce Canyon vegetation (Dave Roberts) 160 plots × 169 species, 11 environmental variables
Carpathian wetlands Spring-fen wetlands (Hájek, Hekera & Hájková) 70 plots × species (plants and bryophytes), 15 environmental variables
Danube meadow Danube floodplain meadow (Ellenberg 1956) ≈25 plots × 94 species (48 × 171 in the download), plus Ellenberg values
Dune meadow Dutch dune meadows (Jongman et al. 1995) 20 plots × 30 species, 5 environmental variables 9b; Tasks E, G
Taiwan MQU forests Forest plots along a Taiwan elevation gradient (Li & Zelený) 9 plots × 89 woody species, 2 environmental variables, 5 leaf traits
Gentry’s transects Gentry’s global forest transects 197 localities × species, with latitude, longitude, elevation, precipitation
Lalashan transect Forest vegetation plots (Zelený et al.) 18 plots × woody species, ~40 environmental variables, ~16 traits
Nanjenshan Nanjenshan forest dynamics plot (Sun & Hsieh) Page restricted; dimensions not stated
Němčičky Forest understory permanent plot (Chudomelová et al.) 97 samples × 274 species, plus Ellenberg values
Ohrazení Wet-meadow experiment (Jan Lepš) 96 samples × 86 species, with experimental factors
Seedlings (removal) Randomised-block seedling experiment (Špačková & Lepš) 16 plots (4 blocks × 4 treatments) × seedling species
Taiwan GBIF Taiwan GBIF records (Liao & Chen) Page restricted; dimensions not stated
Taiwan 1-ha plots Seven 1-ha forest plots (Zelený & Li) Page restricted; 7 plots, further dimensions not stated
Třebíč grasslands Dry grasslands (David Zelený) 48 samples × 171 species, ~18 environmental variables
Vltava valley Vltava river valley vegetation (David Zelený) 97 plots × 274 species, ~28 environmental variables, plus traits

Zoological data

Dataset Description Dimensions Used in BCB743
Yushan birds Bird communities along an elevation gradient, Taiwan (Ding) 50 sites × 59 bird species, 20 environmental variables, plus traits Tasks B, D, E
Carabids (Finland) Boreal-forest carabid beetles (Niemelä et al.) Species × 5 habitat types (counts not stated)
Carabids (Canada) Carabid beetles from Canada (Bergeron, Blanchet) 194 sites × species, with vegetation-structure predictors
Coral reefs Indonesian coral-reef data (Warwick, Anderson) 10 transects over 6 years × 75 coral species
Doubs fish Doubs river fish (Verneaux) 30 sites × 27 fish species, 11 environmental variables, plus coordinates Extensively (5–19); Tasks A, B, D, E, F, H, I
Spring-fen molluscs Molluscs, vegetation and water chemistry (Horsák & Hájek) 43 localities × species, 14 water-chemistry variables
Oribatid mites Oribatid mites (Borcard et al.) 70 cores × 35 morphospecies, 5 environmental variables, plus coordinates 7, 9a, 13b, 15; Tasks G, H, L, M

Simulated data

Dataset Description Dimensions
Simulated data Simulated community along a gradient (Minchin 1987; Fridley et al. 2007) Up to 500 samples × ~280–296 species, 1–2 gradients (several variants)
Spatial simulated data Spatially structured simulated community (Smith & Lundholm 2010) 144 habitats × 50 species, elevation plus coordinates

Other

Dataset Description Dimensions
Cookies / pastries / pizza A light teaching example (everest4ever) 1931 recipes × 133 ingredients (presence/absence), 3 food types
Normal pig A teaching curiosity Stub page; dimensions not stated
Morse codes Rothkopf’s Morse-code confusion experiment 36 × 36 confusion matrix, plus 2 code attributes
What’s Cooking? Recipe-ingredient data (Kaggle) ~40,000 recipes × ~6,700 ingredients, 20 cuisines

The seaweed data

Because the seaweed data anchor the constrained-ordination chapters, they deserve a fuller account.

The analyses rest on two matrices. The first, Y, holds distribution records of 846 macroalgal species across 58 coastal sections, each 50 km long, spanning the South African coast. This represents about 90% of the known seaweed flora of the country, assembled from verifiable literature and from John Bolton and Rob Anderson’s own collections over three decades (Bolton 1986; Stegenga et al. 1997; Bolton and Stegenga 2002; De Clerck et al. 2005). The second, E, is a dataset of in situ coastal seawater temperatures (Smit et al. 2013), derived from daily measurements over up to 40 years.

Four bioregions structure the coast (Bolton and Anderson 2004): the Benguela Marine Province (BMP, sections 1–17), the Benguela–Agulhas Transition Zone (B-ATZ, 18–22), the Agulhas Marine Province (AMP, 19–43/44), and the East Coast Transition Zone (ECTZ, 44/45–58). The plotting functions partition and colour-code the data by bioregion so that regional patterns in β-diversity become visible.

The data and the full analysis are described in Smit et al. (2017) and worked through in the Two Oceans appendices. Background reading is provided in Smit et al. 2017 and a description of the seaweed data. The complete code and data are also on the Seaweed-beta GitHub repository.

References

Bolton JJ (1986) Marine phytogeography of the Benguela upwelling region on the west coast of southern Africa: A temperature dependent approach. Botanica Marina 29:251–256.
Bolton JJ, Anderson RJ (2004) Marine Vegetation. In: Cowling RM, Richardson DM, Pierce SM (eds) Vegetation of southern africa. Cambridge University Press, pp 348–370
Bolton JJ, Stegenga H (2002) Seaweed species diversity in South Africa. South African Journal of Marine Science 24:9–18.
De Clerck O, Bolton JJ, Anderson RJ, Coppejans E (2005) Guide to the seaweeds of KwaZulu-Natal. Scripta Botanica Belgica 33:294 pp.
Smit AJ, Roberts M, Anderson RJ, Dufois F, Dudley SFJ, Bornman TG, Olbers J, Bolton JJ (2013) A coastal seawater temperature dataset for biogeographical studies: large biases between in situ and remotely-sensed data sets around the coast of South Africa. PLOS ONE 8:e81944.
Smit AJ, Bolton JJ, Anderson RJ (2017) Seaweeds in two oceans: Beta-diversity. Frontiers in Marine Science 4:404.
Stegenga H, Bolton JJ, Anderson RJ (1997) Seaweeds of the South African west coast. Contributions of the Bolus Herbarium 18:3–637.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {Datasets},
  date = {2026-06-14},
  url = {https://tangledbank.netlify.app/BCB743/datasets.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ (2026) Datasets. https://tangledbank.netlify.app/BCB743/datasets.html.