Biostatistics R Exam (Example)

Author

Affiliation

Published

30 May 2025

About the Exam

The Biostatistics Exam will start at 8:30 on 30 May, 2025 and you have until 8:30 on 31 May, 2025 to complete it. This exam may be conducted anywhere in the world, and it will contribute 70% of the final assessment marks for the Biostatistics component of the module.

Assessment Criteria

Your responses will be evaluated based on the following criteria:

Technical Accuracy (50%)
- Correct application of data analyses and statistical methods. Statistical tests should address the hypotheses within what was taught in The assessor recognises that students will not have access to the level of statistical knowledge and experience a research statistician might have. For example, in places, linear mixed models might be more suited to the questions, but students only were taught the basics about ANOVA and simple linear model (relatively simple designs). When non-parametric alternatives are required, I’ll assign marks to the any statement that suggests the correct test to use, but not actually marks the execution of those tests.
- Use of appropriate R packages, functions, and syntax (code style and liberal commenting)
- Appropriate choice and justification of techniques, including due consideration for the assumptions of the methods used
- Accurate calculations and results interpretation, down to small details such as how many decimal places to use
Depth of Analysis (20%)
- Comprehensive exploration of the problem
- Insightful interpretation of results
- Consideration of shortfalls in the analysis (due to data limitations, assumptions, etc.), and suggestions for improvement
- Application of out-of-the-box thinking to the problem
Clarity and Communication (20%)
- Logical organisation of ideas, including clear section headings and subheadings
- Clear and concise explanations at each stage of the analysis
- Effective use of publication quality visualisations where appropriate, including all necessary annotations
- Communication of results in a way that is appropriate for a scientific audience (e.g. a journal article)
Critical Thinking Shown in Final Conclusion/Synopsis (10%)
- Discussion of the findings in the context of the problem (add ecological context, etc., as you deem necessary)
- Identification of limitations
- Discussion of assumptions
- Consideration of broader implications

General Notes to Assessor (applies to all tasks):

Heavily penalise untidy formatting, good document structure according to logical headings and heading hierarchies, excessive output of long, unnecessary data printouts (other than the obvious and required used of head(), tail(), glimpse(), and summary()) that serve no purpose [-15%].
Answers where the code gives error messages (it fails to run to provide the required output) get 0 for that question.
Where students write long-form text feedback/answers within code blocks, where it should have been more appropriately placed within the markdown text between code blocks and presented in full sentences, should be penalised [-10% for each question where this occurs].
Text answers written as bullet points and which lacks detailed explanatory power gets penalised [-10%].
Untidy presentation and formatting that fails to resemble my model answers (below) penalised [-15%].

The marks indicated for each task reflect the relative weight and expected depth of your response. Focus on demonstrating both technical proficiency and conceptual understanding in your answers.

Instructions

This is the open book assessment.

You must address all tasks in the allocated time of 24-hr. Please submit your answers in a neatly formatted .html document (produced from a Quarto document in RStudio) and submit it to the iKamva platform.

Clearly structure the document according to the task numbers, i.e., use appropriately hierarchical headings, subheadings, and sub-subheadings to structure your document logically.

Naming convention: Biostatistics_Prac_Exam_YourSurname.html

Background

These data represent the aerial cover of kelp canopy in South Africa, as measured by Landsat satellites, for the period 1984 to 2024 at a quarterly interval. The intention is to understand the spatio-temporal patterns in kelp canopy cover and to explore how these patterns may be related to coastal sections and biogeographical provinces.

You are provided with two datasets at the Google Drive link emailed to you:

A table of 58 coastal sections (58_sections.csv) that partitions the South African coastline into approximately 50 km intervals. Each section is defined by a single coordinate point (latitude, longitude) representing the boundary of the section.
A table of the biogeographical provinces (bioregions.csv) that the 58 coastal sections fall within. There is one row for each of the 58 sections. For this exercise, the biogeographical classification by Professor John Bolton is of interest.
A netCDF file (kelpCanopyFromLandsat_SouthAfrica_v04.nc) of kelp sampling locations and aerial cover data – these are presented as various variables at grid points across time.

Task 1: Initial Processing

[Task Weight: 10%]
[Components (1) and (2) marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 10%]

You are provided with a NetCDF file that contains satellite-derived measurements of kelp canopy area across the South African coastline from 1984 to 2024, sampled quarterly. Each observation corresponds to a grid cell at a specific time point.

Read the kelp canopy area, time, location (latitude/longitude), and satellite pass data from the NetCDF file. Once unpacked, it contains over 5 million rows. Your processing workflow will include:

extracting data from the netCDF file where area and passes are variables defined over 3D space (longitude, latitude, and time); and
using functions such as tidync::hyper_tibble() or ncdf4::ncvar_get() to read these values.

Restructure the data into a data.table or data.frame:

the data should have six columns: longitude, latitude, year, quarter, area, and passes;
each row should correspond to a unique pixel in space-time (i.e., one location at one time point); and
note that the time variable in the netCDF file is in numeric format (e.g., days since origin, where origin = "1970-01-01"), so you’ll have to convert it to POSIX timestamps using appropriate tools (e.g., as.POSIXct()).

If you are unable to read the NetCDF file, you may request access to a processed version of this file (in long CSV format) from me, but you’ll be penalised by 10% if you do so.

Task 2: Exploratory Data Analysis

[Task Weight: 10%]
[Tasks 2.1, 2.2, and 2.3, each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 10%]

2.1 Weighted Mean Time Series

For each year and quarter combination:

compute the weighted mean of the kelp canopy area across all locations, using the number of satellite passes as weights;
exclude observations where passes = 0 or area is NA; and
plot the resulting time series of weighted mean kelp area, using i) quarters on the x-axis, and ii) a continuous time index from 1984–2024.

Compute the weighted mean area at each unique (longitude, latitude) pixel across time. Then:

select a random sample of 100 pixels;
for each sampled pixel, extract the full time series of weighted mean area;
plot all 100 time series in a single panel (overlayed), using semi-transparent lines; and
label axes appropriately.

2.2 Summary Statistics

Using the weighted data prepared for each year and quarter combination (prepared in 2.1.1), compute and report summary statistics for the levels of temporal aggregation:
- by year;
- by quarter;
- by year/quarter combination;

include: weighted mean, median, standard deviation, interquartile range, skewness, and kurtosis; and
comment on the appropriateness of each statistic for these data, and justify your choices in light of the data distribution.

Create visualisations (e.g. boxplots, violin plots, histograms) to support your interpretations.

Based on these, discuss any discernible temporal trends (e.g. decadal increases/decreases) and seasonal patterns (quarterly effects).

2.3 Observation Density Map

Create a map plotting each observed pixel location (defined by longitude × latitude):

colour each pixel by the total number of valid observations (i.e., non-NA values of area) across all time points;
overlay the 58 coastal sections as reference points or lines, numbered from west (1) to east (58); and
use an appropriate geographic projection and include a legend.

Task 3: Inferential Statistics (Part 1)

[Task Weight: 20%]
[Components (1), (2), (3), and (4) each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 20%]

You are now asked to formally test whether the weighted mean kelp canopy area has changed over time, and whether it shows evidence of seasonal variation.

You should:

Formulate and clearly state the null and alternative hypotheses for each of the following:
- a temporal effect (i.e., whether kelp canopy area has changed across the study period); and
- a seasonal effect (i.e., whether kelp canopy area differs between quarters).

Choose and implement a statistical model appropriate to this task.

You may also consider:

whether to model individual observations or to aggregate the data across spatial pixels; and
how to treat missing or zero-valued observations.

The model you choose should reflect your understanding of the data structure and the nature of the questions being asked.

Justify your modelling approach, including:
- why you chose that particular method (rather than alternatives);
- the assumptions involved; and
- how those assumptions might be violated in this dataset.

Present and interpret your results as you would in a scientific paper.

Task 4: Assigning Kelp Observations to Coastal Sections

[Task Weight: 20%]
[Tasks 4.1 and 4.2 each marked on a 0–100 scale, then scaled in the proportion 0.7 and 0.3 of the Task Weight of 20%]

Using the data prepared above, your task now is to spatially classify each kelp canopy observation by assigning it to two types of geographic units.

4.1 Assignment to Coastal Sections

You are provided with a table of 58 coastal sections, each defined by a single geographic coordinate (Latitude and Longitude). These points mark successive ~50 km intervals along the South African coastline, numbered from west (1) to east (58).

Assign each kelp canopy observation to the nearest coastal section based on geographic proximity:

use a geodesic (great-circle) distance metric to compute proximity between kelp sampling points and section coordinates (assume all coordinates are in WGS84);
add a new column to your kelp dataset called section_id, indicating the row number (1–58) of the nearest section; and
you may use any R packages or methods you like, but your code should be efficient and well-commented.

4.2 Assignment to Biogeographical Provinces

You are also provided with a table that maps each coastal section (1–58) to a biogeographical province, based on a classification by Professor John Bolton.

Using your previous assignment of each kelp observation to a section_id, add a second column called bioregion_id that indicates which biogeographical province the observation falls within.
Your final kelp dataset should contain the following key columns (alongside the original data):
- longitude, latitude
- year, quarter, area, passes
- section_id (integer 1–58)
- bioregion_id (character or factor)
Include your full, annotated R code that performs both spatial assignments into your resultant .html document. Your method should be reproducible, and your code should be easy to follow. Print the head() and tail() of your final dataset, and include a summary() of the data.

Task 5: Inferential Statistics (Part 2)

[Task Weight: 30%]
[Tasks 5.1, 5.2, 5.3, 5.4, and 5.5 each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 30%]

You are now asked to evaluate a series of research questions concerning the spatial and temporal structure of kelp canopy area. These questions are to be answered using the kelp dataset that has already been processed to include both section_id and bioregion_id. Use the weighted kelp canopy area (area, weighted by passes) as your response variable throughout – you should have already prepared this dataset in Task 2.

You may use ANOVAs and/or linear models. In each case you must clearly state your hypotheses, justify your choice of model, and interpret your findings both statistically and ecologically.

5.1 Spatial Differences Between Coastal Sections

Question: Is there a statistically significant difference in mean kelp canopy area between coastal sections?

5.2 Spatial Differences Between Biogeographical Provinces

Question: Is there a statistically significant difference in mean kelp canopy area between biogeographical provinces?

5.3 Interaction Between Section and Province

Question: Is there an interaction between coastal section and biogeographical province in explaining variation in kelp canopy area?

5.4 Linear Trend Over Time by Province

Question: Is there a linear trend in kelp canopy area over time, and does the direction or strength of this trend differ between biogeographical provinces?

5.5 Seasonal Variation Across Provinces

Question: Does the seasonal pattern in kelp canopy area differ between provinces?

General Instructions for Task 5 (above)

For each sub-question, above, consider:

formally state the null and alternative hypotheses;
justify your choice of model;
justify your choice of predictors;
justify your decision to aggregate or not aggregate the data at various levels;
discuss the assumptions involved and any violations you detect;
present the relevant model outputs and statistical tests;
include visualisations where appropriate (e.g. interaction plots, trend lines, diagnostic plots);
justify your choice of visualisation; and
present the results in a clear and concise manner, including tables and figures where appropriate, in a manner that would be appropriate for a scientific audience (e.g. a journal article).

You are not required to use the same modelling approach for all five sub-questions, though consistency across related questions is encouraged.

Task 6: Write-up

[Task Weight: 10%]

Write a short report (maximum 2 pages of text) that synthesises your findings across Tasks 2 through 5. This report should be written in the style of the Discussion section of a scientific paper, intended for an ecological audience.

Your goal is to interpret the major patterns and relationships you have identified, and to comment meaningfully on their ecological significance. Your write-up should include:

Temporal Trends and Seasonality.
Spatial Structure and Biogeography.
Interaction Effects and Spatial–Temporal Coupling.
Limitations and Assumptions.
Ecological Interpretation.

Format and tone:

Aim for clarity and economy of expression.
Don’t generate any new tables and figures. The tables and figures from Tasks 2 through 5 should be sufficient.
Write in complete paragraphs. Avoid bulleted summaries.
Add references to the tables and figures from Tasks 2 through 5 as needed.
Cite any additional references you use.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit2025,
  author = {Smit, A. J.},
  title = {Biostatistics {R} {Exam} {(Example)}},
  date = {2025-05-30},
  url = {https://tangledbank.netlify.app/BCB744/assessments/BCB744_Biostats_Prac_Exam_2025.html},
  langid = {en-GB}
}

For attribution, please cite this work as:

Smit AJ (2025) Biostatistics R Exam (Example). https://tangledbank.netlify.app/BCB744/assessments/BCB744_Biostats_Prac_Exam_2025.html.