BCB744 Final Exam

Published

26 April 2026

Exam Guidelines

Date and Exam Duration

The Biostatistics Exam will start at 8:00 on 26 April and you have until 14:00 on 27 April, 2026, to complete it. This exam may be conducted anywhere in the world, and it will contribute 40% of the final assessment marks for the BCB744 module.

Student Guidelines

  • Convert Quarto to HTML: Submit your assignment as an HTML file, derived from a Quarto document. Ensure your submission is a thoroughly annotated report, complete with meta-information (name, date, purpose, etc.) at the beginning. Each section/test should be accompanied by detailed explanations of its purpose.

  • Required YAML Structure: Your Quarto document must begin with the following YAML header (fill in your own details where indicated):

    ---
    title: "BCB744 Biostatistics Exam"
    author: "Your Name"
    date: "2026-04-26"
    format:
      html:
        embed-resources: true
        toc: false
        number-sections: false
    ---

    Your document should follow this hierarchical structure:

    # Part A
    
    ## Question 1
    
    ### Preamble
    ### Introduction
    ### Methods
    ### Results
    ### Discussion
    
    ## Question 2
    
    ### Preamble
    ### Introduction
    ### Methods
    ### Results
    ### Discussion
    
    [... continue for remaining questions ...]
    
    # Part B
    
    ## Task 1
    
    ## Task 2
    ### 2.1
    ### 2.2
    ### 2.3
    
    [... continue for remaining tasks ...]
    
    # References
  • Testing Assumptions: For all questions necessitating formal inferential statistics, conduct and document the appropriate preliminary tests to check statistical assumptions. This includes stating the assumptions, detailing the procedures for testing these assumptions, and specifying the null hypotheses (\(H_{0}\)). If assumptions are tested graphically, elucidate the rationale behind the graphical method. Discuss the outcomes of these assumption tests and provide a rationale for the chosen inferential statistical tests (e.g., t-test, ANOVA).

  • State Hypotheses: When inferential statistics are employed, clearly articulate the null (\(H_{0}\)) and alternative (\(H_{A}\)) hypotheses. Later, in the results section, remember to state whether the \(H_{0}\) or \(H_{A}\) is accepted or rejected.

  • Graphical Support: Support all descriptive and inferential statistical analyses with appropriate graphical representations of the data.

  • Presentation Format: Structure each answer as a concise mini-paper, including the sections Introduction, Methods, Results, Discussion, and References. Though each answer is expected to span 2-3 pages, there are no strict page limits. [Does not apply to questions marked with an *]

    • Incorporate a Preamble section before the Introduction to detail preliminary analyses, figures, tables, and other relevant background information that doesn’t fit into the main narrative of your paper. This section provides insight into the preparatory work and will not be considered part of the main evaluation.

    • The Introduction should set the stage by offering background information, establishing the relevance of the study, and clearly stating the research question or hypothesis.

    • The Methods section must specify the statistical methodologies applied, including how assumptions were tested and any additional data analyses performed. Emphasise the inferential statistics without delving into exploratory data analysis (EDA).

    • In the Results section, focus solely on the findings pertinent to the hypotheses introduced in the Introduction. While assumption tests are part of the statistical analysis, they need not be highlighted in this section as that is what the ‘Preamble’ section is for. Ensure that figure and/or table captions are informative and self-explanatory.

    • The Discussion section is for interpreting the results, considering their significance, limitations, and implications, and suggesting avenues for future research. You may reference up to five pertinent studies in the Methods and Discussion sections.

    • End with a consolidated References section, listing all sources cited across the questions.

  • Formatting: Presentation matters. Marks are allocated for the visual quality of the submission. This includes the neatness of the document, proper use of headings, and adherence to coding conventions (e.g., spacing).

Assessor Guidelines (applies to all questions/tasks)

  • Assess good document structure according to logical headings and heading hierarchies.
  • Heavily penalise untidy formatting, excessive output of long, unnecessary data printouts (other than the obvious and required used of head(), tail(), glimpse(), and summary()) that serve no purpose [-15%].
  • Answers where the code gives error messages (it fails to run to provide the required output) get 0 for that question.
  • Where students write long-form text feedback/answers within code blocks, where it should have been more appropriately placed within the markdown text between code blocks and presented in full sentences, should be penalised [-10% for each question where this occurs].
  • Text answers written as bullet points and which lacks detailed explanatory power gets penalised [-10%].
  • Untidy presentation and formatting that fails to resemble my model answers (below) penalised [-15%].

Exam Structure

The exam is comprised of two parts, Part A and Part B. They contribute 0.35 and 0.65 to the exam mark, respectively.

Data Access

Some datasets are included with R packages, and the remaining data files may be found at the Google Drive link.

Part A (0.35)

Question 1: Effects of Mercury-Contaminated Fish Consumption on Chromosomes

Dataset Overview

The dataset mercuryfish, available in the R package coin, comprises measurements of mercury levels in blood, and proportions of cells exhibiting abnormalities and chromosome aberrations. These data are collected from individuals who consume mercury-contaminated fish and a control group with no such exposure. For detailed attributes and dataset structure, refer to the dataset’s documentation within the package.

Objectives

Your analysis should aim to address the following research questions:

  1. Impact of Methyl-Mercury: Is the consumption of fish containing methyl-mercury associated with an increased proportion of cellular abnormalities?

  2. Mercury Concentration and Cellular Abnormalities: How does the concentration of mercury in the blood affect the proportion of cells with abnormalities? Moreover, is there a difference in this relationship between the control group and those exposed to mercury?

  3. Relationship Between Variables: Does a relationship exist between the proportion of abnormal cells (abnormal) and the proportion of cells with chromosome aberrations (ccells)? This analysis should be conducted separately for the control and exposed groups to identify any disparities.

Question 2: Malignant Glioma Pilot Study

Dataset Introduction

The glioma dataset, found within the coin R package, originates from a pilot study focusing on patients with malignant glioma who underwent pretargeted adjuvant radioimmunotherapy using yttrium-90-biotin. This dataset includes variables such as patient sex, treatment group, age, histology (tissue study), and survival time.

Objectives

This analysis aims to investigate the following aspects:

  1. Sex and Group Interaction on Survival Time: Determine whether there is an interaction between patient sex and treatment group that significantly impacts the survival time (time).

  2. Age and Histology Interaction on Survival Time: Assess if age and histology interact in a way that influences the survival time of patients.

  3. Comprehensive Data Exploration: Conduct an exhaustive graphical examination of the dataset to uncover any additional patterns or relationships that merit statistical investigation. Identify the most compelling and insightful observation, formulate a relevant hypothesis, and perform the appropriate statistical analysis.

Question 3: Risk Factors Associated with Low Infant Birth Mass

Dataset Introduction

Package MASS, dataset birthwt: This dataframe has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass. during 1986.

Objectives

State three hypotheses and test them. Make sure one of the tests makes use of the 95% confidence interval approach rather than a formal inferential methodology.

Question 4: The Lung Capacity Data*

Objectives

  1. Using the Lung Capacity data provided, please calculate the 95% CIs for the LungCap variable as a function of:

    1. Gender

    2. Smoke

    3. Caesarean

  2. Create a graph of the mean ± 95% CIs and determine if there are statistical differences in LungCap between the levels of Gender, Smoke, and Caesarean. Do the same using a t-test. Are your findings the same using these two approaches?

  3. Produce all the associated tests for assumptions – i.e. the assumptions to be met when deciding whether to use a t-test or its non-parametric counterpart.

  4. Create a combined tidy dataframe (observe tidy principles) with the estimates for the 95% CI for the LungCap data (LungCap as a function of Gender), estimated using both the traditional and bootstrapping approaches. Create a plot comprising two panels (one for the traditional estimates, one for the bootstrapped estimates) of the mean, median, scatter of raw data points, and the upper and lower 95% CI.

  5. Undertake a statistical analysis that factors in the effect of Age together with one of the categorical variables on LungCap. What new insight does this provide?

Question 5: Piglet Data

Objectives

Here are some fictitious data for pigs raised on different diets (make up an equally fictitious justification for the data and develop hypotheses around that):

feed_1 <- c(60.8, 57.0, 65.0, 58.6, 61.7)
feed_2 <- c(68.7, 67.7, 74.0, 66.3, 69.8)
feed_3 <- c(102.6, 102.1, 100.2, 96.5, 110.3)
feed_4 <- c(87.9, 84.2, 83.1, 85.7, 90.3)

bacon <- data.frame(cbind(feed_1, feed_2, feed_3, feed_4))

Question 6: Investigating the Impact of Biochar on Crop Growth and Nutritional Value

Dataset Introduction

In this analysis, we will explore the effects of biochar application on the growth and elemental composition of four key crops: carrot, lettuce, soybean, and sweetcorn. The dataset for this study is sourced from the US Environmental Protection Agency (EPA) and is available at EPA’s Biochar Dataset. To gain a comprehensive understanding of the dataset and its implications, it is highly recommended to review two pertinent research papers linked on the dataset page. These papers not only provide valuable background information on the studies conducted but also offer critical insights and methodologies for data analysis that may be beneficial for this project.

Research Goals

The primary aim of this project is to analyse the impact of biochar on plant yield and identify the three most significant nutrients that influence human health. Your task is to:

  1. Determine whether biochar treatments vary in effectiveness across the different crops.
  2. Provide evidence-based recommendations on how to tailor biochar application for each specific crop to optimise the production of nutrients beneficial to human health and achieve the best possible yield.

In the Introduction section, it is important to justify the selection of the three nutrients you will focus on, explaining their importance to human nutrition. Through detailed data analysis, this project seeks to offer actionable insights on biochar application strategies that enhance both the nutritional value and the biomass of the crops by the end of their growth period.

Question 7: Miscellaneous*

Objectives

  1. For each line of the script, below, write an English explanation for what the code does.
ggplot(points, aes(x = group, y = count)) +
  geom_boxplot(aes(colour = group), size = 1, outlier.colour = NA) +
  geom_point(position = position_jitter(width = 0.2), alpha = 0.3) +
  facet_grid(group ~ ., scales = "free") +
    labs(x = "", y = "Number of data points") +
  theme(legend.position = "none",
    strip.background = element_blank(),
    strip.text = element_blank())
  1. Using the rnorm() function, generate some fictitious data that can be plotted using the code, above. Make sure to assemble these data into a dataframe suitable for plotting, complete with correct column titles.

  2. Apply the code exactly as stated to the data to demonstrate your understanding of the code and convince the examiner of your understanding of the correct data structure.


Part B (0.65)

Assessment Criteria

Your responses to Part B will be evaluated based on the following criteria:

  1. Technical Accuracy (50%)
    • Correct application of data analyses and statistical methods. Statistical tests should address the hypotheses within what was taught in this module. The assessor recognises that students will not have access to the level of statistical knowledge and experience a research statistician might have. For example, in places, linear mixed models might be more suited to the questions, but students only were taught the basics about ANOVA and simple linear model (relatively simple designs). When non-parametric alternatives are required, I’ll assign marks to the any statement that suggests the correct test to use, but not actually marks the execution of those tests.
    • Use of appropriate R packages, functions, and syntax (code style and liberal commenting)
    • Appropriate choice and justification of techniques, including due consideration for the assumptions of the methods used
    • Accurate calculations and results interpretation, down to small details such as how many decimal places to use
  2. Depth of Analysis (20%)
    • Comprehensive exploration of the problem
    • Insightful interpretation of results
    • Consideration of shortfalls in the analysis (due to data limitations, assumptions, etc.), and suggestions for improvement
    • Application of out-of-the-box thinking to the problem
  3. Clarity and Communication (20%)
    • Logical organisation of ideas, including clear section headings and subheadings
    • Clear and concise explanations at each stage of the analysis
    • Effective use of publication quality visualisations where appropriate, including all necessary annotations
    • Communication of results in a way that is appropriate for a scientific audience (e.g. a journal article)
  4. Critical Thinking Shown in Final Conclusion/Synopsis (10%)
    • Discussion of the findings in the context of the problem (add ecological context, etc., as you deem necessary)
    • Identification of limitations
    • Discussion of assumptions
    • Consideration of broader implications

The marks indicated for each task reflect the relative weight and expected depth of your response. Focus on demonstrating both technical proficiency and conceptual understanding in your answers.

Background

These data represent the aerial cover of kelp canopy in South Africa, as measured by Landsat satellites, for the period 1984 to 2024 at a quarterly interval. The intention is to understand the spatio-temporal patterns in kelp canopy cover and to explore how these patterns may be related to coastal sections and biogeographical provinces.

You are provided with the following files at the Google Drive link emailed to you:

  1. A table of 58 coastal sections (58_sections.csv) that partitions the South African coastline into approximately 50 km intervals. Each section is defined by a single coordinate point (latitude, longitude) representing the boundary of the section.
  2. A table of the biogeographical provinces (bioregions.csv) that the 58 coastal sections fall within. There is one row for each of the 58 sections. For this exercise, the biogeographical classification by Professor John Bolton is of interest.
  3. A netCDF file (kelpCanopyFromLandsat_SouthAfrica_v04.nc). The netCDF file contains satellite-derived measurements of kelp canopy area across the South African coastline from 1984 to 2024, sampled quarterly. Each observation corresponds to a grid cell at a specific time point.

Task 1: Initial Processing

  • [Task Weight: 10%]
  • [Components (1) and (2) marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 10%]
  1. Read the kelp canopy area, time, location (latitude/longitude), and satellite pass data from the NetCDF file. Once unpacked, it contains over 5 million rows. Your processing workflow will include:
  • extracting data from the netCDF file where area and passes are variables defined over 3D space (longitude, latitude, and time); and
  • using functions such as tidync::hyper_tibble() or ncdf4::ncvar_get() to read these values.
  1. Restructure the data into a data.table or data.frame:
  • the data should have six columns: longitude, latitude, year, quarter, area, and passes;
  • each row should correspond to a unique pixel in space-time (i.e., one location at one time point); and
  • note that the time variable in the netCDF file is in numeric format (e.g., days since origin, where origin = "1970-01-01"), so you’ll have to convert it to POSIX timestamps using appropriate tools (e.g., as.POSIXct()).

If you are unable to read the NetCDF file, you may request access to a processed version of this file (in long CSV format) from me, but you’ll be penalised by 10% if you do so.

Task 2: Exploratory Data Analysis

  • [Task Weight: 10%]
  • [Tasks 2.1, 2.2, and 2.3, each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 10%]

2.1 Weighted Mean Time Series

  1. For each year and quarter combination:
  • compute the weighted mean of the kelp canopy area across all locations, using the number of satellite passes as weights;
  • exclude observations where passes = 0 or area is NA; and
  • plot the resulting time series of weighted mean kelp area, using i) quarters on the x-axis, and ii) a continuous time index from 1984–2024.
  1. Compute the weighted mean area at each unique (longitude, latitude) pixel across time. Then:
  • select a random sample of 100 pixels;
  • for each sampled pixel, extract the full time series of weighted mean area;
  • plot all 100 time series in a single panel (overlayed), using semi-transparent lines; and
  • label axes appropriately.

2.2 Summary Statistics

  1. Using the weighted data prepared for each year and quarter combination (prepared in 2.1.1), compute and report summary statistics for the levels of temporal aggregation:
    • by year;
    • by quarter;
    • by year/quarter combination;
  • include: weighted mean, median, standard deviation, interquartile range, skewness, and kurtosis; and
  • comment on the appropriateness of each statistic for these data, and justify your choices in light of the data distribution.
  1. Create visualisations (e.g. boxplots, violin plots, histograms) to support your interpretations.
  1. Based on these, discuss any discernible temporal trends (e.g. decadal increases/decreases) and seasonal patterns (quarterly effects).

2.3 Observation Density Map

Create a map plotting each observed pixel location (defined by longitude × latitude):

  • colour each pixel by the total number of valid observations (i.e., non-NA values of area) across all time points;
  • overlay the 58 coastal sections as reference points or lines, numbered from west (1) to east (58); and
  • use an appropriate geographic projection and include a legend.

Task 3: Assigning Kelp Observations to Coastal Sections

  • [Task Weight: 20%]
  • [Tasks 4.1 and 4.2 each marked on a 0–100 scale, then scaled in the proportion 0.7 and 0.3 of the Task Weight of 20%]

Using the data prepared above, your task now is to spatially classify each kelp canopy observation by assigning it to two types of geographic units.

3.1 Assignment to Coastal Sections

You are provided with a table of 58 coastal sections, each defined by a single geographic coordinate (Latitude and Longitude). These points mark successive ~50 km intervals along the South African coastline, numbered from west (1) to east (58).

Assign each kelp canopy observation to the nearest coastal section based on geographic proximity:

  • use a geodesic (great-circle) distance metric to compute proximity between kelp sampling points and section coordinates (assume all coordinates are in WGS84);
  • add a new column to your kelp dataset called section_id, indicating the row number (1–58) of the nearest section; and
  • you may use any R packages or methods you like, but your code should be efficient and well-commented.

3.2 Assignment to Biogeographical Provinces

You are also provided with a table that maps each coastal section (1–58) to a biogeographical province, based on a classification by Professor John Bolton.

  • Using your previous assignment of each kelp observation to a section_id, add a second column called bioregion_id that indicates which biogeographical province the observation falls within.
  • Your final kelp dataset should contain the following key columns (alongside the original data):
    • longitude, latitude
    • year, quarter, area, passes
    • section_id (integer 1–58)
    • bioregion_id (character or factor)
  • Include your full, annotated R code that performs both spatial assignments into your resultant .html document. Your method should be reproducible, and your code should be easy to follow. Print the head() and tail() of your final dataset, and include a summary() of the data.

Task 4: Inferential Statistics (Part 2)

  • [Task Weight: 30%]
  • [Tasks 5.1, 5.2, 5.3, 5.4, and 5.5 each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 30%]

You are now asked to evaluate a series of research questions concerning the spatial and temporal structure of kelp canopy area. These questions are to be answered using the kelp dataset that has already been processed to include both section_id and bioregion_id. Use the weighted kelp canopy area (area, weighted by passes) as your response variable throughout – you should have already prepared this dataset in Task 2.

You may use ANOVAs and/or linear models. In each case you must clearly state your hypotheses, justify your choice of model, and interpret your findings both statistically and ecologically.

4.1 Spatial Differences Between Coastal Sections

Question: Is there a statistically significant difference in mean kelp canopy area between coastal sections?

4.2 Spatial Differences Between Biogeographical Provinces

Question: Is there a statistically significant difference in mean kelp canopy area between biogeographical provinces?

4.3 Interaction Between Section and Province

Question: Is there an interaction between coastal section and biogeographical province in explaining variation in kelp canopy area?

4.4 Linear Trend Over Time by Province

Question: Is there a linear trend in kelp canopy area over time, and does the direction or strength of this trend differ between biogeographical provinces?

4.5 Seasonal Variation Across Provinces

Question: Does the seasonal pattern in kelp canopy area differ between provinces?

General Guidance for Task 4 (above)

For each sub-question, above, consider:

  • formally state the null and alternative hypotheses;
  • justify your choice of model;
  • justify your choice of predictors;
  • justify your decision to aggregate or not aggregate the data at various levels;
  • discuss the assumptions involved and any violations you detect;
  • present the relevant model outputs and statistical tests;
  • include visualisations where appropriate (e.g. interaction plots, trend lines, diagnostic plots);
  • justify your choice of visualisation; and
  • present the results in a clear and concise manner, including tables and figures where appropriate, in a manner that would be appropriate for a scientific audience (e.g. a journal article).

You are not required to use the same modelling approach for all five sub-questions, though consistency across related questions is encouraged.

Task 5: Write-up

  • [Task Weight: 10%]

Write a short report (maximum 2 pages of text) that synthesises your findings across Tasks 2 through 5. This report should be written in the style of the Discussion section of a scientific paper, intended for an ecological audience.

Your goal is to interpret the major patterns and relationships you have identified, and to comment meaningfully on their ecological significance. Your write-up should include:

  • Temporal Trends and Seasonality.
  • Spatial Structure and Biogeography.
  • Interaction Effects and Spatial–Temporal Coupling.
  • Limitations and Assumptions.
  • Ecological Interpretation.

Format and tone:

  • Aim for clarity and economy of expression.
  • Don’t generate any new tables and figures. The tables and figures from Tasks 2 through 5 should be sufficient.
  • Write in complete paragraphs. Avoid bulleted summaries.
  • Add references to the tables and figures from Tasks 2 through 5 as needed.
  • Cite any additional references you use.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J.},
  title = {BCB744 {Final} {Exam}},
  date = {2026-04-26},
  url = {https://tangledbank.netlify.app/BCB744/assessments/BCB744_Final_Prac_Exam_2026.html},
  langid = {en-GB}
}
For attribution, please cite this work as:
Smit AJ (2026) BCB744 Final Exam. https://tangledbank.netlify.app/BCB744/assessments/BCB744_Final_Prac_Exam_2026.html.