BCB744 Final Exam

Author

Affiliation

Published

26 April 2026

Exam Guidelines

Date and Exam Duration

The Biostatistics Exam will start at 8:00 on 26 April and you have until 14:00 on 27 April, 2026, to complete it. This exam may be conducted anywhere in the world, and it will contribute 40% of the final assessment marks for the BCB744 module.

Student Guidelines

Convert Quarto to HTML: Submit your assignment as an HTML file, derived from a Quarto document. Ensure your submission is a thoroughly annotated report, complete with meta-information (name, date, purpose, etc.) at the beginning. Each section/test should be accompanied by detailed explanations of its purpose.

Required YAML Structure: Your Quarto document must begin with the following YAML header (fill in your own details where indicated):

---
title: "BCB744 Biostatistics Exam"
author: "Your Name"
date: "2026-04-26"
format:
  html:
    embed-resources: true
    toc: false
    number-sections: false
---

Your document should follow this hierarchical structure:

# Part A

## Question 1

### Preamble
### Introduction
### Methods
### Results
### Discussion

## Question 2

### Preamble
### Introduction
### Methods
### Results
### Discussion

[... continue for remaining questions ...]

# Part B

## Task 1

## Task 2
### 2.1
### 2.2
### 2.3

[... continue for remaining tasks ...]

# References

Testing Assumptions: For all questions necessitating formal inferential statistics, conduct and document the appropriate preliminary tests to check statistical assumptions. This includes stating the assumptions, detailing the procedures for testing these assumptions, and specifying the null hypotheses (\(H_{0}\)). If assumptions are tested graphically, elucidate the rationale behind the graphical method. Discuss the outcomes of these assumption tests and provide a rationale for the chosen inferential statistical tests (e.g., t-test, ANOVA).
State Hypotheses: When inferential statistics are employed, clearly articulate the null (\(H_{0}\)) and alternative (\(H_{A}\)) hypotheses. Later, in the results section, remember to state whether the \(H_{0}\) or \(H_{A}\) is accepted or rejected.
Graphical Support: Support all descriptive and inferential statistical analyses with appropriate graphical representations of the data.
Presentation Format: Structure each answer as a concise mini-paper, including the sections Introduction, Methods, Results, Discussion, and References. Though each answer is expected to span 2-3 pages, there are no strict page limits. [Does not apply to questions marked with an *]
- Incorporate a Preamble section before the Introduction to detail preliminary analyses, figures, tables, and other relevant background information that doesn’t fit into the main narrative of your paper. This section provides insight into the preparatory work and will not be considered part of the main evaluation.
- The Introduction should set the stage by offering background information, establishing the relevance of the study, and clearly stating the research question or hypothesis.
- The Methods section must specify the statistical methodologies applied, including how assumptions were tested and any additional data analyses performed. Emphasise the inferential statistics without delving into exploratory data analysis (EDA).
- In the Results section, focus solely on the findings pertinent to the hypotheses introduced in the Introduction. While assumption tests are part of the statistical analysis, they need not be highlighted in this section as that is what the ‘Preamble’ section is for. Ensure that figure and/or table captions are informative and self-explanatory.
- The Discussion section is for interpreting the results, considering their significance, limitations, and implications, and suggesting avenues for future research. You may reference up to five pertinent studies in the Methods and Discussion sections.
- End with a consolidated References section, listing all sources cited across the questions.
Formatting: Presentation matters. Marks are allocated for the visual quality of the submission. This includes the neatness of the document, proper use of headings, and adherence to coding conventions (e.g., spacing).

Assessor Guidelines (applies to all questions/tasks)

Assess good document structure according to logical headings and heading hierarchies.
Heavily penalise untidy formatting, excessive output of long, unnecessary data printouts (other than the obvious and required used of head(), tail(), glimpse(), and summary()) that serve no purpose [-15%].
Answers where the code gives error messages (it fails to run to provide the required output) get 0 for that question.
Where students write long-form text feedback/answers within code blocks, where it should have been more appropriately placed within the markdown text between code blocks and presented in full sentences, should be penalised [-10% for each question where this occurs].
Text answers written as bullet points and which lacks detailed explanatory power gets penalised [-10%].
Untidy presentation and formatting that fails to resemble my model answers (below) penalised [-15%].

Exam Structure

The exam is comprised of two parts, Part A and Part B. They contribute 0.35 and 0.65 to the exam mark, respectively.

Data Access

Some datasets are included with R packages, and the remaining data files may be found at the Google Drive link.

Part A (0.35)

Question 1: Effects of Mercury-Contaminated Fish Consumption on Chromosomes

Dataset Overview

The dataset mercuryfish, available in the R package coin, comprises measurements of mercury levels in blood, and proportions of cells exhibiting abnormalities and chromosome aberrations. These data are collected from individuals who consume mercury-contaminated fish and a control group with no such exposure. For detailed attributes and dataset structure, refer to the dataset’s documentation within the package.

Objectives

Your analysis should aim to address the following research questions:

Impact of Methyl-Mercury: Is the consumption of fish containing methyl-mercury associated with an increased proportion of cellular abnormalities?
Mercury Concentration and Cellular Abnormalities: How does the concentration of mercury in the blood affect the proportion of cells with abnormalities? Moreover, is there a difference in this relationship between the control group and those exposed to mercury?
Relationship Between Variables: Does a relationship exist between the proportion of abnormal cells (abnormal) and the proportion of cells with chromosome aberrations (ccells)? This analysis should be conducted separately for the control and exposed groups to identify any disparities.

Answer

Impact of Methyl-Mercury

Load the mercuryfish dataset

library(tidyverse)
library(coin)
data(mercuryfish)
mercuryfish

     group mercury abnormal ccells
1  control     5.3      8.6    2.7
2  control    15.0      5.0    0.5
3  control    11.0      8.4    0.0
4  control     5.8      1.0    0.0
5  control    17.0     13.0    5.0
6  control     7.0      5.0    0.0
7  control     8.5      1.0    0.0
8  control     9.4      2.3    1.3
9  control     7.8      2.0    0.0
10 control    12.0      6.4    1.8
11 control     8.7      7.0    0.0
12 control     4.0      1.7    0.0
13 control     3.0      4.0    1.0
14 control    12.2      1.8    1.8
15 control     6.1      2.8    0.0
16 control    10.2      4.7    3.1
17 exposed   100.0      6.4    0.7
18 exposed    70.0      9.2    4.6
19 exposed   196.0      3.6    0.0
20 exposed    69.0      3.7    1.7
21 exposed   370.0     14.2    5.2
22 exposed   270.0      7.0    0.0
23 exposed   150.0     13.5    5.0
24 exposed    60.0     21.5    9.5
25 exposed   330.0      9.0    2.0
26 exposed  1100.0     11.0    3.0
27 exposed    40.0      8.0    1.0
28 exposed   100.0      9.2    3.5
29 exposed    70.0      8.0    2.0
30 exposed   150.0     14.0    5.0
31 exposed   200.0     11.9    5.5
32 exposed   304.0     10.0    2.0
33 exposed   236.0      6.6    3.0
34 exposed   178.0     13.0    4.0
35 exposed    41.0      0.0    0.0
36 exposed   120.0      6.0    2.0
37 exposed   330.0     13.1    2.2
38 exposed    62.0      0.0    0.0
39 exposed    12.8      5.3    2.0

Check the structure of the dataset

head(mercuryfish)

    group mercury abnormal ccells
1 control     5.3      8.6    2.7
2 control    15.0      5.0    0.5
3 control    11.0      8.4    0.0
4 control     5.8      1.0    0.0
5 control    17.0     13.0    5.0
6 control     7.0      5.0    0.0

EDA: Boxplot of the proportion of abnormal cells by group

ggplot(mercuryfish, aes(x = group, y = abnormal)) +
  geom_boxplot(notch = TRUE) +
  labs(
    title = "Proportion of Abnormal Cells by Group",
    x = "Group",
    y = "Proportion of Abnormal Cells"
  )

Above se see that the exposed group has a higher proportion of abnormal cells (the notches do not overlap between the two groups) Let’s test this formally…

Test normality of the data within the groups Shapiro-Wilk test H0: The distribution of my data does not differ from a normal distribution Ha: The distribution of my data differs from a normal distribution

shapiro.test(mercuryfish$abnormal[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$abnormal[mercuryfish$group == "control"]
W = 0.90267, p-value = 0.0887

shapiro.test(mercuryfish$abnormal[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$abnormal[mercuryfish$group == "exposed"]
W = 0.96841, p-value = 0.6509

Above we see that we can accept the assumption of normality for both groups

What about homogeneity of variances?

mercuryfish |>
  group_by(group) |>
  summarise(sample_var = var(abnormal))

# A tibble: 2 × 2
  group   sample_var
  <fct>        <dbl>
1 control       11.2
2 exposed       24.3

We see that the variances are more-or-less the same for the two groups

We could do a Levene’s test to confirm this Levene’s test H0: The variances of the groups are equal Ha: The variances of the groups are not equal

car::leveneTest(abnormal ~ group, data = mercuryfish)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  1.6607 0.2055
      37

Above we see that we accept the H0 that states that the variances are equal The data are normally distributed and variances are equal and we can proceed with a Student’s t-test

Conduct a Student’s t-test to compare the proportion of abnormal cells between the two groups

t.test(abnormal ~ group, var.equal = TRUE, data = mercuryfish)


    Two Sample t-test

data:  abnormal by group
t = -2.9664, df = 37, p-value = 0.005253
alternative hypothesis: true difference in means between group control and group exposed is not equal to 0
95 percent confidence interval:
 -7.084765 -1.334257
sample estimates:
mean in group control mean in group exposed 
             4.668750              8.878261

The p-value is less than 0.05, indicating a significant difference in the proportion of abnormal cells between the control and exposed groups

Mercury Concentration and Cellular Abnormalities

EDA: Scatterplot of mercury concentration vs. proportion of abnormal cells

ggplot(mercuryfish, aes(x = mercury, y = abnormal, color = group)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Mercury Concentration vs. Proportion of Abnormal Cells",
    x = "Mercury Concentration",
    y = "Proportion of Abnormal Cells"
  )

Test normality of the data

shapiro.test(mercuryfish$mercury[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$mercury[mercuryfish$group == "control"]
W = 0.97435, p-value = 0.9032

shapiro.test(mercuryfish$mercury[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$mercury[mercuryfish$group == "exposed"]
W = 0.64984, p-value = 3.341e-06

We see that the mercury concentrations are normally distributed for the control group (we do not reject H0) but not for the exposed group (we reject H0); earlier we have seen that the response variable (abnormalities) is normal for both the control and the exposed groups

But since we want to model a linear relationship, now is not quite the right time to do the tests for normality – we want to do this for the residuals of the model (that is, we fit the model first, and then test the residuals for normality)

We also see from the scatterplot that the data might be approximately linear for the exposed group, but not for the control group where the data are more scattered around very low mercury concentrations near zero

We also see from the very wide confidence intervals that the model is not very good at predicting the proportion of abnormal cells from mercury concentration in the blood in the exposed group; my guess is that there will not be a linear relationship between mercury concentration and the proportion of abnormal cells in the control or exposed groups

We can proceed with a linear regression model to assess the relationship

Fit a linear regression model to assess the relationship between mercury concentration and the proportion of abnormal cells H0(1): There is no relationship between mercury concentration and the proportion of abnormal cells Ha(1): There is a relationship between mercury concentration and the proportion of abnormal cells H0(2): The relationship between mercury concentration and the proportion of abnormal cells does not differ between the control and exposed groups Ha(2): The relationship between mercury concentration and the proportion of abnormal cells differs between the control and exposed groups

model.lm <- lm(abnormal ~ mercury + group, data = mercuryfish)
summary(model.lm)


Call:
lm(formula = abnormal ~ mercury + group, data = mercuryfish)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1709 -2.5884 -0.2124  2.6725 13.3395 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.622336   1.081910   4.272 0.000135 ***
mercury      0.005193   0.004129   1.258 0.216575    
groupexposed 3.226585   1.610348   2.004 0.052677 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.325 on 36 degrees of freedom
Multiple R-squared:  0.2261,    Adjusted R-squared:  0.1832 
F-statistic:  5.26 on 2 and 36 DF,  p-value: 0.009906

We can now check the residuals for normality in the two groups which will confirm that the model is appropriate (or not)

mercuryfish$residuals <- residuals(model.lm)
shapiro.test(mercuryfish$residuals[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals[mercuryfish$group == "control"]
W = 0.90326, p-value = 0.09066

shapiro.test(mercuryfish$residuals[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals[mercuryfish$group == "exposed"]
W = 0.94985, p-value = 0.2905

We see that the residuals are normally distributed for both groups and hence using a linear model was appropriate

The p-value for the interaction term not less than 0.05, indicating that the relationship between mercury concentration and the proportion of abnormal cells does not differ between the control and exposed groups – we can reject Ha(1) and Ha(2) If we wanted to (recommended), we could refit the model without the interaction term

We could also do an ANCOVA instead of the linear model to test the interaction term

model.aov <- aov(abnormal ~ group * mercury, data = mercuryfish)
summary(model.aov)

              Df Sum Sq Mean Sq F value  Pr(>F)   
group          1  167.2  167.20   9.246 0.00445 **
mercury        1   29.6   29.59   1.636 0.20924   
group:mercury  1   40.5   40.47   2.238 0.14361   
Residuals     35  633.0   18.08                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What do we conclude? The proportion of abnormal cells differs significantly between the control and exposed groups, with the exposed group exhibiting a higher proportion of abnormal cells. However, the relationship between mercury concentration and the proportion of abnormal cells does not differ between the two groups. There is a good amount of scatter in the amount of cell abnormalities even in just the control group, which suggests that mercury concentration alone may not be a strong predictor of cellular abnormalities. Increasing the amount of mercury in the blood does not necessarily lead to a linear increase but it certainly does account for a few of the highest values seen in the exposed group.

Relationship Between Variables

EDA: Scatterplot of proportion of abnormal cells vs. chromosome cell proportion

ggplot(mercuryfish, aes(x = abnormal, y = ccells, color = group)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Abnormal Cell Proportion vs. Chromosome Cell Proportion by Group",
    x = "Proportion of Abnormal Cells",
    y = "Proportion of Cells with Chromosome Aberrations"
  )

We see that there is a clear linear relationship between abnormal cell proportion and Cu cell proportion in both groups, and the confidence intervals are narrow(-ish), indicating that the model could be reasonably good at predicting Cu cell proportion from the proportion of abnormal cells

We know the relationship between continuous covariates is linear and may therefore proceed with a linear regression model; the remaining assumptions will be tested afterwards

Fit a linear regression model to assess the relationship between the proportion of Cu cells and the proportion of abnormal cells H0(1): There is no relationship between the proportion of Cu cells and the proportion of abnormal cells Ha(1): There is a relationship between the proportion of Cu cells and the proportion of abnormal cells H0(2): The relationship between the proportion of Cu cells and the proportion of abnormal cells does not differ between the control and exposed groups Ha(2): The relationship between the proportion of Cu cells and the proportion of abnormal cells differs between the control and exposed groups

model.lm2 <- lm(ccells ~ abnormal + group, data = mercuryfish)
summary(model.lm2)


Call:
lm(formula = ccells ~ abnormal + group, data = mercuryfish)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4760 -0.7479  0.1761  0.5831  2.0133 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.67797    0.36166  -1.875    0.069 .  
abnormal      0.37547    0.04461   8.417 5.01e-10 ***
groupexposed  0.12272    0.42837   0.286    0.776    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.183 on 36 degrees of freedom
Multiple R-squared:  0.7152,    Adjusted R-squared:  0.6994 
F-statistic: 45.21 on 2 and 36 DF,  p-value: 1.516e-10

We can now check the residuals for normality in the two groups which will confirm that the model is appropriate (or not)

mercuryfish$residuals2 <- residuals(model.lm2)
shapiro.test(mercuryfish$residuals2[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals2[mercuryfish$group == "control"]
W = 0.95476, p-value = 0.5686

shapiro.test(mercuryfish$residuals2[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals2[mercuryfish$group == "exposed"]
W = 0.96346, p-value = 0.5366

We see that the residuals are normally distributed for both groups and hence using a linear model was appropriate

The p-value ‘abnormal’ term is less than 0.05, indicating that the relationship between the proportion of abnormal cells and the proportion of Cu cells is significant – we accept Ha(1) The p-value for the interaction term is not less than 0.05, indicating that the relationship between the proportion of abnormal cells and the proportion of Cu cells does not differ between the control and exposed groups – we do not reject H0(2)

What do we conclude? The proportion of Cu cells is significantly related to the proportion of abnormal cells, with a higher proportion of abnormal cells corresponding to a higher proportion of Cu cells. This relationship does not differ between the control and exposed groups. The model is appropriate for predicting the proportion of Cu cells from the proportion of abnormal cells, as the residuals are normally distributed for both groups.

Alternative approaches for assigning marks: Instead of doing a linear regression with interaction term, which I did not formally teach, equally justified are individual linear regressions for each group and using the confidence intervals to make inferences. This would involve fitting two linear regression models, one for each group, and comparing the confidence intervals of the coefficients to determine if the relationship between the proportion of Cu cells and the proportion of abnormal cells differs between the two groups. This would apply to all the other questions as well.

Or, in part (c), we could have done correlations for each group and compared the correlation coefficients to determine if the relationship between the proportion of abnormal cells and the proportion of Cu cells differs between the two groups.

Question 2: Malignant Glioma Pilot Study

Dataset Introduction

The glioma dataset, found within the coin R package, originates from a pilot study focusing on patients with malignant glioma who underwent pretargeted adjuvant radioimmunotherapy using yttrium-90-biotin. This dataset includes variables such as patient sex, treatment group, age, histology (tissue study), and survival time.

Objectives

This analysis aims to investigate the following aspects:

Sex and Group Interaction on Survival Time: Determine whether there is an interaction between patient sex and treatment group that significantly impacts the survival time (time).
Age and Histology Interaction on Survival Time: Assess if age and histology interact in a way that influences the survival time of patients.
Comprehensive Data Exploration: Conduct an exhaustive graphical examination of the dataset to uncover any additional patterns or relationships that merit statistical investigation. Identify the most compelling and insightful observation, formulate a relevant hypothesis, and perform the appropriate statistical analysis.

Answer

Sex and Group Interaction on Survival Time

Load the glioma dataset

data(glioma, package = "coin")

Check the structure of the dataset

str(glioma)

'data.frame':   37 obs. of  7 variables:
 $ no.      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age      : int  41 45 48 54 40 31 53 49 36 52 ...
 $ sex      : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 2 2 2 2 2 ...
 $ histology: Factor w/ 2 levels "GBM","Grade3": 2 2 2 2 2 2 2 2 2 2 ...
 $ group    : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 2 2 ...
 $ event    : logi  TRUE FALSE FALSE FALSE FALSE TRUE ...
 $ time     : int  53 28 69 58 54 25 51 61 57 57 ...

Summary statistics

summary(glioma)

      no.              age            sex      histology      group   
 Min.   : 1.000   Min.   :19.00   Female:16   GBM   :20   Control:18  
 1st Qu.: 5.000   1st Qu.:40.00   Male  :21   Grade3:17   RIT    :19  
 Median :10.000   Median :47.00                                       
 Mean   : 9.757   Mean   :48.49                                       
 3rd Qu.:14.000   3rd Qu.:57.00                                       
 Max.   :19.000   Max.   :83.00                                       
   event              time      
 Mode :logical   Min.   : 5.00  
 FALSE:14        1st Qu.:13.00  
 TRUE :23        Median :28.00  
                 Mean   :30.84  
                 3rd Qu.:50.00  
                 Max.   :69.00

EDA: Boxplot of survival

ggplot(glioma, aes(x = group, y = time)) +
  geom_boxplot(notch = FALSE, aes(colour = sex)) +
  labs(title = "Survival Time by Group", x = "Group", y = "Months")

There does seem to be an effect of group on survival time, with the radioimmunotherapy group having a higher median survival time than the control group A sex-related effect does not seem present

Looking at the box and whisker plot, there seems to be an issue with the assumption of normality. Let’s test the normality of the data using the Shapiro-Wilk test:

Shapiro-Wilk test for normality H0: The data are normally distributed Ha: The data are not normally distributed (separate hypotheses for sex and group)

shapiro.test(glioma$time[glioma$group == "Control"])


    Shapiro-Wilk normality test

data:  glioma$time[glioma$group == "Control"]
W = 0.80087, p-value = 0.001563

shapiro.test(glioma$time[glioma$group == "RIT"])


    Shapiro-Wilk normality test

data:  glioma$time[glioma$group == "RIT"]
W = 0.92571, p-value = 0.1443

shapiro.test(glioma$time[glioma$sex == "Female"])


    Shapiro-Wilk normality test

data:  glioma$time[glioma$sex == "Female"]
W = 0.87292, p-value = 0.03016

shapiro.test(glioma$time[glioma$sex == "Male"])


    Shapiro-Wilk normality test

data:  glioma$time[glioma$sex == "Male"]
W = 0.90727, p-value = 0.04852

The p-values are less than 0.05 for the control group and for both sexes, indicating that the data are not normally distributed. We will use non-parametric tests to compare the survival times between the two groups within the factors.

We must also check the homogeneity of variances using Levene’s test: H0: The variances are equal Ha: The variances are not equal

car::leveneTest(time ~ group, data = glioma)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  1.5274 0.2247
      35

car::leveneTest(time ~ sex, data = glioma)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  0.0069 0.9344
      35

Variances are the same everywhere; nevertheless, due to the non-normal data, we will have to use a non-parametric test.

We will check for differences between groups formally using the Kruskal-Wallis test:

Kruskal-Wallis test H0: The medians of the groups are equal Ha: At least one median is different (to be stated separately for group and sex)

kruskal.test(time ~ group, data = glioma)


    Kruskal-Wallis rank sum test

data:  time by group
Kruskal-Wallis chi-squared = 16.123, df = 1, p-value = 5.936e-05

kruskal.test(time ~ sex, data = glioma)


    Kruskal-Wallis rank sum test

data:  time by sex
Kruskal-Wallis chi-squared = 0.54251, df = 1, p-value = 0.4614

The p-value is less than 0.05, indicating that there is a significant difference in survival time between the two groups. We reject the null hypothesis and conclude that the medians of the groups are not equal, but we do not reject the H0 for sex differences.

We cannot do a formal stats test to look for significant interaction terms between sex and group due to the non-normality of the data (i.e. there is not non-parametric equivalent for a two-way ANOVA); However, we can simply look at the box-and-whisker plot: there does not seem to be a significant interaction as the response is the same regardless of sex within each group.

Age and Histology Interaction on Survival Time

EDA: Line/scatterplot of survival by histology and age

ggplot(glioma, aes(x = age, y = time)) +
  geom_point(aes(colour = histology)) +
  geom_smooth(method = "lm", se = TRUE, aes(colour = histology)) +
  labs(
    title = "Survival Time by Histology and Age",
    x = "Age (Years)",
    y = "Survival time (Months)"
  )

There does not seem to be a clear effect of age on survival time within each histology group. Notice the wide confidence intervals at the starts and ends of the age range for each histology group: the starts and ends overlap within each group suggesting a statistically insignificant effect of age

We can test formally with a linear model:

Linear model H0: There is no interaction between age and histology Ha: There is an interaction between age and histology

model.lm <- lm(time ~ age * histology, data = glioma)
summary(model.lm)


Call:
lm(formula = time ~ age * histology, data = glioma)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.830 -10.163  -1.397  10.392  37.274 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)   
(Intercept)          38.2531    13.7074   2.791  0.00867 **
age                  -0.3516     0.2461  -1.429  0.16239   
histologyGrade3      -3.2009    20.5856  -0.155  0.87738   
age:histologyGrade3   0.5739     0.4308   1.332  0.19194   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.48 on 33 degrees of freedom
Multiple R-squared:  0.4464,    Adjusted R-squared:  0.396 
F-statistic: 8.869 on 3 and 33 DF,  p-value: 0.0001877

Before we can interpret the linear model fit, we must check that the lm is indeed suited to the data. We can do this by checking the residuals:

Residuals vs Fitted plot

plot(model.lm, which = 1)

Or you could check normality of residuals with a QQ plot:

Normal Q-Q plot

qqnorm(model.lm$residuals)

Or you could check for homogeneity of variances with a residuals vs fitted plot:

Residuals vs Fitted plot

plot(model.lm, which = 3)

Or simply assess the normality of the residuals with a Shapiro-Wilk test:

Shapiro-Wilk test for normality

glioma$residuals <- model.lm$residuals

shapiro.test(glioma$residuals[glioma$group == "RIT"])


    Shapiro-Wilk normality test

data:  glioma$residuals[glioma$group == "RIT"]
W = 0.93321, p-value = 0.1984

shapiro.test(glioma$residuals[glioma$group == "Control"])


    Shapiro-Wilk normality test

data:  glioma$residuals[glioma$group == "Control"]
W = 0.90444, p-value = 0.0687

Okay, residuals are normal according to most tests above.

The interaction term is not significant, indicating that there is no interaction between age and histology on survival time.

We will use the Kruskal-Wallis test to compare survival times between the different histology groups:

Kruskal-Wallis test H0: The medians of the histology groups are equal Ha: At least one median is different

kruskal.test(time ~ histology, data = glioma)


    Kruskal-Wallis rank sum test

data:  time by histology
Kruskal-Wallis chi-squared = 13.74, df = 1, p-value = 0.0002099

The p-value is less than 0.05, indicating that there is a significant difference in survival time between the histology groups. We reject the null hypothesis and conclude that the medians of the histology groups are not equal.

We cannot do a formal stats test to look for significant interaction terms between age and histology due to the non-normality of the data; However, we can simply look at the box-and-whisker plot: there does not seem to be a significant interaction as the response is the same regardless of age within each histology group.

Or we can test with an ANCOVA instead of the linear model above (start first with the assumption of normality and variances within the histology groups – Levene’s test and Shapiro-Wilk’s test)

model.aov1 <- aov(time ~ histology * age, data = glioma)
summary(model.aov1)

              Df Sum Sq Mean Sq F value   Pr(>F)    
histology      1   5795    5795  24.169 2.36e-05 ***
age            1    159     159   0.662    0.422    
histology:age  1    425     425   1.775    0.192    
Residuals     33   7912     240                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The outcome is the same as the linear model above: there is no interaction between age and histology on survival time.

Comprehensive Data Exploration

Question 3: Risk Factors Associated with Low Infant Birth Mass

Dataset Introduction

Package MASS, dataset birthwt: This dataframe has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass. during 1986.

Objectives

State three hypotheses and test them. Make sure one of the tests makes use of the 95% confidence interval approach rather than a formal inferential methodology.

Answer Load the birthwt dataset

library(MASS)
library(dplyr)
data(birthwt)

Check the structure of the dataset

str(birthwt)

'data.frame':   189 obs. of  10 variables:
 $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
 $ race : int  2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
 $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
 $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
 $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Question 4: The Lung Capacity Data*

Objectives

Using the Lung Capacity data provided, please calculate the 95% CIs for the LungCap variable as a function of:
1. Gender
2. Smoke
3. Caesarean
Create a graph of the mean ± 95% CIs and determine if there are statistical differences in LungCap between the levels of Gender, Smoke, and Caesarean. Do the same using a t-test. Are your findings the same using these two approaches?
Produce all the associated tests for assumptions – i.e. the assumptions to be met when deciding whether to use a t-test or its non-parametric counterpart.
Create a combined tidy dataframe (observe tidy principles) with the estimates for the 95% CI for the LungCap data (LungCap as a function of Gender), estimated using both the traditional and bootstrapping approaches. Create a plot comprising two panels (one for the traditional estimates, one for the bootstrapped estimates) of the mean, median, scatter of raw data points, and the upper and lower 95% CI.
Undertake a statistical analysis that factors in the effect of Age together with one of the categorical variables on LungCap. What new insight does this provide?

Question 5: Piglet Data

Objectives

Here are some fictitious data for pigs raised on different diets (make up an equally fictitious justification for the data and develop hypotheses around that):

feed_1 <- c(60.8, 57.0, 65.0, 58.6, 61.7)
feed_2 <- c(68.7, 67.7, 74.0, 66.3, 69.8)
feed_3 <- c(102.6, 102.1, 100.2, 96.5, 110.3)
feed_4 <- c(87.9, 84.2, 83.1, 85.7, 90.3)

bacon <- data.frame(cbind(feed_1, feed_2, feed_3, feed_4))

Question 6: Investigating the Impact of Biochar on Crop Growth and Nutritional Value

Dataset Introduction

In this analysis, we will explore the effects of biochar application on the growth and elemental composition of four key crops: carrot, lettuce, soybean, and sweetcorn. The dataset for this study is sourced from the US Environmental Protection Agency (EPA) and is available at EPA’s Biochar Dataset. To gain a comprehensive understanding of the dataset and its implications, it is highly recommended to review two pertinent research papers linked on the dataset page. These papers not only provide valuable background information on the studies conducted but also offer critical insights and methodologies for data analysis that may be beneficial for this project.

Research Goals

The primary aim of this project is to analyse the impact of biochar on plant yield and identify the three most significant nutrients that influence human health. Your task is to:

Determine whether biochar treatments vary in effectiveness across the different crops.
Provide evidence-based recommendations on how to tailor biochar application for each specific crop to optimise the production of nutrients beneficial to human health and achieve the best possible yield.

In the Introduction section, it is important to justify the selection of the three nutrients you will focus on, explaining their importance to human nutrition. Through detailed data analysis, this project seeks to offer actionable insights on biochar application strategies that enhance both the nutritional value and the biomass of the crops by the end of their growth period.

Question 7: Miscellaneous*

Objectives

For each line of the script, below, write an English explanation for what the code does.

ggplot(points, aes(x = group, y = count)) +
  geom_boxplot(aes(colour = group), size = 1, outlier.colour = NA) +
  geom_point(position = position_jitter(width = 0.2), alpha = 0.3) +
  facet_grid(group ~ ., scales = "free") +
  labs(x = "", y = "Number of data points") +
  theme(
    legend.position = "none",
    strip.background = element_blank(),
    strip.text = element_blank()
  )

Using the rnorm() function, generate some fictitious data that can be plotted using the code, above. Make sure to assemble these data into a dataframe suitable for plotting, complete with correct column titles.
Apply the code exactly as stated to the data to demonstrate your understanding of the code and convince the examiner of your understanding of the correct data structure.

Part B (0.65)

Assessment Criteria

Your responses to Part B will be evaluated based on the following criteria:

Technical Accuracy (50%)
- Correct application of data analyses and statistical methods. Statistical tests should address the hypotheses within what was taught in this module. The assessor recognises that students will not have access to the level of statistical knowledge and experience a research statistician might have. For example, in places, linear mixed models might be more suited to the questions, but students only were taught the basics about ANOVA and simple linear model (relatively simple designs). When non-parametric alternatives are required, I’ll assign marks to the any statement that suggests the correct test to use, but not actually marks the execution of those tests.
- Use of appropriate R packages, functions, and syntax (code style and liberal commenting)
- Appropriate choice and justification of techniques, including due consideration for the assumptions of the methods used
- Accurate calculations and results interpretation, down to small details such as how many decimal places to use
Depth of Analysis (20%)
- Comprehensive exploration of the problem
- Insightful interpretation of results
- Consideration of shortfalls in the analysis (due to data limitations, assumptions, etc.), and suggestions for improvement
- Application of out-of-the-box thinking to the problem
Clarity and Communication (20%)
- Logical organisation of ideas, including clear section headings and subheadings
- Clear and concise explanations at each stage of the analysis
- Effective use of publication quality visualisations where appropriate, including all necessary annotations
- Communication of results in a way that is appropriate for a scientific audience (e.g. a journal article)
Critical Thinking Shown in Final Conclusion/Synopsis (10%)
- Discussion of the findings in the context of the problem (add ecological context, etc., as you deem necessary)
- Identification of limitations
- Discussion of assumptions
- Consideration of broader implications

The marks indicated for each task reflect the relative weight and expected depth of your response. Focus on demonstrating both technical proficiency and conceptual understanding in your answers.

Background

These data represent the aerial cover of kelp canopy in South Africa, as measured by Landsat satellites, for the period 1984 to 2024 at a quarterly interval. The intention is to understand the spatio-temporal patterns in kelp canopy cover and to explore how these patterns may be related to coastal sections and biogeographical provinces.

You are provided with the following files at the Google Drive link emailed to you:

A table of 58 coastal sections (58_sections.csv) that partitions the South African coastline into approximately 50 km intervals. Each section is defined by a single coordinate point (latitude, longitude) representing the boundary of the section.
A table of the biogeographical provinces (bioregions.csv) that the 58 coastal sections fall within. There is one row for each of the 58 sections. For this exercise, the biogeographical classification by Professor John Bolton is of interest.
A netCDF file (kelpCanopyFromLandsat_SouthAfrica_v04.nc). The netCDF file contains satellite-derived measurements of kelp canopy area across the South African coastline from 1984 to 2024, sampled quarterly. Each observation corresponds to a grid cell at a specific time point.

Task 1: Initial Processing

[Task Weight: 10%]
[Components (1) and (2) marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 10%]

Read the kelp canopy area, time, location (latitude/longitude), and satellite pass data from the NetCDF file. Once unpacked, it contains over 5 million rows. Your processing workflow will include:

extracting data from the netCDF file where area and passes are variables defined over 3D space (longitude, latitude, and time); and
using functions such as tidync::hyper_tibble() or ncdf4::ncvar_get() to read these values.

Answer

Note to assessor: Students may have used any of a number of NetCDF targeted packages, such as tidync, stars, or terra. Below I use ncdf4.

library(tidyverse)
library(ncdf4)
library(geosphere)
library(mgcv)

nc_path <- here::here("data", "BCB744", "Kelpwatch", "")
nc_file <- paste0(nc_path, "kelpCanopyFromLandsat_SouthAfrica_v04.nc")
nc <- nc_open(nc_file)

area <- ncvar_get(nc, "area")
time <- ncvar_get(nc, "time")
year <- ncvar_get(nc, "year")
quarter <- ncvar_get(nc, "quarter")
latitude <- ncvar_get(nc, "latitude")
longitude <- ncvar_get(nc, "longitude")
passes <- ncvar_get(nc, "passes")

nc_close(nc)

POSIX timestamps rather than raw seconds:

time <- as.POSIXct(time, origin = "1970-01-01", tz = "UTC")

Create index vectors:

time_idx <- seq_along(time) # 1…ntime
loc_idx <- seq_along(latitude) # 1…nloc

Restructure the data into a data.table or data.frame:

the data should have six columns: longitude, latitude, year, quarter, area, and passes;
each row should correspond to a unique pixel in space-time (i.e., one location at one time point); and
note that the time variable in the netCDF file is in numeric format (e.g., days since origin, where origin = "1970-01-01"), so you’ll have to convert it to POSIX timestamps using appropriate tools (e.g., as.POSIXct()).

If you are unable to read the NetCDF file, you may request access to a processed version of this file (in long CSV format) from me, but you’ll be penalised by 10% if you do so.

Answer

Note to assessor: Find some evidence for the successful execution of the above instructions such as the presence of the required dataframe columns, correct conversion of the time variable, and so on.

Cartesian join of (time_idx × loc_idx), in an order that matches how as.vector() will flatten a \(nloc × ntime\) matrix. Then, add the flattened area and the six columns and select the required variables and make a long:

Create Cartesian product of indices

long_df <- crossing(
  time_idx = time_idx,
  loc_idx = loc_idx
) |>
  mutate(
    area = as.vector(area),
    time = time[time_idx],
    year = year[time_idx],
    quarter = quarter[time_idx],
    latitude = latitude[loc_idx],
    longitude = longitude[loc_idx],
    passes = passes[loc_idx]
  ) |>
  dplyr::select(area, time, year, quarter, latitude, longitude, passes)

summary(long_df)

      area              time                          year         quarter     
 Min.   :  0.0     Min.   :1984-04-01 00:00:00   Min.   :1984   Min.   :1.000  
 1st Qu.:  0.0     1st Qu.:1996-01-01 00:00:00   1st Qu.:1996   1st Qu.:1.000  
 Median :  0.0     Median :2005-07-01 00:00:00   Median :2005   Median :2.000  
 Mean   :190.3     Mean   :2005-05-24 09:13:06   Mean   :2005   Mean   :2.477  
 3rd Qu.:324.0     3rd Qu.:2015-01-01 00:00:00   3rd Qu.:2015   3rd Qu.:3.000  
 Max.   :900.0     Max.   :2024-04-01 00:00:00   Max.   :2024   Max.   :4.000  
 NA's   :1061652                                                               
    latitude        longitude         passes     
 Min.   :-34.83   Min.   :14.83   Min.   :0.000  
 1st Qu.:-34.66   1st Qu.:18.12   1st Qu.:1.000  
 Median :-34.35   Median :18.48   Median :1.000  
 Mean   :-33.42   Mean   :18.55   Mean   :1.399  
 3rd Qu.:-32.98   3rd Qu.:19.40   3rd Qu.:2.000  
 Max.   :-25.44   Max.   :19.97   Max.   :4.000

Task 2: Exploratory Data Analysis

[Task Weight: 10%]
[Tasks 2.1, 2.2, and 2.3, each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 10%]

2.1 Weighted Mean Time Series

For each year and quarter combination:

compute the weighted mean of the kelp canopy area across all locations, using the number of satellite passes as weights;
exclude observations where passes = 0 or area is NA; and
plot the resulting time series of weighted mean kelp area, using i) quarters on the x-axis, and ii) a continuous time index from 1984–2024.

Answer

Note to assessor: If the student produced the figure exactly as I have it below, this question can get full marks. Else, assess the analysis workflow and assign marks in accordance with the portions of the script that are correctly executed.

Filter out missing values and zero passes at the start

quarter_means <- long_df |>
  filter(!is.na(area), !is.na(passes)) |>
  group_by(year, quarter) |>
  summarise(
    weighted_area = weighted.mean(area, w = passes, na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(!is.na(weighted_area))

summary(quarter_means)

      year         quarter      weighted_area  
 Min.   :1984   Min.   :1.000   Min.   :  0.0  
 1st Qu.:1996   1st Qu.:1.000   1st Qu.:145.8  
 Median :2005   Median :2.000   Median :186.0  
 Mean   :2005   Mean   :2.477   Mean   :185.1  
 3rd Qu.:2014   3rd Qu.:3.000   3rd Qu.:232.1  
 Max.   :2024   Max.   :4.000   Max.   :321.2

library(tidyverse)

Filter out invalid observations

quarter_means <- long_df |>
  filter(!is.na(area), !is.na(passes), passes > 0) |>
  group_by(year, quarter) |>
  summarise(
    weighted_area = weighted.mean(area, w = passes, na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(!is.na(weighted_area)) |>
  mutate(
    quarter = as.integer(quarter),
    time_index = year + (quarter - 1) / 4
  )

Plot quarterly weighted mean time series

ggplot(quarter_means, aes(x = time_index, y = weighted_area)) +
  geom_line(color = "steelblue", size = 0.8) +
  geom_point(color = "black", size = 0.6) +
  labs(
    title = "Quarterly Weighted Mean Kelp Area (1984–2024)",
    x = "Year (quarterly intervals)",
    y = "Weighted Mean Canopy Area (m²)"
  ) +
  theme_minimal()

Compute the weighted mean area at each unique (longitude, latitude) pixel across time. Then:

select a random sample of 100 pixels;
for each sampled pixel, extract the full time series of weighted mean area;
plot all 100 time series in a single panel (overlayed), using semi-transparent lines; and
label axes appropriately.

Answer

Note to assessor: Again, if they produced the figure exactly as below, this question can get full marks. Else, assess the analysis workflow and assign marks in accordance with the portions of the script that are correctly executed.

pixel_means <- long_df |>
  filter(!is.na(area), !is.na(passes), passes > 0) |>
  group_by(year, quarter, longitude, latitude) |>
  summarise(
    weighted_area = weighted.mean(area, w = passes, na.rm = TRUE),
    .groups = "drop"
  )

summary(pixel_means)

      year         quarter        longitude        latitude     
 Min.   :1984   Min.   :1.000   Min.   :15.08   Min.   :-34.83  
 1st Qu.:1997   1st Qu.:2.000   1st Qu.:18.33   1st Qu.:-34.67  
 Median :2006   Median :3.000   Median :18.84   Median :-34.36  
 Mean   :2006   Mean   :2.503   Mean   :18.64   Mean   :-33.62  
 3rd Qu.:2016   3rd Qu.:3.000   3rd Qu.:19.41   3rd Qu.:-33.95  
 Max.   :2024   Max.   :4.000   Max.   :19.97   Max.   :-26.48  
 weighted_area  
 Min.   :  0.0  
 1st Qu.:  0.0  
 Median :  0.0  
 Mean   :196.1  
 3rd Qu.:360.0  
 Max.   :900.0

Step 1: Calculate weighted mean per pixel over time

pixel_means <- long_df |>
  filter(!is.na(area), !is.na(passes), passes > 0) |>
  group_by(year, quarter, longitude, latitude) |>
  summarise(
    weighted_area = weighted.mean(area, w = passes, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(time_index = year + (quarter - 1) / 4)

Step 2: Identify 100 unique (lon, lat) pairs

set.seed(42)
sample_pixels <- pixel_means |>
  distinct(longitude, latitude) |>
  sample_n(100)

Step 3: Filter time series for the sampled pixels

pixel_timeseries_sample <- pixel_means |>
  inner_join(sample_pixels, by = c("longitude", "latitude"))

Step 4: Plot all 100 pixel time series

ggplot(
  pixel_timeseries_sample,
  aes(
    x = time_index,
    y = weighted_area,
    group = interaction(longitude, latitude)
  )
) +
  geom_line(alpha = 0.3, color = "darkorange") +
  labs(
    title = "Kelp Area Time Series for 100 Random Pixels",
    x = "Year (quarterly intervals)",
    y = "Weighted Mean Canopy Area (m²)"
  ) +
  theme_minimal()

This figure is uninterpretable…

2.2 Summary Statistics

Using the weighted data prepared for each year and quarter combination (prepared in 2.1.1), compute and report summary statistics for the levels of temporal aggregation:
- by year;
- by quarter;
- by year/quarter combination;

include: weighted mean, median, standard deviation, interquartile range, skewness, and kurtosis; and
comment on the appropriateness of each statistic for these data, and justify your choices in light of the data distribution.

Answer

Note to assessor: Since this question can be objectively assessed relative to the expectations (specific reporting of correct summary stats per each level of aggregation as I have them below), marks can simply be assigned based on the correct reporting of the summary statistics.

library(e1071) # for skewness and kurtosis

Yearly aggregation

year_stats <- quarter_means |>
  group_by(year) |>
  summarise(
    mean = mean(weighted_area, na.rm = TRUE),
    median = median(weighted_area, na.rm = TRUE),
    sd = sd(weighted_area, na.rm = TRUE),
    iqr = IQR(weighted_area, na.rm = TRUE),
    skew = skewness(weighted_area, na.rm = TRUE),
    kurt = kurtosis(weighted_area, na.rm = TRUE)
  )
year_stats

# A tibble: 41 × 7
        year  mean median    sd   iqr     skew   kurt
   <int[1d]> <dbl>  <dbl> <dbl> <dbl>    <dbl>  <dbl>
 1      1984  54.1   13.6  81.2  73.2   0.375   -2.33
 2      1985  77.4   77.4 101.   71.4   0       -2.75
 3      1986 146.   146.   NA     0   NaN      NaN   
 4      1987  66.3   66.3  NA     0   NaN      NaN   
 5      1988 133.   133.  188.  133.    0       -2.75
 6      1989 108.    96.3  79.7  91.8   0.250   -2.04
 7      1990 171.   173.   41.1  31.5  -0.0806  -1.89
 8      1991 178.   185.   43.8  50.2  -0.272   -2.03
 9      1992 213.   240.  124.  152.   -0.329   -2.05
10      1993 131.   135.   42.8  41.1  -0.191   -1.95
# ℹ 31 more rows

Quarterly aggregation (across all years)

quarter_stats <- quarter_means |>
  group_by(quarter) |>
  summarise(
    mean = mean(weighted_area, na.rm = TRUE),
    median = median(weighted_area, na.rm = TRUE),
    sd = sd(weighted_area, na.rm = TRUE),
    iqr = IQR(weighted_area, na.rm = TRUE),
    skew = skewness(weighted_area, na.rm = TRUE),
    kurt = kurtosis(weighted_area, na.rm = TRUE)
  )
quarter_stats

# A tibble: 4 × 7
  quarter  mean median    sd   iqr   skew    kurt
    <int> <dbl>  <dbl> <dbl> <dbl>  <dbl>   <dbl>
1       1  216.   220.  66.6  83.4 -0.898  0.922 
2       2  185.   204.  74.2 114.  -0.361 -0.598 
3       3  172.   179.  71.2  85.6 -0.326 -0.0222
4       4  166.   166.  49.7  46.2  0.177  1.32

Year-quarter combination (already in quarter_means)

year_quarter_stats <- quarter_means |>
  mutate(year_quarter = paste0(year, " Q", quarter)) |>
  summarise(
    mean = mean(weighted_area, na.rm = TRUE),
    median = median(weighted_area, na.rm = TRUE),
    sd = sd(weighted_area, na.rm = TRUE),
    iqr = IQR(weighted_area, na.rm = TRUE),
    skew = skewness(weighted_area, na.rm = TRUE),
    kurt = kurtosis(weighted_area, na.rm = TRUE)
  )
year_quarter_stats

# A tibble: 1 × 6
   mean median    sd   iqr   skew     kurt
  <dbl>  <dbl> <dbl> <dbl>  <dbl>    <dbl>
1  185.   186.  68.3  86.2 -0.323 -0.00284

Create visualisations (e.g. boxplots, violin plots, histograms) to support your interpretations.

Answer

Note to assessor: These or similar figures are required, showing the seasonal effects, the annual effect, a histogram of the frequency of the weighted areas, and something similar to the density distribution (or a frequency histogram) that shows distribution within a quarter. I’d also accept variations of these plots, as long as they speak to the nature of the data being assessed. They need to capture the feature of the summary statistics that I have above.

Boxplot by quarter

ggplot(quarter_means, aes(x = factor(quarter), y = weighted_area)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Kelp Canopy Area by Quarter",
    x = "Quarter",
    y = "Weighted Area"
  ) +
  theme_minimal()

Boxplot by year

ggplot(quarter_means, aes(x = factor(year), y = weighted_area)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Kelp Canopy Area by Year", x = "Year", y = "Weighted Area") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Histogram of all weighted areas

ggplot(quarter_means, aes(x = weighted_area)) +
  geom_histogram(bins = 30, fill = "tan", color = "black") +
  labs(
    title = "Histogram of Weighted Kelp Area",
    x = "Weighted Area",
    y = "Frequency"
  ) +
  theme_minimal()

Density plot by quarter

ggplot(quarter_means, aes(x = weighted_area, color = factor(quarter))) +
  geom_density() +
  labs(
    title = "Density of Weighted Area by Quarter",
    x = "Weighted Area",
    color = "Quarter"
  ) +
  theme_minimal()

Based on these, discuss any discernible temporal trends (e.g. decadal increases/decreases) and seasonal patterns (quarterly effects).

Answer

Note to assessor: Some variation of this:

Temporal variation in kelp canopy area, as observed in the quarterly weighted means from 1984 to 2024, exhibits a broadly increasing trend until the early-2000s, followed by more irregular fluctuations at a higher weighted canopy area than at the start of the time series. The yearly summary statistics confirm this, and show rising means and medians through the late 1990s into the early 2000s, peaking during the 2010s. Although values for some early years (e.g., 1986, 1987) suffer from missing data (e.g., NA for SD and skewness), the central decades present coherent patterns.

The skewness values tend to hover close to zero, with mild left-skew in later years, suggesting distributions that are not heavily distorted but do include more frequent lower outliers in certain quarters. The kurtosis estimates, consistently below 0 across many years, indicate relatively flat distributions with light tails, again pointing to the presence of moderate, widespread values and a paucity of extreme outliers in the tails.

The quarterly summaries reveal a seasonal signal: Q1 (January–March) is associated with the highest mean kelp area (216 km²), while Q4 (October–December) shows the lowest (166 km²). This likely reflects seasonal growth dynamics, perhaps linked to photoperiod, wave exposure, or nutrient regimes (factors known to modulate canopy) forming macroalgae. The density plot by quarter reinforces this picture: Q1’s density curve is skewed leftward, indicative of a modal concentration around high values, whereas Q4 shows more compact distributions with lower peaks.

The boxplots by year reinforce this temporal variation. Inter-annual variation is relatively low during some periods (e.g., 1990s) but expands considerably in the 2010s and 2020s, suggesting a greater degree of spatial heterogeneity or more frequent episodes of extreme coverage. Notably, the increase in spread does not always correspond with a decline in median or mean area, implying that while certain coastal sectors may experience loss, others persist or expand.

The histogram of weighted area suggests a right-skewed global distribution when pooled over time, driven by many instances of low or zero canopy and fewer but notable high-coverage events.

So, these patterns suggest:

A long-term trend toward increased kelp canopy extent over the study period, potentially stabilising or fragmenting in recent years;
Clear intra-annual (seasonal) variation, with Q1 consistently supporting greater canopy development than later quarters;
Increasing variability over time, which might reflect changing climate regimes, hydrodynamic forcing, or anthropogenic disturbance gradients.

These findings provide a strong descriptive foundation for the inferential modelling that follows.

2.3 Observation Density Map

Create a map plotting each observed pixel location (defined by longitude × latitude):

colour each pixel by the total number of valid observations (i.e., non-NA values of area) across all time points;
overlay the 58 coastal sections as reference points or lines, numbered from west (1) to east (58); and
use an appropriate geographic projection and include a legend.

Answer

Note to assessor: I’m not concerned too much about having a coastline as the data clearly show the coastal profile, but, of course, having the coastline as well is great too (maybe worth a mark or three extra). I’d also be okay if the figure focuses only on sections 1-22 where kelp is present. Again, this is a fairly objective question, so marks can be assigned based on the correct figure (full marks for all the labels, etc., correctly applied as per standards of scientific publications).

library(ggrepel) # for labelling points
library(viridis) # for color scale

Load the long_df and coastal sections Assume long_df has already been created from earlier tasks Load the 58 sections file (replace with actual path if needed)

sections_df <- read.csv(here::here(
  "data",
  "BCB744",
  "Kelpwatch",
  "58_sections.csv"
))

Count valid observations per pixel

obs_counts <- long_df |>
  filter(!is.na(area)) |>
  group_by(longitude, latitude) |>
  summarise(
    n_obs = n(),
    .groups = "drop"
  )

Prepare coastal sections data

sections <- sections_df |>
  mutate(section_id = row_number())

Plot

ggplot() +
  geom_point(
    data = obs_counts,
    shape = 4,
    aes(x = longitude, y = latitude, colour = n_obs)
  ) +
  scale_colour_viridis(name = "Valid\nObs.", option = "plasma") +
  geom_point(
    data = sections,
    color = "black",
    size = 1.1,
    shape = 1,
    aes(x = Longitude, y = Latitude)
  ) +
  geom_text_repel(
    data = sections,
    size = 2,
    color = "black",
    max.overlaps = 58,
    aes(x = Longitude, y = Latitude, label = section_id)
  ) +
  coord_fixed(1.3) +
  labs(
    title = "Observation Density Map",
    subtitle = "Total no. valid kelp area obs. per pixel (1984–2024)",
    x = "Longitude",
    y = "Latitude"
  ) +
  theme_minimal()

Task 3: Assigning Kelp Observations to Coastal Sections

[Task Weight: 20%]
[Tasks 4.1 and 4.2 each marked on a 0–100 scale, then scaled in the proportion 0.7 and 0.3 of the Task Weight of 20%]

Using the data prepared above, your task now is to spatially classify each kelp canopy observation by assigning it to two types of geographic units.

3.1 Assignment to Coastal Sections

You are provided with a table of 58 coastal sections, each defined by a single geographic coordinate (Latitude and Longitude). These points mark successive ~50 km intervals along the South African coastline, numbered from west (1) to east (58).

Assign each kelp canopy observation to the nearest coastal section based on geographic proximity:

use a geodesic (great-circle) distance metric to compute proximity between kelp sampling points and section coordinates (assume all coordinates are in WGS84);
add a new column to your kelp dataset called section_id, indicating the row number (1–58) of the nearest section; and
you may use any R packages or methods you like, but your code should be efficient and well-commented.

Answer

Note to assessor: Check for objective evidence that a dataframe has been created and that it has the correct columns, i.e., section_id or section, longitude, latitude, year, quarter, and area or weighted_area. Calculations must show evidence of using the Haversine distance metric.

I use only the relevant sections:

sections_df <- sections_df |>
  slice(1:22) # no Ecklonia further east of section 22

Prepare a script for sectioning the kelp data:

processed_file_path <- here::here(
  "data",
  "BCB744",
  "Kelpwatch",
  "pixel_means.RData"
)

if (file.exists(processed_file_path)) {
  message("Loading pre-processed pixel_means from: ", processed_file_path)
  load(processed_file_path)
  message("Successfully loaded pixel_means with sections.")
} else {
  message("Processing and assigning sections to pixel_means...")
}

Check if the processed file exists and load it

if (file.exists(processed_file_path)) {
  message("Loading pre-processed pixel_means from: ", processed_file_path)
  load(processed_file_path)
  message("Successfully loaded pixel_means with sections.")
} else {
  message("Processing and assigning sections to pixel_means...")

  # Create section assignments
  section_coords_matrix <- as.matrix(sections_df[, c("Longitude", "Latitude")])

  find_closest_section_idx <- function(kelp_lon, kelp_lat, all_section_coords) {
    if (is.na(kelp_lon) || is.na(kelp_lat)) {
      return(NA_integer_)
    }
    current_kelp_coord <- c(kelp_lon, kelp_lat)
    distances <- distHaversine(p1 = current_kelp_coord, p2 = all_section_coords)
    return(which.min(distances))
  }

  # Apply the function to each kelp data point
  assigned_section_indices <- mapply(
    find_closest_section_idx,
    kelp_lon = pixel_means$longitude,
    kelp_lat = pixel_means$latitude,
    MoreArgs = list(
      all_section_coords = section_coords_matrix
    ),
    SIMPLIFY = TRUE
  )

  # Add section column to pixel_means
  # pixel_means2[, section := assigned_section_indices]
  pixel_means$section <- assigned_section_indices

  # Ensure the data directory exists
  if (!dir.exists(dirname(processed_file_path))) {
    dir.create(dirname(processed_file_path), recursive = TRUE)
  }

  # Save the processed data
  save(pixel_means, file = processed_file_path)
  message("Processing complete. Data saved to: ", processed_file_path)
}

head(pixel_means)

# A tibble: 6 × 7
       year   quarter longitude  latitude weighted_area time_index section
  <int[1d]> <int[1d]> <dbl[1d]> <dbl[1d]>         <dbl>  <dbl[1d]>   <int>
1      1984         2      15.1     -26.6             0      1984.       1
2      1984         2      15.1     -26.6             0      1984.       1
3      1984         2      15.1     -26.6             0      1984.       1
4      1984         2      15.1     -26.6             0      1984.       1
5      1984         2      15.1     -26.6             0      1984.       1
6      1984         2      15.1     -26.6             0      1984.       1

3.2 Assignment to Biogeographical Provinces

You are also provided with a table that maps each coastal section (1–58) to a biogeographical province, based on a classification by Professor John Bolton.

Using your previous assignment of each kelp observation to a section_id, add a second column called bioregion_id that indicates which biogeographical province the observation falls within.
Your final kelp dataset should contain the following key columns (alongside the original data):
- longitude, latitude
- year, quarter, area, passes
- section_id (integer 1–58)
- bioregion_id (character or factor)
Include your full, annotated R code that performs both spatial assignments into your resultant .html document. Your method should be reproducible, and your code should be easy to follow. Print the head() and tail() of your final dataset, and include a summary() of the data.

Answer

Note to assessor: Check for objective evidence that a dataframe has been created and that it has the correct columns, i.e., section_id or section, longitude, latitude, year, quarter, area or weighted_area, and also bioregion or bolton or some unique identifier for the bioregion.

I used a simple merge to assign bioregions to the kelp data, but there are other ways to do it. Again, we want objective evidence that the data have been assigned correctly (e.g. the head() and tail() of the data, or a summary()).

bioreg <- read.csv(here::here(
  "data",
  "BCB744",
  "Kelpwatch",
  "bioregions.csv"
)) |>
  mutate(section = row_number()) # Make rownames explicit as 'section'

Merge into pixel_means (assumed already loaded in environment)

pixel_means <- pixel_means |>
  left_join(bioreg, by = "section") |>
  dplyr::select(-spal.prov, -spal.ecoreg, -lombard)

The question asks for passes to be present in the data, but I’m okay with omitting it as it is included in the weighted area calculation

head(pixel_means)

# A tibble: 6 × 8
       year   quarter longitude latitude weighted_area time_index section bolton
  <int[1d]> <int[1d]> <dbl[1d]> <dbl[1d>         <dbl>  <dbl[1d]>   <int> <chr> 
1      1984         2      15.1    -26.6             0      1984.       1 BMP   
2      1984         2      15.1    -26.6             0      1984.       1 BMP   
3      1984         2      15.1    -26.6             0      1984.       1 BMP   
4      1984         2      15.1    -26.6             0      1984.       1 BMP   
5      1984         2      15.1    -26.6             0      1984.       1 BMP   
6      1984         2      15.1    -26.6             0      1984.       1 BMP

tail(pixel_means)

# A tibble: 6 × 8
       year   quarter longitude latitude weighted_area time_index section bolton
  <int[1d]> <int[1d]> <dbl[1d]> <dbl[1d>         <dbl>  <dbl[1d]>   <int> <chr> 
1      2024         2      20.0    -34.8             0      2024.      22 AMP   
2      2024         2      20.0    -34.8             0      2024.      22 AMP   
3      2024         2      20.0    -34.8             0      2024.      22 AMP   
4      2024         2      20.0    -34.8             0      2024.      22 AMP   
5      2024         2      20.0    -34.8             0      2024.      22 AMP   
6      2024         2      20.0    -34.8             0      2024.      22 AMP

summary(pixel_means)

      year         quarter        longitude        latitude     
 Min.   :1984   Min.   :1.000   Min.   :15.08   Min.   :-34.83  
 1st Qu.:1997   1st Qu.:2.000   1st Qu.:18.33   1st Qu.:-34.67  
 Median :2006   Median :3.000   Median :18.84   Median :-34.36  
 Mean   :2006   Mean   :2.503   Mean   :18.64   Mean   :-33.62  
 3rd Qu.:2016   3rd Qu.:3.000   3rd Qu.:19.41   3rd Qu.:-33.95  
 Max.   :2024   Max.   :4.000   Max.   :19.97   Max.   :-26.48  
 weighted_area     time_index      section         bolton         
 Min.   :  0.0   Min.   :1984   Min.   : 1.00   Length:5446570    
 1st Qu.:  0.0   1st Qu.:1998   1st Qu.:15.00   Class :character  
 Median :  0.0   Median :2007   Median :19.00   Mode  :character  
 Mean   :196.1   Mean   :2007   Mean   :15.82                     
 3rd Qu.:360.0   3rd Qu.:2016   3rd Qu.:20.00                     
 Max.   :900.0   Max.   :2024   Max.   :22.00

Task 4: Inferential Statistics (Part 2)

[Task Weight: 30%]
[Tasks 5.1, 5.2, 5.3, 5.4, and 5.5 each marked on a 0–100 scale, then scaled to equal proportions of the Task Weight of 30%]

You are now asked to evaluate a series of research questions concerning the spatial and temporal structure of kelp canopy area. These questions are to be answered using the kelp dataset that has already been processed to include both section_id and bioregion_id. Use the weighted kelp canopy area (area, weighted by passes) as your response variable throughout – you should have already prepared this dataset in Task 2.

You may use ANOVAs and/or linear models. In each case you must clearly state your hypotheses, justify your choice of model, and interpret your findings both statistically and ecologically.

4.1 Spatial Differences Between Coastal Sections

Question: Is there a statistically significant difference in mean kelp canopy area between coastal sections?

Answer

Note to Assessor (applies to Questions 5.1–5.5): It is understood that students are not expected to possess the depth of statistical training characteristic of professional statisticians. While some questions may, in principle, warrant more advanced treatments (e.g. linear mixed-effects models), the scope of instruction in this module was limited to foundational techniques—namely, simple linear models and ANOVA for relatively straightforward designs. If students applied LMEs or other more complicated models without demonstrating a clear understanding for why they did so, give them zero percent for that answer. Where assumptions of parametric tests are violated, students should identify suitable non-parametric alternatives. Marks will be awarded for correctly identifying such alternatives, even if these tests are not implemented in code.

Hypotheses:

Null (H₀): There is no difference in mean kelp canopy area across coastal sections.
Alternative (H₁): Mean kelp canopy area differs among at least one pair of coastal sections.

Justification: A one-way ANOVA is appropriate for testing whether the means of a continuous response variable (kelp area) differ across the levels of a single categorical factor (coastal section).

library(lmtest)
library(car)

Set section and bioreg as a factor as this is needed for the tests that will assess the section and bioreg effects

pixel_means <- pixel_means |>
  mutate(section = as.factor(section), bolton = as.factor(bolton))

model_5.1 <- aov(weighted_area ~ section, data = pixel_means)
summary(model_5.1)

                 Df    Sum Sq   Mean Sq F value Pr(>F)    
section          18 2.020e+10 1.122e+09   13252 <2e-16 ***
Residuals   5446551 4.612e+11 8.467e+04                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Also test assumptions… Residuals

resid_5.1 <- sample(residuals(model_5.1), 5000)

Normality

shapiro.test(resid_5.1) # Subsample for tractability


    Shapiro-Wilk normality test

data:  resid_5.1
W = 0.80962, p-value < 2.2e-16

qqnorm(resid_5.1)
qqline(resid_5.1)

Homogeneity of variances

leveneTest(weighted_area ~ section, data = pixel_means)

Levene's Test for Homogeneity of Variance (center = median)
           Df F value    Pr(>F)    
group      18   14154 < 2.2e-16 ***
      5446551                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residuals vs fitted

plot(model_5.1, which = 1)

One could do a TukeyHSD test, but the question did not explicitly ask for this. Add two marks if correctly given and executed.

Results Mean kelp canopy area varied significantly across the 19 coastal sections assessed (F\({18,5446551}\) = 13,252, p < 0.001; one-way ANOVA). Residual diagnostics indicated substantial departures from normality (Shapiro–Wilk W = 0.81, p < 0.001) and non-constant variance across groups (Levene’s test: F\({18,5446551}\) = 14,154, p < 0.001), though the extremely large sample size renders the ANOVA robust to these violations. The fitted model accounted for a substantial proportion of the spatial variance in kelp canopy area. Residual plots suggested heteroscedasticity consistent with spatial heterogeneity in canopy distributions.

To provide a fully defensible hypothesis test (considering both assumptions were violated), do a Kruskal–Wallis rank sum test. This is the standard non-parametric alternative to one-way ANOVA when comparing more than two independent groups (here, coastal sections). It tests whether the distributions of kelp canopy area differ across sections without assuming normality or equal variances.

4.2 Spatial Differences Between Biogeographical Provinces

Question: Is there a statistically significant difference in mean kelp canopy area between biogeographical provinces?

Answer

Hypotheses:

Null (H₀): Mean kelp canopy area is not different amongst bioregions.
Alternative (H₁): Mean kelp canopy area differs between bioregions.

Justification: Here, bioregion is treated as a categorical predictor, suitable for a one-way ANOVA to test inter-bioregional differences. The weighted kelp area is the continuous response variable. Assumptions of normality and heteroscedasticity hold.

model_5.2 <- aov(weighted_area ~ bolton, data = pixel_means)
summary(model_5.2)

                 Df    Sum Sq   Mean Sq F value Pr(>F)    
bolton            2 6.897e+09 3.448e+09   39585 <2e-16 ***
Residuals   5446567 4.745e+11 8.711e+04                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Also test assumptions…

resid_5.2 <- sample(residuals(model_5.2), 5000)

shapiro.test(resid_5.2)


    Shapiro-Wilk normality test

data:  resid_5.2
W = 0.74748, p-value < 2.2e-16

qqnorm(resid_5.2)
qqline(resid_5.2)

leveneTest(weighted_area ~ bolton, data = pixel_means)

Levene's Test for Homogeneity of Variance (center = median)
           Df F value    Pr(>F)    
group       2   39585 < 2.2e-16 ***
      5446567                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(model_5.2, which = 1)

Results A one-way ANOVA revealed significant differences in mean kelp canopy area between the three biogeographical provinces (BMP, SWP, AMP) (F\({2,5446567}\) = 39,585, p < 0.001). Residual analysis again revealed non-normal error distribution (Shapiro–Wilk W = 0.75, p < 0.001) and unequal variances among provinces (Levene’s test: F\({2,5446567}\) = 39,585, p < 0.001). Despite these assumption violations, the effect was large and consistent with substantial regional differentiation in canopy extent. Residuals from the model were symmetrically distributed but with longer tails in high-variance provinces.

As before, a Kruskal–Wallis rank sum test is recommended. Again suitable, since the bioregion variable has three independent levels. This test evaluates whether the central tendency of canopy area differs among provinces without assuming parametric conditions.

4.3 Interaction Between Section and Province

Question: Is there an interaction between coastal section and biogeographical province in explaining variation in kelp canopy area?

Answer

Hypotheses:

H₀₁ (Main effect – Section): There are no differences in mean kelp canopy area across coastal sections.
H₀₂ (Main effect – Province): There are no differences in mean kelp canopy area across biogeographical provinces.
H₀₃ (Interaction): The effect of coastal section on kelp canopy area is the same across all provinces (i.e., no interaction).
H₁ (Alternative): At least one main effect is non-zero, or there is a significant interaction (i.e., the influence of section depends on the province).

A two-way ANOVA with interaction is appropriate here because:

Both predictors, i.e., section (a factor) and province (bolton, also a factor), are categorical.
The response variable (weighted_area) is continuous and assumed to meet the assumptions of normality and homogeneity of variance.
The interaction term allows us to assess whether the spatial granularity of section varies in importance across broader-scale bioregions.

This model allows us to partition variation in canopy area into hierarchical (or cross-cutting) spatial components.

Two-way ANOVA with interaction

model_5.3 <- aov(weighted_area ~ section * bolton, data = pixel_means)

Summary of the model

summary(model_5.3)

                 Df    Sum Sq   Mean Sq F value Pr(>F)    
section          18 2.020e+10 1.122e+09   13252 <2e-16 ***
Residuals   5446551 4.612e+11 8.467e+04                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Also test assumptions…

resid_5.3 <- sample(residuals(model_5.3), 5000)

shapiro.test(resid_5.3)


    Shapiro-Wilk normality test

data:  resid_5.3
W = 0.81015, p-value < 2.2e-16

qqnorm(resid_5.3)
qqline(resid_5.3)

car::leveneTest(model_5.3)

Levene's Test for Homogeneity of Variance (center = median)
           Df F value    Pr(>F)    
group      18   14154 < 2.2e-16 ***
      5446551                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(model_5.3, which = 1)

In this case, interaction terms cannot be estimated, because there’s no shared replication across bioregions within sections. The design is not factorial. The solution is to fix mixed models with random and fixed effects.

Results A two-way ANOVA incorporating both section and biogeographical province, including their interaction, confirmed strong spatial structuring of kelp canopy area (F\(_{18,5446551}\) = 13,252, p < 0.001 for the section main effect). However, the interaction term was non-estimable due to the confounded structure of the design — each section belonged uniquely to a single province, precluding factorial replication. As such, a fully crossed design could not be evaluated within a fixed-effects ANOVA framework. Model residuals displayed similar deviations from normality and homogeneity as in previous analyses, but were centered and broadly symmetric.

A non-parametric two-way tests for interaction effects do not exist. Because of the nested design, a fully non-parametric two-way ANOVA is not viable. As an alternative approach, perform separate Kruskal–Wallis tests within each province, or apply a permutation-based two-way ANOVA that accommodates unbalanced and non-normal designs.

4.4 Linear Trend Over Time by Province

Question: Is there a linear trend in kelp canopy area over time, and does the direction or strength of this trend differ between biogeographical provinces?

Answer

Hypotheses:

H₀₁ (Main effect of time): There is no linear trend in kelp canopy area over time.
H₀₂ (Time × Province interaction): The effect (slope) of year is the same across all provinces.
H₁ (Alternative): Kelp canopy area shows a linear trend over time, and the rate or direction of change differs by province.

This is a typical ANCOVA design because:

The continuous predictor year captures temporal trends.
The categorical predictor bolton accounts for regional differences.
The interaction term year × bolton assesses whether temporal slopes differ among provinces.

ANCOVA is ideal for testing whether regression lines have the same slope across groups; here, whether kelp decline/growth rates differ by province.

Aggregate by year and province

province_annual <- pixel_means |>
  group_by(year, bolton) |>
  summarise(mean_area = mean(weighted_area, na.rm = TRUE), .groups = "drop")

Fit the ANCOVA model

model_5.4 <- lm(mean_area ~ year * bolton, data = province_annual)

Show the fitted model summary:

summary(model_5.4)


Call:
lm(formula = mean_area ~ year * bolton, data = province_annual)

Residuals:
     Min       1Q   Median       3Q      Max 
-108.302  -30.177   -3.231   20.679  217.695 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -9706.8982  1442.5276  -6.729 6.83e-10 ***
year                 4.8688     0.7196   6.766 5.70e-10 ***
boltonB-ATZ       4135.0773  2006.6662   2.061 0.041571 *  
boltonBMP         7488.3915  2006.6662   3.732 0.000296 ***
year:boltonB-ATZ    -1.9800     1.0012  -1.978 0.050340 .  
year:boltonBMP      -3.6787     1.0012  -3.674 0.000362 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52.74 on 116 degrees of freedom
Multiple R-squared:  0.7016,    Adjusted R-squared:  0.6888 
F-statistic: 54.56 on 5 and 116 DF,  p-value: < 2.2e-16

A Type III ANOVA, as in the car::Anova() function with type = 3, tests the significance of each effect after accounting for all other effects in the model, including interactions. It asks: “What is the unique contribution of this factor, given that all others (including interactions) are already in the model?”

Anova(model_5.4, type = 3)

Anova Table (Type III tests)

Response: mean_area
            Sum Sq  Df F value    Pr(>F)    
(Intercept) 125927   1 45.2807 6.830e-10 ***
year        127296   1 45.7731 5.702e-10 ***
bolton       38795   2  6.9749  0.001379 ** 
year:bolton  37570   2  6.7548  0.001679 ** 
Residuals   322599 116                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(province_annual, aes(x = year, y = mean_area, color = bolton)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Linear Trends in Kelp Canopy Area by Province",
    y = "Mean Canopy Area per Pixel",
    x = "Year"
  )

Also test assumptions…

resid_5.4 <- residuals(model_5.4)

Normality

shapiro.test(resid_5.4)


    Shapiro-Wilk normality test

data:  resid_5.4
W = 0.93452, p-value = 1.599e-05

qqnorm(resid_5.4)
qqline(resid_5.4)

Homoscedasticity

plot(model_5.4, which = 1)

bptest(model_5.4) # Breusch-Pagan test or any other suitable test


    studentized Breusch-Pagan test

data:  model_5.4
BP = 16.228, df = 5, p-value = 0.006223

Multicollinearity Don’t penalise students for not testing multicollinearity, but add a bonus two marks if included

vif(model_5.4)

                    GVIF Df GVIF^(1/(2*Df))
year        3.138942e+00  1        1.771706
bolton      8.468999e+08  2      170.591747
year:bolton 8.466877e+08  2      170.581063

Independence of residuals

dwtest(model_5.4) # Durbin-Watson test


    Durbin-Watson test

data:  model_5.4
DW = 1.7605, p-value = 0.1114
alternative hypothesis: true autocorrelation is greater than 0

Results The ANCOVA model indicated a significant linear increase in mean kelp canopy area over time across all provinces (F\({1,116}\) = 45.8, p < 0.001), with significant differences in temporal trends among provinces (year × province interaction: F\({2,116}\) = 6.75, p = 0.0017). This suggests that both the direction and magnitude of temporal trends in canopy area are modulated by biogeographical province. Residuals exhibited slight right-skew (Shapiro–Wilk W = 0.93, p < 0.001), and heteroscedasticity was supported by a significant Breusch–Pagan test (BP = 16.23, df = 5, p = 0.006). No evidence of residual autocorrelation was detected (Durbin–Watson DW = 1.76, p = 0.11). Variance inflation factors indicated high collinearity between province and interaction terms, reflecting the nested structure of the provinces over time.

Pragmatic alternative to account for non-normality, heteroscedasticity, and collinearity: perform separate Spearman’s rank correlation tests between year and mean area within each province to assess monotonic trends, then compare slope magnitudes informally or using non-parametric trend tests like Mann–Kendall.

4.5 Seasonal Variation Across Provinces

Question: Does the seasonal pattern in kelp canopy area differ between provinces?

Answer

Hypotheses:

H₀₁ (Main effect of quarter): Mean kelp canopy area does not vary across seasons (quarters).
H₀₂ (Quarter × Province interaction): The pattern of seasonal variation is the same in all provinces — i.e., there is no interaction.
H₁ (Alternative): There are seasonal differences in kelp canopy area, and the shape or magnitude of the seasonal cycle differs by province.

A two-way ANOVA is appropriate here, with:

quarter as a categorical predictor representing seasonal variation.
bolton as a categorical grouping variable for province.
An interaction term (quarter × bolton) to test whether seasonal cycles differ by region.

This model assesses both the amplitude and phase-shift of seasonal dynamics across spatial domains.

Ensure quarter is treated as a categorical variable

pixel_means <- pixel_means |>
  mutate(quarter = as.factor(quarter))

Group by year, quarter, and province to retain replication across years

province_seasonal_replicated <- pixel_means |>
  group_by(year, quarter, bolton) |>
  summarise(mean_area = mean(weighted_area, na.rm = TRUE), .groups = "drop")

Fit two-way ANOVA with interaction

model_5.5 <- aov(
  mean_area ~ quarter * bolton,
  data = province_seasonal_replicated
)

Model summary

summary(model_5.5)

                Df  Sum Sq Mean Sq F value Pr(>F)    
quarter          3   66835   22278   3.061 0.0281 *  
bolton           2 1944753  972377 133.590 <2e-16 ***
quarter:bolton   6  103175   17196   2.362 0.0295 *  
Residuals      423 3078948    7279                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extract residuals

resid_5.5 <- residuals(model_5.5)

Test for normality of residuals

shapiro.test(resid_5.5)


    Shapiro-Wilk normality test

data:  resid_5.5
W = 0.95331, p-value = 1.7e-10

qqnorm(resid_5.5)
qqline(resid_5.5)

Test for homogeneity of variances across quarter × province combinations

province_seasonal_replicated$group <- interaction(
  province_seasonal_replicated$quarter,
  province_seasonal_replicated$bolton
)
leveneTest(mean_area ~ group, data = province_seasonal_replicated)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value   Pr(>F)   
group  11  2.3599 0.007804 **
      423                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction plot: mean kelp canopy area by quarter and province

ggplot(
  province_seasonal_replicated,
  aes(x = quarter, y = mean_area, group = bolton, color = bolton)
) +
  stat_summary(fun = mean, geom = "line", size = 1) +
  stat_summary(fun = mean, geom = "point", size = 2) +
  theme_minimal(base_size = 11) +
  scale_x_discrete(labels = c("Q1", "Q2", "Q3", "Q4")) +
  labs(
    title = "Interaction of Quarter and Province on Mean Kelp Canopy Area",
    x = "Quarter",
    y = "Mean Kelp Canopy Area (per pixel per year)",
    color = "Province"
  ) +
  theme(legend.position = "bottom")

Results Seasonal variation in kelp canopy area differed significantly among provinces (two-way ANOVA: quarter × province interaction: F\({6,423}\) = 2.36, p = 0.030). Both main effects were significant, with variation among quarters (F\({3,423}\) = 3.06, p = 0.028) and among provinces (F\({2,423}\) = 133.6, p < 0.001). Residuals exhibited mild non-normality (Shapiro–Wilk W = 0.95, p < 0.001), and homoscedasticity was violated across the interaction groups (Levene’s test: F\({11,423}\) = 2.36, p = 0.008). Nevertheless, interaction plots indicated province-specific seasonal cycles in canopy area that were consistent across years.

There is no easy non-parametric alternative.

General Guidance for Task 4 (above)

For each sub-question, above, consider:

formally state the null and alternative hypotheses;
justify your choice of model;
justify your choice of predictors;
justify your decision to aggregate or not aggregate the data at various levels;
discuss the assumptions involved and any violations you detect;
present the relevant model outputs and statistical tests;
include visualisations where appropriate (e.g. interaction plots, trend lines, diagnostic plots);
justify your choice of visualisation; and
present the results in a clear and concise manner, including tables and figures where appropriate, in a manner that would be appropriate for a scientific audience (e.g. a journal article).

You are not required to use the same modelling approach for all five sub-questions, though consistency across related questions is encouraged.

Task 5: Write-up

[Task Weight: 10%]

Write a short report (maximum 2 pages of text) that synthesises your findings across Tasks 2 through 5. This report should be written in the style of the Discussion section of a scientific paper, intended for an ecological audience.

Your goal is to interpret the major patterns and relationships you have identified, and to comment meaningfully on their ecological significance. Your write-up should include:

Temporal Trends and Seasonality.
Spatial Structure and Biogeography.
Interaction Effects and Spatial–Temporal Coupling.
Limitations and Assumptions.
Ecological Interpretation.

Format and tone:

Aim for clarity and economy of expression.
Don’t generate any new tables and figures. The tables and figures from Tasks 2 through 5 should be sufficient.
Write in complete paragraphs. Avoid bulleted summaries.
Add references to the tables and figures from Tasks 2 through 5 as needed.
Cite any additional references you use.

Answer

Note to Assessor: Penalise all text that was clearly AI generated by subtracting 50% off the mark.

Discussion

Our analysis of long-term kelp canopy dynamics revealed strong spatio-temporal structuring in both the mean extent and variability of kelp forests across southern Africa. While the limitations of simple linear and ANOVA-based models prevent exhaustive ecological inference, the results consistently support our observations that kelp canopy cover is not uniformly variable in space or time, but instead shaped by regionally distinct seasonal regimes and long-term trends that interact with the underlying biogeographic patterns.

At the broadest temporal scale, the annual means (Task 2) show evidence for a long-term linear increase in kelp canopy area across the study period, although the strength and direction of this trend varied substantially by biogeographical province. The Benguela Marine Province (BMP) exhibited a relatively strong positive trend, while the Agulhas and Benguela-Agulhas Transition Zone (AMP and B-ATZ) showed weaker or inconsistent trajectories. These patterns likely reflect contrasting oceanographic influences, specifically the differing degrees of upwelling intensity and sea surface temperature variability across provinces. When the annual trends were disaggregated seasonally (Task 3), the signal became more complex. Seasonal means indicated a marked peak in kelp canopy area during austral summer and early autumn (quarters 1 and 2). This seasonal cycle was most pronounced in BMP, less so in B-ATZ, and relatively flat in AMP, suggesting that regional exposure to storm-driven disturbance, solar irradiance, or herbivore activity may modulate canopy recovery and loss at intra-annual scales.

Spatial patterns in kelp canopy extent, assessed both at the bioregional level (Task 4.1) and at the finer spatial resolution of coastal sections (Task 5.1), reinforce the view that biogeographic context is a dominant driver of canopy variability. A large proportion of the overall variance in canopy area was attributable to differences among sections, but these differences were nested within broader provincial contrasts. The one-way and two-way ANOVAs demonstrated strong main effects of both spatial variables, although the interaction between section and province could not be statistically partitioned due to their hierarchical structure. Nonetheless, the magnitude of spatial variation suggests that local environmental factors (such as wave exposure, topographic complexity, and nutrient availability) likely modulate kelp dynamics on sub-provincial scales, even as broader provincial regimes exert overarching influence.

Where temporal and spatial effects intersect, the coupling of seasonal cycles with regional identity becomes ecologically meaningful. Task 5.5 illustrated this clearly: although all provinces experience some degree of seasonal fluctuation, the shape and amplitude of these cycles differed significantly. The interaction effect between quarter and province was significant despite the relatively simple model structure, indicating that the timing of canopy peaks and troughs is not synchronised coast-wide. This asynchrony complicates large-scale generalisations about kelp seasonality and underscores the importance of considering regional context in ecological forecasting. That the BMP, for instance, displays both the strongest seasonal cycle and the most pronounced long-term increase suggests a possible reinforcing interaction between natural seasonal pulses and long-term drivers such as ocean cooling (prevalent in the kelp’s distributional range) or changing storm regimes.

Several methodological limitations constrain the scope of these interpretations. First, violations of model assumptions (non-normality, heteroscedasticity, and the lack of full factorial design in spatial terms) limit the interpretive power of parametric tests. While non-parametric alternatives were noted where appropriate (e.g. Kruskal–Wallis, etc.), their implementation was outside the scope of this coursework. Secondly, while the data offer high temporal and spatial resolution, they lack covariates such as sea surface temperature, nutrient levels, or herbivore abundance that would allow causal inference. Finally, the use of mean canopy area as the primary response variable conceals potentially important variation in canopy fragmentation, persistence, or patch turnover.

Despite these limitations, the ecological interpretation remains convincing: kelp canopy extent is increasing overall, but in a regionally specific manner that reflects underlying biogeographic structure. Seasonal variation is evident and substantial, but not temporally synchronised across the coastline. These findings align with previous studies that describe the interaction between physical forcing and biological response in kelp-dominated ecosystems (Dayton 1985; Wernberg et al. 2016). Future work should explore whether these spatial–temporal patterns correspond to known gradients in oceanographic regime or are indicative of broader shifts in coastal ecosystem functioning under climate change.

References

Dayton, P. K. (1985). Ecology of kelp communities. Annual Review of Ecology and Systematics, 16, 215–245.

Wernberg, T., et al. (2016). Climate-driven regime shift of a temperate marine ecosystem. Science, 353(6295), 169–172.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit2026,
  author = {Smit, A. J.},
  title = {BCB744 {Final} {Exam}},
  date = {2026-04-26},
  url = {https://tangledbank.netlify.app/BCB744/assessments/BCB744_Final_Prac_Exam_2026.html},
  langid = {en-GB}
}

For attribution, please cite this work as:

Smit AJ (2026) BCB744 Final Exam. https://tangledbank.netlify.app/BCB744/assessments/BCB744_Final_Prac_Exam_2026.html.