BCB744 Biostatistics — Theory Test (Version 8)

Total: 13 marks | Time: 180 minutes

Author

Affiliation

A. J. Smit

University of the Western Cape

Published

January 1, 2026

Instructions

This paper has three parts: Part A (General Theory, 46 marks), Part B (Experiment Design and Hypothesis Formulation, 51 marks), and Part C (Statistical Output Interpretation, 37 marks).
Answer all questions.
Write clearly and in complete sentences where prose is required.
Mark allocations are shown next to each question in (/ marks) notation.
Statistical notation: use H₀ for the null hypothesis and H_A for the alternative hypothesis.

Part A: General Theory (46 marks)

Question 1 — Assumptions and Transformations (/8)

Name three properties of the normal distribution that are directly relevant to the validity of parametric hypothesis tests. (/ 3)
A researcher measures tree-ring width (mm) for 100 trees. The data are strongly right-skewed with several very large values. Why might a log-transformation be appropriate, and what property does it tend to stabilise? (/ 2)
A parasitologist counts the number of helminth parasites per fish host and wants to test whether counts differ between two host species. The count data are right-skewed with variance exceeding the mean. They apply a square-root transformation before running a t-test. What property of the count distribution does the square-root transformation specifically address, and is this transformation sufficient given the degree of overdispersion described? (/ 3)

Model Answer — Question 1

a. Any three of the following (1 mark each):

✓ It is symmetric around the mean — residuals of equal magnitude above and below the mean are equally probable, which is required for unbiased estimation.
✓ It is fully described by only two parameters (μ and σ) — tests based on normal theory rely on this parsimony to derive exact null distributions.
✓ The mean, median, and mode coincide — ensuring the mean is a stable and meaningful measure of central tendency on which parametric tests focus.
✓ The distribution has defined, finite variance — required for the central limit theorem and for the calculation of standard errors and t-statistics.

✓ A log-transformation compresses large values and expands small values, reducing right skew and making the transformed distribution closer to symmetric/normal.
✓ It stabilises multiplicative variance (variance that scales with the mean): if the coefficient of variation (SD/mean) is roughly constant across the range of the data, log-transformation converts multiplicative error structure to additive, which is what normal-theory models assume.

✓ The square-root transformation is classically applied to Poisson-distributed counts to stabilise variance: for a Poisson distribution, variance = mean, so variance increases with the mean. The square-root transformation makes variance approximately constant (homoscedastic) across the range of means.
✓ However, the parasitologist’s counts show overdispersion (variance > mean), which indicates the data follow a negative binomial rather than Poisson distribution. The square-root transformation stabilises Poisson (equidispersed) variance but is less effective for negative binomial overdispersion.
✓ The transformation may be insufficient; a log(x + 1) transformation (which stabilises negative binomial variance more effectively) or a non-parametric Wilcoxon rank-sum test would be more appropriate alternatives.

Question 2 — Variables and Measurement Scales (/6)

Describe the four levels of measurement (nominal, ordinal, interval, ratio). For each level, give one biological example. (/ 4)
Why does the level of measurement of a response variable constrain the choice of statistical test? Give one concrete example where using a test designed for a higher measurement level on a lower-level variable would be problematic. (/ 2)

Model Answer — Question 2

a. One mark per level with a valid example:

✓ Nominal: categories with no inherent order; differences have no quantitative meaning. Example: species identity (damselfish, parrotfish, wrasse) or habitat type (rocky shore, sandy beach, seagrass bed).
✓ Ordinal: categories with a meaningful rank order, but intervals between ranks are not equal. Example: substrate rugosity scored 1–5 (low to high), or dominance rank in a social group.
✓ Interval: continuous measurements with equal intervals between values, but an arbitrary zero (zero does not mean absence). Example: water temperature in °C (0°C is arbitrary — does not mean absence of heat).
✓ Ratio: continuous measurements with a true zero (zero means complete absence of the quantity). Example: body mass (g), shell length (mm), or dissolved oxygen (mg L⁻¹) — a value of zero means none is present.

✓ Parametric tests (e.g., ANOVA, t-tests) require at least interval-level measurement, because they use arithmetic operations (mean, variance) that assume equal spacing between values. Applying these tests to ordinal data assumes equal intervals that do not exist.
✓ Example: calculating the mean of an ordinal rugosity score (1–5) treats the difference between 1 and 2 as equal to the difference between 4 and 5 — which is not guaranteed. A non-parametric test (e.g., Kruskal-Wallis) is appropriate for ordinal response variables, as it operates on ranks rather than raw values.

Question 3 — Repeated Measures and Within-Subject Designs (/7)

Explain the difference between a between-subjects and a within-subjects (repeated measures) experimental design. (/ 2)
What statistical advantage does a within-subjects design offer over a between-subjects design, and why? (/ 2)
A researcher measures plant biomass at weeks 0, 4, 8, and 12 under two fertiliser treatments (control, high-N). The same 15 pots are measured at all four time points. What type of analysis is most appropriate, and what violation of standard one-way ANOVA assumptions must be addressed? (/ 3)

Model Answer — Question 3

✓ In a between-subjects design, different individuals are assigned to different treatment conditions — each person or experimental unit contributes data to only one group. Variation between individuals is part of the error term.
✓ In a within-subjects (repeated measures) design, the same individuals are measured under all treatment conditions or at multiple time points — each unit contributes observations to every level of the within-subjects factor. The design exploits the fact that each individual serves as its own control.

✓ A within-subjects design reduces residual variance and thereby increases statistical power, because individual-level baseline differences (which are a major source of noise) are removed from the error term by subtracting each subject’s mean response.
✓ Mechanistically: in a between-subjects design, between-individual variability is part of the error; in a within-subjects design, this variability is partitioned into a separate subject term and is excluded from the denominator of the F-ratio, leaving only within-subject variability as error.

✓ The appropriate analysis is a two-way repeated measures ANOVA (or a linear mixed-effects model), with time as the within-subjects factor and fertiliser treatment as the between-subjects factor.
✓ Standard ANOVA assumes that observations are independent, but repeated measurements from the same pot are correlated — this violates the independence assumption. Additionally, repeated measures ANOVA requires sphericity: the variances of the differences between all pairs of time points must be equal. This is checked with Mauchly’s test of sphericity; if violated, epsilon corrections (Greenhouse-Geisser or Huynh-Feldt) are applied to the degrees of freedom.

Question 4 — Standardised Regression Coefficients (/6)

What is a standardised (beta) regression coefficient, and how does it differ from an unstandardised coefficient? (/ 3)
In a multiple regression predicting bird species richness from habitat patch area (ha) and distance to the nearest forest fragment (km), the standardised coefficients are β_area = 0.61 and β_distance = −0.38. What can you conclude about the relative importance of the two predictors? (/ 2)
Why can standardised coefficients not be meaningfully compared across different studies that used different samples? (/ 1)

Model Answer — Question 4

✓ An unstandardised coefficient (b) gives the change in the response variable (in its original units) for a one-unit increase in the predictor (in its original units). Its value depends on the measurement scales of both variables, making direct comparison of coefficients across predictors (or studies) with different units meaningless.
✓ A standardised (beta) coefficient is obtained by standardising both the response and predictor variables to have mean = 0 and SD = 1 before fitting the model (or equivalently, by multiplying the unstandardised coefficient by SD_x / SD_y). It gives the change in the response in standard deviation units for a one-SD increase in the predictor.
✓ Standardised coefficients are unitless and can be compared directly across predictors within the same model, providing a measure of the relative contribution of each predictor.

✓ Patch area (β = 0.61) has a larger absolute standardised coefficient than distance to fragment (β = −0.38), indicating that a one-SD increase in patch area is associated with a larger change in species richness (in SD units) than a one-SD increase in distance.
✓ Therefore, within this model and dataset, patch area is the more important predictor of bird species richness. Distance has a moderate negative effect (larger distance → fewer species), but patch area explains more of the variation.

✓ Standardised coefficients are scaled by the standard deviation of the predictor in the sample used. If two studies sample populations with different ranges or variances of the predictor (e.g., one study covers a small patch-size range, another a wide range), the standard deviations will differ, and the same underlying unstandardised slope will produce different beta weights. Comparing beta coefficients across studies therefore conflates the underlying effect size with sample variability.

Question 5 — ANOVA and Post-hoc Tests (/6)

Explain why it is statistically incorrect to perform all pairwise comparisons between three or more groups using individual t-tests, rather than ANOVA. (/ 3)
What is the Tukey Honestly Significant Difference (HSD) test? When is it the appropriate post-hoc procedure following a significant one-way ANOVA result? (/ 3)

Model Answer — Question 5

✓ Each individual t-test is conducted at α = 0.05, meaning there is a 5% chance of a Type I error per test. With k groups, there are k(k−1)/2 pairwise comparisons: for k = 3, that is 3 comparisons; for k = 5, that is 10.
✓ The family-wise error rate (FWER) — the probability of making at least one false rejection across all tests — inflates substantially: with 3 independent tests, FWER ≈ 1 − (0.95)³ ≈ 0.14, not 0.05. With 10 tests, FWER ≈ 0.40.
✓ ANOVA conducts a single omnibus F-test that controls the error rate at α = 0.05 for the global null hypothesis (all means equal), avoiding this inflation.

✓ The Tukey HSD test is a post-hoc multiple comparison procedure that makes all pairwise comparisons among group means while controlling the family-wise error rate at α across all comparisons. It uses the studentised range distribution to compute critical differences.
✓ It is appropriate when: (a) the omnibus ANOVA F-test is significant (indicating some difference exists), (b) all groups have approximately equal sample sizes (balanced or near-balanced design), and (c) the researcher wants to identify which specific pairs of groups differ significantly, with simultaneous Type I error control across all pairwise tests.

Question 6 — The Scientific Method (/6)

Explain the difference between a null hypothesis and an alternative hypothesis. (/ 2)
Why is it important to formulate hypotheses before collecting data? What statistical problem arises when hypotheses are adjusted after seeing the data? (/ 2)
What is a confounding variable? Provide one example from biology and explain how you would control for it in an experiment. (/ 2)

Model Answer — Question 6

✓ The null hypothesis (H₀) is the default position of no effect, no difference, or no relationship between variables — it is what we assume to be true until evidence suggests otherwise.
✓ The alternative hypothesis (H_A) states that there is an effect, difference, or relationship — it is what the researcher typically hopes to support with data.

✓ Formulating hypotheses before data collection ensures that the test is a genuine test of a prediction rather than a post-hoc rationalisation, preserving the logical structure of hypothesis testing.
✓ Adjusting hypotheses after seeing the data is called HARKing (Hypothesising After Results are Known) or contributes to p-hacking, inflating the Type I error rate because multiple implicit comparisons have been made without correction.

✓ A confounding variable is one that is associated with both the predictor and the response variable, creating a spurious apparent relationship. Example: studying the effect of intertidal height on limpet size, with wave exposure confounding both (exposed shores have lower intertidal zones and smaller limpets). Control: hold wave exposure constant by sampling from shores of the same exposure class, or include it as a covariate in the model.

Question 7 — Statistical Inference and Error (/7)

Define a p-value in plain language (without using the word “probability” in a circular way). (/ 2)
Distinguish between a Type I error and a Type II error. Which one does the significance level α directly control? (/ 3)
A researcher increases the sample size of their experiment from n = 20 to n = 80. What effect does this have on statistical power, and why? (/ 2)

Model Answer — Question 7

✓ A p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. It quantifies how surprising the data are under H₀.
✓ A small p-value means the observed result would be rare if H₀ were true — it does not tell us the probability that H₀ is true.

✓ A Type I error (false positive) is rejecting a true H₀: concluding there is an effect when there is none.
✓ A Type II error (false negative) is failing to reject a false H₀: missing a real effect.
✓ The significance level α directly controls the Type I error rate: by setting α = 0.05 we accept a 5% chance of a false positive.

✓ Increasing sample size increases statistical power (the ability to detect a real effect when one exists), because larger samples produce more precise estimates with smaller standard errors.
✓ With smaller standard errors, the test statistic becomes larger for the same true effect size, making it more likely to exceed the critical threshold and lead to rejection of a false H₀.

Part B: Experiment Design and Hypothesis Formulation (51 marks)

Question 8 — Factorial Design: Lizard Sprint Speed (/13)

A herpetologist measures the maximum sprint speed (m s⁻¹) of common lizards (Zootoca vivipara) reared under two temperatures (20°C and 30°C) and two diet types (insect-based and plant-based). Six individuals are assigned to each of the four treatment combinations. The first six rows of the dataset are:

  lizard_id  temperature  diet_type  sprint_speed_m_s
1         1        20°C     insects              1.23
2         2        20°C     insects              1.18
3         3        20°C  vegetation              0.89
4         4        20°C  vegetation              0.92
5         5        30°C     insects              1.67
6         6        30°C     insects              1.71

The researcher asks: “Does sprint speed vary with temperature, diet type, or the interaction between them?”

State formal null and alternative hypotheses for each of the following effects: (i) the main effect of temperature, (ii) the main effect of diet type, and (iii) the temperature × diet interaction. (/ 6)
What statistical test is most appropriate, and give three reasons, including reference to the number of predictors and their nature. (/ 4)
The temperature × diet interaction is significant. What does this mean biologically? How does it affect how you would report and interpret the main effects? (/ 3)

Model Answer — Question 8

a. Two marks per effect pair (H₀ + H_A):

(i) Temperature:

✓ H₀: Mean sprint speed does not differ between lizards reared at 20°C and 30°C (μ₂₀ = μ₃₀).
✓ H_A: Mean sprint speed differs between the two temperature treatments (μ₂₀ ≠ μ₃₀).

(ii) Diet type:

✓ H₀: Mean sprint speed does not differ between lizards fed insects and those fed vegetation (μ_insects = μ_vegetation).
✓ H_A: Mean sprint speed differs between the two diet types (μ_insects ≠ μ_vegetation).

(iii) Temperature × diet interaction:

✓ H₀: The effect of temperature on sprint speed is the same regardless of diet type (no interaction; the effects are additive).
✓ H_A: The effect of temperature on sprint speed depends on diet type (the two factors interact; their combined effect is not simply additive).

✓ Two-way (factorial) ANOVA — this is the correct test because there are two categorical predictors (temperature with 2 levels; diet with 2 levels) and a single continuous response variable (sprint speed).
✓ Reason 1: There are two factorial predictors (not one), each with distinct levels. A two-way ANOVA simultaneously tests main effects of each factor and their interaction — a design that one-way ANOVA or t-tests cannot accommodate.
✓ Reason 2: The response (sprint speed, m s⁻¹) is continuous and ratio-scaled, appropriate for ANOVA which compares group means.
✓ Reason 3: The design is balanced (equal replication, 6 per cell), which maximises the power and interpretive clarity of a factorial ANOVA; each cell’s mean is estimated with equal precision.

✓ A significant interaction means that the effect of temperature on sprint speed depends on diet type (or equivalently, the diet effect depends on temperature). The two factors do not act independently.
✓ For example, warming may strongly enhance sprint speed in insect-fed lizards (because sufficient protein supports muscle development) but have little effect in vegetation-fed lizards (because plant-based nutrition cannot support the thermal enhancement of locomotor performance).
✓ Because the interaction is significant, the main effects cannot be interpreted in isolation — reporting a single main effect of temperature (e.g., “warmer lizards are faster”) is misleading if this is only true for one diet type. You must present and interpret the conditional effects (simple main effects) separately for each diet type, ideally via an interaction plot.

Question 9 — Mussel Shell Thickness Across Shore Types (ANCOVA) (/13)

A marine biologist measures shell thickness (mm) in mussels (Mytilus galloprovincialis) from three shoreline types (Exposed, Semi-exposed, Sheltered), also recording shell length (mm) as a continuous covariate. The first six rows of the dataset are:

  mussel_id  shore_type    shell_length_mm  thickness_mm
1         1    Exposed              52.1          3.84
2         2    Exposed              48.7          3.61
3         3  Semi-exposed           54.2          3.47
4         4  Semi-exposed           51.8          3.29
5         5    Sheltered            49.3          2.95
6         6    Sheltered            53.6          3.02

The research question is: “Does shell thickness differ among shore types, after statistically controlling for shell length?”

State formal null and alternative hypotheses appropriate for this ANCOVA. (/ 3)
Why is it necessary to include shell length as a covariate? What would be the risk of comparing raw (unadjusted) group means? (/ 3)
State the key assumption of ANCOVA that must be verified before interpreting the adjusted means. Describe how you would test it and what outcome would indicate a violation. (/ 4)
If the assumption in (c) is violated, describe two alternative approaches the researcher could use. (/ 3)

Model Answer — Question 9

✓ H₀: After adjusting for shell length, the mean shell thickness does not differ among shore types (adjusted μ_Exposed = adjusted μ_Semi-exposed = adjusted μ_Sheltered).
✓ H_A: After adjusting for shell length, at least one shore type has a mean shell thickness that differs from the others.
✓ The hypotheses explicitly reference the covariate-adjusted means — failing to mention the adjustment would be an incomplete statement of the ANCOVA hypothesis.

✓ Shell thickness scales with body size — larger mussels have thicker shells. If mussels from different shore types differ systematically in shell length (e.g., wave-exposed mussels are smaller due to physical disturbance or differential growth), then unadjusted mean thickness differences among shore types would confound the shore-type effect with body size effects.
✓ Ignoring shell length risks attributing a size-driven difference in thickness to shore type, producing a biased or spurious treatment effect. ANCOVA statistically removes the linear effect of shell length and estimates the shore-type effect at a common shell length.

✓ The critical assumption is homogeneity of regression slopes (parallelism): the slope of the shell length vs. shell thickness relationship must be equal across all shore type groups. If it differs, the covariate adjustment is not uniform and the adjusted means at a single common shell length are not comparable across groups.
✓ Test: fit a model including the interaction term shell_length_mm × shore_type (i.e., lm(thickness_mm ~ shell_length_mm * shore_type, data = mussels)). If the interaction term is statistically significant (p < 0.05), the slopes are not equal — the assumption is violated.
✓ Graphically, a violation appears as non-parallel regression lines for the three groups when thickness is plotted against shell length, with markedly different slopes (one group’s relationship much steeper or shallower than others).

✓ Option 1: Stratified analysis with simple slopes — report the relationship between shell length and thickness separately for each shore type, and describe how the shore-type effect on thickness varies across the range of shell lengths (Johnson-Neyman technique).
✓ Option 2: Fit a multiple regression with the interaction explicitly included (thickness ~ shell_length + shore_type + shell_length:shore_type) and use this model to obtain predicted thickness at specific, biologically meaningful shell lengths for each shore type, rather than a single overall adjusted mean. This acknowledges and quantifies the heterogeneous slopes rather than forcing parallelism.

Question 10 — Repeated Measures: Trout Immune Response (/13)

A fish immunologist measures plasma antibody titre (arbitrary units) of 18 individual rainbow trout (Oncorhynchus mykiss) at three time points: Baseline (Day 0), post-vaccination (Day 14), and peak immune response (Day 28). The same 18 fish are measured at all three time points. The first six rows of the dataset are:

  fish_id  timepoint  antibody_titre
1       1        D0              1.2
2       1       D14              4.8
3       1       D28              9.1
4       2        D0              1.4
5       2       D14              5.2
6       2       D28              8.7

The research question is: “Does plasma antibody titre change significantly over time following vaccination?”

State formal null and alternative hypotheses for this analysis. (/ 3)
Identify the appropriate statistical test and give three reasons for your choice, with reference to the study design and data structure. (/ 5)
What key assumption does this repeated measures design introduce that standard one-way ANOVA does not require? How is it checked? (/ 3)
If the overall test is significant, what post-hoc procedure would you apply to identify which time points differ? (/ 2)

Model Answer — Question 10

✓ H₀: Mean plasma antibody titre does not differ among time points; μ_D0 = μ_D14 = μ_D28 (there is no change in antibody titre over time following vaccination).
✓ H_A: Mean antibody titre differs among at least some time points — vaccination produces a significant change in immune titre over time.
✓ The alternative is non-directional for the omnibus test (though a directional prediction of increasing titre is biologically motivated, the overall F-test is non-directional; directional predictions should be addressed in post-hoc comparisons).

✓ One-way repeated measures ANOVA (or a linear mixed-effects model with time as a fixed effect and fish_id as a random intercept).
✓ Reason 1: The same 18 individual fish are measured at all three time points — the measurements are not independent between time points within the same fish. This within-subject structure requires a repeated measures approach.
✓ Reason 2: There is a single categorical within-subjects factor (timepoint) with three levels (D0, D14, D28). A one-way repeated measures ANOVA is designed to test for differences in a continuous response across multiple levels of a within-subjects factor. Using three paired t-tests would inflate the Type I error rate.
✓ Reason 3: The response variable (antibody titre) is continuous, meeting the measurement-scale requirement for a parametric approach.

✓ The key additional assumption is sphericity: the variances of the differences between all pairs of time points must be equal — i.e., Var(D0 − D14) = Var(D0 − D28) = Var(D14 − D28).
✓ Sphericity is checked using Mauchly’s test of sphericity. If violated (Mauchly’s p < 0.05), the F-statistic is positively biased, and degrees of freedom must be corrected using the Greenhouse-Geisser or Huynh-Feldt epsilon correction to maintain a valid α level.

✓ Pairwise paired t-tests with a Bonferroni (or Holm) correction are applied to compare all pairs of time points (D0 vs. D14, D0 vs. D28, D14 vs. D28), correcting the p-values for the three simultaneous comparisons.
✓ Alternatively, Tukey’s HSD (if available in the repeated-measures framework) or a linear contrast approach can be used to identify the specific time points at which significant changes in titre occur.

Question 11 — Multiple Regression: Seagrass Biomass (/12)

A coastal ecologist measures above-ground seagrass biomass (g m⁻²) at 48 sites, along with three candidate environmental predictors: water clarity (Secchi depth, m), tidal exposure (hours exposed per day), and sediment organic matter (%). The first six rows of the dataset are:

  site  biomass_g_m2  secchi_m  tidal_hrs  sediment_om_pct
1    1         412.3       3.2        2.1             3.4
2    2         387.1       2.9        2.8             4.1
3    3         318.5       2.1        3.5             5.2
4    4         271.4       1.8        4.2             6.8
5    5         224.8       1.4        5.1             8.3
6    6         193.2       1.2        5.8             9.1

The research aim is: “To determine which combination of environmental variables best predicts above-ground seagrass biomass.”

State the null and alternative hypotheses for the overall multiple regression model. (/ 3)
Give three specific reasons why multiple regression (not simple linear regression) is appropriate for this research aim. (/ 3)
The researcher uses AIC to compare four candidate models. Describe what they would look for in the AIC table to select the best model. (/ 3)
From the data preview, what concern arises about multicollinearity among the three predictors, and how would you formally diagnose it? (/ 3)

Model Answer — Question 11

✓ H₀: None of the three environmental predictors (Secchi depth, tidal exposure, sediment OM) has a linear relationship with seagrass biomass; all regression slopes (β₁ = β₂ = β₃) = 0. The model explains no more variance than the intercept-only model.
✓ H_A: At least one predictor has a non-zero slope — at least one environmental variable is a significant linear predictor of seagrass biomass.
✓ The overall null is tested by the omnibus F-statistic in the ANOVA table of the regression output.

✓ Reason 1: There are three candidate predictors, each potentially influencing biomass. Simple linear regression models only one predictor at a time, ignoring the simultaneous effects of the others and failing to control for their confounding influence on the focal predictor’s coefficient.
✓ Reason 2: The research aim is to identify the best combination of predictors — this requires fitting and comparing models that include subsets of all three predictors simultaneously, which is the purpose of multiple regression and model selection.
✓ Reason 3: Partial regression coefficients in multiple regression estimate the effect of each predictor holding the others constant — providing a clearer picture of each variable’s independent contribution. Simple regression slopes confound the effects of correlated predictors.

✓ Select the model with the lowest AIC value as the best supported model. Compare other candidate models by computing ΔAIC = AIC_model − AIC_minimum. Models with ΔAIC < 2 are considered empirically equivalent (equally well supported); ΔAIC > 10 indicates strong evidence against the model.
✓ Also check that the “best” model makes biological sense — AIC minimisation should be combined with substantive knowledge. A simpler model (fewer parameters) with nearly equal AIC may be preferred on the principle of parsimony (prefer AICc for small n/k ratios).

✓ The data preview shows that Secchi depth decreases, tidal exposure increases, and sediment OM increases together as site number increases — the three predictors appear to co-vary systematically, suggesting strong inter-predictor correlations (multicollinearity). Sites with more tidal exposure may have lower water clarity and higher organic matter deposition.
✓ Formal diagnosis: calculate the Variance Inflation Factor (VIF) for each predictor after fitting the full model. VIF_j = 1 / (1 − R²_j), where R²_j is the variance in predictor j explained by all other predictors. VIF > 5 (or > 10 by more lenient standards) indicates problematic multicollinearity that inflates coefficient standard errors and makes individual estimates unstable.

Part C: Statistical Output Interpretation (37 marks)

Question 12 — Multiple Regression Output: Kelp Frond Length (/13)

A kelp ecologist models the frond length (cm) of Ecklonia maxima at 75 sampling locations as a function of three continuous environmental predictors: daily irradiance (mol photons m⁻² day⁻¹), water temperature (°C), and nitrate concentration (μmol L⁻¹). The lm() output is:

Call:
lm(formula = frond_length_cm ~ irradiance + temperature + nitrate,
   data = kelp)

Residuals:
    Min      1Q  Median      3Q     Max
 -28.43  -7.91    0.34   8.12   31.17

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)   -45.240      8.120   -5.57  < 0.001 ***
irradiance      3.840      0.530    7.25  < 0.001 ***
temperature     2.120      0.680    3.12   0.0026 **
nitrate         1.470      0.410    3.59   0.0006 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.34 on 71 degrees of freedom
Multiple R-squared:  0.7401,  Adjusted R-squared:  0.7291
F-statistic: 67.42 on 3 and 71 DF,  p-value: < 2.2e-16

Write the fitted regression equation and comment on whether the intercept is biologically interpretable. (/ 2)
Interpret the coefficient for irradiance (3.840) as a partial regression coefficient. What does “partial” mean in this context? (/ 3)
A colleague argues that temperature should be removed from the model because its p-value (0.0026) is larger than those of the other two predictors. Evaluate this argument. (/ 3)
What does adjusted R² = 0.7291 indicate, and why is it lower than R² = 0.7401? (/ 3)
A new site has irradiance = 8.5 mol photons m⁻² day⁻¹, temperature = 16.2°C, and nitrate = 4.8 μmol L⁻¹. Calculate the predicted frond length. (/ 2)

Model Answer — Question 12

✓ $\widehat{frond\_length} = -45.240 + 3.840 \times irradiance + 2.120 \times temperature + 1.470 \times nitrate$
✓ The intercept (−45.24 cm) is the predicted frond length when all three predictors simultaneously equal zero — a combination that never occurs in nature (zero irradiance, zero temperature, zero nitrate). It is a mathematical anchor for the regression plane, not a biologically meaningful quantity; extrapolation to zero values is outside the observed data range.

✓ The coefficient 3.840 is a partial regression coefficient: it estimates the change in expected frond length associated with a one-unit increase in irradiance (1 mol photon m⁻² day⁻¹), holding temperature and nitrate constant. For each additional mol photon m⁻² day⁻¹ of irradiance, kelp fronds are predicted to be 3.84 cm longer, all else being equal.
✓ “Partial” means this estimate isolates the unique contribution of irradiance to frond length after accounting for the linear effects of temperature and nitrate — it is not the simple (marginal) effect of irradiance in isolation, which would also absorb any variation shared with the other two predictors.
✓ The strong positive relationship is biologically consistent: greater light availability drives photosynthesis and carbon fixation, fuelling frond elongation growth.

✓ The colleague’s argument is incorrect. All three predictors are statistically significant at α = 0.05 (temperature: p = 0.0026). A lower p-value does not mean one predictor is “better” or that others should be dropped — it reflects a combination of effect size and estimation precision.
✓ Model selection should be guided by AIC, adjusted R², or biological reasoning — not by p-value ranking among retained predictors. Removing a significant predictor (temperature, p = 0.0026 ≪ 0.05) without justification would discard a real effect and bias the remaining coefficients if temperature is correlated with irradiance or nitrate (omitted variable bias).
✓ Furthermore, each predictor’s p-value already accounts for all other predictors in the model. Temperature remains significant conditional on irradiance and nitrate — it has an independent contribution to frond length that would be conflated with the other predictors if removed.

✓ Adjusted R² = 0.7291 means that approximately 72.9% of the variation in kelp frond length is explained by the three-predictor model, after penalising for model complexity (number of predictors relative to sample size).
✓ Adjusted R² is lower than R² (0.7401) because it applies a penalty for each additional parameter estimated: R²_adj = 1 − (1 − R²) × (n − 1) / (n − k − 1), where n = 75 and k = 3. Unlike R², adjusted R² does not automatically increase when a predictor is added — it decreases if the predictor adds less explanatory power than expected by chance. The small gap (0.0110) here confirms that all three predictors are genuinely contributing.

✓ $\widehat{frond\_length} = -45.240 + 3.840 \times 8.5 + 2.120 \times 16.2 + 1.470 \times 4.8$
✓ = −45.240 + 32.640 + 34.344 + 7.056 = 28.80 cm

Question 13 — Two-Way ANOVA Output: Prawn Survival (/12)

An aquaculturist tests the effects of salinity (three levels: 20, 30, 40 ppt) and temperature (two levels: 20°C, 28°C) on the percentage survival (%) of juvenile prawns (Penaeus japonicus) after 72 hours. Ten replicates per treatment combination are used. The ANOVA table is:

                       Df  Sum Sq  Mean Sq  F value   Pr(>F)
salinity                2  1284.6   642.3   31.48   < 0.001 ***
temperature             1   394.1   394.1   19.32   < 0.001 ***
salinity:temperature    2   187.3    93.7    4.59    0.0132 *
Residuals              54  1101.8    20.4

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

State the null hypothesis for the salinity × temperature interaction term. (/ 2)
What does Residuals df = 54 tell you about the experimental design? Show your reasoning. (/ 2)
The interaction is significant (F(2, 54) = 4.59, p = 0.0132). Interpret this biologically. (/ 3)
Because the interaction is significant, how should you approach the interpretation of the main effects? (/ 3)
Verify the F-value for salinity using the values in the ANOVA table. Show your working. (/ 2)

Model Answer — Question 13

✓ H₀: The effect of salinity on juvenile prawn survival is the same at both temperatures — the two factors act additively, and the response to salinity does not depend on temperature (or equivalently, the temperature effect is the same at all three salinities).

✓ Total observations n = number of cells × replicates per cell = (3 salinity × 2 temperature) × 10 = 60 observations.
✓ Residuals df = total n − number of cells = 60 − 6 = 54 ✓. This represents the within-cell error — the variability among the 10 replicates within each salinity × temperature treatment combination.

✓ A significant interaction means the effect of salinity on survival depends on temperature — the response to changing salinity is not the same at 20°C as it is at 28°C.
✓ Biologically: at 20°C, prawns may tolerate a broader range of salinities (a flat, moderate survival plateau), while at 28°C (thermal stress), deviations from optimal salinity (30 ppt) may be far more lethal — the additional stressor of temperature reduces the physiological scope to cope with osmotic imbalance.
✓ This non-additive (synergistic stressor) interaction has important practical implications: the optimal salinity for prawn aquaculture depends on the culture temperature and cannot be determined from single-factor experiments.

✓ When the interaction is significant, the main effects are conditional — the effect of salinity cannot be summarised as a single universal value, because it differs between the two temperature levels. Reporting “salinity significantly affected survival” without qualification is misleading.
✓ The appropriate approach is to interpret and present simple main effects — the effect of salinity separately at 20°C and at 28°C, ideally with an interaction plot. Main effects should not be interpreted in isolation when a significant interaction is present.

✓ F_salinity = Mean Square_salinity / Mean Square_Residuals = 642.3 / 20.4 = 31.48 ✓
✓ This F-ratio means the between-salinity-group variance is 31.48 times larger than the within-cell residual variance — the salinity differences are far beyond what would be expected from random sampling of a common population, strongly supporting rejection of the salinity H₀.

Question 14 — Welch’s t-Test Output: Cuttlefish Mantle Length (/12)

The following R output compares mantle length (mm) of Sepia officinalis caught at two sites: an inshore estuary and an offshore reef.

Welch Two Sample t-test

data: mantle_length by site
t = 3.271, df = 31.84, p-value = 0.002614
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 8.432 35.568

sample estimates:
 mean in group estuary mean in group reef
             142.3              120.0

Levene's Test for Homogeneity of Variance
        Df  F value   Pr(>F)
group    1   9.412   0.00374 **
      46

Why was Welch’s t-test used rather than Student’s t-test? Support your answer with evidence from the output. (/ 2)
State the null and alternative hypotheses being tested and interpret the t-statistic and p-value in plain language. (/ 3)
Interpret the 95% confidence interval [8.432, 35.568]. What does it tell you about the precision of the estimated difference? (/ 3)
The degrees of freedom are reported as 31.84 rather than a whole number. Explain why this occurs in Welch’s test and what it implies about the two groups’ sample sizes and/or variances. (/ 2)
Write a one-sentence conclusion reporting the result in the style of a scientific paper. (/ 2)

Model Answer — Question 14

✓ Levene’s test is significant (F(1, 46) = 9.412, p = 0.00374), indicating that the variances differ significantly between the two groups. The homoscedasticity assumption of Student’s t-test is violated.
✓ Welch’s t-test is appropriate because it does not assume equal variances — it adjusts both the test statistic and the degrees of freedom to account for heteroscedasticity.

✓ H₀: the true difference in mean mantle length between estuary and reef cuttlefish is zero (µ_estuary − µ_reef = 0).
✓ H_A: the true difference is not equal to zero (µ_estuary ≠ µ_reef).
✓ t = 3.271, p = 0.0026: the observed difference in means (22.3 mm) is 3.271 standard errors from zero. The probability of observing a difference this large or larger by chance, assuming H₀ is true, is 0.0026. We reject H₀ at α = 0.05 and conclude that mean mantle length differs significantly between sites.

✓ We are 95% confident that the true difference in mean mantle length (estuary − reef) lies between 8.43 mm and 35.57 mm.
✓ The interval is relatively wide (spanning ~27 mm), reflecting moderate imprecision in the estimate. However, since the entire interval is positive (the lower bound exceeds zero), we can be confident the estuary cuttlefish are genuinely longer on average — the direction of the difference is clear even if the exact magnitude is uncertain.

✓ In Welch’s test, the degrees of freedom are calculated using the Welch-Satterthwaite equation, which weights each group’s contribution by its sample variance and sample size. When the two groups have unequal variances (as indicated here by Levene’s test), the formula yields a non-integer value — a fractional reduction from the maximum possible df (N − 2 = 46).
✓ The fractional df (31.84 vs. a maximum of 46) implies that the two groups contribute unequally to the pooled uncertainty — likely because one group has a substantially larger variance, which downweights the effective sample size of that group. This results in a more conservative test.

✓ “Mantle length of Sepia officinalis was significantly greater at the inshore estuary (mean = 142.3 mm) than at the offshore reef (mean = 120.0 mm; Welch’s t-test, t(31.84) = 3.271, p = 0.003, 95% CI of difference: [8.4, 35.6] mm).”

End of Version 8

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit2026,
  author = {Smit, A. J.},
  title = {BCB744 {Biostatistics} — {Theory} {Test} {(Version} 8)},
  date = {2026-01-01},
  url = {https://tangledbank.netlify.app/BCB744/assessments/BCB744_Biostats_Theory_Test_V8.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit AJ (2026) BCB744 Biostatistics — Theory Test (Version 8). https://tangledbank.netlify.app/BCB744/assessments/BCB744_Biostats_Theory_Test_V8.html.

--- title: "BCB744 Biostatistics — Theory Test (Version 8)" subtitle: "Total: 13 marks | Time: 180 minutes" date: "2026" format: html: number-sections: false toc: true toc-depth: 2 toc-title: "Contents" embed-resources: true engine: knitr params: hide_answers: false --- ::: {.callout-important appearance="simple"} **Instructions** - This paper has **three parts**: Part A (General Theory, 46 marks), Part B (Experiment Design and Hypothesis Formulation, 51 marks), and Part C (Statistical Output Interpretation, 37 marks). - Answer **all** questions. - Write clearly and in complete sentences where prose is required. - Mark allocations are shown next to each question in **(/ marks)** notation. - Statistical notation: use *H*~0~ for the null hypothesis and *H*~A~ for the alternative hypothesis. ::: --- # Part A: General Theory (46 marks) ## Question 1 — Assumptions and Transformations (/8) a. Name **three** properties of the normal distribution that are directly relevant to the validity of parametric hypothesis tests. **(/ 3)** b. A researcher measures tree-ring width (mm) for 100 trees. The data are strongly right-skewed with several very large values. Why might a **log-transformation** be appropriate, and what property does it tend to stabilise? **(/ 2)** c. A parasitologist counts the number of helminth parasites per fish host and wants to test whether counts differ between two host species. The count data are right-skewed with variance exceeding the mean. They apply a square-root transformation before running a *t*-test. What property of the count distribution does the square-root transformation specifically address, and is this transformation sufficient given the degree of overdispersion described? **(/ 3)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 1** *a.* Any three of the following (1 mark each): - ✓ It is **symmetric** around the mean — residuals of equal magnitude above and below the mean are equally probable, which is required for unbiased estimation. - ✓ It is **fully described by only two parameters** (μ and σ) — tests based on normal theory rely on this parsimony to derive exact null distributions. - ✓ The mean, median, and mode **coincide** — ensuring the mean is a stable and meaningful measure of central tendency on which parametric tests focus. - ✓ The distribution has **defined, finite variance** — required for the central limit theorem and for the calculation of standard errors and *t*-statistics. *b.* - ✓ A log-transformation compresses large values and expands small values, reducing right skew and making the transformed distribution closer to symmetric/normal. - ✓ It stabilises **multiplicative variance** (variance that scales with the mean): if the coefficient of variation (SD/mean) is roughly constant across the range of the data, log-transformation converts multiplicative error structure to additive, which is what normal-theory models assume. *c.* - ✓ The square-root transformation is classically applied to Poisson-distributed counts to stabilise variance: for a Poisson distribution, variance = mean, so variance increases with the mean. The square-root transformation makes variance approximately constant (homoscedastic) across the range of means. - ✓ However, the parasitologist's counts show **overdispersion** (variance > mean), which indicates the data follow a negative binomial rather than Poisson distribution. The square-root transformation stabilises Poisson (equidispersed) variance but is less effective for negative binomial overdispersion. - ✓ The transformation may be insufficient; a log(x + 1) transformation (which stabilises negative binomial variance more effectively) or a **non-parametric Wilcoxon rank-sum test** would be more appropriate alternatives. ::: `r if (params$hide_answers) ":::"` --- ## Question 2 — Variables and Measurement Scales (/6) a. Describe the **four levels of measurement** (nominal, ordinal, interval, ratio). For each level, give one biological example. **(/ 4)** b. Why does the level of measurement of a response variable **constrain the choice of statistical test**? Give one concrete example where using a test designed for a higher measurement level on a lower-level variable would be problematic. **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 2** *a.* One mark per level with a valid example: - ✓ **Nominal**: categories with no inherent order; differences have no quantitative meaning. Example: species identity (damselfish, parrotfish, wrasse) or habitat type (rocky shore, sandy beach, seagrass bed). - ✓ **Ordinal**: categories with a meaningful rank order, but intervals between ranks are not equal. Example: substrate rugosity scored 1–5 (low to high), or dominance rank in a social group. - ✓ **Interval**: continuous measurements with equal intervals between values, but an arbitrary zero (zero does not mean absence). Example: water temperature in °C (0°C is arbitrary — does not mean absence of heat). - ✓ **Ratio**: continuous measurements with a true zero (zero means complete absence of the quantity). Example: body mass (g), shell length (mm), or dissolved oxygen (mg L⁻¹) — a value of zero means none is present. *b.* - ✓ Parametric tests (e.g., ANOVA, *t*-tests) require at least interval-level measurement, because they use arithmetic operations (mean, variance) that assume equal spacing between values. Applying these tests to ordinal data assumes equal intervals that do not exist. - ✓ Example: calculating the mean of an ordinal rugosity score (1–5) treats the difference between 1 and 2 as equal to the difference between 4 and 5 — which is not guaranteed. A non-parametric test (e.g., Kruskal-Wallis) is appropriate for ordinal response variables, as it operates on ranks rather than raw values. ::: `r if (params$hide_answers) ":::"` --- ## Question 3 — Repeated Measures and Within-Subject Designs (/7) a. Explain the difference between a **between-subjects** and a **within-subjects (repeated measures)** experimental design. **(/ 2)** b. What statistical advantage does a within-subjects design offer over a between-subjects design, and why? **(/ 2)** c. A researcher measures plant biomass at weeks 0, 4, 8, and 12 under two fertiliser treatments (control, high-N). The same 15 pots are measured at all four time points. What type of analysis is most appropriate, and what violation of standard one-way ANOVA assumptions must be addressed? **(/ 3)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 3** *a.* - ✓ In a **between-subjects design**, different individuals are assigned to different treatment conditions — each person or experimental unit contributes data to only one group. Variation between individuals is part of the error term. - ✓ In a **within-subjects (repeated measures) design**, the same individuals are measured under all treatment conditions or at multiple time points — each unit contributes observations to every level of the within-subjects factor. The design exploits the fact that each individual serves as its own control. *b.* - ✓ A within-subjects design **reduces residual variance** and thereby **increases statistical power**, because individual-level baseline differences (which are a major source of noise) are removed from the error term by subtracting each subject's mean response. - ✓ Mechanistically: in a between-subjects design, between-individual variability is part of the error; in a within-subjects design, this variability is partitioned into a separate subject term and is excluded from the denominator of the *F*-ratio, leaving only within-subject variability as error. *c.* - ✓ The appropriate analysis is a **two-way repeated measures ANOVA** (or a linear mixed-effects model), with time as the within-subjects factor and fertiliser treatment as the between-subjects factor. - ✓ Standard ANOVA assumes that observations are **independent**, but repeated measurements from the same pot are correlated — this violates the independence assumption. Additionally, repeated measures ANOVA requires **sphericity**: the variances of the differences between all pairs of time points must be equal. This is checked with **Mauchly's test of sphericity**; if violated, epsilon corrections (Greenhouse-Geisser or Huynh-Feldt) are applied to the degrees of freedom. ::: `r if (params$hide_answers) ":::"` --- ## Question 4 — Standardised Regression Coefficients (/6) a. What is a **standardised (beta) regression coefficient**, and how does it differ from an unstandardised coefficient? **(/ 3)** b. In a multiple regression predicting bird species richness from habitat patch area (ha) and distance to the nearest forest fragment (km), the standardised coefficients are β~area~ = 0.61 and β~distance~ = −0.38. What can you conclude about the **relative importance** of the two predictors? **(/ 2)** c. Why can standardised coefficients **not** be meaningfully compared across different studies that used different samples? **(/ 1)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 4** *a.* - ✓ An **unstandardised coefficient** (b) gives the change in the response variable (in its original units) for a one-unit increase in the predictor (in its original units). Its value depends on the measurement scales of both variables, making direct comparison of coefficients across predictors (or studies) with different units meaningless. - ✓ A **standardised (beta) coefficient** is obtained by standardising both the response and predictor variables to have mean = 0 and SD = 1 before fitting the model (or equivalently, by multiplying the unstandardised coefficient by SD~x~ / SD~y~). It gives the change in the response in standard deviation units for a one-SD increase in the predictor. - ✓ Standardised coefficients are **unitless** and can be compared directly across predictors within the same model, providing a measure of the relative contribution of each predictor. *b.* - ✓ Patch area (β = 0.61) has a **larger absolute standardised coefficient** than distance to fragment (β = −0.38), indicating that a one-SD increase in patch area is associated with a larger change in species richness (in SD units) than a one-SD increase in distance. - ✓ Therefore, within this model and dataset, **patch area is the more important predictor** of bird species richness. Distance has a moderate negative effect (larger distance → fewer species), but patch area explains more of the variation. *c.* - ✓ Standardised coefficients are scaled by the **standard deviation of the predictor** in the sample used. If two studies sample populations with different ranges or variances of the predictor (e.g., one study covers a small patch-size range, another a wide range), the standard deviations will differ, and the same underlying unstandardised slope will produce different beta weights. Comparing beta coefficients across studies therefore conflates the underlying effect size with sample variability. ::: `r if (params$hide_answers) ":::"` --- ## Question 5 — ANOVA and Post-hoc Tests (/6) a. Explain why it is statistically incorrect to perform all pairwise comparisons between three or more groups using individual *t*-tests, rather than ANOVA. **(/ 3)** b. What is the **Tukey Honestly Significant Difference (HSD)** test? When is it the appropriate post-hoc procedure following a significant one-way ANOVA result? **(/ 3)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 5** *a.* - ✓ Each individual *t*-test is conducted at *α* = 0.05, meaning there is a 5% chance of a Type I error per test. With *k* groups, there are *k*(k−1)/2 pairwise comparisons: for *k* = 3, that is 3 comparisons; for *k* = 5, that is 10. - ✓ The **family-wise error rate** (FWER) — the probability of making *at least one* false rejection across all tests — inflates substantially: with 3 independent tests, FWER ≈ 1 − (0.95)³ ≈ 0.14, not 0.05. With 10 tests, FWER ≈ 0.40. - ✓ ANOVA conducts a single omnibus *F*-test that controls the error rate at *α* = 0.05 for the global null hypothesis (all means equal), avoiding this inflation. *b.* - ✓ The Tukey HSD test is a **post-hoc multiple comparison procedure** that makes all pairwise comparisons among group means while controlling the family-wise error rate at *α* across all comparisons. It uses the studentised range distribution to compute critical differences. - ✓ It is appropriate when: (a) the omnibus ANOVA *F*-test is significant (indicating *some* difference exists), (b) all groups have approximately equal sample sizes (balanced or near-balanced design), and (c) the researcher wants to identify *which specific pairs* of groups differ significantly, with simultaneous Type I error control across all pairwise tests. ::: `r if (params$hide_answers) ":::"` --- ## Question 6 — The Scientific Method (/6) a. Explain the difference between a null hypothesis and an alternative hypothesis. **(/ 2)** b. Why is it important to formulate hypotheses *before* collecting data? What statistical problem arises when hypotheses are adjusted after seeing the data? **(/ 2)** c. What is a confounding variable? Provide one example from biology and explain how you would control for it in an experiment. **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 6** *a.* - ✓ The null hypothesis (*H*~0~) is the default position of no effect, no difference, or no relationship between variables — it is what we assume to be true until evidence suggests otherwise. - ✓ The alternative hypothesis (*H*~A~) states that there *is* an effect, difference, or relationship — it is what the researcher typically hopes to support with data. *b.* - ✓ Formulating hypotheses before data collection ensures that the test is a genuine test of a prediction rather than a post-hoc rationalisation, preserving the logical structure of hypothesis testing. - ✓ Adjusting hypotheses after seeing the data is called *HARKing* (Hypothesising After Results are Known) or contributes to *p-hacking*, inflating the Type I error rate because multiple implicit comparisons have been made without correction. *c.* - ✓ A confounding variable is one that is associated with both the predictor and the response variable, creating a spurious apparent relationship. Example: studying the effect of intertidal height on limpet size, with wave exposure confounding both (exposed shores have lower intertidal zones and smaller limpets). Control: hold wave exposure constant by sampling from shores of the same exposure class, or include it as a covariate in the model. `r if (params$hide_answers) ":::"` ::: --- ## Question 7 — Statistical Inference and Error (/7) a. Define a *p*-value in plain language (without using the word "probability" in a circular way). **(/ 2)** b. Distinguish between a Type I error and a Type II error. Which one does the significance level *α* directly control? **(/ 3)** c. A researcher increases the sample size of their experiment from *n* = 20 to *n* = 80. What effect does this have on statistical power, and why? **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 7** *a.* - ✓ A *p*-value is the probability of obtaining a test statistic at least as extreme as the one observed, *assuming the null hypothesis is true*. It quantifies how surprising the data are under *H*~0~. - ✓ A small *p*-value means the observed result would be rare if *H*~0~ were true — it does **not** tell us the probability that *H*~0~ is true. *b.* - ✓ A **Type I error** (false positive) is rejecting a true *H*~0~: concluding there is an effect when there is none. - ✓ A **Type II error** (false negative) is failing to reject a false *H*~0~: missing a real effect. - ✓ The significance level *α* directly controls the Type I error rate: by setting *α* = 0.05 we accept a 5% chance of a false positive. *c.* - ✓ Increasing sample size increases statistical power (the ability to detect a real effect when one exists), because larger samples produce more precise estimates with smaller standard errors. - ✓ With smaller standard errors, the test statistic becomes larger for the same true effect size, making it more likely to exceed the critical threshold and lead to rejection of a false *H*~0~. ::: `r if (params$hide_answers) ":::"` --- # Part B: Experiment Design and Hypothesis Formulation (51 marks) ## Question 8 — Factorial Design: Lizard Sprint Speed (/13) A herpetologist measures the maximum sprint speed (m s⁻¹) of common lizards (*Zootoca vivipara*) reared under two temperatures (20°C and 30°C) and two diet types (insect-based and plant-based). Six individuals are assigned to each of the four treatment combinations. The first six rows of the dataset are: ``` lizard_id temperature diet_type sprint_speed_m_s 1 1 20°C insects 1.23 2 2 20°C insects 1.18 3 3 20°C vegetation 0.89 4 4 20°C vegetation 0.92 5 5 30°C insects 1.67 6 6 30°C insects 1.71 ``` The researcher asks: *"Does sprint speed vary with temperature, diet type, or the interaction between them?"* a. State formal null and alternative hypotheses for each of the following effects: (i) the main effect of temperature, (ii) the main effect of diet type, and (iii) the temperature × diet interaction. **(/ 6)** b. What statistical test is most appropriate, and give **three** reasons, including reference to the number of predictors and their nature. **(/ 4)** c. The temperature × diet interaction is significant. What does this mean biologically? How does it affect how you would report and interpret the main effects? **(/ 3)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 8** *a.* Two marks per effect pair (*H*~0~ + *H*~A~): *(i) Temperature:* - ✓ *H*~0~: Mean sprint speed does not differ between lizards reared at 20°C and 30°C (μ~20~ = μ~30~). - ✓ *H*~A~: Mean sprint speed differs between the two temperature treatments (μ~20~ ≠ μ~30~). *(ii) Diet type:* - ✓ *H*~0~: Mean sprint speed does not differ between lizards fed insects and those fed vegetation (μ~insects~ = μ~vegetation~). - ✓ *H*~A~: Mean sprint speed differs between the two diet types (μ~insects~ ≠ μ~vegetation~). *(iii) Temperature × diet interaction:* - ✓ *H*~0~: The effect of temperature on sprint speed is the same regardless of diet type (no interaction; the effects are additive). - ✓ *H*~A~: The effect of temperature on sprint speed depends on diet type (the two factors interact; their combined effect is not simply additive). *b.* - ✓ **Two-way (factorial) ANOVA** — this is the correct test because there are two categorical predictors (temperature with 2 levels; diet with 2 levels) and a single continuous response variable (sprint speed). - ✓ Reason 1: There are **two factorial predictors** (not one), each with distinct levels. A two-way ANOVA simultaneously tests main effects of each factor and their interaction — a design that one-way ANOVA or *t*-tests cannot accommodate. - ✓ Reason 2: The response (sprint speed, m s⁻¹) is **continuous** and ratio-scaled, appropriate for ANOVA which compares group means. - ✓ Reason 3: The design is **balanced** (equal replication, 6 per cell), which maximises the power and interpretive clarity of a factorial ANOVA; each cell's mean is estimated with equal precision. *c.* - ✓ A significant interaction means that the effect of temperature on sprint speed **depends on diet type** (or equivalently, the diet effect depends on temperature). The two factors do not act independently. - ✓ For example, warming may strongly enhance sprint speed in insect-fed lizards (because sufficient protein supports muscle development) but have little effect in vegetation-fed lizards (because plant-based nutrition cannot support the thermal enhancement of locomotor performance). - ✓ Because the interaction is significant, the **main effects cannot be interpreted in isolation** — reporting a single main effect of temperature (e.g., "warmer lizards are faster") is misleading if this is only true for one diet type. You must present and interpret the **conditional effects** (simple main effects) separately for each diet type, ideally via an interaction plot. ::: `r if (params$hide_answers) ":::"` --- ## Question 9 — Mussel Shell Thickness Across Shore Types (ANCOVA) (/13) A marine biologist measures shell thickness (mm) in mussels (*Mytilus galloprovincialis*) from three shoreline types (Exposed, Semi-exposed, Sheltered), also recording shell length (mm) as a continuous covariate. The first six rows of the dataset are: ``` mussel_id shore_type shell_length_mm thickness_mm 1 1 Exposed 52.1 3.84 2 2 Exposed 48.7 3.61 3 3 Semi-exposed 54.2 3.47 4 4 Semi-exposed 51.8 3.29 5 5 Sheltered 49.3 2.95 6 6 Sheltered 53.6 3.02 ``` The research question is: *"Does shell thickness differ among shore types, after statistically controlling for shell length?"* a. State formal null and alternative hypotheses appropriate for this ANCOVA. **(/ 3)** b. Why is it necessary to include shell length as a covariate? What would be the risk of comparing raw (unadjusted) group means? **(/ 3)** c. State the **key assumption** of ANCOVA that must be verified before interpreting the adjusted means. Describe how you would test it and what outcome would indicate a violation. **(/ 4)** d. If the assumption in (c) is violated, describe **two alternative approaches** the researcher could use. **(/ 3)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 9** *a.* - ✓ *H*~0~: After adjusting for shell length, the mean shell thickness does not differ among shore types (adjusted μ~Exposed~ = adjusted μ~Semi-exposed~ = adjusted μ~Sheltered~). - ✓ *H*~A~: After adjusting for shell length, at least one shore type has a mean shell thickness that differs from the others. - ✓ The hypotheses explicitly reference the covariate-adjusted means — failing to mention the adjustment would be an incomplete statement of the ANCOVA hypothesis. *b.* - ✓ Shell thickness scales with body size — larger mussels have thicker shells. If mussels from different shore types differ systematically in shell length (e.g., wave-exposed mussels are smaller due to physical disturbance or differential growth), then unadjusted mean thickness differences among shore types would **confound the shore-type effect with body size effects**. - ✓ Ignoring shell length risks attributing a size-driven difference in thickness to shore type, producing a **biased or spurious treatment effect**. ANCOVA statistically removes the linear effect of shell length and estimates the shore-type effect at a common shell length. *c.* - ✓ The critical assumption is **homogeneity of regression slopes** (parallelism): the slope of the shell length vs. shell thickness relationship must be equal across all shore type groups. If it differs, the covariate adjustment is not uniform and the adjusted means at a single common shell length are not comparable across groups. - ✓ Test: fit a model including the interaction term `shell_length_mm × shore_type` (i.e., `lm(thickness_mm ~ shell_length_mm * shore_type, data = mussels)`). If the interaction term is statistically significant (*p* < 0.05), the slopes are not equal — the assumption is violated. - ✓ Graphically, a violation appears as non-parallel regression lines for the three groups when thickness is plotted against shell length, with markedly different slopes (one group's relationship much steeper or shallower than others). *d.* - ✓ Option 1: **Stratified analysis with simple slopes** — report the relationship between shell length and thickness separately for each shore type, and describe how the shore-type effect on thickness varies across the range of shell lengths (Johnson-Neyman technique). - ✓ Option 2: Fit a **multiple regression with the interaction explicitly included** (`thickness ~ shell_length + shore_type + shell_length:shore_type`) and use this model to obtain predicted thickness at specific, biologically meaningful shell lengths for each shore type, rather than a single overall adjusted mean. This acknowledges and quantifies the heterogeneous slopes rather than forcing parallelism. ::: `r if (params$hide_answers) ":::"` --- ## Question 10 — Repeated Measures: Trout Immune Response (/13) A fish immunologist measures plasma antibody titre (arbitrary units) of 18 individual rainbow trout (*Oncorhynchus mykiss*) at three time points: Baseline (Day 0), post-vaccination (Day 14), and peak immune response (Day 28). The same 18 fish are measured at all three time points. The first six rows of the dataset are: ``` fish_id timepoint antibody_titre 1 1 D0 1.2 2 1 D14 4.8 3 1 D28 9.1 4 2 D0 1.4 5 2 D14 5.2 6 2 D28 8.7 ``` The research question is: *"Does plasma antibody titre change significantly over time following vaccination?"* a. State formal null and alternative hypotheses for this analysis. **(/ 3)** b. Identify the appropriate statistical test and give **three reasons** for your choice, with reference to the study design and data structure. **(/ 5)** c. What key assumption does this repeated measures design introduce that standard one-way ANOVA does not require? How is it checked? **(/ 3)** d. If the overall test is significant, what **post-hoc procedure** would you apply to identify which time points differ? **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 10** *a.* - ✓ *H*~0~: Mean plasma antibody titre does not differ among time points; μ~D0~ = μ~D14~ = μ~D28~ (there is no change in antibody titre over time following vaccination). - ✓ *H*~A~: Mean antibody titre differs among at least some time points — vaccination produces a significant change in immune titre over time. - ✓ The alternative is non-directional for the omnibus test (though a directional prediction of increasing titre is biologically motivated, the overall *F*-test is non-directional; directional predictions should be addressed in post-hoc comparisons). *b.* - ✓ **One-way repeated measures ANOVA** (or a linear mixed-effects model with time as a fixed effect and fish_id as a random intercept). - ✓ Reason 1: The same **18 individual fish** are measured at all three time points — the measurements are not independent between time points within the same fish. This within-subject structure requires a repeated measures approach. - ✓ Reason 2: There is a **single categorical within-subjects factor** (timepoint) with **three levels** (D0, D14, D28). A one-way repeated measures ANOVA is designed to test for differences in a continuous response across multiple levels of a within-subjects factor. Using three paired *t*-tests would inflate the Type I error rate. - ✓ Reason 3: The **response variable** (antibody titre) is continuous, meeting the measurement-scale requirement for a parametric approach. *c.* - ✓ The key additional assumption is **sphericity**: the variances of the differences between all pairs of time points must be equal — i.e., Var(D0 − D14) = Var(D0 − D28) = Var(D14 − D28). - ✓ Sphericity is checked using **Mauchly's test of sphericity**. If violated (Mauchly's *p* < 0.05), the *F*-statistic is positively biased, and degrees of freedom must be corrected using the Greenhouse-Geisser or Huynh-Feldt epsilon correction to maintain a valid *α* level. *d.* - ✓ **Pairwise paired *t*-tests with a Bonferroni (or Holm) correction** are applied to compare all pairs of time points (D0 vs. D14, D0 vs. D28, D14 vs. D28), correcting the *p*-values for the three simultaneous comparisons. - ✓ Alternatively, **Tukey's HSD** (if available in the repeated-measures framework) or a linear contrast approach can be used to identify the specific time points at which significant changes in titre occur. ::: `r if (params$hide_answers) ":::"` --- ## Question 11 — Multiple Regression: Seagrass Biomass (/12) A coastal ecologist measures above-ground seagrass biomass (g m⁻²) at 48 sites, along with three candidate environmental predictors: water clarity (Secchi depth, m), tidal exposure (hours exposed per day), and sediment organic matter (%). The first six rows of the dataset are: ``` site biomass_g_m2 secchi_m tidal_hrs sediment_om_pct 1 1 412.3 3.2 2.1 3.4 2 2 387.1 2.9 2.8 4.1 3 3 318.5 2.1 3.5 5.2 4 4 271.4 1.8 4.2 6.8 5 5 224.8 1.4 5.1 8.3 6 6 193.2 1.2 5.8 9.1 ``` The research aim is: *"To determine which combination of environmental variables best predicts above-ground seagrass biomass."* a. State the null and alternative hypotheses for the **overall** multiple regression model. **(/ 3)** b. Give **three specific reasons** why multiple regression (not simple linear regression) is appropriate for this research aim. **(/ 3)** c. The researcher uses AIC to compare four candidate models. Describe what they would look for in the AIC table to select the best model. **(/ 3)** d. From the data preview, what concern arises about **multicollinearity** among the three predictors, and how would you formally diagnose it? **(/ 3)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 11** *a.* - ✓ *H*~0~: None of the three environmental predictors (Secchi depth, tidal exposure, sediment OM) has a linear relationship with seagrass biomass; all regression slopes (β~1~ = β~2~ = β~3~) = 0. The model explains no more variance than the intercept-only model. - ✓ *H*~A~: At least one predictor has a non-zero slope — at least one environmental variable is a significant linear predictor of seagrass biomass. - ✓ The overall null is tested by the omnibus *F*-statistic in the ANOVA table of the regression output. *b.* - ✓ Reason 1: There are **three candidate predictors**, each potentially influencing biomass. Simple linear regression models only one predictor at a time, ignoring the simultaneous effects of the others and failing to control for their confounding influence on the focal predictor's coefficient. - ✓ Reason 2: The research aim is to identify the **best combination** of predictors — this requires fitting and comparing models that include subsets of all three predictors simultaneously, which is the purpose of multiple regression and model selection. - ✓ Reason 3: **Partial regression coefficients** in multiple regression estimate the effect of each predictor **holding the others constant** — providing a clearer picture of each variable's independent contribution. Simple regression slopes confound the effects of correlated predictors. *c.* - ✓ Select the model with the **lowest AIC value** as the best supported model. Compare other candidate models by computing ΔAIC = AIC~model~ − AIC~minimum~. Models with ΔAIC < 2 are considered empirically equivalent (equally well supported); ΔAIC > 10 indicates strong evidence against the model. - ✓ Also check that the "best" model makes biological sense — AIC minimisation should be combined with substantive knowledge. A simpler model (fewer parameters) with nearly equal AIC may be preferred on the principle of parsimony (prefer AICc for small *n*/*k* ratios). *d.* - ✓ The data preview shows that Secchi depth decreases, tidal exposure increases, and sediment OM increases together as site number increases — the three predictors appear to **co-vary systematically**, suggesting strong inter-predictor correlations (multicollinearity). Sites with more tidal exposure may have lower water clarity and higher organic matter deposition. - ✓ Formal diagnosis: calculate the **Variance Inflation Factor (VIF)** for each predictor after fitting the full model. VIF~j~ = 1 / (1 − *R*²~j~), where *R*²~j~ is the variance in predictor *j* explained by all other predictors. VIF > 5 (or > 10 by more lenient standards) indicates problematic multicollinearity that inflates coefficient standard errors and makes individual estimates unstable. ::: `r if (params$hide_answers) ":::"` --- # Part C: Statistical Output Interpretation (37 marks) ## Question 12 — Multiple Regression Output: Kelp Frond Length (/13) A kelp ecologist models the frond length (cm) of *Ecklonia maxima* at 75 sampling locations as a function of three continuous environmental predictors: daily irradiance (mol photons m⁻² day⁻¹), water temperature (°C), and nitrate concentration (μmol L⁻¹). The `lm()` output is: ``` Call: lm(formula = frond_length_cm ~ irradiance + temperature + nitrate, data = kelp) Residuals: Min 1Q Median 3Q Max -28.43 -7.91 0.34 8.12 31.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -45.240 8.120 -5.57 < 0.001 *** irradiance 3.840 0.530 7.25 < 0.001 *** temperature 2.120 0.680 3.12 0.0026 ** nitrate 1.470 0.410 3.59 0.0006 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 12.34 on 71 degrees of freedom Multiple R-squared: 0.7401, Adjusted R-squared: 0.7291 F-statistic: 67.42 on 3 and 71 DF, p-value: < 2.2e-16 ``` a. Write the fitted regression equation and comment on whether the intercept is biologically interpretable. **(/ 2)** b. Interpret the coefficient for `irradiance` (3.840) as a **partial regression coefficient**. What does "partial" mean in this context? **(/ 3)** c. A colleague argues that `temperature` should be removed from the model because its *p*-value (0.0026) is larger than those of the other two predictors. Evaluate this argument. **(/ 3)** d. What does adjusted *R*² = 0.7291 indicate, and why is it lower than *R*² = 0.7401? **(/ 3)** e. A new site has irradiance = 8.5 mol photons m⁻² day⁻¹, temperature = 16.2°C, and nitrate = 4.8 μmol L⁻¹. Calculate the predicted frond length. **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 12** *a.* - ✓ $\widehat{frond\_length} = -45.240 + 3.840 \times irradiance + 2.120 \times temperature + 1.470 \times nitrate$ - ✓ The intercept (−45.24 cm) is the predicted frond length when all three predictors simultaneously equal zero — a combination that never occurs in nature (zero irradiance, zero temperature, zero nitrate). It is a **mathematical anchor** for the regression plane, not a biologically meaningful quantity; extrapolation to zero values is outside the observed data range. *b.* - ✓ The coefficient 3.840 is a **partial regression coefficient**: it estimates the change in expected frond length associated with a one-unit increase in irradiance (1 mol photon m⁻² day⁻¹), **holding temperature and nitrate constant**. For each additional mol photon m⁻² day⁻¹ of irradiance, kelp fronds are predicted to be 3.84 cm longer, all else being equal. - ✓ "Partial" means this estimate isolates the unique contribution of irradiance to frond length *after* accounting for the linear effects of temperature and nitrate — it is not the simple (marginal) effect of irradiance in isolation, which would also absorb any variation shared with the other two predictors. - ✓ The strong positive relationship is biologically consistent: greater light availability drives photosynthesis and carbon fixation, fuelling frond elongation growth. *c.* - ✓ The colleague's argument is **incorrect**. All three predictors are statistically significant at *α* = 0.05 (temperature: *p* = 0.0026). A lower *p*-value does not mean one predictor is "better" or that others should be dropped — it reflects a combination of effect size and estimation precision. - ✓ Model selection should be guided by **AIC**, **adjusted *R*²**, or biological reasoning — not by *p*-value ranking among retained predictors. Removing a significant predictor (temperature, *p* = 0.0026 ≪ 0.05) without justification would discard a real effect and bias the remaining coefficients if temperature is correlated with irradiance or nitrate (omitted variable bias). - ✓ Furthermore, each predictor's *p*-value already accounts for all other predictors in the model. Temperature remains significant conditional on irradiance and nitrate — it has an independent contribution to frond length that would be conflated with the other predictors if removed. *d.* - ✓ Adjusted *R*² = 0.7291 means that approximately **72.9% of the variation in kelp frond length** is explained by the three-predictor model, after penalising for model complexity (number of predictors relative to sample size). - ✓ Adjusted *R*² is lower than *R*² (0.7401) because it applies a **penalty for each additional parameter** estimated: *R*²~adj~ = 1 − (1 − *R*²) × (*n* − 1) / (*n* − *k* − 1), where *n* = 75 and *k* = 3. Unlike *R*², adjusted *R*² does not automatically increase when a predictor is added — it decreases if the predictor adds less explanatory power than expected by chance. The small gap (0.0110) here confirms that all three predictors are genuinely contributing. *e.* - ✓ $\widehat{frond\_length} = -45.240 + 3.840 \times 8.5 + 2.120 \times 16.2 + 1.470 \times 4.8$ - ✓ = −45.240 + 32.640 + 34.344 + 7.056 = **28.80 cm** ::: `r if (params$hide_answers) ":::"` --- ## Question 13 — Two-Way ANOVA Output: Prawn Survival (/12) An aquaculturist tests the effects of **salinity** (three levels: 20, 30, 40 ppt) and **temperature** (two levels: 20°C, 28°C) on the percentage survival (%) of juvenile prawns (*Penaeus japonicus*) after 72 hours. Ten replicates per treatment combination are used. The ANOVA table is: ``` Df Sum Sq Mean Sq F value Pr(>F) salinity 2 1284.6 642.3 31.48 < 0.001 *** temperature 1 394.1 394.1 19.32 < 0.001 *** salinity:temperature 2 187.3 93.7 4.59 0.0132 * Residuals 54 1101.8 20.4 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` a. State the null hypothesis for the **salinity × temperature interaction** term. **(/ 2)** b. What does `Residuals df = 54` tell you about the experimental design? Show your reasoning. **(/ 2)** c. The interaction is significant (*F*(2, 54) = 4.59, *p* = 0.0132). Interpret this biologically. **(/ 3)** d. Because the interaction is significant, how should you approach the interpretation of the main effects? **(/ 3)** e. Verify the *F*-value for salinity using the values in the ANOVA table. Show your working. **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 13** *a.* - ✓ *H*~0~: The effect of salinity on juvenile prawn survival is the same at both temperatures — the two factors act **additively**, and the response to salinity does not depend on temperature (or equivalently, the temperature effect is the same at all three salinities). *b.* - ✓ Total observations *n* = number of cells × replicates per cell = (3 salinity × 2 temperature) × 10 = **60 observations**. - ✓ Residuals df = total *n* − number of cells = 60 − 6 = **54** ✓. This represents the within-cell error — the variability among the 10 replicates within each salinity × temperature treatment combination. *c.* - ✓ A significant interaction means the **effect of salinity on survival depends on temperature** — the response to changing salinity is not the same at 20°C as it is at 28°C. - ✓ Biologically: at 20°C, prawns may tolerate a broader range of salinities (a flat, moderate survival plateau), while at 28°C (thermal stress), deviations from optimal salinity (30 ppt) may be far more lethal — the additional stressor of temperature reduces the physiological scope to cope with osmotic imbalance. - ✓ This non-additive (synergistic stressor) interaction has important practical implications: the optimal salinity for prawn aquaculture depends on the culture temperature and cannot be determined from single-factor experiments. *d.* - ✓ When the interaction is significant, the main effects are **conditional** — the effect of salinity cannot be summarised as a single universal value, because it differs between the two temperature levels. Reporting "salinity significantly affected survival" without qualification is misleading. - ✓ The appropriate approach is to interpret and present **simple main effects** — the effect of salinity separately at 20°C and at 28°C, ideally with an interaction plot. Main effects should not be interpreted in isolation when a significant interaction is present. *e.* - ✓ *F*~salinity~ = Mean Square~salinity~ / Mean Square~Residuals~ = **642.3 / 20.4 = 31.48** ✓ - ✓ This *F*-ratio means the between-salinity-group variance is 31.48 times larger than the within-cell residual variance — the salinity differences are far beyond what would be expected from random sampling of a common population, strongly supporting rejection of the salinity *H*~0~. ::: `r if (params$hide_answers) ":::"` --- ## Question 14 — Welch's *t*-Test Output: Cuttlefish Mantle Length (/12) The following R output compares mantle length (mm) of *Sepia officinalis* caught at two sites: an inshore estuary and an offshore reef. ``` Welch Two Sample t-test data: mantle_length by site t = 3.271, df = 31.84, p-value = 0.002614 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 8.432 35.568 sample estimates: mean in group estuary mean in group reef 142.3 120.0 ``` ``` Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 9.412 0.00374 ** 46 ``` a. Why was Welch's *t*-test used rather than Student's *t*-test? Support your answer with evidence from the output. **(/ 2)** b. State the null and alternative hypotheses being tested and interpret the *t*-statistic and *p*-value in plain language. **(/ 3)** c. Interpret the **95% confidence interval** [8.432, 35.568]. What does it tell you about the precision of the estimated difference? **(/ 3)** d. The degrees of freedom are reported as 31.84 rather than a whole number. Explain why this occurs in Welch's test and what it implies about the two groups' sample sizes and/or variances. **(/ 2)** e. Write a one-sentence conclusion reporting the result in the style of a scientific paper. **(/ 2)** `r if (params$hide_answers) "::: {.content-hidden}"` ::: {.callout-tip appearance="simple"} **Model Answer — Question 14** *a.* - ✓ Levene's test is significant (*F*(1, 46) = 9.412, *p* = 0.00374), indicating that the variances differ significantly between the two groups. The homoscedasticity assumption of Student's *t*-test is violated. - ✓ Welch's *t*-test is appropriate because it does not assume equal variances — it adjusts both the test statistic and the degrees of freedom to account for heteroscedasticity. *b.* - ✓ *H*~0~: the true difference in mean mantle length between estuary and reef cuttlefish is zero (µ_estuary − µ_reef = 0). - ✓ *H*~A~: the true difference is not equal to zero (µ_estuary ≠ µ_reef). - ✓ *t* = 3.271, *p* = 0.0026: the observed difference in means (22.3 mm) is 3.271 standard errors from zero. The probability of observing a difference this large or larger by chance, assuming *H*~0~ is true, is 0.0026. We reject *H*~0~ at α = 0.05 and conclude that mean mantle length differs significantly between sites. *c.* - ✓ We are 95% confident that the true difference in mean mantle length (estuary − reef) lies between **8.43 mm and 35.57 mm**. - ✓ The interval is relatively wide (spanning ~27 mm), reflecting moderate imprecision in the estimate. However, since the entire interval is positive (the lower bound exceeds zero), we can be confident the estuary cuttlefish are genuinely longer on average — the direction of the difference is clear even if the exact magnitude is uncertain. *d.* - ✓ In Welch's test, the degrees of freedom are calculated using the **Welch-Satterthwaite equation**, which weights each group's contribution by its sample variance and sample size. When the two groups have unequal variances (as indicated here by Levene's test), the formula yields a non-integer value — a fractional reduction from the maximum possible df (N − 2 = 46). - ✓ The fractional df (31.84 vs. a maximum of 46) implies that the two groups contribute unequally to the pooled uncertainty — likely because one group has a substantially larger variance, which downweights the effective sample size of that group. This results in a more conservative test. *e.* - ✓ "Mantle length of *Sepia officinalis* was significantly greater at the inshore estuary (mean = 142.3 mm) than at the offshore reef (mean = 120.0 mm; Welch's *t*-test, *t*(31.84) = 3.271, *p* = 0.003, 95% CI of difference: [8.4, 35.6] mm)." ::: `r if (params$hide_answers) ":::"` *End of Version 8*