BCB744 (BioStatistics): Summative Task 2, 12 April 2024

Author

Affiliation

Published

August 17, 2024

Honesty Pledge

This assignment requires that you work as an individual and not share your code, results, or discussion with your peers. Penalties and disciplinary action will apply if you are found cheating.

Acknowledgement of the Pledge

Copy the statement, below, into your document and replace the underscores with your name acknowledging adherence to the UWC’s Honesty Pledge.

I, ____________, hereby state that I have not communicated with or gained information in any way from my peers and that all work is my own.

Instructions

Please note the following instructions. Failing to comply with them in full will result in a loss of marks.

QUARTO –> HTML Submit your assessment answers as an .html file compiled from your Quarto document. Produce fully annotated reports, including the meta-information at the top (name, date, purpose, etc.). Provide ample commentary explaining the purpose of the various tests/sections as necessary.
TESTING OF ASSUMPTIONS For all questions, make sure that when formal inferential statistics are required, each is preceded by the appropriate tests for the assumptions, i.e., state the assumptions, state the statistical procedure for testing the assumptions and mention their corresponding $H_{0}$. If a graphical approach is used to test assumptions, explain the principle behind the approach. Explain the findings emerging from the test of assumptions, and justify your selection of the appropriate inferential test (e.g. t-test, ANOVA, etc.) that you will use.
STATE HYPOTHESES When inferential statistics are required, please provide the full $H_{0}$ and $H_{A}$, and conclude the analysis with a statement of which is accepted or rejected.
GRAPHICAL SUPPORT All descriptive and inferential statistics must be supported by the appropriate figures of the results.
STATEMENT OF RESULTS Make sure that the textual statement of the final result is written exactly as required for it to be published in a journal article. Please consult a journal if you don’t know how.
FORMATTING Pay attention to formatting. Some marks will be allocated to the appearance of the script, including considerations of aspects of the tidiness of the file, the use of the appropriate headings, and adherence to code conventions (e.g. spacing etc.).
MARK ALLOCATION Please see the Introduction Page for an explanation of the assessment approach that will be applied to these questions.

Submit the .html file wherein you provide answers to Questions 1–7 by no later than 19:00 today. Label the script as follows:

BCB744_<Name>_<Surname>_Summative_Task_2.html, e.g.

BCB744_AJ_Smit_Summative_Task_2.html.

Upload your .html files onto Google Forms.

Question 1

Chromosomal effects of mercury-contaminated fish consumption

These data reside in package coin, dataset mercuryfish. The dataframe contains the mercury level in blood, the proportion of cells with abnormalities, and the proportion of cells with chromosome aberrations in consumers of mercury-contaminated fish and a control group. Please see the dataset’s help file for more information.

Analyse the dataset and answer the following questions:

Does the presence of methyl-mercury in a diet containing fish result in a higher proportion of cellular abnormalities?
Does the concentration of mercury in the blood influence the proportion of cells with abnormalities, and does this differ between the control and exposed groups?
Is there a relationship between the variables abnormal and ccells? This will have to be for the control and exposed groups, noting that an interaction effect might be present.

Answers

Does the presence of methyl-mercury in a diet containing fish result in a higher proportion of cellular abnormalities?

library(coin)
data(mercuryfish)
head(mercuryfish)

    group mercury abnormal ccells
1 control     5.3      8.6    2.7
2 control    15.0      5.0    0.5
3 control    11.0      8.4    0.0
4 control     5.8      1.0    0.0
5 control    17.0     13.0    5.0
6 control     7.0      5.0    0.0

# EDA: do a boxplot
ggplot(mercuryfish, aes(x = group, y = abnormal)) +
  geom_boxplot(aes(colour = group), notch = TRUE)

# Looking at the above figure, we see that there is a statistically
# significant difference between the two groups. We will now test the
# assumption.

# Testing assumptions
# 1. Normality

# Shapiro-Wilk test
# H0: The data are normally distributed
# Ha: The data are not normally distributed

shapiro.test(mercuryfish$abnormal[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$abnormal[mercuryfish$group == "control"]
W = 0.90267, p-value = 0.0887

shapiro.test(mercuryfish$abnormal[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$abnormal[mercuryfish$group == "exposed"]
W = 0.96841, p-value = 0.6509

# We see that the data are normally distributed.

# Test homogeneity of variances

# Levene's test

# H0: The variances are equal
# Ha: The variances are not equal

car::leveneTest(abnormal ~ group, data = mercuryfish)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  1.6607 0.2055
      37

# We can therefore go ahead and perform the test.
# We select a Student's two sample t-test

t.test(abnormal ~ group, var.equal = TRUE, data = mercuryfish)


    Two Sample t-test

data:  abnormal by group
t = -2.9664, df = 37, p-value = 0.005253
alternative hypothesis: true difference in means between group control and group exposed is not equal to 0
95 percent confidence interval:
 -7.084765 -1.334257
sample estimates:
mean in group control mean in group exposed 
             4.668750              8.878261

# We now have confirmation that the presence of methyl-mercury in a diet
# will have a significant effect on the proportion of cellular abnormalities.

Does the concentration of mercury in the blood influence the proportion of cells with abnormalities, and does this differ between the control and exposed groups?

  # EDA: Scatterplot of mercury concentration vs. proportion of abnormal
  # cells
  
  ggplot(mercuryfish, aes(x = mercury, y = abnormal, color = group)) +
    geom_point() +
    geom_smooth(method = "lm", se = TRUE) +
    labs(title = "Mercury Concentration vs. Proportion of Abnormal Cells",
         x = "Mercury Concentration",
         y = "Proportion of Abnormal Cells")

  # Test normality of the data
  
  shapiro.test(mercuryfish$mercury[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$mercury[mercuryfish$group == "control"]
W = 0.97435, p-value = 0.9032

  shapiro.test(mercuryfish$mercury[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$mercury[mercuryfish$group == "exposed"]
W = 0.64984, p-value = 3.341e-06

  # We see that the mercury concentrations are normally distributed for the 
  # control group (we do not reject H0) but not for the exposed group
  # (we reject H0); earlier we have seen that the response variable
  # (abnormalities) is normal for both the control and the exposed groups
  
  # But since we want to model a linear relationship, now is not quite the
  # right time to do the tests for normality -- we want to do this for the
  # residuals of the model (that is, we fit the model first, and then test 
  # the residuals for normality)
  
  # We also see from the scatterplot that the data might be approximately
  # linear for the exposed group, but not for the control group where the
  # data are more scattered around very low mercury concentrations near
  # zero
   
  # We also see from the very wide confidence inrtervals that the model is
  # not very good at predicting the proportion of abnormal cells from
  # mercury concentration in the blood in the exposed group; my guess is
  # that there will not be a linear relationship between mercury
  # concentration and the proportion of abnormal cells in the control or
  # exposed groups
   
  # We can proceed with a linear regression model to assess the
  # relationship
  
  # Fit a linear regression model to assess the relationship between
  # mercury concentration and the proportion of abnormal cells
  # H0(1): There is no relationship between mercury concentration and the
  # proportion of abnormal cells
  # Ha(1): There is a relationship between mercury concentration and the
  # proportion of abnormal cells
  # H0(2): The relationship between mercury concentration and the
  # proportion of abnormal cells does not differ between the control and
  # exposed groups
  # Ha(2): The relationship between mercury concentration and the
  # proportion of abnormal cells differs between the control and exposed
  # groups
  
  model.lm <- lm(abnormal ~ mercury + group, data = mercuryfish)
  summary(model.lm)


Call:
lm(formula = abnormal ~ mercury + group, data = mercuryfish)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1709 -2.5884 -0.2124  2.6725 13.3395 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.622336   1.081910   4.272 0.000135 ***
mercury      0.005193   0.004129   1.258 0.216575    
groupexposed 3.226585   1.610348   2.004 0.052677 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.325 on 36 degrees of freedom
Multiple R-squared:  0.2261,    Adjusted R-squared:  0.1832 
F-statistic:  5.26 on 2 and 36 DF,  p-value: 0.009906

  # We can now check the residuals for normality in the two groups
  # which will confirm that the model is appropriate (or not)
  
  mercuryfish$residuals <- residuals(model.lm)
  shapiro.test(mercuryfish$residuals[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals[mercuryfish$group == "control"]
W = 0.90326, p-value = 0.09066

  shapiro.test(mercuryfish$residuals[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals[mercuryfish$group == "exposed"]
W = 0.94985, p-value = 0.2905

  # We see that the residuals are normally distributed for both groups
  # and hence using a linear model was appropriate

  # The p-value for the interaction term not less than 0.05, indicating
  # that the relationship between mercury concentration and the proportion
  # of abnormal cells does not differ between the control and exposed
  # groups -- we can reject Ha(1) and Ha(2)
  # If we wanted to (recommended), we could refit the model without the 
  # interaction term
  
    
  # What do we conclude?
  # The proportion of abnormal cells differs significantly between the
  # control and exposed groups, with the exposed group exhibiting a higher
  # proportion of abnormal cells. However, the relationship between mercury
  # concentration and the proportion of abnormal cells does not differ
  # between the two groups.
  # There is a good amount of scatter in the amount of cell abnormalities
  # even in just the control group, which suggests that mercury
  # concentration alone may not be a strong predictor of cellular
  # abnormalities. Increasing the amount of mercury in the blood does not
  # necessarily lead to a linear increase but it certainly does account
  # for a few of the highest values seen in the exposed group.

Relationship Between Variables

  # EDA: Scatterplot of mercury concentration vs. age
  
  ggplot(mercuryfish, aes(x = abnormal, y = ccells, color = group)) +
    geom_point() +
    geom_smooth(method = "lm", se = TRUE) +
    labs(title = "Mercury Concentration vs. Age",
         x = "Proportion of Abnormal Cells",
         y = "Proportion of Cu cells")

  # We see that there is a clear linear relationship between abnormal cell
  # proportion and Cu cell proportion in both groups, and the confidence
  # intervals are narrow(-ish), indicating that the model could be
  # reasonably good at predicting Cu cell proportion from the proportion of
  # abnormal cells
  
  # We know the relationship between continuous covariates is linear and
  # may therefore proceed with a linear regression model; the remaining
  # assumptions will be tested afterwards
  
  # Fit a linear regression model to assess the relationship between the 
  # proportion of Cu cells and the proportion of abnormal cells
  # H0(1): There is no relationship between the proportion of Cu cells and
  # the proportion of abnormal cells
  # Ha(1): There is a relationship between the proportion of Cu cells and
  # the proportion of abnormal cells
  # H0(2): The relationship between the proportion of Cu cells and the
  # proportion of abnormal cells does not differ between the control and
  # exposed groups
  # Ha(2): The relationship between the proportion of Cu cells and the
  # proportion of abnormal cells differs between the control and exposed
  # groups
  
  model.lm2 <- lm(ccells ~ abnormal + group, data = mercuryfish)
  summary(model.lm2)


Call:
lm(formula = ccells ~ abnormal + group, data = mercuryfish)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4760 -0.7479  0.1761  0.5831  2.0133 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.67797    0.36166  -1.875    0.069 .  
abnormal      0.37547    0.04461   8.417 5.01e-10 ***
groupexposed  0.12272    0.42837   0.286    0.776    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.183 on 36 degrees of freedom
Multiple R-squared:  0.7152,    Adjusted R-squared:  0.6994 
F-statistic: 45.21 on 2 and 36 DF,  p-value: 1.516e-10

  # We can now check the residuals for normality in the two groups
  # which will confirm that the model is appropriate (or not)
  
  mercuryfish$residuals2 <- residuals(model.lm2)
  shapiro.test(mercuryfish$residuals2[mercuryfish$group == "control"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals2[mercuryfish$group == "control"]
W = 0.95476, p-value = 0.5686

  shapiro.test(mercuryfish$residuals2[mercuryfish$group == "exposed"])


    Shapiro-Wilk normality test

data:  mercuryfish$residuals2[mercuryfish$group == "exposed"]
W = 0.96346, p-value = 0.5366

  # We see that the residuals are normally distributed for both groups
  # and hence using a linear model was appropriate

  # The p-value 'abnormal' term is less than 0.05, indicating that the
  # relationship between the proportion of abnormal cells and the
  # proportion of Cu cells is significant -- we accept Ha(1)
  # The p-value for the interaction term is not less than 0.05, indicating
  # that the relationship between the proportion of abnormal cells and the
  # proportion of Cu cells does not differ between the control and exposed
  # groups -- we do not reject H0(2)
    
  # What do we conclude?
  # The proportion of Cu cells is significantly related to the proportion
  # of abnormal cells, with a higher proportion of abnormal cells
  # corresponding to a higher proportion of Cu cells. This relationship
  # does not differ between the control and exposed groups. The model is
  # appropriate for predicting the proportion of Cu cells from the
  # proportion of abnormal cells, as the residuals are normally distributed
  # for both groups.
  
  # Alternative approaches for assigning marks: Instead of doing a linear
  # regression with interaction term, which I did not formally teach,
  # equally justified are individual linear regressions for each group
  # and using the confidence intervals to make inferences. This would
  # involve fitting two linear regression models, one for each group, and
  # comparing the confidence intervals of the coefficients to determine if
  # the relationship between the proportion of Cu cells and the proportion
  # of abnormal cells differs between the two groups. This would apply to
  # all the other questions as well.
  
  # Or, in part (c), we could have done correlations for each group and
  # compared the correlation coefficients to determine if the relationship
  # between the proportion of abnormal cells and the proportion of Cu cells
  # differs between the two groups.

Question 2

Malignant glioma pilot study

Package coin, dataset glioma: A non-randomized pilot study on malignant glioma patients with pretargeted adjuvant radioimmunotherapy using yttrium-90-biotin.

Do sex and group interact to affect survival time (time)?
Do age and histology interact to affect survival time (time)?
Show a full graphical exploration of the data. Are there any other remaining patterns visible in the data that should be explored statistically? Study your results, select the most promising and insightful question that remains, and do the analysis.

Question 3

Risk factors associated with low infant birth weight

Package MASS, dataset birthwt: A dataset about the risk factors associated with low infant birth mass collected at Baystate Medical Center, Springfield, Mass. during 1986.

State three hypotheses and test them. Make sure one of the tests makes use of the 95% confidence interval approach rather than a formal inferential methodology.

Question 4

The `LungCapData.csv` data

Using the Lung Capacity data provided, please calculate the 95% CIs for the LungCap variable as a function of:
- Gender
- Smoke
- Caesarean

Create a graph of the mean ± 95% CIs and determine if there are statistical differences in LungCap between the levels of Gender, Smoke, and Caesarean. Do the same using inferential statistics. Are your findings the same using these two approaches?

Produce all the associated tests for assumptions—i.e. the assumptions to be met when deciding whether to use your choice of inferential test or its non-parametric counterpart.

Create a combined tidy dataframe (observe tidy principles) with the estimates for the 95% CI for the LungCap data (LungCap as a function of Gender), estimated using both the traditional and bootstrapping approaches. Create a plot comprising two panels (one for the traditional estimates, one for the bootstrapped estimates) of the mean, median, scatter of raw data points, and the upper and lower 95% CI.

Undertake a statistical analysis that incorporates both the effect of Age and one of the categorical variables on LungCap. What new insight does this provide?

Question 5

The air quality data

Package datasets, dataset airquality. These are daily air quality measurements in New York, May to September 1973. See the help file for details.

Which two of the four response variables are best correlated with each other?

Question 6

The `shells.csv` data

This dataset contains measurements of shell widths and lengths of the left and right valves of two species of mussels, Aulacomya sp. and Choromytilus sp. Length and width measurements are presented in mm.

Fully analyse this dataset.

Question 7

The `fertiliser_crop_data.csv` data

The data represent an experiment designed to test whether or not fertiliser type and the density of planting have an effect on the yield of wheat. The dataset contains the following variables:

Final yield (kg per acre)—make sure to convert this to the most suitable SI unit before continuing with your analysis
Type of fertiliser (fertiliser type A, B, or C)
Planting density (1 = low density, 2 = high density)
Block in the field (north, east, south, west)

Fully analyse this dataset.

Question 8

Reflect on the project you intend doing during your Honours year. Specifically, focus on your experimental or sampling design (even though this might not be fully known at this stage), the nature of the data you anticipate obtaining, and the statistical analyses you will perform. Structure your response as follows:

Provide a brief Aim and state the Objectives
What are your predictions?
Write down the hypotheses you will test
Describe the experimental or sampling design that will support testing the hypotheses
Describe the data you anticipate obtaining
What statistical analyses will you perform on the data?

For those of you who will not generate data suitable for statistical analysis, please reflect on

The end

Submit the .html file wherein you provide answers to Questions 1–7 by no later than 19:00 today. Label the script as follows:

BCB744_<Name>_<Surname>_Summative_Task_2.html, e.g.

BCB744_AJ_Smit_Summative_Task_2.html.

Upload your .html files onto Google Forms.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{j._smit,
  author = {J. Smit, Albertus},
  title = {BCB744 {(BioStatistics):} {Summative} {Task} 2, 12 {April}
    2024},
  date = {},
  url = {http://tangledbank.netlify.app/assessments/BCB744_Summative_Task_2_2024.html},
  langid = {en}
}

For attribution, please cite this work as:

J. Smit A BCB744 (BioStatistics): Summative Task 2, 12 April 2024. http://tangledbank.netlify.app/assessments/BCB744_Summative_Task_2_2024.html.

--- title: "BCB744 (BioStatistics): Summative Task 2, 12 April 2024" date: "`r Sys.Date()`" format: html: number-sections: false --- ## Honesty Pledge **This assignment requires that you work as an individual and not share your code, results, or discussion with your peers. Penalties and disciplinary action will apply if you are found cheating.** ::: callout-note ## Acknowledgement of the Pledge Copy the statement, below, into your document and replace the underscores with your name acknowledging adherence to the UWC's Honesty Pledge. **I, \_\_\_\_\_\_\_\_\_\_\_\_, hereby state that I have not communicated with or gained information in any way from my peers and that all work is my own.** ::: ```{r} #| echo: false library(tidyverse) library(ggpubr) ``` ## Instructions Please note the following instructions. Failing to comply with them in full will result in a loss of marks. * **QUARTO --> HTML** Submit your assessment answers as an .html file compiled from your Quarto document. Produce *fully annotated reports*, including the meta-information at the top (name, date, purpose, etc.). Provide ample commentary explaining the purpose of the various tests/sections as necessary. * **TESTING OF ASSUMPTIONS** For all questions, make sure that when *formal inferential statistics are required, each is preceded by the appropriate tests for the assumptions*, i.e., state the assumptions, state the statistical procedure for testing the assumptions and mention their corresponding $H_{0}$. If a graphical approach is used to test assumptions, explain the principle behind the approach. Explain the findings emerging from the test of assumptions, and justify your selection of the appropriate inferential test (e.g. *t*-test, ANOVA, etc.) that you will use. * **STATE HYPOTHESES** When inferential statistics are required, please provide the full $H_{0}$ and $H_{A}$, and conclude the analysis with a statement of which is accepted or rejected. * **GRAPHICAL SUPPORT** All descriptive and inferential statistics must be supported by the appropriate figures of the results. * **STATEMENT OF RESULTS** Make sure that the textual statement of the final result is written exactly as required for it to be published in a journal article. Please consult a journal if you don't know how. * **FORMATTING** Pay attention to formatting. Some marks will be allocated to the appearance of the script, including considerations of aspects of the tidiness of the file, the use of the appropriate headings, and adherence to code conventions (e.g. spacing etc.). * **MARK ALLOCATION** Please see the [Introduction Page](https://tangledbank.netlify.app/bcb744/bcb744_index#summative-tasks) for an explanation of the assessment approach that will be applied to these questions. Submit the .html file wherein you provide answers to Questions 1–7 by no later than 19:00 today. Label the script as follows: BCB744_\<Name\>_\<Surname\>_Summative_Task_2.html, e.g. BCB744_AJ_Smit_Summative_Task_2.html. Upload your .html files onto [Google Forms](https://docs.google.com/forms/d/e/1FAIpQLSfBO9a42ESrL3E3ytIFIqMGvnei19ynWisPpJHGU8P9DXvyow/viewform?usp=sf_link). ## Question 1 ### Chromosomal effects of mercury-contaminated fish consumption These data reside in package **coin**, dataset `mercuryfish`. The dataframe contains the mercury level in blood, the proportion of cells with abnormalities, and the proportion of cells with chromosome aberrations in consumers of mercury-contaminated fish and a control group. Please see the dataset's help file for more information. Analyse the dataset and answer the following questions: a. Does the presence of methyl-mercury in a diet containing fish result in a higher proportion of cellular abnormalities? b. Does the concentration of mercury in the blood influence the proportion of cells with abnormalities, and does this differ between the `control` and `exposed` groups? c. Is there a relationship between the variables `abnormal` and `ccells`? This will have to be for the `control` and `exposed` groups, noting that an interaction effect *might* be present. ### Answers a. Does the presence of methyl-mercury in a diet containing fish result in a higher proportion of cellular abnormalities? ```{r} library(coin) data(mercuryfish) head(mercuryfish) # EDA: do a boxplot ggplot(mercuryfish, aes(x = group, y = abnormal)) + geom_boxplot(aes(colour = group), notch = TRUE) # Looking at the above figure, we see that there is a statistically # significant difference between the two groups. We will now test the # assumption. # Testing assumptions # 1. Normality # Shapiro-Wilk test # H0: The data are normally distributed # Ha: The data are not normally distributed shapiro.test(mercuryfish$abnormal[mercuryfish$group == "control"]) shapiro.test(mercuryfish$abnormal[mercuryfish$group == "exposed"]) # We see that the data are normally distributed. # Test homogeneity of variances # Levene's test # H0: The variances are equal # Ha: The variances are not equal car::leveneTest(abnormal ~ group, data = mercuryfish) # We can therefore go ahead and perform the test. # We select a Student's two sample t-test t.test(abnormal ~ group, var.equal = TRUE, data = mercuryfish) # We now have confirmation that the presence of methyl-mercury in a diet # will have a significant effect on the proportion of cellular abnormalities. ``` b. Does the concentration of mercury in the blood influence the proportion of cells with abnormalities, and does this differ between the `control` and `exposed` groups? ```{r} # EDA: Scatterplot of mercury concentration vs. proportion of abnormal # cells ggplot(mercuryfish, aes(x = mercury, y = abnormal, color = group)) + geom_point() + geom_smooth(method = "lm", se = TRUE) + labs(title = "Mercury Concentration vs. Proportion of Abnormal Cells", x = "Mercury Concentration", y = "Proportion of Abnormal Cells") # Test normality of the data shapiro.test(mercuryfish$mercury[mercuryfish$group == "control"]) shapiro.test(mercuryfish$mercury[mercuryfish$group == "exposed"]) # We see that the mercury concentrations are normally distributed for the # control group (we do not reject H0) but not for the exposed group # (we reject H0); earlier we have seen that the response variable # (abnormalities) is normal for both the control and the exposed groups # But since we want to model a linear relationship, now is not quite the # right time to do the tests for normality -- we want to do this for the # residuals of the model (that is, we fit the model first, and then test # the residuals for normality) # We also see from the scatterplot that the data might be approximately # linear for the exposed group, but not for the control group where the # data are more scattered around very low mercury concentrations near # zero # We also see from the very wide confidence inrtervals that the model is # not very good at predicting the proportion of abnormal cells from # mercury concentration in the blood in the exposed group; my guess is # that there will not be a linear relationship between mercury # concentration and the proportion of abnormal cells in the control or # exposed groups # We can proceed with a linear regression model to assess the # relationship # Fit a linear regression model to assess the relationship between # mercury concentration and the proportion of abnormal cells # H0(1): There is no relationship between mercury concentration and the # proportion of abnormal cells # Ha(1): There is a relationship between mercury concentration and the # proportion of abnormal cells # H0(2): The relationship between mercury concentration and the # proportion of abnormal cells does not differ between the control and # exposed groups # Ha(2): The relationship between mercury concentration and the # proportion of abnormal cells differs between the control and exposed # groups model.lm <- lm(abnormal ~ mercury + group, data = mercuryfish) summary(model.lm) # We can now check the residuals for normality in the two groups # which will confirm that the model is appropriate (or not) mercuryfish$residuals <- residuals(model.lm) shapiro.test(mercuryfish$residuals[mercuryfish$group == "control"]) shapiro.test(mercuryfish$residuals[mercuryfish$group == "exposed"]) # We see that the residuals are normally distributed for both groups # and hence using a linear model was appropriate # The p-value for the interaction term not less than 0.05, indicating # that the relationship between mercury concentration and the proportion # of abnormal cells does not differ between the control and exposed # groups -- we can reject Ha(1) and Ha(2) # If we wanted to (recommended), we could refit the model without the # interaction term # What do we conclude? # The proportion of abnormal cells differs significantly between the # control and exposed groups, with the exposed group exhibiting a higher # proportion of abnormal cells. However, the relationship between mercury # concentration and the proportion of abnormal cells does not differ # between the two groups. # There is a good amount of scatter in the amount of cell abnormalities # even in just the control group, which suggests that mercury # concentration alone may not be a strong predictor of cellular # abnormalities. Increasing the amount of mercury in the blood does not # necessarily lead to a linear increase but it certainly does account # for a few of the highest values seen in the exposed group. ``` c. **Relationship Between Variables** ```{r} # EDA: Scatterplot of mercury concentration vs. age ggplot(mercuryfish, aes(x = abnormal, y = ccells, color = group)) + geom_point() + geom_smooth(method = "lm", se = TRUE) + labs(title = "Mercury Concentration vs. Age", x = "Proportion of Abnormal Cells", y = "Proportion of Cu cells") # We see that there is a clear linear relationship between abnormal cell # proportion and Cu cell proportion in both groups, and the confidence # intervals are narrow(-ish), indicating that the model could be # reasonably good at predicting Cu cell proportion from the proportion of # abnormal cells # We know the relationship between continuous covariates is linear and # may therefore proceed with a linear regression model; the remaining # assumptions will be tested afterwards # Fit a linear regression model to assess the relationship between the # proportion of Cu cells and the proportion of abnormal cells # H0(1): There is no relationship between the proportion of Cu cells and # the proportion of abnormal cells # Ha(1): There is a relationship between the proportion of Cu cells and # the proportion of abnormal cells # H0(2): The relationship between the proportion of Cu cells and the # proportion of abnormal cells does not differ between the control and # exposed groups # Ha(2): The relationship between the proportion of Cu cells and the # proportion of abnormal cells differs between the control and exposed # groups model.lm2 <- lm(ccells ~ abnormal + group, data = mercuryfish) summary(model.lm2) # We can now check the residuals for normality in the two groups # which will confirm that the model is appropriate (or not) mercuryfish$residuals2 <- residuals(model.lm2) shapiro.test(mercuryfish$residuals2[mercuryfish$group == "control"]) shapiro.test(mercuryfish$residuals2[mercuryfish$group == "exposed"]) # We see that the residuals are normally distributed for both groups # and hence using a linear model was appropriate # The p-value 'abnormal' term is less than 0.05, indicating that the # relationship between the proportion of abnormal cells and the # proportion of Cu cells is significant -- we accept Ha(1) # The p-value for the interaction term is not less than 0.05, indicating # that the relationship between the proportion of abnormal cells and the # proportion of Cu cells does not differ between the control and exposed # groups -- we do not reject H0(2) # What do we conclude? # The proportion of Cu cells is significantly related to the proportion # of abnormal cells, with a higher proportion of abnormal cells # corresponding to a higher proportion of Cu cells. This relationship # does not differ between the control and exposed groups. The model is # appropriate for predicting the proportion of Cu cells from the # proportion of abnormal cells, as the residuals are normally distributed # for both groups. # Alternative approaches for assigning marks: Instead of doing a linear # regression with interaction term, which I did not formally teach, # equally justified are individual linear regressions for each group # and using the confidence intervals to make inferences. This would # involve fitting two linear regression models, one for each group, and # comparing the confidence intervals of the coefficients to determine if # the relationship between the proportion of Cu cells and the proportion # of abnormal cells differs between the two groups. This would apply to # all the other questions as well. # Or, in part (c), we could have done correlations for each group and # compared the correlation coefficients to determine if the relationship # between the proportion of abnormal cells and the proportion of Cu cells # differs between the two groups. ``` ## Question 2 ### Malignant glioma pilot study Package **coin**, dataset `glioma`: A non-randomized pilot study on malignant glioma patients with pretargeted adjuvant radioimmunotherapy using yttrium-90-biotin. a. Do `sex` and `group` interact to affect survival time (`time`)? b. Do `age` and `histology` interact to affect survival time (`time`)? c. Show a full graphical exploration of the data. Are there any other remaining patterns visible in the data that should be explored statistically? Study your results, select the most promising and insightful question that remains, and do the analysis. ## Question 3 ### Risk factors associated with low infant birth weight Package **MASS**, dataset `birthwt`: A dataset about the risk factors associated with low infant birth mass collected at Baystate Medical Center, Springfield, Mass. during 1986. State three hypotheses and test them. Make sure one of the tests makes use of the 95% confidence interval approach rather than a formal inferential methodology. ## Question 4 ### The [`LungCapData.csv`](https://github.com/ajsmit/R_courses/raw/main/static/data/LungCapData.csv) data a. Using the Lung Capacity data provided, please calculate the 95% CIs for the `LungCap` variable as a function of: * `Gender` * `Smoke` * `Caesarean`        b. Create a graph of the mean ± 95% CIs and determine if there are statistical differences in `LungCap` between the levels of `Gender`, `Smoke`, and `Caesarean`. Do the same using inferential statistics. Are your findings the same using these two approaches?                c. Produce all the associated tests for assumptions---i.e. the assumptions to be met when deciding whether to use your choice of inferential test or its non-parametric counterpart.                      d. Create a combined tidy dataframe (observe tidy principles) with the estimates for the 95% CI for the `LungCap` data (`LungCap` as a function of `Gender`), estimated using both the traditional and bootstrapping approaches. Create a plot comprising two panels (one for the traditional estimates, one for the bootstrapped estimates) of the mean, median, scatter of raw data points, and the upper and lower 95% CI.                  e. Undertake a statistical analysis that incorporates both the effect of `Age` *and* one of the categorical variables on `LungCap`. What new insight does this provide?                         ## Question 5 ### The air quality data Package **datasets**, dataset `airquality`. These are daily air quality measurements in New York, May to September 1973. See the help file for details. a. Which two of the four response variables are best correlated with each other? ## Question 6 ### The **[`shells.csv`](https://raw.githubusercontent.com/ajsmit/R_courses/main/static/data/shells.csv)** data This dataset contains measurements of shell widths and lengths of the left and right valves of two species of mussels, *Aulacomya* sp. and *Choromytilus* sp. Length and width measurements are presented in mm. Fully analyse this dataset. ## Question 7 ### The [`fertiliser_crop_data.csv`](https://raw.githubusercontent.com/ajsmit/R_courses/main/static/data/fertiliser_crop_data.csv) data The data represent an experiment designed to test whether or not fertiliser type and the density of planting have an effect on the yield of wheat. The dataset contains the following variables: * Final yield (kg per acre)---make sure to convert this to the most suitable SI unit before continuing with your analysis * Type of fertiliser (fertiliser type A, B, or C) * Planting density (1 = low density, 2 = high density) * Block in the field (north, east, south, west) Fully analyse this dataset.                                                ## Question 8 Reflect on the project you intend doing during your Honours year. Specifically, focus on your experimental or sampling design (even though this might not be fully known at this stage), the nature of the data you anticipate obtaining, and the statistical analyses you will perform. Structure your response as follows: - Provide a brief Aim and state the Objectives - What are your predictions? - Write down the hypotheses you will test - Describe the experimental or sampling design that will support testing the hypotheses - Describe the data you anticipate obtaining - What statistical analyses will you perform on the data? For those of you who will not generate data suitable for statistical analysis, please reflect on ## The end Submit the .html file wherein you provide answers to Questions 1–7 by no later than 19:00 today. Label the script as follows: BCB744_\<Name\>_\<Surname\>_Summative_Task_2.html, e.g. BCB744_AJ_Smit_Summative_Task_2.html. Upload your .html files onto [Google Forms](https://docs.google.com/forms/d/e/1FAIpQLSfBO9a42ESrL3E3ytIFIqMGvnei19ynWisPpJHGU8P9DXvyow/viewform?usp=sf_link).

Honesty Pledge

Instructions

Question 1

Chromosomal effects of mercury-contaminated fish consumption

Answers

Question 2

Malignant glioma pilot study

Question 3

Risk factors associated with low infant birth weight

Question 4

The LungCapData.csv data

Question 5

The air quality data

Question 6

The shells.csv data

Question 7

The fertiliser_crop_data.csv data

Question 8

The end

Reuse

Citation

The `LungCapData.csv` data

The `shells.csv` data

The `fertiliser_crop_data.csv` data