tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
✓ dimnames() returns the names of the rows and columns of the dataset.
✓ The rows are numbered 1 to 344, and the columns are named species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year.
✓ str() provides a concise summary of the dataset’s structure. It shows the number of observations (344) and variables (8), the variable names, and their data types.
✓ The dataset contains 344 observations of 8 variables.
✓ The variables are a mix of character, factor, and numeric data types.
Question 2
How would you manually calculate the mean value for the normal_data we generated in the lecture? (/3)
✓✓✓ The mean value of the normal_data is calculated by summing all the data points and dividing by the number of data points. Assign three marks if they correctly calculate the mean of the normal_data without using the function mean().
Question 3
Find the faithful dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use ggplot()), and conclude with a brief statement about the data distribution. (/10)
Answer
# Load the faithful datasetdata(faithful)library(ggpubr)plt1 <-ggplot(faithful, aes(x = eruptions)) +geom_histogram(fill ="lightblue", color ="black") +labs(title ="Eruptions", x ="Duration (minutes)", y ="Frequency") +theme_minimal()plt2 <-ggplot(faithful, aes(x = waiting)) +geom_histogram(fill ="lightgreen", color ="black") +labs(title ="Waiting", x ="Time (minutes)", y ="Frequency") +theme_minimal()# Arrange the plots side by sideggarrange(plt1, plt2, ncol =2)
# Central tendency measureslibrary(e1071)kurtosis(faithful$eruptions)
[1] -1.511605
kurtosis(faithful$waiting)
[1] -1.156263
skewness(faithful$eruptions)
[1] -0.4135498
skewness(faithful$waiting)
[1] -0.414025
The faithful dataset contains data on the Old Faithful geyser in Yellowstone National Park. It measures the eruption duration (in minutes) and the waiting time between eruptions (also in minutes). We use both graphical and statistical measures to understand the central tendency and overall distribution of each variable.
1. Eruption Duration
Using ggplot2, the histogram of eruption durations reveals a bimodal distribution – one cluster of short eruptions around 2 minutes and another around 4.3 minutes. This visible separation implies that computing a single mean or median would obscure meaningful structural information in the data.
Statistical measures of central tendency support the visual impression:
Skewness = -0.4135
Kurtosis = -1.5116
The negative skewness suggests a slight asymmetry with a longer left tail, though the bimodal structure renders this value difficult to interpret in isolation. The negative kurtosis implies a flatter distribution compared to a normal curve – again, a consequence of the data’s underlying bimodality.
2. Waiting Time
The histogram for waiting times similarly displays two overlapping modes, with denser regions around 55 and 80 minutes.
The numerical descriptors are:
Skewness = -0.4140
Kurtosis = -1.1563
As with eruption durations, the slightly negative skew and platykurtic (low kurtosis) profile again reflect a flattened, spread-out distribution—not tightly peaked, and not symmetric.
3. Conclusion About the Data’s Distribution
Due to the bimodal nature of the measured variables, neither conforms to expectations of central tendency characterisation. Yes, we can compute mean and median values, but their interpretive value is diminished in the presence of this obvious bimodality. This indicates a structural feature of the physical process operating there (i.e. not merely a statistical artefact), suggestive of two regimes of geyser activity (short/long eruptions, brief/long waits).
The skewness and kurtosis figures reinforce the visual presentation of the data structure: the distributions are asymmetrical and flatter than a normal curve and make parametric assumptions (e.g., normality in linear models) poorly-suited (unless stratified or transformed in some way or another). The histograms and moment-based statistics highlight the need to interrogate shape and structure – not merely central values – and would suggest that a more detailed approach is taken to studying the processes operating there.
✓ 8/10 for the correct graphical representation and interpretation of the data distribution, and for the calculation of skewness and kurtosis.
✓ 2/10 for some sensible explanation of what this means.
Question 4
Manually calculate the variance and SD for the normal_data we generated in the lecture. Make sure your answer is the same as those reported there. (/5)
Answer
# Using the equations for sample variance and sd# (not the built-in functions), do:# Variance(norm_var <-round(sum((normal_data -mean(normal_data))^2) / (length(normal_data) -1), 3))
[1] 1.002
# Standard deviation(norm_sd <-round(sqrt(norm_var), 3))
[1] 1.001
Question 5
Write a few lines of code to demonstrate that the \((0-0.25]\), \((0.25-0.5]\), \((0.5-0.75]\), and \((0.75-1]\) quantiles of the normal_data we generated in the lecture indeed conform to the formal definition for what quantiles are. I.e., show manually how you can determine that 25% of the observations indeed fall below -0.66 for the normal_data. Explain the rationale to your approach. (/10)
Answer
# Generate random data from a normal distributionset.seed(666)n <-5000# Number of data pointsmean <-0sd <-1normal_data <-rnorm(n, mean, sd)# Calculate the quantilesq_25 <-quantile(normal_data, p =0.25)q_50 <-quantile(normal_data, p =0.50)q_75 <-quantile(normal_data, p =0.75)q_100 <-quantile(normal_data, p =1.00)# Verify each quantile interval(count_0_to_25 <-sum(normal_data <= q_25))
# Calculate percentages(perc_0_to_25 <- count_0_to_25 / n *100)
[1] 25
(perc_25_to_50 <- count_25_to_50 / n *100)
[1] 25
(perc_50_to_75 <- count_50_to_75 / n *100)
[1] 25
(perc_75_to_100 <- count_75_to_100 / n *100)
[1] 25
# Make a figure to visualiselibrary(ggplot2)ggplot(data.frame(x = normal_data), aes(x = x)) +geom_histogram(binwidth =0.5, fill ="lightblue", color ="black") +geom_vline(xintercept =c(q_25, q_50, q_75), color =c("red", "blue", "green"), linetype ="dashed") +labs(title ="Histogram of normal_data with quantiles", x ="Value", y ="Frequency") +theme_minimal()
To demonstrate that the \((0-0.25]\), \((0.25-0.5]\), \((0.5-0.75]\), and \((0.75-1]\) quantiles of our generated normal_data conform to the formal definition of quantiles, we must verify that approximately 25% of observations fall within each quantile interval. The formal definition states that the \(p\)-th quantile is a value \(q_p\) such that the proportion of observations less than or equal to \(q_p\) is approximately \(p\).
Approach
In practice, there are three key steps:
Compute the quantiles using the quantile() function.
Count observations in each quantile interval by using logical expressions to filter the data:
For the first interval \((0-0.25]\): count values \(\leq q_{0.25}\)
For the second interval \((0.25-0.5]\): count values \(> q_{0.25}\) and \(\leq q_{0.50}\)
For the third interval \((0.5-0.75]\): count values \(> q_{0.50}\) and \(\leq q_{0.75}\)
For the fourth interval \((0.75-1]\): count values \(> q_{0.75}\)
Convert counts to percentages by dividing by the total number of observations and multiplying by 100.
✓ (x 10) They must correctly calculate the quantiles and verify that approximately 25% (i.e. obtain the percentage value in the calcs) of the observations fall within each quantile interval. They should also provide a clear explanation of their approach. No need to have a figure, but some bonus marks may be given if one is provided.
Question 6
Why is it important to consider the grouping structures that might be present within our datasets? (/2)
Answer
Simpson’s Paradox: Relationships seen in grouped data can reverse/disappear when examining subgroups separately.
Heterogeneity: Different groups within data often exhibit distinct statistical properties. A a single mean (and/or SD) across all observations masks this variability and can provide a distorted view of the real phenomena.
Statistical Independence: Many statistical tests assume independence of observations. If data contain hierarchical or nested structures (e.g., individuals within ecosystems, repeated measurements from the same subjects), this assumption is violated and might invalidate statistical inferences.
Group-Specific Insights: Analysing data by relevant groupings (e.g. a population of a flowering plant in different climatic regions or time periods) may show relevant patterns and differences that would otherwise remain hidden in aggregate statistics.
So, if we don’t account for group structures, summary statistics may reflect artificial central tendencies or dispersions that don’t realistically represent any meaningful subpopulation within the data. This limits the validity and usefulness of the analysis.
✓ (x 2) … give some marks if some if what’s above is mentioned.
Question 7
Explain the output of summary() when applied to the penguins dataset. (/3)
Answer
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
The summary() function provides a concise summary (obviously) of the dataset’s numerical variables. It includes the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for each variable (i.e. central tendency and some view of the dispersion). For categorical variables, it shows the frequency of each level. We caqn gain some basic insight into the distribution of the data and identify potential outliers or missing values.
✓ (x 3) Assign a few marks if they correctly describe the output of summary() with specific reference to the penguins dataset. For example:
We can see that the penguin’s bill depth (in mm) is close to normally distributed given that the mean value of 17.15 mm is very close to the median of 17.30 mm, with the minimum and maximum values at 13.10 mm and 21.50 mm, respectively. There are two missing values here. The bill length, however, is slightly skewed to the right with a mean of 43.92 mm and a median of 44.45 mm…
3. Exploring With Figures
Question 8
Using a tidy workflow, assemble a summary table of the palmerpenguins dataset that has a similar appearance as that produced by psych::describe(penguins). (/5)
For bonus marks (which won’t count anything) of up to 10% added to Task E, apply a beautiful and creative styling to the table using the kable package. Try and make it as publication ready as possible. Refer to a few journal articles to see how to professionally typeset tables.
Still using the palmerpenguins dataset, perform an exploratory data analysis to investigate the relationship between penguin species and their morphological traits (bill length, bill depth, flipper length, and body mass). Employ the tidyverse approaches learned earlier in the module to explore the data and account for the grouping structures present within the dataset. (/10)
Provide visualisations (use Figure 4 as inspiration) and summary statistics to support your findings and elaborate on any observed patterns or trends. (/10)
Ensure your presentation is professional and adhere to the standards required by scientific publications. State the major aims of your analysis and the patterns you seek. Using the combined findings from the EDA and the figures produced here, discuss the findings in a formal Results section. (/5)
---title: "BCB744 Task E"format: html: fig-format: svg fig_retina: 2 fig-dpi: 400params: hide_answers: false---```{r, echo=FALSE}knitr::opts_chunk$set(echo =TRUE,eval =TRUE,warning =FALSE,message =FALSE,fig.width =6,fig.asp =0.65,out.width ="100%",fig.align ="center")library(tidyverse)library(ggpubr)library(ggthemes)```# [[Assessment Sheet](BCB744_Task_E_Surname.xlsx)]{.my-highlight} {#sec-assessment}# 2. Exploring With Summaris and Descriptions## Question 1a. Explain the output of `dimnames()` when applied to the `penguins` dataset. **(/2)**b. Explain the output of `str()` when applied to the `penguins` dataset. **(/3)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}library(palmerpenguins)data(penguins)# a.dimnames(penguins)# b.str(penguins)```- ✓ `dimnames()` returns the names of the rows and columns of the dataset. - ✓ The rows are numbered 1 to 344, and the columns are named `species`, `island`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`, `sex`, and `year`.- ✓ `str()` provides a concise summary of the dataset's structure. It shows the number of observations (344) and variables (8), the variable names, and their data types.- ✓ The dataset contains 344 observations of 8 variables.- ✓ The variables are a mix of character, factor, and numeric data types.`r if (params$hide_answers) ":::"`## Question 2How would you manually calculate the mean value for the `normal_data` we generated in the lecture? **(/3)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| eval: true#| echo: trueset.seed(666)n <-5000mean <-0sd <-1normal_data <-rnorm(n, mean, sd)round(sum(normal_data) /length(normal_data), 3)```- ✓✓✓ The mean value of the `normal_data` is calculated by summing all the data points and dividing by the number of data points. Assign three marks if they correctly calculate the mean of the `normal_data` without using the function `mean()`.`r if (params$hide_answers) ":::"`## Question 3Find the `faithful` dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use `ggplot()`), and conclude with a brief statement about the data distribution. **(/10)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| eval: true#| echo: true# Load the faithful datasetdata(faithful)library(ggpubr)plt1 <-ggplot(faithful, aes(x = eruptions)) +geom_histogram(fill ="lightblue", color ="black") +labs(title ="Eruptions", x ="Duration (minutes)", y ="Frequency") +theme_minimal()plt2 <-ggplot(faithful, aes(x = waiting)) +geom_histogram(fill ="lightgreen", color ="black") +labs(title ="Waiting", x ="Time (minutes)", y ="Frequency") +theme_minimal()# Arrange the plots side by sideggarrange(plt1, plt2, ncol =2)# Central tendency measureslibrary(e1071)kurtosis(faithful$eruptions)kurtosis(faithful$waiting)skewness(faithful$eruptions)skewness(faithful$waiting)```The faithful dataset contains data on the Old Faithful geyser in Yellowstone National Park. It measures the eruption duration (in minutes) and the waiting time between eruptions (also in minutes). We use both graphical and statistical measures to understand the central tendency and overall distribution of each variable.*1. Eruption Duration*Using ggplot2, the histogram of eruption durations reveals a bimodal distribution -- one cluster of short eruptions around 2 minutes and another around 4.3 minutes. This visible separation implies that computing a single mean or median would obscure meaningful structural information in the data.Statistical measures of central tendency support the visual impression:- Skewness = -0.4135- Kurtosis = -1.5116The negative skewness suggests a slight asymmetry with a longer left tail, though the bimodal structure renders this value difficult to interpret in isolation. The negative kurtosis implies a flatter distribution compared to a normal curve -- again, a consequence of the data’s underlying bimodality.*2. Waiting Time*The histogram for waiting times similarly displays two overlapping modes, with denser regions around 55 and 80 minutes.The numerical descriptors are:- Skewness = -0.4140- Kurtosis = -1.1563As with eruption durations, the slightly negative skew and platykurtic (low kurtosis) profile again reflect a flattened, spread-out distribution—not tightly peaked, and not symmetric.*3. Conclusion About the Data's Distribution*Due to the bimodal nature of the measured variables, neither conforms to expectations of central tendency characterisation. Yes, we can compute mean and median values, but their interpretive value is diminished in the presence of this obvious bimodality. This indicates a structural feature of the physical process operating there (i.e. not merely a statistical artefact), suggestive of two regimes of geyser activity (short/long eruptions, brief/long waits).The skewness and kurtosis figures reinforce the visual presentation of the data structure: the distributions are asymmetrical and flatter than a normal curve and make parametric assumptions (e.g., normality in linear models) poorly-suited (unless stratified or transformed in some way or another). The histograms and moment-based statistics highlight the need to interrogate shape and structure -- not merely central values -- and would suggest that a more detailed approach is taken to studying the processes operating there.- ✓ 8/10 for the correct graphical representation and interpretation of the data distribution, and for the calculation of skewness and kurtosis.- ✓ 2/10 for some sensible explanation of what this means.`r if (params$hide_answers) ":::"`## Question 4Manually calculate the variance and SD for the `normal_data` we generated in the lecture. Make sure your answer is the same as those reported there. **(/5)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| eval: true#| echo: true# Using the equations for sample variance and sd# (not the built-in functions), do:# Variance(norm_var <-round(sum((normal_data -mean(normal_data))^2) / (length(normal_data) -1), 3))# Standard deviation(norm_sd <-round(sqrt(norm_var), 3))````r if (params$hide_answers) ":::"`## Question 5Write a few lines of code to demonstrate that the $(0-0.25]$, $(0.25-0.5]$, $(0.5-0.75]$, and $(0.75-1]$ quantiles of the `normal_data` we generated in the lecture indeed conform to the formal definition for what quantiles are. I.e., show manually how you can determine that 25% of the observations indeed fall below `r round(quantile(normal_data, p = 0.25), 2)` for the `normal_data`. Explain the rationale to your approach. **(/10)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}# Generate random data from a normal distributionset.seed(666)n <-5000# Number of data pointsmean <-0sd <-1normal_data <-rnorm(n, mean, sd)# Calculate the quantilesq_25 <-quantile(normal_data, p =0.25)q_50 <-quantile(normal_data, p =0.50)q_75 <-quantile(normal_data, p =0.75)q_100 <-quantile(normal_data, p =1.00)# Verify each quantile interval(count_0_to_25 <-sum(normal_data <= q_25))(count_25_to_50 <-sum(normal_data > q_25 & normal_data <= q_50))(count_50_to_75 <-sum(normal_data > q_50 & normal_data <= q_75))(count_75_to_100 <-sum(normal_data > q_75 & normal_data <= q_100))# Calculate percentages(perc_0_to_25 <- count_0_to_25 / n *100)(perc_25_to_50 <- count_25_to_50 / n *100)(perc_50_to_75 <- count_50_to_75 / n *100)(perc_75_to_100 <- count_75_to_100 / n *100)# Make a figure to visualiselibrary(ggplot2)ggplot(data.frame(x = normal_data), aes(x = x)) +geom_histogram(binwidth =0.5, fill ="lightblue", color ="black") +geom_vline(xintercept =c(q_25, q_50, q_75), color =c("red", "blue", "green"), linetype ="dashed") +labs(title ="Histogram of normal_data with quantiles", x ="Value", y ="Frequency") +theme_minimal()```To demonstrate that the $(0-0.25]$, $(0.25-0.5]$, $(0.5-0.75]$, and $(0.75-1]$ quantiles of our generated `normal_data` conform to the formal definition of quantiles, we must verify that approximately 25% of observations fall within each quantile interval. The formal definition states that the $p$-th quantile is a value $q_p$ such that the proportion of observations less than or equal to $q_p$ is approximately $p$.*Approach*In practice, there are three key steps:1. Compute the quantiles using the `quantile()` function.2. Count observations in each quantile interval by using logical expressions to filter the data: - For the first interval $(0-0.25]$: count values $\leq q_{0.25}$ - For the second interval $(0.25-0.5]$: count values $> q_{0.25}$ and $\leq q_{0.50}$ - For the third interval $(0.5-0.75]$: count values $> q_{0.50}$ and $\leq q_{0.75}$ - For the fourth interval $(0.75-1]$: count values $> q_{0.75}$3. Convert counts to percentages by dividing by the total number of observations and multiplying by 100.- ✓ (x 10) They must correctly calculate the quantiles and verify that approximately 25% (i.e. obtain the percentage value in the calcs) of the observations fall within each quantile interval. They should also provide a clear explanation of their approach. No need to have a figure, but some bonus marks may be given if one is provided.`r if (params$hide_answers) ":::"`## Question 6Why is it important to consider the grouping structures that might be present within our datasets? **(/2)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**Simpson's Paradox: Relationships seen in grouped data can reverse/disappear when examining subgroups separately.Heterogeneity: Different groups within data often exhibit distinct statistical properties. A a single mean (and/or SD) across all observations masks this variability and can provide a distorted view of the real phenomena.Statistical Independence: Many statistical tests assume independence of observations. If data contain hierarchical or nested structures (e.g., individuals within ecosystems, repeated measurements from the same subjects), this assumption is violated and might invalidate statistical inferences.Group-Specific Insights: Analysing data by relevant groupings (e.g. a population of a flowering plant in different climatic regions or time periods) may show relevant patterns and differences that would otherwise remain hidden in aggregate statistics.So, if we don't account for group structures, summary statistics may reflect artificial central tendencies or dispersions that don't realistically represent any meaningful subpopulation within the data. This limits the validity and usefulness of the analysis.- ✓ (x 2) ... give some marks if some if what's above is mentioned.`r if (params$hide_answers) ":::"`## Question 7Explain the output of `summary()` when applied to the `penguins` dataset. **(/3)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}summary(penguins)```- The `summary()` function provides a concise summary (obviously) of the dataset's numerical variables. It includes the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for each variable (i.e. central tendency and some view of the dispersion). For categorical variables, it shows the frequency of each level. We caqn gain some basic insight into the distribution of the data and identify potential outliers or missing values.- ✓ (x 3) Assign a few marks if they correctly describe the output of `summary()` with specific reference to the `penguins` dataset. For example: - We can see that the penguin's bill depth (in mm) is close to normally distributed given that the mean value of 17.15 mm is very close to the median of 17.30 mm, with the minimum and maximum values at 13.10 mm and 21.50 mm, respectively. There are two missing values here. The bill length, however, is slightly skewed to the right with a mean of 43.92 mm and a median of 44.45 mm...`r if (params$hide_answers) ":::"`# 3. Exploring With Figures## Question 8a. Using a tidy workflow, assemble a summary table of the **palmerpenguins** dataset that has a similar appearance as that produced by `psych::describe(penguins)`. **(/5)** - For bonus marks (which won't count anything) of up to 10% added to Task E, apply a beautiful and creative styling to the table using the [**kable**](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html) package. Try and make it as publication ready as possible. Refer to a few journal articles to see how to professionally typeset tables.b. Still using the **palmerpenguins** dataset, perform an exploratory data analysis to investigate the relationship between penguin species and their morphological traits (bill length, bill depth, flipper length, and body mass). Employ the tidyverse approaches learned earlier in the module to explore the data and account for the grouping structures present within the dataset. **(/10)**c. Provide visualisations (use Figure 4 as inspiration) and summary statistics to support your findings and elaborate on any observed patterns or trends. **(/10)**d. Ensure your presentation is professional and adhere to the standards required by scientific publications. State the major aims of your analysis and the patterns you seek. Using the combined findings from the EDA and the figures produced here, discuss the findings in a formal Results section. **(/5)**`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}library(kableExtra)library(tidyverse)# a.penguins %>%select_if(is.numeric) %>% psych::describe() %>%kable("html") %>%kable_styling("striped", full_width = F)# b.penguins %>%select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%pivot_longer(-species) %>%ggplot(aes(x = species, y = value, fill = species)) +geom_boxplot() +facet_wrap(~name, scales ="free_y") +theme_minimal()````r if (params$hide_answers) ":::"`