BCB744 Task E

Author

Affiliation

Smit, A. J.

University of the Western Cape

Assessment Sheet

2. Exploring With Summaris and Descriptions

Question 1

Explain the output of dimnames() when applied to the penguins dataset. (/2)
Explain the output of str() when applied to the penguins dataset. (/3)

Answer

library(palmerpenguins)
data(penguins)

# a.
dimnames(penguins)

[[1]]
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
 [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
 [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
 [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
 [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
 [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
 [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
 [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
[109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
[121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
[133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
[145] "145" "146" "147" "148" "149" "150" "151" "152" "153" "154" "155" "156"
[157] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168"
[169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178" "179" "180"
[181] "181" "182" "183" "184" "185" "186" "187" "188" "189" "190" "191" "192"
[193] "193" "194" "195" "196" "197" "198" "199" "200" "201" "202" "203" "204"
[205] "205" "206" "207" "208" "209" "210" "211" "212" "213" "214" "215" "216"
[217] "217" "218" "219" "220" "221" "222" "223" "224" "225" "226" "227" "228"
[229] "229" "230" "231" "232" "233" "234" "235" "236" "237" "238" "239" "240"
[241] "241" "242" "243" "244" "245" "246" "247" "248" "249" "250" "251" "252"
[253] "253" "254" "255" "256" "257" "258" "259" "260" "261" "262" "263" "264"
[265] "265" "266" "267" "268" "269" "270" "271" "272" "273" "274" "275" "276"
[277] "277" "278" "279" "280" "281" "282" "283" "284" "285" "286" "287" "288"
[289] "289" "290" "291" "292" "293" "294" "295" "296" "297" "298" "299" "300"
[301] "301" "302" "303" "304" "305" "306" "307" "308" "309" "310" "311" "312"
[313] "313" "314" "315" "316" "317" "318" "319" "320" "321" "322" "323" "324"
[325] "325" "326" "327" "328" "329" "330" "331" "332" "333" "334" "335" "336"
[337] "337" "338" "339" "340" "341" "342" "343" "344"

[[2]]
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

# b.
str(penguins)

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

✓ dimnames() returns the names of the rows and columns of the dataset.
✓ The rows are numbered 1 to 344, and the columns are named species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year.
✓ str() provides a concise summary of the dataset’s structure. It shows the number of observations (344) and variables (8), the variable names, and their data types.
✓ The dataset contains 344 observations of 8 variables.
✓ The variables are a mix of character, factor, and numeric data types.

Question 2

How would you manually calculate the mean value for the normal_data we generated in the lecture? (/3)

Answer

set.seed(666)
n <- 5000
mean <- 0
sd <- 1
normal_data <- rnorm(n, mean, sd)
round(sum(normal_data) / length(normal_data), 3)

[1] 0.009

✓✓✓ The mean value of the normal_data is calculated by summing all the data points and dividing by the number of data points. Assign three marks if they correctly calculate the mean of the normal_data without using the function mean().

Question 3

Find the faithful dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use ggplot()), and conclude with a brief statement about the data distribution. (/10)

Answer

# Load the faithful dataset
data(faithful)
library(ggpubr)

plt1 <- ggplot(faithful, aes(x = eruptions)) +
  geom_histogram(fill = "lightblue", color = "black") +
  labs(title = "Eruptions", x = "Duration (minutes)", y = "Frequency") +
  theme_minimal()

plt2 <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(fill = "lightgreen", color = "black") +
  labs(title = "Waiting", x = "Time (minutes)", y = "Frequency") +
  theme_minimal()

# Arrange the plots side by side
ggarrange(plt1, plt2, ncol = 2)

# Central tendency measures
library(e1071)
kurtosis(faithful$eruptions)

[1] -1.511605

kurtosis(faithful$waiting)

[1] -1.156263

skewness(faithful$eruptions)

[1] -0.4135498

skewness(faithful$waiting)

[1] -0.414025

The faithful dataset contains data on the Old Faithful geyser in Yellowstone National Park. It measures the eruption duration (in minutes) and the waiting time between eruptions (also in minutes). We use both graphical and statistical measures to understand the central tendency and overall distribution of each variable.

1. Eruption Duration

Using ggplot2, the histogram of eruption durations reveals a bimodal distribution – one cluster of short eruptions around 2 minutes and another around 4.3 minutes. This visible separation implies that computing a single mean or median would obscure meaningful structural information in the data.

Statistical measures of central tendency support the visual impression:

Skewness = -0.4135
Kurtosis = -1.5116

The negative skewness suggests a slight asymmetry with a longer left tail, though the bimodal structure renders this value difficult to interpret in isolation. The negative kurtosis implies a flatter distribution compared to a normal curve – again, a consequence of the data’s underlying bimodality.

2. Waiting Time

The histogram for waiting times similarly displays two overlapping modes, with denser regions around 55 and 80 minutes.

The numerical descriptors are:

Skewness = -0.4140
Kurtosis = -1.1563

As with eruption durations, the slightly negative skew and platykurtic (low kurtosis) profile again reflect a flattened, spread-out distribution—not tightly peaked, and not symmetric.

3. Conclusion About the Data’s Distribution

Due to the bimodal nature of the measured variables, neither conforms to expectations of central tendency characterisation. Yes, we can compute mean and median values, but their interpretive value is diminished in the presence of this obvious bimodality. This indicates a structural feature of the physical process operating there (i.e. not merely a statistical artefact), suggestive of two regimes of geyser activity (short/long eruptions, brief/long waits).

The skewness and kurtosis figures reinforce the visual presentation of the data structure: the distributions are asymmetrical and flatter than a normal curve and make parametric assumptions (e.g., normality in linear models) poorly-suited (unless stratified or transformed in some way or another). The histograms and moment-based statistics highlight the need to interrogate shape and structure – not merely central values – and would suggest that a more detailed approach is taken to studying the processes operating there.

✓ 8/10 for the correct graphical representation and interpretation of the data distribution, and for the calculation of skewness and kurtosis.
✓ 2/10 for some sensible explanation of what this means.

Question 4

Manually calculate the variance and SD for the normal_data we generated in the lecture. Make sure your answer is the same as those reported there. (/5)

Answer

# Using the equations for sample variance and sd
# (not the built-in functions), do:

# Variance
(norm_var <- round(sum((normal_data - mean(normal_data))^2) /
                     (length(normal_data) - 1), 3))

[1] 1.002

# Standard deviation
(norm_sd <- round(sqrt(norm_var), 3))

[1] 1.001

Question 5

Write a few lines of code to demonstrate that the $(0 - 0.25]$ , $(0.25 - 0.5]$ , $(0.5 - 0.75]$ , and $(0.75 - 1]$ quantiles of the normal_data we generated in the lecture indeed conform to the formal definition for what quantiles are. I.e., show manually how you can determine that 25% of the observations indeed fall below -0.66 for the normal_data. Explain the rationale to your approach. (/10)

Answer

# Generate random data from a normal distribution
set.seed(666)
n <- 5000 # Number of data points
mean <- 0
sd <- 1
normal_data <- rnorm(n, mean, sd)

# Calculate the quantiles
q_25 <- quantile(normal_data, p = 0.25)
q_50 <- quantile(normal_data, p = 0.50)
q_75 <- quantile(normal_data, p = 0.75)
q_100 <- quantile(normal_data, p = 1.00)

# Verify each quantile interval
(count_0_to_25 <- sum(normal_data <= q_25))

[1] 1250

(count_25_to_50 <- sum(normal_data > q_25 & normal_data <= q_50))

[1] 1250

(count_50_to_75 <- sum(normal_data > q_50 & normal_data <= q_75))

[1] 1250

(count_75_to_100 <- sum(normal_data > q_75 & normal_data <= q_100))

[1] 1250

# Calculate percentages
(perc_0_to_25 <- count_0_to_25 / n * 100)

[1] 25

(perc_25_to_50 <- count_25_to_50 / n * 100)

[1] 25

(perc_50_to_75 <- count_50_to_75 / n * 100)

[1] 25

(perc_75_to_100 <- count_75_to_100 / n * 100)

[1] 25

# Make a figure to visualise
library(ggplot2)
ggplot(data.frame(x = normal_data), aes(x = x)) +
  geom_histogram(binwidth = 0.5, fill = "lightblue", color = "black") +
  geom_vline(xintercept = c(q_25, q_50, q_75), color = c("red", "blue", "green"), linetype = "dashed") +
  labs(title = "Histogram of normal_data with quantiles", x = "Value", y = "Frequency") +
  theme_minimal()

To demonstrate that the $(0 - 0.25]$ , $(0.25 - 0.5]$ , $(0.5 - 0.75]$ , and $(0.75 - 1]$ quantiles of our generated normal_data conform to the formal definition of quantiles, we must verify that approximately 25% of observations fall within each quantile interval. The formal definition states that the $p$ -th quantile is a value $q_{p}$ such that the proportion of observations less than or equal to $q_{p}$ is approximately $p$ .

Approach

In practice, there are three key steps:

Compute the quantiles using the quantile() function.
Count observations in each quantile interval by using logical expressions to filter the data:
- For the first interval $(0 - 0.25]$ : count values $\leq q_{0.25}$
- For the second interval $(0.25 - 0.5]$ : count values $> q_{0.25}$ and $\leq q_{0.50}$
- For the third interval $(0.5 - 0.75]$ : count values $> q_{0.50}$ and $\leq q_{0.75}$
- For the fourth interval $(0.75 - 1]$ : count values $> q_{0.75}$
Convert counts to percentages by dividing by the total number of observations and multiplying by 100.

✓ (x 10) They must correctly calculate the quantiles and verify that approximately 25% (i.e. obtain the percentage value in the calcs) of the observations fall within each quantile interval. They should also provide a clear explanation of their approach. No need to have a figure, but some bonus marks may be given if one is provided.

Question 6

Why is it important to consider the grouping structures that might be present within our datasets? (/2)

Answer

Simpson’s Paradox: Relationships seen in grouped data can reverse/disappear when examining subgroups separately.

Heterogeneity: Different groups within data often exhibit distinct statistical properties. A a single mean (and/or SD) across all observations masks this variability and can provide a distorted view of the real phenomena.

Statistical Independence: Many statistical tests assume independence of observations. If data contain hierarchical or nested structures (e.g., individuals within ecosystems, repeated measurements from the same subjects), this assumption is violated and might invalidate statistical inferences.

Group-Specific Insights: Analysing data by relevant groupings (e.g. a population of a flowering plant in different climatic regions or time periods) may show relevant patterns and differences that would otherwise remain hidden in aggregate statistics.

So, if we don’t account for group structures, summary statistics may reflect artificial central tendencies or dispersions that don’t realistically represent any meaningful subpopulation within the data. This limits the validity and usefulness of the analysis.

✓ (x 2) … give some marks if some if what’s above is mentioned.

Question 7

Explain the output of summary() when applied to the penguins dataset. (/3)

Answer

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

The summary() function provides a concise summary (obviously) of the dataset’s numerical variables. It includes the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for each variable (i.e. central tendency and some view of the dispersion). For categorical variables, it shows the frequency of each level. We caqn gain some basic insight into the distribution of the data and identify potential outliers or missing values.
✓ (x 3) Assign a few marks if they correctly describe the output of summary() with specific reference to the penguins dataset. For example:
- We can see that the penguin’s bill depth (in mm) is close to normally distributed given that the mean value of 17.15 mm is very close to the median of 17.30 mm, with the minimum and maximum values at 13.10 mm and 21.50 mm, respectively. There are two missing values here. The bill length, however, is slightly skewed to the right with a mean of 43.92 mm and a median of 44.45 mm…

3. Exploring With Figures

Question 8

Using a tidy workflow, assemble a summary table of the palmerpenguins dataset that has a similar appearance as that produced by psych::describe(penguins). (/5)
- For bonus marks (which won’t count anything) of up to 10% added to Task E, apply a beautiful and creative styling to the table using the kable package. Try and make it as publication ready as possible. Refer to a few journal articles to see how to professionally typeset tables.
Still using the palmerpenguins dataset, perform an exploratory data analysis to investigate the relationship between penguin species and their morphological traits (bill length, bill depth, flipper length, and body mass). Employ the tidyverse approaches learned earlier in the module to explore the data and account for the grouping structures present within the dataset. (/10)
Provide visualisations (use Figure 4 as inspiration) and summary statistics to support your findings and elaborate on any observed patterns or trends. (/10)
Ensure your presentation is professional and adhere to the standards required by scientific publications. State the major aims of your analysis and the patterns you seek. Using the combined findings from the EDA and the figures produced here, discuss the findings in a formal Results section. (/5)

Answer

library(kableExtra)
library(tidyverse)

# a.
penguins %>%
  select_if(is.numeric) %>%
  psych::describe() %>%
  kable("html") %>%
  kable_styling("striped", full_width = F)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
bill_length_mm	1	342	43.92193	5.4595837	44.45	43.90693	7.04235	32.1	59.6	27.5	0.0526530	-0.8931397	0.2952205
bill_depth_mm	2	342	17.15117	1.9747932	17.30	17.17263	2.22390	13.1	21.5	8.4	-0.1422086	-0.9233523	0.1067846
flipper_length_mm	3	342	200.91520	14.0617137	197.00	200.33577	16.30860	172.0	231.0	59.0	0.3426554	-0.9991866	0.7603704
body_mass_g	4	342	4201.75439	801.9545357	4050.00	4154.01460	889.56000	2700.0	6300.0	3600.0	0.4662117	-0.7395200	43.3647348
year	5	344	2008.02907	0.8183559	2008.00	2008.03623	1.48260	2007.0	2009.0	2.0	-0.0532601	-1.5092478	0.0441228

# b.
penguins %>%
  select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  pivot_longer(-species) %>%
  ggplot(aes(x = species, y = value, fill = species)) +
  geom_boxplot() +
  facet_wrap(~name, scales = "free_y") +
  theme_minimal()

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.,
  author = {Smit, A. J.,},
  title = {BCB744 {Task} {E}},
  url = {http://tangledbank.netlify.app/assessments/BCB744_Task_E.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. BCB744 Task E. http://tangledbank.netlify.app/assessments/BCB744_Task_E.html.