BCB744 Task E

Author

Affiliation

A. J. Smit

University of the Western Cape

Self-Assessment Task E.1

Explain the output of dimnames() when applied to the penguins dataset. (/2)
Explain the output of str() when applied to the penguins dataset. (/3)

Self-Assessment Task E.2

How would you manually calculate the mean value for the normal_data we generated in the lecture? (/3)

Self-Assessment Task E.3

Find the faithful dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use ggplot()), and conclude with a brief statement about the data distribution. (/10)

Self-Assessment Task E.4

Manually calculate the variance and SD for the normal_data we generated in the lecture. Make sure your answer is the same as those reported there. (/5)

Self-Assessment Task E.5

Write a few lines of code to demonstrate that the $(0-0.25]$, $(0.25-0.5]$, $(0.5-0.75]$,$(0.75-1]$ quantiles of the normal_data we generated in the lecture indeed conform to the formal definition for what quantiles are. I.e., show manually how you can determine that 25% of the observations indeed fall below -0.66 for the normal_data. Explain the rationale to your approach. (/10)

Self-Assessment Task E.6

Why is it important to consider the grouping structures that might be present within our datasets? (/2)

Self-Assessment Task E.7

Explain the output of summary() when applied to the penguins dataset. (/3)

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,
  author = {Smit, A. J.},
  title = {BCB744 {Task} {E}},
  url = {https://tangledbank.netlify.app/BCB744/tasks/task_e_blocks.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit AJ BCB744 Task E. https://tangledbank.netlify.app/BCB744/tasks/task_e_blocks.html.

--- title: "BCB744 Task E" format: html: fig-format: png fig_retina: 2 fig-dpi: 400 execute: dev: ragg_png fig-format: png params: hide_answers: true --- ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.1 a. Explain the output of `dimnames()` when applied to the `penguins` dataset. **(/2)** b. Explain the output of `str()` when applied to the `penguins` dataset. **(/3)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| eval: true #| echo: true library(palmerpenguins) data(penguins) # a. dimnames(penguins) # b. str(penguins) ``` - ✓ `dimnames()` returns the names of the rows and columns of the dataset. - ✓ The rows are numbered 1 to 344, and the columns are named `species`, `island`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`, `sex`, `year`. - ✓ `str()` provides a concise summary of the dataset's structure. It shows the number of observations (344) and variables (8), the variable names, and their data types. - ✓ The dataset contains 344 observations of 8 variables. - ✓ The variables are a mix of character, factor, and numeric data types. `r if (params$hide_answers) ":::"` ::: ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.2 How would you manually calculate the mean value for the `normal_data` we generated in the lecture? **(/3)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| eval: true #| echo: true set.seed(666) n <- 5000 mean <- 0 sd <- 1 normal_data <- rnorm(n, mean, sd) round(sum(normal_data) / length(normal_data), 3) ``` - ✓✓✓ The mean value of the `normal_data` is calculated by summing all the data points and dividing by the number of data points. Assign three marks if they correctly calculate the mean of the `normal_data` without using the function `mean()`. `r if (params$hide_answers) ":::"` ::: ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.3 Find the `faithful` dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use `ggplot()`), and conclude with a brief statement about the data distribution. **(/10)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| fig-cap: "Histograms of Old Faithful eruptions and waiting times." #| eval: true #| echo: true # Load the faithful dataset data(faithful) library(ggpubr) plt1 <- ggplot(faithful, aes(x = eruptions)) + geom_histogram(fill = "lightblue", colour = "black") + labs(title = "Eruptions", x = "Duration (minutes)", y = "Frequency") + theme_minimal() plt2 <- ggplot(faithful, aes(x = waiting)) + geom_histogram(fill = "lightgreen", colour = "black") + labs(title = "Waiting", x = "Time (minutes)", y = "Frequency") + theme_minimal() # Arrange the plots side by side ggarrange(plt1, plt2, ncol = 2) # Central tendency measures library(e1071) kurtosis(faithful$eruptions) kurtosis(faithful$waiting) skewness(faithful$eruptions) skewness(faithful$waiting) ``` The faithful dataset contains data on the Old Faithful geyser in Yellowstone National Park. It measures the eruption duration (in minutes) and the waiting time between eruptions (also in minutes). We use both graphical and statistical measures to understand the central tendency and overall distribution of each variable. *1. Eruption Duration* Using **ggplot2**, the histogram of eruption durations reveals a bimodal distribution --- one cluster of short eruptions around 2 minutes, another around 4.3 minutes. This visible separation implies that computing a single mean or median would obscure meaningful structural information in the data. Statistical measures of central tendency support the visual impression: - Skewness = -0.4135 - Kurtosis = -1.5116 The negative skewness suggests a slight asymmetry with a longer left tail, though the bimodal structure renders this value difficult to interpret in isolation. The negative kurtosis implies a flatter distribution compared to a normal curve --- again, a consequence of the data’s underlying bimodality. *2. Waiting Time* The histogram for waiting times similarly displays two overlapping modes, with denser regions around 55, 80 minutes. The numerical descriptors are: - Skewness = -0.4140 - Kurtosis = -1.1563 As with eruption durations, the slightly negative skew, platykurtic (low kurtosis) profile again reflect a flattened and spread-out distribution --- not tightly peaked, and not symmetric. *3. Conclusion About the Data's Distribution* Due to the bimodal nature of the measured variables, neither conforms to expectations of central tendency characterisation. Yes, we can compute mean, median values but their interpretive value is diminished in the presence of this obvious bimodality. This indicates a structural feature of the physical process operating there (*i.e.* not merely a statistical artefact), suggestive of two regimes of geyser activity (short/long eruptions, brief/long waits). The skewness and kurtosis figures reinforce the visual presentation of the data structure: the distributions are asymmetrical and flatter than a normal curve and make parametric assumptions (*e.g.*, normality in linear models) poorly-suited (unless stratified, transformed in some way or another). The histograms and moment-based statistics highlight the need to interrogate shape and structure --- not merely central values --- and would suggest that a more detailed approach is taken to studying the processes operating there. - ✓ 8/10 for the correct graphical representation and interpretation of the data distribution, and for the calculation of skewness, kurtosis. - ✓ 2/10 for some sensible explanation of what this means. `r if (params$hide_answers) ":::"` ::: ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.4 Manually calculate the variance and SD for the `normal_data` we generated in the lecture. Make sure your answer is the same as those reported there. **(/5)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| eval: true #| echo: true # Using the equations for sample variance and sd # (not the built-in functions), do: # Variance (norm_var <- round(sum((normal_data - mean(normal_data))^2) / (length(normal_data) - 1), 3)) # Standard deviation (norm_sd <- round(sqrt(norm_var), 3)) ``` `r if (params$hide_answers) ":::"` ::: ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.5 Write a few lines of code to demonstrate that the $(0-0.25]$, $(0.25-0.5]$, $(0.5-0.75]$,$(0.75-1]$ quantiles of the `normal_data` we generated in the lecture indeed conform to the formal definition for what quantiles are. I.e., show manually how you can determine that 25% of the observations indeed fall below `r round(quantile(normal_data, p = 0.25), 2)` for the `normal_data`. Explain the rationale to your approach. **(/10)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| fig-cap: "Quantile verification for simulated normal data." # Generate random data from a normal distribution set.seed(666) n <- 5000 # Number of data points mean <- 0 sd <- 1 normal_data <- rnorm(n, mean, sd) # Calculate the quantiles q_25 <- quantile(normal_data, p = 0.25) q_50 <- quantile(normal_data, p = 0.50) q_75 <- quantile(normal_data, p = 0.75) q_100 <- quantile(normal_data, p = 1.00) # Verify each quantile interval (count_0_to_25 <- sum(normal_data <= q_25)) (count_25_to_50 <- sum(normal_data > q_25 & normal_data <= q_50)) (count_50_to_75 <- sum(normal_data > q_50 & normal_data <= q_75)) (count_75_to_100 <- sum(normal_data > q_75 & normal_data <= q_100)) # Calculate percentages (perc_0_to_25 <- count_0_to_25 / n * 100) (perc_25_to_50 <- count_25_to_50 / n * 100) (perc_50_to_75 <- count_50_to_75 / n * 100) (perc_75_to_100 <- count_75_to_100 / n * 100) # Make a figure to visualise library(ggplot2) ggplot(data.frame(x = normal_data), aes(x = x)) + geom_histogram(binwidth = 0.5, fill = "lightblue", colour = "black") + geom_vline(xintercept = c(q_25, q_50, q_75), colour = c("red", "blue", "green"), linetype = "dashed") + labs(title = "Histogram of normal_data with quantiles", x = "Value", y = "Frequency") + theme_minimal() ``` To demonstrate that the $(0-0.25]$, $(0.25-0.5]$, $(0.5-0.75]$,$(0.75-1]$ quantiles of our generated `normal_data` conform to the formal definition of quantiles, we must verify that approximately 25% of observations fall within each quantile interval. The formal definition states that the $p$-th quantile is a value $q_p$ such that the proportion of observations less than or equal to $q_p$ is approximately $p$. *Approach* In practice, there are three key steps: 1. Compute the quantiles using the `quantile()` function. 2. Count observations in each quantile interval by using logical expressions to filter the data: - For the first interval $(0-0.25]$: count values $\leq q_{0.25}$ - For the second interval $(0.25-0.5]$: count values $> q_{0.25}$$\leq q_{0.50}$ - For the third interval $(0.5-0.75]$: count values $> q_{0.50}$$\leq q_{0.75}$ - For the fourth interval $(0.75-1]$: count values $> q_{0.75}$ 3. Convert counts to percentages by dividing by the total number of observations and multiplying by 100. - ✓ (x 10) They must correctly calculate the quantiles and verify that approximately 25% (*i.e.* obtain the percentage value in the calcs) of the observations fall within each quantile interval. They should also provide a clear explanation of their approach. No need to have a figure, but some bonus marks may be given if one is provided. `r if (params$hide_answers) ":::"` ::: ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.6 Why is it important to consider the grouping structures that might be present within our datasets? **(/2)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** Simpson's Paradox: Relationships seen in grouped data can reverse/disappear when examining subgroups separately. Heterogeneity: Different groups within data often exhibit distinct statistical properties. A a single mean (and/or SD) across all observations masks this variability and can provide a distorted view of the real phenomena. Statistical Independence: Many statistical tests assume independence of observations. If data contain hierarchical or nested structures (*e.g.*, individuals within ecosystems, repeated measurements from the same subjects), this assumption is violated, might invalidate statistical inferences. Group-Specific Insights: Analysing data by relevant groupings (*e.g.* a population of a flowering plant in different climatic regions or time periods) may show relevant patterns and differences that would otherwise remain hidden in aggregate statistics. So, if we do not account for group structures, summary statistics may reflect artificial central tendencies, dispersions that do not realistically represent any meaningful subpopulation within the data. This limits the validity and usefulness of the analysis. - ✓ (x 2) ... give some marks if some if what is above is mentioned. `r if (params$hide_answers) ":::"` ::: ::: {.callout-note appearance="simple"} ## Self-Assessment Task E.7 Explain the output of `summary()` when applied to the `penguins` dataset. **(/3)** `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} summary(penguins) ``` - The `summary()` function provides a concise summary (obviously) of the dataset's numerical variables. It includes the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for each variable (*i.e.* central tendency, some view of the dispersion). For categorical variables and it shows the frequency of each level. We caqn gain some basic insight into the distribution of the data and identify potential outliers or missing values. - ✓ (x 3) Assign a few marks if they correctly describe the output of `summary()` with specific reference to the `penguins` dataset. For example: - We can see that the penguin's bill depth (in mm) is close to normally distributed given that the mean value of 17.15 mm is very close to the median of 17.30 mm, with the minimum, maximum values at 13.10 mm and 21.50 mm and respectively. There are two missing values here. The bill length, however, is slightly skewed to the right with a mean of 43.92 mm, a median of 44.45 mm... `r if (params$hide_answers) ":::"` :::