BCB744: Intro R Theory Test

Author

Affiliation

Published

February 13, 2026

1 Instructions

The Intro R Theory Test will start at 9:00 on 13 February, 2026. You have until 12:00 to complete it.

Your answer should demonstrate a comprehensive understanding of the theoretical concepts and techniques required to read and comprehend R code.

Only answer what is explicitely stated. For example, if the question asks for only a graph as the final output, only the graph will be assessed, not the reasoning that brought you there. Anything extra will not amount to extra marks, so save yourself the time and produce the most concise answer possible given the content of the question. What is required will always be explicitely stated.

This is a closed book assessment. Below is a set of questions to answer. You must answer all questions in the allocated time of 3-hr. Please write your answers neatly in the answer book provided. Structure your answers logically.

1.1 Question 1 [10 marks]

Please translate the following code into English by providing an explanation for each line:

library(tidyverse)
monthlyData <- dailyData |> 
    mutate(t = asPOSIXct(t)) |> 
    mutate(month = floor_date(t, unit = "month")) |> 
    group_by(lon, lat, month) |> 
    summarise(temp = mean(temp, na.rm = TRUE)) |> 
    mutate(year = year(month)) |> 
    group_by(lon, lat) |> 
    mutate(num = seq(1:length(temp))) |> 
    ungroup()

In your answer, simply refer to the line numbers (1-10) before each line of code, and provide an explanation for each line.

Answer

Line 1: The variable monthlyData is created by starting with dailyData, which is a dataset containing daily records.
Line 2: The mutate() function is used to convert the column t (presumably a date or timestamp) into a POSIXct datetime format. This ensures that t is stored in a standardised date-time format suitable for time-based operations.
Line 3: The mutate() function is again used to create a new column month, which is derived from t. The floor_date() function rounds down the date to the first day of the corresponding month, effectively extracting the month from t.
Line 4: The group_by() function groups the dataset by lon (longitude), lat (latitude), month. This means subsequent operations will be performed separately for each unique combination of these three variables.
Line 5: The summarise() function computes the mean temperature (temp) for each group. The na.rm = TRUE argument ensures that missing values (NA) are ignored in the calculation.
Line 6: The mutate() function creates a new column, year, extracting the year from the month column. This provides an explicit reference to the year of each data entry.
Line 7: The group_by() function is applied again, but this time only by lon and lat. This modifies the grouping structure to remove the month grouping while retaining spatial grouping.
Line 8: The mutate() function adds a new column, num, which assigns a sequence of numbers (1:length(temp)) to the grouped data. This effectively creates an index for each record within each longitude-latitude group.
Line 9: The ungroup() function removes all grouping, ensuring that further operations on monthlyData are performed on the entire dataset rather than within groups.

1.2 Question 2 [5 marks]

What is ‘Occam’s Razor’? What is the relevance to science?

Answer

Occam’s Razor is a principle of parsimony often attributed to the 14th-century philosopher William of Ockham. While the famous phrasing, “Entities should not be multiplied beyond necessity,” does not appear in his surviving works, the principle captures his idea that simpler explanations are generally preferable. It is relevant to the BCB744 module because the principle of Occam’s Razor is often interpreted as “the simplest explanation that sufficiently explains the data should be preferred over more complex alternatives.” This is a nice guiding principle which might be useful in your research, especially when you are faced with multiple explanations for a phenomenon. The principle suggests that the simplest explanation is often the best one, and that more complex explanations should only be considered when the simpler ones fail to account for the data. But, keep in mind that biological systems tend to be complex, and oversimplifying an explanation may ignore important interactions, heterogeneities.

1.3 Question 3 [20 marks]

Using the penguin data provided in Table 1, please produce the figure produced by the code block.

Species	Island	\shortstack{Bill\\length\\(mm)}	\shortstack{Bill\\depth\\(mm)}	\shortstack{Flipper\\length\\(mm)}	\shortstack{Body\\mass\\(g)}	Sex	Year
Table 1: Penguin Sample (n = 4 per Species x Island)
Adelie	Biscoe	37.8	20.0	190.0	4,250.0	male	2009
Adelie	Biscoe	37.9	18.6	193.0	2,925.0	female	2009
Adelie	Biscoe	37.9	18.6	172.0	3,150.0	female	2007
Adelie	Biscoe	40.1	18.9	188.0	4,300.0	male	2008
Adelie	Dream	41.5	18.5	201.0	4,000.0	male	2009
Adelie	Dream	41.1	19.0	182.0	3,425.0	male	2007
Adelie	Dream	36.0	17.9	190.0	3,450.0	female	2007
Adelie	Dream	38.1	18.6	190.0	3,700.0	female	2008
Adelie	Torgersen	40.2	17.0	176.0	3,450.0	female	2009
Adelie	Torgersen	42.9	17.6	196.0	4,700.0	male	2008
Adelie	Torgersen	33.5	19.0	190.0	3,600.0	female	2008
Adelie	Torgersen	39.3	20.6	190.0	3,650.0	male	2007
Chinstrap	Dream	45.2	17.8	198.0	3,950.0	female	2007
Chinstrap	Dream	49.3	19.9	203.0	4,050.0	male	2009
Chinstrap	Dream	46.5	17.9	192.0	3,500.0	female	2007
Chinstrap	Dream	45.5	17.0	196.0	3,500.0	female	2008
Gentoo	Biscoe	50.4	15.7	222.0	5,750.0	male	2009
Gentoo	Biscoe	49.5	16.1	224.0	5,650.0	male	2009
Gentoo	Biscoe	52.5	15.6	221.0	5,450.0	male	2009
Gentoo	Biscoe	49.3	15.7	217.0	5,850.0	male	2007

pen_long <- pen |>
  pivot_longer(
    cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm),
    names_to = "measurement_type",
    values_to = "value_mm"
  ) |>
  group_by(species, island, measurement_type) |>
  summarise(value_mm = mean(value_mm, na.rm = TRUE), .groups = "drop")

ggplot(pen_long, aes(x = measurement_type, y = value_mm, fill = measurement_type)) +
  geom_col(width = 0.7, colour = "black", linewidth = 0.2, show.legend = FALSE) +
  facet_grid(species ~ island) +
  scale_x_discrete(
    labels = c(
      bill_length_mm = "Bill length", bill_depth_mm = "Bill depth",
      flipper_length_mm = "Flipper length"
    )
  ) +
  labs(x = "Measurement", y = "Mean length (mm)") +
  theme_bw(base_size = 11) +
  theme(
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

Marks will only be assigned for the figure that the code produces.

Answer

1.4 Question 4 [5 marks]

Explain the difference between R and RStudio.

Answer

Taken verbatim from Tangled Bank:

R is a programming language and software environment for statistical computing and graphics. It provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, multivariate analyses, neural networks, and so forth), graphical techniques and is highly extensible.

RStudio is an integrated development environment (IDE) for R. It provides a graphical user interface (GUI) for working with R, making it easier to use for those who are less familiar with command-line interfaces. Some of the features provided by RStudio include:

a code editor with syntax highlighting and code completion;
a console for running R code;
a graphical interface for managing packages and libraries;
an integrated tools for plotting and visualisation;
support for version control with Git and SVN.

R is the core software for statistical computing, like a car’s engine, while RStudio provides a more user-friendly interface for working with R, like the car’s body, the seats, steering wheel, and other bells, whistles.

1.5 Question 5 [10 marks]

By way of example, please explain some key aspects of R code conventions. For each line of code you write (neatly and legibly so each intended style item is visible), explain also in English what aspects of the code are being adhered to.

For example:

a <- b is not the same as a < -b. The former is correct because there is a space preceding and following the assignment operator (<-, a less-than sign immediately followed by a dash to form an arrow); this has a different meaning from the latter, which is incorrect because there is no space between the less-than sign, the dash and reading as “a is less than negative b”.

Answer

Proper use of indentation:

if (x > 0) {
  print("Positive number")
}

Use of meaningful variable names:

temperature <- 25

Use of comments to explain code:

# Calculate the mean temperature
mean_temp <- mean(temperature)

Consistent use of spacing around operators:

a <- b + c

Consistent use of compound object names:

A principles of writing clean and readable R code (or any code) is maintaining consistent variable naming conventions throughout a script or project. Mixing different naming styles — such as “snake_case” (words separated by underscores) and “camelCase” (capitalising the first letter of each subsequent word) — makes the code harder to read, maintain, and debug.

Examples:

# Example of consistent use of either convention:
my_variable <- 10 # snake case
another_variable <- 20 # camel case

# An example of inconsistent use of conventions:
myVariable <- 30 # camel case
yet_another_variable <- 40 # snake case

# This is also incorrect:
variable_one <- 13 # llowercase "one"
variable_Two <- 13 * 2 # uppercase "Two"

Avoiding the = as Assignment Operator

# Correct:
a <- 1

# Incorrect:
a = 1

Consistent use of spaces around # symbols in comments:

# This is correct:

# This is a comment
# This is another comment
# And another

# This is incorrect:

#This is a comment
# A comment?
#  Another comment

Correct use of + and - for unary operators:

# Correct:
a <- -b

Use of TRUE and FALSE instead of T and F:

# Correct:
is_positive <- TRUE

# Incorrect:
is_positive <- T

For more, refer to the tidyverse style guide.

1.6 Question 6 [15 marks]

You are a research assistant who have just been given your first job. You are asked to analyse a dataset about patterns of extreme heat in the ocean and the possible role that ocean currents (specifically, eddies) might play in modulating the patterns of extreme sea surface temperature extremes in space, time.

Being naive and relatively inexperienced, and misguided by your exaggerated sense of preparedness as young people tend to do, you gladly accept the task, and start by exploring the data. You notice that the dataset is quite large and you have no idea what is happening, what you are doing, why you are doing it, or what you are looking for. Ten minutes into the job you start to question your life choices. Your feeling of bewilderment is compounded by the fact that, when you examine the data (the output of the head() and tail() commands is shown below), the entries seem confusing.

fpath <- "/Volumes/OceanData/spatial/processed/WBC/misc_results"
fname <- "KC-MCA-data-2013-01-01-2022-12-31-bbox-v1_ma_14day_detrended.csv"
data <- read.csv(file.path(fpath, fname))

> nrow(data)
[1] 53253434

> head(data)
           t     lon    lat      ex    ke
1 2013-01-01 121.875 34.625 -0.7141 2e-04
2 2013-01-01 121.875 34.625 -0.8027 2e-04
3 2013-01-02 121.875 34.625 -0.8916 2e-04
4 2013-01-02 121.875 34.625 -0.9751 2e-04
5 2013-01-03 121.875 34.625 -1.0589 3e-04
6 2013-01-03 121.875 34.625 -1.1406 3e-04

> tail(data)
                  t     lon    lat     ex      ke
53253429 2022-12-29 174.375 44.875 0.4742 -0.0049
53253430 2022-12-29 174.375 44.875 0.4856 -0.0049
53253431 2022-12-30 174.375 44.875 0.4969 -0.0050
53253432 2022-12-30 174.375 44.875 0.5169 -0.0050
53253433 2022-12-31 174.375 44.875 0.5367 -0.0051
53253434 2022-12-31 174.375 44.875 0.5465 -0.0051

You resign yourself to admitting that you do not understand much, but at the risk of sounding like a fool when you go to your professor, you decide to do as much of the preparation you can do so that you at least have something to show for your time.

What will you take back to your professor to show that you have prepared yourself as fully as possible? For example:
- What is in your ability to understand about the study and the nature of the data?
- What will you do for yourself to better understand the task at hand?
- What do you understand about the data?
- What will you do to aid your understanding of the data?
- What will your next steps be going forward?
- Etc. (Anything else you can think about doing to convnce the professor you though about the data?) [/10 marks]
What will you need from your professor to help you understand the data and the task at hand so that you are well equipped to tackle the problem? [/5 marks]

Answer

I am able to understand what the concept of ‘extreme heat’ is, and what ocean eddies are — all I need to do is find some papers about it, do broad reading around these concepts. So and I will start by reading up on these concepts.
I can see from the columns that there appears to be three independent variables (lon, lat, t) and two dependent variables (ex and ke). I will need to understand what these variables are, and how they relate to each other. It is easy to see that lon and lat are the longitude and latitude of the data points, and that t is the date of the data point. I will need to understand what the ex and ke variables are, and how they relate to the lon and lat variables. Presumably ex and ke are the extreme heat and ocean eddies, respectively. I will confirm with the professor.
Because I have lon and lat, I can make a map of the study area. By making a map of the study area for one or a few days in the dataset, I can get a sense of the spatial distribution of the data. I can also plot the ex and ke data to see what the data look like. Because the data cover the period 2013-2022, I know that I can create a map for each day (a time-series analysis might eventually be needed?), and that is probably where the analysis will takle me later once I have confirmed my thinking with the professor. If I am really proactive, want to seriously impress the professor and I will make an animation of the data to show the temporal evolution of revealed patterns in the data over time. This will clearly show the processes operating there. A REALLY informed mind will be able to even go as far as understanding what the analysis should entail, but, admittedly, this will require a deep subject matter understanding, which you might not possess at the moment, but which is nevertheless not beyond your reach to attain without guidance.
I can conclude that the data reveal some dynamical process (I infer ‘dynamical’ from the fact that we have time-series data, and time-series reveal dynamics).
Knowing what the geographical region is from the map I created and what is happening there that might be of interest to the study, I can make some guesses about what the analysis will be.
FYI, what basic research would reveal include the following (not for marks):
- you would see that it is an ocean region south of South Africa;
- once you know the region covered, you can read about the processes operating in the region that the data cover;
- because the temperature spatially defines the Agulhas Current, you can infer that the study is about the Agulhas Current
- plotting ke will reveal eddies in the Agulhas Current;
- you can read about the Agulhas Current and its eddies and think about how eddies might affect the temperature in the region — both of these are dynamical processes.
I will need to understand what the data are telling me, and what the variables mean. I will need to understand what the ex and ke variables are, and how they relate to the lon and lat variables.
Having discovered all these things simply by doing a basic first-stab analyses, I can prepare a report of my cursory findings, draw of a list of things I know and together with suggested further avenues for exploration. I will take this to the professor to confirm my understanding and to get guidance on how to proceed.
I will also add a list of the things I cannot know from the data, and what I need to know from the professor to proceed.
There is also something strange happening with the data. It seems that there are duplicate data entries (two occurrences of each combination of lat x lon x t resulting in duplicated values for each spatio-temporal point of ke and a pair of dissimilar values for ex). I will need to understand why this is the case. Clearly this is incorrect, and this points to pre-processing errors somewhere. I will have to ask the professor to give me access to all pro-processing scripts, the raw data to see if I can trace the error back to its source.
If I was this professor, I would be immensepy mpressed by tyour proactive approach to the problem. You are showing that you are not just a passive learner, but that you are actively engaging with the data, the problem at hand. This is a very good sign of a good researcher in the making. In my mind and I would seriously think about finding you a salary for permanent employment in my lab.

1.7 Question 7 [15 marks]

Explain why one typically prefers working with CSV files over Excel files in R.
What are the properties of a CSV file that make it more suitable for data analysis in R?
What are the properties of an Excel file that make it less suitable for data analysis in R?

Answer

CSV (Comma-Separated Values) files are preferred over Excel files due to their simplicity, compatibility, and efficiency in handling data. CSV files are stored as plain text, making them easy to read, write across different software and platforms. They do not contain proprietary formatting and formulas, or metadata, which minimises the risk of unintended data transformations.

Excel files (.xls, .xlsx) are proprietary, designed for spreadsheet applications and incorporating complex formatting, formulas, and visual formatting that can interfere with data processing in R. Unlike CSV files, which can be directly read using base R functions like read.csv(), Excel files require additional packages such as readxl for data extraction. Excel’s tendency to automatically modify data types — such as converting text to dates or numbers — is annoying and introduces errors, making CSV a more reliable format for reproducible data analysis.

CSV files store data in a simple text-based format that ensures easy readability by both humans and computers.
Each row represents a single record, and fields are separated by commas (or another delimiter) to ensure a consistent tabular format.
CSV files can be opened and edited using a wide range of software, including text editors, spreadsheets (e.g., Excel, Google Sheets), and statistical tools (e.g., R, Python).
R provides optimised functions like read.csv() (base R)read_csv() (tidyverse) for quickly reading CSV files without additional dependencies.
Unlike Excel, CSV files do not contain embedded formulas, formatting, figures, or macros, these properties reduce the risk of unintended data stuff-ups.
Being plain text, CSV files are typically smaller in size compared to Excel files.

Excel files are stored in a format (.xls, .xlsx) that is specific to Microsoft Excel; special packages (e.g., readxl, openxlsx) are needed to read them in R.
Excel often automatically formats data and changes numeric values to dates or rounding decimal values. This can lead to errors in data analysis.
Excel files support formulas, pivot tables, conditional formatting, and visual elements that may not be relevant for raw data processing in R.
Users can store multiple sheets within a single Excel file and this makes it trickier to maintain a standardised structure when importing data into R.
Excel files are not made for handling large datasets. Excel becomes very slow and is prone to crashing or memory limitations when dealing with ‘big’ data.
Excel’s binary files do not work with version control systems like Git.
Excel files are complex and more prone to accidental modifications or corruption.

1.8 Question 8 [20 marks]

Explain each of the following in the context of their use in R. For each, provide an example of how you would construct them in R:

A vector
A matrix
A dataframe
A list

Answer

A vector in R is the simplest and most fundamental data structure. It is a one-dimensional collection of elements, all of the same type (e.g., numeric, character, or logical). Vectors can be created using the c() function. For example:

# Creating a numeric vector
numbers <- c(1, 2, 3, 4, 5)

# Creating a character vector
names <- c("Acacia", "Protea", "Leucadendron")

# Creating a logical vector
logical_values <- c(TRUE, FALSE, TRUE)

A matrix is a two-dimensional data structure where all elements must be of the same type. It is essentially an extension of a vector with a specified number of rows and columns.

# Creating a matrix with 3 rows and 2 columns
my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2)

A dataframe is a two-dimensional data structure that can contain different data types in different columns (variables). It is the most commonly used data structure for data analysis in R and resembles a table with rows and columns.

# Creating a dataframe
my_dataframe <- data.frame(
  Name = c("Acacia", "Protea", "Leucadendron"),
  Age = c(25, 30, 22),
  Height = c(85.5, 90.3, 78.0)
)

A list is a flexible data structure that can store elements of different types, including vectors, matrices, dataframes, and even other lists. Unlike vectors, matrices and which require uniform data types, lists can contain heterogeneous elements.

# Creating a list with different data types
# Uses the data created abobe, for example
my_list <- list(
  plants = my_dataframe,
  some_numbers = mu_matrix,
  other_numbers = numbers
  )

TOTAL MARKS: 100

– THE END –

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J.,},
  title = {BCB744: {Intro} {R} {Theory} {Test}},
  date = {2026-02-13},
  url = {http://tangledbank.netlify.app/BCB744/assessments/BCB744_Intro_R_Theory_Test_2026.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2026) BCB744: Intro R Theory Test. http://tangledbank.netlify.app/BCB744/assessments/BCB744_Intro_R_Theory_Test_2026.html.

--- title: "BCB744: Intro R Theory Test" date: "2026-02-13" params: hide_answers: false format: pdf: citation-location: bottom code-line-numbers: true documentclass: article fontsize: "10pt" fig-dpi: 300 fig-format: pdf highlight-style: tango include-in-header: ../../styles/preamble.tex indent: false keep-tex: true latex-tinytex: false number-sections: false par-skip: 6pt pdf-engine: lualatex toc: false citecolor: "blue" linkcolor: "blue" filecolor: "blue" urlcolor: "blue" toccolor: "blue" code-blocks: break-lines: true break-by-word: true --- ```{r code-brewing-opts, echo=FALSE} knitr::opts_chunk$set( comment = "R>", warning = FALSE, message = FALSE, fig.width = 4.5, fig.height = 2.625, out.width = "75%", fig.asp = NULL, # control via width/height dpi = 300 ) ggplot2::theme_set( ggplot2::theme_minimal(base_size = 8) ) ggplot2::theme_set( ggplot2::theme_bw(base_size = 8) ) ``` ```{r code-setup, include=FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, cache = TRUE, echo = TRUE, eval = TRUE ) ``` # Instructions The Intro R Theory Test will start at **9:00** on 13 February, 2026. You have until **12:00** to complete it. Your answer should demonstrate a comprehensive understanding of the theoretical concepts and techniques required to read and comprehend R code. Only answer what is explicitely stated. For example, if the question asks for only a graph as the final output, *only* the graph will be assessed, not the reasoning that brought you there. Anything extra will not amount to extra marks, so save yourself the time and produce the most concise answer possible given the content of the question. **What is required will *always* be explicitely stated.** **This is a closed book assessment.** Below is a set of questions to answer. You must answer **all** questions in the allocated time of **3-hr**. Please write your answers neatly in the answer book provided. Structure your answers logically. ## Question 1 [10 marks] Please translate the following code into English by providing an explanation for each line: ```{r code-monthlydata-dailydata, echo=TRUE, eval=FALSE} library(tidyverse) monthlyData <- dailyData |> mutate(t = asPOSIXct(t)) |> mutate(month = floor_date(t, unit = "month")) |> group_by(lon, lat, month) |> summarise(temp = mean(temp, na.rm = TRUE)) |> mutate(year = year(month)) |> group_by(lon, lat) |> mutate(num = seq(1:length(temp))) |> ungroup() ``` In your answer, simply refer to the line numbers (1-10) before each line of code, and provide an explanation for each line. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** * Line 1: The variable `monthlyData` is created by starting with `dailyData`, which is a dataset containing daily records. * Line 2: The `mutate()` function is used to convert the column `t` (presumably a date or timestamp) into a POSIXct datetime format. This ensures that `t` is stored in a standardised date-time format suitable for time-based operations. * Line 3: The `mutate()` function is again used to create a new column `month`, which is derived from `t`. The `floor_date()` function rounds down the date to the first day of the corresponding month, effectively extracting the month from `t`. * Line 4: The `group_by()` function groups the dataset by `lon` (longitude), `lat` (latitude), `month`. This means subsequent operations will be performed separately for each unique combination of these three variables. * Line 5: The `summarise()` function computes the mean temperature (`temp`) for each group. The `na.rm = TRUE` argument ensures that missing values (`NA`) are ignored in the calculation. * Line 6: The `mutate()` function creates a new column, `year`, extracting the year from the `month` column. This provides an explicit reference to the year of each data entry. * Line 7: The `group_by()` function is applied again, but this time only by `lon` and `lat`. This modifies the grouping structure to remove the month grouping while retaining spatial grouping. * Line 8: `The mutate()` function adds a new column, `num`, which assigns a sequence of numbers (`1:length(temp)`) to the grouped data. This effectively creates an index for each record within each longitude-latitude group. * Line 9: The `ungroup()` function removes all grouping, ensuring that further operations on `monthlyData` are performed on the entire dataset rather than within groups. `r if (params$hide_answers) ":::"` ## Question 2 [5 marks] What is 'Occam's Razor'? What is the relevance to science? `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** Occam’s Razor is a principle of parsimony often attributed to the 14th-century philosopher William of Ockham. While the famous phrasing, “Entities should not be multiplied beyond necessity,” does not appear in his surviving works, the principle captures his idea that simpler explanations are generally preferable. It is relevant to the BCB744 module because the principle of Occam's Razor is often interpreted as “the simplest explanation that sufficiently explains the data should be preferred over more complex alternatives.” This is a nice guiding principle which might be useful in your research, especially when you are faced with multiple explanations for a phenomenon. The principle suggests that the simplest explanation is often the best one, and that more complex explanations should only be considered when the simpler ones fail to account for the data. But, keep in mind that biological systems tend to be complex, and oversimplifying an explanation may ignore important interactions, heterogeneities. `r if (params$hide_answers) ":::"` ## Question 3 [20 marks] Using the penguin data provided in Table 1, please produce the figure produced by the code block. ```{r} #| echo: false library(palmerpenguins) library(tidyverse) library(gt) data(penguins) set.seed(744) pen <- penguins |> dplyr::group_by(species, island) |> dplyr::slice_sample(n = 4) |> dplyr::ungroup() pen |> gt::gt() |> gt::fmt_number( columns = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g), decimals = 1 ) |> gt::cols_label( species = "Species", island = "Island", bill_length_mm = gt::latex("\\shortstack{Bill\\\\length\\\\(mm)}"), bill_depth_mm = gt::latex("\\shortstack{Bill\\\\depth\\\\(mm)}"), flipper_length_mm = gt::latex("\\shortstack{Flipper\\\\length\\\\(mm)}"), body_mass_g = gt::latex("\\shortstack{Body\\\\mass\\\\(g)}"), sex = "Sex", year = "Year" ) |> gt::tab_header( title = gt::md("**Table 1:** Penguin Sample (n = 4 per Species x Island)") ) |> gt::opt_table_font( font = list(gt::google_font("Source Sans 3"), gt::default_fonts()) ) |> gt::tab_options( table.font.size = gt::px(11), heading.title.font.size = gt::px(13), column_labels.font.size = gt::px(10), data_row.padding = gt::px(2), table.width = gt::pct(100) ) ``` ```{r} #| eval: false #| echo: true #| fig-width: 8 #| fig-height: 5 pen_long <- pen |> pivot_longer( cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm), names_to = "measurement_type", values_to = "value_mm" ) |> group_by(species, island, measurement_type) |> summarise(value_mm = mean(value_mm, na.rm = TRUE), .groups = "drop") ggplot(pen_long, aes(x = measurement_type, y = value_mm, fill = measurement_type)) + geom_col(width = 0.7, colour = "black", linewidth = 0.2, show.legend = FALSE) + facet_grid(species ~ island) + scale_x_discrete( labels = c( bill_length_mm = "Bill length", bill_depth_mm = "Bill depth", flipper_length_mm = "Flipper length" ) ) + labs(x = "Measurement", y = "Mean length (mm)") + theme_bw(base_size = 11) + theme( strip.text = element_text(face = "bold"), axis.text.x = element_text(angle = 20, hjust = 1) ) ``` Marks will *only* be assigned for the figure that the code produces. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: false #| eval: true pen_long <- pen |> tidyr::pivot_longer( cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm), names_to = "measurement_type", values_to = "value_mm" ) |> dplyr::group_by(species, island, measurement_type) |> dplyr::summarise(value_mm = mean(value_mm, na.rm = TRUE), .groups = "drop") ggplot(pen_long, aes(x = measurement_type, y = value_mm, fill = measurement_type)) + geom_col(width = 0.7, colour = "black", linewidth = 0.2, show.legend = FALSE) + facet_grid(species ~ island) + scale_x_discrete( labels = c( bill_length_mm = "Bill length", bill_depth_mm = "Bill depth", flipper_length_mm = "Flipper length" ) ) + labs( x = "Measurement", y = "Mean length (mm)" ) + theme_bw(base_size = 11) + theme( strip.text = element_text(face = "bold"), axis.text.x = element_text(angle = 20, hjust = 1) ) ``` `r if (params$hide_answers) ":::"` ## Question 4 [5 marks] Explain the difference between R and RStudio. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** Taken verbatim from Tangled Bank: **R is a programming language and software environment for statistical computing and graphics**. It provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, multivariate analyses, neural networks, and so forth), graphical techniques and is highly extensible. **RStudio is an integrated development environment (IDE)** for R. It provides a graphical user interface (GUI) for working with R, making it easier to use for those who are less familiar with command-line interfaces. Some of the features provided by RStudio include: - a code editor with syntax highlighting and code completion; - a console for running R code; - a graphical interface for managing packages and libraries; - an integrated tools for plotting and visualisation; - support for version control with Git and SVN. R is the core software for statistical computing, like a car's engine, while RStudio provides a more user-friendly interface for working with R, like the car's body, the seats, steering wheel, and other bells, whistles. `r if (params$hide_answers) ":::"` ## Question 5 [10 marks] By way of example, please explain some key aspects of R code conventions. For each line of code you write (**neatly and legibly so each intended style item is visible**), explain also in English what aspects of the code are being adhered to. For example: **`a <- b`** is not the same as **`a < -b`**. The former is correct because there is a space preceding and following the assignment operator (**`<-`**, a less-than sign immediately followed by a dash to form an arrow); this has a different meaning from the latter, which is incorrect because there is no space between the less-than sign, the dash and reading as "a is less than negative b". `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** 1. Proper use of indentation: ```{r code-if-x, eval=FALSE} if (x > 0) { print("Positive number") } ``` 2. Use of meaningful variable names: ```{r code-temperature, eval=FALSE} temperature <- 25 ``` 3. Use of comments to explain code: ```{r code-mean-temp-mean-temperature, eval=FALSE} # Calculate the mean temperature mean_temp <- mean(temperature) ``` 4. Consistent use of spacing around operators: ```{r code-a-b-c, eval=FALSE} a <- b + c ``` 5. Consistent use of compound object names: A principles of writing clean and readable R code (or *any* code) is maintaining consistent variable naming conventions throughout a script or project. Mixing different naming styles --- such as "snake_case" (words separated by underscores) and "camelCase" (capitalising the first letter of each subsequent word) --- makes the code harder to read, maintain, and debug. Examples: ```{r code-my-variable-snake-case, eval=FALSE} # Example of consistent use of either convention: my_variable <- 10 # snake case another_variable <- 20 # camel case # An example of inconsistent use of conventions: myVariable <- 30 # camel case yet_another_variable <- 40 # snake case # This is also incorrect: variable_one <- 13 # llowercase "one" variable_Two <- 13 * 2 # uppercase "Two" ``` 6. Avoiding the = as Assignment Operator ```{r code-a, eval=FALSE} # Correct: a <- 1 # Incorrect: a = 1 ``` 7. Consistent use of spaces around # symbols in comments: ```{r code-chunk, eval=FALSE} # This is correct: # This is a comment # This is another comment # And another # This is incorrect: #This is a comment # A comment? # Another comment ``` 8. Correct use of `+` and `-` for unary operators: ```{r code-a-b, eval=FALSE} # Correct: a <- -b ``` 9. Use of `TRUE` and `FALSE` instead of `T` and `F`: ```{r code-is-positive-true, eval=FALSE} # Correct: is_positive <- TRUE # Incorrect: is_positive <- T ``` For more, refer to the [tidyverse style guide](https://style.tidyverse.org/syntax.html). `r if (params$hide_answers) ":::"` ## Question 6 [15 marks] You are a research assistant who have just been given your first job. You are asked to analyse a dataset about patterns of extreme heat in the ocean and the possible role that ocean currents (specifically, eddies) might play in modulating the patterns of extreme sea surface temperature extremes in space, time. Being naive and relatively inexperienced, and misguided by your exaggerated sense of preparedness as young people tend to do, you gladly accept the task, and start by exploring the data. You notice that the dataset is quite large and you have no idea what is happening, what you are doing, why you are doing it, or what you are looking for. Ten minutes into the job you start to question your life choices. Your feeling of bewilderment is compounded by the fact that, when you examine the data (the output of the `head()` and `tail()` commands is shown below), the entries seem confusing. ```{r code-fpath-volumes-oceandata-spatial, eval=FALSE} fpath <- "/Volumes/OceanData/spatial/processed/WBC/misc_results" fname <- "KC-MCA-data-2013-01-01-2022-12-31-bbox-v1_ma_14day_detrended.csv" data <- read.csv(file.path(fpath, fname)) ``` ```{r code-nrow-data, echo=TRUE, eval=FALSE} > nrow(data) [1] 53253434 > head(data) t lon lat ex ke 1 2013-01-01 121.875 34.625 -0.7141 2e-04 2 2013-01-01 121.875 34.625 -0.8027 2e-04 3 2013-01-02 121.875 34.625 -0.8916 2e-04 4 2013-01-02 121.875 34.625 -0.9751 2e-04 5 2013-01-03 121.875 34.625 -1.0589 3e-04 6 2013-01-03 121.875 34.625 -1.1406 3e-04 > tail(data) t lon lat ex ke 53253429 2022-12-29 174.375 44.875 0.4742 -0.0049 53253430 2022-12-29 174.375 44.875 0.4856 -0.0049 53253431 2022-12-30 174.375 44.875 0.4969 -0.0050 53253432 2022-12-30 174.375 44.875 0.5169 -0.0050 53253433 2022-12-31 174.375 44.875 0.5367 -0.0051 53253434 2022-12-31 174.375 44.875 0.5465 -0.0051 ``` You resign yourself to admitting that you do not understand much, but at the risk of sounding like a fool when you go to your professor, you decide to do as much of the preparation you can do so that you at least have something to show for your time. a. What will you take back to your professor to show that you have prepared yourself as fully as possible? For example: - What is in your ability to understand about the study and the nature of the data? - What will you do for yourself to better understand the task at hand? - What do you understand about the data? - What will you do to aid your understanding of the data? - What will your next steps be going forward? - Etc. (Anything else you can think about doing to convnce the professor you though about the data?) [*/10 marks*] b. What will you need from your professor to help you understand the data and the task at hand so that you are well equipped to tackle the problem? [*/5 marks*] `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** * I am able to understand what the concept of 'extreme heat' is, and what ocean eddies are --- all I need to do is find some papers about it, do broad reading around these concepts. So and I will start by reading up on these concepts. * I can see from the columns that there appears to be three independent variables (`lon`, `lat`, `t`) and two dependent variables (`ex` and `ke`). I will need to understand what these variables are, and how they relate to each other. It is easy to see that `lon` and `lat` are the longitude and latitude of the data points, and that `t` is the date of the data point. I will need to understand what the `ex` and `ke` variables are, and how they relate to the `lon` and `lat` variables. Presumably `ex` and `ke` are the extreme heat and ocean eddies, respectively. I will confirm with the professor. * Because I have `lon` and `lat`, I can make a map of the study area. By making a map of the study area for one or a few days in the dataset, I can get a sense of the spatial distribution of the data. I can also plot the `ex` and `ke` data to see what the data look like. Because the data cover the period 2013-2022, I know that I can create a map for each day (a time-series analysis might eventually be needed?), and that is probably where the analysis will takle me later once I have confirmed my thinking with the professor. If I am really proactive, want to seriously impress the professor and I will make an animation of the data to show the temporal evolution of revealed patterns in the data over time. This will clearly show the processes operating there. A REALLY informed mind will be able to even go as far as understanding what the analysis should entail, but, admittedly, this will require a deep subject matter understanding, which you might not possess at the moment, but which is nevertheless not beyond your reach to attain without guidance. * I can conclude that the data reveal some dynamical process (I infer 'dynamical' from the fact that we have time-series data, and time-series reveal dynamics). * Knowing what the geographical region is from the map I created and what is happening there that might be of interest to the study, I can make some guesses about what the analysis will be. * FYI, what basic research would reveal include the following (not for marks): * you would see that it is an ocean region south of South Africa; * once you know the region covered, you can read about the processes operating in the region that the data cover; * because the temperature spatially defines the Agulhas Current, you can infer that the study is about the Agulhas Current * plotting `ke` will reveal eddies in the Agulhas Current; * you can read about the Agulhas Current and its eddies and think about how eddies might affect the temperature in the region --- both of these are dynamical processes. * I will need to understand what the data are telling me, and what the variables mean. I will need to understand what the `ex` and `ke` variables are, and how they relate to the `lon` and `lat` variables. * Having discovered all these things simply by doing a basic first-stab analyses, I can prepare a report of my cursory findings, draw of a list of things I know and together with suggested further avenues for exploration. I will take this to the professor to confirm my understanding and to get guidance on how to proceed. * I will also add a list of the things I cannot know from the data, and what I need to know from the professor to proceed. * There is also something strange happening with the data. It seems that there are duplicate data entries (two occurrences of each combination of `lat` x `lon` x `t` resulting in duplicated values for each spatio-temporal point of `ke` and a pair of dissimilar values for `ex`). I will need to understand why this is the case. Clearly this is incorrect, and this points to pre-processing errors somewhere. I will have to ask the professor to give me access to all pro-processing scripts, the raw data to see if I can trace the error back to its source. * If I was this professor, I would be immensepy mpressed by tyour proactive approach to the problem. You are showing that you are not just a passive learner, but that you are actively engaging with the data, the problem at hand. This is a very good sign of a good researcher in the making. In my mind and I would seriously think about finding you a salary for permanent employment in my lab. `r if (params$hide_answers) ":::"` ## Question 7 [15 marks] a. Explain why one typically prefers working with CSV files over Excel files in R. b. What are the properties of a CSV file that make it more suitable for data analysis in R? c. What are the properties of an Excel file that make it less suitable for data analysis in R? `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** a) CSV (Comma-Separated Values) files are preferred over Excel files due to their simplicity, compatibility, and efficiency in handling data. CSV files are stored as plain text, making them easy to read, write across different software and platforms. They do not contain proprietary formatting and formulas, or metadata, which minimises the risk of unintended data transformations. Excel files (.xls, .xlsx) are proprietary, designed for spreadsheet applications and incorporating complex formatting, formulas, and visual formatting that can interfere with data processing in R. Unlike CSV files, which can be directly read using base R functions like `read.csv()`, Excel files require additional packages such as **readxl** for data extraction. Excel’s tendency to automatically modify data types --- such as converting text to dates or numbers --- is annoying and introduces errors, making CSV a more reliable format for reproducible data analysis. b) * CSV files store data in a simple text-based format that ensures easy readability by both humans and computers. * Each row represents a single record, and fields are separated by commas (or another delimiter) to ensure a consistent tabular format. * CSV files can be opened and edited using a wide range of software, including text editors, spreadsheets (*e.g.*, Excel, Google Sheets), and statistical tools (*e.g.*, R, Python). * R provides optimised functions like `read.csv()` (base R)`read_csv()` (tidyverse) for quickly reading CSV files without additional dependencies. * Unlike Excel, CSV files do not contain embedded formulas, formatting, figures, or macros, these properties reduce the risk of unintended data stuff-ups. * Being plain text, CSV files are typically smaller in size compared to Excel files. c) * Excel files are stored in a format (.xls, .xlsx) that is specific to Microsoft Excel; special packages (*e.g.*, **readxl**, **openxlsx**) are needed to read them in R. * Excel often automatically formats data and changes numeric values to dates or rounding decimal values. This can lead to errors in data analysis. * Excel files support formulas, pivot tables, conditional formatting, and visual elements that may not be relevant for raw data processing in R. * Users can store multiple sheets within a single Excel file and this makes it trickier to maintain a standardised structure when importing data into R. * Excel files are not made for handling large datasets. Excel becomes very slow and is prone to crashing or memory limitations when dealing with 'big' data. * Excel's binary files do not work with version control systems like Git. * Excel files are complex and more prone to accidental modifications or corruption. `r if (params$hide_answers) ":::"` ## Question 8 [20 marks] Explain each of the following in the context of their use in R. For each, provide an example of how you would construct them in R: a. A vector b. A matrix c. A dataframe d. A list `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** (a) A vector in R is the simplest and most fundamental data structure. It is a one-dimensional collection of elements, all of the same type (*e.g.*, numeric, character, or logical). Vectors can be created using the `c()` function. For example: ```{r code-numbers-c, eval=FALSE} # Creating a numeric vector numbers <- c(1, 2, 3, 4, 5) # Creating a character vector names <- c("Acacia", "Protea", "Leucadendron") # Creating a logical vector logical_values <- c(TRUE, FALSE, TRUE) ``` (b) A matrix is a two-dimensional data structure where all elements must be of the same type. It is essentially an extension of a vector with a specified number of rows and columns. ```{r code-my-matrix-matrix-c-nrow, eval=FALSE} # Creating a matrix with 3 rows and 2 columns my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2) ``` (c) A dataframe is a two-dimensional data structure that can contain different data types in different columns (variables). It is the most commonly used data structure for data analysis in R and resembles a table with rows and columns. ```{r code-my-dataframe-data-frame, eval=FALSE} # Creating a dataframe my_dataframe <- data.frame( Name = c("Acacia", "Protea", "Leucadendron"), Age = c(25, 30, 22), Height = c(85.5, 90.3, 78.0) ) ``` (d) A list is a flexible data structure that can store elements of different types, including vectors, matrices, dataframes, and even other lists. Unlike vectors, matrices and which require uniform data types, lists can contain heterogeneous elements. ```{r code-my-list-list, eval=FALSE} # Creating a list with different data types # Uses the data created abobe, for example my_list <- list( plants = my_dataframe, some_numbers = mu_matrix, other_numbers = numbers ) ``` `r if (params$hide_answers) ":::"` **TOTAL MARKS: 100** **-- THE END --**