---
title: "BCB744: Intro R Theory Test"
date: "2026-02-13"
params:
hide_answers: true
format:
html: default
typst:
fontsize: 10pt
code-line-numbers: true
---
```{r code-brewing-opts, echo=FALSE}
knitr::opts_chunk$set(
comment = "R>",
warning = FALSE,
message = FALSE,
fig.width = 4.5,
fig.height = 2.625,
out.width = "75%",
fig.asp = NULL, # control via width/height
dpi = 300
)
ggplot2::theme_set(
ggplot2::theme_minimal(base_size = 8)
)
ggplot2::theme_set(
ggplot2::theme_bw(base_size = 8)
)
```
```{r code-setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
cache = TRUE,
echo = TRUE,
eval = TRUE
)
```
# Instructions
The Intro R Theory Test will start at **12:30** on **26 March, 2026**. You have until **15:30** to complete it.
Your answer should demonstrate a comprehensive understanding of the theoretical concepts and techniques required to read and comprehend R code.
Only answer what is explicitely stated. For example, if the question asks for only a graph as the final output, *only* the graph will be assessed, not the reasoning that brought you there. Anything extra will not amount to extra marks, so save yourself the time and produce the most concise answer possible given the content of the question. **What is required will *always* be explicitely stated.**
**This is a closed book assessment.** Below is a set of questions to answer. You must answer **all** questions in the allocated time of **3-hr**. Please write your answers neatly in the answer book provided. Structure your answers logically.
## Question 1 [10 marks]
Please translate the following code into English by providing an explanation for each line. At the end, state what kind of data structure this pipeline produces and how that output could flow into a `ggplot2` figure.
```{r code-monthlydata-dailydata, echo=TRUE, eval=FALSE}
library(tidyverse)
monthlyData <- dailyData |>
mutate(t = asPOSIXct(t)) |>
mutate(month = floor_date(t, unit = "month")) |>
group_by(lon, lat, month) |>
summarise(temp = median(temp, na.rm = TRUE)) |>
mutate(year = year(month)) |>
group_by(lon, lat) |>
mutate(num = seq(1:length(temp))) |>
ungroup()
```
In your answer, refer to the line numbers (1-9) and provide an explanation for each line. Then add one final sentence describing the resulting dataset and a likely `ggplot2` use.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
* Line 1: `library(tidyverse)` loads the packages needed for the data manipulation pipeline, such as `dplyr`.
* Line 2: `monthlyData <- dailyData |>` starts with the dataset `dailyData` and begins a pipeline that will create a new object called `monthlyData`.
* Line 3: `mutate(t = asPOSIXct(t))` converts the variable `t` into a date-time format that R can use for time-based calculations.
* Line 4: `mutate(month = floor_date(t, unit = "month"))` creates a new variable called `month` by rounding each date down to the start of its month.
* Line 5: `group_by(lon, lat, month)` groups the data by longitude, latitude, and month so that the next calculation is done separately for each location in each month.
* Line 6: `summarise(temp = median(temp, na.rm = TRUE))` calculates the median temperature for each longitude-latitude-month group, ignoring missing values.
* Line 7: `mutate(year = year(month))` creates a new variable called `year` by extracting the year from the `month` variable.
* Line 8: `group_by(lon, lat)` changes the grouping so that the data are now grouped only by spatial location, not by month.
* Line 9: `mutate(num = seq(1:length(temp))) |> ungroup()` creates an index variable `num` numbering the monthly observations within each location, and then removes the grouping structure.
The final output is a dataframe/tibble with one row per longitude-latitude-month combination and summary variables that are ready to flow into a `ggplot2` plot such as a time-series line graph of monthly median temperature for each location.
`r if (params$hide_answers) ":::"`
## Question 2 [5 marks]
What are the three properties of **tidy data**? Briefly explain each one, and then state why tidy data work especially well with `ggplot2`.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
Tidy data have three defining properties:
* Each **variable** must have its own column.
* Each **observation** must have its own row.
* Each **value** must have its own cell.
These rules ensure that the structure of the dataset is clear and consistent. Variables are not spread across multiple columns, different observations are not mixed within the same row, and individual values are not bundled together in one cell.
Tidy data work especially well with `ggplot2` because `ggplot2` expects a clean dataframe in which variables can be mapped directly to aesthetics such as `x`, `y`, `colour`, or `facet`. When data are tidy, plotting and further analysis are straightforward because the software does not have to guess what the rows, columns, or cells represent.
`r if (params$hide_answers) ":::"`
## Question 3 [20 marks]
Using the penguin data provided in Table 1, please produce the figure produced by the code block.
```{r}
#| echo: false
library(palmerpenguins)
library(tidyverse)
library(gt)
data(penguins)
set.seed(744)
pen <- penguins |>
dplyr::group_by(species, island) |>
dplyr::slice_sample(n = 4) |>
dplyr::ungroup()
pen |>
gt::gt() |>
gt::fmt_number(
columns = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
decimals = 1
) |>
gt::cols_label(
species = "Species",
island = "Island",
bill_length_mm = "Bill length (mm)",
bill_depth_mm = "Bill depth (mm)",
flipper_length_mm = "Flipper length (mm)",
body_mass_g = "Body mass (g)",
sex = "Sex",
year = "Year"
) |>
gt::tab_header(
title = gt::md("**Table 1:** Penguin Sample (n = 4 per Species x Island)")
) |>
gt::opt_table_font(
font = list(gt::google_font("Source Sans 3"), gt::default_fonts())
) |>
gt::tab_options(
table.font.size = gt::px(11),
heading.title.font.size = gt::px(13),
column_labels.font.size = gt::px(10),
data_row.padding = gt::px(2),
table.width = gt::pct(100)
)
```
```{r}
#| eval: false
#| echo: true
#| fig-width: 8
#| fig-height: 5
pen_long <- pen |>
pivot_longer(
cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm),
names_to = "measurement_type",
values_to = "value_mm"
) |>
group_by(species, island, measurement_type) |>
summarise(
mean_mm = mean(value_mm, na.rm = TRUE),
sd_mm = sd(value_mm, na.rm = TRUE),
.groups = "drop"
)
ggplot(pen_long, aes(x = measurement_type, y = mean_mm, fill = measurement_type)) +
geom_col(width = 0.7, colour = "black", linewidth = 0.2, show.legend = FALSE) +
geom_errorbar(
aes(ymin = mean_mm - sd_mm, ymax = mean_mm + sd_mm),
width = 0.2,
linewidth = 0.2
) +
facet_grid(species ~ island) +
scale_x_discrete(
labels = c(
bill_length_mm = "Bill length", bill_depth_mm = "Bill depth",
flipper_length_mm = "Flipper length"
)
) +
labs(x = "Measurement", y = "Mean length (mm) ± SD") +
theme_bw(base_size = 11) +
theme(
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 20, hjust = 1)
)
```
Marks will *only* be assigned for the figure that the code produces.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: false
#| eval: true
pen_long <- pen |>
tidyr::pivot_longer(
cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm),
names_to = "measurement_type",
values_to = "value_mm"
) |>
dplyr::group_by(species, island, measurement_type) |>
dplyr::summarise(
mean_mm = mean(value_mm, na.rm = TRUE),
sd_mm = sd(value_mm, na.rm = TRUE),
.groups = "drop"
)
ggplot(pen_long, aes(x = measurement_type, y = mean_mm,
fill = measurement_type)) +
geom_col(width = 0.7, colour = "black", linewidth = 0.2,
show.legend = FALSE) +
geom_errorbar(
aes(ymin = mean_mm - sd_mm, ymax = mean_mm + sd_mm),
width = 0.2,
linewidth = 0.2
) +
facet_grid(species ~ island) +
scale_x_discrete(
labels = c(
bill_length_mm = "Bill length",
bill_depth_mm = "Bill depth",
flipper_length_mm = "Flipper length"
)
) +
labs(
x = "Measurement",
y = "Mean length (mm) ± SD"
) +
theme_bw(base_size = 11) +
theme(
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 20, hjust = 1)
)
```
`r if (params$hide_answers) ":::"`
## Question 4 [5 marks]
Why do we prefer to use R over Excel for data analysis and statistics?
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
We prefer R over Excel for data analysis and statistics because R is a programming language designed specifically for reproducible data handling, statistical analysis, and graphics, whereas Excel is primarily a spreadsheet application.
R has several important advantages:
- analyses are written as code, which makes every step transparent and reproducible;
- the same script can be rerun on updated data without repeating the work manually;
- R has a far wider range of statistical methods and graphical tools than Excel;
- large datasets are generally easier to manage in R than in Excel;
- R works better with version control and reproducible research workflows;
- manual editing is minimised, which reduces the chance of hidden mistakes.
Excel is useful for quick viewing and simple tables, but it is less suitable for serious statistical work because it encourages point-and-click workflows, makes reproducibility difficult, and can silently alter data formats such as dates, numbers, or text codes. For those reasons, R is generally the preferred tool for scientific data analysis.
`r if (params$hide_answers) ":::"`
## Question 5 [10 marks]
By way of example, please explain some key aspects of R code conventions. For each line of code you write (**neatly and legibly so each intended style item is visible**), explain also in English what aspects of the code are being adhered to.
For example:
**`a <- b`** is not the same as **`a < -b`**. The former is correct because there is a space preceding and following the assignment operator (**`<-`**, a less-than sign immediately followed by a dash to form an arrow); this has a different meaning from the latter, which is incorrect because there is no space between the less-than sign, the dash and reading as "a is less than negative b".
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
1. Proper use of indentation:
```{r code-if-x, eval=FALSE}
if (x > 0) {
print("Positive number")
}
```
2. Use of meaningful variable names:
```{r code-temperature, eval=FALSE}
temperature <- 25
```
3. Use of comments to explain code:
```{r code-mean-temp-mean-temperature, eval=FALSE}
# Calculate the mean temperature
mean_temp <- mean(temperature)
```
4. Consistent use of spacing around operators:
```{r code-a-b-c, eval=FALSE}
a <- b + c
```
5. Consistent use of compound object names:
A principles of writing clean and readable R code (or *any* code) is maintaining consistent variable naming conventions throughout a script or project. Mixing different naming styles --- such as "snake_case" (words separated by underscores) and "camelCase" (capitalising the first letter of each subsequent word) --- makes the code harder to read, maintain, and debug.
Examples:
```{r code-my-variable-snake-case, eval=FALSE}
# Example of consistent use of either convention:
my_variable <- 10 # snake case
another_variable <- 20 # camel case
# An example of inconsistent use of conventions:
myVariable <- 30 # camel case
yet_another_variable <- 40 # snake case
# This is also incorrect:
variable_one <- 13 # llowercase "one"
variable_Two <- 13 * 2 # uppercase "Two"
```
6. Avoiding the = as Assignment Operator
```{r code-a, eval=FALSE}
# Correct:
a <- 1
# Incorrect:
a = 1
```
7. Consistent use of spaces around # symbols in comments:
```{r code-chunk, eval=FALSE}
# This is correct:
# This is a comment
# This is another comment
# And another
# This is incorrect:
#This is a comment
# A comment?
# Another comment
```
8. Correct use of `+` and `-` for unary operators:
```{r code-a-b, eval=FALSE}
# Correct:
a <- -b
```
9. Use of `TRUE` and `FALSE` instead of `T` and `F`:
```{r code-is-positive-true, eval=FALSE}
# Correct:
is_positive <- TRUE
# Incorrect:
is_positive <- T
```
For more, refer to the [tidyverse style guide](https://style.tidyverse.org/syntax.html).
`r if (params$hide_answers) ":::"`
## Question 6 [15 marks]
You are a research assistant who have just been given your first job. You are asked to analyse a dataset about patterns of extreme heat in the ocean and the possible role that ocean currents (specifically, eddies) might play in modulating the patterns of extreme sea surface temperature extremes in space, time.
Being naive and relatively inexperienced, and misguided by your exaggerated sense of preparedness as young people tend to do, you gladly accept the task, and start by exploring the data. You notice that the dataset is quite large and you have no idea what is happening, what you are doing, why you are doing it, or what you are looking for. Ten minutes into the job you start to question your life choices. Your feeling of bewilderment is compounded by the fact that, when you examine the data (the output of the `head()` and `tail()` commands is shown below), the entries seem confusing.
```{r code-fpath-volumes-oceandata-spatial, eval=FALSE}
fpath <- "/Volumes/OceanData/spatial/processed/WBC/misc_results"
fname <- "KC-MCA-data-2013-01-01-2022-12-31-bbox-v1_ma_14day_detrended.csv"
data <- read.csv(file.path(fpath, fname))
```
```{r code-nrow-data, echo=TRUE, eval=FALSE}
> nrow(data)
[1] 53253434
> head(data)
t lon lat ex ke
1 2013-01-01 121.875 34.625 -0.7141 2e-04
2 2013-01-01 121.875 34.625 -0.8027 2e-04
3 2013-01-02 121.875 34.625 -0.8916 2e-04
4 2013-01-02 121.875 34.625 -0.9751 2e-04
5 2013-01-03 121.875 34.625 -1.0589 3e-04
6 2013-01-03 121.875 34.625 -1.1406 3e-04
> tail(data)
t lon lat ex ke
53253429 2022-12-29 174.375 44.875 0.4742 -0.0049
53253430 2022-12-29 174.375 44.875 0.4856 -0.0049
53253431 2022-12-30 174.375 44.875 0.4969 -0.0050
53253432 2022-12-30 174.375 44.875 0.5169 -0.0050
53253433 2022-12-31 174.375 44.875 0.5367 -0.0051
53253434 2022-12-31 174.375 44.875 0.5465 -0.0051
```
You resign yourself to admitting that you do not understand much, but at the risk of sounding like a fool when you go to your professor, you decide to do as much of the preparation you can do so that you at least have something to show for your time.
a. What will you take back to your professor to show that you have prepared yourself as fully as possible? For example:
- What is in your ability to understand about the study and the nature of the data?
- What will you do for yourself to better understand the task at hand?
- What do you understand about the data?
- What will you do to aid your understanding of the data?
- What will your next steps be going forward?
- Etc. (Anything else you can think about doing to convnce the professor you though about the data?) [*/10 marks*]
b. What will you need from your professor to help you understand the data and the task at hand so that you are well equipped to tackle the problem? [*/5 marks*]
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
* I am able to understand what the concept of 'extreme heat' is, and what ocean eddies are --- all I need to do is find some papers about it, do broad reading around these concepts. So and I will start by reading up on these concepts.
* I can see from the columns that there appears to be three independent variables (`lon`, `lat`, `t`) and two dependent variables (`ex` and `ke`). I will need to understand what these variables are, and how they relate to each other. It is easy to see that `lon` and `lat` are the longitude and latitude of the data points, and that `t` is the date of the data point. I will need to understand what the `ex` and `ke` variables are, and how they relate to the `lon` and `lat` variables. Presumably `ex` and `ke` are the extreme heat and ocean eddies, respectively. I will confirm with the professor.
* Because I have `lon` and `lat`, I can make a map of the study area. By making a map of the study area for one or a few days in the dataset, I can get a sense of the spatial distribution of the data. I can also plot the `ex` and `ke` data to see what the data look like. Because the data cover the period 2013-2022, I know that I can create a map for each day (a time-series analysis might eventually be needed?), and that is probably where the analysis will takle me later once I have confirmed my thinking with the professor. If I am really proactive, want to seriously impress the professor and I will make an animation of the data to show the temporal evolution of revealed patterns in the data over time. This will clearly show the processes operating there. A REALLY informed mind will be able to even go as far as understanding what the analysis should entail, but, admittedly, this will require a deep subject matter understanding, which you might not possess at the moment, but which is nevertheless not beyond your reach to attain without guidance.
* I can conclude that the data reveal some dynamical process (I infer 'dynamical' from the fact that we have time-series data, and time-series reveal dynamics).
* Knowing what the geographical region is from the map I created and what is happening there that might be of interest to the study, I can make some guesses about what the analysis will be.
* FYI, what basic research would reveal include the following (not for marks):
* you would see that it is an ocean region south of South Africa;
* once you know the region covered, you can read about the processes operating in the region that the data cover;
* because the temperature spatially defines the Agulhas Current, you can infer that the study is about the Agulhas Current
* plotting `ke` will reveal eddies in the Agulhas Current;
* you can read about the Agulhas Current and its eddies and think about how eddies might affect the temperature in the region --- both of these are dynamical processes.
* I will need to understand what the data are telling me, and what the variables mean. I will need to understand what the `ex` and `ke` variables are, and how they relate to the `lon` and `lat` variables.
* Having discovered all these things simply by doing a basic first-stab analyses, I can prepare a report of my cursory findings, draw of a list of things I know and together with suggested further avenues for exploration. I will take this to the professor to confirm my understanding and to get guidance on how to proceed.
* I will also add a list of the things I cannot know from the data, and what I need to know from the professor to proceed.
* There is also something strange happening with the data. It seems that there are duplicate data entries (two occurrences of each combination of `lat` x `lon` x `t` resulting in duplicated values for each spatio-temporal point of `ke` and a pair of dissimilar values for `ex`). I will need to understand why this is the case. Clearly this is incorrect, and this points to pre-processing errors somewhere. I will have to ask the professor to give me access to all pro-processing scripts, the raw data to see if I can trace the error back to its source.
* If I was this professor, I would be immensepy mpressed by tyour proactive approach to the problem. You are showing that you are not just a passive learner, but that you are actively engaging with the data, the problem at hand. This is a very good sign of a good researcher in the making. In my mind and I would seriously think about finding you a salary for permanent employment in my lab.
`r if (params$hide_answers) ":::"`
## Question 7 [15 marks]
Name the general characteristics of ASCII-type data files, then name and explain three common variations of these tabular data files.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
ASCII-type tabular data files are plain-text files in which each row is stored as a line of text and the fields are separated by a delimiter.
a. **Tab-separated values (`.tsv`)**
In a TSV file, the fields are separated by tab characters. This format is commonly encountered when data contain commas within text fields, because tabs reduce ambiguity. In R, one practical issue is that you must use the correct reader or delimiter setting, such as `read.delim()` or `readr::read_tsv()`, otherwise the entire line may be imported into a single column.
b. **Comma-separated values (`.csv`)**
In a CSV file, the fields are separated by commas. This is one of the most common file formats for exchanging tabular data between spreadsheets, databases, and statistical software. A practical issue is that commas may also appear inside text fields, so quoting rules matter. In R, CSV files are commonly read with `read.csv()` or `readr::read_csv()`.
c. **Semicolon-delimited text files**
In semicolon-delimited files, the fields are separated by semicolons. These are often encountered in regional settings where commas are used as decimal separators, making comma-delimited files inconvenient or ambiguous. A practical issue in R is that importing such a file as though it were a standard CSV will parse the columns incorrectly. You would instead use something like `read.csv2()` or specify `sep = \";\"`.
These formats are all plain-text table files, but they differ in the delimiter used. Correctly identifying the delimiter is essential for importing the data properly into R.
`r if (params$hide_answers) ":::"`
## Question 8 [20 marks]
Explain each of the following in the context of their use in R. For each, provide an example of how you would construct them in R:
a. A vector
b. A matrix
c. A dataframe
d. A list
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
(a) A vector in R is the simplest and most fundamental data structure. It is a one-dimensional collection of elements, all of the same type (*e.g.*, numeric, character, or logical). Vectors can be created using the `c()` function. For example:
```{r code-numbers-c, eval=FALSE}
# Creating a numeric vector
numbers <- c(1, 2, 3, 4, 5)
# Creating a character vector
names <- c("Acacia", "Protea", "Leucadendron")
# Creating a logical vector
logical_values <- c(TRUE, FALSE, TRUE)
```
(b) A matrix is a two-dimensional data structure where all elements must be of the same type. It is essentially an extension of a vector with a specified number of rows and columns.
```{r code-my-matrix-matrix-c-nrow, eval=FALSE}
# Creating a matrix with 3 rows and 2 columns
my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2)
```
(c) A dataframe is a two-dimensional data structure that can contain different data types in different columns (variables). It is the most commonly used data structure for data analysis in R and resembles a table with rows and columns.
```{r code-my-dataframe-data-frame, eval=FALSE}
# Creating a dataframe
my_dataframe <- data.frame(
Name = c("Acacia", "Protea", "Leucadendron"),
Age = c(25, 30, 22),
Height = c(85.5, 90.3, 78.0)
)
```
(d) A list is a flexible data structure that can store elements of different types, including vectors, matrices, dataframes, and even other lists. Unlike vectors, matrices and which require uniform data types, lists can contain heterogeneous elements.
```{r code-my-list-list, eval=FALSE}
# Creating a list with different data types
# Uses the data created abobe, for example
my_list <- list(
plants = my_dataframe,
some_numbers = mu_matrix,
other_numbers = numbers
)
```
`r if (params$hide_answers) ":::"`
**TOTAL MARKS: 100**
**-- THE END --**