BCB744: Intro R Theory Test
1 Instructions
The Intro R Theory Test will start at 9:00 on 13 February, 2026. You have until 12:00 to complete it.
Your answer should demonstrate a comprehensive understanding of the theoretical concepts and techniques required to read and comprehend R code.
Only answer what is explicitely stated. For example, if the question asks for only a graph as the final output, only the graph will be assessed, not the reasoning that brought you there. Anything extra will not amount to extra marks, so save yourself the time and produce the most concise answer possible given the content of the question. What is required will always be explicitely stated.
This is a closed book assessment. Below is a set of questions to answer. You must answer all questions in the allocated time of 3-hr. Please write your answers neatly in the answer book provided. Structure your answers logically.
1.1 Question 1 [10 marks]
Please translate the following code into English by providing an explanation for each line:
In your answer, simply refer to the line numbers (1-10) before each line of code, and provide an explanation for each line.
Answer
- Line 1: The variable
monthlyDatais created by starting withdailyData, which is a dataset containing daily records. - Line 2: The
mutate()function is used to convert the columnt(presumably a date or timestamp) into a POSIXct datetime format. This ensures thattis stored in a standardised date-time format suitable for time-based operations. - Line 3: The
mutate()function is again used to create a new columnmonth, which is derived fromt. Thefloor_date()function rounds down the date to the first day of the corresponding month, effectively extracting the month fromt. - Line 4: The
group_by()function groups the dataset bylon(longitude),lat(latitude),month. This means subsequent operations will be performed separately for each unique combination of these three variables. - Line 5: The
summarise()function computes the mean temperature (temp) for each group. Thena.rm = TRUEargument ensures that missing values (NA) are ignored in the calculation. - Line 6: The
mutate()function creates a new column,year, extracting the year from themonthcolumn. This provides an explicit reference to the year of each data entry. - Line 7: The
group_by()function is applied again, but this time only bylonandlat. This modifies the grouping structure to remove the month grouping while retaining spatial grouping. - Line 8:
The mutate()function adds a new column,num, which assigns a sequence of numbers (1:length(temp)) to the grouped data. This effectively creates an index for each record within each longitude-latitude group. - Line 9: The
ungroup()function removes all grouping, ensuring that further operations onmonthlyDataare performed on the entire dataset rather than within groups.
1.2 Question 2 [5 marks]
What is ‘Occam’s Razor’? What is the relevance to science?
Answer
Occam’s Razor is a principle of parsimony often attributed to the 14th-century philosopher William of Ockham. While the famous phrasing, “Entities should not be multiplied beyond necessity,” does not appear in his surviving works, the principle captures his idea that simpler explanations are generally preferable. It is relevant to the BCB744 module because the principle of Occam’s Razor is often interpreted as “the simplest explanation that sufficiently explains the data should be preferred over more complex alternatives.” This is a nice guiding principle which might be useful in your research, especially when you are faced with multiple explanations for a phenomenon. The principle suggests that the simplest explanation is often the best one, and that more complex explanations should only be considered when the simpler ones fail to account for the data. But, keep in mind that biological systems tend to be complex, and oversimplifying an explanation may ignore important interactions, heterogeneities.
1.3 Question 3 [20 marks]
Using the penguin data provided in Table 1, please produce the figure produced by the code block.
| Table 1: Penguin Sample (n = 4 per Species x Island) | |||||||
| Species | Island | \shortstack{Bill\\length\\(mm)} | \shortstack{Bill\\depth\\(mm)} | \shortstack{Flipper\\length\\(mm)} | \shortstack{Body\\mass\\(g)} | Sex | Year |
|---|---|---|---|---|---|---|---|
| Adelie | Biscoe | 37.8 | 20.0 | 190.0 | 4,250.0 | male | 2009 |
| Adelie | Biscoe | 37.9 | 18.6 | 193.0 | 2,925.0 | female | 2009 |
| Adelie | Biscoe | 37.9 | 18.6 | 172.0 | 3,150.0 | female | 2007 |
| Adelie | Biscoe | 40.1 | 18.9 | 188.0 | 4,300.0 | male | 2008 |
| Adelie | Dream | 41.5 | 18.5 | 201.0 | 4,000.0 | male | 2009 |
| Adelie | Dream | 41.1 | 19.0 | 182.0 | 3,425.0 | male | 2007 |
| Adelie | Dream | 36.0 | 17.9 | 190.0 | 3,450.0 | female | 2007 |
| Adelie | Dream | 38.1 | 18.6 | 190.0 | 3,700.0 | female | 2008 |
| Adelie | Torgersen | 40.2 | 17.0 | 176.0 | 3,450.0 | female | 2009 |
| Adelie | Torgersen | 42.9 | 17.6 | 196.0 | 4,700.0 | male | 2008 |
| Adelie | Torgersen | 33.5 | 19.0 | 190.0 | 3,600.0 | female | 2008 |
| Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3,650.0 | male | 2007 |
| Chinstrap | Dream | 45.2 | 17.8 | 198.0 | 3,950.0 | female | 2007 |
| Chinstrap | Dream | 49.3 | 19.9 | 203.0 | 4,050.0 | male | 2009 |
| Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3,500.0 | female | 2007 |
| Chinstrap | Dream | 45.5 | 17.0 | 196.0 | 3,500.0 | female | 2008 |
| Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5,750.0 | male | 2009 |
| Gentoo | Biscoe | 49.5 | 16.1 | 224.0 | 5,650.0 | male | 2009 |
| Gentoo | Biscoe | 52.5 | 15.6 | 221.0 | 5,450.0 | male | 2009 |
| Gentoo | Biscoe | 49.3 | 15.7 | 217.0 | 5,850.0 | male | 2007 |
pen_long <- pen |>
pivot_longer(
cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm),
names_to = "measurement_type",
values_to = "value_mm"
) |>
group_by(species, island, measurement_type) |>
summarise(value_mm = mean(value_mm, na.rm = TRUE), .groups = "drop")
ggplot(pen_long, aes(x = measurement_type, y = value_mm, fill = measurement_type)) +
geom_col(width = 0.7, colour = "black", linewidth = 0.2, show.legend = FALSE) +
facet_grid(species ~ island) +
scale_x_discrete(
labels = c(
bill_length_mm = "Bill length", bill_depth_mm = "Bill depth",
flipper_length_mm = "Flipper length"
)
) +
labs(x = "Measurement", y = "Mean length (mm)") +
theme_bw(base_size = 11) +
theme(
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 20, hjust = 1)
)Marks will only be assigned for the figure that the code produces.
Answer
1.4 Question 4 [5 marks]
Explain the difference between R and RStudio.
Answer
Taken verbatim from Tangled Bank:
R is a programming language and software environment for statistical computing and graphics. It provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, multivariate analyses, neural networks, and so forth), graphical techniques and is highly extensible.
RStudio is an integrated development environment (IDE) for R. It provides a graphical user interface (GUI) for working with R, making it easier to use for those who are less familiar with command-line interfaces. Some of the features provided by RStudio include:
- a code editor with syntax highlighting and code completion;
- a console for running R code;
- a graphical interface for managing packages and libraries;
- an integrated tools for plotting and visualisation;
- support for version control with Git and SVN.
R is the core software for statistical computing, like a car’s engine, while RStudio provides a more user-friendly interface for working with R, like the car’s body, the seats, steering wheel, and other bells, whistles.
1.5 Question 5 [10 marks]
By way of example, please explain some key aspects of R code conventions. For each line of code you write (neatly and legibly so each intended style item is visible), explain also in English what aspects of the code are being adhered to.
For example:
a <- b is not the same as a < -b. The former is correct because there is a space preceding and following the assignment operator (<-, a less-than sign immediately followed by a dash to form an arrow); this has a different meaning from the latter, which is incorrect because there is no space between the less-than sign, the dash and reading as “a is less than negative b”.
Answer
- Proper use of indentation:
- Use of meaningful variable names:
- Use of comments to explain code:
- Consistent use of spacing around operators:
- Consistent use of compound object names:
A principles of writing clean and readable R code (or any code) is maintaining consistent variable naming conventions throughout a script or project. Mixing different naming styles — such as “snake_case” (words separated by underscores) and “camelCase” (capitalising the first letter of each subsequent word) — makes the code harder to read, maintain, and debug.
Examples:
# Example of consistent use of either convention:
my_variable <- 10 # snake case
another_variable <- 20 # camel case
# An example of inconsistent use of conventions:
myVariable <- 30 # camel case
yet_another_variable <- 40 # snake case
# This is also incorrect:
variable_one <- 13 # llowercase "one"
variable_Two <- 13 * 2 # uppercase "Two"- Avoiding the = as Assignment Operator
- Consistent use of spaces around # symbols in comments:
- Correct use of
+and-for unary operators:
- Use of
TRUEandFALSEinstead ofTandF:
For more, refer to the tidyverse style guide.
1.6 Question 6 [15 marks]
You are a research assistant who have just been given your first job. You are asked to analyse a dataset about patterns of extreme heat in the ocean and the possible role that ocean currents (specifically, eddies) might play in modulating the patterns of extreme sea surface temperature extremes in space, time.
Being naive and relatively inexperienced, and misguided by your exaggerated sense of preparedness as young people tend to do, you gladly accept the task, and start by exploring the data. You notice that the dataset is quite large and you have no idea what is happening, what you are doing, why you are doing it, or what you are looking for. Ten minutes into the job you start to question your life choices. Your feeling of bewilderment is compounded by the fact that, when you examine the data (the output of the head() and tail() commands is shown below), the entries seem confusing.
> nrow(data)
[1] 53253434
> head(data)
t lon lat ex ke
1 2013-01-01 121.875 34.625 -0.7141 2e-04
2 2013-01-01 121.875 34.625 -0.8027 2e-04
3 2013-01-02 121.875 34.625 -0.8916 2e-04
4 2013-01-02 121.875 34.625 -0.9751 2e-04
5 2013-01-03 121.875 34.625 -1.0589 3e-04
6 2013-01-03 121.875 34.625 -1.1406 3e-04
> tail(data)
t lon lat ex ke
53253429 2022-12-29 174.375 44.875 0.4742 -0.0049
53253430 2022-12-29 174.375 44.875 0.4856 -0.0049
53253431 2022-12-30 174.375 44.875 0.4969 -0.0050
53253432 2022-12-30 174.375 44.875 0.5169 -0.0050
53253433 2022-12-31 174.375 44.875 0.5367 -0.0051
53253434 2022-12-31 174.375 44.875 0.5465 -0.0051You resign yourself to admitting that you do not understand much, but at the risk of sounding like a fool when you go to your professor, you decide to do as much of the preparation you can do so that you at least have something to show for your time.
- What will you take back to your professor to show that you have prepared yourself as fully as possible? For example:
- What is in your ability to understand about the study and the nature of the data?
- What will you do for yourself to better understand the task at hand?
- What do you understand about the data?
- What will you do to aid your understanding of the data?
- What will your next steps be going forward?
- Etc. (Anything else you can think about doing to convnce the professor you though about the data?) [/10 marks]
- What will you need from your professor to help you understand the data and the task at hand so that you are well equipped to tackle the problem? [/5 marks]
Answer
- I am able to understand what the concept of ‘extreme heat’ is, and what ocean eddies are — all I need to do is find some papers about it, do broad reading around these concepts. So and I will start by reading up on these concepts.
- I can see from the columns that there appears to be three independent variables (
lon,lat,t) and two dependent variables (exandke). I will need to understand what these variables are, and how they relate to each other. It is easy to see thatlonandlatare the longitude and latitude of the data points, and thattis the date of the data point. I will need to understand what theexandkevariables are, and how they relate to thelonandlatvariables. Presumablyexandkeare the extreme heat and ocean eddies, respectively. I will confirm with the professor. - Because I have
lonandlat, I can make a map of the study area. By making a map of the study area for one or a few days in the dataset, I can get a sense of the spatial distribution of the data. I can also plot theexandkedata to see what the data look like. Because the data cover the period 2013-2022, I know that I can create a map for each day (a time-series analysis might eventually be needed?), and that is probably where the analysis will takle me later once I have confirmed my thinking with the professor. If I am really proactive, want to seriously impress the professor and I will make an animation of the data to show the temporal evolution of revealed patterns in the data over time. This will clearly show the processes operating there. A REALLY informed mind will be able to even go as far as understanding what the analysis should entail, but, admittedly, this will require a deep subject matter understanding, which you might not possess at the moment, but which is nevertheless not beyond your reach to attain without guidance. - I can conclude that the data reveal some dynamical process (I infer ‘dynamical’ from the fact that we have time-series data, and time-series reveal dynamics).
- Knowing what the geographical region is from the map I created and what is happening there that might be of interest to the study, I can make some guesses about what the analysis will be.
- FYI, what basic research would reveal include the following (not for marks):
- you would see that it is an ocean region south of South Africa;
- once you know the region covered, you can read about the processes operating in the region that the data cover;
- because the temperature spatially defines the Agulhas Current, you can infer that the study is about the Agulhas Current
- plotting
kewill reveal eddies in the Agulhas Current; - you can read about the Agulhas Current and its eddies and think about how eddies might affect the temperature in the region — both of these are dynamical processes.
- I will need to understand what the data are telling me, and what the variables mean. I will need to understand what the
exandkevariables are, and how they relate to thelonandlatvariables. - Having discovered all these things simply by doing a basic first-stab analyses, I can prepare a report of my cursory findings, draw of a list of things I know and together with suggested further avenues for exploration. I will take this to the professor to confirm my understanding and to get guidance on how to proceed.
- I will also add a list of the things I cannot know from the data, and what I need to know from the professor to proceed.
- There is also something strange happening with the data. It seems that there are duplicate data entries (two occurrences of each combination of
latxlonxtresulting in duplicated values for each spatio-temporal point ofkeand a pair of dissimilar values forex). I will need to understand why this is the case. Clearly this is incorrect, and this points to pre-processing errors somewhere. I will have to ask the professor to give me access to all pro-processing scripts, the raw data to see if I can trace the error back to its source. - If I was this professor, I would be immensepy mpressed by tyour proactive approach to the problem. You are showing that you are not just a passive learner, but that you are actively engaging with the data, the problem at hand. This is a very good sign of a good researcher in the making. In my mind and I would seriously think about finding you a salary for permanent employment in my lab.
1.7 Question 7 [15 marks]
- Explain why one typically prefers working with CSV files over Excel files in R.
- What are the properties of a CSV file that make it more suitable for data analysis in R?
- What are the properties of an Excel file that make it less suitable for data analysis in R?
Answer
CSV (Comma-Separated Values) files are preferred over Excel files due to their simplicity, compatibility, and efficiency in handling data. CSV files are stored as plain text, making them easy to read, write across different software and platforms. They do not contain proprietary formatting and formulas, or metadata, which minimises the risk of unintended data transformations.
Excel files (.xls, .xlsx) are proprietary, designed for spreadsheet applications and incorporating complex formatting, formulas, and visual formatting that can interfere with data processing in R. Unlike CSV files, which can be directly read using base R functions like read.csv(), Excel files require additional packages such as readxl for data extraction. Excel’s tendency to automatically modify data types — such as converting text to dates or numbers — is annoying and introduces errors, making CSV a more reliable format for reproducible data analysis.
- CSV files store data in a simple text-based format that ensures easy readability by both humans and computers.
- Each row represents a single record, and fields are separated by commas (or another delimiter) to ensure a consistent tabular format.
- CSV files can be opened and edited using a wide range of software, including text editors, spreadsheets (e.g., Excel, Google Sheets), and statistical tools (e.g., R, Python).
- R provides optimised functions like
read.csv()(base R)read_csv()(tidyverse) for quickly reading CSV files without additional dependencies. - Unlike Excel, CSV files do not contain embedded formulas, formatting, figures, or macros, these properties reduce the risk of unintended data stuff-ups.
- Being plain text, CSV files are typically smaller in size compared to Excel files.
- Excel files are stored in a format (.xls, .xlsx) that is specific to Microsoft Excel; special packages (e.g., readxl, openxlsx) are needed to read them in R.
- Excel often automatically formats data and changes numeric values to dates or rounding decimal values. This can lead to errors in data analysis.
- Excel files support formulas, pivot tables, conditional formatting, and visual elements that may not be relevant for raw data processing in R.
- Users can store multiple sheets within a single Excel file and this makes it trickier to maintain a standardised structure when importing data into R.
- Excel files are not made for handling large datasets. Excel becomes very slow and is prone to crashing or memory limitations when dealing with ‘big’ data.
- Excel’s binary files do not work with version control systems like Git.
- Excel files are complex and more prone to accidental modifications or corruption.
1.8 Question 8 [20 marks]
Explain each of the following in the context of their use in R. For each, provide an example of how you would construct them in R:
- A vector
- A matrix
- A dataframe
- A list
Answer
- A vector in R is the simplest and most fundamental data structure. It is a one-dimensional collection of elements, all of the same type (e.g., numeric, character, or logical). Vectors can be created using the
c()function. For example:
- A matrix is a two-dimensional data structure where all elements must be of the same type. It is essentially an extension of a vector with a specified number of rows and columns.
- A dataframe is a two-dimensional data structure that can contain different data types in different columns (variables). It is the most commonly used data structure for data analysis in R and resembles a table with rows and columns.
- A list is a flexible data structure that can store elements of different types, including vectors, matrices, dataframes, and even other lists. Unlike vectors, matrices and which require uniform data types, lists can contain heterogeneous elements.
TOTAL MARKS: 100
– THE END –
Reuse
Citation
@online{smit,_a._j.2026,
author = {Smit, A. J.,},
title = {BCB744: {Intro} {R} {Theory} {Test}},
date = {2026-02-13},
url = {http://tangledbank.netlify.app/BCB744/assessments/BCB744_Intro_R_Theory_Test_2026.html},
langid = {en}
}