BCB744: Introduction to R, and Biostatistics

Published

February 3, 2025

PhD Comics on data expectations.

PhD Comics on data expectations.

1 Venue, Timetable, and Content

The venue for the module is the 5th Floor Computer Lab, BCB Department, University of the Western Cape. The module runs in the first two weeks of Term 1 of 2026 (2-13 February), and again during the mid-semester break in Semester 1 (30 March to 2 April). Lectures will run from 09:00 to 16:30 on the days indicated in the table below.

The module coordinator and lecturer is Prof AJ Smit (Room 4.103), and the teaching assistant for the module is Mr. Jesse Phillips (4115146@myuwc.ac.za). For queries about the Honours programme in general, please consult Dr. Patrick O’Farrell (Room 4.111).

  • Introduction to R: From 2 to 13 February 2026.
  • Data Collection Field Trip: From 13 to 17 March 2026.
  • Biostatistics: From 30 March to 2 April 2026 (during the mid-semester break of Semester 1).
  • Important links:
Wk Lecture Topic Class Date Tasks/Assess. Task/Assess. due
INTRODUCTION TO R
Wk6 L1 (PM) About the Module 2 Feb 26 Task A 4 Feb 26
1. R and RStudio
2. Working With Data and Code
3. R Markdown and Quarto
L2 (AM) 4. Data Classes and Structures in R 3 Feb 26 Task A 4 Feb 26
5. R Workflows
L3 6. Graphics With ggplot2 4 Feb 26 Task B 6 Feb 26
7. Faceting Figures
8. Brewing Colours
L4 9. Mapping With ggplot2 6 Feb 26 Task C 9 Feb 26
10. Mapping With Style
11. Mapping With Natural Earth and the sf Package
Self 12. The Fiji Earthquake Data Bonus Task 20 Feb 26
Wk7 L5 13. Tidy Data 9 Feb 26 Task D 11 Feb 26
14. Tidier Data
15. Tidiest Data
L6 16. Synthesis 11 Feb 26
Test 1: Intro R Theory Test 13 Feb 26 13 Feb 26
Test 1: Intro R Prac Test TBA TBA
BIOSTATISTICS
Wk14 L1 1. Introduction to Statistics 30 Mar 26
2. Exploring With Summaries and Descriptions Task E 31 Mar 26
3. Exploring With Figures Task E 31 Mar 26
L2 4. Data Distributions 31 Mar 26
5. Statistical Inference and Hypothesis Testing
6. Assumptions
7. Inferences About One or Two Populations Task F 1 Apr 26
L3 8. Analysis of Variance (ANOVA) 1 Apr 26 Task G 2 Apr 26
9. Simple Linear Regressions Task H 3 Apr 26
10. Correlations Task H 3 Apr 26
L4 11. A Guide to Selecting the Right Parametric Test 2 Apr 26
12. Non-Parametric Statistics
13. Confidence Intervals
14. Data Transformations
Test 2: Biostatistics Theory Test TBA TBA
Test 2: Biostatistics Prac Test TBA TBA
Exam: Biostatistics Prac Exam (Intro R + Biostatistics) TBA TBA

2 Description and Content

Yes, the comma in this page’s title is correct: “BCB744: Introduction to R, and Biostatistics.” The module provides an introduction to the R software language. I will also teach biostatistics.

This is a core module in your Honours programme. You will learn to use R for data analysis, visualisation, and statistical inference. You will also learn fundamental biostatistics concepts, such as hypothesis testing, probabilities, confidence intervals, regression analysis, Analysis of Variance, and other staples of biostatistics. I will use real-world datasets from the biological, ecological, and environmental fields that you can use to practice applying your R and biostatistics skills.

The approach taken in this Workshop is not dissimilar from a course in Data Science. However, in this Workshop, we will not do data science, but we will use R to actually do science. There is a difference! Any scientist that can use R is also ideally equipped to be a data scientist, and some people who have completed this module actually do just that. The difference between the two ideas, philosophies, and careers is provided in the box immediately below.

NoteReal Scientists and Data ‘Scientists’

I am deliberate in my use of the phrase not real scientists. This is a claim about epistemic practice.

Data “science” involves the manipulation of data and by technical fluency with statistical or computational tools. Science, as defined by how it is practiced, is a framework around a structured relationship between theory, question formation, experimental or observational design, and inference. A scientist begins with a problem framed within an existing body of knowledge and articulates expectations that could, in principle, fail. They then construct a procedure whose purpose may involve pattern discovery and downstream analyses, and, importantly, adds layers of explanation for the phenomenon under scrutiny.

Much of what is commonly called data science (primarily by data “scientists” themselves) goes in the opposite direction of what real scientists do. Large, often poorly characterised datasets are subjected to algorithmic exploration (more often using unsupervised methods without human guidance) in search of statistically admissible regularities, after which interpretation is retrofitted. This is not hypothesis testing in any meaningful sense; it is post hoc rationalisation constrained primarily by optimisation criteria rather than by theory. The fact that such procedures can generate predictions does not grant them scientific status. Prediction without explanation is engineering, at best, with statistical underpinnings.

A data science practitioner who can move seamlessly from marketing data to genomic counts to social media engagement metrics does so precisely because the underlying activity is detached from any requirement to understand the causal structure of the system under study. Scientific work does not grant this mobility. One cannot meaningfully investigate trophic dynamics, atmospheric circulation, or physiological regulation without years of undergraduate, domain-specific instruction in the theory and practice of the field. The data science methods are transferable but understanding is not.

The claim of “data-driven discovery” also does not rescue the position. Scientifc discovery necessitates falsifiability and humility to being wrong in a way that matters theoretically. Data analysis tools blindly applied do not invite this risk. When a model underperforms, it is replaced. When a scientific hypothesis fails, something about the world has resisted our attempt to describe it.

This distinction is why traditional scientific training places much weight on experimental design, controls, and prior expectation. They are supported by a rich philosophical tradition that guide the practitioner in confronting ideas with reality. Data science may be technically demanding, commercially valuable, or computationally impressive, but without the scientific philosophy it is not science.

Calling data scientists “scientists” therefore reflects sloppy language, a slippage that mistakes proximity to data for engagement with explanation, and a willingness to dilute epistemic categories until they describe little more than technical competence with fashionable tools. If the word scientist is to retain its true, intended meaning, it must refer to a mode of inquiry defined by theory-driven questions and disciplined inference and not just by the capacity to process data (often at scale).

The Intro R Workshop focuses on the functionality offered by the tidyverse suite of packages. I designed the Workshop to introduce you to a powerful set of tools for data manipulation, exploration, and visualisation. The tidyverse is a collection of R packages that work together to provide a cohesive set of functions for manipulating data. This course will cover the most popular packages in the tidyverse, including tidyr for data reshaping, dplyr for data ‘wrangling’, and ggplot2 for data visualisation. You will learn how to clean, transform, and visualise data, as well as how to use these tools to build reproducible, informative data analysis pipelines. With a focus on practical application and hands-on exercises, you will gain the skills and knowledge needed to effectively use the tidyverse in your own data analysis projects.

In biological and ecological sciences, statistical methods play a crucial role in analysing, interpreting data. Some of the basic statistical methods used include:

  • Descriptive statistics These methods are used to summarise and describe the basic features of a dataset, such as the mean, median, and standard deviation.

  • Inferential statistics These allow you, the scientist, to make predictions, inferences about a population based on a sample of data. Common inferential statistical techniques include t-tests and ANOVA, and regression analysis.

  • Non-parametric statistics Non-parametric methods are called for when the data do not meet the assumptions of parametric statistics. Examples of non-parametric techniques include Wilcoxon rank-sum test and Kruskal-Wallis test.

3 Skills and Graduate Attributes

By the end of this module, you will be able to:

  • Understand and use R within the RStudio IDE
  • Know and understand the tidyverse suite of functions and approach to data analysis and graphics
  • Understand the principles underlying tidy data
  • Understand the types of data and data distributions that biologists and ecologists will frequently encounter
  • Understand and be able to execute the most frequently used inferential statistical tests
  • Use the R software and associated packages to undertake these analyses
  • Interpret the outcomes of these analyses and use it to probabilistically make inferences about the scientific enquiries
  • Communicate the findings by written and oral form

The graduate attributes resulting from completion of this module align with the expectations of the workplace across diverse organisations and institutions where graduates typically find employment.

4 Assessment Policy

Continuous Assessment (CA) and a Final Assessment will provide a Final Mark for the module. These modes of assessment meet our needs as far as formative and summative assessments are concerned. The weighting of the CA and the Final Assessment is 0.6 and 0.4, respectively. Except for the Biostatistics Theory Test, all assessments are open book, so consult your code, reading material if and when you need to.

Assessment Component Weight Contribution (%)
CONTINUOUS ASSESSMENT (0.6)
Introduction to R
Progress Portfolio 10
Self-Assessment Tasks A–D (Random penalty) (max. -10)
Intro R Test (0.3 × theory + 0.7 × prac) 35
Biostatistics
Presentations 10
Progress Portfolio 10
Self-Assessment Tasks E–H (Random penalty) (max. -10)
Biostatistics Test (0.3 × theory + 0.7 × prac) 35
Total 100
FINAL ASSESSMENT (0.4)
Biostatistics Prac Exam (Intro R + Biostatistics) 100

Care must be taken that the tests and exams are submitted as instructed, i.e. paying attention to naming conventions, the format of the files submitted — typically this will be in a Quarto document (.qmd) and the knitted output (I prefer .html).

Random quizzes will not form part of the CA for BCB744.

Starting 3 February 2026, you will submit a thoroughly annotated .html file produced from a Quarto document, that outlines each day’s teaching material you covered as you followed along in the class. Since it is made within Quarto, you will have to include code chunks, their output, and some narrative text that describes what each portion of your code does. The document must be:

  • One continuous file that you add to each day (use a clear heading with the date and topic for every new day).
  • Rendered to HTML and submitted daily starting 3 February 2026, on the same day of the lecture your portfolio material covers.
  • Complete and readable, with the code you ran, the outputs it produced, and a few sentences explaining what the code does. Include any notes that you made for your future self, which will aid you to study the work for assessments you will encounter. Any text added must be in your own words, and explained in a way that you understand (in good English, without grammatical and spelling errors).
  • Neatly structured, with short sections, headings, and clear figure/table outputs (no screenshots). I will significantly weigh the visual impression you ceate, so take pride in your work.

The presentations are a critical part of the CA. They are designed to help you develop your communication around topics tangentially to the broad field of knowledge generation. The presentations will cover topics such as the nature of knowledge and belief, the nature of science, the scientific method, the limits to science, and other broader societal topics.

For more detail, see these links:

BCB744 (Introduction to R and Biostatistics) relies on regular, honest self-reflection about your grasp of each day’s lecture content. After every lecture, complete the Daily Self-Assessment Tasks to gauge your understanding; answers will be provided the following day, before introducing new content. Each task should be rated on a personal scale from 1 (no real comprehension) to 10 (complete mastery). These self-assessment marks will be kept on record and checked randomly, and we will discourage students from undertaking the Intro R Test and the BioStats Test if their self-assessment scores are consistently low.

If you realise you are struggling, seek assistance from the lecturer or teaching assistant early (ideally on the day). Consistent, candid self-assessment strongly correlates with later performance in the Intro R Test, the Biostatistics Test, and the combined Exam. The goal is to align your learning strategies with course expectations and build a foundation for success.

For the daily self-assessment tasks to be effective, you must work alone on all of them.

For more detail, see these links:

At the conclusion of Intro R and Biostatistics, you will take the more rigorous Intro R Test and Biostatistics Test. As indicated in the table above, these assessments carry significant weight. The tests will be conducted over several days, and you may complete them either at home or on campus. They constitute a key component of Continuous Assessment (CA) and are designed to prepare you for the final exam.

Each test consists of two parts:

  1. Theory Test (30%) — This is a written, closed-book assessment where you will be tested on theoretical concepts. The only resource available during this test is the R help system.
  2. Practical Test (70%) — In this open-book coding assessment, you will apply your theoretical knowledge to real data problems. While you may reference online materials (including ChatGPT), collaboration with peers is strictly prohibited.

The practical component of the tests will be graded as follows:

  • Content (20%):
    • Questions answered in order
    • A written explanation of approach included for each question
    • Appropriate formatting of text, for example, fonts not larger than necessary, headings used properly, etc. Be sensible, tasteful.
  • Code formatting, structure, and correctness (50%):
    • Use Tidyverse code
    • No more than ~80 characters of code per line (pay particular attention to the comments)
    • Application of R code conventions, e.g. spaces around <-, after #, after ,, etc.
    • New line for each dplyr function (lines end in %>%) and each ggplot layer (lines end in +)
    • Proper indentation of pipes and ggplot() layers
    • All chunks labelled without spaces
    • No unwanted / commented out code left behind in the document
  • Figures (30%):
    • Sensible use of themes / colours
    • Publication quality
    • Informative and complete titles, axes labels, legends, etc.
    • No redundant features or aesthetics

The Exam is the final assessment. As such, it will test your skills broadly across both Intro R, Biostatistics. The Exam may be up to five days in duration. It will involve the analysis of real world data. Some of the questions might expect that you write 1) statements of aims and objectives, and hypotheses; 2) the full, detailed methods followed by analyses together with all code and 3) full reporting of results in a manner suited for peer reviewed publications; 4) graphical support highlighting the patterns observed (again with the code), and 5) a discussion if, when required. The weighting of marks to these various sections is:

  1. Aims, objectives, and hypotheses: 5%
  2. Methods and analyses: 45%
  3. Results: 20%
  4. Graphs: 15%
  5. Discussion: 15%

Other questions might be shorter in nature, designed to specifically test important aspects of BCB744. Such questions might be worth anything from 10 to 50 marks.

The Exam is also open book. Review the questions carefully, answer them at home, and submit by the deadline.

  • The Progress Portfolios must be submitted on the day of the lectures they cover.
  • The Tasks must be submitted by the date specified in the time table.

A statement such as the one below accompanies every assignment — pay attention, as failing to observe this instruction may result in a loss of marks (i.e. if an assignment remains ungraded because the owner of the material cannot be identified):

Submit the output of your Quarto script, wherein you provide answers to the task questions, by no later than 8:30 the following date (or the Monday in cases when assignments were given on Fridays). Label the script as follows (e.g.): BCB744_Smit_Task_A.html.

Late Submissions

Late assignments will be penalised 10% per day and will not be accepted more than 48 hours late, unless evidence such as a doctor’s note, a death certificate, or another documented emergency can be provided. If you know in advance that a submission will be late, please discuss this, seek prior approval. This policy is based on the idea that in order to learn how to translate your human thoughts into computer language (coding) you should be working with them at multiple times each week — ideally daily. Time has been allocated in class for working on assignments and students are expected to continue to work on the assignments outside of class. Successfully completing (and passing) this module requires that you finish assignments based on what we have covered in class by the following class period. Work diligently from the onset so that even if something unexpected happens at the last minute you should already be close to done. This approach also allows rapid feedback to be provided to you and which can only be accomplished by returning assignments quickly and punctually.

5 Data Used

All the data required for BCB744 may be downloaded here. After you have downloaded the archived (.zip) data, unzip it in a folder named data placed at the root of your R project. This will ensure that all the data are easily accessible to you.

R also gives you access to many built-in datasets that are useful for practising our R skills. To find out which datasets are available to you on your system, execute the following command. Help files for each of the datasets are also available:

# load the data like this:
data()

# find help, for example:
?datasets::ChickWeight

It is important to use these (or any) datasets to practice your R skills. Actively engaging with my web pages and practising on the included datasets will make the difference between a 60% average mark for the module and a mark in excess of 80%.

6 Prerequisites

You should have a moderate numerical literacy, but prior programming experience is not required. In all sciences, practical problem solving skills and tenacity in the face of challenges are crucial for success. Scientific disciplines constantly evolve and present new and complex problems that require creative and innovative solutions. You will have to demonstrate agile and adaptive approaches to solving challenges and you must have the ability to break down complex problems into smaller parts and approach them systematically. You must also be able to identify and overcome roadblocks, and be persistent in your efforts to find a solution. These attributes will allow you to be effective in this module.

7 Method of Instruction

The workshop is designed to be as interactive as possible, so while you are working on exercises the tutor, I will circulate among you and engage with you to help you understand any material and the associated code you are uncomfortable with. Often this will result in discussions of novel applications and alternative approaches to the data analysis challenges you are required to solve. More challenging concepts might emerge during the Tasks and Assignments (typically these will be submitted the following day) and any such challenges will be dealt with in class prior to learning new concepts.

Although the module ultimately supports the application of biologically-oriented statistics, a large part of it is also about programming. It is up to you to take your coding skills to the next level and move beyond what I teach in class. Coding is a bit like learning a language and, as such, programming is a skill that is best learned by doing.

8 Learning

NoteAlso Read: How to Learn

Please refer to my advice about how to learn.

Collaborative learning provides an opportunity to work together and learn from each other. It develops communication, teamwork, and leadership, and it can deepen your understanding of the subject matter. Discuss the BCB744 Workshop activities with your peers as you work on them. Use the WhatsApp group set up for the module for discussion purposes (I might assist via this medium if necessary if your questions/comments have relevance to the whole class). A better option is to use GitHub Issues. Ask questions, answer questions, and share ideas liberally. Please identify your work partners by name on all assignments (if you decide to work in pairs).

At the same time, you are individually responsible for the submitted work. Collaboration means discussing ideas, approaches, and interpretations; it does not include sharing or reusing code, text, or outputs. Anything you submit must be your own, and any external material (including AI-generated code or web-sourced snippets) must be clearly cited. Plagiarism is a serious offence and will be dealt with concisely. Consequences of cheating are severe — they range from a 0% for the assignment or exam up to dismissal from the course for a second offence.

A huge volume of code is available on the web and it can be adapted to solve your own problems. You may make use of any online resources (e.g., from StackOverflow, a thoroughly-used source of discussion about R code) — but you MUST clearly indicate (cite) that your solution relies on found code, regardless of how much you have modified it to your own needs. Reused code that is discovered via a web search, which is not explicitly cited, is plagiarism and will be treated as such. On assignments you may not directly share code with your peers in this workshop.

The 2025 BSc (Hons) cohort will be the first to experience the use of AI tools in the BCB744 module. The use of AI tools is a new development and it is important that you are exposed to these tools. The use of AI tools will be limited to the use of the OpenAI ChatGPT tool, which may be used to generate ‘proto-code’ that will assist you in becoming familiar with the R language. We will explore ideas together, and the mark allocation to tasks and assignments will be adjusted accordingly.

9 Software

In this course you will rely entirely on R running within the RStudio IDE. The use of R is covered extensively in the BCB744 module where the installation process is discussed.

Additionally, the very basics — i.e. about R, RStudio, packages, their installation, etc. — can also be found on the ModernDive website. A slightly longer, more detailed account of the installation process and the very basics is provided on the datacamp platform.

ModernDive also provides a nice overview of using R for data science.

For more in-depth coverage of the R language, refer to R Master Hadley Wickham’s pages. There you will find everything you need to know in a well thought through presentation. Thoroughly working through this material, page by page, will quickly make you a R Master yourself (well, almost).

10 Computers

You are encouraged to provide your own laptops and to install the necessary software before the module starts. Limited support can be provided if required, but in the end, the onus is on you to understand how your computer works (from the filesystem through to dealing with software installation issues). There are also computers with R, RStudio (and the necessary add-on libraries) available in the 5th floor lab in the BCB Department.

11 Attendance

This workshop-based, hands on course can only deliver acceptable outcomes if you attend all classes. The schedule is set, cannot be changed. Sometimes an occasional absence cannot be avoided. Please be courteous and notify myself or the tutor in advance of any absence. If you work with a partner in class and notify them too. Keep up with the reading assignments while you are away and we will all work with you to get you back up to speed on what you miss. If you do miss a class, however, the assignments must still be submitted on time (also see Late submission of CA).

Since you may decide to work in collaboration with a peer on tasks and assignments, please keep this person informed at all times in case some emergency makes you unavailable for a period of time. Someone might depend on your input, contributions — do not leave someone in the lurch so that they cannot complete a task in your absence.

12 Support

It is expected that some tricky aspects of the module will take time to master, and the best way to master problematic material is to practice, practice some more, and then to ask questions. Trying for 10 minutes and then giving up is rarely enough. I will be more sympathetic to your cause if you can demonstrate sustained effort before asking me. When you ask questions about a challenge, explain the approaches you tried and how they failed. I will not help you if you have not tried to help yourself first (maybe with advice from friends). There will be time in class to do this, typically before we embark on a new topic. You are also encouraged to bring up related questions that arise in your own B.Sc. (Hons.) research project.

Should you require more time with me, find out when I am ‘free’, set an appointment by sending me a calendar invitation. I am happy to have a personal meeting with you via Zoom but I prefer face-to-face in my office.

Guidelines for asking questions:

  • First search existing issues (open or closed) for answers. If the question has already been answered, you are done! If there is an open issue, feel free to contribute to it. Or feel free to open a closed issue if you believe the answer is not satisfactory.
  • Give your issue an informative title.
    • Good: “Error: could not find function”ggplot””
    • Bad: “My code does not work!” Note that you can edit an issue’s title after it has been posted.
  • Format your questions nicely using markdown and code formatting. Preview your issue prior to posting.
  • As I explained above, your peers and I will be more sympathetic to your cause if you can show all the things you have tried yourself to fix the issue first.
  • Include code and example data so the person trying to help you have something to work with (and which results in the error, perhaps)
  • Where appropriate, provide links to specific files, or even lines within them, in the body of your issue. This will help your peers understand your question. Note that only the teaching team will have access to private repos.
  • (Optional) Tag someone or some group of people. Start by typing their GitHub username prefixed with the @ symbol. Of course this supposes that each of you have a GitHub account and username.
  • Hit Submit new issue when you are ready to post.

Footnotes

  1. A maximum of 10% may be deducted from your presentation marks should you be found to be dishonest in your self assessments.↩︎

Reuse

Citation

BibTeX citation:
@online{smit,_a._j.2025,
  author = {Smit, A. J.,},
  title = {BCB744: {Introduction} to {R,} and {Biostatistics}},
  date = {2025-02-03},
  url = {http://tangledbank.netlify.app/BCB744/BCB744_index.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit, A. J. (2025) BCB744: Introduction to R, and Biostatistics. http://tangledbank.netlify.app/BCB744/BCB744_index.html.