BCB744: Introduction to R, & Biostatistics
Venue, Timetable, and Content
The venue for the module is the 5th Floor Computer Lab, BCB Department, University of the Western Cape. The module will run from 09:00 to 16:30 on the days indicated in the table below.
The module coordinator and lecturer is Prof AJ Smit (Room 4.103), and the teaching assistant for the module is Chané Claassen (4142581@myuwc.ac.za). For queries about the Honours programme in general, please consult Prof Bryan Maritz (Room 4.105).
- Intro to R: From 3 to 7 February 2025.
- Biostatistics: From 31 March to 4 April 2025 (during the mid-semester break of Semester 1).
- Important links:
Yes, the comma in this page’s title is correct: “BCB744: Introduction to R, and Biostatistics.” The module provides an introduction to the R software and language. I will also teach biostatistics.
This is a core module in your Honours programme. You will learn to use R for data analysis, visualisation, and statistical inference. You will also learn fundamental biostatistics concepts, such as hypothesis testing, probabilities, confidence intervals, regression analysis, Analysis of Variance, and other staples of biostatistics. I will use real-world datasets from the biological, ecological, and environmental fields that you can use to practice applying your R and biostatistics skills.
The approach taken in this Workshop is not dissimilar from a course in Data Science. However, in this Workshop, we won’t do data science, but we will use R to actually do science. There is a difference! Any scientist that can use R is also ideally equipped to be a data scientist, and some people who have completed this module actually do just that. The difference between the two ideas, philosophies, careers is provided in the box immediately below.
A Scientist able to apply their intermediate to advanced R skills is by default also a ‘Data Scientist’. The opposite is generally not true: Data Scientists are not real Scientists—especially after only having completed ‘traditional’ courses in data science.
Science refers to the application of the scientific method of conducting research, where hypotheses are proposed, experiments are designed and conducted to test these hypotheses, and data are collected and analysed to draw conclusions. The aim of Science is to generate new knowledge and understanding of the natural world. A Scientist will typically be equipped to work through all of these steps.
Data Science, on the other hand, involves the use of computational and statistical tools to extract knowledge and insights from data. These datasets typically already exist because someone (companies, industries, NGOs, etc.) collected them. Data Science focuses on analysing large and complex datasets to uncover patterns, trends, and relationships that can be used to inform decision-making. The Data Scientist is not typically involved in generating the data from de novo.
These key aspects summarise the difference between the two fields:
Approach Science is hypothesis-driven, while Data Science is data-driven. Science begins with a hypothesis that is tested through experiments, while Data Science begins with data and uses statistical and computational methods to uncover insights.
Goals Science aims to generate new knowledge and understanding of the natural world, while Data Science aims to uncover insights and make predictions based on existing data. Scientist focus on understanding the underlying mechanisms of natural phenomena and their area of focus is the real world, while Data Scientists focus on extracting knowledge and insights from data, often in the realm of business.
Methods Science involves making observations of the world, conducting experiments, collecting and analysing data, and drawing conclusions based on the results. Data Science only involves using statistical and computational tools to analyse data and uncover patterns and relationships.
Context Science is typically focused on a specific domain, such as biology, chemistry, or physics. Data Science can be applied to any domain that involves data, including business, finance, healthcare, and social media.
The Intro R Workshop focuses on the functionality offered by the tidyverse suite of packages. I designed the Workshop to introduce you to a powerful set of tools for data manipulation, exploration, and visualisation. The tidyverse is a collection of R packages that work together to provide a cohesive set of functions for manipulating data. This course will cover the most popular packages in the tidyverse, including tidyr for data reshaping, dplyr for data ‘wrangling’, and ggplot2 for data visualisation. You will learn how to clean, transform, and visualise data, as well as how to use these tools to build reproducible and informative data analysis pipelines. With a focus on practical application and hands-on exercises, you will gain the skills and knowledge needed to effectively use the tidyverse in your own data analysis projects.
In biological and ecological sciences, statistical methods play a crucial role in analysing and interpreting data. Some of the basic statistical methods used include:
Descriptive statistics These methods are used to summarise and describe the basic features of a dataset, such as the mean, median, and standard deviation.
Inferential statistics These allow you, the scientist, to make predictions and inferences about a population based on a sample of data. Common inferential statistical techniques include t-tests, ANOVA, and regression analysis.
Non-parametric statistics Non-parametric methods are called for when the data do not meet the assumptions of parametric statistics. Examples of non-parametric techniques include Wilcoxon rank-sum test and Kruskal-Wallis test.
By the end of this module, you will be able to:
- Understand and use use R within the RStudio IDE
- Know and understand the the tidyverse suite of functions and approach to data analysis and graphics
- Understand the principles underlying tidy data
- Understand the types of data and data distributions that biologists and ecologists will frequently encounter
- Understand and be able to execute the most frequently used inferential statistics
- Use the R software and associated packages to undertake these analyses
- Interpret the outcomes of these analyses and use it to probabilistically make inferences about the scientific enquiries
- Communicate the findings by written and oral means
The graduate attributes resulting from completion of this modules alignment with the expectations of the workspace across diverse organisations and institutions where graduates typically find employment.
Data Used
All the data required for BCB744 may be downloaded here. After you have downloaded the archived (.zip) data, unzip it in a folder named data
placed at the root of your R project. This will ensure that all the data are easily accessible to you.
R also gives you access to many built-in datasets that are useful for practicing our R skills. To find out which datasets are available to you on your system, execute the following command. Help files for each of the datasets are also available:
It is important to use these (or any) datasets to practice your R skills on. Actively engaging with my comprehensive and detailed web pages, and practising on the included and additional other datasets will make to difference between a 60% average mark for the module, and a mark in excess of 80%.
Prerequisites
You should have a moderate numerical literacy, but prior programming experience is not required. In all sciences, practical problem solving skills and a tenacity for challenges are crucial for success. Scientific disciplines constantly evolve and present new and complex problems that require creative and innovative solutions. You will have to demonstrate agile and adaptive approaches to solving challenges, and you must have the ability to break down complex problems into smaller parts and approach them systematically. You must also be able to identify and overcome roadblocks, and be persistent in your efforts to find a solution. These attributes will allow you to be effective in this module.
Method of Instruction
The workshop is designed to be as interactive as possible, so while you are working on exercises the tutor and I will circulate among you and engage with you to help you understand any material and the associated code you are uncomfortable with. Often this will result in discussions of novel applications and alternative approaches to the data analysis challenges you are required to solve. More challenging concepts might emerge during the Tasks and Assignments (typically these will be submitted the following day), and any such challenges will be dealt with in class prior to learning new concepts.
Although the module ultimately supports the application of biologically-oriented statistics, a large part of it is also about programming. It is up to you to take your coding skills to the next level and move beyond what I teach in class. Coding is a bit like learning a language, and as such programming is a skill that is best learned by doing.
Learning
Please refer to my advice about how to learn.
Collaborative learning provides an opportunity for you to work together and learn from each other. In this way, you will develop a deeper understanding of the subject matter. Collaborating with your friends and peers allows you to explore different perspectives and ideas, which can broaden your understanding and help you to see the subject matter from new angles. This type of learning environment also fosters the development of important skills such as communication, teamwork, and leadership, which are essential for success in academic and professional careers. Collaborative learning can create a sense of community and support among your group of peers. In the end, it will enhance your university experience, drive your love for learning, and prepare you for success beyond the university.
Discuss the BCB744 Workshop activities with your peers as you work on them. Use the WhatsApp group set up for the module for discussion purposes (I might assist via this medium if necessary if your questions/comments have relevance to the whole class). A better option is to use GitHub Issues. You will learn more in this module if you work with your friends than if you do not. Ask questions, answer questions, and share ideas liberally. Please identify your work partners by name on all assignments (if you decide to work in pairs).
Collaborative learning does not give you permission to reuse someone else’ code or text. Plagiarism is a serious offence and will be dealt with concisely. Consequences of cheating are severe—they range from a 0% for the assignment or exam up to dismissal from the course for a second offense.
A huge volume of code is available on the web and it can be adapted to solve your own problems. You may make use of any online resources (e.g. form StackOverflow, a thoroughly-used source of discussion about R code)—but you MUST clearly indicate (cite) that your solution relies on found code, regardless to what extent you have modified it to your own needs. Reused code that is discovered via a web search and which is not explicitly cited is plagiarism and it will be treated as such. On assignments you may not directly share code with your peers in this workshop.
The 2025 BSc (Hons) cohort will be the first to experience the use of AI tools in the BCB744 module. The use of AI tools is a new and exciting development and it is important that you are exposed to these tools. The use of AI tools will be limited to the use of the OpenAI ChatGPT tool, which may be used to generate ‘proto-code’ that will assist you in becoming familiar with the R langauge. We will explore ideas together, and the mark allocation to tasks and assignments will be adjusted accoringly.
Software
In this course you will rely entirely on R running within the RStudio IDE. The use of R is covered extensively in the BCB744 module where the installation process is discussed.
Additionally, the very basics—i.e. about R, RStudio, packages, their installation, etc.—can also be found on the ModernDive website. A slightly longer and more detailed account of the installation process and the very basics is provided on the datacamp platform.
ModernDive also provides a nice overview of using R for data science.
For more in-depth coverage of the R language, refer to R Master Hadley Wickham’s pages. There you will find everything you need to know in a well thought through presentation. Thoroughly working through this material, page by page, will quickly make you a R Master yourself (well, almost).
Computers
You are encouraged to provide your own laptops and to install the necessary software before the module starts. Limited support can be provided if required, but in the end, the onus is on you to understand how your computer works (from the filesystem through to dealing with software installation issues). There are also computers with R and RStudio (and the necessary add-on libraries) available in the 5th floor lab in the BCB Department.
Attendance
This workshop-based, hands on course can only deliver acceptible outcomes if you attend all classes. The schedule is set and cannot be changed. Sometimes an occasional absence cannot be avoided. Please be curtious and notify myself or the tutor in advance of any absence. If you work with a partner in class, notify them too. Keep up with the reading assignments while you are away and we will all work with you to get you back up to speed on what you miss. If you do miss a class, however, the assignments must still be submitted on time (also see Late submission of CA).
Since you may decide to work in collaboration with a peer on tasks and assignments, please keep this person informed at all times in case some emergency makes you unavailable for a period of time. Someone might depend on your input and contributions—do not leave someone in the lurch so that they cannot complete a task in your absence.
Assessment Policy
Continuous Assessment (CA) and a Final Assessment will provide a Final Mark for the module. These modes of assessment meet our needs as far as formative and summative assessments are concerned. The weighting of the CA and the Final Assessment is 0.6 and 0.4, respectively. All assessments are open book, so consult your code and reading material if and when you need to.
Assessment Component | Weight | Contribution (%) |
---|---|---|
CONTINUOUS ASSESSMENT | (0.6) | |
Introduction to R | ||
Presentations | 10 | |
Self-Assessment Tasks A–D (Random penalty)1 | (max. -10). | |
Intro R Test | 40 | |
Biostatistics | ||
Presentations | 10 | |
Self-Assessment Tasks E–H (Random penalty) | (max. -10). | |
Biostatistics Test | 40 | |
Total | 100 | |
FINAL ASSESSMENT | (0.4) | |
Exam (Intro R + Biostatistics) | 100 |
1 A maximum of 10% may be deducted from your presentation marks should you be found to be dishonest in your self assessments.
Care must be taken that the tests and exams are submitted as instructed, i.e. paying attention to naming conventions and the format of the files submitted – typically this will be in a Quarto document (.qmd) and the knitted output (I prefer .html).
Random quizzes will not form part of the CA for BCB744.
The presentations are a critical part of the CA. They are designed to help you develop your communication around topics tangentially to the broad field of knowledge generation. The presentations will cover topics such as the the nature of knowledge and belief, the nature of science, the scientific method, the limits to sciencde, and other broader societal topics.
For more detail, see these links:
BCB744 (Introduction to R and Biostatistics) relies on the expectation that you will engage in regular, honest self-reflection about your grasp of each day’s lecture content. After every lecture, time should be devoted to completing the Daily Self-Assessment Tasks, which are designed to help you gauge your understanding of the covered material. Answers to these tasks will be provided the following day, before introducing new content. The honesty of these reflections cannot be overstated: each task should be rated on a personal scale from 1 (no real comprehension) to 10 (complete mastery). These self-assessment marks will be kept on record and serve as an indicator of progress. We will not permit the submission of these tasks, but they will be checked randomly. We will also discourage students from undertaking the Intro R Test and the BioStats Test if their self-assessment scores are consistently low.
Students who realise they are struggling are strongly advised to seek assistance from the lecturer or teaching assistant well before the gap in understanding becomes too large to bridge (i.e. on the day). The correlation between consistent, candid, and honest self-assessment and later performance in the Intro R Test, the Biostatistics Test, and the combined Exam (Intro R + Biostatistics) is high. By admitting the need for help early, you can align your learning strategies with course expectations and reinforce your command of the subject matter. Being the judge of personal preparedness demands self-reflection and honesty about your own strengths and weaknesses so as to develop a strong foundation for success.
For the daily self-assessment tasks to be effective, you must work alone on all of them.
Be responsible for your own learning. The lecturer and teaching assistant are here to help you, but you must take the initiative to seek assistance when needed. The more you engage with the material, the more you will learn and the better you will perform in the assessments.
For more detail, see these links:
At the conclusion of Intro R, and Biostatistics, you will take the more rigorous Intro R Test and Biostatistics Test. As indicated in the table above, these assessments carry significant weight. The tests will be conducted over several days, and you may complete them both at home and on campus. They constitute a key component of Continuous Assessment (CA) and are designed to prepare you for the final exam.
Each test consists of two parts:
- Theory Test (30%) – This is a written, closed-book assessment where you will be tested on theoretical concepts. The only resource available during this test is the R help system.
- Practical Test (70%) – In this open-book coding assessment, you will apply your theoretical knowledge to real data problems. While you may reference online materials (including ChatGPT), collaboration with peers is strictly prohibited.
The practical component of the tests will be graded as follows:
- Content (20%):
- Questions answered in order
- A written explanation of approach included for each question
- Appropriate formatting of text, for example, fonts not larger than necessary, headings used properly, etc. Be sensible and tasteful.
- Code formatting, structure, and correctness (50%):
- Use Tidyverse code
- No more than ~80 characters of code per line (pay particular attention to the comments)
- Application of R code conventions, e.g. spaces around
<-
, after#
, after,
, etc. - New line for each
dplyr
function (lines end in%>%
) orggplot
layer (lines end in+
) - Proper indentation of pipes and
ggplot()
layers - All chunks labelled without spaces
- No unwanted / commented out code left behind in the document
- Figures (30%):
- Sensible use of themes / colours
- Publication quality
- Informative and complete titles, axes labels, legends, etc.
- No redundant features or aesthetics
The Exam is the final assessment. As such, it will test your skills broadly across both Intro R and Biostatistics. The Exam may be up to five days in duration. It will involve the analysis of real world data. Some of the questions might expect that you write 1) statements of aims, objectives, and hypotheses; 2) the full and detailed methods followed by analyses together with all code, 3) full reporting of results in a manner suited for peer reviewed publications; 4) graphical support highlighting the patterns observed (again with the code), and 5) a discussion if and when required. The weighting of marks to these various sections is:
- Aims, objectives, and hypotheses: 5%
- Methods and analyses: 45%
- Results: 20%
- Graphs: 15%
- Discussion: 15%
Other questions might be shorter in nature, designed to specifically test important aspects of BCB744. Such questions might be worth anything from 10 to 50 marks.
The Exam is also open book. Go home. Look at the questions. Answer them at home. Submit them by the deadline.
A statement such as the one below accompanies every assignment—pay attention, as failing to observe this instruction may result in a loss of marks (i.e. if an assignment remains ungraded because the owner of the material cannot be identified):
Submit the outpt of your Quarto script wherein you provide answers to the task questions by no later than 8:30 the following data (or the Monday in cases when assignments were given on Fridays). Label the script as follows (e.g.): BCB744_Smit_Task_A.html.
Late Submissions
Late assignments will be penalised 10% per day and will not be accepted more than 48 hours late, unless evidence such as a doctor’s note, a death certificate, or another documented emergency can be provided. If you know in advance that a submission will be late, please discuss this and seek prior approval. This policy is based on the idea that in order to learn how to translate your human thoughts into computer language (coding) you should be working with them at multiple times each week—ideally daily. Time has been allocated in class for working on assignments and students are expected to continue to work on the assignments outside of class. Successfully completing (and passing) this module requires that you finish assignments based on what we have covered in class by the following class period. Work diligently from the onset so that even if something unexpected happens at the last minute you should already be close to done. This approach also allows rapid feedback to be provided to you, which can only be accomplished by returning assignments quickly and punctually.
Support
It’s expected that some tricky aspects of the module will take time to master, and the best way to master problematic material is to practice, practice some more, and then to ask questions. Trying for 10 minutes and then giving up is not good enough. I’ll be more sympathetic to your cause if you can demonstrate having tried for a full day before giving up and asking me. When you ask questions about some challenge, this is the way to do it—explain to me your numerous attempts at trying to solve the problem, and explain how these various attempts have failed. I will not help you if you have not tried to help yourself first (maybe with advice from friends). There will be time in class to do this, typically before we embark on a new topic. You are also encouraged to bring up related questions that arise in your own B.Sc. (Hons.) research project.
Should you require more time with me, find out when I am ‘free’ and set an appointment by sending me a calendar invitation. I am happy to have a personal meeting with you via Zoom, but I prefer face-to-face in my office.
Guidelines for asking questions:
- First search existing issues (open or closed) for answers. If the question has already been answered, you’re done! If there is an open issue, feel free to contribute to it. Or feel free to open a closed issue if you believe the answer is not satisfactory.
- Give your issue an informative title.
- Good: “Error: could not find function”ggplot””
- Bad: “My code does not work!” Note that you can edit an issue’s title after it’s been posted.
- Format your questions nicely using markdown and code formatting. Preview your issue prior to posting.
- As I explained above, your peers and I will more sympathetic to your cause if you can show all the things you have tried as you, yourself, tried to fix the issue first.
- Include code and example data so the person trying to help you have something to work with (and which results in the error, perhaps)
- Where appropriate, provide links to specific files, or even lines within them, in the body of your issue. This will help your peers understand your question. Note that only the teaching team will have access to private repos.
- (Optional) Tag someone or some group of people. Start by typing their GitHub username prefixed with the @ symbol. Of course this supposes that each of you have a GitHub account and username.
- Hit Submit new issue when you’re ready to post.
Reuse
Citation
@online{smit,_a._j.2025,
author = {Smit, A. J.,},
title = {BCB744: {Introduction} to {R,} \& {Biostatistics}},
date = {2025-02-03},
url = {http://tangledbank.netlify.app/BCB744/BCB744_index.html},
langid = {en}
}