Introduction to R, and Biostatistics

Published

August 8, 2022

“Most people use statistics like a drunk man uses a lamppost; more for support than illumination.”

— Andrew Lang

“If your experiment needs a statistician, you need a better experiment.”

— Ernest Rutherford

Welcome to the pages for BCB744. This page provides the syllabus and teaching policies for the module, and it serves as a starting point for accessing all the theory, instruction, and data.

2024 Workshop Schedule

  • Intro to R: 5th - 9th February
  • Biostatistics: 25th - 28th March (during the mid-semester break of Semester 1)

Venue

5th Floor Computer Lab, BCB Department, Life Sciences Building, University of the Western Cape.

Honours Coordinator

Prof. Bryan Maritz: Room 4.105, Department of Biodiversity & Conservation Biology

Course Coordinator

Prof. AJ Smit: Room 4.103, Department of Biodiversity & Conservation Biology, ajsmit@uwc.ac.za

Instructors

Cayley Cammell: Department of Biodiversity & Conservation Biology, 4269088@myuwc.ac.za

Zoë-Angelique Petersen: Department of Biodiversity & Conservation Biology, 4042512@myuwc.ac.za

Course description

Yes, the comma in this page’s title is correct: “BCB744: Introduction to R, and Biostatistics.” The module provides an introduction to the R software and language. I will also teach biostatistics.

This is a core module in your Honours programme. You will learn to use R for data analysis, visualisation, and statistical inference. You will also learn fundamental biostatistics concepts, such as hypothesis testing, probabilities, confidence intervals, regression analysis, Analysis of Variance, and other staples of biostatistics. I will use real-world datasets from the biological, ecological, and environmental fields that you can use to practice applying your R and biostatistics skills.

The approach taken in this Workshop is not dissimilar from a course in Data Science. However, in this Workshop, we won’t do data science, but we will use R do actually do science. There is a difference! Any scientist that can use R is also ideally equipped to be a data scientist, and some people who have completed this module actually do just that. The difference between the two ideas, philosophies, careers is provided in the box immediately below.

‘Real’ Scientists and Data ‘Scientists’

A Scientist able to apply their intermediate to advanced R skills is by default also a ‘Data Scientist’. The opposite is generally not true: Data Scientists are not real Scientists—especially after only having completed ‘traditional’ courses in data science.

Science refers to the application of the scientific method of conducting research, where hypotheses are proposed, experiments are designed and conducted to test these hypotheses, and data are collected and analysed to draw conclusions. The aim of Science is to generate new knowledge and understanding of the natural world. A Scientist will typically be equipped to work through all of these steps.

Data Science, on the other hand, involves the use of computational and statistical tools to extract knowledge and insights from data. These datasets typically already exist because someone (companies, industries, NGOs, etc.) collected them. Data Science focuses on analysing large and complex datasets to uncover patterns, trends, and relationships that can be used to inform decision-making. The Data Scientist is not typically involved in generating the data from de novo.

These key aspects summarise the difference between the two fields:

  • Approach Science is hypothesis-driven, while Data Science is data-driven. Science begins with a hypothesis that is tested through experiments, while Data Science begins with data and uses statistical and computational methods to uncover insights.

  • Goals Science aims to generate new knowledge and understanding of the natural world, while Data Science aims to uncover insights and make predictions based on existing data. Scientist focus on understanding the underlying mechanisms of natural phenomena and their area of focus is the real world, while Data Scientists focus on extracting knowledge and insights from data, often in the realm of business.

  • Methods Science involves making observations of the world, conducting experiments, collecting and analysing data, and drawing conclusions based on the results. Data Science only involves using statistical and computational tools to analyse data and uncover patterns and relationships.

  • Context Science is typically focused on a specific domain, such as biology, chemistry, or physics. Data Science can be applied to any domain that involves data, including business, finance, healthcare, and social media.

Course timetable and content

The Workshop has two sub-components. The first, the Intro R Workshop will take place at the very start of your Honours year. The second is called Biostatistics and it usually starts in late March or early April. This year’s timetable is provided below.

Wk Lecture Topic Class Date Tasks/Assessment Task/Assess. due
INTRO R
Wk1 L1 Introduction 5 Feb ’24
1. R and RStudio Task A 6 Feb ’24
2. Working with data and code
3. R workflows Task B 7 Feb ’24
L2 4. Graphics with ggplot2 6 Feb ’24 Task C 7 Feb ’24
5. Faceting figures Task D 7 Feb ’24
L3 6. Brewing colours 7 Feb ’24 Task E 8 Feb ’24
7. Mapping with ggplot2
8. Mapping with style
9. Mapping with Natural Earth and the sf package Task F 8 Feb ’24
Self 10. The Fiji Earthquake data
Bonus Task 1 Apr ’24
L4 11. Tidy data 8 Feb ’24 Task G 1 Mar ’24
12. Tidier data
13. Tidiest data Summative Task 1 1 & 25 Mar ’24
L5 Recap 9 Feb ’24
Wk2 16 Feb ’24 Intro R Assessment 6 Mar ’24
BIOSTATISTICS
Wk10 L1 Biostatistics and the philosophy of scientific inquiry 25 Mar ’24
1. Data classes and structures in R Task A 26 Mar ’24
2. Exploring with summaries and descriptions Task B 26 Mar ’24
3. Exploring with figures Task C 26 Mar ’24
L2 4. Data distributions 26 Mar ’24
5. Statistical inference and hypothesis testing Task D 27 Mas ’24
6. Assumptions Summative Task 2 12 Apr ’24
7. Inferences about one or two populations Task E 27 Mar ’24
L3 8. Analysis of Variance (ANOVA) 27 Mar ’24 Task F 28 Mar ’24
9. Simple linear regressions Task G 28 Mar ’24
10. Correlations Task H 28 Mar ’24
11. A guide to selecting the right parametric test
12. Non-Parametric statistics
13. Confidence intervals
14. Data transformations
Wk12 11 Apr ’24 Final Integrative Assessment 14 Apr ’24

Core theoretical content and philosophical framework of the Intro R Workshop

The Intro R Workshop focuses on the functionality offered by the tidyverse suite of packages. I designed the Workshop to introduce you to a powerful set of tools for data manipulation, exploration, and visualisation. The tidyverse is a collection of R packages that work together to provide a cohesive set of functions for manipulating data. This course will cover the most popular packages in the tidyverse, including tidyr for data reshaping, dplyr for data ‘wrangling’, and ggplot2 for data visualisation. You will learn how to clean, transform, and visualise data, as well as how to use these tools to build reproducible and informative data analysis pipelines. With a focus on practical application and hands-on exercises, you will gain the skills and knowledge needed to effectively use the tidyverse in your own data analysis projects.

One of the key heuristic devices I use throughout is to focus on figures, particularly maps. Maps can be a powerful tool for teaching coding because they combine both aesthetics and information in a visually compelling format. Maps are not only beautiful, but they also provide insights and context that can be difficult to grasp through numbers or text alone. By working with code to create your own maps, you can learn programming concepts and techniques while also developing your visual literacy skills. This engaging and interactive approach to learning coding can help to demystify the subject and make it more accessible when you don’t have a background in coding.

Core statistical content of the Biostatistics Workshop

In biological and ecological sciences, statistical methods play a crucial role in analysing and interpreting data. Some of the basic statistical methods used include:

  • Descriptive statistics These methods are used to summarise and describe the basic features of a dataset, such as the mean, median, and standard deviation.

  • Inferential statistics These allow you, the scientist, to make predictions and inferences about a population based on a sample of data. Common inferential statistical techniques include t-tests, ANOVA, and regression analysis.

  • Non-parametric statistics Non-parametric methods are called for when the data do not meet the assumptions of parametric statistics. Examples of non-parametric techniques include Wilcoxon rank-sum test and Kruskal-Wallis test.

Core skills developed by BCB744

By the end of this module, you will be able to:

  • Understand and use use R within the RStudio IDE
  • Know and understand the the tidyverse suite of functions and approach to data analysis and graphics
  • Understand the principles underlying tidy data
  • Understand the types of data and data distributions that biologists and ecologists will frequently encounter
  • Understand and be able to execute the most frequently used inferential statistics
  • Use the R software and associated packages to undertake these analyses
  • Interpret the outcomes of these analyses and use it to probabilistically make inferences about the scientific enquiries
  • Communicate the findings by written and oral means

Graduate attributes

The graduate attributes resulting from completion of this modules alignment with the expectations of the workspace across diverse organisations and institutions where graduates typically find employment.

Data used in support of the module

All the data required for BCB744 may be downloaded here.

Built-in data

R has many built-in datasets that are useful for practicing our R skills. To find out which datasets are available to you on your system, execute the following command. Help files for each of the datasets are also available:

# load the data like this:
data()

# find help, for example:
?datasets::ChickWeight

It is important to use these (or any) datasets to practice your R skills on. Actively engaging with my comprehensive and detailed web pages, and practising on the included and additional other datasets will make to difference between a 60% average mark for the module, and a mark in excess of 80%.

Prerequisites

You should have a moderate numerical literacy, but prior programming experience is not required. In all sciences, practical problem solving skills and a tenacity for challenges are crucial for success. Scientific disciplines constantly evolve and present new and complex problems that require creative and innovative solutions. You will have to demonstrate agile and adaptive approaches to solving challenges, and you must have the ability to break down complex problems into smaller parts and approach them systematically. You must also be able to identify and overcome roadblocks, and be persistent in your efforts to find a solution. These attributes will allow you to be effective in this module.

Method of instruction

The workshop is designed to be as interactive as possible, so while you are working on exercises the tutor and I will circulate among you and engage with you to help you understand any material and the associated code you are uncomfortable with. Often this will result in discussions of novel applications and alternative approaches to the data analysis challenges you are required to solve. More challenging concepts might emerge during the Tasks and Assignments (typically these will be submitted the following day), and any such challenges will be dealt with in class prior to learning new concepts.

Although the module ultimately supports the application of biologically-oriented statistics, a large part of it is also about coding. It is up to you to take your coding skills to the next level and move beyond what I teach in class. Coding is a bit like learning a language, and as such programming is a skill that is best learned by doing.

Learning colaboratively

Also read: How to learn

Please refer to my advice about how to learn.

Collaborative learning provides an opportunity for you to work together and learn from each other. In this way, you will develop a deeper understanding of the subject matter. Collaborating with your friends and peers allows you to explore different perspectives and ideas, which can broaden your understanding and help you to see the subject matter from new angles. This type of learning environment also fosters the development of important skills such as communication, teamwork, and leadership, which are essential for success in academic and professional careers. Collaborative learning can create a sense of community and support among your group of peers. In the end, it will enhance your university experience, drive your love for learning, and prepare you for success beyond the university.

Discuss the BCB744 Workshop activities with your peers as you work on them. Use the WhatsApp group set up for the module for discussion purposes (I might assist via this medium if necessary if your questions/comments have relevance to the whole class). A better option is to use GitHub Issues. You will learn more in this module if you work with your friends than if you do not. Ask questions, answer questions, and share ideas liberally. Please identify your work partners by name on all assignments (if you decide to work in pairs).

Cooperative learning is not a licence for plagiarism. Plagiarism is a serious offence and will be dealt with concisely. Consequences of cheating are severe—they range from a 0% for the assignment or exam up to dismissal from the course for a second offense.

Reusing code found elsewhere

A huge volume of code is available on the web and it can be adapted to solve your own problems. You may make use of any online resources (e.g. form StackOverflow, a thoroughly-used source of discussion about R code)—but you MUST clearly indicate (cite) that your solution relies on found code, regardless to what extent you have modified it to your own needs. Reused code that is discovered via a web search and which is not explicitly cited is plagiarism and it will be treated as such. On assignments you may not directly share code with your peers in this workshop.

OpenAI ChatGPT and other AI tools

The 2024 BSc (Hons) cohort will be the first to experience the use of AI tools in the BCB744 module. The use of AI tools is a new and exciting development and it is important that you are exposed to these tools. The use of AI tools will be limited to the use of the OpenAI ChatGPT tool, which may be used to generate ‘proto-code’ that will assist you in becoming familiar with the R langauge. We will explore ideas together, and the mark allocation to tasks and assignments will be adjusted accoringly.

Software

In this course you will rely entirely on R running within the RStudio IDE. The use of R is covered extensively in the BCB744 module where the installation process is discussed.

Additionally, the very basics—i.e. about R, RStudio, packages, their installation, etc.—can also be found on the ModernDive website. A slightly longer and more detailed account of the installation process and the very basics is provided on the datacamp platform.

ModernDive also provides a nice overview of using R for data science.

For more in-depth coverage of the R language, refer to R Master Hadley Wickham’s pages. There you will find everything you need to know in a well thought through presentation. Thoroughly working through this material, page by page, will quickly make you a R Master yourself (well, almost).

Computers

You are encouraged to provide your own laptops and to install the necessary software before the module starts. Limited support can be provided if required, but in the end, the onus is on you to understand how your computer works (from the filesystem through to dealing with software installation issues). There are also computers with R and RStudio (and the necessary add-on libraries) available in the 5th floor lab in the BCB Department.

Attendance

This workshop-based, hands on course can only deliver acceptible outcomes if you attend all classes. The schedule is set and cannot be changed. Sometimes an occasional absence cannot be avoided. Please be curtious and notify myself or the tutor in advance of any absence. If you work with a partner in class, notify them too. Keep up with the reading assignments while you are away and we will all work with you to get you back up to speed on what you miss. If you do miss a class, however, the assignments must still be submitted on time (also see Late submission of CA).

Since you may decide to work in collaboration with a peer on tasks and assignments, please keep this person informed at all times in case some emergency makes you unavailable for a period of time. Someone might depend on your input and contributions—do not leave someone in the lurch so that they cannot complete a task in your absence.

Assessment policy

Continuous Assessment (CA) and a Final Assessment will provide a Final Mark for the module. These modes of assessment meet our needs as far as formative and summative assessments are concerned. The weighting of the CA and the Final Assessment is 0.6 and 0.4, respectively. All assessments are open book, so consult your code and reading material if and when you need to.

Assessment Component Weighting Contribution (%)
CONTINUOUS ASSESSMENT (60)
Introduction to R
Daily Tasks (A-G) 7.5
Summative Task 1 12.5
Intro R Assessment 30
Biostatistics
Daily Tasks (A-G) 7.5
Summative Task 2 42.5
Total 100
FINAL ASSESSMENT (40)
Integrative Assessment (Intro R + Biostatistics) 100

You must work alone on all these Tasks. Care must be taken that the Tasks are submitted as instructed, i.e. paying attention to naming conventions and the format of the files submitted—typically this will be in a Quarto document (.qmd) and the knitted output (I prefer .html).

Random quizzes will not form part of the CA for BCB744.

Daily Tasks

Intro R and Biostatistics require that we work with real-world datasets. To this end, a series of Daily Tasks involving real data is a required part of the BCB744. These are interwoven into the daily Chapters, and your feedback is required the following day. The Daily Tasks are part of the CA.

When assessing the tasks, we will pay attention to the following criteria:

  • Content (10%):
    • Questions answered in order
    • Annotations (meta-data in file header, commnets about code, ideas, and approach)
  • Code formatting and correctness (45%):
    • Application of R code conventions, e.g. spaces around <-, after #, after ,, etc.
    • New line for each dplyr function (lines end in %>%) or ggplot layer (lines end in +)
  • Figures (45%):
    • Sensible use of themes / colours
    • Publication quality
    • Informative and complete titles, axes labels, legends, etc.

Summative Tasks

At the end of Intro R and Biostatistics, the more demanding Summative Tasks 1 and 2 will be required. As the table above indicates, the weighting of the Summative Tasks is more than that of the Daily Tasks but they also differ slightly between the two parts of the module. These assessments will take place over a few days and you may work at home or on campus. Like the Daily Tasks, the Summative Tasks are also open book assessments. The Summative Tasks form part of the CA and will prepare you for the Intro R Assessment (at the end of Intro R) and the Integrative Assessment (after Biostatistics; see ‘Final Assessment’, below).

We will assess the Summative Tasks as per the assessment breakdown provided under the Daily Tasks.

Intro R Assessment

The Intro R Assessment will, as the name suggests, be about the material covered in the Introduction to R section of the work only. It forms part of the CA.

The Intro R Assessment will be graded with the following in expectations in mind:

  • Content (20%):
    • Questions answered in order
    • A written explanation of approach included for each question
    • Appropriate formatting of text, for example, fonts not larger than necessary, headings used properly, etc. Be sensible and tasteful.
  • Code formatting, structure, and correctness (50%):
    • Use Tidyverse code
    • No more than ~80 characters of code per line (pay particular attention to the comments)
    • Application of R code conventions, e.g. spaces around <-, after #, after ,, etc.
    • New line for each dplyr function (lines end in %>%) or ggplot layer (lines end in +)
    • Proper indentation of pipes and ggplot() layers
    • All chunks labelled without spaces
    • No unwanted / commented out code left behind in the document
  • Figures (30%):
    • Sensible use of themes / colours
    • Publication quality
    • Informative and complete titles, axes labels, legends, etc.
    • No redundant features or aesthetics

Integrative (Final) Assessment

The Integrative Assessment is the Final Assessment. As such, it will test your skills broadly across both Intro R and Biostatistics, and hence its weighting causes it to contribute more towards your final grade as it falls outside of the pool of marks that form your CA. The Integrative Assessment may be up to five days in duration. It will involve the analysis of real world data. Some of the Questions might expect (as per Question) that you write 1) statements of aims, objectives, and hypotheses; 2) the full and detailed methods followed by analyses together with all code, 3) full reporting of results in a manner suited for peer reviewed publications; 4) graphical support highlighting the patterns observed (again with the code), and 5) a discussion if and when required. The weighting of marks to these various sections is:

  1. Aims, objectives, and hypotheses: 5%
  2. Methods and analyses: 45%
  3. Results: 20%
  4. Graphs: 15%
  5. Discussion: 15%

Other Questions might be shorter in nature, designed to specifically test important aspects of BCB744. Such Question might be worth anything from 10 to 50 marks, depending on the nature of the Questions.

The Integrative Assessment is also open book. Go home. Look at the questions. Answer them at home. Submit them by the deadline.

Submission of Tasks, Assessments, and the Final Assessment

A statement such as the one below accompanies every assignment—pay attention, as failing to observe this instruction may result in a loss of marks (i.e. if an assignment remains ungraded because the owner of the material cannot be identified):

Submit a Quarto script wherein you provide answers to Questions by no later than 8:00 the following data (or the Monday in cases when assignments were given on Fridays). Label the script as follows (e.g.): BCB744_AJ_Smit_Task_A.qmd.

Late submission of Tasks

Late assignments will be penalised 10% per day and will not be accepted more than 48 hours late, unless evidence such as a doctor’s note, a death certificate, or another documented emergency can be provided. If you know in advance that a submission will be late, please discuss this and seek prior approval. This policy is based on the idea that in order to learn how to translate your human thoughts into computer language (coding) you should be working with them at multiple times each week—ideally daily. Time has been allocated in class for working on assignments and students are expected to continue to work on the assignments outside of class. Successfully completing (and passing) this module requires that you finish assignments based on what we have covered in class by the following class period. Work diligently from the onset so that even if something unexpected happens at the last minute you should already be close to done. This approach also allows rapid feedback to be provided to you, which can only be accomplished by returning assignments quickly and punctually.

Support

It’s expected that some tricky aspects of the module will take time to master, and the best way to master problematic material is to practice, practice some more, and then to ask questions. Trying for 10 minutes and then giving up is not good enough. I’ll be more sympathetic to your cause if you can demonstrate having tried for a full day before giving up and asking me. When you ask questions about some challenge, this is the way to do it—explain to me your numerous attempts at trying to solve the problem, and explain how these various attempts have failed. I will not help you if you have not tried to help yourself first (maybe with advice from friends). There will be time in class to do this, typically before we embark on a new topic. You are also encouraged to bring up related questions that arise in your own B.Sc. (Hons.) research project.

Should you require more time with me, find out when I am ‘free’ and set an appointment by sending me a calendar invitation. I am happy to have a personal meeting with you via Zoom, but I prefer face-to-face in my office.

Help via BCB744 Issues on GitHub

All discussion for the BCB744 and BCB743 workshops will be held in the Issues of this repository. Please post all content-related questions there, and use email only for personal matters. Note that this is a public repository, so be professional in your writing here (grammar, etc.).

To start a new thread, create a New issue. Tag your peers using their handle—@ajsmit, for example—to get their attention.

Once a question has been answered, the issue will be closed, so lots of good answers might end up in closed issues. Don’t forget to look there when looking for answers—you can use the Search feature on this repository to find answers that might have been offered by the same or similar problem experienced by someone else in the past.

Guidelines for posting questions:

  • First search existing issues (open or closed) for answers. If the question has already been answered, you’re done! If there is an open issue, feel free to contribute to it. Or feel free to open a closed issue if you believe the answer is not satisfactory.
  • Give your issue an informative title.
    • Good: “Error: could not find function”ggplot””
    • Bad: “My code does not work!” Note that you can edit an issue’s title after it’s been posted.
  • Format your questions nicely using markdown and code formatting. Preview your issue prior to posting.
  • As I explained above, your peers and I will more sympathetic to your cause if you can show all the things you have tried as you, yourself, tried to fix the issue first.
  • Include code and example data so the person trying to help you have something to work with (and which results in the error, perhaps)
  • Where appropriate, provide links to specific files, or even lines within them, in the body of your issue. This will help your peers understand your question. Note that only the teaching team will have access to private repos.
  • (Optional) Tag someone or some group of people. Start by typing their GitHub username prefixed with the @ symbol. Of course this supposes that each of you have a GitHub account and username.
  • Hit Submit new issue when you’re ready to post.

Reuse

Citation

BibTeX citation:
@online{j._smit2022,
  author = {J. Smit, Albertus},
  title = {Introduction to {R,} and {Biostatistics}},
  date = {2022-08-08},
  url = {http://tangledbank.netlify.app/BCB744/BCB744_index.html},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2022) Introduction to R, and Biostatistics. http://tangledbank.netlify.app/BCB744/BCB744_index.html.