BCB744: Introduction to R, & Biostatistics

Author

Affiliation

Published

February 3, 2025

Venue, Timetable, and Content

The venue for the module is the 5th Floor Computer Lab, BCB Department, University of the Western Cape. The module will run from 09:00 to 16:30 on the days indicated in the table below.

The module coordinator and lecturer is Prof AJ Smit (Room 4.103), and the teaching assistant for the module is Chané Claassen (4142581@myuwc.ac.za). For queries about the Honours programme in general, please consult Prof Bryan Maritz (Room 4.105).

Intro to R: From 3 to 7 February 2025.
Biostatistics: From 31 March to 4 April 2025 (during the mid-semester break of Semester 1).
Important links:
- Self-Assessments
- Presentations
- BCB744 Data

Wk	Lecture	Topic	Class Date	Tasks/Assessments	Task/Assess. due
		INTRO R
Wk1	L1	About the Module	3 Feb 25	Task A	4 Feb 25
		1. R and RStudio
		2. Working With Data and Code
		3. Data Classes and Structures in R
		4. R Workflows
	L2	5. Graphics With ggplot2	4 Feb 25	Task B	5 Feb 25
		6. Faceting Figures
		7. Brewing Colours
	L3	8. Mapping With ggplot2	5 Feb 25	Task C	6 Feb 25
		9. Mapping With style
		10. Mapping With Natural Earth and the sf Package
	Self	11. The Fiji Earthquake data		Bonus Task	31 Mar 25
	L4	12. Tidy Data	6 Feb 25	Task D	7 Feb 25
		13. Tidier Data
		14. Tidiest Data
	L5	Recap	7 Feb 25
		Test 1	17 Mar 25		TBA
		BIOSTATISTICS
Wk10	L1	The History of Scientific Inquiry	31 Mar 25
		1. Rmarkdown and Quarto
		2. Exploring With Summaries and Descriptions		Task E	1 Apr 25
		3. Exploring With Figures		Task E	1 Apr 25
	L2	4. Data Distributions	1 Apr 25
		5. Statistical Inference and Hypothesis Testing
		6. Assumptions
		7. Inferences About One or Two Populations		Task F	2 Apr 25
	L3	8. Analysis of Variance (ANOVA)	2 Apr 25	Task G	3 Apr 25
		9. Simple Linear Regressions		Task H	4 Apr 25
		10. Correlations		Task H	4 Αpr 25
	L4	11. A Guide to Selecting the Right Parametric Test	3 Apr 25
		12. Non-Parametric Statistics
		13. Confidence Intervals
		14. Data Transformations
		Test 2	7-11 Apr 25		TBA
		Exam	TBA		TBA

Yes, the comma in this page’s title is correct: “BCB744: Introduction to R, and Biostatistics.” The module provides an introduction to the R software and language. I will also teach biostatistics.

This is a core module in your Honours programme. You will learn to use R for data analysis, visualisation, and statistical inference. You will also learn fundamental biostatistics concepts, such as hypothesis testing, probabilities, confidence intervals, regression analysis, Analysis of Variance, and other staples of biostatistics. I will use real-world datasets from the biological, ecological, and environmental fields that you can use to practice applying your R and biostatistics skills.

The approach taken in this Workshop is not dissimilar from a course in Data Science. However, in this Workshop, we won’t do data science, but we will use R to actually do science. There is a difference! Any scientist that can use R is also ideally equipped to be a data scientist, and some people who have completed this module actually do just that. The difference between the two ideas, philosophies, careers is provided in the box immediately below.

Real Scientists and Data ‘Scientists’

A Scientist able to apply their intermediate to advanced R skills is by default also a ‘Data Scientist’. The opposite is generally not true: Data Scientists are not real Scientists—especially after only having completed ‘traditional’ courses in data science.

Science refers to the application of the scientific method of conducting research, where hypotheses are proposed, experiments are designed and conducted to test these hypotheses, and data are collected and analysed to draw conclusions. The aim of Science is to generate new knowledge and understanding of the natural world. A Scientist will typically be equipped to work through all of these steps.

Data Science, on the other hand, involves the use of computational and statistical tools to extract knowledge and insights from data. These datasets typically already exist because someone (companies, industries, NGOs, etc.) collected them. Data Science focuses on analysing large and complex datasets to uncover patterns, trends, and relationships that can be used to inform decision-making. The Data Scientist is not typically involved in generating the data from de novo.

These key aspects summarise the difference between the two fields:

Approach Science is hypothesis-driven, while Data Science is data-driven. Science begins with a hypothesis that is tested through experiments, while Data Science begins with data and uses statistical and computational methods to uncover insights.
Goals Science aims to generate new knowledge and understanding of the natural world, while Data Science aims to uncover insights and make predictions based on existing data. Scientist focus on understanding the underlying mechanisms of natural phenomena and their area of focus is the real world, while Data Scientists focus on extracting knowledge and insights from data, often in the realm of business.
Methods Science involves making observations of the world, conducting experiments, collecting and analysing data, and drawing conclusions based on the results. Data Science only involves using statistical and computational tools to analyse data and uncover patterns and relationships.
Context Science is typically focused on a specific domain, such as biology, chemistry, or physics. Data Science can be applied to any domain that involves data, including business, finance, healthcare, and social media.

The Intro R Workshop focuses on the functionality offered by the tidyverse suite of packages. I designed the Workshop to introduce you to a powerful set of tools for data manipulation, exploration, and visualisation. The tidyverse is a collection of R packages that work together to provide a cohesive set of functions for manipulating data. This course will cover the most popular packages in the tidyverse, including tidyr for data reshaping, dplyr for data ‘wrangling’, and ggplot2 for data visualisation. You will learn how to clean, transform, and visualise data, as well as how to use these tools to build reproducible and informative data analysis pipelines. With a focus on practical application and hands-on exercises, you will gain the skills and knowledge needed to effectively use the tidyverse in your own data analysis projects.

In biological and ecological sciences, statistical methods play a crucial role in analysing and interpreting data. Some of the basic statistical methods used include:

Descriptive statistics These methods are used to summarise and describe the basic features of a dataset, such as the mean, median, and standard deviation.
Inferential statistics These allow you, the scientist, to make predictions and inferences about a population based on a sample of data. Common inferential statistical techniques include t-tests, ANOVA, and regression analysis.
Non-parametric statistics Non-parametric methods are called for when the data do not meet the assumptions of parametric statistics. Examples of non-parametric techniques include Wilcoxon rank-sum test and Kruskal-Wallis test.

Core Skills
Graduate Attributes

By the end of this module, you will be able to:

Understand and use use R within the RStudio IDE
Know and understand the the tidyverse suite of functions and approach to data analysis and graphics
Understand the principles underlying tidy data
Understand the types of data and data distributions that biologists and ecologists will frequently encounter
Understand and be able to execute the most frequently used inferential statistics
Use the R software and associated packages to undertake these analyses
Interpret the outcomes of these analyses and use it to probabilistically make inferences about the scientific enquiries
Communicate the findings by written and oral means

The graduate attributes resulting from completion of this modules alignment with the expectations of the workspace across diverse organisations and institutions where graduates typically find employment.

Data Used

All the data required for BCB744 may be downloaded here. After you have downloaded the archived (.zip) data, unzip it in a folder named data placed at the root of your R project. This will ensure that all the data are easily accessible to you.

R also gives you access to many built-in datasets that are useful for practicing our R skills. To find out which datasets are available to you on your system, execute the following command. Help files for each of the datasets are also available:

# load the data like this:
data()

# find help, for example:
?datasets::ChickWeight

It is important to use these (or any) datasets to practice your R skills on. Actively engaging with my comprehensive and detailed web pages, and practising on the included and additional other datasets will make to difference between a 60% average mark for the module, and a mark in excess of 80%.

Prerequisites

You should have a moderate numerical literacy, but prior programming experience is not required. In all sciences, practical problem solving skills and a tenacity for challenges are crucial for success. Scientific disciplines constantly evolve and present new and complex problems that require creative and innovative solutions. You will have to demonstrate agile and adaptive approaches to solving challenges, and you must have the ability to break down complex problems into smaller parts and approach them systematically. You must also be able to identify and overcome roadblocks, and be persistent in your efforts to find a solution. These attributes will allow you to be effective in this module.

Method of Instruction

The workshop is designed to be as interactive as possible, so while you are working on exercises the tutor and I will circulate among you and engage with you to help you understand any material and the associated code you are uncomfortable with. Often this will result in discussions of novel applications and alternative approaches to the data analysis challenges you are required to solve. More challenging concepts might emerge during the Tasks and Assignments (typically these will be submitted the following day), and any such challenges will be dealt with in class prior to learning new concepts.

Although the module ultimately supports the application of biologically-oriented statistics, a large part of it is also about programming. It is up to you to take your coding skills to the next level and move beyond what I teach in class. Coding is a bit like learning a language, and as such programming is a skill that is best learned by doing.

Learning

Software

In this course you will rely entirely on R running within the RStudio IDE. The use of R is covered extensively in the BCB744 module where the installation process is discussed.

Additionally, the very basics—i.e. about R, RStudio, packages, their installation, etc.—can also be found on the ModernDive website. A slightly longer and more detailed account of the installation process and the very basics is provided on the datacamp platform.

ModernDive also provides a nice overview of using R for data science.

For more in-depth coverage of the R language, refer to R Master Hadley Wickham’s pages. There you will find everything you need to know in a well thought through presentation. Thoroughly working through this material, page by page, will quickly make you a R Master yourself (well, almost).

Computers

You are encouraged to provide your own laptops and to install the necessary software before the module starts. Limited support can be provided if required, but in the end, the onus is on you to understand how your computer works (from the filesystem through to dealing with software installation issues). There are also computers with R and RStudio (and the necessary add-on libraries) available in the 5th floor lab in the BCB Department.

Attendance

This workshop-based, hands on course can only deliver acceptible outcomes if you attend all classes. The schedule is set and cannot be changed. Sometimes an occasional absence cannot be avoided. Please be curtious and notify myself or the tutor in advance of any absence. If you work with a partner in class, notify them too. Keep up with the reading assignments while you are away and we will all work with you to get you back up to speed on what you miss. If you do miss a class, however, the assignments must still be submitted on time (also see Late submission of CA).

Since you may decide to work in collaboration with a peer on tasks and assignments, please keep this person informed at all times in case some emergency makes you unavailable for a period of time. Someone might depend on your input and contributions—do not leave someone in the lurch so that they cannot complete a task in your absence.

Assessment Policy

Continuous Assessment (CA) and a Final Assessment will provide a Final Mark for the module. These modes of assessment meet our needs as far as formative and summative assessments are concerned. The weighting of the CA and the Final Assessment is 0.6 and 0.4, respectively. All assessments are open book, so consult your code and reading material if and when you need to.

Assessment Component	Weight	Contribution (%)
CONTINUOUS ASSESSMENT	(0.6)
Introduction to R
Presentations		10
Self-Assessment Tasks A–D (Random penalty)¹		(max. -10).
Intro R Test		40
Biostatistics
Presentations		10
Self-Assessment Tasks E–H (Random penalty)		(max. -10).
Biostatistics Test		40
Total		100
FINAL ASSESSMENT	(0.4)
Exam (Intro R + Biostatistics)		100

¹ A maximum of 10% may be deducted from your presentation marks should you be found to be dishonest in your self assessments.

Care must be taken that the tests and exams are submitted as instructed, i.e. paying attention to naming conventions and the format of the files submitted – typically this will be in a Quarto document (.qmd) and the knitted output (I prefer .html).

Random quizzes will not form part of the CA for BCB744.

The presentations are a critical part of the CA. They are designed to help you develop your communication around topics tangentially to the broad field of knowledge generation. The presentations will cover topics such as the the nature of knowledge and belief, the nature of science, the scientific method, the limits to sciencde, and other broader societal topics.

For more detail, see these links:

BCB744 (Introduction to R and Biostatistics) relies on the expectation that you will engage in regular, honest self-reflection about your grasp of each day’s lecture content. After every lecture, time should be devoted to completing the Daily Self-Assessment Tasks, which are designed to help you gauge your understanding of the covered material. Answers to these tasks will be provided the following day, before introducing new content. The honesty of these reflections cannot be overstated: each task should be rated on a personal scale from 1 (no real comprehension) to 10 (complete mastery). These self-assessment marks will be kept on record and serve as an indicator of progress. We will not permit the submission of these tasks, but they will be checked randomly. We will also discourage students from undertaking the Intro R Test and the BioStats Test if their self-assessment scores are consistently low.

Students who realise they are struggling are strongly advised to seek assistance from the lecturer or teaching assistant well before the gap in understanding becomes too large to bridge (i.e. on the day). The correlation between consistent, candid, and honest self-assessment and later performance in the Intro R Test, the Biostatistics Test, and the combined Exam (Intro R + Biostatistics) is high. By admitting the need for help early, you can align your learning strategies with course expectations and reinforce your command of the subject matter. Being the judge of personal preparedness demands self-reflection and honesty about your own strengths and weaknesses so as to develop a strong foundation for success.

For the daily self-assessment tasks to be effective, you must work alone on all of them.

Be responsible for your own learning. The lecturer and teaching assistant are here to help you, but you must take the initiative to seek assistance when needed. The more you engage with the material, the more you will learn and the better you will perform in the assessments.

For more detail, see these links:

At the conclusion of Intro R, and Biostatistics, you will take the more rigorous Intro R Test and Biostatistics Test. As indicated in the table above, these assessments carry significant weight. The tests will be conducted over several days, and you may complete them both at home and on campus. They constitute a key component of Continuous Assessment (CA) and are designed to prepare you for the final exam.

Each test consists of two parts:

Theory Test (30%) – This is a written, closed-book assessment where you will be tested on theoretical concepts. The only resource available during this test is the R help system.
Practical Test (70%) – In this open-book coding assessment, you will apply your theoretical knowledge to real data problems. While you may reference online materials (including ChatGPT), collaboration with peers is strictly prohibited.

The practical component of the tests will be graded as follows:

Content (20%):
- Questions answered in order
- A written explanation of approach included for each question
- Appropriate formatting of text, for example, fonts not larger than necessary, headings used properly, etc. Be sensible and tasteful.
Code formatting, structure, and correctness (50%):
- Use Tidyverse code
- No more than ~80 characters of code per line (pay particular attention to the comments)
- Application of R code conventions, e.g. spaces around <-, after #, after ,, etc.
- New line for each dplyr function (lines end in %>%) or ggplot layer (lines end in +)
- Proper indentation of pipes and ggplot() layers
- All chunks labelled without spaces
- No unwanted / commented out code left behind in the document
Figures (30%):
- Sensible use of themes / colours
- Publication quality
- Informative and complete titles, axes labels, legends, etc.
- No redundant features or aesthetics

The Exam is the final assessment. As such, it will test your skills broadly across both Intro R and Biostatistics. The Exam may be up to five days in duration. It will involve the analysis of real world data. Some of the questions might expect that you write 1) statements of aims, objectives, and hypotheses; 2) the full and detailed methods followed by analyses together with all code, 3) full reporting of results in a manner suited for peer reviewed publications; 4) graphical support highlighting the patterns observed (again with the code), and 5) a discussion if and when required. The weighting of marks to these various sections is:

Aims, objectives, and hypotheses: 5%
Methods and analyses: 45%
Results: 20%
Graphs: 15%
Discussion: 15%

Other questions might be shorter in nature, designed to specifically test important aspects of BCB744. Such questions might be worth anything from 10 to 50 marks.

The Exam is also open book. Go home. Look at the questions. Answer them at home. Submit them by the deadline.

A statement such as the one below accompanies every assignment—pay attention, as failing to observe this instruction may result in a loss of marks (i.e. if an assignment remains ungraded because the owner of the material cannot be identified):

Submit the outpt of your Quarto script wherein you provide answers to the task questions by no later than 8:30 the following data (or the Monday in cases when assignments were given on Fridays). Label the script as follows (e.g.): BCB744_Smit_Task_A.html.

Late Submissions

Late assignments will be penalised 10% per day and will not be accepted more than 48 hours late, unless evidence such as a doctor’s note, a death certificate, or another documented emergency can be provided. If you know in advance that a submission will be late, please discuss this and seek prior approval. This policy is based on the idea that in order to learn how to translate your human thoughts into computer language (coding) you should be working with them at multiple times each week—ideally daily. Time has been allocated in class for working on assignments and students are expected to continue to work on the assignments outside of class. Successfully completing (and passing) this module requires that you finish assignments based on what we have covered in class by the following class period. Work diligently from the onset so that even if something unexpected happens at the last minute you should already be close to done. This approach also allows rapid feedback to be provided to you, which can only be accomplished by returning assignments quickly and punctually.

Support

It’s expected that some tricky aspects of the module will take time to master, and the best way to master problematic material is to practice, practice some more, and then to ask questions. Trying for 10 minutes and then giving up is not good enough. I’ll be more sympathetic to your cause if you can demonstrate having tried for a full day before giving up and asking me. When you ask questions about some challenge, this is the way to do it—explain to me your numerous attempts at trying to solve the problem, and explain how these various attempts have failed. I will not help you if you have not tried to help yourself first (maybe with advice from friends). There will be time in class to do this, typically before we embark on a new topic. You are also encouraged to bring up related questions that arise in your own B.Sc. (Hons.) research project.

Should you require more time with me, find out when I am ‘free’ and set an appointment by sending me a calendar invitation. I am happy to have a personal meeting with you via Zoom, but I prefer face-to-face in my office.

Guidelines for asking questions:

First search existing issues (open or closed) for answers. If the question has already been answered, you’re done! If there is an open issue, feel free to contribute to it. Or feel free to open a closed issue if you believe the answer is not satisfactory.
Give your issue an informative title.
- Good: “Error: could not find function”ggplot””
- Bad: “My code does not work!” Note that you can edit an issue’s title after it’s been posted.
Format your questions nicely using markdown and code formatting. Preview your issue prior to posting.
As I explained above, your peers and I will more sympathetic to your cause if you can show all the things you have tried as you, yourself, tried to fix the issue first.
Include code and example data so the person trying to help you have something to work with (and which results in the error, perhaps)
Where appropriate, provide links to specific files, or even lines within them, in the body of your issue. This will help your peers understand your question. Note that only the teaching team will have access to private repos.
(Optional) Tag someone or some group of people. Start by typing their GitHub username prefixed with the @ symbol. Of course this supposes that each of you have a GitHub account and username.
Hit Submit new issue when you’re ready to post.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2025,
  author = {Smit, A. J.,},
  title = {BCB744: {Introduction} to {R,} \& {Biostatistics}},
  date = {2025-02-03},
  url = {http://tangledbank.netlify.app/BCB744/BCB744_index.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2025) BCB744: Introduction to R, & Biostatistics. http://tangledbank.netlify.app/BCB744/BCB744_index.html.

--- date: "2025-02-03" title: "BCB744: Introduction to R, & Biostatistics" --- ![](../images/PhD_comics.JPG){width=100%} # Venue, Timetable, and Content The venue for the module is the 5th Floor Computer Lab, BCB Department, University of the Western Cape. The module will run from 09:00 to 16:30 on the days indicated in the table below. The module coordinator and lecturer is Prof AJ Smit (Room 4.103), and the teaching assistant for the module is Chané Claassen (4142581@myuwc.ac.za). For queries about the Honours programme in general, please consult Prof Bryan Maritz (Room 4.105). * **Intro to R**: From 3 to 7 February 2025. * **Biostatistics**: From 31 March to 4 April 2025 (during the mid-semester break of Semester 1). * **Important links:** - **[[Self-Assessments]{.my-highlight}](../assessments/BCB744_Intro_R_Self-Assessment.qmd)** - **[[Presentations]{.my-highlight}](../assessments/BCB744_Intro_R_Presentations.qmd)** - **[[BCB744 Data]{.my-highlight}](../data/Archive.zip)** | Wk | Lecture | Topic | Class Date | Tasks/Assessments | Task/Assess. due | |-------|-----------|-------------------------------------------------------------------------------------------|---------------|---------------------------------------------------------------------------------------|-------------------| | | | **INTRO R** | | | | | Wk1 | L1 | About the Module | 3 Feb 25 | [Task A](../assessments/BCB744_Task_A.html) | 4 Feb 25 | | | | [1. R and RStudio](intro_r/01-RStudio.html) | | | | | | | [2. Working With Data and Code](intro_r/02-working-with-data.html) | | | | | | | [3. Data Classes and Structures in R](intro_r/03-data-in-R.html) | | | | | | | [4. R Workflows](intro_r/04-workflow.html) | | | | | | L2 | [5. Graphics With **ggplot2**](intro_r/05-graphics.html) | 4 Feb 25 | [Task B](../assessments/BCB744_Task_B.html) | 5 Feb 25 | | | | [6. Faceting Figures](intro_r/06-faceting.html) | | | | | | | [7. Brewing Colours](intro_r/07-brewing.html) | | | | | | L3 | [8. Mapping With **ggplot2**](intro_r/08-mapping.html) | 5 Feb 25 | [Task C](../assessments/BCB744_Task_C.html) | 6 Feb 25 | | | | [9. Mapping With style](intro_r/09-mapping_style.html) | | | | | | | [10. Mapping With Natural Earth and the **sf** Package](intro_r/10-mapping_rnaturalearth.html)| | | | | | Self | [11. The Fiji Earthquake data](intro_r/11-mapping_quakes.html) | | [Bonus Task](../assessments/BCB744_Task_Bonus.html) | 31 Mar 25 | | | L4 | [12. Tidy Data](intro_r/12-tidy.html) | 6 Feb 25 | [Task D](../assessments/BCB744_Task_D.html) | 7 Feb 25 | | | | [13. Tidier Data](intro_r/13-tidier.html) | | | | | | | [14. Tidiest Data](intro_r/14-tidiest.html) | | | | | | L5 | Recap | 7 Feb 25 | | | | | | [**Test 1**]{.my-highlight} | 17 Mar 25 | | TBA | | | | **BIOSTATISTICS** | | | | | Wk10 | L1 | [The History of Scientific Inquiry](../docs/AJ%20Smit%20-%20To-Know-is-to-Not-Believe%20(Chapter%202).pdf)| 31 Mar 25 | | | | | | 1. Rmarkdown and Quarto | | | | | | | [2. Exploring With Summaries and Descriptions](basic_stats/02-summarise-and-describe.qmd) | | [Task E](../assessments/BCB744_Task_E.html) | 1 Apr 25 | | | | [3. Exploring With Figures](basic_stats/03-visualise.qmd) | | [Task E](../assessments/BCB744_Task_E.html) | 1 Apr 25 | | | L2 | [4. Data Distributions](basic_stats/04-distributions.qmd) | 1 Apr 25 | | | | | | [5. Statistical Inference and Hypothesis Testing](basic_stats/05-inference.qmd) | | | | | | | [6. Assumptions](basic_stats/06-assumptions.qmd) | | | | | | | [7. Inferences About One or Two Populations](basic_stats/07-t_tests.qmd) | | [Task F](../assessments/BCB744_Task_F.html) | 2 Apr 25 | | | L3 | [8. Analysis of Variance (ANOVA)](basic_stats/08-anova.qmd) | 2 Apr 25 | [Task G](../assessments/BCB744_Task_G.html) | 3 Apr 25 | | | | [9. Simple Linear Regressions](basic_stats/09-regressions.qmd) | | [Task H](../assessments/BCB744_Task_H.html) | 4 Apr 25 | | | | [10. Correlations](basic_stats/10-correlations.qmd) | | [Task H](../assessments/BCB744_Task_H.html) | 4 Αpr 25 | | | L4 | [11. A Guide to Selecting the Right Parametric Test](basic_stats/11-decision_guide.qmd) | 3 Apr 25 | | | | | | [12. Non-Parametric Statistics](basic_stats/12-glance.qmd) | | | | | | | [13. Confidence Intervals](basic_stats/13-confidence.qmd) | | | | | | | [14. Data Transformations](basic_stats/14-transformations.qmd) | | | | | | | [**Test 2**]{.my-highlight} | 7-11 Apr 25 | | TBA | | | | [**Exam**]{.my-highlight} | TBA | | TBA | ::: {.panel-tabset} ## Course Description {#sec-descr} Yes, the comma in this page's title is correct: "BCB744: Introduction to R, and Biostatistics." The module provides an introduction to the R software and language. I will also teach biostatistics. This is a core module in your Honours programme. You will learn to use R for data analysis, visualisation, and statistical inference. You will also learn fundamental biostatistics concepts, such as hypothesis testing, probabilities, confidence intervals, regression analysis, Analysis of Variance, and other staples of biostatistics. I will use real-world datasets from the biological, ecological, and environmental fields that you can use to practice applying your R and biostatistics skills. The approach taken in this Workshop is not dissimilar from a course in Data Science. However, in this Workshop, we won't do data science, but we will use R to actually **do** science. There is a difference! Any scientist that can use R is also ideally equipped to be a data scientist, and some people who have completed this module actually do just that. The difference between the two ideas, philosophies, careers is provided in the box immediately below. ::: {.callout-note appearance="simple"} ## Real Scientists and Data 'Scientists' {#sec-datascience} A **Scientist** able to apply their intermediate to advanced R skills is by default also a 'Data Scientist'. The opposite is generally not true: Data Scientists are not *real* Scientists---especially after only having completed 'traditional' courses in data science. **Science** refers to the application of the *scientific method* of conducting research, where hypotheses are proposed, experiments are designed and conducted to test these hypotheses, and data are collected and analysed to draw conclusions. The aim of Science is to generate *new knowledge and understanding* of the natural world. A Scientist will typically be equipped to work through all of these steps. **Data Science**, on the other hand, involves the use of computational and statistical tools to extract knowledge and insights from data. These datasets typically already exist because someone (companies, industries, NGOs, etc.) collected them. Data Science focuses on analysing large and complex datasets to uncover patterns, trends, and relationships that can be used to inform decision-making. The Data Scientist is not typically involved in generating the data from *de novo*. These key aspects summarise the difference between the two fields: - **Approach** Science is hypothesis-driven, while Data Science is data-driven. Science begins with a hypothesis that is tested through experiments, while Data Science begins with data and uses statistical and computational methods to uncover insights. - **Goals** Science aims to generate new knowledge and understanding of the natural world, while Data Science aims to uncover insights and make predictions based on existing data. Scientist focus on understanding the underlying mechanisms of natural phenomena and their area of focus is the real world, while Data Scientists focus on extracting knowledge and insights from data, often in the realm of business. - **Methods** Science involves making observations of the world, conducting experiments, collecting and analysing data, and drawing conclusions based on the results. Data Science *only* involves using statistical and computational tools to analyse data and uncover patterns and relationships. - **Context** Science is typically focused on a specific domain, such as biology, chemistry, or physics. Data Science can be applied to any domain that involves data, including business, finance, healthcare, and social media. ::: ## Theoretical Content The Intro R Workshop focuses on the functionality offered by the [**tidyverse**](https://www.tidyverse.org/) suite of packages. I designed the Workshop to introduce you to a powerful set of tools for data manipulation, exploration, and visualisation. The **tidyverse** is a collection of R packages that work together to provide a cohesive set of functions for manipulating data. This course will cover the most popular packages in the **tidyverse**, including **tidyr** for data reshaping, **dplyr** for data 'wrangling', and **ggplot2** for data visualisation. You will learn how to clean, transform, and visualise data, as well as how to use these tools to build reproducible and informative data analysis pipelines. With a focus on practical application and hands-on exercises, you will gain the skills and knowledge needed to effectively use the tidyverse in your own data analysis projects.  ## Statistical Content In biological and ecological sciences, statistical methods play a crucial role in analysing and interpreting data. Some of the basic statistical methods used include: - **Descriptive statistics** These methods are used to summarise and describe the basic features of a dataset, such as the mean, median, and standard deviation. - **Inferential statistics** These allow you, the scientist, to make predictions and inferences about a population based on a sample of data. Common inferential statistical techniques include *t*-tests, ANOVA, and regression analysis. - **Non-parametric statistics** Non-parametric methods are called for when the data do not meet the assumptions of parametric statistics. Examples of non-parametric techniques include Wilcoxon rank-sum test and Kruskal-Wallis test. :::  ::: {.panel-tabset} ## Core Skills By the end of this module, you will be able to: - Understand and use use R within the RStudio IDE - Know and understand the the **tidyverse** suite of functions and approach to data analysis and graphics - Understand the principles underlying *tidy data* - Understand the types of data and data distributions that biologists and ecologists will frequently encounter - Understand and be able to execute the most frequently used inferential statistics - Use the R software and associated packages to undertake these analyses - Interpret the outcomes of these analyses and use it to probabilistically make inferences about the scientific enquiries - Communicate the findings by written and oral means ## Graduate Attributes The [**graduate attributes**](../pages/graduate_attributes.qmd) resulting from completion of this modules alignment with the expectations of the workspace across diverse organisations and institutions where graduates typically find employment. ::: # Data Used All the data required for BCB744 may be [downloaded here](../data/Archive.zip). After you have downloaded the archived (.zip) data, unzip it in a folder named `data` placed at the root of your R project. This will ensure that all the data are easily accessible to you. R also gives you access to many built-in datasets that are useful for practicing our R skills. To find out which datasets are available to you on your system, execute the following command. Help files for each of the datasets are also available: ```{r} #| eval: false # load the data like this: data() # find help, for example: ?datasets::ChickWeight ``` It is important to use these (or any) datasets to practice your R skills on. Actively engaging with my comprehensive and detailed web pages, and practising on the included and additional other datasets will make to difference between a 60% average mark for the module, and a mark in excess of 80%. # Prerequisites You should have a moderate numerical literacy, but prior programming experience is not required. In all sciences, practical problem solving skills and a tenacity for challenges are crucial for success. Scientific disciplines constantly evolve and present new and complex problems that require creative and innovative solutions. You will have to demonstrate agile and adaptive approaches to solving challenges, and you must have the ability to break down complex problems into smaller parts and approach them systematically. You must also be able to identify and overcome roadblocks, and be persistent in your efforts to find a solution. These attributes will allow you to be effective in this module. # Method of Instruction The workshop is designed to be as interactive as possible, so while you are working on exercises the tutor and I will circulate among you and engage with you to help you understand any material and the associated code you are uncomfortable with. Often this will result in discussions of novel applications and alternative approaches to the data analysis challenges you are required to solve. More challenging concepts might emerge during the Tasks and Assignments (typically these will be submitted the following day), and any such challenges will be dealt with in class prior to learning new concepts. Although the module ultimately supports the application of biologically-oriented statistics, a large part of it is also about programming. It is up to you to take your coding skills to the next level and move beyond what I teach in class. Coding is a bit like learning a language, and as such programming is a skill that is best learned by doing. # Learning ::: {.panel-tabset} ## Collaboration ::: {.callout-note appearance="simple"} ## Also read: How to learn Please refer to my [advice about how to learn](../pages/How_to_learn.qmd). ::: Collaborative learning provides an opportunity for you to work together and learn from each other. In this way, you will develop a deeper understanding of the subject matter. Collaborating with your friends and peers allows you to explore different perspectives and ideas, which can broaden your understanding and help you to see the subject matter from new angles. This type of learning environment also fosters the development of important skills such as communication, teamwork, and leadership, which are essential for success in academic and professional careers. Collaborative learning can create a sense of community and support among your group of peers. In the end, it will enhance your university experience, drive your love for learning, and prepare you for success beyond the university. Discuss the BCB744 Workshop activities with your peers as you work on them. Use the WhatsApp group set up for the module for discussion purposes (I might assist via this medium if necessary if your questions/comments have relevance to the whole class). A better option is to use [GitHub Issues](/quantecol/#help-via-bcb744-and-bcb743-issues-on-github). You will learn more in this module if you work with your friends than if you do not. Ask questions, answer questions, and share ideas liberally. Please identify your work partners by name on all assignments (if you decide to work in pairs). **Collaborative learning does not give you permission to reuse someone else' code or text. Plagiarism is a serious offence and will be dealt with concisely. Consequences of cheating are severe---they range from a 0% for the assignment or exam up to dismissal from the course for a second offense.** ## Found Code A huge volume of code is available on the web and it can be adapted to solve your own problems. You may make use of any online resources (e.g. form [StackOverflow](https://stackoverflow.com/), a thoroughly-used source of discussion about [R code](https://stackoverflow.com/questions/tagged/r))---but you **MUST** clearly indicate (cite) that your solution relies on found code, regardless to what extent you have modified it to your own needs. Reused code that is discovered via a web search and which is not explicitly cited is plagiarism and it will be treated as such. On assignments you may not directly share code with your peers in this workshop. ## AI tools The 2025 BSc (Hons) cohort will be the first to experience the use of AI tools in the BCB744 module. The use of AI tools is a new and exciting development and it is important that you are exposed to these tools. The use of AI tools will be limited to the use of the OpenAI ChatGPT tool, which may be used to generate 'proto-code' that will assist you in becoming familiar with the R langauge. We will explore ideas together, and the mark allocation to tasks and assignments will be adjusted accoringly. ::: # Software In this course you will rely entirely on [R](https://cran.r-project.org/) running within the [RStudio](https://www.rstudio.com/) IDE. The use of R is covered extensively in the [BCB744](https://tangledbank.netlify.app/bcb744/intro_r/01-rstudio) module where the [installation process](https://tangledbank.netlify.app/BCB744/intro_r/01-RStudio) is discussed. Additionally, the very basics---i.e. about R, RStudio, packages, their installation, etc.---can also be found on the [ModernDive](https://moderndive.netlify.app/1-getting-started.html) website. A slightly longer and more detailed account of the installation process and the very basics is provided on the [datacamp](https://www.datacamp.com/tutorial/r-studio-tutorial) platform. [ModernDive](https://moderndive.netlify.app) also provides a nice overview of using R for data science. For more in-depth coverage of the R language, refer to R Master [Hadley Wickham's](https://r4ds.hadley.nz/) pages. There you will find everything you need to know in a well thought through presentation. Thoroughly working through this material, page by page, will quickly make you a R Master yourself (well, almost). # Computers You are encouraged to provide your own laptops and to install the necessary software before the module starts. Limited support can be provided if required, but in the end, the onus is on you to understand how your computer works (from the filesystem through to dealing with software installation issues). There are also computers with R and RStudio (and the necessary add-on libraries) available in the 5th floor lab in the BCB Department. # Attendance This workshop-based, hands on course can only deliver acceptible outcomes if you attend all classes. The schedule is set and cannot be changed. Sometimes an occasional absence cannot be avoided. Please be curtious and notify myself or the tutor in advance of any absence. If you work with a partner in class, notify them too. Keep up with the reading assignments while you are away and we will all work with you to get you back up to speed on what you miss. If you do miss a class, however, the assignments must still be submitted on time (also see [**Late submission of CA**](#late-submission-of-ca)). Since you may decide to work in collaboration with a peer on tasks and assignments, please keep this person informed at all times in case some emergency makes you unavailable for a period of time. Someone might depend on your input and contributions---do not leave someone in the lurch so that they cannot complete a task in your absence. # Assessment Policy {#sec-policy} **Continuous Assessment (CA)** and a **Final Assessment** will provide a **Final Mark** for the module. These modes of assessment meet our needs as far as [formative and summative assessments](../pages/assessment_theory.qmd) are concerned. The weighting of the CA and the Final Assessment is 0.6 and 0.4, respectively. All assessments are open book, so consult your code and reading material if and when you need to. | Assessment Component | Weight | Contribution (%) | |-----------------------------------------------------------|-----------|-------------------| | **CONTINUOUS ASSESSMENT** | **(0.6)** | | | **Introduction to R** | | | | Presentations | | 10 | | Self-Assessment Tasks A--D (Random penalty)[^1] | | (max. -10). | | [Intro R Test]{.my-highlight} | | 40 | | **Biostatistics** | | | | Presentations | | 10 | | Self-Assessment Tasks E--H (Random penalty) | | (max. -10). | | [Biostatistics Test]{.my-highlight} | | 40 | | *Total* | | *100* | | **FINAL ASSESSMENT** | **(0.4)** | | | [*Exam (Intro R + Biostatistics)*]{.my-highlight} | | *100* | [^1]: A maximum of 10% may be deducted from your presentation marks should you be found to be dishonest in your self assessments. Care must be taken that the tests and exams are submitted as instructed, i.e. paying attention to naming conventions and the format of the files submitted -- typically this will be in a Quarto document (.qmd) and the knitted output (I prefer .html). Random quizzes will not form part of the CA for BCB744. ::: {.panel-tabset} ## Presentations The presentations are a critical part of the CA. They are designed to help you develop your communication around topics tangentially to the broad field of knowledge generation. The presentations will cover topics such as the the nature of knowledge and belief, the nature of science, the scientific method, the limits to sciencde, and other broader societal topics. For more detail, see these links: - [[Presentations]{.my-highlight}](../assessments/BCB744_Intro_R_Presentations.qmd) - [[Assessment Sheet]{.my-highlight}](../assessments/BCB744_presentation_scoring.pdf) ## Self-Assessment Tasks BCB744 (Introduction to R and Biostatistics) relies on the expectation that you will engage in regular, honest self-reflection about your grasp of each day’s lecture content. After every lecture, time should be devoted to completing the Daily Self-Assessment Tasks, which are designed to help you gauge your understanding of the covered material. Answers to these tasks will be provided the following day, before introducing new content. The honesty of these reflections cannot be overstated: each task should be rated on a personal scale from 1 (no real comprehension) to 10 (complete mastery). These self-assessment marks will be kept on record and serve as an indicator of progress. We will not permit the submission of these tasks, but they will be checked randomly. We will also discourage students from undertaking the Intro R Test and the BioStats Test if their self-assessment scores are consistently low. Students who realise they are struggling are strongly advised to seek assistance from the lecturer or teaching assistant well before the gap in understanding becomes too large to bridge (i.e. on the day). The correlation between consistent, candid, and honest self-assessment and later performance in the Intro R Test, the Biostatistics Test, and the combined Exam (Intro R + Biostatistics) is high. By admitting the need for help early, you can align your learning strategies with course expectations and reinforce your command of the subject matter. Being the judge of personal preparedness demands self-reflection and honesty about your own strengths and weaknesses so as to develop a strong foundation for success. [**For the daily self-assessment tasks to be effective, you must work alone on all of them.**]{.my-highlight} **Be responsible for your own learning. The lecturer and teaching assistant are here to help you, but you must take the initiative to seek assistance when needed. The more you engage with the material, the more you will learn and the better you will perform in the assessments.** For more detail, see these links: - [[Self-Aassessments]{.my-highlight}](../assessments/BCB744_Intro_R_Self-Assessment.qmd). - [[Rubric (All Tasks)]{.my-highlight}](../assessments/BCB744_self_assessment_scoring.pdf) ## Tests {#sec-summative} At the conclusion of Intro R, and Biostatistics, you will take the more rigorous **Intro R Test** and **Biostatistics Test**. As indicated in the table above, these assessments carry significant weight. The tests will be conducted over several days, and you may complete them both at home and on campus. They constitute a key component of Continuous Assessment (CA) and are designed to prepare you for the final exam. Each test consists of two parts: 1. **Theory Test (30%)** – This is a **written, closed-book** assessment where you will be tested on theoretical concepts. The only resource available during this test is the R help system. 2. **Practical Test (70%)** – In this **open-book coding assessment**, you will apply your theoretical knowledge to real data problems. While you may reference online materials (including ChatGPT), collaboration with peers is strictly prohibited. The practical component of the tests will be graded as follows: - Content (20%): - Questions answered in order - A written explanation of approach included for each question - Appropriate formatting of text, for example, fonts not larger than necessary, headings used properly, etc. Be sensible and tasteful. - Code formatting, structure, and correctness (50%): - Use **Tidyverse** code - No more than \~80 characters of code per line (pay particular attention to the comments) - Application of [R code conventions](http://adv-r.had.co.nz/Style.html), e.g. spaces around `<-`, after `#`, after `,`, etc. - New line for each `dplyr` function (lines end in `%>%`) or `ggplot` layer (lines end in `+`) - Proper indentation of pipes and `ggplot()` layers - All chunks labelled without spaces - No unwanted / commented out code left behind in the document - Figures (30%): - Sensible use of themes / colours - Publication quality - Informative and complete titles, axes labels, legends, etc. - No redundant features or aesthetics ## Exam The Exam is the final assessment. As such, it will test your skills broadly **across both Intro R and Biostatistics**. The Exam may be up to five days in duration. It will involve the analysis of real world data. Some of the questions might expect that you write 1) statements of aims, objectives, and hypotheses; 2) the full and detailed methods followed by analyses together with all code, 3) full reporting of results in a manner suited for peer reviewed publications; 4) graphical support highlighting the patterns observed (again with the code), and 5) a discussion if and when required. The weighting of marks to these various sections is: 1. Aims, objectives, and hypotheses: 5% 2. Methods and analyses: 45% 3. Results: 20% 4. Graphs: 15% 5. Discussion: 15% Other questions might be shorter in nature, designed to specifically test important aspects of BCB744. Such questions might be worth anything from 10 to 50 marks. The Exam is also open book. Go home. Look at the questions. Answer them at home. Submit them by the deadline. ## Submission of Assignments A statement such as the one below accompanies every assignment---pay attention, as failing to observe this instruction may result in a loss of marks (i.e. if an assignment remains ungraded because the owner of the material cannot be identified): Submit the outpt of your Quarto script wherein you provide answers to the task questions by no later than 8:30 the following data (or the Monday in cases when assignments were given on Fridays). Label the script as follows (e.g.): **BCB744_Smit_Task_A.html**. **Late Submissions** Late assignments will be penalised 10% per day and will not be accepted more than 48 hours late, unless evidence such as a doctor's note, a death certificate, or another documented emergency can be provided. If you know in advance that a submission will be late, please discuss this and seek prior approval. This policy is based on the idea that in order to learn how to translate your human thoughts into computer language (coding) you should be working with them at multiple times each week---ideally daily. Time has been allocated in class for working on assignments and students are expected to continue to work on the assignments outside of class. Successfully completing (and passing) this module requires that you finish assignments based on what we have covered in class by the following class period. Work diligently from the onset so that even if something unexpected happens at the last minute you should already be close to done. This approach also allows rapid feedback to be provided to you, which can only be accomplished by returning assignments quickly and punctually. ::: # Support It's expected that some tricky aspects of the module will take time to master, and the best way to master problematic material is to practice, practice some more, and then to ask questions. Trying for 10 minutes and then giving up is not good enough. I'll be more sympathetic to your cause if you can demonstrate having tried for a full day before giving up and asking me. When you ask questions about some challenge, this is the way to do it---explain to me your numerous attempts at trying to solve the problem, and explain how these various attempts have failed. *I will not help you if you have not tried to help yourself first* (maybe with advice from friends). There will be time in class to do this, typically before we embark on a new topic. You are also encouraged to bring up related questions that arise in your own B.Sc. (Hons.) research project. Should you require more time with me, find out when I am 'free' and set an appointment by sending me a calendar invitation. I am happy to have a personal meeting with you via Zoom, but I prefer face-to-face in my office. **Guidelines for asking questions:** - First search existing issues (open or closed) for answers. If the question has already been answered, you're done! If there is an open issue, feel free to contribute to it. Or feel free to open a closed issue if you believe the answer is not satisfactory. - Give your issue an informative title. - Good: "Error: could not find function "ggplot"" - Bad: "My code does not work!" Note that you can edit an issue's title after it's been posted. - Format your questions nicely using markdown and code formatting. Preview your issue prior to posting. - As I explained above, your peers and I will more sympathetic to your cause if you can show *all the things you have tried as you, yourself, tried to fix the issue first*. - Include code and example data so the person trying to help you have something to work with (and which results in the error, perhaps) - Where appropriate, provide links to specific files, or even lines within them, in the body of your issue. This will help your peers understand your question. Note that only the teaching team will have access to private repos. - (Optional) Tag someone or some group of people. Start by typing their GitHub username prefixed with the \@ symbol. Of course this supposes that each of you have a GitHub account and username. - Hit **Submit new issue** when you're ready to post.