14. Tidier Data
“Statistics cannot rescue a poorly designed study.”
— Douglas Altman
“The statistician’s task is not to discover the truth, but to measure uncertainty.”
— Bradley Efron
1 Introduction
On Day 1 already you worked through a tidy workflow. You saw how to import data, how to manipulate it, run a quick analysis, and create figures. In the previous session you filled in the missing piece of the workflow by also learning how to tidy up your data within R. For the remainder of today you will be revisiting the “transform” portion of the tidy workflow. In this session you are going to go into more depth on what you learned in Day 1, and in the last session you will learn some new tricks. Over these two sessions you will also become more comfortable with the pipe command %>%, while practising writing tidy code.
Transformation in the tidyverse is a small grammar, not a random list of tools. All five verbs share three properties:
- The first argument is a data frame.
- The output is a data frame.
- The result stays tidy unless you deliberately collapse it (mainly with
summarise()).
There are five primary data transformation functions that you will focus on here:
- Arrange observations (rows) with
arrange()
- Filter observations (rows) with
filter()
- Select variables (columns) with
select()
- Create new variables (columns) with
mutate()
- Summarise variables (columns) with
summarise()
-
Need fewer rows? →
filter() -
Need different row order? →
arrange() -
Need fewer or re-ordered columns? →
select() -
Need new columns (same number of rows)? →
mutate() -
Need fewer rows by aggregating? →
summarise()(usually aftergroup_by())
-
arrange()andfilter()change rows, not columns. -
select()andmutate()change columns, not rows. -
summarise()changes rows on purpose (it collapses data).
You will use the full South African Coastal Temperature Network dataset for these exercises. Before you begin, however, you will need to cover two new concepts.
The pipe is not just syntax. Each line should answer one question: “What rows do I want?” → “What columns do I need?” → “What new columns should I add?” If a pipeline feels confusing, split it into steps and inspect the intermediate outputs.
2 Comparison and Logical (Boolean) Operators
The assignment operator (<-) is a symbol that we use to assign some bit of code to an object in your environment. Likewise, comparison operators are symbols we use to compare different objects. This is how you tell R how to decide to do many different things. You will see these symbols often out in the “real world” so let us spend a moment now getting to know them better. Most of these should be very familiar to you already:
-
>: Greater than
-
>=: Greater than or equal to -
<: Less than -
<=: Less than or equal to -
==: Equal to -
!=: Not equal to
It is important here to note that == is for comparisons; = is for maths. They are not interchangeable, as we may see in the following code chunk. This is one of the more common mistakes one makes when writing code. Luckily the error message this creates should provide us with the clues we need to figure out that we have made this specific mistake.
R> Error in `filter()`:
R> ! We detected a named input.
R> ℹ This usually means that you've used `=` instead of `==`.
R> ℹ Did you mean `site == "Amanzimtoti"`?
dplyr inspects your arguments and refuses to guess what you meant. This is a feature: it prevents silent mistakes and makes debugging faster.
The comparison operators are often used together with Boolean operators. Boolean operators are used for logical operations and can compare values, resulting in either TRUE and FALSE. Here they are:
-
!: NOT - Negates a true value to false, and a false value to true. -
&: AND (vectorised) - ReturnsTRUEif both operands are true,FALSEotherwise. -
|: OR (vectorised) - ReturnsTRUEif at least one of the operands is true. -
&&: AND (short-circuit) - Only evaluates the first element of each vector. -
||: OR (short-circuit) - Only evaluates the first element of each vector.
Think of & as constraint tightening (both conditions must be true) and | as constraint loosening (either condition can be true).
The %in% operator in R is a special operator used to test if elements of a vector or data object are contained in another vector or data object. It returns a Boolean vector (TRUE and FALSE) indicating whether each element of the first vector is found in the second vector. This operator is particularly useful for subsetting or filtering data based on matching values. For example, x %in% y will check for each element of x if there is a match in y, and return a logical vector indicating the presence or absence of each x element in y.
So, comparison operators are used to make direct comparisons between specific things, but logical operators are used more broadly when making logical arguments. Logic is central to most computing so it is worth taking the time to cover these symbols explicitly here. R makes use of the same Boolean logic symbols as many other platforms, including Google, so some (or all) of these will likely be familiar.
When writing a line of tidy code you tend to use these logical operator to combine two or more arguments that use comparison operators. For example, the following code chunk uses the filter() function to find all temperatures recorded at Pollock Beach during December or January. Do not worry if the following line of code is difficult to piece out, but make sure you can locate which symbols are comparison operators, which are logical operators. Please note that for purposes of brevity all of the outputs in this section are limited to ten lines but when you run these code chunks on your own computer they will be much longer.
R> site src date temp depth type
R> 1 Pollock Beach SAWS 1999-12-01 19.95000 0 thermo
R> 2 Pollock Beach SAWS 2000-01-01 19.03333 0 thermo
R> 3 Pollock Beach SAWS 2000-12-01 19.20000 0 thermo
R> 4 Pollock Beach SAWS 2001-01-01 18.32667 0 thermo
R> 5 Pollock Beach SAWS 2001-12-01 20.59032 0 thermo
R> 6 Pollock Beach SAWS 2002-01-01 21.47097 0 thermo
R> 7 Pollock Beach SAWS 2002-12-01 19.78065 0 thermo
R> 8 Pollock Beach SAWS 2003-01-01 20.64516 0 thermo
R> 9 Pollock Beach SAWS 2003-12-01 20.48710 0 thermo
R> 10 Pollock Beach SAWS 2004-01-01 21.34839 0 thermo
You will look at the interplay between comparison and logical operators in more depth in the following session after you have reacquainted yourself with the main transformation functions you need to know.
3 Arrange Observations (Rows) with arrange()
First up in our greatest hits reunion tour is the function arrange(). This very simply arranges the observations (rows) in a dataframe based on the variables (columns) it is given. If you are concerned with ties in the ordering of our data you provide additional columns to arrange(). The importance of the columns for arranging the rows is given in order from left to right.
When does arranging matter?
- Checking extremes (highest/lowest values).
- Validating temporal order before plotting.
- Preparing grouped summaries so you can inspect patterns.
R> site src date temp depth type
R> 1 Sea Point SAWS 1990-07-01 9.635484 0 thermo
R> 2 Muizenberg SAWS 1984-07-01 9.708333 0 thermo
R> 3 Doringbaai SAWS 2000-12-01 9.772727 0 thermo
R> 4 Hondeklipbaai SAWS 2003-06-01 9.775000 0 thermo
R> 5 Sea Point SAWS 1984-06-01 10.000000 0 thermo
R> 6 Muizenberg SAWS 1992-07-01 10.193548 0 thermo
R> 7 Hondeklipbaai SAWS 2005-07-01 10.333333 0 thermo
R> 8 Hondeklipbaai SAWS 2003-07-01 10.340909 0 thermo
R> 9 Sea Point SAWS 2000-12-01 10.380645 0 thermo
R> 10 Muizenberg SAWS 1984-08-01 10.387097 0 thermo
If you would rather arrange your data in descending order, as is perhaps more often the case, you simply wrap the column name you are arranging by with the desc() function as shown below.
R> site src date temp depth type
R> 1 Sodwana DEA 2000-02-01 28.34648 18 UTR
R> 2 Sodwana DEA 1999-03-01 28.04890 18 UTR
R> 3 Sodwana DEA 1998-03-01 27.87781 18 UTR
R> 4 Sodwana DEA 1998-02-01 27.76452 18 UTR
R> 5 Sodwana DEA 1996-02-01 27.73637 18 UTR
R> 6 Sodwana DEA 2000-03-01 27.52637 18 UTR
R> 7 Sodwana DEA 2000-01-01 27.52291 18 UTR
R> 8 Leadsmanshoal EKZNW 2007-02-01 27.48132 10 UTR
R> 9 Sodwana EKZNW 2005-01-01 27.45619 12 UTR
R> 10 Sodwana EKZNW 2007-02-01 27.44054 12 UTR
It must also be noted that when arranging data in this way, any rows with NA values will be sent to the bottom of the dataframe. This is not always ideal and so must be kept in mind.
4 Filter Observations (Rows) with filter()
When simply arranging data is not enough, and you need to remove rows of data you do not want, filter() is the tool to use. For example, you can select all monthly temperatures recorded at the site Humewood during the year 1990 with the following code chunk:
Every filter() call removes rows. Stacking multiple filters shrinks your data quickly. If results look “too small,” check each step.
R> site src date temp depth type
R> 1 Humewood SAWS 1990-01-01 21.87097 0 thermo
R> 2 Humewood SAWS 1990-02-01 18.64286 0 thermo
R> 3 Humewood SAWS 1990-03-01 18.61290 0 thermo
R> 4 Humewood SAWS 1990-04-01 17.30000 0 thermo
R> 5 Humewood SAWS 1990-05-01 16.35484 0 thermo
R> 6 Humewood SAWS 1990-06-01 15.93333 0 thermo
R> 7 Humewood SAWS 1990-07-01 15.70968 0 thermo
R> 8 Humewood SAWS 1990-08-01 16.09677 0 thermo
R> 9 Humewood SAWS 1990-09-01 16.41667 0 thermo
R> 10 Humewood SAWS 1990-10-01 17.14194 0 thermo
Remember to use the assignment operator (<-, keyboard shortcut alt -) if you want to create an object in the environment with the new results.
It must be mentioned that filter() also automatically removes any rows in the filtering column that contain NA values. Should you want to keep rows that contain missing values, insert the is.na() function into the line of code in question. To illustrate this let us filter the temperatures for the Port Nolloth data collected by the DEA that were at or below 11°C or were missing values. You will put each argument on a separate line to help keep things clear. Note how R automatically indents the last line in this chunk to help remind you that they are in fact part of the same argument. Also note how I have put the last bracket at the end of this argument on its own line. This is not required, but I like to do so as it is a very common mistake to forget the last bracket.
5 Select Variables (Columns) with select()
When you load a dataset that contains more columns than will be useful or required, it is preferable to shave off the excess. You do this with the select() function. In the following four examples you are going to remove the depth and type columns. There are many ways to do this and none are technically better or faster. So it is up to the user to find a favourite technique.
select()
- Positive selection: “What do I want to keep?”
- Negative selection: “What do I want to remove?”
select() never changes rows. It only changes columns.
# Select columns individually by name
SACTN %>%
select(site, src, date, temp)
# Select all columns between site and temp like a sequence
SACTN %>%
select(site:temp)
# Select all columns except those stated individually
SACTN %>%
select(-date, -depth)
# Select all columns except those within a given sequence
# Note that the '-' goes outside of a new set of brackets
# that are wrapped around the sequence of columns to remove
SACTN %>%
select(-(date:depth))You may also use select() to reorder the columns in a dataframe. In this case the inclusion of the everything() function may be a useful shortcut as illustrated below.
# Change up order by specifying individual columns
SACTN %>%
select(temp, src, date, site)
# Use the everything function to grab all columns
# not already specified
SACTN %>%
select(type, src, everything())
# Or go bananas and use all of the rules at once
# Remember, when dealing with tidy data,
# everything may be interchanged
SACTN %>%
select(temp:type, everything(), -src)[] Notation
The square bracket [] notation may also be used for indexing and subsetting data structures such as vectors, matrices, data frames, and lists. Before tidyverse existed, this was the only way to do so. Square brackets allow you to select elements from these data structures based on their positions, conditions, or names. The use of square brackets can vary slightly depending on the data structure being accessed. Here is a brief overview:
Vectors: When used with vectors, square brackets allow you to select elements by their numeric position, a logical vector indicating which elements to select. For example
vector[c(1, 3)]returns the first and third elements of the vector.Matrices: For matrices, square brackets take two dimensions
[row, column]to select elements. You can select entire rows, columns, or individual elements. Specifying a row, column as empty (e.g.,[,]) selects everything in that dimension.Dataframes: Similar to matrices, square brackets can be used to subset data frames by row, column. However and since dataframes can have column names, you can also use these names for selection, e.g.,
df[1,]selects the first row of the dataframe,df[, "columnName"]selects all rows of a specific column.Lists: Lists can be subsetted by numeric or character indices corresponding to their elements. For example,
list[[1]]selects the first element of the list. Note the double brackets, which are used to extract elements from a list directly. Single brackets, e.g.,list[1], return a sublist containing the first element.
Base R subsetting is powerful but low-level. Tidyverse verbs are higher-level: they make your intent explicit and read like a sentence. Use the one that makes your reasoning clearest.
Do this now: provide examples of i) the various data structures available in R, and ii) how to use square brackets to subset each of them. You may use any of the built-in datasets to do so.
6 Create New Variables (Columns) with mutate()
When you are performing data analysis/statistics in R this is likely because it is necessary to create some new values that did not exist in the raw data. The previous three functions you looked at (arrange(), filter(), select()) will prepare you to create new data, but do not do so themselves. This is when you need to use mutate(). You must however be very mindful that mutate() is only useful if we want to create new variables (columns) that are a function of one or more existing columns (well, that is how it is mainly used). Any new column you create with mutate() must always have the same number of rows as the dataframe you are working with. In order to create a new column you must first tell R what the name of the column will be, in this case let us create a column named kelvin. The second step is to then tell R what to put in the new column. As you may have guessed, you are going to convert the temp column which contains degrees Celsius (C) into Kelvin (K) by adding 273.15 to every row.
- Trying to create a summary (fewer rows) inside
mutate()→ usesummarise()instead. - Accidentally overwriting an existing column when you meant to create a new one.
R> site src date temp depth type kelvin
R> 1 Port Nolloth DEA 1991-02-01 11.47029 5 UTR 284.6203
R> 2 Port Nolloth DEA 1991-03-01 11.99409 5 UTR 285.1441
R> 3 Port Nolloth DEA 1991-04-01 11.95556 5 UTR 285.1056
R> 4 Port Nolloth DEA 1991-05-01 11.86183 5 UTR 285.0118
R> 5 Port Nolloth DEA 1991-06-01 12.20722 5 UTR 285.3572
R> 6 Port Nolloth DEA 1991-07-01 12.53810 5 UTR 285.6881
R> 7 Port Nolloth DEA 1991-08-01 11.25202 5 UTR 284.4020
R> 8 Port Nolloth DEA 1991-09-01 11.29208 5 UTR 284.4421
R> 9 Port Nolloth DEA 1991-10-01 11.37661 5 UTR 284.5266
R> 10 Port Nolloth DEA 1991-11-01 10.98208 5 UTR 284.1321
This is a very basic example; mutate() is capable of much more than simple addition. You will get into some more exciting examples during the next session.
7 Summarise Variables (Columns) with summarise()
Finally this brings you to the last tool for this section. To create new columns you use mutate(), but to calculate any sort of summary/statistic from a column that will return fewer rows than the dataframe has you will use summarise(). This makes summarise() much more powerful than the other functions in this section, but because it is able to do more, it can also be more unpredictable, making its use potentially more challenging. You will almost always end up using this function in our workflows. The following chunk very simply calculates the overall mean temperature for the entire SACTN.
Think of summarise() as a controlled rupture: it intentionally breaks the one-observation-per-row structure to produce aggregate values. This is why it is almost always paired with group_by() later.
Note how the above chunk created a new dataframe. This is done because it cannot add this one result to the previous dataframe due to the mismatch in the number of rows. If you were to create additional columns with other summaries, you may do so within the same summarise() function. These multiple summaries are displayed on individual lines in the following chunk to help keep things clear.
R> mean_temp sd_temp min_temp max_temp
R> 1 19.26955 3.682122 9.136322 28.34648
Creating summaries of the entire SACTN dataset in this way is not appropriate as you should not be combining time series from such different parts of the coast. In order to calculate summaries within variables you will need to learn how to use group_by(), which in turn will first require you to learn how to chain multiple functions together within a pipe (%>%). That is how you will begin the next session for today. You will finish with several tips on how to make your data the tidiest that it may be.
If a result looks wrong:
- Run the pipeline stepwise and inspect intermediate objects.
- Check row counts before and after each verb.
- Use
count()orsummarise(n = n())to confirm expectations.
Reuse
Citation
@online{smit,_a._j.2021,
author = {Smit, A. J.,},
title = {14. {Tidier} {Data}},
date = {2021-01-01},
url = {http://tangledbank.netlify.app/BCB744/intro_r/14-tidier.html},
langid = {en}
}
