1. Data Classes and Structures in R

Getting familiar with data classes and structures

Published

January 1, 2021

That which can be destroyed by the truth should be.

— P.C. Hodgell

In this Chapter
  • Data as encountered by the statistician
  • Data classes in R
  • Data structures in R
Tasks to complete in this Chapter
  • Task A 1-4

Introduction

In biostatistics we will encounter different data types, and comprehending data classes, data structures, and their statistical interpretations is important for several reasons:

  • Correctly identifying the data class To manipulate, analyse, or visualise data effectively, we must first identify the type of data we are working with. Since different data classes possess distinct properties, we must determine the class of our data to leverage the appropriate functions and operations.

  • Efficient data processing As biologists using R, we may come across data structures that are not fully compatible with the available functions. For example, when dealing with extensive datasets, employing data structures intended for vectorised operations like arrays or matrices can significantly boost data processing speed. In contrast, other data representations may not perform well when managing large datasets. By becoming familiar with the different data classes and their attributes in R, we can make informed choices about which data structures and functions to use to accomplish their data processing and analysis tasks effectively.

  • Data manipulation and analysis Different data classes in R have specific methods for manipulating and analysing data. For example, if we’re working with character strings, we can use string manipulation functions like gsub() or strsplit() which do not work on numerical data. If we’re working with dates and times, we need to use date and time functions like as.Date() or lubridate::ymd(). By understanding the different data classes in R, we can choose the appropriate functions for manipulating and analysing our data.

  • Data visualisation Data can be visualised in numerous ways, such as using histograms for numeric data, bar graphs for categorical data, or scatter plots for representing the relationship between two numeric variables. However, the way we provide data to R is critical. By familiarising ourselves with the diverse data classes in R, we can select the most suitable visualisation methods for our data.

Let us delve into how biologists classify data and the distinct terminology they employ to describe it. It’s essential to comprehend these subtleties before we investigate how R recognises data classes. We’ll scrutinise the data types that are ubiquitous in the fields of biology and ecology, and subsequently shift our attention to data classes and structures in R.

Types of variables common in biology and ecology

We will most frequently encounter data arranged in columns of a data file—typically in MS Excel files or CSV files. A column is a variable, and one variable is comprised of one data type. So, when we refer to a variable, we expect that all the data within would be homogeneous—at least in as far as the data’s type.

The type of data that biologists and statisticians work with can influence the statistical techniques and methods they use to analyse and interpret the data. Let us discuss some of the different types of biological and ecological data we are likely to encounter.

Numeric variables

Numeric data in the context of biostatistics refers to quantitative data that can be expressed in numerical form, typically obtained from field and laboratory measurements, or from field sampling campaigns. Examples of numeric data in biostatistics include the height and mass of a animals, concentrations of nutrients, laboratory test results such as respiration rates, or the number of limpets in a quadrat. Numeric data can be further categorised as discrete or continuous.

Discrete variables

Discrete data are whole (integer) numbers that represent counts of items or events. Integer data usually answer the question, “how many?” For example, in the biological and Earth sciences, discrete data is commonly encountered in the form of counts or integers that represent the presence or absence of certain characteristics or events. For example, the number of individuals of a particular species in a population, the number of chromosomes in a cell, or the number of earthquakes occurring in a region within a given time frame. Other examples of discrete data in these sciences include the number of mutations in a gene, the number of cells in a tissue sample, or the number of species present in an ecosystem. These types of data are often analysed using statistical techniques such as frequency distributions, contingency tables, and chi-square tests.

Continuous variables

Continuous data, on the other hand, are measured on a continuous scale. These usually represent measured quantities such as something’s heat content (temperature, measured in degrees Celsius) or distance (measured in metres or similar), etc. They can be rational numbers including integers and fractions, but typically they have an infinite number of ‘steps’ that depends on rounding (they can even be rounded to whole integers) or considerations such as measurement precision and accuracy. Often, continuous data have upper and lower bounds that depend on the characteristics of the phenomenon being studied or the measurement being taken.

Dates

We often encounter date data when dealing with time-related data. For example, in ecological research, data collection may involve recording the date of a particular observation or sampling event, such as the date when a bird was sighted, or when water samples were taken from a stream. The purpose of using date (or time) data in biology and ecology is to enable us to understand and analyse temporal patterns and relationships in their response variables. This can include exploring seasonal trends, understanding the impact of environmental changes over time, or tracking the growth and development of organisms.

By analysing date data, we can gain insights into long-term trends and patterns that may not be apparent when looking at the data in aggregate. They can also use this information to make predictions about future trends, develop more effective management strategies, and identify potential areas for further research.

Qualitative variables

Qualitative data in the context of statistics used by biologists refers to non-numerical data that are collected from observations, experimental treatment groups, or other sources. They tend to be textual and it are often used to describe characteristics or properties of living organisms, ecosystems, or other biological phenomena. Examples of qualitative data in biology may include the colour of flowers, the type of habitat where an animal is found, the behaviour of animals, or the presence or absence of certain traits or characteristics in a population.

Qualitative data can be further classified into nominal or ordinal data.

Character data

Add some text.

Nominal variables

Nominal data are used to describe qualitative variables that do not have any inherent order or ranking. Examples of nominal data in biology may include the type of plant or animal species, or the presence or absence of certain genetic traits. Another term for nominal data is categorical data. Because there are well-defined categories, the number of members belonging to each of the category can be counted. For example, there are three red flowers, 66 purple flowers, and 13 yellow flowers.

Ordinal variables

Ordinal data refer to a type of data that can be used to describe qualitative categorical variables that have a natural order or ranking. It is used when we need to arrange things in a particular order, such as from worst to best or from least to most. However, the differences between the values cannot be measured or quantified exactly, making them somewhat subjective. Examples of ordinal data include the different stages of development of an organism or the performance of a species to different fertilisers. Ordinal data can be entered as descriptive character strings, and internally, they are translated into an ordered vector of integers in R. For example, we can use a scale of 1 for terrible, 2 for ‘so-so’, 3 for average, 4 for good, and 5 for brilliant.

Binary variables

Life can be boiled down to a series of binary decisions: should I have pizza for dinner, yes or no? Should I go to bed early, TRUE or FALSE? Should I start that new series on Netflix, accept or reject? Am I present or absent? You get the gist… This kind of binary decision-making is known as ‘logical’, and in R they can only take on the values of TRUE or FALSE (remember to mind your case!). In the computing world, logical data are often represented by 1 for TRUE and 0 for FALSE. So basically, your life’s choices can be summarised as a string of 1s and 0s. Who knew it was that simple?

When it comes down to it, everything in life is either black or white, right or wrong, good or bad. It’s like a cosmic game of “Would You Rather?” — and we’re all just playing along.

Character variables

As the name implies, these are not numbers. Rather, they are human words that have found their way into R for one reason or another. In biology we most commonly encounter character values when we have a list of things, such as sites or species. These values will often be used as categorical or ordinal data.

Missing values

It’s unfortunate to admit that one of the most reliable aspects of any biological dataset is the presence of missing data (the presence of something that’s missing?!). It is a stark reminder of the fragility of life. How can we say that something contains missing data? It seems counter intuitive, as if the data were never there in the first place. However, as we remember the principles of tidy data, we see that every observation must be documented in a row, and each column in that row must contain a value. This organisation allows us to create a matrix of data from multiple observations. Since the data are presented in a two-dimensional format, any missing values from an observation will leave a gaping hole in the matrix. We call these ‘missing values.’ It’s a somber reality that even the most meticulous collection of data can be marred by the loss of information.

Complex numbers

“And if you gaze long enough into an abyss, the abyss will gaze back into you.”

— Friedrich Nietzsche

As we draw to a close on the topic of data types, we cling desperately to the threads of our sanity, hoping against hope that they remain tightly stitched. But let it be known, to those who dare enter further into the realm of data, that beneath the surface lie countless rocks, and around every corner lurk a legion of complex data types, waiting to ensnare the unwary. These shadows of information are as enigmatic as they are perilous, for they challenge the very essence of our understanding. It is not until the final chapter of our journey, when we confront the elusive art of modeling, that we will face these data demons head-on. But fear not, for we shall arm ourselves with the knowledge and techniques acquired on this treacherous path, and with each step forward, we shall move closer to mastering the darkness that awaits us.

Data classes in R

R is a powerful programming language and software environment for statistical computing and graphics. It offers a wide variety of data classes and types to represent different types of information.

The atomic modes are logical, integer, numeric (also sometimes called double), complex, character and raw. The Date class is a specialised form of the numeric class. Each atomic mode has its own properties and functions that can be used to manipulate objects of that mode. These atomic modes can be used to make an atomic data structure such as a vector, array, or matrix. This knowledge is also important when working with R tabular mixed data structures, such as a data.frame, tibble, or list. Please refer to Hadley Wickham’s overview of vectors in Advanced R, 2nd edition for more insight presented in an informative yet concise way.

The data class can be determined with the class() or str() commands. Here’s a brief overview of some of the main data types in R:

numeric

In R, the numeric data class represents either integers or floating point (decimal) values. Numerical data are quantitative in nature as they represent things that can be objectively counted, measured, or calculated—the measured variables.

Numeric data are one of the most common types of data used in statistical and mathematical analysis. In R, numeric data are represented by the class numeric, which includes both integers and floating-point numbers. Numeric data can be used in a variety of operations and calculations, including arithmetic operations, statistical analyses, and visualisations. One important feature of the numeric data class in R is that it supports vectorisation, which allows for efficient and concise operations on large sets of numeric data. Additionally, R provides a wide range of built-in functions for working with numeric data, including functions for calculating basic statistical measures such as mean, median, and standard deviation.

In R integer (discrete) data are called int or <int> while continuous data are denoted num or <dbl>.

Example of integer data Suppose you have a dataset of the number of rats in different storm water drains in a neighbourhood. The number of rats is a discrete variable because it can only take on integer values (you can’t own a fraction of a rat).

Here’s how you could create a vector of this data in R:

# Create a vector of the number of pets owned by each household
num_rats <- c(0, 1, 2, 2, 3, 1, 4, 0, 2, 1, 2, 2, 0, 3, 2, 1, 1, 4, 2, 0)
num_rats
 [1] 0 1 2 2 3 1 4 0 2 1 2 2 0 3 2 1 1 4 2 0
class(num_rats)
[1] "numeric"

In this example, the data are represented as a vector called num_rats of class numeric (as revealed by class(num_rats)). Each element of the vector represents the number of rats in one storm water drain. For example, the first element of the vector (num_rats[1]) is 0, which means that the first drain in the dataset is free of rats. The fourth element of the vector (num_rats[4]) is 2, indicating that the fourth drain in the dataset is occupied by 2 rats.

One can also explicitly create a vector of integer using the as.integer() function:

num_rats_int <- as.integer(num_rats)
num_rats_int
 [1] 0 1 2 2 3 1 4 0 2 1 2 2 0 3 2 1 1 4 2 0
class(num_rats_int)
[1] "integer"

Above we coerced the class numeric data to class integer. But we can take floating point numeric and convert them to integers too with the as.integer() function. As we see, the effect is that the whole part of the number is retained and the rest discarded:

pies <- pi * seq(1:5)
pies
[1]  3.141593  6.283185  9.424778 12.566371 15.707963
class(pies)
[1] "numeric"
as.integer(pies)
[1]  3  6  9 12 15

Effectively, what happened above is more-or-less equivalent to what the floor() function would return:

floor(pies)
[1]  3  6  9 12 15

Be careful when coercing floating point numbers to integers. If rounding is what you expect, this is not what you will get. For rounding, use round() instead:

round(pies, 0)
[1]  3  6  9 13 16

Example of continuous data Here are some randomly generated temperature data assigned to an object called temp_data:

# Generate a vector of 50 normally distributed temperature values
temp_data <- round(rnorm(n = 50, mean = 15, sd = 3), 2)
temp_data
 [1] 21.65 12.38 14.41 15.82  9.05 14.39 12.20 14.49 13.58 15.29 13.47 13.06
[13] 14.61 17.92 17.41 15.02 14.30 15.19 17.41 17.61 12.00 11.94 11.25 14.36
[25] 15.25 13.02 10.15 14.57 20.78 20.16 10.90 12.43 12.45 14.30 16.27 11.89
[37] 14.07 14.00 11.65 13.67 14.46 12.73 16.02 17.86 11.76 21.72 18.77 16.29
[49] 11.42 12.06
class(temp_data)
[1] "numeric"

character

In R, the character data class represents textual data such as words, sentences, and paragraphs. Character data can be created using either single or double quotes, and it can include letters, numbers, and other special characters. In addition, character data can be concatenated using the paste() function or other string manipulation functions.

One important feature of the character data class in R is its versatility in working with textual data. For instance, it can be used to store and manipulate text data, including text-based datasets, text-based files, and text-based visualisations. Additionally, R provides a wide range of built-in functions for working with character data, including functions for manipulating strings, searching for patterns, and formatting output. Overall, the character data class in R is a fundamental data type that is critical for working with textual data in a variety of contexts. You will most frequently use character values are often used to represent labels, names, or descriptions.

factor

In R, the factor data class is used to represent discrete categorical variables. Factors are often used in statistical analyses to represent class or group belonging. Factor values are categorical data, such as levels or categories of a variable. Factor variables are most commonly also character data, but they can be numeric too if coded correctly as factors. Factor values can be ordered (ordinal) or unordered (categorical or nominal).

Categorical variables take on a limited number of distinct values, often corresponding to different groups or levels. For example, a categorical variable might represent different colours, size classes, or species. Factors in R are represented as integers with corresponding character levels, where each level corresponds to a distinct category. The levels of a factor can be defined explicitly using the factor() function or automatically using the cut() function. One important feature of the factor data class in R is that it allows for efficient and effective data manipulation and analysis, particularly when working with large datasets. For instance, factors can be used in statistical analyses such as regression models or ANOVA, and they can also be used to create visualisations such as bar or pie graphs. The factor data class in R is a fundamental data type that is critical for representing and working with categorical variables in data analysis and visualisation.

The factor data class of data in an R data.frame structure (or in a tibble) is indicated by Factor or <fctr>. Ordered factors are denoted by columns named Ord.factor or <ord>.

Nominal data One example of nominal factor data that ecologists might encounter is the type of vegetation in a particular area, such as ‘grassland’, ‘forest’, or ‘wetland’. Here’s an example of how to generate a vector of nominal data in R using the sample() function:

# Generate a vector of vegetation types
vegetation <- sample(c("grassland", "forest", "wetland"), size = 50, replace = TRUE)

# View the vegetation data
vegetation
 [1] "grassland" "wetland"   "forest"    "forest"    "wetland"   "forest"   
 [7] "forest"    "forest"    "wetland"   "grassland" "wetland"   "wetland"  
[13] "forest"    "forest"    "grassland" "wetland"   "forest"    "grassland"
[19] "grassland" "wetland"   "forest"    "grassland" "grassland" "grassland"
[25] "wetland"   "grassland" "grassland" "forest"    "forest"    "wetland"  
[31] "grassland" "grassland" "grassland" "forest"    "wetland"   "wetland"  
[37] "grassland" "forest"    "grassland" "wetland"   "forest"    "wetland"  
[43] "forest"    "forest"    "wetland"   "wetland"   "forest"    "wetland"  
[49] "grassland" "grassland"
class(vegetation)
[1] "character"
The sample() function

Note that the sample() function is not made specifically for nominal data; it can be used on any kind of data class.

Task A
  1. Provide some examples of what one can do with the sample() function. Why might we (mostly biologists and ecologists) find it useful?

Ordinal data Here’s an example vector of ordinal data in R that could be encountered by ecologists:

# Vector of ordinal data representing the successional stage of a forest
succession <- c("Early Pioneer", "Late Pioneer",
                "Young Forest", "Mature Forest",
                "Old Growth")
succession
[1] "Early Pioneer" "Late Pioneer"  "Young Forest"  "Mature Forest"
[5] "Old Growth"   
class(succession)
[1] "character"
# Convert to ordered factor
succession <- factor(succession, ordered = TRUE,
                     levels = c("Early Pioneer", "Late Pioneer",
                                "Young Forest", "Mature Forest",
                                "Old Growth"))
succession
[1] Early Pioneer Late Pioneer  Young Forest  Mature Forest Old Growth   
5 Levels: Early Pioneer < Late Pioneer < Young Forest < ... < Old Growth
class(succession)
[1] "ordered" "factor" 

In this example, the successional stage of a forest is represented by an ordinal scale with five levels ranging from ‘Early Pioneer’ to ‘Old Growth’. The factor() function is used to convert the vector to an ordered factor, with the ordered argument set to TRUE and the levels argument set to the same order as the original vector. This ensures that the levels are properly represented as an ordered factor.

logical

In R, the logical data class represents binary or Boolean data. Logical data are used to represent variables that can take on only two possible values, TRUE or FALSE. In addition to TRUE and FALSE, logical data can also take on the values of NA or NULL, which represent missing or undefined values.

Logical data can be created using logical operators such as ==, !=, >, <, >=, and <=. Logical data are commonly used in R for data filtering and selection, conditional statements, and logical operations. For example, logical data can be used to filter a dataset to include only observations that meet certain criteria or to perform logical operations such as AND (&) and OR (|). The logical data class in R is a fundamental data type that is critical for representing and working with binary or Boolean variables in data analysis and programming.

Example logical (binary) data Here’s an example of generating a vector of binary or logical data in R, which represents the presence or absence of a particular species in different ecological sites:

# Generate a vector of 1s and 0s to represent the presence
# or absence of a species in different ecological sites
species_presence <- sample(c(0,1), 10, replace = TRUE)
species_presence
 [1] 0 0 0 1 0 0 1 1 1 0

We can also make a formal logical class data:

species_presence_logi <- as.logical(species_presence)
class(species_presence_logi)
[1] "logical"

In this example, we again use the sample() function to randomly generate a vector of 10 values, each either 0 or 1, to represent the presence or absence of a species in 10 different ecological sites. However, it is often not necessary to coerce to class logical, as we see in the presence-absence datasets we will encounter in BCB743: Quantitative Ecology.

date

In R, the POSIXct, POSIXlt, and Date classes are commonly used to represent date and time data. These classes each have unique characteristics that make them useful for different purposes.

The POSIXct class is a date/time class that represents dates and times as a numerical value, typically measured in seconds since January 1st, 1970. This class provides a high level of precision, with values accurate to the second. It is useful for performing calculations and data manipulation involving time, such as finding the difference between two dates or adding a certain number of seconds to a given time. An example of how to generate a POSIXct object in R is as follows:

my_time <- as.POSIXct("2022-03-10 12:34:56")
class(my_time)
[1] "POSIXct" "POSIXt" 
my_time
[1] "2022-03-10 12:34:56 SAST"

The POSIXlt class, on the other hand, typically represents dates and times in a more human-readable format. It stores date and time information as a list of named elements, including year, month, day, hour, minute, and second. This format is useful for displaying data in a more understandable way and for extracting specific components of a date or time. An example of how to generate a POSIXlt object in R is as follows:

my_time <- as.POSIXlt("2022-03-10 12:34:56")
class(my_time)
[1] "POSIXlt" "POSIXt" 
my_time
[1] "2022-03-10 12:34:56 SAST"

The Date class is used to represent dates only, without any time information. Dates are typically stored as the number of days since January 1st, 1970. This class provides functions for performing arithmetic operations and comparisons between dates. It is useful for working with time-based data that is only concerned with the date component, such as daily sales or stock prices. An example of how to generate a Date object in R is as follows:

my_date <- as.Date("2022-03-10")
class(my_date)
[1] "Date"
my_date
[1] "2022-03-10"

To generate a vector of dates in R with daily intervals, we can use the seq() function to create a sequence of dates, specifying the start and end dates and the time interval. Here’s an example:

# Generate a vector of dates from January 1, 2022 to December 31, 2022
dates <- seq(as.Date("2022-01-01"), as.Date("2022-12-31"), by = "day")

# View the first 10 dates in the vector
head(dates, 10)
 [1] "2022-01-01" "2022-01-02" "2022-01-03" "2022-01-04" "2022-01-05"
 [6] "2022-01-06" "2022-01-07" "2022-01-08" "2022-01-09" "2022-01-10"
class(dates)
[1] "Date"

Understanding the characteristics of these date and time classes in R is essential for effective data analysis and manipulation in fields where time-based data is a critical component.

Date and time data in R can be manipulated using various built-in functions and packages such as lubridate and chron. Additionally, date and time data can be visualised using different types of graphs such as time series plots, heatmaps, and Hovmöller diagrams. The date and time data classes in R are essential for working with temporal data and conducting time-related analyses in various biological and environmental datasets.

Missing values, NA

Missing values can be encountered in vectors of all data classes. To demonstrate some data that contains missing values, I will generate a data sequence containing 5% missing values. We can use the rnorm() function to generate a sequence of random normal numbers and then randomly assign 5% of the values as missing using the sample() function. The indices of the missing values are stored in missing_indices, and we use them to assign NA to the corresponding elements of the data sequence. Here’s some code to achieve this:

# Set the length of the sequence
n <- 100

# Generate a sequence of random normal numbers with
# mean 0 and standard deviation 1
data <- rnorm(n, mean = 0, sd = 1)

# Randomly assign 5% of the values as missing
missing_indices <- sample(1:n, size = round(0.05*n))
data[missing_indices] <- NA
length(data)
[1] 100
data
  [1] -0.81940192 -0.00740456 -0.36999559  0.70074979          NA -1.24852871
  [7] -0.99792356  0.48755000 -0.78069577 -1.27643878  1.03977193  1.09405968
 [13]  0.74844413 -2.22925395 -0.10169076  0.22252991 -0.01513552  0.15518752
 [19] -0.63032108  0.10941233  0.75692863  1.54465913 -1.99389488 -0.10323424
 [25] -0.52478450          NA -0.52410587 -1.21112225 -0.66382656  1.98609308
 [31] -0.74351216  0.98347683  0.10919197 -1.90699075  0.83776093  1.48987542
 [37]          NA  1.11696675  0.03982922 -1.11019956  0.99554720  0.85509468
 [43] -0.73931718 -0.45132178  1.42535260 -0.75741995 -1.26200804 -0.70876054
 [49]  0.07646090  1.30062866 -1.21517127 -2.07190314  0.66569083 -1.49051237
 [55]  0.52042125  1.27683410 -0.88654953  1.06120386          NA -2.63300632
 [61] -1.54958598 -1.79825644  0.26311841 -0.78707880  0.43926228  0.93532487
 [67]  0.16997429 -0.31449895 -1.49526094 -1.32722447  0.38798064  0.52625778
 [73] -0.40413670  0.28038868  2.00052101  0.12642065  0.23809910  1.67756946
 [79]  0.32614738  1.21189286 -1.43601480  0.09241923  1.56839501 -1.30208426
 [85]  1.33344490 -0.20869887 -0.86656886  0.88263507  0.23866951 -0.16493668
 [91] -1.69156154  0.52660571  1.27504834 -0.48766272 -0.05187454  1.22361374
 [97]          NA -0.13347367  1.30950275  0.77215222

To remove all NAs from the vector of data we can use na.omit():

data_sans_na <- na.omit(data)
length(data_sans_na)
[1] 95
data_sans_na
 [1] -0.81940192 -0.00740456 -0.36999559  0.70074979 -1.24852871 -0.99792356
 [7]  0.48755000 -0.78069577 -1.27643878  1.03977193  1.09405968  0.74844413
[13] -2.22925395 -0.10169076  0.22252991 -0.01513552  0.15518752 -0.63032108
[19]  0.10941233  0.75692863  1.54465913 -1.99389488 -0.10323424 -0.52478450
[25] -0.52410587 -1.21112225 -0.66382656  1.98609308 -0.74351216  0.98347683
[31]  0.10919197 -1.90699075  0.83776093  1.48987542  1.11696675  0.03982922
[37] -1.11019956  0.99554720  0.85509468 -0.73931718 -0.45132178  1.42535260
[43] -0.75741995 -1.26200804 -0.70876054  0.07646090  1.30062866 -1.21517127
[49] -2.07190314  0.66569083 -1.49051237  0.52042125  1.27683410 -0.88654953
[55]  1.06120386 -2.63300632 -1.54958598 -1.79825644  0.26311841 -0.78707880
[61]  0.43926228  0.93532487  0.16997429 -0.31449895 -1.49526094 -1.32722447
[67]  0.38798064  0.52625778 -0.40413670  0.28038868  2.00052101  0.12642065
[73]  0.23809910  1.67756946  0.32614738  1.21189286 -1.43601480  0.09241923
[79]  1.56839501 -1.30208426  1.33344490 -0.20869887 -0.86656886  0.88263507
[85]  0.23866951 -0.16493668 -1.69156154  0.52660571  1.27504834 -0.48766272
[91] -0.05187454  1.22361374 -0.13347367  1.30950275  0.77215222
attr(,"na.action")
[1]  5 26 37 59 97
attr(,"class")
[1] "omit"
Dealing with NAs in functions

Many functions have specific arguments to deal with NAs in data. See for example the na.rm = TRUE argument given to mean(), median(), min(), lm(), etc.

Data structures in R

Data structures can be viewed as the ‘containers’ that hold the different data classes. The kind of data structure can also be revealed by the class() command.

vector, array, and matrix

Vectors In R, a vector is a one-dimensional array-like data structure that can hold a sequence of values of the same atomic mode, such as numeric, character, logical values, or Date and times. A vector can be created using the c() function, which stands for ‘combine’ or ‘concatenate,’ and is used to combine a sequence of values into a vector. Vectors can also be created by using the seq() function to generate a sequence of numbers, or the rep() function to repeat a value or sequence of values. Here is an example of a numeric vector:

# create a numeric vector
my_vector <- c(1, 2, 3, 4, 5)

# coerce to vector
my_vector <- as.vector(c(1, 2, 3, 4, 5))
class(my_vector) # but it doesn't change the class from numeric
[1] "numeric"
# print the vector
my_vector
[1] 1 2 3 4 5
Coercion to vector

The behaviour is such that the output of coercion to vector is that one the atomic modes (the basic data types) is returned.

One of the advantages of using vectors in R is that many of the built-in functions and operations work on vectors, allowing us to easily manipulate and analyse large amounts of data. Additionally, R provides many functions specifically designed for working with vectors, such as mean(), median(), sum(), min(), max(), and many others.

Matrices A matrix (again, this terminology may be different for other languages), on the other hand, is a special case of an array that has two dimensions (rows and columns). It is also a multi-dimensional data structure that can hold elements of the same data type, but it is specifically designed for handling data in a tabular format. A matrix can be created using the matrix() function in R.

# create a numeric matrix
my_matrix <- matrix(1:6, nrow = 2, ncol = 3)

# print the matrix
my_matrix
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
class(my_matrix)
[1] "matrix" "array" 

We can query the size or dimensions of the matrix as follows:

dim(my_matrix)
[1] 2 3
ncol(my_matrix)
[1] 3
nrow(my_matrix)
[1] 2

Coercion of matrices to vectors A matrix can be coerced to a vector:

as.vector(my_matrix)
[1] 1 2 3 4 5 6

Arrays In R (as opposed to in python or some other languages), an array specifically refers to a multi-dimensional data structure that can hold elements of the same data type. It can have any number of dimensions (1, 2, 3, etc.), and its dimensions can be named. An array can be created using the array() function in R.

# create a 2-dimensional array
my_array <- array(1:27, dim = c(3, 3, 3))

# print the array
my_array
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27
class(my_array)
[1] "array"

We can figure something out about the size or dimensions of the array:

dim(my_array)
[1] 3 3 3
ncol(my_array)
[1] 3
nrow(my_array)
[1] 3

Coercion of arrays to vectors The array can be coerced to a vector:

as.vector(my_array)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27
Task A
  1. Using examples (new data), explain how the as.vector() function works when applied to matrices and arrays. How does it decide in what order to string the elements of the matrices and arrays together?
  2. Use the result produced by as.vector() (your own data) and assemble a new array with a different combination of dimensions from the one initially produced.

The key difference between vectors, arrays, and a matrices in R is their dimensions. A vector has one dimension, an array can have any number of dimensions, while a matrix is limited to two dimensions. Additionally, a matrix is often used to store data in a tabular format, while an array is used to store multi-dimensional data in general. A commonly encountered kind of matrix is seen in multivariate statistics is a distance or dissimilarity matrix.

In R, vectors, arrays, and matrices share a common characteristic: they do not have row or column names. Therefore, to refer to any element, row, or column, one must use their corresponding index. How?

Accessing elements, rows, columns, and matrices In R, the square bracket notation is used to access elements, rows, columns, or matrices in arrays. The notation takes the form of [i, j, k, ...], where i, j, k, and so on, represent the indices of the rows, columns, or matrices to be accessed.

Suppose we have the following array:

my_array <- array(data = round(rnorm(n = 60, mean = 13, sd = 2), 1),
                  dim = c(5, 4, 3))
my_array
, , 1

     [,1] [,2] [,3] [,4]
[1,] 14.0 11.8 12.0 14.1
[2,] 12.7 11.5 15.1 11.0
[3,] 13.9 14.5 11.1 16.1
[4,] 12.9 12.6 12.8 12.6
[5,] 13.3 13.8 15.6 12.4

, , 2

     [,1] [,2] [,3] [,4]
[1,] 17.0 13.5 12.4 10.2
[2,] 10.5 13.0 11.9 11.1
[3,] 10.9 13.4 12.2 12.8
[4,] 16.2 10.1 16.2 17.2
[5,] 14.1 14.4  9.6 12.6

, , 3

     [,1] [,2] [,3] [,4]
[1,] 10.9 12.0 12.7 14.5
[2,]  7.7  9.8 16.9 12.3
[3,] 12.4 15.1 13.1 13.8
[4,] 14.0 17.6 11.0 11.9
[5,] 14.3 15.4 13.4 14.4
dim(my_array)
[1] 5 4 3

This creates a \(5\times4\times3\) array with values from 1 to 60.

When working with multidimensional arrays, it is possible to omit some of the indices in the square bracket notation. This results in a subset of the array, which can be thought of as a lower-dimensional array obtained by fixing the omitted dimensions. For example, consider a 3-dimensional array my_array above with dimensions dim(my_array) = c(5,4,3). If we use the notation my_array[1,,], we would obtain a 2-dimensional array with dimensions dim(my_array[1,,]) = c(4,3) obtained by fixing the first index at 1:

dim(my_array[1,,])
[1] 4 3
my_array[1,,]
     [,1] [,2] [,3]
[1,] 14.0 17.0 10.9
[2,] 11.8 13.5 12.0
[3,] 12.0 12.4 12.7
[4,] 14.1 10.2 14.5

Here are some more examples of how to use square brackets notation with arrays in R:

To access a single element in the array, use the notation [i, j, k], where i, j, and k are the indices along each of the three dimensions, which in combination, uniquely identifies each element. Below we return the element in the second row, third column, and first matrix:

my_array[2, 3, 1]  
[1] 15.1

To access a single row in the array, use the notation [i, , ], where i is the index of the row. This will return the second rows and all of the columns of the first matrix:

my_array[2,,1]
[1] 12.7 11.5 15.1 11.0

To access a single column in the array, use the notation [ , j, ], where j is the index of the column. Here we will return all the elements in the row of column two and matrix three:

my_array[ , 2, 3]
[1] 12.0  9.8 15.1 17.6 15.4

To access a single matrix in the array, use the notation [ , , k], where k is the index of the matrix:

my_array[ , , 2]
     [,1] [,2] [,3] [,4]
[1,] 17.0 13.5 12.4 10.2
[2,] 10.5 13.0 11.9 11.1
[3,] 10.9 13.4 12.2 12.8
[4,] 16.2 10.1 16.2 17.2
[5,] 14.1 14.4  9.6 12.6

To obtain a subset of the array, use the notation [i, j, k] with i, j, or k omitted to obtain a lower-dimensional array:

my_array[1, , ]
     [,1] [,2] [,3]
[1,] 14.0 17.0 10.9
[2,] 11.8 13.5 12.0
[3,] 12.0 12.4 12.7
[4,] 14.1 10.2 14.5
my_array[ , 2:3, ]
, , 1

     [,1] [,2]
[1,] 11.8 12.0
[2,] 11.5 15.1
[3,] 14.5 11.1
[4,] 12.6 12.8
[5,] 13.8 15.6

, , 2

     [,1] [,2]
[1,] 13.5 12.4
[2,] 13.0 11.9
[3,] 13.4 12.2
[4,] 10.1 16.2
[5,] 14.4  9.6

, , 3

     [,1] [,2]
[1,] 12.0 12.7
[2,]  9.8 16.9
[3,] 15.1 13.1
[4,] 17.6 11.0
[5,] 15.4 13.4

data.frame

A dataframe is perhaps the most commonly-used ‘container’ for data in R because they are so convenient and serve many purposes. A dataframe is not a data class—more correctly, it is a form of tabular data (like a table in MS Excel), with each vector (a variable or column) comprising the table sharing the same length. What makes a dataframe versatile is that its variables can be any combination of the atomic data types. It may even include list columns (we will not cover list columns in this module). Applying the class() function to a dataframe shows that it blongs to class data.frame.

Here’s an example of an R data.frame with Date, numeric, and categorical data classes:

# Create a vector of dates
dates <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03",
                   "2022-01-04", "2022-01-05"))

# Create a vector of numeric data
numeric_data <- rnorm(n = 5, mean = 0, sd = 1)

# Create a vector of categorical data
categorical_data <- c("A", "B", "C", "A", "B")

# Combine the vectors into a data.frame
my_dataframe <- data.frame(dates = dates,
                           numeric_data = numeric_data,
                           categorical_data = categorical_data)

# Print the dataframe
my_dataframe
       dates numeric_data categorical_data
1 2022-01-01    0.0112347                A
2 2022-01-02    0.8438787                B
3 2022-01-03    0.2512231                C
4 2022-01-04    0.9081754                A
5 2022-01-05    0.3407367                B
class(my_dataframe)
[1] "data.frame"
str(my_dataframe)
'data.frame':   5 obs. of  3 variables:
 $ dates           : Date, format: "2022-01-01" "2022-01-02" ...
 $ numeric_data    : num  0.0112 0.8439 0.2512 0.9082 0.3407
 $ categorical_data: chr  "A" "B" "C" "A" ...
summary(my_dataframe)
     dates             numeric_data     categorical_data  
 Min.   :2022-01-01   Min.   :0.01123   Length:5          
 1st Qu.:2022-01-02   1st Qu.:0.25122   Class :character  
 Median :2022-01-03   Median :0.34074   Mode  :character  
 Mean   :2022-01-03   Mean   :0.47105                     
 3rd Qu.:2022-01-04   3rd Qu.:0.84388                     
 Max.   :2022-01-05   Max.   :0.90818                     

Dataframes may also have row names:

rownames(my_dataframe) <- paste(rep("row", 5), seq = 1:5)
my_dataframe
           dates numeric_data categorical_data
row 1 2022-01-01    0.0112347                A
row 2 2022-01-02    0.8438787                B
row 3 2022-01-03    0.2512231                C
row 4 2022-01-04    0.9081754                A
row 5 2022-01-05    0.3407367                B

Typically we will create a dataframe by reading in data from a .csv file, but it is useful to be able to construct one from scratch.

tibble

In R, a dataframe and a tibble are both data structures used to store tabular data. Although tibbles are also dataframes, but they differ subtly in several ways.

  • A tibble is a relatively new addition to the R language and forms part of the tidyverse suite of packages. They are designed to be more user-friendly than traditional data frames and have several additional features, such as more informative error messages, stricter data input and output rules, and better handling of NA.

  • Unlike a dataframe, a tibble never automatically converts strings to factors or changes column names, which can help avoid unexpected behavior when working with the data.

  • A tibble does not have row names.

  • A tibble has a slightly different and more compact printing method than a dataframe, which makes them easier to read and work with.

  • Finally, a tibble has better performance than dataframes for many tasks, especially when working with large datasets.

While a dataframe is a core data structure in R, a tibble provides additional functionality and are becoming increasingly popular among R users, particularly those working with tidyverse packages. Applying the class() function to a tibble revelas that it belongs to the classes tbl_df, tbl and data.frame.

We can convert our dataframe my_dataframe to a tibble, and present the output with the print() function that applies nicely to tibbles:

library(tidyverse) # we need to load the tidyverse package
my_tibble <- as_tibble(my_dataframe)
class(my_tibble)
[1] "tbl_df"     "tbl"        "data.frame"
print(my_tibble)
# A tibble: 5 × 3
  dates      numeric_data categorical_data
  <date>            <dbl> <chr>           
1 2022-01-01       0.0112 A               
2 2022-01-02       0.844  B               
3 2022-01-03       0.251  C               
4 2022-01-04       0.908  A               
5 2022-01-05       0.341  B               

This very simple tibble looks identical to a dataframe, but as we start using more complex sets of data you’ll learn to appreciate the small convenience that tibbles offer.

list

This is also not actually a data class, but rather another way of representing a collection of objects of different types, all the way from numerical vectors to dataframes. Lists are useful for storing complex data structures and can also be accessed using indexing.

As an example, we create another dataframe:

dates <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03",
                   "2022-01-04", "2022-01-05"))

# Create a vector of numeric data
numeric_data <- rnorm(n = 5, mean = 1, sd = 1)

# Create a vector of categorical data
categorical_data <- c("C", "D", "D", "F", "A")

# Combine the vectors into a data.frame
my_other_dataframe <- data.frame(dates = dates,
                                  numeric_data = numeric_data,
                                  categorical_data = categorical_data)

my_list <- list(A = my_dataframe,
                B = my_other_dataframe)
my_list
$A
           dates numeric_data categorical_data
row 1 2022-01-01    0.0112347                A
row 2 2022-01-02    0.8438787                B
row 3 2022-01-03    0.2512231                C
row 4 2022-01-04    0.9081754                A
row 5 2022-01-05    0.3407367                B

$B
       dates numeric_data categorical_data
1 2022-01-01    1.6954532                C
2 2022-01-02    1.7709695                D
3 2022-01-03    2.3458497                D
4 2022-01-04    0.4964464                F
5 2022-01-05    0.2085725                A
class(my_list)
[1] "list"
str(my_list)
List of 2
 $ A:'data.frame':  5 obs. of  3 variables:
  ..$ dates           : Date[1:5], format: "2022-01-01" "2022-01-02" ...
  ..$ numeric_data    : num [1:5] 0.0112 0.8439 0.2512 0.9082 0.3407
  ..$ categorical_data: chr [1:5] "A" "B" "C" "A" ...
 $ B:'data.frame':  5 obs. of  3 variables:
  ..$ dates           : Date[1:5], format: "2022-01-01" "2022-01-02" ...
  ..$ numeric_data    : num [1:5] 1.695 1.771 2.346 0.496 0.209
  ..$ categorical_data: chr [1:5] "C" "D" "D" "F" ...

We can access one of the dataframes is the list as follows:

my_list[[2]]
       dates numeric_data categorical_data
1 2022-01-01    1.6954532                C
2 2022-01-02    1.7709695                D
3 2022-01-03    2.3458497                D
4 2022-01-04    0.4964464                F
5 2022-01-05    0.2085725                A
my_list[["A"]]
           dates numeric_data categorical_data
row 1 2022-01-01    0.0112347                A
row 2 2022-01-02    0.8438787                B
row 3 2022-01-03    0.2512231                C
row 4 2022-01-04    0.9081754                A
row 5 2022-01-05    0.3407367                B

To access a variable within one of the elements of the list we can do something like:

my_list[["B"]]$numeric_data
[1] 1.6954532 1.7709695 2.3458497 0.4964464 0.2085725
Task A
  1. Find five datasets that you like the look and content of. They may be some of the datasets built into R (and the various packages you downloaded), or they may be ones you found somewhere else. For each:
  • describe the data types (statistical view) of the variables contained within,
  • using the functions shown in the Chapter, describe their R data classes of each variable, and
  • using the functions shown in the Chapter, describe their data structures.

Conclusion

These are the bread and butter data classes and structures in R. Other data classes and structures also exist, but these may be particular to certain packages. We will encounter some of them in BCB743: Quantitative Ecology.

Reuse

Citation

BibTeX citation:
@online{j._smit2021,
  author = {J. Smit, Albertus},
  title = {1. {Data} {Classes} and {Structures} in {R}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/01-data-in-R.html},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2021) 1. Data Classes and Structures in R. http://tangledbank.netlify.app/BCB744/basic_stats/01-data-in-R.html.