# Create a vector of the number of pets owned by each household
num_rats <- c(0, 1, 2, 2, 3, 1, 4, 0, 2, 1, 2, 2, 0, 3, 2, 1, 1, 4, 2, 0)
num_rats
[1] 0 1 2 2 3 1 4 0 2 1 2 2 0 3 2 1 1 4 2 0
[1] "numeric"
Getting familiar with data classes and structures
January 1, 2021
“That which can be destroyed by the truth should be.”
— P.C. Hodgell
In biostatistics we will encounter different data types, and comprehending data classes, data structures, and their statistical interpretations is important for several reasons:
Correctly identifying the data class To manipulate, analyse, or visualise data effectively, we must first identify the type of data we are working with. Since different data classes possess distinct properties, we must determine the class of our data to leverage the appropriate functions and operations.
Efficient data processing As biologists using R, we may come across data structures that are not fully compatible with the available functions. For example, when dealing with extensive datasets, employing data structures intended for vectorised operations like arrays or matrices can significantly boost data processing speed. In contrast, other data representations may not perform well when managing large datasets. By becoming familiar with the different data classes and their attributes in R, we can make informed choices about which data structures and functions to use to accomplish their data processing and analysis tasks effectively.
Data manipulation and analysis Different data classes in R have specific methods for manipulating and analysing data. For example, if we’re working with character strings, we can use string manipulation functions like gsub()
or strsplit()
which do not work on numerical data. If we’re working with dates and times, we need to use date and time functions like as.Date()
or lubridate::ymd()
. By understanding the different data classes in R, we can choose the appropriate functions for manipulating and analysing our data.
Data visualisation Data can be visualised in numerous ways, such as using histograms for numeric data, bar graphs for categorical data, or scatter plots for representing the relationship between two numeric variables. However, the way we provide data to R is critical. By familiarising ourselves with the diverse data classes in R, we can select the most suitable visualisation methods for our data.
Let us delve into how biologists classify data and the distinct terminology they employ to describe it. It’s essential to comprehend these subtleties before we investigate how R recognises data classes. We’ll scrutinise the data types that are ubiquitous in the fields of biology and ecology, and subsequently shift our attention to data classes and structures in R.
We will most frequently encounter data arranged in columns of a data file—typically in MS Excel files or CSV files. A column is a variable, and one variable is comprised of one data type. So, when we refer to a variable, we expect that all the data within would be homogeneous—at least in as far as the data’s type.
The type of data that biologists and statisticians work with can influence the statistical techniques and methods they use to analyse and interpret the data. Let us discuss some of the different types of biological and ecological data we are likely to encounter.
Numeric data in the context of biostatistics refers to quantitative data that can be expressed in numerical form, typically obtained from field and laboratory measurements, or from field sampling campaigns. Examples of numeric data in biostatistics include the height and mass of a animals, concentrations of nutrients, laboratory test results such as respiration rates, or the number of limpets in a quadrat. Numeric data can be further categorised as discrete or continuous.
Discrete data are whole (integer) numbers that represent counts of items or events. Integer data usually answer the question, “how many?” For example, in the biological and Earth sciences, discrete data is commonly encountered in the form of counts or integers that represent the presence or absence of certain characteristics or events. For example, the number of individuals of a particular species in a population, the number of chromosomes in a cell, or the number of earthquakes occurring in a region within a given time frame. Other examples of discrete data in these sciences include the number of mutations in a gene, the number of cells in a tissue sample, or the number of species present in an ecosystem. These types of data are often analysed using statistical techniques such as frequency distributions, contingency tables, and chi-square tests.
Continuous data, on the other hand, are measured on a continuous scale. These usually represent measured quantities such as something’s heat content (temperature, measured in degrees Celsius) or distance (measured in metres or similar), etc. They can be rational numbers including integers and fractions, but typically they have an infinite number of ‘steps’ that depends on rounding (they can even be rounded to whole integers) or considerations such as measurement precision and accuracy. Often, continuous data have upper and lower bounds that depend on the characteristics of the phenomenon being studied or the measurement being taken.
We often encounter date data when dealing with time-related data. For example, in ecological research, data collection may involve recording the date of a particular observation or sampling event, such as the date when a bird was sighted, or when water samples were taken from a stream. The purpose of using date (or time) data in biology and ecology is to enable us to understand and analyse temporal patterns and relationships in their response variables. This can include exploring seasonal trends, understanding the impact of environmental changes over time, or tracking the growth and development of organisms.
By analysing date data, we can gain insights into long-term trends and patterns that may not be apparent when looking at the data in aggregate. They can also use this information to make predictions about future trends, develop more effective management strategies, and identify potential areas for further research.
Qualitative data in the context of statistics used by biologists refers to non-numerical data that are collected from observations, experimental treatment groups, or other sources. They tend to be textual and it are often used to describe characteristics or properties of living organisms, ecosystems, or other biological phenomena. Examples of qualitative data in biology may include the colour of flowers, the type of habitat where an animal is found, the behaviour of animals, or the presence or absence of certain traits or characteristics in a population.
Qualitative data can be further classified into nominal or ordinal data.
Add some text.
Nominal data are used to describe qualitative variables that do not have any inherent order or ranking. Examples of nominal data in biology may include the type of plant or animal species, or the presence or absence of certain genetic traits. Another term for nominal data is categorical data. Because there are well-defined categories, the number of members belonging to each of the category can be counted. For example, there are three red flowers, 66 purple flowers, and 13 yellow flowers.
Ordinal data refer to a type of data that can be used to describe qualitative categorical variables that have a natural order or ranking. It is used when we need to arrange things in a particular order, such as from worst to best or from least to most. However, the differences between the values cannot be measured or quantified exactly, making them somewhat subjective. Examples of ordinal data include the different stages of development of an organism or the performance of a species to different fertilisers. Ordinal data can be entered as descriptive character strings, and internally, they are translated into an ordered vector of integers in R. For example, we can use a scale of 1
for terrible, 2
for ‘so-so’, 3
for average, 4
for good, and 5
for brilliant.
Life can be boiled down to a series of binary decisions: should I have pizza for dinner, yes or no? Should I go to bed early, TRUE or FALSE? Should I start that new series on Netflix, accept or reject? Am I present or absent? You get the gist… This kind of binary decision-making is known as ‘logical’, and in R they can only take on the values of TRUE
or FALSE
(remember to mind your case!). In the computing world, logical data are often represented by 1
for TRUE
and 0
for FALSE
. So basically, your life’s choices can be summarised as a string of 1
s and 0
s. Who knew it was that simple?
When it comes down to it, everything in life is either black
or white
, right
or wrong
, good
or bad
. It’s like a cosmic game of “Would You Rather?” — and we’re all just playing along.
As the name implies, these are not numbers. Rather, they are human words that have found their way into R for one reason or another. In biology we most commonly encounter character values when we have a list of things, such as sites or species. These values will often be used as categorical or ordinal data.
It’s unfortunate to admit that one of the most reliable aspects of any biological dataset is the presence of missing data (the presence of something that’s missing?!). It is a stark reminder of the fragility of life. How can we say that something contains missing data? It seems counter intuitive, as if the data were never there in the first place. However, as we remember the principles of tidy data, we see that every observation must be documented in a row, and each column in that row must contain a value. This organisation allows us to create a matrix of data from multiple observations. Since the data are presented in a two-dimensional format, any missing values from an observation will leave a gaping hole in the matrix. We call these ‘missing values.’ It’s a somber reality that even the most meticulous collection of data can be marred by the loss of information.
“And if you gaze long enough into an abyss, the abyss will gaze back into you.”
— Friedrich Nietzsche
As we draw to a close on the topic of data types, we cling desperately to the threads of our sanity, hoping against hope that they remain tightly stitched. But let it be known, to those who dare enter further into the realm of data, that beneath the surface lie countless rocks, and around every corner lurk a legion of complex data types, waiting to ensnare the unwary. These shadows of information are as enigmatic as they are perilous, for they challenge the very essence of our understanding. It is not until the final chapter of our journey, when we confront the elusive art of modeling, that we will face these data demons head-on. But fear not, for we shall arm ourselves with the knowledge and techniques acquired on this treacherous path, and with each step forward, we shall move closer to mastering the darkness that awaits us.
R is a powerful programming language and software environment for statistical computing and graphics. It offers a wide variety of data classes and types to represent different types of information.
The atomic modes are logical
, integer
, numeric
(also sometimes called double
), complex
, character
and raw
. The Date
class is a specialised form of the numeric
class. Each atomic mode has its own properties and functions that can be used to manipulate objects of that mode. These atomic modes can be used to make an atomic data structure such as a vector
, array
, or matrix
. This knowledge is also important when working with R tabular mixed data structures, such as a data.frame
, tibble
, or list
. Please refer to Hadley Wickham’s overview of vectors in Advanced R, 2nd edition for more insight presented in an informative yet concise way.
The data class can be determined with the class()
or str()
commands. Here’s a brief overview of some of the main data types in R:
numeric
In R, the numeric
data class represents either integers or floating point (decimal) values. Numerical data are quantitative in nature as they represent things that can be objectively counted, measured, or calculated—the measured variables.
Numeric data are one of the most common types of data used in statistical and mathematical analysis. In R, numeric data are represented by the class numeric
, which includes both integers and floating-point numbers. Numeric data can be used in a variety of operations and calculations, including arithmetic operations, statistical analyses, and visualisations. One important feature of the numeric
data class in R is that it supports vectorisation, which allows for efficient and concise operations on large sets of numeric data. Additionally, R provides a wide range of built-in functions for working with numeric data, including functions for calculating basic statistical measures such as mean, median, and standard deviation.
In R integer (discrete) data are called int
or <int>
while continuous data are denoted num
or <dbl>
.
Example of integer data Suppose you have a dataset of the number of rats in different storm water drains in a neighbourhood. The number of rats is a discrete variable because it can only take on integer values (you can’t own a fraction of a rat).
Here’s how you could create a vector of this data in R:
# Create a vector of the number of pets owned by each household
num_rats <- c(0, 1, 2, 2, 3, 1, 4, 0, 2, 1, 2, 2, 0, 3, 2, 1, 1, 4, 2, 0)
num_rats
[1] 0 1 2 2 3 1 4 0 2 1 2 2 0 3 2 1 1 4 2 0
[1] "numeric"
In this example, the data are represented as a vector called num_rats
of class numeric
(as revealed by class(num_rats)
). Each element of the vector represents the number of rats in one storm water drain. For example, the first element of the vector (num_rats[1]
) is 0
, which means that the first drain in the dataset is free of rats. The fourth element of the vector (num_rats[4]
) is 2
, indicating that the fourth drain in the dataset is occupied by 2
rats.
One can also explicitly create a vector of integer using the as.integer()
function:
[1] 0 1 2 2 3 1 4 0 2 1 2 2 0 3 2 1 1 4 2 0
[1] "integer"
Above we coerced the class numeric
data to class integer
. But we can take floating point numeric and convert them to integers too with the as.integer()
function. As we see, the effect is that the whole part of the number is retained and the rest discarded:
[1] 3.141593 6.283185 9.424778 12.566371 15.707963
[1] "numeric"
[1] 3 6 9 12 15
Effectively, what happened above is more-or-less equivalent to what the floor()
function would return:
Be careful when coercing floating point numbers to integers. If rounding is what you expect, this is not what you will get. For rounding, use round()
instead:
Example of continuous data Here are some randomly generated temperature data assigned to an object called temp_data
:
# Generate a vector of 50 normally distributed temperature values
temp_data <- round(rnorm(n = 50, mean = 15, sd = 3), 2)
temp_data
[1] 21.65 12.38 14.41 15.82 9.05 14.39 12.20 14.49 13.58 15.29 13.47 13.06
[13] 14.61 17.92 17.41 15.02 14.30 15.19 17.41 17.61 12.00 11.94 11.25 14.36
[25] 15.25 13.02 10.15 14.57 20.78 20.16 10.90 12.43 12.45 14.30 16.27 11.89
[37] 14.07 14.00 11.65 13.67 14.46 12.73 16.02 17.86 11.76 21.72 18.77 16.29
[49] 11.42 12.06
[1] "numeric"
character
In R, the character
data class represents textual data such as words, sentences, and paragraphs. Character data can be created using either single or double quotes, and it can include letters, numbers, and other special characters. In addition, character data can be concatenated using the paste()
function or other string manipulation functions.
One important feature of the character data class in R is its versatility in working with textual data. For instance, it can be used to store and manipulate text data, including text-based datasets, text-based files, and text-based visualisations. Additionally, R provides a wide range of built-in functions for working with character data, including functions for manipulating strings, searching for patterns, and formatting output. Overall, the character data class in R is a fundamental data type that is critical for working with textual data in a variety of contexts. You will most frequently use character values are often used to represent labels, names, or descriptions.
factor
In R, the factor
data class is used to represent discrete categorical variables. Factors are often used in statistical analyses to represent class or group belonging. Factor values are categorical data, such as levels or categories of a variable. Factor variables are most commonly also character data, but they can be numeric too if coded correctly as factors. Factor values can be ordered (ordinal) or unordered (categorical or nominal).
Categorical variables take on a limited number of distinct values, often corresponding to different groups or levels. For example, a categorical variable might represent different colours, size classes, or species. Factors in R are represented as integers with corresponding character levels, where each level corresponds to a distinct category. The levels of a factor can be defined explicitly using the factor()
function or automatically using the cut()
function. One important feature of the factor data class in R is that it allows for efficient and effective data manipulation and analysis, particularly when working with large datasets. For instance, factors can be used in statistical analyses such as regression models or ANOVA, and they can also be used to create visualisations such as bar or pie graphs. The factor
data class in R is a fundamental data type that is critical for representing and working with categorical variables in data analysis and visualisation.
The factor
data class of data in an R data.frame
structure (or in a tibble
) is indicated by Factor
or <fctr>
. Ordered factors are denoted by columns named Ord.factor
or <ord>
.
Nominal data One example of nominal factor data that ecologists might encounter is the type of vegetation in a particular area, such as ‘grassland’, ‘forest’, or ‘wetland’. Here’s an example of how to generate a vector of nominal data in R using the sample()
function:
# Generate a vector of vegetation types
vegetation <- sample(c("grassland", "forest", "wetland"), size = 50, replace = TRUE)
# View the vegetation data
vegetation
[1] "grassland" "wetland" "forest" "forest" "wetland" "forest"
[7] "forest" "forest" "wetland" "grassland" "wetland" "wetland"
[13] "forest" "forest" "grassland" "wetland" "forest" "grassland"
[19] "grassland" "wetland" "forest" "grassland" "grassland" "grassland"
[25] "wetland" "grassland" "grassland" "forest" "forest" "wetland"
[31] "grassland" "grassland" "grassland" "forest" "wetland" "wetland"
[37] "grassland" "forest" "grassland" "wetland" "forest" "wetland"
[43] "forest" "forest" "wetland" "wetland" "forest" "wetland"
[49] "grassland" "grassland"
[1] "character"
sample()
function
Note that the sample()
function is not made specifically for nominal data; it can be used on any kind of data class.
sample()
function. Why might we (mostly biologists and ecologists) find it useful?Ordinal data Here’s an example vector of ordinal data in R that could be encountered by ecologists:
# Vector of ordinal data representing the successional stage of a forest
succession <- c("Early Pioneer", "Late Pioneer",
"Young Forest", "Mature Forest",
"Old Growth")
succession
[1] "Early Pioneer" "Late Pioneer" "Young Forest" "Mature Forest"
[5] "Old Growth"
[1] "character"
# Convert to ordered factor
succession <- factor(succession, ordered = TRUE,
levels = c("Early Pioneer", "Late Pioneer",
"Young Forest", "Mature Forest",
"Old Growth"))
succession
[1] Early Pioneer Late Pioneer Young Forest Mature Forest Old Growth
5 Levels: Early Pioneer < Late Pioneer < Young Forest < ... < Old Growth
[1] "ordered" "factor"
In this example, the successional stage of a forest is represented by an ordinal scale with five levels ranging from ‘Early Pioneer’ to ‘Old Growth’. The factor()
function is used to convert the vector to an ordered factor, with the ordered
argument set to TRUE
and the levels
argument set to the same order as the original vector. This ensures that the levels are properly represented as an ordered factor.
logical
In R, the logical
data class represents binary or Boolean data. Logical data are used to represent variables that can take on only two possible values, TRUE
or FALSE
. In addition to TRUE
and FALSE
, logical data can also take on the values of NA
or NULL
, which represent missing or undefined values.
Logical data can be created using logical operators such as ==
, !=
, >
, <
, >=
, and <=
. Logical data are commonly used in R for data filtering and selection, conditional statements, and logical operations. For example, logical data can be used to filter a dataset to include only observations that meet certain criteria or to perform logical operations such as AND (&
) and OR (|
). The logical data class in R is a fundamental data type that is critical for representing and working with binary or Boolean variables in data analysis and programming.
Example logical (binary) data Here’s an example of generating a vector of binary or logical data in R, which represents the presence or absence of a particular species in different ecological sites:
# Generate a vector of 1s and 0s to represent the presence
# or absence of a species in different ecological sites
species_presence <- sample(c(0,1), 10, replace = TRUE)
species_presence
[1] 0 0 0 1 0 0 1 1 1 0
We can also make a formal logical class data:
In this example, we again use the sample()
function to randomly generate a vector of 10 values, each either 0 or 1, to represent the presence or absence of a species in 10 different ecological sites. However, it is often not necessary to coerce to class logical
, as we see in the presence-absence datasets we will encounter in BCB743: Quantitative Ecology.
date
In R, the POSIXct
, POSIXlt
, and Date
classes are commonly used to represent date and time data. These classes each have unique characteristics that make them useful for different purposes.
The POSIXct
class is a date/time class that represents dates and times as a numerical value, typically measured in seconds since January 1st, 1970. This class provides a high level of precision, with values accurate to the second. It is useful for performing calculations and data manipulation involving time, such as finding the difference between two dates or adding a certain number of seconds to a given time. An example of how to generate a POSIXct
object in R is as follows:
[1] "POSIXct" "POSIXt"
[1] "2022-03-10 12:34:56 SAST"
The POSIXlt
class, on the other hand, typically represents dates and times in a more human-readable format. It stores date and time information as a list of named elements, including year
, month
, day
, hour
, minute
, and second.
This format is useful for displaying data in a more understandable way and for extracting specific components of a date or time. An example of how to generate a POSIXlt object in R is as follows:
[1] "POSIXlt" "POSIXt"
[1] "2022-03-10 12:34:56 SAST"
The Date
class is used to represent dates only, without any time information. Dates are typically stored as the number of days since January 1st, 1970. This class provides functions for performing arithmetic operations and comparisons between dates. It is useful for working with time-based data that is only concerned with the date component, such as daily sales or stock prices. An example of how to generate a Date object in R is as follows:
To generate a vector of dates in R with daily intervals, we can use the seq()
function to create a sequence of dates, specifying the start and end dates and the time interval. Here’s an example:
# Generate a vector of dates from January 1, 2022 to December 31, 2022
dates <- seq(as.Date("2022-01-01"), as.Date("2022-12-31"), by = "day")
# View the first 10 dates in the vector
head(dates, 10)
[1] "2022-01-01" "2022-01-02" "2022-01-03" "2022-01-04" "2022-01-05"
[6] "2022-01-06" "2022-01-07" "2022-01-08" "2022-01-09" "2022-01-10"
[1] "Date"
Understanding the characteristics of these date and time classes in R is essential for effective data analysis and manipulation in fields where time-based data is a critical component.
Date and time data in R can be manipulated using various built-in functions and packages such as lubridate and chron. Additionally, date and time data can be visualised using different types of graphs such as time series plots, heatmaps, and Hovmöller diagrams. The date and time data classes in R are essential for working with temporal data and conducting time-related analyses in various biological and environmental datasets.
NA
Missing values can be encountered in vectors of all data classes. To demonstrate some data that contains missing values, I will generate a data sequence containing 5% missing values. We can use the rnorm()
function to generate a sequence of random normal numbers and then randomly assign 5% of the values as missing using the sample()
function. The indices of the missing values are stored in missing_indices, and we use them to assign NA
to the corresponding elements of the data sequence. Here’s some code to achieve this:
# Set the length of the sequence
n <- 100
# Generate a sequence of random normal numbers with
# mean 0 and standard deviation 1
data <- rnorm(n, mean = 0, sd = 1)
# Randomly assign 5% of the values as missing
missing_indices <- sample(1:n, size = round(0.05*n))
data[missing_indices] <- NA
length(data)
[1] 100
[1] -0.81940192 -0.00740456 -0.36999559 0.70074979 NA -1.24852871
[7] -0.99792356 0.48755000 -0.78069577 -1.27643878 1.03977193 1.09405968
[13] 0.74844413 -2.22925395 -0.10169076 0.22252991 -0.01513552 0.15518752
[19] -0.63032108 0.10941233 0.75692863 1.54465913 -1.99389488 -0.10323424
[25] -0.52478450 NA -0.52410587 -1.21112225 -0.66382656 1.98609308
[31] -0.74351216 0.98347683 0.10919197 -1.90699075 0.83776093 1.48987542
[37] NA 1.11696675 0.03982922 -1.11019956 0.99554720 0.85509468
[43] -0.73931718 -0.45132178 1.42535260 -0.75741995 -1.26200804 -0.70876054
[49] 0.07646090 1.30062866 -1.21517127 -2.07190314 0.66569083 -1.49051237
[55] 0.52042125 1.27683410 -0.88654953 1.06120386 NA -2.63300632
[61] -1.54958598 -1.79825644 0.26311841 -0.78707880 0.43926228 0.93532487
[67] 0.16997429 -0.31449895 -1.49526094 -1.32722447 0.38798064 0.52625778
[73] -0.40413670 0.28038868 2.00052101 0.12642065 0.23809910 1.67756946
[79] 0.32614738 1.21189286 -1.43601480 0.09241923 1.56839501 -1.30208426
[85] 1.33344490 -0.20869887 -0.86656886 0.88263507 0.23866951 -0.16493668
[91] -1.69156154 0.52660571 1.27504834 -0.48766272 -0.05187454 1.22361374
[97] NA -0.13347367 1.30950275 0.77215222
To remove all NA
s from the vector of data we can use na.omit()
:
[1] 95
[1] -0.81940192 -0.00740456 -0.36999559 0.70074979 -1.24852871 -0.99792356
[7] 0.48755000 -0.78069577 -1.27643878 1.03977193 1.09405968 0.74844413
[13] -2.22925395 -0.10169076 0.22252991 -0.01513552 0.15518752 -0.63032108
[19] 0.10941233 0.75692863 1.54465913 -1.99389488 -0.10323424 -0.52478450
[25] -0.52410587 -1.21112225 -0.66382656 1.98609308 -0.74351216 0.98347683
[31] 0.10919197 -1.90699075 0.83776093 1.48987542 1.11696675 0.03982922
[37] -1.11019956 0.99554720 0.85509468 -0.73931718 -0.45132178 1.42535260
[43] -0.75741995 -1.26200804 -0.70876054 0.07646090 1.30062866 -1.21517127
[49] -2.07190314 0.66569083 -1.49051237 0.52042125 1.27683410 -0.88654953
[55] 1.06120386 -2.63300632 -1.54958598 -1.79825644 0.26311841 -0.78707880
[61] 0.43926228 0.93532487 0.16997429 -0.31449895 -1.49526094 -1.32722447
[67] 0.38798064 0.52625778 -0.40413670 0.28038868 2.00052101 0.12642065
[73] 0.23809910 1.67756946 0.32614738 1.21189286 -1.43601480 0.09241923
[79] 1.56839501 -1.30208426 1.33344490 -0.20869887 -0.86656886 0.88263507
[85] 0.23866951 -0.16493668 -1.69156154 0.52660571 1.27504834 -0.48766272
[91] -0.05187454 1.22361374 -0.13347367 1.30950275 0.77215222
attr(,"na.action")
[1] 5 26 37 59 97
attr(,"class")
[1] "omit"
Data structures can be viewed as the ‘containers’ that hold the different data classes. The kind of data structure can also be revealed by the class()
command.
vector
, array
, and matrix
Vectors In R, a vector is a one-dimensional array-like data structure that can hold a sequence of values of the same atomic mode, such as numeric
, character
, logical
values, or Date
and times. A vector can be created using the c()
function, which stands for ‘combine’ or ‘concatenate,’ and is used to combine a sequence of values into a vector. Vectors can also be created by using the seq()
function to generate a sequence of numbers, or the rep()
function to repeat a value or sequence of values. Here is an example of a numeric vector:
# create a numeric vector
my_vector <- c(1, 2, 3, 4, 5)
# coerce to vector
my_vector <- as.vector(c(1, 2, 3, 4, 5))
class(my_vector) # but it doesn't change the class from numeric
[1] "numeric"
[1] 1 2 3 4 5
The behaviour is such that the output of coercion to vector is that one the atomic modes (the basic data types) is returned.
One of the advantages of using vectors in R is that many of the built-in functions and operations work on vectors, allowing us to easily manipulate and analyse large amounts of data. Additionally, R provides many functions specifically designed for working with vectors, such as mean()
, median()
, sum()
, min()
, max()
, and many others.
Matrices A matrix (again, this terminology may be different for other languages), on the other hand, is a special case of an array that has two dimensions (rows and columns). It is also a multi-dimensional data structure that can hold elements of the same data type, but it is specifically designed for handling data in a tabular format. A matrix can be created using the matrix()
function in R.
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[1] "matrix" "array"
We can query the size or dimensions of the matrix as follows:
Coercion of matrices to vectors A matrix can be coerced to a vector:
Arrays In R (as opposed to in python or some other languages), an array specifically refers to a multi-dimensional data structure that can hold elements of the same data type. It can have any number of dimensions (1, 2, 3, etc.), and its dimensions can be named. An array can be created using the array()
function in R.
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
[1] "array"
We can figure something out about the size or dimensions of the array:
Coercion of arrays to vectors The array can be coerced to a vector:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27
as.vector()
function works when applied to matrices and arrays. How does it decide in what order to string the elements of the matrices and arrays together?as.vector()
(your own data) and assemble a new array with a different combination of dimensions from the one initially produced.The key difference between vectors, arrays, and a matrices in R is their dimensions. A vector has one dimension, an array can have any number of dimensions, while a matrix is limited to two dimensions. Additionally, a matrix is often used to store data in a tabular format, while an array is used to store multi-dimensional data in general. A commonly encountered kind of matrix is seen in multivariate statistics is a distance or dissimilarity matrix.
In R, vectors, arrays, and matrices share a common characteristic: they do not have row or column names. Therefore, to refer to any element, row, or column, one must use their corresponding index. How?
Accessing elements, rows, columns, and matrices In R, the square bracket notation is used to access elements, rows, columns, or matrices in arrays. The notation takes the form of [i, j, k, ...]
, where i
, j
, k
, and so on, represent the indices of the rows, columns, or matrices to be accessed.
Suppose we have the following array:
, , 1
[,1] [,2] [,3] [,4]
[1,] 14.0 11.8 12.0 14.1
[2,] 12.7 11.5 15.1 11.0
[3,] 13.9 14.5 11.1 16.1
[4,] 12.9 12.6 12.8 12.6
[5,] 13.3 13.8 15.6 12.4
, , 2
[,1] [,2] [,3] [,4]
[1,] 17.0 13.5 12.4 10.2
[2,] 10.5 13.0 11.9 11.1
[3,] 10.9 13.4 12.2 12.8
[4,] 16.2 10.1 16.2 17.2
[5,] 14.1 14.4 9.6 12.6
, , 3
[,1] [,2] [,3] [,4]
[1,] 10.9 12.0 12.7 14.5
[2,] 7.7 9.8 16.9 12.3
[3,] 12.4 15.1 13.1 13.8
[4,] 14.0 17.6 11.0 11.9
[5,] 14.3 15.4 13.4 14.4
[1] 5 4 3
This creates a \(5\times4\times3\) array with values from 1
to 60
.
When working with multidimensional arrays, it is possible to omit some of the indices in the square bracket notation. This results in a subset of the array, which can be thought of as a lower-dimensional array obtained by fixing the omitted dimensions. For example, consider a 3-dimensional array my_array
above with dimensions dim(my_array) = c(5,4,3)
. If we use the notation my_array[1,,]
, we would obtain a 2-dimensional array with dimensions dim(my_array[1,,]) = c(4,3)
obtained by fixing the first index at 1:
[1] 4 3
[,1] [,2] [,3]
[1,] 14.0 17.0 10.9
[2,] 11.8 13.5 12.0
[3,] 12.0 12.4 12.7
[4,] 14.1 10.2 14.5
Here are some more examples of how to use square brackets notation with arrays in R:
To access a single element in the array, use the notation [i, j, k]
, where i
, j
, and k
are the indices along each of the three dimensions, which in combination, uniquely identifies each element. Below we return the element in the second row, third column, and first matrix:
To access a single row in the array, use the notation [i, , ]
, where i
is the index of the row. This will return the second rows and all of the columns of the first matrix:
To access a single column in the array, use the notation [ , j, ]
, where j
is the index of the column. Here we will return all the elements in the row of column two and matrix three:
To access a single matrix in the array, use the notation [ , , k]
, where k
is the index of the matrix:
[,1] [,2] [,3] [,4]
[1,] 17.0 13.5 12.4 10.2
[2,] 10.5 13.0 11.9 11.1
[3,] 10.9 13.4 12.2 12.8
[4,] 16.2 10.1 16.2 17.2
[5,] 14.1 14.4 9.6 12.6
To obtain a subset of the array, use the notation [i, j, k]
with i
, j
, or k
omitted to obtain a lower-dimensional array:
[,1] [,2] [,3]
[1,] 14.0 17.0 10.9
[2,] 11.8 13.5 12.0
[3,] 12.0 12.4 12.7
[4,] 14.1 10.2 14.5
, , 1
[,1] [,2]
[1,] 11.8 12.0
[2,] 11.5 15.1
[3,] 14.5 11.1
[4,] 12.6 12.8
[5,] 13.8 15.6
, , 2
[,1] [,2]
[1,] 13.5 12.4
[2,] 13.0 11.9
[3,] 13.4 12.2
[4,] 10.1 16.2
[5,] 14.4 9.6
, , 3
[,1] [,2]
[1,] 12.0 12.7
[2,] 9.8 16.9
[3,] 15.1 13.1
[4,] 17.6 11.0
[5,] 15.4 13.4
data.frame
A dataframe is perhaps the most commonly-used ‘container’ for data in R because they are so convenient and serve many purposes. A dataframe is not a data class—more correctly, it is a form of tabular data (like a table in MS Excel), with each vector (a variable or column) comprising the table sharing the same length. What makes a dataframe versatile is that its variables can be any combination of the atomic data types. It may even include list columns (we will not cover list columns in this module). Applying the class()
function to a dataframe shows that it blongs to class data.frame
.
Here’s an example of an R data.frame
with Date
, numeric
, and categorical
data classes:
# Create a vector of dates
dates <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03",
"2022-01-04", "2022-01-05"))
# Create a vector of numeric data
numeric_data <- rnorm(n = 5, mean = 0, sd = 1)
# Create a vector of categorical data
categorical_data <- c("A", "B", "C", "A", "B")
# Combine the vectors into a data.frame
my_dataframe <- data.frame(dates = dates,
numeric_data = numeric_data,
categorical_data = categorical_data)
# Print the dataframe
my_dataframe
dates numeric_data categorical_data
1 2022-01-01 0.0112347 A
2 2022-01-02 0.8438787 B
3 2022-01-03 0.2512231 C
4 2022-01-04 0.9081754 A
5 2022-01-05 0.3407367 B
[1] "data.frame"
'data.frame': 5 obs. of 3 variables:
$ dates : Date, format: "2022-01-01" "2022-01-02" ...
$ numeric_data : num 0.0112 0.8439 0.2512 0.9082 0.3407
$ categorical_data: chr "A" "B" "C" "A" ...
dates numeric_data categorical_data
Min. :2022-01-01 Min. :0.01123 Length:5
1st Qu.:2022-01-02 1st Qu.:0.25122 Class :character
Median :2022-01-03 Median :0.34074 Mode :character
Mean :2022-01-03 Mean :0.47105
3rd Qu.:2022-01-04 3rd Qu.:0.84388
Max. :2022-01-05 Max. :0.90818
Dataframes may also have row names:
dates numeric_data categorical_data
row 1 2022-01-01 0.0112347 A
row 2 2022-01-02 0.8438787 B
row 3 2022-01-03 0.2512231 C
row 4 2022-01-04 0.9081754 A
row 5 2022-01-05 0.3407367 B
Typically we will create a dataframe by reading in data from a .csv
file, but it is useful to be able to construct one from scratch.
tibble
In R, a dataframe and a tibble are both data structures used to store tabular data. Although tibbles are also dataframes, but they differ subtly in several ways.
A tibble is a relatively new addition to the R language and forms part of the tidyverse suite of packages. They are designed to be more user-friendly than traditional data frames and have several additional features, such as more informative error messages, stricter data input and output rules, and better handling of NA
.
Unlike a dataframe, a tibble never automatically converts strings to factors or changes column names, which can help avoid unexpected behavior when working with the data.
A tibble does not have row names.
A tibble has a slightly different and more compact printing method than a dataframe, which makes them easier to read and work with.
Finally, a tibble has better performance than dataframes for many tasks, especially when working with large datasets.
While a dataframe is a core data structure in R, a tibble provides additional functionality and are becoming increasingly popular among R users, particularly those working with tidyverse packages. Applying the class()
function to a tibble revelas that it belongs to the classes tbl_df
, tbl
and data.frame
.
We can convert our dataframe my_dataframe
to a tibble, and present the output with the print()
function that applies nicely to tibbles:
library(tidyverse) # we need to load the tidyverse package
my_tibble <- as_tibble(my_dataframe)
class(my_tibble)
[1] "tbl_df" "tbl" "data.frame"
# A tibble: 5 × 3
dates numeric_data categorical_data
<date> <dbl> <chr>
1 2022-01-01 0.0112 A
2 2022-01-02 0.844 B
3 2022-01-03 0.251 C
4 2022-01-04 0.908 A
5 2022-01-05 0.341 B
This very simple tibble looks identical to a dataframe, but as we start using more complex sets of data you’ll learn to appreciate the small convenience that tibbles offer.
list
This is also not actually a data class, but rather another way of representing a collection of objects of different types, all the way from numerical vectors to dataframes. Lists are useful for storing complex data structures and can also be accessed using indexing.
As an example, we create another dataframe:
dates <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03",
"2022-01-04", "2022-01-05"))
# Create a vector of numeric data
numeric_data <- rnorm(n = 5, mean = 1, sd = 1)
# Create a vector of categorical data
categorical_data <- c("C", "D", "D", "F", "A")
# Combine the vectors into a data.frame
my_other_dataframe <- data.frame(dates = dates,
numeric_data = numeric_data,
categorical_data = categorical_data)
my_list <- list(A = my_dataframe,
B = my_other_dataframe)
my_list
$A
dates numeric_data categorical_data
row 1 2022-01-01 0.0112347 A
row 2 2022-01-02 0.8438787 B
row 3 2022-01-03 0.2512231 C
row 4 2022-01-04 0.9081754 A
row 5 2022-01-05 0.3407367 B
$B
dates numeric_data categorical_data
1 2022-01-01 1.6954532 C
2 2022-01-02 1.7709695 D
3 2022-01-03 2.3458497 D
4 2022-01-04 0.4964464 F
5 2022-01-05 0.2085725 A
[1] "list"
List of 2
$ A:'data.frame': 5 obs. of 3 variables:
..$ dates : Date[1:5], format: "2022-01-01" "2022-01-02" ...
..$ numeric_data : num [1:5] 0.0112 0.8439 0.2512 0.9082 0.3407
..$ categorical_data: chr [1:5] "A" "B" "C" "A" ...
$ B:'data.frame': 5 obs. of 3 variables:
..$ dates : Date[1:5], format: "2022-01-01" "2022-01-02" ...
..$ numeric_data : num [1:5] 1.695 1.771 2.346 0.496 0.209
..$ categorical_data: chr [1:5] "C" "D" "D" "F" ...
We can access one of the dataframes is the list as follows:
dates numeric_data categorical_data
1 2022-01-01 1.6954532 C
2 2022-01-02 1.7709695 D
3 2022-01-03 2.3458497 D
4 2022-01-04 0.4964464 F
5 2022-01-05 0.2085725 A
dates numeric_data categorical_data
row 1 2022-01-01 0.0112347 A
row 2 2022-01-02 0.8438787 B
row 3 2022-01-03 0.2512231 C
row 4 2022-01-04 0.9081754 A
row 5 2022-01-05 0.3407367 B
To access a variable within one of the elements of the list we can do something like:
These are the bread and butter data classes and structures in R. Other data classes and structures also exist, but these may be particular to certain packages. We will encounter some of them in BCB743: Quantitative Ecology.
@online{j._smit2021,
author = {J. Smit, Albertus},
title = {1. {Data} {Classes} and {Structures} in {R}},
date = {2021-01-01},
url = {http://tangledbank.netlify.app/BCB744/basic_stats/01-data-in-R.html},
langid = {en}
}
---
date: "2021-01-01"
title: "1. Data Classes and Structures in R"
subtitle: "Getting familiar with data classes and structures"
---
> "*That which can be destroyed by the truth should be.*"
>
> --- P.C. Hodgell
::: {.callout-note appearance="simple"}
## In this Chapter
* Data as encountered by the statistician
* Data classes in R
* Data structures in R
:::
::: {.callout-important appearance="simple"}
## Tasks to complete in this Chapter
* Task A 1-4
:::
![](/images/769_life_finds_a_way.png){fig-align="center" width="600"}
# Introduction
In biostatistics we will encounter different data types, and comprehending **data classes**, **data structures**, and their **statistical interpretations** is important for several reasons:
- **Correctly identifying the data class** To manipulate, analyse, or visualise data effectively, we must first identify the type of data we are working with. Since different data classes possess distinct properties, we must determine the class of our data to leverage the appropriate functions and operations.
- **Efficient data processing** As biologists using R, we may come across data structures that are not fully compatible with the available functions. For example, when dealing with extensive datasets, employing data structures intended for vectorised operations like arrays or matrices can significantly boost data processing speed. In contrast, other data representations may not perform well when managing large datasets. By becoming familiar with the different data classes and their attributes in R, we can make informed choices about which data structures and functions to use to accomplish their data processing and analysis tasks effectively.
- **Data manipulation and analysis** Different data classes in R have specific methods for manipulating and analysing data. For example, if we're working with character strings, we can use string manipulation functions like `gsub()` or `strsplit()` which do not work on numerical data. If we're working with dates and times, we need to use date and time functions like `as.Date()` or `lubridate::ymd()`. By understanding the different data classes in R, we can choose the appropriate functions for manipulating and analysing our data.
- **Data visualisation** Data can be visualised in numerous ways, such as using histograms for numeric data, bar graphs for categorical data, or scatter plots for representing the relationship between two numeric variables. However, the way we provide data to R is critical. By familiarising ourselves with the diverse data classes in R, we can select the most suitable visualisation methods for our data.
Let us delve into how biologists classify data and the distinct terminology they employ to describe it. It's essential to comprehend these subtleties before we investigate how R recognises data classes. We'll scrutinise the data types that are ubiquitous in the fields of biology and ecology, and subsequently shift our attention to data classes and structures in R.
# Types of variables common in biology and ecology
We will most frequently encounter data arranged in columns of a data file---typically in MS Excel files or CSV files. A column is a variable, and one variable is comprised of one data type. So, when we refer to a variable, we expect that all the data within would be homogeneous---at least in as far as the data's type.
The type of data that biologists and statisticians work with can influence the statistical techniques and methods they use to analyse and interpret the data. Let us discuss some of the different types of biological and ecological data we are likely to encounter.
## Numeric variables
Numeric data in the context of biostatistics refers to **quantitative data** that can be expressed in numerical form, typically obtained from field and laboratory measurements, or from field sampling campaigns. Examples of numeric data in biostatistics include the height and mass of a animals, concentrations of nutrients, laboratory test results such as respiration rates, or the number of limpets in a quadrat. Numeric data can be further categorised as **discrete** or **continuous**.
### Discrete variables
Discrete data are whole (integer) numbers that represent counts of items or events. Integer data usually answer the question, "how many?" For example, in the biological and Earth sciences, discrete data is commonly encountered in the form of counts or integers that represent the presence or absence of certain characteristics or events. For example, the number of individuals of a particular species in a population, the number of chromosomes in a cell, or the number of earthquakes occurring in a region within a given time frame. Other examples of discrete data in these sciences include the number of mutations in a gene, the number of cells in a tissue sample, or the number of species present in an ecosystem. These types of data are often analysed using statistical techniques such as frequency distributions, contingency tables, and chi-square tests.
### Continuous variables
Continuous data, on the other hand, are measured on a continuous scale. These usually represent measured quantities such as something's heat content (temperature, measured in degrees Celsius) or distance (measured in metres or similar), etc. They can be rational numbers including integers and fractions, but typically they have an infinite number of 'steps' that depends on rounding (they can even be rounded to whole integers) or considerations such as measurement precision and accuracy. Often, continuous data have upper and lower bounds that depend on the characteristics of the phenomenon being studied or the measurement being taken.
## Dates
We often encounter date data when dealing with time-related data. For example, in ecological research, data collection may involve recording the date of a particular observation or sampling event, such as the date when a bird was sighted, or when water samples were taken from a stream. The purpose of using date (or time) data in biology and ecology is to enable us to understand and analyse temporal patterns and relationships in their response variables. This can include exploring seasonal trends, understanding the impact of environmental changes over time, or tracking the growth and development of organisms.
By analysing date data, we can gain insights into long-term trends and patterns that may not be apparent when looking at the data in aggregate. They can also use this information to make predictions about future trends, develop more effective management strategies, and identify potential areas for further research.
## Qualitative variables
Qualitative data in the context of statistics used by biologists refers to *non-numerical data* that are collected from observations, experimental treatment groups, or other sources. They tend to be textual and it are often used to describe characteristics or properties of living organisms, ecosystems, or other biological phenomena. Examples of qualitative data in biology may include the colour of flowers, the type of habitat where an animal is found, the behaviour of animals, or the presence or absence of certain traits or characteristics in a population.
Qualitative data can be further classified into **nominal** or **ordinal** data.
### Character data
Add some text.
### Nominal variables
Nominal data are used to describe qualitative variables that do not have any inherent order or ranking. Examples of nominal data in biology may include the type of plant or animal species, or the presence or absence of certain genetic traits. Another term for nominal data is *categorical data*. Because there are well-defined categories, the number of members belonging to each of the category can be counted. For example, there are three red flowers, 66 purple flowers, and 13 yellow flowers.
### Ordinal variables
Ordinal data refer to a type of data that can be used to describe qualitative categorical variables that have a natural order or ranking. It is used when we need to arrange things in a particular order, such as from worst to best or from least to most. However, the differences between the values cannot be measured or quantified exactly, making them somewhat subjective. Examples of ordinal data include the different stages of development of an organism or the performance of a species to different fertilisers. Ordinal data can be entered as descriptive character strings, and internally, they are translated into an ordered vector of integers in R. For example, we can use a scale of `1` for terrible, `2` for 'so-so', `3` for average, `4` for good, and `5` for brilliant.
## Binary variables
Life can be boiled down to a series of binary decisions: should I have pizza for dinner, *yes* or *no*? Should I go to bed early, *TRUE* or *FALSE*? Should I start that new series on Netflix, *accept* or *reject*? Am I *present* or *absent*? You get the gist... This kind of binary decision-making is known as 'logical', and in R they can only take on the values of `TRUE` or `FALSE` (remember to mind your case!). In the computing world, logical data are often represented by `1` for `TRUE` and `0` for `FALSE`. So basically, your life's choices can be summarised as a string of `1`s and `0`s. Who knew it was that simple?
<!-- ::: {.column-margin} -->
![](../../images/bad_choices.jpg){fig-align="center" width=50%}
<!-- ::: -->
When it comes down to it, everything in life is either `black` or `white`, `right` or `wrong`, `good` or `bad`. It's like a cosmic game of "Would You Rather?" --- and we're all just playing along.
## Character variables
As the name implies, these are not numbers. Rather, they are human words that have found their way into R for one reason or another. In biology we most commonly encounter character values when we have a list of things, such as sites or species. These values will often be used as categorical or ordinal data.
## Missing values
It's unfortunate to admit that one of the most reliable aspects of any biological dataset is the presence of missing data (the presence of something that's missing?!). It is a stark reminder of the fragility of life. How can we say that something contains missing data? It seems counter intuitive, as if the data were never there in the first place. However, as we remember the principles of *tidy* data, we see that every observation must be documented in a row, and each column in that row must contain a value. This organisation allows us to create a matrix of data from multiple observations. Since the data are presented in a two-dimensional format, any missing values from an observation will leave a gaping hole in the matrix. We call these 'missing values.' It's a somber reality that even the most meticulous collection of data can be marred by the loss of information.
## Complex numbers
> *"And if you gaze long enough into an abyss, the abyss will gaze back into you."*
>
> --- Friedrich Nietzsche
As we draw to a close on the topic of data types, we cling desperately to the threads of our sanity, hoping against hope that they remain tightly stitched. But let it be known, to those who dare enter further into the realm of data, that beneath the surface lie countless rocks, and around every corner lurk a legion of complex data types, waiting to ensnare the unwary. These shadows of information are as enigmatic as they are perilous, for they challenge the very essence of our understanding. It is not until the final chapter of our journey, when we confront the elusive art of modeling, that we will face these data demons head-on. But fear not, for we shall arm ourselves with the knowledge and techniques acquired on this treacherous path, and with each step forward, we shall move closer to mastering the darkness that awaits us.
# Data classes in R
R is a powerful programming language and software environment for statistical computing and graphics. It offers a wide variety of data classes and types to represent different types of information.
The **atomic modes** are `logical`, `integer`, `numeric` (also sometimes called `double`), `complex`, `character` and `raw`. The `Date` class is a specialised form of the `numeric` class. Each atomic mode has its own properties and functions that can be used to manipulate objects of that mode. These atomic modes can be used to make an atomic data structure such as a `vector`, `array`, or `matrix`. This knowledge is also important when working with R tabular mixed data structures, such as a `data.frame`, `tibble`, or `list`. Please refer to Hadley Wickham's overview of vectors in [Advanced R, 2nd edition](https://adv-r.hadley.nz/index.html) for more insight presented in an informative yet concise way.
The **data class** can be determined with the `class()` or `str()` commands. Here's a brief overview of some of the main data types in R:
## `numeric`
In R, the `numeric` data class represents either **integers** or **floating point** (decimal) values. Numerical data are quantitative in nature as they represent things that can be objectively counted, measured, or calculated---the *measured variables*.
Numeric data are one of the most common types of data used in statistical and mathematical analysis. In R, numeric data are represented by the class `numeric`, which includes both integers and floating-point numbers. Numeric data can be used in a variety of operations and calculations, including arithmetic operations, statistical analyses, and visualisations. One important feature of the `numeric` data class in R is that it supports vectorisation, which allows for efficient and concise operations on large sets of numeric data. Additionally, R provides a wide range of built-in functions for working with numeric data, including functions for calculating basic statistical measures such as mean, median, and standard deviation.
In R integer (discrete) data are called `int` or `<int>` while continuous data are denoted `num` or `<dbl>`.
**Example of integer data** Suppose you have a dataset of the number of rats in different storm water drains in a neighbourhood. The number of rats is a discrete variable because it can only take on integer values (you can't own a fraction of a rat).
Here's how you could create a vector of this data in R:
```{r}
# Create a vector of the number of pets owned by each household
num_rats <- c(0, 1, 2, 2, 3, 1, 4, 0, 2, 1, 2, 2, 0, 3, 2, 1, 1, 4, 2, 0)
num_rats
class(num_rats)
```
In this example, the data are represented as a vector called `num_rats` of class `numeric` (as revealed by `class(num_rats)`). Each element of the vector represents the number of rats in one storm water drain. For example, the first element of the vector (`num_rats[1]`) is `0`, which means that the first drain in the dataset is free of rats. The fourth element of the vector (`num_rats[4]`) is `2`, indicating that the fourth drain in the dataset is occupied by `2` rats.
One can also explicitly create a vector of integer using the `as.integer()` function:
```{r}
num_rats_int <- as.integer(num_rats)
num_rats_int
class(num_rats_int)
```
Above we *coerced* the class `numeric` data to class `integer`. But we can take floating point numeric and convert them to integers too with the `as.integer()` function. As we see, the effect is that the whole part of the number is retained and the rest discarded:
```{r}
pies <- pi * seq(1:5)
pies
class(pies)
as.integer(pies)
```
Effectively, what happened above is more-or-less equivalent to what the `floor()` function would return:
```{r}
floor(pies)
```
Be careful when coercing floating point numbers to integers. If rounding is what you expect, this is not what you will get. For rounding, use `round()` instead:
```{r}
round(pies, 0)
```
**Example of continuous data** Here are some randomly generated temperature data assigned to an object called `temp_data`:
```{r}
# Generate a vector of 50 normally distributed temperature values
temp_data <- round(rnorm(n = 50, mean = 15, sd = 3), 2)
temp_data
class(temp_data)
```
## `character`
In R, the `character` data class represents textual data such as words, sentences, and paragraphs. Character data can be created using either single or double quotes, and it can include letters, numbers, and other special characters. In addition, character data can be concatenated using the `paste()` function or other string manipulation functions.
One important feature of the character data class in R is its versatility in working with textual data. For instance, it can be used to store and manipulate text data, including text-based datasets, text-based files, and text-based visualisations. Additionally, R provides a wide range of built-in functions for working with character data, including functions for manipulating strings, searching for patterns, and formatting output. Overall, the character data class in R is a fundamental data type that is critical for working with textual data in a variety of contexts. You will most frequently use character values are often used to represent labels, names, or descriptions.
## `factor`
In R, the `factor` data class is used to represent discrete categorical variables. Factors are often used in statistical analyses to represent class or group belonging. Factor values are categorical data, such as levels or categories of a variable. Factor variables are most commonly also character data, but they can be numeric too if coded correctly as factors. Factor values can be ordered (ordinal) or unordered (categorical or nominal).
Categorical variables take on a limited number of distinct values, often corresponding to different groups or levels. For example, a categorical variable might represent different colours, size classes, or species. Factors in R are represented as integers with corresponding character levels, where each level corresponds to a distinct category. The levels of a factor can be defined explicitly using the `factor()` function or automatically using the `cut()` function. One important feature of the factor data class in R is that it allows for efficient and effective data manipulation and analysis, particularly when working with large datasets. For instance, factors can be used in statistical analyses such as regression models or ANOVA, and they can also be used to create visualisations such as bar or pie graphs. The `factor` data class in R is a fundamental data type that is critical for representing and working with categorical variables in data analysis and visualisation.
The `factor` data class of data in an R `data.frame` structure (or in a `tibble`) is indicated by `Factor` or `<fctr>`. Ordered factors are denoted by columns named `Ord.factor` or `<ord>`.
**Nominal data** One example of nominal factor data that ecologists might encounter is the type of vegetation in a particular area, such as 'grassland', 'forest', or 'wetland'. Here's an example of how to generate a vector of nominal data in R using the `sample()` function:
```{r}
# Generate a vector of vegetation types
vegetation <- sample(c("grassland", "forest", "wetland"), size = 50, replace = TRUE)
# View the vegetation data
vegetation
class(vegetation)
```
:::{.callout-note appearance="simple"}
## The `sample()` function
Note that the `sample()` function is not made specifically for nominal data; it can be used on any kind of data class.
:::
::: {.callout-important}
## Task A
1. Provide some examples of what one can do with the `sample()` function. Why might we (mostly biologists and ecologists) find it useful?
:::
**Ordinal data** Here's an example vector of ordinal data in R that could be encountered by ecologists:
```{r}
# Vector of ordinal data representing the successional stage of a forest
succession <- c("Early Pioneer", "Late Pioneer",
"Young Forest", "Mature Forest",
"Old Growth")
succession
class(succession)
# Convert to ordered factor
succession <- factor(succession, ordered = TRUE,
levels = c("Early Pioneer", "Late Pioneer",
"Young Forest", "Mature Forest",
"Old Growth"))
succession
class(succession)
```
In this example, the successional stage of a forest is represented by an ordinal scale with five levels ranging from 'Early Pioneer' to 'Old Growth'. The `factor()` function is used to convert the vector to an ordered factor, with the `ordered` argument set to `TRUE` and the `levels` argument set to the same order as the original vector. This ensures that the levels are properly represented as an ordered factor.
## `logical`
In R, the `logical` data class represents binary or Boolean data. Logical data are used to represent variables that can take on only two possible values, `TRUE` or `FALSE`. In addition to `TRUE` and `FALSE`, logical data can also take on the values of `NA` or `NULL`, which represent missing or undefined values.
Logical data can be created using logical operators such as `==`, `!=`, `>`, `<`, `>=`, and `<=`. Logical data are commonly used in R for data filtering and selection, conditional statements, and logical operations. For example, logical data can be used to filter a dataset to include only observations that meet certain criteria or to perform logical operations such as AND (`&`) and OR (`|`). The logical data class in R is a fundamental data type that is critical for representing and working with binary or Boolean variables in data analysis and programming.
**Example logical (binary) data** Here's an example of generating a vector of binary or logical data in R, which represents the presence or absence of a particular species in different ecological sites:
```{r}
# Generate a vector of 1s and 0s to represent the presence
# or absence of a species in different ecological sites
species_presence <- sample(c(0,1), 10, replace = TRUE)
species_presence
```
We can also make a formal logical class data:
```{r}
species_presence_logi <- as.logical(species_presence)
class(species_presence_logi)
```
In this example, we again use the `sample()` function to randomly generate a vector of 10 values, each either 0 or 1, to represent the presence or absence of a species in 10 different ecological sites. However, it is often not necessary to coerce to class `logical`, as we see in the presence-absence datasets we will encounter in [BCB743: Quantitative Ecology](../../BCB743/BCB743_index.html).
## `date`
In R, the `POSIXct`, `POSIXlt`, and `Date` classes are commonly used to represent date and time data. These classes each have unique characteristics that make them useful for different purposes.
The `POSIXct` class is a date/time class that represents dates and times as a numerical value, *typically* measured in seconds since January 1st, 1970. This class provides a high level of precision, with values accurate to the second. It is useful for performing calculations and data manipulation involving time, such as finding the difference between two dates or adding a certain number of seconds to a given time. An example of how to generate a `POSIXct` object in R is as follows:
```{r}
my_time <- as.POSIXct("2022-03-10 12:34:56")
class(my_time)
my_time
```
The `POSIXlt` class, on the other hand, *typically* represents dates and times in a more human-readable format. It stores date and time information as a list of named elements, including `year`, `month`, `day`, `hour`, `minute`, and `second.` This format is useful for displaying data in a more understandable way and for extracting specific components of a date or time. An example of how to generate a POSIXlt object in R is as follows:
```{r}
my_time <- as.POSIXlt("2022-03-10 12:34:56")
class(my_time)
my_time
```
The `Date` class is used to represent dates only, without any time information. Dates are *typically* stored as the number of days since January 1st, 1970. This class provides functions for performing arithmetic operations and comparisons between dates. It is useful for working with time-based data that is only concerned with the date component, such as daily sales or stock prices. An example of how to generate a Date object in R is as follows:
```{r}
my_date <- as.Date("2022-03-10")
class(my_date)
my_date
```
To generate a vector of dates in R with daily intervals, we can use the `seq()` function to create a sequence of dates, specifying the start and end dates and the time interval. Here's an example:
```{r}
# Generate a vector of dates from January 1, 2022 to December 31, 2022
dates <- seq(as.Date("2022-01-01"), as.Date("2022-12-31"), by = "day")
# View the first 10 dates in the vector
head(dates, 10)
class(dates)
```
Understanding the characteristics of these date and time classes in R is essential for effective data analysis and manipulation in fields where time-based data is a critical component.
Date and time data in R can be manipulated using various built-in functions and packages such as **lubridate** and **chron**. Additionally, date and time data can be visualised using different types of graphs such as time series plots, heatmaps, and Hovmöller diagrams. The date and time data classes in R are essential for working with temporal data and conducting time-related analyses in various biological and environmental datasets.
## Missing values, `NA`
Missing values can be encountered in vectors of all data classes. To demonstrate some data that contains missing values, I will generate a data sequence containing 5% missing values. We can use the `rnorm()` function to generate a sequence of random normal numbers and then randomly assign 5% of the values as missing using the `sample()` function. The indices of the missing values are stored in missing_indices, and we use them to assign `NA` to the corresponding elements of the data sequence. Here's some code to achieve this:
```{r}
# Set the length of the sequence
n <- 100
# Generate a sequence of random normal numbers with
# mean 0 and standard deviation 1
data <- rnorm(n, mean = 0, sd = 1)
# Randomly assign 5% of the values as missing
missing_indices <- sample(1:n, size = round(0.05*n))
data[missing_indices] <- NA
length(data)
data
```
To remove all `NA`s from the vector of data we can use `na.omit()`:
```{r}
data_sans_na <- na.omit(data)
length(data_sans_na)
data_sans_na
```
:::{.callout-note appearance="simple"}
## Dealing with `NA`s in functions
Many functions have specific arguments to deal with `NA`s in data. See for example the `na.rm = TRUE` argument given to `mean()`, `median()`, `min()`, `lm()`, etc.
:::
# Data structures in R
Data structures can be viewed as the 'containers' that hold the different data classes. The kind of data structure can also be revealed by the `class()` command.
## `vector`, `array`, and `matrix`
**Vectors** In R, a vector is a one-dimensional array-like data structure that can hold a sequence of values of the same atomic mode, such as `numeric`, `character`, `logical` values, or `Date` and times. A vector can be created using the `c()` function, which stands for 'combine' or 'concatenate,' and is used to combine a sequence of values into a vector. Vectors can also be created by using the `seq()` function to generate a sequence of numbers, or the `rep()` function to repeat a value or sequence of values. Here is an example of a numeric vector:
```{r}
# create a numeric vector
my_vector <- c(1, 2, 3, 4, 5)
# coerce to vector
my_vector <- as.vector(c(1, 2, 3, 4, 5))
class(my_vector) # but it doesn't change the class from numeric
# print the vector
my_vector
```
:::{.callout-note appearance="simple"}
## Coercion to vector
The behaviour is such that the output of coercion to vector is that one the atomic modes (the basic data types) is returned.
:::
One of the advantages of using vectors in R is that many of the built-in functions and operations work on vectors, allowing us to easily manipulate and analyse large amounts of data. Additionally, R provides many functions specifically designed for working with vectors, such as `mean()`, `median()`, `sum()`, `min()`, `max()`, and many others.
**Matrices** A matrix (again, this terminology may be different for other languages), on the other hand, is a special case of an array that has two dimensions (rows and columns). It is also a multi-dimensional data structure that can hold elements of the *same data type*, but it is specifically designed for handling data in a tabular format. A matrix can be created using the `matrix()` function in R.
```{r}
# create a numeric matrix
my_matrix <- matrix(1:6, nrow = 2, ncol = 3)
# print the matrix
my_matrix
class(my_matrix)
```
We can query the size or dimensions of the matrix as follows:
```{r}
dim(my_matrix)
ncol(my_matrix)
nrow(my_matrix)
```
**Coercion of matrices to vectors** A matrix can be coerced to a vector:
```{r}
as.vector(my_matrix)
```
**Arrays** In R (as opposed to in python or some other languages), an array specifically refers to a multi-dimensional data structure that can hold elements of the *same data type*. It can have any number of dimensions (1, 2, 3, etc.), and its dimensions can be named. An array can be created using the `array()` function in R.
```{r}
# create a 2-dimensional array
my_array <- array(1:27, dim = c(3, 3, 3))
# print the array
my_array
class(my_array)
```
We can figure something out about the size or dimensions of the array:
```{r}
dim(my_array)
ncol(my_array)
nrow(my_array)
```
**Coercion of arrays to vectors** The array can be coerced to a vector:
```{r}
as.vector(my_array)
```
::: callout-important
## Task A
2. Using examples (new data), explain how the `as.vector()` function works when applied to matrices and arrays. How does it decide in what order to string the elements of the matrices and arrays together?
3. Use the result produced by `as.vector()` (your own data) and assemble a new array with a different combination of dimensions from the one initially produced.
:::
The key difference between vectors, arrays, and a matrices in R is their dimensions. A vector has one dimension, an array can have any number of dimensions, while a matrix is limited to two dimensions. Additionally, a matrix is often used to store data in a tabular format, while an array is used to store multi-dimensional data in general. A commonly encountered kind of matrix is seen in multivariate statistics is a distance or dissimilarity matrix.
*In R, vectors, arrays, and matrices share a common characteristic: they do not have row or column names.* Therefore, to refer to any element, row, or column, one must use their corresponding index. How?
**Accessing elements, rows, columns, and matrices** In R, the square bracket notation is used to access elements, rows, columns, or matrices in arrays. The notation takes the form of `[i, j, k, ...]`, where `i`, `j`, `k`, and so on, represent the indices of the rows, columns, or matrices to be accessed.
Suppose we have the following array:
<!-- # my_array <- array(data = 1:60, dim = c(5, 4, 3)) -->
```{r}
my_array <- array(data = round(rnorm(n = 60, mean = 13, sd = 2), 1),
dim = c(5, 4, 3))
my_array
dim(my_array)
```
This creates a $5\times4\times3$ array with values from `1` to `60`.
When working with multidimensional arrays, it is possible to omit some of the indices in the square bracket notation. This results in a subset of the array, which can be thought of as a lower-dimensional array obtained by fixing the omitted dimensions. For example, consider a 3-dimensional array `my_array` above with dimensions `dim(my_array) = c(5,4,3)`. If we use the notation `my_array[1,,]`, we would obtain a 2-dimensional array with dimensions `dim(my_array[1,,]) = c(4,3)` obtained by fixing the first index at 1:
```{r}
dim(my_array[1,,])
my_array[1,,]
```
Here are some more examples of how to use square brackets notation with arrays in R:
To access a **single element** in the array, use the notation `[i, j, k]`, where `i`, `j`, and `k` are the indices along each of the three dimensions, which in combination, uniquely identifies each element. Below we return the element in the second row, third column, and first matrix:
```{r}
my_array[2, 3, 1]
```
To access a single row in the array, use the notation `[i, , ]`, where `i` is the index of the row. This will return the second rows and all of the columns of the first matrix:
```{r}
my_array[2,,1]
```
To access a single column in the array, use the notation `[ , j, ]`, where `j` is the index of the column. Here we will return all the elements in the row of column two and matrix three:
```{r}
my_array[ , 2, 3]
```
To access a single matrix in the array, use the notation `[ , , k]`, where `k` is the index of the matrix:
```{r}
my_array[ , , 2]
```
To obtain a subset of the array, use the notation `[i, j, k]` with `i`, `j`, or `k` omitted to obtain a lower-dimensional array:
```{r}
my_array[1, , ]
my_array[ , 2:3, ]
```
## `data.frame`
A dataframe is perhaps the most commonly-used 'container' for data in R because they are so convenient and serve many purposes. A dataframe is *not a data class*---more correctly, it is a form of tabular data (like a table in MS Excel), with each vector (a variable or column) comprising the table sharing the same length. What makes a dataframe versatile is that its variables can be any combination of the atomic data types. It may even include *list columns* (we will not cover list columns in this module). Applying the `class()` function to a dataframe shows that it blongs to class `data.frame`.
Here's an example of an R `data.frame` with `Date`, `numeric`, and `categorical` data classes:
```{r}
# Create a vector of dates
dates <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03",
"2022-01-04", "2022-01-05"))
# Create a vector of numeric data
numeric_data <- rnorm(n = 5, mean = 0, sd = 1)
# Create a vector of categorical data
categorical_data <- c("A", "B", "C", "A", "B")
# Combine the vectors into a data.frame
my_dataframe <- data.frame(dates = dates,
numeric_data = numeric_data,
categorical_data = categorical_data)
# Print the dataframe
my_dataframe
class(my_dataframe)
str(my_dataframe)
summary(my_dataframe)
```
Dataframes may also have row names:
```{r}
rownames(my_dataframe) <- paste(rep("row", 5), seq = 1:5)
my_dataframe
```
Typically we will create a dataframe by reading in data from a `.csv` file, but it is useful to be able to construct one from scratch.
## `tibble`
In R, a dataframe and a tibble are both data structures used to store tabular data. Although tibbles are also dataframes, but they differ subtly in several ways.
* A tibble is a relatively new addition to the R language and forms part of the **tidyverse** suite of packages. They are designed to be more user-friendly than traditional data frames and have several additional features, such as more informative error messages, stricter data input and output rules, and better handling of `NA`.
* Unlike a dataframe, a tibble never automatically converts strings to factors or changes column names, which can help avoid unexpected behavior when working with the data.
* A tibble does not have row names.
* A tibble has a slightly different and more compact printing method than a dataframe, which makes them easier to read and work with.
* Finally, a tibble has better performance than dataframes for many tasks, especially when working with large datasets.
While a dataframe is a core data structure in R, a tibble provides additional functionality and are becoming increasingly popular among R users, particularly those working with **tidyverse** packages. Applying the `class()` function to a tibble revelas that it belongs to the classes `tbl_df`, `tbl` and `data.frame`.
We can convert our dataframe `my_dataframe` to a tibble, and present the output with the `print()` function that applies nicely to tibbles:
```{r}
#| warning: false
library(tidyverse) # we need to load the tidyverse package
my_tibble <- as_tibble(my_dataframe)
class(my_tibble)
print(my_tibble)
```
This very simple tibble looks identical to a dataframe, but as we start using more complex sets of data you'll learn to appreciate the small convenience that tibbles offer.
## `list`
This is also not actually a data class, but rather another way of representing a collection of objects of different types, all the way from numerical vectors to dataframes. Lists are useful for storing complex data structures and can also be accessed using indexing.
As an example, we create another dataframe:
```{r}
dates <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03",
"2022-01-04", "2022-01-05"))
# Create a vector of numeric data
numeric_data <- rnorm(n = 5, mean = 1, sd = 1)
# Create a vector of categorical data
categorical_data <- c("C", "D", "D", "F", "A")
# Combine the vectors into a data.frame
my_other_dataframe <- data.frame(dates = dates,
numeric_data = numeric_data,
categorical_data = categorical_data)
my_list <- list(A = my_dataframe,
B = my_other_dataframe)
my_list
class(my_list)
str(my_list)
```
We can access one of the dataframes is the list as follows:
```{r}
my_list[[2]]
my_list[["A"]]
```
To access a variable within one of the elements of the list we can do something like:
```{r}
my_list[["B"]]$numeric_data
```
::: callout-important
## Task A
4. Find five datasets that you like the look and content of. They may be some of the datasets built into R (and the various packages you downloaded), or they may be ones you found somewhere else. For each:
- describe the data types (statistical view) of the variables contained within,
- using the functions shown in the Chapter, describe their R data classes of each variable, and
- using the functions shown in the Chapter, describe their data structures.
:::
# Conclusion
These are the bread and butter data classes and structures in R. Other data classes and structures also exist, but these may be particular to certain packages. We will encounter some of them in [BCB743: Quantitative Ecology](../../BCB743/BCB743_index.qmd).