2. Exploring With Summaries and Descriptions

The first steps of exploratory data analysis

Published

January 1, 2021

“I think it is much more interesting to live with uncertainty than to live with answers that might be wrong.”

—- Richard Feynman

In this Chapter
  • Data summaries
  • Measures of central tendency
  • Measures of dispersal
  • Descriptive statistics by group
Tasks to complete in this Chapter
  • Task B 1-10

Introduction

Exploratory data analysis (EDA) is an essential initial step in every statistical analysis project aimed at understanding the structure, patterns, and characteristics of the data. During EDA, we visually and quantitatively examine the data by creating plots, summary statistics, and descriptive measures to identify trends, outliers, and potential relationships among variables. This process helps us formulate hypotheses, choose appropriate statistical models, and identify potential issues or anomalies that may require further investigation or data cleaning. EDA serves as a foundation for making informed decisions and guiding the subsequent steps of a statistical analysis project.

In this Chapter we will focus on the basic components of EDA, and these initial forays are built around measures of the central tendency (the mean, median, mode) and the dispersion and variability (e.g. standard deviations, standard error, variance, quantiles) of the data. I advocate a workflow in which EDA and graphical assessments (Chapter 3) precede inferential statistics (Chapter 7).

But first, I want to define a few principles that you will encounter throughout this course on biostatistics.

Variables and parameters

In a statistical framework, a parameter is a fixed, unknown (to be determined) value that characterises a population or distribution, and is often estimated using sample data. On the other hand, a variable is a measurable characteristic that can vary among individuals or objects in a population or sample. Parameters are used to describe the distribution of a population, while variables are the data points that are collected and analysed to draw conclusions about the population. Variables are used to estimate parameters through statistical methods such as regression and hypothesis testing.

Samples and populations

In statistics, samples and populations are two fundamental concepts used to describe data sets and the process of drawing conclusions from them.

  • Population A population is the entire set of individuals, objects, or events of interest in a study. It represents the complete collection of data points that you would ideally like to examine to answer a specific research question. Populations can be finite or infinite, depending on the context. For example, the population could be all the trees in a forest, all the people living in a country, or all the bacteria in a particular environment.

  • Sample A sample is a smaller random subset of the population, selected to represent the larger group. Due to practical constraints, such as time, cost, and accessibility, it is often impossible or impractical to collect data from every member of a population. Instead, we collect data from a carefully chosen sample, with the goal of making inferences about the broader population based on the information gathered from this smaller group. Ideally, a sample should be representative of the population, meaning that it accurately reflects the characteristics and variability present in the entire group. When drawing more samples from the population, the more representative it will be, the smaller the variance will be, and the better the accuracy of the estimated mean (Figure 1).

Figure 1: Drawing increasingly larger sample sizes from a population with a true mean of 13 and an SD of 1.
When is something random?

This is a bit of a rant. I often get annoyed by the imprecise use of language and disrespect for scientific terms. ‘Random’ is such a word (as are ‘literally’ and ‘energy’). The scientific concept of randomness and the colloquial usage of the word differ in their precision and context. Understanding these distinctions can help clarify the role of randomness in various settings.

  • Scientific concept of randomness In the scientific context, randomness refers to a specific pattern or sequence of events that completely lacks predictability, order, or regularity. The biostatistician often associates the term ‘random’ with probability theory and statistical models. In experiments, we use random sampling techniques to ensure that samples are representative of the population, and we employ random assignment to minimise biases in experimental groups. In computer science, algorithms generate pseudo-random numbers, which approximate true randomness but are ultimately determined by deterministic processes.

  • Colloquial usage of randomness In everyday language, the term ‘random’ is often used more loosely to describe events or situations that appear unexpected, surprising, or unrelated to a specific context. Colloquially, randomness may imply a lack of explanation or discernible pattern, even though in most cases the event might not be genuinely random in a strict scientific sense. For example, when someone says, “I had a random encounter with an old friend,” they mean the meeting was unexpected or unplanned, rather than being the result of a stochastic process. In these instances, there most certainly is an explanation for why the two friends hapazzardly met.

An example of a truly random physical process is radioactive decay. Radioactive decay is a spontaneous process in which an unstable atomic nucleus loses energy by emitting radiation, such as alpha particles, beta particles, or gamma rays. The exact timing of when a specific atom will decay is unpredictable and follows a random pattern. The decay events are independent of each other and are governed by the principles of quantum mechanics, making radioactive decay a truly random physical process.

Sometimes we will see the term stochastic in the scientific literature. This term is similar to random stricto sensu, but it is used in a different context. The term ‘random’ refers to a lack of predictability or discernible pattern in a sequence of events, often associated with outcomes that are equally likely or follow a specific probability distribution. ‘Stochastic’, on the other hand, describes processes that incorporate an element of randomness or uncertainty, with outcomes that depend on both deterministic factors and random variables. Both terms are used in the context of probability theory and statistical models to describe events or processes that exhibit some level of unpredictability. Stochastic processes are widespread in nature and may explain a large fraction of ecological variation.

An example of a stochastic ecological process is the population dynamics of a species subject to unpredictable environmental fluctuations. In this scenario, the growth or decline of a species’ population is influenced by both deterministic factors, such as birth and death rates, and random factors, like unpredictable weather events, disease outbreaks, or variations in food availability. These random factors introduce uncertainty and variability in the population dynamics, making it a stochastic ecological process.

Okay, you may ask, “Weather models can predict tomorrow’s weather to some degree, albeit not with complete accuracy, so is there truly such a thing as unpredictable weather events”? Weather events might not be truly random in the strictest sense, as weather patterns are influenced by various physical factors, such as temperature, air pressure, humidity, and interactions between different atmospheric layers. However, due to the immense complexity and chaotic nature of these factors, weather events can be challenging to predict with absolute certainty, especially over longer time scales.

In this context, “unpredictable weather events” are considered random from a practical standpoint because they introduce uncertainty and variability in the system. While they may not be truly random, like quantum-level events, their unpredictability and the difficulty in forecasting their occurrence make them effectively random for many ecological and statistical analyses.

The Palmer penguin dataset

The Palmer penguin dataset in the palmerpenguins package is a modern alternative to the classic Iris dataset, often used for teaching data exploration, visualisation, and statistical modeling. The dataset was introduced by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) program. It contains information collected from penguins living in the Palmer Archipelago, located in the Western Antarctic Peninsula, and focuses on three different penguin species: Adélie, Chinstrap, and Gentoo.

The dataset includes measurements such as bill length (bill_length_mm), bill depth (bill_depth_mm), flipper length (flipper_length_mm), body mass (body_mass_g), and sex (sex), along with the species of the penguins. It provides an opportunity to explore various aspects of data analysis, such as EDA, data visualisation, data preprocessing, hypothesis testing, and machine learning techniques like classification and clustering.

Let’s start with loading the data:

library(palmerpenguins)

Exploring the properties of the data structure

Several functions help us get a broad view of the data structure (the dataset) itself: it’s size (number of rows and columns), the data classes contained, the presence of missing values, and so forth. This table lists some of the most important ways to get such an overview:

Purpose Function
The class of the dataset class()
The head of the dataframe head()
The tail of the dataframe tail()
Printing the data print()
Glimpse the data glimpse()
Show number of rows nrow()
Show number of columns ncol()
The column names colnames()
The row names row.names()
The dimensions dim()
The dimension names dimnames()
The data structure str()

First, it is a good idea to understand the class of the data structure. We can do this with the class() function. Applying it to the penguins dataset, we see that the dataset is a tibble, which is a modern take on the old-fashioned dataframe:

class(penguins)
[1] "tbl_df"     "tbl"        "data.frame"

We can convert between tibbles and dataframes using the commands as.data.frame() and as_tibble():

penguins_df <- as.data.frame(penguins)
class(penguins_df)
[1] "data.frame"

And vice versa:

penguins_tbl <- as_tibble(penguins_df)
class(penguins_tbl)
[1] "tbl_df"     "tbl"        "data.frame"

Note that the print() methods for tibbles and dataframes differ somewhat. You’ll see that the print() function applied to a dataframe does not result in a particularly nice-looking table. Applied to a tibble, on the other hand, print() results in a neat, concise display of the dataset’s content:

# print the dataframe
print(penguins_df[1:5,]) # I limit to 5 rows
  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA          NA
5  Adelie Torgersen           36.7          19.3               193        3450
     sex year
1   male 2007
2 female 2007
3 female 2007
4   <NA> 2007
5 female 2007
# print the tibble
print(penguins_tbl)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The glimpse() function is not dissimilar to print() as it gives more-or-less the same kind of output, but with a horisontal orientation:

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

I suggest converting all your dataframes to tibbles to benefit from the Tidyverse philosophy around which tibbles were designed.

The tops and bottoms of tibbles (or dataframes) can be viewed with the head() and tail() functions. Both take the n = ... argument so that you can specify how many rows to display:

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
tail(penguins, n = 3)
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            49.6          18.2               193        3775
2 Chinstrap Dream            50.8          19                 210        4100
3 Chinstrap Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <fct>, year <int>

You can wrap head() and tail() in print() for a nicer display:

print(head(penguins))
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
print(tail(penguins, n = 3))
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            49.6          18.2               193        3775
2 Chinstrap Dream            50.8          19                 210        4100
3 Chinstrap Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <fct>, year <int>

Three functions tell us about the size of the tibble. We can return the number of rows with nrow(), the number of columns with ncol(), or both the number of rows and columns with dim():

nrow(penguins)
[1] 344
ncol(penguins)
[1] 8
dim(penguins)
[1] 344   8

If we wanted to return the names of the columns we could use colnames() as such:

colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Tibbles do not have row names, but dataframes can have (although they are not frequently used). If we did have row names and wanted to know their names, we could use row.names() (note the . in the function name). We can also use colnames() (names() returns an identical output) and row.names() to assign new names to the dataframes or tibbles—see the respective help files for more information about this use. There is also the dimnames() function.

Task B
  1. Explain the output of dimnames() when applied to the penguins dataset.

The last function to introduce in this section is str(). It can be applied to any R object (i.e. any data structure or model output).

str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Task B
  1. Explain the output of str() when applied to the penguins dataset.

Exploring the properties of the data

Thus far we have looked at the properties of the data structure. Now we turn to the properties of the data contained within a data structure. This might simply be to see the data class of the different variables, or to provide some basic descriptive statistics. Before we go to deeply into the data summaries, we need to first review basic descriptive statistics.

Descriptive statistics refer to the process of summarising and organising data collected from our various kinds of studies in a meaningful way. The process of summary implies that we take something detailed (like a large number of individual measurements) and reduce its complexity to a view—a ‘statistic’—that tells us something informative about the nature of the population. Descriptive statistics allow us to gain an initial understanding of the data’s main features, including the sample size, central tendency, variability, and distribution. We will look at data distributions in Chapter 4.

The sample size is simply the number of data points (\(n\)) per sample. The measures of central tendency include the mean, median, and mode, while variability is most often described using the range, interquartile range, variance, and standard deviation. Additionally, descriptive statistics can involve visual representations, such as histograms, box plots, and scatter plots, to further illustrate the data’s characteristics, and we will cover these in Chapter 3. Descriptive statistics serve as a foundation for more advanced statistical analyses, as it helps us make sense of our data and identify patterns, trends, or potential anomalies.

Measures of central tendency

Statistic Function Package
Mean mean() base
Median median() base
Mode Do it!
Skewness skewness() e1071
Kurtosis kurtosis() e1071

Central tendency is a fundamental concept in statistics, referring to the central or typical value of a dataset that best represents its overall distribution. The measures of central tendency are also sometimes called ‘location’ statistics. As a key component of descriptive statistics, central tendency is essential for summarising and simplifying complex data. It provides a single representative value that captures the data’s general behaviour and which might tell us something about the bigger population from which the random samples were drawn.

A thorough assessment of the central tendency in EDA serves several purposes:

  • Summary of data Measures of central tendency, such as the mean, median, and mode, provide a single value that represents the center or typical value of a dataset. They help summarise the data and allow us to gain an early insight into the dataset’s general properties and behaviour.

  • Comparing groups or distributions Central tendency measures allow us to compare different datasets or groups within a dataset. They can help identify differences or similarities in the data. This may be useful for hypothesis testing and inferential statistics.

  • Data transformation decisions Understanding the central tendency of our data can inform decisions on whether to apply transformations to the data to better meet the assumptions of certain statistical tests or improve the interpretability of the results.

  • Identifying potential issues Examining the central tendency can help reveal issues with the data, such as outliers or data entry errors, that could influence the results of inferential statistics. Outliers, for example, can greatly impact the mean, making the median a more robust measure of central tendency in such cases.

Understanding the central tendency informs the choice of inferential statistics in the following ways:

  • Assumptions of statistical tests Many inferential statistical tests have assumptions about the distribution of the data, such as normality, linearity, or homoscedasticity. Analysing the central tendency helps assess whether these assumptions are met and informs the choice of an appropriate test.

  • Choice of statistical models The central tendency can influence the choice of statistical models or the selection of dependent and independent variables in regression analyses—certain models or relationships may be more appropriate depending on the data’s distribution and central tendencies.

  • Choice of estimators Central tendency measures can influence our choice of estimators for further inferential statistics, depending on the data’s distribution and presence of outliers (e.g., mean vs. median).

Before I discuss each central tendency statistic, I’ll generate some random data to represent normal and skewed distributions. I’ll use these data in my discussions, below.

# Generate random data from a normal distribution
n <- 5000 # Number of data points
mean <- 0
sd <- 1
normal_data <- rnorm(n, mean, sd)

# Generate random data from a slightly
# right-skewed beta distribution
alpha <- 2
beta <- 5
right_skewed_data <- rbeta(n, alpha, beta)

# Generate random data from a slightly
# left-skewed beta distribution
alpha <- 5
beta <- 2
left_skewed_data <- rbeta(n, alpha, beta)

# Generate random data with a bimodal distribution
mean1 <- 0
mean2 <- 10
sd1 <- 3
sd2 <- 4

# Generate data from two normal distributions
data1 <- rnorm(n, mean1, sd1)
data2 <- rnorm(n, mean2, sd2)

# Combine the data from both distributions to
# create a bimodal distribution
bimodal_data <- c(data1, data2)
# Set up a three-panel plot layout
par(mfrow = c(2, 2))

# Plot the histogram of the normal distribution
hist(normal_data, main = "Normal Distribution",
     xlab = "Value", ylab = "Frequency",
     col = "lightblue", border = "black")

# Plot the histogram of the right-skewed distribution
hist(right_skewed_data, main = "Right-Skewed Distribution",
     xlab = "Value", ylab = "Frequency",
     col = "lightgreen", border = "black")

# Plot the histogram of the left-skewed distribution
hist(left_skewed_data, main = "Left-Skewed Distribution",
     xlab = "Value", ylab = "Frequency",
     col = "lightcoral", border = "black")

# Plot the histogram of the left-skewed distribution
hist(bimodal_data, main = "Bimodal Distribution",
     xlab = "Value", ylab = "Frequency",
     col = "khaki2", border = "black")

# Reset the plot layout to default
par(mfrow = c(1, 1))
Figure 2: A series of plots with histograms for the previously generated normal, right-skewed, and left-skewed distributions.

The sample mean

The mean is the arithmetic average of the data, and it is calculated by summing all the data and dividing it by the sample size, n (Equation 1).

The mean, \(\bar{x}\), is calculated thus: \[\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i} = \frac{x_{1} + x_{2} + \cdots + x_{n}}{n} \tag{1}\] where \(x_{1} + x_{2} + \cdots + x_{n}\) are the observations and \(n\) is the number of observations in the sample.

We can calculate the mean of a sample using the … wait for it … mean() function:

round(mean(normal_data), 3)
[1] 0.021
Task B
  1. How would you manually calculate the mean value for the normal_data? Do it now!

The mean is quite sensitive to the presence of outliers or extreme values in the data, and it is advised that its use be reserved for normally distributed data from which the extremes/outliers have been removed. When extreme values are indeed part of our data and not simply ‘noise,’ then we have to resort to a different measure of central tendency: the median.

The median

The median indicates the center value in our dataset. The simplest way to explain what is is is to describe how it is determined. It can be calculated by ‘hand’ (if you have a small enough amount of data) by arranging all the numbers in sequence from low to high, and then finding the middle value. If there are five numbers, say 5, 2, 6, 13, 1, then you would arrange them from low to high, i.e. 1, 2, 5, 6, 13. The middle number is 5. This is the median. But there is no middle if we have an even number of values. What now? Take this example sequence of six integers (they may also be floating point numbers), which has already been ordered for your pleasure: 1, 2, 5, 6, 9, 13. Find the middle two numbers (i.e. 5, 6) and take the mean. It is 5.5. That is the median.

The median is therefore the value that separates the lower half of the sample data from the upper half. In normally distributed continuous data the median is equal to the mean. Comparable concepts to the median are the 1st and 3rd quartiles, which, respectively, separate the first quarter of the data from the last quarter—see the later in the section on ‘Measures of variance and dispersal’ in this Chapter. The advantage of the median over the mean is that it is unaffected by extreme values or outliers. The median is also used to provide a robust description of non-parametric data (see Chapter 4 for a discussion on normal data and other data distributions).

What is the median of the normal_data dataset? We use the median() function for this:

round(median(normal_data), 3)
[1] 0.033

It is easier to see what the median is by looking at a much smaller dataset. Let’s take 11 random data points:

small_normal_data <- round(rnorm(11, 13, 3), 1)
sort(small_normal_data)
 [1]  8.7  9.5 10.8 13.1 13.4 13.8 14.0 14.6 14.9 15.9 16.1
median(small_normal_data)
[1] 13.8

The mean and median together provide a comprehensive understanding of the data’s central tendency and underlying distribution.

What is the relationship between the median and quantiles?

The relation between the median and quantiles lies in their roles as measures that describe the relative position of data points within a dataset. Quantiles are values that partition a dataset into equal intervals, with each interval containing the same proportion of the data. The most common types of quantiles are quartiles, quintiles, deciles, and percentiles.

The median is a special case of a quantile, specifically the 50th percentile or the second quartile (Q2). It divides the dataset into two equal halves, with 50% of the data points falling below the median and 50% of the data points falling above the median. In this sense, the median is a central quantile that represents the middle value of the dataset.

Both the median and quantiles help describe the distribution and spread of a dataset, with the median providing information about the center and other quantiles (such as quartiles) offering insights into the overall shape, skewness, and dispersion of the data.

The mode

The mode is a measure that represents the value or values that occur most frequently in a dataset. Unlike the mean and median, the mode can be used with both numerical and categorical data, making it quite versatile. For a dataset with a single value that appears most often, the distribution is considered unimodal. However, datasets can also be bimodal (having two modes) or multimodal (having multiple modes) when there are multiple values that occur with the same highest frequency.

While the mode may not always be a good representative of the dataset’s center, especially in the presence of extreme values or skewed distributions, it can still offer valuable information about the data’s characteristics when used alongside the other measures of central tendency.

There is no built-in function to calculate the mode of a numeric vector, but you can make one if you need it. There are some examples on the internet that you will be able to adapt to your needs, but my cursory evaluation of them does not suggest they are particularly useful. The easiest way to see the data’s mode(s) is to examine a histogram of your data. All the data we have explored above are examples of unimodal distributions, but a bimodal distribution can also be seen in Figure 2.

Skewness

Skewness is a measure of symmetry (or asymmetry) of the data distribution, and it is best understood by understanding the location of the median relative to the mean. A distribution with a skewness of zero is considered symmetric, with both tails extending equally on either side of the mean. Here, the mean will be the same as the median. A negative skewness indicates that the mean of the data is less than their median—the data distribution is left-skewed; that is, there is a longer or heavier tail to the left of the mean. A positive skewness results from data that have a mean that is larger than their median; these data have a right-skewed distribution; so there will be a longer or heavier tail to the right of the mean. Base R does not have a built-in skewness function, but we can use the one included with the e1071 package:

library(e1071)
# Positive skewness
skewness(right_skewed_data)
[1] 0.5296222
# Is the mean larger than the median?
mean(right_skewed_data) > median(right_skewed_data)
[1] TRUE
# Negative skewness
skewness(left_skewed_data)
[1] -0.6348656
# Is the mean less than the median?
mean(left_skewed_data) < median(left_skewed_data)
[1] TRUE

Kurtosis

Kurtosis describes the tail shape of the data’s distribution. Kurtosis is effectively a measure of the ‘tailedness’ or the concentration of data in the tails of a distribution, relative to a normal distribution. A normal distribution has zero kurtosis (or close to) and thus the standard tail shape (mesokurtic). Negative kurtosis indicates data with a thin-tailed (platykurtic) distribution. Positive kurtosis indicates a fat-tailed distribution (leptokurtic).

Similarly as to skewness, we use the e1071 package for a kurtosis function. All the output shown below suggests a tendency towards thin-tailedness, but it is subtle.

kurtosis(normal_data)
[1] -0.06825458
kurtosis(right_skewed_data)
[1] -0.1542938
kurtosis(left_skewed_data)
[1] -0.0001723586

I have seldom used the concepts of the skewness or kurtosis in any EDA, but it is worth being aware of them. The overall purpose of examining data using the range of central tendency statistics is to get an idea of whether our data are normally distributeda normal distribution is a key requirement for all parametric inferential statistics. See Chapter 4 for a discourse of data distributions. These central tendency statistics will serve you well as a first glance, but formal tests for normality do exist and I encourage their use before embarking on the rest of the journey. We will explore these formal tests in Chapter 7.

Task B
  1. Find the faithful dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use ggplot()), and conclude with a brief statement about the data distribution.

Measures of variance or dispersion around the center

Statistic Function
Variance var()
Standard deviation sd()
Minimum min()
Maximum max()
Range range()
Quantile quantile()
Inter Quartile Range IQR()

A good understanding of variability, or variation around the central point, is crucial in EDA for several reasons:

  • Signal vs. noise Variability helps distinguish between the signal (true underlying pattern) and noise (random fluctuations that might arise from stochastic processes, measurement or experimental error, or other unaccounted for influences) in the data. High variability can make it difficult to detect meaningful patterns or relationships in the data, while low variability may indicate a strong underlying pattern.

  • Precision and reliability Variability is related to the precision and reliability of measurements. Smaller variability indicates more consistent and precise measurements, whereas larger variability suggests inconsistency and potential issues with the data collection process.

  • Comparing groups Understanding variability is essential when comparing different groups or datasets. Even if two groups have similar central tendencies, their variability may differ significantly, leading to different interpretations of the data.

  • Assumptions of statistical tests Many inferential statistical tests have assumptions about the variability of the data, such as homoscedasticity (equal variances across groups) or independence of observations. Assessing variability helps determine whether these assumptions are met and informs the choice of appropriate tests.

  • Effect sizes and statistical power Variability plays a role in determining the effect size (magnitude of the difference between groups or strength of relationships) and the statistical power (ability to detect a true effect) of a study. High variability can make it harder to detect significant effects, requiring larger sample sizes to achieve adequate statistical power.

Understanding variability informs the choice of inferential statistics:

  • Parametric vs non-parametric tests If the data exhibit normality and homoscedasticity, parametric tests may be appropriate (see Chapter 7). However, if the data have high variability or violates the assumptions of parametric tests, non-parametric alternatives may be more suitable.

  • Choice of estimators Variability can influence the choice of estimators (e.g., mean vs. median) for central tendency, depending on the data’s distribution and presence of outliers.

  • Sample size calculations Variability informs sample size calculations for inferential statistics. Higher variability typically requires larger sample sizes to achieve sufficient statistical power.

  • Model selection Variability can influence the choice of statistical models, as certain models may better accommodate the variability in the data than others (e.g., linear vs. non-linear models, fixed vs. random effects).

Let us now look at the estimators of variance.

Variance and standard deviation

Variance and standard deviation (SD) are examples of interval estimates. The sample variance, \(S^{2}\), may be calculated according to the following formula (Equation 2). If we cannot be bothered to calculate the variance and SD by hand, we may use the built-in functions var() and sd():

The sample variance: \[S^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} \tag{2}\]

This reads, “the sum of the squared differences from the mean, divided by the sample size minus 1.” To get the standard deviation, \(S\), we take the square root of the variance, i.e. \(S = \sqrt{S^{2}}\).

var(normal_data)
[1] 1.018911
sd(normal_data)
[1] 1.009411
Task B
  1. How would you manually calculate the variance and SD for the normal_data? Do it now! Make sure your answer is the same as those reported above.

The interpretation of the concepts of mean and median are fairly straight forward and intuitive. Not so for the measures of variance. What does \(S\) represent? Firstly, the unit of measurement of \(S\) is the same as that of \(\bar{x}\) (but the variance doesn’t share this characteristic). If temperature is measured in °C, then \(S\) also takes a unit of °C. Since \(S\) measures the dispersion around the mean, we write it as \(\bar{x} \pm S\) (note that often the mean and standard deviation are written with the letters mu and sigma, respectively; i.e. \(\mu \pm \sigma\)). The smaller \(S\) the closer the sample data are to \(\bar{x}\), and the larger the value is the further away they will spread out from \(\bar{x}\). So, it tells us about the proportion of observations above and below \(\bar{x}\). But what proportion? We invoke the the 68-95-99.7 rule: ~68% of the population (as represented by a random sample of \(n\) observations taken from the population) falls within 1\(S\) of \(\bar{x}\) (i.e. ~34% below \(\bar{x}\) and ~34% above \(\bar{x}\)); ~95% of the population falls within 2\(S\); and ~99.7% falls within 3\(S\) (Figure 3).

Figure 3: The idealised Normal distribution showing the proportion of data within 1, 2, and 3SD from the mean.

Like the mean, \(S\) is affected by extreme values and outliers, so before we attach \(S\) as a summary statistic to describe some data, we need to ensure that the data are in fact normally distributed. We will talk about how to do this in Chapter 7, where we will go over the numerous ways to check the assumption of normality. When the data are found to be non-normal, we need to find appropriate ways to express the spread of the data. Enter the quartiles.

The minimum, maximum, and range

A description of the extremes (edges of the distribution) of the data can also be provided by the functions min(), max() and range(). These concepts are straight forward and do not require elaboration. They apply to data of any distribution, and not only to normal data. These statistics are often the first places you want to start when looking at the data for the first time. Note that range() does something different from min() and max():

min(normal_data)
[1] -3.448997
max(normal_data)
[1] 4.177403
range(normal_data)
[1] -3.448997  4.177403

range() actually gives us the minimum and maximum values, and not the difference between them. To find the range value properly we must be a bit more clever:

range(normal_data)[2] - range(normal_data)[1]
[1] 7.6264

Quartiles and the interquartile range

A more forgiving approach (forgiving of the extremes, often called ‘robust’) is to divide the distribution of ordered data into quarters and finding the points below which 25% (0.25, the first quartile; Q1), 50% (0.50, the median; Q2) and 75% (0.75, the third quartile; Q3) of the data are distributed. These are called quartiles (for ‘quarter;’ not to be confused with quantile, which is a more general concept that divides the distribution into any arbitrary proportion from 0 to 1).

The interquartile range (IQR) is a measure of statistical dispersion that provides information about the spread of the middle 50% of a dataset. It is calculated by subtracting the first quartile (25th percentile) from the third quartile (75th percentile).

The quartiles and IQR have several important uses:

  • Identifying central tendency As I have shown earlier, the second quartile, or median, is a measure of central tendency that is less sensitive to outliers than the mean. It offers a more robust estimate of the typical value in skewed distributions or those with extreme values.

  • Measure of variability The IQR is a robust measure of variability that is less sensitive to outliers and extreme values compared to other measures like the range or standard deviation. It gives a better understanding of the data spread in the middle part of the distribution.

  • Identifying outliers The IQR can be used to identify potential outliers in the data. A common method is to define outliers as data points falling below the first quartile minus 1.5 times the IQR or above the third quartile plus 1.5 times the IQR.

  • Describing skewed data For skewed distributions, the quartiles and IQR provide a better description of the data spread than the standard deviation, as it is not influenced by the skewness of the data. It can help reveal the degree of asymmetry in the distribution and the concentration of values in the middle portion.

  • Comparing distributions The IQR can be used to compare the variability or spread of two or more distributions. It provides a more robust comparison than the standard deviation or range when the distributions have outliers or are not symmetric, and the median reveals departures from normality.

  • Box plots The quartiles and IQR are key components of box plots, which are graphical representations of the distribution of a dataset. Box plots display the median, first quartile, third quartile, and potential outliers, providing a visual representation of the data’s central tendency, spread, and potential outliers.

In R we use the quantile() function to provide the quartiles. Here is a demonstration:

# Look at the normal data
quantile(normal_data, p = 0.25)
       25% 
-0.6546966 
quantile(normal_data, p = 0.75)
      75% 
0.7044242 
# Look at skewed data
quantile(left_skewed_data, p = 0.25)
      25% 
0.6137101 
quantile(left_skewed_data, p = 0.75)
     75% 
0.840831 
Task B
  1. Write a few lines of code to demonstrate that the above results indeed conform to the formal definition for what quantiles are? I.e., show manually how you can determine that 25% of the observations indeed falls below -0.65 for the normal_data. Explain the rationale to your approach.

We calculate the interquartile range using the IQR() function:

IQR(normal_data)
[1] 1.359121

Data summaries

Purpose Function Package
Summary of the data properties summary() base
describe() psych
skim() skimr
descriptives() jmv
dfSummary() summarytools

R has a built-in function that provides a broad overview, not only of the classes within a dataframe, but also of some coarsely calculated descriptive statistics that might serve well as a starting place for EDA. There also also several packages that offer similar views of our data. I’ll introduce these below, but leave it to you to study and to choose one you like best.

Above, I said, “coarsely calculated”. Why? Because it ignores the grouping structure of the data.

Task B
  1. Why is it important to consider the grouping structures that might be present within our datasets?

summary()

The first method is a generic function that can be applied to a range of R data structures, and whose output depends on the class of the structure. It is called summary(). This function can be applied to the dataset itself (here a tibble) and also to the output of some models fitted to the data (later we will see, for instance, how it is applied to t-tests, ANOVAs, correlations, and regressions). When applied to a dataframe or tibble, we will be presented with something quite useful. Let us return to the Palmer penguin dataset, and you’ll see many familiar descriptive statistics:

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
Task B
  1. Explain the output of summary() when applied to the penguins dataset.

psych::describe()

The psych package has the describe() function, which provides a somewhat more verbose output containing many of the descriptive statistics I introduced earlier in this Chapter:

psych::describe(penguins)
                  vars   n    mean     sd  median trimmed    mad    min    max
species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
                   range  skew kurtosis    se
species*             2.0  0.16    -1.73  0.05
island*              2.0  0.61    -0.91  0.04
bill_length_mm      27.5  0.05    -0.89  0.30
bill_depth_mm        8.4 -0.14    -0.92  0.11
flipper_length_mm   59.0  0.34    -1.00  0.76
body_mass_g       3600.0  0.47    -0.74 43.36
sex*                 1.0 -0.02    -2.01  0.03
year                 2.0 -0.05    -1.51  0.04

skimr::skim()

The skimr package offers something similar, but different. The skim() function returns:

library(skimr)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

jmv::descriptives()

Here’s yet another view into our data, this time courtesy of the jmv package:

library(jmv)
descriptives(penguins, freq = TRUE)

 DESCRIPTIVES

 Descriptives                                                                                                                           
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
                         species    island    bill_length_mm    bill_depth_mm    flipper_length_mm    body_mass_g    sex    year        
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   N                         344       344               342              342                  342            342    333          344   
   Missing                     0         0                 2                2                    2              2     11            0   
   Mean                                             43.92193         17.15117             200.9152       4201.754            2008.029   
   Median                                           44.45000         17.30000             197.0000       4050.000            2008.000   
   Standard deviation                               5.459584         1.974793             14.06171       801.9545           0.8183559   
   Minimum                                          32.10000         13.10000                  172           2700                2007   
   Maximum                                          59.60000         21.50000                  231           6300                2009   
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 


 FREQUENCIES

 Frequencies of species                                
 ───────────────────────────────────────────────────── 
   species      Counts    % of Total    Cumulative %   
 ───────────────────────────────────────────────────── 
   Adelie          152      44.18605        44.18605   
   Chinstrap        68      19.76744        63.95349   
   Gentoo          124      36.04651       100.00000   
 ───────────────────────────────────────────────────── 


 Frequencies of island                                 
 ───────────────────────────────────────────────────── 
   island       Counts    % of Total    Cumulative %   
 ───────────────────────────────────────────────────── 
   Biscoe          168      48.83721        48.83721   
   Dream           124      36.04651        84.88372   
   Torgersen        52      15.11628       100.00000   
 ───────────────────────────────────────────────────── 


 Frequencies of sex                                 
 ────────────────────────────────────────────────── 
   sex       Counts    % of Total    Cumulative %   
 ────────────────────────────────────────────────── 
   female       165      49.54955        49.54955   
   male         168      50.45045       100.00000   
 ────────────────────────────────────────────────── 

summarytools::dfSummary()

And lastly, there is the summarytools package and the dfSummary() function within:

library(summarytools)
print(dfSummary(penguins, 
                varnumbers   = FALSE, 
                valid.col    = FALSE, 
                graph.magnif = 0.76),
      method = 'render')

Data Frame Summary

penguins

Dimensions: 344 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
species [factor]
1. Adelie
2. Chinstrap
3. Gentoo
152 ( 44.2% )
68 ( 19.8% )
124 ( 36.0% )
0 (0.0%)
island [factor]
1. Biscoe
2. Dream
3. Torgersen
168 ( 48.8% )
124 ( 36.0% )
52 ( 15.1% )
0 (0.0%)
bill_length_mm [numeric]
Mean (sd) : 43.9 (5.5)
min ≤ med ≤ max:
32.1 ≤ 44.5 ≤ 59.6
IQR (CV) : 9.3 (0.1)
164 distinct values 2 (0.6%)
bill_depth_mm [numeric]
Mean (sd) : 17.2 (2)
min ≤ med ≤ max:
13.1 ≤ 17.3 ≤ 21.5
IQR (CV) : 3.1 (0.1)
80 distinct values 2 (0.6%)
flipper_length_mm [integer]
Mean (sd) : 200.9 (14.1)
min ≤ med ≤ max:
172 ≤ 197 ≤ 231
IQR (CV) : 23 (0.1)
55 distinct values 2 (0.6%)
body_mass_g [integer]
Mean (sd) : 4201.8 (802)
min ≤ med ≤ max:
2700 ≤ 4050 ≤ 6300
IQR (CV) : 1200 (0.2)
94 distinct values 2 (0.6%)
sex [factor]
1. female
2. male
165 ( 49.5% )
168 ( 50.5% )
11 (3.2%)
year [integer]
Mean (sd) : 2008 (0.8)
min ≤ med ≤ max:
2007 ≤ 2008 ≤ 2009
IQR (CV) : 2 (0)
2007 : 110 ( 32.0% )
2008 : 114 ( 33.1% )
2009 : 120 ( 34.9% )
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-08-19

As you can see, there are many option and you may use the one you least dislike. I’ll not be prescriptive or openly opinionated about it.

Descriptive statistics by group

In this Chapter, we have explored the fundamental types of summary statistics typically encountered during EDA and the methods for calculating them. While these basic measures provide valuable insights, delving deeper into the data requires recognising the inherent grouping structures. By identifying how group membership influences measured variables, we can answer questions such as, “Are there differences in bill length, bill depth, flipper length, and body mass between species?” or “Does the island where individuals were studied affect these variables?” Additionally, we might be interested in investigating the effects of sex and year. Numerous approaches can be taken to dissect this structured dataset, each shedding light on unique aspects of the data’s structure and relationships.

Here’s an example. Remember the ChickWeight dataset introduced in Chapter 4 of Intro R? In this dataset, we calculated the mean (etc.) for all the chickens, over all the diet groups to which they had been assigned (there are four factors, i.e. Diets 1 to 4), and over the entire duration of the experiment (the experiment lasted 21 days). It would be more useful to see what the weights are of the chickens in each of the four groups at the end of the experiment—we can compare means (± SD) and medians (± interquartile ranges, etc.), for instance. You’ll notice now how the measures of central tendency are being combined with the measures of variability/range. Further, we can augment this statistical summary with many kinds of graphical summaries, which will be far more revealing of differences (if any) amongst groups.

An analysis of the ChickWeights dataset that recognises the effect of diet and time (start and end of experiment) might reveal something like this:

# A tibble: 8 × 10
  Diet   Time mean_wt sd_wt min_wt qrt1_wt med_wt qrt3_wt max_wt  n_wt
  <fct> <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <int>
1 1         0    41.4   1       39    41     41       42      43    20
2 1        21   178.   58.7     96   138.   166      208.    305    16
3 2         0    40.7   1.5     39    39.2   40.5     42      43    10
4 2        21   215.   78.1     74   169    212.     262.    331    10
5 3         0    40.8   1       39    41     41       41      42    10
6 3        21   270.   71.6    147   229    281      317     373    10
7 4         0    41     1.1     39    40.2   41       42      42    10
8 4        21   239.   43.3    196   204    237      264     322     9

We typically report the measure of central tendency together with the associated variation. So, in a table we would want to include the mean ± SD. For example, this table is almost ready for including in a publication:

Mean ± SD for the `ChickWeight` dataset as a function of Diet and Time.
Diet Time Mean ± SD
1 0 41.4 ± 1
1 21 177.8 ± 58.7
2 0 40.7 ± 1.5
2 21 214.7 ± 78.1
3 0 40.8 ± 1
3 21 270.3 ± 71.6
4 0 41 ± 1.1
4 21 238.6 ± 43.3

Further, we want to supplement this EDA with some figures that visually show the effects. Here I show a few options for displaying the effects in different ways: Figure 4 shows the spread of the raw data, the mean or median, as well as the appropriate accompanying indicators of variation around the mean or median. I will say much more about using figures in EDA in Chapter 3.

Figure 4: The figures represent A) a scatterplot of the mean and raw chicken mass values; B) a bar graph of the chicken mass values, showing whiskers indicating 1 ±SD; C) a box and whisker plot of the chicken mass data; and D) chicken mass as a function of both Diet and Time (10 and 21 days).
Task B
  1. Using a tidy workflow, assemble a summary table of the palmerpenguins dataset that has a similar appearance as that produced by psych::describe(penguins).

    • For bonus marks of up to 10% added to Task B, apply a beautiful and creative styling to the table using the kable package. Try and make it as publication ready as possible. Refer to a few journal articles to see how to professionally typeset tables.
  2. Still using the palmerpenguins dataset, perform an exploratory data analysis to investigate the relationship between penguin species and their morphological traits (bill length, bill depth, flipper length, and body mass). Employ the tidyverse approaches learned earlier in the module to explore the data and account for the grouping structures present within the dataset. Provide visualisations (use Figure 4 for inspiration) and summary statistics to support your findings and elaborate on any observed patterns or trends.

Reuse

Citation

BibTeX citation:
@online{j._smit2021,
  author = {J. Smit, Albertus},
  title = {2. {Exploring} {With} {Summaries} and {Descriptions}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/02-summarise-and-describe.html},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2021) 2. Exploring With Summaries and Descriptions. http://tangledbank.netlify.app/BCB744/basic_stats/02-summarise-and-describe.html.