2. Summarising Biological Data

The First Steps of Exploratory Data Analysis

Author

A. J. Smit

Published

2026/04/05

A reminder that claims need evidence.

A reminder that claims need evidence.

“I think it is much more interesting to live with uncertainty than to live with answers that might be wrong.”

  • — Richard Feynman
NoteIn This Chapter
  • Data summaries
  • Measures of central tendency
  • Measures of dispersal
  • Descriptive statistics by group
ImportantTasks to Complete in This Chapter
  • Self-Assessment Task 2-1 (/3)
  • Self-Assessment Task 2-2 (/10)
  • Self-Assessment Task 2-3 (/5)
  • Self-Assessment Task 2-4 (/10)
  • Self-Assessment Task 2-5 (/5)
  • Self-Assessment Task 2-6 (/3)
  • Self-Assessment Task 2-7 (/2)
  • Self-Assessment instructions and full task overview

1 Introduction

Exploratory data analysis (EDA) establishes what kind of data we actually have before we test hypotheses or fit models. It tells us how many observations we have, what variables were measured, how much variation is present, whether missing values or outliers need attention, and whether the data contain obvious grouping structure. Poor summaries at this stage usually lead to poor inference later.

In this chapter, I introduce the numerical summaries used at the start of that process. I focus on measures of centre, measures of spread, and a small set of tools for inspecting the structure of a dataset. These summaries work alongside the figures in Chapter 3 and the treatment of distributions and uncertainty in Chapter 4.

2 Key Concepts

The ideas that organise the rest of the chapter are:

  • EDA: The first analytical pass through a dataset. Its job is to reveal structure, variation, missingness, and potential problems before formal inference begins.
  • Centre and spread: Most numerical summaries describe either where the data cluster or how widely they vary. They form part of summary (descriptive) statistics.
  • Distribution: The shape of a distribution determines which summaries describe it well. Symmetric data and skewed data are not best summarised in the same way, and they will be analysed using different tests or models.
  • Grouped summaries: Biological datasets often contain treatments, species, sites, or times. Summaries by group may reveal structure hidden in pooled data.

3 Foundational Definitions

A few definitions that recur throughout the module are presented next.

3.1 Variables and Parameters

A parameter is a fixed but usually unknown quantity that describes a population or probability distribution. Examples include the true population mean, variance, or regression slope. A variable is the measured characteristic itself: body mass, bill length, temperature, salinity, presence or absence, and so on. We observe variables in a sample and use them to estimate parameters.

3.2 Samples and Populations

The population is the full set of units about which we want to make a claim: all trees in a forest, all quadrats in a marsh, all fish in an estuary, or all patients in a trial. A sample is the subset we actually measure. Because we rarely have access to the whole population, we rely on a random sample to estimate population-level quantities. The question is therefore whether the sample is informative about the population we care about. As sample size increases, and as the sample better represents the population, the sample mean becomes a more stable estimate of the population mean (Figure 1).

Code
set.seed(666)

# pre-allocate the tibble
normal_takes_shape <- tibble(number_draws = c(), draws = c())

# simulate increasingly larger samples
for (i in c(2, 5, 10, 50, 100, 500, 1000, 10000, 100000)) {
  draws_i <-
    tibble(
      number_draws = c(rep.int(
        x = paste(as.integer(i), " draws"),
        times = i
      )),
      draws = c(rnorm(
        n = i,
        mean = 13,
        sd = 1
      ))
    )

  normal_takes_shape <- rbind(normal_takes_shape, draws_i)
  rm(draws_i)
}

normal_takes_shape |>
  mutate(number_draws = as_factor(number_draws)) |>
  ggplot(aes(x = draws)) +
  geom_density(colour = "indianred3") +
  theme_grey() +
  facet_wrap(
    vars(number_draws),
    scales = "free_y"
  ) +
  labs(
    x = "Mean",
    y = "Density"
  )
Figure 1: Drawing increasingly larger sample sizes from a population with a true mean of 13 and an SD of 1.

3.3 When Is Something Random?

In statistics, random means that outcomes cannot be predicted exactly in advance, even when the process generating them is understood. Random sampling and random assignment use this idea to reduce bias and to justify probability-based inference.

The term stochastic is closely related but usually refers to a process rather than to a single outcome. Population growth under fluctuating weather, disease transmission, and dispersal all have deterministic components, but they also include variation that is modelled probabilistically. In practice, both terms point us to the same issue: biological systems often contain uncertainty that must be described rather than ignored.

ImportantDo It Now!

Run the code for Figure 1 and compare the panels for 2, 10, and 10 000 draws in terms of their mean and SD. If you forgot the functions, you can find them below. In two or three sentences, explain what changes as sample size increases and why that is important when we use a sample to estimate a population mean.

4 Descriptive Statistics

Now to the summaries used most often in EDA. These summaries answer three basic questions: where is the centre of the data, how much do the values vary, and which summaries remain sensible when the data are skewed or contain outliers.

4.1 The Core Equations

The sample mean is:

\[\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i} = \frac{x_{1} + x_{2} + \cdots + x_{n}}{n} \tag{1}\]

In Equation 1, \(x_{1}, x_{2}, \ldots, x_{n}\) are the observations and \(n\) is the sample size. The mean is therefore the total of the observations divided by the number of observations.

The sample variance is:

\[S^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} \tag{2}\]

Equation Equation 2 measures the average squared deviation from the sample mean. The divisor is \(n - 1\) rather than \(n\) because we are estimating population variance from a sample.

The sample standard deviation is the square root of the variance:

\[S = \sqrt{S^{2}} \tag{3}\]

The standard deviation is usually easier to interpret than the variance because it is expressed on the original measurement scale of the data.

For robust summaries, the interquartile range is:

\[\text{IQR} = Q_{3} - Q_{1} \tag{4}\]

In Equation 4, \(Q_{1}\) is the first quartile and \(Q_{3}\) is the third quartile, so the IQR measures the spread of the middle 50% of the data.

4.2 Measures of Central Tendency

Statistic Function Package
Mean mean() base
Median median() base
Mode Do it!
Skewness skewness() e1071
Kurtosis kurtosis() e1071

Measures of central tendency describe where the data cluster. The mean and standard deviation work well when the data are roughly symmetric and not dominated by extreme values. The median and IQR are usually better when the data are skewed or contain outliers.

Before discussing each statistic, I will generate several simple datasets with different shapes. These give us a controlled way to compare the summaries.

Show code
# Generate random data from a normal distribution
set.seed(666)
n <- 5000 # Number of data points
mean <- 0
sd <- 1
normal_data <- rnorm(n, mean, sd)

# Generate random data from a slightly
# right-skewed beta distribution
alpha <- 2
beta <- 5
right_skewed_data <- rbeta(n, alpha, beta)

# Generate random data from a slightly
# left-skewed beta distribution
alpha <- 5
beta <- 2
left_skewed_data <- rbeta(n, alpha, beta)

# Generate random data with a bimodal distribution
mean1 <- 0
mean2 <- 10
sd1 <- 3
sd2 <- 4

# Generate data from two normal distributions
data1 <- rnorm(n, mean1, sd1)
data2 <- rnorm(n, mean2, sd2)

# Combine the data from both distributions to
# create a bimodal distribution
bimodal_data <- c(data1, data2)

make_hist_plot <- function(x, title, fill_col) {
  stat_lines <- tibble(
    statistic = c("Mean", "Median"),
    xint = c(mean(x), median(x))
  )

  ggplot(tibble(value = x), aes(x = value)) +
    geom_histogram(bins = 30, fill = fill_col, colour = "black", linewidth = 0.3) +
    geom_vline(
      data = stat_lines,
      aes(xintercept = xint, colour = statistic, linetype = statistic),
      linewidth = 0.5,
      show.legend = TRUE
    ) +
    scale_colour_manual(values = c("Mean" = "red", "Median" = "blue")) +
    scale_linetype_manual(values = c("Mean" = "solid", "Median" = "dashed")) +
    labs(
      title = title,
      x = "Value",
      y = "Frequency"
    ) +
    theme_grey() +
    theme(
      legend.position = "bottom",
      plot.title = element_text(size = 9)
    )
}

plt_normal <- make_hist_plot(normal_data, "Normal Distribution", "grey80")
plt_right <- make_hist_plot(right_skewed_data, "Right-Skewed Distribution", "grey80")
plt_left  <- make_hist_plot(left_skewed_data, "Left-Skewed Distribution", "grey80")
plt_bimodal <- make_hist_plot(bimodal_data, "Bimodal Distribution", "grey80")

ggpubr::ggarrange(
  plt_normal, plt_right, plt_left, plt_bimodal,
  ncol = 2, nrow = 2,
  labels = c("A", "B", "C", "D"),
  common.legend = TRUE,
  legend = "bottom"
)
Figure 2: Generated normal, right-skewed, left-skewed, and bimodal distributions, shown in panels A-D, with the mean and median indicated in each panel.

4.2.1 The Sample Mean

The mean is the arithmetic average of the data. As shown in Equation 1, it is calculated by summing the observations and dividing by the sample size.

We calculate it with mean():

round(mean(normal_data), 3)
[1] 0.009
ImportantSelf-Assessment Task 2-1

How would you manually calculate the mean value for the normal_data we generated in the lecture? (/3)

The mean uses all observed values, which makes it informative but also sensitive to skew and outliers. That sensitivity does not make it invalid for non-normal data, but it does make it a poor summary when a few extreme values dominate the result.

In panel A of Figure 2, the normal distribution is centred cleanly around its mean. In panels B and C, tail asymmetry pulls the mean away from the bulk of the data.

4.2.2 The Median

The median is the middle value after the data have been ordered. With an odd number of observations, it is the single central value. With an even number, it is the mean of the two central values.

The median divides the ordered data into two equal halves. In a symmetric distribution it will often be close to the mean. In skewed data it usually gives a more stable description of the centre because extreme values have little influence on it.

That contrast is visible in panels B and C of Figure 2 where the median sits closer to the main cluster of values than the mean does.

We calculate the median with median():

round(median(normal_data), 3)
[1] 0.017

It is easier to see the calculation on a small dataset:

set.seed(123) # for reproducibility
small_normal_data <- round(rnorm(11, 13, 3), 1)
sort(small_normal_data)
 [1]  9.2 10.9 11.3 11.7 12.3 13.2 13.4 14.4 16.7 17.7 18.1
median(small_normal_data)
[1] 13.2
ImportantDo It Now!

Use Figure 2 to decide which measure of centre you would report for each of the four panels. For each panel, write down either mean or median and give one short reason for your choice.

NoteWhat Is the Relationship Between the Median and Quantiles?

The median is the 50th percentile, or second quartile (\(Q_{2}\)). More generally, quantiles divide the ordered data into specified proportions. Quartiles split them into four parts, deciles into ten, and percentiles into one hundred.

4.2.3 The Mode

The mode is the most frequent value or values in a dataset. It is useful for categorical data and for identifying whether a distribution is unimodal, bimodal, or multimodal. For continuous numerical data, exact repeated values are often less informative, so the mode is usually assessed visually from a histogram or density plot rather than calculated directly.

Panel D of Figure 2 shows why visual inspection is useful here: one mean can be calculated, but the figure shows two clear peaks.

Base R does not provide a standard mode() function for this purpose. In practice, visual inspection is often the more useful route.

4.2.4 Skewness

Skewness describes asymmetry in a distribution. A symmetric distribution has skewness close to zero. Positive skewness means the right tail is longer; negative skewness means the left tail is longer.

Skewness is often easiest to understand by comparing the mean and median. In a right-skewed distribution the mean is usually greater than the median. In a left-skewed distribution the mean is usually smaller.

You can see both patterns in Figure 2, where panel B has a longer right tail, and panel C has a longer left tail.

library(e1071)
# Positive skewness
skewness(right_skewed_data)
[1] 0.5453162
# Is the mean larger than the median?
mean(right_skewed_data) > median(right_skewed_data)
[1] TRUE
# Negative skewness
skewness(left_skewed_data)
[1] -0.5790834
# Is the mean less than the median?
mean(left_skewed_data) < median(left_skewed_data)
[1] TRUE

4.2.5 Kurtosis

Kurtosis describes tail heaviness relative to a normal distribution. A normal distribution has close to zero kurtosis (called mesokurtic). Negative kurtosis indicates data with a thin-tailed (platykurtic) distribution and positive kurtosis indicates a fat-tailed distribution (leptokurtic).

kurtosis(normal_data)
[1] -0.01646261
kurtosis(right_skewed_data)
[1] -0.1898941
kurtosis(left_skewed_data)
[1] -0.1805365
ImportantSelf-Assessment Task 2-2

Find the faithful dataset and describe both variables in terms of their measures of central tendency. Include graphs in support of your answers (use ggplot()), and conclude with a brief statement about the data distribution. (/10)

Skewness and kurtosis can be informative, but they do not replace visual inspection or later assumption checks. They give a first numerical impression of distribution shape.

4.3 Measures of Variance or Dispersion Around the Centre

Statistic Function
Variance var()
Standard deviation sd()
Minimum min()
Maximum max()
Range range()
Quantile quantile()
Inter Quartile Range IQR()

Measures of dispersion describe how widely the values are spread. Two samples can have the same mean but very different biological interpretations if one is tightly clustered and the other highly variable.

4.3.1 Variance and Standard Deviation

Variance and standard deviation are measures of dispersion. The sample variance is given in Equation 2, and the standard deviation in Equation 3. We can calculate them with var() and sd():

var(normal_data)
[1] 1.002459
sd(normal_data)
[1] 1.001229
ImportantSelf-Assessment Task 2-3

Manually calculate the variance and SD for the normal_data we generated in the lecture. Make sure your answer is the same as those reported there. (/5)

The standard deviation is easier to interpret than the variance because it is measured on the same scale as the data. If temperature is measured in degrees Celsius, the standard deviation is also measured in degrees Celsius.

For roughly normal data, the 68-95-99.7 rule gives a useful approximation: about 68% of observations lie within 1 SD of the mean, about 95% within 2 SD, and about 99.7% within 3 SD (Figure 3).

Figure 3: The idealised Normal distribution showing the proportion of data within 1, 2, and 3SD from the mean.

Like the mean, the standard deviation is sensitive to extreme values. For skewed data or data with strong outliers, the IQR is often more informative.

4.3.2 The Minimum, Maximum, and Range

min(), max(), and range() give the extremes of the data:

min(normal_data)
[1] -3.400137
max(normal_data)
[1] 3.235566
range(normal_data)
[1] -3.400137  3.235566

range() returns the minimum and maximum as a pair. If we want the numerical width of the range, we subtract the minimum from the maximum:

range(normal_data)[2] - range(normal_data)[1]
[1] 6.635703

These are simple summaries, but they are often the first place to look for impossible values, obvious outliers, or coding errors.

4.3.3 Quartiles and the Interquartile Range

Quartiles divide the ordered data into quarters. The first quartile (\(Q_{1}\)) marks the point below which 25% of the data fall, the second quartile (\(Q_{2}\)) is the median, and the third quartile (\(Q_{3}\)) marks the point below which 75% fall.

The IQR measures the spread of the middle 50% of the data. Because it ignores the tails, it is much less sensitive to outliers than the standard deviation. It is often the better description of spread for skewed data.

We obtain quartiles with quantile():

# Look at the normal data
quantile(normal_data, p = 0.25)
       25% 
-0.6597937 
quantile(normal_data, p = 0.75)
      75% 
0.6840946 
# Look at skewed data
quantile(left_skewed_data, p = 0.25)
      25% 
0.6133139 
quantile(left_skewed_data, p = 0.75)
      75% 
0.8390202 

We calculate the IQR with IQR():

IQR(normal_data)
[1] 1.343888
ImportantDo It Now!

Calculate the range and IQR for left_skewed_data. The code is just above Figure 2. Then add one very large value to that vector and calculate both summaries again. Which measure changes more, and what does that tell you about when range or IQR is the better description of spread?

ImportantSelf-Assessment Task 2-4

Write a few lines of code to demonstrate that the \((0-0.25]\), \((0.25-0.5]\), \((0.5-0.75]\),\((0.75-1]\) quantiles of the normal_data we generated in the lecture indeed conform to the formal definition for what quantiles are. I.e., show manually how you can determine that 25% of the observations indeed fall below -0.66 for the normal_data. Explain the rationale to your approach. (/10)

The choice between mean and SD on the one hand, and median and IQR on the other, depends on the data. Symmetric distributions are often well described by the first pair. Skewed distributions or data with strong extremes are usually better described by the second.

The contrast among panels A-D in Figure 2 shows why that choice depends on the distribution rather than on habit.

5 The Palmer Penguin Dataset

The Palmer penguin dataset in the palmerpenguins package is a widely used teaching dataset for data exploration, visualisation, and modelling. It contains measurements from three penguin species in the Palmer Archipelago: Adélie, Chinstrap, and Gentoo.

The variables include bill length (bill_length_mm), bill depth (bill_depth_mm), flipper length (flipper_length_mm), body mass (body_mass_g), species, island, and sex. The dataset is rich enough to illustrate grouping structure, missingness, and both numerical and categorical variables without being unnecessarily large.

Let us start by loading the data:

library(palmerpenguins)

6 Exploring the Data Structure

Now we go from statistical definitions to implementation. These functions answer three practical questions: what variables exist, how large the dataset is, and whether the data contain missing values or unexpected structure.

6.1 Inspecting Type and Layout

Several functions give a quick overview of the dataset itself rather than of the values inside it:

Purpose Function
The class of the dataset class()
The head of the dataframe head()
The tail of the dataframe tail()
Printing the data print()
Glimpse the data glimpse()
Show number of rows nrow()
Show number of columns ncol()
The column names colnames()
The row names row.names()
The dimensions dim()
The dimension names dimnames()
The data structure str()

First, check the class of the object. The penguins dataset is a tibble, which is the tidyverse version of a data frame:

class(penguins)
[1] "tbl_df"     "tbl"        "data.frame"

We can convert between tibbles and data frames with as.data.frame() and as_tibble():

penguins_df <- as.data.frame(penguins)
class(penguins_df)
[1] "data.frame"
penguins_tbl <- as_tibble(penguins_df)
class(penguins_tbl)
[1] "tbl_df"     "tbl"        "data.frame"

The print methods differ. Data frames print more bluntly; tibbles are more compact and readable:

print(penguins_df[1:5,])
  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA          NA
5  Adelie Torgersen           36.7          19.3               193        3450
     sex year
1   male 2007
2 female 2007
3 female 2007
4   <NA> 2007
5 female 2007
print(penguins_tbl)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

glimpse() gives similar information in a horizontal layout:

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

6.2 Inspecting Size and Names

Use nrow(), ncol(), and dim() to check dataset size:

nrow(penguins)
[1] 344
ncol(penguins)
[1] 8
dim(penguins)
[1] 344   8

Use colnames() to inspect variable names:

colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Tibbles do not use row names in the same way as traditional data frames, but row.names() and dimnames() are still worth recognising:

ImportantDo It Now!

Explain the output of dimnames() when applied to the penguins dataset.

ImportantSelf-Assessment Task 2-5
  1. Explain the output of dimnames() when applied to the penguins dataset. (/2)
  2. Explain the output of str() when applied to the penguins dataset. (/3)

6.3 Previewing the Data

head() and tail() let us inspect the first and last rows:

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
tail(penguins, n = 3)
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            49.6          18.2               193        3775
2 Chinstrap Dream            50.8          19                 210        4100
3 Chinstrap Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <fct>, year <int>

You can wrap them in print() if you want more control over display:

print(head(penguins))
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
print(tail(penguins, n = 3))
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            49.6          18.2               193        3775
2 Chinstrap Dream            50.8          19                 210        4100
3 Chinstrap Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <fct>, year <int>

str() is often the most compact first inspection because it shows object type, variable classes, and a preview of values:

str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
ImportantDo It Now!

Explain the output of str() when applied to the penguins dataset.

7 Data Summaries

Once we understand the structure of the dataset, we summarise the values it contains. The tools in this section automate the numerical descriptions introduced above.

Use them for slightly different purposes:

  • summary() for a quick overview of variable types and basic summaries.
  • skim() for a broader inspection that includes missingness and type-specific summaries.
  • describe() for more detailed descriptive statistics on numerical variables.
  • descriptives() and dfSummary() when you want more elaborate tabular output.
Purpose Function Package
Summary of the data properties summary() base
describe() psych
skim() skimr
descriptives() jmv
dfSummary() summarytools

7.1 summary()

summary() is the standard quick overview in base R. For data frames and tibbles, it reports variable classes and a small set of numerical summaries:

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
ImportantSelf-Assessment Task 2-6

Explain the output of summary() when applied to the penguins dataset. (/3)

7.2 psych::describe()

psych::describe() provides a more detailed numerical summary:

psych::describe(penguins)
                  vars   n    mean     sd  median trimmed    mad    min    max
species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
                   range  skew kurtosis    se
species*             2.0  0.16    -1.73  0.05
island*              2.0  0.61    -0.91  0.04
bill_length_mm      27.5  0.05    -0.89  0.30
bill_depth_mm        8.4 -0.14    -0.92  0.11
flipper_length_mm   59.0  0.34    -1.00  0.76
body_mass_g       3600.0  0.47    -0.74 43.36
sex*                 1.0 -0.02    -2.01  0.03
year                 2.0 -0.05    -1.51  0.04

7.3 skimr::skim()

skim() adds type-specific summaries and a clearer account of missingness:

library(skimr)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

7.4 jmv::descriptives()

descriptives() from jmv gives another formatted summary view:

library(jmv)
descriptives(penguins, freq = TRUE)

 DESCRIPTIVES

 Descriptives                                                                                                                           
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
                         species    island    bill_length_mm    bill_depth_mm    flipper_length_mm    body_mass_g    sex    year        
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   N                         344       344               342              342                  342            342    333          344   
   Missing                     0         0                 2                2                    2              2     11            0   
   Mean                                             43.92193         17.15117             200.9152       4201.754            2008.029   
   Median                                           44.45000         17.30000             197.0000       4050.000            2008.000   
   Standard deviation                               5.459584         1.974793             14.06171       801.9545           0.8183559   
   Minimum                                          32.10000         13.10000                  172           2700                2007   
   Maximum                                          59.60000         21.50000                  231           6300                2009   
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 


 FREQUENCIES

 Frequencies of species                                
 ───────────────────────────────────────────────────── 
   species      Counts    % of Total    Cumulative %   
 ───────────────────────────────────────────────────── 
   Adelie          152      44.18605        44.18605   
   Chinstrap        68      19.76744        63.95349   
   Gentoo          124      36.04651       100.00000   
 ───────────────────────────────────────────────────── 


 Frequencies of island                                 
 ───────────────────────────────────────────────────── 
   island       Counts    % of Total    Cumulative %   
 ───────────────────────────────────────────────────── 
   Biscoe          168      48.83721        48.83721   
   Dream           124      36.04651        84.88372   
   Torgersen        52      15.11628       100.00000   
 ───────────────────────────────────────────────────── 


 Frequencies of sex                                 
 ────────────────────────────────────────────────── 
   sex       Counts    % of Total    Cumulative %   
 ────────────────────────────────────────────────── 
   female       165      49.54955        49.54955   
   male         168      50.45045       100.00000   
 ────────────────────────────────────────────────── 

7.5 summarytools::dfSummary()

dfSummary() from summarytools produces a richer tabular report:

library(summarytools)
print(dfSummary(penguins, 
                varnumbers   = FALSE, 
                valid.col    = FALSE, 
                graph.magnif = 0.76),
      method = "render")

Data Frame Summary

penguins

Dimensions: 344 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
species [factor]
1. Adelie
2. Chinstrap
3. Gentoo
152 ( 44.2% )
68 ( 19.8% )
124 ( 36.0% )
0 (0.0%)
island [factor]
1. Biscoe
2. Dream
3. Torgersen
168 ( 48.8% )
124 ( 36.0% )
52 ( 15.1% )
0 (0.0%)
bill_length_mm [numeric]
Mean (sd) : 43.9 (5.5)
min ≤ med ≤ max:
32.1 ≤ 44.5 ≤ 59.6
IQR (CV) : 9.3 (0.1)
164 distinct values 2 (0.6%)
bill_depth_mm [numeric]
Mean (sd) : 17.2 (2)
min ≤ med ≤ max:
13.1 ≤ 17.3 ≤ 21.5
IQR (CV) : 3.1 (0.1)
80 distinct values 2 (0.6%)
flipper_length_mm [integer]
Mean (sd) : 200.9 (14.1)
min ≤ med ≤ max:
172 ≤ 197 ≤ 231
IQR (CV) : 23 (0.1)
55 distinct values 2 (0.6%)
body_mass_g [integer]
Mean (sd) : 4201.8 (802)
min ≤ med ≤ max:
2700 ≤ 4050 ≤ 6300
IQR (CV) : 1200 (0.2)
94 distinct values 2 (0.6%)
sex [factor]
1. female
2. male
165 ( 49.5% )
168 ( 50.5% )
11 (3.2%)
year [integer]
Mean (sd) : 2008 (0.8)
min ≤ med ≤ max:
2007 ≤ 2008 ≤ 2009
IQR (CV) : 2 (0)
2007 : 110 ( 32.0% )
2008 : 114 ( 33.1% )
2009 : 120 ( 34.9% )
0 (0.0%)

Generated by summarytools 1.1.5 (R version 4.5.2)
2026-04-05

No single tool is best in every situation. The important point is to know what each one is for and to use it deliberately rather than mechanically.

8 Descriptive Statistics by Group

ImportantDo It Now!

Before looking at the grouped ChickWeight summaries, identify the variables that define the grouping structure in that dataset. Then propose one grouped table and one grouped figure that would help you compare chicks in a biologically sensible way.

ImportantSelf-Assessment Task 2-7

Why is it important to consider the grouping structures that might be present within our datasets? (/2)

Whole-dataset summaries are only a starting point. Biological data are often structured by species, treatment, site, sex, season, or year, and those groupings are often more informative than the overall mean. Once we acknowledge that structure, the descriptive question changes from “What is the average?” to “Average for whom, under what condition, and with what spread?”

The ChickWeight data make the point clearly. A single mean across all chicks, all diets, and all sampling days hides the fact that the birds were measured repeatedly and assigned to different diets. It is much more informative to summarise weight within diet groups and at specific time points, for example at the start and end of the experiment. That lets us compare means with standard deviations, medians with interquartile ranges, and then relate those numerical summaries to figures that make the group differences visible.

An analysis of the ChickWeights dataset that recognises the effect of diet and time (start and end of experiment) might reveal something like this:

# A tibble: 8 × 10
  Diet   Time mean_wt sd_wt min_wt qrt1_wt med_wt qrt3_wt max_wt  n_wt
  <fct> <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <int>
1 1         0    41.4   1       39    41     41       42      43    20
2 1        21   178.   58.7     96   138.   166      208.    305    16
3 2         0    40.7   1.5     39    39.2   40.5     42      43    10
4 2        21   215.   78.1     74   169    212.     262.    331    10
5 3         0    40.8   1       39    41     41       41      42    10
6 3        21   270.   71.6    147   229    281      317     373    10
7 4         0    41     1.1     39    40.2   41       42      42    10
8 4        21   239.   43.3    196   204    237      264     322     9

We typically report the measure of central tendency together with the associated variation. So, in a table we would want to include the mean ± SD. For example, this table is almost ready for including in a publication:

Table 1: Mean ± SD for the ChickWeight dataset as a function of Diet and Time.
Diet Time Mean ± SD
1 0 41.4 ± 1
1 21 177.8 ± 58.7
2 0 40.7 ± 1.5
2 21 214.7 ± 78.1
3 0 40.8 ± 1
3 21 270.3 ± 71.6
4 0 41 ± 1.1
4 21 238.6 ± 43.3

Further, we want to supplement this EDA with some figures that visually show the effects. Here I show a few options for displaying the effects in different ways: Figure 4 shows the spread of the raw data, the mean, median or as well as the appropriate accompanying indicators of variation around the mean or median. I will say much more about using figures in EDA in Chapter 3.

Figure 4: The figures represent A) a scatterplot of the mean and raw chicken mass values; B) a bar graph of the chicken mass values, showing whiskers indicating 1 ±SD; C) a box and whisker plot of the chicken mass data; and D) chicken mass as a function of both Diet and Time (10 and 21 days).

9 Reporting

We want to communicate our EDA in a report or publication. Here is how we would typically do it:

NoteWrite-Up

Methods

Chicken body mass was summarised descriptively by diet group at the start of the experiment (Time = 0) and again at day 21. Group means, standard deviations, and sample sizes were calculated, and complementary figures were used to visualise the spread of raw observations within each diet.

Results

At the start of the experiment, mean body mass was similar across diets (about 41 g in all four groups). By day 21, mean mass had increased in every diet group, but the increase differed among diets: Diet 3 had the highest mean final mass (270.3 ± 71.6 g SD, n = 10), followed by Diet 4 (238.6 ± 43.3 g, n = 9), Diet 2 (214.7 ± 78.1 g, n = 10), and Diet 1 (177.8 ± 58.7 g, n = 16). The grouped summaries and figures therefore suggest a strong diet-related difference in final body mass, with Diet 3 producing the heaviest chicks in this descriptive comparison.

Discussion

This is still a descriptive result rather than a formal inferential test, but it shows what grouped summaries reveal. Once the data are split by diet and time, biologically important differences become visible that would be hidden in a single overall mean.

ImportantDo It Now!

Select your favourite three conitnuous variables in the BCB7342 field trip set of data. Assess the grouping structure, and apply a full set of descriptive analyses. Describe your data’s distribution in the light of your finding, and provide visual support.

10 Conclusion

Numerical summaries are the starting point for any serious data analysis. Measures of centre tell you where a distribution sits; measures of spread tell you how uncertain that location is; quantiles and structure-inspection tools tell you whether the data meet the assumptions you will rely on in later tests. None of these summaries is informative in isolation. For example, a mean without a measure of spread is incomplete, and a spread without context for the sample size can also be misleading. Used together, and always computed within the grouping structures that the study design imposes, they give you a first account of what your data contain.

The numerical summaries covered here are, however, only one half of exploratory data analysis. Tables of means and standard deviations alone can hide the shape of a distribution, mask outliers, and obscure the relationships between variables. Chapter 3 extends the picture by turning exploratory data analysis into graphical evidence. In that chapter, I will show you how to choose the right plot for each type of question and how to communicate your findings in a form that a scientific audience can immediately interpret.

Reuse

Citation

BibTeX citation:
@online{smit2026,
  author = {Smit, A. J. and J. Smit, A.},
  title = {2. {Summarising} {Biological} {Data}},
  date = {2026-04-05},
  url = {https://tangledbank.netlify.app/BCB744/basic_stats/02-summarise-and-describe.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit AJ, J. Smit A (2026) 2. Summarising Biological Data. https://tangledbank.netlify.app/BCB744/basic_stats/02-summarise-and-describe.html.