y <-rnorm(n = 200, m = 13, sd = 2)
par(mfrow = c(2, 2))
# using some basic base graphics as ggplot2 is overkill;
# we can get a histogram using hist() statement
hist(y, main = "Histogram of observed data")
plot(density(y), main = "Density estimate of data")
plot(ecdf(y), main = "Empirical cumulative distribution function")
# standardise the data
z.norm <- (y - mean(y)) / sd(y)
# make a qqplot
qqnorm(z.norm)
# add a 45-degree reference line
abline(0, 1)
4. Data Distributions
Getting familiar with data handling in R
- The concept of data distributions
- None
- All distribution generation and visualisations functions in this chapter are done using
library(TidyDensity)
andlibrary(tidyverse)
.
Introduction
A good grasp of data distributions is a prerequisite of any statistical analysis. It enables us to describe and summarise the underlying patterns, trends, and variations in our data. This allows for more robust predictions and inferences about natural processes, such as outcomes of experiments or the structure of biodiversity. Conclusions stemming from our application of inferential statistics are defensible only if we understand and can justify the distribution of our data. In this chapter we will learn about the most common data distributions encountered in ecological research, and how this knowledge will help us to effectively apply statistical methods to analyse and interpret them.
What are data distributions?
Data distributions are fundamental views through which we understand how values disperse across datasets. They are a mathematical underpinning that transforms our basic numerical observations (which we obtain as samples taken at random to represent a population) into interpretable patterns of frequency, probability, and structural regularity. Distributions show us the underlying “shape” of variation itself and reveal how values are clustered symmetrically (or not) around a central tendency, are skewed toward the extremes, or exhibit multiple peaks that could suggest distinct groupings within the broader dataset.
We need to understand distributions because any collection of measurements (e.g. heights of individuals, rates of nutrient uptake, or counts of animals moving past a point) exhibits characteristic shapes when we plot their frequency against their value. These shapes may be normal, exponential, uniform, bimodal, etc., and are specific to the processes that generated the data. As such, distributions are a statistical link between the the empirical observations and our claims about the underlying mechanisms. Our understanding of the data distribution and the process under scrutiny will therefore inform us of the statistical approaches most suited to extracting meaningful inference from them.
Discrete distributions
Discrete distributions form one of two fundamental branches in probability theory; the other is continuous distributions. This division rests on the nature of the sample space: discrete distributions assign probabilities to countable outcomes (integers, finite sets), while continuous distributions apply to uncountable intervals where individual point probabilities equal zero.
Discrete random variables have a finite or countable number of possible values and are the foundation of discrete probability distributions. Two core mathematical constraints govern mapping these probability distributions to their discrete random values:
- First, each probability value must fall within the closed interval [0,1]; no outcome can possess negative probability or exceed certainty.
- Second, the sum of all probabilities across the complete sample space must equal exactly 1.0; this ensures that the distribution accounts for all possible outcomes without omission or duplication.
In practice, these constraints give us access to practical verification steps:
- When constructing or validating a discrete distribution, sum all probability values to confirm they equal unity (1).
- Inspect each individual probability to ensure none violates the boundary conditions ([0,1]).
Violations point to computational errors or incomplete specification of the sample space.
The expected value
Variance calculation follows a two-step process. First, compute the expected value of the squared outcomes:
Then apply the computational formula:
For the die example:
Below I provide options to generate and visualise data belonging to several classes of discrete distributions. In Chapter 6 we will learn how to transform these data prior to performing the appropriate statistical analysis.
Bernoulli and Binomial distributions
The Bernoulli and Binomial distributions belong to what might be termed the “trial-based” subfamily of discrete distributions: they directly model outcomes of repeated experiments. Trial-based distributions require specifying both the number of trials and success probability.
Bernoulli and Binomial distributions are both discrete probability distributions that describe the outcomes of binary events. They are similar but there are also some key differences between the two. In real life examples encountered in ecology and biology we will probably mostly encounter the Binomial distributions. Let us consider each is more detail.
Bernoulli distribution The Bernoulli distribution represents a single binary trial or experiment with only two possible outcomes: ‘success’ (usually represented as 1
) and ‘failure’ (usually represented as 0
). The probability of success is denoted by
The Bernoulli distribution:
where
Binomial distribution The Binomial distribution represents the sum of outcomes in a fixed number of independent Bernoulli trials with the same probability of success,
The Binomial distribution:
where
In practice, determine
There are several examples of Binomial distributions in ecological and biological contexts. The Binomial distribution is relevant when studying the number of successes in a fixed number of independent trials, each with the same probability of success. A few examples of the Bernoulli distribution:
Seed germination Suppose we plant 100 seeds of a particular plant species and wants to know the probability of a certain number of seeds germinating. If the probability of germination for each seed is constant then we can model the number of germinated seeds by a Binomial distribution.
Disease prevalence An epidemiologist studies the prevalence of a disease within a population. For a random sample of 500 individuals, and with a fixed probability of an individual having the disease, the number of infected individuals in the sample can be modeled using a Binomial distribution.
Species occupancy We do an ecological assessment to determine the occupancy of bird species across 50 habitat patches. If the probability of the species occupying a patch is the same across all patches, the number of patches occupied by the species will follow a Binomial distribution.
Allele inheritance We want to examine the inheritance of a specific trait following Mendelian inheritance patterns. If the probability of inheriting the dominant allele for a given gene is constant, the number of offspring with the dominant trait in a fixed number of offspring follows the Binomial distribution.
Note that in these examples we assume a fixed probability and independence between trials and this is not always be true in real-world situations.
Negative Binomial and Geometric distributions
The Geometric and Negative Binomial distributions form a “waiting time” subfamily, focusing on the number of trials preceding specified success patterns. These waiting-time distributions focus on success probability and target achievement levels.
Negative Binomial distribution A Negative Binomial random variable,
The Negative Binomial distribution:
The equation describes the probability mass function (PMF) of a Negative Binomial distribution, where
Geometric distribution A geometric random variable,
The Geometric distribution:
The equation represents the PMF of a Geometric distribution, where
Poisson distribution
The Poisson distribution:
The function represents the PMF of a Poisson distribution, where
For practical application, identify
The Poisson distribution represents a “rate-based” subfamily, which is used to model count phenomena without explicit trial structure. They involve the average occurrence rates over specified intervals.
A Poisson random variable,
Hypergeometric distribution
The Hypergeometricd distribution:
Consider drawing 5 cards from a deck without replacement, seeking exactly 2 hearts. Here
The hypergeometric distribution occupies a distinct position within the discrete distribution taxonomy: we might term the “finite population sampling” subfamily.
The hypergeometric distribution models sampling without replacement from finite populations containing two types of items. Unlike binomial sampling, each draw changes the composition of remaining items.
To do: insert figures.
Continuous distributions
Normal distribution
The Normal distribution:
where
Another name for this kind of distribution is a Gaussian distribution. A random sample with a Gaussian distribution is normally distributed. These values are independent and identically distributed random variables (i.i.d.), and they have an expected mean given by
The Central Limit Theorem (CLT) is a fundamental result in probability theory and statistics, which states that the distribution of the sum (or average) of a large number of independent, identically distributed (IID) random variables approaches a Normal distribution regardless of the shape of the original distribution. So, the CLT asserts that the Normal distribution is the limiting distribution for the sum or average of many random variables, as long as certain conditions are met.
The CLT provides a basis for making inferences about population parameters using sample statistics. For example, when dealing with large sample sizes, the sampling distribution of the sample mean is approximately normally distributed, even if the underlying population distribution is not normal. This allows us to apply inferential techniques based on the Normal distribution, such as hypothesis testing and constructing confidence intervals, to estimate population parameters using sample data.
Some conditions must be met for the CLT to be true:
- The random variables must be independent The observations should not be influenced by one another.
- The random variables must be identically distributed They must come from the same population with the same mean and variance.
- The number of random variables (sample size) must be sufficiently large Although there is no strict rule for the sample size, a common rule of thumb is that the sample size should be at least 30 for the CLT to be a reasonable approximation.
Uniform distribution
The continuous uniform distribution is sometime called a rectangular distribution. Simply, it states that all measurements of the same magnitude included with this distribution are equally probable. This is basically random numbers (Figure 9).
Student T distribution
This is a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. It is used in the statistical significance testing between the means of different sets of samples, and not much so in the modelling of many kinds of experiments or observations (Figure 10).
Chi-squared distribution
Mostly used in hypothesis testing, but not to encapsulate the distribution of data drawn to represent natural phenomena (Figure 11).
Exponential distribution
This is a probability distribution that describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate (Figure 12).
F distribution
This is a probability distribution that arises in the context of the analysis of variance (ANOVA) and regression analysis. It is used to compare the variances of two populations (Figure 13).
Gamma distribution
This is a two-parameter family of continuous probability distributions. It is used to model the time until an event occurs. It is a generalisation of the exponential distribution (Figure 14).
Beta distribution
This is a family of continuous probability distributions defined on the interval [0, 1] parameterised by two positive shape parameters, typically denoted by α and β. It is used to model the behaviour of random variables limited to intervals of finite length in a wide variety of disciplines (Figure 15).
Finding one’s data distribution
Data belonging to a sample will never exactly follow a specific distribution, even when the test for normality says it does—there will always be a small probability that they are non-normal and is in fact better described by some other distribution. In other words, data are only compatible with a certain distribution, and one can never answer the question “Does my data follow the distribution xy exactly?” as simply as providing a yes/no answer. So what now? How does one find one’s data distribution? We can use the Cullen and Frey graph function that lives in the fitdistrplus package. This graph tells us whether the skewness and kurtosis of our data are consistent with that of a particular distribution. We will demonstrate by generating various data distributions and testing them using the Cullen and Frey graph.
library(fitdistrplus)
library(logspline)
# Generate log-normal data
y <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
plot(x = c(1:length(y)), y = y)
hist(y)
descdist(y, discrete = FALSE, boot = 100)
# normally distributed data
y <- rnorm(100, 13, 2)
plot(x = c(1:100), y = y)
hist(y)
descdist(y, discrete = FALSE)
# uniformly distributed data
y <- runif(100)
plot(x = c(1:100), y = y)
hist(y)
descdist(y, discrete = FALSE)
# uniformly distributed data
y <- rexp(100, 0.7)
plot(x = c(1:100), y = y)
hist(y)
descdist(y, discrete = FALSE)
There is also a whole bunch of other approaches to use to try and identify the data distribution. Let us start with the gold standard first: normal data. We will demonstrate some visualisation approaches. The one that you already know is a basic histogram; it tells us something about the distribution’s skewness, the tails, the mode(s) of the data, outliers, etc. Histograms can be compared to shapes associated with idealistic (simulated) distributions, as we will do here.
Above we have also added a diagonal line to the qqplot. If the sampled data come from the population with the chosen distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the data set have come from a population with a different distribution.
Reuse
Citation
@online{smit,_a._j.2021,
author = {Smit, A. J.,},
title = {4. {Data} {Distributions}},
date = {2021-01-01},
url = {http://tangledbank.netlify.app/BCB744/basic_stats/04-distributions.html},
langid = {en}
}