4. Data Distributions

Getting familiar with data handling in R

Published

January 1, 2021

In this Chapter
  • The concept of data distributions
Tasks to complete in this Chapter
  • None

Introduction

A good grasp of data distributions is a prerequisite of any statistical analysis. It enables us to describe and summarise the underlying patterns, trends, and variations in our data. In doing so, it allows for more robust predictions and inferences about natural processes, such as outcomes of experiments or the structure of biodiversity. By understanding the distribution of our data, we can select the most appropriate statistical techniques and ensure the validity of our conclusions. Only through the application of the correct inferential statistics can we facilitate the development of robust and defensible insights about our research. In this Chapter we will learn about the most common data distributions encountered in ecological research, and how this knowledge will empower us to effectively apply statistical methods to analyse and interpret them.

Discrete distributions

A discrete random variable has a finite or countable number of possible values. As the name suggests, it models integer data. Below we provide options to generate and visualise data belonging to several classes of discrete distributions. In Chapter 6 we will learn how to transform these data prior to performing the appropriate statistical analysis.

Bernoulli and binomial distributions

Bernoulli and Binomial distributions are both discrete probability distributions that describe the outcomes of binary events. They are similar but there are also some key differences between the two. In real life examples encountered in ecology and biology we will probably mostly encounter the Binomial distributions. Let us consider each is more detail.

Bernoulli distribution The Bernoulli distribution represents a single binary trial or experiment with only two possible outcomes—‘success’ (usually represented as 1) and ‘failure’ (usually represented as 0). The probability of success is denoted by \(p\), while the probability of failure is \(1 - p\). A Bernoulli distribution is characterised by only one parameter, \(p\), which represents the probability of success for the single trial (Equation 1) (Figure 1, Figure 2).

The Bernoulli distribution: \[ P(X=k) = \begin{cases} p, & \text{if } k=1 \\ 1-p, & \text{if } k=0 \end{cases} \tag{1}\]

where \(X\) is a random variable, \(k\) is the outcome (1 for success and 0 for failure), and \(p\) is the probability of success.

Figure 1: Bernoulli distribution with 50 trials and 20 simulations set at a probability of p = 0.2. This could be a heavily loaded coin that has a 20% chance of landing heads.
Figure 2: Bernoulli distribution with 50 trials and 20 simulations set at a probability of p = 0.5. This represents an unbiased coin that has an equal chance of landing on head or tail.

Binomial distribution The Binomial distribution represents the sum of outcomes in a fixed number of independent Bernoulli trials with the same probability of success, \(p\). It is characterised by two parameters—\(n\) (the number of trials) and \(p\) (the probability of success in each trial). The Binomial distribution describes the probability of obtaining a specific number of successes (\(k\)) in \(n\) trials (Equation 2) (Figure 3).

The Binomial distribution: \[P(X=k) = \binom{n}{k} p^k (1-p)^{n-k} \tag{2}\]

where \(X\) is a random variable, \(k\) is the number of successes, \(n\) is the total number of trials, and \(p\) is the probability of success in each trial.

Figure 3: Binomial distribution with 40 trials and 10 simulations at an a priori set probability of p = 0.5.

There are several examples of Binomial distributions in ecological and biological contexts. The Binomial distribution is relevant when studying the number of successes in a fixed number of independent trials, each with the same probability of success. A few examples of the Bernoulli distribution:

  • Seed germination Suppose we plant 100 seeds of a particular plant species and wants to know the probability of a certain number of seeds germinating. If the probability of germination for each seed is constant then we can model the number of germinated seeds by a Binomial distribution.

  • Disease prevalence An epidemiologist studies the prevalence of a disease within a population. For a random sample of 500 individuals, and with a fixed probability of an individual having the disease, the number of infected individuals in the sample can be modeled using a Binomial distribution.

  • Species occupancy We do an ecological assessment to determine the occupancy of bird species across 50 habitat patches. If the probability of the species occupying a patch is the same across all patches, the number of patches occupied by the species will follow a Binomial distribution.

  • Allele inheritance We want to examine the inheritance of a specific trait following Mendelian inheritance patterns. If the probability of inheriting the dominant allele for a given gene is constant, the number of offspring with the dominant trait in a fixed number of offspring follows the Binomial distribution.

Note that in these examples we assume a fixed probability and independence between trials and this is not always be true in real-world situations.

Negative Binomial distribution

A Negative Binomial random variable, \(X\), counts the number of successes in a sequence of independent Bernoulli trials with probability \(p\) before \(r\) failures occur. This distribution could for example be used to predict the number of heads that result from a series of coin tosses before three tails are observed (Equation 3) (Figure 4).

The Negative Binomial distribution: \[P(X=k) = \binom{k+r-1}{k} p^r (1-p)^k \tag{3}\]

The equation describes the probability mass function (PMF) of a Negative Binomial distribution, where \(X\) is a random variable, \(k\) is the number of failures, \(r\) is the number of successes, and \(p\) is the probability of success in each trial. The binomial coefficient is denoted by \(\binom{k+r-1}{k}\), which calculates the number of ways to arrange \(k\) failures and \(r\) successes such that the last trial is a success.

Figure 4: A negative binomial distribution with 50 trials and 10 simulations at an a priori expectation of p = 0.75.

Geometric distribution

A geometric random variable, \(X\), represents the number of trials that are required to observe a single success. Each trial is independent and has success probability \(p\). As an example, the geometric distribution is useful to model the number of times a die must be tossed in order for a six to be observed (Equation 4) (Figure 5).

The Geometric distribution: \[P(X=k) = p (1-p)^k \tag{4}\]

The equation represents the PMF of a Geometric distribution, where \(X\) is a random variable, \(k\) is the number of failures before the first success, and \(p\) is the probability of success in each trial. The Geometric distribution can be thought of as a special case of the Negative Binomial distribution with \(r = 1\), which models the number of failures before achieving a single success.

Figure 5: A geometric distribution with 50 trials and 10 simulations at an a priori expectation of p = 0.75.

Poisson distribution

A Poisson random variable, \(X\), tallies the number of events occurring in a fixed interval of time or space, given that these events occur with an average rate \(\lambda\). Poisson distributions can be used to model events such as meteor showers and or number of people entering a shopping mall (Equation 5) (Figure 6).

The Poisson distribution: \[P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!} \tag{5}\]

The function represents the PMF of a Poisson distribution, where \(X\) is a random variable, \(k\) is the number of events or occurrences, and \(\lambda\) (lambda) is the average rate of occurrences (events per unit of time or space). The constant \(e\) is the base of the natural logarithm, and \(k!\) is the factorial of \(k\). The Poisson distribution is commonly used to model the number of events occurring within a fixed interval of time or space when events occur independently and at a constant average rate.

Figure 6: A Poisson distribution with 50 trials and 10 simulations at an a priori expectation of p = 0.75.

Continuous distributions

Normal distribution

Figure 7: The idealised Normal distribution showing the proportion of data within 1, 2, and 3SD from the mean.

Another name for this kind of distribution is a Gaussian distribution. A random sample with a Gaussian distribution is normally distributed. These values are independent and identically distributed random variables (i.i.d.), and they have an expected mean given by \(\mu\) (or \(\hat{x}\) in Chapter 3.2.1) and a finite variance given by \(\sigma^{2}\) (or \(S^{2}\) in Chapter 3.3.1); if the number of samples drawn from a population is sufficiently large, the estimated mean and SD will be indistinguishable from the population (as per the central limit theorem). It is represented by Equation 6 (Figure 8).

The Normal distribution: \[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 } \tag{6}\]

where \(x\) is a continuous random variable, \(\mu\) (mu) is the mean, and \(\sigma\) (sigma) is the standard deviation. The constant factor \(\frac{1}{\sigma\sqrt{2\pi}}\) ensures that the Probability Density Function (PDF) integrates to 1, and the exponential term is responsible for the characteristic bell-shaped curve of the Normal distribution.

Figure 8: Normal distribution with 40 trials and 5 simulations.
Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental result in probability theory and statistics, which states that the distribution of the sum (or average) of a large number of independent, identically distributed (IID) random variables approaches a Normal distribution regardless of the shape of the original distribution. So, the CLT asserts that the Normal distribution is the limiting distribution for the sum or average of many random variables, as long as certain conditions are met.

The CLT provides a basis for making inferences about population parameters using sample statistics. For example, when dealing with large sample sizes, the sampling distribution of the sample mean is approximately normally distributed, even if the underlying population distribution is not normal. This allows us to apply inferential techniques based on the Normal distribution, such as hypothesis testing and constructing confidence intervals, to estimate population parameters using sample data.

Some conditions must be met for the CLT to be true:

  • The random variables must be independent The observations should not be influenced by one another.
  • The random variables must be identically distributed They must come from the same population with the same mean and variance.
  • The number of random variables (sample size) must be sufficiently large Although there is no strict rule for the sample size, a common rule of thumb is that the sample size should be at least 30 for the CLT to be a reasonable approximation.

Uniform distribution

The continuous uniform distribution is sometime called a rectangular distribution. Simply, it states that all measurements of the same magnitude included with this distribution are equally probable. This is basically random numbers (Figure 9).

Figure 9: Uniform distribution with 100 trials and 5 simulations.

Student T distribution

This is a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. It is used in the statistical significance testing between the means of different sets of samples, and not much so in the modelling of many kinds of experiments or observations (Figure 10).

Figure 10: Uniform distribution with 100 trials and 5 simulations.

Chi-squared distribution

Mostly used in hypothesis testing, but not to encapsulate the distribution of data drawn to represent natural phenomena (Figure 11).

Figure 11: Chi distribution with 100 trials and 5 simulations.

Exponential distribution

This is a probability distribution that describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate (Figure 12).

Figure 12: An exponential distribution with 100 trials and 5 simulations.

F distribution

This is a probability distribution that arises in the context of the analysis of variance (ANOVA) and regression analysis. It is used to compare the variances of two populations (Figure 13).

Figure 13: An F distribution with 100 trials and 5 simulations.

Gamma distribution

This is a two-parameter family of continuous probability distributions. It is used to model the time until an event occurs. It is a generalisation of the exponential distribution (Figure 14).

Figure 14: A Gamma distribution with 100 trials and 5 simulations.

Beta distribution

This is a family of continuous probability distributions defined on the interval [0, 1] parameterised by two positive shape parameters, typically denoted by α and β. It is used to model the behaviour of random variables limited to intervals of finite length in a wide variety of disciplines (Figure 15).

Figure 15: A Beta distribution with 100 trials and 5 simulations.

Finding one’s data distribution

Data belonging to a sample will never exactly follow a specific distribution, even when the test for normality says it does—there will always be a small probability that they are non-normal and is in fact better described by some other distribution. In other words, data are only compatible with a certain distribution, and one can never answer the question “Does my data follow the distribution xy exactly?” as simply as providing a yes/no answer. So what now? How does one find one’s data distribution? We can use the Cullen and Frey graph function that lives in the fitdistrplus package. This graph tells us whether the skewness and kurtosis of our data are consistent with that of a particular distribution. We will demonstrate by generating various data distributions and testing them using the Cullen and Frey graph.

library(fitdistrplus)
library(logspline)

# Generate log-normal data
y <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

plot(x = c(1:length(y)), y = y)
hist(y)
descdist(y, discrete = FALSE, boot = 100)

summary statistics
------
min:  16.62   max:  54.15 
median:  40.38 
mean:  40.28434 
estimated sd:  7.420034 
estimated skewness:  -0.551717 
estimated kurtosis:  3.565162 
# normally distributed data
y <- rnorm(100, 13, 2)

plot(x = c(1:100), y = y)
hist(y)
descdist(y, discrete = FALSE)

summary statistics
------
min:  8.606167   max:  17.69996 
median:  12.99119 
mean:  12.86674 
estimated sd:  1.967107 
estimated skewness:  0.04159819 
estimated kurtosis:  2.855855 
# uniformly distributed data
y <- runif(100)

plot(x = c(1:100), y = y)
hist(y)
descdist(y, discrete = FALSE)

summary statistics
------
min:  0.0020474   max:  0.996361 
median:  0.48258 
mean:  0.4982293 
estimated sd:  0.2975505 
estimated skewness:  0.08480085 
estimated kurtosis:  1.763882 
# uniformly distributed data
y <- rexp(100, 0.7)

plot(x = c(1:100), y = y)
hist(y)
descdist(y, discrete = FALSE)

summary statistics
------
min:  0.001621721   max:  8.886 
median:  1.067032 
mean:  1.598883 
estimated sd:  1.618668 
estimated skewness:  2.03815 
estimated kurtosis:  8.547846 

There is also a whole bunch of other approaches to use to try and identify the data distribution. Let us start with the gold standard first: normal data. We will demonstrate some visualisation approaches. The one that you already know is a basic histogram; it tells us something about the distribution’s skewness, the tails, the mode(s) of the data, outliers, etc. Histograms can be compared to shapes associated with idealistic (simulated) distributions, as we will do here.

y <-rnorm(n = 200, m = 13, sd = 2)
par(mfrow = c(2, 2))
# using some basic base graphics as ggplot2 is overkill;
# we can get a histogram using hist() statement
hist(y, main = "Histogram of observed data")
plot(density(y), main = "Density estimate of data")
plot(ecdf(y), main = "Empirical cumulative distribution function")
# standardise the data
z.norm <- (y - mean(y)) / sd(y) 
# make a qqplot
qqnorm(z.norm)
# add a 45-degree reference line
abline(0, 1)

Above we have also added a diagonal line to the qqplot. If the sampled data come from the population with the chosen distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the data set have come from a population with a different distribution.

# curve(dnorm(100, m = 10, sd = 2), from = 0, to = 20, main = "Normal distribution")
# curve(dgamma(100, scale = 1.5, shape = 2), from = 0, to = 15, main = "Gamma distribution")
# curve(dweibull(100, scale = 2.5, shape = 1.5), from = 0, to = 15, main = "Weibull distribution")

Reuse

Citation

BibTeX citation:
@online{j._smit2021,
  author = {J. Smit, Albertus},
  title = {4. {Data} {Distributions}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/04-distributions.html},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2021) 4. Data Distributions. http://tangledbank.netlify.app/BCB744/basic_stats/04-distributions.html.