A good grasp of data distributions is a prerequisite for any statistical analysis. Distributions help us describe the underlying patterns, trends, and variation in our data. They also guide the choice of models and tests. Inferential conclusions are only defensible if we understand, at least broadly, the kind of variation our data represent.
This chapter links three ideas that belong together:
the distribution of observations,
the distribution of sample statistics, and
the uncertainty around the quantities we estimate from data.
2 Key Concepts
The chapter is organised around the following linked ideas.
Data distributions describe the shape, spread, and structure of observed variables.
Random variables may be discrete or continuous, and that distinction affects model choice.
Sampling variation means sample summaries differ from one sample to the next.
The central limit theorem explains why sample means often become approximately normal.
Confidence intervals express uncertainty around estimated quantities.
3 What a Distribution Tells Us
A distribution describes the pattern of possible values for a variable and how frequently or how probably those values occur. When we plot data, we often ask:
Where is the centre?
How spread out are the values?
Is the distribution symmetric or skewed?
Are there heavy tails or outliers?
Does the dataset suggest one group or several?
These features are not cosmetic. They affect what summary statistics are useful, what assumptions are plausible, and what models are appropriate.
4 Discrete and Continuous Variables
Probability distributions are usually divided into two broad classes.
4.1 Discrete distributions
Discrete variables take countable values such as 0, 1, 2, and so on. These arise when we count events, individuals, or successes.
Common biological examples include:
the number of offspring in a brood,
the number of infected individuals in a sample,
the number of seeds that germinate, and
the number of birds detected in a transect.
4.2 Continuous distributions
Continuous variables can take any value within a range. These arise when we measure quantities such as length, mass, temperature, or nutrient concentration.
Common examples include:
plant height,
chlorophyll concentration,
growth rate, and
body temperature.
5 Common Distributions in Biology
The biological process often leaves a signature in the distribution of the data. A few distributions appear repeatedly in ecological and biological work.
5.1 Normal distribution
The normal distribution is symmetric and bell-shaped. It is often a reasonable approximation for continuous traits influenced by many small contributing factors. Many classical methods rely on normality either in the data themselves or, more often, in the residuals of a model.
5.2 Binomial distribution
The binomial distribution describes the number of successes in a fixed number of independent trials with constant success probability. It is useful for proportions, occupancy, germination, and similar binary outcomes.
5.3 Poisson distribution
The Poisson distribution describes counts of events in a fixed interval of time or space. It often appears in abundance or event-count data, especially when counts are low and skewed.
5.4 Negative binomial distribution
The negative binomial distribution is often used for count data that are more variable than a Poisson model allows. This is common in ecological field data where aggregation and heterogeneity are strong.
5.5 Log-normal and gamma distributions
These are useful for positive, right-skewed continuous variables such as biomass, time-to-event measurements, and concentration data.
NoteThe Process Matters
The point is not to memorise a long list of named distributions. The point is to ask what process plausibly generated the data. The distribution is often a clue about the process, and the process should inform the model.
6 Samples, Populations, and Sampling Variation
Statistical inference depends on the fact that we usually observe a sample, not the full population. Different samples from the same population will differ. This is called sampling variation.
If we repeatedly sampled the same population and calculated the sample mean each time, those means would themselves form a distribution. That distribution is the sampling distribution of the mean.
This distinction matters:
the data distribution describes the values of observations,
the sampling distribution describes the values of a statistic across repeated samples.
Many misunderstandings in introductory statistics come from confusing these two ideas.
7 The Central Limit Theorem
One of the most important results in statistics is the central limit theorem (CLT). In practical terms, it says that the sampling distribution of the sample mean tends toward normality as sample size increases, even when the underlying data are not perfectly normal.
This matters because many inferential procedures rely on the behaviour of sample means or model estimates rather than on the raw data alone.
The CLT does not solve every problem:
it does not excuse poor design,
it does not make extreme skewness irrelevant in small samples, and
it does not rescue non-independence.
But it explains why normal-based approximations are often useful.
8 Standard Error and Uncertainty
Once we recognise that sample estimates vary from sample to sample, we need a way to describe that uncertainty.
The standard deviation describes variability among observations in the sample.
The standard error describes variability in an estimate, such as the sample mean, across repeated samples. For the mean:
\[
SE = \frac{SD}{\sqrt{n}}
\]
As sample size increases, the standard error decreases. Larger samples therefore tend to produce more precise estimates of population parameters.
9 Confidence Intervals
A confidence interval (CI) gives a range of plausible values for an unknown population parameter. For a 95% CI, the interpretation is tied to repeated sampling: if we repeated the same sampling procedure many times and calculated a CI each time, about 95% of those intervals would contain the true parameter.
Confidence intervals are useful because they communicate both:
the estimated value, and
the uncertainty around that estimate.
A narrow interval suggests greater precision; a wide interval suggests greater uncertainty.
9.1 Why confidence intervals matter
Confidence intervals are often more informative than a bare p-value because they show:
For group comparisons, the confidence interval is often calculated for the difference between groups. If that interval excludes zero, the result is typically consistent with a statistically detectable difference.
R> # A tibble: 1 × 6
R> n mean sd se ci_low ci_high
R> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
R> 1 8 12.0 0.245 0.0866 11.8 12.3
This example uses the t distribution to calculate a confidence interval for the mean of a small sample. Later chapters will connect this logic directly to hypothesis tests and model coefficients.
11 Linking Distributions to Later Chapters
This chapter sits at the bridge between description and inference.
@online{smit,_a._j.2026,
author = {Smit, A. J., and J. Smit, A.},
title = {4. {Distributions,} {Sampling,} and {Uncertainty}},
date = {2026-03-19},
url = {http://tangledbank.netlify.app/BCB744/basic_stats/04-distributions-sampling-uncertainty.html},
langid = {en}
}
---title: "4. Distributions, Sampling, and Uncertainty"author: "A. J. Smit"date: last-modifieddate-format: "YYYY/MM/DD"execute: cache: false---```{r code-brewing-opts, echo=FALSE}knitr::opts_chunk$set(comment ="R>",warning =FALSE,message =FALSE,fig.width =6.5,fig.height =4.5,out.width ="88%",fig.asp =NULL,fig.align ="center",fig.retina =2,dpi =300)ggplot2::theme_set( ggplot2::theme_minimal(base_size =8))ggplot2::theme_set( ggplot2::theme_bw(base_size =8))``````{r code-repro-seed, echo=FALSE}set.seed(74404)``````{r code-libraries, echo=FALSE}library(tidyverse)```::: {.callout-note appearance="simple"}## In This Chapter- Why distributions matter- Discrete and continuous random variables- Common distributions encountered in biology- Sampling variation and the central limit theorem- Confidence intervals as measures of uncertainty:::::: {.callout-important appearance="simple"}## Tasks to Complete in This Chapter- None:::# IntroductionA good grasp of data distributions is a prerequisite for any statistical analysis. Distributions help us describe the underlying patterns, trends, and variation in our data. They also guide the choice of models and tests. Inferential conclusions are only defensible if we understand, at least broadly, the kind of variation our data represent.This chapter links three ideas that belong together:1. the **distribution of observations**,2. the **distribution of sample statistics**, and3. the **uncertainty** around the quantities we estimate from data.# Key ConceptsThe chapter is organised around the following linked ideas.- **Data distributions** describe the shape, spread, and structure of observed variables.- **Random variables** may be discrete or continuous, and that distinction affects model choice.- **Sampling variation** means sample summaries differ from one sample to the next.- **The central limit theorem** explains why sample means often become approximately normal.- **Confidence intervals** express uncertainty around estimated quantities.# What a Distribution Tells UsA distribution describes the pattern of possible values for a variable and how frequently or how probably those values occur. When we plot data, we often ask:- Where is the centre?- How spread out are the values?- Is the distribution symmetric or skewed?- Are there heavy tails or outliers?- Does the dataset suggest one group or several?These features are not cosmetic. They affect what summary statistics are useful, what assumptions are plausible, and what models are appropriate.# Discrete and Continuous VariablesProbability distributions are usually divided into two broad classes.## Discrete distributionsDiscrete variables take countable values such as `0`, `1`, `2`, and so on. These arise when we count events, individuals, or successes.Common biological examples include:- the number of offspring in a brood,- the number of infected individuals in a sample,- the number of seeds that germinate, and- the number of birds detected in a transect.## Continuous distributionsContinuous variables can take any value within a range. These arise when we measure quantities such as length, mass, temperature, or nutrient concentration.Common examples include:- plant height,- chlorophyll concentration,- growth rate, and- body temperature.# Common Distributions in BiologyThe biological process often leaves a signature in the distribution of the data. A few distributions appear repeatedly in ecological and biological work.## Normal distributionThe normal distribution is symmetric and bell-shaped. It is often a reasonable approximation for continuous traits influenced by many small contributing factors. Many classical methods rely on normality either in the data themselves or, more often, in the residuals of a model.## Binomial distributionThe binomial distribution describes the number of successes in a fixed number of independent trials with constant success probability. It is useful for proportions, occupancy, germination, and similar binary outcomes.## Poisson distributionThe Poisson distribution describes counts of events in a fixed interval of time or space. It often appears in abundance or event-count data, especially when counts are low and skewed.## Negative binomial distributionThe negative binomial distribution is often used for count data that are more variable than a Poisson model allows. This is common in ecological field data where aggregation and heterogeneity are strong.## Log-normal and gamma distributionsThese are useful for positive, right-skewed continuous variables such as biomass, time-to-event measurements, and concentration data.::: {.callout-note appearance="simple"}## The Process MattersThe point is not to memorise a long list of named distributions. The point is to ask what process plausibly generated the data. The distribution is often a clue about the process, and the process should inform the model.:::# Samples, Populations, and Sampling VariationStatistical inference depends on the fact that we usually observe a **sample**, not the full **population**. Different samples from the same population will differ. This is called **sampling variation**.If we repeatedly sampled the same population and calculated the sample mean each time, those means would themselves form a distribution. That distribution is the **sampling distribution of the mean**.This distinction matters:- the **data distribution** describes the values of observations,- the **sampling distribution** describes the values of a statistic across repeated samples.Many misunderstandings in introductory statistics come from confusing these two ideas.# The Central Limit TheoremOne of the most important results in statistics is the **central limit theorem (CLT)**. In practical terms, it says that the sampling distribution of the sample mean tends toward normality as sample size increases, even when the underlying data are not perfectly normal.This matters because many inferential procedures rely on the behaviour of sample means or model estimates rather than on the raw data alone.The CLT does not solve every problem:- it does not excuse poor design,- it does not make extreme skewness irrelevant in small samples, and- it does not rescue non-independence.But it explains why normal-based approximations are often useful.# Standard Error and UncertaintyOnce we recognise that sample estimates vary from sample to sample, we need a way to describe that uncertainty.The **standard deviation** describes variability among observations in the sample.The **standard error** describes variability in an estimate, such as the sample mean, across repeated samples. For the mean:$$SE = \frac{SD}{\sqrt{n}}$$As sample size increases, the standard error decreases. Larger samples therefore tend to produce more precise estimates of population parameters.# Confidence IntervalsA **confidence interval (CI)** gives a range of plausible values for an unknown population parameter. For a 95% CI, the interpretation is tied to repeated sampling: if we repeated the same sampling procedure many times and calculated a CI each time, about 95% of those intervals would contain the true parameter.Confidence intervals are useful because they communicate both:- the estimated value, and- the uncertainty around that estimate.A narrow interval suggests greater precision; a wide interval suggests greater uncertainty.## Why confidence intervals matterConfidence intervals are often more informative than a bare `p`-value because they show:- the scale of the effect,- the precision of the estimate, and- whether biologically meaningful values remain plausible.For group comparisons, the confidence interval is often calculated for the **difference** between groups. If that interval excludes zero, the result is typically consistent with a statistically detectable difference.# A Simple Example in R```{r}x <-c(12.1, 11.8, 12.4, 12.0, 11.9, 12.3, 12.2, 11.7)n <-length(x)mean_x <-mean(x)sd_x <-sd(x)se_x <- sd_x /sqrt(n)t_crit <-qt(0.975, df = n -1)ci_low <- mean_x - t_crit * se_xci_high <- mean_x + t_crit * se_xtibble(n = n,mean = mean_x,sd = sd_x,se = se_x,ci_low = ci_low,ci_high = ci_high)```This example uses the `t` distribution to calculate a confidence interval for the mean of a small sample. Later chapters will connect this logic directly to hypothesis tests and model coefficients.# Linking Distributions to Later ChaptersThis chapter sits at the bridge between description and inference.- [Chapter 2](02-summarise-and-describe.qmd) describes data numerically.- [Chapter 3](03-visualise.qmd) describes data graphically.- This chapter adds probabilistic structure and uncertainty.- [Chapter 5](05-inference.qmd) then uses that structure to formalise inference.# Summary- Distributions describe the pattern of variation in data.- Biological processes often suggest which distributions are plausible.- Statistical inference depends on sampling variation, not just raw data values.- The central limit theorem explains why sampling distributions are often approximately normal.- Standard errors and confidence intervals quantify uncertainty in sample estimates.Understanding these ideas is essential before moving on to formal hypothesis tests and regression models.