4. Distributions, Sampling, and Uncertainty

Author

A. J. Smit

Published

2026/03/19

NoteIn This Chapter
  • Why distributions matter
  • Discrete and continuous random variables
  • Common distributions encountered in biology
  • Sampling variation and the central limit theorem
  • Confidence intervals as measures of uncertainty
ImportantTasks to Complete in This Chapter
  • None

1 Introduction

A good grasp of data distributions is a prerequisite for any statistical analysis. Distributions help us describe the underlying patterns, trends, and variation in our data. They also guide the choice of models and tests. Inferential conclusions are only defensible if we understand, at least broadly, the kind of variation our data represent.

This chapter links three ideas that belong together:

  1. the distribution of observations,
  2. the distribution of sample statistics, and
  3. the uncertainty around the quantities we estimate from data.

2 Key Concepts

The chapter is organised around the following linked ideas.

  • Data distributions describe the shape, spread, and structure of observed variables.
  • Random variables may be discrete or continuous, and that distinction affects model choice.
  • Sampling variation means sample summaries differ from one sample to the next.
  • The central limit theorem explains why sample means often become approximately normal.
  • Confidence intervals express uncertainty around estimated quantities.

3 What a Distribution Tells Us

A distribution describes the pattern of possible values for a variable and how frequently or how probably those values occur. When we plot data, we often ask:

  • Where is the centre?
  • How spread out are the values?
  • Is the distribution symmetric or skewed?
  • Are there heavy tails or outliers?
  • Does the dataset suggest one group or several?

These features are not cosmetic. They affect what summary statistics are useful, what assumptions are plausible, and what models are appropriate.

4 Discrete and Continuous Variables

Probability distributions are usually divided into two broad classes.

4.1 Discrete distributions

Discrete variables take countable values such as 0, 1, 2, and so on. These arise when we count events, individuals, or successes.

Common biological examples include:

  • the number of offspring in a brood,
  • the number of infected individuals in a sample,
  • the number of seeds that germinate, and
  • the number of birds detected in a transect.

4.2 Continuous distributions

Continuous variables can take any value within a range. These arise when we measure quantities such as length, mass, temperature, or nutrient concentration.

Common examples include:

  • plant height,
  • chlorophyll concentration,
  • growth rate, and
  • body temperature.

5 Common Distributions in Biology

The biological process often leaves a signature in the distribution of the data. A few distributions appear repeatedly in ecological and biological work.

5.1 Normal distribution

The normal distribution is symmetric and bell-shaped. It is often a reasonable approximation for continuous traits influenced by many small contributing factors. Many classical methods rely on normality either in the data themselves or, more often, in the residuals of a model.

5.2 Binomial distribution

The binomial distribution describes the number of successes in a fixed number of independent trials with constant success probability. It is useful for proportions, occupancy, germination, and similar binary outcomes.

5.3 Poisson distribution

The Poisson distribution describes counts of events in a fixed interval of time or space. It often appears in abundance or event-count data, especially when counts are low and skewed.

5.4 Negative binomial distribution

The negative binomial distribution is often used for count data that are more variable than a Poisson model allows. This is common in ecological field data where aggregation and heterogeneity are strong.

5.5 Log-normal and gamma distributions

These are useful for positive, right-skewed continuous variables such as biomass, time-to-event measurements, and concentration data.

NoteThe Process Matters

The point is not to memorise a long list of named distributions. The point is to ask what process plausibly generated the data. The distribution is often a clue about the process, and the process should inform the model.

6 Samples, Populations, and Sampling Variation

Statistical inference depends on the fact that we usually observe a sample, not the full population. Different samples from the same population will differ. This is called sampling variation.

If we repeatedly sampled the same population and calculated the sample mean each time, those means would themselves form a distribution. That distribution is the sampling distribution of the mean.

This distinction matters:

  • the data distribution describes the values of observations,
  • the sampling distribution describes the values of a statistic across repeated samples.

Many misunderstandings in introductory statistics come from confusing these two ideas.

7 The Central Limit Theorem

One of the most important results in statistics is the central limit theorem (CLT). In practical terms, it says that the sampling distribution of the sample mean tends toward normality as sample size increases, even when the underlying data are not perfectly normal.

This matters because many inferential procedures rely on the behaviour of sample means or model estimates rather than on the raw data alone.

The CLT does not solve every problem:

  • it does not excuse poor design,
  • it does not make extreme skewness irrelevant in small samples, and
  • it does not rescue non-independence.

But it explains why normal-based approximations are often useful.

8 Standard Error and Uncertainty

Once we recognise that sample estimates vary from sample to sample, we need a way to describe that uncertainty.

The standard deviation describes variability among observations in the sample.

The standard error describes variability in an estimate, such as the sample mean, across repeated samples. For the mean:

\[ SE = \frac{SD}{\sqrt{n}} \]

As sample size increases, the standard error decreases. Larger samples therefore tend to produce more precise estimates of population parameters.

9 Confidence Intervals

A confidence interval (CI) gives a range of plausible values for an unknown population parameter. For a 95% CI, the interpretation is tied to repeated sampling: if we repeated the same sampling procedure many times and calculated a CI each time, about 95% of those intervals would contain the true parameter.

Confidence intervals are useful because they communicate both:

  • the estimated value, and
  • the uncertainty around that estimate.

A narrow interval suggests greater precision; a wide interval suggests greater uncertainty.

9.1 Why confidence intervals matter

Confidence intervals are often more informative than a bare p-value because they show:

  • the scale of the effect,
  • the precision of the estimate, and
  • whether biologically meaningful values remain plausible.

For group comparisons, the confidence interval is often calculated for the difference between groups. If that interval excludes zero, the result is typically consistent with a statistically detectable difference.

10 A Simple Example in R

x <- c(12.1, 11.8, 12.4, 12.0, 11.9, 12.3, 12.2, 11.7)

n <- length(x)
mean_x <- mean(x)
sd_x <- sd(x)
se_x <- sd_x / sqrt(n)
t_crit <- qt(0.975, df = n - 1)
ci_low <- mean_x - t_crit * se_x
ci_high <- mean_x + t_crit * se_x

tibble(
  n = n,
  mean = mean_x,
  sd = sd_x,
  se = se_x,
  ci_low = ci_low,
  ci_high = ci_high
)
R> # A tibble: 1 × 6
R>       n  mean    sd     se ci_low ci_high
R>   <int> <dbl> <dbl>  <dbl>  <dbl>   <dbl>
R> 1     8  12.0 0.245 0.0866   11.8    12.3

This example uses the t distribution to calculate a confidence interval for the mean of a small sample. Later chapters will connect this logic directly to hypothesis tests and model coefficients.

11 Linking Distributions to Later Chapters

This chapter sits at the bridge between description and inference.

  • Chapter 2 describes data numerically.
  • Chapter 3 describes data graphically.
  • This chapter adds probabilistic structure and uncertainty.
  • Chapter 5 then uses that structure to formalise inference.

12 Summary

  • Distributions describe the pattern of variation in data.
  • Biological processes often suggest which distributions are plausible.
  • Statistical inference depends on sampling variation, not just raw data values.
  • The central limit theorem explains why sampling distributions are often approximately normal.
  • Standard errors and confidence intervals quantify uncertainty in sample estimates.

Understanding these ideas is essential before moving on to formal hypothesis tests and regression models.

Reuse

Citation

BibTeX citation:
@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {4. {Distributions,} {Sampling,} and {Uncertainty}},
  date = {2026-03-19},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/04-distributions-sampling-uncertainty.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit, A. J., J. Smit A (2026) 4. Distributions, Sampling, and Uncertainty. http://tangledbank.netlify.app/BCB744/basic_stats/04-distributions-sampling-uncertainty.html.