1: Introduction to Statistics

Author

A. J. Smit

Published

2026/03/18

1 Introduction to Statistics

Statistics provides a framework for learning from data. In biology, we rarely observe systems under controlled, repeatable conditions. Instead, we measure processes that vary across space, time, and individuals. Statistical methods allow us to quantify this variation, evaluate evidence, and draw conclusions about underlying processes.

This course approaches statistics as a tool for scientific reasoning, rather than a collection of tests. The central task is to connect three elements:

the biological question,
the data we collect, and
the model we use to represent the process.

A statistical analysis is only as strong as the alignment among these three components.

2 What statistics does in biology

In ecological and biological systems, variation is not noise to be removed; it is a property of the system. Temperature fluctuates, populations change, and individuals differ. Statistics allows us to:

describe patterns in data,
quantify uncertainty, and
evaluate whether observed patterns reflect underlying processes.

This leads to two broad goals:

Explanation — attributing patterns to biological mechanisms
Prediction — forecasting responses under new conditions

These goals require different modelling choices. A recurring theme in this course is that you must decide which goal you are pursuing before interpreting results. In fact, the seasoned biologist will already know the modelling approach or statistical test to use before the data have even been collected, implicit in the hypothesis stated before the laboratory or field research begins.

3 From tests to models

Many introductory courses present statistics as a sequence of tests: t-tests, ANOVA, correlations, and regressions. Naming the tools is useful, but treating them as isolated procedures encourages a checklist approach.

This course instead uses a model-based framework.

A statistical model is a simplified representation of how a response variable depends on one or more predictors. For example, a regression model expresses how a biological response changes with environmental conditions. Once this framework is established, many classical tests can be understood as special cases of a general modelling approach.

Approaching statistical analysis in this way has two consequences:

It unifies statistical methods under a single concept framework.
It makes assumptions explicit, which allows us to evaluate whether a model is appropriate.

4 The structure of the course

The chapters that follow are organised to reflect how analyses are conducted in practice.

1. Foundations We begin with data summaries, visualisation, and distributions. These chapters establish how to describe variation and recognise structure in data.

2. Inference We introduce hypothesis testing, confidence intervals, and standard statistical tests. These tools provide formal methods for evaluating evidence.

3. Regression and relationships We then focus on relationships among variables. Regression forms the backbone of most analyses in ecology and biology.

4. Assumptions and transformations Statistical models rely on assumptions about data structure. We examine these assumptions and how to respond when they are violated.

5. Common failure modes We address two major sources of incorrect inference:

Pseudoreplication, which concerns independence of observations
Collinearity, which concerns redundancy among predictors

These chapters emphasise that incorrect conclusions often arise from how data are collected and structured, rather than from the choice of statistical test.

6. Model-based reasoning The final section develops a framework for building and evaluating models. Topics include model specification, confounding, interaction effects, model selection, dependence structures, and hierarchical models. The course concludes by distinguishing prediction from explanation and outlining a reproducible analytical workflow.

5 Core principles

Throughout the course, several principles guide analysis:

1. Define the question before the method Statistical tools do not determine the question. The biological problem determines the model and the analysis.

2. Identify the data-generating process Every dataset reflects a process: how observations were collected, structured, and measured. Understanding this process is essential for valid inference.

3. Match the model to the process Predictors should represent meaningful biological mechanisms or clearly defined proxies. Misalignment leads to biased or unstable results.

4. Check assumptions explicitly Model assumptions are part of the analysis, not an afterthought. Violations provide information about model adequacy.

5. Separate prediction from explanation A model that predicts well does not necessarily provide interpretable coefficients. Interpretation requires additional constraints on model structure.

6 What you should expect

By the end of this course, you should be able to:

translate a biological question into a statistical model,
identify the appropriate experimental unit and structure of the data,
diagnose common problems such as pseudoreplication and collinearity, and
interpret model outputs in terms of biological processes.

The aim is not to memorise procedures, but to develop the ability to reason from data to conclusions in a principled way.

7 Final remark

Statistical analysis is not a final step applied after data collection. It begins when you define your question and design your study. The quality of your conclusions depends on decisions made at every stage—from sampling design to model specification to interpretation.

This course develops those decisions as a coherent workflow.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2026,
  author = {Smit, A. J., and J. Smit, A.},
  title = {1: {Introduction} to {Statistics}},
  date = {2026-03-18},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/01-introduction.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., J. Smit A (2026) 1: Introduction to Statistics. http://tangledbank.netlify.app/BCB744/basic_stats/01-introduction.html.

--- title: "1: Introduction to Statistics" author: "A. J. Smit" date: last-modified date-format: "YYYY/MM/DD" --- # Introduction to Statistics Statistics provides a framework for learning from data. In biology, we rarely observe systems under controlled, repeatable conditions. Instead, we measure processes that vary across space, time, and individuals. Statistical methods allow us to quantify this variation, evaluate evidence, and draw conclusions about underlying processes. This course approaches statistics as a tool for **scientific reasoning**, rather than a collection of tests. The central task is to connect three elements: * the **biological question**, * the **data we collect**, and * the **model we use to represent the process**. A statistical analysis is only as strong as the alignment among these three components. --- # What statistics does in biology In ecological and biological systems, variation is not noise to be removed; it is a property of the system. Temperature fluctuates, populations change, and individuals differ. Statistics allows us to: * describe patterns in data, * quantify uncertainty, and * evaluate whether observed patterns reflect underlying processes. This leads to two broad goals: 1. **Explanation** — attributing patterns to biological mechanisms 2. **Prediction** — forecasting responses under new conditions These goals require different modelling choices. A recurring theme in this course is that you must decide which goal you are pursuing before interpreting results. In fact, the seasoned biologist will already know the modelling approach or statistical test to use *before* the data have even been collected, implicit in the hypothesis stated before the laboratory or field research begins. --- # From tests to models Many introductory courses present statistics as a sequence of tests: t-tests, ANOVA, correlations, and regressions. Naming the tools is useful, but treating them as isolated procedures encourages a checklist approach. This course instead uses a **model-based framework**. A statistical model is a simplified representation of how a response variable depends on one or more predictors. For example, a regression model expresses how a biological response changes with environmental conditions. Once this framework is established, many classical tests can be understood as special cases of a general modelling approach. Approaching statistical analysis in this way has two consequences: * It unifies statistical methods under a single concept framework. * It makes assumptions explicit, which allows us to evaluate whether a model is appropriate. --- # The structure of the course The chapters that follow are organised to reflect how analyses are conducted in practice. **1. Foundations** We begin with data summaries, visualisation, and distributions. These chapters establish how to describe variation and recognise structure in data. **2. Inference** We introduce hypothesis testing, confidence intervals, and standard statistical tests. These tools provide formal methods for evaluating evidence. **3. Regression and relationships** We then focus on relationships among variables. Regression forms the backbone of most analyses in ecology and biology. **4. Assumptions and transformations** Statistical models rely on assumptions about data structure. We examine these assumptions and how to respond when they are violated. **5. Common failure modes** We address two major sources of incorrect inference: * **Pseudoreplication**, which concerns independence of observations * **Collinearity**, which concerns redundancy among predictors These chapters emphasise that incorrect conclusions often arise from how data are collected and structured, rather than from the choice of statistical test. **6. Model-based reasoning** The final section develops a framework for building and evaluating models. Topics include model specification, confounding, interaction effects, model selection, dependence structures, and hierarchical models. The course concludes by distinguishing prediction from explanation and outlining a reproducible analytical workflow. --- # Core principles Throughout the course, several principles guide analysis: **1. Define the question before the method** Statistical tools do not determine the question. The biological problem determines the model and the analysis. **2. Identify the data-generating process** Every dataset reflects a process: how observations were collected, structured, and measured. Understanding this process is essential for valid inference. **3. Match the model to the process** Predictors should represent meaningful biological mechanisms or clearly defined proxies. Misalignment leads to biased or unstable results. **4. Check assumptions explicitly** Model assumptions are part of the analysis, not an afterthought. Violations provide information about model adequacy. **5. Separate prediction from explanation** A model that predicts well does not necessarily provide interpretable coefficients. Interpretation requires additional constraints on model structure. --- # What you should expect By the end of this course, you should be able to: * translate a biological question into a statistical model, * identify the appropriate experimental unit and structure of the data, * diagnose common problems such as pseudoreplication and collinearity, and * interpret model outputs in terms of biological processes. The aim is not to memorise procedures, but to develop the ability to reason from data to conclusions in a principled way. --- # Final remark Statistical analysis is not a final step applied after data collection. It begins when you define your question and design your study. The quality of your conclusions depends on decisions made at every stage—from sampling design to model specification to interpretation. This course develops those decisions as a coherent workflow.