6. Raw-Data Assumptions and Transformations
Diagnostics for Groups, Pairs, and Associations
The opening meme in Figure 1 is a light reminder that distributions are often less cooperative than we hope.
- how assumption checks depend on data structure
- normality within groups
- variance across groups
- paired differences
- joint structure in associations
- transformations and what to do when assumptions fail
- None
Inferential statistics can be broadly categorised into parametric and nonparametric methods. The choice between them hinges on understanding the distribution of our data and the assumptions underlying each method. Parametric methods rely on specific assumptions about the underlying probability distribution of the population from which the sample data are drawn. The two key assumptions are normality, that the data follow a normal (Gaussian) distribution, and homoscedasticity, which requires equal variances across groups or levels of predictors.
Strictly speaking, the core parametric requirement is not normality per se but that the data follow a known probability distribution specified in advance. When the response is a count, a proportion, or a binary outcome, other distributions (Poisson, binomial) apply, and methods such as Generalised Linear Models extend the parametric framework accordingly (Part V). For the tests covered in Parts II and III of the book (t-tests, ANOVA, regression, and correlation) the relevant distribution is normal, so normality and homoscedasticity are the assumptions that matter here.
Nonparametric methods offer an alternative when data do not conform to any known distribution, or when assumptions cannot be met and the sample is too small for the Central Limit Theorem (CLT) to compensate. They make fewer assumptions, are more robust to non-standard distributions, and trade a modest reduction in statistical power for greater generality. For most biological datasets of adequate size, a well-chosen parametric test is preferred, but nonparametric alternatives become the more defensible choice when distributional requirements clearly cannot be satisfied.
Choosing between the two approaches requires first assessing whether your data meet parametric assumptions. With larger samples, the CLT means that moderate departures from normality become less important, because the sampling distribution of the mean approaches normality regardless. With small samples, the same violations carry more weight and nonparametric alternatives deserve serious consideration. When assumptions are met, parametric tests are more powerful but when they are violated, that power advantage dissipates.
Assumption checks determine whether a planned inferential method is defensible. In this chapter, I apply those checks to raw data rather than to model residuals — residual-based diagnostics belong to the later regression chapters, beginning with Residuals and Model-Based Diagnostics. My aim here is to prepare you for the tests immediately ahead: t-tests, ANOVA, and Correlation and Association.
The discussion is organised around data structure. Group comparisons require normality checks within groups and homoscedasticity checks across them. Paired designs require checks on the differences. Associations require checks on the joint pattern of both variables.
1 Assumption-Checking Workflow
Assumption checking follows a series of steps:
- identify the structure of the analysis;
- inspect the raw data visually;
- assess the assumptions graphically;
- use formal tests to support judgement;
- decide whether to proceed, transform, or change method;
- re-check if the analysis changes.
Different assumptions often fail together which is why I tie the workflow the analysis rather than to a list of tests.
2 Identifying the Data Structure
Before checking assumptions, identify the structure of the data:
- Are there groups? Then check each group separately.
- Are the measurements paired? Then check the differences.
- Are both variables continuous? Then check the joint pattern of the relationship.
Assumptions apply to the structure implied by the analysis, not to the dataset as a whole. I introduced the grouped-data example earlier in Comparing Groups, where I emphasised how plots change once categories are recognised explicitly.
In the regression chapters, I will apply these same ideas to residuals, which are the differences between observed values and model predictions, but here I develop the assumptions at the level of raw data. The extension to residuals belongs with the regression series of chapters.
3 Group Comparisons: Normality Within Groups
I begin with a grouped comparison that prepares directly for t-tests and ANOVA.
I compare bill length across penguin species. The grouping structure is seen in the dataset where you can see observations belonging to several species. This is the same grouped-data idea introduced in Comparing Groups, but here the aim is diagnostic rather than descriptive. The assumption applies within each species, not to the pooled data.
3.1 Graphical Checks
Start with plots. Histograms show the shape within each species. A Q-Q plot compares the observed quantiles of the data with the quantiles expected from a normal distribution. If the distribution is close to normal, the points fall roughly along a straight reference line. These grouped and pooled views are shown in Figure 2.
Code
plt1 <- ggplot(peng, aes(body_mass_g)) +
geom_histogram(bins = 20, fill = "grey70", colour = "white") +
facet_wrap(~species, scales = "free_x") +
labs(x = "Body mass (g)", y = "Count")
plt2 <- ggplot(peng, aes(sample = body_mass_g)) +
stat_qq(shape = 21, fill = "salmon", colour = "black") +
stat_qq_line(colour = "red4") +
labs(x = "Theoretical quantiles", y = "Sample quantiles")
plt3 <- ggplot(peng, aes(sample = body_mass_g)) +
stat_qq(shape = 21, fill = "salmon", colour = "black") +
stat_qq_line(colour = "red4") +
facet_wrap(~species, scales = "free") +
labs(x = "Theoretical quantiles", y = "Sample quantiles")
ggarrange(plt1, plt2, plt3, ncol = 1, labels = "AUTO")In Figure 2, the pooled Q-Q plot bends because it combines Adelie, Chinstrap, and Gentoo penguins into one mixture distribution. The grouped Q-Q plots are the relevant ones because the comparison is made across species. That is the pattern we carry forward into Chapter 7 and Chapter 8.
3.2 Formal Test
The most commonly used formal test is the Shapiro-Wilk test, shapiro.test(). Its null hypothesis is that the data are compatible with normality:
\[H_{0}: \text{the distribution is compatible with normality}\] \[H_{a}: \text{the distribution departs from normality}\]
The test statistic is:
\[W = \frac{(\sum_{i=1}^n a_i x_{(i)})^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \tag{1}\]
Here, \(W\) is the Shapiro-Wilk test statistic, \(a_i\) are coefficients that depend on the sample size and expected order statistics under normality, \(x_{(i)}\) is the \(i\)-th ordered observation, and \(\overline{x}\) is the sample mean.
Run the test within species, not on the pooled sample:
# A tibble: 3 × 3
species shapiro_w shapiro_p
<fct> <dbl> <dbl>
1 Adelie 0.993 0.717
2 Chinstrap 0.975 0.194
3 Gentoo 0.973 0.0135
When p is greater than 0.05, the test shows no strong evidence against normality. That result does not prove exact normality, but it shows that the observed departure is not strong enough for this test to detect at the sample size available. There is sufficient evidence against the null, suggesting that Gentoo penguins have non-normal data.
That result does not automatically end the analysis, but it does change the next step. First, inspect the plots in Figure 2 and ask whether the departure is mild or severe. Second, continue to the variance check below, because non-normality and unequal variance often appear together. If one or more groups are clearly non-normal, and especially if variance also differs among groups, the later comparison should use a method that tolerates those features better. In the chapters that follow, that may mean Welch’s version of the test, a rank-based alternative, a transformation followed by a fresh round of checks, or a different model altogether. The decision depends on the size of the departure, the sample sizes, and the biological scale on which you need to interpret the result.
Using the penguins dataset (load palmerpenguins), run the normality checks for flipper_length_mm instead of bill_length_mm. For each species, produce a Q-Q plot and apply the Shapiro-Wilk test. Do the conclusions differ from what was found for bill length? Copy the code from above and adapt it. Report your W statistics and p-values and describe in one sentence whether you would consider the flipper length data approximately normal within each species.
3.3 Other Tests for Normality
Several other tests can be used to assess whether data are compatible with normality:
-
Kolmogorov-Smirnov test compares the empirical distribution of a sample with a specified theoretical distribution. In R use
ks.test(). -
Anderson-Darling test is another goodness-of-fit test, available as
ad.test()in packages such as nortest. -
Lilliefors test is a modification of the Kolmogorov-Smirnov test for estimated mean and variance. See
lillie.test(). -
Jarque-Bera test is based on skewness and kurtosis. See
jarque.bera.test()in tseries. -
Cramer-Von Mises test is another goodness-of-fit approach. See
cvm.test()in goftest.
These tests support the plots. They do not replace them.
In the regression chapters, I check normality with Q-Q plots of residuals rather than with grouped raw-data plots. That extension belongs with fitted models and is introduced in Chapter 11.
4 Group Comparisons: Variance Across Groups
Grouped comparisons also require attention to spread. For methods such as one-way ANOVA, large differences in variance across groups can affect the inferential result.
We use penguin body mass because the spread differs more clearly across species:
Code
peng_var <- peng |>
group_by(species) |>
summarise(sample_var = var(body_mass_g), .groups = "drop")
plt1 <- ggplot(peng, aes(species, body_mass_g, fill = species)) +
geom_boxplot(show.legend = FALSE, alpha = 0.7) +
labs(x = "", y = "Body mass (g)")
plt2 <- ggplot(peng_var, aes(species, sample_var, fill = species)) +
geom_col(show.legend = FALSE, alpha = 0.7) +
labs(x = "", y = "Sample variance")
ggpubr::ggarrange(plt1, plt2, ncol = 2, widths = c(1.4, 1))The box plots and variance bars in Figure 3 show the spread clearly. Large differences in spread suggest heteroscedasticity.
The most commonly used formal test is Levene’s test, car::leveneTest(). Its null hypothesis is that the group variances are equal:
\[H_{0}: \sigma^{2}_{A} = \sigma^{2}_{B}\] \[H_{a}: \sigma^{2}_{A} \ne \sigma^{2}_{B}\]
The Levene test statistic is:
\[W = \frac{(N-k)}{(k-1)} \cdot \frac{\sum_{i=1}^k n_i (\bar{z}_i - \bar{z})^2}{\sum_{i=1}^k \sum_{j=1}^{n_i} (z_{ij} - \bar{z}_i)^2} \tag{2}\]
Here, \(N\) is the total sample size, \(k\) is the number of groups, \(n_i\) is the sample size in group \(i\), \(z_{ij}\) is the absolute deviation of observation \(j\) in group \(i\) from its group centre, \(\bar{z}_i\) is the mean of those deviations within group \(i\), and \(\bar{z}\) is the overall mean deviation.
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 5.1203 0.006445 **
339
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A small p-value indicates evidence against equal variance. This structure is used directly in ANOVA.
# A tibble: 3 × 2
species sample_var
<fct> <dbl>
1 Adelie 210283.
2 Chinstrap 147713.
3 Gentoo 254133.
4.1 Other Tests for Homogeneity
Several other tests are available for comparing variances:
-
F-test compares two variances. Use
var.test(). It assumes normality. -
Bartlett’s test compares variances across multiple groups. Use
bartlett.test(). It also assumes normality. - Brown-Forsythe test is a median-based modification of Levene’s test and is more robust to non-normality.
-
Fligner-Killeen test is a robust non-parametric variance test available in base R as
fligner.test().
In the regression chapters, I assess changing variance with residual-versus-fitted plots. That diagnostic belongs with fitted models and is introduced in Chapter 11.
5 Paired Designs: Check the Differences
In paired designs, the assumption applies to the differences, not to the raw measurements.
The paired structure means that each after value belongs to the same unit as one before value. The analysis will therefore be built on the differences, which are visualised in Figure 4.
Code
plt1 <- ggplot(paired, aes(diff)) +
geom_histogram(bins = 12, fill = "grey70", colour = "white") +
labs(x = "After - before", y = "Count")
plt2 <- ggplot(paired, aes(sample = diff)) +
stat_qq(shape = 21, fill = "salmon", colour = "black") +
stat_qq_line(colour = "red4") +
labs(x = "Theoretical quantiles", y = "Sample quantiles")
ggarrange(plt1, plt2, ncol = 2, labels = "AUTO")
Shapiro-Wilk normality test
data: paired$diff
W = 0.94617, p-value = 0.3127
The histogram and Q-Q plot in Figure 4 are the relevant diagnostics because the paired t-test is built on the differences, not on the raw before and after values. This is the structure used directly in the paired t-test in Chapter 7.
6 Associations: Check the Joint Structure
Association problems have a different geometry. There are no groups, and there are no paired differences. The main diagnostic is the joint structure of the two variables.
Code
In Figure 5, the questions are whether the relationship is approximately linear and whether a few extreme values dominate the pattern. Checking each variable for normality in isolation would miss the main issue. The structure used here leads directly into Correlation and Association.
7 Independence
Independence is set by design. If repeated measurements are taken on the same individual, or if samples are nested within a site, tank, quadrat, or transect, then the observations are not independent.
Suppose ten leaves are measured from each of three plants, but the analysis treats the thirty leaves as if they came from thirty independent plants. That is pseudoreplication. The apparent sample size is inflated, and the inferential result becomes too optimistic because the real experimental unit was the plant, not the leaf.
Check independence by asking:
- what is the true experimental unit;
- are observations repeated, nested, spatially clustered, or temporally linked;
- does the design require a paired test, repeated-measures method, or mixed model.
We return to these issues in Pseudoreplication and Dependence and Mixed Models.
For each scenario below, state whether the observations are independent and identify the potential source of non-independence if there is one:
- You measure oxygen consumption in 20 individual fish, one per tank, across two temperature treatments.
- You measure body mass in 30 individual fish, but six fish share each of five tanks, and tanks are the actual treatment units.
- You sample 15 quadrats along a single 100 m transect to estimate cover of an invasive alga.
- You measure blood pressure in 10 patients before and after a treatment.
Which of these would call for a paired test, a mixed model, or a simple independent-samples test? Discuss with a partner.
8 Reporting Assumption Checks
Methods
Penguin body mass was examined by species before any formal group comparison. Normality was assessed within species using histograms, Q-Q plots, and the Shapiro-Wilk test. Homogeneity of variance across species was assessed with box plots, direct comparison of sample variances, and Levene’s test.
Results
Body-mass distributions differed among penguin species in both location and spread. The species-specific histograms and Q-Q plots in Figure 2 gave the relevant diagnostic view because the assumption applied within species rather than to the pooled dataset. The box plots and variance summaries in Figure 3 reinforced that variance was not uniform across species, so a later comparison of species means would need to take that heterogeneity into account.
Discussion
In a journal-style account, these diagnostics justify the choice of inferential method. Here they show that the grouped structure of the penguin data must be respected and that unequal variance among species would need to be handled explicitly in the later analysis.
9 Responding to Assumption Violations
Once assumptions have been checked, the response is usually one of four:
- proceed because the departures are minor and the planned method is adequate;
- change to a method that matches the data more closely;
- transform the response and then re-check the assumptions;
- recognise that the design does not support the intended inference.
The best response depends on the structure of the problem. Counts, proportions, and non-independent observations often need a different method rather than a transformed version of the same method.
The non-parametric alternatives for the main inferential methods are introduced within the relevant method chapters: Wilcoxon procedures in Chapter 7, Kruskal-Wallis in Chapter 8, and rank-based correlation in Chapter 9. Chapter 10 then summarises how those choices fit together.
10 Why Transform Data?
Transformations change the scale of the response so that the data are more compatible with the assumptions of the planned method. They are often used to reduce right-skew, stabilise variance, or linearise a relationship.
Use a transformation only when it improves the fit between the data and the method in a way that still supports a clear biological interpretation.
After transformation, quantities such as means and confidence intervals usually need to be back-transformed before reporting. Back-transformed uncertainty is often asymmetric.
“Torture numbers and they will confess to anything” — Gregg Easterbrook
11 Transformation Decision Guide
| Data problem | Usual response | Comment |
|---|---|---|
| Strong right-skew in positive continuous data |
log(x) or log10(x)
|
Often useful for multiplicative processes or long right tails |
| Count data with variance increasing with the mean |
sqrt(x) or a count model |
A Poisson or negative-binomial GLM is often better |
| Proportions near 0 or 1 | Usually a binomial GLM | Arcsine transformations are now often unnecessary |
| Severe right-skew with difficult scale |
1/x in rare cases |
Hard to interpret biologically |
| Negative skew | Reflect, then transform if needed | Often suggests that a different model or scale may be better |
If the response has its own natural error structure, such as counts, proportions, presence-absence data, or survival times, a generalised model is often better than transformation.
Use the decision guide above to recommend a transformation (or decide that transformation is not appropriate) for each of the following:
- Seabird nest counts per island (range: 0 to 2500; variance increases with the mean).
- Percentage cover of lichen on rock surfaces (range: 2% to 98%; most sites around 50%).
- Salmon body mass in grams (range: 1.5 to 8.2 kg; slight right-skew).
- Presence or absence of a parasitic worm (binary: 0 or 1).
- Time-to-death of bacteria exposed to an antibiotic (range: 5 to 240 minutes; strongly right-skewed).
For each, state the transformation or model type you would recommend and give a one-sentence justification.
12 Common Transformations
12.1 Log Transformation
A log transformation is often useful for positive, right-skewed data. Use log(x) or log10(x). The statistical conclusion does not depend on whether the base is \(e\) or 10 because the two scales differ only by a constant factor.
If zeros are present, a constant is sometimes added first, for example log(x + 1). Do that for a clear reason, not mechanically.
12.2 Square-Root Transformation
The square-root transformation, \(\sqrt{x}\), is often used for count-like data or for responses where the variance increases with the mean.
12.3 Arcsine Transformation
The arcsine square-root transformation,
\[y' = \arcsin(\sqrt{y}) \tag{3}\]
was historically used for proportions between 0 and 1. It still appears in older literature, but a binomial GLM is often the better solution because it respects the mean-variance relationship directly.
The back-transformation is:
\[y = \sin(y')^2 \tag{4}\]
12.4 Reciprocal Transformation
The reciprocal transformation, 1/x, can reduce strong right-skew, but it often produces a scale that is hard to explain biologically.
12.5 Square and Cube Transformations
Square and cube transformations can be useful after reflecting negatively skewed data, but they also magnify larger values and can make outliers more influential.
12.6 Anscombe Transformation
For Poisson counts, the Anscombe transformation is:
\[y' = 2\sqrt{x + \frac{3}{8}} \tag{5}\]
It was designed to stabilise variance in count data, but a Poisson or negative-binomial GLM is usually preferable now.
13 Worked Example with Transformations
To demonstrate what a transformation is trying to achieve, I use the built-in airquality dataset. Here the response is ozone concentration, and I compare May-June with August-September. This is a two-group comparison, so the relevant questions are the same ones used earlier in the chapter: are the distributions within groups reasonably well behaved, and are the group variances similar enough for a mean-based comparison?
Code
ozone_dat <- na.omit(airquality[, c("Ozone", "Month")]) |>
as_tibble() |>
filter(Month %in% c(5, 6, 8, 9)) |>
mutate(
season = factor(
ifelse(Month %in% c(5, 6), "May-June", "August-September")
),
log_ozone = log(Ozone)
)
plt1 <- ggplot(ozone_dat, aes(Ozone)) +
geom_histogram(bins = 20, fill = "grey70", colour = "grey30") +
facet_wrap(~season, scales = "free_x") +
labs(x = "Ozone", y = "Frequency")
plt2 <- ggplot(ozone_dat, aes(log_ozone)) +
geom_histogram(bins = 20, fill = "grey70", colour = "grey30") +
facet_wrap(~season, scales = "free_x") +
labs(x = "log(Ozone)", y = "Frequency")
ggarrange(plt1, plt2, ncol = 2, labels = "AUTO")Code
# A tibble: 2 × 5
season raw_w raw_p log_w log_p
<fct> <dbl> <dbl> <dbl> <dbl>
1 August-September 0.858 0.0000112 0.973 0.260
2 May-June 0.765 0.00000448 0.939 0.0537
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 6.3393 0.01362 *
88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.0964 0.7569
88
Figure Figure 6 shows the effect of the transformation directly. The raw ozone values show strong right-skew in both seasons. On the raw scale, both groups depart clearly from normality and the variances differ. On the log scale, the upper tail is compressed, the group spreads are closer, and the Shapiro-Wilk and Levene results are far less problematic. The transformation has changed the scale in a way that makes a mean-based comparison more defensible.
In this chapter, I stop at the diagnostic stage since the inferential step belongs in Chapter 7 where the same ozone data are analysed with a two-sample t-test on the transformed scale before the result is back-transformed for reporting.
14 Check the Assumptions Again
The data diagnostics do not end after we have transformed the data. The transformed data must be checked again. The second pass decides whether the transformation actually improved the fit between the data and the planned method. It is annoying but necessary.
15 When Transformation Is Not the Right Solution
Transformation is usually the wrong response when:
- the response is fundamentally a count, proportion, presence-absence variable, or survival time with its own natural error structure;
- the observations are not independent;
- the biological interpretation becomes less clear than the original problem;
- a more appropriate method already exists.
The airquality dataset (built into R) contains ozone concentrations measured in New York over several months. Run the full assumption-checking workflow on ozone (Ozone) split by Month:
- Produce histograms and Q-Q plots grouped by month.
- Apply the Shapiro-Wilk test within each month.
- Apply Levene’s test for equal variance across months (use
car::leveneTest()). - Based on your diagnostics, would you apply a log transformation? Apply it and re-check.
16 Summary
- Assumptions apply to the structure implied by the analysis.
- Group comparisons require checks within groups and across group variances.
- Paired designs require checks on the differences.
- Associations require checks on the joint pattern of the variables.
- Independence comes from design, not from a formal test.
- Transformations should be justified by the data and followed by a second round of diagnostics.
Reuse
Citation
@online{smit2026,
author = {Smit, A. J.},
title = {6. {Raw-Data} {Assumptions} and {Transformations}},
date = {2026-04-05},
url = {https://tangledbank.netlify.app/BCB744/basic_stats/06-assumptions-and-transformations.html},
langid = {en}
}

