14. Data Transformations

The dark arts

Author

Affiliation

Smit, A. J.

University of the Western Cape

Published

March 26, 2024

In this Chapter

Discuss the reasons for transforming data
Cover some of the most common data transformations
Discuss the importance of checking the assumptions of the transformed data

Tasks to complete in this Chapter

None

Data transformations

Data transformation is used to change the scale of the data in a way that makes it conform to the assumptions of normality and homoscedasticity so that we can proceed with parametric tests in the usual way. Transformations can be used to change the shape of the distribution of the data, reduce the effects of outliers, or stabilise the variance across levels of the independent variable. However, we must often first identify the way in which our data are distributed (refer to Chapter 4) so we may better decide how to transformation them in an attempt to coerce them into a format that will pass the assumptions of normality and homoscedasticity.

Common data transformations include logarithmic, square root, and reciprocal transformations, among others. These transformations can be applied to the dependent variable, independent variable, or both, depending on the nature of the data and the research question of interest.

When selecting a data transformation method, it is important to consider the goals of the analysis, as well as the properties of the data. Different transformations can have different effects on the distribution of the data, and may lead to different conclusions or interpretations of the results.

When transforming data, one does a mathematical operation on the observations and then use these transformed numbers in the statistical tests. After one as conducted the statistical analysis and calculated the mean ± SD (or ± 95% CI), these values are back transformed (i.e. by applying the reverse of the transformation function) to the original scale before being reported. Note that in back-transformed data the SD (or CI) are not necessarily symmetrical, so one cannot simply compute one (e.g. the upper) and then assumed the lower one would be the same distance away from the mean.

“Torture numbers and they will confess to anything” — Gregg Easterbrook

When transforming data, it is a good idea to know a bit about how data within your field of study are usually transformed—try and use the same approach in your own work. Don’t try all the various transformations until you find one that works, else it might seem as if you are trying to massage the data into an acceptable outcome. The effects of transformations are often difficult to see on the shape of data distributions, especially when you have few samples, so trust that what you are doing is correct. Unfortunately, as I said before, transforming data requires a bit of experience and knowledge with the subject matter, so read widely before you commit to one.

Some of the texts below come from this discussion and from John H. McDonald. Below (i.e. the text on log transformation, square-root transformation, and arcsine transformation) I have extracted, often verbatim, the excellent text produced by John H MacDonald from his ‘Handbook of Biological Statistics’. Please attribute this text directly to him. I have made minor editorial changes to point towards some R code, but aside from that the text is more-or-less used as is. I strongly suggest reading the preceding text under his ‘Data transformations’ section, as well as consulting the textbook for in-depth reading about biostatistics. Highly recommended!

Log transformation

Log transformation is often applied to positively skewed data. It consists of taking the log of each observation. You can use either base-10 logs (log10(x)) or base- $e$ logs, also known as natural logs (log(x)). It makes no difference for a statistical test whether you use base-10 logs or natural logs, because they differ by a constant factor; the base- 10 log of a number is just 2.303…× the natural log of the number. You should specify which log you’re using when you write up the results, as it will affect things like the slope and intercept in a regression. I prefer base-10 logs, because it’s possible to look at them and see the magnitude of the original number: $l o g (1) = 0$ , $l o g (10) = 1$ , $l o g (100) = 2$ , etc.

The back transformation is to raise 10 or $e$ to the power of the number; if the mean of your base-10 log-transformed data is 1.43, the back transformed mean is $10^{1.43} = 26.9$ (in R, 10^1.43). If the mean of your base- $e$ log-transformed data is 3.65, the back transformed mean is $e^{3.65} = 38.5$ (in R, exp(3.65)). If you have zeros or negative numbers, you can’t take the log; you should add a constant to each number to make them positive and non-zero (i.e. log10(x + 1)). If you have count data, and some of the counts are zero, the convention is to add 0.5 to each number.

Many variables in biology have log-normal distributions, meaning that after log-transformation, the values are normally distributed. This is because if you take a bunch of independent factors and multiply them together, the resulting product is log-normal. For example, let’s say you’ve planted a bunch of weed seeds, then 10 years later you see how tall the trees are. The height of an individual tree would be affected by the nitrogen in the soil, the amount of water, amount of sunlight, amount of insect damage, etc. Having more nitrogen might make a tree 10% larger than one with less nitrogen; the right amount of water might make it 30% larger than one with too much or too little water; more sunlight might make it 20% larger; less insect damage might make it 15% larger, etc. Thus the final size of a tree would be a function of nitrogen × water × sunlight × insects, and mathematically, this kind of function turns out to be log-normal.

Arcsine transformation

Arcsine transformation is commonly used for proportions, which range from 0 to 1, or percentages that go from 0 to 100. Specifically, this transformation is quite useful when the data follow a binomial distribution and have extreme proportions close to 0 or 1.

A biological example of the type of data suitable for arcsine transformation is the proportion of offspring that survives or the proportion of plants that succumbs to a disease; such data often follow a binomial distribution.

This transformation involves of taking the arcsine of the square root of a number (in R, arcsin(sqrt(x))). (The result is given in radians, not degrees, and can range from −π/2 to π/2). The numbers to be arcsine transformed must be in the range 0 to 1. […] the back-transformation is to square the sine of the number (in R, sin(x)^2).

Square root transformation

The square root transformation (in R, sqrt(x)) is often used to stabilise the variance of data that have a non-linear relationship between the mean and variance (heteroscedasticity). It is effective for reducing right-skewness (positively skewed). Taking the square root of each observation has the effect of compressing the data towards zero and reducing the impact of extreme values. It is a monotonic transformation, which means that it preserves the order of the data and does not change the relative rankings of the observations.

The square root transformation does not work with negative values, but one could add a constant to each number to make them positive.

A square root transformation is most frequently applied where the data are counts or frequencies, such as the number of individuals in a population or the number of events in a certain time period. Count data are prone to the variance increasing with the mean due to the discrete nature of the data. In these cases, the data tend to follow a Poisson distribution, which is characterised by a variance that is equal to the mean. The same applies to some environmental data, such as rainfall or wind; these may also exhibit heteroscedasticity due to extreme weather phenomena.

Square transformation

Another transformation available for dealing with heteroscedasticity is the square transformation. As the name suggests, it involves taking the square of each observation in a dataset (x^2). The effect sought is to reduce left skewness.

This transformation has the effect of magnifying the differences between values and so increasing the influence of extreme values. However, this can make outliers more prominent and can make it more challenging to interpret the results of statistical analysis.

The square transformation is often used in situations where the data are related to areas or volumes, such as the size of cells or the volume of an organ, where the data may follow a nonlinear relationship between the mean and variance.

Cube transformation

This transformation also applies to heteroscedastic data. It is sometimes used with moderately left skewed data. This transformation is more drastic than a square transformation, and the drawback are more severe.

The cube transformation is less commonly used than other data transformations such as square-root or log transformation. Use with caution.

Reciprocal transformation

It involves taking the reciprocal or inverse of each observation in a dataset (1/x). It is another variance stabilising transformation and is used with severely positively skewed data.

Anscombe transformation

Another variance stabilising transformation is the Anscombe transformation, sqrt(max(x+1)-x). It is applied to negatively skewed data. This transformation can be used to shift the data and compress it towards zero, and remove the influence of extreme values. It is a monotonic transformation, which means that it preserves the order of the data and does not change the relative rankings of the observations.

The Anscombe transformation is useful when dealing with count or frequency data that have a non-linear relationship between the mean and variance; such data are characteristic of Poisson-distributed count data.

Transformation and regressions

Regression models do not necessarily require data transformations to deal with heteroscedasticity. Generalised Linear Models (GLM) can be used with a variety of variance and error structures in the residuals via so-called link functions. Please consult the glm() function for details.

The linearity requirement specifically applies to linear regressions. However, regressions do not have to be linear. Some degree of curvature can be accommodated by additive (polynomial) models, which are like linear regressions, but with additional terms (you already have the knowledge you need to fit such models). More complex departures from linearity can be modelled by non-linear models (e.g. exponential, logistic, Michaelis-Menten, Gompertz, von Bertalanffy, and their ilk) or Generalised Additive Models (GAM)—these more complex relationships will not be covered in this module. The gam() function in the mgcv package fits GAMs. After fitting these parametric or semi-parametric models to accommodate non-linear regressions, the residual error structure still does to meet the normality requirements, and these can be tested as before with simple linear regressions.

The importance of checking the assumptions of the transformed data

It is important to check the assumptions of the transformed data after applying a transformation. The transformed data may not meet the assumptions, so applying any of the tests for normality, homoscedasticity, and independence is still necessary before we resume statistical testing.

Conclusion

Knowing how to successfully implement transformations can be as much alchemy as science and requires a great deal of experience to get right. Due to the multitude of options I cannot offer comprehensive examples to deal with all eventualities—so I will not provide any examples at all! I suggest reading widely on the internet or textbooks, and practising by yourselves on your own datasets.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2024,
  author = {Smit, A. J.,},
  title = {14. {Data} {Transformations}},
  date = {2024-03-26},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/14-transformations.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2024) 14. Data Transformations. http://tangledbank.netlify.app/BCB744/basic_stats/14-transformations.html.

--- date: "2024-03-26" title: "14. Data Transformations" subtitle: "The dark arts" --- ```{r} #| echo: false #| message: false library(tidyverse) ``` ::: {.callout-note appearance="simple"} ## In this Chapter - Discuss the reasons for transforming data - Cover some of the most common data transformations - Discuss the importance of checking the assumptions of the transformed data ::: ::: {.callout-important appearance="simple"} ## Tasks to complete in this Chapter - None ::: # Data transformations Data transformation is used to change the scale of the data in a way that makes it conform to the assumptions of normality and homoscedasticity so that we can proceed with parametric tests in the usual way. Transformations can be used to *change the shape of the distribution* of the data, *reduce the effects of outliers*, or *stabilise the variance* across levels of the independent variable. However, we must often first identify the way in which our data are distributed (refer to [Chapter 4](04-distributions.qmd)) so we may better decide how to transformation them in an attempt to coerce them into a format that will pass the assumptions of normality and homoscedasticity. Common data transformations include **logarithmic**, **square root**, and **reciprocal** transformations, among others. These transformations can be applied to the dependent variable, independent variable, or both, depending on the nature of the data and the research question of interest. When selecting a data transformation method, it is important to consider the goals of the analysis, as well as the properties of the data. Different transformations can have different effects on the distribution of the data, and may lead to different conclusions or interpretations of the results. When transforming data, one does a mathematical operation on the observations and then use these transformed numbers in the statistical tests. After one as conducted the statistical analysis and calculated the mean ± SD (or ± 95% CI), these values are back transformed (i.e. by applying the reverse of the transformation function) to the original scale before being reported. Note that in back-transformed data the SD (or CI) are not necessarily symmetrical, so one cannot simply compute one (e.g. the upper) and then assumed the lower one would be the same distance away from the mean. > “*Torture numbers and they will confess to anything*” --- Gregg Easterbrook When transforming data, it is a good idea to know a bit about how data within your field of study are usually transformed---try and use the same approach in your own work. Don't try all the various transformations until you find one that works, else it might seem as if you are trying to massage the data into an acceptable outcome. The effects of transformations are often difficult to see on the shape of data distributions, especially when you have few samples, so trust that what you are doing is correct. Unfortunately, as I said before, transforming data requires a bit of experience and knowledge with the subject matter, so read widely before you commit to one. Some of the texts below come from [this discussion](http://fmwww.bc.edu/repec/bocode/t/transint.html) and from [John H. McDonald](http://www.biostathandbook.com/transformation.html). Below (i.e. the text on log transformation, square-root transformation, and arcsine transformation) I have extracted, often verbatim, the excellent text produced by John H MacDonald from his 'Handbook of Biological Statistics'. Please attribute this text directly to him. I have made minor editorial changes to point towards some R code, but aside from that the text is more-or-less used as is. I strongly suggest reading the preceding text under his 'Data transformations' section, as well as consulting the textbook for in-depth reading about biostatistics. Highly recommended! ## Log transformation Log transformation is often applied to positively skewed data. It consists of taking the log of each observation. You can use either base-10 logs (`log10(x)`) or base-$e$ logs, also known as natural logs (`log(x)`). It makes no difference for a statistical test whether you use base-10 logs or natural logs, because they differ by a constant factor; the base- 10 log of a number is just 2.303...× the natural log of the number. You should specify which log you're using when you write up the results, as it will affect things like the slope and intercept in a regression. I prefer base-10 logs, because it's possible to look at them and see the magnitude of the original number: $log(1) = 0$, $log(10) = 1$, $log(100) = 2$, etc. The back transformation is to raise 10 or $e$ to the power of the number; if the mean of your base-10 log-transformed data is 1.43, the back transformed mean is $10^{1.43} = 26.9$ (in R, `10^1.43`). If the mean of your base-$e$ log-transformed data is 3.65, the back transformed mean is $e^{3.65} = 38.5$ (in R, `exp(3.65)`). If you have zeros or negative numbers, you can't take the log; you should add a constant to each number to make them positive and non-zero (i.e. `log10(x + 1))`. If you have count data, and some of the counts are zero, the convention is to add 0.5 to each number. Many variables in biology have log-normal distributions, meaning that after log-transformation, the values are normally distributed. This is because if you take a bunch of independent factors and multiply them together, the resulting product is log-normal. For example, let's say you've planted a bunch of weed seeds, then 10 years later you see how tall the trees are. The height of an individual tree would be affected by the nitrogen in the soil, the amount of water, amount of sunlight, amount of insect damage, etc. Having more nitrogen might make a tree 10% larger than one with less nitrogen; the right amount of water might make it 30% larger than one with too much or too little water; more sunlight might make it 20% larger; less insect damage might make it 15% larger, etc. Thus the final size of a tree would be a function of nitrogen × water × sunlight × insects, and mathematically, this kind of function turns out to be log-normal. ## Arcsine transformation Arcsine transformation is commonly used for proportions, which range from 0 to 1, or percentages that go from 0 to 100. Specifically, this transformation is quite useful when the data follow a binomial distribution and have extreme proportions close to 0 or 1. A biological example of the type of data suitable for arcsine transformation is the proportion of offspring that survives or the proportion of plants that succumbs to a disease; such data often follow a binomial distribution. This transformation involves of taking the arcsine of the square root of a number (in R, `arcsin(sqrt(x))`). (The result is given in radians, not degrees, and can range from −π/2 to π/2). The numbers to be arcsine transformed must be in the range 0 to 1. [...] the back-transformation is to square the sine of the number (in R, `sin(x)^2`). ## Square root transformation The square root transformation (in R, `sqrt(x)`) is often used to stabilise the variance of data that have a non-linear relationship between the mean and variance (heteroscedasticity). It is effective for reducing right-skewness (positively skewed). Taking the square root of each observation has the effect of compressing the data towards zero and reducing the impact of extreme values. It is a monotonic transformation, which means that it preserves the order of the data and does not change the relative rankings of the observations. The square root transformation does not work with negative values, but one could add a constant to each number to make them positive. A square root transformation is most frequently applied where the data are counts or frequencies, such as the number of individuals in a population or the number of events in a certain time period. Count data are prone to the variance increasing with the mean due to the discrete nature of the data. In these cases, the data tend to follow a Poisson distribution, which is characterised by a variance that is equal to the mean. The same applies to some environmental data, such as rainfall or wind; these may also exhibit heteroscedasticity due to extreme weather phenomena. ## Square transformation Another transformation available for dealing with heteroscedasticity is the square transformation. As the name suggests, it involves taking the square of each observation in a dataset (`x^2`). The effect sought is to reduce left skewness. This transformation has the effect of magnifying the differences between values and so increasing the influence of extreme values. However, this can make outliers more prominent and can make it more challenging to interpret the results of statistical analysis. The square transformation is often used in situations where the data are related to areas or volumes, such as the size of cells or the volume of an organ, where the data may follow a nonlinear relationship between the mean and variance. ## Cube transformation This transformation also applies to heteroscedastic data. It is sometimes used with moderately left skewed data. This transformation is more drastic than a square transformation, and the drawback are more severe. The cube transformation is less commonly used than other data transformations such as square-root or log transformation. Use with caution. ## Reciprocal transformation It involves taking the reciprocal or inverse of each observation in a dataset (`1/x`). It is another variance stabilising transformation and is used with severely positively skewed data. ## Anscombe transformation Another variance stabilising transformation is the Anscombe transformation, `sqrt(max(x+1)-x)`. It is applied to negatively skewed data. This transformation can be used to shift the data and compress it towards zero, and remove the influence of extreme values. It is a monotonic transformation, which means that it preserves the order of the data and does not change the relative rankings of the observations. The Anscombe transformation is useful when dealing with count or frequency data that have a non-linear relationship between the mean and variance; such data are characteristic of Poisson-distributed count data.    # Transformation and regressions Regression models do not necessarily require data transformations to deal with heteroscedasticity. Generalised Linear Models (GLM) can be used with a variety of variance and error structures in the residuals via so-called link functions. Please consult the `glm()` function for details. The linearity requirement specifically applies to linear regressions. However, regressions do not have to be linear. Some degree of curvature can be accommodated by additive (polynomial) models, which are like linear regressions, but with additional terms (you already have the knowledge you need to fit such models). More complex departures from linearity can be modelled by non-linear models (e.g. exponential, logistic, Michaelis-Menten, Gompertz, von Bertalanffy, and their ilk) or Generalised Additive Models (GAM)---these more complex relationships will not be covered in this module. The `gam()` function in the **mgcv** package fits GAMs. After fitting these parametric or semi-parametric models to accommodate non-linear regressions, the residual error structure still does to meet the normality requirements, and these can be tested as before with simple linear regressions. # The importance of checking the assumptions of the transformed data It is important to check the assumptions of the transformed data after applying a transformation. The transformed data may not meet the assumptions, so applying any of the tests for normality, homoscedasticity, and independence is still necessary before we resume statistical testing. # Conclusion Knowing how to successfully implement transformations can be as much alchemy as science and requires a great deal of experience to get right. Due to the multitude of options I cannot offer comprehensive examples to deal with all eventualities---so I will not provide any examples at all! I suggest reading widely on the internet or textbooks, and practising by yourselves on your own datasets.