8. ANOVA

Comparing the means of more than two groups

Author

Affiliation

Smit, A. J.

University of the Western Cape

Published

January 1, 2021

“He uses statistics as a drunken man uses lamp posts—for support rather than for illumination.”

— Marissa Mayer

In this Chapter

Single factor ANOVA
Multiple factor ANOVA
Tukey HSD test
Wilcoxon rank sum test
Kruskal-Wallis rank sum test

Cheatsheet

Find here a Cheatsheet on statistical methods.

Tasks to complete in this Chapter

Task G

1 At a glance

Where the Student’s and Welch’s t-tests are used to test for the difference between the means of two groups, analysis of variance (ANOVA)¹ is used to compare the means of three or more groups. The following expectations about your data should be met to conduct an ANOVA:

¹ Variations of the basic ANOVA include an analysis of covariance (ANCOVA) when, in addition to the categorical variable, your data also contains a continuous covariate (e.g., age, mass) that you want to control. Or, if your study involves multiple dependent variables, you may need to consider using multivariate analysis of variance (MANOVA) instead. ANOVAs that have multiple independent variables are called factorial ANOVAs (e.g. 2-way ANOVA or 3-way ANOVA).

Continuous dependent variable The dependent variable should be measured on a continuous scale (e.g., height, weight, test scores).
Categorical independent variable The independent variable should be categorical with at least three levels or groups (e.g., different treatments, age groups).
- Independent groups The groups being compared should be independent of each other, meaning that the observations within each group should not affect the other group’s observations. In the case of repeated measures ANOVA, the groups are related (e.g., before-and-after measurements or multiple measurements on the same subjects).

Other assumptions to be aware of with regards to the dependent variable:

Independence of data
Normally distributed data
Homogeneity of variances
We also encourage that the data are balanced

2 Introduction to ANOVA

Whole big books have been written about analysis of variance (ANOVA). Although there are many experimental designs that may be analysed with ANOVAs, biologists are taught to pay special attention to the design of experiments, and generally make sure that the experiments are fully factorial (in the case of two-way or higher ANOVAs) and balanced. For this reason we will focus in this introductory statistics course on one-way and factorial ANOVAs only.

Factorial designs

A factorial experimental design involves studying the effects of two or more factors (also known as independent variables) on a response variable (dependent variable) simultaneously. The term ‘factorial’ comes from the fact that each level of one factor is combined with each level of the other factor(s), resulting in all possible combinations of factor levels.

In a factorial design, the main effects of each factor and the interaction effects between factors can be analysed. When we have two factors, this is typically done by applying a 2-way ANOVA, but higher-order ANOVAs are also available. The main effect of a factor is its individual effect on the response variable (ignoring the effect due to the other(s)), while the interaction effect occurs when the effect of one factor depends on the level of another factor.

The simplest factorial design is the 2 $\times$ 2 design, which involves two factors, each with two levels. For example, if you are studying the effect of temperature (high vs. low) and fertiliser concentration (high vs. low) on the yield of a crop, a 2 $\times$ 2 factorial design would involve four experimental conditions:

high temperature and high fertiliser concentration
high temperature and low fertiliser concentration
low temperature and high fertiliser concentration
low temperature and low fertiliser concentration

In the above example there are 4 possible combinations. Factorial designs can involve more than two factors and/or more than two levels per factor, leading to more complex experimental setups. For example, in a 3 $\times$ 4 there will be 12 combinations of the two factor levels, and in a 2 $\times$ 3 $\times$ 5 factorial experiment the three factors with their respective levels will result in 30 combinations.

As we have seen in Chapter 7 about t-tests, ANOVAs also require that some assumptions are met:

Normal distribution The data in each group should follow a normal distribution or be approximately normally distributed. If the assumption of normality is not met, as, for example, determined with the shapiro.test(), you may consider using a non-parametric alternative such as Kruskal-Wallis rank sum test, kruskal.test(). This assumption can be relaxed for large sample sizes due to the Central Limit Theorem.
Homoscedasticity The variances of the groups should be approximately equal. This assumption can be tested using Levene’s test, car::leveneTest(), or Bartlett’s test, bartlett.test(). When the variances are different but the data are normally distributed, consider the ‘Welch ANOVA,’ oneway.test(), instead of aov().
Random sampling The data should be obtained through random sampling or random assignment, ensuring that each observation has an equal chance of being included in the sample.
Independent observations The observations within each group should be independent of each other.

If some of the above assumptions are violated, then your course of action is to either use a non-parametric test (and here), transform the data (as in Chapter 6) or use a generalised linear model if non-normal, or to use a linear mixed model when non-independence of data cannot be guaranteed. As I have already indicated, ANOVAs are also sensitive to the presence of outliers, so we need to ensure that outliers are not present. Outliers can be removed but if they are an important feature of the data, then a non-parametric test can be used.

Rather than talking about t-tests and ANOVAs as if they do different things, let us acknowledge that they ask a similar question. That question being, “are the means of these two or more things we want to compare different, or are they the same?” At this stage it is important to note that, as with t-tests, the independent variable is expected to be categorical (i.e. a factor denoting two or more different treatments or sampling conditions) and that the dependent variable must be continuous. You may perhaps be more familiar with this question when it is presented as a set of hypotheses as we saw in a t-test:

H₀ Group A is not different from Group B.
H_a Group A is different from Group B.

In an ANOVA, hypotheses willl be more similar to these:

H₀ There are no differences among Groups A, B, or C.
H_a There are differences among Groups A, B, or C.

This is a scientific question in the simplest sense. Often, for basic inquiries such as that posed above, we need to see if one group differs significantly from another. The way in which we accomplish this is by looking at the mean and variance within a set of data compared against another similar set.

2.1 Remember the t-test

As you already know, a t-test is used when we want to compare two different sample sets against one another. This is also known as a two-factor or two level test. When one wants to compare multiple (more than two) sample sets against one another an ANOVA is required (I’ll get there shortly). Remember how to perform a t-test in R: we will revisit this test using the chicks data, but only for Diets 1 and 2 from day 21.

# First grab the data
chicks <- as_tibble(ChickWeight)

# Then subset out only the sample sets to be compared
chicks_sub <- chicks %>% 
  filter(Diet %in% c(1, 2), Time == 21)

Once we have filtered our data we may now perform the t-test.

t.test(weight ~ Diet, data = chicks_sub)


    Welch Two Sample t-test

data:  weight by Diet
t = -1.2857, df = 15.325, p-value = 0.2176
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -98.09263  24.19263
sample estimates:
mean in group 1 mean in group 2 
         177.75          214.70

As one may recall from Chapter 7, whenever we want to give a formula to a function in R, we use the ~. The formula used above, weight ~ Diet, reads in plain English as “weight as a function of diet”. This is perhaps easier to understand as “Y as a function of X.” This means that we are assuming whatever is to the left of the ~ is the dependant variable, and whatever is to the right is the independent variable. Did the Diet 1 and 2 produce significantly fatter birds?

One could also supplement the output by producing a graph (Figure 1).

library(ggstatsplot)

## since the confidence intervals for the effect sizes are computed using
## bootstrapping, important to set a seed for reproducibility
set.seed(13)

## parametric t-test and box plot
ggbetweenstats(
  data = chicks_sub,
  x = Diet,
  y = weight,
  xlab = "Diet",
  ylab = "Chick mass (g)",
  plot.type = "box",
  p.adjust.method = "bonferroni",
  pairwise.display = "ns",
  type = "p",
  results.subtitle = FALSE,
  conf.level = 0.95,
  title = "t-test",
  ggtheme = ggthemes::theme_fivethirtyeight(),
  package = "basetheme",
  palette = "ink"
)

Figure 1: Box-and-whisker plot showing the differences in means between chicks reared to 21 days old and fed Diets 1 and 2

Notice above that we did not need to specify to use a t-test. The ggbetweenstats() function automatically determines if an independent samples t-test or a 1-way ANOVA is required based on whether there are two groups or three or more groups within the grouping (factor) variable.

That was a nice revision. But applied to the chicks data it seemed a bit silly, because you may ask, “What if I wanted to know if there are differences among the means computed at Day 1, Day 6, Day 10, and Day 21?” We should not use t-tests to do this (although we can). So now we can move on to the ANOVA.

Task G.1: Do it now!

Why should we not just apply t-tests once per each of the pairs of comparisons we want to make?

2.2 Why not do multiple t-tests?

In the chicks data we have four diets, not only two as in the t-test example just performed. Why not then simply do a t-test multiple times, once for each pair of diets given to the chickens? Multiple t-tests would be written as:

$H_{0} : μ_{1} = μ_{2}$
$H_{0} : μ_{1} = μ_{3}$
$H_{0} : μ_{1} = μ_{4}$
$H_{0} : μ_{2} = μ_{3}$
$H_{0} : μ_{2} = μ_{4}$
$H_{0} : μ_{3} = μ_{4}$

This would be invalid. The problem is that the chance of committing a Type I error increases as more multiple comparisons are done. So, the overall chance of rejecting the H₀ increases. Why? If one sets $α = 0.05$ (the significance level below which the H₀ is no longer accepted), one will still reject the H₀ 5% of the time when it is in fact true (i.e. when there is no difference between the groups). When many pairwise comparisons are made, the probability of rejecting the H₀ at least once is higher because we take this 5% risk each time we repeat a t-test. In the case of the chicken diets, we would have to perform six t-tests, and the error rate would increase to slightly less than $6 \times 5 %$ . See Table 1. ::: {.cell layout-align=“center”}

:::

K	0.2	0.1	0.05	0.02	0.01	0.001
2	0.20	0.10	0.05	0.02	0.01	0.00
3	0.49	0.27	0.14	0.06	0.03	0.00
4	0.74	0.47	0.26	0.11	0.06	0.01
5	0.89	0.65	0.40	0.18	0.10	0.01
10	1.00	0.99	0.90	0.60	0.36	0.04
20	1.00	1.00	1.00	0.98	0.85	0.17
100	1.00	1.00	1.00	1.00	1.00	0.99

Table 1. Probability of committing a Type I error due to applying multiple t-tests to test for differences between K means. α from 0.2 to 0.0001 are shown.

If you insist in creating more work for yourself and do t-tests many times, one way to overcome the problem of committing Type I errors that stem from multiple comparisons is to apply a Bonferroni correction.

Bonferonni correction

The Bonferroni correction is used to adjust the significance level of multiple hypothesis tests, such as multiple paired t-tests among many groups, in order to reduce the risk of false positives or Type I errors. It is named after the Italian mathematician Carlo Emilio Bonferroni.

The Bonferroni correction is based on the principle that when multiple hypothesis tests are performed, the probability of observing at least one significant result due to random chance increases. To correct for this, the significance level (usually 0.05) is divided by the number of tests being performed. This results in a more stringent significance level for each individual test and it so reduces the risk of committing a Type I error.

For example, if we conduct ten hypothesis tests, the significance level for each test after Bonferonni correction would become 0.05/10 = 0.005. The implication is that each individual test would need to have a p-value less than 0.005 to be considered significant at the overall significance level of 0.05.

On the downside, this method can be overly conservative and we may then increase the risk of Type II errors, which are false negatives. If you really cannot avoid multiple tests, then also assess one of the alternatives to Bonferonni’s method, viz: the false discovery rate (FDR) correction, the Holm-Bonferroni correction, Benjamini-Hochberg’s procedure, the Sidak correction, or some of the Bayesian approaches.

Or better still, we do an ANOVA that controls for these Type I errors so that it remains at 5%.

3 ANOVA

Hypothesis for an ANOVA

If we have four groups whose means we want to compare, a suitable H₀ would be:

$H_{0} : μ_{1} = μ_{2} = μ_{3} = μ_{4}$

and

$H_{a}$ : the means of the four groups are not equal.

Here, $μ_{1}$ , $μ_{2}$ , $μ_{3}$ , and $μ_{4}$ are four population means. For the H₀ to be rejected, all that is required is for one of the pairs of means to be different, not all of them.

3.1 Single factor

We continue with the chicken data. The t-test showed that Diets 1 and 2 resulted in the same chicken mass at Day 21. What about the other two diets? Our H₀ is that, at Day 21, $μ_{1} = μ_{2} = μ_{3} = μ_{4}$ . Is there a statistical difference between chickens fed these four diets, or do we retain the H₀? The R function for an ANOVA is aov(). To look for significant differences between all four diets on the last day of sampling we use this one line of code:

chicks.aov1 <- aov(weight ~ Diet, data = filter(chicks, Time == 21))
summary(chicks.aov1)

            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         3  57164   19055   4.655 0.00686 **
Residuals   41 167839    4094                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task G.2: Do it now!

What does the outcome say about the chicken masses? Which ones are different from each other?
Devise a graphical display of this outcome.

If this seems too easy to be true, it’s because we aren’t quite done yet. You could use your graphical display to eyeball where the significant differences are, or we can turn to a more ‘precise’ approach. The next step one could take is to run a Tukey HSD test on the results of the ANOVA by wrapping tukeyHSD() around aov():

TukeyHSD(chicks.aov1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ Diet, data = filter(chicks, Time == 21))

$Diet
         diff        lwr       upr     p adj
2-1  36.95000  -32.11064 106.01064 0.4868095
3-1  92.55000   23.48936 161.61064 0.0046959
4-1  60.80556  -10.57710 132.18821 0.1192661
3-2  55.60000  -21.01591 132.21591 0.2263918
4-2  23.85556  -54.85981 102.57092 0.8486781
4-3 -31.74444 -110.45981  46.97092 0.7036249

The output of tukeyHSD() shows us that pairwise comparisons of all of the groups we are comparing. We can also display this as a very rough figure (Figure 2):

plot(TukeyHSD(chicks.aov1))

Figure 2: A plot of the Tukey-HSD test showing the differences in means between chicks reared to 21 days old and fed four diets.

We may also produce a nicer looking graphical summary in the form of a box-and-whisker plot and/or a violin plot. Here I combine both (Figure 3):

set.seed(666)

## parametric t-test and box plot
ggbetweenstats(
  data = filter(chicks, Time == 21),
  x = Diet,
  y = weight,
  xlab = "Diet",
  ylab = "Chick mass (g)",
  plot.type = "box",
  boxplot.args = list(notch = TRUE),
  type = "parametric",
  results.subtitle = FALSE,
  pairwise.comparisons = TRUE,
  pairwise.display = "s",
  p.adjust.method = "bonferroni",
  conf.level = 0.95,
  title = "ANOVA",
  ggtheme = ggthemes::theme_fivethirtyeight(),
  package = "basetheme",
  palette = "ink"
)

Figure 3: Box-and-whisker plot showing the differences in means between chicks reared to 21 days old and fed four diets. Shown is a notched box plot where the extent of the notches is `1.58 * IQR / sqrt(n)`. This is approximately equivalent to a 95% confidence interval andf may be used for comparing medians.

Task G.3: Do it now!

Look at the help file for the TukeyHSD() function to better understand what the output means.

How does one interpret the results? What does this tell us about the effect that that different diets has on the chicken weights at Day 21?
Figure out a way to plot the Tukey HSD outcomes in ggplot.
Why does the ANOVA return a significant result, but the Tukey test shows that not all of the groups are significantly different from one another?

3.2 Multiple factors

What if we have multiple grouping variables, and not just one? We would encounter this kind of situation in factorial designs. In the case of the chicken data, there is also time that seems to be having an effect.

Task G.4: Do it now!

How is time having an effect? (/3)
What hypotheses can we construct around time? (/2)

Let us look at some variations around questions concerning time. We might ask, at a particular time step, are there differences amongst the effect due to diet on chicken mass? Let’s see when diets are starting to have an effect by examining the outcomes at times 0, 2, 10, and 21:

# effect at time = 0
summary(aov(weight ~ Diet, data = filter(chicks, Time == 0)))

            Df Sum Sq Mean Sq F value Pr(>F)
Diet         3   4.32   1.440   1.132  0.346
Residuals   46  58.50   1.272

# effect at time = 2
summary(aov(weight ~ Diet, data = filter(chicks, Time == 2)))

            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         3  158.4   52.81   4.781 0.00555 **
Residuals   46  508.1   11.05                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# effect at time = 10
summary(aov(weight ~ Diet, data = filter(chicks, Time == 10)))

            Df Sum Sq Mean Sq F value   Pr(>F)    
Diet         3   8314    2771    6.46 0.000989 ***
Residuals   45  19304     429                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# effect at time = 21
summary(aov(weight ~ Diet, data = filter(chicks, Time == 21)))

            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         3  57164   19055   4.655 0.00686 **
Residuals   41 167839    4094                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task G.5: Do it now!

What do you conclude from the above series of ANOVAs? (/3)
What problem is associated with running multiple tests in the way that we have done here? (/2)

Or we may ask, regardless of diet (i.e. disregarding the effect of diet by clumping all chickens together), is time having an effect?

chicks.aov2 <- aov(weight ~ as.factor(Time),
                   data = filter(chicks, Time %in% c(0, 2, 10, 21)))
summary(chicks.aov2)

                 Df Sum Sq Mean Sq F value Pr(>F)    
as.factor(Time)   3 939259  313086   234.8 <2e-16 ***
Residuals       190 253352    1333                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task G.6: Do it now!

Write out the hypotheses for this ANOVA. (/2)
What do you conclude from the above ANOVA? (/3)

Or, to save ourselves a lot of time and reduce the coding effort, we may simply run a two-way ANOVA and look at the effects of Diet and Time simultaneously. To specify the different factors we put them in our formula and separate them with a +:

summary(aov(weight ~ Diet + as.factor(Time),
            data = filter(chicks, Time %in% c(0, 21))))

                Df Sum Sq Mean Sq F value  Pr(>F)    
Diet             3  39595   13198   5.987 0.00091 ***
as.factor(Time)  1 734353  734353 333.120 < 2e-16 ***
Residuals       90 198402    2204                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task G.7: Do it now!

What question are we asking with the above line of code? (/3)
What is the answer? (/2)
Why did we wrap Time in as.factor()? (/2)

It is also possible to look at what the interaction effect between grouping variables (i.e. in this case the effect of time on diet—does the effect of time depend on which diet we are looking at?), and not just within the individual grouping variables. To do this we replace the + in our formula with *:

summary(aov(weight ~ Diet * as.factor(Time),
            data = filter(chicks, Time %in% c(4, 21))))

                     Df Sum Sq Mean Sq F value   Pr(>F)    
Diet                  3  40914   13638   6.968 0.000298 ***
as.factor(Time)       1 582221  582221 297.472  < 2e-16 ***
Diet:as.factor(Time)  3  25530    8510   4.348 0.006684 ** 
Residuals            86 168322    1957                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task G.8: Do it now!

How do these results differ from the previous set? (/3)

One may also run a post-hoc Tukey test on these results the same as for a single factor ANOVA:

TukeyHSD(aov(weight ~ Diet * as.factor(Time),
             data = filter(chicks, Time %in% c(20, 21))))

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ Diet * as.factor(Time), data = filter(chicks, Time %in% c(20, 21)))

$Diet
         diff        lwr       upr     p adj
2-1  36.18030  -9.301330  81.66194 0.1663037
3-1  90.63030  45.148670 136.11194 0.0000075
4-1  62.25253  15.223937 109.28111 0.0045092
3-2  54.45000   3.696023 105.20398 0.0305957
4-2  26.07222 -26.072532  78.21698 0.5586643
4-3 -28.37778 -80.522532  23.76698 0.4863940

$`as.factor(Time)`
          diff       lwr      upr     p adj
21-20 8.088223 -17.44017 33.61661 0.5303164

$`Diet:as.factor(Time)`
                diff        lwr        upr     p adj
2:20-1:20  35.188235  -40.67378 111.050253 0.8347209
3:20-1:20  88.488235   12.62622 164.350253 0.0111136
4:20-1:20  63.477124  -14.99365 141.947897 0.2035951
1:21-1:20   7.338235  -58.96573  73.642198 0.9999703
2:21-1:20  44.288235  -31.57378 120.150253 0.6116081
3:21-1:20  99.888235   24.02622 175.750253 0.0023872
4:21-1:20  68.143791  -10.32698 146.614563 0.1371181
3:20-2:20  53.300000  -31.82987 138.429869 0.5234263
4:20-2:20  28.288889  -59.17374 115.751515 0.9723470
1:21-2:20 -27.850000 -104.58503  48.885027 0.9486212
2:21-2:20   9.100000  -76.02987  94.229869 0.9999766
3:21-2:20  64.700000  -20.42987 149.829869 0.2732059
4:21-2:20  32.955556  -54.50707 120.418182 0.9377007
4:20-3:20 -25.011111 -112.47374  62.451515 0.9862822
1:21-3:20 -81.150000 -157.88503  -4.414973 0.0305283
2:21-3:20 -44.200000 -129.32987  40.929869 0.7402877
3:21-3:20  11.400000  -73.72987  96.529869 0.9998919
4:21-3:20 -20.344444 -107.80707  67.118182 0.9960548
1:21-4:20 -56.138889 -135.45396  23.176184 0.3619622
2:21-4:20 -19.188889 -106.65152  68.273738 0.9972631
3:21-4:20  36.411111  -51.05152 123.873738 0.8984019
4:21-4:20   4.666667  -85.06809  94.401428 0.9999998
2:21-1:21  36.950000  -39.78503 113.685027 0.8067041
3:21-1:21  92.550000   15.81497 169.285027 0.0075185
4:21-1:21  60.805556  -18.50952 140.120628 0.2629945
3:21-2:21  55.600000  -29.52987 140.729869 0.4679025
4:21-2:21  23.855556  -63.60707 111.318182 0.9896157
4:21-3:21 -31.744444 -119.20707  55.718182 0.9486128

Task G.9: Do it now!

Yikes! That’s a massive amount of results. What does all of this mean, and why is it so verbose? (/5)

Summary

To summarise t-tests, single-factor (1-way) and multifactor (2- or 3-way, etc.) ANOVAs:

A t-test is applied to situations where one wants to compare the means of only two groups of a response variable within one categorical independent variable (we say a factor with two levels).
A 1-way ANOVA also looks at the means of a response variable belonging to one categorical independent variable, but the categorical response variable has more than two levels in it.
Following on from there, a 2-way ANOVA compares the means of response variables belonging to all the levels within two categorical independent variables (e.g. Factor 1 might have three levels, and Factor 2 five levels). In the simplest formulaton, it does so by looking at the main effects, which is the group differences between the three levels of Factor 1 and disregarding the contribution due to the group membership to Factor 2, and also the group differences amongst the levels of Factor 2 but disregarding the group membership of Factor 1. In addition to looking at the main effects, a 2-way ANOVA can also consider the interaction (or combined effect) of Factors 1 and 2 in influencing the means.

4 Alternatives to ANOVA

In the first main section of this chapter we learned how to test hypotheses based on the comparisons of means between sets of data when we were able to meet our two base assumptions. These parametric tests are preferred over non-parametric tests because they are more robust. However, when we simply aren’t able to meet these assumptions we must not despair. Non-parametric tests are still useful. In this chapter we will learn how to run non-parametric tests for two sample and multiple sample datasets. To start, let’s load our libraries and chicks data if we have not already.

# First activate libraries
library(tidyverse)
library(ggpubr)

# Then load data
chicks <- as_tibble(ChickWeight)

With our libraries and data loaded, let’s find a day in which at least one of our assumptions are violated.

# Then check for failing assumptions
chicks %>% 
  filter(Time == 0) %>% 
  group_by(Diet) %>% 
  summarise(norm_wt = as.numeric(shapiro.test(weight)[2]),
            var_wt = var(weight))

# A tibble: 4 × 3
  Diet  norm_wt var_wt
  <fct>   <dbl>  <dbl>
1 1     0.0138   0.989
2 2     0.138    2.23 
3 3     0.00527  1.07 
4 4     0.0739   1.11

4.1 Wilcoxon rank sum test

The non-parametric version of a t-test is a Wilcox rank sum test. To perform this test in R we may again use compare_means() and specify the test we want:

compare_means(weight ~ Diet,
              data = filter(chicks, Time == 0,
                            Diet %in% c(1, 2)),
              method = "wilcox.test")

# A tibble: 1 × 8
  .y.    group1 group2     p p.adj p.format p.signif method  
  <chr>  <chr>  <chr>  <dbl> <dbl> <chr>    <chr>    <chr>   
1 weight 1      2      0.235  0.23 0.23     ns       Wilcoxon

What do our results show?

4.2 Kruskall-Wallis rank sum test

4.2.1 Single factor

The non-parametric version of an ANOVA is a Kruskall-Wallis rank sum test. As you may have by now surmised, this may be done with compare_means() as seen below:

compare_means(weight ~ Diet,
              data = filter(chicks, Time == 0),
              method = "kruskal.test")

# A tibble: 1 × 6
  .y.        p p.adj p.format p.signif method        
  <chr>  <dbl> <dbl> <chr>    <chr>    <chr>         
1 weight 0.475  0.48 0.48     ns       Kruskal-Wallis

As with the ANOVA, this first step with the Kruskall-Wallis test is not the last. We must again run a post-hoc test on our results. This time we will need to use pgirmess::kruskalmc(), which means we will need to load a new library.

library(pgirmess)

kruskalmc(weight ~ Diet, data = filter(chicks, Time == 0))

Multiple comparison test after Kruskal-Wallis 
alpha: 0.05 
Comparisons
    obs.dif critical.dif stat.signif
1-2    6.95     14.89506       FALSE
1-3    6.90     14.89506       FALSE
1-4    4.15     14.89506       FALSE
2-3    0.05     17.19933       FALSE
2-4    2.80     17.19933       FALSE
3-4    2.75     17.19933       FALSE

Let’s consult the help file for kruskalmc() to understand what this print-out means.

4.2.2 Multiple factors

The water becomes murky quickly when one wants to perform multiple factor non-parametric comparison of means tests. To that end, we will not cover the few existing methods here. Rather, one should avoid the necessity for these types of tests when designing an experiment.

4.3 The SA time data

sa_time <- as_tibble(read_csv("../../data/snakes.csv",
                              col_types = list(col_double(),
                                               col_double(),
                                               col_double())))
sa_time_long <- sa_time %>% 
  gather(key = "term", value = "minutes") %>% 
  filter(minutes < 300) %>% 
  mutate(term = as.factor(term))

my_comparisons <- list( c("now", "now_now"),
                        c("now_now", "just_now"),
                        c("now", "just_now") )

ggboxplot(sa_time_long, x = "term", y = "minutes",
          color = "term", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          add = "jitter", shape = "term")

Figure 4: Time is not a limited resource in South Africa.

5 Examples

5.1 Snakes!

These data could be analysed by a two-way ANOVA without replication, or a repeated measures ANOVA. Here I will analyse it by using a two-way ANOVA without replication.

Place and Abramson (2008) placed diamondback rattlesnakes (Crotalus atrox) in a ‘rattlebox,’ a box with a lid that would slide open and shut every 5 minutes. At first, the snake would rattle its tail each time the box opened. After a while, the snake would become habituated to the box opening and stop rattling its tail. They counted the number of box openings until a snake stopped rattling; fewer box openings means the snake was more quickly habituated. They repeated this experiment on each snake on four successive days, which is treated as an influential variable here. Place and Abramson (2008) used 10 snakes, but some of them never became habituated; to simplify this example, data from the six snakes that did become habituated on each day are used.

First, we read in the data, making sure to convert the column named day to a factor. Why? Because ANOVAs work with factor independent variables, while day as it is encoded by default is in fact a continuous variable.

snakes <- read_csv("../../data/snakes.csv")
snakes$day = as.factor(snakes$day)

The first thing we do is to create some summaries of the data. Refer to the summary statistics Chapter.

snakes.summary <- snakes %>% 
  group_by(day, snake) %>% 
  summarise(mean_openings = mean(openings),
            sd_openings = sd(openings)) %>% 
  ungroup()
snakes.summary

# A tibble: 24 × 4
   day   snake mean_openings sd_openings
   <fct> <chr>         <dbl>       <dbl>
 1 1     D1               85          NA
 2 1     D11              40          NA
 3 1     D12              65          NA
 4 1     D3              107          NA
 5 1     D5               61          NA
 6 1     D8               22          NA
 7 2     D1               58          NA
 8 2     D11              45          NA
 9 2     D12              27          NA
10 2     D3               51          NA
# ℹ 14 more rows

Task G.9: Do it now!

Something seems… off. What’s going on here? Please explain this outcome.

To fix this problem, let us ignore the grouping by both snake and day.

snakes.summary <- snakes %>% 
  group_by(day) %>% 
  summarise(mean_openings = mean(openings),
            sd_openings = sd(openings)) %>% 
  ungroup()
snakes.summary

# A tibble: 4 × 3
  day   mean_openings sd_openings
  <fct>         <dbl>       <dbl>
1 1              63.3        30.5
2 2              47          12.2
3 3              34.5        26.0
4 4              25.3        18.1

Rmisc::summarySE() offers a convenience function if your feeling less frisky about calculating the summary statistics yourself:

library(Rmisc)
snakes.summary2 <- summarySE(data = snakes,
                             measurevar = "openings",
                             groupvars = c("day"))
snakes.summary2

  day N openings       sd        se       ci
1   1 6 63.33333 30.45434 12.432931 31.95987
2   2 6 47.00000 12.21475  4.986649 12.81859
3   3 6 34.50000 25.95958 10.597956 27.24291
4   4 6 25.33333 18.08498  7.383164 18.97903

Now we turn to some visual data summaries (Figure 5).

ggplot(data = snakes, aes(x = day, y = openings)) +
  geom_segment(data = snakes.summary2, aes(x = day, xend = day,
                                           y = openings - ci,
                                           yend = openings + ci,
                                           colour = day),
              size = 2.0, linetype = "solid", show.legend = FALSE) +
  geom_boxplot(aes(fill = day), alpha = 0.3, show.legend = FALSE) + 
  geom_jitter(width = 0.05) +
  theme_pubclean()

Figure 5: Boxplots showing the change in the snakes’ habituation to box opening over time.

What are our null hypotheses?

H₀ There is no difference between snakes with respect to the number of openings at which they habituate.
H₀ There is no difference between days in terms of the number of openings at which the snakes habituate.

Fit the ANOVA model to test these hypotheses:

snakes.aov <- aov(openings ~ day + snake, data = snakes)
summary(snakes.aov)

            Df Sum Sq Mean Sq F value Pr(>F)  
day          3   4878  1625.9   3.320 0.0487 *
snake        5   3042   608.4   1.242 0.3382  
Residuals   15   7346   489.7                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we need to test of the assumptions hold true (i.e. erros are normally distributed and heteroscedastic) (Figure 6). Also, where are the differences (Figure 7)?

par(mfrow = c(1, 2))
# Checking assumptions...
# make a histogram of the residuals;
# they must be normal
snakes.res <- residuals(snakes.aov)
hist(snakes.res, col = "red")

# make a plot of residuals and the fitted values;
# # they must be normal and homoscedastic
plot(fitted(snakes.aov), residuals(snakes.aov), col = "red")

Figure 6: Exploring the assumptions visually.

snakes.tukey <- TukeyHSD(snakes.aov, which = "day", conf.level = 0.90)
plot(snakes.tukey, las = 1, col = "red")

Figure 7: Exploring the differences between days.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2021,
  author = {Smit, A. J.,},
  title = {8. {ANOVA}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB744/basic_stats/08-anova.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2021) 8. ANOVA. http://tangledbank.netlify.app/BCB744/basic_stats/08-anova.html.

--- date: "2021-01-01" title: "8. ANOVA" subtitle: "Comparing the means of more than two groups" reference-location: margin number-sections: true --- ![](../../images/fancy_stats.jpg){width=50%} > *"He uses statistics as a drunken man uses lamp posts---for support rather than for illumination."* > > --- Marissa Mayer ```{r, echo=FALSE} knitr::opts_chunk$set( echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, fig.width = 6, fig.asp = 0.65, out.width = "100%", fig.align = "center" ) library(tidyverse) library(ggpubr) library(ggthemes) library(kableExtra) ``` ::: {.callout-note appearance="simple"} ## In this Chapter - Single factor ANOVA - Multiple factor ANOVA - Tukey HSD test - Wilcoxon rank sum test - Kruskal-Wallis rank sum test ::: ::: {.callout-note appearance="simple"} ## Cheatsheet Find here a [Cheatsheet](../../docs/Methods_cheatsheet_v1.pdf) on statistical methods. ::: ::: {.callout-important appearance="simple"} ## Tasks to complete in this Chapter - Task G ::: # At a glance Where the Student's and Welch's *t*-tests are used to test for the difference between the means of two groups, analysis of variance (ANOVA)[^1] is used to *compare the means of three or more groups*. The following expectations about your data should be met to conduct an ANOVA: [^1]: Variations of the basic ANOVA include an analysis of covariance (ANCOVA) when, in addition to the categorical variable, your data also contains a **continuous covariate** (e.g., age, mass) that you want to control. Or, if your study involves multiple dependent variables, you may need to consider using **multivariate analysis of variance (MANOVA)** instead. ANOVAs that have multiple independent variables are called **factorial ANOVAs** (e.g. 2-way ANOVA or 3-way ANOVA). - **Continuous dependent variable** The dependent variable should be measured on a *continuous scale* (e.g., height, weight, test scores). - **Categorical independent variable** The independent variable should be *categorical with at least three levels* or groups (e.g., different treatments, age groups). - **Independent groups** The groups being compared should be independent of each other, meaning that the observations within each group should not affect the other group's observations. In the case of repeated measures ANOVA, the groups are related (e.g., before-and-after measurements or multiple measurements on the same subjects). Other assumptions to be aware of with regards to the dependent variable: - Independence of data - Normally distributed data - Homogeneity of variances - We also encourage that the data are balanced # Introduction to ANOVA Whole big books have been written about analysis of variance (ANOVA). Although there are many experimental designs that may be analysed with ANOVAs, biologists are taught to pay special attention to the design of experiments, and generally make sure that the experiments are fully *factorial* (in the case of two-way or higher ANOVAs) and balanced. For this reason we will focus in this introductory statistics course on one-way and factorial ANOVAs only. ::: {.callout-note appearance="simple"} ## Factorial designs A factorial experimental design involves studying the effects of two or more factors (also known as independent variables) on a response variable (dependent variable) simultaneously. The term 'factorial' comes from the fact that each level of one factor is combined with each level of the other factor(s), resulting in all possible combinations of factor levels. In a factorial design, the *main effects* of each factor and the *interaction effects* between factors can be analysed. When we have two factors, this is typically done by applying a 2-way ANOVA, but higher-order ANOVAs are also available. The main effect of a factor is its individual effect on the response variable (ignoring the effect due to the other(s)), while the interaction effect occurs when the effect of one factor depends on the level of another factor. The simplest factorial design is the 2 $\times$ 2 design, which involves two factors, each with two levels. For example, if you are studying the effect of temperature (high vs. low) and fertiliser concentration (high vs. low) on the yield of a crop, a 2 $\times$ 2 factorial design would involve four experimental conditions: * high temperature and high fertiliser concentration * high temperature and low fertiliser concentration * low temperature and high fertiliser concentration * low temperature and low fertiliser concentration In the above example there are 4 possible combinations. Factorial designs can involve more than two factors and/or more than two levels per factor, leading to more complex experimental setups. For example, in a 3 $\times$ 4 there will be 12 combinations of the two factor levels, and in a 2 $\times$ 3 $\times$ 5 factorial experiment the three factors with their respective levels will result in 30 combinations. ::: As we have seen in [Chapter 7](07-t_tests.qmd) about *t*-tests, ANOVAs also require that some assumptions are met: - **Normal distribution** The data in each group should follow a normal distribution or be approximately normally distributed. If the assumption of normality is not met, as, for example, determined with the `shapiro.test()`, you may consider using a non-parametric alternative such as Kruskal-Wallis rank sum test, `kruskal.test()`. This assumption can be relaxed for large sample sizes due to the [Central Limit Theorem](04-distributions.qmd#sec-normal). - **Homoscedasticity** The variances of the groups should be approximately equal. This assumption can be tested using Levene's test, `car::leveneTest()`, or Bartlett's test, `bartlett.test()`. When the variances are different but the data are normally distributed, consider the 'Welch ANOVA,' `oneway.test()`, instead of `aov()`. - **Random sampling** The data should be obtained through random sampling or random assignment, ensuring that each observation has an equal chance of being included in the sample. - **Independent observations** The observations within each group should be independent of each other. If some of the above assumptions are violated, then your course of action is to either use a [non-parametric test](@sec-alt) (and [here](../../docs/Methods_cheatsheet_v1.pdf)), transform the data (as in [Chapter 6](06-assumptions.qmd)) or use a generalised linear model if non-normal, or to use a linear mixed model when non-independence of data cannot be guaranteed. As I have already indicated, ANOVAs are also sensitive to the presence of outliers, so we need to ensure that outliers are not present. Outliers can be removed but if they are an important feature of the data, then a non-parametric test can be used. Rather than talking about *t*-tests and ANOVAs as if they do different things, let us acknowledge that they ask a similar question. That question being, "are the means of these two or more things we want to compare different, or are they the same?" At this stage it is important to note that, as with *t*-tests, the independent variable is expected to be categorical (i.e. a factor denoting two or more different treatments or sampling conditions) and that the dependent variable must be continuous. You may perhaps be more familiar with this question when it is presented as a set of hypotheses as we saw in a *t*-test: - ***H*~0~** Group A is not different from Group B. - ***H*~a~** Group A is different from Group B. In an ANOVA, hypotheses willl be more similar to these: - ***H*~0~** There are no differences among Groups A, B, or C. - ***H*~a~** There are differences among Groups A, B, or C. This is a scientific question in the simplest sense. Often, for basic inquiries such as that posed above, we need to see if one group differs significantly from another. The way in which we accomplish this is by looking at the mean and variance within a set of data compared against another similar set. ## Remember the *t*-test As you already know, a *t*-test is used when we want to compare two different sample sets against one another. This is also known as a two-factor or two level test. When one wants to compare multiple (more than two) sample sets against one another an ANOVA is required (I'll get there shortly). Remember how to perform a *t*-test in R: we will revisit this test using the `chicks` data, but only for Diets 1 and 2 from day 21. ```{r} # First grab the data chicks <- as_tibble(ChickWeight) # Then subset out only the sample sets to be compared chicks_sub <- chicks %>% filter(Diet %in% c(1, 2), Time == 21) ``` Once we have filtered our data we may now perform the *t*-test. ```{r} t.test(weight ~ Diet, data = chicks_sub) ``` As one may recall from [Chapter 7](07-t_tests.qmd), whenever we want to give a formula to a function in R, we use the `~`. The formula used above, `weight ~ Diet`, reads in plain English as "weight as a function of diet". This is perhaps easier to understand as "*Y* as a function of *X*." This means that we are assuming whatever is to the left of the `~` is the dependant variable, and whatever is to the right is the independent variable. Did the Diet 1 and 2 produce significantly fatter birds? One could also supplement the output by producing a graph (@fig-boxwhisker1). ```{r} #| label: fig-boxwhisker1 #| fig-cap: "Box-and-whisker plot showing the differences in means between chicks reared to 21 days old and fed Diets 1 and 2" library(ggstatsplot) ## since the confidence intervals for the effect sizes are computed using ## bootstrapping, important to set a seed for reproducibility set.seed(13) ## parametric t-test and box plot ggbetweenstats( data = chicks_sub, x = Diet, y = weight, xlab = "Diet", ylab = "Chick mass (g)", plot.type = "box", p.adjust.method = "bonferroni", pairwise.display = "ns", type = "p", results.subtitle = FALSE, conf.level = 0.95, title = "t-test", ggtheme = ggthemes::theme_fivethirtyeight(), package = "basetheme", palette = "ink" ) ``` Notice above that we did not need to specify to use a *t*-test. The `ggbetweenstats()` function automatically determines if an independent samples *t*-test or a 1-way ANOVA is required based on whether there are two groups or three or more groups within the grouping (factor) variable. That was a nice revision. But applied to the `chicks` data it seemed a bit silly, because you may ask, "What if I wanted to know if there are differences among the means computed at Day 1, Day 6, Day 10, and Day 21?" We should not use *t*-tests to do this (although we can). So now we can move on to the ANOVA. ::: callout-important ## Task G.1: Do it now! - Why should we not just apply *t*-tests once per each of the pairs of comparisons we want to make? ::: ## Why not do multiple *t*-tests? In the `chicks` data we have four diets, not only two as in the *t*-test example just performed. Why not then simply do a *t*-test multiple times, once for each pair of diets given to the chickens? Multiple *t*-tests would be written as: - $H_{0}: \mu_1 = \mu_2$ - $H_{0}: \mu_1 = \mu_3$ - $H_{0}: \mu_1 = \mu_4$ - $H_{0}: \mu_2 = \mu_3$ - $H_{0}: \mu_2 = \mu_4$ - $H_{0}: \mu_3 = \mu_4$ This would be invalid. The problem is that the chance of committing a Type I error increases as more multiple comparisons are done. So, the overall chance of rejecting the *H*~0~ increases. Why? If one sets $\alpha=0.05$ (the significance level below which the *H*~0~ is no longer accepted), one will still reject the *H*~0~ 5% of the time when it is in fact true (i.e. when there is no difference between the groups). When many pairwise comparisons are made, the probability of rejecting the *H*~0~ at least once is higher because we take this 5% risk each time we repeat a *t*-test. In the case of the chicken diets, we would have to perform six *t*-tests, and the error rate would increase to slightly less than $6\times5\%$. See Table 1. ```{r} #| echo: false # Function to calculate the probability of Type I error type_one_error_probability <- function(k, alpha) { pairwise_comparisons <- k * (k - 1) / 2 return(1 - (1 - alpha)^pairwise_comparisons) } # Set the values of k and significance levels k_values <- c(2, 3, 4, 5, 10, 20, 100) alpha_values <- c(0.20, 0.10, 0.05, 0.02, 0.01, 0.001) # Create an empty data frame to store the results results <- data.frame(K = integer(), Alpha = numeric(), Type1ErrorProbability = numeric()) # Calculate the probabilities and store them in the data frame for (k in k_values) { for (alpha in alpha_values) { type1_error_prob <- type_one_error_probability(k, alpha) results <- rbind( results, data.frame(K = k, Alpha = alpha, Type1ErrorProbability = type1_error_prob)) } } results_long <- results |> pivot_wider(names_from = Alpha, values_from = Type1ErrorProbability) ``` ::: {#kable-table-div} ```{r} #| tbl-cap: "**Table 1.** Probability of committing a Type I error due to applying multiple *t*-tests to test for differences between *K* means. *α* from `0.2` to `0.0001` are shown." #| echo: false kbl(results_long, digits = 2) |> kable_classic(bootstrap_options = "striped", full_width = FALSE) ``` ::: If you insist in creating more work for yourself and do *t*-tests many times, one way to overcome the problem of committing Type I errors that stem from multiple comparisons is to apply a Bonferroni correction. ::: {.callout-note appearance="simple"} ## Bonferonni correction The Bonferroni correction is used to adjust the significance level of multiple hypothesis tests, such as multiple paired *t*-tests among many groups, in order to reduce the risk of false positives or Type I errors. It is named after the Italian mathematician Carlo Emilio Bonferroni. The Bonferroni correction is based on the principle that when multiple hypothesis tests are performed, the probability of observing at least one significant result due to random chance increases. To correct for this, the significance level (usually 0.05) is divided by the number of tests being performed. This results in a more stringent significance level for each individual test and it so reduces the risk of committing a Type I error. For example, if we conduct ten hypothesis tests, the significance level for each test after Bonferonni correction would become 0.05/10 = 0.005. The implication is that each individual test would need to have a *p*-value less than 0.005 to be considered significant at the overall significance level of 0.05. On the downside, this method can be overly conservative and we may then increase the risk of Type II errors, which are false negatives. If you really cannot avoid multiple tests, then also assess one of the alternatives to Bonferonni's method, viz: the false discovery rate (FDR) correction, the Holm-Bonferroni correction, Benjamini-Hochberg's procedure, the Sidak correction, or some of the Bayesian approaches. ::: Or better still, we do an ANOVA that controls for these Type I errors so that it remains at 5%. # ANOVA ::: {.callout-note} ## Hypothesis for an ANOVA If we have four groups whose means we want to compare, a suitable *H*~0~ would be: $H_{0}: \mu_1 = \mu_2 = \mu_3 = \mu_4$ and $H_{a}$: the means of the four groups are not equal. Here, $\mu_1$, $\mu_2$, $\mu_3$, and $\mu_4$ are four population means. For the *H*~0~ to be rejected, all that is required is for one of the pairs of means to be different, not all of them. ::: ## Single factor We continue with the chicken data. The *t*-test showed that Diets 1 and 2 resulted in the same chicken mass at Day 21. What about the other two diets? Our *H*~0~ is that, at Day 21, $\mu_{1}=\mu_{2}=\mu_{3}=\mu_{4}$. Is there a statistical difference between chickens fed these four diets, or do we retain the *H*~0~? The R function for an ANOVA is `aov()`. To look for significant differences between all four diets on the last day of sampling we use this one line of code: ```{r} chicks.aov1 <- aov(weight ~ Diet, data = filter(chicks, Time == 21)) summary(chicks.aov1) ``` ::: callout-important ## Task G.2: Do it now! a. What does the outcome say about the chicken masses? Which ones are different from each other? b. Devise a graphical display of this outcome. ::: If this seems too easy to be true, it's because we aren't quite done yet. You could use your graphical display to eyeball where the significant differences are, or we can turn to a more 'precise' approach. The next step one could take is to run a Tukey HSD test on the results of the ANOVA by wrapping `tukeyHSD()` around `aov()`: ```{r} TukeyHSD(chicks.aov1) ``` The output of `tukeyHSD()` shows us that pairwise comparisons of all of the groups we are comparing. We can also display this as a very rough figure (@fig-tukeydiff): ```{r} #| label: fig-tukeydiff #| fig-cap: "A plot of the Tukey-HSD test showing the differences in means between chicks reared to 21 days old and fed four diets." plot(TukeyHSD(chicks.aov1)) ``` We may also produce a nicer looking graphical summary in the form of a box-and-whisker plot and/or a violin plot. Here I combine both (@fig-boxwhisker2): ```{r} #| label: fig-boxwhisker2 #| fig-cap: "Box-and-whisker plot showing the differences in means between chicks reared to 21 days old and fed four diets. Shown is a notched box plot where the extent of the notches is `1.58 * IQR / sqrt(n)`. This is approximately equivalent to a 95% confidence interval andf may be used for comparing medians." set.seed(666) ## parametric t-test and box plot ggbetweenstats( data = filter(chicks, Time == 21), x = Diet, y = weight, xlab = "Diet", ylab = "Chick mass (g)", plot.type = "box", boxplot.args = list(notch = TRUE), type = "parametric", results.subtitle = FALSE, pairwise.comparisons = TRUE, pairwise.display = "s", p.adjust.method = "bonferroni", conf.level = 0.95, title = "ANOVA", ggtheme = ggthemes::theme_fivethirtyeight(), package = "basetheme", palette = "ink" ) ``` ::: callout-important ## Task G.3: Do it now! Look at the help file for the `TukeyHSD()` function to better understand what the output means. a. How does one interpret the results? What does this tell us about the effect that that different diets has on the chicken weights at Day 21? b. Figure out a way to plot the Tukey HSD outcomes in **ggplot**. c. Why does the ANOVA return a significant result, but the Tukey test shows that not all of the groups are significantly different from one another? :::    ## Multiple factors What if we have multiple grouping variables, and not just one? We would encounter this kind of situation in factorial designs. In the case of the chicken data, there is also time that seems to be having an effect. ::: callout-important ## Task G.4: Do it now! a. How is time having an effect? **(/3)** b. What hypotheses can we construct around time? **(/2)** ::: Let us look at some variations around questions concerning time. We might ask, at a particular time step, are there differences amongst the effect due to diet on chicken mass? Let's see when diets are starting to have an effect by examining the outcomes at times 0, 2, 10, and 21: ```{r} # effect at time = 0 summary(aov(weight ~ Diet, data = filter(chicks, Time == 0))) # effect at time = 2 summary(aov(weight ~ Diet, data = filter(chicks, Time == 2))) # effect at time = 10 summary(aov(weight ~ Diet, data = filter(chicks, Time == 10))) # effect at time = 21 summary(aov(weight ~ Diet, data = filter(chicks, Time == 21))) ``` ::: callout-important ## Task G.5: Do it now! a. What do you conclude from the above series of ANOVAs? **(/3)** b. What problem is associated with running multiple tests in the way that we have done here? **(/2)** ::: Or we may ask, regardless of diet (i.e. disregarding the effect of diet by clumping all chickens together), is time having an effect? ```{r} chicks.aov2 <- aov(weight ~ as.factor(Time), data = filter(chicks, Time %in% c(0, 2, 10, 21))) summary(chicks.aov2) ``` ::: callout-important ## Task G.6: Do it now! a. Write out the hypotheses for this ANOVA. **(/2)** b. What do you conclude from the above ANOVA? **(/3)** ::: Or, to save ourselves a lot of time and reduce the coding effort, we may simply run a two-way ANOVA and look at the effects of `Diet` and `Time` simultaneously. To specify the different factors we put them in our formula and separate them with a `+`: ```{r} summary(aov(weight ~ Diet + as.factor(Time), data = filter(chicks, Time %in% c(0, 21)))) ``` ::: callout-important ## Task G.7: Do it now! a. What question are we asking with the above line of code? **(/3)** b. What is the answer? **(/2)** c. Why did we wrap `Time` in `as.factor()`? **(/2)** ::: It is also possible to look at what the interaction effect between grouping variables (i.e. in this case the effect of time on diet---does the effect of time depend on which diet we are looking at?), and not just within the individual grouping variables. To do this we replace the `+` in our formula with `*`: ```{r} summary(aov(weight ~ Diet * as.factor(Time), data = filter(chicks, Time %in% c(4, 21)))) ``` ::: callout-important ## Task G.8: Do it now! How do these results differ from the previous set? **(/3)** ::: One may also run a post-hoc Tukey test on these results the same as for a single factor ANOVA: ```{r} TukeyHSD(aov(weight ~ Diet * as.factor(Time), data = filter(chicks, Time %in% c(20, 21)))) ``` ::: callout-important ## Task G.9: Do it now! Yikes! That's a massive amount of results. What does all of this mean, and why is it so verbose? **(/5)** :::   ::: {.callout-note appearance="simple"} ## Summary To summarise *t*-tests, single-factor (1-way) and multifactor (2- or 3-way, etc.) ANOVAs: 1. A *t*-test is applied to situations where one wants to compare the means of only **two** groups of a response variable within **one categorical independent variable** (we say a factor with two levels). 2. A 1-way ANOVA also looks at the means of a response variable belonging to **one categorical independent variable**, but the categorical response variable has **more than two** levels in it. 3. Following on from there, a 2-way ANOVA compares the means of response variables belonging to all the levels within **two categorical independent variables** (e.g. Factor 1 might have three levels, and Factor 2 five levels). In the simplest formulaton, it does so by looking at the **main effects**, which is the group differences between the three levels of Factor 1 and disregarding the contribution due to the group membership to Factor 2, and also the group differences amongst the levels of Factor 2 but disregarding the group membership of Factor 1. In addition to looking at the main effects, a 2-way ANOVA can also consider the **interaction** (or combined effect) of Factors 1 and 2 in influencing the means. ::: # Alternatives to ANOVA {#sec-alt} In the first main section of this chapter we learned how to test hypotheses based on the comparisons of means between sets of data when we were able to meet our two base assumptions. These parametric tests are preferred over non-parametric tests because they are more robust. However, when we simply aren't able to meet these assumptions we must not despair. Non-parametric tests are still useful. In this chapter we will learn how to run non-parametric tests for two sample and multiple sample datasets. To start, let's load our libraries and `chicks` data if we have not already. ```{r} # First activate libraries library(tidyverse) library(ggpubr) # Then load data chicks <- as_tibble(ChickWeight) ``` With our libraries and data loaded, let's find a day in which at least one of our assumptions are violated. ```{r} # Then check for failing assumptions chicks %>% filter(Time == 0) %>% group_by(Diet) %>% summarise(norm_wt = as.numeric(shapiro.test(weight)[2]), var_wt = var(weight)) ``` ## Wilcoxon rank sum test The non-parametric version of a *t*-test is a Wilcox rank sum test. To perform this test in R we may again use `compare_means()` and specify the test we want: ```{r} compare_means(weight ~ Diet, data = filter(chicks, Time == 0, Diet %in% c(1, 2)), method = "wilcox.test") ``` What do our results show? ## Kruskall-Wallis rank sum test ### Single factor The non-parametric version of an ANOVA is a Kruskall-Wallis rank sum test. As you may have by now surmised, this may be done with `compare_means()` as seen below: ```{r} compare_means(weight ~ Diet, data = filter(chicks, Time == 0), method = "kruskal.test") ``` As with the ANOVA, this first step with the Kruskall-Wallis test is not the last. We must again run a post-hoc test on our results. This time we will need to use `pgirmess::kruskalmc()`, which means we will need to load a new library. ```{r} library(pgirmess) kruskalmc(weight ~ Diet, data = filter(chicks, Time == 0)) ``` Let's consult the help file for `kruskalmc()` to understand what this print-out means. ### Multiple factors The water becomes murky quickly when one wants to perform multiple factor non-parametric comparison of means tests. To that end, we will not cover the few existing methods here. Rather, one should avoid the necessity for these types of tests when designing an experiment. ## The SA time data ```{r anova-plot7} #| fig-cap: "Time is not a limited resource in South Africa." #| label: fig-anova-plot6 sa_time <- as_tibble(read_csv("../../data/snakes.csv", col_types = list(col_double(), col_double(), col_double()))) sa_time_long <- sa_time %>% gather(key = "term", value = "minutes") %>% filter(minutes < 300) %>% mutate(term = as.factor(term)) my_comparisons <- list( c("now", "now_now"), c("now_now", "just_now"), c("now", "just_now") ) ggboxplot(sa_time_long, x = "term", y = "minutes", color = "term", palette = c("#00AFBB", "#E7B800", "#FC4E07"), add = "jitter", shape = "term") ``` # Examples ## Snakes! These data could be analysed by a two-way ANOVA without replication, or a repeated measures ANOVA. Here I will analyse it by using a two-way ANOVA without replication. Place and Abramson (2008) placed diamondback rattlesnakes (*Crotalus atrox*) in a 'rattlebox,' a box with a lid that would slide open and shut every 5 minutes. At first, the snake would rattle its tail each time the box opened. After a while, the snake would become habituated to the box opening and stop rattling its tail. They counted the number of box openings until a snake stopped rattling; fewer box openings means the snake was more quickly habituated. They repeated this experiment on each snake on four successive days, which is treated as an influential variable here. Place and Abramson (2008) used 10 snakes, but some of them never became habituated; to simplify this example, data from the six snakes that did become habituated on each day are used. First, we read in the data, making sure to convert the column named `day` to a factor. Why? Because ANOVAs work with factor independent variables, while `day` as it is encoded by default is in fact a continuous variable. ```{r} snakes <- read_csv("../../data/snakes.csv") snakes$day = as.factor(snakes$day) ``` The first thing we do is to create some summaries of the data. Refer to the summary statistics Chapter. ```{r} snakes.summary <- snakes %>% group_by(day, snake) %>% summarise(mean_openings = mean(openings), sd_openings = sd(openings)) %>% ungroup() snakes.summary ``` ::: callout-important ## Task G.9: Do it now! - Something seems... off. What's going on here? Please explain this outcome. ::: To fix this problem, let us ignore the grouping by both `snake` and `day`. ```{r} snakes.summary <- snakes %>% group_by(day) %>% summarise(mean_openings = mean(openings), sd_openings = sd(openings)) %>% ungroup() snakes.summary ``` `Rmisc::summarySE()` offers a convenience function if your feeling less frisky about calculating the summary statistics yourself: ```{r} library(Rmisc) snakes.summary2 <- summarySE(data = snakes, measurevar = "openings", groupvars = c("day")) snakes.summary2 ``` Now we turn to some visual data summaries (@fig-anova-plot). ```{r} #| fig-cap: "Boxplots showing the change in the snakes' habituation to box opening over time." #| label: fig-anova-plot ggplot(data = snakes, aes(x = day, y = openings)) + geom_segment(data = snakes.summary2, aes(x = day, xend = day, y = openings - ci, yend = openings + ci, colour = day), size = 2.0, linetype = "solid", show.legend = FALSE) + geom_boxplot(aes(fill = day), alpha = 0.3, show.legend = FALSE) + geom_jitter(width = 0.05) + theme_pubclean() ``` What are our null hypotheses? 1. ***H*~0~** There is no difference between snakes with respect to the number of openings at which they habituate. 2. ***H*~0~** There is no difference between days in terms of the number of openings at which the snakes habituate. Fit the ANOVA model to test these hypotheses: ```{r} snakes.aov <- aov(openings ~ day + snake, data = snakes) summary(snakes.aov) ``` Now we need to test of the assumptions hold true (i.e. erros are normally distributed and heteroscedastic) (@fig-anova-plot5). Also, where are the differences (@fig-tukey)? ```{r} #| fig-cap: "Exploring the assumptions visually." #| label: fig-anova-plot5 par(mfrow = c(1, 2)) # Checking assumptions... # make a histogram of the residuals; # they must be normal snakes.res <- residuals(snakes.aov) hist(snakes.res, col = "red") # make a plot of residuals and the fitted values; # # they must be normal and homoscedastic plot(fitted(snakes.aov), residuals(snakes.aov), col = "red") ``` ```{r} #| fig-cap: "Exploring the differences between days." #| label: fig-tukey snakes.tukey <- TukeyHSD(snakes.aov, which = "day", conf.level = 0.90) plot(snakes.tukey, las = 1, col = "red") ```