Basic Tests for Three or More Samples

Basic Tests for Three or More Samples

Tags
Statistics
Visualization
R
Wilcoxon
Asymptotic
Date
May 4, 2019
Description
Nonparametric analogues of the one-way classification ANOVA and the simplest two-way classifications, namely the Kruskal-Wallis test, the Jonckheere-Terpstra test, and the Friedman test.
Slug
nonparametric-methods-three-or-more-samples
The following methods discussed are nonparametric analogues to analysis of variance (ANOVA). When the ANOVA assumptions (normality and homogeneity of variances in residuals) cannot be met, a nonparametric test is appropriate.

Notations

Suppose we have independent random samples (independence within and across samples). The sample has observations, . The general question would be equality of means / medians, or identical distributions.
  • - observations from sample where .
  • - total number of observations across all samples.
  • - rank of in the combined sample (ranges from 1 to ).
  • - sum of ranks of observations in the sample (group).

The Kruskal-Wallis test

The K-W test extends the WMW test via the Wilcoxon formulation. It can be used as an overall test for equality of means / medians at the population level, when samples are from otherwise identical continuous distributions (i.e. shifts of location).
It can also be used as a test for equality of distribution against an alternative that at least one of the populations tends to yield larger observations than at least one of the other populations.
Under (equality of means / medians), all of the should be “roughly similar” because each sample will have small, medium and large ranks. Under (not all the same), some should be “larger” than others, and some “smaller”. We’d have clusters of small / large ranks in some of the samples. The test statistic
will be inflated if there’s a cluster of large ranks in one or two of the samples. A large is evidence against , as shown in the figure below1. If is moderate or large, then under , .
library(tidyverse) df <- tibble( x = seq(0, 20, 0.01), y = dchisq(x, 3) ) rejection.region <- df %>% filter(x >= qchisq(0.95, 3)) %>% bind_rows(tibble( x = qchisq(0.95, 3), y = 0 )) ggplot(rejection.region, aes(x, y))+ geom_polygon(show.legend = F)+ geom_line(data = df, aes(x, y))+ annotate("segment", x = 9, y = 0.01, xend = 12, yend = 0.03, arrow = arrow(length = unit(0.3, "cm")))+ annotate("text", label = "p-value=0.05", x = 14.3, y = 0.033, size = 4)+ xlab(expression(x))+ ylab(expression(f(x)))+ ggtitle(expression(chi[2]^2))+ theme_minimal()
R code inspired by this post. See ?plotmath for symbol details.
In this example, we reject  if  is larger than the 0.95 quantile of .
In this example, we reject if is larger than the 0.95 quantile of .
In R, the function kruskal.test() can be used in two ways:
  1. Input a data vector x and a vector of group indicators g (x and g should both have length N). Now we can call kruskal.test(x, g). This method emphasizes the Wilcoxon nature.
  1. Use the same input, but use a formula kruskal.test(x ~ g). This model formulation emphasizes the connection to ANOVA.
A very useful R package coin includes more functionality for this test.

Multiple comparisons

If is rejected, i.e. there’s a difference among the groups, we may be interested in exploring which groups differ. This is the multiple comparisons problem. There are a lot of different ways to address this issue, and we’ll look at one of these - consider all pairwise comparisons. For each pair of samples and :
where the LHS is the average rank for observations in sample (), and the RHS is a measure of variability in the differences. The is the upper quantile from the -distribution with degrees of freedom.
The problem now is, what is the “right” α to use here? A naïve approach is to test all post-hoc (after rejecting of overall equality of means/medians) at the same level , say 0.05. The problem with the approach is that our Type I error probability is inflated above 0.05. The more comparisons there are, the more inflation there is, which makes our inference unreliable.
We need to make an adjustment. There’s also lots of ways to do this! One popular way to adjust this is the Bonferroni correction. The idea is to control the overall Type I error probability at level α by testing each of the post-hoc comparisons at level instead of at level . With a more stringent cutoff, we make it harder to reject a null hypothesis of “no difference”.

Ties in the data

We use the mid-ranks for ties as before. Define:
and our test statistic:
If there are no ties, this reduces to the previous statistic. This will be approximately for large .

The Jonckheere-Terpstra test

The J-T test is a test for a directional alternative. This is more specific than the previous K-W test as now there’s a priori ordering, i.e.  with at least one strict inequality. Here we have ordered means/medians. This test has more statistical power than the K-W test. The K-W can still be used, but we’ll get a more sensitive analysis if we use a test that is more specific for this type of alternative.
Importantly, as with all cases of ordered alternatives, the order of the groups and the direction of the alternative need to be set prior to data collection based on theory, prior experience, etc. You can’t look at the data and then decide on order, or decide to do an ordered alternative. The J-T test is designed for this case, and extends the Mann-Whitney two-sample formulation. As usual, this is best explained in an example!
Braking distance taken by motorists to stop when driving at various speeds. A prior (before seeing any data) is that braking distance increases as driving speed increases. A subset of the data is given in the following table.
Sample
Speed (mph)
Braking distances (feet)
1
20
48
2
25
33, 48, 56, 59
3
30
60, 67, 101
4
35
85, 107
The speeds are ordered in the data. We expect braking distance to increase. Our procedure is with samples (here ), compute all pairwise M-W test statistics. Add all of these up, i.e. compute all of ’s relevant to the sample () and any sample for which . In example, , e.g.  number of sample 2 values that exceed each sample 1 value  = 2.5 (48 counts as a half since it’s equal to the value in sample 1). Likewise: . In total, our test statistic .
We could potentially do an exact calculation by looking at all the possible configurations. Or we could use simulation where we randomly sample from all possible configurations. The final way is using asymptotic approximation.
In R, the package clinfun has a function called jonckheere.test() that does the job for us. When and there are no ties, it does an exact calculation. Otherwise, a normal approximation is used:
In our case, since we have an ordered alternative, a one-sided test should be used. Also, since we have very small samples, can we trust a normal approximation?
If we had perfect separation (i.e. all observations in sample 1 < all observations in sample 2 < ⋯ without overlap), then the value of the test statistic would be 35. Our observed value is pretty close to that. We have circumstantial evidence against . But the question is, how many configurations do we have in total? How many of them have value of ?
The exact calculation (ignores ties) gives a p-value of 0.0011. The normal approximation (we have small samples) gives a p-value of 0.0022. Both values are very small, and we have converging evidence against . We may conclude that braking distance increases as driving speed increases.

The median test

Another easy generalization is the median test. We have independent samples instead of two:
We can get a contingency table following the same procedure as in two independent samples.
sample 1
sample 2
sample
Above M
A
Below M
B
Column margins are fixed by design, and row margins A and B are constrained by use of median (). What doesn’t translate from the table is - what does a “more extreme” table mean now? Another thing is just knowing one cell value isn’t enough to fill out the whole table since we have more degrees of freedom here. We need of them.
Because of those issues, we’ll just go straight to an approximation here. Our test statistic
where is the expected number of cases above for group under . This test statistic is essentially the test for independence, which we’ll talk about more in later chapters.

The Friedman test

The Friedman test makes inference for several related samples. Its parametric analogy is the two-way classification ANOVA from a randomized block design.
According to Wikipedia, The Friedman test is similar to the repeated measures ANOVA.
Data: the data we have are independent “blocks” (patients, plots of land, etc.) and “treatments” which are applied to (or measured on) each block. So the data are a array, as usually we put blocks along the rows and treatments along the columns. Blocks are independent; measurements are dependent within a block.
Procedure: within each block, independently rank the observations from 1 (smallest) to (largest). The null hypothesis: all treatments have the same effect; the alternative is not . The test statistic is:
where is the sum over all blocks of the ranks allocated to treatment (rank along the rows, sum across the columns). The sum part is the core of the test statistic that encapsulates the procedure, and the other constants are derived from mathematical theories that help form a distribution we can work with later.
For that are “not too small”, is approximately under . The test is to look at whether we fall in the far right tail of the distribution. Again, it’s much easier to explain the test with an example.
7 students’ pulse rate per min. was measured (i) before exercise, (ii) immediately after exercise, and (iii) 5 minutes after exercise. Our is there’s no effect of exercise / rest on pulse rates, and the alternative is not . Results (repeated measures):
Student
Before
Right after
5min. after
1
72
120
76
2
96
120
95
3
88
132
104
4
92
120
96
5
74
101
84
6
76
96
72
7
82
112
76
Our procedure says rank along each row separately:
Student
Before
Right after
5min. after
1
1
3
2
2
2
3
1
3
1
3
2
4
1
3
2
5
1
3
2
6
2
3
1
7
2
3
1
then sum ranks along each column, which gives
In data, . Compare to at , where . We have strong evidence against .

So far we’ve focused on location and distribution inferences. What about association analyses?