Previously we’ve used the sign test to look at the median (a measure of location) survival time with a censored data point. The original observations were transformed into “successes” and “failures” and a

**lot**of information was thrown out.For location inference, a basic one-sample procedure is the

**Wilcoxon Signed-Rank test**. It’s used to test whether a sample comes from a population with a specified mean or median.### Wilcoxon signed-rank test

Suppose observations , are a sample from a

**symmetric continuous**distribution with unknown median, . We want to test:The assumption of symmetry implies that under :

- Each is equally likely to be positive or negative (i.e. each is equally likely to be above or below under - logic for sign test).

- The magnitude of any size is equally likely to be positive or negative.

A symmetric population distribution has a mean that coincides with the median. In those circumstances the test may also be formulated in terms of means. The test procedure is simple:

- Calculate the discrepancies of each observation from , the median hypothesized under .

- Order these magnitudes (i.e. in absolute value) from smallest (rank = 1) to largest (rank =
*n*).

- Assign a sign to ranks corresponding to
*, and a sign to ranks corresponding to .*

From here, we can define a few different test statistics. We denote by the the sum of positive ranks, and by the sum of ranks associated with negative deviations. These two are equivalent to each other because the total sum of ranks in sample of size is fixed: . We may use any of the statistics

*,**, or a third statistics as a test statistic. All three have the same information about the plausibility of . Under , we’d expect and**to be roughly equal, and likewise should be close to 0. Intermediate values of or**are more supportive of .*We can use the permutation test approach to get the

**exact**distribution for any of the three test statistics. We look at all the possible allocations of signs to the ranks . Let’s take a look at the following example.### Heart rate example

Heart rate (beats per minute) when standing was recorded for seven people. Assume a symmetric distribution for heart rate in the population. Continuity is questionable as heart rate is an integer, but we’re okay if there are no ties, which is usually why we assume continuity.

The observed data is:

Suppose we want to test

First, compute :

Magnitude | 3 | 12 | 17 | 2 | 36 | 10 | 27 |

Rank | 2 | 4 | 5 | -1 | 7 | -3 | 6 |

Note now we added a minus sign to the ranks of the observations that were smaller than 70. Now we can calculate the test statistics:

Sanity check: .

Under , we’d expect and . Under , we’d expect and to be larger. Here we observe to be appreciably larger than , which is in support of .

For the permutation test, we need to build the permutation distribution. Note that can take values from 0 (when all ranks are negative, i.e. all observations are less than ) up to 28 (all ranks are positive). There are ways of allocating signs to the ranks. All of them are equally likely under . Below we have a few possible configs:

1 | 2 | 3 | 4 | 5 | 6 | 7 | S+ | S− | Sd |

- | - | - | - | - | - | - | 0 | 28 | 28 |

+ | - | - | - | - | - | - | 1 | 27 | 26 |

⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |

+ | + | + | + | + | + | + | 28 | 0 | 28 |

With an one-sided test, the upper tail large values of (equivalently small values of ) are evidence against . We want

which is pretty unlikely if holds. The evidence against is sufficiently strong to warrant a similar study with a larger group of students.

### Function in R

In R, the function

`wilcox.test()`

does the exact permutation test if and there are no ties in data. Otherwise, a normal approximation is used. It should be noted that two options in `wilcox.test`

conflict with each other:`exact`

: compute exact permutation distribution. Ignored when there are ties.

`correct`

: use continuity correction based on the normal distribution.

`exact`

overrides`correct`

, so when`exact = T`

the value of`correct`

doesn’t matter.

In our case,

`wilcox.test(data, "greater", mu = 70)`

gives *and a p-value of 0.05469. We can also do this by hand with the following procedure.*In R, is called

`V`

.There are configurations in the permutation distribution. The statistic or

*can take on 79 possible values: the values 0 to . We may use*`dsignrank(z, n)`

to build up the permutation distribution for (equivalently ), where `z`

are the possible values, and `n`

is the sample size.For each distribution, there’s a series of functions in R to get the density

`dsignrank`

, distribution function `psignrank`

, quantile function `qsignrank`

and random numbers `rsignrank`

.Another useful application is to get a confidence interval (CI) for the population median (mean):

`wilcox.test(x,conf.int = T, conf.level = 0.95)`

. The test is based on `Walsh averages`

.### Walsh average

Walsh averages are just pairwise averages of the observations. Set and get the order statistics . Now, if , will have a

**negative**rank. In fact, the largest negative deviation will belong to , the next largest to , and so on. Likewise, if , will have a positive rank, and the largest positive deviation will belong to .Now, we consider the paired averages of with each of .

- Denote by the signed rank of the deviation associated with . Then each is less than when the deviation associated with also has negative rank, or positive rank less than .

- If is the smallest observation with a positive rank greater than , then and then the average of and . And also any .

- The number of averages involving that are less than equals the negative rank associated with . Likewise, the number of averages involving less than with each of that are less than equals the negative rank associated with .

- The number of
**Walsh averages**less than , and the number of Walsh averages greater than*.*

We can use this correspondence to get point estimate and CI for the population median (mean). To do this:

- Order the data from smallest to largest.

- Compute the pairwise Walsh averages and arrange them in a table. There are 78 unique averages. The Walsh averages increase along each row and column.

- Point estimator of the population median is the
**median of the Walsh averages**, which should be the average of the and the ordered Walsh averages. In our example, they are both . The median of the Walsh averages is also called the**Hodges-Lehmann estimator**, having been proposed by Hodges and Lehmann.

- To find a 95% confidence interval using the Walsh averages, we select as end points values of that will just be acceptable if . For one tail, , so the cumulative sum can’t be more than . CI puts roughly 2.5% of density in each tail, so we look for smallest and largest Walsh averages .

### Asymptotic results

What happens as sample size increases? The number of possible configurations becomes large very quickly. Therefore, it’s not feasible to get the exact distribution even with the aid of software.

The possible assignments of signs to the ranks is , which increases rapidly.

But we also have symmetry, and as increases, the number of possible values of also increases to . There is a

**normal approximation**to the exact distribution of the test statistic! We can easily show thatsince the sum of the integers from 1 to is and the sum of squares is . For large enough , we can define a test statistic :

We can also improve the effectiveness of the approximation by using a

**continuity correction**, which comes to account for the fact that we approximate a discrete quantity with a continuous distribution (in this case the normal). The idea is if is the smaller of the sums of the positive of negative ranks, we replace by . If is the larger sum it is replaced by . In R, this is called with the option`correct = T`

in `wilcox.test`

.Another thing we could do when is too large to enumerate all of the possible configurations (e.g. in the millions) is to take a random sample of them instead, e.g. .

### Wilcoxon with ties

We’ve been assuming the underlying population distribution is continuous, so in theory in the data there should be no ties. In practice, however, ties can of course happen. Real observations never have a distribution that is strictly continuous either because of their nature, or as a result of rounding errors or limited measurement precision.

The other in-theory-impossible-but-in-practice-often-observed are deviations of 0. We’re talking about values in the sample that are exactly equal to the median hypothesis under , . There’s no one agreed upon best approach for handling these cases!

#### Ties in the observations

One suggestion is to replace the ranks for the tied values with their

**mid-ranks**. For example, if we had a sampleWe take . The deviations and ranks are then

-18 | -12 | -6 | -4 | 7 | 10 | 12 | 17 | 19 | 19 | 48 | 78 | |

Rank | -8 | -5.5 | -2 | -1 | 3 | 4 | 5.5 | 7 | 9.5 | 9.5 | 11 | 12 |

The ’s have the same magnitude so their ranks would be tied. They would be ranked 5 and 6, so we give them both 5.5. Similarly the 19’s marked by are both ranked 9.5.

We can now calculate and same as before. But the exact distribution changes because of using these mid-ranks. In fact, the exact permutation distribution depends on the number of ties and where they fall in the rank sequence. It’s much harder to work out. The distribution with no ties is unimodal, with values of the statistic confined to integer values increasing in steps, while the distribution with ties is heavily multimodal with taking unevenly spaced and not necessarily integer values. Discontinuities are also more marked.

Streitberg and Rohmel came up with an algorithm called the

*shift algorithm*to handle this situation. The R package`exactRankTests`

implements this algorithm under the function `wilcox.exact`

. Works for ties or no ties!Another suggestion is to modify the normal approximation. The idea is to consider each of the signed ranks as a score for observation . Under , as before each score has equal probability of being positive or negative, so the expected value and the variance of or is

These work out to be the same as before for the particular choice of score ⇔ rank. The

**score representation**is thus#### Deviations of zero

Again, opinions differ for data points that are equal to the median hypothesized under . A standard advice is to drop such points from the calculation of (this is equivalent to assigning them rank 0), but this decreases the sample size and we lose data!

Sprent and Smeeton (in part 3.3.4 of the book) proposed a slight alternative: temporarily assign such points a rank of 1 (or the appropriate tied rank if there’s more than one zero), then sign-rank all the other observations as usual. Finally, switch the rank(s) associated with zero deviation to 0. This keeps all the data and uses the ranks up to . However, the effect on the exact distribution is unclear.

### Summary

- The
**sign test**, unlike the**Wilcoxon signed rank test**, does not require symmetry of the underlying distribution. When the data come from a skewed distribution (e.g. income), both the t-test and the Wilcoxon may be inappropriate in the sense that they may not give us valid inference. This depends in part on how skewed the population distribution is. The sign test will still be valid. Confidence intervals based on the t-test / Wilcoxon may also be misleading. Those based on the sign test (and Binomial distribution) will still be fine. In other situations, all three will lead to similar conclusions.

- A suggestion: try different analyses and see if your conclusions are consistent.

- When the symmetry assumption is violated, the
**sign test**may have higher efficiency - higher power in tests and shorter confidence intervals for a given confidence level. We can compare the asymptotic relative efficiency of the sign test, Wilcoxon and t-test: - ARE of the Wilcoxon compared to the t-test is at least 0.864, and can go up to infinity under some circumstances.
- The Wilcoxon is never “too bad” and can be very good.
- When the data are actually normally distributed (the situation where the t-test is optimal), the ARE of Wilcoxon is 0.955, so very little loss here.

Up to this point, we’ve been talking about location inference - questions about the mean or median of the population distribution, but we can do a lot more. One particularly useful application is studying whether our data are consistent with having been drawn from some specified distribution.