Other Single Sample Inferences

Other Single Sample Inferences

Apr 26, 2019
Explore whether the sample is consistent with a specified distribution at the population level. Kolmogorov's test, Lilliefors test and Shapiro-Wilk test are introduced, as well as tests for runs or trends.
Previously we talked a lot about location inference, which is looking at the mean or median of the population distribution, or in fancier words, inferences about centrality. In this chapter we explore whether our sample is consistent with being a random sample from a specified (continuous) distribution at the population level.
Under , the population distribution is completely specified with all the relevant parameters, such as a normal distribution with given mean and variance, or a uniform distribution with given lower and upper limits, or at least it’s some specific family of distributions.

Kolmogorov’s test

The Kolmogorov's test is a widely used procedure. The idea is to look at the empirical CDF , which is a step function that has jumps of at the observed data points.
We mentioned before that as , it becomes “close” to the true CDF, . So, is a consistent estimator for .
If our data really are a random sample from the distribution , we should then be seeing evidence of that in and should be “close”. Under , is completely specified, so it’s known. is determined by the data, so it’s known as well, which means we can compare them explicitly!
The logic of the test statistic is if are a sample from a population with distribution , then the maximum difference between the CDF under and the empirical CDF should be small. The larger the maximum difference is, the more evidence against . It’s generally a good idea to plot the empirical CDF together with the hypothesized to see visually how close they are:
R code for generating the plot.
library(tidyverse) set.seed(42) tibble( Type = c(rep("Empirical", 1000), rep("Data", 10)), Value = c( round(rnorm(1000, mean=60, sd=15)), round(rnorm(10, mean = 60, sd = 15)) ) ) %>% ggplot(aes(Value, color = Type)) + stat_ecdf(geom = "step")+ theme_minimal()
The red line is our data , and the blue line is the hypothesized empirical distribution .
The red line is our data , and the blue line is the hypothesized empirical distribution .
Our test statistic is the maximum vertical distance between F(x) and , or
for a two-sided test with a two-tailed alternative. Deriving the exact distribution in this case is much more complex. In R, the function ks.test() does the job. You have to specify what distribution to compare to, e.g. ks.test(dat, "pnorm", 2, 4) to test whether dat look like a sample from .
More typically, though, we won’t know the values of the parameters that define the distribution. In other words, we have unknown parameters that need to be estimated. If we use the Kolmogorov test with estimated values (from the sample) of the parameters, the distribution of the test statistic T changes.

Lilliefors test for normality

The Lilliefors test is a simple modification of Kolmogorov’s test. We have a sample from some unknown distribution . Compute the sample mean and sample standard deviation as estimates of and , respectively:
Use these to compute “standardized” or “normalized” versions of the data to test for normality:
Compare the empirical CDF of the to the CDF of , as with the Kolmogorov procedure. Alternatively, use the original data and compare to . Here : random sample comes from a population with the normal distribution with unknown mean and standard deviation, and : the population distribution is non-normal.
This is a composite test of normality (testing multiple things simultaneously). We can obtain the distribution of the test statistics via simulation. In R, we can use the function nortest::lillie.test().


  • We computed and and used those as estimators for the normal mean and s.d. in the population. Basically follow the Kolmogorov procedure with and .
  • Lilliefors vs. Kolmogorov - procedurally very similar, but the reference distribution for the test statistic changes because we estimate the population mean and standard deviation.
  • Lilliefors found this reference distribution by simulation in the late 1960s. The idea was to generate random normal variates. For various values of sample size , these random numbers are grouped into “samples”. For example, if , a simulated sample of size 8 from (under ) is generated. The values are computed as described earlier. The empirical CDF is compared to the CDF, and the maximum vertical discrepancy is found / recorded. Repeat this thousands of times to build up the simulated reference distribution for the test statistic under when . Repeat for many different sample sizes. As the number of simulations increases for a given sample size, the approximation improves.

Test for the exponential

Let’s look at a different example, the exponential. Our : random sample comes from the exponential distribution
where is an unknown parameter, vs. : distribution is not exponential. Another composite null. We can compute
where we use to estimate . Consider the empirical CDF of . Compare it to
and find the maximum vertical distance between the two. This is the test statistic for the Lilliefors test for exponentiality. Tables for the exact distribution for this case exist, but not in general. The R package KScorrect tests against many hypothesized distributions.

Another Test for Normality

The Shapiro-Wilk test is another important test for normality which is used quite often in practice. We again have a random sample with unknown distribution . : is a normal distribution with unspecified mean and variance, vs. : is non-normal.
The idea essentially is to look at the correlation between the ordered sample values (order statistics from the sample) and the expected order statistics from . If the null holds, we’d expect this correlation to be near 1. Smaller values are evidence against . A Q-Q plot has the same logic as this test.
For the test more specifically:
  • if is even, otherwise .
  • are the order statistics for the sample.
  • are the expected order statistics from , obtained from tables.
  • .
We may also see it written as
With large samples, the chance to reject increases - even small departures from normality will be detected, and formally lead to rejecting even if the data are “normal enough”. Many parametric tests (such as the t-test) are pretty robust to departures from normality.
The takeaway here is to always think about what you’re doing. Don’t apply tests blindly - think about results, what they really mean, and how you will use them.

Runs or Trends

The motivation here is that many basic analyses make the assumption of a random sample, i.e. independent, identically distributed observations (i.i.d). When this assumption doesn’t hold, we need a different analysis strategy (e.g. time series, spatial statistics, etc.) depending on the characteristics of the data.

Cox-Stuart test

When the data are taken over time (ordered in time), there may be a trend in the observations. Cox and Stuart proposed a simple test for a monotonically increasing or decreasing trend in the data. Note that monotonic doesn’t mean linear, but simply a consistent tendency for values to increase or decrease.
The procedure is based on the sign test. Consider a sample of independent observations . If , take the differences
If , omit the middle value and calculate .
If there is an increasing trend over time, we’d expect the observations earlier in the series will tend to be smaller, so the differences will tend to be positive, and vice versa if there is a decreasing trend. If there’s no monotonic trend, the observations differ by random fluctuations about the center, and the differences are equally likely to be positive or negative.
Under of no monotonic trend, the   signs of the differences are . That’s a sign test scenario!
The U.S. Department of Commerce publishes estimates obtained for independent samples each year of the mean annual mileages covered by various classes of vehicles in the U.S. The figures for cars and trucks (in 1000s of miles) for the years 1970–1983 are:
9.8 1
Is there evidence of a monotonic trend in each case?
We don’t specify increasing or decreasing because we don’t have that information, so it’s a two-sided alternative.
For cars, all the differences are negative. When , . We have a two-sided alternative, so we need to consider also , which by symmetry has the same probability, so we get a p-value . This is reasonably strong evidence against .
For trucks, we have 4 negative differences and 3 positive differences, which is supportive of , in fact, the most supportive you could be with just 7 differences.

Runs test

Note that the sign test does not account for, or “recognize”, the pattern in the signs for the trucks. There is evidence for some sort of trend, but since it’s not monotonic, the sign test can’t catch it. It also can’t find periodic, cyclic, and seasonal trends, because it only counts the number of successes / failures. We need a different type of procedure.
One possibility is the runs test, which looks for patterns in the successes / failures. We’re looking for patterns that may indicate a “lack of randomness” in the data. Suppose we toss a coin 10 times and see
We’d suspect non-randomness because of the constant switching back and forth. Similarly, if we saw
We’d suspect non-randomness because of too few switches, or too “blocky”.
For tests of randomness, both the numbers and lengths of runs are relevant. In the first case we have 10 runs of length 1 each, and in the second case we have 3 runs - one of length 3, followed by one of length 4, and another of length 3. Too many runs and too few are both indications of lack of randomness. Let
Our hypotheses are
We reject if is too big or too small. To get a handle on this, we need to think about the distribution of the number of runs for a given sequence length. We’d like to know
This is conceptually easy, but doing this directly would be tedious for an even moderate . We can use combinatorics to work it out. The denominator is the number of ways to pick out of : . As for the numerator, we need to think about all the ways to arrange H’s and T’s to get runs in total:
In principle, we can use these formulas to compute tail probabilities of events, and hence p-values, if and aren’t too large (both  ). We could run into numerical issues if this isn’t the case, and computing the tail probabilities is tedious, so we also have a normal approximation:
We can still improve by continuity correction: add to numerator if R < E(R) and substract if .
The question of interest overall is randomness, or lack of randomness, thereof the test is two-sided by nature. There are two extremes of run behavior:
  1. clustering or clumping of types - small number of long runs is evidence (one-sided).
  1. alternating pattern of types - large number of runs is evidence of an alternating pattern (again, in a one-sided perspective).

Runs test for multiple categories

We may also take a simulation-based approach. The goal is to find critical values, or p-values empirically based on simulation, rather than using the normal approximation.
The procedure is to generate a large number of random sequences of length , with of type 1 events and of type 2 events (e.g. use R to generate a random sequence of 0’s and 1’s, the probabilities for and comes from the original data - essentially permuting the original sequence). Count the number of runs in each sequence, and this number is what we found for our test statistic based on the data. The generated data is what we expect if the null is reasonable. Gathering all of these together gives an empirical distribution for the number of runs you might expect to see in a sequence of length ( of type 1, of type 2) if is reasonable.
If (hence also ) is small, we can compute the exact probabilities. Also, if is small or moderate, if you generate a lot of random sequences, you will see a lot of repeated sequences.
What if we have more than 2 types of events? Smeeton and Cox described a method for estimating the distribution of the number of runs by simulating permutations of the sequence of categories. Suppose we have different outcomes / events, and let denote an observation of type . We have - the total length of the sequence, and - proportion of observation of type .
We can again use the simulation approach here: generate a lot of sequences of length , with of type 1, of type 2, …, of type k, and count the number of runs in each sequence.
p-values: suppose we have 1000 random sequences of length , and the number of runs ranges from 5 to 25. In the 1000 simulations, we need to take down how many showed 5 runs, 6 runs, …, 25 runs. If we observed 12 runs in our data, the tail probability is , and find the tail probability on the other tail by symmetry (e.g. (5, 6, 24, 25)).
Normal approximation: Schuster and Gu proposed an asymptotic test based on the normal distribution makes use of the mean and variance of the number of runs:
Use these in a normal approximation:
where R denotes the observations in our sample. Barton and David suggest that the normal approximation is adequate for , no matter what the number of categories is.

So far we’ve been talking about inferences on single samples. Next we’ll take a step further and discuss paired samples.