In this chapter our focus is mostly on “count data” - the data are number of units with particular attributes. The problems appear different at first sight, yet many are solved using procedures developed previously.
Attributes are characteristics that the units can have, and are typically qualitative or categorical. Data are arranged in a contingency table. The table has a dimension (row, column, layer, …) for each attribute. We consider multiple attributes simultaneously. Attributes can be
- Nominal or ordinal:
- Nominal - categories are just names, not ordering.
- Ordinal - categories can be ordered.
- Explanatory or response variables.
The general setup is an table (rows and columns) where
- is the count in cell .
- is also denoted , and is also denoted .
- is also denoted or .
Inference and procedures are based on conditional inference: we treat row and column marginals as if they were fixed. Here “conditional” means we condition on the row/column totals. Note that sometimes one (or both) of the marginals will be fixed by the study design, but not always.
We start with tables for ease of discussion. Assume for now that the row variable is the explanatory variable, and the column variable is the response variable. The row variable often will have marginal totals that are fixed by design. Suppose the response variable has two levels that are “Success” and “Failure”. We have
Response 1 (S)
Response 2 (F)
where and are fixed by design, whereas and are not fixed.
We saw before that this design implies independent Binomials - for row 1 and for row 2. Our question is is there an association between the treatment and response, or in the hypothesis testing language , which says the probability of the successful outcome is the same for treated and non-treated.
We haven’t said anything (yet) about the common probability of success under , .
In our example, we observe successes for row 1 (treated), and successes for row 2 (non-treated). We can use this to derive
Also, in total we have observations with successes:
Using properties of conditional, independent, etc. probabilities, we can show the probability of observing successes in trials, conditional on a total of successes in trial is
Under , . All the probabilities in the individual Binomials end up cancelling each other out, and only the cell frequencies remain, which is the hypergeometric distribution:
What might the cell counts be “expected” to look like under ? If we knew , we could get exact expected counts for each cell using properties of the binomial. We don’t know , but we can estimate it using the sample proportion of successes: . The expected number of (successes in row 1) can be get using :
Another useful construct is called the odds ratio, which is used a lot in medical settings. It can also be used as a measure of association between the treatment and response in a table.
Under (no treatment effect), we saw last time that . Deviations of the sample odds ratio from 1 are indications against . We can also show that
A second possibility for how the table could have arise: both attributes are some sort of response, e.g. both attributes could be side effects experienced when patients are given a treatment.
Only the overall sample size is fixed by design now. Each of the individuals gets allocated independently of the others to the appropriate cell according to responses on the two attributes.
Now each individual has four possible outcomes, corresponding to the different patterns of response , , and . What this means is that all cell counts , , , vary freely. This implies with unknown probabilities , each of the subjects is allocated independently to cell , which is a multinomial model (multiple outcomes) with observations and probabilities such that .
and are marginal probabilities that units fall in row 1 or 2, respectively; likewise, and are marginal probabilities that units fall in column 1 or 2, respectively. If the row and column attributes are independent, then for . Since all of the probabilities here are unknown, we estimate them using the sample proportions:
By properties of the multinomial, the expected count in cell under independence is estimated to be
This is the same as in the first model, even though our assumptions about how the table came about were different. Also the odds ratio estimation will be the same.
A third possibility is we collect data for 6 months, and for each individual we classify, independently of others, to cell according to values on the two attributes. Nothing here is fixed, as we don’t know how many individuals we’re going to have!
The model here is independent Poissons for each of the four cells, with mean for cell . Under of no association between the row and column attributes, we get to the same end point as in the previous two cases.
Note that this all extends pretty easily to a general table. In general, we can ignore which elements, if any, are fixed by design, and present a unified framework for the rest of the analysis.
The unified framework for tables focuses on level of measurement for categorical variables. We treat nominal attributes and ordinal attributes differently here.
Three approaches are commonly used when row and column categories are both nominal: the Fisher's exact test, the Pearson chi-squared test, and the likelihood-ratio test. All are tests for a null of no association (or independence) between the two attributes. Under that null, all three tests have the same large-sample (asymptotic) distribution : . Large values are evidence against .
Importantly, the values of the three test statistics will differ on the same sample. So will their exact distributions. However, unless the choice of the significance level cutoff is critical, the three seldom lead to different conclusions.
We saw before for the table in lots of detail. We also saw that for a general table, “extreme” is a little murkier. We won’t do it by hand. In R, the function
fisher.test()does the job. Input either
x- a two-dimensional contingency table, or
y- two vectors (observations for each attribute), from which the table can be built.
The function computes exact p-values for the table. We can also obtain a simulated p-value by passing the argument
simulate.p.value = T. For tables, it can also make inference for the odds ratio.
In practice, the Fisher’s exact test is often used when asymptotic theory is inappropriate (usually when sample sizes are small), or in the case of tables.
An alternative statistic for testing independence of row and column categories is the Pearson chi-squared statistic:
where is the expected cell count under of independence, and is the observed cell count in data. can be calculated using this equation.
A caveat is that if there are , the asymptotic reference distribution is not a good approximation.
An alternative statistic for test of association is the likelihood ratio:
where once again is the expected cell count under of independence, and is the observed cell count in data.
On the same data, these three will obviously give different numerical results. Note that since the three tests apply to nominal categories, reordering the rows or columns doesn’t affect the values of the test statistics. For ordered categories, there are more appropriate tests.
The following cases are considered here:
- Nominal explanatory and ordered response.
- Ordered explanatory and ordered response, e.g. increasing dose of a drug and varied side-effects.
- Row and column attributes are both ordered responses, e.g. two different side-effects.
We’re going to look at the number of patients receiving each drug who experience different levels of side effects:
Here we’re not saying anything about how this study was designed. We don’t have to think about if any margins are fixed. Question: is there an association between drug type (A vs. B) and level of side effect, or are these two attributes independent?
This is like a massive WMW situation with lots of ties! Both test statistic formulations can be applied.
In the Wilcoxon formulation we may use mid-ranks again, e.g. 65 subjects had no side effects, so all get the same mid-rank of 33. 16 are tied at “slight”, so we set to 73.5. We can proceed and correct for ties.
In the Mann-Whitney formulation, there is no need to specify the mid-ranks. We just count the number of drug B patients showing the same or more severe side effects than those for each recipient of drug A, counting ties as 0.5. For instance, the 42 drug B patients with no side effects are tied with the 23 drug A patients with no side effect, ; there are 8 drug B patients with slight and 4 with moderate side effects, so this group contributes to the test statistic. Repeat for the other responses in drug B patients.
If the nominal explanatory variable has more than two values (e.g. 3 drugs in the previous example), the obvious extension to the Kruskal-Wallis test applies.
The row categories are ordinal explanatory, and the columns are ordered responses. We want to ask if there’s an association between the two attributes. This can again be applied with the J-T test, with each row as an ordered sample.
We have data on side effects experienced at increasing dose levels of a drug. Does side effects increase with dose level?
Again, we score relevant ties in any column (as with the previous analysis) as . Obviously the first column contributes most of the ties. We compute each pairwise MW statistic and add them all up. That is,
Conclusion: this has a two-sided approximate p-value of 0.4576 - no evidence of association. Even the one-sided p-value is pretty high. At first glance their results may seem counter-intuitive, but side effects are very rare so it’s hard to discover much pattern.
What if both row and column attributes are “responses”? Do high responses in one classification tend to be associated with high responses in the other (positive association) or low responses (negative association)? Or maybe there is no association between the two responses?
We can use the J-T test here as well, of course. The problem with the J-T test is that it isn’t calibrated as we’d like for a measure of association. There has been a lot of work in this general class of problems - calibration for measures of association in general tables (because and are relevant). Many measures have been proposed and studied.
One measure is the Goodman-Kruskal gamma statistic . This counts concordances and discordances between row and column classifications, like Kendall’s , but with no allowance for ties. When rows and columns are ordered, for each count in cell , there is concordance between that cell count and any cell count below and to the right:
Cell has count . Let denote the sum of all counts below and to the right of cell , not including the ones in the same row/column. The total number of condordances is given by
Similarly, let denote the sum of all counts in cells to the left and below cell , which is ordering on the two attributes disagrees. The total number of discordances is given by
We define our test statistic :
which is calibrated to be between and so that we can easily interpret.
Here we’ll only talk about the test, where we have an asymptotic distribution as our reference. It’s widely used as a goodness-of-fit test of data to any discrete distribution.
Instead of a model of independence or lack of association as before, we can consider goodness of fit to a hypothesized discrete distribution, which may be binomial, Poisson, uniform or some other discrete distribution. We compute expected cell counts under the hypothesized model, and compare to the observed counts in the data.
Sometimes, the parameter(s) of those distributions will not be known. In this case, we need to estimate the unknown parameters from the data.
A computer program is supposed to generate random digits from 0 to 9. If it is doing so, we’ll get digits that look like i.i.d. observations on the values 0 to 9, each with probability 0.1.
We want to test : the numbers appear to be random digits, vs. : some digits are more likely than others.
Generate some number, suppose 300, of digits from the program, and compare to the discrete uniform on 0 to 9 under . Under , the expected number of each number is 30. The observed results in our sample:
Compare this to ( categories has degrees of freedom for ).
In R, use the
chisq.test()function. In this case:
x <- c(22, 28, 41, 35, 19, 25, 25, 40, 30, 35) chisq.test(x)
The default is to test for uniform if no other parameters are specified. This gives a p-value of approximately 0.049. For a different set of specified probabilities, e.g. as given by binomial or Poisson, we need to pass another vector of the same length as the data vector specifying these probabilities. Alternatively, we can supply the vector of expected counts under and specify
rescale.p = T.
If we have to estimate parameters, we lose a degree of freedom for each one. In this situation, R will compute the test statistic but won’t use the correct degrees of freedom. Use
pchisq()with the correct degrees of freedom in this case.
Suppose we want test the goodness of fit to a binomial, but the probability of success is unknown and we have to estimate it.
We have data on the first 18 major league baseball players to have 45 times at bat in 1970. The number of hits they got in their 45 times at bat are given as follows:
We will test the null hypothesis that these data follow a . But first we need to estimate for each time at bat. We may use the overall relative frequency across all players of getting a hit:
With and , we then compute the probability of hits from and get expected counts. Theoretically we should get all expected counts from 0 to 45, but there are many small values which will result in 0 expected counts.
head(18 * dbinom(0:45, size = 45, prob = 0.2654)) #  1.688745e-05 2.745533e-04 2.182224e-03 #  1.130047e-02 4.286825e-02 1.269988e-01
Here we multiply by 18 because there are 18 players. We get many small probabilities, which means we’d get many expected counts of 0, and this is a problem for tests! In practice, we should combine some values to get more reasonable expected cell counts, e.g. combine cells with expected counts less than 0.5, so the first 8 values, and all values 18 and above are combined.
# 7 hits or below sum(18 * dbinom(0:45, size = 45, prob = 0.2654)[1:8]) #  1.105234 18 * dbinom(0:45, size = 45, prob = 0.2654)[9:18] #  1.0566192 1.5693786 2.0411749 2.3464190 2.4018906 #  2.2027936 1.8190547 1.3582077 0.9200627 0.5670437 # 18 hits or above sum(18 * dbinom(0:45, size = 45, prob = 0.2654)[19:46]) #  0.6121209
Now we can build the chi-square goodness-of-fit test:
obs <- c(1, 1, 1, 5, 2, 1, 1, 2, 1, 1, 1, 1) expected.values <- c( round(sum(18 * dbinom(0:45, size = 45, prob = 0.2654)[1:8]), 2), round(18 * dbinom(0:45, size = 45, prob = 0.2654)[9:18], 2), round(sum(18 * dbinom(0:45, size = 45, prob = 0.2654)[19:46]), 2) ) chisq.test(obs, p = expected.values, rescale.p = T) # Chi-squared test for given probabilities # # data: obs # X-squared = 6.737, df = 11, p-value = 0.82 # # Warning message: # In chisq.test(obs, p = expected.values, rescale.p = T) : # Chi-squared approximation may be incorrect
The value of the test statistic is correct, but we should have 10 degrees of freedom instead of 11, because estimating p̂ from the data costs us 1 degree of freedom!
pchisq(6.737, df = 10, lower.tail = F) #  0.7500185
Here we specify
lower.tail = Fbecause we want the probability to the right of our observed test statistic value, and by default the probability to the left is calculated.
This is the last part of the “classical” nonparametric statistics. Next, we’ll be focusing on topics in modern nonparametric statistics, which is also the finale of our nonparametric methods discussions.