Bootstrap

Bootstrap

Tags
Statistics
R
Distribution
Regression
Date
May 13, 2019
Description
The procedure and applications of the nonparametric bootstrap. We also discuss the Jackknife estimator.
Slug
nonparametric-methods-bootstrap
Many of the modern nonparametric statistics are highly computational - mostly been developed in the last 30 years or so.
Bootstrapping is a widely used computational method, mainly because often the theory is intractable - you can’t easily work out the properties of the estimator, or the theory is difficult to apply/work with - checking the assumptions itself can be difficult. It allows us to work with
  • (almost) arbitrary distributions at the population level, and
  • more general characteristics of the population, i.e. not just the mean.

Introduction

There are two main versions of the bootstrap: the parametric bootstrap and the non-parametric bootstrap.
A nice thing about the nonparametric bootstrap is that it’s very close to being assumption free. One key assumption is that the sample (empirical) CDF is a good approximation to the true population CDF for samples that are not too small. That is, , the ECDF, should reflect the main characteristics of , the theoretical population CDF. We don’t care about - we want to use as a proxy for .
The idea is based on resampling of data. In the real world, we have the population and some parameter . Repeated sampling is usually not feasible. In the bootstrap world, we have a sample and an estimated parameter . Repeated sampling is totally feasible with enough computing power to do it.
Note that permutation testing is also a type of resampling. So what’s the main difference? With permutation, we’re resampling without replacement and leads to exact tests. With bootstrap, we resample with replacement from our original sample, and that leads to approximate results. Bootstrapping leads to a more general approach.

Procedure

Suppose we have a sample of size : from some unknown population. A bootstrap sample would be a random sample of size sampled from with replacement. Some will appear multiple times in a given bootstrap sample, and others not at all.
In principle, we could enumerate all possible bootstrap resamples; in practice, this number grows very quickly with , e.g. when there are more than 90,000 possible different bootstrap samples.
In real applications, we’ll always take a sample of the possible bootstrap resamples - call this . Often, will suffice, or but we rarely need more than this.
Standard notation: a typical bootstrap resample will be denoted , where each is equal to one of the original . If bootstrap resamples are generated, the is denoted by .
For each bootstrap sample, we compute the statistic of interest (getting values, one for each sample). We build up from these an approximate, empirical distribution. We use this to learn about the population quantity of interest, e.g. estimate the population parameter by the average of the bootstrap quantities.
Suppose is the population parameter of interest. Let be an estimate of from the bootstrap sample . The average of the values is an estimate for .
The book call it , which is non-typical; it’s usually denoted or .
As B increases, tends to the mean computed for the true bootstrap sampling distribution, which is based on all possible configurations. Even for or 1000, the estimator will be pretty good in general.
Similarly, the estimator of the true bootstrap standard error is
where is the sample s.d. of bootstrap sample values. The easiest way to do this in R is to use the sample() function with replace = T, e.g. sample(x, size = n, replace = T) where x is the data with n observations.

Example in R

Suppose we have a small dataset with observations from some unknown distribution. We consider bootstrap for the mean and the median.
Data generating code:
set.seed(42) x <- rexp(20, 4)
In the sample, we see and a median of 0.119, which indicates skewness. A boxplot would also show this. We should be suspicious of a normal assumption, but the sample is small so it’s hard to know. Maybe we don’t want to use the mean (and its standard error) as our basis for inference, especially in light of the skewness. The median could be a better alternative, but we know little about the sampling distribution of the median (standard error, etc.). Some of these problems can be addressed via the bootstrap.
B <- 500 boot.mean <- vector("numeric", length = B) boot.median <- vector("numeric", length = B) for (i in seq(B)) { boot.mean[i] <- mean(sample(x, size = length(x), replace = T)) boot.median[i] <- median(sample(x, size = length(x), replace = T)) }
The mean and median of the boot.mean (0.1747501 and 0.1731665, respectively) are both pretty similar to the data mean, 0.1741497. The bootstrap standard deviation sd(boot.mean) 0.03444655 matches pretty well with the standard error for the mean sd(x) / sqrt(length(x)) 0.03395322. The bootstrap mean behaves very much like what we would “traditionally” do for a data set like this, i.e. compute the sample mean and the standard error.
What we couldn’t do so easily is to look at the median of the sample. We can use the sample median as an estimator for the population median, but we don’t have an easy estimator for the variability of the estimator! With bootstrap, it’s quite easy to explore.
The mean(boot.median) 0.1290285 and median(boot.median) 0.118546 are both pretty close to the sample median 0.118546. However, we can also easily get, for example, sd(boot.median) to be 0.03989125, or the distribution of the minimum, etc.
The data were generated from an exponential with rate . The true mean() is , and the true median is , so both are within the bootstrap mean and median results. We’ve underestimated both slightly, because the sample values of the mean and median were on the small side, and that introduces bias in bootstrap estimates.
The true median can be derived from solving where is the PDF for the exponential distribution.

Regression analysis with bootstrap

Suppose we have multivariate data where is the explanatory variable and is the response variable. In general, there are different types of bootstrap schemes that can be developed. These schemes depend on the types of assumptions you’re willing to make about how the data might have been generated.
We’ll consider simple linear regression here, but the same general ideas hold for multiple regression and generalized linear models, etc. Our model is
where are uncorrelated, mean 0 and constant variance . In general, the may be (i) fixed by design, (ii) randomly sampled, or (iii) simply observed. We’re going to proceed as if they were fixed by design (it doesn’t make that much difference).
The two common approaches for the bootstrap are model-based resampling (or resampling residuals) and case-based resampling. Note that there are different versions of these as well. Both methods are implemented in the R package car::Boot().

Model-based resampling

  1. Set the as the original values.
  1. Fit the OLS (ordinary least squares) to the linear model and find residuals (some version of it).
  1. Resample the residuals, call them .
  1. Define where and are OLS estimators from original pairs; are the original data; are the resampled residuals, and are the new responses.
  1. Fit OLS regression to to get the intercept estimate , slope estimate , and variance estimate where .
This scheme doesn’t trust the model that much, and has the advantage that it retains the information in the explanatory variables. There’s arguments about which residuals to use (raw or studentized), but it doesn’t make much of a difference in practice.

Case-based resampling

  1. Resample the data pairs directly, keeping the pairs together, with replacement. In R (or any other language), this can be done by sampling the indices.
  1. Fit the OLS regression model and get , where .
This method assumes nothing about the shape of the regression function or the distribution of the residuals. All it assumes is that the observations are independent, which is an assumption we make all the time anyways.

Confidence intervals

In either case, we get an empirical bootstrap distribution of values. We could use these to get confidence intervals for the parameters of interest. This is a widely-useful application of the bootstrap in general - CIs when the theoretical derivation is difficult.
We’ll have “versions” of the statistic of interest: . This results in an empirical distribution. There are two simple, rather intuitive approaches to get CIs for . One caveat is that it may not work well in practice - you may need to use “corrections” or improvements.
Recall that for a sample of size from a normal distribution with variance , the 95% CI for the mean, , is given by . If is unknown, which is typical, we replace with , the sample standard deviation, and 1.96 by the appropriate quantile for (say 2.5% on both tails). We also use this when data are not necessarily normal, according to the central limit theorem.
The first procedure is to mimic the “normal-type” confidence intervals. We can do something similar using bootstrap-based estimates instead:
where is the bootstrap estimate of (mean of bootstrap values), and is the bootstrap standard error estimate (standard deviation of bootstrap values). This forces the CI to be symmetric about .
The second approach is to use the empirical bootstrap distribution. Suppose we have a histogram of . We would take the 2.5% of bootstrap values in the lower and upper tails. This does not enforce symmetry around . Here’s an example in R for the second approach:
set.seed(42) x <- c(1, 2, 3, 5, 8) y <- c(3, 7, 9, 7, 12) res <- vector("numeric", 1000) for (i in seq(1000)) { xb <- sample(x, size = length(x), replace = T) yb <- sample(y, size = length(y), replace = T) res[i] <- mean(xb) / mean(yb) } quantile(res, c(0.025, 0.975)) # 2.5% 97.5% # 0.2444444 0.9476842

Possible bias in bootstrap

In the parametric bootstrap, we assume that the data come from some known distribution but with unknown parameters. The idea of the parametric bootstrap is to use the data to estimate the parameters, and then draw resamples from the estimated distribution. For example, we draw samples from or . This method is not used much in practice because of the strong distribution assumption.
We can see that going from the parametric bootstrap to model-based resampling to case-based resampling, we’re making fewer and weaker assumptions. In other words, we’re getting more and more “secure”.
So why don’t we always use the safest bootstrap? In resampling cases, there’s the usual bias-variance tradeoff. Generally speaking, if we compare the confidence intervals on the same data for the same parameters using the three methods, the parametric bootstrap would give the narrowest intervals, and the case-based resampling would yield the loosest bounds. If we’re very certain (which almost never happens) that the overall model is correct, then the parametric bootstrap would work best.

Jackknife

There are various procedures for dealing with the possible bias of a statistic, and Jackknife is one of them. We can use this to estimate bias and correct for it.
Suppose the sample is and we’re interested in some population characteristic . We estimate it based on the entire sample by . We get additional estimates of by replacing the original sample by a set of subsamples, leaving out one observation in turn each time. The jackknife sample of size is given by:
On each subsample, we estimate by (taking out the observation) and get . The mean of jackknife estimates is
and the Jackknife estimator is given as
where is based on the whole sample, and is based on Jackknife estimates.
The Jackknife estimator of bias is simply defined as:
and the bias-corrected jackknife estimate of θ is given by

Remarks

  1. There are also purely bootstrap-based ways of dealing with bias, e.g. Double bootstrap (bootstrapping the bootstrap samples), or jackknife after bootstrap, etc.
  1. The bootstrap is based on the “plug-in” principle, and the jackknife is based on the “leave one out” principle.
  1. The Jackknife was developed before bootstrapping, and the latter is often considered as a superior technique. A comparison of the two methods is given here.