StatisticsBootstrapping
Bootstrapping is the use of simulation to approximate the value of the plug-in estimator of a statistical functional which is expressed in terms of independent observations from the input distribution . The key point is that drawing observations from the empirical distribution is the same as drawing times
Example
Consider the statistical functional the expected difference between the greatest and least of 10 independent observations from . Suppose that 50 observations from are observed, and that is the associated empirical CDF. Explain how may be estimated with arbitrarily small error.
Solution. The value of is defined to be the expectation of a distribution that we have instructions for how to sample from. So we sample 10 times with replacement from , identify the largest and smallest of the 10 observations, and record the difference. We repeat times for some large integer , and we return the sample mean of these values.
By the law of large numbers, the result can be made arbitrarily close to with arbitrarily high probability by choosing sufficiently large.
Although this example might seem a bit contrived, bootstrapping is useful in practice because of a common source of statistical functionals that fit the bootstrap form: standard errors.
Example
Suppose that we estimate the median of a distribution using the plug-in estimator for 75 observations, and we want to produce a confidence interval for . Show how to use bootstrapping to estimate the standard error of the estimator.
Solution. By definition, the standard error of is the square root of the variance of the median of 75 independent draws from . Therefore, the plug-in estimator of the standard error is the square root of the variance of the median of 75 independent draws from . This can be readily simulated. If the observations are stored in a vector X
, then
using Random, Statistics, StatsBase X = rand(75) std(median(sample(X, 75)) for _ in 1:10^5)
sd(sapply(1:10^5,function(n) {median(sample(X,75,replace=TRUE))}))
returns a very accurate approximation of .
Perhaps the most important caution regarding bootstrapping is that the bootstrap only approximates . It only approximates (where is the underlying true distribution from which the observations are sampled) insofar as we have enough observations for to approximate well.
Exercise
Suppose that is the uniform distribution on . Generate 75 observations from , store them in a vector , and compute the bootstrap estimate of , where is the standard deviation of 75 independent observations from . Use Monte Carlo simulation to directly estimate . Can the gap between your approximations of and be made arbitrarily small by using more bootstrap samples?
Solution. The gap cannot be made arbitrarily small. We would need to get more than 75 samples from the distribution to get closer to the exact value of .
X = rand(75) std(median(sample(X, 75)) for _ in 1:10^6) # estimate T(ν̂) std(median(rand(75)) for _ in 1:10^6) # estimate T(ν)