I misunderstood Terrence Murphy's question about using bootstrap()
in connection with multiple imputations. It is possible and
appropriate to use bootstrap() in his context, though the implementation
requires some creativity, and some care in how the results are used.
Murphy wrote:
>>I occasionally use S plus with the menu and wonder if anyone's willing to
>>share a few lines of code to solve the following task, which is essentially
>>data management.
>>
>>I have twenty imputed datasets because my original data had 20% missing
>>data which looks to be MAR using the ISNI index of Troxell et al. I want to
>>bootstrap the coefficients of a model that will be fit to each of the
>>twenty imputed datasets. The logical approach seems to be to bootstrap each
>>of the 20 datasets and then combine all the output and take the grand mean
>>and grand percentiles to construct the final bootstrap estimate and
>>confidence intervals.
>>
>>The bootstrap command from the point and click menu works like this:
>>
>>coef(lm(response~predictor, data.frame))
>>
>>What I want to to do is run the bootstrap function for twenty different
>>data sets that are combined in one data file, and indexed with a variable
>>called _imputation_ = 1,2, ..., 20.
>>
>>Any ideas in how I might code this in S plus?
The original data has n=304, and missing values.
There are 20 multiple imputations; I'll call these X1, X2, ..., X20.
These are combined into on dataset with 20*304 rows,
with a variable "imputation" taking values 1 to 20.
My interpretation was that the goal was to calculate
theta1 = statistic(X1)
theta2 = statistic(X2)
...
theta20 = statistic(X20)
and take the standard deviation of these 20 values as a standard error.
That would be too small; in the trivial case with no missing data,
where the imputations are identical, it would yield a standard error of 0.
My posting of 28 June, suggesting miApply() or miEval() from
library(missing), is based on this interpretation.
What Murphy really intended is:
theta1.1, theta1.2, ... theta1.B from bootstrap samples of X1
theta2.1, theta2.2, ... theta2.B from bootstrap samples of X2
...
theta20.1, theta20.2, ... theta20.B from bootstrap samples of X20
where B is the number of bootstrap samples, e.g. B=1000.
Then the variance of these 20*B numbers incorporates both sources
of variation:
- between-imputation variation
- variation due to original random sampling, measured by usual standard
errors obtained by formulas or bootstrapping.
--------------------------------------------------
--- Implementation
One way to implement this using the bootstrap() function in the
resample library is to draw 20 samples at a time (set block.size=20),
and draw one sample from each imputation using a custom sampler.
I'll assume the original data has the 304 observations from imputation 1
first, then the 304 observations from imputation 2, etc.
my.samp.bootstrap <- function(n, B, ...){
# This assumes that B = 20 (use the block.size argument)
# Output: first column has numbers from 1:304,
# second from 1:304 + 304, etc.
n <- n / B # original n is 20*304; now n=304
samp.bootstrap(n, B, size=n) + rep(0:19 * n, each = n)
}
value1 <- bootstrap(data, statistic, block.size = 20, B = 20 * 1000,
sampler = my.samp.bootstrap)
Another way is to draw samples for all 20 imputations at once, stratified,
and to use a statistic that splits the data into 20 groups, and
does the calculations for each group:
value2 <- bootstrap(data, statisticThatHandles20Groups, group = imputation)
If there are p coefficients being bootstrapped, then
value1$replicates
has dimension [20*1000, p], and
value2$replicates
has dimension [1000, 20*p]
With the first method the "observed value" (value1$observed)
is computed from all 304 * 20 observations, and is a vector of length p.
With the second there would be observed values from each of the 20
imputations, length 20*p.
--------------------------------------------------
--- How results are used
>>take the grand mean
>>and grand percentiles to construct the final bootstrap estimate and
>>confidence intervals.
We rarely use the average of bootstrap values as a "bootstrap estimate";
this is a poor estimator because it has DOUBLE the usual bias.
Instead we use the statistic calculated for the original data
as an estimate, and limit our use of the bootstrap to estimate
standard error or confidence intervals.
With the first implementation, limits.percentile()
should give reasonable quick-and-dirty confidence limits.
Computing BCa limits here would not be straightforward. limits.bca()
would not give reasonable results. How to compute a BCa interval here
would be an interesting research question -- getting a good value
for the acceleration parameter may be difficult.
If bias is a concern, one might compute a "BC" interval (bias-corrected
percentile interval, a special case of BCa with acceleration set to zero.
In simpler bootstrap contexts, calling
limits.bca(..., acceleration = 0)
does this. In this context there are complications:
* Here we want p intervals, not 20*p intervals. To get this
without reshaping the bootstrap output requires using the first
method above, that created "value1".
* The calculations require an observed value. value1$observed
may not be reasonable; it is probably better to use the average
of the observed values from the 20 imputations. Roughly:
value1b <- value1
value1b$observed <- (average across 20 imputations)
limits.bca(value1b, acceleration = 0)
The observed values from the 20 imputations could be calculated
using miApply() or miEval(), and average using miMeanSE(), from
library(missing).
Tim Hesterberg
========================================================
| Tim Hesterberg Research Scientist |
| timh@insightful.com Insightful Corp. |
| (206)802-2319 1700 Westlake Ave. N, Suite 500 |
| (206)283-8691 (fax) Seattle, WA 98109-3044, U.S.A. |
| www.insightful.com/Hesterberg |
========================================================
Download the S+Resample library from www.insightful.com/downloads/libraries
|