Next Article in Journal
A Bayesian One-Sample Test for Proportion
Previous Article in Journal
Assessing Regional Entrepreneurship: A Bootstrapping Approach in Data Envelopment Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling

1
Department of Biostatistics, Epidemiology and Environmental Health Sciences, JPH College of Public Health, Georgia Southern University, Statesboro, GA 30460, USA
2
Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30303, USA
*
Author to whom correspondence should be addressed.
Stats 2022, 5(4), 1231-1241; https://doi.org/10.3390/stats5040074
Submission received: 28 October 2022 / Revised: 26 November 2022 / Accepted: 26 November 2022 / Published: 29 November 2022
(This article belongs to the Section Statistical Methods)

Abstract

:
Rubin’s variance estimator of the multiple imputation estimator for a domain mean is not asymptotically unbiased. Kim et al. derived the closed-form bias for Rubin’s variance estimator. In addition, they proposed an asymptotically unbiased variance estimator for the multiple imputation estimator when the imputed values can be written as a linear function of the observed values. However, this needs the assumption that the covariance of the imputed values in the same imputed dataset is twice that in the different imputed datasets. In this study, we proposed a bootstrap variance estimator that does not need this assumption. Both theoretical argument and simulation studies show that it was unbiased and asymptotically valid. The new method was applied to the Hox pupil popularity data for illustration.

1. Introduction

Surveys are popular tools for empirical research in many fields, such as the social sciences, marketing, and public health. One of the goals of survey data analysis is the estimation of subpopulations, which are called domains. The estimator of the domain mean θ usually involves weights in survey data analysis [1,2,3,4,5], which is written as
θ ^ = { i A } α i y i
where A represents the set of indices of the sample elements, y i are the observed sample values in A, and α i is the known weight function that is free of y i . The α i can be used to adjust the non-response items or for calibration purposes.
Imputation is commonly used in survey data analysis to handle the non-response items in survey samples. It replaces the non-response items in the survey sample with substituted values. Single imputation leads to underestimating the variances of the estimators [6]. To overcome this, a multiple imputation method [6,7] was proposed to account for the variances in the analysis when substituted values are used instead of the true observations. The basic idea of multiple imputations is to impute each non-response value M times to create M imputed datasets. Then, each imputed dataset is analyzed using standard data analysis techniques. Finally, Rubin’s rule is used to obtain Rubin’s estimator and Rubin’s variance estimator by combining the results from each imputed dataset.
Research on multiple imputations is motivated by a Bayesian framework. Rubin [6] claimed that multiple imputations can provide a valid frequentist inference in many applications, assuming “proper” imputation [6] or congeniality condition [8] for both the imputation model and the analysis model. However, Fay [9,10] and Binder and Sun [11] found that it is difficult to satisfy the conditions for “proper” imputation for general complex sampling schemes. In a parametric model setting, Wang and Robin [12] and Nielsen [13] showed that Rubin’s variance estimator is unbiased when the complete-sample estimator is the maximum likelihood estimator. Then, Robin and Wang [14] showed the bias of Rubin’s variance estimator for nonparametric inference procedures and proposed a new variance estimator. Yang and Kim [15] showed the bias of Rubin’s variance estimator for the method of moments estimator. They also proposed a new variance estimator when the complete-sample estimator is the method of moments estimator or maximum likelihood estimator. However, these proposed variance estimators cannot be directly used when the complete-sample estimator is given as (1) for the domain mean. For estimator (1), Kim et al. [16] derived the closed-form bias of Rubin’s variance estimator and proposed a new variance estimator when the imputed values can be expressed as a linear function of the response items. However, it requires an assumption that the covariance of the imputed values in the same imputed datasets is twice that of the different imputed datasets. Kim et al. [16] did not discuss how to check this assumption. Our simulation study results presented in Section 3.1 for the domain mean estimation showed that Kim’s variance estimator is biased, which indicated that this assumption may not be satisfied.
In this study, we proposed a bootstrap variance estimator without this assumption. Traditional bootstrapping [17] is usually used to estimate the properties of an estimator by sampling from an approximating distribution. For independent and identically distributed sample data with n observations, it obtains bootstrap samples, each with n observations, by randomly drawing observations with replacement from the original sample. It was used for variance estimation in the presence of single imputation for missing data analysis [18,19,20,21,22,23,24,25]. Recently, researchers [26,27,28,29,30,31] applied the traditional bootstrap for multiple imputations. They applied multiple imputations for each bootstrap sample to obtain the parameter estimator. Then, the variance of the estimator is estimated by the sample variance of the estimators from the bootstrap samples. Therefore, they need to refit the model M × B times, where M is the number of imputed samples and B is the number of bootstrap samples. Obviously, the traditional bootstrap methods are computationally intensive. In this study, we proposed a computationally efficient approach for a bootstrap procedure. We used the bootstrap samples with single imputation and multiple imputation samples to estimate the variance. Therefore, it only needed to refit the model M + B times to estimate the variance of the multiple imputation estimator with the complete-sample estimator in the form of (1).

2. Methods

2.1. Rubin’s Method

We consider data generated from a superpopulation model, i.e., a random sample from an infinite population. The sampling mechanism and the response mechanism are assumed to be ignorable under the superpopulation model in the sense of Rubin [6]. Let y 1 , ,   y n be the complete sample for the random variable Y. Without loss of generality, assume that the first r observations are observed and the remaining nr observations are missing, i.e., Y obs = { y 1 , ,   y r } and Y mis = { y r + 1 , ,   y n } . Denote R i as the indicator of whether the ith observation is observed. If R i = 1 , y i is observed and if R i = 0 , y i is missing. The multiple imputation method uses the following three steps to estimate the parameter domain mean θ for the complete-sample estimator in the form of (1).
Step 1. Create m = 1 , ,   M , imputed datasets, where the observations y i m , i = 1 , ,   n , are imputed only for the missing values; otherwise, they remain as observed.
Step 2. Apply the complete-sample estimation procedure to each imputed dataset. Let θ ^ n m be the imputed-sample estimator of θ for the mth imputed dataset and V ^ n m be the imputed-sample variance estimator of θ ^ n m .
Step 3. Combine the results from the imputed datasets using Rubin’s rule [6]. Rubin’s estimator of θ is θ ^ M I = M 1 m = 1 M θ ^ n m , and Rubin’s variance estimator is
V ^ M I ( θ ^ M I ) = W M + ( 1 + M 1 ) B M
where W M = M 1 m = 1 M V ^ n m is the within-imputation variance and B M = ( M 1 ) 1 m = 1 M ( θ ^ n m θ ^ M I ) 2 is the between-imputation variance.
Rubin’s variance estimator is based on the decomposition
v a r ( θ ^ M I ) = v a r ( θ ^ n ) + v a r ( θ ^ M I θ ^ n ) + 2 c o v ( θ ^ M I θ ^ n ,   θ ^ n )
where θ ^ n is the complete-sample estimator of θ . In Rubin’s variance estimator (2), W M estimates the first term of (3) and ( 1 + M 1 ) B M estimates the second term of (3). Kim et al. [16] showed that for estimators in the form of (1), ( 1 + M 1 ) B M is an asymptotically unbiased estimator of the second term of (3) based on the covariance assumption, that is, the covariance between y i m 1 and y j m 1 is twice that of the covariance between y i m 1 and y j m 2 for i j   and m 1 m 2 . Therefore, if W M is an unbiased estimator of the first term of (3), the asymptotic bias of Rubin’s variance estimator is
b i a s ( V ^ M I ) = 2 c o v ( θ ^ M I θ ^ n ,     θ ^ n ) .
Kim et al. [16] derived the closed-form bias of (4) for the estimators in the form of (1) when the imputed values can be expressed as a linear function of the observed items. We can see that Kim’s method [16] requires the covariance assumption. In this study, we proposed a bootstrap variance estimator for θ ^ M I that does not need this assumption, and hence, this new method has a much wider range of applications.

2.2. The Bootstrap Estimation Method

2.2.1. Assumptions

We employed the regularity assumptions for the data and the imputation mechanism in Kim et al. [16] for the newly proposed method.
(C.1) Both the complete-sample point estimator and variance estimator are asymptotically unbiased, i.e.,
E ( θ ^ n ) = θ + O ( n 1 )   and E ( V ^ n ) = v a r ( θ ^ n ) + O ( n 1 ) , where V ^ n is the complete-sample variance estimator of θ ^ n .
(C.2) The conditional expectations of the imputed values are the same as those of the real values, i.e.,
E ( y i m | A ,   A R ) = E ( y i | A ,   A R ) ,   m = 1 , ,   M , where A R is the set of indices of the observed items.
(C.3) The conditional variances of the imputed values are asymptotically the same as those of the real values, i.e.,
max i , j | c o v ( y i m , y j m | A ,   A R ) c o v ( y i ,   y j | A ,   A R ) | = o ( 1 ) ,   m = 1 , ,   M .
(C.4) Conditional on A ,   A R and Y o b s , the imputed values y j m of y j ,   j A k are conditionally independent and identically distributed, where A k is the set of indices of the missing items.
Assumptions (C.1) and (C.2) ensure that the imputed-sample estimators of the form (1) are approximately unbiased. Assumption (C.3) ensures that the covariances of the imputed values are asymptotically the same as those of the original values. Therefore, we can treat the imputed values similarly to the original values in the inference procedure. Assumption (C.4) ensures both the estimators θ ^ n m   ,   m = 1 ,   ,   M , and the naive variance estimators V ^ n m   ,   m = 1 ,   ,   M , are identically distributed. Therefore, the imputed values and the imputed samples can be treated equally in the inference procedure.

2.2.2. Variance Decomposition

Instead of the decomposition (3), we decomposed the variance as
v a r ( θ ^ M I ) = V 1 + V 2 + V 3 + V 4 ,
where V 1 = v a r ( i A R α i y i ) , V 2 = M 1 E { v a r ( j A k α j y j m | Y o b s ) , V 3 = v a r { E ( j A k α j y j m | Y o b s ) , and V 4 = 2 c o v ( j A k α j y j m ,   j A R α i y i ) . V 1 is the sum of the variances for the observed observations. The sum of V 2 and V 3 is the sum of the variances for the imputed observations based on the law of total variance. V 4 is the sum of the covariances of the observed observations and the imputed observations.

2.2.3. Bootstrap Variance Estimator and Its Properties

To estimate (5), we generated bootstrap data as follows. For b = 1 , ,   B :
Step 1: Take a random sample of size n with a replacement from the original sample, which consists of independent and identical observations Denote the bootstrap sample as ( y 1 b ,   R 1 b ) , , ( y n b , R n b ) . Denote the observed items as Y o b s b = { y 1 b , ,   y r b } and the missing items as Y m i s b = { y r + 1 b , ,   y n b } in the bootstrap method. Let A b , A R b , and A k b represent the set of indices of the sample elements, the set of indices of the observed items, and the set of indices of the missing items in the bth bootstrap sample, respectively.
Step 2: With the resultant bootstrap sample, calculate the imputed values y j b for missing items y j b using a single imputation method, which is the same as the multiple imputation method with M = 1. The imputed-sample estimator in the bth bootstrap sample is θ ^ b = i A b α i b y i b .
We estimated V 1 and V 2 using multiple imputed samples and estimated V 3 and V 4 using bootstrap samples. V 1 can be estimated from the imputed-sample output based on condition (C.1), i.e., V ^ 1 = r ( n M ) 1 m = 1 M V ^ n m .   V 2 is estimated using the empirical conditional variance, which is V ^ 2 = M 1 ( M 1 ) 1 m = 1 M ( j A k α j y j m H 1 ) 2 , where H 1 = M 1 m = 1 M j A k α j y j m   . V 3 is estimated using V ^ 3 = ( B 1 ) 1 b = 1 B ( j A k b α j b η j b H 2 ) 2 , where η j b E ( y j b | Y o b s b ) and H 2 = B 1 b = 1 B j A k b α j b η j b .   V 4 is estimated using V ^ 4 = 2 j A k α j i A R α i ( B 1 ) 1 b = 1 B ( η ¯ j b H 3 ) ( Y ¯ r b H 4 )   , where η ¯ j b = ( n r ) 1 j A k b η j b , Y ¯ r b = ( r ) 1 i A R b y i b , H 3 = B 1 b = 1 B η ¯ j b   , and H 4 = B 1 b = 1 B Y ¯ r b   . Therefore, the bootstrap variance estimator is V ^ = V ^ 1 + V ^ 2 + V ^ 3 + V ^ 4 .
The newly proposed bootstrap variance estimator is an asymptotically unbiased estimator for the variance of the multiple imputation estimator. Under the conditions (C.1)–(C.4), V ^ 1 is an asymptotically unbiased estimator of V 1 . V ^ 2 is an asymptotically unbiased estimator because E ( V ^ 2 ) = E { E ( V ^ 2 | Y o b s ) } . V ^ 3 is an asymptotically unbiased estimator of V 3 because it is based on the Glivenko–Cantelli Theorem:
E ( V ^ 3 ) = E [ E { ( j A k b α j b η j b ) 2 | Y o b s } { E ( j A k b α j b η j b | Y o b s ) } 2 ]   ,
where η j b = E ( y j b | Y o b s ) .
For V ^ 4 , we derived
E { c o v (   Y ¯ r b , η ¯ j b | A ,   A R ,   Y o b s ) } = c o v ( y i ,   y j m | A ,   A R ) + O ( 1 r ) ,
and
E ( 1 B 1 b = 1 B ( η ¯ j b H 3 ) ( Y ¯ r b H 4 ) )
= E { 1 B 1 E ( b = 1 B ( ( η ¯ j b μ 1 ) + ( μ 1 H 3 ) ) ( ( Y ¯ r b μ 2 ) + ( μ 2 H 4 ) ) | Y o b s ) }
= E { c o v ( η ¯ j b ,   Y ¯ r b | Y o b s ) } ,
where μ 1 E ( η ¯ j b | Y o b s ) and μ 2 E ( Y ¯ r b | Y o b s ) . Thus, V ^ 4 is an asymptotically unbiased estimator of V 4 .
To account for the uncertainty in the variance estimator with a small-to-moderate imputation size and bootstrap replicates, a t-distribution is used to calculate 100 ( 1 α ) % confidence intervals for θ , i.e., θ ^ M I ± t v , 1 α / 2 V ^ 1 / 2 . The degrees of freedom v is an approximate number along the lines of the Satterthwaite [32] method, which is derived as
v = V ^ 2 ( M 1 ) 1 V ^ 2 2 + ( B 1 ) 1 [ V ^ 3 2 + V ^ 5 2 + V ^ 6 2 + V ^ 7 2 + 2 ( 1 π 2 ) V ^ 5 ( V ^ 3 V ^ 7 ) 2 V ^ 3 V ^ 7 2 V ^ 5 V ^ 6 π 2 ] ,
in which π is the response rate of the interested domain, V ^ 5 = ( B 1 ) 1 b = 1 B ( H 5 + H 6 ) 2 , V ^ 6 = ( B 1 ) 1 b = 1 B H 5 2 , and V ^ 7 = ( B 1 ) 1 b = 1 B H 6 2 , where H 5 = i A R α i ( Y ¯ r b H 4 )   and H 6 = j A k α j ( η ¯ j b H 3 )   . The detailed derivation of v is provided in the Appendix A.

3. Examples and Results

Simulations and a real data analysis were conducted to investigate the performance of the proposed bootstrap variance estimator ( V ^ ). We compared it with Rubin’s variance estimator ( V ^ M I ), Kim’s method ( V ^ k ) [16], and the traditional bootstrapping method ( V ^ B ). As in Yang and Kim [15], we report the relative bias (Rbias) of the variance estimators, mean width (mwidth), and coverage probabilities of the 95% confidence intervals (95cov) for θ . The relative bias was calculated using ( E ( V ˜ ) v a r ( θ ^ M I ) ) / v a r ( θ ^ M I ) × 100 % , where V ˜ is a variance estimator. The 95% confidence intervals were calculated based on a t-distribution with v degrees of freedom. We considered different values of B, M, and n to investigate their effects. These results were based on 5000 Monte Carlo runs for each setting. The R code used for these simulations is available from the first author upon request.

3.1. Simulation 1: Domain Mean Estimation

We simulated a sample y 1 , , y n from N ( μ , σ 2 ) , where μ = 2 ,   σ 2 = 4 , and n = 500 or 1000. Then non-respondents were generated based on the response rate π and using the ampute function in the mice package of R. We considered two response rates, which were 0.8 and 0.5. To indicate whether an observation belonged to a domain Z , we generated z i from Bernoulli distribution with probability d, which was 0.2 or 0.6 in the simulation study. If z i   = 1, this meant that y i was in domain Z and was 0 otherwise. The z i ,   i = 1 , ,   n , were always observed.
Now we would like to estimate the mean θ of the domain Z. Without loss of generality, denote the first r observations as respondents ( y 1 , ,   y r ) and the remaining (n − r) observations as the non-respondents ( y r + 1 , ,   y n ) . Then, the complete-sample estimator of θ is
θ ^ n = i = 1 n z i i = 1 n z i y i .
Let n d = i = 1 n z i be the size of the domain Z and r d = i = 1 r z i be the number of response items in the domain Z. It is obvious that α i = z i / n d .
To handle the missing observations, we used the multiple imputation method with M = 10 or 30. The mth imputed-sample estimator is
θ ^ n m = i = 1 n z i i = 1 n z i y i m .
If y i is missing, the conditional expectation of the imputed value is η i E ( y i m | y 1 , ,   y r ) = y ¯ r and y ¯ r = i = 1 r y i / r .
Now we used the newly proposed bootstrap method with B = 200 or 500 in Section 3 to estimate the variance of the multiple imputation estimator. Denote the bootstrap sample as { y 1 b , ,   y n b } in which the response items are Y o b s b = { y 1 b , ,   y r b } , the non-response items are Y m i s b = { y r + 1 b , ,   y n b } , and the corresponding indicators for domain Z are z i b ,   i = 1 , , n . Let n d b   be the size of the domain Z in the bth bootstrap sample. Then, we calculate
V ^ 1 = r d n d 2 ( n 1 ) M m = 1 M i = 1 n ( y i m y ¯ m ) 2 , where y ¯ m = 1 n i = 1 n y i m ;
V ^ 2 = 1 M ( M 1 ) m = 1 M ( j = r + 1 n z j n d y j m H 1 ) 2 , where H 1 = 1 M m = 1 M j = r + 1 n z j n d y j m ;
V ^ 3 = 1 B 1 b = 1 B ( j = r + 1 n z j b n d b Y ¯ r b H 2 ) 2 , where H 2 = 1 B b = 1 B j = r + 1 n z j b n d b Y ¯ r b and
Y ¯ r b = i = 1 r y i b r ;
V ^ 4 = 2 j = r + 1 n z j n d i = 1 r z i n d ( B 1 ) 1 b = 1 B ( Y ¯ r b H 3 ) ( Y ¯ z b H 4 ) , where Y ¯ z b = i = 1 r z i b y i b i = 1 r z i b ,
H 3 = B 1 b = 1 B Y ¯ r b and H 4 = B 1 b = 1 B Y ¯ z b .
To compare the newly proposed method with the method proposed by Kim et al. [16], we used the following formula derived by them [16]:
V ^ k = W M + ( 1 + M 1 ) B M + 2 n d 2 m d ( 1 r 1 r d ) σ ^ 2 , where m d = n d r d   and σ ^ 2 = ( n 1 ) 1 M 1 m = 1 M i = 1 n ( y i m y ¯ m ) 2 .
The simulation results are summarized in Table 1. They show that the newly proposed method V ^ had a smaller relative bias than both Rubin’s method V ^ M I and Kim’s method V ^ k , especially for cases with a small response rate. We notice that the relative bias was larger for the proposed method when the sample size became larger. This may have been due to the smaller value of v a r ( θ ^ M I ) for the larger sample size. Other methods have this pattern as well. The width of the confidence interval of Rubin’s method was usually larger, while that of Kim’s method was usually smaller than that of the newly proposed method. The coverage rate of the newly proposed method was close to the nominal level. However, Rubin’s method had a larger coverage rate and Kim’s method had a smaller coverage rate than the nominal level. These results indicated that the covariance assumption required in Kim’s method may not be satisfied, and hence, ( 1 + M 1 ) B M may not unbiasedly estimate v a r ( θ ^ M I θ ^ n ) for the domain mean estimation. The newly proposed method V ^ had a similar performance to the traditional bootstrap method V ^ B . However, it was much faster than the traditional bootstrap method V ^ B , especially for cases with large n, B, and M. For example, for a case with n = 1000, B = 500, M = 30, and 5000 Monte Carlo runs, the new method needed about half an hour, while the traditional bootstrap method needed about four hours. In addition, the values of B considered in these examples had no significant effects on these methods. Larger values of n and M resulted in smaller widths of the confidence intervals.

3.2. Simulation 2: Linear Regression

The variance estimation method proposed in this study can not only be applied to the domain mean estimation but also to any other estimator that has the form of (1). In this example, we showed its performance for the mean estimation of the response variable in the linear regression model. We considered the model used in Yang and Kim [15] as follows:
y i = β x i + e i ,   i = 1 ,   ,   n ,
Table 1. Simulation results for the domain mean estimation.
Table 1. Simulation results for the domain mean estimation.
Rbias (%)Mwidth   ( × 10 2 ) 95cov   ( × 10 2 )
nBM π d V ^ V ^ M I V ^ K V ^ B V ^ V ^ M I V ^ K V ^ B V ^ V ^ M I V ^ K V ^ B
500500100.80.20.6146.2−12.50.565102586595998995
0.61.621.6−4.91.54853464895979595
0.50.2−1.3237.7−41.10.761113366195995896
0.6−0.718.1−40.5−0.65561425695968595
30 0.80.21.0156.8−12.60.963100576295999295
0.61.321.6−5.31.34752464795979595
0.50.2−2.2261.4−54.30.6581093357951006395
0.6−1.417.8−42.2−1.45459415495978695
200100.80.20.5146.2−12.50.265102586595998995
0.61.621.6−4.91.54853464895979595
0.50.2−1.4237.7−41.10.762113366195995896
0.6−0.818.1−40.5−0.75561425695968595
300.80.21.1156.8−12.61.063100576295999295
0.61.321.6−5.31.34752464795979595
0.50.2−2.2261.4−54.30.6611083357961006395
0.6−1.3−42.217.8−1.35459415495978695
1000500100.80.2−1.1141.413.1−1.14672424695998995
0.6−4.053.4−9.7−4.03437333495979495
0.50.2−3.7255.4−44.3−3.24380254495995795
0.6−0.319.5−39.2−0.33944304095978595
30 0.80.2−1.5148.6−14.6−1.64471414495999295
0.6−3.715.7−9.8−3.63337323495969495
0.50.2−3.9251.2−55.3−3.34177234094996394
0.6−0.120.0−40.8−0.13842293995978595
200100.80.2−1.0141.4−13.1−1.04672424695998995
0.6−3.915.3−9.7−4.03437333494979495
0.50.2−3.8225.4−44.3−3.34480254496995795
0.6−0.319.5−39.2−0.63944304095978595
300.80.2−1.6148.614.6−1.54571414495999295
0.6−3.715.7−9.8−3.63337323495969494
0.50.2−4.0251.2−55.3−3.34277234095996394
0.6−0.120.0−40.8−0.13842293995978595
Where n = 500 or 1000, β = 0.1 , x i exp ( 1 ) , and e i N ( 0 ,   σ e 2 ) with σ e 2 = 0.25 . In this data setup, we assumed that x i were observed fully. The missingness in y i was controlled by δ i , which followed δ i B e r ( p i ) and p i = 1 / { e x p ( ϕ 0 ϕ 1 x i ) } . Two scenarios were considered, namely, ( ϕ 0 , ϕ 1 ) = (−1.5, 2) and ( ϕ 0 , ϕ 1 ) = (3, −3), which yielded the approximate average response rate of 0.5. We were interested in estimating θ = E ( y i ) . The normal imputation procedure [33] was used to generate the imputed values for the missing observations. Let Y o b s = { y 1 , ,   y r } be the observed observations and Y m i s = { y r + 1 , ,   y n } be the missing observations. In the simulations, we set B = 200 or 500 and M = 10 or 30. Then, the complete-sample estimator was θ ^ n = i = 1 n y i / n .
Therefore, α i = 1 / n . It is a special case of the estimator with form (1), that is, α i ,   i = 1 , , n , were all same. The mth imputed-sample estimator is θ ^ n m = i = 1 n y i m / n .
If y i   was missing, the conditional expectation of the imputed value was η i E ( y i m | y 1 , , y r ) = x i ( X r T X r ) 1 X r T Y o b s for m = 1 ,   , M , where X r was a r × 2   matrix with rows X i = [ 1   x i ] ,   i = 1 , , r . To estimate the variance of the multiple imputation estimator, we used the bootstrap procedure described in Section 3. Denote the bootstrap sample as { y 1 b , , y n b } , the response items as Y o b s b = { y 1 b , , y r b } and the non-response items as Y m i s b = { y r + 1 b , , y n b } in the bootstrap sample. The corresponding independent variable for y i b is x i b . Then, we calculated
V ^ 1 = r n 2 ( n 1 ) M m = 1 M i = 1 n ( y i m y ¯ m ) 2 , where y ¯ m = n 1 i = 1 n y i m ;
V ^ 2 = 1 M ( M 1 ) m = 1 M ( j = r + 1 n 1 n y j m H 1 ) 2 , where H 1 = 1 M m = 1 M j = r + 1 n 1 n y j m ;
V ^ 3 = 1 B 1 b = 1 B ( j = r + 1 n 1 n η j b H 2 ) 2 , where H 2 = 1 B b = 1 B j = r + 1 n 1 n η j b and
η j b = x j b ( X r b T X r b ) 1 X r b T Y o b s b ;
V ^ 4 = 2 j = r + 1 n 1 n i = 1 r 1 n ( B 1 ) 1 b = 1 B ( η ¯ j b H 3 ) ( Y ¯ r b H 4 ) , where η ¯ j b = ( n r ) 1 j = r + 1 n η j b , Y ¯ r b = ( r ) 1 i = 1 r y i b , H 3 = B 1 b = 1 B η ¯ j b , and H 4 = B 1 b = 1 B Y ¯ r b .
We derived the variance estimator based on Kim’s method [16]:
V ^ k = W M + ( 1 + M 1 ) B M + 2 n 2 1 T X n r T ( X r T X r ) 1 S r σ ^ 2 ,
where 1 is a vector with elements 1 of length (nr); X n r is an (nr) × 2 matrix with rows X j = [ 1   x j ] ,   j = r + 1 , ,   n ; S r =   [ r   i = 1 r x i ]   is a 2 × 1 matrix; and σ ^ 2 = ( n 1 ) 1 M 1 m = 1 M i = 1 n ( y i m y ¯ m ) 2 .
The simulation results are shown in Table 2. We noticed that the biases of both Rubin’s method and Kim’s method were close to zero and the coverage rates were close to the nominal level. This indicated that the covariance assumption was satisfied and the bias term (4) was close to zero. The newly proposed method and the traditional bootstrap method had a similar performance as Rubin’s method and Kim’s method. This showed that they performed as well as the other two methods when their assumptions were satisfied. The newly proposed method was much faster than the traditional bootstrapping method. For a case with n = 1000, B = 500, M = 30, and 5000 Monte Carlo runs, the new method needed about half an hour, while the traditional bootstrap method needed about eight hours. In addition, same as in example 1, the B considered in example 2 had no significant effects, while larger n and M values led to smaller widths of the confidence intervals.

3.3. A Real Data Analysis

In order to illustrate the proposed method, we applied the newly proposed method to the Hox pupil popularity data [34]. The dataset collects information about pupil popularity, sex, school, and teacher experience. Pupil popularity is measured using a self-rating scale that ranges from 0 (very unpopular) to 10 (very popular). The sex denotes the gender of the students, which is boy or girl. The school variable indicates which schools the students are from. It includes 100 schools. The teacher experience records the teacher’s teaching experience in years. There are 2000 observations in the dataset.
Before doing any analysis on the dataset, we usually show descriptive statistics first. We were interested in the average pupil popularity for the domain third school. We selected the first 50 schools. Therefore, the data set had 998 observations. The values for the variable school are always observed. For the variable pupil popularity, there were 405 missing values. For the third school, there were 18 observations for pupil popularity and six were missing among them. Therefore, the domain proportion was 1.8% and the total response rate was approximately 60%. Then, the method used in example 1 in Section 4 was used to calculate the average pupil popularity of the third school. We used M = 30 and B = 500. Rubin’s point estimator was 6.41. The four methods used in the simulation study were used to estimate the variance of Rubin’s point estimator. The bootstrap variance estimator V ^ was 0.059, with the corresponding 95% confidence interval of (5.935, 6.887). Rubin’s variance estimator V ^ M I was 0.109, with the corresponding 95% confidence interval of (5.762, 7.060). Kim’s variance estimator V ^ k was 0.053, with the corresponding 95% confidence interval of (5.961, 6.862). The traditional bootstrap variance estimator V ^ B was 0.066, with the corresponding 95% confidence interval of (5.932, 6.946). The results showed that the new proposed bootstrap variance estimator results were close to the traditional bootstrap variance estimator results. We observed that Rubin’s variance estimator V ^ M I was larger than the bootstrap variance estimator V ^ and Kim’s variance estimator V ^ k was smaller than the bootstrap variance estimator V ^ . The results from the simulation studies indicated that the covariance assumption may not have been satisfied In addition, Rubin’s variance estimator may have resulted in a larger coverage probability and Kim’s variance estimator may have resulted in a lower coverage probability of the confidence interval than the nominal level.

4. Discussion and Conclusions

In this study, we proposed a new bootstrap variance estimator for a multiple imputation estimator with the complete-sample estimator in the form of (1). It does not need the covariance assumption used in Kim’s method, and thus, it has a wider range of applications. The simulations showed that it performed similarly to the traditional bootstrap method but greatly saved computational time. Moreover, the new method performed well for both situations with the covariance assumption satisfied or not. In addition, Kim’s method can only be applied to a situation where the imputed values can be written as a linear function of the observed values. However, the new method has no such restriction, and it can be applied to the multiple imputation estimator with any form of imputed values. Further research can be conducted to investigate its performance in such situations.
This newly proposed method can be extended to a more general class of estimators with the form i A g ( y i ) , such as a maximum likelihood estimator, where the proportion of y is less than any constant and the quantile of y . Future research can be conducted to investigate the performance when using these estimators.

Author Contributions

Conceptualization, L.Y. and Y.Z.; methodology, L.Y. and Y.Z.; software, L.Y.; validation, L.Y. and Y.Z.; writing—original draft preparation, L.Y.; writing—review and editing, L.Y. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in this article can be found in reference [34].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of the approximate number of degrees of freedom:
As in Rubin and Schenker [35], we derived the degrees of freedom of the t-distribution using (i) the conditional on the observed observations, imputed observations, and bootstrap observations; (ii) the assumption that V ^ 1 is fixed; and (iii) the assumption that V ^ / V ^ , , where V ^ , is the probability limit of V ^ as both M and B go to infinity, is distributed as χ v 2 / v .
Let H 5 = i A R α i ( Y ¯ r b H 4 ) , H 6 = j A k α j ( η ¯ j b H 3 ) , V ^ 5 = ( B 1 ) 1 b = 1 B ( H 5 + H 6 ) 2 ,   V ^ 6 = ( B 1 ) 1 b = 1 B H 5 2 , and V ^ 7 = ( B 1 ) 1 b = 1 B H 6 2 . Let α = V ^ 2 , / V ^ 2 , where V ^ 2 , = p l i m M V ^ 2 . Let β = V ^ 3 , / V ^ 3 , γ = V ^ 5 , / V ^ 5 , κ = V ^ 6 , / V ^ 6 , and ϕ = V ^ 7 , / V ^ 7 , where V ^ . , = p l i m B V ^ . . Let D be the dataset, which consists of the observations, multiple imputations, and bootstrap observations. Then, α 1 D follows an χ 2 distribution with degrees of freedom M − 1. Similarly, β 1 D , γ 1 D , κ 1 D , and ϕ 1 D follow an χ 2 distribution with degrees of freedom B − 1. Therefore, v a r ( α 1 | D ) = 2 / ( M 1 ) , v a r ( β 1 | D ) = v a r ( γ 1 | D ) = v a r ( κ 1 | D ) = v a r ( ϕ 1 | D ) = 2 / ( B 1 ) . Furthermore c o v ( β 1 , γ 1 | D ) = 2 ( 1 r 2 ) / ( B 1 ) ,   c o v ( β 1 , κ 1 | D ) = 2 / ( B 1 ) , c o v ( κ 1 , γ 1 | D ) = 2 ( 1 r 2 ) / ( B 1 ) , c o v ( ϕ 1 , γ 1 | D ) = 2 r 2 / ( B 1 ) , and the covariances between any other two terms are zero.
Then, we can write V ^ = V ^ 1 + V ^ 2 + V ^ 3 + V ^ 5 V ^ 6 V ^ 7 . Let f = V ^ 2 / V ^ 1 , g = V ^ 3 / V ^ 1 , h = V ^ 5 / V ^ 1 ,   a = V ^ 6 / V ^ 1 , and b = V ^ 7 / V ^ 1 . Using a Taylor expansion, we derived
V ^ V ^ , = 1 + f + g + h a b 1 + α f + β g + γ h κ a ϕ b
     1 + f 1 + f + g + h a b ( α 1 1 ) + g 1 + f + g + h a b ( β 1 1 )
     + 1 + h 1 + f + g + h a b ( γ 1 1 ) + a 1 + f + g + h a b ( κ 1 1 ) b 1 + f + g + h a b ( ϕ 1 1 ) .
Let π be the response rate for the interested domain. We obtained
v a r ( V ^ V ^ , | D )   = 2 M 1 f 2 ( 1 + f + g + h a b ) 2 + 2 B 1 g 2 + h 2 + a 2 + b 2 + 2 g h ( 1 π 2 ) 2 g a 2 h a ( 1 π 2 ) 2 h b π 2 ( 1 + f + g + h a b ) 2
           = 2 ( M 1 ) 1 f 2 + ( B 1 ) 1 ( g 2 + h 2 + a 2 + b 2 + 2 g h ( 1 π 2 ) 2 g a 2 h a ( 1 π 2 ) 2 h b π 2 ) ( 1 + f + g + h a b ) 2 .
Thus, the degrees of freedom are
v = ( 1 + f + g + h a b ) 2 ( M 1 ) 1 f 2 + ( B 1 ) 1 ( g 2 + h 2 + a 2 + b 2 + 2 g h ( 1 π 2 ) 2 g a 2 h a ( 1 π 2 ) 2 h b π 2 )
    = V ^ 2 ( M 1 ) 1 V ^ 2 2 + ( B 1 ) 1 [ V ^ 3 2 + V ^ 5 2 + V ^ 6 2 + V ^ 7 2 + 2 ( 1 π 2 ) V ^ 5 ( V ^ 3 V ^ 7 ) 2 V ^ 3 V ^ 7 2 V ^ 5 V ^ 6 π 2 ] .

References

  1. Clogg, C.C.; Rubin, D.B.; Schenker, N.; Schultz, B.; Weidman, L. Multiple imputation of industry and occupation codes in census public-use samples using bayesian logistic-regression. J. Am. Stat. Assoc. 1991, 86, 68–78. [Google Scholar] [CrossRef]
  2. Schafer, J.L.; Ezzati-Rice, T.M.; Johnson, W.; Khare, M.; Little, R.J.A.; Rubin, D.B. The NHANES III multiple imputation project. Race Ethn. 1996, 60, 28–37. [Google Scholar]
  3. Gelman, A.; King, G.; Liu, C. Not asked and not answered: Multiple imputation for multiple surveys. J. Am. Stat. Assoc. 1998, 93, 846–857. [Google Scholar] [CrossRef]
  4. Davey, A.; Shanahan, M.J.; Schafer, J.L. Correcting for selective nonresponse in the National Longitudinal Survey of Youth using multiple imputation. J. Hum. Resour. 2001, 36, 500–519. [Google Scholar] [CrossRef]
  5. Taylor, J.M.G.; Cooper, K.L.; Wei, J.T.; Sarma, A.V.; Raghunathan, T.E.; Heeringa, S.G. Use of multiple imputation to correct for nonresponse bias in a survey of urologic symptoms among African-American men. Am. J. Epidemiol. 2002, 156, 774–782. [Google Scholar] [CrossRef] [Green Version]
  6. Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley: Hoboken, NJ, USA, 1987. [Google Scholar]
  7. Rubin, D.B. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 1996, 91, 473–489. [Google Scholar] [CrossRef]
  8. Meng, X.L. Multiple-imputation inferences with uncongenial sources of input. Stat. Sci. 1994, 9, 538–558. [Google Scholar]
  9. Fay, R.E. When are Inferences from Multiple Imputation Valid? Proceedings of the Section on Survey Research Methods; U.S. Bureau of the Census: Washington, DC, USA, 1992; pp. 227–232.
  10. Fay, R.E. Valid inferences from imputed survey data. Surv. Res. Methods 1993, 41–48. [Google Scholar]
  11. Binder, D.A.; Sun, W.M.; Amer Stat, A. Frequency valid multiple imputation for surveys with a complex design. Surv. Res. Methods 1996, 281–286. [Google Scholar]
  12. Wang, N.; Robins, J.M. Large-sample theory for parametric multiple imputation procedures. Biometrika 1998, 85, 935–948. [Google Scholar] [CrossRef]
  13. Nielsen, S.F. Proper and improper multiple imputation. Int. Stat. Rev. 2003, 71, 593–607. [Google Scholar] [CrossRef]
  14. Robins, J.M.; Wang, N.S. Inference for imputation estimators. Biometrika 2000, 87, 113–124. [Google Scholar] [CrossRef] [Green Version]
  15. Yang, S.; Kim, J.K. Fractional imputation in survey sampling: A Comparative Review. Stat. Sci. 2016, 31, 415–432. [Google Scholar] [CrossRef]
  16. Kim, J.K.; Michael Brick, J.; Fuller, W.A.; Kalton, G. On the bias of the multiple-imputation variance estimator in survey sampling. J. R. Stat. Soc. Ser. B 2006, 68, 509–521. [Google Scholar] [CrossRef]
  17. Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; Chapman and Hall: New York, NY, USA, 1993. [Google Scholar]
  18. Sarndal, C.E. Methods for estimating the precision of survey estimates when imputation has been used. Surv. Methodol. 1992, 18, 241–252. [Google Scholar]
  19. Rao, J.N.K.; Shao, J. Jackknife variance-estimation with survey data under hot deck imputation. Biometrika 1992, 79, 811–822. [Google Scholar] [CrossRef]
  20. Rao, J.N.K. On variance estimation with imputed survey data. J. Am. Stat. Assoc. 1996, 91, 499–506. [Google Scholar] [CrossRef]
  21. Shao, J.; Sitter, R.R. Bootstrap for imputed survey data. J. Am. Stat. Assoc. 1996, 91, 1278–1288. [Google Scholar] [CrossRef]
  22. Shao, J.; Steel, P. Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. J. Am. Stat. Assoc. 1999, 94, 254–265. [Google Scholar] [CrossRef]
  23. Haziza, D. Imputation and inference in the presence of missing data. In Handbook of Statistics: Sample Surveys: Theory Methods and Inference; Rao, C.R., Pfeffermann, D., Eds.; Elsevier: Amsterdam, The Netherlands, 2009; Volume 29A, pp. 215–246. [Google Scholar]
  24. Kim, J.K.; Rao, J.N.K. A unified approach to linearization variance estimation from survey data after imputation for item nonresponse. Biometrika 2009, 96, 917–932. [Google Scholar] [CrossRef] [Green Version]
  25. Chen, S.; Haziza, D.; Léger, C.; Mashreghi, Z. Pseudo-population bootstrap methods for imputed survey data. Biometrika 2019, 106, 369–384. [Google Scholar] [CrossRef] [PubMed]
  26. Lu, K.F.; Li, D.Y.; Koch, G.G. Comparison between two controlled multiple imputation methods for sensitivity analyses of time-to-event data with possibly informative censoring. Stat. Biopharm. Res. 2015, 7, 199–213. [Google Scholar] [CrossRef]
  27. Gao, F.; Liu, G.F.; Zeng, D.; Xu, L.; Lin, B.; Diao, G.; Golm, G.; Heyse, J.F.; Ibrahim, J.G. Control-based imputation for sensitivity analyses in informative censoring for recurrent event data. Pharm. Stat. 2017, 16, 424–432. [Google Scholar] [CrossRef] [PubMed]
  28. Schomaker, M.; Heumann, H. Bootstrap inference when using multiple imputation. Stat. Med. 2018, 37, 2252–2266. [Google Scholar] [CrossRef] [PubMed]
  29. Darken, P.; Nyberg, J.; Ballal, S.; Wright, D. The attributable estimand: A new approach to account for intercurrent events. Pharm. Stat. 2020, 19, 626–635. [Google Scholar] [CrossRef]
  30. Nguyen, T.L.; Collins, G.S.; Pellegrini, F.; Moons, K.G.; Debray, T.P. On the aggregation of published prognostic scores for causal inference in observational studies. Stat. Med. 2020, 39, 1440–1457. [Google Scholar] [CrossRef] [Green Version]
  31. Bartlett, J.W.; Hughes, R.A. Bootstrap inference for multiple imputation under uncongeniality and misspecification. Stat. Methods Med. Res. 2020, 29, 3533–3546. [Google Scholar] [CrossRef]
  32. Satterthwaite, F.E. An approximate distribution of estimates of variance components. Biom. Bull. 1946, 2, 110–114. [Google Scholar] [CrossRef] [Green Version]
  33. Schenker, N.; Welsh, A.H. Asymptotic results for multiple imputation. Ann. Stat. 1988, 16, 1550–1566. [Google Scholar] [CrossRef]
  34. Hox, J.J. Multilevel Analysis: Techniques and Applications; Lawrence Erlbaum: Mahwah, NJ, USA, 2002. [Google Scholar]
  35. Rubin, D.B.; Schenker, N. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J. Am. Stat. Assoc. 1986, 81, 366–374. [Google Scholar] [CrossRef]
Table 2. Simulation results for the linear regression mean estimation.
Table 2. Simulation results for the linear regression mean estimation.
Rbias (%) Mwidth   ( × 10 2 ) 95 cov   ( × 10 2 )
ScennBM V ^ V ^ M I V ^ K V ^ B V ^ V ^ M I V ^ K V ^ B V ^ V ^ M I V ^ K V ^ B
1500500100.70.40.40.12122222295959596
300.80.20.30.12020202095959595
200100.60.40.4−0.12122222295959596
300.70.20.30.02020202095959595
100050010−1.8−2.4−2.3−2.31415151694959596
30−2.4−2.9−2.9−2.91414141494959595
20010−2.0−2.4−2.4−2.41415151694959596
30−2.4−2.9−2.9−2.81414141494959595
250050010−6.0−5.0−5.0−5.01314141494949495
30−5.0−4.0−4.0−4.11313131394959595
20010−6.0−5.0−5.0−5.01314141494949495
30−5.0−4.0−4.0−4.01313131394959595
100050010−1.00.40.40.0910101095959596
30−0.60.80.80.6999995959595
20010−1.10.40.4−0.1910101095959596
30−0.80.80.80.7999995959595
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yu, L.; Zhao, Y. A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling. Stats 2022, 5, 1231-1241. https://doi.org/10.3390/stats5040074

AMA Style

Yu L, Zhao Y. A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling. Stats. 2022; 5(4):1231-1241. https://doi.org/10.3390/stats5040074

Chicago/Turabian Style

Yu, Lili, and Yichuan Zhao. 2022. "A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling" Stats 5, no. 4: 1231-1241. https://doi.org/10.3390/stats5040074

APA Style

Yu, L., & Zhao, Y. (2022). A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling. Stats, 5(4), 1231-1241. https://doi.org/10.3390/stats5040074

Article Metrics

Back to TopTop