Next Article in Journal
(Hyper)graph Kernels over Simplicial Complexes
Next Article in Special Issue
Segmentation of High Dimensional Time-Series Data Using Mixture of Sparse Principal Component Regression Model with Information Complexity
Previous Article in Journal
Radiative Transfer and Generalized Wind
Previous Article in Special Issue
Forecasting Financial Time Series through Causal and Dilated Convolutional Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records

1
Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, USA
2
Novartis Institutes for Biomedical Research, Shanghai 201203, China
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(10), 1154; https://doi.org/10.3390/e22101154
Submission received: 24 August 2020 / Revised: 27 September 2020 / Accepted: 12 October 2020 / Published: 14 October 2020

Abstract

:
We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.

1. Introduction

A major step towards scientific discovery is to identify useful associations from various features and to quantify their uncertainties. This usually warrants building a regression model for an outcome variable and estimating the coefficient associated with each feature as well as the precision of the estimator. Besides the traditional regression with a small dimensionality, with advances in biotechnology, the modern high-dimensional regression usually posits a sparse parameter in the model, and then applies regularization to select the significant features in order to recover the sparsity. In particular, the post-selection inference could be challenging in a regularized regression framework. In this paper, our main interest is to consider a regression model where the outcome variable is prone to missing values. We study both the traditional setting where regularization is not needed and the modern one with regularization.
The missing data issue is an inevitable concern for statistical analysis in various disciplines ranging from biomedical studies to social sciences. In many applications, the occurrence of missing data is usually not the investigator’s primary interest but complicates the statistical analysis. The validity of any method devised for missing data heavily depends on the assumption of the missingness mechanism [1]. Unfortunately, those assumptions are largely unknown and difficult, if not infeasible, to be empirically tested. Therefore, one prefers to concentrate on analyzing the regression model for the outcome variable, while treating the mechanism model as a nuisance. A flexible assumption imposed at the minimum level on the mechanism would provide protection against model misspecification at this level.
While it is indeed promising to regard the missingness mechanism as a nuisance with a flexible assumption, a potential issue is the model identifiability problem if the mechanism contains missing-not-at-random cases, i.e., allowing the mechanism to depend on the missing values themselves. In the past few years, researchers have made great progress on this topic by introducing a so-called instrument. This instrument could be a shadow variable [2,3,4,5,6,7] or an instrumental variable [8,9]. Both approaches are reasonable and are suitable for different applications. In this paper, we adopt the shadow variable approach as it facilitates the interpretability of the regression model for the outcome. The details of the shadow variable approach will be articulated later throughout the paper.
Therefore, we proceed with a semiparametric framework where our primary interest is a parametric regression, e.g., a linear model, where the statistical task is to estimate the parameter of interest and conduct statistical inference (particularly post-selection inference for the setting with regularization). For the nuisance missingness mechanism, we only impose a nonparametric assumption without specifying a concrete form. We encode the shadow variable as Z, which is one component of the covariate X . In general, a shadow variable with a smaller dimensionality allows more flexibility of the missingness mechanism. Therefore, although it could be multidimensional, we only consider univariate Z throughout the paper. With all of these ingredients, we analyze a conditional likelihood approach which will eventually result in a nuisance-free procedure for parameter estimation and statistical inference.
There are at least two extra highlights of our proposed method that are worth mentioning. The first pertains to the algorithm and computation. Although it looks complicated at first sight, we show that, via some data manipulation strategy, the conditional likelihood function can be analytically written as the likelihood of a conventional logistic regression with some prespecified format. Therefore, our objective function can be readily optimized by many existing software packages. This greatly alleviates the computational burden of our procedure. Second, while the variance estimation under the traditional setting is straightforward following the asymptotic approximation, it is challenging for the setting with regularization. To resolve this problem, we present an easy-to-implement data-driven method to estimate the variance of the regularized estimator via a data perturbation technique. It is noted that the current literature on the inference procedure for regularized estimation in the presence of missing values is very scarce. The authors of [10,11,12] all considered the model selection problem under high dimensionality with missing data; however, none of them studied the post-selection inference in this context.
The remainder of the paper is structured as follows. In Section 2, we first layout our model formulation and introduce the shadow variable and the conditional likelihood. Section 3 details the traditional setting without regularization. We present our algorithm of how to maximize the conditional likelihood function, the theory of how to derive the asymptotic representation of our proposed estimator and how to estimate its variance. In Section 4, we devote ourselves to the modern setting where the sparsity assumption is imposed and the regularization technique is adopted. Both algorithm and theory as well as the variance estimation through the data perturbation technique are presented. In Section 5, we conduct comprehensive simulation studies to examine the finite sample performance of our proposed estimator as well as the comparison to some existing methods. Section 6 is the application of our method to the regression model for the albumin level which suffers from a large amount of missing values in the MIMIC-III study [13]. The paper is concluded with a discussion in Section 7.

2. Methodology

Denote the outcome variable as Y and covariate X . We assume X = ( U T , Z ) T where U is p-dimensional and Z univariate, with detailed interpretation later. We consider the linear model
Y = α + β T U + γ Z + ϵ ,
where β is also p-dimensional, α and γ are scalars and the true value of γ , γ 0 , is nonzero, ϵ N ( 0 , σ 2 ) . We consider the situation that Y has missing values while X is fully observed. We introduce a binary variable R to indicate missingness: R = 1 if Y is observed and R = 0 if missing. To allow the greatest flexibility of the missingness mechanism model, we assume
pr ( R = 1 | Y , X ) = pr ( R = 1 | Y , U ) = s ( Y , U ) ,
where s ( · ) merely represents an unknown and unspecified function not depending on Z. We reiterate that, as the assumption (2), in a nonparametric flavor, does not specify a concrete form of s ( · ) , one does not need to be worrisome of the mechanism model misspecification. Moreover, as it allows the dependence on Y, besides missing-completely-at-random (MCAR) and many scenarios of missing-at-random (MAR), the assumption (2) also contains various situations of missing-not-at-random (MNAR).
We term Z the shadow variable following the works in [5,6,7,14]. Its existence depends on whether it is sensible that Z and R are conditionally independent (given Y and U ) and that Y heavily relies on Z (as γ 0 0 ). There are many examples in the literature documenting that the existence of Z is practically reasonable. In application, a surrogate or a proxy of the outcome variable Y, which would not synchronically affect the missingness mechanism, could be a good choice for the shadow variable Z.
We assume independent and identically distributed observations { r i , y i , u i , z i } for i = 1 , , N and the first n subjects are free of missing data. Now we present a s ( · ) -free procedure via the use of the conditional likelihood. Denote V = ( Y , U T ) T . We start with
i = 1 n p ( v i | z i , r i = 1 ) = i = 1 n s ( v i ) g ( z i ) p ( v i | z i ) ,
where g ( z i ) = pr ( r i = 1 | z i ) = pr ( r i = 1 | v ) p ( v | z i ) d v and p ( · | · ) is a generic notation for conditional probability density/mass function. If V were univariate, we denote A as the rank statistic of { v 1 , , v n } , then
i = 1 n p ( v i | z i , r i = 1 ) = p ( v 1 , , v n | z 1 , , z n , r 1 = = r n = 1 ) = p ( A | v ( 1 ) , , v ( n ) , z 1 , , z n , r 1 = = r n = 1 ) p ( v ( 1 ) , , v ( n ) | z 1 , , z n , r 1 = = r n = 1 ) .
The conditional likelihood that we use, the first term on the right hand side of (3), is exactly
p ( A | v ( 1 ) , , v ( n ) , z 1 , , z n , r 1 = = r n = 1 ) = p ( v 1 , , v n | z 1 , , z n , r 1 = = r n = 1 ) p ( v ( 1 ) , , v ( n ) | z 1 , , z n , r 1 = = r n = 1 ) = i = 1 n p ( v i | z i , r i = 1 ) Σ ω Ω i = 1 n p ( v ω ( i ) | z i , r i = 1 ) = i = 1 n p ( v i | z i ) Σ ω Ω i = 1 n p ( v ω ( i ) | z i ) ,
where Ω represents the collection of all one-to-one mappings from { 1 , , n } to { 1 , , n } . Now (4) is nuisance-free and can be used to estimate the unknown parameters in p ( v i | z i ) .
Although V is multidimensional in our case, the idea presented above can still be applied and it leads to
i = 1 n p ( y i , u i | z i , r i = 1 ) Σ ω Ω i = 1 n p ( y ω ( i ) , u ω ( i ) | z i , r i = 1 ) = i = 1 n p ( y i , u i | z i ) Σ ω Ω i = 1 n p ( y ω ( i ) , u ω ( i ) | z i ) .
Furthermore, to simplify the computation, we adopt the pairwise fashion of (5) following the previous discussion on pairwise pseudo-likelihood in [15], which results
1 i < j n p ( y i , u i | z i ) p ( y j , u j | z j ) p ( y i , u i | z i ) p ( y j , u j | z j ) + p ( y i , u i | z j ) p ( y j , u j | z i ) .
After plugging in model (1) and some algebra, the objective eventually becomes to minimize
L ( θ ) = N 2 1 1 i < j N ϕ i j ( θ ) = N 2 1 1 i < j N r i r j log { 1 + W i j exp ( θ T d i j ) } ,
where θ = ( γ ˜ , β ˜ T ) T , γ ˜ = γ / σ 2 , β ˜ = γ ˜ β , d i j = ( y i j z i j , u i j T z i j ) T , y i j = y i y j , u i j = u i u j , z i j = z i z j and W i j = p ( z i | u j ) p ( z j | u i ) / { p ( z i | u i ) p ( z j | u j ) } .
Denote the minimizer of (6) as θ ^ . By checking that
2 ϕ i j ( θ ) θ θ T = r i r j { 1 + W i j exp ( θ T d i j ) } 2 W i j exp ( θ T d i j ) d i j d i j T
is positive definite, θ ^ uniquely exists. To compute θ ^ , one also needs a model for W i j . Fortunately, this model only depends on fully observed data x i and x j . Essentially any existing parametric, semiparametric, or nonparametric modeling technique for p ( z | u ) can be used, and W i j can be estimated accordingly. Throughout, we denote W ^ i j as an available well-behaved estimator of W i j . Although our procedure stems from p ( y , u | z , r = 1 ) , which only relies on the data { y i , x i } with i = 1 , it can be seen that, not only the data { y i , x i } with i = 1 are used to compute θ ^ , the data { x i } with i = 0 are also used in the process of estimating W i j . Therefore, all observed data, both from completely-observed subjects and from partially-observed subjects, are utilized in our procedure.
One can notice that, due to the assumption (2) which allows the greatest flexibility of the mechanism model and the adoption of the conditional likelihood, not all parameters α , β , γ , and σ 2 are estimable. Nevertheless, the parameter β , which quantifies the association between Y and U after adjusting for Z and is of primarily scientific interest, can be fully estimable. The remainder of the paper focuses on the estimation and inference of β , as well as the variable selection procedure based on β .
Before moving on, we give some comparison with the existing literature to underline the novel contributions we make in this paper. Based on a slightly different but more restrictive missingness mechanism assumption that pr ( R = 1 | Y , X ) = a ( Y ) b ( X ) , Refs. [16,17,18] used the similar idea to analyze non-ignorable missing data for a generalized linear model and a semiparametric proportional likelihood ratio model, respectively. They focused on different aspects of how to use the conditional likelihoods and their consequences such as the partial identifiability issue and the large bias issue. In this paper, we focus on the linear model (1) and we just showed that the parameter β is fully identifiable. It can be seen that the method presented in this paper can be applied to different models, but their identifiability problems or some other relevant issues have to be analyzed on a case-by-case basis. For instance, Ref. [19] studied the parameter estimation problem in a logistic regression model with a low dimensionality under assumption (2). They showed that, different from the current paper, all the unknown parameters are identifiable in their context. However, because of the complexity of their objective function, the algorithm studied in [19] is trivial and cannot be extended to a high dimensional setting.

3. Traditional Setting without Regularization

Computation. Directly minimizing L ( θ ) is feasible; however, it is very computationally involved. From rearranging the terms in L ( θ ) , we realize that it can be rewritten as the negative log-likelihood function of a standard logistic regression model. To be more specific, let k be the index of pair ( i , j ) with k = 1 , , K and K = n 2 . Then,
L ( θ ) = 1 K k = 1 K log 1 + exp s k θ T t k + log W ^ k ,
where s k = sign ( z i j ) , t k = ( | z i j | y i j , | z i j | u i j T ) T . Denote g k = I { z i j > 0 } , then one can show that the summand in (7), log 1 + exp s k θ T t k + log W ^ k , equals,
g k θ T t k + s k log W ^ k log 1 + exp θ T t k + s k log W ^ k ,
which is the contribution of the k-th subject to the negative log-likelihood of a logistic regression with g k as the response, θ as the coefficient, t k as the covariate, and s k log W ^ k as the offset term, but without an intercept. Therefore, θ ^ can be obtained by fitting the aforementioned logistic regression model. Algorithm 1 describes the steps for data manipulation and model fitting to estimate θ under this traditional setting.
Algorithm 1 Minimization of (6) without penalization
1: Inputs:   { y i , u i , z i } , { y j , u j , z j } , W ^ i j , for i = 1 , , n and j = 1 , , n
2: Initialize:   k 0
3: for j { 2 : n } do
4:  for i { 1 : ( j 1 ) } do
5:     k k + 1
6:     y i j y i y j , u i j u i u j , z i j z i z j , W ^ k W ^ i j
7:     g k I { z i j > 0 }
8:     s k sign ( z i j )
9:     t k ( | z i j | y i j , | z i j | u i j T ) T
10: Fit logistic regression with response g , covariate t , offset s T log W ^ , and no intercept.
11: Outputs:   θ ^
Asymptotic Theory. The asymptotic theory of θ ^ involves a model of p ( z | u ) , which does not contain any missing values, and therefore any statistical model, either parametric, or semiparametric, or nonparametric, can be used. For simplicity, we only discuss the parametric case here, and any further elaborations will be rendered into Section 7. For a parametric model p ( z | u ; η ) , one can apply the standard maximum likelihood estimate η ^ . Here, we simply assume
N η ^ η 0 = G 1 N 1 N i = 1 N η log p ( z i | u i ; η 0 ) + o p ( 1 ) ,
where G = E 2 η η T log p ( z | u ; η 0 ) , E 2 η η T log p ( z | u ; η 0 ) 2 < , η 0 is the true value of η , and M = trace ( M M T ) for a matrix M . With this prerequisite, we have the following result for θ ^ , and its proof is provided in Appendix A.
Theorem 1.
Assume (8) as well as E 2 ϕ i j ( θ 0 , η 0 ) θ θ T 2 < . Denote θ 0 the true value of θ. Then
N θ ^ θ 0 d N 0 , A 1 Σ A 1 ,
where A = E 2 ϕ i j ( θ 0 , η 0 ) θ θ T , Σ = 4 E λ 12 ( θ 0 , η 0 ) λ 13 ( θ 0 , η 0 ) T , λ i j ( θ 0 , η 0 ) = B G 1 M i j ( η 0 ) N i j ( θ 0 , η 0 ) , B = E 2 ϕ i j ( θ 0 , η 0 ) θ η T , M i j ( η 0 ) = 1 2 η log p ( z i | u i ; η 0 ) + η log p ( z j | u j ; η 0 ) , and N i j ( θ 0 , η 0 ) = ϕ i j ( θ 0 , η 0 ) θ .
If one prefers the asymptotic result of β ^ , we have
Corollary 1.
Let C be a p × ( p + 1 ) matrix such that C θ = β , i.e.,
C = 0 1 / γ ˜ 0 0 0 0 0 1 / γ ˜ 0 0 0 0 0 1 / γ ˜ 0 .
Denote β 0 the true value of β. Then, following Theorem 1, we have N β ^ β 0 d N 0 , C A 1 Σ A 1 C T .
Variance Estimation. With Theorem 1 and Corollary 1, the variance estimation is straightforward using the plugging in strategy. Note that v a r ( θ ^ ) = 1 N A 1 Σ A 1 , then one would have the estimate v a r ^ ( θ ^ ) = 1 N A ^ 1 Σ ^ A ^ 1 where A ^ = N 2 1 1 i < j N 2 ϕ i j ( θ ^ , η ^ ) θ θ T ,
Σ ^ = 4 N 1 i = 1 N 1 N 1 j = 1 , j i N B ^ G ^ 1 M i j ( η ^ ) N i j ( θ ^ , η ^ ) 2 , B ^ = N 2 1 1 i < j N 2 ϕ i j ( θ ^ , η ^ ) θ η T , and G ^ = 1 N i = 1 N 2 η η T log p ( z i | u i ; η ^ ) .

4. Modern Setting with Regularization

In the past few decades, it has become a standard practice to consider the high-dimensional regression model, where one assumes the parameter β is sparse and often uses the regularization technique to recover the sparsity. While it is a prominent problem to analyze this type of model when the data are prone to missing values, the literature is quite scarce primarily because it is cumbersome to rigorously address the missingness under high dimensionality. Therefore, it is valuable to extend the nuisance-free likelihood procedure proposed in Section 3 to the setting with regularization.
Computation. Regularization is a powerful technique to identify the zero elements of a sparse parameter in a regression model. Various penalty functions have been extensively studied, such as LASSO [20], SCAD [21], and MCP [22]. In particular, we study the adaptive LASSO penalty [23] with the objective of minimizing the following function
L λ ( θ ) = L ( θ ) + j = 1 p λ β ˜ ^ j 1 β ˜ j ,
where λ > 0 is the tuning parameter. Following [23], β ˜ ^ j is a root-N-consistent estimator of β ˜ j ; for example, one can use the estimator via minimizing the unregularized objective Function (6). Obviously, the penalty term in (9) does not alter the numerical characteristic of L ( θ ) that we presented in Section 3. The L λ ( θ ) is essentially the regularized log-likelihood of a logistic regression model with the similar format as discussed in (7).
To choose the tuning parameter λ , one can follow either the cross-validation method or various information-based criteria. Fortunately, all of these approaches have been extensively studied in the literature. In this paper, we follow the Bayesian information criterion (BIC) to determine λ . Specifically, we choose λ to be the minimizer of the following BIC function
BIC ( λ ) = 2 L ( θ ) + p λ log ( n ) n ,
where p λ is the number of nonzero elements in β ˜ ^ λ and the minimizer of (9) is encoded as θ ^ λ = ( γ ˜ ^ λ , β ˜ ^ λ T ) T . We summarize the whole computation pipeline as Algorithm 2 below.
Algorithm 2 Minimization of (9) with the ALASSO penalty
1: Inputs:    { y i , u i , z i } , { y j , u j , z j } , W ^ i j , for i = 1 , , n and j = 1 , , n
2: Initialize:   k 0
3: for j { 2 : n } do
4:  for i { 1 : ( j 1 ) } do
5:     k k + 1
6:     y i j y i y j , u i j u i u j , z i j z i z j , W ^ k W ^ i j
7:     g k I { z i j > 0 }
8:     s k sign ( z i j )
9:     t k ( | z i j | y i j , | z i j | u i j T ) T
10: Fit logistic regression with response g , covariates t , offset s T log W , and no intercept.
11: Obtain θ ˜ ^ .
12: Fit logistic regression with ALASSO penalty.
13: Find λ which minimizes the BIC.
14: Outputs:   θ ^ ( λ ) = θ ^ λ
Asymptotic Theory. Recall that θ = ( γ ˜ , β ˜ T ) T . Without loss of generality, we assume the first p 0 parameters in β ˜ are nonzero, where 1 p 0 < p . For simplicity, we denote θ T = ( γ ˜ , β ˜ 1 , , β ˜ p 0 ) T as the vector of nonzero components and θ T c = ( β ˜ p 0 + 1 , , β ˜ p ) T as the vector of zeros.
In Theorem 1, we defined A = E 2 ϕ i j ( θ 0 , η 0 ) θ θ T , a ( p + 1 ) × ( p + 1 ) matrix. Now we assume it can be partitioned as A = A 1 A 2 A 2 T A 3 , where A 1 is a ( p 0 + 1 ) × ( p 0 + 1 ) submatrix corresponding to θ T . Similarly, we defined Σ = 4 E λ 12 ( θ 0 , η 0 ) λ 13 ( θ 0 , η 0 ) T , and we also assume it can be partitioned as Σ = Σ 1 Σ 2 Σ 2 T Σ 3 , where Σ 1 is a ( p 0 + 1 ) × ( p 0 + 1 ) submatrix corresponding to θ T as well. We denote the minimizer of (9), θ ^ λ , as θ ^ λ = ( θ ^ λ , T T , θ ^ λ , T c T ) T , and its true value θ 0 = ( θ 0 , T T , θ 0 , T c T ) T .
Now, we present the oracle property pertaining to θ ^ λ , which includes the asymptotic normality for the nonzero components and the variable selection consistency. The proof is provided in Appendix B.
Theorem 2.
Assume (8), A 1 is positive definite and E ϕ i j ( θ 0 , η 0 ) θ 2 < for each θ in a neighborhood of θ 0 . We also assume N λ 0 and N λ . Then,
N θ ^ λ , T θ 0 , T d N 0 , A 1 1 Σ 1 A 1 1 .
In addition, let T N = { j 1 , , p : β ˜ ^ j , λ 0 } and T = { j 1 , , p : β ˜ j , 0 0 } , then
lim N pr ( T N = T ) = 1 .
Variance Estimation. Although the above theory provides a rigorous justification for the asymptotic property of θ ^ λ , in practice, however, it does not guide the standard error estimation. Here, we propose a data perturbation approach for the variance estimation. Specifically, following [24], we generate a set of independent and identically distributed positive random variables Ξ = { ξ i , i = 1 , , N } with E ( ξ i ) = 1 and v a r ( ξ i ) = 1 , e.g., the standard exponential distribution. Since it is based on a U-statistic structure, we perturb our objective function by adding κ i j = ξ i ξ j to each of its pairwise terms. We first obtain the estimator θ ^ by minimizing the perturbed version of (6):
L ( θ ) = N 2 1 1 i < j N κ i j ϕ i j ( θ ) .
Then, we obtain the estimator θ ^ λ by minimizing the perturbed version of (9):
L λ ( θ ) = N 2 1 1 i < j N κ i j ϕ i j ( θ ) + j = 1 p λ β ˜ ^ j β ˜ j ,
where the optimal λ is also computed by the BIC. We repeat this data perturbation scheme a large number of times, say, M.
Following the theory in [25,26], under some regularity conditions, one can first show that N θ ^ λ , T θ 0 , T converges in distribution to N ( 0 , A 1 1 Σ 1 A 1 1 ) , the same limiting distribution of N θ ^ λ θ 0 . Furthermore, one can also show pr * θ ^ λ , T c = 0 1 , where pr * is the probability measure generated by the original data X and the perturbation data Ξ . In addition, one can show that the distribution of N θ ^ λ , T θ ^ λ , T conditional on the data can be used to approximate the unconditional distribution of N θ ^ λ , T θ 0 , T and that pr * θ ^ λ , T c = 0 | X 1 .
To achieve a confidence interval for θ j , the j-th coordinate in θ , the lower and upper bounds can be formed by θ ^ λ , j , α / 2 and θ ^ λ , j , 1 α / 2 , respectively, where θ ^ λ , j , q represents the q-th quantile of θ ^ λ , j , m , m = 1 , , M .

5. Simulation Studies

We conduct comprehensive simulation studies to evaluate the finite sample performance of our proposed estimators and also compare with some currently existing methods. We first present the results under the model without regularization, then with regularization.

5.1. Scenarios without Regularization

For the proposed estimator studied in Section 3, we generate { R i , Y i , U i T , Z i } , i = 1 , , N , independent and identically distributed copies of ( R , Y , U T , Z ) , as follows. We first generate the random vector U = ( U 1 , , U p ) T with U i N ( 0.5 , 1 ) and p = 4 , and then generate Z = α z + η T U + ϵ z with α z = 0.5 , η = ( 0.5 , 1 , 1 , 1.5 ) T , ϵ z N ( 0 , 1 ) . Afterwards, the outcome variable Y is generated following the model (1) with α = 1 , β = ( 0.5 , 1 , 1 , 1.5 ) T , γ = 0.5 , and ϵ N ( 0 , 1 ) , and the missingness indicator R is generated following pr ( R = 1 | Y , U ) = I ( Y < 2.5 , U 1 < 2 , U 2 < 2 , U 3 < 2 , U 4 < 2 ) which results in around 40% missing values. We examine two situations with sample size N = 500 and N = 1000 respectively. Besides the estimator studied in Section 3 (Proposed), we also implement the estimator using all simulated data (FullData) and the estimator using completely observed subjects only (CC). Based on 1000 simulation replicates, for each of the three estimators, we summarize the sample bias, sample standard deviation, estimated standard error, and coverage probability of 95% confidence intervals in Table 1.
Furthermore, we consider a similar simulation setting where the generation is the same as above except for a logistic missingness mechanism model with logit { pr ( R = 1 | Y , U ) } = 3 2 Y + 0.5 U 1 U 2 + U 3 1.5 U 4 , which also results in around 40% missing values. We replicate the results, shown in Table 2.
We can reach the following conclusions from Table 1 and Table 2. For the estimator Proposed, although its bias is slightly larger than the benchmark FullData, it is still very close to zero. The sample standard deviation and the estimated standard error are rather close to each other. The sample coverage probability of the estimated 95% confidence interval is also very close to the nominal level. This observation well matches our theoretical justification in Theorem 1. On the contrary, the estimator CC is clearly biased, resulting in empirical coverage far from the nominal level, and therefore is not recommended to use in practice. It is also clear that, compared to the benchmark FullData, the estimator Proposed has estimation efficiency loss to some extent. This is because the proposed method uses the conditional likelihood approach and it completely eliminates the effect of the nuisance.

5.2. Scenarios with Regularization

For the estimator studied in Section 4, the independent and identically distributed samples are generated as follows. The variable U = ( U 1 , , U p ) T is generated from MVN ( 0 , Σ u ) with Σ u = ( 0 . 5 | i j | ) 1 i , j p and p = 8 . Then, the shadow variable Z is generated following Z = α z + η T U + ϵ z with α z = 0 , η = ( 0.5 , 0.5 , 1 , 1 , 0.5 , 0.5 , 1 , 1 ) T and ϵ z N ( 0 , 1 ) . The outcome variable Y is generated from model (1) with α = 0 , β = ( 3 , 1.5 , 0 , 0 , 2 , 0 , 0 , 0 ) T , γ = 3 , ϵ N ( 0 , σ 2 ) and σ = 3 . The distribution of the missingness indicator follows from logit { pr ( R = 1 | Y , U ) } = 5 + 5 Y + 0.2 U 1 + 0.2 U 7 , which results in about 45% missing values. Similar to Section 5.1, we also examine two situations with sample size N = 500 and N = 1000 respectively, and we implement three estimators FullData, CC, and Proposed. When the estimator Proposed is implemented, we perform M = 500 perturbations in order to obtain the confidence interval for the unknown parameter. The results summarized below are based on 1000 simulation replicates.
Figure 1 shows the L 1 , L 2 , and L norms of the bias for the three different estimators. As sample size increases, there is no doubt that the estimation bias is getting smaller for any method. It is also clear that the bias of the Proposed estimator is larger than the benchmark FullData, but much smaller than the method CC.
We present the statistical inference results in Table 3 for N = 500 and Table 4 for N = 1000 , respectively, including sample bias, sample standard deviation, estimated standard error, coverage probability, and length of 95% confidence interval for the three different methods. For the nonzero β ’s as well as γ ˜ , similar to Section 5.1, the method CC clearly prompts coverage probability far from the nominal level hence is not reliable. For the method Proposed, its estimation bias is quite close to zero, and its sample standard deviation and estimated standard error are quite close to each other. The coverage probability of the confidence interval converges to the nominal level 95% as the sample size gets larger. For the noisy zero β ’s, the coverage probabilities in the three methods are all close to 1, reflecting the variable selection consistency in the oracle property, even for the CC method. Furthermore, a very nice finite sample property of our proposed estimator is that it produces the confidence interval with the shortest length, which can be clearly seen from both Table 3 and Table 4.

6. Real Data Application

The Medical Information Mart for Intensive Care III (MIMIC-III) is an openly available electronic health records (EHR) database, developed by the MIT Lab for Computational Physiology [13], comprising de-identified health-related data associated with intensive care unit patients with rich information including demographics, vital signs, laboratory test, medications, and more.
Our initial motivation for this data analysis is to understand the missingness mechanism for some laboratory test biomarkers in this EHR system. As for the EHR database, since the data are collected in a non-prescheduled fashion, i.e., only available when the patient seeks care or the physician orders care, the visiting process could be potentially informative about the patients’ risk categories. Therefore, it is very plausible that the data are missing not at random, or a mix of missing not at random and missing at random [27,28]. When we first conducted the data cleaning process briefly, an interesting phenomenon we observe is that, compared to most biomarkers which usually have <3% missing values, the albumin level in the blood sample, a very indicative biomarker associated with different types of diseases [29], has around 30% missingness.
To further understand this phenomenon, we concentrate on a subset of the data with sample size N = 1359 in which 421 samples have missing values in the albumin level but all other variables are complete. We aim to apply the proposed method to the study of the albumin level (Y). The calcium level in the blood sample, free of missing data, has been shown in the biomedical literature that it has high correlation with the albumin level [30,31,32]; therefore, we adopt the calcium level as the shadow variable Z. Seventeen other variables comprise the vector U , which are either demographics (age and gender), chart events (respiratory rate, glucose, heart rate, systolic blood pressure, diastolic blood pressure, and temperature), other laboratory tests (urea nitrogen, platelets, magnesium, hematocrit, red blood cell, white blood cell, and peripheral capillary oxygen saturation (SpO2)), or aggregated metrics (simplified acute physiology score (SAPS-II) and sequential organ failure assessment score (SOFA)).
We implement the proposed estimator studied in Section 4 to achieve both variable selection and post-selection inference. We also compare it with the CC method which naively fits the regularized linear regression with the ALASSO penalty. For each of the methods, we apply the data perturbation scheme presented in Section 4 with M = 500 for standard error estimation. The results are summarized in Table 5. The solution path of the Proposed method, as the tuning parameter λ varies, is also provided in Figure 2.
In general, both methods achieve the goal of variable selection and post-selection inference by leveraging the regularization technique coupled with the data perturbation strategy, and identify many variables as noise with zero coefficients. In particular, the Proposed method provides larger effects for the calcium level (the shadow variable) and the red blood cell count, whereas a smaller effect for the aggregated SOFA score. The Proposed method simplifies the body temperature and the white blood cell count as nonsignificant variables, which are identified as nonzero but with a very small effect using the CC method. It is also worthwhile to mention that the Proposed method signifies the magnesium level with a quite significant coefficient, which was extensively investigated in the scientific literature [33,34,35].

7. Discussion

In this paper, we provide a systematic approach for parameter estimation and statistical inference in both traditional linear model where the regularization is not needed and the modern regularized regression setting, when the outcome variable is prone to missing values and the missingness mechanism can be arbitrarily flexible. A pivotal condition rooted in our procedure is the shadow variable Z, which overcomes the model identifiability problem and enables the nuisance-free conditional likelihood process.
Certainly any method would have its own limitations and could be potentially improved. One needs a model p ( z | u ) to implement the proposed estimator in Section 3 and Section 4. As its modeling does not involve any missing data, we simply use the parametric maximum likelihood estimation in our algorithm as well as in the theoretical justification. Indeed, any statistical or machine learning method can be used for modeling p ( z | u ) . For instance, if one would like to consider a semiparametric model [36], e.g.,
p ( z | u ; η , F ) = exp ( η T u z ) f ( z ) exp ( η T u z ) d F ( z ) ,
where η = ( η 1 , , η p ) T is a vector of unknown parameters and f ( z ) is the density of an unknown baseline distribution function F with respect to some dominating measure ν . With this model fitted, W i j can be simplified to W i j = exp ( z i j η T u i j ) . Therefore, a similar conditional likelihood approach can be used to estimate η without estimating the nonparametric component f ( z ) .

Author Contributions

Conceptualization, J.Z.; Experiment, J.Z. and C.C.; Writing, J.Z. and C.C.; Supervision, J.Z. All authors have read and agreed to the published version of the manuscript

Funding

Jiwei Zhao is supported by the National Science Foundation award 1953526.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1

Proof. 
Note that θ ^ is obtained by setting estimating equation L ( θ ^ , η ^ ) θ = 0 , which is equivalent to
L ( θ ^ , η ^ ) θ L ( θ 0 , η ^ ) θ + L ( θ 0 , η ^ ) θ L ( θ 0 , η 0 ) θ + L ( θ 0 , η 0 ) θ = 0 .
Specifically,
L ( θ ^ , η ^ ) θ L ( θ 0 , η ^ ) θ = 2 L ( θ 0 , η ^ ) θ θ T θ ^ θ 0 + o p N 1 2 ,
by Taylor expansion. Similarly,
L ( θ 0 , η ^ ) θ L ( θ 0 , η 0 ) θ = 2 L ( θ 0 , η 0 ) θ η T η ^ η 0 + o p N 1 2 .
With (A2) and (A3) plugging into (A1), we can obtain the following equation,
N 2 L ( θ 0 , η ^ ) θ θ T θ ^ θ 0 + N 2 L ( θ 0 , η 0 ) θ η T η ^ η 0 + N L ( θ 0 , η 0 ) θ + o p ( 1 ) = 0 .
As N η ^ η 0 = G 1 N 1 N i = 1 N η log p ( z i | u i ; η 0 ) + o p ( 1 ) from the asymptotic property of η ^ , (A4) is equivalent to
N 2 L ( θ 0 , η ^ ) θ θ T θ ^ θ 0 + 2 L ( θ 0 , η 0 ) θ η T G 1 N 1 N i = 1 N η log p ( z i | u i ; η 0 ) + N L ( θ 0 , η 0 ) θ + o p ( 1 ) = 0 .
Thus,
N θ ^ θ 0 = 2 L ( θ 0 , η ^ ) θ θ T 1 × 2 L ( θ 0 , η 0 ) θ η T G 1 N 1 N i = 1 N η log p ( z i | u i ; η 0 ) + N L ( θ 0 , η 0 ) θ + o p ( 1 ) = A 1 B G 1 N 1 N i = 1 N η log p ( z i | u i ; η 0 ) + N L ( θ 0 , η 0 ) θ + o p ( 1 ) ,
where 2 L ( θ 0 , η 0 ) θ θ T p A = E 2 ϕ i j ( θ 0 , η 0 ) θ θ T , and 2 L ( θ 0 , η 0 ) θ η T p B = E 2 ϕ i j ( θ 0 , η 0 ) θ η T . In addition, we need to form a projection of 1 N i = 1 N η log p ( z i | u i ; η 0 ) in (A5) through
1 N i = 1 N η log p ( z i | u i ; η 0 ) = N 2 1 1 i < j N 1 2 η log p ( z i | u i ; η 0 ) + η log p ( z j | u j ; η 0 ) ,
and
L ( θ 0 , η 0 ) θ = N 2 1 1 i < j N ϕ i j ( θ 0 , η 0 ) θ .
To sum up, (A5) can be formed as
N θ ^ θ 0 = A 1 N N 2 1 1 i < j N B G 1 M i j ( η 0 ) N i j ( θ 0 , η 0 ) + o p ( 1 ) ,
where M i j ( η 0 ) = 1 2 η log p ( z i | u i ; η 0 ) + η log p ( z j | u j ; η 0 ) and N i j ( θ 0 , η 0 ) = ϕ i j ( θ 0 , η 0 ) θ .  □

Appendix B. Proof of Theorem 2

Proof. 
Define function
q i j ( θ ) = ϕ i j θ 0 + θ N , η ^ ϕ i j ( θ 0 , η ^ ) θ N T ϕ i j ( θ 0 , η ^ ) θ = O p 1 N ,
and we can form a U-statistic based on q i j ( θ ) as
Q N ( θ ) = 2 N ( N 1 ) 1 i < j N q i j ( θ ) = L θ 0 + θ N L ( θ 0 ) 1 N · 2 N ( N 1 ) θ T 1 i < j N ϕ i j ( θ 0 , η ^ ) θ .
The variance of Q N ( θ ) is bounded by v a r Q N ( θ ) 2 N v a r q i j ( θ ) , from Corollary 3.2 of [37]. Meanwhile, 2 N v a r q i j ( θ ) = 2 N E q i j ( θ ) 2 E q i j ( θ ) 2 2 N E q i j ( θ ) 2 , as E q i j ( θ ) 2 0 . As ϕ i j ( θ , η ^ ) is convex, that is, differentiable at θ 0 , we can conclude
ϕ i j θ 0 + θ N , η ^ ϕ i j θ 0 , η ^ θ N T ϕ i j ( θ 0 , η ^ ) θ ,
from which we can obtain q i j ( θ ) 0 . Similarly,
ϕ i j θ 0 + θ N , η ^ ϕ i j θ 0 , η ^ θ N T ϕ i j θ 0 + θ N , η ^ θ .
From (A6)–(A8), we can conclude
0 q i j ( θ ) θ N T ϕ i j θ 0 + θ N , η ^ θ ϕ i j θ 0 , η ^ θ .
Therefore, we can bound
2 N E q i j ( θ ) 2 2 N 1 N 2 E θ T θ ϕ i j θ 0 + θ N , η ^ ϕ i j ( θ 0 , η ^ ) θ 2 .
The term θ T θ ϕ i j θ 0 + θ N , η ^ ϕ i j ( θ 0 , η ^ ) θ p 0 as N . Thus, v a r N · Q N ( θ ) p 0 and consequently
N · Q N ( θ ) N · E { Q N ( θ ) } p 0 .
Meanwhile, E Q N ( θ ) = E ϕ i j θ 0 + θ N , η ^ E ϕ i j ( θ 0 , η ^ ) . Eventually from (A9) we have
N L θ 0 + θ N L θ 0 θ T N 2 N ( N 1 ) 1 i < j N ϕ i j ( θ 0 , η ^ ) θ                 N E ϕ i j θ 0 + θ N , η ^ E ϕ i j ( θ 0 , η ^ ) p 0 .
The third term on the left side of (A10) has convergence properties
N E ϕ i j θ 0 + θ N , η ^ E ϕ i j ( θ 0 , η ^ ) = N E ϕ i j ( θ 0 , η ^ ) + θ N T ϕ i j ( θ 0 , η ^ ) θ + 1 2 θ N T 2 ϕ i j ( θ 0 , η ^ ) θ θ T θ N + o p 1 N E ϕ i j ( θ 0 , η ^ ) p 1 2 θ T A θ .
By CLT for U-statistics,
N 2 N ( N 1 ) 1 i < j N ϕ i j ( θ 0 , η ^ ) θ d N ( 0 , Σ ) .
Using Slutsky’s theorem, we can simplify (A10) as
N L θ 0 + θ N L ( θ 0 ) d 1 2 θ T A θ + θ T W ,
where W N ( 0 , Σ ) . Based on convexity [38], for every compact set K R p + 1 , we have
N L θ 0 + θ N , η ^ L ( θ 0 , η ^ ) : θ K d 1 2 θ T A θ + θ T W : θ K .
Now we develop large sample properties on the penalty term in objective function with adaptive LASSO penalty. We modify the penalty term as
N j = 1 p λ β ˜ ^ j β ˜ j , 0 + β ˜ j N N j = 1 p λ β ˜ ^ j β ˜ j , 0 .
From Theorem 1, we have already obtained N β ˜ ^ j β ˜ j , 0 = O p ( 1 ) . Meanwhile, N λ and N λ 0 . If β ˜ j , 0 0 , then N λ / β ˜ ^ j p 0 and N β ˜ j , 0 + β ˜ j N β ˜ j , 0 sign ( β ˜ j , 0 ) β ˜ j . Eventually
N λ β ˜ ^ j β ˜ j , 0 + β ˜ j N β ˜ j , 0 = N λ β ˜ ^ j N β ˜ j , 0 + β ˜ j N β ˜ j , 0 p 0 .
If β ˜ j , 0 = 0 , then N λ / β ˜ ^ j = N λ / N β ˜ ^ j p , consequently
N λ β ˜ ^ j β ˜ j , 0 + β ˜ j N β ˜ j , 0 = N λ β ˜ ^ j β ˜ j p { 0 , if β ˜ j = 0 , , if β ˜ j 0 .
Therefore, we can summarize
N j = 1 p λ β ˜ ^ j β ˜ j , 0 + β ˜ j N β ˜ j , 0 p { 0 , if β ˜ = ( β ˜ 1 , , β ˜ p 0 , 0 , , 0 ) , , otherwise .
We have infinity in the limit function, so we cannot use standard argumentation relating to uniform convergence in probability on compacts [39]. However, we can apply slightly more complicated epi-convergence. Thus, based on the works in [23,40,41], we have
N L θ 0 + θ N L θ 0 + N j = 1 p λ β ˜ ^ j β ˜ j , 0 + β ˜ j N β ˜ j , 0 e d V ( θ ) ,
and
V ( θ ) = { 1 2 θ T T A 1 θ T + θ T T W T , if θ = ( γ ˜ , β ˜ 1 , , β ˜ p 0 , 0 , , 0 ) , , otherwise .
and W T N ( 0 , Σ 1 ) . Specifically, the left side of (A12) is minimized if θ = N θ ^ λ θ 0 and V ( θ ) has a unique minimizer ( A 1 1 W T ) T , 0 T T by setting V ( θ ) θ = 0 . Therefore, convergence of minimizers [40] can be concluded from (A12):
N θ ^ λ , T θ 0 , T d A 1 1 W T and N θ ^ λ , T c θ 0 , T c d 0 .
For j T ,
pr j T N = pr β ˜ ^ j , λ = 0 0 .
Thus, pr T T N 1 . In addition, θ ^ λ minimizes the convex objective function L λ ( θ ) so that 0 L λ ( θ ^ λ ) . As L λ ( θ ) might be nondifferentiable and gradient of L λ ( θ ) does not exist for some θ , we use L λ ( θ ) to represent an arbitrary selection of the subgradient of L λ ( θ ) . By taking the subgradient of the objective function with adaptive LASSO penalty, we can obtain
L λ ( θ ^ λ ) = L ( θ ^ λ ) + j = 1 p λ β ˜ ^ j β ˜ ^ j , λ .
For j T , pr j T N can be upper bounded by
pr j L ( θ ^ λ ) + λ β ˜ ^ j sign β ˜ ^ j , λ = 0 pr N j L ( θ ^ λ ) = N λ β ˜ ^ j ,
where j is the j-th coordinate of subgradient and N λ / β ˜ ^ j p as j T .
We can expand the subgradient N L ( θ ^ λ ) as
N L ( θ ^ λ ) = N L ( θ ^ λ ) L ( θ 0 ) A θ ^ λ θ 0 + N L ( θ 0 ) + N A θ ^ λ θ 0 ,
where N L ( θ 0 ) is bounded in probability, N A θ ^ λ θ 0 D N W which is bounded in probability as well. By Theorem 1 of the work in [42],
sup θ ^ λ θ 0 M / N L ( θ ^ λ ) L ( θ 0 ) A θ ^ λ θ 0 = o p 1 N .
Therefore, N L ( θ ^ λ ) L ( θ 0 ) A θ ^ λ θ 0 p 0 . Finally, N j L ( θ ^ λ ) is bounded and the right side of (A14) converges to 0, which proves pr ( j T N ) 0 for j T .  □

References

  1. Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data, 2nd ed.; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
  2. Shao, J.; Zhao, J. Estimation in longitudinal studies with nonignorable dropout. Stat. Its Interface 2013, 6, 303–313. [Google Scholar] [CrossRef] [Green Version]
  3. Wang, S.; Shao, J.; Kim, J.K. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat. Sin. 2014, 24, 1097–1116. [Google Scholar] [CrossRef] [Green Version]
  4. Zhao, J.; Shao, J. Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. J. Am. Stat. Assoc. 2015, 110, 1577–1590. [Google Scholar] [CrossRef]
  5. Miao, W.; Tchetgen Tchetgen, E.J. On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 2016, 103, 475–482. [Google Scholar] [CrossRef] [Green Version]
  6. Zhao, J.; Ma, Y. Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. Biometrika 2018, 105, 479–486. [Google Scholar] [CrossRef]
  7. Miao, W.; Liu, L.; Tchetgen Tchetgen, E.; Geng, Z. Identification, Doubly Robust Estimation, and Semiparametric Efficiency Theory of Nonignorable Missing Data With a Shadow Variable. arXiv 2019, arXiv:1509.02556. [Google Scholar]
  8. Tchetgen Tchetgen, E.J.; Wirth, K.E. A general instrumental variable framework for regression analysis with outcome missing not at random. Biometrics 2017, 73, 1123–1131. [Google Scholar] [CrossRef] [Green Version]
  9. Sun, B.; Liu, L.; Miao, W.; Wirth, K.; Robins, J.; Tchetgen Tchetgen, E.J. Semiparametric estimation with data missing not at random using an instrumental variable. Stat. Sin. 2018, 28, 1965–1983. [Google Scholar]
  10. Zhao, J.; Yang, Y.; Ning, Y. Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data. Stat. Sin. 2018, 28, 2125–2148. [Google Scholar] [CrossRef] [Green Version]
  11. Jiang, W.; Bogdan, M.; Josse, J.; Miasojedow, B.; Rockova, V.; Group, T. Adaptive Bayesian SLOPE–High-dimensional Model Selection with Missing Values. arXiv 2019, arXiv:1909.06631. [Google Scholar]
  12. Jiang, W.; Josse, J.; Lavielle, M.; Group, T. Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework. Comput. Stat. Data Anal. 2020, 145, 106907. [Google Scholar] [CrossRef] [Green Version]
  13. Johnson, A.E.; Pollard, T.J.; Shen, L.; Li-wei, H.L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Zhao, J.; Ma, Y. A versatile estimation procedure without estimating the nonignorable missingness mechanism. arXiv 2019, arXiv:1907.03682. [Google Scholar]
  15. Liang, K.Y.; Qin, J. Regression analysis under non-standard situations: A pairwise pseudolikelihood approach. J. R. Stat. Soc. Ser. B 2000, 62, 773–786. [Google Scholar] [CrossRef]
  16. Zhao, J.; Shao, J. Approximate conditional likelihood for generalized linear models with general missing data mechanism. J. Syst. Sci. Complex. 2017, 30, 139–153. [Google Scholar] [CrossRef]
  17. Zhao, J. Reducing bias for maximum approximate conditional likelihood estimator with general missing data mechanism. J. Nonparametr. Stat. 2017, 29, 577–593. [Google Scholar] [CrossRef]
  18. Yang, Y.; Zhao, J.; Wilding, G.; Kluczynski, M.; Bisson, L. Stability enhanced variable selection for a semiparametric model with flexible missingness mechanism and its application to the ChAMP study. J. Appl. Stat. 2020, 47, 827–843. [Google Scholar] [CrossRef]
  19. Zhao, J.; Chen, C. Estimators based on unconventional likelihoods with nonignorable missing data and its application to a children’s mental health study. J. Nonparametric Stat. 2019, 31, 911–931. [Google Scholar] [CrossRef]
  20. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  21. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  22. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
  23. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  24. Cai, T.; Tian, L.; Wei, L. Semiparametric Box–Cox power transformation models for censored survival observations. Biometrika 2005, 92, 619–632. [Google Scholar] [CrossRef]
  25. Kosorok, M.R. Introduction to Empirical Processes and Semiparametric Inference; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  26. Minnier, J.; Tian, L.; Cai, T. A perturbation method for inference on regularized regression estimates. J. Am. Stat. Assoc. 2011, 106, 1371–1382. [Google Scholar] [CrossRef] [Green Version]
  27. Hu, Z.; Melton, G.B.; Arsoniadis, E.G.; Wang, Y.; Kwaan, M.R.; Simon, G.J. Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J. Biomed. Inform. 2017, 68, 112–120. [Google Scholar] [CrossRef]
  28. Li, J.; Wang, M.; Steinbach, M.S.; Kumar, V.; Simon, G.J. Don’t Do Imputation: Dealing with Informative Missing Values in EHR Data Analysis. In Proceedings of the 2018 IEEE International Conference on Big Knowledge (ICBK), Singapore, 17–18 November 2018; pp. 415–422. [Google Scholar]
  29. Phillips, A.; Shaper, A.G.; Whincup, P. Association between serum albumin and mortality from cardiovascular disease, cancer, and other causes. Lancet 1989, 334, 1434–1436. [Google Scholar] [CrossRef]
  30. Katz, S.; Klotz, I.M. Interactions of calcium with serum albumin. Arch. Biochem. Biophys. 1953, 44, 351–361. [Google Scholar] [CrossRef]
  31. Butler, S.; Payne, R.; Gunn, I.; Burns, J.; Paterson, C. Correlation between serum ionised calcium and serum albumin concentrations in two hospital populations. Br. Med. J. 1984, 289, 948–950. [Google Scholar] [CrossRef] [Green Version]
  32. Hossain, A.; Mostafa, G.; Mannan, K.; Prosad Deb, K.; Hossain, M. Correlation Between Serum Albumin Level and Ionized Calcium in Idiopathic Nephrotic Syndrome in Children. Urol. Nephrol. Open Access. J. 2015, 3, 70–71. [Google Scholar] [CrossRef] [Green Version]
  33. Kroll, M.; Elin, R. Relationships between magnesium and protein concentrations in serum. Clin. Chem. 1985, 31, 244–246. [Google Scholar] [CrossRef]
  34. Huijgen, H.J.; Soesan, M.; Sanders, R.; Mairuhu, W.M.; Kesecioglu, J.; Sanders, G.T. Magnesium levels in critically ill patients: What should we measure? Am. J. Clin. Pathol. 2000, 114, 688–695. [Google Scholar] [CrossRef]
  35. Djagbletey, R.; Phillips, B.; Boni, F.; Owoo, C.; Owusu-Darkwa, E.; deGraft Johnson, P.K.G.; Yawson, A.E. Relationship between serum total magnesium and serum potassium in emergency surgical patients in a tertiary hospital in Ghana. Ghana Med. J. 2016, 50, 78–83. [Google Scholar] [CrossRef] [Green Version]
  36. Luo, X.; Tsai, W.Y. A proportional likelihood ratio model. Biometrika 2011, 99, 211–222. [Google Scholar] [CrossRef]
  37. Shao, J. Mathematical Statistics; Springer Texts in Statistics; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  38. Arcones, M.A. Weak convergence of convex stochastic processes. Stat. Probab. Lett. 1998, 37, 171–182. [Google Scholar] [CrossRef]
  39. Rejchel, W. Model selection consistency of U-statistics with convex loss and weighted lasso penalty. J. Nonparametric Stat. 2017, 29, 768–791. [Google Scholar] [CrossRef]
  40. Geyer, C.J. On the asymptotics of constrained M-estimation. Ann. Stat. 1994, 22, 1993–2010. [Google Scholar] [CrossRef]
  41. Pflug, G.C. Asymptotic stochastic programs. Math. Oper. Res. 1995, 20, 769–789. [Google Scholar] [CrossRef]
  42. Niemiro, W. Least empirical risk procedures in statistical inference. Appl. Math. 1993, 22, 55–67. [Google Scholar] [CrossRef]
Figure 1. In Section 5.2, L 1 (1st column), L 2 (2nd column), and L (3rd column) norms of the estimation bias of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 4.
Figure 1. In Section 5.2, L 1 (1st column), L 2 (2nd column), and L (3rd column) norms of the estimation bias of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 4.
Entropy 22 01154 g001
Figure 2. In Section 6, as tuning parameter λ varies, the solution path of the proposed estimator in the MIMIC-III study. The optimal λ , λ * , equals 1.0030 and log λ * = 0.0030 .
Figure 2. In Section 6, as tuning parameter λ varies, the solution path of the proposed estimator in the MIMIC-III study. The optimal λ , λ * , equals 1.0030 and log λ * = 0.0030 .
Entropy 22 01154 g002
Table 1. In Section 5.1, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 3.
Table 1. In Section 5.1, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 3.
NParameterMethodBiasSDSECP
500 γ ˜ FullData0.00260.04440.04500.9540
CC−0.03290.05640.05600.9100
Proposed0.01740.08290.07890.9450
β 1 FullData0.00220.04890.05030.9510
CC0.03760.06700.06990.9300
Proposed0.01640.16440.16070.9400
β 2 FullData−0.00170.06570.06350.9310
CC−0.06490.08510.08350.8680
Proposed−0.03990.23050.22390.9360
β 3 FullData0.00220.06160.06350.9540
CC0.07780.08710.08670.8430
Proposed0.04620.23230.22980.9410
β 4 FullData−0.00450.07920.08100.9530
CC−0.09880.10070.10430.8550
Proposed−0.06720.30810.30470.9380
1000 γ ˜ FullData−0.00120.03170.03170.9540
CC−0.03480.03960.03930.8510
Proposed0.00680.05730.05550.9350
β 1 FullData0.00110.03670.03550.9370
CC0.03990.04900.04940.8840
Proposed0.01540.11540.11380.9460
β 2 Full Data0.00200.04480.04480.9500
CC−0.06490.05770.05880.8110
Proposed−0.01530.15310.15910.9590
β 3 Full Data−0.00150.04580.04490.9460
CC0.07790.06050.06110.7490
Proposed0.01350.15980.16340.9480
β 4 Full Data0.00090.05640.05710.9540
CC−0.09490.07200.07340.7550
Proposed−0.02420.20910.21670.9430
Table 2. In Section 5.1, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 3, with a logistic missingness mechanism model.
Table 2. In Section 5.1, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 3, with a logistic missingness mechanism model.
NParameterMethodBiasSDSECP
500 γ ˜ FullData−0.00110.04640.04510.9410
CC−0.03060.05670.05670.9200
Proposed0.01000.08220.07870.9380
β 1 FullData−0.00040.05090.05030.9520
CC0.04400.06360.06370.8930
Proposed0.01460.13080.12360.9420
β 2 FullData0.00130.06390.06370.9520
CC−0.08710.08280.08210.8190
Proposed−0.01730.18240.17530.9430
β 3 FullData−0.00300.06550.06360.9400
CC0.08760.08470.08210.8030
Proposed0.02140.18400.17560.9440
β 4 FullData0.00230.08450.08120.9390
CC−0.13070.10830.10610.7560
Proposed−0.03310.25330.23840.9360
1000 γ ˜ FullData0.00040.03150.03170.9490
CC−0.02860.03960.03980.8950
Proposed0.00600.05680.05550.9390
β 1 FullData0.00070.03620.03540.9420
CC0.04420.04510.04470.8410
Proposed0.00790.09100.08590.9290
β 2 FullData−0.00040.04500.04480.9390
CC−0.08790.05710.05760.6640
Proposed−0.00440.12770.12200.9420
β 3 FullData−0.00090.04500.04480.9450
CC0.08800.05880.05770.6660
Proposed0.01140.13090.12220.9380
β 4 FullData−0.00050.05760.05720.9510
CC−0.13420.07550.07450.5740
Proposed−0.01910.17570.16610.9370
Table 3. In Section 5.2, with sample size N = 500 , sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), coverage probability (CP), and length (Length) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects) and of the proposed estimator studied in Section 4.
Table 3. In Section 5.2, with sample size N = 500 , sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), coverage probability (CP), and length (Length) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects) and of the proposed estimator studied in Section 4.
ParameterMethodBiasSDSECPLength
γ ˜ FullData0.00010.01200.01320.94800.0515
CC−0.07290.01800.01830.03700.0716
Proposed−0.04230.05000.04980.82000.1926
True Nonzero β 1 FullData0.00210.16860.16490.94000.6415
CC−0.65470.22070.21140.14600.8233
Proposed0.03540.46980.47460.93201.8513
β 2 Full Data−0.02750.16920.17910.94400.6952
CC−0.35010.22270.21740.61800.8471
Proposed−0.26540.58430.56090.89401.9237
β 5 Full Data−0.01720.15760.17560.96500.6826
CC−0.44780.21720.21610.43700.8418
Proposed−0.12510.40370.46110.93301.8063
True Zero β 3 FullData0.00850.15670.18900.99600.7184
CC0.00630.20670.23040.98900.8890
Proposed0.01090.09880.16901.00000.4398
β 4 Full Data−0.00190.15810.19000.99400.7206
CC−0.00170.20970.23070.99000.8914
Proposed0.01260.11120.14471.00000.3668
β 6 Full Data0.00450.12120.16060.99800.6146
CC−0.00530.17490.19530.99000.7560
Proposed0.00340.06640.11601.00000.2555
β 7 Full Data0.00140.13510.18390.99800.7063
CC−0.00550.18700.22450.99500.8717
Proposed0.00240.03860.11151.00000.2538
β 8 Full Data−0.00720.12950.17480.99900.6653
CC−0.00620.17950.21250.99400.8251
Proposed0.00160.07410.10661.00000.2284
Table 4. In Section 5.2, with sample size N = 1000 , sample bias (Bias), sample standard derivation (SD), estimated standard error (SE), coverage probability (CP), and length (Length) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects) and of the proposed estimator studied in Section 4.
Table 4. In Section 5.2, with sample size N = 1000 , sample bias (Bias), sample standard derivation (SD), estimated standard error (SE), coverage probability (CP), and length (Length) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects) and of the proposed estimator studied in Section 4.
ParameterMethodBiasSDSECPLength
γ ˜ FullData−0.00050.00730.00880.96900.0344
CC−0.07300.01260.01300.00000.0507
Proposed−0.02130.03110.03340.87000.1293
True Nonzero β 1 FullData−0.00050.11860.11700.93000.4547
CC−0.66550.15680.15070.00900.5864
Proposed0.02110.29110.29690.93001.1631
β 2 Full Data−0.03210.11750.12490.95500.4861
CC−0.33870.14770.15340.39600.5972
Proposed−0.09790.29070.33830.92301.3115
β 5 Full Data−0.02250.10510.12060.95900.4698
CC−0.44850.14780.15340.17700.5964
Proposed−0.06210.23510.25260.92900.9871
True Zero β 3 FullData−0.00070.06210.11621.00000.4253
CC0.00230.14140.16140.99200.6180
Proposed0.00440.05810.09101.00000.2091
β 4 Full Data0.00200.06320.11701.00000.4271
CC−0.00050.13330.16080.99300.6207
Proposed0.00630.05840.08871.00000.2107
β 6 Full Data0.00130.05710.10101.00000.3670
CC−0.00340.11590.13780.99500.5313
Proposed0.00120.02810.06881.00000.1430
β 7 Full Data−0.00280.05990.11441.00000.4231
CC−0.00330.12430.15840.99700.6131
Proposed0.00160.02880.06981.00000.1421
β 8 Full Data0.00390.05890.10801.00000.3970
CC0.00280.12560.14970.99400.5752
Proposed0.00000.03330.06441.00000.1314
Table 5. In Section 6, the parameter estimate (Estimate), standard error (SE), and confidence interval (CI) of the estimator of CC (using only completely observed subjects) and of the proposed estimator studied in Section 4 in the MIMIC−III study.
Table 5. In Section 6, the parameter estimate (Estimate), standard error (SE), and confidence interval (CI) of the estimator of CC (using only completely observed subjects) and of the proposed estimator studied in Section 4 in the MIMIC−III study.
EffectCCProposed
EstimateSECIEstimateSECI
Calcium(shadow)0.77070.0691[0.6532, 0.9153]1.52710.1796[1.1815, 1.8835]
Red Blood Cell0.64910.0514[0.5337, 0.7257]0.75450.1631[0.3594, 1.0109]
Magnesium0.00000.0686[−0.2073, 0.0000]0.27310.2452[0.0000, 0.6609]
SOFA−0.27200.0268[−0.3135, −0.2099]−0.18520.1040[−0.3467, 0.0000]
Temperature−0.03600.0351[−0.0883, 0.0659]0.00000.0964[0.0000, 0.3132]
White Blood Cell−0.02450.0123[−0.0416, 0.0000]0.00000.0025[0.0000, 0.0000]
Age0.00000.0008[0.0000, 0.0000]0.00000.0017[0.0000. 0.0000]
Gender0.00000.0240[−0.0477, 0.0662]0.00000.1320[−0.4025, 0.0000]
Respiratory Rate0.00000.0034[−0.0141, 0.0000]0.00000.0008[0.0000, 0.0000]
Glucose0.00000.0000[0.0000, 0.0000]0.00000.0005[0.0000, 0.0000]
Heart Rate0.00000.0025[−0.0091, 0.0000]0.00000.0004[0.0000, 0.0000]
Systolic BP0.00000.0045[−0.0139, 0.0000]0.00000.0000[0.0000, 0.0000]
Diastolic BP0.00000.0072[0.0000, 0.0223]0.00000.0000[0.0000, 0.0000]
Urea Nitrogen0.00000.0004[0.0000, 0.0000]0.00000.0000[0.0000, 0.0000]
Platelets0.00000.0000[0.0000, 0.0000]0.00000.0000[0.0000, 0.0000]
Hematocrit0.00000.0027[0.0000, 0.0000]0.00000.0000[0.0000, 0.0000]
SpO20.00000.0145[−0.0479, 0.0000]0.00000.0162[0.0000, 0.0000]
SAPS-II0.00000.0106[−0.0051, 0.0269]0.00000.0000[0.0000, 0.0000]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhao, J.; Chen, C. A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records. Entropy 2020, 22, 1154. https://doi.org/10.3390/e22101154

AMA Style

Zhao J, Chen C. A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records. Entropy. 2020; 22(10):1154. https://doi.org/10.3390/e22101154

Chicago/Turabian Style

Zhao, Jiwei, and Chi Chen. 2020. "A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records" Entropy 22, no. 10: 1154. https://doi.org/10.3390/e22101154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop