Next Article in Journal
Data-Rate Constrained Observers of Nonlinear Systems
Next Article in Special Issue
SIMIT: Subjectively Interesting Motifs in Time Series
Previous Article in Journal
Non-Geodesic Incompleteness in Poincaré Gauge Gravity
Previous Article in Special Issue
Mixture of Experts with Entropic Regularization for Data Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis

by
Shinpei Imori
1,3,* and
Hidetoshi Shimodaira
2,3
1
Graduate School of Science, Hiroshima University, Hiroshima 739-8526, Japan
2
Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
3
RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(3), 281; https://doi.org/10.3390/e21030281
Submission received: 21 February 2019 / Revised: 9 March 2019 / Accepted: 12 March 2019 / Published: 14 March 2019
(This article belongs to the Special Issue Information-Theoretical Methods in Data Mining)

Abstract

:
Statistical inference is considered for variables of interest, called primary variables, when auxiliary variables are observed along with the primary variables. We consider the setting of incomplete data analysis, where some primary variables are not observed. Utilizing a parametric model of joint distribution of primary and auxiliary variables, it is possible to improve the estimation of parametric model for the primary variables when the auxiliary variables are closely related to the primary variables. However, the estimation accuracy reduces when the auxiliary variables are irrelevant to the primary variables. For selecting useful auxiliary variables, we formulate the problem as model selection, and propose an information criterion for predicting primary variables by leveraging auxiliary variables. The proposed information criterion is an asymptotically unbiased estimator of the Kullback–Leibler divergence for complete data of primary variables under some reasonable conditions. We also clarify an asymptotic equivalence between the proposed information criterion and a variant of leave-one-out cross validation. Performance of our method is demonstrated via a simulation study and a real data example.

1. Introduction

Auxiliary variables are often observed along with primary variables. Here, the primary variables are random variables of interest, and our purpose is to estimate their predictive distribution, i.e., a probability distribution of the primary variables in future test data, while the auxiliary variables are random variables that are observed in training data but not included in the primary variables. We assume that the auxiliary variables are not observed in the test data, or we do not use them even if they are observed in the test data. When the auxiliary variables have a close relation with the primary variables, we expect to improve the accuracy of predictive distribution of the primary variables by considering a joint modeling of the primary and auxiliary variables.
The notion of auxiliary variables has been considered in statistics and machine learning literature. For example, the “curds and whey” method [1] and the “coaching variables” method [2] are based on a similar idea for improving prediction accuracy of primary variables by using auxiliary variables. In multitask learning, Caruana [3] improved generalization accuracy of a main task by exploiting extra tasks. Auxiliary variables are also considered in incomplete data analysis, i.e., a part of primary variables are not observed; Mercatanti et al. [4] showed some theoretical results to make parameter estimation better by utilizing auxiliary variables in Gaussian mixture model (GMM).
Although auxiliary variables are expected to be useful for modeling primary variables, they can actually be harmful. As mentioned in Mercatanti et al. [4], using auxiliary variables may affect modeling results adversely because the number of parameters to be estimated increases and a candidate model of the auxiliary variables can be misspecified. Hence, it is important to select useful auxiliary variables. This is formulated as model selection by considering parametric models with auxiliary variables. In this paper, usefulness of auxiliary variables for estimating predictive distribution of primary variables is measured by a risk function based on the Kullback–Leibler (KL) divergence [5] that is often used for model selection. Because the KL risk function includes unknown parameters, we have to estimate it in actual use. Akaike Information Criterion (AIC) proposed by Akaike [6] is one of the most famous criteria, which is known as an asymptotically unbiased estimator of the KL risk function. AIC is a good criterion from the perspective of prediction due to the asymptotic efficiency; see Shibata [7,8]. Takeuchi [9] proposed a modified version of AIC, called Takeuchi Information Criterion (TIC), which relaxes an assumption for deriving AIC, that is, correct specification of candidate model. However, AIC and TIC are derived for primary variables without considering auxiliary variables in the setting of complete data analysis, and therefore, they are not suitable for auxiliary variable selection nor incomplete data analysis.
Incomplete data analysis is widely used in a broad range of statistical problems by regarding a part of primary variables as latent variables that are not observed. This setting also includes complete data analysis as a special case, where all the primary variables are observed. Information criteria for incomplete data analysis have been proposed in previous studies. Shimodaira [10] developed an information criterion based on the KL divergence for complete data when the data are only partially observed. Cavanaugh and Shumway [11] modified the first term of the information criterion of Shimodaira [10] by the objective function of the EM algorithm [12]. Recently, Shimodaira and Maeda [13] proposed an information criterion, which is derived by mitigating a condition assumed in Shimodaira [10] and Cavanaugh and Shumway [11].
However, any of these previously proposed criteria are not derived by taking auxiliary variables into account. Thus, we propose a new information criterion by considering not only primary variable but also auxiliary variables in the setting of incomplete data analysis. The proposed criterion is a generalization of AIC, TIC, and the criterion of Shimodaira and Maeda [13]. To the best of our knowledge, this is the first attempt to derive an information criterion by considering auxiliary variables. Moreover, we show an asymptotic equivalence between the proposed criterion and a variant of leave-one-out cross validation (LOOCV); this result is a generalization of the relationship between TIC and LOOCV [14].
Note that “auxiliary variables” may also be used in other contexts in literature. For example, Ibrahim et al. [15] considered to use auxiliary variables in missing data analysis, which is similar to our usage in the sense that auxiliary variables are highly correlated with missing data. However, they use the auxiliary variables in order to avoid specifying a missing data mechanism; this goal is different from ours, because no missing data mechanism is considered in our study.
The reminder of this paper is organized as follows. Notations as well as the setting of this paper are introduced in Section 2. Illustrative examples of useful and useless auxiliary variables are given in Section 3. The information criterion for selecting useful auxiliary variables in incomplete data analysis is derived in Section 4, and the asymptotic equivalence between the proposed criterion and a variant of LOOCV is shown in Section 5. Performance of our method is examined via a simulation study and a real data analysis in Section 6 and Section 7, respectively. Finally, we conclude this paper in Section 8. All proofs are shown in Appendix A.

2. Preliminaries

2.1. Incomplete Data Analysis for Primary Variables

First we explain a setting of incomplete data analysis for primary variables in accordance with Shimodaira and Maeda [13]. Let X denote a vector of primary variables, which consists of two parts as X = ( Y , Z ) , where Y denotes the observed part and Z denotes the unobserved latent part. This setting reduces to complete data analysis of X = Y when Z is empty. We write the true density function of X as q x ( x ) = q x ( y , z ) and a candidate parametric model of the true density as p x ( x ; θ ) = p x ( y , z ; θ ) , where θ Θ R d is an unknown parameter vector and Θ is its parameter space. We assume that x = ( y , z ) Y × Z for all density functions, where Y and Z are domains of Y and Z, respectively. Thus the marginal densities of the observed part Y are obtained by q y ( y ) = q x ( y , z ) d z and p y ( y ; θ ) = p x ( y , z ; θ ) d z . For denoting densities, we will omit random variables such as q y and p y ( θ ) . We assume that θ is identifiable with respect to p y ( θ ) .
In this paper, we consider only a simple setting of i.i.d. random variables of sample size n. Let x i = ( y i , z i ) , i = 1 , , n , be independent realizations of X, where we only observe y 1 , , y n and we cannot see the values of z 1 , , z n . We estimate θ from the observed training data y 1 , , y n . Then the maximum likelihood estimate (MLE) of θ is given by
θ ^ y = arg max θ Θ y ( θ ) arg max θ Θ 1 n i = 1 n log p y ( y i ; θ ) ,
where y ( θ ) denotes the log-likelihood function (divided by n) of θ with respect to y 1 , , y n .
If we were only interested in Y, we would consider the plug-in predictive distribution p y ( θ ^ y ) by substituting θ ^ y into p y ( θ ) . However, we are interested in the whole primary variable X = ( Y , Z ) and its density q x . We thus consider p x ( θ ^ y ) by substituting θ ^ y into p x ( θ ) , and evaluate the MLE by comparing p x ( θ ^ y ) with q x . For this purpose, Shimodaira and Maeda [13] derived an information criterion as an asymptotically unbiased estimator of the KL risk function which measures how well p x ( θ ^ y ) approximates q x .

2.2. Statistical Analysis with Auxiliary Variables

Next, we extend the setting to incomplete data analysis with auxiliary variables. Let A denote a vector of auxiliary variables. In addition to Y, we observe A in the training data, but we are not interested in A. For convenience, we introduce a vector of observable variables B = ( Y , A ) and a vector of all variables C = ( Y , Z , A ) as summarized in Table 1. Now c i = ( y i , z i , a i ) , i = 1 , , n , are independent realizations of C, and we estimate θ from the observed training data b i = ( y i , a i ) , i = 1 , , n . Let θ ^ b be the MLE of θ by using A in addition to Y. Since we are only interested in the primary variables, we consider the plug-in predictive distribution p x ( θ ^ b ) by substituting θ ^ b into p x ( θ ) , and evaluate the MLE by comparing p x ( θ ^ b ) with q x .
In order to define the MLE θ ^ b , let us clarify a candidate parametric model with auxiliary variables. We write the true density function of C as q c ( c ) = q c ( y , z , a ) and a candidate parametric model of the true density as p c ( c ; β ) = p c ( y , z , a ; β ) , where β = ( θ , φ ) B R d + f is an unknown parameter vector with nuisance parameter φ R f and B is its parameter space. We assume that c = ( y , z , a ) Y × Z × A for all density functions, where A is the domain of A. We also assume that β is identifiable with respect to p b ( y , a ; β ) = p c ( y , z , a ; β ) d z . Let us redefine p x ( θ ) as p x ( y , z ; θ ) = p c ( y , z , a ; β ) d a and the parameter space of θ as
Θ = θ | θ φ B .
Then, θ ^ b is obtained from the MLE of β given by
β ^ b = θ ^ b φ ^ b = arg max β B b ( β ) arg max β B 1 n i = 1 n log p b ( b i ; β ) ,
where b ( β ) denotes the log-likelihood function (divided by n) of β with respect to b 1 , , b n .
Finally, we introduce a general notation for density functions. For a random variable, say R, we write the true density function as q r ( r ) and a candidate parametric model of q r as p r ( r ; θ ) or p r ( r ; β ) . For random variables R and S, we write the true conditional density function of R given S = s as q r | s ( r | s ) and its corresponding model as p r | s ( r | s ; θ ) or p r | s ( r | s ; β ) . For example, a candidate model of C can be decomposed as
p c ( y , z , a ; β ) = p x ( y , z ; θ ) p a | x ( a | y , z ; β ) .

2.3. Comparing the Two Estimators

We have thus far obtained the two MLEs of θ , namely θ ^ y and θ ^ b , and their corresponding predictive distributions p x ( θ ^ y ) and p x ( θ ^ b ) , respectively. We would like to determine which of the two predictive distributions approximates q x better than the other. The approximation error of p x ( θ ) is measured by the KL divergence from q x to p x ( θ ) defined as
D x ( q x ; p x ( θ ) ) = q x ( x ) log p x ( x ; θ ) d x + q x ( x ) log q x ( x ) d x .
Since the last term on the right hand side does not depend on p x ( θ ) , we ignore it for computing the loss function of p x ( θ ) defined by
L x ( θ ) = q x ( x ) log p x ( x ; θ ) d x .
Let θ ^ be an estimator of θ . The risk (or expected loss) function of p x ( θ ^ ) is defined by
R x ( θ ^ ) = E [ L x ( θ ^ ) ] ,
where we take the expectation by considering θ ^ as a random variable. Note that θ ^ in the notation of R x ( θ ^ ) indicates the procedure for computing θ ^ instead of a particular value of θ ^ . R x ( θ ^ ) measures how well p x ( θ ^ ) approximates q x on average in the long run.
For comparing the two MLEs, we define R x ( θ ^ y ) and R x ( θ ^ b ) by considering that θ ^ y and θ ^ b are functions of independent random variables Y 1 , , Y n and B 1 , , B n , respectively, where B i = ( Y i , A i ) has the same distribution as B for all i = 1 , , n . θ ^ b is better than θ ^ y when R x ( θ ^ b ) < R x ( θ ^ y ) , that is, the auxiliary variable A helps the statistical inference on q x . On the other hand, A is harmful when R x ( θ ^ b ) > R x ( θ ^ y ) . Although we focus only on comparison between Y and B = ( Y , A ) in this paper, if there are more than two auxiliary variables (and their combinations) A 1 , A 2 , , then we may compare R x ( θ ^ ( y , a 1 ) ) , R x ( θ ^ ( y , a 2 ) ) , , to determine good auxiliary variables. Of course, the risk functions cannot be calculated in reality because they depend on the unknown true distribution. Thus, we derive a new information criterion as an estimator of the risk function in our setting. Since an asymptotically unbiased estimator of R x ( θ ^ y ) has been already derived in Shimodaira and Maeda [13], we will only derive an asymptotically unbiased estimator of R x ( θ ^ b ) .

3. An Illustrative Example with Auxiliary Variables

3.1. Model Setting

In this section, we demonstrate parameter estimation by using auxiliary variables in Gaussian mixture model (GMM), which can be formulated in incomplete data analysis. Let us consider a two-component GMM; observed values are generated from one of two Gaussian distributions, where the assigned labels are missing. The observed data and missing labels are realizations of Y and Z, respectively. We estimate a predictive distribution of X = ( Y , Z ) from the observation of Y, and we attempt improving it by utilizing A in addition to Y. The true density function of primary variables X = ( Y , Z ) R × { 0 , 1 } is given as
q y | z ( y | z ) = z N ( y ; 1.2 , 0.7 ) + ( 1 z ) N ( y ; 1.2 , 0.7 ) , q z ( z ) = 0.6 z + 0.4 ( 1 z ) ,
where N ( · ; μ , σ 2 ) denotes the density function of N ( μ , σ 2 ) , i.e., the normal distribution with mean μ and variance σ 2 . We consider the following two cases for the true conditional distribution of auxiliary variable A given X = x :
Case 1:
q a | x ( a | y , z ) = q a | z ( a | z ) = z N ( a ; 1.8 , 0.49 ) + ( 1 z ) N ( a ; 1.8 , 0.49 ) .
Case 2:
q a | x ( a | y , z ) = q a ( a ) = 0.6 N ( a ; 1.8 , 0.49 ) + 0.4 N ( a ; 1.8 , 0.49 ) .
The random variables X and A are not independent in Case 1 whereas they are independent in Case 2. Hence, A will contribute to estimating θ in Case 1. On the other hand, in Case 2, A must not be useful, and A becomes just noise if we estimate θ from Y and A.
In both cases, we use the following two-component GMM as a candidate model of q c :
p b | z ( y , a | z ; β ) = z N 2 ( ( y , a ) ; μ 1 , Σ ) + ( 1 z ) N 2 ( ( y , a ) ; μ 2 , Σ ) , p z ( z ; θ ) = π 1 z + ( 1 π 1 ) ( 1 z ) ,
where N 2 ( · ; μ i , Σ ) denotes the density function of bivariate normal distribution N 2 ( μ i , Σ ) , i = 1 , 2 , and the parameters are
μ 1 = μ 1 y μ 1 a , μ 2 = μ 2 y μ 2 a , Σ = σ y 2 σ y a σ y a σ a 2 .
Therefore, β = ( θ , φ ) , θ = ( π 1 , μ 1 y , μ 2 y , σ y 2 ) and φ = ( μ 1 a , μ 2 a , σ a 2 , σ y a ) . The true parameters of θ and φ for Case 1 are given by θ 0 = ( 0.6 , 1.2 , 1.2 , 0.7 ) and φ 0 = ( 1.8 , 1.8 , 0.49 , 0 ) , respectively. By considering the joint density function p c ( y , z , a ; β ) = p b | z ( y , a | z ; β ) p z ( z ; θ ) , this candidate model correctly specifies the true density function q c ( y , z , a ) = q a | x ( a | y , z ) q y | z ( y | z ) q z ( z ) in Case 1. On the other hand, the model is misspecified for Case 2, and we cannot think of the true parameters.

3.2. Estimation Results

For illustrating the impact of auxiliary variables on parameter estimation in each case, we generated a typical dataset c 1 , , c n with sample size n = 100 from q c , which is actually picked from 10,000 datasets generated in the simulation study of Section 6, and details of how to select the typical dataset are also shown there. For each case, we computed the three MLEs θ ^ y , θ ^ b , and θ ^ x , where θ ^ x is the MLE of θ calculated by using complete data x 1 , , x n as if labels z 1 , , z n were available.
The result of Case 1 is shown in Figure 1, where A is beneficial for estimating θ . In the left panel, the two clusters are well separated, which makes parameter estimation stable. The estimated p b ( β ^ b ) captures the structure of the two clusters corresponding to the label z i = 0 and z i = 1 , showing that p c ( β ^ b ) is estimated reasonably well, and thus p x ( θ ^ b ) is a good approximation of q x . Looking at the right panel, we also observe that p y ( θ ^ b ) is better than p y ( θ ^ y ) for approximating p y ( θ ^ x ) , suggesting that the auxiliary variable is useful for recovering the lost information of missing data. In fact, the three MLEs are calculated as follows: θ ^ y = ( 0.671 , 1.143 , 1.324 , 0.678 ) , θ ^ b = ( 0.613 , 1.228 , 1.093 , 0.744 ) , and θ ^ x = ( 0.620 , 1.233 , 1.141 , 0.695 ) . By comparing θ ^ b θ ^ x = 0.069 with θ ^ y θ ^ x = 0.212 , we can see that θ ^ b is better than θ ^ y for predicting θ ^ x without looking at the latent variable. All these observations indicate that the parameter estimation of θ is improved by using A in Case 1.
The result of Case 2 is shown in Figure 2, where A is harmful for estimating θ . For fair comparison, exactly the same values of { ( y i , z i ) } i = 1 100 are used in both cases. Thus, θ ^ y and θ ^ x have the same values as in Case 1 whereas θ ^ b has a different value as θ ^ b = ( 0.581 , 0.403 , 0.232 , 2.015 ) . By comparing θ ^ b θ ^ x = 2.078 with θ ^ y θ ^ x = 0.212 , we can see that θ ^ b is worse than θ ^ y for predicting θ ^ x . This is also seen in Figure 2. In the left panel, the estimated p b ( β ^ b ) captures some structure of the two clusters, but they do not correspond to the label z i = 0 and z i = 1 . As a result, p y ( θ ^ b ) becomes a very poor approximation of p y ( θ ^ x ) in the right panel, indicating that the parameter estimation of θ is actually hindered by using A in Case 2.
These examples suggest that usefulness of auxiliary variables depends strongly on the true distribution and a candidate model. Hence, it is important to select useful auxiliary variables from observed data.

4. Information Criterion

4.1. Asymptotic Expansion of the Risk Function

In this section, we derive a new information criterion as an asymptotically unbiased estimator of the risk function R x ( θ ^ b ) defined in (3). We start from a general framework of misspecification, i.e., without assuming that candidate models are correctly specified, and later we give specific assumptions. Let β ¯ be the optimal parameter value with respect to the KL divergence from q b to p b ( β ) , that is,
β ¯ = θ ¯ φ ¯ = arg max β B q b ( b ) log p b ( b ; β ) d b .
If the candidate model is correctly specified, i.e., there exists β 0 = ( θ 0 , φ 0 ) such that q b = p b ( β 0 ) , then β ¯ = β 0 as well as θ ¯ = θ 0 .
In this paper, we assume the regularity conditions A1 to A6 of White [16] for q b and p b ( β ) so that the MLE β ^ b has consistency and asymptotic normality. In particular, β ¯ is determined uniquely (i.e., identifiable) and is interior to B . We assume that I b and J b defined below are nonsingular in the neighbourhood of β ¯ . Then White [16] showed the asymptotic normality as n ,
n ( β ^ b β ¯ ) d N d + f ( 0 , I b 1 J b I b 1 ) ,
where I b and J b are ( d + f ) × ( d + f ) matrices defined by using = / β , = / β , and 2 = 2 / β β as
I b = E [ 2 log p b ( b ; β ¯ ) ] , J b = E [ log p b ( b ; β ¯ ) log p b ( b ; β ¯ ) ] .
Note that we write derivatives by abbreviated forms, e.g., 2 log p b ( b ; β ¯ ) means 2 log p b ( b ; β ) | β = β ¯ and so on. In addition, we allow interchange of integrals and derivatives rather formally when working with models, although we actually need conditions for the models such as White [16]. Moreover, the condition A7 of White [16] is assumed in order to establish I b = J b when considering a situation that the candidate model is correctly specified. We assume the above conditions throughout the paper without explicitly stated.
Let us define three ( d + f ) × ( d + f ) matrices as
I x = E [ 2 log p x ( x ; θ ¯ ) ] , I y = E [ 2 log p y ( y ; θ ¯ ) ] , I z | y = E [ 2 log p z | y ( z | y ; θ ¯ ) ] = I x I y ,
which will be used in the lemmas below. Since the derivatives of log p x ( x ; θ ) and log p y ( y ; θ ) with respect to φ is zero, the matrices become singular when f > 0 , but this is not a problem in our calculation. The following lemma shows that the dominant term of R x ( θ ^ b ) is L x ( θ ¯ ) and the remainder terms are of order O ( n 1 ) , by noting that L x ( θ ¯ ) = O ( 1 ) and E [ β ^ b β ¯ ] = O ( n 1 ) in general. The proof is given in Appendix A.1.
Lemma 1.
The risk function R x ( θ ^ b ) is expanded asymptotically as
R x ( θ ^ b ) = L x ( θ ¯ ) + L x ( θ ¯ ) E [ β ^ b β ¯ ] + 1 2 n tr ( I x I b 1 J b I b 1 ) + o ( n 1 ) .
Just as a remark, the term L x ( θ ¯ ) E [ β ^ b β ¯ ] = O ( n 1 ) above does not appear in the derivation of AIC or TIC, where B = X and thus L x ( θ ¯ ) = 0 . This term appears when the loss function for evaluation and that for estimation differ, for example, in the derivation of the information criterion under covariate shift; see K w [ 1 ] b w in Equation (4.1) of Shimodaira [17].

4.2. Estimating the Risk Function

For deriving an estimator of R x ( θ ^ b ) , we introduce an additional condition. Let us assume that the candidate model is correctly specified for the latent part as
q z | y ( z | y ) = p z | y ( z | y ; θ ¯ ) .
This is the same condition as Equation (14) of Shimodaira and Maeda [13] except that θ ¯ is replaced by
θ ¯ y = arg max θ Θ q y ( y ) log p y ( y ; θ ) d y .
Since Z is missing completely in our setting, we need such a condition to proceed further. Although any method cannot detect misspecification of p z | y if p b is correctly specified, it is often the case that misspecification of p z | y leads to that of p b , and thus it is detected indirectly as in Case 2 of Section 3.
Note that the symbol of θ ¯ in our notation should have been θ ¯ b , although we used θ ¯ for simplicity, and there is also θ ¯ x defined similarly from p x ( x ; θ ) . They all differ each other with differences of order O ( 1 ) in general, but θ ¯ = θ ¯ y = θ ¯ x = θ 0 when p c ( β ) is correctly specified as q c = p c ( β 0 ) .
Now we give the asymptotic expansion of E [ y ( θ ^ b ) ] , which shows that y ( θ ^ b ) can be used as an estimator of L x ( θ ¯ ) but the asymptotic bias is of order O ( n 1 ) .
Lemma 2.
Assume the condition (6). Then, the expectation of the estimated log-likelihood y ( θ ^ b ) can be expanded as
E [ y ( θ ^ b ) ] = L x ( θ ¯ ) C ( q x ) L x ( θ ¯ ) E [ β ^ b β ¯ ] + 1 n tr ( I b 1 K b , y ) 1 2 n tr ( I y I b 1 J b I b 1 ) + o ( n 1 ) ,
where K b , y = E [ log p b ( β ¯ ) log p y ( θ ¯ ) ] and C ( q x ) = q x ( x ) log q z | y ( z | y ) d x .
The proof of Lemma 2 is given in Appendix A.2. By eliminating L x ( θ ¯ ) from the two expressions in Lemma 1 and Lemma 2, and rearranging the formula, we get the following lemma, which plays a central role in deriving our information criterion.
Lemma 3.
Assume the condition (6). Then, an expansion of the risk function R x ( θ ^ b ) is given by
R x ( θ ^ b ) = E [ y ( θ ^ b ) ] C ( q x ) + 1 n tr ( I b 1 K b , y ) + 1 2 n tr ( I z | y I b 1 J b I b 1 ) + o ( n 1 ) .
We can ignore C ( q x ) for model selection, because it is a constant term which does not depend on the candidate model. Thus, finally, we define an information criterion from the right hand side of (7). The following theorem is an immediate consequence of Lemma 3.
Theorem 1.
Assume the condition (6). Let us define an information criterion as
risk ^ x ; b = 2 n y ( θ ^ b ) + 2 tr ( I b 1 K b , y ) + tr ( I z | y I b 1 J b I b 1 ) .
Then this criterion is an asymptotically unbiased estimator of 2 n R x ( θ ^ b ) by ignoring the constant term C ( q x ) .
E [ risk ^ x ; b ] = 2 n R x ( θ ^ b ) + 2 n C ( q x ) + o ( 1 ) .
Note that the subscript of risk ^ x ; b , x ; b is defined in accordance with Shimodaira and Maeda [13]; thus the former x and the latter b mean random variables used in evaluation and estimation, respectively. This criterion is an extension of TIC because when X = B = Y , risk ^ x ; b coincides with TIC of Takeuchi [9] defined as follows:
TIC = 2 n y ( θ ^ y ) + 2 tr ( I y 1 J y ) .

4.3. Akaike Information Criteria for Auxiliary Variable Selection

In actual use, risk ^ x ; b may have a too complicated form. Thus, we derive a simpler information criterion by assuming the correctness of the candidate model like as AIC.
Theorem 2.
Suppose p c ( β ) is correctly specified so that q c = p c ( β 0 ) for some β 0 B . Then, we have
J b = I b , K b , y = I y ,
and thus risk ^ x ; b is rewritten as
AIC x ; b = 2 n y ( θ ^ b ) + tr ( I x I b 1 ) + tr ( I y I b 1 ) .
This criterion is an asymptotically unbiased estimator of 2 n R x ( θ ^ b ) by ignoring the constant term C ( q x ) .
E [ AIC x ; b ] = 2 n R x ( θ ^ b ) + 2 n C ( q x ) + o ( 1 ) .
The proof is given in Appendix A.3. I x , I y and I b are replaced by their consistent estimators in practical situations.
The newly obtained criterion AIC x ; b is a generalization of AIC and some of its variants. If θ is estimated by θ ^ y instead of θ ^ b , we simply let B = Y in the expression of AIC x ; b so that we get AIC x ; y proposed by Shimodaira and Maeda [13]:
AIC x ; y = 2 n y ( θ ^ y ) + tr ( I x I y 1 ) + d .
Note that if B = Y , I y is not singular because β = θ . On the other hand, if there is no latent part, we simply let X = Y in the expression of AIC x ; b so that we get
AIC y ; b = 2 n y ( θ ^ b ) + 2 tr ( I y I b 1 ) .
This can be used to select useful auxiliary variables in complete data analysis. Moreover, if X = Y = B , AIC x ; b reduces to the original AIC proposed by Akaike [6]:
AIC y ; y = 2 n y ( θ ^ y ) + 2 d .
It is worth mentioning that tr ( I z | y I b 1 ) is interpreted as the additional penalty for the latent part:
AIC x ; b AIC y ; b = tr ( I x I b 1 ) tr ( I y I b 1 ) = tr ( I z | y I b 1 ) 0 ,
which is also mentioned in Equation (1) of Shimodaira and Maeda [13] for the case of B = Y .

4.4. The Illustrative Example (Cont.)

Let us return to the problem of determining whether to use the auxiliary variables or not, that is, comparison between p x ( θ ^ b ) and p x ( θ ^ y ) . By comparing AIC x ; b with AIC x ; y , we can determine whether the vector of auxiliary variables A is useful or useless. Thus, only when AIC x ; b < AIC x ; y , we conclude that A is useful in order to estimate θ for predicting X.
Let us apply this procedure to the illustrative example in Section 3. The generalized AICs are computed for the two cases of the typical dataset, and the results are shown in Table 2. Looking at the value of AIC x ; b AIC x ; y , it is negative for Case 1, concluding that the auxiliary variable is useful, and it is positive for Case 2, concluding that the auxiliary variable is useless. According to the AIC values, therefore, we use the auxiliary variable of Case 1, but do not use the auxiliary variable of Case 2. This decision agrees with the observations of Figure 1 and Figure 2 in Section 3.2. In fact, the decision is correct, because the value of R x ( θ ^ b ) R x ( θ ^ y ) is negative for Case 1 and positive for Case 2 as will be seen in the simulation study of Section 6.2.
We can also argue the usefulness of the auxiliary variable for predicting Y instead of X, that is, comparison between p y ( θ ^ b ) and p y ( θ ^ y ) . By comparing AIC y ; b with AIC y ; y , we can determine whether A is useful or useless for predicting Y. Looking at the value of AIC y ; b AIC y ; y in Table 2, we make the same decision as that for X.

5. Leave-One-Out Cross Validation

Variable selection by cross-validatory (CV) choice [18] is often applied to real data analysis due to its simplicity, although its computational burden is larger than that of information criteria; see Arlot and Celisse [19] for a recent review of cross-validation methods. As shown in Stone [14], leave-one-out cross validation (LOOCV) is asymptotically equivalent to TIC. Because LOOCV does not require calculation of the information matrices of TIC, LOOCV is easier to use than TIC. There are also some literature for improving LOOCV such as Yanagihara et al. [20], which gives a modification of LOOCV to reduce its bias by considering maximum weighted log-likelihood estimation. However, we focus on the result of Stone [14] and extend it to our setting.
In incomplete data analysis, LOOCV cannot be directly used because the loss function with respect to the complete data includes latent variables. Thus, we transform the loss function as follows:
L x ( θ ) = q y ( y ) g ( y ; θ ) d y ,
where g ( y ; θ ) = log p y ( y ; θ ) + f ( y ; θ ) and
f ( y ; θ ) = q z | y ( z | y ) log p z | y ( z | y ; θ ) d z .
Note that f ( y ; θ ) = 0 when X = Y . Using the function g ( y ; θ ) , we then obtain the following LOOCV estimator of the risk function R x ( θ ^ b ) .
L x cv ( θ ^ b ) = 1 n i = 1 n g ( y i ; θ ^ b ( i ) ) ,
where θ ^ b ( i ) is the leave-out-out estimate of θ defined as
β ^ b ( i ) = θ ^ b ( i ) φ ^ b ( i ) = arg max β B 1 n j i n log p b ( b j ; β ) = arg max β B b ( β ) 1 n log p b ( b i ; β ) .
We will show below in this section that L x cv ( θ ^ b ) is asymptotically equivalent to risk ^ x ; b . For implementing the LOOCV procedure with latent variables, however, we have to estimate q z | y ( z | y ) by p z | y ( z | y , θ ^ b ) in f ( y ; θ ) . This introduces a bias to L x cv ( θ ^ b ) , and hence, information criteria are preferable to the LOOCV in incomplete data analysis.
Let us show the asymptotic equivalence of L x cv ( θ ^ b ) and risk ^ x ; b by assuming that we know the functional form of f ( y ; θ ) . Noting that β ^ b ( i ) is a critical point of b ( β ) log p b ( b i ; β ) / n , we have
b ( β ^ b ( i ) ) = 1 n log p b ( b i ; β ^ b ( i ) ) = O p ( n 1 ) .
By applying Taylor expansion to b ( β ) around β = β ^ b , it follows from b ( β ^ b ) = 0 that
2 b ( β ˜ b i ) ( β ^ b ( i ) β ^ b ) = 1 n log p b ( b i ; β ^ b ( i ) ) ,
where β ˜ b i lies between β ^ b ( i ) and β ^ b . We can see from (14) that β ^ b ( i ) β ^ b = O p ( n 1 ) . Next, we regard g ( y i ; θ ) as a function of β and apply Taylor expansion to it around β = β ^ b . Therefore, g ( y i ; θ ^ b ( i ) ) can be expressed as follows:
g ( y i ; θ ^ b ( i ) ) = g ( y i ; θ ^ b ) + g ( y i ; θ ˜ b i ) ( β ^ b ( i ) β ^ b ) ,
where θ ˜ b i lies between θ ^ b ( i ) and θ ^ b ( θ ˜ b i does not corresponding to β ˜ b i ). Then we assume that
1 n i = 1 n 2 b ( β ˜ b i ) 1 log p b ( b i ; β ^ b ( i ) ) g ( y i ; θ ˜ b i ) p I b 1 E [ log p b ( b ; β ¯ ) g ( y ; θ ¯ ) ] .
By noting β ^ b ( i ) = β ^ b + O p ( n 1 ) , we have β ˜ b i = β ¯ + O p ( n 1 / 2 ) and θ ˜ b i = θ ¯ + O p ( n 1 / 2 ) , and thus (16) holds at least formally. With the above setup, we show the following theorem. The proof is given in Appendix A.4.
Theorem 3.
Supposing the same assumptions of Theorem 1 and (16), we have
2 n L x cv ( θ ^ b ) = risk ^ x ; b 2 i = 1 n f ( y i ; θ ¯ ) + o p ( 1 ) .
Because the second term on the right-hand side of (17) does not depend on candidate models under condition (6), this theorem implies that L x cv ( θ ^ b ) is asymptotically equivalent to risk ^ x ; b except for the scaling and the constant term. However, someone may wonder why f ( y ; θ ) is included in g ( y ; θ ) for comparing models of p ( b ; β ) . By assuming that p z | y ( θ ) is correctly specified for q z | y , f ( y ; θ ¯ ) = q z | y ( z | y ) log q z | y ( z | y ) d z does not depend on the model anymore, so we may simply exclude f ( y ; θ ) from g ( y ; θ ) , leading to the loss L y ( θ ) instead. The reason for including f ( y ; θ ) in g ( y ; θ ) is explained as follows. L x cv ( θ ^ b ) , as well as risk ^ x ; b (and AIC x ; b ), include the additional penalty for estimating θ ^ b in f ( y ; θ ^ b ) , which depends on the candidate models even if p z | y ( θ ) is correctly specified.

6. Experiments with Simulated Datasets

This section shows the usefulness of auxiliary variables and the proposed information criteria via a simulation study. The models illustrated in Section 3 are used for confirming the asymptotic unbiasedness of the information criterion and the validity of auxiliary variable selection.

6.1. Unbiasedness

At first, we confirm the asymptotic unbiasedness of AIC x ; b for estimating 2 n R x ( θ ^ b ) except for the constant term, C ( q x ) . The simulation setting is the same as Case 1 in Section 3, thus the data generating model is given by
q b | z ( y , a | z ) = z N 2 ( ( y , a ) ; μ 10 , Σ 0 ) + ( 1 z ) N 2 ( ( y , a ) ; μ 20 , Σ 0 ) , q z ( z ) = 0.6 z + 0.4 ( 1 z ) ,
where μ 10 = μ 20 = ( 1.2 , 1.8 ) and Σ 0 = diag ( 0.7 , 0.49 ) . We generated T = 10 4 independent replicates of the dataset { ( y i , z i , a i ) } i = 1 n from this model; in fact, we used { ( y i , z i , a i , 1 ) } i = 1 n generated in Section 6.2. The candidate model is given by (4), which is correctly specified for the above data generating model. Because AIC x ; b is derived by ignoring C ( q x ) , we compare E [ AIC x ; b AIC x ; y ] with 2 n { R x ( θ ^ b ) R x ( θ ^ y ) } . The computation of the expectation is approximated by the simulation average as
E [ AIC x ; b AIC x ; y ] 1 T t = 1 T { AIC x ; b ( t ) AIC x ; y ( t ) } , 2 n { R x ( θ ^ b ) R x ( θ ^ y ) } 2 n T t = 1 T { L x ( θ ^ b ( t ) ) L x ( θ ^ y ( t ) ) } ,
where AIC x ; b ( t ) , AIC x ; y ( t ) , θ ^ b ( t ) , and θ ^ y ( t ) are those computed for the t-th dataset ( t = 1 , , T ).
Here, we remark about calculation of the loss function L x ( θ ^ ) in two-component GMM. Let θ ^ = ( π ^ 1 , μ ^ 1 , μ ^ 2 , σ ^ 2 ) be an estimator of θ . We expect that the components of GMM corresponding to Z = 1 and Z = 0 consist of ( π ^ 1 , μ ^ 1 , σ ^ 2 ) and ( 1 π ^ 1 , μ ^ 2 , σ ^ 2 ) , respectively. However, we cannot determine the assignment of the estimated parameters in reality, i.e., ( π ^ 1 , μ ^ 1 , σ ^ 2 ) and ( 1 π ^ 1 , μ ^ 2 , σ ^ 2 ) may correspond to Z = 0 and Z = 1 , respectively, because the labels z 1 , , z n are missing. The assignment is required to calculate L x ( θ ^ ) whereas it is not used for L y ( θ ^ ) and the proposed information criteria. Hence, in this paper, we define L x ( θ ^ ) as the minimum value between L ( θ ^ ) and L ( θ ^ ) , where θ ^ = ( 1 π ^ 1 , μ ^ 2 , μ ^ 1 , σ ^ 2 ) .
Table 3 shows the result of the simulation for n = 100 , 200 , 500 , 1000 , 2000 , and 5000. For all n, we observe that E [ AIC x ; b AIC x ; y ] is very close to 2 n { R x ( θ ^ b ) R x ( θ ^ y ) } , indicating the unbiasedness of AIC x ; b .

6.2. Auxiliary Variable Selection

Next, we demonstrate that the proposed AIC selects a useful auxiliary variable (Case 1), while it does not select a useless auxiliary variable (Case 2). In each case, we generated T = 10 4 independent replicates of the dataset { ( y i , z i , a i ) } i = 1 n from the model. In fact, the values of { ( y i , z i ) } i = 1 n are shared in both cases, so we generated replicates of { ( y i , z i , a i , 1 , a i , 2 ) } i = 1 n , where a i , 1 and a i , 2 are auxiliary variables for Case 1 and Case 2, respectively. In each case, we compute AIC x ; b and AIC x ; y , then we select θ ^ b (i.e., selecting the auxiliary variable A) if AIC x ; b < AIC x ; y and select θ ^ y (i.e., not selecting the auxiliary variable A) otherwise. The selected estimator is denoted as θ ^ b e s t . This experiment was repeated for T = 10 4 times. Note that the typical dataset in Section 3 was picked from the generated datasets so that it has around the median value in each of L x ( θ ^ b ) L x ( θ ^ y ) , L y ( θ ^ b ) L y ( θ ^ y ) , AIC x ; b AIC x ; y , and AIC y ; b AIC y ; y in both cases.
The selection frequencies are shown in Table 4 and Table 5. We observe that, as expected, the useful auxiliary variable tends to be selected in Case 1, while the useless auxiliary variable tends to be not selected in Case 2.
For verifying the usefulness of the auxiliary variable in both cases, we computed the risk value R x ( θ ^ ) for θ ^ = θ ^ y , θ ^ b , and θ ^ b e s t . They are approximated by the simulation average as
R x ( θ ^ ) 1 T t = 1 T L x ( θ ^ ( t ) ) .
The results are shown in Table 6 and Table 7. For easier comparisons, the values are the differences from L x ( θ 0 ) with the true value θ 0 . For all n, we observe that, as expected, R x ( θ ^ b ) < R x ( θ ^ y ) in Case 1, and R x ( θ ^ b ) > R x ( θ ^ y ) in Case 2. In both cases, R x ( θ ^ b e s t ) is close to min { R x ( θ ^ b ) , R x ( θ ^ y ) } , indicating that the variable selection is working well.

7. Experiments with Real Datasets

We show an example of auxiliary variable selection using Wine Data Set available at UCI Machine Learning Repository [21], which consists of 1 categorical variable (3 categories) and 13 continuous variables, denoted as V 1 , , V 13 . For simplicity, we only use the first two categories and regard them as a latent variable Z { 0 , 1 } ; the experiment results were similar to the other combinations. The sample size is then n = 130 and all variables except for Z are standardized. We set one of the 13 continuous variables as the observed primary variable Y, and set the rest of 12 variables as auxiliary variables A 1 , , A 12 . For example, if Y is V 1 , then A 1 , , A 12 are V 2 , , V 13 . The dataset is now { ( y i , z i , a i , 1 , , a i , 12 ) } i = 1 n , which is randomly divided into the training set with sample size n t r = 86 ( z i is not used) and the test set with sample size n t e = 44 ( a i , 1 , , a i , 12 are not used).
In the experiment, we compute AIC x ; b for B = ( Y , A ) , = 1 , , 12 , and AIC x ; y for Y from the training dataset using the model (4). We select θ ^ b e s t from θ ^ b 1 , , θ ^ b 12 and θ ^ y by finding the minimum of the 13 AIC values. Thus we are selecting one of the auxiliary variables A 1 , , A 12 or not selecting any of them. It is possible to select a combination of the auxiliary variables, but we did not attempt such an experiment. For measuring the generalization error, we compute L x ( θ ^ y ) L x ( θ ^ b e s t ) from the test set as
L x ( θ ^ y ) L x ( θ ^ b e s t ) 1 n t e i D t e { log p x ( y i , z i ; θ ^ y ) log p x ( y i , z i ; θ ^ b e s t ) } ,
where D t e { 1 , , n } represents the test set. The assignment problem of L x ( · ) mentioned in Section 6 is avoided by a similar manner.
For each case of Y = V , = 1 , , 13 , the above experiment was repeated 100 times, and the experiment average of the generalization error was computed. The result is shown in Table 8. A positive value indicates that θ ^ b e s t performed better than θ ^ y . We observe that θ ^ b e s t is better than or almost the same as θ ^ y for all cases = 1 , , 13 , suggesting that AIC works well to select a useful auxiliary variable.

8. Conclusions

We often encounter a dataset composed of various variables. If only some of the variables are of interest, then the rest of the variables can be interpreted as auxiliary variables. Auxiliary variables may be able to improve estimation accuracy of unknown parameters but they could also be harmful. Hence, it is important to select useful auxiliary variables.
In this paper, we focused on exploiting auxiliary variables in incomplete data analysis. The usefulness of auxiliary variables is measured by a risk function based on the KL divergence for complete data. We derived an information criterion which is an asymptotically unbiased estimator of the risk function except for a constant term. Moreover, we extended a result of Stone [14] to our setting and proved asymptotic equivalence between a variant of LOOCV and the proposed criteria. Since LOOCV requires an additional condition for its justification, the proposed criteria are preferable to LOOCV.
This study assumes that variables are different between training set and test set. There are other settings, such as covariate shift [17] and transfer learning [22], where distributions are different between the training set and test set. It will be possible to combine these settings to construct a generalized framework. It is also possible to extend our study for taking account of a missing mechanism. We will leave these extensions as future works.

Author Contributions

Conceptualization, S.I. and H.S.; methodology, S.I. and H.S.; software, S.I.; validation, S.I. and H.S.; formal analysis, S.I. and H.S.; writing-original draft preparation S.I. and H.S.; visualization, S.I. and H.S.

Funding

This research was funded in part by JSPS KAKENHI Grant (17K12650 to S.I., 16H02789 to H.S.) and by “Funds for the Development of Human Resources in Science and Technology” under MEXT, through the “Home for Innovative Researchers and Academic Knowledge Users (HIRAKU)” consortium (to S.I.).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Proof. 
Taylor expansion of L x ( θ ) around θ = θ ¯ , by formally taking it as a function of β , gives
L x ( θ ^ b ) = L x ( θ ¯ ) + L x ( θ ¯ ) ( β ^ b β ¯ ) + 1 2 tr { I x ( β ^ b β ¯ ) ( β ^ b β ¯ ) } + o p ( n 1 ) ,
where 2 L x ( θ ¯ ) = I x is used above. By taking the expectation of both sides,
E [ L x ( θ ^ b ) ] = L x ( θ ¯ ) + L x ( θ ¯ ) E [ β ^ b β ¯ ] + 1 2 tr { I x E [ ( β ^ b β ¯ ) ( β ^ b β ¯ ) ] } + o ( n 1 ) = L x ( θ ¯ ) + L x ( θ ¯ ) E [ β ^ b β ¯ ] + 1 2 n tr ( I x I b 1 J b I b 1 ) + o ( n 1 ) ,
where the asymptotic variance of β ^ b in (5) is given as
n E [ ( β ^ b β ¯ ) ( β ^ b β ¯ ) ] = I b 1 J b I b 1 + o ( 1 ) .

Appendix A.2. Proof of Lemma 2

Proof. 
Taylor expansion of y ( θ ) around θ = θ ¯ , by formally taking it as a function of β , gives
y ( θ ^ b ) = y ( θ ¯ ) + y ( θ ¯ ) ( β ^ b β ¯ ) 1 2 tr { I y ( β ^ b β ¯ ) ( β ^ b β ¯ ) } + o p ( n 1 ) ,
where 2 y ( θ ¯ ) = I y + o p ( 1 ) is used above. By taking the expectation of both sides,
E [ y ( θ ^ b ) ] = E [ y ( θ ¯ ) ] + E [ y ( θ ¯ ) ( β ^ b β ¯ ) ] 1 2 E [ tr { I y ( β ^ b β ¯ ) ( β ^ b β ¯ ) } ] + o ( n 1 ) = E [ y ( θ ¯ ) ] + E [ y ( θ ¯ ) ( β ^ b β ¯ ) ] 1 2 n tr ( I y I b 1 J b I b 1 ) + o ( n 1 ) .
In the last expression, we used (A1) for the asymptotic variance of β ^ b . For working on the second term in (A2), we first derive an expression of β ^ b β ¯ . Taylor expansion of the score function b ( β ) around β = β ¯ gives
b ( β ^ b ) = b ( β ¯ ) + 2 b ( β ¯ ) ( β ^ b β ¯ ) + o p ( n 1 / 2 ) = b ( β ¯ ) I b ( β ^ b β ¯ ) + o p ( n 1 / 2 ) ,
where 2 b ( β ¯ ) = I b + o p ( 1 ) is used above. By noticing b ( β ^ b ) = 0 , we thus obtain
β ^ b β ¯ = I b 1 b ( β ¯ ) + o p ( n 1 / 2 ) = 1 n i = 1 n I b 1 log p b ( b i ; β ¯ ) + o p ( n 1 / 2 ) ,
where E [ b ( β ¯ ) ] = 0 and each term in the summation has mean zero, because E [ log p b ( b ; β ¯ ) ] = E [ log p b ( b ; β ¯ ) ] = 0 . Now we are back to the the second term in (A2). Using (A3), we have
y ( θ ¯ ) ( β ^ b β ¯ ) = E [ y ( θ ¯ ) ] ( β ^ b β ¯ ) + { y ( θ ¯ ) E [ y ( θ ¯ ) ] } ( β ^ b β ¯ ) = E [ y ( θ ¯ ) ] ( β ^ b β ¯ ) + { y ( θ ¯ ) E [ y ( θ ¯ ) ] } I b 1 b ( β ¯ ) + o p ( n 1 ) .
By noting E [ b ( β ¯ ) ] = 0 , the expectation of the second term in (A4) is
E [ { y ( θ ¯ ) E [ y ( θ ¯ ) ] } I b 1 b ( β ¯ ) ] = E [ y ( θ ¯ ) I b 1 b ( β ¯ ) ] = 1 n 2 i = 1 n j = 1 n E [ log p y ( y i ; θ ¯ ) I b 1 log p b ( b j ; β ¯ ) ] = 1 n E [ log p y ( y ; θ ¯ ) I b 1 log p b ( b ; β ¯ ) ] = 1 n tr { I b 1 E [ log p b ( b ; β ¯ ) log p y ( y ; θ ¯ ) ] } = 1 n tr ( I b 1 K b , y ) .
Combining (A4) and (A5), we have
E [ y ( θ ¯ ) ( β ^ b β ¯ ) ] = E [ y ( θ ¯ ) ] E [ β ^ b β ¯ ] + 1 n tr ( I b 1 K b , y ) + o ( n 1 ) .
We next show that E [ y ( θ ¯ ) ] = L x ( θ ¯ ) . Let us recall that we have assumed q z | y ( z | y ) = p z | y ( z | y ; θ ¯ ) in (6), which leads to
E [ log p z | y ( z | y ; θ ¯ ) ] = q y ( y ) p z | y ( z | y ; θ ¯ ) log p z | y ( z | y ; θ ¯ ) d z d y = q y ( y ) p z | y ( z | y ; θ ¯ ) d z d y = q y ( y ) p z | y ( z | y ; θ ¯ ) d z d y = 0 .
Therefore,
L x ( θ ¯ ) = E [ log p x ( x ; θ ¯ ) ] = E [ log p x ( x ; θ ¯ ) ] = E [ log p y ( y ; θ ¯ ) ] + E [ log p z | y ( z | y ; θ ¯ ) ] = E [ y ( θ ¯ ) ] .
Substituting this and (A6) into the second term in (A2), we have
E [ y ( θ ^ b ) ] = E [ y ( θ ¯ ) ] L x ( θ ¯ ) E [ β ^ b β ¯ ] + 1 n tr ( I b 1 K b , y ) 1 2 n tr ( I y I b 1 J b I b 1 ) + o ( n 1 ) .
The first term on the right hand side in (A7) is
E [ y ( θ ¯ ) ] = E [ log p y ( y ; θ ¯ ) ] = E [ log p x ( x ; θ ¯ ) ] E [ log p z | y ( z | y ; θ ¯ ) ] = L x ( θ ¯ ) C ( q x ) ,
where (6) is used again in the last term. Finally (A7) is rewritten as
E [ y ( θ ^ b ) ] = L x ( θ ¯ ) C ( q x ) L x ( θ ¯ ) E [ β ^ b β ¯ ] + 1 n tr ( I b 1 K b , y ) 1 2 n tr ( I y I b 1 J b I b 1 ) + o ( n 1 ) .

Appendix A.3. Proof of Theorem 2

Proof. 
First recall that we have assumed that q c ( c ) = p c ( c ; β 0 ) , which also implies the condition (6) as q z | y ( z | y ) = p z | y ( z | y ; θ 0 ) with β ¯ = β 0 . Thus Theorem 1 holds. Substituting J b = I b and K b , y = I y in the penalty term of (8), we have
2 tr ( I b 1 K b , y ) + tr ( I z | y I b 1 J b I b 1 ) = 2 tr ( I b 1 I y ) + tr ( ( I x I y ) I b 1 ) = tr ( I b 1 I y ) + tr ( I x I b 1 ) ,
giving the penalty term of (10). Therefore, we only have to show (9). Noting the identity
2 log p b ( b ; β ) = 1 p b ( b ; β ) 2 p b ( b ; β ) log p b ( b ; β ) log p b ( b ; β ) ,
it follows from q b ( b ) = p b ( b ; β 0 ) that
I b = E [ 2 log p b ( b ; β 0 ) ] = 2 p b ( b ; β 0 ) d b + E [ log p b ( b ; β 0 ) log p b ( b ; β 0 ) ] = 2 p b ( b ; β 0 ) d b + J b = J b .
Note that the same result can be obtained from Theorem 3.3 in White [16]. Next we show K b , y = I y . Since q a | y ( a | y ) = p a | y ( a | y ; β 0 ) ,
q a | y ( a | y ) log p a | y ( a | y ; β 0 ) d a = p a | y ( a | y ; β 0 ) d a = p a | y ( a | y ; β 0 ) d a = 0 .
Therefore, we have
K b , y = E [ log p b ( b ; β 0 ) log p y ( y ; θ 0 ) ] = E [ log p y ( y ; θ 0 ) log p y ( y ; θ 0 ) ] + E [ log p a | y ( a | y ; β 0 ) log p y ( y ; θ 0 ) ] = I y + q y ( y ) q a | y ( a | y ) log p a | y ( a | y ; β 0 ) d a log p y ( y ; θ 0 ) d y = I y .

Appendix A.4. Proof of Theorem 3

Proof. 
It follows from (14) and (15) that
g ( y i ; θ ^ b ( i ) ) = g ( y i ; θ ^ b ) + 1 n g ( y i ; θ ˜ b i ) 2 b ( β ˜ b i ) 1 log p b ( b i ; β ^ b ( i ) ) = g ( y i ; θ ^ b ) + 1 n tr { 2 b ( β ˜ b i ) 1 log p b ( b i ; β ^ b ( i ) ) g ( y i ; θ ˜ b i ) } .
This and the assumption (16) imply that
L x cv ( θ ^ b ) = 1 n i = 1 n g ( y i ; θ ^ b ) 1 n 2 i = 1 n tr { 2 b ( β ˜ b i ) 1 log p b ( b i ; β ^ b ( i ) ) g ( y i ; θ ˜ b i ) } = 1 n i = 1 n g ( y i ; θ ^ b ) + 1 n tr { I b 1 E [ log p b ( β ¯ ) g ( y ; θ ¯ ) ] } + o p ( n 1 ) .
Under the assumption q z | y ( z | y ) = p z | y ( z | y ; θ ¯ ) ,
f ( y ; θ ¯ ) = q z | y ( z | y ) log p z | y ( z | y ; θ ¯ ) d z = p z | y ( z | y ; θ ¯ ) d z = 0 .
This yields that
E [ log p b ( β ¯ ) g ( y ; θ ¯ ) ] = E [ log p b ( β ¯ ) log p y ( θ ¯ ) ] = K b , y .
Hence, by noting g ( y ; θ ) = log p y ( y ; θ ) + f ( y ; θ ) , it holds that
L x cv ( θ ^ b ) = y ( θ ^ b ) 1 n i = 1 n f ( y i ; θ ^ b ) + 1 n tr ( I b 1 K b , y ) + o p ( n 1 ) .
For evaluating the second term on the right hand side, we apply Taylor expansion to n 1 i = 1 n f ( y i ; θ ) around θ = θ ¯ by formally taking it as a function of β . By noting (A8), this gives
1 n i = 1 n f ( y i ; θ ^ b ) = 1 n i = 1 n f ( y i ; θ ¯ ) + 1 2 n i = 1 n ( β ^ b β ¯ ) 2 f ( y i ; θ ¯ ) ( β ^ b β ¯ ) + o p ( n 1 ) = 1 n i = 1 n f ( y i ; θ ¯ ) + 1 2 n tr i = 1 n 2 f ( y i ; θ ¯ ) ( β ^ b β ¯ ) ( β ^ b β ¯ ) + o p ( n 1 ) .
It follows from the law of large numbers that
1 n i = 1 n 2 f ( y i ; θ ¯ ) = 1 n i = 1 n q z | y ( z | y i ) 2 log p z | y ( z | y i ; θ ¯ ) d z p E [ 2 log p z | y ( z | y ; θ ¯ ) ] = I z | y .
Hence, (A1) indicates that
1 n i = 1 n f ( y i ; θ ^ b ) = 1 n i = 1 n f ( y i ; θ ¯ ) 1 2 n tr ( I z | y I b 1 J b I b 1 ) + o p ( n 1 ) .
By substituting (A10) into (A9), we establish that
L x cv ( θ ^ b ) = y ( θ ^ b ) + 1 n tr ( I b 1 K b , y ) + 1 2 n tr ( I z | y I b 1 J b I b 1 ) 1 n i = 1 n f ( y i ; θ ¯ ) + o p ( n 1 ) .
Hence, the proof is complete. □

References

  1. Breiman, L.; Friedman, J.H. Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 1997, 59, 3–54. [Google Scholar] [CrossRef]
  2. Tibshirani, R.; Hinton, G. Coaching variables for regression and classification. Stat. Comput. 1998, 8, 25–33. [Google Scholar] [CrossRef]
  3. Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
  4. Mercatanti, A.; Li, F.; Mealli, F. Improving inference of Gaussian mixtures using auxiliary variables. Stat. Anal. Data Min. 2015, 8, 34–48. [Google Scholar] [CrossRef] [Green Version]
  5. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  6. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  7. Shibata, R. An optimal selection of regression variables. Biometrika 1981, 68, 45–54. [Google Scholar] [CrossRef]
  8. Shibata, R. Asymptotic mean efficiency of a selection of regression variables. Ann. Inst. Stat. Math. 1983, 35, 415–423. [Google Scholar] [CrossRef]
  9. Takeuchi, K. Distribution of information statistics and criteria for adequacy of models. Math. Sci. 1976, 153, 12–18. (In Japanese) [Google Scholar]
  10. Shimodaira, H. A new criterion for selecting models from partially observed data. In Selecting Models from Data; Cheeseman, P., Oldford, R.W., Eds.; Springer: New York, NY, USA, 1994; pp. 21–29. [Google Scholar]
  11. Cavanaugh, J.E.; Shumway, R.H. An Akaike information criterion for model selection in the presence of incomplete data. J. Stat. Plan. Inference 1998, 67, 45–65. [Google Scholar] [CrossRef] [Green Version]
  12. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–38. [Google Scholar] [CrossRef]
  13. Shimodaira, H.; Maeda, H. An information criterion for model selection with missing data via complete-data divergence. Ann. Inst. Stat. Math. 2018, 70, 421–438. [Google Scholar] [CrossRef]
  14. Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 44–47. [Google Scholar] [CrossRef]
  15. Ibrahim, J.G.; Lipsitz, S.R.; Horton, N. Using auxiliary data for parameter estimation with non-ignorably missing outcomes. J. R. Stat. Soc. Ser. C Appl. Stat. 2001, 50, 361–373. [Google Scholar] [CrossRef]
  16. White, H. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  17. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244. [Google Scholar] [CrossRef] [Green Version]
  18. Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Methodol. 1974, 36, 111–147. [Google Scholar] [CrossRef]
  19. Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef] [Green Version]
  20. Yanagihara, H.; Tonda, T.; Matsumoto, C. Bias correction of cross-validation criterion based on Kullback–Leibler information under a general condition. J. Multivar. Anal. 2006, 97, 1965–1975. [Google Scholar] [CrossRef] [Green Version]
  21. Dua, D.; Karra Taniskidou, E. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 31 July 2017. [Google Scholar]
  22. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Figure 1. Useful auxiliary variable (Case 1). The left panel plots { ( y i , a i ) } i = 1 100 with labels indicating z i . The estimated p b ( β ^ b ) is shown by the contour lines. The right panel shows the histogram of { y i } i = 1 100 , and three density functions p y ( θ ^ x ) (broken line), p y ( θ ^ y ) (dotted line), and p y ( θ ^ b ) (solid line). In Section 4.4, this useful auxiliary variable is selected by our method (Case 1 in Table 2).
Figure 1. Useful auxiliary variable (Case 1). The left panel plots { ( y i , a i ) } i = 1 100 with labels indicating z i . The estimated p b ( β ^ b ) is shown by the contour lines. The right panel shows the histogram of { y i } i = 1 100 , and three density functions p y ( θ ^ x ) (broken line), p y ( θ ^ y ) (dotted line), and p y ( θ ^ b ) (solid line). In Section 4.4, this useful auxiliary variable is selected by our method (Case 1 in Table 2).
Entropy 21 00281 g001
Figure 2. Useless auxiliary variable (Case 2). The symbols are the same as Figure 1. In Section 4.4, this useless auxiliary variable is NOT selected by our method (Case 2 in Table 2).
Figure 2. Useless auxiliary variable (Case 2). The symbols are the same as Figure 1. In Section 4.4, this useless auxiliary variable is NOT selected by our method (Case 2 in Table 2).
Entropy 21 00281 g002
Table 1. Random variables in incomplete data analysis with auxiliary variables. B = ( Y , A ) is used for estimation of unknown parameters, and X = ( Y , Z ) is used for evaluation of candidate models.
Table 1. Random variables in incomplete data analysis with auxiliary variables. B = ( Y , A ) is used for estimation of unknown parameters, and X = ( Y , Z ) is used for evaluation of candidate models.
ObservedLatentComplete
PrimaryYZX
AuxiliaryA
AllBC
Table 2. Comparisons between θ ^ b and θ ^ y for predicting X, and that for Y.
Table 2. Comparisons between θ ^ b and θ ^ y for predicting X, and that for Y.
p x ( θ ^ b ) vs. p x ( θ ^ y ) p y ( θ ^ b ) vs. p y ( θ ^ y )
AIC x ; b AIC x ; y AIC y ; b AIC y ; y
Case 1−2.67−0.96
Case 29.8610.37
Table 3. Expected Akaike Information Criterion (AIC) difference is compared with the risk difference. The values are computed from T = 10 4 runs of simulation with their standard errors in parentheses.
Table 3. Expected Akaike Information Criterion (AIC) difference is compared with the risk difference. The values are computed from T = 10 4 runs of simulation with their standard errors in parentheses.
n100200500100020005000
E [ AIC x ; b AIC x ; y ] −3.559−3.263−3.221−3.197−3.195−3.180
(0.074)(0.021)(0.015)(0.013)(0.013)(0.012)
2 n { R x ( θ ^ b ) R x ( θ ^ y ) } −3.603−3.333−3.275−3.208−3.182−3.232
(0.071)(0.054)(0.050)(0.050)(0.050)(0.050)
Table 4. Useful auxiliary variable (Case 1): selection frequencies of θ ^ b and θ ^ y .
Table 4. Useful auxiliary variable (Case 1): selection frequencies of θ ^ b and θ ^ y .
n100200500100020005000
θ ^ b 923094759649968797119727
θ ^ y 770525351313289273
Table 5. Useless auxiliary variable (Case 2): selection frequencies of θ ^ b and θ ^ y .
Table 5. Useless auxiliary variable (Case 2): selection frequencies of θ ^ b and θ ^ y .
n100200500100020005000
θ ^ b 15082121000
θ ^ y 84929788999910,00010,00010,000
Table 6. Useful auxiliary variable (Case 1): estimated risk functions of θ ^ b , θ ^ y , and θ ^ b e s t , and their standard errors in parenthesis.
Table 6. Useful auxiliary variable (Case 1): estimated risk functions of θ ^ b , θ ^ y , and θ ^ b e s t , and their standard errors in parenthesis.
n100200500100020005000
2 n { R x ( θ ^ b ) L x ( θ 0 ) } 4.2294.0794.0514.0394.0294.033
(0.032)(0.030)(0.029)(0.028)(0.029)(0.028)
2 n { R x ( θ ^ y ) L x ( θ 0 ) } 7.8317.4127.3267.2477.2117.266
(0.078)(0.061)(0.058)(0.058)(0.058)(0.058)
2 n { R x ( θ ^ b e s t ) L x ( θ 0 ) } 5.1094.7414.5014.4914.4794.454
(0.052)(0.045)(0.041)(0.042)(0.042)(0.041)
Table 7. Useless auxiliary variable (Case 2): estimated risk functions of θ ^ b , θ ^ y , and θ ^ b e s t , and their standard errors in parenthesis.
Table 7. Useless auxiliary variable (Case 2): estimated risk functions of θ ^ b , θ ^ y , and θ ^ b e s t , and their standard errors in parenthesis.
n100200500100020005000
2 n { R x ( θ ^ b ) L x ( θ 0 ) } 105.527214.659543.6851091.1052182.6475452.623
(0.111)(0.167)(0.301)(0.474)(0.723)(1.151)
2 n { R x ( θ ^ y ) L x ( θ 0 ) } 7.8317.4127.3267.2477.2117.266
(0.078)(0.061)(0.058)(0.058)(0.058)(0.058)
2 n { R x ( θ ^ b e s t ) L x ( θ 0 ) } 22.06411.5557.3757.2477.2117.266
(0.358)(0.304)(0.079)(0.058)(0.058)(0.058)
Table 8. Experiment average of n t e { L ( θ ^ y ) L x ( θ ^ b e s t ) } for each case of Y = V , = 1 , , 13 . Standard errors are in parenthesis.
Table 8. Experiment average of n t e { L ( θ ^ y ) L x ( θ ^ b e s t ) } for each case of Y = V , = 1 , , 13 . Standard errors are in parenthesis.
Y V 1 V 2 V 3 V 4 V 5 V 6 V 7
n t e { L x ( θ ^ y ) L x ( θ ^ b e s t ) } 0.13−0.1489.7146.24−1.763.3476.54
(0.08)(0.12)(3.82)(4.17)(2.52)(1.34)(6.09)
Y V 8 V 9 V 10 V 11 V 12 V 13
n t e { L x ( θ ^ y ) L x ( θ ^ b e s t ) } 13.9139.451.72111.2415.480.23
(2.21)(3.12)(0.29)(8.46)(2.11)(0.09)

Share and Cite

MDPI and ACS Style

Imori, S.; Shimodaira, H. An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis. Entropy 2019, 21, 281. https://doi.org/10.3390/e21030281

AMA Style

Imori S, Shimodaira H. An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis. Entropy. 2019; 21(3):281. https://doi.org/10.3390/e21030281

Chicago/Turabian Style

Imori, Shinpei, and Hidetoshi Shimodaira. 2019. "An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis" Entropy 21, no. 3: 281. https://doi.org/10.3390/e21030281

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop