An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis

Imori, Shinpei; Shimodaira, Hidetoshi

doi:10.3390/e21030281

Open AccessArticle

An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis

by

Shinpei Imori

^1,3,* and

Hidetoshi Shimodaira

^2,3

¹

Graduate School of Science, Hiroshima University, Hiroshima 739-8526, Japan

²

Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan

³

RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(3), 281; https://doi.org/10.3390/e21030281

Submission received: 21 February 2019 / Revised: 9 March 2019 / Accepted: 12 March 2019 / Published: 14 March 2019

(This article belongs to the Special Issue Information-Theoretical Methods in Data Mining)

Download

Browse Figures

Versions Notes

Abstract

:

Statistical inference is considered for variables of interest, called primary variables, when auxiliary variables are observed along with the primary variables. We consider the setting of incomplete data analysis, where some primary variables are not observed. Utilizing a parametric model of joint distribution of primary and auxiliary variables, it is possible to improve the estimation of parametric model for the primary variables when the auxiliary variables are closely related to the primary variables. However, the estimation accuracy reduces when the auxiliary variables are irrelevant to the primary variables. For selecting useful auxiliary variables, we formulate the problem as model selection, and propose an information criterion for predicting primary variables by leveraging auxiliary variables. The proposed information criterion is an asymptotically unbiased estimator of the Kullback–Leibler divergence for complete data of primary variables under some reasonable conditions. We also clarify an asymptotic equivalence between the proposed information criterion and a variant of leave-one-out cross validation. Performance of our method is demonstrated via a simulation study and a real data example.

Keywords:

Akaike information criterion; auxiliary variables; Fisher information matrix; incomplete data; Kullback–Leibler divergence; misspecification; Takeuchi information criterion

1. Introduction

Auxiliary variables are often observed along with primary variables. Here, the primary variables are random variables of interest, and our purpose is to estimate their predictive distribution, i.e., a probability distribution of the primary variables in future test data, while the auxiliary variables are random variables that are observed in training data but not included in the primary variables. We assume that the auxiliary variables are not observed in the test data, or we do not use them even if they are observed in the test data. When the auxiliary variables have a close relation with the primary variables, we expect to improve the accuracy of predictive distribution of the primary variables by considering a joint modeling of the primary and auxiliary variables.

The notion of auxiliary variables has been considered in statistics and machine learning literature. For example, the “curds and whey” method [1] and the “coaching variables” method [2] are based on a similar idea for improving prediction accuracy of primary variables by using auxiliary variables. In multitask learning, Caruana [3] improved generalization accuracy of a main task by exploiting extra tasks. Auxiliary variables are also considered in incomplete data analysis, i.e., a part of primary variables are not observed; Mercatanti et al. [4] showed some theoretical results to make parameter estimation better by utilizing auxiliary variables in Gaussian mixture model (GMM).

Although auxiliary variables are expected to be useful for modeling primary variables, they can actually be harmful. As mentioned in Mercatanti et al. [4], using auxiliary variables may affect modeling results adversely because the number of parameters to be estimated increases and a candidate model of the auxiliary variables can be misspecified. Hence, it is important to select useful auxiliary variables. This is formulated as model selection by considering parametric models with auxiliary variables. In this paper, usefulness of auxiliary variables for estimating predictive distribution of primary variables is measured by a risk function based on the Kullback–Leibler (KL) divergence [5] that is often used for model selection. Because the KL risk function includes unknown parameters, we have to estimate it in actual use. Akaike Information Criterion (AIC) proposed by Akaike [6] is one of the most famous criteria, which is known as an asymptotically unbiased estimator of the KL risk function. AIC is a good criterion from the perspective of prediction due to the asymptotic efficiency; see Shibata [7,8]. Takeuchi [9] proposed a modified version of AIC, called Takeuchi Information Criterion (TIC), which relaxes an assumption for deriving AIC, that is, correct specification of candidate model. However, AIC and TIC are derived for primary variables without considering auxiliary variables in the setting of complete data analysis, and therefore, they are not suitable for auxiliary variable selection nor incomplete data analysis.

Incomplete data analysis is widely used in a broad range of statistical problems by regarding a part of primary variables as latent variables that are not observed. This setting also includes complete data analysis as a special case, where all the primary variables are observed. Information criteria for incomplete data analysis have been proposed in previous studies. Shimodaira [10] developed an information criterion based on the KL divergence for complete data when the data are only partially observed. Cavanaugh and Shumway [11] modified the first term of the information criterion of Shimodaira [10] by the objective function of the EM algorithm [12]. Recently, Shimodaira and Maeda [13] proposed an information criterion, which is derived by mitigating a condition assumed in Shimodaira [10] and Cavanaugh and Shumway [11].

However, any of these previously proposed criteria are not derived by taking auxiliary variables into account. Thus, we propose a new information criterion by considering not only primary variable but also auxiliary variables in the setting of incomplete data analysis. The proposed criterion is a generalization of AIC, TIC, and the criterion of Shimodaira and Maeda [13]. To the best of our knowledge, this is the first attempt to derive an information criterion by considering auxiliary variables. Moreover, we show an asymptotic equivalence between the proposed criterion and a variant of leave-one-out cross validation (LOOCV); this result is a generalization of the relationship between TIC and LOOCV [14].

Note that “auxiliary variables” may also be used in other contexts in literature. For example, Ibrahim et al. [15] considered to use auxiliary variables in missing data analysis, which is similar to our usage in the sense that auxiliary variables are highly correlated with missing data. However, they use the auxiliary variables in order to avoid specifying a missing data mechanism; this goal is different from ours, because no missing data mechanism is considered in our study.

The reminder of this paper is organized as follows. Notations as well as the setting of this paper are introduced in Section 2. Illustrative examples of useful and useless auxiliary variables are given in Section 3. The information criterion for selecting useful auxiliary variables in incomplete data analysis is derived in Section 4, and the asymptotic equivalence between the proposed criterion and a variant of LOOCV is shown in Section 5. Performance of our method is examined via a simulation study and a real data analysis in Section 6 and Section 7, respectively. Finally, we conclude this paper in Section 8. All proofs are shown in Appendix A.

2. Preliminaries

2.1. Incomplete Data Analysis for Primary Variables

First we explain a setting of incomplete data analysis for primary variables in accordance with Shimodaira and Maeda [13]. Let X denote a vector of primary variables, which consists of two parts as

X = (Y, Z)

, where Y denotes the observed part and Z denotes the unobserved latent part. This setting reduces to complete data analysis of

X = Y

when Z is empty. We write the true density function of X as

q_{x} (x) = q_{x} (y, z)

and a candidate parametric model of the true density as

p_{x} (x; θ) = p_{x} (y, z; θ)

, where

θ \in Θ \subset R^{d}

is an unknown parameter vector and

Θ

is its parameter space. We assume that

x = (y, z) \in Y \times Z

for all density functions, where

Y

and

Z

are domains of Y and Z, respectively. Thus the marginal densities of the observed part Y are obtained by

q_{y} (y) = \int q_{x} (y, z) d z

and

p_{y} (y; θ) = \int p_{x} (y, z; θ) d z

. For denoting densities, we will omit random variables such as

q_{y}

and

p_{y} (θ)

. We assume that

θ

is identifiable with respect to

p_{y} (θ)

.

In this paper, we consider only a simple setting of i.i.d. random variables of sample size n. Let

x_{i} = (y_{i}, z_{i})

,

i = 1, \dots, n

, be independent realizations of X, where we only observe

y_{1}, \dots, y_{n}

and we cannot see the values of

z_{1}, \dots, z_{n}

. We estimate

θ

from the observed training data

y_{1}, \dots, y_{n}

. Then the maximum likelihood estimate (MLE) of

θ

is given by

\begin{matrix} {\hat{θ}}_{y} & = \arg max_{θ \in Θ} ℓ_{y} (θ) \equiv \arg max_{θ \in Θ} \frac{1}{n} \sum_{i = 1}^{n} log p_{y} (y_{i}; θ), \end{matrix}

(1)

where

ℓ_{y} (θ)

denotes the log-likelihood function (divided by n) of

θ

with respect to

y_{1}, \dots, y_{n}

.

If we were only interested in Y, we would consider the plug-in predictive distribution

p_{y} ({\hat{θ}}_{y})

by substituting

{\hat{θ}}_{y}

into

p_{y} (θ)

. However, we are interested in the whole primary variable

X = (Y, Z)

and its density

q_{x}

. We thus consider

p_{x} ({\hat{θ}}_{y})

by substituting

{\hat{θ}}_{y}

into

p_{x} (θ)

, and evaluate the MLE by comparing

p_{x} ({\hat{θ}}_{y})

with

q_{x}

. For this purpose, Shimodaira and Maeda [13] derived an information criterion as an asymptotically unbiased estimator of the KL risk function which measures how well

p_{x} ({\hat{θ}}_{y})

approximates

q_{x}

.

2.2. Statistical Analysis with Auxiliary Variables

Next, we extend the setting to incomplete data analysis with auxiliary variables. Let A denote a vector of auxiliary variables. In addition to Y, we observe A in the training data, but we are not interested in A. For convenience, we introduce a vector of observable variables

B = (Y, A)

and a vector of all variables

C = (Y, Z, A)

as summarized in Table 1. Now

c_{i} = (y_{i}, z_{i}, a_{i})

,

i = 1, \dots, n

, are independent realizations of C, and we estimate

θ

from the observed training data

b_{i} = (y_{i}, a_{i})

,

i = 1, \dots, n

. Let

{\hat{θ}}_{b}

be the MLE of

θ

by using A in addition to Y. Since we are only interested in the primary variables, we consider the plug-in predictive distribution

p_{x} ({\hat{θ}}_{b})

by substituting

{\hat{θ}}_{b}

into

p_{x} (θ)

, and evaluate the MLE by comparing

p_{x} ({\hat{θ}}_{b})

with

q_{x}

.

In order to define the MLE

{\hat{θ}}_{b}

, let us clarify a candidate parametric model with auxiliary variables. We write the true density function of C as

q_{c} (c) = q_{c} (y, z, a)

and a candidate parametric model of the true density as

p_{c} (c; β) = p_{c} (y, z, a; β)

, where

β = {(θ^{⊤}, φ^{⊤})}^{⊤} \in B \subset R^{d + f}

is an unknown parameter vector with nuisance parameter

φ \in R^{f}

and

B

is its parameter space. We assume that

c = (y, z, a) \in Y \times Z \times A

for all density functions, where

A

is the domain of A. We also assume that

β

is identifiable with respect to

p_{b} (y, a; β) = \int p_{c} (y, z, a; β) d z

. Let us redefine

p_{x} (θ)

as

p_{x} (y, z; θ) = \int p_{c} (y, z, a; β) d a

and the parameter space of

θ

as

\begin{matrix} Θ = \{θ | (\begin{matrix} θ \\ φ \end{matrix}) \in B\} . \end{matrix}

Then,

{\hat{θ}}_{b}

is obtained from the MLE of

β

given by

\begin{matrix} {\hat{β}}_{b} & = (\begin{matrix} {\hat{θ}}_{b} \\ {\hat{φ}}_{b} \end{matrix}) = \arg max_{β \in B} ℓ_{b} (β) \equiv \arg max_{β \in B} \frac{1}{n} \sum_{i = 1}^{n} log p_{b} (b_{i}; β), \end{matrix}

(2)

where

ℓ_{b} (β)

denotes the log-likelihood function (divided by n) of

β

with respect to

b_{1}, \dots, b_{n}

.

Finally, we introduce a general notation for density functions. For a random variable, say R, we write the true density function as

q_{r} (r)

and a candidate parametric model of

q_{r}

as

p_{r} (r; θ)

or

p_{r} (r; β)

. For random variables R and S, we write the true conditional density function of R given

S = s

as

q_{r | s} (r | s)

and its corresponding model as

p_{r | s} (r | s; θ)

or

p_{r | s} (r | s; β)

. For example, a candidate model of C can be decomposed as

\begin{matrix} p_{c} (y, z, a; β) = p_{x} (y, z; θ) p_{a | x} (a | y, z; β) . \end{matrix}

2.3. Comparing the Two Estimators

We have thus far obtained the two MLEs of

θ

, namely

{\hat{θ}}_{y}

and

{\hat{θ}}_{b}

, and their corresponding predictive distributions

p_{x} ({\hat{θ}}_{y})

and

p_{x} ({\hat{θ}}_{b})

, respectively. We would like to determine which of the two predictive distributions approximates

q_{x}

better than the other. The approximation error of

p_{x} (θ)

is measured by the KL divergence from

q_{x}

to

p_{x} (θ)

defined as

\begin{matrix} D_{x} (q_{x}; p_{x} (θ)) = - \int q_{x} (x) log p_{x} (x; θ) d x + \int q_{x} (x) log q_{x} (x) d x . \end{matrix}

Since the last term on the right hand side does not depend on

p_{x} (θ)

, we ignore it for computing the loss function of

p_{x} (θ)

defined by

\begin{matrix} L_{x} (θ) = - \int q_{x} (x) log p_{x} (x; θ) d x . \end{matrix}

Let

\hat{θ}

be an estimator of

θ

. The risk (or expected loss) function of

p_{x} (\hat{θ})

is defined by

\begin{matrix} R_{x} (\hat{θ}) = E [L_{x} (\hat{θ})], \end{matrix}

(3)

where we take the expectation by considering

\hat{θ}

as a random variable. Note that

\hat{θ}

in the notation of

R_{x} (\hat{θ})

indicates the procedure for computing

\hat{θ}

instead of a particular value of

\hat{θ}

.

R_{x} (\hat{θ})

measures how well

p_{x} (\hat{θ})

approximates

q_{x}

on average in the long run.

For comparing the two MLEs, we define

R_{x} ({\hat{θ}}_{y})

and

R_{x} ({\hat{θ}}_{b})

by considering that

{\hat{θ}}_{y}

and

{\hat{θ}}_{b}

are functions of independent random variables

Y_{1}, \dots, Y_{n}

and

B_{1}, \dots, B_{n}

, respectively, where

B_{i} = (Y_{i}, A_{i})

has the same distribution as B for all

i = 1, \dots, n

.

{\hat{θ}}_{b}

is better than

{\hat{θ}}_{y}

when

R_{x} ({\hat{θ}}_{b}) < R_{x} ({\hat{θ}}_{y})

, that is, the auxiliary variable A helps the statistical inference on

q_{x}

. On the other hand, A is harmful when

R_{x} ({\hat{θ}}_{b}) > R_{x} ({\hat{θ}}_{y})

. Although we focus only on comparison between Y and

B = (Y, A)

in this paper, if there are more than two auxiliary variables (and their combinations)

A_{1}, A_{2}, \dots

, then we may compare

R_{x} ({\hat{θ}}_{(y, a_{1})}), R_{x} ({\hat{θ}}_{(y, a_{2})}), \dots

, to determine good auxiliary variables. Of course, the risk functions cannot be calculated in reality because they depend on the unknown true distribution. Thus, we derive a new information criterion as an estimator of the risk function in our setting. Since an asymptotically unbiased estimator of

R_{x} ({\hat{θ}}_{y})

has been already derived in Shimodaira and Maeda [13], we will only derive an asymptotically unbiased estimator of

R_{x} ({\hat{θ}}_{b})

.

3. An Illustrative Example with Auxiliary Variables

3.1. Model Setting

In this section, we demonstrate parameter estimation by using auxiliary variables in Gaussian mixture model (GMM), which can be formulated in incomplete data analysis. Let us consider a two-component GMM; observed values are generated from one of two Gaussian distributions, where the assigned labels are missing. The observed data and missing labels are realizations of Y and Z, respectively. We estimate a predictive distribution of

X = (Y, Z)

from the observation of Y, and we attempt improving it by utilizing A in addition to Y. The true density function of primary variables

X = (Y, Z) \in R \times {0, 1}

is given as

\begin{matrix} q_{y | z} (y | z) & = z N (y; - 1.2, 0.7) + (1 - z) N (y; 1.2, 0.7), \\ q_{z} (z) & = 0.6 z + 0.4 (1 - z), \end{matrix}

where

N (\cdot; μ, σ^{2})

denotes the density function of

N (μ, σ^{2})

, i.e., the normal distribution with mean

μ

and variance

σ^{2}

. We consider the following two cases for the true conditional distribution of auxiliary variable A given

X = x

:

Case 1:: $q_{a | x} (a | y, z) = q_{a | z} (a | z) = z N (a; 1.8, 0.49) + (1 - z) N (a; - 1.8, 0.49)$ .
Case 2:: $q_{a | x} (a | y, z) = q_{a} (a) = 0.6 N (a; 1.8, 0.49) + 0.4 N (a; - 1.8, 0.49)$ .

The random variables X and A are not independent in Case 1 whereas they are independent in Case 2. Hence, A will contribute to estimating

θ

in Case 1. On the other hand, in Case 2, A must not be useful, and A becomes just noise if we estimate

θ

from Y and A.

In both cases, we use the following two-component GMM as a candidate model of

q_{c}

:

\begin{matrix} \begin{matrix} p_{b | z} (y, a | z; β) & = z N_{2} ({(y, a)}^{⊤}; μ_{1}, Σ) + (1 - z) N_{2} ({(y, a)}^{⊤}; μ_{2}, Σ), \\ p_{z} (z; θ) & = π_{1} z + (1 - π_{1}) (1 - z), \end{matrix} \end{matrix}

(4)

where

N_{2} (\cdot; μ_{i}, Σ)

denotes the density function of bivariate normal distribution

N_{2} (μ_{i}, Σ)

,

i = 1, 2

, and the parameters are

\begin{matrix} μ_{1} = (\begin{matrix} μ_{1 y} \\ μ_{1 a} \end{matrix}), μ_{2} = (\begin{matrix} μ_{2 y} \\ μ_{2 a} \end{matrix}), Σ = (\begin{matrix} σ_{y}^{2} & σ_{y a} \\ σ_{y a} & σ_{a}^{2} \end{matrix}) . \end{matrix}

Therefore,

β = {(θ^{⊤}, φ^{⊤})}^{⊤}

,

θ = {(π_{1}, μ_{1 y}, μ_{2 y}, σ_{y}^{2})}^{⊤}

and

φ = {(μ_{1 a}, μ_{2 a}, σ_{a}^{2}, σ_{y a})}^{⊤}

. The true parameters of

θ

and

φ

for Case 1 are given by

θ_{0} = {(0.6, - 1.2, 1.2, 0.7)}^{⊤}

and

φ_{0} = {(1.8, - 1.8, 0.49, 0)}^{⊤}

, respectively. By considering the joint density function

p_{c} (y, z, a; β) = p_{b | z} (y, a | z; β) p_{z} (z; θ)

, this candidate model correctly specifies the true density function

q_{c} (y, z, a) = q_{a | x} (a | y, z) q_{y | z} (y | z) q_{z} (z)

in Case 1. On the other hand, the model is misspecified for Case 2, and we cannot think of the true parameters.

3.2. Estimation Results

For illustrating the impact of auxiliary variables on parameter estimation in each case, we generated a typical dataset

c_{1}, \dots, c_{n}

with sample size

n = 100

from

q_{c}

, which is actually picked from 10,000 datasets generated in the simulation study of Section 6, and details of how to select the typical dataset are also shown there. For each case, we computed the three MLEs

{\hat{θ}}_{y}

,

{\hat{θ}}_{b}

, and

{\hat{θ}}_{x}

, where

{\hat{θ}}_{x}

is the MLE of

θ

calculated by using complete data

x_{1}, \dots, x_{n}

as if labels

z_{1}, \dots, z_{n}

were available.

The result of Case 1 is shown in Figure 1, where A is beneficial for estimating

θ

. In the left panel, the two clusters are well separated, which makes parameter estimation stable. The estimated

p_{b} ({\hat{β}}_{b})

captures the structure of the two clusters corresponding to the label

z_{i} = 0

and

z_{i} = 1

, showing that

p_{c} ({\hat{β}}_{b})

is estimated reasonably well, and thus

p_{x} ({\hat{θ}}_{b})

is a good approximation of

q_{x}

. Looking at the right panel, we also observe that

p_{y} ({\hat{θ}}_{b})

is better than

p_{y} ({\hat{θ}}_{y})

for approximating

p_{y} ({\hat{θ}}_{x})

, suggesting that the auxiliary variable is useful for recovering the lost information of missing data. In fact, the three MLEs are calculated as follows:

{\hat{θ}}_{y} = {(0.671, - 1.143, 1.324, 0.678)}^{⊤}

,

{\hat{θ}}_{b} = {(0.613, - 1.228, 1.093, 0.744)}^{⊤}

, and

{\hat{θ}}_{x} = {(0.620, - 1.233, 1.141, 0.695)}^{⊤}

. By comparing

∥ {\hat{θ}}_{b} - {\hat{θ}}_{x} ∥ = 0.069

with

∥ {\hat{θ}}_{y} - {\hat{θ}}_{x} ∥ = 0.212

, we can see that

{\hat{θ}}_{b}

is better than

{\hat{θ}}_{y}

for predicting

{\hat{θ}}_{x}

without looking at the latent variable. All these observations indicate that the parameter estimation of

θ

is improved by using A in Case 1.

The result of Case 2 is shown in Figure 2, where A is harmful for estimating

θ

. For fair comparison, exactly the same values of

{(y_{i}, z_{i})}_{i = 1}^{100}

are used in both cases. Thus,

{\hat{θ}}_{y}

and

{\hat{θ}}_{x}

have the same values as in Case 1 whereas

{\hat{θ}}_{b}

has a different value as

{\hat{θ}}_{b} = {(0.581, - 0.403, - 0.232, 2.015)}^{⊤}

. By comparing

∥ {\hat{θ}}_{b} - {\hat{θ}}_{x} ∥ = 2.078

with

∥ {\hat{θ}}_{y} - {\hat{θ}}_{x} ∥ = 0.212

, we can see that

{\hat{θ}}_{b}

is worse than

{\hat{θ}}_{y}

for predicting

{\hat{θ}}_{x}

. This is also seen in Figure 2. In the left panel, the estimated

p_{b} ({\hat{β}}_{b})

captures some structure of the two clusters, but they do not correspond to the label

z_{i} = 0

and

z_{i} = 1

. As a result,

p_{y} ({\hat{θ}}_{b})

becomes a very poor approximation of

p_{y} ({\hat{θ}}_{x})

in the right panel, indicating that the parameter estimation of

θ

is actually hindered by using A in Case 2.

These examples suggest that usefulness of auxiliary variables depends strongly on the true distribution and a candidate model. Hence, it is important to select useful auxiliary variables from observed data.

4. Information Criterion

4.1. Asymptotic Expansion of the Risk Function

In this section, we derive a new information criterion as an asymptotically unbiased estimator of the risk function

R_{x} ({\hat{θ}}_{b})

defined in (3). We start from a general framework of misspecification, i.e., without assuming that candidate models are correctly specified, and later we give specific assumptions. Let

\bar{β}

be the optimal parameter value with respect to the KL divergence from

q_{b}

to

p_{b} (β)

, that is,

\begin{matrix} \bar{β} = (\begin{matrix} \bar{θ} \\ \bar{φ} \end{matrix}) = \arg max_{β \in B} \int q_{b} (b) log p_{b} (b; β) d b . \end{matrix}

If the candidate model is correctly specified, i.e., there exists

β_{0} = {(θ_{0}^{⊤}, φ_{0}^{⊤})}^{⊤}

such that

q_{b} = p_{b} (β_{0})

, then

\bar{β} = β_{0}

as well as

\bar{θ} = θ_{0}

.

In this paper, we assume the regularity conditions A1 to A6 of White [16] for

q_{b}

and

p_{b} (β)

so that the MLE

{\hat{β}}_{b}

has consistency and asymptotic normality. In particular,

\bar{β}

is determined uniquely (i.e., identifiable) and is interior to

B

. We assume that

I_{b}

and

J_{b}

defined below are nonsingular in the neighbourhood of

\bar{β}

. Then White [16] showed the asymptotic normality as

n \to \infty

,

\begin{matrix} \sqrt{n} ({\hat{β}}_{b} - \bar{β}) \overset{d}{\to} N_{d + f} (0, I_{b}^{- 1} J_{b} I_{b}^{- 1}), \end{matrix}

(5)

where

I_{b}

and

J_{b}

are

(d + f) \times (d + f)

matrices defined by using

\nabla = \partial / \partial β

,

\nabla^{⊤} = \partial / \partial β^{⊤}

, and

\nabla^{2} = \partial^{2} / \partial β \partial β^{⊤}

as

\begin{matrix} I_{b} = - E [\nabla^{2} log p_{b} (b; \bar{β})], J_{b} = E [\nabla log p_{b} (b; \bar{β}) \nabla^{⊤} log p_{b} (b; \bar{β})] . \end{matrix}

Note that we write derivatives by abbreviated forms, e.g.,

\nabla^{2} log p_{b} (b; \bar{β})

means

\nabla^{2} log p_{b} {(b; β) |}_{β = \bar{β}}

and so on. In addition, we allow interchange of integrals and derivatives rather formally when working with models, although we actually need conditions for the models such as White [16]. Moreover, the condition A7 of White [16] is assumed in order to establish

I_{b} = J_{b}

when considering a situation that the candidate model is correctly specified. We assume the above conditions throughout the paper without explicitly stated.

Let us define three

(d + f) \times (d + f)

matrices as

\begin{matrix} I_{x} = - E [\nabla^{2} log p_{x} (x; \bar{θ})], I_{y} = - E [\nabla^{2} log p_{y} (y; \bar{θ})], I_{z | y} = - E [\nabla^{2} log p_{z | y} (z | y; \bar{θ})] = I_{x} - I_{y}, \end{matrix}

which will be used in the lemmas below. Since the derivatives of

log p_{x} (x; θ)

and

log p_{y} (y; θ)

with respect to

φ

is zero, the matrices become singular when

f > 0

, but this is not a problem in our calculation. The following lemma shows that the dominant term of

R_{x} ({\hat{θ}}_{b})

is

L_{x} (\bar{θ})

and the remainder terms are of order

O (n^{- 1})

, by noting that

\nabla^{⊤} L_{x} (\bar{θ}) = O (1)

and

E [{\hat{β}}_{b} - \bar{β}] = O (n^{- 1})

in general. The proof is given in Appendix A.1.

Lemma 1.

The risk function

R_{x} ({\hat{θ}}_{b})

is expanded asymptotically as

\begin{matrix} R_{x} ({\hat{θ}}_{b}) & = L_{x} (\bar{θ}) + \nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] + \frac{1}{2 n} tr (I_{x} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}) . \end{matrix}

Just as a remark, the term

\nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] = O (n^{- 1})

above does not appear in the derivation of AIC or TIC, where

B = X

and thus

\nabla^{⊤} L_{x} (\bar{θ}) = 0

. This term appears when the loss function for evaluation and that for estimation differ, for example, in the derivation of the information criterion under covariate shift; see

K_{w}^{[1] ⊤} b_{w}

in Equation (4.1) of Shimodaira [17].

4.2. Estimating the Risk Function

For deriving an estimator of

R_{x} ({\hat{θ}}_{b})

, we introduce an additional condition. Let us assume that the candidate model is correctly specified for the latent part as

\begin{matrix} q_{z | y} (z | y) = p_{z | y} (z | y; \bar{θ}) . \end{matrix}

(6)

This is the same condition as Equation (14) of Shimodaira and Maeda [13] except that

\bar{θ}

is replaced by

\begin{matrix} {\bar{θ}}_{y} = \arg max_{θ \in Θ} \int q_{y} (y) log p_{y} (y; θ) d y . \end{matrix}

Since Z is missing completely in our setting, we need such a condition to proceed further. Although any method cannot detect misspecification of

p_{z | y}

if

p_{b}

is correctly specified, it is often the case that misspecification of

p_{z | y}

leads to that of

p_{b}

, and thus it is detected indirectly as in Case 2 of Section 3.

Note that the symbol of

\bar{θ}

in our notation should have been

{\bar{θ}}_{b}

, although we used

\bar{θ}

for simplicity, and there is also

{\bar{θ}}_{x}

defined similarly from

p_{x} (x; θ)

. They all differ each other with differences of order

O (1)

in general, but

\bar{θ} = {\bar{θ}}_{y} = {\bar{θ}}_{x} = θ_{0}

when

p_{c} (β)

is correctly specified as

q_{c} = p_{c} (β_{0})

.

Now we give the asymptotic expansion of

E [ℓ_{y} ({\hat{θ}}_{b})]

, which shows that

- ℓ_{y} ({\hat{θ}}_{b})

can be used as an estimator of

L_{x} (\bar{θ})

but the asymptotic bias is of order

O (n^{- 1})

.

Lemma 2.

Assume the condition (6). Then, the expectation of the estimated log-likelihood

ℓ_{y} ({\hat{θ}}_{b})

can be expanded as

\begin{matrix} E [ℓ_{y} ({\hat{θ}}_{b})] & = - L_{x} (\bar{θ}) - C (q_{x}) - \nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) - \frac{1}{2 n} tr (I_{y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}), \end{matrix}

where

K_{b, y} = E [\nabla log p_{b} (\bar{β}) \nabla^{⊤} log p_{y} (\bar{θ})]

and

C (q_{x}) = \int q_{x} (x) log q_{z | y} (z | y) d x

.

The proof of Lemma 2 is given in Appendix A.2. By eliminating

L_{x} (\bar{θ})

from the two expressions in Lemma 1 and Lemma 2, and rearranging the formula, we get the following lemma, which plays a central role in deriving our information criterion.

Lemma 3.

Assume the condition (6). Then, an expansion of the risk function

R_{x} ({\hat{θ}}_{b})

is given by

\begin{matrix} R_{x} ({\hat{θ}}_{b}) & = - E [ℓ_{y} ({\hat{θ}}_{b})] - C (q_{x}) + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) + \frac{1}{2 n} tr (I_{z | y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}) . \end{matrix}

(7)

We can ignore

C (q_{x})

for model selection, because it is a constant term which does not depend on the candidate model. Thus, finally, we define an information criterion from the right hand side of (7). The following theorem is an immediate consequence of Lemma 3.

Theorem 1.

Assume the condition (6). Let us define an information criterion as

\begin{matrix} {\hat{risk}}_{x; b} & = - 2 n ℓ_{y} ({\hat{θ}}_{b}) + 2 tr (I_{b}^{- 1} K_{b, y}) + tr (I_{z | y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) . \end{matrix}

(8)

Then this criterion is an asymptotically unbiased estimator of

2 n R_{x} ({\hat{θ}}_{b})

by ignoring the constant term

C (q_{x})

.

\begin{matrix} E [{\hat{risk}}_{x; b}] = 2 n R_{x} ({\hat{θ}}_{b}) + 2 n C (q_{x}) + o (1) . \end{matrix}

Note that the subscript of

{\hat{risk}}_{x; b}

,

x; b

is defined in accordance with Shimodaira and Maeda [13]; thus the former x and the latter b mean random variables used in evaluation and estimation, respectively. This criterion is an extension of TIC because when

X = B = Y

,

{\hat{risk}}_{x; b}

coincides with TIC of Takeuchi [9] defined as follows:

\begin{matrix} TIC = - 2 n ℓ_{y} ({\hat{θ}}_{y}) + 2 tr (I_{y}^{- 1} J_{y}) . \end{matrix}

4.3. Akaike Information Criteria for Auxiliary Variable Selection

In actual use,

{\hat{risk}}_{x; b}

may have a too complicated form. Thus, we derive a simpler information criterion by assuming the correctness of the candidate model like as AIC.

Theorem 2.

Suppose

p_{c} (β)

is correctly specified so that

q_{c} = p_{c} (β_{0})

for some

β_{0} \in B

. Then, we have

\begin{matrix} J_{b} = I_{b}, K_{b, y} = I_{y}, \end{matrix}

(9)

and thus

{\hat{risk}}_{x; b}

is rewritten as

\begin{matrix} {AIC}_{x; b} & = - 2 n ℓ_{y} ({\hat{θ}}_{b}) + tr (I_{x} I_{b}^{- 1}) + tr (I_{y} I_{b}^{- 1}) . \end{matrix}

(10)

This criterion is an asymptotically unbiased estimator of

2 n R_{x} ({\hat{θ}}_{b})

by ignoring the constant term

C (q_{x})

.

\begin{matrix} E [{AIC}_{x; b}] = 2 n R_{x} ({\hat{θ}}_{b}) + 2 n C (q_{x}) + o (1) . \end{matrix}

The proof is given in Appendix A.3.

I_{x}

,

I_{y}

and

I_{b}

are replaced by their consistent estimators in practical situations.

The newly obtained criterion

{AIC}_{x; b}

is a generalization of AIC and some of its variants. If

θ

is estimated by

{\hat{θ}}_{y}

instead of

{\hat{θ}}_{b}

, we simply let

B = Y

in the expression of

{AIC}_{x; b}

so that we get

{AIC}_{x; y}

proposed by Shimodaira and Maeda [13]:

\begin{matrix} {AIC}_{x; y} = - 2 n ℓ_{y} ({\hat{θ}}_{y}) + tr (I_{x} I_{y}^{- 1}) + d . \end{matrix}

(11)

Note that if

B = Y

,

I_{y}

is not singular because

β = θ

. On the other hand, if there is no latent part, we simply let

X = Y

in the expression of

{AIC}_{x; b}

so that we get

\begin{matrix} {AIC}_{y; b} & = - 2 n ℓ_{y} ({\hat{θ}}_{b}) + 2 tr (I_{y} I_{b}^{- 1}) . \end{matrix}

(12)

This can be used to select useful auxiliary variables in complete data analysis. Moreover, if

X = Y = B

,

{AIC}_{x; b}

reduces to the original AIC proposed by Akaike [6]:

\begin{matrix} {AIC}_{y; y} = - 2 n ℓ_{y} ({\hat{θ}}_{y}) + 2 d . \end{matrix}

(13)

It is worth mentioning that

tr (I_{z | y} I_{b}^{- 1})

is interpreted as the additional penalty for the latent part:

\begin{matrix} {AIC}_{x; b} - {AIC}_{y; b} = tr (I_{x} I_{b}^{- 1}) - tr (I_{y} I_{b}^{- 1}) = tr (I_{z | y} I_{b}^{- 1}) \geq 0, \end{matrix}

which is also mentioned in Equation (1) of Shimodaira and Maeda [13] for the case of

B = Y

.

4.4. The Illustrative Example (Cont.)

Let us return to the problem of determining whether to use the auxiliary variables or not, that is, comparison between

p_{x} ({\hat{θ}}_{b})

and

p_{x} ({\hat{θ}}_{y})

. By comparing

{AIC}_{x; b}

with

{AIC}_{x; y}

, we can determine whether the vector of auxiliary variables A is useful or useless. Thus, only when

{AIC}_{x; b} < {AIC}_{x; y}

, we conclude that A is useful in order to estimate

θ

for predicting X.

Let us apply this procedure to the illustrative example in Section 3. The generalized AICs are computed for the two cases of the typical dataset, and the results are shown in Table 2. Looking at the value of

{AIC}_{x; b} - {AIC}_{x; y}

, it is negative for Case 1, concluding that the auxiliary variable is useful, and it is positive for Case 2, concluding that the auxiliary variable is useless. According to the AIC values, therefore, we use the auxiliary variable of Case 1, but do not use the auxiliary variable of Case 2. This decision agrees with the observations of Figure 1 and Figure 2 in Section 3.2. In fact, the decision is correct, because the value of

R_{x} ({\hat{θ}}_{b}) - R_{x} ({\hat{θ}}_{y})

is negative for Case 1 and positive for Case 2 as will be seen in the simulation study of Section 6.2.

We can also argue the usefulness of the auxiliary variable for predicting Y instead of X, that is, comparison between

p_{y} ({\hat{θ}}_{b})

and

p_{y} ({\hat{θ}}_{y})

. By comparing

{AIC}_{y; b}

with

{AIC}_{y; y}

, we can determine whether A is useful or useless for predicting Y. Looking at the value of

{AIC}_{y; b} - {AIC}_{y; y}

in Table 2, we make the same decision as that for X.

5. Leave-One-Out Cross Validation

Variable selection by cross-validatory (CV) choice [18] is often applied to real data analysis due to its simplicity, although its computational burden is larger than that of information criteria; see Arlot and Celisse [19] for a recent review of cross-validation methods. As shown in Stone [14], leave-one-out cross validation (LOOCV) is asymptotically equivalent to TIC. Because LOOCV does not require calculation of the information matrices of TIC, LOOCV is easier to use than TIC. There are also some literature for improving LOOCV such as Yanagihara et al. [20], which gives a modification of LOOCV to reduce its bias by considering maximum weighted log-likelihood estimation. However, we focus on the result of Stone [14] and extend it to our setting.

In incomplete data analysis, LOOCV cannot be directly used because the loss function with respect to the complete data includes latent variables. Thus, we transform the loss function as follows:

\begin{matrix} L_{x} (θ) & = - \int q_{y} (y) g (y; θ) d y, \end{matrix}

where

g (y; θ) = log p_{y} (y; θ) + f (y; θ)

and

\begin{matrix} f (y; θ) = \int q_{z | y} (z | y) log p_{z | y} (z | y; θ) d z . \end{matrix}

Note that

f (y; θ) = 0

when

X = Y

. Using the function

g (y; θ)

, we then obtain the following LOOCV estimator of the risk function

R_{x} ({\hat{θ}}_{b})

.

\begin{matrix} L_{x}^{cv} ({\hat{θ}}_{b}) & = - \frac{1}{n} \sum_{i = 1}^{n} g (y_{i}; {\hat{θ}}_{b}^{(- i)}), \end{matrix}

where

{\hat{θ}}_{b}^{(- i)}

is the leave-out-out estimate of

θ

defined as

\begin{matrix} {\hat{β}}_{b}^{(- i)} & = (\begin{matrix} {\hat{θ}}_{b}^{(- i)} \\ {\hat{φ}}_{b}^{(- i)} \end{matrix}) = \arg max_{β \in B} \frac{1}{n} \sum_{j \neq i}^{n} log p_{b} (b_{j}; β) = \arg max_{β \in B} \{ℓ_{b} (β) - \frac{1}{n} log p_{b} (b_{i}; β)\} . \end{matrix}

We will show below in this section that

L_{x}^{cv} ({\hat{θ}}_{b})

is asymptotically equivalent to

{\hat{risk}}_{x; b}

. For implementing the LOOCV procedure with latent variables, however, we have to estimate

q_{z | y} (z | y)

by

p_{z | y} (z | y, {\hat{θ}}_{b})

in

f (y; θ)

. This introduces a bias to

L_{x}^{cv} ({\hat{θ}}_{b})

, and hence, information criteria are preferable to the LOOCV in incomplete data analysis.

Let us show the asymptotic equivalence of

L_{x}^{cv} ({\hat{θ}}_{b})

and

{\hat{risk}}_{x; b}

by assuming that we know the functional form of

f (y; θ)

. Noting that

{\hat{β}}_{b}^{(- i)}

is a critical point of

ℓ_{b} (β) - log p_{b} (b_{i}; β) / n

, we have

\begin{matrix} \nabla ℓ_{b} ({\hat{β}}_{b}^{(- i)}) = \frac{1}{n} \nabla log p_{b} (b_{i}; {\hat{β}}_{b}^{(- i)}) = O_{p} (n^{- 1}) . \end{matrix}

By applying Taylor expansion to

\nabla ℓ_{b} (β)

around

β = {\hat{β}}_{b}

, it follows from

\nabla ℓ_{b} ({\hat{β}}_{b}) = 0

that

\begin{matrix} \nabla^{2} ℓ_{b} ({\tilde{β}}_{b}^{i}) ({\hat{β}}_{b}^{(- i)} - {\hat{β}}_{b}) = \frac{1}{n} \nabla log p_{b} (b_{i}; {\hat{β}}_{b}^{(- i)}), \end{matrix}

(14)

where

{\tilde{β}}_{b}^{i}

lies between

{\hat{β}}_{b}^{(- i)}

and

{\hat{β}}_{b}

. We can see from (14) that

{\hat{β}}_{b}^{(- i)} - {\hat{β}}_{b} = O_{p} (n^{- 1})

. Next, we regard

g (y_{i}; θ)

as a function of

β

and apply Taylor expansion to it around

β = {\hat{β}}_{b}

. Therefore,

g (y_{i}; {\hat{θ}}_{b}^{(- i)})

can be expressed as follows:

\begin{matrix} g (y_{i}; {\hat{θ}}_{b}^{(- i)}) & = g (y_{i}; {\hat{θ}}_{b}) + \nabla^{⊤} g (y_{i}; {\tilde{θ}}_{b}^{i}) ({\hat{β}}_{b}^{(- i)} - {\hat{β}}_{b}), \end{matrix}

(15)

where

{\tilde{θ}}_{b}^{i}

lies between

{\hat{θ}}_{b}^{(- i)}

and

{\hat{θ}}_{b}

(

{\tilde{θ}}_{b}^{i}

does not corresponding to

{\tilde{β}}_{b}^{i}

). Then we assume that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} \nabla^{2} ℓ_{b} {({\tilde{β}}_{b}^{i})}^{- 1} \nabla log p_{b} (b_{i}; {\hat{β}}_{b}^{(- i)}) \nabla^{⊤} g (y_{i}; {\tilde{θ}}_{b}^{i}) & \overset{p}{\to} - I_{b}^{- 1} E [\nabla log p_{b} (b; \bar{β}) \nabla^{⊤} g (y; \bar{θ})] . \end{matrix}

(16)

By noting

{\hat{β}}_{b}^{(- i)} = {\hat{β}}_{b} + O_{p} (n^{- 1})

, we have

{\tilde{β}}_{b}^{i} = \bar{β} + O_{p} (n^{- 1 / 2})

and

{\tilde{θ}}_{b}^{i} = \bar{θ} + O_{p} (n^{- 1 / 2})

, and thus (16) holds at least formally. With the above setup, we show the following theorem. The proof is given in Appendix A.4.

Theorem 3.

Supposing the same assumptions of Theorem 1 and (16), we have

\begin{matrix} 2 n L_{x}^{cv} ({\hat{θ}}_{b}) = {\hat{risk}}_{x; b} - 2 \sum_{i = 1}^{n} f (y_{i}; \bar{θ}) + o_{p} (1) . \end{matrix}

(17)

Because the second term on the right-hand side of (17) does not depend on candidate models under condition (6), this theorem implies that

L_{x}^{cv} ({\hat{θ}}_{b})

is asymptotically equivalent to

{\hat{risk}}_{x; b}

except for the scaling and the constant term. However, someone may wonder why

f (y; θ)

is included in

g (y; θ)

for comparing models of

p (b; β)

. By assuming that

p_{z | y} (θ)

is correctly specified for

q_{z | y}

,

f (y; \bar{θ}) = \int q_{z | y} (z | y) log q_{z | y} (z | y) d z

does not depend on the model anymore, so we may simply exclude

f (y; θ)

from

g (y; θ)

, leading to the loss

L_{y} (θ)

instead. The reason for including

f (y; θ)

in

g (y; θ)

is explained as follows.

L_{x}^{cv} ({\hat{θ}}_{b})

, as well as

{\hat{risk}}_{x; b}

(and

{AIC}_{x; b}

), include the additional penalty for estimating

{\hat{θ}}_{b}

in

f (y; {\hat{θ}}_{b})

, which depends on the candidate models even if

p_{z | y} (θ)

is correctly specified.

6. Experiments with Simulated Datasets

This section shows the usefulness of auxiliary variables and the proposed information criteria via a simulation study. The models illustrated in Section 3 are used for confirming the asymptotic unbiasedness of the information criterion and the validity of auxiliary variable selection.

6.1. Unbiasedness

At first, we confirm the asymptotic unbiasedness of

{AIC}_{x; b}

for estimating

2 n R_{x} ({\hat{θ}}_{b})

except for the constant term,

C (q_{x})

. The simulation setting is the same as Case 1 in Section 3, thus the data generating model is given by

\begin{matrix} q_{b | z} (y, a | z) & = z N_{2} ({(y, a)}^{⊤}; μ_{10}, Σ_{0}) + (1 - z) N_{2} ({(y, a)}^{⊤}; μ_{20}, Σ_{0}), \\ q_{z} (z) & = 0.6 z + 0.4 (1 - z), \end{matrix}

where

μ_{10} = - μ_{20} = {(- 1.2, 1.8)}^{⊤}

and

Σ_{0} = diag (0.7, 0.49)

. We generated

T = 10^{4}

independent replicates of the dataset

{(y_{i}, z_{i}, a_{i})}_{i = 1}^{n}

from this model; in fact, we used

{(y_{i}, z_{i}, a_{i, 1})}_{i = 1}^{n}

generated in Section 6.2. The candidate model is given by (4), which is correctly specified for the above data generating model. Because

{AIC}_{x; b}

is derived by ignoring

C (q_{x})

, we compare

E [{AIC}_{x; b} - {AIC}_{x; y}]

with

2 n {R_{x} ({\hat{θ}}_{b}) - R_{x} ({\hat{θ}}_{y})}

. The computation of the expectation is approximated by the simulation average as

\begin{matrix} E [{AIC}_{x; b} - {AIC}_{x; y}] & \approx \frac{1}{T} \sum_{t = 1}^{T} {{AIC}_{x; b}^{(t)} - {AIC}_{x; y}^{(t)}}, \\ 2 n {R_{x} ({\hat{θ}}_{b}) - R_{x} ({\hat{θ}}_{y})} & \approx \frac{2 n}{T} \sum_{t = 1}^{T} {L_{x} ({\hat{θ}}_{b}^{(t)}) - L_{x} ({\hat{θ}}_{y}^{(t)})}, \end{matrix}

where

{AIC}_{x; b}^{(t)}

,

{AIC}_{x; y}^{(t)}

,

{\hat{θ}}_{b}^{(t)}

, and

{\hat{θ}}_{y}^{(t)}

are those computed for the t-th dataset (

t = 1, \dots, T

).

Here, we remark about calculation of the loss function

L_{x} (\hat{θ})

in two-component GMM. Let

\hat{θ} = {({\hat{π}}_{1}, {\hat{μ}}_{1}, {\hat{μ}}_{2}, {\hat{σ}}^{2})}^{⊤}

be an estimator of

θ

. We expect that the components of GMM corresponding to

Z = 1

and

Z = 0

consist of

({\hat{π}}_{1}, {\hat{μ}}_{1}, {\hat{σ}}^{2})

and

(1 - {\hat{π}}_{1}, {\hat{μ}}_{2}, {\hat{σ}}^{2})

, respectively. However, we cannot determine the assignment of the estimated parameters in reality, i.e.,

({\hat{π}}_{1}, {\hat{μ}}_{1}, {\hat{σ}}^{2})

and

(1 - {\hat{π}}_{1}, {\hat{μ}}_{2}, {\hat{σ}}^{2})

may correspond to

Z = 0

and

Z = 1

, respectively, because the labels

z_{1}, \dots, z_{n}

are missing. The assignment is required to calculate

L_{x} (\hat{θ})

whereas it is not used for

L_{y} (\hat{θ})

and the proposed information criteria. Hence, in this paper, we define

L_{x} (\hat{θ})

as the minimum value between

L (\hat{θ})

and

L ({\hat{θ}}^{'})

, where

{\hat{θ}}^{'} = {(1 - {\hat{π}}_{1}, {\hat{μ}}_{2}, {\hat{μ}}_{1}, {\hat{σ}}^{2})}^{⊤}

.

Table 3 shows the result of the simulation for

n = 100, 200, 500, 1000, 2000

, and 5000. For all n, we observe that

E [{AIC}_{x; b} - {AIC}_{x; y}]

is very close to

2 n {R_{x} ({\hat{θ}}_{b}) - R_{x} ({\hat{θ}}_{y})}

, indicating the unbiasedness of

{AIC}_{x; b}

.

6.2. Auxiliary Variable Selection

Next, we demonstrate that the proposed AIC selects a useful auxiliary variable (Case 1), while it does not select a useless auxiliary variable (Case 2). In each case, we generated

T = 10^{4}

independent replicates of the dataset

{(y_{i}, z_{i}, a_{i})}_{i = 1}^{n}

from the model. In fact, the values of

{(y_{i}, z_{i})}_{i = 1}^{n}

are shared in both cases, so we generated replicates of

{(y_{i}, z_{i}, a_{i, 1}, a_{i, 2})}_{i = 1}^{n}

, where

a_{i, 1}

and

a_{i, 2}

are auxiliary variables for Case 1 and Case 2, respectively. In each case, we compute

{AIC}_{x; b}

and

{AIC}_{x; y}

, then we select

{\hat{θ}}_{b}

(i.e., selecting the auxiliary variable A) if

{AIC}_{x; b} < {AIC}_{x; y}

and select

{\hat{θ}}_{y}

(i.e., not selecting the auxiliary variable A) otherwise. The selected estimator is denoted as

{\hat{θ}}_{b e s t}

. This experiment was repeated for

T = 10^{4}

times. Note that the typical dataset in Section 3 was picked from the generated datasets so that it has around the median value in each of

L_{x} ({\hat{θ}}_{b}) - L_{x} ({\hat{θ}}_{y})

,

L_{y} ({\hat{θ}}_{b}) - L_{y} ({\hat{θ}}_{y})

,

{AIC}_{x; b} - {AIC}_{x; y}

, and

{AIC}_{y; b} - {AIC}_{y; y}

in both cases.

The selection frequencies are shown in Table 4 and Table 5. We observe that, as expected, the useful auxiliary variable tends to be selected in Case 1, while the useless auxiliary variable tends to be not selected in Case 2.

For verifying the usefulness of the auxiliary variable in both cases, we computed the risk value

R_{x} (\hat{θ})

for

\hat{θ} = {\hat{θ}}_{y}

,

{\hat{θ}}_{b}

, and

{\hat{θ}}_{b e s t}

. They are approximated by the simulation average as

\begin{matrix} R_{x} (\hat{θ}) \approx \frac{1}{T} \sum_{t = 1}^{T} L_{x} ({\hat{θ}}^{(t)}) . \end{matrix}

The results are shown in Table 6 and Table 7. For easier comparisons, the values are the differences from

L_{x} (θ_{0})

with the true value

θ_{0}

. For all n, we observe that, as expected,

R_{x} ({\hat{θ}}_{b}) < R_{x} ({\hat{θ}}_{y})

in Case 1, and

R_{x} ({\hat{θ}}_{b}) > R_{x} ({\hat{θ}}_{y})

in Case 2. In both cases,

R_{x} ({\hat{θ}}_{b e s t})

is close to

min {R_{x} ({\hat{θ}}_{b}), R_{x} ({\hat{θ}}_{y})}

, indicating that the variable selection is working well.

7. Experiments with Real Datasets

We show an example of auxiliary variable selection using Wine Data Set available at UCI Machine Learning Repository [21], which consists of 1 categorical variable (3 categories) and 13 continuous variables, denoted as

V_{1}, \dots, V_{13}

. For simplicity, we only use the first two categories and regard them as a latent variable

Z \in {0, 1}

; the experiment results were similar to the other combinations. The sample size is then

n = 130

and all variables except for Z are standardized. We set one of the 13 continuous variables as the observed primary variable Y, and set the rest of 12 variables as auxiliary variables

A_{1}, \dots, A_{12}

. For example, if Y is

V_{1}

, then

A_{1}, \dots, A_{12}

are

V_{2}, \dots, V_{13}

. The dataset is now

{(y_{i}, z_{i}, a_{i, 1}, \dots, a_{i, 12})}_{i = 1}^{n}

, which is randomly divided into the training set with sample size

n_{t r} = 86

(

z_{i}

is not used) and the test set with sample size

n_{t e} = 44

(

a_{i, 1}, \dots, a_{i, 12}

are not used).

In the experiment, we compute

{AIC}_{x; b_{ℓ}}

for

B_{ℓ} = (Y, A_{ℓ})

,

ℓ = 1, \dots, 12

, and

{AIC}_{x; y}

for Y from the training dataset using the model (4). We select

{\hat{θ}}_{b e s t}

from

{\hat{θ}}_{b_{1}}, \dots, {\hat{θ}}_{b_{12}}

and

{\hat{θ}}_{y}

by finding the minimum of the 13 AIC values. Thus we are selecting one of the auxiliary variables

A_{1}, \dots, A_{12}

or not selecting any of them. It is possible to select a combination of the auxiliary variables, but we did not attempt such an experiment. For measuring the generalization error, we compute

L_{x} ({\hat{θ}}_{y}) - L_{x} ({\hat{θ}}_{b e s t})

from the test set as

\begin{matrix} L_{x} ({\hat{θ}}_{y}) - L_{x} ({\hat{θ}}_{b e s t}) \approx - \frac{1}{n_{t e}} \sum_{i \in D^{t e}} {log p_{x} (y_{i}, z_{i}; {\hat{θ}}_{y}) - log p_{x} (y_{i}, z_{i}; {\hat{θ}}_{b e s t})}, \end{matrix}

where

D^{t e} \subset {1, \dots, n}

represents the test set. The assignment problem of

L_{x} (\cdot)

mentioned in Section 6 is avoided by a similar manner.

For each case of

Y = V_{ℓ}

,

ℓ = 1, \dots, 13

, the above experiment was repeated 100 times, and the experiment average of the generalization error was computed. The result is shown in Table 8. A positive value indicates that

{\hat{θ}}_{b e s t}

performed better than

{\hat{θ}}_{y}

. We observe that

{\hat{θ}}_{b e s t}

is better than or almost the same as

{\hat{θ}}_{y}

for all cases

ℓ = 1, \dots, 13

, suggesting that AIC works well to select a useful auxiliary variable.

8. Conclusions

We often encounter a dataset composed of various variables. If only some of the variables are of interest, then the rest of the variables can be interpreted as auxiliary variables. Auxiliary variables may be able to improve estimation accuracy of unknown parameters but they could also be harmful. Hence, it is important to select useful auxiliary variables.

In this paper, we focused on exploiting auxiliary variables in incomplete data analysis. The usefulness of auxiliary variables is measured by a risk function based on the KL divergence for complete data. We derived an information criterion which is an asymptotically unbiased estimator of the risk function except for a constant term. Moreover, we extended a result of Stone [14] to our setting and proved asymptotic equivalence between a variant of LOOCV and the proposed criteria. Since LOOCV requires an additional condition for its justification, the proposed criteria are preferable to LOOCV.

This study assumes that variables are different between training set and test set. There are other settings, such as covariate shift [17] and transfer learning [22], where distributions are different between the training set and test set. It will be possible to combine these settings to construct a generalized framework. It is also possible to extend our study for taking account of a missing mechanism. We will leave these extensions as future works.

Author Contributions

Conceptualization, S.I. and H.S.; methodology, S.I. and H.S.; software, S.I.; validation, S.I. and H.S.; formal analysis, S.I. and H.S.; writing-original draft preparation S.I. and H.S.; visualization, S.I. and H.S.

Funding

This research was funded in part by JSPS KAKENHI Grant (17K12650 to S.I., 16H02789 to H.S.) and by “Funds for the Development of Human Resources in Science and Technology” under MEXT, through the “Home for Innovative Researchers and Academic Knowledge Users (HIRAKU)” consortium (to S.I.).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Proof.

Taylor expansion of

L_{x} (θ)

around

θ = \bar{θ}

, by formally taking it as a function of

β

, gives

\begin{matrix} L_{x} ({\hat{θ}}_{b}) & = L_{x} (\bar{θ}) + \nabla^{⊤} L_{x} (\bar{θ}) ({\hat{β}}_{b} - \bar{β}) + \frac{1}{2} tr {I_{x} ({\hat{β}}_{b} - \bar{β}) {({\hat{β}}_{b} - \bar{β})}^{⊤}} + o_{p} (n^{- 1}), \end{matrix}

where

\nabla^{2} L_{x} (\bar{θ}) = I_{x}

is used above. By taking the expectation of both sides,

\begin{matrix} E [L_{x} ({\hat{θ}}_{b})] & = L_{x} (\bar{θ}) + \nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] + \frac{1}{2} tr {I_{x} E [({\hat{β}}_{b} - \bar{β}) {({\hat{β}}_{b} - \bar{β})}^{⊤}]} + o (n^{- 1}) \\ = L_{x} (\bar{θ}) + \nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] + \frac{1}{2 n} tr (I_{x} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}), \end{matrix}

where the asymptotic variance of

{\hat{β}}_{b}

in (5) is given as

\begin{matrix} n E [({\hat{β}}_{b} - \bar{β}) {({\hat{β}}_{b} - \bar{β})}^{⊤}] = I_{b}^{- 1} J_{b} I_{b}^{- 1} + o (1) . \end{matrix}

(A1)

□

Appendix A.2. Proof of Lemma 2

Proof.

Taylor expansion of

ℓ_{y} (θ)

around

θ = \bar{θ}

, by formally taking it as a function of

β

, gives

\begin{matrix} \begin{matrix} ℓ_{y} ({\hat{θ}}_{b}) = ℓ_{y} (\bar{θ}) + \nabla^{⊤} ℓ_{y} (\bar{θ}) ({\hat{β}}_{b} - \bar{β}) - \frac{1}{2} tr {I_{y} ({\hat{β}}_{b} - \bar{β}) {({\hat{β}}_{b} - \bar{β})}^{⊤}} + o_{p} (n^{- 1}), \end{matrix} \end{matrix}

where

\nabla^{2} ℓ_{y} (\bar{θ}) = - I_{y} + o_{p} (1)

is used above. By taking the expectation of both sides,

\begin{matrix} E [ℓ_{y} ({\hat{θ}}_{b})] & = E [ℓ_{y} (\bar{θ})] + E [\nabla^{⊤} ℓ_{y} (\bar{θ}) ({\hat{β}}_{b} - \bar{β})] - \frac{1}{2} E [tr {I_{y} ({\hat{β}}_{b} - \bar{β}) {({\hat{β}}_{b} - \bar{β})}^{⊤}}] + o (n^{- 1}) \\ = E [ℓ_{y} (\bar{θ})] + E [\nabla^{⊤} ℓ_{y} (\bar{θ}) ({\hat{β}}_{b} - \bar{β})] - \frac{1}{2 n} tr (I_{y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}) . \end{matrix}

(A2)

In the last expression, we used (A1) for the asymptotic variance of

{\hat{β}}_{b}

. For working on the second term in (A2), we first derive an expression of

{\hat{β}}_{b} - \bar{β}

. Taylor expansion of the score function

\nabla ℓ_{b} (β)

around

β = \bar{β}

gives

\begin{matrix} \nabla ℓ_{b} ({\hat{β}}_{b}) & = \nabla ℓ_{b} (\bar{β}) + \nabla^{2} ℓ_{b} (\bar{β}) ({\hat{β}}_{b} - \bar{β}) + o_{p} (n^{- 1 / 2}) \\ = \nabla ℓ_{b} (\bar{β}) - I_{b} ({\hat{β}}_{b} - \bar{β}) + o_{p} (n^{- 1 / 2}), \end{matrix}

where

\nabla^{2} ℓ_{b} (\bar{β}) = - I_{b} + o_{p} (1)

is used above. By noticing

\nabla ℓ_{b} ({\hat{β}}_{b}) = 0

, we thus obtain

\begin{matrix} {\hat{β}}_{b} - \bar{β} = I_{b}^{- 1} \nabla ℓ_{b} (\bar{β}) + o_{p} (n^{- 1 / 2}) = \frac{1}{n} \sum_{i = 1}^{n} I_{b}^{- 1} \nabla log p_{b} (b_{i}; \bar{β}) + o_{p} (n^{- 1 / 2}), \end{matrix}

(A3)

where

E [\nabla ℓ_{b} (\bar{β})] = 0

and each term in the summation has mean zero, because

E [\nabla log p_{b} (b; \bar{β})] = \nabla E [log p_{b} (b; \bar{β})] = 0

. Now we are back to the the second term in (A2). Using (A3), we have

\begin{matrix} \nabla^{⊤} ℓ_{y} (\bar{θ}) ({\hat{β}}_{b} - \bar{β}) & = E [\nabla^{⊤} ℓ_{y} (\bar{θ})] ({\hat{β}}_{b} - \bar{β}) + {\nabla^{⊤} ℓ_{y} (\bar{θ}) - E [\nabla^{⊤} ℓ_{y} (\bar{θ})]} ({\hat{β}}_{b} - \bar{β}) \\ = E [\nabla^{⊤} ℓ_{y} (\bar{θ})] ({\hat{β}}_{b} - \bar{β}) + {\nabla^{⊤} ℓ_{y} (\bar{θ}) - E [\nabla^{⊤} ℓ_{y} (\bar{θ})]} I_{b}^{- 1} \nabla ℓ_{b} (\bar{β}) + o_{p} (n^{- 1}) . \end{matrix}

(A4)

By noting

E [\nabla ℓ_{b} (\bar{β})] = 0

, the expectation of the second term in (A4) is

\begin{matrix} E [{\nabla^{⊤} ℓ_{y} (\bar{θ}) - E [\nabla^{⊤} ℓ_{y} (\bar{θ})]} I_{b}^{- 1} \nabla ℓ_{b} (\bar{β})] & = E [\nabla^{⊤} ℓ_{y} (\bar{θ}) I_{b}^{- 1} \nabla ℓ_{b} (\bar{β})] \\ = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\nabla^{⊤} log p_{y} (y_{i}; \bar{θ}) I_{b}^{- 1} \nabla log p_{b} (b_{j}; \bar{β})] \\ = \frac{1}{n} E [\nabla^{⊤} log p_{y} (y; \bar{θ}) I_{b}^{- 1} \nabla log p_{b} (b; \bar{β})] \\ = \frac{1}{n} tr {I_{b}^{- 1} E [\nabla log p_{b} (b; \bar{β}) \nabla^{⊤} log p_{y} (y; \bar{θ})]} \\ = \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) . \end{matrix}

(A5)

Combining (A4) and (A5), we have

\begin{matrix} E [\nabla^{⊤} ℓ_{y} (\bar{θ}) ({\hat{β}}_{b} - \bar{β})] = E [\nabla^{⊤} ℓ_{y} (\bar{θ})] E [{\hat{β}}_{b} - \bar{β}] + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) + o (n^{- 1}) . \end{matrix}

(A6)

We next show that

E [\nabla^{⊤} ℓ_{y} (\bar{θ})] = - \nabla^{⊤} L_{x} (\bar{θ})

. Let us recall that we have assumed

q_{z | y} (z | y) = p_{z | y} (z | y; \bar{θ})

in (6), which leads to

\begin{matrix} E [\nabla log p_{z | y} (z | y; \bar{θ})] & = \int q_{y} (y) \int p_{z | y} (z | y; \bar{θ}) \nabla log p_{z | y} (z | y; \bar{θ}) d z d y \\ = \int q_{y} (y) \int \nabla p_{z | y} (z | y; \bar{θ}) d z d y \\ = \int q_{y} (y) \nabla \int p_{z | y} (z | y; \bar{θ}) d z d y = 0 . \end{matrix}

Therefore,

\begin{matrix} - \nabla L_{x} (\bar{θ}) & = \nabla E [log p_{x} (x; \bar{θ})] \\ = E [\nabla log p_{x} (x; \bar{θ})] \\ = E [\nabla log p_{y} (y; \bar{θ})] + E [\nabla log p_{z | y} (z | y; \bar{θ})] \\ = E [\nabla ℓ_{y} (\bar{θ})] . \end{matrix}

Substituting this and (A6) into the second term in (A2), we have

\begin{matrix} E [ℓ_{y} ({\hat{θ}}_{b})] & = E [ℓ_{y} (\bar{θ})] - \nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] \\ + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) - \frac{1}{2 n} tr (I_{y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}) . \end{matrix}

(A7)

The first term on the right hand side in (A7) is

\begin{matrix} E [ℓ_{y} (\bar{θ})] & = E [log p_{y} (y; \bar{θ})] \\ = E [log p_{x} (x; \bar{θ})] - E [log p_{z | y} (z | y; \bar{θ})] \\ = - L_{x} (\bar{θ}) - C (q_{x}), \end{matrix}

where (6) is used again in the last term. Finally (A7) is rewritten as

\begin{matrix} E [ℓ_{y} ({\hat{θ}}_{b})] & = - L_{x} (\bar{θ}) - C (q_{x}) - \nabla^{⊤} L_{x} (\bar{θ}) E [{\hat{β}}_{b} - \bar{β}] \\ + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) - \frac{1}{2 n} tr (I_{y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o (n^{- 1}) . \end{matrix}

□

Appendix A.3. Proof of Theorem 2

Proof.

First recall that we have assumed that

q_{c} (c) = p_{c} (c; β_{0})

, which also implies the condition (6) as

q_{z | y} (z | y) = p_{z | y} (z | y; θ_{0})

with

\bar{β} = β_{0}

. Thus Theorem 1 holds. Substituting

J_{b} = I_{b}

and

K_{b, y} = I_{y}

in the penalty term of (8), we have

\begin{matrix} 2 tr (I_{b}^{- 1} K_{b, y}) + tr (I_{z | y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) = 2 tr (I_{b}^{- 1} I_{y}) + tr ((I_{x} - I_{y}) I_{b}^{- 1}) = tr (I_{b}^{- 1} I_{y}) + tr (I_{x} I_{b}^{- 1}), \end{matrix}

giving the penalty term of (10). Therefore, we only have to show (9). Noting the identity

\begin{matrix} \nabla^{2} log p_{b} (b; β) = \frac{1}{p_{b} (b; β)} \nabla^{2} p_{b} (b; β) - \nabla log p_{b} (b; β) \nabla^{⊤} log p_{b} (b; β), \end{matrix}

it follows from

q_{b} (b) = p_{b} (b; β_{0})

that

\begin{matrix} I_{b} = - E [\nabla^{2} log p_{b} (b; β_{0})] & = - \int \nabla^{2} p_{b} (b; β_{0}) d b + E [\nabla log p_{b} (b; β_{0}) \nabla^{⊤} log p_{b} (b; β_{0})] \\ = - \nabla^{2} \int p_{b} (b; β_{0}) d b + J_{b} = J_{b} . \end{matrix}

Note that the same result can be obtained from Theorem 3.3 in White [16]. Next we show

K_{b, y} = I_{y}

. Since

q_{a | y} (a | y) = p_{a | y} (a | y; β_{0})

,

\begin{matrix} \int q_{a | y} (a | y) \nabla log p_{a | y} (a | y; β_{0}) d a = \int \nabla p_{a | y} (a | y; β_{0}) d a = \nabla \int p_{a | y} (a | y; β_{0}) d a = 0 . \end{matrix}

Therefore, we have

\begin{matrix} K_{b, y} & = E [\nabla log p_{b} (b; β_{0}) \nabla^{⊤} log p_{y} (y; θ_{0})] \\ = E [\nabla log p_{y} (y; θ_{0}) \nabla^{⊤} log p_{y} (y; θ_{0})] + E [\nabla log p_{a | y} (a | y; β_{0}) \nabla^{⊤} log p_{y} (y; θ_{0})] \\ = I_{y} + \int q_{y} (y) (\int q_{a | y} (a | y) \nabla log p_{a | y} (a | y; β_{0}) d a) \nabla^{⊤} log p_{y} (y; θ_{0}) d y \\ = I_{y} . \end{matrix}

□

Appendix A.4. Proof of Theorem 3

Proof.

It follows from (14) and (15) that

\begin{matrix} g (y_{i}; {\hat{θ}}_{b}^{(- i)}) & = g (y_{i}; {\hat{θ}}_{b}) + \frac{1}{n} \nabla^{⊤} g (y_{i}; {\tilde{θ}}_{b}^{i}) \nabla^{2} ℓ_{b} {({\tilde{β}}_{b}^{i})}^{- 1} \nabla log p_{b} (b_{i}; {\hat{β}}_{b}^{(- i)}) \\ = g (y_{i}; {\hat{θ}}_{b}) + \frac{1}{n} tr {\nabla^{2} ℓ_{b} {({\tilde{β}}_{b}^{i})}^{- 1} \nabla log p_{b} (b_{i}; {\hat{β}}_{b}^{(- i)}) \nabla^{⊤} g (y_{i}; {\tilde{θ}}_{b}^{i})} . \end{matrix}

This and the assumption (16) imply that

\begin{matrix} L_{x}^{cv} ({\hat{θ}}_{b}) & = - \frac{1}{n} \sum_{i = 1}^{n} g (y_{i}; {\hat{θ}}_{b}) - \frac{1}{n^{2}} \sum_{i = 1}^{n} tr {\nabla^{2} ℓ_{b} {({\tilde{β}}_{b}^{i})}^{- 1} \nabla log p_{b} (b_{i}; {\hat{β}}_{b}^{(- i)}) \nabla^{⊤} g (y_{i}; {\tilde{θ}}_{b}^{i})} \\ = - \frac{1}{n} \sum_{i = 1}^{n} g (y_{i}; {\hat{θ}}_{b}) + \frac{1}{n} tr {I_{b}^{- 1} E [\nabla log p_{b} (\bar{β}) \nabla^{⊤} g (y; \bar{θ})]} + o_{p} (n^{- 1}) . \end{matrix}

Under the assumption

q_{z | y} (z | y) = p_{z | y} (z | y; \bar{θ})

,

\begin{matrix} \nabla f (y; \bar{θ}) & = \int q_{z | y} (z | y) \nabla log p_{z | y} (z | y; \bar{θ}) d z = \int \nabla p_{z | y} (z | y; \bar{θ}) d z = 0 . \end{matrix}

(A8)

This yields that

\begin{matrix} E [\nabla log p_{b} (\bar{β}) \nabla^{⊤} g (y; \bar{θ})] & = E [\nabla log p_{b} (\bar{β}) \nabla^{⊤} log p_{y} (\bar{θ})] = K_{b, y} . \end{matrix}

Hence, by noting

g (y; θ) = log p_{y} (y; θ) + f (y; θ)

, it holds that

\begin{matrix} L_{x}^{cv} ({\hat{θ}}_{b}) & = - ℓ_{y} ({\hat{θ}}_{b}) - \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; {\hat{θ}}_{b}) + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) + o_{p} (n^{- 1}) . \end{matrix}

(A9)

For evaluating the second term on the right hand side, we apply Taylor expansion to

n^{- 1} \sum_{i = 1}^{n} f (y_{i}; θ)

around

θ = \bar{θ}

by formally taking it as a function of

β

. By noting (A8), this gives

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; {\hat{θ}}_{b}) & = \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; \bar{θ}) + \frac{1}{2 n} \sum_{i = 1}^{n} {({\hat{β}}_{b} - \bar{β})}^{⊤} \nabla^{2} f (y_{i}; \bar{θ}) ({\hat{β}}_{b} - \bar{β}) + o_{p} (n^{- 1}) \\ = \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; \bar{θ}) + \frac{1}{2 n} tr \{\sum_{i = 1}^{n} \nabla^{2} f (y_{i}; \bar{θ}) ({\hat{β}}_{b} - \bar{β}) {({\hat{β}}_{b} - \bar{β})}^{⊤}\} + o_{p} (n^{- 1}) . \end{matrix}

It follows from the law of large numbers that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} \nabla^{2} f (y_{i}; \bar{θ}) & = \frac{1}{n} \sum_{i = 1}^{n} \int q_{z | y} (z | y_{i}) \nabla^{2} log p_{z | y} (z | y_{i}; \bar{θ}) d z \\ \overset{p}{\to} E [\nabla^{2} log p_{z | y} (z | y; \bar{θ})] = - I_{z | y} . \end{matrix}

Hence, (A1) indicates that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; {\hat{θ}}_{b}) & = \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; \bar{θ}) - \frac{1}{2 n} tr (I_{z | y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) + o_{p} (n^{- 1}) . \end{matrix}

(A10)

By substituting (A10) into (A9), we establish that

\begin{matrix} L_{x}^{cv} ({\hat{θ}}_{b}) & = - ℓ_{y} ({\hat{θ}}_{b}) + \frac{1}{n} tr (I_{b}^{- 1} K_{b, y}) + \frac{1}{2 n} tr (I_{z | y} I_{b}^{- 1} J_{b} I_{b}^{- 1}) - \frac{1}{n} \sum_{i = 1}^{n} f (y_{i}; \bar{θ}) + o_{p} (n^{- 1}) . \end{matrix}

Hence, the proof is complete. □

References

Breiman, L.; Friedman, J.H. Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 1997, 59, 3–54. [Google Scholar] [CrossRef]
Tibshirani, R.; Hinton, G. Coaching variables for regression and classification. Stat. Comput. 1998, 8, 25–33. [Google Scholar] [CrossRef]
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Mercatanti, A.; Li, F.; Mealli, F. Improving inference of Gaussian mixtures using auxiliary variables. Stat. Anal. Data Min. 2015, 8, 34–48. [Google Scholar] [CrossRef] [Green Version]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Shibata, R. An optimal selection of regression variables. Biometrika 1981, 68, 45–54. [Google Scholar] [CrossRef]
Shibata, R. Asymptotic mean efficiency of a selection of regression variables. Ann. Inst. Stat. Math. 1983, 35, 415–423. [Google Scholar] [CrossRef]
Takeuchi, K. Distribution of information statistics and criteria for adequacy of models. Math. Sci. 1976, 153, 12–18. (In Japanese) [Google Scholar]
Shimodaira, H. A new criterion for selecting models from partially observed data. In Selecting Models from Data; Cheeseman, P., Oldford, R.W., Eds.; Springer: New York, NY, USA, 1994; pp. 21–29. [Google Scholar]
Cavanaugh, J.E.; Shumway, R.H. An Akaike information criterion for model selection in the presence of incomplete data. J. Stat. Plan. Inference 1998, 67, 45–65. [Google Scholar] [CrossRef] [Green Version]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–38. [Google Scholar] [CrossRef]
Shimodaira, H.; Maeda, H. An information criterion for model selection with missing data via complete-data divergence. Ann. Inst. Stat. Math. 2018, 70, 421–438. [Google Scholar] [CrossRef]
Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 44–47. [Google Scholar] [CrossRef]
Ibrahim, J.G.; Lipsitz, S.R.; Horton, N. Using auxiliary data for parameter estimation with non-ignorably missing outcomes. J. R. Stat. Soc. Ser. C Appl. Stat. 2001, 50, 361–373. [Google Scholar] [CrossRef]
White, H. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244. [Google Scholar] [CrossRef] [Green Version]
Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Methodol. 1974, 36, 111–147. [Google Scholar] [CrossRef]
Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef] [Green Version]
Yanagihara, H.; Tonda, T.; Matsumoto, C. Bias correction of cross-validation criterion based on Kullback–Leibler information under a general condition. J. Multivar. Anal. 2006, 97, 1965–1975. [Google Scholar] [CrossRef] [Green Version]
Dua, D.; Karra Taniskidou, E. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 31 July 2017. [Google Scholar]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]

Figure 1. Useful auxiliary variable (Case 1). The left panel plots

{(y_{i}, a_{i})}_{i = 1}^{100}

with labels indicating

z_{i}

. The estimated

p_{b} ({\hat{β}}_{b})

is shown by the contour lines. The right panel shows the histogram of

{y_{i}}_{i = 1}^{100}

, and three density functions

p_{y} ({\hat{θ}}_{x})

(broken line),

p_{y} ({\hat{θ}}_{y})

(dotted line), and

p_{y} ({\hat{θ}}_{b})

(solid line). In Section 4.4, this useful auxiliary variable is selected by our method (Case 1 in Table 2).

Figure 1. Useful auxiliary variable (Case 1). The left panel plots

{(y_{i}, a_{i})}_{i = 1}^{100}

with labels indicating

z_{i}

. The estimated

p_{b} ({\hat{β}}_{b})

is shown by the contour lines. The right panel shows the histogram of

{y_{i}}_{i = 1}^{100}

, and three density functions

p_{y} ({\hat{θ}}_{x})

(broken line),

p_{y} ({\hat{θ}}_{y})

(dotted line), and

p_{y} ({\hat{θ}}_{b})

(solid line). In Section 4.4, this useful auxiliary variable is selected by our method (Case 1 in Table 2).

Figure 2. Useless auxiliary variable (Case 2). The symbols are the same as Figure 1. In Section 4.4, this useless auxiliary variable is NOT selected by our method (Case 2 in Table 2).

Table 1. Random variables in incomplete data analysis with auxiliary variables.

B = (Y, A)

is used for estimation of unknown parameters, and

X = (Y, Z)

is used for evaluation of candidate models.

Table 1. Random variables in incomplete data analysis with auxiliary variables.

B = (Y, A)

is used for estimation of unknown parameters, and

X = (Y, Z)

is used for evaluation of candidate models.

	Observed	Latent	Complete
Primary	Y	Z	X
Auxiliary	A	–	–
All	B	–	C

Table 2. Comparisons between

{\hat{θ}}_{b}

and

{\hat{θ}}_{y}

for predicting X, and that for Y.

Table 2. Comparisons between

{\hat{θ}}_{b}

and

{\hat{θ}}_{y}

for predicting X, and that for Y.

	$p_{x} ({\hat{θ}}_{b})$ vs. $p_{x} ({\hat{θ}}_{y})$	$p_{y} ({\hat{θ}}_{b})$ vs. $p_{y} ({\hat{θ}}_{y})$
	${AIC}_{x; b} - {AIC}_{x; y}$	${AIC}_{y; b} - {AIC}_{y; y}$
Case 1	−2.67	−0.96
Case 2	9.86	10.37

Table 3. Expected Akaike Information Criterion (AIC) difference is compared with the risk difference. The values are computed from

T = 10^{4}

runs of simulation with their standard errors in parentheses.

Table 3. Expected Akaike Information Criterion (AIC) difference is compared with the risk difference. The values are computed from

T = 10^{4}

runs of simulation with their standard errors in parentheses.

n	100	200	500	1000	2000	5000
$E [{AIC}_{x; b} - {AIC}_{x; y}]$	−3.559	−3.263	−3.221	−3.197	−3.195	−3.180
	(0.074)	(0.021)	(0.015)	(0.013)	(0.013)	(0.012)
$2 n {R_{x} ({\hat{θ}}_{b}) - R_{x} ({\hat{θ}}_{y})}$	−3.603	−3.333	−3.275	−3.208	−3.182	−3.232
	(0.071)	(0.054)	(0.050)	(0.050)	(0.050)	(0.050)

Table 4. Useful auxiliary variable (Case 1): selection frequencies of

{\hat{θ}}_{b}

and

{\hat{θ}}_{y}

.

Table 4. Useful auxiliary variable (Case 1): selection frequencies of

{\hat{θ}}_{b}

and

{\hat{θ}}_{y}

.

n	100	200	500	1000	2000	5000
${\hat{θ}}_{b}$	9230	9475	9649	9687	9711	9727
${\hat{θ}}_{y}$	770	525	351	313	289	273

Table 5. Useless auxiliary variable (Case 2): selection frequencies of

{\hat{θ}}_{b}

and

{\hat{θ}}_{y}

.

Table 5. Useless auxiliary variable (Case 2): selection frequencies of

{\hat{θ}}_{b}

and

{\hat{θ}}_{y}

.

n	100	200	500	1000	2000	5000
${\hat{θ}}_{b}$	1508	212	1	0	0	0
${\hat{θ}}_{y}$	8492	9788	9999	10,000	10,000	10,000

Table 6. Useful auxiliary variable (Case 1): estimated risk functions of

{\hat{θ}}_{b}

,

{\hat{θ}}_{y}

, and

{\hat{θ}}_{b e s t}

, and their standard errors in parenthesis.

Table 6. Useful auxiliary variable (Case 1): estimated risk functions of

{\hat{θ}}_{b}

,

{\hat{θ}}_{y}

, and

{\hat{θ}}_{b e s t}

, and their standard errors in parenthesis.

n	100	200	500	1000	2000	5000
$2 n {R_{x} ({\hat{θ}}_{b}) - L_{x} (θ_{0})}$	4.229	4.079	4.051	4.039	4.029	4.033
	(0.032)	(0.030)	(0.029)	(0.028)	(0.029)	(0.028)
$2 n {R_{x} ({\hat{θ}}_{y}) - L_{x} (θ_{0})}$	7.831	7.412	7.326	7.247	7.211	7.266
	(0.078)	(0.061)	(0.058)	(0.058)	(0.058)	(0.058)
$2 n {R_{x} ({\hat{θ}}_{b e s t}) - L_{x} (θ_{0})}$	5.109	4.741	4.501	4.491	4.479	4.454
	(0.052)	(0.045)	(0.041)	(0.042)	(0.042)	(0.041)

Table 7. Useless auxiliary variable (Case 2): estimated risk functions of

{\hat{θ}}_{b}

,

{\hat{θ}}_{y}

, and

{\hat{θ}}_{b e s t}

, and their standard errors in parenthesis.

Table 7. Useless auxiliary variable (Case 2): estimated risk functions of

{\hat{θ}}_{b}

,

{\hat{θ}}_{y}

, and

{\hat{θ}}_{b e s t}

, and their standard errors in parenthesis.

n	100	200	500	1000	2000	5000
$2 n {R_{x} ({\hat{θ}}_{b}) - L_{x} (θ_{0})}$	105.527	214.659	543.685	1091.105	2182.647	5452.623
	(0.111)	(0.167)	(0.301)	(0.474)	(0.723)	(1.151)
$2 n {R_{x} ({\hat{θ}}_{y}) - L_{x} (θ_{0})}$	7.831	7.412	7.326	7.247	7.211	7.266
	(0.078)	(0.061)	(0.058)	(0.058)	(0.058)	(0.058)
$2 n {R_{x} ({\hat{θ}}_{b e s t}) - L_{x} (θ_{0})}$	22.064	11.555	7.375	7.247	7.211	7.266
	(0.358)	(0.304)	(0.079)	(0.058)	(0.058)	(0.058)

Table 8. Experiment average of

n_{t e} {L ({\hat{θ}}_{y}) - L_{x} ({\hat{θ}}_{b e s t})}

for each case of

Y = V_{ℓ}

,

ℓ = 1, \dots, 13

. Standard errors are in parenthesis.

Table 8. Experiment average of

n_{t e} {L ({\hat{θ}}_{y}) - L_{x} ({\hat{θ}}_{b e s t})}

for each case of

Y = V_{ℓ}

,

ℓ = 1, \dots, 13

. Standard errors are in parenthesis.

Y	$V_{1}$	$V_{2}$	$V_{3}$	$V_{4}$	$V_{5}$	$V_{6}$	$V_{7}$
$n_{t e} {L_{x} ({\hat{θ}}_{y}) - L_{x} ({\hat{θ}}_{b e s t})}$	0.13	−0.14	89.71	46.24	−1.76	3.34	76.54
	(0.08)	(0.12)	(3.82)	(4.17)	(2.52)	(1.34)	(6.09)
$Y$	$V_{8}$	$V_{9}$	$V_{10}$	$V_{11}$	$V_{12}$	$V_{13}$
$n_{t e} {L_{x} ({\hat{θ}}_{y}) - L_{x} ({\hat{θ}}_{b e s t})}$	13.91	39.45	1.72	111.24	15.48	0.23
	(2.21)	(3.12)	(0.29)	(8.46)	(2.11)	(0.09)

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Imori, S.; Shimodaira, H. An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis. Entropy 2019, 21, 281. https://doi.org/10.3390/e21030281

AMA Style

Imori S, Shimodaira H. An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis. Entropy. 2019; 21(3):281. https://doi.org/10.3390/e21030281

Chicago/Turabian Style

Imori, Shinpei, and Hidetoshi Shimodaira. 2019. "An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis" Entropy 21, no. 3: 281. https://doi.org/10.3390/e21030281

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis

Abstract

1. Introduction

2. Preliminaries

2.1. Incomplete Data Analysis for Primary Variables

2.2. Statistical Analysis with Auxiliary Variables

2.3. Comparing the Two Estimators

3. An Illustrative Example with Auxiliary Variables

3.1. Model Setting

3.2. Estimation Results

4. Information Criterion

4.1. Asymptotic Expansion of the Risk Function

4.2. Estimating the Risk Function

4.3. Akaike Information Criteria for Auxiliary Variable Selection

4.4. The Illustrative Example (Cont.)

5. Leave-One-Out Cross Validation

6. Experiments with Simulated Datasets

6.1. Unbiasedness

6.2. Auxiliary Variable Selection

7. Experiments with Real Datasets

8. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Lemma 2

Appendix A.3. Proof of Theorem 2

Appendix A.4. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI