1. Introduction
In statistical research and data mining, methods for selecting the “best” model associated with the observed data are of great importance. This is especially crucial in areas such as healthcare and medical research, where the models themselves are very important (and so merely minimizing the generalization error directly, via cross-validation and non-parametric methods, will not suffice). Model selection takes a major role in regression analysis that searches for a set of variables that are “important", to compose a model that best explains some given data. In recent decades, a variety of methods have been developed for selecting models when the data are completely observed, in particular methods based on information criteria [
1,
2].
Information criteria are perhaps the most common tools used for model selection, through the process of comparing a quantitative score for each model and choosing the model with the best score. The information criterion score is a trade-off between goodness-of-fit and the complexity of the model: specifically
, where
is the maximized likelihood of the candidate model, and the penalty term takes into consideration the complexity of the sub-model. Perhaps the most common information criterion today is the Akaike information criterion (AIC), first introduced by [
3,
4], where the AIC penalty is simply
, twice the number of parameters in the model. The idea of this information criterion was derived from the Kullback–Leibler (KL) divergence, a quantitative measure of directed distance between two models. AIC estimates how much information is lost when approximating the true unknown model, which actually generated the observed data, with the estimated model.
Another common information criterion is the Bayesian information criterion (BIC) presented by [
5], which has penalty
, where
n is the number of observations. BIC was derived from Bayesian theory, based on the Bayes factor approach. It is an approximation to the Bayes factor, which selects the model with the greatest posterior probability. Unlike the AIC, it is asymptotically consistent—it will select the true model if it is among the candidate models considered—but it is inefficient for prediction (in contrast to AIC, which is asymptotically efficient for prediction). This criterion is used when trying to understand which of the variables explain the data, and find the model most likely to be true [
1,
6].
A problem with both of these methods is that they are not suitable where the data contain missing values. Model selection in the presence of missing data is a major challenge, even when the data are missing completely at random, MCAR, (here we use Rubin’s taxonomy for missingness [
7,
8,
9]; cf. the recent and more comprehensive taxonomy by [
10]).
In particular, suppose we have
n independent observations, containing some missing values. Consider that we have a set of candidate models,
, to choose from, in order to explain the observations. It might happen that each candidate model,
, is based on a different sample size,
.
Figure 1 shows a setting of a design matrix in a regression model, where the various possible models do not have the same sample size; thus, implementing model selection using conventional information criteria is inappropriate.
Obviously, missing data are a problem for nearly all statistical applications, not just for model selection. Various methods have been developed to deal with this problem, most of which involve manipulating the data into a complete dataset, so a standard analysis can be performed. The common approaches for dealing with missing data ranges from
complete cases analysis (which simply deletes all observations with one or more missing values, leaving a complete dataset so that a standard analysis can be performed) to
single imputation and
multiple imputation, which consists of three stages—imputation, analysis, and combination—all of which are required to properly account for the uncertainty of imputation [
8,
11]. Somewhat naturally, early work addressing missing data in the context of model selection took this approach and tried to adapt it; however, it is challenging to combine the results of variable selection (the last stage) across all imputed datasets in a principled framework, and there are a variety of articles on this issue [
12,
13,
14,
15,
16,
17].
Another strategy, presented by [
12], is to combine multiple imputation and variable selection in a Bayesian framework. The authors of that study showed two methods for linear regression models with missing data in the covariates. The first method,
impute then select (ITS), starts with generating different imputed datasets, then applies Bayesian variable selection methods for each imputed dataset, and uses Rubin’s rule to combine the results. The second method is
simultaneously impute and select, which embeds the steps of imputation and variable selection in a single, combined Gibbs sampling process. The first method is easier to implement in standard software and enables flexible imputation and separate analysis, but the study’s results indicated that the second method slightly outperformed the first.
One more variable selection strategy is to combine resampling techniques with imputation, as presented by [
18]. They developed the procedure of
bootstrap imputation and stability selection, which generates bootstrap samples and conducts single imputation on each bootstrap dataset. It then obtains the randomized lasso estimate for each dataset, and the final variable selection is conducted by stability selection. The authors of [
19] developed the similar method,
multiple imputation random lasso, with the main difference being that their method first performs imputation, and only then generates bootstrap samples. Both sets of results showed that these approaches are suitable for high-dimensional problems. This approach, using imputation or expectation–maximization (and EM-like) to replace missing data, is present also in [
20,
21,
22,
23], cf. [
24,
25,
26].
In this work we present a different approach, based on first principles of information criteria [
3,
4,
5]. We propose to apply the information criteria to a candidate model by using all of its full observations, and then normalize, in some form, the information criteria according to sample size. In this way, the different models can be compared by the information criterion score
per observation. Consider again the example in
Figure 1, in order to calculate the information criterion of a model containing variables
and
, we will use only observations 5, 6, 7, and 10, and then normalize it to the number of observations,
, to obtain the normalized information criterion. By contrast, in models containing
and
, we will use observations 7, 9, and 10 to calculate the information criterion and then normalize it to
.
Here, we focus on two methods for model selection in the presence of missing data, which are based on the theory of existing AIC and BIC methods, and introduce normalized AIC and normalized BIC. The advantage of the new methods is the use of more observations than in “complete cases”, on the one hand, while avoiding unnecessary variance due to imputation, on the other (as well as savings in computational complexity compared to imputation methods).
The rest of this paper is organized as follows. In
Section 2, we present the theory of normalized AIC and normalized BIC, which adapts current theory to cases where the data are missing. In
Section 3, we discuss the computational efficiency [
27] of the normalized information criteria, compared to imputation-based methods, as well as their statistical efficiency, and present three additional methods that extend the normalized AIC method from
Section 2.1, in order to increase its statistical efficiency. Simulation results are provided in
Section 4. We conclude this article in
Section 5.
2. Normalized Information Criteria
Our theory of normalized AIC and normalized BIC relies on the traditional theory of AIC and BIC with changes due to relaxing the assumption that all candidate models have the same sample size; the new criteria makes no such assumption.
Suppose we have data that are independent and identically distributed i.i.d., drawn from some probability distribution with the unknown density , and we have a set of candidate models, , where each model is a family of probability densities, , depending on a parameter vector, . Where is the parameter space of model with dimension .
In the context of this paper, the data contains missing values. Each model is based on the observations that do not contain missing values, so each model can have different sample size , which is indicated by the subscript . The likelihood of the data, given the jth model based on observations is , and the maximum likelihood estimator of , based on data, is
2.1. Normalized AIC
The AIC is equal to , where k is the number of parameters in the model. Akaike recognized the properties of as a mean unbiased estimate of the expected relative KL distance between a candidate model and the “true” model, on data that are previously unseen; however, since the multiplicative factor shared by the AIC scores of all models do not affect their selection, it was disregarded, and subsequent research focused on the AIC score itself. We assume, on the other hand, that each candidate model, , is based on a different sample size, . Therefore, sample size n is no longer constant for all candidate models, and cannot be ignored.
The AIC is an estimation of the KL divergence for selecting the model that minimizes the information lost when candidate model,
, is used to approximate the reality,
f,
where
is an unknown constant, equal for all candidate models, and the second term, the
relative KL information, is the expectation with respect to the probability density,
f. Note that the true model
f is not necessarily in any of the candidate models. For all candidate models, we calculate the negative expected relative distance and select the model with the smallest value, i.e., the “closest” to
where
x is a conceptual future random sample from distribution
f, that play as “validation” set, to estimate
, and assume to be independent sample from data
. The expectations
and
are the expectation with respect to the true model,
, over the random samples x and
, respectively. If multiplied by 2 and
it will produce
, the AIC score for the candidate model
.
Thus, the final criterion obtained for model
is
The best model chosen by this method would be the one that minimizes the criterion. More detailed derivation of the AIC appears in
Appendix A.
2.2. Normalized BIC
The BIC is
, where
k is the number of parameters in the model, and
n is the number of observations. The BIC is a Bayesian method for selecting the model which maximizes the
a posteriori probability of the model given the data. For all candidate models,
, we calculate the posterior probability and select the model with the greatest value. According to Bayes’ theorem the posterior probability of model
, given data comprised of
observations, is
where
is the prior probability of model
,
is the unconditional likelihood of the data comprised of
observations,
is the prior density of
given model
j, and
is the marginal probability of the data for model
j.
In case all prior probabilities are equal, i.e.,
and
for all
j, the critical quantity to be approximated is
, because all other expressions in (
1) are the same for all models. Thus, by maximizing this term the posterior probability will be maximized as well. In this standard case, BIC is obtained as
The best model chosen by the method would be the one that minimizes the criterion (i.e., maximizes the posterior probability). For a more detailed derivation of BIC see
Appendix B. Similarly, in the case of different sample sizes, a “naive” BIC score can be calculated
however, now the denominator
in (
1) cannot be ignored. To maximize the posterior probability we need to maximize the quantity
We develop the log denominator of expression (
3) as follows:
Assume
, using the law of large numbers yield
and the log unconditional likelihood converges to
When comparing two models, we need to check whether
which is equivalent to
and from (
2)–(
4)
Instead of comparing each pair of models, it is possible to calculate the result for each model,
, by comparing to the fixed model which consists of an intercept only,
. Since
uses all observations, the candidate models maximal sample size is that of the constant model,
, we find that in case
, the above is
and in case
, the above is
Although the value of
h is unknown, we can estimate it (The scalar parameter
h is known as the
entropy of the distribution
g. Estimation of a scalar is generally much simpler than estimation of a distribution, and thus estimating
h is easier compared to estimating
g. However, here we do not even require estimating
h and thus we do not discuss it further; this is left for future developments). Furthermore, as a simple substitute we can optimize the left-hand side, and select the model with the best ratio.
The final normalized BIC is calculated as follows for each model, : in the case , the score will be ; otherwise, and the score will be . The best model will be the one with the lowest score.
3. Statistical Efficiency and Computational Efficiency
The methods presented in
Section 2 introduce a benefit in terms of statistical efficiency. First, they are statistically efficient in the sense that they uses all cases completely observed for a candidate model. Unlike complete-cases analysis, where each candidate model uses only the cases completely observed of the data, the new proposed methods use a larger fraction of the original data so the model selection procedure is more efficient. Second, the new methods avoid the addition of “noise” and unnecessary variance due to imputation.
In addition to being advantageous in terms of statistical efficiency, the normalized information criteria are also advantageous in terms of
computational efficiency (see also [
27]). They do not require a procedure for filling in missing values as in imputation method, and moreover the model selection procedure processes less data for each candidate model.
3.1. Computational Efficiency
In this section, we compare the computational efficiency of model selection of normalized-information-criterion methods and the computational efficiency of single-imputation methods; we find that under a MCAR assumption, imputation, which considers a complete dataset after filling in the missing data, can be exponentially less efficient.
Consider the following crude toy example of “all subset regression” (exhaustively comparing all possible models in the candidate set) and examining how much data are being processed by the model-selection procedures:
Assume a global model that contains d variables and n observations, and the goal is to choose the best model among a set of candidate models, . This set of models is the subset of the global model with all possible combination of the variables, so the total number of models in the set of candidates is . The number of models in that contain exactly j variables, where , is . Each model , containing j variables can be represented by an rectangle data matrix. These are the data each model m receives, so in case missing values do not occur, the “data size” for model m with j variables is exactly the multiplication between the number of variables and the number of observation: .
Under MCAR, each entry is missing with probability
p, independently for all entries. The probability of single observation (i.e., row) with
j variables to be completely observed is
, and the number of observations completely observed out of
n possible observations is binomially distributed with parameter
n and probability
; therefore, the expected value of the number of completely observed observations is
The expected “data size” of model
m with
j variables in case of MCAR is the expected number of complete observations multiplied by the number of variables:
In order to obtain the computational efficiency of both methods, we calculate the “size of the data” each method processes during the model selection procedure; i.e., the “data size” each model m receives and summing it up over all candidate models under consideration in the procedure. This size is a function of the number of observations in the global model, n, the number of variables in the global model, d, and the missingness probability per entry of the data, p.
In order to calculate the computational complexity of model selection with single imputation we set aside the fill-in procedure, and only calculates as if the missingness probability is 0 (due to the completion of the missing values). Thus, each model has the same number of observed data
n exactly:
As mentioned above, the number of models in
that contain
j variables are
, and the data size is the multiplication
. Using Newton’s binomial formula, we obtain
To find the computational complexity of the normalized-information-criterion methods, a similar calculation is made, but this time each value is observed with probability
Further,
is the number of models in
that contain
j variables, and the data size is the multiplication mention before
. Using Newton’s binomial formula (after integration with respect to
p), we obtain
Calculating the computational efficiency ratio between the single imputation methods (
5) and normalized information criterion methods (
6) yields
Hence, the improvement in computational efficiency increases exponentially as the number of variables increases; i.e., using normalized information criteria significantly reduces computational complexity, compared to single imputation.
A similar simple calculation for other scenarios, such as forward-selection, would yield results that might be less overwhelming but substantial nonetheless. Indeed, for forward selection the ratio in computational efficiency increases quadratically (though this is more nuanced, and in addition to it depends also on , the number of variables in the chosen model).
3.2. Increasing the Statistical Efficiency
The new methods are statistically efficiencies since they uses more observations then complete cases, and they do not add additional noise as imputation methods.
However, they do not use all available information for each model, as opposed to imputation mathods that uses all observed data. For example, suppose that, in a regression model with missing values in the design matrix, the model selected by normalized AIC contains the variables
and
.
Figure 2a illustrates the data available for the selected model, completely ignoring the other variables. In calculating the normalized AIC, only observations 3, 6, 9, and 10 are used, while the rest of the available values, marked in white, are not used.
To address the inefficiency issue of the preliminary normalized AIC procedure discussed in
Section 2.1, we propose three additional methods that extend the normalized AIC, called
post-selection imputation,
Akaike sub-model averaging, and
minimum variance averaging. These methods use all of the observed information relevant to the model chosen by the method normalized AIC in order to increase its efficiency. All methods use only the data from the selected model and completely ignore variables that are not part of the chosen model.
3.2.1. Post-Selection Imputation
Imputation is performed only after the removal of “unwanted” columns, identified using our selection procedure. For instance, in
Figure 2a, the method will first fill in the missing values, marked in black, and only then estimate the parameters.
3.2.2. Akaike Sub-Model Averaging
After identifying the “best” model, it is possible to make use of its nested models, which add additional data. For example, in
Figure 2a, the model has three sub-models: first, a model containing only variable
; second, a model containing only variable
; and third, a model containing variables
and
.
This procedure (as well as the procedure in
Section 3.2.3) will include all observations relevant to the sub-model, based on the values that do exist in that observation. As an illustration,
Figure 2 shows how the observations are associated with their respective models. In
Figure 2b, all observations which contain values of variable
will be associated with the first model, which contains only variable
. Those observations (3, 5, 6, 7, 9, and 10) are highlighted. Similarly,
Figure 2c highlights the observations (2, 3, 4, 6, 8, 9, and 10) associated with the second model, which contains variable
. Finally,
Figure 2d highlights the observations (3, 6, 9, and 10) belonging to the third model—the full model. Note that the best (full) model has the minimal sample size.
We will use model averaging by determining their associated weights, akin to Akaike weights [
2,
28,
29]. We take a similar approach as we did in
Section 2.2, by finding the “posterior” distribution of each sub-model, and approximate the log posterior odds from the best model. Further, we assume that the priori probabilities are the same, but the denominator probabilities are not, and therefore we will not ignore it.
The posterior distribution of model
j is
where
is the prior probability of model
,
is the unconditional likelihood of the data comprised of
observations and
is the marginal probability of the data for model
j.
Akaike recognize the quantity
as the “likelihood” of the model determined by the method of maximum likelihood [
30,
31,
32], so
We develop the denominator of expression (
7) exactly the same as in Equation (
4), and obtain
In case all prior probabilities are equal, and from (
7)–(
9), we obtain the quantity
In order to use (
10) as weights we can solve for the
h which yields
; however, there is a simpler albeit cruder approach:
Let
be the model chosen by the normalized AIC procedure, so it minimize the score
where
t is the number of existing sub-models of
in the original dataset.
Similar to
Section 2.2 let us define the relative difference
where the second case yields the same relative difference as the original Akaike weights. The sub-model weights,
, obtain by finding the relative difference for each model,
, and substitute it in the following equation
The proposed weights have similar characteristics to the original Akaike weights. We strive to minimize the proposed relative difference, so “closer” models (with lower AIC) have a higher weight and vice versa. In addition, the relative difference is always non-negative, , and specifically zero for the best model. Another feature that corresponds to missing data is to provide greater weights to models with larger sample sizes, which increases the statistical power.
The weight
has the heuristic interpretation as the probability that
is the best model, given the data and the set of sub-models. The final prediction estimation obtain by linear combination of all sub-models estimates with their associate weights
This is further examined below via simulations.
3.2.3. Minimal Variance Sub-Model Averaging
The method presented in
Section 3.2.2 might ignore the correlation that exists between the different models. We propose another method for sub-model averaging that takes into account the correlation between the models. The motivation of this method is the same as the previous method—making use of the nested models within the “best” model, which utilizes additional data. As mention before, the final prediction estimation obtains by a linear combination of all sub-models estimates with their associate weights, as in (
11).
The variance of
is
where
is the
weights vector,
is the
estimates generalization error (mean-squared error; MSE) covariance matrix of the models, and
t is the number of existing sub-models in the original selected dataset by the normalized AIC.
We will use model averaging by determining their appropriate weights, proportional to the inverse generalization MSE covariance matrix of the models. The weights vector are obtained by minimizing the variance using Lagrangian multipliers with the constraint
yielding
where U is a
vector of 1’s. This weights are the optimal minimum variance thus minimize the MSE.
The procedure will assign large weight to models with small variance, and small weight to models with large variance, taking into account the correlation between the models.
In order to obtain the MSE covariance matrix we use the well-known bootstrap method, introduce by [
33] and extensively described in [
34]. Our procedure start by resampling observations independently with replacement from the selected dataset, until sample size is equal to that for the original data set,
n. These samples will serve as bootstrap train dataset. The probability that an observation does not appear in a bootstrap sample is
as
, so the expected number of observation not chosen is
and these observations will serve as a validation dataset.
Each bootstrap sample
i (
), estimates the
vector and each component
j (
) is calculated as
where
i is the bootstrap sample index and
j is the model index. This process repeated B times, resulting in B bootstrap samples and B estimators, which combine to the estimate
MSE matrix. These matrix is then used to compute the
variance-covariance matrix of the models,
.
4. Simulation Studies
In order to investigate the statistical properties of the normalized information criteria for modeling selection with missing data, we performed a series of simulations using R.
4.1. Design
The data were simulated in the form of a linear regression model
, where
is the vector of dependent variables,
is the design matrix, and
is the noise vector, whose entries are i.i.d. and normally distributed with mean zero and variance 2.5, i.e.,
. The design matrix is composed of
explanatory variables, which were generated from a multivariate normal distribution with symmetric correlation matrix with equal off-diagonal elements, denoted as
; where three different off-diagonal elements were considered:
,
, and
. The coefficients were set to be
,
,
,
, and
, so the “real” model comprises the variables
,
,
, and
. The response vector,
, was obtained by the aforementioned linear regression. 400 observations were generated, and for the AIC methods outlined in
Section 2.1 and
Section 3.2, an additional 200 observations were treated as “test” data for model validation, with the remaining observations being “training” data, used to fit the models.
We created missing data by omitting some of the values from the design matrix (the training design matrix for AIC methods). We explored the MCAR missingness mechanism with two different missingness probabilities, and per observation, so that there were approximately and complete cases, respectively. With this simulation configuration there are six different design factors for all combinations of different effect of the correlation parameters () and missingness probabilities (p). For each of those six simulation configurations, 100 different datasets with missing values were simulated. The candidate models, , are all-subset regressions, so there are possible models.
4.2. Comparing Model Selection Results between AIC Methods
The normalized AIC method calculates for each candidate model,
, the normalized AIC score,
, according to the number of observations that the model was based on,
. The model with the minimal score is selected. The methods that extend the normalized AIC are based on the results of normalized AIC. These methods were compared to complete case, single imputation, and multiple-imputation methods. Single imputation and multiple imputation was performed using
MICE package [
35], with one and five imputations respectively. For all these three methods, the regular AIC score was calculated for all possible models.
Each method was applied to each of the 100 datasets, where for each dataset the generalization error (MSE) of the chosen model was calculated over a test-set as . Then, the MSE was averaged over 100 simulated datasets. Overall, for each method, a total of six average MSE for each of the different configurations was calculated for comparison purposes. To test the statistical significance of the difference between the competing average scores, a two-tailed paired t-test was performed, with significance level 0.05.
4.2.1. Normalized AIC
Figure 3 presents the comparisons between the normalized AIC method and the other three methods. Each plot shows the MSE result averaged over 100 simulation datasets for the six different designs. Full circle markings indicate that the result is significant according to paired
t-test, and empty markings indicate otherwise.
Figure 3a shows the comparison of normalized AIC to complete cases, and in
Figure 3b the comparison of normalized AIC to single imputation is shown. It can be seen that for both comparisons, and for all the different configurations, the normalized AIC method yields better results, and is statistically more efficient. For almost all the comparisons made, the results are significant (except in
Figure 3b, at
and
).
On the other hand, in
Figure 3c, which compares normalized AIC to multiple imputation, there is no method that achieves better results for all configurations and all results are not significant; therefore, the results are inconclusive and both methods are fairly on par. As expected, all methods yield better results for small missingness probabilities and lower correlations.
4.2.2. Extensions of Normalized AIC
In this section, we examine the methods that build on the normalized AIC approach. Post-selection imputation, Akaike sub-model averaging, and minimum variance averaging methods make use of only the variables chosen by the normalized-AIC method. For the chosen model, each method re-estimates the predictions. In post-selection imputation, multiple imputation is performed with the
MICE package, and five imputations are made. In minimum variance averaging,
bootstrap samples are made.
Figure 4,
Figure 5 and
Figure 6 compares the extension methods to normalized AIC, multiple imputation and to each other, respectively. They summarize the average generalization MSE score for all six different designs. Here, too, full circles indicate statistically significantly differences.
Figure 4 compares normalized AIC and the methods extending it and examines whether the methods succeed in improving its statistical efficiency.
Figure 4a, comparing post-selection imputation to the normalized AIC method, shows that post-selection imputation yields better results for all the different designs, significantly for higher missingness probabilities and higher correlations. Further,
Figure 4b,c, compared to the Akaike sub-model averaging and minimum variance averaging to normalized AIC, respectively, shows that both extending methods yield better results for all six configurations significantly.
Figure 5 compares multiple imputation and the three extensions of normalized AIC methods: post-selection imputation (in
Figure 5a), Akaike sub-model averaging (in
Figure 5b), and minimum variance averaging (in
Figure 5c).
It can be seen that all of the three extensions methods achieved lower MSE than multiple imputation, for all the different configurations. Post-selection imputation yields significant results for cases with and , and Akaike sub-model averaging yields significant results for all cases, except at and . Minimum variance averaging is significantly statistically more efficient for all six configurations.
Figure 6 compares the three methods extending the normalized AIC to each other.
Figure 6a, comparing Akaike sub-model averaging to post-selection imputation, shows that Akaike sub-model averaging yields better results for all the different configurations, significantly only at
and
and at
and
.
Figure 6b, comparing minimum variance averaging to post-selection imputation, and
Figure 6c, comparing minimum variance averaging to Akaike sub-model averaging, shows that minimum variance averaging succeed in minimized the MSE for both comparisons and all different configurations, significantly (except in
Figure 6c, at
and
).
Detailed numerical results for each method are provided in
Appendix C.
4.3. Comparing Model Selection Results between BIC Methods
The normalized BIC method calculated for each candidate model, , the normalized BIC score according to the number of observations that the model was based on, . The score is if the sample size of model j is not completely observed (), and otherwise, where model 0 is the constant model. The model with the minimum score will be selected. Similar to the AIC method, the normalized BIC was compared to complete case, single imputation, and multiple imputation methods. For all these three methods, the regular BIC score was calculated.
In addition, we compared normalized BIC to an additional fourth method. We implemented a variation of the ITS method described in
Section 1, which consists of three similar steps but uses different algorithms than those described in the original method in [
12]. The first step is to perform multiple imputation, which is achieved using the
MICE package, with
imputed datasets. In the second step, a Bayesian variable selection algorithm is implemented for each dataset,
l, separately, where
. We implemented this using the
BMS package [
36]. Then, we calculated the posterior inclusion probability for each variable:
. In the third and final step, we combined the results using Rubin’s rules [
7] to obtain the average marginal posterior probability of
across
L multiple imputation datasets defined as
. The final variable selection is determined by whether
exceeds the threshold of 0.5.
BIC methods were tested similarly to the AIC methods, only here, for each dataset the scores “incorrectly omitted” and “incorrectly included” were calculated for the chosen model and averaged over 100 simulated datasets. “Incorrectly omitted” is the number of important variables that were not selected in the chosen model (i.e., the number of variables from the important set, , , , and , not selected); and “incorrectly included” is the number of unimportant variables selected in the chosen model (i.e., the number of variables from the unimportant set, , , , , , and that were selected).
In
Figure 7, we compared the normalized BIC method to complete case and ITS methods.
Figure 7a provides the “incorrectly included” score for each different configuration and method. The results show that, while the other methods tend to include variables that are not important to the chosen model, the method we developed, normalized BIC, included only variables that were important. The results presented in
Figure 7b summarize the “incorrectly omitted” score. It can be seen that all methods do not tend to exclude important variables, although when the correlation is large the normalized BIC method does tend to omit some.
The results of the single and multiple imputation methods were inferior to the other methods, so we do not present them here. Detailed numerical results for each method are provided in
Appendix D.
5. Discussion
In this paper, we presented a new perspective on popular information criteria, specifically AIC and BIC, and developed a theory that adapts them to address data containing missing values. The traditional information criteria are not applicable in cases where data are missing, as they are based on the assumption that all models have the same constant sample size. The methods presented here address this issue by making slight modifications to the existing theory, required due to this non-assumption, and using the different sample sizes for each model to normalize its score. Thus, the score of each models is per-observation, which makes them comparable for model selection.
The methods we have introduced are intuitive, easy to use, and can be easily integrated into statistical software to help in the task of model selection. These methods do not require a fully complete dataset or modification of the original dataset before analysis; they do not remove in advance some of the observations (as in complete cases method) and more information is utilized, and they do not require filling of data (as in imputation methods) with additional uncertainty. In addition, the normalized information criteria require less computational resources compared to imputation methods and are more computationally efficient in the model-selection process. In particular, for “all subset” selection the computational efficiency is exponentially better as the number of variables increases; similarly, for “forward” selection the efficiency is quadratically better as the number of variables increases (with dependency on the number of final variables selected).
In the simulation here we find the new methods typically outperform the common alternative methods, with the exception of normalized-AIC and multiple imputation which were roughly on par. It seems that the trade-off between “adding noise” (with multiple imputation) and not using all the available data (with normalized-AIC) is balanced in this situation. However, the statistical efficiency of the vanilla normalized-AIC can be easily improved by an additional step which does not leave out any relevant data. In particular, three additional methods we suggest here, post-selection imputation, Akaike sub-model averaging and minimum variance averaging, reduced significantly the MSE and outperformed the rest (in particular the latter). Further examination of the computational complexity of these methods is indeed warranted; however, multiple imputation would nevertheless come out as more demanding, bearing in mind that it requires also an actual imputation stage (which we ignored here).
In this work we focused on a setting akin to “classic” statistics, i.e., a low-dimensional linear regression model. In future work it would be interesting to broaden the scope, both to more elaborate models, as well as to the “high-dimensional” data, where concerns of computational efficiency are much more acute. Finally, the normalization we introduced here (of information criteria) focus on AIC and BIC, but this new perspective could serve as a basis for any likelihood-based information criteria, not only AIC and BIC; this too is the subject of ongoing work.