Assessing Variable Importance for Best Subset Selection

Seedorff, Jacob; Cavanaugh, Joseph E.

doi:10.3390/e26090801

Open AccessArticle

Assessing Variable Importance for Best Subset Selection

by

Jacob Seedorff

^*

and

Joseph E. Cavanaugh

Department of Biostatistics, College of Public Health, University of Iowa, 145 N. Riverside Dr., Iowa City, IA 52242, USA

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(9), 801; https://doi.org/10.3390/e26090801

Submission received: 20 August 2024 / Revised: 16 September 2024 / Accepted: 17 September 2024 / Published: 19 September 2024

(This article belongs to the Special Issue Information Theoretic Criteria: New Theoretical Developments and Applications, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

One of the primary issues that arises in statistical modeling pertains to the assessment of the relative importance of each variable in the model. A variety of techniques have been proposed to quantify variable importance for regression models. However, in the context of best subset selection, fewer satisfactory methods are available. With this motivation, we here develop a variable importance measure expressly for this setting. We investigate and illustrate the properties of this measure, introduce algorithms for the efficient computation of its values, and propose a procedure for calculating p-values based on its sampling distributions. We present multiple simulation studies to examine the properties of the proposed methods, along with an application to demonstrate their practical utility.

Keywords:

AIC; BIC; feature selection; parametric bootstrap; post-selection inference; variable selection

1. Introduction

The assessment of the relative importance of each variable in a model is a problem of fundamental relevance in statistical modeling. Multiple approaches have been proposed for this purpose (see [1,2,3], and the references contained therein). Popular methods include scaling all variables to have a common standard deviation and then using coefficient estimates from the full model to assess importance. Another familiar technique is to use p-values for the coefficients from the full model to quantify importance. However, neither of these techniques are defensible when determining variable importance in the context of best subset selection. In fact, the relative importance of the variables in the full model may differ substantially from the importance of the corresponding variables in a best subset selection framework. For example, a variable may be statistically significant at the 0.05 level in the full model, but may also be excluded from the selected model.

Using the coefficient estimates for standardized variables or their associated p-values to evaluate variable importance for the selected model is equally problematic. In particular, for the variables that remain in the selected model, it is well known that the standard errors of coefficient estimates will be unrealistically small, leading to confidence intervals that are too narrow, inflated test statistics, and deflated p-values [4].

When performing best subset selection, the assessment of variable importance is more challenging. Ideally, any such method would quantify importance based on the variables that tend to be prevalent in the models favored by the criteria used to score the models. A cogent and intuitively appealing method proposed by Efron [4] uses the nonparametric bootstrap method to repeatedly perform best subset selection on bootstrapped datasets and quantifies importance based on the proportion of times each variable is included in the optimal model. However, this approach entails performing best subset selection many times over, which will be computationally expensive even with the advent of branch and bound algorithms that can make best subset selection much faster [5]. A similar approach involves so-called “Akaike weights” [6,7]. With this method, importance is based on the sum of the Akaike weights for all the models that include a specific variable; the most important variables will have larger values of this sum. This approach is very slow because it requires that every possible model is fit. Thus, branch and bound algorithms that speed up the best subset selection process cannot be used. In fact, in the context of best subset selection, any algorithm for evaluating variable importance will likely become much slower as the number of variables increases, and if the algorithm requires fitting all candidate models, it will quickly become computationally prohibitive.

In this paper, we introduce a novel method for determining variable importance in the context of best subset selection that is relatively efficient computationally. Moreover, we believe that the proposed measure provides the most natural quantification of variable importance for best subset selection. An additional benefit of our measure is that valid p-values can be computed based on the parametric bootstrap method. Calculating the p-values, however, can be a slow process. We therefore discuss approaches that can improve computational efficiency.

The remainder of the manuscript is structured as follows. Section 2 presents the proposed variable importance measure along with some relevant theoretical results. Section 3 presents a simulation study to verify the validity of p-values based on the variable importance values. In Section 4, the proposed methods are demonstrated in a modeling application where the goal is to predict percentage body fat based on easily obtained anthropometric measurements. In Appendix A and Appendix B, we explore computational issues, and illustrate how the method can be applied in the context of other variable selection algorithms, including heuristic procedures such as forward selection and backward elimination.

2. Variable Importance for Best Subset Selection

2.1. Proposed Methodology

We will focus on variable selection that is based on the following penalized log-likelihood measure:

Q (β | X, y, λ) \overset{d e f}{=} - 2 l (β | X, y) + {λ | | β | |}_{0} .

where

l (β | X, y)

denotes the log-likelihood for a model with regression parameters

β

,

{| | β | |}_{0}

is the number of non-zero values in

β

, and

λ

is the penalty term. The

β

that minimizes this penalized log-likelihood measure is the best subset selection estimator (i.e., the estimator resulting from the selected model from best subset selection).

λ

is commonly pre-specified according to the penalization of an information criterion. Some of the most popular information criteria are the Akaike information criterion (AIC, [8]) where

λ = 2

, the Bayesian information criterion (BIC, [9]) where

λ = log n

, and the Hannan–Quinn information criterion (HQIC, [10]) where

λ = 2 k log log n

(we select

k = 1

for our usage of HQIC). The form of

Q (β | X, y, λ)

represents that of a general likelihood-based information criterion.

We propose the following quantity to assess variable importance for the ith set of variables:

V I (i, λ | X, y) = min_{β : β_{S_{i}} = 0} Q (β | X, y, λ) - min_{β : β_{S_{i}} \neq 0} Q (β | X, y, λ),

where

S_{i}

is a set that contains the indices of the variables that are included in the ith variable set. Note that in many applications, these variable sets will each only contain a single variable. However, assessing variable importance for a set of variables can be useful when there is a natural way to group together some of the variables. One important instance of this occurs with categorical variables where multiple indicator variables are used to encode the levels of the variable. For ease of exposition, in much of the subsequent presentation, we will often describe the measure as pertaining to one variable as opposed to a variable set.

The variable importance measure is very similar to the traditional likelihood ratio test statistic to compare two models, except that it uses the criterion

Q (β | X, y, λ)

instead of

- 2 l (β | X, y)

, and compares

Q (β | X, y, λ)

for the optimal model that includes a variable (or variable set) of interest to the optimal model that excludes the same variable (or variable set). The measure is bounded below by

- λ | S_{i} |

and is not bounded above. When the measure is negative, this means that the best subset selection estimator for the information criterion does not include the variable(s). Large negative values provide strong evidence that the variable(s) should not be included in the final model. When the quantity is positive, the best subset selection estimator includes the variable(s). Large positive values of this quantity provide strong evidence that the variable(s) should be included in the final model. When this quantity is zero, then the inclusion or exclusion of the variable(s) does not impact the optimal value of the penalized log-likelihood measure.

Based on simulation results, it appears that the following modification of the variable importance measure has a null sampling distribution that is sometimes well approximated by a chi-square distribution with degrees of freedom equal to

| S_{i} |

:

m V I (i, λ | X, y) = V I (i, λ | X, y) + λ | S_{i} | .

Thus, this modified variable importance measure could be used as a test statistic to assess the contribution of a particular variable (or variable set). The chi-squared approximation, however, breaks down when the variables are highly correlated. Instead, we can use a simulation-based approach to approximate the null distribution for each test statistic. We suggest the use of the parametric bootstrap. However, when we are calculating the statistic for a given variable, we use the model that contains all of the variables except for the variable(s) of interest to generate the new response variable. This method preserves the correlation between the variables, which facilitates an accurate characterization of the null distribution. The main downside of this approach is that calculating the test statistic once can take some time, so repeatedly calculating the test statistic to approximate a null distribution can be very time consuming.

Calculating p-values based on the modified variable importance measure could be very useful since the traditional method of obtaining p-values after performing best subset selection is well known to be biased and only provides p-values for variables included in the final model. Additionally, using p-values based on the full model is not advisable when best subset selection is employed.

2.2. Theoretical Results

In this subsection, we will prove that the modified variable importance measures are non-negative.

First, we define the active set

A

as the set that contains the indices of the non-zero coefficients and

A^{∁}

as the set that contains the indices of the zero coefficients. We also define the following constructs:

\begin{matrix} - 2 l_{A} (β | X, y) & = min_{β} - 2 l (β | X, y) s . t . β_{A^{∁}} = 0 \\ Q_{A} (β | X, y, λ) & = - 2 l_{A} (β | X, y) + λ | A | . \end{matrix}

Lemma 1.

For models

A_{1}

and

A_{2}

, such that

A_{1} \subset A_{2}

, the following is true:

Q_{A_{2}} (β | X, y, λ) - Q_{A_{1}} (β | X, y, λ) \leq λ (| A_{2} | - | A_{1} |)

(1)

Proof.

Let

A_{1}

define model one and

A_{2}

define model two, where

A_{1} \subset A_{2}

. Hence, model one is a submodel of model two. Note that since model one is subsumed by model two, we know that

- 2 l_{A_{2}} (β | X, y) \leq - 2 l_{A_{1}} (β | X, y)

. We may assert the following:

\begin{matrix} Q_{A_{2}} (β | X, y, λ) - Q_{A_{1}} (β | X, y, λ) & = - 2 l_{A_{2}} (β | X, y) + λ | A_{2} | - (- 2 l_{A_{1}} (β | X, y) + λ | A_{1} |) \\ \leq - 2 l_{A_{1}} (β | X, y) + λ | A_{2} | - (- 2 l_{A_{1}} (β | X, y) + λ | A_{1} |) \\ = λ (| A_{2} | - | A_{1} |) \end{matrix}

□

Theorem 1.

The modified variable importance measures are non-negative for X, y, λ, and i.

Proof.

Let

A_{1}

be the active set for the optimal model that does not contain the ith variable set and let

A_{1 i}

be the same active set, but with the ith variable set included. Also, let

A_{2}

be the active set for the optimal model that contains the ith variable set.

A_{1} \subset A_{1 i}

, so we have the following:

\begin{matrix} Q_{A_{1 i}} (β | X, y, λ) - Q_{A_{1}} (β | X, y, λ) & \leq λ (| A_{1 i} | - | A_{1} |) & by (1) \\ ⟹ Q_{A_{1 i}} (β | X, y, λ) - Q_{A_{1}} (β | X, y, λ) & \leq λ | S_{i} | \\ ⟹ Q_{A_{1}} (β | X, y, λ) - Q_{A_{1 i}} (β | X, y, λ) & \geq - λ | S_{i} | \end{matrix}

We also have

Q_{A_{2}} (β | X, y, λ) \leq Q_{A_{1 i}} (β | X, y, λ)

, since

A_{2}

is the active set of the model with the minimum penalized log-likelihood measure that includes the ith variable set and

A_{1 i}

also has the ith variable set. Hence, we can establish that

\begin{matrix} Q_{A_{1}} (β | X, y, λ) - Q_{A_{2}} (β | X, y, λ) & \geq Q_{A_{1}} (β | X, y, λ) - Q_{A_{1 i}} (β | X, y, λ) \\ \geq - λ | S_{i} | \\ ⟹ Q_{A_{1}} (β | X, y, λ) - Q_{A_{2}} (β | X, y, λ) + λ | S_{i} | & \geq 0 \end{matrix}

The result then follows from the definition of the modified variable importance measure. □

3. p-Value Simulations

In this section, we provide a set of simulation results that validate the parametric bootstrap approach to calculating p-values based on the modified variable importance measure. These results also invalidate the omnibus use of the chi-square approximation, which is especially problematic when there is strong correlation between the variables and when the penalty term grows with the sample size.

The setup for the simulations used in this section is as follows. First, we generate various linear regression models by initially defining a regression parameter vector

β

for six covariates as

β^{⊤} = (\begin{matrix} 2 & 3 & 4 & 5 & 6 & 7 \end{matrix})

. To create null effects for each of our models, we randomly set three of these six regression coefficients to zero. With this approach, the relative importance of the three retained non-null effects will vary from one model to the next.

For the generation of the covariate vectors

x_{i}

, we define

Σ = ρ J_{6} + (1 - ρ) I_{6}

, where

J_{6}

is a 6 by 6 matrix of all ones. We then generate the vectors as

x_{i} \sim N_{6} (1, Σ) .

With

μ_{i} = x_{i}^{⊤} β + 1

, we generate the outcome variables as

y_{i} \sim N (μ_{i}, σ^{2})

. We determine the variance

σ^{2}

by considering the relation

σ^{2} = β^{⊤} Σ β / SNR

, where SNR denotes the signal-to-noise ratio, commonly defined as

V a r (x^{⊤} β) / σ^{2}

for linear regression.

To set the variance

σ^{2}

, we use a SNR of about 0.4286, which corresponds to a coefficient of determination of

R^{2} = 0.3

. We considered multiple different values of SNR, but we decided on this value of SNR because it proved to be the most problematic for our methods. This SNR would therefore illustrate the efficacy of our method in a “worst-case” scenario.

3.1. Simulated Distributions

For the following simulation sets, we generate 10,000 linear regression models as described above, with

ρ \in {0, 0.9}

. For each of these models, we generate 1000 observations. We then calculate the modified variable importance measures for each variable that has a zero coefficient. Next, we collect all 30,000 modified variable importance values corresponding to these null effects and plot their empirical distribution relative to a

χ_{1}^{2}

distribution. We repeat this process for AIC, BIC, and HQIC.

In the set of simulations shown in Figure 1, it appears that the modified variable importance measures corresponding to each of the information criteria are roughly chi-squared distributed under independence. However, it appears that the modified variable importance measures are not approximately

χ_{1}^{2}

distributed when there is correlation between the variables. The distribution appears to deviate further from a

χ_{1}^{2}

as

λ

increases when there is correlation.

3.2. Type 1 Error Rates

For the following sets of results, we again generate 10,000 linear regression models as previously described, with

ρ \in {0, 0.9}

and a sample size of 1000 observations for each model. For each of these simulated regressions, we calculate the test statistics and p-values for each variable. We then use the p-values from the coefficients that were truly zero to calculate the type 1 error rates for each of the different scenarios.

3.2.1. Naive Approach

Here, we show the type 1 error rates that would result from treating the modified variable importance measures as if they were

χ_{1}^{2}

distributed. We consider rates based on the p-values corresponding to AIC, BIC, and HQIC.

From the simulation results in Figure 2, it appears that the distribution of the test statistics is well approximated by a chi-square distribution under independence. In large-sample settings, note that all of the tests appear to have the correct type 1 error rate for the independent variables results. However, we can see that the type 1 error rates for the tests with BIC and HQIC are much too large when the variables are highly correlated with

ρ = 0.9

. The type 1 error rate with AIC does appear to be much closer to the desired level, but it is still slightly too large.

From Figure 3, we can see that the type 1 error rates from using this approach with BIC and HQIC tend to get larger as the correlation between variables increases. It appears that the type 1 error rates when using this method with AIC are slightly larger than desired, but are still fairly close to the desired error rates.

3.2.2. Bootstrapping Approach

We propose the use of a parametric approach that can be employed to approximate the null distribution of the modified variable importance measure. For a given variable, we generate a new realization of the response variable based on the model that includes all variables except for the current variable of interest. We then calculate the modified variable importance measure for the current variable of interest. We carry out this calculation for each variable, repeating the process many times to empirically approximate null distributions for the importance measures.

Here, we show the type 1 error rates that would result from using the parametric bootstrap approach described above. The null distribution of the modified variable importance measure for each variable is estimated with 100 bootstrapped replicates in each simulation.

From the simulation results in Figure 4 and Figure 5, we see that the parametric bootstrap approach gives approximately correct type 1 error rates even with BIC and HQIC under large correlation. We do see some slight deviations from the desired type 1 error rates under atypically high correlation. However, even under a very extreme correlation of 0.99 between all of the variables, the type 1 error rates are still within about 0.03 of the desired rates.

It appears that the parametric bootstrap approach can be used to obtain valid p-values from the modified variable importance measures. However, this method is much more computationally expensive than using a

χ_{1}^{2}

distribution because we need to perform best subset selection 2p ∗ nboot times, where nboot denotes the number of bootstrap replicates and p denotes the number of variables. Thus, in Appendix A, we discuss approaches for improving computational efficiency.

4. Body Fat Dataset Example

In this section, we demonstrate the utility of the proposed methods in an application. The goal is to build a regression model to predict the body fat percentage for men based on the following variables: age (years), weight (pounds), height (inches), and ankle (cm), bicep (cm), chest (cm), forearm (cm), hip (cm), knee (cm), neck (cm), thigh (cm), waist (inches), and wrist (cm) circumference. Our dataset can be found at https://dasl.datadescription.com/datafile/bodyfat/ (accessed on 2 August 2024) and comprises 250 observations.

In the context of linear regression, we will consider the use of best subset selection with AIC, BIC, and HQIC. We will demonstrate the utility of the proposed variable importance measure, as well as p-values based on the modified version of this measure, calculated using the parametric bootstrap. We will calculate both the variable importance measures and the p-values for AIC, BIC, and HQIC, and we will compare these results to what one would obtain using a conventional approach based on using the full model.

4.1. Coefficient Estimates

The standardized coefficient estimates resulting from the full model and from best subset selection using each of the information criteria are featured in Table 1. From these results, we can see that waist and wrist circumference appear to be the most important variables because they have large coefficient estimates (in magnitude) and they are both included in all of the models.

We can also see that weight is included in the model selected by BIC, but is not included in the model selected by AIC or HQIC. This is a very interesting result, because we would expect the variables that are included in the model selected by BIC to be included in the models favored by AIC and HQIC, since BIC has the largest penalty term and therefore favors more parsimonious models. We suspect that this inconsistency occurs because weight is highly correlated with many of the body measurement variables, yet does not seem to explain much additional variability in the outcome when these other variables are included in the model. Thus, when the majority of the body measurement variables are excluded from the model, weight can be a very useful variable to include. However, when a number of these body measurement variables are included, weight does not seem to provide much additional benefit. This can also be seen because the coefficient for weight is more than 10 times larger (in magnitude) in the BIC model as compared to the full model.

4.2. Variable Importance

Variable importance assessments resulting from each method are displayed in Table 2. The variable importance measures for the full model are taken to be the squared Wald test statistics for each of the variables. These measures reflect some of the same conclusions that we noted based on the coefficient estimates. Namely, for each method, waist circumference is the most important variable by far, and wrist circumference is the second most important variable. These results also show that weight is very important for best subset selection with BIC, and is somewhat important for best subset selection with HQIC, but is much less important for the other methods.

4.3. p-Values

The p-values resulting from each method, along with Wald p-values based on the selected models, are featured in Table 3. Note that the Wald p-values based on the selected models differ, often substantially, from the Wald p-values arising from the full model and from the p-values based on variable importance. The difference in these p-values becomes more pronounced as the penalty term of the selection criterion increases. For instance, in the model selected by BIC, weight has a Wald p-value of 1.3 × 10⁻⁴. However, the p-value for weight based on the full model is 0.895 and the p-value for weight based on variable importance with BIC is 0.224.

An interesting result from these analyses is that weight is the most important variable for best subset selection based on BIC, but it has a relatively large associated p-value based on simulation. This is likely a consequence of the high correlation between weight and the other body measurements. The p-value suggests that weight does not have a strong effect in the presence of these other measurements. Weight is correlated with body measurements that each explain some variation in the response, and therefore has a small coefficient estimate when several of these variables are included in the model, yet has a large coefficient estimate when the majority of these variables are not included in the model.

5. Discussion

In this paper, we develop and investigate a novel method for assessing variable importance when performing best subset selection. Our measure is specifically designed for the context of best subset selection, and provides a cogent approach for quantifying variable importance based on comparing two values of an information criterion: one for the optimal model that includes a variable (or variable set) of interest, and the other for the optimal model that excludes the same variable (or variable set). Our modified variable importance measure may be used in conjunction with the parametric bootstrap to calculate associated p-values, providing an inferential approach that is defensible when best subset selection is performed. The methods presented in this paper are available from the BranchGLM (version 3.0.0) R (version 4.4.1) [11] package, which is available on CRAN.

Because our variable importance measure was developed expressly for the purpose of best subset selection, we note that it may yield results that are different from those based on bootstrap selection frequencies or Akaike weights, each of which can be employed in the context of model averaging. We also note that data splitting provides an alternative to our approach for calculating p-values that are valid under best subset selection. Again, however, data splitting may yield different p-values compared to our method. Although both p-values are used to detect null and non-null effects, the p-values associated with our method are based on the importance of an effect across all possible models, whereas the conventional p-value only considers a specific model.

Author Contributions

J.S. and J.E.C. developed the methodology; J.S. and J.E.C. wrote the paper; J.S. developed the code for the implementation of the procedures. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The body fat dataset used for the application in this paper is available at https://dasl.datadescription.com/datafile/bodyfat/ (accessed on 2 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIC	Akaike information criterion
BIC	Bayesian information criterion
HQIC	Hannan–Quinn information criterion
SNR	Signal-to-noise ratio
VI	Variable importance

Appendix A. Computation

In this section of the appendix, we discuss various ways to efficiently calculate the proposed variable importance measure.

In order to calculate the variable importance measure for a given variable, we need to find the optimal model that includes that variable along with the optimal model that excludes that variable. If we seek to calculate the variable importance measure for all p variables, then we need to find p models that correspond to the optimal models that exclude each of the p variables, and we need to find another p models that correspond to the optimal models that include each of the p variables. Note that these

2 p

models are not necessarily unique, and the optimal unconstrained model constitutes exactly p of these models. Note that the optimal unconstrained model refers the optimal model resulting from best subset selection without any constraints regarding what variables are included or excluded.

The simplest way to calculate the variable importance measure for each variable is to perform an exhaustive search of all possible models and to keep track of all the models needed to calculate the variable importance measures. However, this process will be very slow and will not be feasible when we have

p ⪆ 15

. Instead, we could make use of branch and bound algorithms to speed up the process. Branch and bound algorithms can be employed to greatly accelerate the best subset selection process by ruling out models that cannot possibly be the optimal model based on the criterion employed for model evaluation. We developed two different approaches for calculating variable importance based on branch and bound algorithms.

The first approach begins by finding the optimal unconstrained model using a branch and bound algorithm. A branch and bound algorithm is then used once for each variable. Each algorithm either forces the inclusion of a variable if that variable is excluded from the optimal unconstrained model or forces the exclusion of a variable if that variable is included in the optimal unconstrained model. We call this method the separate branch and bound algorithms approach.

We also devised a branch and bound algorithm that will keep track of all the necessary models and will terminate when all of them have been verified. In this way, we would not need to refit any models like we might need to fit them when using the separate branch and bound algorithms approach. We call this procedure the simultaneous branch and bound approach.

Both of the preceding approaches were implemented in R and are available with the BranchGLM R package. With these algorithms, it is possible to quickly calculate variable importance for linear regression models with

p ⪅ 30

and for other generalized linear models (GLMs) with

p ⪅ 15

. In general, it appears that the simultaneous branch and bound approach is slightly faster than the separate branch and bound algorithms approach. However, the separate branch and bound algorithms approach can become faster than the simultaneous branch and bound approach if a faster algorithm is used to find the optimal solutions. Specifically for linear regression, it may be faster to use the lmSelect function from the lmSubsets (version 0.5-2) [12] R package or the MIQP approach from [13] instead of the branch and bound algorithms in the separate branch and bound algorithms approach.

Appendix B. Variable Importance with Heuristic Methods

Often times, it is not feasible to perform best subset selection due to the search space growing exponentially with p. This problem is exacerbated when using the parametric bootstrap method with the modified variable importance measures to calculate corresponding p-values. Due to these difficulties, it is often reasonable to use a heuristic method such as forward selection or backward elimination to perform variable selection. A heuristic method provides an approximate solution to an optimization problem in exchange for simplicity and/or computational speed. The variable importance measures can still be approximately computed with any heuristic method, but the method of doing so is slightly different from before.

The desired heuristic method is used twice for each variable, once where the given variable is forced to be included in the model and once where the given variable is forced out of the model. The variable importance values can then be calculated from these

2 p

models.

Two of the most commonly used heuristic methods are forward selection and backward elimination. The variable importance measures can be computed slightly more efficiently using these methods. The final model found by performing forward selection without any constraints is the same as the model found by forcing out variables that are not included in the final unconstrained model. Similarly, the final model found by performing backward elimination without any constraints is the same as the model found by forcing in any variables that are included in the final unconstrained model.

Using the same parametric bootstrap method as described previously, it is possible to calculate approximate p-values based on variable importance with heuristic methods. To illustrate this process, we show simulation results that are based on the same configuration as the study in Section 3, but using forward selection and backward elimination as opposed to best subset selection.

The simulation results for backward elimination are displayed in Figure A1, and the results for forward selection are featured in Figure A2. With both selection algorithms, we see that the parametric bootstrap approach gives approximately correct type 1 error rates, even with BIC and HQIC under extreme correlation.

Hence, from these results, it appears that the parametric bootstrap approach can also be used to obtain valid p-values based on variable importance for backward elimination and forward selection. We see no reason why these results should not also extend to other heuristic methods that provide a reasonable approximation to the best subset selection solution.

Figure A1. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

from the parametric bootstrap approach with backward elimination.

Figure A1. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

from the parametric bootstrap approach with backward elimination.

Figure A2. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

from the parametric bootstrap approach with forward selection.

Figure A2. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

from the parametric bootstrap approach with forward selection.

References

Fisher, A.; Rudin, C.; Dominici, F. All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. J. Mach. Learn. Res. 2019, 20, 1–81. [Google Scholar]
Ulrike, G. Variable Importance Assessment in Regression: Linear Regression versus Random Forest. Am. Stat. 2009, 63, 308–319. [Google Scholar]
Wei, P.; Lu, Z.; Song, J. Variable importance analysis: A comprehensive review. Reliab. Eng. Syst. Saf. 2015, 142, 399–432. [Google Scholar] [CrossRef]
Efron, B. Estimation and Accuracy After Model Selection. J. Am. Stat. Assoc. 2014, 109, 991–1007. [Google Scholar] [CrossRef] [PubMed]
Seedorff, J. BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms, R Package Version 3.0.0; 2024. Available online: https://CRAN.R-project.org/package=BranchGLM (accessed on 19 August 2024).
Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar] [CrossRef]
Burnham, K.P.; Anderson, D.R. Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res. 2004, 33, 261–304. [Google Scholar] [CrossRef]
Akaike, H. A New Look at the Statistical Model Identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Hannan, E.J.; Quinn, B.G. The Determination of the Order of an Autoregression. J. R. Stat. Soc. Ser. B (Methodol.) 1979, 41, 190–195. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
Hofmann, M.; Gatu, C.; Kontoghiorghes, E.J.; Colubi, A.; Zeileis, A. lmSubsets: Exact Variable-Subset Selection in Linear Regression, R Package Version 0.5-2; 2021. Available online: https://CRAN.R-project.org/package=lmSubsets (accessed on 19 August 2024).
Bertsimas, D.; King, A.; Mazumder, R. Best Subset Selection via a Modern Optimization Lens. Ann. Stat. 2016, 44, 813–852. [Google Scholar] [CrossRef]

Figure 1. Test statistic simulations with

n = 1000

and

ρ \in {0, 0.9}

.

Figure 1. Test statistic simulations with

n = 1000

and

ρ \in {0, 0.9}

.

Figure 2. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

using the naive approach.

Figure 2. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

using the naive approach.

Figure 3. Type 1 error rate versus correlation for three levels of significance and for two sample sizes using the naive approach.

Figure 4. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

using the parametric bootstrap approach.

Figure 4. Type 1 error rate versus sample size for three levels of significance and for

ρ \in {0, 0.9}

using the parametric bootstrap approach.

Figure 5. Type 1 error rate versus correlation for three levels of significance and for two sample sizes using the parametric bootstrap approach.

Table 1. Standardized coefficient estimates from each method.

	Full	AIC	HQIC	BIC
(Intercept)	19.108	19.032	19.032	19.032
Age	0.898	0.771	0.709	0.000
Weight	−0.233	0.000	0.000	−2.535
Height	−0.701	−0.841	−0.845	0.000
Neck	−0.874	−0.740	0.000	0.000
Chest	−1.108	−1.083	0.000	0.000
Waist	9.152	8.886	7.870	9.802
Hip	−1.038	0.000	0.000	0.000
Thigh	0.741	0.000	0.000	0.000
Knee	−0.106	0.000	0.000	0.000
Ankle	0.285	0.000	0.000	0.000
Bicep	0.555	0.000	0.000	0.000
Forearm	0.525	0.729	0.000	0.000
Wrist	−1.654	−1.571	−1.747	−1.254

Table 2. Variable importance from each method.

	Full	AIC	HQIC	BIC
Age	5.144	3.918	4.230	4.573
Weight	0.017	1.172	2.604	6.470
Height	2.063	2.828	4.230	4.573
Neck	2.827	2.155	1.954	1.827
Chest	1.679	2.355	2.446	2.373
Waist	103.820	84.424	85.880	88.933
Hip	1.316	1.216	0.590	1.746
Thigh	1.129	1.216	1.387	1.468
Knee	0.038	0.049	0.155	0.296
Ankle	0.659	0.762	0.772	0.834
Bicep	1.324	1.980	2.604	3.760
Forearm	1.672	2.020	1.524	2.680
Wrist	12.348	9.855	10.167	9.882

Table 3. p-values from each method and Wald p-values from the selected models.

	Full	AIC	AIC (Wald)	HQIC	HQIC (Wald)	BIC	BIC (Wald)
Age	0.024	0.058	0.015	0.064	0.019	0.077
Weight	0.895	0.452		0.387		0.224	<0.001
Height	0.152	0.151	0.008	0.107	0.008	0.143
Neck	0.094	0.164	0.157	0.207		0.205
Chest	0.196	0.151	0.109	0.159		0.184
Waist	<0.001	<0.001	<0.001	<0.001	<0.001	<0.001	<0.001
Hip	0.252	0.338		0.535		0.271
Thigh	0.289	0.324		0.327		0.308
Knee	0.847	0.845		0.746		0.641
Ankle	0.418	0.409		0.377		0.392
Bicep	0.251	0.187		0.126		0.103
Forearm	0.197	0.187	0.070	0.237		0.120
Wrist	<0.001	0.003	<0.001	0.001	<0.001	0.003	0.002

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seedorff, J.; Cavanaugh, J.E. Assessing Variable Importance for Best Subset Selection. Entropy 2024, 26, 801. https://doi.org/10.3390/e26090801

AMA Style

Seedorff J, Cavanaugh JE. Assessing Variable Importance for Best Subset Selection. Entropy. 2024; 26(9):801. https://doi.org/10.3390/e26090801

Chicago/Turabian Style

Seedorff, Jacob, and Joseph E. Cavanaugh. 2024. "Assessing Variable Importance for Best Subset Selection" Entropy 26, no. 9: 801. https://doi.org/10.3390/e26090801

APA Style

Seedorff, J., & Cavanaugh, J. E. (2024). Assessing Variable Importance for Best Subset Selection. Entropy, 26(9), 801. https://doi.org/10.3390/e26090801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessing Variable Importance for Best Subset Selection

Abstract

1. Introduction

2. Variable Importance for Best Subset Selection

2.1. Proposed Methodology

2.2. Theoretical Results

3. p-Value Simulations

3.1. Simulated Distributions

3.2. Type 1 Error Rates

3.2.1. Naive Approach

3.2.2. Bootstrapping Approach

4. Body Fat Dataset Example

4.1. Coefficient Estimates

4.2. Variable Importance

4.3. p-Values

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Computation

Appendix B. Variable Importance with Heuristic Methods

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI