1. Introduction
When presented with a set of variables posited as potential predictors of a targeted outcome, researchers harness an array of techniques to construct an appropriate descriptive or predictive model. Historically, the conventional approach entailed the formulation of a model under the guidance of an expert, who, grounded in a scientific understanding of the underlying mechanics of the observed outcome, would advocate for the structure of the proposed model. In contemporary practice, while domain expertise continues to inform the compilation of potential explanatory variables, the adoption of an all-encompassing model incorporating all of these variables is rare. Instead, scientists and statisticians have gravitated towards model selection algorithms (e.g., best subsets selection, forward selection, backward elimination, stepwise selection, and the LASSO), which utilize the sample at hand to yield models exhibiting more parsimonious structures than the comprehensive global model.
Once a model is selected, the common practice is to conduct inference on that model, proceeding as if this were the only model to ever be considered. However, different samples from the same population may lead to a selection of models with different structures. Therefore, the standard practice neglects the sampling variability inherent in the selection process. This pervasive issue in contemporary applications of statistics was described by Breiman as a “quiet scandal in the statistical community” [
1].
A direct consequence of this oversight is the inherent difficulty in replicating the results of a modeling analysis using subsequent samples [
2,
3]. Therefore, inferences contingent on the chosen model are profoundly impacted, raising concerns regarding the bias of regression effect estimators, the accuracy of their estimated standard errors, and the validity of associated
p-values or confidence intervals [
3]. For instance, failing to account for model selection variability will typically result in smaller standard errors, which in turn results in erroneously smaller
p-values and overly optimistic, narrower confidence intervals.
In addition, common model selection procedures are prone to including spurious effects in the final model [
4,
5]. In the setting of regression models with correlated variables, the inclusion of spurious effects can heavily influence the estimates and the interpretation of important effects. For example, if two explanatory variables are correlated, their estimated effects can vastly vary for models that only include one of the variables as opposed to models that include both variables [
3].
In recent years, various approaches have been proposed that are aimed at addressing the estimation of regression effects while accounting for model selection variability. These approaches can often be characterized as multimodel inference procedures, where regression effect estimates are not solely derived from a single model but rather from the entire collection of models. Conceptually, such procedures are aligned with ensemble methods, where the final prediction or classification is derived through the combination of multiple submodels.
The foundational principles of multimodel inference are influenced by Bayesian principles, particularly the concept of Bayesian model averaging (BMA). The BMA framework is founded on the recognition that different estimates are contingent upon specific models, and that these models possess varying probabilities of arising, given the observed data. Consequently, the posterior mean of any quantity of interest should encompass contributions from all the models based on their posterior probabilities. In essence, the posterior mean of the target quantity should be an average of the estimates conditional on each model, weighted by the posterior probability of that particular model.
In the frequentist domain, several multimodel inference methodologies mirror the BMA framework. A prominent approach involves the utilization of Akaike weights to play the role of model probabilities [
6]. These Akaike weights offer a measure of evidence for each model in a candidate set, allowing us to compute the estimate of each regression coefficient by aggregating their Akaike-weighted model-specific estimates. In this manner, a model averaging calculation akin to BMA is performed but with the incorporation of each model’s Akaike weight as opposed to its posterior probability.
The basis of incorporating Akaike weights into the multimodel inference framework presupposes the validity of Akaike weights as approximations to model selection probabilities. In fact, Burnham and Anderson claim that Akaike weights “may be interpreted as the probability that model
i is the actual expected K–L [Kullback–Leibler] best model for the sampling situation considered” [
6]. However, in this paper, we argue that Akaike weights do not provide adequate approximations to model probabilities. Instead, we demonstrate that repeating the model selection process via the bootstrap yields better approximations to model selection probabilities, which can then be incorporated for the purpose of conducting multimodel inference. Yet this bootstrap procedure must be implemented with caution because of the bias generated from bootstrapping the likelihood-based information criteria. Under appropriate conditions, we derive the form of this bias and propose a simple correction.
4. Application in Biomedicine: Sulindac for the Treatment of Colonic and Rectal Ademonams in Patients with Familial Adenomatous Polyposis
Familial adenomatous polyposis (FAP) is an autosomal dominant genetic disease characterized by the development of thousands of polyps throughout the colon and rectum. This condition is rare; it occurs in 1 in 1000 people, and although it accounts for only of all diagnosed colorectal cancers, it is the second most common inherited colorectal cancer syndrome.
FAP is the result of a mutation of a tumor suppressor gene on chromosome 5 known as the ademomatous polyposis coli (APC) gene. Polyps begin to arise in the early teens and if untreated, patients have a
lifetime risk of developing colorectal cancer. In addition, patients with FAP are at risk of developing extracolonic pathologies such as desmoid tumors (a solid connective tissue tumor), hepatoblastomas (liver tumors), and thyroid cancer [
12].
In patients with FAP, early detection and treatment are essential for preventing the development of colorectal carcinoma, thus improving the prognosis. The standard treatment for FAP is colectomy with or without protectomy. Subtotal colectomy is desirable for many patients, but it requires continued surveillance, while total proctocolectomy does not require surveillance, but patients experience increased stool urgency and higher rates of urinary dysfunction. Therefore, there is a need for the development of non-surgical treatments for patients with FAP [
13,
14].
In 1993, the effect of Sulindac, a non-steroidal anti-inflammatory drug (NSAID), was investigated for the treatment of FAP in a randomized clinical trial [
15]. The study recruited 22 patients with FAP, 11 of whom were assigned to the treatment group that received 150 mg dosages of Sulindac, and 11 of whom were assigned to control group that received an identical-appearing placebo tablet. The study also considered the sex and age of the patients.
The results of the study were published in the paper titled “Treatment of Colonic and Rectal Adenomas with Sulindac in Familial Adenomatous Polyposis” in
The New England Journal of Medicine [
15]. The data set is available in the R package “medicaldata” [
16].
The main outcome of interest in the study is the proportionate difference in the number of polyps at 3 months compared to baseline. In other words, if we define
D as the proportionate change in the number of polyps, then we have that
In addition to this outcome, the explanatory variables in the data set include (a value of 1 for Sulindac and 0 for the placebo), (a value 1 for male and 0 for female), and . Moreover, in our modeling analysis, we consider the interaction between and , and we define this variable as .
The primary objective of this application is to assess the selection probability of different linear models that can be constructed with the variables collected from the study. To maintain consistency with the original publication [
15], all the candidate models are fitted using least squares regression; an inspection of the distribution of the residuals for the full model confirms that this is a reasonable approach. Ultimately, we will compare the model selection probability estimates that are obtained by utilizing the Akaike weights, the unadjusted bootstrap model frequencies and the bias-adjusted bootstrap model frequencies. The results are displayed in
Table 4.
In analyzing the results of this example, several important points emerge. First, it is worth emphasizing that all estimating approaches consistently point towards the model incorporating both the and variables as the most likely. This finding is particularly reassuring, as each approach hinges on model selections through AIC.
However, a notable discrepancy arises in the degree to which this favored model is endorsed as reflected by the disparities between the Akaike weights and the BMFs. The Akaike weights suggest a model selection probability of , whereas both the adjusted and unadjusted BMFs indicate a higher probability, hovering around each. This distinction carries practical implications, especially in the context of estimating treatment effects and standard errors within the model averaging framework. A higher probability assigned to a single model implies a concentration of estimates from that model, leading to reduced variability due to model selection.
Further scrutiny reveals distinctions between the unadjusted and bias-adjusted BMFs in their estimation of probabilities for the second most popular model. Specifically, for the model containing , , and the , the unadjusted BMF yields an estimated probability of (in proximity to the Akaike weight for the same model). In contrast, the adjusted version provides a markedly lower estimate of . However, for the model involving only , the adjusted approach yields a probability estimate of .
To better appreciate the practical effects of these differences in model probability estimation, consider the results on
Table 5 and
Table 6, which show that the smoothed CIs for the adjusted BMFs are narrower than for the unadjusted case. More importantly, the length of the smoothed CI for
is noticeably narrower for the adjusted setting. This occurs because, as noted previously, the second most popular model for the adjusted case is the model that only contains the
variable. In other words,
of the contributions to the smoothed estimate of
come from a model that allocates all the data to estimating the effect of
. On the other hand, for the unadjusted case,
of the contributions to the smoothed estimate of
come from a model that allocates the data to estimating
,
, and the
term.
From an information theoretic point of view, this dissimilarity can be explained by considering the optimism inherent in fitted log-likelihoods. As established in
Section 3.2, the fitted log-likelihood given by
serves as an overly optimistic measure of goodness-of-fit. Thus, to circumvent the unrealistic selection of complex models, the addition of a penalty term of size
(as employed in AIC) proves insufficient. Instead, an extra
k must be incorporated to strike a proper balance between the goodness-of-fit term and the penalty term. Since the unadjusted BMF lacks this extra penalty, it is expected to tend to favor larger models than the adjusted BMF.