Next Article in Journal
Modeling Robotic Thinking and Creativity: A Classic–Quantum Dialogue
Previous Article in Journal
AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model Averaging for Accelerated Failure Time Models with Missing Censoring Indicators

Department of Statistics and Data Science, School of Economics, Jinan University, Guangzhou 510632, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(5), 641; https://doi.org/10.3390/math12050641
Submission received: 23 January 2024 / Revised: 16 February 2024 / Accepted: 20 February 2024 / Published: 22 February 2024
(This article belongs to the Section Probability and Statistics)

Abstract

:
Model averaging has become a crucial statistical methodology, especially in situations where numerous models vie to elucidate a phenomenon. Over the past two decades, there has been substantial advancement in the theory of model averaging. However, a gap remains in the field regarding model averaging in the presence of missing censoring indicators. Therefore, in this paper, we present a new model-averaging method for accelerated failure time models with right censored data when censoring indicators are missing. The model-averaging weights are determined by minimizing the Mallows criterion. Under mild conditions, the calculated weights exhibit asymptotic optimality, leading to the model-averaging estimator achieving the lowest squared error asymptotically. Monte Carlo simulations demonstrate that the method proposed in this paper has lower mean squared errors compared to other model-selection and model-averaging methods. Finally, we conducted an empirical analysis using the real-world Acute Myeloid Leukemia (AML) dataset. The results of the empirical analysis demonstrate that the method proposed in this paper outperforms existing approaches in terms of predictive performance.

1. Introduction

In some practical scenarios, we often need to select useful models from a candidate model set. A popular approach to address this issue is model selection. Methods such as the Akaike Information Criterion (AIC) [1], Mallows’ Cp [2] and Bayesian Information Criterion (BIC) [3] are designed to identify the best model. However, in cases where a single model does not receive strong support from the data, these model-selection methods may overlook valuable information from other candidate models, leading to issues of model-selection uncertainty and bias [4].
To tackle these challenges and enhance prediction accuracy, several model-averaging techniques have been developed to leverage all information from the candidate models. Taking inspiration from AIC and BIC, Buckland et al. [5] proposed smoothed AIC (SAIC) and smoothed BIC (SBIC) methods based on AIC and BIC, respectively. Hansen [6] introduced the Mallows model-averaging (MMA) estimator, obtaining weights through the minimization of Mallow’s Cp criterion. The MMA estimator asymptotically attains the minimum squared error among the model-averaging estimators in its class. Subsequently, Wan et al. [7] relaxed the constraints of Hansen [6], allowing for non-nested candidate models and continuous weights. In practical applications, many datasets exhibit heteroscedasticity. Therefore, it is essential to explore model-averaging methods tailored for heteroscedastic settings. Firstly, Hensen and Racine [8] proposed Jackknife model averaging (JMA), which determines weights by minimizing a cross-validation criterion. JMA significantly reduces Mean Squared Error (MSE) compared to MMA when errors are heteroscedastic. Secondly, Liu and Okui [9] modified the MMA method proposed by Hensen [6] to make it suitable for heteroscedastic scenarios. Furthermore, Zhao et al. [10] extended [6]’s work by estimating the covariance matrix based on the weighted average of squared residuals corresponding to all candidate models. This approach improves the model average estimator under heteroskedasticity settings.
In survival analysis, the accelerated failure time (AFT) model provides a straightforward description of how covariates directly impact survival time and has consequently garnered widespread attention. There are several parameter-estimation methods for the Accelerated Failure Time (AFT) model, including Miller’s estimator [11], Buckley–James estimator [12], Koul–Susarla–Van Ryzin (KSV) estimator [13] and WLS estimator [14]. However, all these methods assume that the censoring indicator is observable. Therefore, Wang and Dinse [15] improved the KSV estimator to make it adaptable to situations where the censoring indicator is missing.
Under practical conditions, it is common to encounter situations where only the observed time is available and it is uncertain whether the event of interest has occurred. In such cases, data suffer from missingness in the censoring indicator. For example, in a clinical trial for lung cancer, a patient may die for unknown reasons and while the survival time is observed, it is uncertain whether the patient died specifically due to lung cancer. This situation leads to missingness in the censoring indicator. Previous studies have mainly addressed the issue of missingness in the censoring indicator under a specific model. Research on model averaging for right-censored data typically assumes that the censoring indicator is observable. Therefore, this paper adopts the inverse probability weighting method proposed by [15] to construct the response variable. Through appropriate weight-selection criteria, weights are chosen to build the model-averaged estimator for the accelerated failure time model. It significantly enhances the predictive performance of the model and mitigates the bias introduced by the selection of a single model. Compared to previous research, this paper makes two main contributions: First, it introduces a novel model-averaging method for the case of missingness in the censoring indicator. Second, the paper allows for heteroscedasticity and employs model-averaging techniques to estimate variance.
The remaining sections of this paper are organized as follows. In Section 2, we commence by introducing the notation and progressively delineate the methodology and associated theoretical properties of the proposed model-averaging approach. In Section 3, we report the Monte Carlo simulation results. In Section 4, we assess the predictive performance of the proposed model-averaging method against other approaches using the real-world Acute Myeloid Leukemia (AML) dataset. In Section 5, we provide a comprehensive summary of the entire paper and suggest future research directions in this area. All theorem proofs will be presented in Appendix A.

2. Methodology and Theoretical Property

We denote Y = log ( T ) = ( Y 1 , , Y n ) , C = log ( V ) = ( C 1 , , C n ) , where T represents the survival time and V denotes the censored time. X = ( X 1 , X 2 , , X n ) denotes the covariate matrix for n independent observations, where X i = ( x i 1 , x i 2 , , x i p ) . The accelerated failure time model can be expressed as follows:
Y i = μ i + e i = j = 1 p β j x i j + e i , ( i = 1 , , n ) ,
where e i is the random error with E ( e i | X i ) = 0 and E ( e i 2 | X i ) = σ i 2 .
We assume that there are M candidate models in the candidate model set. Where the mth candidate model contains p m covariates. Following [7], these candidate model forms are non-nested. The mth candidate model is
Y m i = j = 1 p m β j x i j + e m i , ( i = 1 , , n ) ,
for m = 1 , , M . The matrix form of (2) is
Y m = X m β m + e m ,
where X m is an n × p m dimensional full column-rank matrix, Y m = ( Y m 1 , , Y m n ) , β m = ( β 1 , , β p m ) , e m = e m 1 , , e m n .
In the case of right censored data, the response variable Y i might be censored, making it unobservable. We only observe ( Z i , X i , δ i ) , where Z i = m i n ( Y i , C i ) and the censoring indicator δ i = I ( Y i C i ) . Define a missingness indicator ξ i which is 1 if δ i is observed and is 0 otherwise. When the censoring indicators are missing, the observed data are { Z i , X i , ξ i , ξ i δ i } . For simplicity, we set U i = ( Z i , X i ) . In this paper, similar to [15], we assume the missing mechanism for δ to be:
P ( ξ = 1 | Z , X , δ ) = P ( ξ = 1 | Z ) .
This assumption is more stringent than the missing at random (MAR) condition yet less restrictive than the assumption of missing completely at random (MCAR).
Koul et al. [13] introduced a method that involves synthetic data for constructing linear regression models. Wang and Dinse [15] extended [13]’s method to address the situation where censoring indicators are missing. In our work, we follow the approach proposed by [15] to construct a response in the form of inverse probability weighting, specifically:
Y W i = ξ i δ i π ( Z i ) + ( 1 ξ i π ( Z i ) ) m ( U i ) 1 G n ( Z i ) Z i ,
where π ( z ) = E ( ξ | Z = z ) , m ( u ) = E ( δ | U = u ) . G n ( · ) represents the cumulative distribution function of C. It is easy to observe that under the missing data mechanism in this paper:
E ( Y W i | X i ) = μ i = X i β .
Similar to Equation (2), we have:
Y W i = j = 1 p m β j x i j + e W i , ( i = 1 , , n ) ,
where E ( e W i | X i ) = 0 , σ W i 2 = v a r ( e W i | X i ) . This is expressed in matrix form as:
Y W = X m β m + e W ,
where Y W = ( Y W 1 , , Y W n ) , e W = ( e W 1 , , e W n ) . And then the weighted least squares estimator of β m :
β ^ m = ( X m D X m ) 1 X m D Y W ,
where D = d i a g { 1 σ W 1 2 , , 1 σ W n 2 } .
Let μ m i = E ( Y W i | X i ) ; subsequently, the estimation for the mth candidate model μ m = ( μ m 1 , , μ m n ) is given by:
μ ^ m = X m β ^ m = X m ( X m D X m ) 1 X m D Y W = P m Y W ,
where P m = X m ( X m D X m ) 1 X m D . Denote weight vector w = ( w 1 , , w M ) T , belonging to the set H M = { w [ 0 , 1 ] M : m = 1 M w m = 1 } . The model-averaging estimator of μ is defined as follows:
μ ^ G n ( w ) = m = 1 M w m X m β ^ m = m = 1 M w m X m ( X m D X m ) 1 X m D Y W = P ( w ) Y W ,
for any w H M , where P ( w ) = m = 1 M w m X m ( X m D X m ) 1 X m D .
Define the square loss function L G n ( w ) = μ μ ^ ( w ) 2 , where · denotes the Euclidean norm. Then the risk function is defined as:
R G n ( w ) = E ( L G n ( w ) ) = P ( w ) μ μ 2 + t r { P ( w ) Ω P ( w ) } ,
where Ω = d i a g { σ W 1 2 , , σ W n 2 } . The derivation of (10) is as follows:
R G n ( w ) = E [ L G n ( w ) ] = E [ ( μ μ ^ ( w ) ) ( μ μ ^ ( w ) ) ] = E [ u μ 2 u P ( w ) Y W + Y W P ( w ) P ( w ) Y W ] = u μ 2 u P ( w ) μ + u P ( w ) P ( w ) μ + t r ( P ( w ) Ω P ( w ) ) = ( P ( w ) μ μ ) ( P ( w ) μ μ ) + t r ( P ( w ) Ω P ( w ) ) .
Regarding the choice of weights, a natural approach is to minimize the risk function to obtain the optimal weights. However, as shown in Equation (11), we recognize that the risk function includes the unknowns μ , which makes it infeasible to directly minimize the risk function to obtain the optimal weights. Therefore, we replace μ with Y W and seek an unbiased estimator of the risk function as the criterion for weight selection.
Define the criterion for weight selection as
C G n ( w ) = Y W μ ^ ( w ) 2 + 2 t r { P ( w ) Ω } .
It is not difficult to observe that E ( C G n ( w ) ) = R G n ( w ) + i = 1 n σ W i 2 . By disregarding a term that is independent of w , C G n ( w ) serves as an unbiased estimator of the risk function.
In practice, m ( · ) , π ( · ) and G n ( · ) are usually unknown; therefore, we need to estimate them. Firstly regarding the estimation of m u , it is usually estimated by the Logit model. Suppose m u is estimated by the parametric model m 0 u ; θ , where m 0 u ; θ = e U θ 1 + e U θ . By the maximum likelihood estimation method, we can obtain the parameter estimate θ ^ n for the parameter θ . π ( z ) usually can be estimated nonparametrically by
π ^ n ( z ) = i = 1 n ξ i W z Z i b n / i = 1 n W z Z i b n ,
where W ( · ) is a kernel function and b n is a bandwidth sequence. Next, we define u ( z ) = E ( δ | Z = z ) , u ( z ) estimated nonparametrically by
u ^ n ( z ) = i = 1 n δ i ξ i π ^ n ( Z i ) K z Z i h n i = 1 n ξ i π ^ n ( Z i ) K z Z i h n ,
where K ( · ) is a kernel function and h n is a bandwidth sequence. We adopt the following estimator of G n ( z ) :
G ^ n ( z ) = 1 i : Z i z n R i n R i + 1 1 u ^ n Z i ,
where R i denotes the rank of Z i .
Next, replacing m ( · ) , π ( · ) and G n ( · ) with m 0 ( · , · ) , π ^ n ( · ) and G ^ n ( · ) , we have:
Y ^ W i = ξ i δ i π ^ n ( Z i ) + ( 1 ξ i π ^ n ( Z i ) ) m 0 ( U i , θ ^ n ) 1 G ^ n ( Z i ) Z i .
And the corresponding weight selection criterion is as follows:
C G ^ n ( w ) = Y ^ W μ ^ G ^ n ( w ) 2 + 2 t r a c e { P ( w ) Ω } ,
where Y ^ W = ( Y ^ W 1 , , Y ^ W n ) . The weights for minimizing C G ^ n ( w ) are given by:
w ˜ = arg min w H M C G ^ n ( w ) .
Then, we enumerate the necessary regularity conditions for the asymptotic optimality.
(C1)
Let S ( t ) = 1 ( 1 F ( t ) ( 1 G n ( t ) ) ) and τ H = inf { t : S ( t ) = 1 } , where F ( t ) is the cumulative distribution function of Y i . Assume that 1 G n ( τ H ) > 0 .
(C2)
There exists a positive constant k such that max 1 i n | μ i | k .
(C3)
Denote ξ n = inf w H M R G n ( w ) and w m 0 is an M × 1 unit vector in which the mth element is 1 and the others are 0. For some integer 1 J < and some positive constant k such that E ( e i 4 J ) k < , assume
M ξ n 2 J m = 1 M R G n w m 0 J 0 .
(C4)
There exists ϵ > 0 such that inf e W i 2 > ϵ , i = 1 , , n .
(C5)
m ( · ) and π ( · ) are bounded.
(C6)
n h n and n h n 2 0 .
(C7)
Let p ˜ = max m p m , ρ i i m denote the ith diagonal element of P m . There exists a constant c such that ρ i i ( m ) c n 1 p m .
Condition (C1) is utilized in [16] and it ensures that 1 G n ( t ) is not equal to 0. Condition (C2) mandates that the conditional expectation of μ i remains within bounded limits, in line with assumptions seen in prior research, including [7,17]. Condition (C3) is a requirement commonly found in model-averaging literature (e.g., [7,18]). Condition (C4) mandates the non-degeneracy of the covariance matrix Ω as n . Similar assumptions can also be found in [9,10]. Similar to [15], Conditions (C5) and (C6) impose constraints on the bounds of m ( · ) , π ( · ) and bandwidth, respectively. Condition (C7) is frequently employed in the analysis of the asymptotic optimality of cross-validation methods, as seen in prior works like [8].
Theorem 1.
Under Conditions (C1) to (C6),
L G ^ n ( w ˜ ) inf w H M L G ^ n ( w ) p 1 .
Theorem 1 establishes the asymptotic optimality of the model-averaging procedure employing weights w ˜ , as its squared loss converges to that of the infeasible best possible model average estimator.
In most cases, Ω is unknown and needs to be estimated. We estimate Ω using residuals derived from the model-averaging process: e ^ ( w ) = Y ^ W μ ^ ( w ) = { e ^ W 1 ( w ) , , e ^ W n ( w ) } . Specifically, the estimator of Ω is
Ω ^ ( w ) = d i a g { σ ^ W 1 2 ( w ) , , σ ^ W n 2 ( w ) } ,
where σ ^ W i 2 = v a r ( e ^ W i ) .
In the existing literature on model averaging, most estimates of variance are predominantly derived from the largest candidate model, as exemplified by works such as [6,16]. In contrast, our approach, following [10], leverages information from all candidate models for estimation rather than relying on a single model. Such an estimation method is more robust. Replacing Ω by Ω ^ ( w ) in (13), C ( w ) becomes
C ^ G ^ n ( w ) = Y ^ W μ ^ G ^ n ( w ) 2 + 2 t r a c e { P ( w ) Ω ^ ( w ) } .
The weights that minimize C ^ G ^ n ( w ) are as follows:
w ^ = arg min w H M C ^ G ^ n ( w ) .
This weight selection criterion C ^ G ^ n ( w ) is a cubic function of w.
Theorem 2.
Under Conditions (C1) to (C7),
L G ^ n ( w ^ ) inf w H M L G ^ n ( w ) p 1 .

3. Simulation

In the simulation study, we generate data from the accelerated failure time (AFT) model, log ( T i ) = Y i = j = 1 1000 β j x i j + e i , where β j = 1 / j 2 ; the observations of X i = ( x i 1 , x i 2 , , x i 1000 ) are generated from a multivariate normal distribution with zero mean and covariance matrix Σ = ( σ i j ) with σ i j = 0 . 5 | i j | . The errors e i follow normal distribution N ( 0 , γ 2 ( x i 2 4 + 0.01 ) ) . By varying the value of γ , we allow R 2 to range from 0.1 to 0.9. This variance specification closely resembles that of [8]. However, we introduce a small constant, 0.01, to ensure that the variances remain strictly positive. The censoring time C i is generated from N ( C 0 , 7 ) . By varying the value of C 0 , we achieve censoring rates (CRs) of approximately 20 % , 40 % . We set sample sizes n = 150, 300. Here, our model configuration is set in a nested form, meaning the first m models include the first m regressors. The number of candidate models M was set to be 3 n 1 / 3 , where x denote the smallest integer greater than x.
Based on the missing mechanism described in this paper, we assume that the probability of missing censoring indicators, denoted as 1 π ( z ) , is determined via a logistic model: log { π ( z ) 1 π ( z ) } = θ 1 + θ 2 z . Following [15], we employed the uniform kernel function W ( x ) = 1 2 for | x | 1 and W ( x ) = 0 otherwise. Additionally, we used the biweight kernel function K ( x ) = 15 16 ( 1 2 x 2 + x 4 ) for | x | 1 and K ( x ) = 0 otherwise. The bandwidths were b n = h n = n 1 3 max ( Z ) . We estimated m ( u ) under the logistic model: log { m ( u ) 1 m ( u ) } = γ 1 + γ 2 z + γ 3 x . As highlighted by [19], when the data on δ are completely (or quasi-completely) separated, the maximum likelihood estimate of γ = ( γ 1 , γ 2 , γ 3 ) does not exist. In our simulation setup, the number of covariates significantly exceeds the sample size. Therefore, we employ the lasso method to estimate the parameters.
We compare the proposed Model-Averaging method for the Missing Censoring Indicators in the Heteroscedastic setting (HCIMA) with other classical model-selection and model-averaging methods in this article. Brief descriptions of these methods are provided below:
  • The model-selection methods rely on AIC and BIC, where the AIC and BIC criterion for the mth model are defined as follows:
    AIC ( m ) = log σ ^ G ^ n m 2 + 2 n 1 tr P m ,
    and
    BIC ( m ) = log σ ^ G ^ n m 2 + n 1 tr P m l o g ( n ) ,
    where σ ^ G ^ n m 2 = n 1 Y ^ W μ ^ G ^ n m 2 .
  • Model methods based on SAIC and SBIC: The weights for the mth candidate model are given by:
    w ( m ) A I C = exp ( A I C m / 2 ) / m = 1 M exp ( A I C m / 2 ) ,
    w ( m ) B I C = exp ( B I C m / 2 ) / m = 1 M exp ( B I C m / 2 ) ,
    where A I C m = A I C ( m ) m i n ( A I C ) , B I C m = B I C ( m ) m i n ( B I C ) .
  • Additionally, we compare our approach with the method that estimates the variance using the maximum candidate model (MCIMA). And the specifics of variance estimation and weight selection in their approach are as follows:
    σ ^ G ^ n = ( σ ^ G ^ n 1 , , σ ^ G ^ n n ) T = n n M ( I P M ) Y ^ W ,
    C ^ n ( w ) = Y ^ W μ ^ ( w ) 2 + t r a c e { P ( w ) Ω ^ } ,
    where Ω ^ = d i a g { σ ^ G ^ 1 2 , , σ ^ G ^ n 2 } .
In the simulation, we utilize the Mean Squared Error (MSE) to evaluate the performance of various methods, where the MSE is defined as 1 n μ ^ G ^ n μ 2 . We present the mean of MSEs from 500 replications.
Figure 1 and Figure 2, respectively, show the Mean Squared Error (MSE) values for various methods across 500 repetitions under different censored rates and sample sizes, with missing rates of 20% and 40%. In terms of Mean Squared Error (MSE), our proposed HCIMA method outperforms other approaches. Additionally, the MCIMA method performs better than existing methods in all cases except for when compared to HCIMA. Furthermore, it is evident that SAIC and SBIC outperform their respective AIC and BIC counterparts, further highlighting the advantages of model-averaging methods.
Comparing Figure 1 and Figure 2, it is observed that the MSE at MR = 20% is slightly higher than at MR = 40%. The reason for this occurrence is that when ξ i = 1 , δ i = 0 , the signs of Y W i and Z i are opposite. As MR increases, the occurrence of the ξ i = 1 , δ i = 0 situation decreases. Although this result may seem counterintuitive, it does not affect the performance of the method proposed in this paper, which still keeps its advantages in this case.

4. Real Data Analysis

In this section, we assess the predictive performance of our proposed HCIMA method using the real Acute Myeloid Leukemia (AML) dataset. This dataset contains 672 samples, including 97 variables such as patient age, survival time, gender, race, mutation count, etc. For more specific information about this dataset, we refer the reader to https://www.cbioportal.org/study/clinicalData?id=aml_ohsu_2018 (accessed on 13 December 2023).
We selected ten variables for analysis: Cause Of Death, Age, Sex, Overall Survival Status, Overall Survival Months (Survival Time), Number of Cumulative Treatment Stages, Cumulative Treatment Regimen Count, Mutation Count, Platelet Count and WBC (White Bloodcell Count). After removing rows with missing values, we retained a total of 396 samples. We treat samples with unknown causes of death as missing censoring indicators. Among these 396 samples, 76 have unknown causes of death and 167 samples are still alive after the clinical trial ends. Therefore, the missing rate is approximately 19% and the censoring rate is 42%. We focus on the impact of seven variables, excluding “Cause Of Death” and “Overall Survival Status” on Survival Time. Therefore, we can construct 2 7 1 = 127 non-nested candidate models.
We randomly select data from n 0 samples as the training dataset, while the remaining n 1 = n n 0 samples are used as the testing dataset. We set the training dataset size to 50%, 60%, 70% and 80% of the total dataset size, respectively. Following [16,20], we employed the normalized mean squared prediction error (NMSPE) as the performance metric:
NMSPE = i = n 0 + 1 n Y ^ W i μ ^ i 2 min m = 1 , 2 , , M i = n 0 + 1 n Y ^ W i μ ^ m i 2 ,
where μ ^ i represents the predicted value and μ ^ m i denotes the value of μ ^ for the mth model.
We calculate the mean, the standard deviation and the optimal rate of each method over these 1000 repetitions. Specifically, the optimal rate refers to the frequency at which the minimum value is achieved across these 1000 repetitions.
Table 1 displays the mean, optimal rates and standard deviations of NMSPE for each method over 1000 repetitions. Consistent with the simulation results, the HCIMA method exhibits the lowest average NMSPE and standard deviation and the highest optimal rate. The MCIMA method also performs well, ranking second after HCIMA. This indicates that the proposed model-averaging methods in this paper demonstrate superior predictive performance compared to other approaches.

5. Discussion

To address the uncertainty in model selection and enhance predictive accuracy, this paper proposes a novel model-averaging approach for the accelerated failure time model with missing indicators. Moreover, we establish asymptotic optimality under certain mild conditions. In Monte Carlo simulations, the method proposed in this paper exhibits lower mean squared errors compared to other model-selection and model-averaging methods. Empirical results demonstrate that the proposed method has a lower NMSPE compared to other approaches, indicating its superior predictive performance. This further underscores the applicability of the proposed method to real-life data scenarios with missing censoring indicators.
In this paper, we introduce the inverse probability weighted form of response variable proposed in [15]. The primary advantage of this form of response variable lies in its double robustness, making it less susceptible to the impact of model misspecification (if π ( · ) or m ( · ) is misspecified). However, as mentioned in [15], its drawback, compared to synthetic response [13], regression calibration and imputation [15], is a larger variance. Yet, in practical scenarios, the harm caused by model misspecification often outweighs the harm of higher variance. Therefore, in our work, we follow the recommendation of [15] to use the inverse probability weighted form of the response variable. A future research direction is to further enhance this response variable for better applicability in the context of missing censoring indicators.
As far as we know, there is currently very limited research on model averaging for missing censoring indicators. Therefore, there are still many questions that deserve further investigation. There is potential for extending our approach to high-dimensional data in terms of data and in terms of models, exploration into partial linear models, generalized linear models and other extensions could be pursued.

Author Contributions

Conceptualization, L.L.; Methodology, L.L.; Writing—review and editing, L.L.; Software, J.L.; Data curation, J.L.; Writing—original draft, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to thank the reviewers and editors for their careful reading and constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this appendix, we provide the proofs for Theorems 1 and 2. To facilitate the presentation, we begin with several lemmas.
Lemma A1.
Under Conditions (C1) to (C3), there exists a positive constant c 1 such that
max 1 i n E e W , i 4 J X i c 1 ,
where J is given in Condition (C3).
Lemma A1 is consistent with Lemma 6.1 in [16] and under our specific conditions, the proof technique for Lemma 1 is the same as the proof technique for Lemma 6.1 in [16].
Lemma A2.
Under Conditions (C1) to (C3),
| L G n ( w ) R G n ( w ) 1 | = o p ( 1 ) .
Under our specific conditions, we can prove this lemma using the same techniques as the proof of (A.3) in [7]. Therefore, we omit the proof here.
Lemma A3.
Under Conditions (C1) to (C5), as n , we have
Y ^ W Y W 2 = o p ( 1 ) .
 Proof of Lemma A3. 
Y ^ W Y W 2 = i = 1 n { ξ i δ i π ( Z i ) + ( 1 ξ i π ( Z i ) ) m ( U i ) 1 G n ( Z i ) ξ i δ i π ^ n ( Z i ) + ( 1 ξ i π ^ n ( Z i ) ) m 0 ( U i , θ ^ n ) 1 G ^ n ( Z i ) } 2 Z i K i = 1 n { 1 1 G n ( Z i ) 1 1 G ^ n ( Z i ) } 2 Z i C n 1 / 2 max 1 i n 1 1 G ^ n Z i 1 1 G n Z i 2 1 n μ T μ + 1 n e W T e W ,
where K is a constant. By Condition (C2), we have 1 n μ T μ = O p ( 1 ) and 1 n e W T e W = O p ( 1 ) . According to [15], G ^ n Z i G n Z i = o p ( 1 ) . Combined with Condition (C1), we have:
G ^ n ( z ) G n ( z ) 1 G n ( z ) = o p ( 1 ) ,
and
G ^ n ( z ) G n ( z ) 1 G ^ n ( z ) = o p ( 1 ) .
Similar to the proof of Lemma 6.2 in [16], we have:
n 1 / 2 max 1 i n 1 1 G ^ n Z i 1 1 G n Z i = o p ( 1 ) .
Furthermore, we can obtain
Y ^ W Y W 2 = o p ( 1 ) .
With the three lemmas mentioned above, we can now proceed to prove Theorem 1.
Proof of Theorem 1. Fist, we note that
C G ^ n ( w ) = Y ^ W μ ^ G ^ n ( w ) 2 + 2 t r a c e { P ( w ) Ω } = Y ^ W μ + μ μ ^ G ^ n ( w ) 2 + 2 t r a c e { P ( w ) Ω } = Y ^ W μ 2 + μ μ ^ G ^ n ( w ) 2 + 2 ( Y ^ W μ ) ( μ μ ^ G ^ n ( w ) ) + 2 t r a c e { P ( w ) Ω } = e W 2 + L G ^ n ( w ) + 2 e W ( μ P ( w ) μ + P ( w ) μ μ ^ G ^ n ( w ) ) + 2 t r a c e { P ( w ) Ω } = L G ^ n ( w ) + 2 e W ( I P ( w ) ) μ + 2 t r a c e ( P ( w ) Ω ) 2 e W P ( w ) e W + e W 2 .
Following [7], except for a term unrelated to w, to prove Theorem 1, we only need to verify
sup w H M | e W ( I P ( w ) ) μ | R G n ( w ) = o p ( 1 ) ,
sup w H M | t r a c e ( P ( w ) Ω ) 2 e W P ( w ) e W | R G n ( w ) = o p ( 1 ) ,
sup w H M | L G ^ n ( w ) R G n ( w ) 1 | = o p ( 1 ) .
We begin by proving Equation (A1). As per Equation (11), we can ascertain that:
R G n ( w m 0 ) P ( w m 0 ) μ μ 2 ,
R G n ( w m 0 ) t r a c e ( P ( w m 0 ) Ω P T ( w m 0 ) ) .
Furthermore, we denote the maximum eigenvalue of matrix A as λ m a x ( A ) ; since P m is an idempotent matrix, we have:
λ max P m = 1 ,
λ max { P ( w ) } m = 1 M w m λ max P m 1 .
According to the proof of Theorem 1 in [21], we have:
lim n sup w H M λ max ( P ( w ) P ( w ) ) < .
Applying the triangle inequality, Bonferroni’s inequality, Chebyshev’s inequality and Theorem 2 of [22], we can conclude, for any τ > 0 ,
P sup w H M | e W ( I P ( w ) ) μ | R G n ( w ) > τ P sup w H M m = 1 M w m e W I P m μ > τ ξ n = P max 1 m M e W I P m μ > τ ξ n = P e W , A w 1 0 μ > τ ξ n e W , A w M 0 μ > τ ξ n m = 1 M P e W , A w m 0 μ > τ ξ n m = 1 M E e W , A w m 0 μ 2 J τ 2 J ξ n 2 J C 1 τ 2 J ξ n 2 J m = 1 M Ω ( 2 J ) 1 / 2 A w m 0 μ 2 J ,
where , represents an inner product, A ( w ) = I P ( w ) . C 1 is a constant, Ω ( 2 J ) = diag γ 1 2 ( 2 J ) , , γ n 2 ( 2 J ) and γ i 2 ( 2 J ) = E ( e W i 2 J | X i ) 1 / 2 J . By Lemma A1, γ i 2 ( 2 J ) < ; thus, λ m a x ( Ω ( 2 J ) ) J = O ( 1 ) . Hence, combining this with Equation (A4), we have:
P sup w H M | e W , A ( w ) μ | / R G n ( w ) > τ C 1 τ 2 J ξ n 2 J λ m a x ( Ω ( 2 J ) ) J m = 1 M A w m 0 μ 2 J C 1 τ 2 J ξ n 2 J m = 1 M A w m 0 μ 2 J C 1 τ 2 J ξ n 2 J m = 1 M R G n w m 0 J .
And together with condition (C3), we can prove Equation (A1). Next, we will prove (A2). Similar to the proof of Equation (A1), we have:
P sup w H M | trace [ Ω P ( w ) ] e W , P ( w ) e W | / R G n ( w ) > τ m = 1 M P trace Ω P w m 0 e W , P w m 0 e W > τ ξ n m = 1 M E trace Ω P w m 0 e W , P w m 0 e W 2 J τ 2 J ξ n 2 J C 2 τ 2 J ξ n 2 J m = 1 M tr P w m 0 Ω ( 4 J ) P w m 0 J ,
where C 2 is a constant, Ω ( 4 J ) = diag γ 1 2 ( 4 J ) , , γ n 2 ( 4 J ) and γ i 2 ( 4 J ) = E ( e W i 4 J | X i ) 1 / 4 J . By Lemma A1, γ i 2 ( 4 J ) < ; thus, λ m a x ( Ω ( 4 J ) ) J = O ( 1 ) . Hence, combining Equation (A5) and condition (C3), we have:
P sup w H M | trace [ Ω P ( w ) ] e W , P ( w ) e W | / R G n ( w ) > τ C 2 τ 2 N ξ n 2 N λ max [ Ω ( 4 N ) ] N m = 1 M tr P w m 0 P w m 0 C 2 τ 2 J ξ n 2 J inf i e W i 2 J m = 1 M inf i e W i 2 t r a c e ( P 2 ( w m 0 ) ) J C 3 τ 2 J ξ n 2 J m = 1 M R G n w m 0 J = o ( 1 ) .
Next, we will prove Equation (A3). Note that
L G ^ n ( w ) R G n ( w ) 1 = L G n ( w ) R G n ( w ) 1 + L G ^ n ( w ) L G n ( w ) R G n ( w ) L G n ( w ) R G n ( w ) 1 + μ μ ^ G ^ n ( w ) 2 μ μ ^ ( w ) 2 R G n ( w ) .
From Lemma A2, we know that | L G n ( w ) R G n ( w ) 1 | = o p ( 1 ) , Therefore, to prove (A3), it is sufficient to verify that the second part of Equation (A9) converges to 0 in probability.
μ μ ^ G ^ n ( w ) 2 μ μ ^ ( w ) 2 R G n ( w ) = 2 ( μ μ ^ ( w ) ) μ ^ ( w ) μ ^ G ^ n ( w ) + μ ^ ( w ) μ ^ G ^ n ( w ) 2 R G n ( w ) 2 L G n ( w ) 1 / 2 P ( w ) Y ^ W Y W R G n ( w ) + P ( w ) Y ^ W Y W 2 R G n ( w ) .
According to Lemma A3, we have:
P ( w ) Y ^ W Y W 2 λ max ( P ( w ) ) Y ^ W Y W 2 = O p ( 1 ) .
Combining this with Lemma A3, we can conclude that (A9) is of o p ( 1 ) , which establishes the proof for (A3). □
Proof of Theorem 2. It is evident from Equations (13) and (16) that:
C ^ G ^ n ( w ) = C G ^ n ( w ) + 2 trace { P ( w ) Ω ^ ( w ) } 2 trace { P ( w ) Ω } .
In conjunction with Theorem 1, it is evident that to prove Theorem 2, we only need to establish:
sup w H M | trace { P ( w ) Ω ^ ( w ) } trace { P ( w ) Ω } | / R G n ( w ) = o p ( 1 ) .
We denote Q m = diag ρ 11 ( m ) , , ρ n n ( m ) and Q ( w ) = m = 1 M w s Q m . According to Lemma A1, we have:
λ m a x ( Ω ) = O ( 1 ) .
Considering the definition of Ω ^ ( w ) , and employing proof techniques similar to [10,23], we obtain:
sup w H M | trace { P ( w ) Ω ^ ( w ) } trace { P ( w ) Ω } | / R G n ( w ) = sup w H M { Y ^ W P ( w ) Y ^ W } Q ( w ) { Y ^ W P ( w ) Y ^ W } trace { Q ( w ) Ω } / R G n ( w ) = sup w H M { e W + μ P ( w ) Y ^ W } Q ( w ) { e W + μ P ( w ) Y ^ W } trace { Q ( w ) Ω } / R G n ( w ) sup w H M e W Q ( w ) e W trace { Q ( w ) Ω } / R G n ( w ) + 2 sup w H M e W Q ( w ) { P ( w ) Y ^ W μ } / R G n ( w ) + sup w H M { { P ( w ) Y ^ W μ } Q ( w ) { P ( w ) Y ^ W μ } / R G n ( w ) sup w H M e W Q ( w ) e W trace { Q ( w ) Ω } / R G n ( w ) + 2 sup w H M e W Q ( w ) { P ( w ) μ μ } / R G n ( w ) + 2 sup w H M e W Q ( w ) P ( w ) e W trace { Q ( w ) P ( w ) Ω } / R G n ( w ) + 2 sup w H M | trace { Q ( w ) P ( w ) Ω } | / R G n ( w ) + sup w H M { P ( w ) Y ^ W μ } Q ( w ) { P ( w ) Y ^ W μ } / R G n ( w ) T 1 + T 2 + T 3 + T 4 + T 5 .
Let ρ = max m max i ρ i i ( m ) . According to condition (C7), we have:
ρ = O n 1 p ˜ .
Given the definition of R G n ( w ) and condition (C4), the following equation holds:
R G n w m 0 trace P m Ω P m T ϵ trace P m = ϵ p m ,
ξ n and M ξ n 2 J = o ( 1 ) .
From (A6), (A11), (A13), the Chebyshev inequality and Theorem 2 of [22], for any τ > 0 , there exist constants c 1 and c 2 such that:
P T 1 > τ m = 1 M P e W Q m e W trace Q m Ω > τ ξ n τ 2 J ξ n 2 J m = 1 M E e W T Q m e W trace Q m Ω 2 J c 1 τ 2 J ξ n 2 J m = 1 M { trace { Ω 1 / 2 ( 4 J ) Q m Ω ( 4 J ) Q m Ω 1 / 2 ( 4 J ) } } J c 1 τ 2 J ξ n 2 J M λ max 2 J ( Ω ( 4 J ) ) max 1 m M { t r a c e ( Q m ) } J = ξ n 2 J M O n 1 p ˜ 2 J = o ( 1 ) ,
P T 3 / 2 > τ m = 1 M P e W Q m P m e W trace Q m P m Ω > τ ξ n τ 2 J ξ n 2 J m = 1 M E e W Q m P m e W trace Q m P m Ω 2 J c 2 τ 2 J ξ n 2 J m = 1 M trace { Ω 1 / 2 ( 4 J ) Q m P m Ω ( 4 J ) P m T Q m Ω 1 / 2 ( 4 J ) } J c 2 τ 2 J ξ n 2 J M λ max 2 J ( Ω ( 4 J ) ) λ max J P m P m T max 1 m M { t r a c e ( Q m 2 ) } J = ξ n 2 J M O n 1 p ˜ 2 J = o ( 1 ) ,
T 2 / 2 sup w H M e W 2 ρ 2 P ( w ) μ μ 2 / R G n 2 ( w ) 1 / 2 e W ρ ξ n 1 / 2 = ξ n 1 / 2 O n 1 / 2 p ˜ = o ( 1 ) ,
T 4 / 2 ξ n 1 ρ λ max ( Ω ) sup w H M [ trace { P ( w ) } ] ξ n 1 ρ λ max ( Ω ) max m trace P m ξ n 1 ρ λ max ( Ω ) max m λ max P m max m rank P m = ξ n 1 O n 1 p ˜ 2 = ξ n 1 O n 1 p ˜ 2 = o ( 1 ) ,
T 5 ρ sup w H M { P ( w ) Y ^ W μ } T { P ( w ) Y ^ W μ } / R G n ( w ) = ρ sup w H M L G n ( w ) / R G n ( w ) = O n 1 p ˜ .
Therefore, combining (A15)–(A19), along with Condition (C7), it is clear that Theorem 2 holds. □

References

  1. Akaike, H. Information Theory and an Extension of the Maximum Likelihood Principle. In Second International Symposium on Information Theory; Petrov, B., Csáki, F., Eds.; Akadémiai Kiadó: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
  2. Mallows, C.L. Some Comments on Cp. Technometrics 1973, 15, 661–675. [Google Scholar]
  3. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 15–18. [Google Scholar] [CrossRef]
  4. Hjort, N.L.; Claeskens, G. Frequentist Model Average Estimators. J. Am. Stat. Assoc. 2003, 98, 879–899. [Google Scholar] [CrossRef]
  5. Buckland, S.T.; Burnham, K.P.; Augustin, N.H. Model selection: An integral part of inference. Biometrics 1997, 53, 603–618. [Google Scholar] [CrossRef]
  6. Hansen, B.E. Least squares model averaging. Econometrica 2007, 75, 1175–1189. [Google Scholar] [CrossRef]
  7. Wan, A.T.; Zhang, X.; Zou, G. Least squares model averaging by Mallows criterion. J. Econom. 2010, 156, 277–283. [Google Scholar] [CrossRef]
  8. Hansen, B.E.; Racine, J.S. Jackknife model averaging. J. Econom. 2012, 167, 38–46. [Google Scholar] [CrossRef]
  9. Liu, Q.; Okui, R. Heteroscedasticity-robust Cp model averaging. Econom. J. 2013, 16, 463–472. [Google Scholar] [CrossRef]
  10. Zhao, S.; Zhang, X.; Gao, Y. Model averaging with averaging covariance matrix. Econom. Lett. 2016, 145, 214–217. [Google Scholar] [CrossRef]
  11. Miller, R. Least square regression with censored data. Biometrika 1976, 63, 449–464. [Google Scholar] [CrossRef]
  12. Buckley, J.; James, I. Linear regression with censored data. Biometrika 1979, 66, 429–436. [Google Scholar] [CrossRef]
  13. Koul, H.; Susarla, V.; Van Ryzin, J. Regression analysis with randomly right-censored data. Ann. Stat. 1981, 9, 1276–1288. [Google Scholar] [CrossRef]
  14. He, S.; Huang, X. Central limit theorem of linear regression model under right censorship. Sci. China Ser. A-Math. 2003, 46, 600–610. [Google Scholar] [CrossRef]
  15. Wang, Q.; Dinse, G.E. Linear regression analysis of survival data with missing censoring indicators. Lifetime Data Anal. 2011, 17, 256–279. [Google Scholar] [CrossRef]
  16. Liang, Z.Q.; Chen, X.L.; Zhou, Y.Q. Mallows model averaging estimation for linear regression model with right censored data. Acta Math. Appl. Sin. Engl. Ser. 2022, 38, 5–23. [Google Scholar] [CrossRef]
  17. Wei, Y.; Wang, Q.; Liu, W. Model averaging for linear models with responses missing at random. Ann. Inst. Stat. Math. 2020, 73, 535–553. [Google Scholar] [CrossRef]
  18. Liu, Q.; Okui, R.; Yoshimura, A. Generalized least squares model averaging. Econom. Rev. 2016, 35, 1692–1752. [Google Scholar] [CrossRef]
  19. Albert, A.; Anderson, J.A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984, 71, 1–10. [Google Scholar] [CrossRef]
  20. Zhu, R.; Wan, A.T.; Zhang, X.; Zou, G. A Mallows-type model averaging estimator for the varyingcoecient partially linear model. J. Am. Stat. Assoc. 2019, 114, 882–892. [Google Scholar] [CrossRef]
  21. Dong, Q.K.; Liu, B.X.; Zhao, H. Weighted least squares model averaging for accelerated failure time models. Comput. Stat. Data Anal. 2023, 184, 107743. [Google Scholar] [CrossRef]
  22. Whittle, P. Bounds for the moments of linear and quadratic forms in independent variables. Theory Probab. Appl. 1960, 5, 302–305. [Google Scholar] [CrossRef]
  23. Qiu, Y.; Wang, W.; Xie, T.; Yu, J.; Zhang, X. Boosting Store Sales Through Machine Learning-Informed Promotional Decisions. 2023. Available online: http://www.mysmu.edu/faculty/yujun/Research/Maml_sales.pdf (accessed on 10 February 2024).
Figure 1. Mean Squared Errors (MSEs) of various methods under different sample sizes and censor rates at MR = 20%.
Figure 1. Mean Squared Errors (MSEs) of various methods under different sample sizes and censor rates at MR = 20%.
Mathematics 12 00641 g001
Figure 2. Mean Squared Errors (MSEs) of various methods under different sample sizes and censor rates at MR = 40%.
Figure 2. Mean Squared Errors (MSEs) of various methods under different sample sizes and censor rates at MR = 40%.
Mathematics 12 00641 g002
Table 1. The mean, optimal rate and standard deviation of NMSPE.
Table 1. The mean, optimal rate and standard deviation of NMSPE.
MethodAICSAICBICSBICMCIMAHCIMA
50%Mean1.36281.33701.35171.33451.27651.2165
Standard deviation0.52830.50600.51230.49700.40390.3500
Optimal rate0.0840.1370.0420.0930.3060.338
60%Mean1.36631.33881.35561.34041.26511.1800
Standard deviation0.55040.51660.51510.50680.43430.3066
Optimal rate0.0940.1190.0490.0910.2880.359
70%Mean1.33471.32131.33611.32591.24511.1766
Standard deviation0.53240.52320.52880.52570.34330.2966
Optimal rate0.0970.1400.0570.0790.2590.368
80%Mean1.27941.26191.28281.26281.20341.1504
Standard deviation0.48650.47140.48610.47770.29410.2030
Optimal rate0.0830.1650.0630.1290.2400.320
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, L.; Liu, J. Model Averaging for Accelerated Failure Time Models with Missing Censoring Indicators. Mathematics 2024, 12, 641. https://doi.org/10.3390/math12050641

AMA Style

Liao L, Liu J. Model Averaging for Accelerated Failure Time Models with Missing Censoring Indicators. Mathematics. 2024; 12(5):641. https://doi.org/10.3390/math12050641

Chicago/Turabian Style

Liao, Longbiao, and Jinghao Liu. 2024. "Model Averaging for Accelerated Failure Time Models with Missing Censoring Indicators" Mathematics 12, no. 5: 641. https://doi.org/10.3390/math12050641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop