1. Introduction
This work proposes a general variable selection method for the length-biased and interval-censored failure time data with the classical proportional hazards (PH) model. Interval-censored data arise when each failure time of interest cannot be measured accurately but is only known to lie in a certain time interval formed by periodical follow-ups [
1]. Such data are frequently encountered in many scientific studies including clinical trials and epidemiological surveys, and their regression analysis has been discussed extensively in the literature, see [
2,
3,
4,
5,
6,
7,
8] for details. Specifically, Zeng et al. [
4], Wang et al. [
6] and Zeng et al. [
7] investigated inference procedures for the additive hazards, PH and transformation models, respectively.
In addition to interval censoring, left truncation is also frequently encountered in prospective cohort studies, inducing non-randomly selected samples from the target population. A typical example of left truncation occurs in the PLCO study where individuals with any of the PLCO cancers at the onset of the study were not enrolled [
6,
9]. In particular, when the truncation times follow the uniform distribution (also known as the length-biased or stationarity assumption), the left-truncated data reduce to the length-biased data discussed by many authors, including but not limited to Wang [
10], Shen et al. [
11] and Ning et al. [
12].
The analysis of length-biased data under right censoring has been investigated extensively in the literature [
11,
13,
14,
15,
16]. To name a few examples, Shen et al. [
11] presented unbiased estimating equation approaches for the transformation and accelerated failure time models. Qin and Shen [
13] developed an inverse weighted estimating equation approach for the PH model. Qin et al. [
14] developed new expectation-maximization (EM) algorithms to estimate the survival function of the failure time. For the length-biased and interval-censored data, Gao and Chan [
15] developed an EM algorithm for the PH model via two-stage data augmentation. Further, Shen et al. [
16] considered the mixture PH model with a nonsusceptible or cured fraction.
In many practical applications, one may collect a large number of candidate covariates, but in general, only a few covariates are useful to model the failure time of interest. In such a case, penalized variable selection provides a useful tool to eliminate irrelevant variables and further enhance the estimation accuracy. Popular penalty functions include LASSO [
17], SCAD [
18], adaptive LASSO (ALASSO) [
19], SICA [
20], SELO [
21], MCP [
22] and BAR [
23,
24]. In particular, Fan et al. [
25] provided a comprehensive review of variable selection methods and the corresponding algorithms. In recent years, machine learning-based methods have also gained considerable attention due to their great ability in identifying relevant features. To name a few examples, Garavand et al. [
26] used clinical examination features and compared different machine learning algorithms in developing a model for the early diagnosis of coronary artery disease. Hosseini et al. [
27] used blood microscopic images and a convolutional neural network algorithm for detecting and classifying B-acute lymphoblastic leukemia. Garavand et al. [
28] conducted a systematic review of advanced techniques to facilitate the rapid diagnosis of coronary artery disease. Ghaderzadeh and Aria [
29] conducted a systematic review of Artificial Intelligence techniques for COVID-19 detection.
Regarding the left-truncated failure time data, a number of variable selection methods have been proposed. In particular, Chen [
30] considered the right-censored data and developed a variable selection method for the additive hazards model with covariate measurement errors. He et al. [
31] also considered the right-censored data and performed variable selection with penalized estimating equations for the accelerated failure time model. Li et al. [
32] developed a conditional likelihood-based variable selection method for left-truncated and interval-censored data under the PH model. However, it is worth noting that the work of Li et al. [
32] only involved the ALASSO penalty and their method can be anticipated to lose some efficiency due to ignoring the distribution information of the truncation times.
In this paper, we offer an efficient penalized likelihood method to achieve variable selection in the PH model with length-biased and interval-censored data. Compared with the traditional conditional likelihood method in Li et al. [
32], the proposed method yields an efficiency gain via fully taking into account the distribution information of the truncation times. In particular, to optimize the penalized likelihood function with an intractable form, we develop a penalized EM algorithm by introducing pseudo-left-truncated data and Poisson random variables. The proposed method is easy to implement and computationally stable and has desirable advantages over the variable selection method based on the penalized conditional likelihood [
32]. An application to a real data set arising from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial demonstrates the practical usefulness of the proposed method.
The PLCO cancer screening trial is a large-scale multicenter trial conducted to screen the PLCO cancers and investigate cancer-related mortality. To date, motivated by the rich data structure in the PLCO database, various statistical methods have already been proposed in the literature. To name a few examples, Wang et al. [
6] developed an EM algorithm to estimate the spline-based PH model with interval-censored data. Sun et al. [
33] considered variable selection in a semiparametric nonmixture cure model with interval-censored data. Li and Peng [
34] investigated instrumental variable estimation of complier causal treatment effect with interval-censored data. Withana Gamage et al. [
35] considered the estimation of the PH model with left-truncated and arbitrarily interval-censored data.
The remainder of this paper is organized as follows. In
Section 2, we first introduce the notation, assumption and corresponding likelihood.
Section 3 presents the proposed penalized EM algorithm, and
Section 4 establishes the oracle property of the proposed estimators. In
Section 5, a simulation study is conducted to assess the variable selection performance and the estimation accuracy of the proposed method, followed by an application in
Section 6. Some discussions and conclusions are given in
Section 7.
Section 8 provides several potential future research directions.
2. Notation, Model and Penalized Likelihood
For the target population, let
,
and
denote the failure time of interest (e.g., the time to the onset of the failure event), the truncation time (e.g., the time to the study enrollment) and the
p-dimensional vector of covariates, respectively. Given
, the PH model specifies that the conditional cumulative hazard function of
takes the form
where
is the
p-dimensional vector of the unknown regression coefficients and
is an unknown increasing cumulative baseline hazard function. Let
d denote the number of nonzero components in
, and
and
denote all
d nonzero and
zero-component vectors, respectively.
Under the left-truncation scheme, only individuals with
are enrolled in the study, and the failure time, truncation time and covariate vector that satisfy
are denoted by
T,
A and
, respectively. Then, we know that
has the same joint distribution as
given
[
36]. As mentioned above, if the truncation time
is further assumed to follow the uniform distribution on
, where
is the support of
, we have the length-biased sampling mechanism [
10,
14]. Let
and
denote the density and survival functions of
given
, respectively. Under the assumption that
is independent of
given
, the joint density function of
given
evaluated at
is
where
denotes the density function of
at time
a and equals
under the length-biased sampling scheme.
Consider a failure time study that recruits
n subjects and each failure time suffers from interval censoring due to the periodical examinations for the occurrence of the failure event. For
, denote by
,
and
the failure time, truncation time and covariate vector of the
ith subject in the study, respectively. We assume that there exists a sequence of examination times
for subject
i, where
is a random positive integer, and define
. Let
and
denote the endpoints of the smallest interval that brackets
. That is,
. Clearly,
is left-censored if
, and
is right-censored if
. Then, the observed data consist of
. Under the assumption that examination times are independent of the failure and truncation times given the covariates, the likelihood function based on the observed data can be written as
where
.
Essentially, the likelihood (
3) is the product of the marginal likelihood and the conditional likelihood, that is,
, where
and
In the above,
denotes the indicator function,
;
is the marginal likelihood of
given
; and
is the conditional likelihood given the
’s. Notably, the commonly used conditional likelihood method only utilizes
for inference, which can be anticipated to lose some estimation efficiency because
also involves the parameters in model (
1).
For the nuisance function
, we propose to approximate it with a step function that has non-negative jumps at the unique examination times. Specifically, let
denote the ordered unique values of
, where
is an integer determined by the observed data. For
, denote by
the non-negative jump size of
at
. Then, we have
, and the likelihood function (
3) can be rewritten as
where
.
To accomplish variable selection and estimate the nonzero parameters simultaneously, we propose to maximize the following penalized log-likelihood:
where
denotes a penalty function that depends on the tuning parameter
. In what follows, we provide a general maximization procedure for (
5) under various commonly adopted penalty functions, such as LASSO, ALASSO, SCAD, SELO, SICA, MCP and BAR [
17,
18,
19,
20,
21,
22,
23,
24]. Because the penalized log-likelihood (
5) has an intractable form, performing direct maximization with some of the existing software is extremely difficult and unstable. This is also the case even without the penalty term as shown in Gao and Chan [
15]. In the next section, we will propose a reliable and stable penalized EM algorithm to overcome this computation challenge.
3. Estimation Procedure
The proposed penalized EM algorithm involves two layers of data augmentation, which aims at simplifying the form of (
4) and obtaining a tractable objective function. In the first stage of data augmentation, for the
ith subject, we introduce a set of independent pseudo-truncated data,
, which are also referred to as “ghost data” [
37], and the random integer
follows a negative binomial distribution
with
, where
Given
, let
for each
, and then
follows a multinomial distribution with probabilities
, where
and
. In the above,
is the finite upper bound of the support of
and can be specified as
in practice [
14]. After deleting some constants that are irrelevant to the parameters to be estimated, the augmented likelihood function based on
is
In the second stage, for the
ith subject, we introduce the independent latent variables
with
and
, where
is the Poisson random variable with mean
. Then, the likelihood function (
6) can be re-expressed with Poisson variables as
Let
be the probability mass function of
and
. By treating the latent variables
’s and
’s as observable, we have the following complete data likelihood
where we require that
and
if
, and
if
.
Let
, and let
be the update of
at the
mth iteration with
. Based on
, we can present the expectation step (E-step) and maximization step (M-step) of the proposed algorithm. In the E-step, we calculate the conditional expectations of
and
given the observed data and
in the
. This step yields
In particular, at the
mth iteration of the algorithm, the expressions of the conditional expectations are given by
and
where
For notational simplicity, we omitted the conditional arguments including the observed data and current estimates of the parameters in the above conditional expectations.
In the M-step of the algorithm, by solving
, we have a closed-form expression for
, which is given by
Next, by plugging (
8) into (
7), we have the following objective function that only involves the unknown parameter
:
To obtain the sparse estimator of
, we propose to minimize the following objective function
where
.
For LASSO and ALASSO, the modified shooting algorithm given in Zhang and Lu [
38] and others can be adopted to minimize (
9). For the BAR, the closed-form solution for
is available [
39]. For other penalties, after using local linear approximation for
[
40], one can also adopt the modified shooting algorithm to minimize (
9).
In summary, for the given and initial estimator , we repeat the E-step and M-step until the convergence criterion is satisfied, rendering the sparse estimators of the regression parameters. It is worth pointing out that the proposed algorithm is insensitive to the choices of the initial value . In practice, one can simply set the initial value of each component in to 0 and the initial value of to , for . The proposed algorithm is declared to achieve convergence if the summation of the absolute differences in the estimates between two successive iterations is less than a small positive number, such as .
To select the optimal
, we follow Li et al. [
39] and others and adopt the BIC criterion, which is defined as
where
is the final estimator of
,
is the final estimator of
,
denotes the logarithm of (
4) and
is the total number of the nonzero estimates in
and
. For a given set including candidate values of
, the optimal
can be set as the one that yields the smallest BIC.
4. Asymptotic Properties
Without a loss of generality, we write , where includes the first d components of that are nonzero and consists of the remaining zero components. Denote the true value of by , where is the true value of and is the true value of . Let be the estimator of obtained from the method proposed above, where denotes the estimate of and denotes the estimate of . In what follows, we establish the asymptotic properties of .
For any penalty function
with tuning parameter
, we let
and assume that
belongs to the function class
as considered in Lv and Fan [
20]:
The class
is quite general and includes the penalty functions considered in this work. To establish the asymptotic properties of
, we need the following regularity conditions.
- (C1)
The true regression parameter lies in a compact set in , and the true cumulative baseline hazard function is continuously differentiable and positive with for all , where is the union of the support of and . In addition, we assume that .
- (C2)
The covariate vector is bounded with probability one and the covariance matrix of is positive definite.
- (C3)
The number of examination times, M, is positive and . Additionally, there exists a positive constant such that . Furthermore, there exists a probability measure in such that the bivariate distribution function of conditional on is dominated by , and its Radon–Nikodym derivative, denoted by , can be expanded to a positive and has twice-continuous derivatives with respect to u and v when and are continuously differentiable with respect to .
Conditions (C1) and (C2) are standard conditions in a failure time data analysis [
7]. Condition (C3) pertains to the joint distribution of the examination times, which ensures that two adjacent examination times should be separated by at least
. Otherwise, the data may contain exactly observed failure times, requiring different theoretical treatments. Note that the conditions (C1)–(C3) are used in establishing the root-n consistency of the unpenalized maximum likelihood estimator of the regression vector [
7], which is required for the penalty term in the penalized likelihood. Also, the conditions (C1)–(C3) ensure that the log profile likelihood
has a quadratic expansion around
[
7].
Theorem 1 (root-n consistency). Under conditions (C1) to (C3), if , then , where denotes the Euclidean norm for a given vector.
Theorem 2 (oracle property). Under conditions (C1) to (C3), if and , then has the following properties:
- 1.
(Sparsity) ;
- 2.
(Asymptotic normality) , where is the upper-left sub-matrix of the efficient Fisher information matrix for β, denoted by .
Theorem 1 indicates that
is consistent, and Theorem 2 (i) implies that
is sparse and has the selection consistency property, that is,
. Theorem 2 (ii) implies that the estimators of the nonzero regression parameters are semiparametrically efficient. The detailed proofs of Theorems 1 and 2 under the
penalty ALASSO are given in the
Appendix A. For other penalty functions that belong to
, one can prove the above two theorems with analogous techniques, which are omitted in this paper.
5. A Simulation Study
We conducted a simulation study to evaluate the finite-sample performance of the proposed penalized EM algorithm. We first assumed that there exist 10 covariates, following the marginal standard normal distribution with the pairwise correlation
. We set the true value of
denoted by
to be
(large effects) or
(weak effects). The truncation time
followed the uniform distribution on
with
. The failure time of interest
was generated from model (
1) with
. Because we considered the length-biased sampling, only pairs that satisfy
were kept in the simulated data, which were denoted as
.
To construct interval censoring, for subject
i, we generated a series of potential examination times
with
,
and
. Then,
was defined as the smallest interval that brackets
. On average, we had about 30–44% left-censored observations and 19–30% right-censored ones. We considered some classical penalty functions, including LASSO, ALASSO, SCAD, SELO, SICA, MCP and BAR [
17,
18,
20,
21,
22,
24]. In order to find the optimal
for each penalty, we considered 20 equally spaced points over the interval
and selected the one that minimizes the BIC. In particular,
b was chosen to guarantee that all the regression parameter estimates were penalized to zero, while
a was chosen to ensure that all the covariates were selected. The following results are based on
or 400 and 100 replications.
To assess the variable selection performance, we calculated the FP and TP, which are defined as the average number of selected covariates whose true coefficients are zero and the average number of selected covariates whose true coefficients are not zero, respectively. To measure the estimation accuracy, we reported the median of the mean squared errors (MMSE) and the standard deviation of the mean squared errors (SD), where the mean squared error is defined as and denotes the population covariance matrix of the covariate vector .
Table 1 and
Table 2 present the results obtained by the proposed method with
and
, respectively. In both tables, we also present the results of the oracle estimation and the analysis method without conducting variable selection. The results given in
Table 1 and
Table 2 show that almost all the penalty functions gave a similar variable selection performance. The only exception is LASSO, which yielded a slightly larger FP. This observation can be anticipated because LASSO often selects more noises than the other penalty functions [
39]. In addition, one can see from the tables that except LASSO, the MMSEs yielded by the other penalty functions were close to the oracle estimation and smaller than the analysis method without conducting variable selection. As the sample size increased, the performance of the variable selection and estimation accuracy improved for all the penalty functions.
For comparison, we also include in
Table 1 and
Table 2 the results obtained by the variable selection method based on the penalized conditional likelihood (PCL). The detailed implementation of the PCL method is given in the
Appendix B. Notably, Li et al. [
32] also maximized the PCL to conduct variable selection but only considered the ALASSO penalty. It is clear that, compared with the PCL approach, the proposed method yielded smaller MMSEs, implying a more accurate estimation performance. Furthermore, the SDs obtained by the proposed method are smaller than those of the PCL method, because the PCL method ignores the distribution information of the truncation times and thus loses some estimation efficiency.
In this study, we also considered other simulation settings with
or 50. Specifically, we set
for
and
for
and let other simulation specifications be the same as above. The simulation results are presented in
Table 3 and
Table 4 and show similar conclusions as above. In particular, the proposed method with ALASSO, SCAD, SELO, SICA, MCP and BAR yielded much smaller MMSEs than the analysis method without conducting variable selection when
p increased. This clearly demonstrated the necessity of conducting variable selection in the presence of a large number of covariates.
6. An Application
6.1. Background and Analysis Methods
We applied the proposed method to a set of real data arising from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial [
9,
41]. Sponsored by the National Cancer Institute, the PLCO cancer screening trial was initiated in 1993 and recruited participants who had not previously taken part in any other cancer screening trials at ten screening centers nationwide. The recruited participants were aged from 55 to 74. In particular, the participants who were randomly assigned to the screening group received the Prostate-Specific Antigen (PSA) test periodically over 13 years. If abnormally high PSA levels were found, a prostate biopsy was conducted to determine the occurrence status of prostate cancer. In this study, we focused on the prostate cancer screening data in the screening group and aimed at identifying the important risk factors of the development of prostate cancer. This is because prostate cancer in general causes no signs or symptoms in the early stages, but as the disease progresses, it can cause serious complications, such as urination problems and anemia. Therefore, exploring the risk factors of prostate cancer exhibits a pressing need and is also beneficial to conduct early prevention for males. To this end, the failure time of interest was defined as the age at onset of prostate cancer. Because the participants were only examined intermittently, only interval-censored observations could be obtained for the onset of the prostate cancer. In addition, because the study excluded individuals who had already developed prostate cancer at the study recruitment, the age at the onset of prostate cancer suffered from left truncation with the truncation time being the age the individual enrolled in the study.
We considered seven potential risk factors, including Race (1 for African American and 0 otherwise), Education (1 for at least college and 0 otherwise), Cancer (1 for having an immediate family member with any PLCO cancer and 0 otherwise), ProsCancer (1 for having an immediate family member with prostate cancer and 0 otherwise), Diabetes (1 for having diabetes and 0 otherwise), Stroke (1 for having strokes before and 0 otherwise) and Gallblad (1 for having gall bladder stones and 0 otherwise). The sample size was n = 32,897, and the left and right censoring rates were about and , respectively.
To achieve variable selection, we implemented the proposed method with LASSO, ALASSO, SCAD, SELO, SICA, MCP and BAR as in the simulation study. To select the optimal for each penalty, we utilized a two-step method. In the first step, we examined a range of points over to roughly identify a narrower interval that includes the optimal tuning parameter. Here, as in the simulation study, a was selected to ensure that all the covariates were selected, while b was chosen to ensure that all the regression parameter estimates were penalized to zero. Next, we further considered 20 equally spaced points within the narrower interval and selected the optimal that minimizes the BIC. In addition, we also employed the BIC to select the best penalty among all the penalties considered for the data, and it turned out that the PH model with SCAD and MCP yielded the smallest BIC value. To calculate the standard error, we used the nonparametric bootstrap method with 100 bootstrap samples. In addition to the proposed method, we also considered the variable method based on the penalized conditional likelihood (PCL) for comparison.
6.2. Results
We summarize in
Table 5 the results obtained by the proposed and PCL methods. The results indicated that, except LASSO, the proposed method with other penalties recognized Race, ProsCancer and Diabetes as significant risk factors for prostate cancer. Specifically, being African American and having an immediate family member with prostate cancer increased the risk of developing prostate cancer, while having diabetes exhibited a lower risk of developing prostate cancer. These findings are in accordance with the conclusions obtained by Meister [
42], Pierce [
43] and others. In contrast, the results given in
Table 5 show that the method based on PCL yielded relatively larger standard error estimates compared with the proposed method. This finding again demonstrates the advantage or efficiency gain of the proposed method when taking into account the distribution information of the truncation times in the inference procedure.
7. Discussion and Conclusions
In this article, we considered the length-biased and interval-censored data and developed a penalized analysis procedure to choose important variables among the large number of covariates in the PH model. The main contribution of this work is the development of a novel penalized EM algorithm via introducing two-stage data augmentation, which can greatly simplify the penalized nonparametric maximum likelihood estimation. Specifically, by introducing pseudo-truncated data and Poisson random variables, the possible high-dimensional parameters involved in
have explicit solutions, making the proposed algorithm simple and computationally stable. In contrast to the work of Li et al. [
32] that only involved the ALASSO penalty, we proposed to jointly utilize the local linear approximation and the modified shooting algorithm, yielding the sparse estimators of the regression parameters under various popular penalty functions. Thus, the proposed method can offer flexible options for the data analyst. The numerical results obtained from a simulation study showed the satisfactory performance and desirable advantage of the proposed method in finite samples. Moreover, by legitimately taking into account the distribution information of the truncation times, the proposed method is more efficient than the traditional penalized conditional likelihood approach (e.g., Li et al.’s method [
32]).
Notably, the findings of our prostate cancer data analysis may have certain public health implications. Specifically, African Americans and individuals who have immediate family members with prostate cancer are specific population groups and need to receive early prevention (e.g., cancer screening) in order to reduce the risk of developing prostate cancer.