1. Introduction
The joint analysis of longitudinal and survival data has gained widespread application in clinical studies on cancer and HIV/AIDS, where the primary endpoints typically involve time-to-event outcomes such as disease-free and overall survival. Notably, following the seminal work by Faucett and Thomas [
1] and Wulfsohn and Tsiatis [
2], the standard joint model has been extensively investigated. Researchers have extensively discussed the advantages of joint models [
3,
4,
5,
6,
7,
8]. However, certain patients with a compromised quality of life (QOL) may opt to discontinue their participation in the clinical trials due to disease recurrence, or they may experience mortality. In this case, the absence of QOL measures resulting from the withdrawal of patients provides informative insights into the trade-off between intensive treatment and poor QOL. To establish strong evidence, we conducted joint modeling of longitudinal life measures and survival data to investigate their relationship. For the longitudinal quality of life and survival data, Henderson et al. [
9] and Zeng and Cai [
10] considered the use of shared-normal-distribution random effects to jointly analyze the relationship between longitudinal QOL and survival time. Tang et al. [
11] considered a novel semiparametric joint model for multivariate longitudinal and survival data to analyze data from the International Breast Cancer Study. Longitudinal quality-of-life measurement data can be linearly converted into longitudinal proportional data whose value range is in the unit interval (0, 1) [
12]. Song and Tan [
12] emphasized that disregarding the constraint of having values between 0 and 1 could lead to erroneous interpretations. For the longitudinal component, there are two methods to deal with it. The first method applied the classic linear mixed model to the longitudinal proportional data after logit transformation [
13], and the second method directly used the simplex distribution to model the longitudinal proportional data [
14,
15]. The models established using the two methods both used the EM algorithm and the Laplace approximation to estimate the unknown parameters. In order to be more flexible and practical, this paper will use a partial linear mixed-effect model for the logit transformed longitudinal proportional data and use the B-splines method to model the unknown function in the model. Meanwhile, to enhance the feasibility of our proposed model, we use the CDPMM method to model random effects.
In addition, variable selection in the joint model is also considered. In traditional regression models, variable selection methods include forward selection, backward elimination, stepwise selection, and the use of information criteria such as the Akaike information criterion (AIC). However, these approaches can be computationally expensive and unstable when dealing with complex models that have a large number of covariates. To address this issue, penalized likelihood methods have been proposed, with one popular method being the Lasso of Tibshirani [
16]. The Lasso estimates linear regression coefficients by applying a constraint on the
norm of the least squares. Tibshirani [
16] proposed that Lasso estimates can be interpreted as posterior norm estimates when the regression parameters have independent and identically Laplacian priors. Park and Casella [
17] extended this idea under the Bayesian framework and introduced the Bayesian Lasso (BLasso) variable selection method. They used a double exponential prior for the regression coefficients and a gamma distribution for the shrinkage parameter. The BLasso method has been successfully applied to various models, including linear regression [
18], semiparametric structural equation models [
19], and joint models of longitudinal and survival data [
11]. Building on this work, our paper extends the BLasso variable selection method to the joint model of longitudinal proportional data and survival data. We propose an approach called BLasso, which aims to estimate unknown parameters while also identifying the significant effects of crucial covariates.
The rest of this paper is organized as follows. In
Section 2, the joint model of longitudinal proportional and survival data is introduced. In
Section 3, the Bayesian estimations of the joint model are proposed. In
Section 4, three numerical simulations are presented to evaluate the performance of the proposed methods. In
Section 5, we utilize the proposed approach to analyze the MA.5 research experiment’s data. We then provide some concluding remarks in
Section 6. For more technical information, please refer to
Appendix A.
2. Model and Notation
Consider a dataset consisting of n individuals. Let be a longitudinal proportional measurement for the i-th individual () at observation time point for , and , where represents the number of observations of individual i. We assume that is the logit transformation of and . Furthermore, and are the true survival time and censoring time, respectively. Additionally, we have the true survival time and the censoring time for each individual i. Let denote the corresponding observed event time. Let denote the failure indicator, where 1 is an indicator function.
We denote
, where
. Let
and
. The random effects
are time-independent and underlie both the longitudinal and survival processes for the
i-th individual. Given the random effects
, we assume that
follows a partially linear mixed-effect model.
where
and
represent the time-independent design vectors of fixed and random effects associated with
, respectively;
is a
vector of fixed effects’ regression parameters;
is a
random effects vector;
is a twice-continuous differentiable unknown function; and
is a white noise process with variance
. Additionally, we assume that
’s are independent of
. To facilitate the feasibility of our proposed model, instead of the traditional normality assumption, which may be violated in some applications [
20], we specify the random effects using a Dirichlet process (DP) mixture of normals.
For event time
, given random effects
, we assume that
follows the hazard model:
where the known fixed effects’ design matrix
connects the unknown
parameter vector
to
. Additionally, the unknown
parameter vector
links
to
. Lastly, the basic hazard function
remains unknown.
From the above discussion, it is suggested to link models (
1) and (
2) through shared random effects, called a shared random effects joint model (JMSRE). The parameter
in model JMSRE reflects the correlation between transformed longitudinal proportional data and survival data, given random effects. When
, it means that the longitudinal index is not necessarily related to the event time; i.e., longitudinal proportional data and survival data can be modeled separately. So in this case, joint modeling is not necessary, and longitudinal indicators can be ignored for modeling survival data.
Further, to make Bayesian inference on
based on model (
1), we approximate
through a B-splines method:
where
,
d is the degree of B-splines,
K is the number of knots,
is an
unknown coefficient vector, and
.
We denote
as the unknown parameters associated with model (
1) and
as the unknown parameters associated with model (
2). Thus, given
, the joint likelihood function of
can be written as
where
4. Simulation Studies
In this section, we perform three simulation studies to examine the finite performance of the previously mentioned methods.
The model used in these studies was the one defined in models (
1) and (
2), involving a total of 200 individuals. The specific details of the model are as follows:
In model (
1),
can be either one-dimensional or multi-dimensional. However, in the following simulation study,
was set to be one-dimensional. In order to perform variable selection on
and
,
and
were set to be multi-dimensional in the simulation study. The data were generated as follows: observation time
was randomly generated between 0 and 3. The covariates
and
followed a Bernoulli distribution with success probabilities of 0.5 and 0.3, respectively. The covariates
, and
were generated from a multivariate normal distribution
with mean vector
0 and covariance matrix
. The covariance matrix
is a symmetric positive definite matrix with diagonal elements of 1 and all other elements of 0.5. The random error
was generated from a normal distribution with mean 0 and variance
. We define
. The baseline hazard function
and
. The censoring time
was generated from the uniform distribution
, and
was generated from the exponential distribution with mean
,
. Our main objective is to utilize the proposed approaches to identify insignificant covariates and estimate non-zero coefficients. Bayesian results were obtained from 200 replications.
To demonstrate the accuracy and flexibility of our proposed method, we conducted three simulation studies. These simulations aimed to estimate parameters of interest, identify unimportant variables, and capture the features of the unknown function and random effects . The true values of unknown parameters and were set to be the same in Simulation I and Simulation II, and the parameter’s true values included 0. The true values of unknown parameters and in Simulation III are all non-zero. The settings of the unknown function and random effects are different between the three simulation studies. The unknown function setting includes both nonlinear and linear. The random effect was set to follow a mixed normal distribution with unimodal, bimodal, and trimodal distributions, respectively. By conducting these simulation studies, we can showcase the effectiveness and versatility of our method.
We utilized the proposed semiparametric Bayesian procedure to simultaneously estimate unknown parameters and identify significant covariates in each of the three simulation studies. The mean censoring rates for the survival times in these studies were 44%, 45%, and 37%. The prior hyperparameters were set as follows:
,
. These hyperparameters correspond to the hyperpriors for the adjustment coefficients in Equations (
10)–(12). We set
,
, and
, which correspond to the prior parameters of
and
. We set the degree of B-splines
, the number of knots
, and
.
To assess the convergence of the proposed algorithm, we computed the estimated potential scale reduction (EPSR) values for the parameters. Additionally, we also need to test the convergence of the unknown function fitted using the B-splines method.
Figure 1 indicates that the EPSR values remained consistently below 1.2 after around 3000 iterations in all three simulation studies. Consequently, we collected 3000 observations (
) to calculate the Bayesian estimates of the parameters after 3000 iterations in order to produce Bayesian results for each of the 200 replications. For comparison, we also applied Gaussian priors as the prior distribution of random effects. The purpose of these simulations is to compare the semi-parametric approach based on the CDPMM prior with the parametric approach based on the Gaussian prior from a Bayesian perspective. Results obtained from three simulation studies were reported in
Table 1,
Table 2 and
Table 3, which include five measures: “Median”, “Bias”, “SD”, “RMS”, and “F0”. “Median” represents the median of the estimates from 200 replications. “Bias” indicates the difference between the true value and the mean of the estimates from 200 replications. “SD” indicates the standard deviation of the estimates from 200 replications. “RMS” is the root mean square between the estimates from 200 replications and their true values. “F0” indicates the proportion of parameters identified as zero in 200 replications, considering a parameter to be identified as zero if its 95% confidence interval contains zero.
The results from
Table 1,
Table 2 and
Table 3 suggest that the Bayesian estimates of the parameters are reasonably accurate. One can see that in all simulations, the proposed CDPMM prior performed better in both parameter estimation and inferential characteristics. This is indicated by the fact that the bias (Bias) values of the results based on the CDPMM prior method are all less than 0.10, and the root mean square (RMS) value and standard deviation (SD) value are both less than 0.20. Furthermore, the BLasso method was able to correctly identify the important covariates in most cases, regardless of the prior inputs of parameters. This is supported by the fact that the F0 values corresponding to the important covariates were less than 10%, indicating a high level of significance. On the other hand, the F0 values corresponding to the unimportant covariates were more than 90%, indicating a lack of significance. The recovery performance of the proposed method for the unknown function
can be measured using the RMSE (the root mean square error), which is expressed as
where
,
represents the Bayesian estimated value of the parameter vector
in the
r-th replication. Similar to the RMSE of the unknown function, we also calculate the RMSE of the random effects.
Figure 2 plots the estimated curve and estimated density of the unknown function
and the random effects
of the replication based on different priors. The mean of the RMSE of the unknown function and the random effects is in the middle of the 200 replications and is compared against the true curves and true density in three simulation studies, respectively.
Upon inspection of
Figure 2, it is evident that the Bayesian B-splines method proposed in this paper is flexible enough to accurately fit the true curve of the unknown function
. Additionally, the CDPMM prior proposed demonstrates sufficient flexibility compared to the Gaussian prior to capture the general shapes of the three distribution assumptions considered for
. The results presented in
Table 4, based on 200 replications in three simulation studies under the CDPMM prior and Gaussian prior, further support the robustness of the CDPMM method. The estimated means and standard deviations (SDs) of the random effects
closely align with their corresponding true values. Moreover, the 25%, 50%, and 75% quantiles of the RMSE of the unknown function and the random effects are sufficiently small, indicating the effectiveness of the CDPMM approach in estimating random effects.
All these findings show that, compared with the Gaussian prior method, our CDPMM prior method makes the Bayesian B-spline curve flexible enough to accurately fit the real curve of nonlinear data. Additionally, the Bayesian procedure effectively captures the true information of , regardless of their true distributions and forms. Furthermore, BLasso has a high probability of correctly identifying the true model.
5. An Example
In this section, we apply the method proposed in the previous sections to the MA.5 research experiment conducted by the Clinical Trial Group of the National Cancer Institute of Canada. The data pertain to 716 women with early-stage breast cancer before menopause. A total of 356 patients were randomly selected to receive cyclophosphamide, epirubicin, and fluorouracil (CEF) adjuvant chemotherapy as the experimental group. The remaining 360 patients received cyclophosphamide, methotrexate, and fluorouracil (CMF) adjuvant chemotherapy as the control group of the trial. In clinical trials, visits were made before the start of treatment, during each of the six treatment cycles, and every three months after treatment. At each visit, medical history and physical examination were conducted, and the Breast Cancer Questionnaire (BCQ) is used to assess the patient’s QOL. The dataset consists of a total of 7807 observations. By the end of the study, 366 patients had died, resulting in a censoring rate of approximately 49%. For a detailed study of these data, please refer to Song et al. [
26] and Levine et al. [
27]. We linearly convert the evaluated BCQ score into a unit interval
, and the longitudinal data constrained to the interval
are the longitudinal proportional data of interest. The trial focuses on the recurrence-free survival time (RFS), which is the duration between randomization and disease recurrence. Different treatment options, age, and the number of tumor-positive lymph nodes may directly affect RFS and the patient’s QOL. We fitted the MA.5 research experiment dataset to the following model:
where variable
represents the BCQ score after applying the logit function transformation.
is a two-class treatment index, where
indicates that the
i-th patient underwent CEF treatment, and
indicates that the
i-th patient underwent CMF treatment. Age and the number of lymph node metastases are binary variables. Patients who are 40 years old or younger are classified as belonging to the younger group, denoted as
. Patients who are older than 40 years old belong to the elderly group, denoted as
. When the number of lymph node metastases is 0–3,
; otherwise, it is 1. The term
in Equation (
23) represents an unknown function related to the observation time
t.
The unknown function
is estimated using a cubic B-spline function, and the domain of the cubic B-spline function is
. The prior distributions and values of all hyperparameters in the case study are the same as those set in the simulation study above. Based on the above settings, we calculated EPSR values for all parameters. The results indicate that after approximately 3000 iterations, all EPSR values are less than 1.2. Therefore, we use the 3000 iterations after the 3000th iteration to calculate the Bayesian estimation. The results of the example analysis are shown in
Table 5 based on two different prior methods.
From
Table 5, the following observations can be made. (i) The parameter estimation based on the CDPMM prior proposed in this paper has a smaller standard deviation (SD) and a shorter confidence interval than that based on the Gaussian prior. This suggests that the approach proposed in this paper is more effective. (ii) Under the CDPMM prior, the risk ratio of randomly receiving CEF and CMF treatment is
, implying that patients who randomly receive CEF chemotherapy have a lower risk. (iii) The credible interval
for
does not include 0, indicating that different adjuvant chemotherapy regimens have a significant impact on patients’ QOL. Additionally, it suggests that CEF chemotherapy is more toxic than CMF chemotherapy. (iv) The risk ratio for the number of lymph node metastases being greater than or equal to four compared to less than four is calculated as
. This implies that patients with a higher number of lymph node metastases have a greater risk of breast cancer recurrence and a shorter RFS. (v) The regression coefficient
for lymph node metastasis numbers greater than or equal to 4 is 0.304, and its credible interval does not include 0, indicating high significance. This suggests that patients with a higher number of lymph nodes experience a lower QOL, which aligns with clinical experience; (vi) The risk ratio between the young group and the old group is
, implying that the risk of breast cancer recurrence is higher and the RFS is shorter in the young group; (vii) The credible interval
for
does not contain 0, indicating that age has a significant impact on where variable
represents the BCQ score after applying the logit function transformation. This suggests that the quality of life for the elderly group is better than that of the young group. (viii) The value of
is 0.269, and the credible interval for
is
, which does not include 0. This indicates that
is significantly different from 0, suggesting a significant correlation between the longitudinal proportional data and survival data. Therefore, the JMSRE model proposed in this paper is applicable and reasonable for analyzing the MA.5 research experiment’s data.
6. Concluding Remarks
In this paper, a semiparametric joint model is proposed for longitudinal proportional data and survival data. The model does not assume the normality of random effects and does not require the specification of an unknown function influencing longitudinal responses. The proposed model offers several advantages. Firstly, it improves the flexibility of jointly modeling longitudinal proportional data and survival data. Secondly, the proposed B-splines method effectively captures different unknown functions in a flexible manner. Thirdly, compared to a Gaussian prior, the proposed CDPMM method accurately captures the unimodal, bimodal, and multimodal features of random effects. Lastly, the computational burden is not heavy, with the replication in the simulation study taking approximately 4 min and the breast cancer dataset taking about 78 min to run.
Our simulation studies and example analysis demonstrate that the Bayesian estimation approach proposed based on the joint model is accurate and robust. The use of Bayesian B-splines allows for a more flexible estimation of the unknown function curve, enabling it to capture the true characteristics of the unknown function more effectively. Additionally, compared with the Gaussian prior method, the CDPMM method effectively captures the true information of
. Furthermore, the BLasso method has a high probability of correctly identifying the true model. In comparison to the method proposed by Song et al. [
26] for jointly modeling longitudinal proportional data and survival data, the joint model proposed in this paper offers greater flexibility.
The joint model of longitudinal proportional data and survival data proposed in this paper still has many unsolved problems, and we need to address the following issues in the future: (i) It does not impose any constraints on the form of the basic hazard function. (ii) We should consider more complex spline models, such as automatically selecting nodes to enhance the performance of the proposed model. (iii) We should also explore a joint model for the variable longitudinal proportional outcome and the multivariate survival outcome.