1. Introduction
Historically, cancer is a disease with a high mortality rate in human societies. For this reason, it is of interest to correctly identify the possible relationship between cancer and factors that affect it. A useful way to study cancer progression is to monitor patient survival over time. Identifying the effective factors in the long-term survival of cancer patients is one of the important concerns of medical studies. It also makes sense to identify the exact and appropriate models for cancer patients’ lifetime to evaluate the effectiveness of different treatment methods, and patients’ survival time, as well as identify factors affecting patients’ survival. As of recently, with developments in new drugs, medical advancements and improved treatments, patients with certain types of cancer can live as long as people without cancer and show no recurrence of the disease. These patients are said to be cured or long-term survivors. The remaining patients who show a recurrence of the disease are called non-cured or susceptible. Cure rate models have an important role in medical studies and clinical trials and became popular in reliability and survival analysis. In order to introduce the cure rate models, let M be an unobservable random variable of the initial number of competing causes related to the occurrence of an event of interest of an individual in the population.
For the initial number of competing causes
M, several discrete distributions were chosen in the literature. To name a few, de Castro et al. [
1], Cancho et al. [
2], Yiqi et al. [
3], Ortega et al. [
4] and D’Andrea et al. [
5] considered the negative binomial distribution for
M. Leão et al. [
6] considered zero-modified geometric distribution for
M. Gallardo et al. [
7] considered Yule–Simon distribution for
M. Gallardo et al. [
8] considered polylogarithm distribution for
M. Gallardo et al. [
9] considered the modified power series family of distributions for
M. Cancho et al. [
10] considered the power series distribution for the number of competing causes
M. Balakrishnan et al. [
11] considered the weighted Poisson distribution for
M. Balakrishnan and Pal [
12] considered the Conway–Maxwell–Poisson distribution for
M.
In this work, we introduce a new cure rate model based on the Flory–Schulz (F-S) distribution. The motivations to introduce this model are: (1) The F-S has a PGF in a simple and closer form, very useful in this cure rate model context; (2)
also has a simple form. This allows one to, among other things, reparametrize the model in terms of the cure rate directly; (3) the F-S distribution has one parameter, being a parsimonious competitor of the traditional promotion time cure rate model proposed in Chen et al. [
13]; (4) the advantage of the proposed model is that the F-S distribution is flexible and also there is over-dispersion in the data, which produces a more flexible cure rate model; (5) to the best of our knowledge, the F-S distribution has not yet been proposed in a cure rate model context for the modeling of the number of competing causes. In addition, we also considered the generalized truncated Nadarajah–Haghighi (GeTNH) distribution (Azimi and Esmailian [
14]) for the time-to-event of the concurrent causes, a model proposed recently in the literature.
The contents of this article are organized as follows. In
Section 2, we formulate the F-S cure rate model based on Flory–Schulz and GeTNH distributions. In
Section 3, we obtain the maximum likelihood estimators and asymptotic confidence interval estimations of the parameters of the F-S cure rate models based on the GeTNH distribution. In
Section 4, we analyze three real cancer data sets. In
Section 5, we perform a simulation study to evaluate the performance of the ML estimators of model parameters. Finally, in
Section 6, we provide some conclusions.
2. Modeling
Let
be a sequence of random variables that denotes the time of occurrence of the event of interest (for example, the number of clonogenic tumor cells at the end of treatment that can produce detectable cancer). Assume that, conditional on
M, the
are independent identically distributed with the common survival function
. We also assume that
are independent of
M and define
as
.
and
determines the cured and susceptible individuals, respectively. The observable time until the event of interest is defined as (Cancho et al. [
10])
Under this setting and according to Tsodikov et al. [
15] and Rodrigues et al. [
16], the survival function for the population is given by
where
is the probability generating function (PGF) of
M. Let
M denote the number of competing cause random variables related to the occurrence of the event of interest. We consider that random variable
M has the
(Flory–Schulz distribution with parameter
) with probability mass function given as (Flory [
17])
According to Gallardo et al. [
7] and Gallardo et al. [
8], in the cure rate contexts,
M is a random variable with support on set
. Therefore, we shift the probability mass function in (
3) to the form of
with PGF
Specifically, applying the F-S model in the discussed previous cure rate models framework and from Equation (
2), the survival function for the population is given by
Hereafter, we refer to the model appearing in Equation (
4) as the Flory–Schulz cure rate (F-SCR) model. From Equation (
4) and as
is a proper survival function, the cure rate immediately is given by
Thus, the cured fraction is an increasing function of . Henceforth, we adopt the parameterization , i.e., . In this way, covariates can be directly linked to the cure rate p, allowing to compare regression coefficients among different models parameterized in terms of the cure rate.
With the parameterization in the cure rate
p, the survival function of the F-SCR model for the population from Equation (
4) is given by
From Equation (
5), the probability density function (PDF) of the population is given by
where
is the survival function and
is the probability density function of the latent event times
. Since
is not a proper survival function, the
is not a proper PDF either. The hazard function associated to PDF Equation (
6) is given by
Moreover, we note that in the negative binomial (NBCR) cure rate model (Rodrigues et al. [
16]) if
and
, the NBCR cure rate model reduces to the F-SCR model. As mentioned, there are some important special cases in the F-SCR model, such as having one parameter and simple form and simply reparametrizing the model in terms of the cure rate. In this context it is very common to use the Weibull distribution as the model for
the time-to-event of the concurrent causes. This is explained because the Weibull model has a simple expression for the survival function and the hazard rate assumes different shapes depending only on its shape parameter. Alternatively, in this work we propose, for the first time in a cure rate model context, the use of the GeTNH distribution because its hazard function can assume increasing, decreasing, bathtub shaped or increasing–decreasing (upside-down bathtub shaped) depending on the parameter values. The PDF for the GeTNH model is
where
. The survival function of GeTNH distribution is given by
For this particular case (considering GeTNH distribution to the time of the event of interest), we refer to the model as the F-SCR/GeTNH model and the PDF, CDF and associated survival functions are given by
and
respectively.
The different shapes of the HRF and survival function of the F-SCR/GeTNH model are illustrated in
Figure 1. Note that the HRF of the F-SCR/GeTNH model is very flexible and can be decreasing, bathtub shaped or increasing–decreasing (upside-down bathtub shaped) depending on the parameter values. This wide range of HRF shapes allows more flexibility concerning F-SCR/GeTNH model distribution for analyzing real lifetime data sets in many areas of reliability and survival analysis.
3. Inference
In this section, the parameters for the F-SCR/GeTNH are derived based on the maximum likelihood (ML) method. We consider the lifetimes as subject to right censoring. Let
and
be the failure and censoring time variables for the
ith individual, respectively. In a random sample of size
n, we observed
and
, where
and
denote that a failure or a censoring time was observed for the
ith individual, respectively. Additionally, to avoid the identifiability problems discussed in Li et al. [
18] and Hanin and Huang [
19], we assume that the population is heterogeneous, so we considered a vector of dimension
s, say
, related to the cure rate. Such covariates can be linked with
as
where
is a vector of unknown regression coefficients. Under those assumptions, these covariates are linked to the parameter
p in the following way:
The notation ⊤ indicates the transpose. Under the usual assumptions in survival analysis (independence between
and
, independence among the observations and non-informative censoring, Williams and Lagakos [
20]), the log-likelihood of the model is given by
where
, and
.
The ML estimators of the model are derived by numerical maximization of the log-likelihood function Equation (
8). To maximize the log-likelihood function, we used the NMaximize function in the MATHEMATICA 12.0 software.
Further, the confidence intervals of the model parameters with covariates based on the asymptotic distributions of their ML estimators are derived. Under certain regularity conditions presented in
Appendix A,
is approximately distributed as a multivariate normal distribution with mean
and covariance matrix
, which can be estimated by
The approximate
confidence intervals of the parameters
and the components of
are
where
is the
quantile of the standard normal distribution.
4. Applications: Data Analysis
We provide applications of the F-SCR model to three real cancer data sets. We compare the F-SCR model with the most popular cure rate models, namely, Bernoulli (BeCR) and binomial (BCR) cure rate models (D’Andrea et al. [
5]), Poisson (PCR) cure rate model (Chen et al. [
13]) and the NBCR based on GeTNH, Nadarajah–Haghighi (NH) and Weibull distributions by considering Akaike’s information criterion (AIC) (Akaike [
21]), Bayesian information criterion (BIC) (Schwarz [
22]) and Hannan–Quinn information criterion (HQIC) (Hannan and Quinn [
23]). These criteria are computed for some models. Lower AIC, BIC and HQIC values indicate a better model. The survival function for these cure rate models parameterized in terms of the cure rate are presented as
Appendix B. We also compare the Kaplan–Meier estimator to fitting the data through the survival curves of models. In this section, all calculations were performed using maxLik package of R software [
24] with the Newton–Raphson maximization method and NMaximize function in the MATHEMATICA 12.0 software with the Nelder–Mead maximization method.
4.1. Colon Cancer Data
The colon cancer data set relates to the recurrence or death of colon cancer patients. This data set has 1858 observations and 50.58% censoring (938 in total) is availible in the survival package [
25] of R [
24]. The mean and standard deviation of the observed survival time (measured in years) are 4.2124 and 2.5937, respectively. For more details concerning this data set, see Laurie et al. [
26]. For computational purposes, we convert all observations from days to years. For the colon data set, the following covariate variables are available:
node4: , more than four positive lymph nodes (0 = no, 1 = yes),
sex: (1 = male, 0 = female),
etype: (1 = recurrence, 2 = death),
surg: , time from surgery to registration (0 = short, 1 = long),
extent: , extent of local spread (1 = submucosa, 2 = muscle, 3 = serosa, 4 = contiguous structures).
Figure 2 shows the plot of the Kaplan–Meier estimator for the survival function of the colon data. We observe that a proportion of colon cancer for the patients will never recur, and the patients censored at the end of the experiment may be immune, suggesting that the patients can be considered as cured.
Table 1,
Table 2 and
Table 3 provide the ML estimators of the parameters and corresponding information criteria for the colon cancer data, assuming the GeTNH, NH and Weibull distributions for the time-to-event for the concurrent causes. AIC, BIC and HQIC in
Table 1,
Table 2 and
Table 3 indicate that the F-SCR/GeTNH model has the lowest value, and it is best fit among the BeCR/GeTNH, BCR/GeTNH, PCR/GeTNH, NBCR/GeTNH, F-SCR/NH, BeCR/NH, BCR/NH, PCR/NH, NBCR/NH, F-SCR/We, BeCR/We, BCR/We, PCR/We and NBCR/We cure rate models in terms of fitted information criteria. Therefore, comparing the AIC, BIC and HQIC in
Table 1,
Table 2 and
Table 3, we realize that the F-SCR/GeTNH model provides a better fit to the colon cancer data. In
Table 1 and
Table 2 the estimated lower bound of the approximate confidence intervals of the parameters
and
in some cases is negative values. Since
and
, we convert all negative values to zero (0
).
From
Table 1 and using the F-SCR/GeTNH model, we conclude that the parameters
,
,
and
have a significant effect on the cure rate (95% confidence interval for
,
,
,
and
, not including zero) and
and
has no significant influence on the cure rate.
Figure 3 shows the plots of the estimated survival functions of the F-SCR/GeTNH model for the colon cancer data stratified by sex status for patients with more than four positive lymph nodes (0 = no, 1 = yes); event type: recurrence; time from surgery to registration: short; extent of local spread: serosa.
Figure 4 shows the plots of the estimated survival functions of the F-SCR/GeTNH model for the colon cancer data stratified by surg status for patients with more than four positive lymph nodes (0 = no, 1 = yes); event type: recurrence; sex: female; extent of local spread: serosa. From
Figure 3 and
Figure 4, we observe that the cure rate of the patients with more than four positive lymph nodes (yes) is less than the cure rate of the patients with more than four positive lymph nodes (no). According to
Figure 3, in the case of more than four positive lymph nodes (no), the cure rate of the male patients is less than the cure rate of the female patients. In the case of more than four positive lymph nodes (yes) the cure rate of the female patients is less than the cure rate of the male patients. According to
Figure 4, in the case of more than four positive lymph nodes (no and yes), the cure rate of the patients with a long time from surgery to registration is less than the cure rate of the patients with a short time from surgery to registration.
4.2. Melanoma Cancer Data
This melanoma cancer data set is related to the phase III cutaneous melanoma clinical trial in which patients were observed for recurrence after the removal of a malignant melanoma described and analyzed by Ibrahim et al. [
27]. The data set, labeled E1690, is available at
http://merlot.stat.uconn.edu/~mhchen/survbook/ (accessed on 15 October 2022).The original sample has 427 patients (417 observations and 10 missing) with 55.63% censored elements (232 in total); the mean and standard deviation of the observed lifetimes (measured in years) are 3.18 and 1.69, respectively. For the melanoma data set, we considered the following covariate variables:
age: , classified as zero when the age was below the third quartile (57.56 years) and as one otherwise;
nodes1: , nodule category 1 to 4, with 4 being the most severe category of cancer;
perform: , performance status. This means a patient’s functional capacity scale as regards his or her daily activities (0: fully active, 1: other);
sex: , (0: male; 1: female).
Figure 5 shows the Kaplan–Meier estimator for the survival function of the melanoma cancer data, suggesting that there is a proportion of patients in the population with melanoma cancer that can be considered cured.
Table 4 shows the ML estimators of the parameters and corresponding information criteria for the melanoma cancer data, assuming GeTNH for the time-to-event for the concurrent causes. AIC, BIC and HQIC in
Table 4 indicate that the F-SCR/GeTNH model has the lowest value, and it is best fit among the BeCR/GeTNH, BCR/GeTNH, PCR/GeTNH and NBCR/GeTNH cure rate models in terms of fitted information criteria. Therefore, comparing AIC, BIC and HQIC in
Table 4, we realize that the F-SCR/GeTNH model provides a better fit to the melanoma cancer data. From
Table 4 and using the F-SCR/GeTNH model, we conclude that the parameter
has a significant effect on the cure rate (95% confidence interval for
, not including zero) and
,
and
have no significant influence on the cure rate.
Figure 6 shows the plots of the estimated survival functions of the F-SCR/GeTNH model for the melanoma cancer data stratified by nodule category (1 to 4); perform: fully active; sex: female; age: below 57.56. From
Figure 6, we observe that the cure rate of the melanoma cancer patients with nodule category 1 is more than the cure rate of the patients with nodule category 2 to 4.
4.3. Oropharynx Cancer Data
This oropharynx cancer data set is related to the survival time of 195 patients with carcinoma of the oropharynx. For the oropharynx cancer data, the percentage of censored observations is nearly 27.1% (53 in total). For computational purposes, let us consider the survival time from days to years. The mean and standard deviation of the observed survival time (measured in years) are 1.53 and 1.14, respectively. These data are available in Kalbfleisch and Prentice [
28]. For the oropharynx data set, the following covariates are associated with each participant:
age: (0: less than 60 years; 1: greater or equal to 60 years);
T stage: ; 1: primary tumour measuring 2 cm or less in largest diameter; 2: primary tumour measuring 2 cm to 4 cm in the largest diameter with minimal infiltration in depth; 3: primary tumour measuring more than 4 cm; 4: massive invasive tumour,
N stage: ; 0: no clinical evidence of node metastases; 1: single positive node 3 cm or less in diameter, not fixed; 2: single positive node more than 3 cm in diameter, not fixed; 3: multiple positive nodes or fixed positive nodes;
sex: (1: male; 2: female).
Figure 7 shows the Kaplan–Meier estimator for the survival function of the oropharynx cancer data, suggesting that there is a proportion of patients in the population with oropharynx cancer that can be considered cured.
Table 5 shows the ML estimators of the parameters and corresponding information criteria for the oropharynx cancer data, assuming GeTNH for the time-to-event for the concurrent causes. AIC, BIC and HQIC in
Table 5 indicate that the F-SCR/GeTNH model has the lowest value, and it is best fit among the BeCR/GeTNH, BCR/GeTNH, PCR/GeTNH and NBCR/GeTNH cure rate models in terms of fitted information criteria. Therefore, comparing the AIC, BIC and HQIC in
Table 5, we realize that the F-SCR/GeTNH model provides a better fit to the oropharynx cancer data. From
Table 5 and using the F-SCR/GeTNH model, we conclude that the parameters
and
have a significant effect on the cure rate (95% confidence interval for
and
, not including zero) and
and
have no significant influence on the cure rate.
Figure 8 shows the plots of the estimated survival functions of the F-SCR/GeTNH model for the oropharynx cancer data stratified by T stages 1 to 4; sex: male; age: greater than or equal to 60 years. From
Figure 8, we observe that the cure rate of the oropharynx cancer patients with T stage 1 is more than the cure rate of the oropharynx cancer patients with T stage 2 to 4.
5. Simulation Study
In this section, we present a simulation study to show the accuracy of the ML estimators of the parameters of the F-SCR model based on GeTNH distribution with covariates. Applying a similar algorithm due to Kutal and Qian [
29], the right-censored samples of size
n from the F-SCR model based on GeTNH distribution are generated as follows.
- Step 1:
Fix the parameter values, , and , as well as the value of the cure fraction .
- Step 2:
Generate n random samples from
- Step 3:
The random survival time can be calculated from the equation
if
; otherwise,
is infinity.
- Step 4:
Generate the simple sample of the censoring times from a GeTNH distribution and adjust the parameters of the GeTNH distribution to obtain the desired censoring rates.
- Step 5:
Calculate
. Pairs of simulated values
,
are thus obtained, where
if
and
if
. In the simulation study, we pick the F-SCR model with three covariates
and
, where
,
and
. In the case of high cure rate the initial values of parameters are
and in the case of low cure rate they are
. The initial value of
and
is computed from four combinations of values of
and cure rates
. For
, we choose (0, 0, 0), (1, 0, 0), (0, 1, 2) and (1, 2, 1). In the studies, we also consider two levels of the cure rate, say high cure rate (0.8, 0.7, 0.6, 0.5) and low cure rate (0.25, 0.2, 0.15, 0.10). Solving the four equations resulting from Equation (
7), for high cure rate we obtain
and
, and for low cure rate we obtain
and
.
For the simulations, we take the sample sizes
. We replicate the simulations 1000 times and evaluate the estimated bias of ML estimators and the mean square errors (MSEs). The program codes are available once requested. Based on simulation results in
Table 6 we observe that the MSEs for ML estimators decrease when the sample size
n increases. In fact, the ML estimators tend to be closer to the true parameter values when the sample size
n increases, suggesting that the ML estimators for the F-SCR/GeTHN are consistent in finite samples.