In the last two decades there has been a remarkable progress in cancer treatments, which led to longer patient survival and improved their quality of life. Consequently, a spate of statistical research to develop cure models arose. These models are useful tools to analyze and describe survival data with long-term survivors, since they express and predict the prognosis of a patient considering, as a novelty, the real possibility that the subject may never experience the event of interest. Cure models allow to estimate the cured proportion,
, and also the probability of survival of the uncured patients up to a given time point, or latency,
. In the literature, ref. [
1] proposed the nonparametric incidence estimator:
, where
is the conditional Kaplan-Meier estimator with bandwidth
h, and
is the largest uncensored failure time. The first completely nonparametric approach in mixture cure models was proposed by [
2], who introduced the nonparametric latency estimator:
, studied in detail by [
3]. Furthermore, in cancer studies it is interesting to test if a covariate has some influence on the cure rate or on the survival time of the susceptible patients. Since no significance testing has been proposed yet for nonparametric cure models, this important gap is filled with the proposal of a covariate significance test for the incidence. This test allows to identify which covariates must be included in the incidence in a mixture cure model. Following [
4], the proposed statistics is based on the process:
where
n is the sample size,
is an estimator of the cure indicator for each individual, and
Z is the covariate. Possible test statistics are the Cramér-von Mises (CvM) or the Kolmogorov-Smirnov (KS) tests. Moreover, the test statistic null distribution is approximated by bootstrap, using an independent naive resampling. For the case with an
m-dimensional covariate,
, the method consists of considering
m hypotheses in
to be tested independently. In order to control the false discovery rate, the approach by [
5] to problems of multiple significance testing is studied. In addition, to achieve the family wise error rate control, the conservative method by [
6] is considered.
Application to Medical Data
The proposed methodology is applied to a dataset of 414 colorectal cancer patients from CHUAC. The goal is to estimate the cure rate as a function of the stage (from 1 to 4) and the age. The event of interest is the death due to colorectal cancer, and the censoring percentage is between (Stage 4) and (Stage 1). Figure S1 in the Supplementary Materials shows that the effect of the age on the cure rate changes with the stage. For example, in Stage 1, patients have a probability of survival between and , depending on the age; whereas in Stage 3, for patients above 60, in a 10 years gap that probability decreases considerably from to almost 0. The latency estimation for three specific ages is shown in Figure S2 in the Supplementary Materials. For Stages 1–2, the age does not seem to be determining for the survival of the uncured patients. On the contrary, for Stages 3–4, the latency estimation varies considerably depending on the age. For example, the probability that the follow-up time since the diagnostic until death is larger than years is around for patients with ages 35 and 50, whereas for 80 year old patients, that probability is larger than .
Moreover, a dataset related to patients with sarcomas, provided by CHUS, is studied. It consists of 261 observations with 372,420 covariates with information about DNA methylations and 32 covariates with clinical data. The event of interest is the death due to sarcomas, and a total of 195 observations are censored. Regarding the conservative method, the results show that only one covariate is significant for the cure rate: “Year of initial pathologic diagnosis”. With respect to the non-conservative alternative, the results for bootstrap resamples show that for the CvM statistic, there are 14,182 significant covariates and 650 non-conclusive covariates, which need to be considered again in the next iteration of the process. For the KS statistic, there are 12,411 significant covariates, and 608 non-conclusive covariates. The program is still running for bootstrap resamples.