**2. Estimation When the Cure Status Is Partially Available**

Let *Y* be the time until the event of interest, *X* is a vector of covariates and *F*(*t* | **x**) = *P*(*Y* ≤ *t* | **X** = **x**) is the distribution function of *Y* conditional on **X** = **x**. In follow-up studies, the event of interest may not be observed due to, for example, the end of the study or loss to follow up, which occurs at censoring time *C*∗ with conditional distribution function *G*(*t* | **x**) = *P*(*C*<sup>∗</sup> ≤ *t* | **X** = **x**). As a consequence, instead of observing *Y*, only the possibly censored survival time *T*∗ = min(*Y*, *C*∗) and the indicator of the event *δ* = **1**(*Y* < *C*∗) can be observed. The random variables *Y* and *C*∗ are assumed to be conditionally independent given **X** = **x**, which is a widely used assumption in most studies. We set *Y* = ∞ if the subject will not experience the event and so is cured. Let *ν* = **1**(*Y* = ∞)

**Citation:** Safari, WC.; López-de-Ullibarri, I.; Jácome, M.A. Nonparametric Inference for Mixture Cure Model When the Cure Information Is Partially Available. *Eng. Proc.* **2021**, *7*, 17. https:// doi.org/10.3390/engproc2021007017


Academic Editors: Joaquim de Moura, Marco A. González, Javier Pereira and Manuel G. Penedo

Published: 9 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

be the indicator of being cured. Note that *ν* is partially observed because the individual is known not to be cured (*ν* = 0) when the event is observed (*δ* = 1), but in the general situation, *ν* is unknown when *δ* = 0. When the cure status is partially known, some censored individuals are identified to be cured, so *ν* = 1 is observed.

To accommodate the cure status information, we include an additional random variable *ξ*, which indicates whether the cure status *ν* is known (*ξ* = 1) or not (*ξ* = 0). Furthermore, let the censoring distribution be an improper distribution function *G*(*t* | **x**) = (1 − *π*(**x**))*G*0(*t* | **x**). Thus, with probability *π*(**x**), the censoring variable is *C*<sup>∗</sup> = ∞, and with probability 1 − *π*(**x**) the value of the censoring variable *C*<sup>∗</sup> corresponds to the value of a random variable *C* with proper continuous distribution function *G*0(*t* | **x**). A cured individual is identified with probability *P*(*ξ* = 1 | *ν* = 1, **X** = **x**) = *P*(*C*<sup>∗</sup> = ∞ | **X** = **x**) = *π*(**x**). In this setup, the data actually observed are {(**X***i*, *Ti*, *δi*, *ξi*, *ξiνi*) : *i* = 1, ... , *n*}, where the observed time is *Ti* = min(*Yi*, *C*<sup>∗</sup> *<sup>i</sup>* ) = *T*<sup>∗</sup> *<sup>i</sup>* , except for those identified as cured which is *Ti* = *Ci*. Hence, the observations {(**X***i*, *Ti*, *δi*, *ξi*, *ξiνi*) : *i* = 1, ... , *n*} can be classified into three groups: (a) the individual is observed to have experienced the event and, therefore, is known to be uncured (**X***i*, *Ti* = *Yi*, *δ<sup>i</sup>* = 1, *ξ<sup>i</sup>* = 1, *ξiν<sup>i</sup>* = 0); (b) the lifetime is censored and the cure status is unknown (**X***i*, *Ti* = *Ci*, *δ<sup>i</sup>* = 0, *ξ<sup>i</sup>* = 0, *ξiν<sup>i</sup>* = 0); and (c) the lifetime is censored and the individual is known to be cured (**X***i*, *Ti* = *Ci*, *δ<sup>i</sup>* = 0, *ξ<sup>i</sup>* = 1, *ξiν<sup>i</sup>* = 1). In standard cure models where the cure status is unknown for all the censored observations, only groups (a) and (b) are considered.

The probability of cure is 1 − *p*(**x**) = *P*(*Y* = ∞ | **X** = **x**), and the conditional survival function of the uncured individuals, also known as latency, is *S*0(*t* | **x**) = *P*(*Y* > *t* | *Y* < ∞, **X** = **x**). The mixture cure model specifies the survival function *S*(*t* | **x**) = *P*(*Y* > *t* | **X** = **x**) as the following.

$$S(t \mid \mathbf{x}) = 1 - p(\mathbf{x}) + p(\mathbf{x})S\_0(t \mid \mathbf{x}).\tag{1}$$

Assuming model (1) and the availability of a suitable estimator of the *S*(*t* | *x*), estimators of the cure probability and the latency can be derived by considering the following relationships.

$$1 - p(\mathbf{x}) = \lim\_{t \to \infty} S(t \mid \mathbf{x}) > 0,\ S\_0(t \mid \mathbf{x}) = \frac{S(t \mid \mathbf{x}) - \{1 - p(\mathbf{x})\}}{p(\mathbf{x})}.\tag{2}$$

Safari et al. [2] proposed the generalized product-limit estimator of the conditional survival function *S*(*t* | *x*) when the cure status is partially known, which is the following:

$$\widehat{S}\_{h}^{c}(t \mid \mathbf{x}) = \prod\_{i=1}^{n} \left( 1 - \frac{\delta\_{[i]} B\_{h[i]}(\mathbf{x}) \mathbf{1}\left(T\_{(i)} \le t\right)}{\sum\_{j=i}^{n} B\_{h[j]}(\mathbf{x}) + \sum\_{j=1}^{i-1} B\_{h[j]}(\mathbf{x}) \mathbf{1}\left(\mathbb{J}\_{[j]} \nu\_{[j]} = 1\right)}\right),\tag{3}$$

where *X*[*i*], *δ*[*i*], *ξ*[*i*], and *ν*[*i*] are the concomitants of the ordered observed times *T*(1) ≤ ... ≤ *T*(*n*), and *Bh*[*i*](*x*) is the Nadaraya–Watson (NW) weight of the following:

$$B\_{h[i]}(\mathbf{x}) = \frac{K\_{\mathrm{fl}}\left(\mathbf{x} - X\_{[i]}\right)}{\sum\_{j=1}^{n} K\_{\mathrm{fl}}\left(\mathbf{x} - X\_{j}\right)},$$

*Kh*(·) = *K*(·/*h*)/*h* is a kernel function *K*(·) rescaled with bandwidth *h*. The corresponding estimator of the cure rate 1 − *p*(*x*) [3] is the following:

$$1 - \hat{p}\_h^\xi(\mathbf{x}) = \hat{S}\_h^\xi \left( T\_{(n)}^1 \mid \mathbf{x} \right), \tag{4}$$

where *T*<sup>1</sup> (*n*) is the largest uncensored observed time. Here, in light of (3), (4), and the relation in (2), a nonparametric estimator of the latency function is given by the following.

$$\hat{S}\_{0,h\_1,h\_2}^{\varepsilon}(t \mid \mathbf{x}) = \begin{cases} \frac{\hat{S}\_{h\_2}^{\varepsilon}(t \mid \mathbf{x}) - (1 - \hat{p}\_{h\_1}^{\varepsilon}(\mathbf{x}))}{\hat{p}\_{h\_1}^{\varepsilon}(\mathbf{x})} & \text{if } 0 \le t \le T\_{(n)}^{1} \text{ and } \hat{S}\_{h\_2}^{\varepsilon}(t \mid \mathbf{x}) > 1 - \hat{p}\_{h\_1}^{\varepsilon}(\mathbf{x})\\\ 0 & \text{otherwise.} \end{cases} \tag{5}$$

The optimal bandwidth for *<sup>S</sup>c <sup>h</sup>*(*t* | *x*) in (3) is not necessarily the optimal bandwidth for 1 <sup>−</sup> *<sup>p</sup>c <sup>h</sup>*(*x*) in (4); therefore, the estimator in (5) is a more general estimator that uses two different bandwidths for estimating *S*(*t* | *x*) and 1 − *p*(*x*). Note that if *h* = *h*<sup>1</sup> = *h*2, then the estimator in (5) reduces to the following estimator.

$$
\hat{S}\_{0,h}^{\hat{c}}(t \mid \mathbf{x}) = \frac{\hat{S}\_h^{\hat{c}}(t \mid \mathbf{x}) - (1 - \hat{p}\_h^{\hat{c}}(\mathbf{x}))}{\hat{p}\_h^{\hat{c}}(\mathbf{x})}.
$$

#### **3. Application to COVID-19 Data**

For illustration of the nonparametric estimators stated in Section 2, we present an application concerning patients hospitalized with COVID-19 in Galicia (Spain) during the first outbreak of the epidemic. We have a medical database of 10,454 COVID-19 patients reported by the Galician Healthcare Service between 6 March and 7 May 2020. This database contains some information on sex, age, and the dates of different medical outcomes such as admission to the intensive care unit (ICU), discharge, or death. The aim was to estimate the time from hospital ward until admission to ICU while adjusting for age and sex. In our analysis we included only 2380 patients who had been hospitalized for at least a day. Among them, 8.3% were admitted to ICU and 91.7% were censored. In the censored group, 68.8% patients were discharged from the hospital alive and without the need for ICU, and 13.8% died without entering the ICU. Therefore, these patients were identified to be "cured" from the event of interest, which is admission to ICU. Note that in this example, "being cured" means being free of experiencing admission to ICU and not being cured in medical terms.

**Acknowledgments:** This work has been supported by MINECO grant MTM2017-82724-R and the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-014) and we wish to acknowledge the support received from the Centro de Investigación de Galicia "CITIC" funded by Xunta de Galicia and the European Union (European Regional Development Fund Galicia 2014–2020 Program) by grant ED431G 2019/01. The authors are grateful to Andrés Paz-Ares Rodríguez (General Director of Public Health), Xurxo Hervada Vidal (General Deputy Director of Information on Health and Epidemiology), and Benigno Rosón Calvo (general deputy director of the SERGAS information system) for providing the COVID-19 data.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

