Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare

Berloco, Francesco; Marvulli, Pietro Maria; Suglia, Vladimiro; Colucci, Simona; Pagano, Gaetano; Palazzo, Lucia; Aliani, Maria; Castellana, Giorgio; Guido, Patrizia; D’Addio, Giovanni; Bevilacqua, Vitoantonio

doi:10.3390/app14146084

Open AccessArticle

Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare

by

Francesco Berloco

^1,†

,

Pietro Maria Marvulli

^1,†

,

Vladimiro Suglia

¹

,

Simona Colucci

¹

,

Gaetano Pagano

^2,*

,

Lucia Palazzo

²

,

Maria Aliani

³

,

Giorgio Castellana

³

,

Patrizia Guido

³

,

Giovanni D’Addio

⁴

and

Vitoantonio Bevilacqua

¹

Department of Electrical and Information Engineering (DEI), Politecnico di Bari, Via Orabona 4, 70125 Bari, BA, Italy

²

Bioengineering Unit of Bari Institute, Istituti Clinici Scientifici Maugeri IRCCS, Via Generale Nicola Bellomo 73/75, 70124 Bari, BA, Italy

³

Respiratory Rehabilitation Unit of Bari Institute, Istituti Clinici Scientifici Maugeri IRCCS, Via Generale Nicola Bellomo 73/75, 70124 Bari, BA, Italy

⁴

Bioengineering Unit of Telese Terme Institute, Istituti Clinici Scientifici Maugeri IRCCS, Via Bagni Vecchi 1, 82037 Telese Terme, BN, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(14), 6084; https://doi.org/10.3390/app14146084

Submission received: 13 June 2024 / Revised: 8 July 2024 / Accepted: 9 July 2024 / Published: 12 July 2024

(This article belongs to the Special Issue Application of Decision Support Systems in Biomedical Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Artificial intelligence algorithms have become extensively utilized in survival analysis for high-dimensional, multi-source data. However, due to their complexity, these methods often yield poorly interpretable outcomes, posing challenges in the analysis of several conditions. One of these conditions is obstructive sleep apnea, a sleep disorder characterized by the simultaneous occurrence of comorbidities. Survival analysis provides a potential solution for assessing and categorizing the severity of obstructive sleep apnea, aiding personalized treatment strategies. Given the critical role of time in such scenarios and considering limitations in model interpretability, time-dependent explainable artificial intelligence algorithms have been developed in recent years for direct application to basic Machine Learning models, such as Cox regression and survival random forest. Our work aims to enhance model selection in OSA survival analysis using time-dependent XAI for Machine Learning and Deep Learning models. We developed an end-to-end pipeline, training several survival models and selecting the best performers. Our top models—Cox regression, Cox time, and logistic hazard—achieved good performance, with C-index scores of 0.81, 0.78, and 0.77, and Brier scores of 0.10, 0.12, and 0.11 on the test set. We applied SurvSHAP methods to Cox regression and logistic hazard to investigate their behavior. Although the models showed similar performance, our analysis established that the results of the log hazard model were more reliable and useful in clinical practice compared to those of Cox regression in OSA scenarios.

Keywords:

survival analysis; Machine Learning; Deep Learning; explainable artificial intelligence; rehabilitation medicine; obstructive sleep apnea

1. Introduction

Obstructive Sleep Apnea (OSA) is a sleep disorder caused by the recurrence of either partial (hypopnea) or complete (apnea) collapse of the pharyngeal airway, which leads to reduced or ceased airflow during sleep [1,2,3]. Such a disease compromises sleep health by discontinuous episodes of hypoxia [4,5], and its symptoms may be treated through respiratory rehabilitation [3]. Furthermore, effective follow-up during rehabilitation is crucial for monitoring the patient’s progress, adjusting treatment as necessary, and ensuring long-term management of OSA symptoms. The prognosis of OSA is complicated by the simultaneous occurrence of other comorbidities, mostly metabolic and cardiovascular [6], as well as renal [1].

Survival Analysis (SA) of patients with OSA could help clinicians better assess and categorize disease severity, thus more easily identifying higher-risk individuals, especially during the rehabilitation and follow-up phases. Therefore, treatment can be tailored to the subject’s condition, considering both the comorbidities and symptoms typical of OSA, thus increasing survival probability.

Survival analysis has been traditionally performed using statistical methods, such as Kaplan–Meier curves (KMCs) and Cox regression, but such methods are less effective in the case of datasets with high dimensionality [7].

On the other hand, artificial intelligence (AI) has been widely applied to overcome this limitation of traditional SA methods by exploiting either Machine Learning (ML) [6,8,9,10,11,12] or Deep Learning (DL) models [2,4,12,13,14,15,16]. Notwithstanding, AI algorithms addressing survival analysis are known to be characterized by the poor interpretability of their results, since determining which feature impacts model prediction is not a simple task. Moreover, the role of comorbidities and risk factors is often evaluated only according to hazard and odds ratios [17]. This limits their applicability in decision support systems for clinical purposes [18].

EXplainable Artificial Intelligence (XAI) can be integrated into SA pipelines to provide clinicians with more interpretable model predictions [13,19,20,21,22,23,24]. Shapley additive explanations (SHAP) [25] and local interpretable model-agnostic explanations (LIME) [26] are then utilized to clarify the relationship between the input and output of the model. Both SHAP and LIME have been extended to address survival tasks by including time and event variables in the computation. These versions are called SurvSHAP [18] and SurvLIME [27]. Specifically, SurvSHAP calculates the contribution of each feature to survival predictions at different time points, thus dynamically providing an overview of feature importance. The latter factor is crucial in healthcare to understand how a survival model made its predictions since the significance of clinical variables can evolve over time.

A gap has been found in the existing literature on XAI methods for SA tasks. To the best of our knowledge, no prior works have applied time-dependent XAI–XAI(t) methods to survival DL models while offering a model comparison in terms of dataset- and model-level explanations. Most existing XAI-oriented research focuses on standard XAI methodology applied to classification tasks, possibly considering survival analysis as a separate task. Such a gap can be attributed to the novelty of XAI(t) approaches, as well as the challenges in producing a simple and reliable explanation when time is involved. Focusing on the OSA clinical scenario, this work proposes a survival analysis pipeline focused on the following:

Training and validating different Machine Learning and Deep Learning survival models, selecting the best-performing ones according to the metrics used in survival tasks;
Investigating the role of comorbidities in OSA from an XAI(t) perspective;
Performing a model comparison, selecting the most reliable models according to the explanations retrieved by XAI(t) algorithms.

The rest of this work is structured as follows. In Section 2, we describe the clinical dataset, the survival models involved, and the XAI(t) algorithms used. Section 3 presents the obtained results. In Section 4, we compare the best ML and DL models based on SurvSHAP explanations. Finally, Section 5 reports the conclusions.

2. Materials and Methods

First, we perform pre-processing operations to clean and prepare the data. Then, we compute statistical tests and correlations for feature selection. Next, we split the data into training and test sets and train the SA models to choose the best-performing ones based on the evaluation metrics on the test set. The final step involves interpreting and comparing the selected models using SurvSHAP. Figure 1 depicts an overview of the pipeline designed for the analysis.

Dataset

For this study, we use an internal dataset collected by the Istituti Clinici Scientifici (ICS) Maugeri, Hospital Sleep Laboratory, Bari. All patients’ data were gathered while they underwent rehabilitation. All patients involved were certified as being affected by OSA, established by in-laboratory overnight polysomnography (PSG).

The raw data are available in [28].

The original dataset is composed of 1592 samples with 45 features, including clinical data and data retrieved from polysomnography exams. In addition to the information on age, follow-up time, status, marital status, profession, and sex, our dataset contains indications of Body Mass Index (BMI), the value of the Glomerular Filtrate Rate (GFR), Ejection Fraction (EF), Oxygen Desaturation Index (ODI), minimum saturation of blood oxygen recorded (

{SaO}_{2}

min), apnea-hypopnea index (AHI), and other information on comorbidities like heart disease and diabetes.

Data Pre-Processing

The first operation involved dropping non-relevant features: patient I.D., the number of health records (N CC), admission date, discharge date, profession description, and follow-up date. Then, we converted the survival time from days to months without performing any aggregation operations to facilitate both the analysis and interpretation of data. Next, we checked for null values in EF, ODI, and anemia features. We dropped the EF and ODI features due to 90% null values and lack of information, while we imputed the anemia values based on hemoglobin levels, considering the patient’s sex.

Afterward, we discretized the variables wherever feasible: Age was categorized into two classes, 0 if Age

\leq 65

years old, 1 otherwise. GFR and BMI were binned according to reference values present in the medical literature [29,30]. The age cutoff value was chosen according to the relevant medical literature [31,32,33]. The adoption of unsupervised cutoffs that do not rely on data distribution (but are still relevant in medicine) allowed for better generalization of results. Finally, we included two other features: Continuous Positive Airway Pressure (CPAP) (a binary categorical feature indicating whether a subject followed CPAP treatment) and the corresponding treatment duration in years (Years_of_CPAP).

Outlier removal was not performed, as in the medical domain, removing outliers from clinical data is not always feasible due to the possibility that they may indicate pathological conditions. Finally, since our study focuses on a respiratory disease, samples with last follow-up dates after 2020 (i.e., 198 samples) were excluded due to the COVID-19 pandemic to avoid the presence of bias in our results. As a result of pre-processing, we retained 1394 samples.

Statistical Analysis and Feature Selection

Before training the SA models, we performed feature selection, considering only the training set. For each feature, we calculated Pearson’s correlation (PC) coefficients to visualize and remove the most correlated ones through filtering

| P C | > 0.6

. The correlation matrices before and after filtering are depicted in Appendix A, Figure A1 and Figure A2. In cases of numerical-categorical high-correlation feature pairs, we retained the numerical ones.

After all pre-processing operations, we retained 1394 records with 23 features. A summary of the data is reported in Table 1. For the survival analysis task, the status feature represents the event, while follow-up days (converted into months) represents the time.

Survival models

To perform the experiments, we included the following models:

Machine Learning—Cox proportional hazards (CPH) [34], survival random forest (SRF) [35], survival gradient boosting [36], survival support vector machine [37].
Deep Learning—Cox time (CT) [38], DeepHit (DH) [39], DeepSurv (DS) [40], and NNet-Survival (logistic hazard (LH) and piecewise constant hazard (PCH)) [41,42].

Time-Dependent XAI

Unlike other ad hoc explanation methods, such as SurvLIME [27] and survNAM [43], SurvSHAP [18] also takes into account time in the explanation process. This method generalizes Shapley additive explanations (SHAP) [25] to survival models, giving a global explanation of the overall behavior of the model over time. SurvSHAP can reveal the importance of comorbidities affecting the prognosis over the follow-up period, offering valuable insights into the progression of OSA and strategies for improving patient prognosis. Similarly, SurvLIME explains individual predictions, so it is helpful for analyzing single-case predictions and planning personalized treatment over the follow-up period.

Specifically, SurvSHAP describes variable contributions across the entire considered time period to detect variables with a time-dependent effect. Its aggregation approach determines the variables’ importance for a prediction better than the others.

D = \{(X_{i}, y_{i}, t_{i})\}

denotes the dataset used for training, where

D

contains m unique time instants

t_{m} > t_{m - 1} > \dots > t_{1}

.

y_{i}

is the status of the patient, i.e., whether a patient is dead or alive, and

t_{i}

is the time point of the follow-up. Therefore, every patient is represented by n covariates, a status y, and an instant of time t. For each individual described by a variable vector X, the model returns the individual’s survival distribution

\hat{S} (t, X)

. For the observation of interest

X_{*}

at any selected time point t, the algorithm assigns an importance value

ϕ_{t} (X_{*}, c)

to the value of each variable

X (c)

included in the model, where

c \in \{1, 2, \dots, n\}

and n is the number of variables. To calculate every value of the SurvSHAP function, it is necessary to define the expected value for the survival function conditioned by the values of the features:

e_{t, X_{*}}^{c} = E [\hat{S} (t, X) | X^{c} = X_{*}^{c}]

(1)

By defining P(c,

π

) as the precedence subset of c in a permutation

π

, i.e.,

P (c, π) = {x \in X ∣ x precedes c in π}

, the contribution of the variable c to the model is calculated as:

ϕ_{t} (X_{*}, c) = \frac{1}{| Π |} \sum_{π \in Π} e_{X_{*}, t}^{P (c, π) ⋃ c} - e_{X_{*}, t}^{P (c, π)}

(2)

where

Π

is a set of all permutations of n variables and the indices indicate which variables contribute to the estimation of the survival function. The first term in (2) refers to the contribution of the model with the subset up to variable c with c included, while the second term represents the contribution of the subset before variable c In this way, we obtain the contribution of each single variable to the prediction. For an easier comparison among different models and time points, this value can be normalized to obtain values on a common scale from −1 to 1, so the contribution becomes:

ϕ_{t}^{*} (X_{*}, c) = \frac{ϕ_{t} (X_{*}, c)}{\sum_{j = 1}^{p} | ϕ_{t} (X_{*}, j) |}

(3)

To calculate the global variable importance, we aggregate the time-dependent contributions, thus achieving the following average aggregate SurvSHAP value:

Ψ (X, c) = \int_{0}^{t_{m}} | ϕ_{t}^{*} (X_{*}, c) |

(4)

Experimental Pipeline

We performed experiments using R (v. 4.3.2) and Python (v. 3.8) languages. We randomly split the dataset using the holdout technique, with 70% used as the training set and the remaining 30% as the test set, stratifying by the Status feature. We also verified that the observation period in the test set did not exceed the observation period in the training set.

The experimental pipeline workflow is illustrated in Figure 2.

Appendix A, Algorithm A1, reports the pseudo-code related to the experimental pipeline workflow depicted in Figure 2.

3. Results

We evaluated the models’ performance on the test set according to the following metrics:

Harrell’s C-Index [44]—Also known as the concordance index, it measures the proportion of all pairs of observations for which the survival order predicted by the model matches the order in the data. A higher C-index (ideally equal to 1) indicates better concordance between the model prediction and the relative ground truth, whereas a C-index equal to 0.5 indicates a random prediction.
Integrated Cumulative-Dynamic Area Under the Curve (C/D AUC) [45]—This measures the area under the Receiver Operating Characteristic (ROC) curve at different time points during the observation.
Brier Score [46]—This measures the difference between the model prediction and the corresponding ground truth event; a lower Brier score suggests better model precision, while a high Brier score indicates performance degradation. Ideal Brier score values should be closer to 0 since values closer to 0.5 indicate random predictions.

We placed greater emphasis on the C-index and the Brier score, as they are the most commonly used metrics in survival tasks and are generally more easily understandable than the C/D AUC.

The evaluation metrics on the test set are shown in Table 2, with graphical comparisons shown in Figure 3 and Figure 4 for the ML and DL models, respectively.

ML Model Results As a general result, SSVM was the worst-performing model, while all other models achieved good results. The differences between Cox regression, SGBM, and SRF were minimal for the C-index and integrated C/D AUC, whereas Cox regression exhibited a lower Brier score.
Thus, we chose CPH for the explainability step. Moreover, this model returned a hazard ratio for each feature that can be used for data explainability in addition to XAI techniques (Section 4).
DL Model Results As shown in Figure 4, CT, PCH, and LH exhibited similar performance and were thus the best models. We selected LH for the explainability phase (see Section 4 for more details).
The time-variant Brier scores and C/D AUC values for the CPH, CT, and LH models are depicted in Figure 5.

Moreover, to assess the stability of the outcomes, we performed model training using 10-fold cross-validation. The results obtained (on the same test set used in the holdout approach) are depicted in Appendix A, Table A1, confirming the performance trends obtained with the holdout split.

4. Discussion

4.1. Related Works

In the artificial intelligence field, results are difficult to interpret, especially when dealing with deep models that are considered “black boxes”, making it difficult to understand how the model generates predictions. XAI has recently been adapted to this context to improve the explainability, interpretability, and transparency of model results [47,48,49]. The goal of XAI algorithms is to convert unexplained ML and DL predictions into more interpretable “white-box” glass ones. Notably, in the realm of survival analysis, understanding which characteristics significantly influence predictions—essentially, understanding how such features are “weighted” for prognosis or diagnosis—is fundamental because it allows clinicians to choose the type of treatments for patients (preventive or curative). Most existing works apply XAI techniques like SHAP and LIME to AI-based frameworks addressing SA. Qi et al. [50] used SHAP to improve the explainability of an ML-based framework concerning the role of mitochondrial regulatory genes in the evolution of renal clear cell carcinoma. Zaccaria et al. [13] adopted approximated SHAP values to build an interpretable transcriptomics-based prognostic system for Diffuse Large B-Cell Lymphoma (DLBCL). Srinidhi and Bhargavi [22] embedded SHAP and LIME in ML- and DL-based frameworks to support the prediction of survival rates for patients with pancreatic cancer. Zuo et al. [51] employed both SHAP and LIME to explain the survival predictions of many ML methods fed with a radiomics feature set extracted from tomographic images. Chadaga et al. [23] used SHAP and LIME to explain ML-based estimations concerning the survival probability of children after bone marrow transplantation. Both SHAP and LIME were included in Alabi’s study [21] to better interpret the predictions of ML algorithms addressing SA on clinical data from people with nasopharyngeal carcinoma. Peng and colleagues [24] used SHAP and LIME to corroborate the outcomes of ML models concerning hepatitis diagnosis and prognosis. Even fewer works have employed either SurvSHAP or SurvLIME for performing SA within ML-based workflows. Zhu and colleagues [12] employed SurvSHAP to investigate the efficacy of adjuvant chemotherapy starting from predictions of both ML and DL models concerning the survival probability of breast cancer patients. Passera et al. [20] explained the outcomes of survival models fed by demographic and clinical features using both SurvSHAP(t) and SurvLIME. These XAI methods were oriented toward global and local explanations by analyzing data from the whole cohort or a single patient. Baniecki et al. [19] performed SA to estimate hospitalization times by training ML algorithms on a multimodal dataset and explaining the results of time-to-event models for single patients using SurvSHAP. Remarkably, as far as we know, no studies have utilized XAI(t) to elucidate the predictions generated by a Deep Learning model. In fact, such architectures often play a supporting role in SA workflows; they are used not to predict mortality risks but to determine additional metrics that are then fed into a classical survival pipeline for performing SA [2,4,14]. In addition, these works did not include XAI strategies to better interpret how the model generated survival predictions. Related works in the literature show a paucity of studies on survival analysis with either ML algorithms or DL models embedding XAI techniques, e.g., SurvSHAP and SurvLIME, to increase the interpretability of the predictions of mortality risk. Table 3 summarizes related works involving classical XAI and XAI(t) in SA.

4.2. Explanation Methods

The model explanation methods can be classified into two categories: dataset-level explanations and model-level explanations. The former category aims at analyzing the dataset characteristics to understand their impact on event predictions; the latter focuses on ranking and highlighting the features that the model considers important for predictions.

Intuitively, both methods present some limitations:

Dataset-level explanation limitations: SHAP is designed to return a local explanation, i.e., it gives an explanation for a single sample. Consequently, SurvSHAP behaves in the same way. When used on a dataset, its resulting explanation depends on the sample distribution. In fact, if the data are unbalanced for specific features, their contribution will be minimal, but this conclusion cannot be generalized to other data. Hence, the ideal scenario could be to use a large dataset for the sake of higher generalization of the results. However, the computation of feature contributions for a single sample is computationally expensive because of model complexity and the operations involved (e.g., multiple feature permutations, predictions, and performance computations for SHAP values; local sample generation, local model training, and prediction for LIME values). In XAI(t), this is exacerbated since the feature contributions are also computed for different time instants.
Model-level explanation limitations: Although model-level explanations are computationally less expensive than dataset-level ones, they return explanation information at the population level that cannot be used for explanations at a single prediction level. Moreover, the explanation methods based on permutation can lead to misinterpretations when the independent variables are strongly correlated [52].

To overcome the limitations of both explanation methods, they can be exploited in a complementary way: while the model-level explanation method can be used to identify the most important features affecting predictions, the dataset-level explanation method can be used to investigate how they affect predictions.

Obviously, both methods strongly depend on model performance: if a model is not reliable, then the model-level explanation method will not be able to identify some features that may be important for event predictions, while the dataset-level explanation method can lead to misinterpretations of feature behavior.

Here, we performed both dataset- and model-level time-dependent explanations on the test set (419 samples) by comparing ML and DL survival models and using SurvSHAP. Notably, all results in the following are presented in relation to the survival function.

We note that generating explanations for all test samples is computationally intensive for Deep Learning models. Also, there are minimal performance differences between the CT and LH models. For these reasons, we opted for the LH model for the explanation phase due to its computational optimizations, which enable the generation of comprehensive explanations within a reasonable time. Here, the issue related to the permutation method for model-level explanations was mitigated by filtering the features based on the correlation coefficients, as described in Section 2.

Dataset-Level Explanations
We computed the SurvSHAP values on the test data, retrieving and ranking the most important features.

As depicted on the left side of Figure 6, relative to CPH, age emerges as the most important feature, followed by years of CPAP, renal dysfunction, COP, BMI categories, sex, and anemia.

Similarly, in Figure 7, age remains a predominant feature, yet the LH model assigns almost the same importance (with minimum differences) to the other ones. In more detail, AHI comes in second, followed by renal dysfunction,

{SaO}_{2}

min, years of CPAP, COPD, and anemia. In the LH model, these features exhibit greater importance in terms of their average contributions to the predictions compared to the CPH model. The reason behind the differences in feature importance can be attributed to the non-linear relationships between features and targets discovered by the LH model, which preferred numeric features over categorical ones in some cases. The temporal trends related to the model outcomes are depicted on the right side of Figure 6 and Figure 7. Interestingly, in the CPH model, after ∼9 years (∼110 months), years of CPAP, sex, and BMI categories gained importance with respect to renal dysfunction, anemia, and COPD, respectively.

A similar behavior was seen in the LH model, where a discretization effect on predictions was observed. Although

{SaO}_{2}

ranked fourth in terms of feature contribution and initially did not appear to be very important, in the last part of the observation period, it gained more and more importance as the number of months increased, becoming the third most important feature in the last part of the observation period.

Finally, although in the CPH model, all features had similar contributions in the final observation period, we can observe a set of features consisting of AHI and

{SaO}_{2}

and another composed of renal dysfunction, COPD, and years of CPAP.

While in classical XAI approaches the variable importance is presented as a statistical measure reflecting the overall importance, the exploitation of SurvSHAP makes the variable importance change over time, thus leading to a different evaluation of the impact of predictions at time t.

This dynamic understanding enables clinicians to assign different weights to features based on their evolving importance throughout all stages of patient care, contributing to a more personalized and effective patient care approach.

Furthermore, we investigated the individual contribution of each feature for each subject by plotting a bee swarm plot, as depicted in Figure 8 and Figure 9.

Here, each data sample corresponds to several points in the plot according to its feature value. Aggregated SurvSHAP(t) values close to 0 typically indicate a minimal impact on the survival function. The above figures confirm that the models interpreted the feature values in a coherent way. High values of features considered risk factors or comorbidities in medicine contributed negatively to survival, while high values related to protective factors contributed positively. This alignment with medical understanding underscores the models’ coherent interpretation of feature values and their impact on survival outcomes.

Model-Level Explanations

The second main evaluation step consisted of investigating the feature importance from the models’ perspectives. This was accomplished by computing the difference between the loss function of the trained model and the loss function of the model with permutations. Specifically, the loss function (i.e., Brier score) was computed multiple times by changing the value of each single feature one at a time while keeping the others unchanged. Features that led to greater fluctuations in the difference were considered more influential according to the models’ perspectives. The results are depicted in Figure 10 for the CPH model and Figure 11 for the LH model.

The relationships between OSA and the most important features retrieved by the SA models align with the medical literature [53,54,55,56,57]. Such comorbidities were revealed to be predictors of a lower survival probability for people with OSA in a previous work [1]. The most important features identified for the CPH model aligned with the features highlighted by computing the SurvSHAP values on the test set. Remarkably, additional features such as malignancy and dilated heart disease were not reported in the dataset-level explanations. This suggests that while certain features may not stand out prominently in individual data instances (i.e., they are underrepresented in the data), they still hold significant weight when considered from a CPH model-level perspective. This conclusion can also be confirmed by looking at the related hazard ratios in Table 4.

However, although the presence of malignancy and dilated cardiomyopathy has a strong influence on individual survival, such features are not always available since they are not common in the population and cannot be used in clinical practice (also in our cohort, see Figure A3).

In addition, unlike the dataset-level explanations, where the contribution of the CPAP treatment period increased over time, in the general model, it lost its importance in the final part of the observation period, thus lessening the contributions related to renal dysfunction and COPD.

Concerning the LH model (Figure 11), the features identified were the same as those retrieved by applying SurvSHAP to the test set. Unlike the CPH model, the contribution of age was almost equal to the other features. Surprisingly, AHI had the greatest contribution from ∼150 to ∼170 months (∼12.5 to 14 years). In addition, starting from ∼125 months, the

{SaO}_{2}

min started to gain more and more importance until it became the most important feature. Comparing Figure 10 and Figure 11, the LH model computed the feature contributions in a more balanced way with respect to the CPH model.

Notably, while exhibiting similar performance, the selected models focused on different feature sets. Specifically, the CPH model identified several features related to mortality in a general way, such as dilated heart disease and the presence of malignancies, while assigning greater importance to age. On the other hand, the LH model focused on features related to OSA pathology, such as the minimum oxygenation level (

{SaO}_{2}

min) and apnea-hypopnea index (AHI), which gained importance over the observation period. In light of this, the use of SurvSHAP may lead clinicians to assess that, for parity of performances, the results of the LH model are more reliable here. Ultimately, the results of the LH model are also more useful since the AHI and

{SaO}_{2}

min are directly derived from polysomnography and more relevant features compared to the presence of malignancies or dilated cardiomyopathy, which are rarer conditions in the population (thus making these data not always applicable).

A summary of the differences between the CPH and LH models from an XAI(t) perspective is depicted in Table 5.

5. Conclusions

In this study, we provided a full pipeline for training and comparing survival ML and DL models, and we leveraged SurvSHAP to enrich our understanding of model predictions over time.

In the context of survival analysis, this work aimed to address two crucial issues: (i) identifying the most reliable model for survival prediction among models performing equally well, and (ii) determining and analyzing the variable importance according to the observation time t.

We focused on the domain of obstructive sleep apnea, with a particular emphasis on the role of comorbidities. However, any other pathology may be considered as a use case for performing explainable survival analysis with the SurvSHAP method. In addition, we made comparisons among Machine Learning and Deep Learning survival models.

The performance and reliability of the AI survival model in the OSA context are affected by several factors, such as data quality and quantity, features relevant to OSA pathology, and AI model architecture. Here, we demonstrated how complex survival models are particularly capable of identifying the features that most accurately describe the pathology under examination.

After performing data exploration and preparation operations, we trained four survival ML models and five survival DL models. Then, we selected the best-performing models according to evaluation metrics computed on the test set. Finally, we leveraged SurvSHAP algorithms and classified the explanations into two categories—dataset- and model-level explanations—to investigate the differences among the best models based on this study’s observation time.

The achieved results illustrated the differences among the values of feature importance retrieved by ML and DL models, highlighting how the models identify the most important variables and how they affect the predictions over time. We demonstrated how time-dependent explainability, i.e., XAI(t), helps understand the model’s behavior and the interpretation of data feature contributions to the survival function. Our XAI-based analysis showed that although the CPH model exhibited slightly higher performance, the LH model proved to be more reliable and clinically useful for supporting the follow-up of patients with OSA.

Such a dynamic evaluation and understanding can enhance the clinical decision-making process during the follow-up stages of the rehabilitation process by assisting physicians in assigning different weights to the features related to patients’ conditions based on their evolving importance over time.

Author Contributions

F.B.: Conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, and writing—review and editing. P.M.M.: Conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, and writing—review and editing. V.S.: Writing—original draft and writing—review and editing. S.C.: Visualization and writing—review and editing. G.P.: Conceptualization, resources, project administration, validation, and writing—review and editing. L.P.: Resources and data curation. M.A.: Resources, data curation, investigation, and validation. G.C.: Resources, data curation, investigation, and validation. P.G.: Resources. G.D.: Conceptualization, resources, and project administration. V.B.: Conceptualization, methodology, supervision, writing—review and editing, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded under the research projects:

National Recovery and Resilience Plan (NRRP), Mission 4, Component 1, Investment 4.1, Decree No. 118 of the Italian Ministry of University and Research, Concession Decree No. 2333 of the Italian Ministry of University and Research, CUP D93C23000450005, within the Italian National Program Ph.D. Program in Autonomous Systems (DAuSy), co-funded by the European Union—Next-Generation EU.
National Recovery and Resilience Plan (NRRP), project “BRIEF—Biorobotics Research and Innovation Engineering Facilities”, Mission 4: “Istruzione e Ricerca”, Component 2: “Dalla ricerca all’impresa”, Investment 3.1: “Fondo per la realizzazione di un sistema integrato di infrastrutture di ricerca e innovazione”, CUP: J13C22000400007, funded by the European Union—Next-Generation EU.
Project D3 4 Health “Digital-Driven Diagnostics, Prognostics, and Therapeutics for Sustainable Healthcare” (PNC 0000001), National Plan for Complementary Investments in the NRRP, CUP B53C22006170001, funded by the European Union—Next-Generation EU.

Institutional Review Board Statement

This study was conducted according to the principles of the World Medical Association Declaration of Helsinki and was approved by the Ethics Committee of the Istituto Tumori “G.Paolo II”-IRCCS (Bari, Italy), Prot. 67/CE CE Maugeri.

Informed Consent Statement

Informed consent was waived since this is a retrospective study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no competing interests.

Appendix A

Appendix A.1. Correlation Matrices

Figure A1. Correlation matrix before filtering by

| P C | > 0.6

.

Figure A1. Correlation matrix before filtering by

| P C | > 0.6

.

Figure A2. Correlation matrix after filtering by

| P C | > 0.6

.

Figure A2. Correlation matrix after filtering by

| P C | > 0.6

.

Appendix A.2. Pipeline Workflow Pseudo-Code

Algorithm A1 Pipeline Pseudo-Code

1:: Full Dataset ← Data Loading and Pre-processing
2:: training_set, test_set ← Train_test_split(Full Dataset, Stratified_by = event, Ratio = 0.7)
3:: features ← feature_selection(training_set)
4:: training_set ← training_set(features)
5:: test_set ← test_set(features)
6:: perfomance_list ← list()
7:: Model List ← (CPH, SRF, SSVM, SGBM, CT, DH, DS, LH, PCH)
8:: Task ← SurvTask(time, event, training_ set)
9:
10:: for model in Model List do
11:: if $m o d e l = = (C P H | | S R F)$ then
12:: $m o d e l . f i t (t i m e, e v e n t) \sim t r a i n i n g_s e t (f e a t u r e s)$
13:: else
14:: composite_learner ← mlr3_learner(model, parameters)
15:: model $\leftarrow c o m p o s i t e_l e a r n e r . f i t (T a s k)$
16:: end if
17:: ExplainerTr $\leftarrow c r e a t e_e x p l a i n e r (m o d e l, t r a i n i n g_s e t, t a r g e t = t r a i n i n g (t i m e, e v e n t))$
18:: ExplainerTest $\leftarrow c r e a t e_e x p l a i n e r (m o d e l, t e s t_s e t, t a r g e t = t e s t (t i m e, e v e n t))$
19:: test_metrics ←model_performance(ExplainerTest)
20:: perfomance_list ← append(test_metrics)
21:: end for
22:
23:: best_models $\leftarrow s e l e c t_m o d e l (performance_list)$
24:: Model SurvSHAP ← SurvSHAP(ExplainerTr(best_models), new_observations=test_set)
25:: Feature importance ← model_parts(ExplainerTest(best_models))

Appendix A.3. Performance Metrics (10-Fold Cross-Validation)

Table A1. Performance metrics obtained on the test set in terms of the C-index and integrated Brier score using 10-fold cross-validation.

Family	Model	C-Index	Integrated Brier Score
Machine Learning	CPH	$0.82$	0.10
	SRF	0.82	0.12
	SSVM	0.72	0.32
	SGBM	0.80	0.11
Deep Learning	Cox time	0.77	0.12
	DeepHit	0.74	0.15
	DeepSurv	0.57	0.17
	LogHazard	0.76	0.12
	PCHazard	0.77	0.14

Appendix A.4. Survival Curves for Malignancy and Idiopathic Dilated Cardiomyopathy

Figure A3. Survival curves with related statistics. On the left, presence of malignancy. On the right, presence of dilated cardiomyopathy.

References

Scrutinio, D.; Guida, P.; Aliani, M.; Castellana, G.; Guido, P.; Carone, M. Age and comorbidities are crucial predictors of mortality in severe obstructive sleep apnoea syndrome. Eur. J. Intern. Med. 2021, 90, 71–76. [Google Scholar] [CrossRef]
Blanchard, M.; Feuilloy, M.; Sabil, A.; Gervès-Pinquié, C.; Gagnadoux, F.; Girault, J.M. A Deep Survival Learning Approach for Cardiovascular Risk Estimation in Patients with Sleep Apnea. IEEE Access 2022, 10, 133468–133478. [Google Scholar] [CrossRef]
Pagano, G.; Aliani, M.; Genco, M.; Coccia, A.; Proscia, V.; Cesarelli, M.; D’Addio, G. Rehabilitation outcome in patients with obstructive sleep apnea syndrome using wearable inertial sensor for gait analysis. In Proceedings of the 2022 IEEE International Symposium on Medical Measurements and Applications, MeMeA 2022—Conference Proceedings, Messina, Italy, 22–24 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Huttunen, R.; Leppänen, T.; Duce, B.; Oksenberg, A.; Myllymaa, S.; Töyräs, J.; Korkalainen, H. Assessment of obstructive sleep apnea-related sleep fragmentation utilizing Deep Learning-based sleep staging from photoplethysmography. Sleep 2021, 44, zsab142. [Google Scholar] [CrossRef]
D’Addio, G.; De Felice, A.; Balzano, G.; Zotti, R.; Iannotti, P.; Bifulco, P.; Cesarelli, M. Diagnostic decision support of heart rate turbulence in sleep apnea syndrome. Stud. Health Technol. Inform. 2013, 186, 150–154. [Google Scholar] [CrossRef]
Ma, E.Y.; Kim, J.W.; Lee, Y.; Cho, S.W.; Kim, H.; Kim, J.K. Combined unsupervised-supervised Machine Learning for phenotyping complex diseases with its application to obstructive sleep apnea. Sci. Rep. 2021, 11, 4457. [Google Scholar] [CrossRef]
Wang, P.; Li, Y.; Reddy, C.K. Machine Learning for survival analysis: A survey. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
Zaccaria, G.M.; Vegliante, M.C.; Mezzolla, G.; Stranieri, M.; Volpe, G.; Altini, N.; Gargano, G.; Pappagallo, S.A.; Bucci, A.; Esposito, F.; et al. A Decision-tree Approach to Stratify DLBCL Risk Based on Stromal and Immune Microenvironment Determinants. HemaSphere 2023, 7, e862. [Google Scholar] [CrossRef]
Altini, N.; Brunetti, A.; Mazzoleni, S.; Moncelli, F.; Zagaria, I.; Prencipe, B.; Lorusso, E.; Buonamico, E.; Carpagnano, G.E.; Bavaro, D.F.; et al. Predictive Machine Learning Models and Survival Analysis for COVID-19 Prognosis Based on Hematochemical Parameters. Sensors 2021, 21, 8503. [Google Scholar] [CrossRef]
Silva, C.A.; Morillo, C.A.; Leite-Castro, C.; González-Otero, R.; Bessani, M.; González, R.; Castellanos, J.C.; Otero, L. Machine Learning for atrial fibrillation risk prediction in patients with sleep apnea and coronary artery disease. Front. Cardiovasc. Med. 2022, 9, 1–11. [Google Scholar] [CrossRef]
Wang, M.; Greenberg, M.; Forkert, N.D.; Chekouo, T.; Afriyie, G.; Ismail, Z.; Smith, E.E.; Sajobi, T.T. Dementia risk prediction in individuals with mild cognitive impairment: A comparison of Cox regression and Machine Learning models. BMC Med. Res. Methodol. 2022, 22, 284. [Google Scholar] [CrossRef]
Zhu, E.; Zhang, L.; Wang, J.; Hu, C.; Pan, H.; Shi, W.; Xu, Z.; Ai, P.; Shan, D.; Ai, Z. Deep Learning-guided adjuvant chemotherapy selection for elderly patients with breast cancer. Breast Cancer Res. Treat. 2024, 205, 97–107. [Google Scholar] [CrossRef]
Zaccaria, G.M.; Altini, N.; Mezzolla, G.; Vegliante, M.C.; Stranieri, M.; Pappagallo, S.A.; Ciavarella, S.; Guarini, A.; Bevilacqua, V. SurvIAE: Survival prediction with Interpretable Autoencoders from Diffuse Large B-Cells Lymphoma gene expression data. Comput. Methods Programs Biomed. 2024, 244, 107966. [Google Scholar] [CrossRef]
Korkalainen, H.; Leppanen, T.; Duce, B.; Kainulainen, S.; Aakko, J.; Leino, A.; Kalevo, L.; Afara, I.O.; Myllymaa, S.; Toyras, J. Detailed Assessment of Sleep Architecture with Deep Learning and Shorter Epoch-to-Epoch Duration Reveals Sleep Fragmentation of Patients with Obstructive Sleep Apnea. IEEE J. Biomed. Health Inform. 2021, 25, 2567–2574. [Google Scholar] [CrossRef]
Berloco, F.; Bevilacqua, V.; Colucci, S. Distributed Analytics For Big Data: A Survey. Neurocomputing 2024, 574, 127258. [Google Scholar] [CrossRef]
Bevilacqua, V.; Altini, N.; Prencipe, B.; Brunetti, A.; Villani, L.; Sacco, A.; Morelli, C.; Ciaccia, M.; Scardapane, A. Lung Segmentation and Characterization in COVID-19 Patients for Assessing Pulmonary Thromboembolism: An Approach Based on Deep Learning and Radiomics. Electronics 2021, 10, 2475. [Google Scholar] [CrossRef]
Stare, J.; Maucort-Boulch, D. Odds ratio, hazard ratio and relative risk. Metod. Zv. 2016, 13, 59. [Google Scholar] [CrossRef]
Krzyziński, M.; Spytek, M.; Baniecki, H.; Biecek, P. SurvSHAP(t): Time-dependent explanations of Machine Learning survival models. Knowl.-Based Syst. 2023, 262, 110234. [Google Scholar] [CrossRef]
Baniecki, H.; Sobieski, B.; Bombiński, P.; Szatkowski, P.; Biecek, P. Hospital Length of Stay Prediction Based on Multi-modal Data Towards Trustworthy Human-AI Collaboration in Radiomics. In Artificial Intelligence in Medicine; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Portorož, Slovenia, 2023; Volume 13897, pp. 65–74. [Google Scholar] [CrossRef]
Passera, R.; Zompi, S.; Gill, J.; Busca, A. Explainable Machine Learning (XAI) for Survival in Bone Marrow Transplantation Trials: A Technical Report. BioMedInformatics 2023, 3, 752–768. [Google Scholar] [CrossRef]
Alabi, R.O.; Elmusrati, M.; Leivo, I.; Almangush, A.; Mäkitie, A.A. Machine Learning explainability in nasopharyngeal cancer survival using LIME and SHAP. Sci. Rep. 2023, 13, 8984. [Google Scholar] [CrossRef]
Srinidhi, B.; Bhargavi, M.S. An XAI Approach to Predictive Analytics of Pancreatic Cancer. In Proceedings of the 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 9–10 August 2023; pp. 343–348. [Google Scholar] [CrossRef]
Chadaga, K.; Prabhu, S.; Sampathila, N.; Chadaga, R. Healthcare Analytics A Machine Learning and explainable artificial intelligence approach for predicting the efficacy of hematopoietic stem cell transplant in pediatric patients. Healthc. Anal. 2023, 3, 100170. [Google Scholar] [CrossRef]
Peng, J.; Zou, K.; Zhou, M.; Teng, Y.; Zhu, X.; Zhang, F.; Xu, J. An Explainable Artificial Intelligence Framework for the Deterioration Risk Prediction of Hepatitis Patients. J. Med. Syst. 2021, 45, 61. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), New York, NY, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Kovalev, M.S.; Utkin, L.V.; Kasimov, E.M. SurvLIME: A method for explaining Machine Learning survival models. Knowl.-Based Syst. 2020, 203, 106164. [Google Scholar] [CrossRef]
Scrutinio, D. Dataset Related to Study: “Age and Comorbidity Are Crucial Predictors of Mortality in Severe Obstructive Sleep Apnoea Syndrome”. Available online: https://zenodo.org/records/4290149 (accessed on 27 April 2024).
Delanaye, P.; Schaeffner, E.; Ebert, N.; Cavalier, E.; Mariat, C.; Krzesinski, J.M.; Moranne, O. Normal reference values for glomerular filtration rate: What do we really know? Nephrol. Dial. Transplant. 2012, 27, 2664–2672. [Google Scholar] [CrossRef]
A Healthy Lifestyle—WHO Recommendations. Available online: https://www.who.int/europe/news-room/fact-sheets/item/a-healthy-lifestyle---who-recommendations (accessed on 20 April 2024).
Tondo, P.; Scioscia, G.; Sabato, R.; Leccisotti, R.; Hoxhallari, A.; Sorangelo, S.; Mansueto, G.; Campanino, T.; Carone, M.; Barbaro, M.P.; et al. Mortality in obstructive sleep apnea syndrome (OSAS) and overlap syndrome (OS): The role of nocturnal hypoxemia and CPAP compliance. Sleep Med. 2023, 112, 96–103. [Google Scholar] [CrossRef]
Peppard, P.E.; Young, T.; Barnet, J.H.; Palta, M.; Hagen, E.W.; Hla, K.M. Increased Prevalence of Sleep-Disordered Breathing in Adults. Am. J. Epidemiol. 2013, 177, 1006–1014. [Google Scholar] [CrossRef] [PubMed]
Heilbrunn, E.S.; Ssentongo, P.; Chinchilli, V.M.; Oh, J.; Ssentongo, A.E. Sudden death in individuals with obstructive sleep apnoea: A systematic review and meta-analysis. BMJ Open Respir. Res. 2021, 8. [Google Scholar] [CrossRef]
Cox, D.R. Regression Models and Life-Tables. J. R. Stat. Soc. Ser. B (Methodol.) 1972, 34, 187–220. [Google Scholar] [CrossRef]
Ishwaran, H.; Kogalur, U.; Blackstone, E.; Lauer, M. Random Survival Forests. Ann. Appl. Stat. 2008, 2, 841–860. [Google Scholar] [CrossRef]
Ridgeway, G. The State of Boosting. Comput. Sci. Stat. 2001, 31, 172–181. [Google Scholar]
Fouodo, C.; König, I.; Weihs, C.; Ziegler, A.; Wright, M. Support Vector Machines for Survival Analysis with R. R J. 2018, 10, 412–423. [Google Scholar] [CrossRef]
Kvamme, H.; Borgan, Ø.; Scheel, I. Time-to-Event Prediction with Neural Networks and Cox Regression. arXiv 2019, arXiv:1907.00825. [Google Scholar]
Lee, C.; Zame, W.; Yoon, J.; Schaar, M. DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Katzman, J.; Shaham, U.; Cloninger, A.; Bates, J.; Jiang, T.; Kluger, Y. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 2018, 18, 24. [Google Scholar] [CrossRef]
Gensheimer, M.F.; Narasimhan, B. A scalable discrete-time survival model for neural networks. PeerJ 2018, 7, e6257. [Google Scholar] [CrossRef]
Eleuteri, A.; Aung, M.; Taktak, A.; Damato, B.; Lisboa, P. Continuous and Discrete Time Survival Analysis: Neural Network Approaches. In Proceedings of the 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, 22–26 August 2007; pp. 5420–5423. [Google Scholar] [CrossRef]
Utkin, L.; Satyukov, E.; Konstantinov, A. SurvNAM: The Machine Learning survival model explanation. Neural Netw. 2021, 147, 81–102. [Google Scholar] [CrossRef]
Harrell, F.E.; Lee, K.L.; Mark, D.B. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 1996, 15, 361–387. [Google Scholar] [CrossRef]
Lambert, J.; Chevret, S. Summary measure of discrimination in survival models based on cumulative/dynamic time-dependent ROC curves. Stat. Methods Med. Res. 2016, 25, 2088–2102. [Google Scholar] [CrossRef]
Sawyer, C.; Reichelderfer, F.W.; Editor, J.E.; Caskey, J.R.; Brie, G.W. Verification of forecasts expressed in terms of probability. Mon. Weather. Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Hu, J.; Zhu, K.; Cheng, S.; Kovalchuk, N.M.; Soulsby, A.; Simmons, M.J.; Matar, O.K.; Arcucci, R. Explainable AI models for predicting drop coalescence in microfluidics device. Chem. Eng. J. 2024, 481, 148465. [Google Scholar] [CrossRef]
Chen, D.; Cheng, S.; Hu, J.; Kasoar, M.; Arcucci, R. Explainable Global Wildfire Prediction Models using Graph Neural Networks. arXiv 2024, arXiv:2402.07152. [Google Scholar] [CrossRef]
Altini, N.; Puro, E.; Taccogna, M.G.; Marino, F.; De Summa, S.; Saponaro, C.; Mattioli, E.; Zito, F.A.; Bevilacqua, V. Tumor Cellularity Assessment of Breast Histopathological Slides via Instance Segmentation and Pathomic Features Explainability. Bioengineering 2023, 10, 396. [Google Scholar] [CrossRef]
Qi, X.; Ge, Y.; Yang, A.; Liu, Y.; Wang, Q.; Wu, G. Potential value of mitochondrial regulatory pathways in the clinical application of clear cell renal cell carcinoma: A Machine Learning-based study. J. Cancer Res. Clin. Oncol. 2023, 149, 17015–17026. [Google Scholar] [CrossRef]
Zuo, Y.; Liu, Q.; Li, N.; Li, P.; Zhang, J.; Song, S. Optimal 18F-FDG PET/CT radiomics model development for predicting EGFR mutation status and prognosis in lung adenocarcinoma: A multicentric study. Front. Oncol. 2023, 13, 1173355. [Google Scholar] [CrossRef]
Hooker, G.; Mentch, L.; Zhou, S. Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. Stat. Comput. 2021, 31, 82. [Google Scholar] [CrossRef]
Zamarrón, E.; Jaureguizar, A.; García-Sánchez, A.; Díaz-Cambriles, T.; Alonso-Fernández, A.; Lores, V.; Mediano, O.; Rodríguez-Rodríguez, P.; Cabello-Pelegrín, S.; Morales-Ruíz, E.; et al. Obstructive sleep apnea is associated with impaired renal function in patients with diabetic kidney disease. Sci. Rep. 2021, 11, 5675. [Google Scholar] [CrossRef]
McNicholas, W.T. COPD-OSA Overlap Syndrome: Evolving Evidence Regarding Epidemiology, Clinical Consequences, and Management. Chest 2017, 152, 1318–1326. [Google Scholar] [CrossRef] [PubMed]
Khan, A.M.; Ashizawa, S.; Hlebowicz, V.; Appel, D.W. Anemia of aging and obstructive sleep apnea. Sleep Breath. Schlaf Atm. 2011, 15, 29–34. [Google Scholar] [CrossRef]
Cheng, L.; Guo, H.; Zhang, Z.; Yao, Y.; Yao, Q. Obstructive sleep apnea and incidence of malignant tumors: A meta-analysis. Sleep Med. 2021, 84, 195–204. [Google Scholar] [CrossRef]
Jehan, S.; Myers, A.K.; Zizi, F.; Pandi-Perumal, S.R.; Jean-Louis, G.; McFarlane, S.I. Obesity, obstructive sleep apnea and type 2 diabetes mellitus: Epidemiology and pathophysiologic insights. Sleep Med. Disord. Int. J. 2018, 2, 52. [Google Scholar] [CrossRef]

Figure 1. Processing pipeline followed.

Figure 2. Experimental pipeline. We exploited the compatibility with the SurvSHAP and mlr3 libraries by creating a survival task and wrapping unsupported models in a composite learner. For each trained model, we created the related explainer object and computed performance metrics and explanations. Models included Cox proportional hazards (CPH), survival random forest (SRF), survival SVM (SSVM), survival gradient boosted model (SGBM), and survival Deep Learning (SDL).

Figure 3. Metrics of survival Machine Learning models computed on the test set. Cox proportional hazards (CPH), survival random forest (SRF), survival SVM (SSVM); and survival gradient boosted model (SGBM).

Figure 4. Metrics of survival DL models computed on the test set.

Figure 5. Comparison of time-variant models: CPH, Cox time, and log hazard models. The x-axis represents the event time expressed in months, where each tick represents the event (black—0; red—1).

Figure 6. Dataset-level explanations for the Cox regression model: on the left, the feature importance ranking according to the average of the absolute Shapley values; on the right, the feature importance according to the observation time. In the right part of the figure, the x-axis represents the event time expressed in months, where each tick represents the event (black—0, red—1).

Figure 7. Dataset-level explanations for the log hazard model: on the left, the feature importance ranking according to the average of the absolute Shapley values; on the right, the feature importance according to the observation time. In the right part of the figure, the x-axis represents the event time expressed in months, where each tick represents the event (black—0, red—1).

Figure 8. Feature importance for the Cox regression model as a bee swarm plot (aggregated SurvSHAP(t) values).

Figure 9. Feature importance for the log hazard model as a bee swarm plot (aggregated SurvSHAP(t) values). The sample with a very low level of oxygenation refers to a patient that presents a severe OSA condition (

A H I > 33

) and a cholesterol category equal to 2 (≥240 mg/dL), affected by asthma and morbid obesity.

Figure 9. Feature importance for the log hazard model as a bee swarm plot (aggregated SurvSHAP(t) values). The sample with a very low level of oxygenation refers to a patient that presents a severe OSA condition (

A H I > 33

) and a cholesterol category equal to 2 (≥240 mg/dL), affected by asthma and morbid obesity.

Figure 10. Time-dependent feature importance for the Cox regression model, obtained by subtracting the full model Brier score from the Brier score after single feature permutations.

Figure 11. Time-dependent feature importance for the log hazard model, obtained by subtracting the full model Brier score from the Brier score after single feature permutations.

Table 1. Final dataset with related statistics.

Feature	Type	Description	Number	Mean ± Std	p-Value
Number of Patients = 1394
Demographics
Status	Categorical	Indicates whether the patient is dead or alive at follow-up			Reference
Dead			363	-
Alive			1031	-
Sex	Categorical	Sex of the patient			0.182
Male			997	-
Female			397	-
Age	Categorical	Indicates whether the patient is over 65			<0.001
Under 65 years			700	-
Over 65 years			694	-
Marital Status	Categorical	Marital status of the patient			0.842
Married			262	-
Not married			1132	-
Comorbidities
Hypertension	Categorical	Presence of hypertension	752	-	0.652
Diabetes	Categorical	Presence of diabetes	413	-	0.033
Heart failure	Categorical	Indicates whether patients have a history of heart failure	79	-	<0.001
Dilated cardiomyopathy	Categorical	Presence of dilated cardiomyopathy	17	-	<0.001
Atrial Fibrillation	Categorical	Indicates whether patients have a history of atrial fibrillation	135	-	<0.001
Previous cardiovascular events	Categorical	Indicates whether patients have a history of previous CV events	32	-	0.089
Valvular heart disease	Categorical	Presence of valvular heart disease	33	-	0.006
Cardiovascular disease	Categorical	Presence of cardiovascular disease	339	-	<0.001
Chronic obstructive pulmonary disease (COPD)	Categorical	Presence of COPD	288	-	<0.001
Asthma	Categorical	Presence of asthma	70	-	0.297
Malignancy	Categorical	Indicates whether patients have a history or presence of malignancy	26	-	<0.001
Renal dysfunction	Categorical	Presence of renal dysfunction in patient	308	-	<0.001
Anemia	Categorical	Presence of anemia	263	-	<0.001
Cholesterol Category	Categorical	Categorical column classifying cholesterol into 3 categories			0.030
Value ≤200 [mg/dL]			870	-
Value in 200–239 [mg/dL]			371	-
Value ≥240 [mg/dL]			153	-
Weight Categories	Categorical	Categories of weight based on BMI value			<0.001
Normal weight			75	-
Overweight			291	-
Obesity class I			397	-
Obesity class II			327	-
Morbid obesity			304	-
Polysomnographic data
AHI	Numeric	Apnea-hypopnea index	-	57.01 ± 19.16	0.002
SaO₂ min $[%]$	Numeric	Minimum oxygen saturation	-	70.94 ± 13.63	0.335
Treatment Info
CPAP	Categorical	Received CPAP treatment	555	-	0.006
Years of CPAP	Numeric	Duration of CPAP usage (years)	-	4.44 ± 3.25	<0.001
Follow-up days	Numeric	Days from admission to follow-up (months)	-	98.62 ± 49.76	<0.086

Table 2. Metrics of survival models computed on the test set, with best performance metrics highlighted. Cox proportional hazards (CPH); survival random forest (SRF); survival SVM (SSVM); and survival gradient boosted model (SGBM).

Family	Model	C-Index	Integrated C/D AUC	Integrated Brier Score
Machine Learning	CPH	$0.81$	$0.72$	$0.10$
	SRF	0.81	0.70	0.12
	SSVM	0.71	0.61	0.15
	SGBM	0.79	0.69	0.14
Deep Learning	Cox time	$0.78$	$0.73$	0.12
	DeepHit	0.73	0.70	0.13
	DeepSurv	0.57	0.60	0.16
	LogHazard	0.77	0.70	$0.11$
	PCHazard	$0.78$	0.71	0.13

Table 3. Related works.

Author	Year	Task	Input Data	Survival Analysis Model	Explainability Model
Zaccaria et al. [13]	2023	Prognosis of DLBCL	Transcriptomic data	AutoEncoders	DeepSHAP
Alabi et al. [21]	2023	Prognosis of NPC	CT images, clinical data	Linear Regression, KNN, support vector machines, Naive Bayes, tree-based models,	SHAP, LIME
Srinidhi et al. [22]	2023	Prognosis of pancreatic cancer	CT images, clinical data	Convolutional Neural Networks, support vector machines	SHAP, LIME
Chadaga et al. [23]	2023	Prediction of BMT efficacy	Clinical data	Tree-base models, Linear Regression, KNN, AdaBoost, CartBoost	SHAP, LIME
Peng et al. [24]	2021	Prognosis of hepatitis	Clinical and demographic data	Linear Regression, CART, KNN, tree-based models, Naive Bayes	SHAP, LIME
Qi et al. [50]	2023	Prognosis of RCC	Genomic data	LASSO-Cox	SHAP, LIME
Zuo et al. [51]	2023	Identification of EGFR in lung adenocarcinoma	CT images	Light GBM, Linear Regression, tree-based models	SHAP, LIME
Zhu et al. [12]	2024	Prognosis of breast cancer	Clinical and demographic data	Cox Mixtures, DeepSurv, Cox PH, survival random forest	SurvSHAP
Baniecki et al. [19]	2023	Prediction of hospital LoS	Text data, tabular data, X-ray images	Tree-based models, CoxPH, DeepSurv, DeepHit	SurvSHAP, SurvLIME
Passera et al. [20]	2023	Test XAI on SA for BMT	Clinical and demographic data	CoxPH, survival random forest	SurvSHAP, SurvLIME

Table 4. Cox proportional hazards matrix with features sorted by hazard ratio in descending order.

Variable	Coef.	Exp. Coef.	Se. Coef.	Z	Pr $. .$ z $. .$
Malignancy	1.83	6.21	0.29	6.30	2.89 × $10^{- 10}$
Idiopathic dilated cardiomyopathy	1.41	4.08	0.48	2.92	0.00
COPD	0.53	1.70	0.14	3.81	1.37 × $10^{- 4}$
Renal dysfunction	0.56	1.75	0.15	3.81	1.37 × $10^{- 4}$
Age	1.37	3.94	0.18	7.75	9.28 × $10^{- 15}$
Anemia	0.40	1.49	0.15	2.73	0.01
Atrial fibrillation	0.37	1.45	0.23	1.64	0.10
Heart failure	0.35	1.41	0.28	1.24	0.22
Diabetes	0.11	1.12	0.14	0.78	0.43
Sex	0.30	1.35	0.16	1.81	0.07
Hypertension	0.02	1.02	0.13	0.15	0.88
Cardiovascular disease	−0.01	0.99	0.19	−0.06	0.95
BMI categories	−0.07	0.93	0.06	−1.12	0.26
Cholesterol categories	−0.08	0.92	0.10	−0.83	0.41
Valvular disease	−0.02	0.98	0.42	−0.05	0.96
SaO₂ min	−0.01	0.99	0.01	−1.16	0.25
AHI	−0.01	0.99	0.00	−1.54	0.12
Years of CPAP	−0.13	0.87	0.03	−5.06	4.11 × $10^{- 7}$

Table 5. Summary of differences between the CPH and LH models. DCM—dilated cardiomyopathy.

	Performance Metrics		Dataset-Level Explanations (419 Test Samples)			Model-Level Explanations
	C-Index	Brier Score	Relevant Features	Prevailing Features	Observations	Relevant Features	Prevailing Features	Observations
CPH	0.81	0.10	Age, years of CPAP, renal dysfunction, COPD, BMI, sex, anemia.	Age	Huge gap in age contribution with respect to other features; Feature contributions have few variations over time.	Age, years of CPAP, renal dysfunction, COPD, anemia, malignancy, DCM.	Age, followed by years of CPAP.	Age still prevails over other features; Malignancy and DCM are not strictly related to mortality, and they are not very common in the population.
LH	0.77	0.11	Age, AHI, renal dysfunction, ${SaO}_{2} \min$ , years of CPAP, COPD, anemia.	Age	Moderated gap between age contribution and other features; These ones provide the same contribution, but it varies over time.	Age, years of CPAP, renal dysfunction, COPD, anemia, AHI, ${SaO}_{2} \min$ .	All features have the same contribution.	The relevant features have almost the same contribution. AHI and ${SaO}_{2} \min$ are more useful and accessible in the OSA context.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Berloco, F.; Marvulli, P.M.; Suglia, V.; Colucci, S.; Pagano, G.; Palazzo, L.; Aliani, M.; Castellana, G.; Guido, P.; D’Addio, G.; et al. Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare. Appl. Sci. 2024, 14, 6084. https://doi.org/10.3390/app14146084

AMA Style

Berloco F, Marvulli PM, Suglia V, Colucci S, Pagano G, Palazzo L, Aliani M, Castellana G, Guido P, D’Addio G, et al. Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare. Applied Sciences. 2024; 14(14):6084. https://doi.org/10.3390/app14146084

Chicago/Turabian Style

Berloco, Francesco, Pietro Maria Marvulli, Vladimiro Suglia, Simona Colucci, Gaetano Pagano, Lucia Palazzo, Maria Aliani, Giorgio Castellana, Patrizia Guido, Giovanni D’Addio, and et al. 2024. "Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare" Applied Sciences 14, no. 14: 6084. https://doi.org/10.3390/app14146084

APA Style

Berloco, F., Marvulli, P. M., Suglia, V., Colucci, S., Pagano, G., Palazzo, L., Aliani, M., Castellana, G., Guido, P., D’Addio, G., & Bevilacqua, V. (2024). Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare. Applied Sciences, 14(14), 6084. https://doi.org/10.3390/app14146084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Survival Analysis Model Selection through XAI(t) in Healthcare

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion

4.1. Related Works

4.2. Explanation Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Correlation Matrices

Appendix A.2. Pipeline Workflow Pseudo-Code

Appendix A.3. Performance Metrics (10-Fold Cross-Validation)

Appendix A.4. Survival Curves for Malignancy and Idiopathic Dilated Cardiomyopathy

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI