Integrating Shapley Values into Machine Learning Techniques for Enhanced Predictions of Hospital Admissions

Feretzakis, Georgios; Sakagianni, Aikaterini; Anastasiou, Athanasios; Kapogianni, Ioanna; Bazakidou, Effrosyni; Koufopoulos, Petros; Koumpouros, Yiannis; Koufopoulou, Christina; Kaldis, Vasileios; Verykios, Vassilios S.

doi:10.3390/app14135925

Open AccessArticle

Integrating Shapley Values into Machine Learning Techniques for Enhanced Predictions of Hospital Admissions

by

Georgios Feretzakis

^1,*

,

Aikaterini Sakagianni

²

,

Athanasios Anastasiou

³

,

Ioanna Kapogianni

¹,

Effrosyni Bazakidou

⁴,

Petros Koufopoulos

⁵,

Yiannis Koumpouros

⁶

,

Christina Koufopoulou

⁷

,

Vasileios Kaldis

⁸

and

Vassilios S. Verykios

¹

School of Science and Technology, Hellenic Open University, 26335 Patras, Greece

²

Intensive Care Unit, Sismanogleio General Hospital, 15126 Marousi, Greece

³

Biomedical Engineering Laboratory, National Technical University of Athens, 15780 Athens, Greece

⁴

Medical School, Humanitas University, 20072 Milan, Italy

⁵

Internal Medicine Department, Sismanogleio General Hospital, 15126 Marousi, Greece

⁶

Digital Innovation in Public Health Research Lab, Department of Public and Community Health, University of West Attica, 11521 Athens, Greece

⁷

Anesthesiology Department, Aretaieio Hospital, National and Kapodistrian University of Athens, 11528 Athens, Greece

⁸

Emergency Department, Sismanogleio General Hospital, 15126 Marousi, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5925; https://doi.org/10.3390/app14135925

Submission received: 2 June 2024 / Revised: 27 June 2024 / Accepted: 4 July 2024 / Published: 7 July 2024

(This article belongs to the Special Issue Bioinformatics in Healthcare to Prevent Cancer and Children Obesity)

Download

Browse Figures

Versions Notes

Abstract

:

(1) Background: Predictive modeling is becoming increasingly relevant in healthcare, aiding in clinical decision making and improving patient outcomes. However, many of the most potent predictive models, such as deep learning algorithms, are inherently opaque, and their decisions are challenging to interpret. This study addresses this challenge by employing Shapley Additive Explanations (SHAP) to facilitate model interpretability while maintaining prediction accuracy. (2) Methods: We utilized Gradient Boosting Machines (GBMs) to predict patient outcomes in an emergency department setting, with a focus on model transparency to ensure actionable insights. (3) Results: Our analysis identifies “Acuity”, “Hours”, and “Age” as critical predictive features. We provide a detailed exploration of their intricate interactions and effects on the model’s predictions. The SHAP summary plots highlight that “Acuity” has the highest impact on predictions, followed by “Hours” and “Age”. Dependence plots further reveal that higher acuity levels and longer hours are associated with poorer patient outcomes, while age shows a non-linear relationship with outcomes. Additionally, SHAP interaction values uncover that the interaction between “Acuity” and “Hours” significantly influences predictions. (4) Conclusions: We employed force plots for individual-level interpretation, aligning with the current shift toward personalized medicine. This research highlights the potential of combining machine learning’s predictive power with interpretability, providing a promising route concerning a data-driven, evidence-based healthcare future.

Keywords:

Shapley values; machine learning; SHAP; MIMIC-IV; GBM

1. Introduction

Globally, there is an escalating demand for emergency department (ED) services [1]. This increasing demand has, in turn, led to the overcrowding of emergency departments, which has significantly compromised the quality of services provided [2,3]. The decline in service quality often results in delays in delivering care, potentially leading to increased morbidity and mortality rates. The urgency of this issue was further underscored during the COVID-19 pandemic, as hospital capacities were stretched to their limits, underscoring the criticality of effective patient admission processes [4].

The decision regarding the hospitalization of patients arriving at the emergency department is primarily made by medical personnel, informed by clinical examination results and a series of medical investigations. However, in situations where the workload is exacerbated or the hospital is understaffed, a reliable decision support system becomes an invaluable tool [5]. The data underpinning these systems should ideally adhere to the criteria of the five V’s: volume, velocity, variety, veracity, and value [6,7].

In handling such large healthcare datasets, particular attention must be paid to record linkage, data cleaning, similarity search, and summarization algorithms [8,9]. Recent studies on efficient record linkage using compact Hamming space and the development of parallel and distributed engines for record linkage and similarity search, such as LSHDB, provide the foundation for advanced data preprocessing techniques [10,11]. These methods ensure the quality of the data, which in turn leads to more reliable and effective decision-making models in healthcare.

In the contemporary landscape, machine learning (ML) models and artificial intelligence (AI) technologies have been increasingly leveraged to aid in decision support systems. Deep learning applications in healthcare are rapidly evolving, bringing about significant improvements in diagnosis, treatment, and overall patient care, such as the detection and classification of malaria parasites in blood smear images [12]. However, the ’black box’ nature of many of these models hinders their broader application due to the lack of transparency and explainability [13].

To address this, the current study makes use of Shapley Additive Explanations (SHAP), a unified measure of feature importance that offers enhanced model interpretability while maintaining prediction accuracy [14]. We utilized the MIMIC-IV-ED database, a publicly available critical care database, that offers an abundant, diverse, and reliable data source for healthcare-related studies [15,16]. This open dataset facilitates research studies, promoting a collaborative and transparent research culture in the healthcare domain [15].

The Shapley Value, a cooperative game theory idea that won the Nobel Prize, provides a systematic technique for equitable distribution of overall advantages or costs among a group of participants [17]. The idea was developed by Lloyd Shapley, who in 2012 was awarded the Nobel Prize in Economic Sciences for his contributions to the theory of stable allocations. It emphasizes the importance of each player’s individual contributions to the overall result, a factor that is crucial in situations where cooperation is required and the outcome depends on the combined efforts of all participants.

The significance of Shapley values in this context cannot be overstated. In the realm of machine learning, the interpretability of predictive models is not just an academic concern but a practical necessity, especially in high-stakes domains like healthcare. Shapley values, as implemented through SHAP, provide a potent solution to the ‘black box’ nature of advanced algorithms. By offering a method to fairly distribute the ‘credit’ for a prediction across all input features, SHAP enables clinicians and researchers to understand the driving factors behind prognostic models. This granular level of insight is indispensable for clinching the trust of medical professionals and for the ethically-aligned deployment of AI in healthcare. With the incorporation of SHAP values into our analysis, this study stands at the intersection of cutting-edge machine learning techniques and the pressing need for their explainability in clinical decision making.

1.1. Aim and Contributions of the Study

The aim of this study is to develop a machine learning model that predicts hospital admissions from emergency department visits using the MIMIC-IV-ED database, with a particular emphasis on the interpretability of the model through SHAP values. The novelty of this study lies in the integration of SHAP values with a Gradient Boosting Machine (GBM) model, which not only enhances the accuracy of predictions but also provides clear, interpretable insights into the factors influencing these predictions.

Our contributions are threefold: First, we provide a robust predictive model for hospital admissions that leverages the comprehensive MIMIC-IV-ED dataset. Second, we enhance the interpretability of machine learning predictions in healthcare by employing SHAP values, offering a transparent view of the model’s decision-making process. Third, we demonstrate the practical applicability of our model in a clinical setting, addressing the need for reliable decision support tools that can alleviate the burden on healthcare providers in overcrowded EDs. Through these contributions, our study advances the field of healthcare predictive modeling by not only improving prediction accuracy but also ensuring that these models are interpretable and actionable in real-world clinical environments.

The rest of the paper is organized as follows. The second section explores the relevant literature surrounding the application of machine learning in predicting patient out-comes, with a particular focus on emergency department settings. The third section provides a comprehensive outline of the methodology used in the current study, detailing our data sourcing, preprocessing, and analysis techniques. We specifically delve into the usage of the Gradient Boosting Machine (GBM) model and the SHAP interpretability approach. The fourth section presents the results derived from our model and the corresponding SHAP analysis. This includes both global feature importance insights and individual-level interpretations gleaned from force plots. In the subsequent discussion section, we contextualize our findings within the existing body of research, explore the practical implications of our work, consider potential limitations, and suggest avenues for future research. Finally, we conclude the paper with a summary of the key findings and their broader implications for healthcare predictive modeling and clinical decision making.

1.2. Related Work

Using the MIMIC-IV dataset, Huang et al. predicted in-hospital mortality in lung cancer patients admitted to intensive care unit [18]. Zhao et al. created a predictive model for evaluating the mortality risk of patients with sepsis-related acute respiratory failure [19]. Similarly, Xie et al. created a benchmark database based on the MIMIC-IV emergency department database, combining data from multiple tables using four unique identifiers for data linkage [20].

In addition to leveraging the MIMIC-IV dataset, our literature review reveals a spectrum of machine-learning methodologies applied to hospital admission predictions. Tschoellitsch et al. utilized a machine learning model in EDs with commendable accuracy; however, this model requires extensive data preprocessing, which can be a limiting factor in time-sensitive environments [21]. Decision trees and support vector machines, as used by Araz et al., provide intuitive decision-making pathways and robust classification capabilities, but they often suffer from overfitting and scalability issues, particularly when dealing with large datasets like those from EDs [22]. The extreme gradient boosting method presents an efficient handling of varied data types and missing values, a clear advantage in clinical settings. Nonetheless, it also brings a computational complexity that necessitates careful tuning to avoid overfitting [22]. Logistic regression and artificial neural networks, while widely used for their simplicity and flexibility, respectively, may not capture complex nonlinear relationships as effectively as ensemble methods [22].

The work by Goto et al., which focused on predicting children’s admissions from triage data, showed the potential for ML to complement conventional triage approaches. [23]. However, it is important to note that such models can be highly specific to the datasets they were trained on, limiting their generalizability [23]. The Random Forest classifier, found to perform optimally by Feretzakis et al., offers a robust alternative with its ensemble approach, reducing the risk of overfitting. However, it can be less interpretable, which is a crucial consideration in clinical settings where understanding the model’s reasoning is vital [24].

It is worth noting that advancements in the same domain have been made by another study [25], which used a clustering-related technique, the k-means algorithm, to categorize ED patients and compare its efficacy with the traditional admission methods. Their efforts aimed to reduce ED overcrowding by optimizing the triage process, thus enhancing the quality of healthcare services. The k-means clustering algorithm offers several advantages but also carries notable limitations. Among its benefits, k-means can process large datasets efficiently and provide a systematic, objective categorization of patients, which is crucial for the fast-paced environment of the ED. This method aids in optimizing resource allocation, thereby potentially reducing wait times and improving patient flow. Moreover, it scales well to accommodate growing data volumes and supports data-driven decision making. However, the algorithm also has drawbacks, such as sensitivity to outliers, which can lead to misclassification. It requires the number of clusters (k) to be predetermined, which can be challenging without prior insight into the optimal categories. Additionally, k-means assumes clusters are of similar size and spherical shapes, which might not apply well to the complexity of medical data. The initial placement of centroids can affect the final outcome, introducing potential instability across different runs. Furthermore, the simplification of complex medical conditions into neat clusters might risk inadequate triage decisions if not carefully integrated with expert clinical judgments. These factors highlight the need for a balanced approach, combining algorithmic efficiency with clinical expertise to enhance the triage process in emergency settings.

The performance of popular classification techniques from the scikit-learn library was evaluated by the same researchers using a dataset of usual coagulation and biochemical markers, with a Gaussian NB model outperforming other models in terms of AUC ROC [26]. The Gaussian NB model’s performance superiority in certain cases points to its efficiency with smaller datasets and its speed in training and prediction. However, its assumption of feature independence can be overly simplistic for the complex interdependencies often present in medical data [26].

In their study, Boulitsakis et al. compared the performance of three ML algorithms using patient data obtained from electronic medical records aiming at predicting imminent clinical deterioration in patients presenting to the ED. The study incorporated equity modeling to reduce bias among protected demographic groups. The machine learning model outperformed the National Early Warning Score 2 (NEWS2) [27]. The authors suggest that the implementation of these models could reduce fatigue and identify high-risk patients with low NEWS2 scores. The comparison with NEWS2 underscores the potential of machine learning in identifying high-risk patients but also highlights the challenge of integrating these models into existing clinical workflows [27].

Nohara et al. [28] employed SHAP in conjunction with GBM to improve the interpretability of models predicting outcomes in patients with cerebral infarction. Our study uses a different dataset (MIMIC-IV) and focuses on predicting hospital admissions in an emergency department setting, providing unique insights specific to our dataset.

In light of these insights, our study not only builds upon the established foundations of machine learning in the healthcare domain but also acknowledges the importance of balancing methodological robustness with the practicalities of clinical application. Through this comprehensive review, we pave the way for a more nuanced understanding of each method’s place in healthcare predictive analytics.

2. Materials and Methods

To improve the ability to predict the disposition of patients at emergency department triage, we utilized a machine learning-based approach.

We utilized the Gradient Boosting Machine (GBM) algorithm, a powerful ensemble method that builds an additive model in a forward stage-wise manner [29]. The GBM optimizes arbitrary differentiable loss functions and can handle a mix of categorical and numerical input features, which suits our data. The implementation of our predictive model using the Gradient Boosting Machine (GBM) was performed in Python, leveraging several libraries for data preprocessing, model training, and interpretability analysis. The main libraries used include Pandas for data manipulation, NumPy for numerical operations, scikit-learn for model training and evaluation, SHAP (Shapley Additive Explanations) for enhancing model interpretability, and Matplotlib and Seaborn for visualization.

Our approach, which integrates a Gradient Boosting Machine (GBM) with Shapley Additive Explanations (SHAP), is systematically outlined in Algorithm 1. The algorithm delineates the sequence of operations, from data preprocessing to model training and interpretation using SHAP values. It provides the pseudo-code necessary for replicating our model, specifying the parameters used, the iteration process, and the calculation of SHAP values for feature importance analysis.

Algorithm 1: GBM with SHAP for Predicting Hospital Admissions

Input: Dataset D from the MIMIC-IV-ED, features F, target outcome T (admission status)

1:

Data Preparation:

a.: Retrieve dataset D from the MIMIC-IV-ED database.
b.: Clean D by handling missing values and outliers.
c.: Encode categorical features in F using one-hot encoding.
d.: Normalize or standardize numerical features in F.
e.: Split D into training set D_train (80%) and testing set D_test (20%) using stratified sampling.

2:

Model Training with GBM:

a.: Define GBM parameters (learning rate, number of trees, depth of trees, etc.).
b.: Train the GBM model on D_train predicting target T.
c.: Validate the model on D_test with performance metrics (Accuracy, Precision, Recall, F1 Score, AUC).

3:

SHAP Value Calculation for Feature Importance:

a.: Apply SHAP to the trained GBM model to compute the contribution of each feature in F.
b.: Rank features by the magnitude of their SHAP values.

4:

Model Interpretation:

a.: Generate SHAP summary plot for global feature importance from D_test.
b.: Produce SHAP dependence plots to examine interactions among top features (“Acuity”, “Hours”, “Age”).
c.: Create SHAP force plots for individual prediction interpretation.

5:

Results Analysis and Validation:

a.: aAnalyze the GBM model’s performance on D_test.
b.: Interpret the SHAP plots for actionable clinical insights.
c.: Assess the model’s generalization capability through its performance metrics.

Output: Trained GBM model, SHAP values for feature importance, interpretative plots for clinical decision support.

Furthermore, to facilitate a clearer understanding, we introduce a flowchart explaining the step-by-step process (Figure 1). This visual representation maps the entire process of our method, beginning with data collection from the MIMIC-IV-ED database, data preprocessing, training the GBM model, and finally applying SHAP for interpretability.

Our data were retrieved from the MIMIC IV database, which provides a rich and reliable source for healthcare-related studies [15]. In our research, we utilized a comprehensive set of features from the MIMIC-IV-ED dataset to develop our predictive model. These features include age, means of arrival/transport, vital signs collected at triage (patient temperature, heart rate, respiratory rate, oxygen saturation, systolic blood pressure, and diastolic blood pressure), pain intensity, acuity (based on the triage assessment, where the care provider assigns an integer level of severity ranging from 1 for the highest severity to 5 for the lowest severity), final disposition (admission/discharge), and total waiting time in hours. The feature “Hours” refers to the total duration of time a patient spent in the emergency department (ED) from arrival to the time of the disposition decision (either discharge or admission). This feature is derived from the timestamps provided in the MIMIC-IV-ED dataset. Specifically, it is obtained by subtracting the arrival time from the disposition time. It is an important factor as it can indicate the severity and complexity of the patient’s condition, the efficiency of the ED processes, and the demand on ED resources. For the categorical features, we used one-hot encoding to transform them into a format that could be used by the GBM model.

The preprocessing of the MIMIC-IV ED dataset involved several key steps to ensure the quality and usability of the data for predictive modeling. Initially, data were extracted from the MIMIC-IV ED database, focusing on emergency department triage data, which included initial vital signs and demographic information. To clean and reshape the data, we identified and removed non-valid records, such as those with incorrectly registered values (e.g., pain scores above the valid range). Handling missing values was a critical part of our preprocessing. Records with null values were systematically removed from the dataset to prevent any negative impact on the model’s performance. After preprocessing, the final dataset contained 276,939 instances.

After preparing the data, we split it into a training set (80% of the data) and a testing set (20% of the data) using stratified sampling. This was done to ensure that the model could be validated on unseen data. The model was trained using the training set and the performance of the model was subsequently validated using the testing set. The randomness in the data splitting was controlled by setting a seed value to ensure the reproducibility of the results.

Post-model training, we utilized SHAP to interpret the model’s predictions. Utilizing SHAP values, we are able to measure the contribution of each individual feature to the prediction outcome for each instance in the test set. These SHAP values have notable properties that make them valuable for our interpretability goals. Firstly, they calculate the difference between the prediction for each individual instance and the mean prediction for all instances. This provides a means of quantifying how much each feature ’shifts’ the prediction from the baseline. Secondly, SHAP values ensure that the contributions of the features are allocated fairly. This means that each feature’s contribution to the prediction is proportionate to its influence on the outcome.

The SHAP methodology allows us to dive deeper into the model’s decision process by producing SHAP dependence plots. These plots help to visualize and understand the complex relationship between the predictor variables and the model outcome. They reveal nuanced insights into how different features interact and how they collectively influence the model’s predictions. These dependence plots go beyond simple feature importance, shedding light on how individual features behave and interact in the context of other features. This robust interpretability tool thus enables a more holistic understanding of the predictive behavior of our model.

3. Results

The GBM model was trained using the training set and was subsequently used to predict the “Disposition” of patients, represented as a binary variable (0 for “HOME”, 1 for “ADMITTED”), in both the training and testing sets. We evaluated the model performance based on several metrics: Accuracy, Precision, Recall, F1 Score, and Area Under the Receiver Operating Characteristic Curve (AUC).

On the training set, the model achieved an accuracy of 0.733, precision of 0.702, recall of 0.636, F1 score of 0.668, and AUC score of 0.720. The model’s performance on the testing set was slightly lower but comparable, with an accuracy of 0.729, precision of 0.694, recall of 0.629, F1 score of 0.660, and AUC score of 0.715.

The similar performance on both the training and testing sets suggests that our model was able to generalize well to unseen data and did not suffer from overfitting. As demonstrated by the AUC scores in Figure 2, our model’s ability to distinguish between patients who were sent home and those who were admitted was fairly good. This indicates the model’s potential utility in a real-world clinical setting.

As a result of our methodology, we generated SHAP values and then illustrated them using SHAP summary and dependence plots. The summary plot gives an overview of the importance and impacts of features across all instances in the test dataset. It ranks the features by the sum of SHAP value magnitudes over all samples and uses SHAP values to show the distribution of the impacts each feature has on the model output.

From the summary plot (Figure 3), it is clear that not all features contribute equally to the predictive power of the model. Certain features have a higher impact, which can be identified by their positions in the plot and by the color which indicates the feature value. This kind of plot provides a holistic view of feature importance and how each feature’s value affects the output.

The SHAP summary plot, depicted in Figure 3, provides a holistic view of feature importance and the effects of feature values on the model predictions. The y-axis lists the features, and the x-axis quantifies the SHAP value. Each dot corresponds to a Shapley value for a feature and an instance.

The features are ranked by importance from top to bottom, determined by the sum of SHAP value magnitudes over all samples. It is clear that “Acuity”, “Hours”, and “Age” are the top three influential factors in the model’s decision-making process. The colors represent the feature value (red high, blue low).

“Acuity” appears to be the most impactful feature. High “Acuity” values (red points) are associated with a positive SHAP value, indicating that high acuity levels increase the likelihood of the positive class. Conversely, low acuity levels decrease this likelihood.

The “Hours” feature exhibits a mixed impact. Both high and low “Hours” values are found on either side of the SHAP value spectrum, indicating its complex relationship with the model’s output.

Lastly, the “Age” feature predominantly exhibits positive SHAP values, suggesting that an increase in “Age” generally contributes to a higher likelihood of the positive class.

The insights from the SHAP summary plot, thus, provide an intuitive understanding of the model’s behavior. They align with the expectations and prior knowledge in the healthcare context.

Dependence plots were created for the features “Acuity”, “Hours”, and “Age”. These were chosen because they were among the top features identified in the summary plot and their interaction effects with the model prediction are of particular interest in the scope of this study. The SHAP dependence plots (Figure 4, Figure 5 and Figure 6) show the marginal effect one or two features have on the predicted outcome of a machine learning model. The color represents a third feature that may potentially interact with the x-axis feature.

The reason why we chose to create dependence plots for “Acuity”, “Hours”, and “Age” is mainly due to their prominence in the summary plot, suggesting they have a high magnitude of impact on the model’s predictions. The dependence plots for these features can potentially highlight the relationship between each feature and the target variable, as well as any possible interaction effects between these features and others.

In conclusion, our results highlight the power of the Gradient Boosting Machine and SHAP values in providing insightful interpretability of complex models. The SHAP summary plot and dependence plots offered an intuitive understanding of feature importance and their interactions in a way that simple global feature importance methods cannot match. In future work, it would be useful to investigate and compare other feature interpretation methods as well as other model types.

The force plot illustrates how each feature is pushing the model output away from the base value (expected value) and towards the output value for that specific instance (row of data). The base value is the average prediction for all instances, and the output value is the prediction for the particular instance we are inspecting. The features that push the prediction higher are shown in red, while the features that push the prediction lower are shown in blue.

The force plot (Figure 7) provides a detailed view of how the model makes decisions. By looking at this plot, we can understand how the value of each feature contributes positively or negatively to the final prediction, which further enhances our interpretability of the model’s behavior. It represents a single patient’s visit to the hospital with a gradient-boosting machine predicting their hospitalization risk. We start with the red values.

“Hours” has a SHAP value of 8.217, meaning that the number of hours contributes to increasing the model’s prediction compared to the base value. In other words, this feature is pushing the prediction higher.
“Age” of 76 years also increases the model’s prediction. As per the data, older patients are likely to have a higher risk of requiring hospitalization.
“Arrival transport WALK IN” with a value of 0 also contributes to increasing the risk, suggesting that patients arriving by other means than walking in might be more unwell, and therefore at a higher risk of hospitalization.
“Acuity” with a value of 2 also contributes to an increase in the model’s prediction. Higher acuity might mean a more serious condition, and hence a higher hospitalization risk.

Next, we interpret the blue values:

“Heartrate” with a value of 68 contributes to decreasing the model’s prediction. A lower heart rate might be indicative of a stable condition, hence lowering the risk of hospitalization.
“SBP” (systolic blood pressure) with a value of 145, also lowers the model’s prediction. High systolic blood pressure might indicate a patient’s body’s response to a health stressor, but in the context of other parameters, this is lowering the predicted risk in the model.
“Temperature” of 97.9 degrees Fahrenheit also decreases the model’s prediction, as it is close to the average body temperature, indicating no sign of fever, hence lowering the hospitalization risk.

The SHAP force plot for a single patient visit provides insightful explanations at an individual level. The red features (i.e., “Hours”, “Age”, “Arrival transport WALK IN”, “Acuity”) drive the model’s prediction higher compared to the base value, while the blue features (i.e., “Heartrate”, “SBP”, “Temperature”) pull it lower.

For instance, a longer time since the last assessment (8.217 h) and older age (76 years old) are notable factors increasing the predicted risk of hospitalization. The transport method upon arrival and a higher acuity level also contribute to this increased risk. Conversely, a stable heart rate of 68 bpm, systolic blood pressure of 145 mmHg, and normal body temperature of 97.9°F lower the predicted risk. These interpretations, however, are based on the patterns learned by the model from the data, and might not fully capture the complexity of clinical decision making.

The dependence plot of “Age” and “Acuity” (Figure 8) offers valuable insights into how these two important features interact to influence patient outcomes.

The colors in the plot show the effects of “Acuity”. There appears to be a non-linear relationship between “Age” and “Acuity”, with the influence of “Age” on the model’s prediction being particularly profound for patients with high acuity. The effects of “Age” are also amplified in patients with very low acuity, suggesting that age is a significant factor in both severe and mild cases. The dependence plot, thus, reveals how “Age” and “Acuity” interactively impact the model predictions.

Similarly, the dependence plot of “Heartrate” and “Age” (Figure 9) demonstrates the interaction between these two features.

The influence of “Heartrate” on the model’s prediction appears to be significantly impacted by “Age”, especially for older patients. There is a clear pattern showing higher “Heartrate” associated with higher SHAP value for older patients, suggesting that the model identifies high heart rates in older patients as a significant risk factor. This visualization reinforces the importance of considering complex interactions between features in clinical prediction models.

4. Discussion

In the present study, the application of Gradient Boosting Machines (GBM) combined with SHAP (Shapley Additive Explanations) to the MIMIC-IV dataset is positioned as its primary innovation [29]. This methodology seeks to enhance the interpretability and predictive accuracy of machine learning models in healthcare, particularly for predicting patient outcomes in emergency settings.

The interpretability of machine learning models is paramount in the healthcare domain, especially for clinical decision making. Black-box models, despite their prediction accuracy, do not afford the transparency that healthcare professionals require [13]. The integration of SHAP with GBMs in this study has shown that it is possible to achieve both high predictive performance and interpretability. However, it is crucial to acknowledge and differentiate this work from similar studies in the field to solidify its contribution to medical informatics.

A pertinent study that warrants comparison is that of Nohara et al. (2022), who also employed SHAP in conjunction with GBM but utilized a different dataset, sourced from a hospital in Japan [28]. Their research focuses on improving the interpretability of models used in predicting outcomes for patients with cerebral infarction, employing SHAP to analyze feature importance within their predictive models. This approach parallels the methodological framework utilized by the present study, albeit applied to a distinct clinical environment and patient demographic.

Our findings emphasize the importance of certain clinical variables such as “Acuity”, “Hours”, and “Age” in predicting patient outcomes. These variables have previously been recognized as significant predictors in other healthcare studies [30,31]. Our study, however, has taken this a step further by elucidating the interaction effects between these and other variables.

The SHAP summary plot provided an intuitive understanding of feature importance, revealing “Acuity”, “Hours”, and “Age” as the top three features influencing the model’s predictions. These findings align with previous research indicating that older age and high acuity are linked with poorer outcomes in emergency departments [30,31]. However, the insights derived from the SHAP values extend beyond simple global feature importance by revealing how features interact to influence the outcome.

For example, the dependence plot between “Age” and “Acuity” shows a non-linear relationship, where both the very old and young, and the very sick and healthy, were of significant interest to the model. Similarly, the interaction between “Heartrate” and “Age” indicates a higher risk for older patients with high heart rates. These insights emphasize the importance of adopting a multi-criteria approach to patient assessment and triage in the emergency department [30].

Furthermore, the force plot provides a granular level of interpretability, showing how each feature contributed to the model’s prediction for individual patients. This type of interpretability is crucial for personalized medicine, where understanding individual patient characteristics is as important as understanding overall trends [32].

In the spirit of providing a comprehensive and fair comparison of our GBM approach with existing methodologies, we meticulously evaluated our model against other established machine learning techniques. This comparative analysis involved benchmarking our model’s performance, in terms of accuracy, precision, recall, F1 score, and AUC, against algorithms such as decision trees, random forests, support vector machines, and neural networks, which have been previously applied to similar healthcare datasets.

For each comparator model, we utilized the same training and testing sets to ensure an equitable comparison. The performance metrics were calculated for each model, and statistical significance tests were employed to determine if the differences in performance were due to the model’s predictive capability rather than random variation.

Our findings revealed that while certain models, such as the Random Forest, demonstrated comparable predictive accuracy, our GBM model, interpreted through SHAP values, provided additional interpretability which is indispensable in clinical decision making. Particularly, the ability to elucidate the intricate interactions between features like ‘Acuity’, ‘Hours’, and ‘Age’ with patient outcomes is a distinct advantage that supports healthcare professionals in understanding the model’s rationale, a qualitative benefit that goes beyond numerical performance measures.

Moreover, the integration of SHAP plots for individual-level interpretation aligns with personalized medicine objectives, offering insights that are not available through traditional global importance measures. Future research will be aimed at furthering these comparisons, considering not only performance metrics but also aspects such as computational efficiency, ease of model update, and interpretability, to reinforce the practicality of our GBM-SHAP methodology in emergency healthcare settings.

The use of GBMs in conjunction with SHAP values thus offers an innovative solution to the trade-off between predictive power and interpretability. This methodology can be applied in various healthcare settings to derive meaningful insights from complex data, thereby supporting evidence-based clinical decision making [14,29,33].

While our findings have practical implications for emergency departments, they also contribute to broader discussions around the interpretability of machine learning models in healthcare. Future research could explore other feature interpretation methods and model types, as well as apply these methods in different healthcare contexts.

While our approach uses the same machine learning algorithm (GBM) and explanation method (SHAP) as Nohara et al. [28], our study is distinguished by the use of a different dataset (MIMIC-IV) and focuses on predicting hospital admissions in an emergency department setting. This provides unique insights into patient outcomes specific to our dataset.

In conclusion, this study has demonstrated the value of using GBMs and SHAP values for interpreting complex machine learning models in healthcare. These methods offer a powerful tool for understanding and predicting patient outcomes, thereby supporting evidence-based practice in healthcare settings.

5. Conclusions

In conclusion, our research study has contributed valuable insights into the realm of healthcare predictive modeling. By employing Gradient Boosting Machines (GBMs) along with Shapley Additive Explanations (SHAP), we can build a highly predictive model, and yet maintain the level of interpretability that is crucial in healthcare settings [14,29]. The necessity for such an approach is underscored by the requirement of transparency and explainability in decision-making processes, particularly in a sensitive and impactful area like healthcare [13].

Our findings have highlighted “Acuity”, “Hours”, and “Age” as critical variables in predicting patient outcomes in an emergency department. By leveraging the SHAP values, we are not only able to determine the importance of these features but also elucidate the intricate interactions between them [14]. This understanding may have vital implications for patient assessment and triage processes, enabling a more informed and evidence-based approach to healthcare.

The use of force plots further facilitated an individual-level interpretation, providing insights into each patient’s unique characteristics and how they influenced the model’s prediction. This aligns with the contemporary shift towards personalized medicine, which emphasizes understanding and addressing individual patient characteristics [32].

While our study focused on an emergency department context, the demonstrated approach is generalizable and can be utilized across various healthcare contexts. Future research can explore other model types and feature interpretation methods, further pushing the envelope of predictive modeling in healthcare. Moreover, different healthcare settings could be explored to validate and further refine the methodologies used in this study.

Overall, this study has emphasized the potential of combining predictive power with interpretability in machine learning models, specifically within healthcare. With the continued advancement of AI and machine learning, this study’s methodology provides a promising route for navigating the path toward a data-driven healthcare future, where evidence-based practice and patient-specific treatment are paramount.

Author Contributions

Conceptualization, G.F., A.S. and V.S.V.; methodology, G.F., A.S., V.K. and V.S.V.; validation, G.F.; formal analysis, V.K.; investigation, G.F. and V.S.V.; data curation, I.K.; writing—original draft preparation, G.F., A.S., V.S.V., C.K., E.B. and P.K.; writing—review and editing, G.F., A.S., V.S.V., C.K., E.B., P.K., Y.K. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the MIMIC-IV-ED database, which is publicly accessible. The specific version of the dataset and the queries used for this analysis are available from the corresponding author upon reasonable request.

Acknowledgments

We would like to acknowledge the use of the MIMIC-IV-ED database (version 2.2) in this study. The database was developed and is maintained by the Laboratory for Computational Physiology at the Massachusetts Institute of Technology. This work was conducted in compliance with the database’s data use agreement. We would like to express our gratitude to the many individuals who contributed to this invaluable resource. For detailed information on the database, readers are referred to the publications [15,16].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Higginson, I. Emergency department crowding. Emerg. Med. J. 2012, 29, 437–443. [Google Scholar] [CrossRef] [PubMed]
Pines, J.M.; Hilton, J.A.; Weber, E.J.; Alkemade, A.J.; Al Shabanah, H.; Anderson, P.D.; Bernhard, M.; Bertini, A.; Gries, A.; Ferrandiz, S.; et al. International perspectives on emergency department crowding. Acad. Emerg. Med. 2011, 18, 1358–1370. [Google Scholar] [CrossRef] [PubMed]
Sun, B.C.; Hsia, R.Y.; Weiss, R.E.; Zingmond, D.; Liang, L.J.; Han, W.; McCreath, H.; Asch, S.M. Effect of emergency department crowding on outcomes of admitted patients. Ann. Emerg. Med. 2013, 61, 605–611.e6. [Google Scholar] [CrossRef] [PubMed]
Rosenbaum, L. Facing COVID-19 in Italy—Ethics, Logistics, and Therapeutics on the Epidemic’s Front Line. N. Engl. J. Med. 2020, 382, 1873–1875. [Google Scholar] [CrossRef] [PubMed]
Dubey, R.; Zhou, J.; Wang, Y.; Thompson, P.M.; Ye, J. Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study. Neuroimage 2014, 87, 220–241. [Google Scholar] [CrossRef] [PubMed]
Abdalla, H.B. A brief survey on big data: Technologies, terminologies and data-intensive applications. J. Big Data 2022, 9, 107. [Google Scholar] [CrossRef]
Khan, N.; Yaqoob, I.; Hashem, I.A.T.; Inayat, Z.; Ali, W.K.; Alam, M.; Shiraz, M.; Gani, A. Big data: Survey, technologies, opportunities, and challenges. Sci. World J. 2014, 2014, 712826. [Google Scholar] [CrossRef] [PubMed]
Randall, S.M.; Ferrante, A.M.; Boyd, J.H.; Semmens, J.B. The effect of data cleaning on record linkage quality. BMC Med. Inform. Decis. Mak. 2013, 13, 64. [Google Scholar] [CrossRef] [PubMed]
Brown, A.P.; Randall, S.M. Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model. JMIR Med. Inform. 2020, 8, e18920. [Google Scholar] [CrossRef]
Soliman, A.; Rajasekaran, S.; Toman, P.; Ravishanker, N. A fast privacy-preserving patient record linkage of time series data. Sci. Rep. 2023, 13, 3292. [Google Scholar] [CrossRef]
Karapiperis, D.; Gkoulalas-Divanis, A.; Verykios, V.S. LSHDB: A parallel and distributed engine for record linkage and similarity search. In Proceedings of the IEEE 16th International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain, 12–15 December 2016; pp. 1–4. [Google Scholar] [CrossRef]
Fasihfar, Z.; Rokhsati, H.; Sadeghsalehi, H.; Ghaderzadeh, M.; Gheisari, M. AI-Driven Malaria Diagnosis: Developing a Robust Model for Accurate Detection and Classification of Malaria Parasites. Iran. J. Blood Cancer 2023, 15, 112–124. [Google Scholar] [CrossRef]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018, 51, 1–42. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV (version 2.2). Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
Goldberger, A.; Amaral, L.; Glass, L.; Hausdorff, J.; Ivanov, P.C.; Mark, R.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, E215–E220. [Google Scholar] [CrossRef] [PubMed]
Shapley, L.S. A value for n-person games. In Contributions to the Theory of Games; Kuhn, H.W., Tucker, A.W., Eds.; Princeton University Press: Princeton, NJ, USA, 1953; Volume II, pp. 307–317. [Google Scholar] [CrossRef]
Huang, T.; Le, D.; Yuan, L.; Xu, S.; Peng, X. Machine learning for prediction of in-hospital mortality in lung cancer patients admitted to intensive care unit. PLoS ONE 2023, 18, e0280606. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Feng, Q.; Wu, P.; Lupu, R.A.; Wilke, R.A.; Wells, Q.S.; Denny, J.C.; Wei, W.Q. Learning from longitudinal data in electronic health record and genetic data to improve cardiovascular event prediction. Sci. Rep. 2019, 9, 717. [Google Scholar] [CrossRef] [PubMed]
Xie, F.; Zhou, J.; Lee, J.W.; Tan, M.; Li, S.; Rajnthern, L.S.; Chee, M.L.; Chakraborty, B.; Wong, A.I.; Dagan, A.; et al. Benchmarking emergency department prediction models with machine learning and public electronic health records. Sci. Data 2022, 9, 658. [Google Scholar] [CrossRef]
Tschoellitsch, T.; Seidl, P.; Böck, C.; Maletzky, A.; Moser, P.; Thumfart, S.; Giretzlehner, M.; Hochreiter, S.; Meier, J. Using emergency department triage for machine learning-based admission and mortality prediction. Eur. J. Emerg. Med. 2023, 30, 408–416. [Google Scholar] [CrossRef]
Araz, O.M.; Bentley, D.; Muelleman, R.L. Using Google Flu Trends data in forecasting influenza-like-illness related ED visits in Omaha, Nebraska. Am. J. Emerg. Med. 2014, 32, 1016–1023. [Google Scholar] [CrossRef]
Goto, T.; Camargo, C.A.; Faridi, M.K.; Yun, B.J.; Hasegawa, K. Machine learning-based prediction of clinical outcomes for children during emergency department triage. JAMA Netw. Open 2019, 2, e186937. [Google Scholar] [CrossRef] [PubMed]
Feretzakis, G.; Sakagianni, A.; Loupelis, E.; Kalles, D.; Panteris, V.; Tzelves, L.; Chatzikyriakou, R.; Trakas, N.; Kolokytha, S.; Batiani, P.; et al. Prediction of Hospitalization Using Machine Learning for Emergency Department Patients. Stud. Health Technol. Inform. 2022, 294, 145–146. [Google Scholar] [CrossRef] [PubMed]
Feretzakis, G.; Sakagianni, A.; Kalles, D.; Loupelis, E.; Tzelves, L.; Panteris, V.; Chatzikyriakou, R.; Trakas, N.; Kolokytha, S.; Batiani, P.; et al. Exploratory Clustering for Emergency Department Patients. Stud. Health Technol. Inform. 2022, 295, 503–506. [Google Scholar] [CrossRef] [PubMed]
Feretzakis, G.; Sakagianni, A.; Loupelis, E.; Karlis, G.; Kalles, D.; Tzelves, L.; Chatzikyriakou, R.; Trakas, N.; Petropoulou, S.; Tika, A.; et al. Predicting Hospital Admission for Emergency Department Patients: A Machine Learning Approach. Stud. Health Technol. Inform. 2022, 289, 297–300. [Google Scholar] [CrossRef]
Boulitsakis Logothetis, S.; Green, D.; Holland, M.; Al Moubayed, N. Predicting acute clinical deterioration with interpretable machine learning to support emergency care decision making. Sci. Rep. 2023, 13, 13563. [Google Scholar] [CrossRef] [PubMed]
Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput. Methods Programs Biomed. 2022, 214, 106584. [Google Scholar] [CrossRef] [PubMed]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Singer, A.J.; Thode, H.C., Jr.; Viccellio, P. The association between length of emergency department boarding and mortality. Acad. Emerg. Med. 2011, 18, 1324–1329. [Google Scholar] [CrossRef] [PubMed]
Salvi, F.; Morichi, V.; Grilli, A.; Giorgi, R.; De Tommaso, G.; Dessì-Fulgheri, P. The elderly in the emergency department: A critical review of problems and solutions. Intern. Emerg. Med. 2007, 2, 292–301. [Google Scholar] [CrossRef]
National Research Council (US) Committee on A Framework for Developing a New Taxonomy of Disease. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease; National Academies Press: Washington, DC, USA, 2011. Available online: https://www.ncbi.nlm.nih.gov/books/NBK91503/ (accessed on 21 April 2024). [CrossRef]
Moons, K.G.; Royston, P.; Vergouwe, Y.; Grobbee, D.E.; Altman, D.G. Prognosis and prognostic research: What, why, and how? BMJ 2009, 338, b375. [Google Scholar] [CrossRef]

Figure 1. Flowchart illustrating the step-by-step process.

Figure 2. Receiver Operating Characteristic (ROC) curve of the GBM model on the test set. The AUC score is 0.715, indicating good predictive performance.

Figure 3. SHAP summary plot.

Figure 4. SHAP dependence plot for “Acuity”.

Figure 5. SHAP dependence plot for “Hours”.

Figure 6. SHAP dependence plot for “Age”.

Figure 7. SHAP force plot for a single patient visit.

Figure 8. SHAP dependence plot for “Age” and “Acuity”.

Figure 9. SHAP dependence plot for “Heartrate” and “Age”.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feretzakis, G.; Sakagianni, A.; Anastasiou, A.; Kapogianni, I.; Bazakidou, E.; Koufopoulos, P.; Koumpouros, Y.; Koufopoulou, C.; Kaldis, V.; Verykios, V.S. Integrating Shapley Values into Machine Learning Techniques for Enhanced Predictions of Hospital Admissions. Appl. Sci. 2024, 14, 5925. https://doi.org/10.3390/app14135925

AMA Style

Feretzakis G, Sakagianni A, Anastasiou A, Kapogianni I, Bazakidou E, Koufopoulos P, Koumpouros Y, Koufopoulou C, Kaldis V, Verykios VS. Integrating Shapley Values into Machine Learning Techniques for Enhanced Predictions of Hospital Admissions. Applied Sciences. 2024; 14(13):5925. https://doi.org/10.3390/app14135925

Chicago/Turabian Style

Feretzakis, Georgios, Aikaterini Sakagianni, Athanasios Anastasiou, Ioanna Kapogianni, Effrosyni Bazakidou, Petros Koufopoulos, Yiannis Koumpouros, Christina Koufopoulou, Vasileios Kaldis, and Vassilios S. Verykios. 2024. "Integrating Shapley Values into Machine Learning Techniques for Enhanced Predictions of Hospital Admissions" Applied Sciences 14, no. 13: 5925. https://doi.org/10.3390/app14135925

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Shapley Values into Machine Learning Techniques for Enhanced Predictions of Hospital Admissions

Abstract

1. Introduction

1.1. Aim and Contributions of the Study

1.2. Related Work

2. Materials and Methods

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI