3.2. Discussion
Using a clinical database, we previously demonstrated accurate prediction of fluid requirement of ICU patients who are receiving vasopressor agents using the physiologic variables during the previous 24 h in the ICU [
5]. Subsequently, we demonstrated improved mortality prediction among ICU patients who developed AKI by building models on this subset of patients [
1]. In this paper, we applied the approach to another subset of patients (ICU patients who presented with SAH) from MIMIC, and to a database from another country (elderly patients who underwent cardiac surgery in Dunedin Hospital, New Zealand). In addition, instead of limiting the variables to those that have been validated to be predictive of mortality, we used an automated feature selection to a large group of candidate variables. Finally, we employed three machine learning algorithms—logistic regression, Bayesian network and artificial neural network—to build our models.
ICU patients who develop AKI are one subset of patients where severity scoring systems have consistently performed poorly. The largest worldwide multi-center prospective study found that the observed mortality among these patients was substantially greater than predicted by SAPS II (60.3%
vs. 45.6%,
p < 0.001) [
6]. In another UK-wide study of ICU patients who develop AKI, the APACHE II score under-predicted the number of deaths [
7]. In this study, the null hypothesis of perfect calibration was strongly rejected (
p < 0.001) by both the Hosmer-Lemeshow and Cox’s calibration regression.
SAH is a neurological emergency caused by bleeding into the subarachnoid space. Despite advances in neurocritical monitoring and treatment options, population-based studies reported that SAH death rate is about 50% [
8]. Numerous systems are reported for grading the clinical condition of patients following SAH, including the Hunt and Hess Scale, Fisher Scale, Glasgow Coma Score (GCS), and World Federation of Neurological Surgeons Scale. But there are few validation studies of these scales and furthermore, there are no prospective controlled comparison studies [
9]. For this reason, the use of any particular SAH grading scale is largely a matter of individual or institutional preference.
For patients undergoing cardiac surgery, the EuroSCORE is the most frequently used risk algorithm [
10]. It was created primarily to allow patient grouping for the total spectrum of cardiac surgery. In the original EuroSCORE model, only 10% of the patients examined were >75 years of age. It is generally accepted that the EuroSCORE places significant weight on age as a surgical risk factor, and as a result, overestimates the mortality in elderly patients undergoing cardiac surgery [
11,
12]. This small proportion of geriatric patients included in the training patient cohort is a weakness of all current cardiac surgical scoring systems.
The AUC and Hosmer-Lemeshow p value of SAPS among the MIMIC II patients with AKI that we obtained (AUC = 0.6419, Hosmer-Lemeshow
p = 0) are consistent with the performance of SAPS in predicting mortality among ICU patients in the US (AUC = 0.67,
p = 0.05) [
13]. In a UK study, APACHE II, another severity scoring system looking at physiologic variables during the first 24 h in the ICU, also had poor calibration (Hosmer-Lemeshow
p < 0.001) when used to predict death among patients with AKI [
7]. Although SAPS performed better among patients in the MIMIC II database presenting with SAH in terms of AUC (AUC = 0.84), its calibration in this patient cohort is very poor (HL
p < 0.001). As expected, based on previous studies, the EuroSCORE did not discriminate well between survivors and non-survivors among elderly patients who underwent open heart surgery in Dunedin Hospital during the study period (AUC = 0.648). This poor performance of current predictive models when applied to (1) regions different from where the model was built and (2) specific subsets of ICU patients is the main impetus for this research.
Three machine learning algorithms were employed to build the mortality prediction models. Regression has been the methodology-of-choice in the field of medicine. Most clinicians have some level of familiarity with the concepts behind regression, and the output is relatively easy to understand from a clinical standpoint. However, regression has a number of limitations, including its major assumptions that the predictors are independent and identically distributed (iid), and that the outcome of interest is a function of some linear combination of the variables. The use of quadratic effects and interaction terms allows more complex, but not necessarily better, hypotheses or models. Bayesian or belief network develops models based on conditional probability distribution. The main concept behind the methodology (Is variable A dependent on variable B, or is variable B dependent on variable A, or are variables A and B independent based on the probability distribution?) is not difficult to comprehend. Similar to regression, the output usually makes sense to a clinician. However, conversion of the threshold Bayes factor, which sets the strength of the dependencies between the variables, to the widely-accepted p value has not been well-established. For this reason, the clinical significance of the arcs or dependencies in the network might be more difficult to ascertain. Finally, artificial neural network is a multi-layer perceptron. A perceptron is a linear classifier that defines the hyperplane that separates the data points according to the outcome of interest. Like the Bayesian network, it makes no assumption that the features are iid and complex hypotheses can be explored. However, the output, which consists of nodes and the weights of the variables in each node, is undecipherable to a clinician, making this methodology relatively unpopular in medicine.
For all three patient subsets, the AUCs of the local customized models were significantly higher than those of the gold standards, i.e., SAPS for the ICU patients and EuroSCORE for the cardiac surgery patients. The calibration either improved if the gold standard was poorly calibrated (SAPS), or preserved if the gold standard was well-calibrated (EuroSCORE). For the ICU patients who developed AKI, artificial neural network had the highest AUC. The logistic regression and Bayesian network performed the best for the ICU patients who presented with SAH. Finally, Bayesian network and artificial neural network had the highest AUC for elderly patients who underwent cardiac surgery. The models to predict mortality among elderly patients who underwent cardiac surgery in Dunedin Hospital performed surprisingly well despite a much smaller set of low-resolution data available as candidate variables.
The gold standard in evidence-based medicine is a well-designed, well-executed multi-center prospective randomized controlled trial. Even when such trials are performed and subsequently published, they very rarely, if ever, provide clear evidence upon which to base the management of an individual patient. Patient prognostication is no exception. There is an abundance of literature on risk assessment performed prospectively. However, patients enrolled in prospective randomized controlled trials are heterogeneous, and conclusions are valid for the “average” patient. In addition, these trials are executed in very strictly monitored, and thus artificial, conditions, and often, findings in these studies do not translate to the real world ICU. It is difficult to predict whether an individual patient is likely to behave like the “average” patient in the multi-center prospective randomized trial. Hence, day-to-day clinical decisions are still based mostly on personal experience, experiences shared with colleagues, and consideration of reported data if they exist.
Data mining may provide an additional tool for decision support. The main objective of this project is to determine whether predictive models built on patient subsets yield more accurate predictions than traditional one-model-fits-all approach. As more ICUs switch to a paperless system, local ICU database become available for building models. Rather than developing models with good external validity by including a heterogeneous patient population from various centers across countries as has been traditionally done, an alternative approach would be to build models for specific patient subsets using one’s own local or regional database.