**Preface to "The Convergence of Human and Artificial Intelligence on Clinical Care - Part I"**

Artificial intelligence is gradually becoming a go-to technology in clinical care, from diagnosing a wide range of diseases to predicting outcomes and selecting the best treatment at a personalized level. In the past few years, intelligent systems have contributed to building prediction models and identifying patients at higher risk of certain high-impact conditions such as heart failure, sepsis, and ischemic stroke. In this edited book, key areas that are pushing the forefront of innovation in healthcare are presented by leading experts. The studies showcase the power of technology, its limits, and the value of collaboration, all with a core mission of tackling healthcare's changing landscape using AI. This edited book contains twelve studies, large and pilots, in five main categories: (i) adaptive imputation to increase the density of clinical data for improving downstream modeling; (ii) machine-learning-empowered diagnosis models; (iii) machine learning models for outcome prediction; (iv) innovative use of AI to improve our understanding of the public view; and (v) understanding of the attitude of providers in trusting insights from AI for complex cases.

Overall, the studies used an array of data modalities, including data from electronic health records, imaging data, voice signals, resource utilization, Twitter data, and questionnaire, in addition to a wide range of modeling frameworks, designs, and algorithms. This edited book is an excellent example of how technology can add value in healthcare settings and hints at some of the pressing challenges in the field, including attitudes towards technology and the quality of the data used in predictive modeling. As we move toward implementing these tools in clinical settings, the experts from different fields need to work together to better understand the technological challenges, the needs of care providers and patients, and to ensure there are no unintended consequences which are introduced by integrating AI into the clinical workflow.

I believe that our future is in partnering with intelligent systems to solve complex multidimensional problems in many fields, including health care, and shifting from performance-driven outcomes to risk-sensitive model optimization, improved transparency, and better patient representation for more equitable healthcare for all.

> **Vida Abedi** *Editor*

## *Article* **Increasing the Density of Laboratory Measures for Machine Learning Applications**

**Vida Abedi 1,2,\* , Jiang Li <sup>1</sup> , Manu K. Shivakumar <sup>3</sup> , Venkatesh Avula <sup>1</sup> , Durgesh P. Chaudhary <sup>4</sup> , Matthew J. Shellenberger <sup>5</sup> , Harshit S. Khara <sup>5</sup> , Yanfei Zhang <sup>6</sup> , Ming Ta Michael Lee <sup>6</sup> , Donna M. Wolk <sup>7</sup> , Mohammed Yeasin <sup>8</sup> , Raquel Hontecillas 2,9, Josep Bassaganya-Riera 2,9 and Ramin Zand <sup>4</sup>**


**Abstract:** Background. The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications. Method. We analyzed the laboratory measures derived from Geisinger's EHR on patients in three distinct cohorts—patients tested for *Clostridioides difficile* (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns. Results. We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for *C. difficile* infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as −35.5 for the Cdiff, −8.3 for the IBD, and −11.3 for the OA dataset. Conclusions. An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis.

**Citation:** Abedi, V.; Li, J.; Shivakumar, M.K.; Avula, V.; Chaudhary, D.P.; Shellenberger, M.J.; Khara, H.S.; Zhang, Y.; Lee, M.T.M.; Wolk, D.M.; et al. Increasing the Density of Laboratory Measures for Machine Learning Applications. *J. Clin. Med.* **2021**, *10*, 103. https://doi.org/ 10.3390/jcm10010103

Received: 25 November 2020 Accepted: 25 December 2020 Published: 30 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** imputation; electronic health records; machine learning; EHR; laboratory measures; medical informatics; inflammatory bowel disease; *C. difficile* infection; osteoarthritis; complex diseases

#### **1. Introduction**

Given the complexity and high dimensionality of Electronic Health Records (EHR), the need for imputation is an inevitable aspect in any study that attempts to use such data for downstream analysis or building advanced machine learning models for decision support systems for clinical applications. The EHR or any other administrative dataset is not designed for research purposes, even though the breadth and depth of the information can be used to improve care at many levels [1]. Furthermore, the level and extent of the missing values in healthcare systems are typically not at random. Three main categories explain the missingness in clinical settings [2,3]—incompleteness, inconsistency, and inaccuracy—and these can capture a variety of situations, including the following: the patient could have been cared for outside of the healthcare system where the data are collected, the patient did not seek treatment, the health care provider did not enter the information, the patient expired, and the missing value was not needed.

Given the complexity of the clinical data and the advanced analytics that can be applied on such data, it is important to account for any sources of bias in the data that will be used to drive predictive models. Imputation is an example of data preprocessing that could lead to biased results. Furthermore, excluding variables or patients with a high-level of missingness can also introduce bias and reduce the scope of the study. From a recent review article, 85 out of 316 studies reported some form of missing data, and only 12 studies actively handled the missingness; as the authors showed, the majority of researchers exclude incomplete cases, causing biased outcomes [4]. Furthermore, imputation could boost the statistical power for data-poor patients who tend to be minorities and lowincome patients with more restricted access to primary and specialty care and rehabilitation programs.

Imputation has been an ongoing solution in many fields, but only recently, the research has been focused on medical applications. Twelve different imputation techniques applied to laboratory measures from EHR were compared [5]. In general, the authors found that Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error [5]; however, in that study, the analysis was restricted to 28 most commonly available variables. In another study, the authors assessed the different causes of missing data in the EHR data and identified these causes to be the source of unintentional bias [6]. A comparative analysis of three methods of imputation (a Singular Value Decomposition (SVD)-based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average for DNA microarrays showed that, in general, KNN and SVD methods surpass the commonly accepted solutions of filling missing values with zeros or row averages [7]. However, comparing imputation for clinical data with a DNA microarray can be misleading. The missingness in a DNA microarray is likely at random due to technical challenges unlike missingness in the EHR. In another study, fuzzy clustering was integrated with a neural network to enhance the imputation process [8].

Research has also been done to evaluate imputation methods for non-normal data [9]. Using simulated data from a range of non-normal distributions and a level of missingness of 50% (missing completely at random or missing at random), it was found that the linearity between variables could be used to determine the need for transformation for non-normal variables. In the case of a linear relationship, transformation can introduce bias, while the nonlinear relationship between variables may require adequate transformation to accurately capture the nonlinearity. Furthermore, many of the techniques are optimized for smaller levels of missingness (the most commonly available measurements), yet most clinical datasets (including the EHRs) have a significant level of missingness for many of their important variables that are routinely used for diagnosis purposes. To address

this problem, machine learning methods have also been proposed [10]. There are more examples of imputation applied to simulated than real-life EHR data; however, few studies focused on imputing laboratory values. For instance, Ford E. and colleagues [11] proposed using logistic regression models with and without Bayesian priors representing the rates of misclassification in the data. However, in that study, the authors focused on misclassified diagnoses rather than laboratory values. The challenges of imputation for EHRs are unique, and if left unaddressed, the utility of the data becomes limited [12]. Consequently, even though, for smaller targeted studies, it could be possible to integrate additional modalities or perform an analytical evaluation through a chart review to determine a likely cause of missingness, for larger studies, this becomes infeasible. For example, the missingness level for very important variables, such as hemoglobin A1C or HbA1c (LOINC ID: 17856-6) levels, a common biomarker for diabetes can easily reach 50% or more in many realistic large datasets. At last, in a more recent study, the integration of genetic and clinical information was shown to improve the imputation of data missing from the Electronic Health Records [13]; however, genetic data integrated with the EHR is still scarce.

Finally, given the complexity and the scale of the problem, in many studies, MICE [14] remains the method of choice. The MICE fully conditional specification (FCS) algorithm imputes multivariate missing data on a variable-by-variable basis [15]. An imputation model is specified for each incomplete variable, and the imputation of missingness in one variable is conducted iteratively based on the other variables. There are also variations of MICE that have been proposed [16]; however, the need for imputation for data from EHR poses its challenges, especially when targeting less commonly measured variables. Nonetheless, given the high level of redundancy and the presence of highly correlated entities in the EHR, imputation by MICE still performs relatively well for large clinical datasets. A comprehensive overview of handling missing data in the EHR is presented in [12].

In this study, we created three unique cohorts from the EHR data, with varying sizes and heterogeneity, and developed a hybrid imputation strategy that we applied to these cohorts. We selected the inflammatory bowel disease cohort because of its heterogeneity and the fact that a clear understanding of IBD's risk factors is still lacking. We selected the *Clostridioides difficile*, because understanding of the recurrent infection is important, and the existing data from the EHR can help us identify clinical biomarkers; finally, we created the osteoarthritis (OA) cohort to test the limits of this model, as the OA diagnosis is not based on any laboratory measurements known today. Our imputation model was based on using comorbidity information to cluster patients prior to the imputation of their laboratory values.

#### **2. Methods**

In the following section, we will (1) describe our cohort definition and data extraction for the laboratory values and comorbidities from our EHR data warehouse and (2) outline our imputation design.

#### *2.1. Study Cohort*

The cohort in this study consisted of 67,445 patients from the Geisinger Health System with three different phenotypes. This study was exempted by the Geisinger Institutional Review Board for using deidentified information.

*Clostridioides difficile* (Cdiff) Infection case and control cohort: *Clostridioides difficile* (*C. difficile*) is an anaerobic, Gram-positive, and spore-forming bacterium and a major cause of intestinal infection and antibiotic-associated diarrhea. Toxins are the major virulence factors of *C. difficile* [17]. Toxins A (TcdA) and B (TcdB) are large, secreted glucosyltransferase proteins that target intestinal epithelia cells and disrupt the epithelial barrier, leading to secretory diarrhea. The diagnosis of *C. difficile* at Geisinger is captured and documented by Polymerase Chain Reaction (PCR) confirmation, which is highly sensitive. The latter is also considered the gold standard by the eMERGE algorithm for EHR mining [18]. We

identified the *C. difficile* cohort, which includes patients tested for *C. difficile*, from the EHR of the Geisinger Health System. The cohort includes both cases and controls. Cases are defined as having laboratory positive PCR test results. Controls are patients tested for *C. difficile* with negative PCR test results. Case/control ratio is 1:8. We are interested in the combined case and control cohort, since patients tested for *C. difficile*, irrespective of their test results, share some of the signs and symptoms (such as diarrhea); furthermore, using a case and control combined cohort increases our sample size, an important factor for imputation, while providing a framework for building predictive models that can benefit from the integration of a large number of laboratory-based features.

Inflammatory Bowel Disease (IBD) cohort: We identified the IBD cohort from the EHR of the Geisinger Health System. Inclusion criteria of this cohort were based on the extraction of the patient population based on the diagnosis recorded for patients under their visits, admissions, and currently active problems listed based on the ICD9 and ICD10 codes for Crohn's disease (CD) and ulcerative colitis (UC) (see Table A1 in Appendix A). To have a higher fidelity regarding the diagnosis in the EHR, qualifying criteria included either two or more outpatient encounters, or one or more inpatient admissions, or an entry into the problem list with an active flag.

Osteoarthritis (OA) cohort: We identified an osteoarthritis (OA) cohort from the EHR of the Geisinger Health System; the cohort includes a knee or hip OA diagnosis, either primary or secondary diagnosis (see Table A1 in Appendix A for the OA diagnosis ICD codes).

#### *2.2. Data Extraction*

We extracted clinical laboratory measurements for this cohort using the Logical Observation Identifiers Names and Codes (LOINC) system. For comorbidities, we extracted all the diagnosis codes for all the patients based on the ICD9, as well as ICD10, codes. Comorbidity data included details from out-patient visits, in-patient admissions, and problem lists. The latter was used to capture conditions identified outside of the Geisinger Health System but discussed and assessed during the patient's care management. We excluded laboratory codes with more than 75% missingness. To further clarify, in this study, missingness is defined as the laboratory measure "not resulted". Therefore, if an order was placed but the results were not available (or not valid), we considered that as a missing value. We analyzed the data in three batches, including only laboratory measures that have, at most, (a) 25% missingness, (b) 50% missingness, and (c) 75% missingness.

#### *2.3. Data Processing*

Quality Control (QC) and outlier detection strategy: Geisinger has implemented a rigorous process to continuously extract, transform, organize, and store EHR data and remove erroneous entries for research purposes. For example, we currently have access to quality-controlled laboratory values with the reconciliation of units. Median laboratory values for each patient were calculated to be used for this study. It is important to mention that, especially for less common laboratory values, the frequency of measurements and the window between the first and last measurements per patient is relatively narrow. We analyzed the frequency patterns and reported the results in our descriptive section.

As part of the added data processing and outlier detection and removal, the distribution of each laboratory value was analyzed and fit to a tri-modal gaussian distribution model (see Equation (1)). The rationale for using this strategy, as opposed to the assumption of normality, is driven by the nature of the laboratory measures. Laboratory orders, especially those with a higher level of missingness, are typically missing not at random (MNAR), and there are mainly three groups of patients for whom there is a measurement recorded (those with higher or lower than average measures, as well as patients with average measurements). However, the average measurement is not necessarily associated with a larger group in all the cases, especially for laboratory measures that are specific to a phenotype, such as an iron-binding capacity. The latter is ordered for patients if the

physician needs that information to make a diagnosis/management decision. Two cut-off values are created to filter outliers based on the three distributions model. The automated process to generate data-driven cut-off values is proposed for large-scale data mining, where limited manual curation is applied in the data preparation and preprocessing.

$$f = \mathcal{N}\_1\Big(\mu\_1, \sigma\_1^2\Big) + \mathcal{N}\_2\Big(\mu\_2, \sigma\_2^2\Big) + \mathcal{N}\_3\Big(\mu\_3, \sigma\_3^2\Big) \tag{1}$$

where *µ* is the mean and *σ* is the standard deviation. The lowest boundary to filter out the outliers is set to *c\_low* = max (min(*µ*<sup>1</sup> − 3*σ*1,*µ*<sup>2</sup> − 3*σ*2, *µ*<sup>3</sup> − 3*σ*3), 0), and the highest boundary is set to *c\_high* = max(*µ*<sup>1</sup> + 3*σ*1,*µ*<sup>2</sup> + 3*σ*2, *µ*<sup>3</sup> + 3*σ*3).

Data processing of the comorbidity dataset was performed to remove noise by excluding the ICD9/10 codes that were recorded only once in the patient's chart (rule of 2). The resulting matrix was then converted to binary to represent the presence or absence of an ICD9/10 code for each patient. This is important, since the count does not necessarily correlate with the severity or duration of the condition. Therefore, a binary comorbidity matrix for each cohort was created for imputation modeling.

#### *2.4. Data Abstraction and Imputation Strategy*

The comorbidity dataset was used to compute an encoding matrix for each dataset (Cdiff, OA, and IBD) using singular value decomposition (Equation (2)).

$$A\_{\rm PT\\_ICD\\_color} = A\_{\rm PT\ X\ I\\_ICD\\_color} = \mathcal{U}SV^T \tag{2}$$

where *APT\_ICD\_cohort* is the matrix encompassing all the ICD9/10 codes (presence of absence) for all the patients for each dataset, *U* is an *mxm* square matrix, *S* is an *mxn* diagonal matrix with *m* rows and *n* colums, and *V* is an *nxn* square matrix. The columns of *V* are eigenvectors of *A <sup>T</sup>A*, and the columns of *U* are eigenvectors of *AA<sup>T</sup>* . The diagonal elements of *S* are the square root of the eigenvalues of *A <sup>T</sup>A* or *AA<sup>T</sup>* .

The encoding matrix was then used to create different levels of data abstraction by retaining only 100 or 1000 of the encoding using the dimensionality reduction technique (Equation (3)) for each dataset. We used these predefined cut-off values based on our preliminary assessment [19], as well as empirical studies [20,21]. For comparison, the full rank was also used in the modeling. Note that the approximation matrix is referred to as the data abstraction. The finalized output is referred to as latent comorbidities.

$$A\_{PT\\_ICD\\_\mathcal{g}} = \mathcal{U}\_{reduced} \mathcal{S}\_{reduced} V\_{reduced}^T \tag{3}$$

where *g* is the level of abstraction (100 or 1000) corresponding to the level of reduced matrices. *APT\_ICD\_cohort\_g* is an approximation of the initial matrix (*APT\_ICD\_cohort*).

As a final step in the data abstraction process, a baseline noise reduction is performed by removing the ICD codes if the sum of all the values for a given code in the latent comorbidity matrix is less than 1. This strategy reduces noise that is due to irrelevant (very rare) comorbidities in the model. The imputation method presented in this work is a hybrid method—that is, based upon concurrently applying dimensionality reduction and a clustering strategy—to efficiently capture relationships among the features (or variables) and reduce noise (through dimensionality reduction) while providing an adaptive mechanism to perform imputation for any complex phenotype or trait. Using latent comorbidity data, patients are clustered using the k-mean clustering technique with *K* set to 2, 4, 8, and 16 clusters, depending on the heterogeneity of the cohort.

Imputation was applied using the MICE fully conditional specification (FCS) algorithm [5], which imputes multivariate missing data on a variable-by-variable basis. An imputation model is specified to each incomplete variable, and the imputation of missingness in one variable is conducted in an iterative fashion using the Markov Chain Monte Carlo (MCMC) method. More specifically, we selected the predictive mean matching (pmm) algorithm, which is the default method of mice() for imputing continuous incomplete variables. For each missing value, pmm finds a set of observed values (default is 5) with the closest predicted mean as the missing one and imputes the missing value by a random draw from that set. In other words, pmm is restricted to the observed values. We also used Random Forest (rf), which is based on imputing missingness by recursively subdividing the data based on values of the predictor variables in the predictive model by a bootstrap aggregation of multiple regression trees to reduce the risk of overfitting and improve the predictions through a combination of prediction from many trees [22]. The latter does not rely on distributional assumptions and can better accommodate nonlinear relations and interactions.

Imputations using MICE-pmm and MICE-rf were applied to each subgroup independently to predict the missing values. The results were compared when MICE-pmm and MICE-rf were applied to estimate the missing in the laboratory values in three cohorts without any consideration of the comorbidity information. The reader is referred to the work [15] by S. van Buuren and K. Groothuis-Oudshoorn for more details about imputation by MICE.

#### *2.5. Evaluation Strategy*

Model evaluation is performed by randomly selecting variables and predicting them using the hybrid strategy. A total of 100 values from each laboratory measure was randomly withheld for testing. For example, for the Cdiff cohort, where we identified 48 laboratory codes with less than 75% missingness, we held out 100 values for each of the 48 laboratory codes and estimate these 10 times. The root mean square error (RMSE) was also calculated and averaged over the 10 runs. Comparison was based on calculating the difference between running imputation using the hybrid model and the standard MICE algorithm, without any consideration of the comorbidity information, using both the pmm and rf models implemented in the MICE package. The presented results were, therefore, the RMSE differences, where the negative values represent a reduction in the root mean square error.

#### **3. Results**

In the following section, we will (1) describe our cohorts, pattern of missingness, and frequency of available data for different levels of missingness and (2) present imputation results for the three datasets.

#### *3.1. Description of Laboratory Values for the Three Cohorts*

We identified a total of 67,445 patients in three different cohorts (Cdiff, OA, and IBD) from Geisinger's electronic data warehouse. Further, we identified 495 LOINC codes from this cohort. We selected the LOINC codes for which we had, at most, 75% missingness (i.e., the number of patients without any measurement divided by the total number of patients is less than or equal to 75%) in each of the three cohorts.

We identified a total of 46,215 patients tested for *C. difficile*. We extracted comorbidity and laboratory data from the EHR for this cohort. A total of 48 laboratory codes and 8160 ICD codes for comorbidities were used. Specifically, we identified a total of 48 of the laboratory codes from the 495 codes that had at least 25% of the 46,215 patients with at least one measurement in their records. It is important to highlight that many of the LOINC codes can be very specific (<1% of the patients have such measurements) or were used for a narrow period and may not be actively in use. The dimensionality reduction was set to 100 and 1000. The Cdiff cohort had high heterogeneity, since the dataset contained both cases (tested positive for *C. difficile*) and controls (tested negative for *C. difficile*). The number of clusters tested was 4, 8, and 16.

Similarly, we further identified 11,230 IBD patients with both comorbidity and laboratory data from the EHR. A total of 48 laboratory codes and 7916 ICD codes for comorbidities were identified. The dimensionality reduction was set to 100 and 1000. The number of clusters tested was two, four, and eight, given the smaller sample size of this cohort.

Finally, we identified 187,040 patients with a primary or secondary diagnosis of the knee or hip OA from which we randomly selected 10,000 patients for imputation modeling. A total of 44 laboratory codes and 2042 ICD codes for comorbidities were used. The OA cohort had high heterogeneity, since the dataset was large (almost 200,000 cases from the initial pool) and contained both hip and knee OA. We selected a random set of 10,000 patients, as it is impractical to use an extremely large cohort of patients for optimizing an imputation, as the optimization alone is a computationally extensive process. The number of clusters tested was 4, 8, and 16.

The distribution of missingness in the laboratory values was different for the different cohorts. Table A2 summarizes the percentage missing for the laboratory measures. Our results showed that the pattern and frequency of the laboratory measurements were dependent on the missingness level. Briefly, for laboratory values with high missingness, a larger percentage of patients (30–60%) had only one resulted value; therefore, the median that we calculated in our experiment was practically the exclusively reported value for the patient (see Figure 1A). We further observed that the laboratory values with a high level of missingness (when a patient had more than one value) tended to have an observation window of approximately two to six years (see Figure 1B) and a frequency that was below five measurements (see Figure 1C). However, for more common laboratory values, we observe a window of approximately 5 to 12 years and a frequency above 10 (see Figure 1C).

The outlier detection using a multimodal gaussian distribution function was applied to each laboratory measure for each cohort separately. Figure 2 highlights that, for laboratories with higher missingness levels, the distribution is different for the different cohorts, and therefore, the accepted range is adjusted accordingly. For more common laboratory measures (such as the example presented in Figure 3), the distributions are similar. The accepted range for these laboratory measures is within the calculated range. To further help the reader to better understand the pattern of laboratory data, we created distribution plots for all the laboratory values used in this study for the three cohorts (see Figure A1 and Table A2).

#### *3.2. Imputation Applied to Laboratory Values*

*C. difficile* (Cdiff) infection case and control cohort: Using adaptive imputation for the Cdiff cohort showed improved performance, especially for the high missingness group (laboratory measures that have, at most, 75% missingness). An average RMSE difference (comparing the proposed imputation with the standard imputation model, without any consideration of comorbidity information using MICE) was −31.47 for a level of abstraction *g* = 1000 and a cluster number *k* = 4. The average RMSE difference was −8.75 for *g* = 100 and *k* = 4, demonstrating that, at a high missingness level, additional information from the patient comorbidity information can play an important role in improving the accuracy of the imputation prediction. A total of 27 combinations (or nine combinations for each missingness threshold) were tested, and for each missingness level (Table 1), the tradeoff between the sample size and clustering approach resulted in one or two instances where clustering was associated with improved performance. Since the dataset is of fixed size, the higher number of clusters will reduce the power of the imputation method, especially when the number of clusters is increased to eight or beyond. However, as each dataset has its unique characteristics, the best set of parameters must be empirically determined prior to performing the imputation using the adaptive strategy. Using MICE and the random forest model (rf), the RMSE differences were negative for the majority of the combinations. The missingness group of <75% had seven out of the nine parameter combinations that were in favor of the novel method (See Table 1 and Figure 4).

*J. Clin. Med.* **2021**, *10*, x FOR PEER REVIEW 8 of 25

**Figure 1.** The pattern of missingness for the three cohorts. A generalized additive model was used Figure 95. confidence interval. (**A**) The percentage of patients with one laboratory measurement versus the missingness percentage for the three datasets. (**B**) The average number of years between the first and last laboratory measurements (calculated for patients with two or more measurements) versus the missingness percentage for the three datasets. (**C**) The frequency of the laboratory measurements calculated for patients with two or more measurements versus the missingness percentage for the three datasets. Cdiff: *Clostridioides difficile*, IBD: inflammatory bowel disease, and OA: osteoarthritis. **Figure 1.** The pattern of missingness for the three cohorts. A generalized additive model was used for smoothing. The gray area around the smoothing curve represents a 95% confidence interval. (**A**) The percentage of patients with one laboratory measurement versus the missingness percentage for the three datasets. (**B**) The average number of years between the first and last laboratory measurements (calculated for patients with two or more measurements) versus the missingness percentage for the three datasets. (**C**) The frequency of the laboratory measurements calculated for patients with two or more measurements versus the missingness percentage for the three datasets. Cdiff: *Clostridioides difficile*, IBD: inflammatory bowel disease, and OA: osteoarthritis.

.

*J. Clin. Med.* **2021**, *10*, x FOR PEER REVIEW 9 of 25

**Figure 2.** Distribution of laboratory values normalized for Logical Observation Identifiers Names and Codes (LOINC) 2501-5 (iron-binding capacity) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The "iron-binding capacity" is missing at 52% in the Cdiff dataset, 65% in Table 64. in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)). **Figure 2.** Distribution of laboratory values normalized for Logical Observation Identifiers Names and Codes (LOINC) 2501-5 (iron-binding capacity) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The "ironbinding capacity" is missing at 52% in the Cdiff dataset, 65% in the IBD dataset, and 64% in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)). **Figure 2.** Distribution of laboratory values normalized for Logical Observation Identifiers Names and Codes (LOINC) 2501-5 (iron-binding capacity) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The "iron-binding capacity" is missing at 52% in the Cdiff dataset, 65% in Table 64. in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)).

datasets (Cdiff in red, IBD in green, and OA in blue). The "MCV" is missing at 2% in the Cdiff dataset, 5% in the IBD dataset, and 4% in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)). **Figure 3.** Distribution of laboratory values normalized for LOINC 787-2 (mean corpuscular volume or MCV) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The "MCV" is missing at 2% in the Cdiff dataset, 5% in the IBD dataset, and 4% in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)). **Figure 3.** Distribution of laboratory values normalized for LOINC 787-2 (mean corpuscular volume or MCV) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The "MCV" is missing at 2% in the Cdiff dataset, 5% in the IBD dataset, and 4% in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)).

**Table 1.** The root mean square error (RMSE) difference from imputation is applied with and without the integration of comorbidity information for the three datasets. Negative RMSE correspond to improvements by the hybrid approach. The predictive mean matching (pmm) and Random Forest (rf) model in Multivariate Imputation by Chained Equations (MICE) were used in this study. The reader is referred to Tables A3–A5 for a more comprehensive results, with *p*-values reported from multiple runs.



**Inflammatory Bowel Disease (IBD)**

**Osteoarthritis (OA)**


**Cluster Number** 

4

8

16

**Dimensionality Level (g)** 

*J. Clin. Med.* **2021**, *10*, x FOR PEER REVIEW 11 of 25

**Figure 4.** Violin plots representing the root mean square error (RMSE) differences—comparing the performance of Multivariate Imputation by Chained Equations (MICE) with and without the comorbidity information. Two algorithms, predictive mean matching (pmm) and Random Forest (rf), were compared. A Negative RMSE difference indicates a performance improvement when the comorbidity information is utilized. **Figure 4.** Violin plots representing the root mean square error (RMSE) differences—comparing the performance of Multivariate Imputation by Chained Equations (MICE) with and without the comorbidity information. Two algorithms, predictive mean matching (pmm) and Random Forest (rf), were compared. A Negative RMSE difference indicates a performance improvement when the comorbidity information is utilized.

**Table 1.** The root mean square error (RMSE) difference from imputation is applied with and without the integration of comorbidity information for the three datasets. Negative RMSE correspond Table 3. A5 for a more comprehensive results, with *p*-values reported from multiple runs. *C. difficile* **(Cdiff) Infection MICE-PMM MICE-RF Missingness < 25% Missingness < 50% Missingness < 75% Cluster Number Dimensionality Level (g) Missingness < 25% Missingness < 50% Missingness < 75%** 100 −0.77 7.12 −8.76 4 100 0.35 −1.47 −4.92 1000 7.42 6.93 −31.47 1000 2.07 0.50 −12.72 8160 −3.09 2.06 8.37 8160 −4.40 −3.28 0.49 100 0.11 9.19 12.39 8 100 1.40 11.06 −16.75 1000 0.14 6.69 4.02 1000 1.24 4.04 9.73 8160 4.63 10.09 6.99 8160 −0.88 −7.32 −5.11 100 −2.12 −3.00 5.03 100 −0.04 14.73 −2.36 Inflammatory Bowel Disease (IBD) cohort: Using adaptive imputation for the IBD cohort showed improved performance, especially for the high missingness group (laboratory measures that have, at most, 75% missingness). An average RMSE difference when compared to the standard model using MICE alone was −8.35 with no abstraction and cluster number *k* = 2. Similarly, an average RMSE difference when compared to the standard model using MICE alone was −8.24 for *k* = 8. The results highlighted that, at a high missingness level, additional information from the patient comorbidity data can play an important role in improving the accuracy of the imputation prediction, even as the sample size is significantly smaller (in this case, 11 K versus 46 K for the Cdiff cohort). A total of 27 combinations (or nine combinations for each missingness threshold) were tested. The tradeoff between the sample size and clustering approach resulted in parameter combinations that were associated with improved performance. Additional analyses were performed with the random forest model in MICE, and an RMSE difference of −2.70 was recorded for a missingness level of 25% (see Table 1 and Figure 4). Our results corroborate

16

the value of parameter optimization on the dataset using various modeling frameworks. Thus, the best set of parameters should be empirically determined for each dataset.

Osteoarthritis (OA) cohort: Using adaptive imputation for the OA cohort showed that the best performance improvement was for missingness at 50% (Table 1 and Figure 4). The tradeoff between the sample size reduction, when clustering is utilized, and the use of additional information from comorbidities did show benefits even for this smaller and more heterogeneous dataset. The rf model in MICE was best fitted for this dataset.

#### **4. Discussion**

This study is a first step towards improving our many layers of data analytics and quality control pipelines to help enhance the quality of data extracted from the EHR that is ingested in machine learning applications for precision medicine. The use of heterogeneous and large-scale clinical datasets, such as EHRs, provides an avenue for the exploration of strategies to improve care at individualized levels, which include developing personalized models of responses to therapy and the prediction of disease onset, among others [1]. However, the data extracted from EHRs are noisy and have many missing values. In the majority of studies, variables suffering from missingness are excluded from models and analyses [4], even for some variables with high discriminative ability according to the clinical knowledge. As we showed in this work, it is not recommended to solely rely on the redundancy of EHR laboratory data to conduct imputation for realistic applications. That is because the majority of redundancy from laboratory measurements are associated with variables that are missing at high levels. However, laboratory data is highly associated with comorbidity, as the latter is based on laboratory values in realistic settings. For instance, besides the commonly ordered laboratory tests (20–30 laboratory measures), the remaining values are missing at very high rates, even in a healthcare system with a stable population (Geisinger is an integrated healthcare system with a drop-out rate <5%). However, the laboratory measures are highly correlated with comorbidities and diagnosis. Therefore, our intuitive modeling strategy is focused on using this redundancy to improve the imputation for laboratory values.

Furthermore, many diagnoses are based on laboratory values; however, due to the challenges associated with mining laboratory measures, many models ignore this important parameter or only include the ones that are not missing at high levels to reduce the noise and bias due to poor imputation predictions. We created three diverse datasets to test this intuitive strategy of imputation designed specifically for EHR laboratory data by including information from the comorbidities.

The IBD dataset was used, because IBD is a heterogeneous disease and a clear understanding of its risk factors is still lacking. Recent advances in the knowledge of IBD's pathogenesis have led to the implication of a complex interplay between metabolic reprogramming and immunity [23]. Furthermore, the response to treatment in IBD varies significantly among individuals and disease subtypes based on demographic characteristics, diet, comorbidities, underlying immunological factors, and genetic polymorphisms. Thus, there is an urgent unmet need to replace the current imputation approaches with personalized strategies that consider individual variability, diversity, and more balanced patient representation. Therefore, building predictive models for treatment outcomes for IBD is an important step in utilizing the available data on drug responses to provide better care for this patient population. Thus, the integration of laboratory measures in a predictive model for IBD has clinical value.

We created the Cdiff dataset, because the understanding of recurrent *C. difficile* infection is important, and the existing data from EHR can help us identify clinical biomarkers and help in building a decision support system for physicians to target the patients at a higher chance of recurrence for more targeted preventive care.

Finally, the OA dataset was added to test the limits of this model. An OA diagnosis is not based on any laboratory measure known today. An OA diagnosis is based on imaging alone. Therefore, we did not expect the OA cohort to have any special patterns in their

laboratory profile, yet we observed that, even in this situation, the use of a comorbidity pattern can help in improving the imputation of laboratory values. The OA dataset was also the smallest dataset tested in this study.

Overall, our results showed that each dataset is unique, and a one-size-fits-all approach does not apply when selecting the imputation model. On simulated datasets with interactions between variables, the imputation of missing data using MICE with regression trees resulted in less biased parameter estimates than MICE with linear regression. [24] In the CALIBER study, MICE random forest showed more imputation efficiency with narrower confidence intervals for the error metric [25]. Through a simulation of a dataset in which the partially observed variable depended on the fully observed variables in a nonlinear way, MICE-RF showed less bias in parameter estimates and better confidence interval coverage. In our study, rf also performed well; however, the best performance was observed when pmm was used in the Cdiff cohort. Nonetheless, because the RMSEs were calculated across all laboratory variables, the improvement may be contributed by a few variables that were imputed better in perhaps some, but not all, cases. Further analysis will be needed to address this assumption.

The method presented here is an intuitive approach for any given complex disease where biosignatures or risk factors are only partially known and the relationship among the variables can be convoluted given the large dimensionality of the dataset. Even though the level of missingness can vary, the best results are typically obtained when the level of missingness is low or moderate. The improvement over conventional methods without the consideration of comorbidity information can be achieved when the missingness level is high. Our strategy was to ensure that (1) our experiment aligned with the current methodologies in practice and (2) others can easily adapt this modification to their work. In future directions, we will explore if advanced modeling frameworks such as the generative adversarial network [26] (GAN) or the newly proposed generative adversarial imputation nets (GAIN) framework [27] can be optimized for imputing laboratory values from EHRs.

Finally, our study provided a step in what we believe is a pipeline of data quality improvements for empowering machine learning models using EHRs. The main limitation of this approach is the need for large datasets. This is due to the nature of this approach, as the clustering step will reduce the sample size for the imputation, thus reducing its power. Therefore, this approach is ideal for machine learning applications where the sample size tends to be large and comprehensive. Our smallest cohort consisted of 10,000 OA patients. Our best prediction improvement was observed for the largest dataset of 46,215 patients. Another limitation of this study that we could not address is based on our masking strategy for the evaluation, which was done at random, even though we knew that the missingness in the EHR was not at random. However, given that we did not know *a priori* the reason for missingness for each patient, given the complex nature of the data, masking at random was the most sensible strategy in this case. As of now, we do not have a better strategy to simulate MNAR to withhold values. The contributing factors to MNAR are multifactorial and largely unknown.

This study had several other limitations. First, by converting the comorbidity information into binary, we may have lost important information. This study design can be enhanced further to answer a specific research question by optimizing the pattern of ICD codes recoded (both the frequency and time intervals) to capture the duration and severity of the conditions. Second, we withheld a relatively small number of values to evaluate our model. This is because we included laboratory codes with as high as 75% missingness and applied clustering prior to imputing; thus, withholding a higher level of laboratory values may further increase the sparsity of the dataset and introduce further bias. As a future direction, we plan on applying the algorithm several times to random subsamples of the data of size n/2 (n = number of samples). This repeated double randomization, similar to the concept of bagging and sub-bagging [28,29] algorithms, could further help optimize our strategy. Third, we are not limiting the window with respect to the diagnosis index event, as it should be for a carefully designed study [30,31]. However, the identification of pre-

and post-index windows should be thoroughly planned based on the research question, the sparsity of the data, the healthcare system, and the variables under consideration [30]. However, as this is a proof-of-concept study, we did not limit our observation window in order to help improve our data availability so that we could experiment with different levels of missingness. Even though this is a limitation of this study, we showed what, in many instances, were only a few laboratory values for each patient for the less commonly used laboratory codes. Fourth, as this was a pilot study, we wanted to corroborate the generalizability and scalability of the proposed strategy. Therefore, we did not exhaustively vary the abstraction level nor the size of the clusters; however, we applied the model on three different cohorts that were created specifically for this study. Finally, by combining the laboratory codes into three groups (<25% missing, <50% missing, and <75% missing), we were unable to determine if this improvement was due to one or a few laboratory variables. Further assessments will be needed to study the improvement of imputation for each laboratory on a case-by-case basis for more targeted evaluations and improvements.

To conclude, the advantages of imputing missingness are manifold; imputation can be used for increasing the data density, improving the representation of data-poor patients, thus reducing the implicit algorithmic bias. Patients with limited access to healthcare and specialty care may be prone to be less-represented in models, because their data footprint is lower. The inclusion of more laboratory values is important as a prediction of a diagnosis; if it is not at least partially based on laboratory information, it could be weak. Predicting a future disease by only focusing on past diagnoses (i.e., using only information based on the ICD codes) is not taking full advantage of the information in electronic health records. Laboratory measurements, similar to imaging and imaging reports, are at the core of diagnosis and care management. The novelty of this study is in its intuitive design and relatively simple implementation in incorporating information from a patient's comorbidity to improve the imputation of laboratory values.

As a future direction, we will investigate how best to impute longitudinal laboratory measures to better inform clinical studies. In addition, we will also explore integrating additional features, such as demographic information, age, gender, and medication usage, as well as genetic information when available, to further enhance the imputation outcome. Finally, we will evaluate various preprocessing and normalization strategies and evaluate if these manipulations can improve the outcome of our predictions, especially for variables with skewed distributions, and explore the impact of imputation on each laboratory value and further investigate any potential patterns or trends that can help improve predicting the missing values. To conclude, we optimized the level of abstraction needed to improve the imputation for three cohorts of varying sizes and complexities. This study demonstrates that the use of shared latent comorbidities can facilitate improvements in imputing laboratory measures from EHRs for downstream analysis and predictive modeling.

**Author Contributions:** Conceptualization, V.A. (Vida Abedi) and R.Z.; methodology, V.A. (Vida Abedi), J.L., M.K.S., V.A. (Venkatesh Avula); software, J.L., M.K.S., V.A. (Venkatesh Avula); validation, D.P.C., M.J.S., H.S.K., M.T.M.L., D.M.W., R.H., J.B.-R.; formal analysis, V.A. (Vida Abedi), J.L., V.A. (Venkatesh Avula); investigation, V.A. (Vida Abedi), J.L., R.Z.; resources, V.A. (Vida Abedi), R.Z.; data curation, D.P.C., Y.Z., J.L.; writing—original draft preparation, V.A. (Vida Abedi); writing—review and editing, V.A. (Vida Abedi), J.L., M.Y., R.Z.; visualization, V.A. (Venkatesh Avula); supervision, V.A. (Vida Abedi), R.Z.; project administration, V.A. (Vida Abedi); funding acquisition, V.A. (Vida Abedi), R.Z., R.H., J.B.-R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was sponsored by funds from the Defense Threat Reduction Agency (DTRA) grant No. HDTRA1-18-1-0008 to J.B.-R., R.H. and V.A. (Vida Abedi) and funds from the National Institute of Health (NIH) grant No. R56HL116832 to V.A. (Vida Abedi), as well as funds from Geisinger Health Plan Quality to R.Z.

**Institutional Review Board Statement:** The study was reviewed and approved by the Geisinger Institutional Review Board to meet "Non-human subject research", for using de-identified information. **Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data analyzed in this study is not publicly available due to privacy and security concerns. The data may be shared with a third party upon execution of data sharing agreement for reasonable requests, such requests should be addressed to V.A. (Vida Abedi) or R.Z.

**Acknowledgments:** The authors would like to thank the Phenomic Analytics and Clinical Data Core at Geisinger—more specifically, Joseph B. Leader, Monika Ahuja, and Amy Kolinovsky—for helping with data extraction and deidentification from the Electronic Health Records. Special thanks to Alvaro E. Ulloa Cerna for the insightful discussion.

**Conflicts of Interest:** Authors J.B.-R. and R.H. were employed by BioTherapeutics, Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interests. The funders had no role in study design, data collection, and interpretation or the decision to submit the work for publication.

#### **Appendix A**

**Figure A1.** Distribution of the laboratory values normalized for all the LOINC included in this study. **Figure A1.** Distribution of the laboratory values normalized for all the LOINC included in this study.

‐


**Table A1.** Diagnosis codes used for inflammatory bowel disease and osteoarthritis.


**Table A2.** Various summary statistics for the laboratory variables included in this study. The empty cell represents a percentage missing that is higher than 75%.

**Table A2.** *Cont.*


**Table A2.** *Cont.*



**Table A3.** The RMSE difference from imputation is applied with and without the integration of comorbidity information for the Cdiff dataset. Negative RMSE correspond to improvement by the hybrid approach. The pmm and rf models in MICE were used in this study. The *p*-value is reported based on 10 runs.

**Table A4.** The RMSE difference from imputation is applied with and without the integration of comorbidity information for the IBD dataset. Negative RMSE correspond to improvement by the hybrid approach. The pmm and rf models in MICE were used in this study. The *p*-value is reported based on 10 runs.



**Table A4.** *Cont.*

**Table A5.** The RMSE difference from imputation is applied with and without the integration of comorbidity information for the OA dataset. Negative RMSE correspond to improvement by the hybrid approach. The pmm and rf models in MICE were used in this study. The *p*-value is reported based on 10 runs.


#### **References**


## *Article* **Convolutional Neural Network Classifies Pathological Voice Change in Laryngeal Cancer with High Accuracy**

**HyunBum Kim 1,**† **, Juhyeong Jeon 2,**† **, Yeon Jae Han <sup>3</sup> , YoungHoon Joo <sup>1</sup> , Jonghwan Lee <sup>2</sup> , Seungchul Lee 2,4,\* and Sun Im 3,\***


Received: 5 September 2020; Accepted: 22 October 2020; Published: 25 October 2020

**Abstract:** Voice changes may be the earliest signs in laryngeal cancer. We investigated whether automated voice signal analysis can be used to distinguish patients with laryngeal cancer from healthy subjects. We extracted features using the software package for speech analysis in phonetics (PRAAT) and calculated the Mel-frequency cepstral coefficients (MFCCs) from voice samples of a vowel sound of /a:/. The proposed method was tested with six algorithms: support vector machine (SVM), extreme gradient boosting (XGBoost), light gradient boosted machine (LGBM), artificial neural network (ANN), one-dimensional convolutional neural network (1D-CNN) and two-dimensional convolutional neural network (2D-CNN). Their performances were evaluated in terms of accuracy, sensitivity, and specificity. The result was compared with human performance. A total of four volunteers, two of whom were trained laryngologists, rated the same files. The 1D-CNN showed the highest accuracy of 85% and sensitivity and sensitivity and specificity levels of 78% and 93%. The two laryngologists achieved accuracy of 69.9% but sensitivity levels of 44%. Automated analysis of voice signals could differentiate subjects with laryngeal cancer from those of healthy subjects with higher diagnostic properties than those performed by the four volunteers.

**Keywords:** voice change; larynx cancer; machine learning; deep learning; voice pathology classification

#### **1. Introduction**

Laryngeal cancer is one of the most debilitating forms of malignancy, with an average incidence of 3.3 per 100,000 from 2012 to 2016 in the USA [1]. In 2019, there were 12,370 new cases diagnosed in the USA alone. Despite the rising incidence, early diagnosis remains challenging, resulting in delayed treatment [2,3]. With a delay of diagnosis, laryngeal cancer may lead to the most severe debilitating disabilities in phonation, swallowing [4] and overall quality of life. An automated voice analysis tool could advance the time of diagnosis regardless of patients' location, in line with the idea of telemedicine. Though voice changes can indicate the first clinical signs of disease, subjective perception of early voice changes can be listener dependent and subject to intrajudge variations [5].

Image analysis based on the use of computational algorithms in radiology is now expanding to signal processing in other fields such as electrodiagnosis [6]. Furthermore, the popularity of these new techniques has led to the use of automated detection of pathological voices using machine and deep learning algorithms. A voice pathology detection was reported successful using a deep learning model [7]. Algorithms based on feature extraction, such as the Mel-frequency cepstral coefficients (MFCCs) from the acoustic signals have been used for many years to detect vocal fold disorders and dysphonia [8–10]. For example, Chuang et al. [9] have used normalized MFCCs features and have shown that a deep neural network (DNN) can detect abnormal voice changes in voice disorders. Another study by Fang et al. [10], which included laryngeal cancer data, have reported that the results of a DNN were superior to other machine learning algorithms.

However, in past studies, the number of cancer patients was either too small [10,11] or often assessed as a single group together with other voice disorders. Most recent studies that investigated the role of automatic detection of voice disorders [8–10] were based on open voice databases such as the Massachusetts Eye and Ear Infirmary Database [12] or Saarbrucken Voice Database [13], and laryngeal cancer voices were rated together as one group in combination with other voice disorders. In addition, past algorithms have not been validated against the clinicians' judgement of voice change. Subjective perception of early voice changes can be difficult [5]. The possibility of an algorithm that can distinguish pathological voice changes at the early stages in laryngeal cancer from normal healthy voices with the potential to overcome the limitations imposed by inter-subject human perception remains to be explored.

Therefore, this study aims to investigate the role of computational algorithms including a support vector machine (SVM), extreme gradient boosting (XGBoost) and the recent popular convolutional neural network (CNN) in distinguishing voice signals of patients with laryngeal cancer against those obtained from healthy subjects. We also compared the performance levels of these algorithms to those obtained by four human raters who rated the same voice files.

#### **2. Materials and Methods**

#### *2.1. Study Subjects*

A retrospective review of medical records was performed at a single university center from July 2015 to June 2019. We identified patients who had undergone voice assessments at the time of laryngeal cancer diagnosis. Only the preoperative records were collected, whereas those obtained postoperatively or after radiotherapy were excluded.

Normal voice samples were acquired from otherwise healthy subjects who had undergone voice assessments for the evaluation of their vocal cords prior to general anesthesia for surgical procedures involving sites other than the head and neck region, such as the hands or legs. Any subject subsequently diagnosed with any benign laryngeal disease, sulcus vocalis, or one-sided vocal palsy were excluded from the data analysis of the healthy subjects. Any additional diagnosis of voice disorders was excluded by a detailed review of patients' medical records.

Patients' demographic information, including gender, age, and smoking history, were collected. In those diagnosed with laryngeal cancer, additional clinical information such as the TNM (Tumor Node Metastases Classification of Malignant Tumors) stage, a global standard for classifying the anatomical extent of tumor cancers, was recorded. The study protocols were approved by our institutional review board [HC19RES10098].

#### *2.2. Datasets from Voice Files*

The dataset comprises recordings of normal subjects and cancer patients. Voice samples were recorded with a Kay Computer Speech Lab (CSL) (Model 4150B; KayPENTAX, Lincoln Park, NJ, USA) supported by a personal computer, including a Shure-Prolog SM48 microphone with Shure digital amplifier, located at a distance of 10–15 cm from the mouth and an angle of 90◦ . Background noise

was controlled below 45 dB. Analysis of a voice sample, directly recorded using digital technology and with a sampling frequency of 50,000 Hz, was carried out using MDVP 515 software (version 2.3). Patients phonated vowel sound /a:/ for over 4 s at a comfortable level of loudness (about 55–65 dB). The operator's experience dates back to 2011, and the voice testing protocol in the hospital was established in 2015.

#### *2.3. Experimental Setups*

The study used a NVIDIA GeForce RTX2080 Ti (11GB) graphic card. We examined the normal and cancer voice signal classification and tested the performance of the SVM, XGBoost, light gradient boosted machine (LightGBM), artificial neural network (ANN), one-dimensional convolutional neural network (1D-CNN), and 2D-CNN. The performance was evaluated via five-fold cross-validation. The accuracy, sensitivity, specificity, and area under the curve (AUC) values were used as performance metrics. This study strictly used male voice samples to exclude gender effects in that all laryngeal cancer cases were male and that male and female voices have different frequency range. Otherwise, some factors that are not directly related to cancerous voice can undermine the integrity of the study design.

#### *2.4. Feature Extraction*

In this study, two common features in speech analysis were selected. First, a term named after the word "talk" in Dutch, PRAAT is a speech analysis program (Paul Boersma and David Weenink, Institute for Phonetic Sciences, University of Amsterdam, The Netherlands) in phonetics designed to extract key features in the voice [14]. The raw voice input was 4 s of 50,000 Hz signal. The PRAAT features were extracted under a minimum value of 10 Hz and a maximum value of 8000 Hz to account for the spectral range of a human voice. Fourteen audio features include mean and standard deviation of the fundamental frequency, harmonic to noise ratio (HNR), and jitter and shimmer variants. The HNR denotes the degree of acoustic periodicity in the aspect of energy. The last two sets of features are measures of perturbation in acoustic analysis, where jitter demonstrates the frequency instability, whereas shimmer represents the amplitude instability of a signal. In other words, jitter refers to the frequency variation from cycle to cycle, and the shimmer represents the amplitude variation of the sound wave. Following is a description of the jitter and shimmer variants [14]. The localJitter is the average absolute difference between consecutive intervals, divided by the average interval, and localabsolutejitter uses absolute difference. The rapJitter is the relative average perturbation of itself and the two adjacent, and ppq5jitter accounts for four neighbors. Ddpjitter is the difference between differences of consecutive difference. In a similar fashion, six shimmer variants were defined: localshimmer, localdbshimmer, apq3shimmer, apq5shimmer, apq11shimmer, and ddashimmer.

Second, MFCCs, a collection of numerical values resulting from the transformation of time series data, were obtained [15]. The principle of MFCCs is based on a short-time Fourier transform and additional consideration for the distinct nature of a human voice in the lower frequency range, set by the biased bandpass filter design. Forty triangular bandpass filters were used. Initially, we down sampled the input signal to 16,000 Hz, accounting for the Nyquist frequency of the human voice range. As a result of the transformation, 200,000 data points of the input signal were converted to 64,000 points, and then into a 40 × 126-time spectral image. The graphic presentation of down sampling, normalization, and MFCCs transformation is shown in Figures 1 and 2. In addition, short time Fourier transform (STFT), another common time-spectral representation, was obtained for comparative evaluation. For this conversion, we down sampled the input signal to 4000 Hz and processed with a frame size of 0.02 without overlap, in order to match with the height size of the MFCCs. As a result, a 40 × 199-time spectral image was produced.

− **Figure 1.** The graphic presentation of transformation from raw signal into Mel-frequency cepstral coefficients (MFCCs) image, a necessary process to comply with the two-dimensional convolutional neural network input shape. (**a**) Plot of signals down sampled to 16,000 Hz; (**b**) plot of signals normalized between −1 and 1; (**c**) image of signals after MFCCs transformation. −

**Figure 2.** The flowchart of Mel-frequency cepstral coefficients (MFCCs) transformation. (**a**) and presentation of Mel filter banks (**b**). The triangular filter banks are densely located towards low frequency range, reflecting the distinctive nature of the human voice in that range. Abbreviations: FFT, fast Fourier transform; DFT, discrete Fourier transform.

#### *2.5. Preprocessing*

A series of preprocessing steps are introduced for the effective representation of the signal. The recordings represent continuous sounds. Normalization was performed to change the value of numerical voice signals to a common scale, because the magnitudes vary depending on the measuring distance of the record. Each signal was divided by the maximum absolute value of the recording per patient while taking account of the peak outliers.

For accurate validation, the data set was divided into two parts in each validation: one for training and another for testing. Performance metrics were only calculated with the testing dataset, the signals that the model did not process during its training. A five-fold validation method is used for reliable results, which divides the dataset into five subsets. For each validation fold, four subsets were used to train a model for appropriate representation and generalization power, and the model was validated with the remaining subset. Overall, five validations were conducted, and the performance matrix represented the average of all results. The process can be seen in Figure 3.


**Figure 3.** Illustration of five-fold validation. A given data set is split into five subsections where each fold is used as a testing set, a useful method to use all data where data is limited.

#### *2.6. Machine Learning Algorithms*

߱ ω We tested three machine learning algorithms: SVM, XGBoost, LightGBM, and ANN. The SVM is the most frequently practiced method used in the classification task. The SVM resolves the classification task by drawing a decision boundary hyperplane that divides space with the maximum distance from each class. However, not all cases can be resolved similarly, as clusters often require a non-linear boundary. The kernel trick facilitates by warping spaces. In machine learning, the hyper-parameter is a high-level configurator empirically chosen to control complexity, conservativeness, and overfitting of a model before training the networks [16]. The governing decision-making equation and its classification decision is shown in the equations below, where ω<sup>0</sup> and ω represent bias and weights of the boundary.

$$\mathbf{x} \,\, \boldsymbol{\omega}\_0 + \,\, \boldsymbol{\omega}^T \mathbf{x} > \mathbf{0} \,\, \rightarrow \,\, \mathbf{x} \,\, \text{belongs to normal}\tag{1}$$

$$\mathbf{x}\_0 + \boldsymbol{\omega}^T \mathbf{x} < \mathbf{0} \to \mathbf{x} \text{ belongs to pathological } \mathbf{0} \tag{2}$$

The LightGBM and XGBoost are classifiers derived from a decision tree family known to perform best in many practices. The decision tree is named after its shape comprising of a series of dividing rules. The model learns the optimal rules based on information gain and entropy. Information gain is a quantified value based on information generated by a certain event. Entropy is a relative degree of disorder [17]. Since a signal decision tree can easily overfit, a series of techniques are implemented to boost performance such as bagging, boosting, tree-pruning, and parallel processing. The techniques effectively combine predictions from multiple trees and multiple sequential models.

An ANN is a basic form of a DNN. A series of fully connected layers constitute an ANN. The model predicts the label of input data with trained weights and biases through a forward propagation. We consider this ANN model to be a machine learning model since the input is hand-crafted feature and the propagation mostly performs classification tasks only.

#### *2.7. Deep Learning Algorithms*

The human voice exhibits distinct characteristics in the lower frequency range, so biased filters are used in the MFCCs. Although a recent study has shown that MFCCs are consistent metric constraints [18], an inevitable information loss occurs at the conversion. Ten pieces of size 40 × 40 are randomly cropped per image to lower the computational cost and to elicit a data augmentation effect. In a similar fashion, ten pieces of size 40 × 40 segments from a STFT spectrogram are prepared from each signal.

Zero padding and down sampling are implemented for the 1D-CNN. Zero padding ensures stable frequency conversion and provides better resolution. Further, a recent study showed that the most contributive bands in both detection and classification ranged between 1000 and 8000 Hz [7]. Down sampling is set at 22,050 Hz for 1D-CNN and 16,000 for 2D-CNN preprocessing.

The 1D-CNN structure is composed of six convolution blocks and three fully connected layers (Figure 2). The number of kernels is 16, 32, 64, 128, and 256. A kernel is equivalent to a filter. For example, the first layer represents the filtered signals from 16 kernels. Max pooling sizes used are 8, 8, 8, 8, and 4 to compress the long signal, which choose the maximum single value from a given window size to progressively reduce the spatial size and to provide abstract representation. The dense layer is composed of 1536, 100, 50, and 2 nodes. Batch normalization and ReLU activation are used for faster and stable training [19]. The detailed structure of the algorithm is shown in Figure 4.

**Figure 4.** Illustration of one-dimensional convolutional neural network model structure.

The 2D-CNN structure is composed of three convolution blocks and three fully connected layers. The number of kernels is 64, 64, and 128. The dense layer is composed of 500, 50, and 2 nodes. A dropout of 0.3 is used twice at a dense layer to prevent overfitting. A Glorot uniform initializer and ReLU activations are used [20]. Maximum pooling is done conservatively, only once, because the input image is already small. The detailed structure of the algorithm is shown in Figure 5.

**Figure 5.** Illustration of two-dimensional convolutional neural network model structure.

#### *2.8. Human Interpretation*

Two laryngologists with 3–10 years of experience in laryngoscopy and laryngeal cancer were asked to listen to the same files and classify the voice sounds as either normal or abnormal. In addition, two volunteers with no medical background were asked to perform the same tasks. All volunteers were informed that abnormal voices are from laryngeal cancer patients, prior to the evaluation. No prior demographic information was provided. All volunteers were allowed to hear the voice files multiple times. The diagnostic parameters obtained from the four volunteers were calculated.

#### *2.9. Statistical Analysis*

Data are expressed as mean ± standard deviation for continuous data and as counts (%) for categorical data. Bivariate analyses were conducted using a two-tailed Student's *t*-test for continuous data and a two-tailed χ2 or Fisher's exact test for categorical data when appropriate.

All these statistical analyses were performed using IBM SPSS Statistics 20.0 (IBM Corp., Armonk, NY, USA), and *p*-values less than 0.05 were considered to indicate statistical significance.

Group differences between patients with cancer and healthy participants were determined using non-parametric tests. The AUC values, which reflect the diagnostic accuracy and predictive ability, were calculated for each parameter. The performance of laryngeal cancer classification was evaluated with an AUC of receiver operating characteristic (ROC) curves using roc\_curve and auc functions of the Scikit-learn library and the matplotlib library in the Python 3.5.2. and R 2.15.3 package software (R Foundation for Statistical Computing, Vienna, Austria).

#### **3. Results**

#### *3.1. Demographic Features*

Using the medical records, we identified a total of 50 laryngeal cancer patients who had undergone voice analysis preoperatively. From the normal voices (*n* = 180), only the male voice data were selected (*n* = 45) and used for analysis. All cancer subjects were male. Laryngeal cancer included glottic (84%) and supraglottic (16%) types of cancer. The majority (84%) of patients were diagnosed at the T1–T2 stages when the voice recordings were performed. The characteristics of cancer and their staging are shown in Table 1. Compared with the healthy group, subjects with laryngeal cancer were significantly older, and showed higher smoking rates than the healthy subjects, as shown in Table 2.


**Table 1.** Clinicopathological characteristics of the cancer patients.

Abbreviations: SCC; Squamous Cell Cancer, CIS; Carcinoma in Situ, TNM; Tumor Node Metastases Classification of Malignant Tumors.


**Table 2.** Demographic data from the 95 subjects, including the glottic cancer patients.

Values are presented in mean ± standard deviation (min~max) or number (%). \* Group comparison between normal cases versus laryngeal cancer patients. † Hazards ratio: 14.8. ‡ Hazards ratio: 11.6.

#### *3.2. Feature Selection*

Figure 6 shows the contribution of each PRAAT 14 feature obtained from the XGBoost for the classification of voice changes. The HNR, standard deviation of F0, and apq11shimmer were major features in the classification of abnormal voices.

**Figure 6.** Feature importance analysis of XGBoost. The plot demonstrates the relative information gains based on the feature importance classification task of male voice samples.

#### *3.3. Accuracy of the Automatic Detection in Male Voice Samples*

We performed the analysis with no female data for two reasons. First, male and female voices are known to fall within different ranges of frequency [21]. Secondly, females rarely have larynx cancer, which is directly reflected in our data set that no female data exist in cancer class. Especially in East Asian countries, the proportion of female patients with laryngeal cancer is reported to be less than 10%. Therefore, voice signals comprising only the male dataset were analyzed. Among the algorithms, the 1D-CNN again showed good accuracy levels with sensitivity levels up to 85% (Table 3). Of interest was that five out of eight supraglottic cancer patients were correctly diagnosed with the 1D-CNN model.

The accuracy values and receiver operating characteristic (ROC) curves for a set of evaluations are demonstrated in Table 3 and Figure 7.


**Table 3.** Evaluation metrics table of only male voice samples for classification task of abnormal voice signals in laryngeal cancer (*n* = 50) from normal healthy subjects (*n* = 45).

Abbreviations: AUC, area under curve; SVM, support vector machine; XGBoost, extreme gradient boosting; LightGBM, light gradient boosted machine; ANN, artificial neural network; MFCCs, Mel-frequency cepstral coefficients; STFT, short-time Fourier transform. \*: with 10 times augmented data.

**Figure 7.** ROC (receiver operating characteristic) curve analysis of the different models for the classification of laryngeal cancer. ROC curves algorithms for classification task of only male voice samples. Abbreviations: LGBM, LightGBM; XBG, XGBoost; SVM, support vector machine; ANN, artificial neural network; 1D-CNN, one-dimensional convolutional neural network; 2D-CNN, two-dimensional convolutional neural network; MFCCs, Mel-frequency cepstral coefficients; STFT, short time Fourier transform.

#### *3.4. Accuracy in Human Rating*

Results show large variance in the sensitivity levels across the four raters with levels as low as 29% and the highest at 50%. The two experts showed higher accuracy levels than the two non-experts, but compared to the machine learning and deep learning algorithms, they showed low sensitivity and accuracy levels. Table 4 summarizes the result.

**Table 4.** Evaluation metrics table of only male voice samples for classification task of abnormal voice signals in laryngeal cancer (*n* = 50) from normal healthy subjects (*n* = 45).


#### **4. Discussion**

The results of our study provide high accuracy levels of automated algorithms using machine learning and deep learning techniques that assess voice change in laryngeal cancer. The results are promising since the majority of the cancer subjects (84%) were at early stages of cancer. Among the algorithms, the 1D-CNN showed better performance than other algorithms, with accuracy levels of 85%. All the other computational algorithms showed promising levels of performance and some showed higher accuracy levels compared with the results obtained from two laryngologists, who showed sensitivity levels of 44%. To the best of our knowledge, this is one of the first studies that has compared the performance of automated algorithms against those performed by both clinicians and non-clinicians. Based on our results, automatic detection using the 1D-CNN and other computational algorithms may be considered as potential supplementary tools to distinguish abnormal voice changes in patients with laryngeal cancer.

Past studies have already used several machine learning techniques in attempts to distinguish pathological voice disorders in laryngeal cancer. Gavidia-Ceballos and Hansen [22] demonstrated accuracy levels of 88.7% in patients with vocal fold cancer, but their sample was limited to 20 glottic cancer and 10 healthy subjects. Previous studies employed an ANN in laryngeal cancer with accuracy levels of 92% [9]. However, their data included patients who were recovering from laryngeal cancer, mostly following surgery. The voice signals in the present study were obtained from laryngeal cancer patients preoperatively, and thus, our results are more appropriate for assistance in screening laryngeal cancer rather than detection of postoperative voice changes. The results are even more promising since our study also provided detailed clinical information about laryngeal cancer, mostly at the early stages.

Our results are also in accordance with previous studies that have suggested better performance of DNNs in some datasets compared with a SVM or Gaussian mixture model (GMM) in detecting pathological voice samples [9,10,23]. An unexpected finding was that the 1D-CNN showed better performance than the 2D-CNN, a more sophisticated algorithm. The processed signals contained 64,000 (4 [s] × 16000 [Hz]) data points representing acoustic information, whereas the MFCCs carried 15,640 (40 × 391) points. In addition, the 2D-CNN model has a limited scope of 40 by 40 kernel windows at a time. Through a series of feature conversion and windowing, the 2D-CNN method leads to unfortunate information loss. Thus, the 1D-CNN is associated with a higher resolution than the 2D-CNN in the presence of appropriate hyper parameters such as learning rate, kernel size, and the number of layers. Although the 1D-CNN is more difficult to optimize, higher resolution implies higher learning capacity. Although our results are consistent with recent studies that showed the superior performance of the 1D-CNN compared with the 2D-CNN in a heart sound classification study [24], one has to be conscious of the fact that performance of these algorithms may change depending on the nature of the data.

Laryngeal cancer is known to show a skewed gender distribution [25,26] with an approximately four- to six-fold higher risk in males [27] and poor prognosis compared with females [28]. Based on this gender difference, we assessed the performance levels of these algorithms when the analysis of voice features in laryngeal cancer was limited to males. In such gender-restricted analysis, the 1D-CNN showed good performance with accuracy levels of 85.2%. This phenomenon is discrepant to those observed by Fang et al. [10], who showed that the SVM outscored the GMM and ANN when the data were analyzed separately without the female subjects. Therefore, it was unexpected that the 1D-CNN performed well even with the limited sample of male voices. Despite our results favoring the 1D-CNN, due to the uneven gender distribution in laryngeal cancer, the gender composition of the data should be considered with caution when developing future deep learning algorithms since the female voice has broader distributions in cepstral domain analysis [29]. Furthermore, one has to be mindful that the performance of these algorithms may be different depending on the feature selection and amount of data, and therefore, caution is needed before making direct comparison of which algorithm is superior to the other.

In this study, among the many PRAAT features that played significant roles in helping to classify the voice changes, of interest, the HNR was shown to be an essential feature, followed by the F0 standard deviation and apq11shimmer from the XGBoost algorithms. Past studies [30,31] have shown changes in some acoustic parameters including the HNR, jitter, and shimmer in laryngeal cancer due to the structural effects on the vibration and wave movement of the vocal cord mucosa but have not shown which parameter contributes more than the other in the classification of laryngeal cancer. Abnormal HNR values reflect an asthenic voice and dysphonia [32], which may be expected with the mass effects. Controversies surrounding the role of fundamental frequencies exist, with some suggesting these values decrease in laryngeal cancer and smokers [31], compared to healthy groups [33]. Instead, this feature is a more prominent marker in voice change observed in smokers [34]. Of interest is that no study has yet emphasized the role of the apq11shimmer values in classifying these voice changes with high accuracy. The clinical implication of the apq11shimmer value in combination with changes of the F0 standard deviation, which reflects changes in voice tone, needs to be verified with future studies, including those related to other voice disorders.

Voice changes that do not improve within two weeks are known to be one of the earliest signs of laryngeal cancer and mandate a visit to the laryngologist. Our results indicated that the average time from onset of voice change to the first visit to the laryngologist was 16 weeks. Though most patients in our study were in the early stages of cancer, in reality, patients failed to consult with the laryngologist within the initial month when the voice changes develop. Subjective perception of voice change can be challenging and our results from the ratings by the four volunteers showed that half of the early stage cases could also be missed by the human ear, even by expert laryngologists. The diagnostic parameters from the four volunteers showed overall high specificity levels, which indicate good performance levels in discerning those with normal voices. However, the low sensitivity levels indicate that human perception of subtle voice changes within the short 4 s voice file may be insufficient to discern the initial voice changes in laryngeal cancer. Though the two experts showed better performance than the two non-experts, the low sensitivity levels of this latter group are of concern and reflect real-world situations where the patients may misjudge and miss the initial changes as normal. Voice change can be the only first symptom, and if not considered as a serious sign, it can inadvertently result in a delay when making the initial visit to the laryngologist. Automated algorithms may be used to alert the "non-expert" patients when these voice changes appear to seek medical advice. Higher sensitivity levels are ideal for screening tools. Therefore, the higher sensitivity levels from the deep learning algorithms may support the use of these automated voice algorithms in the detection of voice changes in early glottic cancer. Though based on a limited number of data, our results show the potential of future applications of these algorithms in digital health care and telemedicine.

One interesting point to consider is that the 1D-CNN showed good accuracy levels, even when most were at the early stages. In addition, the inclusion of these supraglottic cancer patients who usually remain asymptomatic in the early stages and are difficult to diagnose [35] may be clinically relevant. Voice changes in advanced laryngeal cancer stages can be evident because of the involvement of the vocal cord or thyroid cartilage. By contrast, in the T1 stage, these changes may be too subtle and may go unnoticed. The encouraging results of classifying those, even in the early laryngeal cancer stages, show the opportunity of automatic voice signal analysis to be used as part of future digital health tools for the noninvasive and objective detection of subtle voice changes at the early stages of laryngeal cancer. Future studies with more detailed separate analysis among the different tumor types and stages could be promising.

Significant new work has been reported recently using artificial intelligence techniques in the early detection of cancer, including skin and gastric cancer [36,37]. Similar attempts have also been made in oropharyngeal cancer with mixed results. A few studies have used features associated with hypernasalance in oropharyngeal cancer [38] and glottal flow in glottic cancer [39] and employed ANNs with mixed results. Recent studies have shown that the CNN can be used to predict the outcome automatically [40] or detect laryngeal cancer based on laryngoscope images with accuracy levels of 86% [41], which are similar to our accuracy levels of 85.2%. Our work differs from these past studies in that we employed voice signals rather than imaging data and compared the accuracy levels of the 1D-CNN to those rated by the human ear.

The algorithms presented in this study showed promising results. However, a few limitations remain to be addressed. First is the non-inclusion of other voice disorders such as those related to more common benign disorders, such as, vocal polyps or vocal fold palsies. The main objective was to determine the accuracy of various algorithms including the 1D-CNN against those performed by human raters. The use of these algorithms to classify other voice disorders may require rebuilding the algorithm structure based on additional hyper parameters. Ongoing studies by our research group are currently attempting to design new CNN algorithms that may be used to distinguish voice changes in cancer patients from other various voice disorders, such as those related to vocal palsy or polyps. Another factor to consider is the limited number of cancer cases. Machine learning requires a large amount of processed data, and its performance depends heavily on how well the feature is crafted. The limited number of data is a problem often encountered in medical data acquired from sources other than image files. It is even more challenging to obtain voice data from patients with laryngeal cancer during the preoperative period. However, the number of cancer patients was similar to past studies [9–11] that employed automated algorithms in voice pathologies. The voice, a signal carrying infinite information, can be represented in a simpler form by introducing digital signal processing tools such as PRAAT or MFCCs, which improves optimization potential despite the small datasets [15]. Second, the proposed algorithms performed well for datasets comprising only males. The inclusion of females in the analysis may inadvertently provide a clue to the model with all cancer data comprising male subjects and therefore excluded female data. Although our results supported the high-performance levels of the 1D-CNN, the model proposed in this study may lose its diagnostic power when female cancer patients are included. Therefore, our algorithms require re-validation when adequate data are collected from female patients in the future. Furthermore, prospective studies are needed for large-scale validation of our model. Third, since the cancer group showed more elderly males with a higher proportion of smokers, one could question whether our algorithm classified voice changes related to the presence of laryngeal cancer or related to smoking and old age. Smoking and old age are the cardinal risk factors of laryngeal cancer. However, these two conditions manifest in distinctive voice changes. For example, according to a recent meta-analysis study [34] voice changes in smoking are manifested mostly in the fundamental frequency (F0). Likewise, voice changes in elderly males are characterized by an increase of jitter values [42]. Had our algorithms classified based solely on senile and smoking changes, these two features would have been the two most important features. Instead, other features, which may reflect the tumor effects on the voice, played a more prominent role. Nevertheless, the skewed distribution of gender, age, and smoking status are important factors to consider in future studies that intend to employ artificial intelligence in voice disorders that include laryngeal cancer. Finally, our results are by no means intended to replace current diagnostic tools and future studies using voice signals as a supplementary screening tool in the age of telemedicine in conjunction with current laryngoscope studies in laryngeal cancer are warranted.

The results presented in our study demonstrate the ability of the proposed computational algorithms to distinguish voice changes in early laryngeal cancer from healthy voices in normal participants. However, this study did not include other voice disorders, which may be more common in clinical practice than laryngeal cancer patients. Therefore, a high degree of prudence is required in interpreting the results. Nevertheless, the application of voice signals to digital algorithms as alternative methods to assess patients at difficult times [43] when direct physical contact with the laryngologist is not feasible may have important social implications in the future.

#### **5. Conclusions**

This study has shown that automated voice detection based on both machine learning and deep learning algorithms facilitates detection of voice changes in laryngeal cancer in a noninvasive yet

objective manner with accuracy levels that may surpass human performance. Future studies are warranted on techniques to implement and adopt these automated voice analyses using the 1D-CNN, as part of the digital health system [44].

**Author Contributions:** Conceptualization, J.J. and Y.J.H.; data curation, H.K., Y.J.H., and Y.J.; funding acquisition, S.L. and S.I.; investigation, S.L.; methodology, J.J., H.K., and Y.J.; project administration, S.L. and S.I.; software, J.L.; supervision, S.I.; validation, J.J. and J.L.; visualization, J.J.; writing—original draft, J.J. and H.K.; writing—review and editing, S.L. and S.I. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Po-Ca Networking Groups funded by the Postech-Catholic Biomedical Engineering Institute (No. 5-2020-B0001-00050), National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) 2020R1F1A1065814 and 2020R1A2C1009744, and the Priority Research Centers Program funded by the Ministry of Education (2020R1A6A1A03047902).

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**


## **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Early Detection of Septic Shock Onset Using Interpretable Machine Learners**

**Debdipto Misra <sup>1</sup> , Venkatesh Avula <sup>2</sup> , Donna M. Wolk <sup>3</sup> , Hosam A. Farag <sup>3</sup> , Jiang Li <sup>2</sup> , Yatin B. Mehta <sup>4</sup> , Ranjeet Sandhu <sup>1</sup> , Bipin Karunakaran <sup>1</sup> , Shravan Kethireddy <sup>4</sup> , Ramin Zand <sup>5</sup> and Vida Abedi 2,\* ,†**


**Abstract:** Background: Developing a decision support system based on advances in machine learning is one area for strategic innovation in healthcare. Predicting a patient's progression to septic shock is an active field of translational research. The goal of this study was to develop a working model of a clinical decision support system for predicting septic shock in an acute care setting for up to 6 h from the time of admission in an integrated healthcare setting. Method: Clinical data from Electronic Health Record (EHR), at encounter level, were used to build a predictive model for progression from sepsis to septic shock up to 6 h from the time of admission; that is, *T* = *1*, *3*, and *6 h* from admission. Eight different machine learning algorithms (Random Forest, XGBoost, C5.0, Decision Trees, Boosted Logistic Regression, Support Vector Machine, Logistic Regression, Regularized Logistic, and Bayes Generalized Linear Model) were used for model development. Two adaptive sampling strategies were used to address the class imbalance. Data from two sources (clinical and billing codes) were used to define the case definition (septic shock) using the Centers for Medicare & Medicaid Services (CMS) Sepsis criteria. The model assessment was performed using Area under Receiving Operator Characteristics (AUROC), sensitivity, and specificity. Model predictions for each feature window (1, 3 and 6 h from admission) were consolidated. Results: Retrospective data from April 2005 to September 2018 were extracted from the EHR, Insurance Claims, Billing, and Laboratory Systems to create a dataset for septic shock detection. The clinical criteria and billing information were used to label patients into two classes-septic shock patients and sepsis patients at three different time points from admission, creating two different case-control cohorts. Data from 45,425 unique in-patient visits were used to build 96 prediction models comparing clinical-based definition versus billing-based information as the gold standard. Of the 24 consolidated models (based on eight machine learning algorithms and three feature windows), four models reached an AUROC greater than 0.9. Overall, all the consolidated models reached an AUROC of at least 0.8820 or higher. Based on the AUROC of 0.9483, the best model was based on Random Forest, with a sensitivity of 83.9% and specificity of 88.1%. The sepsis detection window at 6 h outperformed the 1 and 3-h windows. The sepsis definition based on clinical variables had improved performance when compared to the sepsis definition based on only billing information. Conclusion: This study corroborated that machine learning models can be developed to predict septic shock using clinical and administrative data. However, the use of clinical information to define septic shock outperformed models developed based on only administrative data. Intelligent decision support tools can be developed and integrated into the EHR and improve clinical outcomes and facilitate the optimization of resources in real-time.

**Citation:** Misra, D.; Avula, V.; Wolk, D.M.; Farag, H.A.; Li, J.; Mehta, Y.B.; Sandhu, R.; Karunakaran, B.; Kethireddy, S.; Zand, R.; et al. Early Detection of Septic Shock Onset Using Interpretable Machine Learners. *J. Clin. Med.* **2021**, *10*, 301. https://doi.org/10.3390/jcm10020301

Received: 12 November 2020 Accepted: 12 January 2021 Published: 15 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** healthcare; artificial intelligence; machine learning; interpretable machine learning; explainable machine learning; septic shock; clinical decision support system; electronic health record

#### **1. Introduction**

Sepsis is a life-threatening condition that arises when the body's response to an infection injures its tissues and organs as defined by the 1991 consensus [1–3]. Sepsis is a complex syndrome that is difficult to identify early, as its symptoms, such as fever and low blood pressure, overlap with those of other common illnesses. Without timely treatment, sepsis can progress to septic shock, which has a hospital mortality rate greater than 40%. Identification of sepsis patients who are at high risk of septic shock will be helpful for clinicians to prioritize preventive care and improve the survival rate. Early diagnosis, prompt antibiotic, and supportive therapy are associated with improved outcomes [4–6]. Severe sepsis and septic shock are the leading causes of morbidity and mortality in the Intensive Care Unit (ICU) [7]. Septic shock is a subset of sepsis with significantly increased mortality due to severe circulation and/or cellular metabolism abnormalities. During septic shock, the heart and circulatory system begin to fail and blood pressure drops. Septic shock, the leading cause of morbidity and mortality in the Intensive Care Unit (ICU), is costing the United States' healthcare system more than \$20 billion per year [8].

Translating recent advances in Artificial Intelligence (AI) to patient outcomes is an active area of research [9–11]. A few examples where AI has shown promise are interpreting chest radiographs [12], identifying malignancy in mammograms [10], and detecting incidental lung nodules analyzing computer tomography scans among others [13,14]. Leveraging data collected from the EHRs offers clinical insight, which can better augment favorable patient outcomes [15]. Data-driven AI models can also assign risk scores to transfer high-risk patients to intensive care units [16]. More and more advanced ML models are used to develop clinical decision systems, predicting in-hospital mortality, length of stay, readmission risk, and discharge diagnoses [17] and sepsis management [18,19]. In this study, we developed a working model for predicting septic shock in an acute care setting up to 6 h from the time of admission using real-time data. Predicting septic shock is challenging yet highly impactful, as timely diagnosis and prompt antibiotic and supportive therapy are associated with improved outcomes. This paper presents a practical working model for using ML to develop predictive models of septic shock in an Intensive Care Unit environment. The findings highlight how ML and large clinical and administrative data lakes can be leveraged to address practical challenges.

#### **2. Related Works**

Recent works have highlighted the unmet need for data-driven clinical decision systems for the identification of at-risk patients. For instance, in 2018, researchers [20] leveraged high-resolution time-series data to predict septic shock onset in the Intensive Care Unit, 4 to 12 h before the event. In 2019, it was demonstrated that [21] an expert AI system could outperform clinicians to predict sepsis onset. In 2020, Kim et al. [22] the possibility of predicting septic shock within 24 h using ML-based models was explored. Even though septic shock has higher mortality than sepsis [23], identification of both sepsis and septic shock patients in such a way to give the care providers more time (even a few hours) can lead to improved outcomes. Although there are many use cases of MLbased models of sepsis and septic shock, there is limited literature focusing on a working model in an integrated healthcare system focusing on scalability, real-time data access, and standardization of the sepsis and septic shock evolving phenotype definition. Previous works have focused on clinical models using various datasets and characteristics [24], focusing on the effect of ML algorithms on outcomes of sepsis patients.

This project was part of an initiative to build a translational and interpretable decision support system as an assistive technology for our providers. In particular, we aimed to

develop a prediction model of sepsis and severe sepsis to septic shock by using clinical data and comparing the model performance when only billing data are used to define the cohort. Data extraction from administrative sources (such as billing codes), which are in a structured form, is easier compared to data extraction from unstructured clinical sources (such as notes for extraction of the source of infection). The latter requires more complex queries, including the integration of natural language processing pipelines. It was [25] reported that identifying sepsis or septic shock patients based on clinical data, as compared to administrative data, is more accurate; however, many studies still rely mainly on administrative data. For septic shock, administrative data can be inaccurate as the patient's progression to septic shock can occur at any time. While earlier works [26,27] have demonstrated moderate success using tree-based models for visit level prediction, recent works [26] leveraging temporal neural network-based models have shown promising results for predicting septic shock at visit and event levels. However, one of the challenges while defining cases and control revolves around the lack of consensus for defining sepsis and septic shock [7]. Cohort definition is the first and most important step of the modeling pipeline. In this study, we used clinical variables to map our cohort definition (cases: septic shock; controls: sepsis and severe sepsis [28]) with the Systemic Inflammatory Response Syndrome (SIRS) [29] criteria. The SIRS, as outlined by the Centers for Medicare & Medicaid Services [30], is outlined in Figure 1.

**2021**, , x FOR PEER REVIEW 3 of 17

**Figure 1.** Case and control definition based on the SIRS criteria and Centers for Medicare & Medicaid Services (CMS) definition.

#### **3. Methods**

#### *3.1. Data Sources*

This study was approved by the Geisinger Institutional Review Board (IRB). Geisinger, an integrated multi-site health system in North Eastern Pennsylvania with a catchment population of approximately 2.5 million citizens, has been known for being one of the most "wired" and innovative healthcare systems in the United States. Thirteen years of retrospective data between April 2005 to September 2018 from EHR (EPIC), Insurance Claims and Billing (AMISYS), and Laboratory Systems (Sunquest) were used to create a sepsis dataset for this study. The systemic inflammatory response syndrome (SIRS) [30] criteria, outlined by Centers for Medicare & Medicaid Services (CMS) [31], were used to assign patients into the case and control groups—septic shock patients (case group) and sepsis and severe sepsis patients (control group). In production, the system was designed to detect septic shock using real-time data to assist clinicians when treating high-risk sepsis patients in ICU. In addition to the EHR data, billing codes were utilized to ascertain the correct diagnosis for a patient at a given encounter for comparative assessment.

The initial assessment of clinical features, which was based on input from the clinicians and the literature, resulted in 65 features in six different categories from the structured sources. The features included during the first assessment were broadly in the following categories: demographics, vitals, pathology and laboratory measurements, medications, comorbidities, and procedures. Additional variables, which are critical in sepsis and septic shock, were also considered. In particular, (1) use of vasopressors was part of the criteria to define septic shock (persistent hypotension), (2) use of antibiotic administration was also included in the study (to suspect infection), and (3) creatinine level was utilized to evaluate kidney function since the use of urine output data, also an important parameter, was challenging; the latter is associated with a high error rate, given the needs for visual assessment and manual data recording.

Data from structured and unstructured sources were extracted and processed. Clinical notes (unstructured sources) were used to ascertain clinical states, including the source of infection, focused exam, documentation of septic shock, and severe sepsis documentation. Medical ontology from the Unified Medical Language System (UMLS) [31] meta-thesaurus, including SNOMED [32], LOINC [33] and ICD-9/ICD-10 [34] were used in the data model abstraction. Technical details of the natural language processing (NLP) pipeline are provided in the data extraction section.

#### *3.2. Feature Assessment*

The list of features was further evaluated for the clinical implementation to ensure clear workflow integration. Stakeholders from the data management, EHR vendor (EPIC), Laboratory Medicine, and clinicians reviewed the comprehensive feature list, and a decision was made to include actionable features with high clinical value. The final list included the following features: blood culture, diastolic and systolic blood pressure, creatinine, lactic acid, mean arterial pressure (MAP), platelet count, pulse, respiration, temperature, white blood cell count, age, gender, height, and weight. Association Rule Mining [35] was also performed as part of the feature exploration strategy to investigate the relationship between comorbidities using diagnosis codes. Results from this additional assessment are included in the Appendix A (Figure A1) for the interested reader.

#### *3.3. Cohort Selection*

Cohort definition involves establishing a reproducible process by which data elements from the EHR (both structured and unstructured) can be used to develop a longitudinal view of the patient. Deep phenotyping was performed to create different case and control cohorts based on structured and unstructured data sources. The Systemic Inflammatory Response Syndrome (SIRS) [30] criteria were used to group patients into the case (septic shock) and control (sepsis and severe sepsis) group (See Figure 1). Three different sets of case-control were also designed based on the adult patients (>18 years old) progressing from sepsis to septic shock at three different proceeding time frames from admission— *T* = *1*, *3*, and *6 h* from the time of admission to septic shock progression (visit level early diagnosis—based on a left-align design). Since vitals are extracted directly from sensors and fed into the system as they are generated, our data was time-stamped, which allowed us to collect data points preceding the observation window. For instance, if there were three data points at 0.5 h, 2.5 h, and 3.5 h for a patient, for *T* = *3 h* window, data at 2.5 h was utilized, similarly, for the *T* = *6 h*, data point collected at 3.5 h was used and so forth.

#### *3.4. Data Extraction*

Analytics Infrastructure: Unified Data Architecture (UDA) is the Enterprise Data Lake providing core integration, storage, and user-specific access and retrieval information at Geisinger. It is an in-house 50-node cluster running with the capability to ingest, store, and transform big data using a combination of Apache Spark and Apache Hive on an Apache Hadoop cluster. Data from heterogeneous source systems and vendors (e.g., clinical, billing, radiology, laboratory) are ingested into an Enterprise Data Warehouse daily (EDW). The

data model is used extensively for clinical reporting and advanced analytics. EDW was used as the source for the extraction of retrospective data and clinical features.

Data extraction from unstructured sources: Patient notes, specifically nursing notes, were used to determine the source of infection, chronic conditions, fluid bolus, and acute kidney disease. Apache cTAKES [36] was used as the natural language processing (NLP) engine. The NLP engine was modified to be utilized in a big data environment using the Apache Spark framework on Hadoop [37]. Concepts related to chronic conditions, fluid bolus, and acute kidney disease were identified from in-patient provider notes using entities from the UMLS meta-thesaurus. Notes with the relevant concepts were selected for downstream analysis. A custom regular expression-based NLP pipeline was applied to extract additional information for the three SIRS criteria, including the source of infection, chronic conditions, and fluid bolus.

Data extraction from structured sources: Various data elements, including vitals, flowsheets, and medications were processed, enhanced, and integrated into Geisinger's UDA platform. An Extract Transform Load (ETL) pipeline consolidated the data and aggregated clinical measures along with patients' encounters and demographic information. This data was aggregated with unstructured patient notes to determine various events such as SIRS and Organ Dysfunction (OD). Sepsis, severe sepsis, and septic shock classification are performed based on these medical events' chronology as defined by the CMS guidelines. The classified data was integrated with patients' additional historical data such as chronic conditions and medical history. Finally, a longitudinal chronological narrative of various clinical measures and medical events from the time of admission was generated and used for model development.

#### *3.5. Data Processing*

Various data processing, such as exploratory data analysis, imputation, and sampling, were performed before training and testing the various models.

#### 3.5.1. Outlier Removal

The distribution of unique features was assessed to identify noise or outliers in the data. Units of the numeric variables and the bounds of lower and upper limits were applied (see Table A1). Furthermore, values identified outside of the six standard deviations were manually verified and removed if considered dubious.

#### 3.5.2. Imputation

Variables with more than 40% missing were excluded from the analysis. The only exception is lactic acid, which had an overall higher missingness; however, given the importance of this variable, a decision was made to include this key variable. The MICE package in R with the random forest implementation was used to impute missingness [38]. Given the large dataset, a custom pipeline was implemented using Apache Spark [39] and optimized for scalability. The distribution of variables before and after imputation was assessed to ensure consistency.

#### 3.5.3. Class Imbalance

Given that the percentage of patients with septic shock (cases) is significantly smaller than patients with sepsis and severe sepsis (controls), three sampling strategies were applied. Statistical techniques were applied in the following specific order. First, Edited Nearest Neighbors (ENN) [40] was used to smooth the data distribution by removing misclassified samples based on nearest neighbors from the majority class. The ENN was followed by the Synthetic Minority Over Sampling Technique (SMOTE) [41] to increase the size of the minority class. Two different variations of SMOTE (SMOTE and Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC)) were used for numeric and categorical features. Finally, under-sampling was addressed by using a random under-sampling (RUS) algorithm, applied to balance the classes [42]. Up-sampling, synthetically increasing the sample size of the minority class, was performed separately for labels from the Billing and CMS-based cohorts.

#### *3.6. Modeling Strategy*

Geisinger's big data environment used for our modeling consisted of 34 physical nodes with 1140 vCores using 11.07 TB of memory. We also used the Yet Another Resource Negotiator (YARN) [37] cluster manager for jobs that are configured to use 200 executors with 5 GB memory container size. The technology stack used consisted of running spark jobs submitted to the YARN cluster resource manager.

As the list of features was limited to actionable features with the highest clinical utility, we did not perform data-driven feature selection; however, we used Pearson pairwise correlation analysis to corroborate that features in the cohort were not highly correlated. We split the data into training and testing (80/20 split) while retaining the proportion of classes. Model development was performed on 80% of the data, while model testing was performed on 20% of the data. During the model development (on the 80% of the data), 5-fold cross-validation was utilized. Furthermore, synthetic sampling was used only on the training data. Model performances were evaluated on the holdout test data set (20% of the data) using the area under the receiver operating characteristic curve (AUROC), specificity, and sensitivity. Consolidated metrics for 1, 3, and 6-h feature windows were also calculated. Thus, if the patient was assigned a septic shock label in any of the three time intervals, the consolidated prediction was selected as septic shock.

The models were derived from the two cohorts (cohort designed based on CMS criteria and billing information). Predicting the onset of septic shock in the proceeding *T* hours after admission was designed for *T* = *1*, *3* and *6 h*. Time-dependent features (dynamic features) were collected for each window, and the results of the model performance were compared.

A total of eight different algorithms were trained: Logistic Regression [43], Regularized Logistic Regression [44], Bayes General Linear Model [45], Boosted Logistic Regression [46], C5.0 [46], Decision Trees [47], Support Vector Machine (SVM) [48], and Random Forest [49]. Grid search [50] was used to tune the hyperparameters for the classification models. Twenty node cluster, running Apache Spark, was used for tuning the models in conjunction with sparkR and R [39].

#### **4. Results**

#### *4.1. Patient Characteristics*

This study includes a total of 46,651 distinct adult patient (>18 years old) visits, extracted from Geisinger's data warehouse between April 2005 and September 2018. Each record corresponds to a unique encounter. A set of 1226 records were excluded due to data quality and the excessive missing of static features such as height, gender, and age. The remaining 45,425 records met the inclusion criteria.

Sepsis data sets for 1, 3, and 6 h feature windows had labels from CMS and Billing, depending on the data extraction process. There was a total of 3179 encounters from CMS (7% of the cohort) while billing-based septic shock records accounted for 6127 encounters— 14% of the total records analyzed. Among the 45,425 records, 5784 were identified as a septic shock while 30,192 were identified as sepsis and severe sepsis (control) within a *T* = *1 h* window; similarly, 5845 cases were classified as septic shock (cases) while 31,668 records were identified as controls within a window of *T* = *3 h*. A total of 5852 records (cases) were septic shock while 32,329 records were sepsis (controls) within a *T* = *6 h* window. Overall, 51% of all the cases and controls were men. The mean age was higher in the case group compared to the control group for the three case-control designs (*T* = *1*, *3*, and *6 h* from admission). The same trend was observed for the average weight of the patients; however, the difference was marginal. Table 1 illustrates the cohort statistics for the *T* = *1*, *3* and *6 h* prediction windows. This study was based on 15 features, including vitals, laboratory values, and baseline demographics.


**Table 1.** Cohort Statistics based on CMS criteria.

<sup>1</sup> Mean Arterial Pressure; <sup>2</sup> Glasgow Coma Score; <sup>3</sup> Activated Partial Thromboplastin Time; <sup>4</sup> Prothrombin Time Test.

Our data showed that the average levels of lactic acid and creatinine were lower as the feature window is reduced to *T* = *3 h* and *T* = *1 h*. The average pulse followed the same trend (higher in the cases at *T* = *6 h* versus *T* = *1 h*). The average blood pressure had an opposite pattern; septic shock patients had on average lower blood pressure (both diastolic and systolic) at *T* = *6 h*. The average temperature was lowest in the *T* = *1 h* window for both case and control groups. Finally, the whole blood count (WBC) was lower in the case group compared to the control group for the three feature windows.

#### *4.2. Machine Learning Models Can Be Trained for the Detection of Septic Shock Using Administrative Datasets*

In this study, we used different case-control designs by focusing on different prediction windows, as well as labeling strategies—CMS versus billing information to label the cases. We also used a sampling technique to address the data imbalance. Overall, consolidated results demonstrated that clinical decision support systems can be developed for the detection of septic shock in ICU using administrative or clinical data. In the consolidated results, the final prediction label was determined based on whether at least one of the three case-control designs (based on the *T* = *1*, *3*, or *6 h* windows) was able to detect septic shock (Table 2). Overall, four of the modeling algorithms resulted in an AUROC above 0.92, with an average AUROC of 0.91. The parameters for the grid search for the different models are also listed in Table 2. The average sensitivity and specificity of the consolidated results were 0.82 and 0.86 respectively. Finally, the best performance (AUROC of 0.943) was when Random Forest was used (Figure 2 and Table 2). The 95% confidence interval of the AUROC, sensitivity, and specificity are provided in Appendix A Figure A2.


**Table 2.** Performance metrics for the best model for each machine learning algorithm.

RF: Random Forest, DT: Decision Trees, BL: Boosted Logistic, SVM: Support Vector Machine (Radial), LR: Logistic Regression, RLR: Regularized Logistic Regression, BGLM: Bayes Generalized Logistic Regression, HP: hyper-parameters.

**Figure 2.** Receiver Operating Characteristic plots for the best machine learning algorithms.

#### *4.3. Model Prediction Performance Improves as the Time from Admission Widens*

Analysis of performance metrics, comparing the different case-control designs based on the feature window, demonstrated that the average model performance—in terms of AUROC, accuracy, sensitivity, and specificity—increased monotonically as time elapsed from admission increased from *T* = *1 h* to *3* and *6 h* (Figure 3). Furthermore, our results on the best performing model using Random Forest also corroborated that the models based on the longer time frame (*T* = *6 h*) consistently outperformed the others in terms of all performance metrics used in this study (Figure 3).

The prediction of models (at *T* = *1*, *3*, *6 h*) are aggregated, such that the final prediction is true even if only one of the models labels that as true. This strategy reduced false negatives at the cost of false positives. Model AUROC, Specificity, Sensitivity, are reported in Table 2. It is important to indicate that the aggregate models for the best performing model are presented in Table 2 and the model performance metrics, especially model sensitivity and specificity, are above 0.8 for all the models.

**2021**, , x FOR PEER REVIEW 9 of 17

**Figure 3.** Consolidated Metrics, using Random Forest-based models, comparing CMS and Billing-based cohort as well as models based on the different windows, *T* = *1*, *3*, and *6 h*.

#### *4.4. Models Based on CMS-Derived Information Have Better Detection Power*

Our results highlight that the prediction models when used in conjunction with labeling rules that are derived from CMS information (clinical information), rather than billing data (administrative information), can improve the performance metrics in terms of model AUROC, model accuracy, sensitivity, and specificity. Figure 3 shows that on average, model AUROC, sensitivity, specificity, and accuracy were higher for the CMS-based cohort for all three different case-control designs (*T* = *1*, *3* and *6 h*). Model AUROC had the highest improvement for the 6 h window, with CMS-cohort reaching an average of 0.87, while billing-cohort for the same time frame reached an average of 0.77. Similarly, average model accuracy was highest for the same *T* = *6 h* cohort when CMS information was used to define the cohort (0.90 versus 0.78 average accuracies). Model sensitivity and specificity were also higher with the CMS-based cohort (model sensitivity for *T* = *6 h* is 0.66 versus 0.56; model specificity for *T* = *6* is 0.92 versus 0.82). The same pattern was observed for the cohorts where the time from admission was defined as *T* = *1* and *T* = *3 h*.

#### *4.5. Important Clinical Markers of Septic Shock*

Our results (Figure 4) demonstrated that the eight ML algorithms were able to identify lactic acid as the most important feature. Furthermore, there was a consensus in feature importance ranking in three out of the eight algorithms (logistic regression, regularized logistic regression, and Bayes generalized logistic regression). Overall, the dynamic features including laboratory features and vitals were important clinical markers for the majority of the algorithms. Demographic variables such as sex, age, and weight were the least discriminative variables by most of the models.

**2021**, , x FOR PEER REVIEW 10 of 17

**Figure 4.** Feature Importance Profile for the eight machine learning models, based on aggregated measures.

#### **5. Discussion**

This study demonstrated that machine learning models can be used to predict septic shock within the first 6 h of admission. Furthermore, model performance can be improved by aggregating the temporal models from each prediction window. Even when the rate of septic shock was between 7–14% (depending on how the septic shock is defined), the presented pipeline achieved a good balance of sensitivity and specificity at 1, 3, and 6 h from the time of admission. The major contribution of this study, is the use of a wellestablished framework, big data analytics and solid infrastructure in building interpretable decision support systems that can be integrated into clinical workflow in EHR.

#### *5.1. Design Consideration for Building a Clinical Decision Support System for Detection of Septic Shock Using Healthcare Data*

Our findings highlighted the value of data density for building predictive models. As the time from admission increases from *T* = *1 h* to *3* and *6 h*, more clinical variables were available for each patient. The latter had an impact on model performance. This observation, even though expected, (a) can help design models with a balance between performance improvement versus how much time in advance a practitioner could be able to be notified of a patient's declining condition, (b) corroborated the value of advances in laboratory technologies that can reduce the turn-around time, which could eventually facilitate the development of models that could target narrower windows as the data becomes available.

Our findings also demonstrated that the cohort definition for a clinical application can benefit if clinical information is leveraged as opposed to relying only on administrative (billing) information. The latter might be counter-intuitive, as billing codes may be more robust, at least for some conditions. Administrative data tend to be considered in many studies as a gold standard since billing codes are entered after chart review and have legal implications. However, as our results corroborated, clinically derived criteria using data from structured and unstructured sources, such as SIRS criteria, can exhibit higher fidelity in identifying septic shock patients when compared to leveraging only diagnostic codes.

Besides a carefully-designed cohort definition and selection of the optimal prediction window (based on clinical workflow settings and turn-around time to have patient-level data for the model), we discussed important technical considerations for building a successful ML-enabled decision support system. One such consideration was to address the class imbalance between cases and controls. Our results denoted the value of applying

robust sampling strategies to address the challenges due to the imbalanced nature of the dataset. Even though we did not compare our model performance with and without sampling as a pre-processing step, evidence suggests that this design strategy likely aided our model performance. Fleuren et al. [51], in their systematic review of ML-based models of sepsis, identified that some of the studies [51] potentially suffered from selection bias. In particular, to label septic shock patients, authors [5] used discharge ICD9 codes for acute infection to identify acute organ dysfunction and complemented that information with the need for vasopressors within 24 h of ICU transfer. In another study [27], authors used deep learning models to assess risk score 4 h before onset. In essence, since many patients present themselves in the Emergency Department with imminent or overt symptoms of septic shock, it is important that a decision support system, when integrated into the clinical workflow, can detect septic shock patients; therefore, in our design strategy, we ensured patients with imminent or overt septic shock were included to mimic a realistic situation. Finally, as EHR provides a valuable resource, it is important to leverage scalable analytical frameworks (such as pre-processing, data augmentation, use of ontologies, etc.) for providing assistive tools to providers in real-time.

#### *5.2. Lactic Acid and Other Laboratory Measurements are Highly Important Indicators of Progression to Septic Shock*

Epidemiological studies have established that the initial hypotension and lactic acid levels are important indicators of the progression of sepsis to septic shock [52,53]. Our results also highlighted that lactic acid is the most important indicator of septic shock followed by blood culture, creatinine level, and systolic blood pressure. However, it should be mentioned that lactic acid demonstrated higher than 40% overall sparsity, yet, it was decided to include this important variable in the model. In our dataset, lactic acid was not missing completely at random, as the missing level in the control group was higher; our data included 25,352 encounters out of 43,332 with lactic acid data available in the control group, versus 6037 encounters out of 6486 with lactic acid in the case group, for the 6-h window. Our team is leading comprehensive studies in the imputation of laboratory values [54] and we hope in the follow-up study we can better address this challenge.

Overall, other laboratory values were found to be relevant to the decision support system. Early warning scores do not consider laboratory values, however, in a recent meta-analysis of 28 studies [53] it was observed that overall laboratory values play an important role. Static features (age, sex, height, and weight) are the least important variables in the majority of the models used in this study. Furthermore, as different algorithms demonstrated different patterns (see Figure 4), it is imperative to not only rely on one modeling algorithm but an ensemble of models [55] when building prediction models based on a limited set of variables for time-critical conditions.

#### *5.3. Strengths, Limitations, and Future Work*

This study had several strengths and some limitations. Using a large dataset from an integrated healthcare system was a clear strength; however, Geisinger's patients' cohort were predominantly Caucasians, therefore, models developed in this study may not be generalizable to other healthcare systems without further fine-tuning and optimization. Furthermore, the use of large clinical data leads also to a study limitation. Data from the EHR tend to be noisy; however, with the proper data extraction pipeline and close collaboration with the clinical team, it is possible to augment data quality and reduce systemic bias. However, models developed using EHR-based data can be integrated and deployed into the same healthcare system more effectively, as ML models trained on the data specific to a particular healthcare system (and population) can provide better specificity and sensitivity.

Another strength and key contribution of this study is the development and comparison of two cohorts, based on administrative and clinical data, using billing information and clinical information based on CMS guidelines. Other studies have relied on using clinical markers such as lactic acid levels in combination with hypotension for determining septic

shock [53]; however, the progression from sepsis to septic shock occurs on a spectrum and there are specific criteria that define this progression, from sepsis to severe sepsis and eventually to septic shock, the latter is clearly defined by the CMS guidelines [53,56]. Our results showed that the clinically derived cohort is more robust and leveraging guideline recommendations can improve the performance of the models. However, since the use of SIRS criteria may also lead to labeling bias (over-diagnosed cases), it is important to work closely with the clinical team and consider additional guidelines and metrics as needed. It is also important to perform a careful evaluation and comparative analysis (such as targeted chart review, etc.). Also, given the study limitation around the use of SIRS criteria, the strategy in this study was to align our decision support system with the contemporaneous roll-out of the CMS sepsis protocol, which did not include qSOFA or SOFA at the time this study was conceptualized. As in any healthcare system, with changing recommendations and guidelines, we are working on adapting our models with clinical workflow accordingly. Finally, since we use a multi-level approach in defining our cohort, our strategy is robust and can be updated relatively efficiently based on new guidelines. In particular, we use ICD codes as the first level, complemented with clinical data from notes and other sources of structured data. It is important to emphasize that diagnosis codes may have a systemic bias as they are intended for billing purposes. Furthermore, our case/control ratio had a significant imbalance, which typically leads to a reduction in model performance. However, as the field of machine learning is advancing at an unprecedented rate, we are exploring the use of novel strategies (such as the use of the generative adversarial network (GAN)) [57], which could be used to address data imbalance and to improve our models.

As future directions, our team is actively working on further refining our septic early detection models based on technical and clinical advances. In particular, (1) some of the important data elements such as SOFA score (and different variations of SOFA score) were not captured in our EHR routinely at the time of this study. Given the clinical utility of such data elements, our system is now capturing these important variables more consistently. Therefore, as part of future work, we will be integrating these new variables and assessing their predictive utility. (2) Certain variables, especially laboratory variables (such as blood cultures), have a higher turn-around time (sometimes ranging between 48 to 72 h). In this study, we used the presence of blood culture order as a binary variable; however, having the actual test results could improve the detection of septic shock. We are working on integrating more laboratory-based features as their turn-around time improves. Finally, (3) many other laboratory variables could be included in the model; however, laboratory values tend to suffer from non-at-random missing and at high rates, and imputing them is a challenging task. Our team is developing imputation modeling that is designed specifically for laboratory-based features [54]. We believe better imputation and more targeted hyperparameter tuning, including sensitivity-based analysis, could further improve model performance.

One of the main limitations of this study design is that some patients who progress to septic shock might be mislabeled as controls in the cohort. Even though this can be avoided by taking a large time window and leveraging pathology results, the technical and clinical steps needed to address this study limitation are manifold and beyond the scope of this study. Currently, the turnaround time for pathology reports makes it impractical for the integration of such data into a decision support system that is aimed at assisting ICU providers in real-time-few hours after the patient is admitted to ICU. Another potential source of noise is the intervention by care providers e.g., administration of fluid bolus based on capillary refill, which would suppress clinical markers e.g., SBP, lactate to baseline, thus misleading the model during training.

Furthermore, it is difficult to know the impact of antibiotics on the specific trajectory of an individual patient as infection types are different and outcomes of progression are predicated based on many dynamic variables. For instance, it has been shown that 30% of patients who received appropriate anti-infective before the onset of hypotension continued to develop septic shock [12]. Thus, more targeted research is needed to assess the impact of medication at a personalized level before such information can be used for practical and time-sensitive applications.

Finally, this study is unique as it operates directly on the multiple sources of clinical data to build an ML-based decision support system for the detection of septic shock. This study also demonstrated that high-resolution and large heterogeneous data sets from administrative sources can be used to develop assistive tools for time-sensitive conditions such as the progression of sepsis or severe sepsis to septic shock. Such technologies could be integrated into the electronic healthcare system to improve the detection of septic shock and enable optimization of resources. The models have the potential to improve clinical outcomes in real-time.

**Author Contributions:** Conceptualization, V.A. (Vida Abedi), D.M., B.K. and S.K.; methodology, V.A. (Vida Abedi), V.A. (Venkatesh Avula); D.M., R.S.; software, D.M., R.S., V.A. (Venkatesh Avula); validation, H.A.F., Y.B.M., S.K., D.M.W.; formal analysis, V.A. (Vida Abedi), V.A. (Venkatesh Avula); D.M.; investigation, V.A. (Vida Abedi), R.Z.; D.M., S.K.; resources, V.A. (Vida Abedi), R.Z., B.K., S.K.; data curation, D.M., R.S., V.A. (Venkatesh Avula); writing—original draft preparation, D.M., V.A. (Vida Abedi); writing—review and editing, V.A. (Vida Abedi), H.A.F., S.K., D.M.W., J.L., D.M.; visualization, D.M., V.A. (Venkatesh Avula); supervision, V.A. (Vida Abedi), S.K., B.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** This study was approved as a quality improvement project (IRB Number: 2020-0355) as well as exempt study (IRB Number: 2014-0242) by the institutional review board of Geisinger Health System.

**Data Availability Statement:** The data analyzed in this study is not publicly available due to privacy and security concerns. A GitHub link to the team's notebook with exploratory data analysis, additional meta-data and summary plots are compiled for reference: https://github.com/TheDecodeLab/ early\_sepsis\_detection\_2020.

**Acknowledgments:** The authors would like to thank Arjun Lingala, Data Engineer- Sr, William Wenrich, Technical Analyst- Sr, Jody Pensyl, Data Analyst-Sr, and Dhruv Mathrawala, Data Architect-Lead for their contribution in implementing the Geisinger Data Model and the Data Engineering pipeline. The authors would also like to thank Cheryl Fritzen (RNC, MSN), Quality Data and Reporting Coordinator for her assistance in validating the data for dubious entries and insightful discussions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** Upper and lower limits on variables as part of the data pre-processing for variables that had outliers after considering six standard deviations.




<sup>1</sup> Prothrombin Time and International Normalized Ratio; <sup>2</sup> Activated Partial Thromboplastin Time.

**Figure A1.** Graphical Representation of Association Rules with Septic Shock (ICD9 = 785.52) as a consequent. Diagnosis codes for the patients in the cohort were obtained and Association Rule Mining [35] was run to mine for relationships between comorbidities. In the study, "items" are diagnosis codes. Items are connected to rules using directional edges. For nodes representing rules, edges pointing from codes to rule vertices indicate antecedent items and an edge from a rule to an item indicates the consequent item. The reader is referred to [58] for more details about the visualization. 511.9: Unspecified pleural effusion; 401.9: Unspecified essential hypertension; 995.2: Severe sepsis; 038.9: Unspecified septicemia; 518.81: Acute respiratory failure; EP751: Other congenital anomalies of digestive system: 584.9: Acute kidney failure.

**Figure A2.** Comparison Performance Profiles based on aggregated data.

#### **References**


## *Article* **Classification of Monocytes, Promonocytes and Monoblasts Using Deep Neural Network Models: An Area of Unmet Need in Diagnostic Hematopathology**

**Mazen Osman 1,\*, Zeynettin Akkus 2,\* , Dragan Jevremovic <sup>3</sup> , Phuong L. Nguyen <sup>3</sup> , Dana Roh <sup>3</sup> , Aref Al-Kali <sup>4</sup> , Mrinal M. Patnaik <sup>4</sup> , Ahmad Nanaa <sup>4</sup> , Samia Rizk <sup>5</sup> and Mohamed E. Salama 3,\***


**Abstract:** The accurate diagnosis of chronic myelomonocytic leukemia (CMML) and acute myeloid leukemia (AML) subtypes with monocytic differentiation relies on the proper identification and quantitation of blast cells and blast-equivalent cells, including promonocytes. This distinction can be quite challenging given the cytomorphologic and immunophenotypic similarities among the monocytic cell precursors. The aim of this study was to assess the performance of convolutional neural networks (CNN) in separating monocytes from their precursors (i.e., promonocytes and monoblasts). We collected digital images of 935 monocytic cells that were blindly reviewed by five experienced morphologists and assigned into three subtypes: monocyte, promonocyte, and blast. The consensus between reviewers was considered as a ground truth reference label for each cell. In order to assess the performance of CNN models, we divided our data into training (70%), validation (10%), and test (20%) datasets, as well as applied fivefold cross validation. The CNN models did not perform well for predicting three monocytic subtypes, but their performance was significantly improved for two subtypes (monocyte vs. promonocytes + blasts). Our findings (1) support the concept that morphologic distinction between monocytic cells of various differentiation level is difficult; (2) suggest that combining blasts and promonocytes into a single category is desirable for improved accuracy; and (3) show that CNN models can reach accuracy comparable to human reviewers (0.78 ± 0.10 vs. 0.86 ± 0.05). As far as we know, this is the first study to separate monocytes from their precursors using CNN.

**Keywords:** digital imaging; artificial intelligence; improving diagnosis accuracy; monocytes; promonocytes and monoblasts; chronic myelomonocytic leukemia (CMML) and acute myeloid leukemia (AML) for acute monoblastic leukemia and acute monocytic leukemia; concordance between hematopathologists

#### **1. Introduction**

The classification of the monocytic subpopulations (monoblasts, promonocytes, and monocytes) is important for the proper diagnosis and classification of various monocyticlineage leukemias, namely, chronic myelomonocytic leukemia (CMML) and acute myeloid leukemia (AML), including acute monoblastic leukemia and acute monocytic leukemia, and acute myelomonocytic leukemia [1].

To meet the World Health Organization (WHO) diagnostic criteria, the peripheral blood (PB) or bone marrow (BM) of patients with acute monoblastic and monocytic

**Citation:** Osman, M.; Akkus, Z.; Jevremovic, D.; Nguyen, P.L.; Roh, D.; Al-Kali, A.; Patnaik, M.M.; Nanaa, A.; Rizk, S.; Salama, M.E. Classification of Monocytes, Promonocytes and Monoblasts Using Deep Neural Network Models: An Area of Unmet Need in Diagnostic Hematopathology. *J. Clin. Med.* **2021**, *10*, 2264. https:// doi.org/10.3390/jcm10112264

Academic Editor: Vida Abedi

Received: 18 April 2021 Accepted: 19 May 2021 Published: 24 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

leukemia must have ≥20% blasts (including promonocytes), and ≥80% of the leukemic cells must be of monocytic lineage, including monoblasts, promonocytes, and monocytes. Differentiation between acute monoblastic leukemia and acute monocytic leukemia is based on the relative proportions of monoblasts and promonocytes. In acute monoblastic leukemia, the majority of the monocytic cells (≥80%) are monoblasts, whereas in acute monocytic leukemia, the predominant populations are mature monocytes and promonocytes [1–3].

The diagnostic criteria for CMML include PB monocytosis (≥1 × 109/L), in which >10% of the PB leukocytes are monocytes. In addition, the PB and BM blast count of <20% of blasts and promonocytes (a blast equivalent cell) must be ascertained [4,5]. Beyond diagnosis, CMML can be stratified into three subcategories based on accurate enumeration of blasts and equivalents (i.e., promonocytes) in the PB and BM. CMML-0: <2% in PB and <5% in BM, CMML-1: 2–4% in PB or 5–9% in BM, CMML-2: 5–19% in PB and 10–19% in BM [2,6].

As seen from the diagnostic criteria listed above, distinction between CMML and AML, and the staging of CMML, depend on accurate differentiation between blast equivalents (monoblasts and promonocytes) and mature monocytes. WHO classification still uses cytomorphology as the gold standard for the definition of blasts. In many cases, the expression of immature marker CD34 is used to supplement the enumeration of blasts. However, monoblasts are frequently negative for CD34 [7], and there are no other reliable immunophenotypic markers to distinguish monoblasts and promonocytes from mature monocytes. As a result, the differential diagnosis in these cases relies solely on cytomorphology.

In general, monocytes are mature cells with minimal morphologic atypia. However, atypical monocytes can be present with abnormal cellular features such as unusually fine chromatin but with prominent nuclear folds or convolutions that partially overlap with more immature forms, including monoblasts and promonocytes [8,9]. This renders distinguishing them from the immature forms notoriously difficult and might lead to under- or overestimation of blast cell numbers [10].

In this article, we present the applicability of artificial intelligence using convolutional neural network architecture for separating monocytes from the spectrum of monocyte precursors (i.e., promonocytes and monoblasts) with reference labels generated based on experts' morphologic review consensus. Differentiating myeloblasts from monoblasts solely on optical cytology can be very difficult; therefore, we will refer to monoblasts as blasts (monoblasts and/or myeloblasts) in this manuscript.

#### **2. Methods**

We trained convolutional neural network (CNN) architecture on digital images of monocytes, promonocytes, and blasts to separate monocytes from monocyte precursors (i.e., promonocytes and monoblasts). We experimented and evaluated several data preprocessing configurations and assessed the performance of well-known CNN architecture in order to find the best-performing CNN model and preprocessing strategy for this classification task. The data were imbalanced; therefore, we used the weighted categorical cross entropy loss function (see Equation (1)) to penalize loss for each category during the training [11–13]. We used the Adam optimizer [14] and initialized the learning rate to 1 × 10−<sup>4</sup> .

$$L = \frac{1}{n} \sum\_{i=1}^{n} \sum\_{j=1}^{m} -y\_{ij} \log \left( \hat{y}\_{ij} \right) w\_{ij} \tag{1}$$

where *n* is number of samples, *m* is number of classes, *y* is the true labels, *y*ˆ is the predicted labels, and *wij* is the weighting for each sample of classes. *wij* = *max n*<sup>0</sup> . . . *n<sup>j</sup>* /*n<sup>j</sup>* is defined to balance the impact of each class in the loss function.

#### *2.1. Data Collection*

After approval by the Mayo Clinic institution review board (IRB protocol #19-001950), 935 consecutive monocytic cell images were acquired from the PB smear samples of 10 patients diagnosed with AML with monocytic differentiation and CMML using a 100× objective lens under immersion oil using an Olympus BX53 microscope with Olympus DP74 camera to obtain digital images. Each cell was manually cropped by an experienced hematopathologist (M.E.S.) into 200 × 200 pixel images using HyperSnap V7 software (Hyperionics Technology, Murrysville, PA, USA). In order to eliminate the impact of nonrelevant background information that might include red blood cells, artifacts, and platelets, a manually segmented mask was provided for each monocytic cell. The cytoplasm and nucleus were labelled separately in each segmentation mask. All collected cells were split into 3 categories (i.e., monocyte, promonocyte, and blast) by 5 hematopathology experts. The consensus between the five experienced morphologists (four hematopathologists, D.J., P.L.N., S.R. and M.E.S., and an experienced pathologist assistant, D.R.) was considered as a ground truth reference label for each cell.

#### *2.2. Experiments and Evaluations*

We split the data into 70%, 10%, and 20% for training, validation, and testing purposes, respectively, and assessed the performance of five well-known CNNs architectures: InceptionV3 [15,16], Resnet50 [17], Inception\_resnet [18], VGG16 [19], and Densenet121 [20]. The training set was used for learning about the data. The validation set was employed to establish the reliability of learning results, and the test set was used to assess the generalizability of a trained model on the data that were not seen by the model. Furthermore, we applied stratified 5-fold cross validation to the best-performing model configuration to further assess the generalization ability of the model. In the 5-fold cross validation, the data were divided randomly into 5 equal sized pieces and samples of each class were equally distributed to each piece. One piece was reserved for assessing the performance of a model, and the remaining 4 pieces were utilized for training models.

We generated five configurations based on pre-processing input data and assessed the impact of data pre-processing to select the best configuration for our classification task. In configuration 1, cell masks were applied to image patches to suppress the background (i.e., assigning zeros to non-cell pixels) and leave only the cell content in image patches. Afterwards, color normalization (i.e., RGB color channels values were normalized as a percentage of sum of RGB values) was applied to image patches and cells were centered and resized into 200 × 200 pixels. In configuration 2, cell masks were applied to image patches to suppress the background and leave only the cell content in image patches. Next, z-scoring, which is also called the standard score, was applied to image patches. In z-scoring, RGB image channel values were scaled with 0 mean and unit variation. Lastly, cells were centered and resized into 200 × 200 pixels. In configuration 3, image patches without suppressing background (i.e., whole image patched including all the background information) were used as the input data for CNN models. In configuration 4, cell masks were also applied to image patches to suppress the background, leaving only cell content (Figure 1). Lastly, in configuration 5, cell masks were applied to suppress the background as well as the cells of interest but excluding their nuclei, leaving only the nuclei content in image patches (Figure 1). We then centered and resized only the nuclei of each cell into 200 × 200 pixels and applied to them z-scaling to standardize RGB color distribution. For each configuration, we presented accuracy, precision, recall, and F1-score metrics. In addition, we also generated t-SNE plots using the features of the last convolution layer of the best model to show the separation of monocytic cells on the test dataset.


**Figure 1.** Examples of monocytes, promonocytes, and monoblasts with criteria.

In order to assess the inter-reviewer variability (i.e., the variability between the five expert reviewers), we compared the labels of each reviewer to consensus labels and the average performance and standard deviation were presented. Similarly, to assess the intra-reviewer variability, reviewer 5 labeled the cells a second time (one month later) and a correlation matrix was calculated, as shown in the results section below.

#### **3. Results**

The performance of the five CNN models with different configurations and the resulting classification of the monocytic cells (i.e., monocyte, promonocyte, and blast) on the validation and test datasets are shown in Tables 1 and 2. Table 1 shows the results of CNN models with configurations 1–5 for the three-subcategory classification (monocyte vs. promonocyte vs. blast), while Table 2 shows results of CNN models with configurations 1–5 for the two-subcategory classification (monocyte vs. blast + promonocyte). Overall, the Inception\_resnet model [18], which is a version of the inception model with residual connection, using configuration 2, gave the best performance in terms of accuracy, precision, recall, and F1-score in the validation and test datasets of both the two-subcategory and the three-subcategory classifications. Densenet121 using configuration 2 was the second-best performing model.

Using configuration 2, the accuracy of CNN models for predicting three subcategories (Table 1) on the test dataset ranged from 42% to 58%, while it ranged from 70% to 85% for predicting two subcategories (Table 2). In the three-subcategory classification (Table 1), the Inception\_resnet model achieved 81% accuracy in the validation dataset, but its performance dropped to 53% in the test dataset. In the two-subcategory classification (Table 2), the accuracy of CNN models using configuration 2 ranged from 79% to 88% on the validation dataset. Inception\_resnet using configuration 2 provided the most consistent performance in the two-subcategory classification as well in terms of accuracy, precision, recall, and F1-score in the validation and test datasets.

*Color Key for Tables 1–5:*


**Table 1.** Performance of CNN models using five pre-processing configurations on 3-subcategory (monocytes, promonocytes, and blasts) classification task.


Table 1 shows the Inception\_resnet model using configuration 2 performing the best in terms of accuracy, precision, recall, and F1-score in the validation and test datasets of the 3-subcategory classification.


**Table 2.** Performance of CNN models using five pre-processing configurations on 2-subcategory (monocytes and promonocytes + blasts) classification task.

Table 2 shows the Inception\_resnet model using configuration 2 performing the best in terms of accuracy, precision, recall, and F1-score in the validation and test datasets of the 2-subcategory classification.

> In Tables 1 and 2, CNN models with configuration 1 showed less consistency between validation and test datasets and had worse performance compared to those with configuration 2. The Inception\_resnet model using configurations 3 and 4 showed poor performance compared to the model using configuration 2 (Table 1). However, their performance improved with two-subcategory classification (Table 2). The overall performance of Inception\_resnet using configuration 5, which included the nucleus only in image patches, was slightly lower than the performance of the best model in both the two-subcategory and three-subcategory classification tasks, as shown in Tables 1 and 2.

> Figure 2 shows the t-SNE plots for the learned features of the last convolutional layer of the Inception\_resnet model with configurations 1 and 2 that were generated from the test dataset. As shown in the t-SNE plot of the Inception\_resnet model with configuration 1, all promonocytes demonstrated similar features to blasts, and some of monocytes were also not discernable from blasts. In the t-SNE plot of the model with configuration 2, promonocytes were distributed across monocyte and blast classes. There was a narrow band to differentiate promonocytes from both other classes.

> The average performance of the fivefold cross validation using the best performing model, Inception\_resnet, is shown in Table 3. The average accuracy of the model and its standard deviation across the fivefold cross validation were 0.66 ± 0.12 and 0.78 ± 0.10 for three-subcategory and two-subcategory classifications, respectively. The performances in the first two iterations, were the lowest while the performance in iteration three was the highest. In the two-subcategory classification, the average performance of the fivefold cross validation (Table 3) was slightly lower than the performance of the Inception\_resnet model (Table 2) on the test dataset (78% vs. 80%, respectively).

**Figure 2.** t-SNE plots for the performance of the Inception\_resnet model using configurations 1 and 2 on the test. In the configuration 1 plot, all promonocytes demonstrated similar features to blasts and some of monocytes were also not discernable from blasts. In the configuration 2 plot, promonocytes were distributed across monocyte and blast classes.

**Table 3.** Overall performance of fivefold cross validation using the Inception\_resnet CNN model.


The performance of the five human expert reviewers compared to the consensus reference labels is shown in Table 4. The mean and standard deviation of the performance of the reviewers were 0.81 ± 0.07 and 0.86 ± 0.05 for the three-subcategory and twosubcategory classifications, respectively. Apart from reviewers 3 and 5, there was a strong consensus between the other three reviewers. The performance of reviewer 3 was 72% accurate, which was the lowest performance among the other reviewers. As seen in Table 4, human performance could be as low as 72% and 80% accurate for the three-subcategory and two-subcategory classifications, respectively. The overall results in the fivefold cross validation test (Table 3) were slightly lower than the human reviewers' performance in the two-subcategory classification task (0.78 ± 0.10 vs. 0.86 ± 0.05).


**Table 4.** Performance of human experts compared to consensus reference labels.

A Pearson's correlation matrix between reviewers and consensus reference labels is displayed in Table 5. The Pearson's correlation between the five reviewers ranged from 0.5 to 0.75. The correlation between reviewers and consensus reference labels ranged from 0.67 to 0.86. The correlation between the two labels of reviewer 5 (reviewer 5 vs. reviewer 5R) is 0.92 and represents the intra-reviewer variability.

**Table 5.** Pearson's correlation matrix between reviewers. Reference: consensus of 5 reviewers. Reviewer 5R: second repetition of reviewer 5.


#### **4. Discussion**

Monocyte assessment is frequently used in day-to-day practice to differentiate neoplastic processes from reactive monocytosis such as infections. According to the WHO criteria, the diagnosis of monocytic neoplasms is dependent on quantitating monoblasts, promonocytes, and monocytes [2]. Specifically, for the accurate recognition and quantification of the two subtypes (promonocytes and monoblasts) most characteristic of acute leukemia, we are required to distinguish between the subtypes of AML with monocytic differentiation and CMML [2,21]. In addition, quantification of monoblasts is necessary for CMML staging, and quantification of monocytes is important for the differential diagnosis of other chronic myeloid neoplasms, including atypical CML [22].

Microscopic evaluation and enumeration of monoblasts, promonocytes, and monocytes by an experienced hematopathologist remains to be the only accepted gold standard; however, morphologic assessment alone can be difficult and subject to significant interand intra-observer variability. In fact, monocytes and monocytic precursors are the most difficult cells to identify and classify with confidence in the peripheral blood or in the bone marrow [8]. Other modalities such as multiparameter flow cytometry have been attempted to determine whether immunophenotypic expressions such as anti-CD14 antibodies, which recognize the MO2 and MY4 epitopes, can identify monoblasts, promonocytes, and monocytes [23]. However, the adoption of alternatives to morphology requires technical expertise and remains limited in terms of widespread applicability.

It is imperative that diagnoses distinguish accurately between CMML, including the correct subcategory, and AML with monocytic differentiation, because incorrect diagnosis has significant therapeutic ramifications. For instance, management of CMML is guided by risk categories (high or low risk) based on a CMML-specific scoring system [24] that incorporates the percentage of PB and BM blasts as an important factor determining survival and prognosis [25]. Accordingly, high-risk groups are more subject to hematopoietic cell

transplantation—which is associated with significant morbidity and mortality—than the low-risk groups, which are more subject to symptom-directed therapy (e.g., hydroxyurea, hypomethylating agents, and/or supportive care) [26]. Likewise, patients with AML have a different therapeutic approach, because their treatment regimen usually begins with intensive remission–induction chemotherapy, which generally includes a seven-day continuous infusion of cytarabine along with anthracycline treatment on days 1–3 (the so-called "7 + 3" regimen) [27]. This induction therapy can be highly toxic and typically entails hospitalization for several weeks. Hence, precise identification and detailed characterization of monocytic cells is of major relevance not only for diagnosis, but also for treatment. Other neoplastic myeloid conditions have been associated with monocytic abnormalities including juvenile myelomonocytic leukemia, chronic myeloid leukemia with p190 fusion, and myeloid neoplasm with rearrangements of PDGFRA, PDGFRB, FGFR1 and PCM1- JAK2. In addition, monocytosis could be a sign of progression of Philadelphia-negative myeloproliferative neoplasms [28].

The evolution of digital imaging and AI application provides a promising potential in cell-based classification. As such, we thought to evaluate the applicability in monocytic cell-type classification. In this study, we assessed the performance of five well-known CNN architectures for separating monocytes from the spectrum of monocyte precursors (i.e., promonocytes and monoblasts). As mentioned before, ground truth reference labels to train these models were generated based on the consensus of five expert reviewers. Table 4 shows that the percentage of agreement between expert reviewers ranged from 72% to 86% for the three-subcategory classification task, which is a good concordance for such a difficult task. These results were in line with previously reported concordance rates in the literature (76.6%) between expert hematopathologists [8,10]. This agreement was further improved when monocyte precursors were combined. Importantly, consensus on the classification of cells, which is used as the gold standard, was achieved by individual classification of each cell by each one of the evaluators. This is a higher standard than applied in a regular clinical practice, where there are other parameters which could be helpful in reaching the correct percentage of blast-equivalents (for example: similarity between individual cells, bone marrow cellularity, absence of other hematopoietic lineages).

The performance of CNN models did not reach the level of the performance of human experts in separating monocytic cells in the three-subcategory classification, while their performance was significantly improved in the three-subcategory classification, and hence more comparable to the performance of human experts. The improvements in the inter-observer agreement and CNN model support the practice of combining blasts/promonocytes into a single subcategory. As shown in our experiment in the threesubcategory classification (Table 1)—to find the best model and preprocessing approach we conclude that Inception\_resnet using configuration 2 provides the best overall results in validation and test datasets. However, the performance of the other models and configurations, apart from configuration 1, was comparable with small differences in the two-subcategory classification, as shown in Table 2. Even though the results are comparable, configurations using cell masking to suppress the impact of irrelevant background information on the prediction outcome are more reliable. Configuration 5 using nucleus only data also showed consistent results of cross-validation and test datasets, both in the two-subcategory and three-subcategory classifications (Tables 1 and 2). The impact of the cytoplasm and nucleus on predicting monocytic cells could be further investigated in a larger study to validate our preliminary findings.

The scope of this study was limited to the applicability of monocytic classification based on the morphologic assessment by our expert hematopathology reviewers. Other cell types, immunohistochemical, or flow cytometric immunophenotyping features were not collected to address the reproducibility of the results presented in this article and its direct impact on the diagnosis. Even though we obtained promising results in the identification of monocytes and its precursors using CNNs, these results still need to be validated with a larger study population. We used high-resolution cell images which required the manual acquisition of images. Both image acquisition and cell classification posed challenges that limited the number of cells used in our study.

A larger study with higher numbers of cells could also help further improve the performance of CNN models and obtain a better generalization ability. A larger cohort will likely improve training of the CNN models and could possibly provide an improved ground truth reference. Furthermore, additional work is needed to explore the clinical applicability and clinical validity of such CNN models. Finally, our results underline the fact that monocytic cell differentiation is a difficult task, with relatively low concordance between expert reviewers.

#### **5. Conclusions**

In summary, we present that CNN models could perform almost as well as human experts in separating monocytes from their precursor cells. To the best of our knowledge, this is the first study to separate monocytes from their precursors using deep learning. Our promising results demonstrate that CNN models could be adopted for this task and further improved with a larger study population.

**Author Contributions:** Conceptualization, M.E.S., A.N. and M.M.P.; methodology, M.E.S., M.O. and Z.A.; software, Z.A.; validation, M.E.S., M.O., Z.A., D.J., P.L.N., D.R. and S.R.; formal analysis, M.E.S., M.O. and Z.A.; investigation, M.E.S., M.O. and Z.A.; resources, M.E.S., M.O. and Z.A.; data curation, M.E.S., M.O., D.J., P.L.N., D.R. and S.R.; writing— original draft preparation, M.E.S., M.O. and Z.A.; writing—review and editing, M.E.S., M.O., Z.A., D.J., P.L.N., D.R., S.R., A.A.-K., M.M.P. and A.N.; visualization, M.E.S. and M.O.; supervision, M.E.S.; project administration, M.E.S.; funding acquisition, M.E.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by the division of hematopathology research funds at Mayo Clinic, Rochester, MN, USA.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of Mayo Clinic (IRB protocol #19-001950).

**Informed Consent Statement:** Informed consent was waived per (IRB protocol #19-001950).

**Data Availability Statement:** The data presented in this study are contained within this article.

**Conflicts of Interest:** Mohamed Salama serves on the Board of Directors and has stock option at Techcyte Inc.

#### **References**


*Article*
