**Using Bayesian Networks to Predict Long-Term Health-Related Quality of Life and Comorbidity after Bariatric Surgery: A Study Based on the Scandinavian Obesity Surgery Registry**

#### **Yang Cao 1,\* , Mustafa Raoof <sup>2</sup> , Eva Szabo <sup>2</sup> , Johan Ottosson <sup>2</sup> and Ingmar Näslund <sup>2</sup>**


Received: 27 May 2020; Accepted: 15 June 2020; Published: 17 June 2020

**Abstract:** Previously published literature has identified a few predictors of health-related quality of life (HRQoL) after bariatric surgery. However, performance of the predictive models was not evaluated rigorously using real world data. To find better methods for predicting prognosis in patients after bariatric surgery, we examined performance of the Bayesian networks (BN) method in predicting long-term postoperative HRQoL and compared it with the convolution neural network (CNN) and multivariable logistic regression (MLR). The patients registered in the Scandinavian Obesity Surgery Registry (SOReg) were used for the current study. In total, 6542 patients registered in the SOReg between 2008 and 2012 with complete demographic and preoperative comorbidity information, and preoperative and postoperative 5-year HROoL scores and comorbidities were included in the study. HRQoL was measured using the RAND-SF-36 and the obesity-related problems scale. Thirty-five variables were used for analyses, including 19 predictors and 16 outcome variables. The Gaussian BN (GBN), CNN, and a traditional linear regression model were used for predicting 5-year HRQoL scores, and multinomial discrete BN (DBN) and MLR were used for 5-year comorbidities. Eighty percent of the patients were randomly selected as a training dataset and 20% as a validation dataset. The GBN presented a better performance than the CNN and the linear regression model; it had smaller mean squared errors (MSEs) than those from the CNN and the linear regression model. The MSE of the summary physical scale was only 0.0196 for GBN compared to the 0.0333 seen in the CNN. The DBN showed excellent predictive ability for 5-year type 2 diabetes and dyslipidemia (area under curve (AUC) = 0.942 and 0.917, respectively), good ability for 5-year hypertension and sleep apnea syndrome (AUC = 0.891 and 0.834, respectively), and fair ability for 5-year depression (AUC = 0.750). Bayesian networks provide useful tools for predicting long-term HRQoL and comorbidities in patients after bariatric surgery. The hybrid network that may involve variables from different probability distribution families deserves investigation in the future.

**Keywords:** machine learning-enabled decision support system; improving diagnosis accuracy; Bayesian network; bariatric surgery; health-related quality of life; comorbidity

#### **1. Introduction**

Over the past two decades, obesity has been continuously increasing worldwide, which has become a major health issue worldwide and raised public concern across the globe [1]. Severe obesity,

defined as body mass index (BMI) over 35 kg/m<sup>2</sup> with obesity-related comorbidities, or BMI > 40 kg/m<sup>2</sup> , has been associated with impaired health-related quality of life (HRQoL) and multiple comorbidities, including type 2 diabetes (T2D), hypertension, and cancer [2–4]. Gastric bypass and other weight-loss surgeries, known collectively as bariatric surgery, are currently considered the most effective treatment options for morbid obesity to help severe obese patients to lose excess weight and reduce potentially life-threatening risk of weight-related health problems, such as heart disease and stroke, hypertension, T2D, nonalcoholic fatty liver disease, and sleep apnea [5,6].

Based on the findings from several long-term (follow-up time ranging between 5 and 10 years) prospective studies, bariatric surgery patients' HRQoL improved considerably after surgery and much of the initial HRQoL improvement was maintained over the long term [7]. While bariatric surgery can offer many benefits, all forms of weight-loss surgery are major procedures that can pose serious risks and side effects, including acid reflux, chronic nausea and vomiting, infection, obstruction of stomach, failure to lose weight, low blood sugar, malnutrition, and hernias, which in turn may have adverse impacts on HRQoL of the patients with morbid obesity after surgery [8–10].

Previously published literature has identified a few predictors of HRQoL after bariatric surgery, including baseline demographic data and depression severity score [11–14]. However, none of these studies evaluated the models' performances or the predictors' predictive abilities rigorously using real world data. In our previous study, we have examined the performance of the convolution neural network (CNN) for predicting 5-year HRQoL after bariatric surgery based on the available preoperative information from the Scandinavian Obesity Surgery Registry (SOReg) [15]. We found that, although the CNN is better than the traditional multivariate linear regression model at predicting long-term HRQoL after bariatric surgery, the overfitting issue is still apparent and needs to be mitigated [15]. In the two recently published studies, using the same database, we found that patients with postoperative complications had significantly less improvements in all aspects of HRQoL compared to those without any form of postoperative complication [16], and the ability of multilayer perceptron and CNN for predicting the postoperative serious complications after bariatric surgery is limited [17].

To find better methods for predicting prognosis and provide evidence for patient management after bariatric surgery, in this study, we examined the performance of the Bayesian networks (BN) method in predicting long-term postoperative HRQoL and compared it with the CNN and multivariable linear regression. At the same time, we also evaluated the performance of the BN in predicting postoperative comorbidities and compared it with multivariable logistic regression (MLR) model.

#### **2. Materials and Methods**

#### *2.1. Subjects and Variables*

The patients registered in the Scandinavian Obesity Surgery Registry (SOReg) were used for the current study. The registry was launched in 2007 and covers 98% of bariatric surgery in Sweden since 2009. It is validated regularly and shows high data quality [18–21]. In total, 6542 patients registered in the SOReg between 2008 and 2012, operated with primary Roux-en-Y gastric bypass, with complete demographic and preoperative comorbidity information; and preoperative and postoperative 5-year HROoL scores and comorbidities were included in the study. HRQoL was measured using the RAND-SF-36 [22] and the obesity-related problems (OP) scale [23] for the patients. In the present study, 35 variables were used for analyses, including 19 predictors (i.e., sex, and preoperative age, BMI, physical functioning (PF), role physical (RP), bodily pain (BP), general health (GH), vitality (VT), social functioning (SF), role emotional (RE), mental health (MH) scale, summary physical scale (PCS), summary mental scale (MCS), OP, sleep apnea syndrome (SAS), hypertension, pharmaceutically treated T2D, depression, and dyslipidemia), and 16 outcome variables (i.e., postoperative 5-year PF, RP, BP, GH, VT, SF, RE, MH, PCS, MCS, OP, SAS, hypertension, T2D, depression, and dyslipidemia). All scale scores ranged from 0 to 100, with higher scores indicating better health status, except for OP, for which low values represent good health; comorbidity variables are binary, with 1 indicating yes and 0 no.

The characteristics of the patients are shown in Table 1. Briefly, the average age and body mass index (BMI) of the patients were 42.7 years and 42.3 kg/m<sup>2</sup> , respectively. More than three quarters (78.8%) were female and 45% had at least one of the five comorbidities before bariatric surgery. Prevalence for all the comorbidities was reduced except for depression, and all the HRQoL scores were improved except for MCS after five years of bariatric surgery (Table 1).


**Table 1.** Characteristics of the patients (*n* = 6542) included in the study, mean ± SD or *n* (%).

SD, standard deviation; BMI, body mass index; SAS, sleep apnea syndrome; T2D, type 2 diabetes; PF, physical functioning; RP, role-physical; BP, bodily pain; GH, general health; VT, vitality; SF, social functioning; RE, role-emotional; MH, mental health; PCS, summary physical scale; MCS, summary mental scale; OP, obesity-related problems.

The study was approved by the Regional Ethics Review Committee in Stockholm (approval number: 2013/535-31/5). The data that support the study are not publicly available because they contain information that could compromise research subjects' privacy and confidentiality. However, the data may be available upon reasonable request and with permission of the Committee of Scandinavian Obesity Surgery Registry in Örebro, Sweden.

#### *2.2. Statistical Methods*

A BN is a probabilistic directed acyclic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). In particular, each node in the DAG represents a random variable, while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. A BN takes an event that occurred and predicts the likelihood that any one of its parent nodes was the possibly contributing factor [24]. Applications of BN have multiplied in recent years, spanning such different topics as systems biology, economics, social sciences, and medical informatics [25,26].

In the current study, prediction for 5-year HRQoL scores was conducted using a Gaussian BN (GBN) because it follows or approximates a normal distribution [25]. GBN is a specially directed graphical model, which offers algorithms for prediction and inference when all variables could be defined by a Gaussian prior distribution or a Gaussian conditional distribution [27,28]. Binary predictors were transformed into continuous propensity scores using MLR before they entered the GBN [29]. Performance factors of the GBN were all compared with those from the previous CNN [15] and a traditional linear regression model.

Prediction for 5-year comorbidities was conducted using both multinomial discrete BN (DBN) and MLR, and the results from the two methods were compared. Before entering the DBN, the continuous predictors were discretized into ten categories using the information-preserving discretization method introduced by Hartemink [30]. Although at the cost of losing some information, the discretization may accommodate skewness of the variables and nonlinear relationships between them, and speed up the computation substantially [25,31].

#### *2.3. Model Training and Validation*

In total, 80% of the patients were randomly selected as a training dataset for learning the structure of the GBN and the DBN. When learning the structure of the networks, only a black list was used to block the edges directing from the postoperative variables to the preoperative variables, and no other constraints were used. The hill-climbing (HC) algorithm was used to learn the structure of the networks, which starts from a network with no edges, and then adds, removes, and reverses one edge at a time and picks the change that increases the network's Bayesian information criterion score the most [25].

The remaining 20% of the patients were used as the validation dataset to evaluate the performance of the Bayesian networks, CNN, multivariable linear, and logistic regression models. Performance of the GBN was evaluated using the mean squared error (MSE) in view of the existence of zero values in the outcome variables [32]. MSE from the min-max normalized scores (between 0 and 1) was used to compare the results from the GBN and those from the previous multivariable linear regression and the CNN [33]. Performance of the DBN and MLR was evaluated using accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve. Terminology and derivations of the metrics were given in detail elsewhere [33]. A successful prediction model for comorbidities was defined as with an area under the ROC curve (AUC) greater than 0.7 [33–35].

#### *2.4. Software and Hardware*

The descriptive statistical analyses were performed using Stata 16.0 (StataCorp LLC, College Station, TX, USA). The Bayesian networks were constructed using package *bnlearn* [25,36] in software R version 3.62 (R Foundation for Statistical Computing, Vienna, Austria), and the multiple linear and logistic regression models were achieved in R as well.

All of the computation was conducted in a computer with a 64-bit Windows 7 Enterprise operation system (Service Pack 1), Intel® Core TM i5-4210U CPU @ 2.40 GHz, and 16.0 GB random access memory.

#### **3. Results**

#### *3.1. Structure of the GBN*

The structure of the GBN for predicting the postoperative 5-year HRQoL is shown using the DAG in Figure 1. It shows all the edges based on the HC algorithm. The DAG looks complicated and messy because it indicates all the contributors to each postoperative 5-year variable at the same time. For example, the possible direct contributors for the 5-year OP are preoperative T2D, BMI, age, OP, and PCS, and 5-year GH, PCS, SF, PF, and MH. The conditional distribution of the 5-year OP, therefore, can be presented as:

$$\begin{aligned} \text{POP\\_5y} \left| \begin{aligned} \text{age}\_{\text{5}} &= \text{x}\_{1\prime} \text{ BMI}\_{\text{p}} = \text{x}\_{2\prime} \text{ DM}\_{\text{p}} = \text{x}\_{3\prime} \text{ OP}\_{\text{p}} = \text{x}\_{4\prime} \text{ PCS}\_{\text{p}} = \text{x}\_{5\prime} \text{ GH}\_{5\text{y}} = \text{x}\_{6\prime} \text{ P}\_{\text{p}} \right| \\ \text{PCS}\_{5y} &= \text{x}\_{7\prime} \text{ SF}\_{5y} = \text{x}\_{8\prime} \text{ PF}\_{5y} = \text{x}\_{9\prime} \text{ MH}\_{5y} = \text{x}\_{10} \text{)} \sim N(\beta\_{0} + \beta\_{1}\text{x}\_{1} + \\ \beta\_{2}\text{x}\_{2} + \beta\_{3}\text{x}\_{3} + \beta\_{4}\text{x}\_{4} + \beta\_{5}\text{x}\_{5} + \beta\_{6}\text{x}\_{6} + \beta\_{7}\text{x}\_{7} + \beta\_{8}\text{x}\_{8} + \beta\_{9}\text{x}\_{9} + \beta\_{10}\text{x}\_{10} \text{ } \text{ } \text{e}^{2} \end{aligned} $$

where *N* means a normal distribution with a variance ε 2 .

The probability distribution above is just one of the conditional Gaussian distributions proposed by the DAG in Figure 1, and we can construct the conditional distribution for any one of the eleven postoperative 5-year HRQoL scores based on the edges pointing to them.

**Figure 1.** The directed acyclic graph (DAG) of the GBN for predicting postoperative 5-year HRQoL scores. DAG, directed acyclic graph; GBN Gaussian Bayesian network; PF, physical functioning; RP, role physical; BP, bodily pain; GH, general health; VT, vitality; SF, social functioning; RE, role emotional; MH, mental health; PCS, summary physical scale; MCS, summary mental scale; OP, obesity-related problems; BMI, body mass index; SAS, sleep apnea syndrome; HT, hypertension; DM, diabetes; Depr, depression; Dyslip, dyslipidemia; \_p, preoperation; \_5y, 5-year.

#### *3.2. Performance of the GBN in the Validation Dataset*

When the models were evaluated using the validation data that were not seen previously by the GBN, in general, the GBN presented a better performance than the CNN and the linear regression model (Table 2); all MSEs were smaller than those from the CNN and eight of eleven MSEs were smaller than those from the linear regression model (Table 2). For example, MSE of PCS was only 0.0196 for GBN compared to the 0.0333 seen in the CNN (Table 2), which means the average prediction error of the GBN accounted for less than 3% of the normalized mean of the postoperative 5-year PCS (which is 0.653). In general, the GBN could provide better prediction for postoperative 5-year HRQoL than the CNN and multivariable linear regression did.


**Table 2.** Mean squared errors of the GBN, the CNN, and the multivariable linear regression model.

GBN, Gaussian Bayesian network; CNN, convolutional neural network; PF, physical functioning; RP, role physical; BP, bodily pain; GH, general health; VT, vitality; SF, social functioning; RE, role emotional; MH, mental health; PCS, summary physical scale; MCS, summary mental scale; OP, obesity-related problems.

#### *3.3. Structure of the DBN*

The structure of DBN for predicting postoperative 5-year comorbidities is shown using the DAG in Figure 2, which is much simpler than the GBN. The comorbidities might be predicted using much less preoperative variables. For example, the conditional probability of 5-year depression (Depr\_5y) depended only on sex and preoperative depression (Depr\_p), which could be predicted by conditional probability tables between preoperative and postoperative depression for men or women separately. The conditional probability tables needed for men and women were estimated in a Bayesian setting in the DBN. When a comorbidity involved more predictors, such as 5-year dyslipidemia, there were more conditional probability tables to be referred to for prediction. Interestingly, preoperative BMI was not involved in any potential causal relationships in the network regarding the postoperative 5-year comorbidities (Figure 2).

**Figure 2.** The DAG of the DBN for predicting postoperative 5-year comorbidities. DAG, directed acyclic graph; DBN, discrete Bayesian network; PF, physical functioning; RP, role physical; BP, bodily pain; GH, general health; VT, vitality; SF, social functioning; RE, role emotional; MH, mental health; PCS, summary physical scale; MCS, summary mental scale; OP, obesity-related problems; BMI, body mass index; SAS: sleep apnea syndrome; HT, hypertension; DM, diabetes; Depr, depression; Dyslip, dyslipidemia; \_p, preoperation; \_5y, 5-year.

#### *3.4. Performance of the DBN in the Validation Dataset*

The DBN showed excellent predictive ability for 5-year T2D and dyslipidemia (AUC = 0.942 and 0.917, respectively), good ability for 5-year hypertension and SAS (AUC = 0.891 and 0.834, respectively), and fair ability for 5-year depression (AUC = 0.750) (Figure 3).

**Figure 3.** Receiver operating characteristic (ROC) curve of the discrete Bayesian network (DBN) for predicting 5-year comorbidity after bariatric surgery.

Compared with the results from the MLR, the DBN presented significant improvement in predicting 5-year comorbidities. All the AUCs from the DBN were larger than those from the MLR, and the differences were statistically significant (*p* < 0.05), except for SAS (Table 3). The sensitivity and specificity of the DBN in predicting postoperative 5-year T2D could be as high as 0.96 and 0.89, in contrast to the 0.78 and 0.68 of the MLR, respectively (Table 3).

**Table 3.** Performance metrics of the DBN and MLR model for predicting the 5-year comorbidities.


DBN, discrete Bayesian network; MLR, multivariable logistic regression; Sen, sensitivity; Spe, specificity; Acc, accuracy; AUC, area under the ROC curve; CI, confidence interval; SAS: sleep apnea syndrome; T2D, type 2 diabetes.

#### **4. Discussion**

In this study, we explored application of Bayesian networks for predicting long-term outcomes after bariatric surgery in a national registry. They showed promising predictive ability for both continuous and binary outcomes. For predicting the postoperative 5-year HRQoL, the GBN had smaller MSEs than those seen from the CNN for all scores and from the traditional multivariable linear regression for most scores. The most accurate predictions from the GBN were seen for PCS, and followed by PF and MCS; average prediction errors were lower than 3%, 4%, and 6% of their normalized means, respectively. For predicting the postoperative 5-year comorbidity, the DBN showed statistically significantly better performance compared with the MLR. It showed good and even excellent predictive ability for four of the five comorbidities, with an AUC as high as 0.942 in postoperative T2D.

Bayesian networks use Bayesian inference to model conditional dependence, and therefore causation, via a DAG. They are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. Experience has shown that Bayesian networks and associated methods are geared to reasoning with uncertainty in a way closely resembling physicians [37–39]. Physicians who aim to develop computer-assisted systems for making clinical decisions are frequently confronted by the complexity and uncertainty in the models and prediction. In many cases, the situation is even worse, as many of the processes in medicine are only partly known [38]. During the past decade, Bayesian networks have become important tools for building decision-support systems in medical sciences and are now steadily becoming mainstream in some areas [40]. However, we should notice that DAGs are not designed to capture cyclic patterns, such as depression causing increased BMI [41]. Potential cyclic causal relationships may be explored using cyclic structural equation models [42] or Markov networks [43].

Many methods have been applied to predict the outcomes in patients after bariatric surgery, including stepwise multivariable linear regression [44,45], MLR [46], and machine learning methods such as the decision tree [47] and CNN [15,33]. Although an intelligent decision-making support system involving Bayesian networks has been reported for the nutrition diagnosis of bariatric surgery patients [48], according to our literature search, there is no study that has used the method for predicting outcomes after bariatric surgery. In our previous study, we illustrated that CNN might be a useful tool to predict long-term HRQoL after bariatric surgery; however, its overfitting on external validation dataset was still noticeable. To further mitigate the overfitting issue commonly seen in the machine learning field, we explored the application and performance of Bayesian networks in the current study and achieved desired results.

A significant advantage of the study in clinical sense is that it provides a solution with which to predict outcomes as far as 5 years after bariatric surgery. To give realistic and relevant information about the long-term prognosis of bariatric surgery is currently challenging. This type of knowledge can be used in clinical practice when it comes to giving scientifically-based preoperative information to patients considering the surgery. The knowledge can also be helpful in giving scientifically-based information to policy makers in health care to explain the expected positive effects of bariatric surgery. This information can also be used to customize the follow-ups of the individual patients. However, we would also note that this kind of prediction should not be used to exclude individual patients, who otherwise fulfil criteria for surgery, from having an operation. Meanwhile, while limited by the relatively small sample size compared to those usually recommended in statistical learning studies, it would be premature to use the models presented in the study in clinical decision-making right now.

There are several advantages in Bayesian networks. First, commonly used methods in epidemiological studies such as logistic regression and related methods do not take account of causal relationships that may exist between the covariates. Causal relationships between some of the risk factors may be already known, or may be regarded as plausible on biological grounds [49,50]. However, such information was incorporated into our BN models to reveal the potential relationships between the health or disease status and the associated risk factors [51]. Second, high correlation among predictors has long been an annoyance in regression analysis. The crux of the problem is that

the linear regression model assumes each predictor has an independent effect on the response that can be encapsulated in the predictor's regression coefficient. As opposed to creating problems of multicollinearity, the associations between candidate predictor variables are naturally accounted for when defining a BN's conditional probability distributions. The HC algorithm used in the study may search a structure starting from either an empty, full, or possibly random DAG, or an initial DAG chosen according to existing knowledge. The main loop then consists of attempting every possible single-edge addition, removal, or reversal relative to the current candidate network. The change that increases the score the most then becomes the next candidate. The process iterates until a change in a single-edge no longer increases the score. By gradually taking into account the relationships between the variables, the problem of multicollinearity, therefore, can be reduced in a BN analysis [52]. Third, the DAG proposed by the BN method captures the dependence structure of multiple variables, and used appropriately, allows more robust conclusions about the direction of causation. BN analysis revealed a richer structure of relationships than could be inferred using the traditional multivariable regression methods, such as logistic regression, and highlight a potential pathway unseen previously for further investigation [53]. Fourth, compared with the deep learning method CNN used in our previous study for predicting HRQoL scores, the GBN provided much faster computing, better performance, and interpretable results. Finding the final DAG with the HC algorithm using 35 variables and 6542 observations only took 2 min in GBN analysis, in contrast to about 10 min in CNN analysis [33]. Except for the output HRQoL scores, the contributions of and relationships between the variables could not be explained or were hard to explain in the CNN analysis. In contrast, the GBN showed us all the potential causal relationships between the variables and estimated the strength of the relationships using liner regression coefficients.

However, there are limitations in our study. Our dataset includes both continuous and binary variables. To reduce the complexity of the networks and computing time, we converted the binary variables to continuous propensity scores for the GBN analysis, and discretized the HRQoL scores to categorical variables for the DBN analysis, which may involve tortuous information or lose some information in the analyses. A better solution would be a hybrid BN with use of Markov chain Monte Carlo techniques [25]. Although limited by the software packages available and adopting the compromising methods so far, we would like to explore the hybrid BN in the future and see whether it could improve the performance of prediction further. Besides, even though HRQoL and comorbidities are of importance, we have not tested hard endpoints, such as survival, heart attack, stroke, and cancer, which warrants a subsequent study when more detailed data are available. We should also notice that this study only included patients from Roux-en-Y gastric bypass, since this was almost the only operation method used in Sweden during the study period. Whether the results could be applied to other methods, such as sleeve gastrectomy, is not known yet. However, we will be able to investigate this in the future, since SOReg has contained a large number of sleeve gastrectomy patients in recent years. Besides, there are many more females than males in the database (80% vs. 20%). The generalizability of the BN models might be limited by the gender imbalance. Meanwhile, the menopausal transition can be an important factor related to HRQol in women [54]. In view of the average age with a wide standard deviation at 5 years after surgery, which is right around the menopause of women, this issue deserves clarification and assessment by incorporating with the menopause information in women. Therefore, the applicability and validity of the models need be further explored using a larger representative dataset with more covariates and longer follow-up.

#### **5. Conclusions**

Bayesian networks provide useful tools for predicting long-term HRQoL and comorbidities in patients after bariatric surgery, based on their preoperative health and disease status. The GBN and DBN used in our study outperformed the deep learning method CNN and multivariable logistic regression. However, the hybrid network that may involve variables from different probability distribution families deserves investigation in the future.

**Author Contributions:** All authors have read and agree to the published version of the manuscript. Conceptualization, Y.C.; data curation, M.R.; formal analysis, Y.C.; funding acquisition, Y.C.; investigation, M.R., E.S., J.O., and I.N.; methodology, Y.C.; project administration, J.O., and I.N.; resources, E.S., J.O., and I.N.; software, Y.C.; validation, Y.C.; visualization, Y.C.; writing—original draft, Y.C. and J.O.; writing—review and editing, Y.C., M.R., E.S., J.O., and I.N.

**Funding:** Yang Cao's work was supported by Örebro Region County Council (OLL-864441). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

**Vida Abedi 1,2 , Venkatesh Avula <sup>1</sup> , Durgesh Chaudhary <sup>3</sup> , Shima Shahjouei <sup>3</sup> , Ayesha Khan <sup>3</sup> , Christoph J Griessenauer 3,4 , Jiang Li <sup>1</sup> and Ramin Zand 3,\***


**Abstract:** Background: The long-term risk of recurrent ischemic stroke, estimated to be between 17% and 30%, cannot be reliably assessed at an individual level. Our goal was to study whether machine-learning can be trained to predict stroke recurrence and identify key clinical variables and assess whether performance metrics can be optimized. Methods: We used patient-level data from electronic health records, six interpretable algorithms (Logistic Regression, Extreme Gradient Boosting, Gradient Boosting Machine, Random Forest, Support Vector Machine, Decision Tree), four feature selection strategies, five prediction windows, and two sampling strategies to develop 288 models for up to 5-year stroke recurrence prediction. We further identified important clinical features and different optimization strategies. Results: We included 2091 ischemic stroke patients. Model area under the receiver operating characteristic (AUROC) curve was stable for prediction windows of 1, 2, 3, 4, and 5 years, with the highest score for the 1-year (0.79) and the lowest score for the 5-year prediction window (0.69). A total of 21 (7%) models reached an AUROC above 0.73 while 110 (38%) models reached an AUROC greater than 0.7. Among the 53 features analyzed, age, body mass index, and laboratory-based features (such as high-density lipoprotein, hemoglobin A1c, and creatinine) had the highest overall importance scores. The balance between specificity and sensitivity improved through sampling strategies. Conclusion: All of the selected six algorithms could be trained to predict the long-term stroke recurrence and laboratory-based variables were highly associated with stroke recurrence. The latter could be targeted for personalized interventions. Model performance metrics could be optimized, and models can be implemented in the same healthcare system as intelligent decision support for targeted intervention.

**Keywords:** healthcare; artificial intelligence; machine learning; interpretable machine learning; explainable machine learning; ischemic stroke; clinical decision support system; electronic health record; outcome prediction; recurrent stroke

#### **1. Introduction**

Predictive modeling of stroke, the leading cause of death and long-term disability [1], is crucial due to high individual and societal impact. Each year, about 800,000 people experience a new or recurrent stroke in the United States [2]. It has been estimated that the 5-year risk of stroke recurrence is between 17% and 30% [3,4]. Recurrent stroke has a higher rate of death and disability [5]. Therefore, the identification of patients who are at a higher risk of recurrence can help the care-providers prioritize and define more vigorous secondary prevention plans for those at risk, especially when there are limited resources.

**Citation:** Abedi, V.; Avula, V.; Chaudhary, D.; Shahjouei, S.; Khan, A.; Griessenauer, C.J; Li, J.; Zand, R. Prediction of Long-Term Stroke Recurrence Using Machine Learning Models. *J. Clin. Med.* **2021**, *10*, 1286. https://doi.org/10.3390/jcm10061286

Academic Editor: Nandu Goswami

Received: 30 January 2021 Accepted: 16 March 2021 Published: 20 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

To date several predictive models of recurrent stroke, using regression or other statistical methods, have been developed; however, the clinical utility of these models tends to be limited due to the narrow scope of variables used in these models [6]. In a recent study, multivariable logistic models of 1-year stroke recurrence, developed based on 332 patients, using clinical and retinal characteristics (using 20 variables) have shown promising results with an area under the receiver operating characteristic (AUROC) curve of 0.71–0.74 [7]. Large real-world patient-level data from electronic health records (EHR) and machine learning (ML) methods can be leveraged to capture a greater number of features to help build better prediction models [8]. In a recent study of 2604 patients, ML has been successfully used to predict the favorable outcome following an acute stroke at three months [9]. We also showed that ML can be used for flagging stroke patients in the emergency setting [10–12].

The present study aimed at using rich longitudinal data from EHR to construct an ML-enabled model of long-term (up to 5-years) recurrent stroke. We evaluated Extreme Gradient Boosting (XGBoost), Gradient Boosting Machine (GBM), Random Forest (RF), Support Vector Machine (SVM), and Decision Tree (DT), and benchmarked these algorithms' performance against Logistic Regression (LR) as these are interpretable models and feature importance can be extracted for further validation and assessment by care providers. We hypothesized that (1) all of the modeling algorithms can be trained to predict long-term stroke recurrence, (2) A wide range of clinical features associated with stroke recurrence can be identified, and (3) performance metrics can be improved through sampling processes.

#### **2. Methods**

All of the relevant codes developed as well as summary data generated for this project can be found at https://github.com/TheDecodeLab/GNSIS\_v1.0/tree/master/ ModelingStrokeRecurrence (accessed on 19 March 2021).

#### *2.1. Data Source*

Database description and processing: this study was based on the extracted data from the Geisinger EHR system, Geisinger Quality database, and the Social Security Death database to build a stroke database called "Geisinger Neuroscience Ischemic Stroke (GNSIS)" [13]. GNSIS includes demographic, clinical, laboratory data from ischemic stroke patients from September 2003 to May 2019. The study was reviewed and approved by the Geisinger Institutional Review Board to meet "non-human subject research", for using de-identified information.

The GNSIS database was created based on a high-fidelity and data-driven phenotype definition for ischemic stroke developed by our team. The patients were included if they had a primary hospital discharge diagnosis of ischemic stroke; a brain magnetic resonance imaging (MRI) during the same encounter to confirm the diagnosis; and, an overnight stay in the hospital. The diagnoses were based on International Classification of Diseases, Ninth/Tenth Revision, Clinical Modification (ICD-9-CM/ICD-10-CM) codes. For each index stroke, the following data elements were recorded: (1) date of the event, (2) age of the patient at the index stroke, (3) encounter type, (4) ICD code and corresponding primary diagnosis of index stroke, (5) presence or absence (and date) of recurrent stroke, and (6) ICD code and corresponding primary diagnosis for the recurrent stroke. Other data elements include sex, birth date, death date, last medical visit within the Geisinger system, presence or absence of comorbidities, presence or absence of a family history of heart disorders or stroke, and smoking status. In the case of multiple encounters due to recurrent cerebral infarcts, the first hospital encounter was considered as the index (first-time) stroke. To improve the accuracy of comorbidity information based on ICD-9- CM/ICD-10-CM diagnosis, either two outpatient visits or one in-patient visit were used to assign a diagnosis code to a patient. Our database interfaces with the Social Security Death Index on a biweekly basis to reflect updated information on the vital status. The manual validation of a random set of patients, including reviewing the MRI, to ensure all

patients in the GNSIS database had a correct diagnosis of acute ischemic stroke indicated a specificity of 100%.

Data pre-processing: Units were verified and reconciled if needed and distributions of variables were assessed over time to ensure data stability. The range for the variables was defined according to expert knowledge and available literature—and outliers were assessed and removed. To ensure that patients were active, the last encounter of patients was recorded.

#### *2.2. Study Population*

For this study, we excluded patients with recurrent stroke within 24 days of the index stroke. We organized the included patients into six groups. One control group and five case groups. The control group consisted of patients who did not have a stroke recurrence during the 5-year follow-up. Case groups 1, 2, 3, 4, and 5 comprised of patients who had a recurrent stroke between 24 days and 1, 2, 3, 4, and 5-years, respectively. The 24 day cut-off was selected to ensure that the recurrent stroke was independent of the index stroke; as our data demonstrate, the number of stroke recurrences stabilizes after approximately 24 days (Figure 1A). Nevertheless, we repeated the analysis by including the patients with a stroke recurrence within the 24 days for comparison. Patients with stroke-related or other vascular death might be excluded from this study if they did not meet the inclusion/exclusion criteria stated above.

**Figure 1.** (**A**) Flow-chart of inclusion-exclusion of subjects in cases and control group in the study. Patients in the control group had available records in the electronic health record for at least 5 years and no documented stroke recurrence within 5 years. Distribution panel shows the number of recurrences over time. At 24 days, the number of recurrent cases can be seen to approach a plateau. (**B**) The design strategy for predicting stroke recurrence using electronic health records (EHR), Geisinger Quality database as well as Social Security Death database.

#### *2.3. Data Processing, Feature Extraction, and Sampling*

Training-testing set: Each of the cases and control groups was randomly split into 80:20 training and testing sets.

Imputation: A total of 53 features were used. Table 1 includes data on the missingness. Imputation of the missing values was performed separately on training and testing sets using Multivariate Imputation by Chained Equations (MICE) package [14]. The quality of the imputations was examined using t-test, summary statistics, as well as strip and density plots of the missing features to ensure distribution of the variables was comparable before and after imputation. Only four variables suffered from missingness at relatively higher levels.

**Table 1.** Patient demographics, past medical and family history in different groups. Detailed description of the variables is provided in the Geisinger Neuroscience Ischemic Stroke (GNSIS) study [13]. IQR: interquartile range; HDL: high-density lipoprotein; LDL: low-density lipoprotein.



**Table1.***Cont.*

Feature selection: We performed feature selection using different strategies. The feature sets were: Set 1: all features; Set 2: all features except medication history; Set 3: features selected by at least two data-driven strategies; and Set 4: minimum set, obtained as the intersect of Set 2 and Set 3 (Table S1). The full set of features (Sets 1, 2) were selected based on clinical expertise and previous studies [6,15]. Feature selection (Sets 3, 4) was performed based on three data-driven approaches for each set of case-control.

The data-driven approaches were: (1) filter-based methods including Pearson correlation [16] and univariate filtering; (2) embedded methods including RF [17] and Lasso Regression [18]; and (3) wrapper methods including the Boruta algorithm [19] and recursive feature elimination. Feature importance scores were scaled between zero and 100, with higher scores representing higher variable contributions. Using the reduced set of features will ensure variables with high collinearity are removed.

Sampling: The training dataset after applying the case-control definition was imbalanced. Many of the classification models trained on class-imbalanced data are biased towards the majority class. To avoid poor performance of minority class (recurrent stroke) compared with the dominant class, we balanced out the number of cases and controls by up-sampling and down-sampling methods. We applied the up-sampling method to the prediction window with the lowest and median rate of stroke recurrence and downsampling to the prediction window with the median rate of stroke recurrence. In the up-sampling, we used the Synthetic Minority Over-sampling Technique (SMOTE) [19]. In the down-sampling, we randomly selected patients from the control group.

#### *2.4. Model Development and Testing*

We used six interpretable ML algorithms and four feature sets to develop a classification model for 1, 2, 3, 4, and 5-year recurrence prediction window. We developed 24 models for each prediction window. The ML algorithms included LR [20], XGBoost [21], GBM [22], RF [17], SVM [23], and DT [24]. We included SVM, LR, and DT as these could provide benchmarking metrics as well as better flexibility in terms of implementation into cloud-based EHR vendors. Therefore, simpler and faster models could provide strategic alternatives for future implementation if the results from this study indicate, similar to other studies [25], that by including a large number of features, models can reach convergence to the point of algorithm indifference (or marginal improvements). A parameter grid was built to train the model with 10-fold repeated CV with 10 repeats. Furthermore, 5-fold repeated CV for the prediction window with the median rate of stroke recurrence was also performed. Model tuning was performed by an automatic grid search with 10 different values to try for each algorithm parameter randomly. For each model, we used 20% of the data for model testing and calculated specificity, recall (sensitivity), precision (positive predictive value, PPV), AUROC, F1 score, accuracy, and computation time for model training. The modeling pipeline is summarized in Figure 1B.

#### **3. Results**

All of the detailed summary results with comprehensive performance metrics, feature importance and computation time for the 288 models this project are provided as Supplementary Information (see Tables S1–S3).

#### *3.1. Patient Population and Characteristics*

A total of 2091 adult patients met the inclusion criteria; 114 patients had a recurrent stroke within 24 days from their index stroke and were excluded from the analysis (Figure 1A). Out of 2091 patients, 51.6% were men. The median age was 68.1 years (IQR (interquartile range) = 58–77). The three most common comorbidities were hypertension (72%), dyslipidemia (62%), and diabetes (29%). Table 1 includes the patients' demographics and past medical history. The rate of stroke recurrence was 11%, 16%, 18%, 20%, and 21% at 1, 2, 3, 4, and 5-year window, respectively.

This study was based on 53 features. Table S1 summarizes the results from the feature selection process. Age, sex, BMI, systolic blood pressure, hemoglobin, high-density lipoprotein (HDL), creatinine, smoking status, chronic heart failure, chronic kidney disease, diabetes, hypertension, and peripheral vascular disease were selected by all of the different data-driven approaches for the five different case-control designs.

#### *3.2. Models Can Be Trained to Predict the Long-Term Stroke Recurrence*

Model AUROC was stable for the five case-control designs with the highest score for the 1-year prediction window and the lowest score for the 5-year window (Figure 2, Table S2). The best AUROC for the 1-year prediction window was 0.79 (Table S2, model#63). The top ten models (AUROC: 0.79–0.74) were from the 1-year prediction window. The best AUROC for the 2, 3, 4 and 5-year prediction windows were 0.70, 0.73, 0.73, and 0.69 respectively. Furthermore, when comparing features included in the models, the AUROC was highest when all of the features were used. The variation in AUROC was higher across the various study window and feature sets for DT, while the score variance was lowest for RF. The ROC curve for the different models is shown in Figure 3 for the 1-year prediction window.

Based on the accuracy, RF (RF, mtry = 14) model, using 26 features (Set3), had the best performance for a 1-year prediction window (accuracy: 90% (95% CI: 86%–92%), PPV: 80%, specificity: 100%). The average accuracy by using the six models and four sets of features was 88% (Table S2, model numbers 1–24). The prediction accuracy decreased as the prediction window widened to 2-years (average accuracy: 85%) with the best accuracy score reached by LR (86%, 95% CI: 82%–89%) and PPV of 80% with a specificity of 99%, Table S2 model number 79. The average accuracy of the 3-year prediction window was 82% for the 4-year prediction window. The average accuracy of the 5-year prediction window was 78%.

Out of the 24 models for the 1-year prediction window, one model reached a perfect PPV, while 11 models reached a 100% specificity. For the 2-year prediction window, 7 out of the 24 models reached a PPV of 100% while 9 reached a specificity of 100%. Overall, models based on all features had higher PPV. Model sensitivity and specificity had the best tradeoff when GBM was used. The highest model sensitivity was achieved using both DT and GBM, while the best specificity was achieved using RF, SVM, and XGBoost. When we compared the 3-year prediction window with and without the 24 days cut-off, the average AUROC, sensitivity, and specificity were unaffected; however, the average model accuracy was reduced by 5% when excluding the 24 days interval. Detailed performance metrics for the 288 models are presented in Table S2.

**Figure 2.** Model performance summaries for the five different prediction windows, six different classifiers, and four feature selection approaches. Performance metrics for (**A**–**F**) Decision tree, (**G**–**L**) Gradient Boost, (**M**–**R**) Logistic Regression, (**S**–**X**) Random Forest, (**Y**–**AD**) SVM, and (**AE**–**AJ**) XGBoost.

**Figure 3.** Area under the receiver operating characteristic (AROC) curve using six classifiers for the 1-year prediction window. The feature Set 3 is used for this figure. (**A**) Model without sampling; (**B**) Model with up-sampling at a 1:2 ratio; (**C**) Model with up-sampling at a 1:1 ratio. The best performer model (AUROC of 0.79) is when up-sampling is used with Random Forest algorithm (panel B).

#### *3.3. Age, BMI, and Laboratory Values Highly Associated with Stroke Recurrence*

Age and BMI had the highest overall feature importance at 90% ± 5% and 58 ± 10%, respectively. Laboratory values specifically LDL, HDL, platelets, hemoglobin A1c, creatinine, white blood cell, and hemoglobin were highly ranked in our different modeling frameworks. The feature importance of laboratory-based features ranged from 49% ± 10% to 39% ± 11% for HDL and platelet, respectively. Laboratory values had an average feature importance score of 44%, the highest among the different feature categories. Medications (statin, antihypertensive, warfarin, and antiplatelet), were also important features. Figure 4 (and Table S3) includes the feature importance of different models and the overall average feature importance across the models and different prediction windows. The difference in days between the last outpatient visit before the index date and index date (45% ± 12%) and certain comorbidities were other important features for the recurrence models.

**Figure 4.** Feature importance based on the different trained models. (**A**–**E**) Six different classifiers (Gradient Boost, Random Forest, Extreme Gradient Boosting (XGBoost), Decision Trees, Support Vector Machine (SVM), and Logistic Regression) and five different prediction windows were used. (**F**) Average feature importance score across the different models and prediction windows.

#### *3.4. Models' Performance Metrics Improved through Sampling Strategies*

Given the low prevalence of recurrent stroke in our dataset (11–21%), we applied up- and down-sampling to the training dataset for the prediction window prior to the model training.

The application of up-sampling the minority class using 1:2 and 1:1 ratio for the 1-year prediction window improved the sensitivity to 55% while only slightly affecting the specificity to 91%. The model AUROC averaged 0.67 before up-sampling to 0.68 after up-sampling with five of the models reaching an AUROC above 0.75. The AUROC of the

test set for the 3-year prediction window remained at 0.69 while the AUROC of the training set improved as expected with up-sampling (Figure 5, Table S2).

**Figure 5.** Model Performance summaries with sampling-based optimization for the 1 and 3-year prediction window. Up-sampling using was performed using the Synthetic Minority Over-sampling Technique (SMOTE). The feature Set 3 is used for this figure. (**A**–**F**) Model without sampling; (**G**–**L**) Model with down-sampling; (**M**–**R**) Model with up-sampling.

#### **4. Discussion**

We have taken a comprehensive approach to develop and optimize interpretable models of long-term stroke recurrence. We have shown that (1) the six algorithms used could be trained to predict the long-term stroke recurrence, (2) many of the clinical features that were highly associated with stroke recurrence could be actionable, and (3) model performance metrics could be optimized.

There have been multiple clinical scores developed for predicting recurrence after cerebral ischemia with limited clinical utility [6]. Among all, only Stroke Prognostic Instrument (SPI-II) [26] and Essen Stroke Risk Score (ESRS) [27] were designed to predict the long-term (up to 2-years) risk of recurrence after an ischemic stroke. SPI-II can be applied to patients with transient ischemic attack (TIA) and minor strokes; yet, ESRS application focuses on stroke. The main limitations of SPI-II are focusing on patients with

suspected carotid TIA or minor stroke, developed using a cohort of 142 patients. The ESRS, derived from the stroke subgroup of the clopidogrel versus aspirin among patients at risk of ischemic events (CAPRIE) trial, includes only eight parameters. In a validation study, the PPV for each tool were low, raising questions about their utility [28–30]. Previous validation studies of SPI-II demonstrated a *c*-statistic of 0.62 to 0.65, which can be judged as only fair [26,31,32]. In addition, SPI-II has poor performance in stratifying recurrent stroke in isolation as compared with the composite of recurrent stroke and death. The above demonstrates that the SPI-II score's performance is driven mostly by its ability to predict mortality not a recurrence. There is an unmet need for better predictive measures of long-term prediction given the high rate and devastating consequences of a recurrent stroke. Other studies over the past few years have shown the power of ML in predicting short and long-term outcomes in various complex diseases [8,9,25].

#### *4.1. Models Could Be Trained to Predict the Long-Term Stroke Recurrence*

Our results showed that a high-quality training dataset with a rich set of variables can be utilized to develop models of recurrent stroke. Among the 288 models, prediction of stroke recurrence within a 1-year prediction window had an AUROC of 0.79, an accuracy of 88% (95% CI: 84%–91%), PPV of 42%, and specificity of 96% using RF with up-sampling the training dataset (Table S2, model number 63). The LR-based models have similar results when compared to more complex algorithms such as XGBoost or RF. Our results showed that 21 (7%) models reached an AUROC above 0.73 while 110 (38%) models reached an AUROC above 0.7. Furthermore, the AUROC for the training and testing dataset were within a similar range which corroborates that models were not suffering from over-fitting. As expected, a model based on LR took a fraction of the time for training when compared to XGBoost, RF, or SVM (Table S2).

We tested the prediction window for up to 5-years. Our results showed that the average model accuracy declined from 85% for the 1-year window to 78% for the 5-year window. However, the shorter prediction window provided the lowest rate of recurrence and therefore highest data imbalance, affecting model performance. The average model sensitivity increased as the prediction window widened, likely due to the increase in sample size and recurrent stroke rate. The optimal prediction window could depend on the richness of longitudinal data used for model training, in our dataset, that was between 2 and 4-years.

#### *4.2. Clinical Features Highly Associated with Stroke Recurrence*

In this study, 53 features were used as the full set (set1), followed by a subset of features excluding medication history (Set 2, 31 features). We also applied feature selection and created data-driven features (Set 3) and a minimum set of features (Set 4) for comparison. In most of the experiments more comprehensive feature set led to higher model performance, even though some features had some level of collinearity. In general, baseline clinical features, such as age, BMI, and laboratory values were among the most important features. Our results also highlighted that the last outpatient visit before the index stroke was important for the prediction of recurrence; patients in the control group had the lowest average number of days when compared to the five different case groups.

Analyzing the feature importance revealed that in general laboratory values were highly influential in the prediction models. The pattern of the importance of features was similar when considering different prediction windows, with many comorbidities and medications having the lowest relative impact. Laboratory values (LDL, HDL, platelets, HbA1c, creatinine, and hemoglobin), and blood pressure have shown to be high-ranking for all of the five different prediction windows and all of the different modeling framework with few exceptions. This finding highlights the fact that these potentially actionable features (e.g., HbA1c) may have more importance when compared to the corresponding comorbidities in the patient's chart. The binary nature of medical history without the corresponding measures may have limited power in predicting recurrence. However, one

of the main limitations of using more comprehensive laboratory values is missingness, especially when the missing is not completely at random.

#### *4.3. Model Performance Metrics Optimized Based on the Target Goals*

We have also shown that model performance metrics, such as specificity and sensitivity can be optimized based on the availability of resources and institutional priorities. We were able to improve the sensitivity of the models for the 1 and 3-year prediction window by sampling the training dataset to address the data imbalance. The tradeoff between specificity and sensitivity was of special interest given that different healthcare systems likely have different constraints, availability of resources, and infrastructures to implement preventive strategies to reduce stroke burden. Some of the resources may include, the number of providers needed to schedule follow up appointments or to discuss medication plans and ensure that the patient is compliant; or availability of resources to provide home-care or telehealth for patients needing those services for continuity of care. Thus, optimizing sensitivity and specificity should be aligned with the institution's priorities. Here we demonstrated that sampling strategies could be useful tools in achieving optimal tradeoffs by increasing the sensitivity of the models up to 55% even with a low rate of stroke recurrence.

#### *4.4. Study Strengths, Limitations, and Future Directions*

The EHR data used in model development was longitudinally rich. However, that also leads to some of the study limitations. There is an inherent noise associated with the use of administrative datasets such as EHR, including biased patient selection and lack of information regarding stroke severity captured for approximately half of the patients. However, separate logistic regression models were employed to study the association of NIHSS with one-year stroke recurrence and did not show any association (OR: 1.01, 95% CI: 0.97–1.05, *p* = 0.625). Our phenotype definition to extract patients with stroke was strict, leading to 100% specificity on a randomly selected sample, which also means that our criteria likely missed some of the cases (for instance, if the patient had some MRI contraindication). Nevertheless, MRI is part of our stroke order-set and is performed for every stroke patient unless the patient refuses or has a contraindication (e.g., noncompatible pacemaker, etc.). We also did not include transient ischemic attacks since it is associated with significant misdiagnosis [33].

As future directions, we are expanding this study at two different levels by including additional layers of data and improving the model and model optimization. We are expanding the GNSIS dataset by incorporating a larger number of laboratory-based features; unstructured data from clinical notes such as signs and symptoms during the initial phases of patient evaluation; information about stroke subtypes; and genetic information from a subset of patients enrolled in the MyCode initiative [34]. We are also expanding our modeling strategies by (1) improving the imputation for laboratory values for EHRmining [35,36], which could improve patient representation and reduce algorithmic bias; (2) applying natural language processing to expand the feature set from clinical notes; (3) developing polygenic risk score [37] using genetic information from a subset of our GNSIS cohort; (4) improving model parameter optimization using sensitivity analysis (SA) based approaches [38–41]; and (5) expanding the study by incorporating more advanced methodologies, including deep learning models to compare with binary classification developed in this study. Finally, we are planning on developing models that account for the competing risk of death and other major vascular events in addition to ischemic stroke.

In conclusion, predicting long term stroke recurrence is an unmet need with high clinical impact for improved outcomes. Using rich longitudinal data from EHR and optimized ML models, we have been able to develop models of stroke recurrence for different prediction windows. Model performance metrics could be optimized and implemented in the same healthcare system as an intelligent decision support system to improve outcomes. Even though validating the model in patients recruited at a later time point could be

done within the Geisinger system, external validation will be necessary to predict how the model predictions may be affected with regard to other health care systems and patient demographics. External validation to assess generalizability and identify potential biases will be an important next step of this study as well. Finally, based on our findings, we recommend that studies aimed at using ML for the prediction of stroke recurrence should leverage more than one modeling framework, ideally including also logistic regression as benchmarking framework for comparison.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2077-0 383/10/6/1286/s1, Table S1. Feature selection applied to cases and controls based on four criteria. Set 1: all features; Set 2: all features except medication history; Set 3: features selected by at least two data-driven strategies; Set 4: minimum set, obtained as the intersect of Set 2 and Set 3; Table S2. Comprehensive model performance measures for the 288 prediction models. https: //www.dropbox.com/s/4h4qr6ivi1z9bt9/Final\_Table\_A2.xlsx?dl=0 (accessed on 19 March 2021). The file is too large for a word document. The file can be added to the GitHub folder if it cannot be added as an Excel supplemental document; Table S3. Feature importance ranking for the different modeling frameworks.

**Author Contributions:** Conception and design of the study: V.A. (Vida Abedi) and R.Z. Acquisition and analysis of data: V.A. (Vida Abedi), V.A. (Venkatesh Avula), D.C., A.K., and R.Z. Interpretation of the findings: V.A. (Vida Abedi), V.A. (Venkatesh Avula), D.C., J.L., S.S., C.J.G., and R.Z. Drafting a significant portion of the manuscript or figures: V.A. (Vida Abedi), V.A. (Venkatesh Avula), and R.Z. Resources and Supervision: V.A. (Vida Abedi) and R.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study had no specific funding. Vida Abedi had financial support from the Defense Threat Reduction Agency (DTRA) grant No. HDTRA1-18-1-0008 and funds from the National Institute of Health (NIH) grant No. R56HL116832 during the study period. Ramin Zand had financial research support from Bucknell University Initiative Program, ROCHE–Genentech Biotechnology Company, the Geisinger Health Plan Quality fund, and receives institutional support from Geisinger Health System during the study period.

**Institutional Review Board Statement:** The study was reviewed and approved by the Geisinger Institutional Review Board to meet "non-human subject research", for using de-identified information.

**Informed Consent Statement:** The study was reviewed and approved by the Geisinger Institutional Review Board to meet "non-human subject research", for using de-identified information. Informed consent was not required.

**Data Availability Statement:** The data analyzed in this study are not publicly available due to privacy and security concerns. The data may be shared with a third party upon execution of data sharing agreement for reasonable requests; such requests should be addressed to Vida Abedi or Ramin Zand. Codes and additional meta-data, summary plots, and information can be found at https: //github.com/TheDecodeLab/GNSIS\_v1.0/tree/master/ModelingStrokeRecurrence (accessed on 19 March 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**



#### **References**


## *Article* **SARS-CoV-2 Is a Culprit for Some, but Not All Acute Ischemic Strokes: A Report from the Multinational COVID-19 Stroke Study Group**

**Shima Shahjouei <sup>1</sup> , Michelle Anyaehie <sup>1</sup> , Eric Koza <sup>2</sup> , Georgios Tsivgoulis <sup>3</sup> , Soheil Naderi <sup>4</sup> , Ashkan Mowla 1,5 , Venkatesh Avula <sup>1</sup> , Alireza Vafaei Sadr <sup>6</sup> , Durgesh Chaudhary <sup>1</sup> , Ghasem Farahmand <sup>7</sup> , Christoph Griessenauer 1,8, Mahmoud Reza Azarpazhooh <sup>9</sup> , Debdipto Misra <sup>10</sup>, Jiang Li <sup>11</sup>, Vida Abedi 11,12 , Ramin Zand 1,\* and the Multinational COVID- Stroke Study Group †**

	- 1211 Geneva, Switzerland; vafaei.sadr@gmail.com Iranian Center of Neurological Research, Neuroscience Institute, Tehran University of Medical Sciences, Tehran 14155-6559, Iran; Ghasem.farahmand89@gmail.com

**Abstract:** Background. SARS-CoV-2 infected patients are suggested to have a higher incidence of thrombotic events such as acute ischemic strokes (AIS). This study aimed at exploring vascular comorbidity patterns among SARS-CoV-2 infected patients with subsequent stroke. We also investigated whether the comorbidities and their frequencies under each subclass of TOAST criteria were similar to the AIS population studies prior to the pandemic. Methods. This is a report from the Multinational COVID-19 Stroke Study Group. We present an original dataset of SASR-CoV-2 infected patients who had a subsequent stroke recorded through our multicenter prospective study. In addition, we built a dataset of previously reported patients by conducting a systematic literature review. We demonstrated distinct subgroups by clinical risk scoring models and unsupervised machine learning algorithms, including hierarchical K-Means (ML-K) and Spectral clustering (ML-S). Results. This study included 323 AIS patients from 71 centers in 17 countries from the original dataset and 145 patients reported in the literature. The unsupervised clustering methods suggest a distinct cohort of patients (ML-K: 36% and ML-S: 42%) with no or few comorbidities. These patients were more than 6 years younger than other subgroups and more likely were men (ML-K: 59% and ML-S: 60%). The majority of patients in this subgroup suffered from an embolic-appearing stroke on imaging (ML-K: 83% and ML-S: 85%) and had about 50% risk of large vessel occlusions (ML-K: 50% and ML-S: 53%). In addition, there were two cohorts of patients with large-artery atherosclerosis (ML-K: 30% and ML-S: 43% of patients) and cardioembolic strokes (ML-K: 34% and ML-S: 15%) with consistent

**Citation:** Shahjouei, S.; Anyaehie, M.; Koza, E.; Tsivgoulis, G.; Naderi, S.; Mowla, A.; Avula, V.; Vafaei Sadr, A.; Chaudhary, D.; Farahmand, G.; et al. SARS-CoV-2 Is a Culprit for Some, but Not All Acute Ischemic Strokes: A Report from the Multinational COVID-19 Stroke Study Group. *J. Clin. Med.* **2021**, *10*, 931. https:// doi.org/10.3390/jcm10050931

7

Academic Editor: Hugues Chabriat

Received: 16 January 2021 Accepted: 16 February 2021 Published: 1 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

comorbidity and imaging patterns. Binominal logistic regression demonstrated that ischemic heart disease (odds ratio (OR), 4.9; 95% confidence interval (CI), 1.6–14.7), atrial fibrillation (OR, 14.0; 95% CI, 4.8–40.8), and active neoplasm (OR, 7.1; 95% CI, 1.4–36.2) were associated with cardioembolic stroke. Conclusions. Although a cohort of young and healthy men with cardioembolic and large vessel occlusions can be distinguished using both clinical sub-grouping and unsupervised clustering, stroke in other patients may be explained based on the existing comorbidities.

**Keywords:** cerebrovascular disorders; stroke; SARS-CoV-2; COVID-19; cluster analysis; risk factors; comorbidity

#### **1. Introduction**

Since the emergence of the Coronavirus Disease 2019 (COVID-19) pandemic, many cerebrovascular events have been reported among patients with SARS-CoV-2 infection. Some reports have highlighted strokes in critically ill and older patients with a higher number of comorbidities, while others have suggested a higher risk in younger and healthy individuals [1–5]. Studies have suggested that stroke patients with SARS-CoV-2 present with multiple cerebral infarcts [2,4,6], systemic coagulopathies [7], uncommon thrombotic events such as aortic [8] or common carotid artery thrombosis [9], and simultaneous arterial and venous thrombus formation [10].

Considering the hypercoagulable state as one of the main etiologies of stroke among the SARS-CoV-2 infected patients, we would expect a similar increased rate for cardiovascular thrombotic events and acute coronary syndrome after the pandemic. However, higher acute coronary syndrome case fatality rate and other adverse outcomes among cardiac patients compared with the pre-pandemic era have been attributed to public fear and reluctance to call for medical aid and increased pre-hospital delay. A dramatic decline in the guideline-indicated care, hospitalization rate, and revascularization procedures are other possible factors attributing to adverse outcomes in patients with acute coronary syndrome [11–15]. Studies have failed to show any difference among cardiovascular patients in terms of age, sex, comorbidities, clinical presentation, and diagnosis pre- and post-pandemic era [14,16]. Similarly, a higher rate of coronary stent thrombosis in comparison with the pre-pandemic era [17,18] was reported among the patients with multiple comorbidities (about 44% with at least four vascular risk factors) and a median age of 65 years [18]. Acute myocardial injury (defined as a substantial increase in cardiac troponin level) is associated with the underlying cardiac pathology in the majority of the SARS-CoV-2 infected patients [19] rather than a thrombotic event.

The first report from our Multinational COVID-19 Stroke Study Group and recent metaanalyses on reported infected patients presented a stroke incidence rate of 0.5–1.4% [20–22]. The odds of stroke after SARS-CoV-2 may not be greater than in non-infected patients [23]. In addition, meta-analyses of the reported patients presented that SARS-CoV-2 infected patients who experienced a stroke had a mean age of over 65 years, carried a load of comorbidities, and were affected by more severe infections [21,22]. Thereby, in some patients, stroke may be a coincidence or an indirect consequence of critical illness [24,25] and not a direct complication of the SARS-CoV-2 infection. As an example, there is an increased risk of ischemic stroke (odds ratio (OR) > 28) and hemorrhagic stroke (OR > 12) within two weeks of sepsis [26]. This might be due to new-onset atrial fibrillation (6%) that put the patient at risk of in-hospital stroke (2.6%) [24].

Understanding the population at risk for having a stroke after SARS-CoV-2 infection can promote timely diagnosis and proper management of these patients.

We designed this study to explore the pattern of traditional vascular risk factors and stroke etiology among stroke patients with prior SARS-CoV-2 infection. We leveraged unsupervised hierarchical and spectral model-based clustering in addition to clinical risk scoring models to decipher patterns of comorbidity among stroke patients with prior

SARS-CoV-2 infection. We further expanded our analysis to corroborate whether the comorbidities under each subclass of TOAST (the Trial of Org 10172 in Acute Stroke Treatment [27]) were similar to the AIS population studies prior to the pandemic.

#### **2. Methods**

This report presents a multicenter prospective and observational study from our Multinational COVID-19 Stroke Study Group [20] and a cohort of patients extracted from the literature.

#### *2.1. Original Dataset*

Collaborators from 71 centers of 17 countries (Brazil, Canada, Croatia, Egypt, France, Germany, Greece, Iran, Israel, Italy, Portugal, Republic of Korea, Singapore, Spain, Switzerland, Turkey, and the United States) reported data on their patients for this study. We included consecutive SARS-CoV-2 infected adult patients who had imaging confirmed subsequent acute ischemic stroke.

The study protocol, details of eligibility criteria, data elements, and neurological investigations have been previously published [20]. The demographics, vascular risk factors, and comorbidities—i.e., hypertension, diabetes, ischemic heart disease, atrial fibrillation, carotid stenosis, chronic kidney disease, congestive heart failure with cardiac ejection fraction <40%, active neoplasms, rheumatological diseases, smoking status, and history of transient ischemic attack (TIA) or stroke—were recorded for the stroke patients [28–31]. We also recorded the neurological examinations, the National Institute of Health Stroke Scale (NIHSS), TOAST [27] subclasses, presence of large-vessel occlusions (LVOs), and brain imaging findings.

The study protocol was designed at the Neuroscience Institute of Geisinger Health System, Pennsylvania, United States, and received approval by the Institutional Review Board of Geisinger Health System and participating institutions, as needed. The study was conducted and reported according to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) [32], and Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [33].

#### *2.2. Systematic Literature Review*

To compare our results with the available literature, we searched PubMed for reports of patients with subsequent stroke after SARS-CoV-2 infection. Different terms in addition to Medical Subject Headings (MeSH) were utilized to build the search protocol (Document S1). The search was last updated on 15 October 2020, with no limitation to study design, language, or document type. The search was augmented by forward and backward citation tracking in PubMed and Google Scholar. We additionally searched medRxiv to track the documents ahead of publication and communicated with the corresponding authors to include them under peer review or in press studies prior to publication. Two reviewers (EK and SS) independently evaluated the titles/abstracts of the retrieved results and reviewed the full texts of candidate articles. Data available from the literature were extracted per the same datasheet as the data collected in our original multicenter case series when possible. The extracted data were further reviewed by two neurologists (G.F. and R.Z.).

#### *2.3. Comorbidity-Based Subgrouping: Expert Opinion*

The details of the subgroups are available in Document S2. In the risk scoring models based on the EXpert opinion (EX), we considered the number of present stroke-related comorbidities—either All the 11 collected comorbidities (EX-A) as mentioned above, or eight Selected comorbidities (EX-S, excluding congestive heart failure, active neoplasm, and rheumatological disorders) [27–30]. We considered equal weights for all comorbidities. We divided the patients based on EX-A and EX-S scores into two subgroups (EX-A<sup>2</sup> and EX-S2); Subgroup "a" included patients who had a history of zero or one stroke-related comorbidity, and subgroup "b" included the patients with >1 comorbidity. In addition,

we divided the patients based on EX-A and EX-S scores into three subgroups (EX-A<sup>3</sup> and EX-S3). In this second classification, subgroup "a" represented the patients without any known comorbidity, subgroup "b" with one or two comorbidities, and subgroup "c" included the patients with more than two comorbidities.

#### *2.4. Comorbidity-Based Subgrouping: Unsupervised Modeling*

We explore the probable similarities among the patients based on the presence of comorbidities in a data-driven approach. These patterns might have been remained hidden by clinical risk scores to the experts. For this purpose, we leveraged unsupervised algorithms and Machine Learning models (ML) (Document S2). We applied hierarchical (complete linkage method) and K-means (Hartigan-Wong algorithm) clustering (ML-K models) to group the patients into 2 (ML-K2) and 3 (ML-K3) subgroups. We also used Spectral clustering [34] (ML-S models) and clustered the patients into two (ML-S2) and three subgroups (ML-S3). Tables S1 and S2 present the clustering of the patients into four and five subgroups. Patients from the original dataset and literature review were clustered independently.

We used the contingency matrix (also known as a contingency table) [35] to demonstrate the subgroups of each model versus other models. The average similarity of the models in clustering the patients was calculated as *Sim* = ∑ *i* <sup>1</sup> *Maximum Value in Column i* ∑ *k* <sup>1</sup> *Value in Cell k* ; where *i* is the number of columns and *k* is the total number of cells in the contingency matrix. Similarities among the models were considered as mild (50–65%), moderate (65–80%), and strong (80–100%). The packages *stat* [36] and *gplots* [37] in R version 3.6.3, and the scikit-learn package [38] in Python version 3.7 were used.

#### *2.5. Statistical Analysis*

We used descriptive statistics to summarize the data. Demographic data, comorbidities, laboratory findings, and neurological investigations were reported as medians (interquartile range (IQR)) and mean (standard deviations (SD)). Categorical variables were reported as absolute frequencies and percentages. The comparison between categorical variables was conducted with the Pearson chi-square test, while the differences among continuous variables were assessed by an independent *t*-test. We explored the association of comorbidities with each subclass of TOAST criteria by binary logistic regression. A *p*-value < 0.05 was considered significant in all analyses.

#### **3. Results**

#### *3.1. Patients Characteristics*

This study included 323 AIS patients from our original prospective multicenter case series, with a mean age of 67 ± 15 years and 60% men (Table S3). The most prevalent comorbidities were hypertension (63%), diabetes (35%), and ischemic heart disease (24%). In addition, through our systematic review of the literature, we retrieved data from an additional 412 stroke patients (including dural sinus thrombosis) post-SARS-CoV-2 infection (Figure 1). The data from the 412 patients were extracted from 81 articles (in 18 countries). Among the 412 patients, individual-level data of 145 AIS patients were reported from 36 centers in nine countries. The mean age of the retrieved AIS patients was 63 ± 14 years, and 57% were men (Table S3).

In comparison with our original multicenter dataset, patients reported in the literature were younger (mean age of 63 versus 67 years, *p* < 0.01), with a higher proportion of LVOs (83% versus 45%, *p* < 0.0001), and strokes of undetermined (38% versus 22%, *p* < 0.01) or other determined etiologies (31% versus 8%, *p* < 0.001). Although not statistically significant, reported patients in the literature had more severe strokes (median NIHSS of 15 versus nine, *p* = 0.11). Fewer patients of this cohort were reported to have had vascular risk factors; however, hypertension (55%), diabetes (37%), and atrial fibrillation (12%) were the most prevalent reported comorbidities among the patients from the published reports.

**Figure 1.** The process of literature review and main results.

#### *3.2. Clinical Risk Scoring Models Revealed a Large Cohort of Young Men with No Comorbidities Who Suffered from Large Vessel Occlusions (LVOs)*

Among the 323 AIS patients from the original dataset, 65 (22%) patients reported no known comorbidities, and 115 (39%) had at most one known comorbidity (Table 1). Among the 117 patients from the literature review who had a completed comorbidity panel, 33 (28%) reported no known comorbidity, and 71 (61%) had at most one known comorbidity (Table S4).

In both datasets, we identified a cohort of patients with no vascular risk factors with distinct features—subgroup "a" in all clinical risk scoring models; original dataset, EX-A3a: 22% and EX-S3a: 25% (Table 1); literature review, EX-A3a = EX-S3a: 28% (Table S4). These cohorts included patients with (1) younger age (over 8 years in comparison with other subgroups of the original dataset), (2) male predominance (original dataset, EX-A3a: 55% and EX-S3a: 54%; literature review, EX-A3a = EX-S3a: 59%), and (3) a higher proportion of embolic-appearing imaging stroke pattern (original dataset, 82%; literature review dataset 67%). About half of patients in the original dataset had LVOs (EX-A3a: 48% and EX-S3a: 49%), as did the majority of patients reported in the literature (EX-A3a = EX-S3a: 80%). In comparison with patients who carried a high load of comorbidities (subgroup "c"), the cohorts of patients without comorbidities (subgroup "a") had a longer length of hospital stay (original dataset EX-S3a, 16 days versus 11 days in EX-S3c, *p* = 0.03). Although not statistically significant, patients in the subgroup "a" also had less severe strokes (median NIHSS in the original dataset, eight versus 12 in subgroup "c"; median NIHSS in review dataset, six versus nine in subgroup "c"), but a higher chance of a need for mechanical ventilation (original dataset EX-A3a: 34% versus 28%, *p* = 0.39; EX-S3a: 37% versus 28%, *p* = 0.16).


**Table 1.** Characteristics of the patients grouped by clinical risk scoring models.

EX-A2: clinical risk-scoring (expert opinion) model including all comorbidities; a, 0–1 comorbidity; b, >1 comorbidity; EX-S2: clinical risk-scoring model including selected comorbidities; a, 0–1 comorbidity; b, >1 comorbidity; EX-A3: clinical risk scoring model including all comorbidities; a, 0 comorbidity; b, 1–2 comorbidities, c, >2 comorbidities; EX-S3: clinical risk scoring model including selected comorbidities; a, 0 comorbidities; b, 1–2 comorbidities, c, >2 comorbidities. Due to missingness, we provided the valid percentages in this table.

#### *3.3. Unsupervised Clustering Revealed Three Subgroups of Stroke Patients*

In addition to clinical risk scoring, we used unsupervised algorithms to potentially identify hidden comorbidity patterns among AIS patients. There were strong similarities (Sim > 80%) among the models in grouping the patients, except two sets that were moderately similar (Figure S1). Clustering the patients from the original dataset (Table 2) demonstrated a subgroup of patients with no or few comorbidities—subgroup "a" in all ML models (ML-K3a: 36% and ML-S3a: 42% of patients, Table 2). The latter is similar to subgroup "a" in all EX models (22–46% of patients, Table 1). The patients in these groups were (1) mainly men (ML-K3a: 59% and ML-S3a: 60%), (2) more than six years younger than other subgroups, (3) had a higher risk of embolic-appearing stroke on imaging (ML-K3a: 83% and ML-S3a: 85%), and (4) had about 50% risk of LVOs (ML-K3a: 50% and ML-S3a: 53%). Patients in the second subgroup (ML-K3b: 30% and ML-S3b: 43%; similar to EX-A3b: 47% and EX-S3b: 50%) presented with a high proportion of hypertension, diabetes, chronic kidney disease, and smoking. These patients had a higher risk of large artery atherosclerosis (ML-K3b: 40%, and ML-S3b: 31%). The third subgroup (ML-K3c: 34% and ML-S3c: 15% similar to EX-A3c: 31% and EX-S3c: 25%) presented mostly with hypertension, diabetes, ischemic heart disease, atrial fibrillation, congestive heart failure, carotid stenosis, neoplasm, and smoking. The majority of these patients (ML-K3c: 34% and ML-S3c: 60%) had cardioembolic strokes based on TOAST and imaging patterns consistent with an embolic ischemic stroke.

Similar patterns were observed among patients reported in the literature (Tables S4 and S5). The first group (subgroup "a" in all models, 28–61%) included the patients with no or few comorbidities. These patients were more likely men (63–100%), with over 80% LVOs, about 65% strokes of undetermined or other determined etiologies, and over 60% embolic-appearing strokes. In the second subgroup identified by unsupervised clustering (ML-K3b: 41% and ML-S3b: 66%, similar to EX-A3b: 33% and EX-S3b: 33%), the majority of the patients presented with hypertension and diabetes. Strokes of undetermined (ML-K3b: 39% and ML-S3b: 33%) and other determined (ML-K3b: 33% and ML-S3b: 37%) etiologies were more prevalent in these subgroups. The third subgroup (ML-K3c: 16% and ML-S3c: 26%, similar to EX-A3c: 39% and EX-S3c: 39%) included the patients with hypertension, diabetes, ischemic heart disease, atrial fibrillation, smoking, and prior stroke or TIA. The majority of the patients in the third subgroup of the literature review dataset had strokes of undetermined (ML-K3c, 46% and ML-S3c, 50%) or other determined etiologies (ML-K3c: 27% and ML-S3c: 18%).

#### *3.4. The TOAST Subtype Classification Was Consistent with the Patients' Risk Profile*

We observed significantly different proportions of hypertension, ischemic heart disease, atrial fibrillation, carotid stenosis, chronic kidney disease, and active neoplasms among subclasses of TOAST (Table 3). Binominal logistic regression models demonstrated that atrial fibrillation (OR: 0.2; 95% CI: 0.04–0.8) and carotid stenosis (OR: 6.9; 95% CI: 2.2–21.4) were associated with large-artery atherosclerosis; ischemic heart disease (OR: 4.9; 95% CI: 1.6–14.7), atrial fibrillation (OR: 14.0; 95% CI: 4.8–40.8), and active neoplasm (OR: 7.1; 95% CI: 1.4–36.2) with cardioembolic stroke; chronic kidney disease (OR: 6.23; 95% CI: 1.8–21.5) with small-vessel occlusion; and ischemic heart disease (OR: 0.1; 95% CI: 0.01–0.5), carotid stenosis (OR: 0.1; 95% CI: 0.01–0.8), and chronic kidney disease (OR: 0.2; 95% CI: 0.04–0.9) with strokes of other determined etiology.

Among the AIS patients reported in the literature, 120 patients had available TOAST criteria, 109 patients had available comorbidity panel, and 93 patients had data regarding both the TOAST criteria and the comorbidities. Because of the small sample size under each subgroup of TOAST, further analysis of the association of TOAST and comorbidities among these patients was not performed.


**Table 2.** Characteristics of the patients clustered with unsupervised machine learning algorithms.

ML-K2: machine learning model using K-mean, dividing the patients into two subgroups; ML-S2: machine learning model using spectral, dividing the patients into two subgroups; ML-K3: machine learning model using K-mean, dividing the patients into three subgroups; ML-S3: machine learning model using spectral, dividing the patients into three subgroups. Please note a, b, and c in this table are not based on the number of comorbidities and just indicated a distinct subgroup detected by unsupervised algorithms. Due to missingness, we provided the valid percentages in this table.

**Table 3.** The proportion of comorbidities under each subgroup of TOAST in original dataset and literature review dataset. Due to missingness, the valid percentages are reported in this table.


\* Due to missingness, this value could not be computed. We provided the valid percentages in this table.

#### **4. Discussion**

The results of our study indicated that SARS-CoV-2 infection could cause AIS among a considerable number of young and majority male patients who did not have vascular risk factors. The majority of these young patients had embolic-appearing stroke on their neuroimaging. Stroke in older patients can be attributed to the existing vascular risk factors.

#### *4.1. Unsupervised Clustering Identified Three Subgroups of SARS-CoV-2 Infected AIS Patients*

Despite several reports of special features and probable underlying coagulopathy in AIS with prior SARS-CoV-2 infection [2,4,6–10], similar reports are lacking in the literature regarding acute coronary syndrome and cardiovascular thromboembolic events. The majority of adverse outcomes among patients with stroke [39,40] or acute coronary syndrome [11–15] were related to the declining trend in seeking urgent care, hospitalization, and receiving guideline indicated measures. On the other hand, the meta-analyses of AIS infected patients presented a mean age of over 65 years and a high load of comorbidities [21,22]. Thereby, there might be a specific group of AIS patients with prior SARS-CoV-2 infection that can be attributed to the virus, while the incidence of stroke among other patients, especially older patients, might be related to their vascular risk factors or critical illness. On this basis, we analyzed the data from our Multinational COVID-19 Stroke Study Group [20] and a dataset of reported patients in the literature. The two cohorts facilitated the identification of three main subgroups. The first group includes patients with no or very few comorbidities—EX-A3a, EX-S3a, ML-K3a, and ML-S3a. The majority of these patients are young men who had an embolic-appearing stroke. The second subgroup was distinguishable by having a high proportion of hypertension, diabetes, chronic kidney disease, and carotid stenosis, large-artery atherosclerosis origin of stroke, and embolicappearing stroke on imaging—ML-K3b, ML-S3b, EX-A3b, and EX-S3b. The third group presented with hypertension, diabetes, ischemic heart disease, atrial fibrillation, congestive heart failure, smoking, and prior TIA or stroke—ML-K3c, ML-S3c, EX-A3c, and EX-S3c. The majority of the patients in the third group had cardioembolic strokes based on the TOAST classification and had a consistent imaging pattern. Subgroups of patients identified by clinical risk scoring and unsupervised clustering based on the comorbidity panels were similar in the original and literature review datasets. However, unlike the original dataset, the etiology of the stroke in the majority of patients in the second and third subgroups of the review datasets were reported as "strokes of undetermined etiology". Overall, the identified pattern demonstrated by all models may indicate that AIS in only a subgroup of patients can be attributed to the SARS-CoV-2 infection (subgroup a in all models), while AIS in the second and third group of patients may be explained by the existing comorbidities.

#### *4.2. Higher Proportion of AIS Showed Lack of Comorbidities among SARS-CoV-2 Infected Patients*

Our study indicated a subgroup of patients with no known comorbidities among the SARS-CoV-2 infected patients (22.0%).The result of our systematic literature review on SARS-CoV-2 infected stroke patients reported from 36 centers in nine countries similarly demonstrated that 24% of the patients had no prior comorbidities. The proportion of the patients without known comorbidities was not available from large-scale studies on SARS-CoV-2 infected stroke patients reported from the UK [5] and the Global COVID-19 Stroke Registry [41]. However, a case series from New York presented that among 32 infected AIS patients, seven (22%) did not report hypertension, diabetes, dyslipidemia, coronary artery disease, congestive heart failure, atrial fibrillation, prior stroke or transient ischemic stroke, or active smoking [42]. A series of 22 AIS patients with SARS-CoV-2 infection from the US demonstrated that 12 out of 22 (54%) of the patients did not report any comorbidities (i.e., hypertension, congestive heart failure, chronic lung disease, chronic kidney disease, diabetes, or atrial fibrillation) [43]. In a report of six consecutive SARS-CoV-2 infected AIS patients from the UK, one patient (16%) had no prior medical history [44]. All of these patients had LVO strokes and elevated D-dimer levels. Similarly, among the five young patients in the US who had LVO stroke after SARS-CoV-2, 2 (40%) reported no prior

comorbidities [1]. These findings may suggest that after SARS-CoV-2 infection, higher percentages of patients without comorbidities are having a stroke.

#### *4.3. The Proportion of Comorbidities under Each Subclass of TOAST Is Similar to Population Studies Prior to the Pandemic*

The second report from our Multinational COVID-19 Stroke Study Group [20] indicated a lower rate of small-vessel occlusion and lacunar infarcts and a higher risk of embolic-appearing stroke in patients with SARS-CoV-2 infection in comparison with population studies conducted prior to the pandemic. These findings were valid even after considering the geographical regions and countries' health expenditure. The results of subgroup analyses and binary logistic regression in the current study presented that the comorbidity panel of the patients from the original dataset is consistent with the stroke subtypes. To see if the comorbidity panel of AIS patients infected with SARS-CoV-2 was consistent with the large-scale population studies, we further investigated the proportion of comorbidities under each subclass of TOAST (Table 3). We observed that in comparison with population studies, AIS patients infected with SARS-CoV-2 have an almost similar rate of comorbidities under each subclass of TOAST [45–48]. Among patients with large-artery atherosclerosis in our study, 54% had hypertension (versus 54–85%), 36% had diabetes (versus 13–32%), and 20% were smokers (versus 17–50%). Among patients with cardioembolism, hypertension was recorded in 76% (versus 59–86%), diabetes in 33% (versus 13–32%), ischemic heart disease in 46% (versus 20–32%), and atrial fibrillation in 50% (versus 79–86%). Similarly, patients with small-vessel occlusion had 59% hypertension (versus 54–58%), 35% diabetes (versus 12–35%), and 18% ischemic heart disease (versus 15–20%) [45–48]. The result of the literature review presented similar findings, although we recognized a selective report of patients with a lower comorbidity panel (Table 3). These findings suggest that the comorbidities under each stroke etiology are not highly different from the population studies prior to the pandemic, and we should still consider the possibility of bias in reporting the patients with SARS-CoV-2 infection and stroke before concluding the role of the virus as an absolute direct cause of stroke.

#### **5. Study Limitations**

To build up the database of SARS-CoV-2 infected patients with stroke, several attempts have been made in collaboration with multiple centers around the world. In addition, we reviewed all available reports to present a comprehensive overview of the topic. Despite this effort, these findings could largely be affected by selection and low sample size bias as well as bias due to incomplete diagnostic workups. In addition, we could not include dyslipidemia in the comorbidity list because data regarding lipid profile could not pass the quality control phase. For instance, some of the included patients were reported before comprehensive diagnostic tests, which may cause a bias in determining the subclasses of TOAST criteria. We also detected publication bias among the reported patients in the literature (significantly lower age, higher LVOs, more severe strokes, and strokes with undetermined and other determined etiologies). In addition, clustering the patients in this study is limited to the vascular risk factors, and we did not include the laboratory findings. Lastly, the unsupervised algorithms tend to be susceptible to the presence of outliers, especially when used for data with a small sample size.

#### **6. Conclusions**

Among patients with SARS-CoV-2 and acute ischemic stroke, there is a considerable number of young and majority male patients who did not report vascular risk factors. Therefore, young patients with SARS-CoV-2 infection should be monitors for the sign and symptoms of vascular events, including ischemic stroke. It is reasonable to ensure that these patients and their families are aware of early signs of stroke (BE-FAST) [49]. Stroke in other patients can be attributed to the existing comorbidity panel. We also observed that the proportions of comorbidities under each subclass of TOAST criteria were not different from the population studies prior to the SARS-CoV-2 pandemic.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2077-038 3/10/5/931/s1. 1. Document S1. Search Strategy in PubMed. 2. Document S2. Detailed Description of the Clustering. 3. Table S1. Dividing the SARS-CoV-2 infected acute ischemic stroke patients into 4 and 5 subgroups based on K-Mean clustering. 4. Table S2. Dividing the SARS-CoV-2 infected acute ischemic stroke patients into 4 and 5 subgroups by Spectral clustering. 5. Table S3. The characteristics of SARS-CoV-2 infected stroke patients in original dataset and literature review. 6. Table S4. The characteristics of the patients from the literature review dataset divided based on clinical risk scoring models. 7. Table S5. The characteristics of the patients from the literature review dataset divided based on unsupervised machine learning models. 8. Figure S1. Contingency Matrices. 9. The COVID-19 Stroke Study Group collaborators affiliations

**Author Contributions:** Conceptualization, R.Z., V.A. C.G., J.L. and S.S.; methodology, R.Z., V.A. (Venkatesh Avula), S.S., G.T., A.V.S., V.A. (Venkatesh Avula); software, V.A. (Vida Abedi), V.A. (Venkatesh Avula), D.M. and A.V.S.; validation, R.Z., A.M, G.T., S.N., S.S. and M.R.A.; formal analysis, S.S., E.K., J.L. and D.M.; investigation, S.S., M.A., E.K., D.C., C.G. and G.F.; resources, G.T., S.N., A.M., G.F., M.R.A. and the Multinational COVID-19 Stroke Study Group; data curation, S.S., G.F., S.N., V.A. (Venkatesh Avula) and D.M.; writing—original draft preparation, S.S., M.A., E.K., G.F. and D.C.; writing—review and editing, G.T., S.N., A.M., C.G., M.R.A., J.L., R.Z. and the Multinational COVID-19 Stroke Study Group; visualization, A.V.S., V.A. (Venkatesh Avula), D.M. and V.A. (Vida Abedi); supervision, R.Z., V.A. (Vida Abedi) and C.G.; project administration, S.S.; funding acquisition, None. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by Geisinger Institutional Review Board (IRB ID: 2020-0321, date of approval: 31 March 2020) and other participating institutions, as needed.

**Informed Consent Statement:** Informed consent wad obtained from majority of subjects involved in this study. Some centers waived the need for informed consent for studies regarding SARS-CoV-2 as a rapid response to COVID-19 pandemic.

**Data Availability Statement:** The data presented in this study are available in the manuscript and supplemental materials. Additional data are available on request from the corresponding author. The data are not publicly available due to health information privacy.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Usefulness of Respiratory Mechanics and Laboratory Parameter Trends as Markers of Early Treatment Success in Mechanically Ventilated Severe Coronavirus Disease: A Single-Center Pilot Study**

**Daisuke Kasugai 1,\* , Masayuki Ozaki <sup>1</sup> , Kazuki Nishida <sup>2</sup> , Hiroaki Hiraiwa <sup>1</sup> , Naruhiro Jingushi <sup>1</sup> , Atsushi Numaguchi <sup>1</sup> , Norihito Omote <sup>3</sup> , Yuichiro Shindo <sup>3</sup> and Yukari Goto <sup>1</sup>**


**Abstract:** Whether a patient with severe coronavirus disease (COVID-19) will be successfully liberated from mechanical ventilation (MV) early is important in the COVID-19 pandemic. This study aimed to characterize the time course of parameters and outcomes of severe COVID-19 in relation to the timing of liberation from MV. This retrospective, single-center, observational study was performed using data from mechanically ventilated COVID-19 patients admitted to the ICU between 1 March 2020 and 15 December 2020. Early liberation from ventilation (EL group) was defined as successful extubation within 10 days of MV. The trends of respiratory mechanics and laboratory data were visualized and compared between the EL and prolonged MV (PMV) groups using smoothing spline and linear mixed effect models. Of 52 admitted patients, 31 mechanically ventilated COVID-19 patients were included (EL group, 20 (69%); PMV group, 11 (31%)). The patients' median age was 71 years. While in-hospital mortality was low (6%), activities of daily living (ADL) at the time of hospital discharge were significantly impaired in the PMV group compared to the EL group (mean Barthel index (range): 30 (7.5–95) versus 2.5 (0–22.5), *p* = 0.048). The trends in respiratory compliance were different between patients in the EL and PMV groups. An increasing trend in the ventilatory ratio during MV until approximately 2 weeks was observed in both groups. The interaction between daily change and earlier liberation was significant in the trajectory of the thrombin–antithrombin complex, antithrombin 3, fibrinogen, C-reactive protein, lymphocyte, and positive end-expiratory pressure (PEEP) values. The indicator of physiological dead space increases during MV. The trajectory of markers of the hypercoagulation status, inflammation, and PEEP were significantly different depending on the timing of liberation from MV. These findings may provide insight into the pathophysiology of COVID-19 during treatment in the critical care setting.

**Keywords:** COVID-19; mechanical ventilation; respiratory failure

#### **1. Introduction**

The number of patients with coronavirus disease (COVID-19) is increasing worldwide, including in Japan. In Japan, 8.1% of all COVID-19 cases require mechanical ventilation (MV), and the 30-day mortality rate has been reported to be 30% [1–3]. COVID-19 requires

**Citation:** Kasugai, D.; Ozaki, M.; Nishida, K.; Hiraiwa, H.; Jingushi, N.; Numaguchi, A.; Omote, N.; Shindo, Y.; Goto, Y. Usefulness of Respiratory Mechanics and Laboratory Parameter Trends as Markers of Early Treatment Success in Mechanically Ventilated Severe Coronavirus Disease: A Single-Center Pilot Study. *J. Clin. Med.* **2021**, *10*, 2513. https://doi.org/ 10.3390/jcm10112513

Academic Editors: Vida Abedi and Michela Sabbatucci

Received: 24 April 2021 Accepted: 4 June 2021 Published: 6 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

a longer treatment duration than other causes of viral pneumonia, with a median length of stay in the intensive care unit (ICU) of 10 days [4]. Once the capacity of ICU services for COVID-19 is overwhelmed, a significant increase in mortality and excess mortality from any cause may be expected [3,5]. Furthermore, prolonged MV is a risk factor for ICU-acquired weakness [6]. In this context, whether the patient with severe COVID-19 will be liberated from MV is of particular interest for improving patients' outcomes.

Thus far, little is known about the time course of COVID-19-related respiratory failure during ICU treatment. Previous studies have suggested that severe COVID-19 is characterized by excessive inflammation and hypercoagulation [7–9]. In addition to the conventional acute respiratory distress syndrome phenotype, there is another phenotype of high pulmonary compliance and increased physiologic dead space, which is thought to be due to pulmonary microthrombosis [10]. Meanwhile, lower compliance was reported to be associated with prolonged MV, which is similar to findings in other causes of acute respiratory distress syndrome [11]. Considering the complexity of the pathophysiology of severe COVID-19, knowledge of how time series data of clinical parameter changes is needed to assess the response to treatment and to make clinical decisions. However, it is poorly documented how respiratory and laboratory findings—including respiratory compliance, physiologic dead space, and inflammatory and coagulation biomarkers of severe COVID-19—change in response to empirical treatment, including anti-viral medication usage, anti-coagulation, or corticosteroid administration.

The aim of this study was to characterize the time course of the parameters and outcomes of severe COVID-19 in relation to the timing of liberation from MV.

#### **2. Materials and Methods**

#### *2.1. Ethics Statements*

The Nagoya University Hospital Institutional Review Board approved this study (registration number: 2020-0519), and informed consent of the participants was waived but the opt-out method was adopted according to the ethics guidelines.

#### *2.2. Study Design, Setting, and Population*

To characterize the time course of the parameters and outcomes of severe COVID-19 in relation to the timing of liberation from MV, we conducted a retrospective observational study at Nagoya University Hospital from 1 March 2020 to 15 December 2020. Nagoya University Hospital is a quaternary academic medical center with 1035 beds, including 10 emergency and medical ICU (EMICU) beds and 30 surgical ICU beds, located in the Aichi Prefecture, one of the epicenters of COVID-19 from the first wave of the pandemic in Japan. The EMICU usually treats 10–20 patients with extracorporeal membrane oxygenation (ECMO) annually for the management of severe respiratory failure or cardiogenic shock. All severe COVID-19 cases in the hospital and transfers from other hospitals, which are coordinated by the Infectious Disease Control Office in Nagoya City, were admitted to the air-isolated beds of the EMICU. Patients requiring less than 4 L of oxygen were transferred to another COVID-19 ward.

Eligible patients in this study had COVID-19 that required MV. Exclusion criteria were patients introduced to venovenous (VV)-ECMO. The diagnosis of COVID-19 was confirmed by real-time polymerase chain reaction test of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) from any specimen. Patients were categorized into the early liberation from ventilation group (EL group) or prolonged MV group (PMV group). Early liberation from MV was defined as successful extubation within 10 days of MV, since 10 days is the widely adopted duration of antiviral and steroid treatment [12,13].

#### *2.3. Management of Coronavirus Disease*

All mechanically ventilated patients with COVID-19 were initially managed with pressure-controlled ventilation. Placement in the prone position was considered when the PaO2/FiO<sup>2</sup> ratio was less than 150, and was performed at the physicians' discretion.

Neuromuscular blockade was administered for less than 48 h when significant patient ventilator desynchrony was observed. All patients received favipiravir or remdesivir as antiviral medications, depending on clinical availability. A 10-day course of intravenous dexamethasone (6.6 mg) once daily was initially started [12]. Antibiotics were administered to patients with suspected bacterial co-infections. Unfractured heparin was administered and titrated to maintain the activated prothrombin time ratio between 1.5 and 2.5 after MV in all patients [14]. Tracheostomy was considered if patients could not be extubated within 10 days [15]. Because of inadequate personal protective equipment and concerns about nosocomial infections, physiotherapists were unable to be directly involved in bedside rehabilitation [16]. The bedside rehabilitation was performed by a physiotherapist after negative conversion of the SARS-CoV-2 PCR test result, or it was performed by doctors and nurses under the supervision of a physiotherapist after the patient was liberated from MV.

#### *2.4. Data Collection*

Demographic information was extracted from patients' electronic medical records. The details of the parameters during ICU management were extracted from the ICU patient information system (Fortec ACSYS, Phillips Japan). Ventilator parameters were recorded minutely by the IntelliVue MX800 (Philips Japan). Static compliance was calculated using the tidal volume and driving pressure. As an indicator of physiologic dead space, the ventilatory ratio was calculated using the following formula: [minute ventilation (mL/min) × partial pressure of carbon dioxide (mm Hg)]/(predicted body weight × 100 × 37.5) [17,18]. The following laboratory parameters were routinely monitored daily during MV and extracted from the database: coagulation markers (D-dimer, thrombin-antithrombin complex (TAT), plasmin-alpha2-plasmininhibitor-complex, fibrin degradation products (FDP), antithrombin 3 (AT3), fibrinogen, activated partial thromboplastin time ratio, and platelet count), biomarkers of inflammation and lung injury (C-reactive protein level, procalcitonin (PCT) level, ferritin level, white blood cell count, neutrophil count, lymphocyte count, 50% hemolytic complement activity (CH50), and Krebs von den Lungen-6 (KL-6)). Activities of daily living (ADL) before admission and at the time of hospital discharge were measured using the Barthel index, which was routinely evaluated by the nurses and recorded in the nursing summary [19,20].

#### *2.5. Statistical Analysis*

Continuous data are summarized as median and interquartile range (25th–75th percentiles). Categorical variables are expressed as numbers (%). Non-parametric variables were compared between the EL and PMV groups using the Mann–Whitney U test. The Barthel index at hospital discharge was compared between the groups, and the median Barthel index of each component in both groups was visualized using a Rader chart. Nonparametric trending changes in each parameter in both groups were fitted by smoothing splines. Additionally, multivariable mixed effect linear regression models were used to evaluate the longitudinal associations between daily changes in each parameter during initial 5 days and the EL group [21]. Variables were excluded from this evaluation when the linearity assumption seems to be inappropriate, by judging from the spline regression analysis. Within-subject changes were included in the model as random effects to adjust for patient factors. Early liberation, days after MV, and their interaction were assumed as fixed effects in the model. When the interaction term was statistically significant, we considered that the trajectory of the parameter was different between the two groups. Using the parameters that showed significant differences in daily changes that interacted with early liberation in the linear mixed effect model, the trajectory of each parameter was converted into the coefficient using linear regression model, and finally converted into the EL prediction score. The cutoff of each coefficient was determined by the results of the linear mixed effect model. Receiver operating characteristic (ROC) curve analysis was subsequently used to evaluate the performance of the predictive score. For missing data,

the number of missing values were reported and complete-case analysis was performed. All analyses were performed using R software (version 4.0.2; The R Foundation).

#### **3. Results**

#### *3.1. Patient Characteristics and Outcomes*

Of the 52 patients with COVID-19 admitted to the EMICU during the study period, 31 of 32 mechanically ventilated patients were included in this study; one patient required VV-ECMO and was excluded (Supplementary Figure S1: additional file S1). The details of the baseline characteristics are shown in Table 1. The median age of the patients was 71 years. Most patients did not require nursing care before admission (median Barthel index: 100) and were more likely to be male (86%). Common comorbidities included diabetes mellitus (58%) and hypertension (38%). Twenty cases were successful in early liberation from MV (69%). The median worst partial pressure of oxygen/fraction of inspired oxygen (P/F) ratio was found to be 96. The initial ventilatory parameters did not differ between the groups. D-dimer levels were slightly elevated in the PMV group compared to the MV group. Overall, in-hospital mortality was low (6%), and one patient developed massive ischemic stroke after extubation and was withdrawn from care. Figure 1A,B shows the Barthel indexes of both groups at hospital discharge. ADL at the time of hospital discharge was significantly impaired in the PMV group compared to the EL group (median Barthel index (range): 30 (7.5–95) versus 2.5 (0–22.5), *p* = 0.048).


**Table 1.** Patient characteristics.

μ


**Table 1.** *Cont.*

BMI, body mass index; HT, hypertension; DM, diabetes mellitus; SOFA, Sequential Organ Failure Assessment; APACHE II, acute physiology and chronic health evaluation II; MV, mechanical ventilation; PaO2/FiO2, partial pressure of oxygen/fraction of inspired oxygen; PEEP, positive end-expiratory pressure; NMB, neuromuscular blockade; ICU, intensive care unit.

#### *3.2. Ventilatory and Laboratory Parameters and Liberation from Mechanical Ventilation*

Figure 2 shows the trends in ventilatory parameters in each group. The EL group was managed with a lower PEEP throughout the period. Trends of compliance and the P/F ratio were different between the EL and PMV groups with an inflection point on day 5 of MV. The ventilatory ratio was higher in the PMV group than in the EL group. Of note, an increasing trend in the ventilatory ratio during MV until approximately 2 weeks was observed in both groups.

Figures 3 and 4 show each trend of laboratory parameters according to the duration of MV. Despite appropriate therapeutic anticoagulation, D-dimer and FDP levels were gradually increased and the AT3 level was decreased until day 14 in the PMV group. A decrease in the platelet count was not observed. In terms of inflammatory biomarkers, CRP levels were continuously high in the PMV group. PCT levels were initially high in

patients with successful early liberation, and then they immediately became negative. The ferritin levels increased in both groups at about 2 weeks, but a significant difference in their trajectory was not observed. While the CH50 level was decreased within the normal range, it increased in the EL group. KL-6 levels were significantly high initially in the PMV group, but elevation of KL-6 levels was observed in both groups.

**Figure 2.** Trend of respiratory mechanic parameters in relation to the timing of liberation from mechanical ventilation. PEEP, positive end-expiratory pressure; EL group, early liberation from ventilation group; PMV group, prolonged mechanical ventilation group. The number of study timepoint: static compliance, 474,429; ventilatory ratio, 1813; PEEP, 474,941; PaO2/FiO<sup>2</sup> ratio, 1778.

**Figure 3.** Trends of the coagulation parameters. FDP, fibrin degradation products; TAT, thrombin-antithrombin complex; PIC, plasmin-alpha2-plasmininhibitor-complex; APTT-R, activated partial thromboplastin time ratio; AT3, antithrombin 3; EL group, early liberation from ventilation group; PMV group, prolonged mechanical ventilation group. The number of study timepoint: D-dimer, 393; FDP, 392; TAT, 355; PIC, 354; APTT-R, 394; AT3, 392; platelet, 394; FG, 394.

**Figure 4.** Trends of the laboratory parameters of inflammation. CH50, 50% hemolytic complement activity; CRP, C-reactive protein; PCT, procalcitonin; WBC, white blood cell; KL6, Krebs von den Lungen-6; EL group, early liberation from ventilation group; PMV group, prolonged mechanical ventilation group. The number of study timepoint: CRP, 394; PCT, 392; Ferritin, 349; CH50, 198; WBC, 396; neutrophil, 393; lymphocyte, 393; KL6, 370.

The results of the longitudinal association between daily changes in each parameter during the initial 5 days and early liberation from MV are shown in Supplementary Figure S2 (Additional file S2). We found that CRP (*p* = 0.048), TAT (*p* = 0.019), fibrinogen (*p* = 0.002), AT3 (*p* < 0.001), lymphocyte (*p* = 0.009), and PEEP (*p* < 0.001) values showed significantly different daily changes that interacted with early liberation. An EL prediction score was developed using the trajectory of these variables (Supplementary Figure S3: Additional file S3). The area under ROC for the prediction of early liberation (95 %CI) was 0.913 (0.823–1), which was significantly higher than other severity scales (0.573 (0.34–0.802), 0.47 (0.262–0.679), and 0.457 (0.225–0.689) for the APACHEII, SOFA, and 4C mortality scores, respectively).

#### **4. Discussion**

The main findings of this study are as follows: (1) prolonged MV was significantly associated with poor ADL at discharge in the setting of rehabilitation-limiting situations; (2) the trajectory of ventilator and laboratory data were characterized between patients with early liberation and prolonged MV; and (3) early-phase differences in the trajectories of hypercoagulability, inflammatory, and PEEP markers were observed depending on the timing of liberation from MV, which can potentially be useful in identifying patients with early treatment success.

In this single-center observational study, the mortality rate was low compared to that in previous reports [1–3]. However, patients with prolonged MV showed significantly poor ADL. The relationship between the length of MV and ADL is well-known [22]. The ADL impairment may be due to clinical setting characteristics in the management of severe COVID-19, i.e., bedside interventions of rehabilitation were significantly impaired in our hospital because of physiotherapists' concerns of exhaustion of personal protective equipment and nosocomial infection. Although we could not evaluate the long-term outcome of quality of life, our findings indicate that post-intensive care syndrome is particularly important in COVID-19 patients with prolonged MV. This finding may aid in future clinical decision-making and policymaking in terms of staffing and resource allocation in critical care settings during ongoing pandemics. Our findings indicate the importance of direct intervention by physiotherapists in the management of COVID-19. The duration of MV may be used as a surrogate marker for ADL impairment after treatment.

We observed a decreasing trend in respiratory static compliance despite the higher PEEP setting after day 5 and a higher ventilatory ratio in patients with prolonged MV than in those with early liberation. In patients with worsening COVID-19, two types of pathophysiologies may explain this change: pulmonary micro-thromboembolism and organizing pneumonia [23]. Our findings were consistent with those of previous reports in that therapeutic anticoagulation was not fully controlled coagulopathy in COVID-19 [24]. The decrease in the AT3 level and continuous elevation of the TAT suggests poorly controlled thrombin activity during the treatment. The role of complement activation in thrombotic tendency uncontrolled by heparin has been previously documented [25]. In this study, complements were gradually consumed during the treatment phase, which is consistent with previous findings [26]. Meanwhile, the combination of the elevation of the KL-6 level, a marker of interstitial lung injury, worsening respiratory compliance with poor recruitment, and increased physiologic dead space may be explained by the ongoing fibrin deposition of organizing pneumonia [27,28]. This is consistent with the pathologic findings of acute fibrinous and organizing pneumonia-predominant histology in the later phase of treatment, and this may explain the downward trend in compliance despite the higher PEEP setting [29,30]. To further understand the underlying mechanism in the exacerbating conditions, prospective studies with computed tomography pulmonary angiography and/or bronchoalveolar lavage evaluation may be warranted.

Notably, an increasing trend in the ventilatory ratio was also observed in patients with early liberation and in patients with prolonged ventilation. Taken together, these findings may indicate that empirical therapeutic anticoagulation and a 10-day course of dexamethasone (6.6 mg) were not enough to manage the underlying mechanisms. Recently, the CoDEX trial showed that administration of a higher dose of dexamethasone for severe COVID-19 shortened the duration of MV [31]. Further evaluation of anticoagulation and more intensive anti-inflammatory management may be warranted in patients with prolonged ventilation. The trajectory of respiratory compliance and oxygenation was different between the two groups after day 5 of MV. Early tracheostomy was associated with early liberation from MV and preserved ADL [32,33]. It may be reasonable to make a clinical decision for additional treatment or earlier tracheostomy by reviewing the clinical time course until about day 5 of MV.

This study has several limitations. Firstly, because of the nature of the single-center observational study with a small sample size, we mainly focused on descriptive analysis. Furthermore, selection bias might have occurred in inter-hospital transfers, which may limit the external validity of our study. The prognostic value of each parameter and the predictive score should be evaluated in further multicenter studies. Secondly, the titration of PEEP was not protocolized and carried out according to bedside clinicians' preferences. Although a higher PEEP setting was used in patients with prolonged MV, it is unclear whether these patients require a higher PEEP setting because of poor oxygenation or if an unnecessarily high PEEP setting was prescribed, as this may worsen the ventilation perfusion mismatch [34–36].

#### **5. Conclusions**

Prolonged MV was associated with poor ADL at hospital discharge during COVID-19 infection. The indicator of physiological dead space increases during MV. The trajectory of markers of the hypercoagulation status, inflammation, and PEEP were significantly different depending on the timing of liberation from MV. These findings may provide insight into the pathophysiology of COVID-19 during treatment in a critical care setting.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/jcm10112513/s1, Supplementary Figure S1: Flow diagram of patient selection, Supplementary Figure S2: Association between daily change in each parameter during the initial 5 days and early liberation from mechanical ventilation. CRP, C-reactive protein; FDP, fibrin degradation products; TAT, thrombin-antithrombin complex; PIC, plasmin-alpha2-plasmininhibitor-complex; AT3, antirhombin3; KL6, Krebs von den lungen-6; CH50, 50% hemolytic complement activity; EL group, early liberation from ventilation group; PMV group, prolonged mechanical ventilation group. The number of missing values were following: KL6, 13; \_TAT, 16; PIC, 16; ferritin, 16; CH50, 63, Supplementary Figure S3: The components of the EL prediction score and its performance. The cutoff coefficient of each component was determined by the estimated effect of daily change and its interaction with early liberation (A). The area under the receiver operating characteristic curve of the EL prediction score was significantly high compared to other severity scales (B). PEEP, positive end-expiratory pressure; TAT, thrombin-antithrombin complex; AT3, antirhombin3; CRP, C-reactive protein; EL, early liberation; APACHE II, acute physiology and chronic health evaluation II; SOFA, Sequential Organ Failure Assessment.

**Author Contributions:** D.K. and M.O. were responsible for conceptualization and design of the study and data extraction. D.K. and K.N. were responsible for data analyses. D.K. drafted the manuscript. K.N., M.O., H.H., N.J., A.N., N.O., Y.S. and Y.G. critically analyzed and reviewed the draft analyses. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Nakatani Foundation for Advancement of Measuring Technologies in Biomedical Engineering.

**Institutional Review Board Statement:** This study was conducted according to the guidelines of the declaration of Helsinki, and approved by The Nagoya University Hospital Institutional Review Board (registration number: 2020-0519).

**Informed Consent Statement:** Patient consent was waived due to the retrospective nature of this study.

**Data Availability Statement:** The dataset supporting the conclusions of this article is available from the corresponding author on reasonable request.

**Acknowledgments:** We thank all staff for treating coronavirus disease in our intensive care unit. Data extraction was supported by Philips.

**Conflicts of Interest:** The authors declare no conflict of interests.

#### **Abbreviations**

COVID-19, coronavirus disease; MV, mechanical ventilation; ICU, intensive care unit; EMICU, emergency and medical intensive care unit; ECMO, extracorporeal membrane oxygenation; VV, venovenous; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2; EL group, early liberation from ventilation group; PMV group, prolonged mechanical ventilation group; CH50, 50% hemolytic complement activity; ADL, activities of daily living; P/F, partial pressure of oxygen/fraction of inspired oxygen; TAT, thrombin-antithrombin complex; FDP, fibrin degradation products; AT3, antithrombin 3; PCT, procalcitonin; KL-6, Krebs von den Lungen-6.

#### **References**


## *Article* **Artificial Neural Network for Predicting the Safe Temporary Artery Occlusion Time in Intracranial Aneurysmal Surgery**

**Shima Shahjouei 1,2,\*, Seyed Mohammad Ghodsi <sup>2</sup> , Morteza Zangeneh Soroush 3,4, Saeed Ansari <sup>5</sup> and Shahab Kamali-Ardakani <sup>2</sup>**


**Abstract:** Background. Temporary artery clipping facilitates safe cerebral aneurysm management, besides a risk for cerebral ischemia. We developed an artificial neural network (ANN) to predict the safe clipping time of temporary artery occlusion (TAO) during intracranial aneurysm surgery. Method. We devised a three-layer model to predict the safe clipping time for TAO. We considered age, the diameter of the right and left middle cerebral arteries (MCAs), the diameter of the right and left A1 segment of anterior cerebral arteries (ACAs), the diameter of the anterior communicating artery, mean velocity of flow at the right and left MCAs, and the mean velocity of flow at the right and left ACAs, as well as the Fisher grading scale of brain CT scans as the input values for the model. Results. This study included 125 patients: 105 patients from a retrospective cohort for training the model and 20 patients from a prospective cohort for validating the model. The output of the neural network yielded up to 960 s overall safe clipping time for TAO. The input values with the greatest impact on safe TAO were mean velocity of blood at left MCA and left ACA, and Fisher grading scale of brain CT scan. Conclusion. This study presents an axillary framework to improve the accuracy of the estimated safe clipping time interval of temporary artery occlusion in intracranial aneurysm surgery.

**Keywords:** aneurysm surgery; temporary artery occlusion; clipping time; artificial neural network

### **1. Introduction**

Intracranial aneurysms have a prevalence of 3.2% in the general population [1]. Although the majority of patients can remain asymptomatic, cerebral aneurysms have a significant risk of rupture. Temporary artery occlusion (TAO) is an indispensable technique to facilitate aneurysm dissection and clipping and to reduce the risk of intra-operative rupture [2]. However, TAO may be complicated with detrimental consequences such as cerebral ischemia and postoperative neurological deficits [3]. Thereby, estimating a safe clipping time (SCT) for TAO is essential to give the surgeons the maximum window to perform the surgery, and keep the patients safe from the complications of the surgery. Although several intra-operative neurophysiologic monitoring and imaging methods have been proposed for determining safe occlusion time [4,5], SCT is mostly estimated based on clinicians' expertise in real practice. The purpose of this study is to leverage machine learning to identify the prominent clinical features determining the outcome of TAO and to predict the SCT for intracranial aneurysm surgeries.

Machine learning can be used to extract meaningful relationships and patterns from a set of features (model inputs) for estimating the future values of a phenomenon (model

**Citation:** Shahjouei, S.; Ghodsi, S.M.; Zangeneh Soroush, M.; Ansari, S.; Kamali-Ardakani, S. Artificial Neural Network for Predicting the Safe Temporary Artery Occlusion Time in Intracranial Aneurysmal Surgery. *J. Clin. Med.* **2021**, *10*, 1464. https:// doi.org/10.3390/jcm10071464

Academic Editors: Vida Abedi, George N. Kouvelos and Emmanuel Andrès

Received: 14 February 2021 Accepted: 31 March 2021 Published: 2 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

outcome). An artificial neural network (ANN) is a type of data mining and pattern recognition method which reveals complex nonlinear relationships in addition to linear correlations. ANNs have been widely used in a variety of neurosurgical applications, such as predicting the occurrence of symptomatic cerebral vasospasm after aneurysmal subarachnoid hemorrhage [6], traumatic brain injury outcome and survival [7,8], recurrent lumbar disk herniation [9], and endoscopic third ventriculostomy success in childhood hydrocephalus [10]. Regarding cerebral aneurysm surgeries, the majority of the studies deployed these techniques to predict the aneurysm rupture [11,12], or for automated detection of the aneurysms on imaging [13]. In this study, we aimed to evaluate the feasibility and validity of ANN modeling in predicting the SCT and for determining the prominent clinical features of cerebral aneurysm surgeries.

#### **2. Methods**

This study was conducted in Shariati Hospital, Tehran, Iran. To develop the ANN, we used two separate datasets.

#### *2.1. Retrospective Cohort, Training, and Testing Set of the Model*

We retrospectively reviewed the medical records of all patients who underwent craniotomy and clipping for aneurysm management between 2004 and 2011. Clinical data, including demographic information, comorbidities, pre- and post-neurological examination, Fisher grading scale of computerized tomography (CT) scan imaging, pre- and post-operative trans cranial doppler (TCD), location and diameter of the aneurysm(s), and temporary artery clipping time and number(s), were extracted. The presence or absence of flow-through vessels of the circle of Willis and possible anatomic variations were indicated by either digital subtraction angiography (DSA), computed tomography angiography (CTA), magnetic resonance angiography (MRA), or T2 weighted magnetic resonance imaging (MRI). The mean velocity of flow in cerebral arteries was measured from pre-operative TCD.

The information from the patients in the retrospective cohort was used to train the model. To obtain the SCT, we excluded all patients with unfavorable outcomes or any signs of ischemia. Patients with Glasgow coma scale (GCS) less than 11, presence of a neurologic deficit in the pre-operative examination, post-operative decline in either motor or sensory function, or any pathologic finding in neuroimaging other than the presence of aneurysm were excluded from the training set.

#### *2.2. Prospective Cohort and Validation Set of the Model*

Between 2011 and 2013, we devised a protocol to prospectively include the patients with surgical clipping of cerebral aneurysms (ruptured and un-ruptured). We only included those with aneurysms of the anterior communicating artery (AcomA) or middle cerebral arteries (MCA). Data were collected using the same protocol as the retrospective cohort. In addition, all patients of the prospective cohort underwent diffusion weighted imaging (DWI) MRI within 6 h and 24 h of the surgery to rule out cerebral ischemia. We also measured the diameter of arteries in the circle of Willis (anterior cerebral arteries (ACA), AcomA, and MCA) from CTA images. Image-J software (Image J 1.42q software, U.S. National Institutes of Health, Bethesda, MA, USA) was used for this purpose. The information obtained from the prospective cohort was used to test and validate the model.

#### *2.3. Surgical Techniques*

The surgical procedure for clipping of the aneurysms was either a standard pterional craniotomy (MCA location) or frontotemporal craniotomy (AcomA location). For AcomA aneurysms, the ipsilateral and contralateral A1 segments were exposed and temporarily clipped. When the AcomA segment aneurysm was dissected and permanently clipped, the temporal clipping of the A1 segments was subsequently removed. In MCA aneurysms, the MCA was exposed from proximal to distal to identify the location of the aneurysm. Temporal clipping of the proximal MCA at the M1 segment was applied and then the aneurysm was dissected. Subsequently, the temporal clip of the MCA was removed. The duration of temporary vascular obstruction following clipping was measured in seconds.

#### *2.4. Feature Selection for ANN Model*

Through a comprehensive literature review and consultation with clinicians, a wide variety of related clinical and physiological parameters with a possible impact on the SCT were proposed. Based on the anatomical distribution of intracranial aneurysms and the importance of compensatory blood flow mechanisms in each segment of the circle of Willis, 11 features were selected as the input for the ANN model.

Age, the diameter of the right and left MCAs, the diameter of the right and left A1 segment of ACAs, the diameter of AcomA, mean velocity of flow at the right and left MCAs, mean velocity of flow at the right and left ACAs, and Fisher grading scale of brain CT scan were considered as the input values for the model. The diameter of the P1 segment of the right and left posterior cerebral arteries (PCAs) and flow in the posterior circulation were excluded in our final model due to the low prevalence of posterior aneurysms.

#### *2.5. Structure of the ANN Model*

A three-layer structure neural network was used in this study: an input layer, one hidden layer, and an output layer (Figure 1). The number of input values (units) in the first layer of the model was equal to 11, the same as the number of selected features that was proposed to affect the outcome of clipping and subsequent ischemia. For determining the number and structure of the hidden layer, we considered the training accuracy and generalization. The presence of too many hidden layers (which is needed for accuracy) may cause overtraining, and this will result in a decline in generalization. To apply the optimum number of neurons in the hidden layer, the model was run with different counts. The architecture with five units on the hidden layer was accompanied by the lowest error. The output layer consisted of only one neuron, representing the SCT as the outcome of the model.

**Figure 1.** The structure of the artificial neural network. I, input unit; H, hidden unit.

The units within each layer of the model were connected with the units of the adjacent layers through directed edges (weights). There were no connections between the units within the same layer [14]. A nonlinear Sigmoid function was applied to the hidden layer, and a linear function was applied to the output layer.

#### *2.6. Training and Validating of the ANN Model*

Five-fold cross-validation was used for this model. In each run of the modeling, 80% of the retrospective cohort was randomly selected to train the ANN model. The remaining 20% of the dataset was used to test the performance of the model. During the training

phase, the weights and interactions of the input variables were gradually determined during each run. For this purpose, each set of input features was broadcast to every unit in the hidden layer. After computing its activation, each unit in the hidden layer transferred the signal to the unit of the adjacent output layer. In this way, the response of the network was computed for a specific set of input values (feed-forward propagation phase). In the backward propagation phase, the computed activation in the output layer (predicted SCT) was compared with the observed SCT value (obtained from the patient's medical record), and the training error was calculated. The error was then propagated back to each unit of the hidden layer and updated the weights between the output layer and the hidden units. Correspondingly, the computed error in this layer was distributed to the input layer and the weights between the hidden layer and input layer were updated as well. This process was repeated several times, using different random sets of patients for training and testing the ANN model.

Data from the prospective cohort of the patients were used to validate the model and provide the performance metrics for the model. We used the trained ANN model (based on data from the retrospective cohort) to predict the SCT for patients in the prospective cohort. This cohort was kept unseen from the ANN algorithms in the training phase to prevent bias and overfitting.

#### *2.7. Importance of Each Clinical Feature in Predicting the SCT*

To evaluate the importance of the input parameters in predicting the SCT, a model sensitivity test was implemented. For this purpose, we considered a fixed weight of 1 for all the input variables. In each turn, we increased the weight of one variable up to 10% and evaluated the variation in the output value. After repeating this process for each input parameter, we ranked the obtained sensitivity values.

#### *2.8. Estimating the Errors*

To evaluate the proposed pattern recognition model performance, two types of error were proposed. The mean absolute deviation (Equation (1)) calculated the difference between the clinical assigned SCT values in real practice and those predicted by the model.

$$\text{Mean Absolute Deviation} = \frac{\sum\_{i=1}^{N\_{\text{tr}}} |\mathbf{SCT}\_{i} - \mathbf{SCT}\_{i}|}{N\_{\text{lt}}} \text{ \%} \tag{1}$$

where *N* is the total number of patients, *Ntr* = <sup>4</sup>*<sup>N</sup> K* the total number of training samples, *Nte* = *<sup>N</sup> K* the total number of test samples, *K* = 5 the realization of the *K*-fold validation algorithm in our model, *SCT* stands for safe clipping time, and *SCT*ˆ *i* is the predicted value as the outcome of the model.

For relative error (Equation (2)), the mean absolute deviation was adjusted by the greatness of the error according to each value of the observed outcome. This criterion resulted in a better perception of the bias on the model. The relative error was considered to report the bias of the model in this study.

$$Relative\ Error = \frac{1}{N\_{le}} \sum\_{i=1}^{N\_{tr}} \left| \frac{\mathbf{SCT}\_{i} - \mathbf{S}\mathbf{C}T\_{i}}{\mathbf{SCT}\_{i}} \right| \text{\%} \tag{2}$$

where the parameters are as above.

MATLAB software was used for mathematical modeling and designing the ANN. The confidence interval of 95% was assigned to the outputs. We used a T-test to assess the independence of the outputs by considering a *p* value less than 5% as significant.

#### **3. Results**

A total of 131 patients were evaluated for this study (105 patients from the retrospective cohort and 26 patients from the prospective cohort). Six patients were excluded from

the prospective cohort due to low GCS (4 patients) and positive DWI MRI indicating postoperative cerebral ischemia (2 patients, none could be directly related to temporary clipping). Demographic data of the included patients, location of the aneurysms, and also details of the Fisher grading scale of each cohort are available in Table 1.

**Table 1.** Demographic and surgical characteristics of the patients in retrospective and prospective datasets.


The overall predicted TAO based on the prospective cohort was 90–960 s; 120–932 s in AcomA, 240–960 s in right MCA, and 90–950 s in left MCA (Table 2). The average deviation of predicted SCT by the ANN model in this study from the clinical observed SCT of the unseen prospective cohort was 12%, leaving an 88% accuracy of the model.

**Table 2.** Output values. The safe clipping time interval is based on the aneurysm location.


A sensitivity analysis of the input values showed that mean velocity of the left M1, mean velocity of the left A1, and Fisher grading scale had the greatest impact on SCT (Table 3).



#### **4. Discussion**

Surgical management of aneurysms is among the most critical procedures in neurosurgery. Temporary artery occlusion (TAO) is a fundamental component in facilitating aneurysm dissection. The main purpose of this study was to introduce an alternative intelligent predictive tool besides the commonly accepted clinical experience, rather than providing an absolute value for SCT. However, the ANN model in this study demonstrated that the clipping time might be considered as safe for intervals longer than those practiced in the clinic. We observed that mean velocity of flow at the left MCA and left ACA, in addition to the Fisher grading scale of brain CT scans have the greatest impact on the outcome of the TAO.

Although a detrimental consequence of clipping is ischemia, several mechanisms such as redirection of blood flow from the contralateral side through communicating arteries of the Willis circle, leptomeningeal and collateral vessels, and cortical anastomosis can compensate for the hypo-perfusion and eliminate the cerebral ischemia [15–18]. Aging has been shown to reduce the efficacy of collateral flow and cortical anastomosis capacity by decreasing the collateral number and diameter, increasing tortuosity, and impairment of remodeling capacity [19–21].

In addition, the difference in predicted SCT for different vessels might have a biological basis. Predicted SCTs were higher in the left hemisphere. The difference in the origin of right and left common carotid arteries (aortic arc versus the brachiocephalic artery on the right side), the curvature of the vessels, carotid intima-media thickness (CIMT), and other hemodynamic characteristics of the vessels in the right and left side may result in variation between flow in the right and left circulation [22,23]. Blood flow in each vascular section is a function of the velocity of blood and diameter of the vessel at that section (Flow = Velocity × Diameter). Accordingly, by considering the similar diameter of vessels on both sides, the higher velocity of the blood on the left side might be representative of the greater flow in the left circulation. This might be the underlying reason why the velocity of the left ACA and MCA has a major impact on the outcome. The higher incidence rate of aneurysm formation and greater wall shear stress (WSS) and wall shear stress gradient (WSSG) on the left side in comparison with the right in our study (not presented in this draft), may verify this assumption. Additionally, the difference in blood flow may produce a higher compensatory potential for the dominant side in case of vessel occlusion, by redirecting the blood flow through the Willis circle toward the site of obstruction. Consequently, the extra ten seconds of safe occlusion time in the right MCA TAO (960 s versus 950 s for left MCA TAO), although clinically insignificant, may demonstrate this bonus reperfusion provided by the contralateral dominant side.

#### *4.1. Selected Features as Input of the Model*

WSS and WSSG can affect the SCT by promoting aneurysm formation. Permanent pathologic alteration of vasculature, such as disruption of the internal elastic lamina or thinning of the media along with increasing the number and tortuosity of collateral vessels, were introduced as complications of WSS and WSSG [19,24–27]. Alteration in hemodynamic parameters such as the diameter of vessels and velocity of blood flow can change WSS and WSSG [28–30]. Thereby, we considered the diameter of ACAs, MCAs, and AcomA, and the mean velocity of ACAs and MCAs as an indirect measure of WSS and WSSG.

Primarily, we considered the diameter of the P1 segment of right and left PCA and flow in posterior circulation as other predictors of SCT. Previous studies suggested that AcomA is more prominent in maintaining the blood flow after obstruction than the posterior communicating arteries [25,31]. Besides this, aneurysms are not uniformly distributed [32]. Less than 1% of intracranial aneurysms occur at the vertebra–basilar junction, basilar artery, and superior cerebellar artery bifurcation. ACA and MCA bifurcations together account for more than 50% of intracranial aneurysms [33]. Consequently, we did not include the diameter of right and left P1 and flow in the posterior circulation in our final model.

#### *4.2. Limitations and Error Estimation*

The strength of this study was to include information of the patients from two different cohorts in model training, using cross-validation from a retrospective cohort, and model testing using a prospective cohort. Using a very select number of features with clinical value was important to ensure our study did not suffer from missingness, which could have introduced selection bias. We considered a comprehensive panel of clinical and imaging features as the input to assess the feasibility of our approach. However, the Institutional Review Board (IRB) prevented us from including sex as an input variable in the validation cohort due to the deidentification process for datasets including less common pathologies with fewer than 100 patients. Despite considering various imaging modalities to monitor the possible post-operative ischemia, determining the exact underlying cause of ischemia (e.g., impact of final clipping rather than temporary clipping, vasospasm, and other intraor post-operative complications) was challenging, and we did not include patients with cerebral ischemia in our models. Adding intraoperative variables and patients with adverse outcome could improve the predictive value of our ANN model.

The average deviation of predicted SCT from clinically assigned SCT (relative error of our ANN) was equal to 12%. In the training phase, we used five-fold cross validation, which resulted in average relative regression errors of 4.3% (training set) and 11.3% (test set). This training error can be considered quite low and acceptable for training process. After finalizing our regression model, we employed our final model to estimate SCT for the validation set (prospective cohort). This result indicates that our model does not suffer from overfitting or underfitting or unequal distribution over different subsets. Although, an 88% accuracy is a promising result for our pilot study with a total of 131 patients, the model would benefit from validation and justification over larger datasets. However, considering the prevalence of cerebral aneurysms which need surgical intervention, including a large cohort of patients is not simple. Furthermore, in this pilot feasibility study, we used ANN as our machine learning framework, however; comparative analysis with other modeling tools and deep learning methods may provide better performance. We will employ commonly used regression models in our future study to better visualize the power of our model compared to previously used linear models in the medical literature.

#### **5. Conclusions**

The main goal of this study was to present an axillary framework to improve the accuracy of the estimated safe clipping time interval of temporary artery occlusion during intracranial aneurysm surgery. The proposed method was an offline approach that can provide a prediction for the SCT in TAO before the surgery. However, to provide an accurate and precise SCT during the surgery, integration of online measurements and frequent updates of the predicted clipping time is required. To design a model with higher generalization, further studies with more clinical variables, larger sample size, and more diverse demographics are recommended.

**Author Contributions:** Conceptualization, S.K.-A., S.M.G., and S.A.; methodology, S.S., M.Z.S., S.A., S.K.-A.; software, M.Z.S.; validation, S.S., S.M.G., S.K.-A., and S.A.; formal analysis, S.S. and S.K.-A.; investigation, S.S., S.K.-A., and S.A.; resources, S.M.G.; data curation, S.S., and M.Z.S.; writing original draft preparation, S.S. and S.K.-A.; writing—review and editing, S.M.G., S.A.; visualization, S.S. and M.Z.S.; supervision, S.M.G. and S.A.; project administration, S.K.-A., S.S.; funding acquisition, None. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board and Medical Ethics and History of Medicine Research Center at Tehran University of Medical Sciences (no. 2011-00611012N).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Areas of Interest and Attitudes towards the Pharmacological Treatment of Attention Deficit Hyperactivity Disorder: Thematic and Quantitative Analysis Using Twitter**

**Miguel Angel Alvarez-Mon 1,2,\*, Laura de Anta <sup>1</sup> , Maria Llavero-Valero <sup>3</sup> , Guillermo Lahera 2,4,5 , Miguel A. Ortega 2,6 , Cesar Soutullo <sup>7</sup> , Javier Quintero 1,8, Angel Asunsolo del Barco 6,9 and Melchor Alvarez-Mon 2,6,10,11**


**Abstract:** We focused on tweets containing hashtags related to ADHD pharmacotherapy between 20 September and 31 October 2019. Tweets were classified as to whether they described medical issues or not. Tweets with medical content were classified according to the topic they referred to: side effects, efficacy, or adherence. Furthermore, we classified any links included within a tweet as either scientific or non-scientific. We created a dataset of 6568 tweets: 4949 (75.4%) related to stimulants, 605 (9.2%) to non-stimulants and 1014 (15.4%) to alpha-2 agonists. Next, we manually analyzed 1810 tweets. In the end, 481 (48%) of the tweets in the stimulant group, 218 (71.9%) in the non-stimulant group and 162 (31.9%) in the alpha agonist group were considered classifiable. Stimulants accumulated the majority of tweets. Notably, the content that generated the highest frequency of tweets was that related to treatment efficacy, with alpha-2 agonist-related tweets accumulating the highest proportion of positive consideration. We found the highest percentages of tweets with scientific links in those posts related to alpha-2 agonists. Stimulant-related tweets obtained the highest proportion of likes and were the most disseminated within the Twitter community. Understanding the public view of these medications is necessary to design promotional strategies aimed at the appropriate population.

**Keywords:** ADHD; social media; Twitter; pharmacotherapy; stimulants; alpha-2-adrenergic agonists; non-stimulants

## **1. Introduction**

Attention deficit hyperactivity disorder (ADHD) is one of the most common neuropsychiatric disorders of childhood and adolescence, often persisting into adulthood [1]. The

**Citation:** Alvarez-Mon, M.A.; de Anta, L.; Llavero-Valero, M.; Lahera, G.; Ortega, M.A.; Soutullo, C.; Quintero, J.; Asunsolo del Barco, A.; Alvarez-Mon, M. Areas of Interest and Attitudes towards the Pharmacological Treatment of Attention Deficit Hyperactivity Disorder: Thematic and Quantitative Analysis Using Twitter. *J. Clin. Med.* **2021**, *10*, 2668. https://doi.org/ 10.3390/jcm10122668

Academic Editor: Vida Abedi

Received: 15 March 2021 Accepted: 13 June 2021 Published: 17 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

reported prevalence of ADHD in children varies from 2 to 18 percent [2,3]. ADHD is associated with negative health outcomes and marked impairment in academic, occupational and social functioning [4,5].

The treatment of ADHD is complex and may involve behavioral, psychological and educational interventions, as well as medication [6]. Different pharmacological treatments have shown efficacy in reducing ADHD symptoms and improving daily functioning [6]. As has been reported, however, the efficacy of these treatments is not homogenous, nor is the frequency and pattern of associated side effects [6]. The choice of the initial medication depends upon a number of factors, including the individual preferences of the clinician, patient and family [6]. Furthermore, adherence to the treatment regimen is critical for the efficacy of the medical intervention [7,8]. Determinants of patient behavior, including adherence to medication and one's own lifestyle habits, are influenced by patients' experiences, attitudes and opinions with regard to their treatment [7,8]. In order to better optimize medical treatments for the management of ADHD, analyses of the opinions of patients and their families are therefore required.

Social media platforms are increasingly being leveraged by researchers for public health surveillance, intervention delivery, the study of attitudes toward health behaviors and diseases, predictions on diseases, and insight into the medical experiences of patients [9–12]. In particular, Twitter is the most commonly used social media platform within health research, and content analysis is the most common approach [13,14]. In this context, the exploration of tweets discussing perceptions of medications for better understanding, compliance and therapeutic decision making has been sufficiently established [15,16].

Moreover, research on patients' beliefs and attitudes has traditionally relied on surveys, interviews and clinical trials [17,18]. However, social media may also allow for a wider range of patients' voices to be heard, including those perspectives from patients reluctant to participate in surveys or research. In addition, since social media posts are spontaneous in nature, they may be more reflective of what patients truly experience than surveys conducted by researchers, which rely on structured, formal interviews [19–21]. Moreover, postings can be collected nearly in real time, thereby avoiding recall bias. Consequently, platforms such as Twitter may provide a useful insight into patients' beliefs.

Finally, the analysis of tweets on psychiatric disorders is a recently significant area of study for understanding the sentiments of society, patients and health professionals [22–24]. That being said, topics of medical and non-medical interest among Twitter users with relation to ADHD treatment have not yet been established, with the dissemination of ADHD-related tweets remaining unknown.

In this study, we have hypothesized that, firstly, the pharmacological treatment for ADHD is an area of interest for Twitter users and that, secondly, a diverse perception towards the different drug treatments available can be observed. More specifically, the aims of this multidisciplinary research were to investigate the interest and social considerations of Twitter users towards approved pharmacological treatments for ADHD. In addition, we investigated the dissemination of these tweets.

#### **2. Materials and Methods**

#### *2.1. Data Collection*

In this observational quantitative and qualitative study, we focused on searching for tweets that referred to medications approved for the treatment of ADHD: Adderall, Dexedrina, Dextrostat, Focalin, Metilin, Ritalin, Metadate CD (methylphenidate), Ritalin LA (methylphenidate), Adderal-XR, Vyvanse (Lisdexamfetamine), Concerta, Daytrana, Focalin XR, Quillivant XR (methylphenidate), Intuniv (guanfacine), Kapvay (clonidine) and Strattera (Atomoxetine). The inclusion criteria for tweets were: (1) being posted publicly; (2) using any of the previously mentioned hashtags; (3) being posted between 20 September and 31 October 2019; (4) being written in English or Spanish. The six-week period was chosen to avoid any potential bias in the content of the tweets. We collected the number of

likes each tweet generated, the date and time of each tweet, a permanent link to the tweet and each user's profile description. In addition, we obtained a list of the ten hashtags most frequently associated with the hashtags of our study.

#### *2.2. Search Tool*

We used the Twitter Firehose data stream, which is managed by Gnip and allows access to 100% of all public tweets that match a certain criteria (query). In our study, the search criteria were the previously mentioned hashtags.

#### *2.3. Content Analysis Process*

All 118,388 retrieved tweets were included in the dataset (Figure 1). First, we excluded those tweets mentioning any of the aforementioned medications in an unrelated context. For example, Concerta is also the name of a political party in Chile. In this case, any tweets referring to the political party were omitted. Secondly, we excluded all tweets, including hashtags and keywords, not related to health (e.g., political issues). Specifically, Concerta and Ritalin generated 10,773 and 13,987 tweets, respectively, but 10,127 (94%) and 13,567 (97%), respectively, were not related to health. Indeed, most of them included hashtags (#mesacentral, #apoyofirmado, #tumbamadre, #lamarchamasgrande, #Pinerarenuncia) or keywords related to political conflict occurring in Chile. Similarly, Adderall generated 87,808 tweets, of which 86,052 (98.7%) included hashtags or keywords related to political issues (e.g., Trump, impeachment).

**Figure 1.** Flowchart of data management.

All 8642 remaining tweets were inspected by two raters (M.A.A.-M. and L.d.A.). First, we scanned all of the tweets and excluded 2074 that provided information that was too limited, contained only images or included hashtags of more than one treatment. This process led to the creation of a more concise dataset of 6568 tweets, which we divided into three groups: 4949 (75.4%) stimulants, 605 (9.2%) non-stimulants and 1014 (15.4%) alpha-2-adrenergic agonists.

Next, we created a codebook based on our research questions, our previous experience in analyzing tweets and what we determined to be the most common themes. M.A.A.-M.

and L.d.A. analyzed 300 tweets separately to test the suitability of the codebook. Discrepancies were discussed between the raters and with another author (M.L.-V.). After revising the codebook, the raters then proceeded to perform a content analysis of 50% of the tweets in each group, limiting them to a maximum of 1000 tweets randomly selected. Thus, we manually analyzed 1000 tweets from the stimulant group, 303 from the nonstimulant group and 507 from the alpha-2 agonist group (Figure 1). Classification criteria and examples of tweets are shown in Table 1.

**Table 1.** Category, definitions and examples of classification. Usernames and personal names were removed.


*2.4. Measuring Influence and Interest on Twitter*

We analyzed the number of likes generated by each tweet as an indicator of user interest on a given topic. We also measured the potential reach and impact of all analyzed hashtags. Impact is defined as a numerical value representing the potential views a tweet may receive, while reach is defined as a numerical value measuring the potential audience of the hashtag.

In addition, we measured how positive or negative a hashtag was on a scale from 1 (negative) to 100 (positive). Sentiment analysis tools analyze all words contained in a tweet, and each word has its own score that can vary depending on the context. The average score of all the tweets with a certain hashtag determines that hashtag's overall sentiment score.

#### *2.5. Ethical Considerations*

This study was approved by the Research Ethics Committee of the University of Alcala (OE 14\_2020).

#### *2.6. Statistical Analysis*

A descriptive study of the sample was performed, describing the variables by their absolute and relative frequencies. The percentages found were compared using the chisquare test. In the case of quantitative variables, it was checked whether they followed a normal distribution using the Kolmogorof–Smirnof test. As this was not the case, nonparametric tests were used. The Kruskal–Wallis test was used for comparisons of median values among three groups, followed by post hoc testing using a Bonferroni-adjusted alpha level.

#### **3. Results**

#### *3.1. Stimulants Accumulated the Most Interest among Twitter Users*

According to the codebook, 521 (52%) of the stimulant tweets, 85 (28.1%) of the non-stimulant tweets and 345 (68.1%) of the alpha-2 agonist tweets were considered unclassifiable. These tweets shared information or news either about the commercialization of the medication, business-related information, or mentions of treatments for other disorders apart from ADHD. In the end, 481 (48%) of the tweets in the stimulant group, 218 (71.9%) in the non-stimulant group and 162 (31.9%) in the alpha agonist group were considered classifiable (Figure 1). In terms of the content of these tweets, the mention of the specific medications was related to their efficacy, side effects or adherence to treatment for ADHD (Table 1). Moreover, these coding categories were not mutually exclusive in the sense that a generated tweet could be listed under more than one category.

There were significant differences in the percentage of tweets with medical efficacy content between the three groups of drugs (Table 2). The percentage of tweets related to the efficacy of the alpha-2 agonist group was higher than that found in the stimulant and non-stimulant groups. Furthermore, the alpha-2 agonist group also had the highest percentage of tweets containing a positive description of the efficacy of their use (74.1%). Similar results were observed in the percentage of tweets addressing efficacy, as well as the valuation of that efficacy among the stimulant and non-stimulant groups.


**Table 2.** Descriptive characteristics of the tweets considered classifiable in the content analysis, categorized by total amount per drug and category.


**Table 2.** *Cont.*

For each category, total number of tweets (*n*) and relative proportions (%) are provided. Chi-square tests were conducted to assess for statistical differences.

The analysis of the content related to the side effects of the treatments also showed significant differences between the three groups of drugs (Table 2). The alpha-2 agonist group had the highest percentage of tweets with content related to side effects and accumulated the highest percentage of those tweets with a negative valuation (72.8%). In contrast, the stimulant group had the lowest percentage of negative valuations towards side effects (49.1%). There were not any significant differences in the percentages of those tweets mentioning treatment adherence between the three groups of drugs, being that they were all low.

#### *3.2. Scientific Links Were Mainly Found in Alpha-2 Agonist-Related Tweets*

We investigated the use of sources defined by the inclusion of links within the tweet. The links were categorized as scientific or non-scientific sources. Of the tweets related to ADHD, 185 out of the 861 (21.5%) included a reference source, the majority of which were scientific in nature (94.6%). We found significant differences between the percentages of tweets containing a reference link between the three groups of drugs (*p* < 0.001) (Table 2). Those tweets related to alpha-2 agonists had the highest percentage of links, of which most were scientific. In contrast, tweets related to the stimulant drug group had the lowest percentage of links (3.1%).

We observed that the percentages of tweets with negative or positive content related to the efficacy of treatments were different among those tweets both including and not including a link (*p* < 0.001) (Table 3). The negative opinion of treatment efficacy was higher in those without a link (8.1%). In contrast, the percentage of tweets related to side effects was higher among those with a link than in those without one included (*p* < 0.001). Interestingly, the use of links in tweets with adherence content was very low (0.5%).

We studied the use of links in the three groups of treatments. We found a different pattern of distribution of links within the different categories. Within the group of alpha-2 agonist tweets, we observed that the majority of the tweets with a link were focused on efficacy and side effects (Table 4). In contrast, within the non-stimulant group, references to efficacy were mainly posted without a link. Lastly, within the stimulant drug group, efficacy was mainly addressed using a link, whereas side effects were mainly addressed without one.


**Table 3.** Descriptive characteristics of the tweets considered classifiable in the content analysis, categorized by either including or not including a link.

For each category, total number of tweets (n) and relative proportions (%) are provided. Chi-square tests were conducted to assess for statistical differences.

**Table 4.** Use of links in the different content categories of the tweets related to the three different groups of pharmacological treatments.



**Table 4.** *Cont.*

Percentages (%) were calculated with respect to the total number of tweets generated without or with links in each group of treatments and content category. NM = no mention. + = positive. − = negative.

#### *3.3. Stimulant Related Tweets Were the Most Disseminated within the Twitter Community*

We found that the probabilities of a tweet being liked among the three groups were significantly different (*p* < 0.001). Alpha-2 agonists showed a statistically significantly lower number of likes than both non-stimulant (*p* = 0.024) and stimulant (*p* < 0.001). Stimulantrelated tweets accumulated the highest median of likes per tweet. In addition, we analyzed the number of likes received per tweet as classified by the inclusion or absence of a link. We found that tweets not including a link had a significantly higher median of likes per tweet than those tweets including a link (*p* = 0.041).

Furthermore, we found that stimulant-related tweets had the highest potential reach and impact (Figure 2). Both parameters were markedly lower for non-stimulant and alpha-2 agonist-related tweets. Regarding the sentiment analyses of the content of the tweets, we found that it was positive for all three groups (Figure 3).

**Figure 2.** Potential reach and potential impact.

**Figure 3.** Sentiment score comparisons between alpha-2 agonists, non-stimulants and stimulants. Sentiment analysis is classifying the polarity of the tweet on a scale of 0 (very negative) to 100 (very positive).

#### **4. Discussion**

#### *4.1. Principal Findings*

In this study, we have found that Twitter users show a great interest in ADHD drugs, mainly focusing on stimulants. These tweets are centered on the efficacy and side effects of ADHD treatment. Tweets containing a positive consideration of efficacy were mainly observed in those posts related to alpha-2 agonists. The frequency of tweets with content related to adherence to treatment was marginal. The highest percentages of tweets with scientific links were observed in those related to alpha-2 agonists. Furthermore, those tweets referencing stimulants obtained the highest potential reach and impact.

The treatment of ADHD is complex, involving both the use of non-pharmacological tools and the prescription of drugs [6]. Regarding pharmacological treatments, different variables condition their clinical results. For instance, the efficacy of a drug for controlling disease symptoms and the frequency and intensity of side effects are considered to be critical for a treatment's success [6]. Nevertheless, the subjective experience of a drug being used by a patient is pivotal too in terms of adherence to treatment [7,8]. Furthermore, a patient's experience when consuming a drug is conditioned by any information or social valuations received [25]. Thus, the study of patients' experiences with regard to the efficacy and side effects of, as well as adherence to, ADHD treatments is an area of intense focus, having been previously assessed mainly through the use of qualitative studies such as surveys and interviews [26,27]. However, contradictory results have also been reported on the different drugs employed in ADHD treatment [28].

Currently, Twitter serves as one of the predominant social platforms for disseminating perspectives publicly, giving anonymity to user testimonies and encouraging communication by people with real or perceived personal and social restrictions [29]. This anonymity also prevents the potential stigmatization of a Twitter user for his/her attitudes towards a disease, or for any physical or mental conditions they choose to disclose [30]. Thus, Twitter has become a relevant tool for the dissemination of medical information and an interesting resource for the study of individual experiences and opinions [31]. Furthermore, it has been shown that young people tend to hide information from their doctors, especially that information related to behaviors of which health care providers do not usually approve [32]. As a result, Twitter gives them the opportunity to express their experiences anonymously [33].

In this study, we have demonstrated that the use of Twitter for sharing information on patient experiences regarding ADHD treatment is significant, with tweets of this type maintaining a high frequency among those containing content more generally related to medical treatment. Nevertheless, Twitter users tend to be younger than the population at large; likewise, ADHD tends to affect a mostly younger demographic [1]. Moreover, the majority of the tweets with medical content on ADHD drug treatments were related to the stimulant group. Interestingly, these data uphold the elevated frequency of the use of these drugs in the treatment of ADHD patients globally [34].

The high frequency of tweets with content related to the efficacy and side effects of pharmacological treatments supports their significance to ADHD patients. Several studies have also examined the efficacy of each treatment on ADHD symptoms; however, with contradictory results [28]. Various reasons might explain such a discrepancy, yet this strategy for obtaining patient information is critical regardless. Additionally, it has been proposed that infodemiology may overcome the Hawthorne Effect as well as any memory recall biases common to cross-sectional surveys and questionnaire-based studies [19,35]. In terms of medical efficacy, our findings show that the alpha-2 agonist group of drugs accumulated the highest frequency of tweets with a positive valuation, even though frequency levels observed for both stimulants and non-stimulants ranked similarly. However, the alpha-2 agonist group of drugs received the highest frequency of tweets related to their side effects; interestingly, stimulants received the lowest frequency of tweets with regard to tolerability. It has been previously shown that some of the side effects of stimulant drugs have even been considered positive and actively pursued by patients [36]. These results might support the designation of stimulant drugs as the first pharmacological option for treating ADHD, as evidenced by several guidelines [6,37].

Our data also show that adherence to a pharmacological treatment is not a relevant consideration for ADHD patients who are Twitter users. Additional to this point is the fact that a similarly low frequency of tweets related to treatment adherence was found within all three groups of drugs, with a positive valuation towards adherence uncommon but slightly higher in the non-stimulant and alpha-2 agonist groups. Furthermore, the limited interest for treatment adherence found among Twitter users confirms previous studies carried out that employed other strategies [38,39].

Correct medical information is considered to be a cornerstone for the understanding of disease and subsequent patient treatments [40,41]. Currently, access to medical information has been generalized across the internet. For instance, we have found that one fifth of the content related to medical treatment included a link, a majority of which was deemed scientific in nature. This low frequency of the inclusion of links in tweets related to ADHD pharmacological treatment contrasted with those found in a study on statins [19]. Moreover, tweets including a link were twenty times more frequent in those posts referring to alpha-2 agonists than in those related to stimulants.

Different patterns in the use of links were also found between the different groups of drugs analyzed. Within the group of alpha-2 agonist tweets, for instance, the majority of tweets with a link were focused on efficacy and side effects. In contrast, among the nonstimulant group, the majority of tweets mentioning efficacy did not include a link. These results indicate the significant relevance of scientific information and medical research for ADHD patients who are Twitter users. As an example, most alpha-2 agonist medications have been approved for ADHD treatment over the last ten years, while stimulants have been used for decades. This finding therefore supports Twitter's value as a means of communicating scientific content. However, it is worrying that only a limited number of tweets referring to ADHD treatment adherence included a scientific link, especially considering that adherence is pivotal to treatment success [7,8]. That being said, trends in information exchanged over Twitter may still be important as studies have identified that certain health behaviors can be affected [42,43].

Our study also shows that the names of those drugs used for ADHD treatment coincided with tweets referencing political, social and other non-medical content. Furthermore, we observed pejorative uses of these names by Twitter users. These findings suggest that social stigmatization towards mental health, as previously described, still persists, producing

deleterious effects in the lives of people suffering from mental health conditions [44,45]. As well, the non-medical use of psychostimulant drugs, which has not always been uncovered via traditional surveys, has nevertheless been revealed through the analysis of Twitter content [46].

Clinicians themselves should therefore take into consideration information posted over social media with regard to pharmacological treatment that otherwise may not be spontaneously reported during a patient interview. This is particularly important for medications commonly abused or consumed over the counter, behaviors commonly hidden by patients from doctors [47]. In this context, social media may be deemed a friendlier place to discuss the effects of medications, especially those usually rejected by doctors. As relates to this study, an increase in the dissemination of scientific information on ADHD treatment and, in particular, the importance of the adherence to said treatment appears to be a primary objective for the medical community at large.

#### *4.2. Limitations*

First, Twitter may not be reflective of the general population. Secondly, researchers cannot directly measure clinical outcomes from tweets. Third, the codebook design and text analysis imply a degree of subjectivity. However, this methodology is consistent with previous medical research studies using Twitter. Furthermore, to address this last issue, our study comprised a series of countermeasures including an initial review, design of the codebook, and an agreement between coders. Although computerized machine learning methods have been tested to automatically identify and classify topics in medical research over social media, we used an analytical strategy based on raters' clinical expertise in psychiatry, which constituted a qualitative advantage compared to other automated strategies [48]. Finally, we did not determine whether the date of FDA approval affected Twitter activity differently when comparing more recent medication to older medication.

#### **5. Conclusions**

This study identified interesting beliefs and opinions regarding the pharmacological treatment of ADHD that may affect patient behavior. Moreover, social media may be useful for investigating the public's prevailing attitudes when investigating particular medications, as well as when patients report on adverse events and efficacy since both issues can affect their choice of and adherence to treatment. Public perceptions about medications could in turn help inform clinicians, particularly when developing treatment guidelines. Specific to ADHD, public opinions elucidated by this study could be used to help update guidelines, improve communication between health care professionals and patients and ultimately help to build more viable bridges between both parties.

**Author Contributions:** Material preparation, data collection and analysis were performed by M.A.A.- M., L.d.A., M.L.-V., A.A.d.B. conducted and reported statistical analysis. The first draft of the manuscript was written by M.A.A.-M. Interpretation of data and revision of the manuscript for important intellectual content was carried out by M.A.O., G.L., C.S. and J.Q. M.A.-M. contributed as supervisor of all the stages. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by grants from the Fondo de Investigación de la Seguridad Social, Instituto de Salud Carlos III (PI18/01726), Spain, and the Programa de Actividades de I+D de la Comunidad de Madrid en Biomedicina (B2017/BMD-3804), Madrid, Spain.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Research Ethics Committee of the University of Alcala (OE 14\_2020).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data that support the findings of this study are available from the corresponding author upon reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Attitudes towards Trusting Artificial Intelligence Insights and Factors to Prevent the Passive Adherence of GPs: A Pilot Study**

**Massimo Micocci 1,2,\* , Simone Borsci 1,2,3, Viral Thakerar <sup>4</sup> , Simon Walne 1,2, Yasmine Manshadi <sup>5</sup> , Finlay Edridge <sup>5</sup> , Daniel Mullarkey <sup>5</sup> , Peter Buckle 1,2 and George B. Hanna 1,2**


**Citation:** Micocci, M.; Borsci, S.; Thakerar, V.; Walne, S.; Manshadi, Y.; Edridge, F.; Mullarkey, D.; Buckle, P.; Hanna, G.B. Attitudes towards Trusting Artificial Intelligence Insights and Factors to Prevent the Passive Adherence of GPs: A Pilot Study. *J. Clin. Med.* **2021**, *10*, 3101. https://doi.org/10.3390/jcm10143101

Academic Editors: Vida Abedi, Bahi Takkouche and Roberto Cuomo

Received: 30 April 2021 Accepted: 9 July 2021 Published: 14 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Abstract:** Artificial Intelligence (AI) systems could improve system efficiency by supporting clinicians in making appropriate referrals. However, they are imperfect by nature and misdiagnoses, if not correctly identified, can have consequences for patient care. In this paper, findings from an online survey are presented to understand the aptitude of GPs (*n* = 50) in appropriately trusting or not trusting the output of a fictitious AI-based decision support tool when assessing skin lesions, and to identify which individual characteristics could make GPs less prone to adhere to erroneous diagnostics results. The findings suggest that, when the AI was correct, the GPs' ability to correctly diagnose a skin lesion significantly improved after receiving correct AI information, from 73.6% to 86.8% (X 2 (1, *N* = 50) = 21.787, *p* < 0.001), with significant effects for both the benign (X 2 (1, *N* = 50) = 21, *p* < 0.001) and malignant cases (X<sup>2</sup> (1, *N* = 50) = 4.654, *p* = 0.031). However, when the AI provided erroneous information, only 10% of the GPs were able to correctly disagree with the indication of the AI in terms of diagnosis (d-AIW M: 0.12, SD: 0.37), and only 14% of participants were able to correctly decide the management plan despite the AI insights (d-AIW M:0.12, SD: 0.32). The analysis of the difference between groups in terms of individual characteristics suggested that GPs with domain knowledge in dermatology were better at rejecting the wrong insights from AI.

**Keywords:** artificial intelligence; trust; passive adherence; human factors

#### **1. Introduction**

Artificial Intelligence (AI)-based technologies used for medical purposes may have the ability to change the healthcare landscape, providing opportunities for the prioritization of patients who are most at risk [1] and for the support of clinicians making diagnostic conclusions [2].

A growing field of development of AI systems is dermatology, in which early detection of melanoma may benefit patients [3–5]. Every year in the UK, General Practitioners (GPs) see over 13 million patients for dermatological concerns [6]; melanoma is one of the most dangerous forms of skin cancer, with the potential to metastasise to other parts of the body via the lymphatic system and bloodstream. The current standard of care for skin cancer is set by the National Institute for Health and Care Excellence (NICE) [7], which adopt a 'risk threshold' value of 3% positive predictive value (PPV) in primary care to underpin recommendations for suspected skin cancer pathway referrals and urgent direct access investigations in cancer. GPs are expected to refer under the 2WW if the probability

of cancer is 3% or higher. Referral rates are also influenced by factors beyond clinical suspicion of the lesion, such as a clinician's individual risk tolerance and perceived patient expectations or concerns [8]. Dermatology is the speciality with the highest referral rate in the NHS [9]; however, of the half a million cases referred on this pathway, melanoma and squamous cell carcinoma (SCC) only made up 6.5% of referrals in 2019/20 [10]. This reflects the accepted behaviour amongst clinicians of referring with a very low threshold to facilitate detection in the early stages of the disease. The same data from the National Cancer Registration and Analysis Service (NCRAS) also indicate that only 64% of cancers are detected through 2WW referrals, suggesting that considerable numbers of skin cancer cases are detected through alternative pathways, potentially representing missed diagnoses by GPs and risking delays in diagnosis. These professionals, given their role as generalists rather than specialist dermatologists [11], represent the first line of defence against skin cancer, and they might benefit from the support of an accurate AI solution for the early detection of skin cancer and the identification of atypical presentations, with an overall beneficial impact for patients and the NHS [12].

The number of studies assessing the efficacy of intelligent systems for dermatology applications [13–18] is significant. However, to date, only a few of these AI-enabled medical devices have made it through to real-world deployment. This is also a result of a lack of randomized trials [18] and the absence of AI assessments for lesions with abnormal presentation and clinical features similar to melanoma that may produce erroneous diagnoses [19]. These tools are dependent on the quantity and quality of training data [12,20]. The introduction of algorithm-based tools into a complex socio-technical system may create friction and conflict in decision making; this is due to the intrinsic tendency of artificial intelligence to reach a certain 'conclusion' that may not be transparent to human decision-makers and the consequent alterations in practices.

Ultimately, the key issue with AI is how much decision makers will trust these medical devices once deployed in the market. The inclusion of AI systems in the healthcare field should be supported by the awareness that these systems, like the existing workforce, are imperfect. For decision support tools, the resilience of the diagnostic process is in the hands of the clinicians, even when an AI is involved, as they are the only ones who have a holistic view of each clinical scenario, and they can decide to agree or disagree with an AI [21]. Beyond the issue associated with having a 'black box' AI or a fully transparent tool to support decisions [22], the main risk could also be that professionals might over-trust the insights provided by these tools due to a lack of expertise in the use of the technology or the complexity around the cases [4,23,24].

In this paper, we present results from an online survey conducted on a pool of GPs who were presented with a combination of accurate and inaccurate results from a hypothetical AI-enabled diagnostic tool for the early detection of skin cancer. This study aimed to explore the attitudes of GPs when asked to trust (or not to trust) the AI diagnosis as appropriate. We also explore 'predicting factors to trust' that would make GPs resilient enough to prioritise their clinical opinion when an AI produces erroneous diagnoses.

#### **2. Materials and Methods**

A total of 73 GPs participated in the study. Among them, 23 were excluded because they were not able to finalise or correctly complete the test. The final sample of 50 GPs (mean age: 34.4, min = 26, max = 53; 76% female) completed the test online via QualtricsXM between the 10 April 2020 and the 10 May 2020. Participants were directly informed of this study and recruited by email through a clinical lead in primary care research at the NIHR LIVD; also, the link to the survey was posted on social media (Twitter and LinkedIn) and in a private WhatsApp group used by GPs and GPs with special interests working in the Greater London area.

The online test was composed of the following sections:

• Demographics. This section was composed of 15 items. It included qualitative questions regarding individual characteristics (age, gender, years of practice etc.) and

questions regarding the respondent's interest in dermatology and attendance at dermatology courses in the past three years, as well as their perceived confidence in dermatology and familiarity with tools for early skin cancer diagnosis. Three questions considered the GPs overall trust attitude toward innovations in medical devices [25].

• Main test. This was composed of questions on 10 lesions (See Appendix A) purposively selected to be representative of commonly encountered lesions. The cases presented are realistic. Cases of misclassification were modified to explore GPs' attitudes when their diagnosis conflicted with those from the AI.

Each lesion was accompanied by vignettes of hypothetical patient details likely to be asked after in a routine GP consultation (age, gender, duration of the skin lesion, evolution/changes of the lesion, sensory changes, bleeding, risk factors, body location). Each lesion was associated with three questions pertaining to:


The 10 skin lesions were divided in terms of the type of decision making and type of case (benign and malignant) as follows:


#### *2.1. Procedure*

The study was presented to participants as a simulation—with fictitious patients' details—to assess their agreement with an AI system to better report diagnostic test results. Once the study was completed, a disclaimer email was sent to each participant clarifying that the provided combinations of lesions/diagnoses in the study were not always accurate; the study aim of assessing GPs' performance and attitudes with both accurate and inaccurate AI diagnoses was fully explained. After the demographic survey, each participant received ten blocks of questions (each related to one lesion) in a fully randomised order. Participants completed these questions regarding the diagnosis, the management plan and their confidence twice:


GPs were then asked to decide whether to change or to maintain their answers regarding the diagnosis, management plan and their confidence in their decision.

#### **Figure 1.** Example of one lesion with only patient information (fictitious).


**Figure 2.** Example of one lesion with a fictitious AI assessment.

#### *2.2. Data Analysis*

Descriptive statistics were used to observe participants' characteristics, the frequency of correct diagnoses and management plans, and the GPs' confidence in their decision making before and after receiving the AI-enabled information. The pre-and post-AI performance levels of the GPs, in terms of their diagnoses and management plans, were dichotomised (correct/incorrect) and McNemar's Chi-square test was used to analyse the effect of AI information in each decision-making group (EC, UC, DS) by also accounting for the type of case (benign and malignant). The percentage of confidence was tested using a generalized linear mixed model.

The hit and false rates of the GPs for the diagnostic and management decision making before and after the wrong AI insights were used to model GPs' resilience when dealing with erroneous AI information (i.e., DS cases). In line with signal detection theory [27], a computation was used to compose a sensitivity index for when AI was wrong (d-AIW, see Appendix B); the higher the index compared to zero, the better the GP's ability to

ignore the wrong indication of the AI. The index was used to distinguish two groups: one included GPs who had a d-AIW over zero (hereafter called the 'resilient group') and the other included GPs with an index below or equal to zero (hereafter called the 'non-resilient group') for the management and diagnostics of patients with skin lesions. A Kruskal– Wallis test was performed to check if resilient and non-resilient GPs performed significantly differently when AI provided them with correct and incorrect answers and to observe the differences between the two groups in terms of individual characteristics.

#### **3. Results**

#### *3.1. Individual Characteristics*

In total, 76% of the participants had less than 5 years of experience, 16% from 5 to 10 years and 8% had more than 10 years of experience. Overall, the GPs in our cohort declared an average level of confidence in dermatology of 51.5 out of 100 (SD: 16.2), although 34% of them had attended specialisation courses on the topic in the previous three years. Seventy per cent of the participants stated that they had not used a dermatoscope in the previous 12 months, with only 4% of the GPs declaring weekly use of such an instrument. Thirty-eight per cent never used digital systems for skin lesions (e.g., taking pictures of patients' skin lesions to be uploaded into the system), while among those who used such digital systems for diagnostic purposes, 2% declared daily usage, 10% weekly and 50% stated that they used them at least once per month. The level of trust toward AI support systems declared by GPs for this application domain was sufficient (M: 61.2%; SD: 14.5%).

#### *3.2. General Practitioners' Correct Decision Making before and after AI Insights*

Table 1 shows the statistics of GPs' performances before and after receiving the fictitious AI-enabled information, which suggests that GPs tended to adhere to the indications of the AI. Specifically, when the AI was correct (EC and CU cases), there was a positive effect on GPs' performance and confidence. Correct diagnosis, supported by a trustworthy AI, went up by 13.2 points for EC cases and 16.5 points for CU cases. Similarly, the selection of the correct management plan went up by 7.6 points (EC) and 18.5 points (CU). GPs' confidence in their decision making went up of 12.7 for EC cases after the insights of the AI, while this aspect only increased by 1.5 points when dealing with CU cases. Conversely, when the AI provided incorrect insights (DS cases), the correctness of diagnoses and management went down by 24 and 29 points respectively, with a positive boost of 5.7 points in the GPs' confidence in their decision making after receiving AI insights.

McNemar's Chi-square test clarified how the AI insights affected the GPs' decision making for each group.


**Table 1.** Statistics for GP performance before and after receiving the fictitious AI assessment.

Everyday cases: GPs' ability to correctly diagnose a skin lesion significantly improved after receiving the AI information from 73.6% to 86.8% (X<sup>2</sup> (1, *N* = 50) = 21.787, *p* < 0.001), with significant effects for both the benign (X<sup>2</sup> (1, *N* = 50) = 21, *p* < 0.001) and malignant (X<sup>2</sup> (1, *N* = 50) = 4.654, *p* = 0.031) cases. The selection of the correct management plan was also positively affected by the AI information, going from 82.4% to 90% (X 2 (1, *N* = 50) = 3.78, *p* < 0.001), and it was particularly relevant for the plans regarding benign cases (X<sup>2</sup> (1, *N* = 50) = 22, *p* < 0.001), while no major improvement was observed for malignant cases. Confidence about decision making, independent of the type of skin lesion, significantly improved from 66.8% to 79.5% after receiving the AI information (X<sup>2</sup> (1, *N* = 48) = 107.2, *p* < 0.001).

Cases with uncertainties (CU): GPs' correct diagnosis improved significantly from 37.5% to 54% correct decision making when supported by the AI (X<sup>2</sup> (1, *N* = 50) = 24.9, *p* < 0.001). This difference was significant for benign cases (X<sup>2</sup> (1, *N* = 50) = 31.03, *p* < 0.001), while no significant differences emerged in malignant cases before and after receiving AI information. Concurrently, the ability to correctly define a management plan significantly increased from 44% to 62.5% thanks to the AI (X<sup>2</sup> (1, *N* = 50) = 28.195, *p* < 0.001), and this effect was significant for begin cases (X<sup>2</sup> (1, *N* = 50) = 31, *p* < 0.001). GPs' confidence was not significantly affected by the AI information.

Dangerous situations (DS): When erroneous information was provided by the AI, it seems that GPs were significantly pushed to adhere to the erroneous suggestions of the AI. Correct diagnosis of the skin lesions significantly decreased from 46% to 22% (X<sup>2</sup> (1, *N* = 50) = 22.04, *p* < 0.001). Adherence to the wrong AI insights was significant for both benign (X<sup>2</sup> (1, *N* = 50) = 9.08, *p* = 0.026) and malignant (X<sup>2</sup> (1, *N* = 50) = 11.7, *p* = 0.009) cases. Similarly, decision making about management was significantly affected by wrong AI insights, decreasing the ability of GPs to correctly decide the plan for the patient from 54% to 25% (X<sup>2</sup> (1, *N* = 50) = 25.290, *p* < 0.001). This significantly affected GPs' decision making regarding both benign (X<sup>2</sup> (1, *N* = 50) = 12.07, *p* = 0.005) and malignant (X 2 (1, *N* = 50) = 11.52, *p* = 0.007) cases. Confidence was not affected by the information provided by the AI.

#### *3.3. Resilience to the Erroneous Insights of the Artificial Agent*

When the AI provided erroneous information (DS cases), only 10% of the GPs were able to correctly disagree with the indication of the AI in terms of diagnosis (d-AIW M: 0.12, SD: 0.37), and only 14% of participants were able to correctly decide the management plan despite the AI insights (d-AIW M: 0.12, SD: 0.32). These GPs were categorized as the resilient ones (i.e., the ones able to correctly reject the AI insights), as opposed to all the others, who were categorized as less resilient to the wrong indications of the AI.

The Kruskal–Wallis test, when carried out on EC and CU cases (when the AI provided correct results), suggested that the performance of the GPs in the resilient group was not significantly different to the performance of the less resilient group. Conversely, when the AI provided erroneous diagnoses (DS cases), a significant difference was found between the two groups in terms of diagnostic decision making (X<sup>2</sup> = 12.4, *p* < 0.001) and the correct management plan (X<sup>2</sup> = 6.8, *p* = 0.009).

The analysis of the differences between the groups in terms of individual characteristics suggested that GPs who declared regular usage of the dermatoscope were better at rejecting the wrong insights from the AI and making correct diagnoses (X<sup>2</sup> = 7.8, *p* = 0.005) and at managing patients (X<sup>2</sup> = 5.1, *p* = 0.023) compared to less resilient GPs. Some moderate but still significant effects also emerged concerning GPs' overall confidence in dermatology, indicating that resilient GPs were more confident than non-resilient doctors, and this may have played a role in their ability to correctly diagnose (X<sup>2</sup> = 3.8, *p* = 0.049) and define a management plan (X<sup>2</sup> = 5, *p* = 0.024) even when the AI provided erroneous insights. The other individual factors (e.g., age, sex, training, predisposition to trust, etc.) only showed some moderate tendencies.

#### **4. Discussion**

The results demonstrate high levels of trust among GPs towards results attributed to a fictitious AI system, a finding which has both positive and negative implications for the healthcare system. Whilst an accurate clinical decision support tool may support GPs in correctly identifying benign lesions, thus reducing the number of false positives referred to 2WW clinics, there is also a possibility that an erroneous result from the AI system could lead to a patient's case being under-triaged.

Adherence to an AI system that can provide correct insights about cases, even when there are uncertainties, can significantly improve the decision making (diagnosis and plan) of GPs. The correctness and confidence of GPs in their decision making were significantly improved by using the AI when a case presented no uncertainties. Given the pressure on the 2WW pathway, this result may be convenient for ruling out negative cases at the triage stage, with benefits on patient flow and for the individual patients who will avoid unnecessary anxiety associated with a suspected cancer referral. However, when dealing with some uncertainties (CU cases) or when the AI was wrong (DS cases), the confidence of the GPs in the final decision was not affected by the AI insights. This might suggest that when GPs had doubts on how to treat a case (CU cases) or when they were not convinced by the insights of the AI (DS cases), they were not completely reassured by the use of the AI; however, a large majority of the GPs continued to adhere to the indications of the AI. These findings are in alignment with previous studies [28] suggesting that over-reliance on automated systems may be triggered by confirmatory bias when participants direct their attention towards features consistent with the (inaccurate) advice. We also considered the variability of personal expertise and attitudes towards automated systems as having an influence by reducing passive adherence. The results suggest that the tendency to adhere, even when the AI is inaccurate, may be due to a lack of experience with the specific tasks or domain knowledge that may bring GPs to overestimate the insights of the intelligent systems. The small number of resilient GPs who were able to critically interpret the results of the AI declared significantly higher usage of essential dermatological tools (i.e., dermatoscope) and confidence in the specific domain of dermatology compared to the GPs who adhered to the suggestions of the mistaken AI.

The present pilot study is intended as an initial step in the understanding of the future relationship between AI and clinicians in the domain of dermatology.

#### *Limitations and Future Work*

Three main limitations of the present work should be considered for future studies.

First, the small sample surveyed may not be representative of the variety of expertise, exposure to dermatology cases and experience with similar technologies that GPs may have. A power analysis using SAS revealed a 95.9% power to detect the difference in correctness with and without AI support. Our sample size could have detected a minimum difference of 6.5% with 80% power.

Secondly, the participants of the present study were aware that the test was a simulation and that no real AI technology was involved; therefore, we cannot rule out that they may have changed their behaviour because of the attention they received [29] and because of the absence of implications for patients. This effect may have implications for the generalisability of our findings.

Finally, how information from an AI system is presented may impact the end-user. In future studies, we advocate a larger group of GPs, with different expertise, varying familiarity with AI systems, and different cultural backgrounds to expand the current results. Concurrently, a larger number of cases should be tested with equal numbers of different types of lesions in each group. This may bring further insights into the mechanism that leads to adherence to information provided by AI. Mixed-methods studies [30] could help in mitigating the effects of bias and changes in the behaviour of research participants under the influence of observation and measurement. The risk of a passive adherence to AI in the real world could also emerge due to the complexity of the healthcare system [21]

and future longitudinal studies on real cases should be implemented to monitor such a possibility. As well as the user interface, the role of training and documentation, such as the 'Instructions for Use' (IFU), should be considered in future research, both academically and from the perspective of regulatory applications.

#### **5. Conclusions**

Well-designed, accurate and intelligent systems may be able to support GPs in managing patients in primary care with suspicious skin lesions confidently and appropriately, helping them to not only refer suspicious lesions but also manage other lesions in primary care, thus relieving pressure on busy dermatology departments and saving patients from the anxiety of an unnecessary 2WW referral.

Whilst standards of clinical evidence for AI systems should continue to improve, with more emphasis on prospective clinical trials, it is fair to assume that, much like the existing clinical workforce, no AI system will be 100% sensitive in a real-world deployment. Human expertise can be amplified by AI systems, but human decision-makers need to have the domain knowledge and confidence to disagree with such systems when it is necessary.

This counter-intuitively suggests that AI tools are better suited in the hands of clinicians with certain domain knowledge (senior or specialist clinicians) rather than less expert professionals, and this should perhaps be reflected in early deployments. For the specific case of skin cancer, the results suggested that the more clinicians practised dermatological skills, the more they were able to maximize the benefit of the AI systems.

How the new relationship between healthcare professionals and AI systems will be regulated in the future requires further exploration [31]. The risk of under-or overestimating the usefulness of AI tools during clinical decision making might lead to severe consequences for patients.

Designing safe, explainable, reliable and trustworthy AI systems based on fair, inclusive and unbiased data is a key element supporting the diffusion of such tools in the medical field. However, medical professionals will need to adapt, learn and put in place behaviour and strategies to accommodate the unavoidable uncertainties around the interaction with intelligent systems. In this sense, the diffusion and adoption of AI in clinical practice will inevitably impact the training and education of clinicians, who should learn how to interact with these systems, establish a practice to minimise and prevent system failure and learn how to operate when the system fails, misbehaves or malfunctions.

**Author Contributions:** Conceptualization, M.M., S.B., V.T. and S.W.; methodology, M.M., S.B. and P.B.; formal analysis, M.M. and S.B.; investigation, M.M.; resources, Y.M., F.E., D.M. and V.T.; writing original draft preparation, M.M.; writing—review and editing, M.M., S.B., V.T. and S.W.; supervision, G.B.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Biomedical Catalyst 2018 round 2: late stage; project no.: 25763.

**Institutional Review Board Statement:** Local approval for Service Evaluation was sought and obtained from Imperial College Healthcare NHS Trust (ICHNT)—registration no. 373.

**Informed Consent Statement:** Participants who consented to complete the survey were asked to read the Participants Information Sheet and to sign the Consent Form, by which they agreed to take part in the study and to have their personal opinions reflected, anonymously, in reports and academic publications.

**Data Availability Statement:** The data are not publicly available to protect the privacy and confidentiality of study participant.

**Acknowledgments:** The authors would like to thank the study participants for their input in this study and Anna McLister for her assistance in editing the paper.

**Conflicts of Interest:** Y.M., D.M. and F.E. declare non-financial competing interests. The other authors declare no conflicts of interest.

#### **Appendix A**

Table A1 shows the ten lesions used in the simulation study and their classification.



**Table A1.** *Cont.*

#### **Appendix B**

− Computation used to compose the sensitivity indexes:

$$\mathbf{d}' = \mathbf{z}^8 - \mathbf{z} \text{(False)}$$


#### **References**


## *Review* **Artificial Intelligence (AI)-Empowered Echocardiography Interpretation: A State-of-the-Art Review**

**Zeynettin Akkus \* , Yousof H. Aly, Itzhak Z. Attia, Francisco Lopez-Jimenez, Adelaide M. Arruda-Olson, Patricia A. Pellikka, Sorin V. Pislaru, Garvan C. Kane, Paul A. Friedman and Jae K. Oh**

> Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN 55905, USA; Aly.Yousof@mayo.edu (Y.H.A.); attia.itzhak@mayo.edu (I.Z.A.); lopez@mayo.edu (F.L.-J.); ArrudaOlson.Adelaide@mayo.edu (A.M.A.-O.); Pellikka.Patricia@mayo.edu (P.A.P.); Pislaru.Sorin@mayo.edu (S.V.P.); Kane.Garvan@mayo.edu (G.C.K.); Friedman.Paul@mayo.edu (P.A.F.); Oh.Jae@mayo.edu (J.K.O.)

**\*** Correspondence: akkus.zeynettin@mayo.edu

**Abstract:** Echocardiography (Echo), a widely available, noninvasive, and portable bedside imaging tool, is the most frequently used imaging modality in assessing cardiac anatomy and function in clinical practice. On the other hand, its operator dependability introduces variability in image acquisition, measurements, and interpretation. To reduce these variabilities, there is an increasing demand for an operator- and interpreter-independent Echo system empowered with artificial intelligence (AI), which has been incorporated into diverse areas of clinical medicine. Recent advances in AI applications in computer vision have enabled us to identify conceptual and complex imaging features with the self-learning ability of AI models and efficient parallel computing power. This has resulted in vast opportunities such as providing AI models that are robust to variations with generalizability for instantaneous image quality control, aiding in the acquisition of optimal images and diagnosis of complex diseases, and improving the clinical workflow of cardiac ultrasound. In this review, we provide a state-of-the art overview of AI-empowered Echo applications in cardiology and future trends for AI-powered Echo technology that standardize measurements, aid physicians in diagnosing cardiac diseases, optimize Echo workflow in clinics, and ultimately, reduce healthcare costs.

**Keywords:** cardiac ultrasound; echocardiography; artificial intelligence; portable ultrasound

#### **1. Introduction**

Echocardiography (Echo), also known as cardiac ultrasound (CUS), is currently the most widely used noninvasive imaging modality for assessing patients with various cardiovascular disorders. It plays a vital role in evaluation of patients with symptoms of heart disease by identifying structural as well as functional abnormalities and assessing intracardiac hemodynamics. However, accurate echo measurements can be hampered by variability between interpreters, patients, and operators and image quality. Therefore, there is a clinical need for standardized methods of echo measurements and interpretation to reduce these variabilities. Artificial-intelligence-empowered echo (AI-Echo) can potentially reduce inter-interpreter variability and indeterminate assessment and improve the detection of unique conditions as well as the management of various cardiac disorders.

In this state-of-the-art review, we will provide a brief background on transthoracic echocardiography (TTE) and artificial intelligence (AI) followed by a summary of the advances in echo interpretation using deep learning (DL) with its self-learning ability. Since DL approaches have shown superior performance compared to machine-learning (ML) approaches based on hand-crafted features, we focus on DL progress in this review and refer the readers to other reviews [1,2] for ML approaches used to interpret echo. The AI advances could potentially allow objective evaluation of echocardiography, improving clinical workflow, and reducing healthcare costs. Subsequently, we will present currently

**Citation:** Akkus, Z.; Aly, Y.H.; Attia, I.Z.; Lopez-Jimenez, F.; Arruda-Olson, A.M.; Pellikka, P.A.; Pislaru, S.V.; Kane, G.C.; Friedman, P.A.; Oh, J.K. Artificial Intelligence (AI)-Empowered Echocardiography Interpretation: A State-of-the-Art Review. *J. Clin. Med.* **2021**, *10*, 1391. https://doi.org/ 10.3390/jcm10071391

Academic Editor: Vida Abedi

Received: 15 March 2021 Accepted: 24 March 2021 Published: 30 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

available AI-Echo applications, delve into challenges of current AI applications using DL, and share our view on future trends in AI-Echo.

#### *1.1. Transthoracic Echocardiogram*

Transthoracic echocardiogram transmits and receives sound waves with frequencies higher than human hearing using an ultrasound transducer. It generates ultrasound waves and transmits to the tissue and listens to receive the reflected sound wave (echo). The reflected echo signal is recorded to construct an image of the interrogated region. The sound waves travel through soft tissue medium with a speed of approximately 1540 m/s. The time of flight between the transmitted and received sound waves is used to locate objects and construct an image of the probed area. The recorded echo data can be either a single still image or a movie/cine clip over multiple cardiac cycles. CUS has several advantages compared to cardiac magnetic resonance, cardiac computed tomography, and cardiac positron emission tomography imaging modalities. CUS does not use ionizing radiation, is less expensive, portable for point-of-care (POCUS) applications, and provides actual real-time imaging. It can be carried to a patient's bedside for examining patients and monitoring changes over time. Disadvantages of TTE include its dependence on operator and interpreter skill, with variability in data acquisition and interpretation. In addition to operator variability, it includes patient specific variability (e.g., signal-to-noise ratio and limited acoustic window due to anatomical or body mass differences) and machine specific variability (e.g., electronic noise and post-processing filters applied to acquired images). Image quality plays an important factor for accurate measurements. Suboptimal image quality can affect all measurements and can result in misdiagnosis.

Diverse image types are formed by using cardiac ultrasound (Figure 1). The most common types used in clinics are:

B-mode: It is also called brightness mode (B-mode), which is the most well-known US image type. An ultrasound beam is scanned across the tissue to construct a 2D cross section image of the tissue.

M-mode: Motion mode (M-mode) is used to examine motion over time. For example, it provides a single scan line of the heart, and all of the reflectors along this line are shown along the time axis to measure temporal resolution of the cardiac structures.

Doppler ultrasound: A change in the frequency of a wave occurs when the source and observer are moving relative to each other, this is called the Doppler effect. An US wave is transmitted with a specific frequency through an ultrasound probe (the observer). The US waves that are reflected from moving objects (e.g., red blood cells in vessels) return to the probe with a frequency shift. This frequency shift is used to estimate the velocity of the moving object. In blood flow, the velocity of red blood cells moving towards and away from the probe is recorded to construct Doppler signals. The velocity of information overlaid on top of a B-mode anatomical image to show color Doppler images of blood flow.

Contrast enhanced ultrasound (CEUS): CEUS is a functional imaging that suppresses anatomical details but visualizes blood pool information. It exploits the non-linear response of ultrasound contrast agents (lipid coated gas bubbles). Generally, two consecutive US signals are propagated through the same medium, and their echo response is subtracted to obtain contrast signal. Since the tissue generates linear echo response, the subtraction cancels out the tissue signal, and only the difference signal from non-linear responses of bubbles remains. This imaging technique is used to enhance cardiac chamber cavities when B-mode US provides poor quality images. It is useful to detect perfusion abnormalities in tissues and enhance the visibility of tissue boundaries.

Strain imaging: This technique detects myocardial deformation patterns such as longitudinal, radial, and circumferential deformations, and early functional abnormalities before they become noticeable as regional wall motion abnormalities or reduced ejection fraction on B-mode cardiac images.

**Figure 1.** Sample US images showing different US modes. (**A**) B-mode image of the apical 4 chamber view of a heart. (**B**) Doppler image of mitral inflow. (**C**) Contrast enhanced ultrasound image of left ventricle. (**D**) Strain imaging of the left ventricle.

## *1.2. Artificial Intelligence*

Artificial intelligence (AI) is considered to be a computer-based system that can observe an environment and takes actions to maximize the success of achieving its goals. Some examples include a system that has the ability of sensing, reasoning, engaging, and learning, are computer vision for understanding digital images, natural language processing for interaction between human and computer languages, voice recognition for detection and translation of spoken languages, robotics and motion, planning and organization, and knowledge capture. ML is a subsection of AI that covers the ability of a system to learn about data using supervised or unsupervised statistical and ML methods such as regression, support vector machines, decision trees, and neural networks. Deep learning (DL), which is a subclass of ML, learns a sequential chain of pivotal features from input data that maximizes the success of the learning process with its self-learning ability. This is different from statistical ML algorithms that require handcrafted feature selection [3] (Figure 2).

**Figure 2.** The context of artificial intelligence, machine learning, and deep learning. SVM: Support Vector Machine. CNN: convolutional neural networks, R-CNN: recurrent CNN, ANN: artificial neural networks.

Artificial neural networks (ANN) are the first DL network design where all nodes are fully connected to each other. It mimics biological neurons for creating representation from an input signal, including many consecutive layers that learn a hierarchy of features from an input signal. ANN and the advance in graphics processing units (GPU) processing power have enabled the development of deep and complex DL models with simultaneous multitasking at the same time. DL models can be trained with thousands or millions of samples to gain robustness to variations in data. The representation power of DL models is massive and can create representation for any given variation of a signal. Recent accomplishments of DL, especially in image classification and segmentation applications, made it very popular in the data science community. Traditional ML methods use handcrafted features extracted from data and process them in decomposable pipelines. This makes them more comprehensible as each component is explainable. On the other hand, they tend to be less generalizable and robust to variations in data. With DL models, we give up interpretability in exchange for obtaining robustness and greater generalization ability, while generating complex and abstract features.

State-of-the-art DL models have been developed for a variety of tasks such as object detection and segmentation in computer vision, voice recognition, and genotype/phenotype prediction. There are different types of models that include convolutional neural networks (CNNs), deep Boltzmann machines, stacked auto-encoders [4], and deep belief neural networks [5]. The most commonly used DL method for processing images that are CNNs as fully connected ANN is computationally heavy for 2D/3D images and requires extensive GPU memory. CNNs share weights across each feature map or convolutional layers to mitigate this. CNN approaches have gained enormous awareness, achieving impressive results in the ImageNet [6–8] competition in 2012 [8], which includes natural photographic images. They were utilized to classify a dataset of around a million images that comprise a thousand diverse classes, achieving half the error rates of the most popular traditional ML approaches [8]. CNNs have been widely utilized for medical image classification and segmentation tasks with great success [3,9–12]. Since DL algorithms outperform ML algorithms in general and exploit the GPU processing power, it allows real-time processing of US images. We will only focus on DL applications of AI-powered US cardiology in this review.

To assess the performance of ML models, data are generally split into training, validation, and test sets. The training set is used for learning about the data. The validation set is employed to establish the reliability of learning results, and the test set is used to assess the generalizability of a trained model on the data that are never seen by the model. When training samples are limited, k-fold cross validation approaches (e.g., leave-one-out, five-fold, or ten-fold cross validation) are utilized. In cross-validation, the data are divided randomly into k equal sized pieces. One piece is reserved for assessing the performance of a model, and the remaining pieces (k-1) are utilized for training models. The training process is typically performed in a supervised way, which involves ground truth labels for each input data and minimizes a loss function over training samples iteratively, as shown in Figure 3. Supervised learning is the most common training approach for ML, but it requires a laborious ground truth label generation. In medical imaging, ground truth labels are generally obtained from clinical notes for diagnosis or quantification. Furthermore, manual outlining of structures by experts are used to train ML models for segmentation tasks.

**Figure 3.** A framework of training a deep-learning model for classification of myocardial diseases. Operations between layers are shown with arrows. SGD: Stochastic Gradient Descent.

#### **2. Methods and Results: Automated Echo Interpretation**

We performed a thorough analysis of the literature using Google Scholar and PubMed search engines. We included peer-reviewed journal publications and conference proceedings in this field (IEEE Transactions on Medical Imaging, IEEE Journal of Biomedical and Health Informatics, Circulation, Nature, and conference proceedings from SPIE, the Medical Image Computing and Computer Assisted Intervention Society, the Institute of Electrical and Electronics Engineers, and others) that describe the application of DL to cardiac ultrasound images before 15 January 2021. We included a total of 14 journal papers and three conference proceedings that are relevant to the scope of this review (see Figure 4 for the detailed flowchart for the identification, screening, eligibility, and inclusion). We divided reports into three groups on the basis of the task performed: view identification and quality control, image segmentation and quantification, and disease diagnosis.

**Figure 4.** The flowchart of systematic review that includes identification, screening, eligibility, and inclusion.

Current Echo-AI applications require several successive processing steps such as view labelling and quality control, segmentation of cardiac structures, echo measurements, and disease diagnosis (Figure 5). AI-Echo can be used for low-cost, serial, and automated evaluation of cardiac structures and function by experts and non-experts in cardiology, primary care, and emergency clinics. This would also allow triaging incoming patients with chest pain in an emergency department by providing preliminary diagnosis and longitudinally monitoring patients with cardiovascular risk factors in a personalized manner.

**Figure 5.** The flowchart of automated artificial-intelligence-empowered echo (AI-Echo) interpretation pipeline using a chain approach. QC: Quality Control.

With the advancing ultrasound technology, the current clinical cart-based ultrasound systems could be replaced with portable point-of-care ultrasound (POCUS) systems or could be used together. GE Vscan, Butterfly IQ, and Philips Lumify are popular POCUS devices. A single Butterfly IQ probe contains 9000 micro-machined semiconductor sensors and emulates linear, phased, and curved array probes. While the Butterfly IQ probe using ultrasound-on-chip technology could be used for imaging the whole body, Philips Lumify provides different probes for each organ (e.g., s4-1 phased array probe for cardiac applications). GE Vscan comes with two transducers placed in one probe and can be used for scanning deep and superficial structures. Using POCUS devices powered with cloud-based AI-Echo interpretation at point of care locations could significantly reduce the US cost and increase the utility of AI-Echo by non-experts in primary and emergency departments (see Figure 6). A number of promising studies using DL approaches have been published for classification of standard echo views (e.g., apical and parasternal views), segmentation of heart structures (e.g., ventricle, atrium, septum, myocardium, and pericardium), and prediction of cardiac diseases (e.g., heart failure, hypertrophic cardiomyopathy, cardiac amyloidosis, and pulmonary hypertension) in recent years [13– 16]. In addition, several companies such as TOMTEC IMAGING SYSTEMS GMBH, Munich, Germany and Ultromics, Oxford, United Kingdom have already obtained premarket FDA clearance on auto ejection fraction (EF) and echo strain packages using artificial intelligence. The list of companies and their provided AI tools is shown in Table 1.

**Figure 6.** A schematic diagram of AI (artificial intelligence) interpretation of echocardiography images for preliminary diagnosis and triaging patients in emergency and primary care clinics. POCUS: point of care ultrasound.


**Table 1.** The list of commercial software packages that provides automated measurements or diagnosis.

EF: ejection fraction. CHD: coronary heart disease.

#### *2.1. View Identification and Quality Control*

A typical TTE study includes the acquisition of multiple cine clips of the heart's chambers from five standardized windows that are left parasternal window (i.e., parasternal long and short axis views), apical window (i.e., two, three, four, five chamber views), subcostal window (i.e., four chamber view, long axis inferior vena cava view), and suprasternal notch window (i.e., aortic arch view), right parasternal window (i.e., ascending aorta view). In addition to these, the study includes several other cine clips of color Doppler, strain imaging, and 3D ultrasound and still images of valves, walls, and the blood vessels (e.g., aorta and pulmonary veins). View identification and quality control are essential prerequisite steps for a fully automated echo interpretation.

Zhang et al. [16,17] presented a fully automated echo interpretation pipeline that includes 23 view classifications. They trained a 13-layer CNN model with 7168 labelled cine clips and used five-fold cross validation to assess the performance of their model. In evaluation, they selected 10 random frames per clip and averaged the resulting probabilities. The overall accuracy of their model was 84% at an individual image level. They also reported that distinguishing the various apical views was the greatest challenge in the setting of partially obscured left ventricles. They made their source code and model weights publicly available at [18]. Mandani et al. [19] presented the classification of 15 standard echo views using DL. They trained a VGG CNN network with 180,294 images of 213 studies and tested their model on 21,747 images of 27 studies. They obtained 91.7% overall accuracy on the test dataset at a single image level and 97.8% overall accuracy when considering the model's top two guesses. Akkus et al. [20] trained a CNN inception model with residual connections on 5544 images of 140 patients for predicting 24 Doppler image classes and automating Doppler mitral inflow analysis. They obtained overall accuracy of 97% on the test set that included 1737 images of 40 patients.

Abdi et al. [21,22] trained a fully connected CNN with 6196 apical four chamber (A4C) images that were scored between 0 to 5 to assess the A4C quality of echo images. They used three-fold cross validation and reported an error comparable to intra-rater reliability (mean absolute error: 0.71 ± 0.58). Abdi et al. [23] later extended their previous work and trained a CNN regression architecture that includes five regression models with the same weights in the first few layers for assessing the quality of cine loops across five standard view planes (i.e., apical 2, 3, and 4 chamber views and parasternal short axis views at papillary muscle and aortic valve levels). Their dataset included 2435 cine clips, and they achieved an average of 85% accuracy compared to gold standard scores assigned by experienced echo sonographers on 20% of the dataset. Zhang et al. [16,17] calculated the averaged

probability score of views classification across all videos in their study to define an image quality score for each view. They assumed that poor quality cine clips tended to have a more ambiguous view assignment, and the view classification probability could be used for quality assessment. Dong et al. [24] presented a generic quality control framework for fetal ultrasound cardiac four chamber planes (CFPs). Their proposed framework consists of three networks that roughly classify four-chamber views from the raw data, determine the gain and zoom of images, and detect the key anatomical structures on a plane. The overall quantitative score of each CFP was achieved based on the output of the three networks. They used five-fold cross validation to assess their model across 2032 CFPs and 5000 non-CFPs and obtained a mean average precision of 93.52%. Labs et al. [25] trained a hybrid model including CNN and LSTM layers to assess the quality of apical four-chamber view images for three proposed attributes (i.e., foreshortening, gain/contrast, and axial target). They split a dataset of 1039 unique apical four-chamber views into 60:20:20% ratio for training, validation, and testing, respectively, and achieved an average accuracy of 86% on the test set.

View identification and quality assessment of cine clips are the most important pieces of a fully automated echo interpretation pipeline. As shown in Table 2, there is an error range of 3–16% in the current studies for both view identification and quality control. The proposed models were generally trained with a dataset from a single or a few vendors or a single center. Apart from the study of Zhang et al. [16,17], none of the studies shared their source code and model weights for comparisons. In some studies, customized CNN models were used, but not enough evidence or comparisons were shown to support that their choices perform better than the state-of-the-art CNN models such as Resnet, Inception, and Densenet.

**Table 2.** Deep-learning-based AI studies for view identification and quality assessment. MAE: mean absolute error.


#### *2.2. Image Segmentation and Quantification*

Partitioning of an identified view into the region of interests such as left/right ventricle or atrium, ventricular septum, and mitral/tricuspid valves is necessary to quantify certain biomarkers such as ejection fraction, volume changes, and velocity of septal or distal annulus. Several studies have used DL methods to segment left ventricles from apical four and two chamber views.

Zhang et al. [16,17] presented a fully automated echo interpretation pipeline that includes segmentation of cardiac chambers in five common views and quantification of

structure and function. They used five-fold cross validation on 791 images that have manual segmentation of left ventricle and reported the intersection over union metric ranging from 0.72 to 0.90 for the performance of their U-Net-based segmentation model. In addition, they produced automated measurements such as LV ejection fraction (LVEF), LV volumes, LV mass, and global longitudinal strain from the resulting segmentations. Compared to manual measurements, median absolute deviation of 9.7% (*n* = 6407 studies) was achieved for LVEF; median absolute deviation of 15–17% was obtained for LV volume and mass measurements; median absolute deviation of 7.5% (*n* = 419) and 9.0% (*n* = 110) was obtained for strain. They concluded that they obtained cardiac structure measurements comparable with values in study reports. Leclerc et al. [13] studied the state-of-art encoder–decoder type DL methods (e.g., U-Net [28]) for segmenting cardiac structures and made a large dataset (500 patients) publicly available with segmentation labels of end diastole and systole frames. The full dataset is available for download at [29]. They showed that their U-Net-based model outperformed the state-of-the-art non-deep-learning methods for measurements of end-diastolic and end-systolic left ventricular volumes and LVEF. They achieved a mean correlation of 0.95 and an absolute mean error of 9.5 mL for LV volumes and a mean correlation coefficient of 0.80 and an absolute mean error of 5.6% for LVEF. Jafari et al. [30] presented a recurrent CNN and optical flow for segmentation of the left ventricle in echo images. Jafari et al. [14] also presented biplane ejection fraction estimation with POCUS using multi-task and learning and adversarial training. The performance of the proposed model for the segmentation of LV was an average Dice score of 0.92 and, for the automated ejection fraction, was shown to be around an absolute error of 6.2%. Chen et al. [31] proposed an encoder–decoder type CNN with multi-view regularization to improve LV segmentation. The method was evaluated on 566 patients and achieved an average Dice score of 0.88. Oktay et al. [32] incorporated anatomical prior knowledge in their CNN model that allows following the global anatomical properties of the underlying anatomy. Ghorbani et al. [33] used a custom CNN model, named EchoNet, to predict left ventricular end systolic and diastolic volumes (R2 = 0.74 and R2 = 0.70), and ejection fraction (R2 = 0.50). Ouyang et al. [15] trained a semantic segmentation model using atrous convolutions on echocardiogram videos. Their model obtained Dice similarity coefficient of 0.92 for left ventricle segmentation of apical four-chamber view and used a spatiotemporal 3D CNN model with residual connections and predicted ejection fraction with mean absolute errors of 4.1 and 6% for internal and external datasets, respectively. Ouyang et al. [15] de-identified 10,030 echocardiogram videos, resized them into 112 × 112 pixels, and made their dataset publicly available at [34].

U-Net is the most common DL model used for echo image segmentation. As shown in Table 3, the error range for LVEF is ranging between 4 and 10%, while it ranges between 10 and 20% for LV and LA volume measurements.

#### *2.3. Disease Diagnosis*

Several studies have shown that DL models can be used to assess cardiac diseases (see Table 4). Zhang et al. [16,17] presented a fully automated echo interpretation pipeline for disease detection. They trained a VGG [26] network using three random images per video as an input and provided two prediction outputs (i.e., diseased or normal). The ROC curve performance of their model for prediction of hypertrophic cardiomyopathy, cardiac amyloidosis, and pulmonary hypertension were 0.93, 0.87, and 0.85, respectively. Ghorbani et al. [33] trained a customized CNN model that includes inception connections, named EchoNet, on a dataset of more than 1.6 million echocardiogram images from 2850 patients to identify local cardiac structures, estimate cardiac function, and predict systemic risk factors. The proposed CNN model identified the presence of pacemaker leads with AUC = 0.89, enlarged left atrium with AUC = 0.86, and left ventricular hypertrophy with AUC = 0.75. Ouyang et al. [15] trained a custom model that includes spatiotemporal 3D convolutions with a residual connection network together with semantic segmentation of the left ventricle to predict the presence of heart failure with reduced ejection fraction. The output of the spatiotemporal network and semantic segmentation were combined to classify heart failure with reduced ejection fraction. Their model achieved an area under the curve of 0.97 for predicting heart failure with reduced ejection fraction. Omar et al. [35] trained a modified VGG-16 CNN model on a 3D Dobutamine stress echo dataset to detect wall motion abnormalities and compared its performance to hand-crafted approaches: support vector machines (SVM) and random forests (RF). They achieved slightly better accuracy with the CNN model: RF (72.1%), SVM (70.5%), and CNN (75.0%). In another study, Kusunose et al. [36] investigated whether a CNN model could provide improved detection of wall motion abnormalities. They presented that the area under the AUC produced by the deep-learning algorithm was comparable to that produced by the cardiologists and sonographer readers (0.99 vs. 0.98, respectively) and significantly higher than the AUC result of the resident readers (0.99 vs. 0.90, respectively). Narula et al. [37] trained SVM, RF, and artificial neural network (ANN) with hand-crafted echo measurements (i.e., LV wall thickness, end-diastolic volume, end-systolic volume, and ejection fraction, pulsed-wave Doppler-derived transmitral early diastolic velocity (E), the late diastolic atrial contraction wave velocity (A), and the ratio E/A to differentiate hypertrophic cardiomyopathy (HCM) from physiological hypertrophy seen in athletes (ATH). They reported overall sensitivity and specificity of 87 and 82%, respectively.

Unlike other hand-crafted feature-based ML approaches, the DL approaches may extract features from data beyond human perception. DL-based AI approaches have the potential to support accurate diagnosis and discovering crucial features from echo images. In the near future, these tools may aid physicians in diagnosis and decision making and reduce the misdiagnosis rate.


**Table 3.** Deep-learning-based AI studies for image segmentation and quantification. MAD: mean absolute difference. LVEF: left ventricle ejection fraction.


**Table 4.** Deep-learning-based AI studies for disease diagnosis. AUC: area under the curve.

#### **3. Discussion and Outlook**

Automated image interpretation that mimics human vision with traditional machine learning has existed for a long time. Recent advances in parallel processing with GPUs and deep-learning algorithms, which extract patterns in images with their self-learning ability, have changed the entire automated image interpretation practice with respect to computation speed, generalizability, and transferability of these algorithms. AI-empowered echocardiography has been advancing and moving closer to be used in routine clinical workflow in cardiology due to the increased demand for standardizing acquisition and interpretation of cardiac US images. Even though DL-based methods for echocardiography provide promising results in diagnosis and quantification of diseases, AI-Echo still needs to be validated with larger study populations including multi-center and multi-vendor datasets. High intra-/inter-variability in echocardiography makes standardization of image acquisition and interpretation challenging. However, AI-Echo will provide solutions to mitigate operator-dependent variability and interpretability. AI applications in cardiac US are more challenging than those in cardiac CT and MR imaging modalities due to patientdependent factors (e.g., obesity, limited acoustic window, artifacts, and signal drops) and natural US speckle noise pattern. These factors that affect US image quality will remain as challenges with cardiac ultrasound.

Applications of DL in echocardiography are rapidly advancing as evidenced by the growing number of studies recently. DL models have enormous representation power and are hungry for large amounts of data in order to obtain generalization ability and stability. Creating databases with large datasets that are curated and have good quality data and labels is the most challenging and time-consuming part of the whole AI model development process. Although it has been shown that AI-echo applications have superb performance compared to classical ML methods, most of the models were trained and evaluated on small datasets. It is important to train AI models on large multi-vendor and multi-center datasets to obtain generalization and validate on large multi-vendor datasets to increase reliability of a proposed model. An alternative way to overcome the limitation of having small training datasets would be augmenting the dataset with realistic transformations (e.g., scaling, horizontal flipping, translations, adding noise, tissue deformation, and adjusting image contrast) that could help improve generalizability of AI

models. On the other hand, realistic transformations need to be used to genuinely simulate variations in cardiac ultrasound images, and transformations-applied images should not create artifacts. Alternatively, generative adversarial networks, which include a generator and a discriminator model, are trained until the model generates images that are not separable by the discriminator. This could be used to generate realistic cardiac ultrasound B-mode images of the heart. Introducing such transformations during the training process will make AI models more robust to small perturbations in input data space.

Making predictions and measurements based on only 2D echo images could be considered as a limitation of AI-powered US systems. Two-dimensional cross section images include limited information and do not constitute the complete myocardium. Training AI models on 3D cardiac ultrasound data that include the entire heart or the structure of interest would potentially improve the diagnostic accuracy of an AI model.

It is important to design AI models that are transparent for the prediction of any disease from medical images. The AI models developed for diagnosis of a disease must elucidate the reasons and motivations behind their predictions in order to build trust in them. Comprehension of the inner mechanism of an AI model necessitates interpreting the activity of feature maps in each layer [39–41]. However, the extracted features are a combination of sequential layers and become complicated and conceptual with more layers. Therefore, the interpretation of these features become difficult compared to handcrafted imaging features in traditional ML methods. Traditional ML methods are designed for separable components that are more understandable, since each component of ML methods has an explanation but usually is not very accurate or robust. With DL-based AI models, the interpretability is given up for the robustness and complex imaging features with greater generalizability. Recently, a number of methods have been introduced about what DL models see and how to make their predictions. Several CNN architectures [26,28,38,42,43] employed techniques such as deconvolutional networks [44], gradient back-propagation [45], class activation maps (CAM) [41], gradient-weighted CAM [46], and saliency maps [47,48] to make CNN understandable. With these techniques, gradients of a model have been projected back to the input image space, which shows what parts in the input image contribute the most to the prediction outcome that maximizes the classification accuracy. Although making AI models understandable has been an active research topic in the DL community, there is still much further research needed in the area. Despite the fact that high prediction performances were achieved and reported in the studies discussed in this review, none of the studies have provided an insight on which heart regions play an important role in any disease prediction.

Developing AI models that standardize image acquisition and interpretation with less variability is essential considering that echocardiography is an operator- and interpreterdependent imaging modality. AI guidance during data acquisition for the optimal angle, view, and measurements would make echocardiography less operator-dependent and smarter, while standardizing data acquisition. Cost-effective and easy access of POCUS systems with AI capability would help clinicians and non-experts perform swift initial examinations on patients and progress with vital and urgent decisions in emergency and primary care clinics. In the near future, POCUS systems with AI capability could replace the stethoscopes that doctors use in their daily practice to listen to patients' hearts. Clinical cardiac ultrasound or POCUS systems empowered with AI, which can assess multi-mode data, steer sonographers during acquisition, and deliver objective qualifications, measurements, and diagnoses, will assist with decision making for diagnosis and treatments, improve echocardiography workflow in clinics, and lower healthcare cost.

**Author Contributions:** Conceptualization, Z.A., J.K.O. and F.L.-J.; methodology, Z.A. writing original draft preparation, Z.A., J.K.O. and F.L.-J.; writing—review and editing, Y.H.A., I.Z.A., A.M.A.-O., P.A.P., S.V.P., G.C.K., P.A.F., F.L.-J., and J.K.O.; visualization, Z.A. and Y.H.A.; supervision, J.K.O. and F.L.-J.; project administration, J.K.O. and F.L.-J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Journal of Clinical Medicine* Editorial Office E-mail: jcm@mdpi.com www.mdpi.com/journal/jcm

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-3295-0