**Predictive Value of 18F-FDG PET/CT Using Machine Learning for Pathological Response to Neoadjuvant Concurrent Chemoradiotherapy in Patients with Stage III Non-Small Cell Lung Cancer**

**Jang Yoo 1, Jaeho Lee 2, Miju Cheon 1, Sang-Keun Woo 3, Myung-Ju Ahn 4, Hong Ryull Pyo 5, Yong Soo Choi 6, Joung Ho Han <sup>7</sup> and Joon Young Choi 8,\***


**Simple Summary:** The pathological complete response (pCR) after neoadjuvant chemoradiotherapy (CCRT) is an independent prognostic factor for progression-free and overall survival in non-small cell lung cancer (NSCLC). 18F-FDG PET/CT has been performed for initial staging work-up, treatment response, and follow-up in patients with NSCLC. Machine learning (ML) as an empirical data science has become relevant to nuclear medicine. We investigated the predictive performance of 18F-FDG PET/CT using an ML model to assess the treatment response to neoadjuvant CCRT in patients with stage III NSCLC, and compared the performance of the ML model predictions to predictions from conventional PET parameters and from physicians. The predictions from the ML model using radiomic features of 18F-FDG PET/CT provided better accuracy than predictions from conventional PET parameters and from physicians for the neoadjuvant CCRT response of stage III non-small cell lung cancer.

**Abstract:** We investigated predictions from 18F-FDG PET/CT using machine learning (ML) to assess the neoadjuvant CCRT response of patients with stage III non-small cell lung cancer (NSCLC) and compared them with predictions from conventional PET parameters and from physicians. A retrospective study was conducted of 430 patients. They underwent 18F-FDG PET/CT before initial treatment and after neoadjuvant CCRT followed by curative surgery. We analyzed texture features from segmented tumors and reviewed the pathologic response. The ML model employed a random forest and was used to classify the binary outcome of the pathological complete response (pCR). The predictive accuracy of the ML model for the pCR was 93.4%. The accuracy of predicting pCR using the conventional PET parameters was up to 70.9%, and the accuracy of the physicians' assessment was 80.5%. The accuracy of the prediction from the ML model was significantly higher than those derived from conventional PET parameters and provided by physicians (*p* < 0.05). The ML model is useful for predicting pCR after neoadjuvant CCRT, which showed a higher predictive accuracy than those achieved from conventional PET parameters and from physicians.

**Citation:** Yoo, J.; Lee, J.; Cheon, M.; Woo, S.-K.; Ahn, M.-J.; Pyo, H.R.; Choi, Y.S.; Han, J.H.; Choi, J.Y. Predictive Value of 18F-FDG PET/CT Using Machine Learning for Pathological Response to Neoadjuvant Concurrent Chemoradiotherapy in Patients with Stage III Non-Small Cell Lung Cancer. *Cancers* **2022**, *14*, 1987. https:// doi.org/10.3390/cancers14081987

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 17 February 2022 Accepted: 13 April 2022 Published: 14 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** non-small cell lung cancer; neoadjuvant concurrent chemoradiotherapy; 18F-FDG PET/CT; machine learning; random forest; pathologic complete response

#### **1. Introduction**

Lung cancer is the most common malignant tumor and remains the leading cause of cancer-related death worldwide in spite of major advances in prevention and multimodal treatment [1]. Non-small cell lung cancer (NSCLC) accounts for more than 85% of all lung cancers and about 30% of NSCLC present with locally advanced disease in stage III [2]. Patients with stage III NSCLC are usually considered as inoperable. Neoadjuvant concurrent chemoradiotherapy (CCRT) followed by surgery has been established as being able to improve the overall outcome by reducing the rate of local failures and distant metastasis [3,4].

In patients receiving neoadjuvant CCRT for stage III NSCLC, surgical resection allows for the identification of the histopathologic tumor response to determine the prognosis and to evaluate postoperative therapeutic options. According to previous studies, the pathologic complete response (pCR) after neoadjuvant CCRT is an independent prognostic factor for progression-free and overall survival in NSCLC [5,6]. Although several papers have reported a wide range of pCR values of 16–27%, it is clear that the pCR is highly correlated with patient survival [7–10].

18F-fluorodeoxyglucose positron emission tomography/computed tomography ( 18F-FDG PET/CT) has been performed for initial staging work-up, treatment response, and follow-up in patients with NSCLC. It has also been viewed as appropriate for the precise investigation of treatment response after CCRT [11,12]. Previous studies have focused on the comparison of quantitative PET parameters such as the standard uptake value (SUV) after neoadjuvant treatment and histopathologic findings after surgery [13,14]. Moreover, the application of the PET response criteria in solid tumors (PERCIST 1.0) as an evaluation for 18F-FDG PET/CT has been performed to enhance the limitation of anatomic tumor response metrics [15,16]. The role of 18F-FDG PET/CT still needs to be explored because possible misinterpretations due to radiation-induced inflammation such as pneumonitis can cause problems in 18F-FDG PET/CT images [17,18].

Machine learning (ML) as an empirical data science, which can learn patterns or characteristics from one set of given data and use them to evaluate new data, has become relevant to nuclear medicine. Our previous study demonstrated that ML is well suited to performing analyses of high dimensionality radiomic feature extraction from 18F-FDG PET/CT, and ML analysis provided better diagnostic performance than physicians for evaluating metastatic mediastinal lymph nodes in NSCLC [19]. Although assessing the radiomic features of a tumor in clinical practice has some challenges because of the time, effort, and skill involved, we have shown that ML can improve the diagnostic accuracy and its availability in NSCLC. However, there is still no study that has evaluated the predictive performance of ML for the neoadjuvant CCRT response using the radiomic features of 18F-FDG PET/CT.

Therefore, we investigated the predictive performance of 18F-FDG PET/CT using an ML model to assess the treatment response to neoadjuvant CCRT in patients with stage III NSCLC, and compared the performance of the ML model predictions to predictions from conventional PET parameters and from physicians.

#### **2. Materials and Methods**

#### *2.1. Subjects*

We retrospectively reviewed the medical records of all patients newly diagnosed with stage III NSCLC through imaging studies such as chest X-ray, enhanced chest CT, and 18F-FDG PET/CT, as well as pathologic studies including endobronchial ultrasoundguided transbronchial needle aspiration, mediastinoscopic biopsy, or thoracotomy, between

November 2008 and October 2020. To be included in the study population, patients needed to complete a planned neoadjuvant CCRT and undergo curative-intent surgical treatment for stage III NSCLC according to the 7th edition of the TNM classification [20], and undergo a second 18F-FDG PET/CT within approximately 3 weeks following the completion of neoadjuvant CCRT for restaging work-up. Patients in poor cardiopulmonary condition that precluded surgery or who had previously been treated because of another malignant disease were excluded from the study population. Patients who received neoadjuvant chemotherapy or radiotherapy alone were also excluded.

This study was approved by the institutional review board of our institution (IRB No. 2020-09-185), and the requirement for informed patient consent was waived due to its retrospective design.

#### *2.2. Neoadjuvant CCRT and Histopathologic Evaluation*

The neoadjuvant CCRT consisted of chemotherapy and concurrent thoracic radiotherapy. Thoracic radiotherapy was delivered to patients with a total dose of 45 Gy with 1.8 Gy/fraction over 5 weeks from November 2008 to October 2009 or 44 Gy with 2.0 Gy/fraction over 4.5 weeks using 10-MV X-rays from October 2009 and thereafter. The radiotherapy target volume included the known gross and clinical disease plus adequate peripheral margins. The chemotherapy regimens mostly consisted of intravenous administration of paclitaxel (50 mg/m2 per week) or docetaxel (20 mg/m<sup>2</sup> per week) plus either cisplatin (25 mg/m<sup>2</sup> per week) or carboplatin (AUC, 1.5/week) for 5 weeks. The first dose of chemotherapy was delivered on the first day of thoracic radiotherapy [3,4,21].

Surgical procedures were planned for 4~6 weeks following the completion of neoadjuvant CCRT and comprised resection of the affected lung plus mediastinal lymph nodes dissection, depending on the clinical stage. Pulmonary resection included lobectomy, bilobectomy, pneumonectomy, or lobectomy with en bloc wedge resection according to the extent of the primary tumor. After surgical resection, the specimens were examined by pathologists for residual tumors based on hematoxylin and eosin-stained slides. They reported the percentage of residual tumor, which was determined by comparing the estimated cross-sectional area of the viable tumor foci with the estimated cross-sectional areas of necrosis, fibrosis, and inflammation on each slide. The absolute viable tumor extent was also assessed based on their calculation, and pathologic complete response (pCR) was defined as no residual viable tumor remaining in the post-therapy pathology specimen [22,23].

#### *2.3. 18F-FDG PET/CT Analysis*

All patients fasted for at least 6 h before 18F-FDG PET/CT was performed to keep their blood glucose level below 200 mg/dL. Torso PET and unenhanced CT images were acquired using a dedicated PET/CT scanner (Discovery STe, GE Healthcare, Waukesha, WI, USA) approximately 60 min after intravenous injection of 5.5 MBq/kg of 18F-FDG. CT images were obtained using a 16-slice helical CT with the following settings: 140 keV, 30–170 mAs with Auto A mode, and a slice section of 3.75 mm. PET images were acquired from head to thigh and attenuation-corrected PET images (voxel size, 3.9 × 3.9 × 3.3 mm3) were reconstructed using a 3D ordered-subset expectation-maximization algorithm (20 subsets, 2 iterations).

For quantitative analysis, the volume of interest (VOI) from the primary tumor was delineated using the gradient-based segmentation method (PET Edge) in MIM version 6.4 (MIM Software Inc., Cleveland, OH, USA) [19]. These VOIs were saved as a DICOM-RT structure that was imported into the Chang-Gung Image Texture Analysis toolbox (CGITA, http://code.google.com/p/cgita, accessed on 1 March 2020) facilitated by MATLAB software (version 2014b; MathWorks, Inc., Natick, MA, USA) to extract the radiomic features from the PET images (Supplemental Table S1) as well as conventional PET parameters, including the maximum SUV (SUVmax), mean SUV (SUVmean), metabolic tumor volume (MTV), and total lesion glycolysis (TLG). We also calculated the differences of these conventional parameters between PET1 and PET2 by subtracting PET2 parameters from those of PET1 and dividing by those of PET1.

Two nuclear medicine physicians (J.Y.C. and B.T.K) with more than 15 years of experience in PET/CT interpretation assessed the neoadjuvant treatment response according to PERCIST 1.0 [16] by means of a baseline 18F-FDG PET/CT (PET1) and second PET/CT (PET2) undertaken before surgery. They categorized all patients into four response criteria: complete metabolic response (CMR), partial metabolic response (PMR), stable metabolic disease (SMD), and progressive metabolic disease (PMD). After that, the accuracy of the predicted CMR results were compared to histopathologic pCR.

#### *2.4. Machine Learning (ML) Model*

The ML model was developed as a binary classification. First, data were partitioned into a training dataset (70%) for model building and an independent testing dataset (30%) for internal validation. We developed an ML tree-based boosting model for pCR prediction using a random forest (RF) algorithm, which consisted of a multitude of decision trees and used an ensemble method to decide the outcome. Our model was trained with the bagging method to predict the pCR. Different numbers of trees were used to classify the binary decision of the result to achieve the best performance score. The Gini impurity was measured to the quality of a split. The maximum depth of the tree was 5, and the square root of the number of the features was considered for the max. number of features to look for the best split of the model. We applied a random grid search method to determine the optimal hyperparameter of the RF model [24–27]. A 10-fold cross-validation in the training dataset, a technique for reducing the bias that can occur as a result of using a single training set, was applied for method validation. All ML statistical analyses were performed using Python (version 3.8.3).

In classic oversampling techniques, the minority data are simply replicated from the minority data population. The ML model does not reflect on variation from the oversampling data. Therefore, we tried to use SMOTE (Synthetic Minority Oversampling Technique) to deal with this class problem. This technique helped with unbalanced data by creating new synthetic data to provide balance in the distribution. SMOTE starts by choosing random data from the minority class. Then it uses a K-Nearest Neighbor (KNN) algorithm to set new points of the data. Next, new synthetic data are created between the random data and new point, which is derived from KNN algorithm. This process is repeated until the minority class reaches the same size as the majority class. Therefore, we added 322 more participants from the existing raw data. A total of 752 participants were analyzed using this oversampling technique.

Several useful scaling techniques (Min–Max scaler, Normalization, Standardization) prevent overflow and underflow of the data. They help to compare dimensional data more efficiently through a scaling process. The process reduces the conditional number of covariance matrices from the independent variables. This reduction enhances the speed of conversion and stability of the model during the optimization process. We used a standard scaler, which removes the mean and helps to scale the value's unit variance. To adjust for the different scales of the features, standardization of the variables is necessary for the preprocessing steps.

For feature selection, top 10, 20, and 30 variables among 144 variables were selected according to the importance of the variables based on the mean decrease impurity (MDI). MDI or Gini importance was calculated as the decrease in node impurity weighted by the probability of reaching the node. The sum over the number of splits decided the variable importance of the model. The higher value of MDI meant the critical feature in the model.

#### *2.5. Statistical Analysis*

The association between conventional PET parameters and pCR was determined by an independent *t*-test or the Mann–Whitney test according to the Kolmogorov–Smirnov test. Receiver operating characteristic (ROC) curve analysis was performed to assess optimal cutoff values of continuous variables using the MedCalc software package (Ver. 9.5, MedCalc Software, Mariakerke, Belgium). The predictive performance of conventional PET parameters and physicians' diagnostic results were reported using sensitivity (Sen), specificity (Spe), positive predictive value (PPV), negative predictive value (NPV), and accuracy (ACC).

For predictive performance of the ML model, we measured the areas under curve (AUCs), ACC, F1 score, precision (also called PPV), and recall (also known as Sen). We compared the measured values with those of predictions from conventional PET parameters and from physicians by using a McNemar test or Fisher's exact test. A *p*-value of less than 0.05 was considered statistically significant.

#### **3. Results**

#### *3.1. Subject Characteristics*

Among 484 consecutive patients, 430 patients were enrolled in this study. Fiftyfour patients were excluded from the analysis due to a lack of surgical treatment after completion of neoadjuvant CCRT (Figure 1). The clinical characteristics of the 430 patients are summarized in Table 1. The patients were predominantly male (71.9%), and there was a high prevalence (67.2%) of adenocarcinoma among the patients. After neoadjuvant CCRT followed by surgery, the mean percentage of viable tumor in the pathologic specimen was 28.8% (range 0–95%). The pCR was observed in 54 patients (12.6%). According to PERCIST criteria, 16.7% of patients had CMR (*n* = 72).

**Figure 1.** Flowchart of the inclusion and exclusion criteria for the patients.

#### *3.2. Predictive Performance of ML Model for pCR*

The radiomic feature importance was obtained using a Gini index representing the coefficient of the attributes on the prediction model, as listed in Figure 2. The overall prediction performance of the ML model was compared by calculating each of the PET1 and PET2 features separately, and all variables from both PET1 and PET2 (PET3) were analyzed (Table 2). The AUCs determined by the ML model were 0.934 in PET1, 0.975 in PET2, and 0.977 in PET3. For comparison ROC curve analysis (Figure 3), the AUCs of PET2 and PET3 were significantly higher than that of PET1 (*p* = 0.009, *p* = 0.006, respectively). However, there was no significant difference between the AUCs of PET2 and PET3 (*p* = 0.805). According to other indices, PET3 revealed a better predictive performance than those results with either PET1 or PET2 variables.


**Table 1.** Subjects' characteristics.

pCR, pathologic complete response; PERCIST, PET response criteria in solid tumors; CMR, complete metabolic response; PMR, partial metabolic response; SMD, stable metabolic disease; PMD, progressive metabolic disease.


**Table 2.** Comparisons in predictive performance of the ML models using a random forest algorithm for pCR prediction with the included PET data.

AUC, area under curve; ACC, accuracy; PET3, combining PET1 and PET2; \*, †, ‡, *p* < 0.05.

**Figure 3.** Comparisons of the ROC curves of the ML models according to the included PET data. It showed that the AUC of ML using PET/CT data obtained after neoadjuvant CCRT was significantly higher than that of using only baseline PET/CT data (*p* < 0.05).

Additionally, we investigated the predictive results from the ML model using four feature subsets with the top 10, 20, 30, and all features from PET3 (Supplemental Table S2 and Supplemental Figure S1). The ML model outperformed other methods when all features were selected (AUC = 0.977, ACC = 0.934, F1 = 0.940, Precision = 0.937, Recall = 0.944).

#### *3.3. Predictive Performances of Conventional PET Parameters and Physicians for pCR Prediction*

In conventional PET parameters, the SUVmax, SUVmean, MTV, and TLG of PET1 and the SUVmax and SUVmean of PET2 were significantly associated with the pCR (*p* < 0.05). The difference between PET1 and PET2 of the SUVmax (*p* < 0.001), SUVmean (*p* < 0.001), MTV (*p* = 0.003), and TLG (*p* < 0.001) were also significantly associated with the pCR. In contrast, the MTV and TLG of PET2 were not statistically associated with the pCR (Table 3).


**Table 3.** Comparisons in conventional PET parameters according to the presence of pCR.

pCR, pathologic complete response; PET, positron emission tomography; SUV, standard uptake value; MTV, metabolic tumor volume; TLG, total lesion glycolysis; IQR, interquartile range; \*, *p* < 0.05.

The optimal cutoff values that allowed significant association with the pCR were PET1-SUVmax = 13.15, PET1-SUVmean = 4.70, PET1-MTV = 41.11, PET1-TLG = 142.97, PET2-SUVmax = 3.97, PET2-SUVmean = 1.83, dSUVmax = 56.5%, dSUVmean = 43.9%, dMTV = 55.4%, and dTLG = 86.2%. Using these cutoff values, the predictive performance of the PET parameters are listed in Table 4. The predictive performance of the physicians based on their diagnostic result are also presented in Table 4.

**Table 4.** Comparisons of predictive performance from conventional PET parameters, from physicians and from the ML model.


AUC, area under curve; Sen, sensitivity; Spe, specificity; PPV, positive predictive value; NPV, negative predictive value; ACC, accuracy.

#### *3.4. Comparisons of the ML Model with Conventional PET Parameters and Physicians*

A comparison of the predictive performances between conventional PET parameters, physicians, and the ML model are shown in Table 4. First, the performance of the ML

model for pCR prediction was compared with those of conventional PET parameters by analyzing the AUCs. The ML model revealed higher AUC values than all of the single PET parameters (*p* < 0.001). When the pCR was predicted with the conventional single PET parameter, the AUC was only 0.588 to 0.745. By applying the ML model using variable radiomic features, however, the AUC improved to 0.977. In terms of predictive performance, the ML model showed significantly higher performance in Spe, PPV, and ACC than was achieved with any of the conventional PET parameters (*p* < 0.001). When comparing the predictive performances of physicians and of the ML model, the ACC of the ML model was significantly higher than that of physicians (93.4 vs. 80.5%, *p* < 0.001). Not only ACC, but also Sen, Spe, and PPV showed that the ML model significantly increased the results of physicians (94.4 vs. 33.9%, *p* < 0.001; 92.2 vs. 86.4%, *p* = 0.001; 93.7 vs. 29.2%, *p* < 0.001; respectively). NPV was the only case where there was no significant difference between the ML model and prediction by physicians (93.1 vs. 90.8%, *p* = 0.155).

#### **4. Discussion**

We have demonstrated that the ML model using an RF algorithm could be robust and useful in determining the pCR following neoadjuvant CCRT by radiomic features of 18F-FDG PET/CT. Although several studies evaluating ML for treatment response have been published recently [28–31], they mainly conducted research with multiparametric MRI features and not with 18F-FDG PET/CT. Only a few studies have used 18F-FDG PET/CT features to assess neoadjuvant treatment response in breast and rectal cancer using ML models [26,27]. To the best of our knowledge, this is the first study to predict the response to neoadjuvant CCRT in patients with NSCLC using an ML model.

The response to neoadjuvant CCRT is critical because it affects postoperative treatment and individual prognosis. Furthermore, the correct prediction of the pCR can determine which patients will require more or less aggressive adjuvant treatment to reduce the risk of complications. Despite improvements in therapeutic modalities of neoadjuvant CCRT, the pCR rate still remains with a variety of outcomes. The gold standard for assessing the pCR is based on postoperative histopathologic findings, which could be inefficient to implement in all patients with advanced NSCLC. Therefore, it is necessary to develop a method of improving the predictive significance of non-invasive imaging modalities for establishing a personalized therapeutic strategy.

Radiomics is an emerging field where various imaging modalities are performed to extract features that may reflect changes in human tissues at the cellular levels and estimate detailed information on tumor biology and microenvironment in nuclear medicine [32,33]. The radiomic features delineated on PET/CT images can represent tumor heterogeneity including fractal dimension, tumor shape, and proliferation [34]. In our experiments, voxel statistics of radiomic features were highly ranked in the prediction for the pCR, followed by texture spectrum and co-occurrence matrix. Although there are differences in the feature importance of many radiomic variables, the ML model using them demonstrated better predictive performance for the pCR than the single conventional PET/CT parameters. Conventional PET parameters and their changes in FDG uptake before and after CCRT have been previously evaluated in determining the treatment response in patients with NSCLC [11]. We also performed these analyses; however, the ACC of the predictive performance using them was only shown to be 44.2–70.9%. Therefore, it seemed unfavorable to evaluate the predictive performance using single PET parameters even though they were statistically significantly correlated with the pCR.

The ML model significantly outperformed the physicians in terms of Sen, Spe, PPV, and ACC. The outcomes of conducting the ML model with PET2 data revealed higher predictive performance than those of the ML model with PET1 data. It appears that radiomic features obtained from PET/CT after neoadjuvant CCRT have more relevant clinical value in the prediction of the pCR. Compared to the results of the ML model with only the variables from each time of PET/CT images, the predictive performance also increased by inputting all variables from both PET1 and PET2. We assumed that the improvement in

performance is probably because of the feature importance for predicting the pCR, which is somewhat different between radiomics of PET1 and PET2. If more significant variables were input into the ML model, the predictive performance may be further improved. The PET-based radiomics can provide the potential to characterize intratumoral heterogeneity indicating resistance to neoadjuvant CCRT. Therefore, it is clinically important to evaluate treatment response not only to obtain baseline PET/CT images but also to examine PET/CT after neoadjuvant CCRT. As the current study demonstrated, the use of ML with radiomics features could be predictive of treatment response and thus help to select a more aggressive treatment for those with high-risk factors after curative surgery in patients with stage III NSCLC.

This study had several limitations. First, this study was conducted in a retrospective manner with a limited sample size from a single center. Because radiomic features can be highly dependent on reconstruction methods and imaging parameters [35], it is planned to obtain a prospective multicenter trial to be more generalizable in the future. Second, the study population was composed of patients with different therapeutic schemes. Although we addressed a homogeneous population of patients with stage III NSCLC, it is also needed to select patients with a more uniform therapeutic modality based on the consistent guideline. Third, various pulmonary side effects can arise after radiotherapy, such as pneumonitis or fibrosis, which may challenge the response assessment, although we tried our best to exclude the possibility of treatment-induced inflammatory changes based on the relative intensity and distribution of FDG uptake in the lung parenchyma and automatically generated tumor VOI [36]. Finally, although the proposed ML model was analyzed using a 10-fold cross-validation for minimizing overfitting instead of splitting the dataset into training and test sets, external validation using an independent dataset is necessary to verify the clinical significance using a larger cohort.

#### **5. Conclusions**

In conclusion, the developed ML model using an RF algorithm and 18F-FDG PET/CT radiomics features was useful for predicting the pCR after neoadjuvant CCRT in NSCLC. The predictions of the ML model had higher accuracy than predictions from conventional PET parameters and from physicians. The ML model using radiomics features can be used to facilitate the preoperative individualized prediction for the pCR. Our findings further highlight the potential, non-invasive, and effective clinical significance of an ML model to predict the pCR in patients with stage III NSCLC who had received neoadjuvant CCRT followed by surgery.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/cancers14081987/s1, Figure S1: Comparison of ROC curves from random forest according to ranking-based feature selection; Table S1: List of quantitative PET-based radiomic features from CGITA; Table S2: Predictive performance of random forest according to ranking-based feature selection

**Author Contributions:** All authors contributed equally to this work. Conceptualization, J.Y. and J.Y.C.; data curation, J.Y.; formal analysis, J.Y., J.L. and J.Y.C.; funding acquisition, J.Y. and J.Y.C.; investigation, J.Y. and J.L.; methodology, J.Y., J.L. and J.Y.C.; project administration, J.Y.C.; resources, M.C., S.-K.W., M.-J.A., H.R.P., Y.S.C. and J.H.H.; software, J.Y. and J.L.; supervision, J.Y.C.; validation, J.Y. and J.Y.C.; visualization, J.Y. and J.L.; writing—original draft preparation, J.Y.; writing—review and editing, M.C., S.-K.W., M.-J.A., H.R.P., Y.S.C., J.H.H. and J.Y.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (No. NRF-2020M2D9A1094072), Future Medicine 20\*30 Project of the Samsung Medical Center (#SMO1220071), and VHS Medical Center Research Grant (No. VHSMC 22001).

**Institutional Review Board Statement:** This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the Samsung Medical Center (IRB No. 2020-09-185).

**Informed Consent Statement:** Patient consent was waived due to the retrospective design of this study.

**Data Availability Statement:** Restrictions apply to the availability of these data. Data were obtained from the Samsung Medical Center and are available from the corresponding author with the permission of the Samsung Medical Center.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Synaptophysin, CD117, and GATA3 as a Diagnostic Immunohistochemical Panel for Small Cell Neuroendocrine Carcinoma of the Urinary Tract**

**Gi Hwan Kim 1, Yong Mee Cho 1, So-Woon Kim 2, Ja-Min Park 3, Sun Young Yoon 3, Gowun Jeong 4, Dong-Myung Shin 5, Hyein Ju <sup>5</sup> and Se Un Jeong 1,\***


**Simple Summary:** While diagnosing a case of small cell neuroendocrine carcinoma (SCNEC) in the urinary tract, we found that the previous biopsy had been misdiagnosed as urothelial carcinoma (UC) because only chromogranin and synaptophysin were tested to define neuroendocrine differentiation and both tests were negative. This case led us to conduct this present study to define a panel of neuroendocrine markers to ensure the diagnosis of traditional neuroendocrine marker-negative SCNEC. We employed a decision tree classifier algorithm to analyze the expression of 17 immunohistochemical markers and found that the extent of synaptophysin (>5%) and CD117 (>20%) and the intensity of GATA3 (negative or weak) are major parameters. Since SCNEC is an aggressive tumor type and requires therapeutic approaches that differ from those used for UC, an accurate diagnosis of SCNEC is critical and this model may help pathologists accurately diagnose SCNEC in daily practice.

**Abstract:** Although SCNEC is based on its characteristic histology, immunohistochemistry (IHC) is commonly employed to confirm neuroendocrine differentiation (NED). The challenge here is that SCNEC may yield negative results for traditional neuroendocrine markers. To establish an IHC panel for NED, 17 neuronal, basal, and luminal markers were examined on a tissue microarray construct generated from 47 cases of 34 patients with SCNEC as a discovery cohort. A decision tree algorithm was employed to analyze the extent and intensity of immunoreactivity and to develop a diagnostic model. An external cohort of eight cases and transmission electron microscopy (TEM) were used to validate the model. Among the 17 markers, the decision tree diagnostic model selected 3 markers to classify NED with 98.4% accuracy in classification. The extent of synaptophysin (>5%) was selected as the initial parameter, the extent of CD117 (>20%) as the second, and then the intensity of GATA3 (≤1.5, negative or weak immunoreactivity) as the third for NED. The importance of each variable was 0.758, 0.213, and 0.029, respectively. The model was validated by the TEM and using the external cohort. The decision tree model using synaptophysin, CD117, and GATA3 may help confirm NED of traditional marker-negative SCNEC.

**Keywords:** carcinoma; neuroendocrine; urinary bladder; decision trees; immunohistochemistry; synaptophysin; negative results

**Citation:** Kim, G.H.; Cho, Y.M.; Kim, S.-W.; Park, J.-M.; Yoon, S.Y.; Jeong, G.; Shin, D.-M.; Ju, H.; Jeong, S.U. Synaptophysin, CD117, and GATA3 as a Diagnostic Immunohistochemical Panel for Small Cell Neuroendocrine Carcinoma of the Urinary Tract. *Cancers* **2022**, *14*, 2495. https:// doi.org/10.3390/cancers14102495

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 21 April 2022 Accepted: 16 May 2022 Published: 19 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Small cell neuroendocrine carcinoma (SCNEC) is a rare entity in the urinary tract, representing 0.5–1% of urinary bladder cancers [1,2]. It usually presents as a high stage tumor with frequent muscularis propria invasion and metastasis compared to conventional urothelial carcinoma (UC) [3]. SCNEC requires an aggressive clinical course, and its 5-year survival rate is as low as 8% [4]. A recently reported combined therapeutic approach included neoadjuvant chemotherapy with cisplatin and etoposide, followed by either radiation therapy or cystectomy if no systemic disease is present; the overall survival was higher in patients who received the neoadjuvant chemotherapy than in those who did not receive it [5,6]. Therefore, accurate diagnosis of SCNEC is critical because of its poor prognosis and therapeutic approaches differing from those used for UC.

SCNEC is defined by its characteristic histology: sheets and large nests of relatively small cells with scant cytoplasm, speckled nuclei, and indistinct nucleoli. In the urinary bladder, SCNEC presents as a pure form or more frequently as a component of combined SCNEC and non-SCNEC [4,7]. The non-SCNEC component includes UC, invasive or in situ, and other divergent differentiation and histologic variants such as squamous, glandular, nested, plasmacytoid, sarcomatoid, and trophoblastic.

The diagnosis of SCNEC is classically based on the histologic features, but immunohistochemical (IHC) staining is commonly employed to confirm the diagnosis or to exclude an alternative diagnosis in cases with ambiguous histology. Similar to its more common counterpart in the lungs, synaptophysin, chromogranin, and CD56 are widely used neuroendocrine (NE) markers in a panel to compensate the suboptimal sensitivity and specificity of each marker [8]. Synaptophysin has a relatively reliable diagnostic potential; chromogranin is less sensitive with weak and focal positivity; and CD56 is most sensitive but less specific [8,9]. However, SCNEC may yield negative results for all three of these markers [10]. In fact, up to two-thirds of small cell lung cancer could provide negative results for the relatively specific NE markers synaptophysin and chromogranin A [10,11]. The challenge is that SCNEC may have ambiguous or overlapping features with UC, especially in cases of combined SCNEC and UC [5]. In such cases, it might be difficult to accurately diagnose SCNEC, and when the traditional NE markers are negative, it could result in misdiagnosis as UC.

Follow-up biopsies are scheduled for bladder cancer patients to estimate treatment response and detect tumor recurrence. While diagnosing a case of SCNEC in the urinary bladder, we found that the previous bladder biopsy had been misdiagnosed as UC because only chromogranin and synaptophysin were tested to define NE differentiation and both tests were negative. This case led us to conduct this present study to define a panel of NE markers to ensure the diagnosis of traditional NE marker-negative SCNEC. We employed a decision tree classifier algorithm to analyze the expression of 17 IHC markers and finally propose a decision tree model using three markers synaptophysin, CD117, and GATA3.

#### **2. Materials and Methods**

#### *2.1. Study Samples*

This retrospective study was approved by the Asan Medical Center Institutional Review Board (2013–0107). Initially, the cohort consisted of 47 patients who were diagnosed with SCNEC of the urinary tract (urinary bladder and ureter) as a pure form or combined with UC between May 2002 and October 2020 at Asan Medical Center, Seoul, Republic of Korea. The diagnosis of SCNEC was based on histologic features only or IHC expression analysis of NSE, CD56, chromogranin, and synaptophysin (alone or in combination). After exclusion of 13 patients for which glass slides or paraffin blocks were not available, 34 patients of SCNEC were included in the discovery cohort. Among the 34 patients, 23 patients were biopsied once and accounted for one case each. Nine patients were biopsied twice (accounting for two cases each), and two patients were biopsied thrice (accounting for three cases each). Among the 11 patients who had been biopsied more than once, six patients had specimens diagnosed with UC during the period. The UC cases of

these patients were also included in the analysis to compare their immunoprofile with that of SCNEC. Therefore, 34 patients and their 47 cases (40 cases of pure and combined SCNEC and 7 cases of UC) were finally included in the discovery cohort.

For an external validation of the diagnostic model, data for eight patients were retrieved at the Kyung Hee University Medical Center (KHMC), Seoul, Republic of Korea from 2000 to 2020. They had a confirmed or suspected diagnosis of SCNEC of the urinary bladder based on the IHC staining of NE markers.

Patients' clinicopathological information was obtained from electronic medical records and surgical pathology reports. Pathologic materials of both discovery and external validation cohorts were reassessed according to the 2016 World Health Organization Tumor Classification criteria and staged according to the American Joint Committee on Cancer Staging System, 8th edition.

#### *2.2. Tissue Microarray Construction*

Tissue microarray blocks with 2-mm-diameter cores were constructed from 10% neutrally buffered formalin-fixed, paraffin-embedded urinary bladder tumor blocks using a tissue microarrayer (Quick-Ray, Unitma Co. Ltd., Seoul, Republic of Korea). In general, three representative cores from each case were generated while trying to exclude necrotic and degenerative areas and to maximize tumor cell content. In cases showing histologically divergent or variant features of UC, each representative area was included, resulting in up to 11 cores generated for one case. As a result, a total of 211 cores were generated.

#### *2.3. IHC*

IHC analysis was performed using NE, basal, and luminal markers of bladder cancer [11]. The NE markers included in the present study were CD56, CD117, chromogranin, insulinoma-associated protein 1 (INSM1), neuron specific enolase (NSE), SRY (sex determining region Y)-box 2 (SOX2), synaptophysin, somatostatin receptor 2 (SSTR2), and tubulin beta 2B class IIB (TUBB2B). The loss of retinoblastoma-associated protein (Rb) and p53 was reported in bladder cancers with NE differentiation [11–14]. The basal markers were cytokeratin 5/6 (CK5/6) and cytokeratin 14 (CK14). High expression of epidermal growth factor receptor (EGFR) was reported in the basal subtype of bladder cancer [15]. Luminal markers were cytokeratin 20 (CK20), GATA binding protein 3 (GATA3), and forkhead box A1 (FOXA1) [11,16]. The primary antibodies used in this study, their dilutions, and the subcellular location of each antigen are summarized in Supplementary Table S1. IHC staining was performed using an automated staining system (BenchMark XT, Ventana Medical Systems, Tucson, AZ, USA). The nuclei were counterstained with hematoxylin.

The IHC staining results were assessed in a semiquantitative manner by two pathologists (G.H.K. and S.U.J). The immunoreactivity of the markers was evaluated according to the intensity (negative (0), weak (1), moderate (2), or strong (3)) and the extent of positive tumor cells (percentage). A diffuse expression in a core was defined as immunoreactivity in more than half of tumor cells. The intensity and extent of marker expression were independently assessed in the decision tree analysis.

#### *2.4. Establishment of the Decision Tree Model*

All 17 IHC markers were included as variables and analyzed for their intensity and extent to classify the cases as neuroendocrine differentiation (NED) and non-neuroendocrine differentiation (non-NED). NED was defined as immunoreactivity to one or more NE markers in cores with SCNEC histology [11]. Based on histologic features and IHC results, the 211 cores were classified into 146 NED cores and 65 non-NED cores. In an attempt to overcome the small number of cases, each core type was analyzed separately to represent NED and non-NED. In cores with simultaneous expression of NE markers with luminal or basal markers, the core was classified as NED when it showed histologic features of SCNEC.

A decision tree model was constructed using a decision tree classifier algorithm on python-3.8, sklearn-1.0.2, and dtreeviz-1.3.2. The algorithm randomly selected 147 cores for the training set and 64 cores for the validation set at odds of 7 to 3. To select a diagnostic IHC panel for NED using the intensity and extent of immunoreactivity of 17 markers, the algorithm repeatedly classified all cores into NED and non-NED to minimize incorrect classifications [17]. A decision tree-derived diagnostic model was visualized after the training procedure was finished. The finally classified cores are colored yellow for NED and green for non-NED in all plots.

#### *2.5. Transmission Electron Microscopy (TEM) Analysis*

TEM analysis was performed using standard techniques. The submitted tissues were retrieved from paraffin blocks, deparaffinized, post-fixed in 1% buffered osmium tetroxide, dehydrated, and embedded in Epon. Ultrathin sections (1 μm) were stained with uranyl acetate-lead citrate and examined using a JEOL 1200 EX-II TEM (Jeol, Tokyo, Japan) [18].

#### **3. Results**

#### *3.1. Patients' Characteristics*

The clinicopathological features of the 47 cases from the 34 patients are summarized in Table 1. The median age at the initial diagnosis of bladder cancer of the 34 patients was 66 years (range, 31–86 years) with a 6:1 male to female ratio. Most cases were diagnosed by transurethral resection (34 cases, 72.3%) and followed by partial or radical cystectomy (10 cases, 21.3%), ureterectomy (2 cases, 4.3%), and cystoscopic biopsy (1 case, 2.1%). The mean tumor size was 4.36 cm in its greatest dimension (range, 1.0–11.4 cm).

**Table 1.** Clinicopathological features of the discovery cohort.



**Table 1.** *Cont.*

\* Other organs: prostate, both seminal vesicles, and right vas deferens.

During the reassessment of the cases, we noted that four SCNEC cases from four patients had been misdiagnosed as UC. In three cases, the SCNEC histology was not recognized and IHC for NE markers was not performed. In the remaining case, the SCNEC with ambiguous histology was recognized but chromogranin and synaptophysin staining were negative (Figure 1).

**Figure 1.** Representative H&E and immunohistochemical images of small cell neuroendocrine carcinoma (SCNEC) of classic histology (**A**–**E**) and with ambiguous histology (**F**–**J**). SCNEC shows sheets of relatively small cells with scant cytoplasm, speckled nuclei, and indistinct nucleoli (**A**). It is typically immunoreactive for synaptophysin (**B**), chromogranin (**C**), and CD117 (**D**) and negative for GATA3 (**E**). SCNEC with ambiguous histology shows sheets of cells with small to medium nuclei, relatively abundant cytoplasm, mild pleomorphism and occasional nucleoli (**F**). Although this case is immunonegative for synaptophysin (**G**) and chromogranin (**H**), the tumor is diffusely immunoreactive for CD117 (**I**) and negative for GATA3 (**J**). (Original magnification: A–I, ×400).

After the reassessment of H&E slides and immune-stained slides, the cases were classified as pure SCNEC (29 cases, 61.7%), combined SCNEC and UC (15 cases, 31.9%), and UC (3 cases, 6.4%). Divergent differentiation and variant histology were frequently noted and included glandular (6 cases, 12.7%) and squamous (3 cases, 6.4%) differentiation and micropapillary (4 cases, 8.5%), rhabdoid (1 case, 2.1%), and giant cell (1 case, 2.1%) variants. Tumor invasion into the muscularis propria was noted in 38 cases (80.9%). Twentyfive patients were treated with chemotherapy. Among the 10 cases involving partial or radical cystectomy, most were of high pathologic stages with pT3 (8 cases, 80%) and pT4 (1 case, 10%), and half of the patients had lymph node metastasis (5 patients, 50.0%).

#### *3.2. Expression of NE, Luminal, and Basal Markers in the Discovery Cohort*

The expression profile of 17 IHC markers in the 146 NED cores and 65 non-NED cores is summarized in Table 2. Detailed information on the IHC markers is presented in Supplementary Table S1. Representative IHC images are presented in Supplementary Figure S1.


**Table 2.** Immunoprofile of neuroendocrine cores and non-neuroendocrine cores from small cell neuroendocrine carcinomas of the urinary tract.

Data are expressed as number (%). Abbreviations: SYP, synaptophysin; CGA, chromogranin; INSM1, insulinomaassociated protein 1; NSE, neuron specific enolase; SOX2, SRY (sex determining region Y)-box 2; TUBB2B, tubulin beta 2B class IIb, SSTR2, somatostatin receptor 2; p53, tumor protein p53; Rb, retinoblastoma-associated protein; EGFR, epidermal growth factor receptor; CK5/6, cytokeratin 5/6; CK14, cytokeratin 14; CK20, cytokeratin 20; FOXA1, forkhead box A1; GATA3, GATA binding protein 3.

In the NED cores, synaptophysin was the most strongly and widely expressed NE marker, and approximately 80% of NED cores showed diffuse expression. CD56 and CD117 were also diffusely expressed in 61.0% and 58.2% of NED cores, respectively. However, a subset of NED cores was negative for the NE markers synaptophysin (12 cores, 8.2%), CD56 (30 cores, 20.5%), and CD117 (38 cores, 26.0%). Chromogranin and INSM1were expressed less widely, and their diffuse expression was noted in 20.5% and 43.8% of NED cores, respectively. As expected, the expression of luminal (CK20 and GATA3) and basal (CK5/6 and CK14) markers was negative or weak in ≤5% NED cores. However, EGFR and FOXA1 were expressed in a significant number of NED cores and immunoreactive in 31.5% and 71.9% of NED cores, respectively, with varying intensities.

In the non-NED cores, most of the NE markers such as synaptophysin, chromogranin, CD56, INSM1, SSTR2, and CD117 were negative or weakly expressed (≤5%) in more than 95% of such cores. NSE, SOX2, and TUBB2 were immunoreactive in a significant extent (>5%) of non-NED cores (43.0%, 44.6%, and 13.8%, respectively) with varying intensities, although they were expressed as such in most NED cores (86.3%, 79.5%, 53.4%, respectively). GATA3 and EGFR showed diffuse expression in 80.0% and 73.9% of non-NED cores, respectively.

#### *3.3. Decision Tree-Based Diagnostic NE IHC Model*

Given the lack of expression of NE markers in a significant number of NED cores, the decision tree classifier algorithm was employed to define a diagnostic IHC panel for NED. Among multiple models suggested by the algorithm, this model was selected because it was relatively simple, highly reproducible, and easy to apply in routine clinical practice. It consisted of three markers synaptophysin (cutoff >5% immunoreactive area), CD117 (cutoff >20% immunoreactive area), and GATA3 (cutoff of negative/weak intensity to be classified as NED) and applied in that order. The relative importance of the markers was 0.758 for synaptophysin, 0.213 for CD117, and 0.029 for GATA3 in the model.

An overview of the decision tree model using 147 cores of the training set is shown in Figure 2. The synaptophysin immunoreactivity was noted in >5% tumor area in 94 cores and was classified as NED (64.0%). Among 53 cores with ≤5% synaptophysinimmunoreactive area, 43 cores were of CD117-immunoreactive area ≤20% and classified as non-NED (81.1%). In cores with the CD117-immunoreactive area >20%, the intensity

of GATA3 immunoreactivity was considered, being classified as NED in 9 cores with negative/weak intensity (90.0%) and non-NED in 1 core with moderate to strong intensity (10.0%) (Supplementary Figure S2). The overall accuracy and area under the receiver operating characteristic curve were 98.4% and 98.8% according to the internal validation.

**Figure 2.** Decision tree model of the discovery cohort. Diagnostic flow of the training set is demonstrated with cutoff values (bold red arrow) and distribution plots of NED and non-NED cores. Each distribution plot stands for a split-by-condition node. The *x*-axis and *y*-axis represent the extent or intensity of the corresponding IHC marker and the number of NED or non-NED cores, respectively. The finally classified cores are colored yellow for NED and green for non-NED. The degrees of intensity of GATA3 are represented as follows: 0, negative; 1, weak; 2, moderate; 3, strong.

The distribution of expression and association of each marker in all cores of the discovery cohort are presented in Figure 3. When the decision tree model was applied to all 211 cores, 11 cores with ≤5% of synaptophysin-immunoreactive area were classified as NED. They expressed one or more NE markers such as CD117 (11/11 cores, 100%), CD56 (9/11 cores; 81.8%), TUBB2B (6/11 cores, 54.6%), SOX2 (9/11 cores, 81.8%), NSE (7/11 cores, 63.6%), SSTR2 (5/11 cores, 45.5%), and INSM1 (3/11 cores, 27.3%). According to the model, CD117 expression was identified in all NED cores with ≤5% of synaptophysinimmunoreactive area and showed a weak relationship with synaptophysin compared to other NE markers.

**Figure 3.** Distribution of the expression of 17 markers in NED and non-NED cores. Heatmap of 17 markers is presented. The white to red shades show increasing immunoreactivity from 5% to 100%, and the blue color represents less than 5% immunoreactivity of IHC markers including no expression. See color scale.

#### *3.4. Application of the Diagnostic NE IHC Model on an External Cohort*

Six SCNEC cases and two UC cases from the external cohort were immunostained for synaptophysin, CD117, and GATA3 using whole tumor sections in our institution. According to the model, five SCNEC cases were immunoreactive for synaptophysin in more than 20% of tumor cells and classified as NED. The remaining SCNEC case was negative for synaptophysin but immunoreactive for CD117 in more than 90% of tumor cells, being classified as NED. The two UC cases were immunonegative for all three markers and classified as non-NED. These results were consistent with the original diagnosis.

#### *3.5. Ultrastructural Validation of NE Differentiation*

TEM was performed on samples from five SCNEC cases (four cases in the discovery cohort from which the 11 cores with ≤ 5% of synaptophysin-immunoreactive area were derived and one such case from the external cohort). Two SCNEC cases with diffuse synaptophysin expression and two UC cases were also included as positive and negative controls, respectively.

All five cases showed varied numbers of electron dense neurosecretory granules in the cytoplasm of the tumor cells, similar to those of SCNEC (Figure 4). They ranged from 144.5 to 582.2 nm. The granules were round with a dense core, although the delimiting outer membrane and peripheral halos were not clearly observed probably due to the deparaffinization process. There were no neurosecretory granules in the two UC cases (data not shown).

**Figure 4.** Transmission electron microscopy image of synaptophysin-negative SCNEC. Arrows indicate neurosecretory granules (218.31–275.16 nm). (Original magnification, ×20,000).

#### **4. Discussion**

Herein, we propose a decision tree-based IHC model consisting of two inclusion markers synaptophysin and CD117 and one exclusion marker GATA3 for the diagnosis of SCNEC of the urinary bladder. It could detect NED of not only NE marker-positive SCNEC but also traditional marker-negative SCNEC. The model was validated using an external cohort and by TEM analysis.

Through this study, we emphasize the following points for the diagnosis of SCNEC. First, it is crucial to be familiar with the histological features of SCNEC. In cases with ambiguous histological features that are difficult to differentiate from UC, IHC for NE markers should be performed with a low threshold. Second, even focal (>5%) and weak synaptophysin immunoreactivity would be sufficient for the diagnosis of SCNEC. Third, in synaptophysin-negative cases, CD117 and GATA3 may be helpful to distinguish between SCNEC and non-SCNEC.

SCNEC is mainly diagnosed based on histology and may not require IHC confirmation. As reported previously, most of our cases including traditional NE marker-negative cases showed classic histological features of SCNEC. The tumor presented as solid sheets, nests, or trabeculae of small cells. Tumor cells have sparse cytoplasm, nuclear molding, finely granular stippled chromatin, inconspicuous nucleoli, high mitotic count, and frequent individual and geographic necrosis [4]. However, ambiguous histological features such as relatively abundant cytoplasm and the presence of nucleoli albeit inconspicuous were noted as shown in Figure 1. In such cases, IHC for NE markers might be useful to confirm NED.

Synaptophysin, chromogranin, and CD56 are widely used clinically in a diagnostic panel because of their suboptimal sensitivity and specificity as individual markers [9]. In the more common counterpart lung cancer, synaptophysin is expressed in 41–75% of small cell lung carcinoma (SCLC) and 58–85% of large cell neuroendocrine carcinomas (LCNEC). Chromogranin may show weak and focal positivity and less sensitivity, being expressed in only 23–58% of SCLC and 42–69% of LCNEC. CD56 is expressed in most SCLC (72–99%) and LCNEC (72–94%) cases but at the cost of relatively low specificity (72%). As expected synaptophysin was chosen as the most important NE marker in our model.

CD117 was chosen as the second most important marker for the diagnosis of SCNEC in preference to other traditional or emerging NE markers. This could be explained, at least in part, by the fact that other NE markers were often expressed simultaneously whereas CD117 was expressed in those NE marker-negative SCNEC cases. CD117 expression has been reported in SCNEC of various organs such as the lung, uterine cervix, and esophagus [19–21]. CD117 expression was also noted in 27% cases of SCNEC in the urinary bladder [22]. The mechanisms of CD117 expression in NE carcinoma are largely unknown, but an autocrine growth loop has been suggested in SCLC cell lines [23]. As a member of the type III receptor tyrosine kinase family, CD117 activates several signaling pathways, such as the JAK/STAT, RAS/MAP kinase pathway, PI3 kinase, PLCγ pathway, and SRC pathway [24]. Consequently, it plays an important role in the proliferation, survival, differentiation, apoptosis, and migration of tumor cells [24]. Another hypothesis is that CD117 may increase cancer stem cell phenotype in SCNEC since it plays a key role in maintaining the stemness of cancer stem cells [24]. Because both UC and SCNEC arise from common multipotential cancer stem cells, SCNEC frequently coexists with conventional UC [25]. Therefore, CD117 expression may represent a marker of aggressive biologic behavior of SCNEC instead of NED in the model.

According to previous reports, a novel pan-NE marker INSM1 was superior to traditional NE markers with high sensitivity (93.9%) and specificity (97.4%) in the SCNEC of the genitourinary tract [26,27]. In our cases, INSM1 showed relatively lower sensitivity (78.1%) but similar high specificity (96.9%) compared to the previous report. Nevertheless, this novel marker was not selected in our model. The decision tree model suggests variables based on the causal relationship and selects the best one if multiple variables are correlated. As shown in Figure 3, when there is a strong relationship between INSM1 and synaptophysin immunoreactivity, synaptophysin might be selected in the model.

Among non-NE markers employed in the present study, GATA3 immunoreactivity was selected as an exclusion marker for NE differentiation probably because of its relatively higher specificity than that of the other non-NE markers. The basal markers CK5/6 and CK14 were not only negative in most NE cores (94.5% and 93.8%, respectively) but also not expressed in more than half of non-NE cores (63.1% and 66.2%, respectively). The luminal marker FOXA1 was expressed similarly in NE cores and non-NE cores (88.4% and 83.1%, respectively). In the remaining luminal markers, GATA3 was negative in more NE cores than CK20 (89.7% and 81.5%, respectively) and had stronger immunoreactivity in the non-NE cores (moderate to strong immunoreactivity in 89.3% and 75.3%, respectively). Therefore, basal markers CK5/6 and CK14 and luminal marker FOXA1 might offer suboptimal distinguishing power between NE cores and non-NE cores, and GATA3 might be a better exclusion marker than CK20.

Although the demand for TEM has decreased due to the development of IHC staining and molecular pathology, this technique is still used for accurate diagnosis. TEM is particularly useful for the differential diagnosis between malignant mesothelioma and serous carcinoma, whereas immunostaining results alone cannot achieve an accurate diagnosis [28]. In the present study, neurosecretory granules were found in all synaptophysin-negative and inconspicuous (≤5%) cases and were useful for confirming NED in those cases, although the number of granules was fewer than that in classic SCNEC cases.

Genomic analyses of bladder cancer have been used for the molecular characterization of variant histologic subtypes. The Cancer Genome Atlas (TCGA) and a report by Lund et al. have identified neuronal subtype or small cell/neuroendocrine (SC/NE) consensus cluster, accounting for 3–15% of bladder cancer by RNA-sequencing analysis [16,29,30]. A TCGA report has shown that tumors representing NED at the molecular level were not similar in histology to SCNEC in 85% of cases (17/20) [16]. A report by Lund et al. showed that only half of the SC/NE consensus cluster represented the enriched expression of neuronal markers such as synaptophysin, chromogranin, and CD56 [29]. Phenotypical UC with the absence of NE histology may also reveal transcriptomic patterns of NE carcinoma and be defined as neuroendocrine-like (NE-like) tumors [11]. These reports suggest that histological, molecular, and IHC results of SCNEC may not agree completely with each other. Combining our findings with previous results, continuous efforts should be made to define the diagnostic criteria for aggressive NE carcinoma that requires therapeutic approaches different from those used for UC.

The present study has limitations. Although the performance of the decision tree diagnostic model was excellent, the possibility of overfitting cannot be excluded. Since we performed core-based analysis to compensate for the small number of SCNEC cases, this model needs to be validated with larger numbers of SCNEC cases, preferably in a multicenter study.

#### **5. Conclusions**

Our study demonstrated that the decision tree model using synaptophysin, CD117, and GATA3 may help confirm NED of not only NE marker-positive SCNEC but also traditional marker-negative SCNEC.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14102495/s1, Figure S1: Representative immunohistochemical analysis of 17 markers used in the present study.; Figure S2: Representative immunohistochemistry of GATA3. Table S1: Antibodies used in the study.; Table S2: Immunoprofile of neuroendocrine cores and non-neuroendocrine cores from small cell neuroendocrine carcinomas of the urinary tract.

**Author Contributions:** Conceptualization, Y.M.C.; methodology, S.U.J. and Y.M.C.; software, G.J. and S.U.J.; validation, G.H.K. and S.-W.K.; formal analysis, S.U.J.; investigation, G.H.K.; resources, J.-M.P. and S.Y.Y.; data curation, G.J.; writing—original draft preparation, G.H.K.; writing—review and editing, S.U.J. and Y.M.C.; visualization, G.J., D.-M.S. and H.J.; supervision, Y.M.C.; project administration, Y.M.C.; funding acquisition, Y.M.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is funded by the Ministry of Science, ICT and Future Planning (2019R1A2C1088246) and a grant (2019IP0870-2) from the Asan Institute for Life Sciences, Asan Medical Centre, Seoul, Korea.

**Institutional Review Board Statement:** The study was approved by the Institutional Review Board of Asan Medical Center (2013-0107).

**Informed Consent Statement:** The patient consent was waived due to retrospective nature of the study.

**Data Availability Statement:** The data are available on request from the corresponding author.

**Acknowledgments:** This work is supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Integrated Analysis of Tumor Mutation Burden and Immune Infiltrates in Hepatocellular Carcinoma**

**Yulan Zhao, Ting Huang and Pintong Huang \***

Department of Ultrasound in Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310000, China

**\*** Correspondence: huangpintong@zju.edu.cn; Tel.: +86-18857168333; Fax: +86-0571-87783934

**Abstract:** Tumor mutation burdens (TMBs) act as an indicator of immunotherapeutic responsiveness in various tumors. However, the relationship between TMBs and immune cell infiltrates in hepatocellular carcinoma (HCC) is still obscure. The present study aimed to explore the potential diagnostic markers of TMBs for HCC and analyze the role of immune cell infiltration in this pathology. We used OA datasets from The Cancer Genome Atlas database. First, the "maftools" package was used to screen the highest mutation frequency in all samples. R software was used to identify differentially expressed genes (DEGs) according to mutation frequency and perform functional correlation analysis. Then, the gene ontology (GO) enrichment analysis was performed with "clusterProfiler", "enrichplot", and "ggplot2" packages. Finally, the correlations between diagnostic markers and infiltrating immune cells were analyzed, and CIBERSORT was used to evaluate the infiltration of immune cells in HCC tissues. As a result, we identified a total of 359 DEGs in this study. These DEGs may affect HCC prognosis by regulating fatty acid metabolism, hypoxia, and the P53 pathway. The top 15 genes were selected as the hub genes through PPI network analysis. *SRSF1*, *SNRPA1*, and *SRSF3* showed strong similarities in biological effects, NCBP2 was demonstrated as a diagnostic marker of HCC, and high NCBP2 expression was significantly correlated with poor over survival (OS) in HCC. In addition, NCBP2 expression was correlated with the infiltration of B cells (r = 0.364, *<sup>p</sup>* = 3.30 <sup>×</sup> <sup>10</sup><sup>−</sup>12), CD8+ T cells (r = 0.295, *<sup>p</sup>* = 2.71 <sup>×</sup> <sup>10</sup>−8), CD4+ T cells, (r = 0.484, *<sup>p</sup>* = 1.37 <sup>×</sup> <sup>10</sup>−21), macrophages (r = 0.551, *<sup>p</sup>* = 1.97 <sup>×</sup> <sup>10</sup>−28), neutrophils (r = 0.457, *<sup>p</sup>* = 3.26 <sup>×</sup> <sup>10</sup>−19), and dendritic cells (r = 0.453, *<sup>p</sup>* = 1.97 <sup>×</sup> <sup>10</sup><sup>−</sup>18). Immune cell infiltration analysis revealed that the degree of central memory T-cell (Tcm) infiltration may be correlated with the HCC process. In conclusion, NCBP2 can be used as diagnostic markers of HCC, and immune cell infiltration plays an important role in the occurrence and progression of HCC.

**Keywords:** hepatocellular carcinoma; tumor mutation burden; immune cells; The Cancer Genome Atlas; CIBERSORT

#### **1. Introduction**

Hepatocellular carcinoma (HCC) is one of the most common and aggressive malignancies in the digestive system and contributes to a severe global disease burden worldwide [1]. It ranked sixth in global incidence (4.7%) and was the third leading cause of cancer-related deaths (8.3%) in 2020, according to a recent study [2]. The prognosis of patients is usually driven by the tumor stage. The 5-year survival rates for local disease exceed 70%; however, the median survival time of advanced-stage HCC patients is only 1 year [3]. Although the survival situation has improved, benefiting from advancements in medical treatments [4], approximately 2/3 of HCC patients are diagnosed at advanced stages, and the median overall survival rate remains at a low level [5]. Therefore, there is an urgent need to explore the potential molecular mechanisms of tumor progression to develop better therapeutic strategies and investigate the potential benefits of adjuvant systemic therapies.

The molecular mechanisms contributing to the development of HCC are extremely complex and involve various genetic abnormalities, such as the dysregulation of signaling

425

**Citation:** Zhao, Y.; Huang, T.; Huang, P. Integrated Analysis of Tumor Mutation Burden and Immune Infiltrates in Hepatocellular Carcinoma. *Diagnostics* **2022**, *12*, 1918. https://doi.org/10.3390/ diagnostics12081918

Academic Editor: Gian Paolo Caviglia

Received: 26 June 2022 Accepted: 1 August 2022 Published: 8 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

pathways, genomic instability, single-nucleotide polymorphisms (SNPs), and somatic mutations [6,7]. The somatic mutations were reported frequently among HCC patients, and the landscape was complicated, including somatic mutations that occur in multitudes of genes accompanied by the changes of multiple signaling pathways [8], which contribute to various molecular heterogeneities that remain poorly understood. With the rise of high-throughput sequencing technology, a large number of databases based on TCGA (The Cancer Genome Atlas) and GEO (Gene Expression Omnibus) datasets have emerged, making it convenient for us to investigate the complex relationships between HCC and the underlying oncogenic somatic mutation molecular mechanisms. Our results may provide new insight into novel diagnostic and prognostic values for HCC.

In addition, recent studies have demonstrated that TMB(Tumor mutation burden) was correlated with immune cell infiltration and subtypes [9,10]. TMB is defined as the frequency of gene mutations (total count of variants/the whole length of exons), including translocation, deletion, and insertion mutations, in addition to other mutations that appear in the somatic-gene-coding region, with an average 1 Mb-base range for the tumor genome, and it is used as a biomarker to predict the sensitivity, efficacy, and treatment outcomes of immune checkpoint inhibitors (ICPIs) [11,12]. The tumor cell carries new antigens generated by somatic mutations on the cell surface that may be recognized by the immune system, further making the tumor cell a target for activated immune cells [13]. To date, there have been numerous studies focusing on the relationship between TMB and immunotherapy in diverse cancers [14–16], and accumulating evidence indicates that a high tumor mutation burden confers an increased immune reaction to tumors and a better response to ICPI treatment [17]. However, the prognostic value of TMB in HCC has not yet been clearly determined.

In the present study, we downloaded The Cancer Genome Atlas HCC data sets using R software package and other online databases to investigate the association of genes bearing important mutations contributing to TMBs with clinical and genomic features in HCC patients. We performed gene ontology (GO) term enrichment and protein–protein interaction (PPI) analysis and constructed functional networks related to NCBP2 in HCC. Finally, the relationship between NCBP2 and immune cell infiltration in the HCC was also analyzed. The findings from the present study suggest that NCBP2 influences the prognosis of HCC patients via its interaction with infiltrating immune cells.

#### **2. Materials and Methods**

#### *2.1. Data Download*

The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) (accessed on 1 March 2021) database provides publicly available cancer genome datasets. TCGA database contains 369 cases of LICH tissue samples. We used R language RTCGToolbox package from TCGA database (https://portal.gdc.cancer.gov/) (accessed on 1 March 2021) to download Liver Cancer (LIHC) gene expression spectrum and clinical data as the training sets. We included a total of 364 cases of LIHC samples in the present study. We used the maftools package to screen the 20 genes with the highest mutation frequencies in all samples, and we visualized the mutation situations and frequencies of all samples. We grouped all samples according to the genes with the highest mutation frequencies.

#### *2.2. Data Preprocessing and Differentially Expressed Gene (DEG) Screening*

We used affy package (R version 3.6.3; TUNA Team, Tsinghua University, Beijing, China) to perform background correction and data normalization, and we screened differentially expressed genes (DEGs) by using limma software package. The screening criteria were: |log2 fold change (log2FC)| > 1, adjust *p* < 0.05. We used univariate Cox regression to screen out prognostic Genes. We used the intersection Search Tool (http://string-db.org; Version: 11.0) (accessed on 1 March 2021) for the Retrieval of Separated Genes (STRING) to predict the protein–protein interaction (PPI) network. We used Cytoscape to visualize complex networks and integrate them with data of any attribute type. Gene ontology (GO) is a common method used to annotate genes and their products. This method is often used to annotate large-scale genes, determining molecular function (MF) and biological process (BP). We used cellular components (CCs) for a GO analysis of intersecting genes.

#### *2.3. GSEA and GSVA Analysis*

We performed GSEA and GSVA analysis to explore the important pathway of enrichment between the two groups. The reference gene set was H.all.v.7.1.symbols.gmt. We replaced 1000 genomes to achieve standardized enrichment scores for each analysis. We considered a nominal *p* < 0.05 and a false discovery rate < 0.05 as significant results. We used clusterProfiler and GSVA packages for GSVA analysis, and we considered adj.*p* value < 0.05 as a meaningful pathway.

#### *2.4. Verification of Differential Expression of NCBP2*

We used GEPIA2 (http://gepia2.cancer-pku.cn/) (accessed on 1 March 2021) to verify the differential expression between liver cancer and other cancer and paracancer samples in the database. We applied the box plot module of the GEPIA2 database to explore the expression level of NCBP2 in various cancer datasets, including the GTEx and TCGA databases, and we also analyzed the expression levels of NCBP2 in different stages of liver cancer through a Stage Plot module. Then, we used the Survival Map module to investigate the overall survival (OS) rates in liver and other cancers. Significance level is 0.05.

#### *2.5. Prognostic Analysis*

The Kaplan–Meier mapping platform is able to assess the effects of more than 50,000 genes on survival in 21 cancer types. The primary purpose of this tool is the discovery and validation of survival biomarkers based on meta-analysis. We explored the correlation between NCBP2 and prognosis of liver cancer in Kaplan–Meier mapping platform to verify the relationship between NCBP2 and liver cancer prognosis.

#### *2.6. Expression Verification of NCBP2 in Cells and Tissues*

The Human Protein Atlas is an open-access database used to map all human proteins in organ tissues and cells, and integrates various omics techniques. We detected the mRNA expression of NCBP2 in organ tissues and large tumors using the Human Protein Atlas and TIMER database. We used this database to preliminarily verify the expression levels of NCBP2 in cells and tissues.

#### *2.7. Correlation Analysis between NCBP2 and Immunity*

We applied "corrplot package" to further investigate the infiltration conditions of immune cells and the relationship between NCBP2 and immune cells in liver cancer. We constructed a correlation heatmap to visualize the correlation of 22 types of infiltrating immune cells in liver cancer. Then, we performed Spearman correlation analyses using "ggstatsplot" package (https://github.com/IndrajeetPatil/ggstatsplot) (accessed on 1 March 2021) to investigate the relationship between the levels of NCBP2 and immune cells.

#### **3. Results**

#### *3.1. Landscape of Gene Mutation Files in LIHC*

To investigate the mutation profile among the TCGA-LIHC cohort, we used the RTCGToolbox package of R language to acquire the LIHC gene expression spectrum and clinical data as the training set from TCGA database (https://portal.gdc.cancer.gov/) (accessed on 1 March 2021). The maftools package was used to screen the top 20 genes with high mutation frequencies in all samples, and waterfall plots were utilized to visualize the mutation landscapes of the genes. The results of the somatic mutation profiles in 364 cases of LIHC samples included in the present study showed that around 312 (85.71%) samples possessed somatic mutations. As for the top 20 mutated genes shown in Figure 1, we

discovered that gene *TP53* mutated most frequently, approximately accounting for 28% of mutations, followed by *TTN* (25%), *CTNNB1* (24%), *MUC16* (16%), *ALB* (11%), *PCLO* (11%), *MUC4* (10%), *RYR2* (10%), *ABCA13* (9%) and *APOB* (9%), *CSMD3* (8%), *FLG* (8%), *LRP1B* (8%), *OBSCN* (8%), *AXIN1* (8%), *XIRP2* (8%), *ARID1A* (7%), *HMCN1* (7%), *CACNA1E* (7%), and *SPTA1* (7%). Missense mutations were the most frequent among these alterations.

**Figure 1.** Landscape profile of top 20 mutated genes in 364 LIHC from TCGA database. Mutations of each gene in each sample are shown in waterfall plot. Each column presents specific sample, each line presents mutated gene, and name is listed on left. Different forms of somatic mutations and percentages of gene mutation types are shown on right (color version of figure is available online). LIHC: Liver Cancer; TCGA: The Cancer Genome Atlas.

#### *3.2. Data Preprocessing and Screening of DEGs*

All samples obtained from above were divided into high- and low-TMB groups according to the median TMB threshold, and we further evaluated the missing data and normalization for data preprocessing. The box chart results showed that similar levels of data points were achieved after correcting the mean value of the gene expression, and the data homogenization was credible (Figure 2A,B). The gene expression matrix was then merged for further normalization. The PCA results indicated that the clustering of samples was more obvious between the two groups after homogenization (Figure 2C,D), and the results suggested that the sample data source included in the present study was reliable and could be used for further analysis. After data preprocessing, we identified 2171 DEGs between high- and low-TMB groups with |Log FC| > 1 and *p* value < 0.05 through the limma package of R software. The result was presented via a volcano map (Figure 2E), in which green dots represent downregulated genes, red dots represent upregulated genes, and black dots represent unchanged genes.

**Figure 2.** Data preprocessing and differential expression analysis. (**A**,**B**) Box chart of gene expression among high- and low-TMB groups. Black dots represent mean values of gene expression after sample normalization before (**A**) and after (**B**) sample normalization. (**C**,**D**) before (**C**) and after (**D**) principal component analyses (PCA) of gene expression between high- and low-TMB groups. (**E**) Volcano map of DEGs; red represents upregulated differential genes, green represents downregulated differential genes, and grey represents no-significant-difference genes. TMB: Tumor Mutation Burden.

#### *3.3. Joint Screening of Genes, PPI Network Construction, Hub Genes Screening, and Similarities*

In order to explore more accurate genes related to the prognosis of patients with HCC, intersection analysis was conducted on the identified differentially expressed genes between the high- and low-TMB groups, and the prognosis-related genes with *p* values < 0.05 in univariate Cox analysis were obtained from TCGA database. The combined results revealed that a total of 359 differentially expressed genes were identified following the intersection of 2171 DEGs between high- and low-TMB groups, with 2250 genes related to prognosis and survival (Figure 3A). Search Tool for the Retrieval of Interacting Genes (STRING) (http://string db.org; Version: 11.0) (accessed on 1 March 2021) is an online tool for predicting protein–protein interaction (PPI) networks. An analysis of functional interactions between proteins can provide more information into the mechanisms of disease occurrence or development. Through Cytoscape and its plug-in cytoHubba, we constructed the PPI network of DEGs related to prognosis obtained above (Figure 3B). The top 15 genes were selected as the hub genes through the MCC cytoHubba plugin with the highest correlation scores in this PPI network: *USP39*, *RBM22*, *SNRPD1*, *CPSF3*, *SRSF1*, *SRSF3*, *HSPA8*, *HNRNPU*, *SRSF4*, *CWC27*, *EFTUD2*, *ALYREF*, *NCBP2*, *SNRPA1*, and *POLR2D* (Figure 3C). To further explore the closeness of the correlation between hub DEGs, which were ranked on the basis of average functional similarity, the results suggested that *SRSF1*, *SNRPA1*, *SRSF3*, *SRSF4*, *ALYREF*, *NCBP2*, *SNRPD1*, and *EFTUD2* were found to be hub genes with cut-off values greater than 0.7, and *SRSF1*, *SNRPA1*, and *SRSF3* showed a strong similarity in biological effects (Figure 3D).

**Figure 3.** Joint screening of DEGs, Protein–protein interaction (PPI), hub DEGs, and functional similarity analysis of DEGs. (**A**) Venn diagram of DEGs between high- and low-TMB groups and the prognosis-related genes with *p* value less than 0.05 in Cox univariate analysis obtained from TCGA. Middle part represents overlap of two groups of data. (**B**) Gene interaction network of 359 prognosisrelated DEGs visualized with PPI network. (**C**) Interaction network of 15 DEGs scored by maximum correlation coefficient; the darker the color, the higher the MCC algorithm score. (**D**) Functional similarities of 11 hub genes—dashed line represents cut-off value of similarity. DEGs: Differentially Expressed Genes; TCGA: The Cancer Genome Atlas; MCC: Matthews correlation coefficient.

#### *3.4. Functional Correlation Analysis*

A total of 359 differentially expressed genes related to prognosis in HCC samples were further subjected to GO analysis. The results suggested that in the biological process (BP) category, these prognosis-related differentially expressed genes were mainly correlated with RNA localization and the transport and export of components in the nucleus (Figure 4A). In order to explore the important pathway of enrichment between the two groups, the gene set enrichment analysis (GSEA) of gene expression profiles was used to identify differentially enriched signaling pathways between patients in highand low-TMB groups. The results suggested that the enriched functions and pathways in the high-TMB group mainly involved fatty acid metabolism, hypoxia, and the P53 pathway (Figure 4B). The results of gene set variation analysis (GSVA) revealed that androgen response, coagulation, bile acid metabolism, angiogenesis, pancreas beta cells, fatty acid metabolism, TNFA signaling via NFKB and adipogenesis were enriched in the high-TMB group (Figure 4C).

**Figure 4.** GO, GSEA, and GSVA analyses. (**A**) Significantly enriched gene ontology terms in categories BP. (**B**) GSEA analysis based on h.all.v7.1.symbols.gmt. (**C**) GSVA analysis based on h.all.v7.1.symbols.gmt. GO: Gene ontology; GSEA: gene set enrichment analysis; GSVA: gene set variation analysis.

#### *3.5. The mRNA Expression Level of NCBP2 in Hepatocellular Carcinoma*

To further explore the mRNA expression level of NCBP2 in hepatocellular carcinoma, we performed a verification to investigate the differential mRNA expression between HCC tumor samples and adjacent normal samples in the GEPIA2 (http: //gepia2.cancer-pku.cn/) (accessed on 1 March 2021) and TIMER databases (https: //cistrome.shinyapps.io/timer/) (accessed on 1 March 2021). As a result, the GEPIAbased analysis indicated that NCBP2 was upregulated in 17 of 33 cancer types, including hepatocellular carcinoma, which was computed in the form of transcripts per million compared with adjacent tissues (Figure 5A). In addition, the mRNA expression of NCBP2 was significantly different among different stages of HCC (F value = 0.53, Pr(>F) = 0.0014) (Figure 5B). Finally, we evaluated the NCBP2 mRNA expression using the RNA-seq data in TIMER database. The result also indicated that the mRNA expression of NCBP2 was overexpressed in hepatocellular carcinoma tissues compared with adjacent tissues, and NCBP2 mRNA expression was also overexpressed in other cancer types, such as BLCA (bladder urothelial carcinoma), BRCA (breast invasive carcinoma), CHOL (cholangiocarcinoma), COAD (colon adenocarcinoma), ESCA (esophageal carcinoma), GBM (glioblastoma multiforme), HNSC (head and neck squamous cell carcinoma), KIRP (kidney renal papillary cell carcinoma), LUAD (lung adenocarcinoma), LUSC (lung squamous cell carcinoma), PRAD (prostate adenocarcinoma), READ (rectum adenocarcinoma), STAD (stomach adenocarcinoma), and UCEC (uterine corpus endometrial carcinoma), but downregulated in KICH (kidney chromophobe) and KIRC (kidney renal clear cell carcinoma) (Figure 5C). In summary,

all these results indicate that the mRNA expression level of NCBP2 is significantly overexpressed in HCC.

**Figure 5.** NCBP2 expression levels in HCC. (**A**) Expression patterns of NCBP2 in 33 cancer types and paired non-tumor samples. (**B**) Violin plots reveal relationship between NCBP2 expression and LIHC staging. (**C**) Human NCBP2 expression levels in different tumor types determined by TIMER (\* *p* < 0.05, \*\* *p* < 0.01, \*\*\* *p* < 0.001). TIMER: Tumor Immune Estimation Resource.

#### *3.6. Correlations between the mRNA Expression Level of NCBP2 and Survival in HCC Patients*

To further investigate the relationship of the mRNA expression level of NCBP2 with the survival situation in HCC patients, the Kaplan–Meier Plotter, which is based on the transcriptome data mainly extracted from GEO, EGA, and TCGA, was used to assess the NCBP2-related survival rate. As a result, we firstly identified NCBP2 as a detrimental prognostic factor in LIHC (Overall Survival (OS): HR = 1.86, 95% CI from 1.31 to 2.63, log-rank *<sup>p</sup>* = 4 × 104) (Figure 6A). Then, we further investigated the prognostic value of NCBP2 expression for pan-cancer in another database. The correlation between NCBP2 expression and the prognosis of each cancer were investigated, and the result suggested that NCBP2 expression was significantly related to a total of six cancer types, including KICH, KIRP, LICH, LUAD, PAAD, and PRAD (Figure 6B), and the expression level of NCBP2 was negatively correlated with over survival. Among those cancers, NCBP2 played a detrimental role in LIHC according to the GEPIA2 database (OS: total number = 364, HR = 1.9, log-rank *p* = 0.00026) (Figure 6C). In summary, we identified NCBP2 as a detrimental biomarker for the survival prognosis of HCC.

**Figure 6.** Kaplan–Meier survival curves comparing high and low expressions of NCBP2 in different databases. (**A**) Kaplan–Meier survival curves of LIHC in PrognoScan. (**B**) Relationship between NCBP2 expression and survival prognosis of each cancer in TCGA. (**C**) Kaplan–Meier survival curves of LIHC in Kaplan–Meier Plotter. Number at risk represent number of people exposed to outcome risk at each time point.

#### *3.7. Protein Expression Level of NCBP2 in Human Tissue and Cell Lines*

After investigating the mRNA expression pattern of NCBP2 in various databases, we further explored the protein expression pattern of NCBU2 in cell lines and human tissue in The Human Protein Atlas database (THPA), including tumor samples and normal adjacent specimens. The results confirmed that the protein level of NCBP2 was expressed moderately less in normal liver tissues compared with other normal tissues (Figure 7A), and the immunohistochemical analysis demonstrated that NCBP2 was overexpressed in HCC tissue relative to the normal adjacent sample (Figure 7B). The expression level of NCBP2 in liver cancer cell lines was analyzed using the CCLE online platform, and the result showed that the liver cancer cell lines with the highest expression of NCBP2 was from the HEP3B cell, and the lowest was from the JHH6 cell (Figure 7C).

**Figure 7.** NCBP2 protein expression in human tissues and cell lines. (**A**) NCBP2 protein expression in normal human tissues based on The Human Protein Atlas (THPA). (**B**) NCBP2 expression assessed using immunohistochemistry in normal and liver cancer tissues. (**C**) NCBP2 gene expression profiles of 19 liver cancer cell lines based on Cancer Cell Line Encyclopedia (CCLE) database.

#### *3.8. Relationship between the NCBP2 Expression and TP53 Mutation with Immune Makers*

Immune infiltration was involved with hepatocellular carcinoma progression. Since NCBP2 expression was related to the prognostic of hepatocellular carcinoma, the relationship between 22 infiltrating immune cells and the NCBP2 expression was investigated by the TIMER database. The results suggested that, after adjustments for tumor purity, the NCBP2 expression was positively associated with all immune cells, including B cells (r = 0.364, *<sup>p</sup>* = 3.30 × <sup>10</sup>−12), CD8+ T cells (r = 0.295, *<sup>p</sup>* = 2.71 × <sup>10</sup>−8), CD4+ <sup>T</sup> cells, (r = 0.484, *<sup>p</sup>* = 1.37 × <sup>10</sup>−21), macrophages (r = 0.551, *<sup>p</sup>* = 1.97 × <sup>10</sup>−28), neutrophils (r = 0.457, *<sup>p</sup>* = 3.26 × <sup>10</sup><sup>−</sup>19), and dendritic cells (r = 0.453, *<sup>p</sup>* = 1.97 × <sup>10</sup>−18) (Figure 8A). Intriguingly, we also found that the expression of NCBP2 was positively associated with TP53 (Figure 8B). After the prognosis of hepatocellular carcinoma related to the genetic mutations, among which TP53 represented a primary concern, we further investigated the relationship between the TP53 mutation and immune infiltration. The results showed that B cells and macrophages were significantly higher in the TP53 mutant than the wildtype; however, the rest of the immune cells, including CD8<sup>+</sup> T cells, CD4+ T cells, neutrophils, and dendritic cells, were not statistically significant with TP53 (Figure 8C). We further analyzed the relationship between NCBP2 expression with macrophages and CD4+ T cell infiltration levels in diverse cancer types using the TIMER 2.0 database. The results indicated that NCBP2 expression was positively correlated with the immune infiltration levels of macrophages (Figure 9A) and CD4+ T cells (Figure 9B) across most tumor types, with the highest correlation shown in LIHC. Univariate and multivariate COX regression also showed that the stage of HCC, CD8+ T cells, and the expression of NCBP2 were the independent indicators for predicting the prognosis of OS patients (Table 1).

**Figure 8.** Correlation of NCBP2 expression and TP53 mutation with immune infiltration levels in LIHC. (**A**) Relationship of NCBP2 expression with immune infiltration. (**B**) Relationship of NCBP2 expression with TP53. (**C**) Correlation between TP53 mutation and immune infiltration.

**Figure 9.** Relationship of NCBP2 expression with immune infiltration level in diverse cancer types (TIMER 2.0). (**A**) Macrophage immune infiltration level. (**B**) CD4<sup>+</sup> T-cell immune infiltration level.



\* *p* < 0.05; \*\* *p* < 0.01; CI: Confidence Interval; HR Hazard Ratio.

#### *3.9. Immune Cell Infiltration Analysis in LIHC*

Finally, we evaluated the infiltration of immune cells in LIHC. The results of the correlation heatmap between the 22 types of immune cells revealed that T cells had a significant positive correlation with cytotoxic cells and type 1 T-helper cells (Th1), and the macrophages and immature dendritic cells (iDC) also had a positive correlation. Type 2 T-helper cells (Th2) had a significant negative correlation with dendritic cells (DCs) and neutrophils, and the T-helper cells also had a negative correlation with DCs (Figure 10A). The immune cell interaction network results suggested that neutrophils, T cells, and follicular helper T cells (TFH) have strong relationships with other immune cells, but that regulatory cells (TReg) and plasmacytoid dendritic cells (pDC) have a weak relationship with other immune cells (Figure 10B). The violin plot of the immune cell infiltration results revealed that the degree of central memory T-cell (Tcm) infiltration was higher than in the low mutation frequencies of the TP53 samples (*p* < 0.05) (Figure 10C).

**Figure 10.** Correlation plots of immune cell infiltration analysis in LIHC. (**A**) Correlation heat map of 22 immune cells. Blue indicates positive correlation, red indicates negative correlation. Size of colored squares indicates strength of correlation. (**B**) Network diagram of 24 immune cell types. The circle size indicates the strength of interaction. (**C**) Violin diagram shows the difference of 24 types of immune cell infiltration in high mutation frequency of TP53 versus low mutation frequency of TP53.

#### **4. Discussion**

HCC is one of the most common malignant tumors. According to Global Cancer Statistics 2020, there were 906,000 new cases of HCC worldwide each year, causing about 830,000 deaths [2]. The main risk factors for HCC are chronic infection with the hepatitis B (HBV) or C virus (HCV), alcoholic cirrhosis, aflatoxin-contaminated foods, and excess body weight [18,19]. Due to early detection and a systemic therapy of surgery combined with adjuvant chemotherapy, targeted treatment, or immunotherapy, the mortality rate of HCC has declined in the last three decades [4]. However, the 5-year survival rate of patients with advanced HCC is still low, which is mainly due to tumor advances [20]. Therefore, it is important to understand the molecular mechanisms underlying HCC to identify an effective target for prevention and treatment. Recent studies have focused on the relationships between HCC, TMBs, and immunity and have confirmed that HCC with a high tumor mutation burden (TMB-H) may generate immunogenic neoantigens. The increased production of neoantigens is positively related to the infiltration of immune cells, especially for the count of macrophages and CD4+ and central memory T cells [21,22]. The infiltration changes of immune cells are the basis for a good response to immunotherapy [23]. However, none of these have been applied clinically; therefore, we used bioinformatics tools to analyze HCC-associated TMBs and to identify potential immune biomarkers for the diagnosis and prognosis of HCC.

In the present study, we performed a comprehensive biological analysis on the relationship between tumor somatic mutational profiles and immunity for HCC. To understand the functions and associations of these TMB-associated DEGs, GO analyses were performed. The result showed that DEGs are mainly enriched in nucleocytoplasmic and nuclear transport, and previous studies have confirmed that nucleocytoplasmic and nuclear transport are closely associated with the development of tumorigenesis [24,25]. Further studies have confirmed that nucleocytoplasmic and nuclear transport are closely related to HCC metastasis [26]. These studies suggested that the DEGs of TMBs may be closely correlated with the metastasis of HCC. By constructing a PPI network, we found that *USP39*, *RBM22*, *SNRPD1*, *CPSF3*, *SRSF1*, *SRSF3*, *HSPA8*, *HNRNPU*, *SRSF4*, *CWC27*, *EFTUD2*, *ALYREF*, *NCBP2*, *SNRPA1*, and *POLR2D* may play pivotal roles in the development of HCC. There was no research to investigate the relationship between HCC and the genes of *RBM22*, *SRSF4*, *CWC27*, and *POLR2D*, which would provide us a new research direction. In addition, we further used GO annotation semantics to investigate the functional similarity of key DEGs, and a strong biological functional similarity was found between *SRSF1*, *SNRPA1*, *SRSF3*, *SRSF4*, and *ALYREF*. *SNRPA1* was reported to promote HCC proliferation through activating the mTOR-signaling pathway [27], and the phosphorylation of *SRSF3* by *PPM1G* could result in the proliferation, invasion, and metastasis of HCC [28]; furthermore, *ALYREF* was significantly correlated to both advanced tumor-node-metastasis stages and poor HCC prognosis [29], which is similar to our results. However, we have not found any reports focused on the effects of *NCBP2* in HCC, which may have helped us to find new immunotherapy targets in HCC; however, it is worth considering for further investigation in future studies. In addition, the pathway enriched by GSEA mainly involved fatty acid metabolism, hypoxia, and the P53 pathway. Fatty acid metabolism was reported to be correlated with the advance of HCC and simultaneously influenced the infiltration of immune cells [30]. Both hypoxia and the mutation of P53 were also reported to lead to the metastasis of HCC [31,32]. The above studies are similar to our results, suggesting that the conclusions of the present study are accurate.

*NCBP2*, also known as *CBP20* or *NIP1*, can bind to the monomethylated 5 cap of nascent pre-mRNA. *NCBP2* has an RNP domain usually found in RNA-binding proteins and contains the cap-binding activity [33,34]. It has been reported that *NCBP2* regulates proliferation, metastasis, and apoptosis in multiple cancers [35,36], and accumulating evidence suggests that *NCBP2* may serve as a biomarker for carcinogenesis and cancer progression. For example, NCBP2 was upregulated in an acute lymphoblastic leukemia rearrangement child patient (r ALL) compared with non-r ALL patients. Childhood ALL patients with high expressions of NCBP2 had significantly poorer overall survival rates [37]. The latest study revealed that NCBP2 was overexpressed in the high-risk group of acute myeloid leukemia (AML) and was negatively correlated with survival [38]. In the present study, the results showed that NCBP2 was upregulated in multiple cancers and played a detrimental role at

the LIHC stage, and NCBP2 expression was significantly related to another five cancers, including KICH, KIRP, LUAD, PAAD, and PRAD, and was negatively correlated with the over survival of those cancers. Moreover, the present study revealed that the expression of NCBP2 was significantly upregulated in HCC compared with adjacent liver tissues according to the Human Protein Atlas database, and NCBP2 played a detrimental role in the OS of HCC patients. The antisense gene protein NCBP2-AS2 (transcribed from the antisense DNA strand of the gene NCBP2) also plays an important role in multiple tumors. A study has revealed that NCBP2-AS2 was overexpressed in hypoxic-cancer-associated fibroblasts, and it can promote the secretion of pro-angiogenic factor VEGFA, consequently reducing VEGF/VEGFR downstream signaling, which leads to tumor metastasis and reduces the efficacy of therapy [39]. Furthermore, LncRNA NCBP2-AS2 was upregulated in lung squamous cell carcinoma samples compared with lung adenocarcinoma samples and adjacent tissues and promoted cell proliferation and metastasis, as well as the invasive and inhibited apoptosis of SCC cells via the TAp63/ZEB1-regulating pathway [40]. LncRNA NCBP2-AS2 also could promote HCC cell growth and proliferation through regulating KRASIM [41]. In conclusion, NCBP2 is overexpressed in multiple cancers compared with adjacent normal tissues, and high expressions of NCBP2 were significantly correlated with poor OS in HCC. However, further research is needed to establish diagnostic accuracy and treatment with NCBP2 in liver cancer.

To further investigate the role of immune cell infiltration in HCC, TIMER database analysis revealed that the NCBP2 expression was most positively correlated with macrophages (r = 0.551, *<sup>p</sup>* = 1.97 × <sup>10</sup>−28) and CD4+ T cells (r = 0.484, *<sup>p</sup>* = 1.37 × <sup>10</sup>−21). Studies have demonstrated that by infiltrating tumor-associated macrophages (TAMs) at a high level in HCC, target TAM infiltration results in tumor growth inhibition in a mouse HCC model [42,43]. Higher infiltrating fractions of activated memory CD4<sup>+</sup> T cells were also found in high-risk groups of HCC patients [44,45]. These results showed that the expression level of NCBP2 may be associated with the immune response to the tumor microenvironment of HCC, especially with CD4+ T cells and macrophages. In addition, our study investigates the details of 22 types of immune cell infiltrations in HCC, and the results showed that T cells were closely related to follicular helper T cells (TFH), whereas regulatory cells (TReg) showed the weakest interactions with plasmacytoid dendritic cells (pDC), which provided ideas for further investigations regarding the regulation mechanisms of HCC in immune cells, for which no research currently exists. The degree of central memory T cell (Tcm) infiltration was higher in the high-mutation-frequency TP53 samples. Accumulating research has demonstrated that the infiltration of Tcm may help to discover novel treatments for more effective cancer immunotherapies [46,47]. Tcm are functionally and phenotypically distinct monitoring points in the liver, capable of long-lived retention, and well positioned for rapid and potent front-line immunosurveillance [48]. The above studies, combined with our research, have shown that immune cells, especially CD4<sup>+</sup> T cells, macrophages, and central memory T cells, play important roles in HCC and should be the focus of further studies.

In summary, comprehensive bioinformatic analyses were performed to analyze the predictive value of TMB in HCC prognosis and identified that the expression of NCBP2 was strongly correlated to HCC prognosis. Moreover, immune cell infiltration investigations also suggested that immune cells, especially CD4+ T cells, macrophages, and central memory T cells, play important roles in HCC. It is noteworthy that the systematic analysis of TMB-status hub genes in the present study will facilitate an understanding of the role played by TMBs in HCC and contribute to accurate immunotherapeutic treatment. Our findings may serve as a potential guide for targeted immunotherapy and provide ideas for the further development of new immunotherapies. Notwithstanding, more clinical studies and experimental research are needed to verify our findings and explore the molecular mechanisms of TMBs in HCC.

**Author Contributions:** Conceptualization, P.H.; data curation, T.H.; formal analysis, Y.Z.; funding acquisition, P.H.; investigation, Y.Z.; methodology, T.H.; project administration, P.H.; software, Y.Z.; supervision, P.H.; validation, T.H.; writing—original draft, Y.Z.; writing—review and editing, T.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** TCGA-LIHC dataset is available at The Cancer Genome Atlas (https: //cancergenome.nih.gov/).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Clinically Applicable Pathological Diagnosis System for Cell Clumps in Endometrial Cancer Screening via Deep Convolutional Neural Networks**

**Qing Li 1,2,†, Ruijie Wang 3,†, Zhonglin Xie 3, Lanbo Zhao 1, Yiran Wang 1, Chao Sun 1, Lu Han 1, Yu Liu 4, Huilian Hou 4, Chen Liu 2, Guanjun Zhang 4, Guizhi Shi 5, Dexing Zhong 3,6,7,\* and Qiling Li 1,2,\***


**Simple Summary:** The soaring demand for endometrial cancer screening has exposed a huge shortage of cytopathologists worldwide. Deep learning algorithms, based on convolutional neural networks, have been successfully applied to the classification and segmentation of medical images. The aim was to establish an artificial intelligence system that automatically recognizes and diagnoses pathological images of endometrial cell clumps (ECCs). Total 39,000 ECCs (26,880 for training, 11,520 for testing and 600 malignant for verification) patches were obtained by the segmentation network. The training set reached 100% accuracy, the testing set gained 93.5% accuracy, 92.2% specificity, and 92.0% sensitivity. Therefore, an artificial intelligence system was successfully built to classify malignant and benign ECCs for reducing pathologists' workload, providing decision-making assistance and promoting the development of endometrial cancer screening.

**Abstract:** Objectives: The soaring demand for endometrial cancer screening has exposed a huge shortage of cytopathologists worldwide. To address this problem, our study set out to establish an artificial intelligence system that automatically recognizes and diagnoses pathological images of endometrial cell clumps (ECCs). Methods: We used Li Brush to acquire endometrial cells from patients. Liquid-based cytology technology was used to provide slides. The slides were scanned and divided into malignant and benign groups. We proposed two (a U-net segmentation and a DenseNet classification) networks to identify images. Another four classification networks were used for comparison tests. Results: A total of 113 (42 malignant and 71 benign) endometrial samples were collected, and a dataset containing 15,913 images was constructed. A total of 39,000 ECCs patches were obtained by the segmentation network. Then, 26,880 and 11,520 patches were used for training and testing, respectively. On the premise that the training set reached 100%, the testing set gained 93.5% accuracy, 92.2% specificity, and 92.0% sensitivity. The remaining 600 malignant patches were used for verification. Conclusions: An artificial intelligence system was successfully built to classify malignant and benign ECCs.

**Keywords:** endometrial cancer; deep learning; screening; pathological diagnosis system; cell clumps

**Citation:** Li, Q.; Wang, R.; Xie, Z.; Zhao, L.; Wang, Y.; Sun, C.; Han, L.; Liu, Y.; Hou, H.; Liu, C.; et al. Clinically Applicable Pathological Diagnosis System for Cell Clumps in Endometrial Cancer Screening via Deep Convolutional Neural Networks. *Cancers* **2022**, *14*, 4109. https://doi.org/10.3390/ cancers14174109

Academic Editor: David Wong

Received: 26 July 2022 Accepted: 22 August 2022 Published: 25 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Endometrial cancer (EC) has become the second most common malignant tumor in the female reproductive system, with about 378,400 new cases in 2018 worldwide [1]. With increasing life expectancy and altered living habits, the incidence of EC is on the rise, and patients tend to be younger [2,3]. The 5-year survival rate with appropriate treatment is more than 85% for localized, 49% to 71% for regional, and less than 17% for distant stages of EC [4]. Women exposed to high risks have been recommended to be screened. Screening for EC and precancerous changes has been strongly suggested for early diagnosis and to reduce morbidity and mortality [5].

Researchers on the early detection of EC focus on minimally invasive histopathologic and cytopathologic procedures [6]. An endometrial cytologic test (ECT) has been carried out in many countries, including Italy, the United States, and Japan. ECT was added into Japanese Law on health care for the elderly in 1987. The mortality from EC among Japanese high-risk women fell from 20% in 1950 to 8% in 1999 [7]. In the past 20 years, academics from different regions have put forward the invention and improvement of endometrial samplers and have recommended diagnosis systems for endometrial cytopathology [8–10]. Confirmed by diagnostic curettage, the sensitivity, specificity, and coincidence rate of a well-designed endometrial sampling device, Li Brush, were 92.73%, 98.15%, and 92.73%, respectively [11]. On the other hand, a large number of endometrial cytopathological slides need to be identified, which exposes the lack of pathologists.

With the development of artificial intelligence (AI) technology and the improvement of hardware computing power in recent years, deep learning (DL) in medical analysis is considered as a third eye for doctors [12]. DL algorithms, based on deep convolutional neural networks (CNNs), have been proven to strongly boost the development of biomedical image analysis [13,14]. CNNs are becoming a reference tool for pathologists and have been successfully applied to the classification and segmentation of medical images, reducing the workload of pathologists and providing decision-making assistance [15–17].

AI has been successfully used in recognizing pathologic images and identifying malignant and benign tumors. However, there are relatively few studies on EC recognition. In one study, a computer-aided morphology program was established to distinguish benign and malignant cells. Geometric and densitometric nuclear features were measured for analysis. However, the typical three-dimensional shape (crowded and overlapping nuclei) of the endometrium increased miscalculation [18]. In another experiment, an endometrial histopathological AI recognition system was built, though it had a relatively high falsenegative rate because a few subtle features were undetectable at the cellular level [19]. Inspired by these studies, we developed a recognition system based on CNNs to automatically identify benign and malignant endometrial cell clumps (ECCs). The shortcomings of the two above studies will be overcome by analyzing the cellular clump's structure and cytological characteristics.

#### **2. Materials and Methods**

#### *2.1. Ethics Statement and Patients*

The patients, who underwent curettage or hysterectomy, were recruited in the First Affiliated Hospital of Xi'an Jiaotong University from July 2015 to July 2020. This study was approved by the Ethics Committee of the First Affiliated Hospital of Xi'an Jiaotong University (XJTU1AHCR2014-007), and all patients signed written informed consent. The protocols were in compliance with the ethical principles for research that involves human subjects of the Helsinki Declaration for medical research [20].

Patients were excluded who had been diagnosed with suspected pregnancy or pregnancy, acute inflammation of the reproductive system, cervical cancer, or dysfunctional clotting diseases. Women with body temperature at or more than 37.5 ◦C were also excluded after being measured twice a day.

#### *2.2. Preparation of Pathological Slides*

We chose Li Brush (20152660054, Xi'an Meijiajia Medical Technology Co., Ltd., China) for endometrial cytological sampling (Figure 1a). Liquid-based cytology combined with Hematoxylin and Eosin staining was used for pathological slides of endometrial cells. The sampling, pathological slide, and staining procedures were described by Lu Han et al. [11]. Based on the endometrial cytological diagnostic criteria proposed by Chinese Expert Consensus [21], two experienced pathological professors (H.H. and G.S., with over 20 years of endometrial cytopathology experience) labeled all cytopathological slides and divided them into two classes: malignant (atypical cells of undetermined significance, suspected malignant tumor cells, and malignant tumor cells), and benign (non-malignant tumor cells). Slides with fewer than 10 or 5 ECCs were judged to be "unsatisfactory for evaluation" for premenopausal or postmenopausal women, respectively. Only a few isolated atypical or cancerous cells present were considered as satisfactory [22]. Histopathological diagnosis, acquired from the endometrium by curettage or hysterectomy, was regarded as the gold standard. Normal endometrium and endometrial hyperplasia without atypia were considered as benign; endometrial atypical hyperplasia and endometrial cancer were malignant. Only when consistent classification was reached between histology and the two pathologists' cytology on a sample was the sample considered for the study. Otherwise, it was suspended [22].

**Figure 1.** The process of obtaining images and recognition. (**a**) Sampling procedure; (**b**) cytological slides diagnosis; (**c**) classification using endometrial cytological images feature.

#### *2.3. Cytopathological Image Acquisition*

We used a MOTIC digital biopsy scanner (EasyScan 60, 20192220065, Motic, Xiamen, China) to scan cytopathological slides (Figure 1b), using a lens with 200 times magnification (20×) to obtain whole slide images. A counterclockwise spiral scan was performed with a camera exposure time of 0.65 s per slide and automatic focal adjustment. Each scanned slide image was segmented into 1360 small images (1816 × 1519 pixels) (Figure 1b).

#### *2.4. ECCs Image Annotation*

Adobe Photoshop CC (2019 v20.0.2.30, Adobe Inc., San Jose, CA, USA) was engaged to sketch the edge of the ECCs. There is no doubt that ECCs from negative slides were all negative, but some ECCs were negative in positive slides. Thus, the two pathologists voted on the labels of each ECC again; when discordant voting results happened, they would have a discussion. If the discussion failed to conclude with an accurate diagnosis, the ECC was discarded. A benign diagnosis was defined as cell clumps with neat edges, nuclei with oval or spindle shape, and evenly distributed, finely granular chromatin [23,24]. Malignant diagnosis referred to a three-dimensional appearance, irregular (including dilated, branched, protruding, and papillotubular) edge, with the nucleus poloidal disordering or disappearing (including megakaryocyte appearance, nuclear membrane thickness, and coarse granular or coarse block chromatin) [23,25].

#### *2.5. Segmentation Networks*

The U-Net with jumping connection structure was selected to eliminate the interference of neutrophils and single cells, facilitating ECC extraction from each image. Figure 2 shows the U-Net architecture based on full convolutional networks. The U-Net architecture combined a down-sampling path to capture context and an up-sampling path to achieve precise localization. We calculated the probability that each pixel belonged to the cell clumps and normalized it. The collection of a detected cell clumps image was automatically marked as a region of interest (ROI) area. A total of 1000 images and their corresponding masks marked by pathologists were randomly selected for training. In order to describe the effect of the U-Net, we selected the Dice coefficient (a verification index of image segmentation accuracy) for evaluation.

**Figure 2.** Segmentation network. The blue box represents the feature map. The yellow arrow represents 3 × 3 convolution and striding of 1 used for feature extraction; we set the padding as 1 to ensure that the size of the convolutional image at the same steps was stable. The gray arrow indicates skip-connection, which is used for feature fusion, and pure up-sampling will cause the loss of information. The red arrow indicates the 2 × 2 maximum pooling, which is used to reduce the dimensionality. The green arrow indicates up-sampling, which is used to restore the dimension. The cyan arrow indicates the convolution plus activation function, which is used to output the result.

$$\text{Dice} = \frac{2|\mathbf{A} \cap \mathbf{B}|}{\mathbf{A} + \mathbf{B}}$$

The Dice coefficient is at the pixel level; A represents the area where the real target appears, and B signifies the target area that showed the predicted result (Figure 3a). The segmented mask often has small holes and residues (Figure 3b). We used morphological operations (first corrosion and then expansion) to eliminate small holes. The ROI set was input into a subsequent neural network for endometrial cytopathological screening.

**Figure 3.** The effect of segmentation. (**a**) Variation of segmentation accuracy with training epochs. Compared with the ground truth (mask was manually marked by the physician), the red areas were not predicted in the mask of the model training; compared with the ground truth, the green areas represent other predicted areas in the mask of model training. (**b**) The process of ECC acquisition.

#### *2.6. Data Preprocessing*

We input the cytopathologic images into a trained U-Net to obtain the patch set of cell clumps. The segmentation results were first obtained by the U-Net, and background images (free single cells and white cells) were removed. Then, we extracted all the cell clumps using the minimum outer rectangle. The size of all cell clusters was uniformly resized to 256 × 256 by filling the surrounding area with pixels of value 0.

#### *2.7. Classification Network*

The CNNs were used to capture the characteristics of ECCs: nuclear heterogeneity, nuclear size, ratio between nucleus and plasma, chromatin homogeneity, cell polarity, isolation and aggregation of cell clumps, regularity of cell clump's edge, etc. We constructed a DL model with DenseNet201 being the backbone to classify malignant and benign cell communities. The training set was annotated by two cytopathologists. The final fully connected layer of DenseNet201 was replaced by a global average pooling layer, then a single fully connected layer. The specific architecture is shown in Figure 4, and the output results were classified into two categories (Figure 1c). Then, the classification network was pre-trained on ImageNet. Several groups were carried out for comparative experiments to find the best patch input size and iteration time. The iteration was set to be 50, 100, 150, and 500 epochs in the training process. The results showed that the network converged at

100 epochs, and a longer training time was not necessary (Figure 5a). We changed the size of the input patch to 32 × 32, 64 × 64, 128 × 128, and 256 × 256, respectively (Figure 5b). When the input patch size was 256 × 256, the best result was achieved.

**Figure 4.** The recognition network architecture for classifying endometrial cell clusters. The size of the input image is 256 × 256, and each 3 × 3 convolution is preceded by a 1 × 1 convolution operation.

**Figure 5.** The performance of our model and four other common DL models on the same validation set. (**a**) Description of the AUC corresponding to the network with different numbers of iterations. (**b**) Description of the AUC corresponding to the network with different image input sizes. (**c**) The confusion matrix of different networks under the same hyperparameter conditions. The horizontal axis was a true label, the vertical axis was the predicted label, and the lower false-negative rate was preferred. (**d**) The ROC curves of different models. (**e**) The precision, accuracy, sensitivity, and specificity of different models.

#### *2.8. Network Evaluation*

We conducted comparative experiments on four CNNs (VGG16, InceptionV3, ResNet, and DenseNet) and one Support Vector Machine (SVM). The hyperparameters, all kept consistent, were as follows: Loss function (Binary Cross-Entropy), Initial learning rate (0.0001), Learning rate delay (0.5), Batch-size (8), and Adam optimizer. In addition, the SVM classifier used a radial basic function kernel with parameters of 0.0078 and 2. DenseNet gained the best result due to its advantage of featured graph jump connection (Figure 5c–e).

All experiments were performed on a personal computer equipped with a GeForce GTX2080 super (NVIDIA) graphics processing unit. Python programming language 3.6.12 (Python Software Foundation, Wilmington, DE, USA) with keras 2.4.3 (Google Brain, Mountain View, CA, USA) and Tensor Flow 2.2.0 (Google Brain, Mountain View, CA, USA) for neural networks was used for the training.

#### *2.9. Statistical Analysis*

The following indexes were calculated by the four-lattice paired hypothesis test for statistical analysis: accuracy (Acc), sensitivity (Se), and specificity (Sp). The confusion matrix and receiver operating characteristic (ROC) curve were used to visualize the classification effect. The definition criteria were as follows:

$$\text{Acc} = \frac{(\text{TP} + \text{TN})}{(\text{TP} + \text{FP} + \text{TN} + \text{FN})}$$

$$\text{Se} = \frac{\text{TP}}{(\text{TP} + \text{FN})}$$

$$\text{Sp} = \frac{\text{TN}}{(\text{TN} + \text{FP})}$$

#### *2.10. Plots and Charts*

All the drawings were performed using the matplotlib package in Python and Matlab. The ROC curve of model performance was shown with specificity being the X axis and sensitivity being the Y axis. We used a bar chart to show the predictions from different CNNs and SVM. Line graphs were drawn to illustrate the results and compare performance between different groups.

#### **3. Results**

#### *3.1. Baseline Characteristics*

A total of 113 patients who met the criteria were enrolled for final analysis, among which 42 were malignant and 71 were benign. Table 1 lists the demographic data of these patients.

#### *3.2. Dataset*

A total of 15,913 annotated cell clump images were segmented on ×20 magnification digital slides. The average image size was 1816 × 1519 pixels by width and height. We used a trained U-Net to extract ECC patches from the 15,913 images and obtained 39,000 ECC patches. Divided in 7:3, 26,880 and 11,520 patches were used for training and testing. The remaining 300 benign patches and 300 malignant patches were included in a verification set.

#### *3.3. Verification Set and Test Set*

The prediction results of ECC patches were completely in accordance with the labels given by the pathologists. We randomly exhibit the results of eight (three malignant and five benign) validation patches (Figure 6A).


**Figure 6.** Presentation of true and false results. (**A**) A 100% consistency of results was achieved in the training set. Patches (**a**–**c**) showed the true positive, and patches (**d**–**h**) showed the true negative. (**B**) Analysis of false results in test set. The two false-positive (over diagnosis) patches (**a**,**b**) are exhibited. The six false-negative patches included one well-differentiated endometrial adenocarcinoma (**c**), three atypical hyperplasia (**d**–**f**), and two poorly differentiated adenocarcinomas (**g**,**h**).

In the test set, the accuracy and specificity of the classifier were 93.5% and 92.2%, respectively. The DenseNet achieved a 95.1% area under the curve score (AUC). In addition, we compared the results with four other common classification models (Figure 5c–e).

#### *3.4. False Results*

DenseNet obtained a 5% false-positive rate and an 8% false-negative rate in the test set (Figure 5c). We randomly listed eight common failure patches in the test set. The six false-negative (missed diagnosis) patches included one well-differentiated endometrial adenocarcinoma, three endometrial atypical hyperplasia, and two poorly differentiated endometrial adenocarcinomas. In addition, two over-diagnoses occurred (Figure 6B).

#### *3.5. Data Supporting*

The results of this study are available from the corresponding authors (Qiling Li and Dexing Zhong). Because of hospital policy, the data cannot be made public.

#### **4. Discussion**

#### *Principal Findings*

For the first time, we introduced two neural networks based on deep convolution, namely U-Net and DenseNet, to segment ECC images and recognize patches, respectively. The DenseNet achieved 93.5% accuracy and 92.2% specificity. At the same time, this system was developed for screening, and the sensitivity of our algorithm was better than that of all the comparison ones, reaching 92.0%. The results indicated that the neural network has great feasibility and potentiality in endometrial pathological image recognition.

#### **5. Results**

It is well-known that a large amount of labeled data is often required to train a highquality machine learning classifier through DL to complete a specific cancer classification task [26]. Due to the high amount of time and effort required for image annotation work, as well as the protection of patients' privacy, there are currently few endometrial image datasets available to the public. Despite the limited dataset, our classifier performed well in the 10-fold cross-validation and in the external validation of 15,913 images.

#### *5.1. Clinical Implications*

At the beginning of the experiment, we considered that DL was able to automatically learn cancer's information from pathological images [27]. We put the unlabeled benign and malignant images into the network for recognition and obtained 40–70% specificity (data not shown) from multiple networks, proving the method to be a failure. ECCs are quite different from non-cellular clumps in ecological appearance, cell morphological structure, and other pathological characteristics. U-net combines low-resolution information (to provide the basis for object category recognition) and high-resolution information (to provide the basis for precise segmentation and positioning), which is perfectly suitable for medical image segmentation. Combined with the pathological features in patients with ECCs, we chose the U-Net as the segmentation network to analyze and calculate the probability that each pixel belonged to the cell clumps. The detected cell clump images were automatically marked as ROI areas. The obtained ROI set was processed by a traditional image-processing algorithm to eliminate small holes. The ROI set was input into a subsequent neural network for cytopathological screening of the endometrium. We built a DL model with DenseNet201 as the backbone. The DL model was trained by the dataset annotated by cytopathologists, and the model was built to classify malignant and benign cell clumps. It turned out that our model alleviated the vanishing gradient problem, strengthened feature propagation, encouraged feature reuse, and outperformed ResNet50 with the same number of parameters. In order to compare the prediction performance of various DL algorithms on an experimental dataset, four commonly used CNNs were used to train different classifiers, namely VGG16, Inception-v3, ResNet, and DenseNet. The SVM classifier, which used features extracted by the CNN as input, had a better performance than the end-to-end CNN classifier [28,29]. Therefore, on the basis of previous experiments, the DenseNet had the best performance in extracting sample features to train the SVM

classifier. In addition, a group of comparative experiments were also performed with traditional PCA + SVM machine learning method.

The results of the test set showed that the false-negative rate was twice as high as the false-positive rate. We analyzed all the missed and over-diagnosis images and randomly selected eight patches to illustrate the common error that occurred. There were two false positives: one patch of secretory phase endometrium and one patch of complex hyperplasia. One reason for this was that the endometrial cells were clustered and seriously overlapped. It was difficult to distinguish well-differentiated EC from the proliferative endometrium, and it was difficult to distinguish complex hyperplasia from atypical hyperplasia. Another reason was that the dysplasia coincidence rate between the cytological and histological pathological diagnosis was relatively low, which was 56% in some studies [30]. This was the main reason for their miscalculation.

#### *5.2. Research Implications*

Due to the development of liquid-based cytology and endometrial cell sampling in recent years, ECT has been gradually accepted as a simple, rapid, and economical endometrial screening method [31]. Moreover, AI can be applied to the pathological recognition of endometrial cells to promote screening. AI works steadily and indefatigably, and can quickly screen out suspicious malignant results, allowing pathologists to focus on the malignant results and improve the accuracy and efficiency of diagnosis [32].

#### *5.3. Strengths and Limitations*

This study had some limitations. First, although our images were labeled in a randomized and blind way, and histological diagnosis was used as control, and the two pathologists' diagnoses were still somehow subjective. We hope that more recommendations from pathologists in different treatment centers will be included in follow-up studies regarding the proposed diagnostic system. Second, liquid endometrial cytological smear was used in our diagnostic system. At present, cell block technology can prepare slides with cell clumps and micro tissues, which is expected to further refine the diagnostic results and provide better diagnosis and treatment suggestions for clinical work [33]. We will focus on improving the performance of the classifier by training it with more samples, aiming at subdividing endometrial pathological types in future research.

#### **6. Conclusions**

This study confirmed that the recognition of DL has similar specificity and sensitivity to manual diagnosis. At the same time, the DL saves time and manpower. Therefore, the use of endometrial liquid-based cytology in combination with AI to identify ECC is reliable for EC screening and is able to reduce pathologists' workload. By carrying out this form of screening work, cross-population, big data will be rapidly established, and the participation of scholars from different regions will greatly promote the development of precision medicine.

**Author Contributions:** Conceptualization, D.Z. and Q.L. (Qiling Li); Data curation, R.W. and Z.X.; Funding acquisition, Q.L. (Qiling Li); Methodology, Y.L., H.H., C.L., G.Z. and G.S.; Resources, Y.W., C.S. and L.H.; Writing—original draft, Q.L. (Qing Li) and R.W.; Writing—review & editing, L.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by the Clinical Research Award of the First Affiliated Hospital of Xi'an Jiaotong University, China (No. XJTU1AF-CRF-2019-002); the Natural Science Basic Research Program of Shaanxi (2017ZDJC-11, 2018JM7073); the Clinical Research Award of the First Affiliated Hospital of Xi'an Jiaotong University, China (XJTU1AF-2018-017); the Key Research and Development Program of Shaanxi (2017ZDXM-SF-068, 2019QYPY-138); the Innovation Capability Support Program of Shaanxi (2017XT-026, 2018XT-002); and the Medical Research Project of Xi'an Social Development Guidance Plan (2017117SF/YX011-3). The sponsors had no involvement in the study's design, data collection, analysis, and interpretation, or in writing of the manuscript.

**Institutional Review Board Statement:** This study was approved by the Ethics Committee of the First Affiliated Hospital of Xi'an Jiaotong University (XJTU1AHCR2014-007), and informed written consent of all subjects was obtained prior to this study.

**Informed Consent Statement:** The participants provided their written informed consent to participate in this study.

**Data Availability Statement:** The raw data supporting the conclusions of this manuscript will be made available by the authors without undue reservation to any qualified researcher. All data generated or analyzed during this study are included either in this article or available from the correspondence authors.

**Acknowledgments:** We thank the staff in the Department of Obstetrics and Gynecology, the First Affiliated Hospital of Xi'an Jiaotong University for supporting with data collection.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Learn to Estimate Genetic Mutation and Microsatellite Instability with Histopathology H&E Slides in Colon Carcinoma**

**Yimin Guo 1,†, Ting Lyu 1,†, Shuguang Liu 1, Wei Zhang 1, Youjian Zhou 1, Chao Zeng 1,\* and Guangming Wu 2,\***


**Simple Summary:** Colorectal cancer is one of the most common malignancies and the third leading cause of cancer-related mortality worldwide. Identifying KRAS, NRAS, and BRAF mutations and MSI status are closely related to the individualized therapeutic judgment and oncologic prognosis of CRC patients. In this study, we introduced a cascaded network framework with an average voting ensemble strategy to sequentially identify the tumor regions and predict gene mutations & MSI status from whole-slide H&E images. Experiments on a colorectal cancer dataset indicated that the proposed method can achieve high fidelity in both gene mutation prediction and MSI status estimation. In our testing set, the AUCs for KRAS, NRAS, BRAF, and MSI were ranged from 0.794 to 0.897. The results suggested that the deep convolutional networks have the potential to assist pathologists in prediction of gene mutation & MSI status in colorectal cancer.

**Abstract:** Colorectal cancer is one of the most common malignancies and the third leading cause of cancer-related mortality worldwide. Identifying KRAS, NRAS, and BRAF mutations and estimating MSI status is closely related to the individualized therapeutic judgment and oncologic prognosis of CRC patients. In this study, we introduce a cascaded network framework with an average voting ensemble strategy to sequentially identify the tumor regions and predict gene mutations & MSI status from whole-slide H&E images. Experiments on a colorectal cancer dataset indicate that the proposed method can achieve higher fidelity in both gene mutation prediction and MSI status estimation. In the testing set, our method achieves 0.792, 0.886, 0.897, and 0.764 AUCs for KRAS, NRAS, BRAF, and MSI, respectively. The results suggest that the deep convolutional networks have the potential to provide diagnostic insight and clinical guidance directly from pathological H&E slides

**Keywords:** deep convolutional network; H&E slice; gene mutation prediction; microsatellite instability; colon carcinoma

### **1. Introduction**

Colorectal cancer (CRC) is one of the most common lower gastrointestinal malignancies and is currently the third leading cause of cancer-related mortality worldwide [1,2]. Despite the over survival rate of colorectal cancer has increased in recent years due to the improved treatment strategies [3], distant metastasis is still a significant cause of high morbidity and mortality for CRC patients [4]. So far, various predominant environmental risk factors for the development of CRC have been identified, including diet, obesity, lack of physical activity, and inflammatory bowel disease [5]. However, a module formed by the interaction of multiple genetic alterations determines individual differences and tumor progression in CRC patients.

In the past decades, a deep understanding of molecular profiles has been more significant for selecting appropriate therapies for metastatic CRC patients [6]. Numerous frequent

**Citation:** Guo, Y.; Lyu, T.; Liu, S.; Zhang, W.; Zhou, Y.; Zeng, C.; Wu, G. Learn to Estimate Genetic Mutation and Microsatellite Instability with Histopathology H&E Slides in Colon Carcinoma. *Cancers* **2022**, *14*, 4144. https://doi.org/10.3390/ cancers14174144

Academic Editor: Luca Roncucci

Received: 30 July 2022 Accepted: 22 August 2022 Published: 27 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

genetic mutations have been identified as critical drivers responsible for comprehensive therapeutic judgment and oncologic prognosis [7]. Mutations of RAS (i.e., exon 2, 3, and 4 of KRAS, exon 2 and 3 of NRAS) are considered negative predictors for targeted therapy with anti-EGFR monoclonal antibodies (e.g., cetuximab and panitumumab) [8,9]. Mutation of BRAF V600E is a worse prognostic biomarker. Patients with BRAF V600E mutation will be less likely to respond to treatment with cetuximab and panitumumab unless combined with a BRAF inhibitor [10,11]. Moreover, the microsatellite instability (MSI) status of CRC patients is also an important marker closely related to the assessment of prognosis, the efficacy of chemotherapeutic and immunity therapy [12,13]. Therefore, all metastatic CRC patients are suggested to detect the KRAS, NRAS, and BRAF mutations and MSI status according to the National Comprehensive Cancer Network (NCCN) clinical practice guidelines in oncology (Colon Cancer, Version 2.2021) [14].

The general diagnosis procedure of molecular pathology includes Sanger sequencing, Next-Generation Sequencing (NGS), ARMS-PCR, and digital PCR , etc. [15]. In recent years, the accuracy and sensitivity of those methods have been significantly improved. However, molecular detection remains limited by various factors such as sample quality, mutated gene abundance, and laboratory conditions. Moreover, in a short period of time, high testing prices are also a heavy burden for most families.

With the development of big data and deep convolutional network, artificial intelligence (AI)-assisted pathological diagnosis has attracted more and more attention. In 2018, Coudray et al. trained a deep convolutional neural network on Whole-Side Images (WSIs) to predict the cancer subtype and gene mutations in lung cancer [16]. Later, MSI status estimation of CRC from H&E histology was reported [17,18]. Furthermore, Skrede et al. exhibited a promising result in the survival risk interpretation of tumor patients based on artificial intelligence [19]. These methods have significantly extended the application capability of deep convolutional networks. However, genetic mutation prediction from H&E slices in CRC, which has more clinical significance in precision diagnosis, is still very challenging. To fulfill this demand and further explore the potential of H&E slides, we propose a cascaded deep convolutional framework to simultaneously generate gene mutation predicting and MSI status estimation using WSIs in colorectal cancer. The proposed method consists of two tumor region classification models, gene mutation& MSI status estimation models, and an average voting ensemble strategy. The effectiveness of the proposed method is demonstrated by a CRC dataset collected from GDC Data Portal and Eighth Affiliated Hospital, Sun Yatsen University (see Section 2.1). In qualitative and quantitative evaluation, the proposed method reveals promising accuracy in tumor classification (0.939–0.976 AUC), gene mutation prediction (0.792–0.897 AUC), and MSI status estimation (0.764 AUC).

The main contributions of this study can be summarized as follows:


The rest of the paper is organized as follows: Firstly, we present the datasets and methods used for this research in Section 2. Then, we illustrate the quantitative and qualitative results in Section 3. Finally, discussion and conclusion are presented in the Sections 4 and 5, respectively.

#### **2. Materials and Methods**

#### *2.1. Data*

To explore the possibility of estimating somatic mutations and microsatellite instability (MSI) using Hematoxylin-Eosin(H&E) stained whole-slide image (WSI), we downloaded diagnostic slides and corresponding clinical data of the TCGA-COAD cohort from GDC Data Portal (https://portal.gdc.cancer.gov/projects/TCGA-COAD, accessed at 20 February 2022). The pre-compiled somatic mutation data and MSI status data were acquired from UCSC Xena (https://xenabrowser.net/datapages/, accessed at 10 March 2022) and MSIsensor-pro [20], respectively. The original WSIs were formated in a magnification ratio of either 20× or 40×. Prior to performing our experiments, we manually resize the 40× images to 20× using libvips (https://github.com/libvips/libvips) (see Figure 1A–D). There were 292 WSIs with corresponding somatic mutations and MSI statuses in the TCGA-COAD dataset. To achieve better generalization, we also collected the SYSU8H dataset with the cooperation of The Eighth Affiliated Hospital, Sun Yat-sen University. The selected pathological specimens were fixed in formalin, embedded in paraffin wax block, and cut by several consecutive slices in 3–5 um by a Leica HistoCore Autocut. Later, the slices were used for Hematoxylin-Eosin (H&E) staining, IHC staining, or gene sequencing, separately. Compared with the scanned H&E slices, the tumor areas for the sequencing slices are in micron-level drifts that tumor genomic heterogeneity among these slices is negligible. There were total 104 WSIs captured with 20× magnification ratio by PANNORAMIC 1000, 3DHISTECH Ltd.(see Figure 1E–H). Unlike next-generation sequencing (NGS) of TCGA-COAD, in the SYSU8H dataset, the genetic information was obtained by sanger sequencing. The binary masks of tumor areas of the WSIs were carefully annotated by experienced pathologists using QGIS (v3.22.7 LTR, https://qgis.org/).

As shown in Table 1, the 396 WSIs samples were randomly divided into training, validation, and testing groups with the ratios of 70%(278), 15%(59), and 15%(59), respectively. At 5× magnification WSIs, there were 283,126, 49,988, and 55,787 tiles within the corresponding training, validating, and testing set. At 10× magnification WSIs, 1,152,481, 203,183, and 2,275,595 tiles were within the corresponding training, validating, and testing set. In our experiment, the size of each tile was set to 512 × 512 pixels.


**Table 1.** Distribution of patients and whole-side images samples.

**Figure 1.** Representative H&E stained whole-side images (WSIs) from SYSU8H and TCGA-COAD dataset. The (**A**–**D**) and (**E**–**H**) samples are randomly selected from SYSU8H and TCGA-COAD datasets, respectively.

#### *2.2. Methodology*

In this study, we proposed a cascaded network framework to directly estimate somatic gene mutation and microsatellite instability status from the H&E stained whole-side image.

As shown in Figure 2, at the training stage, WSIs and corresponding binary masks of the training and validation set were partitioned into 5× or 10× tiles for training and validating the tumor classifier. The annotated tumor tiles and their somatic gene mutations or microsatellite instability (MSI) were used for training a binary classifier to discriminate wild type (i.e., W.T.) vs. mutant type (i.e., M.T.) of the gene or MSI-H vs. MSS/MSI-L, respectively. The top N highest probabilities of all tiles within a WSI were used to generate the final prediction for the patient.

**Figure 2.** Experimental workflow for estimating somatic gene mutation and microsatellite instability with H&E stained whole-side images. The 5× or 10× tiles from WSIs will be accessed by a tumor classifier, a gene&MSI classifier, and a TopN ensemble classifier.

Through several cycles of training and validation, the hyperparameters, including batch size, the number of iterations, and learning rate, were optimized with the Adam stochastic optimizer [21]. Subsequently, the predictions generated by the optimized models were evaluated using the WSIs of the test set (see details in Table 1). For performance evaluations, we carefully measured the area under the receiver operator characteristic (ROC) curve [22] and its confidence interval (CI) [23].

#### 2.2.1. Data Preprocessing

At first, the 396 pairs of whole-side images (WSIs) and their corresponding clinical records were shuffled and partitioned into three groups: training (70%), validating (15%), and testing (15%). Within each pair, a binary tumor mask of WSI was generated through polygon rasterization of its manually created tumor annotation. Later, a square window of 512 × 512 pixels was applied to the whole-side image and the corresponding tumor mask to extract paired tiles of WSI and mask. Then, each tile of WSI was labeled according to the positive ratio of pixels of the tumor mask. To focus on the tumor regions, tiles with positive ratios less than 80% were marked as 0. Otherwise, tiles were marked as 1. There were 388,901 and 1,583,259 tiles extracted from 5× and 10× magnification. As shown in Table 1, at 5× magnification, there were 283,126, 49,988, and 55,787 tiles within the training, validating, and testing set. While at 10× magnification, the number of tiles used for training, validation, and testing was 1,152,481, 203,183, and 2,275,595, respectively.

#### 2.2.2. Network Architectures

For simplicity and efficiency, we adopted an advanced convolutional neural network (CNN) architecture, i.e., EfficientNet [24], as a backbone for tumor classification and gene&MSI classification.

In 1998, Lecun et al. introduced the classic CNN architecture, LetNet-5 [25], which consists of two sets of convolutional & pooling layers, a flattening convolutional layer, and two fully-connected layers. The CNN reveals two important concepts, sparse connectivity and shared weights, significantly reducing memory occupation and promoting computational efficiency. With the growing complexity of the dataset and rapid development of computational capacity, computer scientists have proposed more advanced CNN architectures for better generalization capacity and computational efficiency [26]. These architectures significantly promote CNN performance by introducing well-designed novel strategies, such as network in network (i.e., NIN) [27], residual learning (i.e., ResNet) [28], inception architecture [29], and dense connection (i.e., DenseNet) [30]. Differ from the above-mentioned models, which mainly focus on model accuracy, the EfficientNet architecture is designed to get a present accuracy level with limited computational operations. The EfficientNet introduces a uniformed scaling method that scales all dimensions of depth, width, and resolution with a set of fixed scaling coefficients [24].

In our experiments, we chose an ImageNet-1K [31] pretrained EfficientNet B0 (https: //pytorch.org/vision/master/models/generated/torchvision.models.efficientnet\_b0.html, accessed at 4 March 2022) as the backbone for both tumor classification and Gene&MSI classification. As shown in Table 2, we introduced a dropout layer (*p* = 0.5) [32] to prevent overfitting. Then, we replaced the dimensions of fully-connected (FC) layer from 1280 × 1000 to 1280 × 1.

Subsequently, the activation function was changed from softmax to sigmoid.

$$\begin{aligned} z\_i &= b + \sum\_{j=1}^c w\_j \times x\_{i,j} \\ p\_i &= \frac{1}{1 + e^{-z\_i}} \end{aligned} \tag{1}$$

The *<sup>w</sup>* ∈ <sup>R</sup>*<sup>c</sup>* and *<sup>b</sup>* ∈ <sup>R</sup><sup>1</sup> denote the weights and bias, respectively. The range of prediction *pi* is limited to [0, 1].

Instead of binary cross entropy [33], we adopted focal loss [34] as our object function to focus learning on hard misclassified examples and address class imbalance. The equation can be formulated as:

$$\begin{aligned} p\_l &= \begin{cases} p\_{i\prime} & \text{if } y\_i = 1 \\ 1 - p\_{i\prime} & \text{if } y\_i = 0 \end{cases} \\ \text{Loss}\_{focal} &= -(1 - p\_l)^{\gamma} \log(p\_l) \end{aligned} \tag{2}$$

where *pi* and *yi* is the *i*th prediction and corresponding ground truth. The value of *pt* is *pi* if the observation is in class 1; otherwise, the value is 1 − *pi*. The *γ* (≥ 0) is a tunable focusing parameter which reduces the relative loss for well-classified examples (i.e., *pt* > 0.5) and puts more focus on hard, misclassified examples.

**Table 2.** The backbone network for both tumor classification and Gene&MSI classification. Each row describes the stage, operation, input resolution, output channel, and the number of layers.


With all of the above layers being trained by mini-batch stochastic gradient descent (SGD) [35] to minimize the focal loss, the model learns how to map from the input 512 × 512 RGB image to a binary prediction.

#### 2.2.3. Model Ensemble

To make a decisive conclusion on the whole-slide-image (WSIs) using the separated predictions of 5× and 10× tiles, we introduced a simple yet efficient average voting strategy using the top N number of features to ensemble models. To ensure the high fidelity of selected features, a high threshold (i.e., 0.8) was used to filter out tiles with a low probability of being a tumor region. Later, tiles with a high probability of being tumor regions were passed to corresponding gene&MSI classification models to generate predictions of 5× tiles (*Px*5) and 10× tiles (*Px*10). Then, the top N highest probabilities of predictions from both 5× and 10× tiles were selected for the final estimation of the WSI (*Pwsi*). Finally, the *Pwsi* and corresponding ground truth (*Ywsi*) were used to calculate the area under the curve (AUC) for performance estimation.

$$\begin{aligned} P\_{topN} &= \max\{ [P\_{\mathbf{x}5}, P\_{\mathbf{x}10}]\_{\prime}, N \} \\ P\_{\text{wsi}} &= \frac{1}{N} \sum\_{i=1}^{N} P\_{topN} \end{aligned} \tag{3}$$

#### **3. Results**

A total of 396 colorectal cancer (CRC) patients with various gene mutations and MSI status from the SYSU8H and TCGA-COAD datasets were recruited in this study. The collected WSIs were randomly split into three sets: training, validation, and testing with the ratio of 70%, 15%, and 15%, respectively. The tiles extracted from the training and validation set wereused for training and optimizing hyperparameters of the proposed classification models. In order to estimate the performance of the proposed classification models, we have conducted heavy quantitative and qualitative comparisons on the testing set. All experiments were performed on the same dataset and processing platform.

#### *3.1. Tumor Classification*

The tumor regions annotated by the pathologist and probability maps generated by the tumor classification models using 5× and 10× tiles of WSIs are presented in Figure 3. Both 5× and 10× models display high fidelity in tumor recognition compared to manual

annotations. Compared with the 5× model, the model trained with 10× tiles shows fewer false positives (e.g., orange and red patches outside the blue dashed curve of A, B, and C), fewer false negatives (e.g., blue and green patches inside the blue dashed curve of E and F), and better boundaries (e.g., around the blue dashed curve of A, B, and D). We selected 5 tiles from each of the four randomly selected whole slide images in the testing set, which present the highest probabilities to be the tumor regions according to our trained 5× or 10× tumor classification models (Figures 4 and 5). The selected tiles show high consensus with the annotations by the pathologist. The receiver operator characteristic (ROC) curve and area under the curve (AUC), are used to evaluate the performance of tumor classification models using 5× and 10× tiles of the WSIs (Figure 6). The AUCs of 5× classification model have achieved 0.939 (95% CI of 0.937–0.940), 0.910 (95% CI of 0.905–0.914), and 0.959 (95% CI of 0.957–0.961) for training, validating, and testing set, respectively. Slightly better than the 5× model, the AUCs of 10× classification model are up to 0.971 (95% CI of 0.971–0.972), 0.973 (95% CI of 0.972–0.973), and 0.976 (95% CI of 0.975–0.977) for training, validating, and testing set, respectively. These values are consistent with our observation in Figures 3–5.

**Figure 3.** Probability maps of tumor classification using 5× and 10× tiles of the whole slide images (WSIs). The annotations created by the pathologist are marked with the blue dashed curve. The probability values are categorized into five groups with different color representations. The (**A**–**C**) and (**D**–**F**) samples are randomly selected from the testing set of TCGA-COAD and SYSU8H, respectively.

**Figure 4.** Representative tiles of tumor classification using 5× tiles of the whole slide images (WSIs). The (**A**–**D**) samples are randomly selected from the testing set. In each row, tiles 1–5 are patches from the same WSI.

**Figure 5.** Representative tiles of tumor classification using 10× tiles of the whole slide images (WSIs). The (**A**–**D**) samples are randomly selected from the testing set. In each row, tiles 1–5 are patches from the same WSI.

**Figure 6.** The receiver operator characteristic (ROC) curve and area under the curve(AUC) of tumor classification using 5× and 10× tiles of the whole slide images (WSIs). (**A**) The curves of training, validating, and testing set using 5× tiles. (**B**) The curves of training, validating, and testing set using 10× tiles.

#### *3.2. Gene&MSI Classification*

After model ensembling, the proposed method generates probabilities of gene mutations (i.e., KRAS, NRAS, and BRAF) and MSI status of every WSI.

As shown in Figure 7a–c, in the testing set, the proposed method reaches 0.792 (95% CI of 0.669–0.914), 0.886 (95% CI of 0.688–1.00), and 0.897 (95% CI of 0.800–0.994) AUCs for gene mutation predictions of KRAS, NRAS, and BRAF, respectively. In Figure 7d, our method shows high accuracy (i.e., 0.764 AUC, 95% CI 0.563–0.965) on the MSI status estimating in colorectal cancer.

**Figure 7.** The receiver operator characteristic (ROC) curve and area under the curve(AUC) of Gene&MSI classification using 5&10× tiles of the whole slide images (WSIs). (**A**) The curves of KRAS gene mutation classification. (**B**) The curves of NRAS gene mutation classification. (**C**) The curves of BRAF gene mutation classification. (**D**) The curves of MSI status classification.

To investigate the effect of the selected number of features (i.e., topN) used for model ensembling, we conducted a comparison experiment on the testing set using sequential values (i.e., [1, 3, 5, 7, 9]) of topN. Figure 8 shows the trend of the AUC values under sequential values of topN in the testing set. Among all values, the proposed method achieves the highest KRAS, NRAS, and BRAF gene mutation prediction accuracy while topN equals 7. In gene mutation predictions, as the value of topN increases, the AUC value will firstly increase and then decrease. In MSI status estimation, the AUC increases gradually as the value of topN increases. As the value of topN passes 7, the increment of AUC narrows down.

Figure 9 shows the top weighted tiles of whole slide images (WSIs) in gene mutation prediction and MSI status estimation by the proposed models.

**Figure 8.** The receiver operator characteristic (ROC) curve and area under the curve(AUC) of Gene&MSI classification using sequential values of topN. (**A**) The trend of KRAS gene mutation classification. (**B**) The trend of NRAS gene mutation classification. (**C**) The trend of BRAF gene mutation classification. (**D**) The trend of MSI status classification.

**Figure 9.** Top weighted tiles of whole slide images (WSIs) in gene mutation and MSI status estimation. In each row, tiles 1–5 are either 5× or 10× tiles extracted from the same WSI.

#### **4. Discussion**

#### *4.1. Regarding the Cascaded Framework*

In recent years, deep convolutional networks have demonstrated their potential in computer-aided cancer identification using clinical images such as CT scan [36], ultrasonic [37], and MRI images [38]. Other than tumor recognization, a growing number of researches are trying to look deeper into microsatellite instability estimation [39,40], gene mutation prediction [41] or survival risk evaluation [19], which are vital for precision pathological diagnosis and treatment.

To the best of our knowledge, the proposed cascaded framework is the first end-to-end method that simultaneously generates gene mutation prediction and MSI status estimation using the whole slide image (WSI) in colorectal cancer. Our method can produce highfidelity gene mutation prediction and MSI status estimation for each WSI through a simple yet efficient average voting strategy to ensemble models. Predicting the gene mutations (KRAS, NRAS, and BRAF) and MSI status from deep convolutional networks provides pathologists with a more convenient way to evaluate prognosis and guide medication. For example, advanced metastatic CRC patients with KRAS and NRAS mutations are not recommended to choose anti-EGFR monoclonal drugs (cetuximab and panimab) for treatment. The evaluation of BRAF mutation can stratify the prognosis and guide clinical treatment. Patients with BRAF genetic mutation are unlikely to respond to the treatment of cetuximab or panimab. MSI is a predictor of the efficacy of immune checkpoint inhibitors, CRC patients with MSI-H are more likely to benefit from the treatment of immune checkpoint inhibitors (e.g., pabolizumab). Qualitative and quantitative results of the experiment data demonstrated the effectiveness of our proposed framework. These results suggest that the deep learning models have the potential to provide diagnostic insight and clinical guidance directly from pathological H&E slides. Additionally, as the gene mutation prediction and MSI status estimation are directly computed from histopathology H&E slides, in principle, the proposed method should apply not only to colorectal cancer but also to other malignant cancers (e.g., lung, breast, and liver cancer).

#### *4.2. Accuracies, Uncertainties, and Limitations*

The proposed framework revealed high values of area under the curve (AUC) in both tumor classification and gene&MSI classification tasks. In tumor classification,the 5× and 10× classification models achieved 0.959 (95% CI of 0.957–0.961) and 0.976 (95% CI of 0.975–0.977) AUCs in the testing set, respectively. The values show a very close judgment between the pathologist and the proposed method, which suggest that the AI-algorithm can potentially serve as a pre-screening tool. The performance will be further evaluated using a larger dataset with multiple tissue samples collected from varied pathology departments.

In gene mutation prediction, the proposed method achieved 0.792 (95% CI of 0.669–0.914), 0.886 (95% CI of 0.688–1.00), and 0.897 (95% CI of 0.800–0.994) AUCs for gene mutation predictions of KRAS, NRAS, and BRAF, respectively. Because of the extremely biased ratio of mutant type / wild type distribution (i.e., 15 vs. 381 of NRAS, 43 vs. 353 of BRAF), the value of AUCs fluctuates in a large range within 95% confidence interval (see details in Figure 7). In terms of MSI status estimation, recent researches [39,40] had reported higher performance than ours(i.e., 0.764 AUC, 95% CI 0.563–0.965). Compared with these methods, our method is able to simultaneously generate gene mutation prediction (KRAS, NRAS, and BRAF) and MSI status estimation, which are all mandatory for metastatic CRC patients. As for future clinical application, improving the accuracy level of our algorithm remains one of the main future goals.

With the current cascaded classification-based scheme, the models are trained to generate tile-to-label predictions using features extracted from sequential convolutional layers. The lack of internal connectivity with adjacent tiles within the same WSI might lead to partial misclassification (e.g., red patches outside the blue dashed curve and green patches within the blue dashed curve in Figure 1B,D). Since the models are trained and optimized separately, the proposed framework requires extra computational time and storage for training and saving checkpoints of multiple models. Considering the computational efficiency, a unified model with shared parameters and object functions should be explored in further work.

Considering the type of H&E used for staining, varied types of hematoxylin have certain differences in stability, durability, and dyeing time, which may lead to distinct visual patterns. In the SYSU8H dataset, the H&E slices were stained using an identical form of hematoxylin (i.e., Harris hematoxylin) to make sure both the nucleus and cytoplasm can be clearly visible and discriminated. Due to the fact that the TCGA-COAD dataset was collected from multiple centers, the forms of hematoxylin used for staining were very likely to be different. However, as shown in Figure 3, in tumor classification, prediction accuracies among slices were not so significant. The result indicates that our method can be adapted to different forms of H&E staining approaches.

Another issue that should not be ignored is the tumor heterogenity of the primary and metastatic lesions. Clinically, whether it is pathological diagnosis or target gene detection, the tumor specimen of the primary lesion is the first choice. However, there may be discrepancies in the gene mutation between the primary and metastatic tumor. For advanced metastatic tumors,when the target gene mutation of the primary tumor is negative, the target gene detection of the metastatic tumor can be carried out if conditions permitted, which can increase the opportunity for patients to receive one more targeted drug treatment. In this study, limited by the publicly available clinical samples attached with the gene mutation information of the primary tumor and the corresponding metastases, our method focused exclusively on primary tumors. Further evaluation is still necessary to clarify the reliability and generalization of our model performance.

#### **5. Conclusions**

For colon carcinoma, we design a cascaded deep convolutional framework to simultaneously generate gene mutation predicting and MSI status estimation based on the whole-slide images. The proposed method introduces a simple yet efficient average voting ensemble strategy to produce a high-fidelity prediction of the WSI. In gene mutation&MSI

status classification task, the proposed method achieves 0.792 (95% CI of 0.669–0.914), 0.886 (95% CI of 0.688–1.00), 0.897 (95% CI of 0.800–0.994), and 0.764 (95% CI 0.563–0.965) AUCs for KRAS, NRAS, BRAF, and MSI, respectively. These results suggest that the deep learning models have the potential to provide diagnostic insight and clinical guidance directly from pathological H&E slides. We plan to improve the architecture of the framework and apply it to other data sources to achieve better generalization capacity and diagnostic reliability.

**Author Contributions:** Conceptualization, C.Z. and G.W.; Formal analysis, S.L., W.Z. and Y.Z.; Funding acquisition, C.Z.; Investigation, Y.G. and T.L.; Methodology, Y.G. and T.L.; Writing—original draft, Y.G. and T.L.; Writing—review & editing, G.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of The Eighth Affiliated Hospital, Sun Yat-sen University (protocol code 2022d013, 20220222).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors would like to thank all investigators and contributing pathologists from the TCGA (http://portal.gdc.cancer.gov), UCSC Xena (https://xenabrowser.net), and SYSU8H (http://www.sysu8h.com.cn/) for providing us samples and tools in this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Sample Availability:** Samples of the SYSU8H whole-side images and corresponding clinical information are available from the authors.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Article* **Prediction of Postoperative Pathologic Risk Factors in Cervical Cancer Patients Treated with Radical Hysterectomy by Machine Learning**

**Zhengjie Ou 1,†, Wei Mao 1,†, Lihua Tan 1, Yanli Yang 2, Shuanghuan Liu 1, Yanan Zhang 1, Bin Li <sup>1</sup> and Dan Zhao 1,\***

<sup>1</sup> Department of Gynecology Oncology, National Cancer Center, National Clinical Research Center for Cancer, Cancer Hospital, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing 100021, China

<sup>2</sup> Department of Gynecology Oncology, The Fifth People's Hospital of Qinghai Province, Xining 810007, China

**\*** Correspondence: zhaodan@cicams.ac.cn; Tel.: +86-010-8778-7384; Fax: +86-097-1636-0700

† These authors contributed equally to this work.

**Abstract:** Pretherapeutic serological parameters play a predictive role in pathologic risk factors (PRF), which correlate with treatment and prognosis in cervical cancer (CC). However, the method of pre-operative prediction to PRF is limited and the clinical availability of machine learning methods remains unknown in CC. Overall, 1260 early-stage CC patients treated with radical hysterectomy (RH) were randomly split into training and test cohorts. Six machine learning classifiers, including Gradient Boosting Machine, Support Vector Machine with Gaussian kernel, Random Forest, Conditional Random Forest, Naive Bayes, and Elastic Net, were used to derive diagnostic information from nine clinical factors and 75 parameters readily available from pretreatment peripheral blood tests. The best results were obtained by RF in deep stromal infiltration prediction with an accuracy of 70.8% and AUC of 0.767. The highest accuracy and AUC for predicting lymphatic metastasis with Cforest were 64.3% and 0.620, respectively. The highest accuracy of prediction for lymphavascular space invasion with EN was 59.7% and the AUC was 0.628. Blood markers, including D-dimer and uric acid, were associated with PRF. Machine learning methods can provide critical diagnostic prediction on PRF in CC before surgical intervention. The use of predictive algorithms may facilitate individualized treatment options through diagnostic stratification.

**Keywords:** blood biomarker; cervical cancer; deep stromal infiltration; lymph node metastasis; lymph-vascular space invasion; machine learning methods

### **1. Introduction**

Cervical cancer remains one of the most frequent malignant tumors in women [1]. With the widespread application of human papillomavirus (HPV) vaccination and the popularity of screening, patients diagnosed at early stages have accounted for the majority. Radical hysterectomy (RH) is the standard-of-care treatment for these patients [2]. The unavoidable problem after surgery is whether adjuvant treatment is required, which is judged in accordance with postoperative pathological risk factors. The likelihood of risk factors that increase the risk of recurrence is high, especially in stage IB3-IIA2 (the 2018 International Federation of Gynecology and Obstetrics, FIGO) due to large tumor bulk [2]. Previous studies have illustrated that neoadjuvant chemotherapy (NACT) plus surgery inhibited micro-metastasis and distant metastasis of tumors, and was associated with a declined incidence of pathologic risk factors [3]. However, despite the fact that NACT reduces the rate of adjuvant therapy after surgery, patients treated with NACT cannot be thoroughly free from radiotherapy and the adverse effects that radiotherapy brings.

In addition, concurrent chemoradiotherapy (CCRT) is also an alternative initial treatment for early-stage cervical cancer, particularly for locally advanced cervical cancer. As for a patient with several pathologic risk factors, conformed to the adjuvant therapy standard,

**Citation:** Ou, Z.; Mao, W.; Tan, L.; Yang, Y.; Liu, S.; Zhang, Y.; Li, B.; Zhao, D. Prediction of Postoperative Pathologic Risk Factors in Cervical Cancer Patients Treated with Radical Hysterectomy by Machine Learning. *Curr. Oncol.* **2022**, *29*, 9613–9629. https://doi.org/10.3390/ curroncol29120755

Received: 7 October 2022 Accepted: 29 November 2022 Published: 6 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

CCRT should be considered as the initial therapy but not RH, which shortens the treatment process for the same effect and reduces treatment costs [4]. With regard to patients staged IB-IIA, according to the National Comprehensive Cancer Network (NCCN) guidelines, concurrent chemoradiation and RH both serve as alternative primary treatment options, sharing nearly therapeutic equivalence. However, increased morbidity and complications have been specifically illustrated when surgery and radiotherapy are combined [5,6]. This multimodal treatment modality has caused them to bear a double treatment burden and increased medical cost. In addition, the successive therapeutic process also prolongs the treatment period, aggregates their side effects and affects quality of life in the long run. Accordingly, it is necessary to construct a model to predict pathologic risk factors before primary treatment, which will help select those for whom it is more appropriate to receive direct chemoradiation therapy rather than RH. Additionally, the development of model to predict postoperative pathologic risk factors is an important element for individual prognosis stratification and personalized medicine.

Pathologic risk factors in cervical cancer include lymph node metastasis (LNM), parametria infiltration, positive surgical margins, lymph-vascular space invasion (LVSI), tumor size >4 cm and deep stromal infiltration (DSI) [2]. Previous studies illustrated that many clinicopathologic factors were related to pathologic risk factors by common statistical methods, but these methods were not suited to handle more complex data [7–9]. Machine learning is a branch of artificial intelligence (AI) technology that allows the computer to conclude potential rules from complicated data of retrospective examples. AI technology has been widely used to analyze clinical material to construct a model to predict clinicopathological factors and treatment outcome, acquiring a properly higher accuracy compared with traditional statistical methods [10–12]. Therefore, it is feasible and reasonable to apply machine learning to the prediction of postoperative pathologic risk factors.

Based on the successful application of AI technology and the discovery of related factors with pathologic risk factors, we hypothesized that pretreatment of clinicopathological factors would be effective in the prediction of postoperative pathologic risk factors by machine learning analysis in FIGO stage IB-IIA cervical cancer. In addition, because of the low incidence rate of positive margins and parametria infiltration in primary cohorts and preoperative confirmation of tumor size via clinical palpation, this study's outcome contained a prediction of other pathologic risk factors. Therefore, in the present study, we aimed to explore the construction of a model for predicting LNM, LVSI and DSI through machine learning combing of clinicopathological biomarkers and explore unreported significant parameters associated with these factors.

#### **2. Materials and Methods**

#### *2.1. Patients and Considered Features*

This was a retrospective cohort study of 1260 patients with FIGO stage (2003) IB and IIA cervical cancer who were treated with RH with retroperitoneal lymphadenectomy between 2003 and 2017 in our institution (National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences; CICAMS). We retrospectively collected clinicopathological parameters, including age at diagnosis, body mass index (BMI), menopausal status, clinical FIGO stage, gross type, histologic grade, clinical tumor diameter, 75 preoperative peripheral blood biomarkers, etc. (Table 1 and Table S1). Tumor diameter was obtained via clinical palpation before surgical intervention.


**Table 1.** Clinical and pathologic characteristics of 1260 patients with cervical cancer.

#### *2.2. Data Splitting*

We obtained 1260 samples after preliminary preprocessing: removing medically impossible data (containing obvious record error), removing the features with 10% missing values and the samples with missing values. Variables of age, BMI, menopausal status, clinical tumor diameter, histology, FIGO stage, gross type, previous abdominal surgery, histologic grade (obtained via cervical biopsy preoperatively) and 75 pretreatment peripheral blood markers were all incorporated into the model construction. We started to handle the features: the continuous features were normalized and categorical features were one-hot coded, and LinearSVC method with L1 penalty was used to choose features.

The dataset was split into training and test cohorts according to a ratio of 1:1 by repeated random sampling until there was no significant difference (*p* value > 0.05) between the two cohorts with respect to the three tasks (Table 1). The *p* values were calculated using Chi-square or Fisher exact test for categorical variables, and the student's *t*-test or the Mann–Whitney U test were conducted for analyzing normally distributed or non-normally distributed continuous variables. This resulted in the training cohort and the test cohort both having 630 patients.

#### *2.3. Supervised Machine Learning Classifiers*

In this study, we evaluated six types of supervised machine learning classifiers, including GBM (Gradient Boosting Machine) [13,14], SVMRadial (Support Vector Machine with Gaussian kernel) [15], RF (Random Forest) [16], Cforest (Conditional Random Forest) [17], NB (Naive Bayes) [18] and EN (Elastic Net) [19]. In addition, a logistic regression classifier was used as a baseline. R software version 4.2.1 with R package caret was used to implement all classifiers. One hundred independent training sets were conducted using different random seeds in order to calculate variable importance for prediction. We used the median of variable importance acquired from each training as a representative value. The importance of each variable was calculated using the varImp function of the caret package. A RF classifier combines two machine learning techniques: bagging and random feature selection consisting of a group of decision trees. Cforest is an algorithm using conditional inference trees as base learners, implementing both the random forest and the bagging ensemble algorithm. EN is a logistic regression classifier trained by using a regularized method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

#### *2.4. Model Assessment*

To assess the performance of different models, we computed the accuracy (ACC) and the area under the ROC curve (AUC) on the test cohort as our evaluation metrics. Here, ACC was obtained by setting the threshold corresponding to the top left point of the ROC curve. As the AUC is independent of the chosen threshold, we used it as the main evaluation metric.

#### *2.5. Confidence of Prediction and Shannon's Information Gain*

Shannon's information gain was used to assess the prediction confidence [20]. If a patient, *i*, is lacking the information concerning the class that the patient is included in (k-class), the Shannon's information entropy representing uncertainty is expressed with:

$$H(i) = \log\_2 k$$

If a classifier provides prediction probabilities for each class, the entropy will be:

$$H\_{\mathfrak{c}}(i) = \sum\_{j=1}^{k} p\_j(i) \log\_2(p\_j(i))$$

Here, *pj*(*i*) is the predicted probability that the patient *i* is included in class *j*. Thus, we obtain the information gain, i.e., information gained by the prediction:

$$IG(i) = H(i) - H\_c(i)$$

The individual information gain for each class is given by:

$$IG\_j(i) = p\_j(i) \times IG(i)$$

#### **3. Results**

*3.1. Prediction of Deep Stromal Infiltration of Cervical Cancer Based on Multiple Preoperative Blood Markers Using Machine Learning Methods*

Depth of stromal invasion was evaluated by an experienced pathologist and was recognized as significant, with more than one millimeter of invasion in the depth of the stroma in a microscopic examination. The status of the depth of stromal infiltration was classified into two groups: "non-deep" and "deep". The "deep" group referred to patients who had an invasive carcinoma with greater than one-third stromal invasion according to the pathologic findings. "Non-deep" indicated a carcinoma infiltrating no more than one third of the cervical stroma. The values for the highest ACC of the prediction and the AUC were 70.8% and 0.767 with RF classifier, which achieved a 5.4% higher score than the traditional method of multiple logistic regression analysis in AUC (Figure 1A; Supplemental Table S2). It is notable that the best two classifiers, RF and GBM, both used ensemble methods that combine weak decision trees.

Next, we focused on the best model, RF, and understood the variables. The relative importance of each variable for segregating deep stromal infiltration patients from nondeep infiltration ones was calculated for RF (Figure 1B). We identified the top eight factors, including SCC, D-D, tumor diameter, URIC, age, neut%, ALP and TP, as important RF predictors for distinguishing deep infiltration from non-deep infiltration. Standard box plots that presented the distribution of each variable between deep and non-deep samples are shown in Figure 1C.

Interestingly, we found that D-D was a critical variable, in addition to SCC. From the confusion matrix (Figure 1D), RF predicted 81 patients with deep infiltration as ones with non-deep infiltration and predicted 108 patients with non-deep infiltration as ones with deep infiltration. When we considered the Shannon gain to represent the confidence of predictions and chose those patients with certain higher confidence of predictions, the predictions designated as higher confidence (>0.2 bits from Shannon information gain computation) contained only 21 mispredictions out of 148 instances (Figure 1E). In particular, for the predictions with higher confidence, if a patient was predicted as non-deep, this was right at a rate of 1 − 7/52 = 86.5%.

**Figure 1.** Prediction of deep stromal infiltration of cervical cancer based on multiple preoperative blood markers using machine learning methods. (**A**) ROC curves derived from logistic regression for predicting deep stromal infiltration of cervical cancer based on all 75 peripheral blood markers using machine learning methods compared with logistic regression. (**B**) Relative importance of variables for prediction of deep stromal infiltration calculated in the RF. Variable importance is represented as

a percentage of the highest value. (**C**) Box and jitter plots representing the distribution of top eight important parameters for distinguishing infiltration from non-infiltration. (**D**,**E**), Confusion matrix indicating the prediction quality of the RF classification for all predictions (**D**) and for those predictions with high (>0.2 bits) confidence (**E**). Notes: SCC, squamous cell carcinoma antigen; D-D, D-dimer; URIC, uric acid; ALP, alkaline phosphatase; TP, total protein; IgA, immunoglobulin A; LDH, lactate dehydrogenase; TT, thrombin time; PT(A), plasma prothrombin time ratio (A); MONO%, percentage of monocytes; HCT, hematocrit; HGB, hemoglobin; CK-MB, creatine kinase-MB isoenzyme; b1-G, beta 1 globulin; PT(r), plasma prothrombin time ratio (r).

#### *3.2. Differentiation of Lymph Node Metastasis of Cervical Cancer with Machine Learning Methods*

The status of lymph node metastasis was classified into two groups: "metastasis" and "non-metastasis". We found that Cforest showed the best prediction performance with an ACC of 64.3% and an AUC of 0.620 (Figure 2A; Supplemental Table S2), which achieved a 5.8% higher score than LR in AUC.

Next, the relative importance of a variable for segregating metastatic patients from non-metastatic ones was calculated for Cforest (Figure 2B). We identified the top eight factors, including SCC, IB2, IB1, MONO%, diameter, PT(A), HCT and TT, as important Cforest predictors for distinguishing metastatic patients from non-metastatic ones. It should be noted that as the clinical stage progresses, SCC and tumor diameter can increase. Standard box plots that presented the distribution of each variable between metastatic and non-metastatic samples are shown in Figure 2C.

Interestingly, we found that SCC was a critical variable. From the confusion matrix (Figure 2D), RF predictions had 105 false negative samples and 13 false positive samples. However, predictions designated as higher confidence (>0.2 bits from Shannon information gain computation) contained only 29 misprediction out of 230 instances (Figure 3E). In particular, for the predictions with higher confidence, if a patient was predicted as nonmetastasis, this was right at a rate of 1 − 29/230 = 87.4%.

**Figure 2.** Differentiation of lymph node metastasis of cervical cancer with machine learning methods. (**A**) ROC curves derived from logistic regression for predicting lymph node metastasis of cervical cancer based on all 75 peripheral blood markers using machine learning methods compared with

logistic regression. (**B**) Relative importance of variables for prediction of lymph node metastasis calculated in the Cforest. Variable importance is represented as a percentage of the highest value. (**C**) Box and jitter plots representing the distribution of top eight important parameters for distinguishing metastasis from non-metastasis. (**D**,**E**), Confusion matrix indicating the prediction quality of the Cforest classification for all predictions (**D**) and for those predictions with high (>0.2 bits) confidence (**E**). Notes: SCC, squamous cell carcinoma antigen; MONO%, percentage of monocytes; PT(A), plasma prothrombin time ratio (A); HCT, hematocrit; TT, thrombin time; LDH, lactate dehydrogenase; D-D, D-dimer; PT(r), plasma prothrombin time ratio (r); HGB, hemoglobin; ALP, alkaline phosphatase; TP, total protein; URIC, uric acid; neut%, percentage of neutrophils; b1-G, beta 1 globulin; CK-MB, creatine kinase-MB isoenzyme; IgA, immunoglobulin A.

#### *3.3. Prediction of Lymph-Vascular Space Invasion of Cervical Cancer Based on Preoperative Blood Markers Using Machine Learning Methods*

In the task of lymph-vascular space invasion, patients were labeled as "invasion" or "non-invasion". LVSI refers to the presence of epithelial tumor cells in the lumen of vessels. "Invasion" indicated positive pathologic findings of LVSI and "non-invasion" indicated no pathologic proof of LVSI. We found that EN showed the best prediction performance, with ACC of 59.7% and AUC of 0.628, and the traditional method of multiple logistic regression analysis was comparative with ACC of 59.5% and AUC of 0.627 (Figure 3A; Supplemental Table S2).

Next, the relative importance of each variable for segregating invasion from noninvasion was calculated for EN (Figure 3B). We identified the top eight factors, including RDW-SD, CK-MB, PCT, A/G, PT(A), IB1, TT and TBIL, as important EN predictors for distinguishing invasion patients from non-invasion ones. Standard box plots that present the distribution of each variable between invasion and non-invasion are shown in Figure 3C.

Interestingly, we found that RDW-SD was a critical variable. From the confusion matrix (Figure 3D), EN predictions had 180 false negative samples and 36 false positive samples. However, predictions designated as higher confidence (>0.2 bits from Shannon information gain computation) contained only 15 misprediction out of 98 instances (Figure 3D,E). In particular, for the predictions with higher confidence, if a patient was predicted as noninvasion, it was right at a rate of 1 − 15/98 = 84.7%.

**Figure 3.** Prediction of lymph-vascular space invasion of cervical cancer based on preoperative blood markers using machine learning methods. (**A**) ROC curves derived from logistic regression for predicting lymph-vascular space invasion of cervical cancer based on all 75 peripheral blood markers

using machine learning methods compared with logistic regression. (**B**) Relative importance of variables for prediction of lymph-vascular space invasion calculated in the EN. Variable importance is represented as a percentage of the highest value. (**C**) Box and jitter plots representing the distribution of top eight important blood markers for distinguishing invasion from non-invasion. (**D**,**E**) Confusion matrix indicating the prediction quality of the EN classification for all predictions (**D**) and for those predictions with high (>0.2 bits) confidence (**E**). Notes: RDW-SD, standard deviation of red blood cell distribution width; CK-MB, creatine kinase-MB isoenzyme; PCT, plateletcrit; A/G, albumin to globulin ratio; PT(A), plasma prothrombin time ratio (A); TT, thrombin time; TBIL, total bilirubin; TP, total protein; TBA, total bile acid; MCV, mean corpuscular volume; abdo\_surgery\_0.0, previous abdominal surgery; MONO%, percentage of monocytes; LDL-CHO, low density lipoprotein cholesterol; D-D, D-dimer; b2-MG, beta 2 microglobulin.

#### **4. Discussion**

In recent years, machine learning algorithms based on AI technology have been widely accepted and extensively utilized for diagnostic and prognostic assessment of various types of cancers in the context of precision medicine [11,21,22]. This innovative approach, serving as an important tool with high accuracy and efficient ability to process complex data, can explore the key related factors to effectively assist in the clinical decision making of cervical cancer treatment. More importantly, hidden and embedded patterns within familiar clinical data can be revealed with the aid of AI models. However, so far, no studies have been conducted on integrating readily accessible clinical blood markers into the model construction of predicting pathologic risk factors in cervical cancer based on AI technology. Our study allowed for the comparison of various machine learning algorithms with the traditional logistic regression analysis to identify the approach with the most favorable performance and explore the serologic biomarkers with potential diagnostic potency. In cervical cancer with FIGO stage IB-IIA, radical hysterectomy followed by tailored adjuvant radiotherapy and concurrent chemoradiotherapy are both recommended for suitable treatment modalities [21]. Postoperative adjuvant radiotherapy is warranted for women with histopathologically verified risk factors, such as LVSI, LNM, DSI, etc., to improve prognosis [22–24], which led to an increase in the risk of higher morbidity [25–27]. It is beneficial and meaningful to predict pathologic risk factors so as to identify those more likely to receive postoperative adjuvant radiotherapy to avoid compounding treatmentrelated morbidity. Currently, the lack of ability to accurately identify those with a higher chance to receive postoperative radiotherapy and achieve individualized medical management instead of a "one-size fits all" approach has been a primary clinical limitation. Therefore, predicting pathologic risk factors by comprehensive utility of laboratory blood tests and other pretreatment information is a fundamental way toward individualized optimal medical care. In this study, we explored the ability of multiple machine learning methods to predict pathologic risk factors of patients with cervical cancer by incorporating readily available blood biomarkers. We found that three ensemble classifiers, RF, Cforest and EN, were able to predict pathologic risk factors of early-stage cervical cancer, in which RF showed the best predictive performance with an appreciable accuracy of 70.8% and AUC of 0.767 for DSI. Cforest showed the most accurate predictive value for LNM (64.3% accuracy and 0.620 AUC), and EN for LVSI (59.7% accuracy and 0.628 AUC). Compared to the traditional approach of logistic regression analysis, the RF classifier achieved a 5.4% higher score of AUC in DSI prediction, Cforest achieved a 3.4% higher score of AUC in LNM prediction and EN showed almost the same performance in LVSI prediction. The underperformance of these classifiers with regard to LNM and LVSI may be attributable to the lack of particularly strong distinctions of cervical cancer at the level of an early stage based on serum biomarkers. Nevertheless, the results indicate that AI technology can provide valuable predictive information before primary treatment to facilitate individualized medical strategy. In addition, based on the optimal results of machine learning algorithms, this study may offer useful clinical information concerning variables that are of most importance for identification of pathologic risk factors, like DSI, in early-stage patients.

Previous evidence has suggested that cancer is a metabolic disease associated with inflammation [28]. Cervical cancer harbors a unique collection of inflammatory and metabolic molecules in the serum [29]. In early-stage cervical cancer, local inflammatory processes may be at an initial state in which the peritumoral microenvironment perhaps alters the most, while distant and systemic metabolic features and cancer-target responses are immunosuppressed [30], leading to the slight distinction of cancer invasiveness, which was obscured in serum markers. Understandably, as tumor debulk progresses, tumor burden aggravates, leading to cancer invasiveness. In this study, we found that squamous cell carcinoma antigen (SCC), D-dimer and uric acid (UA) levels were the top five significant plasma biomarkers for predicting DSI. SCC has been considered as the most important diagnostic and prognostic tumor marker in cervical cancer. Many studies demonstrated that an elevated level of pretreatment serum SCC was closely associated with disease progression and recurrence [31,32]. UA is a powerful antioxidant and considered as a protective factor against cancer [33]. It has been reported that an elevated level of UA was associated with cancer risk, aggressiveness and poor oncologic outcomes in various cancer types [34–36], but few studies have focused on gynecologic cancer. Interestingly, previous studies have also shown a prooxidant role of UA [37] and lower levels of UA were associated with elevated risk of cancer-related mortality compared with high levels [38]. The precise relation of UA with cancer, especially cervical cancer, needs further study. D-dimer serves as a valuable marker of activation of coagulation and fibrinolysis, and is also known as a biomarker of cancer prognosis, especially in metastasized patients [39–41]. The pretreatment prediction model of DSI in cervical cancer performed well and revealed potential meaningful serum biomarkers that were readily available in clinical settings, which is also consistent with previous studies. This study's findings suggest that the supervised machine learning analysis serves as a feasible and effective approach that can aid in discovering more meaningful biomarkers that are correlated with PRF in cervical cancer and are not identified by conventional multiple regression analysis.

Identification of reliable pretreatment blood markers associated with pathologic risk factors helps clinicians in clinical decision making [42]. In this study, we found some serologic indicators, such as RDW-SD and other indicators, that had scarcely been found to be related to the diagnosis and prognosis of cervical cancer in previous studies. We found that RDW was the top predictive indicator for LVSI. RDW is a routinely measured hematological index, primarily reflecting the degree of anisocytosis. It has been reported that this simple and inexpensive parameter is a strong and independent risk factor for death in the general population [43]. Research has demonstrated that an aberrant elevation level of RDW leads to poor survival outcomes in most tumor types and stages, independent of age, gender or region [44]. However, little is known about RDW in cervical cancer. One recent study indicated that RDW was associated with worse prognosis in cervical cancer [45]. Excessive oxidative stress, inflammation, and cell senescence were proposed as the conditions that RDW associates closely with mortality [46,47]. More dataset analysis is still needed to confirm the predictive ability of these factors. Based on the high efficiency of pretreatment blood markers, the dynamic detection of serological indicators in multiple time periods may be more powerful in prediction. As the dynamic analysis of serological indicators is more complex, future studies should develop the use of artificial intelligencebased machine learning algorithms to identify the predictive features of preoperative blood variable time series, which might significantly facilitate the accuracy of clinical characteristics prediction and deserve further study.

As tumors progress over time, the signal transduction and correlation between the tumor and its microenvironment, including fibroblasts, tumor-related immune cells and endothelial cells, will become increasingly closer [48]. The changes of peripheral blood parameters before surgery were inherently a combination of tumor-specific and microenvironment-specific factors and the result of the interaction between tumor and microenvironment. Given the importance of tumor microenvironment in the process of tumor development, clinicians should make full use of preoperative peripheral blood indicators

for treatment decision making, cancer progression evaluation and prognosis assessment. In previous studies, clinicians often ignored the reflection of regular blood biomarkers on the biological characteristics of tumors and relied almost exclusively on tumor-specific factors as included indicators for assessment, which was also a common problem in previous retrospective analysis of tumors. In this study, we identified a series of blood indicators that were readily available and necessary for preoperative evaluation related to pathologic risk factors by machine learning methods, such as UA, D-dimer, thrombin time, AST, MONO%, RDW-SD, etc. These parameters have the potential to be related to the microenvironment in cancer progression or metastasis, and their changes will also influence treatment timing and selection.

There have been a few previous studies exploring the use of serologic biomarkers to predict PRF. One study [49] in 2016 incorporated clinical factors and three blood markers derived from pretreatment blood routine examination to predict LNM, patients' overall survival and recurrence-free survival. They found platelet/lymphocyte ratio were significantly associated with LNM. Another study [50] in 2020 found that pretreatment albumin to fibrinogen ratio was significantly related to lymph node metastasis, depth of stromal infiltration, etc. Many studies focused on prediction for survival outcomes or a single PRF of cervical cancer based on clinical factors [51–53] and/or radiomic parameters [54,55]. However, no studies have made an attempt to predict three PRFs based on a series of clinically readily available blood markers. In addition to critical data analysis methods based on clinical factors, there are still many studies exploring new approaches of postoperative pathologic risk factors prediction. It is clear that the diagnosis of pathologic risk factors could only be accurately judged from the postoperative report of cervical cancer. Identification of reliable approaches that are able to predict pathologic risk factors in advance would facilitate the identification of more accurate diagnostic stratification and a more appropriate treatment strategy. A previous study indicated that DSI can be determined by combining the 2D or 3D ultrasound with clinical variables before treatment, with over 70% accuracy and AUC [56]. However, this diagnostic approach depended more on subjective judgment rather than objective parameters based on relatively few cases. It was reported that the assessment of cervical cancer with full-thickness stromal invasion by MRI examination was limited [57]. In Bidus's study, the conical method combined with clinical factors to determine DSI and LVSI before treatment also achieved good accuracy but this method is a destructive examination and may easily interfere with the complete resection of radical surgery [58]. In the study of LNM diagnosis, sentinel node staining is currently the most commonly developed method, but it is only used to determine whether complete lymph node resection is performed before surgery [59,60]. In this study, LNM was associated closely with primary tumor size as staging and tumor diameter were among the top five predictors for LNM. Results indicated that imaging materials, such as MRI, reflecting the visual size of the tumor itself and enlarged lymph nodes would potentially provide more accurate predictive information preoperatively. However, previous studies also used magnetic resonance imaging (MRI) and ultrasound to determine lymph node metastasis, but imaging data could only determine lymphadenectasis rather than tumor cell metastases in most cases, which leads to the unsatisfactory accuracy of the prediction model [56,61]. This is a reminder that traditional data analysis on simple integration of imaging information is not adequate enough to achieve LNM prediction. It is promising to achieve more comprehensive and precise prediction by virtue of effective integration of high-throughput extraction of a large amount of information from images based on AI technology, which will be the focus of our subsequent research. As the approach used in this study did not consider any information from pretreatment biopsies or imaging studies, there may be a limitation of the ability to predict pathologic risk factors before initial treatment; indeed, more independent datasets from other institutions are required to investigate how pretreatment blood signatures can be utilized for more accurate assessment of pathologic risk factors. Manipulation of high-throughput sequencing analysis, such as RNA sequencing, of pretreatment peripheral blood may improve predictive performance, however, from another perspective, it may become more complicated and expensive to incorporate RNA analysis information into the process of preoperative assessment in the current context of clinical settings. Further comprehensive investigation is needed in the hope of achieving the best clinical and socioeconomic benefits.

Our study has some limitations. Firstly, this study was a single-center retrospective study. The retrospective nature may result in inherent bias. Secondly, results from our database should be supplemented with external and prospective validation for prevention of overfitting as well as further spread of application in clinical practice. Thirdly, other machine learning approaches should be undertaken to manage the missing data in future work. Fourthly, our assessment of diagnostic ability to predict pathological risk factors was preliminary, and further study is warranted to better validate the accuracy of blood biomarkers. At present, our model is not sufficiently powerful and accurate to predict LVSI and LNM, but some blood biomarkers have been revealed for the first time that may be potentially useful predictors from a large number of variables. However, a positive prediction is not trivial; compared with traditional methods, the machine learning algorithms could serve as a feasible tool for clinicians to predict oncologic outcomes based solely on pretherapeutic information.

#### **5. Conclusions**

This study indicates that AI-based algorithms are useful tools that may aid in providing critical information for diagnostic evaluation of pathologic risk factors in patients with cervical cancer before initial treatment. The use of predictive algorithms may facilitate personalized treatment selection through pretherapeutic assessment.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/curroncol29120755/s1, Table S1: Pretreatment peripheral blood tests of 1260 cervical cancer patients included in the primary cohort; Table S2: Diagnostic accuracy of clinicopathological factors using machine learning algorithms.

**Author Contributions:** Conceptualization, D.Z., Y.Y. and B.L.; methodology, D.Z.; formal analysis, Z.O.; investigation, Z.O., W.M., L.T., S.L. and Y.Z.; resources, D.Z. and B.L.; data curation, Z.O., W.M., S.L. and Y.Z.; writing—original draft preparation, Z.O. and W.M.; writing—review and editing, D.Z. and W.M.; visualization, W.M.; supervision, D.Z.; project administration, D.Z. and B.L.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (D.Z., grant number 62176267), the Natural Science Foundation of Qinghai Province (D.Z., grant number 2021-ZJ-922); the CAMS Innovation Fund for Medical Sciences (D.Z., grant number 2021-I2M-C&T-B-048), the Beijing Hope Run Special Fund of Cancer Foundation of China (D.Z., grant number LC2021A10) and Capital's Funds for Health Improvement and Research (D.Z., grant number 2022-2-4026).

**Institutional Review Board Statement:** Ethical review and approval were waived for this study due to the retrospective nature of the data.

**Informed Consent Statement:** Patient consent was waived due to the retrospective nature of the study.

**Data Availability Statement:** The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


### *Article* **Using Whole Slide Gray Value Map to Predict HER2 Expression and FISH Status in Breast Cancer**

**Qian Yao 1,†, Wei Hou 1,†, Kaiyuan Wu 2, Yanhua Bai 1, Mengping Long 1, Xinting Diao 1, Ling Jia 1, Dongfeng Niu 1,\* and Xiang Li 2,\***


**Simple Summary:** HER2 expression is important for target therapy in breast cancer patients, however, accurate evaluation of HER2 expression is challenging for pathologists owing to the ambiguities and subjectivities of manual scoring. We proposed a deep learning framework using a Whole Slide gray value map and convolutional neural network model to predict HER2 expression level on immunohistochemistry (IHC) assay and predict HER2 gene status on fluorescence in situ hybridization (FISH) assay. Our results indicated that the proposed model is feasible for predicting HER2 expression and gene amplification and achieved high consistency with the experienced pathologists' assessment. This unique HER2 scoring model did not rely on challenging manual intervention and proved to be a simple and robust tool for pathologists to improve the accuracy of HER2 interpretation and provided a clinical aid to target therapy in breast cancer patients.

**Abstract:** Accurate detection of HER2 expression through immunohistochemistry (IHC) is of great clinical significance in the treatment of breast cancer. However, manual interpretation of HER2 is challenging, due to the interobserver variability among pathologists. We sought to explore a deep learning method to predict HER2 expression level and gene status based on a Whole Slide Image (WSI) of the HER2 IHC section. When applied to 228 invasive breast carcinoma of no special type (IBC-NST) DAB-stained slides, our GrayMap+ convolutional neural network (CNN) model accurately classified HER2 IHC level with mean accuracy 0.952 ± 0.029 and predicted HER2 FISH status with mean accuracy 0.921 ± 0.029. Our result also demonstrated strong consistency in HER2 expression score between our system and experienced pathologists (intraclass correlation coefficient (ICC) = 0.903, Cohen's *κ* = 0.875). The discordant cases were found to be largely caused by high intra-tumor staining heterogeneity in the HER2 IHC group and low copy number in the HER2 FISH group.

**Keywords:** breast cancer; HER2; artificial intelligence; deep learning; immunohistochemical (IHC) scoring

### **1. Introduction**

Breast cancer is the most diagnosed cancer that seriously threatens the life and health of women all over the world, with high morbidity and mortality rates of 24.5% and 15.5%, respectively [1]. The HER2 (human epidermal growth factor receptor-2) gene, located at chromosome 17q12–212, plays an important role in the development of breast cancer. Fifteen to twenty percent of breast cancer patients are HER2 positive, including HER2 gene amplification and/or overexpression. HER2-positive breast cancer has poor clinical outcomes [2,3], but fortunately, there is a targeted drug-Trastuzumab (Herceptin), which can effectively improve the prognosis [4,5]. HER2 gene amplification assessed by in situ

**Citation:** Yao, Q.; Hou, W.; Wu, K.; Bai, Y.; Long, M.; Diao, X.; Jia, L.; Niu, D.; Li, X. Using Whole Slide Gray Value Map to Predict HER2 Expression and FISH Status in Breast Cancer. *Cancers* **2022**, *14*, 6233. https://doi.org/10.3390/ cancers14246233

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 10 November 2022 Accepted: 14 December 2022 Published: 17 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

hybridization (ISH) or protein overexpression assessed by IHC remains the primary predictor of responsiveness to HER2- targeted therapies and a key prognostic biomarker in breast cancer [6]. According to the latest American Society of Clinical Oncology (ASCO)/College of American Pathologists (CAP) guideline [6], all newly diagnosed patients with breast cancer must have a HER2 test performed. In routine clinical practice, the IHC test is first performed. The IHC test gives a score of 0, 1+, 2+, or 3+ that measures the amount of HER2 receptor protein on the surface of cells in a breast cancer tissue sample. The 3+ is the strongest staining, with which the patient must be diagnosed as HER2 positive. 2+ is also known as the equivocal level. Fluorescence in situ hybridization (FISH) must be performed to further decide the HER2 status for patients with IHC 2+ score. Therefore, accurate and efficient HER2 IHC evaluation is important for the diagnosis and treatment of breast cancer patients. In the HER2 IHC test, the HER2-receptor protein is commonly stained with 3,3 -diaminobenzidine (DAB), which has a brown color, meanwhile, hematoxylin staining which has blue color is also applied to visualize the cell nuclei. The stained slide is manually accessed by pathologists under the microscope. Although many countries have implemented national testing guidelines to standardize testing procedures and make results more accurate, the procedure is subjective and semi-quantitative and quite often leads to high inter- and intra-observer variation [7–9]. Therefore, there is an urgent need for an objective and consistent HER2 evaluation system.

Many researchers are devoted to developing computer-aided solutions, semi-automatically or fully automatically, to address the ambiguities and subjectivities of manual scoring. Compared to manual scoring, the computer-aided solution can decrease human error, increase the accuracy of diagnosis, reduce the workload of pathologists, and standardize the scoring systems [10,11]. The pathology whole slide images (WSI) have trillions of pixels, which are too large to process in a single-shot end-to-end way, i.e., processing WSI as a traditional image, even on modern computers. Usually, the fully automatic methods have the following three steps: WSI is first split into small size, i.e., 512 × 512, image patches; then information of single patch image are extracted; and at last single patch information are summarized to conclude the WSI level result. While the semiautomatic methods need pathologists to manually select regions of interest in the WSI. Masmoudi, et al. [12] presented a method for automated assessment of HER2 IHC staining. They first used a linear classification model on the color information of pixels to discriminate the membrane pixels and nuclei pixels, then watershed algorithm and adaptive ellipse fitting were applied to segment the nuclei and cell membrane. At last, slides were classified into one of the three scoring groups based on features describing the membrane staining intensity and completeness. In contrast to Masmoudi et al. work, HER2CONNECT found the distribution of the area of the connected brown color components (the stained membranes) in the core invasive cancer region had a good correlation with the HER2 expression level, therefore can be used to predict HER2 score. Their method reached 92.3% between the software and the score by the pathologist [13]. Ruifrok et al. [14] proposed a color deconvolution method to deconvolute and quantify the contributions of each staining in the histochemical slide. Motivated by the color convolution method, many researchers were devoted to quantifying the gray level of the HER2 IHC slide. ImmunoMembrane, a web-based application, utilized color deconvolution to separate stained membranes and then designed the IM-score, which is the sum of membrane completeness score and membrane intensity score to classify HER2 scores [15]. Kabakci et al. [16] characterized the cell membrane staining intensity in a comprehensive way using the so call Membrane Intensity Histogram (MIH) method which described the distribution of the staining intensity in different directions.

Deep Learning (DL) models are increasingly being used in various application areas such as computer vision, natural language processing, text or image classification, sentiment analysis, recommender systems, user profiling, etc. [17,18]. Compared to handcraft feature engineering, one of the major advantages of the DL model is the automatic learning feature representation and high representability, which bring the DL model much more versatility when dealing with large datasets and complex problems. Saha et al. [11] developed a cell segmentation model using Trapezoidal LSTM units and HER2 scoring based on the segmented membranes. However, Saha uses 2048 × 2048 patches, rather than the entire WSI. Qaiser et al. [19] also achieved patch-level HER2 scoring with the help of reinforcement learning. Zhen Chen, et al. [20] proposed a Focal-Aware Module to estimate diagnosisrelated regions and a Relevance-enhanced Graph Convolutional Network to summarize information extracted from different levels of the original WSI.

Recently DL models are attracting increasing attention to predicting gene expression status using the WSI image [21–24]. The diagnosis label is usually provided at the WSI level, which cannot be treated as a cluster label of the inputs of the underline model. Therefore, multiple instance learning (MIL) is often implemented to overcome the issue. In this paper, we propose a new artificial intelligence (AI) method to predict HER2 protein expression level and gene status using the WSIs. Instead of using a manual strong label of patch level image or using MIL on the slide-level labeled dataset, we first calculate the unsupervised feature for each patch image, i.e., the gray level, the gray level area fraction, and generate a slide-level feature map using the patch-level feature to represent each patch. In this way, we can reduce the input size of the original slide. Then we build a multi-task deep learning model to predict HER2 protein expression level and gene amplification status simultaneously.

#### **2. Material and Methods**

Figure 1 shows the workflow of our study.

**Figure 1.** The workflow of our study includes the main steps for preprocessing slides and training the deep learning model. The numbers below the model block give the channel number respectively.

#### *2.1. Human Subjects*

We selected 228 biopsy cases of IBC-NST with both IHC and FISH information which were collected between 2010 and 2021 from the department of pathology, Peking University Cancer Hospital & Institute. All subjects were female. Our study obtained permission from the Peking University Cancer Hospital Institutional Review Board and Ethics Committee (Grant: 2022KT15).

#### *2.2. ImmunohistoChemical Staining*

Commercially available primary antibody HER2 (4B5, Roche Ventana) was applied. Immunohistochemical stains were performed on Ventana Benchmark automated immune-Stainer (Tucson, Arizona), following the vendor's protocol. The appropriate positive and negative controls were included for each run. HER2 immunoexpressing was evaluated as 0, 1+, 2+, and 3+ based on the 2018 ASCO/CAP guideline [6] by three experienced pathologists (Q.Y., D.N., and Y.B.). To prevent intra-rater variability, three pathologists were blind to the initial manual evaluation and AI-based scores, and all the cases were reviewed a second time after a 4-week washout period. The discrepant cases were reviewed again to get the final score.

#### *2.3. Fluorescence In Situ Hybridization*

HER2 FISH was carried out using the Path Vysion HER2 DNA Probe Kit (Abbott Molecular, Abbott Park, Illinois) and followed the manufacturer's instructions. Two experienced pathologists (DFN and Y.B.) evaluated the HER2 copy number, CEP17 copy number, and their ratios of 20 tumor cells independently and blinded to IHC results. FISH results were recorded as negative and positive according to the 2018 ASCO/CAP guideline. In detail, HER2 FISH results were designated into five groups: group one (G1, HER2/CEP17 ratio ≥ 2.0; average HER2 copy number ≥ 4.0/cell); group two (G2, HER2/CEP17 ratio ≥ 2.0; average HER2 copy number < 4.0/cell); group three (G3, HER2/CEP17 ratio < 2.0; average HER2 copy number ≥ 6.0/cell); group four (G4, HER2/CEP17 ratio < 2.0; 4.0 ≤ average HER2 copy number < 6.0/cell); and group five (G5, HER2/CEP17 ratio < 2.0; average HER2 copy number < 4.0/cell) [6]. G1 was considered FISH positive and G5 was FISH negative. However, G2 and G4 should evaluate the HER2 IHC results in addition, if not 3+, then those cases should be considered HER2 negative. In G3 cases, when concurrent IHC results are negative (0 or 1+), it is recommended that the specimen be considered HER2 negative.

#### *2.4. Image Processing*

The digitized whole-slide images (WSIs) were acquired using a Leica Aperio Versa pathologic scanner (Aperio, Leica Biosystems Imaging, Inc.) viewed at 400× magnification using Leica ImageScope software. The order of magnitude of pixels was 10<sup>9</sup> ∼ <sup>10</sup>10.

Figure 1 shows the flowchart of the method. The whole slide image was first partitioned into 512 × 512 patches. Then for each small patch image, we segment the membrane pixels using color deconvolution and the k-means method (k-means parameters: number of clusters is 3, the maximum number of iterations is 50, number of redos is 10). After the membrane segmentation, we evaluate the gray value and membrane pixels fraction of each patch. The original WSI is profiled into three maps. In the following, we describe the procedure in detail.

#### *2.5. Membrane Segmentation*

The DAB signal is mainly located at the membrane. In the following, we introduce the membrane segmentation method which is based on the color deconvolution and k-means method. Ruifrok etc. applied the Beer-Lambert law to model the stained slide image and proposed the color deconvolution method to separate and quantify immunohistochemical staining [14]. According to the Beer-Lambert law,

$$I\_{\mathfrak{c}} = I\_{0,\mathfrak{c}} \mathbf{1} \mathbf{0}^{-AC\_{\mathfrak{c}}} \tag{1}$$

where *Ic* is the intensity of light detected after passing the specimen, *I*0,*<sup>c</sup>* is the intensity of light entering the specimen and *A* is the amount of the stain with absorption factor *C*. The subscript *c* indicates the detection channel. By assuming a linear relation between

stain concentration and absorbance, Ruifrok proposed the following color deconvolution method,

$$A = -\log 10 \left(\frac{I}{I\_0}\right) \times OD^{-1} \tag{2}$$

where *A* is a vector representing the amount of different stains, *I* is the transmitted light intensity, i.e., the detected slide image, *OD* is the normalized optical density matrix, which can be measured experimentally. In the analysis of the HER2 IHC slide, because there are only two kinds of stains, we use the following normalized *OD* matrix

$$OD = \begin{pmatrix} 0.650 & 0.704 & 0.286 \\ 0.268 & 0.570 & 0.776 \\ 0.636 & -0.710 & 0.302 \end{pmatrix} \tag{3}$$

where the first two row vectors correspond to the *OD* vectors of hematoxylin and DAB14 and the last row vector is the normalized cross product of hematoxylin and DAB *OD* vectors. Following the convention of color deconvolution code given in the Color Deconvolution 2

ImageJ plugin, we use *<sup>A</sup>* <sup>=</sup> <sup>−</sup> log <sup>10</sup> *<sup>I</sup>* 255 × *OD*−<sup>1</sup> to deconvolute the original slide image.

After color deconvolution, the value of the 2nd channel corresponds to the intensity of the DAB stain. We then apply the k-means method to the original image. The image is first converted from RGB to Luv color to get better perceptual uniformity which is more suitable for clustering analysis. Define the distance between pixels *p*, *q*:

$$D(p,q) = \sqrt{\left(L\_p - L\_q\right)^2 + \left(u\_p - u\_q\right)^2 + \left(v\_p - v\_q\right)^2} \tag{4}$$

where  *Lp*, *up*, *vp* and  *Lq*, *uq*, *vq* are Luv values of pixel *p* and *q*, respectively. Based on the distance *D*(*p*, *q*), we use the k-means algorithm to cluster the pixels in the slice into three clusters, which correspond to the stained cell membrane region, the nuclei region, and the complementary region respectively. At last, we calculate the mean gray values of each pixel group according to the DAB channel calculated previously. We select the group with the highest mean gray value as the cell membrane. Figure 2A–D gives an illustration of the cell membrane segmentation.

#### *2.6. Gray Value Map*

In this section, we describe the gray value map which integrates patch-level gray value information to get slide-level gray value information. After segmentation of the cell membrane of each patch image, we calculate the mean gray value and membrane pixel fraction of each patch image. We find that the value of the DAB channel cannot reflect well when the visual gray value is greater than 8, as shown in Figure 2E. By checking the RGB channel value of the membrane pixels, we find that this effect is partially caused by the saturation of the blue channel. It is unclear whether this is truly caused by the stain absorbing all blue light or whether there are some other effects of the hardware device. We notice that the Lightness channel of Luv color space generally reflects the visual gray level except the low gray value range. Therefore, we add the Lightness channel value to the gray value map and build the model to automatically fuse the information. In summary, the gray value *A*, membrane pixel fraction *F,* and Lightness value *L* at patch level are defined as:

*A* = mean*iAi* where mean is over all pixels in the membrane cluster,

*F* = number of pixels in membrane cluster total number of pixels ,

*L* = mean*iLi* where mean is over all pixels in the membrane cluster.

Figure 3 shows the gray value map of IHC HER2 expression 0/1+, 2+, and 3+ cases.

**Figure 2.** Cell membrane segmentation and the schematic of Graymap. (**A**) raw section of HER2 3+ and HER2 0/1+. (**B**–**D**) are three groups of K-Means output. The gray values are labeled on the images respectively. (**E**) The mean RGB value of different gray value membrane pixels. The bottom color bar is an RGB color map of different gray values.

**Figure 3.** Examples of GrayMap of HER2 IHC expression. Typical examples of HER2 0/1+, 2+, 3+ cases in IBC-NST. From top to bottom: HER2 IHC raw images, magnified images, cell membrane segmentation, and pixels' gray value's distribution of the images.

#### *2.7. Multitask Convolutional Neural Network (CNN)*

After getting the gray value map of the whole slide, we further utilize a multi-task CNN model to classify the IHC HER2 expression level and the FISH status simultaneously. We use Resnet18 with base channel number 64 as our backbone network. After the backbone network, we concatenate two task branches corresponding to the IHC HER2 expression classification and the FISH status classification respectively. For each task branch, we use the sigmoid cross-entropy loss as the classification loss and add the dropout layer before the last fully connected layer. All Relu activations are replaced with PRelu to avoid the Relu blow-up issue due to a lack of pretrained weight initialization.

Data augment techniques and manually synthesized images are used to overcome the overfit issue due to the lack of training data samples. We add random rotation (−180, +180), random crop (512, 512) (raw training input size is (680, 680)), random horizontal flip, and random vertical flip data augmentations. We also manually synthesize the image for each original data sample by first manually drawing a mask of a random sample that has the same FISH status, and the same fold-id, but a lower HER2 expression level of the target sample, and then paste the masked part of the selected sample into the target sample's blank space. In this way, we partially increase our training dataset.

The model is implemented in Pytorch using the MMDetection framework and trained with the Adam optimizer with Cosine learning rate policy (learning rate parameters: base learning rate is 0.001, the minimum learning rate is 1.0 × <sup>10</sup>−8). We utilized the 5-fold cross-validation method to evaluate the model. The mean and standard deviation were calculated using prediction on each fold to demonstrate the model performance and stability. Evaluation metrics including precision, recall, F1-score, Jaccard Index, specificity, accuracy, and Area Under Curve of receiver operating characteristic curve (ROC) (AUC) were calculated for binary FISH status prediction. Evaluation metrics including accuracy, F1-score, Cohen's kappa coefficient (*κ*), and Matthews correlation coefficient (MCC) were calculated for multiclass IHC prediction using macro average mode.

#### **3. Results**

#### *3.1. HER2 IHC Status Classification Using GrayMax Model*

In the first step, we obtained the manual results of HER2 IHC and HER2 FISH. HER2 IHC was evaluated by three experienced pathologists. We used the median score of three pathologists to further reduce the inter-observer variability, which meant if there was a difference between the three scores, we used the median value of three scores. The details of the HER2 status including IHC and FISH results are shown in Table 1. According to the 2018 ASCO/CAP clinical practice guideline, the cutoff of HER2 IHC staining is 10%, which means the 10% strongest staining of HER2 IHC can be chosen as the represent score of the whole slice. So, we first use the maximum gray value of all patches to represent the gray value of WSI. Then we compared the GrayMax model with the median HER2 scores of pathologists. However, after utilization of the 5-fold cross-validation method, the GrayMax model showed relatively inferior performance with an average accuracy of 0.842 ± 0.023, F1-score of 0.665 ± 0.078, *Cohen's κ* of 0.640 ± 0.063 and MCC of 0.663 ± 0.058 (Table 2). We analysed the details of our model and found the errors in the cases with a heterogeneity of staining, nonspecific cytoplasmic staining, and in cases with invasive micropapillary carcinoma component, mucinous carcinoma component and ductal carcinoma in situ (DCIS) component and interference by necrosis region.

**Table 1.** Summary of the cohort of the different HER2 statuses.



**Table 2.** Performance comparison of GrayMax and GrayMap + CNN methods by cross-validation classification.

Abbreviation: Avg, Average value; Std, Standard deviation.

#### *3.2. HER2 IHC Status Classification Using GrayMap + CNN Model*

To solve the issues of the GrayMax model, we developed a new method to classify the HER2 IHC status. The main issue of the GrayMax model is that a single maximum gray value cannot represent the information of the whole slide. Therefore, we first used the GrayMap of the original whole slide, which contained the gray value information of all the patches, as described in the materials and methods section. Figure 2 showed the segmentation of the cell membrane and the schematic of GrayMap. Figure 3 showed typical examples of GrayMap in a subgroup of 0/1+, 2+, and 3+. Next, we utilized a multi-task CNN model to classify the IHC HER2 expression level as described in the material and methods section (Figure 1). We evaluated the model through a 5-fold crossvalidation method and compared the results with three experienced pathologists. The experiment results show that the GrayMap model has much better performance than the GrayMax model with an average accuracy of 0.952 ± 0.029, F1-score of 0.860 ± 0.12, *Cohen's κ* of 0.891 ± 0.069 and MCC of 0.899 ± 0.062 (Table 2). Parameters of evaluation metrics on a subgroup of 0/1+, 2+, and 3+ showed in Figure 4A and Table S1. We further analyzed the intraclass correlation coefficient (ICC) among pathologists and found the ICC value was 0.791 (95% confidence interval [CI], 0.749–0.829) (Figure 4B). It indicated the presence of inter-observer variability and suggested that manual interpretation by the single pathologist may face a high risk of misdiagnosis. Then HER2-AI and HER2-pathologists were compared to show consistency between the AI system and pathologists. The median variables of HER2 pathologists were used in the comparison. The results showed a high consistency between the HER2-AI and HER2-pathologists (ICC = 0.903) (Figure 4C).

#### *3.3. HER2 Gene Status Prediction Using GrayMap+ CNN Model*

Since HER2 IHC expression largely represents the HER2 gene amplification status [25]. We also utilized the GrayMap model to predict HER2 gene status and compared the data with the FISH results. Our system demonstrated high performance in predicting HER2 gene status with an accuracy of 0.921, specificity of 0.945, precision of 0.927, recall of 0.89, F1-score of 0.908, and Jaccard Index of 0.832 (Figure 5A and Table S2) and AUC value of 0.936 in the ROC curve which presented the high quality in FISH classification via 5-fold cross-validation method (Figure 5B). This data further confirmed our model as a robust high-performance system not only in HER2 IHC classification but also in HER2 gene status prediction.

**Figure 4.** Consistency of the pathologists and the AI system on HER2 IHC classification. (**A**) Histograms of GrayMap model performance in a subgroup of 0/1+, 2+, and 3+. (**B**) The intraclass consistency of HER2 IHC scores in pathologists. (**C**) Consistency of HER2 between AI system (IHC score-AI) and median IHC score in pathologists (median IHC score).

**Figure 5.** Performance of AI system on HER2 FISH classification. (**A**) Histograms of GrayMap model performance. (**B**) ROC curve of HER2 FISH status classification by cross-validation classification.

#### *3.4. The Analysis of Discordant Cases*

The proposed system correctly classified most of the WSIs. However, there were several discordant cases with false positive and negative samples (Figure 6A). We further analyzed the difference between AI systems and pathologists. As for the HER2 IHC results, 13 (13/228, 5.70%) cases were discordant between AI and pathologists. We investigated each case to identify the causes of the variability. Intra-tumor cell heterogeneity of HER2 staining was detected in six cases (6/13, 46.15%) (Figure 6B). Nonspecific cytoplasmic staining was found in four cases (Figure 6C). Another one was due to the nonspecific staining in DCIS (Figure 6D). Our result provided that HER2 staining heterogeneity was identified as the main driver of disagreement between AI and pathologists. Furthermore, the cytoplastic staining can interfere with the machine's extraction of cell membrane staining, resulting in misinterpretation. The nonspecific HER2 expression on DCIS will also lead to error, especially on biopsy tissue with a substantial amount of DCIS. HER2 validation is supposed to be performed only in the IBC-NST component. Since we did not annotate the IBC-NST region on WSIs, we calculated the DCIS component and found 75 cases (75/228, 32.89%) of samples had a DCIS component with a ratio of 5–35%. Only one case (1/75, 1.33%) was included in discordant cases, thus, our model had the ability to resolve the hidden trouble of DCIS. Only two cases could not find a clear explanation for discordance. According to HER2 FISH status, there were 18 (18/228, 7.89%) discordant cases. Five cases were identified intra-tumor cell heterogeneity of dual-color probes. For example, one case with only 2% tumor cells HER2 amplification and one case with 5%. Seven cases have low HER2 copy numbers (average copy number range 4–6 signals/cell). Three cases that were manually evaluated as negative belonged to the G2 and G4 groups, which were the new FISH group according to the 2018 ASCO/CAP guideline. Though the seven low-copy number cases were evaluated as positive and the new FISH group was regarded as negative, the efficacy of HER2-targeted therapy on these groups still needs to be investigated because of the limited evidence with a small subset of cases [6]. Only five cases were left without any explanation for discordance. Our results indicated that AI-based classification guaranteed high diagnostic accuracy and enabled us to reduce misinterpretation.

**Figure 6.** HER2 scoring discordance between pathologists and AI system and the possible causes of the variability. (**A**) Top 2 lines: Comparison between GrayMap model and the pathologist assessment; Bottom 4 lines: The possible causes of the variability; Left: The discordant cases on HER2 IHC classification; Right: The discordant cases on HER2 FISH classification. Vertical bars represent single cases and the representation of different colors are listed at the bottom. The typical image of (**B**) HER2 staining heterogeneity, (**C**) nonspecific cytoplasmic staining, (**D**) nonspecific staining in ductal carcinoma in situ (DCIS) with negative staining of the invasive component.

#### **4. Discussion**

In this paper, we proposed a new AI method to tackle the subjectivity and interobserver disagreement issues of manual interpretation of HER2 IHC slides. The experiments' results showed that the new method could accurately predict HER2 protein expression level (Accuracy 0.95 ± 0.029, Cohen's *κ* 0.891 ± 0.069) and FISH status (AUC 0.936 ± 0.030). The test of concordance with the three pathologists' interpretation showed that the new method has the highest ICC (ICC 0.903, 95%-Confidence Interval 0.875 ∼ 0.924). Breast cancer (BC) has become the most common cancer diagnosed in women. Personalized medicine, especially drugs focused on target genes in BC, such as trastuzumab, has greatly improved survival. HER2 protein expression level and gene amplification status are the most important indicators for the targeted therapy of BC. However, traditional manual interpretation of HER2 slide has been criticized for subjectivity and inter-observer disagreement among pathologists. This is not only caused by the subjective decision that needs clinic pathologists to take, such as completeness of the membrane staining, intensity of staining, and percentage of positive cells, according to the ASCO/CAP guideline, but also caused by the heterogeneity of BC. AI-based methods, because of the nature of the parametrized model and deterministic behavior, are a prospective approach to solving the pool reproducibility issue of manual interpretation. However, on one hand, the whole slide image is too large to be processed by a single model directly, on the other hand, a single patch-level image of WSI is not able to capture the heterogeneity property of BC. Currently, there are several approaches to solving this issue. The first approach predicts the HER2 expression of each patch and uses the statistical average method to summarize the patch-level results. Compared to this approach, the method proposed in this work adopts a deep learning model to do slide-level predictions, which are more flexible and powerful than the simple statistical average method. Another approach generally follows the ASCO/CAP guideline, making predicting at the cell level. This approach needs considerable human labeling which is not only tedious but also prone to label error, especially for weak staining samples. The weakly Supervised Learning (WSL) method is an attractive method to alleviate patch-level labeling [26]. However, WSL needs a considerable amount of slide-level data. Currently, the performance of WSL on a large HER2 IHC dataset is unclear yet. The method proposed in this work could be another prospective approach to do slide-level predictions.

The proposed AI system can be applied in our actual work in the pathology department. After uploading the WSIs into the system, our model can automatically process patches splitting, cell segmentation, gray value map information extraction, and HER2 IHC and FISH results prediction. The system assists pathologists by pre-reading HER2 IHC slides and presenting calculated results as second opinions to pathologists, especially those with equivocal results as 2+. Our system will significantly mitigate the interobserver discrepancy and contribute to the efficacy and safety of HER2-targeted therapies on BC. At present, a new HER2-low subtype was defined by a score of IHC 1 +or IHC 2+/FISH −, who may benefit from the new HER2-ADC drugs, such as trastuzumab deruxtecan (T-DXd) [27]. The current system has the potential to recognize HER2-low cases with an accurate prediction of both IHC and FISH status.

In our study, compared to the former GrayMax algorithm, the upgraded GrayMap + CNN model can get rid of the most nonspecific and heterogeneous staining problem as well as the special staining pattern of specific breast cancer subtypes in HER2 IHC classification. However, inconsistency between AI systems and pathologists still exists. Consistent with the previous study, HER2 staining heterogeneity was identified as the main driver of disagreement [28]. Intratumoral heterogeneity of HER2 may be due to intrinsic the characteristics of BC, defined as regional heterogeneity and genetic heterogeneity [29]. It may also be caused by IHC procedures, tissue collection, and processing, or slide scanning procedure. In our dataset, most heterogeneity staining cases of the discordant cohort were weak staining thus our model need to improve its capability in dealing with weak HER2 staining. As for HER2 FISH classification, in addition to heterogeneity, a low copy number

(average copy number range 4–6 signals/cell) was the most common cause of inconsistency. According to the 2018 guideline, an average HER2 copy number ≥4 signal/cell is regarded as FISH positive. However, the study showed a clear difference on HER2 copy levels using droplet digital PCR (ddPCR) and targeted next-generation sequencing (NGS) method between the 4–6 copy number groups and ≥6 groups. However, it remains unclear if patients of the 4–6 copy number group derive the same level of benefit as the≥6 groups in HER2-targeted therapy [30]. Futhermore, there were three cases belonging to G2 and G4 groups according to the 2018 guideline, which was the new FISH and should be recognized as FISH negative. However, the researcher showed the G2 group represents a biologically heterogeneous subset, which is different from those in G1 (FISH positive) and G5 (FISH negative) [31]. The G4 group was also proved to be a distinct group with intermediate levels of RNA/protein expression, close to positive/negative cut points [32]. Additional outcome information after HER2-targeted treatment is needed for the new FISH groups.

To improve the accurate, precise, and reproducible interpretation of HER2 IHC results for BC, where quantitative image analysis (QIA) is applied, The College of American Pathologists (CAP) developed the guideline with eleven recommendations [33]. The recommendations suggested that QIA and procedures must be validated before implementation, followed by regular maintenance and ongoing evaluation of quality control and quality assurance. In addition, HER2 QIA performance, interpretation, and reporting should be supervised by pathologists with expertise in QIA. We studied the detailed description of the guideline and found our AI model and procedures met most of the criteria, which suggested the present model is a promising tool for HER2 interpretation. However, this study still had some limitations. First, this work uses the k-means method to segment the cell membrane. It may wrongly classify the cytoplasmic pixels into membrane when the cell is weakly stained or cytoplastic immunohistochemical staining. For most of the weakly stained cases, the method is still able to do correct predictions, because the intensity and percentage of positive cells are major discrimination factors. However, for cytoplastic staining cases, as also demonstrated in the analysis of discordant cases section (four out of 13 total error cases), more local features are needed to discriminate the wrong cases. Secondly, we did not segment the invasive carcinoma region first. The current method relies on the deep learning model to automatically learn features from the data. In future works, we will collect more data and investigate the performance difference between the current method and model which makes predictions only rely on carcinoma region. Third, the completeness of the cell membrane is not represented in the current method. 2018 ASCO/CAP guidelines lay more emphasis on the completeness of cell membrane staining on HER2 2+ and 3+ cases in order to reduce the confusion of pathologists and allow greater discrimination between positive and negative results [6]. Our AI system promised high performance without calculating membrane completeness, however, a feature still needed to be found to represent the completeness of cell membrane staining according to the ASCO/CAP guideline to get a better result.

In conclusion, experimental results indicated that the proposed AI model is feasible for predicting HER2 expression score and HER2 gene amplification using IHC WSI and achieved high consistency with the experienced pathologists' assessments. This unique HER2 scoring model does not rely on challenging manual intervention and is proven to be a simple and robust tool for pathologists to improve the accuracy of HER2 interpretation and provides a clinical aid to target therapy in BC patients.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers14246233/s1, Table S1: HER2 IHC classification performance of GrayMap methods by cross-validation classification in the subgroup of 0/1+, 2+, and 3+. Table S2: HER2 FISH prediction performance of GrayMap methods on the subgroup of 0/1+, 2+, and 3+.

**Author Contributions:** Conceptualization, D.N., K.W., Q.Y., and W.H.; Methodology, X.D., and L.J.; Investigation, W.H., K.W., Y.B., and M.L.; Writing—Original Draft, K.W., Q.Y., and D.N.; Writing—Review & Editing, K.W., and Q.Y.; Funding Acquisition, W.H.; Resources, W.H., and Q.Y.; Supervision, D.N., and X.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (No. 81301879 and No. 81702839), Hygiene and Health Development Scientific Research Fostering Plan of Haidian District Beijing (No. HP2022-31-503001), Science Foundation of Peking University Cancer Hospital (No. 2021-11).

**Institutional Review Board Statement:** This study obtained permission from the Peking University Cancer Hospital Institutional Review Board and Ethics Committee (Grant: 2022KT15).

**Informed Consent Statement:** The requirements for informed consent were waived due to the noninvasive nature of the study.

**Data Availability Statement:** All image data associated with this study can be downloaded at https: //data.mendeley.com/datasets/3njjk252vc/draft?a=29e5963c-e2d6-4bdb-9c7b-b51d3741b6f0. The source code and the guideline are publicly available at https://github.com/KaiyuanWu/Her2 GrayMap. Any further information and requests for resources and materials should be directed to and will be fulfilled by the lead contact, Dongfeng Niu (dongfengniu@foxmail.com).

**Conflicts of Interest:** The authors declare no conflict of interests.

#### **References**


### *Article* **Predicting Tumor Perineural Invasion Status in High-Grade Prostate Cancer Based on a Clinical–Radiomics Model Incorporating T2-Weighted and Diffusion-Weighted Magnetic Resonance Images**

**Wei Zhang 1,2,†, Weiting Zhang 2,3,†, Xiang Li 2,3, Xiaoming Cao 1, Guoqiang Yang 2,3,4,\* and Hui Zhang 2,3,4,\***


**Simple Summary:** Perineural invasion (PNI) is present in 17–75% of prostate cancer patients and is an important mechanism for cancer progression, leading to poor prognoses. An optimized preoperative technique is needed to detect PNI in prostate cancer patients and administer the best treatment. The aim of our retrospective study was to develop a model based on high-throughput radiomic features of bi-parametric MRI combined with clinical factors that can predict PNI status in high-grade prostate cancers. In total, 183 high-grade PCa patients were included in this retrospective study, and the radiomics model based on 13 selected features of bi-parametric MRI showed better discrimination than did the conventional model in the test cohort (area under the curve (AUC): 0.908). Discrimination efficiency improved when the radiomics and clinical models were combined (AUC: 0.947). This improved model may help predict PNI in prostate cancer patients and allow more personalized clinical decision-making.

**Abstract:** Purpose: To explore the role of bi-parametric MRI radiomics features in identifying PNI in high-grade PCa and to further develop a combined nomogram with clinical information. Methods: 183 high-grade PCa patients were included in this retrospective study. Tumor regions of interest (ROIs) were manually delineated on T2WI and DWI images. Radiomics features were extracted from lesion area segmented images obtained. Univariate logistic regression analysis and the least absolute shrinkage and selection operator (LASSO) method were used for feature selection. A clinical model, a radiomics model, and a combined model were developed to predict PNI positive. Predictive performance was estimated using receiver operating characteristic (ROC) curves, calibration curves, and decision curves. Results: The differential diagnostic efficiency of the clinical model had no statistical difference compared with the radiomics model (area under the curve (AUC) values were 0.766 and 0.823 in the train and test group, respectively). The radiomics model showed better discrimination in both the train cohort and test cohort (train AUC: 0.879 and test AUC: 0.908) than each subcategory image (T2WI train AUC: 0.813 and test AUC: 0.827; DWI train AUC: 0.749 and test AUC: 0.734). The discrimination efficiency improved when combining the radiomics and clinical models (train AUC: 0.906 and test AUC: 0.947). Conclusion: The model including radiomics signatures and clinical factors can accurately predict PNI positive in high-grade PCa patients.

**Keywords:** prostate cancer; PNI; bi-parametric MRI; radiomics; nomogram

**Citation:** Zhang, W.; Zhang, W.; Li, X.; Cao, X.; Yang, G.; Zhang, H. Predicting Tumor Perineural Invasion Status in High-Grade Prostate Cancer Based on a Clinical–Radiomics Model Incorporating T2-Weighted and Diffusion-Weighted Magnetic Resonance Images. *Cancers* **2023**, *15*, 86. https://doi.org/10.3390/ cancers15010086

Academic Editor: Hamid Khayyam

Received: 12 November 2022 Revised: 8 December 2022 Accepted: 17 December 2022 Published: 23 December 2022

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Prostate cancer (PCa) is the most frequent malignant tumor in 105 countries worldwide and the first leading cause of cancer-related death in 46 countries among males [1]. Often, there are significant differences in the prognosis of patients with the same stratification who adopt the same treatment plan [2]. In addition, many localized PCa cases, especially high-grade cases, are not truly localized tumors when they are diagnosed. The reasons for this situation are that cancer cells have already spread beyond the scope of surgery or radiotherapy, and these patients are prone to developing biochemical recurrence [3]. It is widely accepted that prostate-specific antigen (PSA), Gleason score (GS), and T stage are the main variables for evaluating the prognosis of localized PCa. Among the factors causing tumor spread, perineural invasion (PNI), which is invasion along or around nerves within the perineural space, also plays an important role in cancer [4]. PNI can be evaluated in a biopsy specimen or radical prostatectomy specimen, and it is present in 17–75% of prostate cancer patients [5]. The College of American Pathologists published a consensus statement on prognostic factors for PCa in which PNI was identified as a potential prognostic factor (category III) that needed additional study [6]. Therefore, identifying the PNI status of high-grade PCa is an urgent problem to be solved.

At present, magnetic resonance imaging (MRI) is widely used for diagnosing PCa and can help detect several prognostic factors; it has been used to increase T staging accuracy and predict positive surgical margins (PSMs) by detecting and localizing extracapsular extension (ECE) [7,8]. Radiomics, as an extension concept of texture analysis, can convert medical images into high-dimensional mineable and quantitative features by using high-throughput extraction algorithms of these characterizations. In recent years, qualitative analysis of prostate MRI images by means of radiomics plays a crucial role at the pretreatment staging step and is increasingly applied to determine invasion and prognosis for prostate cancer [9,10]. PNI is a pathological feature that can only be detected after an invasive biopsy or prostatectomy. This form of metastasis can affect peri-prostatic neurovascular fibers, the lumbosacral plexus, and the sciatic nerve, and MRI can visualize involvement of these nerve fibers as direct evidence of cancer cell spreading [11,12]. In the age of high-resolution imaging, developing a method based on radiomics to accurately assess the PNI status of PCa is urgently needed.

In this study, we evaluated the relationship between MRI radiomics signature, as well as other clinical and pathological factors, and PNI in high-grade PCa. We hypothesized that the MRI radiomics signature may provide effective information and established a model for preoperatively predicting the probability of PNI in high-grade PCa patients.

#### **2. Materials and Methods**

#### *2.1. Patients*

This retrospective study received Institutional Review Board approval of the First Hospital of Shanxi Medical University, ethic code: (K131). We retrospectively selected PCa patients with clinical and imaging data from January 2016 to May 2021 who underwent prostate MR examination before systematic prostate biopsy or radical prostatectomy (RP). Clinical data, including age, PSA level, prostate volume, prostate-specific antigen density (PSAD), GS, grading groups (GGs), and tumor location in the prostate, were collected from patient medical records. The study inclusion criteria were as follows: (a) high-grade PCa patients who underwent prostate MRI examination; and (b) tumor perineural invasion status obtained on histopathology by biopsy or RP. The following exclusion criteria were applied: (a) PCa patients who received other treatments before MRI examination, such as androgen suppression therapy or any previous transurethral surgery; (b) poor image quality due to artifacts; (c) incomplete MR sequence; and (d) incomplete clinical data collection; (e) the lesions were too small for segmentation and analysis (maximum diameter <3 mm). A total of 208 high-grade prostate cancer patients' data were collected. According to the exclusion criteria, 25 patients were excluded. Ultimately, 183 high-grade PCa patients

were enrolled in the study. The patients were randomly divided into training and test groups at a ratio of 7 to 3 (training group: 128 patients, test group: 55 patients).

#### *2.2. MR Image Data*

The prostate MRI examination was performed according to PI-RADS v2.1 protocol and the process was as follows. We utilized a 3.0-T scanner (GE Signa HDxt) with an 8-channel array coil to acquire the images of multiplanar T2-weighted imaging (T2WI) and diffusion-weighted imaging (DWI), which were obtained with a turbo spin-echo sequence and the following parameters: repetition time/echo time (TR/TE): 3360/68.16 ms; field of view (FOV): 220 × 220 mm; matrix: 320 × 256; slice thickness: 5 mm; and spacing between slices: 5.5 mm. A single-shot echo-planar sequence with four b-values was also acquired: 0 and 1500 s/mm (TR/TE: 5250/78.6 ms; FOV: 100 × 100 mm; matrix: 128 × 160; and slice thickness: 5 mm).

#### *2.3. Histopathologic Analysis*

All patients underwent transrectal ultrasound-guided 12-core systematic prostate biopsy or RP after prostate MRI examination. The specimen pathological diagnosis was made by two pathologists with more than three years of experience in diagnosis of prostate diseases. The GS was updated according to the 2014 International Society of Urological Pathology criteria. PNI was diagnosed when PCa infiltration was identified in any layer of the nerve sheath or tumor invasion involved at least one-third of the nerve circumference. Pathologic information was collected, and, according to the outcomes, all patients were divided into two groups: one group had positive prostate cancer cell PNI and the other group had negative prostate cancer cell PNI (Figure 1).

**Figure 1.** Preoperative MRI images, ROI delineation, and pathological comparison of prostate cancer with and without PNI, as indicated by the arrow.

#### *2.4. Tumor Segmentation*

All MR images were manually delineated by two independent readers with more than 5 years' experience in reading prostate MR images. ITK-SNAP software was used to process T2WI and high-b-value (b = 1500) DWI images. Tumors were targeted as the regions of interest (ROIs), defined as hypointense signal areas compared with the normal prostate area on T2WI and a higher signal intensity than that of the normal prostate area on DWI. For consistency between ROIs in both T2WI and DWI images, all depicted ROIs were strictly delineated with the same criteria and visually validated by the same expert. The ROIs were manually delineated layer-by-layer along the lesion boundary, obtaining three-dimensional data (Figure 1).

#### *2.5. Extraction of Radiomic Features*

Software of FAE (FAE version is 0.5.2 and PyRadiomics version is 3.0.1. The software was soured from East China Normal University, Shanghai, China. https://github.com/ salan668/FAE accessed on 16 December 2022), which was developed based on the PyRadiomics package (https://github.com/Radiomics/pyradiomics, accessed on 2 June 2022), was used to extract features from the T2WI ROIs and DWI ROIs. The parameters of feature extraction were: first order statistics, shape-based, GLCM, GLRLM, GLSZM, GLDM, NGTDM. A total of 1702 features were extracted from the MRI data and 851 features each from T2WI and DWI, including 14 shape features, 18 first-order features, 24 gray level co-occurrence matrix (GLCM) features, 16 gray level run length matrix (GLRLM) features, 16 gray level size zone matrix (GLSZM) features, 5 neighboring gray tone difference matrix (NGTDM) features, and 14 gray level dependence matrix (GLDM) features and 744 wavelet features [13].

#### *2.6. Feature Selection and Model Building*

The process of feature selection was based on training set. Thirty patients were randomly selected for a double-blinded comparison of manual segmentations by two radiologists. Inter- and intraclass correlation coefficients (ICCs) between groups and within groups were calculated to select features with high stability and reproducibility, and ICCs greater than or equal to 0.75 were considered to have good agreement. To remove the imbalance of the training dataset, we used the synthetic minority oversampling technique (SMOTE) to balance the positive/negative samples. Before feature selection, we subtracted by the mean value and divided by the standard deviation to normalize the feature matrix for each feature vector. Next, the feature selection process was divided into two steps. In the first step, the features with statistical significance for identifying PNI positivity were selected by univariate logistic regression analysis. In addition, the first stage of dimensionality reduction of the data was achieved to ensure that each feature had a significant effect on the outcome. In the second step, least absolute shrinkage and selection operator (LASSO) regression analysis was used for further data dimensionality reduction, and the best features were determined for establishment of the radiomics model. The hyperparameter lambda value and the number of selected features were determined by tenfold cross-validation. After the radiomics model was established, each feature was multiplied by its corresponding coefficient, and an intercept value was added to calculate the radiomics score (Rad-score) for each patient, which was establishment of the radiomics signature (Appendix A).

For clinical features, we used the univariate analysis method, and the features with statistical significance for the results were selected to construct a clinical model. Finally, the combined model of clinical and radiomics features was established by multiple logistic regression analysis method.

#### *2.7. Model Evaluation*

After the models were built, their performance was evaluated using receiver operating characteristic (ROC) curve analysis. The area under the ROC curve (AUC) was calculated for quantification of the performance. The accuracy, sensitivity, and specificity were also calculated at a cutoff value that maximized the value of the Youden index. A radiomic nomogram combining the Rad-score derived from T2WI and DWI scans and clinical factors was developed for predicting PNI. The calibration curves measured the consistency between the predicted probability of PNI and the actual probability of PNI. Decision curve analysis was applied to measure the clinical utility of the nomogram.

#### *2.8. Statistical Analysis*

Demographic data were compared by chi-squared test, Mann-Whitney test, or *t*-test. Continuous variables are expressed as mean ± standard deviation, and categorical variables are expressed as median (25 quantile, 75 quantile). A value of *p* < 0.05 was considered statistically significant. Statistical analyses were performed using SPSS v22.0 (IBM SPSS Statistics, IBM Corp., Armonk, NY, USA) and R software (R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues, version 4.1.2; http://www.Rproject.org, accessed on 17 December 2022).

#### **3. Results**

#### *3.1. Patient Characteristics*

PNI was diagnosed histologically based on RP or biopsy specimen tissues. In total, 183 patients were then divided into the PNI positive [PNI (+)] group and the PNI negative [PNI (−)] group. The PNI (+) group contained 54 patients (29.51%), while the PNI (−) group contained 129 patients (70.49%). In the PNI positive group, 42 were detected on RP and 12 on biopsy. Twenty-seven of the forty-two cases were confirmed PNI positive both on preoperative biopsy and RP; eight of the forty-two cases had no PNI positive results on biopsy, but the RP outcomes were determinative; seven of the forty-two cases obtained a biopsy at another center, and we only had PNI positive results after RP in our center. Twelve PNI positive cases confirmed by biopsy did not undergo RP after biopsy in our center. The concordance rate of PNI positive results between biopsy and RP was 64.29%. In the PNI negative group, 98 cases were diagnosed as PNI negative both on preoperative biopsy and RP; 31 cases obtained a biopsy at another center; we only had their PNI negative outcomes of RP in our center. The concordance rate was 75.97%. The average ages were 69.7 ± 8.2 years and 72.0 ± 9.0 years in the two respective groups. The PSA levels were 15.9 ng/mL and 17.4 ng/mL in the two respective groups. In the PNI (+) group, the GS proportions were distributed as follows: 22.2% of patients (12/54) had a score of 8, 42.6% (23/54) had a score of 9, and 11.1% (6/54) had a score of 10. In the PNI (−) group, the GS proportions were distributed as follows: 41.1% of patients (53/129) had a score of 8, 39.5% (51/129) had a score of 9, and 19.4% (25/129) had a score of 10. The radiological and other clinical characteristics of the two groups are summarized in Table 3. There were no significant differences between these two groups in terms of age, PSA level, PSAD, or tumor location. However, there were significant differences in prostate volume, GS, and GG (*p* < 0.05). There were no significant differences between the training and test cohorts in terms of all clinical characteristics, which are summarized in Table 2 (*p* > 0.05).


**Table 1.** Patient clinic radiological characteristics between groups of PNI (+) and PNI (−).


**Table 1.** *Cont*.

**Table 2.** Patient clinic radiological characteristics between training and test cohort.


PSA: prostate-specific antigen. Prostate volume: foot–head (FH) length × right–left (RL) length × anterior– posterior (AP) length × π/6. PSAD: prostate-specific antigen density, PSA value divided by MRI-estimated prostate volume. Grading groups (GG): GG1: Gleason scores ≤ 6; GG2: Gleason scores 3 + 4; GG3: Gleason scores 4 + 3; GG4: Gleason scores 4 + 4, 3 + 5, 5 + 3; GG5: Gleason scores 4 + 5, 5 + 4, 5 + 5. *p* < 0.05 indicates a statistically significant difference.

#### *3.2. Feature Selection and Comparison of Models*

Further, 1193 stable features with ICCs ≥ 0.75 were retained (611 features from T2WI, and 582 features from DWI). The T2WI sequence selected 10 features when the λ1se was equal to 0.06478 and obtained the highest AUC on the testing dataset. The AUC and accuracy of the model were 0.827 (95% CI 0.707–0.947) and 0.818, respectively. The DWI sequence selected four features when the λ1se was equal to 0.11225 and obtained the highest AUC on the testing dataset. The AUC and accuracy of the model were 0.734 (95% CI 0.593–0.975) and 0.746, respectively. The T2WI + DWI sequence selected 13 features when the λ1se was equal to 0.06787 and obtained the highest AUC on the validation dataset. The AUC and accuracy of the model were 0.908 (95% CI 0.821–0.996) and 0.855, respectively. Thirteen features were found to have high stability for prediction of PNI and were chosen to construct the final model. The details of feature selection and comparison of models were shown in Figures 2 and 3 and Tables 3 and 4.

The clinical model based on features including FH, RL, prostate volume, and GS obtained the highest AUC on the test dataset. The AUC and accuracy of the model were 0.823 (95% CI 0.712–0.933) and 0.673, respectively, on the testing dataset (Figures 2 and 3 and Table 4).

**Figure 2.** The lasso plots for radiomics feature selection: (**a**,**b**) for T2WI, 10 features were selected when the λ1se = 0.06478, (**c**,**d**) for DWI, 4 features were selected when the λ1se = 0.11225, and (**e**,**f**) for T2WI + DWI sequences, 13 features were selected when the λ1se = 0.06787.

**Figure 3.** The AUCs of different models in the training (**a**) and test (**b**), respectively.




**Table 3.** *Cont*.

**Table 4.** The diagnostic performance of models.


P: AUC value of T2WI model, DWI model, T2WI + DWI model, and radiomic combined clinical model, respectively, compared to AUC value of clinical model.

#### *3.3. Development of the Clinical–Radiomics Predictive Model*

After the independently associated risk factors of FH, RL, volume, and GS were selected, we combined them with the Rad-score of the 13 features to form a PNI predictive nomogram. This nomogram had better performance in predicting PNI: the AUCs were 0.906 (95% CI 0.866–0.947) in the training group and 0.947 (95% CI 0.884–1) in the test group (Figure 4 and Table 4).

**Figure 4.** Nomogram developed for prediction of PNI. Radiomic nomogram combining the Radscore derived from T2WI and DWI scans and clinical–radiological factors for predicting PNI. PNI: perineural invasion.

#### *3.4. Validation of the Clinical–Radiomics Predictive Nomogram*

The calibration charts showed that the actual probability of PNI occurrence was consistent with the predicted probability, and the Hosmer-Leme show test yielded P values of 0.907 and 0.689 in the training and test cohorts, respectively. As shown in Figure 5, decision curve analysis indicated that the PNI predictive nomogram model was the best method across the full range of reasonable threshold probabilities. In the training group, the net reclassification index (NRI) was 1.1252 (0.8659–1.3644, *p* < 0.01) comparing the clinical model and combined model, while the NRI was 0.886 (0.6271–1.449, *p* < 0.01) comparing the radiomic model and combined model. In the test group, the NRI was 1.2312 (0.7796–1.6829, *p* < 0.01) comparing the clinical model and combined model, while the NRI was 1.0691 (0.5958–1.5424, *p* < 0.01) comparing the radiomic model and combined model (Figure 6).

**Figure 5.** Calibration curve of the nomogram in the training (**a**) and test (**b**) groups.

**Figure 6.** Decision curve analysis.

#### **4. Discussion**

PNI is a histological phenomenon in which cancer cells surround and invade nerves in the tumor microenvironment and play a role in development and regeneration of cancer cells. Nerves and cancer cells communicate bidirectionally to each other, providing a mechanism that could induce cancer invasion and spread. Studies have shown that the sympathetic nervous system in cancer can regulate pathological gene expression, leading to DNA damage repair inhibition and oncogene activation to increase cancer cell metastasis and tumorigenesis [14,15]. On the other hand, cancer cells can secrete neurotrophic growth factors or chemokines, such as CCL2 and CXCL12, to promote development of neural progenitors, causing nerve growth [16,17]. PNI in cancer is associated with poor prognosis, likely because neoplastic cells hidden in the perineural space cannot be removed during tumor resection and cause recurrence.

In 1999, the College of American Pathologists published a consensus statement on prognostic factors for PCa in which PNI was classified as category III for risk of recurrence and needed additional study [6]. In multivariate analysis, PNI on biopsy showed significance for recurrence. The presence of PNI on target-biopsy associated with worse histopathologic features on RP and poorer outcomes might thus be useful for risk stratification [18]. As primary treatment decisions are often based on biopsy results, the additional PNI information may be relevant for optimal patient care [19]. PNI found on prostate biopsies has been shown to be an independent predictor of high-grade disease associated with a higher mean PSA, adverse pathologic features of higher GS, and extra-prostatic extension [20,21]. In our study, 54 PNI (+) patients among 183 high-grade PCa patients had higher GG and GS than PNI (−) patients, and the outcome was consistent with these studies. PCa patients with PNI positivity showed an increased risk of biochemical recurrence after prostatectomy or radiotherapy and worse survival outcomes, which have important implications for treatment decision-making and management of PCa [22–24].

The slowly progressive nature of nerve involvement can often make PNI difficult to diagnose, and PNI is always detected based on the pathological results of the biopsy and prostatectomy specimens of PCa patients. As not all PCa cases are diagnosed at the initial biopsy, PNI as an independent prognostic factor remains difficult to quantitatively measure in pathological samples because of its heterogenous presentations and the multifocal nature of RP specimens [25]. Recent research has shown that the distribution of nerves within the tumor-infiltrating microenvironment is not homogeneous. The neural density was significantly higher in the cancer periphery close to cancer infiltration than in the cancer core area, which suggests that nerves may drive tumor progression and invasion [26]. Many factors may influence the true pathological positive rate of PNI, such as the needle core number of biopsy and the processing method of RP specimen tissues [27]. Thus, the prognostic value of PNI evaluation in pathological analysis should be further assessed and a better method should be developed to provide a detailed spatial representation of heterogeneity.

MRI is a noninvasive diagnostic tool that can acquire entire anatomical images of the prostate for cancer staging, such as extra-prostatic extension. This is important for urologists to determine a treatment plan before surgery, such as preservation of the neurovascular bundle (NVB) [28]. In the era of high-resolution imaging, extra-prostatic extension on MR images already has a better ability to predict locally advanced-stage PCa than PNI positivity on biopsy [29]. Whether PNI, as a predominant mechanism and a predictor of PCa progression to an advanced stage, can be directly assessed on imaging measures needs further study to develop a visualization method. Jonathan J. Stone retrospectively reviewed the data of 3733 PCa patients from a medical database who had undergone both MRI and PET before surgery to identify direct radiological evidence of PNI. Fifteen patients who had perineural spread found on MRI presented enlargement of the spinal nerves, lumbosacral plexus, sciatic nerve on T1-weighted sequences, hyperintensity on T2-weighted sequences, and/or abnormal nerve enhancement after gadolinium administration [30]. Salvatore Siracusano evaluated a new MRI modality called diffusion tensor imaging (DTI), which can provide sharp depiction of peripheral nervous fibers to detect changes in peri-prostatic neuro-vasculature (PNF) before and after RP. DTI was able to detect quantitative changes in the number, length, and fractional anisotropy values of the PNF, and they observed that the fiber number in MRI images can serve as a recovery indicator of erectile dysfunction in nerve-sparing prostatectomy [31]. However, PNI is a microscopic-level finding in PCa. Huijuan You combined MRI and magnetic particle imaging involving superparamagnetic iron oxide nanoparticles to precisely distinguish high and low nerve densities of the PCa tissue microenvironment in a mouse model. Their method could visualize the nerve density, and they observed a positive correlation with the aggressiveness of PCa cancer cells, which can be a novel strategy for discovering biomarkers for neural tissue and tumor aggressiveness in PCa [32].

Although MR plays an important role in detecting and accurately evaluating PCa, image outcome reporting depends on the subjective judgment of radiologists, which causes high inter-reader variability. Recently, the quantitative analysis method based on machine learning techniques called radiomics was shown to automatically obtain high-throughput imaging features to overcome the above limitations and assess tumor biology characteristics. Several studies have reported use of MR-based radiomics to detect clinically significant PCa and assess aggressiveness and tumor staging [33]. Shuai Ma developed and validated a radiomics model that contains 17 stable radiomics features extracted from 1619 features based on T2WI to predict ECE in PCa. The AUC was 0.883 in the validation cohort, and the model was more sensitive than the radiologists' interpretations, especially for apical tumors, which would influence a nerve-sparing surgical plan [34]. PNI is a predominant mechanism of ECE in PCa; to the best of our knowledge, there is no radiomics model based on MRI for preoperatively predicting this histopathological phenomenon.

In our study, we constructed a model derived from clinical and imaging data, including radiomic features from T2WI and DWI, based on computer-aided analysis to evaluate the PNI status in high-grade PCa. Our best radiomics model contained three GLDM features, one GLRLM feature, two NGTDM features, three GLSZM features, two GLCM features, one first-order feature, and one shape feature from T2WI and DWI images, which have the best predictive ability for PNI status in high-grade PCa. Our results demonstrated that the NGTDM feature had the greatest weight of the features in the T2WI model, while, in the DWI model, it was the GLCM feature, which is associated with tumor invasion and is a predictor of PCa aggressiveness, consistent with recently published findings concerning risk stratification for Pca. This finding suggests that invading nerves in the tumor microenvironment may affect the homogeneous texture features and that these radiomics features associated with PNI positivity may provide some additional information related to Pca aggressiveness, as previous studies reported [35,36]. The feature with the greatest weight in the T2WI + DWI model was the higher-order feature GLDM; this feature describes the gray level intensity within the ROI between the PNI positive and PNI negative groups and is used to highlight local heterogeneity information. This texture feature was rarely mentioned in previous radiomics studies for Pca, but, for other tumors, such as rectal cancer and cervical cancer, GLDM was thought to be associated with locally advanced tumors and poor prognosis in recent studies [37,38]. Similar to those in nontumor tissues, the GLDM metrics were found to be significantly different among peritumoral fat between high-grade and low-grade clear cell renal carcinoma and urothelial carcinoma [39,40]. Therefore, whether radiomics feature GLDM could be a biomarker for predicting the heterogeneity of interstitial composition in urologic cancers requires more research. Similar to the study of B. De Santi, which showed that a difference in voxel intensity distribution could distinguish cancerous and normal prostatic tissues [41], our model led to the conclusion that differences in heterogeneity between PNI positive and PNI negative samples can be detected and, therefore, can help depict the tissue microstructure as PNI positive or PNI negative before surgery.

Our clinical–radiomics prediction model, which integrates clinical characteristics and the Rad-score derived from MRI, had good sensitivity (0.944) and good specificity (0.865) in the test cohort, indicating that it is superior to all the above-mentioned models for predicting PNI status. Comparing the AUC values in the independent test cohort, our clinical–radiomics prediction model (AUC 0.947; 95% CI 0.884–1) performed better than the radiomics model alone (AUC 0.908; 95% CI 0.821–0.996) and the clinical model alone (AUC 0.823; 95% CI 0.712–0.933). Decision curve analysis showed that the clinical– radiomics model had a better ability to predict PNI than the other two models at any given threshold probability. This finding confirms that assessment of PNI with clinical or radiomic information alone will not be comprehensive.

Several limitations should be noted when considering this study. First, we included GGs of high-grade patients only; those with GS ≤ 7 patterns were excluded, especially patients with GS 4 + 3 who have a much worse prognosis, and their PNI status was not assessed. Second, some GS values were based on biopsy rather than on RP in our study, possibly causing sampling error. Third, there was a lack of spatial co-registration of the histopathology slides and MR images, which may cause a mismatch in delineating the ROIs directly on the T2WI and DWI images. Fourth, FAE software can be used conveniently for binary classification, but it has not yet provided an integrated UI for multilabel classification and regression problems. Fifth, this study was a single-institutional retrospective study design without external validation.

#### **5. Conclusions**

In our study, the results showed that MRI-derived radiomic features can be independent predictors of PNI in high-grade PCa. The combination of radiomic features extracted from T2WI and DWI maps produced higher diagnostic power to predict PNI than a single pattern. Additionally, our clinical–radiomics model was superior to a single radiomics model and a clinical model, suggesting that, combined, the radiomic features and clinical pathology information may have considerable value in predicting PNI in high-grade PCa, which can aid clinicians in choosing appropriate treatment options and estimating prognoses for such patients.

**Author Contributions:** Conceptualization: W.Z. (Wei Zhang) and H.Z.; methodology: W.Z. (Wei Zhang) and G.Y.; software: W.Z. (Wei Zhang), W.Z. (Weiting Zhang), and X.L.; validation: W.Z. (Wei Zhang) and W.Z. (Weiting Zhang); formal analysis: W.Z. (Weiting Zhang) and G.Y.; investigation: W.Z. (Wei Zhang) and W.Z. (Weiting Zhang); resources: H.Z. and X.C.; data curation: W.Z. (Weiting

Zhang); writing—original draft preparation: W.Z. (Wei Zhang); writing—review and editing: W.Z. (Wei Zhang) and G.Y.; project administration: H.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** This retrospective study received Institutional Review Board approval of the First Hospital of Shanxi Medical University, ethic code: K131.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

```
Rad-score = −0.6252 + T2_wavelet.HLH_gldm_SmallDependenceHighGrayLevelEmphasis × 0.9473 +
```
T2\_wavelet.HLH\_glrlm\_RunPercentage × (−0.5091) + T2\_wavelet.HLL\_ngtdm\_Coarseness × 0.7033 +

T2\_wavelet.LHH\_gldm\_DependenceNonUniformityNormalized × 0.8344 +

T2\_wavelet.LHH\_glszm\_SizeZoneNonUniformityNormalized × 0.5365 +

T2\_wavelet.LHH\_ngtdm\_Contrast × 0.3040 + T2\_wavelet.LHL\_firstorder\_RootMeanSquared × 0.3430 +

DWI\_original\_glszm\_SizeZoneNonUniformityNormalized × 0.2708 +

DWI\_original\_shape\_SurfaceArea × (−0.8964) + DWI\_wavelet.HHH\_glcm\_DifferenceEntropy × 0.6870 +

DWI\_wavelet.HLH\_glcm\_MaximumProbability × (−0.4943) +

DWI\_wavelet.HLL\_gldm\_LargeDependenceLowGrayLevelEmphasis × 0.3766 +

DWI\_wavelet.LHH\_glszm\_ZoneEntropy × (−0.1266)

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **Gut Microbial Shifts Indicate Melanoma Presence and Bacterial Interactions in a Murine Model**

**Marco Rossi 1,†, Salvatore M. Aspromonte 2,3,†, Frederick J. Kohlhapp 3, Jenna H. Newman 3, Alex Lemenze 4, Russell J. Pepe 2, Samuel M. DeFina 3, Nora L. Herzog 3, Robert Donnelly 4, Timothy M. Kuzel 1, Jochen Reiser 1, Jose A. Guevara-Patino 5,\* and Andrew Zloza 1,\***


**Abstract:** Through a multitude of studies, the gut microbiota has been recognized as a significant influencer of both homeostasis and pathophysiology. Certain microbial taxa can even affect treatments such as cancer immunotherapies, including the immune checkpoint blockade. These taxa can impact such processes both individually as well as collectively through mechanisms from quorum sensing to metabolite production. Due to this overarching presence of the gut microbiota in many physiological processes distal to the GI tract, we hypothesized that mice bearing tumors at extraintestinal sites would display a distinct intestinal microbial signature from non-tumor-bearing mice, and that such a signature would involve taxa that collectively shift with tumor presence. Microbial OTUs were determined from 16S rRNA genes isolated from the fecal samples of C57BL/6 mice challenged with either B16-F10 melanoma cells or PBS control and analyzed using QIIME. Relative proportions of bacteria were determined for each mouse and, using machine-learning approaches, significantly altered taxa and co-occurrence patterns between tumor- and non-tumor-bearing mice were found. Mice with a tumor had elevated proportions of *Ruminococcaceae*, *Peptococcaceae*.g\_rc4.4, and *Christensenellaceae,* as well as significant information gains and ReliefF weights for *Bacteroidales.f\_\_S24.7*, *Ruminococcaceae*, *Clostridiales*, and *Erysipelotrichaceae*. *Bacteroidales.f\_\_S24.7*, *Ruminococcaceae*, and *Clostridiales* were also implicated through shifting co-occurrences and PCA values. Using these seven taxa as a melanoma signature, a neural network reached an 80% tumor detection accuracy in a 10-fold stratified random sampling validation. These results indicated gut microbial proportions as a biosensor for tumor detection, and that shifting co-occurrences could be used to reveal relevant taxa.

**Keywords:** gut microbiota; machine learning; statistical algorithms; co-occurrence patterns; melanoma

#### **1. Introduction**

The gastrointestinal microbiota contains a diverse and dense collection of symbiotic organisms that contribute to intestinal homeostasis. Nutrient digestion, synthesis of vitamins, protection against pathologic organisms, and production of neurotransmitters are just a few of the biological functions that these organisms provide [1–3]. The host's immune system plays an essential role in controlling microbial growth and development in the microbiome to ensure that a mutual relationship is maintained between the host and organism.

**Citation:** Rossi, M.; Aspromonte, S.M.; Kohlhapp, F.J.; Newman, J.H.; Lemenze, A.; Pepe, R.J.; DeFina, S.M.; Herzog, N.L.; Donnelly, R.; Kuzel, T.M.; et al. Gut Microbial Shifts Indicate Melanoma Presence and Bacterial Interactions in a Murine Model. *Diagnostics* **2022**, *12*, 958. https://doi.org/10.3390/ diagnostics12040958

Academic Editor: Sung Chul Lim

Received: 4 February 2022 Accepted: 4 March 2022 Published: 12 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

At the same time, the microbiota plays a role in adapting the host's immune system to various stressors [4]. In fact, evidence is accumulating that the intestinal microflora can respond to changes in host health status by sensing soluble host elements and local micro-environmental cues [5]. For this reason, the gastrointestinal microbiota is affected by the pathological immune responses derived from diseases such as diabetes mellitus, cancer, obesity, and inflammatory diseases, which impacts the body's immune response against disease [2,6,7].

It is increasingly being recognized that the gut microbiome composition differs significantly between healthy individuals and those with various pathological conditions. Dongmei et al. found that healthy individuals have a more diverse gut flora than those with colorectal cancer. In addition, certain bacterial populations were more likely to co-occur in patients with colorectal cancer than in healthy individuals [3]. While alterations in microbiome composition can be seen in pathologic conditions such as cancer, it is unclear whether these changes are a cause or a consequence of the disease [6]. Multiple studies that analyzed the composition of the gut microbiota in colorectal cancer patients suggested the presence of both "driver bacteria", or those that promote cancer growth, and "passenger bacteria", or those that solely flourish in the proinflammatory environment, but do not impact tumor progression. Geng et al. found that in their colorectal cancer patients, members of the *Enterobacteriaceae* family promoted cancer growth, whereas members of the *Streptococcaceae* family merely flourished in a proinflammatory environment [7].

The presence of these microbial mechanisms in which bacterial taxa have a certain level of dependency have wide implications for their use in modeling respective pathological conditions. Typically, connectivity and dependency between variables such as bacterial taxa in the context of predictive modeling has typically been a hindrance to model performance [8–10]. It is widely understood with many kinds of algorithms that, in various circumstances, variables with some manner of co-occurrence provide a certain level of redundant information, and therefore reduce the variability explained in models [8]. This presence of redundant information decreases the model's fit to the training dataset, as well as its prediction accuracy in the testing dataset [10–12].

Despite these limitations, co-occurrences in the context of pathological prediction with microbial taxa may still hold significance in the application of diagnostic signatures [8,13]. When co-occurrences shift between conditions, so does the direction of variability represented by relevant taxa in planes of higher dimensionality [9,10,14]. These shifts are reflected in principal component analysis, in which each principal component represents a different proportion of the total variability present [8,13]. They are also represented in ReliefF and information gain values, in which microbial taxa with these differences in variability have increased reliability as predictors [11,15]. Therefore, the identification of these shifts in co-occurrences in pathological conditions such as cancer is optimal for the implementation of gut microbial diagnostic signatures.

The implementation of machine-learning algorithms for the prediction of the presence of various cancers using the gut microbiome has been widely studied [16–18]. However, to date, relatively little work has been done regarding the use of the gut microbiome to predict the presence of melanoma. In addition, one of the challenges of predicting the presence of a specific disease with the gut microbiota is the variability in relative proportions of specific gut bacteria that can exist between patients and populations [12]. Through our analyses, we have indicated shifts in microbial co-occurrences as a potential method in accounting for such variability. Therefore, we hypothesized that models based on gut microbial proportion profiles of taxa involved in co-occurrence shifts could form a distinct diagnostic signature that effectively differentiated mice bearing mouse melanoma tumors from non-tumor-bearing mice. This implies that the intestinal microflora may function as a biosensor for the presence of cancer, and that its manipulation may alter cancer prognoses.

#### **2. Results**

#### *2.1. Shifts in Microbial Taxon Proportions of Melanoma-Bearing Mice*

Mice bearing melanoma tumors displayed significant shifts in gut microbial proportions compared to non-tumor-bearing mice, which: (1) implicated consistency in changes in gut microbiota data with tumors in the skin, distal to the gut; and (2) implied that such changes could be used by an algorithm to detect the presence of cancer. We compared the microbial composition of fecal samples of melanoma-bearing and tumor-free mice by terminal restriction fragment length polymorphism (T-RFLP) analysis [14,16]. This technique is commonly used to study complex microbial communities based on 16S rRNA gene variation, and has been applied in the study of microbial communities in soil and sludge systems [19]. T-RFLP analysis was carried out in a blinded fashion as previously described [4]. It was readily seen for the two mouse experiments (Figure 1) that the cooccurrences of relative taxon proportions shifted in the presence of B16 melanoma. In addition, *Peptococcaceae*.g\_rc4.4 was significantly increased (Wilcoxon *p* < 0.05) in both groups of mice (Figure 1). These data demonstrated that the intestinal flora developed detectable changes that discriminated a tumor-bearing from a tumor-free host. In order to more fully determine the extent to which these results distinguished between hosts that had a tumor and those that did not, the two mouse groups were combined and further analyzed as a single dataset (*n* = 56).

**Figure 1.** Shifted co-occurrences of microbial taxa and increased *Peptococcaceae*.g\_rc4.4 characterize tumor presence. (**A**) C57BL/6 (B6) male mice were injected with either 105 B16 melanoma cells (*n* = 19) or PBS (*n* = 16). After 10 days, fecal samples were collected and 16S rRNA genes were analyzed using terminal restriction fragment length polymorphism (T-RFLP) analysis. From individual taxon proportion and co-occurrence patterns, it could be seen that such patterns shifted with melanoma presence, and *Peptococcaceae*.g\_rc4.4 levels increased. (**B**) B6 male mice were injected with either 105 B16 melanoma cells (*n* = 11) or PBS (*n* = 10). After 16 days, fecal samples were collected and 16S rRNA genes were analyzed using terminal restriction fragment length polymorphism (T-RFLP) analysis. The results of these data directly corresponded with the mice in (**A**).

#### *2.2. Co-Occurrence between Bacteroidales.f\_\_S24.7, Clostridiales, and Ruminococcaceae Proportions in Mouse Melanoma*

Seeking to identify the specific bacterial co-occurrences that were altered in the presence of a tumor, we first used Cytoscape to map them in the B16-melanoma- and PBStreated mice. From these diagrams (Figure 2A,B), it was found that the co-occurrences of *Bacteroidales.f\_\_S24.7* greatly differed between the two treatments. When looking further into this taxon, it was found that its co-occurrences with *Clostridiales* and *Ruminococcaceae* had changed the most between tumor and nontumor/PBS (Figure 2C,D), with Pearson correlation values of approximately −0.9 and −0.8 for tumor, as well as −0.15 and −0.13 for nontumor, respectively. Interestingly, however, when looking at the individual relative amounts of these taxa, the only one that was significantly different between tumor and nontumor was *Ruminococcaceae* (Wilcoxon *p* < 0.05, *T*-test *p* < 0.05; Figure 2E). Thus, we concluded that the potential for these taxa to predict tumor presence relied heavily on the extent to which their co-occurrences shifted in that condition, rather than changes in their individual relative amounts.

**Figure 2.** Co-occurrence changes between *Bacteroidales.f\_\_S24.7*, *Clostridiales,* and *Ruminococcaceae* occur with tumor presence. (**A**,**B**) Pearson correlation matrices were determined for microbiotas from tumor and nontumor mice and displayed using Cytoscape. From these visualizations, *Bacteroidales.f\_\_S24.7* co-occurrences greatly changed with tumor presence. (**C**,**D**) Using the R programming language, it was found that the most dramatic shifts of *Bacteroidales.f\_\_S24.7* were in conjunction with *Clostridiales* and *Ruminococcaceae*. (**E**) When comparing each taxon individually between tumor and nontumor, only *Ruminococcaceae* was significantly different.

#### *2.3. Differences in Principal Components between Tumor and Nontumor*

Considering our results for both individual microbial taxa and co-occurrence shifts, we wanted to assess the relevance of each taxon in the context of predictive modeling. Thus, we calculated the information gains and ReliefF weights for each taxon (Figure 3A,B). In the scoring for information gains, *Ruminococcaceae*, *Peptococcaceae.g\_rc4.4*, and *Christensenellaceae* consistently scored higher than the majority of taxa (Figure 3A). For the ReliefF algorithm, *Bacteroidales.f\_\_S24.7* had a fairly high weight, along with *Peptococcaceae.g\_rc4.4* and *Christensenellaceae* (Figure 3A). Further, *Christensenellaceae* was found to be significantly different between tumor and nontumor (Wilcoxon *p* < 0.05, Figure 3A,B). Considering that *Bacteroidales.f\_\_S24.7* shifted its co-occurrences and its ReliefF weight indicated variable importance, we performed a principal component analysis (PCA) using this taxon (Figure 3C,D). Two PCAs were performed, one with *Clostridiales* and the other with *Ruminococcaceae* (Figure 3C,D). After performing the PCAs, we compared the resulting principal component coordinates between tumor and nontumor mice. From this comparison, we found that, although the first principal components did not differ between the two groups (Figure 3C), the second ones did (Wilcoxon *p* < 0.05, *T*-test *p* < 0.05; Figure 3D). These results indicated that the coordinates of these second principal components could be implemented in predictive modeling.

**Figure 3.** *Cont*.

**Figure 3.** Significant predictors of tumor presence include the second principal components involving *Bacteroidales.f\_\_S24.7*, *Clostridiales,* and *Ruminococcaceae*. (**A**,**B**) Using the CORElearn package in the R programming language, the information gains and ReliefF weights were calculated for each taxon. (**A**) *Ruminococcaceae*, *Peptococcaceae.g\_rc4.4*, and *Christensenellaceae* were found significantly altered with tumor presence and having high information gains. (**B**) Along with *Peptococcaceae.g\_rc4.4* and Christensenellaceae, *Bacteroidales.f\_\_S24.7* and *Erysipelotrichaceae* had high ReliefF weights. (**C**,**D**) Two PCAs using *Bacteroidales.f\_\_S24.7*, one with *Ruminococcaceae* and the other with *Clostridiales*, were conducted using R. While their first principal components did not change with tumor, their second ones did (Wilcoxon *p* < 0.05, *T*-test *p* < 0.05 (**D**)).

#### *2.4. Prediction of Tumor Presence Using Microbial Taxa Involved in Altered Co-Occurrences*

Since the second principal components involving *Bacteroidales.f\_\_S24.7*, *Ruminococcaceae*, and *Clostridiales* were found to significantly differ with tumor presence, the proportions of those taxa, along with those of *Peptococcaceae.g\_rc4.4*, *Christensenellaceae*, and *Erysipelotrichaceae,* were implemented as a mouse melanoma signature (Figure 4A,B). The 10 fold stratified random sampling used to obtain melanoma prediction results with machinelearning algorithms was performed by randomly selecting 90% of the mouse samples to train the algorithms and then testing them with the remaining 10% of samples (Figure 4A). This process was repeated 10 times, and the prediction results were averaged over those repeats (Figure 4A). Using this protocol, the highest percent accuracy in melanoma prediction was achieved by the neural network, with 80% (Figure 4A,B). Thus, the implementation of microbial taxa indicated by the second principal components in the prediction signature allowed for the identification of melanoma presence.

**Figure 4.** *Cont*.


**IROG6WUDWLILHG5DQGRP6DPSOLQJZLWK7D[RQ3URSRUWLRQV**

(**B**)

**Figure 4.** Implementation of microbial taxa implicated in second principal components accurately predict tumor presence. (**A**) Using Orange3, 10-fold stratified shuffle splits were performed. (**B**) Using a prediction signature which included *Bacteroidales.f\_\_S24.7*, *Ruminococcaceae*, and *Clostridiales*, implicated in the second principal components, resulted in an average accuracy of 80% achieved with a Neural Network classifier. AUC, area under the curve; CA, classification accuracy; F1, F1 score).

#### **3. Discussion**

Our findings demonstrated that the presence of a mouse melanoma tumor can be detected through the altered gut microbial proportions using classification algorithms. By using the gut microbial taxa to model tumor presence, it became apparent that such a condition manifested in more ways than just changes in individual amounts of certain taxa. Indeed, one of the main implications of this study is that considering gut microbial taxa co-occurrences and dependencies in predictive modeling can significantly increase predictive power in melanoma, more so than analyzing only statistical significance between groups. This concept of intertaxa correlations in modeling microbial-based conditions has wide applications in the interpretation of the gut microbiota, as it suggests that the role of an individual taxon in manifesting a biological phenotype is not solely attributed to its unique characteristics [17,18]. Rather, this role also depends on the extent to which a single taxon can communicate and affect other taxa through various mechanisms, from quorum sensing to metabolite production [20–23].

Despite this apparent, predictive relationship between murine melanoma and the gut microbiota, certain experimental limitations still existed. The primary limitation for consideration was the external validity of these results. It is often the case that gut microbiota data do not directly correspond between murine and human subjects, with various mechanisms implicated, from general differences in GI physiology to lifestyle, epigenetics, and immune responses [24–26]. Thus, in order for gut microbial associations to be implemented in clinical cancer diagnoses, further work needs to be done to elucidate pertinent taxa in a variety of human populations and pathophysiological states, including cancer, as well as the interaction between shifts in gut microbial content and certain factors such as diet and lifestyle. Most pertinent to patient treatment is the level of interaction between host immune responses and the gut microbiota, as antitumor immunity and immunotherapies may affect prediction outcomes [27,28]. These studies would also need to consider the correlation between patient stool sampling and gut microbial content with cancer presence, as sampling variation may be a confound [24]. Finally, since our gut microbiota data had a

certain level of variation, other parameters should be considered in the future predictive modeling of human melanoma, such as biochemical and clinical observations [29].

In the statistical analysis of gut microbial taxa, algorithms have been developed to accurately detect the presence of these intertaxa co-occurrences [30–32]. Such algorithms for the detection of microbial "co-occurrence networks" include Sparse Inverse Covariance Estimation for Ecological Association Inference (SPEIC-EASI) and Sparse Correlations for Compositional Data (SparCC) [31–33]. However, despite these advances in the statistical detection of these interactions, there has not been as much work to determine their efficacy in different types of classification algorithms in conditions such as melanoma. In fact, their presence in predictive models has generally been discouraged, as the collinearity they create have been shown to compromise the performance of many model types [34–36]. Further, even for models that can more readily account for collinearity, the use of such interactions in these models does not consistently increase the performance of those models [34–36]. Thus, there is a necessity for a new statistical interpretation of intertaxa co-occurrences in order for them to be optimally utilized in a predictive model. Perhaps new insights into such interpretations can be eventually made when taxa indicated by shifts in cooccurrence networks are further tested in more architecturally complex algorithms such as deep-learning neural networks.

Traditionally, one of the most common procedures in dealing with collinearity between variables such as microbial taxa is the use of principal components in principal component analysis (PCA) [34–37]. By definition, the resulting principal components do not significantly correlate with each other, and are thus used in various model types [34–37]. These components are not usually interpretable from the perspective of the original data because they are linear transformations of that data [34–37]. However, if a small number of variables (e.g., two or three) is used, the principal components can be more easily interpreted [34–37]. In this study, PCA analysis was able to differentiate the two groups of mice successfully; however, much work still needs to be done to characterize the significance of individual PCs in different situations, such as in other clinically relevant tumor types.

#### **4. Methods**

#### *4.1. Cell Culture*

B16-F10 cells (ATCC) were cultured in RPMI 1640 plus 10% heat-inactivated fetal bovine serum (Atlanta Biologicals, Flowery Branch, GA, USA), 2 mM L-glutamine (Mediatech, Manassas, VA, USA), and 1% penicillin/streptomycin (Mediatech).

#### *4.2. Mouse Experiments*

C57BL/6 mice (B6; no. 00664; Jackson Laboratory) were housed in a specific pathogenfree facility at the Rutgers Cancer Institute of New Jersey. Experiments involving animals were carried out in accordance with respective Institutional Animal Care and Use Committee (IACUC) and Institutional Biosafety Committee (IBC) guidelines.

In the first experiment, 35 B6 male mice, aged 6 to 8 weeks old from the Jackson Laboratory were intradermally challenged in the right flank with 10<sup>5</sup> cells of the highly aggressive and poorly immunogenic melanoma B16 cell line (*n* = 19) [17] or phosphate buffered saline (PBS) (*n* = 16) under isoflurane anesthesia. Mice were fed regular chow according to animal care institutional guidelines. Fecal sample collection to compare tumor-bearing to non-tumor-bearing mice was carried out on day 10, when tumors were approximately 25–50 mm2. Samples were stored immediately at −<sup>80</sup> ◦C until DNA extraction [38] and sequencing.

The second experiment at this facility followed the identical protocol, using 21 B6 male mice aged 6 to 8 weeks old that were intradermally challenged in the right flank with 10<sup>5</sup> cells of the highly aggressive and poorly immunogenic melanoma B16 cell line (*n* = 11) [17] or phosphate buffered saline (PBS) (*n* = 10) under isoflurane anesthesia. Fecal sample collection to compare tumor-bearing to non-tumor-bearing mice was carried out on day 16, when tumors were approximately 25–50 mm<sup>2</sup> in diameter. Samples were stored immediately at −80 ◦C until DNA extraction [38] and sequencing.

#### *4.3. DNA Extraction*

Fecal pellets were homogenized and extracted using the QIAamp PowerFecal DNA Extraction kit following the manufacturer's protocols [39].

#### *4.4. 16S rRNA Gene Sequencing and Data Analysis*

The 16S rRNA genes were amplified from purified DNA using PCR primers specific to the V3–V4 region of the 16S rRNA gene and sequenced by Illumina MiSeq in a 2 × 150 bp configuration at the Rutgers New Jersey Medical School Genomics Core. Quantitative Insights Into Microbial Ecology (QIIME) software was used for open-reference operational taxonomic unit (OTU) classification with OTU clustering at 0.97, followed by rarefaction and taxonomic classification of de novo OTUs [40].

#### *4.5. qPCR for Bacterial Load and Taxa Assays*

Bacterial loads of extracted fecal DNA were determined by qPCR. DNA were quantified against a standard curve, and the results were normalized to the weight of fecal samples [40].

#### *4.6. Taxon Comparisons, Analyses, and Statistical Modeling*

Using the R programming language, microbial taxa between tumor-bearing and PBS control mice were compared using Welch's *t*-test as well as the Mann–Whitney *U* test (a *p*-value of <0.05 was considered to denote statistically significant differences). Between these two groups of mice, general taxa and comparison attributes were determined using the Orange3 v3.27.1 data-mining program and the CORElearn package in CRAN. PCA analysis and principal components were determined using the prcomp function in R. General machine-learning model analyses and cross-validation procedures were performed using the Orange3 program with these settings:

The neural network was a 100-neuron single hidden layer that used the ReLu activation function and the Adam solver.

The support vector machine (SVM) used a radial basis function (RBF) kernel with a cost of 1.0 and a regression loss epsilon of 0.1.

The AdaBoost used a SAMME.R classification algorithm with a linear regression loss function, 50 estimators, and learning rate of 1.0.

The CN2 rule inducer used entropy as the evaluation measure, a beam width of 5, and a maximum rule length of 5.

The random forest used a 12-tree ensemble with subsets split no smaller than 5.

The k-nearest neighbor (kNN) used 5 neighbors and considered the Euclidean distance and uniform weights.

For the naïve Bayes, the attributes were not weighted.

Tree used a maximal tree depth of 100 and subsets not split smaller than 5.

In the logistic regression, a ridge regularization was implemented.

Quality parameters for this model were determined using an internal 10-fold stratified shuffle split, with 90% of the samples selected for training and the remaining 10% for testing in Orange3. Results were graphed using the ggplot2, ggrepel, and ggpubr packages in CRAN, as well as Orange3 and Cytoscape v3.7.2. Heatmaps were generated using the ComplexHeatmap package in CRAN. Tables were formatted using the sjPlot package in CRAN.

**Author Contributions:** Conceptualization, J.A.G.-P. and A.Z.; formal analysis, M.R.; investigation, S.M.A., F.J.K., J.H.N., A.L., R.J.P., S.M.D., N.L.H. and R.D.; resources, T.M.K. and J.R.; data curation, M.R.; writing—original draft preparation, M.R.; writing—review and editing, S.M.A., F.J.K., J.H.N., A.L., R.J.P., S.M.D., N.L.H., R.D., T.M.K., J.R., J.A.G.-P. and A.Z.; supervision, J.A.G.-P. and A.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data supporting the findings of this study are available from the corresponding authors upon reasonable request.

**Acknowledgments:** We would like to thank those at the Rutgers Robert Wood Johnson Medical School, Rutgers New Jersey Medical School, and Rutgers Cancer Institute of New Jersey who offered technical assistance. We would also like to thank those at Rush University Medical Center who provided input and support in the writing of this manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Communication* **Method for the Intraoperative Detection of IDH Mutation in Gliomas with Differential Mobility Spectrometry**

**Ilkka Haapala 1,\*, Anton Kondratev 2, Antti Roine 2,3, Meri Mäkelä 2,3, Anton Kontunen 2,3, Markus Karjalainen 2,3, Aki Laakso 4, Päivi Koroknay-Pál 4, Kristiina Nordfors 5, Hannu Haapasalo 6, Niku Oksala 2,3, Antti Vehkaoja <sup>2</sup> and Joonas Haapasalo <sup>1</sup>**


**Abstract:** Isocitrate dehydrogenase (IDH) mutation status is an important factor for surgical decisionmaking: patients with IDH-mutated tumors are more likely to have a good long-term prognosis, and thus favor aggressive resection with more survival benefit to gain. Patients with IDH wild-type tumors have generally poorer prognosis and, therefore, conservative resection to avoid neurological deficit is favored. Current histopathological analysis with frozen sections is unable to identify IDH mutation status intraoperatively, and more advanced methods are therefore needed. We examined a novel method suitable for intraoperative IDH mutation identification that is based on the differential mobility spectrometry (DMS) analysis of the tumor. We prospectively obtained tumor samples from 22 patients, including 11 IDH-mutated and 11 IDH wild-type tumors. The tumors were cut in 88 smaller specimens that were analyzed with DMS. With a linear discriminant analysis (LDA) algorithm, the DMS was able to classify tumor samples with 86% classification accuracy, 86% sensitivity, and 85% specificity. Our results show that DMS is able to differentiate IDH-mutated and IDH wild-type tumors with good accuracy in a setting suitable for intraoperative use, which makes it a promising novel solution for neurosurgical practice.

**Keywords:** differential mobility spectrometry; neuro-oncology; neurosurgery; glioma; classification; isocitrate dehydrogenase (IDH)

### **1. Introduction**

Gliomas represent the most clinically important group of primary brain tumors. Traditionally, they have been classified into WHO groups I–IV to evaluate their malignant potential by analysis of their morphological features. However, the past decades of research have led to the discovery of many molecular alterations in gliomas that have a great impact on the tumor's malignancy and, accordingly, to the patient's prognosis [1]. Among such alterations, the mutation of isocitrate dehydrogenase (IDH) enzymes 1 or 2 is highly correlated with the patient's overall survival, and the effect is present regardless of the tumor's histopathological WHO grade [2–5]. IDH mutation also seems to play a pivotal role in the carcinogenesis of other solid tumors, such as cholangiocarcinoma, where it is also a major target for medical therapy [6–8].

**Citation:** Haapala, I.; Kondratev, A.; Roine, A.; Mäkelä, M.; Kontunen, A.; Karjalainen, M.; Laakso, A.; Koroknay-Pál, P.; Nordfors, K.; Haapasalo, H.; et al. Method for the Intraoperative Detection of IDH Mutation in Gliomas with Differential Mobility Spectrometry. *Curr. Oncol.* **2022**, *29*, 3252–3258. https://doi.org/10.3390/ curroncol29050265

Received: 21 February 2022 Accepted: 29 April 2022 Published: 4 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Normally, IDH enzymes catalyze the oxidative decarboxylation of isocitrate to form a-ketoglutarate (aKG) in the Krebs cycle. IDH1 and IDH2 localize differently in the cell but share the same function; hence, they are hereafter referred to collectively as IDH. The mutation of IDH confers a neomorphic enzyme activity that catalyzes the reduction of aKG into the putative oncometabolite D-2-hydroxyglutarate (D2HG) [9]. The accumulation of D2HG is further associated with the hypermethylation of DNA and chromatin, which is thought to dysregulate cell epigenetics [10,11].

IDH mutation status is an important factor for surgical decision-making: patients with IDH-mutated tumors are more likely to have a good long-term prognosis, and thus favor aggressive gross total resection with more survival benefit to gain. Patients with IDH wild-type tumors have a generally poorer prognosis and, therefore, conservative resection to avoid neurological deficit is favored [12–14]. The effect of gross total resection on survival remains also in recurrent diseases [15,16]. Current histopathological analysis based on frozen sections is unable to identify molecular characteristics, including IDH mutation, within the time frame of surgery [17], thus creating an imminent need for new solutions.

We have previously shown that differential mobility spectrometry (DMS) is able to identify different brain tumors ex vivo [18]. DMS characterizes substances based on the mobility differences of ionized particles in high-frequency electrical fields, resulting in a substance-specific dispersion spectrum, or "smell fingerprint" [19]. The simplicity, quickness and cost-effectiveness of DMS makes it a compelling emerging technology for clinical applications [18]. In this study, we demonstrate the rapid, preparation-free analysis of a tumor's IDH mutation status with DMS.

#### **2. Materials and Methods**

We prospectively obtained tumor samples from 22 patients who had neurosurgical operations at Tampere University Hospital between the years 2018 and 2021, and at Helsinki University Hospital in 2020. Patient recruitment was continued until we had a sufficient number of IDH-mutated tumors, which are rarer. To make balanced classes, an equal number of IDH wild-type tumors were randomly selected for the experiment. Eventually, we had 11 IDH-mutated tumors and 11 IDH wild-type tumors. IDH-mutated tumors included 5 WHO gr. II–III astrocytomas, 3 gr. II–III oligodendrogliomas, and 3 gr. IV glioblastomas (GBM). IDH wild-type tumors included 1 gr. III astrocytoma and 10 GBMs. Diagnoses were made by an experienced neuropathologist and IDH mutation was identified with immunohistochemistry. The study was approved by the ethics review board of Pirkanmaa Hospital District, Finland. The patients gave their written consent for the study.

All samples were stored in a freezer at −70 ◦C. The samples were carefully cut into 88 (44 IDH-mutated and 44 IDH wild-type) smaller specimens of macroscopically equal sizes. Blood, if macroscopically visible, was carefully rinsed from the samples before the analysis. The samples were randomly placed in a plastic well plate with each well containing 0.18 mL of agar in the bottom. Each sample was incised with a custom-built, computer-controlled, 40 W, 10.6 μm CO2 laser evaporator four times in a quadratic manner, with 1 mm gaps between the incisions. The total number of incisions was 352. The laser sampling was controlled by a graphical user interface. To provide a clean and controlled supply of carrier gas for the analyte gas, purified and humidified pressurized air was introduced to the sampling stage via a sampling nozzle. The sampling nozzle provided a protective stream of carrier gas around the sampling area and, after sample vaporization, transported the sample gas to the DMS inlet. The DMS used in the study was a commercial IonVision instrument (Olfactomics Oy, Finland). The measurement parameters for the DMS spectrum were: separation voltage (Usv), 200–1000 V with 20 increments; compensation voltage (Ucv), −2–10 V with 60 increments; separation field frequency, 1 MHz; and duty cycle of the field, 22%. With these parameters, the DMS measurement produced a total of 1200 data points and the duration of the measurement was approximately 13 s, during which 250 2 ms laser pulses were used to provide a sample stream of vaporized tissue to the DMS.

A gross appearance of the setup (A–D) and examples of the dispersion spectra (G) are presented in Figure 1.

**Figure 1.** The setup: (**A**) humidifier; (**B**) sampling unit; (**C**) DMS analyzer (**D**); graphical user interface; (**E**) computing unit for data analytics; (**F**) workflow of the algorithm; (**G**) examples of IDH−positive and −negative dispersion spectra. Vc = compensation voltage; Vrf = peak-to-peak amplitude of the radiofrequency waveform voltage.

We evaluated the accuracy of several machine learning algorithms for the detection of differences in dispersion spectra and the classification of the analyzed samples. Linear discriminant analysis (LDA) was found to be the best performing algorithm. The main idea of training an LDA algorithm is the projection of data points to a lower dimensional space so that the between-class distance of class centers is maximized, and the within-class distance of data points is minimized, defining a decision boundary between the classes that is used to classify new samples. The other algorithms tested were K-nearest neighbors (KNN), random forest (RF), decision tree (DT), support vector machines (SVM) and XGBoost (XGB).

#### **3. Results**

The data set revealed a temperature rise, which caused baseline drift during the measurement of one well plate, making the data biased. Thus, a necessary preprocessing method was to remove the dimension-wise linear trend which belonged the well plate from each part of the data set. This preprocessing step improved the classification results compared to the classification of the raw data. The data set contained 352 samples taken from 22 patients. Group cross-validation was utilised to estimate the classification performance. Group cross-validation is implemented so that, at every iteration, it leaves one group of samples only for testing. The other groups are used for training. In this case, the nested group cross-validation technique was used. This algorithm leaves one group for testing and the other groups are used for training and validating. For the next iteration, the second group is used for testing and the others for training and validating, and so on. This approach ensures that there are no data leakages into the training phase. With the nested group cross-validation training, the LDA algorithm reached a classification accuracy of 86%, with 86% sensitivity and 85% specificity (Table 1). The workflow of the LDA algorithm is presented in Figure 1F. Further details of the cross-validation and classification results reached with other algorithms are presented in the Supplementary File.

In terms of the samples, out of the original 22 tumor samples (352 incisions), 8 samples had all their incisions correctly classified. In five samples, less than 10% of incisions were erroneous. In four samples, 10–20% were wrong. In five samples, 20–50% of the incisions were incorrectly classified. The tumors that had incorrectly clustered incisions included eight IDH wild-type tumors and six IDH-mutated tumors. The most difficult tumor type for the classifier was gr. IV GBM.

**Table 1.** Cross tabulation of the classification results (LDA).


#### **4. Discussion**

Our results show that the smoke generated from the IDH-mutated and IDH wildtype gliomas had distinct DMS profiles, and the DMS could differentiate them with good sensitivity and specificity. The laser evaporator platform is compact enough to be placed in the operating room and used for intermittent analysis of the tumor samples during surgery. The duration of measurement was approximately 13 s, so the DMS operates in almost real time. The DMS is also simpler and more economical than conventional mass spectrometerbased solutions. Conventional frozen section analysis is unable to identify molecular alterations in tumors, such as IDH mutation. In the latest WHO tumor classification, these alterations have become ever more prominent. This creates an increasing need for novel tumor identification methods in neurosurgical departments worldwide.

Recently, Raman spectroscopy has also been used for genotyping unprocessed glioma samples [20]. Raman spectroscopy is a modality that gives spectral tissue characteristics based on molecular signatures resulting from the inelastic scattering of incident light. Our results equal those achieved with Raman spectroscopy, and the workflow in DMS is at least as fast and straightforward.

Our tumor sample set included both IDH-mutated and IDH wild-type gr. IV GBMs and gr. III malignant astrocytomas. Out of the tumors with an unusual IDH mutation status given their histology, one GBM had 25% (9 out of 36) of the incisions erroneously classified, but all the other tumors (two IDH-mutated gr. IV GBMs and one IDH wild-type gr. III astrocytoma) had all their incisions correct classified, even though the opposite cluster had multiple histologically similar tumors. This indirectly indicates that the divisive features in the classification process were actually due to the cellular metabolic changes driven by an IDH mutation. The phospholipid content of tissue has previously been identified as a key distinguishing factor in DMS analysis [18]. The metabolic changes associated with an IDH mutation include aberrations in phospholipid composition [10], which constitutes a plausible theoretical basis for the detection of IDH mutation by DMS.

A potential source of error in DMS analysis is intratumoral heterogeneity. This is especially true in GBMs, which vary in terms of cellular density, nuclear pleomorphism, necrosis, histologic architecture, vasculature, mitoses, and multifaceted microenvironments [21,22]. This can cause variance in tissue impedance and disturb the classifier [23]. An additional confounding factor in our study was 5-ALA, which was used only in the resections of tumors that radiologically appeared as malignant. However, all three IDHmutated GBMs were resected with 5-ALA guidance, and still the classifier was able to classify them correctly.

Our study was limited by a relatively small number of samples that we multiplied into smaller specimens. In order to achieve a setup resembling actual intraoperative use, we only minimally prepared the tumor samples for the analysis. This inevitably caused spatial variance in the specimens that affected the DMS signal strength, thus creating an additional confounding factor to the classifier. This issue could be addressed in future studies by processing the samples into a more homogeneous cell suspension by a centrifuge before the analysis. The suspension could then be pipetted into the well plate to obtain precisely equal sample sizes. We also used frozen samples instead of fresh tumors. In our earlier unpublished experiments, freezing of the samples was not found to affect the classification results. However, this should be verified in peer-reviewed studies in the future.

#### **5. Conclusions**

Our results show that the DMS is able to differentiate IDH-mutated and IDH wildtype tumors with good accuracy in a setting suitable for intraoperative use. The role of molecular alterations in classifying brain tumors and evaluating their prognosis is increasing. Additionally, the degree of survival benefit achieved with a gross-total resection varies even in histologically similar tumors based on their IDH mutation status, which is impossible to identify with conventional frozen section analysis. This makes the DMS a promising novel tool for neurosurgical practice.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/curroncol29050265/s1. The work includes a supplementary file; detailed description of data analysis and classification results achieved with other algorithms. Figure S1: Nested cross-validation.

**Author Contributions:** Conceptualization, I.H., J.H., A.R., A.V. and N.O.; data curation, A.K. (Anton Kondratev), A.K. (Anton Kontunen), M.K. and M.M.; formal analysis, A.K. (Anton Kondratev), A.K. (Anton Kontunen), M.K. and M.M.; funding acquisition, J.H., K.N., A.R. and N.O.; investigation, I.H., A.K. (Anton Kontunen) and M.M.; methodology, I.H., J.H., A.K. (Anton Kontunen), M.K., A.R., N.O. and A.V.; project administration, J.H., K.N., A.R., A.V. and N.O.; resources, H.H., A.R., A.V., N.O., A.L. and P.K.-P.; software, A.K. (Anton Kontunen), M.K., A.R. and N.O.; supervision, J.H., K.N., A.R., A.V., N.O., A.L. and P.K.-P.; validation, A.V., A.R. and N.O.; visualization, I.H., A.K. (Anton Kondratev), A.K. (Anton Kontunen) and M.M.; writing—original draft, I.H. and A.K.; writing—review and editing, all authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** Ilkka Haapala declares funding from The Finnish Medical Foundation, Eka -grant, decision number 3535. Anton Kondratev declares funding from Academy of Finland, Programmable Scent Environments -project, decision number 323530. Anton Kontunen declares funding from the Doctoral School of Tampere University and Emil Aaltonen Foundation (Grant number 210073). Markus Karjalainen declares funding from The Finnish Cultural Foundation, Pirkanmaa Regional Fund. Kristiina Nordfors declares funding from Finnish Pediatric Research Foundation, Pediatric Cancer Research Foundation Väre and The Finnish Ministry of Social Affairs and Health. Hannu Haapasalo declares funding from Competitive State Research Financing of the Expert Responsibility area of Tampere University Hospital. Niku Oksala declares funding from Competitive State Research Financing of the Expert Responsibility area of Tampere University Hospital (Grant numbers 9s045, 9T044, 9U042, 150618, 9V044, 9X040, 9AA057, 9AB052, and MK301); from Competitive funding to strengthen university research profiles funded by Academy of Finland, decision number 292477; and from Tampere Tuberculosis Foundation. Joonas Haapasalo declares funding from Emil Aaltonen Foundation, Finnish Pediatric Research Foundation, Competitive State Research Financing of the Expert Responsibility area of Tampere University Hospital, Pediatric Cancer Research Foundation Väre and The Finnish Ministry of Social Affairs and Health. The study sponsors did not have any involvement in the study design; collection, analysis, and interpretation of data; the writing of the manuscript; or the decision to submit the manuscript for publication.

**Institutional Review Board Statement:** The study was approved by the Ethics Committee of Tampere University Hospital (ETL R10066) but was not listed in ClinicalTrials Database, because it was a non-randomized, non-interventional study.

**Informed Consent Statement:** Written informed consent has been obtained from the patients.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** A.R., N.O., A.Kontunen and M.K. are shareholders and employees of Olfactomics Ltd. M.M. is an employee of Olfactomics. Other authors do not declare any conflict of interest with respect to this work.

#### **References**


### *Article* **Artificial Intelligence Predicted Overall Survival and Classified Mature B-Cell Neoplasms Based on Immuno-Oncology and Immune Checkpoint Panels**

**Joaquim Carreras 1,\*, Giovanna Roncador <sup>2</sup> and Rifat Hamoudi 3,4**


**Simple Summary:** Artificial intelligence (AI) is a field that combines computer science with robust datasets to solve problems. AI in medicine uses machine learning and deep learning to analyze medical data and gain insight into the pathogenesis of diseases. This study summarizes and integrates our previous research and advances the analyses of macrophages. We used artificial neural networks and several types of machine learning to analyze the gene expression and protein levels by immunohistochemistry of several hematological neoplasia and pan-cancer series. As a result, the patients' survival and disease subtype classification were achieved with high accuracy. Additionally, a review of the literature on the latest progress made by AI in the hematopathology field and future perspectives are given.

**Abstract:** Artificial intelligence (AI) can identify actionable oncology biomarkers. This research integrates our previous analyses of non-Hodgkin lymphoma. We used gene expression and immunohistochemical data, focusing on the immune checkpoint, and added a new analysis of macrophages, including 3D rendering. The AI comprised machine learning (C5, Bayesian network, C&R, CHAID, discriminant analysis, KNN, logistic regression, LSVM, Quest, random forest, random trees, SVM, tree-AS, and XGBoost linear and tree) and artificial neural networks (multilayer perceptron and radial basis function). The series included chronic lymphocytic leukemia, mantle cell lymphoma, follicular lymphoma, Burkitt, diffuse large B-cell lymphoma, marginal zone lymphoma, and multiple myeloma, as well as acute myeloid leukemia and pan-cancer series. AI classified lymphoma subtypes and predicted overall survival accurately. Oncogenes and tumor suppressor genes were highlighted (MYC, BCL2, and TP53), along with immune microenvironment markers of tumor-associated macrophages (M2-like TAMs), T-cells and regulatory T lymphocytes (Tregs) (CD68, CD163, MARCO, CSF1R, CSF1, PD-L1/CD274, SIRPA, CD85A/LILRB3, CD47, IL10, TNFRSF14/HVEM, TNFAIP8, IKAROS, STAT3, NFKB, MAPK, PD-1/PDCD1, BTLA, and FOXP3), apoptosis (BCL2, CASP3, CASP8, PARP, and pathway-related MDM2, E2F1, CDK6, MYB, and LMO2), and metabolism (ENO3, GGA3). In conclusion, AI with immuno-oncology markers is a powerful predictive tool. Additionally, a review of recent literature was made.

**Keywords:** non-Hodgkin lymphoma; mature B-cell neoplasms; immune checkpoint; immunooncology; immune microenvironment; 3D macrophages; artificial intelligence; machine learning; artificial neural networks; deep learning

**Citation:** Carreras, J.; Roncador, G.; Hamoudi, R. Artificial Intelligence Predicted Overall Survival and Classified Mature B-Cell Neoplasms Based on Immuno-Oncology and Immune Checkpoint Panels. *Cancers* **2022**, *14*, 5318. https://doi.org/ 10.3390/cancers14215318

Academic Editors: Kentaro Inamura and Sam Payabvash

Received: 23 September 2022 Accepted: 24 October 2022 Published: 28 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Lymphoid neoplasms are tumors of the hematopoietic system derived from immature and mature B lymphocytes, T lymphocytes, and natural killer (NK) cells that evoke the normal stages of cell differentiation. Nevertheless, some neoplasms (such as hairy cell leukemia) show lineage heterogeneity and plasticity, and their normal counterparts cannot be found [1–7]. The 2016 revision of the World Health Organization (WHO) classification of lymphoid neoplasms [3] and the International Consensus Classification (ICC) [6] describe around 45 different subtypes of mature lymphoid neoplasms [3,6,7]. In this research, we analyzed the gene expression of some of the most relevant and frequent ones.

Chronic lymphocytic leukemia/small lymphocytic lymphoma (CLL/SLL) develops from small mature CD5+ and CD23+ B-cells with mutated or unmutated *IGHV* genes [3,8].

Follicular lymphoma (FL) is a neoplasia of the germinal centers of follicles (centrocytes and centroblasts), with a follicular (nodular) pattern, and is frequently associated with the *IGH*/*BCL2* translocation (t14;18)(q32;q21) that occurs in the bone marrow [3,9,10].

Extranodal marginal zone lymphoma of mucosa-associated lymphoid tissue is an extranodal lymphoma (MALT lymphoma) composed of a heterogeneous population of small B-cells [3]. It originates in the marginal zones, but it extends into the interfollicular and follicular regions and infiltrates the epithelium, forming the lymphoepithelial lesions [3,11].

Mantle cell lymphoma (MCL) is characterized by monomorphic small to mediumsized lymphoid cells with irregular nuclei and the *CCND1* translocation, originating from peripheral B lymphocytes of the inner mantle zone, CD5+, and SOX11+ in the classical form [3,12,13].

Diffuse large B-cell lymphoma (DLBCL) is a neoplasm of medium or large B lymphoid cells that originate from the germinal center in the germinal center B-cell-like type, or from the post-germinal center in the activated B-cell-like type [3,14,15]. According to the clinical, morphological, and biological features, DLBCL can be subdivided into different subtypes; the remaining ones are not otherwise specified (NOS).

Burkitt lymphoma is a highly aggressive but curable lymphoma that often appears at extranodal sites or as acute leukemia. It is characterized by a monomorphic proliferation of medium-size B-cells, mitotic figures, and the *MYC* translocation to the immunoglobulin (IG) locus. It originates from the germinal centers. There are three epidemiological variants, with variable association with the Epstein-Barr virus (EBV): endemic, sporadic, and immunodeficiency-associated [3,16–18].

Figure 1 shows the stages of the B-lymphocyte differentiation, and the relationship with the different lymphoma subtypes [19].

Nowadays, there has been rapid advance in the field of artificial intelligence (AI), and its role in medicine is gaining relevance. AI integrates computer science and datasets to make predictions or classifications based on input data.

There are two types of artificial intelligence, weak and strong AI. Weak AI, also known as narrow AI (NAI), is trained to perform specific tasks. Conversely, strong AI includes artificial general intelligence (AGI) or artificial super intelligence (ASI), and it is expected to surpass human abilities in the future [20–26].

In this research, we used weak artificial intelligence to predict the prognosis of the patients and to classify several subtypes of mature B-cell neoplasms (output). Gene expression (transcriptomics) and protein immunohistochemical data were used as predictors (input data). The research focused on artificial neural networks (mainly multilayer perceptron), but also used other neural networks such as the radial basis function and other machine learning techniques. Regarding the neural networks, "basic" but robust and reliable architectures were chosen as an elemental part of the analysis. Then, the "basic" networks were combined in more complex, multivariate analysis algorithms. Figure 2 describes the basic structure of the neural network.

**Figure 1.** Postulated cell of origin of the non-Hodgkin lymphoma subtypes. In the current theory of the pathogenesis of hematopoietic and lymphoid tissues, B-cell neoplasms correspond to various stages of B-cell differentiation. For example, follicular lymphoma, Burkitt lymphoma, and diffuse large B-cell lymphoma develop (or have a stage of differentiation) from mature B lymphocytes from the germinal centers of follicles of peripheral lymphoid tissues. Of note, follicular lymphoma is characterized by the IGH/BCL2 translocation (t14;18)(q32;q21) that occurs in the bone marrow. Nevertheless, this genetic alteration is not sufficient to generate lymphoma, and additional cumulative changes are necessary.

The immune checkpoints are regulators of the immune system that belong to the self-tolerance pathways. Without them, the immune system would attach to cells indiscriminately. Cancer uses several mechanisms to proliferate, including evading the host immune response using immune checkpoint molecules. There are two types of immune checkpoint molecules: stimulatory and inhibitory. Inhibitory checkpoint molecules inhibit the immune response and include several markers such as B7-H3 (CD276), BTLA, CTLA-4, LAG3, PD-1, TIM-3, and VISTA. Nowadays, immune checkpoints are important because they are the basis of cancer immunotherapy. Currently approved checkpoint inhibitors are anti CTLA-4, PD-1, and PD-L1 [19,27–35]. In this research, artificial intelligence was used to classify and to predict the overall survival of different lymphoma subtypes using gene expression data, all the genes of the arrays, and specific panels of the immune checkpoint.

This manuscript integrates our previous publications to provide a general view of the results and adds new analysis on tumor-associated macrophages (TAMs).

**Figure 2.** The basic structure of a neural network. The network is a function of predictors (also called inputs or independent variables) that minimize the prediction error of target variables (outputs). In the case of a multilayer perceptron, it is a feed-forward architecture because the connections flow from the input to the output layer without loops. Here, four genes predict the overall survival of patients. The input layer contains these genes. The hidden layer contains the unobservable nodes (units). The output layer contains the responses; the overall survival is a categorical variable (dead vs alive).

#### **2. Materials and Methods**

#### *2.1. Machine Learning and Neural Networks*

This research integrates all the previous analyses that were obtained using conventional biostatistics, machine learning, and artificial neural networks. Machine learning included Bayesian network, C&R tree, C5 tree, CHAID tree, discriminant analysis, KNN algorithm, logistic regression, LSVM, Quest tree, random forest, random trees, SVM, tree-AS, XGBoost linear, and XGBoost tree. Two types of artificial neural networks were used: the multilayer perceptron and radial basis function. The digital image quantification of markers was performed using the Waikato Environment for Knowledge Analysis (Weka), and the training of the classifier included fast random forest. All the materials and methods were thoroughly described in the previous publications [19,27–35].

#### *2.2. Multilayer Perceptron Artificial Neural Network*

The multilayer perceptron architecture was chosen in most cases. Several parameters were chosen to optimize the neural network. The predictors were included in the input layer, the unobservable nodes or units in the hidden layer, and the responses in the output layer. Scale-dependent variables and covariates were rescaled to improve network training. The method for rescaling of covariates was standardized: subtract the mean and divide by the standard deviation, (*x*−mean)/*s*.

The series of cases were randomly partitioned into training (70%) and testing (30%) datasets. The best performance was found using one hidden layer. The activation function linked the weighted sums of units in a layer to the values of units in the succeeding layer. The hyperbolic tangent was usually used. This function has the form γ(*c*) = tanh(*c*)=(*e<sup>c</sup> –e–c*)/(*ec +e*−*<sup>c</sup>* ). It takes real-valued arguments and transforms them into the range (–1, 1). When automatic architecture selection is used, this is the activation function for all units in the hidden layers. The number of units in each hidden layer was determined automatically by an estimation algorithm.

The output layer contained the target (dependent) variables and the activation function was softmax. This function has the form: γ(*c*k) = exp(*c*k)/Σjexp(*c*j). It takes a vector of real-valued arguments and transforms it into a vector whose elements fall in the range (0,1) and sum to 1. Softmax is available only if all dependent variables are categorical.

The training type determines how the network processes the records; the training type was batch. The training options were initial lambda (0.0000005), initial sigma (0.00005), interval center (0), and interval offset (+/−0.5). The network performance was assessed by the classification results, receiver operating characteristic (ROC) curve, cumulative gains chart, lift chart, predicted by observed chart, and residual by predicted chart. Using a sensitivity analysis, the independent variables were ranked according to their importance for predicting the dependent variable and in determining the neural network (Figure 3).

$$d\_{pm} = \max\_{x\_{r\_1}, x\_{r\_2} \in S\_r} ||\hat{Y}\_{F^i}^{(m)} - \hat{Y}\_{p\_2}^{(m)}||$$

$$d\_p = \frac{1}{M} \sum\_{m=1}^{M} d\_{pr}$$

**Figure 3.** Sensitivity analysis. Using a sensitivity analysis, the independent variables were ranked according to their importance for predicting the dependent variable and in determining the neural network.

The general architecture for a multilayer perceptron is as follows [34]: Input layer: *J*<sup>0</sup> = *P* units, *a*0:1,... , *a*0:*J*0; with *a*0:*<sup>j</sup>* = *xj*. Hidden layer: *<sup>J</sup>*<sup>i</sup> units, *ai*:1,... , *ai:Ji*; with *ai:k* <sup>=</sup> <sup>γ</sup>*i*(*ci:k*) and *ci:k* <sup>=</sup> <sup>∑</sup>*Ji*−<sup>1</sup> *<sup>j</sup>*=<sup>0</sup> *wi:j,kai*\_1:*<sup>j</sup>* where

*<sup>j</sup>*=<sup>0</sup> *wI*:*j,ka*i\_ 1:*<sup>j</sup>*

.*ai*−1:0 = 1 Output layer: *<sup>J</sup>*<sup>I</sup> <sup>=</sup> *<sup>R</sup>* units, *aI*:1, ... , *aI:Ji*; with *aI:k* <sup>=</sup> <sup>γ</sup>*I*(*cI:k*) and *cI:k* <sup>=</sup> <sup>∑</sup>*J*<sup>1</sup>

where .*ai*−1:0 = 1

Notation [34]:


#### *2.3. Differential Gene Expression Using the GEOR2 Software*

The GEO2R 1.0 software was used to compare the differential gene expression between subtypes simply. The Benjamini–Hochberg false discovery rate was applied to adjust the *p* values. Log transformation was applied if necessary. Limma precision weights and force normalization were not applied. The data were visualized using volcano and mean difference (MA) plots, contrasted with a level of cut-off significance set a priori at 0.05. This software runs in R 3.2.3, Biobase 2.30.0, GEOquery 2.40.0, limma 3.26.8. Webpage: https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html (accessed on 23 July 2022).

#### *2.4. Gene Set Enrichment Analysis*

The Gene Set Enrichment Analysis (GSEA) was used to determine if a pathway of interest was associated with a particular biological state (for example, dead vs alive) [36,37]. The pathways were obtained from the Molecular Signatures Database (MSigDB 7.0 and greater) or designed in-house. The software GSEA v4.2.3 was downloaded from the webpage of UC San Diego, Broad Institute: http://www.gsea-msigdb.org/gsea/index.jsp (accessed on 23 July 2022).

#### *2.5. Conventional Statistical Analyses*

Comparisons between groups were performed using crosstabulation with Pearson Chi-Square and Fisher's exact tests, and with nonparametric Mann–Whitney U (2 groups) and Kruskal-Wallis H (≥3 groups) tests. Survival analyses used the Kaplan–Meier and Log-rank tests, and the univariate and multivariate Cox Regression. The criteria of survival and response were the standard [38]. Overall survival was calculated from the time of diagnosis to the last contact with the patient (event recorded as alive vs dead).

#### *2.6. Risk Groups*

Risk groups were created using the risk score (prognostic index), which was calculated by multiplying the beta coefficients of the Cox model by the gene expression values (Risk score = B1X1 + B2X2 + ... + BpXp, where xi is the expression value and BI is the beta value of the Cox table). In the Cox, all the genes are included in a unique model [39].

#### *2.7. Hardware*

The analyses were performed on a desktop equipped with an AMD Ryzen 5 1600 and NVIDIA GeForce GTX 1050 Ti [27], Ryzen 7 3700X and GeForce GTX 1650 [30,33,34], and a Ryzen 9 5900X and GeForce GTX 3060 Ti [35], all with 16.0 GB of RAM.

Appendix A describes all the software that was used to perform the biostatistical analyses, including machine learning and artificial neural networks [19,27–35].

#### *2.8. Datasets and Immunohistochemical Procedures*

We used publicly available datasets downloaded from the Gene Expression Omnibus (GEO) repository, webpage: https://www.ncbi.nlm.nih.gov/geo/ (accessed on 23 July 2022) (Appendix B Table A1) [40–57], and own Tokai University Hospital gene expression (transcriptomic) and immunohistochemical (proteomic) datasets for this research.

Several of the markers that were highlighted in the AI analyses (both machine learning and artificial neural network) were validated by immunohistochemistry at the protein level. The cases were selected from the lymphoma series of Tokai University Hospital. The series of cases ranged from 100 to 293 cases, depending on the project. Immunohistochemistry was performed using a Leica Bond Max autostainer following the manufacturer's instructions (Leica K.K., Tokyo, Japan). Table 1 details the primary antibodies that were used. The review section was made on the basis of PRISMA guidelines: https://prisma-statement.org/ (accessed on 29 September 2022), Carreras, J. (20 October 2022). Systematic review. https://doi.org/10.17605/OSF.IO/436JQ. The manuscripts were selected in PubMed using the keywords "lymphoma" and "artificial intelligence", and were organized according to the type of input data as PET/CT scan, histological images, immunophenotype, clinicopathological variables, and gene expression, mutational, and integrative analysis-based artificial intelligence.


**Table 1.** Immunohistochemical markers used in lymphoma cases of Tokai University, School of Medicine.

CNIO, Centro Nacional de Investigaciones Oncológicas (Spanish National Cancer Research Center).

#### **3. Results**

The different subtypes of hematological neoplasia (mainly non-Hodgkin lymphomas) were predicted using artificial neural networks, machine learning, and conventional biostatistics. The analysis used transcriptomic data and protein levels assessed by immunohistochemistry. The results are summarized as a bulleted list.

#### *3.1. Predictive Classification of Non-Hodgkin Lymphomas*


**Figure 4.** Prediction of lymphoma subtype by a neural network with high accuracy. (**A**) A multilayer perceptron predicted the different non-Hodgkin lymphoma subtypes, including follicular lymphoma, mantle cell lymphoma, diffuse large B-cell lymphoma, Burkitt's lymphoma, and marginal zone lymphoma. The predictors (inputs) were the gene expression values of a pan-cancer transcriptome panel. The architecture of the network had 1769 nodes in the input layer, a hidden layer of 16 nodes, and an output layer with 5 nodes (5 lymphoma subtypes). In this figure, the top 20 most relevant genes for predicting the lymphoma subtype are shown, based on their average normalized importance for prediction. The most relevant gene was *ARG1*, followed by *MAGEA3*, *AKT2*, and *IL1B*. (**B**) This multilayer perceptron had a high performance, as shown in the ROC curve that had an area under the curve near 1. (**C**–**F**) Interestingly, the top 30 genes of the neural network not only predicted the lymphoma subtype but also managed to predict the overall survival of a large pan-cancer series from the TCGA of 7441 cases. Using a risk score formula, the cases of each series were stratified into high- and low-risk groups. The risk scores were calculated by multiplying the beta values of the Cox regression per gene expression values for each gene. The overall survival was calculated using the Kaplan–Meier and log-rank test and Cox regression analyses. These top 30 genes belonged to a pan-cancer transcriptome panel. Therefore, this may explain why they have predictive value in a pan-cancer series, and points out thattheremaybecommoncancermechanismsinallhumanneoplasia.


**Figure 5.** Prediction of the overall survival of follicular lymphoma using an algorithm based on neural networks. The algorithm combined multilayer perceptron (MLP), radial basis function (RBF), and COX regression to highlight 43 genes with prognostic relevance; finally, a correlation with immuno-oncology genes was also performed. This figure shows the algorithm (method) that was used to analyze the gene expression data of follicular lymphoma using artificial neural networks. From an initial set of 22,215 genes, a strategy of dimensionality reduction highlighted 43 genes, of which 18 were associated with poor and 25 with good overall survival of the patients. The first step

consisted of several independent artificial neural networks. The network architecture included the 22,215 genes as predictors (inputs), a hidden layer, and an output layer with the predicted variable. The predicted variables were the overall survival of the patients (outcome dead vs alive), and other relevant clinicopathological variables of follicular lymphoma. The result of the neural network ranked all the genes according to their normalized importance for predicting the target variable. The results of the independent multiple neural networks were pooled resulting in 1005 genes, and the most relevant ones were highlighted using univariate and multivariate Cox regression analyses. The relevance of these genes was confirmed using gene set enrichment analysis (GSEA). Finally, these genes were also correlated with several immuno-oncology genes. The 43 genes were the following: 18 were associated with a poor prognosis (*FRYL, KIAA0100, CDC40, MED8, PTP4A2, BNIP2, TMEM70, MED6, SLC24A2, KLK10, RANBP9, PRB1, EVA1B, CBFA2T2, ALDH1L1, KRT19, BTN2A3P,* and *TRPM4*) and 25 were associated with a good prognosis of the patients (*HSF2, ATPAF2, SLC7A11, PTAFR, TTLL3, TCP10L, DNAAF1, PRH1, NSDHL, TAF12, TSPAN3, AKIRIN1, ITK, TDRD12, LPP, BTD, SIRT5, ZNF230, ABHD6, TOP2B, ARPC2, ASAP2, IDH3A, PSMF1,* and *ARFGEF1*) (Supplementary Tables S1–S5). LDH, lactate dehydrogenase; IPI, international prognostic index; IR ratio, immune response ratio; 5-y, five years; MLP, multilayer perceptron; RBF, radial basis function.

**Figure 6.** Prediction of the overall survival of follicular lymphoma using an algorithm based on neural networks. This figure shows the GSEA results of Figure 4 in detail. Gene set enrichment analysis (GSEA) was performed to confirm the results of the multivariate Cox regression for the overall survival analysis.

The set of 43 was used in addition to genes of the immune response as well as oncogenes and tumor suppressor genes related to the pathogenesis of follicular lymphoma. Of note, genes related to macrophages were highlighted, such as *CD163*. NOM p–val, nominal p value (the nominal *p* value estimates the statistical significance of the enrichment score for a single gene set); FDR q–val, false discovery rate.

• Tridimensional (3D) analysis of tumor-associated macrophages (TAMs) of follicular lymphoma and transformation to diffuse large B-cell lymphoma was associated with increased numbers of TAMs, which created a network-like structure (Figure 7).

**Figure 7.** Tridimensional analysis of tumor-associated macrophages (TAMs) in follicular lymphoma. The analysis of M2-like TAMs in follicular lymphoma showed that the progression from low grade to high grade, and the transformation to diffuse large B-cell lymphoma, were associated with increased numbers of TAMs, which created a physical network-like structure. This result points out that TAMs may contribute to the disease pathogenesis. In this figure, the macrophages are highlighted in pale blue (right) and green (left). B and T lymphocytes are in dark blue and red. The images were obtained using a LSM 700 laser scanning confocal microscope from Carl Zeiss (Carl-Zeiss-Strasse 22, 73447 Oberkochen, Germany), and Imaris software (version 8.4, Oxford Instruments, Belfast, United Kingdom). FL, follicular lymphoma; DLBCL, diffuse large B-cell lymphoma.

*3.3. Follicular Lymphoma, Random Number Generator-Based Strategy*

• The random number generation created 120 independent multilayer perceptron solutions and 22,215 gene probes were ranked according to their averaged normalized importance for predicting the overall survival [35].


**Figure 8.** Prediction of the overall survival of follicular lymphoma taking advantage of the random number generator. (**A**) By using the random generator, 120 independent and different neural network solutions were calculated, and the averaged normalized importance of each gene for predicting the overall survival was recorded. Then, the minimal number of genes of a neural network with sufficient performance was selected, and a final neural network with 17 genes was defined. (**B**) This neural network (multilayer perceptron type) included 17 genes in the input layer, a hidden layer of 7 nodes, and an output layer of 2 nodes (overall survival, death vs alive). (**C**) A new neural network was created with the highlighted 17 genes and known immuno-oncology genes. The resulting model had an acceptable accuracy, with an area under the curve (AUC) of 0.89. The predictors (inputs) were ranked according to their normalized importance in predicting the overall survival.

#### *3.4. Mantle Cell Lymphoma, Use of Immuno-Oncology Panels to Predict Survival*

• An analysis algorithm included several analysis techniques such as neural networks (both the multilayer perceptron artificial and radial basis function), GSEA, and conventional statistics. In this analysis, 20,862 genes were correlated with 28 prognostic genes of mantle cell lymphoma. After dimensionality reduction, the patients' overall survival was predicted, and new markers were highlighted (Figure 9) [34].


**Figure 9.** *Cont*.

**Figure 9.** Prediction of the overall survival of mantle cell lymphoma using an algorithm based on neural networks. Two methods (**A** and **B** algorithms) were designed. Method 1 used as input 20,862 genes to predict the overall survival outcome (dead vs. alive) and other prognostic markers; because of dimensionality reduction, a final set of 19 genes were highlighted. The analysis also included testing the final 19 genes with other machine learning analysis, and conventional overall survival with log-rank test. Method 2 used as input several gene panels to predict the overall survival. As a result, 125 pan-cancer and immuno-oncology genes were highlighted. The association with the patients overall survival was confirmed by GSEA and conventional overall survival with log-rank test. OS, overall survival; MLP, multilayer perceptron; RBF, radial basis function; GSEA, gene set enrichment analysis; D/A, dead/Alive; AUC, area under the curve; NI, normalized importance.


#### *3.5. Diffuse Large B-Cell Lymphoma, Identification of the 25 Genes Set*


**Figure 10.** A neural network predicted the overall survival of diffuse large B-cell lymphoma using gene expression data. (**A**) A multilayer perceptron predicted the overall survival and highlighted the most important 25 genes. (**B**) Using a risk score formula and the gene expression of the 25 genes, two groups of patients with different overall survival were found; this figure shows the different gene expression of the 25 genes between the two risk groups. (**C**) The two risk groups had different overall survival. (**D**) Among the 25 genes, *ENO3*, *MYC*, and *BCL2* were the most important, and only with these 3 genes the survival of the patients could be determined.

**Figure 11.** Immunohistochemical staining of ENO3, MYC, and BCL2 in diffuse large B-cell lymphoma. This figure shows six different lymphoma cases, with high or low expression of the 3 markers. Original magnification: 400× (scale bar = 50 um).

*3.6. Diffuse Large B-Cell Lymphoma, Prognostic Value of the 25 Genes in Hematological Neoplasia, and TNFAIP8 Validation*


**Figure 12.** *Cont*.

**Figure 12.** A set of 25 genes derived from a neural network predicted the overall survival of several lymphoma subtypes and acute myeloid leukemia, and high protein expression of TNFAIP8 correlated with poor survival of diffuse large B-cell lymphoma patients. (**A**) Using the gene expression values of 25 genes, previously identified using artificial neural networks, and a risk score formula, it was possible to predict the overall survival of several hematological neoplasia (lymphomas and acute myeloid leukemia). All Kaplan–Meier analyses with log-rank tests were statistically significant and had a *p* < 0.001. (**B**) Although all 25 genes were relevant, the strength and direction of the association was different in each subtype of hematological neoplasia. For example, *TNFAIP8* was more relevant for the overall survival of diffuse large B-cell lymphoma and chronic lymphocytic leukemia, but less relevant for acute myeloid leukemia and multiple myeloma. Nevertheless, *TNFAIP8* contributed to the survival of all these hematological neoplasia. (**C**) High TNFAIP8 protein expression, evaluated by immunohistochemistry using both conventional digital image analysis and AI-based methods, correlated with poor overall survival of diffuse large B-cell lymphoma patients. This figure shows two cases of diffuse large B-cell lymphoma. The figure at the top express low TNFAIP8. On the left, the hematoxylin (dark blue) and DAB-based (brown) immunohistochemical image is shown. As shown in the inset, the TNFAIP8 staining was cytoplasmic. On the right, the AI-based digital image analysis is shown for the same case and area. TNFAIP8 is highlighted in red, cellular structures (B lymphocytes of the lymphoma, T lymphocytes, and macrophages) in pink, and intercellular tissue in green. The figure at the bottom is characterized by high TNFAIP8 expression. After staining procedures, the immunohistochemical slides were digitalized and visualized (NanoZoomer S360 scanner and NDP.view2 viewing software, Hamamatsu KK.). Original magnification: 200×. High TNFAIP8 correlated with age > 60 years, high serum IL2RA, non-GCB phenotype, and high infiltration of CD163+ M2-like tumor-associated macrophages (CD163+TAMs). TNFAIP8 also moderately correlated with MYC (Spearman's correlation coefficient 0.389, *p* = 0.009) and Ki67 (proliferation index; Spearman's correlation coefficient 0.48, *p* = 0.001). High TNFAIP8 was also associated (trend) with worse progression-free survival (*p* = 0.052). Finally, a multivariate COX analysis between TNFAIP8 (high vs low) and the international prognostic index (IPI) (low+low/intermediate vs high/intermediate + high) showed that only TNFAIP8 retained the prognostic value (HR = 3.5, *p* = 0.040). CLL, chronic lymphocytic leukemia; DLBCL, diffuse large B-cell lymphoma; FL, follicular lymphoma; MM, multiple myeloma; MCL, mantle cell lymphoma; AML, acute myeloid leukemia.

*3.7. Diffuse Large B-Cell Lymphoma, Prediction of Survival by Caspase-8*


**Figure 13.** High caspase-8 correlated with favorable survival of diffuse large B-cell lymphoma patients. The protein levels of caspase-8 (*CASP8*) were evaluated by immunohistochemistry, and later correlated with the survival of the patients. Two types of immunohistochemical staining were observed, low and high. In diffuse large B-cell lymphoma, high caspase-8 expression is associated with a favorable overall survival (*p* = 0.005). Additionally, other markers of the capsase-8 pathway, including caspase-3, cleaved PARP, BCL2, TP53, MDM2, MYC, Ki67, E2F1, CDK6, MYB, LMO2, and TNFAIP8, were evaluated by immunohistochemistry and quantified using digital image analysis. Caspase-8 was successfully predicted by the pathway markers, both using conventional statistics and several machine learning techniques and artificial neural networks. Of note, after staining procedures, the immunohistochemical slides were digitalized and visualized (NanoZoomer S360 scanner and NDP.view2 viewing software, Hamamatsu KK.). Original magnification: 400× (scale bar = 50 um). OS, overall survival; ROC curve, the receiver operating characteristic curve.

**Figure 14.** High caspase-8 correlated with favorable survival of diffuse large B-cell lymphoma patients. This figure shows the immunohistochemical expression of active subunit p18 casp-8 (CASP8), which correlated with good prognosis of the patients when high. Other related markers, as shown in the protein–protein interaction analysis, were also analyzed by immunohistochemistry. After staining procedures, the immunohistochemical slides were digitalized and visualized (NanoZoomer S360 scanner and NDP.view2 viewing software, Hamamatsu KK.). All the markers were quantified using digital image analysis. This figure shows examples of low and high expressions for each marker. Original magnification: 400× (scale bar = 50 um).

#### *3.8. Diffuse Large B-Cell Lymphoma, CD274 (PD-L1) and IKAROS*


**Figure 15.** An algorithm that included artificial neural networks and machine learning predicted the survival of diffuse large B-cell lymphoma, and highlighted *PD-L1* and *IKAROS* as prognostic markers. (**A**) Algorithm: This algorithm is similar to that one of follicular lymphoma and mantle cell lymphoma. The basic structure

analysis is an artificial neural network (multilayer perceptron). In this analysis, 54,613 gene probes were used as predictors for the overall survival, but also for other relevant clinicopathological variables. The basic neural network was composed of the input layer (predictors, 54,613 gene probes), a hidden layer (automatically computed), and an output layer (predicted variable; for example, the overall survival outcome as a dichotomic variable dead vs alive, or the cell of origin classification (GCB vs ABC), etc.). The dimensionality reduction included additional steps of machine learning, Cox regression, and GSEA. (**B**) Digital image quantification using AI-based strategy for PD-L1 (CD274) and IKAROS. (**C**) High protein expression of PD-L1 correlated with poor survival of the patients. Conversely, high IKAROS was associated with favorable survival. (**D**) AI-based quantification correlated well with conventional digital image quantification. Therefore, both techniques provide comparable results. (**E**) Modeling of the overall survival using a Bayesian network. The Bayesian network builds a probability model, a graphical model that shows variables (nodes) of the dataset, and the probabilistic (conditional) independences between them. The links of the network are called arcs and represent the relationship between the variables, but do not necessarily mean cause and effect. Original magnification: 200×. OS, overall survival; NCCN IPI, National Comprehensive Cancer Network International Prognostic Index; ECOG PS, Eastern Cooperative Oncology Group Performance Status; LDH, lactate dehydrogenase; R-CHOP, rituximab, cyclophosphamide, doxorubicin hydrochloride, vincristine, and prednisolone; AI, artificial intelligence.


**Figure 16.** Role of CSF1R in the prognosis of diffuse large B-cell lymphoma. CSF1R was analyzed by immunohistochemistry in a series of 198 cases, and two histological patterns were found. A CSF1Rpositive B-cell pattern was characterized by favorable progression-free survival; this pattern was less frequent (around 30% of the cases). Conversely, the most frequent pattern was of CSF1Rpositive tumor-associated macrophages (TAMs) and was associated with an unfavorable outcome. Additionally, the prediction of the immunohistochemical expression of CSF1R by other CSF1R-related markers was performed using neural networks. The CSF1R-related markers were CSF1, STAT3, NFKB, MYC, and Ki67. All markers were quantified using digital image analysis. Of note, the multilayer perceptron network analyses were performed to predict both the TAM and the B-cell patterns. Our data suggested that the use of a CSF1R inhibitor such as Pexidartinib could be used in the CSF1R + TAM pattern. CSF1R, macrophage colony-stimulating factor 1 receptor; DLBCL, diffuse large B-cell lymphoma; TAM, tumor-associated macrophage, PFS, progression-free survival.


**Figure 17.** Correlation between expression levels of CSF1R and SIRPA/CD47 in diffuse large B-cell lymphoma. The immunohistochemical pattern of CSF1R-positive tumor-associated macrophages (TAMs) suggested a relationship with other makers such as SIRPA. SIRPA is a relevant immune checkpoint marker that mediates negative regulation of phagocytosis. The histological pattern of SIRPA was of TAMs, similar to PD-L1, CD85A, and MARCO. A ligand for SIRPA is CD47. In our series, the histological pattern of CD47 was of B lymphocytes of the diffuse large B-cell lymphoma.

.

**Figure 18.** Gene expression analysis of *CD47* and *SIRPA* in the diffuse large B-cell lymphoma. In the series of the Lymphoma/Leukemia Molecular Profiling Project (LLMPP), when analyzing only the cases with R-CHOP-like treatment, high *CD47* but low *SIRPA* correlated with poor overall survival of the patients, and *SIRPA* positively correlated with *CSF1R*. CD47 is a ligand for SIRPA (SIRPα), a protein expressed by macrophages and dendritic cells. These two markers belong to the immune checkpoint pathway, and mediate a negative regulation of phagocytosis. R-CHOP, rituximab, cyclophosphamide, doxorubicin hydrochloride, vincristine, and prednisolone; LLMPP, Lymphoma/Leukemia Molecular Profiling Project; OS, overall survival.

*3.10. Diffuse Large B-Cell Lymphoma, Pan-Cancer Immuno-Oncology Panel*


**Figure 19.** An artificial neural network predicted the overall survival of the diffuse large B-cell lymphoma patients, and the cell of origin subtype using a pan-cancer immuno-oncology gene expression panel. The analysis consisted of the multilayer perceptron. The cell of origin characterization was assessed with the NanoString Lymph2Cx assay. The performance of the network was high, 0.89 for overall survival and 0.99 for the cell of origin phenotype. GSEA analysis confirmed enrichment toward the survival outcome of the dead and the cell of origin subtype of activated (ABC) + unspecified. Using a risk score formula, with 7 genes it was possible to predict the survival of diffuse large B-cell lymphoma. The association of phospho-MAPK with the germinal center B-cell (GCB) phenotype was also noted and confirmed by immunohistochemistry. GSEA, gene set enrichment analysis. ABC, activated B-cell type; GCB, germinal center B-cell type.

#### *3.11. Diffuse Large B-Cell Lymphoma, Integrative Analysis of Macrophage Markers*

Gene expression profiling of 233 DLBCL patients treated with chemotherapy plus Rituximab was obtained from the series GSE10846, present in the NCBI Gene Expression Omnibus database. The prognostic value for overall survival of the gene expression of *CD163* was first tested and 100 representative cases were selected, which contained high-risk (i.e., high *CD163*) and low-risk cases (i.e., low *CD163*) (Figure 20).

**Figure 20.** Analysis of macrophages in diffuse large B-cell lymphoma. The overall survival of diffuse large B-cell lymphoma was assessed based on the expression of *CD163*, which is an M2-like macrophage marker. High expression was associated with a poor prognosis of the patients. Then, a protein–protein functional network association analysis was performed using the macrophage markers of CD68 (pan-macrophages), CD16 (M1-like macrophages), CD163 (M2-like), PTX3 (M2clike), and MITF (M2-like), and the regulatory T lymphocytes (Tregs) marker of FOXP3. The network created a macrophage pathway that was subsequently applied to a gene set enrichment analysis (GSEA). The GSEA confirmed the association of the macrophage pathway with the high-risk group, which was characterized by poor overall survival and high CD163-positive macrophages.

A functional protein association network was created using the five macrophage and one regulatory T lymphocyte (Treg) markers: CD68, CD16, CD163, PTX3, MITF, and FOXP3 as the initial nodes (identifies). Then, the resulting network (i.e., pathway) that contained 57 markers was tested for GSEA analysis in the GSE10846 series of gene expression of diffuse large B-cell lymphoma. We identified the most relevant pathological markers (i.e., genes) that are associated with the prognosis of the patients as follows: high-risk (bad prognosis, and with high *CD163* expression) vs low-risk (good prognosis, low *CD163*). We found that this pathway was enriched in the high-risk phenotype with a NOM p-val < 0.001 and FDR q-val < 0.001. In the enrichment score, we could identify the markers: *CD163* (2nd in the list with a rank metric score of 0.515), *CD16* (FCGR3B, 4th), *CD68* (10th), *PTX3* (15th), and *MITF* (23rd). Of note, *FOXP3* was outside the enrichment set of genes so it was not associated with the high-risk group. Importantly, at fifth position, IL10, was identified. GSEA with markers belonging to the immune regulatory M2c-like TAM pathway was also tested with similar results (Figure 20).

The macrophage markers were analyzed at protein level by immunohistochemistry in the series of Tokai University (*n* = 132) (Figure 21). The distribution of the markers in the normal reactive tonsil was also evaluated.

**Figure 21.** Immunohistochemical staining of macrophage markers and regulatory T lymphocytes (Tregs) in diffuse large B-cell lymphoma. The expression of macrophage markers and Tregs was evaluated using immunohistochemical procedures. The staining confirmed that when macrophages are present at a high concentration in the tissues, their shape is more elongated and dendriform-like. CD68 is a pan-macrophage marker, CD16 is macrophage polarization M1-like, and CD163, PTX3, and MITF are M2-like. FOXP3 is a specific marker of Tregs. Original magnification: 400×.

The histological analysis in reactive tonsil, a secondary lymphoid organ, showed a different distribution of the different markers. CD68-positive and MITF-positive macrophages were widely distributed in all areas. CD16-positive cells were scarce and only identified in

the lympho-epithelium, the epithelial barrier. CD163-positive macrophages were mainly present in the interfollicular regions and infrequently in the germinal centers of the follicles. PTX3-positive cells were of macrophage morphology in all areas and in the germinal centers PTX3-positive cells also had a morphology of B lymphocytes (mainly centroblasts). IL10 positive macrophages were scarce but present in all areas. Double IHC showed mutually exclusive distribution between CD163 and CD16 and partially exclusive with MITF.

The multilayer perceptron (MLP) procedure was performed to produce a predictive model for one target variable, using the values of several predictors. The target was the dead or alive variable for overall survival. The predictors were the same categorical variables used in the COX multivariate analysis: CD163, PTX3 Total, MITF, FOXP3, and IL10. The independent variables normalized importance were as follows: PTX3 Total (100%), IL10 (95.9%), FOXP3 (48.9%), MITF (35.8%), and CD163 (6.3%) (Figure 22). This result is compatible with COX. The same procedure was performed to predict the Hans classifier and the importance was IL10 (100%), PTX3 Total (67.1%), FOXP3 (44.8%), CD163 (39.8%), and MITF (32.8%) (Figure 22).

Additional analysis consisted of validation the macrophage markers in an independent series of cases of diffuse large B-cell lymphoma, from the Lymphoma/Leukemia Molecular Profiling Project (LLMPP), the GSE10846 (webpage: https://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE10846, accessed on 21 September 2022). Only the cases treated with R-CHOP-like therapy were selected (*n* = 233). Several machine learning and artificial neural networks (multilayer perceptron) were used. The dependent (target) variable was the overall survival (outcome dead vs alive). As predictors, the macrophage genes of *CD163, CSF1R, PTX3, CD274 (PD-L1)*, and *IL10* were used. Additional immuno-oncology predictors were markers previously highlighted in the analyses, including *MYC, BCL2, TP53, FOXP3, CSF1, IL34, PDCD1 (PD-1), TNFRSF14, TNFAIP8, IKZF1, STAT3, NFKB1, MYD88, RELA, CASP8, CASP3, PARP1, BCL2, MKI67, ENO3*, and *GGA3*. In total, 25 genes were analyzed and the overall survival was successfully predicted. Table 2 shows the machine learning and neural network models, the number of predictors used in the models, and the overall accuracy. Figure 16 shows the most relevant models and the most relevant genes. The models confirmed the importance of the immuno-oncology markers (Figure 23).


**Table 2.** Machine learning and artificial neural network analysis using gene expression data.

**Figure 22.** Prediction of the overall survival of diffuse large B-cell lymphoma by M2c-like macrophages using an artificial neural network. The overall survival of the patients was predicted using an artificial neural network using the histochemical data of the tissue samples. The network confirmed that the most relevant markers were PTX3 and IL10, which characterized the immune regulatory M2c-like macrophages. A conventional survival analysis using the Kaplan–Meier with log-rank test confirmed the association of high M2c-like macrophages with poor overall and progression-free survival of the patients. Original magnification: 400×.

**Figure 23.** Prediction of the overall survival of diffuse large B-cell lymphoma using immune checkpoint and immuno-oncology markers. Using gene expression data of the GSE10846 dataset, the association of markers of immune regulatory M2c-like tumor-associated macrophages and other immune checkpoint markers was assessed. The methodology included several machine learning and artificial neural networks. The overall accuracy of each method is shown in Table 2.

Using the random forest, the markers were ranked according to their significance for predicting the patients' overall survival. The random forest uses a tree model and a bagging method.

The Bayesian network is a graphical model that shows variables (nodes) in a dataset and the probabilistic, or conditional, independences between them. It constructs a probability model by combining observed and recorded evidence. The network's links (arcs) do not always depict cause and effect.

The LSVM method permits the classification of data using a linear support vector machine. With large datasets, or ones with numerous predictor fields, LSVM is an especially adequate method. In this LSVM analysis, the predictors were ranked in order of relevance.

Nearest Neighbor Analysis classifies the cases based on the resemblance to others and patterns; this chart is a lower-dimensional projection of the predictor space, which contains 25 predictors (genes).

#### **4. Discussion**

Artificial intelligence (AI) is a recently developed field that integrates computer science with datasets to perform out calculations. In medicine, both machine learning and deep learning analyze medical data and gain insights on diseases. Artificial intelligence has many applications, including diagnosis, disease classification, image analysis, etc. [20–24].

Machine learning is a specialty in artificial intelligence. By using statistics, algorithms are trained to make classifications or predictions [20–23]. An algorithm of machine learning is composed of three parts:


There are three categories of machine learning models:


A linear regression algorithm is used to predict numerical values based on a linear relationship between predictors. Logistic regression is a type of supervised learning that predicts a categorical variable (binary). The clustering analysis uses unsupervised learning and identifies patterns to group the cases. Decision trees can be used to predict numerical values or to classify the data into categories; they use a branching sequence of link decisions that are represented in a tree diagram. Random forests predict a value or category by combining the results of decision trees [20].Artificial neural networks (ANNs) are algorithms that, in essence, mimic the human brain. Many data mining applications use neural networks because they are flexible and powerful for complex processes [25].

A neural network is composed of an input layer, multiple hidden layers (deep neural network), and an output layer. Most neural networks are feed-forward, which means that the flow moves in one direction from the input to the output [20–24]. The "deep" term refers to the number of layers (inclusive of input, hidden, and output layer); more than three layers can be considered in a deep learning algorithm [21]. The multilayer perceptron (MLP) and radial basis function (RBF) are used in predictive applications, and are supervised because the results can be compared with the known values of the target variables [20–26]. The input layer contains the predictors (for example, the genes). The hidden layer contains unobservable nodes (units). The value of each hidden unit is some function of the predictors. The output layer contains the responses (Figure 2).

This research predicted the prognosis (mainly the overall survival) and classified the different subtypes of mature B-cell neoplasms (non-Hodgkin lymphomas) with high accuracy. Therefore, machine learning and artificial neural networks are useful biostatistical tools in biomedical research, and it is expected that the importance of artificial intelligence in medicine will increase in the future.

This research used basic types of neural networks to obtain reliable results. The single neural networks created the basis for more complex algorithms, making the analysis similar to a classical multivariate analysis. The neural networks were also complemented with other conventional biostatistical analyses, such as gene set enrichment analysis (GSEA) and Cox regression. Additionally, other machine learning techniques were used to complement the results. Each type of machine learning has special uses, and in the results, the information that is provided was complementary.

In the different algorithms, the input data comprised all the genes of the array or specific panels. The panels that were used were carefully selected, and included cancer transcriptome, pan-cancer, cancer progression, and metabolic pathways that incorporate many oncogenes and tumor suppressor genes, but also immune-related panels such as immune exhaustion, human inflammation, host response, autoimmune, and immuno-oncology. Nowadays, immuno-oncology panels are particularly relevant. This research highlighted many important immuno-oncology markers such as CD163, CSF1R, CSF1, PD-L1, IL10, TN-FRSF14, TNFAIP8, PD-1, and FOXP3 which are markers of tumor-associated macrophages (TAMs), T lymphocytes, and regulatory T lymphocytes (Tregs). A complete discussion can be found in the previous publications [19,27–35]. Most of these markers can be targeted using inhibitors. In diffuse large B-cell lymphoma, the use of immunomodulatory drugs and immune checkpoint inhibitors is a new and promising field for treating the patients beyond the classical R-CHOP [58] (Table 3).

**Table 3.** Immuno-oncology and pathway-related markers that were highlighted in this research.


Tregs, regulatory T lymphocytes; TAMs, tumor-associated macrophages; DLBCL, diffuse large B-cell lymphoma; FL, follicular lymphoma. Information based on UniProt and GeneCards, and our results.

Interestingly, some of the identified markers were also relevant for the prognosis of nonhematological neoplasia, which suggests that there are common pathogenic mechanisms in all types of neoplasia.

AI analysis combined neural networks such as multilayer perceptron and radial basis function, and several machine learning techniques such as Bayesian network, C&R tree, C5 tree, CHAID tree, discriminant analysis, KNN algorithm, logistic regression, LSVM, Quest tree, random forest, random trees, SVM, tree-AS, XGBoost linear, XGBoost tree. It is impossible to decide which the best technique is because each method has some strengths

and weaknesses, and its applicability depends on the type of data, number of cases, and number of variables (inputs).

The term neural network refers to a family of loosely related models that are characterized by large parameter spaces and flexible structures, derived from the study of brain function. Neural networks are the tools of choice in many data mining applications because of their power and flexibility, especially if the underlying process is complex [28].

Artificial neural networks used in prediction applications, such as multilayer perceptron (MLP) and radial basis function (RBF) networks, are supervised in the sense that the results predicted by the model are compared to known values of target variables. The choice between the MLP and RBF methods depends on the type of data and the level of complexity of the problem. The MLP method can find more complex relationships, while RBF is generally faster [30]. Deep neural networks have been criticized for being opaque because their predictions are incomprehensible to humans; their multi-layered nonlinear structure is a "black box model" [31].

We recently modeled celiac disease and ulcerative colitis using AI [59,60]. In the case of ulcerative colitis, we analyzed a series of 43 cases, including 13 healthy controls, 8 inactive ulcerative colitis, 7 non-involved active ulcerative colitis, and 15 involved active ulcerative colitis. As input, 734 genes were included. A total of 16 models were used to predict ulcerative colitis. The overall accuracy was as follows: C5 decision tree (100%, 2 fields used); logistic regression, discriminant analysis, LSVM, SVM, XGBoost linear, XGBoost tree, and neural network (100%, 734 fields); CHAID (97.7%, 2 fields); random forest (97.7%, 734); KNN algorithm (95.4%, 734); C&R tree (95.4%, 12); Quest (83.7%, 6); Bayesian network (65.1%, 734); random trees (0%, 734). In this research, most of the machine learning methods and neural networks had accuracy above 85%. Nevertheless, the number of fields that were used was variable. As also observed in the data of mature B-cell neoplasms, decision trees have difficulties in handling a large set of variables. Bayesian networks provide acceptable results, but are not superior to neural networks. Logistic regression accuracy is usually high and uses many variables. In the end, the most practical strategy is to test all methods and select the ones that predict better. In Table 2, the same 16 models are applied to our data of diffuse large B-cell lymphoma. Generally, the machine learning methods successfully predicted the overall survival of patients with diffuse large B-cell lymphoma using immuno-oncology and immune checkpoint markers. In this particular experiment, neural networks did not have high accuracy.

In conclusion, artificial intelligence analysis is a useful tool for analyzing the prognosis and classification of non-Hodgkin lymphomas.

#### **5. Review of the Literature and Future Perspective in Hematological Neoplasia Using AI**

Other groups have also used artificial intelligence in the field of hematopathology research. Table 4 provides precise updates on the latest progress made in hematological malignancies using machine learning and neural networks. The manuscripts were selected in PubMed using the keywords "lymphoma" and "artificial intelligence". Among all articles that were found within the past 3–4 years, a selection of the most recent research was made. Because of limited space, not all relevant manuscripts are included in Table 4.



**4.**Updatethelatestmadeinhematologicalmalignanciesusingartificial







*Cancers* **2022**, *14*, 5318


*Cancers* **2022**, *14*, 5318


The manuscripts were organized according to the type of input data, i.e., PET/CT scan, histological images, immunophenotype, clinicopathological variables, and gene expression, mutational, and integrative analysis-based artificial intelligence [61–84].

Worth mentioning is the work of Schmitz R et al. published in the *New England Journal of Medicine* in 2018. The genetics and pathogenesis of diffuse large B-cell lymphoma were analyzed using random forest. The input data from 574 diffuse large B-cell lymphoma cases included exome and transcriptome sequencing, whole-genome copy-number arraybased DNA analysis, and targeted amplicon resequencing of 372 genes to identify genetic subtypes [84].

A similar work was published by Xu-Monette ZY et al. in 2020 in *Blood Advances*. Based on targeted next-generation sequencing (NGS), a correlation with the cell of origin subtypes was made using AI in diffuse large B-cell lymphoma. The series of 418 cases included immunohistochemical, gene expression, DNA in situ hybridization, array CGH, and NGS sequencing. Using autoencoders and CPH models, the cases were classified according to the cell of origin and the patients' survival (overall survival and progression-free survival) [81].

Li D et al. reported in 2020 in *Nature Communications* a deep learning diagnostic platform for diffuse large B-cell lymphoma. The method included data from multiple hospitals. This research used histological images of H&E to classify diffuse large B-cell lymphoma (DLBCL) vs non-DLBCL. Non-DLBCL included cases of metastatic carcinoma, melanoma, and other lymphomas. The lymphoma subtypes were chronic lymphocytic leukemia, mantle cell lymphoma, follicular lymphoma, and classical Hodgkin lymphoma. Seventeen types of convolutional neural networks were used, and the model had an accuracy of 99.7–100% [74].

In the past five years, there has been a significant increase in the use of artificial intelligence in cancer research, and many applications in hematological neoplasia have been published [85]. Many studies have used convolutional neural networks to classify digitalized histological images. Machine learning and artificial neural networks have also been used to analyze gene expression and mutational data. It is expected that in the future, artificial intelligence techniques will become a standard part of the biostatistical analysis, and complementary to "conventional" bioinformatics.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14215318/s1, Table S1: Multilayer perceptron analysis (MLP). Table S2: Radial basis function analysis (RBF). Table S3: Genes associated to poor prognosis in the multivariate Cox survival analysis. Table S4: Genes associated to good prognosis in the multivariate Cox survival analysis. Table S5: Clinicopathological correlations with the final set of seven prognostic genes.

**Author Contributions:** Conceptualization, methodology, formal analysis, investigation, and writing, J.C. Resources, G.R. Resources, software, validation, R.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** Joaquim Carreras was funded by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) and the Japan Society for the Promotion of Science (JSPS) (grant numbers KAKEN 15K19061, 18K15100, and 24590430) and the Tokai University School of Medicine research incentive assistant plan (grant number 2021-B04). Rifat Hamoudi was funded by Al-Jalila Foundation (grant number AJF2018090) and University of Sharjah (grant number 22010902103).

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Tokai University, School of Medicine (protocol code IRB14R-080, and IRB20-156).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** All the data are available upon request to Joaquim Carreras.

**Acknowledgments:** The authors thank all the colleagues who had previously contributed to the research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

The analyses used several software applications, including EditPad Lite (version 8.4.0 x64, Just Great Software Co. Ltd.); Fiji (version ImageJ 1.53u, NIH); GSEA (version 4.3.2, Broad Institute); GIMP (version 2.10.8, GNU); IBM SPSS 25 to 27; IBM modeler 18 (IBM); JMP Pro 14 (JMP Statistical Discovery LLC, SAS); Microsoft excel 2016 (version 16.0.5317.1000, Microsoft Corporation); Minitab (version 21.1.0, Minitab, LLC); Morpheus matrix visualization and analysis software (version 1, https://github.com/cmap/morpheus.js, Broad Institute) (accessed date 25 October 2022); NSolver (version 4.0, NanoString); RapidMiner Studio (version 9.10.011, RapidMiner); R (version 4.2.1) (http://cran.r-project.org) (accessed date 25 October 2022); RStudio (version 2022.07.2, Build 576, RStudio, PBC); STRING protein– protein interaction networks (version 11.5, STRING Consortium 2022); and Xlstat (Premium 2018.1, Build 49320 x64, multilingual, Addinsoft).

#### **Appendix B**

**Table A1.** Publicly available datasets used in addition to the Tokai University series.


#### **Appendix C. Comments and Analysis Of breast Cancer Detection Using Deep Neural Networks**

Breast cancer is the second most frequent type of cancer in women, just before skin cancer. Worldwide, breast cancer represents the 30% of all female cancers, and it has a mortality of about 15%, but in emergent countries can reach up to 70% [86,87]. The worldwide incidence ranges from 27 to 97 cases for 100,000 [87], and in about 10% of the cases, there is a genetic predisposition or family history [87]. The most frequently associated germline mutations affect the *BRCA1* and *BRCA2* genes [88,89].

The development of strategies for the early detection of breast cancer is necessary to improve access to treatment and reduce the mortality rate. As described by Basurto-Hurtado JA et al. [90], breast cancer detection includes four steps: image acquisition, segmentation and pre-processing, feature extraction, and classification [90].

Image acquisition can be obtained through several methods, such as mammography, ultrasound, magnetic resonance imaging (MRI), and other approaches, including microwave, computed tomography (TC), and positron emission tomography (PET) [90].

The image processing and classification strategies include several steps: region of interest (ROI) estimation, and feature extraction. The classifiers can be both unsupervised and supervised. Examples of unsupervised classifiers include K-means and hierarchical clustering. Examples of supervised classifiers are decision trees, random forests, AdaBoost, support vector machines, artificial neural networks, and convolutional neural networks [90]. Recently, new image generation techniques have developed, such as infrared thermography (IRT). This technique has been successfully applied to breast cancer; the classification methods included several machine learning and artificial neural networks, and the accuracy ranged from 90% to 100% [90–101].

Recently, new classification algorithms have been developed, including autoencoders, deep belief networks, ladder networks, and deep neural network (DNN)-based algorithms such as the deep Kronecker neural network [90,102].

Gene expression profiling is a useful tool in medical research, both for diagnosis and for the elucidation of the disease pathogenesis. Artificial neural networks can handle gene expression profiling data successfully, and we recently described their usability in hematological neoplasia [27–35]. In our research, we used conventional machine learning techniques and artificial neural networks because the aim was to identify prognostic factors in a reliable and systematic manner instead of developing new advanced mathematical algorithms. Nevertheless, the performance of the artificial neural networks can be improved with the use of adaptive activation functions (AAFs). Kronecker neural networks (KNNs) are a new type of neural network with adaptive activation functions described by Jagtap AD et al. [103]. Unlike the traditional neural network architecture, in a KNN, the output of the neuron passes to more than one activation function [103]. The use of the Kronecker product in the KNN made the network wide, while at the same time, the number of trainable parameters remained low [103]. Recently, a multi-level KNN approach was used in the analysis of MRI images of brain tumors (glioma) to develop an automated glioma segmentation system [104].

The research in this manuscript focuses on immuno-oncology markers, as we have recently described [85]. In relation to breast cancer, we tested the prognostic value of a set of 718 genes from a pan-cancer immune profiling panel on the overall survival of the patients. A series of 1215 breast cancer patients from The Cancer Genome Atlas (TCGA) was selected. Unfortunately, in this model, a multilayer perceptron analysis failed to properly predict the overall survival of the patients (83.7% overall percent of correct classification, AUC = 0.61). Next, the input was narrowed to 16 genes: macrophage markers (*CD68*, *CSF1R*, *CD163*, *CSF1R*, *CSF1*, *IL10*, *CD274 (PD-L1)*, and *TNFAIP8*), T helper cells (*PDCD1/PD-1*), Tregs (*FOXP3*), apoptosis (*BCL2*, *CASP3*, and *CASP8*), NFKB pathway (*STAT3*), and metabolism (*ENO3*, *GGA3*). The overall survival of breast cancer was predicted using 16 models, namely C5, logistic regression, Bayesian network, discriminant analysis, KNN algorithm, LSVM, random trees, SVM, tree-AS, XGBoost linear, XGBoost tree, CHAID, Quest, C&R tree, random forest, and neural network (multilayer perceptron). Among all models, only random forest provided suitable modeling (input = 16 fields, overall accuracy 98.4%). The order of predictor importance was *CD274*, *FOXP3*, *ENO3*, *IL10*, *CSF1R*, *CSF1*, *BCL2*, *GGA3*, *TNFAIP8*, *CASP8*, *PDCD1*, *CASP3*, *CD163*, *TNFRSF14*, *CD68*, and *STAT3*.

Noteworthy, further analysis was performed in the breast series of the TCGA and the pan-cancer immune profiling panel. In addition to the overall survival, other survival variables were tested, including the disease-specific survival, disease-free interval, and progression-free interval. The multilayer perceptron analysis also failed to predict the survival of the patients with good performance. Additional analyses were performed. Different types of training were tested: batch, online, and mini-batch. Two types of optimization algorithms were also tested: scaled conjugate gradient, and gradient descent. The training options for the scaled conjugate gradient were the following: initial lambda (0.0000005), initial sigma (0.00005), interval center (0), and interval offset (±0.5). The training options for the gradient descent were initial learning rate (0.4), momentum (0.9), interval center (0), and the interval offset (±0.5). Of note, batch training can use both a scaled conjugate gradient and gradient descent. However, online and mini-batch are restricted to gradient descent. The training options of gradient descent in case of online and mini-batch were initial learning rate (0.4), lower boundary of learning rate (0.001), learning rate reduction, in epochs (10), momentum (0.9), interval center (0), and interval offset (±0.5). We tried improving the network performance by changing all the training parameters, but no significant improvement in performance was achieved.

#### **References**


### *Article* **Regulation of Epithelial–Mesenchymal Transition Pathway and Artificial Intelligence-Based Modeling for Pathway Activity Prediction**

**Shihori Tanabe 1,\*, Sabina Quader 2, Ryuichi Ono 3, Horacio Cabral 4, Kazuhiko Aoyagi 5, Akihiko Hirose 1, Edward J. Perkins 6, Hiroshi Yokozaki <sup>7</sup> and Hiroki Sasaki <sup>8</sup>**


**Simple Summary:** Molecular network pathways are activated or inactivated under various conditions. Previously, we revealed that epithelial–mesenchymal transition (EMT) is a feature of diffusetype gastric cancer. Here, we modeled the activation states of EMT in the development pathway using molecular pathway images and artificial intelligence (AI). The regulation of EMT in the development pathway was activated in diffuse-type gastric cancer (GC) and inactivated in intestinal-type GC. AI modeling with molecular pathway images generated a highly accurate Elastic-Net Classifier models that was validated with 10 additional activated and 10 inactivated pathway images.

**Abstract:** Because activity of the epithelial–mesenchymal transition (EMT) is involved in anti-cancer drug resistance, cancer malignancy, and shares some characteristics with cancer stem cells (CSCs), we used artificial intelligence (AI) modeling to identify the cancer-related activity of the EMTrelated pathway in datasets of gene expression. We generated images of gene expression overlayed onto molecular pathways with Ingenuity Pathway Analysis (IPA). A dataset of 50 activated and 50 inactivated pathway images of EMT regulation in the development pathway was then modeled by the DataRobot Automated Machine Learning platform. The most accurate models were based on the Elastic-Net Classifier algorithm. The model was validated with 10 additional activated and 10 additional inactivated pathway images. The generated models had false-positive and false-negative results. These images had significant features of opposite labels, and the original data were related to Parkinson's disease. This approach reliably identified cancer phenotypes and treatments where EMT regulation in the development pathway was activated or inactivated thereby identifying conditions where therapeutics might be applied or developed. As there are a wide variety of cancer phenotypes and CSC targets that provide novel insights into the mechanism of CSCs' drug resistance and cancer metastasis, our approach holds promise for modeling and simulating cellular phenotype transition, as well as predicting molecular-induced responses.

**Keywords:** artificial intelligence; epithelial–mesenchymal transition; Ingenuity Pathway Analysis; machine learning; molecular pathway network

**Citation:** Tanabe, S.; Quader, S.; Ono, R.; Cabral, H.; Aoyagi, K.; Hirose, A.; Perkins, E.J.; Yokozaki, H.; Sasaki, H. Regulation of Epithelial– Mesenchymal Transition Pathway and Artificial Intelligence-Based Modeling for Pathway Activity Prediction. *Onco* **2023**, *3*, 13–25. https://doi.org/10.3390/ onco3010002

Academic Editor: Galatea Kallergi

Received: 17 November 2022 Revised: 29 December 2022 Accepted: 3 January 2023 Published: 6 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Molecular network pathways are activated or inactivated under many different conditions. Previously, we found that diffuse-type gastric cancer (GC) has a feature of epithelial– mesenchymal transition (EMT) [1–3]. EMT is involved in anti-cancer drug resistance, cancer malignancy, metastasis, and cancer stem cells (CSCs) [4–7]. Experiments in anticancer drug-resistant cancer cell lines indicate that EMT is involved in cancer cell drug resistance [8], highlighting the significance of EMT targeting in cancer treatment [6].

Several signaling pathways involved in EMT contribute to drug resistance [6]. Tumor growth factor beta (TGFβ) signaling activates SMAD2/3, which then complexes with SMAD4 to form a trimetric SMAD complex, leading to the transcription of EMT transcription factors [9]. Wnt/β-catenin signaling activates Snail transcription to induce EMT [6,10]. Recent studies have also revealed the role of EMT in autophagy and CSCs during metastasis [11,12]. However, the relationship between the EMT pathway activation state and therapeutic responsiveness is not fully understood.

Understanding the activity state of the EMT pathway in cancer cells may be an important clue for identifying therapeutic targets in malignant cancers. To effectively predict EMT activity and potential therapeutic responsiveness, molecular pathway images were used to capture activity of EMT-related pathways of datasets in Ingenuity Pathway Analysis (IPA), followed by artificial intelligence (AI) modeling with images of gene expression activity in the pathway.

#### **2. Materials and Methods**

#### *2.1. Data Analysis of Diffuse- and Intestinal-Type GC*

We used RNA sequencing data of diffuse- and intestinal-type GC, which are publicly available in The Cancer Genome Atlas (TCGA) of the cBioPortal for Cancer Genomics database at the National Cancer Institute (NCI) Genomic Data Commons (GDC) data portal [13–17]. Publicly available data on stomach adenocarcinoma in the TCGA, Stomach Adenocarcinoma (TCGA, PanCancer Atlas), [13–16] were compared between diffuse-type GC, which is genomically stable (n = 50), and intestinal GC, which has a feature of chromosomal instability (n = 223), in TCGA Research Network publications, as previously described [1,14,18].

#### *2.2. Network Analysis*

Data on intestinal- and diffuse-type GC from the TCGA cBioPortal for Cancer Genomics were uploaded and analyzed using IPA (Qiagen, CA, USA) [19,20]. The datasets of gene expression in diseases were searched in IPA, and datasets with absolute values in z-score in the top 60 for activated state and inactivated state (total of 120) in regulation of EMT in the development pathway were extracted for AI prediction modeling and evaluation. Among 120 analyses in the activity plot of regulation of EMT in the development pathway, 50 activated and 50 inactivated analyses (total of 100) were used to generate AI models and 10 activated and 10 inactivated analyses (total of 20) were withheld for use in validating the generated model. The 100 analyses (50 activated and 50 inactivated states) found in the database of IPA and newly used to generate AI-based models are summarized in Table 1.


**Table 1.** Analyses in the regulation of EMT in the development pathway for AI prediction modeling.

2-Normal control culture medium 1187 Normal control Culture medium TRUE

**Table 1.** *Cont.*




#### *2.3. AI Prediction Modeling*

To create a prediction model using multi-modal data including images and text descriptions of molecular networks, an enterprise AI platform (DataRobot Automated Machine Learning version 7.2; DataRobot Inc. (Boston, MA, USA) was used. For the modeling, the 100 molecular networks on the regulation of EMT in the development pathway were collected and input as image data in the DataRobot (50 images in the activated state and 50 images in the inactivated state), which automatically created and tuned prediction

models using various machine-learning algorithms (e.g., eXtreme gradient-boosted trees, random forest, regularized regression such as Elastic Net, Neural Networks) [21–23]. Finally, the AI model with the highest predictive accuracy on DataRobot was identified, and various insights (such as Permutation Importance or Partial Dependence Plot) obtained from the model were reviewed. To calculate the accuracy of the model, 20 additional image data (10 images in the activated state and 10 images in the inactivated state) that were not used as training data for the AI model creation were added for validation.

#### *2.4. Statistical Analysis*

The RNA sequencing data on diffuse- and intestinal-type GC was analyzed via Student's *t*-test. The z-scores of intestinal- and diffuse-type GC samples were compared, and the difference was considered significant at *p* < 0.00001, following previous reports [1,18]. The activation z-score in each pathway was calculated in IPA to show the level of activation.

#### **3. Results**

#### *3.1. Regulation of the EMT in Development Pathway in Diffuse- and Intestinal-Type GC*

3.1.1. Gene Expression Mapping in Regulation of the EMT in the Development Pathway in Diffuse- and Intestinal-Type GC

Alterations in gene expression in diffuse- and intestinal-type GC was mapped to a canonical pathway, "Regulation of the EMT in development pathway" (Figure 1) based on the previous gene expression analysis results [1]. Red or green color indicates upregulated or downregulated genes, respectively. In the regulation of EMT in the development pathway, frizzled and adenomatous polyposis coli regulator of the WNT signaling pathway (APC) was upregulated, while SUFU negative regulator of hedgehog signaling (SUFU), pygopus family PHD finger 2 (PYGO2), and BRCA1 was downregulated in diffuse-type GC compared to intestinal-type GC. APC encodes a tumor suppressor protein that acts as an antagonist of the Wnt signaling pathway. APC is also involved in other processes, including cell migration and adhesion, transcriptional activation, and apoptosis. SUFU is associated with β-catenin binding, protein kinase binding, and transcription regulation.

**Figure 1.** Regulation of the epithelial–mesenchymal transition (EMT) in development pathway in diffuse- and intestinal-type gastric cancer (GC). (**a**) Gene expression alteration in diffuse-type GC in regulation of the EMT in development pathway; (**b**) Gene expression alteration in intestinal-type GC in regulation of the EMT in development pathway. Red or green color indicates upregulated or downregulated genes, respectively. The intensity of colors indicates the degree of up- or downregulation. A solid or dashed line indicates direct or indirect interaction, respectively.

3.1.2. Molecular Activity Prediction in Regulation of the EMT in Development Pathway in Diffuse- and Intestinal-Type GC

The prediction of molecular activity in the regulation of the EMT in the development pathway in diffuse- and intestinal-type GC was mapped (Figure 2). GSK3β, SNAI1, NFκB, LOX, and EMT are activated, whereas SNAI2 and E-cadherin are inactivated in diffuse-type GC compared to intestinal-type GC. Notch receptor 1 (NOTCH1) intracellular domain (NOTCHIC) was predicted to be activated in the CSL-HIF1A-MAML1-NICD complex, which consists of hypoxia-inducible factor 1 subunit alpha (HIF1A), mastermind-like transcriptional coactivator 1 (MAML1), NOTCH1, and recombination signal binding for immunoglobulin kappa J region (RBPJ) in the nucleus, and β-catenin (CTNNB1) was predicted to be activated in β-catenin-APC-AXIN-GSK3β complex in the cytoplasm in diffuse-type GC compared to intestinal-type GC.

**Figure 2.** Molecular activity prediction in regulation of the EMT in development pathway in diffuseand intestinal-type GC. (**a**) Molecular activity prediction in diffuse-type GC; (**b**) molecular activity prediction in intestinal-type GC. Red or green color indicates upregulated or downregulated genes, respectively. The intensity of colors indicates the degree of up- or downregulation. A solid or dashed line indicates direct or indirect interaction, respectively. Orange or blue color indicates predicted activation or inhibition, respectively. The intensity of colors indicates the confidence level of the prediction.

#### *3.2. Activity Plot of Regulation of the EMT in Development Pathway*

In total, 6216 analyses were found to be involved in the regulation of the EMT in the development pathway (as of September 2021) (Figure 3). In subsequent AI modeling analyses, samples with "NA" in the case treatment and blank in the disease state were excluded.

**Figure 3.** Activity plot of regulation of EMT in development pathway (6216 analyses, as of September 2021).

#### *3.3. AI Modeling and Validation of the Prediction Model*

The activation state of regulation of EMT in the development pathway was modeled by machine learning, including deep learning, using 50 activated and 50 inactivated images of the regulation of EMT in development pathway (Figure 4). DataRobot was used for machine-learning modeling and 34 models were automatically created, including an Elastic-Net Classifier (L2/Binomial Deviance) model. DataRobot also highlighted the parts of the image data critical to the prediction accuracy of the model in an activation map (Figure 4).

**Figure 4.** Activation map of AI modeling (DataRobot).

To validate the ElasticNet Classifier model, predictions were made using data on 10 activated and 10 inactivated pathway images that were not used to train the model (Table 2). The results showed that the prediction accuracy for the additional 20 images was 100% (AUC = 1.0).


**Table 2.** Validation of the model ElasticNet\_Classifier\_(L2/Binomial\_Deviance).


**Table 2.** *Cont.*

#### *3.4. Regulation of EMT in the Development Pathway in Other Diseases Than Cancer*

The results of the modeling of regulation of EMT in the development pathway found one false-positive and one false-negative result in the model Elastic-Net Classifier in the process of the model generation (Figure 5). The analysis of the false-negative result was Parkinson's disease with a z-score of 3 (Figure 5a). The analysis of the false-positive result was a genetic disease with a z-score of −2.646 (Figure 5b).

**Figure 5.** Regulation of EMT in development pathway in diseases. (**a**) Parkinson's disease (PD) (skin) differentiation medium 4389, *<sup>p</sup>* value = 1.89 <sup>×</sup> <sup>10</sup><sup>−</sup>2, z-score = 3; Gene identifiers marked with an asterisk (\*) indicate that multiple identifiers in the dataset file map to a single gene in the Global Molecular Network. (**b**) genetic disease (midbrain) 444, *<sup>p</sup>* value = 4.75 <sup>×</sup> <sup>10</sup><sup>−</sup>2, z-score = <sup>−</sup>2.646.

#### **4. Discussion**

Our result demonstrates that the canonical pathway of regulation of the EMT in the development pathway was activated in diffuse-type GC but not in intestinal-type GC. Specifically, the pathway mapping of gene expression revealed that Frizzled and APC were upregulated, while SUFU, PYGO2, and BRCA1 were downregulated in diffuse-type GC compared to intestinal-type GC. Frizzled proteins are a family of Wnt receptors involved in carcinogenesis [24]. It was previously shown that Frizzled-7 affected stemness and chemotherapeutic resistance in GC [25]. Accordingly, targeting inhibition of Frizzled-7

attenuated spheroid formation and stemness, as well as the resistance to cisplatin, an anti-cancer drug, in GC cells may have a therapeutic effect [25]. Besides Frizzled-7, the expression of Frizzled-10 was shown to have interesting correlation with cancer evolution. Importantly, as Frizzled-10 is not expressed in fully proliferative healthy tissue, but is highly expressed in certain cancerous tissue, it has the potential to be used as a prospective receptor molecule for targeted therapy. Intriguingly, it was found that while in GC, a decrease in cytoplasmic expression of Frizzled-10 is associated with increasing malignancy, while in colon cancer, the opposite is true; increased cytoplasmic expression of Frizzled-10 is crucial for the late stages of colon cancer progression and metastasis [24]. The co-localized expression of Frizzled family in different sub-types of cancer would confer progressive features on cancer.

APC is essential as a tumor suppressor protein in colorectal cancer and for its destruction complex functions, though its specific molecular activity has not been fully resolved [26]. The modeling or simulation of the cellular phenotype transition in EMT and diseases and predicting the molecular-induced responses in diseases would be useful for future investigation.

SUFU, PYGO2, and BRCA1 were downregulated in diffuse-type GC compared to intestinal-type GC. Previous findings have reported that SUFU, a regulator of Wnt signaling, was downregulated in GC and inhibited by miRNA-324-5p [27]. It was suggested that miRNA-324-5p induces EMT by inhibiting SUFU in GC [27]. PYGO2 was reported to be increased in human breast cancer [28]. The expression of PYGO2 was also assessed in glioma tissue samples and the results showed a positive correlation between tumor grade and PYGO2 overexpression [29]. The expression of PYGO2 was overexpressed in drug-resistant cell lines of GC and GC tissue after neoadjuvant chemotherapy [30]. It may be possible that PYGO2 has a different expression profile in diffuse-type GC compared to intestinal-type GC. BRCA1 was also downregulated in diffuse-type GC compared to intestinal-type GC. We have previously shown that the role of BRCA1 in the DNA damage response pathway was activated in intestinal-type GC compared to diffuse-type GC [18]. Accordingly, BRCA1 is rather important to intestinal-type GC.

The current study successfully generated AI-based models using 50 activated and 50 inactivated images of EMT gene regulation in the development pathway. The analyses in the database were selected based on the diseases and the treatment (Tables 1 and 2). Diseases in activated states of EMT regulation in the development pathway included bone osteosarcoma [31], breast carcinoma [32], and colon cancer [33]. AI application in gastrointestinal diseases would be a promising approach [34].

An interesting point of our current study is that the machine-learning modeling revealed that an IPA analysis of Parkinson's disease had a false-negative prediction result (Figure 5a). The color of the picture seems to be inactivated, which is in accordance with the prediction result as inactivated. Furthermore, it seems that EMT activation in the WNT pathway via SNAI2 resulted in the prediction being activated, whereas CSL-HIF1A-MAML1-NICD complex-induced EMT via SNAI1 was predicted as inactivated. In addition to Parkinson's disease, the machine-learning modeling revealed that an analysis of another unrelated genetic disease had a false-positive prediction result (Figure 5b). On the other hand, based on the analysis, GSK3β and SNAI1 were predicted as activated, while SNAI2 was inactivated (Figure 5b). The activation of GSK3β could be associated with the mediator role of GSK3β in the cross-talk of EMT signaling pathways [35].

#### **5. Conclusions**

The regulation of EMT in the development pathway was activated in diffuse-type GC and inactivated in intestinal-type GC. AI modeling with molecular pathway images generated the Elastic-Net Classifier model. The validation with 10 activated and 10 inactivated new pathway images, which were not used for the modeling, resulted in high accuracy. The modeling of the cellular phenotype transition in EMT and diseases will be studied in the near future.

**Author Contributions:** Conceptualization, S.T.; methodology, S.T.; formal analysis, S.T.; investigation, S.T.; writing—original draft preparation, S.T.; writing—review and editing, S.T., S.Q., R.O., H.C., K.A., A.H., E.J.P., H.Y. and H.S.; visualization, S.T.; project administration, S.T.; funding acquisition, S.T., S.Q., R.O. and A.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Japan Agency for Medical Research and Development (AMED) Grant Number JP20ak0101093 (S.T., R.O. and A.H.), JP21mk0101216 (S.T.), JP22mk0101216 (S.T.), and Strategic International Collaborative Research Program, Grant Number JP20jm0210059 (S.T. and S.Q.), Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 21K12133 (S.T. and R.O.).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors would like to acknowledge Shinpei Ijichi for assisting with the DataRobot Automated Machine Learning platform. The authors are grateful to all colleagues including members of the National Institute of Health Sciences, Japan for their support. This research was supported by the Ministry of Health, Labour, and Welfare, Japan.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **Machine Learning Model to Stratify the Risk of Lymph Node Metastasis for Early Gastric Cancer: A Single-Center Cohort Study**

**Ji-Eun Na 1,2,†, Yeong-Chan Lee 3,†, Tae-Jun Kim 1,\*, Hyuk Lee 1,\*, Hong-Hee Won 3, Yang-Won Min 1, Byung-Hoon Min 1, Jun-Haeng Lee 1, Poong-Lyul Rhee <sup>1</sup> and Jae J. Kim <sup>1</sup>**


**Simple Summary:** Endoscopic resection (ER) is a treatment option for clinically T1a early gastric cancer (EGC) without suspicion of lymph node metastasis (LNM). In patients with non-curative resection after ER, additional surgery is recommended owing to the LNM risk. However, of those patients treated with additional surgery after ER, the actual rate of LNM was about 5–10%; that is, the other patients underwent unnecessary surgeries. Therefore, it is crucial to estimate LNM risk in EGC patients to determine additional management after ER. We derived a machine learning (ML) model to stratify the LNM risk in EGC patients and validate its performance. The constructed ML model, which showed good performance with an area under the receiver operating characteristic of 0.85 or higher, could stratify LNM risk into very low (<1%), low (<3%), intermediate (<7%), and high (≥7%) risk categories. These findings suggest that the ML model can stratify the LNM risk in EGC patients.

**Abstract:** Stratification of the risk of lymph node metastasis (LNM) in patients with non-curative resection after endoscopic resection (ER) for early gastric cancer (EGC) is crucial in determining additional treatment strategies and preventing unnecessary surgery. Hence, we developed a machine learning (ML) model and validated its performance for the stratification of LNM risk in patients with EGC. We enrolled patients who underwent primary surgery or additional surgery after ER for EGC between May 2005 and March 2021. Additionally, patients who underwent ER alone for EGC between May 2005 and March 2016 and were followed up for at least 5 years were included. The ML model was built based on a development set (70%) using logistic regression, random forest (RF), and support vector machine (SVM) analyses and assessed in a validation set (30%). In the validation set, LNM was found in 337 of 4428 patients (7.6%). Among the total patients, the area under the receiver operating characteristic (AUROC) for predicting LNM risk was 0.86 in the logistic regression, 0.85 in RF, and 0.86 in SVM analyses; in patients with initial ER, AUROC for predicting LNM risk was 0.90 in the logistic regression, 0.88 in RF, and 0.89 in SVM analyses. The ML model could stratify the LNM risk into very low (<1%), low (<3%), intermediate (<7%), and high (≥7%) risk categories, which was comparable with actual LNM rates. We demonstrate that the ML model can be used to identify LNM risk. However, this tool requires further validation in EGC patients with non-curative resection after ER for actual application.

**Keywords:** early gastric cancer; machine learning model; risk stratification; lymph node metastasis

**Citation:** Na, J.-E.; Lee, Y.-C.; Kim, T.-J.; Lee, H.; Won, H.-H.; Min, Y.-W.; Min, B.-H.; Lee, J.-H.; Rhee, P.-L.; Kim, J.J. Machine Learning Model to Stratify the Risk of Lymph Node Metastasis for Early Gastric Cancer: A Single-Center Cohort Study. *Cancers* **2022**, *14*, 1121. https:// doi.org/10.3390/cancers14051121

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 14 January 2022 Accepted: 20 February 2022 Published: 22 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Early gastric cancer (EGC) describes a gastric tumor confined to the submucosa with or without lymph node metastasis (LNM). Endoscopic resection (ER) is recommended as a minimally invasive treatment for clinically mucosal EGC without suspicion of LNM [1–4]. In cases of non-curative resection after ER that do not satisfy the expanded criteria of curative resection, additional surgery is recommended, considering the risk of LNM [5,6]; however, LNM is found in only 5–10% of those patients after surgery [7–10]. Therefore, overtreatment is a concern. To address this, the recently revised guidelines excluded piecemeal resection and a positive lateral margin from the factors of non-curative resection after ER for which additional surgery is primarily recommended [1,4,11].

Furthermore, in Japan, patients who have non-curative resection after ER, excluding piecemeal resection and a positive lateral margin, are classified as "endoscopic curability (eCura) C-2"; patients in the eCura C-2 category are further stratified into low (2.5%), intermediate (6.7%), and high (22.7%) LNM risk categories based on the eCura scoring system [2,12,13]. In the low-risk category, there is no difference in cancer recurrence or cancer-specific mortality between patients who undergo no additional treatment and those who undergo additional surgery [14]. Hence, this LNM risk stratification system suggests that additional surgery after non-curative resection may be determined on an individual basis, considering the LNM risk, the patient's condition, and the benefits and limitations of additional surgery [11,12,14].

Another area of concern is that some patients who were confirmed non-curative resection after ER without actual LNM may be unnecessarily exposed to surgery-related risks. The rates of postoperative complications and overall mortality after gastric cancer surgery are 10–26% and 0.3–2.3%, respectively, and comorbidities, body mass index, and lymph node dissection have been reported as risk factors [15–21]. In addition, the potential for long-term health problems after gastric cancer surgery, such as reflux, gastroparesis, gallstone, and osteoporosis, must be considered [22,23]. Therefore, it is clinically significant to predict the LNM risk among EGC patients who undergo non-curative resection after ER to prevent unnecessary surgery.

To stratify the LNM risk in EGC patients, we created a machine learning (ML) model for predicting LNM risk and validated its performance.

#### **2. Materials and Methods**

#### *2.1. Patients*

We included patients who underwent surgery for EGC between May 2005 and March 2021 at Samsung Medical Center. Additionally, patients who underwent additional surgery after ER owing to complications or non-curative resection were included. Moreover, patients who underwent ER alone for EGC without surgery between May 2005 and March 2016 were included and followed up for at least 5 years. After excluding patients with missing data, a total of 14,760 patients who underwent surgery (*n* = 12,631) or ER alone (*n* = 2129) were included (Figure 1). The patients were randomly divided into the development set (70%) and validation set (30%).

#### *2.2. Definition, Outcome, Data Sources, and Study Variables*

LNM was defined based on surgical specimens of patients who underwent surgery. In patients who underwent ER alone, regional LN recurrence was determined based on computed tomography scans during follow-up.

The outcome consisted of establishing the ML model for predicting LNM risk in EGC patients and validating its performance. We divided the entire cohort into a development set (70%) for derivation of the ML model and a validation set (30%) for validation. Since the actual target participants were patients treated with ER for EGC, the performance of the ML model was evaluated for total patients and initial ER patients, respectively, using three methods in the development set and validation set. First, the area under the receiver operating characteristic (AUROC), sensitivity, and specificity of the ML model

were analyzed. Second, we assessed whether the ML model could stratify the risk of LNM into very low-, low-, intermediate-, and high-risk categories. In the development set, we listed the predicted values calculated by the ML model and selected cutoffs at the points where the actual LNM rates were 1%, 3%, and 7%. An actual LNM rate <1% was allocated into the very low-, <3% into the low-, <7% into the intermediate-, and ≥7% into the high-risk categories. The 3% and 7% criteria for the low-, intermediate-, and high-risk categories were based on the previous literature [12]. Additionally, we set a very-low risk category of predicted LNM risk with <1%. This ML model for stratifying LNM risk was applied to the total patients and patients with initial ER in the validation set. Third, we evaluated the ability of the ML model to discriminate patients with negligible risk of LNM at a high-sensitivity cutoff of 100% to predict LNM. From a clinical perspective, the utility of a risk score depends on its ability to discriminate patients at low risk for LNM, i.e., it is ideal to identify patients who do not need surgery and those who need surgery.

**Figure 1.** Diagram of patient selection.

Non-curative resection was defined as not satisfying an expanded criterion for curative resection. The expanded criteria for curative resection were en bloc resection, negative horizontal and vertical margins, absence of lymphovascular invasion, and one of the following: (a) differentiated mucosal cancer without ulcerative lesions, regardless of the tumor size; (b) differentiated mucosal cancer with ulcerative lesions that were ≤3 cm in size; (c) undifferentiated mucosal cancer without ulcerative lesions that were ≤2 cm in size; or (d) differentiated cancer invasion to the submucosa <500 μm from the muscularis mucosa that was ≤3 cm in size.

Data were collected retrospectively from the electronic medical records, including age, sex, number of tumors, tumor location (upper third, middle third, and lower third), size (mm), gross type (non-depressed and depressed), differentiation (well, moderate, signet, and poor), Lauren classification (intestinal, diffuse, and mixed), depth of invasion (lamina propria, muscularis mucosa, submucosal invasion <500 μm from the muscularis mucosa (SM1), and submucosal invasion ≥500 μm from the muscularis mucosa (SM2/3)), lymphatic invasion, venous invasion, and perineural invasion.

#### *2.3. Establishment of the Machine Learning Model*

The ML model was implemented using 3 methods to produce an optimal model based on the development set (70%): logistic regression, support vector machine (SVM), and random forest (RF). We constructed the ML model in the cohort of total patients and patients with initial ER, respectively. This design considered our actual target as EGC patients who were feasible ER. A randomized search algorithm with fivefold nested cross-validation

in the development set was conducted for hyperparameter optimization of each method. The algorithm was optimized by randomly searching the given hyperparameter space 1000 times using the development set (Table S1). We selected this search algorithm rather than grid or Bayesian search algorithms because these three methods are fast enough to search all given spaces and have relatively few hyperparameters. The best hyperparameters in a model were chosen when the model had the highest AUROC. The performance of the models with the best hyperparameters was evaluated in the validation set (30%). We defined the weighted factors of 14.0 through the imbalanced rate of the classes. We confirmed the feature importance as permutating a specific variable 100 times. We publicly opened the codes and models at https://github.com/YeongChanLee/Predict-LNM (accessed on 21 February 2022).

#### *2.4. Statistical Analysis*

Baseline characteristics were compared between the development and validation sets and presented as means (standard deviation) and frequencies (%) for continuous and categorical variables, respectively. The performance of the ML model was evaluated using AUROC, sensitivity, and specificity. The sensitivity and specificity were derived using Youden's index. The risk probability was calculated for the stratification of LNM risk based on the logistic regression, RF, and SVM analyses in the development set. Predicted LNM risk was classified into very low-, low-, intermediate-, and high-risk categories according to the actual LNM rate with a cutoff <1%, <3%, and <7%. We analyzed whether the categories of predicted LNM risk correlated with the real LNM rate. As a subanalysis, the performance of the ML model was compared with the eCura system as a clinical model in cases defined as non-curative resection after ER for EGC in the validation set, using AUROC, net reclassification improvement (NRI), and specificity at a high-sensitivity cutoff of 95%. The ML model was developed using Scikit-learn 0.24.1 and Python 3.8.5. Statistical analyses were performed using R (version 3.5.1, Vienna, Austria).

#### **3. Results**

#### *3.1. Baseline Characteristics*

A total of 14,760 patients were eligible for analysis; 10,332 patients were randomly sorted into the development set and 4428 into the validation set. LNM was found in 794 of 10,332 patients (7.7%) in the development set and 337 of 4428 patients (7.6%) in the validation set. The baseline characteristics of the development and validation sets are shown in Table 1. They were comparable in most variables, including age, sex, number of tumors, size, gross type, differentiation, Lauren classification, depth of invasion, lymphatic invasion, venous invasion, and perineural invasion. However, the middle-third of the stomach was the most frequent tumor location in the development set whereas the lowerthird of the stomach was the most frequent tumor location in the validation set (*p* = 0.013).


**Table 1.** Baseline characteristics of the development set and validation set.

**†** Mean ± standard deviation presented for continuous variables. Values are expressed as *n* (%); unless otherwise specified. **<sup>a</sup>** *p*-value calculated using Student's *t*-test for continuous variables or Pearson's chi-square test for categorical variables for overall data. SM1: submucosal invasion <500 μm from the muscularis mucosa; SM2/3: submucosal invasion ≥500 μm from the muscularis mucosa.

#### *3.2. Derivation of the Machine Learning Model*

In the development set, LNM was found in 794 of 10,332 patients (7.7%) in the total patients, and in 42 of 2320 patients (1.8%) in patients with initial ER. The derivatated ML model showed good to excellent performance in the development set; in the total patients, logistic regression was AUROC (95% CI), 0.86 (0.85–0.88); sensitivity, 0.80; and specificity, 0.76; RF was AUROC (95% CI), 0.95 (0.94–0.95); sensitivity, 0.91; and specificity, 0.86; and SVM was AUROC (95% CI), 0.87 (0.85–0.88); sensitivity, 0.79; and specificity, 0.78. In patients with initial ER, logistic regression was AUROC (95% CI), 0.88 (0.83–0.92); sensitivity, 0.86; and specificity 0.82; RF was AUROC (95% CI), 0.95 (0.93–0.97); sensitivity, 0.93; and specificity, 0.88; and SVM was AUROC (95% CI), 0.88 (0.83–0.92); sensitivity, 0.93; and specificity, 0.73 (Figure 2).

**Figure 2.** AUROC of the ML model for the prediction of LNM in the development set (total number = 10,332, number of patients with initial ER = 2320).

In the development set, LNM risk was predicted using the ML model (logistic regression, RF, and SVM), and the cutoff for the categories of very low, low, intermediate, and high risk was set as the value of the actual LNM rate of <1%, <3%, and <7% in the total patients and initial ER patients, respectively (Table 2). As an example, in the total patients, LNM risk was stratified using logistic regression into very low (<1%)-, low (<3%)-, intermediate (<7%)-, and high (≥7%)-risk categories, and the cutoff was determined by the actual LNM rate. Each category showed a real LNM rate of 0.2%, 1.4%, 4.1%, and 18.4% (Table 2).

**Table 2.** Determination of the cutoff for stratification of LNM risk based on the predictive value of the ML model and actual LNM rate in the development set. (**A**) Total patients. (**B**) Patients with initial ER.



**Table 2.** *Cont.*

LNM, lymph node metastasis.

#### *3.3. Validation of the Machine Learning Model*

In the validation set, LNM was found in 337 of 4428 patients (7.6%) in the total patients, and in 24 of 1016 patients (2.4%) in patients with initial ER. In the validation set, the ML model showed a good performance in the total patients and patients with initial ER. In total patients, logistic regression was AUROC (95% CI), 0.86 (0.84–0.88); sensitivity, 0.80; and specificity, 0.75; RF was AUROC (95% CI), 0.85 (0.83–0.87); sensitivity, 0.82; and specificity, 0.72; and SVM was AUROC (95% CI), 0.86 (0.84–0.88); sensitivity, 0.69; and specificity, 0.85. In patients with initial ER, logistic regression was AUROC (95% CI), 0.90 (0.86–0.94); sensitivity, 0.92; and specificity, 0.77; RF was AUROC (95% CI), 0.88 (0.82–0.92); sensitivity, 0.92; and specificity, 0.74; and SVM was AUROC (95% CI), 0.89 (0.85–0.93); sensitivity, 0.92; and specificity, 0.78 (Figure 3).

In the validation set, logistic regression and SVM showed the possibility of stratifying the risk of LNM for total patients and patients with initial ER. The predicted LNM risk was correlated with the actual LNM rate. In the total patients, the actual LNM rate according to the very low-, low-, intermediate-, and high-risk categories was 0.1%, 1.6%, 4.8%, and 17.7% based on logistic regression and 0.1%, 1.6%, 4.2%, and 18.1% based on SVM, respectively. In patients with initial ER, the actual LNM rate according to the very low-, low-, intermediate-, and high-risk categories was 0.2%, 2.5%, 0.0%, and 11.9% based on logistic regression and 0.2%, 1.7%, 4.5%, and 13.0% based on SVM, respectively. In contrast, in the analysis using RF, the actual LNM rate was 1.3%, 6.3%, 7.4%, and 23.1% of the total patients and 0.4%, 5.0%, 10.0%, and 12.0% of patients with initial ER, which was higher than that of the predicted category of LNM risk (Table 3).

**Figure 3.** AUROC of the ML model for the prediction of LNM in the validation set (total number = 4428, number with initial ER = 1016).


**Table 3.** Risk stratification of LNM by the ML model and the actual rate in the validation set. (**A**) Total patients. (**B**) Patients with initial ER.


**Table 3.** *Cont.*

In the total patients in the validation set, the specificities of the ML model at the high-sensitivity cutoff of 100% were 49%, 46%, and 49% in the logistic regression, RF, and SVM analyses, respectively. In patients with initial ER, the specificities of the ML model at the high-sensitivity cutoff of 100% were 71%, 57%, and 70% in the logistic regression, RF, and SVM analyses, respectively (Figure 4).

**Figure 4.** Identification of patients with negligible risk of lymph node metastasis at the high-sensitivity cutoff in the validation set.

In the validation set, as a subanalysis in the patients with non-curative resection after ER for EGC, LNM was found in 21 of 362 patients (5.8%). The AUROC of the ML model was 0.76, 0.73, and 0.75 in the logistic regression, RF, and SVM analyses, respectively, and the AUROC of the eCura system was 0.72. Logistic regression (NRI, 0.46) and SMV (NRI, 0.21) improved the performance compared to the eCura system. The specificities of the ML model at the high-sensitivity cutoff of 95% were 39%, 38%, and 38% in the logistic regression, RF, and SVM analyses, respectively, which were higher than the specificity of 9% for the eCura system (Figure S1).

#### **4. Discussion**

Here, we demonstrated the utility of an ML model for predicting the LNM risk in EGC patients. In the validation set, the AUROC of each ML model showed a good performance, ranging from 0.85 to 0.90. Furthermore, each ML model could stratify the LNM risk as very low, low, intermediate, and high risk, and those stratified groups showed a consistent actual LNM rate. In addition, these showed specificities of about 0.50 or higher at a matched sensitivity of 100%, indicating that it could discriminate patients with negligible risk of LNM while identifying the patients who needed surgery owing to the LNM risk with 100% sensitivity. This tool can easily be applied in clinical practice to categorize the LNM risk and identify patients with negligible LNM risk under the assumption of maximum sensitivity.

Non-curative resection after ER for EGC patients is a clinical concern. Physicians determine further strategies under careful consideration, accounting for the patient's comorbidities associated with surgical risk and individual preference, and the characteristics of the tumor and surgical procedure. Despite additional surgery owing to non-curative resection after ER, the rate of LNM is only 5–10%; hence, among the patients with noncurative resection, it is clinically significant to identify patients at low risk of LNM to prevent unnecessary surgery. The current guidelines have been revised to address these issues and recommend a more detailed strategy after non-curative resection [1,2,4,11]. In the JGCA guidelines (5th edition), among the factors of non-curative resection, piecemeal resection or a positive lateral margin is defined as eCura C-1, and other factors are described as eCura C-2. Based on these classifications, physicians can determine the appropriate therapeutic options, such as additional ER or coagulation for patients in eCura C-1. For eCura C-2, the eCura scoring system was built based on large-scale data and stratifies LNM risk as low (0–1 point), intermediate (2–4 points), or high (5–7 points) [11,12]. In patients with the low-risk category, there is no difference in cancer recurrence or cancer-specific mortality between patients who receive no additional treatment and those who undergo additional surgery [14]. Similarly, reports that investigated LNM risk in patients with early colon cancer after ER were conducted to prevent unnecessary surgery or excess treatment using the AI system and clinical guidelines [24–27]. This reflects the necessity for detailed guidance on additional strategies through the stratification of LNM risk in EGC patients with non-curative resection after ER; therefore, this study has clinical significance.

The strength of this study is that it is the first to develop an ML model to predict LNM in patients with EGC and validate its good performance. Furthermore, our study was based on a large sample size and investigated three models (logistic regression, RF, and SVM) to develop an optimal ML model. Considering that the target participants were patients who underwent ER for EGC, the performance of the ML model was verified not only for the total patients but also the patients who received ER as the initial treatment for EGC. In our study, the very low-risk group had an LNM rate of <1%. This is a stricter category than the classifications of previous reports that defined a low risk of LNM as <3%, including nomograms and the eCura system for predicting LNM in EGC patients [11,28]. In addition to the variables included in the nomogram and the eCura system, our ML model was constructed based on various variables, including the number of tumors, tumor location, Lauren classification, perineural invasion, age, sex, gross type, tumor size, differentiation, depth of invasion, lymphatic invasion, and venous invasion [12,28]. Moreover, we utilized the ability of the ML model to comprehensively interpret various factors by subdividing the data of the variables assessed in previous reports [12,28]. For example, the depth of invasion was subdivided into the lamina propria, muscularis mucosae, SM1, and SM2/3.

We evaluated the performance of the ML model using clinically relevant outcomes. In estimating LNM risk in patients with non-curative resection after ER for EGC, achieving a high sensitivity to predict LNM is essential for long-term outcomes. Furthermore, there is a need to identify patients at low risk for LNM to prevent unnecessary surgery. Our ML model showed specificities of 49% in the total patients and 71% in the patients with initial ER at the high-sensitivity cutoff of 100%. When examining only patients with non-curative resection after ER, our ML model showed specificities ranging from 38% to 39% at the high-sensitivity cutoff of 95%, which is significantly increased compared to the specificity of 9% for the eCura system. The sensitivity of 95% was set based on the highest sensitivity achieved by the eCura system. Therefore, the ML model has great clinical potential in that it had better specificity than the eCura system at a high-sensitivity cutoff, despite there being no significant difference in the value of AUROC.

This study had several limitations. First, there may be selection bias due to the exclusion of missing data and the study's retrospective nature; however, this study was designed to develop the ML model, including major factors without missing data. Second, this was a single-center study, and the results need to be validated in other institutions. In addition, it is necessary to validate the performance of the ML model in patients undergoing noncurative resection after ER for EGC. Through this additional validation, we can anticipate the improved version of the ML model by reinforcement learning and suggest that the ML model can be a valuable tool in clinical applications. Third, most of the variables included in our ML model are based on the pathology after ER. For estimation of LNM risk, several major variables, such as lymphatic invasion, vertical margin, and the depth of invasion, could not be assessed by endoscopy alone. Fourth, the comparison of long-term survival was not analyzed according to the stratification of LNM risk, as there were some cases with insufficient follow-up because the follow-up ended in March 2021.

In conclusion, the ML model showed good performance in the prediction and stratification of LNM risk in patients with EGC. Based on this finding, we suggest that the ML model has the potential to be a clinically useful tool for estimating LNM risk among patients with non-curative resection after ER.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14051121/s1, Figure S1: Performance of the ML model and eCura system for predicting LNM in patients with non-curative resection after ER. AUROC, area under the receiver operating characteristic; NRI, net reclassification index. Table S1: Best hyperparameters selected from the search algorithm.

**Author Contributions:** Study concept and design: J.-E.N. and T.-J.K.; Acquisition, analysis, or interpretation of data: J.-E.N., Y.-C.L., T.-J.K. and H.L.; Writing and drafting of the manuscript: J.-E.N., Y.-C.L., T.-J.K. and H.L.; Critical revision of the manuscript for important intellectual content: T.-J.K., H.L., H.-H.W., Y.-W.M., B.-H.M., J.-H.L., P.-L.R. and J.J.K.; Statistical analysis: Y.-C.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Samsung Medical Center (2021-09-155 and 30 September 2021).

**Informed Consent Statement:** Informed consents were waived for this study due to the retrospective and observational design.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to personal privacy.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Evaluation of Computer-Aided Detection (CAD) in Screening Automated Breast Ultrasound Based on Characteristics of CAD Marks and False-Positive Marks**

**Jeongmin Lee, Bong Joo Kang \*, Sung Hun Kim and Ga Eun Park**

Department of Radiology, Seoul Saint Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea; jmlee328@gmail.com (J.L.); rad-ksh@catholic.ac.kr (S.H.K.); hoonhoony@naver.com (G.E.P.) **\*** Correspondence: lionmain@catholic.ac.kr; Tel.: +82-2-2258-6253

**Abstract:** The present study evaluated the effectiveness of computer-aided detection (CAD) system in screening automated breast ultrasound (ABUS) and analyzed the characteristics of CAD marks and the causes of false-positive marks. A total of 846 women who underwent ABUS for screening from January 2017 to December 2017 were included. Commercial CAD was used in all ABUS examinations, and its diagnostic performance and efficacy in shortening the reading time (RT) were evaluated. In addition, we analyzed the characteristics of CAD marks and the causes of false-positive marks. A total of 1032 CAD marks were displayed based on the patient and 534 CAD marks on the lesion. Five cases of breast cancer were diagnosed. The sensitivity, specificity, PPV, and NPV of CAD were 60.0%, 59.0%, 0.9%, and 99.6% for 846 patients. In the case of a negative study, it was less time-consuming and easier to make a decision. Among 530 false-positive marks, 459 were identified clearly for pseudo-lesions; the most common cause was marginal shadowing, followed by Cooper's ligament shadowing, peri-areolar shadowing, rib, and skin lesions. Even though CAD does not improve the performance of ABUS and a large number of false-positive marks were detected, the addition of CAD reduces RT, especially in the case of negative screening ultrasound.

**Keywords:** computer-aided detection; automated breast ultrasound; breast

#### **1. Introduction**

Mammographic screening has reduced the rate of breast cancer mortality [1]. Recent guidelines for screening of breast cancer recommend mammography starting at age 45 or 50 years [2,3]. Although the incidence of breast cancer in Asian women is still lower than in Western countries, morbidity and mortality continue to increase in Asian countries [4]. The peak age of breast cancer in Asian countries is 40–49 years, whereas in Western countries the peak is around 60 to 70 years [5]. Asian women tend to have breasts with higher density compared with Western women [6]. Further, dense breast is an independent risk factor for developing breast cancer [7].

Real-time B-mode ultrasonography has emerged as an alternative imaging technique for breast cancer screening [8]. Ultrasound elastography can quantify stiffness distribution of tissue lesions and complements conventional B-mode ultrasonography. The development of computer-aided diagnosis has improved the reliability of the system, whilst the inception of machine learning, such as deep learning, has further extended its power by facilitating automated segmentation and tumor classification [9].

Automated breast ultrasonography (ABUS) was proposed as a supplementary screening modality recently, for increased cancer detection combined with digital mammography (DM), especially in dense breasts [10–12]. In addition, ABUS has been proposed in the diagnostic setting in a few recent studies [13].

However, due to the large number of images in a single scan, the reading time (RT) of a full ABUS examination can be prolonged and cancers may be easily overlooked [14]. For this

**Citation:** Lee, J.; Kang, B.J.; Kim, S.H.; Park, G.E. Evaluation of Computer-Aided Detection (CAD) in Screening Automated Breast Ultrasound Based on Characteristics of CAD Marks and False-Positive Marks. *Diagnostics* **2022**, *12*, 583. https://doi.org/10.3390/ diagnostics12030583

Academic Editor: Ralph A. Bundschuh

Received: 20 January 2022 Accepted: 23 February 2022 Published: 24 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

611

reason, computer-aided detection (CAD) software for ABUS has been developed to facilitate the radiological interpretation of ABUS examinations [15]. Few studies investigated the effect of commercially available CAD systems for ABUS on the RT and screening performance of breast radiologists [16]. However, before using the CAD system clinically, it is necessary to analyze the characteristics of CAD marks. It could be useful for radiologists to have knowledge about the characteristics of CAD marks and the causes of false-positive marks.

In this study, we evaluated the effectiveness of computer-aided detection (CAD) system in screening automated breast ultrasound (ABUS) through diagnostic performance and reading time (RT). We also investigated and analyzed the characteristics of CAD marks and the causes of false-positive marks, to distinguish between true and false marks.

#### **2. Materials and Methods**

This retrospective study was approved by the institutional review board (IRB) of our institution. The need for informed consent was waived by the ethics committee due to the retrospective design. All procedures involving human participants were in accordance with the ethical standards of IRB issued by our institution, and assessments were carried out in accordance with the tenets of the Declaration of Helsinki of 1975, and its revision in 2013.

#### *2.1. ABUS Acquisitions*

The ABUS examinations were performed with the ACUSON S2000 Automated Breast Volume Scanner system (Siemens, Erlangen, Germany). This ABUS system acquires 3D Bmode ultrasound volumes over an area of 15.4 × 16.8 × 6 cm3 volume data sets of the breast in one sweep using a mechanically driven linear array transducer (14L5). Adequate depth and focus can be obtained using predefined settings for different breast cup sizes. All ABUS examinations were performed by a single trained radiographer. To ensure coverage of the entire breast, three overlapping acquisitions including antero-posterior, medial, and lateral views were performed. The scan thickness was displayed at 1 mm intervals without overlap. A dedicated ABUS workstation was used to reconstruct the transverse slices into a 3D volume that can be read in a multiplanar hanging, with sagittal and coronal reconstructions.

#### *2.2. CAD System*

A prototype workstation was designed and developed specifically for high-throughput ABUS screening in this observer study (MeVis Medical Solutions, Bremen, Germany). In this prototype, each user action was logged with timestamps, which were subsequently used to estimate the time spent per case. The workstation was integrated with a commercially developed CAD software (QVCAD, Qview Medical Inc., Los Altos, CA, USA), which is designed to detect suspicious candidate regions in an ABUS volume highlighted with the so-called CAD marks (Figure 1).

In addition, the QVCAD software provides an "intelligent" minimum intensity projection (MinIP) of the breast tissue in a 3D ABUS volume that can be used for rapid navigation through ABUS scans for enhancement of the possible suspicious regions. The CAD-based MinIP integrated with a multiplanar hanging protocol for ABUS displays the conventional ABUS planes. By clicking on the dark spot, the 3D multiplanar hanging automatically snaps to the corresponding 3D location. The crosshair is focused on a breast lesion that is marked by the CAD software with a green circular marker. The same lesion is also enhanced and visualized as a dark spot in the MinIP. A screenshot of the CAD-aided reading environment is presented in Figure 1.

**Figure 1.** Screening automated breast ultrasound (ABUS) of a 44-year-old woman shows a truepositive mark. (**a**) Computer-aided detection (CAD)-based minimum intensity projection (MinIP) of an ABUS scan of the antero-posterior (AP), medial, and lateral sides of the left breast. There is one dark spot (arrows) with a green circle. (**b**) The lesion showing a dark spot with a green circle laterally on the left breast confirms invasive ductal carcinoma.

The number of CAD markers displayed per ABUS volume could be adjusted by changing the values of the false-positive rate (FPR) in the configuration setting of the CAD software. According to the manual from the manufacturer, FPR was defined as the total number of false-positive CAD markers in non-cancer volumes divided by the total number of non-cancer volumes. In this study, we set the FPR to 0.2 (i.e., 1 false-positive CAD marker in non-cancer volume per 5 non-cancer volumes), which was its default setting as in previous studies [16–18].

#### *2.3. Study Design*

The study included a total of 846 women aged 40–49 years who underwent ABUS screening from January 2017 to December 2017. The CAD (QVCADTM) system was used in all ABUS examinations and its diagnostic performance was evaluated retrospectively.

We evaluated glandular tissue component (GTC), which was classified as minimal (<25% of the fibroglandular tissue (FGT)), mild (25–49% of the FGT), moderate (50–74% of the FGT), or marked (≥75% of the FGT) in each woman based on bilateral breast images [19].

We analyzed whether CAD addition shortened the RT. The RT was determined by the expert breast radiologists based on their subjective perception in each of the following cases: (1) CAD with ABUS = ABUS only, (2) CAD with ABUS > ABUS only, (3) CAD with ABUS < ABUS only. We defined there is a difference when RT was shortened by more than 1 min.

Furthermore, we analyzed the characteristics of CAD marks including the size of the marked lesion, lesion type (mass or non-mass), tissue composition under ultrasound, and the causes of false-positive marks. The false-positive mark was defined as the mark located on the typical benign lesion or pseudo-lesions that require no additional studies following ABUS. The number of marks per patient and per lesion and the frequency of false-positive marks were also evaluated.

Two board-certified expert breast radiologists determined the characteristics of CAD marks based on consensus. In addition, the pseudo-lesions were also evaluated by two expert breast radiologists with consensus. The characteristics of pseudo-lesions were analyzed including the number, size, and location (right or left; antero-posterior, medial or lateral; upper, mid, or lower; inner, mid, or outer).

All women with suspicious lesions were recalled and US-guided 14G core-needle biopsy was performed. Patients who were not disease-positive were followed up in 2 years with radiologic examination using mammography or ultrasonography.

#### **3. Results**

A total of 846 women participated in the study, and the median age at enrollment was 44 years (mean age ± standard deviation = 43.9 ± 3.0 years). Based on ABUS screening, five breast cancers were diagnosed pathologically over a two-year follow-up (Figure 1). The sensitivity, specificity, PPV, NPV, and accuracy of CAD for cancer detection were 60.0%, 59.0%, 0.9%, 99.6% and 59.0%, respectively, for 846 patients, while those values for 1032 CAD marks were 60.0%, 48.3%, 0.6%, 99.6%, and 48.4%, respectively.

Based on the lesion type detected, the large mass lesions were more than the non-mass lesions (60 vs. 11). Based on tissue composition under ultrasound, the number of minimalto-mild cases in GTC was higher than moderate-to-marked cases (668 vs. 178). The rate of CAD positivity in moderate-to-marked lesions was higher than in minimal-to-mild. Table 1 summarizes the screening performance of CAD for ABUS per patient and per lesion.

**Table 1.** Sensitivity (SEN), specificity (SPE), positive predictive value (PPV), negative predictive value (NPV), and accuracy per patient and per computer aided detection (CAD) mark.



**Table 1.** *Cont.*
