**Comparative Multicentric Evaluation of Inter-Observer Variability in Manual and Automatic Segmentation of Neuroblastic Tumors in Magnetic Resonance Images**

**Diana Veiga-Canuto 1,2,\*, Leonor Cerdà-Alberich 1, Cinta Sangüesa Nebot 2, Blanca Martínez de las Heras 3, Ulrike Pötschger 4, Michela Gabelloni 5, José Miguel Carot Sierra 6, Sabine Taschner-Mandl 4, Vanessa Düster 4, Adela Cañete 3, Ruth Ladenstein 4, Emanuele Neri <sup>5</sup> and Luis Martí-Bonmatí 1,2**


**Simple Summary:** Tumor segmentation is a key step in oncologic imaging processing and is a timeconsuming process usually performed manually by radiologists. To facilitate it, there is growing interest in applying deep-learning segmentation algorithms. Thus, we explore the variability between two observers performing manual segmentation and use the state-of-the-art deep learning architecture nnU-Net to develop a model to detect and segment neuroblastic tumors on MR images. We were able to show that the variability between nnU-Net and manual segmentation is similar to the interobserver variability in manual segmentation. Furthermore, we compared the time needed to manually segment the tumors from scratch with the time required for the automatic model to segment the same cases, with posterior human validation with manual adjustment when needed.

**Abstract:** Tumor segmentation is one of the key steps in imaging processing. The goals of this study were to assess the inter-observer variability in manual segmentation of neuroblastic tumors and to analyze whether the state-of-the-art deep learning architecture nnU-Net can provide a robust solution to detect and segment tumors on MR images. A retrospective multicenter study of 132 patients with neuroblastic tumors was performed. Dice Similarity Coefficient (DSC) and Area Under the Receiver Operating Characteristic Curve (AUC ROC) were used to compare segmentation sets. Two more metrics were elaborated to understand the direction of the errors: the modified version of False Positive (FPRm) and False Negative (FNR) rates. Two radiologists manually segmented 46 tumors and a comparative study was performed. nnU-Net was trained-tuned with 106 cases divided into five balanced folds to perform cross-validation. The five resulting models were used as an ensemble solution to measure training (n = 106) and validation (n = 26) performance, independently. The time needed by the model to automatically segment 20 cases was compared to the time required for manual segmentation. The median DSC for manual segmentation sets was 0.969 (±0.032 IQR). The median DSC for the automatic tool was 0.965 (±0.018 IQR). The automatic segmentation model achieved a better performance regarding the FPRm. MR images segmentation variability is similar between radiologists and nnU-Net. Time leverage when using the automatic model with posterior visual validation and manual adjustment corresponds to 92.8%.

**Citation:** Veiga-Canuto, D.; Cerdà-Alberich, L.; Sangüesa Nebot, C.; Martínez de las Heras, B.; Pötschger, U.; Gabelloni, M.; Carot Sierra, J.M.; Taschner-Mandl, S.; Düster, V.; Cañete, A.; et al. Comparative Multicentric Evaluation of Inter-Observer Variability in Manual and Automatic Segmentation of Neuroblastic Tumors in Magnetic Resonance Images. *Cancers* **2022**, *14*, 3648. https://doi.org/10.3390/

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 25 May 2022 Accepted: 26 July 2022 Published: 27 July 2022

cancers14153648

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** tumor segmentation; neuroblastic tumors; deep learning; manual segmentation; automatic segmentation; inter-observer variability

#### **1. Introduction**

Neuroblastic tumors are the most frequent extracranial solid cancers in children. They comprise ganglioneuroma, ganglioneuroblastoma and neuroblastoma. Ganglioneuroma is composed of gangliocytes and mature stroma and is the most benign. Ganglioneuroblastoma is formed by mature gangliocytes and immature neuroblasts and has an intermediate malignant potential [1]. The most frequent type is neuroblastoma, which is more immature and undifferentiated. It is a heterogeneous neoplasm that shows different behavior based on biological, clinical and prognostic features, some tumors undergo spontaneous regression, while others progress with fatal outcomes despite therapy [2]. Neuroblastic tumors show a wide range of variability in their position. The most common sites of origin of neuroblastic tumors are the adrenal region (48%), extra-adrenal retroperitoneum (25%) and the chest (16%), followed by the neck (3%) and the pelvis (3%) [3]. Furthermore, they show high variability in their size, shape and boundaries, resulting in a common challenging task to differentiate them from the neighboring structures.

Tumor diagnosis, prognosis and the decision on respective treatment/disease management are mainly based on information obtained from imaging, including magnetic resonance (MR) [4]. Additionally, multiparametric data, radiomic features and imaging biomarkers, can provide the clinician with relevant information for disease diagnosis, characterization, and evaluation of aggressiveness and treatment response [5].

In order to ensure the best usability of imaging, it is essential to develop a robust and reproducible imaging processing pipeline. One of the most relevant steps involves segmentation, which consists of placing a Region of Interest (ROI) on a specific area (e.g., a tumor), with the assignment and labeling of voxels in the image that correspond to the ROI. Tumor segmentation can be performed in three different ways: manual, semiautomatic and automatic. Manual segmentation is usually performed by an experienced radiologist. This is usually done slice-by-slice, but is also possible in 3D, with the expert either encircling the tumor or annotating the voxels of interest. This is a reliable but time-consuming method that hinders the radiologists' workflow, especially in cases of mass data processing. However, manual segmentation is observer-dependent and may show wide inter and intra-observer variability [6,7]. This variability is influenced by some objective factors, such as organ/tumor characteristics or contour, and by some subjective factors related to the observer, such as their expertise or coordination skills [6].

Semiautomatic segmentation tries to solve some of the problems related to manual segmentation [8]. By assisting the segmentation with algorithms, for example, by growing the segmentation over a region or expanding the segmentation to other slices to eliminate the need for a slice-by-slice segmentation, the effort and time required from the user can be reduced. However, inter-observer variability is still present, as the manual part of the segmentation and the settings of the algorithm influence the result.

In the case of neuroblastic tumors, several studies have explored the development of semiautomatic segmentation algorithms. They have been performed on Computed Tomography (CT) or MR images, making use of mathematical morphology, fuzzy connectivity and other imaging processing tools [9–11]. However, they have included a very low number of cases and the findings show little improvement with respect to manual approaches. To the best of our knowledge, a robust and generalizable solution for neuroblastoma segmentation has not been yet devised.

Nowadays, most advanced tools are built to be used as automatic segmentation methods, which, by definition, do not rely on user interaction. These solutions are built with deep-learning segmentation algorithms [12], usually based on convolutional neural networks (CNNs). CNNs use several sequential convolution and pooling operations in which images are processed to extract features and recognize patterns using the image itself to be trained during the learning process [13]. One of the most commonly used is the U-Net architecture, consisting of a contracting path to capture context and a symmetric expanding path that enables precise localization, which achieves very good performance in segmentation of different types of cancer [14,15]. Nevertheless, its applicability to specific image analysis and its reproducibility in different structures or lesions has been observed to be limited [16].

Recently, a new solution based on CNNs algorithms called nnU-Net has been proposed. It consists of an automatic deep learning-based segmentation framework that automatically configures itself, including preprocessing, network architecture, training and postprocessing, and adapts to any new dataset, surpassing most existing approaches [16,17].

The aim of this study was to assess the inter-observer variability in manual and automatic segmentation of neuroblastic tumors. We hypothesize that the state-of-theart deep learning framework nnU-Net can be used to automatically detect and segment neuroblastic tumors on MR images, providing a more robust, universal and error-free solution than that obtained by the manual segmentation process. This comparison is performed by evaluating the inter-observer variability between two radiologists. The automatic segmentation model is trained, fine-tuned and validated with cases from different European institutions and then compared to manual segmentation. Previous expert tumor delineation is performed as there does not exist an open-access annotated data set dedicated to this specific tumor.

The automatic segmentation model is then applied to a group of patients from the training set. The time needed for the automatic segmentation (with manual adjustment when necessary) is compared to the time required to manually segment the same cases from scratch.

#### **2. Materials and Methods**

#### *2.1. Participants*

A retrospective multicenter and international collection of 132 pediatric patients with neuroblastic tumors who had undergone a diagnostic MR examination was conducted.

All patients had received a diagnosis of neuroblastic tumor with pathological confirmation between 2002 and 2021. Patients and MR data were retrospectively obtained from 3 centers in Spain (n = 73, La Fe University and Polytechnic Hospital, including 21 cases from European Low and Intermediate Risk Neuroblastoma Protocol clinical trial (LINES)), Austria (n = 57, Children's Cancer Research Institute from SIOPEN High Risk Neuroblastoma Study (HR-NBL1/SIOPEN) current accrual over 3000 patients from 12 countries), and Italy (n = 4, Pisa University Hospital). The study had the corresponding institutional Ethics Committee approvals from all involved institutions. This data set was collected within the scope of PRIMAGE (PRedictive In-silico Multiscale Analytics to support cancer personalized diaGnosis and prognosis, empowered by imaging biomarkers) project [5]. Age at first diagnosis was 37.6 ± 39.3 months (mean ± standard deviation, range 0 to 252 months, median of 24.5 months ± 54 interquartile range (IQR)), with a slight female predominance (70 females, 62 males).

Histology of the tumor was neuroblastoma (104 cases), ganglioneuroblastoma (18 cases) and ganglioneuroma (10 cases). Tumor location was classified as abdominopelvic (105 cases, 59 of them from the adrenal gland, 32 with abdominal non-adrenal location and 14 with a pelvic location) or cervicothoracic (27 cases, 18 of them thoracic, 2 with an exclusive cervical location and 7 affecting both thoracic and cervical regions).

Imaging data from SIOPEN clinical trials (HR-NBL1 and LINES) were collected and centrally stored on an Image Management Server maintained by the Austrian Institute of Technology (AIT) in order to be properly pseudonymized with the European Unified Patient Identity Management (EUPID) [18] system enabling a privacy-preserving record linkage and a secure data transition to the PRIMAGE context. Other imaging data not coming from a SIOPEN trial received a EUPID pseudonym through the direct upload to the PRIMAGE platform. All collected images have been stored in the PRIMAGE platform to be used for further investigation.

The MR images accounted for a high data acquisition variability, including different scanners, vendors and protocols, from the different institutions. MR images were acquired with either a 1.5 T (n = 116) or 3 T (n = 16) scanner, manufactured by either General Electric Healthcare (Signa Excite HDxt, Signa Explorer) (n = 51), Siemens Medical (Aera, Skyra, Symphony, Avanto) (n = 54) or Philips Healthcare (Intera, Achieva, Ingenia) (n = 27). The MR protocol varied among the institutions. Essentially, MR studies consisted of T1-weighted (T1W), T2- weighted (T2w) and/or T2w with fat suppression (T2w fat-sat), Diffusion-weighted (DW) and Dynamic Contrast-enhanced (CET). Chest images were acquired with respiratory synchronization. Mean FOV size was 410 mm, and median FOV was 440 mm (range of 225 to 500 mm).

#### *2.2. Manual Image Labeling*

Tumor segmentation was performed on the transversal T2w fat-sat images as they yield the maximum contrast between the tumor and the surrounding organs (48 cases). T2w images were used when T2w fat-sat images were not available (84 cases). All images were obtained in DICOM format. The open source ITK-SNAP (version 3.8.0) (www.itksnap.org) tool [19] was used for the manual tumor segmentation by two radiologists (with 30 (Radiologist 1) and 5 (Radiologist 2) years of experience in pediatric radiology, respectively) with prior experience with manual segmentation tools. All the tumors (132 cases) were manually segmented by Radiologist 2. For the inter-observer variability study, 46 cases were independently segmented by both radiologists after agreement on the best tumor definition criteria. To increase reproducibility, a restrictive segmentation methodology was established, excluding doubtful peripheral areas. If the tumor contacted or encased a vessel, the vessel was excluded from the segmentation. When the tumor infiltrated neighboring organs with ill-defined margins, the DWI and CE images were reviewed to exclude nontumoral areas. Lymph nodes separated from the tumor and metastases were also excluded. Each of the readers performed a blinded annotation of all the cases independently. Finally, the obtained segmentation masks were exported in NIfTI format (nii) and were considered the ground truth ROIs.

Tumor volume was obtained from the 132 masks performed by Radiologist 2. The median volume of all the masks was 116,518 mm3 (±219,084 IQR) and the mean volume was 193,634 mm3.

#### *2.3. Study Design and Data Partitioning*

Our study consisted of two parts (Figure 1). Firstly, the inter-observer variability in manual MR segmentation was analyzed by comparing the performance of two radiologists in 46 cases of neuroblastic tumor. Secondly, the training and validation of the automatic segmentation model based on nnU-Net architecture were performed, dividing the dataset into two cohorts: training-tuning and validation. A balanced and stratified split of the cases from both cohorts was implemented to eliminate sampling bias and to guarantee the heterogeneity of both datasets in order to construct a reproducible and universal model. Stratified sampling with the scikit-learn library [20] was used, considering four variables: manufacturer (Siemens/Philips/GE), magnetic field strength (1.5 T/3 T), tumor location (abdominopelvic/cervicothoracic) and segmented sequence (T2w/T2w fat-sat). (Table 1).

A first cohort (80% of cases, n = 106) was selected to train and fine-tune the model with a 5-fold cross-validation approach. A second cohort (20% of patients, n = 26) was used for validation.

#### *2.4. Convolutional Neural Network Architecture*

The automatic segmentation model was developed using the state-of-the-art, selfconfiguring framework for medical segmentation, *nnU-Net* [16]. All the images were resampled with a new voxel spacing: [z, x, y] = [8, 0.695, 0.695], corresponding to the

average values within the training data set. The model training was performed along 1000 epochs with 250 iterations each and a batch size of 2. The loss function to optimize each iteration was based on the Dice Similarity Coefficient (DSC). A z-score normalization was applied to the images.

**Figure 1.** Study design. The first part consisted of manual segmentation variability, comparing the performance of two radiologists (n = 46). The second part included the training and validation of the nnU-Net using 132 cases manually segmented by Radiologist 2. Training-tuning with cross-validation was performed. The 5 resulting segmentation models obtained with the cross-validation method were used as an ensemble solution to test all the cases of the training-tuning (n = 106) and the validation (n = 26) data sets in order to measure training and validation performance independently. The previous split of the cases into balanced groups considering vendor, magnetic field strength, location and segmented sequence was performed.

The model employed a 3D net and was trained with a cross-validation strategy, which is a statistical technique frequently used to estimate the skill of a machine learning model on unseen data [21]. The training-tuning dataset (n = 106) was partitioned into 5 subsets or folds of 21 or 22 non-overlapping cases each. Each of the 5 folds was given an opportunity to be used as a held-back test set, whilst all other folds collectively were used as a training dataset. A total of 5 models were fit and evaluated on the 5 hold-out test sets and performance metrics were reported (median and IQR were reported as the distribution of the results was not normal in all the cases. Confidence interval (CI) was also calculated).

Additionally, the 5 resulting segmentation models obtained using the cross-validation method were used as an ensemble solution to test all the cases of the training-tuning (n = 106) and the validation (n = 26) data sets in order to measure training and validation performance independently.


**Table 1.** Composition of validation dataset (20% of cases, n = 26), considering four variables for a balanced split: vendor, magnetic field strength, location and segmented sequence.

#### *2.5. Analysis and Metrics*

To compare segmentation results, different metrics have been described in the literature. The main metric used in this study for the evaluation of results was the DSC [22], a spatial overlap index and a reproducibility validation metric [23]. Its value can range from 0 (meaning no spatial overlap between two sets) to 1 (indicating complete overlap). DSC index has been widely used to calculate the overlap metric between the results of segmentation and ground truth, and is defined as [24,25]:

$$\text{DSC} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}$$

The ROC AUC metric was also calculated. The ROC curve, as a plot of sensitivity against 1-specificity, normally assumes more than one measurement. For the case where a test segmentation is compared to a ground truth segmentation, we consider a definition of the AUC as [26]:

$$\text{AUC} = 1 - \frac{1}{2} \left( \frac{\text{FP}}{\text{FP} + \text{TN}} + \frac{\text{FN}}{\text{FN} + \text{TP}} \right)^2$$

Since metrics have different properties, selecting a suitable one is not a trivial task, and therefore a wide range of metrics have been previously developed and implemented to approach 3D image segmentation [26]. For our study, two spatial overlap-based metrics were specifically designed to gain a deeper understanding of the direction of the errors encountered: the false positive (FP) and false negative (FN) rates, independently, with respect to the ground truth, which consisted of the manual segmentation performed by Radiologist 2 (Figure 2).

**Figure 2.** The ground truth (true positive and false negative voxels) corresponds to the manual segmentation performed by Radiologist 2, which was compared firstly to the manual segmentation performed by Radiologist 1 and then to the automatic segmentation obtained by the automatic segmentation model (non-ground truth mask, true positive and false positive voxels). The FPRm considered those voxels that were identified by the model as tumor but corresponded to other structures, divided by the voxels that actually corresponded to the ground truth mask. The FNR measured those voxels belonging to the tumor that the model did not include as such, divided by the ground truth voxels.

The rate of FP of the automatic segmentation to the ground truth (modified version of FPR) considered those voxels that were identified by the net as tumor but corresponded to other structures, divided by the voxels that actually corresponded to the ground truth mask (TP + FN voxels). This definition differs from the FPR used in standard statistical problems in the exclusion of the true negative (TN) term from the mathematical expression, as the TN voxels correspond to the image background in a segmentation task and not to the segmentation masks intended to be compared.

$$\text{FPRm} = \frac{\text{FP}}{\text{TP} + \text{FN}}$$

The rate of FN of the automatic segmentation to the ground truth (FNR) measured voxels belonging to the tumor that the net did not include as such, divided by the ground truth voxels.

$$\text{FNR} = \frac{\text{FN}}{\text{TP} + \text{FN}} = 1 - \text{Sensitivity or Recall}$$

For consistency reasons and to facilitate the understanding of the results, the FPRm and FNR metrics are reported as 1-self, resulting in a maximum of 1 for a complete voxel-wise agreement and a minimum of 0 for a null similitude.

#### *2.6. Time Sparing*

For comparing the time leverage, the final automatic segmentation model was applied to 20 cases from the training set, corresponding to 4 cases per fold to account for the heterogeneity of the data set. Cases were independently segmented manually from scratch by Radiologist 2, and the mean time (in minutes) necessary to perform that task was compared to the mean time required to obtain the masks with the automatic model. As some variability may exist in the final automatic masks, a human-based validation by Radiologist 2 was performed, and the mean time required to visually validate and manually edit the resulting automatic masks (when needed) was compared to the time necessary to manually segment them from scratch. To remove a potential software-related bias, the open source ITK-SNAP tool [19] was used for both manual and automatic correction approaches.

#### **3. Results**

#### *3.1. Inter-Observer Comparison for Manual Segmentation*

The segmentation results obtained by Radiologist 1 were compared to those of Radiologist 2 to measure inter-observer variability (Table 2). The median DSC was found to be 0.969 (±0.032 IQR). The median FPRm was 0.939 (±0.063 IQR), resulting in a high

concordance between both radiologists according to the non-tumor included voxels. The median FNR was 0.998 (±0.008 IQR), meaning that Radiologist 1 did not miss tumor during the segmentation. AUC ROC was 0.998 (Figure 3).

**Table 2.** Inter-observer variability. Performance metrics for inter-observer comparison for manual segmentation, considering DSC, AUC ROC, 1-FPRm and 1-FNR.


**Figure 3.** Comparison of two cases segmented by Radiologist 1 (red label) and Radiologist 2 (pink label) and mask superposition and comparison. Case 1 was segmented in T2w while case 2 was segmented in T2w fat-sat. In both cases, DSC was 0.957.

#### *3.2. Comparison between Radiologist and nnU-Net*

As the 106 cases of the training group were divided into five folds of 21 or 22 cases to perform cross-validation, each fold achieved different DSC results (Table 3) (Figures 4 and 5).


**Table 3.** Performance metrics for comparison between nnU-Net and Radiologist 2. Cases were divided into 5 folds to perform cross-validation. DSC, AUC ROC, 1-FPRm and 1-FNR for each fold are described.

**Figure 4.** Original transversal and coronal MR images and examples of three cases automatically segmented by nnU-Net (blue labeled) and Radiologist 2 (pink labeled), with mask superposition for comparison. Case 1 was segmented in T2w fat-sat with a DSC of 0.869. Case 2 was segmented on T2w and the DSC obtained was 0.954. Case 3 was segmented with a DSC of 0.617.

**Figure 5.** Box plots depicting the whole set of DSC for each fold of the training group and validation set.

Of the 106 cases, 27 had a DSC value <0.8: folds 0, 2 and 4 had 6 cases each; folds 1 and 3 had 5 cases each. The mean age for these cases was 32.7 ± 30.3 months and the median age was 19.8 months. They had a median volume of 75,733 mm<sup>3</sup> (±42,882 IQR).

From these 27 cases, 8 had a DSC from 0 to 0.19, being in all the cases < 0.01; 3 cases had a DSC ≥ 0.2 to 0.39; 1 case had a DSC ≥ 0.4 to 0.59; and 11 cases had a DSC ≥ 0.6 to 0.8. Cases showing high variability (DSC < 0.8) after automatic segmentation were analyzed by Radiologist 2 to identify the reasons for the low level of agreement. Regarding the eight cases with DSC < 0.01, the net had segmented extensive lymph nodes instead of the primary tumor in three cases. In another two cases, the net segmented other structures instead of the tumor (gallbladder or left kidney). In another three cases, the net did not identify any structure from the original DICOM and thus did not perform any mask.

Of the remaining 19 cases with a DSC < 0.8, 18 cases showed differences as the net localized the tumor well but did not completely segment it or presented variability in the borders, especially in cases with surrounding lymph nodes. One case had bilateral tumors and the net only detected one of them.

Posteriorly, the five resulting segmentation models obtained using the cross-validation method were used as an ensemble solution to test all the cases of the training-tuning (n = 106). We obtained a median DSC of 0.965 (±0.018 IQR) and AUC ROC of 0.981. The FPRm for this ensemble solution was 0.968, and FNR was 0.963. For comparing means of DSC attending to the effects of location (abdominopelvic or cervicothoracic) and magnetic field strength (1.5 or 3 T) (Table 4), an ANOVA test was performed, showing that there were no differences in DSC mean values for the location and magnetic field factors, and the results repeated after considering atypical values and removing them.

**Table 4.** The 5 resulting segmentation models obtained using the cross-validation method were used as an ensemble solution to test all the cases of the training-tuning (n = 106). Performance metrics for the final results are described. Results are detailed according to location (abdominopelvic or cervicothoracic) and magnetic field strength (1.5 T or 3 T).


When introducing age and volume as corrective factors in the evaluation of the effects of location and magnetic field in the DICE, no differences were observed in the results of the analyses. Age and volume have no significant effect and do not show any trend in the DICE (*p*-value = 0.052 for age and 0.169 for volume). Therefore, the effects of location and magnetic field, as well as their interaction, continue to be insignificant when the correction for age and volume is introduced.

Focusing on the direction of the errors between both sets (ground truth vs. automatic segmentation), the median FPRm is 0.968 (±0.015 IQR), meaning that the mask is including as tumor 3.2% of voxels that are not included in the ground truth. The median FNR is 0.963 (±0.021 IQR), so the automatic tool does not include 3.7% of the voxels included in the ground truth mask.

#### *3.3. Validation*

The validation was performed at the end of the model development to test for model overfitting which could result in an overestimation of the model performance. The median DSC result for validation was 0.918 (±0.067 IQR) and AUC ROC was 0.968 (Table 5) (Figure 5).

**Table 5.** Performance metrics for the validation cohort results (n = 26) considering DSC, AUC ROC, 1-FPRm and 1-FNR. Results for Radiologist 2 vs. automatic model are shown. To compare these results to inter-radiologist agreement, Radiologist 1 segmented the 26 cases from the validation dataset and comparisons with Radiologist 2 and to the automatic model were made.


Of the 26 cases in the validation dataset, 4 had a DSC value < 0.8: 3 cases had a DSC ≥ 0.4 to 0.59; and 1 case had a DSC ≥ 0.6 to 0.8. These cases were analyzed by Radiologist 2 to identify the reasons for the low level of agreement. Regarding the three cases with DSC <0.6, the net had segmented extensive lymph nodes besides the primary tumor in two cases, and identified only a part of the tumor in one case. In the case with a DSC ≥ 0.6 to 0.8, the net segmented lymph nodes besides the primary tumor.

To compare the validation results to the inter-radiologist agreement, Radiologist 1 manually segmented the cases from the validation dataset. We compared the segmentation of Radiologist 1 to the segmentations of Radiologist 2 and the automatic model (Table 5).

For comparing the time leverage, we performed a comparison of the mean time needed to manually segment 20 cases (418 slices) from scratch with the mean time required by the automatic model to segment them. Cases segmented manually required a mean time of 56 min per case, while the mean time needed to obtain each mask with the automatic model was 10 s (0.167 min), resulting in a time reduction of 99.7%.

As some variability may exist in the final automatic masks, a human-based visual validation of the masks was performed. All the segmentations were visually validated, and manual editing and adjustment of the automatic masks were performed when needed (12 cases were edited, including 92 slices). The mean time to perform these processes was 4.08 min (±2.35 SD) and the median time was 4 min. This was compared to the time necessary to manually segment the masks from scratch, showing a time reduction of 92.8%.

#### **4. Discussion**

The inter-observer variability when performing manual segmentation of neuroblastic tumors in T2w MR images indicates that there is a high concordance between observers (median DSC overlap index of 0.969 (±0.032 IQR)). The discrepancies between observers may be due to the heterogeneous nature of the neuroblastic tumors and to the intrinsic variability of the manual segmentation related to individual skills and level of attention to detail. In our study, both radiologists are pediatric radiologists with previous experience in segmentation tasks. Radiologist 2 was considered the ground truth for practical reasons, as the segmentation of the whole dataset had been performed by this observer. Nevertheless, the manual ground truth mask may itself have some errors that are intrinsically associated with the human-based segmentation methodology. Joskowicz et al [6] investigated the variability in manual delineations on CT for liver tumors, lung tumors, kidney contours and brain hematomas between 11 observers, and concluded that inter-observer variability is large and even two or three observers may not be sufficient to establish the full range of inter-observer variability. Montagne et al [27] compared the inter-observer variability for prostate segmentation on MR performed by seven observers and concluded that variability is influenced by changes in prostate morphology. Therefore, delineation volume overlap variability for different structures and observers is large [28].

In our study, expert tumor delineation performed as the best (although not perfect) approach to truth. The evaluation of the voxel-wise similarity between the ground truth and the automatically segmented mask demonstrates that the state-of-the-art deep learning architecture nnU-Net can be used to detect and segment neuroblastic tumors on MR images, with a median DSC of 0.965 (±0.018 IQR), achieving a strong performance and surpassing the methods and results obtained in previous studies that approached the problem of neuroblastoma segmentation [9–11]. However, no previous literature has demonstrated the performance of a CNN-based solution in neuroblastic tumor. nnU-Net sets a new state-of-the-art in various semantic segmentation challenges and displays strong generalization characteristics for other structures [16,17]. Our results suggest that this automatic segmentation tool introduces a variability equivalent to that observed in the manual segmentation process in neuroblastic tumors. Previous studies related to breast tumors showed that segmentation algorithms may improve manual variability [29].

When analyzing the direction of the errors in a tumor segmentation problem, our recommendation is to give more relevance to the FPRm, aiming to minimize the included FP voxels with respect to the ground truth, as this metric represents those voxels that belong to adjacent organs or structures, which could introduce a strong bias in the extraction of quantitative imaging features for the development of radiomics models. The influence of FN in radiomics models seems less important, as it may not have a significant impact if some peripheral tumor voxels are missed. The effect of manual inter-observer segmentation variability on MR-based radiomics feature robustness has been described previously in other tumors such as breast cancer [30].

When assessing the FPRm and FNR between the manual segmentations performed by the two radiologists, the median FPRm is 0.939 (±0.063 IQR), indicating that 6.1% of the voxels were misclassified as tumor, while the median FNR is 0.998 (±0.008 IQR), therefore, the manual segmentation of Radiologist 1 did not include 0.2% of voxels included in the ground truth mask. Regarding manual ground truth vs. automatic segmentation, we observe that the median FPRm is 0.968 (±0.015 IQR). Therefore, the automatic segmentation tool generates masks with an average of 3.2% non-tumoral voxels. The median FNR corresponds to a value of 0.963 (±0.021 IQR), therefore, the automatic tool fails to include 3.7% of tumoral voxels. The results obtained demonstrate that the automatic segmentation model achieves a better performance regarding the FPRm, which is a great advantage in segmentation tasks for the posterior extraction of quantitative imaging features.

With regards to the time required for the segmentation process, an average time reduction of 99.7% was obtained when comparing the automatic model with the manual segmentation methodology. As some errors and variability may exist in the final automatic masks, this result is over-optimistic, as in practice the reader has to visually validate the quality of all the segmentations provided by the automatic tool before introducing some corrections, if needed. A human-based visual validation of the masks is recommended to edit and adjust the automatic masks. In our study, this process of validation and correction of the automatic masks that needed adjustment reduced the time required for segmentation from scratch by 92.8%. Correction time was influenced by the intrinsic difficulty of segmentation of each tumor, as there were tumors easier to segment (e.g., more homogeneous, with sharper margins, without lymph nodes) that did not need corrections or required slight adjustments, while there were more challenging tumors with contrast variations close to organ borders and of similar appearance to surrounding structures that required more time to be corrected. Overall, the application of the automatic model results in a great leverage of the time required to perform the segmentation process, facilitating the workflow for radiologists.

In addition, the human-based analysis of the masks performed by the net is useful to gain insights and correct for potential human mistakes and biases/outliers within the data set, which could be used to retrain the model, increasing the model's overall accuracy and robustness.

There are some limitations to this study. Segmentations were performed only by two observers, so it may only represent a fraction of the full range of inter-observer variability and may not be enough to establish a reference standard. Furthermore, both were experienced radiologists, as previous medical knowledge and expertise are assumed to contour highly heterogeneous neuroblastic tumors. Therefore, manual segmentations performed by other users (less experienced radiologists, other clinical users, non-medical staff) are expected to encounter higher inter-observer variability. Another potential limitation is that tumors may associate extensive lymph nodes or can present contact with them, making their differentiation difficult in some cases, which, as we have proved, can lead to errors in the segmentation performed by the net. Finally, as has been pointed out, the mask that is considered to be the ground truth may itself have some errors that are associated intrinsically with the manual segmentation process, and the comparison of the nnU-Net results was performed with the segmentations done by a single radiologist.

#### **5. Conclusions**

MR image segmentation accuracy of neuroblastic tumors is observed to be comparable between radiologists and the state-of-the-art deep learning architecture nnU-Net. The automatic segmentation model achieves a better performance regarding the FPRm, which is a great advantage in segmentation tasks for the posterior extraction of quantitative imaging features. Moreover, the time leverage when using the automatic model corresponds to 99.7%. A human-based validation based on manual editing of the automatic masks is recommended and corresponds to a reduction of time of 92.8% compared to the fully manual approach, reducing the radiologist's involvement in this task.

**Author Contributions:** L.M.-B. and L.C.-A. conceived the idea for this study and supervised the work. D.V.-C. and C.S.N. performed the manual segmentations. L.C.-A. developed the mathematical method, D.V.-C. and L.C.-A. performed the computations. D.V.-C. took the lead in writing the manuscript in consultation with L.M.-B., L.C.-A. and J.M.C.S. U.P. verified the analytical and statistical methods. M.G., C.S.N. and E.N. contributed to the segmentation methodology and to the radiological point of view. B.M.d.l.H., S.T.-M., V.D., R.L. and A.C. supervised the findings of this work from an oncological and clinical point of view. All authors discussed the results and contributed to the preparation of the final manuscript submitted. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was funded by PRIMAGE (PRedictive In-silico Multiscale Analytics to support cancer personalized diaGnosis and prognosis, empowered by imaging biomarkers), a Horizon 2020|RIA project (Topic SC1-DTH-07-2018), grant agreement no: 826494.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Hospital's Ethics Committee (The Ethics Committee for Investigation with medicinal products of the University and Polytechnic La Fe Hospital, ethic code: 2018/0228, 27 March 2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Conflicts of Interest:** All authors certify that there is no actual or potential conflict of interest in relation to this article.

#### **References**


### *Article* **Deep Learning-Based Classification of Uterine Cervical and Endometrial Cancer Subtypes from Whole-Slide Histopathology Images**

**JaeYen Song 1, Soyoung Im 2, Sung Hak Lee 3,\* and Hyun-Jong Jang 4,\***


**Abstract:** Uterine cervical and endometrial cancers have different subtypes with different clinical outcomes. Therefore, cancer subtyping is essential for proper treatment decisions. Furthermore, an endometrial and endocervical origin for an adenocarcinoma should also be distinguished. Although the discrimination can be helped with various immunohistochemical markers, there is no definitive marker. Therefore, we tested the feasibility of deep learning (DL)-based classification for the subtypes of cervical and endometrial cancers and the site of origin of adenocarcinomas from whole slide images (WSIs) of tissue slides. WSIs were split into 360 × 360-pixel image patches at 20× magnification for classification. Then, the average of patch classification results was used for the final classification. The area under the receiver operating characteristic curves (AUROCs) for the cervical and endometrial cancer classifiers were 0.977 and 0.944, respectively. The classifier for the origin of an adenocarcinoma yielded an AUROC of 0.939. These results clearly demonstrated the feasibility of DL-based classifiers for the discrimination of cancers from the cervix and uterus. We expect that the performance of the classifiers will be much enhanced with an accumulation of WSI data. Then, the information from the classifiers can be integrated with other data for more precise discrimination of cervical and endometrial cancers.

**Keywords:** computational pathology; computer-aided diagnosis; convolutional neural network; digital pathology

#### **1. Introduction**

Uterine cervical and endometrial cancers are two major cancer types threatening women's health worldwide [1]. Although they originate from the same organ, i.e., uterus, cervical and endometrial cancers have different subtypes with different clinical outcomes [2–6]. The main histologic subtypes of cervical cancers are squamous cell carcinoma and endocervical adenocarcinoma. The two major histologic subtypes of endometrial cancers are endometrioid adenocarcinoma and serous adenocarcinoma. Because management and prognosis are different between the subtypes, differential diagnosis is crucial for proper treatment decisions. Furthermore, an endometrial and endocervical origin for an adenocarcinoma should be distinguished considering the marked differences in their management [7]. The first step for the discrimination of the subtypes of these cancers is to investigate hematoxylin and eosin (H&E)-stained tissue slides by pathologists. However, the visual discrimination of subtypes is not always clear because some morphologic features are

**Citation:** Song, J.; Im, S.; Lee, S.H.; Jang, H.-J. Deep Learning-Based Classification of Uterine Cervical and Endometrial Cancer Subtypes from Whole-Slide Histopathology Images. *Diagnostics* **2022**, *12*, 2623. https:// doi.org/10.3390/diagnostics12112623

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 4 September 2022 Accepted: 26 October 2022 Published: 28 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

291

overlapping [7,8]. Furthermore, there is considerable inter- and intra-observer variations in the histological subtyping by pathologists [8]. Although various immunohistochemical markers can help distinguish the subtypes, there is no definitive marker [7,8]. Therefore, ancillary methods for the discrimination of the subtypes of cervical and endometrial cancers, and also the origin of the cancers are necessary to improve treatment decisions.

Because whole-slide images (WSIs) were approved for primary diagnostic purposes, many pathologic laboratories have been adopting digitized diagnosis processes [9]. The digitization enabled computer-aided analysis of pathologic tissues. Computer-aided analysis of H&E-stained WSIs could provide valuable information in a cost- and time-effective manner, considering the wide availability of H&E-stained pathologic tissue slides for most cancer patients. Recently, deep learning (DL) has been widely applied for various analysis tasks on H&E-stained WSIs [10]. DL usually performs better than many previous machine learning methods because it can automatically learn the most discriminative features directly from large datasets [11]. Many studies showed that DL can correctly diagnose various cancers from WSIs [12]. Furthermore, DL can even detect molecular alterations of cancer tissues from H&E-stained WSIs [13]. Therefore, DL has tremendous potential to improve the precision of pathologic diagnosis with minimal additional cost.

In the present study, we applied sequential DL models for the subtyping of cervical and endometrial cancers. First, cervical and endometrial cancer regions were automatically selected with DL models. Then, two separate DL models were trained to discriminate cervical and endometrial cancers into cervical squamous cell carcinoma and endocervical adenocarcinoma, and into endometrioid endometrial adenocarcinoma and serous endometrial adenocarcinoma, respectively. Furthermore, we trained an additional DL model to discriminate whether an adenocarcinoma has an endocervical or endometrial origin. The three models showed excellent performance proving the potential of DL for the discrimination of subtypes in gynecologic tumors.

#### **2. Materials and Methods**

#### *2.1. Datasets*

Classifiers for the subtypes of cervical and endometrial cancers and the origin of adenocarcinomas were trained with the WSIs provided by The Cancer Genome Atlas (TCGA) program. From the TCGA cervical (TCGA-CESC) and endometrial (TCGA-UCEC) datasets, we collected formalin-fixed paraffin-embedded (FFPE) slides after the basic slide quality reviews. The TCGA-CESC dataset provided slides from 255 patients for cervical squamous cell carcinoma and from 47 patients for endocervical adenocarcinoma. From the TCGA-UCEC dataset, tissue slides of 399 and 109 patients were obtained for endometrioid endometrial adenocarcinoma and serous endometrial adenocarcinoma, respectively. When there are huge differences in the numbers of data between the classes, performance evaluation can be skewed by the majority class. Therefore, we randomly selected 70 and 160 patients for cervical squamous cell carcinoma and endometrioid endometrial adenocarcinoma, respectively, to make the differences between the major and minor classes under 1.5-fold.

The performance of the classifier for the subtypes of endometrial carcinoma was also evaluated on The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) endometrial cancer dataset (CPTAC-UCEC). There were 83 patients for endometrioid endometrial adenocarcinoma and 12 patients for serous endometrial adenocarcinoma.

#### *2.2. Deep Learning Model*

To fully automate the classification tasks, we sequentially applied different DLbased classifiers to the WSIs (Figure 1). The WSIs were divided into non-overlapping, 360 × 360-pixel image patches at 20× magnification because a WSI is too big to be analyzed by a current DL-system as a whole. In a WSI, various artifacts can exist including air bubbles, blurring, compression artifacts, pen markings, and tissue folding. Patches with these artifacts should be discarded because they can interfere with proper learning

of relevant features. In our previous study for gastric cancer subtyping, we trained a simple DL classifier that can discriminate various artifacts and white backgrounds all at once [14]. The DL network consisted of three convolution layers with 12 [5 × 5] filters, 24 [5 × 5] filters and 24 [5 × 5] filters, each followed by a [2 × 2] max-pooling layer. We reused the classifier and only proper tissue image patches were selected for the next steps (Figure 1a).

**Figure 1.** Classification procedure. (**a**) Sequential application of tissue/non-tissue and normal/tumor classifiers can discriminate proper tumor tissues. (**b**) Three separate classifiers for subtypes of cervical cancers, subtypes of endometrial cancers, and site of origin for adenocarcinomas were trained from tumor tissue image patches.

Cancer subtype classifiers should be trained on the cancer tissues. Therefore, normal and tumor tissue classifiers are prerequisites for cancer subtyping. To train the normal/tumor classifiers, two pathologists (S.I. and S.H.L.) annotated normal and tumor regions for cervical and endometrial cancer tissue slides (Figure 2 left panels). Then, normal and tumor tissue image patches were collected based on the annotation. From these patches, classifiers to discriminate normal and tumor tissues for cervical and endometrial cancers were trained separately for each cancer type.

Next, we trained classifiers for the subtypes of cervical and endometrial cancers, and the origin of adenocarcinomas on prominent tumor tissue patches selected by the normal/tumor classifiers. To evaluate the general performance of the classifiers for the TCGA-CESC and -UCEC datasets, 5-fold cross validation was adopted. Therefore, the WSIs were split into 5 non-overlapping patient-level subsets and classifiers were trained and evaluated for each subset. As we noted, 70 and 160 patients for cervical squamous cell carcinoma and endometrioid endometrial adenocarcinoma were selected for evaluation. However, performance can be enhanced when the classifiers were exposed to more various tissue images during training. Therefore, we randomly sampled tumor image patches from all cervical squamous cell carcinoma and endometrioid endometrial adenocarcinoma WSIs other than the test sets to match the 1.5-fold data ratio of major/minor class tissue patches for training, as this strategy could include a greater variety of tissue images. Therefore, we included sampled data from all patients other than the test sets during training and selected patients for the testing to avoid skewed test results. For the selection of the samples, we made a random selection to avoid selection biases from human selectors. The numbers of image patches used for the training of the classifiers were summarized in Supplementary Table S1.

**Figure 2.** Normal/tumor classification results for (**a**) cervical and (**b**) endometrial cancers. Left panels: annotation made by pathologists. Middle panels: classification results of the normal/tumor classifiers. Right panels: Receiver operating characteristic curves for normal/tumor classification results. AUC: area under the curve.

Inception-v3 model was adopted for the normal/tumor, cancer subtypes, and origin classifiers because the Inception-v3 model yielded good results for normal/tumor classification or tissue subtype classification in our previous studies [14,15]. The models were implemented using the Tensorflow deep learning library version 1.15 (http://tensorflow.org (accessed on 22 January 2022)). The overall structure of the model is presented in Supplementary Figure S1. RMSPropOptimizer was adopted to optimize the model and the hyperparameters were as follows: initial learning rate 0.1, number of epochs per decay 10.0, learning rate decay factor 0.16, RMSPROP decay 0.9, RMSPROP\_MOMENTUM 0.9, RMSPROP\_EPSILON 1.0. Tissue images were color normalized before the training and testing. During training, data augmentation techniques such as random rotation by 90◦ and random horizontal/vertical flipping were applied to the tissue patches. Four computer systems equipped with an Intel Core i9-12900K Processor (Intel Corporation, Santa Clara, CA, USA) and dual NVIDIA RTX 3090 GPUs (NVIDIA corporation, Santa Clara, CA, USA) were used for the training and testing of the models.

#### *2.3. Visualization and Statistics*

To visualize the distribution of different tissue types, heatmaps of classification results of tissue patches were overlaid on the WSIs with specific colors demonstrated in Figure 1. To obtain the overall classification result of a WSI, patch classification results were averaged to obtain the result for the WSI. Receiver operating characteristic (ROC) curves and area under the curves for the ROC curves (AUROCs) were presented to demonstrate the performance of each classifier. For 5-fold cross validated datasets, ROC curves for the folds with the lowest and highest AUROCs and for the concatenated results of all 5 folds were provided for more precise evaluation of the performance of the classifiers. For the concatenated results of all 5 folds, 95% confidence intervals (CIs) were presented. To obtain accuracy, sensitivity, specificity and F1 score of the classification results, cutoff values yielding maximal Youden index (sensitivity + specificity − 1) were adopted.

When a comparison between the ROC curves is necessary, Venkatraman's permutation test with 1000 iterations was applied [16]. A *p*-value < 0.05 was considered significant.

#### *2.4. Ethical Statement*

Informed consent of patients in the TCGA cohorts was acquired by the TCGA consortium [17]. The Institutional Review Board of the College of Medicine at The Catholic University of Korea approved this study (XC21ENDI0031K).

#### **3. Results**

#### *3.1. Normal/Tumor Classification*

To classify the subtypes of cancer tissues, proper cancer tissue image patches should be selected (Figure 1). First, we removed image patches containing various artifacts and white background with a tissue/non-tissue classifier from our previous study [14]. Then, normal/tumor classifiers for cervical and endometrial cancers were trained based on pathologists' annotation (Figure 2). Pathologists annotated 100 slides for each cervical and endometrial cancer. The normal/tumor classifiers were trained with 80 slides and tested on the remaining 20 slides. The representative WSIs in Figure 2 are the cervical and endometrial cancer WSIs from the test sets. The classification results of the normal/tumor classifiers matched well with the pathologists' annotation. The AUROCs for the patch-level classification results of the normal/tumor classifiers are 0.982 and 0.999 for cervical and endometrial cancers, respectively.

#### *3.2. Cervical Cancer Subtypes Classification*

With the tissue/non-tissue and normal/tumor classifiers, we can collect proper tumor patches for the training of the cancer subtype classifiers. First, we trained classifiers for the cervical cancer subtypes. The patches from a WSI are labeled as either cervical squamous cell carcinoma or endocervical adenocarcinoma based on the information obtained from cBioPortal for Cancer Genomics (https://www.cbioportal.org/ (accessed on 12 March 2022)). Then, separate classifiers were trained to distinguish the subtypes for each 5-fold. For each fold, four classifiers were trained repeatedly and a classifier yielding the best AUROC was used to present the results. The classification results of cervical squamous cell carcinoma and endocervical adenocarcinoma are presented in Figure 3. The upper panels show the representative WSIs of clear cervical squamous cell carcinoma, clear endocervical adenocarcinoma, and confusing case with mixed classification results. The ROC curves of slide-level classification results for folds with the lowest and highest AUROCs and concatenated results of all 5-folds are presented in the lower panels. The AUROCs were 0.979 and 1.000 for the folds with the lowest and highest AUROCs, respectively. The AUROC for the concatenated results was 0.977 (95% CI, 0.957–0.998).

**Figure 3.** Classification results for cervical cancer subtypes. Upper panels: the representative whole slide images of clear cervical squamous cell carcinoma, clear endocervical adenocarcinoma, and confusing case with mixed classification results. Lower panels: the receiver operating characteristic curves of slide-level classification results for folds with the lowest and highest area under the curve (AUC) and concatenated results of all 5-folds.

#### *3.3. Endometrial Cancer Subtypes Classification*

Next, we trained other classifiers for the endometrial cancer subtypes. The patches from a WSI are labeled as either endometrioid endometrial adenocarcinoma or serous endometrial adenocarcinoma based on the information obtained also from the cBioPortal. The classification results are presented in Figure 4a. The representative WSIs of clear endometrioid endometrial adenocarcinoma, clear serous endometrial adenocarcinoma, and confusing case with mixed classification results are presented in the upper panels. The AUROCs were 0.923 and 0.982 for the folds with the lowest and highest AUROCs, respectively. The AUROC for the concatenated results was 0.944 (95% CI, 0.916–0.969).

It is of interest whether the classifiers trained on the TCGA datasets work well or not on other datasets. Therefore, we tested the classifier on the CPTAC-UCEC dataset. CPTAC-UCEC provides multiple WSIs for a patient with pure normal tissue WSIs (Figure 5a). We discarded normal WSIs and selected all WSIs with more than 30% of tumor tissue regions for the testing. The classification results are presented in Figure 4b. The AUROC was 0.826 (95% CI, 0.727–0.925), much poorer compared to the AUROC for the TCGA dataset (*p* < 0.05 between CPTAC and TCGA by Venkatraman's permutation test).

**Figure 4.** Classification results for endometrial cancer subtypes. (**a**) Results for the TCGA-UCEC dataset. Upper panels: the representative whole slide images (WSIs) of clear endometrioid endometrial adenocarcinoma, clear serous endometrial adenocarcinoma, and confusing case with mixed classification results. Lower panels: the receiver operating characteristic (ROC) curves of slide-level classification results for folds with the lowest and highest area under the curve (AUC) and concatenated results of all 5-folds. (**b**) The classification results of the CPTAC-UCEC dataset by the classifier trained with the TCGA-UCEC dataset. Left two representative WSIs demonstrate clear endometrioid endometrial adenocarcinoma and clear serous endometrial adenocarcinoma. The ROC curve is obtained from all CPTAC-UCEC tissues with more than 30% of tumor tissue regions.

**Figure 5.** Characteristics of CPTAC-UCEC tissues. Examples of tissues from six patients indicated by IDs. (**a**) Patients with both pure tumor and pure normal tissue samples. (**b**) Patients with frozen tissue samples. (**c**) Patients with small curettage tissue samples.

#### *3.4. Tumor Origin Classification*

Lastly, we trained classifiers to distinguish the origin of adenocarcinomas: endocervical adenocarcinoma vs. endometrioid endometrial adenocarcinoma. The classification results are presented in Figure 6. The upper panels show the representative WSIs of clear endocervical adenocarcinoma, clear endometrioid endometrial adenocarcinoma, and confusing case with mixed classification results. The AUROCs were 0.904 and 0.987 for the folds with the lowest and highest AUROCs, respectively. The AUROC for the concatenated results was 0.939 (95% CI, 0.896–0.982).

**Figure 6.** Classification results for the origin of adenocarcinomas. Upper panels: the representative whole slide images of clear endocervical adenocarcinoma, clear endometrioid endometrial adenocarcinoma, and confusing case with mixed classification results. Lower panels: the receiver operating characteristic curves of slide-level classification results for folds with the lowest and highest area under the curve (AUC) and concatenated results of all 5-folds.

In Table 1, accuracy, sensitivity, specificity, and F1 score of the classification results for these classifiers were presented with cutoff values yielding maximal Youden index (sensitivity + specificity − 1).

**Table 1.** Accuracy, sensitivity, specificity, and F1 score of the classification results. The measures were obtained with cutoff values yielding maximal Youden index (sensitivity + specificity − 1).


#### **4. Discussion**

In the present study, we investigated the feasibility of DL-based classification for the subtypes of cervical and endometrial cancers and the site of origin of adenocarcinomas. Although the performance of the classifiers was not perfect, high AUROCs of all the classifiers revealed the potential of DL-based classification of H&E-stained tissue slides of

cervical and uterine cancers. The performance can be much enhanced when more WSI data can be collected for the training of the classifiers.

The DL-based classifiers for cervical cancer showed the best performance among the classifiers in the study. Pure adenocarcinoma and squamous cell carcinoma of the cervix can be relatively clearly separable because their morphologies have many differences [5]. However, there are also confusing cases including adenosquamous carcinoma which is defined as a tumor with both glandular and squamous components. This explains why the classifier could not accomplish perfection. In clinical practice, tissue slides with mixed classification results need more careful attention by pathologists when a DL-based assistant system for tissue slides is adopted.

Serous endometrial adenocarcinoma represents only about 10% of endometrial carcinomas. However, it is responsible for almost 40% of cancer deaths [8,18]. The distinction between endometrioid and serous endometrial adenocarcinoma is not very clear. Although serous carcinoma typically shows a predominant papillary growth pattern, which is also found in some endometrioid carcinomas. Antibodies for p53, p16, IMP2, and IMP3 can help to distinguish serous endometrial adenocarcinoma, but the markers are not definitive [19]. Therefore, there is an opportunity for DL-based classifiers to improve the diagnostic accuracy of subtypes of endometrial cancers.

One of the important issues of DL application is the generalizability of trained classifiers for external datasets. The TCGA-trained classifiers did not perform well on the CPTAC dataset in the present study. There can be various reasons for the decreased performance. First, the quality of H&E-stained tissue slides can vary between TCGA and CPTAC datasets because of the differences in tissue processing including tissue cutting, fixation, dye concentration, and staining time [20]. Furthermore, the differences in the settings of the slide scanners can also affect the color features of the WSIs. Although we normalized color, it may not be able to overcome the innate differences in the datasets. In addition, there are many other differences between TCGA and CPTAC datasets. CPTAC dataset contains not only FFPE tissues but also frozen tissue sections (Figure 5b). In our previous study, we clearly demonstrated that the classifiers trained on either frozen or FFPE tissue did not perform well on another tissue type [21]. Therefore, the classifiers trained on the TCGA-UCEC FFPE tissues cannot perform properly on the CPTAC frozen tissues. Furthermore, the CPTAC dataset also contains small tissue samples such as biopsy or small curettage specimens (Figure 5c). The dilatation and curettage may be able to deform tissue morphology. In addition, because biopsy samples have fundamental limitations in reflecting the overall contour of tumor histomorphology, the classifiers trained on resection specimens may not perform well on biopsy or small curettage tissues. Whatever the reason, the limited generalizability suggests that the TCGA dataset is not enough to train a classifier performing generally well on real-world problems. More data from various institutes should be collected to establish high generalizability. Recently, many countries started to construct large datasets of pathologic tissue slides [22,23]. Therefore, the performance and generalizability of DL-based tissue classifiers will be much enhanced with the accumulation of more training data in the near future.

The distinction of the site of origin between cervical adenocarcinomas and endometrial adenocarcinomas is important for clinical decisions especially for tumors involving both the endometrium and the endocervix or for tumors with multiple lesions [7]. The decision can be supported by immunohistochemistry for ER, p16, CEA, and vimentin or HPV in situ hybridization [5]. However, there is no decisive marker and additional methods are necessary to support the distinction. It is strongly recommended that various information including clinicopathologic, immunohistochemical, and molecular data should be integrated for proper differentiation of these cancers. We suggest that information from the DL-based classifier can also be integrated into these data for more accurate decisions.

In the present study, we applied DL to classify H&E-stained tissues of cervical and endometrial cancers. There have been other studies applying DL to assist the analysis of gynecologic tumors. Many studies tried to improve cervical cancer screening results based on cervical cytology tests [24–26]. In these studies, DL can discriminate normal/cancer cells from conventional Pap smear or liquid-based cytology. Grades of cervical intraepithelial neoplasia can be determined by DL from either colposcopy images [27,28] or histology images [29]. DL can also analyze hysteroscopy images to discriminate different types of endometrial legions [30,31]. Normal endometrium, endometrial polyp, endometrial hyperplasia, and endometrial adenocarcinoma can be discriminated by DL from H&Estained histopathologic slides [32]. Molecular profiles such as molecular subtypes or microsatellite instability status of endometrial cancers can be predicted by DL directly from H&E-stained WSIs [33]. These studies indicate that DL has tremendous potential to support the assessment of patients with gynecologic tumors.

However, there are also limitations of DL. First, it is almost impossible for human interpreters to understand how DL reaches to the classification results. This "black-box" nature is one of the most important hurdles for the adoption of DL in clinical practice [34]. The effort to enhance the interpretability of DL is actively ongoing [35]. Second, DL cannot perform well in inexperienced settings although the difference is not tremendous. For example, a classifier trained on FFPE tissues has limited performance on frozen tissues although the difference is not limiting to human. Therefore, separate DL models should be trained for slightly different settings. Otherwise, a huge dataset covering every variation should be used to train a widely available model.

In the present study, we demonstrated the feasibility of DL-based classifiers for the subtypes of cervical and endometrial cancers and the site of origin of adenocarcinomas. Although there is still room for improvement, our results showed that DL can capture selective features for the discrimination of cancer tissues. We believe the performance will be much enhanced with an accumulation of training data in the near future. The classification results of DL can be integrated with other clinical information for a more precise analysis of cervical and endometrial cancers.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics12112623/s1, Figure S1: Schematic view of the network structure for the classification of tissue subtypes; Table S1: Numbers of tissue image patches for the training of the classifiers.

**Author Contributions:** Study design, J.S. and H.-J.J.; methodology, S.H.L. and H.-J.J.; software, H.-J.J.; validation, S.I. and S.H.L.; formal analysis, J.S., S.I., S.H.L. and H.-J.J.; data curation, J.S., S.H.L. and H.-J.J.; writing—original draft preparation, J.S., S.I. and H.-J.J.; writing—review and editing, S.H.L. and H.-J.J.; visualization, H.-J.J.; supervision, S.H.L. and H.-J.J.; funding acquisition, S.H.L. and H.-J.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by a grant from the National Research Foundation of Korea (NRF-2021R1A4A5028966) and a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI21C0940).

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of the College of Medicine at The Catholic University of Korea (XC21ENDI0031K, approved on 1 July 2021).

**Informed Consent Statement:** Informed consent of patients in the TCGA cohorts was acquired by the TCGA consortium.

**Data Availability Statement:** The TCGA data presented in this study are openly available in the GDC data portal (https://portal.gdc.cancer.gov/ (accessed on 11 April 2022)). The CPTAC data presented in this study are openly available in The Cancer Imaging Archive website (http://www. cancerimagingarchive.net/ (accessed on 29 April 2022)). Further information is available from the corresponding authors upon request.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


### *Review* **Artificial Intelligence-Driven Diagnosis of Pancreatic Cancer**

**Bahrudeen Shahul Hameed 1,2 and Uma Maheswari Krishnan 1,2,3,\***


**Simple Summary:** Pancreatic cancer poses a grave threat to mankind, due to its poor prognosis and aggressive nature. An accurate diagnosis is critical for implementing a successful treatment plan given the risk of exacerbation. The diagnosis of pancreatic cancer relies on medical imaging, which provides inaccurate information about the prognosis of the patient and makes it difficult for clinicians to select the optimal treatment. Data derived from medical imaging has been integrated with artificial intelligence, an emerging technology, to facilitate clinical decision making. This review explores the implementation of artificial intelligence for various imaging modalities to obtain a precise cancer diagnosis.

**Abstract:** Pancreatic cancer is among the most challenging forms of cancer to treat, owing to its late diagnosis and aggressive nature that reduces the survival rate drastically. Pancreatic cancer diagnosis has been primarily based on imaging, but the current state-of-the-art imaging provides a poor prognosis, thus limiting clinicians treatment options. The advancement of a cancer diagnosis has been enhanced through the integration of artificial intelligence and imaging modalities to make better clinical decisions. In this review, we examine how AI models can improve the diagnosis of pancreatic cancer using different imaging modalities along with a discussion on the emerging trends in an AI-driven diagnosis, based on cytopathology and serological markers. Ethical concerns regarding the use of these tools have also been discussed.

**Keywords:** pancreatic cancer; artificial intelligence; deep learning; cancer imaging; risk prediction

#### **1. Introduction**

Pancreatic cancer (PC) is among the most fatal and invasive tumors of the digestive system [1]. It has been referred to as the king of cancer, due to its aggressiveness, invasiveness and rapid metastasis, poor survival, and poor prognosis [2]. Recent decades have witnessed a surge in the incidence of pancreatic cancer across the globe that has been largely linked to ageing, alcohol consumption, smoking, sedentary lifestyle, obesity, chronic pancreatitis, diabetes, hereditary factors, long-term exposure to air and water pollutants, unhealthy lifestyle, and diet [1,3,4]. Surgery has been the main therapeutic intervention for these patients. However, several factors, including the absence of specific clinical manifestations and molecular markers, have resulted in the detection of the disease only at advanced stages, thereby making surgical options ineffective. Therefore, an early diagnosis and accurate stratification of pancreatic cancer stages are important for improved therapeutic outcomes. Pancreatic cancer diagnosis is challenging because the pancreas is a deep-seated retro-peritoneal organ with complex surrounding structures. The highly vascularized environment surrounding the pancreas facilitates rapid metastasis of the cancer that makes pancreatic cancer highly aggressive. The common symptoms of pancreatic

**Citation:** Hameed, B.S.; Krishnan, U.M. Artificial Intelligence-Driven Diagnosis of Pancreatic Cancer. *Cancers* **2022**, *14*, 5382. https://doi.org/10.3390/ cancers14215382

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 5 October 2022 Accepted: 28 October 2022 Published: 31 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

cancer include abdominal pain, changes in the consistency of faeces, nausea, bloated body, co-morbidities, such as diabetes and jaundice, abnormal liver function parameters, loss of weight, etc. [5]. These symptoms usually become prominent only during the advanced stage of the disease and are often missed during the early stages. Further, serological markers for pancreatic cancer, such as CA-19-9 (Carbohydrate antigen), are not highly specific and indicate only the advanced stage of the disease, thereby increasing the mortality risk of the affected individual. Several imaging tools, including magnetic resonance imaging (MRI), computed tomography (CT), endoscopic ultrasound (EUS), etc., have also been explored for the diagnosis of pancreatic cancer. Due to rapid advances in recent years, imaging technology has emerged in the forefront for the diagnosis, staging, and prognosis of pancreatic cancer [6]. However, distinction of a cancerous lesion from other pancreatic disorders, such as pancreatitis, a chronic inflammation of the pancreas, remains a major roadblock in the accurate and early diagnosis of pancreatic cancer. Despite the existence of advanced imaging equipment, confirmation of pancreatic cancer is confirmed through biopsy after imaging. Not only is this time-consuming, but it also increases the probability of mortality in the affected individual, due to the inordinate delay. A study had reported that nearly 90% of the misdiagnosis of pancreatic cancer was due to the inability to identify the vascular invasion and the difficulty in spotting the underlying tumour mass, due to the inflammation [7]. Table 1 lists some of the common imaging techniques used for the clinical diagnosis of pancreatic cancer, along with their merits and limitations.


**Table 1.** Major imaging techniques employed for the diagnosis of pancreatic cancer and their limitations.

Several approaches to improve the sensitivity and prediction accuracy of these imaging techniques have been reported in the literature. These include the use of image contrast agents to improve the resolution and sensitivity and the use of image processing software for a better diagnostic accuracy. In recent years, the emergence of artificial intelligence and deep learning has transformed the landscape of an image-driven diagnosis of pancreatic cancer with a dramatic improvement in the prediction accuracy. The various attempts to integrate artificial intelligence for the diagnosis of pancreatic cancer are discussed in the following sections.

#### **2. Artificial Intelligence for Diagnostic Applications**

The advances in computer technology, witnessed in the recent decades, coupled with the development of effective image processing strategies, have ushered in a new era of digital medicine. As a result, clinical personnel can avoid the laborious medical image analysis performed manually, thus saving time as well as overcome errors in diagnosis arising, due to the differences in expertise and clinical exposure [8]. The 21st century has witnessed the widespread use of artificial intelligence (AI) that employs computer programs to perform tasks associated with human intelligence, such as learning and problem-solving. The phrase artificial intelligence was first coined by John McCarthy in the mid-1950s, and has since evolved from a set of if-then commands to more complex algorithms that mimic the human brain in some aspects [9,10]. The application of AI tools has resulted in the emergence of a new field of clinical diagnosis, namely, precision oncology that uses a large volume of data from genomics, proteomics, and metabolomics [11]. AI-based cancer diagnosis is mainly driven by machine learning (ML) and deep learning (DL) techniques. Machine learning uses computational methods to analyse large volumes of data and identify patterns for prediction [12]. ML can be supervised where it uses data from previous trials/measurements for the identification of patterns or trends for making predictions. Thus, for a pancreatic cancer diagnosis, CT or PET scans, ultrasonographs, and MRI data can be used to train the system to identify abnormalities that can be classified as pancreatic cancer. The prediction accuracy can be better if large numbers of dataset are used for the training. Different mathematical models and algorithms can be iteratively used during the training period to identify the most efficient model, the accuracy of which can be validated using a testing dataset. The advantage of such supervised ML models is that they can extract meaningful features and identify patterns or subtle changes that could be missed by human personnel, due to oversight or exhaustion. Hence, the prediction accuracy of ML for a cancer diagnosis is higher. ML can also be unsupervised where it can discern patterns and trends from unclassified data. However, the accuracy of the prediction is slightly compromised when compared to the supervised models [13]. The 3D reconstruction of images has also been realized by the ML models for a superior diagnostic accuracy [6].

Another type of ML that is yet to be applied for cancer diagnosis, is reinforcement learning where the algorithm uses the data to understand and respond to the environment predominantly by a trial-and-error process [14]. In other words, reinforcement learning is an advanced concept that could also facilitate decision-making, in addition to prediction [15]. Thus, apart from a diagnosis of pancreatic cancer, reinforcement learning could be used to alert clinicians in remote locations or trigger actuators for releasing a therapeutic agent. These concepts, though attractive, are yet to be realized, but could very well represent the diagnostic technology of the future. Deep learning is another sub-type of AI that uses large data sets and complex algorithms that mimic the human brain to enable prediction, forecasting, and decision-making [16,17]. Most of the DL is supervised and uses data for training for the decision-making process, unlike reinforcement learning that is a dynamic process which relies on a trial-and-error method for the same. Both DL and reinforcement learning are advanced concepts that require a longer duration for training and testing [18]. DL employs convolutional neural networks (CNNs) and artificial neural networks (ANNs) extensively for decision-making [19].

A plethora of supervised and unsupervised ML and DL models continue to be developed and explored for improving the accuracy of a pancreatic cancer diagnosis at the early stage which could be invaluable in enhancing the survival of the affected individual [20]. The complexity of the algorithms will reflect the type of functions they can perform ranging from feature extraction, simple clustering or segregation of data, classification of data, prediction, forecasting, and decision-making [21]. Algorithms such as Naive–Bayes, support vector machine, linear regression analysis, ensemble methods, decision tree, K-mode, hidden Markov model, hierarchical, Gaussian mixture, and neural networks have all been explored with different imaging data sets for distinguishing cancerous tissue from noncancerous tissues [22]. The work flow in the detection of cancer using ML is depicted in Figure 1.

**Figure 1.** Work-flow of the stages during the training of the ML models for the diagnosis of cancer lesions.

The classification of images for diagnosis using various AI models can be broadly divided into one-stage and two-stage methods. The one-stage method segments the medical image into grids and applies the model for classification while the two-stage method demarcates several candidate zones that are used for classification during the training. Though time-consuming, the two-stage object method identifies and screens regions of interest resulting in more accurate predictions. Region-based convolution network (R-CNN), Fast R-CNN, and Faster R-CNN have been employed in the two-stage method as an integrated network for discriminative feature extraction, segmentation, and classification for an improved cancer detection without compromising the spatial structures [6].

#### **3. AI Models for the Diagnosis of Pancreatic Cancer**

Medical imaging has been widely used for locating and diagnosing cancerous tissue in the gastrointestinal tract. Current analysis is largely dependent upon the expertise and experience of the clinician. The quality of the images also influences the diagnosis through conventional methods [23]. The field of digital pathology continues to evolve from the first generation of image processing that involved the use of image processing tools to analyse a single slide, to much more advanced second-generation tools that could scan, analyse, and store records of whole tissue samples. The current paradigm in digital pathology involves the use of AI-based algorithms to analyse images, diagnose the condition with a high accuracy, and even predict the possibility of developing the disease even before the onset of the disease [24]. The development of AI-based tools has enabled the rapid and high precision diagnosis of cancer using different medical images [25]. In the context of pancreatic cancer, AI-based diagnostic tools have been employed for risk prediction, survival prediction, and the distinction of cancer masses from other pancreatic lesions as well as for the evaluation of the response post-therapy.

Machine learning tools, such as the K-nearest neighbour (k-NN), ANN, and SVM, have been extensively investigated for their ability to extract unique signatures from medical images that could be used for the identification of abnormalities [26] in different types of digestive system cancers that also includes pancreatic cancer [27]. The k-NN algorithm, first introduced in 1967 by Cover and Hart, calculates and predicts the distance between the values of the specified features in the sample data and training data. Based on the calculated distance, the sample data is grouped with its nearest neighbour class [28]. The k-NN concept was employed by Kilicet al. [29] to identify colonic polyps using region covariance in CT-colonography images as the distinguishing features. In another report employing k-NN [30], the gray level co-occurrence matrix was employed as the classifying feature in medical images of the brain and pancreatic cancers. However, k-NN is limited by issues pertaining to local structure sensitivity and the possibility of over-fitting, leading to errors.

Artificial Neural Networks (ANNs), the concept of which was first proposed in the early 1940s by McCulloch and Pitts, attempt to mimic the human neuronal network. The input layer receives the input signal that is then passed on to each of the inner hidden layers that understands and transforms them and passes it on to the next layers, until it reaches the final output layer [31], as shown in Figure 2. Unlike k-NN models that can only handle limited data, the ANN model is adaptive and can be trained using large volumes of data to become more robust and accurate. The progress in ANNs has been accelerated, due to advances in big data, affordable graphics processing units (GPUs) and the development of novel algorithms [32]. The ANN method used in diagnosing digestive cancers is the back-propagating (BP) network that was first introduced in 1986 by Rumelhart [33]. This strategy enables the error correction as the output is sent back to the inner layers if found erroneous, to refine the output parameters during the training period. This iterative process ensures the minimization of errors and the improved accuracy. In the context of a pancreatic cancer diagnosis, Săftoiu et al. [34] successfully employed ANNs to differentiate chronic pancreatitis and pancreatic adenocarcinoma, using endoscopic ultrasound images with a sensitivity of 94%. The ANN method has advantages of being able to handle large data sets and predict all types of interactions and inter-relationships between dependent and independent variables [35]. However, ANN algorithms are slow when large numbers of inputs are provided during the training period and require a large computational load, apart from adopting a black-box approach that makes it challenging for achieving accuracy in multi-layer networks [36].

To overcome some of the limitations of ANNs, Vapnik et al. [37] developed a supervised learning algorithm, in 1995, known as the support vector machine (SVM) algorithm, that defines the boundaries known as support vectors to construct a hyperplane, which is used to classify data [38]. The negative and positive boundaries and the maximum margin are defined, based upon the training set of data fed as inputs. The SVM is capable of pattern recognition and regression analysis in addition to the classification of data [39]. Zhang et al. [40] had effectively applied the SVM to identify pancreatic cancers from EUS images, by classifying textural features to achieve a detection accuracy of 99.07%. Though SVM models display a high accuracy and can work with remarkable efficiency when there is a clear demarcation of the data classes, its efficiency reduces when the size of the data set increases or when there is extensive overlap of the data. In addition, despite being memory efficient, SVM algorithms are slow, both during the training, as well as the testing phases.

**Figure 2.** Schematic representation of the process flow in a sample ANN model for the diagnosis of pancreatic cancer.

Deep learning networks exhibit superior diagnostic abilities when compared to ML models as they could extract all features rather than selected ones from the medical images, as in the case of ML. As a result, DL models are preferred for the detection of digestive cancers and image segmentation [41]. Convolutional neural networks (CNNs) are among the most extensively employed supervised DL techniques. These consist of input layers where different clusters of nodes, each for a specific feature, interact with the hidden layers that have the same weightage and bias and perform convolutional operations on these inputs. These are then pooled and transformed to give the final output [42]. A typical CNN network comprises the input, convolutional, activating, pooling, fully connected, and output layers [43]. CNNs are computationally efficient but consume lots of computational power and are slow. CNNs provide a probabilistic depiction of the complete image that can be preferably employed for the image classification, rather than the segmentation [44]. Among the various types of CNNs, U-Net algorithms that use fewer convolutional layers have also been commonly employed for the diagnosis of digestive cancers, including pancreatic cancer, by classifying and segmenting specific features in the medical images [45]. The LeNet, proposed by Lecunet al. [46] in 1989, is considered the basic structure of CNNs. Several other variants, such as AlexNet, VGGNet (visual geometry group), Inception Net, and ResNet, have been introduced, between 2012 and 2015, that vary in the number of convolutional and pooling layers employed [47]. In the context of digestive cancers, Sharma et al. [48] classified and detected necrosis in medical images of gastric carcinoma using the AlexNet architecture with a classification accuracy of 69.9% and a detection accuracy of 81%. Colonic polyps were automatically detected by Shin et al. [49] from colonoscopy images using the Inception-Resnet network. Long et al. [50] proposed a fully convolutional network (FCN) model, in 2015, for the semantic segmentation where each pixel is classified as an image. As the final fully connected layer is substituted by a convolutional layer in the FCN, resulting in the superior segmentation effects, it has been extensively studied for the diagnosis of digestive cancers. Oda et al. [51] employed a three-dimensional FCN model to segment the pancreas automatically using CT images and an average Dice score

of 89.7 ± 3.8, was obtained. The Dice score indicates the precision of the segmentation model employed by eliminating false positives and is computed as follows:

$$Dicescore = 2 \times \frac{area of overlap between two images}{total number of pixels in both images} \tag{1}$$

Generally, a Dice score above 88% is considered highly precise. In another study, Guo et al. [52] employed a Gaussian mixture model and used morphological operations on a three-dimensional U-Net segmentation technique, to achieve an improved segmentation accuracy with a Dice score of 83.2 ± 7.8%. It is also evident from the various reports, that the type of AI tool employed will be different for various imaging techniques. The following sections highlight some recent AI-based strategies for different imaging modalities.

#### **4. Endoscopic Ultrasound (EUS)**

Endoscopic ultrasound (EUS) employs high-frequency ultrasound (US) for the visualization of the size and location of the primary tumor in the pancreas. The ultrasound probe can be maneuvered close to the pancreas for acquiring images of the entire pancreas or the specific locations of suspicious masses or lesions [53]. Advances in the transducer design and the advent of colour Doppler techniques, have contributed to an improved diagnosis and staging of pancreatic cancer. Currently, the sensitivity of EUS, for identifying cancerous lesions in the pancreas, lie in the range 85–99%, that is comparatively superior to CT techniques. Specifically, EUS can detect small lesions in the range of 2–3 mm [54]. For instance, the accuracy of diagnosis for pancreatic tumors with a diameter of 3 cm was reported to be 93% for EUS images, which was significantly superior to CT (53%) and MRI (67%) techniques [55]. Though several literature reports have highlighted the effectiveness of EUS over other medical imaging techniques for the diagnosis of pancreatic cancer and its staging, the resectability has been found to be better predicted only using a combination of CT and EUS images [56,57]. The EUS-driven fine needle aspiration (EUS-FNA) technique has enabled tissue sampling and the evaluation of the primary tumour site, as well as the neighbouring lymph nodes with nearly 100% specificity, that otherwise pose a challenge for detection, using other imaging modalities [58]. The EUS-FNA combination achieved diagnostic accuracies of up to 85%, that are a significant improvement over the 50% accuracy obtained using a CT-assisted diagnosis [59]. However, the EUS-FNA combination is not available in many healthcare institutions. Additionally, the combination requires experienced operators for the precise insertion of the needle that has a major bearing on the diagnostic outcomes [60].

One of the major challenges for clinicians is to distinguish cancerous lesions in the presence of chronic pancreatitis (CP), as the neoplastic features are masked by the inflammation [61]. Norton et al. in 2001 [62], employed neural network models to analyse EUS images for differentiating pancreatic ductal adenocarcinoma (PDAC) and CP, using four different image parameters. Though a high sensitivity was achieved, this strategy resulted in a poor specificity of only 50%. In another attempt, Zhu et al. [63] employed a support vector machine model to extract features from EUS images recorded for 262 individuals affected with pancreatic cancer and 126 individuals with CP. The model extracted 105 distinctive features out of which 16 were selected to differentiate pancreatic cancer and CP with a 94% sensitivity. Similarly, the SVM was used by Zhang et al. [40] to differentiate PDAC and normal tissue using29 features identified in EUS images with a sensitivity of 97.98%. In another attempt, Das et al. [64] employed a combination of image analysis and ANNs to demarcate the cancerous zones in EUS images, acquired from individuals affected with pancreatic cancer with a high accuracy of 93%. In another effort employing multilayer perceptron neural networks (MNNs), a type of ANN, Ozkan et al. [65] categorized EUS images of non-malignant and malignant tissues, based upon various age groups of the patients namely, <40 y, 40–60 y, and >60 y. The MNNs employ a visible layer that receives an input that is passed onto inner units that are denoted as hidden layers, as they do not directly receive the input. The final hidden layer turns out the output. The error is

calculated, based on the deviations from the expected output and these are used to modify the layers to reduce the error during the training period. In another study [66], both one and two hidden layers were employed that exhibited a 97% accuracy with the training data set and a 95% accuracy with the testing data set for discriminating the malignant and non-malignant samples in the different age categories. The high accuracy was achieved for the data sets that were initially segregated into different age groups when compared to their uncategorized counterparts.

Yet another independent study employed MNNs for identifying pancreatic cancer from images of cell clusters, obtained from individuals using fine needle aspiration (FNA). Post-training, the MNN model was found to match the accuracy of an experienced cytopathologist. Additionally, the MNN model was able to predict accurately even inconclusive images, with 80% sensitivity, clearly demonstrating the promise of this tool for the screening of FNA specimens for pancreatic cancer with a conclusive diagnosis, especially those that are deemed inconclusive by cytopathologists. In an interesting study, a computer-assisted diagnosis (CAD) system was developed to analyse EUS images, using deep learning models (EUS-CAD) to identify PDAC, CP, and a normal pancreas (NP). The training set used 920 EUS images and the testing set used 470 EUS images. The detection efficiency was 92% and 94% in the validation and testing phases, respectively. Errors in diagnosis were identified only using the multivariate analysis of non-PDAC cases that was attributed to mass formation resulting in an over diagnosis of tumours [67].

EUS images of intraductal papillary mucinous neoplasms (IPMNs), that are precursors of PDAC, were analysed using deep learning algorithms to predict malignancy, using EUS images of patients acquired before a pancreatectomy. A total of 3970 images were used for the study and the malignant probability was calculated. The probability of the deep learning algorithm to diagnose malignant IPMN was 0.98 (*p* < 0.001) with a sensitivity, specificity, and accuracy of calculated to be 95.7%, 92.6%, and 94.0%, respectively. The accuracy was significantly superior to the corresponding human diagnosis (56.0%) [68]. A comparison of the literature on pancreatic cancer discrimination from EUS images using AI tools revealed that deep learning and ANN techniques exhibited the greatest accuracy, followed by CNNs and the SVM. However, the literature reports chosen for the study had used images that compared normal and pancreatic cancer while some had tried to differentiate pancreatic cancer with CP. Similarly, the size of the cancerous tissues varied between the studies [69]. Therefore, additional studies are required to address if these differences could reflect in the prediction accuracy of the AI tool employed.

#### **5. MRI**

MRI is used to visualise the thinned slices of two-dimensional or three-dimensional soft tissues, due to the presence of water molecules in our body. The shift in the precessional frequency and alignment of the nuclei of the protons in the water molecule, in the presence of an external applied magnetic field and radiofrequency, is used for acquiring the image. The technique measures the relaxation times, T1 and T2 that denote the spin-lattice and spin-spin relaxation, respectively, to reach the original equilibrium position [70]. Relaxitivities (r1 and r2), which are the inverse of the respective relaxation times are also measured. Most of the cases employ positive or negative contrast agents, such as gadolinium-based chelates or iron oxide, respectively, to significantly enhance the ratio of the relaxivities for an improved resolution and sensitivity [71].

Early detection of pancreatic cancer is essential to provide the affected individual with a fair chance of survival beyond five years. However, most imaging techniques, including MRI, fail to identify conclusively subtle changes observed in the pre-malignant stages, such as the pancreatic intraepithelial neoplasia, which is commonly associated with the tumorigenesis of PDAC [72]. Even an individual with stage I (localized) pancreatic cancer has only a 39% survival rate over a five-year period [73]. In a typical example of the use of AI for diagnosis using MRI images, a supervised machine learning (ML) algorithm was developed to predict the overall survival rates in PDAC affected patients, using

a cohort of 102 MRI images during training and a further 30 images during the testing period [74]. The algorithm was used to segment the images extract features. The sensitivity of the ML algorithm was 87%, while the specificity was determined to be 80%. The considerable overlap between the clinical histopathological conclusions and the ML-driven predictions indicates the promise of this strategy for classifying pancreatic cancer sub-types and diagnosis.

Another study [75] had investigated the ability of deep learning to distinguish between different pancreatic diseases from magnetic resonance (MR) images that were contrast-enhanced, using the T1 contrast agent gadopentetate dimeglumine. The generative adversarial network (GAN) form of machine learning can generate new sets of data which resemble/mimic the data used for training. GAN was employed to generate synthetic images that augmented the T1 contrast enhanced MRI data of 398 subjects within the age range of 16 and 85 years, acquired before the commencement of any treatment from a single hospital centre. The Inception-V4 network, a type of CNN with multiple hidden layers, was trained on the GAN augmented data set. Following the training, the MRI images acquired from two different hospital centres, comprising 50 images from subjects in the age group 24–85 years, and 56 images from patients aged between 26–80 years, were used for validating the performance of the Inception-V4 network towards the disease classification. The results were compared with the predictions made by the radiologist.

To augment the diagnostic accuracy of MRI on paediatric pancreatic cancer, Zhang et al. [76] used a quantum genetic algorithm to optimize the parameters of a traditional SVM classification model, for the improved prediction accuracy. In addition, this study acquired test samples from real life cases, and assessed the image processing performance of the algorithm for an efficient detection. The results revealed that the model distinguished clearly the cancer features with a high accuracy when compared with the conventional detection algorithm. Another study had employed a robust and intelligent method of ANNs combined with the SVM for the classification of pancreatic cancer to improve the diagnostic process, in terms of both accuracy and time [77]. Here, features of the MR images of the pancreas were extracted using the GLCM (gray-level co-occurrence matrix) method, a second order image texture analysis technique, that defines the spatial relationships among pixels in the region of interest. The best features extracted, using the JAFER algorithm, were analysed using five classification techniques: ANN BP (back propagation ANN), ANN RBF (radial basis function ANN), SVM Linear, SVM Poly (polynomial kernel), and SVM RBF (radial basis function SVM). The two best features selected, using the ANN BP techniques were used for the classification of pancreatic cancer with a 98% accuracy.

Corral et al. [78] employed a deep learning tool to identify neoplasia in intraductal papillary mucinous neoplasia (IPMN), using CNNs for the classification of the MRI scans of the pancreas. The classification was based on the guidelines issued by the American Gastroenterology Association, as well as the Fukuoka guidelines. When tested in 139 MRI scans of individuals, among which 22% were of a normal pancreas, 34% had a low-grade dysplasia while 14% were diagnosed with a high-grade dysplasia and the remaining 29% had adenocarcinoma, the model exhibited a detection sensitivity of 92% and a specificity of 52% for the detection of dysplasia. The deep learning technique exhibited an accuracy of 78%, in comparison to the 76% obtained by the classification using the American Gastroenterology Association guidelines.

For improving the accuracy, reliability, and efficiency of diagnosis, Chen et al. [79] developed an automated deep learning model (ALAMO) for the segmentation of multiple organs-at-risk (OARs) from the clinical MR images of the abdomen. The model had included training procedures, such as Multiview, deep connection, and auxiliary supervision. The model used multislice MR images as the input and generated segmented images as the output. The model was investigated using ten different OARs, such as the pancreas, liver, spleen, stomach, duodenum, small intestine, kidneys, spinal cord, and vertebral bodies. The results from the model correlated well with those obtained using the manual techniques. However, further studies integrating AI-based algorithms with these ALAMO

generated segmented MR images of the pancreas are required for the extraction of features to confirm the onset or progression of PC.

#### **6. Computed Tomography**

A computed tomography (CT) scan is a non-invasive clinical imaging technique that employs X-rays to obtain images at different angles. The resultant images are processed using customized software to obtain a reconstructed 3D image, which provides valuable anatomical information [80]. This technique is widely employed in healthcare centres for the diagnosis of tumours or internal injuries [81,82]. Despite its merits, CT scan images pose a challenge to clinicians for the accurate diagnosis of cancers, owing to irregular contours presented by regions with lesions, vasculature, bony structures, and soft tissues that display a mosaic of densities and intensities [83]. Additional challenges involved in the precise prediction of the disease from the CT scans are associated with fuzzy and noisy images that lack adequate contrast [84]. AI-driven methods that enable image segmentation, contour identification, and disease classification, therefore will be invaluable in improving the prediction efficiency for pancreatic diseases from CT images [85]. The currently employed conventional image segmentation models consume considerable computational time and power, as they perform every operation for each pixel in the image [86]. Further, the resultant processed image quality also lacks quality, thereby necessitating the development of more robust tools for AI-driven tools for image segmentation and processing that may provide a better diagnostic accuracy [87]. In an interesting study [88], about 19,500 non-contrast CT scan images, acquired from 469 scans, were segmented using CNNs and the mean pancreatic tissue density, in terms of the Hounsfield unit (HU), as well as the pancreatic volume, were computed using the CNN algorithm. The comparison of the results of the pre-diagnostic scans from individuals who later developed PDAC and those that remained cancer-free, revealed that there was a significant reduction in the mean whole gland pancreatic HU of 0.2 vs. 7.8 in individuals who developed PDAC. This suggests that the attenuation of the HU intensity in the CT images of the pancreas could imply a risk of PDAC. This study has opened new avenues for employing CNNs as a tool for the pre-diagnosis/very early diagnosis of PDAC from CT scan images.

In another attempt to classify PDAC, a regular CNN algorithm with four hidden layers was trained using CT images obtained from 222 affected individuals and 190 noncancerous individuals. Though a diagnostic accuracy of 95% was achieved using CNNs, it was not superior to the predictions made by human experts indicating the need for an appropriate AI architecture for the classification of pancreatic cancer [89]. Zhang et al. [90] employed feature pyramid networks with a recurrent CNN (R-CNN) that could identify the sequential patterns and predict the subsequent patterns of a given data set for classifying PDAC from CT scan images. A dataset of 2890 CT images was employed for training the network to achieve a classification accuracy of about 94.5%. Though this method proved to be superior to the existing methods, it was limited by the input uncertainty that is generally associated with closed-source data. This drawback could be eliminated by using a public data set for training. In a more advanced variant, a 16-layer VGG16 CNN model was employed along with R-CNN to diagnose PDAC from 6084 enhanced CT scans obtained from 338 PDAC-affected individuals. The combination of VGG16 and R-CNN exhibited a high prediction accuracy of about 96%. Each CT image was processed by the R-CNN within 0.2 s that was considerably faster than a clinical imaging expert [6]. Additionally, a deep learning algorithm has been developed by Chen et al. [91] for detecting pancreatic cancer that is smaller than 2 cm on CT scans. The study result showed that the CNN was effective in distinguishing patients with pancreatic cancer from normal pancreatic individuals, achieving an 89.7% sensitivity and a 92.8% specificity. It also showed a higher sensitivity of 74.7% for the identification of pancreatic cancer malignancies, smaller than 2 cm.

An attempt to employ CNN models to distinguish different kinds of pancreatic cysts was made using CT images from 206 patients. Among these individuals, 64 suffered from intraductal papillary mucinous neoplasms (IPMNs), 66 had been diagnosed with serous cystic neoplasms (SCN), 35 had mucinous cystic neoplasms (MCNs) while 41 individuals suffered from solid pseudopapillary epithelial neoplasms (SPENs). The feature extraction from the CT images and classification of the type of pancreatic cyst, was accomplished using densely connected convolutional network (Dense-Net) architecture that uses dense layers which receives inputs from all neurons/nodes and dense blocks connecting all layers by the shortest route. The Dense-Net algorithm performed better than the conventional CNN model in discriminating between the different types of cysts with the highest accuracy of 81.3% observed for IPMNs followed by 75.8% for the SCNs and 61% for the SPENs [92]. Though the Dense-Net model outperformed the CNNs in all categories, the study lacked information on the tumour size and failed to provide reasons for the positive and negative errors encountered in the identification of the type of pancreatic cysts. The model needs to be tested rigorously with a wider range of cysts to understand its capability for discriminating between different types of pancreatic cysts if it is to be adopted in the clinics.

#### **7. Positron Emission Tomography (PET)**

Positron emission tomography (PET) employs short-lived radioisotope tracers that emit positrons. These positrons destructively interact with an electron to generate photons, which are recorded for generating the PET image. The tracer can be differentially localized in various tissues by conjugating with a biomolecule for a better target specificity [93,94]. The PET scans provide additional information about the functioning of an organ. Commonly employed tracers include 18F, 15O, 13N, and 11C, that have half-lives of 109.74 min, 122.24 s, 9.97 min, and 20.38 min, respectively [95]. PET imaging has been also used to diagnose the recurrence of pancreatic cancer as well as to understand the response of the cancer tissue to different therapeutic interventions. Despite several studies that have shown the diagnostic efficiency of PET scans towards pancreatic cancers with a sensitivity in the range of 85% and above [96], several factors, such as the dysregulated glucose metabolism and inflammation interfere with the sensitivity of the diagnosis from PET images, resulting in false positives [97]. PET scans are also ineffective in diagnosing pancreatic cancers when the tumour mass has a diameter below 2 cm [98]. This necessitates the use of advanced AI-aided algorithms for the discrimination and classification of cancerous masses from the PET scan images.

For imaging cancers, 18F substituted glucose or fluorodeoxyglucose (FDG) has been frequently used, due to the high consumption of glucose by cancer cells to meet its metabolic requirements [97]. PET scans have been employed frequently in conjunction with MRI or non-contrast CT, owing to their poor spatial resolution for the diagnosis of cancers, including their staging [99]. To overcome challenges in discriminating cancerous lesions from non-contrast CT images, 18F- FDG PET/CT imaging of pancreatic cancers was used by Li et al. [100], in conjunction with a SVM algorithm. The region of interest (ROI) identified in the CT image of the pancreas was initially segmented, using a simple linear iterative clustering (SLIC) followed by the feature extraction using the dual threshold principal component analysis (DT-PCA). Finally, a hybrid feedback-SVM-random forest algorithm (HFP-SVM-RF) was used to classify the pancreatic cancerous lesions. The random forest model is a type of supervised machine learning model that is widely used for classification and decision making. The hybrid model exhibited an accuracy of 96.5% when tested using the PET/CT images of 40 patients with pancreatic cancer and 40 non-cancer individuals. The hybrid algorithm when tested using 82 public PET/CT scans exhibited a similarity score of 78.9% and 65.4%, when compared with the ground-truth contours using the Dice coefficient and Jaccard index, respectively, suggesting there is scope for further improvement in the diagnostic performance.

Radiomics is a feature extraction method that has been widely used in image processing tools. A combination of radiomics with machine learning was employed for the prognostic prediction of the survival rate from 18F-FDG-PET scans of 138 patients with

pancreatic cancer. A random forest model was used for the classification of 42 features extracted from the PET images. The model revealed that the gray-level zone length matrix (GLZLM), the gray-level non-uniformity (GLNU) in the images as the top factor that influenced the one year survival, while the total lesion glycolysis ranked second. This information was used to stratify individuals into poor prognosis groups with a high risk of mortality [101].

It is thus evident that every imaging technique will require customized robust algorithms to extract the subtle but distinctive features of pancreatic cancer for the accurate identification and stratification. The evolution of new ML algorithms continues to improve the sensitivity and selectivity of the diagnosis of pancreatic cancer at an early stage, thereby improving the survival chances of the affected individual. Table 2 lists some of the major studies, using various AI driven models for the diagnosis of pancreatic cancer.

**Table 2.** Summary of the AI driven models for the pancreatic cancer diagnosis.



**Table 2.** *Cont.*

#### **8. Pancreatic Cancer Risk Prediction Using AI**

Since pancreatic cancer is a highly aggressive form of cancer that is largely asymptomatic in the early stages and has a tendency to spread rapidly, leading to poor survival duration post-diagnosis, the AI-based prediction of the risk of developing pancreatic cancer could be an immensely useful strategy for improving the prognosis for an individual. Muhammad et al. [112] had successfully employed ANNs from personal health data to predict and stratify the pancreatic cancer risk as a low, medium, or high risk ,with a sensitivity and specificity of 80.7%.This study highlights the ability of the AI-based predictive tools for the effective management of the pancreatic cancer risk even before the manifestation of symptoms. Similarly, Corral et al. [78] had employed an AI algorithm to identify pancreatic cysts that pose a high risk of transforming into cancerous lesions. Such a prediagnosis could help clinicians in designing adequate preventive interventions to save

lives. The detection of subtle textural and morphological changes in CT and MRI scans of the pancreas could also be facilitated through customized AI algorithms [117]. Several attempts have also been reported to employ AI tools to predict the risk of developing pancreatic cancer from biomarker measurements, as well as abdominal scans to discern pre-cancerous abnormalities [117].

#### **9. AI-Driven Diagnosis Based on Cancer Biomarkers**

The serological detection of PC is based on the quantification of a biomarker whose levels are altered in cancerous conditions. However, a single marker could not accurately diagnose a specific type of cancer as there are several other conditions that could modulate the levels of said biomarker. Hence, multiple biomarkers need to be analysed, to accurately diagnose PC. In an earlier work, protein markers from the serum of 27 normal and 27 individuals diagnosed with pancreatic cancer, were profiled using surface-enhanced laser desorption ionization (SELDI), and were classified using a decision tree algorithm, based on which six serum proteins were identified as pancreatic cancer biomarkers [118]. Carbohydrate antigen 19-9(CA19-9) is the most extensively explored protein biomarker of pancreatic cancer. However, several studies have indicated that CA19-9, by itself, could not be an effective predictor of pancreatic cancer and hence the search for additional diagnostic protein markers in serum are underway [119]. Analysis of datasets from microarray and the next generation sequencing of samples for the gene expression or serum protein expressions using deep learning and machine learning algorithms, could aid in identifying the most promising protein biomarkers that aid in the early detection of pancreatic cancer. For instance, the SVM based algorithm, in combination with the recursive feature elimination (RFE), was employed to screen the gene expression datasets of 78 samples, for additional pancreatic cancer biomarkers. Seven gene targets were short-listed among the genes encoding for the proteins FOS that encodes for the leucine zipper protein, MMP7 (matrix metalloproteinase-7), and A2M (alpha-2-macroglobulin), were predicted to be more accurate diagnostic markers for pancreatic cancer, not only in serum, but also in urine samples [120]. Similarly, ANN-based methods have been employed to analyse the levels of key serum biomarkers implicated in PC, such as CA19-9, CA125, and carcinoembryonic antigen (CEA), from 913 samples obtained from individuals with a normal and a cancerous pancreas. The results showed an improved detection accuracy when compared with a single marker-based prediction, clearly highlighting the benefits of an AI-integrated multi-analyte diagnosis [121]. Exosomes, which are vesicular structures containing miRNA, specific to the source cells, are gaining importance for the disease diagnosis. Several exosome entrapped miRNA have been identified in PC, such as miR-16, miR-20a, miR-21, miR-21-5p, miR-24, miR25, miR99a, miR-133a, miR185, miR191, miR-196a, miR-223, miR-642b-3p, miR-663a, miR-1290, miR-1246, miR-5100, and miR-8073 [122]. In a seminal work, the exosomes obtained from a panel of mouse and human origin PC cell lines, were captured using antibodies against the surface expressed EpCAM (epithelial cell adhesion molecule). The RNA cargo was isolated from the exosomes and the miRNA was identified using qPCR. The cancer miRNA signatures were identified using a customdeveloped machine learning algorithm. The system was validated using samples isolated from individuals with a normal pancreas and those with pancreatic cancer, with a good prediction accuracy [123]. In another study, a neural network algorithm was employed to screen 140 datasets of individuals diagnosed with pancreatic cancer, for gene biomarkers in urine samples, namely REG1A/1B, LYVE1, TFF1, and CA19-9. Following the training, the neural network algorithm predicted REG1A/1B as the most important biomarker in the urine samples with an importance ratio exceeding 80% [124]. With the discovery of new circulating markers, such as glycoproteins and genetic markers, such a machine learning-based diagnosis could herald in the rapid and accurate detection of PC.

The histological analysis or tissue biopsies have been conventionally employed for the identification and stratification of cancers. However, this is a time-consuming process. Further, there is a constant increase in the number of samples that are sent for analysis to

the anatomical pathological laboratory and this, coupled with insufficient skilled pathologists, leads to long turn-around-times [125]. Additionally, cytopathology requires the accurate slide preparation and optimal staining of the tissue slices. However, the staining intensity of biopsy slides exhibit analyst-based, sample thickness-based and laboratory protocol-based variations in the intensity [125]. In this context, deep learning algorithms, such as VGG, DenseNet, ResNet etc., and machine learning algorithms, based on SVM and the random forest, can be employed to extract specific tumour features from the tissue slices to improve the speed of detection and reduce the burden on the clinical pathologists. Similarly, the use of algorithms, such as SA-GAN (stain acclimatization generative adversarial network) that employs a generator that imports the input source image and generates a target image that incorporates the features of the input sample and the colour intensity of a training sample. Two discriminators are also incorporated into this deep learning model, which ensure that the colour intensity of the desired training image and textural features of the source image are maintained in the generated image, thus ensuring the stain colour normalization across the different images [126]. Such approaches have been attempted, to identify various types of gastrointestinal and breast cancer, using mammograms and tissue biopsies [127]. Using a similar concept, a deep learning-based spiral algorithm was employed to transform 3D MRI images of the pancreatic tissue into 2D images without compromising then original image texture and edge parameters. The CNN-based models were employed for the feature extraction and the bilinear pooling module was used to improve the prediction accuracy. Parameters, such as size, shape, volume, texture, and intensity, were employed to classify the image as pancreatic cancer with TP53 gene mutation or otherwise. The prediction results agreed well with the actual mutation status. This approach overcomes the drawback of the need for painful biopsies for classifying a tumour as TP53 positive. In addition, this novel method offers a non-invasive approach for predicting gene mutations, using AI-driven cytopathology that may also be extended for other forms of cancer or gene mutations [128]. Similarly, ResNet and DenseNet models have been employed to identify *Helicobacter pylori*, a key causative pathogen in different gastric cancers from stained tissue biopsy specimens [129]. The advantage of using machine learning models in this case over conventional cytopathology, is the ability of the model to identify even small numbers of the bacteria, which is very tedious and time-consuming in the conventional mode. Abnormal goblet cells have been identified with an 86% accuracy in tissue samples of individuals with Barretts esophagus, using VGG algorithms [130]. AI-driven algorithms can be useful in detecting microsatellite instabilities in the biopsy samples, that area hallmark of many forms of cancers [125]. These studies clearly demonstrate that the integration of machine learning in cytopathology can be useful for the faster, efficient, and early diagnosis of pancreatic cancer. This field is slowly gaining prominence and may soon lead to the establishment of a digital cytopathology as a mainstay in the detection and stratification of cancers.

#### **10. Ethics of Using AI for Diagnosis**

Though AI offers a plethora of benefits in improving the detection and stratification of PC, there are several ethical concerns that have emerged among a section of the society, on the extensive use of AI-based diagnostics. Since AI tools require large datasets for training and validation, concerns on data privacy and confidentiality have been raised. Additionally, data security and safety issues have also been associated with use of an AIbased diagnosis [131]. There exists a regulatory vacuum in the realm of AI-based tool development and no structured white document is available on the data collection, storage, processing and sharing. Furthermore, frequent comparisons between expert predictions by clinicians and the AI algorithm, have given rise to the theory of inadequate training or de-skilling of clinicians, in future, owing to the over-dependence of AI-based detections. A lack of patient-doctor connect, or dissolution of the trust factor are additional issues that have been associated with the deployment of AI-driven technologies in healthcare [132]. Accountability and professional responsibility issues, in the case of a wrong diagnosis by

AI-based tools, that may result in disastrous consequences, is another facet that is being debated as a negative aspect of all AI-driven cancer diagnoses.

#### **11. Concluding Remarks**

The use of AI for cancer detection and biomarker discovery, is expected to be the target of several research studies involving AI over the next decade. Several studies in this direction have clearly demonstrated the benefits of the AI-driven detection of pancreatic cancer, especially those employing imaging tools. However, the widespread clinical deployment of this technology is yet to be realized, owing to lack of large datasets to convincingly train and validate the developed algorithms. Most AI-based models have been developed in a black box mode and as a result, the clinicians are unable to understand or explain the basis of identification or stratification, thereby leading to a reticence in employing this technology. Additional ethical issues concerning data privacy and security further have slowed down the translation of an AI-based diagnosis in clinics. However, the exponential growth, witnessed in computing resources, including open-source tools, has triggered an avalanche of studies focused on developing more robust algorithms for the accurate, rapid, and early diagnosis of PC. As this field continues to grow, new regulatory policies concerning its use and deployment will emerge so that the benefits of this technology can be harnessed to save lives.

**Author Contributions:** Conceptualization, U.M.K.; resources, B.S.H. and U.M.K.; data curation, B.S.H.; writing—original draft preparation, B.S.H. and U.M.K.; writing—review and editing, B.S.H. and U.M.K.; visualization, U.M.K.; supervision, U.M.K.; project administration, U.M.K.; funding acquisition, U.M.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Nano Mission, Department of Science & Technology grant number (SR/NM/NS-1095/2012) and FIST, Department of Science & Technology (SR/FST/LSI-622/2014).

**Acknowledgments:** The authors gratefully acknowledge the financial support from Nano Mission, Department of Science & Technology and FIST, Department of Science & Technology. The authors also thank SASTRA Deemed University for financial and infrastructural support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Deep Learning for Automated Elective Lymph Node Level Segmentation for Head and Neck Cancer Radiotherapy**

**Victor I. J. Strijbis 1,2,\*, Max Dahele 1,2, Oliver J. Gurney-Champion 3,4, Gerrit J. Blom 1, Marije R. Vergeer 1, Berend J. Slotman 1,2 and Wilko F. A. R. Verbakel 1,2**


**Simple Summary:** When treating patients with head-and-neck cancer (HNC), in addition to the primary tumour, commonly involved lymph node (LN) levels are often electively irradiated. This requires the definition of the elective LN target volume. Because the LN levels that will be included in the target depend on the clinical situation, and because manual contouring is a laborious task that can also introduce inter- and intra-observer variation, being able to automate the segmentation of individual LN levels would reduce the clinical burden and would allow use of contours regardless of the primary tumor location. We trained and evaluated three patch- and/or voxel-based deep learning frameworks to segment elective LN levels. Our results suggest that accurate segmentations can be obtained using an ensemble of patch-based UNets and that this result can be further refined by sequentially applying a 2.5D, multi-view voxel classification network.

**Abstract:** Depending on the clinical situation, different combinations of lymph node (LN) levels define the elective LN target volume in head-and-neck cancer (HNC) radiotherapy. The accurate auto-contouring of individual LN levels could reduce the burden and variability of manual segmentation and be used regardless of the primary tumor location. We evaluated three deep learning approaches for the segmenting individual LN levels I–V, which were manually contoured on CT scans from 70 HNC patients. The networks were trained and evaluated using five-fold cross-validation and ensemble learning for 60 patients with (1) 3D patch-based UNets, (2) multi-view (MV) voxel classification networks and (3) sequential UNet+MV. The performances were evaluated using Dice similarity coefficients (DSC) for automated and manual segmentations for individual levels, and the planning target volumes were extrapolated from the combined levels I–V and II–IV, both for the cross-validation and for an independent test set of 10 patients. The median DSC were 0.80, 0.66 and 0.82 for UNet, MV and UNet+MV, respectively. Overall, UNet+MV significantly (*p* < 0.0001) outperformed other arrangements and yielded DSC = 0.87, 0.85, 0.86, 0.82, 0.77, 0.77 for the combined and individual level I–V structures, respectively. Both PTVs were also significantly (*p* < 0.0001) more accurate with UNet+MV, with DSC = 0.91 and 0.90, respectively. The accurate segmentation of individual LN levels I–V can be achieved using an ensemble of UNets. UNet+MV can further refine this result.

**Keywords:** computed tomography; deep learning; head-and-neck cancer; lymph nodes; radiation oncology; auto-contouring

#### **1. Introduction**

Head-and-neck cancer (HNC) radiotherapy (RT) planning frequently includes the contouring of neck lymph nodes (LN) as a part of the elective RT target volume. However,

**Citation:** Strijbis, V.I.J.; Dahele, M.; Gurney-Champion, O.J.; Blom, G.J.; Vergeer, M.R.; Slotman, B.J.; Verbakel, W.F.A.R. Deep Learning for Automated Elective Lymph Node Level Segmentation for Head and Neck Cancer Radiotherapy. *Cancers* **2022**, *14*, 5501. https://doi.org/ 10.3390/cancers14225501

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 30 September 2022 Accepted: 7 November 2022 Published: 9 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the manual delineation of the elective target volume is a labour-intensive task that is prone to inter-observer variation [1], despite the availability of delineation guidelines [2], making automated methods attractive, as an alternative to manual segmentation.

Over the last few years, developments in deep learning approaches have shown impressive results for automated segmentation of organs at risk (OAR) by using convolutional neural networks (CNN) [3–6] and for pathology detection [7], including the deep learningbased delineation of elective targets such as the combinations of neck LN levels, which has only more recently been investigated [8]. Most studies that demonstrated automated LN segmentation with deep learning, incorporated all LN levels or all of those levels relevant to the primary HNC location in one structure, rather than focusing on individual LN levels [7–11]. The methods that segment multiple lymph levels in one structure, however, are not generalizable to all primary HNC locations and tumour stages and require separate networks for contouring different combinations of lymph node levels. Therefore, it would be desirable to have a more general and flexible approach that concurrently and accurately contours individual LN levels and hence can be used for all HNC patients regardless of the subtype and the specific lymph levels required for RT treatment planning.

The automated segmentation of the LN levels is a challenging task because of anatomical limitations in the manual reference. The guidelines prescribe delineation based on anatomical markers in axial slices and assume that no voxels of levels II, III and IV can exist in the same axial plane, irrespective of the curvature and pitch of the neck. In addition, the LN target volumes do not encompass anatomical structures, but rather the expansions of groups of LNs.

In this work, three combinations of deep learning networks were investigated to segment individual LN levels I–V as separate structures. To do this, we evaluated the performance of two CNNs, alone and in combination. First, since UNet is a widely established CNN that is used for a variety of imaging-related problems [12] and since it was used in two other studies for combined lymph structure segmentation [9,13], we included a patch-based UNet variant as a baseline model configuration. Other works have suggested the use of voxel-classification methods for individual LN level segmentation using a 3D multi-scale network [14], as well as 2.5D (multi-view; MV) networks for several segmentation challenges (multiple sclerosis [15], ocular structures [16], abdominal lymph structures [17], head-and-neck tumors [18]). Because 2.5D networks may more effectively learn features in the presence of little data [19] and because voxel classification may better resolve local ambiguities near level transitions, a multi-view convolutional neural network (MV-CNN) was included as our second configuration. This method, however, appears limited by a systematic over-estimation of foreground classes [18]. Therefore, as our third configuration, UNet was used for foreground segmentation, and subsequently MV was used for classifying the foreground voxels into individual LN levels. This way, the over-estimation of foreground classes seen in MV models was effectively eliminated.

This work expands the existing literature by demonstrating the feasibility of deep learning for auto-segmentation for the target definition of individual LN levels I–V towards a flexible RT planning for locally advanced HNC. Based on earlier work, we estimate that accurate performance levels are attained for the segmentation of individual LN levels I–V with Dice similarity coefficient (DSC) of at least 0.8 [9,13,14] and we hypothesize that the contours can be obtained with such accuracy levels for the majority of patients, using one or more of the proposed deep learning configurations.

#### **2. Materials and Methods**

#### *2.1. Data Acquisition*

This retrospective study was exempted from requiring participants' informed consent by the medical ethics committee and was performed using the three-dimensional (3D) planning computed tomography (CT) scans (GE Discovery 590RT, helically scanned) of 70 patients treated between 2019 and 2022 with (chemo-)radiotherapy for locally advanced HNC, of which 60 were used for training and testing, and 10 were retained for an independent test set. We used isotropic, in-plane acquisition resolutions of [0.92–1.56] mm and a 2.5 mm slice thickness, except for two cases in which the slice thickness was 1.25 mm. The CT acquisition dimensions were 512 × 512 × (147 − 361) voxels. Patient-specific radiotherapy head-and-neck moulds and immobilisation masks were used to position the patients in a neutral position. Ground truth contours were created for the specific purpose of this study by manual contouring of individual LN levels I–V according to contouring guidelines [1], by two experienced radiation oncologists (GJB, MRV). During contouring, levels IV and V are regarded as the combinations of IVa, IVb and Va, Vb, respectively. No HNC disease stages or patients with positive LNs were excluded, provided that they had elective LN levels contoured for at least one side. In patients with only one side contoured, the LN level contours of the side that contained no diagnosed disease were added, such that all patients had all individual levels at both sides contoured.

#### *2.2. Pre-Processing*

For all patients, planning CTs and structure sets were initially interpolated to the same isotropic 1.25 mm3 voxel spacing by 3rd-order and nearest-neighbour interpolation, respectively. This spacing was chosen to minimize image interpolations, whilst making sure the network's filters were of equal size in each orthogonal plane for all patients.

#### *2.3. Experimental Outline*

We investigated the performance of three model configurations, i.e., UNet (Figure 1), MV (Figure 1) and UNet+MV (Figure 1). As a baseline reference, we investigated a multiclass, patch-based UNet, which concurrently classifies all lymph levels as separate classes in a single step. This was compared to a per-voxel classification approach that uses an MV-CNN, which is a 2.5D network that uses multiple resolutions of orthogonal views to classify the voxel where the planes cross. In the interest of time, this model used a preconstructed mask to provide the network with the information on which voxels it should consider for segmentation (cyan in Figure 1). Lastly, we investigated a two-step approach, which is essentially a combination of UNet and MV: we used a single-class UNet for segmenting the combined structure of LN levels I–V in an initial step, after which MV was applied only to the detected foreground voxels and classified each voxel in the combined structure into individual lymph levels UNet+MV. Schematic representations of the used UNet and MV networks are displayed in the blue and red boxes in Figures 1 and 2, respectively.

#### *2.4. Model Training*

Model training, validation and evaluation were performed on four NVIDIA-GeForce GTX 2080 TI graphics processor units (GPUs), a 64 GB RAM system with an Intel® Core™ i9-9900KF CPU @3.6 GHz processor, using the GPU version of TensorFlow (Version 2.2.0) with Cuda 10.1 and Python (Version 3.8.10). The TensorBoard (Version 2.2.2) callback was used for tracking the training and validation scores, whilst only the best model in terms of DSC was saved. The models were trained using the Adam optimizer [20]. All models were trained using standard values in Keras, with an initial learning rate of 0.001, β<sup>1</sup> = 0.9, <sup>β</sup><sup>2</sup> = 0.999 and <sup>ε</sup> = 1 × <sup>10</sup>−<sup>7</sup> To reduce the divergence of the model weights at later stages of training, an exponential learning rate decay scheduler was used to decrease the learning rate by 5% with every epoch, up to a minimum of 0.0001. Dropout was switched off at test time. All models were trained using 5-fold cross-validation, with a train\test split of 48\12 cases every fold. To minimize the training variation, we used ensemble learning [9,21–23], where the highest cumulated in-class segmentation probability of 5 sequentially trained networks decided the final segmentation map. The training and evaluation times were saved.

**Figure 1.** Schematic overview of the experimental outline. UNet (blue boxes) and MV (red boxes) were used to make three model configurations. In the first configuration, a patch-based UNet segments the background and LN levels I–V directly from the planning CT. In the second configuration, MV classifies the background and LN levels I–V voxels from within a preconstructed mask (cyan). In UNet+MV, a patch-based UNet first segments the combined structure of LN levels I–V. This is subsequently used as a mask (cyan) for MV to subsequently classify positive voxels into individual levels I–V. The details of both models are given in Figure 2. Abbreviations: MV: multi-view; CT: computed tomography.(Also shows in Figure S2).

#### 2.4.1. UNet

The network that was used is an adaptation of a vanilla UNet [12], where residual blocks were added to reduce the effect from vanishing gradients in deeper layers of the model [24,25], similar to those used by Millerari et al. [26] Batch normalization was performed after every (3 × 3 × 3) 3D convolution, before the non-linear activation function. We used patch-based training of the 3D UNet to ensure the network fitted on our video card [27,28]. During training, patches of 64 × 64 × 64 voxels were sampled randomly from two pre-defined, unilateral regions of interest (ROI) of 280 × <sup>200</sup> × 280 mm<sup>3</sup> in volume that were known to contain the combined structure of LN levels I–V for every patient on each side. Binary and multi-class dice loss functions were used for optimization. The multi-class DSC loss was defined as the sum of individual foreground class losses (Equation (1)):

$$DSC\_{loss} = \sum\_{m=1}^{M} W\_{\text{m}} \cdot DL\_{\text{m}} \tag{1}$$

Here, *Wm* are the class weights that are calculated using the Python's scikit-learn module [29], *m* ranges from 1 to *M* and denotes class indices, where *M* is the number of classes. *DL* is the *DSC* loss, defined as 1 minus the *DSC* score (Equation (2)):

$$DL\_{\rm ll} = 1 - \frac{2 \cdot |A\_{\rm m} \cap B\_{\rm m}|}{|A\_{\rm m}| + |B\_{\rm m}|} \tag{2}$$

**Figure 2.** Schematic overview of the UNet (blue box) and MV (red box) networks. UNet consists of an encoder (**left**) and a decoder (**right**) pathway that generates binary segmentation maps from 64 cubed voxel patches sampled from planning CTs. MV uses three multi-view branches that build up to each anatomical plane within a scale block, the output of which is concatenated and used as the input for the multi-scale branched architecture. The thickness of the convolutional blocks corresponded with the number of filters used. The number of output classes (M) was six for UNet in the UNet-only configuration and two for UNet in the UNet+MV configuration. M was six for MV in the MV-only configuration and five in the UNet+MV configuration. Abbreviations: MV: multi-view; ch: number of channels; BN: batch normalization; ReLu: rectified linear unit; f: number of output filters; M: number of output classes; K: convolution kernel size; S: convolution stride; BN: batch normalization; p: dropout fraction: CT: computed tomography.

Here, *Am* and *Bm* denote the predicted and manual reference binary sets of class *m*, respectively. In the case of binary segmentation, *DSCloss* is reduced to the latter loss function. For patches that contain a limited amount of foreground voxels, *DSCloss* becomes ill-defined (the denominator in *DLm* is not constrained to values larger than 0). To ameliorate this, we used a Gaussian sampling method, where the mean and standard deviation of the x, y and z coordinates are calculated from the centre of mass of the combined, binary structure of LN levels I–V of all patients. Subsequently, we used a truncated normal distribution to sample

patches, such that they were constrained to be entirely within the region of interest. The weights were initialized using the standard initialization method in Keras (glorot uniform initialization). The models were optimized for 100 epochs. However, it should be noted that the use of an epoch in a patch-based setting is arbitrary, because patch sampling is perfrmed at random, and thus a different sub-set of all data is seen by the network in each epoch. The number of training pairs seen by the network per epoch was set to 4096, which corresponded to roughly 34 training patches per side per patient.

#### 2.4.2. Multi-View

MV-CNN is a voxel-wise classification method, for which we predefined which voxels to classify. For C2, this information was provided by a pre-constructed mask, indicated in cyan in Figure 1, which was constructed by a uniform expansion of the manual reference by a margin of 15 mm. This margin was chosen as a balance such that no foreground voxels would be segmented at the border of this mask, while also minimizing the training and evaluation times. In contrast, for C3, the pre-constructed mask was determined by the foreground segmentation result of UNet. Our multi-view network was adapted from a previous classification study [16]. Batch normalization was applied after every (3 × 3) 2D convolution layer, before the non-linear activation function. Three context pyramid scales, 0, 1 and 2, were included to incorporate multi-view information from 4, 8 and 16 cm around the query voxel, respectively. This was done by sampling every, every other and every fourth voxel for scales 0, 1 and 2, respectively, for each view. Fewer pyramid scales yielded inferior results, and more pyramid scales would cause the field of view to fall far outside the ROI. The loss function used for voxel classification was categorical cross-entropy (CCE; Equation (3)):

$$H(p,q) = -\sum\_{m=1}^{M} \sum\_{a=1}^{A} p(a,m) \log(q(a,m))\tag{3}$$

where *p*(*a*,*m*) represents a reference distribution of *a* ∈ *A*, given by the manual annotations, *q*(*a*,*m*) is a query distribution, *A* is a set of observations, *m* denotes class indices and ranges from 1 to *M*, and *M* is the number of classes. The network was optimized for 1000 epochs (batch size = 32). In every epoch, a different random sub-set of at maximum 20% of all training pairs was sampled to allow for varied training and validation. Random over-sampling of minority classes was applied to reduce the effects of class imbalance.

#### 2.4.3. Data Augmentation

Data augmentations were the same for all models and were performed on the fly. Augmentation involved random flipping, rotation and contrast adaptation, with chances of each augmentation occurring being 50%, 40% and 40%, respectively. Flipping was carried out in the left–right direction. Rotation was applied in either the sagittal or the transversal plane, with an angle that was uniformly sampled from [−5, +5 degrees]. Rotated images were acquired by 3rd order spline interpolation for the CT image and by nearestneighbour interpolation for the corresponding segmentation maps. The default window level center (CC) [width (CW)] was 0 [700], as was previously used for lymph structure segmentation [9]. If contrast adaptation was applied, alternative window level center, and width were sampled from normal distributions, with μ<sup>C</sup> = 0; σ<sup>C</sup> = 3% × 700 and μ<sup>W</sup> = 700; σ<sup>W</sup> = 3% × 700, respectively.

#### *2.5. Post-Processing*

In all segmentation maps, the combined structure of LN levels I–V was post-processed with hole filling and by subsequently removing all but the largest connected components. To investigate the agreement in the resulting planning target volumes (PTV), the resulting segmentations of combined structures of LN levels I–V and II-IV were expanded by a margin of 4 mm and were denoted as PI–PV and PII–PIV. These two PTVs were chosen because they were used for planning the majority of HNC sub-types.

#### *2.6. Evaluation and Statistical Analysis*

The evaluation of a full 3D image by UNet was achieved by sliding the 64 × 64 × 64 UNet field of view over the image with stride 32 and subsequently only evaluating the central 32 × 32 × 32 voxels. By doing this, we ensured that the network had sufficient context for reliable inferences, while also making sure that each voxel was classified exactly once. The spatial performance of all models was measured by using DSC, Hausdorff distance (HD) and mean surface distance (MSD) between predictions and manual contours and between the PTVs that resulted from the predictions and manual contours. Because the measures were not normally distributed upon histogram inspection and omnibus test of normality [30], the differences in spatial performance were evaluated by a twosided Wilcoxon signed-rank test. Bonferroni correction was applied for each model and spatial metric separately to account for multiple comparisons. The volumetric agreement was assessed with intra-class correlation [31] (ICC; two-way mixed effects, single measurement, consistency) coefficients and volume outside of the manual contour. Finally, cases with a median DSC in the lowest quartile of the UNet+MV configuration were qualitatively reviewed by GJB. Cases from each quartile (Q1–Q3), as well as several informative examples, were chosen for display, such as one patient who underwent laryngectomy surgery. This case was included during training to maximize the number of training samples but was omitted from the calculations of the model performance metrics, because the anatomical landmarks normally required for manual contouring were not present in this patient's anatomy.

#### *2.7. Independent Validation*

To assess the model generalizability, the two best performing models (UNet and UNet+MV) were tested on the independent test set of 10 patients. These were unique samples that were not seen or used during the model development. For this independent testing, the UNet and UNet+MV models were re-trained using the complete cross-validation dataset (60 patients) as the training data. All training and evaluation settings were identical to the cross-validation setting.

#### **3. Results**

In the cross-validation set, the mean age ± standard deviation was 64.0 ± 10.4 (*N* = 49) and 58.5 ± 4.9 (*N* = 11) for males and females, respectively. UNet and UNet+MV showed better agreement with the manual reference than MV for the complete LN structure, all individual LN levels and both PTVs (Table 1). UNet+MV typically showed the better segmentation performance of the combined LN structure, individual levels II–IV and both PTV structures (Figures 3–6; Table 1). In addition, UNet+MV showed the highest volumetric agreement with the manual reference for all structures (Figure 4). Overall, UNet+MV significantly (*p* < 0.0001) outperformed the other models, with the DSCs (median [interquartile range (Q1–Q3)]) of all individual LN structures present in the dataset being 0.804 [0.763–0.814], 0.658 [0.616–0.678] and 0.821 [0.769–0.831] for the models UNet, MV and UNet+MV, respectively. Even with some deformation, e.g., patient not aligned straight in the mask, median-level DSC results were attained (e.g., Figure 3, second column). MV often (Figure 3, Ax. 1 and Cor. 1 rows) overestimated the segmented combined LN volume medially.

UNet+MV showed significantly higher DSCs for the complete LN level I–V structure, individual levels II–IV and both PTV structures (Figure 5, *p*-values in figure). However, UNet showed higher spatial agreement with the manual reference for LN level I. The transitions of LN levels II–III by UNet+MV typically agreed most strongly with the manual reference. All models commonly disagreed with the manual reference on the caudal and cranial ends of LN level V. In addition, there existed a substantial disagreement on the lateral and dorsal ends of this structure in the model predictions. The models benefitted marginally from ensembling all model configurations for all classes, as the results from the model ensembles were more consistent (Table 1). The models were optimized for a median [range] of 10.3 [9.6–10.9] h, except for MV–only, which was optimized for 19.8 [18.7–21.2] h. The inference time for all UNet models was 2.1 [1.8–2.4] minutes, whereas the MV inference

time, which is proportional to the size of the input mask, was 6.0 [5.4–7.0] and 1.0 [0.8–1.3] minutes per patient for the MV–only and UNet+MV configurations, respectively.

**Table 1.** The reported values denote the range of median DSCs produced by five individual models and ensemble model combinations of UNet, MV and UNet+MV configurations after post-processing. Ensemble results that showed higher spatial agreement than the most accurate individual model are denoted in bold. Ensembles increased result consistency and typically outperformed any of the standalone models for all configurations. Abbreviations: MV: multi-view; Ind. individual; Ens: ensemble; LN: lymph node.


By visually comparing the model and manual reference contour pairs in the worstperforming quartile (*N* = 15), several trends were observed. First, the manual reference was judged to be suboptimal (i.e., not according to the contouring guidelines; Figure 6A–E) for at least one level in 6/15 cases. In these six patients, one, four, two, one and three inaccuracies were found in each respective LN level I–V. Second, the level II–III transitions predicted by UNet+MV were typically more accurate than those obtained from the manual reference, and UNet+MV also often outperformed UNet at this transition (Figure 6F–I). Third, the predictions of LN level II by UNet and UNet+MV were visually more accurate than those of the manual reference at the cranial limit (Figure 6J). Fourth, the automated methods showed a large variation in disagreement with the manual reference for LN level V (Figure 6A,E,H,I,L). In cases where the automated methods showed considerable disagreement with the manual reference, pitch, rotation and/or tilt were often underlying confounders (Figure 6K–M), especially for LN level V (Figure 6M), or there were anatomical variations such as malnourishment (Figure 6F) and laryngectomy (with fewer anatomical landmarks available; Figure 6N–O)). Cases with a coronal tilt showed disagreement in contralateral structures of the same level (Figure 6M). Among cases of the first quartile, there were no particularities in the manual reference.

In the independent test, the mean age ± standard deviation was 66.3 ± 10.1 (*N* = 7) and 64.3 ± 13.6 (*N* = 3) years for males and females, respectively. The median [interquartile range (Q1–Q3)] DSCs of all individual LN level structures were 0.769 [0.703–0.834] and 0.809 [0.729–0.852] by the UNet and UNet+MV configurations, respectively, and differed significantly (*p* < 0.0001). UNet+MV showed significantly higher DSCs for the complete I–V structure, LN levels II and III, as well as both extrapolated PTVs (Table 1; Figure 7). For reference, volumetric performances of the independent test set are included in Figure S1.

**Figure 3.** Example segmentations selected from the first (Q1), second (Q2) and third (Q3) quartile in terms of DSC averaged over individual LN levels I–V. The filled region is the manual reference. The solid, dashed and dotted lines correspond to the predictions of the model configurations of UNet, MV and UNet+MV, respectively. LN levels I–V are indicated in pink, blue, green, red and yellow, respectively. The low average DSC in Q1 was in part attributed to an error in the manual reference level III–IV transition. Abbreviations: DSC: dice similarity coefficient; LN: lymph node.

**Figure 4.** Predicted and manual reference volumes for all structures. Abbreviations: ICC: intra-class correlation (two-way mixed, single measures, consistency).

**Figure 5.** *Cont.*

**Figure 5.** Spatial performances of UNet, MV and UNet+MV model configurations for DSC, HD and MSD measures. Statistical significance marking of the MV configuration was omitted because differences between MV and other model configurations were always significant. Structures for which differences between UNet and UNet+MV were statistically significant are denoted by significance bars. \*: *p* < 0.05; \*\*: *p* < 0.01; \*\*\*: *p* < 0.001; \*\*\*\*: *p* < 0.0001; Abbreviations: DSC: dice similarity coefficient; MV: multi-view; HD: Hausdorff distance; MSD: mean surface distance.

**Figure 6.** Examples from the worst-performing quartile samples in terms of DSC averaged over individual LN levels I–V. The filled region is the manual reference. The solid, dashed and dotted lines correspond to the predictions of the UNet, MV and UNet+MV model configurations, respectively. LN levels I–V are indicated in pink, blue, green, red and yellow, respectively. Arrows indicate specific locations of interest. Abbreviations: DSC: dice similarity coefficient; LN: lymph node.

**Figure 7.** UNet and UNet+MV spatial model performances in the independent test. Structures for which differences between model configurations were statistically significant are denoted by significance bars. \*\*\*: *p* < 0.001; \*\*\*\*: *p* < 0.0001; Abbreviations: DSC: Dice similarity coefficient; MV: multi-view.

#### **4. Discussion**

Our results suggest that accurate contours of individual LN levels I–V can be obtained using UNet (complete I–V structure median DSC = 0.859; individual structure DSC = 0.804), and that these results can be further refined by using a UNet+MV sequential model (complete I–V structure DSC = 0.866; individual structure DSC = 0.821). Despite a limited gain compared to UNet, UNet+MV exhibited a significantly better spatial performance for the complete I–V structure, individual levels II–IV and both PTV structures, and better volumetric performance for all structures. Comparable results were achieved using an independent test set for the model configurations UNet and UNet+MV, suggesting that the models have the ability to generalize beyond the data used for model training and development.

These results, however, should be interpreted with some care. A review of patients with a median DSC in the lowest quartile (*N* = 15) highlighted cases where the automated methods were factually closer to the truth than the manual reference, due to inconsistencies in the manual reference that arose from patient angulation and anatomical limitations in contouring guidelines (Figure 6A–M). In addition, all models were considerably less accurate for levels IV and V. Several factors may have contributed to this. First, it is known that DSC is dependent on the structure size [32]; therefore, the small volumes of the levels IV–V likely negatively influenced DSC, which was especially true for malnourished patients (Figure 6M). Such a case was observed in the independent test, where LN level V had a manual reference volume below the typical range (5 mL) and was almost completely missed (Figure 7; Figure S1-LN level V). Second, despite the measures that were taken to prevent most patient angulation during scanning, considerable patient angulation was sometimes seen. This could be due to anatomical variations and to some patients' inability to lie with their head down. This may also have contributed to a larger variation in the manual reference and may have led to disagreements between the predictions and the manual reference. This problem has recently been addressed in another study by Weissmann et al. [13]. Because the contouring guidelines do not take into account the curvature of the neck and the patient's pitch, tilt and rotation, it can be argued that the predictions may be more factually "correct" than the manual reference when this is the case. Alternatively, if the goal of DL methods is to emulate the contouring guidelines, the networks could be trained using explicit information of slice orientation. Variations

in slice plane orientation are especially problematic for level V, because, for example, the lower axial end of this structure contour in the manual reference is defined by the "plane just below the transverse cervical vessel" [2,33]. This caused larger inconsistencies in the manual reference for LN-level V highly pitched patients, compared to patients with other levels. The same holds for the contralateral structures from the same level for patients with a coronal tilt. The current guidelines prescribe level contours of both sides starting at the same axial slice clinically, even though the tilt leads to different predictions for either side for the automated methods. Similarly, although predictions generally show disagreement in the caudal end of level IV and both axial ends and dorsal borders of level V, it should not be concluded that predictions are inaccurate for these regions. Rather, the way that the contouring guidelines were set up can cause peculiarities for patients with large pitch and/or tilt when comparing with more standardized, automated methods. Although it could seem like the rational step to take, it is not a given fact that redefining contouring guidelines to be less dependent on anatomical landmarks in a certain slice and patient angulation would be better for the clinical practice. Such guidelines would be more labourintense for the clinician, which will need to consider more strongly the 3D information of the patient. However, such an approach may result in more accurate data, which in the long run, will be more informative to the network and result in more consistent contours.

To put the results of this study into perspective, we compared our results to others in the relevant literature on automated lymph level segmentation of combined lymph levels, which reported a mean DSC range of 0.64–0.82 [34] Commercially available contouring software (Limbus Contour build 1.0.22) was evaluated for the neck lymph nodal structures [11], but it was reported that the performance could still be improved (mean DSC = 0.75). Cardenas et al. reported an accurate segmentation performance of the combined LN level I–V and II-IV clinical target volumes (CTV; both DSC = 0.90) [9], but it should be noted that an inspection of example segmentations suggested that these structures more closely resembled PTV structures from our institute. We believe that our finding of PTV overlap of UNet and UNet+MV (PTV I–V and II–IV DSCs = 0.91, 0.90, respectively) is in line with, if not better than, the segmented structures reported by Cardenas et al. To the best of our knowledge, the work of Van der Veen et al. [14] was the first to involve the automated segmentation of individual levels I and V and reported segmentation accuracies (without expert intervention) of DSC = 0.73, 0.61 and 0.79 for levels I and V and the combined II–IV structure, respectively. Interestingly, however, these results seem to more closely resemble the results obtained with our second configuration (level I, V DSCs = 0.70, 0.61, respectively). This is not unexpected, because the MV configuration involves a direct voxel classification method that uses multiple scales, similar to the proposed method by Van der Veen et al., but differs in the 2.5D convolution kernel, whereas Van der Veen et al. used a fully 3D kernel.

The model application times are sufficient for clinical use, but can still be improved. Typical whole-image full segmentation by UNet takes time in the order of seconds, but since this UNet was trained in a patch-based fashion, it required application to all parts of the image, such that each part of the image was seen by the 32 × 32 × 32 center patch exactly once. This procedure was not optimized for speed and could likely still be accelerated considerably. Similarly, the MV models were not optimized for speed. For example, when processing neighbouring voxels, there existed much overlap between the extracted patches, even though each patch was extracted separately in the current implementation.

Our research has some limitations. First, we only indirectly investigated the implications of model predictions for RT treatment planning by investigating the overlap of the two predicted PTVs with the manual reference. Future work may investigate whether the predicted volumes lead to improved dose–volume histograms in OARs and target volumes when using them in a treatment planning system. Second, we did not include LN levels VI and VII because these are less frequently clinically used. Since these are central levels and require a larger region of interest to be considered for learning, deep learning frameworks

aiming to include these structures may focus on patch-based training with sampling from both sides simultaneously or by defining two left/right and one central ROI.

#### **5. Conclusions**

We demonstrated that a UNet can accurately (DSC > 0.8) segment individual LN levels I–V for the majority of patients and that this result can be further refined by using a UNet for the segmentation of foreground structures, followed by a sequential voxel classification network. With this generalized approach, any set of lymph levels can be combined to define patient-specific LN level target structures. When dealing with angulated patients, one should be aware that the current contouring guidelines can lead to situations where the LN level contours may become inconsistent, which may be prevented by using more standardized, automated deep learning methods. Future work should investigate whether clinically acceptable RT plans can be obtained using predicted contours.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers14225501/s1, Figure S1: Predicted and manual reference volumes for all structures resulting from the independent test set. Abbreviations: ICC: intra-class correlation (two-way mixed, single measures, consistency). Figure S2: Graphical representation of the summary of this work.

**Author Contributions:** Conceptualization, V.I.J.S., M.D. and W.F.A.R.V.; Methodology: V.I.J.S., M.D., O.J.G.-C. and W.F.A.R.V.; Software: V.I.J.S. and O.J.G.-C.; Validation: V.I.J.S.; Formal analysis: V.I.J.S.; Investigation: V.I.J.S., M.D., O.J.G.-C. and W.F.A.R.V.; Resources: G.J.B., B.J.S. and M.R.V.; Data curation: V.I.J.S., G.J.B. and M.R.V.; Writing—original draft preparation: V.I.J.S.; Writing—review and editing: V.I.J.S., M.D., O.J.G.-C., G.J.B., B.J.S. and W.F.A.R.V.; Visualization: V.I.J.S.; Supervision: M.D., O.J.G.-C. and W.F.A.R.V.; Project administration: W.F.A.R.V. and B.J.S.; Funding acquisition: M.D., B.J.S. and W.F.A.R.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by a research grant from Varian Medical Systems.

**Institutional Review Board Statement:** Due to the retrospective nature of this study, not including tests or different treatments on human individuals and the use of pseudonamized data, it was concluded by the Medical ethics committee that the Medical Research involving Human Subjects Act (WMO) does not apply to this study and that official approval of this study by the committee is not required.

**Informed Consent Statement:** This retrospective study was exempted from requiring participants' informed consent by the medical ethics committee.

**Data Availability Statement:** The data used in this study is not made publically available for reasons of consent of the participants.

**Conflicts of Interest:** This work was funded by a research grant from Varian Medical Systems. W.F.A.R.V. and M.D. have received speakers honoraria/travel expenses from Varian Medical Systems outside the current work. O.J.G. was funded by the Dutch cancer society (KWF) grant numbers KWF-UVA 2017.10873 and KWF-UVA 2021.13785. KWF funding was outside the scope of this work.

#### **References**


### *Article* **DECT-CLUST: Dual-Energy CT Image Clustering and Application to Head and Neck Squamous Cell Carcinoma Segmentation**

**Faicel Chamroukhi 1,\*, Segolene Brivet 2, Peter Savadjiev 3, Mark Coates <sup>2</sup> and Reza Forghani 3,4**

	- McGill University, Montreal, QC H3G 1A4, Canada

**Abstract:** Dual-energy computed tomography (DECT) is an advanced CT computed tomography scanning technique enabling material characterization not possible with conventional CT scans. It allows the reconstruction of energy decay curves at each 3D image voxel, representing varied image attenuation at different effective scanning energy levels. In this paper, we develop novel unsupervised learning techniques based on mixture models and functional data analysis models to the clustering of DECT images. We design functional mixture models that integrate spatial image context in mixture weights, with mixture component densities being constructed upon the DECT energy decay curves as functional observations. We develop dedicated expectation–maximization algorithms for the maximum likelihood estimation of the model parameters. To our knowledge, this is the first article to develop statistical functional data analysis and model-based clustering techniques to take advantage of the full spectral information provided by DECT. We evaluate the application of DECT to head and neck squamous cell carcinoma. Current image-based evaluation of these tumors in clinical practice is largely qualitative, based on a visual assessment of tumor anatomic extent and basic one- or two-dimensional tumor size measurements. We evaluate our methods on 91 head and neck cancer DECT scans and compare our unsupervised clustering results to tumor contours traced manually by radiologists, as well as to several baseline algorithms. Given the inter-rater variability even among experts at delineating head and neck tumors, and given the potential importance of tissue reactions surrounding the tumor itself, our proposed methodology has the potential to add value in downstream machine learning applications for clinical outcome prediction based on DECT data in head and neck cancer.

**Keywords:** spectral image clustering; dual-energy CT imaging; mixture models; functional data analysis; HNSCC cancer

#### **1. Introduction**

Computed tomography (CT) has been one of the most common and widespread imaging techniques used in the clinic for the last few decades. There is increasing interest in a more advanced CT technique known as dual-energy CT (DECT) or spectral CT that enables additional material or tissue characterization beyond what is possible with conventional CT. In conventional CT, X-rays are emitted at a certain level of energy, whereas in DECT, they are emitted at two separate energy levels, which brings important benefits as compared to standard CT. First, since different materials can have different attenuation coefficients at different energy levels, DECT allows for the separation of materials with different atomic numbers. In particular, DECT enables the computation of image attenuation levels at

**Citation:** Chamroukhi, F.; Brivet, S.; Savadjiev, P.; Coates, M.; Forghani, R. DECT-CLUST: Dual-Energy CT Image Clustering and Application to Head and Neck Squamous Cell Carcinoma Segmentation. *Diagnostics* **2022**, *12*, 3072. https://doi.org/ 10.3390/diagnostics12123072

Academic Editor: Kalevi Kairemo

Received: 28 October 2022 Accepted: 1 December 2022 Published: 6 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

multiple effective energy levels. This results in the association of a decay curve with each reconstructed image voxel, representing energy-dependent changes in attenuation at that body location. Conventional CT imaging is the first-line modality for the clinical evaluation of many different types of known or suspected cancers in adults [1]. However, because of the properties described above, there has been an increased interest in the use of DECT in oncology in recent years, as it provides a new and exciting way of characterizing tumors as well as their surrounding tissues.

Current expert evaluation of CT scans of head and neck cancer patients in clinical practice is largely based on qualitative image evaluation for the delineation of the tumor anatomic extent and basic two- or three-dimensional measurements. However, an increasing body of evidence suggests that quantitative texture or radiomic features extracted from CT images can be used to enhance diagnostic evaluation, including the prediction of tumor molecular phenotypes, prediction of therapy response, and outcome prediction [2–5]. The typical radiomic approach for tumor evaluation can be separated into two steps: (i) identification and segmentation of the tumor in the image; and (ii) prediction of a clinical endpoint of interest based on features extracted from the segmented image region. As such, the ability of the segmentation algorithm to correctly target the tumor region is the first essential step in this process. If performed by an expert, manual processing is prohibitively time-consuming and prone to intra- and inter-observer variability. This step is ideally suited for computerized analysis to make these types of analyses feasible and more reproducible. When conventional single-energy CT (SECT) scans represent a 3D image of a patient, DECT scans may be viewed as a 4D image: a 3D body volume over a range of spectral attenuation levels. The latter dimension provides, for each voxel, a decay curve representing energy-dependent changes in attenuation, enabling tissue characterization beyond what is possible with conventional CT [6,7].

DECT has been shown to improve qualitative image interpretation for the evaluation of head and neck cancer and preliminary results also suggest that the energydependent curves associated with each image voxel can be used to improve predictions using radiomic approaches [8–12]. However, there is currently no widely accepted method for the use of spectral data from DECT scans for radiomic type studies. In this study, we propose a clustering method that incorporates the spectral tissue attenuation curves as a fourth dimension of the 3D representation of tissue voxels in head and neck DECT scans. We demonstrate that by combining spectral information from the voxel-associated curves and spatial information from the voxel coordinates, we can create a segmentation map with high concordance with tumoral tissue voxels. The proposed model provides a clustering of high qualitative aspect, that can act as the basis for identifying tumor or peritumoral regions to be used in subsequent radiomic studies on DECT scans.

In this paper, we evaluate the application of DECT to head and neck squamous cell carcinoma (HNSCC). Current image-based evaluation of HNSCC tumors in clinical practice is largely qualitative, based on a visual assessment of tumor anatomic extent and basic oneor two-dimensional tumor size measurements. However, the frequently complex shape of mucosal head and neck cancers and at times poorly defined boundaries and potential adjacent tissue reactions can result in a high inter-observer variability in defining the extent of the tumor [13–16], especially among radiologists without sub-specialty expertise in head and neck imaging. Furthermore, there is strong evidence that alterations of gene expression and protein–protein interactions in the peri-tumoral tissue or normal adjacent tissue may play a critical role in the evolution and risk of recurrence of HNSCC tumors (e.g., [17,18]). Yet these adjacent regions may not be obviously salient upon visual image examination. For all these reasons, the accurate and consistent determination of a predictive region around the tumor is essential, both for conventional staging which determines patient management and for future automated quantitative image-based predictive algorithms based on machine learning. Such a predictive region may include the tumor itself, but also surrounding tissues of biological relevance.

With these considerations in mind, DECT provides new and unexplored opportunities to answer an important question: what can be learned from DECT about tumor heterogeneity and its associations with surrounding tissues? As a first step toward answering this question, in this paper, we adapt functional data analysis (FDA) techniques to DECT data in order to explore underlying patterns of association in and around the tumor. FDA is a classical branch of statistics dedicated to the analysis of functional data, in situations where each data object is considered a function. This is particularly appropriate for DECT image data, as each 3D image voxel is associated with a curve of image intensity decay over multiple reconstructed energy levels (more details are provided below in Section 2.1). Thus, we adapt FDA statistical models for the clustering of 3D image voxels based on the full functional information provided by the decay curves associated with each voxel. More specifically, the architecture underpinning our proposed method is a functional mixture model, where the mixture component densities are built upon functional approximation of the spectral decay curves at each image voxel, and the mixture weights are constructed to integrate spatial constraints. We then derive an expectation–maximization (EM) algorithm to the maximum likelihood estimation (MLE) of the model parameters.

To our knowledge, this is the first article to propose spatial clustering utilizing the full spectral information available in DECT data, based on an appropriate FDA statistical framework. Existing methods for automatic tumor delineation in DECT (reviewed in detail in Section 2.2) are mostly based on deep learning techniques and utilize only a small subset of the available information, due to the sheer amount of 4D (spatial + spectral) data available in a single DECT scan.

We apply the proposed methodology on 91 DECT scans of HNSCC tumors, and we compare our results to manually traced tumor contours performed by an experienced expert radiologist. We also compare to other baseline clustering methods. However, tumor segmentation on its own is not a clinical outcome. A full demonstration of the clinical utility of our method necessitates an analysis of its ability to predict actual clinical outcomes, and how this prediction performance compares to the performance in the case of manually drawn contours, or contours drawn using alternative automatic methods. We leave this prediction analysis for a subsequent paper. As the first article to adapt FDA statistical tools to DECT data, the main focus of the present paper is on the statistical methodology and on algorithm development. As such, we can summarize our contributions as follows:


The rest of this paper is organized as follows; First, as a background, we describe in Section 2 related work on dual-energy CT and dedicated segmentation methods. Then, in Section 3, we introduce the proposed methodology and present the developed algorithms. Section 4 is dedicated to the experimental study and the obtained quantitative results are provided in Section 5. Finally, in Section 6, we discuss the proposed approach and the obtained results.

#### **2. Background and Related Work**

#### *2.1. Dual-Energy CT*

The use of DECT techniques has very recently attracted interest in different clinical applications, including diagnostics; for example using DECT for the improved detection of portal vein thrombosis via virtual monoenergetic reconstructions [19], for reducing visceralmotion-related artifacts on the liver by comparing different CT scanner techniques [20], or also using DECT of the heart to the study of coronary artery disease [21].

DECT data may be viewed as a 4D image of a patient: a 3D body volume over a range of energy levels. The dual-energy image acquisition using two X-ray energy peaks at the source provides enough attenuation information to be combined and to be able to reconstruct a curve at multiple "virtual monochromatic" energy levels. These simulate what the attenuation (in Hounsfield units; HU) would be if the study was acquired with a monochromatic X-ray beam at that energy value (in kilo-electron-Volt; keV). The reconstructed curve of attenuation numbers over each energy level translates the energy-dependent changes and is commonly called the spectral Hounsfield unit attenuation curve, or an *energy decay curve* [7]. In our method, we will make use of this spectral information through functional approximations, and thus consider the curves as functional observations. An energy decay curve is calculated for each image voxel, and thus, a DECT scan is represented as a 4D image with 3 dimensions for X, Y and Z spatial coordinates and 1 dimension for energy level coordinates. The virtual monochromatic image (VMI) is the 3D image representation at a given energy level. See Figure 1 (left) for examples of a 2D slice from different VMIs and Figure 1 (right) for examples of decay curves for different tissue characteristics.

**Figure 1.** (**Left**) 2D slices of VMIs at 40,65,140 keV with tumor contour in red. At lower energy levels, VMIs are more constrated; at higher levels, VMIs are less noisy. A VMI at 65 keV is similar to a standard CT scan. (**Right**) Examples of decay curves for different body locations. A blue (resp. red, green) curve represents attenuation information stored at one voxel within bone (resp. tumor, tissue).

#### *2.2. Segmentation of Dual-Energy CT Data*

Segmentation is a process of delineating an image region of interest. For example, radiation oncologists usually manually segment tumors for radiation planning. Automatic tumor segmentation has a long history of developments: from knowledge-driven early techniques to data-driven newer techniques, algorithms aim to extract image features to make a decision on region boundaries [22]. However, this process remains challenging in medical imaging due to the heterogeneity over the image or the acquisition process; most of the current algorithms need manual adjustments on the result [23].

In head and neck CT imaging, the difficulty to contour precisely a tumor region results in large inter-observer variability in the segmentation results, even among trained radiation oncologists. A study among radiologists from 14 different institutions obtained a median Dice similarity score (DSC) ranging from 0.51 to 0.82 [15], depending on the delineation criteria used. Another study assessing the same variability among 3 experienced radiologists over 10 tumors obtained a mean DSC of 0.57 [16].

To the best of our knowledge, only a few studies have focused on DECT segmentation. These employ deep learning approaches [24–26]. The four dimensions of the data required workarounds in order to apply neural networks. For example, using two VMIs sampled from the energy level spectrum, one at a low- and one at a high-energy level, Chen et al. in [24] merged the two VMIs in a layer connected to a U-Net architecture [27]. Wang et al. in [26] learned features from two pyramid networks on the two VMIs independently and combined them through deep attention into a mask-scoring regional convolutional neural network (R-CNN). They achieved good performance in segmenting large-sized organs (DSC larger than 0.8), and performance was less impressive for small-sized organs (DSC between 0.5 and 0.8). Deep learning techniques have indeed revolutionized the world and are very popular in many domains. However, despite what they can provide as good results in practice, we note that as a mixture model-based approach, our proposed approach is not necessarily as complex as a deep learning one can be and is not regarded as a black box. It also enjoys interpretability and relies on the statistical sound background of mixture models and the desirable properties of the EM algorithm, in particular, the fact of monotonically improving the loglikelihood as a loss-function. It is also user-friendly, which can be useful in particular for clinical use, and its implementation is also quite simple.

#### *2.3. Decay Curve Clustering via Functional Data Analysis*

FDA aims to represent infinite-dimensional functional data into a finite-dimensional vector of coefficients [28]. To achieve this, FDA consists in expanding functional data into function bases. One approach relies on projection on bases which consist in projecting functional data onto finite dimensional function bases (e.g., splines, B-splines, polynomials, Fourier, and wavelet). It associates a finite vector of projection coefficients. This is what we use in this paper. Analogously another common approach would be to run a functional principal component analysis (fPCA) to obtain a basis of eigenfunctions of the covariance of the process describing our functional data. It associates a (truncated) projection vector of PCA coefficients.

Our objective is to partition our data, modeled with FDA, in different groups of voxels having similar decay curve characteristics. Among the available clustering approaches (e.g., centroid-based clustering, such as k-means; connectivity-based clustering, such as hierarchical clustering; density-based clustering, such as DBSCAN [29]; and distributionbased clustering with model-based methods), since we have a model for each decay curve, a model-based approach is preferred.

Model-based clustering is a thoroughly developed field [30,31], particularly for multivariate analysis. Model-based clustering approaches rely on the finite mixture modeling framework [32] to represent the density of a set of independent multivariate observations and on an optimization algorithm to automatically find a partition into groups of such observations.

To represent different groups of data, mixture models assume each datum to follow a known distribution (e.g., Gaussian), and build a mean representation (i.e., model) for each group of data. The mixture model calculates, for each data point, a value defined by the sum over *k* = {1 ... #groups}, of the probability distribution function (pdf) that this point belongs to group *k* model, emphasized by a weight giving a higher or lower chance of belonging to this group (derived in Section 3.1). Mixture models have the advantage of being interpretable, parametric, thus well-understood, and flexible, as the pdf modeling the data in each cluster can be chosen explicitly.

The expectation–maximization (EM) algorithm [33] is a popular and adapted tool with desirable properties that can be used to conduct an iterative estimation of the mixture model parameters and thus the cluster membership probabilities.

Mixture models for clustering have been applied and adapted to different kind of data, including time-series data [34], gene expression data [35], 3D noisy medical images [36], or spatio-temporal data (non image data) [37]. They also have been recently investigated for functional data [38], and thus provide an avenue to model the spectral decay curves, but in this context of spectral images, we also need to incorporate the spatial information into the clustering.

A related idea was proposed in [39] to develop a spatio-temporal mixture of hidden process models for fMRI analysis. The authors built a temporal probabilistic model, and reshaped the prior probability with spatial constraints to determine a "region of influence" for the temporal model. A specification of our model covers this approach, and our model goes further in generalizing it via the construction of more flexible Gaussianmixture weights around the spatial coordinates. The resulting model enjoys better numerical learning properties with faster convergence due to closed-form updating rules for the spatial weights parameters. Alternative constructions of the proposed mixture model are also presented in order to validate the method and to accommodate potential user specifications.

#### **3. Methodology**

#### *3.1. Generative Modeling Framework*

We adopt the framework of generative modeling for image clustering using different families of extended mixture distributions. The general form of the generative model for the image assumes that the *i*th datum (i.e., pixel and voxel) in the image has the general semi-parametric mixture density:

$$f\_i(\boldsymbol{\theta}) = \sum\_{k=1}^{K} \pi\_{ik} f\_i(\boldsymbol{\theta}\_k) \,\,\,\,\,\tag{1}$$

which is a convex combination of *K* component densities, *fi*(*θk*), *k* ∈ [*K*] = {1, ··· , *K*}, weighted by non-negative mixture weights *πik* that sum to one, that is ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *πik* = 1 for all *i*, *i* ∈ [*n*]. The unknown parameter vector *θ* of density (1) is composed of the set of component density parameters {*θk*} and their associated weights {*πik*}, i.e., *<sup>θ</sup>* = {*πik*, *<sup>θ</sup>k*}*<sup>K</sup> <sup>k</sup>*=1.

From the perspective of model-based clustering of the image, each component density can be associated with a cluster, and hence the clustering problem becomes one of parametric density estimation. Suppose that the image has *K* segments and let *Zi* ∈ [*K*] be the random variable representing the unknown segment label of the *i*th observation in the image. Suppose that the distribution of the data within each segment *k* ∈ [*K*] is *f*(*θk*), i.e, *fiZi*<sup>=</sup>*k*(*θ*) = *fi*(*θk*). Then, from a generative point of view, model (1) is equivalent to *(i)* sampling a segment label according to the discrete distribution with parameters being the mixture weights *π* = {*π*1, ··· , *πK*}, then *(ii)* sampling an observation Im*<sup>i</sup>* from the conditional distribution *f*(*θk*). Given a model of the form (1) represented by *θ*, typically fitted by maximum likelihood estimation (MLE) from the *n* observations composing the image Im*n*, as

$$\hat{\boldsymbol{\theta}} \in \arg\max\_{\boldsymbol{\theta} \in \Theta} \log L(\boldsymbol{\theta} \text{Im}\_n) \tag{2}$$

where *L*(*θ*Im*n*) is the likelihood of *θ* given the image data Im*n*, then, the segment labels can be determined via the Bayes' allocation rule,

$$\hat{Z}\_i = \arg\max\_{k \in [K]} \mathbb{P}(Z\_i = k \text{Im}(i); \hat{\boldsymbol{\theta}}) \,. \tag{3}$$

which consists of maximizing the conditional probabilities

$$\mathbb{P}(Z\_i = k \text{Im}(i); \hat{\boldsymbol{\theta}}) = \frac{\hat{\pi}\_{ik} f\_i(\boldsymbol{\theta}\_k)}{f\_i(\hat{\boldsymbol{\theta}})} \cdot \tag{4}$$

that the *i*th observation originates from segment *k*, *k* ∈ [*K*], given the image data and the fitted model.

Model (1) has many different specifications in the literature, depending on the nature of the data generative process, resulting in a multitude of choices for the mixture weights and for the component densities. Mixtures of multivariate distributions [32] are in particular more popular in model-based clustering of vectorial data using multivariate mixtures. These include multivariate Gaussian mixtures [30,31], where *πik* = *πk*, ∀*i* are constant mixture weights, and the component densities *fi*(*θk*) = *φi*(*μk*,**Σ***k*) are multivariate Gaussians with means *μ<sup>k</sup>* and covariance matrices **Σ***k*. Mixtures of regression models, introduced in [40], are common in the modeling and clustering of regression-type data. For example, in the widely used Gaussian regression mixture model [41], we have constant mixing proportions, i.e., *πik* = *πk*, and the mixture components *fi*(*θk*)'s are Gaussian regressors *φ*(·, *β <sup>k</sup> <sup>x</sup>i*, *<sup>σ</sup>*<sup>2</sup> *<sup>k</sup>* ) with typically linear means *β <sup>k</sup> <sup>x</sup><sup>i</sup>* and variances *<sup>σ</sup>*<sup>2</sup> *<sup>k</sup>* in the case of a univariate response.

In this paper, we consider a more flexible mixture of regressions model in which both the mixture weights and the mixture components are covariate-dependent, and are constructed upon flexible semi-parametric functions. More specifically, in this full conditional mixture model, the mixture weights *πik* are constructed upon parametric functions *πik* = *πk*(·, *xi*; *α*) of some covariates *x<sup>i</sup>* represented by a parameter vector *α*, and the regression functions *fi*(*θk*) are Gaussian regressors *<sup>φ</sup>*(·, *<sup>μ</sup>*(*xi*; *<sup>β</sup>k*), *<sup>σ</sup>*<sup>2</sup> *<sup>k</sup>* ) with semiparametric (non-)linear mean functions *μ*(*xi*; *βk*). This flexible modeling allows us to better capture more non-linear relationships in the functional data via the semi-parametric mean functions. Heterogeneity is accommodated via the mixture distribution, and spatial organization can be captured via spatial-dependent mixture weights.

#### *3.2. Spatial Mixture of Functional Regressions for Dual-Energy CT Images*

We propose a spatialized mixture of functional regressions model, adapted to the given type of image data, for the model-based clustering of dual-energy CT scans. The images we analyze include spectral curves for each 3D voxel. Each image, denoted Im, is represented as a sample of *n* observations, Im = {*vi*, *xi*, *y<sup>i</sup>* }*n <sup>i</sup>*=<sup>1</sup> where *v<sup>i</sup>* = (*vi*1, *vi*2, *vi*3) is the *i*th voxel 3D spatial coordinates.The *i*th voxel is represented by the curve (*xi*, *y<sup>i</sup>* ) composed of HU attenuation values *y<sup>i</sup>* = (*yi*1, ... , *yim*) measured at energy levels (covariates) *x<sup>i</sup>* = (*xi*1,..., *xim*), with *m* being the number of energy levels.

To accommodate the spatial organization of the image together with the functional nature of each of its voxels, we propose spatialized conditional extensions of the general family of model (1), in which we model the *i*th voxel observation of the image using the conditional density *f*(*y<sup>i</sup> xi*, *vi*; *θ*) that relates the attenuation curve levels *y<sup>i</sup>* , given the associated energy levels *xi*, and spatial location *v<sup>i</sup>* via a convex combination of (non-)linear (semi-)parametric functional regressors *f*(*y<sup>i</sup> xi*; *θk*) with spatial weights *πk*(*vi*; *α*), that is,

$$f(y\_i \mathbf{x}\_i, \boldsymbol{\upsilon}\_i; \boldsymbol{\theta}) = \sum\_{k=1}^{K} \pi\_k(\boldsymbol{\upsilon}\_i; \boldsymbol{\mathfrak{a}}) f(y\_i \mathbf{x}\_i; \boldsymbol{\theta}\_k) \,. \tag{5}$$

To this purpose, we consider two different spatial constructions of the mixing weights (gating functions) *π<sup>k</sup>* (*vi*; *α*): (i) softmax gates; and (ii) normalized Gaussian gates. The latter is an appropriate choice if more approximation quality is needed, and facilitates the computations in the learning process. We also consider different families to model the functional regressors, including spline and B-spline regression functions that enjoy better curve approximation capabilities, compared to linear or polynomial regression functions.

#### 3.2.1. Functional Regression Components

We have a 3D image volume over a range of energy levels that provide, for each voxel *i*, an attenuation curve (*xi*, *y<sup>i</sup>* ) which represents energy-dependent changes in attenuation, which enables a better characterization of the tissue at voxel *i*. We therefore model the component densities *f*(*y<sup>i</sup> xi*; *θk*) as functional regression models constructed upon the attenuation curves as functional observations. This allows us to accommodate the spectral curve nature of the data. More specifically, in the case of univariate energy levels, we use smooth functional approximations to model, for the *i*th voxel, the mean spectral curve of the *<sup>k</sup>*th component *<sup>μ</sup>*(*xi*; *<sup>β</sup>k*) = <sup>E</sup>*θ*[*YiZi* <sup>=</sup> *<sup>k</sup>*, *<sup>x</sup>i*], that is, *<sup>μ</sup>*(*xi*; *<sup>β</sup>k*)=(*μ*(*xi*1; *<sup>β</sup>k*), ... , *<sup>μ</sup>*(*xim*; *<sup>β</sup>k*)) using polynomial or (B)-spline functions, whose coefficients are *βk*.

The conditional density model for each regression is then modeled as a functional Gaussian regressor defined by *f*(*y<sup>i</sup> xi*; *θk*) = *φ<sup>m</sup> yi* ; *<sup>μ</sup>*(*xi*; *<sup>β</sup>k*), *<sup>σ</sup>*<sup>2</sup> *k* **I** , with *μ*(*xi*; *βk*) = **B**(*xi*)*β<sup>k</sup>*

being the function approximation onto polynomial or (B-)spline bases **B**(*xi*), and the matrix form of the functional regression model is then given by

$$f(y\_i \mathbf{x}\_i; \theta\_k) = \phi\_m(y\_i; \mathbf{B}(\mathbf{x}\_i) \nexists \theta\_k, \sigma\_k^2 \mathbf{I})\,,\tag{6}$$

where *θ<sup>k</sup>* = (*β <sup>k</sup>* , *<sup>σ</sup>*<sup>2</sup> *<sup>k</sup>* ) <sup>∈</sup> <sup>R</sup>*p*+*q*+<sup>2</sup> is the unknown parameter vector of regression *<sup>k</sup>*.

#### 3.2.2. Spatial Gating Functions

The constructed functional mixture of regressions model (5) specifically integrates the spatial constraints in the mixture weights *πk*(*vi*; *α*) via functions of the spatial locations *v<sup>i</sup>* parametrized by vectors of coefficients *α*. We investigate two choices to this end. The first proposed model is a spatial softmax-gated functional mixture of regression and is defined by (5) with a softmax gating function:

$$\pi\_k(\boldsymbol{\upsilon}\_i; \mathfrak{a}) = \frac{\exp\left(\boldsymbol{\mathfrak{a}}\_k^\top \boldsymbol{\mathfrak{v}}\_i\right)}{1 + \sum\_{k'=1}^{K-1} \exp\left(\boldsymbol{\mathfrak{a}}\_{k'}^\top \boldsymbol{\mathfrak{v}}\_i\right)},\tag{7}$$

where *α* = (*α* <sup>1</sup> , ... , *α <sup>K</sup>* ) is the unknown parameter vector of the gating functions. We will refer to this model, defined by (5)–(7), as the spatial softmax-gated mixture of functional regressions, abbreviated as **SsMFR**. The softmax modeling of the mixture weights is a standard choice known in the mixtures-of-experts community. However, its optimization performed at the M step of the EM algorithm, is not analytic and requires numerical Newton–Raphson optimization. This can become costly, especially in larger image applications, such as the one we address.

In the second proposed model, we use a spatial Gaussian-gated functional mixture of regressions, defined by (5) with a Gaussian-gated function:

$$\pi\_k(\boldsymbol{\upsilon}\_i; \boldsymbol{\mu}) = \frac{w\_k \phi\_3(\boldsymbol{\upsilon}\_i; \boldsymbol{\mu}\_{k'} \mathbf{R}\_k)}{\sum\_{\ell=1}^K w\_\ell \phi\_3(\boldsymbol{\upsilon}\_i; \boldsymbol{\mu}\_{\ell'} \mathbf{R}\_\ell)}\,'\tag{8}$$

in which *wk* are non-negative weights that sum to one, *φd*(*vi*; *μk*, **R***k*) is the density function of a multivariate Gaussian vector of dimension *d* with mean *μ<sup>k</sup>* and covariance matrix **Σ***k*, and *α* = (*α* <sup>1</sup> , ... , *α <sup>K</sup>* ) is the parameter vector of the gating functions with *α<sup>k</sup>* = (*wk*, *μ <sup>k</sup>* , vech(**R***k*) ) .

We will refer to this model, defined by (5), (6) and (8), as the spatial Gaussian-gated mixture of functional regressions, abbreviated as **SgMFR**. This Gaussian gating function was introduced in [42] to bypass the need for an additional numerical optimization in the inner loop of the EM algorithm. We obtain a closed form updating formula at the M-Step, that is detailed in the next section presenting the derived EM algorithm.

#### 3.2.3. MLE of the SgMFR Model via the EM Algorithm

Based on Equations (5), (6) and (8), the SgMFR joint density *f*(*y<sup>i</sup>* , *xi*, *vi*; *θ*) is then derived and the joint log-likelihood we maximize via EM is

$$\log L(\boldsymbol{\theta}) = \sum\_{i=1}^{n} \log f(y\_i, \mathbf{x}\_i; \boldsymbol{\sigma}\_i; \boldsymbol{\theta}) = \sum\_{i=1}^{n} \log \sum\_{k=1}^{K} w\_k \phi\_3(\mathbf{z}\_i; \boldsymbol{\mu}\_k, \mathbf{R}\_k) \phi\_m(y\_i; \mathbf{B}(\mathbf{x}\_i) \boldsymbol{\theta}\_k, \sigma\_k^2 \mathbf{I}) \cdot \tag{9}$$

The complete-data log-likelihood, upon which the EM algorithm is constructed, is

$$\log L\_{\mathbf{c}}(\boldsymbol{\theta}) = \sum\_{i=1}^{n} \sum\_{k=1}^{K} Z\_{ik} \log \left[ w\_k \phi\_3(\boldsymbol{\upsilon}\_i; \boldsymbol{\mu}\_{k'}, \mathbf{R}\_k) \phi\_{\boldsymbol{m}}(\boldsymbol{y}\_i; \mathbf{B}(\boldsymbol{x}\_i) \boldsymbol{\beta}\_k, \boldsymbol{\sigma}\_k^2 \mathbf{I}) \right], \tag{10}$$

where *Zik* is an indicator variable such that *Zik* = 1 if *Zi* = *k* (i.e., if the *i*th pair (*xi*, *y<sup>i</sup>* ) is generated from the *k*th regression component) and *Zik* = 0, otherwise. The EM algorithm, after starting with an initial solution *θ*(0) , alternates between the E and M steps until convergence (when there is no longer a significant change in the log-likelihood).

**The E-step**: Compute the conditional expectation of the complete-data log-likelihood (10), given the image Im*<sup>n</sup>* and the current estimate *θ*(*t*) :

$$Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)}) = \mathbb{E}\left[L\_{\boldsymbol{\epsilon}}(\boldsymbol{\theta})|\mathrm{Im}\_{\boldsymbol{n}};\boldsymbol{\theta}^{(t)}\right] = \sum\_{i=1}^{n}\sum\_{k=1}^{K}\tau\_{ik}^{(t)}\log\left[a\_{k}\boldsymbol{\phi}\_{\boldsymbol{3}}(\boldsymbol{\upsilon}\_{i};\boldsymbol{\mu}\_{k'},\mathbf{R}\_{k})\boldsymbol{\phi}\_{\boldsymbol{m}}(\boldsymbol{y}\_{i};\mathbf{B}(\boldsymbol{x}\_{i})\boldsymbol{\theta}\_{k'}\boldsymbol{\sigma}\_{k}^{2}\mathbf{I})\right], \tag{11}$$

where *τ*(*t*) *ik* <sup>=</sup> <sup>P</sup>(*Zi* <sup>=</sup> *<sup>k</sup>y<sup>i</sup>* , *xi*, *vi*; *θ*(*t*) ) given by

$$\mathbf{r}\_{\rm ik}^{(t)} = \frac{w\_k^{(t)} \phi\_3(\mathbf{v}\_i; \boldsymbol{\mu}\_k^{(t)}, \mathbf{R}\_k^{(t)}) \phi\_m(y\_i; \mathbf{B}(\mathbf{x}\_i) \boldsymbol{\mathcal{B}}\_k^{(t)}, \boldsymbol{\sigma}\_k^{2(t)} \mathbf{I})}{f(\mathbf{v}\_i, \mathbf{x}\_i, y\_i; \boldsymbol{\theta}^{(t)})} \tag{12}$$

is the posterior probability that the observed pair (*xi*, *y<sup>i</sup>* ) is generated by the *k*th regressor. This step therefore only requires the computation of the posterior component membership probabilities *τ*(*t*) *ik* (*i* = 1, . . . , *n*), for *k* = 1, . . . , *K*.

**The M-step**: Calculate the parameter vector update *θ*(*t*+1) by maximizing the *Q*function (11), i.e., *θ*(*t*+1) = arg max*<sup>θ</sup> Q*(*θ*; *θ*(*t*) ). By decomposing the *Q*−function as

$$Q(\theta; \theta^{(t)}) = \sum\_{k=1}^{K} Q(\mathfrak{a}\_k; \theta^{(t)}) + Q(\theta\_k; \theta^{(t)}),\tag{13}$$

with *<sup>Q</sup>*(*αk*; *<sup>θ</sup>*(*t*)) = <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>τ</sup>*(*t*) *ik* log[*wkφ*3(*vi*; *<sup>μ</sup>k*, **<sup>R</sup>***k*)] and *<sup>Q</sup>*(*θk*; *<sup>θ</sup>*(*t*)) = <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>τ</sup>*(*t*) *ik* log[*φm*(*y<sup>i</sup>* ; **<sup>B</sup>**(*xi*)*βk*, *<sup>σ</sup>*<sup>2</sup> *<sup>k</sup>* **I**)], the maximization can then be performed by *K* separate maximizations with respect to the parameters of the gating and the regression functions.

*Updating the gating functions parameters:* Maximizing (13) with respect to *αk*'s corresponds to the M step of a Gaussian mixture model [32]. The closed-form expressions for updating the parameters are given by

$$m\_k^{(t+1)} = \sum\_{l=1}^n \tau\_{lk}^{(t)} \Big/ n\_r$$

$$
\mu\_k^{(t+1)} = \sum\_{l=1}^n \tau\_{lk}^{(t)} \sigma\_l \Big/ \sum\_{l=1}^n \tau\_{lk}^{(t)},\tag{15}
$$

$$\mathbf{R}\_{k}^{(t+1)} = \sum\_{l=1}^{n} \tau\_{lk}^{(t)} (\upsilon\_{l} - \mu\_{k}^{(t+1)}) (\upsilon\_{l} - \mu\_{k}^{(t+1)})^\top \Big/ \sum\_{l=1}^{n} \tau\_{lk}^{(t)} \cdot \tag{16}$$

*Updating the regression functions parameters:* Maximizing (13) with respect to *θ<sup>k</sup>* corresponds to the M step of standard mixtures of experts with univariate Gaussian regressions. The closed-form updating formulas are given by

$$\mathcal{B}\_{k}^{(t+1)} = \left[\sum\_{i=1}^{n} \tau\_{ik}^{(t)} \mathbf{B}^{\top}(\mathbf{x}\_{i}) \mathbf{B}(\mathbf{x}\_{i})\right]^{-1} \sum\_{i=1}^{n} \tau\_{ik}^{(t)} \mathbf{B}(\mathbf{x}\_{i})^{\top} \mathbf{y}\_{i} \,\,\tag{17}$$

$$
\sigma\_k^{2(t+1)} = \sum\_{i=1}^n \tau\_{ik}^{(t)} (y\_i - \mathbf{B}(\mathbf{x}\_i)\boldsymbol{\mathcal{B}}\_k^{(t+1)})^2 \Big/ \sum\_{i=1}^n \tau\_{ik}^{(t)} m\_i \,. \tag{18}
$$

#### *3.3. Alternative Two-Fold Approaches*

We also investigate an alternative approach to the one derived before, which consists of a two-fold approach, rather than a simultaneous functional approximation and model estimation for segmentation. We first construct approximations of the functional data onto polynomial or (B-)splines bases **B**(*xi*) via MLE (ordinary least squares in this case) to obtain

$$
\hat{\boldsymbol{\mathcal{B}}}\_{i} = \left[\mathbf{B}(\boldsymbol{\boldsymbol{x}}\_{i})^{\top}\,\mathbf{B}(\boldsymbol{x}\_{i})\right]^{-1}\mathbf{B}(\boldsymbol{x}\_{i})^{\top}\boldsymbol{y}\_{i}.\tag{19}
$$

Then, we model the density of the resulting coefficient vectors *β<sup>i</sup>* , which is regarded as the *i*th curve representative, by a mixture density with spatial weights of the form

$$f(\hat{\boldsymbol{\beta}}\_{i}, \boldsymbol{\upsilon}\_{i}, \boldsymbol{\theta}) = \sum\_{k=1}^{K} \pi\_{k}(\boldsymbol{\upsilon}\_{i}; \boldsymbol{\mathsf{a}}) \boldsymbol{\phi}\_{d}(\hat{\boldsymbol{\beta}}\_{i}; \boldsymbol{\mathsf{m}}\_{k}, \mathbf{C}\_{k}) \,, \tag{20}$$

where *m<sup>k</sup>* and **C***<sup>k</sup>* are the mean and the covariance matrix of each component. The spatial weights *πk*(*vi*; *α*) are normalized Gaussians as in (8) or softmax as in (7). We will refer to these methods as spatial Gaussian-gated (resp. softmax-gated) mixtures of vectorized functional regressions, **SgMVFR** (resp. **SsMVFR**).

The EM algorithm for fitting this mixture of spatial mixtures, constructed upon precomputed polynomial or (B-)spline coefficients with its two variants for modeling the spatial weights, takes a similar form to the previously presented algorithm, and is summarized as follows. The conditional memberships of the E step are given for the softmax-gated model by

$$\tau\_{lk}^{(t)} = \frac{\pi\_k(\boldsymbol{\upsilon}\_l; \mathbf{a}^{(t)}) \boldsymbol{\phi}\_d(\hat{\boldsymbol{\beta}}\_l; \mathbf{m}\_k^{(t)}, \mathbf{C}\_k^{(t)})}{f(\hat{\boldsymbol{\beta}}\_l \boldsymbol{\upsilon}\_l; \mathbf{e}^{(t)})}, \tag{21}$$

and for the Gaussian-gated model by

$$\tau\_{ik}^{(t)} = \frac{w\_k^{(t)} \phi\_3(\boldsymbol{\upsilon}\_i; \boldsymbol{\mu}\_k^{(t)}, \mathbf{R}\_k^{(t)}) \phi\_d(\widehat{\boldsymbol{\beta}}\_l; \boldsymbol{m}\_k^{(t)}, \mathbf{C}\_k^{(t)})}{f(\boldsymbol{\upsilon}\_l; \widehat{\boldsymbol{\beta}}\_l; \boldsymbol{\theta}^{(t)})} \,. \tag{22}$$

The latter has the same advantage as explained above. In the M step, the gating functions parameter updates are given by (14)–(16) for the Gaussian-gated model, or through a Newton–Raphson optimization algorithm for the softmax-gated model. The component parameter updates are those of classical multivariate Gaussian mixtures

$$m\_k^{(t+1)} = \sum\_{i=1}^n \pi\_{ik}^{(t)} \hat{\mathcal{B}}\_i \Big/ \sum\_{i=1}^n \pi\_{ik}^{(t)} \, \prime \tag{23}$$

$$\mathbf{C}\_{k}^{(t+1)} = \sum\_{i=1}^{n} \tau\_{ik}^{(t)} (\widehat{\boldsymbol{\beta}}\_{i} - \boldsymbol{m}\_{k}^{(t+1)}) (\widehat{\boldsymbol{\beta}}\_{i} - \boldsymbol{m}\_{k}^{(t+1)})^{\top} \Big/ \sum\_{i=1}^{n} \tau\_{ik}^{(t)} \,. \tag{24}$$

In a nutshell, to compute a clustering of the image, the label of voxel *i*, given the fitted parameters *θ*, is calculated by the Bayes' allocation rule (3), in which Im(*i*) is the spatial coordinates of voxel *i* with either its direct spectral curve representative (*xi*, *y<sup>i</sup>* ) or its pre-calculated functional approximation coefficients *β<sup>i</sup>* given by (19).

Appendix A contains the pseudo-codes summarizing the proposed method.

#### *3.4. Time Complexity of the Proposed Algorithms*

In this subsection we investigate the time complexity of the proposed algorithms. The time complexity of the E-step of the proposed EM algorithms for the SgMFR and the SsMFR models is of O(*Kd*<sup>2</sup> *pnm*), with *<sup>n</sup>* being the number of voxels, *<sup>m</sup>* the number of energy levels, *d* is the number of spatial coordinates (2 or 3), *p* the number of the number of regression coefficients, and *K* the number of clusters. For the M step, the SgMFR and the SgMVFR have closed-form updates; The SgMFR requires the calculation of the regression coefficients via weighted least squares with a complexity of O(*Kp*2*nm*). The SgMVFR models require the calculations of the Gaussian means and covariance matrices as in multivariate Gaussian mixtures, and have a complexity of O(*Kp*2*n*). However, the SsMFR and SsMVFR algorithms require at each iteration inside the M step of the EM algorithm an IRLS loop and the inversion of the Hessian matrix which is of dimension *d*(*K* − 1). Therefore, the complexity of the IRLS is approximately of O(*IIRLSd*<sup>2</sup>*K*2), where IRLS is the average number of iterations required by the internal IRLS algorithm. The complexity here can therefore be an issue for a large number of clusters, and the SgMFR and SgMVFR algorithms can be preferred.

#### **4. Experimental Study**

In this section, we describe the evaluation of different versions of our method: mixtures of B-spline and polynomial functional regressions with spatial Gaussian gates (resp. SgMFR-Bspl and SgMFR-poly), mixture of B-spline regressions with softmax gates (SsMFR-Bspl), and mixture of vectorized B-spline regressions with Gaussian gates (SgMVFR-Bspl).

#### *4.1. Data*

In total, 91 head and neck DECT scans were evaluated, consisting of HNSCC tumors of different sizes and stages from different primary sites. In our dataset, 34% of tumors are located in the oral cavity, 26% in the oropharynx, 21% in the larynx, 8% in the hypopharynx and 11% in other locations. The tumors' T-stage [43] ranges from T1 to T4. Of the patients, 75% were coming for a first diagnostic while 25% were recurrent patients. Institutional review board approval was obtained for this study with a waiver of informed consent. Tumors were contoured by an expert head and neck radiologist. All scans were acquired using a fast kVp switching DECT scanner (GE Healthcare) after administration of IV contrast and reconstructed into 1.25 mm sections of axial slices with a resolution of 0.61 mm, as previously described [11]. Multienergy VMIs were reconstructed at energy levels from 40 to 140 keV in 5 keV increments at the GE Advantage workstation (4.6; GE Healthcare).

In each DECT scan, we crop volumes of interest (VOIs) of size 150\*150\*6 containing a tumor, along with the 21-point-spectral curve associated to each selected voxel, in order to reduce the computational demands for an exploratory study, and to exclude regions containing a majority of air voxels around the body. A pre-processing step is also applied to mask any remaining air voxels in the VOI to focus the clustering on tissues.

#### *4.2. Regularization Parameter*

In our study, we augmented the statistical estimator in Equation (16) of the covariance matrix of the spatial coordinates within cluster *k*, with a regularization parameter *λ* ∈ (0, 1], which controls the amount of spatial dispersion (neighborhood) taken into account in the spatial mixture weights, by

$$
\tilde{\mathbf{R}}\_k^{(t+1)} = \lambda \mathbf{R}\_k^{(t+1)}.\tag{25}
$$

By doing so, we can numerically control the amount of data within cluster *k* (i.e., its volume). Indeed, if we decompose **C***<sup>k</sup>* = *λ***DAD***<sup>T</sup>* where **A** is the **C***<sup>k</sup>* eigenvectors matrix, and **D** is a diagonal matrix whose diagonal elements are the eigenvalues in decreasing order, then *λ* is the volume of cluster *k*. Since the tumor cluster has in general no strong spatial dispersion, then in practice, we take small values of order 0.1.

#### *4.3. Parameter Initialization*

For the sake of reproducibility, we start by initializing the regression mixture and weight parameters of the EM algorithm with a coarse clustering solution given by a Voronoi diagram. We build up Voronoi tiles over the selected voxels in the VOI (a square region where air voxels are deleted) with the k-means algorithm applied only on spatial coordinates. Then to fix the number of clusters *K*, and to fix the spatio-spectral hyperparameter *λ*, we assess on a small training set (10 patients) the three metrics described in Section 4.5. In our experiments, *K* is taken to be large enough, say 20, 30 or 40, so that we do not have to perform a full grid search which could be computationally demanding, and the value *λ* around 0.075 works very well. The process is run similarly for both methods of mixtures of functional regressions and mixtures of vectorized functional regressions. The range of search values is adapted for each method, and 5 search values are taken in each range. In the end, we use the mean of optimal values over the patients in the train set.

#### *4.4. Baselines*

We compare the quantitative and qualitative performance of our methodology with three baseline algorithms. First, we implement a Gaussian mixture model (GMM) with the iterative EM algorithm, using the standard non-reshaped algorithm [32] to cluster the spectral curves, thus leading to not include spatial coordinates. Because several clusters can become empty through the optimization, we fix an initial number of clusters (*K* = 150) that ends up providing, on average, the same resulting number of clusters for our method (i.e., *K* = 40). Second, we implement k-means clustering, using all vector information available, that is, the input vector is built with spectral information (i.e., the energy decay curve points) concatenated to a vector of spatial information (i.e., the 3D coordinates). The number of clusters is picked to be also *K* = 40, the number of clusters being stable throughout the optimization. We use the Matlab k-means implementation for images with a reproducible initialization through the built-in 'imsegkmeans' function. Third and last, we implement selective search, a machine learning graph-based segmentation method for object recognition. Using a region merging hierarchical approach with an SVM classifier to select the hierarchical rank of the resulting regions, the authors published an open-source code [44]. We apply selective search on low-, intermediate- and high-energy levels (resp. 40, 65 and 140 keV). These energy levels are used instead of the three RGB image channels. We note that selective search does not predetermine the number of clusters, but specifies a cluster minimum size or favors smaller or larger cluster sizes.

#### *4.5. Metrics*

Our clustering methods, as well as the baseline clustering algorithms, are all evaluated using the following three metrics:


To define tumor clusters, we select the cluster(s) which best cover the tumor, i.e., the ones that give the best Dice score when merged together. Several clusters segmenting the tumor area can indeed represent different tumor subparts, but we only know the tumor primary site contour as the ground truth.

When *C* is the ensemble of clusters, *d*(·, ·) is the Euclidean distance operator, *ck* is the centroid of cluster *ck*, the Davies–Bouldin index is defined as

$$\text{DB}(\mathbb{C}) = 1/\mathbb{C} \sum\_{\mathbb{Q}\_k \in \mathbb{C}} \max\_{\mathbb{C}\_l \in \mathbb{C}\_r} (S(\mathbb{c}\_k) + S(\mathbb{c}\_l)) / d(\mathbb{Z}\_k^\*, \mathbb{Z}\_l^\*) \tag{26}$$

where *<sup>S</sup>*(*ck*) = 1/*ck* <sup>∑</sup>*xi*∈*ck <sup>d</sup>*(*xi*, *ck*). The 'tumor' Davies–Bouldin index is adapted as

$$\text{DB}\_l(\mathbb{C}) = \max\_{\mathfrak{c}\_l \in \mathbb{C}} \left( \mathbb{S}(\mathfrak{c}\_{l \text{uu}}) + \mathbb{S}(\mathfrak{c}\_l) \right) / d(\overleftarrow{\varepsilon\_{l \text{uu}}}, \overline{\mathfrak{c}\_l}) \right), \tag{27}$$

where *ctum* is the region of the merged tumor clusters. The Dice score is defined as

$$\text{Dice}(c\_{tmu}, c\_{ttruth}) = 2 \ast c\_{tmu} \cap c\_{ttruth} / \left(c\_{tmu} + c\_{ttruth}\right), \tag{28}$$

where *ctruth* is the ground truth region. We summarize the distribution of these index values across the population by computing the mean, median and interquartile range.

#### **5. Results**

Figure 2 shows an overview of our results for one tumor example. We visualize the results on a 2D slice when the model has been run on the 3D VOI containing this slice.

(**a**) **Original slice**

(**c**) **k-means**: Dice = 0.409, DB = 3.8 15.3, DB*<sup>t</sup>* = 2.2 17.9, time = 3.47 s (**d**) **S.Search**: Dice = 0.523, DB = 1.1 16.1, DB*<sup>t</sup>* = 2.7 15.8, time = 0.08 s

(**e**) **SgMFR-Bspl**: Dice = 0.809, DB = 1.4 7.6, DB*<sup>t</sup>* = 4.3 5.6, time = 1521 s (**f**) **SgMFR-poly**: Dice = 0.838, DB = 1.2 6.5, DB*<sup>t</sup>* = 1.8 3.6, time = 603 s

(**g**) **SgMVFR-Bspl**: Dice = 0.761, DB = 1.3 9.0, DB*<sup>t</sup>* = 2.0 6.2, time = 199 s

**Figure 2.** Clustering results for each approach in one tumor (DB(*t*) = spatial/spectral index). Our proposed approaches are on the bottom row. One random color is assigned per cluster, ground truth tumor contour is in blue (**a**) or white (**b**–**g**).

The top row shows the performance of the baseline algorithms, whereas the bottom row shows our proposed methods. While the baseline approaches attribute a high number of clusters to bone regions containing big spectral variations and miss smaller variations in tissue regions, our methods with Gaussian gates in Figure 2e–g are able to adapt to relative variations and split the image with more spatial coherence. The results also demonstrate that our method is able to capture tissue characteristics invisible in Figure 2b–d,h. Note that DB and DB*<sup>t</sup>* scores depend on the number of clusters and SsMFR in Figure 2h has a very low number of clusters (softmax having vanishing clusters in the optimization). GMM and selective search in Figure 2b,d have around 40 clusters (varying number as explained in Section 4.4). The results in Figure 2c,e–g were obtained with 40 clusters.

Table 1 and Figure 3 present the quantitative results obtained with the three clustering metrics defined in Section 4.5. Among the three proposed methods that outperform the baseline methods in terms of Dice score (SgMFR-Bspl, SgMFR-poly, and SgMVFR-Bspl), we compared the Dice score distribution obtained with SgMFR-poly (which has the lowest

median Dice score among the three) to that obtained with the k-means-like baseline (which has the highest median Dice score) with a two-sample t-test, obtaining p=0.0014.

Figure 4 showcases the clustering results obtained when varying the tuning of *λ*. Here, we understand that a smaller *λ* gives a higher preference to the spatial information: clusters are compact and define well-separated areas. On the other hand, a larger *λ* gives a higher priority to the spectral information: clusters more closely match tissue characteristics, but one cluster can be split into tiny voxel groups spread all over the image. The ideal *λ* choice would be a *λ* that prioritizes spectral information, but still achieves some cluster spatial compactness. We assess this through metrics calculated on spectral and spatial content as explained in Section 4.5. The general tuning of *λ* = 0.075 is determined to be, on average, the optimal hyper-parameter. However, we can see strong improvements in tumor separation, on a case by case basis, with small variation of *λ*. As shown in Figure 4e,f, the Dice score increases from 0.36 to 0.64. This shows an example out of several results belonging to the lower quartile in the boxplot of Dice scores in Figure 3 that could be highly improved simply with a specific tuning. Some other examples of results in the lower quartile could be due to small tumors (size inferior to 1cm), although half of these small tumors are actually well-separated with our method, reaching a Dice score as high as 0.84 in the best case (see in Figure 5).

The proposed algorithms clearly outperform the investigated standard clustering algorithms in terms of the considered metrics, such as the Dice score. In particular, the two approaches based on Gaussian-gating mixture weights, whenever they are directly built upon a functional mixture model (SgMFR), or used with a prior functional data representation of the energy curves (SgMVFR), enjoy both high segmentation capabilities while being computationally effective. The two proposed alternative approaches based on the use of softmax-gating mixture weights (SsMFR and SsMVFR), can, however, be computationally expensive, given that they use, at each EM iteration, a Newton–Raphson optimization; they may lead in some situations to less precise segmentation, typically due to a numerical convergence issue, as compared to the proposed Gaussian-gatingbased approaches. As a result, we can suggest to the interested users prioritize the use of the SgMFR and SgMVF approaches when investigating our proposed techniques for DECT clustering. This being said, in order to investigate the statistical significance of the differences in the results of the proposed family of algorithms, it is interesting to perform a statistical study with appropriate statistical testing.

**Figure 3.** Boxplots for each metric per method. Note that SsMFR-Bspl method gives few outliers out of reach (order of 1013) on the DB spectral index, and one outlier of 135 for tumor DB spatial index. These values are shifted in the displayed range and exhibit a top arrow.

**Figure 4.** Clustering results for our SgMFR method with different *λ* tuning. (**a**) Original slice. (**b**) *λ*= **0**.**075**, Dice = 0.79, DB = 1.68/7.29, DB*<sup>t</sup>* = 1.47/6.79. (**c**) *λ*= **0**.**100**, Dice = 0.39, DB = 2.59/4.27, DB*<sup>t</sup>* = 3.34/3.74. (**d**) Original slice. (**e**) *λ*= **0**.**075**, Dice = 0.36, DB = 1.75/6.70, DB*<sup>t</sup>* = 1.28/23.71. (**f**) *λ*= **0**.**080**, Dice = 0.64, DB = 1.87/6.57, DB*<sup>t</sup>* = 1.78/2.93. One random color is assigned per cluster, ground truth tumor contour is in blue (**a**,**d**) or white (**b**,**c**,**e**,**f**).

**Figure 5.** Clustering results with our SgMFR for a small tumor. Note the robustness of the result in the presence of a metallic artifact in the right-hand side of the anatomical image. (**a**) Original slice. (**b**) Dice = 0.84, DB = 1.64/6.92, DB*<sup>t</sup>* = 1.98/7.22, *λ* = 0.075.

*Diagnostics* **2022**, *12*, 3072

**Table 1.** Average median (interquartile range) of metrics and runtime over 81 patient scans. DB cluster separation index is computed for spatial content (spat-DB) and spectral content (spec-DB); idem for DB*t* indices focused on tumor separability; runtime is given in seconds.


#### **6. Discussion and Conclusions**

In this paper, we developed a statistical methodology to cluster intensity attenuation curves in DECT scans. We applied our proposed methods, together with other alternative clustering algorithms used as baselines, to a set of 91 DECT scans of HNSCC tumors. The classical manner of evaluating algorithms for clustering/segmentation is via measures of overlap (such as the Dice score) with a ground truth segmentation. However, as mentioned in the Introduction above, the manual segmentations of HNSCC tumors that are used as "ground truth" can suffer from large inter-rater variability, and do not incorporate in any systematic manner regions immediately adjacent to the tumor that may be biologically important for determining the course of evolution of the tumor. Because of this inherent uncertainty in the appropriate contours of an HNSCC tumor, the main objective in our paper was to compare our clustering results to the manual contouring, but also to explore associations between voxels within the ground truth tumor contour and voxels in the surrounding tissue areas.

Compared to the baseline algorithms, it is clear both visually and quantitatively that our methods using Gaussian gates (SgM(V)FR) produce results that match better the manual segmentation contours. Our method using softmax gates (SsMFR) is less flexible compared to the one with Gaussian gating functions, and thus sometimes leads to non-satisfactory results. Although in terms of qualitative assessment, clusters of SsMFR-Bspl are indeed more spatially compact, quantitative performance in some situations stays similar to GMM baseline. Thus, this variant of the algorithm does not appear to perform well in practice. Using Gaussian gates, however, Dice score distributions are significantly better than the k-means-like algorithm, the best of our baseline methods.

That being said, it is also clear that with Dice scores ranging from nearly 0 to nearly 1, our proposed methods do not recover the "ground truth" segmentations in a reliable and consistent manner. Several reasons may explain this finding. First, our clinical dataset of DECT scans is not uniform, i.e., it includes tumors of highly variable characteristics, in highly variable sizes, locations and environments, which makes it particularly challenging. Moreover, as seen in Figure 4, changes in parameter tuning can lead to substantial improvement in Dice scores for some tumors. Finally, because of their intricate morphology and often small sizes, HNSCC tumors are inherently difficult to segment. In a recent international challenge, Dice scores of head and neck tumor segmentation ranged across different competition entries between 0.56 and 0.76 [45].

Most importantly, as argued throughout this paper, the clinical value of recovering the manual segmentations of HNSCC tumors as an objective criterion for evaluating the algorithm is also not clear. In fact, it was recently argued in the clinical literature that AI methods in medical imaging would be more meaningful if evaluated against clinical outcomes, as opposed to an evaluation against radiologists' performance, due to inherent subjectivity and variability of the latter [46]. For all the reasons, the objective of this paper was moved away from reproducing the manual contours produced by the radiologist, and was focused instead on developing tools that discover patterns of association in the DECT data.

Our study has several limitations. As discussed above, in our view, the appropriate way of evaluating the methodology's clinical utility is not by computing Dice scores relative to manually drawn contours. Rather, a more clinically informative evaluation would determine the performance of the recovered clusters in predicting clinical outcome in a machine learning setting, compared to the same predictive algorithm applied with the manual tumor segmentations. Such an evaluation is missing from the present paper; it will be part of a subsequent paper in future work. Another limitation stems from the lack of an automated identification of those clusters that are associated with the tumor region. Right now, we choose those clusters that maximize overlap with the manually segmented tumor region. Ideally, however, the abnormal tumor clusters should be identified automatically, by selecting those clusters that have the highest

association with clinical outcomes. In this manner, the automated cluster identification can be naturally made part of a single machine learning pipeline for predicting clinical outcomes. Yet another limitation comes from the very small size of the subset of tumors (*n* = 10) over which we estimated the algorithm parameters (*λ* and number of clusters), before applying the algorithm to the remaining 81 tumors in our dataset. A larger dataset, together with additional patient-specific tuning will help tune the algorithm's performance.

The need for the improvements described above is clear, and they will be made part of a subsequent publication. In the present article, we chose to focus on the theoretical and algorithmic developments. As mentioned in the Introduction, this is the first time to our knowledge that statistical tools from the functional data analysis field are put into practice with DECT data. As such, the present paper remains an inherently exploratory one in its experimental framework.

Nevertheless, we believe that we provide several important technical and methodological contributions. We constructed a functional regression mixture model that integrates spatial content into the mixture weights, and we developed a dedicated EM algorithm to estimate the optimal model parameters. Our mixture-based model is a highly flexible statistical approach allowing for many choices of the parametric form of the component densities. We proposed two candidate designs for the mixture weights, normalized Gaussian gates and softmax gates. The Gaussian-gate closed-form solution for spatial mixture weight updates considerably reduces the computation time while also providing solutions with better clustering index values, compared to the Newton–Raphson optimization algorithm needed at each update of the softmaxgating parameters.

**Author Contributions:** Conceptualization, F.C.; methodology, F.C.; software, F.C., S.B.; validation, P.S., M.C. and R.F.; formal analysis, F.C., S.B. and P.S.; investigation, F.C., S.B. and P.S.; resources, R.F.; data curation, S.B. and F.C.; writing—original draft preparation, F.C. and S.B; writing—review and editing, all authors; visualization, F.C. and S.B; supervision, F.C., P.S., M.C. and R.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Fonds de Recherche du Québec - Santé (FRQS) and Fondation de l'Association des Radiologistes du Québec (FARQ) (R.F.); by the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference 260250] (M.C.) and by IRT SystemX (F.C.).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The source codes for the proposed methods are publicly available at https://github.com/fchamroukhi/DECT-CLUST (accessed on 28 October 2022), free of charge.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **Appendix A. Pseudo-Codes for the Proposed Methods**


#### **Algorithm A2** Pseudo code of DECT-CLUST with alternative SgMVFR and SsMVFR models.

**Inputs:** 4D-image (*n* curves (*vi*, *xi*, *y<sup>i</sup>* )*n <sup>i</sup>*=1), # clusters *K*, degree *p* (and # knots *q*)

	- **Outputs:** *θ* = (*α*(*t*), *θ* (*t*) <sup>1</sup> ,... *θ* (*t*) *<sup>K</sup>* ) the MLE of *<sup>θ</sup>* and the conditional probabilities *<sup>τ</sup>*(*t*) *ik*

#### **References**


### *Article* **Development and Validation of an Ultrasound-Based Radiomics Nomogram for Identifying HER2 Status in Patients with Breast Carcinoma**

**Yinghong Guo †, Jiangfeng Wu \*,†, Yunlai Wang and Yun Jin**


**Abstract:** (1) Objective: To evaluate the performance of ultrasound-based radiomics in the preoperative prediction of human epidermal growth factor receptor 2-positive (HER2+) and HER2− breast carcinoma. (2) Methods: Ultrasound images from 309 patients (86 HER2+ cases and 223 HER2− cases) were retrospectively analyzed, of which 216 patients belonged to the training set and 93 patients assigned to the time-independent validation set. The region of interest of the tumors was delineated, and the radiomics features were extracted. Radiomics features underwent dimensionality reduction analyses using the intra-class correlation coefficient (ICC), Mann–Whitney U test, and the least absolute shrinkage and selection operator (LASSO) algorithm. The radiomics score (Rad-score) for each patient was calculated through a linear combination of the nonzero coefficient features. The support vector machine (SVM), K nearest neighbors (KNN), logistic regression (LR), decision tree (DT), random forest (RF), naive Bayes (NB) and XGBoost (XGB) machine learning classifiers were trained to establish prediction models based on the Rad-score. A clinical model based on significant clinical features was also established. In addition, the logistic regression method was used to integrate Rad-score and clinical features to generate the nomogram model. The leave-one-out cross validation (LOOCV) method was used to validate the reliability and stability of the model. (3) Results: Among the seven classifier models, the LR achieved the best performance in the validation set, with an area under the receiver operating characteristic curve (AUC) of 0.786, and was obtained as the Rad-score model, while the RF performed the worst. Tumor size showed a statistical difference between the HER2+ and HER2− groups (*p* = 0.028). The nomogram model had a slightly higher AUC than the Rad-score model (AUC, 0.788 vs. 0.786), but no statistical difference (Delong test, *p* = 0.919). The LOOCV method yielded a high median AUC of 0.790 in the validation set. (4) Conclusion: The Rad-score model performs best among the seven classifiers. The nomogram model based on Rad-score and tumor size has slightly better predictive performance than the Rad-score model, and it has the potential to be utilized as a routine modality for preoperatively determining HER2 status in BC patients non-invasively.

**Keywords:** ultrasound; HER2; breast carcinoma; radiomics

#### **1. Introduction**

Breast carcinoma (BC) is the most common malignancy and the most frequent cause of carcinoma mortality in women worldwide [1], and it is a complex and heterogeneous disease [2–4]. Currently, BC is mainly classified into hormone-receptor-positive, human epidermal growth factor receptor 2-positive (HER2+), and triple-negative BC on the basis of histopathological characteristics [5,6].

HER2+ BC, in which the cells do not express estrogen receptors and progesterone receptors, accounts for about 15% of all BC cases and presents a high rate of recurrence and poor prognosis compared with hormone-receptor-positive BC [7–9]. Nevertheless, over the last two decades, as agents that target HER2, including trastuzumab and pertuzumab,

**Citation:** Guo, Y.; Wu, J.; Wang, Y.; Jin, Y. Development and Validation of an Ultrasound-Based Radiomics Nomogram for Identifying HER2 Status in Patients with Breast Carcinoma. *Diagnostics* **2022**, *12*, 3130. https://doi.org/10.3390/ diagnostics12123130

Academic Editor: Mauro Giuseppe Mastropasqua

Received: 27 September 2022 Accepted: 9 December 2022 Published: 12 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

367

are extensively applied in clinical practice, significant advances have been made in the treatment of HER2+ BC and overall survival has improved [10–12]. Hence, the status of HER2 is one of the most significant and decisive factors in the treatment decision and prognosis for breast carcinoma patients.

So far, the evaluation of HER2 status in breast carcinoma patients mainly relies on immunohistochemistry (IHC) examination after surgical tumor excision or biopsy [13], whereas both biopsy and surgery are invasive procedures and may lead to an increased risk of complications such as seroma, local pain, and infection [14,15]. Moreover, the evaluation results of a few tissue biopsies do not necessarily represent HER2 status of the whole tumor [16]. In addition, in our center, routine histopathological findings are analyzed, but patients still need to spend extra to get results from IHC. Therefore, it is urgent to develop an economical, non-invasive, and precise pretreatment technology to predict HER2 status in breast carcinoma patients.

Radiomics is a new research field on the basis of quantitative imaging methods, which are mainly adopted to extract and analyze a large number of imaging features hardly perceived by radiologists to reflect tissue information [17,18]. Recent studies demonstrate that radiomics features extracted from magnetic resonance imaging (MRI) and computed tomography (CT) images have been widely used in diagnosis, prediction of tumor stage and histological subtype, as well as prognostic evaluation [19–22]. MRI and CT are limited by economic cost and/or equipment availability. Compared with the above imaging technologies, ultrasound, recognized as a radiation-free, convenient, and reasonably priced technology, is universally used for breast carcinoma screening and diagnosis [23]. A number of researchers have extended radiomics to ultrasound imaging [24,25]. Prior ultrasound radiomics studies have shown that molecular subtypes of BC are related to qualitative imaging characteristics and histopathologic features [26,27].

To the best of our knowledge, there are still relatively few studies to predict HER2 status of breast carcinoma using the method of ultrasound-based radiomics. We hypothesized that ultrasound radiomics features might provide guidance for predicting HER2 status in patients with breast carcinoma and would like to develop and validate an ultrasound radiomics model that could predict HER2 status.

#### **2. Materials and Methods**

#### *2.1. Patient Cohorts*

The institutional review board approved this retrospective study, and the requirement for written informed consent was waived.

In total, 522 female patients confirmed as primary BC based on pathology examination by means of biopsy or surgical excision and examined by ultrasound before treatment at our institution from March 2019 to November 2021 were retrospectively collected.

Exclusion criteria were as follows: (a) ultrasound images not suitable for radiomics study because of poor quality, artifacts, calcifications, or cystic changes (*n* = 48); (b) tumors larger than 50 mm in diameter (incompletely displayed in a single plane) (*n* = 27); (c) patients who underwent biopsy, radiotherapy, and/or chemotherapy before ultrasound examination (*n* = 65); (d) patients with multifocal lesions or non-mass BC (*n* = 4040); and (e) patients with missing clinical characteristics and/or postoperative histopathology (*n* = 32); Finally, there were 309 eligible patients with BC, of whom those from March 2019 to November 2020 served as the training set (*n* = 216), while the remaining patients formed the time-independent validation set (*n* = 93). The flowchart of patient selection is shown in Figure 1.

#### *2.2. Pathological Assessment*

IHC is the leading clinical technology for immunostaining, which can precisely determine the molecular subtypes of BC with high specificity. The estrogen receptor (ER) and progesterone receptor (PR) status was considered positive if ≥1% of tumor cells had positively stained nuclei [28]. For HER2 status identification, an IHC score 3+ of HER2 was

considered as positive, while an IHC score 0 or 1+ of HER2 was considered as negative. An IHC score of 2+ was considered indeterminate, and then fluorescence in situ hybridization (FISH) was carried out to assess gene amplification, and HER2 was classified as positive if the ratio was ≥2.0 [6]. For Ki-67 status, tumors with greater than 14% positive nuclei were considered to have high expression, while other cases were considered to have low expression [29].

**Figure 1.** The patient enrollment process for this study.

#### *2.3. Clinical Characteristics*

Clinical data such as age, tumor size, and tumor location were obtained from patients' medical records. Status of ER, PR, and HER2, Ki-67 levels, molecular subtype, lymph node metastasis, and histological type of tumor were obtained by reviewing patients' pathology reports.

#### *2.4. Image Acqusition and Segmentation*

Breast ultrasound examinations were carried out by sonographers with more than 5 years of experience in breast ultrasound imaging, within 2 weeks before surgical resection. Ultrasound was performed using the LOGIQ E9 ultrasound system with a 6–15 L linear array probe and the Siemens Acuson S2000 with a 6–18 L linear array probe with radial, transverse, and longitudinal scanning on both breasts. The imaging parameters were consistent among patients: gain was about 50%; image depth was about 3.0 cm to 5.0 cm; and focus paralleled the lesion. The ultrasound image was 1164 × 873 pixels and 1024 × 768 pixels in size on the LOGIQ E9 and Siemens Acuson S2000 devices, respectively. The image of the largest section of the breast tumor with the clearest imaging was saved in the format of Digital Imaging and Communications in Medicine to maximize the preservation of the image information. Manual segmentation was performed on gray-scale ultrasound images of breast lesions. Sonographer 1 (with more than 5 years of experience in breast ultrasound imaging) with no information about the patient's clinical history selected the largest plane of each breast lesion and drew an outline of the region of interest (ROI) by using ITK-SNAP software (version 3.4.0).

#### *2.5. Radiomic Feature Extraction*

A total of 788 radiomics features, consisting of shape, statistics, texture, and wavelet features, were extracted. Radiomics features were extracted using the "pyradiomics" package of Python (version 3.7.11). These ultrasound radiomic features were divided into

four categories, including 14 two-dimension shape-based features, 18 first-order statistics features, 22 gray-level co-occurrence matrix (GLCM) features, 16 gray-level run length matrix (GLRLM) features, 16 gray-level size zone matrix (GLSZM) features, 14 gray-level dependence matrix (GLDM) features, and 688 features derived from first-order GLCM, GLRLM, GLSZM, and GLDM features using wavelet filter images. Supplementary Material Data S1 contains details on the ultrasound radiomics extraction settings.

#### *2.6. Evaluation of Inter- and Intra-Class Correlation Coefficient*

The inter- and intra-class correlation coefficients (ICCs) were adopted to test the reproducibility of feature extraction. Sonographers 1 and 2 (both with more than 5 years of experience in breast ultrasound imaging) drew ROIs on the same ultrasound images from the 50 randomly selected patients and extracted the radiomics features. Two weeks later, sonographer 1 repeated ROI segmentation on the same ultrasound images and extracted the radiomics features to assess the intra-observer reproducibility. An ICC greater than 0.75 suggested a good agreement for the feature extraction.

#### *2.7. Radiomics Feature Selection*

All the radiomics features were standardized by the z-score algorithm to ensure that the scale of feature value was uniform and improve the comparability between features, which was realized in the proportional scaling of the original data. The features with ICCs less than 0.75 were excluded.

In the training set, the Kolmogorov-Smirnov test was first performed to assess whether variances were normally distributed, and Levene's test was used to assess the equality of variance. An independent sample *t* test was used for variables with a normal distribution and homogeneity of variance. Otherwise, the Mann–Whitney U test was used. The radiomics features that showed no significant differences were excluded. The remaining radiomics features were further screened by using penalized logistic regression with a least absolute shrinkage and selection operator (LASSO) algorithm. An optimal lambda was selected through 10-fold stratified cross-validation, which was tuned to achieve minimum mean square error. Thus, features with a non-zero coefficient in the model were regarded as the most representative features.

#### *2.8. Development and Validation of the Prediction Model*

The radiomics score (Rad-score) was calculated for each lesion using LASSO regression and a linear combination of the values of the selected features weighted by their respective non-zero coefficients. Based on the Rad-score, seven machine learning classifiers consisting of decision tree (DT), K nearest neighbors (KNN), random forest (RF), support vector machine (SVM), logistic regression (LR), naive Bayes (NB), and XGBoost were used to construct the prediction model in the training set. The classifier with the highest AUC value in the validation set was obtained as the Rad-score model.

#### *2.9. Clinical Model and Nomogram Model*

Clinical features that showed a statistical difference between the HER2+ and HER2− BC in the training set were adopted to develop the clinical model by using the logistic regression method. In addition, the nomogram model combining significant clinical factors and the Rad-score was constructed for personalized HER2 status prediction.

We evaluated the performances of all the models in the time-independent validation set in terms of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and the area under the receiver operating characteristic (ROC) curve (AUC). To verify the robustness of the nomogram model, the calibration curve [25] was plotted. Furthermore, decision curve analysis (DCA) [26] was also utilized to select the model that maximized patient benefits. The flowchart of this research is shown in Figure 2.

**Figure 2.** Schematic representation of the radiomics analysis steps.

#### *2.10. Statistical Analysis*

R version 3.5.1 software was used for statistical analysis and figure plotting. Radiomics features were extracted from each ROI using the "pyradiomics" package of Python (version 3.7.11). The continuous variables with normal distribution and homogeneity of variance were shown as the mean (standard deviation) and tested by an independent sample *t* test; otherwise, the data were analyzed by the Mann–Whitney U test and expressed as the median (interquartile range). For categorical variables, the chi-square analysis or Fisher's exact tests were applied to compare the results. A two-tailed *p* < 0.05 indicated a significant difference.

#### **3. Results**

#### *3.1. Clinical and Pathological Characteristics*

The clinical and pathological characteristics of the training and validation sets were compared, and there was no statistically significant difference found (*p* > 0.05) (Table 1). This suggested that the training and validation sets were harmonious in these clinical and pathological characteristics.


**Table 1.** The baseline characteristics of the enrolled patients in the training and validation sets.

**Characteristic Total Set (***n* **= 309) Training Set (***n* **= 216) Validation Set (***<sup>n</sup>* **= 93)** *<sup>p</sup>***-Value** ER 0.973 Positive 228 160 68 Negative 91 56 25 PR 0.597 Positive 188 134 54 Negative 121 82 39 HER2 1.000 Positive 86 60 26 Negative 223 156 67 Histologic type 0.581 Invasive ductal 259 184 75 Invasive lobular 14 9 5 Other 36 23 13 Ultrasound equipment 0.636 Siemens Acuson S2000 246 174 72 LOGIQ E9 63 42 21 US-reported LN 0.875 Metastasis positive 130 92 38 Metastasis negative 179 124 55 Pathology-reported LN 0.868 Metastasis positive 170 120 50 Metastasis negative 139 96 43 Ki-67 (%, mean ± SD) 28.52 ± 22.16 28.16 ± 21.96 29.38 ± 22.72 0.663 Radiomics score (median, IQR) <sup>−</sup>0.0097 (−0.0975, 0.0794) −0.0099 (−0.1030, 0.0787) −0.0029 (−0.0883, 0.0808) 0.678

**Table 1.** *Cont.*

ER, estrogen receptor; PR, progesterone receptor; HER2, human epidermal growth factor receptor 2; SD, standard deviation; IQR, interquartile range; LN, lymph node; US, ultrasound; BI-RADS, Breast Imaging Reporting and Data System.

#### *3.2. Radiomics Feature Extraction and Selection*

A total of 788 radiomics features were extracted from the ultrasound images of each patient. The reproducibility of ultrasound radiomics features extraction was assessed. The intra-observer correlation coefficient of sonographer 1 in two extractions was between 0.296 and 0.996, while the inter-observer correlation coefficient of extraction by sonographer 1 and sonographer 2 was between 0.323 and 0.989. Finally, 23 radiomics features (ICC < 0.75) were excluded. The ICC evaluation results are shown in Figure 3. The morphological characteristics of the randomly selected lesions for ICC assessment are provided as Supplementary Material Data S2. All of the following analyses were based on the radiomics features extracted by sonographer 1.

In the training set, after evaluating the differences of radiomics features by the Mann– Whitney U test, 321 radiomics features were used for further analysis. Then, the optimum Lambda (Lambda = 0.027464741148160516) was determined for the LASSO regression, and 12 radiomics features with nonzero coefficients were selected to differentiate HER2+ from HER2− BC (Figure 4).

**Figure 3.** Bar plots of intra- and inter-observer ICC. Upper: inter-rater agreement; Lower: intra-rater agreement. ICC: intra-class correlation coefficient.

**Figure 4.** Feature selection and Rad-score building by LASSO. (**A**) A 10-fold cross validation was used to predict mean square error of the Rad-score building by different Lambda values. (**B**) The coefficient profiles of the radiomics features determined by different Lambda values.

Detailed information on the HER2+ BC-related features is shown in Table 2, and the nonzero coefficients of the selected features based on the LASSO regression are shown in Figure 5A. Moreover, the Pearson correlation coefficient between any pair of selected features was computed, and the correlation coefficient matrix heatmap is shown in Figure 5B.

**Table 2.** List of features with nonzero coefficients.


**Figure 5.** (**A**) The coefficients of radiomics features to construct the Rad-score; (**B**) a Pearson correlation coefficient heatmap of the selected features for predicting HER2 status. Green color denotes a positive correlation, the red color denotes a negative correlation, and the shade of the color indicates the degree of correlation.

#### *3.3. Radiomics Score Calculation*

The radiomics score (Rad-score) for each patient in the training and validation sets was calculated through a linear combination of the nonzero coefficient features based on the LASSO regression, as shown in Figure 6A,B. The corresponding fitting formula is listed in Supplementary Material Data S3. In the training set, the medians of Rad-score showed a statistical difference between the HER2+ and HER2− BC (0.0838 vs. −0.0546, *p* < 0.001), and the same results were achieved in the validation set (0.0936 vs. −0.0518, *p* < 0.001) (Figure 6C,D, Table 3).

**Figure 6.** Radiomics score for each breast carcinoma patient in the training (**A**) and validation sets (**B**); Distribution of radiomics score values of the HER2+ and HER2− groups in the training (**C**) and validation sets (**D**).


**Table 3.** Rad-score for the training and validation sets.

IQR, interquartile range.

#### *3.4. Construction and Evaluation of Machine Learning Classifier*

Seven machine learning classifiers (KNN, DT, RF, SVM, LR, NB, and XGBoost) were then adopted to develop the prediction model based on the Rad-score. The sensitivity, specificity, accuracy, PPV, NPV, and AUC values of the seven machine learning classifiers are shown in Table 4.



DT, decision tree; RF, random forest; SVM, support vector machine; LR, logistic regression; NB, naive Bayes; KNN, K nearest neighbors; XGB, XGBboost; AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy.

Among the classifiers, the general accuracies of the RF and XGBoost were 100.0% and 94.0% in the training set and 63.4% and 66.7% in the validation set, which suggested overfitting. The accuracy was 63.4% in the RF classifier and 77.4% in the SVM and NB classifiers; the AUC values of the seven machine learning classifiers ranged from 0.593 to 0.786 in the validation set, with the LR classifier performing the best and the RF classifier performing the worst. The LR classifier with the highest AUC value was selected as the Rad-score model. In addition, a comparison of the ROC curves of the seven machine learning classifiers in the training set and validation set is shown in Figure 7. Furthermore, the AUC values between any pair of the classifiers were compared, and the *p* values were obtained by DeLong test, which are shown in Table 5.

**Figure 7.** Receiver operating characteristic curves of seven machine learning classifiers predicting HER2+ status in training (**A**) and validation sets (**B**).


**Table 5.** P values for AUC comparison between any pair of models tested by the DeLong method in the validation set.

LR, logistic regression; KNN, K nearest neighbors; DT, decision tree; RF, random forest; SVM, support vector machine; NB, naive Bayes; XGB, XGBboost; AUC, area under the curve. The bold numbers (<0.05) mean statistical difference.

#### *3.5. Clinical Model and Nomogram Model*

Comparison of the clinical features between the HER2+ and the HER2− BC in the training set was performed. Tumor size (*p* = 0.028) and Rad-score (*p* < 0.001) were the significant factors to distinguish the HER2+ from HER2− BC. Other clinical features such as age, tumor location, ultrasound equipment, and ultrasound-reported lymph node status were not identified as potential factors for predicting the HER2+ type (Table 6). Then, the clinical model based on tumor size was constructed using logistic regression. At the same time, the nomogram model was established by combining the tumor size and Rad-score (Figure 8).

**Table 6.** Comparison of the clinical features between the HER2+ and HER2− BC groups in the training set.


SD, standard deviation; LN, lymph node; US, ultrasound; IQR, interquartile range.

**Figure 8.** Nomogram based on the combination of the tumor size and Rad-score was developed using logistic regression analysis.

Moreover, the predictive abilities of the clinical, Rad-score and nomogram models were compared. The results for each model are summarized in Table 7. The ROC curves of the three models to predict the HER2+ type are shown in Figure 9. In the time-independent validation set, the AUC value of the nomogram was significantly higher than that of the clinical model (AUC, 0.788 vs. 0.618; DeLong test, *p* = 0.016). Although the nomogram model performed slightly better than the Rad-score model, there was no statistically significant difference between them (AUC, 0.788 vs. 0.786; DeLong test, *p* = 0.919).

**Table 7.** Predictive performances of the models identifying HER2+ status in patients with BC.


AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy.

**Figure 9.** Receiver operating characteristic curves of the three models predicting HER2+ type in the training (**A**) and validation sets (**B**).

The LOOCV algorithm was carried out to validate the reliability and stability of the results, which yielded a high median AUC (0.790 in the validation set), indicating that the predictive performance of the nomogram model was reliable and stable.

#### *3.6. Model Performance Evaluation*

The predictive performances of the nine models, including seven machine learning classifiers, a clinical model, and a nomogram model, in the validation set are shown in Figure 10. The nomogram model has the highest AUC value (0.788), sensitivity (73.1%), and accuracy (78.5%), and NB has the highest specificity (91.0%). To sum up, the overall discrimination performance of the nomogram model was better than that of other models.

*3.7. Clinical Application of the Prediction Models*

The calibration curve for the nomogram was tested using the Hosmer-Lemeshow test and yielded nonsignificant results due to both *p* values > 0.05 in the training and validation sets, showing good agreements between the observed and predicted results (Figure 11).

**Figure 11.** Calibration curves of the nomogram model in the training (**A**) and validation sets (**B**).

Decision curve analysis of the clinical, Rad-score and nomogram models is shown in Figure 12. The gray line represents the assumption that all lesions were HER2+ type. The black line represents the assumption that all lesions were HER2− type. If the threshold probability was less than 56.9%, using the nomogram would add more benefit (red line).

**Figure 12.** Decision curves of the models. If the risk threshold is less than 56.9%, the nomogram model will obtain more benefit than all treatment (assuming all breast cancer patients were HER2+) or no treatment (assuming all breast cancer patients were HER2−).

#### **4. Discussion**

Mineable data can be extracted from digital medical images by radiomics and analyzed to improve detection, diagnosis, staging, and prognosis prediction [20–22,24]. Ultrasound radiomics might be helpful to answer questions like what the molecular subtype of BC is, and this might affect the treatment strategy in patients with BC.

In our study, seven machine learning classifiers, such as KNN, LR, SVM, DT, NB, RF, and XGBoost, were established based on the Rad-score in the training set and tested in the time-independent validation set. Among them, the LR classifier with the AUC value of 0.786 performed the best, which might be that complex classifiers needed more training samples. Then the LR classifier was selected as the Rad-score model. The results indicated that the ultrasound-related Rad-score could predict the HER2+ status of patients with breast carcinoma. In addition, by establishing a nomogram model combining the Rad-score with clinical risk factors, we found that the nomogram model had significantly improved predictive performance compared with the model only involving clinical risk factors (AUC, 0.788 vs. 0.618, in the validation set) and slightly improved the ability compared with the Rad-score model (AUC, 0.788 vs. 0.786, in the validation set). The consistency between the nomogram model's predicted probability of HER2 status and the actual results were evaluated by the calibration curve, and *p*-values in the training and validation sets were all > 0.05, which suggested that the stability of the model is fine. In addition, patients with BC could obtain a pronounced net benefit from the nomogram model when the threshold probability is less than 56.9%, which is shown in the decision curve analysis, demonstrating the good clinical utility of this model. The nomogram model could be potentially utilized as a routine tool to assist clinicians in preoperatively predicting HER2 status non-invasively.

In recent years, radiomics studies have mainly been carried out based on computer tomography or magnetic resonance imaging [19–22], demonstrating that radiomics features could reflect the heterogeneity of tumors and have become a reliable potential biomarker for improving diagnosis and treatment decisions. In recent radiomics studies on breast ultrasound imaging, researchers have mainly focused on the differential diagnosis of benign and malignant breast tumors [27,30,31], prediction of preoperative axillary lymph node metastasis [26,32,33], and prediction of molecular subtypes [28], with mixed findings that might be due to the heterogeneity of ultrasound machines, algorithms, and extracted features. The results of our study facilitate a possible clinical role for the nomogram model in the identification of HER2 status in BC, in accordance with the mentioned studies above carried out by ultrasound radiomics.

In the present study, the ultrasound images of breast carcinomas were analyzed by radiomics, and finally 12 features were screened out to calculate the radiomics score. A majority of the selected ultrasound radiomics features were wavelet-based features that were supposed to redisplay tumor characteristics hidden behind the speckle and show discriminative ability [32,34]. Among the 12 features, original\_glszm\_SmallAreaEmphasis revealed the strongest correlation with HER2+, while wavelet-LHL\_glcm\_Idn and wavelet-HLL\_gldm\_DependenceNonUniformityNormalized also showed a strong correlation. The relationship between the combinations of gray levels in the image parameters is calculated by glcm texture features, which have been widely used in many texture analysis applications and can reflect the internal spatial heterogeneity of the tumor lesions [35,36]. In the present study, glcm features extracted from an ultrasound image of BC were correlated with HER2 status. Radiomics features extracted from ultrasound image of BC could detect the invisible heterogeneity of tumors and were available to predict HER2 status in patients with BC.

Generally, one feature selection method is adopted in conventional radiomics analysis. In the study by Xu et al. [37], six features based on ultrasound radiomics were selected by the recursive feature elimination, and a random forest model including 90 trees was built for prediction of HER2 status, with the AUC of 0.780 and 0.740 in the training and validation sets. In order to reduce overfitting effectively, we used the ICC and Mann–Whitney U test for feature selection in the first step and LASSO regression in the second step, and we achieved better predictive performance with the LR classifier than the study by Xu et al., with AUC values of 0.804 and 0.786 in the training and validation sets, respectively. In addition, the statistical power of our study might be more robust because the sample size in our study was significantly larger than theirs (309 vs. 114).

A prior study by Wu et al. based on ultrasound radiomics developed models to predict the expression of molecular biomarkers of the mass type of breast ductal carcinoma in situ (DCIS) [29]. Based on 41 ultrasound radiomics features, they generated a model predictive of HER2+ type in BC patients with AUC values of 0.940 in the training set and 0.740 in the validation set. As the significantly reduced AUC value in the validation set and 41 ultrasound radiomics features (much more than 10% of the sample size of the training set) were selected to establish the model, we speculated that the overfitting problem should be taken into account. Moreover, in their study, only patients with a mass type of DCIS were enrolled, whereas in this study, tumors such as invasive ductal carcinoma, invasive lobular carcinoma, and mucinous breast carcinoma were included, which expanded the range of tumor types. Furthermore, the sample size of their retrospective study was much smaller than ours (116 vs. 309). Hence, compared with the study by Wu et al., a major highlight in our study was the larger sample size and diversity of tumor types, which might increase the universality of the nomogram model. We obtained a higher AUC value compared to the aforementioned studies with regards to prediction of HER2 status by using radiomics and a machine-learning algorithm [29,37]. The most probable explanation for this is that we adopted seven machine learning classifiers to develop seven prediction models and selected the one with the highest AUC value. Furthermore, the nomogram model combining the Rad-score with the clinical risk factor of tumor size was constructed and achieved better predictive performance than the LR classifier.

Despite the significance of the present research, there are several shortcomings in our study. Firstly, the prediction model based on ultrasound radiomics features was established and tested for identifying between HER2+ and HER2− BC in a single hospital with only 216 patients in the training set and 93 patients in the validation set. In addition, as all data was collected retrospectively and limited to Chinese patients, bias was inevitable. Therefore, further prospective studies need to involve a larger patient population and perform multicenter external validation. Secondly, in our study, the extraction of radiomics features required time-consuming tumor boundary segmentation and human-defined features, and we believe that a deep learning algorithm might accurately and automatically detect, segment, and achieve more objective results [38,39]. Thirdly, only gray-scale

ultrasound images were adopted to develop the radiomics model, and other types of images like elastosonography or color Doppler ultrasound might be taken into account for multi-modal imaging to improve the predictive performance. Finally, radiomics studies based on gray-scale ultrasound images still lack reproducibility, as researchers always select different ultrasound images of the same lesion for radiomics analysis. Three-dimensional ultrasound images for feature extraction might be more objective than the conventional two-dimensional images, which could be considered in future studies.

#### **5. Conclusions**

In summary, the Rad-score model performs best among the seven classifiers. The nomogram model based on Rad-score and tumor size has slightly better predictive performance than the Rad-score model, and it has the potential to be utilized as a routine modality for preoperatively determining HER2 status in BC patients non-invasively. However, further studies with a prospective design and a larger population are required to validate the conclusions.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/diagnostics12123130/s1. Data S1: The ultrasound radiomics extraction settings; Data S2: The morphological characteristics of the randomly selected 50 lesions for ICC assessment; Data S3: The corresponding fitting formula for calculating the Rad-score.

**Author Contributions:** Conceptualization, J.W.; Methodology, Y.W.; Software, Y.G. and Y.W.; Data curation, Y.G., Y.W. and Y.J.; Writing—original draft, Y.G., J.W. and Y.J.; Writing—review & editing, J.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Jinhua Science and Technology Bureau Scientific Research Project (2022-3-019).

**Institutional Review Board Statement:** The study was approved by the Ethics Committee of Dongyang People's Hospital (2022-YX-024).

**Informed Consent Statement:** This retrospective study was approved by our institutional review board, which waived the requirement for patients' informed consent.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **The Development of an Intelligent Agent to Detect and Non-Invasively Characterize Lung Lesions on CT Scans: Ready for the "Real World"?**

**Martina Sollini 1,2, Margarita Kirienko 3, Noemi Gozzi 4, Alessandro Bruno 1, Chiara Torrisi 2, Luca Balzarini 2, Emanuele Voulaz 1,2, Marco Alloisio 1,2 and Arturo Chiti 1,2,\***


**Simple Summary:** An "intelligent agent" based on deep learning solutions is proposed to detect and non-invasively characterize lung lesions on computed tomography (CT) scans. Our retrospective study aimed to assess the effectiveness of Retina U-Net and the convolutional neural network for computer-aided detection (CADe) and computer-aided diagnosis (CADx) purposes. CADe and CADx were trained, validated, and tested on the publicly available LUNA challenge dataset and two local low-dose CT datasets from the IRCCS Humanitas Research Hospital.

**Abstract:** (1) Background: Once lung lesions are identified on CT scans, they must be characterized by assessing the risk of malignancy. Despite the promising performance of computer-aided systems, some limitations related to the study design and technical issues undermine these tools' efficiency; an "intelligent agent" to detect and non-invasively characterize lung lesions on CT scans is proposed. (2) Methods: Two main modules tackled the detection of lung nodules on CT scans and the diagnosis of each nodule into benign and malignant categories. Computer-aided detection (CADe) and computer aided-diagnosis (CADx) modules relied on deep learning techniques such as Retina U-Net and the convolutional neural network; (3) Results: Tests were conducted on one publicly available dataset and two local datasets featuring CT scans acquired with different devices to reveal deep learning performances in "real-world" clinical scenarios. The CADe module reached an accuracy rate of 78%, while the CADx's accuracy, specificity, and sensitivity stand at 80%, 73%, and 85.7%, respectively; (4) Conclusions: Two different deep learning techniques have been adapted for CADe and CADx purposes in both publicly available and private CT scan datasets. Experiments have shown adequate performance in both detection and diagnosis tasks. Nevertheless, some drawbacks still characterize the supervised learning paradigm employed in networks such as CNN and Retina U-Net in real-world clinical scenarios, with CT scans from different devices with different sensors' fingerprints and spatial resolution. Continuous reassessment of CADe and CADx's performance is needed during their implementation in clinical practice.

**Keywords:** CT scans; lung nodules; artificial intelligence; deep learning

#### **1. Introduction**

Lung lesions are common. The overall incidence of lung nodules has increased 10-fold from 1959 to 2015 [1], but–fortunately—the diagnosis of lung cancer has not risen accordingly [2]. The increasing use of "modern" imaging techniques, the higher adherence to screening programs, and the regular follow-up of patients suffering from other cancers

**Citation:** Sollini, M.; Kirienko, M.; Gozzi, N.; Bruno, A.; Torrisi, C.; Balzarini, L.; Voulaz, E.; Alloisio, M.; Chiti, A. The Development of an Intelligent Agent to Detect and Non-Invasively Characterize Lung Lesions on CT Scans: Ready for the "Real World"? *Cancers* **2023**, *15*, 357. https://doi.org/10.3390/ cancers15020357

Academic Editor: Yasushi Shintani

Received: 1 November 2022 Revised: 4 December 2022 Accepted: 2 January 2023 Published: 5 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

result in a more significant number of lung lesions being incidentally detected in asymptomatic people [2]. Several factors should be considered dealing with the first diagnosis of lung nodules, including the patient's pre-test probability of malignancy (e.g., smoking habits and familiar or previous history of lung cancer), and the lesion's characteristics (e.g., size, spiculation, and pleura indentation) [2]. Based on these risk assessments, patients are assigned to a class of risk and are managed accordingly [2]. The workup of patients with incidentally detected pulmonary lesions comprises actions from no further steps to computed tomography (CT) surveillance, to [18F]FDG positron emission tomography (PET)/CT, to invasive procedures (biopsy, surgery, radiation therapy, or interventional radiology treatment). From a practical point of view, once identified, lung lesions must be characterized by assessing the risk of malignancy. Several qualitative CT features have been reported to be associated with malignancy (e.g., size and attenuation characteristics) [2,3], and standardized criteria to describe pulmonary nodules have been proposed (number, size, and pattern) [3]. Nonetheless, there are still several hurdles to be overcome concerning the applicability and reproducibility of these criteria (i.e., inter-operator and intra-operator variability due to misinterpretation and different experiences and expertise), ultimately affecting the management of patients diagnosed with lung nodule(s).

In recent years, artificial intelligence, acting as "another pair of eyes", has gained popularity. Computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems have been recently developed [4–6] to support imagers in both lung lesion detection and diagnosis tasks. A number of models have been developed for the purpose of lung nodule detection and segmentation [7,8]. Many lung nodule segmentation algorithms based on either general or multiview neural network architecture have been proposed. Most studies adopting multiview neural networks have introduced new architectures by taking multiple lung nodule views. Subsequently, they use those views as inputs to the neural networks. On the contrary, the general neural-network-based methods rely primarily on U-Net architecture. Moreover, different lung nodule segmentation methods can be used for different types of lung nodules. Additionally, many techniques have been proposed for the classification of lung nodules (e.g., whether they are benign or malignant) focused on supervised, as opposed to semi-supervised, learning [7,8]. Despite the promising performance of these computer-aided systems, there are still limitations related to the study design (e.g., retrospective trial), technical issues (e.g., the manual labeling of images and high cost) and the efficiency (e.g., low calculation efficiency) of these tools.

The study presented in this paper aimed to develop an "intelligent agent" to detect and non-invasively characterize lung lesions on CT scans. Our goal was to apply CNN for lung cancer identification on the CT scans inspired by the available literature, but more importantly we aimed to test the tool in a "real-world" setting. In greater detail, the project involved two main modules: the first one addressed the detection of lung nodules on CT scans; the second dealt with the diagnosis (CADx) of each nodule into benign and malignant categories. The "intelligent agent" relied on deep learning techniques, which are described in the following sections.

#### **2. Materials and Methods**

#### *2.1. Study Design*

The study was a retrospective, single-institution trial.

We used public and local datasets to develop the CADe-CADx. CADe and CADx were independently developed. The study was approved by the institutional Ethics Committee.

#### *2.2. Datasets and Image Analysis*

This subsection provides details for both publicly available and local datasets for our CADe-CADx. Tables 1–3 set out lung abnormalities within the LUNA challenge dataset, CT scans used for CADs' development, and the number of nodules used for CADx.


**Table 1.** Lung abnormalities annotated within the LUNA challenge dataset.

**Table 2.** CT series datasets used for the CADs' development.


**Table 3.** The number of lung nodules included in each dataset used for the CADx development.


#### 2.2.1. LUNA Challenge Dataset

The open-source LUNA challenge dataset [9] and the local ICH\_s1 and ICH\_s2 datasets were used for the detection task.

The LUNA dataset consists of 805 series of diagnostic and lung cancer screening chest CT scans along with XML annotation files. Lung abnormalities have been annotated by four thoracic radiologists. Each abnormality is classified as a nodule or not, and annotated according to size, as detailed in Table 1.

The mask of the region of interest (ROI) for nodules of at least 3 mm was based on a 50% consensus criterion on four radiologists' segmentations.

#### 2.2.2. Local Datasets—ICH\_s1 and ICH\_s2

ICH\_s1 is a local dataset consisting of 1189 low-dose CT series. The images were independently analyzed by two expert chest radiologists, and all of the nodules were segmented on non-contrast-enhanced images regardless of size. ICH\_s2 consisted of 92 annotated lesions close to the mediastinum. The "ground truth" for the CADe was the segmentation performed by imagers (full concordance between radiologists). Collectively, local datasets included 1281 CT scans (441 with at least one nodule). The above-mentioned datasets were split into three subsets (training, validation, and test), as detailed in Table 2. Therefore, test set images for both ICH\_s1 and ICH\_s2 were used neither for training nor validation purposes.

The 234-test series from the ICH\_s1 dataset comprises 104 nodules. One nodule per series is present in the 19-test series from the ICH\_s2 dataset. Image segmentation and labelling were performed using a dedicated plug-in implemented for the 3D-slicer software tool (version 4.10.2, Slicer.org, Boston, MA, USA) [8].

#### 2.2.3. CADx—Datasets and Image Analysis

The local datasets, ICH\_x1 and ICH\_x2, were used for classification tasks. The ICH\_x1 subset comprised 349 low-dose CT images with nodules, with 29 confirmed to be malignant. The images were analyzed by an expert chest radiologist (CT), and all of the nodules were segmented on non-contrast-enhanced images regardless of size. There was a partial overlap between the series included in ICH\_s1 and ICH\_x1. The ICH\_x2 subset consists of 957 CT scans (all with at least one nodule) annotated by marking the lesion centroid. ICH\_x2 samples were annotated on non-contrast-enhanced images by experienced imagers (CT and MS). ICH\_x2 comprises any type of CT scan acquired at our institution, including co-registered images of PET/CT (*n* = 301), biopsy-guiding CT scans (*n* = 305), and diagnostic CT scans (*n* = 351, respectively). Collectively, 1346 nodules in 1306 CT scans were segmented and labelled. Radiological follow-up and pathology were used as reference standards in 350/1346 and 996/1346 cases, respectively (Table 3). Specifically, complete resolution of lung lesions was used as a radiological reference standard to define a nodule as benign. The final radiological diagnosis was used to classify 567/632 benign nodules. In the other 65/632 cases, benign nodules were pathologically confirmed. All malignant nodules were pathologically confirmed. Malignancy included primary lung cancer (adenocarcinoma = 392/714, squamous cell carcinoma = 113/714, carcinoid tumor = 31/714, and other = 35/714) and lung metastases (*n* = 133/714). In ten patients, the primary lung tumor subtype was not specified. The final diagnosis was collected from electronic medical records. Image segmentation and labelling were performed using a dedicated plug-in implemented with the 3D slicer tool.

#### *2.3. CADe and CADx Architectures*

As briefly mentioned in the previous sections, deep learning paradigms are behind the proposed CADe and CADx systems. One of the main challenges in our work was to test the effectiveness of deep learning architectures in real scenarios accounting for several variables, such as different CT devices, images with different spatial resolutions, and device fingerprints.

Due to the different nature of detection and diagnosis tasks, we opted for two different deep neural network architectures. CADe relies on pixel-wise segmentation to reveal whether a pixel is part of a lung lesion. To this end, it is necessary to obtain a full-resolution output binary mask to retrieve both the coordinates and the region of the lung lesion.

Conversely, CADx focuses on the final diagnosis of a given lung lesion. The system is meant to return a label indicating 'benign nodule' or 'malignant nodule'. Then, it is not necessary to make the system to return a full-resolution output mask while only an output label is needed. The following two subsections provide further technicalities regarding the two different architectures for CADe and CADx.

Furthermore, it is necessary to point out that deep learning networks must ingest many images to deliver a model with knowledge inference and generalization that can accomplish a specific domain task. The biomedical image analysis scenario is afflicted by a dimensionality problem due to the lack of manually annotated data. To be more accurate, the dimensionality issue refers to the size of hand-labelled data, which is not reasonably big enough to have a deep neural network trained from scratch.

That is where data augmentation comes into play; applying image transformations without altering the meaningful content of the image itself makes a given dataset bigger in size by generating new samples. Examples of primary data augmentation are the following: flipping, mirroring, rotation, translation, and scaling.

In the following two subsections, a further description of the deep learning techniques for CADe and CADx tasks is given.

#### 2.3.1. CADe Architecture and Development

The main goal of a CADe system is to return a full-resolution mask highlighting the suggested regions of interest for a given input image. That is why we opted for the fully convolutional neural network (FCNN) architecture. CADe tasks are, therefore, accomplished in a pixel-wise manner to extract information related to both the ROI (region of interest) and the corresponding targets. FCNN allows for return of a full-resolution

mask for a given input image. In simpler terms, an FCNN ingests an input image with size M × N and returns an output mask with the exact dimensions. The latter makes it suitable for critical biomedical image analysis tasks, such as segmentation and detection.

One of the most popular and cited FCNNs for biomedical image segmentation is the so-called U-Net [10] which owes its name to the U-shape of the network architecture. In this section, we provide readers with the overall description of U-Net, including the main layers and operations throughout the network. For the sake of clarity, we do not address the most complex mathematical concepts, and instead point the readers toward to the reference articles for further details [10].

The overall U-Net architecture is depicted in Figure 1. The encoder is responsible for extracting hidden information within the pixel domain. The latter is achieved with a stack of filters that down-sample input images in the first place. In simpler terms, the network architecture is organized in levels, with each level consisting of two Conv (convolutional layers) followed by a ReLU (rectified linear unit), a max pooling layer characterized by a parameter, namely 'stride', tuning the down-sampling factor for the input image.

**Figure 1.** U-Net architecture.

All of the encoder levels are meant to extract the most meaningful features from the input images all the way to the network bottom level. Each level returns outputs through feature maps (or channels). They represent intermediate stages of the network layers that feed the following level in the stack. From a graphical viewpoint, blue rectangles indicate the input, feature maps, and output of the network. Going through consecutive layers through the encoder, it is noticeable how rectangles change in size, turning into shorter but wider blue rectangles. This is a descriptive representation showing what happens inside the network: convolutional layers work as image feature extractors; ReLU is an activation function whose primary role is to give neural networks non-linearity representation capabilities to represent results with more accuracy. Max pooling is a "pooling" operator extracting the max value from image patches and bringing down downsampled patches.

Purple downward arrows in Figure 1 show max pooling coming into play, while orange arrows represent the sequence Conv + ReLU. The encoder is responsible for extracting "what" is in the images, while the decoder deals with the "where".

The features extracted by the contracting path are then progressively reconstructed by the expanding path (decoder) with layers consisting of transpose convolution (deconvolution), Conv + ReLU and Final Conv. Transpose convolution allows upsampling of the feature maps out of the previous layers; Conv + ReLU are then applied in combination with skip connections to refine the results in each level. Skip connections help to retrieve missing information from the encoder feature maps standing on the same level. The top left corner

of the network returns a segmentation map by adopting a one-dimensional convolutional layer. The latter can return labels in a pixel-wise fashion.

The network employed for our CADe, namely, Retina U-Net [11], is a variant of two pre-existing networks, Retina Net [12] and U-Net [10].

#### 2.3.2. Retina U-Net

Retina U-Net [11] integrates elements from Retina Net and U-Net to combine object detection and semantic segmentation. Taking after most of the state-of-the-art object detectors, Retina U-Net complements U-Net architecture by introducing object-level predictions through feature pyramid networks (FPNs) [13]. FPNs are feature extractors with bottom-up and top-down paths. The overall Retina U-Net architecture is graphically represented in Figure 2. The overall pipeline is mainly characterized by FPNs, coarse features detectors, skip connections, Conv + Softmax, Conv + ReLu + MaxPool.

**Figure 2.** Architecture of the U-Net neural network used to segment lung nodules in CT scans. The number left on each layer represents the number of output channels.

Coarse feature detectors, indicated by red rectangles in Figure 2, are responsible for detecting small-sized objects using sub-network operations such as the so-called bounding box regressor (a well-known object detection technique) [14]. Skip connections support the network in retrieving missing information from the encoder feature maps standing on the same level. The Conv + ReLU + MaxPool stack consists of convolutional filter, a rectified linear unit function, and a max pooling filter. They are key to the contracting path of the FCNN as Conv filters and MaxPool filters down-sample the input feature map while ReLU allows for generalization and inference of knowledge from a non-linear input (as it is a piecewise linear function).

Conv + SoftMax consists of a sequence of a convolutional filter and a SoftMax function returning a probability map for every possible class to be detected in the images. The Up-pool and Deconv layers are responsible for the image reconstruction starting from the network bottleneck (the bottom layer in the U-shaped architecture).

In this work, the Retina U-Net was implemented to segment lung nodules. It sums up 6 layers in the contracting path (see Figure 2), 18 feature maps in the first layer and 576 in the deeper one. In the expansive path, on the other hand, the number of channels is half the ones in the first 4 layers, starting from 576, but then it is kept to 18 in the last 2 upper layers, consistently with the contracting path.

#### 2.3.3. CADx Architecture and Development

The neural network architecture adopted to classify lung nodules is a convolutional neural network (CNN) adapted from [15] (Figure 3).

CNN consists of several layers responsible for feature extraction steps (four convolutional blocks) and classification (three fully connected layers and a SoftMax layer). The SoftMax function returns probability values for a given lung lesion, which is then classified as benign or malignant.

In Figure 3, the CNN layers are grouped into three blocks: the convolutional block, linear block, and SoftMax layer.

The convolution block consists of a convolutional layer, ReLU (rectified linear unit), and 2D dropout. Unlike FCNN, CNN does not account for an expanding path because it is not designed to return full-resolution images; its output labels are related to the classification task. As noticeable in Figure 3, a stack of convolution blocks allows for downsampling of the input image (CT scan) into feature maps that are subsequently ingested by a linear block. The latter consists of fully connected layers paramount to the classification task and ingests high-level features out of down-sampled feature maps from the previous layers. The last layer is characterized by the SoftMax function returning probability values for the input belonging to the category of interest.

Training was performed using an equally balanced cross-entropy loss and Adam optimizer. Each series was preprocessed to extract the pixels belonging to lung nodules; indeed, the series was multiplied by the binary segmentation of each nodule.

As a result, any pixel not belonging to lung nodules is considered a background pixel. In the inference phase, the binary mask of each nodule is the result of the segmentation network described in the previous section, followed by the CNN.

The input volumes are centrally cropped around the lesion to a target size of eight slices, with a 100 × 100-pixel mask. During training, image augmentation is performed by applying random rotations, flipping, and brightness variation. The latter step is to increase the size of the training set to prevent the output model from being prone to overfitting.

As can be noticed in Figure 3 the latest layer from the network stack is a SoftMax function, which is responsible for returning probability values. The likelihood value is then adopted to extract the classification target, which is the network output.

The following section focuses on the system infrastructure and depicts the healthcare scenario we adopted in this study.

#### **3. System Infrastructure**

DICOM series identified from the institutional PACS as chest CT scan acquired and stored according to good clinical practice were downloaded and retrieved from the PACS AI Invariant. Data were anonymously stored in this layer to address privacy requirements compliance. Each series retrieved from the PACS AI Invariant was added daily to a DICOM series queue preprocessed in a cascade by the neural networks previously described. The Invariant AI Runtime module (Figure 4) was used to run the models. The results were then re-transferred to PACS AI Invariant to be processed, consulted, and envisaged on radiological workstations. The model results and manual annotations performed using the 3D-slicer plugin were stored in PACS AI Invariant and a data warehouse.

**Figure 4.** System infrastructure components: PACS Humanitas; PACS AI Invariant, Invariant AI Runtime; Data Warehouse; Radiomix Station, Radiological Workstation.

#### **4. Metrics**

The detection rate, accuracy, specificity, and sensitivity were computed to evaluate the performance of the CADs and the CADx, respectively. Specifically, the "ground truth" for the CADs was the segmentation performed by the imagers (complete concordance between the imagers). The detection rate was calculated as the number of nodules correctly identified by the CADe and the total number of nodules segmented by the imagers. The Dice score was calculated to compare CADe's and imager's segmentation. The final diagnosis (radiological follow-up or pathology) represented the reference standard to evaluate CADx's performance. Accordingly, each CADx prediction was classified as true positive, true negative, false positive, or false negative. The confidence analysis was used to evaluate the distribution of the probability values of each predicted nodule to belong to its class. The abovementioned metrics were calculated for training, validation, and test sets.

#### **5. Results**

As mentioned above, CADe and CADx were independently developed, trained, and tested. The results of CADx (i.e., classification) were not related to the CADe's prediction (i.e., segmentation). We reported the results of the performance obtained in the test set.

#### *5.1. CADe*

CADe correctly identified 96/123 nodules (78%) and missed 27/123 nodules. Specifically, 90/104 and 6/19 nodules of the ICH\_s1 and ICH\_s2 datasets, respectively, were

detected correctly. Failures were relayed mainly on ground glass opacity (*n* = 6) and very small or very large nodules close to vessels, pleura (Figure 5), and/or mediastinum (*n* = 6, *n* = 9, and *n* = 4, respectively).

**Figure 5.** Example of a nodule close to pleura in the right lung correctly predicted by the CADe and of a small nodule, near to the previous one, missed by the CADe. Left panel: axial CT slice with prediction (yellow) and/or mask (pink); right panel: original CT image.

An average of 10.84 nodules per series were falsely identified. The number of false positives was reduced to 6.5 nodules per series when excluding nodules smaller than 3 mm.

#### *5.2. CADx*

CADx correctly classified 109/136 nodules (43 true negatives and 66 true positives). The CADx failed in classifying 27 nodules (11 false negatives and 16 false positives, Figures 6 and 7). The size of nodules wrongly classified was between 3 and 6 mm in 7/27 cases (6/7 solid and all falsely classified as benign), greater than 6 mm but smaller than 8 mm in 5/27 cases (3/5 solid and 2/5 falsely classified as malignant), between 8 and 10 mm in 2/27 cases (both ground glass opacity resulted false positive), bigger than 10 mm but less than 15 mm in 2/27 cases (both solid, one resulted in a false negative and one resulted in a false positive), between 15 and 25 mm in 9/27 cases, and greater than 25 mm in the remaining 2/27 cases. Specifically, false negative nodules were small nodules with a median size of 4.85 mm (range 3–11.3 mm) and solid in the majority of the cases (8/11). Considering only solid nodules, the median size of lesions falsely classified as negative was 4.7 mm (range 3–11.2 mm). Three round glass opacities (median size of 7 mm, range 3.3–7) were wrongly classified as benign. False positive nodules were quite big nodules with a median size of 20 mm (range 7.2–55 mm). Nodules wrongly classified as malignant were mainly solid (10/16) with a median size of 22 mm (range 7.2–55 mm). Considering only this class (i.e., solid nodules resulted in false positives), a consistent number of nodules (7/10) were bigger than 15 mm. Other false positive results accounted for ground glass opacity (*n* = 3/16) and part-solid nodules (3/16) with a median size of 10 mm (range 9–15 mm) and 23 mm (range 20–23 mm), respectively.

Our CADx system achieved an 80% accuracy rate. The sensitivity and specificity rates were equal to 85.7% and 73%, respectively.

The graphs in Figure 8 show the probability of each predicted nodule belonging to its class being similar for correctly classified lesions and nodules misclassified as benign (mean = 0.84 and standard deviation = 0.09 and mean = 0.84 and standard deviation = 0.10, respectively). In contrast, the confidence mean of the CAD in incorrectly predicted malignant lung lesions was lower (mean = 0.72 and standard deviation = 0.08, Figure 7).

**Figure 6.** Example of a solid nodule of 6.3 mm (inside the red box) wrongly predicted as benign by the CADx.

**Figure 7.** Example of a lesion of 55 mm (in yellow) wrongly predicted as malignant by the CADx.

**Figure 8.** Confidence mean and standard deviation for correct classifications, false benign nodules and false malignant nodules.

#### **6. Discussion**

We developed an "intelligent agent" to detect and non-invasively characterize lung lesions using any type of CT scan. Big nodules detected incidentally are typically not a challenge for clinicians since the size and radiological characteristics rarely leave room for doubt. In contrast, nodules of less than 1 cm may be uncertain and difficult to characterize. In this setting, based on patient risk assessment (low versus high), number (solitary versus multiple), pattern (solid, part-solid, and ground glass), and the size of the nodule, radiological follow-up, [18F]FDG PET/CT and biopsy are recommended [3]. However, these actions might be not feasible and/or can result in inconclusive results. Therefore, a tool able

to correctly classify at least small nodule (3–8 mm) as benign or malignant is actually an unmet clinical need. As mentioned, our CADe missed some nodules (22%), mainly ground glass opacity or nodules close to vessels, pleura, or mediastinum. Notably, all nodules of 15 mm or greater were wrongly classified as false positives, while the majority of nodules smaller than 10 mm (77%) resulted in false negatives. Collectively, our CADx was more sensitive than specific and wrongly classified 20% of nodules (8% as false negatives and 12% as false positives).

Performant algorithms capable of detecting lung lesions and discriminating benign from malignant nodules with great accuracy have been described [4,6]. Our CADe and CADx exhibited lower accuracy for both detection and classification (78% and 80%, respectively) tasks than those achieved by the algorithms reported in the literature (up to 95% [6] and 96% [4], respectively). Our CADe missed some ground glass opacities and close-to-vessel nodules, pleura, and/or mediastinum. Similar failures have been reported for deep learning-based algorithms in the literature [16]. Nonetheless, our CADs-CADx benefitted in some respects. Firstly, they were developed and tested using a local dataset from real-scenario data including different types of CT images (co-registered CT from PET/CT = 23%, biopsy-guiding CT scans = 23%, low-dose CT = 27%, and fully diagnostic CT = 27%). The performance achieved in highly selected and homogeneous datasets may lead to overestimated model reliability. Therefore, continuous "real-world" re-validation is necessary for clinical implementation of DL-based tools.

Secondly, our dataset consists of well-balanced classes of benign and malignant nodules (47% and 53%, respectively). Thirdly, the final diagnosis does not rely on subjective interpretative criteria to assess malignancy risks.

Conversely, we used pathology or a rigorous radiological criterion to determine whether a nodule was benign or malignant (approximately 60% and 40% of cases, respectively). Several deep-learning-based algorithms developed to detect and classify lung nodules relied on public datasets consisting of low-dose CT images collected within lung screening programs [4,6], which dealt with a low prevalence of relatively small nodules. Many publicly available databases see the risk of malignancy assessment by expert imagers as the "ground truth" [17–19]. Nonetheless, the latter has been recently shown to affect CADx's reliability and performance [16]. Moreover, in many experiments, malignant nodules accounted for approximately one-third of the total number of nodules [20–22], potentially causing overfitting and ultimately affecting the model's reliability. Lastly, malignancy in our datasets comprised primary lung tumors and lung metastases (81% and 19%, respectively). The pattern recognition out of CNN has shown similarities to typical imagefeature-based learning [23]. Still, different imaging-based features in primary lung tumors and metastases have been reported [24], suggesting specific histology-based descriptors.

On one hand, all these factors, although theoretically positive, generated a widely heterogenous dataset which was analyzed using the gold standard as a reference, which possibly explains why our tool was less performant than those reported in the literature. On the other hand, with the dataset being more heterogeneous, it positively impacted the overfitting and the generalizability of the CADs-CADx in the "real world". Therefore, we can realistically consider our CADx as a tool—albeit to be further improved—for a "virtual biopsy". It could result in several worthwhile circumstances, including, among others, lung nodules of undetermined significance. Giles et al. [25] reported that lung nodules of unknown significance were malignant in 86% of cases. Notably, in this series of 500 surgically treated patients, the percentage of lung metastases was not negligible concerning the total number of malignant lesions (22% metastases versus 78% primary lung tumors) [25], thus underlying the potential additional value of our CADx. Moreover, synchronous and metachronous tumors incidentally detected during staging or follow-up examinations have increased [26], making it imperative to exclude malignancy in a patient with a newly diagnosed lung nodule and a history of cancer.

Despite the abovementioned positive aspects, this study also presented some limitations. Firstly, the CADs-CADx were independently developed, and the presented results

refer to the detection and classification tasks separately. The next step will be to test the end-to-end tool on independent data. Furthermore, the algorithms' architectures used for the CADs-CADx were modified from pre-existing neural networks. That is common for real scenario-oriented deep learning, with fewer methodological and theoretical contributions than new, application-oriented results; the novelty is often represented by the employment of pre-existing deep learning techniques applied in new scenarios and research fields through context-based modifications.

The consideration above paves the way to a crucial point in the reliability of so-called supervised deep learning for some specific tasks. Two main questions arise from our experimental results: Can CNNs and FCNNs be considered as reliable tools for CADe and CADx? Is the supervised learning paradigm gradually going to be left behind in favor of semi-self-supervised deep learning architectures?

The paradigm adopted might not be the most suitable for a scenario with several constraints: images with different spatial resolutions and various sensors' fingerprints. The latest progress in AI sees new architectures reliant on self-supervised learning, which move toward AGI (artificial general intelligence) capable of inferring hidden properties from input data to be fine-tuned over a specific target with only a limited number of annotated samples. The results bring up some other aspects that deserve further investigation. For example, our experimental campaign ran essential data augmentation to prevent lung lesion shape distortion. Nonetheless, more advanced augmentation techniques based on generative deep learning, such as GANs (generative adversarial networks), appear to be promising to provide datasets with many more samples to be re-utilized for training purposes.

All that said, as for other domains of image patter recognition (e.g., animal photos) [27], we are convinced that sophisticated algorithms are insufficient in the setting of "real-world" data, and a huge number of observations (A million? A billion?) are needed to reach satisfactory results in terms of sensitivity and specificity. Moreover, we should keep in mind that our final goal is to develop a tool able to reach 100% accuracy, since even only one misclassified case is a misdiagnosed patient.

#### **7. Conclusions**

We have presented a specific case study on the detection and classification of lung lesions on CT scans to test the effectiveness of two of the most popular deep learning architectures, FCNN and CNN. To this end, we employed data from datasets with different features and specs. The first one was the LUNA 16 Challenge dataset; the second one consisted of images locally acquired and labelled. Furthermore, CT scans were acquired with different scanners making the case study close to real scenarios with the probability of unknown information about the sensors generating the images undergoing CADe and CADx checks. The experimental campaign confirmed the promise of these approaches in automated lung nodule assessment on CT, alongside with some drawbacks of the supervised learning paradigm employed in networks such as CNN and Retina U-Net in real-world clinical scenarios, with CT scans from different devices with different sensors' fingerprints. Collectively, we proved that these tools, although promising, are not "mature" enough to successfully analyze "real-world" data and to be finally implemented in clinical practice.

**Author Contributions:** Conceptualization, M.S., A.C. and M.K.; study design, M.S., M.K. and A.C.; patient screening, identification, and management, M.A., L.B. and E.V.; data analysis M.S., M.K., N.G. and C.T.; writing—original draft preparation, M.S., M.K. and N.G.; writing—review and editing, A.B. and A.C.; supervision, A.C.; project administration, A.C.; funding acquisition M.A. and M.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Lega Italiana per la Lotta contro i Tumori (LILT) "Bando di ricerca sanitaria LILT 2017 Prof. Umberto Veronesi Bando 5xmille anno 2015" and by the "Bando 5xmille Ministero della Salute 2019" assigned to Arturo Chiti from the IRCCS Humanitas.

**Institutional Review Board Statement:** The study was approved by the institutional Ethics Committee of IRCCS Humanitas Research Hospital (approval number 3/18, on 17 April 2018, protocol ADVIMAG-Thorax). All of the procedures performed in the studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and in accordance with the principles of the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.

**Informed Consent Statement:** In view of the retrospective study design, informed consent was waived.

**Data Availability Statement:** Four local and private datasets, namely, ICH\_s1, ICH\_s2, ICH\_x1, and ICH\_x2, were collected and labelled in IRCCS Humanitas. They are stored in the institutional archive; they are available from the corresponding author on reasonable request. The LUNA Challenge dataset is publicly available [9].

**Acknowledgments:** Luca Antiga, Federica Moretti, Pietro Rota, and Gerlando Scibetta from Orobix SpA are acknowledged for the development of algorithms. The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health and their critical role in the creation of the free, publicly available LIDC/IDRI Database (LUNA Challenge dataset) used in this study.

**Conflicts of Interest:** Chiti reports unrestricted grants from Novartis, personal fees from AAA, Blue Earth Diagnostics, and General Electric Healthcare, outside of the submitted work. The other authors do not report any conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
