# **Machine Learning/ Deep Learning in Medical Image Processing**

Mizuho Nishio Printed Edition of the Special Issue Published in *Applied Sciences*

Edited by

www.mdpi.com/journal/applsci

## **Machine Learning/Deep Learning in Medical Image Processing**

## **Machine Learning/Deep Learning in Medical Image Processing**

Editor

**Mizuho Nishio**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Mizuho Nishio Kobe University and Kyoto University Japan

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Applied Sciences* (ISSN 2076-3417) (available at: https://www.mdpi.com/journal/applsci/special issues/ML Medical Image).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-2664-5 (Hbk) ISBN 978-3-0365-2665-2 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

**Mizuho Nishio** currently serves as a program-specific assistant professor at the Department of Radiology in Kobe University Hospital, Kobe, Japan. In addition, he also works at the Department of Diagnostic Imaging and Nuclear Medicine in Kyoto University Hospital, Kyoto, Japan. His research area is the application of machine learning/deep learning to medical image analysis. Recently, he has focused on the automatic diagnosis of COVID-19 using deep learning on chest x-ray images. Dr. Nishio earned his medical degree from the Kobe University School of Medicine. He completed his radiology residency at Nishi-Kobe medical center in Kobe, where he received the Hospital's Award for installation of a picture archiving and communication system in the Department of Radiology. He obtained his Ph.D. degree from the Kobe University Graduate School of Medicine. After that, he served as program-specific assistant professors at the Department of Radiology in Kobe University Graduate School of Medicine and Kyoto University Hospital.

## *Editorial* **Special Issue on "Machine Learning/Deep Learning in Medical Image Processing"**

**Mizuho Nishio**

Department of Diagnostic Imaging and Nuclear Medicine, Kyoto University Graduate School of Medicine, 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan; nishiomizuho@gmail.com or nishio.mizuho.3e@kyoto-u.jp; Tel.: +81-75-751-3760; Fax: +81-75-771-9709

Many recent studies on medical image processing have involved the use of machine learning (ML) and deep learning (DL) [1,2]. In ML, features are frequently extracted from medical images to aid in the interpretation of useful information. However, this process might hinder the images from being fully utilized. In contrast to ML, DL does not require such feature extractions. In fact, DL outperforms a combination of ML and feature extraction in computer vision [3]. Therefore, DL has been used more frequently in recent medical image studies.

This special issue, "Machine Learning/Deep Learning in Medical Image Processing", has been launched to provide an opportunity for researchers in the area of medical image processing to highlight recent developments made in their fields with ML/DL. Seven excellent papers that cover a wide variety of medical/clinical aspects are selected in this special issue [4–10]. Of these, four papers were related to radiology (computed tomography (CT) and nuclear medicine) and two were related to pathology (prostate carcinoma and oral squamous cell carcinoma). These seven papers have been summarized as follows:


**Citation:** Nishio, M. Special Issue on "Machine Learning/Deep Learning in Medical Image Processing". *Appl. Sci.* **2021**, *11*, 11483. https://doi.org/ 10.3390/app112311483

Received: 5 November 2021 Accepted: 30 November 2021 Published: 3 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

These seven papers are expected to tremendously benefit readers in various aspects of medical image processing. I believe that this special issue contains a series of excellent research works on medical image processing with ML and DL.

**Funding:** JSPS KAKENHI (grant number: 19H03599 and JP19K17232).

**Institutional Review Board Statement:** Approval of institutional review board is not required.

**Informed Consent Statement:** Informed consent is not required.

**Data Availability Statement:** No data are used in this editorial.

**Acknowledgments:** I would like to thank the authors of the seven papers, the many reviewers, and the editorial team of *Applied Sciences*—especially the Assistant Managing Editor, Seraina Shi—for their valuable contributions towards making this special issue a success. For writing this editorial, the author was partly supported by JSPS KAKENHI (grant number: 19H03599 and JP19K17232).

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


## *Article* **Automatic Diagnosis of Coronary Artery Disease in SPECT Myocardial Perfusion Imaging Employing Deep Learning**

**Nikolaos Papandrianos \* and Elpiniki Papageorgiou**

Department of Energy Systems, Faculty of Technology, University of Thessaly, Geopolis Campus, Larissa-Trikala Ring Road, 41500 Larissa, Greece; elpinikipapageorgiou@uth.gr

**\*** Correspondence: npapandrianos@uth.gr

**Abstract:** Focusing on coronary artery disease (CAD) patients, this research paper addresses the problem of automatic diagnosis of ischemia or infarction using single-photon emission computed tomography (SPECT) (Siemens Symbia S Series) myocardial perfusion imaging (MPI) scans and investigates the capabilities of deep learning and convolutional neural networks. Considering the wide applicability of deep learning in medical image classification, a robust CNN model whose architecture was previously determined in nuclear image analysis is introduced to recognize myocardial perfusion images by extracting the insightful features of an image and use them to classify it correctly. In addition, a deep learning classification approach using transfer learning is implemented to classify cardiovascular images as normal or abnormal (ischemia or infarction) from SPECT MPI scans. The present work is differentiated from other studies in nuclear cardiology as it utilizes SPECT MPI images. To address the two-class classification problem of CAD diagnosis, achieving adequate accuracy, simple, fast and efficient CNN architectures were built based on a CNN exploration process. They were then employed to identify the category of CAD diagnosis, presenting its generalization capabilities. The results revealed that the applied methods are sufficiently accurate and able to differentiate the infarction or ischemia from healthy patients (overall classification accuracy = 93.47% ± 2.81%, AUC score = 0.936). To strengthen the findings of this study, the proposed deep learning approaches were compared with other popular state-of-the-art CNN architectures for the specific dataset. The prediction results show the efficacy of new deep learning architecture applied for CAD diagnosis using SPECT MPI scans over the existing ones in nuclear medicine.

**Keywords:** coronary artery disease; SPECT MPI scans; deep learning; convolutional neural networks; transfer learning; classification models

#### **1. Introduction**

Coronary artery disease (CAD) is one of the most frequent pathological conditions and the primary cause of death worldwide [1]. Described by its inflammatory nature [2], CAD is an atherosclerotic disease usually developed by the interaction of genetic and environmental factors [3] and leads to cardiovascular events including stable angina, unstable angina, myocardial infarction (MI), or sudden cardiac death [4]. Coronary heart disease (CHD) has a significant impact on mortality and morbidity in Europe, whereas its management requires a large proportion of national healthcare budgets. Thus, an accurate CAD (ischemia, infarction, etc.) diagnosis is crucial on a socioeconomic level. CAD diagnosis usually requires the implementation of suitable diagnostic imaging [5,6]. In this direction, non-invasive imaging techniques are the most preferred methods for diagnosing CAD, prognostication, selection for revascularization and assessing acute coronary syndromes [7]. Even though they have raised the direct expenditure regarding investigation, they are likely to reduce overall costs, leading to greater cost-effectiveness [7].

To achieve a reliable and cost-effective CAD diagnosis, a variety of modern imaging techniques such as single-photon emission computed tomography (SPECT) myocardial

**Citation:** Papandrianos, N.; Papageorgiou, E. Automatic Diagnosis of Coronary Artery Disease in SPECT Myocardial Perfusion Imaging Employing Deep Learning. *Appl. Sci.* **2021**, *11*, 6362. https:// doi.org/10.3390/app11146362

Academic Editor: Mizuho Nishio

Received: 20 May 2021 Accepted: 7 July 2021 Published: 9 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

perfusion imaging (MPI), positron emission tomography (PET), and cardiovascular computed tomography (CT) have been utilized in clinical practice [8–12]. According to the EANM guidelines [13], radionuclide MPI with the contribution of SPECT imaging is a remarkably efficient technique regarding the CAD diagnosis [14,15].

Myocardial perfusion scintigraphy (MPS) is a well-established, non-invasive imaging technique proven effective in diagnosing angina and myocardial infarction. Specifically, SPECT MPI depicts the information regarding the spreading of a radioactive compound within the heart in three dimensions and is considered the most frequently performed procedure in nuclear cardiology [16]. Among others, it is used for predicting future CAD events and identifying coronary artery disease severity. Regarding the detection of myocardial ischemia, MPS outperforms ECG in terms of accuracy [7,17]. Furthermore, applying the MPS imaging method reduces the number of angiographies, offering proper treatment planning [18].

In the discipline of cardiovascular imaging using nuclear cardiology techniques, Slart et al. explored the notion regarding the deployment of artificial intelligence (AI) based on modern machine learning, focusing on methods and computational models currently used [19]. The goal is to enhance diagnostic performance through complex image analysis and interpretation [20]. However, image interpretation is a demanding yet time-consuming task which relies mainly on physicians' experience [21]. On this basis, interpretation can be standardized with the contribution of CAD tools providing enhanced overall objectivity. The diagnostic accuracy is also improved, whereas the diagnostic time and healthcare costs are significantly reduced. Since there is an extensive existing and standardized imaging database in the realm of nuclear cardiac imaging, AI becomes the right candidate to be utilized in this domain [22]. More specifically, AI is currently spreading throughout three main areas related to cardiac SPECT/CT and PET/CT imaging. These involve the processes of automation of image detection and segmentation, spotting patients suffering from obstructive coronary artery disease (CAD), and risk assessment of coronary syndromes [23]. Overall, computer-aided diagnosis can serve as a supportive tool that can assist not only in the realization of unusual and difficult medical cases but to train inexperienced clinical staff too [23].

Computer-aided diagnosis is well attained through the deployment of machine learning, including deep learning algorithms, which are characterized by an extraordinary capability for medical image interpretation in the realm of medical image analysis [24–27]. Acknowledging that deep learning has shown remarkable efficacy in visual object detection and classification, researchers are highly intrigued by the capabilities that deep learning tools possess for improving the accuracy of CAD classification, helping nuclear physicians in this direction [28–30].

#### *1.1. Deep Learning in Image Analysis*

In the domain of medical imaging, deep learning is mainly implemented by convolutional neural networks (CNNs) [25,31–33], which are considered an efficient and dynamic approach for extracting features concerning image classification and segmentation tasks. Before CNN development, this process had to be accomplished using insufficient machine learning models or by hand. However, after their entrance into medical imaging, CNNs could use these features that had been learned directly from the data. Among the marked characteristics of CNNs is their capability of analyzing and classifying images, making them powerful deep learning models for image analysis [24,25,32,34]. CNN resembles a standard artificial neural network (ANN) in its characteristics, utilizing backpropagation and gradient descent for training. However, more pooling layers and layers of convolutions are present concerning its structure.

Among the most common CNNs methods in medical image analysis, there are:

AlexNet (2012): Developed by Yann LeCun et al. [35], this network has a similar architecture to LeNet [36] but consists of more filters per layer, including stacked convolutional layers. Its specifications include 11 × 11, 5 × 5, 3 × 3 convolutions, max pooling, dropout,

data augmentation, rectified linear unit (ReLU) activations and stochastic gradient descent (SGD) with momentum [37]. It attaches ReLU activations after each layer, either convolutional or fully connected. The deep learning boom was attributed to AlexNet, when this architecture won the 2012 ILSVRC competition by a considerable margin. Some features worth mentioning are the computational split among many GPUs, dropout regularization, data augmentation, and the ReLU activation function.

ZFNet (2013): This network constitutes a minor adaptation of AlexNet and won the 2013 ILSVRC competition [38].

VGGNet16 (2014): This network incorporates 16 convolutional layers and is popular due to its consistent structure [39]. Being similar to AlexNet, this model presents only 3 × 3 convolutions but comprises many filters. Currently, VGGNet16 is the most preferred choice in the community for feature extraction from images. At the same time, it gave great popularity to the notion of creating deeper networks by using smaller filter kernels [40].

GoogleNet: It involves a standard piled convolutional layer and one or more fully connected layers [41]. Inception modules were also introduced, applying different-sized filters to the input and concatenated in the end. This way, the module can extract different levels of features simultaneously. Another notion introduced by GoogleNet, which won the 2014 ILSVRC competition, was that there is a global average pooling instead of fully connected layers in the network's ending, reducing the model parameters.

ResNet (2015): This type of CNN architecture introduced the "identity shortcut connection" to handle the well-known "vanishing gradients" issue that characterizes deep networks. This technique revealed that extremely deep networks could use standard SGD along with residual modules for their training [42].

DenseNet (2017): Being another important CNN architecture, DenseNet outweighs other networks as it sorted out the gradient vanishment problem by directly connecting all its layers. Meanwhile, feature delivery is optimized; therefore, it can make more efficient use of it. It is a widespread technique for disease diagnosis, and more recently, it efficiently addressed the task of cardiac disease classification, as reported in [43].

#### *1.2. Machine Learning and Deep Learning in SPECT Nuclear Cardiology Imaging*

Currently, regarding SPECT MPI, which is one of the established methods for imaging in nuclear cardiology, researchers face the challenge of developing an algorithm that can automatically characterize the status of the patients with known or suspected coronary artery disease. The accuracy of this algorithm needs to be extreme due to the importance of people's lives. Since deep learning algorithms have the capacity to improve the accuracy of CAD screening, they have been broadly explored in the domain of nuclear cardiovascular imaging analysis.

ML and DL methods have both been explored to assess the likelihood of obstructive CAD. In the context of ML algorithms for CAD diagnosis, ANN, SVM and boosted ensemble methods have been investigated. In a single-center study for the detection of obstructive CAD, ML was utilized with SPECT myocardial perfusion imaging (MPI) combining clinical data of 1181 patients and provided AUC values (0.94 ± 0.01), which were significantly better than total perfusion deficit (0.88 ± 0.01) or visual readout [44].

ML was also explored in the multi-center REFINE SPECT (REgistry of Fast Myocardial Perfusion Imaging with NExt generation SPECT) registry [45]. In this study, 1980 patients of possible CAD went through a stress/rest 99mTc-sestamibi/tetrofosmin MPI. The ML algorithm embedding 18 clinical, 9 stress test, and 28 imaging variables from 1980 patients produced an AUC of 0.79 [0.77, 0.80], which is higher than that regarding (TPD) 0.71 [0.70, 0.73] or ischemic TPD 0.72 [0.71, 0.74] in the prediction of early coronary revascularization.

In [46], ANN was applied for interpreting MPS with suspected myocardial ischemia and infarction on 418 patients who underwent ECG-gated MPS at a single hospital. The ANN-based method was compared against a conventional automated quantification software package. The results showed that the model based on neural networks presents interpretations more similar to experienced clinicians than the other method examined. Using clinical and other quantification data, the authors of [47] deployed the boosted ensemble machine learning algorithm and the ANN, achieving classification accuracy of up to 90%. SVMs have been exploited in [48] and have been trained considering a group of 957 patients with either correlating invasive coronary angiography or a low possibility of CAD. The AUC value produced for SVM classifier combining quantitative perfusion (TPD and ISCH) and functional data was as high as 86%.

Moreover, several recent research studies explore ML and DL methods for diagnosing CAD in nuclear cardiology using polar maps instead of SPECT MPI scans. The studies devoted to polar maps are set out as follows: in [49], Perfex and an ANN are used with polar maps, while in [50–52], polar maps were utilized along with DL methods. In [49], polar maps of stress and rest examinations of 243 patients who underwent SPECT and coronary angiography within three months were used as input images to train ANN models. The produced AUC results of receiver operating characteristics (ROC) analysis for neural networks was 0.74, surpassing the corresponding AUC for other physicians.

Regarding the application of DL for CAD prediction, the authors in [50] employed deep learning, which was trained using polar maps, for predicting obstructive disease from myocardial perfusion imaging (MPI). The outcome is an improved automatic interpretation of MPI comparing to the total perfusion deficit (TPD). As a result, a pseudo-probability of CAD was deployed per vessel region and per individual patient. An AUC value of 0.80 was calculated concerning the detection of 70% stenosis or higher, still outperforming TPD. The DL procedure automatically predicted CAD from 2-view (upright and supine) polar maps data obtained from dedicated cardiac scanners in 1160 patients, improving current perfusion analysis in the prediction of obstructive CAD [50].

The same team of authors presented another interesting application of DL in the prediction of CAD. A three-fold feature extraction convolutional layer joined with three fully connected layers was deployed to analyze SPECT myocardial perfusion clinical data and polar maps from 1638 patients [51]. These scientific works have investigated the integration of clinical and imaging data and show how to formulate new autonomous systems for the automatic interpretation of SPECT and PET images. The authors in [52] proposed a graph-based convolutional neural network (GCNN) which used Chebyshev polynomials, achieving the highest accuracy (91%) compared with other neural-networkbased methods.

Recently, authors in [53] were the first to study CAD diagnosis using solely SPECT MPI scans in deep learning. They developed two different classification models. The first one is based on deep learning (DL), while the second is based on knowledge to classify MPI scans into two types automatically, ischemia or healthy, exclusively employing the SPECT MPI scans at the input level. Performing exploitation of the well-known DL methods in medical image analysis (such as AlexNet, GoogleNet, ResNet, DenseNet, VGG16, VGG19), the best DL model was determined to be VGG16 with support vector machine (SVM) deep features shallow, concerning classification accuracy. The first model to be developed exploits different pre-trained deep neural networks (DNNs) along with the traditional classifier SVM with the deep and shallow features extracted from various pre-trained DNNs for the classification task. The knowledge-based model, in its turn, is focused on converting the knowledge extracted from experts in the domain into proper image processing methods such as color thresholding, segmentation, feature extraction and some heuristics to classify SPECT images. First, the images were divided into six segments (A, B, C, D, E, F), and the features were extracted from each segment to measure the shapes. Next, a classification rule set assigned by experts was applied. The parameters were empirically identified and fine-tuned on the training and validation images. The produced overall 2-class classification accuracy was 93% for both methods.

As has emerged from the related literature regarding SPECT MPI, PET and PET-CT [54], a system based on deep learning provides similar performance to a nuclear physician in standalone mode. However, the overall performance is notably improved when it is used as a supportive tool for the physician [33,55–58]. Although SPECT MPI

scans are of high importance for diagnosing CAD in nuclear cardiology, only one study was reported in the literature to apply CNNs in these types of MPI images. Thus, there is plenty of space for further research over the investigation on the advantageous and outstanding capabilities of CNNs in the field of nuclear cardiology.

#### *1.3. Contribution of This Research Work*

According to the previous work of the authors, new, fast and powerful CNN architectures were proposed to classify bone scintigraphy images for prostate and breast cancer patients in nuclear medicine [33,56,57,59]. More specifically, an RGB-based CNN model has been proposed to automatically identify whether a patient has bone metastasis or not by viewing whole-body scans. The results showed the superior performance of the RGB-CNN model against other state-of-the-art CNN models in this field [56,57]. Based on the advantageous features of the model, such as robustness, efficacy, low time cost, simple architecture, and training with a relatively small dataset [57], along with the promising results, authors were driven to study further and test the generalization capabilities of this methodology. This research work investigates the performance of the new models, applying them in nuclear cardiology, making the necessary parameterization and regularization to address the recognition of myocardial perfusion imaging from SPECT scans of patients with ischemia or infarction. Hence, the objectives are: (i) to diagnose cardiovascular disease by exploring efficient and robust CNN-based methods and (ii) to evaluate their performance, in line with SPECT MPI scans. A straightforward comparative analysis between the proposed RGB-based CNN methods with state-of-the-art deep learning methods, such as VGG16, Densenet, Mobilenet, etc., found in the literature was also conducted to show their classification performance. The produced results reveal that the proposed deep learning models achieve high classification accuracy with small datasets in SPECT MPI analysis and could show the path to future research directions with a view to a further investigation of the classification method application to other malignant medical states.

Overall, the innovation of this paper which highlights its contribution is two-fold: (a) the application of a predefined CNN-based structure, namely RGB-CNN [56,57], which was recently proposed in bone scintigraphy after a meticulous exploration analysis on its architecture and (b) the implementation of a deep learning classification model utilizing transfer learning for improving CAD diagnosis. The produced fast, robust and highly efficient model, in terms of accuracy and AUC score, can be applied to automatically identify patients with known or suspected coronary artery disease by looking at SPECT MPI scans.

This work is structured as follows: Section 2 includes the methods and materials used in this study. Section 3 provides the proposed deep learning architectures for SPECT MPI classification. In Section 4, a rigorous analysis is conducted, including the exploration of different configurations to determine the most accurate of the proposed classification models. Finally, the discussion of results and the conclusions follow in Section 5.

#### **2. Materials and Methods**

#### *2.1. Patients and Imaging Protocol*

The dataset used in this study corresponds to a retrospective review that includes 224 patients (age 32–85, average age 64.5, 65% men and 55% CAD), whose SPECT MPI scans were issued by the Nuclear Medicine Department of the Diagnostic Medical Center "Diagnostiko-Iatriki A.E.", Larisa, Greece. The dataset consists of images from patients who had undergone stress and rest SPECT MPI on suspicion of ischemia or infarction concerning the prediction of CAD between June 2013 and June 2017. The participant patients went through invasive coronary angiography (ICA) 40 days after MPI.

The set of stress and rest images collected from the records of 224 patients constitute the dataset used in this retrospective study. The dataset includes eight patients with infarction, 142 patients with ischemia, and eight patients with both infarction and ischemia, while the remaining (61 patients) were normal. Indicative image samples are illustrated in

Figure 1. This dataset is available only under request to the nuclear medicine physician and only for research purposes.

**Figure 1.** Image samples of SPECT MPI (from our dataset) (Labels: (**a**) ischemia, (**b**) infarction, (**c**) healthy).

A 1-day stress–rest injection protocol was used for Tc-99m tetrofosmin SPECT imaging. Patients underwent either symptom-limited Bruce protocol treadmill exercise testing (n = 154 [69%]) or pharmacologic stress (n = 69 [31%]) with radiotracer injection at peak exercise or during maximal hyperemia, respectively.

Within 20 min after an injection of 7 to 9 mCi 99mTc-tetrofosmin, stress SPECT images were collected due to either an effort test or pharmacological stress with dipyridamole. In the case of the effort test, a treadmill test was performed. The Bruce protocol was employed, and when at least 85% of the age-predicted maximum heart rate was achieved, a 99mTc-tetrofosmin injection was provided to the patient, whereas the exercise stopped 1 min later. Rest imaging followed 40 min after a dose of 21–27 mCi 99mTc-tetrofosmin had been injected. The data collected from SPECT system came from 32 projections regarding a period of 30 s for the stress and 30 s for the rest SPECT MPI. The rest of the configurations regarded a 140 keV photopeak, a 180-degree arc and a 64 × 64 matrix.

#### *2.2. Visual Assessment*

The assignment of patients' scanning was delivered by a Siemens gamma camera Symbia S series SPECT System (by dedicated workstation and software Syngo VE32B, Siemens Healthcare GmbH, Enlargen, Germany) comprising two heads that include low energy high-resolution (LEHR) collimators. The Syngo software was utilized for the Standard SPECT MPI processing allowing the nuclear medicine specialist to automatically produce SA (short axis), HLA (long horizontal axis), and VLA (long vertical axis) slices from raw data [55]. Afterwards, two expert readers with considerable clinical expertise in nuclear cardiology (N. Papandrianos, who is the first author of this work and N. Papathanasiou, who is Ass. Professor in Nuclear Medicine, University Hospital of Patras), provided visual assessments solely for the series of stress and rest perfusion images in color scale, though not including functional and clinical data [18]. The case is labeled as normal when there is a homogeneous involvement of 99mTc-tetrofosmin in the left ventricular walls.

On the contrary, a defect is defined when radiopharmaceuticals are less involved in any part of the myocardium. A comparative visual analysis is conducted, including the images collected after the stress SPECT MPI and those after the rest SPECT MPI, respectively. Potential defects are identified as a result of the injected radiopharmaceutical agent or even exercise. In this context, the condition is described as ischemia when a perfusion defect was detected in the SPECT images obtained after exercise but not in the rest images. Instead, infarction is the condition in which both stress and rest images include evidence of the defect. The classification process of all SPECT MPI images carried out by expert readers utilizes a two-class label (1 denotes normal and 2 abnormal) to administer additional tasks [53].

#### *2.3. Overview of Convolutional Neural Networks*

Convolutional neural networks are among the most dominant deep learning methodologies since they are designated as techniques with a remarkable capacity for image analysis and classification. Their architecture is originated from the perceptron model in which a series of fully connected layers is established, and all neurons from consecutive layers are individually interconnected. A detailed description of all different types of layers follows.

Concerning the first layer in the architecture, the "convolutional" layer is named after the type of neural network. Its role is substantial for CNNs since this layer is responsible for the formation of activation maps. In particular, specific patterns of an image are extracted, helping the algorithm detect various characteristics essential for image classification [34,60]. Then, a pooling layer follows, whose duty is image downsampling, whereas any unwanted noise that might fuzzy the algorithm is appropriately discarded. This layer retains the set of pixel values that exceed a threshold optimally defined, rejecting all the remaining. For this process, the elements within a matrix that are in line with specific requirements concerning the maximum or average value are correctly selected.

The last part of a CNN architecture comprises one or more fully connected layers assigned for the "flattening" of the previous layer's output every time. This is considered as the final output layer, which takes the form of a vector. According to the values of the outcome vector, a specific label is assigned by the algorithm to every image. On the whole, the set of the fully connected layers are classified into distinct subcategories emanated from their role. For instance, the vectorization is attained by the first layer, whereas the category of each class given is defined by the final layer [61,62].

Concerning the activation function in the CNN models, the rectified linear unit (ReLU) is deployed in all convolutional and fully connected layers, while the sigmoid function serves as the final most common activation function in the output nodes [62]. It is worth mentioning that selecting the most suitable activation function is crucial and dependent on the desired outcome. The Softmax function can be efficiently utilized for the multiclass classification task. It has the ability to target class probabilities through a normalization process conducted on the actual output values derived from the last fully connected layer [62].

#### *2.4. Methodology*

This work discusses the recently proposed RGB-CNN model as a new efficient method in scintigraphy/ nuclear medical image analysis, regarding its application on the classification of SPECT MPI scans in coronary artery disease patients. This two-class classification task involves the cases of ischemia or infarction presence as well as those being labeled as normal in a sample of 224 patients. It particularly involves three distinctive processes, which are pre-processing, network design and testing/evaluation. These stages have been previously presented in common publications (see [30,50,53]). The pre-processing step consists of data normalization, data shuffle, data augmentation and data split into training, validation and testing. Data augmentation involves specific image processes such as range, enlargement, rotation and flip. The augmentation process is conducted before its entrance into the exploration and training of CNN. Concerning data split, the training dataset regards 85% of the provided dataset of 275 MPI images, whereas the remaining 15% is used for testing purposes. Next, the network design stage deals with the construction of a proper architecture through an exploration process. Then, the testing phase follows, utilizing the best CNN model derived. In the final stage, the produced CNN model is tested using unknown to the model data.

Likewise, the respective classification approach is deployed for the tasks of image pre-processing, network training and testing, and is applied to the new dataset. The process for the examined dataset of SPECT MPI scans is visually represented in Figure 2.

**Figure 2.** The proposed methodology for SPECT MPI classification including RGB-CNN architectures.

#### **3. Methodology**

#### *3.1. RGB-Based CNN Architecture for Classification in Nuclear Medical Imaging*

In this research study, we apply an efficient and robust CNN model, the RGB-CNN (proposed in a recent study in the domain of bone scintigraphy), to precisely categorize MPI images as normal or abnormal suffering from CAD. The developed CNN will demonstrate its capacity for high accuracy utilizing a fast yet straightforward architecture regarding MPI classification. A number of experiments were performed for different values of parameters, like pixels, epochs, drop rate, batch size, number of nodes and layers as described in [56–58]. Then, appropriate features are extracted and selected manually, following the most common classic feature extraction techniques. On the other hand, CNNs that resemble ANNs, achieve automatic feature extraction by applying multiple filters on the input images. Next, they proceed in selecting the most suitable for image classification through an advanced learning process.

A deep-layer network is constructed within this framework, embodying five convolutionalpooling layers, two dense layers, a dropout layer, followed by a final two-node output layer (see Figure 3).

**Figure 3.** RGB-CNN architecture for CAD classification using SPECT MPI scans.

The dimensions of the input images vary from 250 × 250 pixels to 400 × 400 pixels. According to the structure of the proposed CNN, the initial convolutional layer includes 3 × 3 filters (kernels) followed by a 2 × 2-sized max-pooling layer and a dropout layer entailing a dropout rate of 0.2. The first convolutional layer is formed by 16 filters, whereas each layer that follows includes a double number of filters compared with the previous one. The same form is followed by the max-pooling layers that come next. A flattening operation is then utilized to transform the 2-dimension matrices to 1-dimension arrays so that they are inserted into the hidden dense layer of 64 nodes. The role of the dropout layer that follows is to randomly drop the learned weights by 20% to avoid overfitting. The output two-node layer comes as last in the proposed CNN model architecture.

The most common function utilized by CNNs is ReLU, which is applied to all convolutional and fully connected (dense) layers. In the output nodes, the categorical cross-entropy function is applied. The algorithm is tested through multiple runs by trying a different number of epochs varying from 200 to 700 to fully exploit the most valid number of epochs for CNN training. In this context, the ImageDataGenerator class from Keras is used, providing specific augmentation tasks over images, such as rotation, shifting, flipping and zoom. Finally, the categorical cross-entropy function is considered as a performance metric applied for the calculation of loss. It employs the ADAM optimizer, an adaptive learning rate optimization algorithm [36].

#### *3.2. Deep Learning Models, Including Transfer Learning for CAD Classification in Medical Imaging*

In this subsection, we introduce the process followed in this study on applying deep learning architectures, including transfer learning for benchmark CNN models in CAD diagnosis.

In deep learning model development, the traditional pipeline is the neural network training from scratch, which depends highly on the size of the data provided. Transfer learning is an alternative, most preferred and used process in developing deep learning architectures [63]. This process offers the capability to sufficiently employ the existing knowledge of a pre-trained CNN through the use of ImageNet dataset so as to result in competent predictions.

For an accurate classification process, an improved model training process is required, which derives from the incorporation of transfer learning during the training phase of the proposed CNN architectures. More specifically, the ImageNet [63,64] dataset needs to be utilized for network pre-training, thus resulting in accurate classification of medical SPECT myocardial perfusion imaging scans into two categories, namely normal and abnormal (patient with ischemia or infarction). According to the relevant literature, the ImageNet dataset is employed by the popular CNN methods for model pre-training and includes 1.4 million images with 1000 classes. Based on this pre-training process, VGG16 and DenseNet models are trained to extract particular features from images through the assignment of constant weights on them. The number of the weight layers affects the depth of the model, along with the steps needed for feature extraction.

The training dataset, representing 85% of the provided dataset of 224 SPECT MPI images, is loaded into the pre-trained models after undergoing a proper augmentation process. Hence, an improved CNN model is produced, which is inserted into the next testing phase. The remaining 15% of the provided dataset is accordingly incorporated into the evaluation process. The proposed transfer learning methodology of the state-of-theart CNN models is graphically presented in Figure 4, regarding the examined dataset of 224 patients.

**Figure 4.** The proposed methodology for SPECT MPI classification including RGB-CNN architectures.

Following the process in which the benchmark CNN model is selected for the classification task, the exploration and identification of suitable, robust and efficient architectures of these CNN models come next for the specific problem solving, which concerns the identification of the correct category of CAD diagnosis. On this basis, the fine-tuning of the model parameters and the configuration of several other hyperparameters were successfully attained through a thorough investigation regarding the appropriate deep learning architecture. For comparison purposes, various common deep learning architectures such as Densenet, VGG16, Mobilienet and InceptionV3 were investigated.

#### **4. Results**

This study attempts to address this image classification problem considering the classification of images into 2 categories: normal and abnormal (ischemic or infarction patient cases). The classification processes were individually repeated 10 times to produce the overall classification accuracy.

All the simulations were performed in Google Colab [65], a cloud-based environment that supports free GPU acceleration. The Keras 2.0.2 and TensorFlow 2.0.0 frameworks were utilized to develop the employed deep learning architectures. Image augmentations (like rotations, shifting, zoom, flips and more) took place only during the training process of the deep networks and were accomplished using the ImageDataGenerator class from Keras. The investigated deep learning architectures were coded in the Python programming language. Sci-Kit Learn was used for data normalization, data splitting, calculation of confusion matrices and classification reports. It should be noted that all images produced by the scanning device and used as the dataset in this research were in RGB format, providing 3-channel color information.

#### *4.1. Results from RGB-CNN*

In this study, a meticulous CNN exploration process regarding the deep learning architectures of RGB-CNN was accomplished. In particular, an experimental analysis was conducted, where various drop rates (between 0.1–0.9), epochs (200 to 700), number of dense nodes (like 16–16, 32–32, 64–64, 128–128, 256–256, 512–512 and 1024–1024), pixel sizes (from 200 × 200 × 3 up to 450 × 450 × 3) and batch sizes (8, 16, 32 and 64) were tested. To prevent overfitting [62] in the proposed deep learning architectures, the authors conducted an exploratory analysis for different dropouts and numbers of epochs. According to the conducted analysis results, a dropout value of 0.2 and the set of 500 epochs were adequate to produce satisfactory results for the investigated RGB-CNN architecture.

Moreover, an exploration analysis involving the testing of various pixel sizes was conducted. The best pixel size of the input images was determined as regards the classification accuracy and loss. Figures 5 and 6 illustrate the produced results in terms of accuracy for the examined pixel sizes, for both CNN-based architectures. These figures foster the successful selection of the appropriate pixel size for each architecture which is 250 × 250 × 3 for RGB-CNN.

Following this exploration process, several configurations, including dropout = 0.2 and three batch sizes (8, 16 and 32), various pixel sizes and dense nodes in RGB-CNN model consisting of 5 layers (16–32–64–128–256) were investigated. Tables 1–3 illustrate the results produced for the relevant pixel sizes for the well-performed batch sizes of 8, 16 and 32. These results helped in the selection of the most appropriate pixel size, which is 250 × 250 × 3.

As regards the two dense blocks that are the last ones in both architectures, a rigorous exploration process was performed to determine the best configuration in terms of accuracy (validation and testing) and loss. The results are depicted in Figures 3 and 4 regarding 4 and 5 convolutional layers, respectively. Looking at the specific figures, it emerges that 64–64 is the optimum combination for the CNN model.

**Figure 5.** RGB-CNN architecture with 4 layers, various dense nodes and batch sizes for CAD classification problem.

**Figure 6.** RGB-CNN architecture with 5 layers, various dense nodes and batch sizes for CAD classification problem.

**Table 1.** Results for various pixel sizes and dense nodes in RGB-CNN with 5 layers (16–32–64–128–256), dropout = 0.2 and batch size = 8.



**Table 2.** Results for various pixel sizes and dense nodes in RGB-CNN with 5 layers (16–32–64–128–256). dropout = 0.2 and batch size = 16.

**Table 3.** Results for various pixel sizes and dense nodes in RGB-CNN with 5 layers (16–32–64–128–256). dropout = 0.2 and batch size = 32.


Next, for the selected pixel size (250 × 250 × 3), different batch sizes (8, 16 and 32) with various configurations in dense nodes were investigated, also utilizing the two previously best-performed architectures concerning the number of convolutional layers (which are 16–32–64–128 and 16–32–64–128–256), as presented in recent research studies [56–58]. The outcomes of this exploration are presented in Figures 5 and 6. These figures show that the best CNN configuration corresponds to batch size 8, five convolutional layers (16–32–64–128–256) and dense nodes 32–32. It emerges that dense 32–32 is the most suitable configuration concerning the dense nodes.

Figure 7 shows the accuracy, loss and AUC values for various dense nodes regarding the best batch size (8) and the number of convolutional layers (16–32–64–128–256).

Additionally, further exploration analysis was performed for various numbers of convolutional layers. Some indicative results are presented in Figure 8. It is observed that the model was able to increase its classification accuracy for 5 convolutional layers significantly.

**Figure 7.** Results for the best RGB-CNN configuration concerning (**a**) accuracy and AUC score and (**b**) loss.

**Figure 8.** CNN results for different numbers of convolutional layers concerning (**a**) validation and testing accuracies and (**b**) AUC values.

To sum-up, the best RGB-CNN architecture with this problem is: pixel size (250 × 250 × 3), batch size = 8, dropout = 0.2, conv 16–32–64–128–256, dense nodes 32.32, epochs = 500 (average run time = 1125 s).

In addition, Table 4 depicts the confusion matrix of the best VGG16 architecture. Figure 9 illustrates the classification accuracies (validation and testing) with their respective loss curves for the proposed RGB-CNN architecture. Figure 10 depicts the diagnostic performance of RGB-CNN model in SPECT MPI interpretation assessed by ROC analysis for CAD patients.

In the proposed method, the early stopping condition for RGB-CNN was investigated considering 100 epochs, thus providing adequate accuracy, higher than that of the other CNNs. In particular, the produced accuracy for early stopping was approximately 89% in most of the examined runs. However, using the minimum error stopping condition, the capacity of the algorithm was explored, increasing the accuracy of the RGB-CNN model up to 94% approximately. Figure 9a illustrates the precision curves presenting a smooth change in accuracy for the proposed model.

**Table 4.** Best confusion matrix for the proposed RGB-CNN.


**Figure 9.** Precision curves for best RGB-CNN model showing (**a**) accuracy and (**b**) loss.

#### *4.2. Results from Deep Learning Architectures Applying Transfer Learning and Comparative Analysis*

In this subsection, the second deep learning classification approach of CAD patients using transfer learning was implemented, followed by a comparative analysis. Following the process discussed in Section 2.4, transfer learning was utilized employing several pretrained CNN models, avoiding training a new network with randomly initialized weights. In this way, the classification process of SPECT MPI scans is faster and more efficient due to the limited number of training images.

This approach includes efficient state-of-the-art CNNs in the medical image analysis domain, which were mainly reported in previous studies in similar classification tasks. In particular, for the purpose of this research work, certain SoA CNN architectures such as: (i) VGG16 [39], (ii) DenseNet in [43], (iii) MobileNet [59], and (iv) Inception V3 [60] were used.

Concerning the training characteristics of this approach, the stochastic gradient descent with momentum algorithm was used, and the initial learning rate was set to 0.0001. It is worth mentioning that an exploratory analysis for the SoA CNNs [25,33] was previously conducted in the reported literature, paying particular attention to overfitting avoidance [62]. Overfitting is a common issue in most state-of-the-art CNNs that work with small datasets; thus, a meticulous exploration with various dropout, dense layers

and batch sizes was applied to avoid it. Overall, the CNN selection and optimization of the hyperparameters was performed following an exploration process considering a combination of values for batch-size (8, 16, 32, 64, and 128), dropout (0.2, 0.5, 0.7 and 0.9), flatten layer, number of trainable layers and various pixel sizes (200 × 200 × 3 up to 350 × 350 × 3). Moreover, a divergent number of dense nodes, like 16, 32, 64, 128, 256 and 512 was explored. The number of epochs ranged from 200 up to 500. The best-performing CNN models in terms of accuracy and loss function in the validation phase were selected as the optimum for classifying the test dataset [24,56].

After the extensive exploration of all the provided architectures of popular CNNs, the authors defined the optimum values for the respective models' parameters, as follows:


Concerning the dropout value, 0.2 was selected as the best-performed for the investigated CNN configurations, according to the exploration process. The testing image dataset was used to evaluate the network's performance; however it is not involved in the training phase.

The results of the explored SoA CNN architectures proposed in the second approach are compared to the best-performed RGB-CNN model. They are gathered in the following three figures. More specifically, Figure 11 depicts the classification accuracy in validation and testing phases for the best-performed deep learning architectures. Figure 12 illustrates the respective loss for all SoA CNNs. Finally, Figure 13 presents the AUC score values for all performed CNNs.

**Figure 11.** Comparison of the classification accuracies for all performed CNNs.

**Figure 12.** Comparison of loss (validation and testing) for all performed CNNs.

**Figure 13.** AUC score values for all performed CNNs.

#### **5. Discussion of Results and Conclusions**

Due to their ability to track complex visual patterns, powerful and widely used CNN algorithms are employed in the medical image analysis domain to address the problem of CAD diagnosis in nuclear cardiology. In this research study, two different deep learning classification approaches, namely, the RGB-based CNN model and the transfer learningbased CNN models (the benchmark CNN models pre-trained by ImageNet dataset), were adopted to identify perfusion defects through the use of SPECT MPI scans. The first classification approach is based on RGB-CNN algorithms, previously proposed for image classification in nuclear medicine regarding bone scintigraphy. The second approach utilizes transfer learning incorporated in well-known deep learning architectures. The provided dataset, comprising stress and rest images from 224 subjects, is employed to assess the proposed models with respect to their performance. The problem was formulated as a two-class classification problem.

For an in-depth assessment of the results, a comparative analysis regarding the classification performance of the proposed model against that of other CNNs reported in the literature is performed (even indirectly) in the examined field of CAD diagnosis in nuclear medicine, using solely SPECT-MPI images. A decent amount of relevant research studies in this scientific field was gathered in Introduction and presented in Table 5, followed by the classification accuracies and evaluation metrics of the respective models. As regards the previous works of [50–52], where polar map images were used for CAD classification, deep CNNs and graph-based CNNs were employed for normal/abnormal classification. These are not related to this research study and provide classification accuracies up to 91%.




**Table 5.** *Cont.*

It is worth mentioning that only one previous work is highly related to the current research study and regards the presence of coronary artery stenosis (normal or abnormal) as a two-class classification problem. This work employed well-known CNNs to classify normal/abnormal patient cases [53], utilizing transfer learning. The authors employ deep neural networks that underwent a pre-training phase as well as an SVM classifier characterized by deep and shallow features derived from the respective networks. Most of the applied DL-based methods (AlexNet, GoogleNet, DenseNet, Resnet, VGG-16) in this dataset provided accuracies less than 87%, and only the VGG-19 utilizing SVM with shallow features increased the accuracy slightly. The knowledge-based classification model, which uses extracted features based on shapes and empirically verified parameters, fine-tuned on the training and validation images, provided the highest classification accuracy of up to 93%. Through the conducted comparative analysis of the proposed RGB-CNN method with the related ML and deep learning techniques as listed in Table 5, it is concluded that the proposed RGB-CNN model outperforms all the previous techniques in MPI imaging. It provides slightly better performance in classification accuracy (94%) and AUC score (93%), making it a competitive solution to this diagnosis task.

Following the process of rigorously exploring possible hyperparameters and regularization methods of the proposed RGB-CNN architecture, the best overall classification accuracy for the deep network model (best RGB-CNN) was established (see Figures 11–13). Authors selected the RGB-CNN model with 5 convolutional layers, batch size = 16, dropout = 0.2 and 64–64 dense nodes as the simplest and most optimum performed CNN, concerning testing accuracy and loss. Moreover, from the results above, it appears that the best RGB-CNN model is characterized by an overall classification accuracy of 93.47% ± 2.81% when the produced overall test loss is approximately 0.18 (see Figure 12). To lay emphasis on the classification performance of the CNN approaches presented in this study, the authors followed a comparative analysis between the proposed RGB-CNN model and other SoA CNNs, commonly used for image classification problems, with reference to accuracy and other metrics such as the AUC score. Regarding the produced AUC value for the RGB-CNN models and the other SoA CNNs, as depicted in Figure 13, RGB-CNN seems to have the highest AUC score, making it possibly the best classifier in terms of performance for the given problem. The average run time of the best architecture for the proposed model is 1125 s which is considered fast for such types of networks. Similar to the other CNN-based methods, this method presents faster run time as shown in the previous works of the same team of authors [33,56] in the case of bone scintigraphy.

The results indicate that the proposed RGB-CNN is an efficient, robust and straightforward deep neural network able to detect perfusion abnormalities related to myocardial ischemia and infarction on SPECT images in nuclear medicine image analysis. It was also demonstrated that this is a model of low complexity and generalization capabilities compared to the state-of-the-art deep neural networks. Moreover, it exhibits better performance than the SoA CNN architectures applied in the specific problem regarding accuracy and AUC values. The proposed CNN-based classification approach can be employed in the case of SPECT-MPI scans in nuclear cardiology and can support CAD diagnosis. It can as well contribute as a clinical decision support system in nuclear medicine imaging.

To sum up, among the major differences of RGB-CNN compared to other conventional CNNs are (i) their ability to efficiently train a model considering a small dataset without the need to undergo network pre-training with ImageNet dataset, (ii) their ability to be optimized through an exploratory analysis which helps to avoid overfitting and generalize well to unknown input images, and (iii) their less complex architecture which enhances their performance in an efficient run time [33,57].

Regarding the limitations presented in previous studies, the models proposed in this work do not depend on specific characteristics like gender and camera specifications that can elevate the number of inputs [34]. In addition, they can perform sufficiently, even when not many training images are available. Among the privileges the proposed models enjoy is their ability to use SPECT images as input without the need for any additional data. This feature is rather distinguishing between this work and other studies. Finally, less experienced physicians can improve their diagnostic accuracy by supporting their opinion with the results of such systems. However, there are some limitations that need to be considered in future work. These are (i) the limited number of normal cases in the dataset, making it unbalanced, and (ii) the disregard of clinical and other functional data in the classification process, which would improve the diagnosis.

According to the overall results of this study, the proposed deep learning structures of RGB-CNN are accredited for being extremely efficient in classifying SPECT MPI scans in nuclear medicine. Even though these effective CNN-based approaches use a relatively limited number of patients, this study further considers a deep learning classification methodology, incorporating transfer learning, and in collaboration with the well-known CNN models, as a technique that can have a considerable impact on myocardial perfusion detection.

As a typical black box AI-based method, deep learning lacks clarity and reasoning for the decision, which is highly important in medical diagnosis. Since DL models are often criticized because of their internal unclear decision-making process, explainable AI systems should come with causal models of the world supporting explanation and understanding. Recent research efforts are directed towards developing more interpretable models, focusing on the understandability of the DL-based methods.

Future work is also oriented toward the acquisition of more scan images of patients suffering from CAD, with a view to expand the current research and validate the efficacy of the proposed architecture. But, overall, the findings of this work seem highly reassuring, particularly when the computer-aided diagnosis is involved, establishing the proposed CNN-based models as a suitable tool in everyday clinical work.

**Author Contributions:** Conceptualization, N.P.; methodology, E.P. and N.P.; software, E.P.; validation, N.P., E.P.; formal analysis, N.P. and E.P.; investigation, E.P. and N.P.; resources, N.P.; data curation, N.P.; writing—original draft preparation, N.P.; writing—review and editing, E.P., N.P.; visualization, E.P.; supervision, N.P. and E.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** This research work does not report human experimentation; not involve human participants following an experimentation in subjects. All procedures in this study were in accordance with the Declaration of Helsinki.

**Informed Consent Statement:** This study was approved by the Board Committee Director of the Diagnostic Medical Center "Diagnostiko-Iatriki A.E." Vasilios Parafestas and the requirement to obtain informed consent was waived by the Director of the Diagnostic Center due to its retrospective nature.

**Data Availability Statement:** The datasets analyzed during the current study are available from the nuclear medicine physician on reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Deep Learning Based Airway Segmentation Using Key Point Prediction**

**Jinyoung Park 1,†, JaeJoon Hwang 2,3,†, Jihye Ryu 1, Inhye Nam 1, Sol-A Kim 1, Bong-Hae Cho 2,3, Sang-Hun Shin 1,3 and Jae-Yeol Lee 1,3,\***


**Abstract:** The purpose of this study was to investigate the accuracy of the airway volume measurement by a Regression Neural Network-based deep-learning model. A set of manually outlined airway data was set to build the algorithm for fully automatic segmentation of a deep learning process. Manual landmarks of the airway were determined by one examiner using a mid-sagittal plane of cone-beam computed tomography (CBCT) images of 315 patients. Clinical dataset-based training with data augmentation was conducted. Based on the annotated landmarks, the airway passage was measured and segmented. The accuracy of our model was confirmed by measuring the following between the examiner and the program: (1) a difference in volume of nasopharynx, oropharynx, and hypopharynx, and (2) the Euclidean distance. For the agreement analysis, 61 samples were extracted and compared. The correlation test showed a range of good to excellent reliability. A difference between volumes were analyzed using regression analysis. The slope of the two measurements was close to 1 and showed a linear regression correlation (r2 = 0.975, slope = 1.02, *p* < 0.001). These results indicate that fully automatic segmentation of the airway is possible by training via deep learning of artificial intelligence. Additionally, a high correlation between manual data and deep learning data was estimated.

**Keywords:** airway volume analysis; deep learning; artificial intelligence

#### **1. Introduction**

Recently, artificial intelligence has been used in the medical field to predict risk factors through correlation analysis and genomic analyses, phenotype-genotype association studies, and automated medical image analysis [1]. Recent advances in machine learning are contributing to research on identifying, classifying, and quantifying medical image patterns in deep learning. Since the convolutional neural network (CNN) based on artificial neural networks has begun to be used in medical image analysis, research on various diseases is rapidly increasing [2,3]. The use of deep learning in the medical field helps diagnose and treat diseases by extracting and analyzing medical images, and its effectiveness has been proven [4].

However, studies related to deep learning in the areas of oral and maxillofacial surgery are limited [5]. For oral and maxillofacial surgery, radiology is used as an important evaluation criterion in the diagnosis of diseases, treatment plans, and follow-up after treatment. However, the evaluation process is performed manually and the assessment can be different among examiners, or even with the same examiner. This may result in an

**Citation:** Park, J.; Hwang, J.; Ryu, J.; Nam, I.; Kim, S.-A; Cho, B.-H.; Shin, S.-H.; Lee, J.-Y. Deep Learning Based Airway Segmentation Using Key Point Prediction. *Appl. Sci.* **2021**, *11*, 3501. https://doi.org/10.3390/ app11083501

Academic Editor: Mizuho Nishio

Received: 22 March 2021 Accepted: 12 April 2021 Published: 14 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

inefficient and time-consuming procedure [6]. In particular, the evaluation of the airway is difficult to analyze due to its anatomical complexity and the limited difference in gray scale between soft tissue and air [7–9]. Airway analysis is essential for diagnosis and assessment of the treatment progress of obstructive sleep apnea patients and for predicting the tendency of airway changes after orthognathic surgery [10–21].

In most previous studies, the airway was segmented semi-automatically using software systems for volumetric measurements using cone-beam computed tomography (CBCT) images [21–23]. These studies evaluated the reliability and reproducibility of the software systems on the measurement of the airway [7,24–27] and compared the accuracy between the various software systems [9,24,25,27]. However, in all cases, the software systems require manual processing by experts.

In this study, a regression neural network-based deep-learning model is proposed, which will enable fully automatic segmentation of airways using CBCT. The differences between the manually measured data and data measured by deep learning will be analyzed. Using a manually positioned data set, training and deep learning will be performed to determine the possibility of a fully automatic segmentation of the airway and to introduce a method and its proposed future use.

#### **2. Materials and Methods**

#### *2.1. Sample Collection and Information*

Images from 315 patients who underwent CBCT for orthognathic surgery were collected retrospectively from 2017 to 2019. The CBCT data were acquired using PaX-i3D (Vatech Co., Hwaseong-si, Korea) at 105–114 KVP, 5.6–6.5 mA with 160 mm × 160 mm field of view, and 0.3 mm in voxel size. The scanning conditions were automatically determined by the machine according to the patients' age and gender. The CBCT images were converted to DICOM 3.0 and stored on a Windows-10-based graphic workstation (Intel Core i7-4770, 32 GB). The patients were all placed in a natural head position. All image processing was performed using MATLAB 2020a (MathWorks, Natick, MA, USA) programming language.

#### *2.2. Coordinate Determination in the Mid-Sagittal Plane*

Five coordinates for each original image were obtained manually in the midsagittal plane of the CBCT images (Figure 1). The definitions of the points and planes for the airway division are presented in Table 1, referring to Lee et al. [28]. These five coordinates were predicted by a 2D convolutional neural network for airway segmentation in the sagittal direction.

**Figure 1.** Coordinate and plane determination in the midsagittal plane of the cone-beam computed tomography (CBCT) image.


**Table 1.** Definition of reference points and planes for airway division. (Abbreviations: PNS, posterior nasal spine; VP, posterior point of vomer; CV1, 1st cervical vertebra; CV2, 2nd cervical vertebra; CV4, 4th cervical vertebra).

#### *2.3. Airway Segmentation*

First, the image was binarized, then it was filled through a 3D close operation, and hole filling, and then, the binarized image was subtracted from the filled image to obtain an airway image. After erasing the image outside, the area that references five points, and the 1/4 and 3/4 of the inferior border are connected. Only the largest object is left to obtain the airway image (Figure 2).

**Figure 2.** Airway segmentation process. (**A**) Binarization image. (**B**) Hole filled image after close operation. (**C**) Difference image between (**A**,**B**). (**D**) An image that erases the area outside the area where 5 reference points, and 1/4 and 3/4 of the inferior border are connected. (**E**) Segmented airway.

#### *2.4. Training via Regression Neural Network and Metrics for Accuracy Comparison*

The 315 midsagittal images obtained from the patient's cone-beam computed tomography (CBCT) data were split into training and test sets at a ratio of 4:1. During clinical data set-based training, validation was not performed because the sample size was too small for validation. Instead, a five-fold cross-validation was applied. First, the image size was set to 200 × 200 pixels, and 16 convolution layers were packed for feature extraction. To generate the regression model, the regression layer was connected to a fully connected layer. Mean-squared-error was used as a loss function. Data augmentation was then conducted, including rotation from −6◦ to +6◦, uniform (isotropic) scaling from 0.5◦ to 1◦, Poisson noise addition, and contrast and brightness adjustment. An NVIDIA Titan RTX GPU with CUDA (version 10.1) acceleration was used for network training. The models were trained for 243 epochs using an Adam optimizer with an initial learning rate of 1e-4 and a mini-batch size of 8.

The prediction accuracy of the model was calculated using (a) the volume difference between the predicted and manually determined nasopharynx, oropharynx, and hypopharynx, and (b) the Euclidean distance between where the predicted and manually determined points are real data.

#### **3. Results**

#### *3.1. Measurements of the Differences between Manual Analysis and Deep Learning Analysis*

The five coordinates manually pointed and predicted by the deep learning model are shown in Figure 3. The Euclidean distance between the predicted and manually determined points was largest at CV4 (4.156 ± 2.379 mm) and smallest at CV1 (2.571 ± 2.028 mm). Other Euclidean distances were estimated as 2.817 ± 1.806 mm at PNS, 2.837 ± 1.924 mm at Vp, and 2.896 ± 2.205 mm at CV2. When the volume was compared for each part, the hypopharynx showed the largest difference difference (50 ± 57.891 mm3), and the oropharynx was assessed as having the smallest difference (37.987 ± 43.289 mm3). The difference in the nasopharyngeal area was 48.620 ± 49.468 mm3. The difference in total volume was measured as 137.256 ± 146.517 mm3. All measurements of the differences are shown in Table 2. Volume differences among parts of the airway are shown in Figure 4.

**Figure 3.** (**A**) Example of manually pointed data and its volume segmentation. (**B**) Example of deep learning pointed data and its volume segmentation.


**Table 2.** Measurements of the differences between manual analysis and deep learning analysis (N = 61).

**Figure 4.** Boxplots of the differences between manual analysis and deep learning analysis (N = 61). In the boxplots, 'x' within the box marks the mean of volume differences.

#### *3.2. Agreement Analysis*

Using agreement analysis, 61 samples were extracted and the manually measured value and deep learning network predicted value were compared for both volumes and coordinates. The total volume was the most correlated intra-class correlation coefficient (ICC) value in the oropharynx (0.986), followed by the hypopharynx (0.964), and the nasopharynx (0.912). The intra-class correlation coefficient (ICC) value for the coordinate CV2(x) was the most correlated (0.963) and the least correlated at CV4(y) (0.868). All ICC values are presented in Table 3.


**Table 3.** Agreement analysis of the volume and point via intra-class correlation coefficient (ICC) (Two-way random effects, absolute agreement, single rater/measurement) (N = 61).

#### *3.3. Linear Regression Scatter Plots and Bland-Altman Plot for the Total Volume Data Set*

The total volume measured by deep learning was compared with the volume manually measured using regression analysis (Figure 5). The slopes of the two measurements were close to 1 and showed a linear regression correlation as r<sup>2</sup> = 0.975, slope = 1.02, and *p* < 0.001. Bland-Altman plots and analyses were used to compare the total volume of the two methods, and the results are presented in Figure 6. The Bland-Altman plot comparing the level of agreement between manual and deep learning indicates an upper limit of agreement (0.261 cm3) and a lower limit of agreement (−0.207 cm3). The range of the 95% confidence interval was 0.468 cm3.

**Figure 5.** Scatter plot of total volume measured between the manual of deep learning (r2 = 0.975, slope = 1.02, *p* < 0.001). The line indicates a linear regression graph. There is a strong correlation between the two methods (N = 61).

**Figure 6.** Bland-Altman plot of the total volume data set. The green line indicates the upper limit of agreement, while the red line indicates the lower limit of agreement (N = 61).

#### **4. Discussion**

In the medical field, many studies have used artificial intelligence via deep learning in radiology [29,30]. There are studies on fully automated airway segmentation of lungs with volumetric computed tomographic images using a convolutional neural network (CNN) [31] and on automatic segmentation and 3D reconstruction of inferior turbinate and maxillary sinus from otorhinolaryngology [32]. Due to the complex anatomical structure of the airway, there are difficulties in researching the airway using manual measurements, which is a time-consuming process, and entails inter-examiner error, intra-examiner error, and a lack of certainty because of the small differences on a gray scale [23]. For these reasons, automated measurement and analysis are necessary, but the fully auto-segmentation of the airway is challenging and a study of airway segmentation using deep learning in the area of oral and maxillofacial surgery has not previously been reported.

Therefore, in this study, we performed a fully automated segmentation of the airway using artificial intelligence for enabling faster and more practical measurement and analysis in clinical practice. The correlation between the coordinates and volumes measured manually and by the deep learning network were evaluated and compared. The distance between the coordinates of each of the five airway reference points was measured between 2.5 mm and 4.1 mm, and the difference between the measured volumes was 48.620 mm3 in the nasopharynx, 37.987 mm<sup>3</sup> in the oropharynx, and 50.010 mm<sup>3</sup> in the hypopharynx. The difference in total volume was observed to be 85.256 mm3. Therefore, it is considered that the correlation between each coordinate and volume showed good to excellent reliability.

In this study, the threshold is defined by the Otsu method [33], the binarized image is extracted, and deep learning performs fully automatic division of the airway and divides it into the nasopharynx, oropharynx, and hypopharynx parts through the reference plane.

The difference between the total volumes in this study was evaluated as an acceptable value at 0.46 cm3 when compared to the Torres et al. [25] study, which gave the difference between the water volume of an actual prototype and the volume on the CT software as 0.2 cm<sup>3</sup> to 1.0 cm3. The difference in the volume of the oropharynx was measured as the smallest, which showed the same results as El et al. [34]. According to Alsufyani et al. [23], since the oropharynx airway is a completely empty space like a tube, it is straightforward to measure the volume. The more complex and narrow shape of the airway's soft tissue is

due to anatomical complexity, such as epiglottis. This has the highest error in volumetric measurements [35]. Therefore, it can be considered that a simpler anatomical structure will result in a smaller difference between the measurement methods.

When comparing the distance of each point, the result of this study is not clinically applicable. A clinically acceptable difference between the landmarks is approximately 2 mm, according to Lee et al. [36]. There are several reasons for a possible error, which include the limitation in the number of training data sets and the necessity for more precise data preparation, such as setting more reference points on each slice segmentation. In setting the reference points for precise training, the reference points were selected on the bony parts to reduce the error due to the variety of soft tissue shapes. This allows clear determination of the anatomical point aided by the large difference on a gray scale, and a simpler comparison of the relationship before and after surgery. Hence, this study applied the reference points of the Lee et al. study [28]. Nevertheless, in the present study, the distance of CV4 had a larger error, which may be due to the shape of the spine CV4 appearing in various ways in the sagittal plane compared to CV1 or CV2. It is necessary to set an additional reference point to define the hypopharynx that appears to be constant in the midsagittal plane.

The limitation of most airway segmentation research is possibly due to an inconsistent patient head position [23,27,37]. Since patients underwent CBCT in the natural head position in this study, errors may occur. It has been reported that the shape of the airway can vary greatly depending on the angle of the head [38]. However, as concluded in most research, it is not a significant error when comparing the volume of the airway rather than evaluating the volume itself [25]. When performing CBCT, the patient's head position is consistently adjusted to a natural head position by the examiner through the head strap, chin support, and guide light. In addition, the natural head position has been proven to be reproducible [39], and, hence, there should be no major error when comparing. Due to breathing and tongue position, errors may occur in volumetric measurements [35,37]. Therefore, for each variable, controlled and consistent scanning is required. This study divided the airway volume using 5 points in the 2D mid-sagittal image. The accuracy of these points affects the accuracy of airway segmentation. Therefore, bigger data is needed for clinical application of our algorithm to raise accuracy of coordinate determination.

In the agreement analysis, according to Koo et al. [40], "Based on the 95% confident interval of the ICC estimate, values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.90, and greater than 0.90 are indicative of poor, moderate, good, and excellent reliability, respectively." In the present study, oropharynx, hypopharynx, total volume, PNS(y), CV1(y), CV2(x), and CV4(x) indicated excellent reliability, and all other variables indicated good reliability based on the Koo et al. report [40].

These results indicate that fully automatic segmentation of the airway is possible through training via deep learning of artificial intelligence. In addition, high correlation between manual data and deep learning data was estimated. To improve the accuracy, validity, and reliability of auto-segmentation, further data collection and optimum training with big data will be required for future clinical application. Furthermore, to raise the robustness of our algorithm, bigger data is needed for accurate coordinate determination. Transfer learning with other datasets, such as facial coordinates, can also be useful. We plan to develop more robust algorithms with bigger data.

#### **5. Conclusions**

In this study, using a manually positioned data set, fully automatic segmentation of the airway was possible with artificial intelligence by training a deep learning algorithm and a high correlation between manual data and deep learning data was estimated.

As the first study to utilize artificial intelligence to reach full auto-segmentation of the airway, this paper is meaningful in showing the possibility of a more accurate and quicker way of producing airway segmentation. For a future clinical application, the more robust algorithms with bigger and multiplex datasets are required.

**Author Contributions:** J.P. and J.H. carried out the analysis of data and prepared the manuscript. J.R. and I.N. helped in the collection and analysis of the data. S.-AK. helped the visualization and analysis of the data in a revised manuscript. B.-H.C. and S.-H.S. conceived of the study, participated in its design and coordination, and helped to draft the manuscript. J.-Y.L. designed the study and drafted the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI19C0824).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of Pusan National Dental Hospital (PNUDH-2021-008).

**Informed Consent Statement:** Patient consent was waived because of the retrospective nature of the study and the analysis used anonymous clinical data.

**Data Availability Statement:** The data presented in this study are openly available in Github at: https://github.com/JaeJoonHwang/airway\_segmentation\_using\_key\_point\_prediction, accessed on 13 April 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Characterization of Optical Coherence Tomography Images for Colon Lesion Differentiation under Deep Learning**

**Cristina L. Saratxaga 1,2,\*, Jorge Bote 3, Juan F. Ortega-Morán 3, Artzai Picón 1, Elena Terradillos 1, Nagore Arbide del Río 4, Nagore Andraka 5, Estibaliz Garrote 1,6 and Olga M. Conde 2,7,8**


#### **Featured Application: Automatic diagnosis of colon polyps on optical coherence tomography (OCT) images for the development of computer-aided diagnosis (CADx) applications.**

**Abstract:** (1) Background: Clinicians demand new tools for early diagnosis and improved detection of colon lesions that are vital for patient prognosis. Optical coherence tomography (OCT) allows microscopical inspection of tissue and might serve as an optical biopsy method that could lead to in-situ diagnosis and treatment decisions; (2) Methods: A database of murine (rat) healthy, hyperplastic and neoplastic colonic samples with more than 94,000 images was acquired. A methodology that includes a data augmentation processing strategy and a deep learning model for automatic classification (benign vs. malignant) of OCT images is presented and validated over this dataset. Comparative evaluation is performed both over individual B-scan images and C-scan volumes; (3) Results: A model was trained and evaluated with the proposed methodology using six different data splits to present statistically significant results. Considering this, 0.9695 (±0.0141) sensitivity and 0.8094 (±0.1524) specificity were obtained when diagnosis was performed over B-scan images. On the other hand, 0.9821 (±0.0197) sensitivity and 0.7865 (±0.205) specificity were achieved when diagnosis was made considering all the images in the whole C-scan volume; (4) Conclusions: The proposed methodology based on deep learning showed great potential for the automatic characterization of colon polyps and future development of the optical biopsy paradigm.

**Keywords:** colon cancer; colon polyps; OCT; deep learning; optical biopsy; animal rat models; CADx

#### **1. Introduction**

Colon cancer is the second most common cause of cancer death in Europe both for women and men, and the third most common cancer worldwide [1]. About 1.8 million new cases of colorectal cancer were recorded globally in 2018 [2], being the third most common cancer in men and second in women. The five-year survival rate is 90 percent for colorectal cancers diagnosed at an early stage, but unfortunately only 4 out of 10 cases are found this early [3].

**Citation:** Saratxaga, C.L.; Bote, J.; Ortega-Morán, J.F.; Picón, A.; Terradillos, E.; del Río, N.A.; Andraka, N.; Garrote, E.; Conde, O.M. Characterization of Optical Coherence Tomography Images for Colon Lesion Differentiation under Deep Learning. *Appl. Sci.* **2021**, *11*, 3119. https://doi.org/10.3390/ app11073119

Academic Editor: Mizuho Nishio

Received: 12 February 2021 Accepted: 25 March 2021 Published: 1 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Clinicians demand new non-invasive technologies for early diagnosis of colon polyps, especially to distinguish between benign and malignant or potentially malignant lesions that must be resected immediately. New methods should also proportionate information for safety margin resection and remaining tissue inspection after resection to decrease the possibility of tumor recurrence and improve patient prognosis. The current gold-standard imaging technique during patient examination is colonoscopy with narrow band red-flag technology for improved lesion visualization. During the procedure, lesions can be classified with Paris (morphology) [4] and Nice (vessel and surface) [5] classification patterns based on the physician experience. As this superficial information is not enough, the final diagnosis of the lesion is determined by the histopathological analysis after biopsy, meaning that all the suspicious polyps are resected. Bleeding related problems usually occur after biopsies are performed, with the risks that this entails for the patient. In fact, most problems occur when the biopsy is performed on a blood vessel and the incidence is higher when it is performed on patients with an abnormal blood coagulation function [6]. In relation to the latter, the rate of perforation associated to colonoscopies with polypectomy is 0.8/1000 (95% confidence interval (CI) 0.6–1.0) and the rate of bleeding related to polypectomies is 9.8/1000 (95% confidence interval (CI) 7.7–12.1) [7]. However, it is demonstrated that hyperplastic polyps are of a benign nature and can be left untouched, avoiding the underlying bleeding risk of resection, saving diagnosis time, costs, and patient trauma during that period [8]. On the other side, pre-malignant lesions and adenomatous polyps cannot be distinguished from neoplastic lesions as adenocarcinoma with the current diagnosis methods. In this sense, new imaging techniques and interpretation methods could allow real-time diagnosis and would facilitate better in-situ treatment of lesions, improving patient prognosis, especially if the diagnosis is made at early stages of the disease.

In recent years, different advanced imaging technologies that allow sub-surface microscopical inspection of tissue in an "optical-biopsy" manner have been under study for colonic polyps [9], such as: reflectance confocal microscopy (RCM) [10], multi-photon microscopy (MPM) [11], and optical coherence tomography (OCT) [12], among others. Of the mentioned techniques, a device called Cellvizio based on confocal laser endomicroscopy (CLE) is the only one commercially available. Using confocal mini-probes inserted in the working channel of flexible endoscopes, the system is used for studying the cellular and vascular microarchitecture of tissue. Colorectal lesions diagnosis [13–15] is one of the targeted applications and the corresponding probe reports a field-of-view (FOV) of 240 μm, 1 μm resolution and 55 to 65 μm confocal depth, with 20 maximum uses. The inconvenience of this system is that the successful usage by clinicians depends on specific training on image interpretation. Moreover, the main limitation is that this technology requires the use of an exogenous fluorophore which results in a more invasive procedure for the patient. In the case of MPM [16,17], which relies on the absorption of an external or endogen (as collagen) tissue fluorophore, high resolution images at sub-cellular level can also be obtained to study structural information, including also functional information. The mentioned ex vivo studies using this technology have revealed significant morphological differences between healthy and cancerous tissue. However, the interpretation of MPM images by clinicians also remains a challenge and relies on their expertise in histopathology.

In contrast, OCT provides sub-surface structural information of the lesion under a label-free approach, with reported resolutions less than 10 μm and penetration capacities up to 2 mm. OCT can be used in combination with MPM, as both technologies provide complementary information useful for diagnosis assessment. While RCM and MPM 2D images are obtained horizontally in the transversal plane (also called "en-face"), OCT 2D images (B-scan) are obtained axially in depth in the coronal or sagittal plane. Furthermore, since OCT also allows obtaining 3D images (C-scan), lesions can be studied volumetrically from different points or axes of visualization. Although OCT images have less resolution than RCM and MPM images, the penetration capacity is higher, and the acquisition time is generally lower. This OCT aspect is of great importance to evaluate lesion margins and tumor infiltration into the mucosa under real-time situations in clinical environments.

OCT technology capabilities in the diagnosis of colon polyps have been investigated in the latest years with promising results on the future adoption in clinical practice. Several studies [18–21], both in murine and human models, have reported the identification of tissue layers and the discrimination capacities of the technology on the differentiation of different types of benign (including healthy) and malignant tissue. When analyzing 44 polyps from 24 patients [18], endoscopists detected fewer subsurface structures and a lower degree of light scattering in adenomas, and that, in comparison, hyperplastic polyps were closer in structure and light scattering to healthy mucosa. The scattering property was calculated by a computer program applying statistical analysis (Fisher–Freeman– Halton test and Spearman rank correlation test), confirming the previous appreciation. A comparison of OCT images with respect to histopathological images was performed in [19] using previously defined criteria for OCT image interpretation on the identification of tissue layers. Upon the observations, hyperplastic polyps are characterized by a threelayer structure (with mucosa thickening) whereas adenomas are characterized by the lack of layers. Then, under these assumptions, measured over a group of 116 polyps from patients, lesions could be visually differentiated in OCT images with 0.92 sensitivity and 0.84 specificity. Later, a fluorescence-guided study performed on 21 mice [20] after administrating a contrast agent showed the OCT ability to differentiate healthy mucosa, early dysplasia, and adenocarcinoma. Visual analysis of normal tissue revealed that the submucosa layer is very thin in some specimens and not always well appreciated in the OCT images, although the tissue boundaries remain distinguishable. In adenoma polyps, a thickening of the mucosa (in first stages) or disappearance of the boundary between layers is detected, whereas in the case of adenocarcinoma, the OCT images showed a loss of tissue texture, absence of layers, and the presence of dark spots caused by the high absorption in necrotic areas. In the latest study [21], they go beyond and propose a diagnosis criterion over micro OCT images with some similarities to the Kudo pit pattern [22] and demonstrate the diagnosis capacity of the OCT technology as clinicians can reach 0.9688 sensitivity and 0.9231 specificity on the identification of adenomas over 58 polyps from patients.

Both the cross sectional and the en-face images have been shown to provide clinically relevant information in the mentioned studies, and the combination of both views for the detailed study of tissue features suggests an important advance [23–25]. In addition to previous studies, the calculation of the angular spectrum of the scattering coefficient map has also revealed quantifiable variances on the different tissue types [26].

The clinical characteristics of the lesions that can be observed on the OCT images can be further exploited by image-based analysis. Image and signal processing methods can allow dealing with the noisy nature of the signal, whereas machine learning algorithms are able to exploit the spatial correlation of the biological structures to make the most of them. These types of algorithms can detect, and quantify, subtle variations on images that the naked human eye cannot and can be applied with the goal of performing automatic interpretation of the images for image enhancement, lesion delimitation, or classification tasks. However, as seen in previously reviewed studies, few attempts of applying these methods for colon polyps on OCT images have been found, showing that there are opportunities of research in the area.

The main limitation of traditional machine learning methods is the need to process the original data from their natural form to another form of representation appropriate for the targeted problem. Image processing methods must be carefully applied to extract the most representative features of the data, aiming to resemble how the experts analyze the images. Then, the extracted features are passed as input to the selected classifier method. Unlike deep learning approaches, traditional machine learning methods require tailored feature extraction which is followed by a shallow machine learning method. This makes them less prone to generalization and leads to lower discriminative power [27]. Under the deep learning paradigm, image feature extraction and classification are simultaneously performed through a network architecture representing all possible solution domains and which is optimized by means of a loss function minimization that seamlessly drives the network

parameters towards a suitable solution. Convolutional neural networks (CNN) [28,29] have surpassed classical machine learning methods [30,31], and even medical expert capabilities [32–34]. They have been also successfully applied in colon cancer histopathological classification [35,36], MPM classification [37], polyp detection on colonoscopy [38–40], or histological colon tissue staining [41].

The application of deep learning methods to OCT medical images is a recent trend and only few examples of application are available. Ophthalmology being the oldest context of application of OCT, most examples are found in this area, and some others in cardiology and breast cancer [42–45]. In gastroenterology of the lower track (colon), only one recent work has been identified [46]. A pattern recognition network called RetinaNet [47] has been trained to distinguish normal from neoplastic tissue with a 1.0 sensitivity and 0.997 specificity. The success of the model is based on a dentate structural pattern, identified in normal tissue in previous studies, being utilized as a structural marker on the images used as input during training and evaluation. In this sense, the B-scan images on the dataset (26,000 images acquired from 20 tumor areas) are manually inspected to identify "teeth" samples representing normal colonic mucosa and "noisy" samples representing malignant tissue. On evaluation, the network provides a list of boxes where these patterns are found along with the probability, and average scores are calculated over a sequence of N adjacent B-scan images. The drawback of this approach resides in the identification of the "teeth" pattern in normal tissue, but no other patterns have been identified for malignant tissue, just assuming that the "teeth" pattern does not appear in that case.

The work presented in this paper further investigates the application of deep learning methods over a collected database with more than 94,000 OCT images of murine (rat) colon polyps to study the discrimination capacity of this imaging technique for its future adoption as a real-time optical biopsy method. The aim of this proposal is to contribute to setting the bases for the automatic analysis of images with latest state-of-the-art techniques that could lead to the development of new computer-aided diagnosis (CADx) applications. Once image analysis methods demonstrate this capacity, colon polyp diagnosis with OCT can be progressively mastered by clinicians and the adoption of the technology naturally accomplished. With this aim, this work implements a classification (benign vs. malignant) approach based on an Xception deep learning model that is trained and tested over a large dataset of OCT images from murine (rat) samples that have been collected for this purpose. We propose a pre-processing method for data augmentation and to validate the application of deep learning methods for colon polyp classification as benign or malignant. In addition, to further investigate the diagnosis capacity of the proposed approach, evaluation is performed twice, once over individual B-scan images and then also over C-scan volumes for comparison. Finally, a strategy to maximize results when evaluating individual B-scans is applied.

In comparison with previous studies [46], this work proposes a general diagnosis strategy based on classification instead of pattern recognition, which avoids time consuming manual annotation of the database providing automatic identification of the characteristics representing polyps tissue type. The classification strategy model can generalize better upon new polyp categories than the segmentation strategy, the performance of which is biased by the available annotations of the database. A classification strategy can help in the identification of subtle characteristics present on noisy OCT images that are not easily distinguished by the naked eye, and with proper visualization of them, can help clinicians to better understand the OCT imaging technique. In the future, the combination of both approaches could be considered for maximizing automatic diagnosis results.

#### **2. Materials and Methods**

#### *2.1. Animal Models*

Sixty animals with colorectal cancer (CRC) from the strain PIRC (polyposis in the rat colon) rat F344/NTac-Apcam1137 model (sex ratio: 50/50) from the Rat Resource and Research Centre (RRRC) were used for the extraction of neoplastic colonic samples. This animal model was used in the study for the following main reasons: (a) it is an excellent

model for studying human familial colon cancer; (b) ENU (N-ethyl-N-nitrosourea)-induced point mutation results in a truncating mutation in the APC (adenomatous polyposis coli) gene at a site corresponding to the human mutation hotspot region of the gene; (c) heterozygotes develop multiple tumors in the small intestine and colon by 2–4 months of age; (d) PIRC tumors closely resemble those in humans in terms of histopathology and morphology as well as distribution between intestine and colon; (e) provides longer lifespan compared to related mouse models (10–15 months); and (f) tumors may be visualized by CT (computerized tomography), endoscopy, or dissection. Moreover, the absolute incidence and multiplicity of colonic tumors are higher in F344-PIRC rats than in carcinogen-treated wild-type F344 rats, or in mice [48,49].

Additionally, thirty rats from the strain Fischer344—F344 wildtype model (sex ratio: 50/50) were used for the development and extraction of hyperplastic colonic samples. A rat surgical model of hyperplasia in the colon was developed in novo for endoscopic applications. It recreates important features of human hyperplasia, such as the generation of new cells in the colonic mucosa and tissue growth, as well as the corresponding angiogenesis. It consists of an extracolonic suture on which lesions are inflicted with a biopsy extraction forceps during a period established in different weekly follow-ups for the correct induction of the model [50,51].

Finally, as a control group, ten healthy tissue samples from three specimens were extracted from the colon of rats from the strain Fischer344—F344 wildtype model (sex ratio: 50/50). Uninvolved areas of the hyperplasia animals (ascending colon, transverse colon, and regions of the descending colon without lesion) were used as healthy tissue samples. This ensured meeting one of the three r's of animal research that aims to maximize the information obtained per animal, making it possible to limit or avoid further use of other animals, without compromising animal welfare.

#### *2.2. Equipment*

The equipment used for imaging the murine (rat) samples was a CALLISTO from Thorlabs (CAL110C1) [52] spectral domain system with central wavelength 930 nm, field of view of 6 × 6 mm2, 7 <sup>μ</sup>m axial resolution, 4 <sup>μ</sup>m lateral resolution, 1.7 mm measurement in depth, 107 dB sensitivity at 1.2 kHz measurement speed, and 7.5 mm working distance. Samples were scanned using the high-resolution scan lens (18 mm focal length) and a standard probe head with a rigid scanner for stable and easy-to-operate setup.

#### *2.3. Acquisition Procedure*

#### 2.3.1. Sample Acquisition Procedure

Rats were acclimatized before surgery in individually housed cages at 22–25 ◦C with food and water ad libitum. All surgical procedures were performed under general inhalation anesthesia [53–55] by placing them in an induction chamber to administrate sevoflurane 6–8% in oxygen with a high flow of fresh gas (1 L/min). Then, they were connected to a face mask to continue the administration of sevoflurane (3–3.5%) in oxygen (300 mL/min) and placed in dorsal decubitus to carry out the endoscopic procedure. Atropine (0.05 mg/kg), meloxicam (1 mg/kg/24 h), and pethidine (10–20 mg/kg) were injected subcutaneously before beginning the surgical procedure. A thermal blanket was used throughout the procedure. Once the animals had acquired the appropriate surgical plane, a colonoscopy was performed to rule out the presence of abnormalities that could interfere with the study. The aim was locating all those lesions that could be found through observation by using white light and a rigid cystourethroscope of 2.9 mm in diameter, which reached a diameter of up to 5 mm when working with an intermediate sheath and an external sheath (size appropriate for this animal model), with the objective of not damaging said structures at the start of the procedure. After shaving the abdomen and preparing the area with povidone-iodine and 70% ethanol, animals were covered with an open sterile cloth. Then, an average laparotomy of 4–5 cm in length was performed. A retraction device with hooks (Lonestar®) was used as support tool to make this section circular and

externalize all the necessary intestinal content outside the abdomen. Animals were kept at constant temperature thanks to successive peritoneal washes made with tempered serum. Then, the block of the colon was fixated with a suture to prevent the reversion of the content throughout the colon and cecum. Three areas (ascending colon, transverse colon, and descending colon) were studied consecutively taking advantage of the anatomical division of the colon. They were divided with the help of ligatures (silk 4/0) through the mesentery of each portion and scanned in the proximal to distal direction making use of the rigid cystoscope to check the number of polyps.

At each point with lesions, a disposable bulldog clamp was used to mark the distribution of the lesions, thus avoiding cutting the lesions in the next procedure of colostomy of ascending and transverse portions. After that, the colon was extracted in block and then, the animals were euthanized under general inhalation anesthesia by rapid intracardiac injection of potassium chloride (KCl) (2 mEq/kg, KCl 2M), according to the ethical committee recommendations. The colon was opened by a longitudinal colotomy with the help of scissors to eliminate the tube shape of the colon, exposing thus the mucosa with the localized polyps to improve their visualization, handling, and analysis. At this time, magnification was provided by a STORZ VITOM® HD for a better location of the lesions with the extended organ.

For each localized lesion, a sample was extracted for later ex vivo analysis with the OCT equipment. Instead of acquiring the images directly on the fresh sample after resection, samples were fixed and then preserved for several further analyses while maintaining the properties of the tissue. Based on [56], the fixation procedure for each sample consisted in the immersion of the sample in 4% formaldehyde for at least 14 h at about 4 ◦C. Then, after two washes with phosphate buffered saline 0.01 M (PBS) each 30 min, the sample was submerged in PBS and 0.1% of sodium azide and stored in refrigeration at 4 ◦C. This method was established to provide safer handling of samples, avoiding the adverse effects of manipulating formaldehyde-embedded samples in a surgical environment. Additionally, it was checked with histopathological analysis that this fixation procedure did not alter the properties of the tissue, showing no noticeable differences from fresh tissue.

#### 2.3.2. Image Acquisition Protocol

First, each sample was placed on a plate, secured, and fixed for the correct exposure of the tissue. Once placed on the platform under the OCT probe, a B-scan of the sample was acquired for further calibration of the equipment. While scanning, the sample was focused by approaching the OCT probe. The super-fine focus allows to acquire a high-quality OCT signal with the better penetration depth. Due to the anatomical differences of the samples, it was always necessary to repeat this step for each new sample. Once the sample was properly focused and the 2D signal quality optimized, the next step was the acquisition of a C-scan of the sample. In this case, the software allowed drawing a rectangle (Figure 1) indicating where to perform the 3D acquisition on the sample. When considered, various 3D scans covering different parts of the lesion were recorded for the same lesion.

**Figure 1.** Pre-view of tissue/lesions with C-scan scanning area selected in red. (**A**): healthy sample; (**B**): neoplastic polyp 1; (**C**): neoplastic polyp 2.

#### 2.3.3. Dataset Summary

The database consists of healthy, hyperplastic, and neoplastic (adenomatous and adenocarcinoma) samples. Following the previously described acquisition procedure, the subsequent number of cases were included in the database for each tissue type: 10 healthy samples with 48 C-scans, 13 hyperplastic samples with 53 C-scans, and 75 neoplastic samples with 245 C-scans. As a result, the database contains a total of 94,687 B-scan images.

The database was visually inspected before training the model ignoring all C-scans or B-scan images acquired with errors, large aberrations, or artifacts to ensure the quality of the data. Note that this database is a preliminary version of an ongoing larger dataset that will be made openly available. Access to the database used in this article is possible upon request to the corresponding author.

#### *2.4. Ethical Considerations*

Ethical approvals for murine (rat) samples acquisition were obtained from the relevant Ethics Committees. In case of research with animals, it was approved by the Ethical Committee of animal experimentation of the Jesús Usón Minimally Invasive Surgery Centre (Number: ES 100370001499) and was in accordance with the welfare standards of the regional government which are based on European regulations.

#### *2.5. Deep Learning Architecture*

The proposed architecture was based on the Xception classification model [57] previously trained over the ImageNet dataset [58]. Then, a global average pooling layer and a final layer with 2 neurons and softmax activation were added, representing the classification classes: benign vs. malignant. A schematic view of the architecture, generated with a visual grammar tool [59], is provided in Figure 2.

**Figure 2.** Schematic diagram of deep learning architecture based on the Xception model.

This pre-trained network accepts images of the size of 299 × 299 pixels which are randomly sampled from the original OCT images as detailed in next section "data preparation and augmentation". OCT images on the database (B-scan images) have variable lateral

sizes in the range 512–2000 pixels due to differences in the sizes of the polyps and scanning area selected. For this reason, B-scan images were pre-processed to extract regions of interest of smaller size (299 × 299 pixels) to make the most of the images and avoid losing lesion structural features on the bigger images that would happen with image rescaling. Directly rescaling the whole image could be comparable to reducing the lateral and axial resolution of the images, and hence losing information about the smaller structures. The proposed data preparation approach also serves as a data augmentation strategy. Moreover, a strategy for dealing with data imbalance in the dataset was also adopted.

#### 2.5.1. Data Preparation and Augmentation

As a data augmentation strategy, during the training process, the algorithm processes the dataset images in the following manner: image pre-processing; air-tissue delimitation; random selection of region of interest (ROI); ROI extraction; and ROI preparation. These steps are illustrated in Figure 3.

**Figure 3.** Proposed image data preparation methodology. 1. Image pre-processing, 2. air-tissue delimitation, 3. random selection of region of interest (ROI), 4. ROI extraction, and 5. ROI preparation.

#### 1. Image pre-processing

The OCT gray scale original image contains one single channel that is duplicated to generate the 3-channel image expected by the network to use the ImageNet pre-trained weights. As an additional data augmentation strategy, the image is randomly flipped horizontally to produce alternative input images. No additional geometric transformations are applied to the image, as this would alter the structural features of the lesion and lead to misclassification.

2. Air-tissue delimitation

The aim of this step is to automatically detect on the image the delimitation between the air and the tissue. The final goal of this operation is to obtain ROI images adjusted to the tissue, so the noise present in the air part and the differences on the distance from the scanning tip to the tissue in the database images do not provide ambiguous information to the network. Conversely, the shape of the lesion is preserved, and flattering is discarded, as this could be a clinically interesting feature for differentiating the lesion's diagnostic nature.

This step was implemented following the next sub-steps: automatic calculation of Otsu threshold [60] to differentiate between the air and the tissue regions; binary mask generation applying the calculated Otsu threshold to the image; morphological operation to remove small objects from the binary mask; then, for each column in the mask image, extraction of the location (row) of the first positive (true) value if available, to obtain a 1D array containing the delimitation path; and application of a median filter (kernel size = 69) to the delimitation array to eliminate or smooth possible noise in the signal.

#### 3. Random selection of region of interest

Considering that the total width of the input image (number of A-scans) is highly variable for the different images of the dataset due to the sample size and scanning conditions, a random number indicating where to start the region of interest is calculated. A preliminary sub-image (column) is obtained considering a width of 512 px for the region of interest.

#### 4. ROI extraction

The values of the delimitation array are applied to the previously extracted sub-image to adjust the tissue at the top, generating a ROI of 512 px width and 224 px depth, which is equivalent to approximately 0.71 mm in width and 0.75 mm in depth considering the optics of the device. Preliminary experiments with fewer widths or longer depths reported worse results. Smaller ROIs reduce the maintained information worsening the feature extraction and classification performance, so it is important to reach an agreement between both aspects.

#### 5. ROI preparation (post-processing)

The extracted ROIs are resized to 229 px width and 299 px depth to match the default input size of the network (pre-trained with ImageNet).

#### 2.5.2. Data Imbalance Management

This work aims at differentiating benign samples, including healthy tissue and hyperplastic polyps, from malignant/neoplastic samples, including adenomatous and adenocarcinomatous samples. Unfortunately, in our dataset, healthy and hyperplastic samples are underrepresented with respect to neoplastic samples. Data imbalance is a usual problem and for the moment there is not a best strategy for dealing with it, as it mostly depends on the problem to solve and on data characteristics. In this work, a resampling strategy was implemented. This strategy was preferred to weight balance compensation, where weights of each class are calculated and specified on network fitting, as in the authors' experience, it provides better results.

Resampling is a classical strategy for dealing with data imbalance. Over-sampling means adding more samples to the minority class, whereas under-sampling means removing samples for the majority class. Over-sampling and under-sampling can be achieved following different strategies, with the weakness that these may imply. The simplest way is to randomly duplicate or remove samples.

In this work, we implemented an over-sampling strategy by adding new samples for the minority class. However, these new samples were not exact copies of original data, as small variations were introduced to create a diverse set of samples. As described in the previous section, dataset images were manipulated for randomly obtaining ROIs (see Figure 3), that in addition were randomly horizontally flipped, which allowed introducing this variability in the training and validation set.

#### 2.5.3. Training Process

The implemented network was based on a Xception model [57], where a global average pooling layer followed by a dense layer (with two outputs and softmax activation) to deal with a 2-class problem (benign vs. malignant) was added at the end. Pre-trained weights of ImageNet were used [58].

Categorical cross entropy loss was minimized by an Adam optimizer with a learning rate of 0.0001 during the training process. The selected batch size is 24, for a number of 100 epochs and validation loss minimization was monitored for early stopping (with patience 20). The training process was repeated 6 times over different data splits to make sure that the provided results were not biased.

#### 2.5.4. Data Evaluation and Test-Time Augmentation

As described before, OCT C-scans were acquired from murine (rat) polyp samples and adjacent healthy tissue. The C-scans are 3D volumes that consist of consecutive and adjacent B-scan images. For some of the polyps, several C-scans covering different parts of the lesion (upper, center, and bottom) were obtained and included in the same data split. As one of the aims of this work was to study the diagnosis capacity and limitations of OCT in more detail, the evaluation of the model was designed with the intention of comparing the discrimination capacity of the individual B-scans classification with respect to C-scans.

A test time augmentation (TTA) strategy was applied to B-scan and C-scan evaluation. This was implemented by performing 10 augmentations over the data following the random ROI extraction strategy previously described (see Figure 3) and then calculating the mean prediction. By applying this strategy, we estimated a richer posterior probability distribution function of the prediction for the bigger (wider) B-scans. We present a comparison of the results without TTA (called standard) and with TTA to facilitate studying how this technique contributed to the proposed approach.

#### **3. Results**

#### *3.1. OCT and H&E Histology Comparative Analysis*

Before performing the analysis, it was important to consider that some anatomical differences exist between human colon and murine colon structure. According to [61], in human and rats species, the colon maintains the same mural structure as the rest of the gastrointestinal tract: mucosa, submucosa, and inner circular and outer longitudinal tunica muscularis and serosa. The mucosa and submucosa layers in mice are relative thin in comparison with the human ones. Furthermore, human mucosa has transverse folds through the entire colon, whereas in mice it varies for each part of the colon. At the cecum and proximal colon, mouse mucosa has transverse folds, in the mid colon is flat, and in the distal colon has longitudinal folds. However, in both species the mucosa is composed of tubular glands. Taking this into account and considering that the database used in this work consists of murine (rat) samples, it was expected that the model also learn these anatomical differences present in the mucosa, especially for the healthy samples. A detailed comparison of the anatomical differences (extracted from reference [61]) is provided in Table A1.

According to previous studies analyzing features on OCT images [18–21], in normal tissue, well-defined layers can be visualized with uniform intensity. In the presence of hyperplasia, thickening of the mucosa layer occurs, but the intensity is similar to healthy tissue and tissue layers are still visible. However, in the case of adenomatous polyps, both thickening of the mucosa and reduced intensity must be observed. Finally, adenocarcinomatous lesions should show blurred boundaries and non-uniform intensity. In the presence of large polyps, the disappearance of the boundaries should be clearly observed, independently from the lesion nature.

Visual inspection of dataset images was performed to look for the features previously mentioned. Figures 4 and 5 provide a detailed analysis of the visible features on the OCT images (of Figure 1 samples) with respect to the histopathological hematoxylin-eosin (H&E) images annotated by a pathologist (scanned at 5x). Regions of interest (with the same FOV of OCT images in mm) were extracted from H&E slides images and later rescaled to fit axial and lateral resolution of the OCT images for better comparison. In these figures, it can be observed that the main features present in H&E images can also be observed in the OCT images. On the one hand, Figure 4, representing healthy tissue, illustrates (as indicated by arrows and manual segmentation lines on the B-scans on the left, Figure 4A,B) that the mucosa layers can be very clearly observed, confirming what has been reported before in previous studies. Muscularis mucosae and sub-mucosa layers are also observed, although clear differentiation in all parts of the image is tougher. On the other side, when analyzing Figure 5 containing neoplastic lesions, it is also possible to confirm that the boundaries of the layers have totally disappeared, making it impossible to find any difference among them. Differences in the noise pattern are also observed. In addition, as indicated using circles and arrows on the B-scans (Figure 5A,B), new underlying structures appeared in the mucosa and can be identified as bright spots or dark areas in the images. These new structures (in comparison with healthy tissue) are also clearly observed in the corresponding annotated histopathology images (Figure 5C,D), where cystic crypts (CC)

have been identified by the pathologist and appear as dark spots in the B-scan and tumoral glands (TG) clusters as bright spots.

**Figure 4.** Comparison of features identified in optical coherence tomography (OCT) images (**A**,**B**) with respect to pathologists' annotations on H&E images (**C**,**D**) on healthy sample (Figure 1A). MU: mucosa, MM: muscularis mucosae, SM: submucosa, ME: muscularis externa.

**Figure 5.** Comparison of features identified in OCT images (**A**,**B**) with respect to pathologists' annotations on H&E images (**C**,**D**) on neoplastic samples (Figure 1B,C). CC: cystic crypt, TG: tumoral glands.

#### *3.2. Dataset Partitioning and Testing*

The dataset was split such that 80% was dedicated to training, 10% to validation, and 10% to testing. It was assured that images coming from the same lesion (both B-scans and C-scans) were included in only one of the sets. The animal models employed on the creation of the database were genetically modified replicas of one specimen, hence no separation per specimen was necessary in splitting and lesions could be considered independently.

The model was tested on 6 different folds to ensure that the evaluation metrics proportionated were not biased by one random dataset split. A random state seed parameter was established for each fold to obtain different training, validation, and testing sets each time.

#### *3.3. Performance Metrics and Evaluation*

Given that both B-scan and C-scan data were available for the murine (rat) samples acquired in the database, the clinical discrimination capability of the model on the differentiation of benign versus malignant polyps was calculated for both types of data. To evaluate each C-scan, the mean of the individual predictions for the B-scan images that form the volume was calculated. The performance of the model was measured using the conditions provided by the confusion matrix (see Table 1).

**Table 1.** Confusion matrix conditions for metrics calculation.


In the clinical context being analyzed in this work, these conditions can be seen as:


The metrics that were employed to measure the model performance based on the previous conditions are described below.


The desired value for these metrics was as close as possible to 1, 1 meaning a perfect test. Additionally, as the accuracy (measure of the number of samples that were correctly classified in the expected class) is a misleading metric in imbalanced datasets, the balanced accuracy was calculated. This metric normalizes true positive and true negative predictions by the number of positive and negative samples, and then divides the sum by two, providing an accuracy value where the class frequencies are the same.

• Balanced accuracy (BAC). Measures the number of samples that were correctly classified in the expected class considering class frequencies. Number of correct/all assessments considering class frequencies. BAC = (TPR + TNR)/2 = (Sensitivity + Specificity)/2.

#### *3.4. Thresholds*

Considering the prediction values provided by the model, the threshold that maximizes the BAC (in the range 0–1) was calculated over the validation subset of each fold

split both for the B-scan and C-scan data. Then, this threshold was applied over the test subset of each fold split to calculate the metrics of the model (BAC, sensitivity, specificity, PPV, and NPV).

#### *3.5. Classification Results*

The evaluation of the model was performed on 6 folds, over different training, validation, and testing splits of the dataset each time, with the aim of obtaining a model ensemble. As a result, the mean and standard deviation (std) were calculated for each of the selected metrics. Table 2 provides a summary of the results, where the first number reports the mean and the second the std (mean ± std). In this summary, the results obtained with B-scan and C-scan images, standard, and TTA test split evaluation are included for comparison. The complete list of results of each fold is included in Table A2. at the end of the document. Additionally, a graph illustrating a fair comparison of the folds results following the sum of ranking differences (SRDs) method [62] is provided in Figure A1. After calculating the SRD coefficients for each of the options on the different folds, a graph comparing the performance of the different options can be generated. The smaller the SRD value, the closer to the reference value, meaning better performance.

**Table 2.** Summary of results by the network for the different imaging modalities (B-scan vs. C-scan), applying different evaluation techniques (standard vs. test time augmentation (TTA)) and resampling imbalance strategy. Note that the numbers report "mean ± std" values.


#### **4. Discussion and Conclusions**

On analyzing the results, in general terms and considering the mean results reported in Table 2, when using the standard evaluation technique, the prediction over C-scan volumes was slightly better than the prediction over individual B-scan images. This impression is confirmed by the SRD analysis (Figure A1), where smaller values were obtained for C-scan images analysis. This result makes sense, since when evaluating the lesion volumetrically (C-scan) considering the mean prediction of all the B-scan images contained in the C-scan, there was less probability of a bad prediction. If the volume contains some individual B-scans with poor information representing the class sample, the (expected) bad predictions do not have great influence on the final diagnosis. In any case, the small differences on the prediction metrics suggest the high quality of the database used in this study, as shown in the detailed results for each fold provided in Table A2.

It can also be observed that the TTA evaluation technique slightly benefitted the prediction over individual B-scan images in terms of sensitivity and specificity, but not the C-scan volume prediction. However, these results make sense for two reasons: the data preparation strategy and the volumetric evaluation of the lesion. On the one hand, due to the nature of the images, no geometrical transformations were applied for data augmentation, as described in the data preparation section, but ROIs at different location of the image were extracted. Depending on the location of the extracted ROIs, the clinical features can be more or less representative of the lesion, affecting the corresponding prediction. When TTA was applied, different ROIs from the B-scan were extracted, allowing analysis of the overall sample in width, and hence a better prediction was obtained. This is particularly beneficial in the case of large wide B-scan images, as it allows analyzing the different parts of the tissue/lesion in detail. Considering this, and although no improvement was observed on the C-scan evaluation, the TTA strategy was preferred during the evaluation, since in this way, the intrinsic clinical variability of the lesions was captured and hence the model prediction was more robust.

Interpretation of new imaging techniques, such as OCT, can be complicated at the beginning and prevent their adoption in clinical practice. However, advanced image processing techniques, such as deep learning, can be used to facilitate automatic image analysis or diagnosis and the development of optical biopsy. A previous work [46] proposed using a pattern recognition network that requires prior manual annotation of the dataset and diagnosis depends on whether the expected pattern is found on the image. Alternatively, this work proposes using a classification strategy, which can help in the identification of subtle clinical characteristics on the images and is not biased by dataset annotations. This work investigates the application of an Xception deep learning model for the automatic classification of colon polyps from murine (rat) samples acquired with OCT imaging. The developed database is accessible upon request and is part of a bigger database in the process of being published. A strategy for processing B-scan images and extracting regions of interest was proposed as a data augmentation strategy. Test time augmentation strategy implemented with the aim of improving model prediction was evaluated. In addition, this work also aims to compare the differences in the diagnosis capacity of the proposed method when evaluated using B-scan images and C-scan volumes, and for this purpose different clinical metrics were compared. The trained model was evaluated 6 times using different training, validation, and testing sets to provide an unbiased diagnosis of the results. In this sense, we got a model with mean 0.9695 (±0.0141) sensitivity and mean 0.8094 (±0.1524) specificity when diagnosis was performed over individual B-scans, and mean 0.9821 (±0.0197) sensitivity and mean 0.7865 (±0.205) specificity when diagnosis was performed in the whole C-scan volume.

Considering the future application of a deep learning method to assist clinical diagnosis with OCT, and in view of the results of this work, successful diagnosis can be achieved both on B-scan images and C-scan volumes. The evaluation of the lesion over a C-scan volume was preferred over the evaluation of an individual B-scan image, so the prediction could be more robust. However, this will not be possible most of the time in the daily clinical routine, for example during patient colonoscopy examination, where in vivo real-time information is necessary for diagnosis and in-situ treatment decision. In this sense, clinical procedures based on the accumulative predictions of various B-scan images could be defined to facilitate clinicians' decision-making during examination. The promising results with the proposed approach suggest that the implemented deep learning based method can identify the clinical features reported in previous clinical studies on the OCT images, and more importantly, that the amount of data and features present on the images database are enough to allow automatic classification. These results are part of ongoing work that will be further extended; however, it has been demonstrated that deep learning-based strategies seem to be the path to achieve the "optical biopsy" paradigm. Raw interpretation of new imaging modalities is difficult for clinicians but assisted by an image analysis method, the interpretation can be eased and the reliable diagnosis suggestion can facilitate the adoption of the technology. Consequently, the CADx market can benefit from this progress in the short term as the latest market forecast studies suggest.

This work will be further extended and tested with a larger and more balanced version of the murine dataset collected. More sophisticated models accepting larger image size will be tested to check whether classification is improved. Optical properties of the different lesions will be studied in detail with the aim of finding scattering patterns for each type of lesion. OCT volumetric (C-scan) information will be also studied in further detail to make the most of it analyzing both the cross sectional and en-face views.

**Author Contributions:** Conceptualization, C.L.S., J.B., and J.F.O.-M.; methodology, C.L.S., A.P., and E.T.; software, C.L.S.; validation, C.L.S., J.B., N.A.d.R., and N.A.; formal analysis, C.L.S., A.P., and E.T.; investigation, C.L.S.; resources, J.B., J.F.O.-M., E.G., O.M.C., N.A.d.R., and N.A.; data curation, C.L.S., J.B., J.F.O.-M., N.A.d.R., and N.A.; writing—original draft preparation, C.L.S., J.B., and J.F.O.-M.; writing—review and editing, C.L.S., J.B., J.F.O.-M., A.P., E.T., E.G., and O.M.C.; visualization, C.L.S.; supervision, E.G. and O.M.C.; project administration, C.L.S., A.P., and E.G.; funding acquisition, C.L.S. and A.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by PICCOLO project. This project has received funding from the European Union's Horizon2020 Research and Innovation Programme under grant agreement No. 732111. The sole responsibility of this publication lies with the authors. The European Union is not responsible for any use that may be made of the information contained therein. This research has also received funding from the Basque Government's Industry Department under the ELKARTEK program's project ONKOTOOLS under agreement KK-2020/00069 and the industrial doctorate program UC- DI14 of the University of Cantabria.

**Institutional Review Board Statement:** Ethical approvals for murine (rat) samples acquisition were obtained from the relevant Ethics Committees. In case of research with animals, it was approved by the Ethical Committee of animal experimentation of the Jesús Usón Minimally Invasive Surgery Centre (Number: ES 100370001499) and was in accordance with the welfare standards of the regional government which are based on European regulations.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The dataset used in this study is available upon request. This dataset is part of a more extensive dataset that is under collection and will be made publicly available in the future.

**Acknowledgments:** The authors would also like to thank Ainara Egia Bizkarralegorra from Basurto University hospital (Spain) for the processing of the samples.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**

**Table A1.** Comparison of anatomical differences of human and murine colon (adapted from reference [61]).


#### **Appendix B**


**Table A2.** Detail of the results of each fold for the different imaging modalities (B-scan vs. C-scans).

**Figure A1.** Fair comparison of folds results with sum of ranking differences (SRDs) method.

#### **References**


## *Article* **Deep Learning-Based Pixel-Wise Lesion Segmentation on Oral Squamous Cell Carcinoma Images**

### **Francesco Martino <sup>1</sup> Domenico D. Bloisi 2,\*, Andrea Pennisi 3, Mulham Fawakherji 4, Gennaro Ilardi 1, Daniela Russo 1, Daniele Nardi 4, Stefania Staibano 1,† and Francesco Merolla 5,†**


Received: 10 October 2020; Accepted: 18 November 2020; Published: 23 November 2020

**Abstract:** Oral squamous cell carcinoma is the most common oral cancer. In this paper, we present a performance analysis of four different deep learning-based pixel-wise methods for lesion segmentation on oral carcinoma images. Two diverse image datasets, one for training and another one for testing, are used to generate and evaluate the models used for segmenting the images, thus allowing to assess the generalization capability of the considered deep network architectures. An important contribution of this work is the creation of the Oral Cancer Annotated (ORCA) dataset, containing ground-truth data derived from the well-known Cancer Genome Atlas (TCGA) dataset.

**Keywords:** oral carcinoma; medical image segmentation; deep learning

#### **1. Introduction**

Malignant tumors of the head and neck region include a large variety of lesions, the great majority of which are squamous cell carcinomas of the oral cavity [1]. According to GLOBOCAN 2018 data on cancer [2], oral cavity malignant neoplasms, together with lip and pharynx malignancies, account for more than half-million new occurrences per year worldwide, with an estimated incidence of 5.06 cases per 100,000 inhabitants. Moreover, Oral Squamous Cell Carcinoma (OSCC) is characterized by high morbidity and mortality, and, in most countries, the survival rate after five years from the diagnosis is less than 50% of the patients [3].

The histology examination is the gold standard for the definition of these tumors. Surgical pathologists use both clinical and radiological evidence to complement their diagnoses, differentiating between benign and malignant lesions. In the last years, surgical pathology is witnessing a digital transformation thanks to (1) the increase of the processing speed of Whole Slide Images (WSI) scanners [4] and (2) the lower storage costs and better compression algorithm [5]. Consequently, WSI digital analysis is one of the most prominent and innovative topics in anatomical pathology, catching academic and industries attentions. An example of an image obtained using a WSI scanner is shown in Figure 1.

**Figure 1.** An example of image generated by a Whole Slide Images (WSI) scanner. The image has a dimension of 35,862 × 32,195 pixels and the file size is 213.3 MB.

However, WSI (and associated datasets) are characterized by three important limitations:


Due to the above-discussed limitations, the research activity based on Artificial Intelligence (AI) algorithms applied to WSI is still limited compared to other diagnostic imaging branches, such as radiology, but the scientific literature on the topic is growing fast and we are observing the appearance of public datasets of unannotated and annotated histopathology WSI.

In this paper, we present a performance evaluation of four different image segmentation architectures based on deep learning to obtain a pixel-wise separation between benign and malignant areas on WSI samples. In particular, we test four widely used Semantic Segmentation deep neural Networks (SSNs) on publicly available data for the detection of carcinomas. As a difference with respect to classification neural network, SSNs take as input images of arbitrary sizes and produce a correspondingly sized segmented output, without relying on local patches.

The contributions of this work are three-fold:


The paper is organized as follows. Section 2 contains a discussion of similar approaches present in the literature. Section 3 describes the details of the proposed method, while Section 4 shows both qualitative and quantitative results obtained on publicly available data. Finally, conclusions are drawn in Section 5.

#### **2. Related Work**

Artificial intelligence (AI) algorithms have been proposed to address a wide variety of questions in medicine; e.g., for prostate Gleason score classification [9], renal cancer grading [10], breast cancer molecular subtyping [11] and their outcome prediction. Moreover, AI-based methods have been applied to the segmentation of various pathological lesions in the fields of neuropathology [12], breast cancer [13], hematopathology [14], and nephrology [15].

The above-cited studies have been conducted mainly on the most common tumors (i.e., breast or prostate), while AI-based methods have been scarcely adopted to deal with other types of cancer, despite their high incidence and mortality rates, as the Oral Squamous Cell Carcinoma (OSCC). The analysis of a recent systematic review by Mahmood et al. [16] shows that still few applications of automatic WSI analysis algorithms are available for OSCC. In particular, the survey reports 11 records about the employment of AI-based methods for the analysis of specific histological features of oral lesions: out of 11, only four papers refer to OSCCs, namely [17–20], one paper is about oral epithelial dysplasia [21], five about oral submucous fibrosis, i.e., [22–26], and one paper is about oropharyngeal squamous cell carcinoma [19]. Another recent application of machine learning algorithms on oral lesions histopathological is based on immunohistochemistry (IHC) positivity prediction [27].

Segmentation methods on WSI images have been developed mainly for nuclei segmentation [28–30], epithelium segmentation [19,24], microvessels and nerves [31], and colour-based tumour segmentation [17]. Recently, Shaban et al. [32] proposed an indirect segmentation method, through small tiles classification, with an accuracy of 95.12% (sensitivity 88.69%, specificity 97.37%).

To the best of our knowledge, there are no published results on direct OSCC segmentation using deep learning and none employing the TCGA as a source of histopathological images. This work represents a first attempt of applying well-known deep learning-based segmentation methods on the publicly available TCGA images, providing also annotations to quantitatively validate the proposed approach.

#### *Datasets*

Concerning WSI datasets, most of them have been made available for challenges, such as Camelyon [33] and HEROHE, on Kaggle or as part of larger databases. The Cancer Genome Atlas (TCGA) [34] contains publicly available data provided by the National Cancer Institute (NCI), which is the U.S. federal government's principal agency for cancer research and training, since 2006. In particular, it contains clinicopathological information and unannotated WSI of over 20,000 primary cancer covering 33 different cancer types [35].

#### **3. Methods**

Our aim is to use a supervised method to automatically segment an input WSI sample into three classes:


Figure 2 shows the functional architecture of our approach. We worked on input images having a large dimension of 4500 × 4500 pixels, which is an input dimension about ten times greater than the input dimension supported by existing segmentation SSNs. Thus, a preprocessing step is needed in order to fit the input format of the deep neural network. We tested two different pre-processing functions:


*Appl. Sci.* **2020**, *10*, 8285

**Figure 2.** Functional architecture of the proposed approach.

Segmentation results obtained using the different pre-processing functions are discussed in Section 4.

#### *3.1. Network Architectures*

The core of the proposed approach consists in the use of a semantic segmentation network. We have selected four different network architectures among the most used ones:


#### 3.1.1. Segnet

SegNet is made of an encoder network and a corresponding decoder network, followed by a final pixel-wise classification layer (Figure 3). The encoder network consists of the first 13 convolutional layers of the VGG16 [38] network designed for object classification, without considering the fully connected layer in order to retain higher resolution feature maps at the deepest encoder output. In such a way, the number of parameters to train is significantly reduced. Each encoder layer has a corresponding decoder made of 13 layers. The final decoder output is fed to a multi-class soft-max classifier to produce class probabilities for each pixel independently.

#### 3.1.2. U-Net

The architecture of the net is shown in Figure 4. The input image is downsampled to obtain a 512 × 512 resized image.

The encoding stage is needed to create a 512 feature vector and it is made of ten 3 × 3 convolutional layers, and by four 2 × 2 max pooling operations with stride 2. In particular, there is a repeated application of two unpadded convolutions, each followed by a rectified linear unit (ReLU) and a max pooling operation. The decoding stage (see the right side of Figure 4) is needed to obtain the predicted mask at 512 × 512 pixels. It is made of eight 3 × 3 convolutional layers and by four 2 × 2 transpose layers. There is a repeated application of two unpadded convolutions, each followed by a ReLU and a transpose operation. Figure 4 shows also the concatenation arcs from the encoding side to the decoding side of the network. Cropping is necessary due to the loss of border pixels in every convolution layer.

**Figure 4.** U-Net architecture.

#### 3.1.3. U-Net with Different Encoders

The original U-Net consists of two paths, the first one (left side) composed of four blocks. Each block consists of a typical architecture of a convolution network, composed of repeated two 3 × 3 Convolutional layers followed by a rectified linear unit (ReLU) and a 2 × 2 max pooling layer with stride 2 as down-sampling steps. The second path (right side) represents the up-sampling path also composed of four blocks. Each block consists of an up-sampling feature map followed by concatenation with the corresponding cropped feature map from the first path followed by a double 3 × 3 Convolutional layer. Each convolution layer is composed of batch normalization. Between the first and the second path, there is a bottleneck block. This bottleneck is built from two convolution layers with batch normalization and dropout. In the models we applied, we replaced the first path, which is the encoder path of the network, with two different models, namely VGG16 [38] and ResNet50 [39], in order to obtain higher accuracy.

#### *3.2. Training and Test*

Two diverse datasets are used to train, validate, and test the networks. The training dataset consists of 188 annotated images of advanced OSCC derived from the digital data acquired in the Surgical Pathology Unit of the Federico II Hospital in Naples (Italy). The study was performed in agreement with Italian law. Moreover, according to the Declaration of Helsinki for studies based only on retrospective analyses on routine archival FFPE-tissue, written informed consent was acquired from the living patient at the time of surgery. The validation dataset consists of 100 annotated images from the newly created ORCA dataset (described below), while the test dataset consists of a further 100 images from the ORCA dataset also. All the images in the two datasets have been manually annotated by two of the authors of this paper that are expert pathologists (D. Russo and F. Merolla, both MD PhD and Board in Pathology) using three color labels:


The above-listed labels represent the classes learned by the deep network during the training stage. Figure 5 shows an example of annotation mask.

**Figure 5.** Example of annotation mask. Carcinoma pixels are colored in white, tissue pixels not belonging to a carcinoma are colored in grey, and non-tissue pixels are colored in black.

#### 3.2.1. Training Data

The dataset used for training consists of a set of images collected by the Surgical Pathology Unit of the Federico II Hospital, in Naples (Italy). In particular, cases were assembled as four Tissue Micro Array and, after H&E staining, the slides were scanned with a Leica Aperio AT2 at 40× (20× optical magnification plus a 2× optical multiplier).

Slides were annotated using the Leica Aperio ImageScope software. In the first instance, to define each core, we used the ImageScope rectangle tool with a fixed size of 4500 × 4500 pixels, corresponding to nearly 1.125 mm, i.e., the size of each TMA core. Then, an expert pathologist used the pen tool to contour the tumor areas within each core. ImageScope provides an XML file as output for each WSI. RGB core images were extracted using OpenCV and OpenSlide libraries.

Concerning the annotation masks, we started by automatically detecting the tissue portion of each core using the OpenCV functions contouring and convex hull, to isolate the tissue pixels (colored in grey, as mentioned beforehand) from the background (non-tissue pixels colored in black). Then, the tumor masks (colored in white) were superimposed over the tissue pixels to draw carcinoma pixels. Finally, the masks were manually refined at pixel-size resolution.

An example of a couple <original image, annotated mask> used for the training stage is shown on the top of Figure 6.

**Figure 6.** Top: Example of annotated data from the training dataset. (**a**) Original image. (**b**) Manually generated annotation mask. Bottom: Example of annotated data from the Oral Cancer Annotated (ORCA) dataset. (**c**) Original image. (**d**) Manually generated annotation mask.

#### 3.2.2. ORCA Dataset

An important contribution of this work is the creation of a fully annotated dataset, called ORCA, consisting of couples of WSI samples plus corresponding annotation masks. An example of a couple <original image, annotated mask> from the ORCA dataset is shown on the bottom of Figure 6.

The original WSI samples belong to the Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types [40]. TCGA data are publicly available for anyone in the research community to use. For each WSI included in the TCGA dataset, we defined one or two cores containing a representative tumor area. The dataset contains WSI scanned both at 20× and at 40×.

TCGA whole slide images were annotated using the Leica Aperio ImageScope software. For each image, a circular core region has been selected using the ImageScope ellipse tool, with a fixed size of 4500 × 4500 pixels in the case of the WSI scanned at 40× or 2250 × 2250 pixels in the case of WSI acquired at 20×. This allowed us to keep the same spatial dimension for all the selected cores. Then, the tumor contours have been drawn using a pen tool. The RGB images and the corresponding annotation masks were extracted using OpenCV and OpenSlide libraries.

All the couples of WSI samples plus corresponding annotation masks used in this work are available for downloading from the ORCA dataset web page at: https://sites.google.com/unibas.it/orca.

#### **4. Experimental Results**

In order to train and evaluate the networks, we split the image data into training, validation, and test sets. The training set consists of the 188 images from the Federico II Hospital plus a set of images obtained by applying a simple augmentation technique. This allows for augmenting the number of the available training samples. In particular, the data augmentation has been achieved by flipping the images vertically, horizontally, and in both ways. The final cardinality of the augmented training set is 756 images.

Each validation and test set includes 100 images from the ORCA data set. It is worth noticing that validation and test set are completely disjoint sets. In such a way, we tested the capability of the networks in generalizing the segmentation problem.

As stated above, we trained four different models:


The source code of the used SegNet network is available at https://github.com/apennisi/segnet, while the source code for U-Net and its modifications is available at https://github.com/Mulham91/ Deep-Learning-based-Pixel-wise-Lesion-Segmentationon-Oral-Squamous-Cell-Carcinoma-Images.

The four networks have been trained by using the Adam optimizer, with a learning rate of 1*e* − 4, without adopting any decay technique. The input image size has been set to 512 × 512 pixels.

For computing the loss, we used the Cross-Entropy function [41], which is a loss function widely used in deep learning. Cross entropy helps in speeding up the training for neural networks in comparison to the other losses [42]. The cross entropy for multi-class error computation is defined as follows:

$$s = -\sum\_{c=1}^{M} y\_{o\cdot c} \log(p\_{o\cdot c}),\tag{1}$$

where *c* is the class id, *o* is the observation id, and *p* is the probability. Such a definition may also be called categorical cross entropy.

The training has been manually stopped after 60 epochs for all the considered network architectures because we noticed a trend of all the network in overfitting the training set.

We performed three kinds of experiments:


An example of the different color models mentioned above is shown in Figure 7. The original RGB image is decomposed into single channels using the RGB and HSV color models. Then, some channels have been selected as input for the deep network. In particular, we used H + R and R + V.

**Figure 7.** An example of the input images used for training the networks. First row: Original input and corresponding annotated mask. Second row: Red, Green, and Blue channels from the original RGB image. Third row: Hue, Saturation, and Value channels from the transformation of the original RGB image into Hue, Saturation, and Value (HSV).

The idea of using a multi-spectral input derives from the application of deep learning techniques in the area of precision farming, where multi-spectral inputs have denoted good performance on SSNs for the segmentation of crop and weed plants in images acquired from farming robots [43].

The prediction masks produced by the deep networks have been compared with the ground-truth annotation masks in order to measure the capability of our pipeline in generating accurate results. In particular, we have used the number of false-positive (FP), false-negative (FN), and true-positive (TP) carcinoma pixel detections as an indicator for the evaluation of the results. Figure 8 shows an example of how we computed FP, FN, and TP pixels to evaluate the predicted mask.

**Figure 8.** An example of error analysis for a test image. (**a**) The original image included in the test set. (**b**) The corresponding annotation mask, where carcinoma pixels are coloured in white. (**c**) The predicted mask generated from U-Net with ResNet50 encoder. (**d**) The coloured error mask, where green pixels are true-positive (TP) carcinoma pixels, blue are false-positives (FPs), and red are false-negatives (FNs).

#### *4.1. Qualitative Evaluation*

We evaluated the output of the trained network, in the first instance, by visual inspection. This allowed us to interpret the results and judge them in the light of the pathologist's experience and diagnostic reasoning in histopathology.

In most cases, we observed that the areas reported as false positives (blue pixels in Figure 8d) actually corresponded to small nests of stromal infiltration, located in correspondence with the tumor invasion front.

Vice versa, we observed that a portion of the pixels reported as false negatives actually referred to small areas of stroma within the tumor area.For example, by observing the colored mask in Figure 8d, it is visible that the stromal area within the tumor is considered as a set of false-negative pixels (coloured in red) even if they should be considered as true positives.

Given the discussion above, from a qualitative point of view, the algorithm has often reported pixels relating to areas of intense inflammatory infiltrate as false positives. However, single interspersed lymphocytes in the peritumor stroma were not per se a problem in the interpretation of the image.

Tumor grading was another determining factor in the efficiency of the algorithm. In fact, the trained network recognized with difficulty the highly differentiated tumor areas, with a prevalence of keratin pearls. This last factor could be attributable to the fact that the dataset used for the training was mainly composed of a series of high grade advanced squamous carcinomas.

#### *4.2. Quantitative Results*

To obtain a quantitative measure of the segmentation performance achieved by the four deep networks, we used the Mean Intersection-Over-Union (mIOU) metric. MIOU is one of the most common metrics for evaluating semantic image segmentation tasks. In particular, we first computed the IOU for each semantic class, namely carcinoma, tissue non-carcinoma, and non-tissue, and then computed the average over all classes. The *mIOU* is defined as follows:

$$mIOI = \frac{1}{C} \sum\_{i=1}^{C} \frac{TP\_{\dot{j}}}{TP\_{\dot{j}} + FP\_{\dot{j}} + FN\_{\dot{j}}} \,\tag{2}$$

where *TP* is true-positive, *FP* is false-positive, *FN* is false-negative, and *C* is the total number of classes.

Table 1 shows the quantitative results of the semantic segmentation on three different inputs: RGB, (Red + Hue), and (Red + Value). The results show that the use of the combination (Red + Value) generates better results than (Red + Hue) input. Moreover, a deeper network, such as U-Net modified with ResNet50 as encoder, performs better than the original U-Net (having a more shallow encoder).


**Table 1.** Pixel-wise segmentation results.

#### *4.3. Discussion*

The histological evaluation of Hematoxylin Eosin stained slides from tumor samples, carried out by an experienced pathologist on an optical microscope, is a mandatory step in the diagnostic, prognostic and therapeutic pathway of patients suffering from squamous cell carcinoma of the oral cavity. To date, for the histological diagnosis of OSCCs, the gold standard is the visual analysis of histological preparations, stained with hematoxylin and eosin; tumor recognition basically takes place on the basis of the qualitative assessment of architectural characteristics of the neoplastic tissue, based on the wealth of knowledge pathologist's own. The qualitative assessment is subjective and can suffer from inter-individual variability, especially in borderline situations that are difficult to interpret. The use of a segmentation algorithm could minimize the interpretative variability and speed up the pathologists' work, providing they with a screening tool, particularly useful in those cases in which the histopathological diagnosis must be carried out on the extensive sampling of complex surgical samples that involve the generation of multiple blocks of formalin-fixed and paraffin-embedded tissue samples, from which numerous slides stained with hematoxylin and eosin are obtained.

Based on the presented results, our contribution is composed of: (i) a novel dataset, the ORCA set, which will allow us to conduct new studies on Oral Squamous Cell Carcinoma. Particularly, the dataset is composed of annotation from the TCGA dataset, a full comprehensive dataset enriched, as well as with diagnostic slide, with clinicopathological information and molecular biology data. This could facilitate the development of molecular characterization deep learning algorithms; (ii) our method relies on 2250 × 2250 and 4500 × 4500 images, without a tiling processing. Even though an improvement in its accuracy is mandatory for clinical practice, the utilization of so large images can hugely reduce time-demand for a WSI, making our approach easily scalable to clinical routine, when hundreds of slides need to be processed each day; (iii) after demanded improvements and a clinical trial, this kind of algorithm may be part of clinical practice via L.I.S. integration, fastening OSCC diagnosis and helping pathologists to identify OSCC areas on WSI. Indeed, we foresee to extend our method on lymphonodal metastasis, giving the pathologist an easy way to detect small tumor islands, and on distant metastasis, supporting the pathologist with cases of suspect metastasis of OSCC primary tumor.

We intend to propose this artificial intelligence algorithm as a Computer-Aided Diagnostic, aware that it cannot replace the pathologist in his routine activity, but that it will be able to provide they with valid help, especially for those who find themselves working in generalist diagnostic centres on the territory, not specialized in the diagnosis of an infrequent but extremely lethal disease.

#### **5. Conclusions**

In this work, we created a dataset called ORCA, containing annotated data from the TCGA dataset, to compare four different deep learning-based architectures for oral cancer segmentation, namely: SegNet, U-Net, U-Net with VGG16 encoder, and U-Net with ResNet50 encoder. The peculiarity of this work consists of the use of a training set completely different from the test data. In such a way, we tested the capability of the networks in generalizing the problem, providing promising segmentation results.

Despite the non-optimal results, to the best of our knowledge, this is the first attempt to use an automatic segmentation algorithm for oral squamous cell carcinoma and it represents an important novelty to this pathology. Furthermore, the publically-available ORCA dataset will facilitate the development of new algorithms and will boost the research on computational approaches to OSCC.

As future directions, we will aim at enlarging the training set and at making it publicly available. In this work, we considered color transformation by using a combination of HSV and RGB color models as a method for creating a multi-channel input. This was done because the group of pathologists that are authors of this work noticed that HSV color space contains a lot of visually distinguishing features about tumor cells. We did not use color modifications for augmenting the data. However, this is an interesting aspect that will be investigated in future work. Moreover, we foresee to improve our model to achieve a result that may be transferred to clinical practice.

**Author Contributions:** D.D.B., A.P. and F.M. (Francesco Merolla) conceived and designed the experiments; M.F. and F.M. (Francesco Martino) performed the experiments; F.M. (Francesco Merolla) and D.R. analyzed the data; G.I. contributed reagents/materials/analysis tools; S.S. and D.N. provide a critical review of the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** Our research has been supported by a POR Campania FESR 2014-2020 grant; "Technological Platform: eMORFORAD-Campania" grant PG/2017/0623667.

**Acknowledgments:** We thank Valerio Pellegrini for his contribution in the annotation of the dataset images.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*

## **An E**ffi**cient Lightweight CNN and Ensemble Machine Learning Classification of Prostate Tissue Using Multilevel Feature Analysis**

### **Subrata Bhattacharjee 1, Cho-Hee Kim 2, Deekshitha Prakash 1, Hyeon-Gyun Park 1, Nam-Hoon Cho <sup>3</sup> and Heung-Kook Choi 1,\***


Received: 23 September 2020; Accepted: 10 November 2020; Published: 12 November 2020

**Abstract:** Prostate carcinoma is caused when cells and glands in the prostate change their shape and size from normal to abnormal. Typically, the pathologist's goal is to classify the staining slides and differentiate normal from abnormal tissue. In the present study, we used a computational approach to classify images and features of benign and malignant tissues using artificial intelligence (AI) techniques. Here, we introduce two lightweight convolutional neural network (CNN) architectures and an ensemble machine learning (EML) method for image and feature classification, respectively. Moreover, the classification using pre-trained models and handcrafted features was carried out for comparative analysis. The binary classification was performed to classify between the two grade groups (benign vs. malignant) and quantile-quantile plots were used to show their predicted outcomes. Our proposed models for deep learning (DL) and machine learning (ML) classification achieved promising accuracies of 94.0% and 92.0%, respectively, based on non-handcrafted features extracted from CNN layers. Therefore, these models were able to predict nearly perfectly accurately using few trainable parameters or CNN layers, highlighting the importance of DL and ML techniques and suggesting that the computational analysis of microscopic anatomy will be essential to the future practice of pathology.

**Keywords:** prostate carcinoma; microscopic; convolutional neural network; machine learning; deep learning; handcrafted

#### **1. Introduction**

Image classification and analysis has become popular in recent years, especially for medical images. Cancer diagnosis and grading are often performed and evaluated using AI as these processes have become increasingly complex, because of growth in cancer incidence and the numbers of specific treatments. The analysis and classification of prostate cancer (PCa) are among the most challenging and difficult. PCa is the second most commonly diagnosed cancer among men in the USA and Europe, affecting approximately 25% of patients with cancer in the Western world [1]. PCa is a type of cancer that has always been an important challenge for pathologists and medical practitioners, with respect to detection, analysis, diagnosis, and treatment. Recently, researchers have analyzed PCa in young Korean men (<50 years of age), considering the pathological features of radical prostatectomy specimens and biochemical recurrence of PCa [2].

In the United States, thousands of people exhibit PCa. In 2017, there were approximately 161,360 new cases and 26,730 deaths, constituting 19% of all new cancer cases and 8% of all cancer

deaths [3]. Therefore, it is important to detect PCa at an early stage to increase the survival rate. Currently, for the clinical diagnosis of PCa, methods that are performed in hospitals include a prostate-specific antigen test, digital rectal exam, trans-rectal ultrasound, and magnetic resonance imaging. Core needle biopsy examination is a common and useful technique, performed by insertion of a thin, hollow needle into the prostate gland to remove a tissue sample [4–6]. However, PCa diagnosis via microscopic biopsy images is challenging. Therefore, diagnostic accuracy may vary among pathologists.

Generally, in histopathology sections, pathologists categorize stained microscopy biopsy images into benign and malignant. To carry out PCa grading, pathologists use the Gleason grading system, which was originally based on the sum of the two Gleason scores for the most common so-called Gleason patterns (GPs). Many studies conclude that this is the recommended methodology for grading PCa [7]. The Gleason grading system defines five histological patterns from GP 1 (well differentiated) to GP 5 (poorly differentiated), with a focus on the shapes of atypical glands [8–11]. During the grossing study, the tumor affected in the prostate gland is extracted by the pathologist for examination under a microscope for cancerous cells [12,13]. In this cell culturing process, the tissues are stained with hematoxylin and eosin (H&E) compounds, yielding a combination of dark blue and bright pink colors, respectively [14–18]. In digital pathology, there are some protocols that every pathologist follows for preparing and staining the tissue slides. However, the acquisition systems and staining process vary from one pathologist to another. The generated tissue images with the variations in colour intensity and artifacts could impact the classification accuracy of the analysis [19,20].

DL and ML in AI have recently shown excellent performance in the classification of medical images. These techniques are used for computer vision tasks (e.g., segmentation, object detection, and image classification) and pattern recognition exploiting handcrafted features from a large-scale database, thus allowing new predictions from existing data [21–24]. DL is a class of ML algorithms, where multiple layers are used to extract higher-level features gradually from the raw input. ML is a branch of AI concentrated on application building that learns from data. ML algorithms are trained to learn features and patterns in huge amounts of data to make predictions based on new data. Both DL and ML have shown promising results in the field of medical imaging and have the potential to assist pathologists and radiologists with an accurate diagnosis; this may save time and minimize the costs of diagnosis [25–28]. For image classification, DL models are built to train, validate, and test thousands of images of different types for accurate prediction. These models consist of many layers through which a CNN transforms the images using functions such as convolution, kernel initialization, pooling, activation, padding, batch normalization, and stride.

The combination of image-feature engineering and ML classification has shown remarkable performance in terms of medical image analysis and classification. In contrast, CNN adaptively learns various image features to perform image transformation, focusing on features that are highly predictive for a specific learning objective. For instance, images of benign and malignant tissues could be presented to a network composed of convolutional layers with different numbers of filters that detect computational features and highlight the pixel pattern in each image. Based on these patterns, the network could use sigmoid and softmax classifiers to learn the extracted and important features, respectively. In DL, the "pipeline" of CNN's processing (i.e., from inputs to any output prediction) is opaque, performed automatically like a passage through a "black box" tunnel, where the user remains fully unaware of the process details. It is difficult to examine a CNN layer-by-layer. Therefore, each layer's visualization results and prediction mechanism are challenging to interpret.

The present paper proposes a pipeline for tissue image classification using DL and ML techniques. We developed two lightweight CNN (LWCNN) models for automatic detection of the GP in histological sections of PCa and extracted the non-handcrafted texture features from the CNN layers to classify these using an ensemble ML (EML) method. Color pre-processing was performed for enhancing images. To carry out a comparative analysis, the two types of hand-designed [29] features, such as the opposite color local binary patterns (OCLBP) [30] and improved OCLBP (IOCLBP) [30] were extracted and pre-trained models (VGG-16, ResNet-50, Inception-V3, and DenseNet-121) [31] were used for EML

and DL classification, respectively. To avoid the complexity and build lightweight DL models, we used a few hidden layers and trainable parameters, and therefore, the models were named LWCNN.

The DL models were trained several times on the same histopathology dataset using different parameters and filters. For each round of training, we fine-tuned the hyperparameters, optimization function, and activation function to improve the model performance, including its accuracy. Binary classification is critical for PCa diagnosis because the goal of the pathologist is to identify whether each tumor is benign or malignant [32]. We generated a class activation map (CAM) using predicted images and created a heat map to visualize the method by which the LWCNN learned to recognize the pixel pattern (image texture) based on activation functions, thus interpreting the decision of the neural network. The CAM visualization results of the training and testing were difficult to interpret because CNNs are black-box models [33,34].

#### **2. Related Work**

A CNN was first used on medical images by Lo et al. [35,36]. Their model (LeNet) succeeded in a real-world application and could recognize hand-written digits [37]. Subsequent CNN-based methods showed the potential for automated image classification and prediction, especially after the introduction of AlexNet, a system that won the ImageNet challenge. In this era, the categorizing and auto-detection of cancer in the histological sections using machine assistance have shown excellent performance in the field of early detection of cancer.

Zheng et al. [38] developed a new CNN-based architecture for histopathological images, using the 3D multiparametric MRI data provided by PROSTATEx challenge. Data augmentation was performed through 3D rotation and slicing, to incorporate the 3D information of the lesion. They achieved the second-highest AUC (0.84) in the PROSTATEx challenge, which shows the great potential of deep learning for cancer imaging.

Han et al. [39] used breast cancer samples from the BreaKHis dataset to perform multi-classification using subordinate classes of breast cancer (ductal carcinoma, fibroadenoma, lobular carcinoma, adenosis, Phyllodes tumor, tubular adenoma, mucinous carcinoma, and papillary carcinoma). The author developed a new deep learning model and has achieved remarkable performance with an average accuracy of 93.2% on a large-scale dataset.

Kumar et al. [12] performed k-means segmentation to separate the background cells from the microscopy biopsy images. They extracted morphological and textural features from for automated detection and classification of cancer. They used different types of machine learning classifiers (random forest, Support vector machine, fuzzy k-nearest neighbor, and k-nearest neighbor) to classify connectivity, epithelial, muscular, and nervous tissues. Finally, the author obtained an average accuracy of 92.19% based on their proposed approach using a k-nearest neighbor classifier.

Abraham et al. [40] used multiparametric magnetic resonance images and presented a novel method for the grading of prostate cancer. They used VGG-16 CNN and an ordinal class classifier with J48 as the base classifier. The author used the PROSTATAx-2 2017 grand challenge dataset for their research work. Their method achieved a positive predictive value of 90.8%.

Yoo et al. [3] proposed an automated CNN-based pipeline for prostate cancer detection using diffusion-weighted magnetic resonance imaging (DWI) for each patient. They used a total of 427 patients as the dataset, out of these, 175 with PCa and 252 patients without PCa. The author used five CNNs based on the ResNet architecture and extracted first order statical features for classification. The analysis was carried out based on a slice- and patient-level. Finally, their proposed pipeline achieved the best result (AUC of 87%) using CNN1.

Turki [41] performed machine learning classification for cancer detection and used a data sample of colon, liver, thyroid cancer. They applied different ML algorithms, such as deep boost, AdaBoost, XgBoost, and support vector machines. The performance of the algorithms was evaluated using the area under the curve (AUC) and accuracy on real clinical data used classification.

Veta et al. [42] proposed different methods for the analysis of breast cancer histopathology images. They discussed different techniques for tissue image analysis and processing like tissue components segmentation, nuclei detection, tubules segmentation, mitotic detection, and computer-aided diagnosis. Before discussing the different image analysis algorithms, the author gave an overview of the tissue preparation, slide staining processes, and digitization of histological slides. In this paper, their approach is to perform clustering or supervised classification to acquire binary or probability maps for the different stains.

Moradi et al. [43] performed prostate cancer detection based on different image analysis techniques. The author used ultrasound, MRI, and histopathology images, and among these, ultrasound images were selected for cancer detection. For the classification of prostate cancer, feature extraction was carried out using the ultrasound echo radio-frequency (RF) signals, B-scan images, and Doppler images.

Alom et al. [44] proposed a deep CNN (DCNN) model for breast cancer classification. The model was developed based on the three powerful CNN architecture by combining the strength of the inception network (Inception-v4), the residual network (ResNet), and the recurrent convolutional neural network (RCNN). Thus, their proposed model was named as inception recurrent residual convolution neural network (IRRCNN). They used two publicly available datasets including BreakHis and Breast Cancer (BC) classification challenge 2015. The test results were compared against the existing state-of-art models for image-based, patch-based, image-level, and patient-level classification.

Wang et al. [45] proposed a novel method for the classification of colorectal cancer histopathological images. The author developed a novel bilinear convolutional neural network (BCNN) model that consists of two CNNs, and the outputs of the CNN layers are multiplied with the outer product at each spatial domain. Color deconvolution was performed to separate the tissue components (hematoxylin and eosin) for BCNN classification. Their proposed model performed better than the traditional CNN by classifying colorectal cancer images into eight different classes.

Bianconi et al. [20] compared the combination effect of six different colour pre-processing methods and 12 colour texture features on the patch-based classification of H&E stained images. They found that classification performance was poor using the generated colour descriptors. However, they achieved promising results using some pre-processing methods such as co-occurrence matrices, Gabor filters, and Local Binary Patterns.

Kather et al. [31] investigated the usefulness of image texture features, pre-trained convolutional networks against variants of local binary patterns for classifying different types of tissue sub-regions, namely stroma, epithelium, necrosis, and lymphocytes. They used seven different datasets of histological images for classifying the handcrafted and non-handcrafted features using standard classifiers (e.g., support vector machines) to obtain overall accuracy between 95% and 99%.

#### **3. Tissue Staining and Data Collection**

#### *3.1. Tissue Staining*

For the identification of cancerous cells, the prostate tissue was sectioned with a thickness of 4μm. The process of deparaffinization (i.e., removal of paraffin wax from slides prior to staining) is especially important after tissue sectioning because, otherwise, only poor staining may be achieved. However, in practice, each tissue section was deparaffinized and rehydrated in an appropriate manner and H&E staining was carried out successfully using an automated stainer (Autostainer XL, Leica). Hematoxylin and Eosin are positively and negatively charged, respectively. The nucleic acids in the nucleus are negatively charged components of basophilic cells; hematoxylin reacts with these components. Amino groups in proteins in the cytoplasm are positively charged components of acidophilic cells; eosin reacts with these components [46–48]. Figure 1 shows the visualization of the H&E stained biopsy image, which was analyzed using QuPath open-source software. The results of H&E staining are shown separately, with their respective chemical formulas.

**Figure 1.** The visualization result of hematoxylin and eosin (H&E) staining slide. (**a**) Hematoxylin staining slide. (**b**) Eosin staining slide. (**c**) H&E staining slide obtained by combining (**a**,**b**). Note that the two slides (**a**,**b**) are highly dissimilar in texture, which is useful for analysis and classification.

#### *3.2. Data Collection*

The whole-slide H&E stained images of size 33,584 × 70,352 pixels were acquired from the pathology department of the Severance Hospital of Yonsei University. The slide images were further processed to generate multiple sizes (256 × 256, 512 × 512, and 1024 × 1024) of 2D patches by scanning at 40× optical magnification with 0.3NA objective using a digital camera (Olympus C-3000) which is attached to a microscope (Olympus BX-51). The extracted regions of interest (ROIs) were sent to the pathologist for prostate cancer (PCa) grading. Figure 2 shows an example of the cropped patches extracted from a whole-slide image. Regions containing background and adipose tissue were excluded. After the labeled patches were received, 6000 samples were selected, all with size 256 × 256 pixels (24 bit/pixel); the samples were divided equally into two classes: cancerous and non-cancerous. The tissue samples used in our research were extracted from 10 patients. These samples had an RGB color coding scheme (8 bits each for red, green, and blue).

**Figure 2.** Data preparation of a sample histopathology slide from a prostatectomy. (**a**) An example of a whole-slide image where a sliding window method was applied to generate patch images. (**b**) The cropped patches obtained from (**a**) corresponded to the lowest and highest Gleason pattern, from well-differentiated to poorly differentiated, respectively. Among all patches in (**b**), the simple stroma, benign and malignant patches were selected for PCa analysis and classification.

#### **4. Materials and Methods**

#### *4.1. Proposed Pipeline*

Image and feature classification based on DL and ML methods showed some promising results in categorizing microscopic images of benign or malignant tissues. Our proposed pipeline for this paper is shown in Figure 3. Our analysis of a tissue image dataset was carried out in five phases, which include image pre-processing, analyze CNN models, feature analysis, model classification, and performance evaluation. In this study, we developed two LWCNN models (model 1 and model 2) and used state-of-art pre-trained models to carry out 2D image classification and perform a comparative analysis among the models. Also, EML classification was performed to classify the handcrafted (OCLBP and IOCLBP) and non-handcrafted (CNN-based) colour texture features extracted from tissue images.

**Figure 3.** Proposed pipeline for image and feature classification based on a lightweight convolutional neural network (LWCNN) and ensemble machine learning (EML). LR: logistic regression, RF: random forest.

#### *4.2. Image Preprocessing*

In this phase, the preprocessing was carried out, whereby we resized the patches to 224 × 224 pixels for CNN training, and to adjust the contrast level of the image, powerlaw (gamma) transformation [49,50] was applied to the resized images. The concept of gamma was used to encode and decode luminance values in image systems. Figure 4 illustrates the clarity of images before and after the application of this operation.

The dataset splitting was performed for training, validating, and testing the CNN models. The data samples were labeled with 0 (non-cancerous) and 1 (cancerous) for accurate classification and randomly assigned to one of three groups for training, validation, and testing, as shown in Table 1. The dataset used for DL and ML classification holds a total of 6000 samples. Out of these, 3600 were used for training, 1200 for validation, and 1200 for testing. Before the samples were fed to the network for classification, data augmentation was performed on the training set, which enabled analysis of model performance, reduction of overfitting problems, and improvement of generalization [51]. Therefore, to create some changes in the images, some transformations were applied using augmentation techniques, and these included rotation by 90◦, transposition, random\_brightening, and random\_contrast, random\_hue, and random\_saturation, shown in Figure 5c,d. Keras and Tensorflow functions were used to execute data augmentation.

**Figure 4.** Image preprocessing using smoothing and gamma correction. (**a**,**c**) Original images of benign and malignant tissues, respectively. Here, the images are blurry and exhibit low contrast. (**b**,**d**) Images after removal of random noise, smoothing, and gamma correction. (**e**) Transformation curve for images with low and high contrast. Because the images in (**a**,**c**) have low contrast, γ = 2 was applied to adjust their intensities, obtaining images in (**b**,**d**) that appear clear and "fresh." Therefore, the tissue components were more visible after transformation, which was important for CNN classification.

**Table 1.** Assignment of benign and malignant samples into datasets for training, validation, and testing.


**Figure 5.** Randomly selected samples from the training dataset demonstrating data augmentation. (**a**,**b**) Images of benign and malignant tissues, respectively, before the transformation. (**c**,**d**) Transformed images from (**a**,**b**), respectively, after data augmentation.

#### *4.3. Convolution Neural Network*

To classify images of PCa, this paper introduces two LWCNN models to perform the classification of the GP and distinguish between two classes. Both model 1 and model 2 included CNN layers, such as those for input, convolution, rectified linear unit (ReLU), max pooling, dropout, flattening, GAP, and classification. Model 1 contained four convolutional blocks, with a depth of 10 layers, which interleaved two-dimensional (2D) convolutional layers (3 × 3 kernel, strides, and padding) with ReLU and batch normalization (BN) layers, followed by three max-pooling (2 × 2) and three dropout layers. To connect the neural network [52,53], a flattening layer and a sequence of three dense layers containing 1024, 1024, and 2 neurons were connected for feature classification and two probabilistic outputs. The sigmoid activation function [54,55] was used as a binary classifier. The numbers of filters in each block were 32, 64, 128, and 256. These filters acted as a sliding window over the entire image.

Model 2 contained three convolutional blocks, with a depth of seven layers, where the 2D convolutional, ReLU, and BN layers were identical to model 1 but were interleaved with two max-pooling (2 × 2) layers and one dropout layer. The numbers of convolutional filters in this model were 92, 192, and 384. A GAP layer was used instead of flattening, the classification section in this model also had three dense layers containing 64, 32, and 2 neurons. Here, a softmax [56,57] classifier was used to reduce binary loss. The input shape was set to 224 × 224 × 3 while building the model. The detailed design and specification of our lightweight CNN (LWCNN) models are shown in Figure 6 and Table 2, respectively. Model 2 was modified from model 1 based on multilevel feature analysis to improve classification accuracy and reduce validation loss, as shown in Figure 7.

**Figure 6.** Structure of our lightweight convolutional neural networks for cancer image classification between two Gleason grade groups of prostate carcinoma. Spatial features are extracted from an image by convolving through one of the networks. Classification layers (flatten, global average pooling [GAP], dense-1, dense-2, and output) were used to find the required response based on features that were extracted by the convolutional neural network.




**Table 2.** *Cont.*

The multilevel feature maps were extracted after each convolutional block for pattern analysis and to understand the pixel distribution that the CNN detected, based on the number of convolution filters applied for edge detection and feature extraction. The convolution operation was performed by sliding the filter or kernel over the input image. Element-wise matrix multiplication was performed at each location in the image matrix and the output results were summed to generate the feature map. Max pooling was applied to reduce the input shape, prevent system memorization, and extract maximum information from each feature map. The feature maps from the first block held most of the information present in the image; that block acted as an edge detector. However, the feature map appeared more similar to an abstract representation and less similar to the original image, with advancement deeper into the network (see Figure 7). In block-3, the image pattern was somewhat visible, and by block-4, it became unrecognizable. This transformation occurred because deeper features encode high-level concepts, such as 2D information regarding the tissue (e.g., only spatial values of 0 or 1), while the CNN detects edges and shapes from low-level feature maps. Therefore, to improve the performance of the LWCNN, based on the observation that block-4 yielded unrecognizable images, model 2 was developed using three convolutional blocks, and selected as the model that this paper proposes.

To validate the performance of model 2 (LWCNN), we also included pre-trained CNN models (VGG-16, ResNet-50, Inceptio-V3, and DenseNet-121) for histopathology image classification. These models are very powerful and effective for extracting and classifying the deep CNN features. For each pre-trained network, the dense or classification block was configured according to the model specification. Sigmoid activation function was used for all the pre-trained models to perform binary classification.

**Figure 7.** Multilevel feature map analysis for tissue image classification using a lightweight convolutional neural network. Visual analysis was performed by observing the pixel pattern in feature maps extracted from each block. Each block holds different information that is useful for convolutional neural network classification. Output shapes of feature maps from blocks 1−4 were: 56 × 56 × 92, 28 × 28 × 192, 14 × 14 × 384, and 7 × 7 × 512, respectively. Shown are four feature maps per block for the purpose of analysis, with 92, 192, 384, and 512 in each block, respectively. Analysis reveals that block-4 contains the maximum information regarding the image, but the resulting maps are less visually interpretable by people. With advancement deeper into the network, the feature maps become sparser, indicating that convolution filters detect fewer features. Therefore, block-4 was removed from model 2.

#### *4.4. Feature Engineering*

The extraction of texture features based on handcrafted and non-handcrafted was performed for ensemble machine learning (EML) classification. First, non-handcrafted or CNN-based features were extracted from the GAP layer of the proposed LWCNN (model 2). A different number of feature maps were generated from each CNN layer and the GAP mechanism was used to calculate the average value for each feature map. Second, a total of 20 handcrafted colour texture features were extracted using OCLBP and IOCLBP techniques. Out of these, 10 features were extracted using OCLBP, and 10 features using IOCLBP. The hand-designed feature analysis was performed for EML classification and compare with the non-handcrafted features classification results.

After we generate colour texture map, the LBP technique was applied to each colour channel (Red/Green/Blue) of OCLBP and IOCLBP separately. These state-of-art methods are the extensions of local binary patterns (LBP) and effective for colour image analysis. OCLBP and IOCLBP are the intraand inter-channel descriptors with dissimilar local thresholding scheme (i.e., the peripheral pixels of OCLBP are thresholded at the central pixel value, and IOCLBP thresholding is based on the mean value) [30]. For each aforesaid state-of-art methods, the feature vector was obtained using general rotation-invariant operators (i.e., neighbor set of pixels *p* was placed on a circle of radius *R*) that can distinguish the spatial pattern and the contrast of local image texture. Therefore, the operators *p* = 8 and *R* = 2 were used to extract the colour features from the H&E stained tissue images.

#### *4.5. DL and ML Classification*

Prior to training and testing the LWCNN, pre-trained, and EML [58] models, we fine-tuned different types of parameters for better prediction and to minimize model loss. To compute the feature maps in each convolutional layer, a non-linear activation function (ReLU) was used, and the equation can be defined as:

$$A\_{i,j,k} = \max\{w\_n^T I\_{i,j} + b\_{n\prime}0\} \tag{1}$$

where *Ai*,*j*,*<sup>k</sup>* is the activation value of the *n*th feature map at the location (*i*, *j*), *Ii*,*<sup>j</sup>* is the input patch, and *wn* and *bn* are the weight vector and bias term, respectively, of the *n*th filter.

BN was also used after each convolution layer to regularize the model, reducing the need for dropout. BN was used in our model because it is more effective than global data normalization. The latter normalization transforms the entire dataset so that it has a mean of zero and unit variance, while BN computes approximations of the mean and variance after each mini-batch. Therefore, BN enables the use of the ReLU activation function without saturating the model. Typically, BN is performed using the following equation:

$$\text{BN} \left( X\_{normalize} \right) = (\mathbf{x}\_n - \mu\_{mb}) / \sqrt{\sigma\_{mb}^2} + \text{c} \tag{2}$$

where *xn* is the *d*-dimensional input, μ*mb* and σ<sup>2</sup> *mb* are the mean and variance, respectively, of the mini-batch, and *c* is a constant.

To optimize the weights of the network and analyze the performance of the LWCNN models, we performed a comparative analysis based on four different types of optimizers, namely stochastic gradient descent (SGD), Adadelta, Adam, and RMSprop. The results of comparative analysis are shown in the next section. The classification performance is measured using the cross-entropy loss, or log loss, whose output is a probability value between 0 and 1. To train our network, we used binary cross-entropy. The standard loss function for binary classification is given by:

$$Binary\_{loss} = -\frac{1}{N} \sum\_{i=1}^{N} \left[ \mathbf{Y}\_i \times \log(M\_w(X\_i)) + (1 - \mathbf{Y}\_i) \times \log(1 - M\_w(X\_i)) \right] \tag{3}$$

where *N* is the number of output class, *Xi* and *Yi* are the input samples and target labels, respectively, and *Mw* is the model with network weight, *w*.

The hyperparameters were tuned while setting a minimum learning rate of 0.001 using the function known as ReduceLROnPlateau, a factor of 0.8 and patience of 10 were set; thus, if no improvement was observed in validation loss for 10 consecutive epochs, the learning rate was reduced by a factor of 0.8. The batch size was set to eight for training the model and regularization was applied by dropping out 25% and 50% of the weights in the convolution and dense blocks of LWCNN, respectively. The probabilistic output in the dense layer was computed using sigmoid and softmax classifiers.

In addition to CNN methods, traditional ML algorithms including logistic regression (LR) [59] and random forest (RF) [60] were used for features classification. In this paper, an ensemble voting method was proposed in which LR and RF classifiers were combined to create an EML model. This ensemble technique was used to classify the handcrafted and non-handcrafted features and compare the classification performance. The LWCNN, pre-trained, and EML models were tested using the unknown or unseen data samples. Typically, for ML classification, cross-validation was used by splitting the training data into *k*-fold (i.e., *k* = 5) to determine the model generalizability, and the result was computed by averaging the accuracies from each of the k trials. Prior to ML classification [61–63], the feature values for training and testing were normalized using the standard normal distribution function, which can be expressed as:

$$P\_{i\\_Normalised} = \frac{P\_i - \mu}{\sigma} \tag{4}$$

where *Pi* is the *i*th pixel in an individual tissue image, and μ and σ are the mean and standard deviation of the dataset.

The DL and ML models were built with the Python 3 programming language using the Keras and Tensorflow libraries. Approximately 36 h were invested in fine-tuning the hyperparameters to achieve better accuracy. Figure 8 shows the entire process flow diagram for DL and ML classification. The hyperparameters that were used for DL and ML models are shown in Table 3.

The models were trained, validated, and tested on a PC with the following specifications: an Intel corei7 CPU (2.93 GHz), one NVIDIA GeForce RTX 2080 GPU, and 24 GB of RAM.

**Figure 8.** Flow diagram for DL and ML classification. Handcrafted and non-handcrafted colour texture descriptors were extracted for EML classification.


#### **5. Experimental Results**

This study mainly focuses on image classification based on AI. The proposed LWCNN (model 2) for tissue image classification and EML for feature classification produced reliable results, which met our requirements, at an acceptable speed. To develop DL models, a CNN approach was used as it

is proven excellent performance in detecting specific regions for multiclass and binary classification. When splitting the dataset, a ratio of 8:2 was set for training and testing. Moreover, to validate the model after each epoch, the training set was further divided, such that 75% of the data was allocated for training and 25% was allocated for validation. Five-fold cross-validation was used during EML training. Algorithms used for preprocessing, data analysis, and classification were implemented in the MATLAB R2019a and PyCharm environments.

#### *5.1. Performance Analysis*

In this study, a binary classification approach was used to classify benign and malignant samples of prostate tissue. Two levels of classification were performed: DL (based on images) and ML (based on features). Table 4 shows the comparative analysis between the optimizers for model 1 and model 2, respectively. The developed LWCNN models were trained a couple of times by changing the optimizers during training.


**Table 4.** Comparison of the optimizers for tissue image classification.

From the above comparison table, we can analyze that the Adadelta performed the best and gave the best accuracies on test data for both the architectures. SGD and Adam performed close to Adadelta for model 2. On the other hand, RMSProp performed close to Adadelta for model 1. However, Adadelta (update version of Adam and Adagrad) is a more robust optimizer that restricts the window of accumulated past gradients to some fixed size *w* instead of accumulating all past square gradients. The comparison of these optimizers revealed that Aadelta is more stable and more rapid, hence, an overall improvement on SGD, RMSProp, and Adam. The behavior and performance of the optimizers were analyzed using the receiver operating characteristic (ROC) curve. It is a probabilistic curve that represents the diagnostic ability of a binary classifier system, including an indication of its effective threshold value. The area under the ROC curve (AUC) summarizes the extent to which a model can separate the two classes. Figure 9a,b show the ROC curve and corresponding AUC that depicts the effectiveness of different optimizers used for model 1 and model 2, respectively. For model 1, the AUCs were 0.95, 0.94, 0.96, and 0.93, and for model 2, 0.98, 0.97, 0.98, and 0.97 were obtained using Adadelta, RMSProp, SGD, and Adam, respectively.

Further, based on the optimum accuracy in Table 4, we carried out EML classification using the CNN extracted features from model 2, to analyze the efficiency of ML algorithms. Also, handcrafted features classification was performed to compare the performance with the non-handcrafted features classification results. Moreover, the EML model achieved promising results using the CNN-based features. Model 2 outperformed model 1 in overall accuracy, precision, recall, F1-score, and MCC, with values of 94.0%, 94.2%, 92.9%, 93.5%, and 87.0%, respectively. A confusion matrix (Figure 10) was generated based on the LWCNN model that yielded the optimum results, and thus most reliably distinguished malignant from benign tissue. Benign tissue was labeled as "0" and malignant was labeled as "1" to plot the confusion matrix for this binary classification. The four squares in the confusion matrix represent true positive, true negative, false positive, and false negative; their values were calculated using the test dataset based on the expected outcome and number of predictions of each class. Tables 5 and 6 show the overall comparative analysis for the DL and ML classification. The performance metrics used to evaluate the analysis results are accuracy, precision, recall, F1-score, and Matthews correlation coefficient (MCC).

**Figure 9.** ROC curves for analyzing the behavior of different optimizers, generated by plotting predicted probability values (i.e., model's confidence scores). (**a**) Performance of model 1 based on sigmoid activation. (**b**) Performance of model 2 based on softmax activation function.

**Figure 10.** Confusion matrix of model 2, generated using the test dataset, showing results of binary classifications between benign (0) and malignant (1) tumors. Blue boxes at top-left and bottom-right represent true positive and true negative, respectively; white boxes at top-right and bottom-left represent false negative and false positive, respectively.

**Table 5.** Comparative analysis of lightweight and pre-trained CNN models based on non-handcrafted features. Metrics are for the test dataset.



**Table 6.** Comparative analysis of non-handcrafted and handcrafted features classification. Metrics are for the test dataset.

#### *5.2. Visualization Results*

The CAM technique was used to visualize the results from an activation layer (softmax) of the classification block. CAM is used to deduce which regions of an image are used by a CNN to recognize the precise class or group it contains [22,64]. Typically, it is difficult to visualize the results from hidden layers of a black box CNN model. More complexity is observed in feature maps with increasing depth in the network; thus, each image becomes increasingly abstract, encoding less information than the initial layers and appearing more blurred. Figure 11 shows the CAM results, indicating the method by which our DL network detected important regions; moreover, the network had learned a built-in mechanism to determine which regions merited attention. Therefore, this decision process was extremely useful in the classification network.

**Figure 11.** Class activation maps are extracted from one of the classification layers of our convolutional neural network. These show how images are classified and predicted by the neural network, although it is a black-box model. Top and bottom pairs of rows depict benign and malignant tissue images, respectively. (**a**) Input images with an RGB color scheme visualized as grayscale. (**b**) Activation map of classification block, showing detection of different regions in each tissue image. (**c**) Images overlaying (**a**,**b**), with spots indicating significant regions that the convolutional neural network used to identify a specific in that image.

Our CNN detected specific regions using the softmax classifier by incorporating spatially averaged information extracted by the GAP layer from the last convolution layer, which had an output shape of 14 × 14 × 384. The detected regions depicted in Figure 11c were generated by the application of a heat map to the CAM image in Figure 11b and overlaying that on the original image from Figure 11a. A heat map is highly effective for tissue image analysis; in this instance, it showed how the CNN detected each region of the image that is important for cancer classification. Doctors can use this information to better understand the classification (i.e., how the neural network predicted the presence of cancer in an image, based on the relevant regions). The visualization process was carried out using the test dataset, which was fed into the trained network of model 2.

In this study, supervised classification was performed for cancer grading, whereby our dataset was labeled with "0" and "1" to categorize benign and malignant tissue separately and independently. The probability distributions of data were similar in training and test sets, but the test dataset was independent of the training dataset. Therefore, after the model had been trained with several binary labeled cancer images, the unanalyzed dataset was fed to the network for accurate prediction between binary classes. Figure 12 shows examples of the binary classification results from our proposed model 2, with examples of images that were and were not predicted correctly. Notably, some images of benign were similar to malignant tissues and vice versa in terms of their nuclei distribution, intensity variation, and tissue texture. It was challenging for the model to correctly classify these images into the two groups.

**Figure 12.** Cancer prediction using a binary labeled test dataset. Examples of images that were (**a**) correctly and (**b**) incorrectly classified, showing their actual and predicted labels.

#### **6. Discussion**

The main aim of this study was to develop LWCNN for benign and malignant tissue image classification based on multilevel feature map analysis and show the effectiveness of the model. Moreover, we developed an EML voting method for the classification of non-handcrafted (extracted from the GAP layer of model 2) and handcrafted (extracted using OCLBP and IOCLBP). Generally, in DL, the features are extracted automatically from raw data and further processed for classification using a neural network approach. However, for ML algorithms, features are extracted manually using different mathematical formulae; these are also regarded as handcrafted features. A CNN is suitable for complex detection tasks, such as analyses of scattered and finely drawn patterns in data. Of particular interest, in the malignant and benign classification task, model 2 was more effective than model 1.Indeed, model 1 performed below expectation, such that we modified it to improve performance, resulting in model 2. The modification comprised removal of the fourth convolutional block, flattening layer, and sigmoid activation function, as well as alterations of filter number and kernel size. Moreover, GAP replaced flattening after the third convolutional block, minimizing overfitting by reducing the total number of parameters in the model. The softmax activation function replaced the sigmoid

activation function in the third dense layer. These modifications, based on the multilevel feature map analysis, improved the overall accuracy and localization ability of tissue image classification.

Furthermore, in this study, we have also compared our proposed CNN model with the well-known pre-trained models such as VGG-16, ResNet-50, Inception-V3, and DenseNet-121. Among these, DenseNet proved to give the highest accuracy of 95% followed by the Inception V3 with 94.6%. The pre-trained VGG-16 and ResNet-50 achieved 92% and 93%, respectively. Although DenseNet gained the highest accuracy among all the pre-trained models as well as our proposed model 2, it is not quite comparable with the motto of this paper. The ultimate goal of this paper was to develop a light-weighted CNN without a much-complicated structure with minimum possible convolutional layers and achieve better classification performance. Model 2 proved this hypothesis by achieving an overall accuracy of 94%. On the other hand, all the pre-trained models are well trained on a huge dataset (ImageNet) which includes 1000 classes. Therefore, it is evident that the classification of such models will be done accurately without much hassle. Nevertheless, the comparison of computational cost between the proposed LWCNN and other pre-trained models was performed to analyze the memory usage, trainable parameters, and learning (training and testing) time, shown in Table 7. First, according to the comparison Table 7, the number of trainable parameters used in the LWCNN model was reduced by more than 75% as compared to VGG-16, ResNet-50, and Inception-V3, and 2% as compared to DenseNet-121. Second, the memory usage of the proposed model was significantly less when compared to other models. Third, the time taken to train the proposed model was also drastically less. Among the pre-trained models, VGG-16 and ResNet-50 agree with the objective of this work. From Tables 5 and 7, it is evident that our LWCNN (model 2) is competitive and inexpensive, whereas, the state-of-art models were computationally expensive and achieved comparable results. Therefore, from this perspective, model 2 of our proposed work performed better than VGG-16 and ResNet-50 in terms of accuracy, besides employing a simple architecture.


**Table 7.** Comparing performance and computation cost of model-2 with other pre-trained models.

Through fine-tuning of the hyperparameters, the CNN layers were determined to be optimal using the validation and test datasets. The modified, model 2 was adequate for the classification of benign and malignant tissue images. Our study examined the capability of the proposed LWCNN model to detect and forecast the histopathology images; a single activation map was extracted from each block (see Figure 13) to visualize the detection results using a heat map. Notably, we used an EML method for non-handcrafted and handcrafted features classification. However, the EML model was sufficiently powerful to classify the computational features extracted using the optimal LWCNN model, which predicted the samples of benign and malignant tissues almost perfectly accurately. Also, tissue samples that were classified and predicted using the softmax classifier are shown in quantile-quantile (Q−Q) plots of the prediction probability confidence for benign and malignant states in Figure 14a,b, respectively. These Q−Q plots allowed for the analysis of predictions. True and predicted probabilistic values were plotted according to true positive and true negative classifications of samples (see Figure 9), respectively.

**Figure 13.** Visualizations of class activation maps generated from model 2, created using different numbers of filters. Outputs of (**a**) first convolutional, (**b**) second convolutional, (**c**) third convolutional, and (**d**) classification blocks. Colors indicate the most relevant regions for predicting the class of these histopathology images, as detected by the convolutional neural network.

**Figure 14.** Quantile-quantile plot for true and predicted probabilistic values. (**a**) Samples that were benign and had true positive predictions. (**b**) Samples that were malignant and had true negative predictions.

In Q−Q plots, note that the black bar at the top parallel to the x-axis shows true probabilistic values; red (true positive) and blue (true negative) markers show the prediction confidence of each sample of a specific class. We used a softmax classifier, which normalizes the output of each unit to be between 0 and 1, ensuring that the probabilities always sum to 1. The number of samples used for each class was 600; the numbers correctly classified were 565 and 557 for true positive and true negative, respectively. A predicted probability value > 0.5 and <0.5 signifies an accurate classification and misclassification, respectively.

The combination of image-feature engineering and ML classification has shown remarkable performance in terms of medical image analysis and classification. In contrast, CNN adaptively learns various image features to perform image transformation, focusing on features that are highly predictive for a specific learning objective [65]. For instance, images of benign and malignant tissues could be presented to a network composed of convolutional layers with different numbers of filters that detect computational features and highlight the pixel pattern in each image. Based on these patterns, the network could use sigmoid and softmax classifiers to learn the extracted and important features, respectively. In DL, the "pipeline" of CNN's processing (i.e., from inputs to any output prediction) is opaque [66], performed automatically like a passage through a "black box" tunnel, where the user remains fully unaware of the process details. It is difficult to examine a CNN layer-by-layer. Therefore, each layer's visualization results and prediction mechanism are challenging to interpret.

Overall, all models performed well in tissue image classification, achieving comparable results. The EML method also worked well with CNN-extracted features, yielding comparable results. We conclude that, for image classification, models with very deep layers performed well by more accurately classifying the data samples. We aimed to build an LWCNN model with few feature-map layers and hyperparameters for prediction of cancer grading based on binary classification (i.e., benign vs. malignant). Our proposed methods have proven that lightweight models can achieve good results if the parameters are tuned appropriately. Furthermore, model 2 effectively recognized the histologic differences in tissue images and predicted their statuses with nearly perfect accuracy. The application of DL to histopathology is relatively new. However, it performs well and delivers accurate results. DL methods provide outstanding performance through black box layers; the outputs of each of these layers can be visualized using a heat map. In this study, our model provided insights into the histologic patterns present in each tissue image and can thus assist pathologists as a practical tool for analyzing tissue regions relevant to the worst prognosis. Heat map analyses suggested that the LWCNN can learn visual patterns of histopathological images containing different features relating to nuclear morphology, cell density, gland formation, and variations in the intensity of stroma and cytoplasm. Performance significantly improved when the first model was modified based on the feature map analysis.

#### **7. Conclusions**

In this study, 2D image classification was performed using PCa samples by leveraging non-handcrafted and handcrafted texture features to distinguish a malignant state of tissue from a benign state. We have presented LWCNN- and EML-based image and feature classification using feature map analysis. The DL models were designed with only a few CNN layers and trained with a small number of parameters. The computed feature maps of each layer were fed into these fully CNNs through the flattening and GAP layers, enabling binary classification using sigmoid and softmax classifiers. GAP and softmax were used for model 2, the optimal network in this paper. The GAP layer was used, instead of flattening, to minimize overfitting by reducing the total number of parameters in the model. This layer computes the mean value for each feature map, whereas flattening combined all feature maps extracted from the final convolution or pooling layers by changing the shape of the data from a 2D matrix of features into a one-dimensional array for passage to the fully CNN classifier. A comparative analysis was performed between the DL and EML classification results. Moreover, the computational cost was also compared among the models. The optimum LWCNN (i.e., model 2) and EML models (a combination of LR and RF classifiers) achieved nearly perfectly accurate results with significantly fewer trainable parameters. The proposed LWCNN model developed in the study achieved an overall accuracy of 94%, average precision of 94.2%, an average recall of 92.9%, an average f1-score of 93.5%, and MCC of 87%. On the other hand, using CNN-based features, the EML model achieved an overall accuracy of 92%, an average precision of 92.7%, an average recall of 91%, an average f1-score of 91.8%, and MCC of 83.5%.

To conclude, the analysis presented in this study is very encouraging. However, a model built for medical images may not work well for other types of images. There is a need to fine-tune the hyperparameters to control model overfitting and loss, thereby improving accuracy. The 2D LWCNN (model 2) developed in this study performed well, and therefore, the predicted true positive and true negative samples for benign and malignant, respectively, were plotted using Q-Q plots. The CAM technique was used to visualize the results of the block box CNN model. In the future, we will consider other methods and develop a more complex DL model and compare it with our optimal LWCNN model and other transfer learning models. Further, we will extend the research to multi-class classification (beyond binary) to simultaneously classify benign tissues, as well as grades 3–5.

**Author Contributions:** Funding acquisition, H.-K.C.; Methodology, S.B.; Resources, N.-H.C.; Supervision, H.-K.C.; Validation, H.-G.P.; Visualization, C.-H.K.; Writing—original draft, S.B.; Writing—review and editing, C.-H.K. and D.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was financially supported by the Ministry of Trade, Industry, and Energy (MOTIE), Korea, under the "Regional Specialized Industry Development Program (R&D, P0002072)" supervised by the Korea Institute for Advancement of Technology (KIAT).

**Ethical Approval:** All subjects' written informed consent waived for their participation in the study, which was approved by the Institutional Ethics Committee at College of Medicine, Yonsei University, Korea (IRB no. 1-2018-0044).

**Conflicts of Interest:** The authors declare that they have no conflicts of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
