**Preface to "Artificial Intelligence in Cancer Diagnosis and Therapy"**

Cancer is the second leading cause of death worldwide. According to the World Health Organization (WHO), around 10 million people died from cancer globally in 2020. The early detection of cancer is of utmost importance for the effective treatment and prevention of the spread of cancer cells to other parts of the body (metastasis). However, this task and assigning effective therapies in clinical cancer settings are of great complexity due to inter- and intra- tumor heterogeneities. The detection, diagnosis, and therapy of cancer are challenged by a hidden pattern of seemingly irregular, chaotic medical events requiring methodologies to capture the complexity of cancer to design effective diagnostic systems and therapies.

Artificial Intelligence (AI) and machine learning have been revolutionizing discovery, diagnosis, and treatment designs. It can aid not only in cancer detection but also in cancer therapy design, identification of new therapeutic targets with accelerating drug discovery, and improving cancer surveillance when analyzing patient and cancer statistics. AI-guided cancer care could also be effective in clinical screening and management with better health outcomes. The Machine Learning (ML) algorithms developed based on biological and computer sciences can significantly help scientists in facilitating the discovery process of biological systems behind cancer initiation, growth, and metastasis. They can also be used by physicians and surgeons in the effective diagnosis and treatment design for different types of cancer and for biotechnology and pharmaceutical industries in carrying out more efficient drug discovery. AI and machine learning may be defined as the branch of computer science that is concerned with intelligent behavior. Artificial intelligence techniques learn about the data they are trained on and, subsequently, learning algorithms are designed to generalize from those data.

This book covers some significant impacts in the recent research of AI and machine learning in both the private and public sectors of cancer diagnosis and therapy. The book is divided in five groups:

The first group is AI in prognosis, grading, and prediction. AI is a powerful tool for prognosis, a branch of medicine that aims in predicting the future health of patients. It performs well in assisting cancer prognosis because of its unprecedented accuracy level.

The second group is AI in clinical image analysis. Image-based methods are among the most powerful applications of AI and recent deep learning methods. AI provides real-time and highly accurate image analytics to increase the quality and localize the anatomical features (pre/post processing), facilitate powerful augmented reality and virtual reality applications in the medical domain, and develop the classification and diagnosis of diseases using the medical images.

The third group is AI models for pathological diagnosis. With the impressive growth in the application of AI in health and in pathology, the specific role of AI in supporting routine diagnosis, particularly for patients with cancer, is evident from many published works. AI can handle the enormous quantity of data created throughout the patient care lifecycle to improve pathologic diagnosis.

The Fourth group is ML and statistical models for molecular cancer diagnostics and genetics. Molecular diagnosis involves processing samples of tissue, blood, or other body fluid to look for the presence of certain genes, proteins, or other molecules. They might be a sign of a disease or condition, such as cancer. AI methods provide lots of opportunities for the analysis of such detailed and gigantic data with high accuracy and lead time.

The fifth group is AI in triage, risk stratification, and screening cancer. Due to the complex and expensive procedure needed for the treatment of cancer, triage and risk stratification provide the procedure of assigning levels of priority to patients to determine the most effective order in which to be treated. The health providers can then identify the right level of care and services for distinct subgroups of patients. AI enables this prediction to occur rapidly, immediately, and accurately.

This book is aimed at serving researchers, physicians, biomedical engineers, scientists, engineering graduates, and Ph.D. students of medical, biomedical engineering, and physical science together with interested individuals in medical, engineering, and general science. This book focuses on the application of artificial intelligence and machine learning methods in cancer diagnosis and therapy including prognosis, grading and prediction, clinical image analysis, pathological diagnosis, molecular cancer diagnostics and genetics, and traige, risk stratification, and screening cancer with approaches representing a wide variety of disciplines including medical, engineering, and general science. Throughout the book, great emphasis is placed on medical applications of cancer diagnosis and therapy, as well as methodologies using artificial intelligence and machine learning. The significant impact of the recent research that has been selected is of high interest in cancer diagnosis and therapy as complex systems. An attempt has been made to expose the reading audience of physicians, engineers, and researchers to a broad range of theoretical and practical topics. The topics contained in the present book are of specific interest to physicians and engineers who are seeking expertise in cancer diagnosis and therapy via artificial intelligence methods and machine learning. The primary audience of this book is researchers, graduate students, and engineers in applications of AI in CT-scan and X-ray images, computer engineering, and science and medicine disciplines. In particular, the book can be used for training graduate students as well as senior undergraduate students to enhance their knowledge by undergoing a graduate or advanced undergraduate course in the areas of cancer diagnosis and therapy and engineering applications. The covered research topics are also of interest to researchers in medicine, biomedical engineering, and academia who are seeking to expand their expertise in these areas.

#### Acknowledgments

This book has been made possible through the effective collaborations of all the enthusiastic chapter author contributors, who have expertise and experience in various disciplines in computer science, clinical cancer settings, and related fields. They deserve the sincerest gratitude for their motivation in creating such a book, for the encouragement in completing the book, for the scientific and professional attitude in constructing each of the chapters of the book, and for the continuous efforts toward improving the quality of the book. Without the collaboration and consistent efforts of the chapter contributors including authors and anonymous reviewers, the completion of this book would have not been possible.

### **Hamid Khayyam , Ali Madani, Rahele Kafieh, and Ali Hekmatnia** *Editors*

### *Review* **Current Value of Biparametric Prostate MRI with Machine-Learning or Deep-Learning in the Detection, Grading, and Characterization of Prostate Cancer: A Systematic Review**

**Henrik J. Michaely 1,\*, Giacomo Aringhieri 2,3, Dania Cioni 2,3 and Emanuele Neri 2,3**


**Abstract:** Prostate cancer detection with magnetic resonance imaging is based on a standardized MRIprotocol according to the PI-RADS guidelines including morphologic imaging, diffusion weighted imaging, and perfusion. To facilitate data acquisition and analysis the contrast-enhanced perfusion is often omitted resulting in a biparametric prostate MRI protocol. The intention of this review is to analyze the current value of biparametric prostate MRI in combination with methods of machinelearning and deep learning in the detection, grading, and characterization of prostate cancer; if available a direct comparison with human radiologist performance was performed. PubMed was systematically queried and 29 appropriate studies were identified and retrieved. The data show that detection of clinically significant prostate cancer and differentiation of prostate cancer from non-cancerous tissue using machine-learning and deep learning is feasible with promising results. Some techniques of machine-learning and deep-learning currently seem to be equally good as human radiologists in terms of classification of single lesion according to the PIRADS score.

**Keywords:** prostate cancer; multiparametric prostate MRI; biparametric prostate MRI; deep-learning; radiomics; artificial intelligence; cancer detection; PIRADS

#### **1. Introduction**

#### *1.1. Prostate Cancer*

Prostate cancer (PCA) is the second most common cancer in men worldwide and it accounts for up to 25% of all malignancies in Europe [1]. It is the third leading cause of cancer-related death in the United States and Europe [2,3]. The incidence of prostate cancer increases with rising age of patients, and prostate cancer and its management are becoming a major public health challenge. PCA aggressiveness can be linked to specific genes such as BRCA, and behavior such as smoking [4,5]. Accurate and early detection of prostate cancer is therefore paramount to achieve good overall patient outcomes. The tools available for assessing and detecting prostate cancer are digital rectal examination (DRE), PSA screening, transrectal ultrasound, and MRI whereby the latter received the highest amount of attention in the past decade due to its unprecedented capabilities in accuracy [6–8].

In contrast to ultrasound and digital rectal examination, MRI offers an operatorindependent tool for objectively assessing the entire prostate gland from base to apex and from the posterior peripheral zone (PZ) to the anterior fibromuscular stroma (AFMS) that are barely assessable with DRE [6,9].

Magnetic resonance imaging of the prostate has a long history going back more than 20 years. In the initial phase, high resolution T2-weighted (T2w) imaging and spectroscopy were mainly used as tools for detecting prostate cancer. Yet, spectroscopy is slow and

**Citation:** Michaely, H.J.; Aringhieri, G.; Cioni, D.; Neri, E. Current Value of Biparametric Prostate MRI with Machine-Learning or Deep-Learning in the Detection, Grading, and Characterization of Prostate Cancer: A Systematic Review. *Diagnostics* **2022**, *12*, 799. https://doi.org/ 10.3390/diagnostics12040799

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 22 February 2022 Accepted: 23 March 2022 Published: 24 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

1

susceptible to artefacts and was not well perceived. In the recent decade, further developments have taken over including diffusion weighted imaging (DWI), dynamic contrast enhanced imaging (DCE). The entire prostate exam has been standardized worldwide and its reporting has been harmonized by the PIRADS (Prostate Imaging Reporting and Data System) system [10]. This classification system allows to objectively assess the prostate and potential cancerous zones and standardizes reporting over separate sites so that the overall performance of MRI is increased and is more reproducible compared to previous periods. With this development MRI of the prostate follows the trend to standardize the entire radiological procedure from image acquisition to data reporting to achieve a higher reliability, enhanced reproducibility, and a direct implication for radiology-based treatments as it has previously successfully demonstrated in breast imaging with BIRADS (Breast Imaging Reporting and Data System) [11].

The report structuring provided by PIRADS is already a condensation of the imaging information and standardizes reporting and its output. This is one major step toward a more automated and operator-independent radiology. Moreover, the image acquisition parameters, slice orientations, and sequences with its specific sequence characteristics are governed by PIRADS [12]. This automatically sets the stage for a potential automated image analysis. In the past decade, artificial intelligence (AI) with its subdivisions of machine learning (ML), radiomics, and deep learning (DL) has become more prevalent. At this point in time, ML and DL are still no clinical standards. Radiomics, for example, use quantitative imaging features that are often unrecognizable to the human eye. Therefore, it is increasing the number of potential parameters to the multi-parametric approach of prostate MRI and with potential benefits for PCA detection and grading and beyond. DL techniques such as convoluted neural networks (CNN) are currently considered gold standard in computer vision and pattern recognition and hence have potential benefits for PCA detection and grading. With larger data sets as basis, they have the potential to automatically learn and deduct conclusions so that PCA recognition based on unperceivable features to the human eye might be possible. Despite numerous experimental studies which will be discussed further in this study, there is no standardized approach on how to use and implement DL and ML for prostate imaging now.

The aim of this study is to elucidate the status of artificial intelligence in prostate imaging with a focus on the so-called bi-parametric (bp) approach of prostate MRI (bpMRI).

#### *1.2. Prostate Imaging Reporting and Data System*

PIRADS was established by key global experts in the field of prostate imaging from America and Europe (European Society of Urogenital Radiology (ESUR), American College of Radiology (ACR)) to facilitate and standardize prostate MRI with the aim of assessing the risk of clinically significant prostate cancer (csPCA). The first version of the PIRADS recommendations was published in December 2011, the latest and current update was published in 2019 (PIRADS v2.1) [10,12,13].

Various studies have compared the predictive performance of PI-RADS v1 for the detection of csPCA compared to image-guided biopsy or radical prostatectomy (RP) specimens as standard of reference. In a 2015 study, Thompson reported multi-parametric MRI detection of csPCA had sensitivity of 96%, specificity of 36%, negative predictive value and positive predictive values of 92% and 52%; when PI-RADS was incorporated into a multivariate analysis (PSA, digital rectal exam, prostate volume, patient age) the area under the curve (AUC) improved from 0.776 to 0.879, *p* < 0.001 [14]. A similar paper showed that PI-RADS v2 correctly identified 94–95% of prostate cancer foci ≥ 0.5 mL but was limited for the assessment of Gleason Score (GS) ≥ 4 + 3 csPCA ≤ 0.5 mL [15]. An experienced radiologist using PIRADS v2 is reported to achieve an AUC of 0.83 with 77% sensitivity and 81% specificity [16].

#### *1.3. Sequences for Prostate MRI*

The initial protocol for MRI of the prostate as provided by PIRADS included highresolution multiplanar T2w-imaging, DWI, and DCE after the intravenous administration of paramagnetic gadolinium chelate contrast agent. This so-called multi parametric prostate MRI (mpMRI) is considered as the gold standard. T2w-imaging is used to demonstrate zonal anatomy of the prostate. Tumors can be well delineated, and their relation to the prostate capsule can be examined. Benign changes such as benign prostate hyperplasia, post-prostatic changes of the peripheral zone or scars can be identified. T2w-imaging is considered the gold standard for the transitional zone (TZ) of the prostate gland. In addition, T2w-imaging can be used to measure the volume of the prostate. The high anatomic information content of T2w-imaging makes this sequence the perfect roadmap for image-guided biopsy [12,17].

DWI serves as an indirect measure of cellular density. In case of a malignant tumor with high cellular density, the ability of water to freely move in the interstitial compartment is decreased hence the diffusion is impaired. The images with high b-values and even those with more and more common-interpolated calculated b-values allow quick and easy depiction of these suspicious areas in the prostate. The calculated ADC maps give a quantitative measure of cellular density and can be considered as a molecular imaging tool for tumor aggressiveness. DWI imaging is considered as the reference sequence for the peripheral zone (PZ) of the prostate [12,17].

Dynamic contrast enhancement (DCE) is considered as the weakest of the three used approaches for prostate imaging. In contrast to T2w-imaging and DWI, DCE is not being considered as a dominant sequence for any of the prostate zones. It only serves as a tiebreaker in very specific questions in the PIRADS system. In addition, it requires the intravenous administration of contrast agent with the risk of side-effects such as allergies, nephrogenic systemic fibrosis, or Gadolinium deposition in the body [18–21]. While the risk of nephrogenic systemic fibrosis is controllable by using little amounts of macrocyclic Gd-chelates, no harmful consequence for Gd-chelate depositions in the body has been found [22,23]. Nevertheless, patients often try to avoid contrast agent if feasible. Moreover, physicians embrace the idea of non-enhanced exams equally, as it speeds up the acquisition and reduces the number of potential complications. In addition, omitting contrast agent permits to save money.

#### *1.4. Multiparametric and Biparametric MRI of the Prostate*

With this in mind and the knowledge that the performance of DCE often yielded limited added value to T2w-imaging and DWI in mpMRI of the prostate bi-parametric MRI (bpMRI) of the prostate is gaining considerable support [15]. Meanwhile, there are several high-ranked studies such as the PROMIS trial and meta-analyses comparing mpMRI and bpMRI of the prostate [24–26]. Current data underline the high negative value of bpMRI in biopsy-naïve patients with a negative predictive value of up to 97% [27,28]. Whether bpMRI might be slightly less accurate in less-experience readers is not yet clearly proven [29,30]. A currently accepted position is that bpMRI of the prostate seems to be equally good as mpMRI of the prostate for patients with low and high risk for csPCA but DCE might be of worth in patients with intermediate risk and PIRADS 3 lesions [25,26,31–35] (Figure 1). bpMRI of the prostate is also commonly used for computer-based postprocessing using artificial intelligence. This is due to the fact that DCE contains a fourth dimension (time) which make those images harder to algin and match with two-dimensional anatomical images such as T2w-imaging and DWI. Another drawback of DCE is that image information is not obvious. The image information on contrast media arrival and distribution which is seen as a surrogate marker for microvascular density have to be extracted using semiquantitative or quantitative pharmacokinetic models which adds another layer of complexity on postprocessing, along with the increase of time necessary to report the exams.

**Figure 1.** Overview of the performance of mpMRI and bpMRI based on data from Woo et al. [33] and Alabousi et al. [25] demonstrating the near equal performance of bpMRI to mpMRI (reprinted with permission from [17], Copyright 2020 Gland Surgery).

#### *1.5. Artificial Intelligence (AI) for Image Postprocessing*

The availability of cheap and high computing power with the additional advent of postprocessing technologies and artificial intelligence such as machine learning techniques and deep neural networks has fostered the application of those techniques for radiology tasks such as tumor detection. The current hierarchical concept of AI is depicted in Figure 2.

*Machine-learning* (ML) is a subfield of AI in which algorithms are trained to perform tasks by learning rules from data rather than explicit programming. *Radiomics* is seen as a method that extracts large numbers of features from radiological images using data characterization algorithms such as first order statistics, shape-based, histogram-based analyses, Gray Level Co-occurrence Matrix, Gray Level Run Length Matrix, Gray Level Size Zone Matrix, Gray Level Dependence Matrix, Neighboring Gray Tone Difference Matrix to name a few [36–39]. These features are said to have the potential to uncover disease characteristics that are hard to be appreciated by the naked eye. The hypothesis of radiomics is that distinctive imaging features between disease forms may be useful for detecting changes and potentially predicting prognosis and therapeutic response for various conditions such as e.g., detection of csPCA. These radiomic features are then often further analyzed using ML-techniques. An example of a radiomics ML-workflow is shown in Figure 3. An issue concerning ML-techniques is that it often requires the manual placement of a region of interest hence hereby introducing a potential source for errors and biases.

**Figure 2.** Hierarchical structure of AI-techniques. Whereas ML requires human feature engineering as guidance for learning, DL is based on self-learning algorithms that can detect and process simple and complex image features.

**Figure 3.** Sample radiomics workflow (reprinted with permission from [40], Copyright 2019 Springer Nature).

*Deep learning* (DL) is a subfield of AI in which algorithms are trained to perform tasks by learning patterns from data rather than explicit programming. The key factors for the increasing attention that DL attracted in the past years are the availability of large quantities of labelled data, the inexpensive and powerful computing hardware particularly graphic-processing units and improvements in training techniques and architectures. DL is a type of representation learning in which the algorithms learn a composition of features that reflect the hierarchy of structures in the data. Current state-of-the-art for medical image recognition using DL techniques are so called convoluted neural networks (CNN). These networks are characterized by an architecture of connected non-linear functions that learn multiple levels of representations of the input data thereby extracting possibly millions of features [41]. Especially CNNs in which a series of convolution of filter layers are exploited are suitable for image processing [42]. Newer techniques such as transfer learning and data augmentation, or the application of generative methods help in mitigating existing limitations of CNN [43]. The entire process of data processing within the multiple layers of a CNN with convolution filters, pooling, and maximum filtering is beyond the scope of this study. Largely simplified, one might say that bottom layers of the CNN act as a feature extractor while the top layers of the CNN act as a classifier. An overview is given in Figure 4 in which the DL workflow is compared to radiomics or the standard radiology reading process [44]. The reason that CNN-based approaches are considered superior to radiomics is that radiomics depend on hand-crafted features which is limited, whereas CNN can generate features that are most appropriate to the problem itself [45].

**Figure 4.** Workflow of standard radiology reporting compared to AI-based methods of radiomic and DL. The entire complexity of deep learning is only schematically shown. There is an abundance of different network architectures or CNN which are beyond the scope of this study. This figure only demonstrates a schematic CNN (reprinted under common creative license 4.0 from [44], Copyright 2021 Springer Nature).

#### **2. Materials and Methods**

Literature research for this study took place in August 2021. A PubMed query with the search terms "prostate" and "magnetic" and "deep learning" or "machine learning" or "radiomics" was performed. The aim was to retrieve those studies which made use of ML or DL techniques to facilitate tumor detection and grading. To make sure that only current techniques were included in the analysis only publications from the year 2019 to 2021 were included. Particularly in the field of CNN the technical improvement is rapidly evolving so that elder publications might not represent the current state-of-the-art. Total of 95 publications were initially retrieved. Of these, 66 were omitted for several reasons so that 29 publications were available for analysis (see Figure 5). Clinical data (question to be answered, number of patients, age, AI-technique, lesion segmentation, MRI-technique, sensitivity, specificity, accuracy, AUC) were then manually extracted and transferred to a Microsoft Excel 365 spreadsheet (Microsoft, Redmond, WA, USA). PRISMA guidelines were followed [46]. An overview of the study according to the PRISMA guidelines can be found in the Appendix A.

**Figure 5.** Literature selection work-flow. ML–machine-learning. DL–deep learning. up–uniarametric. bp–biparametric. mp–multiparametric.

This paper focuses on bpMRI. The current PIRADS guidelines state: "Given the limited role of DCE, there is growing interest in performing prostate MRI without DCE, a procedure termed "biparametric MRI" (bpMRI). A number of studies have reported data that supports the value of bpMRI for detection of csPCA in biopsy-naïve men and those with a prior negative biopsy". The potential benefits of bpMRI include: (1) elimination of adverse events and gadolinium, (2) faster MRI-exam times, and (3) overall reduced costs [47]. These factors will potentially make bpMRI easily accessible. Remaining concerns are that the DCE sequence may serve as backup in case of image degradation of the DWI or T2w sequence. It seems as if DCE may be of less value for assessment of treatment of naïve prostate patients but remains essential in assessment for local recurrence following prior treatment, which however is a setting in which current PI-RADS assessment criteria do not apply. The conclusion of the PIRADS steering committee therefore advocates the use of mpMRI particularly in (1) patients with prior negative biopsies with unexplained raised PSA values, (2) those in active surveillance who are being evaluated for fast PSA doubling times or changing clinical/pathologic status, (3) men who previously had undergone a bpMRI exam that did not show findings suspicious for csPCA, and who remain at persistent suspicion of harboring disease, (4) biopsy-naïve men with strong family history, known genetic predispositions, elevated urinary genomic scores, and higher than average risk calculator scores for csPCA, and (5) men with a hip implant or other consideration that will likely degrade DWI [47].

For this paper bpMRI was selected as most studies dealing with ML or DL techniques solely relay on T2w-imaging and DWI. DCE data were rarely included. In contrast to T2wimaging and DWI the DCE-data must be postprocessed first to generate parameter maps. This process is not yet standardized as several pharmacokinetic models and hereof derived software implementations for postprocessing exist. Without generation of parameter maps a huge number of images would have to be fed into the ML/DL algorithms—a step that most research groups obviously did not want to undertake.

#### **3. Results**

All included studies are listed with an abbreviated overview in Table 1.






83/55% Unet ≥ 4

Total of 29 studies were included in this study. Thirteen of them used ML (44.8%), 14 of them used DL-techniques (48.2%), and 2 of them used a combination of ML and DL (6.9%). The data for 27 of the studies were acquired at 3T (93.1%), 2 of them were acquired at 1.5 T (6.9%). A total of 7466 patients were analyzed within this data set. Hereby, the ProstatEx-data set from the Radbound University, The Netherlands was used seven times. The smallest study had a sample size of 25 patients, the largest study had a sample size of 834 patients. The MRI-technique used for AI-postprocessing most often was T2w-imaging in combination with ADC map and DWI (15 studies/53.6%). Runner-up were T2w-imaging and ADC map (8 studies, 28.6%) and T2w-imaging and DWI (2 studies, 7.1%).

#### *3.1. Tumor Detection and Grading*

As seen in Table 1, the results (AUC, sensitivities and specificities) were comparable and no trend clearly favoring ML or DL-approaches in terms of superiority could be detected. Most studies required manual interaction in which a radiologist had to segment the region of interest.

Overall, the rate of detection and correct tumor creating using AI-techniques was comparable to the performance of trained radiologists in most studies. Studies were often hard to compare as they differed in terms of standard of reference (e.g., Gleason score (GS) vs. PIRADS vs. National Comprehensive Cancer Network Guidelines vs. ISUP Guidelines) and different cut-off values within the same grading system (e.g., GS 7 was in one study considered intermediate grade, in most studies considered high-grade tumor). Some studies focused on the PZ only, while others accepted the entire gland as target tissue.

In a small study with 33 patients to predict IMRT response, GS prediction and PCA stage, GS prediction using T2w-radiomic models was found more predictive (mean AUC 0.739) rather than ADC models (mean AUC 0.70), while for stage prediction, ADC models had higher prediction performance (mean AUC 0.675). For T2w-radiomic models, mean AUC was obtained as 0.625 [40].

Using T2w-imaging and 12 b-values from diffusion along with Kurtosis analysis and T2 mapping for differentiation GS ≤ 3 + 3 vs. GS > 3 + 3 an AUC of 0.88 (95% CI 0.82–0.95) could be reached. This study with 72 patients was the only one to employ T2 mapping which, after all, was deemed as of little worth [51].

In a stringent ML-Radiomics study, an equally high AUC for tumor grading according to National Comprehensive Cancer Network guidelines in low-risk vs. high-risk (i.e., GS ≥ 8) was found for the PIRADS assessment as well as for the ML-approach (0.73 vs. 0.71, *p* > 0.05) [49]. Interestingly, the precision and recall were higher with the ML-approach compared to the PIRADS assessment (0.57 and 0.86 vs. 0.45 and 0.61). Similar results were found for the discrimination of ciPCA and csPCA of the PZA using a ML-Radiomics approach with extreme gradient boosting [62]. In this study performed on the ProstatEx dataset, an AUC of 0.816 for the detection of csPCA using bpMRI was found. Adding DCE slightly increased AUC to 0.870, though this was not statistically significant. Based on the same data set but using optimized CNNs Zong et al. [63] concluded that adding ktrans from DCE deteriorated sensitivity and specificity when compared to bpMRI alone from 100%/83% to 71%/88%. The optimal reported AUC of this study was 0.84.

Extremely good ML-radiomics results for differentiation ciPCA vs. csPCA with an AUC of 0.999 were found in a study by Chen et al. They could also show that ML-radiomics exhibited a higher efficacy in differentiation ciPCA from csPCA than PIRADS. A potential explanation for this, compared to the other studies, is that outstanding result might be the study inclusion/exclusion criteria: small lesions <5 mm and lesion not well delineable on MRI were excluded [52].

Somewhat poorer results were presented in a study by Gong et al. [61]. Their MLradiomics approach that was built on T2w-imaging and b800-DWI images yielded an AUC of 0.787 and an accuracy of 69.9% for the discrimination between ciPCA and csPCA. Adding clinical data to the MRI-based data slightly degraded the results with an AUC of

0.780 and an accuracy of 68.1%. A potential reason for this poorer outcome might be a different set of inclusion parameters.

Zhong et al. compared the performance of DL and Deep Transfer Learning (DTL) with experienced radiologists. They found that DTL further improves DL. The DTL results were comparable to radiologist's performance using PIRADS v2. They concluded that DTL might serve as an adjunct technique to support non-experienced radiologists [54]. Similar results found a study using a CNN-trained algorithm to automatically attribute PIRADS scores to suspicions lesions. A performance comparable to a human radiologist was described [64]. The lowest agreement was found with low PIRADS score, getting better with higher PIRADS scores. There was no statistically significant difference between the radiologist-assigned PIRADS score and the AI-assigned PIRADS score with regards to the presence of csPCA for PIRADS 3–5.

In contrast, for Gleason score prediction one study found better results for AI-based approaches than radiologists for PZ and TZ [76]. This could be particularly useful in the context of active surveillance.

A different study looking into aggressiveness prediction (GS > 8) found equal AUCs for AI and radiologists but higher precision and recall rates for AI than PIRADS mitigating the problem of inter-reader variability [49].

An uncommon approach was presented in [65]. The authors hereby combined Radiomics and DL-based on bpMRI with DCE and T2w-imaging. No ADC/DWI-images were used. In few patients they included, promising results with an AUC of 0.96–0.98 for Gleason score prediction were found. No further study used this subset of DCE and T2w-imaging.

The prospective IMPROD trial also examined if the addition of clinical data and RNA expression profiles of genes associated with prostate cancer increased the accuracy for detection of csPCA [58]. In this study the bpMRI based data yielded the highest AUC 0.92. Adding RNA-based data or clinical data neither improved the results nor yielded better results by itself.

Cao et al. developed an FocalNet to automatically detect and grade PCA (Figure 6) [72]. A similar work was presented by Schelb et al. [75] where a U-Net was trained to detect, segment, and grade PCA. In comparison with radiologists' PIRADS assessment, the U-Net sensitivities and specificities for detection of PCA at different sensitivity levels (PIRADS ≥ 3 and PIRADS ≥ 4) were comparable.

Positive results for DL-based techniques with a larger number of patients (*n* = 312) were found in a DL-Study by Schelb et al. using a U-Net [57]. They reported a sensitivity/specificity for radiologists using PIRADS for detection of PIRADS lesions ≥ 3 and 4 respectively of 96%/88% and 22%/50% while the U-Net approach yielded 96%/92% and 31%/47% (*p* > 0.05). In their study the U-Net also autocontoured the prostate and the lesion with dice-coefficient of 0.89 (very good) and 0.35 (moderate) respectively.

A ML-approach to generate "attention boxes" for the detection of csPCA was published by Mehralivand et al. [60]. Their multicentric approach with data from five institutions showed an AUC of 0.749 for PIRADS assessment of csPCA and a statistically non-significant AUC of 0.775 for the ML-based approach. For the TZ only, the ML-approach yielded a higher sensitivity for detection of csPCA than PIRADS (61.8% vs. 50.8%, *p* = 0.001). Interestingly, the reading time for the ML-approach was on average 40s longer.

An uncommon approach for CNNs was published by Chen et al. [66]. They used U-Net CNNs to segment the prostate and intraprostatic lesions hereby segmenting the PZ, TZ, CZ, and AFMS. Their approach demonstrated impressive results: a Dice coefficient of 63% and a sensitivity and specificity of 74.1% and 99.9% respectively for correctly segmenting the prostatic zones and the suspicious lesion. Yet, in contrast to most other studies, no grading or discrimination of the suspected PCA lesion was performed. As a segmentation study this study was included in this review as it included segmentation of the prostate and detection of the tumor within the prostate.

**Figure 6.** "Examples of lesion detection. The left two columns show the input T2WI and ADC map, respectively. The right two columns show the FocalNet-predicted lesion probability map and detection points (green crosses) with reference lesion annotation (red contours), respectively. (**a**) Patient at age 66, with a prostate cancer (PCa) lesion at left anterior peripheral zone with Gleason Group 5 (Gleason Score 4 + 5). (**b**) Patient at age 68, with a PCa lesion at left posterolateral peripheral zone with Gleason Group 2 (Gleason Score 3 + 4). (**c**) Patient at age 69, with a PCa lesion at right posterolateral peripheral zone with Gleason Group 3 (Gleason Score 4 + 3). ADC = apparent diffusion coefficient; T2WI = T2-weighted imaging"(reprinted with permission from [72], Copyright 2021 John Wiley and Sons).

In a screening study with 3T-bpMRI, Winkel et al. [67] could include and analyze 48 patients, all above 45 years. In a biopsy-correlated reading two human readers and a commercial prototype DL-algorithm were compared in terms of detection of tumorsuspicious lesions and grading according to PIRADS. The DL-approach had a sensitivity and specificity of 87% and 50%. Noteworthy, the DL-analysis required just 14 s.

Different ML-based models were tested and found to be highly accurate for the diagnosis of TZ PCA (sensitivity/specificity/AUC): 93.2%/98.4%/0.989) and their discrimination from BPH-nodules. Reproducibility of segmentation was excellent (DSC 0.84 tumors and 0.87 BPH). Subgroup analyses of TZ PCA vs. stromal BPH (AUC = 0.976) and in <15 mm lesions (AUC = 0.990) remained highly accurate [48].

13

DL-approach for detection of csPCA in patients under active surveillance was brought up by Arif et al. [68]. Initially 366 patients with low risk were included of which 292 were included in the final study. Sensitivities and specificities for csPCA segmentation rose with increasing tumor volume: tumor volumes > 0.03 cc sensitivity 82% 7 specificity of 43%, AUC 0.65; tumor volume > 0.1 cc sensitivity 85%, specificity of 52%, AUC 0.73. Tumor volumes > 0.5 sensitivity 94%, specificity 74%, AUC 0.89.

A total of six studies among the included studies compared DL/ML-approach to human radiologists [52,57,60,64,72,75]. Overall, due to the small number of studies and because of the different approaches the results cannot be analyzed together. What these studies had in common however was the finding, that at this point AI-based methods revealed a performance similar to that of the radiologists'. No study could either show an advantage of AI-methods of the radiologists or vice versa. An overview about the results can be seen in Table 2.


**Table 2.** Display of study results comparing human and AI-based performance.

#### *3.2. PIRADS 3 Lesions*

Radiomics can detect with high accuracy csPCA in PI-RADS 3 lesions [59,77]. Hou et al. examined in a ML-Radiomics approach the ability of bpMRI to identify csPCA in PIRADS 3 lesions in a group of 253 patients with PIRADS 3 lesions in the TZ and PZ of whom 59 (22.4%) had csPCA [59]. The ML-Radiomics approach including T2w imaging, DWI and ADC had an AUC of 0.89 (95% CI 0.88–0.90) for predicting the presence of csPCA in a PIRADS 3 lesion.

#### *3.3. Extracapsular Extension and Biochemical Recurrence*

He et al. set up a large study including 459 patients who underwent 3T bpMRI before prostate biopsy and/or prostatectomy [69]. The aim of the study was first to differentiate between benign and malignant tissue second to predict extracapsular extension (ECE) of prostate tumor and third to predict positive surgical margins (PSM) after RP. Using Radiomics they developed and tested an algorithm that was able to achieve an AUC of 0.905 for the determination of benign and malignant tissue, 0.728 for the prediction of ECE, and a 0.766 for the prediction of PSM. Similarly, Hout et al. found an identical AUC of 0.728 for the prediction of ECE in a DL-based approach using different CNN-architectures [73]. Hence one can infer from the information derived from prostate imaging not only the current situation in the gland but can also predict future developments that might take place under therapy.

Biochemical recurrence (BCR) prediction based on radiomics features was examined in T2w-images only with higher prediction of BCR (C-index 0.802) than conventional scores, particularly also higher than the Gleason scoring system (C-index 0.583) [74]. This work is of particular interest as it first, was one of the few multicentric studies (three centers) with a relatively large number of patients (485) and second, demonstrated the ability of DL-based CNN to look beyond the prostate and infer predictions on the future course of the disease/patient.

#### **4. Discussion**

Prostate cancer is a growing medical condition already now being the second most common cancer in men in the western world. The detection and grading of prostate cancer are shifting more toward MRI and is demanding a higher number of MRI-studies to be performed and read. Currently, prostate MRI is considered a specialized exam and requires a highly specific experience to be performed and reported with high quality. A first step toward facilitation of mpMRI prostate acquisition, reading, and reporting was PIRADS, but surely not the last step [10,12,13]. To put it in a nutshell: prostate MRI is developing from the holy grail, and only a few radiologists were being able to read it competently to a commodity in radiology. This is one of the key drivers behind the growing demand for computer-assisted diagnostic tools, such as tumor detection and grading, to facilitate the diagnostic interpretation of prostate MRI also for less-trained radiologists. As the prostate is a densely packed organ with much more information for example as the sparsely packed lung, simple machine learning tools based on e.g., density differences cannot be successfully employed. To distinguish the different prostatic tissues, such as normal transitional and peripheral zones and malignant tissue, higher-developed machine learning tools are required, often based on radiomics or even deep learning techniques. In the papers included in this review, most approaches using either ML or DL were similar to radiologists in their performance [49,54,57,64,75]. For some specific applications, such as tumor detection in the TZ or detection of clinically significant cancers in PIRADS 3 lesions, AI-based methods might even be superior to radiologists' performance [48,59].

These AI-based approaches should enable less well-trained radiologists to read and report prostate-MRI reports with good quality [57,75]. The literature review showed that different approaches to tumor grading and characterization either via ML or DL are capable of differentiating between cancerous and non-cancerous tissue. New approaches are even able to autonomously segment the prostate and the tumor within the gland overcoming a limitation of the elder approaches, where radiologists often had to manually segment the lesions, resulting in a highly time-consuming task [72,75]. Apart from many sitespecific implementations of radiomics, ML and DL, another sign of maturation of AI-based approaches is that a first commercial tool was already presented [67]. Compared to the other algorithms, this commercial tool was trained by big data sets for the initial training. This development underlines again the trend in imaging toward commoditization of imaging and democratization of information technology enabling every radiologist to perform on a high-quality imaging.

Yet, there are some obstacles still to overcome. First, MRI is a tricky imaging tool. A major drawback of MRI is the lack of standard quantification of image intensities. Within the same image, the intensities for the same material vary as they are affected by bias field distortions and imaging acquisition parameters, not always perfectly standardized. In addition, not only do MR images taken on different scanner vary in image intensities, but the images for the same patient on the same scanner at different times may appear differently from each other due to a variety of scanner- and patient-dependent variables [45]. Therefore, the initial step in ML/DL image postprocessing is to normalize the MR intensity [45]. This process could induce errors, however. At last, also the reproducibility of CNNs varies resulting in interscan differences, though with less impact [78]. Second, most studies rely on single site source data. Multicentric studies are very rare hence making it harder to compare results of AI-based algorithms across different vendors and sequence parameters. Third, the choice of imaging sequences and their specific parameters is variable. This work focused on bpMRI of the prostate. Even though for a radiologist imaging with T2w-imaging and DWI imaging would be seen as biparametric, things look different in the world of AI-based post-postprocessing: sometimes T2w and ADC, sometimes T2w and a single high

b-value, T2w and multiple b-values or T2, ADC and b-values were used (hereby neglecting uncommon outlier studies using DCE and T2 or T1 and T2). Even though DWI source date and ADC are based on the same acquisition, their information content seems different. It was observed in one study that the use of CHB-DWI led to higher specificity while the use of ADC led to highest sensitivity, making the choice of sensing modality useful for different clinical scenarios [79]. For example, maximizing specificity is important for surgery for removal of prostate where minimizing false positive rates to avoid unnecessary surgeries is required. On the other hand, for cancer screening, maximizing sensitivity may be useful to avoid missing cancerous patients [79]. A clear definition what would be considered as truly bpMRI or standards for AI-postprocessing has not been set up. Yet, there is a first European initiative on the development and standardization of AI-based tools for prostate MRI [44]. Fourth, DL-based CNNs are notorious for being a "black box" in terms of the how the decision was achieved. While this may not be entirely true—CNNs can be monitored at any level at some expense—they might never be as transparent as ML-based approaches hence scaring some physicians from using them on real patients outside studies. Moreover, here, commercialization of the techniques might be helpful as larger companies have the means and money to certify algorithms with the FDA or the EU and thus make them broadly (commercially) available.

As seven studies made use of the ProstatEx data, it is worth looking at the overall conclusions the creators of the dataset and initiators of the contest published [80]: the majority of the 71 methods submitted to the challenge (classifying prostate lesions as clinically significant or not) the majority of those methods outperformed random guessing. They conclude that automated classification of clinically significant cancer seems feasible. While in the second contest (computationally assigning lesions to a Gleason grade group) only two out 43 methods did marginally better than random guessing. The creators also conclude that more images and larger data sets with better annotations might be necessary to draw significant conclusions, which brings up again the question of means and money. Another conclusion that can be drawn when looking at the included studies is that 3 T imaging seems to be the standard. This is partly because there is substantial overlap in the source data (ProstatEx) and that, of course, studies are being conducted at University Medical Centers which most often have state-of-the-art equipment. For radiology departments in smaller hospitals or private practices havinga3T system is less likely. Regarding how far the results of 3T e.g., DWI can be transferred to 1.5 T and how the technological improvement of 1.5 T in the field of signal reception and processing is supportive remain unclear. One might speculate that a state-of-the-art 1.5 T will yield comparable image quality to an elder 3 T system. Looking at the source data of the different studies one can roughly estimate that 30% of these were acquired on elder (>14 a) 3 T systems.

There are some unexpected studies with novel approaches to patient care that should be to highlighted. One was therapy assessment with pre- and post-IMRT T2w-imaging [40] for "delta radiomics", using radiomic features extracted from MR images for predicting response in prostate cancer patients. While there was only one study with this specific design, extrapolating ECE or BCR has roughly the same line of thought: could not it be possible to predict for changes in the future with imaging features measured today [69,73,74]. The AUC values of these studies were unexpectedly high (0.801–0.905) as well as the number of included patients.

#### *Limitations*

This review has several limitations that need to be mentioned. First, ML and DL are extremely fast evolving techniques. Data provided in this review simply display a snapshot of the ongoing development. With the ever more powerful hardware and algorithms, future improvements seem likely. Most results are based on small feasibility studies, and larger applications of ML and DL in prostate imaging are not yet available. Whether their results match the promising initial studies remains unclear. Second, the inclusion criteria were narrow so that only 29 studies could be included. With the small sample size, different targets, and the different foci of the studies no wholistic analysis could be performed. Opening up the time window for the included studies would have led to inclusion of elder techniques potentially biasing the results.

#### **5. Conclusions**

In summary, this study investigated the current status of bpMRI of the prostate with postprocessing using ML and DL with a focus and tumor detection and grading. The presented results are very promising in terms of detection of csPCA and differentiation of prostate cancer from non-cancerous tissue. ML and DL seem to be equally good in terms of classification of single lesion according to the PIRADS score. Most approaches however rely on human interference and contouring the lesions. Only a few newer approaches automatically segment the entire gland and lesions, along with lesion grading according to PIRADS. There still exist a large variability and methods and just a few multicentric studies. No AI-postprocessing technique is considered gold standard at this time while there seems to be a trend toward CNNs. Regarding the actual MRI-sequences, most studies used T2w-imaging and either b-values from DWI or the ADC maps from DWI. The application of ML and DL to bpMRI postprocessing and the assistance in the reading process surely represent a step into the future of radiology. Currently however, these techniques remain at an experimental level and are not yet ready or available for a broader clinical application.

**Author Contributions:** Conceptualization, H.J.M., D.C., E.N. and G.A.; methodology, H.J.M., D.C., E.N. and G.A.; formal analysis, H.J.M. and G.A.; data curation, H.J.M. and G.A.; writing—original draft preparation, H.J.M. and G.A.; writing—review and editing, H.J.M., D.C., E.N. and G.A.; visualization, H.J.M.; supervision, H.J.M., D.C., E.N. and G.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research for this paper was done within the University of Pisa Master Course in Oncologic Imaging as part of a research trial of the NAVIGATOR project (funded by Bando Salute 2018, Tuscany Region, Italy).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable as only literature research was performed.

**Data Availability Statement:** All data can be found in the original publications as listed above.

**Acknowledgments:** The research for this paper was done within the University of Pisa Master Course in Oncologic Imaging.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**



#### **Appendix A**

**Table A1.** Display of PRISMA items.



#### **References**


### *Article* **Discovery of Pre-Treatment FDG PET/CT-Derived Radiomics-Based Models for Predicting Outcome in Diffuse Large B-Cell Lymphoma**

**Russell Frood 1,2,3,\*, Matthew Clark 2, Cathy Burton 4, Charalampos Tsoumpas 5,6, Alejandro F. Frangi 6,7,8, Fergus Gleeson 7,9, Chirag Patel 1,2 and Andrew F. Scarsbrook 1,2,3**


**Simple Summary:** Diffuse large B-cell lymphoma (DLBCL) is the most common type of lymphoma. Even with the improvements in the treatment of DLBCL, around a quarter of patients will experience recurrence. The aim of this single centre retrospective study was to predict which patients would have recurrence within 2 years of their treatment using machine learning techniques based on radiomics extracted from the staging PET/CT images. Our study demonstrated that in our dataset of 229 patients (training data = 183, test data = 46) that a combined radiomic and clinical based model performed better than a simple model based on metabolic tumour volume, and that it had a good predictive ability which was maintained when tested on an unseen test set.

**Abstract:** Background: Approximately 30% of patients with diffuse large B-cell lymphoma (DLBCL) will have recurrence. The aim of this study was to develop a radiomic based model derived from baseline PET/CT to predict 2-year event free survival (2-EFS). Methods: Patients with DLBCL treated with R-CHOP chemotherapy undergoing pre-treatment PET/CT between January 2008 and January 2018 were included. The dataset was split into training and internal unseen test sets (ratio 80:20). A logistic regression model using metabolic tumour volume (MTV) and six different machine learning classifiers created from clinical and radiomic features derived from the baseline PET/CT were trained and tuned using four-fold cross validation. The model with the highest mean validation receiver operator characteristic (ROC) curve area under the curve (AUC) was tested on the unseen test set. Results: 229 DLBCL patients met the inclusion criteria with 62 (27%) having 2-EFS events. The training cohort had 183 patients with 46 patients in the unseen test cohort. The model with the highest mean validation AUC combined clinical and radiomic features in a ridge regression model with a mean validation AUC of 0.75 ± 0.06 and a test AUC of 0.73. Conclusions: Radiomics based models demonstrate promise in predicting outcomes in DLBCL patients.

**Keywords:** diffuse large B-cell lymphoma; lymphoma; predictive modelling; radiomics; machine learning

**Citation:** Frood, R.; Clark, M.; Burton, C.; Tsoumpas, C.; Frangi, A.F.; Gleeson, F.; Patel, C.; Scarsbrook, A.F. Discovery of Pre-Treatment FDG PET/CT-Derived Radiomics-Based Models for Predicting Outcome in Diffuse Large B-Cell Lymphoma. *Cancers* **2022**, *14*, 1711. https://doi.org/10.3390/ cancers14071711

Academic Editors: Ali Hekmatnia, Hamid Khayyam, Ali Madani and Rahele Kafieh

Received: 7 March 2022 Accepted: 25 March 2022 Published: 28 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Diffuse large B-cell lymphoma (DLBCL) is the commonest subtype of non-Hodgkin lymphoma (NHL), accounting for approximately 30–40% of adult cases [1]. The gold standard treatment is immunochemotherapy with rituximab, cyclophosphamide, doxorubicin hydrochloride, vincristine (Oncovin) and prednisolone (RCHOP) [2]. Radiotherapy can be added if there is bulky or residual disease. Prophylactic intrathecal methotrexate or intravenous treatment with chemotherapy that crosses the blood-brain barrier may be included if there is high risk for central nervous system (CNS) involvement [3]. Even with current therapy regimes, approximately 20–30% of patients will recur following treatment [4,5]. Staging and response assessment is performed using 2-deoxy-2-[fluorine18]-fluoro-D-glucose (FDG) positron emission tomography/computed tomography (PET/CT). Treatment stratification based on mid-treatment (interim) PET/CT is commonly used in the management of patients with Hodgkin lymphoma but is less established in DLBCL due to the reduced ability to accurately predict treatment outcome in this lymphoma subtype mid-treatment [6,7]. There is increasing interest in the use of PET/CT derived metrics for treatment stratification at baseline in lymphoma to improve patient outcome. A number of groups have explored the potential utility of baseline metabolic tumour volume (MTV) for predicting event free survival (EFS) with promising results, but this has yet to be adopted clinically [8–17]. Others have explored the potential utility of radiomic features extracted from PET/CT for modelling purposes [8,18]. Initial results are promising, however, the published studies with relatively small numbers of patients are heterogenous

This aim of this study was to develop and test models combining baseline clinical information and radiomic features extracted from PET/CT imaging in DLBCL patients to predict 2-year EFS (2-EFS) using data from our tertiary centre. The secondary aim was to compare model performance to the predictive ability of baseline MTV.

#### **2. Materials and Methods**

The transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines were adhered to as part of this study (Supplementary Material).

#### *2.1. Patient Selection*

Radiological and clinical databases were retrospectively reviewed to identify patients who underwent baseline PET/CT for DLBCL at our institution between January 2008 and January 2018. A cut-off of January 2018 was chosen to allow a minimum of 2 years follow up without interference or confounding factors introduced by the COVID-19 pandemic. Patients were excluded if they did not have DLBCL, were under 16 years of age, had no measurable disease on PET/CT, had hepatic involvement, had a concurrent malignancy, were not treated with R-CHOP or if the images were degraded or incomplete. A 2-EFS event was defined as disease progression, recurrence or death from any cause within the 2-year follow up period.

#### *2.2. PET/CT Acquisition*

All imaging was performed as part of routine clinical practice. Patients fasted for 6 h prior to administration of intravenous Fluorine-18 FDG (4 MBq/kg). PET acquisition and reconstruction parameters for the four scanners used at our institution are detailed in Table 1. Attenuation correction was performed using a low-dose unenhanced diagnostic CT component acquired using the following settings: 3.75 mm slice thickness; pitch 6; 140 kV; 80 mAs.


**Table 1.** Reconstruction parameters for the different scanners used.

BLOB-OS-TF = an ordered subset iterative TOF reconstruction algorithm using blobs instead of voxels; DLYD = delayed event subtraction; OSEM = ordered subsets expectation maximisation; SS-Simul = single-scatter simulation; VPFX = Vue Point FX (OSEM including point spread function and time of flight).

#### *2.3. Image Segmentation*

All PET/CT images were reviewed and contoured using a specialised multimodality imaging software package (RTx v1.8.2, Mirada Medical, Oxford, UK). FDG-positive disease segmentation was performed by either a clinical radiologist with six years' experience or a research radiographer with two years' experience. Contours were then reviewed by dual-certified Radiology and Nuclear Medicine Physicians with >15 years' experience of oncological PET/CT interpretation. Any discrepancies were agreed by consensus.

Two different semi-automated segmentation techniques were used. The first applied a fixed standardised uptake value (SUV) threshold of 4.0, and the second used a threshold derived from 1.5 times mean liver SUV. The 4.0 SUV threshold was selected based on previous work assessing different segmentation techniques in a cohort of DLBCL patients by Burggraaff et al. which found it had a higher interobserver reliability [19] and requires less adaption than techniques such as 41% SUVmax. The 1.5 times mean liver SUV threshold was chosen as an adaptive threshold technique which has been used in different cancer types [20,21], and allows for adaptive thresholding which takes into consideration background SUV uptake which can vary between individuals. Mean liver SUV was calculated by placing a 110 cm3 spherical region of interest (ROI) in the right lobe of the liver. The PET image contour was translated to the CT component of the study with the contours matched to soft tissue with a value of −10 to 100 Hounsfield units (HU). Contours were saved and exported as digital imaging and communications in medicine (DICOM) radiotherapy (RT) structures. Both the images and contours were converted to Neuroimaging Informatics Technology Initiative (NIfTI) files using the python library Simple ITK (v2.0.2) (https://simpleitk.org/, accessed on 1 December 2021).

#### *2.4. Feature Extraction*

Feature extraction was performed using PyRadiomics (v2.2.0) (https://pyradiomics. readthedocs.io/en/latest/index.html, accessed on 1 December 2021). Both the CT and PET images were resampled to a uniform voxel size of 2 mm3. Radiomic features were extracted from the entire segmented disease for each patient. A fixed bin width of 2.5 HU was used for the CT component. Two different bin-widths were used when extracting the radiomic features from the PET component. The first being derived by finding the contour with the maximum range of SUVs and dividing this by 130, the second being derived by dividing the maximum range by 64. This methodology was based on previous work by Orlhac et al. and on PyRadiomics documentation [22]. The first and second order features were extracted from both the PET and CT components. Further higher order features were explored by extracting the first and second order features following application of wavelet, log-sigma, square, square root, logarithm, exponential, gradient and local binary pattern (lbp)-3D filters to the images. All the features extracted and the filters applied are detailed in Table S1. The mathematical definition of each of the radiomic features can be found within the PyRadiomics documentation [23]. PyRadiomics deviates from the image biomarker standardisation initiative (IBSI) by applying a fixed bin width from 0 and not the minimum

segmentation value, and the calculation of first order kurtosis being +3 [24,25]. Otherwise, PyRadiomics adheres to IBSI guidelines. Patient age, disease stage and sex were also included as clinical features in the models. Disease stage and sex were dummy encoded using Pandas (v1.2.4) (https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2 .4.html, accessed on 1 December 2021). This resulted in a total of 3935 features extracted per patient. ComBat harmonisation was applied to account for the different scanners used within the study (https://github.com/Jfortin1/ComBatHarmonization, accessed on 1 December 2021) [26].

#### *2.5. Machine Learning*

The dataset was split into a training and test set stratified around 2-EFS, disease stage, age and sex with an 80:20 split using scikit-learn (v0.24.2) (https://scikit-learn.org/ stable/whats\_new/v0.24.html, accessed on 1 December 2021). Concordance between the demographics of the training and test groups was assessed using a *t*-test for continuous data and a χ<sup>2</sup> test for categorical data. A *p*-value of <0.05 was regarded as significant. Continuous data was normalised using a standard scaler (scikit-learn v0.24.2) which was trained and fit on the training set and subsequently applied to the test set. Highly correlated features were removed from the training and test sets if they had a Pearson coefficient over 0.8. This reduced the number of features from 3935 down to 130 for each patient.

Six different machine learning (ML) classifiers were used: logistic regression with lasso, ridge and elasticnet penalties, support vector machine (SVM), random forest and k-nearest neighbour. A maximum number of five features were included within each model, apart from in the lasso and elasticnet models where these classifiers determined the optimum number of features. To avoid false discoveries (Type 1 errors), a maximum number of five features was chosen guided by the rule of 1 feature per 10 events within the training set. Feature selection for the remaining models was performed using three different methods: a forward wrapper method (mlxtend 0.18.0), a univariate analysis method (scikit-learn v0.24.2), and a recursive feature extraction method (where applicable) (scikitlearn v0.24.2). Each method was used to create a list of features from two to the maximum five features which were to be explored in the training phase. The features selected were based on the highest mean receiver operating characteristic (ROC) curve area under the curve (AUC) in a four-fold stratified cross validation with 25 repeats.

Training of the ML models and the tuning of hyperparameters was performed using a grid search with a stratified four-fold cross validation stratified around 2-EFS with 25 repeats. The list of hyperparameters explored within the grid search are detailed in Table S2. Features and hyperparameters with the highest mean validation AUC which was within 0.05 of the mean training AUC were selected. A 0.05 cut-off was chosen to try and minimise selection of an overfitted model. The model which had the highest mean validation AUC overall was tested once on the unseen test set. The Youden index was used to discover the optimum cut-off value from the ROC curve and the accuracy, sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV) were calculated from this for the unseen test set. The pipeline for patient inclusion, feature selection and predictive model creation and testing is depicted in Figure 1.

Given the growing evidence surrounding MTV as a predictor of outcome, two further logistic regression models were derived from the MTVs using the different segmentation. A comparison between results from the different cross validation splits between the radiomic model with the mean highest AUC and the MTV model with the mean higher AUC was performed using a Wilcoxon signed ranked test.

**Figure 1.** Pathway for patient inclusion, feature selection and model creation. \* = initially applied to the training data and then to the test data.

#### **3. Results**

A total of 229 DLBCL patients met the inclusion criteria (136 male, 93 female) with 62 2-EFS events. There were 183 patients within the training cohort and 46 patients in the unseen test cohort. No statistically significant differences were identified between the training and test sets (Table 2).

None of the machine learning models created using elasticnet regression, lasso regression or k-nearest neighbour algorithms had a mean validation AUC within 0.05 of the mean training AUC. The remaining model results are presented in Tables 3 and 4.


**Table 2.** Demographics of the training and testing groups.

2-EFS = 2-year event free survival. The *p*-values were calculated using a *t*-test for age and a χ<sup>2</sup> test for the remaining demographic features.

**Table 3.** Mean training and validation scores for the best performing machine learning models using the 4.0 SUV threshold segmentation technique.


matrix, GLDM = grey level dependence matrix, lbp-3D-m1 = local binary pattern filtered image at level 1, lbp-3D-k = local binary pattern kurtosis image, GLCM = grey level co-occurrence matrix, rbf = radial basis function.

**Table 4.** Mean training and validation scores for the best performing machine learning models using the 1.5 times mean liver SUV thresholding segmentation technique.


l2 = Ridge regression penalty, liblinear = A library for large linear classification, GLSZM = grey level size zone matrix, GLDM = grey level dependence matrix, lbp-3D-m1 = local binary pattern filtered image at level 1, lbp-3D-k = local binary pattern kurtosis image, GLCM = grey level co-occurrence matrix, rbf = radial basis function.

The model within the highest mean validation ROC AUC was the ridge regression model created using radiomic features extracted from a fixed threshold of 4.0 SUV segmentation using a bin width of the maximum range of SUVs divided by 64. The mean training AUC was 0.77 ± 0.02, the mean validation AUC was 0.75 ± 0.06 and the AUC when tested on the unseen dataset was 0.73 (Figure 2). The features selected with their coefficients and intercept are presented in Table 5. A threshold of 0.5 was chosen and led to an accuracy of 0.70, sensitivity of 0.44, specificity of 0.86, positive predictive value of 0.67, and a negative predictive value of 0.71. The confusion matrix is presented in Table 6.

The logistic regression model created solely from MTV using the 4.0 SUV fixed threshold segmentation technique had a mean training AUC of 0.66 ± 0.03 and a mean validation AUC of 0.66 ± 0.08. The logistic regression model derived from MTV using the 1.5 times mean liver SUV segmentation technique had a mean training AUC of 0.67 ± 0.03 and a mean validation AUC of 0.67 ± 0.08. There was a statistically significant difference when comparing the cross validation AUCs for the 100 splits between the highest performing MTV-based model and the radiomic-based ridge regression model, *p* < 0.001 (Figure 3).

**Figure 2.** ROC Curve of the training and unseen test data AUCs for the model derived using a 4.0 SUV thresholding segmentation technique with a bin width derived from SUVmax/64.


**Table 5. The** features selected and their associated coefficients and intercept in the ridge regression model tested on the unseen test dataset.

**Table 6.** Confusion matrix for the threshold of 0.5.


Positive = recorded 2-EFS event, Negative = no recorded 2-EFS event, Predicted Positive = predicted to have had a 2-EFS event, Predicted Negative = predicted to not have had a 2-EFS event.

**Figure 3.** Mean ROC Curve of the MTV-based logistic regression model and the radiomic-based logistic regression model.

#### **4. Discussion**

Our study found that a prediction model combining clinical and radiomic features derived from pretreatment PET/CT using a ridge regression model had the highest mean validation AUC when predicting 2-EFS in DLBCL patients. This model had significantly higher validation AUCs than those achieved by a model solely derived from MTV and achieved an AUC of 0.73 on the unseen test set. The radiomic features used within the model that led to the highest mean validation AUC were extracted from a segmentation derived from a fixed threshold of 4.0 SUV using a bin-width calculated from the maximum range of SUVs divided by 64. The model was formed using five features (Stage Four, PET original GLSZM large area emphasis, PET wavelet-HHL GLSZM Small Area Emphasis, PET wavelet-HHH GLSZM Grey Level Non-Uniformity normalised, PET square 10th percentile).

The biological correlate of radiomic features and how these relate to the lesion or disease process can often be overlooked, and can become more complex when image filtering is involved [27]. Three of the radiomic features included in the best model were derived from GLSZM which is a matrix formed by the number of connected voxels with the same grey level intensity. The first was the PET GLSZM Large Area Emphasis, which is a measure of distribution of large area size zones, and was extracted from the PET data without any filter applied. This feature is higher in lesions which have a coarser texture based on the original image. The other two GLZMs are calculated after applying a wavelet filter. Wavelet filters highlight or suppress certain spatial frequencies within an image. In PyRadiomics a combination of high and low filters is passed in each of the different dimensions, which results in eight different decompositions. PET wavelet-HHL GLSZM Small Area Emphasis is a measure of the distribution of small size zones, which are higher in lesions with fine textures following the application of the wavelet filter. PET wavelet-HHH GLSZM Grey Level Non-Uniformity is a measure of the variability of the grey level intensity within the image. A lower value indicates a higher number of similar SUVs on the wavelet filtered image. The last radiomic feature included was PET square 10th percentile which is the 10th percentile value of the SUV after a square of the image SUVs has been taken and normalised to the original SUV range. Interestingly, none of the CT-derived radiomic features were selected as part of the best performing radiomic models. This is

likely due to the transposition of the segmentations from the PET on to the unenhanced CT including more areas of non-lymphomatous tissue.

Other studies which have explored the use of radiomic features in outcome prediction in DLBCL are not always directly comparable [12,28–32]. This is mainly due to differences in segmentation methodology, modelling techniques and outcome measures between groups. Aide et al. studied the use of radiomic features in predicting 2-EFS in 132 patients (training = 105, validation = 27) and found that MTV as well as four second-order metrics and five third-order metrics were selected from ROC analyses. However, long-zone high-grey level emphasis was the only independent predictor when analysed with the international prognostic index (IPI) and MTV [29]. In our study long-zone high-grey level emphasis was discarded when checking for multicollinearity. This highlights a potential issue of radiomic model development when applying a methodology on different datasets. It may be that the same features would be chosen between the different datasets, but each method removes the alternate correlated feature and, therefore, appears to create an entirely new model. Both Zhang et al. and Ceriani et al. used lasso in their cox regression models to select the most appropriate features [31,32]. Zhang et al. in a study of 152 patients (training = 100, validation = 52) treated with R-CHOP or R-EPOCH (rituximab, etoposide, prednisone, vincristine, cyclophosphamide, and doxorubicin) found that a survival model created with radiomic features and MTV had a validation time dependent ROC AUC of 0.748 (95% CI 0.596–0.886). A model created with radiomic features and metabolic bulk volume had a validation time dependent ROC AUC 0.759 (95% CI 0.595–0.888). Ceriani et al. reported that a radiomic score derived from a training set of 133 patients and tested on an external dataset of 107 patients had an AUC of 0.71 in both the test and validation datasets. The features selected within their cox regression model were GLCM sum squares, maximum 3D diameter and GLDM grey level variance, GLSZM grey level non-uniformity normalised.

In our study both lasso and elasticnet methods failed to produce a model that achieved mean training and validation scores within 0.05 of each other. Even when allowing for a more generous difference between the training and validation scores, mean validation scores remained below 0.65. This 0.05 cut-off is arbitrary and was applied to try and reduce the impact of overfitting on the dataset and allow selection of a potentially more generalisable model. Despite this, there is still a risk that both training and validation datasets are overfitted and the model would need external validation on an external dataset.

One of the largest published studies by Decazes et al. in 215 DLBCL patients, explored use of tumour volume surface ratio and total tumour surface as outcome predictors for 5-year progression free survival (PFS), but found that MTV outperformed both features with MTV having an AUC of 0.67 [12]. This AUC for MTV is similar to the findings in our study, with the mean validation AUC for MTV prediction of 2-EFS being 0.66 for the 4.0 SUV threshold and 0.67 for the 1.5 times liver threshold segmentation techniques, respectively. Although, there is growing interest in the use of MTV as an imaging biomarker, Adams et al. reported, in a study of 73 DLBCL patients, that the prognostic ability of MTV does not add anything to the prognostic ability of the clinical scoring system National Comprehensive Cancer Network-International Prognostic Index (NCCN-IPI) [33]. Unfortunately, due to missing clinical data it was not possible to compare IPI performance in our patient cohort. However, this does highlight the potential impact of confounders on the generalisability of predictive models. Although, causality is not generally considered in predictive modelling, its use in future models could allow for greater transparency of a model. The issues of generalisability may be compounded by learnt biases towards groups of patients in the training process.

The TRIPOD checklist was completed to increase transparency of model development [34,35]. However, there are limitations to our study including its retrospective nature and uncertainty surrounding the exact timing and recording of recurrence. Use of 2-EFS partially mitigates against this by allowing a wider window for the relapse to be recorded, however, it does mean that data which could have been included in a time to

survival type model is lost. 2-EFS was chosen as the majority of patients relapse within the first 2 years. Time to event ML models could be used in future studies to reduce the need to exclude data. The lesions were not re-segmented as part of the study, and therefore, calculations of inter or intra-reliability, as well as robustness of the features have not been performed. ComBat harmonization was used to help mitigate against scanner variation in the extracted feature extraction. However, this limits the ability to apply this model prospectively to patients not scanned using a protocol used to train the model. Lack of clinical data surrounding the IPI and cell of origin (COO) information, meant that these could not be used as direct comparators to radiomic models created.

#### **5. Conclusions**

A combined clinical and PET/CT derived radiomics model using ridge regression demonstrated the highest mean AUC validation (AUC = 0.75) when predicting 2-EFS in DLBCL patients treated with R-CHOP, which outperformed a model derived solely from MTV (AUC = 0.67).

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14071711/s1, TRIPOD Checklist: Prediction Model Development, Table S1: Radiomic features extracted for both the PET and CT components, Table S2: The hyperparameters explored within the grid search.

**Author Contributions:** Conceptualization, R.F., M.C., C.B., C.T., A.F.F., F.G., C.P. and A.F.S.; methodology, R.F., M.C., C.P. and A.F.S.; formal analysis, R.F., M.C., C.P. and A.F.S.; investigation, R.F. and A.F.S.; data curation, R.F., M.C. and C.B.; writing—original draft preparation, R.F. and A.F.S.; writing—review and editing, R.F., M.C., C.B., C.T., A.F.F., F.G., C.P. and A.F.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** R.F., F.G., C.P. and A.F.S. are partly funded by Innovate UK via the National Consortium of Intelligent Medical Imaging (NCIMI) (104688), and A.F.F. is funded by the Royal Academy of Engineering (CiET1819/19). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Institutional Review Board Statement:** Following discussion with the Research and Innovation Department at LTHT it was agreed that this represented a service improvement project. The study was approved by the University of Leeds School of Medicine Research Ethics Committee (SoMREC) (MREC 19-043).

**Informed Consent Statement:** Informed written consent was obtained prospectively from all patients at the time of imaging for use of their anonymised FDG PET/CT images in research and service development projects.

**Data Availability Statement:** The data are not publicly available due to institutional data sharing restrictions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Development of an Image Analysis-Based Prognosis Score Using Google's Teachable Machine in Melanoma**

**Stephan Forchhammer 1,\*, Amar Abu-Ghazaleh 1, Gisela Metzler 2, Claus Garbe <sup>1</sup> and Thomas Eigentler <sup>3</sup>**


**Simple Summary:** The increase in adjuvant treatment of melanoma patients makes it necessary to provide the most accurate prognostic assessment possible, even at early stages of the disease. Although conventional risk stratification correctly identifies most patients in need of adjuvant treatment, there are some patients who, despite having a low tumor stage, have poor prognosis and could therefore benefit from early therapy. To close this gap in prognosis estimation, deep learning-based image analyses of histological sections could play a central role in the future. The aim of this study was to investigate whether such an analysis is possible only using basic image analysis of 831 H&E-stained melanoma sections using Google's Teachable Machine. Although the classification obtained does not provide an additional prognostic estimate to conventional melanoma classification, this study shows that prognostic prediction is possible at the mere cellular image level.

**Citation:** Forchhammer, S.; Abu-Ghazaleh, A.; Metzler, G.; Garbe, C.; Eigentler, T. Development of an Image Analysis-Based Prognosis Score Using Google's Teachable Machine in Melanoma. *Cancers* **2022**, *14*, 2243. https://doi.org/10.3390/ cancers14092243

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 4 April 2022 Accepted: 28 April 2022 Published: 29 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Abstract:** Background: The increasing number of melanoma patients makes it necessary to establish new strategies for prognosis assessment to ensure follow-up care. Deep-learning-based image analysis of primary melanoma could be a future component of risk stratification. Objectives: To develop a risk score for overall survival based on image analysis through artificial intelligence (AI) and validate it in a test cohort. Methods: Hematoxylin and eosin (H&E) stained sections of 831 melanomas, diagnosed from 2012–2015 were photographed and used to perform deep-learning-based group classification. For this purpose, the freely available software of Google's teachable machine was used. Five hundred patient sections were used as the training cohort, and 331 sections served as the test cohort. Results: Using Google's Teachable Machine, a prognosis score for overall survival could be developed that achieved a statistically significant prognosis estimate with an AUC of 0.694 in a ROC analysis based solely on image sections of approximately 250 × 250 μm. The prognosis group "low-risk" (*n* = 230) showed an overall survival rate of 93%, whereas the prognosis group "high-risk" (*n* = 101) showed an overall survival rate of 77.2%. Conclusions: The study supports the possibility of using deep learning-based classification systems for risk stratification in melanoma. The AI assessment used in this study provides a significant risk estimate in melanoma, but it does not considerably improve the existing risk classification based on the TNM classification.

**Keywords:** melanoma; prognosis; risk score; deep learning; artificial intelligence; Google's teachable machines

#### **1. Introduction**

Over the past years and decades, there has been a significant increase in the incidence of malignant melanoma [1]. Despite major advances in the treatment of metastatic

melanoma, including targeted therapy with BRAF inhibitors or immune checkpoint blockade, malignant melanoma remains the skin tumor responsible for the highest number of skin tumor-associated deaths worldwide, with approximately 55,500 cases [2]. Histologically, different subtypes of malignant melanoma can be distinguished. According to the currently valid World Health Organization Classification published in 2018, a distinction is made between melanomas that typically occur in chronically sun-damaged (CSD) skin and those that typically do not occur in chronically sun-damaged skin. These differ in the underlying genetic pathways. The most common subtypes, superficial spreading melanoma (low-CSD) and lentigo maligna melanoma (high-CSD), but also desmoplastic melanoma, are found in association with sun-damaged skin. Representatives of melanomas that do not occur in chronically sun-damaged skin (no-CSD) are acral-, mucosal- and uveal-melanomas, Spitz melanomas, melanomas originating from congenital nevi or blue nevi. Nodular melanomas, on the other hand, can be found in both groups with different underlying genetic pathways [3]. Prognosis prediction and staging of melanoma are mainly based on histologic diagnosis in primary tumors. In this context, tumor thickness (according to Breslow) and ulceration are included in the 8th AJCC classification [4]. In the case of additional histological features, such as regression and mitotic rate, an impact on the further prognosis is assured [5–7]. Other prognostic factors result from the primary staging diagnosis, which includes sonography, CT section imaging and sentinel node biopsy depending on the stage [4,8]. With the advent of adjuvant therapy options for patients with high-risk tumors, the most accurate prognostic prediction possible is already necessary for the primary tumor. Since adjuvant immune checkpoint therapy has a non-negligible side effect profile, it is crucial to identify patients who may particularly benefit from such therapy. Various gene-expression-based assays are in development to distinguish high-risk patients from those with only low risk of metastasis [9–12]. However, these studies are cost intensive and therefore cannot yet be widely used. In addition, these examinations consume tissue that may be needed for further diagnostic workup. The morphology of melanoma already shows a very high diversity in the H&E section, which goes far beyond the detection of tumor thickness, ulceration, mitotic rate and regression. A grading, which is common for most other tumor types, such as cutaneous squamous cell carcinoma, does not exist for melanoma. With the onset of digitalization in pathology, artificial intelligence (AI)-based image analysis has created new opportunities in the evaluation of histological sections. AI is a set of technologies that enables computer systems to acquire intelligent capabilities. One branch of AI is the concept of machine learning, which gives computers the ability to learn without being explicitly programmed [13]. Deep learning, which is popular today, is characterized by greater network depth in terms of multiple layers of neurons; therefore, it makes it possible to learn and solve even complex tasks. Remarkably, artificial neural networks make this possible without having to deposit specific rules or instructions beforehand [14]. It has been shown that programs based on artificial intelligence are able to achieve high diagnostic accuracy in diagnosing melanoma from dermatoscopic images [15–18]. Diagnosis by artificial neuronal networks also seems to lead to very reliable results for histological sections. For epithelial skin tumors, but especially for melanomas, programs have been developed that enable robust diagnostics [19–24]. More exciting, though, is the question of whether image analysis with artificial neural networks can not only confirm a diagnosis but whether it is conceivable that subvisual structures or patterns in histological sections can be detected, leading to improved prognosis assessment. First studies in melanoma have shown that it may be possible to achieve prognosis prediction, prediction of sentinel positivity and prediction of response to immunotherapy by using artificial intelligence-assisted image analysis [25–27]. In particular, the work of Kulkarni et al. was able to make an impressive prognosis prediction based on image analysis; however, here a complex algorithm was used which, in addition to the mere morphological tumor cell information, evaluates in part the distribution of inflammatory cells [26]. Since a clear impact on melanoma prognosis has been well studied, especially for tumor-infiltrating

lymphocytes, it remains unclear whether a purely morphological image analysis of tumor cells allows melanoma prognosis [28–30].

The aim of our study was to develop a prognosis score based purely on histological photographs to predict survival in melanoma. Since our score should be made publicly available, easy to use and based solely on morphological image information, Google's Teachable Machine was used as a deep learning program. This is a pre-trained neural network for image analysis that allows the classification of images into certain groups after previous training [31]. Google's Teachable Machine uses the basic framework of TensorFlow. This is a platform released in 2015 that was created to make artificial intelligence and its training accessible to the public. The use of this program has already been investigated in the first studies for image analysis of medical questions [32].

#### **2. Materials and Methods**

#### *2.1. Study Population*

All 2223 patients diagnosed with primary melanoma at the University Dermatological Clinic Tübingen between 1 January 2012 and 31 December 2015 who provided written informed consent to the nationwide melanoma registry were included in the study. All 831 patients with follow-up data of at least 2 years and histological sections in our archive were included in the further analysis. The group "dead" consists of all patients that died due to melanoma during the observation period up to 114 months. The group "alive" consists of all patients that were alive, lost to follow up or died of another reason. Alive patients with follow-up for less than 2 years were excluded from the study. The diagnosis of melanoma was made by at least two experienced, board-certified dermatopathologists (SF, GM).

#### *2.2. Digitization of HE Sections and AI-Based Evaluation*

All H&E sections of primary melanoma were photographed at the site of the highest tumor thickness according to Breslow using 100× magnification (Figure 1). Pictures were taken using a Nikon Eclipse 80i microscope mounted with a Nikon Digital Sight DS-FI2 camera. The program Nikon NIS Elements D Version 4.13.04. was used, and the exposure time was set to 3 ms. The data were saved in JPG format. Images were analyzed using Google's Teachable Machine, a pre-trained neural network [31]. Sixty percent of the 831 images served as the training cohort, and 40% of the images were subsequently evaluated as the test cohort. The allocation of the 500 images to the training cohort or the 331 images to the test cohort was random. The training dataset contains images that were only used for training Google's Teachable Machine. An analysis of these data was not performed later. Of these 500 patients, 429 were alive; thus, these images were used for the training of the "alive" group. Of the 500 patients, 71 were deceased; these were used for the training of the group "dead". The training was carried out twice and separately for the groups "whole images" and "area of interest". The training curves for accuracy and loss were obtained for both training groups and are shown in Figures S1 and S2. The model that emerged from the initial training was used for further assessment. As Google's Teachable Machine does not provide a verification function, a separate set of verification data was not assigned. The remaining 331 patients were used as the test cohort. The images of these patients were not previously seen by Google's Teachable Machine. These 331 patient images were then classified by the program into the categories "dead" and "alive". Patients who were classified as "dead" were given the label "high-risk" in the further study, and patients who were classified as "alive" were given the label "low-risk".

**Figure 1.** H&E section of a malignant melanoma. (**a**) overview with annotation (star) of the highest tumor thickness (Breslow). The scale is 500 μm. (**b**) Magnification of (**a**) (see square in (**a**)). The image represents one picture of the category "whole image". The scale is 100 μm. (**c**) Magnification of (**b**) (see square in (**b**)). This image represents one picture of the category "area of interest". The scale is 30 μm.

The evaluations of the whole images or the "area of interest" images were performed separately. During the evaluation of whole images, the uploaded images in landscape format 4:3 were cut by Google's Teachable Machine into a square format. To balance the training groups "alive" and "dead", the images of the group "dead" were used 6 times.

For the "area of interest" evaluation, representative image sections of about 250 × 250 μm were selected from the images by a dermatopathologist showing representative tumor areas (file size from 103 kB to 622 kB). Whenever possible, we selected representative areas from the dermal tumor compartment. Only in cases with a very small tumor thickness were areas with an epidermal component included (see Figure 1). To balance the training groups "alive" and "dead" the images of the group "dead" were cut into 6 representative tumor areas. In the advanced settings of the "Teachable Machine" the epochs were set to 1000, the batch size to 16 and the learning rate to 0.001. The 334 images of the test cohort were uploaded individually, and the group allocation of Google's Teachable Machine and the indicated percentage were collected.

#### *2.3. Statistics*

Statistical calculations were performed using IBM SPSS Statistics Version 23.0 (IBM SPSS, Chicago, IL, USA). Numerical variables were described by mean value and standard deviation or median values and interquartile range (IQR). Receiver operating characteristic (ROC) curve analyses and corresponding *p*-value calculations were performed using the ROC-Analysis tool in SPSS. *p*-values in Kaplan–Meier curves were calculated using the log-rank (Mantel-Cox) test. Throughout the analysis, *p* values < 0.05 were considered statistically significant.

#### **3. Results**

To create a prognosis score for melanoma, 60% (*n* = 500) of the images were used as a training cohort. For this purpose, the images were categorized as "alive" and "dead", according to the actual survival of the patients. Google's Teachable Machine was used to create an algorithm from these training groups, which was then applied to the test cohort. The training curves of the models showed an overfitting (see Figure S1); therefore, the training was repeated with a new randomized training set to avoid possible bias caused by the grouping (Figure S2). Since the repetition also showed comparable overfitting, the evaluation was continued with the initial trained model. Subsequently, the remaining 40% of the images (*n* = 331) were used as a test of the previously created score. The overall cohort had a median age of 62 years at diagnosis, a preponderance of 55.6% men versus 44.4% women, and a median tumor thickness of 1.05 mm. Ulceration was detectable in 21.3% of the patients. The most common histological subtype was superficial spreading melanoma with 59.3%, followed by nodular melanoma with 16.1%, lentigo maligna melanoma with 9.1%, acrolentiginous melanoma with 6.0%, other melanomas (5.7%) and melanomas of an unknown subtype (3.5%). Most melanomas were found on the trunk (41.4%), followed by melanomas of the lower extremity (26.4%), head and neck (17.7%), and upper extremity (14.1%). At initial diagnosis, 64.3% of patients were classified as stage I, 21% as stage II, 13.4% as stage III, and 1.3% as stage IV. The staging, subtype classification, and epidemiologic data showed comparable values in the training and test cohorts, confirming the randomization of the groups (see Table 1).

**Table 1.** Demographics, tumor parameters, stage of disease (AJCC 2017), tumor subtype and survival of the cohort.



**Table 1.** *Cont.*

Figure 1 shows the procedure for photographing the melanoma sections. In many melanomas, tumors were present in numerous blocks and slides. The H&E section with the highest tumor thickness according to Breslow was selected (see Figure 1a). Here, an image was taken at 100× magnification at the site of the highest tumor thickness. This image was used for the "whole image" analysis. From these "whole images", small image sections (about 250 × 250 μm) were selected that showed representative parts of the tumor. The generation of a prognosis score was initially performed on both groups. These were compared by ROC analysis (see Figure 2a). We investigated how reliably a prognostic prediction of overall survival could be made based only on the AI classifier. When analyzing the "whole images", no significant result (*p* = 0.101) could be obtained in the prediction of overall survival. The classifier showed an AUC of 0.581, which was only slightly better than a random classification (AUC of 0.5). In contrast, however, a significant prediction estimate with an AUC of 0.694 (*p* < 0.001) could be obtained with the analysis of the AOI images. Therefore, further evaluation was performed using the classifier generated by the analysis of the area of interest images.

If one only uses the classifier, generated solely by image analysis of a H&E-stained melanoma section, this already allows a good prognosis estimate of the overall survival. Of the 331 patients in the test cohort, 230 patients were assigned the AI-classifier "low-risk" and 101 patients were given the AI-classifier "high-risk". Malignant melanoma-related overall survival was 88.2% in the test cohort, with 39 deaths in the observation period up to 114 months. The AI-classifier "low-risk" group showed a statistically significant better overall survival of 93% with 16 deaths, compared to a survival of 77.2% and 23 deaths in the AI-classifier "high-risk" group (*p* < 0.001). Figure 3a shows the Kaplan–Meyer survival curves of the total test cohort, related to melanoma-specific overall survival. Considering recurrence-free survival, there is also a statistically significant distinction by grouping into AI-classifier "low-risk" and "high-risk" (*p* < 0.001). Of the 230 patients in the "low-risk" group, an event such as recurrence, metastasis or death from the disease was recorded in 43 cases. This leads to a recurrence-free survival rate of 81.3%. In contrast, 37 events were recorded in the AI-classifier "high-risk" group out of 101 patients, resulting in a recurrence-free survival of only 63.4% (Figure 3b).

**Figure 2.** Average receiver operating characteristic (ROC) curves of overall survival prognosis. (**a**) Black line = AI-classifier with "area of interest" analysis. Gray line = AI-classifier with "whole image" analysis. (**b**) Black line = pT stage combined with AI-classifier (AOI). Gray line = pT stage (tumor thickness and presence of ulceration).

**Figure 3.** Kaplan–Meyer curve of overall survival (**a**) and relapse-free survival (**b**). Green line = AIclassifier "low risk". Red line = AI-classifier "high risk".

Next, we questioned whether the AI classifier could complement the existing forecast prediction with the AJCC 2017 classification. Here, we first performed a ROC analysis. Comparing the prognosis estimate resulting from the existing T-classification (according to AJCC 2017) of the primary tumor (tumor thickness according to Breslow and the presence of an ulceration) (AUC = 0.872) with the prognosis estimate resulting from the addition of the AI-classifier (AUC = 0.881), only a slightly improved risk stratification was shown (see Figure 2b). This was also evident in the analysis of the Kaplan–Meyer curves of overall survival for the individual stages of the AJCC 2017 classification. Looking at AIbased risk classification in stage I, the following picture emerges: of the 207 patients in

Stage I of the test cohort, 163 (79%) received the label AI-classifier "low-risk". Of these 163 patients, 2 died during the observation period, corresponding to an overall survival rate of 98.8%. Forty-four patients (21%) were classified as "high-risk". In this group, there were also two deaths, which corresponds to an overall survival rate of 95.5%. However, with a *p*-value of 0.154, this does not reach statistical significance (Figure 4a). Regarding stage II, of 74 patients, 39 (53%) were classified as "low-risk," and 35 patients (47%) were marked as "high-risk." There were 8 deaths in the "low-risk" group resulting in an overall survival of 79.5%. The "high-risk" group had 10 deaths, resulting in an overall survival of 71.4%. However, this difference did not reach statistical significance with a *p*-value of 0.378 (Figure 3b). Stage III demonstrated the clearest differences in prognosis estimation. In our test cohort, there were 43 patients in stage III, of which 11 patients died during the observation period, resulting in an overall survival of 74.4%. Twenty-five of these patients (58%) were considered "low-risk", and in fact, only 4 deaths occurred in this group, resulting in an overall survival of 84%. Of the 18 patients (42%) designated as "high-risk" by the AI-classifier, 7 patients died, resulting in an overall survival of only 61.1%. Although an early and quite clear separation of the Kaplan–Meyer curves is seen in stage III, no statistically significant difference (*p* = 0.159) results due to the rather small number of cases in this group (Figure 4c). Seven patients were found to be stage IV at initial diagnosis. Four of these were identified as AI-classifier "high-risk" and 3 were classified as "low-risk". All patients in the "high-risk" group died during the observation period, resulting in an overall survival of 0%. In the "low-risk" group, 2 melanoma-specific deaths were recorded, resulting in a melanoma-specific survival of 33.3%. Patients in the "high-risk" group, in contrast, survived longer than those in the "low-risk" group. This leads to a statistically significant difference in the group classification at this stage (*p* = 0.018) (Figure 4d).

**Figure 4.** Kaplan–Meyer curves of overall survival in AJCC (2017) substages I (**a**), II (**b**), III (**c**) and IV (**d**). Green line = AI-classifier "low-risk". Red line = AI-classifier "high-risk".

#### **4. Discussion**

#### *4.1. Results*

The present study demonstrates the possibilities offered using deep learning-based image analysis in the risk stratification of melanoma. Although the program for risk assessment merely has a tiny image of about 250 × 250 μm at its disposal and no further information is available, a quite reliable and statistically significant risk stratification can be achieved. However, the AI classifier used here does not significantly improve the existing risk classification based on the TNM classification. Nevertheless, it seems possible that such a classifier may add prognostic value to conventional prognostic factors. In particular, our survival data in stage III show a tendency toward improved prognosis with the addition of the AI-classifier, even if this does not reach statistical significance. Further studies with a larger cohort from this advanced tumor stage are needed to confirm this.

The first published studies have investigated the use of AI-based neural networks in melanoma. It has been shown that such image analysis can reliably detect melanomas and differentiate them from benign melanocytic nevi [19–22]. Predicting prognosis, though, is much more complex than mere diagnostic classification of nevus and melanoma. Hence, a study by Brinker et al., published in 2021, failed to predict sentinel lymph node status in malignant melanoma to a clinically meaningful extent using deep learning-based image analysis [25]. In a 2020 study, Kulkarni et al. created a risk classifier that was significantly associated with the occurrence of recurrence in melanoma [26]. However, this score includes other factors for calculation, such as density and distribution of the immune cell infiltrate and nucleus morphology. Therefore, the impressive AUC values of 0.905 and 0.880 achieved in this study are not comparable to the results obtained here. Since other information besides the RGB image had to be included, tumor areas containing lymphocytes in addition to the tumor cells had to be available and the sections should not be too pigmented to allow detection of cellular components [26]. Another unique feature of our risk classifier is that it is a score that can be calculated with an image of only 103 kB to 622 kB in size. There is still a low availability of so-called whole-slide scanners, which can scan and digitize entire histological slides in high resolution in just a few minutes. Although this technology has been established for years, only a few pathological institutes have switched their routine settings to digital reporting, especially because of the high investment costs. Possibly in the coming years, the amount of memory and access to whole-slide scanners will no longer be limiting factors. Currently, a freely available, easy-to-use classifier operating on small data offers massive advantages when it comes to the question of validating that classifier in a large multicenter setting.

#### *4.2. Limitations*

The present study has several limitations. One potential point of criticism is the choice of deep learning tool. It is conceivable that an even better prediction of the prognosis could be made with different programs, although this was not investigated in this study. The focus of this research lies in the proof-of-concept, which shows that it is possible to make a prognosis prediction on the histological section with an as simple as possible AI application and as small as possible amount of data. Due to its straightforward transferability as well as its user-friendly interface, the publicly available Google's Teachable Machine was chosen as a deep learning tool. Overfitting describes learning by memorization of the correct answers by the AI model instead of the establishment of a generally applicable assignment rule in the sense of generalization. Such an overfitting was evident in our trained models, even when repeated with reassigned image groups. It is possible that this overfitting could be minimized by various fine adjustments in the AI model, especially by adjusting the number of epochs. However, this was not further investigated in the present study. It is also conceivable that the pre-trained algorithm of the Teachable Machine is not suitable for this complex histological challenge and thus represents the limiting factor in model performance. Further limitations are that the training and test cohorts are retrospective evaluations and that the number of cases in the groups and especially the number of events

included (39 deaths in the test cohort) is quite small. Another point of criticism is that all used sections originate from one and the same pathological institute. Possibly, the results show only limited transferability to other institutes, as a slightly different staining pattern in H&E staining may be evident here. In addition, the manual selection of the areas of interest by the pathologist offers the possibility of an influence. A trade-off must be made between large datasets and automated selection and manual selection and small data sets. Additionally, the use of similar images in the "dead" group of the training cohort may have restricted the learning curve of artificial intelligence. The melanoma treatment of the patients in our study was not examined. It is possible that changes in treatment regimens during the study period may have limited the predictive accuracy of AI prognosis. To obtain more meaningful results, a larger, prospectively designed, multicenter study would be necessary. One possibility for such studies in the future could be the use of so-called "swarm learning". This newly described approach uses blockchain-based peer-to-peer networking to decentralize the use of machine learning [33].

Another problem with the method used here is the lack of explainability. A program that offers an explanatory approach implemented in the program would be desirable, so the black box of the AI could be illuminated. A study by Courtoil et al. from 2019 shows such a program that not only forecasts the prognosis of mesothelioma but can also show via a heat map analysis that the decision basis of the AI is to be found in the area of the tumor stroma [34].

#### **5. Conclusions**

Finally, the study presented here must be understood as proof-of-concept. It could be shown that prognostic information is contained in tiny image sections of a melanoma, which allows prognosis estimation. To establish a prognosis score that can be used in clinical practice, it must be clearly shown that such a score complements the current classification systems and may in the future be an alternative to invasive diagnostic methods, such as sentinel node biopsy or expensive gene-expression-based prognosis scores.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/cancers14092243/s1, Figure S1: Training curves for initial training; Figure S2: Training curves for repeated training.

**Author Contributions:** Conceptualization, S.F., C.G. and T.E.; formal analysis, S.F. and A.A.-G.; resources, S.F., A.A-G., G.M. and T.E.; writing—original draft preparation, S.F.; writing—review and editing, S.F., A.A-G., G.M., C.G. and T.E.; visualization, S.F.; supervision, T.E. and C.G.; project administration, S.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The retrospective study on "Prediction of recurrence-free survival in patients with malignant melanoma using image analysis of histological sections by artificial neural networks" was approved by the Ethics Committee of the Medical Faculty of the University of Tübingen (No. 874/2019/BO2).

**Informed Consent Statement:** Informed consent was obtained from all subjects from the nationwide melanoma registry Tübingen involved in the study.

**Data Availability Statement:** The trained model for the "whole image" analysis used in this study is available at https://teachablemachine.withgoogle.com/models/q3W4kP4zk/ (accessed on 3 January 2022). The trained model for the "area of interest" analysis used in this study is available at https://teachablemachine.withgoogle.com/models/EWFL98pti/ (accessed on 3 January 2022).

**Acknowledgments:** The authors thank Peter Martus from the Institute for Clinical Epidemiology and Applied Biometry, Eberhardt Karls Universität, Tübingen, for statistical consulting.

**Conflicts of Interest:** S.F. received personal fees from Kyowa Kirin and Takeda Pharamceuticals (speaker s honoraria), as well as institutional grants from NeraCare, SkylineDX and BioNTech outside the submitted work. A.A-G. has no conflict of interest to declare. G.M. has no conflict of interest to declare. C.G. reports personal fees from Amgen, personal fees from MSD, grants and personal fees from Novartis, grants and personal fees from NeraCare, grants and personal fees from BMS, personal fees from Philogen, grants and personal fees from Roche, grants and personal fees from Sanofi, outside the submitted work. T.E. acted as a consultant for Almiral Hermal, Bristol-Myers Squibb, MSD, Novartis, Pierre Fabre and Sanofi outside the submitted work.

#### **References**


### *Article* **Deep Learning Using CT Images to Grade Clear Cell Renal Cell Carcinoma: Development and Validation of a Prediction Model**

**Lifeng Xu 1,2,†, Chun Yang 2,3,†, Feng Zhang 1, Xuan Cheng 2,3, Yi Wei 4, Shixiao Fan 2,3, Minghui Liu 2,3, Xiaopeng He 4,5,\*, Jiali Deng 2,3, Tianshu Xie 2,3, Xiaomin Wang 2,3, Ming Liu 2,3 and Bin Song 4,\***


**Simple Summary:** Clear cell renal cell carcinoma (ccRCC) pathologic grade identification is essential to both monitoring patients' conditions and constructing individualized subsequent treatment strategies. However, biopsies are typically used to obtain the pathological grade, entailing tremendous physical and mental suffering as well as heavy economic burden, not to mention the increased risk of complications. Our study explores a new way to provide grade assessment of ccRCC on the basis of the individual's appearance on CT images. A deep learning (DL) method that includes self-supervised learning is constructed to identify patients with high grade for ccRCC. We confirmed that our grading network can accurately differentiate between different grades of CT scans of ccRCC patients using a cohort of 706 patients from West China Hospital. The promising diagnostic performance indicates that our DL framework is an effective, non-invasive and labor-saving method for decoding CT images, offering a valuable means for ccRCC grade stratification and individualized patient treatment.

**Abstract:** This retrospective study aimed to develop and validate deep-learning-based models for grading clear cell renal cell carcinoma (ccRCC) patients. A cohort enrolling 706 patients (*n* = 706) with pathologically verified ccRCC was used in this study. A temporal split was applied to verify our models: the first 83.9% of the cases (years 2010–2017) for development and the last 16.1% (year 2018–2019) for validation (development cohort: *n* = 592; validation cohort: *n* = 114). Here, we demonstrated a deep learning(DL) framework initialized by a self-supervised pre-training method, developed with the addition of mixed loss strategy and sample reweighting to identify patients with high grade for ccRCC. Four types of DL networks were developed separately and further combined with different weights for better prediction. The single DL model achieved up to an area under curve (AUC) of 0.864 in the validation cohort, while the ensembled model yielded the best predictive performance with an AUC of 0.882. These findings confirms that our DL approach performs either favorably or comparably in terms of grade assessment of ccRCC with biopsies whilst enjoying the non-invasive and labor-saving property.

**Keywords:** clear cell renal cell carcinoma; deep learning; tumor grading; self-supervised learning; label noise; class imbalance

#### **1. Introduction**

Renal cell carcinoma (RCC) is one of the most common deadly tumors in the urinary system, originating from the renal parenchymal urinary tubule epithelial system, account-

**Citation:** Xu, L.; Yang, C.; Zhang, F.; Cheng, X.; Wei, Y.; Fan, S.; Liu, M.; He, X.; Deng, J.; Xie, T.; et al. Deep Learning Using CT Images to Grade Clear Cell Renal Cell Carcinoma: Development and Validation of a Prediction Model. *Cancers* **2022**, *14*, 2574. https://doi.org/10.3390/ cancers14112574

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 1 April 2022 Accepted: 29 April 2022 Published: 24 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

ing for 4% of human malignant tumors [1]. Clear cell renal cell carcinoma (ccRCC) is the most common subtype of RCC, accounting for about 75% of all RCC cases [2]. The Fuhrman grading system is highly recognized in the clinical oncology community, and it is widely used for diagnosing the pathological grade of ccRCC. In the Fuhrman grading system, the tumor is classified into one of four different grades (I, II, III, and IV) [3], with higher grades indicating a more serious patient condition. However, to obtain the pathological grade, the biopsy is most often carried out using a sharp tool to remove a small amount of tissue. Inevitably, this invasive procedure may entail great pain physically and mentally, whilst imposing a heavy economic burden on patients' families and society. Recent study [4] also demonstrated that biopsy may increase the risk of complications, including hemorrhage, infection, even tumor rupture. Furthermore, considering the shortage of specialized doctors and conceivable poor conditions of equipment in some rural areas, patients in these areas may be unable to receive timely and appropriate treatment.

In recent years, deep learning (DL) has defined state-of-the-art performance in many computer vision tasks, such as image classification [5], object detection [6,7], and segmentation [7]. DL models will perform satisfactorily once they have learned enough and high-quality data [8]. Thus, given sufficient data, the accuracy of a deep-learning-enabled diagnosis system often matches or even surpasses the level of expert physicians [9,10]. A myriad of studies have validated the utility of DL in various clinical settings through various experiments, including the reduction of false-positive findings in the interpretation of breast ultrasound exams [11], the detection of intensive care unit patient mobilization activities [12], and the improvement of medical technology [13]. In the same way, DL enables the ability to non-invasively and automatically assess the pathological grade for ccRCC, monitor patients' conditions and construct personalized subsequent treatment strategies.

However, to better apply the DL model, there are a few problematic issues that should not be lightly dismissed. First, the domain shift problem. In most deep-learning-enabled medical system, transfer learning is a common practice [14], where researchers use models pretrained on some other dataset, such as ImageNet [15]. Although ImageNet contains a large variety of images, they are all based on real-life situations and do not overlap with medical images in terms of content. The shifts between two datasets represent that the pattern-recognition abilities acquired from large datasets may not apply well to our medical task. Second is the noisy label problem [16]. Inevitably, there are always some cancerous lesions that come from high-grade patients but do not exhibit characteristics sufficient to discriminate them from low-grade patients, resulting in the mismatch between the manual labels and the actual labels. Third, the imbalance dataset problem. In most medical tasks, images for the abnormal class might be challenging to find. Developing on such an unbalanced dataset can wreak havoc on the utility of the DL model. To combat these issues, our study explores a new DL framework initialized by a self-supervised pre-training method, developed with the addition of mixed loss strategy and sample reweighting to identify patients with high grade for ccRCC.

There are also several studies related to that of ours. Zhu and collaborators [17] proposed a system that can accurately discriminate between five related classes, including clear cell RCC, papillary RCC, chromophobe RCC, renal oncocytoma, and normal, based on digitized surgical resection slides and biopsy slides. Different from this, we only focus on the ccRCC and try to explore a non-invasive tool to replace biopsy whilst providing grade assessment. Zheng [18], Cui [19], and Gao [20] had the same intention with us but their works are mainly based on radiomics, which requires using a high-throughput feature extraction method and a series of data-mining algorithms [21,22]. By contrast, our work does not need to use additional procedures, such as feature extraction, which could save labor to some extent. Most relates to our work is that of [14] which also attempted to use the deep learning model to predict the Fuhrman grade of ccRCC patients. However, it is worth nothing that this study still used ImageNet pretraining and did not pay attention to the noise and imbalance problem that may induce performance degradation in most of cases, while our framework provides a new solution to these issues with the addition of the proposed mixed loss strategy and sample reweighting, providing increased power to the common practice. To the best of our knowledge, our study is the first attempt to identify the pathological grades of patients with ccRCC in the context of a large population whilst dealing with the domain shift problem and the noisy label problem, as well as the imbalance dataset problem, simultaneously.

The specific objective of this study was to develop and validate a new DL framework to identify patients with a high grade for ccRCC based on CT images, and the results indicate that it is feasible. In addition to the application of deep learning to ccRCC pathology grading [14], we focused on the solution of these three problems. To improve the network's capabilities, we proposed an innovative self-supervised pre-training methodology, as well as mixed loss strategy and sample reweighting to address label noise and class imbalance problems. To develop and validate our framework, we applied a temporal split to teledermatology cases: the first 83.9% of the cases (years 2010–2017) for development and the last 16.1% (years 2018–2019) for validation as done in [23]. Putting patients with different years into different groups could help avoid the bias that possibly stems from the machines and radiologic technologists, thereby being also a good practice to demonstrate the generalization ability of our method. In addition, to improve the model generalization ability, we combined several excellent single models, which achieved more reliable results. This project provides a convenient, harmless and accurate opportunity for Fuhrman grading, which will not only relieve patients from suffering from biopsies, but also assist radiologists in making diagnostic decisions in routine clinical practice, even for some rural areas.

#### **2. Materials and Methods**

The institution's research ethics board approved our study. The ethics board waived informed consent because the data were obtained from preexisting institutional or public databases.

#### *2.1. Patient Cohort*

The patient cases covered in this study are all from West China Hospital, with a total case load of 759. We excluded 53 patients for the following reasons: (1) the CT images were incomplete or had poor image quality (*n* = 24); (2) patients with incomplete indicators (*n* = 29). Therefore, 706 patients were finally enrolled in this study. All 706 patients were admitted to the hospital from April 2010 to January 2019. From the perspective of the time domain, we assigned a total of 592 patients before year 2018 as the development cohort and a total of 114 patients after year 2018 (including 2018) as the validation cohort according to the acquisition date of the CT images. The characteristics of the included patients are shown in Table 1.

All of the pathological ccRCC patients' grades were reconfirmed by three independent pathologists with extensive pathology experience. The labels of CT images in the validation cohort were verified by professional pathologists. This study employed the Fuhrman grading system as the benchmark. Grades I and II were assessed as low-grade, and grades III and IV were assessed as high grade. Usually, low grade has a better prognosis than high grade [24].


**Table 1.** Patient characteristics.

#### *2.2. Image Acquisition*

All CT scans used in this study were obtained by one of the six different CT scanners. The PCP, CMP, and NP of the MDCT (multidetector CT) examination were acquired for each ccRCC patient with strict rules. A total of 70–100 mL contrast agents were injected into the antecubital vein using a high-pressure injector at a rate of 3.5 mL/s. The PCP is the precontrast phase. The CMP means that the corticomedullary phase contrast-enhanced scan starting 30 s after injection. The NP means that the nephrographic phase contrast-enhanced scan starting 90 s after the injection. Spiral scanning and thinslice reconstruction were used for all three phases. The CT scanning parameters for the three phases were as follows: the voltage in the tube was 120 kV; the reconstruction thickness was 1 mm to 5 mm, and the matrix was 512 × 512. Only the CMP CT images were used as experimental data most of the time because the CT images are the clearest and most conducive to the analysis of the patient's condition. The selection of only CMP CT images as experimental data somewhat reduces the times of model developing, which may impair the generalization of the model, but since our dataset includes a large enough number of cases, this operation does not have any impact.

#### *2.3. Image Preprocessing*

The original CT image contains interference information, of which only the tumor area is really valid for grading, so for each image, the region of interest (ROI) needs to be delineated. With 706 patients containing more than 12,000 CT images, it is clearly not desirable to have a radiologist process every image.

We utilized the DL models in target detection and segmentation to segment tumor regions in the renal CT images. In the detection and segmentation part of the tumor, we used VGG-16 [25] pre-trained on ImageNet [26] as the backbone for extracting features. A small number of images for detection and segmentation training were annotated by experienced doctors. The network was trained for 6000 epochs until its output converged. We used the trained network to detect and segment the tumors in the overall CT images, and the results were tested by an experienced radiologist, largely meeting the criteria. Figure 1 shows the tumor segmentation process. The segmented CT pictures eliminate interference from other bodily regions, allowing the content to be focused on the tumor area on the renal. The CT images involved in subsequent experiments (including pre-training process and developing process) refer to those after detection and segmentation processing. Since the size of the tumor area varies, the sizes of the CT images obtained by partitioning are different. We performed Resize or Padding operations before the data were entered into the network to make the image size uniform to the 224 × 224 × 3.

**Figure 1.** Segmentation model concentrates the CT image's content on the tumor.

#### *2.4. Self-Supervised Learning*

We used a self-supervised learning (pre-training) method to equip the network with better awareness of the CT images before developing. In the pre-training and developing process, we used the RegNetY400MF, RegNetY800MF [27], SE-ResNet50 [28] and ResNet-101 [29]. Traditional pre-training models are often obtained by developing on ImageNet [15]

and then using transfer learning to satisfy specific classification tasks. Such an approach suffers from the problem that there is segmentation between the pre-training and the actual classification task, with little correlation between the image contents. We used a simpler and more efficient approach to pre-train the network. The images we used in the pre-training are the same as those used in the developing, with the difference that during pre-training, we rotate the input image data clockwise in space in one of four ways (0◦, 90◦, 180◦, 270◦), and the images are labeled with the number of 90◦of image rotation (0, 1, 2, 3), while during developing, CT images are labeled with the ccRCC grade of the relevant patient (0 for low-grade, 1 for high-grade). Such a pre-training method allows the network to develop feature extraction capability based on the developing images without revealing the original semantics of the developing images. We pre-trained different deep learning models using the stochastic gradient descent (SGD) algorithm and the common cross-entropy loss function. The DL models were finally trained for 60 epochs. The overall structure of the pre-training network is shown in the top half of Figure 2.

'HYHORSLQJSURFHVV

**Figure 2.** The overall flow of pre-training and developing. The top part of the figure shows the pre-training process. In the pre-training process, the original images are expanded into four images after rotation transformation, and their labels are 0, 1, 2, and 3, representing that they are obtained by quarter-turning the original image 0, 1, 2, and 3 times, clockwise. The bottom part shows the developing process. The developing process network is initialized from the pre-training process network.

#### *2.5. Mixed Loss Strategy*

There are two pervasive problems in image classification (including medical image classification) tasks: one is the presence of label noise, and the other is the imbalanced data distribution. Both of these problems can be found in the data of our study.

Some malignant lesions that come from higher-grade patients do not exhibit enough characteristics to distinguish them from lower-grade patients, resulting in a mismatch between manual labeling and actual labeling. In simple terms, there are errors in the labels of CT images of some high-grade patients. To tackle the noise problem, we applied the mixed loss strategy similar to that in [30]. Suppose the labeled CT images dataset is *D* = (*xi*, *yi*) *N <sup>i</sup>* . During developing, the ordinary cross-entropy loss is as follows:

$$L\_{CE} = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{j \in \{0, 1\}} l\_{ij} \log p\_{ij} \tag{1}$$

where *lij* = 1 if *yi* = *j*, and 0 otherwise. *pij* is the network output probability that the *i*th sample belongs to category *j*. Since the true labels of some high-grade CT images were supposed to be low-grade., we add loss *LCE*\_2 to alleviate the effect of noise in the developing process. Specifically, in the developing phase, under the assumption that the noise rate is *α*(0 ≤ *α* ≤ 1), the loss is as follows:

$$L\_{total} = \kappa L\_{CE\\_1} + (1 - \kappa)L\_{CE\\_2} \tag{2}$$

$$L\_{CE\\_1} = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{j \in \{0, 1\}} l\_{ij} \log p\_{ij} \tag{3}$$

$$L\_{CE\\_2} = -\frac{1}{N} \sum\_{i=1}^{N} l\_{i0} \log p\_{i0} \tag{4}$$

where *li*<sup>0</sup> = 1 if *yi* = 0, and 0 otherwise. *pi*<sup>0</sup> is the network output probability that the *i*-th sample belongs to category 0 (low-grade). The larger the noise rate *α*, the higher the noise level. In the experiment, the noise rate was set at 0.4 for the best results, which is probably closest to the real noise rate of the data. Through the mixed loss strategy, we made the network learn from the modified data according to a certain probability in the developing process so as to achieve the effect of countering label noise.

#### *2.6. Sample Reweighting*

In terms of class imbalance, it is inevitable. For example, the proportion of mild patients in the cases of cancer detection is small, because cancer patients usually feel physical abnormalities in the middle or even late stage of the disease. The sample reweighting method is used to tackle this problem. In order to account for class imbalance when calculating cross-entropy loss, each class was weighed according to its frequency, with rare samples contributing more to the loss function [23]. Specifically, we assigned lower weights to the categories with a larger proportion of sample size. Since we have a bias toward the low-grade patient sample when dealing with the noise problem, we need to take this information into account when calculating the percentage of the number of low-grade and high-grade CT images. Suppose the weight of the low-grade patient sample is *λ*0, and the weight of the high-grade patient sample is *λ*1; the new weighted cross-entropy is

$$L\_{\text{CE\\_weight}} = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{j \in \{0, 1\}} \lambda\_j l\_{ij} \log p\_{ij} \tag{5}$$

By Equation (5), we made the network learn more from categories with smaller sample sizes. Finally, in order to comprehensively solve the problem of label noise and class imbalance, the overall optimization objective *Ltotal*\_*weight* is

$$L\_{\text{total\\_weight}} = -a \frac{1}{N} \sum\_{i=1}^{N} \sum\_{j \in \{0, 1\}} \lambda\_j l\_{ij} \log p\_{ij} - (1 - a) \frac{1}{N} \sum\_{i=1}^{N} \lambda\_0 l\_{i0} \log p\_{i0} \tag{6}$$

#### *2.7. Developing*

After pre-training, we obtained the DL model with feature extraction capability. Then, all models were developed iteratively and used to grade CT images of ccRCC patients.

It is worth noting that during the pre-training process, the classifiers of the models of the four networks are linear, i.e., one fully connected layer (with an avgpooling). During the developing process, we converted the classifier of the original network into nonlinear projection, which can perform more complex mapping and make the dimension reduction of the feature map smoother.

The weights of DL models were initialized from the networks that had been developed to classify four kinds of picture rotation angles (0◦, 90◦, 180◦, 270◦), except the projection part. The weights of the projection part are initialized in a common and efficient way [31]. To match the number of classes in our study, the output unit was modified to two (low-grade and high-grade). The developing process is shown in the bottom half of Figure 2.

After five epochs of warm up, the learning rate was set to 0.1 at the beginning and it varied as a cosine function. It is worth noting that the pre-trained backbone already has some feature extraction capability, unlike the untrained projection. Therefore, in the process of network developing, these two parts of the network should adopt different learning rates, i.e., a small learning rate for the backbone and a relatively larger learning rate for the projection. Specifically, we set the learning rate of the backbone to 0.1 times that of the projection. In addition, a weight decay rate of 0.0001 was set to inhibit overfitting, which can keep the weights of the neural network from becoming too large. Data augmentation, including random rotation and horizontal flipping, was performed on the development cohort to avoid overfitting, which can emulate the diversity of data observed in the real world. Four NVIDIA Tesla M40 graphics cards with 24 GB of memory were used in the development process. We used the SGD algorithm and cross-entropy loss defined in Equation (6) to develop the network. The DL model was finally developed with 100 epochs. Pytorch (1.0.1) and Python (3.5.7) were the main tools used in our experiments.

#### *2.8. Validation and Statistics Analysis*

After the developing phase, we used a validation cohort to check the generalizability of the developing effect of the model. Since each patient in the experiment contains multiple images, each image is calculated to obtain a probability vector, so for each patient there is a set of probability vectors. We statistically computed the group probability vector for each patient and finally obtained the grade judgment about the patient. When analyzing a patient's condition, the focus is usually on the most severe part of the CT images, which is reasonable because it can accurately identify the patient's condition. Therefore, in the statistical calculation for each patient, we used the highest probability of network output in each patient's CT image as the judgment basis for grading. Suppose the *i*-th patient has *M* CT images, and the output of the model for each CT image is *gj*(*j* = 1, 2, ... , *M*). The grading judgment *Gi* is

$$G\_i = \max(\mathcal{g}\_1, \mathcal{g}\_2, \dots, \mathcal{g}\_M) \tag{7}$$

During validation process, the accuracy (ACC), sensitivity (SEN) and specificity (SPC) were calculated to assess the capability of the DL model. In addition, we used the area under the receiver operating characteristic (ROC) curve (AUC) to show the diagnostic ability of the DL model in grading ccRCC patients.

#### *2.9. Model Ensemble*

Following the developing method described in Section 2.7, we developed a total of four classes of DL models with different structures in the development cohort. To improve the reliability of DL models, we combined models with different weights according to their performance in order to obtain a prediction that works best. During the experiment, we found that the single model performed close to each other. In order to increase the diversity of weights of different models in the process of model ensemble, we proposed an innovative weight calculation method. We used the model's AUC as a reference for its

ensemble weight specifically, as all four types of models have the same decile of AUC, and their ensemble weight is the value of their AUC after decile is removed. Then, for each patient, we weighted the four models' outputs by different weights and summed them to obtain the patient's final grading judgment. Our weight calculation method can make the models with relatively good performance occupy a larger weight in the ensemble process, increasing the difference between the weights of different models and achieving better ensemble results. Assume the weights of the four models are *γ*1, *γ*2, *γ*3, *γ*4, and the *i*-th patient's predictions are *Gi*1, *Gi*2, *Gi*3, *Gi*4. The composite prediction *Fi* is

$$F\_l = \frac{\sum\_{k=1}^4 \gamma\_k G\_{lk}}{\sum\_{k=1}^4 \gamma\_k} \tag{8}$$

#### **3. Results**

We divided the CT images of 706 patients into a development cohort and validation cohort according to the acquisition date, where the development cohort contains 592 patients and the validation cohort contains 114 patients.

Four different kinds of networks (including ensemble model) were validated after developing according to our method, and the relevant metrics were calculated statistically; the validation results are shown in Table 2. The results show that our developing method exhibits satisfactory results on different networks, which illustrates the effectiveness of our method, and in contrast to the subsequent ablation experiments, it can also be seen that our method can effectively mitigate the label noise and class imbalance problems in the data. In addition, our ensemble method can effectively improve the prediction accuracy and enhance the reliability of DL model prediction results. This is like combining the opinions of multiple specialists in the patient's diagnosis process to arrive at a more accurate and reliable judgment about the patient. We selected a model with good performance from each of the four types of models and recorded their receiver operating characteristic curves (ROC), as shown in Figure 3. We also recorded the DL model output probability of each patient in the validation cohort (0 for low-grade, 1 for high-grade), and the results are shown in Figure 4. For most high-grade patients, they have larger lesion areas and a more severe condition based on CT images, and are more likely to have a greater probability of network output. The CT images of some high-grade and low-grade patients are similar, and the probability of a corresponding network output is not significantly different. For low-grade patients, they are more likely to have a relatively smaller network output probability, and their CT images reflect a better condition. The percentage of patients who were graded as low grade or high grade by the ensemble model based on their Fuhrman grades (I, II, III, IV) is displayed in Figure 5. Figure 5 shows that the ensemble model can accurately classify patients in grades I and II as low grade and patients in grades III and IV as high grade, which is pathologically justified by treating grades I and II as low grade and grades III and IV as high grade because grades I and II have relatively more similar characteristics than grades III and IV, thus allowing the network to distinguish between low-grade and high-grade patients.

**Table 2.** Results of different network models and ensemble models in the validation cohort.


ACC = Accuracy; SEN = Sensitivity; SPC = Specificity; AUC = Area under the receiver operating characteristic curve.

**Figure 3.** Receiver operating characteristic (ROC) curve of the four different models and the ensemble model.

We also performed a series of ablation experiments to illustrate the effectiveness and necessity of each part of our proposed method. First, we conducted the baseline experiments, i.e., base model experiments without self-supervised pre-training, mixed loss strategy and sample reweighting, and the results are shown in Table 3. From Table 3, we can see that the overall performance of the base model is poor and biased toward the low-grade patients. The overall poor performance is mainly due to the lack of our self-supervised pretraining method. The feature extraction ability of the network is insufficient to accurately identify low-grade and high-grade patients, while the base models are biased toward low-grade patients because they do not solve the label noise and class imbalance problems.

Without using the mixed loss strategy and sample reweighting approaches, we performed experiments with self-supervised pretraining, and the results are shown in Table 4. Compared with the baseline, the self-supervised pre-training method effectively improves the performance of the models, but there is also the problem of excessive bias. Because of the lack of mixed loss strategy and sample reweighting approaches, the network will be more influenced by low-grade patients in the development process, i.e., the number of CT images of low-grade patients is larger than that of high-grade patients, which will make the network biased to low grade in the development process.

We conducted experiments with the addition of the mixed loss strategy and sample reweighting methods without the self-supervised pre-training, and the experimental results are shown in Table 5. From Table 5, we can see that the mixed loss strategy and sample reweighting can effectively solve the bias problem and improve the performance of the model, which is consistent with the fact that they can effectively solve the label noise and class imbalance problems. However, due to the lack of the self-supervised pre-training method, different networks exhibit a large gap in the integrated level relative to Table 2, which once again proves that our self-supervised pre-training method can effectively improve the network feature extraction capability, thus improving the overall network performance.

To validate the effect of different pre-training methods, we pre-trained the SE-ResNet50 model on ImageNet with other settings consistent with the experiments in Table 2. The experimental results are shown in Table 6. Compared with the ImageNet-based pre-training method, our proposed self-supervised pre-training method achieves better experimental results because the ImageNet dataset contains life-like images that have minimal association with the CT images during the developing process, and our proposed pre-training method allows the network to use the same images in the developing process as in the pre-training process and does not reveal the original semantics of the images, which makes the pre-training process and the developing process more relevant and thus allows the pre-training process to better assist the developing process.

We also conducted experiments to compare our method with different traditional machine learning methods [32] including support vector machine (SVM) [33–35],

1XPEHURI3DWLHQWV

K-nearest neighbor (KNN), tecision tree [35], random forest [35], and gradient boosting [35]. The degree and tolerance of the SVM were 3 and 0.001. We set the number of neighbors in KNN to 5. For the decision tree, the minimum numbers of samples required to split an internal node and be at a leaf node are 2 and 1. The number of trees in random forest was set to 10. The learning rate of gradient boosting was 0.1, and the number of boosting stages to perform was 100. The experimental results are shown in Table 7. As we can see, our method clearly outperforms all the ML methods. It is worth noting that in our experiments, we did not introduce additional feature extraction methods for the ML methods, saving labor to a great extent while having reliable accuracy. The poor effect of ML methods may be due to the inability to deal with the potential noisy and imbalanced problem intrinsically existing in the data. By contrast, our framework explores a new way to deal with these issues with the help of the proposed mixed loss strategy and sample reweighting, providing increased power to the common practice.

**Figure 4.** Network output probabilities for low-grade and high-grade patients. The left subplot is the network output probability distribution of low-grade and high-grade patients. The right subplot is the CT images of low-grade and high-grade patients with different network output probabilities.

58


**Table 3.** Performance of the four basic models in the validation cohort.

**Table 4.** Performance of four types of self-supervised pre-trained models without mixed loss strategy and sample reweighting methods in the validation cohort.


**Table 5.** Performance of four types of basic models with mixed loss and sample reweighting methods in the validation cohort.


**Table 6.** Comparison of the SE-ResNet50 model performance based on different pre-training methods in the validation cohort.



#### **4. Discussion**

In this work, we proposed a radiologist-level diagnostic model based on DL approach that is capable of automatically grading ccRCC patients based on CT images. We improved the network's capabilities using innovative self-supervised pre-training approaches. Based on the data in our research, we also proposed solutions to the label noise and class imbalance problems that exist in real world datasets, and the experimental results demonstrate the effectiveness and necessity of our work.

Our best-performing DL model has a high reliability with an accuracy of 88.2% AUC, 82.0% ACC, 85.5% SEN, and 75.0% SPEC. These results confirm that our DL method performs well or equivalent to biopsy in the grade evaluation ccRCC, with the characteristics of noninvasive and labor-saving, which can offer a valuable means for ccRCC grade stratification and individualized patient treatment.

There are four major advantages to our research. Above all, we pre-train the model with the same images (but different labels) as the developing process, in order to provide the network with a better knowledge of the images before developing. Compared with [36–38] using pre-trained models based on ImageNet, our method does not suffer from the problem of small correlation of image contents between the pre-training and developing process, and it allows the network to develop the same images during pre-training and developing without revealing the original semantics of the images.

Furthermore, label noise is the common problem in medical image datasets. The label noise problem degrades the label quality of medical images [39,40], which will make the medical image mismatch with its real label, and have a negative effect in the development of DL. Manually filtering all the samples undoubtedly raises labor costs, and it is inefficient when dealing with large datasets. We have taken the mixed loss strategy for the label noise, with no labor cost overhead but good results. The satisfactory experimental results verify that our method can make the DL model biased toward the correct samples in the development process. Obviously, the actual problem cannot be exactly the same for different datasets; for example, the noise rate differs in size from one dataset to another. Different real situations require different approaches, and we believe that our approach to the two challenges will aid future study in this area.

In addition, class imbalance also occurs frequently in medical image datasets. The class imbalance problem may negatively affect the performance of ML models [41] and DL models [42,43], as most classification methods assume an equal occurrence of different classes. To address this problem, we used the sample reweighting method, which yielded promising benefits. As can be seen from the experimental results, the sample reweighting method effectively prevents the DL model from favoring a certain category in the development process, that is, balance the contribution of samples with different quantity proportions to the loss function. We also expect that our approach of the topic of class imbalance will aid future study in this area.

Last but not leastFinally, DL models with different structures have different independent parameters and are developed to form different perceptions of the dataset. We combined the developed network models with various architectures and obtained more accurate prediction. The model ensemble approach can make up for the shortcomings of individual models in prediction, enhance the network generalization ability, and improve the reliability of results.

In terms of practical significance, our design can help patients in remote areas to further understand their individual conditions, assist doctors to make more accurate clinical judgments on patients' conditions, and to a certain extent compensate for the lack of professional doctors and promote the treatment of patients. With sufficient and noise-free data and reliable developing, our method can reduce or even replace patient biopsy tests, giving patients a safer and more convenient way to be tested.

Despite the contributions of our study in grading ccRCC, it has some potential limitations. The one, although we used model ensemble to improve the generalization ability of the network. For the development of DL models, there are other more DL network architectures that can be utilized, such as VGGNet [25] and GoogleNet [44], but our experiments demonstrated the effectiveness of applying DL to the pathology grading of ccRCC patients. Next, although all cases included in our data are confirmed by professional doctors, there is still a certain human factor, so if our system is to be applied in practice, a large amount of quality data is needed to improve the model in order to make the results more reliable. The WHO/ISUP grading system has superseded the Fuhrman grading system in terms of prognosis assessment and interpretability [45]. Lastly, we take a uniform size operation (224 × 224 × 3) for tumor images of different sizes, which is necessary for network developing and validation, however, when such an operation is taken for images of small sizes, it may affect the original semantics of the images, which is one of the common problems in the image processing field.However, the intention of using cropped tumor is to exclude the interference of irrelevant information entailed by other normal region. Such normal regions

do not contribute positively to the grading of ccRCC. On the contrary, the redundant information may also include a bias or shortcut that would otherwise enforce the model solving a problem differently than intended. For example, there is an observation that the network has learned to detect a metal token that radiology technicians place on the patient in the corner of the image field of view at the time they capture the image in [46].

For the clinical validation of our method, we also look forward to applying our algorithm to real world practice to protect patients from suffering of biopsies as many as possible. However, unfortunately, such a method needs special approval from corresponding authorities, which cannot be easily acquired within short notice. We will positively try this in our future work. In addition, we hope to research a better algorithm to solve the semantic loss problem caused by fixing all images to a uniform size in DL.

#### **5. Conclusions**

In this paper, we proposed a DL model that can effectively discriminate different grades of ccRCC patients. Based on the innovative self-supervised pre-training method, different semantics are assigned to the images so that the same images can be used in the pretraining and development tasks, which allows the network to have certain feature extraction capabilities before developing and does not make the pre-training task fragmented from the development task. In addition, we improved the accuracy of the model based on our proposed self-supervised pre-training method and alleviated the effects of label noise and class imbalance problems commonly found in the dataset and the necessity and effectiveness of the proposed method are proved by ablation experiments. With richer and cleaner samples and sufficient developing, the model may become a routine clinical tool to reduce the emotional and physical toll of biopsy on patients.

**Author Contributions:** Conceptualization, B.S., M.L. (Ming Liu) and X.H.; Data curation, B.S., Y.W. and X.H.; Formal analysis, M.L. (Minghui Liu), J.D. and Y.W.; Funding acquisition, B.S., M.L. (Ming Liu) and X.W.; Investigation, L.X., C.Y. and M.L. (Minghui Liu); Methodology, C.Y., X.C., S.F. and F.Z.; Project administration, L.X., F.Z. and X.C.; Resources, B.S., L.X. and M.L. (Ming Liu); Software, C.Y., S.F., T.X. and J.D.; Supervision, B.S., L.X. and M.L. (Ming Liu); Validation, X.W., M.L. (Ming Liu) and X.C.; Visualization, C.Y., Y.W., S.F. and X.W.; Writing—original draft, C.Y., X.C. and T.X.; Writing—review & editing, B.S., L.X. and X.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Science and Technology Program of Quzhou under Grant 2021D007, Grant 2021D008, Grant 2021D015, and Grant 2021D018, as well as the project LGF22G010009.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of The Quzhou Affiliated Hospital of Wenzhou Medical University (ethical code number 2020-03-002).

**Informed Consent Statement:** The ethics board waived informed consent because the data were obtained from preexisting institutional or public databases.

**Data Availability Statement:** The de-identified CT images data used in this study are not publicly available due to restrictions in the data-sharing agreement.

**Conflicts of Interest:** The funder had no role in the study design, data collection, data analysis, data interpretation, or writing of the report.

#### **References**


### *Review* **Machine Learning Tools for Image-Based Glioma Grading and the Quality of Their Reporting: Challenges and Opportunities**

**Sara Merkaj 1,2,†, Ryan C. Bahar 1,†, Tal Zeevi 1, MingDe Lin 1,3, Ichiro Ikuta 1, Khaled Bousabarah 4, Gabriel I. Cassinelli Petersen 1, Lawrence Staib 1, Seyedmehdi Payabvash 1, John T. Mongan 5, Soonmee Cha <sup>5</sup> and Mariam S. Aboian 1,\***


**Simple Summary:** Despite their prevalence in research, ML tools that can predict glioma grade from medical images have yet to be incorporated clinically. The reporting quality of ML glioma grade prediction studies is below 50% according to TRIPOD—limiting model reproducibility and, thus, clinical translation—however, current efforts to create ML-specific reporting guidelines and risk of bias tools may help address this. Several additional deficiencies in the areas of ML model data and glioma classification hamper widespread clinical use, but promising efforts to overcome current challenges and encourage implementation are on the horizon.

**Abstract:** Technological innovation has enabled the development of machine learning (ML) tools that aim to improve the practice of radiologists. In the last decade, ML applications to neurooncology have expanded significantly, with the pre-operative prediction of glioma grade using medical imaging as a specific area of interest. We introduce the subject of ML models for glioma grade prediction by remarking upon the models reported in the literature as well as by describing their characteristic developmental workflow and widely used classifier algorithms. The challenges facing these models—including data sources, external validation, and glioma grade classification methods —are highlighted. We also discuss the quality of how these models are reported, explore the present and future of reporting guidelines and risk of bias tools, and provide suggestions for the reporting of prospective works. Finally, this review offers insights into next steps that the field of ML glioma grade prediction can take to facilitate clinical implementation.

**Keywords:** artificial intelligence; glioma; machine learning; deep learning; reporting quality

#### **1. Introduction**

#### *1.1. Artificial Intelligence, Machine Learning, and Radiomics*

Innovations in computation and imaging have rapidly enhanced the potential for artificial intelligence (AI) to impact diagnostic neuroradiology. Emerging areas of implementation include AI in stroke (e.g., early diagnosis, detection of large vessel occlusion, and outcome prediction) [1], AI in spine (fracture detection, and vertebrae segmentation)

**Citation:** Merkaj, S.; Bahar, R.C.; Zeevi, T.; Lin, M.; Ikuta, I.; Bousabarah, K.; Cassinelli Petersen, G.I.; Staib, L.; Payabvash, S.; Mongan, J.T.; et al. Machine Learning Tools for Image-Based Glioma Grading and the Quality of Their Reporting: Challenges and Opportunities. *Cancers* **2022**, *14*, 2623. https://

doi.org/10.3390/cancers14112623

Academic Editor: Axel Pagenstecher

Received: 26 March 2022 Accepted: 23 May 2022 Published: 25 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and detection of intracranial aneurysms and hemorrhage [2], among other disciplines. Machine learning (ML) and its subfield, deep learning (DL), are branches of AI that have received particular attention. ML algorithms, including DL, decipher patterns in input data and independently learn to make predictions [3]. The advent of radiomics—which mines data from images by transforming them into features quantifying tumor phenotypes—has fueled the application of ML methods to imaging, including radiomics-based ML analysis of brain tumors [4–6]. Commonly extracted radiomic features include shape and size, texture, first-order, second-order, higher-order features, etc. (Table 1).

#### *1.2. Machine Learning Applications in Neuro-Oncology*

As the most common primary brain tumors, gliomas constitute a major focus of ML applications to neuro-oncology [7,8]. Prominent domains of glioma ML research include the image-based classification of tumor grade and prediction of molecular and genetic characteristics. Genetic information is not only instrumental to tumor diagnosis in the 2021 World Health Organization classification, but also significantly affects survival and underpins sensitivity to therapeutic interventions [9,10]. ML-based models for predicting tumor genotype can therefore guide earlier diagnosis, estimation of prognosis, and treatmentrelated decision-making [11,12]. Other significant areas of glioma ML research relevant to neuroradiologists include automated tumor segmentation on MRI, detection and prediction of tumor progression, differentiation of pseudo-progression from true progression, glioma survival prediction and treatment response, distinction of gliomas from other tumors and non-neoplastic lesions, heterogeneity assessment based on imaging features, and clinical incorporation of volumetrics [13–15]. Furthermore, ML tools may optimize neuroradiology workflow by expediting the time to read studies from image review to report generation [16]. As an image interpretation support tool, ML importantly may improve diagnostic performance [17,18]. Prior works demonstrate that AI alone can approach the diagnostic accuracy of neuroradiologists and other sub-specialty radiologists [19–21].

#### *1.3. Image-Based Machine Learning Models for Glioma Grading*

This review is concerned with the growing body of studies developing predictive ML models for image-based glioma grading, a fundamentally heterogeneous area of literature. While numerous ML models exist to predict high-grade gliomas and low-grade gliomas, they vary in their definitions of high- and low-grade [22–24]. Other models predict individual glioma grades (e.g., 2 vs. 3, 3 vs. 4), but few have combined glioma grading with molecular classification despite the incorporation of both grade and molecular subtype in 2016 World Health Organization central nervous system tumor classification [25,26]. While studies focus on MRI, they are diverse in the sequences used for prediction, with earlier publications relying on conventional imaging and increasing incorporation of advanced MRI sequences throughout the years [27–30]. Finally, studies vary considerably in their feature extraction and selection methods, datasets, validation techniques, and classification algorithms [31].

It is our belief that the ML models with potential to support one of the most fundamental tasks of the neuroradiologist—glioma diagnosis—present obstacles and opportunities relevant to the radiology community, especially as radiologists endeavor to bring ML models into clinical practice. In this article, we aim to introduce the subject of developing ML models for glioma grade prediction, highlight challenges facing these models and their reporting within the literature, and offer insights into next steps the field can take to facilitate clinical implementation.

#### **2. Workflow for Developing Prediction Models**

Despite their heterogeneity, ML glioma grade prediction studies follow similar steps in developing their models. The development workflow starts with acquisition, registration, and pre-processing (if necessary) of multi-modal MR images. Common pre-processing tasks include data cleaning, normalization, transformation, and dealing with incomplete data, among other tasks [32]. An in-depth exploration of pre-processing is beyond the scope of this review and readers should refer to Kotsiantis et al. for further explanation. Next, tumors undergo segmentation—the delineation of tumor, necrosis, and edema borders—which can be a manual, semi-automatic, or fully automatic process. Manual segmentations rely on an expert delineating and annotating Regions of Interest (ROIs) by hand. Semi-automated segmentations generate automated ROIs that need to be checked and modified by experts. Fully automatic segmentations, on the other hand, are DL-generated (most frequently by convolutional neural networks (CNNs)), which automatically delineate ROIs and omit the need for manual labor [33]. In general, semi-automated segmentations are considered to be more reliable and transparent than fully automatic segmentations. However, they are less time-efficient than automatic segmentations and always require manual input from experts in the field. Whereas manual segmentation is laborious, time-consuming, and subject to inter-reader variability, fully automatic deep-learning generated segmentations may potentially overcome these challenges [34].

Feature extraction is then performed to extract qualitative and quantitative information from imaging. Commonly extracted data include radiomic features (shape, first-order, second-order, higher-order features, etc.), clinical features (age, sex, etc.), and tumorspecific Visually AcceSAble Rembrandt Images (VASARI) features. Feature types and their explanations are presented in Table 1.


**Table 1.** Overview of commonly extracted feature types in studies developing ML prediction models.

Open-source packages such as PyRadiomics have been developed as a reference standard for radiomic feature extraction [36]. Clinical features are known to be important markers for predicting glioma grades and molecular subtypes [37]. VASARI features, developed by The Cancer Imaging Archive (TCIA), are frequently found in studies that qualitatively describe tumor morphology using visual features and controlled vocabulary/standardized semantics [38].

Current technology permits extraction of over 1000 features per image. As a high number of features may lead to model overfitting, model developers commonly reduce the number of features used through feature selection. Feature selection methods, including Filter, Wrapper, and Embedded methods, remove non-informative features that reduce the model's overall performance [39].

The final set of features is fed into a glioma grade classification algorithm(s)—for example, support vector machine (SVM) and CNN—during the training process. The classifier performance is then measured through performance metrics such as accuracy, area under the curve receiver operating characteristic, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score. The model is validated internally, usually through hold-out or cross-validation techniques. Ideally, the model is externally validated as a final step to ensure reproducibility, generalizability, and reliability in a different setting (Figure 1).

**Figure 1.** Characteristic workflow for developing ML glioma grade prediction models. VASARI = Visually AcceSAble Rembrandt Images, AUC = area under the curve receiver operating characteristic, CNN = convolutional neural network, ML = machine learning, NPV = negative predictive value, PPV = positive predictive value, and SVM = support vector machine.

#### **3. Algorithms for Glioma Grade Classification**

The most common high-performing ML classifiers for glioma grading in the literature are SVM and CNN [13]. SVM is a classical ML algorithm that represents objects as points in an n-dimensional space, with features serving as coordinates. SVMs use a hyperplane, or an n-1 dimensional subspace, to divide the space into disconnected areas [40]. These distinct areas represent the different classes that the model can classify. Unlike CNNs, SVMs require hand-engineered features, such as from radiomics, to serve as inputs. This requirement may be advantageous for veteran diagnostic imagers, whose knowledge of brain tumor appearance may enhance feature design and selection. Hand-engineered features also can undergo feature reduction to mitigate the risks of overfitting, and prior works demonstrate better performance for glioma grading models using a smaller number of quantitative features [41]. However, hand-engineered features are limited since they cannot be adjusted during model training, and it is uncertain if they are optimal features for classification. Moreover, hand-engineered features may not generalize well beyond the training set and should be tested extensively prior to usage [42,43].

CNNs are a form of deep learning based on image convolution. Images are the direct inputs to the neural network, rather than the manually engineered features of classical ML. Numerous interconnected layers each compute feature representations and pass them on to subsequent layers [43,44]. Near the network output, features are flattened into a vector that performs the classification task. CNNs appeared for glioma grading in 2018 and have risen quickly in prevalence while exhibiting excellent predictive accuracies [45–48]. To a greater extent than classical ML, they are suited for working with large amounts of data, and their architecture can be modified to optimize efficiency and performance [46]. Disadvantages include the opaque "black box" nature of deep learning and associated difficulty with interpreting model parameters, along with problems that variably apply to classical ML as well (e.g., high amount of time and data required for training, hardware costs, and necessary user expertise) [49,50].

In our systematic review of 85 published ML studies developing models for imagebased glioma grading, we found SVM and CNN to have mean accuracies of 90% and 91%, respectively [51]. Mean accuracies for these algorithms were similar across classification tasks regardless of whether the classification was binary or multi-class (e.g., 90% for the 24 studies whose best models performed binary classification of grades 1/2 vs. 3/4 compared to 86% for the 5 studies classifying grade 2 vs. 3 vs. 4). No consensus has been reached regarding the optimal ML algorithm for image-based glioma classification.

#### **4. Challenges in Image-Based ML Glioma Grading**

#### *4.1. Data Sources*

Since 2011, a significant number of ML glioma grade prediction studies have used open-source multi-center datasets to develop their models. BraTS [52] and TCIA [53] are two prominent public datasets that contain multi-modal MRI images of high- and low-grade gliomas and patient demographics. BraTS was first made available in 2012, with the 2021 dataset containing 8000 multi-institutional, multi-parametric MR images of gliomas [52]. TCIA first went online in 2011 and contains MR images of gliomas collected across 28 institutions [53]. These datasets were developed with the aim of providing a unified multi-center resource for glioma research. A variety of predictive models have been trained and tested on these large datasets since their 2011 release [54]. Despite their value as public datasets for model development, several limitations should be considered. Images are collected across multiple institutions with variable protocols and image quality. Co-registration and imaging pre-processing integrate these images into a single system. Although these techniques are necessary, they may reduce heterogeneity within the datasets [52]. Models developed on these datasets may perform well in training and testing. Nevertheless, the results may not be reproducible in the real-world clinical setting, where images and tumor presentations are heterogeneous. We strongly support large multi-center datasets in order to demonstrate model performance across distinct hospital settings. We, however, recommend such initiatives incorporate images of various diagnostic qualities into their training datasets, which more closely resemble what is seen in daily practice.

#### *4.2. External Validation*

Publications have reported predictive models for glioma grading throughout the last 20 years with the majority relying on internal validation techniques, of which crossvalidation is the most popular. While internal validation is a well-established method for measuring how well a model will perform on new cases from the initial dataset, additional evaluation on a separate dataset (i.e., external validation) is critical to demonstrate model generalizability. External validation mitigates site bias (differences amongst centers in protocols, techniques, scanner variability, level of experience, etc.) and sampling/selection bias (performance only applicable to the specific training set population/demographics) [55]. Not controlling for these two major biases undermines model generalizability, yet few publications externally validate their models [13]. Therefore, normalizing external validation is a crucial step in developing glioma grade prediction models that are suitable for clinical implementation.

#### *4.3. Glioma Grade Classification Systems*

The classification of glioma subtypes into high- and low-grade gliomas is continuously evolving. In 2016, an integrated histological–molecular classification replaced the previous purely histopathological classification [56]. In 2021, the Consortium to Inform Molecular and Practical Approaches to CNS Tumor Taxonomy (cIMPACT NOW) once more accentuated the diagnostic value of molecular markers, such as the isocitrate dehydrogenase mutation, for glioma classification [57]. As a result of the evolving glioma classification system, definitions for high- and low-grade gliomas vary across ML glioma grade prediction studies and publication years. This reduces the comparability of models themselves and grade-labeled datasets used for model development. We recommend future glioma grade prediction studies focus on both glioma grade and molecular subtypes for more comprehensive and reliable results over time. Neuropathologic diagnostic emphasis has shifted from purely based on microscopic histology to one that combines morphologic and molecular genetic features of tumor including gene mutations, chromosomal copy number alterations, and gene rearrangements to yield integrated diagnosis. Rapid developments in next generation sequencing techniques, multimodal molecular analysis, large scale genomic and epigenomic analyses, and DNA methylation methods promise to fundamentally transform the pathologic CNS tumor diagnostics including glioma diagnosis and grading to whole another level of precision and complexity.

Current and future ML methods must keep abreast of the rapid progress in tissue based integrated diagnostics in order to contribute to and make an impact on the clinical care of glioma patients (Figure 2).

**Figure 2.** Challenges for clinical implementation of ML glioma grade prediction models. ML = machine learning. WHO = World Health Organization.

#### *4.4. Reporting Quality and Risk of Bias*

#### 4.4.1. Overview of Current Guidelines and Tools for Assessment

It is critical that studies detailing prediction models, such as those for glioma grading, exhibit a high caliber of scientific reporting in accordance with consensus standards. Clear and thorough reporting enables more complete understanding by the reader and unambiguous assessment of study generalizability, quality, and reproducibility, encouraging future researchers to replicate and use models in clinical contexts. Several instruments have been designed to improve the reporting quality (defined here as the transparency and thoroughness with which authors share key details of their study to enable proper interpretation and evaluation) of studies developing models. The Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) Statement was created in 2015 as a set of recommendations for studies developing, validating, or updating diagnostic or prognostic models [58]. The TRIPOD Statement is a

checklist of 22 items considered essential for transparent reporting of a prediction model study. In 2017, with a concurrent rise in radiomics-based model studies, the radiomics quality score (RQS) emerged [59]. RQS is an adaptation of the TRIPOD approach geared toward a radiomics-specific context. The tool has been used throughout the literature for evaluating the methodological quality of radiomics studies, including applications to medical imaging [60]. Radiomics-based approaches for interpreting medical images have evolved to encompass the AI techniques of classical ML and, most recently, deep learning models. Most recently, in recognition of the growing need for an evaluation tool specific to AI applications in medical imaging, the Checklist for AI in Medical Imaging (CLAIM) was published in 2020 [61]. The 42 elements of CLAIM aim to be a best practice guide for authors presenting their research on applications of AI in medical imaging, ranging from classification and image reconstruction to text analysis and workflow optimization. Other tools—the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [62] and Prediction model Risk Of Bias ASsessment Tool (PROBAST) [63]—importantly evaluate the risk of bias in studies based on what is reported about their models (Table 2). Bias relates to systematic limitations or flaws in study design, methods, execution, or analysis that distort estimates of model performance [62]. High risk of bias discourages adaptation of the reported model outside of its original research context, and, at a systemic level, undermines model reproducibility and translation into clinical practice.

**Table 2.** Overview of major reporting guidelines and bias assessment tools for diagnostic and prognostic studies.


<sup>1</sup> AI = artificial intelligence, <sup>2</sup> CLAIM = Checklist for AI in Medical Imaging, <sup>3</sup> PROBAST = Prediction model Risk Of Bias ASsessment Tool, <sup>4</sup> QUADAS-2 = Quality Assessment of Diagnostic Accuracy Studies, <sup>5</sup> RQS = radiomics quality score, and <sup>6</sup> TRIPOD = Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis.

#### 4.4.2. Reporting Quality and Risk of Bias in Image-Based Glioma Grade Prediction

Assessments of ML-based prediction model studies have demonstrated that risk of bias is high and reporting quality is inadequate. In their systematic review of prediction models developed using supervised ML techniques, Navarro et al. found that the high risk of study bias, as assessed using PROBAST, stems from small study size, poor handling of missing data, and failure to deal with model overfitting [64]. Similar findings have been reported for glioma grade prediction literature. In our prior study conducting a TRIPOD analysis of more than 80 such model development studies, we report a mean adherence rate to TRIPOD of 44%, indicating poor quality of reporting [51]. Areas for improvement included reporting of titles and abstracts, justification of sample size, full model specification and performance, and participant demographics, and missing data. Sohn et al.'s meta-analysis of radiomics studies differentiating high- and low-grade gliomas estimated a high risk of bias according to QUADAS-2, attributing this to the fact that all their analyzed studies were retrospective (and have the potential for bias because patient outcomes are already known), the lack of control over acquisition factors in the studies using public imaging data, and unclear study flow and timing due to poor reporting [41]. Readers should refer directly to Navarro et al., Bahar et al. and Sohn et al. for more detailed discussion of shortcomings in study reporting and risk of bias.

#### 4.4.3. Future of Reporting Guidelines and Risk of Bias Tools for ML Studies

Efforts by authors to refine how they report their studies depend upon existing reporting guidelines. In their systematic review, Yao et al. identified substantial limitations to neuroradiology deep learning reporting standardization and reproducibility [65]. They recommended that future researchers propose a reporting framework specific to deep learning studies. This call for an AI-targeted framework parallels contemporary movements to produce AI extensions of established reporting guidelines. TRIPOD creators have discussed the challenges with ML not captured in the TRIPOD Statement [66]. The introduction of more relevant terminology and movement away from regression-based model approaches will be a part of the forthcoming extension of TRIPOD for studies reporting ML-based diagnostic or prognostic models (TRIPOD-AI) [66,67]. QUADAS-2 creators also announced a plan for an AI-extension (QUADAS-AI), noting that their tool similarly does not accommodate AI-specific terminology and further documenting sources of AI study bias that are not signaled by the tool [68]. PROBAST-AI is in development too [66].

#### 4.4.4. Recommendations

Systematic reviews and meta-analyses in the field [41,51,64] reveal various aspects of reporting and bias risk that need to be addressed in order to promote complete understanding, rigorous assessment, and reproducibility of image-based ML glioma grading studies. Based on the problems identified in this literature (discussed in 4.4.2), we encourage future works to closely adhere to the reporting and risk of bias tools and guidelines most relevant to them, with particular attention to:


For prediction model studies that involve applications of AI to medical imaging, CLAIM is the only framework that is specific to AI and able to capture the nuances of their model reporting—including data preprocessing steps, model layers/connections, software libraries and packages, initialization of model parameters, performance metrics of models on all data partitions, and public access to full study protocols. We, therefore, recommend future studies developing ML models for the prediction of glioma grade from imaging use CLAIM to guide how they present their work. The authors should remain vigilant regarding the release of other AI-specific frameworks that may best suit their studies and seek out AI-specific risk of bias tools to supplement CLAIM once available.

#### **5. Future Directions**

ML models present an attractive solution towards overcoming current barriers and accelerating the transition to patient-tailored treatments and precision medicine. Novel algorithms combine information derived from multimodal imaging to molecular markers and clinical information, with the aim of bringing personalized predictions on a patient level into routine clinical care. Relatedly, multi-omic approaches that integrate a variety of advanced techniques such as proteomics, transcriptomics, epigenomics, etc., are increasingly gaining importance in understanding cancer biology and will play a key role in the facilitation of precision medicine [70,71]. The growing presence of ML models in research settings is indisputable, yet several strategies should be considered to facilitate clinical implementation: PACS-based image annotation tools, data-sharing and federated learning, ML fairness, ML transparency, and FDA clearance and real-world use (Figure 3).

**Figure 3.** Future directions for clinical implementation of ML glioma grade prediction models, ML = machine learning.

#### *5.1. PACS-Based Image Annotation Tools*

Large, annotated datasets that are tailored to the patient populations of individual hospitals and practices are key to training clinically applicable prediction algorithms. An end-to-end solution for generation of these datasets, in which all steps of the ML workflow are performed automatically in clinical picture archiving and communication system (PACS) as the neuroradiologist reads a study, is considered the "holy grail" of AI workflow in radiology [72]. A mechanism for achieving this is through automated/semi-automated segmentation, feature extraction, and prediction algorithms embedded into clinical PACS

that provide reports in real-time. The accumulation of saved segmentations through this workflow could accelerate the generation of large, annotated datasets, in addition to providing a decision-support tool for neuroradiologists in daily practice. Under these circumstances, establishing strong academic-industry partnerships for the development of clinically useful image annotation tools is fundamental.

#### *5.2. Data-Sharing and Federated Learning*

Multi-institutional academic partnerships are also critical for maximizing clinical applications of ML. Data-sharing efforts are under way in order to accelerate the pace of research [73]. Cross-institutional collaborations not only enrich the quality of the input that goes into training the model, but also provide datasets for externally validating other institutions' models. However, data-sharing across institutions is often hindered by technical, regulatory, and privacy concerns [74]. A promising solution for this is federated learning, an up-and-coming collaborative algorithm training effort that does not require cross-institutional data-sharing. In federated learning, models are trained locally inside an institution's firewalls and learned weights or gradients are transferred from participating institutions for aggregation into a more robust model [75]. This overcomes the barriers of data-sharing and has been shown to be superior to algorithms trained on singlecenter datasets [76]. Federated learning is not without drawbacks, however; it depends on existing standards for data quality, protocols, and heterogeneity of data distribution. Researchers do not have access to model training data and may face difficulty interpreting unexpected results.

#### *5.3. ML Fairness*

A common misconception about AI algorithms is that they are not vulnerable to biases during decision-making. In reality, algorithm unfairness—defined as prejudice or discrimination that skews decisions toward individuals or groups based on their characteristics has been extensively documented across AI applications. A well-known example is the Correctional Offender Management Profiling for Alternative Sanctions score, which was a tool that assisted judges with their decision to release an offender or keep them in prison. The software was found to be biased towards African Americans, judging them to be at higher risk for recommitting crimes compared to Caucasian individuals [77]. Additional examples of bias have been demonstrated across widely deployed biobanks [78], clinical trial accrual populations [79] and ICU mortality and 30-day psychiatric readmission prediction algorithms [80] among other medical domains. Publicly available tools, including Fairlearn and AI Fairness 360, assess and correct for algorithm unfairness ranging from allocation harms and quality of service harms to feature and racial bias [81,82]. These tools have yet to be applied widely in medical contexts despite their promising utility. Future works on AI in neuro-oncology should consider implementing evidence-based bias detection and mitigation tools tailored to their algorithm development setting and target population prior to clinical integration.

#### *5.4. ML Transparency*

The opaqueness of ML models—DL in particular—poses a barrier to their acceptance and usage. In addition, traditional measures such as software validation are insufficient for fulfilling legal, compliance, and/or other requirements for ML tool clarification [83,84]. Explainable artificial intelligence (xAI) approaches may address these concerns by explaining particular prediction outputs and overall model behavior in human-understandable terms [85]. A recent study demonstrates the successful use of state-of-the-art xAI libraries incorporating visual analytics for glioma classification [83]. Other approaches such as Grad-CAM generate visual explanations of DL model decisions and, therefore, enhance algorithm transparency [86]. These tools can support the interpretability of ML model outputs for future research as well as prime ML for dissemination and acceptance in clinical neuroradiology. Guidelines for authors, along with reporting quality assessment and risk

of bias tools, should consider encouraging such approaches to further the transparency of literature in the field.

Of relevance to ML model transparency are the concepts of usability and causability. Usability can be defined as the ease of use of a computer system for users, or in other words, the extent to which a user and a system may communicate through an interface without misunderstanding [87,88]. Highly usable tools are associated with positive user satisfaction and performance in the field of human–computer interaction [89]. Causability is a parallel concept to usability and foundational for human–AI interaction. Causability reflects the understandability of an AI model (e.g., CNN) to a human as communicated by an explanation interface [89]. Causability, furthermore, determines relative importance and justifies what should be explained and how [90]. Embracing causability in the development of human–AI interfaces will help people understand the decision-making process of ML algorithms and improve trust. We believe this will lower the threshold for clinical ML utilization.

#### *5.5. FDA Clearance and Real-World Use*

Thousands of studies pertaining to applications of AI and ML in medical imaging have been published [15,82]. Yet, few imaging AI/ML algorithms have been cleared by the FDA as medical products [91], perhaps due in part to the lack of standardization and transparency in the FDA clearance process [92]. Bridging the gap between AI/ML research and FDA clearance—as well as FDA clearance and real-world algorithm use—will streamline the adoption of ML models for glioma grading into clinical settings. To this end, Lin presents several suggestions [93]. Partnering of the FDA with professional societies could facilitate the standardization of algorithm development and evaluation. A key focus would be resolving the split between how results are communicated in the literature (e.g., performance metrics) and what is relevant for AI product assessment (e.g., return on investment, integration and flexibility with PACS, ease of use, etc.). Moreover, reporting of post-marketing surveillance could help real-world use and algorithm performance drift.

#### **6. Conclusions**

ML glioma grade prediction tools are increasingly prevalent in research but have yet to be incorporated clinically. The reporting quality of ML glioma grade prediction studies is low, limiting model reproducibility and thus preventing reliable clinical translation. However, current efforts to create ML-specific reporting guidelines and risk of bias tools may help address these issues. Future directions for supporting clinical implementation of ML prediction models include data-sharing, federated learning, and development of PACS-based image annotation tools for the generation of large image databases, among other opportunities.

**Author Contributions:** Conceptualization, S.M., R.C.B. and M.S.A.; methodology, S.M., R.C.B. and M.S.A.; investigation, J.T.M., S.C. and M.S.A.; resources, M.S.A.; writing—original draft preparation, S.M., R.C.B., T.Z. and M.S.A.; writing—review and editing, M.L., I.I., K.B., G.I.C.P., L.S., S.P., J.T.M. and S.C.; visualization, T.Z., I.I. and S.C.; supervision, G.I.C.P., S.C. and M.S.A.; project administration, M.S.A.; funding acquisition, M.S.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** Sara Merkaj receives funding in part from the Biomedical Education Program (BMEP). Ryan Bahar receives funding in part from the National Institute of Diabetes and Digestive and Kidney Disease of the National Institutes of Health under Award Number T35DK104689. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Mariam Aboian received funding from American Society of Neuroradiology Fellow Award 2018. This publication was made possible by KL2 TR001862 (Mariam Aboian) from the National Center for Advancing Translational Science (NCATS), components of the National Institutes of Health (NIH), and the NIH roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NIH. MingDe Lin is an employee and stockholder of Visage Imaging, Inc. and unrelated to this work, receives funding from NIH/NCI

R01 CA206180 and is a board member of Tau Beta Pi engineering honor society. Khaled Bousabarah is an employee of Visage Imaging, GmbH. Seyedmehdi Payabvash has grant support from NIH/NINDS K23NS118056, foundation of the American Society of Neuroradiology (ASNR) #1861150721, the Doris Duke Charitable Foundation (DDCF) #2020097, and NVIDIA.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We thank Yale Department of Radiology for their research resources and space.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


### *Article* **Machine Learning Based on MRI DWI Radiomics Features for Prognostic Prediction in Nasopharyngeal Carcinoma**

**Qiyi Hu 1,†, Guojie Wang 2,†, Xiaoyi Song 3, Jingjing Wan 1, Man Li 3, Fan Zhang 4, Qingling Chen 1, Xiaoling Cao 1, Shaolin Li 2,\* and Ying Wang 1,\***


**Simple Summary:** In the past, radiomics studies of nasopharyngeal carcinoma (NPC) were only based on basic MR sequences. Previous studies have shown that radiomics methods based on T2-weighted imaging and contrast-enhanced T1-weighted imaging have been successfully used to improve the prognosis of patients with nasopharyngeal carcinoma. The purpose of this study was to explore the predictive efficacy of radiomics analyses based on readout-segmented echo-planar diffusion-weighted imaging (RESOLVE-DWI) which quantitatively reflects the diffusion motion of water molecules for prognosis evaluation in nasopharyngeal carcinoma. Several prognostic radiomics models were established by using diffusion-weighted imaging, apparent diffusion coefficient maps, T2-weighted and contrast-enhanced T1-weighted imaging to predict the risk of recurrence or metastasis of nasopharyngeal carcinoma, and the predictive effects of different models were compared. The results show that the model based on MRI DWI can successfully predict the prognosis of patients with nasopharyngeal carcinoma and has higher predictive efficiency than the model based on the conventional sequence, which suggests MRI DWI-radiomics can provide a useful and alternative approach for survival estimation.

**Abstract:** Purpose: This study aimed to explore the predictive efficacy of radiomics analyses based on readout-segmented echo-planar diffusion-weighted imaging (RESOLVE-DWI) for prognosis evaluation in nasopharyngeal carcinoma in order to provide further information for clinical decision making and intervention. Methods: A total of 154 patients with untreated NPC confirmed by pathological examination were enrolled, and the pretreatment magnetic resonance image (MRI)—including diffusion-weighted imaging (DWI), apparent diffusion coefficient (ADC) maps, T2-weighted imaging (T2WI), and contrast-enhanced T1-weighted imaging (CE-T1WI)—was collected. The Random Forest (RF) algorithm selected radiomics features and established the machine-learning models. Five models, namely model 1 (DWI + ADC), model 2 (T2WI + CE-T1WI), model 3 (DWI + ADC + T2WI), model 4 (DWI + ADC + CE-T1WI), and model 5 (DWI + ADC + T2WI + CE-T1WI), were constructed. The average area under the curve (AUC) of the validation set was determined in order to compare the predictive efficacy for prognosis evaluation. Results: After adjusting the parameters, the RF machine learning models based on extracted imaging features from different sequence combinations were obtained. The invalidation sets of model 1 (DWI + ADC) yielded the highest average AUC of 0.80 (95% CI: 0.79–0.81). The average AUCs of the model 2, 3, 4, and 5 invalidation sets were 0.72 (95% CI: 0.71–0.74), 0.66 (95% CI: 0.64–0.68), 0.74 (95% CI: 0.73–0.75), and 0.75 (95% CI: 0.74–0.76), respectively. Conclusion: A radiomics model derived from the MRI DWI of patients with nasopharyngeal carcinoma was generated in order to evaluate the risk of recurrence and metastasis. The model

**Citation:** Hu, Q.; Wang, G.; Song, X.; Wan, J.; Li, M.; Zhang, F.; Chen, Q.; Cao, X.; Li, S.; Wang, Y. Machine Learning Based on MRI DWI Radiomics Features for Prognostic Prediction in Nasopharyngeal Carcinoma. *Cancers* **2022**, *14*, 3201. https://doi.org/10.3390/ cancers14133201

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 23 May 2022 Accepted: 23 June 2022 Published: 30 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

based on MRI DWI can provide an alternative approach for survival estimation, and can reveal more information for clinical decision-making and intervention.

**Keywords:** radiomics; nasopharyngeal carcinoma; diffusion-weighted imaging; prognostic prediction; heterogeneity

#### **1. Introduction**

Nasopharyngeal carcinoma (NPC) is an epithelial malignancy with distinctive geographic distribution [1]. Over 130,000 patients were newly diagnosed with NPC in 2020, among which more than 70% were located in East and South East Asia [1,2]. Even with advancements in screening and treatments, approximately 5–15% of patients exhibit local recurrence, and 15–30% of NPC patients experience metastatic spread after standard treatment [3]. Therefore, identifying the reliable predictive factors associated with prognosis is necessary. In the last few decades, tumor heterogeneity has continued to be a crucial factor influencing prognosis [4]. At present, the clinical formulation of treatment primarily depends on the TNM staging system. However, similar clinical treatment can result in distinct clinical outcomes for NPC patients with the same TNM stage [5], indicating that the system merely reflects the anatomic invasion and fails to adequately unmask tumor heterogeneity.

Moreover, some specific blood metabolites or cellular and genetic parameters are used to predict the prognosis of nasopharyngeal carcinoma patients, such as EBV-DNA, LDH, ALP, HOPX, miRNAs, and gene expression, etc. [6–10]. Importantly, EBV-DNA and several pretreatment inflammatory biomarkers have been considered as independent prognostic factors for patients with NPC, including lymphocyte and neutrophil counts, and the neutrophil-to-lymphocyte ratio (NLR), etc. [11]. Nevertheless, the former biomarkers present instability and non-specificity, whereas the routine application of the latter parameter modality is restricted by the expensive cost. Therefore, a low-cost, convenient, and accurate approach that can evaluate heterogeneity and prognosis is urgently needed.

The radiomics technique has emerged as a promising approach to the conversion of images into high-dimensional and quantitative features [12]. Radiomics analysis based on clinical images can provide additional information about tumor heterogeneity steadily and accurately, and can thus offer clinical support for decision making, thereby improving tumor treatment with an economic and non-invasive approach [13]. The radiomics model based on MRI to predict the prognosis of patients with nasopharyngeal carcinoma has been observed, and has exported great value in risk stratification and prognosis evaluation [14–16]. However, related studies only extract image features from basic MRI sequences. As a functional imaging technique, DWI can quantitatively demonstrate the diffusion motion of water molecules in the tissue microenvironment, and can detect tissue cellularity, microstructures, and microvasculature at the sub-voxel level, thereby revealing additional internal features of the tumor in order to uncover vital prognostic information [17]. It has been frequently used in clinical trials to report on differential diagnosis, staging, therapeutic evaluation, and prognostic prediction in oncology [18].

In the past, DWI images suffered from insufficient image quality, including obvious artifacts, limited resolution, and blurred images, which may hinder their routine application in radiomics in the head and neck [19]. However, readout-segmented imaging (RS-EPI) approaches have now been introduced to perform high-resolution diffusion-weighted MRI (HR-DWI), and have greatly improved image quality with a higher resolution and fewer artifacts than the extensively adopted single-shot imaging (SS-EPI) DWI [20]. This improvement is achieved by shortening the data-acquisition time and dividing the k-space into multiple interleaved acquisitions in order to diminish the accumulation of phase errors in the phase-encoding direction. Previous studies have shown that a radiomics model

based on DWI MRI can accurately reveal the individual prognosis in several cancers, such as bladder, hepatocellular, and prostate cancers [21–23].

According to the literature searched, whether radiomics based on a DWI sequence can extract the tumor heterogeneity of nasopharyngeal carcinoma and evaluate the risk of recurrence and metastasis remains uncertain. Accordingly, we performed the present study to visualize the heterogeneity and disclose the prognosis of nasopharyngeal carcinoma through radiomics analyses based on the RESOLVE-DWI sequence. Furthermore, we sought to compare and combine the radiomics model based on the RESOLVE-DWI sequence and conventional sequence (T2WI and CE-T1WI) in order to provide more clinical decisionmaking and intervention information.

#### **2. Materials and Methods**

#### *2.1. Patients*

Approval for this retrospective study was obtained from the Ethics Review Committee of the Fifth Affiliated Hospital of Sun Yat-sen University (ClinicalTrials.gov Identifier: NCT05112510). The Committee exempted the informed consent concurrently. A total of 154 patients with untreated NPC confirmed by pathological examination between March 2014 and June 2018 were enrolled, including 15 patients with local or regional tumor recurrence and 28 patients with distant metastasis (1 of the patients had local recurrence and metastases simultaneously).

The collected clinical features included age, gender, tumor size (T), nodal status (N), metastases (M), TNM staging, and histological subtypes. The staging was based on the Eighth American Joint Committee on Cancer TNM staging manual [24]. According to the criteria from the World Health Organization (WHO), the histological subtypes were classified into three patterns: keratinizing squamous cell carcinoma (type I), nonkeratinizing differentiated carcinoma (type II), and nonkeratinizing undifferentiated carcinoma (type III) [25].

#### *2.2. Inclusion and Exclusion Criteria*

The eligibility criteria for patient enrollment were as follows: (1) patients with NPC confirmed by pathological examination; (2) patients with complete MR images and clinical data; (3) patients who did not receive chemotherapy, radiotherapy, or surgery before their MRI scans. Patients were removed by applying the following exclusion criteria: (1) the periodical follow-up data were incomplete; (2) poor image quality; and (3) patients with a concomitant or previous history of cancer.

#### *2.3. Endpoints*

Failure-free survival (FFS) was defined as the primary endpoint in this study, and it was considered from the first date of the MR scan, and ended with the progression. Local recurrence was diagnosed through pathological examinations. If any medical report indicated distant metastasis, the suspected site of involvement was subjected to extra histological confirmation. In the case of failed biopsy or no biopsy, regular follow-up was attempted. Distant metastasis was diagnosed when the enlargement of the lesions was observed.

#### *2.4. MRI Acquisition*

All 154 patients underwent a series of MRI scans. The sequences included axial T2 weighted imaging (T2WI), contrast-enhanced T1-weighted imaging (CE-T1WI), axial DWI (*b* = 800 s/mm2), and ADC mapping. The MRI scanning was performed on a Magnetom Trio 3.0T MRI scanner (Siemens Medical, Erlangen, Germany). An eight-channel head and neck coil was adopted in order to collect the signals. The scanning range was from the skull base to the subclavian region. The conventional MRI sequence included axial T2WI and CE-T1WI. The contrast agent was a Gadobutrol injection.

The following parameters were set for the axial T2WI: TR/TE, 3760 ms/72 ms; field of view (FOV), 230; matrix size, 320 × 224; layer thickness, 5 mm; interlayer spacing, 1 mm; bandwidth, 340 Hz; acquisition time, 3 min and 23 s; number of excitations (NEX), 2; and resolution, 0.7 × 0.7.

The following parameters were set for CE-T1WI: TR/TE, 4660 ms/10 ms; FOV, 230; matrix size, 320 × 224; layer thickness, 5 mm; interlayer spacing, 1 mm; bandwidth, 347 Hz; acquisition time, 2 min 49 s; NEX, 3; and resolution, 0.7 × 0.7.

The following parameters were set for RESOLVE-DWI: RS-EPI, TR/TE, 3800 ms/65 ms; matrix size, 192 × 192; layer thickness, 4 mm; interlayer spacing, 0.6 mm; bandwidth, 521 Hz; acquisition time, 2 min 55 s; segmented readout times, 9; and *b* = 0, 800 s/mm2. The ADC maps were automatically generated from the MRI system.

#### *2.5. Segmentation and Feature Extraction*

All of the regions of interest (ROIs) of the images were manually segmented in all of the slices by two radiologists: one with 5 years of clinical experience and the other with 15 years. A total of 5636 features were extracted. Manual segmentation and relative feature extraction were both conducted in the Radcloud platform (https://mics.radcloud.cn, accessed: 23 May 2022). The intraclass correlation coefficient (ICC) in 20 patients was calculated in order to assess the intra- and inter-observer variability for consistency. Features with an ICC below 0.75 were excluded.

#### *2.6. Radiomics Feature and Model Selection*

All of the feature columns with the same numerical values were eliminated, and normalization processing at the order of magnitude was performed on all of the features. The extracted features were screened by Random Forest (RF), which creates a decision tree such that the suboptimal segmentation is performed by introducing randomness; this has been adopted extensively in radiomics based on its excellent performance in classification tasks [26]. The workflow for feature selection by Random Forest can be summarized as follows. First, the differential clinical characteristics were added and set as dummy variables. The top 100 features were screened according to importance. Then, the top 10 features in terms of improving the model's predictive power were retained after the cyclical inclusion of each feature with a forward stepwise approach by the RF method. Finally, the features of each model were limited to 10. The training set was randomly split with the k-fold cross-validation method: the training set was divided into five subsets, and one of the K-fold sample sizes was *N* = 26 (two-folds: *N* = 27).

The differences in clinical factors between the two groups were investigated by oneway analysis of variance in SPSS (version 25.0, IBM Corp, Armonk, NY, USA). The Chisquare test was used for categorical variables, and the Mann–Whitney U test was used for continuous variables. Hierarchical variables used the Wilcoxon symbol order and test. Python software was performed to screen, choose, and build the machine learning models based on the screened features.

Five of the existing mainstream algorithms (Logistic Regression, kNN, Naive Bayes, Random Forest, and XGB Classifier) were chosen for training and validation. In order to obtain a more robust model, we applied five-fold cross-validation to calculate the average AUC of the training sets and the average AUC of the validation sets. The obtained results were presented as the average AUC of cross-training set and the average AUC of the crossvalidation set. The major parameters of the corresponding models were adjusted using GridSearchCV. The model was chosen according to the average AUC of the cross-validation set [27].

#### *2.7. Prediction Model Building*

The selected models mentioned above were trained and validated based on the screened features from different sequence combinations, and the parameters were adjusted. All of the major parameters, such as criterion, max\_depth, min\_samples\_leaf, min\_samples\_split, max\_features, and min\_impurity decrease, were adjusted within a large range. The OOB\_score was chosen as the evaluation criterion, resulting in the parameters of all of the final models.

Models after the parameter adjustment were used for five-fold cross-validation, and were compared in order to obtain the optimal AUC of different sequence combinations. Accordingly, the optimal machine learning models based on the extracted imaging features from different sequence combinations were built, including model 1 (DWI + ADC), model 2 (T2WI + CE-T1WI), model 3 (DWI + ADC + T2WI), model 4 (DWI + ADC + CE-T1WI), and model 5 (DWI + ADC + T2WI + CE-T1WI). The study workflow is briefly displayed in Figure 1.

**Figure 1.** Workflows: (1) MRI acquisition and segmentation; (2) quantitative feature extraction; (3) radiomic feature and model selection; (4) prediction models built based on the extracted imaging features from different sequence combinations.

#### **3. Results**

#### *3.1. Clinical Characteristics Analysis*

In the present study, 154 patients were included, including 43 females (29%) and 111 males (71%), with a median age of 47 years (19–68). The most common histopathological subtype refers to undifferentiated nonkeratinizing carcinoma (SCC, 80.6%). The relapsed or metastatic group and the non-relapsed or metastatic group presented significant differences in the N, M, and TNM stages (*p* < 0.05). The patient characteristics are presented in Table 1.

**Table 1.** Clinical characteristics of the patients with NPC in the relapsed or metastatic group and the non-relapsed or metastatic group.


#### *3.2. Machine Learning Model Selection*

Five-fold cross-validation was carried out using Logistic Regression, kNN, Naive Bayes, Random Forest, and XGB Classifier, and the results show that the AUC obtained using the RF method is the highest among the different sequence combinations. The results are shown in Figure 2. Therefore, the RF machine learning model was chosen to compare the predictive performances of the different sequence combination models.

**Figure 2.** Five existing mainstream algorithms (Logistic Regression, kNN, Naive Bayes, Random Forest, and XGB Classifier) were chosen for the training and validation, which showed that AUC values obtained using the RF method are the highest among all of the models of different sequence combinations: (**a**) DWI + ADC; (**b**) T2WI + CE-T1WI; (**c**) DWI + ADC + T2WI; (**d**) DWI + ADC + CE-T1WI; (**e**) DWI + ADC + T2WI + CE-T1WI.

#### *3.3. Prediction Performance of the Models*

Concerning the construction and results of different sequence-combination models, the N and M stages were added according to the dissimilarity tests of the clinical variables, and they were set as dummy variables. The top 100 features were screened by the importance of the RF method. Then, the top 10 features in terms of improving the model's predictive power were retained after the cyclical inclusion of each feature with a forward stepwise approach. The selected features and importances are shown in Figure 3. The selected features were used to construct the RF machine learning prediction model. In order to obtain a more robust outcome, we applied five-fold cross-validation, and the AUC of the validation set in the machine learning model was obtained based on different sequence combinations using the RF method.

**Figure 3.** The importance of selected features derived from different sequence combinations: (**a**) DWI + ADC; (**b**) T2WI + CE-T1WI; (**c**) DWI + ADC + T2WI; (**d**) DWI + ADC + CE-T1WI; (**e**) DWI + ADC + T2WI + CE-T1WI.

In order to obtain a more robust outcome, we applied five-fold cross-validation to train and validate the RF machine learning model. After adjusting the parameters, the average AUC of the validation set in the RF machine learning model was obtained based on the extracted imaging features from different sequence combinations. The mean AUCs of the five-fold cross-validation sets of model 1 (DWI + ADC), model 2 (T2WI + CE-T1WI), model 3 (DWI + ADC + T2WI), model 4 (DWI + ADC + CE-T1WI), and model 5 (DWI + ADC + T2WI + CE-T1WI) were 0.80 (95% CI: 0.79–0.81), 0.72 (95% CI: 0.71–0.74), 0.66 (95% CI: 0.64–0.68), 0.74(95% CI: 0.73–0.75), and 0.75 (95% CI: 0.74–0.76), respectively. The average AUC of each model in validation set is shown in Figure 4. The performances of the radiomics models in the validation set are shown in Table 2.

**Table 2.** The performance metrics for five models in the validation set.


**Figure 4.** Average AUC values in the validation set of the RF machine learning model based on selected features of model 1 (**a**), model 2 (**b**), model 3 (**c**), model 4 (**d**), and model 5 (**e**).

Based on the results, the RF model based on the extracted features from the DWI and ADC images has higher prognostic prediction efficacy than the RF model based on T2WI and T1WI images. Moreover, the RF model based on the extracted features from DWI, ADC, and T2WI presents better predictive performance for prognosis than the RF model based on DWI, ADC, and CE-T1WI. Finally, the results indicated that the RF model based on the extracted features from the multiple-sequence combination of DWI, ADC, T2WI, and CE-T1WI did not display optimal effects in the prediction of the recurrence and metastasis of nasopharyngeal carcinoma.

#### **4. Discussion**

Radiomics models based on MRI features in nasopharyngeal carcinoma (NPC) can predict the prognosis and therapeutic responses [28], but these models were constructed based on basic MR sequences (e.g., T2WI, T1WI, and CE-T1WI). Studies with a radiomics approach based on DWI images in nasopharyngeal carcinoma remain to be explored. Considering that the foregoing radiomics research focuses on tumor heterogeneity and the prognosis of NPC mainly based on T2WI and CE-T1WI [15,16,29–32], we attempted to compare and combine the radiomics model based on RESOLVE-DWI simultaneously with T2WI and CE-T1WI. This process aims to determine the optimal machine learning model for the prognostic prediction of NPC.

Extracted features in various MR sequence combinations were adopted in order to predict the recurrence and metastasis risks of NPC patients in the present study. The results show that the average cross-validated AUC of the RF model based on radiomics features extracted from DWI and ADC sequences reached 0.80, and the AUC of RF models based on conventional MR sequences was 0.72. The AUC of model 2 (T2WI + CE-T1WI) of this study in the validation set closely resembled that of Kim et al.'s study [16], which suggests that the AUC of the radiomics model combining T2WI and CE-T1WI sequences was 0.71 for the prediction of progression-free survival in patients with NPC. At the same time, no data from previous studies were comparable to the results of the radiomics model based on the DWI sequence of the present study on account of the rare usage of DWI in radiomics. However, the radiomics features extracted from the DWI and ADC sequences have higher prediction efficacy in terms of the recurrence and metastasis risks of patients. This finding was potentially obtained due to the quantitative features of models extracted from the image, and the DWI can provide more sub-voxel image information about tumor heterogeneity, which reflects the limited Brownian motion and microarchitecture in tumors [17,33]. Moreover, the machine learning model based on features extracted from the DWI, ADC, and CE-T1WI sequences presents a higher forecast performance than the models based on DWI, ADC, and T2WI sequences. This finding was potentially due to CE-T1WI sequences being able to reflect the blood supply and angiogenesis of tumors [34], and to unmask the proliferation state of tumors better than T2WI, making CE-T1WI sequences more relevant for tumor heterogeneity. Finally, we combined DWI, ADC, T2WI, and CE-T1WI sequences in NPC and extracted the relative features from this combination in order to establish RF machine learning models. The average cross-validated AUC of this model was 0.73 for the prediction of the prognosis of NPC, and this value is not higher than that of the RF model based on DWI and ADC sequences. This finding can be attributed to the increase in mixing factors with the increase in sequences.

Notably, high-resolution DWI was applied to extract related features and build the machine learning model for the prediction of the recurrence and metastasis of nasopharyngeal carcinoma. DWI is a proven non-contrast imaging technology that has become a mature quantitative measurement approach for the identification of benign and malignant lesions in routine clinical work [19,35,36]. In malignant tumors, the diffusion of water molecules is often restricted or limited by the high cell density, which exhibited high signals on DWI and a low value on ADC maps. DWI technology can provide quantitative interpretations as well as qualitative interpretations, thereby increasing the specificity of disease diagnosis [17]. The application in radiomics of a single-shot (SS) EPI-DWI technology extensively used to collect DWI images is easily restricted by magnetic susceptibility artifacts, chemical displacement and geometric distortion, limited spatial resolution, and relatively thick sections, especially in head and neck tumors, such as nasopharyngeal carcinoma with artifacts of the skull base [19]. With the improvement in readout-segmented imaging (RS-EPI) technologies, high-resolution DWI (HR-DWI) was applied to clinical work. It remarkably improved the abovementioned problems by using the same diffusion preparation as SS EPI but dividing the K space into several segments in the phase-encoding direction in order to decrease the echo time [20]. Therefore, readout-segmented imaging (RS-EPI) has obvious advantages and is irreplaceable for the diagnosis of tumors at the head and neck compared with (SS) EPI-DWI [19], and the machine learning model based on DWI collected by RS-EPI is more reliable and robust, providing a good foundation to promote its clinical applications.

Additionally, the acquisition of HR-DWI does not require a contrast agent, making it safer than the CE-T1W in daily clinical work. In present practical applications, it has realized technological advantages of increased speed and decreased artifacts, supporting its extensive use in clinical practice. Based on the above analysis, the radiomics method based on RESOLVE-DWI has higher prediction efficacy than the conventional MR sequence regarding the recurrence and metastasis of NPC. The applications of high-resolution DWI in radiomics might be complementary to—and might even replace—the currently used sequences (T2WI, T1WI, and CE-T1WI) in order to provide more high-specificity information and support for clinical decisions.

The present study has some limitations. First, this study involved a few cases, and it was carried out in one hospital. Therefore, a prospective study should be carried out to support the conclusions. Moreover, minor details are hard to depict, which might influence the extraction of features. Finally, the relationship between radiomics features and prognostic outcomes was not explored further in the present study. Relevant data were collected, and the next step for our research is to discover this relationship and further perform survival analysis according to the radiomics model based on DWI sequences.

#### **5. Conclusions**

The results confirmed that the machine learning model based on features extracted by RESOLVE-DWI and corresponding ADC maps could be used as a prognosis detection tool. These features can help to quantify the heterogeneity of patients with NPC and evaluate the risk of recurrence and metastasis in order to quickly provide supporting evidence and thus aid in making a sound clinical decision in clinical practice.

**Author Contributions:** Segmentation and Feature Extraction, Q.H., G.W. and J.W.; radiomics Feature and Model Selection, Q.H. and X.S.; prediction Model Building, X.S. and M.L.; imaging data collection, Q.H., G.W. and Y.W.; clinical data collection, Q.C. and X.C.; manuscript writing and revision, Q.H., G.W., F.Z. and Y.W.; study design, G.W., F.Z. and Y.W.; supervision, Y.W. and S.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (8210072058), Medical Scientific Research Foundation of Guangdong Province (A2021449), Key Project of Guangdong Province (2018B030335001), the Medical Science and Technology Research Fund of Guangdong province (2018B030322006) and Investigator-Initiated Clinical Trial of the Fifth Affiliated Hospital of Sun Yat-sen University (YNZZ2020-04).

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the Fifth Affiliated Hospital of Sun Yat-sen University (protocol code No. ZDWY(2021) Lunzi No. (39-1), approved on 15 July 2021).

**Informed Consent Statement:** Patient consent was waived due to the retrospective nature of this study.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Prediction of Nodal Metastasis in Lung Cancer Using Deep Learning of Endobronchial Ultrasound Images**

**Yuki Ito 1, Takahiro Nakajima 2,\*, Terunaga Inage 1, Takeshi Otsuka 3, Yuki Sata 1, Kazuhisa Tanaka 1, Yuichi Sakairi 1, Hidemi Suzuki <sup>1</sup> and Ichiro Yoshino <sup>1</sup>**


**Simple Summary:** Endobronchial ultrasound-guided transbronchial aspiration is a minimally invasive and highly accurate modality for the diagnosis of lymph node metastasis and is useful for pre-treatment biomarker test sampling in patients with lung cancer. Endobronchial ultrasound image analysis is useful for predicting nodal metastasis; however, it can only be used as a supplemental method to tissue sampling. In recent years, deep learning-based computer-aided diagnosis using artificial intelligence technology has been introduced in research and clinical medicine. This study investigated the feasibility of computer-aided diagnosis for the prediction of nodal metastasis in lung cancer using endobronchial ultrasound images. The outcome of this study may help improve diagnostic efficiency and reduce invasiveness of the procedure.

**Abstract:** Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) is a valid modality for nodal lung cancer staging. The sonographic features of EBUS helps determine suspicious lymph nodes (LNs). To facilitate this use of this method, machine-learning-based computer-aided diagnosis (CAD) of medical imaging has been introduced in clinical practice. This study investigated the feasibility of CAD for the prediction of nodal metastasis in lung cancer using endobronchial ultrasound images. Image data of patients who underwent EBUS-TBNA were collected from a video clip. Xception was used as a convolutional neural network to predict the nodal metastasis of lung cancer. The prediction accuracy of nodal metastasis through deep learning (DL) was evaluated using both the five-fold cross-validation and hold-out methods. Eighty percent of the collected images were used in five-fold cross-validation, and all the images were used for the hold-out method. Ninety-one patients (166 LNs) were enrolled in this study. A total of 5255 and 6444 extracted images from the video clip were analyzed using the five-fold cross-validation and hold-out methods, respectively. The prediction of LN metastasis by CAD using EBUS images showed high diagnostic accuracy with high specificity. CAD during EBUS-TBNA may help improve the diagnostic efficiency and reduce invasiveness of the procedure.

**Keywords:** EBUS-TBNA; echo B-mode imaging; deep learning-based computer-aided diagnosis; nodal staging

#### **1. Introduction**

Endobronchial ultrasound-guided transbronchial aspiration (EBUS-TBNA) is a minimally invasive and highly accurate modality for the diagnosis of lymph node (LN) metastasis and is useful for pre-treatment biomarker test sampling in patients with lung cancer [1].

**Citation:** Ito, Y.; Nakajima, T.; Inage, T.; Otsuka, T.; Sata, Y.; Tanaka, K.; Sakairi, Y.; Suzuki, H.; Yoshino, I. Prediction of Nodal Metastasis in Lung Cancer Using Deep Learning of Endobronchial Ultrasound Images. *Cancers* **2022**, *14*, 3334. https:// doi.org/10.3390/cancers14143334

Academic Editor: Samuel C. Mok

Received: 23 May 2022 Accepted: 5 July 2022 Published: 8 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

According to the current guidelines for lung cancer staging, EBUS-TBNA is recommended as the best first test for nodal staging prior to considering surgical procedures [2].

During EBUS-TBNA, multiple LNs are often encountered within the same nodal station. In this process, selecting the most suspicious LN for sampling is important, considering the difficulty of sampling all LNs using EBUS-TBNA under conscious sedation. Thus, EBUS image analysis is useful for predicting nodal metastasis; however, it can only be used as a supplemental method to tissue sampling [3]. We have previously reported the utility of six distinctive ultrasound and Doppler features on EBUS ultrasound images for predicting nodal metastasis [4,5]. However, categorization of image characteristics was not reliable owing to the fact it was subjective and varied significantly with the operator. Therefore, we sought an objective method to predict nodal metastasis. Elastography is a potential solution since it can visualize the relative stiffness of targeted tissues within the region of interest and helps to predict LN metastases. Moreover, it uses objective parameters such as a stiff area ratio [6,7]. However, elastography requires additional operations during the procedure, and its parameters do not reflect real-time values.

In recent years, deep learning (DL)-based computer-aided diagnosis (CAD) using artificial intelligence (AI) technology has been introduced in research and clinical medicine. CAD has been used for radiology, primarily in the areas of computed tomography (CT), positron emission tomography-CT (PET-CT), and ultrasound images, and for the diagnosis of several tumors, such as breast cancer and gastrointestinal tumors [8–11].

If real-time CAD-based prediction of nodal metastasis during EBUS-TBNA is made possible, the operator can easily identify the most suspicious node for diagnosis, thereby reducing the procedure time of EBUS-TBNA. The well-experienced EBUS operator could predict benign lymph nodes with approximately 90% accuracy by subjective categorization of EBUS ultrasound characters. The AI-CAD technology might make "the expert level prediction of nodal diagnosis" possible even for non-experts. The purpose of this study is to investigate the feasibility of CAD for the prediction of LN metastasis in lung cancer using endobronchial ultrasound images and DL technology.

#### **2. Materials and Methods**

#### *2.1. Participants*

Patients with lung cancer or those suspected of suffering from lung cancer who underwent EBUS-TBNA for the diagnosis of LN metastasis were enrolled in this study. We prospectively collected clinical information and images related to bronchoscopy since April 2017 (registry ID: UMIN000026942), and the ethical committee allowed prospective case accumulation with written consent (ethical committee approval ID: No. 2563, Chiba University Graduate School of Medicine). The EBUS-TBNA video clips from April 2017 to December 2020 were retrospectively reviewed, and the patient's clinical information was obtained from electronic medical records (ethical committee approval ID: No. 3538, Chiba University Graduate School of Medicine). This was a collaborative study between the Chiba University Graduate School of Medicine and Olympus Medical Systems Corp. (Tokyo, Japan). All patient identifiers were deleted, and the image data were sent to the Olympus Medical Systems Corp.'s laboratory and analyzed using DL technology (ethical committee approval ID: OLET-2019-008, Olympus Medical Systems Corp.). This study was conducted in accordance with the principles of the Declaration of Helsinki.

#### *2.2. EBUS-TBNA Procedure*

The patients underwent EBUS-TBNA under local anesthesia with moderate conscious sedation using midazolam and pethidine hydrochloride. OLYMPUS BF-UC290F and EU-ME1 and EU-ME2 PREMIER PLUS were used to observe LNs. Systematic nodal observation starting from the N1, N2, and N3 stations using B-mode, Doppler mode, and elastography was first performed. The size of each LN was measured, and EBUS-TBNA was performed for LNs > 3 mm along the short axis on the EBUS image. TBNA was initiated at N3, N2, and N1 stations to avoid overstating. For TBNA, a dedicated 22-gauge or 21-gauge needle (NA- 201SX-4022, NA-201SX-4021, Olympus Medical Systems Corp., Tokyo, Japan) was used, and rapid on-site evaluation was performed during the procedure. All EBUS procedures were performed by skilled operators (T.N. and Y.Sakairi.) or under their supervision.

#### *2.3. Confirmation Diagnosis of EBUS-TBNA*

Rapid on-site evaluation by DiffQick staining and conventional cytology by Papanicolaou staining were performed and diagnosed by a cytopathologist. The histological core was collected in CytoLyt solution and fixed in 10% neutral buffered formalin. The formalinfixed paraffin-embedded specimens were stained with hematoxylin and eosin (H&E) and subjected to immunohistochemistry. Cytology as well as histology was evaluated by independent pathologists who provided pathological diagnosis [12]. The referenced final diagnoses were as follows: (1) malignant cells were proven by EBUS-TBNA, (2) histological diagnosis was made for surgically resected samples after EBUS-TBNA, (3) clinical follow up by radiology after 6 months.

#### *2.4. EBUS Image Extraction and Image Data Sets*

Ultrasound images were recorded as video clips in the MP4 format; divided into shorter clips featuring each LN using video editing software, XMedia Recode 3.4.3.0 (Sebastian Dörfler, Eschenbergen, Germany); and subsequently anonymized using the dedicated software VideoRectFill (Olympus Medical Systems Corp.). All patient information was manually masked on the software. An anonymized video clip was provided to Olympus Medical Systems Corp. with diagnostic information linked to each LN.

In this study, we retrospectively and prospectively collected cases and investigated the detection of LN metastasis in each LN. The evaluation methods are illustrated in Figure 1. We retrospectively and prospectively collected LNs. We attempted both five-fold crossvalidation and hold-out methods for evaluation. Because the images from the video clips included different ultrasound processors (EU-ME1 and EU-ME2 PREMIER PLUS) and different image sizes, these images were allocated equally to each training, validation, and testing group (Figure S1).



**Figure 1.** The concept of deep learning algorithm.

#### *2.5. Adjustment of Images for DL*

Prior to image analysis, the videos were decomposed into time-series images, from which images of different scenes were extracted. The areas in which the B-mode was drawn were cropped from the images and the cropped images were resized to the same size. To increase the generalizability of the DL algorithm, data augmentation was applied only to the training images, and the number of training images was increased. Scaling and horizontal flipping were used in the data augmentation process.

#### *2.6. DL Algorithm Design*

The Convolutional Neural Network (CNN) structure used in this study for LN metastasis detection is shown in Figure S2. The metastasis detection CNN comprises a feature extraction CNN and detection CNN. The feature extraction CNN comprises multiple stages with each stage having multiple blocks and one downsampling layer. The final stage did not include a downsampling layer. We used the Xception block for each block [13]. The downsampling layer comprises two or more strides of the convolution layer. The detection CNN comprises two convolution layers: one for classification and another for positioning.

Initially, the ultrasound image was input to the feature CNN, and local features, such as edges and textures, were extracted from the input image in the first block. As it progressed through the network, its features were integrated and finally converted into features useful for detection.

Subsequently, the features useful for metastasis detection were input into the detection CNN. The detection CNN outputs the probability and bounding box coordinates and sizes for both metastasis and nonmetastasis. The bounding box with the highest probability was selected from among all the metastatic and non-metastatic bounding boxes in the sequence. Finally, the metastasis or non-metastasis parameters, coordinates and size of the bounding box were obtained as the detection result.

#### *2.7. Five-Fold Cross-Validation Method and the Hold-Out Method*

For the five-fold cross-validation method, 80% of all the images were used for training and validation. The images were divided into five sections: four sections were used for training, and the last section was used for validation. By changing the validation section, the training and validation were repeated five times. The prediction yield was calculated as the average of the results of each validation.

In the hold-out method, all images were used for training and testing. All of the images comprising the 80% used for the five-fold cross-validation method were used for training. The remaining 20% of the images that were not used for the five-fold cross-validation method were used for testing, following which the prediction yield was calculated.

The images of different sizes from the two ultrasound scanners (EU-ME1 and EU-ME2 PREMIER PLUS) were allocated proportionately in each section to avoid selection bias.

#### *2.8. Statistical Analysis*

The "Image" represents "per image" basis analysis and the "Lymph node" represents "per lymph node" basis analysis. The "per image" analysis was based on the accuracy of nodal metastasis prediction for each image. Due to limited number of still images, we used the video clips for analysis. However, in this case, multiple images with varying ultrasound features were included for each targeted lymph node, resulting in variation in the judgement of the AI-CAD system. Therefore, in addition to "per image" analysis, we included "per lymph node" analysis in which multiple images were evaluated for each lymph node. The "per lymph node" analysis included (1) calculation of the ratio between the number of images judged benign and malignant, (2) predicting as benign or malignant based on the ratio >50%, (3) analysis of the accuracy of nodal metastasis prediction for each lymph node.

Sensitivity, specificity, positive predictive value, negative predictive value, and diagnostic accuracy were calculated using standard definitions. Statistical analysis was

performed using Fisher's exact test and chi-square test for categorical outcomes, and Student's t-test for continuous variables. Data were analyzed using the JMP Pro 15 software (SAS Institute Inc., Cary, NC, USA). Statistical significance was set at *p* < 0.05.

#### **3. Results**

Ninety-five cases with a total of 170 LNs were enrolled in the study. Two cases (two LNs) were excluded because of a history of malignant lymphoma. Cases of large-cell carcinoma and large-cell neuroendocrine carcinoma (one LN each) were also excluded because they could not be assigned to both the training and testing sets. Finally, 91 cases and 166 LNs were analyzed in this study (Figure 2). In this cohort, 64 LNs (38.5%) were diagnosed as metastatic and 102 LNs (61.5%) as non-metastatic by pathology. The characteristics of the enrolled patients and LNs are listed in Table 1.

**Figure 2.** Study cohort flow chart. One hundred sixty-six lymph nodes and 6444 images from 91 patients were enrolled in the final analysis.

Pathological diagnosis including cytology and histology were performed for all lymph nodes. The success rate of each diagnosis was shown in Table 2. For adenocarcinoma cases, molecular biomarker testing was performed for selected cases. For non-small cell lung cancer cases, evaluation for PD-L1 (22C3) immunohistochemistry was done for selected cases. Each success rate, detection rate, and testing rate was shown in Table 2.


**Table 1.** Patients' and nodal characteristics.

**Table 2.** Detailed results of pathological diagnosis and biomarker testing in this study.


First, we evaluated the ability of AI-CAD to detect LN metastasis using endobronchial ultrasound images. A total of 5255 and 6444 extracted images from the video clip were analyzed using the five-fold cross-validation and the hold-out methods, respectively (Figure 1). The representative EBUS images judged by AI-CAD in this study are shown in Figure S3.

Using the five-fold cross-validation method, the LN-based diagnostic accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of the AI-CAD were measured to be 69.9% (95% CI, 32.4–75.2%), 37.3% (95% CI, 27.8–49.1%), 90.2% (95% CI, 82.9–92.3%), 70.4%, and 69.8%, respectively (Figure 3). However, although the specificity was high, the sensitivity of this method was low.

**Figure 3.** The result of AI-CAD lung cancer lymph node diagnosis accuracy analysis using echo images by five-fold cross validation method. (**a**) Diagnostic yield by per image basis and per lymph node basis. (**b**) ROC curve.

Using the hold-out method, the LN-based diagnostic accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of the AI-CAD were measured to be 87.9% (95% CI, 75.4–94.1%), 76.9% (95% CI, 58.9–92.9%), 95.0% (95% CI, 79.3–100%), and 90.9% and 86.4%, respectively (Figure 4).

Regarding the diagnostic yield by lung cancer subtypes, the diagnostic accuracy rates were 90.5% for no malignancy, 76.9% for adenocarcinoma, 61.1% for squamous cell carcinoma, and 93.9% for small cell lung cancer (Figure 5).

**Figure 5.** The accuracy rates of the hold-out method by lung cancer subtype.

#### **4. Discussion**

The potential applications of AI technology are rapidly growing in the medical field and are expected to facilitate the demanding work of medical staff. The concept of AI, including such systems as machine learning and DL, has been growing in popularity since the evolution of graphics processing units. AI-CAD is one of the AI applications that has been actively developed in radiology. Significant work has been done in the area of combining radiomics and AI-CAD technology, which helps support the diagnosis of benign and malignant tumors, prediction of histology, stage, genetic mutations, and prediction of treatment response and recurrence using CT and PET-CT images [14–18]. AI-CAD is highly useful in analyzing huge amounts of extracted information that includes information invisible to humans. AI-CAD produces objective indicators based on the judgment, knowledge, and experience of experts. During EBUS-TBNA, a highly skilled operator can select the most suspicious LN to sample, based on a subjective categorization of ultrasound image characteristics. In contrast, by applying AI-CAD technology in EBUS, even a trainee can easily select the target LN for sampling, in addition to the dual advantages of a more efficient and less invasive procedure. In this study, we used the CNN algorithm with Xception to predict nodal metastasis based on the ultrasound images of LNs. Using the hold-out method, AI-CAD exhibited a feasible diagnostic accuracy of 84.7%, on average, per LN basis. In this study, the combination of Xception and the hold-out method resulted in the highest diagnostic yield.

The comparison between the five-fold cross-validation and the hold-out methods, demonstrated that the hold-out method exhibited a superior diagnostic yield in this study setting. First, we evaluated using five-fold cross-validation, and then used hold-out method as the standard for developing AI-CAD technology. The number of evaluated images was increased by 20% for the hold-out method compared to five-fold cross-validation. The increased number of images helped with comprehensive covering of image variation and contributed toward better AI-CAD accuracy. The images used in this study were obtained using two different ultrasound image processors (EU-ME1 and EU-ME2 PREMIER PLUS). In addition, a certain amount of collected images (approximately 10% of all images) were of different sizes owing to the different screen sizes of the various video clips. These variations might affect the diagnostic yield of the five-fold cross-validation and the holdout methods. Thus, for the analysis of different-size images the image had to be resized and then analyzed, which resulted in an adversarial example (AE). An AE is an event in which AI misrecognizes an image as completely different data owing to the addition of insignificant noises that are imperceptible to humans. [19] Therefore, in this study, these problems were solved by allocating images of different sizes in equal proportions for AI-CAD analysis.

The final diagnostic accuracy and specificity for the prediction of LN metastasis using AI-CAD in this study were 87.9% and 95.0%, respectively. Previous studies have reported comparable but lower values. For instance, Ozcelik et al. reported an accuracy rate of 82% and specificity of 72% for the diagnosis of lung cancer LN metastasis in 345 LNs by CNN using MATLAB [20]. Churchill et al. reported an accuracy rate of 72.8% and a specificity of 90.7% for the diagnosis of lung cancer LN metastasis in 406 LNs by CNN using NeuralSeg [21]. It is noteworthy, however, that the specificity of CNN-based diagnosis for the prediction of nodal metastasis was found to be high, and this might help avoid futile biopsies and reduce examination time as well as the risk of co-morbidities.

Furthermore, we examined the diagnostic yield of lung cancer subtypes (Figure 5). The diagnostic yield was highest for small cell lung cancer, while the accuracy rate was relatively low for squamous cell carcinoma. Squamous cell carcinoma is often accompanied by signs of coagulation necrosis at the center of the LN, which might affect diagnostic accuracy.

In this study, the prediction rate for squamous cell carcinoma was relatively lower than other histology. One of the possible reasons of this phenomenon was that the squamous cell carcinoma often shows various histological characters, such as necrosis and fibrosis, and it reflects the characters on an EBUS ultrasound image, such as necrosis sign and heterogeneity of echogram. These various ultrasound image features might cause difficulties for learning and validation by AI-CAD, resulting in a lower prediction rate. Although better AI-CAD analysis required more numbers of squamous carcinoma cases for comprehensive coverage of the image variation of squamous cell carcinoma, the number of actual squamous cell carcinoma cases were relatively low in this study. If we could increase the number of squamous cell carcinoma cases, the diagnostic yield could be better in the future.

This study has several limitations. First, the study population was limited, and we used video clips to overcome the limitations of the small sample size. Some cases underwent multiple LN assessments, and multiple LN images were obtained from a single case, which might show similar image characteristics. Second, we used only B-mode images in this study. Several reports have demonstrated the utility of other imaging modalities such as Doppler mode imaging and elastography [5,6]. Finally, Xception was used for the CNN in this study, although there is currently no consensus as to which algorithm should be used to analyze echo images. To develop the optimal method of AI-CAD for EBUS imaging, a larger prospective cohort study is required in the future. In addition, AI-CAD diagnosis using other imaging modalities such as Doppler mode and elastography should be examined to improve the diagnostic yield of AI-CAD for EBUS imaging.

In this study cohort, the prevalence of nodal metastasis was 38.5%, which was relatively low in comparison with the previous report. Most of the enrolled patients were referred to the surgical department as resectable lung cancer patients. In real clinical setting, the AI-CAD technology will be useful if the operator cannot decide which one to be sampled during EBUS-TBNA. The operator would not need the image analysis support for selecting the target when the lymph node is obviously enlarged. Thus, this study demonstrated that the AI-CAD can be used to support the nodal staging for surgically treatable patients.

#### **5. Conclusions**

In this study, we found that AI-CAD (a combination of Xception and the hold-out method) for the prediction of LN metastasis using endobronchial ultrasound images is feasible and exhibits high diagnostic accuracy and specificity. AI-CAD for EBUS may reduce futile biopsies of LNs, shorten examination time, and make EBUS-TBNA a less invasive procedure, regardless of operator experience.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14143334/s1.

**Author Contributions:** Conceptualization, Y.I., T.N., T.O. and I.Y.; methodology, Y.I., T.N.,T.I., Y.S. (Yuki Sata) and T.O.; software, T.O.; validation, Y.I., T.N., T.I., T.O., K.T., Y.S. (Yuichi Sakairi) and H.S.; formal analysis, Y.I.; investigation, Y.I., T.N. and T.O.; resources, Y.I., T.N. and T.O.; data curation, Y.I., T.N. and T.O.; writing—original draft preparation, Y.I.; writing—review and editing, T.N., T.O. and Y.I.; visualization, Y.I.; supervision, T.N. and I.Y.; project administration, Y.I. and T.N.; funding acquisition, T.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This is research is a collaboration between Chiba University Graduate School of Medicine and Olympus Medical Systems Corp. (Tokyo, Japan) based on the contract. Funding support was neither provided by Olympus Medical Systems Corp. nor did Olympus Medical Systems Corp. influence the study design or interpretation of the results in this study. This work was supported by JSPS KAKENHI, grant number 21K08880 (PI: Takahiro Nakajima). There was no funding support from Olympus Medical Systems Corp., nor did they influence the study design or interpretation of the results in this study.

**Institutional Review Board Statement:** We prospectively collected clinical information and images related to bronchoscopy since April 2017 (registry ID: UMIN000026942), and the ethical committee allowed prospective case accumulation with written consent (ethical committee approval ID: No. 2563, Chiba University Graduate School of Medicine). The EBUS-TBNA video clips from April 2017 to December 2020 were retrospectively reviewed, and the patient's clinical information was obtained from electronic medical records (ethical committee approval ID: No. 3538, Chiba University Graduate School of Medicine). This was a collaborative study between the Chiba University Graduate School of Medicine and Olympus Medical Systems Corp. (Tokyo, Japan). All patient identifiers were deleted, and the image data were sent to the Olympus Medical Systems Corp.'s laboratory and analyzed using DL technology (ethical committee approval ID: OLET-2019-008, Olympus Medical Systems Corp.). This study was conducted in accordance with the principles of the Declaration of Helsinki.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patients to publish this paper.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We would like to thank Junichi Ichikawa, Hironaka Miyaki, Fumiyuki Shiratani, and Jun Ando (Olympus Medical Systems Corporation) for their support with deep learning analysis. We thank the Department of Diagnostic Pathology, Chiba University Hospital, for their support in pathological diagnosis.

**Conflicts of Interest:** This is a collaborative research project between the Chiba University Graduate School of Medicine and Olympus Medical Systems Corp. (Tokyo, Japan) based on the contract. T.N. received honoraria and lecture fees from Olympus Medical Systems Corp. and AstraZeneca for CME activities related to EBUS-TBNA. The authors other than T.N. have no conflicts of interest related to this study.

#### **References**


### *Article* **Novel Harmonization Method for Multi-Centric Radiomic Studies in Non-Small Cell Lung Cancer**

**Marco Bertolini 1, Valeria Trojani 1,\*, Andrea Botti 1, Noemi Cucurachi 1, Marco Galaverni 2, Salvatore Cozzi 3, Paolo Borghetti 4, Salvatore La Mattina 4, Edoardo Pastorello 4, Michele Avanzo 5, Alberto Revelant 6, Matteo Sepulcri 7, Chiara Paronetto 7, Stefano Ursino 8, Giulia Malfatti 8, Niccolò Giaj-Levra 9, Lorenzo Falcinelli 10, Cinzia Iotti 3, Mauro Iori <sup>1</sup> and Patrizia Ciammella <sup>3</sup>**


**Abstract:** The purpose of this multi-centric work was to investigate the relationship between radiomic features extracted from pre-treatment computed tomography (CT), positron emission tomography (PET) imaging, and clinical outcomes for stereotactic body radiation therapy (SBRT) in early-stage nonsmall cell lung cancer (NSCLC). One-hundred and seventeen patients who received SBRT for earlystage NSCLC were retrospectively identified from seven Italian centers. The tumor was identified on pre-treatment free-breathing CT and PET images, from which we extracted 3004 quantitative radiomic features. The primary outcome was 24-month progression-free-survival (PFS) based on cancer recurrence (local/non-local) following SBRT. A harmonization technique was proposed for CT features considering lesion and contralateral healthy lung tissues using the LASSO algorithm as a feature selector. Models with harmonized CT features (B models) demonstrated better performances compared to the ones using only original CT features (C models). A linear support vector machine (SVM) with harmonized CT and PET features (A1 model) showed an area under the curve (AUC) of 0.77 (0.63–0.85) for predicting the primary outcome in an external validation cohort. The addition of clinical features did not enhance the model performance. This study provided the basis for validating our novel CT data harmonization strategy, involving delta radiomics. The harmonized radiomic models demonstrated the capability to properly predict patient prognosis.

**Keywords:** imaging biomarkers and radiomics; quantitative imaging/analysis; computed tomography (ct); multi-modality ct-positron emission tomography (pet); machine learning; non-small-cell lung cancer; stereotactic body radiation therapy (sbrt)

**Citation:** Bertolini, M.; Trojani, V.; Botti, A.; Cucurachi, N.; Galaverni, M.; Cozzi, S.; Borghetti, P.; La Mattina, S.; Pastorello, E.; Avanzo, M.; et al. Novel Harmonization Method for Multi-Centric Radiomic Studies in Non-Small Cell Lung Cancer. *Curr. Oncol.* **2022**, *29*, 5179–5194. https:// doi.org/10.3390/curroncol29080410

Received: 2 May 2022 Accepted: 14 July 2022 Published: 22 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Non-small-cell lung cancer (NSCLC) is, overall, the second-most-common cancer and a leading cause of cancer-related death worldwide, despite recent therapeutic advances [1]. Stage I disease represents approximately 25% of the patients receiving diagnoses of NSCLC and accounts for the most curable cohort of the population [2]. Surgery is the gold standard for these patients: lobectomy with hilar and mediastinal lymph node dissection is the preferred approach, given the Lung Cancer Study Group (LCSG) trial results [3]. Instead, sublobar resection has shown inferior local control and a trend toward decreased survival. However, evaluation of sublobar resection in selected patients is currently underway. The historical standard therapy for unresectable early-stage NSCLC was conventionally fractionated radiation therapy (RT) (e.g., 2 Gy per fraction, for a total dose of 54–60 Gy). However, the reported long-term local control (LC; 30–70%) and overall survival (OS;15–30%) rates with this approach are suboptimal [4–6].

Advances in imaging and radiation treatment planning and delivery (e.g., with image guidance and motion management) made the delivery of "ablative doses" of radiation to small targets possible with better results in terms of local control [7–11].

Stereotactic Body radiation therapy (SBRT) has proved to be the first therapeutic option in inoperable stage I NSCLC patients or for those who refuse surgical treatment, with similar rates of local tumor control and overall clinical outcomes [12,13]. Recently, a meta-analysis by Li et al. reported a significant superiority in the local control rate and in 3-year and 5-year OS (54.73% and 29.30 % vs. 39.5 and 27.47) in the SBRT group compared with conventionally fractionated RT [14].

SBRT was reported to have a local control rate in excess of 85% at 3 years [14,15]. Despite consistent clinical outcomes, it is well known that dose fractionation heterogeneity and technical expertise may influence the outcome with SBRT [16–18]. A recent study reported that the factors affecting outcomes after SBRT for early-stage NSCLC are Biological Effective Dose (BED) and tumor size [19].

Radiomics is a recent technique introduced in medicine to describe characteristics of medical images quantitatively. Radiomics belongs to artificial intelligence (AI) applications, but it is based on the calculation of features using well-defined mathematical formulas applied directly to the image pixel values (or to a filtered version of the original images). The mathematical definitions of radiomic features are based on the distribution and the relationship between pixels and voxels in the images' region of interest. The concept behind this method lies in the fact that the human eye cannot appreciate all the characteristics of a medical image. Haralick et al. [20] described how the textural features, highlighting the behavior of gray levels' dependencies, can identify different areas in an image. Later, textural information was proposed as an application in medical imaging [21,22]. The improvement in hardware calculation power made these techniques able to compute a high number of medical imaging biomarkers in an acceptable span of time; those indices should help the physician during the treatment decision task, allowing a personalized care pathway for different patients. However, these biomarkers are not yet ready to be used in oncology without a robust validation or a demonstration of their reliability [23]. Among them, radiomic indices and feature signatures are increasingly present in the panorama of modern scientific literature [24,25]. The main issue and challenge up to date are to understand how to overcome the limits of this approach [26,27].

To date, in the literature, several studies have investigated the ability of radiomics features in the tumor-healthy tissue differentiation task, both for computed tomography (CT) and positron emission tomography (PET) datasets, as described by Chu et al. [28] that used feature values in a random forest classifier for diagnostic purposes. In another study [29], healthy tissues' features were used as additional information for an automatic segmentation algorithm. More recently, feature-extracted CT images were combined with BED values to predict tumor response to SBRT [30].

Despite the great work done to date, to our knowledge, there is still no characterization of the radiomic features' ability to give specific information about healthy tissue compared to the sick one when machine learning models for prognosis are involved. One of the main challenges in the field of radiomics, which makes its clinical application difficult, is the harmonization of the features to be analyzed.

Our present multicentric work aims to propose a novel concept of harmonizing the CT radiomic signal using a combination derived from both the tumor and the healthy contralateral tissue, to overcome the variability typical in each patient in different conditions (i.e., manufacturer/technical characteristics, acquisition, reconstruction protocol, and different anatomy).

#### **2. Materials and Methods**

#### *2.1. Study Design*

This study was a retrospective multicentric work. It was approved by the Area Vasta Emilia Nord (AVEN) Ethics Committee (ID: 817/2018/OSS\*/IRCCSRE). The study was also approved by the ethics committees of all the participating institutions; it was performed in accordance with the principles of Good Clinical Practice (GCP) in respect of the ICH GCP guidelines, the ethical principles contained in the Helsinki declaration and its subsequent updates. Each patient gave informed consent for joining the study.

#### 2.1.1. Patient Cohort

Patients who underwent SBRT for histologically proven diagnosis of primary earlystage NSCLC were retrospectively collected from January 2010 to December 2019. A multicenter research project named "TEXture Analysis of PET/CT in lung cancer patients treated with Stereotactic body radiation therapy (TEXAS)" was designed to involve seven Italian Centers.

Inclusion criteria were: (1) histologically proven diagnosis of NSCLC; (2) early-stage T1–T3N0M0 (TNM 7th edition); (3) patients who underwent SBRT, with treatment biological effective dose BED10 ≥ 100 Gy; and (4) age > 18 years.

Exclusion criteria were: (1) lung tumor greater than 7 cm; (2) histologically proven diagnosis of small cell lung cancer or metastasis; (3) previous thoracic irradiation; (4) presence of bone, lymph node, or visceral metastatic lesions; (5) patients with secondary pulmonary nodules from non-NSCLC or NSCLC; (6) past non-NSCLC tumors with evidence of active disease at the time of SBRT and synchronous non-NSCLC tumors (arising within six months of SBRT diagnosis of NSCLC) with the exception in both cases of non-melanomatous skin tumors.

The patient cohort was divided into training (76 patients from three centers) and external validation (41 patients from the other four centers) datasets. This strategy for the distribution of centers among datasets was made to balance the two groups according to the patients' outcomes as described in the following sections. The external validation step was a fundamental part of the study in order to confirm the performances obtained in the training phase.

#### 2.1.2. SBRT Details

Conventional computed tomography (CT) simulation scans were obtained. The radiation oncologist contoured gross tumor volume (GTV) on the CT, as part of the therapeutic pathway. A 5–10 mm isotropic margin was added to GTV to generate the planning target volume (PTV). Intensity-modulated radiation therapy (IMRT) was delivered to all patients. The dose normalization ensured that at least 95% of PTV receives 100% of the prescribed dose with a homogeneous distribution. For all patients, ipsilateral and contralateral lung, heart, chest wall, esophagus, spinal cord, and bronchial trees were contoured as organs at risk (OARs).

#### *2.2. Image Acquisition*

All patients included in the study had PET/CT images, previously acquired as part of their care pathways, and a pre-treatment CT used for planning of SBRT. The planning CT acquisition protocols and scanning devices differed among institutions, as reported in Table 1. PET image sets, corrected for attenuation, were acquired no more than three months before the start of the treatment. Patients fasted at least 6 h before the injection 18F-FDG tracer and the serum glucose level measured at the injection time was below 160 mg/mL in all patients. PET examinations were performed 60 min after the intravenous administration of the radiotracer using a specific protocol for each institution shown in Table 1.

**Table 1.** Protocol acquisition parameters for simulation CT and PET examinations stratified for centers. Whenever two scanners were used, a "|" indicated the different configurations.


#### *2.3. Image Segmentation*

Computed tomography and PET image sets were exported in DICOM format into a dedicated research computer for radiomics analysis. For the present study, gross tumor volume contouring was separately performed on the CT (manually, referred to as GTVCT) and PET (automatically, hereinafter named GTVPET) images of the pre-treatment PET/CT studies.

Two radiation oncologists with experience in lung cancer contoured each lesion on every sequential slice of the planning CT using standardized window settings for parenchyma (W = 1600 and L = −600), according to EORTC guidelines [31] for all patients. Regarding GTVPET delineation, the radiation oncologists placed a region of interest (ROI) on the area of tumor FDG uptake on PET images and an automatic contour—consisting of the region encompassed by a given fixed percent intensity level relative to the maximum registered tumor activity (40% of SUV max)—was generated. We decided to use this approach as a previous study showed that GTVPET delineation using this fixed threshold was better correlated with the gross tumor (based on pathologic examination) instead of using as basis the manually delineated GTVCT [32].

In order to perform radiomic feature harmonization, we used an ROI from the healthy tissue. This was obtained by copying the GTVCT into a healthy lung region, i.e., the contralateral lung at the same level of the GTV (named Contra\_Lung). The Contra Lung initial volume was also shifted by 0.6 and 0.3 cm in six directions for a total of 12 shifts, avoiding the inclusion of surrounding tissues of the healthy lung. These shifts had the aim of simulating the uncertainty in the positioning of the healthy ROI (for future reproducibility of the harmonization method). We chose the shifts in accordance with PET image resolution to account for a likely uncertainty in ROI positioning since PET imaging can be used to localize the GTV before the treatment.

An example picture showing the location of the ROIs mentioned above is shown in Figure 1.

**Figure 1.** Visualization of the CT ROIs in a patient. The contralateral ROI was shifted in 12 different positions (shown in red).

#### *2.4. Outcome*

In this study, PFS was considered as the primary endpoint and was converted into a binary outcome, which was set to 1 for patients who were alive and without disease progression at 24 months, 0 otherwise. PFS was defined as the time from the start of the SBRT to documented relapse or death. The use of the 2-year threshold was chosen because it could properly describe the treatment effectiveness. In fact, a preliminary analysis of the Kaplan–Meier curves of PFS after SBRT for our cohort showed that the majority of the progressions occurred in a period ranging from 2 to 3 years.

#### *2.5. Radiomics Analysis*

Our analysis followed the steps defined for our radiomic study (Figure 2), which included image preprocessing. The first phase consisted of spatial resampling to an isotropic voxel size to obtain reproducible and rotationally invariant features. Then, image range re-segmentation updated the ROI voxels according to a chosen intensity range to remove all voxels for which intensity values fall outside the selected intensity range. Finally, the images were discretized by intensity, grouping the original intensity values (256) into specific ranges (bins). The aim was to reduce image noise and computational burden. The intensity discretization process fixed the width of the re-segmentation interval and the bin width, defining a new bin for each intensity interval. Selecting the bin width allowed direct control of the absolute range represented on each bin. The image preprocessing of intensity and spatial discretization is described in Supplementary Material Table S1. Intensity discretization parameters were chosen accordingly to the guidelines proposed by Orlhac et al. [33,34].

**Figure 2.** Radiomic pipeline description of the implemented steps in our evaluation process.

#### 2.5.1. Feature Extraction

After the image preprocessing steps, feature calculation and their extraction was performed. Features (intensity-based, shape-based, and second-order) were extracted from original images and filtered images (using wavelets, Laplacian of Gaussian (LoG), and gamma modifier filters) [35]. Radiomic features were calculated using a homemade software employing the widely used pyRadiomics library in order to apply pre-determined filters to the original images and compute features from the edited results. The list of the extracted radiomic features can be found at https://pyradiomics.readthedocs.io/en/latest/ features.html, (accessed on 15 July 2022).

#### 2.5.2. Harmonization Process

The harmonization process consisted of, for our two available image modalities, calculating features for 14 different ROIs. One of them coincided with GTVCT, the other 13 with the duplicated GTVCT positioned in the healthy region and its shifts, as described in Section 2.3. This allows us to consider operators' variability in the positioning of the healthy ROI. The general idea of employing this harmonization formula was inspired by another work [36], and it is shown in Equation (1):

$$f\_{HARM}(i) = \frac{f\_{GTV}(i) - f\_{HEALTHY}(i)}{\sigma(i)} \tag{1}$$

where: *fHARM*(*i*) is the ith harmonized feature, *fHEALTHY*(*i*) is the median ith feature value calculated on the 13 healthy tissue samples for each patient and modality, and *σ*(*i*) is the difference between the 75th and the 25th percentile of the ith feature distribution. Shape features were not harmonized. We applied this harmonization only to CT data because of its intrinsic dependence on the protocol acquisition parameters. Furthermore, CT is used for a morphologic and anatomical characterization and pixel values are related to a physical characteristic of the tissue (the linear attenuation coefficient). On the other hand, PET, being a functional imaging modality, is less sensitive to low signal changes in spatial coordinates. Especially in this case, for the healthy lung, in PET pixel values, there is no useful physical information regarding a region where we do not register a signal from the radiotracer absorption.

#### 2.5.3. Feature Selection

LASSO feature selection was applied, in which the following function is minimized (Equation (2))

$$\sum\_{i=1}^{n} \left( y\_i - \beta\_0 - \sum\_{j=1}^{p} \beta\_j \mathbf{x}\_{ij} \right)^2 + \lambda \sum\_{j=1}^{p} |\beta\_j| \tag{2}$$

where: *yi* is the observed value, *<sup>β</sup><sup>j</sup>* is the LASSO coefficients, and *<sup>λ</sup>* <sup>∑</sup>*<sup>p</sup> j*=1 *βj* is the shrinkage penalty [37].

The parameter λ was chosen using 10-fold cross-validation (CV) computing its error. LASSO penalty brings to zero the weight coefficients (*βj*) of irrelevant features not predictive of the chosen outcome. In addition, LASSO handles sets of collinear features by increasing the weight of one of them while setting the other weights to zero. Because the considered outcome was binary, we used a binomial function for LASSO regression. In Table S2 we show the shrinkage penalties for our trained models.

#### 2.5.4. Model Building

The original and harmonized features were used to develop a supervised machine learning binary classifier. A linear support vector machine (SVM, Model 1) [38] and an Ensemble Subspace Discriminant (ESD, Model 2) [39] were trained by optimizing their performance in 10-fold cross-validation in the training dataset.

Linear SVM classifiers provide low generalization error, even with small learning sample datasets. ESD classifiers are used to decide an explicit discriminant subspace of low dimension.

The two described model types were applied to five different combinations of input features: (A) harmonized CT + PET, (B) harmonized CT, (C) original CT, (D) only original PET, and (E) harmonized CT + PET + selected clinical variables in order to assess the effect of harmonization on the performance of the predictive models. The interested reader can find more information in Text S1 in Supplementary Materials.

The clinical variables in method (E) were chosen among the available ones by using Kaplan–Meier survival curves as described in Section 2.5.5.

#### 2.5.5. Statistical Analysis

For each model, a 95% confidence interval (CI) of the AUCs was calculated for the training and external validation sets. Furthermore, accuracy (95% CI are reported), precision, and recall were calculated.

Subsequently, Kaplan–Meier survival curves were computed using the PFS to select the clinical features. A clinical feature exhibiting a *p*-value from a log-rank test less than 0.05 was considered significant and included in model E. Matlab R2021b (Mathworks, Natick, MA) and R (Vienna, Austria), available at https://www.R-project.org (accessed on 15 July 2022), were used to perform the statistical analysis.

The *p*-values related to statistical differences among the AUC values of each model were calculated using two-sided DeLong test.

#### **3. Results**

#### *3.1. Clinical Results*

One-hundred and seventeen early-stage NSCLC patients met the inclusion criteria. The baseline characteristics of the patients are summarized in Table 2. The median age was 78 years and there were more male (72.6%) than female patients. With a median follow-up of 29.8 months, the median PFS was 24.2 months, and 2-year PFS percentage was 51.2%. Median OS and 2-year OS percentage were 28.5 months and 64%, respectively. The clinical characteristics, including age, gender, Charlson comorbidity index (CCI), diffusing capacity of carbon monoxide (DLCO), tumor size, Eastern Cooperative Oncology Group (ECOG) performance status, and biological equivalent dose to PTV, showed no significant differences between the training and external validation cohorts (Table 2).

**Table 2.** Statical analysis of clinical variables. Abbreviations: PS: Performance status according to ECOG scale, BPCO: chronic obstructive pulmonary disease; ADK: Adenocarcinoma, SCC: squamous cell carcinoma, Fr: fraction; RT: radiotherapy VMAT: volumetric arc-therapy; IMRT: intensity modulated radiotherapy, TOMO: Tomotherapy, PTV: planning target volume. *p*-values in bold mean the statistical significance.



**Table 2.** *Cont.*

No clinical or treatment-related features were shown to be significantly related to PFS in the univariate analysis of the whole population, except for gender (*p* = 0.04 in favor of female) and lung site (right vs. left in favor of the right one, *p* = 0.006).

#### *3.2. PFS Models*

The PFS predictive performance of the linear SVM and ESD models using radiomic features and clinical features are reported in Table 3. In Figure 3, all the models are graphically compared considering their confidential intervals. Models using harmonized features and PET (A,E) achieved AUCs greater than 0.70, both in training and validation. The performances of models using CT-only harmonized features (B models) are not confirmed on the validation dataset (AUC training > 0.75, AUC validation <0.60), while adding PET features leads to better stability between the training and validation sets. Only C-type models (original CT-only features) showed a low-mean AUC (< 0.62), both in training and validation.

**Figure 3.** Performances (AUC) of the studied models. The boxplot shows the minimum, maximum, and average values of the bootstrapped 95% CIs.


**Table 3.** Models' results in terms of AUC, accuracy, precision, and recall.

\* AUCs in square brackets are their bootstrapped 95% CIs. \*\* Precision and recall are presented for class 1. \*\*\* *p*-values are calculated with respect to the conditions C1 and C2 for linear SVM and ESD models, respectively. Values in bold mean the statistical significance.

It is worth noting that both A1 and A2 models significantly outperformed C1 and C2 models, both in the training and external validation datasets (*p* = 0.0001, *p* = 0.01 and *p* = 0.02, *p* = 0.046, respectively, for linear SVM and subspace discriminant models), likewise for E and C models (E1: *p* < 0.0001, and *p* = 0.02, and E2: *p* = 0.01, and *p* = 0.02, respectively, for training and external validation datasets). C models outperformed B models, but only in the training dataset (*p* < 0.0001 and *p* = 0.01, respectively, for linear SVM and subspace

discriminant models) and D and C models had the same performance metrics. In summary, models using clinical information (E models) do not add a significant improvement to A models. This effect is also appreciable in Figure 4, where all the *p*-values among the models are represented.

**Figure 4.** *p*-values calculated using the two-sided DeLong test. Numbers in bold mean the statistical significance.

#### **4. Discussion**

In this work, a multi-centric cohort of early-stage NSCLC patients treated with SBRT was used to build and validate predictive models of PFS greater than 24 months using radiomic features from CT and PET exams and clinical information. Several existing works in the literature [40] describe acquisition protocol variabilities in multi-centric studies, which could affect the performance of radiomic models. Since radiomics computes features from the pixel values in the images, differences in acquisition protocol can lead to biased results. The rationale behind our harmonization method lies in the fact that retrospective multicenter radiomic studies are challenging but necessary, as gathering data from several centers for a centralized analysis is complex for legal, ethical, administrative, and technical reasons. Most of the time, the different centers involved do not follow standardized acquisition and reconstruction protocols; therefore, the collected data suffer from intra-, and inter-variability, making radiomic features sensitive to multicenter variability. Our novel harmonization technique aims to reduce the bias caused by the absence of standardized protocols. Generally, feature analysis is performed by calculating them inside an ROI that coincides with the lesion target. Our study aims to tackle this issue by attempting to reduce this effect using the healthy region of the patient as the baseline from which to harmonize the radiomic data computed from the lesion. From our results, the harmonization improves models' performance when it is used on CT image sets. On the other hand, we expected that for PET-only images, the harmonization method is not easily applicable due to the functional aim of this imaging modality. In such a modality, several healthy tissues (i.e., lungs) are not 18-FDG-avid, while a harmonization based on healthy radiotracer accumulation, to our knowledge, has yet to be studied. Our work investigated and evaluated the feasibility of this technique, which could be employed and better analyzed in future studies.

When using original feature values, PET features were preferred over CT during feature selection, resulting in an only PET-based model. Furthermore, in A models, two features from CT were included in the final prediction score (Log\_Sigma30mm\_GLDM\_Small-Dependence-High-Gray-Level-Emphasis (SDHGLE) and Wavelet\_LHH\_NGTDM\_Busyness), which were also selected in B models. In the same way, a subset of selected features in the A

method (Log\_sigma10mm\_GLSZM\_Size-Zone-Low-Gray-Level-Zone-Emphasis (LGLZE), Exponential\_FIRST ORDER\_Median, Square\_GLSZM\_ZoneEntropy (ZE)) is also present among the selected features in the PET-only-based method. This could mean that our approach, also given the higher performance metric of the A method, was able to merge the information hidden in the CT and PET image sets in a multi-centric cohort of patients, highlighting the importance of properly handling hybrid imaging in radiomic models.

Other harmonization techniques were previously described in the literature [41]. Among these, the ComBat harmonization method, which removes batch effects, mostly based on an empirical Bayes framework, is one of the most used. Conversely, the ComBat method has some limitations: for instance, the dimension of the homogeneous group cannot be too few (in our study, it would have not been applicable as most of the centers provided less than 15 patients). Indeed, recently, some methods have been proposed to overcome these limitations, e.g., using bootstrap and Monte Carlo technique to improve robustness in the estimation [42].

Even if Monte Carlo and bootstrap strategies aimed at overcoming the cohort size limitations, the objective of this method still focuses on removing differences in radiomic feature distributions among different labels (corresponding to different centers). Com-Bat, thus, relies on the individual distributions and changes made to feature values are dependent on a group of patients. While we know that there are data supporting the effectiveness of this method (especially in making the feature distributions uniform), we wanted to tackle the multicenter studies issue from a different angle, which is to account for the individual patients' differences (caused both by the scanner/institution protocols and their anatomy) taken directly from the lesion imaging. This renders the method easier, both computationally and for cohort eligibility reasons (which, in the ComBat method, is needed for representing the single center in terms of homogeneity being an assumption of the method). In fact, if validated further, our method can be applied even in heterogenous cohorts since it uses only the single image set of the patient.

Our approach aimed to use all the information present in the CT data, both from cancer lesions and the contralateral healthy tissue, simulating the radiologists' skill in subjectively evaluating a lesion and adding this information in quantifiable and statistical terms (through the radiomic features).

In our work, the well-known and studied concept of delta radiomics was implemented not in a temporal sense but spatially (cancer vs. healthy tissue), which is an approach that, to our knowledge, was not applied in other prognostic oncological works. Traditional radiomics uses absolute values extracted from regions of interest to predict a clinical outcome. On the other hand, delta radiomics predicts a clinical outcome through the combination of radiomic values computed from image sets acquired at different time points (i.e., radiographs to monitor follow-ups or differences between basal PET and interim PET), which is a rationale also used in clinical practice to assess lesion progression (i.e., PERCIST). In our manuscript, we decided to apply delta radiomics not between different time points but between different anatomic locations (healthy vs. tumor tissues). The assumption behind this use of delta radiomics is that each patient can have an intrinsic "baseline" value for a certain radiomic feature (caused by individual anatomy and institution protocols) that needs to be accounted for when building predictive models. Comparison between normal and tumor tissue behavior (even in terms of pixel values) is also common in clinical practice (i.e., SUV values typical of physiological metabolism or HU/density values of healthy tissue). Some authors [43–45] explain that delta radiomics—which is the use of textural indices associated with different time points or anatomical regions—is more successful than traditional radiomics. Our work aims to provide the basis for a framework where the study of simple absolute feature values can make room for the analysis of their relationship to a reference, used as a threshold or as a comparison.

There are several limitations to the current work. Our study suffered a restricted number of patients selected retrospectively. Nonetheless, the patients' number seemed reasonable at the current phase of our study. It assures the homogeneity in terms of pathologies, as including only NSCLC lesions prevented possible biases created by evaluating different diseases, even in the lung anatomical district. Our study highlighted the necessity to monitor and carefully use features related to pixel values and their relationships. In this case, we can assume that the importance of the radiomic features is not only held in their numerical value but also in the intrinsic relationship among those values. The textural indices' ability to perform more complex clinical tasks (i.e., predicting toxicity and its grade) could be further examined in the next phase of our work or in a prospective study design, which could also assess the robustness of our method. We believe that a prospective study will be able to validate these models within a cohort gathered with a better strategy.

Interestingly enough, we found that the improved performance in models employing harmonized features in the training phase was also confirmed in the validation dataset; the use of an external dataset is becoming more and more crucial to radiomics studies to assure and facilitate their introduction in clinical practice.

In our study, no clinical or treatment-related features were shown to be significantly related to PFS, except for gender and lung site. It is well known in the literature that gender is a prognostic factor for PFS [46–48]. Due to the size of our cohort, we did not find significant correlations between PFS and other studied clinical prognostic variables, such as age or histology (also due to the inclusion criteria). To our knowledge, we did not find any other study reporting a significant correlation between PFS and tumor laterality; thus, we will investigate this finding together with our model generalization power in a future prospective study. Indeed, in the literature, many studies showed a significant correlation between some clinical or treatment-related characteristics and outcomes (PFS and OS) and some created predictive models. Among the various statistical prediction models, nomograms can be accurate and feasible prognostic instruments with high utility in estimating individual patient risk and may, thus, help guide treatment decisions in clinical practice. At present, there are some nomograms, based on clinical features, developed for early-stage NSCLC treated with SBRT [49–51], but there is still need for validation of the clinical variables found in those studies and their experimental results in more robust cohorts, such as prospective ones. Therefore, a need exists for a robust recurrence-related prediction model to help select high-risk candidates who may benefit from additional systemic therapies.

In this scenario, a predictive model based on radiomics and clinical and treatmentrelated characteristics can improve the prediction of clinical outcomes, as already demonstrated by other works. We also found out that clinical variables did not improve the radiomics models, but only the proposed harmonization process statistically significantly improves the model's performance.

As previously stated, our future aim is to apply our method to a prospective multicentric cohort to further validate the framework's stability. In addition, other anatomical regions should be explored to generalize the harmonization, even when the healthy area is not so easily defined as in the lung case. Regarding the employment of this method also in PET datasets, it could be interesting to explore the feasibility of applying our harmonization to 18F-FDG-avid anatomical regions, such as the liver or the brain, which are, however, not related to a pathologic response. Such a method could be especially useful where CT-PET is the only exam included in the care pathway of the patient.

#### **5. Conclusions**

A novel strategy of CT data harmonization involving delta radiomics, considering both cancer and healthy tissue in the contralateral lung, was tested and externally validated in a multi-centric study for NSCLC patients, to initially assess its feasibility.

The radiomics models with harmonized features can predict better the selected patient outcome in our cohort, providing valuable additional information to the clinician.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/curroncol29080410/s1, Table S1: The image preprocessing of intensity and spatial discretization, Table S2: Lasso feature selection, Text S1: Models' description.

**Author Contributions:** M.B.: study, design, literature search, data analysis/interpretation, writing original draft, supervision, V.T.: study, design, literature search, data analysis/interpretation, writing original draft, A.B.: Data analysis/interpretation, data validation, reviewing draft, N.C.: literature search, data analysis/interpretation, data validation, figures, writing, M.G.: study design, data collection, S.C.: literature search, reviewing draft, P.B.: resources, data collection, S.L.M.: resources, data collection, E.P.: resources, data collection, M.A.: resources, data collection, reviewing draft, A.R.: resources, data collection, M.S.: resources, data collection, C.P.: resources, data collection, S.U.: resources, data collection, G.M.: resources, data collection, N.G.-L.: resources, data collection, L.F.: resources, data collection, C.I.: resources, data collection, M.I.: resources, data collection, P.C.: resources, data collection, supervision. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was partially supported by the Italian Ministry of Health—Ricerca Corrente.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of by the Area Vasta Emilia Nord (AVEN) (protocol code 817/2018/OSS\*/IRCCSRE approved on 16 July 2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The training weights of the models proposed in this work (namely, models E and A described in Section 2.5.4) are available in a GitHub repository at the link: https: //github.com/ausl-re/TEXAS, accessed on 15 July 2022. Further instructions on how to perform the harmonization and to use the model will be added.

**Acknowledgments:** A heartfelt thanks to all the people involved in the various centers in collecting and managing the data, especially Simona Marani, Maria Paola Ruggieri, Giulia Mascari, and Cinthia Aristei. A special thanks to the AIRO Lung Group, especially Vieri Scotti, Stefano Vagge, and Alessio Bruni.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **ViSTA: A Novel Network Improving Lung Adenocarcinoma Invasiveness Prediction from Follow-Up CT Series**

**Wei Zhao 1,†, Yingli Sun 2,†, Kaiming Kuang 3, Jiancheng Yang 3,4, Ge Li 5, Bingbing Ni 4, Yingjia Jiang 1, Bo Jiang 1, Jun Liu 1,6,\* and Ming Li 2,7,\***


**Simple Summary:** Assessing follow-up computed tomography(CT) series is of great importance in clinical practice for lung nodule diagnosis. Deep learning is a thriving data mining method in medical imaging and has obtained surprising results. However, previous studies mostly focused on the analysis of single static time points instead of the entire follow-up series and required regular intervals between CT examinations. In the current study, we propose a new deep learning framework, named ViSTA, that can better evaluate tumor invasiveness using irregularly serial follow-up CT images to avoid aggressive procedures or delay diagnosis in clinical practice. ViSTA provides a new solution for irregularly sampled data. ViSTA delivers superior performance compared with other static or serial deep learning models. The proposed ViSTA framework is capable of improving performance close to the human level in the prediction of invasiveness of lung adenocarcinoma while being transferrable to other tasks analyzing serial medical data.

**Abstract:** To investigate the value of the deep learning method in predicting the invasiveness of early lung adenocarcinoma based on irregularly sampled follow-up computed tomography (CT) scans. In total, 351 nodules were enrolled in the study. A new deep learning network based on temporal attention, named Visual Simple Temporal Attention (ViSTA), was proposed to process irregularly sampled follow-up CT scans. We conducted substantial experiments to investigate the supplemental value in predicting the invasiveness using serial CTs. A test set composed of 69 lung nodules was reviewed by three radiologists. The performance of the model and radiologists were compared and analyzed. We also performed a visual investigation to explore the inherent growth pattern of the early adenocarcinomas. Among counterpart models, ViSTA showed the best performance (AUC: 86.4% vs. 60.6%, 75.9%, 66.9%, 73.9%, 76.5%, 78.3%). ViSTA also outperformed the model based on Volume Doubling Time (AUC: 60.6%). ViSTA scored higher than two junior radiologists (accuracy of 81.2% vs. 75.4% and 71.0%) and came close to the senior radiologist (85.5%). Our proposed model using irregularly sampled follow-up CT scans achieved promising accuracy in evaluating the invasiveness of the early stage lung adenocarcinoma. Its performance is comparable with senior experts and better than junior experts and traditional deep learning models. With further validation, it can potentially be applied in clinical practice.

**Citation:** Zhao, W.; Sun, Y.; Kuang, K.; Yang, J.; Li, G.; Ni, B.; Jiang, Y.; Jiang, B.; Liu, J.; Li, M. ViSTA: A Novel Network Improving Lung Adenocarcinoma Invasiveness Prediction from Follow-Up CT Series. *Cancers* **2022**, *14*, 3675. https:// doi.org/10.3390/cancers14153675

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 7 June 2022 Accepted: 20 July 2022 Published: 28 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** adenocarcinoma; invasiveness; X-ray computed tomography; deep learning; temporal attention

#### **1. Introduction**

Low-dose computed tomography (LDCT) is recommended for lung cancer screening in high-risk populations based on the National Lung Cancer Screening Trial (NLST) report, which is now included in US screening guidelines [1]. Owing to LDCT, more and more early stage lung adenocarcinomas are diagnosed and treated. In clinical practice, most people require follow-up CT scans due to the indeterminate diagnosis or low probability of malignancy on baseline CT. Assessing the changes in size, CT value, and other imaging features can substantially help the diagnosis and invasiveness evaluation of early stage lung adenocarcinomas. However, the evaluation process is tedious and lacks objectivity, which means that radiologists could be overwhelmed by numerous serial CT image evaluations. Moreover, the features indicating malignancy may not be present in the early stages of lung adenocarcinoma. As we know, biological changes may precede morphological changes. Therefore, an efficient tool for objectively evaluating the changes and mining the internal patterns of lung nodules on serial CTs is of great importance.

Deep learning is a thriving data mining method in medical imaging and has obtained surprising results [2–4]. It can efficiently and automatically process medical images and has achieved promising performances on par with clinicians on various clinical tasks, including disease classification, medical image registration, and organ segmentation [5–9]. Previous studies have shown that deep learning could aid clinical decision-making for early lung cancer in disease management and invasiveness prediction [10–14]. However, most prior studies only included single-time CT scan images, while serial CT scan images were not fully investigated. Several powerful deep learning methods have been invented to process serial data, e.g., Long Short-term Memory, Gated Recurrent Unit Network, and Transformer [15–17]. Equipped with the aforementioned tools, a deep learning system can include serial images, better evaluate the biological behavior and changes, and then better predict different clinical events, such as prognosis, therapeutic effect, and subsequent growth patterns [18].

Serial deep learning models have achieved great success in serial data domains, including natural language processing, video classification, and speech recognition [15,19,20]. Nonetheless, it is important to notice that medical serial data such as electronic health records [21] or medical examinations are almost always sampled irregularly in time, separating them from the aforementioned modalities. Since the progression of the disease is strongly correlated with the time intervals between two time points, the asynchronous (sampled irregularly) nature of medical data requires special treatment. For example, by limiting sampling time intervals to 1, 3, and 6 months, deep learning methods proved effective in integrating multiple time points and improving the prediction of lung cancer treatment response [22]. However, this restriction on time intervals still limits the usage of the deep learning method in processing clinical serial data, epically for irregularly serial data.

In this article, we propose ViSTA (Visual Simple Temporal Attention), a deep learning framework capable of predicting the tumor invasiveness of pulmonary adenocarcinomas from Follow-up CT Series. The main contributions are three-fold: First, by introducing a simple temporal attention mechanism, we propose a new deep learning network, named ViSTA, to evaluate the invasiveness of early stage lung adenocarcinoma using irregularly serial CT scans images. ViSTA is able to gather information throughout the entire series and improve the prediction performance. Compared with serial analysis using traditional recurrent neural networks [22], ViSTA is not limited by different time intervals and can process completely irregularly sampled serial data. ViSTA was trained and validated on a dataset of 1121 CT scans from 282 follow-up series and evaluated on a hold-out test set of 113 CT scans from 69 follow-up series. Second, ViSTA delivers superior performances compared with other static or serial deep learning models. ViSTA also outperforms sizebased predictive methods (Volume Doubling Time [23]) by a large margin. Third, ViSTA was proven to achieve higher scores than two junior radiologists and came close to one senior radiologist in the observer study. Our results prove ViSTA's superiority in terms of processing irregularly sampled series and its great potential of being put into clinical practice in reality. Additionally, ViSTA is completely transferrable to other medical imaging tasks where analyzing serial data should yield better performances.

#### **2. Materials and Methods**

#### *2.1. Data Collection*

From January 2011 to October 2017, a search of the electronic medical records and the radiology information systems of the hospital was performed by one author (Yingli Sun). The inclusion criteria are as follows: (1) two or more available CT examinations with thin-slice (≤1.5 mm) images before resection. If there were only two CT examinations, the interval between two scans should be over 30 or more days. (2) Complete pathologic reports. The exclusion criteria for this analysis were: (1) prior treatment before surgery; (2) poor quality CT images; (3) lesions that were difficult to clearly delineate. Finally, a total of 351 nodules from 347 patients (mean age, 58.41 years ±11.79 (SD); range, 22–84 years) were enrolled in the study. Among the 351 lung nodules, 191 nodules were pathologically identified as preinvasive lesions, including 1 atypical adenomatous hyperplasia (AAH), 39 adenocarcinomas in situ (AIS), and 151 minimally invasive adenocarcinoma (MIA); whereas 160 nodules were identified as invasive adenocarcinoma (IA). In total, 1234 serials CT scans of the 351 nodules were enrolled in this study. The median interval between the first and the last CT examinations was 366 ± 500 days (range, 30–2813 days; interquartile range, 165–852 days). The 351 nodules were randomly separated into a training set (245 nodules), validation set (37 nodules), and test set (69 nodules) (see Table 1).


**Table 1.** Number of CT scans/nodules in training, validation, and test set.

#### *2.2. CT Scanning Parameters*

Preoperative chest CT in our department was performed using the following four scanners: GE Discovery CT750 HD, 64-slice LightSpeed VCT (GE Medical Systems, Chicago, IL, USA); Somatom Definition flash, Somatom Sensation-16 (Siemens Medical Solutions, Erlangen, Germany) with the following parameters: 120 kVp; 100–200 mAs; pitch, 0.75–1.5; and collimation, 1–1.5 mm, respectively. All imaging data were reconstructed using a medium sharp reconstruction algorithm with a thickness of 1–1.5 mm.

#### *2.3. Nodule Labeling, Segmentation and Imaging Preprocessing*

A medical image processing and navigation software 3D Slicer (v4.8.0, Brigham and Women's Hospital, Boston, MA, USA) was used to manually delineate the volume of interest (VOI) of the included nodules at the voxel level by one radiologist (Yingli Sun, with 5 years of experience in chest CT interpretation), then the VOI was confirmed by another radiologist (Ming Li, with 12 years of experience in chest CT interpretation). Large vessels and bronchioles were excluded as much as possible from the volume of the nodule. The lung CT DICOM (Digital Imaging and Communications in Medicine) format images were imported into the software for delineation, and then the images with VOI information were extracted with NII format for next step analysis. Each segmented nodule was attributed a specific pathological label (AAH, AIS, MIA, IA), according to the detailed pathological report. Two steps were performed to preprocess CT images before path extraction. First, the whole-volume CT image was resampled to the spacing of 1 mm in all three dimensions to guarantee isotropy. Second, HU values were clipped to the range of (−1000, 400) and normalized to (0, 1) using minimum–maximum normalization. Normalization can accelerate the convergence in the training of the deep learning model and improve its generalization ability.

#### *2.4. Development of the Deep Learning Model*

We developed a deep learning model named ViSTA to classify IA/non-IA lung nodules. The overall architecture of ViSTA is presented in Figure 1. ViSTA first extracts features from CT image patches using a CNN backbone and then integrates information from time series using a lightweight attention module named SimTA [24], which is designed specifically for analyzing asynchronous time series. Details regarding the architecture of ViSTA are provided in Supplementary Section S1, and a single SimTA layer was shown in Figure S1. To avoid overoptimization, we did not heavily tune the hyperparameters of our deep learning model and simply adopted common settings. ViSTA and all its counterparts are trained end-to-end for 100 epochs using the AdamW optimizer [25]. We used a cosine decay learning schedule from 10−<sup>3</sup> to 10−6. The batch size of each update was 32. The drop-out probability and weight decay were set at 0.2 and 0.01 to avoid overfitting.

#### *2.5. Counterpart Methods*

For comparison with ViSTA, we conducted experiments on a few of its counterparts:


#### *2.6. Evaluation and Statistical Analysis*

We evaluated the proposed ViSTA model both quantitatively and qualitatively. To evaluate each method's performance, we used a variety of metrics, including accuracy, precision, sensitivity, F1 score, and AUC. Formulas of evaluation metrics are presented in Supplementary Section S1.

To explore the visual representation and interpretability of ViSTA, we followed Simonyan, K. et al. [26] and plotted our model's saliency maps through backpropagation, and investigated the mechanism under ViSTA and where it directed its attention.

#### *2.7. Observer Study*

To further evaluate the performance of ViSTA, we conducted an observer study to compare the performance of radiologists in the same task against other models. In the observer study, all 69 CT series in the test set were evaluated by three radiologists. One is a senior radiologist with 22 years of experience, and the other two are junior radiologists with 5 and 3 years of experience, respectively. Radiologists gave the results based on the evaluation of all available serials CTs. The reviewed results were analyzed and compared with the performance of our proposed model. Radiologists' performances were evaluated using accuracy, sensitivity, precision, and F1 score.

#### **3. Results**

#### *3.1. Performance of Deep Learning Models in Predicting the Invasiveness of Early Lung Adenocarcinoma*

To validate the effectiveness of ViSTA in predicting IA/non-IA nodules, we evaluated its performance using a variety of metrics against its counterparts: VDT (cutoff value set at best Youden index or 400 days), CNN (including CNN-last, CNN-first, CNN-all-first and CNN-all-last), and CNN+LSTM.

Tables S1 and S2 show their performances on the training dataset and validation dataset. Figure 2 provide the ROC curves of all models on the test dataset. Our proposed model outperformed all deep learning models and VDT-based methods in every metric by considerable margins (best among models are highlighted with an underscore). It is worth noting that VDT is far from effective in terms of invasiveness classification. It underperformed almost all deep learning models in terms of AUC, accuracy, and F1 score. Secondly, sequential models (ViSTA and CNN+LSTM) delivered better performances

than CNN models that utilize static data points. ViSTA outperformed CNN+LSTM by considerable margins in all metrics. This performance gap can be attributed to ViSTA's suitability to analyze asynchronous time series. Unlike CNN+LSTM, which treats all time points as if they were regularly sampled, ViSTA takes time intervals into account and is better at processing follow-up series. Furthermore, we trained CNN on all time points (CNN all-first and CNN all-last) to investigate if sequential models gain superiority over larger training datasets. It turned out that ViSTA and CNN+LSTM still outperformed CNN even when it was trained on all data.

**Figure 2.** ROC curves of different models compared with performances of radiologists. The gray dotted line indicates the performance of a random classifier with no predictive ability.

#### *3.2. Performance Comparison against Radiologists*

In the observer study, we compared the performances of ViSTA and its counterparts against three radiologists (Table 2). One is a senior radiologist with 22 years of experience, and the other two are junior radiologists with 5 and 3 years of experience, respectively. All 69 follow-up series from the test set were included in the observer study. We evaluated radiologists' performances using accuracy, sensitivity and precision, and F1 score and compared them against the proposed model. In terms of metrics that require specifying threshold, we chose the threshold that delivers the best Youden Index on the validation set as the cutoff value. Figure 2 plot deep learning models' ROC curves against radiologists' metrics. In terms of accuracy and F1 score, ViSTA scored higher than the two junior radiologists (accuracy of 81.2% vs. 75.4% and 71.0%; F1 score of 81.7% vs. 73.0% and 65.5%) and came close to the senior radiologist (accuracy of 81.2% vs. 85.5%; F1 score of 81.7% vs. 84.8%).

#### *3.3. Visual Presentation Investigation*

To investigate the mechanism of ViSTA, we used a neural network visualization technique [26] to visualize the attention heatmap of the model, which was mostly attributed to the predicted results and potentially correlated to the biological behavior (Figure 3). We took the absolute value of the raw heatmap and clipped it to the range of (0, 0.01) for better visualization and interpretation. In view of the created heatmaps, we can see that the "attention" of the deep learning system was mostly focused on the nodule. Areas surrounding the nodule draw the attention of ViSTA as well, meaning that they also carry valuable information as the nodule does (Figure 3A,B). Figure 3A show a long follow-up series of 11 time points. We observed that heatmaps stay blank in the first half of the

series, during which both nodule volume and IA probability remain relatively stable. In the latter half, heatmaps begin to show along with significant increases in nodule volume and IA probability. Heatmaps are sometimes only lit up at the last time point (Figure 3B). We contribute this to the sudden increase of nodule volume between the third and the fourth time point, which provides sufficient information for the model. This argument is supported by the spike of IA probability at the fourth time point. In some rare cases, heatmaps on all time points are close to invisible (Figure 3C). We conjecture that this is because the lung nodule had almost no progression, which was proven by the fact that both nodule volume and IA probability stayed almost unchanged throughout the entire series.

**Table 2.** The performance of different models and radiologists on the test dataset. The highest among all is highlighted in bold, and the highest among models and VDT (Volume Doubling Time)-based methods is highlighted with an underscore.


**Figure 3.** Visualization investigation of ViSTA. The top row shows CT slices of each time point in the follow-up series. The middle row shows attention heatmaps extracted using the technique proposed by Simonyan, K. et al. [26]. The bottom row masks heatmaps on top of CT slices. (**A**) Attention gradually grew along with the nodule volume and IA probability as the nodule progressed to the end of the series. (**B**) The heatmap only lit up at the last time point as it is considered the one carrying valuable information. (**C**) All time points are allocated with little to no attention, which may be caused by the slow progress of the nodule.

#### **4. Discussion**

In the current study, we proposed a deep learning framework named ViSTA to predict the invasiveness of lung adenocarcinomas using serial CT images. Our results showed that models fed with serial CT images substantially and consistently outperformed models fed with single CT images. Moreover, our proposed model can effectively process asynchronous time series and outperform the traditional serial network, i.e., LSTM. Our models achieved an AUC of 86.4% and an F1 score of 81.7% in the test dataset, which were higher than those of all its counterparts. In the observer study, ViSTA achieved higher accuracy and F1 scores than two junior radiologists (accuracy of 81.2% vs. 75.4% and 71.0%, F1 score of 81.7% vs. 73.0% and 65.5%). When compared with the senior radiologist, our proposed model delivered close performance (accuracy of 81.2% vs. 85.5%, F1 score of 81.7% vs. 84.8%).

Timely and accurately assessing the biological behavior of early stage lung adenocarcinomas has been a continuous focus of attention in clinical practice. In contrast to traditional radiographic features and handcraft features, deeper and higher dimension level features mined by the deep learning method present promising advantages in many tasks, including predicting the invasiveness of the early lung adenocarcinoma. Kim et al. performed a comparison study and revealed that the predictive accuracy of the deep learning method was superior to those of the size-based logistic model [11]. We also analyzed the predictive value of VDT [27], a size-based key parameter in the differentiation of aggressive tumors from slow-growing tumors in clinical practice [24]. Not surprisingly, the performance of our proposed model substantially exceeded that of the VDT-based methods. It indirectly verified the conjecture that a deep learning system could extract and learn deeper and more valuable features, then better discover the biological behavior of the tumors and predict the invasiveness of early stage lung adenocarcinoma.

Although the deep learning method can obtain better performance, most previous studies only used single CT scan data prior to the surgery for training and extracting features, which cannot reveal and learn the internal growth pattern of the nodules. In clinical scenarios, internal growth is a vital component of Lung-RADS, a guideline to standardize image interpretation by radiologists and dictate management recommendations. Including serial CTs can facilitate medical tasks, such as differentiating benign tumors from malignant ones [28] and monitoring and predicting treatment response [22,29]. The discovery of our study supports this. By modelling serial CTs, the predictive performance of ViSTA substantially surpassed its counterparts in analyzing static data. In clinical practice, sequential medical data is generally sampled irregularly, i.e., with different follow-up time intervals. To address the irregular sampling issue, we adopted SimTA in our proposed model to process irregularly sampled time series. This lightweight module enables modeling sequential information in an efficient way. It turned out that the proposed ViSTA significantly outperformed the standard serial framework, i.e., CNN+LSTM, with considerably fewer parameters and less computation and memory footprint. ViSTA can better take advantage of the complete information of all time point CTs by modelling simple yet effective exponentially decay attention in time series. This was proved by our experiments comparing ViSTA, CNN+LSTM, and pure CNN models trained with all time point CTs (CNN-all). ViSTA's superiority over CNN-all proved that its performance gain does not come from a larger training dataset.

In the visualization analysis, we found that ViSTA can drive its attention on the nodule and the surrounding tissue and drop more attention when the probability of invasiveness increases. It can partly explain the mechanism of the deep learning system. We also found some cases where the model appeared to use features close to the nodule, such as the vasculature and parenchyma surrounding the nodule. In fact, peritumoral tissue may possess valuable information, such as tumor-infiltrating status. Features extracted from the peritumoral tissues can improve the efficiency of intramodular radiomic analysis [30,31]. However, we still cannot fully interpret whether the model incorporates other abnormalities such as background emphysema in its predictions. Further investigation using more comprehensive model attribution techniques may allow clinicians to take advantage of the

same visual features used by the model to assess the biological status of tumors. It is worth noting that some of the heatmaps in the time series are completely blue, meaning that the deep learning model allocated close to zero attention to these time points. Though this phenomenon is not completely interpretable, we argue that it can be attributed to these two facts: these time points are too far from the current one, and they lack findings informative for the deep learning model.

Even though the proposed ViSTA proved effective in processing irregularly sampled CT series in our experiments, there are several limitations left untouched. First, due to the difficulty of collecting complete lung nodule follow-up series, we only included data from a single center in this study. In clinical practice, it is preferable if the proposed method generalizes to multiple data domains. Furthermore, it is possible that a single follow-up series contains CT scans from different centers, which would be an important challenge to solve if the proposed model were to be put into clinical usage. In future studies, we will include CT series from external centers to validate the generalization performance of ViSTA. Second, the SimTA module in ViSTA models a simple temporal attention mechanism that monotonically increases weights as the time point gets closer to the current time. However, it is viable to model more complicated attention relations using deep learning models such as Transformeror Informer [15,32]. These temporal models enable capturing non-monotonic and dynamic temporal attention that could be useful in predicting invasiveness. Last but not least, even though we conducted a visual investigation on ViSTA, the interpretation of deep learning model predictions still remains a major challenge. Additionally, the final clinical decision is still up to clinicians to date. In our future research, we will further investigate the underlying mechanism of ViSTA or other similar attention mechanisms.

#### **5. Conclusions**

To summarize, we designed a deep learning model processing irregularly sampled CT series to predict the invasiveness of early stage lung adenocarcinoma from follow-up CT scans. The model achieved promising accuracy comparable with senior experts and better than junior experts and its counterparts. With further validation, the proposed model could better evaluate the invasiveness of early stage lung adenocarcinoma, avoiding aggressive procedures or delayed diagnosis and helping precise management in clinical practice.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14153675/s1. References [24,33–36]. Section S1: Supplementary Methods. Figure S1: The illustration of a single SimTA layer. The colormap of the attention matrix indicates the magnitude of attention allocated in a particular timepoint. Gray represents zero attention. The darker the orange color is, the more attention the timepoint gets; Table S1: Different models' performances in predicting invasiveness on the training dataset. The highest among all is highlighted in bold; Table S2: Different models' performances in predicting invasiveness on the validation dataset. The highest among all is highlighted in bold.

**Author Contributions:** Conceptualization, K.K. and J.L.; Data curation, W.Z., Y.S. and M.L.; Formal analysis, W.Z., Y.S. and J.Y.; Funding acquisition, M.L.; Investigation, W.Z., K.K., Y.S., G.L., B.N., Y.J. and B.J.; Methodology, W.Z., K.K., Y.S., J.Y., Y.J. and B.J.; Project administration, M.L.; Resources, Y.S.; Software, K.K., B.N. and M.L.; Supervision, J.Y., J.L. and M.L.; Validation, G.L.; Visualization, B.N.; Writing—original draft, W.Z. and Y.S.; Writing—review and editing, W.Z., J.L. and M.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by the National Natural Science Foundation of China 82102157 (Wei Zhao), 61976238 (Ming Li), Hunan Provincial Natural Science Foundation for Excellent Young Scholars 2022JJ20089 (Wei Zhao), Hunan Provincial Natural Science Foundation of China 2021JJ40895 (Wei Zhao), the Research Project of Postgraduate Education and Teaching Reform of Central South University 2021JGB147 (Jun Liu), 2022JGB117 (Wei Zhao), the Clinical Research Center For Medical Imaging In Hunan Province 2020SK4001 (Jun Liu), the science and technology innovation program of Hunan Province 2021RC4016 (Jun Liu), and the Clinical Medical Technology Innovation Guidance Project in Hunan Province 2020SK53423 (Wei Zhao), Science and Technology Planning Project of Shanghai Science and Technology Commission 22Y11910700 (Ming Li), Science and Technology

Planning Project of Shanghai Science and Technology Commission 20Y11902900 (Ming Li), Shanghai "Rising Stars of Medical Talent" Youth Development Program "Outstanding Youth Medical Talents" SHWJRS [2021]-99 (Ming Li).

**Institutional Review Board Statement:** This retrospective study was approved by the institutional review board of Huadong Hospital, affiliated with Fudan University (No. 2019K134).

**Informed Consent Statement:** The institutional review board waived the requirement for patients' written informed consent.

**Data Availability Statement:** Data are available for bona fide researchers who request it from the authors.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival**

**Jason C. Hsu 1,2,3,4, Phung-Anh Nguyen 1,2,3, Phan Thanh Phuc 4, Tsai-Chih Lo 5, Min-Huei Hsu 6,7, Min-Shu Hsieh 8,9, Nguyen Quoc Khanh Le 10,11, Chi-Tsun Cheng 3, Tzu-Hao Chang 2,5,\* and Cheng-Yu Chen 11,12,\***


**Simple Summary:** Previous survival-prediction studies have had several limitations, such as a lack of comprehensive clinical data types, testing only in limited machine-learning algorithms, or a lack of a sufficient external testing set. This lung-cancer-survival-prediction model is based on multiple data types, multiple novel machine-learning algorithms, and external testing. This predicted model demonstrated a higher performance (ANN, AUC, 0.89; accuracy, 0.82; precision, 0.91) than previous similar studies.

**Abstract:** A well-established lung-cancer-survival-prediction model that relies on multiple data types, multiple novel machine-learning algorithms, and external testing is absent in the literature. This study aims to address this gap and determine the critical factors of lung cancer survival. We selected nonsmall-cell lung cancer patients from a retrospective dataset of the Taipei Medical University Clinical Research Database and Taiwan Cancer Registry between January 2008 and December 2018. All patients were monitored from the index date of cancer diagnosis until the event of death. Variables, including demographics, comorbidities, medications, laboratories, and patient gene tests, were used. Nine machine-learning algorithms with various modes were used. The performance of the algorithms was measured by the area under the receiver operating characteristic curve (AUC). In total, 3714 patients were included. The best performance of the artificial neural network (ANN) model was achieved when integrating all variables with the AUC, accuracy, precision, recall, and F1-score of 0.89, 0.82, 0.91, 0.75, and 0.65, respectively. The most important features were cancer stage, cancer size, age of diagnosis, smoking, drinking status, EGFR gene, and body mass index. Overall, the ANN model improved predictive performance when integrating different data types.

**Keywords:** lung cancer; survival; prediction models; real-world data; artificial intelligence; machine learning

**Citation:** Hsu, J.C.; Nguyen, P.-A.; Phuc, P.T.; Lo, T.-C.; Hsu, M.-H.; Hsieh, M.-S.; Le, N.Q.K.; Cheng, C.-T.; Chang, T.-H.; Chen, C.-Y. Development and Validation of Novel Deep-Learning Models Using Multiple Data Types for Lung Cancer Survival. *Cancers* **2022**, *14*, 5562. https://doi.org/10.3390/ cancers14225562

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 21 September 2022 Accepted: 10 November 2022 Published: 12 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Lung cancer is the leading cause of cancer deaths worldwide [1]. Globally, there were around 2.21 million new cases of lung cancer and 1.80 million fatalities in 2020 [2]. One study reported that lung cancer incidence and mortality rates were 22.2 and 18.0 per 100,000 people in 2020, respectively [3,4]. Lung cancer can be divided clinically into two types based on histological features: non-small-cell lung cancer (NSCLC) and small-cell lung cancer (SCLC). NSCLC is the most common among them, accounting for 80–90% of lung cancers [5]. Cell deterioration and metastasis are slower in NSCLC than in SCLC. Around 70% of patients are diagnosed at an advanced stage, making surgical resection and complete treatment challenging [6,7].

Artificial intelligence (AI) has been increasingly used in medical research and clinical practice [8,9]. The accurate prediction of disease prognosis and the outcome of drug treatment, which may serve as a reference for treatment decision-making and drug selection, has become an essential topic in the clinical medicine [9,10]. Developing disease-risk and prognosis-prediction models using machine-learning or deep-learning algorithms with big data is a major area of AI-based academic research in the medical field [10,11]. Studies have used machine-learning and/or deep-learning algorithms to develop lung cancer risk and prognosis-prediction models [12–15]. Among them, Lai et al. [16] used 15 biomarkers with clinical data (including gene expression) from 614 patients to develop a deep neural network to predict the five-year overall survival of NSCLC patients.

This study aimed to develop survival-prediction models for lung cancer patients using a large number of samples, different data types, various machine-learning algorithms, and external testing. In addition to the basic clinical data (including demographic information, disease condition, comorbidity, and current medication), we examined the role of laboratory and genomic test results, which are generally not easy to obtain in predicting lung cancer survival. Moreover, we also explored the important predictors for developing prediction models.

#### **2. Methods**

#### *2.1. Study Design and Data Source*

We conducted a retrospective study in which we obtained data from the Taiwan Cancer Registry (TCR) database and the Taipei Medical University Clinical Research Database (TMUCRD). The TCR database was established in 1979 and is managed by Taiwan's Health Promotion Administration, Ministry of Health and Welfare. It covers 98% of Taiwanese cancer patients and includes diagnosis and other related information. The TMUCRD retrieved data from various electronic medical records (EHR) of three hospitals, Taipei Medical University Hospital (TMUH), Wan-Fang Hospital (WFH), and Shuang-Ho Hospital (SHH). The database contains the electronic medical record data of 3.8 million people from 1998 to 2020, including structured data (e.g., basic information of patients, medical information, test reports, diagnosis results, treatment process, surgery, and medication history) and unstructured data (e.g., progress notes, pathology reports, and medical imaging reports) [17]. This study has been approved by the Joint Institute Review Board of Taipei Medical University (TMU-JIRB), Taipei, Taiwan (approval number N202101080). All the data were anonymous before conducting analysis.

#### *2.2. Cohort Selection*

This study selected patients with lung cancer (ICD-O-3 code: C33, C34) from 2008 to 2018 in the TCR database. Exclusion criteria included individuals under 20 years old, SCLC patients, and patients who did not have any medical history in the three hospitals (TMUH, WFH, SHH). Thus, a total of 3714 patients were included in this study, including 960 patients from TMUH, 1320 from WFH, and 1434 from SHH (Figure S1 in the Supplementary Materials).

#### *2.3. Outcome Measurement*

We ascertained the study outcomes using TMUCRD EHR and vital status data from the Taiwan Death Registry (TDR) [18]. We used the diagnosis date of NSCLC as the index date, and the outcome of this study was death within two years following diagnosis. Data were censored at the date of death or loss to follow-up, insurance termination, or the study's end on 31 December 2018.

#### *2.4. Feature Selection*

Based on a literature review and consultation with clinicians, we selected features that may lead to the mortality of NSCLC patients to build prediction models. These features consisted of:


#### *2.5. Development of the Algorithms*

This study established prediction models based on four modes and different algorithms:


This study aims to predict the survival of lung cancer patients; therefore, the problem can be formulated as a classification model as it could occur in the same patients. We used possible machine-learning techniques such as logistic regression (LR), linear discriminant analysis (LDA), light gradient-boosting machine (LGBM), gradient-boosting machine (GBM), extreme gradient boosting (XGBoost), random forest (RF), AdaBoost, support vector machine (SVC), and artificial neural network (ANN). These methods are briefly introduced below.

Logistic Regression (LR): This is a discrete choice model that models the relationship between a response and multiple explanatory variables and is based on the concept of probability [19]. It is widely used and more practical in fields such as biostatistics, clinical medicine, and quantitative psychology. Its Equation (1) is:

$$y = \frac{e^{(b\_0 + b\_1 \mathcal{K})}}{1 + e^{(b\_0 + b\_1 \mathcal{K})}} \tag{1}$$

where *x* is the input value, *y* is the predicted output, *b*<sup>0</sup> is the bias or intercept term, and *b*<sup>1</sup> is the coefficient for input (*x*). In this study, we used the LR function with the parameter C (inverse of regularization strength) of 0.0001 to reduce the model's overfitting.

Linear Discriminant Analysis (LDA): This is generally used to classify patterns between two classes; however, it can be extended to multiple patterns. LDA assumes that all classes are linearly separable, and according to the multiple linear discrimination functions representing several hyperplanes in the feature space are created to distinguish the classes [20]. In this study, we set the parameters' *shrinkage* to '0' and the *solver* to 'lsqr' to improve estimation and classification accuracy.

Light Gradient-Boosting Machine (LGBM): This is a gradient-boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages: faster training speed and higher efficiency; lower memory usage; better accuracy; support of parallel, distributed, and GPU learning; and capability to handle large-scale data [21]. The model's *class\_weight* parameter was set as 'balanced', which uses the output's value to automatically adjust weights inversely proportional to class frequencies in the input data. The *learning\_rate*, l1 regularization—*reg\_alpha*, and l2 regularization—*reg\_lambda* parameters were set as 0.05, 0.1, and 0.1, respectively.

Gradient-Boosting Machine (GBM): Gradient-boosting regression trees produce competitive, highly robust, and interpretable procedures for regression and classification. The ability of TreeBoost procedures to give a quick indication of potential predictability, coupled with their extreme robustness, makes them a useful preprocessing tool that can be applied to imperfect data [22]. The default parameters were used in this model.

Extreme Gradient Boosting (XGBoost): XGBoost, an efficient and scalable implementation of the gradient-boosting framework, is a machine-learning system for tree boosting. The scalability of XGBoost is attributed to several critical systems and algorithmic optimizations. These innovations include a novel tree-learning algorithm for handling sparse data; a theoretically justified weighted quantile sketch procedure allows the handling of instance weights in approximate tree learning [23]. The default parameters were used in this model.

Random Forest (RF): RF is an ensemble-learning method that operates by constructing many small scales of classification modules (most often decision trees) at the training time. The model outputs the class that combines the result of the individual modules based on some voting algorithms [24]. In this study, we set the parameters as follows: *n\_estimators* (the number of trees) of 500, *max\_depth* of 10, *min\_samples\_split* of 400, and *class\_weight* of 0.5 for each class.

AdaBoost: The AdaBoost algorithm is an iterative procedure that combines several weak classifiers to approximate the Bayes classifier C∗(*x*). AdaBoost builds a classifier, e.g., a classification tree that produces class labels, starting with the unweighted training sample. If a training data point is misclassified, the weight of that data point is increased (boosted). A second classifier is built using the new weights, which are no longer equal. Again, misclassified training data have their weights boosted, and the procedure is repeated [25]. The number of estimators (*n\_estimators*) used was 100.

Support Vector Machine (SVC): This is a machine-learning algorithm that can be applied to linear and nonlinear data. SVC transforms the original data to a higher dimension, from which it can use the super vectors in the training data set to find the hyperplane for categorizing the data. An SVC mainly identifies the hyperplane with the most significant margin, e.g., the maximum marginal hyperplane, to achieve higher accuracy [26]. The SVC can be represented by the following Equation (2):

$$f(\mathbf{x}) = \sum\_{i=1}^{N} (\alpha\_i^\* - \alpha\_i) \mathbf{K}(\mathbf{x}, \mathbf{x}\_i) + \mathbf{B} \tag{2}$$

where *K*(*x*, *xi*) is the kernel function, *αi*, *α*<sup>∗</sup> *<sup>i</sup>* ≥ 0 are the Lagrange multipliers, and B is a bias term. In this study, we used a *linear* kernel for computations.

Artificial Neural Network (ANN): This is a learning algorithm vaguely inspired by biological neural networks. Computations are structured in terms of an interconnected group of artificial neurons, and these neutrons process information using a connectionist approach to computation. They are usually used to model complex relationships between inputs and outputs, find patterns in data, or capture the statistical structure [27]. The number of hidden layers with the number of neurons in each layer was set at 3 and 16, respectively. Additionally, for each layer, the *l2 regularization* of 0.01 and the 'relu' *activation* were used in the study. We set the 'softmax' activation for the output layer. We also used the 'Adam' *optimizer*, a highly performant stochastic gradient descent algorithm, and 'binary\_crossentropy' as the binary classification outcome for the *loss* function.

#### *2.6. Evaluating the Algorithms*

The training dataset contained the data of patients from TMUH and WFH. The stratified 5-fold cross-validation was applied in the training set to assess the different machinelearning models' performance and general errors. In other words, patients in the training set were divided into five groups, each used repeatedly as the internal validation set. We recruited data from SHH and used it for the external testing dataset to generalize the model.

The performance of the algorithms was measured by the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity (recall), specificity, positive predictive value (PPV, precision), negative predictive value (NPV), and F1-score. We defined the best model using the highest AUC by comparing various models based on the external testing set. Furthermore, we analyzed the feature's contribution (i.e., the feature's importance) of the best model using SHAP values (SHapley Additive exPlanations) [28].

All the data processing was performed using MSSQL server 2017 (Redmond, WA, USA), and the model training and testing were performed using Python version 3.8 (Wilmington, DE, USA) with scikit-learn version 1.1 (Paris, France) [29].

#### **3. Results**

#### *3.1. Baseline Characteristics of Patients*

We identified 3714 eligible lung cancer patients diagnosed for the first time and registered at the TCR. Among those patients, 2280 patients were included in the training dataset, whereas 1434 were in the testing dataset. Demographic characteristics, comorbidities, tumor size, tumor stage, genomic tests, medication uses, and laboratory tests are presented in Table 1. The mean (standard deviation, SD) ages and BMI of cohort patients were 68 (13.7) and 23.4 (4.33), respectively. Most of the patients were male (57.5%) with late-stage lung cancer (i.e., stage IV, 54.8%), and patients were less likely to smoke (26.7%) or drink (11%). The cohort of patients had comorbidities related to hypertension (19.8%), hyperlipidemia (13.9%), COPD (16.1%), and CVD problems (11.6%). The follow-up durations for the cohort patients were a mean (SD) of 2.25 (2.47) years and a median (interquartile range (IQR)) of 1.41 [0.46–3.04] years. Detailed information is shown in Table S1 in the Supplementary Materials.

**Table 1.** Basic Characteristics of the Study Cohort.



**Table 1.** *Cont.*

**Table 1.** *Cont.*


**Note**: SD, Standard deviation; yrs., Years; IQR, Interquartile Range; BMI, Body mass index; COPD, Chronic obstructive pulmonary disease; PUD, Peptic ulcer disease; CVD, Cardiovascular; DM, Diabetes; BUN, Blood urea nitrogen; HCT, Hematocrit; HGB, Hemoglobin; K, Potassium; MCH, Mean corpuscular hemoglobin; MCHC, Mean corpuscular hemoglobin concentration; MCV, Mean corpuscular volume; Na, Sodium; PLT, Platelet; RBC, Red blood count; WBC, White blood count; <sup>a</sup> The training set included the data from Taipei Medical University and Wan-Fang hospitals; <sup>b</sup> The testing set included the data from Shuang Ho hospital.

#### *3.2. The Performances of Different Prediction Models*

The performances of different prediction models are shown in Table 2. In Mode 1, the highest AUC of 0.88 was observed for the ANN model (i.e., accuracy, 0.82; precision, 0.90; recall, 0.75; and F1-score, 0.64), followed by the GBM and RF models with an AUC of 0.83 and 0.82, respectively. In Mode 3, the best performance was found with an AUC of 0.89 for the ANN model (i.e., accuracy, 0.83; precision, 0.89; recall, 0.81; and F1-score, 0.64). The following AUCs were observed 0.85 for LGBM, GBM, and 0.84 for RF models. Moreover, when considering all features in Mode 4, we found that the best model was the ANN model with an AUC of 0.89 (i.e., accuracy, 0.82; precision, 0.91; recall, 0.75; and F1-score, 0.65). Figures 1 and 2 show the ROC curves of different prediction models in four modes. Detailed information on the various models' measurements (i.e., sensitivity, specificity, PPV, NPV, accuracy, and F1-score) is shown in Table S2 in the Supplementary Materials.


**Table 2.** Performance of various Prediction Models by Modes.

**Note**: LR, Logistic Regression; LDA, Linear Discriminant Analysis; LGBM, Light Gradient Boosting Machine; GBM, Gradient Boosting Machine; XGBoost, Extreme Gradient Boosting; RF, Random Forest; SVC, Support Vector Machine; ANN, Artificial Neural Network; \*, Best model based on AUC values.

Figure 3 shows the top 20 important features of the ANN model in Mode 4. The most important features were the cancer stage, size, age of diagnosis, smoking, and EGFR gene. In other words, patients with advanced cancer stage, large cancer size, older age, and smoking behavior had a higher risk of death within two years. The SHAP value presented the important features of the GBM model in Mode 4 and was consistent with the ANN model, such as cancer stage, age at diagnosis, cancer size, and smoking status (Figure S2 in the Supplementary Materials).

**Figure 1.** The Performance of the Prediction Models in the Testing dataset by different Modes. **Note**: (**A**), Mode 1; (**B**), Mode 2; (**C**), Mode 3; (**D**), Mode 4.

**Figure 2.** *Cont.*

**Figure 2.** The Performance of the ANN Prediction Models in the Testing dataset by different Modes. **Note**: (**A**), Mode 1; (**B**), Mode 2; (**C**), Mode 3; (**D**), Mode 4.

**Figure 3.** Feature Importance of the ANN Prediction Model in Mode 4. **Note**: BMI, Body mass index; EGFR, Epidermal growth factor receptor; WBC, White blood cell; PD-L1, Programmed death-ligand 1; COPD, Chronic obstructive pulmonary disease; CCI, Charlson comorbidity index.

#### **4. Discussion**

In recent years, the prediction of cancer patients' survival has attracted the medical community's attention in various countries because it can facilitate medical decision making, strengthen the relationship between doctors and patients, and improve the quality of medical care. Rapid progress in the development of AI based on machine learning has led to more diversified applications of AI in the field of precision medicine. Based on previously published studies on machine-learning algorithms to build prediction models for the survival of lung cancer patients [12,14–16], this study further compared the performance of various novel machine-learning algorithms. In addition, we also analyzed the relationship between the diversity of features and the accuracy of prediction results and determined the most important features affecting lung cancer survival.

Studies using multiple data types and multiple novel machine-learning algorithms simultaneously are limited. In previous studies on lung cancer prediction, most of them used a single machine-learning (e.g., RF [30]) or deep-learning (e.g., NN [14–16]) algorithm or a few basic machine-learning algorithms (e.g., LR, SVM, decision tree, RF, GBM [12,31]) to develop prediction models. Our results showed that the ANN model had the highest AUC value (it was the most suitable tool for survival prediction). In contrast, the AUC value of the traditional LR algorithm exhibited the lowest performance (it had the lowest predictive ability). Lai Y.H. et al. [16] presented a deep neural network to predict the overall survival of NSCLC patients. They obtained a good predictive performance (AUC = 0.82, accuracy = 75.4%) by integrating microarray and clinical data. While only using basic clinical data (demographics, comorbidities, and medications), our predicted model demonstrated a higher performance (ANN, AUC, 0.88; accuracy, 0.82; precision, 0.90, recall, 0.75, and F1-score, 0.64). Furthermore, when combining other variables, such as laboratory and genomic tests, the AUC values of the predicted model were better (based on the external testing, the AUCs of the ANN model in Mode 1 and Mode 4 were 0.88 and 0.89, respectively; the AUCs of LGBM model in Mode 1 and Mode 4 were 0.81 and 0.86, respectively; the AUCs of the RF model in Mode 1 and Mode 4 were 0.82 and 0.85, respectively).

In this study, we explored the variables that might affect the predictive performance of the survival model. As expected, these variables were highly correlated to the mortality of lung cancer patients, such as advanced cancer stage, tumor size, age at diagnosis, and smoking and drinking status [32]. Our findings also showed that lymphocytes, platelets, and neutrophils tests were associated with the likelihood of lung cancer survival [33]. Thus, lymphocytes play an essential role in producing cytokines, inhibiting the proliferation of cancer cells, and provoking cytotoxic cell death [34]. In words, a decrease in lymphocyte count may predict worse survival in cancer patients. Neutrophils are recruited with cytokines released by the tumor microenvironment, enhancing carcinogenesis and cancer progression [35]. Platelets modulate the tumor microenvironment by releasing factors contributing to tumor growth, invasion, and angiogenesis [36]. Another study by Wang J. et al. [37] reported that lung cancer patients with a higher BMI have prolonged survival compared to those with a lower BMI. The same was true for our study's results, which may be due to the poor nutrition and weight loss caused by respiratory diseases [38], such as COPD.

There are limitations to this study. First, although the study used data from various clinical settings (e.g., TMUH and WFH for establishing the prediction model and SHH for conducting an external test) located in the north of Taiwan, the results may not directly apply to lung cancer patients in other regions. Future studies may need to consider validating the model using data from other areas. Second, this study used retrospective data for development and validation. Further experiments with a prospective study design in clinical settings are needed. Third, to obtain a highly accurate prediction, we developed the machine-learning algorithms with binary outcomes (i.e., survival and death) rather than expected continuous outcomes (i.e., length of survival) for the NSCLC patients. Further studies should be conducted with larger sample sizes to deal with continuous outcomes for lung cancer survival.

#### **5. Conclusions**

In summary, to observe the expected survival of NSCLC patients during a two-year period, we designed an artificial neural network model with high AUC, precision, and recall. Moreover, integrating different data types (especially laboratory and genomic data) led to better predictive performance. Further research is necessary to determine the feasibility of applying the algorithm in the clinical setting and explore whether this tool could improve care and outcomes.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers14225562/s1, Figure S1: Cohort Selection Process; Figure S2: Feature Importance of the GBM Prediction Model of Mode 4; Table S1: Detailed Demographic Characteristics of Cohort Patients; Table S2: Detailed Performance of various Prediction Models by Modes.

**Author Contributions:** T.-H.C., P.-A.N. and J.C.H. conceptualized and designed the study. P.-A.N., P.T.P. and T.-C.L. collected the data, performed the analysis, and drafted the manuscript. C.-Y.C. and T.-H.C. provided suggestions for the research design and article content. M.-H.H., M.-S.H., N.Q.K.L., C.-T.C. and J.C.H. reviewed all data and revised the manuscript critically for intellectual content. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by Taiwan Ministry of Science and Technology grants (grant numbers: MOST109-2321-B-038-004; MOST110-2321-B-038-004). The funders had no role in the study design, data collection and analysis, publication decision, or manuscript preparation.

**Institutional Review Board Statement:** This study has been approved by the TMU-Joint Institutional Review Board (Project number: TMU-JIRB N202101080).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The authors obtained data from the Taiwan Cancer Registry (TCR) database and the Taipei Medical University Clinical Research Database (TMUCRD).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**



#### **References**


### *Article* **Prognostication in Advanced Cancer by Combining Actigraphy-Derived Rest-Activity and Sleep Parameters with Routine Clinical Data: An Exploratory Machine Learning Study**

**Shuchita Dhwiren Patel 1,\*, Andrew Davies 2, Emma Laing 3, Huihai Wu 3, Jeewaka Mendis <sup>4</sup> and Derk-Jan Dijk 5,6**


**Simple Summary:** Survival prediction is an important aspect of oncology and palliative care. Measures of night-time relative to daytime activity, derived from a motion sensor, have shown promise in patients receiving chemotherapy. Measuring rest-activity and sleep may, therefore, result in improved prognostication in advanced cancer patients. Fifty adult outpatients with advanced cancer were recruited, and rest-activity, sleep, and routine clinical variables were collected just over a one week period, and used in machine learning models. Our findings confirmed the importance of some well-established survival predictors and identified new ones. We found that sleep-wake parameters may be useful in prognostication in advanced cancer patients when combined with routinely collected data.

**Abstract:** Survival prediction is integral to oncology and palliative care, yet robust prognostic models remain elusive. We assessed the feasibility of combining actigraphy, sleep diary data, and routine clinical parameters to prognosticate. Fifty adult outpatients with advanced cancer and estimated prognosis of <1 year were recruited. Patients were required to wear an Actiwatch® (wrist actigraph) for 8 days, and complete a sleep diary. Univariate and regularised multivariate regression methods were used to identify predictors from 66 variables and construct predictive models of survival. A total of 49 patients completed the study, and 34 patients died within 1 year. Forty-two patients had disrupted rest-activity rhythms (dichotomy index (I < O ≤ 97.5%) but I < O did not have prognostic value in univariate analyses. The Lasso regularised derived algorithm was optimal and able to differentiate participants with shorter/longer survival (log rank *p* < 0.0001). Predictors associated with increased survival time were: time of awakening sleep efficiency, subjective sleep quality, clinician's estimate of survival and global health status score, and haemoglobin. A shorter survival time was associated with self-reported sleep disturbance, neutrophil count, serum urea, creatinine, and C-reactive protein. Applying machine learning to actigraphy and sleep data combined with routine clinical data is a promising approach for the development of prognostic tools.

**Keywords:** biomarkers; circadian; machine learning; palliative care; prognosis; survival

### **1. Introduction**

Prognostication (i.e., estimation of survival) is an important aspect of the management of patients with cancer. It is of particular importance in advanced cancer where it has

**Citation:** Patel, S.D.; Davies, A.; Laing, E.; Wu, H.; Mendis, J.; Dijk, D.-J. Prognostication in Advanced Cancer by Combining Actigraphy-Derived Rest-Activity and Sleep Parameters with Routine Clinical Data: An Exploratory Machine Learning Study. *Cancers* **2023**, *15*, 503. https://doi.org/ 10.3390/cancers15020503

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 10 August 2022 Revised: 23 December 2022 Accepted: 6 January 2023 Published: 13 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

immediate implications for clinicians' decisions about the treatment of the cancer, treatment of co-morbidities, so-called "ceilings of care", and referral to palliative care services [1,2]. Furthermore, it has implications for patients (and families) in terms of current decisionmaking, advance care planning, and "getting one's affairs in order".

Healthcare professionals are inaccurate prognosticators, often overestimating survival [3], and the accuracy of estimates is inversely related to survival [2]. Healthcare professionals are relatively good at predicting if patients will die within a couple of days, but not so good at predicting if patients will live for a couple of months or longer.

Various prognostic tools/algorithms have been developed to improve prognostication in patients with cancer [2,4]: these tools vary in their content (e.g., objective items only; subjective items only; objective and subjective items). However, none of these tools have been shown to be consistently better than clinicians' predictions of survival [2]. Current prognostication tools often include measures such as performance status, symptoms, venous blood sample data, and clinician-predicted survival [2,5]. The integration of other physiological and behavioural parameters, such as rest-activity rhythms ("diurnal or circadian") and sleep parameters are yet to be considered in prognostic models. (The term 'circadian' is meant to refer to rhythms that persist in constant conditions. Rhythms assessed in the presence of environmental rhythms, as in the present study, are referred to as diurnal or 24 h rhythms, although increasingly these rhythms are also referred to as 'circadian')

Sleep-wake cycles and circadian rhythms have a key role in sustaining normal body function and homeostasis [6]. Deterioration of rest-activity rhythmicity (loss of rhythmicity) and fragmentation of the sleep-wake cycle may be a marker of deterioration of health and, indeed, a predictor of illness including cancer, as well as cancer survival [7–9].

Several studies in cancer patients have incorporated actigraphy to objectively assess daytime activity, 24 h variation in rest-activity, as well as nocturnal and daytime sleep [7]. A number of actigraphy-derived parameters have been used to quantify rest-activity rhythms in this population including acrophase (time of peak activity), amplitude (peak to nadir difference, i.e., height of activity rhythm peak), mesor (average activity over a 24 h period), and the "dichotomy index" (I < O). Of these parameters, the I < O is one of the most commonly studied rest-activity measures in cancer studies. The I < O has been identified as an independent prognostic biomarker for overall survival, particularly in patients with metastatic colorectal cancer [10,11]. The I < O is defined as the percentage of the activity counts measured when the patient is in bed that are inferior to the median of the activity counts measured when the patient is out of bed [12]. An I < O of ≤97.5% is indicative of a disrupted rest-activity circadian rhythm (i.e., increased fragmented sleep and reduced daytime activity patterns) [7]. However, the I < O has not been used to prognosticate per se, either alone or in combination with other items. Furthermore, few studies have explored the potential of actigraphy-derived sleep parameters as prognostic markers in advanced cancer patients [13].

The first aim of this study was to investigate the feasibility of using I < O and other actigraphy-derived parameters as stand-alone items, to prognosticate in patients with advanced cancer. The second aim of the study was to determine whether the I < O and other actigraphy and sleep parameters should be combined with established prognostic indicators, e.g., Eastern Cooperative Oncology Group performance status (ECOG-PS), modified version of the Glasgow Prognostic Score (mGPS), Prognosis in Palliative Care Study (PiPS) –B, as well as putative prognostic variables from routine clinical data derived from blood samples, to improve prognostic accuracy. To achieve this second aim we deployed regularised regression, a supervised machine learning approach which overcomes some of the limitations of classical multiple regression, to identify effective prognostic indicators and develop more robust prognostic algorithms [14].

#### **2. Materials and Methods**

#### *2.1. Study Design and Setting*

The study was a prospective observational study conducted in a medium-sized district general hospital/cancer centre in the United Kingdom. The study was sponsored by the Royal Surrey County Hospital and received ethical approval from the London–Bromley REC (reference number—16/LO/0243). The study was registered on the CancerTrials.gov registry (reference number—NCT03283683). The study was funded by the Palliative Care Research Fund (Prof. Davies—Royal Surrey County Hospital), including an unrestricted donation from the family of Mr. John Spencer.

#### *2.2. Study Participants*

Participants were recruited from outpatients at the study site. All patients that met the criteria for the study were eligible for entry into the study (convenience sampling, consecutive recruitment). The inclusion criteria were: (a) age ≥ 18 years; (b) diagnosis of locally advanced/metastatic cancer; (c) clinician estimated prognosis of more than 2 weeks but less than 1 year; and (d) known to a specialist palliative care team. The exclusion criteria were: (a) cognitive impairment; (b) physical disability that affected general activity; and (c) physical disability that affected non-dominant arm movement.

Patients were diagnosed with locally advanced/metastatic cancer according to NHS guidelines, which consider TNM staging. All patients who met the inclusion criteria were deemed eligible for entry into the study. Potentially eligible patients were identified by the clinical team and approached by a member of the research team and invited to participate in the study. Any patient referred to the specialist palliative care team was expected to die within the next twelve months (as per the General Medical Council definition for end-of-life care [14]).

#### *2.3. Routine Data Collection*

Written informed consent was obtained from participants prior to entry into the study. The initial review (day 0) involved a collection of routine clinical data: patient demographics, information about cancer diagnosis/treatment, information about co-morbidities/ medication, assessment of Eastern Cooperative Oncology Group performance status (ECOG-PS) (by clinician and patient) [15], and completion of the Abbreviated Mental Test Score [16], the Memorial Symptom Assessment Scale—Short Form (MSAS-SF) [17], and the Global Health Status question from the PiPS-B algorithm [18]. The participant's pulse was measured (as part of the PiPS), and a venous blood sample was taken to measure haemoglobin, white blood cell count (WBC), neutrophil count, lymphocyte count, platelet count, sodium, potassium, urea, creatinine, albumin, alanine aminotransferase (ALT), alkaline phosphatase (ALP), and C-reactive protein (CRP). The final review (day 8) involved further assessment of ECOG-PS (by clinician and patient), completion of the MSAS-SF, the Pittsburgh Sleep Quality Index (PSQI) [19], and a patient acceptability questionnaire. The blood test results were used to complete the PiPS-B scoring algorithm, and serum CRP and albumin were used to calculate the mGPS [20].

#### *2.4. Wrist Actigraphy and Consensus Sleep Diary*

Wrist actigraphy was used to measure physical activity and standard sleep measures. Participants were fitted with the Actiwatch Spectrum Plus® (Philips Respironics, Bend, OR, USA) on the non-dominant arm after the initial review (day 0) and were instructed to wear the device for eight consecutive 24 h periods. The Actiwatch Spectrum Plus® is a CE-marked device with an accelerometer (i.e., motion sensor) that samples movement at 32 Hz [21] with a sensitivity of 0.025 G (at 2 count level). Participants were also given a Consensus Sleep Dairy in order to provide confirmatory information about specific sleep parameters (e.g., number of awakenings, time of final awakening) [22]: the "diary" was completed for eight consecutive sleep periods. The Actiwatches were configured and data were retrieved using device-specific software (Actiware version 6.0.9: Philips Respironics, Bend, OR, USA). The Actiwatches were adjusted to provide an epoch length (sampling interval) of one minute, which is the most common epoch length used in studies of cancer patients [23]. The Consensus Sleep Diary was used in conjunction with the Actiwatch to assist in actigraphy data interpretation (i.e., determine the major sleep/wake periods) [24].

The data from the Actiwatches was downloaded into an Excel spreadsheet, and the following rest-activity parameters were calculated using a study specific SAS programme (SAS® Version 9.4 Statistical Analysis Software, SAS Institute, Cary, NC, USA): I < O, r24 (an autocorrelation coefficient at 24 h, that is "a measure of the regularity and reproducibility of the activity pattern over a 24 h period from one day to the next") [25], mean daily activity (MDA), and mean activity during daytime wakefulness. MDA was calculated as the average number of wrist movements per minute throughout the recording time [25], and the mean duration of activity during wakefulness was calculated as the mean activity score (counts/minute) during the time period between two major sleep period intervals [26]. In addition, the following sleep parameters were calculated both automatically from the Actiwatches (using the Actiware sleep scoring algorithm) and manually from the sleep diary [27]: bedtime (BT), get-up time (GUT), time in bed (TIB), sleep onset latency (SOL), total sleep time (TST), sleep efficiency (SE), wake after sleep onset (WASO), and number of awake episodes (NA). The sleep parameters derived manually solely from the sleep diary were: time tried to sleep, time of final awakening and terminal awakening (TWAK) [22]. See Table 1 for definitions of the sleep parameters.


**Table 1.** Definitions of actigraphy-derived sleep/consensus sleep diary parameters [22,26,28].

#### *2.5. Follow-Up*

During the study period (from time of first patient recruited to six months after last patient recruited), participants' survival status (and date of death, if applicable) was determined every three months by reviewing the hospital clinical records, and/or contacting the general practitioner.

#### **3. Statistical Analyses**

The sample size for the study (*n* = 50) was derived from guidance on sample sizes for feasibility studies (and represents the upper range) [29]. Statistical support was provided by statisticians, within the Research Design Service South-East (based in the Clinical Trials Unit at the University of Surrey). Descriptive statistics were used to explain much of the data (e.g., mean and standard error; median and range). The Intraclass Correlation Coefficient (ICC) was used to assess the robustness of I < O as a marker of the restactivity rhythm, and its stability throughout the actigraphy recording. The Spearman's Rank correlation coefficient was used to measure the association between I < O and other actigraphy-derived parameters. The Spearman's rank correlation '*r*' values were defined as follows: 0 ≤ *r* < 0.3 indicated a negligible correlation, 0.3 ≤ *r* < 0.5 a low correlation, 0.5 ≤ *r* < 0.7 a moderate correlation, 0.7 ≤ *r* < 0.9 a high correlation, and 0.9 ≤ *r* ≤ 1 a very high correlation [30]. Kaplan–Meier plots, a non-parametric statistical method, were used to estimate the probability of survival past a given time point along with the log rank test to compare the survival distribution of two groups. Statistical significance was evaluated at 5%.

The "per protocol set" refers to participants that wore the Actiwatch for the eight consecutive 24 h periods with the corresponding sleep diary, whilst the "full analysis set" refers to participants that wore the Actiwatch for at least three consecutive 24 h periods (i.e., 72 h) and completed the corresponding sleep diary for the actigraphy rest-activity and sleep analysis, or for at least three consecutive or non-consecutive nights in the sleep diary for the subjective sleep analysis (i.e., calculation of the sleep diary parameters).

#### **4. Machine Learning Methods and Data Analysis**

Cox regression has been the standard approach to survival analysis in oncology. However, Cox regression has a number of limitations. In particular, it is not an adequate approach for situations in which the number of predictors is high relative to the number of observations, as is the case in this feasibility study. We therefore opted to use simple alternative methods that can (1) adequately deal with situations in which the number of predictors is large relative to the number of observations and (2) yield models that are interpretable, i.e., are not 'black box models'. Penalised (Regularised) regression models represent such an approach.

A supervised machine learning algorithm was used to develop a predictive model, where the collated subjective and objective parameters (i.e., routine clinical data and actigraphy-derived rest-activity and sleep parameters) were individual predictor variables and survival was the 'response' variable [31]. Sixty-six predictor variables were tested for potential predictive value (Appendix A, see Table A1 for descriptive statistics of the numerical predictor variables). Overall survival was defined as the time from initial review (day 0) to death or until 14 May 2020 for patients that remained alive until the end of the study.

#### *4.1. Machine Learning Dataset*

All patients recruited into the study (*n* = 50) were used for the machine learning analysis. The predictor variables were classified into the relevant variable type (e.g., binary, categorical\_nominal, etc.) and entered into a .csvfile in Excel. Binary variables, such as 'use of opioid analgesia' were transformed into dummy variables (0 or 1). Categorical\_ordinal variables with a numerical ranking, such as ECOG-PS were labelled using the 'LabelEncoder' approach, where the output integer value from the LabelEncoder function was

used to reflect the ordering of the original integer. Categorical\_ordinal variables with non-numerical values, such as PSQI sleep disturbance, were assigned a numerical ranking. Numerical\_continuous variables involving sleep/wake times were entered in the 24 h format. Missing data values were imputed with the average of the group or with the corresponding subjective/objective data from the same participant. Missing data accounted for <4% of the dataset.

#### *4.2. Regularised Regression Methods*

Regularised regression was used to reduce "overfitting" and aid the generalisability of the model. 'Regularisation' corresponds to a penalty that limits the overall weight that can be assigned across all predictor variables in the model, which reduces model complexity (compared to traditional multivariate regression). For some regularised regression approaches, the penalty can drive the weight of a variable to zero, effectively selecting the optimal combination of predictor variables that can be used to predict the given outcome.

Here, three regularised multivariate regression methods were applied and compared: ridge regression, least absolute shrinkage and selection operator (Lasso) and elastic net. The ridge regression algorithm includes all the predictor variables, shrinking the coefficients towards (but not set at) zero in a continuous manner [32]. The Lasso-derived algorithm combines the method of shrinkage with the sub-selection of predictor variables, using a penalty '*L*<sup>1</sup> norm' [32,33], creating a 'sparse' model (i.e., selecting only a few variables from the dataset) [32]. The elastic net algorithm is broadly a combination of the ridge and Lasso [34]. This method simultaneously performs continuous shrinkage and feature selection, selecting groups of correlated variables, using a penalty of '*L*<sup>1</sup> norm' and '*L*<sup>2</sup> norm' [34]. Highly correlated predictor variables are averaged and entered into the model to remove any deviances caused by extreme correlations [35]. Since survival data are censored, i.e., at the end of the observation period some participants may still be alive, we applied regularised Cox regression using the glmnet package in R.

#### *4.3. Model Development*

The models were validated using a *k*-fold (10 folds used) cross-validation approach [32]. For each of the 50 individuals, the predicted survival was based on a model which was constructed on '*k* − 1 subjects, i.e., the model was blind to the participant and the participant did not contribute to the estimation of the prediction. All analyses were carried out within the statistical computing environment R (version 3.6.2). For machine learning, ridge, Lasso and elastic net (alpha = 0.5) regression the package glmnet (version 2.0) was used. Here, an exhaustive search for lambda able to produce the minimum Mean Cross-Validated Error (CVM) was performed. All subjects were used as the training set to build a final model, then *k*-fold cross-validation for performances (CVM) was performed. Analyses were performed with different settings of elastic net mixing parameter (alpha), which were elastic net (alpha = 0.5), Lasso (alpha = 0.99) and ridge (alpha = 0.01). The models generated a predicted hazard, which was compared to the actual survival in days using Pearson's correlation coefficient. To estimate the intra-variable variation in their contribution to the predictor, we computed the mean cross-validated error of the weights of each of the variables that were consistently identified in all 50 participants.

#### **5. Results**

A total of 50 patients were recruited to the study, and 49 participants completed the study (Figure 1): the full analysis set consisted of 44 participants, whilst the per protocol set consisted of 37 participants. See Table 2 for characteristics of the participants. A total of 46 participants were followed up for 12 months (40 in the full analysis set, 33 in the per protocol set), and 34 died within this time period (28 in the full analysis set, 22 in the per protocol set). Unless otherwise stated, the following results relate to the full analysis set.

**Figure 1.** Study flow chart.

**Table 2.** Participant characteristics.


Note: Percentages may not sum to 100 due to rounding.

#### *5.1. Acceptability of Actigraphy and Sleep Diary Acceptability*

Actigraphy data were missing from one participant due to a technical problem. Fortytwo (84%) participants reported that the Actiwatch was "comfortable to wear", and only four (8%) reported that the Actiwatch interfered with their normal activities. No adverse

effects were reported from using the Actiwatch. Fourteen (28%) participants reported that the Consensus Sleep Diary was difficult to complete, and two (4%) subjects reported that the diary interfered with their normal activities.

#### *5.2. Univariate Analyses of Actigraphy Parameters*

5.2.1. Characteristics of the Dichotomy Index (I < O) and Correlation with Other Actigraphy and Sleep Parameters

Table 3 shows the results for the I < O. Forty-two (95%) participants had anI<O of ≤97.5%, indicating a disrupted rest-activity circadian rhythm [7]. The I < O can be considered a stable variable since the intraclass correlation coefficient for values obtained over eight days using the per protocol set, was 0.93 (95% CI: 0.88–1.00; *p* < 0.0005), which is considered an "excellent" correlation [36]. In fact, there was a "high" positive correlation between the I < O for the first three days (72 h) and for the full eight days (Spearman's correlation: *r* = 0.82; *p* < 0.0005) [31]. Moreover, there was a "high" positive correlation between the I < O on weekdays and on the weekend (Spearman's correlation: *r* = 0.76; *p* < 0.0005). Additionally, there was a "very high" positive correlation between the I < O calculated using 24 h of data, and theI<O calculated using 20 h of data, i.e., excluding the one-hour periods before/after going to bed, and the one-hour periods before/after getting out of bed (Spearman's correlation: *r* = 0.98; *p* < 0.0005).

**Table 3.** Dichotomy Index (I < O) data.


There was a "moderate" positive correlation between the I < O and the r24 (Spearman's correlation: *r* = 0.66; *p* < 0.0005), and the mean activity during wakefulness (Spearman's correlation: *r* = 0.51; *p* < 0.0005). However, there was only a "low" positive correlation between the I < O and the mean daily activity (Spearman's correlation: *r* = 0.43; *p* = 0.003). Other standard actigraphy parameters correlated with the I < O were SE, i.e., number of minutes of sleep divided by total number of minutes in bed (Spearman's correlation: *r* = 0.47, "low" correlation; *p* = 0.001), and WASO, i.e., number of minutes awake after sleep onset during sleep period (Spearman's correlation: *r* = −0.51, "moderate" correlation; *p* < 0.0005).

#### 5.2.2. I < O: Predictor of Survival and Correlation with ECOG-PS

Amongst participants that completed one year of follow-up (*n* = 40), there was no significant difference in overall survival between those separated into two groups (based on the median I < O; log rank test, *p* = 0.917), or four groups (based on the quartiles of the I<O; log rank test, *p* = 0.838). However, I < O had a "moderate" negative correlation with the physician assessed ECOG-PS (Spearman rank correlation: *r* = −0.63; *p* < 0.0005). The ECOG-PS was an independent prognostic indicator in this cohort of patients (log rank test, *p* < 0.0005). The median survival for participants with an ECOG-PS of 1 (end of study) was 141 days, ECOG-PS of 2 was 135 days, ECOG-PS of 3 was 57 days, and ECOG-PS of 4 was 17 days.

#### 5.2.3. Autocorrelation Coefficient at 24 h (r24)

The median r24 was 0.16 (range 0.04–0.37). Amongst participants that completed one year of follow-up (*n* = 40), there was no significant difference in overall survival between those separated into two groups (based on the median r24; log rank test, *p* = 0.318), or four groups (based on the quartiles of the r24; log rank test, *p* = 0.800).

#### 5.2.4. Other Actigraphy Parameters

None of the other actigraphy-derived sleep parameters were associated with a decreased overall survival: (a) TIB (log rank, *p* = 0.574: based on group median of 9 h 29 min); (b) TST (log rank, *p* = 0.147: based on normative cut-off value of ≥6.5 h [28]; (c) SOL (log rank, *p* = 0.283: based on normative cut-off value of ≤30 min [28]; (d) SE log rank, *p* = 0.224: based on normative cut-off value of ≥85% [28]; (e) WASO (log rank, *p* = 0.549: based on normative cut-off value of >30 min [28]; and (f) NA (log rank, *p* = 0.972: based on group median of 23 episodes).

#### *5.3. Multivariate Predictors of Survival: Machine Learning Results*

In the machine learning dataset, 46 participants had died within the specified time period of follow-up (i.e., by 14 May 2020). The Lasso model selected 22 predictor variables, with 14 variables consistently selected in all 50 participants during the process of validation (Figure 2). These involved eight predictor variables associated with greater survival time and six predictor variables, associated with a reduced survival time. The predictor variables associated with increased survival time, i.e., smaller hazard (in order of the coefficient associated with the predictor variable) were: later sleep diary time of final awakening, later actigraphy get up time, longer PiPS-B clinician's estimate of survival, better PSQI subjective sleep quality, greater PiPS-B global health status score (indicating better health), better actigraphy sleep efficiency, and higher haemoglobin values. The variables associated with reduced survival time were more frequent PSQI sleep disturbance wake middle of the night/early morning, higher neutrophil count, higher serum urea, serum creatinine, and serum C-reactive protein. On the contrary, a larger MSAS-SF total symptom distress was associated with a lower risk of death and a higher I < O was associated with a worse prognosis. The predicted median hazard was 0.00052, and the model was able to successfully differentiate between participants with a shorter/longer overall survival (log rank *p* < 0.0001) (Figure 3). Figure A1 shows the correlation between the actual survival and predicted hazard (Pearson's correlation coefficient *r* = −0.5; *p* = 0.0002).

The ridge model consistently identified 28 predictor variables in all 50 participants (Figure 4). During the process of validation, the top 10 variables consistently selected involved seven predictors associated with longer survival time and three predictors associated with shorter survival time. The seven predictor variables associated with longer survival time (in order of the coefficient associated with the predictor variable) were: actigraphy get-up time, sleep diary time of final awakening, sleep diary get-up time and PSQI usual get-up time; PiPS-B clinician's estimate of survival, PSQI subjective sleep quality and PiPS-B global health status score. The 3 predictor variables associated with shorter survival time were: use of opioid analgesia, modified Glasgow Prognostic Score and physicianassessed ECOG-PS day 8. The predicted median hazard was 0.44; however, there was no significant difference in overall survival when a median split was applied (log rank, *p* = 0.0914) (Figure 5). Figure A2 shows the correlation between the actual survival and predicted hazard (Pearson's correlation coefficient *r* = −0.5; *p* = 0.0002).

**Figure 2.** The mean cross—validated error (CVM) of predictor variables for hazard selected by the Lasso model.

**Figure 3.** Kaplan–Meier curve comparing survival probability predicted by the Lasso-derived algorithm (log rank, *p* < 0.0001).

**Figure 4.** The mean cross—validated error (CVM) of predictor variables for hazard selected by the ridge model.

**Figure 5.** Kaplan–Meier curve comparing survival probability predicted by the ridge-derived algorithm (log rank, *p* = 0.0914).

The elastic net model selected 10 predictor variables, with 6 variables being consistently selected during the process of validation: the two consistently selected predictor variables associated with longer survival time were (in order of the coefficient associated with the predictor variable): later actigraphy get-up time and greater PiPS-B global health status score; the 4 consistently selected predictor variables associated with shorter survival time were: higher serum urea, neutrophil count, serum C-reactive protein, and serum creatinine (Figure A3). The predicted median hazard was 0.408, but there was no significant difference in overall survival (log rank, *p* = 0.9877) (Figure A4). Figure A5 shows

the correlation between the actual survival and predicted hazard (Pearson's correlation coefficient *r* = −0.08; *p* = 0.5808).

#### **6. Discussion**

The results of this study show that univariate approaches to survival prediction, based on, for example, the I < O, are not very powerful; whereas, multivariate approaches appear to hold promise. To the best of our knowledge, this is the first study describing the application of supervised machine learning methods, involving a combination of actigraphy-derived rest-activity and sleep parameters, and data collected in routine clinical practice (i.e., simple questionnaires such as the MSAS-SF, ECOG-PS, PSQI, venous blood sampling) to prognosticate patients with advanced cancer, receiving supportive and palliative care [37]. Our study confirmed certain established novel predictors and identified some for survival in this group of patients and points to the importance of sleep characteristics for prognostication. The results of the study also confirm that clinicians are inaccurate prognosticators [3], since 11 (24%) participants were still alive at 1 year (despite the inclusion criteria of clinician estimated prognosis of more than 2 weeks but less than 1 year).

The literature had suggested that actigraphy-derived parameters, and the I < O index in particular, could be used as predictors because a low I < O is associated with increased morbidity (worse symptoms, worse quality of life), and with decreased survival [7]. At the outset of this study, we therefore focused on the I < O and other parameters describing the robustness of the rest-activity. We indeed observed a very high prevalence (i.e., 95%) of disrupted rest-activity rhythms in these advanced cancer patients, which is much higher than the reported prevalence of 19.1–54.9% [7]. This disparity undoubtedly reflects different populations, with our population having more advanced disease (and worse performance status) than previous studies [11,38]. However, in the univariate analyses of the data in our study there was no direct association between I < O and survival. Furthermore, other actigraphy-derived parameters, when used in isolation, are also not very accurate in the population.

However, the results of the study suggest that novel models developed through machine learning can facilitate improvements in prognostication. Penalised regression methods implement a feature selection strategy, providing a combination of subjective and objective predictor variables of survival that are ranked based on their contribution to the model. The models manage collinearity within the dataset, which is particularly useful in datasets involving terminal cancer patients, where often the number of features exceeds the relative sample size. The best performing method was Lasso regression which reduces the coefficients of variables with a minor contribution to zero and thereby creates a simple 'model' with only a few variables. Sleep parameters were amongst the most important variables, not only in the Lasso model but also in the more complex elastic net and ridge models. These measures primarily represented positive predictors of survival. Sleep diary final awakening (lasso and ridge) and actigraphy-derived GUT (all models) were found to have particular prognostic relevance in our study, suggesting that a later sleep diary determined 'time of final awakening' and a later actigraphy-derived 'get-up time' are associated with a lower risk of death and improved survival. Furthermore, actigraphyderived SE, which may be considered an objective measure of sleep quality, was selected as a positive predictor of survival in the lasso model (i.e., greater sleep efficiency was associated with enhanced survival) for our population. Whilst actigraphy-derived sleep quality, as opposed to sleep quantity, has been reported to have prognostic significance in advanced breast cancer patients [13], we identified quantitative sleep measures as important contributors to survival prediction.

Studies have reported actigraphy-derived circadian disruption [10,12,39] and fragmented sleep [13] to have prognostic implications in cancer patients, yet little is known about the prognostic impact of subjective sleep measures. A recent study identified the PSQI sleep duration component as a prognostic indicator in a cohort of advanced hepatobiliary/pancreatic cancer patients [40], yet a novel finding in our study was the selection of other sleep parameters from the PSQI: (1) usual get up time and (2) subjective sleep quality, where a later get up time and very good sleep quality are associated with longer overall survival, and (3) PSQI sleep disturbance components—pain, cannot breathe comfortably and wake up in the middle of the night or early morning—were associated with poorer survival. Furthermore, subjective sleep parameters, as opposed to actigraphy-derived sleep parameters, were more commonly identified in all participants in the ridge model.

Venous blood sample measurements were also significant contributors to predicting survival in our study. Previous studies have reported moderate evidence for the prognostic significance of an elevated C-reactive protein (CRP) and leucocytosis being associated with a shorter survival [1,5,18,41]. Whilst our study was able to echo these findings, we were able to further identify novel biomarkers, such as an elevated urea and serum creatinine, that may also be associated with a poorer survival, and raised haemoglobin that may be associated with a lower risk of death. Blood sampling is generally deemed 'inappropriate' when patients are in their last days/weeks of life [42], regardless only one of the 94 patients screened for our study, declined participation. Our findings endorse further evaluation of biological parameters from venous blood sample data, as they may be beneficial to improving prognostication in these patients.

Although our multivariate findings controversially imply that a higher I < O is associated with a shorter predicted survival time, all participants in our population had poor health, i.e., an I < O of <99%, which has recently been identified as an optimal cut-off for distinguishing between healthy controls and patients with advanced cancer [43]. Further inspection of our data identified that all our participants, whether they had shorter or longer survival had disrupted rest-activity rhythms, equally both groups had moderate symptom distress as measured by TMSAS, inevitably expected in an advanced cancer population. Therefore, whilst it may be a simple way of quantifying rest-activity rhythms, I<O may be a more meaningful prognostic indicator during the earlier trajectories of cancer, as opposed to the progressive stages.

In summary, our data suggests that subjective sleep parameters, measured using the consensus sleep diary and the PSQI, and actigraphy-derived sleep parameters may be especially useful when combined with routine clinical data using machine learning approaches, with no substantial additional costs or burden to the health service. Thus, further investigation of these parameters as prognostic indicators is warranted. Indeed, we plan to undertake a larger (definitive) study in the near future. Sleep-wake disturbances and circadian dysregulation are deemed to have a reciprocal relationship [43,44] and our findings are suggestive of sleep/circadian rhythm parameters as potential prognostic indicators. Whether improving the patient's sleep disturbance may improve overall survival remains an open question. Rehabilitation of the circadian system by means of behavioural and pharmacological strategies, to re-synchronise the circadian system, may ultimately improve circadian function and sleep, as well as overall survival [44,45].

The Lasso model was the only model able to successfully differentiate between long and short survival in our study, and the correlation between observed and predicted hazard was only significant for the Lasso and ridge models. The Lasso model is 'sparse' (i.e., only a few variables from the dataset are selected) [32] and therefore may be favourable if a consolidated model were needed to aid prognostication. However, the Lasso selects at most '*n*' variables before it saturates; therefore, the number of predictors is restricted by the number of observations [32]. The ridge model, therefore, may be beneficial due to the greater inclusivity of variables, at the expense of an increased risk of overfitting. Indeed, the absence of significant results cannot be overlooked with the small sample size in this feasibility study. In the definitive study, all three supervised machine learning methods would be deployed after the recruitment of a larger sample size as well as the inclusion of additional variables that may be clinically relevant (e.g., stage of disease, number of comorbidities, nutritional status, presence of specific symptoms/problems) [1,2], More data

would enable robustness of the predictive ability of the models to be assessed as well as enable generalisability of our findings with further confidence in our observations.

Interestingly, a recent systematic review described the prediction of survival to be a process as opposed to an event, and that predictors of survival may develop as the disease progresses [5]. Therefore, there may be added value in predicting the trajectory of death, as opposed to the time of death in future studies. Machine learning approaches would be particularly valuable in such cases, where relevant predictor variables may be identified as the disease trajectory evolves only to ultimately enhance our true understanding of prognostication.

A few limitations need consideration. Firstly, our small sample size is unlikely to capture the true variance of the population. Secondly, the Lasso and elastic net models involve only a subset of predictors and the value of the coefficient associated with each of these predictor variables is dependent on the presence of the other (non-zero) predictor variables in the model. Our results are essentially correlational and demonstrate that the relevant predictor variables (above non-zero coefficient value) may be associated in a positive or negative way with the risk of death. Thirdly, imputation of missing data values with the sample population average may not have been a true reflection of the individual sample's actual score nor using subjective data to impute objective values, particularly if the tools were measuring different timeframes, i.e., actigraphy (over a one-week duration) versus the PSQI questionnaire (measures on average over the previous one month). The *K*-fold cross-validation approach also has some limitations. As it is executed '*k*' times (where '*k*' is the number of subsets of observations), this approach may not be resourceful in a small dataset. Furthermore, *K*-fold cross-validation is likely to have a high variance as well as a higher bias, given the small size of the training set from a small dataset. Therefore, the number '*k*' highly influences the estimation of the prediction error, and the presence of outliers can lead to a higher variation. Indeed, it can be a challenge to find the appropriate '*k*' number to reach a good 'bias-variance' trade-off. In future studies, it will be essential to include an independent validation set.

#### **7. Conclusions**

This study suggests that subjective sleep parameters, measured using the consensus sleep diary and the PSQI, and actigraphy-derived sleep parameters may be useful for prognostication in patients with advanced cancer, and that it may be especially useful when combined with routine clinical data and machine learning approaches.

**Author Contributions:** Conceptualization, A.D. and D.-J.D.; methodology, S.D.P., A.D. and D.-J.D.; data collection, S.D.P.; data interpretation, S.D.P.; supervision, A.D., D.-J.D., E.L., J.M. and H.W.; dichotomy index software and analysis, J.M. and S.D.P.; machine learning software and data analysis, H.W.; data curation, S.D.P. and H.W.; writing—original draft preparation, S.D.P. and A.D.; writing review and editing, S.D.P., A.D., D.-J.D., E.L., J.M. and H.W.; graphical figures—S.D.P., H.W., J.M. and D.-J.D.; supervision, A.D. and D.-J.D.; project administration, S.D.P.; funding acquisition, A.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Palliative Care Research Fund (Davies—Royal Surrey County Hospital), including an unrestricted donation from the family of John Spencer.

**Institutional Review Board Statement:** The study was sponsored by the Royal Surrey County Hospital and received ethical approval from the London-Bromley REC (reference number—16/LO/0243 on 9 May 16). The study was registered on the CancerTrials.gov registry (reference number—NCT03283683).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patients to publish this paper.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the nature of the consent obtained from participants.

**Acknowledgments:** The authors would like to thank Simon Skene for his statistical expertise and Victoria Robinson for her invaluable administrative support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Prognostic Parameters for Machine Learning**



**Table A1.** Mean and standard deviation for prognostic parameters for machine learning.


**Table A1.** *Cont.*


**Figure A1.** Scatterplot showing correlation between actual survival and predicted hazard using the lasso model.

**Figure A2.** Scatterplot showing correlation between actual survival and predicted hazard using the ridge regression model.

**Figure A3.** The mean cross—validated error (CVM) of predictor variables for hazard selected by the elastic net model.

**Figure A4.** Kaplan–Meier curve comparing survival probability predicted by the elastic net-derived algorithm (log rank, *p* = 0.9877).

**Figure A5.** Scatterplot showing correlation between actual survival and predicted hazard using the elastic net model.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **3D Convolutional Neural Network-Based Denoising of Low-Count Whole-Body 18F-Fluorodeoxyglucose and 89Zr-Rituximab PET Scans**

**Bart M. de Vries 1,\*, Sandeep S. V. Golla 1, Gerben J. C. Zwezerijnen 1, Otto S. Hoekstra 1, Yvonne W. S. Jauw 1,2, Marc C. Huisman 1, Guus A. M. S. van Dongen 1, Willemien C. Menke-van der Houven van Oordt 3, Josée J. M. Zijlstra-Baalbergen 1,2, Liesbet Mesotten 4,5, Ronald Boellaard <sup>1</sup> and Maqsood Yaqub <sup>1</sup>**


**Abstract:** Acquisition time and injected activity of 18F-fluorodeoxyglucose (18F-FDG) PET should ideally be reduced. However, this decreases the signal-to-noise ratio (SNR), which impairs the diagnostic value of these PET scans. In addition, 89Zr-antibody PET is known to have a low SNR. To improve the diagnostic value of these scans, a Convolutional Neural Network (CNN) denoising method is proposed. The aim of this study was therefore to develop CNNs to increase SNR for low-count 18F-FDG and 89Zr-antibody PET. Super-low-count, low-count and full-count 18F-FDG PET scans from 60 primary lung cancer patients and full-count 89Zr-rituximab PET scans from five patients with non-Hodgkin lymphoma were acquired. CNNs were built to capture the features and to denoise the PET scans. Additionally, Gaussian smoothing (GS) and Bilateral filtering (BF) were evaluated. The performance of the denoising approaches was assessed based on the tumour recovery coefficient (TRC), coefficient of variance (COV; level of noise), and a qualitative assessment by two nuclear medicine physicians. The CNNs had a higher TRC and comparable or lower COV to GS and BF and was also the preferred method of the two observers for both 18F-FDG and 89Zr-rituximab PET. The CNNs improved the SNR of low-count 18F-FDG and 89Zr-rituximab PET, with almost similar or better clinical performance than the full-count PET, respectively. Additionally, the CNNs showed better performance than GS and BF.

**Keywords:** low-count; CNN; denoising; 18F-FDG; 89Zr-antibody

### **1. Introduction**

18F-fluorodeoxyglucose (18F-FDG) positron emission tomography (PET) is essential in staging of a broad spectrum of malignancies [1–3]. Currently, a whole-body 18F-FDG PET scan is acquired using a scan duration of 2 min per bed position and an injected activity of 3.7 MBq/kg. A shorter scan duration per bed position could ideally decrease

**Citation:** de Vries, B.M.; Golla, S.S.V.; Zwezerijnen, G.J.C.; Hoekstra, O.S.; Jauw, Y.W.S.; Huisman, M.C.; van Dongen, G.A.M.S.; Menke-van der Houven van Oordt, W.C.; Zijlstra-Baalbergen, J.J.M.; Mesotten, L.; et al. 3D Convolutional Neural Network-Based Denoising of Low-Count Whole-Body 18F-Fluorodeoxyglucose and 89Zr-Rituximab PET Scans. *Diagnostics* **2022**, *12*, 596. https:// doi.org/10.3390/diagnostics12030596

Academic Editor: Ayman El-Baz

Received: 20 January 2022 Accepted: 24 February 2022 Published: 25 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the total scan duration, and therefore, minimize movement artefacts and increase patient comfort and throughput. A reduction of injected activity would decrease the radiation burden for the patient, and therefore makes it possible to perform more frequent 18F-FDG PET scans per patient for restaging and therapy-response assessments, in case of scanning children and/or for non-oncological cases. However, a shorter scan duration and lower injected activity would result in a lower signal-to-noise ratio (SNR). Poor scan quality due to low-count (LC) is also observed for 89Zr-antibody PET scans, which are obtained after relatively low injected activity imposed by the radiation burden of 89Zr [4]. Therefore, denoising LC whole-body 18F-FDG and 89Zr-antibody PET scans is of interest for improving image quality.

Traditionally, Gaussian smoothing (GS) has been used to denoise PET images [5]. However, GS reduces the spatial resolution of the images, and therefore, could impair detectability and quantification of small (tumour) lesions [6]. Bilateral filtering (BF) exhibits superior properties in comparison to the more commonly used GS for noise reduction in PET [7]. BF reduces the noise of PET scans, while preserving spatial information (e.g., edges). However, BF parameters are difficult to optimize in a generic way because both an optimized intensity and spatial parameter need to be determined, which depend on both the tracer and site of interest. Therefore, another adaptive/data-driven denoising method with high accuracy is warranted.

Convolutional Neural Networks (CNN) are a specialized type of Neural Networks that use convolution to extract features from the PET scan. This is done by convolution filters which assign importance/weights to (learnable) features present in the PET scan. It can therefore learn and detect features such as PET intensities, edges, shapes, etc. Therefore, CNNs are highly beneficial in various medical image processing/segmentation tasks [8–11]. A CNN-based deep-learning algorithm may also be superior in denoising tasks since it can learn non-linear latent/hidden (not observable by humans) features (which you want to preserve) from LC PET scans and increase the SNR [12–18]. Therefore, denoising LC wholebody 18F-FDG and 89Zr-antibody PET scans using a CNN may be performed to improve the SNR and thus their diagnostic value. In previous studies [12–14,17,19] a successful application of CNNs for improving 18F-FDG PET scans has been presented. However, these studies were performed in a small (oncology) patient cohort, were based on unsupervised deep learning networks and on improving full-count (FC) 18F-FDG PET scans or a longer scan duration per bed position.

Therefore, the aim of this study was to develop, train, and extensively evaluate the performance of CNNs to denoise LC whole-body 18F-FDG and 89Zr-antibody PET scans. A secondary aim was to compare the diagnostic value of the CNNs to that of GS and BF.

#### **2. Materials and Methods**

#### *2.1. Participants*

We included PET scans of 60 patients with stage I–IV non-small-cell lung carcinoma (NSCLC) (40 patients from Limburg PET-Center Hasselt Belgium (LPC) [20], and 20 patients from Amsterdam UMC, location VUmc), of which five patients with diffuse large B cell lymphoma (DLBCL) non-Hodgkin lymphoma (Amsterdam UMC, location VUmc) [21] (Table 1). The study at LPC was registered at clinical trials.gov, NCT02024113. The data from the patients with lung cancer at Amsterdam UMC were retrospectively obtained from medical records, with a waiver for informed consent from the Medical Ethics Review Committee of Amsterdam UMC, location VUmc. This study was registered as IRB2018.029. The patients with non-Hodgkin lymphoma were included as part of studies performed by Jauw et al. These patients provided written informed consent, and the studies were approved by the Medical Ethics Review Committee of Amsterdam UMC, location VUmc. This study was registered at Dutch Trial Register http://www.trialregister.nl (accessed on 19 January 2022), NTR3392.


**Table 1.** The patients characteristics included in this study from Amsterdam UMC.

#### *2.2. Data Acquisition*

Whole-body 18F-FDG PET scans in LPC were acquired with a Gemini Big Bore TF PET/CT scanner, and in Amsterdam UMC with an Ingenuity TF PET/CT and Vereos Digital PET/CT scanners (Philips Medical Systems, Best, The Netherlands). 89Zr-rituximab PET scans (patients with non-Hodgkin lymphoma) were acquired with an Ingenuity TF PET/CT. For 18F-FDG PET scans, 60 min after 259.4 ± 43.8 MBq tracer injection, a lowdose computed tomography (LDCT) scan was performed for attenuation correction and anatomical localisation, and subsequently a 20 min static (exact time depends on patient length) whole-body 18F-FDG PET scan was acquired (2 min per bed position). Six days after the injection of 73.7 ± 0.3 MBq 89Zr-rituximab, an LDCT scan was obtained, directly followed by a 60 min static whole-body PET scan (5 min per bed position). Corrections for decay, dead time, normalization (detector sensitivities), attenuation, random coincidences and scatter were applied.

Amsterdam UMC 18F-FDG PET data were reconstructed with a 10 s (super-low-count (SLC), 92% scan time reduction), 30 s (low-count (LC), 75% scan time reduction) and 2 min (full-count (FC)) scan duration per bed-position. The (S)LC PET scans were reconstructed using multiple time points/delays, which was later used for data augmentation during training. These scans were reconstructed using the blob-basis function ordered-subsets time of flight (BLOB-OS-TF) for the Ingenuity TF PET/CT scanner and the novel ordered subset expectation maximization (OSEM 3i15s, 1i6r-PSF, 4 mm FWHM GAUSS, OSEM 3i15s, 3 mm FWHM GAUSS) for the Vereos Digital PET/CT scanner. The 89Zr-rituximab scans, and the 18F-FDG PET data from LPC were reconstructed with a FC 5 min and 2 min scan duration per bed position only using BLOB-OS-TF, respectively.

The 18F-FDG and 89Zr-rituximab PET scans from Amsterdam UMC were reconstructed according to current European Association of Nuclear Medicine Research Ltd. (Vienna, Austria) EARL1 standards and settings associated with EARL accreditation [22], respectively. Matrix and voxels sizes were 144 <sup>×</sup> 144 and 4 mm in all directions, respectively. 18F-FDG PET scans from LPC were reconstructed according to EARL1 standards, with matrix and voxel sizes of 169 × 169 and 4 mm, respectively.

#### *2.3. Image Processing*

For each FC whole-body 18F-FDG PET scan from LPC, SLC PET, scans were simulated using the SiMulAtion and ReconsTruction (SMART)-PET package [23]. Simulationreconstruction settings were chosen so that the simulated noisy 18F-FDG PET scans from LPC showed an almost similar coefficient of variation as the SLC-reconstructed 18F-FDG PET scans from Amsterdam UMC. The simulated PET images were used to initially train the CNN, while parts of the actual reconstructed images were used for further fine-training of the CNN, details are explained later.

#### *2.4. Model Architecture*

A supervised U-Net based [11] 3D-CNN (Figure 1 and Appendix B) was used to denoise the (S)LC 18F-FDG PET scans while maintaining their diagnostic value. However, instead of the max-pooling layer that is traditionally used, in this study the down sampling layers consisted of convolution layers with a stride of two [24]. Although the convolution layer compresses the feature image just as is the case for a max-pooling layer, it does not exclude voxels by only looking at the maximum values. It therefore, not only reduces computation time (although less than max-pooling), but most importantly increases the

model its ability to learn [24]. Additionally, in contrast to conventional CNNs, a kernel size of 6 × 6 × 6 instead of 3 × 3 × 3 was applied to learn inter-slice morphological features [25].

**Figure 1.** Architecture of the U-net shaped 3D-CNN used in the study. It consists of an encoding and decoding path, which are connected with concatenation layers at each resolution block.

#### *2.5. Model Performance*

#### 2.5.1. Quantitative Performance

The simulated SLC whole-body 18F-FDG PET scans from LPC were used to pre-train a 3D-CNN. Next, the reconstructed SLC and LC 18F-FDG PET data from Amsterdam UMC were used for fine-training (transfer-learning) the pre-trained model, which generated two additional models (SLC-CNN and LC-CNN) that are tailored to manage low or super low count/quality images. These two models were subsequently used for further evaluation. Training of the CNN model on the simulated noisy LPC 18F-FDG data was performed to avoid overfitting due to the small dataset. Noise characteristics of 89Zr-rituximab and LC 18F-FDG PET scans were almost similar. However, we used the SLC-CNN to denoise the 89Zr-rituximab PET scans instead of the LC-CNN, because of the higher level of noise reduction.

The 18F-FDG PET data from LPC was split into a training (80%, *n* = 32) and a validation (20%, *n* = 8) set. Thereafter, for further refinement, validation and testing, the two CNN models, 18F-FDG and 89Zr-rituximab data from Amsterdam UMC, were used. During this training, the data were split into a training (32%, *n* = 8), a validation (8%, *n* = 2) and an independent test (60%, *n* = 15) set. The training and validation set from Amsterdam UMC consisted of only 18F-FDG PET scans from the Ingenuity TF PET/CT scanner. The test set, however, consisted both of 18F-FDG PET scans from the Ingenuity TF PET/CT scanner, the Vereos Digital PET/CT scanner, and 89Zr-rituximab PET scans from the Ingenuity TF PET/CT scanner. PET data augmentation was applied during each training epoch (traindata only) by randomly sampling the different (time points/delays) (S)LC 18F-FDG PET scans for each patient during training. In other words, instead of traditional augmentation (shifts, zoom, translation, rotation, etc.), in each training epoch, minor differences in noise characteristics were present.

To compare the performance of the CNNs with other denoising methods, the (S)LC test PET scans were also denoised using traditional GS (18F-FDG) and more advanced BF [17] (18F-FDG and 89Zr-rituximab) denoising methods (Table 2). A Mann–Whitney U test (*p* < 0.05) was used to compare tumour recovery coefficients (TRC) and levels of noise in the images after applying the denoising methods.

**Table 2.** Gaussian smoothing (GS) and Bilateral Filtering (BF) settings evaluated for the denoising of the low-count whole-body PET scans.


We calculated TRC for the 18F-FDG and 89Zr-rituximab PET scans (test data) using Equation (1). TRC was computed for the test data post-processed with a 3D-CNN, GS, or BF denoising method and compared this to the FC data. PET uptake features from the tumour volumes were extracted (*UX*, *X* = average, maximum and 3Dpeak) for both the denoised (*UX denoised*) as the *FC* (*UX FC*) PET scans, using the in-house built and openaccess ACCURATE tool (quAntitative onCology moleCUlaR Analyses SuiTE) [26,27]. From the 18F-FDG scans, only the primary lung tumour was extracted using a 50% SUV3Dpeak isocontour (Table 1 and Figure A1). For the 89Zr-rituximab PET scans, tumours were extracted using manual delineation (Table 1 and Figure A1). For patients with non-Hodgkin lymphoma with more than three tumours, bootstrapping was applied to randomly choose three tumours for analysis.

$$\text{TRC} = \frac{\text{LI}\_{\text{X}} \text{ dendoised}}{\text{LI}\_{\text{X}} \text{ FC}} \tag{1}$$

The level of noise was presented as the coefficient of variance (*COV*; Equation (2)). Four spherical volume of interest (VOIs) were drawn in the liver (because the liver showed homogeneous tracer uptake in this cohort, and therefore, could be used to reliably assess the level of noise). Average standard deviation *<sup>σ</sup> liver* and average uptake *Uavg liver* were extracted using these four VOIs.

$$COV = \frac{\overline{\sigma \, liver}}{\overline{\underline{\underline{\underline{\underline{\varepsilon}}}\, liver}}} \tag{2}$$

#### 2.5.2. Qualitative Performance

For a qualitative assessment of the denoising methods (CNN and BF), the images after denoising were independently evaluated by two experienced nuclear medicine physicians (BZ and OH). GS was not included in this assessment due to a mostly significant (*p* < 0.05) lower quantitative performance in comparison to the CNNs and BF. The questionnaire was drafted to assess the reliability and effectiveness of the denoising methods. The assessment was blinded, i.e., the scans presented to the physicians were a random combination (without labels) of the FC, SLC, LC (with and without denoising) PET scans per patient. The 18F-FDG and 89Zr-rituximab PET scans were scored per patient (1–5: low to high) based on the level of noise, tumour detectability, overall scan quality, clinical acceptability (yes/no), and overall best performance (1st/2nd/3rd/4th/(5th)). A Mann–Whitney U test (*p* < 0.05) was used to compare the performance of the denoising methods.

#### **3. Results**

#### *3.1. Quantitative Assessment*

#### 3.1.1. 18F-FDG

The BF and the proposed CNN (SLC- and LC-CNN) denoised PET scans have an overall higher TRC and more similar COV to the FC PET scans than GS (Figure 2 and Table A1). In contrast with BF, the SLC-CNN denoised PET scans showed a higher average uptake TRC, a higher 3Dpeak uptake TRC, but a lower maximum uptake TRC. With regard to the LC scans, the LC-CNN denoised PET scans showed a trend (0.05 < *p* < 0.1) of a higher TRC than the BF denoised PET scans for the average uptake, and 3Dpeak uptake. Additionally, the LC-CNN showed a higher but not significant maximum uptake TRC than the BF. In addition, the LC-CNN denoised PET scans showed a similar COV as the FC PET scans. The SLC-CNN had the second closest COV to the FC PET scans.

**Figure 2.** The performance of the denoising methods for low-count 18F-FDG PET. The (**A**) average, (**B**) maximum, (**C**) 3Dpeak TRC of the SLC-CNN, GS (8 mm, 10 mm and 12 mm) and BF (4 mm and 5 mm) denoising methods of the SLC 18F-FDG PET from the Ingenuity TF PET/CT and the Vereos Digital PET/CT scanner. The (**D**) average, (**E**) maximum, (**F**) 3Dpeak TRC of the LC-CNN, GS (4 mm, 6 mm and 8 mm) and BF (3 mm and 4 mm) denoising methods of the LC 18F-FDG PET from the Ingenuity TF PET/CT and the Vereos Digital PET/CT scanner.

#### 3.1.2. 89Zr-Rituximab

The SLC-CNN denoised PET scans showed a predominant trend of a TRC higher than the 3 mm and 4 mm BF (Figure 3 and Table A2). The SLC-CNN even showed a significantly (*p* < 0.05) higher average uptake TRC than the 3 mm and 4 mm BF. The 3 mm and 4 mm BF presented a comparable COV as the SLC-CNN. The 2 mm BF had a significantly (*p* < 0.05) higher COV than the 3 mm and 4 mm BF and the SLC-CNN.

**Figure 3.** The performance of the denoising methods for low-count 89Zr-rituximab PET. The (**A**) average, (**B**) maximum, (**C**) 3Dpeak TRC of the SLC-CNN and BF (2 mm, 3 mm and 4 mm) denoising methods of the FC 89Zr-rituximab PET from the Ingenuity TF PET/CT scanner.

#### *3.2. Qualitative Assessment*

For the 18F-FDG scans, the observers found lower levels of noise, better tumour detectability, better overall scan quality and higher clinical acceptability for all the CNN models in comparison to the BF denoising methods (Figures 4 and A2, Table 3), with the only exception being SLC-CNN in terms of noise levels and tumour detectability.

**Figure 4.** Illustration of a (**A**) FC, (**B**) SLC, (C) SLC-CNN, (**D**) BF 4 mm, (**E**) GS 10 mm denoised 18F-FDG PET scan (axial orientation) from the Ingenuity TF PET/CT scanner. Illustration of a (**F**) LC, (**G**) LC-CNN, (**H**) BF 4 mm, (**I**) GS 6 mm denoised 18F-FDG PET scan (axial orientation) from the Ingenuity TF PET/CT scanner.


**Table 3.** Scores provided by the Nuclear Medicine Physicians as part of qualitative assessment of the denoised 18F-FDG PET scans. The best performing method is indicated in bold based on the average score of both physicians.

\* significant (*p* < 0.05) higher/lower than (S)LC-CNN. \*\* trend (*p* < 0.1).

For the 89Zr-rituximab scans, the observers found a comparable level of noise, but similar/better tumour detectability, better overall scan quality and higher clinical acceptability for the SLC-CNN in comparison to the BF denoising methods (Figure 5 and Table 4).

**Figure 5.** Illustration of a (**A**) FC, (**B**) SLC-CNN, (**C**) BF 3 mm, (**D**) BF 4 mm, (**E**) BF 5 mm denoised 89Zr-rituximab PET scan (coronal orientation) from the Ingenuity TF PET/CT scanner.

**Table 4.** Qualitative assessment of the 89Zr-rituximab PET scans. Scores were given for the PET scans with (SLC-CNN and BF) and without (FC) denoising by both Nuclear Medicine Physicians. In bold the best performing method (or scan) is indicated based on the average score of both physicians.


\* significant (*p* < 0.05) higher/lower than SLC-CNN.

#### **4. Discussion**

CNN models to denoise (S)LC 18F-FDG and 89Zr-rituximab PET scans were trained and extensively evaluated. Overall, the CNN models performed better than the conventional GS and the more advanced BF denoising methods for both the 18F-FDG and 89Zr-rituximab PET scans. As such, the CNN models show promise for reducing the acquisition time and injected activity of 18F-FDG PET scans and increasing the image quality of 89Zr-rituximab PET scans.

In this study, we trained noise-specific CNN models, to address the difference in noise levels seen for different scan acquisition times and injected activity in 18F-FDG PET scans. However, in case of PET tracers such as 89Zr-antibody PET, training a noise specific CNN model was not feasible. Due to the dose limits of 89Zr, the overall image quality was impaired (low SNR), and therefore, no high quality 89Zr-antibody PET images were available for training a CNN. So, the only possible solution was to directly apply the SLC-CNN (trained using SLC 18F-FDG PET scans) to the 89Zr-rituximab PET scans and test its performance. The main advantage of this approach is that this validation is the ultimate way of externally testing the CNN on data that are obtained with a different tracer. Although the SLC-CNN is not trained on 89Zr-rituximab PET scans, it obtained a higher TRC than the 3 mm and 4 mm BF denoising methods (Figure 3 and Table A2).

With regard to 18F-FDG PET scans with a low injected tracer activity, such as scans with shorter scan duration, a lower SNR will be observed, which impairs both the quantitative and qualitative value of these scans. The CNNs could therefore also be useful to maintain a good image quality when reducing injected 18F-FDG activity in whole-body 18F-FDG PET studies, and therefore, reduce radiation burden for the patient, but maintain diagnostic value. However, further assessment is necessary to evaluate the performance of CNNs when used for a reduction in the injected activity for whole-body 18F-FDG PET acquisitions.

The qualitative assessment also showed that the proposed CNNs were preferred over BF. However, the CNN denoised (S)LC 18F-FDG PET scans did show an overall lower qualitative performance than the FC 18F-FDG PET scans. Yet, the LC-CNN denoised LC 18F-FDG PET scans obtained a similar clinical acceptability score as the FC 18F-FDG PET scans (Table 3), while for 89Zr-rituximab PET scans, the SLC-CNN increased the overall image quality of the FC 89Zr-rituximab PET scans (Table 4). The observers preferred the SLC-CNN denoised 89Zr-rituximab PET scans over both the BF denoised and FC 89Zrrituximab PET scans. This can be explained by a higher ratio between tumour signal and background signal present in the SLC-CNN denoised 89Zr-rituximab PET scans (Figure 5). This indicates that the SLC-CNN shows promise for establishing an optimal denoising setup for 89Zr-antibody PET scans.

In this study, several strategies to prevent overfitting were applied. First, data augmentation was applied by randomly sampling the different (S)LC 18F-FDG PET scans for each patient. By using traditional augmentation, interpolation may be different between the training data ((S)LC) and training labels (FC), and therefore, this was not applied in this study. Another method by which overfitting was reduced is by using the symmetric connections in the U-Net based 3D-CNN [28]. As shown in previous studies [12,13], training a model using a small dataset could result in overfitting. Since acquiring sufficient real (S)LC 18F-FDG PET data were not feasible, SLC 18F-FDG PET data were generated using the already available LPC data. SLC 18F-FDG PET data from LPC were simulated using SMART, which facilitated the development of a pre-trained model familiar with morphological features. This resulted in a shorter learning time, lower probability of overfitting, and a more accurate and robust model. Even though pre-training of the model was only performed on SLC 18F-FDG PET data from LPC, the fine-trained LC-CNN showed a higher performance than a LC-CNN without a pre-trained model.

The main limitation of this study is the size of the patient cohort. Small-sized tumours in the (S)LC PET scans are more prone to being underestimated by the proposed CNNs. This is because the training data were devoid of small tumours. Therefore, it could be that the model specifies this signal as noise rather than a tumour-specific signal [29]. However, the proposed CNNs showed better correspondence with the FC PET scans than GS and BF. So, although small tumours were present in a small number in the training data, by using the proposed CNNs, more quantitative information was retained in comparison to GS and BF. Even though the differences in performance between the proposed CNNs and BF were small, contrary to BF, a CNN still has the ability to learn and improve by incorporating more patients. Thus, further evaluation in a larger and more heterogeneous cohort could further improve CNNs performances. However, although the proposed method showed promising

results for denoising low-count 18F-FDG PET scans, obtaining a similar quantitative and qualitative value as the FC 18F-FDG PET scans may not be fully feasible and we therefore foresee that the main applications of the CNNs are denoising and improving image quality of 89Zr-antibody PET studies.

#### **5. Conclusions**

The 3D-CNNs used in this study to denoise (S)LC whole body 18F-FDG and 89Zrrituximab PET scans were constructed and tested. The CNN denoised (S)LC 18F-FDG and 89Zr-rituximab PET scans showed almost similar or better clinical performance than the FC scans, respectively. Therefore, the proposed CNNs show promise for reducing PET scan duration or lowering injected activity of whole-body 18F-FDG PET scans but are particularly useful to increase the quantitative and qualitative image quality of 89Zrrituximab PET scans.

**Author Contributions:** Conceptualization, B.M.d.V., M.Y., S.S.V.G. and R.B.; methodology, B.M.d.V., M.Y., S.S.V.G. and R.B.; software, B.M.d.V.; validation, B.M.d.V., M.Y., S.S.V.G. and R.B.; formal analysis, B.M.d.V., M.Y., S.S.V.G. and R.B.; investigation, B.M.d.V., M.Y., S.S.V.G. and R.B.; resources, R.B.; data curation, B.M.d.V.; writing—original draft preparation, B.M.d.V.; writing—review and editing, B.M.d.V., M.Y., S.S.V.G., R.B., G.J.C.Z., O.S.H., Y.W.S.J., M.C.H., G.A.M.S.v.D., W.C.M.-v.d.H.v.O., J.J.M.Z.-B. and L.M.; visualization, B.M.d.V.; supervision, S.S.V.G., M.Y. and R.B.; project administration, B.M.d.V.; funding acquisition, R.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of Amsterdam UMC, location VUmc.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data can make available on reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A2.** Illustration of a (**A**) FC, (**B**) SLC, (**C**) SLC-CNN, (**D**) BF 4 mm, (**E**) BF 5 mm, (**F**) GS 10 mm denoised 18F-FDG PET scan (coronal orientation) from the Ingenuity TF PET/CT scanner. Illustration of a (**G**) LC, (**H**) LC-CNN, (**I**) BF 3 mm, (**J**) BF 4 mm, (**K**) GS 6 mm denoised 18F-FDG PET scan (coronal orientation) from the Ingenuity TF PET/CT scanner.

**Table A1.** Overview of the performance of the GS, the BF and the proposed CNNs denoising methods using the external test 18F-FDG PET scans from the Ingenuity TF PET/CT and Vereos Digital PET/CT scanners. For the FC (green), SLC and LC (with (blue) and without (orange) post-processing) PET scans, the TRC and COV values are shown in each column.


\* significant (*p* < 0.05) higher/lower than (S)LC-CNN. \*\* trend (0.05 < *p* < 0.1).

**Table A2.** Overview of the performance of the BF and the SLC-CNN on the external test 89Zrrituximab PET scans from the Ingenuity TF PET/CT scanner. For the FC (orange), and post-processed (blue) PET scans, the TRC and COV values are shown in each column.


\* significant (*p* < 0.05) higher/lower than SLC-CNN. \*\* trend (0.05 < *p* < 0.1).

#### **Appendix B**

#### *Appendix B.1. Image Processing*

Matrix dimension of the PET scans varied between centre, scanner, and patients. Therefore, EARL1 scans from the Ingenuity TF PET/CT, the Vereos Digital PET/CT and the Gemini Big Bore TF PET/CT scanner were zero-padded to a uniform matrix size of

192 × 192 × 320. The EARL2 scans from the Vereos Digital PET/CT scanner were zeropadded to a uniform matrix size of 384 × 384 × 640. Next to overcome capacity limitations of the computer system, all the PET scans were rebinned to a matrix size of 192 × 192 × 80 with a voxel size of 4 mm for EARL1 and 2 mm for EARL2 scans in all directions.

#### *Appendix B.2. CNN Architecture*

Noise reduction inevitably degrades some of the quantitative features of the PET image. To repress this, the network uses symmetric connections (concatenate two layers) while decoding (upsampling) to alleviate the loss of details during encoding (downsampling). So, the decoding layers use the features from the previous layer (encoded scans) but also the retained details from the downsampling layers (uncompressed scans). This results in a network that increases the SNR of the PET scans, and simultaneously retain the quantitative and qualitative features of the scan.

The proposed model was implemented with the Keras library (v2.2) in Python (v3.6), which is based on Tensorflow (v.1.13.1) as backend. The model was trained, validated, and tested on two NV-linked Nvidia 11GB RTX 2080Ti GPUs. For optimisation of the weights, an Adam optimizer was used with a low learning rate of 1 × <sup>10</sup>−<sup>5</sup> with a decay of 1 × <sup>10</sup><sup>−</sup>6. The batch size for training the CNNs was set to 2.

**Box A1.** Python code of the architecture of the U-Net


#### *Appendix B.3. Model Performance*

For training and validation, the model performance of the CNNs was measured using structural similarity (SSIM) and peak-signal-to-noise ratio (PSNR) [30]. The PSNR represent the peak signal error, whereas the SSIM is a measure of the similarity between two scans, which have been proven to be consistent with human-eye perception. Based on these two metrics the optimal (highest PSNR and SSIM) trained SLC-CNN and LC-CNN weights were chosen for further assessment using the test PET scans.

#### **References**


### *Article* **Thermal Ablation of Liver Tumors Guided by Augmented Reality: An Initial Clinical Experience**

**Marco Solbiati 1, Tiziana Ierace 2, Riccardo Muglia 3, Vittorio Pedicini 2, Roberto Iezzi 4, Katia M. Passera 1, Alessandro C. Rotilio 1, S. Nahum Goldberg <sup>5</sup> and Luigi A. Solbiati 2,6,\***


**Simple Summary:** We report the first clinical use of Endosight, a new guidance system for percutaneous interventional procedures based on augmented reality, to guide percutaneous thermal ablations. The new system was demonstrated to be precise and reliable, with a targeting accuracy of 3.4 mm. Clinically acceptable, rapid setup and procedural times can be achieved.

**Abstract:** Background: Over the last two decades, augmented reality (AR) has been used as a visualization tool in many medical fields in order to increase precision, limit the radiation dose, and decrease the variability among operators. Here, we report the first in vivo study of a novel AR system for the guidance of percutaneous interventional oncology procedures. Methods: Eight patients with 15 liver tumors (0.7–3.0 cm, mean 1.56 + 0.55) underwent percutaneous thermal ablations using AR guidance (i.e., the Endosight system). Prior to the intervention, the patients were evaluated with US and CT. The targeted nodules were segmented and three-dimensionally (3D) reconstructed from CT images, and the probe trajectory to the target was defined. The procedures were guided solely by AR, with the position of the probe tip was subsequently confirmed by conventional imaging. The primary endpoints were the targeting accuracy, the system setup time, and targeting time (i.e., from the target visualization to the correct needle insertion). The technical success was also evaluated and validated by co-registration software. Upon completion, the operators were assessed for cybersickness or other symptoms related to the use of AR. Results: Rapid system setup and procedural targeting times were noted (mean 14.3 min; 12.0–17.2 min; 4.3 min, 3.2–5.7 min, mean, respectively). The high targeting accuracy (3.4 mm; 2.6–4.2 mm, mean) was accompanied by technical success in all 15 lesions (i.e., the complete ablation of the tumor and 13/15 lesions with a >90% 5-mm periablational margin). No intra/periprocedural complications or operator cybersickness were observed. Conclusions: AR guidance is highly accurate, and allows for the confident performance of percutaneous thermal ablations.

**Keywords:** augmented reality; three-dimensional (3D) reconstruction; interventional oncology; computed tomography; liver

#### **1. Introduction**

Precision and targeting accuracy are key for the success of all image-guided interventional procedures. Over the last 20 years, several new navigational tools have been added

**Citation:** Solbiati, M.; Ierace, T.; Muglia, R.; Pedicini, V.; Iezzi, R.; Passera, K.M.; Rotilio, A.C.; Goldberg, S.N.; Solbiati, L.A. Thermal Ablation of Liver Tumors Guided by Augmented Reality: An Initial Clinical Experience. *Cancers* **2022**, *14*, 1312. https://doi.org/ 10.3390/cancers14051312

Academic Editor: David Wong

Received: 21 January 2022 Accepted: 27 February 2022 Published: 3 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to conventional imaging modalities (ultrasound, CT, MRI) with the purpose of increasing precision, favouring dose reduction, decreasing variability among operators, and thus promoting the diffusion of diagnostic and therapeutic interventional procedures based on ever-increasing reliability. Image fusion platforms based on electromagnetic or optical devices [1–4], CT with laser marker systems [5], CT fluoroscopy [6], cone-beam CT [7,8], CT with electromagnetic tracking [9], and robotic systems [10] have been incorporated into clinical practice in many centers. However, these tools still have some limitations, such as the inability to provide a real, live, 3D visualization of the target and the surrounding structures, the need for the operator to alternate their gaze between the interventional field and the instrumentation screen(s), a steep learning curve, and, for CT-guided procedures, potentially substantial radiation doses to patients and operators [11]. Recently, spatial computing technology has allowed the development of simulated reality environments, virtual reality (VR) and augmented reality (AR), which enable real-time interaction by the user. VR completely immerses the user in an artificial, digitally created 3D world through head-mounted displays (HMDs), with the user having no direct interaction with the real world. Therefore, in the medical field, VR can be used for surgical planning and simulation, but not for the direct guidance of interventional procedures [11,12]. To the contrary, AR overlays digital content onto the visualized real world through an external device [12–14], enhancing reality with superimposed information, using optical see-through head-mounted displays (HMDs or "goggles"), screens, smartphones, tablets and videoprojectors, such that digital and physical objects are visualized simultaneously. This permits their interaction with each other, thus allowing guidance of interventional procedures. The capability for computers to enhance visibility and navigate through 3D coordinates during minimally invasive interventional procedures was first noted in 1997 [15]. Since then, AR has been clinically applied as a visualization tool to augment anatomical [16] and pathological structures in neurosurgery [17–19] and vascular [20,21], orthopedic [22,23], urologic [24–26], plastic [27], and abdominal surgery [28,29]. This was achieved by creating 3D anatomic volumes from cross-sectional scans or angiographic images, and manually overlapping them over patients positioned in the real operating field [3] through electromagnetic or optical tracking systems and computer vision algorithms. In Interventional Oncology, AR was initially tested on phantoms to assist with percutaneous biopsies [30,31], and subsequently for the assessment of its potential role for the augmentation of minimally invasive surgery for the accurate localization of organ, or the guidance of radiofrequency ablation (RFA) or irreversible electroporation (IRE) electrodes on phantoms [32,33], but not for the direct guidance of interventional procedures in humans. To our knowledge, this is the first report of the targeting and ablation of small hepatic malignancies in human patients using AR as the sole modality of guidance.

#### **2. Materials and Methods**

This study was performed at two tertiary referral centres for liver diseases (Humanitas Research Hospital and IRCCS Policlinico Universitario A. Gemelli), with the approval of the local Institutional Ethics Committees. Written informed consent was obtained from all of the subjects involved in the study.

#### *2.1. Patient Population*

Fifteen hepatic malignancies (9 hepatocellular carcinomas (HCCs), 3 metastases from breast carcinoma, and 3 from pancreatic adenocarcinoma) in eight patients (5 males and 3 females, median age 72.5 years, range 56–83) underwent AR-guided percutaneous thermal ablation. The treated nodule size ranged from 0.7 to 3.0 cm (mean 1.56 + 0.55).

For all of the cases, the treatment decision was determined by the consensus of an Institutional Multidisciplinary Liver Team. According to the BCLC classification, the nine HCCs in five patients were either very early (8/9 cases) or early stage (1/9), in a subset of HCV-related early stage cirrhosis (Child-Pugh A, ECOG PS 0) [34]. These were located in segments VIII (*n* = 4), V (*n* = 2), II (*n* = 2) and VI (*n* = 1); the sizes ranged from 1.2 to 3.0 cm (mean 1.69 + 0.53). One patient had four HCCs, and one two HCCs. All of the nodules were treated in the same session. The other three patients had only one HCC. All of the HCCs were diagnosed through a non-invasive radiological work-up, following the European Association of the Study of the Liver (EASL) 2018 clinical practice guidelines [35].

The six metastases in the three patients ranged from 0.7 to 2.1 cm (mean 1.35 cm + 0.56) in size, and were diagnosed by percutaneous US-guided biopsies using 20 G Menghinimodified needles (Sterylab, Milan, Italy).

#### *2.2. Pre-Treatment Diagnostic Assessment*

All of the patients were initially evaluated with a baseline ultrasound of the liver which included contrast enhanced ultrasound (CEUS) after the intravenous administration of 2.4 to 4.8 mL second-generation contrast agent (SonoVue, Bracco, Milan, Italy) (Figure 1A), and an abdominal contrast enhanced computed tomography (CECT) in the arterial, portal, and late phases (Figure 1B). In order to achieve registration for the orientation reference of the AR display, twenty radiopaque markers with no repetitive pattern were applied to the abdominal skin in the right hypochondrium surrounding the area of interest (Figure 1C) immediately prior to the treatment. A new CECT in the arterial and portal phases was acquired during free breathing (i.e., normal respiration), paying particular attention to include all of the markers within the scanning area. In 14 of the 15 patients, CT scans were acquired with two different machines (Ingenuity, Philips Healthcare, Cleveland, OH, USA for 4 patients, and Revolution EVO, General Electric, Boston, MA, USA for 3 patients) following the injection of Iopamidol (Iopamiro 370, Bracco, Milan, Italy) at 4 mL/s, using a 2-mm slice thickness, a matrix of 512 × 512 pixels, an in-plane pixel size of 0.48–0.78 mm, 1:1 pitch, 120 kVp and 180 mA. In the last patient, 70 mL Iomeprol (Iomeron 400 mg/mL, Bracco, Milan, Italy) was injected at 3 mL/s using Lightspeed VCT 64 (General Electric, Boston, MA, USA) using a 2.5-mm slice thickness, a matrix of 512 × 512 pixels, an in-plane pixel size of 0.48–0.78 mm, 1:1 pitch, 120 kVp and 180 mA.

#### *2.3. Augmented Reality Settings*

The AR set-up comprised a proprietary augmented reality system (Endosight, R.A.W. Srl, Milan, Italy) that features a 27" medical display (ACL, Leipzig, Germany), a laptop (Dell Technologies, Round Rock, TX, USA) with installed proprietary image processing and augmented reality software, and a commercially available head-mounted display (HMD) (Oculus Rift-S, Facebook Technologies, Menlo Park, CA, USA) paired with a binocular camera (Zed Mini, Stereolabs, San Francisco, CA, USA) (Figure 2).

The binocular camera viewed the patient from two different angles in order to register the patient model in the camera frame using the markers visible in both video images while tracking the ablation applicator. The software enabled the 3D reconstruction (from CT scans to 3D volumes), co-registration, and AR intervention. Specifically, after uploading the CECT scans into the system, followed by the automatic segmentation and 3D reconstruction of the liver, spleen, bones, liver blood vessels and radiopaque markers, the semi-automatic segmentation of the target lesions occurred using proprietary reconstruction algorithms. In addition, the most suitable trajectory path from the skin to the target was defined. Subsequently, by moving the HMD around the patient, the system software co-registered (matched) all of the radiopaque markers segmented on the CT scans with all of the real markers applied to the patient's skin. This allowed for the simultaneous visualization of the patient's surface and internal anatomy, the target lesion, and the trajectory path to the target in 3D, by superimposing—in real-time—virtual images on the operator's real field of sight (Figure 1D). Next, in order to allow the visualization of the probe position during the procedure, a clip with five markers with no repetitive pattern was attached to either a 14 G (for 14 ablations) or a 11 G (for one ablation) coaxial needle, 7.8 cm in length (Bard Inc., Murray Hill, New York, NY, USA), that was used as a coaxial ablation device introducer (Figure 1E).

**Figure 1.** Augmented reality guided ablation: a 1.5-cm pancreatic carcinoma metastasis at segment VIII, poorly visible on B-mode US and clearly seen by CEUS (**a**), and seen on pre-ablation CT scan (arrow) (**b**). Radiopaque markers with no repetitive pattern applied to the patient's skin (**c**). View through the operator's HMD: ribs (in white), major hepatic blood vessels (light blue), liver (red), and target lesion (green, in a yellow circle) (**d**). View through the HMD, showing that the operator can see the virtual needle (blue line) and the line that connects the tip of the needle to the center of the target (in green) (**e**). Following the trajectory line permits successful tumor targeting with AR guidance alone (**f**). The 5.4-mm distance between the tip of the coaxial needle and the target center by US (**g**). Subsequently, the microwave antenna is inserted into the coaxial needle (**h**). On a post-ablation CT scan, a large ablation volume completely surrounds the metastasis (**i**). Using ablation confirmation software (Ablation-fitTM), the technical success achieved was precisely demonstrated. The margins of the target tumor are shown in orange, the 5-mm ablation margin is shown in green, and the margins of the necrosis volume are shown in blue. Complete tumor ablation with only 5.4% of the safety margin out of the necrosis volume was achieved (**j**).

**Figure 2.** Endosight system overview: cart, medical display, laptop, and Oculus Rift-S paired with a Zed Mini camera.

#### *2.4. Treatment Procedure*

All of the procedures were performed by three interventional radiologists with more than 15 years of experience in percutaneous thermal ablations. In 14 of the 15 patients, the ablations were performed in the CT room coupled with real-time ultrasound, under assisted ventilation, during short-acting anaesthesia using propofol (AstraZeneca, Cambridge, UK) (10 mg/mL) and alfentanil (Hameln Pharma, Gloucester, UK) (0.5 mg/mL), with continuous hemodynamic monitoring throughout the procedure. In the remaining patient, the ablation was performed under direct CT control (Lightspeed VCT 64) after local anesthesia and deep sedation with 0.2 mg Fentanyl (Janssen-Cilag, Beerse, Belgium) without additional ultrasound guidance. Using AR guidance alone, the coaxial needle was inserted following the predefined trajectory line planned during the setup (Figure 1F). This was facilitated by color coding, in that when the predefined trajectory line overlapped the virtual needle line, this path turned from blue to green in the AR visual field, highlighting and denoting the correct alignment. The insertion was conducted during the patient's free breathing (as in the pre-ablation acquisition of the CT scans) in order to minimize the organ displacement caused by breathing. The depth from the entry point (i.e., the skin) to the target centre was measured in real-time by the software, and was visualized on the operator's HMD. Before the introduction of the ablation device into the coaxial needle, the position of the coaxial needle and its correspondence with the real location of the target nodule was verified using real-time US when the target nodule was visible with US, or with CT when the target was invisible on US. In order to assess the precision of the AR, the distance from the real target centre visualized on the US or CT and the virtual target centre shown by the trajectory line starting from the tip of the coaxial needle was measured (Figure 1G). The ablation probe was then inserted, positioning its tip 5–7 mm beyond the deep margin of the target in order to achieve sufficient ablative margins (Figure 1H). Then, the coaxial needle was partly retracted while maintaining the positioning of the ablation device in order to achieve the complete exposition of the active tip. Microwave ablations (MWA) were performed with 13 G, 15 cm-long antennae (Medtronic, Dublin, Ireland) for three malignancies of three patients, and 14 G, 15 cm-long antennae (HS Hospital Service, Aprilia, Italy) for eleven nodules of five patients. The remaining patient recieved RFA performed with a 14 G, 15 cm-long electrode with a 3-cm exposed tip (RF Medical, Seoul, Korea). The treatment power and duration, and the total amount of energy delivered were selected based upon the size and location of each nodule, according to the device manufacturer's technical recommendation and operator experience. Figure 3 shows the complete treatment procedure workflow.

#### *2.5. Post-Procedural Assessment*

The CECT was performed immediately after withdrawing the ablation device (Figure 1I). A proprietary ablation-confirmation software (Ablation-fitTM, R.A.W. Srl, Milan, Italy) [36]—whichl enables the automatic segmentation of the liver and intrahepatic blood vessels, and semi-automatically co-registers the target nodules on pre-ablation CT scans with the volumes of necrosis achieved on post-ablation scans using a non-rigid registration tool—was used in order to assess the precision and completeness of the ablation volume achieved (Figure 1J). Using a 3D model, the software verified whether the volume of ablative necrosis included entirely or partially the tumor and a pre-defined ablative margin (5-mm thick, in these cases), as well as quantifying, as a percentage, the amount of tumor and ablative margin (if any) external to the ablation volume, thus allowing us to assess the technical success of the procedure [37,38]. After the ablation, all of the operators were interviewed regarding the need for manual adjustments of the HMDs and the occurrence of eye fatigue, dizziness, or cybersickness.

**Figure 3.** Workflow of the AR-guided thermal ablations.

#### *2.6. Statistical Analysis*

The primary endpoints evaluated included the time required to set up the system and to position the antenna tip inside the nodule, the mean depth of the target centre from the needle entry point on the skin, and the deployment accuracy, defined as the mean distance between the geometric center of the target and the ablation device tip measured on unenhanced CT or US. Secondary endpoints included the technical success, i.e., the complete ablation of the entire tumor and the achievement of an >90% 5-mm periablational margin ablation [36], complications, and operator sensations regarding the procedure. The data were analyzed with statistical software (SPSS, version 17.0), and were reported as the mean ± standard deviation (SD), or as the mean and range.

#### **3. Results**

The time required to set up the system ranged from 12.0 to 17.2 min (12.3 ± 2.1 min), and the time required to perform each insertion and tumor targeting ranged from 3.2 to 5.7 min (4.3 ± 0.9 min). In 7 of the 15 (46.7%) cases, the target nodule was visible on the US, and the real location of the target nodule and the position of the coaxial needle tip in respect to the target centre were verified using real-time US. In the remaining 8 of the 15 (53.3%) cases, unenhanced CT was employed for verification. The mean depth of the target centre from the needle entry point on the skin was 76.0 ± 28.2 mm. The distance between the geometric center of the target and the ablation device tip measured on unenhanced CT or US ranged from 2.1 to 4.5 mm (3.2 ± 0.7 mm). Table 1 shows—for each target—the size, the distance of the interventional device tip from the tumor center, the time taken to reach the target, and the modality used for the verification.

**Table 1.** Sizes of the targets, the distance of the interventional device tip from the center of each target tumor, the time needed to reach the target, and the modality used for the distance measurement.


For the MWA, the power delivered ranged from 50 to 60 W, with a treatment duration of 5 min in four HCCs, and 6 min in the remaining five HCCs and the five metastases. For the case of radiofrequency ablation (RFA), the power delivered was 1500 mA for 12 min. A single ablation device insertion was performed for each target tumor. Technical success was achieved in each case. After the automatic coregistration of the 3D volumes of the pre-ablation tumors and post-ablation necrotic changes, achieved with the Ablation-fitTM software, the complete ablation of the tumors (i.e., no residual unablated portion of the target tumors) was found. The residual 5-mm ablative margin percentage ranged from 0 to 14.1 % (5.5 ± 4.3%), with 13 of the 15 (86.7%) patients showing >90% ablation of this margin. Table 2 shows the residual 5-mm safety margin (in percentage) of each target lesion.

**Table 2.** Residual 5-mm safety margin (as a percentage) of each target tumor, calculated by the Ablation-fitTM software.


No intra- or periprocedural adverse events occurred. No user-dependent calibration and adjustment for the HMD was needed, and no significant eye fatigue or "cybersickness" was reported by any of the users.

#### **4. Discussion**

Modern imaging modalities enable the visualization of increasingly small target lesions, often in difficult-to-target locations, which is particularly suitable for local, imageguided treatments (IGTs). Consequently, the requests for image-guided therapy, accompanied by expectations of favorable outcomes, are constantly increasing. However, some problems still remain unsolved. First of all, the learning curve for the use of these technologies is often long, and this limits the diffusion of interventional procedures, particularly among young operators and/or in low-referral centers. The lack of the real, live, 3D visualization of targets, and the poor working ergonomics (the need to check many screens simultaneously, restricted line-of-sight to screens, and the need to alternate the operator's gaze between the interventional field and the instrumentation screens) are additional important limitations. The mental registration of the target position seen in the reference image (US, CT, MRI) with the corresponding position in patients is often challenging, particularly for liver dome lesions requiring non-orthogonal or out-of-plane approaches, even when CT guidance is used. The difficulty and subjectivity of this process may also increase the risks for patients. Thus, the need for a technically easy combination of "real-world" visualization with virtual objects precisely superimposed upon the scene is increasingly desired. This can be achieved with AR technology in the actual interventional field, where the operator can visualize and interact simultaneously with the real world (patient, interventional instrumentation) and virtual objects (hidden organs and targets, surrounding structures seen on CT and MRI, etc.) based on the superimposition of the "two worlds", as displayed on

HMD, smartphones, tablets, screens or videoprojectors. Moreover, HMDs can be relatively advantageous compared to the direct line of sight through the lens display [39].

The most critical issue for the use of AR in medical applications is the superimposition precision, i.e., the registration accuracy. Multiple studies on phantoms, animal models, and human cadavers have primarily focused upon the assessment of registration accuracy, either for AR navigation [40] or the AR guidance of needles [19,31–33,41,42]. Hecht et al. [41] reported a smartphone-based AR system for needle trajectory planning and real-time guidance on phantoms. In their first experiment, the mean error of the needle insertion was 2.69 + 2.61 mm, which was 78% lower than the CT-guided freehand procedure. In their second experiment, the operators successfully navigated the needle tip within 5 mm on each first attempt under the guidance of the AR system, which eliminated the need for further needle adjustments. In addition, the procedural time was 66% lower than the CT-guided freehand procedure. Long et al. [42] compared the accuracy and the placement time needed by five interventional radiologists and a resident with a range of clinical experience (3–25 years) to place biopsy needles on millimetric targets positioned in an anthropomorphic abdominal phantom at different depths, using cone-beam CT (CBCT) guided fluoroscopy, and smartphone- and smartglasses-based AR navigation platforms. The placement error was extremely small and virtually identical for all of the three modalities (4–5 mm), and the placement time was significantly shorter for smartphones and HMDs (38% and 55% respectively) than for CBCT. Additionally, the results were achieved by AR without intra-procedural radiation, and with a learning curve of only 15 min.

Using the same system employed for the present study, Solbiati et al. recently published a proof-of-concept study on phantoms, animal models, and human cadavers targeted with AR guidance. In the rigid phantom, sub–5-mm accuracy (2.0 + 1.5 mm) (mean + standard deviation) was achieved. In a porcine model with small (2 × 1 mm) metallic targets embedded, the accuracy was 3.9 + 0.4 mm when the targeting was performed with respiration suspended at maximum expiration, as in the initial CT scan, and 8.0 + 0.5 mm when the procedure was performed without breathing control. In a human cadaver attached to a ventilator to induce simulated respirations, two liver metastases (1.8 cm and 3.0 cm) were targeted with an accuracy of 2.5 mm and 2.8 mm, respectively [43].

Here, we note the similar accuracy of 3.4 mm in living, breathing patients. Regarding AR-guided needle insertions in human patients, De Paolis et al. [32] reported their preliminary experience in locating a focal liver lesion in the operating room just before open surgery. The surgeon was able to determine the correct position of the real tumor by touching it and applying the ablation applicator to it in order to verify the correct overlap between the virtual and the real tumor. Although an excellent accuracy of 2 mm was reported, problems of depth perception and instrument visibility occurred whenever the surgeon's body was located between the tracker and the instrument, both of which related to the use of the optical tracker.

The AR system used for our current report is specifically designed to guide percutaneous biopsies and ablation procedures. It is based on disposable markers with no repetitive pattern affixed to the patient's abdominal skin before performing the CT scans. The associated software enables us to visualize and segment the markers on the patient (virtual objects) and the target tumor, to automatically register and superimpose virtual and real images in real-time, to define the safe and accurate trajectory line to the target center, to depict the guided movements of the interventional device without the need for additional imaging, and to show the whole procedure on a display, HMD, or screen [43,44]. The main advantage of HMD is the 3D visualization, which tops the 2D visualization of smartphones and tablets. The results achieved were very promising: the accuracy of the antenna tip with respect to the center of the target was well below the 5-mm threshold (with a mean of 3.2 + 0.7 mm), and technical success of the ablation was achieved in all cases. The mean times required to set up the system and to perform each insertion were 14.3 min and 4.3 min, respectively, and were independent of the type of ablation system used. This is not substantially different from the time usually required to perform CT-guided procedures

by expert radiologists, even after a long learning curve. Moreover, the software for the assessment and quantification of the tumor ablation margins in 3D was integrated into the AR system, enabling the immediate and accurate evaluation of the technical success [36].

In recent years, two issues have been raised regarding the technology of HMDs used for AR, i.e., the field of view (FOV) and the need for calibration [39]. The binocular FOV of human eyes is naturally about 200 o in the horizontal plane and 135 o in the vertical plane, while commercially available HMDs had initial FOVs ranging around 30–40 o. This limitation has recently increased to 90 o both horizontally and vertically. Nevertheless, the calibration of HMDs is needed to tailor projections to the user's interpupillary distance. Given that most HMDs have fixed focal planes, when the calibration is inaccurate, the eyes can focus and converge at separate distances, causing distorted depth perception, eye fatigue and "cybersickness" due to discrepancies between the visual and vestibular senses. Nowadays, commercially available HMDs are provided with two videocameras, which has eliminated the need for user-dependent calibration and adjustment. This has limited the common occurrence of the cybersickness which was reported previously, as noted in our study.

The patient's respiratory movement and motion remain one of the largest technical and practical hurdles, as AR guidance systems are currently unable to follow respiratory excursions in mobile organs with real-time corrections, bearing a risk of the shifting of the intended target relative to the expected location. Other target-related limitations arise from the abilities of lesions to warp or move within their environments. Respiratory motion tracking and the monitoring of respiration during deep sedation or general anesthesia seem to offer the best solutions to date. The guiding information is provided regularly at the point of the breathing that matches the respiratory phase during which the preoperative CT image was acquired (the middle respiratory or expiratory phase). In this time interval window, the operator can move the needle toward the target as rapidly as possible. In our study, the insertion was conducted during the patient's free breathing (as in the preablation acquisition of the CT scans) in order to minimize the organ displacement caused by breathing. Given that this was the initial study of AR-guided thermal ablation, we selected only tumors which were visible on US or on CT in order to be able to check the position of the device tip after its insertion, before starting the ablation. Probe repositioning was never required, as the position achieved with AR guidance was always sufficiently accurate. Nevertheless, we acknowledge that this will not always be invariable, and note that—should minor placement corrections be needed—the virtual system will potentially save a substantial amount of radiation exposure compared to fully CT-guided procedures, be they CT-guided freehand, cone-beam CT, or CT fluoroscopy guidance [20]. Indeed, in the experimental study conducted by Park et al. [39] comparing a HoloLens-based 3D AR-assisted navigation system with CT-guided simulations, the AR system reduced the radiation dose by 41%.

An additional potential challenge of AR-guided interventional procedures is needle bending during the insertion, exacerbated by increased applied pressure or the use of thinner needles. The solution we successfully utilized was the use of a rigid coaxial needle to maintain the interventional device fixed in 3D space during its advancement, minimizing the bending of the ablation device inserted into the coaxial needle. We further demonstrated that the attachment of a clip with markers with no repetitive pattern to the coaxial needle permits precise monitoring by AR of the probe advancement towards the target, and the interventional device subsequently inserted into the coaxial needle can easily hit the target center. Coaxial needles have been used for interventional procedures for decades, and do not appreciably increase the risk of bleeding because their construction is engineered to result in an ultimate size only 1–2 G larger than that of ablation devices or biopsy needles.

With respect to other navigation systems, AR guidance offers an ergonomic advantage that the overlay of treatment information (anatomy, target, trajectory line, etc.) is shown directly in the procedural environment, and not on a display screen away from the patient on a monitor, as occurs with CT- or CBCT-guided fluoroscopy. Additional advantages of AR guidance are the ease of use, the reduced procedural time compared to more traditional guidance systems, and the short learning curve (compared to that of CT-guided procedures), which is particularly useful for young operators with limited experience, who perform equally or even better than senior operators with long experience. Furthermore, AR guidance systems are significantly less expensive than all of the other needle guidance systems. This may favour the diffusion of AR, and consequently of image-guided procedures in small centers, and in developing countries that cannot afford to buy complex and expensive guidance technologies (the so called "democratization" of interventional procedures).

With AR, the same images seen by the operator can also be visualized on monitors inside and outside the interventional room, and can be broadcast on a larger scale, allowing interventional radiologists to visualize live or recorded procedures performed by experts. AR can provide not only an excellent opportunity for physician training and education but also a very useful tool to exchange collaborative experiences among various centers for remote real-time instruction or expert assistance [12,45].

We acknowledge that this study has some limitations, most notably the small number of patients within the cohort, and the non-randomized type of lesions treated, all of which visible on both US and CT despite their small size. Nonetheless, we believe that it will encourage new prospective studies, and will work as the basis for the development of AR technology in the clinical field.

#### **5. Conclusions**

In this retrospective study, we obtained high standards of targeting accuracy, technical efficacy, procedural time, and radiation dose reduction using AR as the sole guidance method for percutaneous thermal ablation, without encountering any complications. In spite of the small cohort analyzed, we propose that our preliminary data demonstrate the potential for AR, with further validation, to become a leading and low-cost modality for the guidance of interventional procedures worldwide.

**Author Contributions:** Conceptualization, M.S. and L.A.S.; methodology, L.A.S. and M.S.; software, M.S., K.M.P. and A.C.R.; validation, M.S., A.C.R. and R.M.; formal analysis, S.N.G. and L.A.S.; investigation, L.A.S., T.I., V.P., R.M. and R.I.; resources, A.C.R.; data curation, L.A.S.; writing—original draft preparation, L.A.S., K.M.P. and M.S.; writing—review and editing, R.M., R.I. and S.N.G.; visualization, M.S. and A.C.R.; supervision, S.N.G.; project administration, M.S. and A.C.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Ethics Committee of Humanitas Research Hospital and IRCCS Policlinico Universitario A. Gemelli (protocol code Rad-HCC Eudra CT # 2016-004004-60).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** M.S., K.M.P and A.C.R. are employees of R.A.W. Srl, and were the developers of the described technique. T.I., L.A.S., R.M., V.P. and R.I. declare that they have no conflict of interest to disclose. S.N.G. performs unrelated consulting for Angiodynamics and Cosman Instruments.

#### **References**


### *Article* **System for the Recognizing of Pigmented Skin Lesions with Fusion and Analysis of Heterogeneous Data Based on a Multimodal Neural Network**

**Pavel Alekseevich Lyakhov 1,2, Ulyana Alekseevna Lyakhova 3,\* and Nikolay Nikolaevich Nagornov <sup>2</sup>**


**Simple Summary:** Skin cancer is one of the most common cancers in humans. This study aims to create a system for recognizing pigmented skin lesions by analyzing heterogeneous data based on a multimodal neural network. Fusing patient statistics and multidimensional visual data allows for finding additional links between dermoscopic images and medical diagnostic results, significantly improving neural network classification accuracy. The use by specialists of the proposed system of neural network recognition of pigmented skin lesions will enhance the efficiency of diagnosis compared to visual diagnostic methods.

**Abstract:** Today, skin cancer is one of the most common malignant neoplasms in the human body. Diagnosis of pigmented lesions is challenging even for experienced dermatologists due to the wide range of morphological manifestations. Artificial intelligence technologies are capable of equaling and even surpassing the capabilities of a dermatologist in terms of efficiency. The main problem of implementing intellectual analysis systems is low accuracy. One of the possible ways to increase this indicator is using stages of preliminary processing of visual data and the use of heterogeneous data. The article proposes a multimodal neural network system for identifying pigmented skin lesions with a preliminary identification, and removing hair from dermatoscopic images. The novelty of the proposed system lies in the joint use of the stage of preliminary cleaning of hair structures and a multimodal neural network system for the analysis of heterogeneous data. The accuracy of pigmented skin lesions recognition in 10 diagnostically significant categories in the proposed system was 83.6%. The use of the proposed system by dermatologists as an auxiliary diagnostic method will minimize the impact of the human factor, assist in making medical decisions, and expand the possibilities of early detection of skin cancer.

**Keywords:** digital image processing; pattern recognition; convolutional neural networks; multimodal neural networks; heterogeneous data; metadata; dermatoscopic images; pigmented skin lesions; hair removal; melanoma

#### **1. Introduction**

According to World Health Organization statistics, non-melanoma and melanoma skin cancer incidence has significantly increased over the past decade [1]. Up to three million cases of non-melanoma skin cancer [2] and about 140,000 cases of melanoma skin cancer are recorded annually [3]. According to the Skin Cancer Foundation Statistics [4], every third case of cancer diagnostics is caused by skin cancer, making it one of the most common types of malignant lesions in the body [5]. This is because the bulk of the population of the countries of the Northern Hemisphere of the Earth are owners of I and II skin

**Citation:** Lyakhov, P.A.; Lyakhova, U.A.; Nagornov, N.N. System for the Recognizing of Pigmented Skin Lesions with Fusion and Analysis of Heterogeneous Data Based on a Multimodal Neural Network. *Cancers* **2022**, *14*, 1819. https://doi.org/ 10.3390/cancers14071819

Academic Editor: Christine Decaestecker

Received: 9 February 2022 Accepted: 30 March 2022 Published: 3 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

phototypes according to Fitzpatrick's classification [6]. A feature of these phototypes is the genetic inability to increase the level of Ultraviolet radiation (UV) [7] and the greatest tendency to develop melanoma [8]. In modern conditions of decreasing the thickness of the atmosphere's ozone layer, UV directly affects the skin, a factor in the activation of oncogenes. It is estimated that a 10% decrease in the ozone layer will lead to an additional 300,000 non-melanoma and 4500 melanoma skin cancers [9]. In regions with high sun exposure, skin cancer is preceded by solar keratosis, the diagnosis of which can help prevent the transformation of pigmented skin lesions into a cancer-positive form [10].

Rapid and highly accurate early diagnosis of skin cancer can reduce patients' risk of death [11]. When detected early, the 5-year survival rate for patients with melanoma is 99%. In the later stages of diagnosis, when the disease reaches the lymph nodes and metastasizes to distant organs, the survival rate in patients is only 27% [3]. Dermatoscopy is the most common method for diagnosing pigmented skin lesions visually [12]. This method is based on the visual acuity and experience of the practitioner and can only be effectively used by qualified professionals [13]. With the help of dermatoscopy, an experienced dermatologist can achieve an average accuracy in the classification of pigmented skin lesions that ranges from 65% to 75% [14]. The early manifestations of malignant and benign neoplasms are visually indistinguishable [15].

Today medicine is considered one of the strategic and promising areas for the effective implementation of systems based on artificial intelligence [16]. There is an improvement in mathematical models and methods, as well as an increase in the amount of digital information in various fields of medicine due to the accumulation of data from electronic medical records, the results of laboratory and instrumental studies, mobile devices for monitoring human physiological functions, etc. [17]. The development of artificial intelligence technologies allowed algorithms for computer analysis of data to be equal to inefficiency, and some tasks surpass human capabilities [18]. A comparison of the classification accuracy of pigmented skin lesions in dermatologists with different levels of experience and a computer program using an artificial intelligence algorithm is presented in articles such as [19–21]. Studies show that artificial intelligence can outperform 136 out of 157 dermatologists and achieve higher accuracy in recognizing pigmented lesions. Despite the higher quality of recognition in artificial intelligence systems than visual diagnostics in physicians, the problem of low accuracy in general in neural network classification systems remains relevant. One of the possible ways to improve recognition accuracy is using the image pre-processing stage [22].

There are many methods for pre-processing dermoscopic images to improve and visually highlight diagnostically significant features. One of these methods is segmentation to highlight pigmented skin lesions' contours. Segmentation can be performed using a biorthogonal two-dimensional wavelet transform and the Otsu algorithm [23]. Edge extraction can be done using Gaussian contrast enhancement and edge extraction using the saliency map construction [24]. Saliency maps use inner and outer non-overlapping windows, making the foreground and background distinct. A significant disadvantage of segmentation methods using filters is the lack of versatility in selecting contours in images of different quality. Illumination, skin color, and sharpness of the contours of a pigmented skin lesion significantly reduce the accuracy of these algorithms. Another way to highlight contours on dermoscopic images is contrast stretching with further detection using Faster Region-Based Convolutional Neural Network (Faster R-CNN) [25,26]. Segmentation based on neural network algorithms makes it possible to accurately identify the contours of pigmented skin lesions, separate a pigmented neoplasm from a skin area, and exclude the influence of skin color type when recognized by artificial intelligence. At the same time, the problem of the presence of hair structures remains, which can be perceived by both neural network algorithms and filter-based algorithms as part of a pigmented skin lesion.

The presence of hair in dermatoscopic images can drastically change the size, shape, color, and texture of the lesion, which significantly affects the automatic analysis of the neural network [27]. Removing hair from images during digital pre-processing is an

important step in improving the accuracy of automated diagnostic systems [28]. Today, several methods are designed for pre-processing dermatoscopic images of pigmented skin lesions to remove hair or other noise elements [29]. For example, the essence of the DullRazor process [30] is to use the morphological operation of closing. A significant drawback of DullRazor is the distortion of the dark areas of pigmented lesions, which can change diagnostic signs and have a substantial impact on the quality of recognition. In [31], another hair removal method on dermatoscopic images is presented based on non-linear Partial Differential Equation diffusion (PDE-diffusion). The algorithm is designed to fill linear hair structures by diffusion. This method is also used in [32,33].

Another way to improve the accuracy of intelligent classification systems is to combine heterogeneous data and further analyze them to find additional relationships. In database dermatology, heterogeneous data mining makes it possible to combine patient statistical metadata and dermoscopic images, greatly improving the recognition of pigmented skin lesions. The use of multimodal neural network systems [34–37], as well as methods for combining metadata and multidimensional visual data [38], has significantly improved the accuracy in recognizing pigmented skin lesions.

Despite significant progress in implementing artificial intelligence technologies to analyze dermatological data, developing neural network systems of varying complexity is relevant to achieving higher recognition accuracy. The main hypothesis of the manuscript is a potential increase in the quality of neural network systems for analyzing medical data due to the emerging synergy when using various methods to improve recognition accuracy together. This study aims to develop and model a multimodal neural network system for analyzing dermatological data through the preliminary cleaning of hair structures from images. The proposed system makes it possible to achieve higher recognition accuracy levels than similar neural network systems due to the preliminary cleaning of hair structures from dermoscopic images. The use of the proposed system by dermatologists as an auxiliary diagnostic method will minimize the impact of the human factor in making medical decisions.

The rest of the work is structured as follows. Section 2 is divided into several subsection. In Section 2.1 a description of a method for identifying and cleaning hair structures as pre-processing dermatoscopic images of pigmented skin lesions is proposed. In Section 2.2 a description of the method for pre-processing statistical metadata about patients has been made. In Section 2.3 the definition of a multimodal neural network system for processing statistical data and dermatoscopic images of pigmented skin lesions is presented. Section 3 presents practical modeling of the proposed multimodal neural network system to classify pigmentary neoplasms with a preliminary stage of hair removal on dermatoscopic images. Section 4 discusses the results obtained and their comparison with known works in neural network classification of dermatoscopic skin images. In conclusion, the results of the work are summed up.

#### **2. Materials and Methods**

The paper proposes a multimodal neural network system for recognizing pigmented skin lesions with a stage of preliminary processing of dermatoscopic images. The proposed multimodal neural network system for analysis and classification combines heterogeneous diagnostic data represented by multivariate visual data and patient statistics. The scheme of a multimodal neural network system for the classification of dermatoscopic images of pigmented skin lesions with preliminary processing of heterogeneous data is shown in Figure 1.

**Figure 1.** Multimodal neural network system for the classification of dermatoscopic images of pigmented skin lesions with preliminary heterogeneous data processing.

The multidimensional visual data undergoes a pre-processing stage, which identifies and cleans hair structures from dermatoscopic images of pigmented skin lesions. Patient statistics also undergo a one-hot encoding process to generate a feature vector. The multimodal neural network system for recognizing pigmented lesions in the skin consists of two neural network architectures. Dermatoscopic images are processed using the specified Convolutional Neural Network (CNN) architecture. Statistical metadata is processed using a linear multilayer neural network. The resulting feature vector at the CNN output and the output signal of the linear neural network are combined on the concatenation layer. The combined signal is fed to the layer for classification. The output signal from the proposed multimodal neural network system for recognizing pigmented skin lesions is the percentage of 10 diagnostically significant categories, including a recognized dermatoscopic image.

#### *2.1. Hair Removal*

The main diagnostic method in the field of dermatology is visual analysis. Today, many imaging approaches have been developed to help dermatologists overcome the problems caused by the apperception of tiny skin lesions. The most widely used imaging technique in dermatology is dermatoscopy, a non-invasive technique for imaging the skin surface using a light magnifying device and immersion fluid [39]. Statistics show that dermatoscopy has increased the efficiency of diagnosing malignant neoplasms by 50% [40]. A significant problem when working with this method is the possible presence of hair on the area of the pigmented lesion, which causes occlusion.

The presence of such noisy structures as hair significantly complicates the work of dermatologists and specialists. It can also cause errors in recognizing pigmented skin lesions in automatic analysis systems. Hair violates the geometric properties of the pigmented lesion areas, which negatively affects the diagnostic accuracy [41]. Figure 2 shows dermatoscopic images of pigmented skin lesions with hair structures present that cause occlusion by altering the size, shape of the lesion, and texture of the image.

**Figure 2.** Examples of pigmented skin lesions images with hairy structures: (**a**) vascular lesions; (**b**) nevus; (**c**) solar lentigo; (**d**) dermatofibroma; (**e**) seborrheic keratosis; (**f**) benign keratosis; (**g**) actinic keratosis; (**h**) basal cell carcinoma; (**i**) squamous cell carcinoma; (**j**) melanoma.

The most common way to solve the occlusion problem of pigmented skin lesions is to remove the visible part of the hair with a cutting instrument before performing a dermatoscopic examination. However, this approach leads to skin irritation. Also, it causes diffuse changes in the color of the entire pigmented lesion, which distorts diagnostically significant signs to a greater extent than the presence of hair itself. An alternative solution is digitalizing dermatoscopic visual data to remove hair structures. The essence of the hair pre-cleaning methods is to identify each pixel of the image as a pixel-hair or pixel-skin and then replace the pixels of the hair structures with skin pixels [42]. Preliminary digital processing of dermatoscopic images using morphological operations is one of the possible methods for identifying and replacing pixels of hair structures.

This paper proposes a method for digital pre-processing dermoscopic images using morphological operations on multidimensional visual data. A step-by-step scheme of the proposed method is shown in Figure 3.

**Figure 3.** Scheme of the proposed method of identification and hair removal from dermatoscopic images of pigmented skin lesions.

Image processing of pigmented skin lesions consists of four main stages. At the first stage, the RGB image is decomposed into color components. The second step is to locate the locations of the hair structures. At the third stage, the hair pixels are replaced with neighboring pixels. The fourth step is to reverse engineer an RGB color dermatoscopic image.

The input of the proposed method is RGB dermatoscopic images of pigmented neoplasms of the skin *P*(*x*,*y*). The color components *PR*, *PG*, and *PB* are extracted from the image. The following processing steps are performed separately for each color component. The variables *L*<sup>1</sup> and *L*<sup>2</sup> are defined as follows:

$$L\_{1,2} = \{ (\mathbf{x}, y) : \rho(T, (\mathbf{x}, y)) \le r \} \tag{1}$$

where *ρ* is the distance from the center *T* of the set *L*1,2 by the chosen metric, and *r* is the radius of the set specified by the user. The next stage is a morphological closure operation using the *L*<sup>1</sup> element to determine the location of hair structures on dermatoscopic images:

$$H\_{\mathbb{CC}}{}^{3} = P\_{\mathbb{CC}} \cdot L\_{1} = (P\_{\mathbb{CC}} \oplus L\_{1}) \ominus L\_{1} \tag{2}$$

where *CC* stands for the color channel, *CC* ∈ {*R*, *G*, *B*}, ⊕ is the operation of dilatation of the set *P* along *L*<sup>1</sup> and is the operation erosion by element *L*1. The closure operation smooths out the contours of the hair structures in dermatoscopic images, eliminates voids, and fills in narrow gaps and long small-width depressions.

At the next stage, the original image *PCC* is subtracted from the image obtained as a result of the *HCC*<sup>3</sup> close operation:

$$H\_{\text{CC}}{}^2 = H\_{\text{CC}}{}^3 - P\_{\text{CC}} \tag{3}$$

The operator of zeroing the pixels *δ* of the image *P*(*x*,*y*) for further operations is defined as follows:

$$\delta\left(P\_{(\mathbf{x},\mathbf{y})}\right) = \begin{cases} \begin{array}{l} P\_{(\mathbf{x},\mathbf{y})\prime} \text{ if } P\_{(\mathbf{x},\mathbf{y})} > K \\ \mathbf{0} \text{ if } P\_{(\mathbf{x},\mathbf{y})} \le K \end{array} \tag{4}$$

where *K* is the user-defined threshold of pixel intensity values. The next stage is the threshold zeroing of the pixels of the detected hair structures. For this, the entered zeroing operator *δ* is applied to the resulting dermatoscopic image *HCC*2:

$$H\_{\rm CC}^{-1} = \delta(H\_{\rm CC}^{-2})\tag{5}$$

After the operation of threshold zeroing of pixels, a morphological operation of dilatation with the *L*<sup>2</sup> element is performed to expand the boundaries of the hair structures:

$$H\_{\rm CC} = H\_{\rm CC}^{-1} \oplus L\_2 \tag{6}$$

The next step is to replace the pixels of the hair structure with neighboring pixels. Using the Laplace equation, pixels are interpolated from the area's border of the selected hair structures. In this case, the pixels from the border of the hair structures cannot be changed. The last step is the reverse construction of the RGB color image from the extracted color components. For this, the color channels *PR*∗, *PG*∗, and *PB* ∗ are combined.

An example of the step-by-step work of the proposed method for identifying and cleaning hair structures from dermatoscopic images of pigmented skin lesions is shown in Figure 4. To improve the visual perception of the intermediate results of each method stage, Figure 4d–f were inverted.

**Figure 4.** Images obtained as a result of passing each stage of the method of identification and hair removal: **(a**) input RGB image *PRGB*; (**b**) the color component *PR*, presented in shades of gray; (**c**) the result of the *HR*<sup>3</sup> closing operation; (**d**) the result of the subtraction operation *HR*<sup>2</sup> (inverted image); (**e**) the result of zeroing pixels *HR*<sup>1</sup> (inverted image); (**f**) the result of the *HR* dilatation operation (inverted image); (**g**) pixel interpolation result *PR*∗; (**h**) output RGB image *PRGB*∗. Scale bar or magnification.

#### *2.2. Metadata Pre-Processing*

Today, in medicine, there is an increase in the volume of digital information due to the accumulation of data from electronic medical records, the results of laboratory and instrumental studies, mobile devices for monitoring human physiological functions, and others [17]. Patient biomedical statistics are structured data that describe the characteristics of research subjects. Statistical data includes gender, age, race, predisposition to various diseases, bad habits, etc. Such information facilitates the search for connections between research objects and the analysis result.

Metadata pre-processing is converting statistical data into the format required by the selected data mining method. Since the proposed multimodal system for recognizing pigmented skin lesions is a fully connected neural network, it must encode the data as a vector of features. A corresponding metadata information vector is generated for each image in the dataset, which depends on the amount and type of statistical information. One-hot encoding can sometimes outperform complex encoding systems [43]. All multicategorical variables (discrete variables with more than two categories) are converted to a new set of binary variables for one-hot encoding. For example, the categorical variable to denote a pigmented lesion on the patient's body will be replaced by 8 dummy variables indicating whether the pigmented lesion is located on the anterior torso, head/neck, lateral torso, lower extremity, oral/genital, palms/soles, posterior torso, or upper extremity.

Suppose the *M* metadata includes various statistics *M* = {*M*1, *M*2, ... , *Mn*} with *Mn* ∈ *mn*, where *mn* is a pointer to a specific patient parameter. If *mn* is a pointer to the gender of the patient, then *M*<sup>1</sup> = {*male*, *f emale*}. For each set *Mn*, which is one of the patient's indicators, its power *<sup>μ</sup><sup>n</sup>* <sup>=</sup> <sup>|</sup>*Mn*<sup>|</sup> is calculated. For metadata pre-processing, an <sup>→</sup> *m* feature vector of <sup>∑</sup>*<sup>n</sup> <sup>μ</sup><sup>n</sup>* the dimension is generated. The first coordinate of the <sup>→</sup> *m* metadata vector of the *μ*<sup>1</sup> the dimension will encode the statistical data *m*1. The next coordinate of the *μ*<sup>2</sup> the dimension will encode the *m*<sup>2</sup> statistical data, and so on.

One-hot encoding is used to encode the statistic *mn* ∈ *Mn* as follows. For the set of *Mn*, the ordering is performed in an arbitrary fixed way for all considered cases. After that, the binary code 1000 . . . 0 *<sup>μ</sup><sup>n</sup>* is reserved for the first element of the set *Mn*. For the second element of the set *Mn*, the binary code 0100 . . . 0 *<sup>μ</sup><sup>n</sup>* is reserved, and so on. The statistical metadata

pre-processing scheme is shown in Figure 5.

**Figure 5.** Metadata pre-processing scheme.

#### *2.3. Multimodal Neural Network*

In deep learning, multimodal fusion or heterogeneous synthesis combines different data types obtained from various sources [44]. In the field of diagnosis of pigmented skin lesions, the most common types of data are dermatoscopic images and patient statistics such as age, sex, and location of the pigmented lesion on the patient's body. Combining visual data, signals, and multidimensional statistical data about patients allows you to create heterogeneous medical information databases that can be used to build intelligent systems for diagnostics and decision support for specialists, doctors, and clinicians [45]. The rationale for using heterogeneous databases is that the fusion of heterogeneous data can provide additional information and increase the efficiency of neural network analysis and classification systems [46]. The use of heterogeneous data in training multimodal neural network systems will improve the accuracy of diagnostics by searching for connections between visual objects of research and statistical metadata [47].

For the recognition of multidimensional visual data, the most optimal neural network architecture is CNN [48]. The input of the proposed multimodal system for neural network classification of pigmented skin lesions is supplied with dermatoscopic images of *P*(*img*), preprocessed metadata in the vector form of <sup>→</sup> *m* = (*m*1, *m*2,..., *mn*) and tags with a diagnosis of *l* ∈ {1, . . . , *Nlab*}, where *Nlab* is the number of diagnostic categories.

The dermatoscopic image includes *R* rows, *C* columns, and *D* color components. In this case, for the *RGB* format = 3, the color components are represented by the levels of red, green, and blue colors of the image pixels. The input of the convolutional layer receives a dermatoscopic *P*(*img*) image, while the input is a three-dimensional function *P*(*x*, *y*, *z*), where 0 ≤ *x* < *R*, 0 ≤ *y* < *C* and 0 ≤ *z* < *D* are spatial coordinates, and the amplitude *P* at any point with coordinates (*x*, *y*, *z*) is the intensity of the pixels at a given point. Then the procedure for obtaining feature maps in the convolutional layer is as follows:

$$P\_f(\mathbf{x}, y) = \mathbf{g} + \sum\_{i=\frac{w-1}{2}}^{\frac{w-1}{2}} \sum\_{j=\frac{w-1}{2}}^{\frac{w-1}{2}} \sum\_{k=0}^{\frac{w-1}{2}} w\_{\text{ijk}}^{(1)} P(\mathbf{x} + i, \ y + j, k), \tag{7}$$

where *Pf* is a feature map; *<sup>w</sup>*(1) *ijk* is the coefficient of a filter of size *w* × *w* for processing D arrays; *g* is offset.

The concatenation layer at the input receives the feature map, which was obtained on the last layer intended for processing dermatoscopic images *Pf* , and the metadata vector → *m*. The *Pf* feature map contains a set of *xijk*, where *i* is the height coordinate, *j* is the width coordinate, *k* is the number of the map obtained on the last layer from the set of layers that were intended for processing dermatoscopic images. The operation of combining heterogeneous data on the concatenation layer can be represented as follows:

$$f\_l = \sum\_{i} \sum\_{j} \sum\_{k} \mathbf{x}\_{ijk} w\_{ijkl}^{(2)} + \sum\_{i=1}^{n} m\_i w\_{il}^{(3)} \, , \tag{8}$$

where *w*(2) *ijkl* is a set of weights for processing feature maps of dermatoscopic images; *<sup>w</sup>*(3) *il* is a set of weights for processing metadata vectors.

The activation of the last layer of the multimodal neural network is displayed through the *so f tmax* function with the distribution *P*(*y*|*x*, *θ*) and has the form:

$$P(y|\mathbf{x},\;\theta) = softmax(\mathbf{x};\theta) = \frac{\exp\left(w\_l^n\right)^T \mathbf{x}\_l^n + \mathbf{g}\_l^n}{\sum\_{k=1}^K \exp\left(w\_l^n\right)^T \mathbf{x}\_l^n + \mathbf{g}\_l^n},\tag{9}$$

where *w<sup>n</sup> <sup>l</sup>* is the weight vector leading to the output node that is associated with class *l*. The proposed multimodal system for recognizing pigmented skin lesions based on CNN AlexNet is shown in Figure 6.

**Figure 6.** Neural network architecture for multimodal classification of pigmented skin lesions based on CNN AlexNet. Scale bar or magnification.

#### **3. Results**

Data from the open archive of The International Skin Imaging Collaboration (ISIC), which is the largest available set of confidential data in dermatology, was used for the simulations [49]. The main clinical goal of the ISIC project is to support efforts to reduce mortality associated with melanoma and reduce biopsies by improving the accuracy and efficiency of early detection of melanoma. ISIC develops proposed digital imaging standards and engages the dermatological and bioinformatics communities to improve diagnostic accuracy using artificial intelligence. While the initial focus in the ISIC collaboration is on melanoma, diagnosing non-melanoma skin cancer and inflammatory dermatoses is equally important. ISIC has developed an open-source platform for hosting images of skin lesions under Creative Commons licenses. Dermatoscopic photos are associated with reliable diagnoses and other clinical metadata and are available for public use. The ISIC archive contains 41,725 dermatoscopic photographs of various sizes, representing a database of digital representative images of the 10 most important diagnostic categories. Most of the photographs are digitized transparencies of the Roffendal Skin Cancer Clinic in Queensland, Australia, and the Department of Dermatology at the Medical University of Vienna, Austria [50]. The dataset also contains statistical meta-information about the patient's age group (in five-year increments), anatomical site (eight possible sites), and gender (male/female). Figure 7 shows a diagram of the distribution of dermatoscopic images for 10 diagnostically significant categories. Diagnostically significant categories are divided into groups "benign" and "malignant", and are also arranged in order of increasing risk and severity of the course of the disease. Since actinic keratosis can be considered as intraepithelial dysplasia of keratinocytes and, therefore, as a "precancerous" skin lesion, or as in situ squamous cell carcinoma, this category was therefore assigned to the group of "malignant" pigmented neoplasms [51–53]. The diagram shows how unbalanced the available images of pigmented skin lesions are towards the "nevus" category. Figure 8 shows diagrams of the distribution of the base of dermatoscopic images according to the statistical data of patients. The database is dominated by male patients and patients aged 15 to 20 years. At the same time, in patients, pigmented skin lesions were most often found on the back (posterior torso).

**Figure 7.** Diagram of the distribution of the number of dermatoscopic images in 10 diagnostically significant categories.

**Figure 8.** Diagrams of the distribution of the base of dermatoscopic images according to the statistical data of patients: (**a**) by gender; (**b**) by age; (**c**) by the location of the pigmented lesion on the body.

The modeling was performed using the high-level programming language Python 3.8.8. All calculations were performed on a PC with an Intel (R) Core (TM) i5-8500 CPU @ 3.00 GHz 3.00 GHz with 16 GB of RAM and a 64-bit Windows 10 operating system. Multimodal CNN training was carried out using a graphics processing unit (GPU) based on an NVIDIA video chipset GeForce GTX 1050TI.

Preliminary heterogeneous data processing was carried out at the first stage of the proposed multimodal classification system. Dermatoscopic image pre-processing consisted of stepwise hair removal and image resizing. The removal of hair structures was carried out using the developed method based on morphological operations, presented in Section 2.1. An empirical analysis of the application of Formula (1) showed that the best result of identification and cleaning of hair structures is achieved at *r* = 5 for the element *L*<sup>1</sup> and at *r* = 3 for the element *L*2. In the calculations, the Euclidean norm (L2) was used as a metric. It was also empirically found that the optimal threshold value in Formula (4) is *K* = 40. Examples of pre-cleaning dermatoscopic images are shown in Figure 9. Figure 9b was inverted to improve the visual perception of the results of the stage of hair extraction in the pictures.

**Figure 9.** Examples of identification and cleaning of hair structures from dermatoscopic images of pigmented skin lesions using the proposed method: (**a**) original dermatoscopic image; (**b**) the result of extracting hair in the image (inverted image); (**c**) dermatoscopic image cleared of hair structures. Scale bar or magnification.

The pre-processing of patient metadata consisted of one-hot encoding to convert the vector format required for further mining. The coding tables for each patient metadata index are presented in Tables 1–3. An example of pre-processing statistical patient metadata using one-hot encoding is shown in Figure 10.

**Table 1.** A coding table for patient gender metadata.



**Table 2.** A coding table for localization of pigmented lesion on the patient body.

**Table 3.** A coding table for patient age metadata.


**Figure 10.** An example of pre-processing statistical patient metadata using one-hot encoding.

CNN AlexNet [54], SqueezeNet [55], and ResNet-101 [56] were selected to simulate a multimodal neural network system for recognizing pigmented skin lesions, which were pre-trained on a set on a set of natural images ImageNet. The most common size of dermatoscopic images in the ISIC database is 450 × 600 × 3, where 3 is the color channels. For neural network architectures AlexNet and SqueezeNet, the images were transformed to a size of 227 × 227 × 3. For CNN ResNet-101, the images were converted to 224 × 224 × 3. For further modeling, the base of dermatoscopic photographs was divided into images for training and images for validation in a percentage ratio of 80 to 20. Since the ISIC dermatoscopic image base is strongly unbalanced towards the "nevus" category, the training images were augmented using affine transformations.

Large volumes of training data make it possible to increase the classification accuracy of automated systems for neural network recognition of dermatoscopic images of pigmented skin lesions. Creating large-scale medical imaging datasets is costly and timeconsuming because diagnosis and further labeling require specialized equipment and trained practitioners. It also requires the consent of patients to process and provides personal data. Existing training datasets for the intelligent analysis of pigmented skin lesions, including the ISIC open archive, are imbalanced across benign lesion classes. All of this leads to inaccurate classification results due to CNN overfitting.

Affine transformations are one of the main methods for increasing and balancing the amount of multidimensional visual data in each class. The possible affine transformations are rotation, displacement, reflection, scaling, etc. The selected dermatoscopic images of pigmented skin lesions include multidimensional visual data of various sizes. Different CNN architectures require input images of a certain size. Scaling using affine transformations transforms visual data into a set of images of the same size. Scaling is usually combined with cropping to achieve the desired image size.

Augmentation of dermatoscopic images of pigmented skin lesions included all of the above methods of affinity transformations, examples of which are shown in Figure 11.

**Figure 11.** Images obtained as a result of affine transformations: (**a**) original image; (**b**) image after the operation of rotation by a given angle; (**c**) image after shift operation; (**d**) image after the scaling operation; (**e**) image after the reflection operation. Scale bar or magnification.

New multidimensional visual data were created from existing ones using augmentation for more effective training. This allowed us to increase the number of training images. Training data augmentation has proven effective enough to improve accuracy in neural network recognition systems for medical data [57]. When trained, neural network classifiers tend to lean towards classes containing the largest number of images [58]. The use of data augmentation made it possible to minimize the imbalance and achieve uniform learning across all diagnostically significant classes presented. An example of transformed dermatoscopic images from the database for training a multimodal neural network for recognizing pigmented skin lesions is shown in Figure 12.

**Figure 12.** Examples of dermatoscopic training images that have been previously cleaned and enlarged using affinity transformations. Scale bar or magnification.

Pre-processed images of pigmented skin lesions were fed into CNN architectures. The vector of pre-processed metadata was provided to the input of a linear neural network, which consisted of several linear layers and ReLu activation layers. After passing the different input signals through the CNN and the linear neural network, the heterogeneous data passed fusion on the concatenation layer. The combined data was fed to the softmax layer for classification. Figures A1–A3 from Appendix A show graphs of the learning outcomes of a multimodal neural network system for recognizing pigmented skin lesions based on various CNNs.

Table 4 presents the results of assessing the recognition accuracy of dermatoscopic images of pigmented skin lesions. The highest indicator of the accuracy of recognition of pigmented skin lesions was achieved using a multimodal neural network system for recognizing pigmented skin lesions with a stage of preliminary hair cleaning with a pretrained AlexNet architecture [54] and amounted to 83.56%. When training each multimodal neural network architecture using the method of preliminary identification and cleaning of hair structures, the obtained percentage of recognition accuracy was higher than when training original CNNs without a preliminary processing stage. The increase in recognition accuracy during training of multimodal neural network recognition systems for pigmented skin lesions with a stage of preliminary hair cleaning was 4.93–6.28%, depending on the CNN architecture. The best indicator of improving the recognition accuracy was obtained when training a multimodal neural network classification system with a preliminary hair cleaning stage with a pre-trained ResNet-101 [56] architecture amounted to 6.28%. The smallest result of an increase in recognition accuracy of 4.93% was shown by a multimodal system based on AlexNet [54]. Adding each of the components to the system improves the accuracy by 2.18–4.11%. As a result of modeling the original CNN architecture with the stage of preliminary cleaning of hair structures based on SqueezeNet, the increase in recognition accuracy was 2.13%. At the same time, adding the stage of neural network analysis of statistical data made it possible to increase the accuracy by another 4.11%. For the AlexNet neural network architecture, this increase was 2.18% and 2.75%, respectively. For the ResNet-101 neural network architecture, recognition accuracy increased by 3.17% and 3.11%, respectively. The results obtained indicate that the combined use of various methods for improving the accuracy of recognition can significantly increase the accuracy of neural network data analysis.


**Table 4.** Results of modeling a multimodal neural network classification system for dermatoscopic images of pigmented skin lesions. Bold font indicates the best result in each column of the table.

The results predicted by the multimodal neural network from the test sample were converted to a binary form to construct the Receiver Operating Characteristic curve (ROC curve). Each predicted class label consisted of a combination of two characters with a length of 10 characters. The ROC curve represents the number of correctly classified positive values on incorrectly classified negative values.

$$TPR = \frac{TP}{TP + FN} \times 100\% \tag{10}$$

$$FPR = \frac{FP}{TN + FP} \times 100\% \tag{11}$$

where *TP* is true positive cases; *TN* is true negative cases; *FN* is false-negative cases; *FP* is false positives cases. The ROC curve is plotted so that the *x*-axis is the proportion of false positives *FPR*, and the *y*-axis is the proportion of true positive *TPR* cases. The AUC is the area under the ROC curve and is calculated as follows:

$$\text{ALIC} = \int\_0^1 TPR \, d(FPR). \tag{12}$$

Table 5 shows the results of testing the proposed multimodal neural network system for recognizing pigmented lesions with a stage of preliminary cleaning from hair structures. Figures 13–15 show confusion matrices resulting from testing multimodal neural network systems for identifying pigmented skin lesions based on various CNNs.

**Table 5.** Testing results of the proposed multimodal neural network system to recognize pigmented lesions. Bold font indicates the best result in each column of the table.



**Figure 13.** Confusion matrix in the testing results in a multimodal neural network system for recognizing pigmented skin lesions based on CNN AlexNet.


**Figure 14.** Confusion matrix in the testing results in a multimodal neural network system for recognizing pigmented skin lesions based on CNN SqueezeNet.


**Figure 15.** Confusion matrix in the testing results in a multimodal neural network system for recognizing pigmented skin lesions based on CNN ResNet-101.

Following the analysis of the confusion matrices in Figures 13–15, it can be concluded that the most frequently erroneous prediction results concern the different categories of malignant skin neoplasms (see percentages at the top of the columns). As summarized in Figure 16, part of these errors are benign lesions predicted as malignant (i.e. false positives). In addition, the malignant categories of "basal cell carcinoma" and "melanoma" are often predicted as pigmented neoplasms of benign categories. Based on the lines of the confusion matrices in Figure 16, malignant pigmented neoplasms are falsely recognized as benign in an average of 19.6% of cases.

The *χ*<sup>2</sup> McNemar statistic was calculated as follows:

$$
\chi^2 = \frac{\left(b-c\right)^2}{b+c} \tag{13}
$$

where *b* is the value when the proposed multimodal system incorrectly predicted the images and the results of the original CNN were correct; *c* is the value when the results of the original CNN were incorrect and the results of the multimodal system were correct.

The results of the analysis of the McNemar test from Figure 17 show that the proposed multimodal neural network system made it possible to correctly recognize pigmented neoplasms in 825–1238 images that were incorrectly classified by the original CNN with a pre-cleaning step for oatmeal structures; in 86–181 the image was misclassified, in contrast to the results of the original CNN with a pre-cleaning step for oat structures. Based on the results of the McNemar test, the proposed multimodal neural network system correctly classifies images of pigmented neoplasms on average, 12% of the time, compared to the original convolutional neural network architectures with a hair pre-cleaning step.

**Figure 16.** The confusion matrix of the test results of the proposed multimodal neural network system based on CNN is divided into two groups: (**a**) AlexNet; (**b**) SqueezeNet; (**c**) ResNet-101.


**Figure 17.** Classification table neural network systems for recognizing pigmented skin lesions for analysis McNemar based on CNN: (**a**) AlexNet; (**b**) SqueezeNet; (**c**) ResNet-101.

Even though the proposed multimodal neural network system with the stage of preliminary cleaning of hair structures shows higher results in recognition accuracy compared to existing similar systems, as well as compared to visual diagnostic methods for physicians in the field of dermatology, the use of the proposed system as an independent diagnostic tool is impossible due to the presence of a false-negative response in cases of malignant neoplasms. This system can only be used as a high-precision auxiliary tool for physicians and specialists.

Figure 18 shows the ROC curve when testing a multimodal neural network system to identify pigmented skin lesions based on various CNNs.

**Figure 18.** Receiver operative characteristics (ROC) curve when testing a multimodal neural network system for recognizing pigmented lesions and skin based on CNN: (**a**) AlexNet; (**b**) SqueezeNet; (**c**) ResNet-101.

AlexNet deep neural network architecture is superior to other architectures in the following ways: it does not require specialized hardware and works well with limited GPU; learning AlexNet is faster than other deeper architectures; more filters are used on each layer; a pooling layer follows each convolutional layer; ReLU is used as the activation function, which is more biological and reduces the likelihood of the gradient disappearing [59]. The listed characteristics substantiate the best result of training a multimodal neural network to recognize pigmented skin lesions based on the AlexNet neural network architecture.

#### **4. Discussion**

As a result of modeling the proposed multimodal neural network system, the best recognition accuracy was 83.6%. The preliminary cleaning of hair structures and the analysis of heterogeneous data made it possible to significantly exceed the classification accuracy compared to simple neural network architectures to recognize dermoscopic images. In [20] CNN GoogleNet Inception v3 was trained based on dermoscopic images, consisting of nine diagnostically significant categories. The recognition accuracy of CNN

GoogleNet Inception v3 was 72.1%, which is 11.46% lower than modeling the multimodal neural network system proposed in this paper; in [21], the authors present CNN ResNet50 training based on benign and malignant pigmented skin lesions. The trained ResNet50 CNN achieved 82.3% accuracy, which is 1.26% lower than the recognition accuracy of the proposed system with the hair pre-cleaning step. The superior recognition accuracy of the multimodal neural network system proposed in this paper compared to the results of pre-trained CNNs is explained by different data processing methods, which, when used together, enter into synergy.

In [60], preliminary hair cleaning is performed using the DullRazor method, and the skin lesion image classification using a neural network classifier. The best result of recognition accuracy was 78.2%. The analysis of heterogeneous data using the proposed multimodal neural network system made it possible to increase the recognition accuracy by 5.4% compared to recognition using a neural network classifier; [61] presents a skin cancer detection system. The preliminary cleaning of dermatoscopic images from hair was performed at the first stage using the DullRazor method. Neural network classification was performed using the K-Nearest Neighbor (KNN). The system's accuracy was 82.3%, which is 1.3% lower than the recognition accuracy of the proposed multimodal neural network system with the stage of preliminary cleaning of hair structures. The authors of [62] proposed a neural network system for classifying benign and malignant pigmented skin lesions with the stage of preliminary hair removal. This approach made it possible to achieve a classification accuracy of 79.1%, which is 4.5% lower than the recognition accuracy of the proposed multimodal neural network system. Combining and analyzing heterogeneous dermatological data allows the multimodal neural network algorithm to find additional links between images and metadata and improve recognition accuracy compared to the classification accuracy of visual data only by neural network algorithms.

A comparison of the recognition accuracy of various multimodal neural network systems for recognizing pigmented lesions and skin with the proposed system is presented in Table 6.


**Table 6.** Results of recognition accuracy of various multimodal neural network systems for recognizing pigmented lesions and skin.

In [34], the authors solved two problems for neural network classification of pigmented skin lesions. The modeling was carried out based on the open archive ISIC 2019, which is currently the most suitable for research in this area since it contains the largest amount of visual and statistical data. The authors selected 25,331 dermatoscopic images for modeling, divided into eight diagnostically significant categories. The authors used various CNNs to classify dermatoscopic images for the first task. For the second task, statistical metadata about patients was also used along with the photos. The multimodal neural network system for the second task consisted of CNN for dermatoscopic imaging and a dense neural network for metadata. In the first step, the authors trained CNN only on visual multivariate data, then fixed the CNN weights and connected a neural network with metadata. The core architecture of CNN was a pre-trained EfficientNets consisting of eight different models. Pre-trained SENet154 and ResNext were also used for modeling variability. The images were cropped to the required size 224 × 224 × 3 and augmented as a pre-processing stage. Metadata pre-processing consisted of simple numeric coding. In this case, the missing values were coded as "−5". Most of the training was done on an NVIDIA GTX 1080TI

graphics card. The use of metadata has improved the accuracy by 1–2%. At the same time, the increase was observed mainly on smaller models. On the test set, the accuracy of the neural network recognition system in the first task was 63.4%. For the second task using metadata, the accuracy on the test set was 63.4%. At the same time, the most optimal results were 72.5 ± 1.7 and 74.2 ± 1.1 for the first and second tasks, respectively.

Identical conditions for modeling, hardware resources, image base, and many diagnostic categories used make it possible to compare the results obtained with the proposed multimodal neural network system with the stage of preliminary hair removal with the results from work. The recognition accuracy of the proposed multimodal system with the stage of preliminary hair removal on the test set was 83.6%, which is about 20.2% higher than the results of testing the system from [34]. The main difference between the multimodal neural network system proposed in the work is the use of the hair removal method at the stage of preliminary processing of visual data, which significantly increased the accuracy.

In [35], a multimodal convolutional neural network (IM-CNN) is presented, a model for the multiclass classification of dermatoscopic images and patient metadata as input for diagnosing pigmented skin lesions. The modeling was carried out on the open dataset HAM10,000 ("Human versus machine with 10,000 training images"), part of the ISIC Melanoma Project open database, and consists of seven diagnostic categories. This set includes statistical metadata about patients such as age, gender, location of pigmented lesions, and diagnosis. The pre-trained DenseNet and ResNet architectures were used as CNNs to classify dermatoscopic images. The best test result for the proposed model was 72% recognition accuracy. That is about 11.6% lower than the proposed multimodal system with a stage of preliminary hair removal. The main differences in the operation of the proposed multimodal system for the recognition of pigmented lesions of the skin are, firstly, the stage of preliminary hair removal, and, secondly, the use of a larger number of diagnostically significant recognition classes and a more substantial amount of data for training. These distinctive features made it possible to improve the visual quality of diagnostically significant signs on dermatoscopic images due to the removal of hair structures and improve the correctness and balance of the training of the neural network system.

The authors of [36] presented a method combining visual data and patient metadata to improve the efficiency of automatic diagnosis of pigmented skin lesions. The modeling was carried out on the ISIC Melanoma Project database, which consisted of 2917 dermatoscopic images of five classes (nevi, melanoma, basal cell carcinoma, squamous cell carcinoma, pigmented benign keratoses). For image recognition, a modified CNN architecture, ResNet-50, was used. Simulation results have shown that the combination of dermatoscopic images and metadata can improve the accuracy of the classification of skin lesions. The best average recognition accuracy (mAP) using metadata on the test set was 72.9%. This result is 10.7% lower than the recognition accuracy of the proposed multimodal system for recognizing pigmented skin lesions with a stage of preliminary removal of hair structures. A small variation in the database of dermatoscopic examples for training in [36] can significantly affect the reliability of the neural network classification system.

In [38] proposed two methods for classifying pigmented skin lesions. The first method was to use CNN to recognize dermatoscopic images. The authors selected 1000 images from the International Skin Imaging Collaboration (ISIC) archive, divided into two categories (benign and melanoma). The result of recognition accuracy in two categories on the basis for validation was 82.2%. The second method used 600 images from the ISIC archive and patient metadata. Metadata has been added to the dermatoscopic image pixel matrix in each RGB layer at the bottom. After repeatedly adding metadata, a colored bar appeared on the images. The accuracy of CNN recognition and the metadata on the validation set was 79.0%, which is 4.6% lower than the recognition accuracy of the proposed multimodal neural network system. Although adding metadata directly to the image matrix allowed the authors from [38] to improve the classification accuracy, using a separate full-fledged classifier for statistical data is a more rational solution. Convolutional layers in CNN highlight such features on dermatoscopic images as contour, color, size. The metadata added to the pixel matrix of each dermatoscopic image does not require feature extraction.

The main limitation in using the proposed multimodal neural network system for recognizing pigmented lesions in the skin is that specialists can only use the system as an additional diagnostic tool. The proposed system is not a medical device and cannot independently diagnose patients. Since the major dermatoscopic training databases are biased towards benign image classifications, misclassification is possible. The use of augmentation based on affine transformations makes it possible to minimize this factor but not completely exclude it.

A promising direction for further research is constructing more complex multimodal systems for neural network classification of pigmented skin neoplasms. The use of segmentation and preliminary cleaning of the hair's visual data will help highlight the contour of the pigmented skin lesion. Distortion of the shapes of the skin neoplasm is an important diagnostic sign that may indicate the malignancy of this lesion.

#### **5. Conclusions**

The article presents a multimodal neural network system for recognizing pigmented skin lesions with a stage of preliminary cleaning from hair structures. The fusion of dissimilar data made it possible to increase the recognition accuracy by 4.93–6.28%, depending on the CNN architecture. The best recognition accuracy for 10 diagnostically significant categories was 83.56% when using the AlexNet pre-trained CNN architecture. At the same time, the best indicator of improving the accuracy was obtained using the pre-trained ResNet-101 architecture and amounted to 6.28%. The use of the stage of preliminary processing of visual data made it possible to prepare dermatoscopic images for further analysis and improve the quality of diagnostically important visual information. At the same time, the fusion of patient statistics and visual data made it possible to find additional links between dermatoscopic images and the results of medical diagnostics, which significantly increased the accuracy of the classification of neural networks.

Creating systems for automatically recognizing the state of pigmented lesions of patients' skin can be a good incentive for cognitive medical monitoring systems. This can reduce the consumption of financial and labor resources involved in the medical industry. At the same time, the creation of mobile monitoring systems to monitor potentially dangerous skin neoplasms will automatically receive feedback on the condition of patients.

**Author Contributions:** Conceptualization, P.A.L.; methodology, P.A.L.; software, U.A.L.; validation, U.A.L. and N.N.N.; formal analysis, N.N.N.; investigation, P.A.L.; resources, U.A.L.; data curation, N.N.N.; writing—original draft preparation, U.A.L.; writing—review and editing, P.A.L.; visualization, N.N.N.; supervision, P.A.L.; project administration, U.A.L.; funding acquisition, P.A.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work of Pavel Lyakhov (Section 2.1) is supported by the Ministry of Science and Higher Education of the Russian Federation 'Goszadanie' №075-01024-21-02 from 29 September 2021 (project FSEE-2021-0015). Research in Section 2.2 was supported by the Presidential Council for grants (project no. MK-3918.2021.1.6). Research in Section 2.3 was supported by the Presidential Council for grants (project no. MK-371.2022.4). Research in Section 3 was supported by Russian Science Foundation, project 21-71-00017.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** A publicly available dataset was analyzed in this study. This data can be found in https://www.isic-archive.com/#!/topWithHeader/wideContentTop/main, accessed on 7 February 2022. Both the data analyzed during the current study and code are available from the corresponding author upon request.

**Acknowledgments:** The authors would like to thank the North-Caucasus Federal University for supporting the contest of projects competition of scientific groups and individual scientists of the North-Caucasus Federal University.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

Appendix A shows the training and testing graphs of the proposed multimodal neural network systems based on various CNN architectures with preliminary cleaning of hair structures.

**Figure A1.** Graph of learning outcomes of a multimodal neural network system for classifying dermatoscopic images of pigmented skin lesions based on CNN AlexNet: (**a**) loss function; (**b**) recognition accuracy.

**Figure A2.** Graph of learning outcomes of a multimodal neural network system for classifying dermatoscopic images of pigmented skin lesions based on CNN SqueezeNet: (**a**) loss function; (**b**) recognition accuracy.

**Figure A3.** Graph of learning outcomes of a multimodal neural network system for classifying dermatoscopic images of pigmented skin lesions based on CNN ResNet-101: (**a**) loss function; (**b**) recognition accuracy.

#### **References**


### *Review* **Virtual Reality Rehabilitation Systems for Cancer Survivors: A Narrative Review of the Literature**

**Antonio Melillo 1,2, Andrea Chirico 3, Giuseppe De Pietro 4, Luigi Gallo 4, Giuseppe Caggianese 4, Daniela Barone 5, Michelino De Laurentiis 6,\* and Antonio Giordano <sup>2</sup>**


**Simple Summary:** To the best of our knowledge, this is the first review aiming to assess the impact of VR on the rehabilitation care of cancer survivors. We conducted a general review of the current evidence on the efficacy of virtual reality rehabilitation (VRR) systems on cancer-related impairments as retrieved through a systematic search of the main research databases. VRR systems may improve adherence to rehabilitation training programs and be better tailored to cancer patients' needs, but more data is needed.

**Abstract:** Rehabilitation plays a crucial role in cancer care, as the functioning of cancer survivors is frequently compromised by impairments that can result from the disease itself but also from the long-term sequelae of the treatment. Nevertheless, the current literature shows that only a minority of patients receive physical and/or cognitive rehabilitation. This lack of rehabilitative care is a consequence of many factors, one of which includes the transportation issues linked to disability that limit the patient's access to rehabilitation facilities. The recent COVID-19 pandemic has further shown the benefits of improving telemedicine and home-based rehabilitative interventions to facilitate the delivery of rehabilitation programs when attendance at healthcare facilities is an obstacle. In recent years, researchers have been investigating the benefits of the application of virtual reality to rehabilitation. Virtual reality is shown to improve adherence and training intensity through gamification, allow the replication of real-life scenarios, and stimulate patients in a multimodal manner. In our present work, we offer an overview of the present literature on virtual realityimplemented cancer rehabilitation. The existence of wide margins for technological development allows us to expect further improvements, but more randomized controlled trials are needed to confirm the hypothesis that VRR may improve adherence rates and facilitate telerehabilitation.

**Keywords:** virtual; reality; cancer; rehabilitation; disability; robotics; lymphedema; pain; fatigue; telemedicine

**Citation:** Melillo, A.; Chirico, A.; De Pietro, G.; Gallo, L.; Caggianese, G.; Barone, D.; De Laurentiis, M.; Giordano, A. Virtual Reality Rehabilitation Systems for Cancer Survivors: A Narrative Review of the Literature. *Cancers* **2022**, *14*, 3163. https://doi.org/10.3390/ cancers14133163

Academic Editor: Hamid Khayyam

Received: 25 March 2022 Accepted: 24 June 2022 Published: 28 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Cancer ranks as a leading healthcare issue, striking 19.3 million new cases worldwide in just 2020 and with an estimated projection of 28.4 million new cases for 2040 [1]. Contemporarily to this increase in incidence, mainly explainable by the world population's growth and aging, cancer mortality rates have been steadily decreasing by 1% per year, both in high- and low-income countries and for both sexes [2]. Thanks to both diagnostic and therapeutic advancements, the 5-year survival rate of cancer patients has indeed increased from 49% in 1979 to roughly 67% in the US in 2015 [3,4]. As a consequence of these trends, the population of individuals who have received a cancer diagnosis in their life is set to increase rapidly, with the latest projections showing an increase from 16.9 million in the US to 26.1 million people in 2040 [5]. "Cancer survivors" is a term generally used to define anyone living with the physical and or psychological consequences of a recent or past cancer diagnosis and its treatment, with some researchers even advocating for the inclusion of even cancer patients' caregivers and family members under the term [6]. These consequences have a long and significant impact on the physical functioning of this population, as both the disease, the long-term toxicity of chemotherapeutic drugs and radiotherapy, as well as surgical procedures can result in chronic symptoms and long-standing physical and cognitive impairment.

Pain is by far one of the most common chronic symptoms cancer survivors experience, with prevalence rates of 55.0% during anticancer treatment, 39.3% after curative treatment, and 66.4% in advanced, metastatic, or terminal disease [7]. Persistent pain not only significantly undermines quality of life but also causes functional limitations and hence disability. Cancer-related fatigue (CRF) is another extremely common symptom in cancer patients, with a prevalence ranging from 25% to 99% depending on the specific disease, the treatment, and age [8]. Lymphedema is an extremely frequent consequence of cancer treatment, as it can be secondary to the surgical removal of lymph nodes, radiation therapy, chemotherapy, or a combination of such [9]. The condition may severely impact patients' lives, as it causes both pain and function limitations. Its incidence is influenced by both the cancer and the intervention type: rates range from 75% of breast cancer patients after axillary nodes removal to between 14.5 and 41.4% after chest and breast radiation therapy depending on the extension of the area involved, to 50% for melanoma patients and a 16% incidence for genitourinary cancers [10,11]. Many cancer survivors experience not only physical but also cognitive impairment, in particular in areas such as memory, attention span, word-finding, and speed of processing and execution. This impairment is sometimes colloquially referred to as "chemo brain", referring to the well-known neurotoxicity of many chemotherapeutic drugs [12]. However, recent findings on the existence of mild cognitive impairment already existing before chemo treatment pose doubts on the true cause(s) of this condition [13]. Chemotherapy-induced peripheral neuropathy (CIPN) is a severe collateral effect of chemotherapy. Many chemotherapeutic drugs can indeed cause different types of nerve damage depending on the exact chemical compound [14]. Its incidence also varies depending on the treatment, ranging from 19% to 85%. Clinically, CIPN usually manifests itself mainly as a distal sensory deficit, with symptoms of dysesthesia, paresthesia, pain symptoms, or complete anesthesia. Motor symptoms occur less frequently and also usually involve distal limbs, causing balance and gait problems as well. CIPN usually gradually develops months after chemotherapeutic treatment and may affect the patient for years.

These conditions have been shown to benefit from rehabilitation, and in the last years, many systematic reviews and guidelines have contributed to the establishment of specific recommendations for the prescription of specific exercise programs for different cancer types [15–19]. Despite this indication, many studies have shown how just a minority of cancer survivors are referred to rehabilitation programs. Reporting data collected from 163 breast cancer survivors, Cheville et al. found that 91% of women had physical impairments, but only 30% were receiving proper rehabilitative care [20]. Concordantly, a study by Hansen et al. examining a cohort of 3439 cancer survivors reported a total of 60% of

patients referring to the unmet need for either physical or psychological rehabilitation [21]. In a more recent 20-year follow-up of pediatric brain cancer survivors in Norway, the percentage rose to as high as 86% [22]. Through a non-systematic review of the previous literature, Cheville attempted to explain the lack of proper rehabilitative care, mentioning as possible causes the insidious and gradual genesis of these impairments as well as the incapability of the cancer care system to deliver the early detection of the impairing symptoms [23]. However, even when the program is initiated, it is often discontinued as early as within the first twelve months, mainly as a result of the difficulty of traditional training programs in motivating the patients' adherence [24]. In addition, the recent pandemic has very well exposed another cause of this underutilization of rehabilitative cancer care, which is the inadequacy of the present rehabilitative care system in delivering home-based interventions [25,26]. Indeed, many cancer survivors suffer from disabilities or transportation issues which may limit their attendance at rehabilitation facilities. Therefore, in the last years, many studies have been investigating the role of telerehabilitation in the rehabilitative care of cancer survivors to improve adherence and as a safe and more accessible alternative to traditional rehabilitation [27–29]. One of the latest technologies proposed to remotely connect patients and rehabilitation professionals is Virtual Reality (VR) [26,30–34]. Virtual Reality Rehabilitation (VRR) has been tested in various clinical conditions, such as stroke-related deficits [35], spinal cord injuries [36], multiple sclerosis [37], Parkinson's disease [32], cerebral palsy [38–40], and cancer rehabilitation. Many studies have argued that VRR may improve both adherence rates and training intensity thanks to its entertaining and game-like nature [41–43].

The purpose of the present narrative review is to contribute to the investigation of whether VR may be a useful implementation in the cancer rehabilitation field and to give an overview of the current evidence on this application. At the moment, the scientific literature registers either attempts to evaluate the advantages of VR implementation in the rehabilitation field in general [41,44] or to review the implementation of VR in palliative care for single cancer symptoms, mainly during acute cancer care, as highlighted by Zeng et al. [45,46]. From our perspective, the former fails to assess the advantages of VR-integrated rehabilitation when applied to the specifics of cancer survivor disabilities, which often result from the slow and insidious accrual of more symptoms and physical impairments [20]. The latter, on the other hand, does not examine the potential application of VR technology to cancer survivors with chronic symptoms and their role in an impairment-driven rehabilitation of disabilities resulting from a cancer history. Hence, to the best of our knowledge, this is the first review aiming to assess the impact of VR on the rehabilitation care of cancer survivors.

#### **2. Methods**

#### *Database Search*

The main online databases (PubMed, Scopus) were searched from inception until May 2022. The query string was the following: Cancer Survivor\*" OR "cancer" OR "cancer patient\*" AND "Lymphedema" OR "cancer-related fatigue" OR "Fatigue" OR "Chronic Pain" OR "Cancer Pain" OR "cognitive" OR "motor" OR "symptom management" OR "peripheral neuropathy" AND "Rehabilitation" OR "Telerehabilitation" OR "Exercise" OR "physical therapy" OR "sensorimotor rehabilitation" OR "exercise training" OR "postural balance" OR "sensorimotor" AND "Virtual Reality" OR "body sensors" OR "avatar\*". The first author performed the literature search. The first and second authors independently screened titles and abstracts as well as full texts' reference lists against eligibility criteria. The final selection of articles was discussed by the first and second authors. Study eligibility was assessed using the PICOS tool [47]: to be included, studies had to fulfill the following inclusion criteria: (1) population: individuals with a history of cancer; (2) intervention: Virtual Reality-based rehabilitation; (3) comparison for RCCTs: standard physiotherapy; (4) outcomes for clinical trials: functional parameters, pain, lymphedema volume, cancerrelated fatigue, program adherence, exercise performance; and (5) study design: RCT with

or without control, perspective studies, comparative studies, feasibility studies. Studies published in English, Spanish, or Italian were all considered.

#### **3. Results**

The search of the main databases (PubMed, Scopus) produced a total of 7733 results. Duplicate detection led to the elimination of 149 results. After screening through eligibility criteria, a total of nine studies were selected for our review (Figure 1). We will here, therefore, review the design of the included studies, summarized in Table 1.

**Figure 1.** Prisma flowchart of the study selection.


**Table 1.** Features of the included studies.

\* Table 1: Features of the included studies. VR: Virtual reality; VAS: visual analogue scale; DASH: disability of the arm, hand, and shoulder questionnaire; ROM: range of motion; TKS: Tampa Kinesiophobia Scale; CRF: cancer-related fatigue; NRS: numeric rating scale; FMA: Fulg-Meyer assessment; CAHAI-9: Chedokee arm and hand activity inventory; JHFT: Jebsen hand function test; ADL: activities of daily living; UEFI-20: upper extremity function index; BDI-II: Beck Depression Inventory, Second Edition; NAB: Neuropsychological Assessment Battery; HVLT-R: Hopkins Verbal Learning Test; BVMT-R: the Brief Visuospatial Memory Test, Revised; TMT: Trail Making Test; FES-I: Falls efficacy scale—international; pain, measured by BPI: (Brief Pain Inventory scale) (BPI); quality of life, measured through the EQ-5D-5L scale; fatigue, measured through the Functional Assessment of Chronic Illness Therapy Fatigue scale (FACIT-Fatigue); and depression, anxiety, and stress levels, measured through the short version of the Depression, Anxiety, and Stress Scales (DASS-SF).

Atef et al. conducted a quasi-randomized clinical trial comparing the efficacy of VRR and proprioceptive neuromuscular facilitation (PNF) on post-mastectomy lymphedema upper-arm exceeding volume and upper arm function recovery, measured through the QuickDASH-9 scale [48]. The experimental procedure consisted of a 30 min exercise program using a Wii Fit non-immersive VR game. Both the VRR and the PNF procedures were conducted two times per week for a total of 4 weeks. During these sessions, both groups, consisting of 15 women each, also received a procedure of pneumatic compression for the treatment of lymphedema.

Axenie and Kurz conducted a prospective study on the combination of Virtual Reality avatars and Machine Learning to drive patient-tailored CIPN-related motor deficit compensation [49]. They proposed a closed-loop system based on wearable devices designed to precisely assess the kinematics of the sensorimotor deficits. Furthermore, they conceptualized a VR avatar designed to reproduce the patient's movements and to display the discrepancies between the desired movement and the measured/executed one, so as to trigger deficit compensation.

Basha et al. conducted a randomized clinical trial comparing the therapeutic efficiency of non-immersive VR training and resistance exercise training on breast cancer-related lymphedema [50]. The experimental protocol consisted of an exercise program conducted through Xbox Kinect games involving upper arm motion. Both rehabilitation groups, consisting of 30 patients each, received five rehabilitation sessions per week for 8 weeks. The outcome measures included excessive limb volume and pain, measured through the visual analog scale (VAS); the impairment of the upper arm, measured through the Disability of the Arm, Shoulder, and Hand (DASH) questionnaire; shoulder range of motion (ROM); shoulder muscle strength; and hand grip strength.

Feyzio ˘glu et al., 2019 presented a prospective randomized controlled trial comparing the efficacy of a non-immersive VRR intervention with standard physiotherapy on breast cancer survivors who had undergone surgery with axillary dissection [51]. The experimental and control groups, both consisting of 20 individuals, both received the treatment for 45 min per session and two times a week for 6 weeks. The experimental intervention consisted of playing Xbox Kinect games involving upper arm motion in the presence of a trained physiotherapist. However, the intervention group also received a scar tissue massage for 5 min and passive shoulder joint mobilization for 5 min, performed by the same physiotherapist assisting them. The outcomes considered were pain (VAS), grip strength, functionality (assessed through the DASH questionnaire), muscle strength, ROM, and fear of movement, measured through the Tampa Kinesiophobia Scale (TKS).

Hoffman et al. (2014) conducted a non-controlled trial investigating the feasibility of a home-based VRR intervention on seven lung cancer patients who had received thoracotomy [52]. The home-based rehabilitation program, divided into two phases of 5 and 10 weeks, respectively, consisted of playing Nintendo Wii Fit Plus exergames of gradually increasing intensity and duration 5 days a week. The VRR sessions did not require the presence of rehabilitation professionals. The outcomes considered were the levels of adherence, measured as the days of actual training, exercise performance, cancer-related fatigue (0–10 scale), perceived self-efficacy for fatigue self-management (0–10 scale), and perceived self-efficacy for walking 30 min (%).

House et al. conducted a trial on a sample of six patients to investigate the feasibility of a rehabilitative intervention based on a novel technology, named BrightArm Duo, on breast cancer survivors with post-surgical pain and depression [53]. The novel technological tool tested consisted of a combination of a robotic table for forearm rehabilitation and a computer executing non-immersive VR rehabilitation games. The rehabilitation program consisted of training sessions lasting 20 to 50 min of training twice a week for a period of 8 weeks. The outcomes considered were pain, measured through the Numeric Rating Scale (NRS); arm, hand, and bimanual function measured through the Fulg-Meyer assessment, the Chedokee arm and hand activity inventory, and the Jebsen hand function test; upper arm autonomy in the activities of daily living, measured through the Upper extremity

function index (UEFI-20); depression, measured through the Beck Depression Inventory (BDI-II); and cognitive function, measured through the Neuropsychological Assessment Battery (NAB), the Hopkins Verbal Learning Test (HVLT-R), the Brief Visuospatial Memory Test (BVMT-R), and the Trail Making Test (TMT).

Reynolds et al. conducted a pilot study to evaluate the efficacy of two different VRR interventions on pain, CRF, and quality of life [54]. The study involved two groups of 19 and 20 women with metastatic breast cancer who were asked to participate in an immersive home-based VR intervention. The technology involved consisted of a Pico Goblin VR headset playing two different relaxing scenarios. The outcomes considered were pain, measured through the Brief Pain Inventory scale (BPI); quality of life, measured through the EQ-5D-5L scale; fatigue, measured through the Functional Assessment of Chronic Illness Therapy Fatigue scale (FACIT-Fatigue); and depression, anxiety, and stress levels, measured through the short version of the Depression, Anxiety, and Stress Scales (DASS-SF).

Schwenk and colleagues conducted a randomized trial on VR-based balance training [55]. The authors used inertial sensors equipped with gyroscopes and accelerometers on the lower limbs to assess positions and joint angles and a multi-step balance retraining virtual game based on the inputs of the sensors. In particular, the intervention group, consisting of 11 individuals with chemotherapy-induced polyneuropathy, conducted exercises and balance retraining tasks while receiving visual and auditory feedback on their motor errors. The outcomes measured were the sway of the hip, the sway of the ankle, the center of mass movement, gait speed, and fear of falling, measured through the Falls Efficacy Scale (FES-I).

Tsuda et al. conducted a preliminary study on a VR-based exercise program on over 60-year-old hospitalized patients with hematological malignancies receiving chemotherapy [56]. The virtual reality exercise program involved Nintendo Wii Fit games, which were played for 20 min a day, five times a week until hospital discharge. The primary outcomes were adherence rates, physical performance (measured through the Barthel index), muscle strength, and emotive state (hospital anxiety and depression scale).

In summary, eight of the considered studies were clinical trials, with one study conducting a preclinical investigation [49]. Of the clinical trials, four compared VRR to a standard rehabilitation program [48,50,51,55]. One study involved an immersive VR program [54], while the remaining eight studies used non-immersive VR technology. As for the population considered by the clinical trials, five of the included studies involved breast cancer survivors [48,50,51,53,54]. As for the outcomes considered, four of the retrieved studies tested VRR on more than one physical impairment [50,51,53,54]. Overall, we found four studies testing the efficacy of VRR on chronic pain [50,51,53,54], two studies on cancer fatigue [52,54], two studies on lymphedema-related excessive arm volume [48,50], one on cognitive function [53], four on motor performance impairment [48,50,51,53], and two on chemotherapy-induced polyneuropathy [49,55]. Finally, we here report the results of the two included studies considering adherence rates as an outcome [52,56].

#### *3.1. Pain*

Feyzio ˘glu et al. did not find a statistical difference in pain [51]. The study, however, found significant differences in the decreased fear of movement as calculated through the Tampa Kinesiophobia Scale. Moreover, House et al. reported a 20% decrease in pain after treatment (*p* = 0.1) [53]. Basha and colleagues, comparing non-immersive VR exercise with regular resistance exercise in patients with breast cancer-related lymphedema, found significant differences in pain intensity (*p* = 0.002) between groups [50]. Reynolds et al. found that both scenarios significantly reduced pain (mean difference = −6.01, *p* = 0.004) [54]. To summarize, four of the included studies considered pain as their outcome, but only two found a statistically significant effect.

#### *3.2. Fatigue*

Hoffman et al. reported statistically significant improvements in both CRF severity and perceived self-efficacy for walking [52]. Reynolds et al. found a statistical difference in pain and at follow-up compared to before the intervention (mean difference −5.00, *p* < 0.001) [54]. To summarize, two of the included studies found statistically significant effects of VR on cancer-related fatigue.

#### *3.3. Lymphedema*

Atef et al. found that both VR and PNF exercise reduced edema, with no significant differences (*p* = 0.902) [48]. Basha et al.'s trial showed no significant differences among groups for lymphedema-related excessive shoulder volume (mean difference = −11.1 mL, *p* = 0.15) [50]. In conclusion, none of the included studies found statistically significant evidence in favor of a VRR intervention compared to standard rehabilitation.

#### *3.4. Cognitive Impairment*

House et al.'s study on VR rehabilitation found it effective on cognitive function, with 10 out of 11 parameters improved (*p* = 0.004) [53].

#### *3.5. Motor Performance*

The Feyzio ˘glu trial on arm rehabilitation following mastectomy recorded improvements in range of motion, grip strength, and arm muscle strength but did not find any significant differences with the control group [51]. House et al.'s study, also considering arm rehabilitation in breast cancer patients following surgery, reported a significant improvement of the affected shoulder in 17 of 18 range-of-motion metrics (*p* < 0.01), of which five were above the Minimal Clinically Important Difference [53]. The study also reported a recovery in 13 out of 15 strength and function metrics (*p* = 0.02). Basha et al.'s trial also found statistical differences in physical and motility outcomes (shoulder flexion strength, external rotation strength, abduction strength, and handgrip strength) in favor of the control group, who performed regular resistance exercises [50]. The trial also reported that VRR was, however, significantly superior to standard rehabilitation for the range of motion outcome (*p* < 0.001). Lastly, the Atef et al. trial reported statistically significant differences among the VRR group and the control group regarding the functional improvements of the arm following mastectomy (*p* = 0.045) [48]. To summarize, four trials considered motor impairment as their outcome, but only two reported a statistically significant effect of VRR, while one trial found it inferior compared to standard rehabilitation on some of the considered outcomes.

#### *3.6. Chemotherapy-Induced Peripheral Neuropathy*

Schwenk et al. reported how the sway of the hip, ankle, and center of mass while standing with eyes opened and in a semi-tandem position was significantly reduced in the intervention group compared to the control (*p* = 0.010–0.022 and *p* = 0.008–0.035, respectively, for the two positions) [55]. No significant effects were found for balance with eyes closed, gait speed, and fear of falling (*p* > 0.05).

#### *3.7. Adherence to Rehabilitation Programs*

Tsuda et al. recorded an adherence rate of 66.5% in 88 sessions among 16 hospitalized patients and noted the maintenance of physical performance [56]. The Hoffman et al. study reported a mean adherence rate at the end of Phase I of 96.6% (SD: 3.4%) and of 87.6% (SD: 12.2%) at the end of phase II [52]. To summarize, two studies considered adherence rates as an outcome, but none of the two compared it to standard rehabilitation adherence rates.

In summary, VRR was found to be significantly effective for cancer-related fatigue, cognitive impairment, and CIPN-related balance impairment. VRR was found to be effective for cancer survivors' pain, but only two studies found it significantly superior

to standard rehabilitation. The included studies showed mixed results for the motor impairment outcome, with two studies reporting statistically significant data in favor of VRR and one study reporting statistically significant results in favor of the control group for some of the motor performance outcomes. None of the included studies found a statistically significant effect on lymphedema.

#### **4. Discussion**

The present review aimed to offer an overview of the present evidence regarding the benefits of the integration of VR for the rehabilitation of the chronic symptoms and impairments of a specific population, cancer survivors. As previously discussed, the impairments and chronic symptoms considered by the present review are indications for and can be treated through rehabilitation programs [15–17]. The studies retrieved by our database search found VRR effective on cancer survivors' pain, accordantly with previous reviews which found VR interventions effective not only for acute but also for chronic pain [57–59]. However, only two of the included studies found VRR significantly superior to standard rehabilitation for cancer survivors, so more studies will need to address this comparison. Two of the included studies found statistically significant effects of VR on cancer-related fatigue. This is consistent with the previous literature, which found VRR effective for the treatment of chronic fatigue in other conditions, such as multiple sclerosis [60]. Regarding specifically cancer-related fatigue, however, the previous studies have focused on testing the effects of VR on acute cancer fatigue, for example during procedures such as chemotherapy infusions. Indeed, a 2020 systematic review concluded that VR had a statistically significant beneficial effect on cancer-related fatigue immediately after VR-assisted chemotherapy infusions [61]. Consequently, it must be concluded that more studies are needed to confirm the efficacy of VRR for the long-term treatment of chronic cancer-related fatigue. One study found VRR effective for the treatment of CIPN-related balance impairment, coherently with the results of previous studies on the use of VRR for the treatment of balance impairment secondary to other conditions such as diabetic neuropathy, stroke, and senility [62–64]. Two of the included studies considered lymphedema-related excessive arm volume as an outcome, but none found statistically significant evidence in favor of a VRR intervention compared to standard rehabilitation. The included studies also showed mixed results for the motor impairment outcome, with two studies reporting statistically significant data in favor of VRR and one study reporting statistically significant results in favor of the control group for some motor performance outcomes. This result is inconsistent with previous studies showing the efficacy of VRR compared to regular exercise for motor performance and strength outcomes in different conditions, such as cerebral palsy, senility, and after stroke [65–67]. One study found VRR effective for the treatment of cognitive impairment in cancer survivors, consistent with the previous literature stating the efficacy of VRR interventions for cognitive impairment [68–72].

Among the included studies, three conducted a home-based intervention [51,52,54]. This area of research is particularly crucial for cancer survivors: as previously discussed, one of the factors contributing to the limited access that cancer patients have to rehabilitative care seems to be represented by the transportation issues resulting from the patients' disability [16,23,73]. For this reason, many studies have been investigating the potential role of telerehabilitation in improving cancer patients' access to rehabilitative care [29]. Furthermore, the previous literature has addressed how virtual reality may more generally improve and facilitate remote-assisted and home-based healthcare interventions [26,33,74,75]. Considering more particularly the studies included in our review, Hoffman et al. employed a Wii Fit device to deliver a rehabilitative program of increasing intensity. The program involved only two home visits by a rehabilitation professional, one of which was before the start of the training program to set up the device, later involving only remote phone assistance. The study showed promising results in terms of adherence rates; however, its single-arm design did not allow the authors to conclude whether the VR-implemented program actually improved adherence rates compared to standard facility-based or homebased training programs. Reynolds and colleagues also reported the results of a VRR home-based intervention that did not require assistance from a rehabilitation professional but did not report adherence rates. However, discussing the acceptability of their intervention, they reported a feedback comment which may be found suggestive, although of course far from acceptable as evidence:

"With my lack of mobility that's resulted from my illness, I really enjoyed the VR as it made me feel like I'm not house bound . . . "

Feyzio ˘glu et al., on the other hand, conducted a randomized controlled trial, comparing two home-based interventions, an Xbox 360 Kinect-based intervention and a standard physiotherapy intervention. However, the experimental intervention involved a combination of standard physiotherapy and VRR, as it consisted of a phase of active training through a VRR gaming session and passive mobilization and scar tissue massaging, both performed by the trained physiotherapist. As such, this home-based intervention required the constant physical presence of a rehabilitation professional rather than involving remote assistance. So it must be concluded that more studies are needed to examine whether the VR implementation would facilitate remote supervision and whether the implementation of this technology in home-based interventions would improve the cancer survivors' adherence. A possible limitation emerging from the overview of the included studies is, however, the compatibility of some applied VRR systems and especially some of their more complex additional devices with home-based interventions in terms of both costs and usability. However, other included studies did test the application of VR devices currently already commercially available, mainly for entertainment and gaming purposes, and which may even be already present in the patients' houses [48,50,51,54,56]. As previously reported, two of the included trials considered adherence as an outcome [52,56]. However, both consisted of single-arm studies, so more studies are needed to confirm the hypothesis that VRR may actually improve adherence in cancer patients compared to traditional rehabilitation. This result would be consistent with previous studies reporting how VRR may benefit both adherence rates and training intensity [41–43,62,76]. More evidence on this subject would be very significant, as many studies highlighted how cancer survivors often discontinue rehabilitation programs as early as within the first 12 months [24]. One of the contributing factors to these statistics seems to be represented by the patient's lack of confidence and motivation, as standard rehabilitation programs typically require high numbers of repetitions of exercises, which are found to be tiring and boring, when not very frustrating [77]. On this subject, it has been theorized how VRR may increase the patients' enjoyment and excitement about the rehabilitation task administered, which many researchers argue may benefit both adherence rates and training intensity [41–43]. Part of the excitement added by the VR implementation may be explained by the novelty of interacting with a virtual world or even simply wearing an HMD instead of using standard training tools. However, part of its potential in terms of increased engagement seems to derive from the possibility of adding game-like features, rules, and designs to the training tasks, a process named gamification [34,78–80]. Indeed, the virtually unlimited possibilities of the virtual scenario design allow adding positive feedback and an exciting narrative to the training activities through the setting of goals, challenges, and competition elements such as score points and badges [79,81–83]. In addition, VR scenarios can replicate real-life tasks and situations with the result of greater physical and cognitive fidelity of the trained task to the everyday task the patient needs to reacquire. So, it may be argued that VRR may improve motivation by structuring a more goal-oriented training program compared to the execution of physical exercises in the context of a rehabilitation facility.

Another possible advantage of VRR comes from the multisensorial nature of VR experiences, which allow the stimulation of the patient in a multimodal manner [74]. This is particularly important when it comes to cancer-related disabilities, which, as previously discussed, often derive from the sum of more than one impairment. On this subject, we aim to stress how four of the retrieved studies tested VRR on more than one physical impairment [50,51,53,54]. In addition, three of the included studies considered the effects

of VRR on both psychological and physical outcomes [53,54,56], with one also considering cognitive outcomes [53]. Furthermore, we would also like to note how two of the included studies tested VRR systems integrating VR with other technologies [53,55]. In particular, House et al. tested a system consisting of a low-friction robotic rehabilitation table, computerized forearm supports, and a display delivering the non-immersive VR scenario. Schwenk et al. used inertial sensors equipped with gyroscopes and accelerometers on the lower limbs connected to the VRR software, to deliver error-based retraining in the motor tasks required. Many previous studies also integrated VR with other technologies, utilizing the VR software to process the data sent live from different digital rehabilitation tools including treadmills [40,84–88], data gloves [89–91], and robotically-assisted orthoses [92–96]. So, regarding this subject, we aim to stress how VR software can represent an integration platform for the function of many devices currently being tested or already clinically used in the rehabilitation field and for cancer survivors.

#### **5. Conclusions**

The included studies and the previous literature suggest that VRR may be better tailored to cancer survivors' needs, such as the need for home-based rehabilitation, the need for incentives for adherence and motivation, and the need for a multimodal approach. More randomized controlled trials are needed to produce evidence on the possible advantages of VRR compared to standard rehabilitative care. In particular, it would be crucial to confirm the hypothesis that VRR may improve adherence rates thanks to its more entertaining nature and multimodal stimulation. Lastly, we wish to encourage the development of new VRR systems and VRR training programs structured to support remote connections in order to allow patients to more easily reach the assistance of healthcare and rehabilitation professionals. Nonetheless, the existence of wide margins for technological development allows us to expect further improvements in the clinical efficacy and usability of VRR systems as well as a reduction in their prices.

**Author Contributions:** Conceptualization, A.C., A.M. and L.G.; methodology, A.C., G.C. and L.G.; investigation, G.C., L.G., D.B. and A.M.; writing—original draft preparation, A.M.; writing—review and editing, D.B., A.C. and A.M.; visualization, G.C. and D.B.; resources, G.D.P., M.D.L. and A.G.; supervision, A.G., G.D.P. and M.D.L.; project administration, A.G. and G.D.P.; funding acquisition, M.D.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Review* **Diagnostic Strategies for Breast Cancer Detection: From Image Generation to Classification Strategies Using Artificial Intelligence Algorithms**

**Jesus A. Basurto-Hurtado 1,2, Irving A. Cruz-Albarran 1,2, Manuel Toledano-Ayala 3, Mario Alberto Ibarra-Manzano 4, Luis A. Morales-Hernandez 1,\* and Carlos A. Perez-Ramirez 2,\***


**Simple Summary:** With the recent advances in the field of artificial intelligence, it has been possible to develop robust and accurate methodologies that can deliver noticeable results in different healthrelated areas, where the oncology is one the hottest research areas nowadays, as it is now possible to fuse information that the images have with the patient medical records in order to offer a more accurate diagnosis. In this sense, understanding the process of how an AI-based methodology is developed can offer a helpful insight to develop such methodologies. In this review, we comprehensively guide the reader on the steps required to develop such methodology, starting from the image formation to its processing and interpretation using a wide variety of methods; further, some techniques that can be used in the next-generation diagnostic strategies are also presented. We believe this helpful insight will provide deeper comprehension to students and researchers in the related areas, of the advantages and disadvantages of every method.

**Abstract:** Breast cancer is one the main death causes for women worldwide, as 16% of the diagnosed malignant lesions worldwide are its consequence. In this sense, it is of paramount importance to diagnose these lesions in the earliest stage possible, in order to have the highest chances of survival. While there are several works that present selected topics in this area, none of them present a complete panorama, that is, from the image generation to its interpretation. This work presents a comprehensive state-of-the-art review of the image generation and processing techniques to detect Breast Cancer, where potential candidates for the image generation and processing are presented and discussed. Novel methodologies should consider the adroit integration of artificial intelligence-concepts and the categorical data to generate modern alternatives that can have the accuracy, precision and reliability expected to mitigate the misclassifications.

**Keywords:** breast cancer; mammography; magnetic resonance; ultrasound; thermography; image processing; artificial intelligence

#### **1. Introduction**

According to the World Health Organization, Breast Cancer (BC) represents around 16% of the malignant tumors diagnosed worldwide [1]. In Mexico, BC is the leading death cause for cancer in the female population [2]. BC develops when any lump begins an

**Citation:** Basurto-Hurtado, J.A.; Cruz-Albarran, I.A.; Toledano-Ayala, M.; Ibarra-Manzano, M.A.; Morales-Hernandez, L.A.; Perez-Ramirez, C.A. Diagnostic Strategies for Breast Cancer Detection: From Image Generation to Classification Strategies Using Artificial Intelligence Algorithms. *Cancers* **2022**, *14*, 3442. https:// doi.org/10.3390/cancers14143442

Academic Editor: David Wong

Received: 21 May 2022 Accepted: 12 July 2022 Published: 15 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

angiogenesis process, that is, the process that causes the development of new blood vessels and capillaries from the existent vasculature [3]. Unfortunately, BC has a mortality rate of 69% in emergent countries, which is greater than the one in developed countries [1]. This increase is explained as the cancer is detected in a later stage, making the treatment a financial obstacle as its price increases, especially if the disease is detected in an advanced stage [4]. Hence, the development of strategies that can perform an early detection of BC is a priority topic for governments, as an early detection increases the survival chances and lowers the financial burden the disease imposes to families and health systems [4].

A methodology for the BC detection can be composed of 4 steps: (1) image acquisition, (2) Segmentation and preprocessing, (3) feature extraction, and (4) classification. An illustration of the abovementioned concepts is described in Figure 1.

**Figure 1.** BC detection using image processing strategies.

From this figure, it can be seen that the first step uses the different technologies available to acquire the internal tissue dynamics of the breast, so they can be expressed in an image; the second step is used to execute algorithms that perform basic tasks on the images (for instance, correcting the color scale), so the segmentation, which is the detection of Region-of-interest (ROI), can be done; then, the third step quantifies the differences between images that have abnormalities from the ones that do not have; finally, once the differences are quantified, it is necessary to classify them to provide a diagnosis. With the rapid development of novel technologies that can capture more accurately the dynamics of the breast tissues, numerous advances have been done in all the aforementioned fields; in this sense, the goal of detecting all the abnormalities without generating false alarms is still a highly desirable feature for all the proposals [5,6]. Recently, some articles have reviewed some proposals regarding the feature classification and its interpretation [6–9]; yet, an article that presents the main technologies used to form the breast image as well as the processing stages required to provide a diagnosis is still missing. This article presents a state-of-the-art review of both the technologies used to create the breast image as well as the strategies employed to perform the image processing and classification. The article is organized as follows: Section 2 describes the main technologies used for the image generation; Section 3 describes the methods used to perform the segmentation, feature extraction, as well as the interpretation; next, Sections 4 and 5 present some emerging techniques that can be used to improve the image formation and the algorithms used for the interpretation. The article ends with some concluding remarks.

#### **2. Technologies Used to Obtain Breast Tissue Images**

One of the steps require to develop a diagnose system is the representation of the breast tissue dynamics. In this sense, there are several technologies that are commonly used to represent the tissue by means of images. This section presents the most used ones.

#### *2.1. Mammography*

Mammography is a study used to screen the breast tissue in order to detect abnormalities that could indicate the prescience of cancer or other breast diseases [10]. This technique has a sensibility of up to 85% in the recommended population. Essentially, mammography uses low doses of X-ray to form a picture of the breast internal tissues [11]. To form the picture, the breasts are compressed by two plates with the aim of mitigating the dispersion of the rays, allowing to obtain a better picture without using an X-ray high-dose [11], where the tissue changes might appear as white zones on a grey contrast [11]. On average, the

total radiation dose for a typical mammogram with 2 views for each breast is about 0.4 [11]. Figure 2 illustrates the mammography procedure.

**Figure 2.** Mammography procedure.

Several works have focused on the processing of the digital mammographies to detect the most common symptoms that could indicate the presence of cancer: calcifications or masses [12]. Traditionally, the specialist looks for zones that have a different appearance (size, shape, contrast, edges, or bright spots) than the normal tissue. With the employment of segmentation algorithms [13–15], the automatization of this task has been proposed, where some attempts using neural networks have done [12,16,17], delivering encouraging results.

Recently, the utilization of the Breast tomosynthesis (BT) and the Contrast-Enhanced Mammography (CEM) [10] have been proposed as improvements to the traditional digital mammography. The former is a 3D breast reconstruction that allows to further improve the image resolution whereas the latter improves the image resolution injecting a contrast agent; in this way, the anatomic and vascularity definition of the abnormalities is exposed. In this sense, some improvements when dealing with breast-dense tissue patients are obtained; yet, the detection of clustered micro calcifications is still an issue [10]; on the other hand, additional screening tests are required to determine if the abnormality detected by CEM is cancer or not, besides of requiring more expensive equipment.

#### *2.2. Ultrasound*

Ultrasound is a non-invasive and non-irradiating technique that uses sound waves to create images from organs, in this case the breasts, to detect changes in their form. To create the images, a transducer sends high-frequency sound waves (>20 kHz) and measures the reflected ones [10]. The image is formed using the wave sound reflected from the internal tissues. This procedure is depicted in Figure 3.

Ultrasound is used for three purposes: (1) assessing and determining the abnormality condition, that is, to help doctors if the abnormal mass is solid, which might require further examination, is fluid-filled, or has both features; (2) as an auxiliary screen tool, which is used when the patient has dense breasts and the mammography is not the reliable enough, (3) or as a guide to develop a biopsy in the suspected abnormality [10]. Several computer-aided diagnose (CAD) systems that analyze ultrasound images have been proposed [18]. One of the points they note it is necessary to improve is the resolution of the images [19] using specific-designed filters. Another modification proposed is the utilization of micro-bubbles that are injected into the abnormalities detected at first sight [20].

**Figure 3.** Ultrasound procedure.

It should be noticed that the mass tends to stay in its position when compressed, i.e., they do not displace. Elastography is the technique that is employed to measure the tumor displacement when compressed using a special transducer [21]. These developments have led to discover masses that usually require performing a biopsy to determine the mass nature, which delay the diagnosis confirmation [10,21]; moreover, the image interpretation requires a well-trained specialist, which is not always available to perform all the studies.

#### *2.3. Magnetic Resonance Imagining (MRI)*

Breast MRI (BMRI) uses a magnetic field and radio waves to create a detailed image from the breast. Usually, a 1.5 T magnet is used along with a contrast, usually gadolinium, to generate the images of both breasts [22]. To acquire the images, the patient is located in a prone position, in order to minimize the respiration movement and to allow the expansion of the breast tissue [10,22]. When the magnet is turned on, the magnetic field temporary realigns the water molecules; thus, when radio waves are applied, the emitted radiation is captured using specific-designed coils, located at the breast positions, which transforms the captured radiation in electrical signals. The coils position must ensure an appropriate fieldof-vision from the clavicle to the infra-mammary fold, including axilla [10]. An illustration of the patient position is depicted in Figure 4.

**Figure 4.** BMRI procedure.

The main objective of getting the images is to assess for the breast symmetry and the possible changes in the parenchymal tissue, since those changes might indicate the presence of lesions that can be malignant. In general, malignant lesions have irregular margins (or asymmetry), whereas the benign ones usually have a round or oval geometrical shape with well-defined margins (symmetry). To deliver the best possible result, it is necessary to remove the homogenous fat around the breast and parenchyma since fat can render images that can be uninterpretable, specially to detect subtle lesions [10,22].

On the other hand, one of the problems that BMRI has is the false-positive (specificity) rate, as the technique can detect low-size masses (lesions whose size is less than 5 mm) that are benign [10,22]. To mitigate the aforementioned issue, nanomaterials have been developed, so they stick to the cancer masses but not to the benign ones [23] as well as contrast agents [24]. Recently, it has been proposed that a multiparametric approach has been suggested as a strategy to improve the specificity rate [10].

#### *2.4. Other Approaches*

Recently, microwave radiation has been employed as an alternative to obtain information about the breast tissue. The microwaves, whose frequency range varies from 1 to 20 GHz, are applied to the breast and the reflected waves are measured using specificdesigned antennas. To have the best possible results, some works propose that the tissue must be immersed in a liquid [25]. In this sense, some works have proposed acquisition systems that deal with this issue [26–29].

When it is necessary to perform a biopsy to confirm, images from the cells that form the abnormalities are obtained using among other techniques, the fine needle aspiration citology (FNAC), core or excisional biopsy. Once the cell images are captured, an image processing technique is applied in order to detect the differences between normal and malignant cells, which are classified using modern strategies [30–32] such as neural networks, probabilistic-based algorithms and association rules coupled with neural networks.

It should be pointed out that other alternatives for imaging are employed such as Computed Tomography (CT) or Positron Emission Tomography (PET). The former employ X-rays to form images from the chest using different angles; using image processing and reconstruction algorithms, a 3D image of the chest (including the breasts) is obtained [33,34]; on the other hand, the latter uses a small amount of tracer, that is a specific-designed sugar with radioactive properties known as fluorodeoxyglucose-18. The main idea of using this type of sugar is that cancer cells have an increased consume of glucose compared with the normal cells; in this sense, the tracer sticks in the zones where there is an increased glucose consume [35,36]. It is worth noticing that these techniques are recommended to determine the cancer stage rather than first-line diagnosis scheme [10,37]. In this way, they complement the three main techniques to provide more information from the tissues surrounding the breasts [37]. Table 1 presents a table that summarizes the abovementioned methods.


**Table 1.** Summary of the used breast image generation technologies.


**Table 1.** *Cont.*

As it is seen in Table 1, numerous advances for imagining techniques have been achieved in the last years; still, there is a necessity of developing strategies that can allow obtaining sharp images, even for dense breast tissues. In this sense, the obtained images can be used to perform a focused surveillance on the patients that have a higher risk for developing the disease, allowing to achieve the cancer detection in the earliest possible stage. On the other hand, these novel imagining techniques should be able to operate without requiring additional requirements, such as specific electrical or mechanical conditions, so they can be easily adopted in hospitals, or in an ambulatory area.

#### **3. Image Processing and Classification Strategies**

#### *3.1. ROI Estimation*

Once the image is acquired, the next step required is its interpretation. To this purpose, it is necessary to identify the suspicious regions that might contain masses or calcifications, where model, region, or counter-based algorithms for the image segmentation are employed [45]. It should be noticed that these approaches often rely on the manual entries to refine the segmentation zones, which limits the applicability of the proposals on different datasets [45], making necessary to develop novel strategies that can automatically detect all the interest zones. Recently, Sha et al. [46] proposed a convolutional neural network (CNN)-based method for segmentation. The authors develop an optimization scheme to determine the best parameters for the CNN in order to segment the suspicious zones. The results presented show the proposal has a reasonable sensitivity and specificity (89% and 88%, respectively) to determine if a mammograph presents cancerous tumors or not. Wang et al. [47] present a CNN-based strategy. They modify the convolutional layer to increase the detection of multiple suspicious zones. Heidari et al. [48] employ a Gaussian bandpass filter to detect suspicious zones using local properties of the image. On the other hand, Suresh et al. [49] and Sapate et al. [50] employ a fuzzy-based strategy to cluster all the pixels with similar features in order to detect all the zones that have differences. Other strategies involve the utilization of mathematical morphology [51–55], image contrast and intensity [56,57], geometrical features [58,59], correlation and convolution [60,61], nonlinear filtering [62,63], texture features [64], deep learning [65–69], entropy [70,71], among other strategies. It is worth noticing that from the diversity of the employed strategies, some of them still require an initial guidance to detect the suspicious zones, either by manually selecting pixels inside of the zone or using the radiologist notes about the localization. An effective approach for the automatic detection should employ a denoising stage in order to remove residual noise generated during the acquisition and equalization, so the intensity pixel disparities associated to the environment light can be mitigated as much as possible.

#### *3.2. Feature Extraction*

After the suspicious zones are detected and segmented, it is necessary to extract features from them to generate the necessary information to classify the detected lesions as cancer or benign. To this purpose, Fourier Transform-based methods [48,72], wavelet transform-based strategies [73–76], geometric features [77,78], information theory algorithms [79], co-occurrence matrix features [47,80–82], histogram-based values [46,83–85], morphology [86,87], among others. On the other hand, with the increased capabilities (the number of simultaneous operations that can be done) of the new-generation graphical processor units, it is now possible to execute high-load computational algorithms faster than in a multicore processor [88]; in consequence, novel neural networks algorithms that perform the feature extraction and quantification are now being proposed. For instance, Xu et al. [89], use a CNN to extract and classify ultrasound images with suspicious areas in four categories: skin, glandular tissue, masses, and fat. They modify the convolutional filters to speed up the process. Arora et al. [90] also use an ensemble of CNN architectures to extract directly the suspicious zones. They only modify the final layers to speed up the training process. Gao et al. [91] use a deep neural network to generate the features from mammograms. They employ a modified architecture where the outputs and inputs of the network are used to update the model parameters during its training. Similar approaches are described in [92–95].

It should be pointed out that a reduction of the estimated features is often used to reduce the amount of computational resources used in the training scheme and to mitigate the overfitting problem, which reduce the algorithm efficacy. This step is known as dimensionality reduction [45] and the most employed algorithms are the principal component analysis (PCA) and linear discriminant analysis (LDA). PCA use eigenvaluebased algorithms to determine the features that are unrelated between them, that is, they have the maximum variance between them as this will indicate the maximum variation of the information contained, whereas LDA perform a projection of the samples to find out the distance between the classes' mean. In this sense, the greater the distance between the means, the more unrelated the features are [96]. Nevertheless, these algorithms use global properties of the values which might cause to deliver suboptimal results [96]. For these reasons, hybrid strategies are proposed such as neurofuzzy algorithms [97,98], diffusion maps [99], deep learning [100–102], independent component analysis (ICA) [103], clustering-based approaches [104], multidimensional scaling [105], among other strategies. It should be pointed out that hybrid approaches, as abovementioned ones, are particularly effective when a non-linear relationship between the features exists.

To the best of the authors' knowledge, there are no papers that compare some of the abovementioned techniques using the same database to compare the techniques efficacy. In this sense, it is an interesting research topic, since the results of this comparison can provide some guidelines about the image used (mammogram, ultrasound, or MRI) and the technique that has the best performance.

#### *3.3. Classifiers*

The last step of this stage is the classification of the extracted features to make a diagnosis. Broadly speaking, a classifier uses the input data to find out relationships that can be used to determine the class where the input data belongs to. The evaluation of the classifier is done using three basic measurements: accuracy, specificity, and sensitivity [106,107]. Accuracy refers to the percentage of images that are correctly classified in their corresponding classes; sensitivity is the percentage of classified images as malignant that truly are specificity is the percentage of classified images as benign that truly are, and the area under the curve is a parameter that allows choosing the optimal models. It takes a value between 0 and 1, being a good classifier the one that has a value close to 1 [108]. In this sense, depending on the training algorithm required by the strategy, classifiers can be divided in unsupervised or supervised [45,106,107].

#### 3.3.1. Unsupervised Classifiers

An unsupervised classifier aims to find the underlying structures that the input data has without making explicit the class the input data belongs to [109]. In this sense, input data that has similar values is assigned to the same class [109]. Dubey et al. [110] studied the effects that the selection scheme for the size of the number of clusters in the K-means algorithm has. To this purpose, the random and foggy methods were employed. They note that foggy initialization method and the Euclidean-type distances produced the best results, as a 92%-accuracy is obtained. K-means and K-nearest neighbor classifiers have been also employed by Singh et al. [58] and Hernandez-Capistran et al. [111]. This family of classifiers is effective when the distance between the clusters is reasonable; but, when the aforementioned concept is not possible, the accuracy rate is highly degraded. For this reason, Onan [112] introduced the concepts of the fuzzy logic to measure the distance between the set of features used as input and the clusters, where the mutual information, an information theory algorithm, is the chosen to measure the aforementioned distance. The author reports an accuracy of 99%, and a specificity and sensitivity of 99% and 100%, respectively. Similar results are achieved using the fuzzy c-means algorithm [113,114], fuzzy-based classifier for time-series [115], fuzzy rule classifier [116,117], among others. Other clustering-based approaches employed for classification are hierarchical clustering [118] and Unsupervised Test Vector Optimization [119]. It should be pointed out that unsupervised classifiers require a careful selection of the features used to train the algorithm, since an incorrect mix of features will degrade the performance of the classifier.

#### 3.3.2. Supervised Classifiers

Supervised classifiers require to know a-priori the class of which the input data belongs to, that is, the input data must be labeled. The Decision Tree (DT) is an algorithm that uses a set of rules to determine the class of the data input. DT has been employed by Mughal et al. [71], where they perform the detection of masses in mammograms using texture features in the region of interest. Using a DT, they obtain an accuracy, specificity, and sensibility of 89%, 89% and 88.5%, respectively. Shan et al. [120] employ geometrical features to classify abnormalities detected in ultrasound images. The obtained results show an accuracy, sensitivity, and specificity of 77.7%, 74.0%, and 82.0%, respectively. An improvement of DT is the Random Forest (RF). During the training stage, RF uses several DT, where the ones that have the lowest error are chosen; in this way, the accuracy is enhanced. RF are considered as ensemble classifiers, where some applications have been reported [121–124]. The accuracy, specificity, and sensitivity reported show an improvement. Another type of ensemble classifier is the Adaptive Boosting (AdaBoost) algorithm. It consists in the utilization of weak classifiers, which are usually features that can generate a classification accuracy greater than 50% by themselves; thus, using them in an ensemble way, the outliers that the features value have are used, improving the classifier accuracy. AdaBoost applications have been reported [125–127], achieving good results (the accuracy, specificity, and sensitivity values are greater than 90%); yet, the authors note that extensive investigation is still required to ensure that these results can be obtained with different types of images (mammograms, ultrasound, and MRI).

Another classification algorithm widely used for BC detection is the support vector machine (SVM). SVM finds the hyperplane that divides the zones where the values of the input features are located. In this regard, Liu et al. [52] use the morphological and edge features combined with a SVM classifier with a linear kernel, to detect benign and malignant masses in ultrasound images. They obtain an accuracy, sensitivity, and specificity of 82.6%, 66.67%, and 93.55%, respectively. It should be noted that most of the revised works use the term malignant to describe masses or lesions that are cancer regardless its type. To improve the aforementioned results, Sharma and Khanna [128] use the Zernike moments as features and a SVM classifier using a non-linear function as a kernel. The authors obtain a specificity and sensitivity of 99%. Similar approaches have been reported [87,129–133]. It is worth noticing that if the features have a strong nonlinear relationship, other classifiers could deliver better results.

#### *3.4. Artificial Intelligence-Based Classifiers*

Artificial Intelligence (AI) is the section of the computer science that develops algorithms to perform complex tasks that previously are solved with the human knowledge [134]. Evidently, since classification is a task usually solved by the physician, AI

can provide automated solutions. In this sense, Artificial Neural Networks (ANN) are a type of AI algorithms employed to perform the classification in different classes. ANN are brain-inspired algorithms that store the knowledge that the input data using a training process [135]. An ANN consists in a three-layer scheme: input, hidden, and output, as depicted in Figure 5.

**Figure 5.** Artificial Neural Network.

The training process takes the information contained in the input variables and adjust the values of the variables (weights) that connect all the layers in order to match the input with its respecting class; in this way, the hidden pattern that share all the input and their corresponding class is detected and stored. Consequently, it is necessary to use a sufficient database, with representative scenarios, to train the ANN. Beura et al. [136] present a methodology that employs mammograms to detect masses (benign and malignant) using the two-dimension discrete wavelet transform (2D-DWT) with normalized gray-level co-occurrence matrices (NGLCM). The images are segmented using a cropping-based strategy to obtain the ROI, which are analyzed with the symmetric biorthogonal 4.4 wavelet mother and a decomposition level of 2. All the frequency bands are processed to obtain the features (NGLCM), where the *t*-test is selected to perform the optimal choice of the most discriminant features. The obtained results show that the proposal achieves an accuracy, sensitivity, and specificity of 94.2%, 100%, and 90% respectively, using the ANN classifier, whereas a RF classifier, using the same database, obtains an 82.4%-accuracy. Mohammed et al. [137] uses fractal dimension values as features to classify ultrasound breast images in benign and malignant. They obtain the ROIs using a cropping-based algorithm, which are processed to obtain multifractal dimension features. They obtain an accuracy, sensitivity, and specificity of 82.04%, 79.4%, and 84.76% respectively using an ANN classifier. They point out that the ROI extraction algorithm must be improved. Gallego-Ortiz and Martel [138] classifies MRI breast images using graph-based features, the Deep Embedded Clustering algorithm to select the most relevant features and an ANN classifier. The ROIs are obtained using a graph model, where they obtain an area under the curve, which is another feature to measure the classifier effectiveness, of 0.80 (the closer to 1, the better). ANN classifiers have been also used in [139–142].

Deep neural networks (DNN) are a specific type of AI algorithms based on the architecture of an ANN [134]. DNN resembles how the brain stores, in multiple layers, the acquired knowledge to solve a specific task [8]. The Convolutional Neural Network (CNN) is a DNN that emulates the visual processing cortex to determine the class that an image belongs to [8,134]. A CNN typical scheme is depicted in Figure 6.

**Figure 6.** Convolutional Neural Network.

From the figure, it is seen that a CNN consists of a kernel, pooling and fully connected layers. The purpose of the kernel layer is to detect and extract spatial features that the image has, which is usually done with the convolution operator. The output of this layer, known as feature map, might contain negative values that might cause numerical instabilities in the training stage; thus, map is processed using a function to avoid the negative values. Once the feature map is processed, the pooling layer reduces the amount of information contained in order to eliminate redundant information; finally, the output of the pooling layer goes to the fully connected layer to be classified. In this sense, several works [143–148], have been employed CNN to detect benign and malignant tissues in either mammography or MRI images. They note that the depth of the network, i.e., the number of layers, the fine-tuning of some of the kernel or pooling layers, as well as the number of images, affect the classifier performance.

Ribli et al. [149] add an additional layer to implement specific-designed filters for mammograms. The CNN they employ has 16-layers and classifies the detected lesions in benign or malignant, obtaining an area under the curve of 0.85. A similar approach is proposed in [150]. The modification they propose is that a fully connected layer is placed as the first layer of the CNN so when the images are noise-corrupted, the feature extraction process is not degraded. They obtain an accuracy, sensitivity, and specificity of 98.7%, 98.65%, and 99.57% for the detection of benign and malignant lesions in mammograms. Zhang et al. [151] carry out a test to find out the specific-suited process for the pooling layer. They found out that rank-based stochastic process is the best-suited algorithm, obtaining an accuracy, sensibility, and specificity of 94.0%, 93.4%, and 94.6%, respectively, for classifying lesions for normal or abnormal using mammograms. Similar approaches have been proposed [152–155]. Table 2 presents a summary of the classifiers above discussed. It should be noted that a mix of images from mammograms, ultrasound, MRI are usually employed. These images usually came from private databases.

From the data shown in Table 2, it can be seen that it is necessary to standardize the minimum requirements regarding the number of images that the databases must have. In this way, the performance metrics that are employed, i.e., accuracy, specificity, and sensitivity, can be compared in a better way. Moreover, even when the presented approaches show interesting results, one thing they found out is the necessity of having a considerable database that contain significant labeled images to obtain the best possible results, which in many real-life scenarios is not always possible. For these reasons, algorithms that can work with both labeled and unlabeled images are still a necessity.

*Cancers* **2022**, *14*, 3442


### **Table 2.** Summary of the used image classification algorithms.



#### **4. Recent Image Generation Techniques**

*Infrared Thermography (IRT) Applied to Breast Cancer*

Temperature has been documented as an indicator of health [156]. Specifically speaking of breast cancer, when a tumor exists, it makes use of nutrients for its growth (angiogenesis), resulting in an increase in metabolism, thus the temperature around the tumor will increase in all directions [157]. To detect the temperature changes, IRT has been used as it measures the intensity of the thermal radiation (in the form of energy) that bodies emit, converting it into temperature [158]. The emitted energy can be visualized in the electromagnetic spectrum, as shown in Figure 7, where it is seen that the infrared (IR) wave ranges from 0.76 to 1000 μm and in turn is divided into Near-IR, Mid-IR and Far-IR. The available technology to measure IR allows performing the aforementioned task using non-invasive, contactless, safe, and painless equipment [159–161], making a suitable proposal for developing scanning technologies.

**Figure 7.** Electromagnetic spectrum.

To obtain the best possible images, there are mainly three factors that influence thermographic imaging in humans [162,163]


Considering the all the above discussed aspects, a suitable location for developing a controlled scenario to acquire thermographic images focused on breast cancer is depicted in Figure 8.

**Figure 8.** Proposed experimental set up for the breast thermal images acquisition.

Once the room is conditioned for obtaining the thermographic images, the acquisition can be done. The reported results make use of the previously discussed image processing and classification algorithms. Table 3 shows a brief resume of the most recent proposed works.


**Table 3.** Summary of the breast lesions detection using infrared thermography.

Recently, dynamic infrared thermography (DIT) has been proposed as an alternative to further improve the image quality and sharpness [64]. DIT is a sequence of thermograms captured after stimulating the breasts by means of a cold stressor [176]. The objective of this stressor is to generate a contrast between areas with abnormal vascularity and metabolic activity with areas free of abnormalities. Therefore, it is possible to analyze the sinus response after removing this stimulus. In this way, the image sharpness is enhanced. Silva et al. [177] proposed a technology that analyzes the information from the DIT to indicate patients at risk of breast cancer, where they segment the area of interest (breast) and analyze the changes in temperature through the different thermograms acquired. Saniei et al. [178] proposed a system that segments both breasts to obtain the branching point of the vascular network, which represents the pattern of the veins; finally, these patterns are classified to obtain the diagnosis. As it can be seen, the DIT requires robust systems that allow the analysis of the acquired thermograms over time, which should be considered in order to generate the next generation of equipment that can allow the early

detection of the angiogenesis process. By doing this, patients can be properly monitored so the changes in the patterns of the angiogenesis process be detected.

#### **5. Recent Classification Algorithms**

As pointed out in the Classifiers subsection, it is necessary to overcome the lack of a large database of images (mammograms, ultrasound or BRMI) that have been diagnosed to generate robust and efficient classifiers. In this sense, semi-supervised methods can be an attractive choice to explore. They usually combine an unsupervised algorithm to cluster the images available, so a representation of the dataset is obtained; then, the supervised classifier assigns the classes that images have [109,179]. The data that is used in the unsupervised algorithm assumes the unlabeled images are close to the labeled ones in their input space, so their labels are the same [109]. Some of the most recent developments that could be applied in the breast cancer detection are presented.

#### *5.1. Autoencoders*

An autoencoder is a neural network that has one or more hidden layers that is used to reconstruct the input compactly, as the hidden layers have few neurons. The autoencoder is depicted in Figure 9.

**Figure 9.** Autoencoder structure.

From the figure, it is seen that it has two parts: the encoder, that represents the input into its compact representation, and the decoder, which performs the inverse operation, that is, use the compact representation to recover the original data. The most common training scheme consists in employing a loss function that aims to reduce the error between the original and reconstructed data. For breast cancer detection, autoencoders can be used feature extraction stages, as the encoder part obtains the compact representation or features of the input image, that are followed by a supervised classifier. Recently, this approach has been explored [79,94,180–183] showing promising results to generate robust methodologies, where accuracies values above 95% are obtained.

#### *5.2. Deep Belief Networks (DBF)*

They are based on the usage of restricted Boltzmann machines (RBMs). RBMs only use two layers: input and hidden, to represent, as in the case of the autoencoders, the most important features that can represent the input data but in a stochastic way [99]. This ensure that the outliers do not affect the network performance. Detailed information can be found in [184,185]. The main idea in employing DBF is that the image segmentation can be done without external guidance; thus, a totally automated methodology can be proposed. Recent works have been explored this idea to perform the liver segmentation [186], lung lesions detection [187], and fusion of medical images [188]. Its use could deliver promising results to detect BC.

#### *5.3. Ladder Networks*

Ladder Neural Network, proposed by Rasmus et al. [189], uses an autoencoder as the first part of a feedforward network to denoise the inputs; further, by determining the minimum features that represent the inputs, the classification can be done using simple algorithms. The network uses a penalization term in the training algorithm to ensure the maximum similarity between the original and reconstructed inputs.

#### *5.4. Deep Neural Network (DNN)-Based Algorithms*

Recently, DNN-based classification strategies have been proposed to maximize the accuracy that the classifiers achieve while reducing the computational resources required to perform its training and execution, being the physics-informed neural network or more recently, the Deep Kronecker neural network [190] are one of the most recent algorithms that have been proposed. In particular, these NNs are designed to take full advantage of the adaptive activation functions. Traditional activation functions, such as the unipolar and bipolar sigmoid and the ReLU, might have problem when dealing with low-amplitude features as the training algorithm fails to achieve the lowest point in the error surface, thus generating classifiers prone to have generalization issues [190].

In this sense, by introducing a parameter into the activation function equations that can be modified during the training process, it can be avoided that the gradient function does not stall in a local minimum on the error surface [191]; thus, the highest accuracy can be obtained since the global minimum is reached [192]. The results presented [190–192] suggest that the utilization of this type of activation function might increase the classifier accuracy without increasing the computational burden required to train the network as the geometrical shape that the activation function defines can be adapted during the training time to the boundary decision zone where classification is required. It should be noted that the proposed Rowdy family of activation functions could be an interesting research topic for designing classification algorithms, as the presented results demonstrate that the lowest error is achieved in a prediction task.

#### **6. Concluding Remarks**

This paper presents a state-of-the-art review of the technologies used to acquire images from the breast and the algorithms used to detect BC. To the best of the author's knowledge, this is the first review article that deals with all the required steps to propose a reliable methodology for the BC detection. This is important as the earliest detection of the disease can save a considerable amount of money in the required treatments, and the most important, potentially saving numerous lives.

The analyzed papers are focused on the research on the processing of images obtained using non-invasive methods: X-ray, ultrasound, or magnetic resonance, as they are the most accessible technologies in hospitals. The strategy used in most of the papers has 4 steps: image acquisition, ROI estimation, feature extraction, and interpretation. For the ROI estimation, the strategies proposed are based on radiologist annotations or require external help in order to be executed. This is an opportunity area to develop automatic algorithms that can detect the abnormalities. The feature estimation is used to quantify the detected zones in numerical values. In this sense, texture-based and geometrical-based features are by far, the most employed due to its estimation simplicity; still, frequency or spatial features have recently begun to be explored and can lead to detect minimal changes that might increase the sensitivity required to further improve the classification accuracy. It should be noticed that feature reduction strategies are commonly employed in order to reduce the training time or avoid potential misclassifications, where the most popular are LDA and PCA. On the other hand, classification strategies employ either supervised or unsupervised algorithms. The selection of the type of classifier heavily depends on the nature of the features extracted. If they are highly discriminant between them, then an unsupervised classifier is usually selected. On the other hand, when the features used have an overlap zone, then it is necessary to employ a supervised classifier. It should be

noticed that AI-based algorithms, especially those based on deep learning, have the edge in terms of the performance they get at the expense of being very expensive in terms of the computational resources employed.

Emerging imaging technologies such as the microwave and thermography are being explored recently. In particular, the latter has recently obtained the attention of researchers as it is easy-to-use, and, with a proper cooling protocol, can reach an interesting level of accuracy to detect, at least, suspected masses that might evolved into malignant ones. With the development of semi-supervised strategies, some of the stages employed can be integrated into one, allowing the development of effective feature extraction, selection and classification strategies that have the same performance of supervised classifier, with lower computational resources employed, even in the presence of limited labeled images, which is a major obstacle to the training of the classifiers.

Modern BC detection strategies should rely using artificial intelligence(AI)-based algorithms that can use both on the information of the images acquired and categorical data [193–195], i.e., information about the daily life of the patients, with the aim of proposing algorithms that can determine if the patient has malignant lesions with a higher certainty and with the lowest false alarm at the earliest stage possible in order to get an effective treatment that can prevent the disease propagation. To achieve this goal, it is necessary to develop a database that contains the aforementioned features and whose size can reflect the main scenarios that can be found in real-life. Further, having algorithms that can deal with the aforementioned information, it can be possible to design personalized surveillance and clinical screening strategies that could offer the best health outcome for every patient.

**Author Contributions:** Conceptualization: I.A.C.-A., L.A.M.-H. and C.A.P.-R.; methodology: J.A.B.-H. and M.A.I.-M.; investigation: C.A.P.-R. and M.T.-A.; writing—original draft preparation, C.A.P.-R., J.A.B.-H. and I.A.C.-A.; writing—review and editing: M.A.I.-M., M.T.-A. and L.A.M.-H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Review* **Efficacy of Artificial Intelligence-Assisted Discrimination of Oral Cancerous Lesions from Normal Mucosa Based on the Oral Mucosal Image: A Systematic Review and Meta-Analysis**

**Ji-Sun Kim 1, Byung Guk Kim <sup>1</sup> and Se Hwan Hwang 2,\***


**Simple Summary:** Early detection of oral cancer is important to increase the survival rate and reduce morbidity. For the past few years, the early detection of oral cancer using artificial intelligence (AI) technology based on autofluorescence imaging, photographic imaging, and optical coherence tomography imaging has been an important research area. In this study, diagnostic values including sensitivity and specificity data were comprehensively confirmed in various studies that performed AI analysis of images. The diagnostic sensitivity of AI-assisted screening was 0.92. In subgroup analysis, there was no statistically significant difference in the diagnostic rate according to each image tool. AI shows good diagnostic performance with high sensitivity for oral cancer. Image analysis using AI is expected to be used as a clinical tool for early detection and evaluation of treatment efficacy for oral cancer.

**Abstract:** The accuracy of artificial intelligence (AI)-assisted discrimination of oral cancerous lesions from normal mucosa based on mucosal images was evaluated. Two authors independently reviewed the database until June 2022. Oral mucosal disorder, as recorded by photographic images, autofluorescence, and optical coherence tomography (OCT), was compared with the reference results by histology findings. True-positive, true-negative, false-positive, and false-negative data were extracted. Seven studies were included for discriminating oral cancerous lesions from normal mucosa. The diagnostic odds ratio (DOR) of AI-assisted screening was 121.66 (95% confidence interval [CI], 29.60; 500.05). Twelve studies were included for discriminating all oral precancerous lesions from normal mucosa. The DOR of screening was 63.02 (95% CI, 40.32; 98.49). Subgroup analysis showed that OCT was more diagnostically accurate (324.33 vs. 66.81 and 27.63) and more negatively predictive (0.94 vs. 0.93 and 0.84) than photographic images and autofluorescence on the screening for all oral precancerous lesions from normal mucosa. Automated detection of oral cancerous lesions by AI would be a rapid, non-invasive diagnostic tool that could provide immediate results on the diagnostic work-up of oral cancer. This method has the potential to be used as a clinical tool for the early diagnosis of pathological lesions.

**Keywords:** mouth neoplasms; imaging; optical image; precancerous conditions; artificial intelligence; screening

#### **1. Introduction**

Oral cancer accounts for 4% of all malignancies and is the most common type of head and neck cancer [1]. The diagnosis of oral cancer is often delayed, resulting in a poor prognosis. It has been reported that early diagnosis increases the 5-year survival rate to 83%, but if a diagnosis is delayed and metastasis occurs, the survival rate drops to less than

**Citation:** Kim, J.-S.; Kim, B.G.; Hwang, S.H. Efficacy of Artificial Intelligence-Assisted Discrimination of Oral Cancerous Lesions from Normal Mucosa Based on the Oral Mucosal Image: A Systematic Review and Meta-Analysis. *Cancers* **2022**, *14*, 3499. https://doi.org/10.3390/ cancers14143499

Academic Editor: Rahele Kafieh

Received: 29 June 2022 Accepted: 17 July 2022 Published: 19 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

30% [2]. Therefore, there is an urgent need for early and accurate detection of oral lesions and for distinguishing precancerous and cancerous tissues from normal tissues.

The conventional screening method for oral cancer is visual examination and palpation of the oral cavity. However, the accuracy of this method is highly dependent on the subjective judgment of the clinician. Diagnostic methods such as toluidine blue staining, autofluorescence, optical coherence tomography (OCT), and photographic imaging were useful as adjunctive methods for oral cancer screening [3–6].

Over the past decade, studies have increasingly showed that artificial intelligence (AI) technology is consistent with or even superior to human experts in identifying abnormal lesions in additional images of various organs [7–11]. These results give us hope for the potential of AI in the screening of oral cancer. However, large-scale statistical approaches to diagnostic power for using oral imaging with AI are lacking. Therefore, in this study, the sensitivity and specificity were analyzed through meta-analysis to evaluate the accuracy of detecting oral precancerous and cancerous lesions in AI-assisted oral mucosa images. We also performed subgroup analysis to determine whether accuracy differs between imaging tools.

#### **2. Materials and Methods**

#### *2.1. Literature Search*

Searches were performed in six databases: PubMed, Embase, Web of Science, SCOPUS, Cochrane Central Register of Controlled Trials, and Google Scholar. The search terms were: "artificial intelligence", "photo", "optical image", "dysplasia", "oral precancer", "oral cancer", and "oral carcinoma". The search period was set to June 2022, and data written in English were reviewed. Two independent reviewers reviewed all abstracts and titles of candidate studies. Among studies diagnosing oral cancer using images, studies that did not deal with AI were excluded.

#### *2.2. Selection Criteria*

The inclusion criteria were: (1) use of AI; (2) prospective or retrospective study protocol; (3) comparison of AI-assisted screening of oral mucosal lesions with the reference test (histology); and (4) sensitivity and specificity analyses. The exclusion criteria were: (1) case report format; (2) review article format; (3) diagnosis of other tumors (laryngeal cancer or nasal cavity tumors); and (4) lack of diagnostic AI data. The search strategy is summarized in Figure 1.

#### *2.3. Data Extraction and Risk of Bias Assessment*

All data were collected using standardized forms. As diagnostic accuracy, diagnostic odds ratio (DOR), areas under the curve (AUC), and summary receiver operating characteristic (SROC) were identified. The diagnostic performance was compared with histological examination results.

A random-effect model was used in this study. DOR represents the effectiveness of a diagnostic test. DOR is mathematically defined as (true positive/false positive)/(false negative/true negative). When DOR is greater than 1, higher values indicate better performance of the diagnostic method. A value of 1 means that the presence or absence of a disease cannot be determined and that the method cannot provide diagnostic information. To obtain an approximately normal distribution, we calculated the logarithm of each DOR and then calculated 95% confidence intervals [12]. SROC is a statistical technique used when performing a meta-analysis of studies that report both sensitivity and specificity. As the diagnostic ability of the test increases, the SROC curve shifts towards the upper-left corner of the ROC space, where both sensitivity and specificity are 1. AUC ranges from 0 to 1, with higher values indicating better diagnostic performance. We collected data on the number of patients, true-positive, true-negative, false-positive, and false-negative values in all included studies, and calculated AUCs and DORs from these values. The methodological

quality of the included studies was evaluated using the Quality Assessment of Diagnostic Accuracy Study (QUADAS-2) tool.

**Figure 1.** Summary of the search strategy.

#### *2.4. Statistical Analysis and Outcome Measurements*

R statistical software (R Foundation for Statistical Computing, Vienna, Austria) was used to conduct a meta-analysis of the studies. Homogeneity analyses were then performed using the Q statistic. Forest plots were drawn for the sensitivity, specificity, and negative predictive values, and for the SROC curves. A meta-regression analysis was performed to determine the potential influence of imaging tools on AI-based diagnostic accuracy for all premalignant lesions.

#### **3. Results**

This analysis included 14 studies [6,13–25]. Table 1 presents the assessment of bias. The characteristics of the studies are attached in Table S1.

#### *3.1. Diagnostic Accuracy of AI-Assisted Screening of Oral Mucosal Cancerous Lesions*

Seven prospective and retrospective studies were included for discriminating oral cancerous lesions from normal mucosa. The diagnostic odds ratio (DOR) of AI-assisted screening was 121.6609 (95% confidence interval [CI], 29.5996; 500.0534, I<sup>2</sup> = 93.5%) (Figure 2A).




**Figure 2.** Forest plot of the diagnostic odds ratios for (**A**) screening only oral cancerous lesions [13,16, 17,21–23,25] and (**B**) screening all premalignant mucosal lesions [13–21,23,24].

The area under the summary receiver operating characteristic curve was 0.948, suggesting excellent diagnostic accuracy (Figure 3A).

The correlation between the sensitivity and the false-positive rate was 0.437, indicating the absence of heterogeneity. AI-assisted screening exhibited good sensitivity (0.9232 [0.8686; 0.9562]; I<sup>2</sup> = 81.9%), specificity (0.9494 [0.7850; 0.9897], I<sup>2</sup> = 98.3%), and negative predictive value (0.9405 [0.8947; 0.9671]. I<sup>2</sup> = 83.6%) (Figure 4). The Begg's funnel plot (Supplementary Figure S1) shows that a source of bias was not evident in the included studies. The Egger's test result (*p* > 0.05) also shows that the possibility of publication bias is low.

**Figure 3.** Area under the summary receiver operating characteristic for (**A**) screening only the oral cancerous lesions and (**B**) screening all premalignant mucosal lesions. SROC; summary receiver operating characteristic, CI; confidence interval.

Subgroup analyses were performed to determine which image tool assisted by AI had higher discriminating power between oral cancer lesions and normal mucosa. This analysis showed that that there were no significant differences between the photographic image, autofluorescence, and OCT in AI based on the screening for oral cancer lesion (Table 2).

**Table 2.** Subgroup analysis regarding image tool in discriminating oral cancerous lesions from normal mucosa.


DOR; diagnostic odds ratio, AUC; area under the curve, NPV; negative predictive value.

#### *3.2. Diagnostic Accuracy of AI-Assisted Screening of Oral Mucosal Precancerous and Cancerous Lesions*

Twelve prospective and retrospective studies were included for discriminating oral precancerous and cancerous lesions from normal mucosa. The diagnostic odds ratio (DOR) of AI-assisted screening was 63.0193 (95% confidence interval [CI], 40.3234; 98.4896, I <sup>2</sup> = 88.2%) (Figure 2B). The area under the summary receiver operating characteristic curve was 0.943, suggesting excellent diagnostic accuracy (Figure 3B). The correlation between the sensitivity and the false-positive rate was 0.337, indicating the absence of heterogeneity. AIassisted screening exhibited good sensitivity (0.9094 [0.8725; 0.9364]; I2 = 92.3%), specificity (0.8848 [0.8400; 0.9183], I2 = 93.8%), and negative predictive value (0.9169 [0.8815; 0.9424], I <sup>2</sup> = 92.8%) (Figure 5).


**Figure 4.** Forest plots of (**A**) sensitivity, (**B**) specificity, and (**C**) negative predictive values for screening oral cancerous lesions [13,16,17,21–23,25].

The Egger's test results of sensitivity (*p* = 0.02025) and negative predictive value (*p* < 0.001) also show that the possibility of publication bias is high. To compensate for the publication bias using statistical methods, trim-and-fill methods (trimfill) were applied to the outcomes. After implementation of trimfill, sensitivity dropped from 0.9094 [0.8725; 0.9364] to 0.8504 [0.7889; 0.8963] and NPV also dropped from 0.9169 [0.8815; 0.9424] to 0.7815 [0.6577; 0.8694]. These results could mean that the diagnostic power of AI-assisted screening of precancerous and cancerous lesions would be overestimated and clinicians would need to be careful when interpreting these outcomes.

Subgroup analyses were performed to determine which image tool assisted by AI had higher discriminating power of oral mucosal cancerous lesions including precancerous lesions. Subgroup analysis showed that OCT was more diagnostically accurate (324.3335 vs. 66.8107 and 27.6313) and more negatively predictive (0.9399 vs. 0.9311 and 0.8405) than photographic images and autofluorescence in AI based on the screening for oral precancerous and cancerous lesions from normal mucosa (Table 3). Meta-regression of AI diagnostic accuracy for oral precancerous and cancerous lesions on the basis of imaging tool revealed the significant correlations (*p* = 0.0050).


**Figure 5.** Forest plots of (**A**) sensitivity, (**B**) specificity, and (**C**) negative predictive values for screening all premalignant mucosal lesions [6,13–21,23,24].


**Table 3.** Subgroup analysis regarding image tool in discriminating oral precancerous and cancerous lesions from normal mucosa.

DOR; diagnostic odds ratio, AUC; area under the curve, NPV; negative predictive value.

#### **4. Discussion**

Oral cancer is a malignant disease with high disease-related morbidity and mortality due to its advanced loco-regional status at diagnosis. Early detection of oral cancer is the most effective means to increase the survival rate and reduce morbidity, but a significant number of patients experience delays between noticing the first symptoms and receiving a diagnosis from a clinician [26]. In clinical practice, a conventional visual examination is not a strong predictor of oral cancer diagnosis, and a quantitatively validated diagnostic method is needed [27]. Radiographic imaging, such as magnetic resonance imaging and computed tomography, can help determine the size and extent of oral cancer before treatment, but these techniques are not sensitive enough to distinguish precancerous lesions. Accordingly, various adjunct clinical imaging techniques such as autofluorescence and OCT have been used [28].

AI has been introduced in various industries, including healthcare, to increase efficiency and reduce costs, and the performance of AI models is improving day by day [29]. For the past few years, the early detection of oral cancer using AI technology based on autofluorescence imaging, photographic imaging, and OCT imaging has been an important research area. In this study, diagnostic values including sensitivity and specificity data were comprehensively confirmed in various studies that performed AI analysis of images. The diagnostic sensitivity of oral cancer analyzed by AI was as high as 0.92, and the analysis including precancerous lesions was slightly lower than the diagnostic sensitivity for cancer, but this also exceeded 90%. In subgroup analysis, there was no statistically significant difference in the diagnostic rate according to each image tool. In particular, the sensitivity of OCT to all precancerous lesions was found to be very high at 0.94.

Autofluorescence images are created using the characteristic that autofluorescence naturally occurring from collagen, elastin, and other endogenous fluorophores such as nicotinamide adenine dinucleotide in mucosal tissues by blue light or ultraviolet light is expressed differently in cancerous lesions [30,31]. Although it has been used widely in the dental field for the purpose of screening abnormal lesions in the oral cavity, it has been reported that the accuracy is low, with a sensitivity of only 30–50% [32,33]. It has been noted that autofluorescence images have a low diagnostic rate when used in oral cancer screening. Most of the previous clinical studies on autofluorescence-obtained images used differences in spectral fluorescence signals between normal and diseased tissues. Recently, timeresolved autofluorescence measurements using the characteristics of different fluorescence lifetimes of endogenous fluorophores have been used to solve the problem of broadly overlapping spectra of fluorophores, improving image accuracy [34]. Using various AI algorithms for advanced autofluorescence images, the diagnostic sensitivity of precancerous and cancerous lesions was reported to be as high as 94% [15]. As confirmed in our study, AI diagnosis sensitivity using autofluorescence images was confirmed to be 85% in all precancerous lesions. It showed relatively low diagnostic accuracy when compared to other imaging tools in this study. However, autofluorescence imaging is of sufficient value as

an adjunct diagnostic tool. Efforts are also being made to improve the diagnostic accuracy for oral cancer by using AI to analyze images obtained using other tools along with the autofluorescence image [19].

The photographic image is a fast and convenient method with high accessibility compared to other adjunct methods. However, there is a disadvantage in that the image quality varies greatly depending on the camera, lighting, and resolution used while obtaining the image. Unlike external skin lesions, the oral cavity is surrounded by a complex, threedimensional structure including the lips, teeth, and buccal mucosa, which may decrease the image accuracy [6]. In a recent study introducing a smartphone-based device, it was reported that the problem of the image itself was solved through a probe that can easily access the inside of the mouth and increasing images pixel [35]. Image diagnosis using a smartphone is very accessible in the current era of billions of phone subscribers worldwide, and in particular, it is expected that accurate and efficient screening will be possible by diagnosing a vast number of these images with AI. According to our analysis, AI-aided diagnosis from photographic images was confirmed to have a diagnostic sensitivity of over 91% for precancerous and cancerous lesions.

OCT is a medical technology that images tissues using the difference in physical properties between the reference light path and the sample light path reflected after interaction in the tissue [13]. OCT is non-invasive and uses infrared light, unlike other radiology tests that use X-rays. It is also a good diagnostic method that allows real-time image verification. Since its introduction in 1991 [36], OCT has been developed to provide high-resolution images at a faster speed and has played an important role in the biomedical field. In an AI analysis study of OCT images published by Yang et al., it was reported that the sensitivity and specificity of oral cancer diagnosis was 98% or more [22]. In our study, OCT images were found to be the most accurate diagnostic test, with sensitivity of 94% in AI diagnosis compared to other image tools (sensitivity of autofluorescence and photographic images of 89% and 91%, respectively). Therefore, AI diagnosis using OCT images is considered to be of sufficient value as a screening method for oral lesions. Each image tool included in our study has its own pros and cons to be considered when using it in actual clinical practice. In addition, accessibility of equipment or systems that can be performed on patients in actual outpatient treatment will be an important factor.

Based on our results, AI analysis of images in cancer diagnosis is thought to be helpful in making fast decisions regarding further examination and treatment. The accuracy of discriminating between precancerous lesions and normal tissues showed a high sensitivity of over 90%, showing good accuracy as a screening method. Although the question of whether AI can replace experts still exists, it is expected that oral cancer diagnosis using AI will sufficiently improve mortality and morbidity due to disease in low- and middleincome countries with poor health care systems. Acquisition of large-scale image datasets to improve AI analysis accuracy will be a clinically important key.

Our study has several limitations. First, our results include data from multiple imaging tools analyzed at once. This created heterogeneity in the results. Therefore, the sensitivity of each imaging tool was checked separately. The study is meaningful as it is the first meta-analysis to judge the accuracy of AI-based image analysis. Second, even with the same imaging tool, differences in the quality of the devices used in each study and differences between techniques may affect the accuracy of diagnosis. The images used to train the AI algorithm may not fully represent the diversity of oral lesions. Third, there is a limit to the interpretation of the results due to the absolute lack of prospective studies between the conventional examination and AI imaging diagnosis. It is our task to study this in various clinical fields in order to prepare for a future in which AI-assisted healthcare will be successful

#### **5. Conclusions**

AI shows good diagnostic performance with high sensitivity for oral cancer. Through the development of image acquisition devices and the grafting of various AI algorithms, the diagnostic accuracy is expected to increase. As new studies in this field are published frequently, a comprehensive review of the clinical implications of AI in oral cancer will be necessary again in the future.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14143499/s1, Figure S1: Begg's funnel plot; Table S1: Study characteristics.

**Author Contributions:** Conceptualization, J.-S.K., B.G.K. and S.H.H.; methodology, J.-S.K. and S.H.H.; software, S.H.H.; validation, S.H.H.; formal analysis, J.-S.K. and S.H.H.; investigation, J.-S.K. and S.H.H.; data curation, J.-S.K. and S.H.H.; writing—original draft preparation, J.-S.K. and S.H.H.; writing—review and editing, J.-S.K. and S.H.H.; visualization, J.-S.K. and S.H.H.; supervision, J.-S.K., B.G.K. and S.H.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2022R1F1A1066232). The sponsors had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

