Mark No. denotes the number of CAD marks per lesion. € Mark Number indicates the number of CAD marks per patient.

In the absence of the CAD mark, the readers determined that the reading time for CAD with ABUS was less than for ABUS only and easier to make a decision (Table 2). Table 2 summarizes the number and characteristics of CAD marks per patient.

**Table 2.** Characteristics of number for computer-aided detection (CAD) marks per patient and reading time (RT).


Values are expressed as numbers (percentages) for categorical variables and means (SD), median (IQR) others. \* *p*-value was calculated between 0 with total (1,2,3) using Chi-square test, Fisher's exact test, or *t*-test. # Mark No. indicates the number of CAD marks per lesion.

Of 846 patients, 1032 CAD marks were marked in 534 lesions of 348 patients with a mean CAD mark per person of 0.8 (SD ± 1) (range 0–6) (Table 3). No CAD mark was detected in 498 patients (48.3%).

The characteristic CAD marks were determined by two reviewers by consensus as suspicious malignant lesions (0.8%, n = 4), benign lesions (13.3%. n = 71), and clear pseudo-lesions (86%, n = 459).

Among 530 false-positive marks, 459 marks were marked on the clear pseudolesions (Figures 2–4); the most common cause was marginal shadowing (209, 39.1%), followed by

Cooper's ligament shadowing (143, 26.8%), peri-areolar shadowing (64, 12%), rib (37, 6.9 %), and skin lesions (6, 1.1%).


**Table 3.** Number and characteristics of computer-aided detection (CAD) mark.

Values are expressed as numbers (percentages) for categorical variables and means (SD), median (IQR) others. Values are expressed as numbers (percentages) for categorical variables.

**Figure 2.** *Cont*.

(**c**) (**d**)

**Figure 2.** Screening automated breast ultrasound (ABUS) of a 45-year-old woman reveals false-positive marks due to shadowing. (**a**) CAD-based minimum intensity projection (MinIP) of an ABUS scan of the AP, medial, and lateral sides of both breasts. There are three dark spots with green circles. (**b**) The lesion showing a dark spot with a green circle on AP side of the right breast confirms the pseudolesion due to periareolar shadowing in the transverse scan. (**c**,**d**) The lesion showing a dark spot with a green circle on AP side of the left breast confirms the pseudolesion due to Cooper's ligament shadowing in the transverse scan. The lesion showing a dark spot with a green circle laterally on the left breast confirms the pseudolesion due to marginal shadowing in the transverse scan.

(**a**) (**b**)

(**c**) (**d**)

**Figure 3.** Screening automated breast ultrasound (ABUS) of a 42-year-old woman shows falsepositive marks due to rib. (**a**) CAD-based minimum-intensity projection (MinIP) of an ABUS scan in the AP, medial, and lateral sides of both breasts. There are four dark spots with green circles. (**b**–**d**) The lesions showing dark spots with green circles in both AP and right medial sides of both breasts confirm pseudolesions due to ribs in the transverse scan.

**Figure 4.** Screening automated breast ultrasound (ABUS) of a 48-year-old woman reveals falsepositive marks due to skin lesions. (**a**) CAD-based minimum intensity projection (MinIP) of an ABUS scan in the AP, medial, and lateral sides of both breasts. There is a dark spot with a green circle. (**b**) The lesion showing a dark spot with a green circle on the AP side of the left breast confirms the pseudolesion due to a skin lesion in the transverse scan.

The false-positive marks on pseudo-lesions were frequently detected in the upper portion than in the mid-to-lower portion, and in the outer portion than in the mid-to-inner portion of breast (Table 4). There were more marks in the lateral view than in AP or medial views (Table 4).


**Table 4.** Characteristics of false-positive marks associated with pseudolesions (n = 459).

Values represent numbers (percentages) for categorical variables. *p*-value was calculated between MarkNo1 with MarkNo2,3 using Chi-square test. \* *p*-value was calculated only in a group using Chi-square test. # Mark No. indicates the number of CAD marks per lesion.

#### **4. Discussion**

In this study, we evaluated the effectiveness of computer-aided detection (CAD) system in screening automated breast ultrasound (ABUS) through diagnostic performance and reading time (RT). A total of 846 patients displayed 1032 CAD marks and 534 CAD marks based on lesions. The sensitivity, specificity, PPV, NPV, and accuracy of CAD were 60.0%, 59.0%, 0.9%, 99.6% and 59.0% for 846 patients, respectively, while those of 1032 CAD

marks were 60.0%, 48.3%, 0.6%, 99.6%, and 48.4%, respectively. The relatively higher NPV compared with other parameters indicates that the exam can be concluded with a negative study if no CAD mark is detected on ABUS. The presence of marks in multiple views did not suggest malignancy in this study. In the absence of the CAD mark, the readers determined that the reading time for CAD with ABUS was less than for ABUS only and easier to make a decision.

Several studies have reported that the performance of ABUS was comparable to that of hand-held ultrasound [20–22]. In addition, four prospective studies using ABUS demonstrated an increased cancer detection of 1.9–7.7 per 1000 examinations similar to hand-held ultrasound [10,11,14,23].

However, while the ABUS can yield standardized and structured images regardless of the experience of the operator, it takes much more time and effort to interpret the exams [24]. For this reason, the CAD system has been suggested as a supplementary method for interpreting ABUS results. However, the CAD system showed a high negative predictive value, and there were many false-positive CAD marks, which implied typical benign or pseudo-lesions that do not require further investigation. Usually, the falsepositive imaging results can affect the recall rate of the screening modality. The recall rate varied from 8.8% in the J-START study to 10.7% in the American College of Radiology Imaging Network (ACRIN) study [25,26]. However, few studies reported the characteristics of the causes of false-positive marks.

In addition to the diagnostic performance of CAD on ABUS, the previous studies evaluated the RT of CAD on ABUS [27–29]. Yang et al. reported that using CAD in the concurrent-reading mode, all readers saved 32% (16 s per 50 s per volume) in RT with a higher area under the receiver operating characteristic curve values compared with non-CAD mode [28]. Jiang et al. reported that although not all studies were interpreted faster with the CAD system, on average the savings were approximately 1 min per case [29]. In our study, it was less time-consuming and easier to make a clinical decision, especially in the case of a negative study.

In this study, we investigated and analyzed the characteristics of CAD marks and the causes of false-positive marks, to distinguish between true and false marks. Among 530 false-positive marks, 459 were identified clearly for pseudo-lesions; the most common cause was marginal shadowing, followed by Cooper's ligament shadowing, peri-areolar shadowing, rib, and skin lesions, all of which were easily distinguishable radiologically. The false marks for pseudo-lesions were detected more frequently in the upper rather than in the mid-to-lower portion and in the outer rather than in the mid-to-inner portion, probably because of bulkiness and flexibility of the upper and outer portion of the breast.

ABUS is a standardized examination with multiple advantages in both screening and diagnostic settings, including increased detection of breast cancer, improved workflow, and reduced examination time. However, ABUS has disadvantages and even some limitations. Disadvantages regarding image acquisition are the inability to assess the axilla, vascularization, and lesion elasticity. The limitations of interpretation include motion- or lesion-related artifacts due to poor positioning and the lack of contact [30]. In the review article about the pros and cons of ABUS by Ioana Boca et al., marginal shadowing and Cooper's ligament shadowing were defined as artifacts due to insufficient compression [30]. Peri-areolar shadowing is defined as a nipple artifact [30]. Despite the promising detection rate with CAD software in breast cancer, radiologists should determine whether a CAD software-marked lesion is a true- or false-positive lesion, given its positive predictive value and high false-positive rate [17]. The knowledge of these artifacts improves the diagnostic performance of radiologists.

There are several limitations to this study. First, we used only image data obtained with equipment from a single vendor, with a small number of participants. In addition, this study was performed only in academic institutions by a limited number of users, boardcertified expert breast radiologists, and does not represent varying clinical environments. Second, the absence of the numerical result of RT is the limitation of this study. The RT was

determined by the expert breast radiologists based on their subjective perception. Finally, in our study, the expert radiologists' decision was a gold standard for suspicious lesions or pseudo-lesions. However, a large number of marks await the radiologist's rational judgment. Therefore, CAD users should be familiar with marks in various situations before using them, and the review summarizes the characteristics of CAD marks only without radiological evaluation. The knowledge of the characteristics of CAD marks and the causes of false-positive marks could improve the diagnostic performance of radiologists.

#### **5. Conclusions**

In conclusion, even though CAD addition does not improve the performance of screening ABUS and is associated with a large number of false-positive marks, CAD addition improves the negative predictive value and reduces RT, especially for negative screening ultrasound.

**Author Contributions:** Conceptualization, B.J.K.; methodology, J.L.; formal analysis, G.E.P. and J.L.; investigation, S.H.K. and B.J.K.; resources, B.J.K. and S.H.K.; data curation, S.H.K. and G.E.P.; writing—original draft preparation, J.L.; writing—review and editing, B.J.K.; visualization, J.L. and G.E.P.; supervision, B.J.K. and S.H.K.; project administration, B.J.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of Seoul Saint Mary's Hospital (protocol code KC16ECMI0552 and date of approval 1 September 2016).

**Informed Consent Statement:** Patient consent was waived due to the retrospective design of this study.

**Data Availability Statement:** All data generated and analyzed during this study are included in this published article. Raw data supporting the findings of this study are available from the corresponding author on request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Cost-Effectiveness of Artificial Intelligence Support in Computed Tomography-Based Lung Cancer Screening**

**Sebastian Ziegelmayer \*,†, Markus Graf †, Marcus Makowski, Joshua Gawlitza † and Felix Gassert †**

Institute of Diagnostic and Interventional Radiology, School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Ismaninger Straße 22, 81675 Munich, Germany; markus.m.graf@tum.de (M.G.); marcus.makowski@tum.de (M.M.); joshua.gawlitza@tum.de (J.G.); felix.gassert@tum.de (F.G.)

**\*** Correspondence: s.ziegelmayer@tum.de

† These authors contributed equally to this work.

**Simple Summary:** Lung cancer screening with low-dose CT (LDCT) has been shown to significantly reduce cancer-related mortality and is recommended by the United States Preventive Services Task Force (USPSTF). With pending recommendation in Europe and millions of patients enrolling in the program, deep learning algorithms could reduce the number of false positive and negative findings. Therefore, we evaluated the cost-effectiveness of using an AI algorithm for the initial screening scan using a Markov simulation. We found that AI support at initial screening is a cost-effective strategy up to a cost of USD 1240 per patient screening, given a willingness-to-pay of USD 100,000 per quality-adjusted life years (QALYs).

**Abstract:** Background: Lung cancer screening is already implemented in the USA and strongly recommended by European Radiological and Thoracic societies as well. Upon implementation, the total number of thoracic computed tomographies (CT) is likely to rise significantly. As shown in previous studies, modern artificial intelligence-based algorithms are on-par or even exceed radiologist's performance in lung nodule detection and classification. Therefore, the aim of this study was to evaluate the cost-effectiveness of an AI-based system in the context of baseline lung cancer screening. Methods: In this retrospective study, a decision model based on Markov simulation was developed to estimate the quality-adjusted life-years (QALYs) and lifetime costs of the diagnostic modalities. Literature research was performed to determine model input parameters. Model uncertainty and possible costs of the AI-system were assessed using deterministic and probabilistic sensitivity analysis. Results: In the base case scenario CT + AI resulted in a negative incremental cost-effectiveness ratio (ICER) as compared to CT only, showing lower costs and higher effectiveness. Threshold analysis showed that the ICER remained negative up to a threshold of USD 68 for the AI support. The willingness-to-pay of USD 100,000 was crossed at a value of USD 1240. Deterministic and probabilistic sensitivity analysis showed model robustness for varying input parameters. Conclusion: Based on our results, the use of an AI-based system in the initial low-dose CT scan of lung cancer screening is a feasible diagnostic strategy from a cost-effectiveness perspective.

**Keywords:** lung cancer screening; deep learning; cost-effectiveness analysis; AI-support system

#### **1. Introduction**

Based on the findings of the national lung screening trial (NLST), in 2014 the United States Preventive Service task force recommended the annual lung cancer screening of patients between 55 and 80 years with 20 pack years of smoking history [1,2]. In contrast to the high and further increasing incidence of lung cancer globally, the incidence of lung cancer was relatively low in the NLST. Nonetheless, the NLST was able to show a significant reduction in lung cancer related mortality due to the annual screening with lowdose computed tomography (CT). Consequently, a European Position Statement followed

**Citation:** Ziegelmayer, S.; Graf, M.; Makowski, M.; Gawlitza, J.; Gassert, F. Cost-Effectiveness of Artificial Intelligence Support in Computed Tomography-Based Lung Cancer Screening. *Cancers* **2022**, *14*, 1729. https://doi.org/10.3390/ cancers14071729

Academic Editors: Hamid Khayyam, Ali Madani, Rahele Kafieh and Ali Hekmatnia

Received: 1 March 2022 Accepted: 23 March 2022 Published: 29 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in 2017, strongly recommending the CT-based lung cancer screening as well [3]. This recommendation is further supported by the Dutch-Belgian lung-cancer screening trial (Nederlands-Leuvens Longkanker Screenings Onderzoek (NELSON)), which also showed a significant reduction in lung cancer mortality for high-risk patients who participated in the screening [4]. With several ongoing pilot projects in Europe, the widespread introduction of lung cancer screening seems to be only a matter of time.

Nevertheless, the benefits of lung cancer screening are limited by false negative and false positive findings, which not only result in high costs but also affect clinical outcome and quality of life [2,5,6]. Currently, low dose CT-scans in the screening setting are evaluated based on standardized systems like Lung-RADS (Lung imaging reporting and data system), which improve the diagnostic accuracy for radiologists and reduces costs by decreasing the need for further diagnostic tests [7,8]. Even after a recent revision of the reporting system, observer variability will remain a relevant limitation [9,10].

The rapid development of artificial intelligence (AI) in the medical field has shown promising results for cancer screening and recent AI-models may achieve or exceed the diagnostic performance of sub-specialized experts, for example in breast cancer screening [11]. While long-standing CAD (computer aided diagnosis/detection) systems show mixed results for lung cancer detection [12–14], novel neural networks, convolutional neural networks (CNN) in particular, seem to have a positive effect on the diagnostic performance of radiologists [15]. Ardila et al. showed that a 3D-CNN outperformed radiologists in low-dose CT screening scans when no prior scans were available, indicating a favorable benefit for screening initiation.

Among other constraints, the health economic impact of AI systems is an important factor in the decision to implement models in routine clinical practice. Despite the imminent deployment of lung cancer screening and the promising results of AI-systems, no study has been performed to evaluate the utilization of neural networks in lung cancer screening compared to the stand-alone low dose CT-scan from an economic point of view. Therefore, the aim of our study was to evaluate the cost effectiveness of an AI-system for the initial scan of annual lung cancer screening and present the first results on identifying a cost margin for a clinical integration.

#### **2. Materials and Methods**

#### *2.1. Model Structure*

A decision model including the diagnostic strategies of conventional CT and CT augmented by AI was created and used as a decision tree, as shown in Figure 1.

**Figure 1.** Markov model with possible states of disease and transition probabilities between states. BC = bronchial cancer; LT = life tables.

For calculation of costs and benefits in the different iterations a Markov transition state model was created. The model included the stages:


Additionally, for better simulation and understanding of the model, the states "BC delayed detection" and "BC early detection" were created, which only served for transition. The Markov model reflects the different states a patient can be assigned to. Taking into account transition probabilities between the states as well as costs and effectiveness (displayed in Quality of Life) in those states during several iterations, cumulative costs and cumulative effectiveness within a defined time horizon can be calculated by adding those up throughout the iterations.

Analysis of the model was performed using a dedicated decision analysis software (TreeAge Pro Version 19.1.1, Williamstown, MA, USA).

#### *2.2. Input Parameters*

There was no requirement for an ethical approval for this analysis based on commonly available data. Model input parameters were based on current literature. Age-specific risk of death was derived from the US life tables [16]. Age at the diagnostic procedure was set to 60 years and willingness-to-pay was set to USD 100,000 per quality adjusted life year (QALY) at a discount rate of 3%, as reported previously [17,18]. The discount rate reflects the loss in economic value or effectiveness when there is a delay in realizing a benefit or incurring costs. The pre-test probability of BC was set to 2.635% for the risk group consisting of female and male smokers risk for an interval of 30 years, according to published data from Jacob et al. [19]. All input parameters and corresponding references are listed in Table 1.


**Table 1.** Input parameters.


**Table 1.** *Cont.*

AI = artificial intelligence; BC = bronchial cancer; CT = computed tomography; QALY = quality adjusted life year; WTP = willingness-to-pay.

#### *2.3. Diagnostic Test Performances*

Sensitivity and specificity values for CT detection of BC with and without AI were derived from the literature (Table 1).

#### *2.4. Costs*

From a United States (US) healthcare perspective, costs were estimated based on Medicare data and available literature (Table 1). The long-term costs of the follow up in case of false positive was estimated at USD 2256 including the costs for a follow up CT examination and a possible bronchoscopy and biopsy [21]. The resection costs of BC were set to USD 36,305, according to Cowper et al. [22]. annual costs of palliative BC patients were estimated at USD 60,000 [21].

#### *2.5. Utilities*

Utility is measured in the additional quality-adjusted life years (QALY) which are gained through each diagnostic procedure. According to previous studies, quality of life (QOL) for curative BC patients was set to 0.79 for the first year after resection and 0.933 for the following years [24,25]. In accordance with the literature, QOL for palliative BC patients was set to 0.63 [26]. These values were then used for calculations in a Markov model specifically designed as mentioned above.

#### *2.6. Transition Probabilities*

Transition probabilities were derived from a systematic review of the recent literature and are shown in Table 1. Probability of successful resection of (early) detected BC was estimated at 75%, according to the national lung screening trial research team [2]. Risk of secondary occurrence of cancer/metastases after resection of the primary tumor was assumed to be 9.80% [29]. Annual mortality rate of curative patients was set to 4.7% and to 36.0% for palliative patients [28,32,33].

#### *2.7. Cost-Effectiveness Analysis*

The cost-effectiveness analysis was performed based on Markov simulations with a run time of 20 years (20 iterations) after initial diagnostic procedure. The discount rate was set to 3.0% and willingness-to-pay was set to USD 100,000 per QALY according to current recommendations [18].

In the base-case scenario, cost-effectiveness was determined with costs of CT + AI identical to costs of CT only, meaning costs of USD 0 for additional use of AI. Based on these results, maximum costs for AI were calculated for several willingness-to-pay thresholds. For evaluation of model uncertainty and influence of alteration of each variable on the model, a deterministic sensitivity analysis was performed. Results were visualized in a tornado diagram.

Based on the Markov model, Monte-Carlo simulations were used to perform a probabilistic sensitivity analysis with a total of 30,000 iterations. This method is used to account for the variation of input-parameters among different individuals.

#### **3. Results**

#### *3.1. Cost-Effectiveness Analysis*

Simulations of a time horizon of 20 years resulted in average cumulative costs of USD 4310.82 for CT + AI and USD 4378.44 for CT if additional diagnostic costs for the use of AI were set to USD 0 in the base case scenario. In this scenario, average cumulative effectiveness was at 13.76 QALYs for CT + AI and at 13.75 QALYs for CT. To better understand the impact of input parameters on the model, costs and effectiveness as well as distribution of the different outcomes are shown in Figure 2. Different overall costs and effectiveness derive from different distribution of the outcomes "true positive", "false negative", "true negative", and "false positive" based on different sensitivity and specificity of the two methods. The incremental cost-effectiveness ratio in the base case scenario was negative, meaning both, lower cost and higher effectiveness for CT + AI.

**Figure 2.** Roll-back of the economic model showing costs and effectiveness of the different outcomes. Distributions leading to overall costs and effectiveness are different for CT and CT + AI depending on sensitivity and specificity of the two methods and indicated as probabilities. BC = bronchial cancer; CT = computed tomography; TP = true positive; TN = true negative; FP = false positive; FN = false negative; Prob = probability.

#### *3.2. Sensitivity Analysis*

Probabilistic sensitivity analysis and Monte Carlo simulation was performed to determine the distribution of the resulting ICER-values and is visualized in Figure 3. Monte Carlo simulation reflects the difference between costs (=incremental costs) and effectiveness (=incremental effectiveness) for a certain amount of notional scenarios/iterations. All iterations with an ICER-value below the willingness-to-pay of USD 100,000 per QALY were considered cost-effective.

**Figure 3.** Probabilistic sensitivity analysis utilizing Monte-Carlo simulations (30,000 iterations). Incremental cost-effectiveness scatter plot for CT + AI vs. CT. iterations with an ICER-value below the willingness-to-pay of USD 100,000 per QALY are shown as green crosses. WTP = willingness-to-pay.

Deterministic sensitivity analysis was performed to account for variability of input parameters in the base case scenario. Results are displayed as a tornado diagram in Figure 4A.

Applying wide ranges of variation for the different input parameters, ICER stayed below USD 0/QALY for the sensitivities of the diagnostic modalities and the probabilities of resectability in early and delayed diagnosis. Although ICER turned positive when varying the specificity of CT and CT + AI, the willingness-to-pay threshold of USD 100,000/QALY was not crossed in any of the cases.

#### *3.3. Threshold Analysis*

To determine the maximum possible costs for the use of AI at a willingness-to-pay of USD 100,000/QALY, a threshold analysis was performed. As shown in Figure 5, ICER remained negative until costs of AI were raised to USD 68.

**Figure 4.** (**A**) Tornado diagram showing the impact of input parameters on incremental costeffectiveness ratio (ICER) in the base case scenario. Assuming a willingness-to-pay threshold of USD 100,000 per QALY, CT + AI remained cost-effective in all cases. (**B**) Tornado diagram showing the impact of input parameters on incremental cost-effectiveness ratio (ICER) when costs of AI were set to USD 1240 with an expected value of USD 100,000 per QALY. Blue bars show changes when decreasing the value of an input parameter as compared to the base case scenario and red bars when increasing the respective value. Sens = sensitivity; Spec = specificity; CT = computed tomography; AI = artificial intelligence; P = probability.

**Figure 5.** One-way sensitivity analysis for costs of AI (USD) and the corresponding incremental cost effectiveness ratio (ICER in USD/QALY). Thresholds indicate values at an ICER of USD 0/QALY and USD 100,000/QALY. ICER = incremental cost-effectiveness ratio; AI = artificial intelligence; QALY = quality adjusted life year.

Raising costs of AI further, the assumed willingness-to-pay threshold of USD 100,000/ QALY is only crossed at a value USD 1240. Influence in different input parameters in this second base case scenario setting costs of AI to USD 1240 are shown in Figure 4B. To account for possible variation of the willingness-to-pay, Table 2 displays possible costs for AI depending on different willingness-to-pay thresholds. Due to the cost's dependency on the ICER, the cost for AI directly is further influenced by the systems performance, resulting in a higher price for a better system due to the increased ICER.

**Table 2.** Cost of AI at different WTP-thresholds.


#### **4. Discussion**

The widespread integration of lung cancer screening is proving to be a complex and challenging undertaking. Nevertheless, lung cancer screening is a cost-effective method to reduce lung cancer mortality. AI-models for cancer detection and classification have proved to be of benefit in lung cancer screening in several studies [15,34].

In the present study, we show that a state-of-the-art AI-model (3D-convolutional neural network according to Ardila et al.) is a cost-effective method for the baseline screening scan [15]. Despite promising results of AI in the health care sector, studies evaluating the economic impact and cost effectiveness remain sparse [35]. To our knowledge, no study has been conducted to investigate the cost-effectiveness of an AI-system in lung cancer screening. Based on the superior performance of the AI-model without prior imaging, we simulated an implementation for the initial screening scan using input parameters derived from published screening cohorts [2,15,36,37], to ensure comparability to the standard screening setting.

Our base case estimate for screening with an AI system compared to current lowdose CT screening yielded a negative ICER up to costs of USD 68 for the AI system, indicating that using an AI system in the screening setting results in lower cost and higher effectiveness up to these costs per patient scan. Furthermore, the ICER remained below the applied willingness-to-pay up to costs of USD 1240. To account for variations in input parameters, we performed a deterministic sensitivity analysis for the base case scenario and the maximum cost-effective costs (USD 1240). The specificity of the diagnostic strategy had the greatest influence for both scenarios, due to the low lung cancer rate in screening cohorts. For the base case scenario all input variations resulted in an ICER below the willingness-topay by a large margin, indicating robust cost-effectiveness. Adding AI support showed a reduced number of false-positives and an increased number of true negatives in our simulation. In particular, the reduction of false-positives highly impacts the value of a screening method, as not only costs in the form of unnecessary follow-up examination and possibly further, partly invasive examinations are reduced, but also patients do not have to experience the psychological distress of a possible cancer diagnosis [38]. Additionally, the false positive rates and the frequency of invasive diagnostic procedures were more frequent at the baseline CT, ranging from 7.9% to 49.3% for the false positive rate and 3.7% for additional invasive procedures [2,39], further emphasizing the benefit of AI support for the initial screening. As shown by Audelan et al., the sensitivity and specificity of AI in lung cancer screening can further be improved, consequently allowing for an additional reduction of costs and increased effectiveness [40].

Despite promising results, our study underlies several limitations. First, the costeffectiveness was only evaluated for the initial scan in the lung cancer screening. This is due to published literature, focusing on the superiority of AI lung nodule detection and classification in initial CT of the thorax without prior imaging for comparison. According to Ardila et al., deep-learning algorithms are superior to radiologists in lung cancer screening detection, when no prior imaging is available for comparison, but is on-par as soon as

previous examinations are available for the reader. Consequently, further research has to be conducted to evaluate the cost-effectiveness of AI-based computer-aided diagnosis systems in longitudinal screening, beyond the initial scan [15]. Further, our evaluation is focused on the sole AI system performance in comparison to the human reader—the radiologist. However, several studies have shown promising results for the collaboration of both, often referred to as the "Centaur model" [33]. Such systems were shown not only to be beneficial in patient care but cost-effective as well [41]. Despite dealing with different challenges compared to lung cancer, for thyroid nodule detection, AI systems outperform thyroid cancer specialized radiologists in nodule classification, but the combination of specialized radiologists with AI-support showed an even higher specificity and positive predictive value when compared to the AI system alone [42]. Therefore, further research is needed to evaluate the combination of AI models and specialized thorax radiologists in lung cancer detection and its cost-effectiveness. Lastly, cost-effectiveness analysis with decisionbased models is highly dependent on the input parameters, while deterministic sensitivity analysis may incorporate parameter variation to a certain degree, and recommendations for each individual case cannot be derived from the model.

#### **5. Conclusions**

To conclude, in our study we show that screening with an AI-model in the initial screening scan is a cost-effective strategy in low-dose CT lung cancer screening with robustness to variation of input parameters. Defining thresholds for cost of AI results might help faster translate AI systems into clinical use.

**Author Contributions:** Conceptualization, S.Z. and F.G.; methodology, F.G. and J.G.; validation, M.G., S.Z.; formal analysis, F.G.; investigation, S.Z., M.G. and J.G.; resources, M.M.; data curation M.G.; writing—original draft preparation, S.Z. and J.G.; writing—review and editing, M.G., F.G. and M.M.; visualization, F.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Ethical review and approval were waived for this study due to this analysis is based on commonly available data.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data that support the findings of this study are listed in Table 1.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Review* **Advancements in Oncology with Artificial Intelligence—A Review Article**

**Nikitha Vobugari 1, Vikranth Raja 2, Udhav Sethi 3, Kejal Gandhi 1, Kishore Raja <sup>4</sup> and Salim R. Surani 5,\***


**Simple Summary:** With the advancement of artificial intelligence, including machine learning, the field of oncology has seen promising results in cancer detection and classification, epigenetics, drug discovery, and prognostication. In this review, we describe what artificial intelligence is and its function, as well as comprehensively summarize its evolution and role in breast, colorectal, and central nervous system cancers. Understanding the origin and current accomplishments might be essential to improve the quality, accuracy, generalizability, cost-effectiveness, and reliability of artificial intelligence models that can be used in worldwide clinical practice. Students and researchers in the medical field will benefit from a deeper understanding of how to use integrative AI in oncology for innovation and research.

**Abstract:** Well-trained machine learning (ML) and artificial intelligence (AI) systems can provide clinicians with therapeutic assistance, potentially increasing efficiency and improving efficacy. ML has demonstrated high accuracy in oncology-related diagnostic imaging, including screening mammography interpretation, colon polyp detection, glioma classification, and grading. By utilizing ML techniques, the manual steps of detecting and segmenting lesions are greatly reduced. ML-based tumor imaging analysis is independent of the experience level of evaluating physicians, and the results are expected to be more standardized and accurate. One of the biggest challenges is its generalizability worldwide. The current detection and screening methods for colon polyps and breast cancer have a vast amount of data, so they are ideal areas for studying the global standardization of artificial intelligence. Central nervous system cancers are rare and have poor prognoses based on current management standards. ML offers the prospect of unraveling undiscovered features from routinely acquired neuroimaging for improving treatment planning, prognostication, monitoring, and response assessment of CNS tumors such as gliomas. By studying AI in such rare cancer types, standard management methods may be improved by augmenting personalized/precision medicine. This review aims to provide clinicians and medical researchers with a basic understanding of how ML works and its role in oncology, especially in breast cancer, colorectal cancer, and primary and metastatic brain cancer. Understanding AI basics, current achievements, and future challenges are crucial in advancing the use of AI in oncology.

**Keywords:** artificial intelligence; machine learning; deep learning; convolutional neural networks; support vector machine; breast oncology; brain tumors; colon cancer

**Citation:** Vobugari, N.; Raja, V.; Sethi, U.; Gandhi, K.; Raja, K.; Surani, S.R. Advancements in Oncology with Artificial Intelligence—A Review Article. *Cancers* **2022**, *14*, 1349. https://doi.org/10.3390/ cancers14051349

Academic Editors: Hamid Khayyam and Rahele Kafieh

Received: 21 February 2022 Accepted: 28 February 2022 Published: 6 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Artificial intelligence (AI) is a field in which computers are programmed to mimic human intelligence. The abundance of data in the field of medicine makes it a good candidate for problem solving using machine learning (ML). In oncology, ML can be used to diagnose and classify tumors, detect early-stage tumors, gather genetic and histopathological data, assist in pre- and post-operative planning, and predict overall survival outcomes [1]. Deep Learning (DL), a type of ML, has proven to be effective in automating time-consuming steps such as detection and segmentation of lesions [2–4].

AI-based models have demonstrated excellent accuracy rates of cancer detection on screening mammography and breast cancer (BC) prediction based on genetics and hormonal factors [5–7]. AI plays a crucial role in early detection, classification, histopathological aspects, genetics, and molecular markers detection in colorectal cancer (CRC) [8–10]. As a result of extensive data in present-day screening and improvements in life expectancy caused by early detection of breast and colon cancer, we review the potential of AI-based diagnostics and therapeutics. Because mammograms and colonoscopies are widely used in the general population worldwide, AI can be used extensively in future studies on cancer screening to build generalizable AI systems [11]. AI has made its way into other cancer types, which we do not review here. For instance, lung cancer screening is reserved for smokers, and the United States Preventive Services Task Force (USPSTF) approved low-dose chest computed tomography (CT) scans in 2013, and prostate cancer screening has not yet been approved universally [11,12]. CNS cancers are relatively rare and have a poor prognosis. Studying AI in such rare tumors can provide a scope of precision of AI integration in improving the current standard management. In the area of central nervous system (CNS) tumors, AI and radiomics have notably enhanced detection rates and reduced several time-consuming steps in glioma grading, pre- and intraoperative planning, and postoperative follow-up [13–15].

This review article outlines how AI works in simple terminology that medical professionals can understand, how it has improved breast cancer screening, colon polyp detection, and colorectal cancer screening, as well as the implications it has in the management of CNS tumors. A literature search was conducted on PubMed, Google Scholar, arXiv, and Scopus. This is not a systematic review but a narrative review of the literature. We conclude with existing obstacles and future speculations of standardizing AI screening in oncology, as well as proposals for integrating AI basics into medical school curricula.

#### **2. How Does Artificial Intelligence Work?**

AI is a broad concept that aims to simulate human cognitive ability. ML, an approach to AI, is the study of how computer systems can learn to perform a task or predict an outcome without being explicitly programmed [16]. Mitchell et al. (1997) succinctly defines this learning process as follows: A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in T, as measured by P, improves with experience E. A simple example of such a task is the classification of suspicious abnormality on a screening mammogram as probable malignant or benign [17]. To learn to perform this task, a computer program would experience a dataset containing examples of correctly classified cases of benign and malignant breast lesions and come up with a model that can generalize beyond these data. Its ability to then classify previously unseen examples of breast lesions correctly would be evaluated through a quantitative measure of its performance, such as accuracy, sensitivity, and specificity.

#### *2.1. Subtypes of Machine Learning*

Algorithms for ML are typically categorized into supervised, unsupervised, or reinforcement learning. Supervised learning algorithms experience a dataset that contains a label (or correct answer) for each data point. Examples of supervised learning algorithms include support vector machine (SVM) [18,19], linear regression, logistic regression, and k-nearest neighbors [20,21]. In contrast, unsupervised algorithms such as k-means clustering [22,23], affinity propagation [24], and gaussian mixture model [25] study a dataset that does not contain labels and learn to derive structure from the given data. A reinforcement learning system trains an agent to behave in an environment by assigning it with a reward for desired behaviors or penalizing it for undesired ones. The overall objective of an ML algorithm can be interpreted as learning an approximate function of the data. This function should take as input a set of features that describe the data and output a prediction corresponding to the learning task. Classical ML algorithms are generally good at approximating linear or simple non-linear functions [13,26].

#### *2.2. Deep Learning*

DL is a type of ML that enables the learning of complex non-linear functions of the data. Most modern DL methods use neural networks as their learning model, which are loosely inspired by neuroscience [27]. The fundamental computational unit of a neural network is called a neuron. It computes a weighted sum of its inputs and then applies a non-linear operation (often called the activation function) to the sum to compute the output (See Figure 1a). Common activation functions include sigmoid, tanh, and rectified linear activation unit (ReLU) functions. A neural network comprises one or more layers of neurons, with each layer feeding on the outputs of the previous layer. Information flows forward through the network from the input, through a series of intermediate layers (called hidden layers) and finally to the output (see Figure 1b). As the number of layers and units within a layer increase, a neural network can represent functions of increasing complexity. This architecture gives neural networks the ability to learn their own complex features instead of being constrained to the hand-picked features provided as input to the model.

**Figure 1.** (**a**): Neuron, the fundamental computational unit of a neural network, computes the weighted sum of its inputs (X1, X2, X3) and applies a non-linear operation to give output (Y). (**b**): An example of a feedforward neural network with two hidden layers, with five and four neurons, respectively. (**c**): An example of a convolutional neural network (CNN) applied to the classification of a screening mammogram as probable malignant or benign.

During training, the parameters of the neural network are learned in order to fit the dataset for a given task. This corresponds to minimizing some notion of a cost function, which measures the model's performance on the task. After each forward pass through the network, the cost function is used to compute the error between the predicted and expected output. An algorithm called backpropagation allows this cost information to flow backward through the neural network while adjusting the network parameters. Backpropagation computes the gradients of the cost function with respect to the network parameters, which determine the level of adjustment to be made to the parameters in each iteration [28]. These gradients are then used to update the network parameters using an optimization algorithm such as stochastic gradient descent (SGD) [29,30].

Apart from the simple feed-forward model discussed above, there are other specialized architectures of neural networks suited for specific tasks. For instance, convolutional neural networks (CNNs) have a grid-like topology and are well suited to process two or threedimensional inputs such as images [31]. CNNs are designed to capture spatial context and learn correlations between local features, due to which they yield superior performance on image tasks, such as the classification of breast lesions in a screening mammogram as probable malignant or benign (See Figure 1c). CNN-based architectures have also been applied to biomedical segmentation applications [32]. However, CNNs face computational and memory efficiency limitations in three-dimensional (3D) segmentation tasks. More efficient methods have been proposed for the segmentation of 3D data, such as magnetic resonance imaging (MRI) volumes [33]. A recent architecture, occupancy networks for semantic segmentation (OSS-Net) [34], is built upon occupancy networks (O-Net) and contains efficient representations for 3D geometry, which allows for more accurate and faster 3D segmentation [35].

Another family of neural networks, called recurrent neural networks (RNNs), are designed to operate on sequential data. RNNs are well equipped to process sequential inputs of variable lengths for tasks such as machine translation and language modeling. Long Short Term Memory networks (LSTMs) are a special kind of RNNs capable of learning longterm dependencies between inputs [36]. Another technique called attention allows a model to selectively focus on parts of the input data as needed by enhancing specific parts of the input and diminishing others [37]. Recently, a network architecture called the Transformer has achieved state-of-the-art performance in a number of machine learning tasks [38]. Transformers discard recurrence and convolutions entirely, instead relying exclusively on attention mechanisms. Attention-based transformers have demonstrated state-of-the-art segmentation performance and may prove relevance to the field of oncology [39].

#### **3. Breast Cancer**

BC is the most prevalent cancer originally reported in National Cancer Institute Statistics, 2020 [40]. BC is a major cause of cancer-related mortality after lung cancer [41]. The death rates of BC have decreased annually from 1989 to 2017, attributed to the advancements in screening and therapies [41]. AI has shown enormous benefits in screening mammograms, BC predictive tools formulation, and drug development [5,6,42–44].

#### *3.1. Screening Mammogram*

A screening mammogram is one of the most widely performed screening tests, but these mammograms have limitations of very high false positive and false negative rates [14,42]. The AI models reduced the workload and resulted in a 69% reduction in false positive rates and a higher sensitivity rate in screening mammograms [2,42]. AI in BC screening has good accuracy rates with some methodological issues and evidence gaps [14,45].

In the context of mammography, DL algorithms such as CNNs are principally used; the mechanism of the algorithm is illustrated in Figure 1c. The performance of AI is measured by sensitivity, specificity, the area under the curve (AUC), and computation time [46]. Different DL models have been studied with various classification systems

to identify abnormalities in mammograms, with overall sensitivity rates ranging from 88% to 96% [47–49]. Detection rates are augmented by the positive reinforcement of an AUC over 0.96 after biopsy confirmation [50]. A new AI model from Transpara 1.4.0 screenpoint medical BV, Nijmegen, the Netherlands, expedites interpretation and reduces workload by 20–50% by excluding mammograms with a low likelihood of cancer, allowing radiologists to concentrate on challenging cases [2,51]. The detection performance of radiologists using AI-aided systems was compared to radiologists using conventional systems. Radiologists with AI-aided systems achieved higher AUC rates, sensitivity, and classification performance [52,53].

Conventional computer-aided detection (CADe) in mammograms is hampered by high false positive and false negative rates. AI-based CAD systems have proven to reduce false positive rates by 69% and increase in sensitivity ranging from 84% to 91% [42,54]. The concept of double readers (mammogram read by two radiologists independently or together) is used in Europe to reduce false positives and false negatives. The use of AI in place of the second reader maintained a non-inferior performance and reduced the workload by 88% in a simulation study [55]. In another study, a single radiologist assessment was combined with an AI algorithm achieved higher interpretative accuracy with a specificity of 92% vs. 85.9% of a single radiologist's interpretation. However, any single AI algorithm did not outperform radiologists' accuracy rates [14]. Double readers are not a standard practice in the United States, but a prospect of cost-effective AI integration with radiologists can increase overall sensitivity. However, the acceptable miss rate threshold should be carefully considered. Another study used the breast imaging reporting and data system (BI-RADS) to incorporate radiologists' subjective thresholds while using evidence-based data to train AI. The study showed a reduction in false positives by 47.3% and a slight increase in false negatives by 26.7% [56]. AI also has the advantage of not increasing the interpretation time. AI CADe takes 20% less time than traditional CADe, but the same amount of time as radiologists [57]. Although further studies are required to assess the exact costs of AI mammography, the overall reduction in false positives could make it cost-effective [57]. DL models are being incorporated into digital breast tomosynthesis, and contrast-enhanced digital mammography datasets for volumetric assessment of breasts in three dimensions to further increase detection accuracy and reduce workload by 70% [7,58,59]. Radiomics is an approach to extract relevant quantitative properties, also known as features, from clinical, histopathological, and radiological data. It has been applied to breast imaging to further improve accuracy rates [60]. A more detailed description of radiomics is described in Section 5.2.

#### *3.2. Genetics and Hormonal Aspects in Breast Cancer Prediction*

Artificial neural networks (ANNs) achieved remarkable accuracy, measured by AUC of 0.909, 0.886, and 0.883, when assessed for their ability to predict 5-, 10-, 15-year BCrelated survival rates, respectively, based on factors such as age, tumor size, axillary nodal status, histological type, mitotic count, nuclear pleomorphism, and axillary nodal status [61]. Hybrid-DL models incorporate genetics, histopathology, and radiology data, which outperform traditional models such as Gail (which calculates BC risk in the next five years based on medical and reproductive history, not takes into account BRCA gene association) and Tyrer–Cuzick models (calculates the likelihood of carrying BRCA1 or BRCA2 mutations based on personal and familial historical data) [5,6].

#### **4. Colonic Polyps and Colorectal Cancer**

CRC is the third most common cancer in the United States, with the incidence of approximately 147,950 new cases in the year 2020. AI has shown great success in screening, diagnosis, and treatment of CRC. AI is bringing about a new era for CRC screening and detection with computer-assisted techniques for adenoma detection and characterization, computer-aided drug delivery techniques, and robotic surgery. Other benefits of AI include the incorporation of ANN to effectively screen with personal health data [62].

#### *4.1. Colorectal Cancer Screening*

By detecting adenomas and preventing progression to carcinoma, screening has significantly reduced the incidence of CRC over the past decade. This has resulted in recommendations for routine screening starting at age 45 [63]. The current screening methods for CRCs include invasive procedures (colonoscopy (gold standard) and flexible sigmoidoscopy), minimally invasive procedures (capsular endoscopy), and non-invasive procedures (CT colonography or virtual colonoscopy, stool for occult blood, fecal immunochemical test, and multitarget stool DNA).

A few AI models have been tested to predict the risk of CRC and high-risk colonic polyps (CPs) from historical data and complete blood counts (CBCs). One such software, ColonFLag, predicts polyps and CRCs according to age, sex, CBC, and demographic information. Scores were compared to gold standard colonoscopy and converted to percentiles, then categories were made, such as CRC, high-risk polyps, and benign polyps [64]. Another retrospective study (MeScore, Calgary, Alberta, Canada) compared CBC results 3–6 months before colonoscopy with those from colonoscopy in two unrelated groups (Israeli and the UK). AUC for CRC diagnosis was 0.82 ± 0.01. Specificity for 50% detection was 87 ± 2% a year before diagnosis and 85 ± 2% for localized cancers [65]. Study results point to the possibility of an early and noninvasive preliminary screening that can be integrated into electronic medical records to flag high-risk patients who can then be aggressively screened to balance the risks and benefits of colonoscopy in young people. Another ANN model designed to screen a large population based only on personal health information from big data also achieved optimal results [62]. However, these models are not currently practiced and require further validation for generalizability.

#### *4.2. Colonic Polyps Detection*

Colonoscopy is the gold standard invasive testing for the detection of colonic adenoma and CRC. An adenoma is the most common precancerous lesion. Adenoma detection rate (ADR) measures a gastroenterologist's ability to detect an adenoma. ADR is inversely related to the adenoma miss rate and the risk of post-colonoscopy CRC. ADR ranges from 7% to 53%, while AMRs vary from 6% to 27% based on healthcare facilities. Several factors have been postulated to explain these differences, including quality of preprocedural bowel preparation, time of withdrawal, operator experience and training, procedure sedation, cecal intubation rate, visualization of flexures (blind spots), and use of image enhanced endoscopy and presence of flat or diminutive (less than 5 mm) and small (<10 mm but >5 mm) polyps. Studies show that endoscopists with higher ADR during screening colonoscopy are more effective in preventing subsequent CRC risk for patients [66,67].

In recent years, CADe and computer-aided diagnosis (CADx) systems have been developed to automate polyp detection during colonoscopy and further characterize them. Because of its ability to detect diminutive polyps, real-time AI-aided colonoscopy has a greater ADR than colonoscopy (OR 1.53, 95% CI 1.32–1.77; *p* < 0.001), derived from a metanalysis data [4,68,69]. An AI system, GI Genius, uses green squares to highlight suspicious lesions during a colonoscopy by generating a sound for each marker and displaying it as a video of the endoscopy. Several meta-analyses demonstrate excellent detection rates for polyp detection using AI-assisted algorithms with AUC 0.90, sensitivity 95%, and specificity 88% [8].

#### *4.3. Colon Polyps Classification*

AI-based classification of CP into cancerous vs. non-cancerous lesions on CT colonography and capsular endoscopy is a fascinating discovery. CT colonography differentiation by texture analysis based on gradient and curvature of high-order images and random forest models significantly improved the accuracy of the classification of CPs [70,71]. AI-assisted CAD model revealed an inverse correlation of CP sphericity with adenoma detection sensitivity and a direct correlation with adenoma detection accuracy. This model can effectively detect flat colonic lesions and CRCs on CT colonography [72]. Capsule endoscopy is another

noninvasive diagnostic tool for gastrointestinal tract inspection, but it is a time-consuming process to process a large amount of data. Stack sparse autoencoding with image manifold constraint, a DL-based AI, is utilized to correctly identify capsular polyps from capsular endoscopic images with a rate of 98% accuracy and time effectiveness [73]. An ANN model with logistic regression showed a predictive risk of distant metastasis in CRC patients based on several clinical factors, such as pathologic stage grouping, first treatment, sex, age at diagnosis, ethnicity, marital status, and high-risk behavior variables [74]. With DL models, tumors can be segmented and delineated more accurately, and faster region-based CNNs are trained to read MRI images, enabling faster and more accurate diagnosis of CRC metastasis [75,76].

#### *4.4. Histopathological Aspects, Genetics, and Molecular Marker Detection*

Histopathological characterization is the gold standard for the classification of polyps [77]. However, one of the biggest challenges is the significant intra- and interobserver variability. The use of DL and CNN models to automate image analysis can allow pathologists to classify CPs with an overall accuracy of 95% or more [10]. These DL models analyze whole slides and hematoxylin- and eosin-stained slides to identify four different stages, including normal mucosa, early preneoplastic lesions, adenomas, and cancer [9,10,78].

AI-based models were used to identify gene expressions, gene profiling, and noncoding micro-ribonucleotides (mi-RNAs) for diagnosis, prognosis, and targeted therapy planning [79–81]. The use of near-infrared (NIR) spectroscopy and counter propagation artificial neural networks (CP-ANNs) in the determination of mutant vs. wild B-rapidly accelerated fibrosarcoma (BRAF) gene mutations were shown to be highly accurate, specific, and sensitive [79]. Mutant BRAF is associated with a poor prognosis, and this AI model can assist in prognosticating and managing these patients aggressively. Backpropagation and learning vector quantization (LVQ) neural networks demonstrate a remarkable role in assessing the genetic profiling database from the cancer genome atlas (TCGA) in improving CRC diagnosis [81]. Several neural networks, including S-Kohonen, backpropagation, and SVM, were compared for predicting the risk of relapse after surgery. The S-Kohonen neural network was found to be the most accurate [82]. Non-coding mi-RNA plays an important role in tumorigenesis and progression of cancer by interfering with various cell signaling pathways, including, WNT/beta-catenin, phosphoinositide-3-kinase (PI3 K)/protein kinase B (Akt), epidermal growth factor receptor (EGFR), NOTCH1, mechanistic target of rapamycin (mTOR), and TP53. The identification of miRNAs through AI models aids in the diagnosis, prognosis, and targeted treatment of CRCs [80,83–86].

In the early detection of CRC, ML-based AI can help isolate circulating tumor cells in peripheral smear and analyze serum specific biomarkers, such as leucine-rich alpha-2 glycoprotein 1 (LRG1), EGFR, inter-alpha trypsin inhibitor heavy chain family member 4 (ITIH4), hemopexin (HPX), and superoxide dismutase 3 (SOD3) [87,88].

#### **5. Central Nervous System Cancers**

In the United States, primary brain tumors have an annual incidence of 14.8 per 100,000 people and have a male predominance. Despite significant advances in imaging modalities, surgical techniques, chemotherapy, radiotherapy, and radiosurgery, primary brain tumors such as glioblastoma multiforme (GBM) remains challenging to manage [89]. GBM is one of the primary intracranial neoplasms and accounts for nearly 60% of all primary brain tumors worldwide. Primary or metastatic CNS cancers are challenging to manage because of their rapid proliferation, prominent neovascularization, invasion to distant sites, and poor response to chemotherapy due to the blood–brain barrier. Clinical management includes initial observation, grading, accessing the depth of infiltration, segmentation and location of the tumor, histopathological evaluation, and identification of molecular markers. As a result, clinicians have to manually compile all the data for

validation in order to formulate a treatment plan. In this regard, AI has proven to be useful in the diagnosis and management of CNS malignancies [26].

#### *5.1. Central Nervous System Neoplasm Detection*

AI has made significant advances in the diagnosis and classification of brain tumors in recent years. MRI is currently the gold standard tool for tumor detection and characterization [90]. Conventional MRI methods such as T1 and T2 weighted imaging and fluid-attenuated inversion recovery (FLAIR) sequences have the disadvantage of nonspecific contrast enhancement and a high likelihood of missing tumor foci infiltration. In order to enhance detection chances, perfusion MRI with dynamic susceptibility-weighted contrast material enhancement, dynamic contrast enhancement, and arterial spin labeling are also used to evaluate the neoangiogenic properties of brain tumors such as GBM. In addition to identifying tissue microstructure, diffusion-weighted imaging shows neoplastic infiltration in areas of the brain that appear normal on conventional magnetic resonance (MR) images. The use of MR spectroscopy can also be used to identify chemical metabolites such as choline, creatine, and N-acetyl aspartate, which are useful for glioma grading and identifying tumor infiltrated regions [91]. By automating these steps, AI has enhanced detection rates and efficiency of radiologists, which, in turn, has reduced the amount of time traditionally spent in diagnosing a disease. CNN-based DL can also detect millimetersized brain tumors and can distinguish GBMs from metastatic brain lesions [3,92]. MRI technologies provide structured anatomical information on tumors, but tumor differentiation is always based on histopathological evaluation, which is invasive, time-consuming, and expensive. It remains challenging to identify low-grade gliomas from high-grade gliomas on imaging, even with AI systems. Attention-based transformers are currently being investigated for the first time in glioma classification, and their use may offer a breakthrough [39,93].

#### *5.2. Radiomics*

A comprehensive analysis of clinical, histopathological, and radiological data combined with ML/DL image processing has paved the way for a new translational field in neuro-oncology called radiomics [60,94,95]. AI-based radiomics provides enhanced noninvasive tumor characterization by enabling histopathologic classification/grading within minutes even at surgery time, prognostication, monitoring, and treatment response evaluation [96,97]. AI algorithms are able to analyze these images at the pixel level, so they can provide information not visible to the human eye and allow for more accurate grading [3]. Radiomics involves a set of the complex multi-step processes with manual, automatic, and semi-automatic segmentations. Two main types of radiomics are described: feature-based and DL-based. Both provide more accurate and reliable results than human readers. The feature-based radiomics algorithms evaluate subsets of specific features from segmented regions and volumes of interest (VOI) into mathematical representations. This multistep process includes image pre-processing (noise reduction, spatial resampling, and intensity modification), precise tumor segmentation (manual vs. DL-based techniques), feature extraction (histogram-based, textural, and higher-order statistics features), feature selection (filter methods, wrapper approaches, and embedded techniques), and model generation and evaluation (neural networks, SVM, decision trees/ random forests, linear regression, and logistic regression models) [95,98]. DL radiomics use CNNs, in which the model learns in a cascading fashion without any prior description of features and requires a large amount of data in the learning process. The cascading technique processes data to obtain useful information, removes redundancies, and prevents overfitting [27,31,98].

#### *5.3. Histopathological Aspects, Genetics, and Molecular Marker Detection*

Traditional histopathological evaluation of cranial tumors identifies the microscopic features with areas of neovascularization, central necrosis, endothelial hyperplasia, and regions of infiltration. These are sometimes overlapping and could lead to false-positive results [99]. To overcome this complexity, digital slide scanners are now used to convert microscopic slides into image files interpreted by AI-based algorithms such as SVM and decision trees. SVMs have shown higher precision rates [98]. The AI-based algorithms analyze pathological specimens of gliomas and predict outcomes based on genetic and molecular markers, including isocitrate dehydrogenase (IDH) mutation status, 1 p/19 co-deletion status, O-6-methylguanine-DNA methyltransferase (MGMT) methylation status, epidermal growth factor receptor splice variant III (EGFRvIII), Ki-67 marker expression, prediction of p53 status in gliomas, prediction of mutations in BRAF, and catenin β-1 in craniopharyngiomas [96,98,100–103]. IDH mutation leads to the accumulation of an oncometabolite called D-2 hydroxyglutarate. This mutation is an important prognosticator in GBM. CNNbased AI has detected this biomarker from conventional MRI modalities [100]. O-6-MGMT promoter hypermethylation (encoding for DNA repair protein), which is exhibited in about 33%–57% diffuse gliomas, is a better prognostic factor owing to increased sensitivity to alkylating agents such as temozolomide [98,101,104]. AI types such as supervised machine learning combined with texture features have been found to detect this methylation status. Performing principal component analysis on the final layer of CNN indicated that features, such as nodular and heterogeneous enhancement and "masslike FLAIR edema", predicted MGMT methylation status with up to 83% accuracy [105]. EGFRvIII mutation is found in about 40% of GBM. Tumors with this mutation have been found to exhibit deep peritumoral infiltration, which is consistent with a more aggressive phenotype. EGFR mutation is also associated with increased neovascularization and cell density [106]. 1 p/19 codeletion status has been shown to have a protective effect on the prognosis. This codeletion is observed in oligodendrogliomas [102]. CNN-based AI can be employed to detect this codeletion. Ki-67 marker expression indicates tumor cell proliferation. Traditionally, this marker is detected via immunohistochemical studies on the extracted tumor sample. This method is invasive and time-consuming. Identifying this marker is essential in making a differential diagnosis and treatment plan. AI-based radiomics has been developed to detect this marker from fluorodeoxyglucose positron emission tomography (FDG PET) and MRI images [107].

#### *5.4. AI in Pre- and Intra-Operative Planning, Postoperative Follow-Up, and Metastasis* 5.4.1. Preoperative Assessment

Segmentation, volumetric assessment, and differentiating the tumor from healthy brain tissue and peripheral edema, quantitative measurements such as risk stratification, treatment response, and outcome prognosis are essential elements in the treatment planning of CNS tumors [108,109]. In traditional radiographic imaging, contrast-enhanced radiographic images are used to estimate tumor volume or burden; however, single-dimension imaging may not be as accurate in the volumetric assessment of nonuniform tumors, such as high-grade tumors including GBMs. Another challenge is differentiating tumor borders from surrounding edema [110]. AI algorithms such as the random forest, CNN, and SVM have been applied to the tumor segments to overcome these challenges, and they have been shown to provide precise and accurate localization of the tumor. A two-step protocol with CNN and transfer learning models led to precise and accurate localization of glioma [111]. 3D-U-Net CNN on 18 -fluoroethyl-tyrosine-PET, when used for automated segmentation of gliomas, showed 88% sensitivity, 78% positive prediction, 99% negative prediction, and 99% specificity [32,112].

#### 5.4.2. Intraoperative Modalities

High-grade tumors such as GBM have a rapid proliferation rate and invade the surrounding regions beyond the enhancing regions on the radiological images, and excision of these areas could be missed [26,113]. AI-based DL algorithms have been developed to facilitate the surgeons to remove maximum tumor regions and less of the normal healthy brain tissue simultaneously. Three-dimensional CNNs have shown promising results in aiding stereotactic radiation therapy planning. It is often difficult to differentiate among primary brain tumors, primary CNS lymphoma, and brain metastases in some situations. However, AI-based algorithms such as decision tree and multivariate logistic regression models have been developed to differentiate among these entities by using diffusion tensor imaging and dynamic susceptibility-weighted contrast-enhanced MRI [114–116].

#### 5.4.3. Postoperative Surveillance

MRI with gadolinium contrast is the standard for determining postoperative tumor growth and tumor response [117]. CNN-based AI algorithm techniques determine accurate tumor size compared to linear methods. The ability of CNN models to differentiate the true progression from pseudo-progression and ML algorithms to differentiate radiation necrosis from tumor recurrence is revolutionary [109,110,118]. Additionally, CNN and SVM create a superior model to predict the treatment response and survival outcomes from clinical, imaging, genetic, and molecular marker data [26].

#### **6. Precision and Personalized Medicine**

AI has moved towards an era of personalized treatment in oncology with remarkable aid in oncologic drug development, clinical decision support systems, chemotherapy, immunotherapy, and radiation therapy [43]. AI algorithms have been developed to assess several factors such as oncogenetic mutation profile and drug sensitivity prediction showing overall expected prognosis, efficacy, and adverse effects with a particular treatment option in a patient with particular cancer [43,119]. In a study, an ML algorithm was designed to predict the effects of chemotherapy drugs, including gemcitabine and taxols, in correlation to patients' genetic signatures [120]. In another study, an AI-based screening system based on homologous recombination (HR) deficiency was developed to detect cancer cells with HR defects can further narrow patients who would benefit from poly ADP-ribose polymerase (PARP) inhibitors in BC patients [44]. A DL algorithm was used to identify anticancer drugs that inhibit PI3K alpha and tankyrase, promising targets for CRC treatment [121]. An ML-based drug specificity detection by examining protein–protein interactions of anticancer drug and S100A9, a calcium-binding protein, may represent a potential therapeutic target for CRC [122]. These avenues of discovery of new anticancer targeted therapy by ML models is a fascinating step towards much effective therapeutic options. ML models can also be trained to interpret screening data to predict responses to new drugs or combinational therapies [123]. An ability to synthesize and assess a large amount of chemical data also plays a role in cancer drug development by narrowing the prediction towards a specific formula; beyond the traditional experimental methods in which DL systems are currently being explored [124,125]. Learning clinical big data of cancer patients with AI can generate personalized treatment options based on DL assessed factors, including clinical, genetic, cancer-type, and stage of cancer of a patient [126]. Moreover, AI application in radiotherapy is quite distinct. AI can help radiologists plan radiation treatment regimens with automation software as effective as conventional treatment layouts in a robust, time-effective manner [127,128]. With the upcoming role of immunotherapy in managing various cancers, ML-based platforms are trained to predict the therapeutic response of immunotherapy effects in programmed cell death protein 1 (PD-1) sensitive advanced solid tumors [129,130]. AI can thus support and even surpass the capability of humans in anticancer drug development and aid in personalized treatment plans in a time-effective manner.

#### **7. Generalizing Artificial Intelligence, Barriers, and Future Directions**

A number of factors challenge the generalizability of AI systems, including possible bias, external validation of AI performance, the requirement for heterogeneous data and standardized techniques [46].

#### *7.1. AI Performance Interpretation*

In order for AI to perform in clinical practice, it must be both internally and externally validated. In internal validation, the accuracy of AI is compared to expected results when AI algorithms are tested by using previously used questions [131]. Internal validation performance tools rely on sensitivity, specificity, and AUC. The problem with interpreting AUC is that it does not consider the clinical context. For instance, different sensitivity and specificity can provide similar AUCs. In order to measure AI performance, studies should report AUC along with sensitivities and specificities at clinically relevant thresholds, this is referred to as "net benefit" [132]. As an example, high false-positive and false-negative rates continue to be a challenge in DL screening mammograms, for which balancing the net benefit would be important [42]. Thus, prior to concluding that an AI system can outperform a human reader, it is important to carefully interpret its diagnostic performance. Furthermore, the sensitivity, specificity, and accuracy of diagnostic tests are independent of real-life prevalence. As a result, robust clinical diagnostic, and predictive performance verification of AI for clinical applicability requires external validation. For external validation, a representative patient population and prospectively collected data would be necessary to train AI algorithms [131]. Moreover, internal validation poses the challenge of overestimating AI performance by familiarizing itself too much with training data, known as overfitting [131]. By separating unused training datasets, including newly recruited patients, and comparing results with those of independent investigators at different sites, it is possible to improve generalizability and minimize overfitting [131]. In a recent study, curated large mammogram screening datasets from the UK and the US revealed a promising path to generalizing AI performance [55].

#### *7.2. Standardization of Techniques*

An AI model that could be universally applicable must be taught a large amount of heterogeneous clinical data in order to become generalizable [3,54,107]. AI-based infrastructure and data storage systems are not available at all institutes, which is one of the biggest barriers [133]. There is also a lack of standardization of staining reagents, protocols, and section thicknesses of radiologic images, which can further hinder the generalizability of AI in clinical practice worldwide [1,54]. A number of automated CNN-based tools such as HistoQC, Deep Focus, and GAN-based image generators are being developed by societies such as the American College of Radiology Data Science Institute to standardize image sections [1,91]. In the field of radiomics, another challenge involves compliance with appropriate quality controls, ranging from image processing to feature extraction and from mechanics and feature extraction to algorithms for making predictions [134]. There are several emerging initiatives using DLs and CNNs to normalize or standardize images, including, "image biomarker standardization technique" [134,135]. ML algorithms are treated as a "black box" because of a lack of understanding of its inner working. This can pose a challenge when dealing with regulated healthcare data. This necessitates transparent AI algorithms and the interpretation of AI-based results to ensure no mistakes are made [26,136]. A few recently developed methods, such as saliency maps and principal component analysis, are helping interpret the workings of these algorithms [105,137].

#### *7.3. Bias in Artificial Intelligence*

Quality and quantity of data are key factors that determine the performance and objectivity of an ML system. AI can be biased in a number of ways—from assumptions made by engineers who develop AI to bias in the data used to train it. When training data are derived from a homogenous population, they may be poorly generalizable, which can potentially exacerbate racial/ethnic disparities, for example [138]. Thus, when training the AI, it is important to include diverse ethnic, age, and sex groups, as well as examples of benign and malignant tumors. Similarly, to integrate precision medicine and AI in real-world clinical settings, it is necessary to consider environmental factors, limitations of care in resource-poor locations, and co-morbidities [139]. There is also the possibility of bias introduced when radiologists' opinion is regarded as the "gold standard" rather than the actual ground truth or the absolute outcome of the case, benign or malignant [46]. As an example, several AI models in screening mammography are compared with radiologists instead of the gold standard biopsy results, introducing bias [46]. In order to overcome this problem, including interval cancers in testing sets and relying on reports from experienced radiologists might be helpful.

#### *7.4. Ethical and Legal Perspectives*

Creating future models that address the ethical issues and challenges of incorporating AI into preexisting systems requires an awareness of these issues. Few societies, such as the Department of Health and Social Care, the US Food and Drug Administration, and other global partnerships, oversee and regulate the use of AI in medicine [46,140]. The National Health Service (NHS) Trusts in the United Kingdom regulate the use of patient care data in AI in an anonymized format for research purposes [46]. In order for AI in oncology to achieve global standardization, more international organizations must be formed that can oversee future AI studies within ethical and legal boundaries to protect patient privacy.

#### **8. Integrative Training of Computer Science and Medical Professionals**

In order for AI to be effectively integrated into healthcare in general, as well as oncology, formal training of medical professionals and researchers would be critical. Numerous societies and reviews have recommended formal training, but current medical education and health informatics standards do not include mandatory AI education, and competency standards have yet to be established [141,142]. There have been efforts in the radiology community to determine students' opinions about AI applications in radiology in order to develop formal training tools. A few of these are frameworks for teaching, principles for regulating the use of AI tools, special training for evaluating AI technology, and integrating computer science, health informatics, and statistics curriculum during medical school [143–145]. Few institutes in the United States have proposed initiatives for AI in medical education, which were originally submitted by the American Medical Association. Among these initiatives are medical students working with data specialists, radiology residents working with technology base companies to develop computer-aided detection in mammography, offering a summer course by scientists or engineers to update new technologies, and involving medical students in engineering labs to create innovative ideas in health care [136]. Another framework would provide AI training for students in various fields, including medical students, health informatics students, and computer science students [142]. In order to improve patient care, medical students should become proficient in interpreting AI technologies, comparing efficiency in patient care and discussing ethical issues related to using AI tools [142]. Furthermore, medical professionals should understand the limitations and barriers of AI in clinical applications, as well as the distinction between correct and incorrect information [146,147]. In health informatics, students should be taught how to apply appropriate ML algorithms to analyze complicated medical data, integrate data analytics, and formulate questions to visualize large data sets. Students studying computer science should be trained in Python, R, and SQL programming in order to solve complex medical problems [142]. Education tools that integrate medical professionals, health informatics students, and computer science students can pave the way for further developments in the fields of medicine and oncology.

#### **9. Conclusions**

Computer systems are capable of learning tasks and predicting outcomes without being explicitly programmed through AI. DL, a subset of ML, utilizes neural networks and enables learning complex, non-linear functions from data. CNNs are well suited to process two- to three-dimensional inputs such as images, while RNNs can handle sequential inputs of variable length such as textual data. Recently developed attention-based DL systems are capable of selectively focusing on data, resulting in better accuracy in cancer detection rates. AI has shown promising results in oncology in several areas, including detection and classification, molecular characterization of tumors, cancer genetics, drug discovery, predicting treatment outcomes and survival rates, and moving the trend towards personalized medicine. In screening mammography, various DL models have demonstrated non-inferior cancer detection performance, with overall sensitivity rates of 88–96%. Radiologists with AI-assisted systems have achieved higher AUC rates and have reduced their workloads. Different real time CADe and CADx AI systems have demonstrated a higher ADR by automating polyp detection and detecting diminutive polyps during colonoscopy. The use of machines to improve cancer detection at an early stage on screening mammograms and colonoscopies has the potential to be tested for application across the globe for more efficient patient care. Several AI-based cancer detection methods have been developed for other cancer types, including lung, prostate, and cervical cancer. It is possible to pursue future objectives to implement AI worldwide in all cancer types.

CNS tumors such as GBM continue to have a poor prognosis. AI-based radiomics allows for the identification of tumors without invasive methods, by allowing for the classification and grading of tumors within minutes. Radiomics is largely used in CNS tumors identification and grading. State-of-the-art attention-based transformers are currently being studied to improve glioma classification. Analyzing histopathological, genetic, or molecular markers can be made easier with AI. With the advancement of AI, oncology has moved to a more personalized era. AI has revolutionized drug development, clinical decision support systems, chemotherapy, immunotherapy, and radiotherapy.

A better understanding of the ethical implications of the use of AI, including its performance interpretation, standardization of techniques, and the identification and correction of bias, is required for more reliable, accurate, and generalizable AI models. Global organizations must be formed to provide guidance and regulation of AI in oncology. Formal integrated training for medical, health informatics, and computer science students could drive further advances of AI in medicine and oncology.

**Author Contributions:** N.V., V.R. and U.S. were involved in the literature search and writing the manuscript. K.G. and K.R. contributed to the literature search. S.R.S. conceptualized the idea, was involved in writing, review, and revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research study received no external funding.

**Conflicts of Interest:** All authors declare that they have no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

MDPI Books Editorial Office E-mail: books@mdpi.com www.mdpi.com/books

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-6673-3