1. Introduction
The rapid development of artificial intelligence (AI) technologies, particularly machine learning (ML) algorithms, has created opportunities for improving early prevention, diagnosis, prognosis, and the personalization of cancer treatment. In the hierarchy of concepts, AI represents the general field that includes all methods capable of reproducing human reasoning, with an emphasis on decision-making. ML is a subfield of AI focused on developing algorithms that learn patterns directly from data. Deep Learning (DL) is also a branch of ML based on multi-layered artificial neural networks. These techniques are used particularly in the areas of medical imaging, histopathological analysis, and the recognition of complex signals, such as genomic or radiomic ones. In oncology, these algorithms assist in diagnosis. When the data volume is modest, ML algorithms are preferred over DL algorithms, which require a much larger volume of data. The specialized literature reveals differences between the performance reported in studies and its actual applicability in clinical practice. The differences are generated by data quality and availability, class imbalances, lack of external validation, and inconsistent reporting of results. These gaps justify the need for a systematic evaluation of ML applications in oncology.
The paper is addressed to several categories of readers:
Researchers in AI applied to medicine, interested in the current state of ML integration in oncology;
Clinicians and oncologists seeking objective information about models with real potential for clinical use;
Decision-makers and developers of digital medical solutions who can identify opportunities and risks in adopting ML;
The academic community to which it aims to provide a replicable methodological framework for other medical fields.
The research questions (RQs) of the study are as follows:
RQ1: What are the most frequently addressed cancer types using ML algorithms between 2020 and 2025?
RQ2: What are the datasets and ML models mainly used in the studies from the literature?
RQ3: What performance levels are reported for different types of cancer, and what factors influence these results?
RQ4: How do the reported performances correlate with the actual potential for clinical implementation?
Study objectives are outlined below:
To conduct a systematic analysis of the recent literature on ML applications in oncology;
Classification of studies based on cancer type, datasets used, ML models applied, and achieved performance;
Assessing the reproducibility and generalizability of studies based on the information reported in article abstracts by proposing novel indicators associated with reproducibility and generalizability;
Identifying gaps and future research directions for the responsible implementation of ML in oncological practice.
The main contributions of this study are mentioned next:
Developing a rigorous literature filtering methodology, which has never been used before in the review articles;
Classification of studies into four analytical directions: type of cancer, dataset characteristics, ML algorithms used, and the performance achieved;
Comparative analysis of performance indicators, such as accuracy, precision (positive predictive value), recall (sensitivity), F1-score, and Area Under the Curve (AUC).
The paper is structured into five sections.
Section 2 presents the methodology used.
Section 3 focuses on the results and is structured into four subsections that analyze the distribution of publications that employ ML modes by cancer type, the dataset assessment, ML models used in papers by cancer type, and the performance metrics for cancer types. The discussion and the limitations are depicted in
Section 4, whereas the conclusions and future research are described in
Section 5.
This review follows the PRISMA 2020 structure, and the results are reported in accordance with the selection stages described in the methodological section.
The originality of this review lies in the integration of a structured comparative framework along four axes: type of cancer, dataset characteristics, algorithm used, and reported performance. Unlike previous reviews, which focus on a single type of cancer or a single type of algorithm, this study proposes a cross-sectional analysis with its own indicators for reproducibility and generalizability. Additionally, the authors integrate innovative eligibility criteria into the PRISMA analysis, considered unique to date in the literature, aimed at filtering research that demonstrates a degree of quality in both development and content.
2. Methodology
This paper presents a systematic review with the main objective of investigating applications that record technological progress through ML algorithms in the field of cancer. The paper analyzes original studies published between 1 January 2020 and 31 December 2025. The literature selection process has the following standards: novelty, quality, analysis, comparability, and expertise of the extracted data. The methodology presents a systematic approach with a detailed description of the most important papers identified after applying an entire filtration process.
2.1. Search Strategy and Data Source
The search strategy included the following three databases, i.e., Web of Science (WOS), Scopus, and PubMed. These three databases include papers from fields such as medical sciences, life sciences, biomedical engineering, and computer science. This ensures a complete review of the ML subject in cancer. The search focused on two key concepts: cancer and ML. The preliminary search strategy was restricted to article titles. Thus, all initially selected articles included the words “cancer” and “machine learning” or “ML” in their titles. The authors wanted to ensure the relevance of the initial results in this approach. The title of an article best reflects the central subject. Limiting the search to the title reduces irrelevant results where the terms appear only in a sentence within the text or in references. This increases the probability that the extracted articles will directly address the two concepts. Although this strategy proposal may omit some articles where the two subjects appear in the abstract or full text, the advantage of this proposal is to obtain an initial qualitative set that justifies this choice for the systematic review.
The time frame 1 January 2020 to 31 December 2025 is justified by the field of ML, which has been evolving spectacularly in recent years. Limiting the review to the last six years ensures it reflects the latest discoveries, methods, implementations, trends, and evaluations. Secondly, the post-pandemic period of 2020 accelerated medical research, which included digital technological applications involving the integration of ML technologies in healthcare. Therefore, the authors expect that many innovations have emerged during this period. Thirdly, the integration of ML techniques has improved in recent years, meaning that studies from this period are more likely to use advanced models that report updated performance.
After conducting the initial searches, only open-access articles were selected. The motivation for choosing this was to be able to access the full text of the articles. This allowed for a detailed analysis of the content to understand the methods, context, results, future directions, and conclusions obtained.
Furthermore, review articles were excluded, and this exclusion supports the purpose of this research. The main objective is to review the original studies that apply ML in cancer. Including these articles would create a circular loop, distorting the analysis. The paper focuses on the direct analysis of primary information sources to extract details about the types of cancer in which they are used, as well as the methods, datasets, analyzed ML models, and numerical results, which are identified in the models’ performance metrics. It has obtained three lists of open-access articles after applying individual filters to each database. In the next step, common articles across all three databases were identified. This intersection is necessary to identify articles present in all three databases. This ensures the identification of the most highly indexed, most cited articles with high visibility within the scientific community, which represent benchmark scientific elements for the field. This way, confidence in the quality of the selected items is increased. The authors also believe that these analyzed articles are recognized due to their indexing in multiple sources, which reduces the risk of including peripheral articles or those that are not sufficiently representative of the field.
From the set of common articles, a further selection was made by keeping only those articles whose abstracts contain numerical values. With this decision, the technical nature of the review is imposed. One of the goals is to provide a technical perspective on the performance of ML models in cancer. The reason for the restriction applied to the abstract is that it is a concise source of information. By identifying articles with numerical data, analysis efforts are focused on the articles most likely to contain detailed technical information in the rest of the text. This optimization in the data extraction process for the systematic review helped filter the large volume of data existing in the literature up to this point.
The criterion of analyzing numerical values in the abstract has been maintained as a deliberate methodological contribution. The authors of this paper believe that a scientific abstract should highlight the quantifiable results of the study. Excluding articles without numerical data in the abstract is not a cognitive bias. This is considered a quality filter to prioritize studies with transparent performance reporting. This approach brings an element of novelty to this research, through a different vision from the standardized one of systematic reviews. The authors of this paper acknowledge the risk of limiting the approach by excluding important articles, but on the other hand, it is considered that a qualitative study adheres to the standard norms of developing scientific material. Therefore, the idea is emphasized that this criterion ensures the selection of studies that communicate technical results right from the synthesis. Furthermore, it strengthens the argument that this approach represents a methodological novelty for rigorous filtering of the technical literature.
Articles filtered by applying the set of limitations are analyzed from the perspective of oncological classification. In this analysis, the articles were categorized based on the type of cancer addressed. This is how research areas that prioritize the integration of ML technologies are defined. Classification by cancer type allows for a detailed thematic analysis.
The U.S. National Cancer Institute (NCI) groups cancer types by human organs in numerous categories, but the most important are the 35 depicted in
Figure 1 [
1]. According to the American Cancer Society’s estimation, the cancer types that are expected to record the highest cases in 2025 were prostate cancer for males (colored in blue in
Figure 1), uterine cancer for females (colored in red in
Figure 1), and bladder, breast, colorectal, kidney, leukemia, lung, lymphoma, pancreatic, skin, and thyroid cancer for both genders (colored in green in
Figure 1) [
2].
In the second analysis, the characteristics of the datasets used are evaluated. The implications of different types of datasets are also discussed, along with understanding the context and validity of the results, the concept of a well-constructed dataset, and the implications the dataset has on the ML application, which achieves good metric-level performance for practical use. Dataset analysis is particularly important due to the following characteristics:
Reproducibility ensures that the data description allows other researchers to replicate the study and potentially improve upon the findings presented in the paper;
Generalizability is studied using an indicator that reflects the model’s ability to function on populations different from those in the training cohort. Thus, the evaluation was based on the volume of the cohort, the disparity of the dataset, its diversity, and the presence of external validation.
By analyzing the two characteristics that should describe the datasets used in training and validation, the importance of analyzing these datasets used in the research that will be included in the systematic review is deduced.
The third analysis inventories the ML models mentioned in the extracted articles for mapping the algorithms used in the investigations. These provide an overview of the methodological trends of the algorithms. Inventorying these models is important because it highlights technological trends that allow for the identification of the most popular models at a specific point in time. They also provide information on which models are considered suitable for certain data types, support the identification of specific types of cancer, and offer a comparison of the performance of models applied to the same data types and of models investigating the same type of cancer.
Finally, an evaluation of the performance metrics reported in the articles is conducted to understand the degree of integration of ML models into practical applications. The most commonly used performance metrics are accuracy, precision, recall, F1-score, and AUC. Performance metrics are indicators of an application’s success in evaluating its effectiveness, quantifying how well a model performs for the specific task it was designed for, such as diagnosing a particular type of cancer. These metrics provide an objective comparison between multiple models with the same objective, and they also identify a number of limitations through their low values, which can indicate the study’s boundaries and ultimately provide contextualization within the field through the values correlated with a specific type of cancer to understand the specific difficulties associated with a particular application typology.
Figure 2 presents the synthesis of the methodology that includes the WOS, Scopus, and PubMed databases. Thus, a series of successive filters is applied for title, period, open access, and the exclusion of reviews. Subsequently, only the common articles that include numerical data in the abstract are retained, and the selected articles are analyzed in four directions: cancer type, datasets, algorithms, and performances.
This analysis methodology, based on the automatic extraction of information from abstracts and extended papers, systematizes the technical content of a large number of scientific articles. The complete search strategy used for each database, including the exact syntax of the query, Boolean operators, queried fields, and applied filters, is presented in
Table S1 from the Supplementary Materials section.
2.2. Review Protocol in Accordance with the PRISMA Guidelines
The articles extracted from the three databases, WOS, Scopus, and PubMed, investigate applications that integrate ML algorithms in the oncological field. In the WOS database, the initial search filter was applied based on the expression “cancer*” and “machine learning” in the title, for the period 1 January 2020–31 December 2025. The search returned 4577 articles, as shown in
Figure 3. The use of the asterisk in the word cancer includes all lexical derivatives, an example in this sense being the words cancer, cancers, cancerous, etc., practically extending the coverage area. This is a strategic choice to capture all terms in the field of cancer, without losing potential works due to semantic restrictions. Subsequently, the removal of review articles was applied, which reduced the set to 4330 articles. The additional application of the open-access filter reduces the final set to 2237 articles. This filter is motivated by the need for full access to the complete text for in-depth technical analysis.
Regarding the Scopus search, it returned an initial number of 5514 articles. This value indicates a greater coverage of recent works in which frontier research in ML and cancer is published. The removal of review articles reduced the number of results to 5235. After applying the open-access filter, a total of 2098 results were obtained, which represents a practical limitation of secondary studies.
Regarding the PubMed database, it generated 3201 articles in the initial search. This large number is justified by the fact that the database is associated with publications in biomedicine, bioinformatics, and clinical research, so it includes many works specific to the medical field. The removal of reviews led to a minor decrease, with 2965 results. After applying the open-access filter, 2234 results were obtained.
The differences between the databases reflect their complementarity as they are focused on different objectives. Filtering by title is a strategy that ensures the selected papers have the two reference concepts as central elements. The elimination of review papers avoids the inclusion of syntheses that do not present the specific metrics of ML models employed in the conducted research. The setting of open access allows for the validation of discussions regarding content transparency.
Figure 3 presents the PRISMA diagram, which initially identifies 13,292 articles. After applying the review removal filter, the total number was reduced to 12,530, and finally, after excluding those that are not open access, a total of 6569 was obtained. Of these articles, 295 are common to WOS–Scopus, 298 are common to WOS–PubMed, 82 are common to Scopus–PubMed, and 1503 are common to all three databases, WOS, Scopus, and PubMed. After applying the filter of including the value of the metrics in the abstract, 1364 remained. Only these articles are analyzed from the four perspectives that address cancer types, datasets, ML models, and performance metrics.
The systematic review was developed in accordance with the recommendations of the PRISMA 2020 guideline [
3]. The methodological protocol of the study was generated prior to the article selection process. This included: defining the research questions, eligibility criteria, search strategy, and selection stages.
The methodological structure follows the PRISMA principles for reporting the study selection flow. The PRISMA diagram, presented in
Figure 3, illustrates the process of identification, screening, eligibility, and inclusion of studies.
The search strategy was constructed using Boolean operators and domain-specific terms, with queries adapted to each database. The queries used are presented in full in
Table S1 from the Supplementary Material section. In this way, the authors ensure the transparency of the literature review process.
The search was limited to the title to outline the specificity of the results. Equally, this approach reduced the inclusion of peripheral studies where the terms appeared incidentally. This selection principle used in this research is considered a benchmark in the age of speed and AI tools, when everyone wants information delivered quickly and of high quality. By prioritizing conceptual importance over the raw volume of results, the authors extracted those studies that were handled by the authors of the papers with professionalism, adhering to the standards of conducting research in the medical field.
The selection process was carried out following the stages below: removal of duplicates, exclusion of review articles, application of the open-access filter, and application of predefined eligibility criteria.
Inclusion criteria include:
Original studies published between 1 January 2020–31 December 2025;
Articles indexed simultaneously in Web of Science, Scopus, and PubMed;
Studies that explicitly apply ML algorithms in an oncological context;
Articles that report at least one performance indicator (accuracy, precision, recall, F1-score, or AUC);
Articles are available in full, open access.
Exclusion criteria comprise:
Review articles, meta-analyses, editorials, or letters;
Studies without explicit reporting of ML model performance;
Studies in which ML is not the main methodological component;
Studies with insufficient information regarding the dataset used.
The selection process was carried out sequentially, and the justification for each criterion was established prior to data extraction to reduce the risk of selection bias.
To ensure compliance with the PRISMA 2020 guidelines, duplicate articles between the three databases, ineligible publications (review, editorial, letter, etc.), as well as articles lacking explicit application of an ML algorithm, lacking reporting performance indicators, and containing insufficient information regarding the dataset were removed in
Figure 3. The PRISMA diagram illustrates these stages.
The filtering strategy used involves searching in the title, selecting open-access articles, intersecting three databases, and using information from the abstract for preliminary classification. This approach was designed to maximize the identification of important information in the final set of studies. This approach may introduce certain methodological limitations. However, the authors have proposed this framework for the research, which distinguishes the study from other similar studies. The authors’ proposals in the methodology, which still align with the PRISMA 2020 standard, bring the novelty elements of each systematic review article, contributing their own insights within the research.
3. Results
The results section exclusively presents the synthesis of the included studies after applying the previously described methodological criteria, without introducing additional search strategy elements.
ML models can perform classification or prediction tasks. In a medical context, prediction refers to risk, recurrence, survival chances, or other aspects that have implications for a binary classification form. This does not limit the possibility of expanding the number of classes for which the ML model can make predictions [
4]. Regarding prediction, it refers to the risk over time, the chance of survival, the likelihood of recurrence, and the identification of the degree of organ damage, sometimes being treated as a classification. Everything that involves evolution over time at the ML level is considered a prediction task.
Table 1 presents a series of papers that mention the type of cancer, the task, and the objective of the paper.
Table 1 synthesizes the clinical objectives and task typology (classification vs. prediction, survival analysis, metastasis detection), thereby contextualizing how ML is being applied in oncology from a clinical decision-making perspective.
The development of predictive models for risk diagnosis in cancer includes cervical cancer, analyzed with hrHPV genotyping, cervical cytology, and clinical data [
24], as well as through the analysis of simple hematological tests for screening [
25]. Lung cancer [
26], breast cancer [
27], gastric cancer [
28], and pancreatic cancer [
16] can also be included in the category of predictive diagnostic models. Survival prediction in cancer is studied using ML models in lung cancer [
29], prostate cancer with bone metastases [
30], breast cancer [
31], and colorectal cancer [
32].
The analysis of the specialized literature shows that ML models are used for predicting metastases and surgical complications. This is the case of axillary metastases in breast cancer [
33], lymph node metastases [
34], and post-complete mesocolic excision (colon) heart failure [
35,
36]. ML models are also used to assist experts in optimizing a personalized treatment plan without wasting time to the detriment of the patient [
37,
38].
For ML models to function with the highest possible accuracy, the quality of the data used in training these models is a fundamental step in their development phase [
39]. Combining data that have a real contribution makes the models classify or predict as accurately as possible to reality [
40]. Another strategy to increase accuracy is by using multiple ML models simultaneously [
41,
42]. A final modern strategy of ML models that helps alleviate medical effort is the techniques for explaining the decisions obtained by the model. This way, the medical professional can more easily understand the reasoning performed by the model and decide if it has suggested a correct result [
9]. This approach ensures that the doctor does not miss a detail that could lead to an incorrect decision, but it also helps verify the model’s decision in case it might have mislabeled something [
31].
3.1. Cancer Types and ML Models
Figure 4 shows how scientific articles are distributed according to the type of cancer analyzed. After extracting the 1364 original contributions, it was desired to classify them according to the type of cancer addressed in the paper. The representation in
Figure 4 is important because it reflects researchers’ interest in applying ML techniques to each type of cancer individually. Since the articles do not explicitly mention the type of cancer in the title or abstract, it was necessary to perform normalization to categorize the articles based on the type of cancer. A concrete example of this is the different ways the same type of cancer is written. For example, breast cancer is referred to as carcinoma of the breast, mammary carcinoma, invasive ductal carcinoma (IDC), breast malignancy, neoplasm of the breast, etc. Out of the initial 1364 articles identified, those where the type of cancer was not specified were excluded. Then the remaining articles were grouped according to cancer type, and the number of articles corresponding to each type was counted. Finally, a new figure was designed to summarize the types of cancer and the number of articles that address the issue using an ML approach.
An analysis of article distribution shows a concentration of research in the ML area within the oncological context, predominantly on breast cancer. This dominates the number of research studies with 350 articles. This value demonstrates the abundance of available data, the high global incidence, the interest in improving early detection, and the need for predicting the presence of cancer cells using ML techniques. The second most studied types of cancer are colorectal cancer, with 337 articles, and lung cancer, with 151 articles. Similarly, prostate cancer is identified with 83 articles, gastric cancer with 60, ovarian cancer with 49, bladder cancer with 36, pancreatic cancer with 34, head and neck cancer with 30, cervical cancer with 27, etc. Cancer types that were less frequently investigated using ML technologies are brain and CNS cancer (3), childhood cancer (2), bone cancer (2), kidney cancer (2), laryngeal cancer (2), testicular cancer (2), lymphatic cancer (1), nasopharyngeal cancer (1), and salivary gland tumor cancer (1).
The distribution of articles in
Figure 5 shows that ML applications in oncology are more focused on cancer types with high incidence, data availability, early detection possibilities, major clinical impact, and also massive financial contributions to support this research. This trend is natural, considering the equity of research based on the need to extend ML applications to less studied types of cancer.
Figure 5 does not include all 35 types of cancer mentioned in
Figure 1, but only the types of cancer for which the papers employed ML algorithms.
3.2. Dataset Assessment
The indicators proposed by the authors for evaluating reproducibility and generalizability are conceptually inspired by the principles of Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) and Findable, Accessible, Interoperable, Reusable (FAIR) [
45,
46,
47,
48]. These principles address the issue of transparency, accessibility, and data replicability in the context of AI-based medical research [
46,
49]. Reproducibility categories (low, medium, and high), named the generalizability indicator, were defined according to cohort size, following methodological conventions in recent ML medical studies. Datasets with <100 samples were considered exploratory (low), 100–1000 as moderate-scale (medium), and >1000 as large-scale (high) cohorts. Restricted-access datasets were categorized based on sample size, with an additional note regarding data availability. The thresholds defined by the authors for cohort size are related to classifications used in clinical and medical studies that employ ML methods [
50,
51,
52,
53]. The reproducibility indicators proposed in the present study represent an element of originality that the authors introduced to align conceptually with existing standards in the field of medical ML. These indicators provide a systematic way to quantify research reproducibility.
The reproducibility rate was used to measure the extent to which the research can be replicated in another study using the dataset and methodology employed in the article. To establish the degree of reproducibility, three types of labels: high, medium, and low were employed. For the high label, the dataset must be publicly available from standard sources, accessible free of charge, and described within the article. For the medium category, the dataset is private; its volume and structure must be large enough to be adapted to ML algorithms, and it must be described within the paper. Regarding the low label, either the dataset is not specified, the data lacks the described labels, the details are insufficient for replication, or details about the dataset are not mentioned. This labeling is justified by the fact that a public dataset allows anyone to replicate the study, while a private one, even if well-described, is much harder to replicate, even if it partially provides a basis for replication. At the opposite end, a dataset without details cannot be reproduced at all.
The analyzed papers were classified according to the types of cancer treated. It should be mentioned that the dataset was analyzed in the description phase of the abstract and not by directly accessing the entire article. This analysis is based on the importance of the dataset, which should be highlighted from the beginning stage, in the abstract. The reproducibility rate is computed as the ratio of the number of articles with high and medium labels to the total number of articles, according to Equation (1).
where
—reproducibility rate of j cancer type;
—paper i labeled high from j cancer type;
—paper i labeled medium from j cancer type;
—paper i labeled low from j cancer type.
Table 2 summarizes the number of articles that meet each criterion in relation to the type of cancer they treat. Out of the total analyzed papers, only 276 have details about the dataset and were labeled as high and medium. The results from
Table 2 can be ranked into three groups. The first group comprises the cancers with a reproducibility rate over 50%, namely bone cancer (100%), bladder cancer (50%), and kidney cancer (50%). The second group incudes 19 cancer types that recoded a reproducibility rate between 5% and 34%, with higher values in the case of brain and CNS cancer (33.33%), CUP cancer (29.41%), endometrial cancer (25%), ovarian cancer (24.49%), pancreatic cancer (23.53%), gastric cancer (23.33%), etc. The third group consists of seven cancer types that registered a null reproducibility rate, indicating that no articles used a publicly available dataset. This applies to esophageal cancer, childhood cancer, laryngeal cancer, nasopharyngeal cancer, lymphatic cancer, and salivary gland tumor cancer.
For the generalizability study, labeling with high, medium, and low was used. For the high label, the cohort was set at over 1000 patients. The medium label was used for a cohort of 100 to 1000 patients, and the low label for a cohort of fewer than 100 patients (
Figure 6).
The larger and more diverse the cohort, the greater the chances that the model can be implemented in different contexts. Small cohorts are often prone to overfitting. The generalizability study was conducted on the three patient limit intervals, with classification for each type of cancer. The results are summarized in
Table 3. According to the methodology, from the total number of articles reported for each type of cancer, those that mentioned the number of patients for whom the experiment was conducted in the abstract were retained for analysis.
Analyzing
Table 3, an unequal distribution of generalizability among cancer types is observed. Thus, most studies in the high category are in the field of colorectal cancer (33), breast cancer (14), lung cancer (6), bladder cancer (6), and prostate cancer (5), whereas in the case of the medium category the hierarchy is slightly different, i.e., breast cancer (191), colorectal cancer (185), lung cancer (107), prostate cancer (52), and gastric cancer (42). The significant emphasis on large cohorts is because it is leading to confidence in the applicability of the datasets. Conversely, there are no papers in the high category and very few in the medium category that researched CUP cancer, leukemia and hematologic cancer, multiple types of cancer, esophageal cancer, lymphatic cancer, laryngeal cancer, nasopharyngeal cancer, kidney cancer, bone cancer, and brain and CNS cancer. Cancers, such as endometrial, liver, thyroid, skin, and childhood, have a small number of studies. However, they are labeled high due to the use of national databases, such as The Cancer Genome Atlas (TCGA), Gene Expression Omnibus (GEO), Surveillance, Epidemiology, and End Results (SEER), etc.
An important aspect highlighted by
Table 3 is that only studies explicitly mentioning the number of patients in the abstract were included in the analysis. Studies with large cohorts are likely to mention this, while those with small cohorts are prone to omit such information in the abstract. Generalizability in oncology is generally moderate, as analyzed in
Table 3. The focus on cancers with high incidence stems from high generalizability values, while cancers with low incidence are underrepresented. Increasing the clinical impact of ML models requires a focus on large cohorts and the transparency of the dataset, which must be diversified regardless of the type of cancer addressed.
In this analysis, the degree of generalizability was estimated based on cohort size, as this is the most frequently reported variable in clinical ML studies. However, the generalizability of an AI model also relates to qualitative factors such as:
Data diversity is understood as multi-center origin, ethnic distributions, demographic age groups, etc. This diversity is correlated with the model’s subject, which is why it was not feasible in this general approach to all cancer types;
The existence of external validation, which confirms the model’s performance on independent datasets. This aspect is difficult to evaluate, considering that most articles do not present clinical studies, but only innovative methods validated on small sets of real subjects;
The integration of multi-modal data, which combines imaging, clinical, and molecular information, can increase the model’s adaptability to real clinical situations. Again, most articles do not have such combined approaches, either due to a lack of data or a desire to focus exclusively on an isolated issue.
These aspects were mentioned in the comparative discussions, where the information provided by the authors allowed for it, but they could not be uniformly quantified due to the heterogeneous way they were reported in the analyzed literature.
Figure 7 presents the top 10 most used datasets. At the top of this ranking are TCGA, GEO, and SEER. The TCGA dataset is reported in 133 articles and used for the identification of over 10 types of cancer [
54] that employ ML techniques, including colorectal cancer [
55], bladder cancer [
56], endometrial cancer [
57], etc. The second most used dataset, GEO, is reported in 94 articles addressing bladder cancer [
58], kidney cancer [
59], breast cancer [
60], etc. Some studies combine datasets for training ML models [
61]. The third most used dataset is the SEER dataset, encountered in 72 papers. It is used for identifying several types of cancer, such as gastric cancer [
62], breast cancer [
63], esophageal cancer [
64], etc. Other datasets are used with lower frequency such as The Cancer Imaging Archive (TCIA) in 11 papers, IMvigor210 in 10 papers, Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) and Cancer Cell Line Encyclopedia (CCLE) in 8 papers each, Genotype-Tissue Expression (GTEx) in 7 papers, UK Biobank in 6 papers, and Medical Information Mart for Intensive Care (MIMIC) and Clinical Proteomic Tumor Analysis Consortium (CPTAC) in 4 papers each.
This analysis highlights standardized public databases, such as TCGA, GEO, and SEER, in cancer research using ML techniques. The quality of the datasets is a fundamental characteristic in obtaining results that allow the use of ML models in assisting medical professionals [
65].
The synthesis of the analysis regarding reproducibility and generalizability is presented in
Table 4. This table summarizes the characteristics of the main databases used in the included studies. Here, the approximate size of the cohorts, the type of data (genomic, imaging, clinical), the degree of accessibility, and the validation practices reported in the analyzed literature are presented.
The three datasets provide extensive cohorts, public access, and are the most used in studies in the literature. However, the repeated use of the same datasets can generate an internal validation bias. Thus, the performance of the models can be overestimated.
3.3. ML Models Employed in Cancer Types
To identify the ML models studied in the literature in relation to each type of cancer, the abstracts of the articles were examined, and based on this analysis,
Table 5 was designed. Thus, the breast and colorectal cancers are studied using 19 ML models each, followed by lung cancer with 17 models, prostate cancer with 16, cervical and ovarian cancers with 15 each, bladder and head and neck cancers with 14 each, gastric and thyroid cancers with 13 each, pancreas cancer with 12, CUP cancer with 11, and endometrial, esophageal, liver and skin cancers all with the same value of 9. Subsequently, eight ML models are associated with leukemia, hematologic cancer, and pan-cancer; seven ML models are used in papers that focus on cell, multiple types, and nasopharyngeal cancers; and six ML models are used for brain and CNS, laryngeal, and lymphatic cancers. Finally, bone cancer is studied with five ML models, kidney cancer with four ML models, whereas childhood and salivary gland tumor cancers are studied with two models each.
Table 5 provides a methodological inventory of the specific ML algorithms used across cancer types, in order to express technological trends and model distribution.
Table 6 presents the most frequently used ML models for each type of cancer. Thus, the most used model is RF for breast, colorectal, lung, prostate, gastric, ovarian, bladder, pancreatic, head and neck, cervical, liver, thyroid, pan-cancer, CUP, endometrial, esophageal, multiple types, skin, leukemia and hematologic, cell, childhood, kidney, and laryngeal cancers. Furthermore, LogisticReg is used with the same weight as RF for bone cancer, with DT, GB, RF, SVM, and XGBoost for lymphatic cancer, and with Clustering, GB, NB, NN, RF, and XGBoost for nasopharyngeal cancer. In addition to RF, for lymphatic and salivary gland tumor cancers, DT is employed. Moreover, for brain and CNS cancers, the studies are using NN most frequently. Recent progress in glioma research integrates spatiotemporal heterogeneity with multimodal fusion strategies. These are guided by ML techniques that identify aligned and collective multicellular bundles within high-grade gliomas [
66]. At the same time, the paper by Bahar et al. [
67] analyzes the ML methods used in glioma grading. These radiomic models outperform clinical data in predicting the progression of patients with high-grade glioma [
68]. Furthermore, Redlich et al. [
69] provide a synthesis of the use of AI in histopathological imaging of gliomas. This highlights the need for model validation on data from multiple centers to improve generalizability. Finally, integrative models like Deep Orthogonal Fusion show how combining pathology imaging data, genomic data, and clinical variables improves predictive performance compared to unimodal models [
70].
Regardless of the cancer types, the top five most frequently used ML models are RF, NN, LogisticReg, GB, and SVM.
RF is suitable for medical data analysis due to features that prevent overfitting, tolerate missing or incomplete data, evaluate important features, and function well on small datasets. RF is frequently used because it works with mixed data, such as clinical, genetic, imaging, and histopathological data, as it does not assume rigorous statistical distributions, unlike other models, such as classic LR. Another characteristic is its accessibility compared to DL models, which require large datasets and advanced computational resources, and the balance they offer in the medical field through performance metric reporting. In other words, RF is used in medical data analysis due to its ability to work with complex, incomplete, and varied data, as well as data that requires preprocessing before training.
3.4. Performance Metrics for Cancer Types
To perform a comparative analysis of the performance of ML models in oncology and beyond, standardized evaluation metrics are needed. In the context of cancer detection, a patient can be classified as positive or negative. Performance metrics are calculated based on the following parameters:
True positive (TP) means a patient with cancer is correctly classified as “cancer”;
A false positive (FP) is a patient without cancer incorrectly classified as “cancer” (false alarm);
True negative (TN) represents a patient without cancer correctly classified as “no cancer”;
A false negative (FN) corresponds to a patient with cancer incorrectly classified as “no cancer” (missed diagnosis).
The most important performance indicators are calculated using these parameters. The category of performance indicators includes accuracy, precision, recall, and F1-score [
71,
72]. In addition, when model output scores are available, the AUC is computed to measure the model’s discrimination ability across different decision thresholds.
Accuracy (overall correctness) measures the proportion of all patients correctly classified (both cancer and non-cancer). It is the measure that indicates the model gave the correct diagnosis, and it is computed with Equation (2).
Precision (positive predictive value) represents the probability that a patient marked with cancer has cancer in reality (important to avoid unnecessary biopsies or treatment) and it is calculated with Equation (3).
Recall (sensitivity or true positive rate) represents the indicator of real cancer patients detected by the model (critical for early detection or screening), and it is computed with Equation (4).
F1-score is a single-number summary useful when classes are imbalanced and both FP and FN matter, and it is calculated with Equation (5).
AUC is the probability that a randomly chosen patient with cancer receives a higher model score (risk score or probability) than a randomly chosen patient without cancer. It summarizes discrimination ability across all possible decision thresholds.
Consider the hypothetical case of breast cancer screening in a sample of 1010 patients. Out of the total number of patients, 80 patients had a confirmed cancer diagnosis, meaning they were correctly identified by the model as being affected (TP = 80). The model misclassified 20 cancer-free patients as having cancer (FP = 20). From the dataset, 900 healthy patients were correctly identified by the model as unaffected (TN = 900). Out of the total patients, 10 cancer patients were not detected as positive by the model (FN = 10).
The performance indicators are as follows:
Based on the performance metrics, it can be concluded that the model has a high success rate of 97.03%. This value indicates that the model has good classification capabilities, considering it is intended to assist, not replace, a human expert.
An ideal classification model must strike a balance between accuracy, sensitivity, specificity, and interpretability. In the authors’ opinion, the values should mandatorily be above 90% for accuracy, and above 85% for precision, recall, and F1-score. It is important to acknowledge that a perfect model cannot be achieved in any study. However, this recommendation should represent a reference standard for real clinical applicability.
3.4.1. Accuracy Metric for Cancer Types
Accuracy measures the total proportion of correct classifications. This can be deceiving when the datasets are imbalanced. In the case of cancers, this situation is common in ML models, as in the datasets, the number of healthy patients is much larger than that of sick patients.
Table 7 presents the maximum and minimum values of accuracy by cancer types. A 100% performance was achieved for breast, colorectal, and lung cancers. These models, used to obtain the values in
Table 7, are accurate. In the absence of external validation, there is a risk that these may reflect overfitting on the training data. On the opposite end, the models that achieved minimal performance reported values of 45.8% for gastric cancer, 48% for colorectal cancer, and 58% for breast cancer. These values indicate an intrinsic difficulty of the dataset or a generalization problem in the models. As previously stated in the paper, accuracy must be correlated with the other metrics to obtain a correct evaluation. The model may have high accuracy simply because the data is imbalanced, or it may indeed offer good performance in detecting positive cases. To discern which scenario it is in, the model must be evaluated through the lens of all indicators.
RF models are used in various applications, such as predicting pulmonary metastases in thyroid cancer [
112], classifying tumor tissue in high-grade glioma [
75], predicting lateral lymph node metastases in papillary thyroid cancer [
112], predicting HER2 status in bladder cancer [
74], predicting survival in resectable pancreatic cancer [
91], classifying colorectal cancer [
82], categorizing liver tumors using optical biopsy [
97], predicting pathological risk factors in cervical cancer [
81], etc. The accuracy of these works varies between 90% and 99%. The best-performing model is reported by Liu et al. [
111], where the accuracy was 99%, AUC 99%, F1-score 72%, and recall 88%. In the study by Lai et al. [
112], RF achieved an AUC of 80%, accuracy of 74%, F1-score of 81%, and sensitivity of 89%. Overall, all the indicators reported in these papers demonstrate the possibility of their use in clinical applications, but they cannot replace human expertise because the values are not perfect [
113].
For heterogeneous data, XGBoost performs a classification that indicates the possibility of integration into medical applications through performance metrics. For example, for liver cancer prediction, in the paper by Vekariya et al. [
96], XGBoost achieved an AUC of 85.2% and an accuracy of 87.5%.
Alongside RF, XGBoost, and SVM models, the LR model is frequently studied for integration into such applications due to its simplicity in terms of interpretability. The prediction of bone metastases in esophageal cancer is studied by Wan and Zhou [
87], in whose research LR had an AUC of 83.1% and an accuracy of 72.1%. Also, for the progression of post-nephrectomy renal dysfunction, LR achieved an AUC of 81.5% and an accuracy of 78.7% [
93]. The response to paclitaxel in advanced gastric cancer was investigated in the paper by Choi et al. [
89], in which LR had an AUC of 67.9%, accuracy of 82.3%, F1-score of 46.1%, sensitivity of 63.8% and predicted a longer survival trend.
3.4.2. Precision Metric for Cancer Types
Precision measures the percentage of cases classified as positive for a certain type of cancer that are actually proven to be positive. The adapted calculation relationship is presented in Equation (3). This metric refers to cases where the diagnosis is a false positive and involves stress for the patient. This means additional investigations, increased stress for the patient, higher costs, congestion in the medical system, and many other auxiliary inconveniences.
Table 8 depicts the maximum and minimum precision by cancer types. Thus, the precision is perfect for breast and colorectal cancer. These results are very useful for both doctors and patients in avoiding unjustified treatments. It is worth mentioning that this indicator must be carefully analyzed to exclude overfitting. The lowest reported precisions are 36% for breast cancer and 47.1% for lung cancer. The values indicate a high false positive rate with implications that could compromise the model’s accessibility. Depending on the purpose for which the model is used, screening or confirmatory diagnosis, the priority between precision and recall alternates.
Thyroid cancer with lung metastases is investigated by Liu et al. [
111]. This proposes the RF model for predicting pulmonary metastases, using data from the SEER database. The results obtained yielded an F1-score of 72%, a precision of 61%, and a recall of 88%. The early detection of melanoma was investigated using specific image processing filters (Color Layout Filter) and a classifier with attribute selection [
130]. Thus, an F1-score of 91%, precision of 91%, and recall of 91% were obtained.
The research by Jeong et al. [
114] on bladder cancer integrates an electrochemical sensor combined with ML for discriminating normal cells from cancerous ones [
131]. In this case, the RF model achieved an F1-score of 93.8%, with an accuracy of 91.7% and a sensitivity of 92.9%. Colorectal cancer proposes an SVM classifier based on circulating tumor cells (CTCs). The model achieved 100% accuracy, 100% specificity, and an implicit F1-score of 88.9% (calculated from 80% sensitivity and 100% precision) [
119].
For ovarian cancer (progression-free survival), Arezzo et al. [
126] demonstrate the possibility of predicting 12-month survival with 90% accuracy. This means that 9 out of 10 patients predicted to be progression-free actually had such an outcome. The RF model generated a precision and recall of 90%, an accuracy of 93.7%, and an AUC of 92%. The response to chemotherapy in colorectal cancer is predicted using the XGBoost model, which achieved an accuracy of 94.6% for a favorable prognosis in the Multilayer Perceptron (MLP) model. Similarly, the Gradient Boosting Decision Tree (GBDT) model had an accuracy of 86.3% for unfavorable prognosis. These values identify patients who will respond to treatment [
120], making them an extremely useful tool in the healthcare system [
132].
The toxicity of cisplatin in head and neck cancer is being investigated using the GLM model. It achieved 75% precision, meaning that three out of four patients predicted to have severe toxicity will actually develop it [
123]. Zhu et al. [
116] employed the LightGBM model to study breast cancer. The authors achieved 100% precision, meaning all cases classified as malignant were indeed cancerous. This performance makes the model ideal as a screening tool due to its ability to minimize false positives [
133].
Undifferentiated early gastric cancer is being studied for non-curative resection using the XGBoost model, which achieved a precision of 92.6% [
88]. In the case of post-urostomy urinary tract infections in bladder cancer, the SVM model achieved an accuracy of 58.3%, meaning that less than half of the patients predicted to have an infection will actually develop one. Although the AUC is high (83.5%), the low precision of 58.3% indicates limitations in practical application without further adjustments. The model is available online and includes a visualization of variable importance, confirming that accuracy can be improved by selecting suitable clinical features [
115].
3.4.3. Recall Metric for Cancer Types
Recall measures the proportion of positive cases correctly detected. Practically, this indicator is associated with a correctly made diagnosis. The adapted calculation formula for cancers is presented in Equation (4). At the oncological level, the indicator is useful to avoid false negative situations that delay treatment, and therefore survival chances.
Table 9 depicts the maximum and minimum recall by cancer types. An almost complete sensitivity is reported for breast, gastric, and lung cancers of 99.49%, 99%, and 98.9%, respectively. These values indicate that almost all real cases are detected. Low sensitivity is associated with breast cancer (50%), but also with head and neck cancer (55%), when the ML model is not suitable for the context. Models with a high recall value are preferred in the screening stages when an early evaluation is conducted. They need to be adjusted later to reduce the false positive rate by increasing precision.
RF is one of the most frequently used models in the analyzed studies. This model reported the best performance in several types of cancer. Lung cancer (post-lobectomy complications) offers the possibility of predicting cardiopulmonary complications through the RF model, with a recall of 73.8% and an AUC of 85.6% [
138]. Even in the case of CUP, the RF based on the Cancer of Unknown Primary Location Resolver (CUPLR) model was able to identify the tissue of origin for 35 cancer subtypes with a recall of 90% and precision of 90%. It was trained on genomic data (6756 tumors) and resolved 58% of CUP cases [
118]. The second most used ML model is XGBoost. It stands out for its performance in classifying disease severity and predicting treatment response. For lung cancer, the XGBoost model achieved a recall of 98.9%, a precision of 99%, and an accuracy of 98.9%. It was trained on clinical data from Ethiopia [
124].
Breast cancer uses the BreCML model based on XGBoost to identify new key genes. It achieved a recall of 99.49%, a precision of 99.15%, and an F1-score of 99.79% [
135]. Bladder cancer (RB1 mutation) also uses XGBoost to achieve an 80% recall, 84% accuracy, and an AUC of 84%. He was trained on radiomics features from computed tomography urography (CTU) [
134].
3.4.4. F1-Score Metric for Cancer Types
The F1-score is a combination of precision and recall, with the calculation relationship adapted to the oncological context according to Equation (5). This indicator highlights the imbalanced scenarios between classes, a scenario that is typical in the field of oncology. These scenarios refer to the number of positive cases, which is much smaller than that of negative ones. In other words, the number of diagnosed cancer cases is much smaller than the number of negative results. Therefore, a discrepancy arises between the two classes. In
Table 10, which presents the maximum and minimum F1-score by cancer types, the observed results show very good values for breast cancer, 99.79%, colorectal cancer, 98.2%, gastric cancer, 97.5%, and thyroid cancer, 96.7%. From these values, it can be deduced that for these types of cancer, the models have the ability to simultaneously maintain high values for both precision and recall.
Extremely low values were reported for thyroid cancer, 21.6%, head and neck, 30%, and breast cancer, 37%. These severe imbalances are associated with models that detect more positive cases with many false alarms. Basically, it either avoids false positives or misses real cases. The extreme values reported for this indicator do not guarantee good overall performance [
155]. Thus, reporting multiple metrics becomes mandatory in this context for the comprehensive evaluation of ML models in oncology.
Nayan et al. [
151] evaluated disease progression in patients on active surveillance (AS) using ML models. The study achieved an F1-score of 58.6% using the SVM model. Compared to this value, the traditional model had an F1-score of 18.2%. Another study [
150] used RF and XGBoost for prostate cancer prediction based on targeted or combined biopsy. In this case, the F1-score achieved a value between 94% and 97%. Mahmud et al. [
92] used computed tomography (CT) images and clinical data to classify four types of kidney cancer (ccRCC, chRCC, pRCC, oncocytoma). The model combines DL techniques with clinical data. After training, it achieved an F1-score of 84.92% for all types. For renal cell carcinoma (RCC), the F1-score increased to 90.50%.
Two other studies that addressed breast cancer were selected from the literature for their F1-score values. The first research is by Ke et al. [
135] in which BreCML, an XGBoost-based model, was developed. It achieved an F1-score of 99.79% in cell subtype classification. The second research is by Nguyen et al. [
140], which predicted five-year survival. The best model was the Artificial Neural Network (ANN), which achieved an F1-score of 37%, despite an AUC of 95%. The class distribution imbalances are the cause of these anomalies. Nair et al. [
147] applied Short Term Fourier Transform (STFT), LASSO, and Elephant Herding Optimisation (EHO) for feature extraction from gene expression data. The classification used several ML algorithms. The best result was achieved by Flower Pollination Optimization–Gaussian Mixture Model (FPO-GMM), with an F1-score of 97.5%. Schöneck et al. [
148] studied the prediction of Kirsten Rat Sarcoma viral oncogene homolog (KRAS) mutation in non-small cell lung cancer (NSCLC) using radiomics and ML models, but the performance was modest (maximum F1-score of 67% internally, with 41% externally). These values suggest difficulties in model transferability.
Jeong et al. [
114] proposed a non-invasive method based on electrochemical impedance and ML. RF was the model against which the best performance metrics were reported. The results reported an F1-score of 93.8% in discriminating normal cells from cancerous ones. For papillary thyroid cancer (PTC), ML models (SVM, XGBoost, RF) outperformed American Thyroid Association (ATA) classification, with F1-scores ranging from 33.1% to 42.9%. The best model was RF [
154].
Yan et al. [
144] optimized the gastric cancer screening score using GBM, Distributed Random Forest (DRF), and DL. In binary classification, the models achieved an AUC higher than 99%, but in triple classification, the F1-score for high risk was only 53.34% for GBM. The value indicates difficulties in discriminating between intermediate and high risk. The study by Hsu et al. [
149] predicted muscle mass loss in ovarian cancer patients. The best results were reported for RF, which achieved an F1-score of 72.6% (internal) and 74.1% (external validation).
3.4.5. AUC Metric for Cancer Types
Performance metrics in the oncological field refer to the ability of ML algorithms to identify clinical patterns. The values centralized in
Table 11 show the maximum and minimum AUC by cancer types. These values reflect the quality of the dataset used, the model architecture, the degree of applicability of the model for the type of cancer, the evaluation protocol, and the usability in practice.
AUC measures the model’s ability to discriminate between sick and healthy patients [
186]. In oncology, the value of this parameter indicates a higher probability that a positive patient will be correctly classified as having that specific type of cancer. Data from
Table 11 outlines that the maximum values obtained by the ML models are nearly perfect, with values of 1 for colorectal cancer, gastric cancer, ovarian cancer, and pancreatic cancer. The fact that these models achieved such high values indicates a clear separation between classes. The lowest values were obtained for lung cancer at 0.45, head and neck at 0.51, and ovarian cancer at 0.56. These values indicate that the respective studies either had insufficient data, imbalanced data, low-quality data, or poor generalization of the ML model itself. In relation to the methodology of this study, the differences between the maximum and minimum values highlight the importance of selection and diversity in datasets. High AUC values are associated with cancers in databases standardized by TCGA and SEER. Low AUC values are associated with a reduced cohort.
In the specialized literature, a multitude of studies have been presented that utilized ML models for the early diagnosis of cancer in various forms. Thyroid cancer achieved the best results with the help of the RF model. In the paper by Liu et al. [
111], the AUC indicator had a level of 99%, with a precision of 61%, a recall of 88%, and an accuracy of 99%. On the opposite end, the lowest result was obtained for the GBDT in diagnosing central lymph node metastases. In this case, the AUC had a performance of 73.1%, surpassing the performance of ultrasound, which had an AUC of 62.3% [
185]. Practically, the weakest result obtained for the AUC performance indicator surpassed the performance of classical diagnostic methods.
For breast cancer, ML models as well as hybrid models were explored in computerized diagnosis. Shaikh and Ali [
158] use a hybrid model that includes the SVM model, achieving an AUC of 99.41% on a local dataset and 99.21% on the BCDR-F03 dataset, with an overall accuracy of 99.89%. For radiation-associated breast cancer, the RF model achieves an AUC of 62% in predicting contralateral cancer, according to the study [
157]. Ovarian cancer achieved the best AUC of 100%, while on the opposite end, an AUC of 56% was obtained through natural language processing (NLP) on CT reports for operators. The AUC value of 56% improved the prediction of post-operative readmission, reaching 70% when integrated with NLP [
179]. In the research by Hamidi et al. [
178], the Boruta-based model, combined with five other ML algorithms, identified 10 miRNAs. The models achieved an AUC of 100% and over 94% in external validation sets.
Bone metastases are studied using the XGBoost model. Ji et al. [
171] achieved an AUC of 82.69% in the internal cohort and 91.23% in the external cohort. Pain associated with bone metastases is investigated using a gene-based nomogram that achieved an AUC of 99% in the study by Li et al. [
182]. Pancreatic cancer is studied through the RF model, which eliminates features in order to identify a panel of biomarkers. It achieved an AUC of 100% for classifying post-operative complications [
180]. Iwatate et al. [
181] use radiogenomics in training the model that predicts the expression of the Integrin subunit alpha V (ITGAV) gene with an AUC of 69.7%.
Colorectal cancer is studied through NN and RF models. It achieved performance in predicting metastases with an AUC of 100%, a sensitivity of 100%, and an accuracy of 99% on the balanced set by Talebi et al. [
161]. For peripheral nerve invasion, CT radiomics-based models achieved an AUC between 61.1% and 66.3% in the study by Liu et al. [
162]. Lymph node metastases in endometrial cancer are studied with a model based on the apparent diffusion coefficient (ADC) and radiomic features. It achieved an AUC of 85%, demonstrating an improvement over classical criteria [
163]. H
2O AUTO-ML-based GBM investigated prostate cancer; it achieved an AUC of 72% and a specificity of 84% in assisting with case selection for biopsy [
183]. For bone metastases, a gene-based omogram from the Stimulator of INterferon Genes (STING) pathway achieved an AUC of 99% [
182].
Bladder cancer is studied using atomic force microscopy combined with an ML model, achieving an AUC of 97% and an accuracy of 91% in a controlled system. In cases with multiple image channels, the model achieved an AUC of 99% and an accuracy of 93% in the study by Petrov et al. [
73]. In another research by Petrov and Sokolo [
159], RF is employed to separate precancerous cervical cells from cancerous ones. The model achieved an AUC of 93% and a sensitivity of 92%. Peritumoral radiomic models are studied for proximal esophageal cancer. The dual-region model achieves an AUC of 96.63% in the training phase and 94.71% in the validation phase. A radiomic clinical monogram outperforms the clinical model, reporting a net reclassification improvement of 34.4% [
165].
The XGBoost model has been demonstrated to be effective in predicting cervical lymph node metastases in patients with early-stage supraglottic laryngeal cancer. The model proposed by Wang et al. [
172] achieved an AUC of 87% in internal validation and 80% in external validation. Similar ML models, including XGBoost or LightGBM, have been used in early stage gastric cancer as well. It achieved an AUC between 73.6% and 83% in the study by Yang et al. [
168]. A panel of six genes associated with cancer-associated fibroblasts (CAFs) identified an AUC of 75.4% and 100% in the validation sets [
187]. For lung cancer, Wang et al. [
176] employed an NN model, achieving an AUC of 99.4% and an accuracy of 99.3% using inflammatory markers. In the case of esophageal squamous cell carcinoma, Cui et al. [
166] employed combined models, achieving an AUC of 85.6%. As for laryngeal cancer, Nakajo et al. [
173] implemented the NB model, achieving an AUC of 84.2% in predicting progression, while Random Survival Forest (RSF), a modified version of RF, achieves a C-index of 80.8%.
The cancers studied intensively are from the categories of breast, colorectal, gastric, and lung cancers. These are associated with high values of performance indicators due to large and standardized datasets. The cancers with low incidence, such as thyroid, head and neck, and salivary, have greater variability and lower values in indicator reporting because they require additional data collection. Studies that use public and diverse databases have balanced values for performance indicators. Private or small cohort datasets have generated good performance due to overfitting or reported low values. The large differences between the maximum and minimum values for the same type of cancer indicate an incompatibility of the model used, a lack of uniform reporting and validation protocols, or difficulties concerning the datasets used, making direct comparison of studies difficult. Thus, data from
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11 demonstrate that certain ML models have high performance.
Within the analysis of the performance indicators reported by various studies, it was found that some models had an accuracy or AUC of 100%. These perfect values indicate that the results may stem from overfitting the models, using small datasets, including irrelevant data for the elements the model was trained on, the absence of external validation, or the correct selection of training and validation data groups. In many cases, validation was performed only on subsets of the same dataset. This approach leads to an overestimation of the model’s actual performance. Also, the lack of cohort diversity, the absence of multicenter testing, validation in feature processing and selection, and expansion on datasets never seen by the model can, in some cases, lead to seemingly perfect results that, when applied in practice, significantly reduce the model’s performance. These performances reported by the articles in the specialized literature should be viewed as indicators of the potential of machine models. These do not represent irrefutable evidence of immediate clinical applicability. Therefore, model validation should also be performed by evaluating reproducibility and generality according to the authors’ proposals in this article.
The variability of performance indicators reported between studies is explained by the following methodological factors:
The class imbalance in oncology arises from the fact that the proportion of positive patients is much smaller than that of negative patients. This aspect leads to high accuracy values, even though the performance is low for recall or F1-score.
The small sample size at the level of small cohorts increases the risk of overfitting and overestimating internal performance. This aspect is not specific to the field of cancer, as it is encountered in all fields where the datasets are small in volume.
The absence of external validation, which causes many studies to use only internal validation (cross-validation), without testing on independent cohorts.
Differences in data preprocessing generated by normalization methods, feature selection, handling of missing values, and the quality of the final data obtained modify performance indicators.
The heterogeneity of clinical objectives in binary classification, multiclass classification, or survival prediction involves different challenges.
Therefore, the direct comparison of maximum values between cancer types must be done with caution, as they reflect distinct experimental contexts. However, they provide an overview of the performance progress of ML models.
4. Discussion
This study analyzes a large volume of research that integrates ML models in the field of oncology. Thus, articles from recent oncological research, covering the period between 1 January 2020 and 31 December 2025, were analyzed. The analysis included the types of cancer investigated, the datasets used to train the models, the ML models studied in these articles, and the performance metrics obtained in relation to the type of cancer explored. This review paper aims to inventory the studies based on a critical evaluation of the reproducibility, generalizability, transferability, and clinical trends introduced by these ML models.
Unlike other existing synthesis articles, the authors’ proposal makes a distinct methodological contribution. First, the selection process was conducted simultaneously across three different scientific databases, which allowed for complete coverage and eliminated marginal duplicates. The second direct contribution focused on extracting open-access articles that contain numerical values for performance metrics in the abstract. This approach is unique in the specialized literature up to this point and has allowed for a uniform quantitative analysis of all the extracted articles. A third original contribution of the study is marked by the two indicators, reproducibility rate and generalizability rate, proposed by the authors as a new perspective on the quality of results in the field of oncological ML. Furthermore, a comparative analysis of the reported ML model performances based on cancer type, as well as algorithm type, provides a unique perspective in the specialized literature by detailing and ensuring the reproducibility of the technical aspects that are currently lacking in the previous literature.
The selection process aligns with international standards for systematic reviews. Additionally, the article proposes a novel element in the analysis: an additional criterion that ensures scientific rigor from the initial drafting stage. Although the criterion of excluding articles that do not contain numerical values in the abstract represents a potential source of bias, the proposed strategy meets contemporary requirements regarding information accessibility and the ability to quickly identify important elements. The selection criteria were designed to prioritize technical reproducibility over volume. By requesting numerical data in the abstract, we ensured a uniform quantitative synthesis, which is often lacking in narrative reviews. We have strengthened the Methodology and Limitations sections to explicitly justify these choices as a deliberate methodological contribution, aimed at reducing ambiguity in performance reporting. In this way, this research distinguishes itself from all other reviews, introducing a unique contribution.
The work addresses a real need in the field of identifying ML technologies that can be implemented in oncological practice. From the analysis of the studied materials, the pitfalls resulting from imbalanced data, insufficiently documented training sets, lack of external validations, the inappropriate choice of an ML model, or the analysis of a limited set of performance metrics that can create a false illusion of high model performance were extracted. The paper has a novelty factor that places this research at the top of ML works in the field of oncology due to the proposed methodology, which is unique in the literature up to this point. Thus, articles were extracted from the WOS, Scopus, and PubMed databases, applying the strict use of open-source articles as filters so that they could be accessed. Therefore, from an initial total of 13,292 articles, the set was reduced to 1364 common articles across all three databases. Practically, these articles are indexed in all three databases, which guarantees the quality of the analyzed article. Furthermore, within the methodology, a technical criterion was applied to the numerical data, introducing a new numerical value filter in the summary. This way, the articles for which a comparative technical analysis can be performed were extracted.
Their classification was based on four analytical axes corresponding to the type of cancer, dataset characteristics, the type of ML algorithm, and the performance metrics obtained. Regarding the characteristics of the datasets, the authors introduced, in addition to the classic data analysis, two new indicators represented by reproducibility and generalizability, in order to allow for a relevant comparison between the study results. Reproducibility analysis quantifies high, medium, and low labels for each cancer type. For example, bone cancer, bladder cancer, and kidney cancer have a reproducibility of 100%, 50%, and 50%, respectively. Conversely, esophageal cancer, childhood cancer, laryngeal cancer, nasopharyngeal cancer, lymphatic cancer, and salivary gland tumor cancer had a null reproducibility. Generalizability analysis refers to the cohort size used in studies. The high category includes cancer types such as colorectal, breast, lung, bladder, and prostate; the medium category comprises breast cancer, colorectal cancer, lung cancer, prostate cancer, and gastric cancer, and the low category contains cancer types akin to CUP, leukemia and hematologic, multiple types, esophageal, lymphatic, laryngeal, nasopharyngeal, kidney, bone, and brain and CNS. Larger cohorts of over 1000 patients offer a greater impact on performance stability.
The performance metrics analyzed, for which extreme values were reported to highlight variability, were:
Accuracy ranging from 45.8% (gastric cancer) to 100% (breast, colorectal, and lung cancers);
Precision ranges from 36% (breast cancer) to 100% (breast and colorectal cancers);
Recall ranges between 50% (breast cancer) and 99.49% (breast cancer);
F1-score between 21.6% (thyroid cancer) and 99.79% (breast cancer);
AUC between 45% (lung cancer) and 100% (colorectal, gastric, ovarian, and pancreatic cancers).
The most analyzed ML models were identified in the inventory of ML models. Thus, RF is the most applied ML model in 82.14% of the types of cancer, and it is employed with the same weight as other ML models in an additional 14.28% of the types of cancer. This can be justified due to the tolerance of this model to incomplete training datasets. Regardless of the cancer type, the top five most frequently used ML models are RF, NN, LogisticReg, GB, and SVM. Through this approach, the paper investigates studies that extract the direct relationship between data quality, algorithm type, and the resulting outcome, making this study a critical tool for selecting ML technologies with real clinical potential.
The RF algorithm is the most analyzed in these studies due to its characteristics that allow for easy implementation. Also, the model does not require a time-consuming hyperparameter configuration. The model performs well in situations where the data is incomplete or heterogeneous. Furthermore, this model provides a simple analysis of variable importance, which is extremely important in medical research. These claims are also supported by other studies where comparative experiments on hundreds of datasets show that RF performs well even with minimal hyperparameter tuning, especially when the data is tabular and the variables are numerous [
188,
189,
190]. This is often preferred by clinical researchers due to the transparency of the decisions. Also, the results are stable, meaning that successive training sessions will yield the same results. This dominance can introduce a certain bias, as simpler and more accessible models are favored over complex ones in the category of deep neural networks or generative models. The latter requires increased computational resources, as well as extensive datasets. There are studies showing that RF allows for interpretability through variable importance analyses and local explainability, which is particularly important for clinical acceptance. At the same time, in contexts with heterogeneous or incomplete data, RF has provided exceptional results [
191]. The popularity of the RF model reflects researchers’ accessibility and familiarity with this model, not necessarily its absolute performance superiority. However, the literature also points out the risk that these simple models may be unfairly favored over other models that are superior in terms of performance but difficult to implement [
192].
The most studied cancer type in the literature in which ML models are employed is breast cancer. From the category of cancers with a high mortality rate, the one that affects women the most is breast cancer [
6]. Studies show that RF is used by numerous researchers for risk prediction and early diagnosis. This ML model achieved up to 99.3% accuracy in breast cancer prediction based on multifactorial, genetic, biochemical, and demographic factors [
5]. Alongside models like RF, the XGBoost model is also present, which predicts tumor type [
6]. Other approaches, such as integrating fluorescence spectroscopy with ML, achieve 98.78% accuracy in interoperable diagnosis [
27]. Interpretable models have also been developed for assessing the risk in pre-survivors [
8].
The synthesis aspects for datasets identified in the literature address class imbalance, which is common in oncology (low prevalence of many cancers). Metrics behave differently, having the following recommendations:
Accuracy may be misleading (a model predicting always “no cancer” can have high accuracy if prevalence is low);
Precision depends strongly on disease prevalence and, therefore, precision should be reported alongside prevalence or the prevalence-adjusted should be computed when appropriate;
When multiple metrics are used for clinical purposes, the minimum metrics that should be reported are accuracy, recall, precision, F1-score, and AUC;
Threshold selection matters when a decision threshold needs to be chosen based on clinical priorities. A recommendation in this way is to maximize recall for screening or to maximize precision for diagnostic confirmation;
Confidence intervals and statistical tests always require external validation on independent cohorts to check generalizability;
Overfitting warning is when the accuracy, precision, or AUC record values are near 100% on the internal test. The dataset may indicate overfitting, especially when no external validation is present.
One of the limitations identified in the literature concerns class imbalance, which is frequently encountered in oncological studies where the number of positive cases is much lower than that of negative ones. This leads to an artificial overestimation of accuracy. This type of behavior affects the correct evaluation of ML models in clinical studies. For a fair assessment, it is recommended to use a balanced set of performance indicators. This balance should include both AUC and calibration methods, such as calibration curves or the Brier score [
193]. In addition to this, independent external tests for validation are required in the case of clinical studies. These tools provide a realistic estimate of performance under various clinical conditions, which will reduce the risk of overfitting in the long term. Furthermore, the clinical utility of a model must be reported against context-specific performance thresholds. For example, a recall of at least 90% and a precision of over 85% are considered feasibility standards for cancer screening [
194]. Therefore, model evaluation should not be based on specific accuracy values but should include an analysis of the balance between sensitivity, specificity, and external calibration. This ensures real applicability in medical practice.
The paper is addressed to researchers in AI applied to oncology. They will find in this study a mapping of the most used models and data sources. The work is also aimed at clinicians and oncology specialists interested in understanding how to use ML models in prevention, diagnosis, prognosis, and treatment personalization. This review is also aimed at decision-makers and medical solution developers who can identify the models with the greatest chance of integration into clinical systems. Finally, the paper targets the academic community, which benefits from a replicable methodological framework for systematic evaluations in other development directions.
The results of this study confirm that the literature trends are focused on cancers with high incidence and standardized databases. The article differs from other similar works by quantifying the performance and major differences between studies analyzing the same type of cancer. For example, for breast cancer, accuracy ranges between 58% and 100%, and AUC between 62% and 99.89%.
The limitations of the study refer to the constraints in the methodology used to narrow the research framework. These constraints focused on the exclusive analysis of abstracts, open-access articles, and those containing numerical values in the abstract. These three conditions applied simultaneously can lead to the exclusion of a volume of research whose contributions can be major to the field of oncology. However, the authors believe that a quality article should include the most valuable aspects of the research in its synthesis, which is how this methodology was proposed.
In recent years, generative and self-supervised learning models have shaped a new direction in ML research applied to oncology. In addition to classical approaches to ML models, a range of advanced techniques will be discussed to evaluate the state of the art in relation to potential future research directions [
195]. Makhlouf et al. [
196] show that GANs produce synthetic samples in medical imaging when the data is imbalanced or there is a small number of examples for a specific class. These models have a diversity of data distribution that helps stabilize predictions, leading to a reduction in overfitting in CNNs. Frid-Adar et al. [
197] employ GANs to generate synthetic augmented images in a set of 182 liver lesions, adding new data that will increase the sensitivity of the CNN compared to classical augmentation. The paper by Yang et al. [
198] presents a model based on diffusion models for semantic enhancement, which is reflected in superior performance in medical imaging. Dai et al. [
199] outline the advantage of guided text generation when cancer is rare. He produces synthetic samples that reflect subtle clinical variations. He and McMillan [
200] state that DL offers better performance than traditional models when the dataset size is large. However, traditional models remain competitive when the data regime is moderate, also offering a trade-off between interpretability and computational cost. Traditional models, such as RF, are the best classifiers in terms of the balance between performance, requirements, and computational resources [
201]. This analysis reveals that generative or diffusion models require intensive processing, high-quality and balanced data, significant computing power, and advanced knowledge, as well as dedicated development time, which is extensive compared to traditional models.
This review also identifies a series of inherent limitations in the field of ML applied to oncology. Among these are the following:
Publication bias (the tendency to report high performance);
Lack of external validation in numerous studies;
Overestimation of performance in contexts with class imbalance;
Heterogeneity of training and validation protocols;
Exclusion of articles that do not mention numerical values in the abstract;
Exclusion from detailed analysis of articles that are not open-source.
Additionally, the predominant use of public datasets (TCGA, GEO, SEER) can generate a bias due to reusing the same cohorts, limiting true clinical generalizability. The interpretation of performances must be done with caution, especially in the absence of multicenter and prospective validation.
The answers to the RQs stated in
Section 1 are as follows:
RQ1: The types of cancer frequently investigated in the literature using an ML approach are: breast (350 articles), colorectal (337), lung (151), prostate (83), and gastric (60);
RQ2: The datasets used in most studies are TCGA (133 papers), GEO (94), and SEER (72). The most investigated ML model is RF;
RQ3: Performance levels are investigated using the maximum and minimum values for accuracy, precision, recall, F1-score, and AUC, simultaneously. Analyzing a single performance indicator is not conclusive regarding the quality of the model;
RQ4: Correlation with clinical potential refers to the fact that models with external validation and diverse datasets have the best chance of implementation, whereas models with extreme scores without validation are at risk of overfitting.
Incorporating ML models in oncology is opening up new horizons for prevention, early diagnosis, accurate prognosis, and the personalization of treatment plans for humans. The variability in performance across studies, ranging from almost perfect results to very low values, highlights that success is directly dependent on data quality, cohort size, the quality of the ML model, and its external validation. The major contribution of this study lies in the unique methodology applied in selecting and analyzing how ML models are used to identify gaps that need to be addressed in future research.
Future research directions should include:
The development of models based on federated learning, which can train on multicentric data without transferring sensitive information, taking into account the medical context of these studies;
Multimodal integration (imaging, genomic, clinical) for predictions at a level superior to the current one;
The systematic implementation of Explainable AI (XAI) techniques, as a facilitator in clinical acceptance;
The use of prospective validation and randomized studies to confirm real-world applicability;
The standardization of performance reporting according to the TRIPOD-AI guidelines.
As in most recent review articles, future research directions are represented by managing the bias present in datasets. ML models amplify the imbalances generated by the uneven distribution of data. Standardizing the data collection and annotation process, including multicenter and multiethnic cohorts, and studying model interpretability in a clinical context represent future research directions.