Reviewing CAM-Based Deep Explainable Methods in Healthcare

Tang, Dan; Chen, Jinjing; Ren, Lijuan; Wang, Xie; Li, Daiwei; Zhang, Haiqing

doi:10.3390/app14104124

Open AccessReview

Reviewing CAM-Based Deep Explainable Methods in Healthcare

¹

School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China

²

Sichuan Province Engineering Technology Research Center of Support Software of Informatization Application, Chengdu 610225, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4124; https://doi.org/10.3390/app14104124

Submission received: 16 April 2024 / Revised: 10 May 2024 / Accepted: 10 May 2024 / Published: 13 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

The use of artificial intelligence within the healthcare sector is consistently growing. However, the majority of deep learning-based AI systems are of a black box nature, causing these systems to suffer from a lack of transparency and credibility. Due to the widespread adoption of medical imaging for diagnostic purposes, the healthcare industry frequently relies on methods that provide visual explanations, enhancing interpretability. Existing research has summarized and explored the usage of visual explanation methods in the healthcare domain, providing introductions to the methods that have been employed. However, existing reviews are frequently used for interpretable analysis in the medical field ignoring comprehensive reviews on Class Activation Mapping (CAM) methods because researchers typically categorize CAM under the broader umbrella of visual explanations without delving into specific applications in the healthcare sector. Therefore, this study primarily aims to analyze the specific applications of CAM-based deep explainable methods in the healthcare industry, following the PICO (Population, Intervention, Comparison, Outcome) framework. Specifically, we selected 45 articles for systematic review and comparative analysis from three databases—PubMed, Science Direct, and Web of Science—and then compared eight advanced CAM-based methods using five datasets to assist in method selection. Finally, we summarized current hotspots and future challenges in the application of CAM in the healthcare field.

Keywords:

class activation mapping; deep learning; explainable ml; interpretability; medical image

1. Introduction

In the healthcare industry, artificial intelligence assists in medical diagnostics; artificial intelligence models based on deep learning have demonstrated their ability to assist in various medical diagnoses [1]. Specifically, deep learning has been utilized in diagnosing and treating diverse medical conditions, such as breast cancer detection [2], the assessment of thyroid nodules [2], and the analysis of diabetic retinopathy symptoms and progression [3]. However, transparency is often lacking in the use of deep learning within the medical domain. The primary reason for this lack of transparency is the complexity of neural networks, including the multitude of parameters, numerous hidden layers, and the non-arbitrarily selected characteristics of these parameters, which makes it challenging for individuals to understand how these systems produce results in tasks like image classification or localization. As a result, there is a lack of trust in the predictions and localization analyses they provide.

In the explainable methods in the medical field, CAM (Class Activation Mapping) [4] stands out as a representative technique for visual explanation. Specifically, CAM, a technique in computer vision and deep learning, visualizes areas of an image crucial for a deep neural network’s prediction of a specific class or category. The basic concept of CAM is to utilize gradient data from the last convolutional layer of Convolutional Neural Networks (CNNs) to generate a heatmap highlighting key regions in an image for classification purposes. Specifically, in the last layer of the CNN, global average pooling (GAP) is applied to transform each feature map into an average value that indirectly reflects the impact of the class on the corresponding feature map. The weighted feature maps are subsequently aggregated, and the resulting heatmap represents important regions guiding the classification for each class in the neural network. However, CAM has limited applicability because it requires modifications to the model structure. Therefore, several advanced methods have been developed to overcome the limitations of CAM including Grad-CAM (gradient-weighted class activation mapping) [5], Grad-CAM++ [6], Ablation CAM [7], and Score CAM [8].

In the medical field, an extensive body of research has employed CAM-based methods to explain deep learning models. However, no systematic survey has been conducted to analyze the progress and challenges of CAM in the medical domain. Therefore, this study conducted a review of research on visual explanatory methods in the medical field from 2016 to 2023.

Materials and Methods: Relevant literature using CAM-based explanation methods in the medical domain was retrieved through searches on three databases: PubMed, Science Direct, and Web of Science. Due to the initial proposal of the activation mapping interpretability method in 2016, only papers published from 1 January 2016 to 30 June 2023 are included. Specifically, papers were reviewed and evaluated according to publication year, medical relevance, neural network models, datasets used for model development, visualization explanation methods, and application types. In total, 45 articles were included in the review.

This systematic review thoroughly outlines the use of visual explanation methods in medicine. The integration of visual explanation methods with CNN models has the potential to enhance the accuracy of medical diagnosis and increase the transparency of the diagnostic process. Our research revealed that among the 45 selected articles that used activation mapping explanation methods in the healthcare industry, 82% (37 articles) utilized the Grad-CAM method, while only seven articles employed methods proposed after Grad-CAM. These methods are relatively simple, and there is a lack of a unified standard for measuring the quality of these explainable methods. Furthermore, the primary objectives of applying activation mapping methods in the selected articles were categorized as follows: (1) Using CAM techniques only as explanatory tools for CNN. (2) Employing CAM techniques for both CNN explanation and model validation, assessment, and comparison. (3) Enhancing the interpretability of newly developed models with CAM techniques. (4) Establishing a foundation for creating new activation mapping methods. (5) Innovatively using them for feature selection.

The structure of this paper is organized into the following sections: Section 2 presents an extensive survey of recent applications of visual explanation methods within the medical industry. Section 3 details the specific methodologies, criteria, and outcomes used for the literature selection. Section 4 conducts an in-depth analysis of multiple aspects of the chosen articles. Section 5 designs a comparative experiment using five datasets to compare the explainable ability of eight different CAM methods across four models. Section 6 explores current trends and anticipates future challenges in the context of CAM methods. Finally, Section 7 provides a comprehensive summary.

2. Related Work

Within the realm of computer vision, explainable artificial intelligence (XAI) has emerged. Alicioglu et al. [9] offer a detailed exposition of XAI from the perspective of visual analysis and review post hoc explanation methods, including LIME [10], SHAP [11], and CAM [4].

In medical image analysis, there have been some advancements in interpretability research as well. Loh [12] performed an extensive review of XAI research in healthcare, encompassing technologies such as SHAP, LIME, Grad-CAM, LRP, fuzzy classifiers, EBM, CBR, and rule-based systems. Their focus was primarily on providing a comprehensive overview of computational and visual explainable methods, though they overlooked the detailed analysis of these methods’ applications in the medical field. Groen [13] carried out a more detailed exploration of explainable methods in the healthcare sector, aiding in the comprehension of interpretability within radiology, with computer-aided diagnosis research employing end-to-end deep learning. Their study employed methods like CAM, feature activation mapping, and t-distributed stochastic neighbor embedding (t-SNE). Similarly, Allgaier [14] outlined the categorization of explainable methods and reviewed the most frequently employed XAI methods in specific medical supervised machine learning, primarily focusing on Shapley Additive Explanations (SHAP) and Grad-CAM. However, these two studies primarily focused on introducing methods, overlooking an exhaustive examination of the progress and real-world implementations of these methods in the field of medicine.

In summary, existing review articles have offered broad introductions to the mechanisms of XAI methods. Still, with the growing prevalence of CAM techniques in medical applications, there is a pressing need for a comprehensive retrospective analysis of the specific developments in the application of CAM-based methods in the medical field.

3. Methodology and Results

3.1. Article Source

This review focuses on the application of visual explainable methods for activation mapping classes in neural networks in the healthcare industry, so the articles selected for this paper are from three databases containing a large amount of literature related to the healthcare field, namely PubMed, Science Direct, and Web of Science. The search keywords for PubMed are medical & machine learning & explainable classification & CAM, and the search terms for Science Direct and Web of Science were medical & machine learning & explainable classification & “CAM”. All search results articles were limited to a publication timeframe between 2016 and 2023 and included only human studies.

3.2. Inclusion Criteria

Some of the following criteria are specified to gather relevant information from various studies to achieve the goal of being as sound and objective as possible for the purpose of this paper.

(1) The research paper must be drafted in the English language.

(2) The research should focus exclusively on human medicine and pertain to classification tasks.

(3) All articles must have been published during the period from 2016 to 2023.

(4) Research papers must detail the dataset size used and its source.

(5) Research papers must include an application of an interpretable method for activation mapping visualization.

(6) The research paper must include data on various metrics for classification and segmentation tasks, such as the accuracy of disease diagnosis.

(7) The paper may be a conference publication from IEEE.

3.3. Criteria for Risk of Bias: Assessing Study Quality

The studies’ quality was evaluated using the following key items.

(1) Criteria for inclusion

(2) Criteria for feature extraction

(3) Description of disease classification

(4) Medical image examination for disease diagnosis

(5) Description of classification model

(6) Samples of the dataset > 1000

(7) Whether the dataset is publicly available

(8) Description of the dataset

(9) Description of interpretable methods

(10) Analysis of the results after using the interpretable methods

(11) Whether the code is publicly available

3.4. Data Extraction

The selected research papers were all healthcare-related and related to deep learning techniques, all of which analyzed the prediction of human diseases and the parameters associated with them. Therefore, the following data were found to be important for literature assessment, so each research paper reviewed these parameters, including author name, year of publication, country/region, size of dataset used, use of dataset open source, algorithms, classifiers, target, language, and accuracy.

Other factors (e.g., whether features were extracted from the dataset, whether the dataset was re-preprocessed) can also reduce data duplication for dataset analysis, which can turn out to be a direct improvement in learning speed and a necessary step in the machine learning process. However, these parameters could not be included in the analysis as not all of the research papers retrieved included this element.

3.5. Included Studies

A total of 85 articles were extracted from the search results. Among these, 13 were sourced from PubMed, 43 from Web of Science, and 29 from Science Direct. Articles were excluded according to the steps outlined below. Firstly, duplicates were excluded, and after this exclusion, 80 articles remained. In the second step, filters were evaluated and selected for their relevance to healthcare, reducing the number of articles to 66. After excluding an additional 12 articles, 54 remained to progress to the next phase. The third step was the assessment of abstracts, from which 48 articles were selected. The final step was the screening of the full text, and the number of remaining articles was 45. The PRISMA flowchart for the systematic review is shown in Figure 1.

The risk assessment table is as shown in Table 1. The risk assessment table summarizes the criteria for bias risk in the selected articles, and a checkmark is placed for those that meet the criteria. Among all 11 criteria, meeting six or more is defined as “Low” risk, and meeting only six or fewer (including six) is defined as “Moderate” risk.

The feature analysis table is shown in Table 2. The feature table contains detailed information about the 45 selected articles. This information includes the research area of each article, the machine learning model algorithms used, the language used in the article, the size of the dataset, as well as performance comparison parameters used in the article’s research. It also includes the explainable methods used. Due to the lack of uniformity in the comparison methods and the number of models used in the articles, the performance parameters listed in the table are relatively comprehensive and may include data from multiple models.

4. Results and Discussion of Selected Articles

4.1. Discussion of Publication Years

To analyze the trend in the popularity of activation mapping in the medical industry, we utilize the annual publication count of articles as a metric to gauge the practical application and popularity of this research direction.

Figure 2 shows a clear upward trend in the number of related articles over time. Specifically, 2016 marked the introduction of the CAM method, but there were no related articles published between 2016 and 2018, indicating a relatively slow start for this research in the medical field. In 2019, a few scholars began to explore this kind of method, but only one related article was published. With the rise of convolutional network technologies, the number of related articles gradually increased from 2019 to 2021, although the growth was relatively slow. By 2020, the number of related articles reached three, and this figure rose to six by 2021, indicating an increasingly strong interest in this field within the academic community.

However, it is particularly significant to highlight the marked rise in the publication of related articles from 2021 to 2022, reaching 25 articles, far exceeding the total published in the previous three years and showing a significant upward trend. The data for 2023 only includes 10 articles because the statistics are only up to June 2023. This phenomenon may suggest that the application of activation mapping in the medical industry is gradually receiving more attention, and the academic community is engaging in deeper and more extensive discussions. This notable growth trend may be benefiting from the continuous development and improvement of technologies like convolutional networks, offering new possibilities for research in the medical field. In summary, through the analysis of the number of articles, we can clearly observe the trend in the popularity of activation mapping in the medical field, providing valuable insights for future research directions in this area.

4.2. Analysis of Dataset Size

The size of the dataset has a significant and far-reaching impact on research, affecting several key aspects including credibility, interpretability, generalization ability, and computational resource requirements. Through in-depth analysis of the dataset sizes used in each article, we can gain profound insights into the data scale adopted in current medical field research on activation mapping methods. In Figure 3 below, the sizes of the datasets used in each article are presented, ranging from a minimum of 68 samples to a maximum of 107,038 samples.

Figure 4 displays the results of a categorized analysis of the dataset sizes in 45 articles related to the application of activation mapping in the medical field.

From Figure 4, it is evident that there are a total of 22 articles using datasets larger than 5000 and they are concentrated in recent years. As shown in Figure 5, a total of 18 articles were published primarily in 2022 and 2023. Over time, the scale of medical datasets available to researchers is increasing, and this development trend holds significant importance for the application of activation mapping methods in the medical field.

The increase in large-scale datasets provides researchers with more diverse and abundant data resources, thereby enhancing the credibility, reliability, and applicability of their research. Consequently, it can be anticipated that the future prospects for the application of activation mapping interpretation methods in the healthcare sector will become more promising, with the potential for greater breakthroughs and advancements in medical image interpretation.

Specifically, the size of the dataset has the following four important impacts:

(1) The analysis of dataset size reflects the data resource situation in the medical field when applying activation mapping methods.

(2) In research, the choice of dataset size can directly influence the credibility of research results. Larger datasets typically offer more robust results while capturing the data distribution better, thereby enhancing research credibility.

(3) Additionally, dataset size is related to interpretability since larger datasets may contain more rich features and patterns, aiding in a deeper understanding of model behavior.

(4) Furthermore, dataset size is crucial for model generalization. Larger datasets can typically support more complex models that can generalize better to various clinical scenarios, thereby improving model efficacy in practical applications.

Despite the advantages of larger datasets mentioned above, they also require more computational resources and time to process. Balancing the benefits and complexities of large datasets is, therefore, an important consideration for researchers and practitioners.

4.3. Application Field Analysis

Furthermore, we conducted a statistical analysis of the medical directions and diseases covered in these 45 articles. The applications are diverse, encompassing a total of 24 different diseases. Figure 6 displays the distribution of these diseases. In Figure 6, we observe that the disease directions covered in these papers include the following: atrial fibrillation and myocardial infarction, lung disease, diabetic retinopathy, fractures, classification of heart valves, human immunodeficiency virus and acquired immunodeficiency syndrome (HIV/AIDS), HPV-related head and neck cancers, Parkinson’s disease (PD), mental disorders, EEG analysis (Electroencephalogram), intracerebral hemorrhage (ICH), Coronavirus, Lyme disease, oral squamous cell carcinoma, muscular dystrophy, coronary artery disease, dental caries (tooth decay), breast cancer, glaucoma, malaria, sleep assessment, gastric cancer/endoscopic image classification, clinical gland morphometric information, and middle ear (ME) infection. These articles cover a wide range of medical conditions, highlighting the diverse applications of activation mapping techniques in the medical field.

In Table 3, we can observe that CAM-based interpretability research provides strong support for medical image analysis and diagnosis in various fields, offering new possibilities to enhance patient care and diagnosis.

Firstly, the field of cardiovascular diseases is one of the research hotspots for activation mapping methods, with a total of 10 articles focusing on this area. Cardiovascular diseases encompass a variety of cardiac and vascular conditions that often require rapid and accurate diagnosis and classification. Therefore, the high interpretability and feature extraction capabilities of activation mapping methods make them powerful tools in this field.

Secondly, pulmonary diseases, especially those related to COVID-19, have also attracted widespread attention from researchers, which is particularly crucial in the current global pandemic context. Seven articles focus on the detection of lung damage caused by the COVID-19 virus, highlighting the application value of activation mapping methods in crisis situations. They can expedite the diagnosis and classification of pulmonary lesions.

Furthermore, the field of eye diseases has also benefited from research using activation mapping methods, particularly in the medical image classification and diagnosis of diabetic retinopathy. This is a domain where precision is highly required. Activation mapping methods can help medical professionals more accurately identify and classify retinal lesions, facilitating early intervention and treatment.

The application of activation mapping methods in the medical field demonstrates a clear distribution across different disease domains, each with its own unique demands. This reflects the diversity and potential of these methods in improving the accuracy of medical image classification and diagnosis.

4.4. Performance Analysis

In all selected studies, CNN, Recurrent Neural Networks (RNN), and Deep Neural Networks (DNN) served as the primary components of their models for disease detection and classification. Specific models used in these studies included Residual Networks (Resnet) [61], Dense Networks (Dense Net), Dilated Convolutional Neural Networks (DCNN), Bag of Features (BO), Visual Geometry Group (VGG), Efficient Networks (Efficient Net) [62], 3D Residual Networks (3D Resnet), U-Net, Res2Net, and Inception-Resnet-v2.

Through a comprehensive review of 45 relevant articles, we collected and analyzed the performance of various models mentioned in these articles, as shown in Figure 7. Figure 7 presents the highest accuracy achieved in the models used in each article. It is worth noting that 82% of the studies demonstrated high-level model accuracy exceeding 90%, indicating that the application of activation mapping methods in various domains can significantly enhance model performance. Only 5% of the articles reported a model accuracy below 80%. However, one study on EEG classification for Parkinson’s disease showed lower performance with an accuracy of only 61%, mainly due to the relatively small size of the dataset used, containing only 90 samples.

Additionally, some studies, apart from using accuracy as a performance reference metric, also employed other performance reference metrics, as shown in Table 4.

In Table 4, Intersection over Union (IoU) [63] is a metric used to measure the performance of computer vision tasks such as object detection and semantic segmentation. Its purpose is to assess the degree of overlap between the predicted bounding boxes (or segmentation masks) and the actual target bounding boxes.

AUC (Area Under the ROC Curve) [64] is used to measure the performance of binary classification models at different thresholds. It provides a comprehensive evaluation of a model’s ability to classify positive and negative instances and is commonly used for comparing the performance of different models.

The F1 score [65] is the harmonic mean of precision and recall and is used to provide a comprehensive assessment of a model’s accuracy and coverage. It is a performance metric particularly suitable for problems with imbalanced class distributions.

Specificity refers to the proportion of correctly identified negative cases out of all actual negative cases in binary classification. It is used to evaluate the model’s performance in correctly identifying actual negative cases, i.e., the model’s ability to correctly identify negative instances that actually do not exist.

Sensitivity, in binary classification, represents the proportion of correctly identified positive cases out of all actual positive cases. It is used to evaluate the model’s performance in detecting positive cases, i.e., the model’s ability to capture actual positive instances.

4.5. Analysis and Discussion

Computer-aided diagnosis (CAD) systems are employed in the medical field to support doctors in their decision-making processes, but the design and adjustment of traditional CAD technologies have been very challenging. Today, artificial intelligence (AI) technology has been incorporated into these computer-driven diagnostic systems. Deep learning is one of its advantages, but its black box nature limits its applicability. Therefore, a plethora of explainable methods have been suggested to improve the transparency of deep learning, among which CAM has its merits.

Firstly, CAM technology, as an explanatory tool, presents heatmaps in a way that intuitively shows the regions of interest in medical images that deep learning models focus on. It can assist in locating areas of pathology in medical images and also help improve model performance through observations of these areas. For example, A. Singh [25] and colleagues developed a high-performance deep learning model based on CNN that can distinguish scaphoid fractures, the most common type of wrist bone fracture, in X-ray images. In this research, they used Grad-CAM technology to assist in locating the position and area of the fracture, achieving a classification accuracy of 95% in a binary setting. Rapid diagnosis and treatment are crucial for brain hemorrhage. Kim [29] used an explainable deep learning model in their article to detect brain hemorrhages and their locations in images obtained from computed tomography (CT) scans. They examined images of normal brain scans as well as scans showing subarachnoid, intraventricular, subdural, epidural, and intraparenchymal hemorrhages. Using Grad-CAM for heatmap-like visualization explanations at the bleeding locations, they achieved an accuracy of 81% in predicting the location of hemorrhages in the medical images.

Furthermore, Jahmunah [43] and colleagues employed electrocardiograms (ECG) for the classification of myocardial infarction (MI) in their article. They improved DenseNet and CNN models for classification and used Grad-CAM technology for the output of both models. This was done to observe and visualize specific ECG leads and waveform segments that had the most significant impact on the predictive decisions made by the models. The final classification result reached an accuracy of 95%.

Furthermore, CAM technology can also serve as a tool to achieve validation, assessment, and comparison in model usage. Through activation mapping, we can verify the accuracy of the location in the model for classification or segmentation, assess its classification effectiveness, and compare whether its interpretability is truly feasible. Afify [53] and colleagues proposed a new model for predicting oral squamous cell carcinoma (OSCC) tissue pathology images processed using deep transfer learning. Grad-CAM technology was employed for validation to pinpoint lesion areas in OSCC images based on the optimal model, significantly enhancing the prediction model’s robustness. Penso [18] and others used activation mapping technology for the performance evaluation of architectures. They introduced a deep learning-based coronary artery disease reporting and data system (CAD-RADS) for the classification of coronary artery lesions in multiplanar reconstructed images from coronary computed tomography angiography (CCTA). Ruengchaijatuporn [23] and colleagues used bedside tasks such as the clock drawing test (CDT) to detect mild cognitive impairment (MCI) and dementia. They developed their deep learning framework, applied soft labels and self-attention to enhance model performance, and provided visual explanations. They utilized the Grad-CAM technique to evaluate their interpretable approach. However, it’s important to note that since their classification subjects are not traditional medical images, their proposed interpretable methods may not necessarily be applicable in other domains.

In addition to the two approaches mentioned above, the primary purpose of using CAM technology in most medical research is to achieve model interpretability, overcoming the black box issue in the model. Hossain and colleagues [48] conducted a study on the efficacy of CNNs in diagnosing Lyme disease from images. To enhance model interpretability, they applied Grad-CAM to highlight the key input regions essential for CNN predictions, thus making the model more robust. Dabass and colleagues [52] designed a multitask deep U-Net model that provides grade classification services for clinical glandular morphometric information and cancer. They employed Grad-CAM technology to provide further evidence of the self-learning capability of the model, and the visual results obtained were evaluated by expert pathologists. Ho and Ding [39] conducted research on stroke, building a CNN-based binary classifier to distinguish ECGs with stroke symptoms. They also leveraged Grad-CAM to aid model interpretation by illuminating subtle ECG patterns recognized by the model. In order to diagnose diabetic retinopathy (DR), Chetoui and Akhloufi [44] developed a deep learning algorithm capable of detecting DR in retinal fundus images. They used an interpretable algorithm based on Grad-CAM to visually display the symbols chosen by the model, classifying retinal images for DR. This effectively demonstrated the model’s interpretability, suggesting it can effectively pinpoint various indicators of DR and identify related health concerns.

Innovative uses of activation mapping technology also exist. Y. Li and colleagues [38] introduced a recurrent convolutional neural network model for intent recognition by learning decomposed spatiotemporal representations. They applied the Grad-CAM visualization technique for channel selection, achieving an impressive accuracy of 97.36% across all channels, surpassing many leading models and baselines.

In summary, the application of activation mapping methods in healthcare primarily focuses on the following aspects:

CAM technology serves as an explanatory tool, with its heatmap visualization providing an intuitive way to identify the regions of medical images that require the most attention. This not only helps in locating areas of pathology in medical images but also aids in observing these regions for model improvement.

(1) CAM technology serves as a visualization tool, with its heatmap visualization providing an intuitive way to identify the regions of medical images that require the most attention. This not only helps in locating areas of pathology in medical images but also aids in observing these regions for model improvement.

(2) CAM technology can function as a tool to achieve validation, assessment, and comparison in model usage. Activation mapping allows for the verification of the accuracy of the location of classification or segmentation in the model, assessment of its classification effectiveness, and comparison of its interpretability.

(3) Using CAM technology provides models with interpretability. Given the black box nature of neural networks, CAM technology is a straightforward and efficient choice to achieve interpretability in medical image classification or segmentation tasks.

(4) Some authors may base their new activation mapping methods on traditional CAM methods, often tailored to specific medical images, such as lung CT scans.

(5) Innovative usage involves leveraging the characteristics of activation mapping for feature selection and other advanced applications in healthcare.

5. Experiment

To compare existing advanced CAM methods, we designed a comparative experimental study, structured in three distinct phases:

Dataset Selection: We selected five datasets, dividing them into two categories. The first category encompasses datasets that include both classification and segmentation tasks, while the second category comprises those containing only classification tasks.

Model Performance Evaluation on Datasets: Multiple models were employed to perform classification tasks on these datasets. We performed a comparative evaluation of each model’s performance, ensuring that only those models with superior classification performance (accuracy above 0.9) were selected. This criterion was established to mitigate the potential negative impact on CAM interpretation that could arise from suboptimal model classification effectiveness.

CAM Method Comparison and Assessment: In the concluding phase of our study, we implemented a range of CAM methods within a singular classification model framework. This phase entailed an analytical comparison of each CAM method’s efficacy, targeted at pinpointing the most proficient CAM method for model interpretation. Our structured experimental strategy is designed to identify optimal CAM methods that align with distinct model architectures and dataset specifics, thus guiding the strategic selection of CAM techniques to boost model explainability. Specifically, we carried out the following:

(1) Analyzed CAM techniques as visualization tools, focusing on the clarity and effectiveness of their heatmap visualizations;

(2) Tested the precision of CAM in both classification and segmentation tasks across various models;

(3) Explored how CAM techniques contribute to the interpretability of models;

(4) Assessed the compatibility and enhancement effects of CAM methods when integrated with different models;

(5) Investigated potential innovative uses of CAM techniques.

5.1. Datasets

HAM10000 [66]: This dataset pertains to dermatoscopy and is sourced from the Medical University of Vienna. It comprises a diverse collection of skin disease images from various demographics, focusing primarily on pigmentary disorders. The conditions covered include, but are not limited to, various cancers such as melanoma and hemangioma. Additionally, the dataset provides mask segmentation images. In total, it contains 10,015 images.

Diabetic Retinopathy Arranged: The second dataset pertains to diabetic retinopathy, comprising 25,800 retinal images. These images are categorized on a scale from 0 to 4, with the numbers representing the severity of the condition. A higher number indicates more severe pathological changes, whereas a 0 rating signifies a normal, healthy state without any pathological alterations. This dataset is exclusively focused on classification tasks and does not include segmentation tasks. The dataset address is https://www.kaggle.com/datasets/amanneo/diabetic-retinopathy-resized-arranged/data (accessed on 15 April 2024).

Breast Ultrasound Images Dataset [67]: This dataset features ultrasound images of women’s breasts, designed for use in diagnosing and classifying breast cancer. The contributors to this collection range in age from 25 to 75, totaling 600 individuals. Images are organized into three categories reflecting breast cancer severity: normal, benign, and malignant. The set includes 780 images, all provided with corresponding mask images. On average, each image measures 500 by 500 pixels.

Brain Tumor Classification (MRI) [68]: This dataset comprises medical diagnostic images used for brain tumor diagnosis, featuring a total of 394 images. These images are categorized into four distinct classes based on different types of brain tumors.

Chest X-ray Images (Pneumonia) [69]: This dataset consists of chest X-ray diagnostic images, categorized into two types based on pneumonia diagnosis: normal and pneumonia. It includes a total of 5863 images, all derived from pediatric patients aged 1 to 5 years old.

In this study, image datasets were divided into training and validation subsets at a 70:30 ratio. The training subset underwent preprocessing such as random rescaling to 224 × 224 pixels, horizontal flips, and normalization using ImageNet’s mean and standard deviation. Conversely, the validation subset images were first resized to 256 × 256 pixels, then center-cropped to 224 × 224 using the same normalization. These processes enhance the model’s resilience against image variations and ensure data alignment with the original training conditions of pretrained models. Moreover, the training subset used random sampling, and the validation subset was shuffled before each iteration to ensure thorough and randomized data loading.

5.2. Classification Models

To better validate the effectiveness of CAM methods, this study selected GoogleNet [70], ResNet50 [61], DenseNet121 [71], and EfficientNet-B3 [62] as the four models because they represent different deep learning architectural paradigms in the area of computer vision. Each model possesses a unique structure and working principle, offering diverse perspectives for understanding and interpreting visual data. Due to their structural differences, these models vary in their methods of extracting and processing image features, which may result in different focal points and interpretative patterns when CAM methods are applied. By employing CAM methods on these diverse models, we can explore and compare their performance across various architectures, thereby enhancing the generalizability and practicality of the research findings.

DenseNet121, developed in 2016 by Kaiming He and others, is a DenseNet series deep learning model for image classification and computer vision. Its key feature is dense connections in the network, facilitating information flow and allowing deeper structures. In our study, we used this model to complete the classification tasks for five datasets, generating corresponding confusion matrices as illustrated in Figure 8.

EfficientNet-B3, part of the EfficientNet series, is a scaled deep neural network model optimized for performance and complexity. It’s distinct for balancing depth, width, and resolution efficiently, enabling high performance with smaller size and less computational demand. In our study, it was applied to classification tasks on five datasets. The confusion matrices generated from these classifications are presented as Figure 9. These matrices visually depict the performance of the model, highlighting the accuracy of predictions across different classes within each dataset. Confusion matrices are particularly useful for identifying classes that the model may find challenging to classify correctly.

ResNet50 is an architecture within the deep convolutional neural network framework and is a part of the ResNet series. The ResNet (Residual Network) series is renowned for its innovative residual connections; these were developed to tackle the issues of vanishing and exploding gradients, commonly encountered during the training of deep convolutional neural networks. ResNet-50, with its 50 layers, has shown exceptional performance in image classification and computer vision tasks. In our research, ResNet-50 was employed to classify five different datasets. The corresponding confusion matrices generated from these classification tasks are presented in Figure 10. These matrices visually depict the model’s accuracy and effectiveness in correctly classifying the images from each dataset. ResNet-50’s deep layered structure, combined with residual connections, allows it to learn complex features without losing important information through the depth of the network, thus making it a powerful tool for such tasks.

GoogleNet, or Inception network, introduced in 2014 by Google, marked a significant development in convolutional neural networks. It integrated a novel structure, the inception module, to enhance performance and efficiency. This module features parallel convolutions of varying scales for capturing diverse spatial information, boosting the network’s depth and width with minimal parameter increase. In our study, GoogleNet was applied to classify five datasets. The confusion matrices derived from these classification results are displayed in Figure 11.

For each model’s performance, we have conducted a detailed comparison based on the results from the aforementioned confusion matrices, as shown in Table 5, Table 6, Table 7, Table 8 and Table 9. These tables include the performance of each model in each dataset, with parameters such as accuracy, precision, recall, and F1 score. These measures were chosen for their comprehensive and detailed reflection of model performance. Firstly, accuracy, as the most direct performance indicator, provides the proportion of correct predictions in the overall sample and is fundamental for assessing the overall efficacy of the model. However, relying solely on accuracy might overlook the impact of class imbalance. To address this, the introduction of precision and recall becomes necessary. Precision measures the ratio of true positives among the predicted positives, essential in situations where false positives carry a high cost. Conversely, recall quantifies the percentage of actual positives accurately identified, vital in contexts where minimizing false negatives is paramount. Lastly, the F1 score, calculated as the harmonic mean of precision and recall, provides a balanced metric between these two measures, especially suitable for scenarios that require reducing both false positives and negatives.

Table 5, Table 6, Table 7, Table 8 and Table 9 clearly illustrate the performance of various models across different dataset classification tasks. According to the data, the DenseNet121 model achieves the highest accuracy of 98.14% in the HAM dataset, and the greatest precision of 97.5% in the chest-X-ray dataset. The ResNet50 model attains the highest recall and F1 Score, each at 97%, also within the chest X-ray dataset. The data further indicates that the type of dataset significantly affects classification outcomes; for instance, while models generally excel in the chest X-ray dataset with baseline performances around 95%, performance in the breast cancer dataset typically falls below 80%. The DR dataset notably scores lower in all metrics except accuracy, likely due to data imbalance. Additionally, DenseNet121 consistently outperforms other models across all datasets, especially in the HAM dataset. In contrast, the ResNet50 model maintains stable precision across various tasks, with rates ranging from 67.2% to 97%. The lowest accuracies recorded are in the breast cancer dataset, with GoogleNet and EfficientNet_b3 models achieving only 81.64%; however, accuracies in other datasets range between 90% and 99%. Overall, these results affirm that our chosen models satisfy our classification requirements and exhibit stable performance across diverse datasets.

5.3. Comparison of CAM Methods

In our selection of CAM methods, we chose eight different CAM methods: Grad-CAM, GradCAM++, XGradCAM [72], AblationCAM, EigenCAM [73], EigenGrad-CAM, LayerCAM [74], and FullGrad [75]. Among these, Grad-CAM is currently the most frequently used CAM method in the medical industry, with the other seven being derivatives of the Grad-CAM method. The descriptions of these methods are shown in Table 10.

In our study, we employed a novel experimental design to assess the efficacy of various CAM (Class Activation Mapping) methods on segmentation task datasets. We compared CAM-generated images with corresponding mask images by calculating their Intersection over Union (IoU). IoU is a standard metric that quantifies the overlap between two areas, critical for evaluating the interpretative effectiveness of CAM methods. A higher IoU value typically signifies greater interpretative reliability.

The experimental results are displayed in Figure 12, and the corresponding IoU values are presented in Table 11. Figure 12 reveals the heatmaps obtained using four different models (DenseNet121, EfficientNet-B3, ResNet50, and GoogleNet) with the eight CAM methods for breast cancer classification tasks. These results clearly indicate that different CAM methods have varying interpretive effects on different models. Specifically, using the DenseNet121 model, the EigenCAM method showed the highest IoU score (0.372), indicating its superior interpretative capability on this model. In the EfficientNet-B3 model, GradCAM and XGradCAM methods both achieved the highest score (0.307), demonstrating their effectiveness on this model. For the ResNet50 model, AblationCAM performed the best, with an IoU score of 0.271. Finally, for the GoogleNet model, the EigenGradCAM method showed the best effect, scoring 0.192.

We extended the experimental scope to include the classification task of skin cancer, continuing to evaluate the eight different CAM methods on four different models (DenseNet121, EfficientNet-B3, ResNet50, and GoogleNet). The results of this part of the experiment are also recorded in Figure 12, particularly in rows 6 to 9, showing the heatmaps obtained by applying these CAM methods. In this skin cancer dataset, we found that using the EfficientNet-B3 model combined with the FullGrad interpretation method achieved the highest IoU score, reaching 0.762. The corresponding heatmap for this result is located in row 3, column 7 of Figure 12. This finding indicates that on the EfficientNet-B3 model, the FullGrad method has a significant advantage in interpreting skin cancer classification tasks. Notably, in Figure 12, the FullCAM method achieved the highest IoU scores in the DenseNet121, EfficientNet-B3, and ResNet50 models. However, in the GoogleNet model, the FullCAM method failed to produce effective heatmaps, resulting in a blank display in row 8, column 7 of Figure 12. This phenomenon is due to the unique architecture of the GoogleNet model. The core feature of GoogleNet is its inception module, which applies convolutional kernels of different sizes and pooling operations in parallel. This complex multi-branch structure makes it difficult for the FullCAM method to effectively interpret the contribution of each specific branch, leading to poor performance on the GoogleNet model.

5.4. Discussion

Overall, these results underscore the significance of selecting the suitable CAM method to improve the interpretability of classification models. The structure and characteristics of each model may be more suited to certain specific CAM methods, and this adaptability plays a key role in achieving optimal model interpretation. This discovery provides important guidance for future research, this suggests that the characteristics of the model must be taken into account when implementing CAM methods to achieve optimal interpretative outcomes. The specific architecture of the model also needs to be considered. Different models may have different adaptabilities and responses to different CAM methods, and this should be fully considered when applying CAM methods. Moreover, this also suggests that when dealing with models with complex architectures, such as GoogleNet, more advanced or specialized visualization techniques may need to be developed to obtain more accurate and comprehensive interpretations. The compatibility between specific CAM methods and model structures plays a crucial role in achieving optimal interpretability.

Our study commenced by selecting five medical image datasets across various disease domains: skin cancer, breast cancer, chest X-rays, diabetic retinopathy (DR), and brain tumors, all publicly available. We evaluated the stability of four contemporary classification models across these datasets. Each model performed classification tasks on all five datasets. The results indicated that the type of dataset influences the classification outcomes. Notably, the HAM dataset consistently demonstrated robust performance across all models. However, the DR and breast cancer datasets showed lower performance metrics in areas other than accuracy, likely due to their imbalanced nature. In contrast, the models exhibited stable performance across the remaining datasets, efficiently managing the corresponding disease classifications. Overall, our findings suggest that despite real-world case imbalances, the chosen models can reliably perform classification tasks.

Subsequently, we assessed various Class Activation Mapping (CAM) methods on datasets with segmentation tasks. This involved comparative experiments analyzing CAM images against corresponding mask images within the datasets. We focused on the breast cancer and HAM datasets, testing eight different CAM methods. The results varied across different models and datasets, indicating that no single method consistently outperformed others. In particular, the EfficientNet-B3 model, combined with the FullGrad explanation method, achieved the highest Intersection over Union (IoU) score of 0.762 in the skin cancer dataset, underscoring FullGrad’s significant interpretive advantage in this classification task. Notably, FullGrad performed well across all models except GoogleNet, which struggled with its complex multi-branch structure. GradCAM consistently performs well across various models, maintaining IoU metrics at or above the median level. With DenseNet, GradCAM++ significantly surpasses other models, a performance attributed to DenseNet’s unique architecture. Conversely, EigenCAM and EigenGradCAM yield suboptimal results in the EfficientNet model, achieving only one-tenth the IoU scores of alternative methods. In the analysis of breast cancer datasets, EigenGradCAM, when combined with DenseNet, achieves the highest scores. Given the variability in datasets and models, it is currently infeasible to definitively identify an optimal CAM method. Nevertheless, we can distinctly identify unsuitable CAM methods, facilitating the selection of experimental strategies.

In conclusion, selecting optimal CAM methods requires preprocessing our datasets to balance disease characteristics critically, a factor that significantly impacts model performance and interpretability outcomes. Additionally, understanding the unique structures of the models used and integrating the operational principles of various CAM methods is crucial. This comprehensive approach will allow us to identify the most suitable CAM methods for specific disease domains, ensuring superior results.

6. Research Hotspots and Challenges

6.1. Research Hotspots

The application of activation mapping methods in the healthcare industry has shown a steady increase over the years. This article discusses the specific hotspots of its application in the following sections:

(1) Disease Domain Hotspots

Currently, activation mapping interpretative methods are increasingly used in the healthcare industry. In specific disease domains, they are particularly prevalent in cardiovascular diseases, lung diseases (especially COVID-19), and eye diseases (especially diabetic retinopathy, DR). Firstly, cardiovascular diseases often impact the electrophysiological characteristics of the heart, leading to abnormal changes in cardiac electrical signals. Activation mapping methods are valuable in this context, as they not only facilitate the rapid identification of features for classifying cardiovascular diseases but also provide significant assistance in enhancing the model’s interpretability. Once important features are identified, they can also contribute to the model’s accuracy in disease classification. Secondly, the application in lung diseases is second only to cardiovascular diseases. This can be attributed to the impact of COVID-19, which has swept the globe, making activation mapping methods widely applied in the diagnosis of the COVID-19 virus. Additionally, these methods have proven effective in the classification and diagnosis of lung diseases using CT images. While activation mapping methods have found some application in other disease domains, it is relatively limited. This may be due to the diverse nature of medical images and the limited information that can be obtained through images for diagnosing certain diseases.

(2) Usage Method Hotspots

CAM technology, as a vital tool for interpreting medical images, presents visual information in the form of heatmaps, providing an intuitive way to accurately identify regions of interest in medical images. This particular feature is one of the key hotspots for the current application of CAM technology in the healthcare industry. This method not only aids in pinpointing the specific location of potential abnormalities in images but also offers robust support for observing and improving deep learning models. Additionally, CAM technology plays a crucial role in model validation, performance evaluation, and comparison. Through activation mapping, we can verify the accuracy of models in classification or segmentation tasks, assess their classification effectiveness, and examine their interpretability. CAM technology has excelled, particularly in medical image classification and segmentation tasks, where it addresses the opacity issues of neural networks. Some researchers use traditional CAM methods as the foundation for new activation mapping techniques, especially in specific medical imaging fields, such as lung CT scans.

Furthermore, CAM technology plays an important role in improving model accuracy. Through CAM technology, we can accurately identify diagnostic features relevant to disease classification. This is particularly significant in medical diagnostics that require real-time decision-making and feedback. Efficient models can generate decisions more rapidly, offering timelier medical assistance and recommendations. This approach is of crucial importance in enhancing the accuracy of medical image diagnosis and increasing the efficiency of medical decision-making, which holds the potential to drive advancements and innovations in the field of medicine. This is also one of the current research hotspots.

(3) Large-Scale Data Research

Based on trends observed in the literature, the selection of datasets in research papers is increasingly leaning toward the use of large datasets. This trend is driven by a combination of factors. Firstly, advancements in medical image acquisition technology have resulted in higher-quality medical images, creating a more feasible foundation for the application of large datasets. Secondly, continuous improvements in hardware technology enable us to handle larger-scale data, leading to an exponential growth in data volume and, consequently, creating more opportunities for leveraging large datasets. Furthermore, large datasets are evidently more informative for disease studies, as they provide more features to the models and typically encompass a greater number of samples and diverse data. This aids in training more accurate and robust machine learning models.

Having more data helps models better capture the underlying data distribution, thereby enhancing predictive performance. With more instances provided by large datasets, it is also easier to elucidate the model’s decision-making process and the importance of features. For the healthcare industry, large datasets are particularly advantageous in capturing rare events or exceptional cases, which, in turn, improves model robustness. In smaller datasets, rare events may be overlooked or considered as crucial features.

6.2. Challenges

Regarding the future research directions and challenges, the following summary is provided:

(1) Lack of Quantitative Analysis of Explainable Methods

The utilization of explainability methods in research has become widespread; however, a clear and quantitative analysis of how these methods enhance the explanatory power of models is lacking. Many researchers employ techniques such as CAM to extract features, enhance model efficiency, or identify model-improving characteristics. Despite their demonstrated positive effects, a standardized tool or unit for quantifying their performance is currently absent. Researchers typically assess the effectiveness of using CAM techniques by measuring the improvement in model performance. Yet, a definitive and quantitative measure for the explanatory power of CAM techniques is currently unavailable, making it challenging to objectively evaluate their performance. Additionally, comparing the interpretability of different explainability methods fairly is complex, hindering the continued development and iteration of activation mapping methods. This complexity stems from the fact that interpretability assessment not only relates to model performance but also encompasses factors such as consistency, effectiveness, and practical feasibility in real-world applications. Therefore, future research should prioritize the development of more accurate and standardized quantitative methods to comprehensively evaluate and compare the performance of various explainability techniques. This will contribute to the advancement of the field and underscore the importance of using interpretability techniques judiciously, while considering various factors for a comprehensive evaluation of their impact on the model.

(2) Selection of Explainability Methods

Multiple CAM methods are available, and selecting the most appropriate one is a primary focus of current research. Medical images vary in form under different instruments and disease conditions. Therefore, for disease-specific studies, the choice of the CAM method that yields the best results is a subject worthy of further exploration. In fact, among the articles selected in this paper, 82% of them exclusively used Grad-CAM as the explainability method without comparing it to other explainability methods. This phenomenon may be attributed to Grad-CAM’s efficient performance and stability in practical research. On the other hand, it is possible that while other CAM methods may produce superior imaging results, they lack a standardized metric for evaluating the quality of explanatory imaging. The selection of CAM methods for specific disease studies is a complex and in-depth research topic. Choosing the optimal CAM method should take into account multiple factors, including the characteristics of the specific disease, feature extraction methods, classifier choice, dataset characteristics, and more. Thus, future research should focus on conducting systematic evaluations of CAM methods to establish a comprehensive reference framework that assists researchers in making informed decisions when selecting the most suitable CAM methods, thereby better serving the research needs of different diseases. This will contribute to the advancement of the field of medical image explainability research and enhance its application value in the medical field.

(3) Imbalance Issue

For a model, explainability is one of the factors that measures its performance and utility. Using CAM methods enhances the model’s explainability. However, inevitably, while gaining model explainability, the introduced CAM methods also increase the model’s complexity. The balance between the added complexity and the obtained explainability is a key point for consideration and research. How to address the tradeoff between model explainability and predictive classification performance so that we can provide explainability without sacrificing the model’s accuracy is a challenge for the future.

(4) Clinical Significance of Research

When it comes to ensuring that the model’s explanations have practical relevance in actual clinical applications, this is a critical issue. Model explainability only holds true value when it serves a real purpose in clinical practice. This might be a challenging aspect in future research. Currently, research primarily focuses on explaining heatmaps for the classification results of existing models or improving model performance. However, there is a lack of articles that provide actual data proving their clinical significance or how they assist healthcare professionals in their work.

(5) Data Challenges

First, the scarcity and difficulty of acquiring medical data present significant challenges in training deep learning models. Medical data is often limited and not easily obtainable. Acquiring a sufficiently large-scale medical dataset can take years or even longer, mainly due to privacy and ethical concerns related to patient information. Second, data quality and labeling are critical issues in medical data. Medical data demands highly accurate labels, as medical decisions and disease diagnoses have stringent requirements for data accuracy. Incorrect labels or low-quality data can result in reduced performance of deep learning models and adversely affect patient diagnoses and treatments. Furthermore, the sensitive nature of medical data raises concerns about patient privacy. Since it includes personal medical information, data security and privacy protection are of paramount importance. This challenge is particularly prominent in data-sharing environments, where data transmission and storage must comply with strict regulations and privacy laws to ensure the comprehensive protection of patient privacy. Medical institutions and researchers must implement measures to secure medical data, including encryption, authentication, and data access control.

7. Conclusions

AI-assisted diagnosis is now widely applied in the medical field. These technologies help medical professionals quickly access patient health information and expedite the diagnosis and treatment planning process, thereby increasing the efficiency of healthcare providers. Additionally, they provide highly accurate disease diagnoses, particularly benefiting patients in underserved medical areas, at a relatively low cost, ensuring timely treatment recommendations tailored to their medical conditions. This selection of 45 research articles demonstrates the broad scope of activation mapping methods in the medical industry, covering a wide range of diseases.

In these 45 articles, although methods like Hires-CAM, Layer-CAM, Grad-CAM++, XGrad-CAM, and Score-CAM are mentioned in some studies, Grad-CAM remains the most frequently used activation mapping method in the medical industry. The specific applications of CAM methods are mainly reflected in the following aspects. CAM Technology as a visualization tool: CAM technology, with its heat map visualization, provides an intuitive means to identify critical areas within medical images. This aids in locating potential pathological regions and supports observations to enhance model performance. Validation and Evaluation Using CAM Technology: CAM technology serves as a tool for model validation, evaluation, and comparison. Activation mapping allows verification of the accuracy of model classification or segmentation, assessment of classification performance, and an evaluation of the feasibility of interpretability. Obtaining Interpretability Through CAM Technology: Leveraging the black box nature of neural networks, CAM technology offers a straightforward and efficient approach to achieving interpretability in medical image classification and segmentation tasks. Innovative Applications of CAM Technology: Some researchers build upon traditional CAM methods to develop new activation mapping methods, particularly in specific medical imaging domains such as lung CT scans. Additionally, the characteristics of CAM technology can be utilized for feature selection, showing promise in tasks that require neural network-based feature selection.

In selecting a CAM method for model interpretation, it is imperative to undertake a comprehensive evaluation of the model’s architectural framework. Empirical evidence from our experiments underscores the notion that disparate models exhibit differential responsiveness to various CAM methodologies. This necessitates an in-depth consideration of each model’s unique architectural nuances and inherent characteristics. The congruence between a model’s structural attributes and the specificities of a CAM method can significantly influence the efficacy of the interpretative outcomes. Hence, the adaptability of a CAM method to a particular model’s architecture plays a pivotal role in facilitating optimal model interpretation, a finding that is crucial for advancing the precision and utility of CAM-based analytical approaches in complex modeling scenarios.

In the medical industry, CAM explanation methods are increasingly focused on large datasets, covering a wide range of application areas. Cardiovascular and pulmonary diseases show relatively mature application areas. However, the absence of standardized metrics to measure the effectiveness of interpretability remains a challenge, necessitating the development of equitable evaluation standards among various CAM methods. The balance between model interpretability and predictive classification performance requires increased attention. Finally, the clinical significance of CAM methods in the medical field and how to maximize their explanatory capabilities are topics deserving careful consideration.

Author Contributions

Conceptualization, J.C.; methodology, J.C. and D.T.; software, L.R. and D.L; validation, H.Z. and X.W.; formal analysis, J.C., X.W. and L.R.; investigation, J.C., X.W. and L.R.; resources, D.T.; data curation, J.C.; writing—original draft preparation, J.C. and D.T.; writing—review and editing, D.T. and L.R.; visualization, L.R. and X.W.; supervision, D.T. and H.Z.; project administration, D.L.; funding acquisition, D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Major special projects of science and Technology Department of Sichuan Province (2022ZDZX0001), Key R & D projects of Sichuan Science and Technology Department (2022YFG0037, 2022YFG0033).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data reported in this study can be found in the sources cited in the reference list.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, J.; Jiang, Q.; Zhou, B.; Chen, D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: An overview. Math. Biosci. Eng. MBE 2019, 16, 6536–6561. [Google Scholar] [CrossRef]
Liang, X.; Yu, J.; Liao, J.; Chen, Z. Convolutional Neural Network for Breast and Thyroid Nodules Diagnosis in Ultrasound Imaging. BioMed Res. Int. 2020, 2020, 1763803. [Google Scholar] [CrossRef]
Gayathri, S.; Gopi, V.P.; Palanisamy, P. Diabetic retinopathy classification based on multipath CNN and machine learning classifiers. Phys. Eng. Sci. Med. 2021, 44, 639–653. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 618–626. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Lake Tahoe, NV, USA, 2018; pp. 839–847. [Google Scholar]
Desai, S.; Ramaswamy, H.G. Ablation-CAM: Visual Explanations for Deep Convolutional Network via Gradient-free Localization. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020. [Google Scholar] [CrossRef]
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 111–119. [Google Scholar]
Alicioglu, G.; Sun, B. A survey of visual analytics for Explainable Artificial Intelligence methods. Comput. Graph. 2021, 102, 502–520. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Loh, H.W.; Ooi, C.P.; Seoni, S.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Comput. Methods Programs Biomed. 2022, 226, 107161. [Google Scholar] [CrossRef]
Groen, A.M.; Kraan, R.; Amirkhan, S.F.; Daams, J.G.; Maas, M. A systematic review on the use of explainability in deep learning systems for computer aided diagnosis in radiology: Limited use of explainable AI? Eur. J. Radiol. 2022, 157, 110592. [Google Scholar] [CrossRef]
Allgaier, J.; Mulansky, L.; Draelos, R.L.; Pryss, R. How does the model make predictions? A systematic literature review on the explainability power of machine learning in healthcare. Artif. Intell. Med. 2023, 143, 102616. [Google Scholar] [CrossRef]
Talpur, S.; Azim, F.; Rashid, M.; Syed, S.A.; Talpur, B.A.; Khan, S.J. Uses of Different Machine Learning Algorithms for Diagnosis of Dental Caries. J. Healthc. Eng. 2022, 2022, 5032435. [Google Scholar] [CrossRef]
Cui, J.; Lan, Z.; Liu, Y.; Li, R.; Li, F.; Sourina, O.; Müller-Wittig, W. A compact and interpretable convolutional neural network for cross-subject driver drowsiness detection from single-channel EEG. Methods 2022, 202, 173–184. [Google Scholar] [CrossRef]
Li, J.; Huang, J.; Jiang, T.; Tu, L.; Cui, L.; Cui, J.; Ma, X.; Yao, X.; Shi, Y.; Wang, S.; et al. A multi-step approach for tongue image classification in patients with diabetes. Comput. Biol. Med. 2022, 149, 105935. [Google Scholar] [CrossRef]
Penso, M.; Moccia, S.; Caiani, E.G.; Caredda, G.; Lampus, M.L.; Carerj, M.L.; Babbaro, M.; Pepi, M.; Chiesa, M.; Pontone, G. A token-mixer architecture for CAD-RADS classification of coronary stenosis on multiplanar reconstruction CT images. Comput. Biol. Med. 2023, 153, 106484. [Google Scholar] [CrossRef]
Zhang, X.; Han, L.; Zhu, W.; Sun, L.; Zhang, D. An Explainable 3D Residual Self-Attention Deep Neural Network for Joint Atrophy Localization and Alzheimer’s Disease Diagnosis Using Structural MRI. IEEE J. Biomed. Health Inform. 2022, 26, 5289–5297. [Google Scholar] [CrossRef]
Niranjan, K.; Shankar Kumar, S.; Vedanth, S.; Chitrakala, S. An Explainable AI driven Decision Support System for COVID-19 Diagnosis using Fused Classification and Segmentation. Int. Conf. Mach. Learn. Data Eng. 2023, 218, 1915–1925. [Google Scholar] [CrossRef]
Shorfuzzaman, M.; Hossain, M.S.; El Saddik, A. An Explainable Deep Learning Ensemble Model for Robust Diagnosis of Diabetic Retinopathy Grading. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–24. [Google Scholar] [CrossRef]
Oztekin, F.; Katar, O.; Sadak, F.; Yildirim, M.; Cakar, H.; Aydogan, M.; Ozpolat, Z.; Talo Yildirim, T.; Yildirim, O.; Faust, O.; et al. An Explainable Deep Learning Model to Prediction Dental Caries Using Panoramic Radiograph Images. Diagnostics 2023, 13, 226. [Google Scholar] [CrossRef]
Ruengchaijatuporn, N.; Chatnuntawech, I.; Teerapittayanon, S.; Sriswasdi, S.; Itthipuripat, S.; Hemrungrojn, S.; Bunyabukkana, P.; Petchlorlian, A.; Chunamchai, S.; Chotibut, T.; et al. An explainable self-attention deep neural network for detecting mild cognitive impairment using multi-input digital drawing tasks. Alzheimers Res. Ther. 2022, 14, 111. [Google Scholar] [CrossRef]
Krishna, S.; Suganthi, S.S.; Bhavsar, A.; Yesodharan, J.; Krishnamoorthy, S. An interpretable decision-support model for breast cancer diagnosis using histopathology images. J. Pathol. Inform. 2023, 14, 100319. [Google Scholar] [CrossRef]
Singh, A.; Ardakani, A.A.; Loh, H.W.; Anamika, P.V.; Acharya, U.R.; Kamath, S.; Bhat, A.K. Automated detection of scaphoid fractures using deep neural networks in radiographs. Eng. Appl. Artif. Intell. 2023, 122, 106165. [Google Scholar] [CrossRef]
Khan, M.A.; Kwon, S.; Choo, J.; Hong, S.M.; Kang, S.H.; Park, I.-H.; Kim, S.K.; Hong, S.J. Automatic detection of tympanic membrane and middle ear infection from oto-endoscopic images via convolutional neural networks. Neural Netw. 2020, 126, 384–394. [Google Scholar] [CrossRef]
Vafaeezadeh, M.; Behnam, H.; Hosseinsabet, A.; Gifani, P. Automatic morphological classification of mitral valve diseases in echocardiographic images based on explainable deep learning methods. Int. J. Comput. Assist. Radiol. Surg. 2022, 17, 413–425. [Google Scholar] [CrossRef]
Woo, Y.; Kim, D.; Jeong, J.; Lee, W.-S.; Lee, J.-G.; Kim, D.-K. Automatic Sleep Stage Classification Using Deep Learning Algorithm for Multi-Institutional Database. IEEE Access 2023, 11, 46297–46307. [Google Scholar] [CrossRef]
Kim, K.H.; Koo, H.-W.; Lee, B.-J.; Yoon, S.-W.; Sohn, M.-J. Cerebral hemorrhage detection and localization with medical imaging for cerebrovascular disease diagnosis and treatment using explainable deep learning. J. Korean Phys. Soc. 2021, 79, 321–327. [Google Scholar] [CrossRef]
Baseri Saadi, S.; Moreno-Rabié, C.; van den Wyngaert, T.; Jacobs, R. Convolutional neural network for automated classification of osteonecrosis and related mandibular trabecular patterns. Bone Rep. 2022, 17, 101632. [Google Scholar] [CrossRef]
Khan, M.A.; Azhar, M.; Ibrar, K.; Alqahtani, A.; Alsubai, S.; Binbusayyis, A.; Kim, Y.J.; Chang, B. COVID-19 Classification from Chest X-ray Images: A Framework of Deep Explainable Artificial Intelligence. Comput. Intell. Neurosci. 2022, 2022, 4254631. [Google Scholar] [CrossRef]
Hamza, A.; Attique Khan, M.; Wang, S.-H.; Alhaisoni, M.; Alharbi, M.; Hussein, H.S.; Alshazly, H.; Kim, Y.J.; Cha, J. COVID-19 classification using chest X-ray images based on fusion-assisted deep Bayesian optimization and Grad-CAM visualization. Front. Public Health 2022, 10, 1046296. [Google Scholar] [CrossRef]
Suri, J.S.; Agarwal, S.; Chabert, G.L.; Carriero, A.; Pasche, A.; Danna, P.S.C.; Saba, L.; Mehmedovic, A.; Faa, G.; Singh, I.M.; et al. COVLIAS 2.0-cXAI: Cloud-Based Explainable Deep Learning System for COVID-19 Lesion Localization in Computed Tomography Scans. Diagnostics 2022, 12, 1482. [Google Scholar] [CrossRef]
Lombardo, E.; Hess, J.; Kurz, C.; Riboldi, M.; Marschner, S.; Baumeister, P.; Lauber, K.; Pflugradt, U.; Walch, A.; Canis, M.; et al. DeepClassPathway: Molecular pathway aware classification using explainable deep learning. Eur. J. Cancer 2022, 176, 41–49. [Google Scholar] [CrossRef]
Toğaçar, M.; Muzoğlu, N.; Ergen, B.; Yarman, B.S.B.; Halefoğlu, A.M. Detection of COVID-19 findings by the local interpretable model-agnostic explanations method of types-based activations extracted from CNNs. Biomed. Signal Process. Control 2022, 71, 103128. [Google Scholar] [CrossRef]
Hadj Bouzid, A.I.; Yahiaoui, S.; Lounis, A.; Berrani, S.-A.; Belbachir, H.; Naïli, Q.; Abdi, M.E.H.; Bensalah, K.; Belazzougui, D. DIAG a Diagnostic Web Application Based on Lung CT Scan Images and Deep Learning. Stud. Health Technol. Inform. 2021, 281, 332–336. [Google Scholar] [CrossRef]
Yang, M.; Huang, X.; Huang, L.; Cai, G. Diagnosis of Parkinson’s disease based on 3D ResNet: The frontal lobe is crucial. Biomed. Signal Process. Control 2023, 85, 104904. [Google Scholar] [CrossRef]
Li, Y.; Yang, H.; Li, J.; Chen, D.; Du, M. EEG-based intention recognition with deep recurrent-convolution neural network: Performance and channel selection by Grad-CAM. Neurocomputing 2020, 415, 225–233. [Google Scholar] [CrossRef]
Ho, E.S.; Ding, Z. Electrocardiogram analysis of post-stroke elderly people using one-dimensional convolutional neural network model with gradient-weighted class activation mapping. Artif. Intell. Med. 2022, 130, 102342. [Google Scholar] [CrossRef] [PubMed]
Mukhtorov, D.; Rakhmonova, M.; Muksimova, S.; Cho, Y.-I. Endoscopic Image Classification Based on Explainable Deep Learning. Sensors 2023, 23, 3176. [Google Scholar] [CrossRef]
Taniguchi, H.; Takata, T.; Takechi, M.; Furukawa, A.; Iwasawa, J.; Kawamura, A.; Taniguchi, T.; Tamura, Y. Explainable Artificial Intelligence Model for Diagnosis of Atrial Fibrillation Using Holter Electrocardiogram Waveforms. Int. Heart. J. 2021, 62, 534–539. [Google Scholar] [CrossRef] [PubMed]
Ganeshkumar, M.; Ravi, V.; Sowmya, V.; Gopalakrishnan, E.A.; Soman, K.P. Explainable Deep Learning-Based Approach for Multilabel Classification of Electrocardiogram. IEEE Trans. Eng. Manag. 2021, 70, 2787–2799. [Google Scholar] [CrossRef]
Jahmunah, V.; Ng, E.Y.K.; Tan, R.-S.; Oh, S.L.; Acharya, U.R. Explainable detection of myocardial infarction using deep learning models with Grad-CAM technique on ECG signals. Comput. Biol. Med. 2022, 146, 105550. [Google Scholar] [CrossRef] [PubMed]
Chetoui, M.; Akhloufi, M.A. Explainable end-to-end deep learning for diabetic retinopathy detection across multiple datasets. J. Med. Imaging 2020, 7, 044503. [Google Scholar] [CrossRef] [PubMed]
Deperlioglu, O.; Kose, U.; Gupta, D.; Khanna, A.; Giampaolo, F.; Fortino, G. Explainable framework for Glaucoma diagnosis by image processing and convolutional neural network synergy: Analysis with doctor evaluation. Future Gener. Comput. Syst. 2022, 129, 152–169. [Google Scholar] [CrossRef]
Draelos, R.L.; Carin, L. Explainable multiple abnormality classification of chest CT volumes. Artif. Intell. Med. 2022, 132, 102372. [Google Scholar] [CrossRef]
Islam, R.; Nahiduzzaman; Goni, O.F.; Sayeed, A.; Anower, S.; Ahsan, M.; Haider, J. Explainable Transformer-Based Deep Learning Model for the Detection of Malaria Parasites from Blood Cell Images. Sensors 2022, 22, 4358. [Google Scholar] [CrossRef] [PubMed]
Hossain, S.I.; Herve, J.d.G.d.; Hassan, M.S.; Martineau, D.; Petrosyan, E.; Corbin, V.; Beytout, J.; Lebert, I.; Durand, J.; Carravieri, I.; et al. Exploring convolutional neural networks with transfer learning for diagnosing Lyme disease from skin lesion images. Comput. Methods Programs Biomed. 2022, 215, 106624. [Google Scholar] [CrossRef] [PubMed]
Singh, P.; Sharma, A. Interpretation and Classification of Arrhythmia Using Deep Convolutional Network. IEEE Trans. Instrum. Meas. 2022, 71, 044503. [Google Scholar] [CrossRef]
Choi, Y.; Lee, H. Interpretation of lung disease classification with light attention connected module. Biomed. Signal Process. Control 2023, 84, 104695. [Google Scholar] [CrossRef] [PubMed]
Altuve, M.; Pérez, A. Intracerebral hemorrhage detection on computed tomography images using a residual neural network. Phys. Med. 2022, 99, 113–119. [Google Scholar] [CrossRef] [PubMed]
Dabass, M.; Vashisth, S.; Vig, R. MTU: A multi-tasking U-net with hybrid convolutional learning and attention modules for cancer classification and gland Segmentation in Colon Histopathological Images. Comput. Biol. Med. 2022, 150, 106095. [Google Scholar] [CrossRef] [PubMed]
Afify, H.M.; Mohammed, K.K.; Ella Hassanien, A. Novel prediction model on OSCC histopathological images via deep transfer learning combined with Grad-CAM interpretation. Biomed. Signal Process. Control 2023, 83, 104704. [Google Scholar] [CrossRef]
Sunija, A.P.; Kar, S.; Gayathri, S.; Gopi, V.P.; Palanisamy, P. OctNET: A Lightweight CNN for Retinal Disease Classification from Optical Coherence Tomography Images. Comput. Methods Programs Biomed. 2021, 200, 105877. [Google Scholar] [CrossRef] [PubMed]
Shabanpour, M.; Kaboodvand, N.; Iravani, B. Parkinson’s disease is characterized by sub-second resting-state spatio-oscillatory patterns: A contribution from deep convolutional neural network. NeuroImage Clin. 2022, 36, 103266. [Google Scholar] [CrossRef]
Li, H.; Dong, X.; Shen, W.; Ge, F.; Li, H. Resampling-based cost loss attention network for explainable imbalanced diabetic retinopathy grading. Comput. Biol. Med. 2022, 149, 105970. [Google Scholar] [CrossRef]
Lin, Q.-H.; Niu, Y.-W.; Sui, J.; Zhao, W.-D.; Zhuo, C.; Calhoun, V.D. SSPNet: An interpretable 3D-CNN for classification of schizophrenia using phase maps of resting-state complex-valued fMRI data. Med. Image Anal. 2022, 79, 102430. [Google Scholar] [CrossRef]
Cai, J.; Xing, F.; Batra, A.; Liu, F.; Walter, G.A.; Vandenborne, K.; Yang, L. Texture analysis for muscular dystrophy classification in MRI with improved class activation mapping. Pattern Recognit. 2019, 86, 368–375. [Google Scholar] [CrossRef]
Cruz-Bastida, J.P.; Pearson, E.; Al-Hallaq, H. Toward understanding deep learning classification of anatomic sites: Lessons from the development of a CBCT projection classifier. J. Med. Imaging 2022, 9, 045002. [Google Scholar] [CrossRef]
Jiang, M.; Qiu, Y.; Zhang, W.; Zhang, J.; Wang, Z.; Ke, W.; Wu, Y.; Wang, Z. Visualization deep learning model for automatic arrhythmias classification. Physiol. Meas. 2022, 43, 085003. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Majnik, M.; Bosnić, Z. ROC analysis of classifiers in machine learning: A survey. Intell. Data Anal. 2013, 17, 531–558. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Tschandl, P. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 1–9. [Google Scholar] [CrossRef]
Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of breast ultrasound images. Data Brief 2020, 28, 104863. [Google Scholar] [CrossRef]
Chitnis, S.; Hosseini, R.; Xie, P. Brain tumor classification based on neural architecture search. Sci. Rep. 2022, 12, 19206. [Google Scholar] [CrossRef] [PubMed]
Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.S.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar]
Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; Li, B. Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs. arXiv 2020, arXiv:2008.02312. [Google Scholar]
Muhammad, M.B.; Yeasin, M. Eigen-CAM: Class Activation Map using Principal Components. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Glasgow, UK, 2020; pp. 1–7. [Google Scholar]
Jiang, P.-T.; Zhang, C.-B.; Hou, Q.; Cheng, M.-M.; Wei, Y. LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef]
Srinivas, S.; Fleuret, F. Full-Gradient Representation for Neural Network Visualization. arXiv 2019, arXiv:1905.00780. [Google Scholar]

Figure 1. PRISMA flowchart for the systematic review [15].

Figure 2. Annual publication count.

Figure 3. Dataset size distribution [16,17,18,19,20,21,22,23,24,25,26,27,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60].

Figure 4. Dataset size statistics.

Figure 5. Publication years of articles using large dataset.

Figure 6. Distribution of diseases.

Figure 7. The accuracy of each study [16,17,18,20,21,22,23,24,25,26,27,28,29,30,31,32,33,35,36,37,38,39,40,41,42,43,45,47,48,50,51,52,53,54,55,56,58,59].

Figure 8. Densenet121 confusion matrix. ((a–e) represent the confusion matrices obtained from the classification tasks performed by DenseNet121 on datasets HAM10000, Diabetic Retinopathy Arranged, Breast Ultrasound Images Dataset, Brain Tumor Classification (MRI), and Chest X-ray Images, respectively).

Figure 9. Efficientnet_b3 confusion matrix. ((a–e) represent the confusion matrices obtained from the classification tasks performed by EfficientNet-B3 on datasets HAM10000, Diabetic Retinopathy Arranged, Breast Ultrasound Images Dataset, Brain Tumor Classification (MRI), and Chest X-ray Images, respectively).

Figure 10. Resnet50 confusion matrix. ((a–e) represent the confusion matrices obtained from the classification tasks performed by ResNet50 on datasets HAM10000, Diabetic Retinopathy Arranged, Breast Ultrasound Images Dataset, Brain Tumor Classification (MRI), and Chest X-ray Images, respectively).

Figure 11. GoogleNet confusion matrix. ((a–e) represent the confusion matrices obtained from the classification tasks performed by GoogleNet on datasets HAM10000, Diabetic Retinopathy Arranged, Breast Ultrasound Images Dataset, Brain Tumor Classification (MRI), and Chest X-ray Images, respectively).

Figure 12. Comparison of the effects of different CAM methods (In the saliency map, colors transition from cooler to warmer tones, such as blue to red. Areas depicted in cooler colors are less influential or important to the model’s decision-making process, whereas warmer colors highlight regions of greater relevance).

Table 1. Risk of bias (✓).

Criteria	Inclusion Criteria	Feature Extraction Standard	Description of Disease Classification	Medical Image Inspection for Disease Diagnosis	Classification Model Description	Data Set Samples > 1000	Public Dataset	Datasets Description	Description of Interpretable Methods	Analysis of the Results after Using the Interpretable Method	Code Public	Risk of Bias
(Cui et al., 2022) [16]	✓			✓	✓	✓	✓	✓		✓	✓	Low
(J. Li et al., 2022) [17]	✓			✓	✓				✓	✓		Moderate
(Penso et al., 2023) [18]	✓	✓	✓	✓	✓			✓		✓		Low
(Zhang et al., 2022) [19]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Niranjan et al., 2023) [20]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Shorfuzzaman et al., 2021) [21]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Oztekin et al., 2023) [22]	✓			✓	✓	✓		✓	✓	✓		Low
(Ruengchaijatuporn et al., 2022) [23]	✓				✓			✓		✓		Moderate
(Krishna et al., 2023) [24]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	Low
(A. Singh et al., 2023) [25]	✓	✓	✓	✓	✓			✓		✓		Low
(Khan et al., 2020) [26]	✓	✓	✓	✓	✓	✓		✓	✓	✓		Low
(Vafaeezadeh et al., 2022) [27]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	Low
(Woo et al., 2023) [28]	✓	✓		✓	✓					✓		Moderate
(Kim et al., 2021) [29]	✓		✓	✓	✓			✓	✓	✓		Low
(Baseri Saadi et al., 2022) [30]	✓		✓	✓	✓			✓		✓		Moderate
(Khan et al., 2022) [31]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Hamza et al., 2022) [32]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Suri et al., 2022) [33]	✓		✓	✓	✓	✓		✓	✓	✓		Low
(Lombardo et al., 2022) [34]	✓		✓		✓		✓	✓	✓	✓		Low
(Toğaçar et al., 2022) [35]	✓	✓	✓	✓	✓	✓		✓	✓	✓		Low
(Hadj Bouzid et al., 2021) [36]	✓			✓		✓	✓	✓		✓		Moderate
(Yang et al., 2023) [37]	✓	✓	✓	✓	✓		✓	✓	✓	✓		Low
(Y. Li et al., 2020) [38]	✓	✓		✓	✓	✓	✓	✓	✓	✓		Low
(Ho & Ding, 2022) [39]	✓	✓		✓	✓	✓	✓	✓		✓		Low
(Mukhtorov et al., 2023) [40]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Taniguchi et al., 2021) [41]	✓		✓	✓	✓	✓		✓				Low
(Ganeshkumar et al., 2021) [42]	✓	✓	✓	✓	✓	✓			✓	✓		Low
(Jahmunah et al., 2022) [43]	✓	✓	✓	✓	✓		✓	✓		✓		Low
(Chetoui & Akhloufi, 2020) [44]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Deperlioglu et al., 2022) [45]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		Low
(Draelos & Carin, 2022) [46]	✓		✓	✓	✓	✓			✓	✓		Low
(Islam et al., 2022) [47]	✓		✓		✓	✓	✓	✓	✓	✓		Low
(Hossain et al., 2022) [48]	✓	✓	✓	✓	✓	✓		✓		✓	✓	Low
(P. Singh & Sharma, 2022) [49]	✓	✓	✓	✓	✓		✓	✓	✓	✓		Low
(Choi & Lee, 2023) [50]	✓		✓		✓	✓		✓	✓	✓		Low
(Altuve & Pérez, 2022) [51]	✓		✓	✓	✓		✓	✓		✓	✓	Low
(Dabass et al., 2022) [52]	✓	✓	✓	✓	✓	✓	✓	✓		✓		Low
(Afify et al., 2023) [53]	✓		✓	✓	✓	✓	✓	✓	✓	✓		Low
(A P et al., 2021) [54]	✓		✓	✓	✓	✓	✓	✓	✓	✓		Low
(Shabanpour et al., 2022) [55]	✓				✓		✓	✓	✓	✓	✓	Low
(H. Li et al., 2022) [56]	✓	✓	✓	✓	✓	✓	✓	✓		✓		Low
(Lin et al., 2022) [57]	✓	✓	✓	✓	✓			✓	✓	✓		Low
(Cai et al., 2019) [58]	✓	✓		✓	✓			✓		✓		Moderate
(Cruz-Bastida et al., 2022) [59]	✓				✓	✓		✓		✓		Moderate
(M. Jiang et al., 2022) [60]	✓			✓	✓		✓		✓	✓		Moderate

Table 2. Data summary table for the studies included.

Criteria	Objective	Model	Language	Dataset Size	Accuracy (Some Expressed by Other Properties)	Interpretable Methods
(Cui et al., 2022) [16]	Detect sleep status	CNN	English	2022	Accuracy of 73.22%	CAM
(J. Li et al., 2022) [17]	Diabetic tongue image classification method	ViT	English	732	87.8% (best), 84.4% (average)	Grad-CAM
(Penso et al., 2023) [18]	Multiplanar reconstruction image for classification of coronary artery lesions	Res2Net	English	288	0.87	Grad-CAM
(Zhang et al., 2022) [19]	Early diagnosis of Alzheimer’s disease (AD)	DNN	English	1407	Enhances performance over existing models	Grad-CAM
(Niranjan et al., 2023) [20]	Localize the part of the lesion on the CT scan	VGG-16	English	14,000	0.9851	GradCam and Guided GradCam
(Shorfuzzaman et al., 2021) [21]	Classification of diabetic retinopathy (DR).	CNN	English	5590	0.97	Grad-CAM
(Oztekin et al., 2023) [22]	End-to-end caries detection	EfficientNet-B0, DenseNet-121, ResNet-50	English	13,870	92%/ResNet-50 achieves the highest accuracy with a sensitivity of 87.33% and an FL score of 91.61%.	Grad-CAM
(Ruengchaijatuporn et al., 2022) [23]	Detecting mild cognitive impairment	VGG16	English	918	0.75 (baseline) to 0.81, F1-score improved from 0.36 to 0.65.	Grad-CAM
(Krishna et al., 2023) [24]	Classification of histopathology images	ABN-DCN	English	9109	98.7% accuracy/3%–4% performance boost	CAM
(A. Singh et al., 2023) [25]	Obvious and less obvious occult scaphoid fractures can be detected using plain wrist radiographs	CNN	English	525	85% sensitivity, 91% specificity, 90% accuracy, and an F1-score of 0.88 AUC	Grad-CAM
(Khan et al., 2020) [26]	Application of CNN models in automatic detection of tympanic membrane (TM) and middle ear (ME) infections	CNN	English	2484	0.95	CAM
(Vafaeezadeh et al., 2022) [27]	Automatic mitral valve classification using echocardiographic images	ResNeXt50	English	1773	80%	Grad-CAM
(Woo et al., 2023) [28]	Five-category sleep stage classification using images generated from temporal signal displays of the PSG dataset	CNN LSTM	English	/	0.83	CAM
(Kim et al., 2021) [29]	Detect cerebral hemorrhages and their locations	ResNet	English	200	Accuracy of 0.81 with a sensitivity of 0.67 and specificity of 0.86.	Grad-CAM
(Baseri Saadi et al., 2022) [30]	Automated classification of normal, affected, and osteonecrosis mandibular trabecular bone patterns	CNN	English	402	0.96	Grad-CAM
(Khan et al., 2022) [31]	Classification of COVID-19 from the chest X-ray images	EfficientNet and VGG16/Whales-Elephants	English	10,000	Accuracies of 99.1, 98.2, and 96.7%	Grad-CAM
(Hamza et al., 2022) [32]	Classification of COVID-19 from the chest X-ray images	DCNN/BO	English	6000	The highest is 99.4%	Grad-CAM
(Suri et al., 2022) [33]	COVID-19 lung diagnosis	DenseNet—121, DenseNet—169 and DenseNet—201	English	6000	0.96 (dice similarity), 0.93 (Jaccard index), 0.99 (correlation coefficient), 95.99% (figure-of-merit)	Grad-CAM, Grad-CAM++, Score-CAM, FasterScore-CAM
(Lombardo et al., 2022) [34]	Predicts HPV-status and allows patient-specific identification	CNN-DeepClassPathway	English	300	0.96 (ROC-AUC), 0.90 (PR-AUC)	Grad-CAM
(Toğaçar et al., 2022) [35]	Classification of COVID-19 from the chest X-ray images	ResNet	English	1500	0.9962	Grad-CAM
(Hadj Bouzid et al., 2021) [36]	Classification of COVID-19 from the chest X-ray images	CNN ResNet 50+EfficientNetb7+DenseNet 161	English	36,000	More than 92% accuracy	Grad-CAM and Fast-CAM
(Yang et al., 2023) [37]	Detect Parkinson’s disease	3D ResNet	English	259	96.1	Grad-CAM
(Y. Li et al., 2020) [38]	Intent identification and channel selection	RNN	English	1500	97.36% at the full channel; decoding performance of 92.31%	Grad-CAM
(Ho & Ding, 2022) [39]	Electrocardiogram (ECG) to differentiate post-stroke from non-stroke	DNN	English	35,000	The stroke model is about 90% accurate with an ROC area of 0.95	Grad-CAM
(Mukhtorov et al., 2023) [40]	Improve the diagnostic accuracy of endoscopic examinations	ResNet-152	English	8000	89.29% to 93.46%	Grad-CAM, Hires-CAM, Layer-CAM Grad-CAM++, Xgrad-CAM, Score-CAM
(Taniguchi et al., 2021) [41]	Diagnosis of atrial fibrillation	CNN	English	57,273	0.953	Grad-CAM
(Ganeshkumar et al., 2021) [42]	Multi-label classification of ECG signals	CNN	English	6300	0.962	Grad-CAM
(Jahmunah et al., 2022) [43]	Classification according to location of myocardial involvement	DenseNet and CNN	English	209	More than 95%	Grad-CAM
(Chetoui & Akhloufi, 2020) [44]	Diagnosis of diabetic retinopathy (DR).	CNN/Inception-Resnet-v2	English	90,000	(AUC) of 0.986, sensitivity = 0.958, and specificity = 0.971	Grad-CAM
(Deperlioglu et al., 2022) [45]	Diagnosis of glaucoma	CNN	English	781	93.5%	CAM
(Draelos & Carin, 2022) [46]	Explainable multiple abnormality classification in volumetric medical images	AxialNet	English	36,316	Imaging renderings/improved by 33% for abnormal organ positioning	HiResCAM
(Islam et al., 2022) [47]	Diagnosis of malaria parasites in blood cell images	CNN	English	27,588	0.9641	Grad-CAM
(Hossain et al., 2022) [48]	Diagnosing Lyme disease from images	23 CNN models	English	1672	84.42% ± 1.36 (accuracy) 0.9189% (AUC) 83.1% (precision), 87.93% (sensitivity) 80.65% ± 3.59 (specificity)	Grad-CAM
(P. Singh & Sharma, 2022) [49]	Explainable electrocardiogram (ECG) signal analysis	DL	English	107,038	/	K-GradCam
(Choi & Lee, 2023) [50]	Classification of lung diseases	CNN	English	1021	92.56%	Grad-CAM
(Altuve & Pérez, 2022) [51]	Detection of cerebral hemorrhage (ICH).	ResNet-18	English	200	0.9593	Grad-CAM
(Dabass et al., 2022) [52]	Clinical gland morphometric information and cancer grade classification	U-Net	English	10,165	Highest accuracy rate of 99.97% for cancer classification	Grad-CAM
(Afify et al., 2023) [53]	Predicting oral squamous cell carcinoma (OSCC) histopathology images	ResNet-101/EfficientNet-b0	English	1224	95.65%	Grad-CAM
(A P et al., 2021) [54]	Computer-aided classification of retinal disorders from normal retinal OCT images	CNN	English	83,484	99.69% (accuracy) 99.69% (specificity) 99.69% (sensitivity)	Grad-CAM
(Shabanpour et al., 2022) [55]	EEG data analysis	DCNN	English	90	0.61	Grad-CAM
(H. Li et al., 2022) [56]	Explainable diabetic retinopathy grade imbalance	CNN/Resnxt-50	English	90,450	More than 95%	Grad-CAM
(Lin et al., 2022) [57]	Classification of schizophrenia using phase maps	CNN	English	82	SSP achieves better performance than SSM	Grad-CAM
(Cai et al., 2019) [58]	Accurate MD image classification	CNN	English	68	0.917	ICAM
(Cruz-Bastida et al., 2022) [59]	Predict anatomical categories based on a single X-ray projection	VGG-16	English	6850	Sensitivity ≥91% for head, neck, and thorax classes, and ≥82% for abdomen and pelvis classes.	Grad-CAM
(M. Jiang et al., 2022) [60]	Automatic arrhythmias classification	Resnet	English	6877	0.821 (F1-score)	Grad-CAM++

Table 3. The areas of diseases in the included studies.

Medical Area	Related Diseases	Quantity
Cardiovascular Diseases	Atrial fibrillation and myocardial infarction, classification of heart valves, coronary artery disease	10
Pulmonary Diseases	Lung disease	7
Ophthalmic Diseases	Diabetic retinopathy	6
Orthopedic Conditions	Fractures	2
Immunology/Infectious Diseases	HIV and AIDS (MIC and AD)	2
Cancer	HPV-related head and neck cancer, oral squamous cell carcinoma, breast cancer	3
Neurological Diseases	Parkinson’s disease (PD), mental disorders	3
Neurosurgery	Intracerebral hemorrhage (ICH)	2
Infectious Diseases	Coronavirus, Lyme disease, malaria	4
Musculoskeletal Disorders	Muscular dystrophy	1
Oral Health	Dental caries	1
Sleep Disorders	Sleep assessment	1
Gastrointestinal Diseases	Gastric cancer/endoscopic image classification	1
Clinical Disease Research	Clinical gland morphometric information	1
Ear, Nose, and Throat Diseases	Middle ear (ME) infection	1

Table 4. Performance metrics.

Article Citation	Sensitivity	Specificity	F1-Score	AUC	IoU
(Penso et al., 2023) [18]	90	/	/	/	/
(Niranjan et al., 2023) [20]	/	/	/	/	59.5
(Shorfuzzaman et al., 2021) [21]	98	/	/	97.8	/
(Oztekin et al., 2023) [22]	87.33	/	91.61	/	/
(Ruengchaijatuporn et al., 2022) [23]	/	/	65	84	/
(A. Singh et al., 2023) [25]	85	91	/	88	/
(Kim et al., 2021) [29]	67	86	/	/	/
(Baseri Saadi et al., 2022) [30]	/	98	93	/	/
(Suri et al., 2022) [33]	/	/	/	99	/
(Lombardo et al., 2022) [34]	/	/	/	96	/
(Taniguchi et al., 2021) [41]	97.1	/	93.1	/	/
(Ganeshkumar et al., 2021) [42]	/	/	96.7	/	/
(Chetoui & Akhloufi, 2020) [44]	95.8	97.1	/	98.6	/
(Deperlioglu et al., 2022) [45]	97.7	92.6	95.7	95.1	/
(Islam et al., 2022) [47]	/	/	96.44	99.11	/
(Hossain et al., 2022) [48]	87.93 ± 1.47	80.65 ± 3.59	/	91.89 ± 01.15	/
(Choi & Lee, 2023) [50]	92.22	98.5	92.29	/	/
(Altuve & Pérez, 2022) [51]	95.65	96.2	95.91	/	/
(A P et al., 2021) [54]	99.69	/	/	/	/
(Cruz-Bastida et al., 2022) [59]	91	/	/	/	/
(Jiang et al., 2022) [60]	/	/	82.1	/	/

Table 5. Comparison of model performance of BT datasets.

Classification	Model	Accuracy	Precision	Recall	F1 Score
glioma_tumor	DenseNet121	0.85	0.98	0.42	0.59
meningioma_tumor	DenseNet121	0.91	0.76	0.99	0.86
no_tumor	DenseNet121	0.93	0.78	1	0.88
pituitary_tumor	DenseNet121	0.97	0.97	0.88	0.92
AVERAGE	DenseNet121	0.915	0.8725	0.8225	0.8125
glioma_tumor	EfficientNet_b3	0.86	1	0.44	0.61
meningioma_tumor	EfficientNet_b3	0.9	0.74	1	0.85
no_tumor	EfficientNet_b3	0.93	0.79	1	0.88
pituitary_tumor	EfficientNet_b3	0.97	1	0.84	0.91
AVERAGE	EfficientNet_b3	0.915	0.8825	0.82	0.8125
glioma_tumor	GoogleNet	0.84	1	0.38	0.55
meningioma_tumor	GoogleNet	0.89	0.72	1	0.84
no_tumor	GoogleNet	0.94	0.81	1	0.9
pituitary_tumor	GoogleNet	0.98	1	0.92	0.96
AVERAGE	GoogleNet	0.9125	0.8825	0.825	0.8125
glioma_tumor	ResNet50	0.82	1	0.28	0.44
meningioma_tumor	ResNet50	0.86	0.68	1	0.81
no_tumor	ResNet50	0.94	0.81	1	0.89
pituitary_tumor	ResNet50	0.98	1	0.91	0.95
AVERAGE	ResNet50	0.9	0.8725	0.7975	0.7725

Table 6. Comparison of model performance of chest X-ray dataset.

Classification	Model	Accuracy	Precision	Recall	F1 Score
normal	DenseNet121	0.97	0.99	0.92	0.95
pneumonia	DenseNet121	0.97	0.96	0.99	0.97
AVERAGE	DenseNet121	0.97	0.975	0.955	0.96
normal	EfficientNet_b3	0.95	0.98	0.89	0.93
pneumonia	EfficientNet_b3	0.95	0.94	0.99	0.96
AVERAGE	EfficientNet_b3	0.95	0.96	0.94	0.95
normal	GoogleNet	0.96	0.99	0.9	0.94
pneumonia	GoogleNet	0.96	0.94	0.99	0.97
AVERAGE	GoogleNet	0.96	0.97	0.95	0.955
normal	ResNet50	0.97	0.97	0.96	0.96
pneumonia	ResNet50	0.97	0.97	0.98	0.98
AVERAGE	ResNet50	0.97	0.97	0.97	0.97

Table 7. Comparison of model performance of Brest dataset.

Classification	Model	Accuracy	Precision	Recall	F1 Score
benign	DenseNet121	0.78	0.92	0.67	0.77
malignant	DenseNet121	0.83	0.62	0.91	0.74
normal	DenseNet121	0.93	0.76	0.89	0.82
AVERAGE	DenseNet121	0.8467	0.7667	0.8233	0.7767
benign	EfficientNet_b3	0.74	0.88	0.62	0.73
malignant	EfficientNet_b3	0.77	0.54	0.91	0.68
normal	EfficientNet_b3	0.94	0.88	0.75	0.75
AVERAGE	EfficientNet_b3	0.8167	0.7667	0.76	0.72
benign	GoogleNet	0.75	0.88	0.63	0.74
malignant	GoogleNet	0.8	0.59	0.91	0.71
normal	GoogleNet	0.9	0.69	0.71	0.7
AVERAGE	GoogleNet	0.8167	0.72	0.75	0.7167
benign	ResNet50	0.75	0.95	0.58	0.72
malignant	ResNet50	0.78	0.56	0.95	0.71
normal	ResNet50	0.93	0.75	0.86	0.8
AVERAGE	ResNet50	0.82	0.7533	0.7967	0.7433

Table 8. Comparison of model performance of DR datasets.

Classification	Model	Accuracy	Precision	Recall	F1 Score
0	DenseNet121	0.86	0.87	0.96	0.91
1	DenseNet121	0.93	0.25	0.02	0.04
2	DenseNet121	0.89	0.64	0.6	0.62
3	DenseNet121	0.97	0.48	0.48	0.48
4	DenseNet121	0.99	0.78	0.52	0.62
AVERAGE	DenseNet121	0.928	0.604	0.516	0.534
0	EfficientNet_b3	0.86	0.85	0.98	0.91
1	EfficientNet_b3	0.93	0.34	0.03	0.05
2	EfficientNet_b3	0.89	0.68	0.56	0.61
3	EfficientNet_b3	0.98	0.58	0.29	0.38
4	EfficientNet_b3	0.99	0.83	0.55	0.66
AVERAGE	EfficientNet_b3	0.93	0.656	0.482	0.522
0	GoogleNet	0.8	0.85	0.98	0.91
1	GoogleNet	0.93	0.38	0.03	0.06
2	GoogleNet	0.88	0.64	0.53	0.58
3	GoogleNet	0.98	0.49	0.26	0.34
4	GoogleNet	0.99	0.85	0.43	0.57
AVERAGE	GoogleNet	0.916	0.642	0.446	0.492
0	ResNet50	0.87	0.87	0.97	0.92
1	ResNet50	0.93	0.5	0.02	0.03
2	ResNet50	0.89	0.64	0.63	0.63
3	ResNet50	0.98	0.57	0.29	0.39
4	ResNet50	0.99	0.78	0.54	0.64
AVERAGE	ResNet50	0.932	0.672	0.49	0.522

Table 9. Comparison of model performance of HAM datasets.

Classification	Model	Accuracy	Precision	Recall	F1 Score
MEL	DenseNet121	0.98	0.66	0.7	0.68
NV	DenseNet121	0.99	0.85	0.9	0.87
BCC	DenseNet121	0.97	0.88	0.8	0.83
AKIEC	DenseNet121	1	0.82	0.78	0.8
BKL	DenseNet121	0.97	0.9	0.8	0.85
DF	DenseNet121	0.96	0.96	0.98	0.97
VASC	DenseNet121	1	1	0.86	0.93
AVERAGE	DenseNet121	0.9814	0.8671	0.8314	0.8471
MEL	EfficientNet_b3	0.98	0.7	0.71	0.71
NV	EfficientNet_b3	0.99	0.89	0.9	0.89
BCC	EfficientNet_b3	0.97	0.88	0.8	0.84
AKIEC	EfficientNet_b3	1	0.8	0.87	0.83
BKL	EfficientNet_b3	0.96	0.86	0.8	0.86
DF	EfficientNet_b3	0.95	0.95	0.98	0.97
VASC	EfficientNet_b3	1	0.96	0.86	0.91
AVERAGE	EfficientNet_b3	0.9786	0.8629	0.8457	0.8586
MEL	GoogleNet	0.98	0.78	0.64	0.7
NV	GoogleNet	0.98	0.8	0.88	0.84
BCC	GoogleNet	0.96	0.86	0.75	0.8
AKIEC	GoogleNet	1	0.82	0.78	0.8
BKL	GoogleNet	0.96	0.84	0.79	0.81
DF	GoogleNet	0.95	0.94	0.98	0.96
VASC	GoogleNet	1	1	0.83	0.91
AVERAGE	GoogleNet	0.9757	0.8629	0.8071	0.8314
MEL	ResNet50	0.98	0.77	0.76	0.76
NV	ResNet50	0.99	0.88	0.88	0.88
BCC	ResNet50	0.97	0.9	0.85	0.87
AKIEC	ResNet50	1	1	0.74	0.85
BKL	ResNet50	0.96	0.87	0.79	0.83
DF	ResNet50	0.96	0.95	0.98	0.97
VASC	ResNet50	1	0.9	0.9	0.9
AVERAGE	ResNet50	0.98	0.8957	0.8429	0.8657

Table 10. Comparison of Different CAM Methods.

Technique	Description	Requires Network Modification
Grad-CAM	Uses convolutional layer gradients to create heatmaps, highlighting crucial image regions for decision-making.	No
GradCAM++	Improved Grad-CAM with refined localization and gradient weighting for multi-class tasks.	No
XGradCAM	Adjusts Grad-CAM gradients and activation maps for better consistency and interpretability.	No
Ablation-CAM	Generates activation maps by ablating convolutions; focuses on feature importance for decision-making.	Yes
EigenCAM	Uses PCA for gradient-independent visualization of key features in convolutional layers.	No
EigenGradCAM	Hybrid of EigenCAM’s PCA and Grad-CAM’s gradients for comprehensive activation maps.	Yes
LayerCAM	Accumulates information across layers for a broader perspective on model decisions.	No
FullGrad	Combines input image gradients and convolutions for detailed class activation maps.	Yes

Table 11. Average IoU results of eight explainable methods.

	GradCAM	GradCAM++	XGradCAM	AblationCAM	EigenCAM	EigenGradCAM	LayerCAM	FullGrad
DenseNet121-a	0.29	0.269	0.178	0.297	0.36	0.372	0.24	0.085
EfficientNet_b3-a	0.307	0.266	0.307	0.295	0.005	0.007	0.239	0.273
GoogleNet-a	0	0	0	0.011	0.181	0.192	0	\
ResNet50-a	0.27	0.263	0.27	0.271	0.26	0.269	0.254	0.192
DenseNet121-b	0.376	0.599	0.369	0.457	0.566	0.594	0.587	0.651
EfficientNet_b3-b	0.46	0.471	0.46	0.471	0.039	0.046	0.495	0.762
GoogleNet-b	0.132	0.174	0.132	0.133	0.056	0.094	0.211	\
ResNet50-b	0.023	0.244	0.023	0.036	0.017	0.019	0.376	0.455

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, D.; Chen, J.; Ren, L.; Wang, X.; Li, D.; Zhang, H. Reviewing CAM-Based Deep Explainable Methods in Healthcare. Appl. Sci. 2024, 14, 4124. https://doi.org/10.3390/app14104124

AMA Style

Tang D, Chen J, Ren L, Wang X, Li D, Zhang H. Reviewing CAM-Based Deep Explainable Methods in Healthcare. Applied Sciences. 2024; 14(10):4124. https://doi.org/10.3390/app14104124

Chicago/Turabian Style

Tang, Dan, Jinjing Chen, Lijuan Ren, Xie Wang, Daiwei Li, and Haiqing Zhang. 2024. "Reviewing CAM-Based Deep Explainable Methods in Healthcare" Applied Sciences 14, no. 10: 4124. https://doi.org/10.3390/app14104124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reviewing CAM-Based Deep Explainable Methods in Healthcare

Abstract

1. Introduction

2. Related Work

3. Methodology and Results

3.1. Article Source

3.2. Inclusion Criteria

3.3. Criteria for Risk of Bias: Assessing Study Quality

3.4. Data Extraction

3.5. Included Studies

4. Results and Discussion of Selected Articles

4.1. Discussion of Publication Years

4.2. Analysis of Dataset Size

4.3. Application Field Analysis

4.4. Performance Analysis

4.5. Analysis and Discussion

5. Experiment

5.1. Datasets

5.2. Classification Models

5.3. Comparison of CAM Methods

5.4. Discussion

6. Research Hotspots and Challenges

6.1. Research Hotspots

6.2. Challenges

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI