Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care

Lai, Tin

doi:10.3390/biomedinformatics4010008

Open AccessReview

Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care

by

Tin Lai

School of Computer Science, The University of Sydney, Camperdown, NSW 2006, Australia

BioMedInformatics 2024, 4(1), 113-126; https://doi.org/10.3390/biomedinformatics4010008

Submission received: 30 July 2023 / Revised: 20 December 2023 / Accepted: 3 January 2024 / Published: 8 January 2024

(This article belongs to the Special Issue Explainable Artificial Intelligence (XAI) in Biomedical Research and Clinical Practice)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Recent advancements in artificial intelligence (AI) have facilitated its widespread adoption in primary medical services, addressing the demand–supply imbalance in healthcare. Vision Transformers (ViT) have emerged as state-of-the-art computer vision models, benefiting from self-attention modules. However, compared to traditional machine learning approaches, deep learning models are complex and are often treated as a “black box” that can cause uncertainty regarding how they operate. Explainable artificial intelligence (XAI) refers to methods that explain and interpret machine learning models’ inner workings and how they come to decisions, which is especially important in the medical domain to guide healthcare decision-making processes. This review summarizes recent ViT advancements and interpretative approaches to understanding the decision-making process of ViT, enabling transparency in medical diagnosis applications.

Keywords:

explainable AI (XAI); multi-head attention; vision transformer; interpretability in artificial intelligence; computer vision for medical imagery; AI for medical domain

1. Introduction

Artificial intelligence (AI) has made significant strides in various domains in recent years, revolutionizing industries and shaping how we approach complex problems. One of AI’s most remarkable applications is in medical imaging [1], where it has brought about unprecedented advancements in automated image analysis, diagnosis, and decision making. Medical images are one of the most common clinical diagnostic methods [2]. These images vary in properties based on the medical diagnosis and specific anatomical locations, like the skin [3,4,5,6], chest [3,7,8,9,10], brain [11,12], liver [13], and others. Deep learning algorithms have found numerous critical applications in the healthcare domain, ranging from detecting diabetes [14], uses in genomics [15], and applications in mental health support [16]. Among the latest breakthroughs in computer vision models, Vision Transformers (ViT) [17] have emerged by leveraging self-attention mechanisms to achieve state-of-the-art performance in various visual tasks.

As medical professionals increasingly rely on AI-powered systems to aid in diagnosis and treatment planning [18], the need for interpretability and transparency in AI models becomes paramount [19]. Deep learning models, including ViTs, often exhibit highly complex and intricate internal representations, making it challenging for experts to comprehend their decision-making process. The opaque nature of these models raises concerns about their reliability and safety, especially in critical applications such as medical diagnostics, where accurate and trustworthy results are of the utmost importance [20]. Explainable artificial intelligence (XAI) is a burgeoning field that seeks to bridge the gap between the black box nature of AI algorithms and the need for understandable and interpretable decision-making processes [21]. XAI addresses a fundamental challenge: How can we make AI’s decision-making process more transparent and comprehensible to both experts and non-experts? While complex models might achieve impressive accuracy, their inability to provide human-readable explanations hinders their adoption in critical applications such as healthcare, finance, and legal domains. This limitation not only undermines users’ trust but also poses ethical and regulatory concerns. The integration of XAI can also lead to improved collaboration between AI systems and human experts, as well as the identification of novel patterns and insights that might have been overlooked otherwise.

In the realm of XAI, several techniques contribute to enhancing the transparency and trustworthiness of complex machine learning models. Local interpretable model-agnostic explanations (LIMEs) offer insights into any model’s predictions by approximating its behavior with interpretable surrogate models [22]. LIME is model-agnostic, which means it is applicable to most AI models without relying on any specific model architecture. Gradient-based saliency methods, like Grad-CAM, illuminate the model-specific regions of input data that contribute most to predictions, fostering an understanding of where the model focuses its attention [23]. Furthermore, in medical domain, decision understanding is often achieved through interactive dashboards that visualize model outcomes and insights, allowing end users to assess predictions, contributing factors, and uncertainties for informed decision making. These concepts collectively illuminate the intricate inner workings of machine learning models, promoting transparency and user confidence.

This article explores recent advancements in Vision Transformers for medical image analysis and presents a concise review of interpretative approaches that understand the decision-making process of ViT models. Through a detailed analysis of state-of-the-art techniques, we aim to pave the way for transparent and reliable AI systems beneficial for medical diagnosis, ultimately improving patient outcomes and healthcare practices. In the following, we delve into general explainable AI in the medical industry and the method to quantify model output. Subsequently, we provide preliminaries on Vision Transformers, followed by the latest state-of-the-art approaches for interpreting and visualizing the model output. Finally, we discuss the current direction of interpretable medical models and explore potential future directions.

2. Interpretable Model: eXplainable AI (XAI)

In recent years, AI has witnessed remarkable advancements, showcasing its potential to transform industries and reshape the way we interact with technology. From autonomous vehicles to medical diagnostics, AI systems are increasingly being integrated into various domains to make complex decisions and predictions. However, as AI applications become more ubiquitous, concerns have arisen regarding their transparency, accountability, and trustworthiness [24]. When AI model is used in mission critical applications like medical diagnosis [25], Cybersecurity [26], or autonomous driving [27], transparency enables stakeholders to comprehend the rationale behind the model’s predictions, ensuring accountability, identifying potential biases, and facilitating trust in high-stakes scenarios.

Transparency in machine learning (ML), also known as interpretability or explainability, aims to uncover the inner workings of intricate models [28]. From a human-centered design standpoint, transparency is not an inherent property of the ML model but rather a relationship between the algorithm and its users. Therefore, achieving transparency requires prototyping and user evaluations to develop solutions that promote transparency [25]. In specialized and high-stakes domains like medical image analysis, adhering to human-centered design principles proves challenging due to restricted access to end users and the knowledge disparity between users and ML designers.

The concept of eXplainable AI (XAI) [21] has emerged as a crucial research area that seeks to shed light on the inner workings of AI models and elucidate the factors contributing to their decisions. By providing interpretable insights into the decision-making process, XAI bridges the gap between the complexity of deep learning models and the human understanding required in critical domains like medical imaging. In the context of medical imaging, XAI holds immense potential to revolutionize how medical professionals interact with AI systems [29]. By uncovering the underlying rationales behind AI-generated diagnoses and highlighting the relevant features driving the decisions, XAI enhances medical practitioners’ diagnostic accuracy and confidence and enables them to make informed and responsible clinical decisions [30]. Ultimately, this can provide more confidence for medical professionals and patients to entrust the diagnosis recommendation and guide their decisions. Moreover, interpretable AI models facilitate the identification and rectification of biases, enabling fair and unbiased decision making.

There are multiple key concepts that are crucial in ensuring that an AI model is interpretable and explainable. Each concept plays a pivotal role in shaping the development, acceptance, and effectiveness of AI systems across various domains, ensuring their reliability, ethical use, and practicality in real-world applications.

Data Requirements: Data form the foundation of AI models. Adequate and high-quality data are essential for training robust and accurate AI systems across various domains. The requirements for data encompass both quantity and quality. Sufficient volumes of diverse, well-annotated data are necessary to develop models that can generalize effectively [31]. Data quality directly impacts the performance and reliability of AI systems. Moreover, considerations such as representatives, balance, and relevance of the data are pivotal to ensure unbiased and effective learning [32]. In domains like healthcare or medical diagnosis, access to comprehensive and diverse datasets is crucial to train AI models that accurately reflect real-world scenarios.

Computational Costs: The development and deployment of AI models often entail substantial computational costs. Training complex models, especially those employing deep learning techniques, demands significant computational resources [33]. These include high-performance computing infrastructures, specialized hardware like GPUs or TPUs, and considerable energy consumption. These computational demands can pose challenges, particularly for smaller organizations or research groups with limited resources. Balancing model performance with computational costs becomes pivotal in optimizing AI systems for practical usage in limiting environments, such as inside a hospital [34]. Techniques like model compression, transfer learning, or advancements in hardware aim to mitigate these computational burdens.

Importance of Interpretability: Interpretability in AI refers to the capability of understanding and explaining how an AI model arrives at its decisions or predictions. In critical domains where human lives and safety are at stake, the ability to comprehend the reasoning behind AI-generated outcomes is paramount [35]. Interpretability enhances user trust, facilitates model debugging, enables error analysis, and aids in identifying biases. Different stakeholders, including regulators, domain experts, and end users, benefit from interpretability, fostering confidence and acceptance of AI systems [36].

Transparency: Transparency in AI systems involves making the decision-making process of AI models understandable and accessible to relevant stakeholders. It includes revealing model architecture, data sources, and algorithmic processes [37]. Transparent AI models contribute to accountability, enabling scrutiny and identification of potential biases or errors. Transparency fosters compliance and regulatory adherence. It also facilitates better collaboration between AI developers and end users, improving the overall reliability and usability of AI applications. Trustworthy AI systems not only fulfill their intended functions but also consider ethical norms, fairness, and user expectations [38]. Trust instills confidence in users, ensuring acceptance and uptake of AI technologies.

XAI methods allow end users, such as clinicians, to understand, verify, and troubleshoot the decisions made by these systems [39]. The interpretability of ML models is crucial for ensuring accountability and trust from physicians and patients. For instance, a model detecting pneumonia is less likely to be trusted if it cannot explain why a patient received that diagnosis. In contrast, a model that provides insights into its reasoning is more likely to be appreciated and accepted. Interpretable ML systems offer explanations that enable users to assess the reliability of forecasts and recommendations, helping them make informed decisions based on the underlying logic. Moreover, addressing the potential biases in machine learning systems is essential to ensure fair ratings for individuals of all racial and socioeconomic backgrounds [40]. The widespread use of predictive algorithms, as seen in streaming services and social networks, has raised concerns about their societal impact, including the deskilling of professionals like doctors [41]. Therefore, while the application of machine learning techniques in healthcare is inevitable, establishing standardized criteria for interpretable ML in this field is urgently needed to enhance transparency, fairness, and safety.

3. Primaries on Vision Transformer

Vision Transformer (ViT) [17] is a deep learning model that has gained significant attention in computer vision. In contrast to traditional convolutional neural networks (CNNs), which have been the dominant architecture for image recognition tasks, ViT adopts a transformer-based architecture inspired by its success in natural language processing (NLP) [42]. ViT breaks down an image into fixed-size patches, which are then linearly embedded and processed using a transformer encoder to capture global contextual information. This approach allows ViT to handle local and global image features effectively, leading to remarkable performance in various computer vision tasks, including image classification and object detection.

Generally, a Vision Transformer consists of a patch embedding layer and several consecutively connected encoders, as depicted in Figure 1. The self-attention layer is the key component that enables ViT to achieve many state-of-the-art vision recognition performances. The self-attention layer first transforms the input image into three different vectors—the query vector, the key vector, and the value vector. Subsequently, the attention layer then computes the scores between each pair of vectors and determines the degree of attention when given other tokens.

Formally, given an image

x \in R^{H \times W \times 3}

, the patch embedding layer first splits and flattens the sample

x

into sequential patches

x_{p} \in R^{N \times (P^{2} d)}

, where

(H, W)

represents the height and width of the input image,

(P, P)

is the resolution of each image patch,

d

denotes the output channel, and

N = H W / P^{2}

is the number of image tokens. The list of patch tokens is further fed into Transformer encoders for attention calculation.

Each Transformer encoder mainly consists of two types of sub-layers—a multi-head self-attention layer (MSA) and an MLP layer. In MSA, the tokens are linearly projected and further re-formulated into three vectors, namely

Q

,

K

and

V

. The self-attention calculation is performed on

Q, K

and

V

by

x_{ℓ}^{'} = Attention (Q, K, V) = Softmax (\frac{Q \cdot K^{⊤}}{\sqrt{d}}) \cdot V,

(1)

where

x_{ℓ}^{'}

are the tokens produced by MSA at the ℓ-th layer. We can formulate self-attention as follows: the output tokens

x_{ℓ}^{'}

are further normalized with Layer Normalization (LN) and sent to an MLP block which consists of two fully connected layers with a GELU activation [43] in between. This process is formally formulated as follows:

x_{ℓ} = MLP (LN (x_{ℓ}^{'})) + x_{ℓ}^{'},

(2)

where

x_{ℓ}

is the output of the ℓ-th encoder block. At the last transformer layer, the high-dimensional embeddings are used for various downstream tasks, e.g., using the embeddings for training an object recognition model.

4. Explainability Methods in XAI

The importance of interpretability in machine learning models is widely acknowledged, but defining what constitutes interpretability remains a challenge [44]. Various definitions have been proposed, emphasizing openness, accuracy, reliability, and understandability [44,45]. However, these definitions often overlook the user’s perspective, and their needs are not adequately addressed in the produced explanations [46]. This is especially relevant in interpretable machine learning systems, where the audience’s understanding and trust in the models are crucial.

Interpretability becomes even more critical in medical imaging as it influences clinicians’ decision making and patients’ acceptance of the model’s predictions. Interpretable machine learning systems offer valuable insights into their reasoning, helping users, such as clinicians, comprehend and verify predictions, ensuring fairness and unbiased outcomes for diverse populations. As deep learning algorithms find numerous applications in healthcare, the demand for interpretable models grows, necessitating the establishment of uniform criteria for interpretable ML in this vital domain. The following summarizes explainability methods that are commonly used in the XAI field.

4.1. Gradient-Weighted Class Activation Mapping (Grad-CAM) Method

Grad-CAM is a gradient-based interpretability technique introduced by Selvaraju et al. [23] that aims to generate a localization map of the significant regions in an image that contribute the most to the decision made by a neural network. Leveraging the spatial information retained in convolutional layers, Grad-CAM utilizes the gradients propagated to the last convolutional layer to attribute importance values to each network neuron with respect to the final decision. An appealing advantage of Grad-CAM over similar methods is its applicability without requiring re-training or architectural changes, making it readily adaptable to various CNN-based models. Moreover, combined with guided back-propagation through element-wise multiplication, known as Guided Grad-CAM, it enables the generation of high-resolution and class-discriminative visualizations [23].

4.1.1. Saliency Maps

Saliency Maps, introduced by Simonyan et al. [47], is a gradient-based visualization technique that sheds light on the contribution of individual pixels in an image to its final classification made by a neural network. This method involves a backward pass through the network to calculate the gradients of the loss function with respect to the input’s pixels [48]. Doing so reveals the impact of each pixel during the back-propagation step, providing insights into how much each pixel affects the final classification, particularly concerning a specific class of interest. The results from Saliency Maps can be interpreted as another image, either the same size as the input image or easily projectable onto it, highlighting the most important pixels that attribute the image to a specific class [47].

4.1.2. Concept Activation Vectors (CAVs)

Concept Activation Vectors (CAVs) [49] represents an interpretability technique that offers global explanations for neural networks based on user-defined concepts [48]. To leverage CAVs, two datasets need to be gathered: one containing instances relevant to the desired concept and the other comprising unrelated images serving as a random reference. For a specific instance, a binary classifier is trained on these two datasets to classify between instances related to the concept of interest and unrelated ones. The CAV is then derived as the coefficient vector of this binary classifier. Testing with CAVs (TCAVs) allows averaging the concept-based contributions from the relevant dataset and comparing them to the contributions from the random dataset regarding the class of interest. Consequently, CAVs establish connections between high-level user-defined concepts and classes, both positively and negatively. This approach is particularly useful in the medical field, where medical specialists can conveniently relate the defined concepts with existing classes without delving into the intricacies of neural networks [49].

4.1.3. Deep Learning Important Features (DeepLift)

Deep Learning Important Features, commonly known as DeepLift [50], is an explainability method capable of determining contribution scores by comparing the difference in neuron activation to a reference behavior. By employing back-propagation, DeepLift quantifies the contribution of each input feature when decomposing the output prediction. By comparing the output difference between the original input and a reference input, DeepLift can assess how much an input deviates from the reference. One of the significant advantages of DeepLift is its ability to overcome issues related to gradient zeroing or discontinuities, making it less susceptible to misleading biases and capable of recognizing dependencies that other methods may overlook. However, carefully considering the reference input and output is essential for achieving meaningful results using DeepLift [50].

4.1.4. Layer-Wise Relevance Propagation (LRP)

Layer-wise Relevance Propagation (LRP) [51] is an explainability technique that provides transparent insights into complex neural network models, even with different input modalities like text, images, and videos. LRP propagates the prediction backward through the model, ensuring that the neurons’ received relevance is equally distributed among the lower layers. The proper set of parameters and LRP rules make achieving high-quality explanations for intricate models feasible.

4.1.5. Guided Back-Propagation

Guided back-propagation [52] is an explanation method that combines ReLU and deconvolution, wherein at least one of these is applied with masked negative values. By introducing a guidance signal from the higher layer to the typical back-propagation process, guided back-propagation prevents the backward flow of negative gradients, corresponding to the neurons that decrease visualized activation of the higher layer unit. This technique is particularly effective without switching, allowing visualization of a neural network’s intermediate and last layers.

5. Vision Transformer for Medical Images

ViTs have proven to be effective in solving a wide range of vision problems, thanks to their capability to capture long-range relationships in data. Unlike CNNs, which rely on the inductive bias of locality within their convolutional layers, vanilla ViTs directly learn these relationships from the data. However, the success of ViTs has also brought challenges in interpreting their decision-making process, mainly due to their long-range reasoning capabilities. In this section, we will first review some applications of ViTs in medical domains, where they are used as black box methods. Subsequently, we will discuss other models that aim to interpret ViT’s model output and provide explanations for their predictions.

5.1. Black Box Methods

TransMed [53] is a pioneering work that introduces the use of Vision Transformers (ViTs) for medical image classification. Their architecture, TransMed, combines the strengths of convolutional neural networks (CNNs) for extracting low-level features and ViTs for encoding global context. TransMed focuses on classifying parotid tumors in multi-modal MRI medical images and employs a novel image fusion strategy to effectively capture mutual information from different modalities, yielding competitive results on their privately collected parotid tumor classification dataset.

The authors of Lu et al. [54] propose a two-stage framework for glioma sub-type classification in brain images. The framework performs contrastive pre-training and then uses a transformer-based sparse attention module for feature aggregation. Their approach demonstrates its effectiveness through ablation studies on the TCGA-NSCLC [55] dataset. The authors of Gheflati and Rivaz [56] systematically evaluated pure and hybrid pre-trained ViT models for breast cancer classification. Their experiments on two breast ultrasound datasets show that ViT-based models outperform CNNs in classifying images into benign, malignant, and normal categories.

Several other works employ hybrid Transformer–CNN architectures for medical image classification in different organs. For instance, Khan and Lee [57] propose Gene-Transformer to predict lung cancer subtypes, showcasing its superiority over CNN baselines on the TCGA-NSCLC [55] dataset. The authors of Chen et al. [58] presents a multi-scale GasHis–Transformer for diagnosing gastric cancer in the stomach, demonstrating strong generalization ability across other histopathological imaging datasets. The authoers of Jiang et al. [59] propose a hybrid model combining convolutional and transformer layers for diagnosing acute lymphocytic leukemia, utilizing a symmetric cross-entropy loss function.

5.2. Interpretable Vision Transformer

Interpretable vision models aim to reveal the most influential features contributing to a model’s decision. We can visualize the most influential region contributing to ViT’s predictions with methods such as saliency-based techniques and Grad-CAM. Thanks to their interpretability, these models are particularly valuable in building trust among physicians and patients, making them suitable for practical implementation in clinical settings. Table 1 provides a high-level overview of existing state-of-the-art interoperability methods that are specifically designed for transformer models. A naïve method that only visualizes the last attentive block will often be uninformative. In addition, some interoperability methods might be class-agnostic, which means the visualization remains the same for the prediction of all classes (e.g., rollout [60]). In contrast, some correlation methods can illustrate different interpretation results for different target classification results (e.g., transformer attribution [61]).

ViT-based methods can be used for COVID-19 diagnosis [62], where the low-level CXR features can be extracted from a pre-trained self-supervised backbone network. SimCLR [63] is a popular backbone using contrastive-learning-based model training methods. The backbone network extracts abnormal CXR feature embeddings from the CheXpert dataset [64]. The ViT model then uses these embeddings for high-level COVID-19 diagnosis. Extensive experiments on three CXR test datasets from different hospitals show their approach’s superiority over CNN-based models. They also validate the generalization ability of their method and use saliency map visualizations [61] for interpretability. Similarly, COVID-ViT [65] is another ViT-based model for classifying COVID from non-COVID images in the MIA-COVID19 challenge [66]. Their experiments on 3D CT lung images demonstrate the ViT-based approach’s superiority over the DenseNet baseline [67] in terms of F1 score.

In another work, Mondal et al. [10] introduce xViTCOS for COVID-19 screening from lungs CT and X-ray images (see Figure 2). The xViTCOS model is first pre-trained on ImageNet to learn generic image representations, which is then further fine-tuned on a large chest radiographic dataset. Additionally, xViTCOS employs an explainability-driven saliency-based approach [61] with clinically interpretable visualizations to highlight the critical factors in the predictions. The model is experimentally evaluated on the COVID CT-2A dataset [68] for chest X-ray, which is effective in identifying abnormal cases.

Table 1. Summary table on interoperability approaches for transformer models. Class-specific refers to whether the approach can attribute different attentive scores that are specific to the predicted class (in multi-class predictions). Metrics used to evaluate each methods are pixel accuracy, mean average precision (mAP), mean F1 score (mF1), and mean intersection over union (mIoU).

Interoperability Method	Class Specific?	Metrics				Highlights and Summary
Interoperability Method	Class Specific?	Pixel Acc.	mAP	mF1	mIoU	Highlights and Summary
Raw Attention		67.87	80.24	29.44	46.37	Raw attention only consider the attention map of the last block of the transformer architecture
Rollout [60]		73.54	84.76	43.68	55.42	Rollout assume a linear Combination of tokens and quantify the influence of skip connections with identity mateix
GradCAM [23]	✓	65.91	71.60	19.42	41.30	Provides a class-specific explanation by adding weights to gradient based feature map
Partial LRP [69]		76.31	84.67	38.82	57.95	Considers the information flow within the network by identifying the most important heads in each encoder layer through relevance propagation
Transformer Attribution [61]	✓	76.30	85.28	41.85	58.34	Combines relevancy and attention-map gradient by regarding the gradient as a weight to the relevance for certain prediction task
Generic Attribution [70]	✓	79.68	85.99	40.10	61.92	Generic attribution extends the usage of Transformer attribution to co-attention and self-attention based models with a generic relevancy update rule
Token-wise Approx. [71]	✓	82.15	88.04	45.72	66.32	Uses head-wise and token-wise approximations to visualize tokens interaction in the pooled vector with noise-decreasing strategy

The authors of Shome et al. [72] have introduced another ViT-based model for the diagnosis of COVID-19 infection at a large scale. They combined multiple open-source COVID-19 CXR datasets to accomplish this, forming a comprehensive multi-class and binary classification dataset. In order to enhance visual representation and model interpretability, they implemented Grad-CAM-based visualization [23]. The Transformer-based Multiple Instance Learning (TransMIL) architectures proposed by Shao et al. [73] aims to address whole slide brain tumor classification. Their approach involves embedding patches from whole slide images (WSI) into the feature space of a ResNet-50 model. Subsequently, the sequence of embedded features undergoes a series of processing steps in their proposed pipeline, including squaring the sequence, correlation modeling, conditional position encoding using the Pyramid Position Encoding Generator (PPEG) module, local information fusion, feature aggregation, and mapping from the transformer space to the label space. This innovative approach holds promise for accurate brain tumor classification, as illustrated in their work [73]. The self-attention module in transformers can leverage global interactions between encoder features, while cross-attention in the skip connections allows a fine spatial recovery. For example, Figure 3 highlights the attention level across the whole image for a segmentation task in a U-Net Transformer architecture [74,75].

In whole slide imaging (WSI)-based pathology diagnosis, annotating individual instances can be expensive and laborious. Therefore, a label is assigned to a set of instances known as a “bag”. This weakly supervised learning type is called Multiple Instance Learning (MIL) [76], where a bag is labeled positive if at least one instance is positive or negative when all instances in a bag are negative. However, most current MIL methods assume that the instances in each bag are independent and identically distributed, overlooking any correlations among different instances.

To address this limitation, Shao et al. [73] proposes TransMIL, a novel approach that explores morphological and spatial information in weakly supervised WSI classification. Their method aggregates morphological information using two transformer-based modules and a position encoding layer. To encode spatial information, they introduce a pyramid position encoding generator. TransMIL achieves state-of-the-art performance on three computational pathology datasets: CAMELYON16 (breast) [77], TCGA-NSCLC (lung) [55], and TCGA-R (kidney). Their approach demonstrates superior performance and faster convergence than CNN-based state-of-the-art methods, making it a promising and interpretable solution for histopathology classification. Attention-based ViT can further derive instance probability for highlighting regions of interest. For example, AB-MIL [78] uses the derivation of instance probability for feature distillation as shown in Figure 4. The attentive method can also be used for interpreting the classification of retinal images [79].

For the diagnosis of lung tumors, Zheng Zheng et al. [80] proposes the graph transformer network (GTN), leveraging the graph-based representation of WSI. GTN consists of a graph convolutional layer [81], a transformer, and a pooling layer. Additionally, GTN utilises GraphCAM [61] to identify regions highly associated with the class label. Thorough evaluations on the TCGA dataset [55] demonstrate the effectiveness of GTN in accurately diagnosing lung tumors. This graph-based approach provides valuable insights into the spatial relationships among regions, enhancing the interpretability of the classification results.

6. Conclusions

In this review, we explored the advancements and applications of interpretability techniques in the context of deep learning algorithms applied to various domains, with a particular focus on the medical field. Interpretable machine learning models have become increasingly essential as complex models like Transformers and Vision Transformers (ViTs) gain popularity due to their impressive performance in various tasks.

We discussed several interpretability methods, such as Grad-CAM and Saliency Maps, which provide insights into the decision-making processes of deep learning models. These methods allow users, including medical practitioners and researchers, to understand and trust the predictions made by these models. Moreover, they enable the identification of biases, contributing factors, and regions of importance, leading to more informed and accountable decision making. Medical images play a crucial role in clinical diagnostics, offering valuable insights into various anatomical areas. Deep learning algorithms have shown exceptional capabilities in healthcare applications, including medical image analysis and patient risk prediction. However, interpretability is particularly important in medical image analysis as it empowers end users, such as clinicians and patients, to understand and trust the model’s decisions. Interpretable machine learning models provide explanations behind their predictions, allowing users to assess and validate the output before making critical decisions. This transparency is essential for ensuring fairness, mitigating biases, and building trust in machine learning systems, especially in healthcare. By fostering transparency and accountability, interpretable ML in medical imaging has the potential to revolutionize patient care and enhance healthcare practices globally.

In the medical domain, interpretable deep learning models have shown promising results in various applications, such as COVID-19 diagnosis, brain tumor classification, breast cancer detection, and lung tumor subtyping. By visualizing the learned features and attention scores, these models achieve high accuracy and provide valuable insights into disease detection and classification, ultimately improving patient care and treatment decisions.

7. Limitation and Future Directions

There are several limitations that ViTs might exhibit when applied in the medical imagery domain. While the global context understanding is one of the strengths of self-attentive ViTs, it can also become a limitation. In certain tasks where local features are crucial, ViTs might struggle to focus on specific localized patterns, potentially leading to suboptimal performance. Therefore, ViTs are often paired with convolutional layers to account for feature extraction in the local scale [82]. Self-attentive ViTs also often require large amounts of training data to generalize well, which is a limitation in the medical domain as collecting large annotated datasets is challenging. In addition, self-attention mechanisms, especially when applied to large input images, can be computationally expensive. The quadratic complexity of self-attention hampers its scalability to high-resolution images, making it less efficient compared to convolutional neural networks (CNNs) for some tasks. Training large ViT models with medical images can often be a practical difficulty due to the computational resources that it required when paired with high-resolution medical images. As a result, small clinics without access to high-performing computational resources might need to rely on external resources to train or infer ViT models.

While interpretable machine learning has made significant progress, several exciting avenues exist for future research and development. For example, developing unified frameworks that combine multiple interpretability techniques can provide a more comprehensive understanding of model decisions. These frameworks should be scalable and adaptable to various deep learning architectures, enabling users to choose the most suitable methods for their specific applications. The uncertainty within the dataset should also be quantifiable. Integrating uncertainty estimation into interpretable models can enhance their reliability and robustness. Uncertainty quantification can be crucial, especially in critical medical decision-making scenarios, where a model’s confidence can significantly impact patient outcomes. Figure 5 illustrates an example of using uncertainty estimation for improving the medical image segmentation with a transformer-based model. In particular, the proposed model in [83] takes medical images as input and generates an uncertainty map representation in an unsupervised manner. Then, the uncertain region in the uncertainty map is used to reduce the possibility of noise sampling, and to encourage consistent network outputs. Such an approach can then predict classes with a greater degree of separation (see right of Figure 5), which in turn improves the transparency for medical diagnosis with less misclassification. A similar approach that addresses uncertainty within the dataset can often encourage models to produce grounded predictions. By addressing these future directions, we can enhance the trust, transparency, and effectiveness of deep learning models in medical applications, ultimately improving patient outcomes and healthcare practices.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflicts of interest.

References

Esteva, A.; Chou, K.; Yeung, S.; Naik, N.; Madani, A.; Mottaghi, A.; Liu, Y.; Topol, E.; Dean, J.; Socher, R. Deep learning-enabled medical computer vision. NPJ Digit. Med. 2021, 4, 5. [Google Scholar] [CrossRef] [PubMed]
Shung, K.K.; Smith, M.B.; Tsui, B.M. Principles of Medical Imaging; Academic Press: Cambridge, MA, USA, 2012. [Google Scholar]
Hu, B.; Vasu, B.; Hoogs, A. X-MIR: EXplainable Medical Image Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 440–450. [Google Scholar]
Lucieri, A.; Bajwa, M.N.; Braun, S.A.; Malik, M.I.; Dengel, A.; Ahmed, S. ExAID: A Multimodal Explanation Framework for Computer-Aided Diagnosis of Skin Lesions. arXiv 2022, arXiv:2201.01249. [Google Scholar] [CrossRef] [PubMed]
Stieler, F.; Rabe, F.; Bauer, B. Towards Domain-Specific Explainable AI: Model Interpretation of a Skin Image Classifier using a Human Approach. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 1802–1809. [Google Scholar] [CrossRef]
Lucieri, A.; Bajwa, M.N.; Braun, S.A.; Malik, M.I.; Dengel, A.; Ahmed, S. On interpretability of deep learning based skin lesion classifiers using concept activation vectors. In Proceedings of the 2020 international joint conference on neural networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–10. [Google Scholar]
Lenis, D.; Major, D.; Wimmer, M.; Berg, A.; Sluiter, G.; Bühler, K. Domain aware medical image classifier interpretation by counterfactual impact analysis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2020; pp. 315–325. [Google Scholar]
Brunese, L.; Mercaldo, F.; Reginelli, A.; Santone, A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput. Methods Programs Biomed. 2020, 196, 105608. [Google Scholar] [CrossRef]
Corizzo, R.; Dauphin, Y.; Bellinger, C.; Zdravevski, E.; Japkowicz, N. Explainable image analysis for decision support in medical healthcare. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 4667–4674. [Google Scholar]
Mondal, A.K.; Bhattacharjee, A.; Singla, P.; Prathosh, A. xViTCOS: Explainable vision transformer based COVID-19 screening using radiography. IEEE J. Transl. Eng. Health Med. 2021, 10, 1–10. [Google Scholar] [CrossRef]
Bang, J.S.; Lee, M.H.; Fazli, S.; Guan, C.; Lee, S.W. Spatio-Spectral Feature Representation for Motor Imagery Classification Using Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3038–3049. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Shi, H.; Hwang, K.S. An explainable ensemble feedforward method with Gaussian convolutional filter. Knowl.-Based Syst. 2021, 225, 107103. [Google Scholar] [CrossRef]
Mohagheghi, S.; Foruzan, A.H. Developing an explainable deep learning boundary correction method by incorporating cascaded x-Dim models to improve segmentation defects in liver CT images. Comput. Biol. Med. 2022, 140, 105106. [Google Scholar] [CrossRef]
Hu, H.; Lai, T.; Farid, F. Feasibility Study of Constructing a Screening Tool for Adolescent Diabetes Detection Applying Machine Learning Methods. Sensors 2022, 22, 6155. [Google Scholar] [CrossRef]
Yang, S.; Zhu, F.; Ling, X.; Liu, Q.; Zhao, P. Intelligent Health Care: Applications of Deep Learning in Computational Medicine. Front. Genet. 2021, 12, 607471. [Google Scholar] [CrossRef]
Lai, T.; Shi, Y.; Du, Z.; Wu, J.; Fu, K.; Dou, Y.; Wang, Z. Psy-LLM: Scaling up Global Mental Health Psychological Services with AI-based Large Language Models. arXiv 2023, arXiv:2307.11991. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ker, J.; Wang, L.; Rao, J.; Lim, T. Deep learning applications in medical image analysis. IEEE Access 2017, 6, 9375–9389. [Google Scholar] [CrossRef]
MacDonald, S.; Steven, K.; Trzaskowski, M. Interpretable AI in healthcare: Enhancing fairness, safety, and trust. In Artificial Intelligence in Medicine: Applications, Limitations and Future Directions; Springer: Berlin/Heidelberg, Germany, 2022; pp. 241–258. [Google Scholar]
Ghosh, A.; Kandasamy, D. Interpretable artificial intelligence: Why and when. Am. J. Roentgenol. 2020, 214, 1137–1138. [Google Scholar] [CrossRef] [PubMed]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Shin, D. User perceptions of algorithmic decisions in the personalized AI system: Perceptual evaluation of fairness, accountability, transparency, and explainability. J. Broadcast. Electron. Media 2020, 64, 541–565. [Google Scholar] [CrossRef]
Balasubramaniam, N.; Kauppinen, M.; Hiekkanen, K.; Kujala, S. Transparency and explainability of AI systems: Ethical guidelines in practice. In Proceedings of the International Working Conference on Requirements Engineering: Foundation for Software Quality, Birmingham, UK, 21–24 March 2022; pp. 3–18. [Google Scholar]
Lai, T.; Farid, F.; Bello, A.; Sabrina, F. Ensemble Learning based Anomaly Detection for IoT Cybersecurity via Bayesian Hyperparameters Sensitivity Analysis. arXiv 2023, arXiv:2307.10596. [Google Scholar]
Imai, T. Legal regulation of autonomous driving technology: Current conditions and issues in Japan. IATSS Res. 2019, 43, 263–267. [Google Scholar] [CrossRef]
Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar]
Banerjee, A.; Chakraborty, C.; Rathi Sr, M. Medical imaging, artificial intelligence, internet of things, wearable devices in terahertz healthcare technologies. In Terahertz Biomedical and Healthcare Technologies; Elsevier: Amsterdam, The Netherlands, 2020; pp. 145–165. [Google Scholar]
Loh, H.W.; Ooi, C.P.; Seoni, S.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Comput. Methods Programs Biomed. 2022, 226, 107161. [Google Scholar] [CrossRef]
He, Z.; Tang, X.; Yang, X.; Guo, Y.; George, T.J.; Charness, N.; Quan Hem, K.B.; Hogan, W.; Bian, J. Clinical trial generalizability assessment in the big data era: A review. Clin. Transl. Sci. 2020, 13, 675–684. [Google Scholar] [CrossRef]
Autio, L.; Juhola, M.; Laurikkala, J. On the neural network classification of medical data and an endeavour to balance non-uniform data sets with artificial data extension. Comput. Biol. Med. 2007, 37, 388–397. [Google Scholar] [CrossRef]
Chen, C.; Zhang, P.; Zhang, H.; Dai, J.; Yi, Y.; Zhang, H.; Zhang, Y. Deep learning on computational-resource-limited platforms: A survey. Mob. Inf. Syst. 2020, 2020, 8454327. [Google Scholar] [CrossRef]
Abirami, S.; Chitra, P. Energy-efficient edge based real-time healthcare support system. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2020; Volume 117, pp. 339–368. [Google Scholar]
Zhang, Z.; Genc, Y.; Wang, D.; Ahsen, M.E.; Fan, X. Effect of ai explanations on human perceptions of patient-facing ai-powered healthcare systems. J. Med. Syst. 2021, 45, 64. [Google Scholar] [CrossRef] [PubMed]
Hong, S.R.; Hullman, J.; Bertini, E. Human factors in model interpretability: Industry practices, challenges, and needs. Proc. Acm -Hum.-Comput. Interact. 2020, 4, 1–26. [Google Scholar] [CrossRef]
Felzmann, H.; Fosch-Villaronga, E.; Lutz, C.; Tamò-Larrieux, A. Towards transparency by design for artificial intelligence. Sci. Eng. Ethics 2020, 26, 3333–3361. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Ethics and Governance of Artificial Intelligence for Health: WHO Guidance; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Ahmad, M.A.; Teredesai, A.; Eckert, C. Interpretable Machine Learning in Healthcare. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; p. 447. [Google Scholar] [CrossRef]
Burrell, J. How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data Soc. 2016, 3, 2053951715622512. [Google Scholar] [CrossRef]
Arnold, M.H. Teasing out artificial intelligence in medicine: An ethical critique of artificial intelligence and machine learning in medicine. J. Bioethical Inq. 2021, 18, 121–139. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Lipton, Z.C. The Mythos of Model Interpretability. arXiv 2016, arXiv:1606.03490. [Google Scholar] [CrossRef]
Freitas, A.A. Comprehensible Classification Models: A Position Paper. SIGKDD Explor. Newsl. 2014, 15, 1–10. [Google Scholar] [CrossRef]
Miller, T. Explanation in Artificial Intelligence: Insights from the Social Sciences. arXiv 2017, arXiv:1706.07269. [Google Scholar] [CrossRef]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps. arXiv 2013, arXiv:1312.6034. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning; Lulu Press: Morrisville, NC, USA, 2020. [Google Scholar]
Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F.; Sayres, R. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). arXiv 2017, arXiv:1711.11279. [Google Scholar] [CrossRef]
Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef] [PubMed]
Springenberg, J.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for Simplicity: The All Convolutional Net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
Dai, Y.; Gao, Y.; Liu, F. Transmed: Transformers advance multi-modal medical image classification. Diagnostics 2021, 11, 1384. [Google Scholar] [CrossRef] [PubMed]
Lu, M.; Pan, Y.; Nie, D.; Liu, F.; Shi, F.; Xia, Y.; Shen, D. SMILE: Sparse-Attention based Multiple Instance Contrastive Learning for Glioma Sub-Type Classification Using Pathological Images. In Proceedings of the MICCAI Workshop on Computational Pathology, PMLR, Virtual Event, 27 September 2021; pp. 159–169. [Google Scholar]
Napel, S.; Plevritis, S.K. NSCLC Radiogenomics: Initial Stanford Study of 26 Cases. The Cancer Imaging Archive2014. Available online: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=6883610 (accessed on 29 July 2023).
Gheflati, B.; Rivaz, H. Vision Transformers for Classification of Breast Ultrasound Images. arXiv 2021, arXiv:2110.14731. [Google Scholar]
Khan, A.; Lee, B. Gene Transformer: Transformers for the Gene Expression-based Classification of Lung Cancer Subtypes. arXiv 2021, arXiv:2108.11833. [Google Scholar]
Chen, H.; Li, C.; Li, X.; Wang, G.; Hu, W.; Li, Y.; Liu, W.; Sun, C.; Yao, Y.; Teng, Y.; et al. GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification. arXiv 2021, arXiv:2104.14528. [Google Scholar]
Jiang, Z.; Dong, Z.; Wang, L.; Jiang, W. Method for Diagnosis of Acute Lymphoblastic Leukemia Based on ViT-CNN Ensemble Model. Comput. Intell. Neurosci. 2021, 2021, 7529893. [Google Scholar] [CrossRef]
Abnar, S.; Zuidema, W. Quantifying attention flow in transformers. arXiv 2020, arXiv:2005.00928. [Google Scholar]
Chefer, H.; Gur, S.; Wolf, L. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 782–791. [Google Scholar]
Park, S.; Kim, G.; Oh, Y.; Seo, J.B.; Lee, S.M.; Kim, J.H.; Moon, S.; Lim, J.K.; Ye, J.C. Vision Transformer for COVID-19 CXR Diagnosis using Chest X-ray Feature Corpus. arXiv 2021, arXiv:2103.07055. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 590–597. [Google Scholar]
Gao, X.; Qian, Y.; Gao, A. COVID-VIT: Classification of COVID-19 from CT chest images based on vision transformer models. arXiv 2021, arXiv:2107.01682. [Google Scholar]
Kollias, D.; Arsenos, A.; Soukissian, L.; Kollias, S. MIA-COV19D: COVID-19 Detection through 3-D Chest CT Image Analysis. arXiv 2021, arXiv:2106.07524. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Gunraj, H.; Sabri, A.; Koff, D.; Wong, A. COVID-Net CT-2: Enhanced Deep Neural Networks for Detection of COVID-19 from Chest CT Images Through Bigger, More Diverse Learning. arXiv 2021, arXiv:2101.07433. [Google Scholar] [CrossRef] [PubMed]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Chefer, H.; Gur, S.; Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Conference, 11–17 October 2021; pp. 397–406. [Google Scholar]
Chen, J.; Li, X.; Yu, L.; Dou, D.; Xiong, H. Beyond Intuition: Rethinking Token Attributions inside Transformers. Transactions on Machine Learning Research. 2023. Available online: https://openreview.net/pdf?id=rm0zIzlhcX (accessed on 29 July 2023).
Shome, D.; Kar, T.; Mohanty, S.N.; Tiwari, P.; Muhammad, K.; AlTameem, A.; Zhang, Y.; Saudagar, A.K.J. COVID-Transformer: Interpretable COVID-19 Detection Using Vision Transformer for Healthcare. Int. J. Environ. Res. Public Health 2021, 18, 11086. [Google Scholar] [CrossRef]
Shao, Z.; Bian, H.; Chen, Y.; Wang, Y.; Zhang, J.; Ji, X.; Zhang, Y. TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication. arXiv 2021, arXiv:2106.00908. [Google Scholar]
Huang, J.; Xing, X.; Gao, Z.; Yang, G. Swin deformable attention u-net transformer (sdaut) for explainable fast mri. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 538–548. [Google Scholar]
Petit, O.; Thome, N.; Rambour, C.; Themyr, L.; Collins, T.; Soler, L. U-net transformer: Self and cross attention for medical image segmentation. In Proceedings of the Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, 27 September 2021; Proceedings 12. Springer: Berlin/Heidelberg, Germany, 2021; pp. 267–276. [Google Scholar]
Fung, G.; Dundar, M.; Krishnapuram, B.; Rao, R.B. Multiple instance learning for computer aided diagnosis. Adv. Neural Inf. Process. Syst. 2007, 19, 425. [Google Scholar]
Bejnordi, B.E.; Veta, M.; Van Diest, P.J.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J.A.; Hermsen, M.; Manson, Q.F.; Balkenhol, M.; et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 2017, 318, 2199–2210. [Google Scholar] [CrossRef]
Zhang, H.; Meng, Y.; Zhao, Y.; Qiao, Y.; Yang, X.; Coupland, S.E.; Zheng, Y. DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18802–18812. [Google Scholar]
Playout, C.; Duval, R.; Boucher, M.C.; Cheriet, F. Focused attention in transformers for interpretable classification of retinal images. Med. Image Anal. 2022, 82, 102608. [Google Scholar] [CrossRef]
Zheng, Y.; Gindra, R.; Betke, M.; Beane, J.; Kolachalama, V.B. A deep learning based graph-transformer for whole slide image classification. medRxiv 2021. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Xie, Y.; Zhang, J.; Shen, C.; Xia, Y. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part III 24. Springer: Abingdon, UK, 2021; pp. 171–180. [Google Scholar]
Wang, T.; Lu, J.; Lai, Z.; Wen, J.; Kong, H. Uncertainty-guided pixel contrastive learning for semi-supervised medical image segmentation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, Vienna, Austria, 23–29 July; pp. 1444–1450.

Figure 1. The basic framework of Vision Transformer (ViT) [17] and its encoder architecture.

Figure 2. Visualization of interpreting results of xViTCOS [10] using the explainability method in [61]. The figures highlight the associated critical factors that explain the model’s decision making. (a) CXR of a patient with pneumonia; (b) CT scan of patient with COVID-19.

Figure 3. Cross-attention maps with U-Transformer [75] for the yellow-crossed pixel (left image). The attention maps at each level highlight the different regions contributing to the segmentation. “Cross-attention Level 1” is an earlier layer focusing on a wide image region. In contrast, we can see that “Cross-attention Level 3”, which is closer to the model output, corresponds to high-resolution feature maps and focuses on more specific regions that explain its predictions.

Figure 4. Visualization of the probability derivation output from [78] lung cancer region detection. Each pair of images contains (left) ground truth with the tumor regions delineated by blue lines and (right) the probability derivation output. Brighter cyan colors indicate higher probabilities of being tumors for the corresponding locations. We can see that most high cyan region localizes the positive detection regions.

Figure 5. Visualizing results of incorporating contrastive learning in using CNN–Transformer decoder [83]. (Left) Using uncertainty region as a guidance to implements contrastive learning in medical images in an unsupervised manner. (Right) Visualizing the separation of predicted categories, with and without contrastive learning (dimensional reduction using t-SNE algorithm; colors of markers represent pixel categories).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lai, T. Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care. BioMedInformatics 2024, 4, 113-126. https://doi.org/10.3390/biomedinformatics4010008

AMA Style

Lai T. Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care. BioMedInformatics. 2024; 4(1):113-126. https://doi.org/10.3390/biomedinformatics4010008

Chicago/Turabian Style

Lai, Tin. 2024. "Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care" BioMedInformatics 4, no. 1: 113-126. https://doi.org/10.3390/biomedinformatics4010008

APA Style

Lai, T. (2024). Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care. BioMedInformatics, 4(1), 113-126. https://doi.org/10.3390/biomedinformatics4010008

Article Menu

Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care

Abstract

1. Introduction

2. Interpretable Model: eXplainable AI (XAI)

3. Primaries on Vision Transformer

4. Explainability Methods in XAI

4.1. Gradient-Weighted Class Activation Mapping (Grad-CAM) Method

4.1.1. Saliency Maps

4.1.2. Concept Activation Vectors (CAVs)

4.1.3. Deep Learning Important Features (DeepLift)

4.1.4. Layer-Wise Relevance Propagation (LRP)

4.1.5. Guided Back-Propagation

5. Vision Transformer for Medical Images

5.1. Black Box Methods

5.2. Interpretable Vision Transformer

6. Conclusions

7. Limitation and Future Directions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI