1. Introduction
The performance of machine learning systems heavily relies on data representation. Traditionally, in computer vision, data were represented using handcrafted methods designed to extract specific image features. However, this type of representation requires considerable effort from experts to design and develop techniques that do not always provide the expected return or performance [
1]. In recent years, with the increase in the computational power of hardware devices, deep learning algorithms have emerged as a relevant alternative to handcrafted methods. Deep learning automatically transforms raw data, such as pixels from a digital image, into a feature vector, allowing the best data representation to be learned automatically through training. This approach is advantageous because it learns hierarchical representations that capture intricate patterns and structures through multiple layers of abstraction. Also, deep learning models can be fine-tuned for various tasks through transfer learning, making them highly versatile across different domains [
1,
2].
Training representation learning methods, particularly advanced architectures like neural networks and transformers, can be challenging because they require a large amount of labeled data. Many real application contexts, including the medical context, suffer from low availability of labeled images and adverse conditions, such as class imbalance and a lack of standardization in dimensions and formats [
3,
4]. These facts compromise adequate training and may cause overfitting. One way to address these problems is by utilizing automatic data generation methods, such as generative adversarial networks (GANs) [
5,
6]. GANs allow the generation of synthetic images through competitive training between two neural networks, the generator (
G) and the discriminator (
D) [
5]. During training,
G tries to generate images that resemble authentic data, while
D tries to classify correctly whether the images are original or generated. The adversarial training strategy is based on game theory, in which the objective is to reach the Nash equilibrium [
7], where neither
G nor
D can unilaterally improve their outcomes. This situation occurs when
G generates images that are so realistic that
D cannot reliably determine if an image is real or fake. At this point,
G has learned to produce data that effectively fools the discriminator [
8]. Data augmentation through GANs is particularly important in small medical image datasets because it helps to address the issue of overfitting, where models become too tailored to the limited training data and fail to generalize well to new, unseen cases. By generating a wider variety of images, GANs create a more representative training set, which can lead to more robust and accurate models.
However, training GANs is challenging due to issues related to backpropagation, particularly in how
G’s weights are updated. During backpropagation, the gradients of
G are derived from
D’s predictions, but
G has no information on how the images’ features contribute to the classification. Because of this, adversarial training tends to be an unbalanced game where
D generally has the advantage over
G [
9,
10]. Consequently,
D tends to assign higher scores to original images during training, and
G fails to fool
D even after the model converges [
8,
9]. Studies in the literature traditionally aimed to improve GAN training by modifying the discriminator. For instance, in DCGAN, the first GAN proposal with convolutional layers, the authors used batch normalization and leaky ReLU activations between
D’s intermediate layers to make it more stable [
11]. The loss function, however, was the Jensen–Shannon divergence [
5], which can lead to mode collapse and vanishing gradients [
8,
11]. The WGAN-GP [
12] addresses these problems by using the Wasserstein distance as a loss function, which provides a more meaningful measure of the difference between the probability distributions. WGAN-GP also introduces a gradient penalty term that penalizes the norm of
D’s gradients, encouraging more diverse generated samples by enforcing a more uniform distribution of gradients throughout the data space. The RAGAN [
13] introduces the relativistic discriminator that estimates the probability that an authentic sample is more realistic than a fake sample and vice versa. It considers the relative realness of real and fake samples, providing more informative feedback to
G. RAGAN delivers a more nuanced signal to
G, allowing it to understand better how to generate plausible samples on its own and the actual data distribution.
Recent techniques have focused on improving the generator’s training rather than the discriminator. The approaches involve providing more information about the images’ features to
G and making the competition between
G and
D more balanced. For instance, Wang et al. [
9] proposed a training technique that raises the spatial awareness of
G. The strategy consists of sampling multi-level heatmaps from
D using Grad-CAM and integrating them into the feature maps of
G via the spatial encoding layer. The authors used
D as a regularizer, aligning the spatial awareness of
G with
D’s attention maps. Bai et al. [
10] argue that since
G’s weights are updated only with gradients derived from
D,
D acts as a referee rather than a player. The authors propose a new training approach with a generator-leading task to make the adversarial game fairer. In this task,
D must extract features
G can decode to reconstruct the input.
Another possible solution is to combine GANs with explainable artificial intelligence (XAI) in the loss function. In a classification task, XAI generates explanations indicating which parts of the input were considered the most critical in assigning the input to a label. Moreover, it is possible to use the model’s gradients to derive these explanations [
14]. A widely used method is Saliency [
15], which creates explanations by calculating the gradients of the output concerning the input features. It allows for highlighting of regions where a slight change in the input would significantly change the prediction. However, this approach can produce noisy explanations. The Gradient⊙Input [
16] addresses this issue by calculating the elementwise multiplication of gradients by the input. The input acts as a model-independent filter, which reduces noise and smooths the explanations. Another widely used method is DeepLIFT [
16], which attributes importance to the input features by comparing the activations that the actual input and a reference input cause in each neuron. It uses the difference between the activations as importance scores for the input features. Gradient-based XAI generates detailed, fine-grained explanations at the pixel level with significant computational efficiency compared to XAI based on feature perturbation [
17,
18].
The images generated by a combination of GAN and XAI in the loss function can be used in a data augmentation task and potentially improve the classification performance of state-of-the-art methods, such as Transformer-based models [
19,
20,
21,
22]. These methods leverage self-attention mechanisms to enable holistic image understanding and achieve top-notch performance in visual recognition tasks [
23,
24,
25,
26,
27,
28,
29]. ViT [
19] is a prime example. It operates on a patch-based representation of images using the self-attention mechanism to capture global dependencies and learn long-range relationships between the patches. PVT [
20] introduces a hierarchical approach operating at different spatial resolutions to capture fine-grained details and high-level semantic information. It uses local–global and global–local attention modules. The local–global attends to local and global features within the same input region, while the global–local attends to the global representation while incorporating local information. DeiT [
21] focuses on training efficiency on smaller datasets. It employs data augmentation and knowledge distillation, in which a teacher model guides the training of the Transformer-based student model. A learnable distillation and class tokens allow the student to learn from the original and the teacher’s predictions. CoAtNet [
22] is a hybrid architecture that uses convolutional layers to extract local features in the initial stages and comprises self-attention layers to model long-range dependencies and global context within the image in later stages. Positional encodings are added to the embeddings in the attention stages to retain spatial information, which helps the model understand the relative positions of features within the image.
Therefore, considering the advances previously described, we present a new way to train GANs using XAI in backpropagation. Our approach involves extracting XAI explanations from
D to identify the most critical features of the input and feeding this information into
G via the loss function. We used traditional architectures as a basis and modified the loss function to propagate a matrix instead of just an error value. This matrix was derived from the explanations and the discriminator error. We investigated the proposal’s relevance in the histopathology context, which is known to be challenging due to the low availability of labeled images caused by privacy concerns and labeling costs [
30,
31,
32,
33,
34]. We performed the data augmentation of relevant datasets, such as breast, colon, and liver histological images, and classified them using Transformers. Through experiments, we show that our proposal improves the quality and variability of the artificial images compared to traditional GANs, promoting an increase in the generalization and classification performance of state-of-the-art Transformer-based methods. This focus is essential to understanding the added value that XAI brings to GANs, particularly in providing valuable feature information to
G during the training. Our work aimed to fill a specific gap in the literature by demonstrating how XAI can be a powerful tool in enhancing GAN performance. This research makes the following significant contributions:
An approach that feeds G with substantial information concerning the images’ features, increasing the quality and variability of the generated images.
A training strategy that produce images with more realistic features, promoting an increase in the generalization and classification performance of state-of-the-art Transformer-based methods.
The indication of the best combination between the GAN and XAI to generate and classify the histological images explored here.
2. Methodology
We named the proposed method XGAN, and a schematic summary of its structure is illustrated in
Figure 1. It comprises a generator (
G) and a discriminator (
D).
G receives a random signal vector
z and outputs an image
, while
D classifies authentic
x and artificial images
. The model uses XAI to extract feature information from
D and feed it back to
G to perform a new form of training called educational training. To conduct this training, we propose a new loss function
that uses traditional adversary losses (
) combined with XAI explanations (
E) to backpropagate important information to the generator. The new loss function was defined as follows:
in which ∗ is the multiplication operation.
The gradient determines how much to adjust each weight of G so that the loss function walks towards the optimum. Incorporating E within the gradient enables emphasis on areas corresponding to objects of interest while dampening the influence of less relevant regions. Our method proposes a student-versus-teacher relationship. In this relationship, E corresponds to a test answer in which the professor (D) informs the student (G) of their test score, indicating features drawn close to reality and those not similar to the original images. Thus, instead of propagating just one value that indicates D’s classification error, we propagate a matrix with relevant information for each pixel in the image.
To propagate a matrix, an operation known as the vector-Jacobian product is required, defined as
where
is a multidimensional vector of the same dimension as the explanations
E with 1 in all positions, and
J is the Jacobian matrix, a matrix of partial derivatives that indicates how the output changes concerning the input. AutoDiff uses the Jacobian matrix to perform the backpropagation process by stacking the partial derivatives for each output concerning each input variable. Considering that a neural network is a function
that maps
n-dimensional input vectors (
x) to
m-dimensional output vectors (
y), the matrix
J is defined as
The Jacobian matrix indicates how the output changes when a small amount of the input changes. In the proposed method, J also informs the change that each pixel of the artificial image causes in the prediction of D, assigning greater weights to the more relevant pixels via E.
We used the DCGAN, WGAN-GP, and RAGAN models to define the adversarial loss functions and , and the XAI methods Saliency, DeepLIFT, and Gradient⊙Input to generate the explanations E. We give more details about the models and methods in the following sections.
2.1. Adversarial Loss Functions
2.1.1. DCGAN
We calculate the DCGAN adversarial loss for
D (
) through the binary cross-entropy:
where
is
D’s output for real samples
x,
z is a random noise vector, and
is
D’s output for the generated images
. For real samples
x,
D tries to maximize the probability of assigning them a value close to 1, while for artificial samples
,
D tries to assign a value close to zero. In contrast, the
tries to maximize the probability of
D assigning a value close to 1 to the generated samples; it was defined as
2.1.2. WGAN-GP
The
was defined in terms of the Wasserstein distance
and the gradient penalty
:
in which
is the difference between the expected values of
D’s output for real and generated samples:
and
where
is a sample along a straight line between a real sample and a generated sample, and
is a hyperparameter that controls the strength of the penalty.
For
G,
was defined as the negation of the expected value of
D’s output for generated samples:
2.1.3. RAGAN
The
was defined as the sum of the DCGAN loss and the relativistic discriminator loss:
where
The
was defined as
2.2. XAI Methods
For this work, we used gradient-based XAI techniques to extract the most critical features from
D’s gradients. We opted to use this type of XAI due to its computational efficiency and capacity for creating fine-grained pixel-level explanations [
14]. We used the Saliency, DeepLIFT, and Gradient⊙Input methods to generate the explanations
E.
To calculate the Saliency [
15] explanation (
) for a fake image
, we calculated the partial derivative of associated output
concerning the input
:
To determine the Gradient⊙Input [
16] explanation (
), we calculated the elementwise multiplication of gradients by the input:
Finally, to define DeepLIFT [
16], we calculated the importance scores of the input (
) by comparing their contributions to the output against a reference input (
). We used the minimal activation, that is, all zeros, as a reference. Thus, considering
t as an output neuron and
as the set of neurons necessary to calculate
t,
is the difference between the outputs caused by
and
. We calculated the explanation
as follows:
where
is the difference between neuron activations caused by
and
, and
is the contribution score of
to
.
2.3. Datasets
The CR dataset [
35] (
Figure 2) consists of 165 RGB images of colorectal tissue obtained from 16 representative sections of colorectal cancer at stages T3 or T4. The samples are divided between benign (74 images) and malignant tumors (91 images). Image acquisition was performed by digitally photographing histological sections with a Zeiss MIRAX MIDI slide scanner. The pixel resolution was 0.620
m, corresponding to a 20× magnification. The images have different sizes, ranging from
to
pixels.
The LA and LG datasets [
36] (
Figure 3 and
Figure 4) comprise RGB images of liver tissue obtained from mice. The LG consists of 265 images obtained from male (150) and female (115) mice subjected to calorie-restricted diets. The LA dataset is composed of 529 images divided into four classes, each representing a different age group of female mice on ad libitum diets: 1 month (100), 6 months (115), 16 months (162), and 24 months (152) of age. The samples were obtained using a Carl Zeiss Axiovert 200 microscope and a 40× objective. All images have a resolution of
pixels. Both datasets were available through the Atlas of Gene Expression in Mouse Aging Project (AGEMAP).
The UCSB dataset [
37] (
Figure 5) is a critical case of image scarcity. It consists of 58 RGB images of breast tissue divided into two groups: benign breast cancer (32) and malignant breast cancer (26). The samples provided by the Center of Bio-Image Informatics at the University of California at Santa Barbara have a quantization rate of 24 bits and a size of
pixels.
2.4. Performance Evaluation
We evaluated the performance of the proposed model in two steps. First, we quantitatively assessed the quality of the artificial images using the Fréchet inception distance (FID) and the inception score (IS) metrics. Second, we performed the data augmentation and classified the images using the Transformer models ViT, PVT, DeiT, and CoAtNet for each dataset, evaluating the accuracy of each case. The idea was to compare how the XAI models can improve the quality of the generated images compared to the original architectures and how this impacts the classification of the Transformer models.
2.4.1. Image Quality Evaluation
Many metrics are available to evaluate the quality of artificial images, and each has strengths. However, interpreting the results can be challenging and requires careful consideration. For example, a specific FID variation (Gromov–Fréchet Distance) was designed to compare metric space shapes, particularly in shape analysis and topology. It has valuable applications such as shape matching or comparing complex geometric objects. However, GAN-generated images are typically evaluated based on the similarity of their distributions to authentic images, where the focus is on pixel values, textures, and high-level features captured by neural networks rather than the geometric structure of the underlying data.
Some assessments could also be performed via precision–recall (PR) metrics. These metrics are derived from the F1 score or area under the PR curve and can provide insights into a GAN’s performance in generating realistic and diverse images. For instance, PR metrics can provide a valuable perspective by assessing the trade-off between the precision (how many of the generated images are relevant or high-quality) and recall (how many of the relevant images are generated) of the model. However, some disadvantages compared to FID and IS are observed. PR metrics require the definition of thresholds to determine what constitutes a realistic or diverse image, which can be complex and context-dependent. The choice of threshold can significantly affect the results, making PR metrics less consistent across different models and datasets. The interpretation of PR curves might be more complex than single-valued metrics like FID or IS.
The FID and IS are particularly pertinent for assessing image quality because they have been widely adopted in the field and are well-suited for comparing generative models by evaluating the fidelity and diversity of the generated images. FID, for instance, compares the distribution of generated images with real images, capturing the similarity in a way that aligns with human perception. IS measures how distinct and meaningful the generated images are using the classification confidence of a pre-trained network.
In this work, we applied the FID metric [
38] to assess the quality of artificial images quantitatively. This metric measures the distance between the distributions of real and generated images. Thus, lower FID scores indicate higher similarity between the distributions, meaning that the generated images closely resemble the original ones. The FID measures the similarity between two multivariate Gaussian distributions, defined by the mean and covariance matrix of activation features extracted from Inception v3’s 2048th layer. Mathematically, the FID score is defined by
where
and
are the mean features of real and fake images.
and
are the covariance matrices of real and fake image features, and
denotes the trace of a matrix.
We also applied the IS metric [
39] to estimate the diversity of the generated images. A higher IS suggests greater variety in the assigned classes, although it does not necessarily indicate a high degree of realism. In the IS calculation, fake images were evaluated based on the activations of the final classification layer of a pre-trained Inception v3 model. This model assigns a probability distribution to each image over predefined classes in the ImageNet dataset. Diverse images are expected to have probabilities spread across multiple classes. The IS was calculated by taking the average entropy of all generated images and computing its exponential value:
where
is the probability of class
y being assigned to the generated image
x,
is the marginal probability of class
y in the dataset,
denotes the expectation taken over all generated images,
is the Kullback–Leibler divergence between
and
, and
represents the exponential function.
2.4.2. Classification Evaluation
To train and evaluate the GAN and XGAN models, we employed a strategy based on regions of interest (ROIs).
Figure 6 illustrates the classification evaluation process. Initially, we reserved 20% of the dataset for testing and divided the remaining 80% into five stratified folds. Each fold’s images were cropped into
-pixel ROIs to ensure that ROIs from the same image did not appear in both the training and validation sets. We conducted classification using cross-validation, with four folds for training and one fold for validation. For data augmentation, we used the GAN and XGAN models exclusively in the training group to prevent overfitting, with an augmentation rate equal to 100% of the training set size. We fine-tuned Transformer models, initially trained on the ImageNet dataset, for the datasets under investigation. Classification performance was evaluated using the accuracy metric, which measures the proportion of correctly classified samples. Finally, we compared the classification performance of Transformer models with and without data augmentation using the proposed XGAN and the original GAN architectures.
2.5. Execution Environment
The proposed method was implemented using Python 3.9.16 and the Pytorch 1.13.1 API. The experiments were performed on a computer with a 12th Generation Intel® Core™i7-12700, 2.10 GHz, NVIDIA® GeForce RTX™3090 card, 64 GB of RAM, and a Windows operating system with 64-bit architecture.
It is important to consider that we have developed our code in PyTorch, which is well known for its flexibility and strong support within the MLOps ecosystem. PyTorch’s integration capabilities with various MLOps tools and platforms, such as TensorBoard, MLflow, and Kubernetes, make incorporating continuous integration and deployment (CI/CD) pipelines, model versioning, and automated monitoring straightforward. This compatibility ensures that our work can be efficiently transitioned into production environments, facilitating real-time processing and decision making.
3. Results and Discussion
We used XAI explanations to improve the training of GANs and generate artificial images with higher quality and variability.
Figure 7,
Figure 8,
Figure 9 and
Figure 10 show some examples of the original images and those generated by the GAN and XGAN models for the CR, LA, LG, and UCSB datasets, respectively. We conducted experiments using the FID and IS metrics to assess quantitatively the quality and variability of the artificial images.
Table 1 shows the results of these experiments. The results are organized by base architecture: DCGAN, RAGAN, and WGAN-GP. Scores in bold indicate the best results regarding the base architecture. The green and red arrows indicate whether the FID and IS obtained with XGAN are better or worse than those obtained with the original architecture.
Considering the CR dataset (
Figure 7), it is possible to note in
Table 1 that XDCGAN + Saliency achieved the best FID and IS regarding the DCGAN-based architectures. The values were 57.01 and 2.43, respectively. Considering the RAGAN-based models, it is worth noting that all combinations of XRAGAN improved the FID and IS compared with RAGAN. The XRAGAN + DeepLIFT was the highlight, providing the lowest FID, 46.02, representing a 32.70% decrease compared to RAGAN (68.39). The XWGAN-GP also improved the FID in all cases. The combination XWGAN-GP + Saliency provided the lowest FID, 72.01, 13.04% lower than WGAN-GP (82.81).
On the LA dataset (
Figure 8), XDCGAN did not improve the FID and IS. DCGAN achieved the best quality, with an FID of 127.94 and an IS of 1.57. However, considering RAGAN-based models, XRAGAN + DeepLIFT and XRAGAN + Gradient⊙Input improved the FID and IS slightly. The best combination was XRAGAN + DeepLIFT, which provided an FID improvement of 2.8%, 104.08 against 107.15 with RAGAN. On the other hand, XWGAN-GP improved the FID in all cases. XWGAN-GP + Saliency provided the best FID, 152.19, representing a 20.56% decrease compared to WGAN-GP.
Regarding the LG dataset (
Figure 9) and the performance using DCGAN-based architectures (
Table 1), XDCGAN + Gradient⊙Input achieved the best FID, 93.88, representing a 13.22% decrease compared to DCGAN (108.18). Considering RAGAN-based models, XRAGAN + Saliency and XRAGAN + Gradient⊙Input improved the FID and IS. XRAGAN + Saliency provided the best FID, 88.30, representing an 8.99% decrease compared to RAGAN (95.91). When using WGAN-GP as the base architecture, XWGAN-GP + Saliency provided the best FID, 128.26, 6.52% less than WGAN-GP (137.46).
Finally, considering the UCSB dataset (
Figure 10), XDCGAN + DeepLIFT achieved an FID of 62.34 and an IS of 2.83. This combination was the best FID and IS among the DCGAN-based architectures, representing about a 14.33% FID improvement. Considering the RAGAN-based models, XRAGAN + Saliency and XRAGAN + Gradient⊙Input improved the FID and IS. XRAGAN + Saliency provided the best FID, 63.21, representing a 7.61% decrease compared to RAGAN (68.42). This combination also produced more diverse images. The IS score was 2.79. XWGAN-GP showed no improvement compared to WGAN-GP. The best case was WGAN-GP, with an FID of 81.07. However, XWGAN-GP provided the best IS, 2.74.
We used the generated images to augment the datasets and train the Transformer models. Our goal was to verify whether the XAI methods’ feature information could improve the Transformer models’ classification performance. We followed the strategy described in
Section 2.4.2.
Table 2,
Table 3,
Table 4 and
Table 5 show the classification results on the CR, LA, LG, and UCSB datasets, respectively. Each table’s first row presents the classification accuracy without data augmentation (without DA), and results using a GAN in the data augmentation are organized by base architecture: DCGAN, RAGAN, and WGAN-GP. We considered the classification performance without data augmentation and with data augmentation using original architectures (DCGAN, RAGAN, and WGAN-GP with no XAI) as a baseline. We compared the results obtained with XGAN (XDCGAN, XRAGAN, and XWGAN-GP) with the defined baselines to determine whether the proposed method is capable of improving the classification performance of Transformer-based models via data augmentation and whether educational training using XAI is capable of enhancing the quality and classification performance of a given base GAN architecture. Thus, bold values indicate the best accuracy given a base architecture.
Considering the classification results regarding the CR dataset (
Table 2), the combination with the best FID among the DCGAN-based architectures, XDCGAN + Saliency, also provided the best classification performance. This combination provided the best accuracy with all Transformer models: 84.14% with ViT, 87.70% with DEiT, 92.24% with PVT, and 93.44% with CoAtNet. With CoAtNet, XDCGAN + Gradien⊙Input also achieved an accuracy of 93.44%, the same as XDCGAN + Saliency. The IS value with this combination (2.43) may have influenced this result, promoting more significant variability in artificial images, and consequently, better model generalization. Considering the RAGAN-based models, it is possible to note that XRAGAN + DeepLIFT provided the lowest FID (46.02) and the highest accuracy in most cases, 84.30% with ViT and 91.92% with PVT. Also, XWGAN-GP + Saliency provided the lowest FID (72.01) and achieved the highest accuracy with ViT (84.41%), DEiT (88.09%), and CoAtNet (93.64%). The XWGAN-GP + Gradient⊙Input provided the second-lowest FID (74.01) and achieved the highest accuracy with PVT (92.13%).
Regarding the LA dataset, XDCGAN could not improve the FID (
Table 1). XDCGAN + Saliency, however, achieved the second-lowest FID (128.07), and it is worth noting in
Table 3 that this combination provided the best accuracy with DEiT (95.45%). It is also worth noting that the architecture with no GAN improved the classification performance with ViT, which achieved 95.04% without data augmentation. Nevertheless, taking into account the RAGAN-based architectures, the combination with the best FID, the XRAGAN + DeepLIFT, provided the best classification performance with PVT, 97.71% against 97.32% with RAGAN, and 96.44% without data augmentation. XWGAN-GP improved DEiT, PVT, and CoAtNet classification performance compared to WGAN-GP. For instance, WGAN-GP + Saliency achieved an accuracy of 97.79% with DEiT and 98.24% with PVT. This result represents an improvement of 3.81% and 1.87% compared to the models without data augmentation, respectively, and 3.01% and 1.23% compared to WGAN-GP. Also, the combination XWGAN-GP + DeepLIFT, which provided the second-best FID and IS, achieved the best accuracy with CoAtNet, 99.34%.
Table 4 shows the accuracy of the LG dataset. Once again, no architecture was able to improve classification with ViT. However, the two best XDCGAN models in terms of FID (
Table 1), XDCGAN + Gradient⊙Input (99.33) and XDCGAN + DeepLIFT (98.24), achieved the highest accuracy with PVT (99.20%) and DEiT (97.43%), respectively. The XRAGAN model showed improvements in FID and IS when combined with Saliency (
Table 1). This combination achieved the highest accuracy with DEiT (96.77%) and PVT (99.08%). Also, the combination XRAGAN + Gradient⊙Input achieved 99.79% with CoAtNet. The XWGAN-GP model also improved FID and IS when combined with Saliency. It achieved the highest accuracy with DEiT (97.17%) and CoAtNet (99.81%). The XWGAN-GP + Gradient⊙Input combination provided the second-lowest FID and achieved the highest accuracy with PVT (98.94%).
Table 5 shows the accuracy of the UCSB dataset. The XDCGAN + Gradient⊙Input achieved the best results with DEiT (75.14%) and PVT (78.36%). Regarding the XRAGAN models, XRAGAN + Saliency and XRAGAN + Gradient⊙Input achieved the lowest FID (
Table 1). XRAGAN + Saliency achieved the highest accuracy with DEiT (75.88%) and XRAGAN + Gradient⊙Input with PVT (77.36%). The XWGAN-GP + Gradient⊙Input combination provided the lowest FID (87.65) and achieved the highest accuracy with DEiT (77.42%), PVT (77.77%), and CoAtNet (76.59%).
Based on the classification results, we analyzed the best combinations of XAI and GAN for the histological datasets. This analysis aimed to determine the number of cases in which the XGAN combinations provided the best performance given a base architecture.
Table 6 shows a ranking of the best combinations. Considering all datasets, the combination XWGAN-GP + Saliency outperformed WGAN-GP in seven cases. The second and third places also included the Saliency method: XDCGAN + Saliency, with six cases; and XRAGAN + Saliency, with four cases. The subsequent three cases were combinations with the Gradient⊙Input and DeepLIFT methods: XWGAN-GP + Gradient⊙Input with five cases, XDCGAN + Gradient⊙Input with four cases, and XRAGAN + Gradient⊙Input and XRAGAN + DeepLIFT both with three cases. Finally, the two worst cases were combinations with DeepLIFT: XWGAN + DeepLIFT and XDCGAN + DeepLIFT, with only one case each. Based on these facts, the XAI method that provided the best information for the generator was Saliency, followed by Gradient⊙Input and DeepLIFT. These findings can guide researchers and experts in using GANs and XAI to develop artificial augmentation techniques.
It is important to observe that to expand our investigation, new experiments can be performed with different models and datasets, including other types of medical images. On the other hand, incorporating more datasets from various medical fields poses significant challenges, especially when considering GAN architectures. For instance, an architecture that works well for one type of medical image may not be optimal for another, requiring extensive experimentation and customization to achieve the best results. In addition, acquiring and processing datasets from different medical domains involves overcoming various logistical, ethical, and technical hurdles, which adds to the complexity of such an expansion. In this context, our approach opens up new possibilities for enhancing data augmentation techniques and improving the overall performance of Transformer-based models in histopathological datasets. It provides new patterns and insights for specialists interested in machine learning.
Finally, we recognize the importance of discussing the computational complexity of the methods employed. However, performing a complexity analysis of algorithms, especially when integrating XAI with GANs, can be challenging due to the intricate nature of these models and the variability in computational demands across different setups. Moreover, the primary focus of our research was to explore the integration of XAI techniques with GANs to improve the quality of generated images rather than to provide a comprehensive analysis of the computational complexity. Despite the complexity analysis challenges, we consciously used gradient-based XAI methods in our research. Gradient-based XAI is the fastest among the various XAI types, making it a practical choice for our study. These methods work by calculating gradients concerning the input features, which helps to identify the input areas that most influence the model’s output. This approach is computationally efficient because it leverages the gradients already computed during the backpropagation process in neural networks.
4. Conclusions
In this work, we proposed a new approach for training GANs using XAI to improve generation quality and data augmentation performance on histopathological datasets. We used XAI methods, such as Saliency, DeepLIFT, and Gradient⊙Input, to extract feature information from the discriminator and feed it to the generator during the training. We evaluated the proposed method on four histopathological datasets, CR, LA, LG, and UCSB, using the FID and IS metrics to assess the quality of generated images and the accuracy metric to compare the classification performance of four Transformer models, ViT, DEiT, PVT, and CoAtNet, with and without data augmentation. The multiple experiments provided a solid foundation to understand the effectiveness of our approach in this specific domain.
The results showed that the proposed method increased the quality and diversity of the generated images. In most cases, the XGAN provided better FID and IS values than traditional GAN models. For instance, it was possible to decrease the FID by up to 32.70% compared to the traditional architectures. This gain in quality positively affected the classification performance of the Transformer models. Accuracy was increased by up to 3.81% compared to the models without data augmentation and up to 3.01% compared to the models with traditional GAN data augmentation. We also showed that Saliency was the best method for providing information to the generator, followed by Gradient⊙Input and DeepLIFT. The XWGAN-GP + Saliency combination was a highlight, as it outperformed WGAN-GP in seven cases.
These results are significant because they show XAI’s potential to improve the quality of image generation by using GANs on histopathological datasets. We demonstrated that the features provided by XAI explanations contribute to better generalization in the training of Transformer models, promoting an improvement in their classification power. Also, we identified that the saliency method provided the best features and was the most relevant method for composing a combination with GAN models. These findings provide insights and guidelines for researchers and experts interested in developing artificial augmentation techniques for histopathological datasets.
In future work, some insights can be investigated: 1. new tests using different combinations of GAN models; 2. apply other evaluation metrics, such as the precision–recall metrics, for generative models; 3. integrate MLOps approaches and pipelines for biomedical image processing; 4. tests using larger datasets with a significantly higher number of samples, including those from different medical domains, to comprehensively evaluate the performance of our proposed method and its generalization capabilities; 5. new analysis by delving deeper into the reasons behind the varying performance of different GAN and XAI combinations to provide a more thorough examination of their specific advantages and limitations; 6. statistical analysis, including significance testing, to verify the improvements in classification performance and complexity analysis of the proposed method to guide optimization processes.