This section presents the outcomes of two distinct experiments conducted to evaluate the efficacy of GAN training and CNN training with balanced datasets. The first experiment focused on utilizing GANs, specifically the ProGAN, SNGAN, BEGAN, and ReGAN, to balance the dataset. This was carried out with the goal of mitigating class imbalance and enhancing the diversity of the training data. The impact of GAN-based dataset balancing techniques on the performance of CNNs trained on these balanced datasets was examined in this experiment. In the second experiment, CNNs were trained directly on the balanced dataset without GAN augmentation, serving as a benchmark to compare against the GAN-trained models.The non-GAN-based techniques employed were random transformations of the images and downsampling of the malignant class to achieve an equal number of images in both classes. By contrasting the outcomes of these two approaches, insights were gained into the effectiveness of GAN-based dataset balancing in improving CNN performance and robustness in image classification tasks.
3.2. CNN Training
The graphs presented in
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 offer a comprehensive overview of the performance dynamics of CNNs across various epochs and folds in a fivefold cross-validation set-up. Each graph provides insights into the mean accuracy and standard deviation of the CNN models trained using different datasets and methodologies. The upper side graph illustrates the performance metrics derived from training the CNN models on the original dataset. In contrast, the lower side graph showcases the corresponding metrics obtained from each data augmentation technique applied to the dataset (GAN). In both graphs, three distinct CNN architectures (Inception, ResNet, and VGG16) are evaluated, with the mean accuracy and standard deviation depicted for each architecture. The shaded regions surrounding the lines represent the standard deviation, offering insights into the variability in model performance across different folds and epochs. These visualizations provide a comprehensive understanding of the impact of data augmentation techniques on CNN training and performance variability, informing further analysis and optimization efforts in image classification tasks.
A comparison of the two graphs in
Figure 4 reveals an improvement in mean accuracy when traditional data augmentation techniques are applied to the dataset. However, despite this enhancement in accuracy, it is evident that the standard deviation remained relatively high across epochs and folds. This observation suggests that while traditional data augmentation contributes to improved overall performance, it may not fully address the variability in model performance across different folds and epochs. Consequently, while the mean accuracy was positively influenced by traditional transformation, the persistently high standard deviation highlights the necessity for further investigation and refinement in training methodologies to achieve greater consistency and stability in CNN performance.
A comparison of the results obtained from the original dataset (upper) with those obtained by applying downsampling to the malignant class to balance the number of images (shown in
Figure 5) revealed distinct trends across different CNN architectures. For the Inception and ResNet models, the results were found to be quite similar between the two datasets, with a slightly higher accuracy observed for the downsampled dataset. This indicates that both architectures benefitted marginally from the balanced dataset, likely due to a reduction in class imbalance. However, the results for the VGG16 model were significantly worse when using the downsampled dataset. This suggests that the reduced number of images in the malignant class was insufficient for effective training, indicating that VGG16 was more sensitive to the size of the training dataset and that a larger number of images was necessary to achieve optimal performance.
Figure 6 presents a comparison graph illustrating the training mean accuracy and standard deviation when using BEGAN as the dataset balancer. The graph reveals a notable improvement in mean accuracy across all three CNN architectures, indicating enhanced model performance when trained on the dataset balanced with BEGAN. Additionally, there was an overall reduction in standard deviation, suggesting improved stability and consistency in model performance. Nevertheless, it is notable that the standard deviation remained relatively high for the Inception architecture, indicating variability in performance across different epochs and folds. Despite this exception, the general trend reflects the efficacy of BEGAN in improving both the accuracy and stability of the CNN models, underscoring its potential as a dataset balancing technique in image classification tasks.
Figure 7 presents a comparison between CNNs trained on the original dataset (upper side) and CNNs trained with the ProGAN as the dataset balancer (lower side). While the accuracy demonstrated improvement with ProGAN-balanced datasets, the higher standard deviation indicates a noticeable increase in instability. Despite the ProGAN having the lowest FID among the GAN architectures considered, the results suggest a trade-off between image quality and variability. The CNNs trained with ProGAN-balanced datasets exhibited high-quality images, as reflected by the low FID values. However, they also displayed more significant variability in performance across epochs and folds, as evidenced by the elevated standard deviation. These findings underscore the complex interplay between image quality and stability in CNN training, highlighting the need for further exploration and optimization to balance these competing factors.
Figure 8 presents a comparison between the CNNs trained with the original and ReGAN ProGAN balanced datasets. While the accuracy achieved with ReGAN ProGAN was not as high as that with ProGAN, there was a notable improvement in stability, as evidenced by the reduced variability in accuracy values across epochs and folds. However, despite this improvement in stability, the standard deviation remained high, indicating ongoing variability in model performance. Although ReGAN ProGAN demonstrated a balance between accuracy and stability, the persistence of a high standard deviation indicates that further optimization may be necessary to achieve greater consistency in model performance.
Table 2 presents the mean accuracy and mean loss alongside their respective standard deviations for various CNN models trained with different data augmentation techniques. The table illustrates the impact of distinct augmentation strategies on model performance. Traditional image augmentation demonstrated a modest enhancement in mean accuracy coupled with a reduction in the standard deviation, implying improved consistency in model performance. Conversely, the augmentation techniques involving generative adversarial networks (GANs) exhibited significant improvements in both mean accuracy and standard deviation, indicating enhanced model robustness and stability. It is noteworthy that ReGAN ProGAN emerged as the top performer, exhibiting the highest mean accuracy and the lowest standard deviation among the evaluated methods.
The outcomes of the Inception model indicate that conventional image transformations resulted in a marginal improvement in both accuracy and standard deviation. Conversely, the GAN-based augmentation techniques demonstrated a substantial enhancement in both accuracy and standard deviation, thereby underscoring their efficacy in improving model performance. Among the GAN-based augmentation methods, the ProGAN stood out with the highest accuracy while maintaining a similar standard deviation compared to other techniques.
The ResNet model presented analogous results, corroborating the trends observed for the Inception model. In all cases, the GAN-based augmentation techniques consistently provided the highest average accuracy compared with the original dataset and traditional augmentation methods. It should be noted that although GAN-based augmentation improved the accuracy, it was often accompanied by a slightly higher standard deviation, indicating greater variability in model performance. Nevertheless, the ReGAN ProGAN model (highlighted in the table) exhibited a notable improvement in accuracy, accompanied by a marginal increase in the standard deviation.
The VGG16 model emerged as the worst-performing CNN among the architectural variants considered. Overall, VGG16 had a lower average accuracy and a higher standard deviation compared with the other CNN models. Despite these challenges, it can be observed that remarkable performance improvements were achieved with the ReGAN ProGAN and BEGAN enhancement techniques. Despite the inherent limitations of VGG16, the ReGAN ProGAN (highlighted in the table as the best performer) and, to a lesser extent, BEGAN models demonstrated commendable performance, with higher mean accuracies and relatively lower standard deviations compared with other augmentation methods. These results underscore the resilience and effectiveness of the ReGAN ProGAN and BEGAN models in improving the performance of even the worst-performing CNN architecture.
The results, presented in
Table 3, provide an overview of the precision, recall, and F1 score for each CNN and augmentation technique. The metrics were computed by aggregating all true labels and predictions across every test fold, which is why the standard deviations were not included. The data indicate that the ProGAN and ProGAN with ReGAN regularization consistently demonstrated the highest performance across all metrics. This indicates that these GAN-based augmentation techniques significantly enhanced the model’s ability to generalize from the training data. Conversely, the original dataset and traditional augmentation methods exhibited lower precision, indicating a lack of generalization capability and an increased likelihood of false positives. Furthermore, downsampling has been observed to demonstrate improved generalization compared with the original dataset and traditional augmentation methods. However, it still fell short when compared with all the GAN-based techniques. These findings highlight the potential of GAN-based augmentations, particularly the ProGAN and its regularized variant, in enhancing the robustness and accuracy of histopathological image classification models.
In order to highlight the performance of our model, we present the confusion matrix for the Inception CNN, which demonstrated the most favorable results in our study. The confusion matrix serves as an effective visualization tool, illustrating the model’s classification accuracy and its ability to distinguish between benign and malignant histopathological breast cancer images. By focusing on the Inception CNN, we aim to provide a clear representation of its efficacy, emphasising its potential as a robust tool for augmenting histopathological image analysis.
The confusion matrix for the original dataset, presented in
Table 4, reveals a relatively low precision in classifying the benign class, as evidenced by a significant number of false positives (12%). Additionally, while the model demonstrated high recall for the malignant class (94%), the overall precision for the benign classifications was compromised, indicating room for improvement in distinguishing between benign and malignant cases more accurately.
The confusion matrix for the dataset with traditional augmentation, as shown in
Table 5, indicates that the precision for benign classifications was even lower than that of the original dataset, with a false positive rate of 16%. While there was an improvement in the recall for malignant lesions (96%), the overall precision for benign cases decreased. This suggests that traditional augmentation is not an efficient technique for this problem, as it failed to enhance the model’s ability to accurately differentiate between benign and malignant lesions.
The confusion matrix for the dataset with downsampling, as shown in
Table 6, demonstrates a notable improvement over the previous two methods. Both benign and malignant classifications achieved a precision of 94%. This indicates that an equal number of images per class significantly enhanced the model’s performance. Nevertheless, while this balanced approach yielded superior overall results, the metrics, particularly the false positive and false negative rates (both at 5%), indicate that there is still potential for enhancement in the model’s accuracy.
Table 7 presents the confusion matrix for the augmented dataset, which was generated using the BEGAN model. This demonstrates a notable improvement in model performance. The precision for both benign and malignant classifications was exceptionally high, with only 2% false positives for benign lesions and 1% false negatives for malignant lesions. This evidence indicates that the BEGAN-based augmentation technique is highly effective, substantially enhancing the model’s ability to accurately distinguish between benign and malignant histopathological breast cancer images.
The confusion matrix in
Table 8 for the augmented dataset, which incorporated the ProGAN-based technique, demonstrates outstanding performance. The model achieved near-perfect precision for both benign and malignant classifications, with only 1% false positives and 1% false negatives. This demonstrates the efficacy of the ProGAN-based augmentation, markedly enhancing the model’s capacity to accurately differentiate between benign and malignant histopathological breast cancer images.
The confusion matrix depicted in
Table 9 illustrates the outcomes of the ReGAN ProGAN-based augmentation technique. Notably, the model achieved an impressive level of precision for both benign and malignant classifications, demonstrating only 2% false positives and 2% false negatives. This highlights the efficacy of the ReGAN ProGAN approach in enhancing the model’s capacity to distinguish between benign and malignant histopathological breast cancer images. It also reinforces the utility of this approach in augmenting image datasets, thereby improving classification accuracy.
In conclusion, the results presented in the confusion matrices demonstrate that the ProGAN-based augmentation technique consistently outperformed the other methods, including the ReGAN ProGAN approach. In a series of experiments, the ProGAN demonstrated near-perfect precision for both benign and malignant classifications, with minimal false positives and false negatives. While the ReGAN ProGAN method also demonstrated impressive performance, the ProGAN consistently maintained a slight edge in accuracy. These findings reinforce the robustness and efficacy of the ProGAN in enhancing the classification accuracy of histopathological breast cancer images, thereby reaffirming its status as a leading augmentation technique in this domain.