1. Introduction
Breast cancer is one of the most frequently diagnosed cancer types worldwide and one of the prevailing causes of cancer-related deaths. According to the World Health Organization’s International Agency for Research on Cancer (IARC) reports, in 2022, an estimated 2.3 million breast cancer incidences were recorded globally, resulting in 669,418 deaths. Though men are not immune to this disease, studies show that females are 100 times more likely to suffer from it than men. A more recent study by the American Cancer Society [
1] predicted that in 2025, breast cancer will remain the most common form of cancer among women, accounting for approximately 31% of all female cancers. In addition, female breast cancer incidence rates have been continuously increasing since the mid-2000s by 1% per year overall, a trend that has been at least in part attributed to changing risk factors, such as increased excessive body weight [
2].
Breast cancer can be classified as either benign, which is considered to be non-hazardous and non-life-threatening, or malignant. It begins with unnatural cell growth in the lining of the breast and has the potential to spread to neighboring tissues quite rapidly. The nuclei of malignant tissue are often considerably larger than those found in normal tissues, which can be fatal in advanced stages. If the cancer is found early on, before it expands to a size of 10 mm, the patient has an 85% probability of going into complete remission. Therefore, timely detection and accurate prediction of breast cancer are of utmost importance for improving patient outcomes, treatment planning, and survival rates.
The availability of appropriate screening technologies is crucial for spotting the first signs of breast cancer on time. The number one tool for detecting breast cancer in its earliest stages is mammography. Mammography screening has been proven effective in reducing breast cancer mortality rates [
3]. However, detecting subclinical breast cancer through screening mammography poses significant challenges. Tumors often occupy only a small fraction of the breast image; for instance, a full-field digital mammogram (FFDM) typically comprises
pixels, while a region of interest (ROI) indicating potential malignancy can be as small as
pixels. As a result, despite its advantages, screening mammography is overly susceptible to a risk of false negatives and overdiagnosis, where breast cancer that would not develop into clinical cancer during a woman’s lifetime is identified on screen. Numerous factors—such as fatigue, eye strain, and varying levels of experience among professionals, including doctors and radiologists, who analyze images—can affect the mammography’s diagnostic accuracy. To enhance the predictive accuracy of screening mammography, computer-assisted detection (CADe) and diagnosis (CADx) software [
4] have been in clinical use since the 1990s. Unfortunately, earlier versions of these systems did not significantly improve diagnostic performance [
5], and the progress was stalled for well over a decade following their installment.
In recent years, the remarkable successes of machine learning, especially deep learning—which have revolutionized the field of computer vision with a wide range of applications, from image classification and visual object detection to semantic segmentation—have attracted much attention in the medical community. Deep learning technology shows great potential in assisting health professionals, in enhancing mammogram interpretation accuracy and supporting clinical decision-making, thus improving patient outcomes [
6,
7].
It is a well-known fact that deep learning algorithms generally require large amounts of training data to reach their optimal performance level [
8]. This is a huge pitfall, which may hinder their performance, as assembling comprehensive mammography databases with ROI annotations involves considerable labor and time. Indeed, there are only a handful of publicly available mammography databases that are fully annotated, while larger datasets often merely indicate the cancer status of each image [
9]. Some studies [
10,
11] have sought to train deep learning algorithms using whole mammograms without relying on any annotations. Nonetheless, it remains unclear whether these algorithms can effectively identify clinically significant lesions and make predictions based on the relevant sections of the mammograms.
Distinguishing between benign and malignant tumorous cases poses another set of challenges within this domain, which stem from the fact that diverse breast abnormalities—including masses, architectural distortions, and calcifications—come up in all the shapes and sizes. Effective breast cancer detection requires that the model maintains a high level of accuracy in recognizing intricate patterns, avoiding such pitfalls as over-focusing on specific datasets or mistakenly categorizing non-cancerous formations as concerning. Moreover, in clinical settings, instances of malignancy are often outnumbered by benign cases, resulting in heavily imbalanced, skewed datasets. This discrepancy causes detection models to favor non-cancerous outcomes, subsequently diminishing their effectiveness in identifying genuine cancer cases. As a result, training these models becomes more complex, and specialized methodologies—such as data augmentation, synthetic data generation, or cost-sensitive learning—need to be applied to enhance performance and achieve more decisive results.
To alleviate the data inconsistency and deficiency problem, the current study introduces a reliable metric-based few-shot deep learning framework for the diagnosis of breast cancer patients with a limited amount of training mammograms. Few-shot learning is a form of meta-learning, which is a paradigm used to describe a variety of techniques that aim to improve adaptation through transferring generic and accumulated knowledge (meta-data) from prior experiences with few data points to adapt to new tasks quickly, without requiring training from scratch. More precisely, few-shot learning requires only a small number
n of samples for every given image class, to prepare (train) the model that in turn can classify unseen images in the future [
12,
13]. This style of learning corresponds to what we normally think of as true intelligence. For example, a person can recognize someone’s face after only seeing them a few times, and this ability scales to thousands of different faces.
At present, few if any comparable studies treating breast cancer diagnosis as a k-way, n-shot classification problem—where k and n represent the number of class labels and data samples used for model training—have been published. Due to its recent considerable successes in facilitating few-shot learning (particularly in one-shot scenario), we employ an idea of a Siamese network to learn an embedding space in which learning is efficient for few-shot samples. Specifically, a Siamese network is a type of network architecture that contains two or more identical deep convolutional neural subnetworks used to generate feature embeddings from input images, which are then contrasted to verify the similarity between them. We consider three diverse fine-tuned pre-trained CNN models— namely, GoogLeNet, ResNet-50, and MobileNetV3—and examine their effectiveness as backbone encoder sub-networks to obtain unbiased feature representations. In the effort to further improve the classification margin, we replace the traditional binary cross-entropy (CE) loss with a triplet-based loss function. Triplet loss offers significant advantages by bringing intraclass samples closer together while pushing interclass samples further apart in the embedding space. During the training process, triplets are constructed by selecting anchor images, positive images belonging to the same class, and negative images belonging to different classes. The triplet loss is then jointly optimized with the multi-task loss. By integrating these components, the network is capable of learning enhanced classification margins, which ultimately results in an ameliorated performance. The efficacy of the proposed framework is validated using two publicly available mammogram images datasets: INbreast and the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDMS).
In summary, the salient contributions of this research are outlined below:
- (a)
Propose a framework leveraging few-shot deep metric learning techniques for breast cancer diagnosis using whole mammogram images.
- (b)
Design and implement a Siamese network model, wielding a triplet-based loss function, for the generation of bias-free feature encoding vectors from the input mammograms.
- (c)
Examine the efficacy of the proposed framework with a limited dataset across multiple domains, from binary to multi-class classification.
The remainder of this paper is laid out as follows.
Section 2 provides a concise overview of relevant literature.
Section 3 describes the proposed models and outlines the methodology and datasets utilized in this research. The experimental setup and results are presented and discussed in
Section 4 and
Section 5. Finally,
Section 6 summarizes the paper and reflects on potential directions for future research.
2. Related Work
In recent years, deep learning has become a common tool, widely applied in breast cancer screening. Moreover, studies have indicated that these advanced methods can diagnose breast cancer up to 12 months earlier than traditional clinical approaches [
14]. Furthermore, deep learning excels at identifying the most relevant features that are best suited to tackle the said issue. This section will provide a comprehensive overview of deep learning-based techniques specifically designed for analyzing mammography images to detect breast cancer. Several deep learning models have been utilized for this purpose, which include the convolutional neural networks (CNN), regional convolutional neural networks (R-CNN), generative adversarial networks (GAN), and vision transformers (ViT).
CNN [
15] is generally regarded as the leading deep-learning technique employed for breast cancer detection. CNN is a class of deep neural networks, designed to automatically and adaptively filter inputs for useful information through back-propagation by using multiple building blocks, such as convolution layers, pooling layers, and fully connected layers, stacked on top of each other. Since mammography data is not available in abundance, a large proportion of studies resort to a transfer learning approach, as they attempt to exploit existing previously trained CNNs in a way that facilitates the reuse of their learned weights and apply them to the task of breast cancer screening by fine-tuning their fully connected layers only. In [
16], the authors selected three deep learning classifiers—namely, regular CNN, ResNet-50, and Inception-v2—and evaluated their diagnostic performances using DDSM and INbreast datasets. Khamparia et al. [
17] proposed a hybrid transfer learning model—a fusion of modified VGG and ImageNet—which yielded an accuracy of 94.3% on DDSM dataset. Ragab et al. [
18] also used transfer learning with AlexNet architecture, but they replaced the last layer, responsible for final classification, with a Support Vector Machine (SVM) classifier. Their proposed architecture achieved an accuracy of 87.2% on the CBIS-DDSM dataset. A comparative study on mammogram classification performance of different networks on small datasets is reported in [
19]. Apart from directly using on-the-shelf models, other researchers further sought more effective transfer learning methods to improve the pre-training learning process and fully utilize the knowledge learned from the pre-training dataset [
20].
Several studies adopted and modified off-the-shelf end-to-end detectors, which take as input the whole mammography image and output bounding box coordinates for lesions with scores indicating the likelihoods of different lesion types. In particular, Ribli et al. [
21] used Faster-RCNN to detect and perform classification on the INbreast and CBIS-DDSM datasets. RCNN, as their name indicates, combine a convolutional neural network architecture with specialized components aimed at detecting, localizing, and classifying objects within images. A key feature of these models is the Region Proposal Network (RPN), which functions as a specialized branch of convolutional layers positioned atop the final convolutional layer of the original network. This RPN is specifically trained to identify and localize objects in an image, independent of their classes. The system developed by Ribli et al. achieved a remarkable detection rate of 90% for malignant lesions in the INbreast dataset, while maintaining an impressively low rate of only 0.3 false positives per image. Another study by Antari et al. [
22] presented a completely integrated CAD system, comprising of You-Only-Look-Once (YOLO) regional network for detection, a full resolution CNN (FrCNN) for segmentation, and a deep CNN for classification of breast lesions. The system was evaluated on the INBreast dataset, producing an overall accuracy of 95.64%.
Another important deep learning model used for breast cancer detection is the GAN. GAN [
23] is a deep learning-based generative model comprising two sub-models: the generator model, which is trained to generate new samples, and the discriminator model that tries to classify samples as either real (from the domain) or fake (generated). During training, the two models compete against each other in a zero-sum game, where one agent’s gain is another agent’s loss, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible samples. In the work [
24], the authors introduced DiaGRAM (Deep GeneRAtive Multi-task), an innovative end-to-end system that leverages the capabilities of GAN alongside CNN, to improve mammogram classification performance. Their approach utilizes GAN to enhance feature learning by extracting useful features applicable to both the discriminative tasks—i.e., patch and image classification—and for the GAN’s generative task, which involves distinguishing between the real and generated patches. This dual functionality ensures that the learned features effectively capture the essential data characteristics, thereby subsiding the classification efforts. In a separate study, Singh et al. [
25] proposed a conditional GAN specifically designed to segment breast tumors within an ROI in mammograms. Their generative network is adept at identifying the tumor areas and generating the binary masks that outline these regions. In turn, the adversarial network is trained to differentiate between actual (ground truth) and synthetic segmentations, thus compelling the generative network to produce binary masks that closely emulate the real-world representations. In the second stage, a shape descriptor based on a CNN is utilized to classify the generated binary masks into four breast tumor shapes (i.e., irregular, lobular, oval, and round). The proposed shape descriptor was trained on the DDSM database, achieving an overall accuracy of 80%, which outperforms the current state-of-the-art. On the other hand, Guan and Loew [
26] employed GAN as a data augmenting device to generate synthetic mammographic images. Though these generated images are not exactly like the original ones, they can retain some of the essential features, structures, or patterns of the ROIs in the original images. The obtained results demonstrate that to classify the normal and abnormal ROIs from DDSM, adding GAN ROIs to the training data yields approximately 3.6% better classification performance than using the affine transformation augmented ROIs.
When analyzing mammograms, the previously described techniques in the literature mostly tend to focus on specific regions (patches) where tumors are suspected, simultaneously disregarding the rest of the image. This targeted approach can, however, cause them to overlook significant details, which potentially could have been revealed if the entire image was examined at once. Due to its ability to surpass the limitations of models focusing only on a small portion of an image, ViT has recently gained prominence in the field of computer vision, offering encouraging results in terms of accuracy, efficiency, and the aptitude to capture complex image features.
The ViT builds upon the underlying concept of the original transformer architecture, which was initially developed for text processing. By implementing a few adjustments to accommodate different data types, the ViT applies transformer methodology to the realm of images. This model utilizes various tokenization and embedding strategies, yet its general architecture is the same as that of traditional transformers. ViT is characterized by weaker inductive biases, which allows it to scale effectively with much larger datasets compared to CNN. This scalability comes with a hefty price, though. ViT generally requires a substantial amount of training data to achieve optimal performance. To mitigate this challenge, researchers have started to explore hybrid approaches that combine convolutional layers with ViT, thereby enhancing performance even when working with limited image datasets. Additionally, strategies such as transfer learning and self-supervised learning have been extensively utilized to alleviate data constraints.
Recent comparisons between transformer-based models and CNN—pertaining to mammographic image interpretation—have yielded varying results, largely influenced by differences in experimental designs and model architectures [
27,
28,
29,
30]. Most research has focused on comparing ViT to CNN for single-view image classification in a transfer learning setting. Notably, architectures like the Data-efficient Image Transformer (DeiT) and Swin transformer emerge as strong contenders for high-resolution medical image processing [
30,
31,
32,
33]. The Swin model, in particular, stands out due to its hierarchical architecture, which provides advantages in computational efficiency. Still, current research has not demonstrated that ViT consistently outperforms CNN in every scenario, particularly when it comes to low-dimensional and few-shot medical image processing [
34].
Nevertheless, the observed performance gap between ViT and CNN on established mammography datasets, including CBIS-DDSM, tends to be minimal or even slightly in favor of CNN. For instance, research conducted by Cantone et al. [
32] revealed that the Swin-v2 transformer was the only model to achieve competitive results that improved with higher input resolutions, suggesting a benefit in leveraging locality bias. Additionally, Miller et al. [
28] demonstrated that in a self-supervised framework, ViT pre-trained with masked autoencoder techniques exhibited subpar performance compared to CNNs that were pre-trained using contrastive self-supervised methods.
4. Experimental Setup
In this study, we systematically evaluate the effects of using different n-shot variants and loss functions separately for the CBIS-DDSM and INbreast datasets. These datasets provided a benchmark to assess the reliability and efficiency of the proposed n-shot meta-learning approach in the context of breast cancer detection. By utilizing the same training and testing settings, we were able to conduct a direct performance comparison between the CBIS-DDSM and INbreast datasets, thus ensuring a comprehensive analysis of our method.
4.1. Data Preparation
We deliberately avoid extensive pre-processing of mammograms in our datasets, in an effort to improve the proposed Siamese network model’s generalization ability. This approach should consequently make the model more robust to varying image quality and noises present in the scans, while extracting feature embeddings from the input images. Hence, only a few standard pre-processing techniques were performed to optimize the model training procedure. Firstly, an adaptive histogram equalization was applied to the input mammograms to enhance the image contrast. Given that the average size of mammograms in CBIS-DDSM and INbreast datasets is approximately pixels, they all had to be rescaled to a size of pixels to match the required input image size of the selected network models. In addition, image pixel values were converted from to and subsequently standardized to ensure all values in the data have zero mean and unit variance. This is achieved by subtracting the mean and dividing by the standard deviation of the image pixel values in the particular dataset. Standardization serves as a crucial pre-processing step in data preparation, expediting model convergence by removing biases among features and ensuring a uniform distribution throughout the dataset.
To revoke any bias or overfitting that may result in ambiguous classification accuracies, we ensured that the number of mammograms was balanced out in all three categories for both datasets. Stratified random sampling was used to divided each dataset to three different subsets for training, validation, and testing with a split ratio of roughly
, respectively.
Table 1 provides an overview of the data distributions of the normal, benign, and malignant classes for a particular dataset. Samples of the mammograms from the CBIS-DDSM and INbreast datasets are given in
Figure 2.
4.2. Training Strategy
We train our model in an episodic fashion where each episode contains multiple training tasks. For each training task, a mini-batch of triplets with
n-shot anchors from each of the 3 classes—thus totaling
-shot examples—is created. During training, the loss function assesses performance for each task in turn, given the respective batch of triplets. At the end of each episode, the model parameters are refined through backpropagation based on the computed loss to optimize the performance. This iterative process allows the model to learn a metric embedding by gaining experience across a series of training tasks. For validation, we use a completely different set of tasks and evaluate performance on the mini-batches of triplets sampled from the validation subset. Note that there is no overlap between examples in the training and validation subsets, so the algorithm must build a distance metric in general, rather than on a particular image subset.
Figure 3 shows an example episode training scenario, where we create tasks, each of which is defined by a mini-batch of triplets containing sample images from the meta-training dataset. A detailed training strategy is presented in Algorithm 1.
Algorithm 1 Few-shot learning training strategy |
Input: Dataset D Output: Trained model Initialize base encoder network f Define parameters: number of epochs (epizodes) , number of training tasks per episode , number of anchors , loss function for each epoch in do for each training task in do Create a mini-batch of triplets for each triplet in mini-batch do Compute embeddings , , Compute similarity scores (distances) , Calculate according to the selected loss function definition end for Calculate total loss (e.g., average across all triplets) Update model parameters using backpropagation end for if then break end if end for
|
4.3. Implementation Details
The base encoder networks and the proposed Siamese network model were implemented using PyTorch 2.6.0 with CUDA 12.4 All experiments were run in a cloud-based Google Colab notebook environment, which provides free access to computing resources, including GPUs and TPUs. Google Colab currently features an NVIDIA Tesla T4 GPU that contains 40 streaming multiprocessors with a 6 MB L2 cache shared by all. It also has 16 GB high-bandwidth RAM (GDDR6) and comes with pre-installed Python 3.x packages. Using this environment made it possible to complete all experimental runs for a full iteration cycle, within an average of 3 min per single test.
To compensate for the lack of a large annotated medical dataset, we utilized a transfer learning approach and we initialized trainable parameters of base encoder networks, using weights learned on the ImageNet dataset [
47]. Additionally, the pre-trained base encoders were fine-tuned to adapt them to the specific task. For this purpose, we discarded the last two layers of the pre-trained models and replaced them with two newly initialized layers that we trained from scratch, thus enabling them to capture the unique characteristics of selected datasets, while retaining the powerful feature extraction capabilities of the pre-trained models. Embeddings (feature vectors) were generated by stripping the last fully-connected layer of the base encoder network.
To train the proposed Siamese network model, we employed a Stochastic Gradient Descent (SGD) optimizer, with an initial learning rate, momentum, and weight decay set to , , and , respectively. Furthermore, we used the MultiStepLR scheduler with , because this value was deemed the best after hyperparameter tuning. The model was trained for 100 epochs (epizodes) per experiment, and an early stopping procedure was implemented to bring the training process to a halt if the monitored performance measure does not show any improvement for a “patience” number of epochs.
4.4. Evaluation Metrics
The proposed Siamese network model is rigorously scrutinized using five key performance metrics: accuracy, precision, recall, specificity, and F1-score. Here, accuracy is defined as the proportion of correct predictions relative to the total number of instances assessed. Precision, often referred to as positive predictive value (PPV), represents the ratio of true positive cases over all predictions made within the positive class. Recall or sensitivity—also called true positive rate (TPR)—which is particularly important in medical applications, refers to the percentage of actual positive cases that are accurately identified. On the other hand, specificity, commonly known as the true negative rate (TNR), is the percentage of true negative instances that are correctly classified. Last but not least, F1-score denotes the harmonic mean between precision and recall, calculated by taking their weighted average that captures both metrics’ balance effectively.
Also, the receiver operating characteristic (ROC) curve, which visualizes the trade-off between sensitivity and specificity at various classification thresholds, with its area under the curve (AUC) score, is utilized. The AUC score reflects how good the model is in distinguishing between classes and was proven, both theoretically and empirically, to be more suitable than the accuracy metric [
48] for evaluating classification performance. The AUC score can be computed as follows:
where,
is the sum of all positive instances ranked, while
and
denote the total number of positive and negative instances, respectively.
5. Results and Discussion
To assess the efficacy of the proposed framework for diagnosing breast cancer cases, the following approach was adopted. Since our framework is based on a few-shot learning paradigm, first we examine the effect that the number of shots has on the overall performance, and to this end we conduct multiple experiments in various 3-way, n-shot learning settings—where n varies between 3 and 7—with both triplet and circle losses. In order to get even better insight, we subsequently scrutinize the efficiency of the proposed framework in distinguishing between tumorous and normal mammograms (2-class problem variant) with similar n-shot learning settings. Finally, for comparison purposes, we present the results obtained by using the three selected, fine-tuned, pre-trained base encoder networks as standalone classifiers.
During evaluation, the distances between singled-out test subset instances and all the remaining images are rated against each other. Then, thresholding and the 5-nearest neighbour algorithm are applied to classify the instances into certain categories. In the experiments focusing on multi-type classification, the macro average is employed to derive the final performance metrics. This technique ensures that all categories are attributed with the same weights when calculating the mean. As a result, the final value reflects the arithmetic mean of the individual metrics associated with each category.
5.1. Classification Results for 3-Way Detection Problem
The evaluation results of the proposed Siamese network model are presented independently for each dataset and each loss function. In all experiments, we made sure to employ a consistent architecture of back encoder networks, alongside uniform training and test configurations, to objectively validate the model’s performance for both CBIS-DDSM and INbreast datasets.
Table 2 and
Table 3 summarize the classification results for all three base encoder networks over various
n-shot settings on the INbreast dataset using triplet and circle loss, respectively. It’s evident from the results that generally better AUC scores and other performance metrics values are achieved with a lower number of shots (3–4), and they appear to be declining as the number of shots increases. Only for the MobileNetV3 base encoder with triplet loss, the best performance is recorded in the case when training was done with mini-batches containing 6 triplets. Overall, the performance results obtained with the circle loss function seem to be slightly better than the results yielded with the standard triplet loss function, which is in concurrence with the findings reported in the literature. The highest performance rates in terms of sensitivity (
) and specificity (
) on the INbreast dataset are obtained with the ResNet50 base encoder in a 4-shot setting.
The average performance evaluation results for different loss functions and
n-shot settings on the CBIS-DDSM dataset are presented in
Table 4 and
Table 5. Here, the performance results across all experiments are far more homogeneous—showing small variations among themselves based on loss function and number of shots used for training—and are, for the most part, higher than those on the INbreast dataset. On the CBIS-DDSM dataset, GoogLeNet, ResNet50, and MobileNetV3 yield an average AUC score of
,
, and
, respectively, using triplet loss, and an average AUC score of
,
, and
, respectively, for the circle loss function. In addition, the average sensitivities of
,
, and
, respectively, are achieved for GoogLeNet, ResNet50, and MobileNetV3 base encoders with triplet loss. Using circle loss, the average recorded sensitivities for the GoogLeNet, ResNet50, and MobileNetV3 base encoders are
,
, and
, respectively. Again, the averaged performance results over all experiments for different base encoder networks in combination with circle loss consistently narrowly surpass those obtained through the use of the standard triplet loss function, in terms of all evaluation metrics.
Figure 4 and
Figure 5 show corresponding ROC curves for all the 3-way experiments carried out on both INBreast and CBIS-DDSM datasets.
5.2. Classification Results for 2-Way Detection Problem
Table 6,
Table 7,
Table 8 and
Table 9 show the performance results with respective loss functions and various
n-shot settings in classifying healthy and breast cancer patients on INbreast and CBIS-DDSM datasets. As expected, the model generally shows improved performance for diagnosing breast cancer patients on both datasets in this simplified 2-class scenario.
In
Table 6 and
Table 7, we observe the results from the INbreast dataset, where normal mammograms are accurately classified with average specificity of
,
, and
for the GoogLeNet, ResNet50, and MobileNetV3, with triplet loss, and
,
, and
, with circle loss, respectively. For tumorous cases, the average sensitivities are commendably high, standing at
and
for GoogLeNet and
flat for both ResNet50 and MobileNetV3, with triplet and circle loss, respectively. Notably, GoogLeNet combined with circle loss in a 4-shot setting, demonstrates near perfect performance with the highest AUC score of
. These results clearly illustrate that the proposed Siamese network model, particularly when using GoogLeNet as a back encoder, is able to effectively rule-out the presence of a disease in a patient.
When evaluating the CBIS-DDSM dataset, the classification of normal cases yielded average specificities exceeding those in 3-way experiments, namely and for GoogLeNet, and for ResNet50, and and for MobileNetV3, with triplet and circle loss, respectively. The average sensitivities for tumorous cases also reflect strong outcomes, with , , and for GoogLeNet, ResNet50, and MobileNetV3 with triplet loss, and , , and with circle loss, respectively. Furthermore, GoogLeNet paired with circle loss achieved two top AUC scores of and . These findings underscore the effectiveness of the proposed Siamese network model in distinguishing between healthy and cancer patients across the datasets.
Corresponding ROC curves for all 2-way experiments conducted on the INBreast and CBIS-DDSM datasets are illustrated in
Figure 6 and
Figure 7.
5.3. Siamese Network Model vs. Standard Pre-Trained CNN Classifiers
The performance results attained by utilizing the three selected, fine-tuned, pre-trained base encoder networks as standalone classifiers in 2-way and 3-way scenarios are presented in
Table 10 and
Table 11. By comparing these results to those achieved by the proposed Siamese network model, it can be concluded that the latter consistently outperforms the pre-trained CNN models in every scenario and across all evaluation metrics for both the INbreast and CBIS-DDSM datasets. There could be many reasons for this, but the main point is that there is not nearly enough mammogram data available to efficiently train deep neural networks. The somewhat satisfactory performance results the three networks achieved can mostly be attributed to the extensive pre-training they underwent using the ImageNet dataset. On the other hand, the Siamese-based model exploits the benefits of being given triplets of images, where it learns to distinguish a similar image from different ones, and exhibits stronger generalization capabilities. Specifically, on the INbreast dataset for a binary classification task, GoogLeNet, ResNet-50, and MobileNetV3, paired with triplet loss, have improved their performance in terms of AUC scores by an average
,
, and
, respectively. For the circle loss, the corresponding average AUC score improvements stand at
,
, and
, in a 2-way scenario. Similarly, for a 3-class problem, the average recorded boosts in AUC score amount to
and
for GoogLeNet,
and
for ResNet50, and
and
for MobileNetV3, with triplet and circle loss, respectively. In the context of the CBIS-DDSM dataset, considering a 2-way scenario, the three selected pre-trained back encoders have enhanced their AUC scores by an average of
,
, and
, in case of triplet loss, and an average of
,
, and
, in case of circle loss. In a 3-class setting, the average AUC score enhancements for GoogLeNet are approximately
and
, while ResNet50 shows improvements of around
and
. Meanwhile, MobileNetV3 demonstrates increases of
and
when utilizing triplet and circle loss, respectively.
These achievements are particularly significant, knowing that the Siamese model is trained with only a (very) limited number of training examples from each category of images. Furthermore, the proposed model is relatively compact, featuring fewer trainable parameters, which highlights the advantages of employing a few-shot learning approach for diagnosing breast cancer patients compared to traditional, supervised learning techniques.
5.4. Statistical Inference Study
To assess the means and standard deviations of TPR values and AUC scores, we conducted a statistical analysis using the Kruskal–Wallis test. This non-parametric method is particularly effective for comparing two or more independent samples, regardless of whether they are of equal or differing sizes. A significant result from the Kruskal–Wallis test implies that at least one sample exhibits a stochastic dominance over the others. We utilized this approach to investigate whether there are significant statistical differences—at a confidence level—in the outcomes yielded by the proposed Siamese network model, depending on the number of shots or base encoders used, specifically for the INbreast and CBIS-DDSM datasets in 2-way and 3-way scenarios.
The computed
p values, presented in
Table 12 and
Table 13, indicate that even though the results differ insignificantly
based on number of training triplets (
n-shot), in some cases significant difference are recorded, in both TPR values and AUC scores, between the three selected base encoder networks for the INbreast and CBIS-DDSM datasets. While the
p value from the Kruskal–Wallis test confirms the existence of a statistically significant difference, the Mann–Whitney U tests are further employed to identify which specific CNN model(s) produce significant results. This analysis is also conducted at a
confidence level, with the findings detailed in
Table 14 and
Table 15. For a binary classification task with a triplet loss, MobileNetV3 exhibits a significantly inferior average AUC score compared to the best-performing base encoder, GoogLeNet, on the INbreast dataset. When analyzing the same scenario in the context of the CBIS-DDSM dataset, we find that all three base encoders differ significantly in their performance in terms of TPR values; whereas, when circle loss is utillized, GoogLeNet appears to have an edge in terms of TPR values and AUC scores over MobileNetV3 and ResNet50, respectively. Similar conclusions can be drawn for a 3-class problem using triplet loss.
6. Conclusions and Future Research
Breast cancer classification falls within a data-scarce domain, where acquiring an adequate number of training samples for effectively training a conventional neural network is often impractical. Endeavoring to circumvent the problem of insufficient data, this paper showcases a few-shot learning approach to breast cancer detection, which is evaluated on two digital X-ray mammogram datasets. In particular, we test out three different pre-trained, finely-tuned CNN encoders for their ability to effectively capture unbiased feature representations resilient against overfitting, and we adapt a Siamese network architecture to perform the final classification of normal and tumorous cases. Our findings indicate that the proposed model, utilizing triplet-based loss, delivers a highly accurate and practical mechanism for the automated diagnosis of breast cancer in various n-shot learning settings. Moreover, our model exhibits performance that is not only comparable to but actually exceeds that of the fine-tuned pre-trained CNN models, employed as standalone classifiers, by a handsome margin. These results may encourage health professionals to leverage the model in the early detection of breast cancer, thereby easing their own workload and enabling them to allocate more time to crafting personalized treatment plans for their patients.
Moving forward, we intend to extend our work by performing data augmentation to artificially increase the size and diversity of training data in an attempt to further improve the performance and generalization of the proposed model. In addition, more specific and meaningful representations could potentially be derived through the exertion of multi-scale feature maps, generated by superimposing features produced by feeding deep learning convolutional models with annotated ROI patches, containing different types of breast lesions, on top of features obtained by feeding the network with the entire breast images. Finally, the presented model can be applied to other common types of cancer detection, such as brain, lung, or prostate cancer.