Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss

Marasović, Tea; Papić, Vladan

doi:10.3390/a18090567

Open AccessArticle

Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss

by

Tea Marasović

^*

and

Vladan Papić

Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(9), 567; https://doi.org/10.3390/a18090567

Submission received: 14 July 2025 / Revised: 29 August 2025 / Accepted: 3 September 2025 / Published: 8 September 2025

(This article belongs to the Special Issue Machine Learning for Pattern Recognition (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Breast cancer is one of the leading causes of death among women of all ages and backgrounds globally. In recent years, the growing deficit of expert radiologists—particularly in underdeveloped countries—alongside a surge in the number of images for analysis, has negatively affected the ability to secure timely and precise diagnostic results in breast cancer screening. AI technologies offer powerful tools that allow for the effective diagnosis and survival forecasting, reducing the dependency on human cognitive input. Towards this aim, this research introduces a deep meta-learning framework for swift analysis of mammography images—combining a Siamese network model with a triplet-based loss function—to facilitate automatic screening (recognition) of potentially suspicious breast cancer cases. Three pre-trained deep CNN architectures, namely GoogLeNet, ResNet50, and MobileNetV3, are fine-tuned and scrutinized for their effectiveness in transforming input mammograms to a suitable embedding space. The proposed framework undergoes a comprehensive evaluation through a rigorous series of experiments, utilizing two different, publicly accessible, and widely used datasets of digital X-ray mammograms: INbreast and CBIS-DDSM. The experimental results demonstrate the framework’s strong performance in differentiating between tumorous and normal images, even with a very limited number of training samples, on both datasets.

Keywords:

breast cancer diagnosis; few-shot learning; metric learning; Siamese network; triplet loss; circle loss

1. Introduction

Breast cancer is one of the most frequently diagnosed cancer types worldwide and one of the prevailing causes of cancer-related deaths. According to the World Health Organization’s International Agency for Research on Cancer (IARC) reports, in 2022, an estimated 2.3 million breast cancer incidences were recorded globally, resulting in 669,418 deaths. Though men are not immune to this disease, studies show that females are 100 times more likely to suffer from it than men. A more recent study by the American Cancer Society [1] predicted that in 2025, breast cancer will remain the most common form of cancer among women, accounting for approximately 31% of all female cancers. In addition, female breast cancer incidence rates have been continuously increasing since the mid-2000s by 1% per year overall, a trend that has been at least in part attributed to changing risk factors, such as increased excessive body weight [2].

Breast cancer can be classified as either benign, which is considered to be non-hazardous and non-life-threatening, or malignant. It begins with unnatural cell growth in the lining of the breast and has the potential to spread to neighboring tissues quite rapidly. The nuclei of malignant tissue are often considerably larger than those found in normal tissues, which can be fatal in advanced stages. If the cancer is found early on, before it expands to a size of 10 mm, the patient has an 85% probability of going into complete remission. Therefore, timely detection and accurate prediction of breast cancer are of utmost importance for improving patient outcomes, treatment planning, and survival rates.

The availability of appropriate screening technologies is crucial for spotting the first signs of breast cancer on time. The number one tool for detecting breast cancer in its earliest stages is mammography. Mammography screening has been proven effective in reducing breast cancer mortality rates [3]. However, detecting subclinical breast cancer through screening mammography poses significant challenges. Tumors often occupy only a small fraction of the breast image; for instance, a full-field digital mammogram (FFDM) typically comprises

4000 \times 3000

pixels, while a region of interest (ROI) indicating potential malignancy can be as small as

100 \times 100

pixels. As a result, despite its advantages, screening mammography is overly susceptible to a risk of false negatives and overdiagnosis, where breast cancer that would not develop into clinical cancer during a woman’s lifetime is identified on screen. Numerous factors—such as fatigue, eye strain, and varying levels of experience among professionals, including doctors and radiologists, who analyze images—can affect the mammography’s diagnostic accuracy. To enhance the predictive accuracy of screening mammography, computer-assisted detection (CADe) and diagnosis (CADx) software [4] have been in clinical use since the 1990s. Unfortunately, earlier versions of these systems did not significantly improve diagnostic performance [5], and the progress was stalled for well over a decade following their installment.

In recent years, the remarkable successes of machine learning, especially deep learning—which have revolutionized the field of computer vision with a wide range of applications, from image classification and visual object detection to semantic segmentation—have attracted much attention in the medical community. Deep learning technology shows great potential in assisting health professionals, in enhancing mammogram interpretation accuracy and supporting clinical decision-making, thus improving patient outcomes [6,7].

It is a well-known fact that deep learning algorithms generally require large amounts of training data to reach their optimal performance level [8]. This is a huge pitfall, which may hinder their performance, as assembling comprehensive mammography databases with ROI annotations involves considerable labor and time. Indeed, there are only a handful of publicly available mammography databases that are fully annotated, while larger datasets often merely indicate the cancer status of each image [9]. Some studies [10,11] have sought to train deep learning algorithms using whole mammograms without relying on any annotations. Nonetheless, it remains unclear whether these algorithms can effectively identify clinically significant lesions and make predictions based on the relevant sections of the mammograms.

Distinguishing between benign and malignant tumorous cases poses another set of challenges within this domain, which stem from the fact that diverse breast abnormalities—including masses, architectural distortions, and calcifications—come up in all the shapes and sizes. Effective breast cancer detection requires that the model maintains a high level of accuracy in recognizing intricate patterns, avoiding such pitfalls as over-focusing on specific datasets or mistakenly categorizing non-cancerous formations as concerning. Moreover, in clinical settings, instances of malignancy are often outnumbered by benign cases, resulting in heavily imbalanced, skewed datasets. This discrepancy causes detection models to favor non-cancerous outcomes, subsequently diminishing their effectiveness in identifying genuine cancer cases. As a result, training these models becomes more complex, and specialized methodologies—such as data augmentation, synthetic data generation, or cost-sensitive learning—need to be applied to enhance performance and achieve more decisive results.

To alleviate the data inconsistency and deficiency problem, the current study introduces a reliable metric-based few-shot deep learning framework for the diagnosis of breast cancer patients with a limited amount of training mammograms. Few-shot learning is a form of meta-learning, which is a paradigm used to describe a variety of techniques that aim to improve adaptation through transferring generic and accumulated knowledge (meta-data) from prior experiences with few data points to adapt to new tasks quickly, without requiring training from scratch. More precisely, few-shot learning requires only a small number n of samples for every given image class, to prepare (train) the model that in turn can classify unseen images in the future [12,13]. This style of learning corresponds to what we normally think of as true intelligence. For example, a person can recognize someone’s face after only seeing them a few times, and this ability scales to thousands of different faces.

At present, few if any comparable studies treating breast cancer diagnosis as a k-way, n-shot classification problem—where k and n represent the number of class labels and data samples used for model training—have been published. Due to its recent considerable successes in facilitating few-shot learning (particularly in one-shot scenario), we employ an idea of a Siamese network to learn an embedding space in which learning is efficient for few-shot samples. Specifically, a Siamese network is a type of network architecture that contains two or more identical deep convolutional neural subnetworks used to generate feature embeddings from input images, which are then contrasted to verify the similarity between them. We consider three diverse fine-tuned pre-trained CNN models— namely, GoogLeNet, ResNet-50, and MobileNetV3—and examine their effectiveness as backbone encoder sub-networks to obtain unbiased feature representations. In the effort to further improve the classification margin, we replace the traditional binary cross-entropy (CE) loss with a triplet-based loss function. Triplet loss offers significant advantages by bringing intraclass samples closer together while pushing interclass samples further apart in the embedding space. During the training process, triplets are constructed by selecting anchor images, positive images belonging to the same class, and negative images belonging to different classes. The triplet loss is then jointly optimized with the multi-task loss. By integrating these components, the network is capable of learning enhanced classification margins, which ultimately results in an ameliorated performance. The efficacy of the proposed framework is validated using two publicly available mammogram images datasets: INbreast and the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDMS).

In summary, the salient contributions of this research are outlined below:

(a): Propose a framework leveraging few-shot deep metric learning techniques for breast cancer diagnosis using whole mammogram images.
(b): Design and implement a Siamese network model, wielding a triplet-based loss function, for the generation of bias-free feature encoding vectors from the input mammograms.
(c): Examine the efficacy of the proposed framework with a limited dataset across multiple domains, from binary to multi-class classification.

The remainder of this paper is laid out as follows. Section 2 provides a concise overview of relevant literature. Section 3 describes the proposed models and outlines the methodology and datasets utilized in this research. The experimental setup and results are presented and discussed in Section 4 and Section 5. Finally, Section 6 summarizes the paper and reflects on potential directions for future research.

2. Related Work

In recent years, deep learning has become a common tool, widely applied in breast cancer screening. Moreover, studies have indicated that these advanced methods can diagnose breast cancer up to 12 months earlier than traditional clinical approaches [14]. Furthermore, deep learning excels at identifying the most relevant features that are best suited to tackle the said issue. This section will provide a comprehensive overview of deep learning-based techniques specifically designed for analyzing mammography images to detect breast cancer. Several deep learning models have been utilized for this purpose, which include the convolutional neural networks (CNN), regional convolutional neural networks (R-CNN), generative adversarial networks (GAN), and vision transformers (ViT).

CNN [15] is generally regarded as the leading deep-learning technique employed for breast cancer detection. CNN is a class of deep neural networks, designed to automatically and adaptively filter inputs for useful information through back-propagation by using multiple building blocks, such as convolution layers, pooling layers, and fully connected layers, stacked on top of each other. Since mammography data is not available in abundance, a large proportion of studies resort to a transfer learning approach, as they attempt to exploit existing previously trained CNNs in a way that facilitates the reuse of their learned weights and apply them to the task of breast cancer screening by fine-tuning their fully connected layers only. In [16], the authors selected three deep learning classifiers—namely, regular CNN, ResNet-50, and Inception-v2—and evaluated their diagnostic performances using DDSM and INbreast datasets. Khamparia et al. [17] proposed a hybrid transfer learning model—a fusion of modified VGG and ImageNet—which yielded an accuracy of 94.3% on DDSM dataset. Ragab et al. [18] also used transfer learning with AlexNet architecture, but they replaced the last layer, responsible for final classification, with a Support Vector Machine (SVM) classifier. Their proposed architecture achieved an accuracy of 87.2% on the CBIS-DDSM dataset. A comparative study on mammogram classification performance of different networks on small datasets is reported in [19]. Apart from directly using on-the-shelf models, other researchers further sought more effective transfer learning methods to improve the pre-training learning process and fully utilize the knowledge learned from the pre-training dataset [20].

Several studies adopted and modified off-the-shelf end-to-end detectors, which take as input the whole mammography image and output bounding box coordinates for lesions with scores indicating the likelihoods of different lesion types. In particular, Ribli et al. [21] used Faster-RCNN to detect and perform classification on the INbreast and CBIS-DDSM datasets. RCNN, as their name indicates, combine a convolutional neural network architecture with specialized components aimed at detecting, localizing, and classifying objects within images. A key feature of these models is the Region Proposal Network (RPN), which functions as a specialized branch of convolutional layers positioned atop the final convolutional layer of the original network. This RPN is specifically trained to identify and localize objects in an image, independent of their classes. The system developed by Ribli et al. achieved a remarkable detection rate of 90% for malignant lesions in the INbreast dataset, while maintaining an impressively low rate of only 0.3 false positives per image. Another study by Antari et al. [22] presented a completely integrated CAD system, comprising of You-Only-Look-Once (YOLO) regional network for detection, a full resolution CNN (FrCNN) for segmentation, and a deep CNN for classification of breast lesions. The system was evaluated on the INBreast dataset, producing an overall accuracy of 95.64%.

Another important deep learning model used for breast cancer detection is the GAN. GAN [23] is a deep learning-based generative model comprising two sub-models: the generator model, which is trained to generate new samples, and the discriminator model that tries to classify samples as either real (from the domain) or fake (generated). During training, the two models compete against each other in a zero-sum game, where one agent’s gain is another agent’s loss, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible samples. In the work [24], the authors introduced DiaGRAM (Deep GeneRAtive Multi-task), an innovative end-to-end system that leverages the capabilities of GAN alongside CNN, to improve mammogram classification performance. Their approach utilizes GAN to enhance feature learning by extracting useful features applicable to both the discriminative tasks—i.e., patch and image classification—and for the GAN’s generative task, which involves distinguishing between the real and generated patches. This dual functionality ensures that the learned features effectively capture the essential data characteristics, thereby subsiding the classification efforts. In a separate study, Singh et al. [25] proposed a conditional GAN specifically designed to segment breast tumors within an ROI in mammograms. Their generative network is adept at identifying the tumor areas and generating the binary masks that outline these regions. In turn, the adversarial network is trained to differentiate between actual (ground truth) and synthetic segmentations, thus compelling the generative network to produce binary masks that closely emulate the real-world representations. In the second stage, a shape descriptor based on a CNN is utilized to classify the generated binary masks into four breast tumor shapes (i.e., irregular, lobular, oval, and round). The proposed shape descriptor was trained on the DDSM database, achieving an overall accuracy of 80%, which outperforms the current state-of-the-art. On the other hand, Guan and Loew [26] employed GAN as a data augmenting device to generate synthetic mammographic images. Though these generated images are not exactly like the original ones, they can retain some of the essential features, structures, or patterns of the ROIs in the original images. The obtained results demonstrate that to classify the normal and abnormal ROIs from DDSM, adding GAN ROIs to the training data yields approximately 3.6% better classification performance than using the affine transformation augmented ROIs.

When analyzing mammograms, the previously described techniques in the literature mostly tend to focus on specific regions (patches) where tumors are suspected, simultaneously disregarding the rest of the image. This targeted approach can, however, cause them to overlook significant details, which potentially could have been revealed if the entire image was examined at once. Due to its ability to surpass the limitations of models focusing only on a small portion of an image, ViT has recently gained prominence in the field of computer vision, offering encouraging results in terms of accuracy, efficiency, and the aptitude to capture complex image features.

The ViT builds upon the underlying concept of the original transformer architecture, which was initially developed for text processing. By implementing a few adjustments to accommodate different data types, the ViT applies transformer methodology to the realm of images. This model utilizes various tokenization and embedding strategies, yet its general architecture is the same as that of traditional transformers. ViT is characterized by weaker inductive biases, which allows it to scale effectively with much larger datasets compared to CNN. This scalability comes with a hefty price, though. ViT generally requires a substantial amount of training data to achieve optimal performance. To mitigate this challenge, researchers have started to explore hybrid approaches that combine convolutional layers with ViT, thereby enhancing performance even when working with limited image datasets. Additionally, strategies such as transfer learning and self-supervised learning have been extensively utilized to alleviate data constraints.

Recent comparisons between transformer-based models and CNN—pertaining to mammographic image interpretation—have yielded varying results, largely influenced by differences in experimental designs and model architectures [27,28,29,30]. Most research has focused on comparing ViT to CNN for single-view image classification in a transfer learning setting. Notably, architectures like the Data-efficient Image Transformer (DeiT) and Swin transformer emerge as strong contenders for high-resolution medical image processing [30,31,32,33]. The Swin model, in particular, stands out due to its hierarchical architecture, which provides advantages in computational efficiency. Still, current research has not demonstrated that ViT consistently outperforms CNN in every scenario, particularly when it comes to low-dimensional and few-shot medical image processing [34].

Nevertheless, the observed performance gap between ViT and CNN on established mammography datasets, including CBIS-DDSM, tends to be minimal or even slightly in favor of CNN. For instance, research conducted by Cantone et al. [32] revealed that the Swin-v2 transformer was the only model to achieve competitive results that improved with higher input resolutions, suggesting a benefit in leveraging locality bias. Additionally, Miller et al. [28] demonstrated that in a self-supervised framework, ViT pre-trained with masked autoencoder techniques exhibited subpar performance compared to CNNs that were pre-trained using contrastive self-supervised methods.

3. Materials and Methods

3.1. Datasets Description

To demonstrate the efficacy of our few-shot meta-learning approach for breast cancer diagnosis, we assess the proposed Siamese network model’s performance using two publicly available mammogram datasets: CBIS-DDSM and INbreast.

CBIS-DDSM [35] is an enhanced and standardized version of the DDSM dataset that encompasses a grand total of 10,239 images of normal, benign, and malignant cases with verified pathology information. The data were selected and curated by trained mammographers, and the images are available in the DICOM format. Updated ROI segmentation and bounding boxes, plus pathological diagnosis for training data, are also included. It should be noted that the image data for this collection is structured such that each participant has multiple patient IDs. This makes it appear as though there are 6671 patients according to the DICOM metadata, but there are only 1566 actual participants in the cohort.

INbreast database [36] comprises FFDM images, which have different intensity profiles compared with digitized film mammograms from the CBIS-DDSM. Mammography images in this dataset were originally collected from Centro Hospitalar de S. Joao [CHSJ], Breast center, in Porto, Portugal, from August 2008 to July 2010. The dataset includes 410 images of both views—i.e., mediolateral oblique (MLO) view and craniocaudal (CC) view—from 115 patients. In 90 of the 115 cases, the malignancy was detected in both breasts. All images are stored in DICOM format. Additionally, the INbreast database features the BI-RADS [37] assessment categories assigned by radiologists. The categories are defined as follows: 0 (incomplete exam), 1 (no findings), 2 (benign), 3 (probably benign), 4 (suspicious), 5 (highly suggestive of malignancy), and 6 (known biopsy-proven cancer). Due to the absence of reliable pathological confirmation of malignancy within the database, we designated all images with BI-RADS 1 and 2 as negative, while BI-RADS 4, 5, and 6 were classified as positive. Notably, we excluded 12 patients and 23 images marked as BI-RADS 3, as this assessment is commonly not provided during screening.

3.2. Siamese Network Architecture

A Siamese network contains two or more identical base neural networks that share the same architecture and weights. Each of the sub-networks processes a different input image and converts it into a corresponding feature embedding, allowing for effective comparison and analysis of the images. The subnetwork outputs are joined by an energy function at the top to produce a similarity score among input images. The aim is to learn a similarity model such that images of the same class will have embeddings that contrast significantly with those obtained when an image of a different class is fed. The fact that sub-networks’ weights are tied together ensures that two highly similar images will be mapped to nearly identical positions in the embedding space. This occurs because each sub-network applies the same function to its inputs. Furthermore, the symmetric nature of the Siamese network means that when distinct images are presented to twin sub-networks, the top conjoining layer will consistently calculate the same metric as it would if it were processing the same images through different branches.

Let’s assume that we have randomly sampled three images (triplets) from the training data: where two images, anchor

(A)

and positive

(P)

, are taken from the same class, whereas the third image, negative

(N)

, belongs to a different class. The formation of triplets allows the Siamese network to learn how to differentiate between different samples. During training, each of the input images is individually fed to one of the sub-networks: positive to the left sub-network, anchor to the middle sub-network, and negative to the right sub-network, as shown in Figure 1. (It should be noted that the Siamese network, although depicted as having separate branches, essentially has a single base encoder that sequentially extracts features from anchor, positive and negative images.) The obtained feature embeddings

f (A)

,

f (P)

, and

f (N)

, generated by the sub-networks through flattening of the last layer, are subsequently fed to an energy function

(E)

that measures the similarity of theanchor with both positive and negative image. We use the squared L2 distance metric as our energy function, which can be expressed as follows:

E (A, P) = {∥f (A) - f (P)∥}_{2}^{2},

(1)

E (A, N) = {∥f (A) - f (N)∥}_{2}^{2} .

(2)

The value of the energy function will be smaller if the images are similar and vice versa. For classification purposes, the pair-wise distance is converted to a probability

(p)

, which indicates whether the input images belong to the same target class or not.

3.3. Loss Function

Through the use of an appropriate loss function, the base encoder sub-network can be optimized to produce better embeddings of the input images. The aim is to maximize interclass and minimize intraclass distance by ensuring that the distance between the anchor-positive pair of images is smaller than the distance between the anchor-negative pair of images for each triplet, i.e.:

{∥f (A) - f (P)∥}_{2}^{2} + α < {∥f (A) - f (N)∥}_{2}^{2}

(3)

where

α

is the margin added to the difference between the pair distances, which dictates how dissimilar an embedding must be to be considered an alien class.

3.3.1. Triplet Loss

One way to achieve this is by utilizing a margin-based triplet loss function, originally introduced in FaceNet by Google [38]. Mathematically, this loss function is formulated as follows:

L_{t r i} = max ({∥f (A) - f (P)∥}_{2}^{2} - {∥f (A) - f (N)∥}_{2}^{2} + α, 0)

(4)

According to [39], better results are obtained when the loss function is replaced by a simple mean square error (MSE) on the soft-max result so that the loss is:

L_{t r i} (d_{+}, d_{-}) = {∥d_{+}, d_{-} - α∥}_{2}^{2}

(5)

where

d_{+} = \frac{e^{{∥f (A) - f (P)∥}_{2}^{2}}}{e^{{∥f (A) - f (P)∥}_{2}^{2}} + e^{{∥f (A) - f (N)∥}_{2}^{2}}}

(6)

and

d_{-} = \frac{e^{{∥f (A) - f (N)∥}_{2}^{2}}}{e^{{∥f (A) - f (P)∥}_{2}^{2}} + e^{{∥f (A) - f (N)∥}_{2}^{2}}}

(7)

From here, we have a differentiable loss term that can easily be used with the back-propagation algorithm. Note that a “good” model will learn a usefull representation when

L_{t r i} (d_{+}, d_{-}) \to 0

which happens if and only if

\frac{{∥f (A) - f (P)∥}_{2}^{2}}{{∥f (A) - f (P)∥}_{2}^{2}} \to 0

.

The base encoder sub-networks typically contain a large number of parameters, which means that a large number of triplets must be sampled from the training data so that a robust model can be learned. However, sampling all possible triplets from the training dataset can quickly become intractable, since the majority of those samples may produce small costs in (3), resulting in slow convergence [38]. In order to ensure fast convergence, it is crucial to select triplets that violate the triplet constraint in (3). This means that, given anchor image, we want to select the positive image which is further away than any of the other images (hard positive), i.e.,

arg {max}_{P} {∥f (A) - f (P)∥}_{2}^{2}

. Similarly, we want to select the negative image, such that the distance between the anchor and chosen negative is the least than the rest of the images (hard negative), i.e.,

arg {min}_{N} {∥f (A) - f (N)∥}_{2}^{2}

.

It is infeasible to compute the

arg min

and

arg max

across the whole training set. Therefore, we opt for the so-called online mining approach, which involves selecting the hard positive/negative samples from within a mini-batch. Instead of picking the hardest negatives that can, in practice, lead to bad local minima early on in training; specifically, it can result in a collapsed model (i.e.,

f (x) = 0

), we select N such that:

{∥f (A) - f (P)∥}_{2}^{2} < {∥f (A) - f (N)∥}_{2}^{2}

(8)

These negatives are called semi-hard, as they are further away from the anchor than the positive, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin

α

.

3.3.2. Circle Loss

Most popular loss functions, including triplet loss, use a similar optimization strategy to maximize interclass similarity (

s_{p}

) and minimize intraclass similarity (

s_{n}

). They all enclose

s_{n}

and

s_{p}

into similarity pairs and seek to reduce

(s_{n} - s_{p})

, by restricting the penalty strength on every single similarity score to be equal.

Adversely, the circle loss function [40] is built on the idea that different similarity scores should incur different penalties and that similarity scores, which deviate significantly from the optimum, should be emphasized. This approach extends the difference between negative and positive scores, represented as

(s_{n} - s_{p})

, into a more nuanced formulation:

(α_{n} s_{n} - α_{p} s_{p})

. Here,

α_{n}

and

α_{p}

serve as independent weighting factors, enabling

s_{n}

and

s_{p}

to learn at different rates. The result of this optimization is a decision boundary shaped like a circle, which is reflected in the loss function’s name.

Given a feature embedding

f (x)

, let’s assume that there are K interclass similarity scores and L intraclass similarity scores associated with it. To minimize each

s_{n}^{j}, \forall j \in {1, 2, . . ., L}

, as well as to maximize

s_{p}^{i}, \forall i \in {1, 2, . . ., K}

, the loss function can be formulated as follows:

L_{c i r c} = log [1 + \sum_{i = 1}^{K} \sum_{j = 1}^{L} e^{γ (s_{n}^{j} - s_{p}^{i} + m)}] = log [1 + \sum_{j = 1}^{L} e^{γ (s_{n}^{j} + m)} \sum_{i = 1}^{K} e^{γ (- s_{p}^{i})}]

(9)

in which

γ

is a scale factor and m is a margin for better similarity separation. When

γ \to + inf

, (9) degenerates to triplet loss with hard mining:

\begin{matrix} L_{t r i} & = lim_{γ \to 0} \frac{1}{γ} L_{c i r c} \\ = lim_{γ \to 0} \frac{1}{γ} log [1 + \sum_{i = 1}^{K} \sum_{j = 1}^{L} e^{γ (s_{n}^{j} - s_{p}^{i} + m)}] \\ = max {[s_{n}^{j} - s_{p}^{i}]}_{+} \end{matrix}

(10)

Circle loss enhances the process of deep feature learning by offering greater flexibility in optimization and a clearer convergence target.

3.4. Base Encoders

3.4.1. GoogLeNet

GoogLeNet, also known as Inception-v1, is a 22-layer deep network structure, proposed by the Google team in 2014 [41], as a response to the limitations of previous CNN architectures, particularly in terms of depth and computational efficiency. In comparison with e.g., AlexNet [42], it uses

12 \times

less parameters, therefore it works faster and also provides higher accuracy.

GoogLeNet’s main innovative trait is nine inception modules, which form the building blocks of the architecture. An inception module consists of multiple convolution layers, with different kernel sizes ranging from

1 \times 1

to

5 \times 5

, and a

3 \times 3

max pooling operation block. Those modules receive data from the previous layers, then apply arbitrary parallel operations on the same data. To make GoogLeNet even more efficient, the authors introduced another clever concept — bottleneck layers. These layers, embedded inside inception modules, use

1 \times 1

convolutions to perform dimensionality reduction before applying larger convolutional kernels. This not only significantly reduces computational load but also acts as a regularizer, preventing over-fitting.

Each branch of a single inception module calculates different features based on the data passed from the previous layer. All the outputs are concatenated at the end of these parallel operations as input for the next layers of the network. This allows the model to take advantage of multilevel feature extraction and to cover larger image areas, while keeping a fine resolution for small details in the images. At the end of the network, the obtained feature map is compressed into a one-dimensional vector and finally classified by using the softmax function.

GoogLeNet has had a substantial impact on the field of deep learning. The architecture established a new benchmark for classification and detection during the ImageNet Large-Scale Visual Recognition Challenge in 2014, while also laying the groundwork for subsequent innovations in neural network design.

3.4.2. ResNet-50

In order to tackle the challenges of gradient disappearance and gradient explosion that may occur during deep convolutional network training, He et al. [43] proposed a novel network architecture called residual network (ResNet). A significant innovation in this architecture is the use of skip connections, also known as shortcut connections, which facilitate the learning of residual functions that map inputs to desired outputs. The shortcut connections bypass one or more layers, connecting earlier layers directly to those appearing further along in the network. This design preserves essential information from the initial layers, effectively alleviating the vanishing gradient problem during backpropagation.

The ResNet-50 comprises 50 layers, including 3 convolutional layers and 4 residual blocks. The convolutional layers are equipped with batch normalization, ReLU activation, and max pooling operations, which work together to extract features from the input image. These features are then processed by the residual blocks, the fundamental elements of ResNet-50. Each residual block features an identity block that processes the input through a sequence of convolutional layers, before adding the input back to the output; and the convolutional block, which is similar to the identity block, but includes a

1 \times 1

convolutional layer to reduce the number of filters prior to the

3 \times 3

convolutional layer.

Finally, the architecture ends with fully connected layers, which play a crucial role in classifying the output. The output from the last fully connected layer is passed through a softmax activation function, generating the final class probabilities. ResNet-50 has been pre-trained on extensive datasets, such as ImageNet, and it consistently yields state-of-the-art performance across various benchmarks.

3.4.3. MobileNetV3

MobileNetV3 is the latest and most efficient generation of the popular deep learning model for smartphone applications and edge computing, released by the Google team in 2019 [44]. It has two releases: large and small. The large version possesses a slightly greater number of parameters and is computationally more demanding, which enables it to learn more complex patterns and higher-level abstract features. Consequently, the large model has been chosen as the base encoder in this study.

Building upon the innovations of its predecessors [45], MobileNetV3 has been further refined through the integration of two advanced techniques: MNasNet, which optimizes configuration selection, and the NetAdapt algorithm, aimed at discovering and fine-tuning network architecture. In the initial stage, MobileNetV3 implements a groundbreaking activation function known as hard swish (h-swish), which enhances operational speed and reduces computational overhead compared to the modified sigmoid function introduced by [46]. The model’s core building block is so-called inverted residual block, which combines a depthwise separable convolution block with a squeeze-and-excitation (SE) channel attention block. This design draws inspiration from bottleneck layers, utilizing an inverted residual connection to link input and output features within the same channels, thereby improving feature representation while minimizing memory usage. In contrast to traditional convolution blocks, depthwise separable convolutions apply a depthwise kernel to each channel, followed by a

1 \times 1

pointwise convolutional kernel with an accompanying batch normalization layer. The SE block enhances the network’s capability to focus on significant feature channels, thereby bolstering its representational strength. This process involves computing the mean of the input features through an average pooling layer, applying the ReLU or h-swish activation functions to derive feature weights, and then multiplying these weights with the original feature matrix to yield weighted output features. This technique significantly strengthens MobileNetV3’s learning efficacy by amalgamating feature channels.

In the final stage, the average pooling layer is advanced, and the convolutional layer adjusts the number of feature channels to expand into a higher-dimensional space. This structural design improves computational speed and performance without affecting accuracy. With its lightweight architecture and impressive performance, MobileNetV3 has become a mainstay in both academic and industrial applications.

4. Experimental Setup

In this study, we systematically evaluate the effects of using different n-shot variants and loss functions separately for the CBIS-DDSM and INbreast datasets. These datasets provided a benchmark to assess the reliability and efficiency of the proposed n-shot meta-learning approach in the context of breast cancer detection. By utilizing the same training and testing settings, we were able to conduct a direct performance comparison between the CBIS-DDSM and INbreast datasets, thus ensuring a comprehensive analysis of our method.

4.1. Data Preparation

We deliberately avoid extensive pre-processing of mammograms in our datasets, in an effort to improve the proposed Siamese network model’s generalization ability. This approach should consequently make the model more robust to varying image quality and noises present in the scans, while extracting feature embeddings from the input images. Hence, only a few standard pre-processing techniques were performed to optimize the model training procedure. Firstly, an adaptive histogram equalization was applied to the input mammograms to enhance the image contrast. Given that the average size of mammograms in CBIS-DDSM and INbreast datasets is approximately

3000 \times 4800

pixels, they all had to be rescaled to a size of

224 \times 224

pixels to match the required input image size of the selected network models. In addition, image pixel values were converted from

[0, 255]

to

[0, 1]

and subsequently standardized to ensure all values in the data have zero mean and unit variance. This is achieved by subtracting the mean and dividing by the standard deviation of the image pixel values in the particular dataset. Standardization serves as a crucial pre-processing step in data preparation, expediting model convergence by removing biases among features and ensuring a uniform distribution throughout the dataset.

To revoke any bias or overfitting that may result in ambiguous classification accuracies, we ensured that the number of mammograms was balanced out in all three categories for both datasets. Stratified random sampling was used to divided each dataset to three different subsets for training, validation, and testing with a split ratio of roughly

70 % : 15 % : 15 %

, respectively. Table 1 provides an overview of the data distributions of the normal, benign, and malignant classes for a particular dataset. Samples of the mammograms from the CBIS-DDSM and INbreast datasets are given in Figure 2.

4.2. Training Strategy

We train our model in an episodic fashion where each episode contains multiple training tasks. For each training task, a mini-batch of triplets with n-shot anchors from each of the 3 classes—thus totaling

3 \times n

-shot examples—is created. During training, the loss function assesses performance for each task in turn, given the respective batch of triplets. At the end of each episode, the model parameters are refined through backpropagation based on the computed loss to optimize the performance. This iterative process allows the model to learn a metric embedding by gaining experience across a series of training tasks. For validation, we use a completely different set of tasks and evaluate performance on the mini-batches of triplets sampled from the validation subset. Note that there is no overlap between examples in the training and validation subsets, so the algorithm must build a distance metric in general, rather than on a particular image subset. Figure 3 shows an example episode training scenario, where we create tasks, each of which is defined by a mini-batch of triplets containing sample images from the meta-training dataset. A detailed training strategy is presented in Algorithm 1.

Algorithm 1 Few-shot learning training strategy

Input: Dataset D
Output: Trained model
Initialize base encoder network f
Define parameters: number of epochs (epizodes) $n_e p o c h s$ , number of training tasks per episode $n_t a s k s$ , number of anchors $n_s h o t$ , loss function $L$
for each epoch in $(1, \dots, n_e p o c h s)$ do
for each training task in $(1, \dots, n_t a s k s)$ do
Create a mini-batch of $N = 3 \times n_s h o t$ triplets ${a_{i}, p_{i}, n_{i}}_{i = 1}^{N}$
for each triplet $(a, p, n)$ in mini-batch ${a_{i}, p_{i}, n_{i}}_{i = 1}^{N}$ do
Compute embeddings $x_{a} = f (a)$ , $x_{p} = f (p)$ , $x_{n} = f (n)$
Compute similarity scores (distances) $d_{p} = {∥x_{a}, x_{p}∥}_{2}^{2}$ , $d_{n} = {∥x_{a}, x_{n}∥}_{2}^{2}$
Calculate $L_{i}$ according to the selected loss function definition
end for
Calculate total loss $L$ (e.g., average $L_{i}$ across all triplets)
Update model parameters using backpropagation
end for
if $early stopping$ then
break
end if
end for

4.3. Implementation Details

The base encoder networks and the proposed Siamese network model were implemented using PyTorch 2.6.0 with CUDA 12.4 All experiments were run in a cloud-based Google Colab notebook environment, which provides free access to computing resources, including GPUs and TPUs. Google Colab currently features an NVIDIA Tesla T4 GPU that contains 40 streaming multiprocessors with a 6 MB L2 cache shared by all. It also has 16 GB high-bandwidth RAM (GDDR6) and comes with pre-installed Python 3.x packages. Using this environment made it possible to complete all experimental runs for a full iteration cycle, within an average of 3 min per single test.

To compensate for the lack of a large annotated medical dataset, we utilized a transfer learning approach and we initialized trainable parameters of base encoder networks, using weights learned on the ImageNet dataset [47]. Additionally, the pre-trained base encoders were fine-tuned to adapt them to the specific task. For this purpose, we discarded the last two layers of the pre-trained models and replaced them with two newly initialized layers that we trained from scratch, thus enabling them to capture the unique characteristics of selected datasets, while retaining the powerful feature extraction capabilities of the pre-trained models. Embeddings (feature vectors) were generated by stripping the last fully-connected layer of the base encoder network.

To train the proposed Siamese network model, we employed a Stochastic Gradient Descent (SGD) optimizer, with an initial learning rate, momentum, and weight decay set to

1 \times 10^{- 5}

,

0.9

, and

5 \times 10^{- 4}

, respectively. Furthermore, we used the MultiStepLR scheduler with

γ = 0.1

, because this value was deemed the best after hyperparameter tuning. The model was trained for 100 epochs (epizodes) per experiment, and an early stopping procedure was implemented to bring the training process to a halt if the monitored performance measure does not show any improvement for a “patience” number of epochs.

4.4. Evaluation Metrics

The proposed Siamese network model is rigorously scrutinized using five key performance metrics: accuracy, precision, recall, specificity, and F1-score. Here, accuracy is defined as the proportion of correct predictions relative to the total number of instances assessed. Precision, often referred to as positive predictive value (PPV), represents the ratio of true positive cases over all predictions made within the positive class. Recall or sensitivity—also called true positive rate (TPR)—which is particularly important in medical applications, refers to the percentage of actual positive cases that are accurately identified. On the other hand, specificity, commonly known as the true negative rate (TNR), is the percentage of true negative instances that are correctly classified. Last but not least, F1-score denotes the harmonic mean between precision and recall, calculated by taking their weighted average that captures both metrics’ balance effectively.

Also, the receiver operating characteristic (ROC) curve, which visualizes the trade-off between sensitivity and specificity at various classification thresholds, with its area under the curve (AUC) score, is utilized. The AUC score reflects how good the model is in distinguishing between classes and was proven, both theoretically and empirically, to be more suitable than the accuracy metric [48] for evaluating classification performance. The AUC score can be computed as follows:

A U C = \frac{S_{p} - n_{p} (n_{n} + 1) / 2}{n_{p} n_{n}}

(11)

where,

S_{p}

is the sum of all positive instances ranked, while

n_{p}

and

n_{n}

denote the total number of positive and negative instances, respectively.

5. Results and Discussion

To assess the efficacy of the proposed framework for diagnosing breast cancer cases, the following approach was adopted. Since our framework is based on a few-shot learning paradigm, first we examine the effect that the number of shots has on the overall performance, and to this end we conduct multiple experiments in various 3-way, n-shot learning settings—where n varies between 3 and 7—with both triplet and circle losses. In order to get even better insight, we subsequently scrutinize the efficiency of the proposed framework in distinguishing between tumorous and normal mammograms (2-class problem variant) with similar n-shot learning settings. Finally, for comparison purposes, we present the results obtained by using the three selected, fine-tuned, pre-trained base encoder networks as standalone classifiers.

During evaluation, the distances between singled-out test subset instances and all the remaining images are rated against each other. Then, thresholding and the 5-nearest neighbour algorithm are applied to classify the instances into certain categories. In the experiments focusing on multi-type classification, the macro average is employed to derive the final performance metrics. This technique ensures that all categories are attributed with the same weights when calculating the mean. As a result, the final value reflects the arithmetic mean of the individual metrics associated with each category.

5.1. Classification Results for 3-Way Detection Problem

The evaluation results of the proposed Siamese network model are presented independently for each dataset and each loss function. In all experiments, we made sure to employ a consistent architecture of back encoder networks, alongside uniform training and test configurations, to objectively validate the model’s performance for both CBIS-DDSM and INbreast datasets. Table 2 and Table 3 summarize the classification results for all three base encoder networks over various n-shot settings on the INbreast dataset using triplet and circle loss, respectively. It’s evident from the results that generally better AUC scores and other performance metrics values are achieved with a lower number of shots (3–4), and they appear to be declining as the number of shots increases. Only for the MobileNetV3 base encoder with triplet loss, the best performance is recorded in the case when training was done with mini-batches containing 6 triplets. Overall, the performance results obtained with the circle loss function seem to be slightly better than the results yielded with the standard triplet loss function, which is in concurrence with the findings reported in the literature. The highest performance rates in terms of sensitivity (

90 %

) and specificity (

94.64 %

) on the INbreast dataset are obtained with the ResNet50 base encoder in a 4-shot setting.

The average performance evaluation results for different loss functions and n-shot settings on the CBIS-DDSM dataset are presented in Table 4 and Table 5. Here, the performance results across all experiments are far more homogeneous—showing small variations among themselves based on loss function and number of shots used for training—and are, for the most part, higher than those on the INbreast dataset. On the CBIS-DDSM dataset, GoogLeNet, ResNet50, and MobileNetV3 yield an average AUC score of

0.9281

,

0.9088

, and

0.9092

, respectively, using triplet loss, and an average AUC score of

0.9373

,

0.9311

, and

0.9180

, respectively, for the circle loss function. In addition, the average sensitivities of

84.98 %

,

85.99 %

, and

84.98 %

, respectively, are achieved for GoogLeNet, ResNet50, and MobileNetV3 base encoders with triplet loss. Using circle loss, the average recorded sensitivities for the GoogLeNet, ResNet50, and MobileNetV3 base encoders are

85.15 %

,

87.68 %

, and

85.908 %

, respectively. Again, the averaged performance results over all experiments for different base encoder networks in combination with circle loss consistently narrowly surpass those obtained through the use of the standard triplet loss function, in terms of all evaluation metrics. Figure 4 and Figure 5 show corresponding ROC curves for all the 3-way experiments carried out on both INBreast and CBIS-DDSM datasets.

5.2. Classification Results for 2-Way Detection Problem

Table 6, Table 7, Table 8 and Table 9 show the performance results with respective loss functions and various n-shot settings in classifying healthy and breast cancer patients on INbreast and CBIS-DDSM datasets. As expected, the model generally shows improved performance for diagnosing breast cancer patients on both datasets in this simplified 2-class scenario.

In Table 6 and Table 7, we observe the results from the INbreast dataset, where normal mammograms are accurately classified with average specificity of

92.00 %

,

76.00 %

, and

68.00 %

for the GoogLeNet, ResNet50, and MobileNetV3, with triplet loss, and

82.00 %

,

80.00 %

, and

68.00 %

, with circle loss, respectively. For tumorous cases, the average sensitivities are commendably high, standing at

88.00 %

and

91.00 %

for GoogLeNet and

92.00 %

flat for both ResNet50 and MobileNetV3, with triplet and circle loss, respectively. Notably, GoogLeNet combined with circle loss in a 4-shot setting, demonstrates near perfect performance with the highest AUC score of

0.9900

. These results clearly illustrate that the proposed Siamese network model, particularly when using GoogLeNet as a back encoder, is able to effectively rule-out the presence of a disease in a patient.

When evaluating the CBIS-DDSM dataset, the classification of normal cases yielded average specificities exceeding those in 3-way experiments, namely

95.95 %

and

94.18 %

for GoogLeNet,

93.16 %

and

96.20 %

for ResNet50, and

94.18 %

and

94.68 %

for MobileNetV3, with triplet and circle loss, respectively. The average sensitivities for tumorous cases also reflect strong outcomes, with

93.92 %

,

94.81 %

, and

96.97 %

for GoogLeNet, ResNet50, and MobileNetV3 with triplet loss, and

93.80 %

,

95.06 %

, and

95.57 %

with circle loss, respectively. Furthermore, GoogLeNet paired with circle loss achieved two top AUC scores of

0.9854

and

0.9843

. These findings underscore the effectiveness of the proposed Siamese network model in distinguishing between healthy and cancer patients across the datasets.

Corresponding ROC curves for all 2-way experiments conducted on the INBreast and CBIS-DDSM datasets are illustrated in Figure 6 and Figure 7.

5.3. Siamese Network Model vs. Standard Pre-Trained CNN Classifiers

The performance results attained by utilizing the three selected, fine-tuned, pre-trained base encoder networks as standalone classifiers in 2-way and 3-way scenarios are presented in Table 10 and Table 11. By comparing these results to those achieved by the proposed Siamese network model, it can be concluded that the latter consistently outperforms the pre-trained CNN models in every scenario and across all evaluation metrics for both the INbreast and CBIS-DDSM datasets. There could be many reasons for this, but the main point is that there is not nearly enough mammogram data available to efficiently train deep neural networks. The somewhat satisfactory performance results the three networks achieved can mostly be attributed to the extensive pre-training they underwent using the ImageNet dataset. On the other hand, the Siamese-based model exploits the benefits of being given triplets of images, where it learns to distinguish a similar image from different ones, and exhibits stronger generalization capabilities. Specifically, on the INbreast dataset for a binary classification task, GoogLeNet, ResNet-50, and MobileNetV3, paired with triplet loss, have improved their performance in terms of AUC scores by an average

0.1350

,

0.3130

, and

0.1705

, respectively. For the circle loss, the corresponding average AUC score improvements stand at

0.0810

,

0.3130

, and

0.1655

, in a 2-way scenario. Similarly, for a 3-class problem, the average recorded boosts in AUC score amount to

0.3918

and

0.4404

for GoogLeNet,

0.5271

and

0.3828

for ResNet50, and

0.4791

and

0.4325

for MobileNetV3, with triplet and circle loss, respectively. In the context of the CBIS-DDSM dataset, considering a 2-way scenario, the three selected pre-trained back encoders have enhanced their AUC scores by an average of

0.3002

,

0.3514

, and

0.1103

, in case of triplet loss, and an average of

0.3128

,

0.3536

, and

0.1169

, in case of circle loss. In a 3-class setting, the average AUC score enhancements for GoogLeNet are approximately

0.4534

and

0.4626

, while ResNet50 shows improvements of around

0.4542

and

0.4765

. Meanwhile, MobileNetV3 demonstrates increases of

0.4119

and

0.4207

when utilizing triplet and circle loss, respectively.

These achievements are particularly significant, knowing that the Siamese model is trained with only a (very) limited number of training examples from each category of images. Furthermore, the proposed model is relatively compact, featuring fewer trainable parameters, which highlights the advantages of employing a few-shot learning approach for diagnosing breast cancer patients compared to traditional, supervised learning techniques.

5.4. Statistical Inference Study

To assess the means and standard deviations of TPR values and AUC scores, we conducted a statistical analysis using the Kruskal–Wallis test. This non-parametric method is particularly effective for comparing two or more independent samples, regardless of whether they are of equal or differing sizes. A significant result from the Kruskal–Wallis test implies that at least one sample exhibits a stochastic dominance over the others. We utilized this approach to investigate whether there are significant statistical differences—at a

95 %

confidence level—in the outcomes yielded by the proposed Siamese network model, depending on the number of shots or base encoders used, specifically for the INbreast and CBIS-DDSM datasets in 2-way and 3-way scenarios.

The computed p values, presented in Table 12 and Table 13, indicate that even though the results differ insignificantly

(p > 0.5)

based on number of training triplets (n-shot), in some cases significant difference are recorded, in both TPR values and AUC scores, between the three selected base encoder networks for the INbreast and CBIS-DDSM datasets. While the p value from the Kruskal–Wallis test confirms the existence of a statistically significant difference, the Mann–Whitney U tests are further employed to identify which specific CNN model(s) produce significant results. This analysis is also conducted at a

95 %

confidence level, with the findings detailed in Table 14 and Table 15. For a binary classification task with a triplet loss, MobileNetV3 exhibits a significantly inferior average AUC score compared to the best-performing base encoder, GoogLeNet, on the INbreast dataset. When analyzing the same scenario in the context of the CBIS-DDSM dataset, we find that all three base encoders differ significantly in their performance in terms of TPR values; whereas, when circle loss is utillized, GoogLeNet appears to have an edge in terms of TPR values and AUC scores over MobileNetV3 and ResNet50, respectively. Similar conclusions can be drawn for a 3-class problem using triplet loss.

6. Conclusions and Future Research

Breast cancer classification falls within a data-scarce domain, where acquiring an adequate number of training samples for effectively training a conventional neural network is often impractical. Endeavoring to circumvent the problem of insufficient data, this paper showcases a few-shot learning approach to breast cancer detection, which is evaluated on two digital X-ray mammogram datasets. In particular, we test out three different pre-trained, finely-tuned CNN encoders for their ability to effectively capture unbiased feature representations resilient against overfitting, and we adapt a Siamese network architecture to perform the final classification of normal and tumorous cases. Our findings indicate that the proposed model, utilizing triplet-based loss, delivers a highly accurate and practical mechanism for the automated diagnosis of breast cancer in various n-shot learning settings. Moreover, our model exhibits performance that is not only comparable to but actually exceeds that of the fine-tuned pre-trained CNN models, employed as standalone classifiers, by a handsome margin. These results may encourage health professionals to leverage the model in the early detection of breast cancer, thereby easing their own workload and enabling them to allocate more time to crafting personalized treatment plans for their patients.

Moving forward, we intend to extend our work by performing data augmentation to artificially increase the size and diversity of training data in an attempt to further improve the performance and generalization of the proposed model. In addition, more specific and meaningful representations could potentially be derived through the exertion of multi-scale feature maps, generated by superimposing features produced by feeding deep learning convolutional models with annotated ROI patches, containing different types of breast lesions, on top of features obtained by feeding the network with the entire breast images. Finally, the presented model can be applied to other common types of cancer detection, such as brain, lung, or prostate cancer.

Author Contributions

Conceptualization, T.M. and V.P.; methodology, T.M.; software, T.M.; validation, T.M. and V.P.; formal analysis, T.M.; investigation, T.M.; data curation, T.M.; writing—original draft preparation, T.M.; writing—review and editing, V.P.; visualization, T.M.; supervision, V.P.; project administration, V.P.; funding acquisition, V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Giaquinto, A.; Sung, H.; Newman, L.; Freedman, R.; Smith, R.; Star, J.; Jemal, A.; Siegel, R. Breast cancer statistics 2024. CA Cancer J. Clin. 2024, 74, 477–495. [Google Scholar] [CrossRef]
Pfeiffer, R.; Webb-Vargas, Y.; Wheeler, W.; Gail, M. Proportion of U.S. trends in breast cancer incidence attributable to long-term changes in risk factor distributions. Cancer Epidemiol. Biomarkers Prev. 2018, 27, 1214–1222. [Google Scholar] [CrossRef] [PubMed]
Mahmood, T.; Li, J.; Pei, Y.; Akhtar, F.; Imran, A.; Rehman, K. A brief survey on breast cancer diagnostic with deep learning schemes using multi-image modalities. IEEE Access 2020, 8, 165779–165809. [Google Scholar] [CrossRef]
Elter, M.; Horsch, A. CADx of mammographic masses and clustered microcalcifications: A review. Med. Phys. 2009, 36, 2052–2068. [Google Scholar] [CrossRef] [PubMed]
Lehman, C.; Wellman, R.; Buist, D.; Kerlikowske, K.; Tosteson, A.; Miglioretti, D. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Intern. Med. 2015, 175, 1828–1837. [Google Scholar] [CrossRef]
Bhowmik, A.; Eskreis-Winkler, S. Deep learning in breast imaging. BJR Open 2022, 4, 1–12. [Google Scholar] [CrossRef] [PubMed]
Oladimeji, O.O.; Imran, A.A.Z.; Wang, X.; Unnikrishnan, S. Deep learning advances in breast medical imaging with a focus on clinical readiness and radiologists’ perspective. Image Vis. 2025, 161, 105601. [Google Scholar] [CrossRef]
Taye, M. Understanding of machine learning with deep learning: Architectures, workflow, applications and future directions. Computers 2023, 12, 91. [Google Scholar] [CrossRef]
Shen, L.; Margolies, L.; Rothstein, J.; Fluder, E.; McBride, R.; Sieh, W. Deep learning to improve breast cancer detection on screening mammography. Sci. Rep. 2019, 9, 12495. [Google Scholar] [CrossRef]
Aboutalib, S.; Mohamed, A.; Berg, W.; Zuley, M.; Sumkin, J.; Wu, S. Deep learning to distinguish recalled but benign mammography images in breast cancer screening. Clin. Cancer Res. 2018, 24, 5902–5909. [Google Scholar] [CrossRef]
Kim, E.-K.; Kim, H.-E.; Han, K.; Kang, B.J.; Sohn, Y.-M.; Woo, O.H.; ILee, C.W. Applying data-driven imaging biomarker in mammography for breast cancer screening: Preliminary study. Sci. Rep. 2018, 8, 2762. [Google Scholar] [CrossRef]
Li, F.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611. [Google Scholar] [CrossRef]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 3630–3638. [Google Scholar] [CrossRef]
Nemade, V.; Pathak, S.; Dubey, A.K. A systematic literature review of breast cancer diagnosis using machine intelligence techniques. Arch. Comput. Methods Eng. 2022, 29, 4401–4430. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Al-antari, M.; Han, S.-M.; Kim, T.-S. Evaluation of deep learning detection and classification towards a computer-aided diagnosis of breast lesions in digital X-ray mammograms. Comput. Methods Programs Biomed. 2020, 196, 105584. [Google Scholar] [CrossRef]
Khamparia, A.; Bharati, S.; Podder, P.; Gupta, D.; Khanna, A.; Phung, T.K.; Thanh, D. Diagnosis of breast cancer based on modern mammography using hybrid transfer learning. Multidimens. Syst. Signal Process. 2021, 32, 747–765. [Google Scholar] [CrossRef] [PubMed]
Ragab, D.; Sharkas, M.; Marshall, S.; Ren, J. Breast cancer detection using deep convolutional neural networks and support vector machines. PeerJ 2019, 7, e6201. [Google Scholar] [CrossRef] [PubMed]
Adedigba, A.; Adeshina, S.; Aibinu, A. Performance evaluation of deep learning models on mammogram classification using small dataset. Bioengineering 2022, 9, 161. [Google Scholar] [CrossRef] [PubMed]
Samala, R.; Chan, H.-P.; Hadjiiski, L.; Helvie, M.; Richter, C.; Cha, K. Breast cancer diagnosis in digital breast tomosynthesis: Effects of training sample size on multi-stage transfer learning using deep neural nets. IEEE Trans. Med. Imaging 2019, 38, 686–696. [Google Scholar] [CrossRef]
Ribli, D.; Horvath, A.; Unger, Z.; Pollner, P.; Csabai, I. Detecting and classifying lesions in mammograms with deep learning. Sci. Rep. 2018, 8, 4165. [Google Scholar] [CrossRef]
Al-antari, M.; Al-masni, M.; Choi, M.-T.; Han, S.-M.; Kim, T.-S. A fully integrated computer-aided diagnosis system for digital X-ray mammograms via deep learning detection, segmentation, and classification. Int. J. Med. Inform. 2018, 117, 44–54. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar] [CrossRef]
Shams, S.; Platania, R.; Zhang, J.; Kim, J.; Lee, K.; Park, S.-J. Deep generative breast cancer screening and diagnosis. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2018), Granada, Spain, 16–20 September 2018; pp. 859–867. [Google Scholar]
Singh, V.K.; Rashwan, H.A.; Romani, S.; Akram, F.; Pandey, N.; Sarker, M.K.; Saleh, A.; Arenas, M.; Arquez, M.; Puig, D.; et al. Breast tumor segmentation and shape classification in mammograms using generative adversarial and convolutional neural network. Expert Syst. Appl. 2020, 139, 112855. [Google Scholar] [CrossRef]
Guan, S.; Loew, M. Breast cancer detection using synthetic mammograms from generative adversarial networks in convolutional neural networks. J. Med. Imaging 2019, 6, 031411. [Google Scholar] [CrossRef]
He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in medical image analysis. Intell. Med. 2023, 3, 59–78. [Google Scholar] [CrossRef]
Miller, J.; Arasu, V.; Pu, A.; Margolies, L.; Sieh, W.; Shen, L. Self-supervised deep learning to enhance breast cancer detection on screening mammography. arXiv 2022, arXiv:2203.08812. [Google Scholar] [CrossRef]
Matsoukas, C.; Haslum, J.; Soderberg, M.; Smith, K. Is it time to replace CNNs with transformers for medical images? arXiv 2021, arXiv:2108.09038. [Google Scholar] [CrossRef]
Matsoukas, C.; Haslum, J.; Sorkhei, M.; Soderberg, M.; Smith, K. What makes transfer learning work for medical images: Feature reuse & other factors. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9215–9224. [Google Scholar] [CrossRef]
Betancourt Tarifa, A.; Marrocco, C.; Molinara, M.; Tortorella, F.; Bria, A. Transformer-based mass detection in digital mammograms. J. Ambient Intell. Humaniz. Comput. 2023, 14, 2723–2737. [Google Scholar] [CrossRef]
Cantone, M.; Marrocco, C.; Tortorella, F.; Bria, A. Convolutional networks and transformers for mammography classification: An experimental study. Sens. 2023, 23, 1229. [Google Scholar] [CrossRef]
Li, J.; Chen, J.; Tang, Y.; Wang, C.; Landman, B.; Zhou, S. Transforming medical imaging with transformers? A comparative review of key properties, current progresses, and future perspectives. Med. Image Anal. 2023, 85, 102762. [Google Scholar] [CrossRef] [PubMed]
Nurgazin, M.; Tu, N. A comparative study of vision transformer encoders and few-shot learning for medical image classification. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 2505–2513. [Google Scholar] [CrossRef]
Sawyer Lee, R.; Gimenez, F.; Hoogi, A.; Kawai Miyake, K.; Gorovoy, M.; Rubin, D. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 2017, 4, 170177. [Google Scholar] [CrossRef]
Moreira, I.; Amaral, I.; Domingues, I.; Cardoso, A.; Cardoso, M.J.; Cardoso, J. INbreast: Toward a full-field digital mammographic database. Acad. Radiol. 2012, 19, 236–248. [Google Scholar] [CrossRef]
D’Orsi, C.; Sickles, E.; Mendelson, E.; Morris, E. ACR BI-RADS Atlas: Breast Imaging and Reporting Data System, 5th ed.; American College of Radiology: Reston, VA, USA, 2013; ISBN 978-15-5903-016-8. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar] [CrossRef]
Hoffer, E.; Ailon, N. Deep metric learning using Triplet network. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Z. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE conference on computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Huang, J.; Ling, C.X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef]

Figure 1. The architecture of deep Siamese network for n-shot image classification.

Figure 2. Sample images from INbreast and CBIS-DDSM datasets.

Figure 3. An example training episode for 2-shot, 3-class image classification task. For illustration purposes, in the figure mammograms are substituted with animal images to facilitate comprehension.

Figure 4. ROC curves for various 3-way, n-shot settings with (left side) triplet loss and (right side) circle loss on INbreast dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 4. ROC curves for various 3-way, n-shot settings with (left side) triplet loss and (right side) circle loss on INbreast dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 5. ROC curves for various 3-way, n-shot settings with (left side) triplet loss and (right side) circle loss on CBIS-DDSM dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 5. ROC curves for various 3-way, n-shot settings with (left side) triplet loss and (right side) circle loss on CBIS-DDSM dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 6. ROC curves for various 2-way, n-shot settings with (left side) triplet loss and (right side) circle loss on INbreast dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 6. ROC curves for various 2-way, n-shot settings with (left side) triplet loss and (right side) circle loss on INbreast dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 7. ROC curves for various 2-way, n-shot settings with (left side) triplet loss and (right side) circle loss on CBIS-DDSM dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Figure 7. ROC curves for various 2-way, n-shot settings with (left side) triplet loss and (right side) circle loss on CBIS-DDSM dataset. t* denotes optimal threshold value that best balances the trade-off between sensitivity and (

1 -

specificity).

Table 1. Data distribution over training, validation, and testing subsets for CBIS-DDSM and INbreast datasets.

Classes	CBIS-DDSM			INbreast
Classes	Training	Validation	Testing	Training	Validation	Testing
Normal	273	48	79	47	10	10
Benign	273	48	79	47	10	10
Malignant	273	48	79	47	10	10
Total	819	144	237	141	30	30

Table 2. Performance results for various 3-way, n-shot settings with triplet loss on INbreast dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.8667	0.8660	0.8667	0.9343	0.8610	0.8364
	4	0.8333	0.8712	0.8333	0.9029	0.8064	0.9097
	5	0.8333	0.8393	0.8333	0.9221	0.8349	0.7613
	6	0.8000	0.7903	0.8000	0.8888	0.7907	0.7181
	7	0.7667	0.7667	0.7667	0.8787	0.7607	0.7828
ResNet50	3	0.8333	0.8492	0.8333	0.9202	0.8326	0.8921
	4	0.8333	0.8360	0.8333	0.8989	0.8334	0.8800
	5	0.7333	0.7374	0.7333	0.8433	0.7320	0.8851
	6	0.8000	0.8026	0.8000	0.8888	0.8001	0.8538
	7	0.6667	0.6654	0.6667	0.8191	0.6452	0.8225
MobileNetV3	3	0.6333	0.4821	0.6333	0.7473	0.5471	0.8542
	4	0.6333	0.4965	0.6333	0.7221	0.5490	0.8253
	5	0.6667	0.5084	0.6667	0.7806	0.5765	0.8878
	6	0.8333	0.8575	0.8333	0.8970	0.8287	0.9111
	7	0.8333	0.8574	0.8333	0.8758	0.8142	0.8345

Table 3. Performance results for various 3-way, n-shot settings with circle loss on INbreast dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.8000	0.8000	0.8000	0.8888	0.8000	0.8041
	4	0.8667	0.8661	0.8667	0.9111	0.8644	0.9008
	5	0.8000	0.8026	0.8000	0.8888	0.8001	0.7652
	6	0.8000	0.8007	0.8000	0.8656	0.7988	0.8656
	7	0.8333	0.8550	0.8333	0.9241	0.8329	0.9157
ResNet50	3	0.8000	0.8003	0.8000	0.8656	0.7945	0.7890
	4	0.9000	0.9060	0.9000	0.9464	0.9016	0.8251
	5	0.7333	0.7351	0.7333	0.8433	0.7303	0.7022
	6	0.7333	0.7775	0.7333	0.8877	0.7220	0.6316
	7	0.9000	0.9042	0.9000	0.9212	0.9003	0.6642
MobileNetV3	3	0.8333	0.8352	0.8333	0.8989	0.8324	0.7254
	4	0.9000	0.9133	0.9000	0.9424	0.8981	0.8347
	5	0.7333	0.7176	0.7333	0.8453	0.7161	0.8155
	6	0.7333	0.7268	0.7333	0.8221	0.7253	0.8312
	7	0.7333	0.7284	0.7333	0.8221	0.7276	0.8731

Table 4. Performance results for various 3-way, n-shot settings with triplet loss on CBIS-DDSM dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.8565	0.8548	0.8565	0.9283	0.8540	0.9335
	4	0.8439	0.8409	0.8439	0.9219	0.8407	0.9198
	5	0.8397	0.8387	0.8397	0.9198	0.8377	0.9279
	6	0.8565	0.8554	0.8565	0.9283	0.8555	0.9346
	7	0.8523	0.8520	0.8523	0.9262	0.8522	0.9245
ResNet50	3	0.8776	0.8779	0.8776	0.9388	0.8763	0.9001
	4	0.8481	0.8505	0.8481	0.9241	0.8461	0.9132
	5	0.8608	0.8640	0.8608	0.9304	0.8587	0.9174
	6	0.8608	0.8617	0.8608	0.9304	0.8603	0.9093
	7	0.8523	0.8549	0.8523	0.9262	0.8495	0.9038
MobileNetV3	3	0.8819	0.8822	0.8819	0.9409	0.8820	0.9107
	4	0.8987	0.8994	0.8987	0.9494	0.8990	0.9160
	5	0.8861	0.8856	0.8861	0.9430	0.8856	0.9018
	6	0.8776	0.8784	0.8776	0.9388	0.8779	0.9024
	7	0.8819	0.8823	0.8819	0.9409	0.8806	0.9153

Table 5. Performance results for various 3-way, n-shot settings with circle loss on CBIS-DDSM dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.8354	0.8346	0.8354	0.9177	0.8348	0.9343
	4	0.8692	0.8680	0.8692	0.9346	0.8667	0.9320
	5	0.8565	0.8559	0.8565	0.9283	0.8557	0.9401
	6	0.8481	0.8498	0.8481	0.9241	0.8472	0.9428
	7	0.8481	0.8495	0.8481	0.9241	0.8474	0.9371
ResNet50	3	0.8650	0.8683	0.8650	0.9325	0.8639	0.9270
	4	0.8734	0.8726	0.8734	0.9367	0.8724	0.9271
	5	0.8650	0.8667	0.8650	0.9325	0.8621	0.9206
	6	0.8987	0.9020	0.8987	0.9494	0.8965	0.9450
	7	0.8819	0.8816	0.8819	0.9409	0.8813	0.9358
MobileNetV3	3	0.8776	0.8824	0.8776	0.9388	0.8737	0.9363
	4	0.8734	0.8738	0.8734	0.9367	0.8717	0.9265
	5	0.8565	0.8563	0.8565	0.9283	0.8557	0.9124
	6	0.8523	0.8643	0.8523	0.9262	0.8523	0.9089
	7	0.8354	0.8471	0.8354	0.9177	0.8322	0.9061

Table 6. Performance results for various 2-way, n-shot settings with triplet loss on INbreast dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.9000	0.9474	0.9000	0.9000	0.9231	0.9225
	4	0.9000	1.0000	0.8500	1.0000	0.9189	0.9550
	5	0.9000	0.9474	0.9000	0.9000	0.9231	0.9200
	6	0.9000	0.9474	0.9000	0.9000	0.9231	0.9250
	7	0.8667	0.9444	0.8500	0.9000	0.8947	0.9025
ResNet50	3	0.9667	1.0000	0.9500	1.0000	0.9744	0.9850
	4	0.9333	0.9500	0.9500	0.9000	0.9500	0.9175
	5	0.9000	0.9474	0.9000	0.9000	0.9231	0.9150
	6	0.8333	0.8571	0.9000	0.7000	0.8780	0.7600
	7	0.7000	0.7200	0.9000	0.3000	0.8000	0.6250
MobileNetV3	3	0.8667	0.9444	0.8500	0.9000	0.8947	0.8325
	4	0.8000	0.8182	0.9000	0.6000	0.8571	0.7025
	5	0.8667	0.9444	0.8500	0.9000	0.8947	0.8725
	6	0.8667	0.8333	1.0000	0.6000	0.9091	0.7100
	7	0.8000	0.7692	1.0000	0.4000	0.8696	0.5850

Table 7. Performance results for various 2-way, n-shot settings with circle loss on INbreast dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.8333	0.8571	0.9000	0.7000	0.8780	0.7475
	4	0.9667	1.0000	0.9500	1.0000	0.9744	0.9900
	5	0.8333	0.8571	0.9000	0.7000	0.8780	0.8775
	6	0.8667	0.8636	0.9500	0.7000	0.9048	0.8100
	7	0.9000	1.0000	0.8500	1.0000	0.9189	0.9300
ResNet50	3	0.8333	0.8261	0.9500	0.6000	0.8837	0.7450
	4	0.9000	0.9474	0.9000	0.9000	0.9231	0.8875
	5	0.8000	0.8182	0.9000	0.6000	0.8571	0.7125
	6	0.9000	1.0000	0.8500	1.0000	0.9189	0.9475
	7	0.9667	0.9524	1.0000	0.9000	0.9756	0.9225
MobileNetV3	3	0.8667	0.8636	0.9500	0.7000	0.9048	0.7200
	4	0.9000	0.8696	1.0000	0.7000	0.9302	0.7375
	5	0.8000	0.8500	0.8500	0.7000	0.8500	0.6700
	6	0.8333	0.8571	0.9000	0.7000	0.8780	0.7850
	7	0.8000	0.8182	0.9000	0.6000	0.8571	0.7650

Table 8. Performance results for various 2-way, n-shot settings with triplet loss on CBIS-DDSM dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.9578	0.9933	0.9430	0.9873	0.9675	0.9700
	4	0.9578	1.0000	0.9367	1.0000	0.9673	0.9599
	5	0.9409	0.9800	0.9304	0.9620	0.9545	0.9675
	6	0.9451	0.9739	0.9430	0.9494	0.9582	0.9641
	7	0.9283	0.9490	0.9430	0.8987	0.9460	0.9680
ResNet50	3	0.9536	0.9804	0.9494	0.9620	0.9646	0.9539
	4	0.9367	0.9613	0.9430	0.9241	0.9521	0.9788
	5	0.9367	0.9554	0.9494	0.9114	0.9524	0.9628
	6	0.9367	0.9497	0.9557	0.8987	0.9527	0.9637
	7	0.9494	0.9803	0.9430	0.9620	0.9613	0.9620
MobileNetV3	3	0.9536	0.9623	0.9684	0.9241	0.9653	0.9641
	4	0.9578	0.9625	0.9747	0.9241	0.9686	0.9563
	5	0.9662	0.9808	0.9684	0.9620	0.9745	0.9634
	6	0.9494	0.9563	0.9684	0.9114	0.9623	0.9819
	7	0.9747	0.9935	0.9684	0.9873	0.9808	0.9712

Table 9. Performance results for various 2-way, n-shot settings with circle loss on CBIS-DDSM dataset.

Encoder	N-Shot	Accuracy	PPV	TPR	TNR	F1-Score	AUC
GoogLeNet	3	0.9198	0.9484	0.9304	0.8987	0.9393	0.9854
	4	0.9578	1.0000	0.9367	1.0000	0.9673	0.9728
	5	0.9451	0.9739	0.9430	0.9494	0.9582	0.9843
	6	0.9409	0.9675	0.9430	0.9367	0.9551	0.9760
	7	0.9325	0.9610	0.9367	0.9241	0.9487	0.9739
ResNet50	3	0.9409	0.9557	0.9557	0.9114	0.9557	0.9672
	4	0.9536	0.9804	0.9494	0.9620	0.9646	0.9577
	5	0.9536	0.9933	0.9367	0.9873	0.9642	0.9667
	6	0.9705	1.0000	0.9557	1.0000	0.9773	0.9738
	7	0.9536	0.9742	0.9557	0.9494	0.9649	0.9667
MobileNetV3	3	0.9620	0.9934	0.9494	0.9873	0.9709	0.9808
	4	0.9536	0.9742	0.9557	0.9494	0.9649	0.9742
	5	0.9578	0.9805	0.9557	0.9620	0.9679	0.9805
	6	0.9451	0.9503	0.9684	0.8987	0.9592	0.9657
	7	0.9451	0.9677	0.9494	0.9367	0.9585	0.9690

Table 10. Performance comparison between selected fine-tuned pre-trained CNN models for various k-way settings on INbreast dataset.

K-Way	Encoder	Accuracy	PPV	TPR	TNR	F1-Score	AUC
2	GoogLeNet	0.5333	1.0000	0.3000	1.0000	0.4615	0.7900
	ResNet50	0.5667	0.7692	0.5000	0.7000	0.6061	0.5300
	MobileNetV3	0.4000	0.6250	0.2500	0.7000	0.3571	0.5700
3	GoogLeNet	0.4333	0.7710	0.4333	0.8256	0.4089	0.4099
	ResNet50	0.4667	0.7383	0.4667	0.8298	0.4114	0.3396
	MobileNetV3	0.3667	0.6280	0.3667	0.7821	0.3717	0.3835

Table 11. Performance comparison between selected fine-tuned pre-trained CNN models for various k-way settings on CBIS-DDSM dataset.

K-Way	Encoder	Accuracy	PPV	TPR	TNR	F1-Score	AUC
2	GoogLeNet	0.8481	0.9178	0.8481	0.8481	0.8816	0.6657
	ResNet50	0.8608	0.8743	0.9241	0.7342	0.8985	0.6128
	MobileNetV3	0.5021	0.9545	0.2658	0.9747	0.4158	0.8571
3	GoogLeNet	0.6287	0.6220	0.6287	0.8143	0.6231	0.4747
	ResNet50	0.6160	0.6321	0.6160	0.8080	0.6218	0.4546
	MobileNetV3	0.4135	0.5097	0.4135	0.7068	0.3105	0.4973

Table 12. p values of Kruskal-Wallis test for various n-shot settings.

Dataset	K-Way	Triplet Loss		Circle Loss
Dataset	K-Way	TPR	AUC	TPR	AUC
INbreast	2	0.8386	0.3386	0.4526	0.1870
INbreast	3	0.9317	0.5804	0.0682	0.3839
CBIS-DDSM	2	0.9766	0.8266	0.7946	0.5690
CBIS-DDSM	3	0.9210	0.9876	0.7525	0.8384

Table 13. p values of Kruskal-Wallis test for various base encoders. Statistically significant differences are denoted as bold.

Dataset	K-Way	Triplet Loss		Circle Loss
Dataset	K-Way	TPR	AUC	TPR	AUC
INbreast	2	0.3096	0.0323	0.9676	0.0935
INbreast	3	0.2709	0.2808	0.6903	0.0590
CBIS-DDSM	2	0.0027	0.5434	0.0184	0.0437
CBIS-DDSM	3	0.0063	0.0092	0.0664	0.0539

Table 14. Results of Mann Whitney U test on INbreast dataset.

K-Way	Metric	Loss	Encoder	Encoder
K-Way	Metric	Loss	Encoder	GoogLeNet	ResNet50	MobileNetV3
2	AUC	Triplet	GoogLeNet	-	0.3095	0.0079
			ResNet50	0.3095	-	0.2222
			MobileNetV3	0.0079	0.2222	-

Table 15. Results of Mann Whitney U test on CBIS-DDSM dataset.

K-Way	Metric	Loss	Encoder	Encoder
K-Way	Metric	Loss	Encoder	GoogLeNet	ResNet50	MobileNetV3
2	TPR	Triplet	GoogLeNet	-	0.0442	0.0088
			ResNet50	0.0442	-	0.0092
			MobileNetV3	0.0088	0.0092	-
2	TPR	Circle	GoogLeNet	-	0.0532	0.0112
			ResNet50	0.0532	-	0.6513
			MobileNetV3	0.0112	0.6513	-
2	AUC	Circle	GoogLeNet	-	0.0159	0.4206
			ResNet50	0.0159	-	0.1508
			MobileNetV3	0.4206	0.1508	-
3	TPR	Triplet	GoogLeNet	-	0.1706	0.0117
			ResNet50	0.1706	-	0.0153
			MobileNetV3	0.0117	0.0153	-
3	AUC	Triplet	GoogLeNet	-	0.0079	0.0079
			ResNet50	0.0079	-	1.0000
			MobileNetV3	0.0079	1.0000	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Marasović, T.; Papić, V. Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss. Algorithms 2025, 18, 567. https://doi.org/10.3390/a18090567

AMA Style

Marasović T, Papić V. Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss. Algorithms. 2025; 18(9):567. https://doi.org/10.3390/a18090567

Chicago/Turabian Style

Marasović, Tea, and Vladan Papić. 2025. "Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss" Algorithms 18, no. 9: 567. https://doi.org/10.3390/a18090567

APA Style

Marasović, T., & Papić, V. (2025). Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss. Algorithms, 18(9), 567. https://doi.org/10.3390/a18090567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Breast Cancer Diagnosis Using a Siamese Neural Network Framework and Triplet-Based Loss

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets Description

3.2. Siamese Network Architecture

3.3. Loss Function

3.3.1. Triplet Loss

3.3.2. Circle Loss

3.4. Base Encoders

3.4.1. GoogLeNet

3.4.2. ResNet-50

3.4.3. MobileNetV3

4. Experimental Setup

4.1. Data Preparation

4.2. Training Strategy

4.3. Implementation Details

4.4. Evaluation Metrics

5. Results and Discussion

5.1. Classification Results for 3-Way Detection Problem

5.2. Classification Results for 2-Way Detection Problem

5.3. Siamese Network Model vs. Standard Pre-Trained CNN Classifiers

5.4. Statistical Inference Study

6. Conclusions and Future Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI