1. Introduction
Breast ultrasound (BUS) has been widely used in breast cancer diagnosis [
1,
2]. In recent years, auxiliary diagnosis research based on Deep Learning (DL) has developed rapidly and played an essential role in breast cancer diagnosis [
3,
4]. Generally, DL requires large-scale annotated datasets to produce an effective model. However, this is difficult for medical data, especially for BUS data. On the one hand, although a large amount of BUS data exists, the collection of BUS data is complex due to patient privacy, and the labeling of the data is very time-consuming and requires professional physicians [
5,
6]. On the other hand, BUS data take a long-tailed distribution form, and images of benign masses are much more prevalent than those of malignant or normal masses.
Extensive research shows that data augmentation overcomes challenges posed by small and imbalanced datasets [
7]. Two main methods are used: Traditional Augmentation (TA) and Synthetic Augmentation (SA). TA applies common operations such as translation, rotation, brightness adjustment, scaling, and clipping to expand the dataset [
8,
9,
10]. However, TA has limitations in generating entirely new patterns. SA uses existing data to generate new samples, employing techniques such as interpolation-based methods (e.g., SMOTE [
11] and Sample Pairing [
12]) and generation-based augmentation (e.g., Generative Adversarial Networks [
13]). SA incorporates additional information that is not present in the original data, overcoming TA’s limitations.
Figure 1 depicts results from a liver tumor classification study by Frid Adar M et al. [
10] on their CT dataset. While TA and additional real data show modest accuracy improvements (red and green curves), SA (blue curve) significantly enhances accuracy.
As a powerful generation-based augmentation method, GAN has played a remarkable role in medical data distribution. In recent medical research, several GAN variants, including DCGAN [
14], LSGAN [
15], WGAN [
16], and others, have been developed and applied to diverse tasks such as classification [
17], segmentation [
3], reconstruction [
18], and synthesis [
19]. These methods have demonstrated their effectiveness in augmenting breast ultrasound (BUS) datasets and improving the performance of deep learning models, particularly in classification and segmentation tasks.
For instance, Al Dhabyani et al. [
19] employed GAN for BUS dataset augmentation and showed superior performance compared with traditional augmentation methods when used with CNN and transfer learning. However, the limited size of the training data can lead to the generation of similar results, limiting the effectiveness of dataset augmentation. One practical solution for the small dataset problem is augmenting the dataset with unlabeled data. Pang et al. [
9] proposed a semi-supervised deep learning model based on GAN, utilizing unlabeled BUS images for classification. They demonstrated that this approach could generate high-quality BUS images, significantly enrich the dataset, and achieve more accurate breast cancer classification results. Sudipan Saha and Nasrullah Sheikh et al. [
20] tackled the small dataset challenge by proposing a BUS classification model using the Auxiliary Classifier Generative Adversarial Network (ACGAN) [
21]. They trained the GAN model by augmenting the BUS dataset using traditional augmentation methods, improving classification performance.
In a different approach, Ruixuan Zhang et al. [
18] utilized partially masked BUS images to generate new images, employing an image data augmentation strategy based on mask reconstruction. This method effectively increased the interpretability and diversity of disease data while maintaining structural integrity. However, it is important to note that the augmented datasets using these augmentation methods may exhibit significant differences from the original data distribution, which could impact downstream tasks.
The performance of most breast ultrasound (BUS) segmentation methods heavily relies on accurately labeled data, which can be challenging to obtain. As a solution, many segmentation methods leverage unlabeled data to expand the dataset and utilize GAN-based approaches to enhance model performance.
For instance, Luyi Han et al. [
17] proposed a semi-supervised deep-learning segmentation model for BUS images. They input the unlabeled BUS image data into a GAN to generate corresponding labeled data. Similarly, Donghai Zhai et al. [
3] employed an asymmetric generative adversarial network (ASSGAN) comprising two generators and a discriminator. This architecture enables mutual supervision between the generators and generates reliable segmentation-predicted masks as guidance for unlabeled images. Comparative studies demonstrated that these methods effectively improve segmentation performance even with a limited number of labeled images.
However, it is worth noting that these semi-supervised methods require a substantial amount of unlabeled data. Additionally, if a classification task is involved, both augmentation methods cannot be reused and require relabeling.
In summary, there are still some challenges to the generation-based augmentation method applied to BUS image data:
Image quality: BUS images exhibit diverse and random morphological features of tumors and texture features of surrounding tissues, leading to unrealistic details and a lack of structural legitimacy in the generated images.
Data constraints: The limited size of the BUS dataset affects both the performance of deep learning models and GANs. Since GANs rely on the underlying data distribution, the small dataset needs to accurately represent the true data distribution, resulting in insufficient diversity and limited information in the generated results.
Application limitations: The current GAN augmentation methods have a significant constraint when applied to different tasks. They are typically tailored for specific tasks, such as classification or segmentation. However, if these augmented data need to be utilized for other tasks, the data must be relabeled, or a new GAN model must be trained, leading to increased costs. Consequently, a universal generation-based augmentation method is meaningful.
In response to these challenges, our research objectives can be summarized as follows: We aim to address the small imbalance in ultrasound breast tumor data by expanding the dataset through data generation. Additionally, we seek to enhance the robustness and generalization of ultrasound breast tumor image recognition and segmentation.
Therefore, we propose a two-stage GAN framework to address the above challenges, called 2s-BUSGAN. The framework consists of the Mask Generation Stage (MGS) and the Image Generation Stage (IGS), which enables the generation of realistic BUS images and their corresponding tumor contour images, eliminating the need for relabeling. To overcome the data constraints, we employ a Differential Augmentation strategy (DiffAug) [
22], which enhances the generalization of GANs for small datasets. Additionally, we incorporate a feature-matching loss that minimizes the discrepancy between real and generated images at a high-level feature representation, thereby improving the quality of synthesized images.
Our 2s-BUSGAN model departs from common GAN models by adopting a two-stage framework, consisting of MGM and IGM. This approach provides a straightforward means of distinguishing between surrounding tissues and tumor regions, enhancing the interpretability of the generated images. Moreover, our generated results include breast tumor images and corresponding tumor contour images. This novel work avoids re-annotating tumor regions in generated images, and can effectively enhance the performance of models that require tumor region annotations for training, such as segmentation models.
2. Materials and Methods
In this section, we present the 2s-BUSGAN model. The goal was to generate realistic images with tumor contour annotations and structural legitimacy. The model was trained on imbalanced and small BUS datasets. Using a single GAN model to achieve our goal would make the model training unstable. Therefore, we divided the model into two stages: Mask Generation Stage (MGS) and Image Generation Stage (IGS), implementing the mapping process from random noises to tumor contours to the corresponding BUS images.
2.1. Overview of 2s-BUSGAN
The flowchart of our method is shown in
Figure 2. Our proposed 2s-BUSGAN consists of four components: Mask Generation Stage (MGS), Image Generation Stage (IGS), Differentiable Augmentation Module (DAM), and Feature Matching Loss (FML). First, MGS uses random noises to generate realistic pseudo-tumor contour images. Second, IGS generates BUS images based on real and generated tumor contour images. Then, FML enriches the detail information of the synthesized images, and DAM enhances the generalization of 2s-BUSGAN. The following sections will describe these components.
2.2. Mask Generation Stage (MGS)
The MGS is the initial phase of 2s-BUSGAN and generates synthetic tumor contour images from random noise input. Achieving stable GAN training has been an ongoing challenge in GAN model research, and even the simplest models can be challenging to train stably. Therefore, in the design and selection of the MGM model, we conducted extensive experiments, including the choice of different models, parameter adjustments, the selection of various GAN loss functions, the use of different normalization layers, and the selection of activation functions. The MGS is a simplified noise-to-image synthesis GAN comprising a generator
G1 and a discriminator
D1. The architectural configuration is depicted in
Figure 3a. The generator provides a random noise following a normal distribution as input, while the output is a pseudo tumor contour image. Similar to other general generative models, generator
G1 architecture encompasses five up-sampling blocks, each comprising a deconvolution layer with a stride of 2, a batch normalization layer (BN) [
23], and a ReLU [
24] activation layer, as illustrated in the left section of
Figure 3a. All deconvolution layers employ a kernel size of 4 × 4, and the number of kernels progressively decreases from 512 to 32. Additionally, the output layer employs the hyperbolic tanh activation function, ensuring values are bounded from −1 to 1.
In discriminator
D1, we incorporate spectral normalization (SN) layers into its architecture. Spectral normalization is an effective technique to enhance training stability in GANs by normalizing the parameters of each convolution layer, ensuring the GAN satisfies the 1-Lipschitz condition [
25]. Spectral normalization is a principled, easy-to-implement approach that has been successfully used in many excellent models. Therefore, the structure of
D1 consists of six down-sampling blocks, each composed of a convolution layer, a spectral normalization layer, a LeakyReLU [
26] activation function, and a dropout layer, as depicted in the right section of
Figure 3a. The convolutional kernel size is set to 4 × 4, and the filters progressively decrease from 512 to 1. As the first stage of 2s-BUSGAN, the loss of MGS will be optimized gradually:
where
LG1 and
LD1, respectively, represent the losses of generator
G1 and discriminator
D1,
z denotes the random vectors,
y denotes the tumor contour images of ground truth BUS images, and
G1(z) denotes the synthesized tumor contour images from input
z. The objective function can represent the similarity of the synthesized image in quality to the real image.
2.3. Image Generation Stage (IGS)
After the MGS, the generated tumor contour image is a binarized matrix, where 0 represents the missing region, and 1 represents the background. However, it solely captures the morphological and structural information of the tumor, lacking texture details. Hence, the Image Generation Stage (IGS) is introduced to generate realistic BUS images by incorporating texture information. This stage addresses the image reconstruction problem, aiming to produce complete BUS images based on the provided structural information.
Similar to MGS, IGS comprises a generator
G2 and a discriminator
D2. In contrast to other generative models that utilize noise as input, this module aims to achieve image-to-image generation using tumor contour images as input. Generator
G2 employs an encoder-decoder architecture with n residual blocks [
27]. This architecture consists of four down-sampling blocks and four up-sampling blocks, as shown in the left part of
Figure 3b. Each block incorporates a convolution or deconvolution layer with a stride of two, an instance normalization layer (IN) [
28], and a ReLU activation function. The architecture includes 9 residual blocks, with each block consisting of two convolution layers with a stride of 1. The first and last convolution layers adapt the channel size using a 1 × 1 kernel size, while all other convolution and deconvolution layers use a 3 × 3 kernel size. For discriminator
D2, each input is a concatenation of the tumor contour image and its corresponding BUS image. To assess image quality at a local scale, we utilize a discriminator with a specific structure inspired by PatchGAN [
29]. Unlike general discriminators that output a single scalar for global judgment, this discriminator produces a matrix that evaluates image quality across multiple scales. Each unit in the output matrix represents the likelihood that an image patch is authentic, capturing local information. This design allows for the incorporation of high-frequency details into the generator.
Consequently, the discriminator structure comprises four down-sampling blocks, as depicted in the right part of
Figure 3b. Each block includes a convolution layer, spectral normalization (SN) layer, LeakyReLU activation function, and dropout layer. Notably, we apply SN to each convolution layer of the discriminator to enhance its performance, similar to discriminator
D1 of MGS. The convolution kernel size is set to 4 × 4, and the number of filters decreases from 512 to 1 sequentially.
The loss function can be expressed as:
where
LG2 and
LD2, respectively, represent the losses of generator
G2 and discriminator
D2,
y denotes tumor contour images of ground truth BUS images, and
G2(y) denotes the synthesized BUS images from tumor contour images
y.
2.4. Feature Matching Loss (FML)
IGS faces challenges in learning the accurate mapping of tumor regions in BUS images due to the inherent randomness in the location and size of tumors. Moreover, the generated images exhibited blurriness and noise, particularly in the regions surrounding the tumors, compared to the ground truth images. To address these issues and enhance image synthesis quality, we introduced a feature-matching loss that imposes additional constraints based on the discriminator of IGS.
The feature-matching loss plays a crucial role in stabilizing the training process and significantly improving the performance of GAN. It is closely related to perceptual loss [
30,
31], which has found widespread application in style transfer research and image super-resolution [
32,
33,
34]. This loss function enforces similarity between the generated BUS images and real BUS images across multiple scales, ensuring that the generator generates images that accurately capture the key characteristics of the dataset. To extract features from the BUS images at different scales, we utilize the discriminator of IGS as a feature extractor, employing its multi-layer architecture.
Thus, the feature-matching loss can be described as follows:
where
M is the total number of layers, and
Ni is the number of elements in each layer, and
is discriminator
D2’s
ith-layer feature extractor.
Full objective combines both the loss of generator
G2 and feature matching loss are defined as follows:
where
λ is a constant of the loss of the feature matching loss, and its values are set to 10.0 in our study.
2.5. Differentiable Augmentation Module (DAM)
The performance of GANs depends on sufficient and well-distributed training data. Lack of training data can lead to overfitting the GAN and limited diversity in the generated samples. Traditional augmentation (TA) methods, which modify the dataset using non-differentiable operations, alter the data distribution and can negatively affect the performance of GANs trained on such augmented data.
To overcome this limitation, we employ Differentiable Augmentation (DiffAug) [
22], which allows for the simultaneous augmentation of real and generated image data using invertible and differentiable operations. By leveraging differentiable operations, DA ensures that the network can be fully trained without altering the distribution of the original data. Therefore, the DAM incorporates DiffAug and updates the loss functions as follows:
where
T denotes differentiable augmentation operations, such as brightness adjustment and translation. Noteworthy, the generated and real samples are subjected to the same differentiable augmentation operation.
3. Experiments
In this section, we conducted qualitative and quantitative evaluations. First, we conducted experiments of the image generation performance of MGS and IGS within the 2s-BUSGAN framework. Moreover, we evaluated 2s-BUSGAN’s overall generative capacity. Second, we submitted the generated images for expert medical evaluations, focusing on image quality and the discernment of benign and malignant features. Lastly, we employed the images produced by 2s-BUSGAN as augmented data to gauge their influence on model performance in breast malignancy classification and breast segmentation.
3.1. Datasets
In our study, we use two different and representative datasets of breast ultrasound (BUS) images: the BUSI [
19,
35] dataset and the Collected dataset (obtained from a de-identified source at a hospital). Both datasets contain BUS images and corresponding segmentation masks annotated by experienced medical professionals. The dataset has undergone data cleaning to ensure the removal of any patient information.
Figure 4 provides examples of the images and their corresponding segmentation maps. The details of each dataset are described as follows:
In each category of images, on the left are the real US images, and on the right are the annotated tumor contour images.
BUSI [
19,
35]. This dataset was curated by the National Cancer Center of Cairo University in Egypt in 2018. The dataset consists of 780 PNG images obtained using the LOGIQ E9 and LOGIQ E9 agile ultrasound systems. These images were acquired from 600 female patients aged between 25 and 75. Among the images, 487 represent benign tumors, 210 represent malignant tumors, and 133 contain normal breast tissue. The labels for benign and malignant tumors were determined based on pathological examination results from biopsy samples. The average image size is 500 × 500 pixels, and each case includes the original ultrasound image data and the corresponding breast tumor boundary image annotated by an expert imaging diagnostic physician. The annotated tumor contour images serve as the gold standard for segmentation in the experiments, providing a reference for the training process and evaluation of results.
Collected. The China-Japan Friendship Hospital in Beijing collected and organized the dataset in 2018. Before data collection, informed written consent was obtained from each patient after a comprehensive explanation of the study’s procedures, purpose, and nature. Multiple acquisition systems were used to capture the images in this dataset. It comprises a total of 920 images, including 470 benign cases and 450 malignant cases. The benign or malignant classification labels are derived from pathological examinations based on puncture biopsy results. A professional imaging diagnostic physician has annotated each data case. Furthermore, the dataset includes notable misdiagnoses and missed diagnosis cases, providing valuable examples for analysis and evaluation.
3.2. Implementation Details
Our experimental setup utilized the PyTorch framework on a single NVIDIA GeForce RTX 2080 Ti GPU with 10 G GPU memory. We employed a batch size of 16 and utilized the Adam optimizer [
36]. Generator
G2 and discriminator
D2 had learning rates of 0.0001 and 0.0004, respectively. The weight
λ for the feature-matching loss term in generator
G2, the number of residual blocks in
G2, and the output size of
D2 was set to 10, 5, and 16 × 16 for the BUSI dataset, and 10, 5, and 16 × 16 for the Collected dataset. For the BUSI dataset, generator
G2 utilized a weight
λ of 10 for the FML term, included 5 residual blocks, and output size of
D2 was 16 × 16. Similarly, for the Collected dataset, the same settings were employed with a weight
λ of 10 for the FML term, 5 residual blocks, and output size of
D2 was 16 × 16.
In our study, we set the weight λ to 10. This value was chosen with reference to related literature on feature matching loss. In practical experiments, we compared four different values: 0.1, 1, 10, and 100. Through experimentation, we found that setting it to 10 produced better results. As a result, we chose to use the referenced weight λ.
Due to limitations with GPU, we were unable to train models when the number of residual blocks exceeded 5. We compared the performance of models with different numbers of residual blocks, ranging from 1 to 5. Our experiments revealed that the best results were achieved with 5 residual blocks; thus, we decided to set this parameter to 5.
All images in both datasets were resized to 256 × 256 pixels. In general, larger images contain more information, resizing images can significantly affect a model’s learning and performance. Unfortunately, our experimental setup was constrained by GPU memory capacity. Given these constraints, images resizing to 256 × 256 pixels are the most suitable compromise. This adjustment not only optimized GPU memory utilization but also streamlined downstream model processing for various tasks.
The models were trained for 6000 epochs across all stages. To ensure distinctive class features in the generated images, we separately trained the benign and malignant BUS images for all datasets.
Furthermore, our experiments involved the use of classification and segmentation models, as well as comparative generative models. For these models, we adhered to the hyperparameter settings found in their respective reference literature to ensure experimental correctness. Therefore, no further hyperparameter tuning was conducted for these models.
We comprehensively evaluated our proposed method using two distinct BUS datasets. The evaluation encompassed both quantitative and qualitative analyses of the experimental results. For the qualitative assessment, we conducted a visual comparison between the real BUS images and the synthesized images. We utilized three numerical metrics: Fréchet Inception Distance (FID) [
37], Kernel Inception Distance (KID) [
38], and multi-scale structural similarity (MS-SSIM) [
39] to quantitatively evaluate the performance. FID and KID are metrics used to measure the similarity between images based on their features. While both metrics assess feature-level distance, KID estimates are unbiased. Lower values of FID and KID indicate better performance in terms of image similarity. MS-SSIM is a metric that quantifies the similarity between paired images on a scale from 0 to 1. Higher MS-SSIM values indicate a greater degree of similarity between the images. In our experiments, we calculated the internal MS-SSIM of the generated data to gauge the diversity of the generated samples [
40]. Lower MS-SSIM values indicate richer diversity in the generated data. The evaluation metrics are described as follows:
where
x denotes real images,
g denotes generated images,
μ and Σ denote means and covariances of intermediate layer features of real and generated images,
m is the sample size of the generated images,
n is the sample size of the real image,
k denotes the kernel function,
v is the feature vector of real and generated images.
M is the size of the intermediate layer of feature extractor,
l,
c, and
s represent the brightness, contrast, and structure between images, respectively, and
α,
β, and
γ represent weights.
3.3. Generation Experiments
3.3.1. MGS Mask Synthesis
MGS effectively maps random noise samples from a normal distribution to the distribution of tumor contour images. In
Figure 5, we compare real tumor contour images and the generated images produced by MGS, including both benign and malignant data. Both methods were trained for an equal number of epochs: 3000.
In addition, we utilized the t-SNE (t-distributed stochastic neighbor embedding) [
41] algorithm to visualize the disparities between synthetic and real data, as illustrated in
Figure 6. The t-SNE algorithm is a nonlinear dimensionality reduction technique capable of visualizing high-dimensional data into two or three dimensions.
3.3.2. IGS Image Synthesis
The generator of IGS exhibits the capability to capture the texture feature distribution within the BUS image by inpainting both the internal and external regions of the tumor contour. By reconstructing real BUS images from real masks, IGS demonstrates its effectiveness. Additionally, we compared real BUS images and the results reconstructed from real masks by IGS and IGS without differential augmentation before the discriminator (IGS w/o DAM).
The main task of IGS is to convert the breast tumor contour images generated by MGS into realistic BUS images. Therefore, we compare generated from the random mask by IGS, IGS without the feature-matching loss (IGS w/o FML), and IGS without differential augmentation before the discriminator (IGS w/o DAM). The number of training epochs was set to 6000 for both the Collected and BUSI datasets.
3.3.3. 2s-BUSGAN Image Synthesis
In this experiment, we utilized 2s-BUSGAN, incorporating MGS and IGS, to generate BUS images from random noise and compare their performance with conventional generative augmentation methods. Furthermore, we employed the t-SNE method to compare the distribution differences between the original data and the data generated by various generation methods.
3.4. Evaluation by Doctors
To assess the quality of the synthesized BUS images, we conducted a subjective evaluation by randomly selecting 120 images of different types. These images comprised 60 ground truth (GT) images and 60 generated images from different methods, including LSGAN (the most effective traditional generative augmentation method mentioned in
Section 3.5) and 2s-BUSGAN, with each type consisting of 15 benign and 15 malignant images. Three experienced ultrasonographers conducted an independent evaluation and a benign and malignant classification of these images.
3.5. Augmentation Experiments for BUS Segmentation and Classification
The original dataset was divided into training, validation, and test subsets for segmentation and classification experiments, with a ratio of 60%, 20%, and 20%, respectively. Due to the imbalanced distribution of benign and malignant cases in the BUSI dataset, we utilized the oversampling method [
7] to balance the dataset. Three-fold cross-validation was employed for training the classification models to evaluate their performance. Online augmentation [
42] was applied to both segmentation and classification experiments, with a probability of 0.3 for each augmentation during the training process. In the classification experiment, augmentation techniques such as translation, horizontal flipping, and rotation were utilized, which preserved the morphological characteristics of the tumor. For the segmentation experiment, augmentation operations included rotation, blurring, translation, gamma transformation, scaling, blurring, scaling, and adding Gaussian noise.
3.5.1. Classification Experiments
We compared different methods based on various augmentation techniques: Baseline (no augmentation), traditional augmentation (TA), traditional augmentation combined with LSGAN-based augmentation (TA + LSGAN), and traditional augmentation combined with 2s-BUSGAN (TA + 2s-BUSGAN). Our classification model of choice was VGG16 [
43], which had been pre-trained on the ImageNet dataset and fine-tuned using only real data. During training, we kept the parameters of the first three layers of VGG16 frozen and only trained the last and fully connected layers. For the evaluation of classification performance in the presence of class imbalance, we utilized precision, recall, F1 score, and accuracy as the evaluation metrics, and they are described as follows:
where TN denotes the number of true negative samples.
3.5.2. Segmentation Experiments
We conducted experiments to assess the impact of different augmentation methods on the segmentation performance. Three settings were compared: no augmentation (Baseline), traditional augmentation (TA), and TA plus proposed 2s-BUSGAN (TA + 2s-BUSGAN). The segmentation model used U-Net [
44], and the performance was evaluated using the Dice, which is described as follows.
where TP denotes the number of true positive samples, FN denotes the number of false negative samples, and FP denotes the number of false positive samples.
6. Conclusions
In this research paper, we addressed challenges of imbalanced and insufficient BUS image data by proposing a two-stage generative augmentation method called 2s-BUSGAN. Our 2s-BUSGAN method consists of two stages: MGS and IGS. These stages enable the generation of tumor contour images and subsequent transformation into BUS images from random noise inputs. Additionally, we incorporated FML to enhance the performance of 2s-BUSGAN and enrich the background texture information of the generated images. Moreover, DAM was employed to enhance the generalization ability of 2s-BUSGAN, particularly in scenarios with small sample datasets.
Through extensive qualitative and quantitative experiments conducted on the BUSI and Collected datasets, we demonstrated the effectiveness of 2s-BUSGAN. The results confirmed that our proposed method is a versatile augmentation technique that significantly improves the performance of deep learning models, including segmentation and classification while circumventing the need for data relabeling.
Overall, our study contributes to addressing the challenges associated with imbalanced and limited BUS image data, and our proposed 2s-BUSGAN method shows promise in augmenting BUS image datasets and enhancing the performance of deep learning models in the field.