1. Introduction
Breast cancer is the most prevalent malignancy, accounting for 12.5% of new cases globally by 2020 [
1,
2]. The survival rate of breast cancer for at least five years after diagnosis varies among populations depending on the economic standing of nations; it is greater than 90% in high-income nations, 40–65% in middle-income nations, and less than 40% in low-income nations [
2,
3]. This difference is primarily attributable to the ability of high-income nations to detect breast cancer early [
1]. Early detection expands treatment options and lowers the risk of breast-cancer-related deaths considerably. The primary imaging modalities for the early diagnosis of breast cancer include mammography and ultrasound [
4].
Previously, ultrasonography was considered helpful only for cyst diagnosis. However, it improves local preoperative staging, guided interventional diagnosis, and differential diagnosis of benign and malignant tumors [
5]. Mammography has a low sensitivity for thick breasts [
6]. Women with dense parenchyma are far more likely to develop breast cancer [
7]. Dense breast tissue can be examined using ultrasound [
8]. Recent research has demonstrated that high-resolution ultrasonography increases the diagnosis of tiny tumors by three to four tumors per 1000 women without clinical or mammographic abnormalities [
9]. Carcinomas observed on mammography and sonography have a similar stage distribution [
10]. To overcome the limitations of mammography, ultrasound is frequently employed for curative diagnosis [
11]. Ultrasound is noninvasive, widely available, easy to use, less expensive, and provides real-time imaging [
12]. Moreover, ultrasound is safer and does not generate radiation [
13]. More importantly, ultrasound helps detect tumors in women with dense breasts and detects and classifies breast lesions that cannot be interpreted adequately through mammography alone [
14]. Doctors can use ultrasound to identify many clinically relevant regions such as benign cysts or normal tissues. Most women over the age of 30 years usually undergo ultrasonography along with mammography. For women under the age of 30, ultrasonography is sufficient to decide whether a biopsy is necessary in a particular area of the suspected breast.
However, early breast ultrasound diagnosis has limitations [
15]. First, follow-up ultrasound, aspiration, or biopsy may be performed after a breast ultrasound image is interpreted [
16]. Biopsy may be recommended to determine whether a suspicious abnormality is cancer. Most of the suspected problem locations detected by ultrasonography that require biopsy are noncancerous (false positives) [
17]. Second, although ultrasound is a breast imaging method, annual mammography is still recommended [
1]. Many tumors cannot be detected via ultrasonography. Numerous calcifications detected by mammography are invisible with ultrasonography [
18]. In mammography, some early breast tumors simply manifest as calcifications [
19,
20]. Even for women with dense breasts, many institutions do not offer ultrasound screening, and some insurance plans might not pay for treatment. As a real-time test, ultrasound depends on the anomaly detected during the scan. Therefore, sophisticated tools and experienced professionals are required.
Deep learning (DL) has been used to overcome the limitations of ultrasound in the early detection of breast cancer [
21]. Many studies have been conducted in the areas of synthetic imaging, object detection, segmentation, and imaging categorization of breast lesions using DL [
22]. Many of these approaches have received the necessary legal certifications and are used in clinical settings [
23]. Additionally, DL approaches have demonstrated ability to perform at levels comparable to those of human experts in a variety of breast cancer detection tasks and have the potential to help physicians with little training improve the diagnosis of breast cancer in clinical settings [
24].
All prior investigations regarding DL for ultrasound-based early breast cancer diagnosis have used convolutional neural networks (CNNs) [
25]. Vision transformers (ViTs), developed by Dosovitskiy et al. [
26] in 2020, outperform state-of-the-art (SOTA) CNN algorithms in natural images categorization. ViTs demonstrate a superior performance by incorporating more global information than a CNN does in lower layers and having a stronger skip connection than that of ResNet. CNNs require several layers to calculate features computed by a smaller collection of lower layers of ViTs. All these attributes help ViTs outperform CNNs in classifying real-world images. However, not much research has been conducted on the application of ViTs in medical image analysis. This is mainly because of the heavy reliance of ViTs on extensive training data. ViTs do not outperform CNNs with small-scale datasets. The heavy reliance of ViTs on big data has limited its effective use in certain sectors. The same is true for medical images, as it can be difficult to locate substantial training datasets. To overcome this challenge, transfer learning has been extensively used in CNN-based medical image analysis [
25,
27,
28]. The use of ViT models that have been pretrained on large natural image datasets and transfer learning via ViT models may help increase the performance of DL in medical image analysis. Hence, we present BUViTNet breast ultrasound detection via ViTs, a new transfer-learning strategy based on ViTs for the classification of ultrasound breast cancer images.
The proposed ViT-based transfer learning has multiple advantages over CNN-based transfer learning. The disadvantages of CNNs include their high processing cost, narrow focus on a portion of the image rather than the entire image, inability to encode relative spatial information, and inability to handle rotation and scale invariance without augmentation. CNNs are computationally expensive owing to the use of pixel arrays [
29]. Because deeper layers are required to extract features from the entire image, CNNs require more training time and hardware. In addition, because of the information lost during processes such as pooling (average- or max-pooling), CNNs become slower and less accurate. A convolutional layer is the primary building block of a CNN. Its task is to identify significant details in the image pixels. Higher layers combine simple characteristics into more complex features, whereas deeper layers (closer to the input) learn to recognize simple features such as edges and color gradients. Very high-level features are finally combined in thick layers at the top of the network to create classification predictions. All higher-level neurons in a CNN receive low-level information. Then, these neurons perform additional convolutions to determine whether specific features are present. This is accomplished by repeating the knowledge across all the individual neurons after traversing the receptive field. The location and orientation of the object are not considered by a CNN when forming the predictions. They entirely lose all internal information about the position and orientation of the item and route all of the data to the same neurons that may not be able to process this type of input [
30]. A CNN predicts output by observing an image and determining whether specific elements are present. If they are, the image is appropriately categorized. An artificial neuron produces only one scalar. Additionally, CNNs employ convolutional layers that produce a 2D matrix, with each number representing the result of the kernel’s convolution with a portion of the input volume. These layers duplicate the weights of each kernel across the entire input volume for each kernel. Therefore, we can consider the 2D matrix as the output of the replicated feature detector. The output of a convolutional layer is then produced by stacking all the 2D matrices of the kernel on top of one another. The next step is to attempt to attain perspective invariance in the neuronal activity. This is accomplished using max pooling, which selects the greatest value in each region of the 2D matrix after looking at each region in turn. Consequently, invariance is achieved. When the output is invariant, the input can be slightly altered without affecting the results. In other words, max pooling ensures that the network activities (neuron outputs) remain constant even if the object we wish to detect is slightly shifted in the input image, allowing the network to continue to recognize the object. However, because max pooling discards important data and does not encode the relative spatial relationships between features, the approach described above is ineffective [
31]. Consequently, CNNs are not genuinely immune to significant input data alterations. The use of ViTs enables us to overcome these limitations of CNNs to detect breast cancer using ultrasound images. In this study, we also introduced a novel transfer-learning method to compensate for the ViT’s data hunger. The use of transfer learning enables ViT to be pretrained on a large number of natural and microscopic images and to be used to classify ultrasound images without overfitting.
Generally, the proposed method provides the following contributions.
Developed the first multistage transfer-learning method using vision transformers for breast cancer detection.
Utilized microscopic image datasets that have related image features to those of ultrasound images for intermediate-stage transfer learning to improve the performance of breast cancer early detection.
Carefully studied the characteristics of different vision transformers based on pretrained models for translation to ultrasound image-based breast cancer detection.
Investigated the effectiveness of the proposed BUViTNet method when applied to datasets from different sources as well as on mixed datasets from different origins.
Compared the performance of the BUViTNet method against vision transformers trained from scratch, conventional vision transformer-based transfer learning, and convolutional neural networks for breast cancer detection.
3. Results
The proposed method was evaluated using two datasets from different sources and their mixture, as listed in
Table 3. The proposed method achieved an AUC of 1 ± 0, MCC score of 1 ± 0, and kappa score of 1 ± 0 on the Mendeley dataset for the vitb_16, vitb_32, and vitl_32 ViT models. Furthermore, the proposed method achieved the highest AUC of 0.968 ± 0.02, MCC score of 0.961 ± 0.01, and kappa score of 0.959 ± 0.02 on the BUSI dataset for the Vitb _16 ViT model. For the mixed dataset, the vitb_16 model achieved the highest AUC of 0.937 ± 0.03, MCC score of 0.924 ± 0.02, and kappa score of 0.919 ± 0.03.
Figure 7 shows the receiver operating characteristics (ROC) curves of the proposed BUViTNet method on the three datasets, using vitb_16 as a base model.
We compared the proposed method with ViT models trained from scratch using ultrasound images. This was performed to determine whether the ViT-based transfer-learning model performed better than a ViT model trained directly from scratch with ultrasound images. The highest AUC recorded with the models trained from scratch were 0.73 ± 0.2, 0.71 ± 0.07, and 0.7 ± 0.1 using vitb_16 on Mendeley, BUSI, and mixed datasets, respectively (
Table 4). The proposed method significantly outperformed the ViT models trained from scratch with a
p-value of less than 0.01 in all cases on all datasets.
Furthermore, the proposed transfer-learning method was compared with conventional ImageNet-pretrained ViT models. This experiment validates the superiority of the proposed method over traditional transfer learning. The highest AUCs achieved for conventional transfer learning using ViT models were 1 ± 0, 0.9548 ± 0.0183, and 0.9116 ± 0.0156 on Mendeley, BUSI, and mixed datasets, respectively (
Table 5). The performances of the proposed method and traditional transfer learning on the Mendeley dataset were comparable; however, the proposed method outperformed the traditional transfer-learning method on BUSI and mixed datasets, with a
p-value of less than 0.01.
Finally, we compared the proposed ViT-based transfer-learning method with the transfer-learning method using CNNs. To do so, we used three SOTA CNN architectures: ResNet50, EfficientNetB2, and InceptionNetV3. All implementation parameters were kept the same as those of the ViT-based transfer-learning method for a fair comparison. ResNet50-based transfer learning provided the highest AUC scores of 0.972 ± 0.01, 0.879 ± 0.2, and 0.836 ± 0.08 on Mendeley, BUSI, and mixed datasets, respectively (
Table 6). The proposed ViT-based transfer-learning method performed better than the CNN-based transfer-learning method for breast ultrasound images, with a
p-value of less than 0.01.
4. Discussion
In this study, we proposed a novel transfer-learning approach for ViT-based breast ultrasound image classification. Our approach uses a ViT model pretrained on the ImageNet dataset for transfer learning to classify cancer cell images. This model, trained on ImageNet and cancer cell images, was then used to classify breast ultrasound images. This novel transfer-learning approach enables a model to learn from a large number of natural and medical images before being used for classifying breast ultrasound images. The model leverages the features learned in the previous transfer-learning stages to use it for the target task. As a result, we were able to achieve the best performance in terms of all metrics used with our proposed model. We compared the proposed approach with ViT models trained from scratch, ViT-based conventional transfer learning, and CNN-based transfer learning. The proposed approach outperformed all of these models.
Regarding the performance of BUViTNet using various base models, vitb_16-based BUViTNet performed better than vitb_32 and vitl_32, thereby providing less computational complexity. The main reason for this might be the size of the patches used in these base models. The vitb_16 base model utilizes an input patch size of 16 × 16, whereas vitb_32 and vitl_32 base models utilize an input patch size of 32 × 32. The smaller the patch size, the higher the efficiency and effectiveness of the transformer encoder’s attention. This leads to a better performance in extracting local and global features, which consequently improves tumor classification. The other reason, specially, in the case of vitl_32 is that the network is larger and overfits the data compared to that of vitb_16 and vitb_32 models. Moreover, the vitb_16 model is less computationally complex compared to vitb_32 and vitl_32 models, making it preferable in our case.
Despite the superior performance of the proposed method in all the experiments, it has some disadvantages. As can be observed from the results in
Table 4, ViT trained from scratch performed poorly. This is because ViTs are data intensive. ViTs require a large number of images to perform well when trained from scratch, owing to their large number of parameters. The ViT base model had 86 million parameters and the ViT large model had 300 million parameters. It is difficult to train from scratch using hundreds or thousands of ultrasound images, as was the case in our study. Therefore, the model, which was trained from scratch, performed poorly. Another observation made from the experiments was that ViTs are more computationally expensive compared to CNNs. Models using ViTs require more time to train owing to the large size of its architecture. Comparing the results in
Table 3 and
Table 6 in terms of the time taken for training the models, one can observe that CNNs were mostly faster than ViTs to train. Despite the larger number of parameters and higher computational cost compared with CNNs, our proposed ViTs performed better than CNNs in all cases. The transfer-learning method proposed in this study exhibited a superior performance.
Future work is directed towards optimizing the model using different parameters not included in this paper. Previous studies related to multistage learning methods for ultrasound images have shown the effect of different deep-learning parameters [
25,
28]. Thus, we will further our efforts in optimizing the proposed model using different deep- learning parameters. Furthermore, we have observed the slight performance difference when using different datasets from different sources, as can be observed in
Table 3. The reason for this is obvious and it is due to the difference of imaging equipment, personnel, location, and other related factors. A good deep-learning model should entertain such variations and perform uniformly, irrespective of the source of the dataset. However, this requires the availability of diverse datasets from different locations and in large amounts, which are not currently available. Therefore, our next task will be considering the usage of different datasets across the globe from different locations and training our proposed model. The proposed method could also be translated to breast cancer early diagnosis via other modalities, such as mammogram and magnetic resonance imaging (MRI). Multistage transfer learning using natural images and microscopic images has shown improvement on CNNs’ performance for breast ultrasound image classification [
27]. This could also be true for vision transformers’ performance on breast mammogram images. Therefore, the proposed method could be translated to breast cancer early detection using other modalities, such as mammograms.