Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning

Kormpos, Christopher; Zantalis, Fotios; Katsoulis, Stylianos; Koulouras, Grigorios

doi:10.3390/bdcc9050111

Open AccessEditor’s ChoiceArticle

Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning

TelSiP Research Laboratory, Department of Electrical and Electronic Engineering, School of Engineering, University of West Attica, Ancient Olive Grove Campus, 250 Thivon Str., GR-12241 Athens, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(5), 111; https://doi.org/10.3390/bdcc9050111

Submission received: 26 February 2025 / Revised: 9 April 2025 / Accepted: 17 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Beyond Diagnosis: Machine Learning in Prognosis, Prevention, Healthcare, Neurosciences, and Precision Medicine)

Download

Browse Figures

Versions Notes

Abstract

The intersection of medical image classification and deep learning has garnered increasing research interest, particularly in the context of breast tumor detection using ultrasound images. Prior studies have predominantly focused on image classification, segmentation, and feature extraction, often assuming that the input images, whether sourced from healthcare professionals or individuals, are valid and relevant for analysis. To address this, we propose an initial binary classification filter to distinguish between relevant and irrelevant images, ensuring only meaningful data proceeds to subsequent analysis. However, the primary focus of this study lies in investigating the performance of a hierarchical two-tier classification architecture compared to a traditional flat three-class classification model, by employing a well-established breast ultrasound images dataset. Specifically, we explore whether sequentially breaking down the problem into binary classifications, first identifying normal versus tumorous tissue and then distinguishing benign from malignant tumors, yields better accuracy and robustness than directly classifying all three categories in a single step. Using a range of evaluation metrics, the hierarchical architecture demonstrates notable advantages in certain critical aspects of model performance. The findings of this study provide valuable guidance for selecting the optimal architecture for the final model, facilitating its seamless integration into a web application for deployment. These insights are further anticipated to advance future algorithm development and broaden the potential of the research applicability across diverse fields.

Keywords:

deep learning architectures; breast tumor classification; transfer learning; convolutional neural network; hierarchical architectures; biomedical technology

1. Introduction

The integration of artificial intelligence into various aspects of modern life, coupled with unprecedented advancements in computational resources such as processors and graphics cards, has driven remarkable progress across numerous domains, including biomedical technology. The recognition and classification of diseases using machine learning algorithms through the use of either simple numerical or complex data, such as images, have been successfully applied for many years, particularly in tumor identification [1]. Specifically, numerous studies have focused on breast cancer detection, which remains one of the leading causes of death among women worldwide [2]. Building on this foundation, breast cancer diagnosis through medical imaging and machine learning has emerged as a critical component of modern healthcare, fueled by growing interest in the application of artificial intelligence to this field. Studies have consistently shown that early detection significantly improves survival rates among women [3], and models, whether designed to assist doctors or operate autonomously, are aimed at improving early diagnoses while alleviating the workload of medical professionals.

The motivation behind this study arises from practical challenges encountered during the development of a real-world web application. This research explores the potential of hierarchical and flat classification approaches to identify the most effective method for breast tissue classification, specifically examining whether a two-branch binary architecture can outperform the traditional three-class classification paradigm. Additionally, the integration of a filtering layer designed to distinguish between relevant and irrelevant input images aims to enhance system reliability and input relevance. The goal of this study is to determine whether the development of advanced binary classification systems represents a more effective strategy than optimizing traditional three-class classification algorithms, thereby contributing to the advancement of diagnostic methodologies in healthcare and the field of image recognition.

2. Related Work

Breast cancer detection has seen significant advancements through the application of machine learning and imaging techniques in recent years. Most existing studies have primarily focused on using mammography images to extract useful features, whether for tumor detection, segmentation, or directly predicting the malignancy of the identified tumor based on the input image [4,5]. However, one challenge is that some hospitals and medical centers may lack the necessary equipment for digital mammography. Additionally, in cases where women have dense breast tissue, mammography may fail to detect cancers [6]. A common supplementary or alternative imaging modality in such scenarios is breast ultrasound, which, despite being prone to noise, is a safe, cost-effective, and portable method frequently employed by radiologists [1].

Convolutional Neural Networks (CNNs) have long been the dominant deep learning approach for breast ultrasound analysis, with numerous architectures demonstrating strong performance on their respective datasets [7,8,9]. While Transformer-based vision models with attention mechanisms have recently emerged as promising alternatives [10], CNNs remain the industry standard, with research primarily emphasizing deeper architectures and greater complexity to extract more intricate features from ultrasound images [11]. This approach seeks to improve the accuracy of predictions for the three-class classification task of distinguishing between malignant, benign, and normal cases. While traditional flat classification approaches, such as those referenced, dominate medical imaging tasks, hierarchical classification architectures have garnered attention for their potential to enhance both accuracy and interpretability by breaking down complex problems into simpler subtasks [12,13]. However, the application of such architectures to breast ultrasound images remains underexplored, presenting an opportunity to investigate their benefits over flat three-class models.

Despite nonstop advancements over the years, the field faces persistent challenges, particularly concerning the interpretability of CNN models and the computational demands associated with training complex architectures [14]. Large CNN architectures, which are typically designed to classify multiple categories, often demand significant computational resources and can take hours or even days to train on extensive datasets using Graphics Processing Units (GPUs). Building on these constraints, along with the limited availability of medical data largely restricted by patient privacy concerns, many recent studies have adopted transfer learning as an effective approach [6]. In transfer learning, a pretrained model, along with its parameters, weights, and gained knowledge from a large dataset, is repurposed to perform a different task than originally intended [15]. Hyperparameter tuning is also a critical aspect of this process, as it optimizes the performance of the model for the specific task and dataset, ultimately enhancing its accuracy [16].

On an application level, numerous studies have focused primarily on mammography or histopathology images, as these datasets are open-source and more readily available. One such application is ABCanDroid [17], which utilizes a high-accuracy ResNet-101 model with transfer learning in an Android environment. Another notable example is Branet [18], a newly developed mobile application that integrates both segmentation and classification techniques for mammography and ultrasound images to predict whether a tumor is benign or malignant. While these applications are distinguished by their strong performance, their scalability when incorporating new data remains a concern, particularly in cases where unrelated images are input into the models.

3. Methodology

The objective of this study is to assess the performance of a hierarchical two-tier binary classification architecture in comparison to the traditional three-class classification approach, which has been widely adopted in previous studies. Despite limited application in medical data, hierarchical architectures have demonstrated superior performance compared to flat models in image classification across various domains [13]. Notably, they have been successful in analyzing small bowel biopsy images [13] and distinguishing between healthy eyes and those affected by glaucoma, including classifying four glaucoma subtypes [19]. To further address the challenges identified on an application level, after determining the superior method, a set of models will be trained and evaluated to serve as the filter stage. These models will be designed to exclude irrelevant images, ensuring that only diagnostically relevant data progresses to the subsequent stages of analysis. The two approaches, which will be compared and evaluated following the filter level, are presented in Figure 1.

In the literature on hierarchical classification, the first approach depicted in the image is known as the Local Classifier per Node (LCN) approach, which is by far the most widely used method in hierarchical architectures [20,21]. This technique involves training a binary classifier for each node within the class hierarchy, progressing from the root to the leaf nodes. The second approach, also illustrated in the image, is the flat classification method, commonly referred to in earlier studies as the Global Classifier [22]. Unlike LCN, this approach disregards the class hierarchy entirely, focusing solely on predicting the classes at the leaf nodes. While this method simplifies the classification task, it does so at the expense of leveraging hierarchical relationships. Both approaches offer distinct advantages and disadvantages when applied to classification tasks. However, their comparative effectiveness in the context of breast ultrasound image classification remains to be evaluated.

4. Data Sources and Image Preprocessing

The experimental analysis was conducted using two primary datasets. The first and most critical dataset, utilized for classifying benign, malignant, and normal breast tissues in the final layers of the three hierarchical architectures, is the widely recognized benchmark Breast Ultrasound Images Dataset [23]. This dataset comprises 780 ultrasound images, including 133 classified as normal, 437 as benign, and 210 as malignant, along with their corresponding tumor segmentation mask images. However, it has been consistently noted across multiple sources that in addition to the class imbalance present in the Breast Ultrasound Images Dataset, the number of images available for training is insufficient for effective model performance [9,24]. Studies have also shown that applying image augmentation techniques not only addresses this limitation, but also leads to improved classification performance [25,26,27]. Consequently, an image augmentation step was deemed essential to enhance the dataset and improve the training process. Representative samples from this dataset, illustrating its diverse image categories, are presented in Figure 2.

For the filter layer, designed to eliminate irrelevant images, a combination of publicly available datasets was utilized. To represent random irrelevant images, the “Unsplash Random Images Collection” from Kaggle, featuring 802 images of landscapes, animals, and street scenes, was used alongside four medical datasets: the “Brain MRI Images” dataset with 14,700 images, the “Mammography Small Dataset” with 106 mammograms, the “Chest X-Ray Images” dataset [28] containing 5863 normal and pneumonia chest X-rays, and the “BreaKHis” dataset [29] with 780 biopsy images of benign and malignant breast tumors.

In the image preprocessing stage for the first dataset, segmentation mask images were discarded, as they were not relevant for classification. All malignant, benign, and normal images were horizontally mirrored, doubling the dataset to 1560 images. However, an imbalance of 56%, 26%, and 18% favoring benign images remained, which could affect model performance particularly in tasks involving pattern recognition and classification [30]. To address this, data augmentation techniques were applied to reach 1150 images per class, generating 276 benign, 730 malignant, and 884 normal images. A portion of the existing images was randomly selected for cropping/zooming with a 1.2× factor. Common distortions in medical imaging, such as blurring, contrast variations, and noise, which can affect diagnostic accuracy [31], were simulated across the augmented dataset. Some images had their contrast reduced, others were enhanced using Contrast Limited Adaptive Histogram Equalization (CLAHE), and others still were subjected to Gaussian blur or salt-and-pepper noise. The zooming/cropping step was applied to normal images after other augmentations due to their smaller quantity. Figure 3 shows the applied distortion types during the ultrasound image data augmentation process.

In architectures such as the ones examined in this study, algorithms trained on imbalanced datasets commonly demonstrate a tendency to favor the majority classes [21]. To mitigate this issue and achieve balance with the 3460 breast ultrasound images, the irrelevant image dataset was appropriately adjusted. After eliminating duplicates, the dataset was augmented primarily using the “Unsplash Random Images Collection”, employing techniques such as mirroring, random cropping, and converting images to black and white, which increased the dataset size to 2184 images. Additionally, 350 images were randomly selected and prepared from the remaining four medical datasets. In the case of the mammography dataset, all 212 images, after mirroring, were included, resulting in a final dataset of 3446 samples.

5. Selected Models for Evaluation

The three pretrained models selected for evaluation within the two proposed architectures were chosen based on their established prominence and frequent application in transfer learning for ultrasound imaging in recent years [27]. These models are recognized for their ability to deliver high levels of accuracy and efficiency during training, making them well-suited for tasks involving complex feature extraction and precise classification.

5.1. VGG-16

When discussing deep learning architectures, one of the most prominent and foundational Convolutional Neural Networks is VGG-16. This architecture gained recognition in the 2014 ImageNet Challenge, securing first place in the localization task and achieving second place in classification. Structurally, VGG-16 comprises 13 convolutional layers and 3 fully connected layers, distinguished from earlier models by its consistent use of small 3 × 3 filters in all convolutional layers, which enhances feature extraction efficiency and overall performance [32]. VGG-16 has demonstrated strong performance in transfer learning applications for breast cancer classification, both with and without data augmentation [33,34]. Furthermore, it has shown high accuracy in distinguishing relevant from irrelevant images, making it effective for filtering social media content related or unrelated to natural diseases [35].

5.2. InceptionV3

Together with ResNet [36], the Inception architecture [37] introduced a transformative approach in computer vision, challenging the trend of deepening VGG-based architectures by effectively addressing the vanishing gradients problem [38]. This issue, where network derivatives diminish and impede learning from new data, was mitigated by the structure of Inception models. Known for its unique multi-scale feature extraction capabilities, the Inception architecture achieved success by employing layers with varying filter sizes within each module. In later years, this design further reduced the parameter count and incorporated regularization techniques, including batch-normalized auxiliary classifiers and label smoothing, allowing efficient training even with moderate-sized datasets [39]. InceptionV3, in particular, has demonstrated robust performance in transfer learning applications on the Breast Ultrasound Images Dataset (BUSI) utilized in this study [40].

5.3. NASNet

Unlike earlier models with architectures grounded in traditional design principles for deepening neural networks, NASNet, developed by the Google Brain team, introduces a unique approach derived from Neural Architecture Search (NAS). In this process, a Recurrent Neural Network (RNN) controller explores the architecture space, initially training and optimizing candidate models on the CIFAR-10 dataset. The best-performing architecture is then scaled and adapted for the larger ImageNet dataset, ensuring efficient feature extraction and optimal performance [41]. The structure of NASNet is composed of convolutional layers interspersed with pooling layers, enhancing feature extraction across varying levels of representation. When applied to the Breast Ultrasound Images Dataset with data augmentation techniques, NASNet has demonstrated superiority in Transfer Learning (TL) applications, outperforming many other models in this domain [23,42].

6. Training Process

This section provides a comprehensive overview of the training process, including a detailed description of the hardware and software employed in conducting the experiments, the normalization techniques applied to the data, and the final adjustments made to the hyperparameters of the networks to optimize performance prior to training.

6.1. Experimental Setup

The proposed models, architectures and final application were implemented on an NVIDIA DGX Workstation (supplied by NVIDIA Corporation, Santa Clara, CA, USA) equipped with a 2.2 GHz Intel Xeon E5-2698 v4 CPU featuring 20 cores (Intel Corporation, Santa Clara, CA, USA), four Tesla V100 Tensor Core GPUs (NVIDIA Corporation, Santa Clara, CA, USA), and 256 GB of RDIMM DDR4 RAM. The software environment included Python 3.10.12, with TensorFlow 2.15.0 and Keras 2.15.0 for model development, alongside supporting libraries such as scikit-learn 1.4.2, Matplotlib 3.10.1, OpenCV 4.11.0, Pillow 11.1.0, and Seaborn 0.13.2.

6.2. Data Normalization

The data normalization step is crucial for improving processing speed, accuracy, and feature extraction in neural networks. Although data augmentation and cleaning enhance the dataset, it may still fall short in providing sufficient, relevant information, potentially complicating model training across epochs instead of improving it [43]. After augmentation, data for each classification level in the hierarchical architectures are divided into 70% training, 15% validation, and 15% test sets. Pixel values are then normalized to a 0 to 1 range by dividing each pixel by their maximum value (255), and finally, images are resized to meet the dimensional requirements of each model.

6.3. Fine Tuning

After data augmentation and normalization, each model is initialized as a base model using pretrained models of TensorFlow, reducing susceptibility to errors and eliminating the need to train models from scratch, saving both time and computational resources [44]. The VGG-16, InceptionV3, and NASNet models are loaded with weights from the ImageNet dataset, excluding their final layers. For VGG-16, the final layers consist of a Flatten layer and a Dense layer with 256 neurons using a ReLU activation function. In InceptionV3 and NASNet, the final layers include a 2D Global Average Pooling layer followed by a Dense layer with 1024 neurons, also using ReLU activation. With the inclusion of the extra layers, the total number of parameters for VGG-16 is 21,137,729, for InceptionV3 the total number is 23,901,985, and for NASNet the total number is 89,047,635, demonstrating the expanded complexity of each model. The final activation layer for each model is adjusted based on the classification task: for three-class classification, a softmax layer with three units is used, while binary classification uses a single sigmoid unit. In all models, early layers are frozen to retain pretrained weights for foundational feature extraction. For the high-level task of irrelevant/relevant image classification, a grid search was performed using the AdamW, Adam, Adamax, and SGD optimizers with learning rates of 10⁻², 10⁻³, 10⁻⁴ and 10⁻⁵. The results indicated that the Adam optimizer with a learning rate of 10⁻³ was the most effective choice for every scenario. This configuration, along with 5 training epochs, proved ideal for achieving fast convergence. For the breast tumor classification task, hyperparameters were selected based on the findings of Al-Dhabyani et al. [24], who successfully trained models for 10 epochs using the Adam optimizer with a learning rate of 10⁻³, yielding satisfactory results with or without data augmentation. Batch sizes for all image generators were consistently set to 32.

The three models within each level of the two architectures are trained using the training data, with their performance evaluated based on the corresponding validation data. By leveraging TensorFlow’s history object and the Matplotlib library, visualizing the changes in accuracy and loss for both the training and validation sets across each epoch becomes straightforward. This visualization is essential for monitoring the training process and identifying potential issues, such as overfitting or underfitting, that may impact each model [45]. Figure 4, Figure 5 and Figure 6 depict the variations in accuracy and loss over epochs for both training and validation sets, offering valuable insights into how the three models perform at each level of the two architectures.

7. Evaluation

After training the three models, their performance was evaluated on the previously reserved testing set to assess each model’s ability to correctly classify each class. It is important to note that the training, validation, and testing datasets for all architectures tasked with either binary or three-class classification of breast ultrasound images were consistently fixed across all experiments. This approach ensures that the same data subsets were used for both training and evaluation, thus eliminating potential inaccuracies that could arise from randomly selecting data in each iteration. Based on the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) values obtained from each model’s confusion matrix, common classification metrics such as accuracy, precision, recall (sensitivity), and F₁ score were used to assess model performance at each stage of the architectures. Precision and recall were specifically calculated for the class of interest at each level: relevant images for the filter layer, and malignant/tumorous images for the later layers. The formulas for these metrics are outlined as follows:

A c c u r a c y = \frac{T P + T N}{T}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

F_{1} s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

To assess whether the observed differences in model performance across different training processes and data splits are meaningful and not due to random variation, a paired t-test was conducted. The paired t-test, also known as Student’s t-test, is a well-established statistical method that evaluates whether the mean difference between two related groups is statistically significant [46]. By comparing the observed t-statistic to the t-distribution, the p-value which quantifies the likelihood that the observed differences occurred by chance, is calculated. A p-value smaller than a predefined threshold (typically 0.05) suggests that the differences are statistically unlikely to have arisen randomly, supporting the hypothesis that the observed performance disparity is meaningful [47]. The equation for calculating the paired t-test is provided below:

t = \frac{\frac{\sum_{i = 1}^{n} d_{i}}{n}}{s_{d} / \sqrt{n}}

(5)

where d_i represents the difference of the pair values for the i-th observation, n is the number of paired observations, and s_d is the standard deviation of the paired differences. In this study, the paired t-test was specifically applied to compare the accuracy metric between the two best performing models, in order to confirm the significance of their performance.

8. Results

Before presenting the final results and comparing the two architectures, it is important to note that previous studies have demonstrated the superior performance of NASNet in binary classification tasks involving breast tumor data through transfer learning [48], particularly in distinguishing between benign and malignant cases. Based on these findings, NASNet was chosen as the model for the final layer of the two-level binary classification architecture. For the comparison of the three models inside this architecture, the final layer remains fixed as NASNet, while the layer responsible for binary classification between tumorous and non tumorous tissue varies across the architectures.

The testing results of the two architectures, along with the integrated models following the training process, are presented in Table 1. This table provides a detailed comparison, including the evaluation metrics discussed earlier along with the training time required for each model, and the time taken to generate a prediction for a single image, offering a comprehensive assessment of their performance and efficiency.

Following the detailed comparison presented in Table 1, the best-performing model combination from Architecture 1 and the most accurate flat three-class classification model were selected to generate confusion matrices for each, as presented in Figure 7. The confusion matrix, often referred to as the error matrix, is a well-established tool used to visualize the performance of a classification algorithm. It offers valuable insights into the types of errors made by the model, facilitating the calculation of various evaluation metrics [49].

The performance metrics for the models in the filter layer, which distinguishes between irrelevant and relevant images, are shown in Table 2. This table includes evaluation metrics such as training and inference times for each model, similar to the data presented in Table 1.

9. Discussion

The results presented in Table 1 highlight that the "NASNet & NASNet" architecture is the most effective choice for the binary classification strategy of two levels, achieving an accuracy of 92.7%. Although not explicitly shown in Table 1, this decision was primarily guided by the performance of the three algorithms when trained to distinguish between normal breast tissue and a combined category of malignant and benign tissue. In this binary task, NASNet exhibited exceptional performance, achieving an impressive accuracy of 97.6% and a sensitivity of 97.8%. Regarding the three-class classification architecture, the NASNet model achieved the highest accuracy of 93.1%, demonstrating its reliability in handling multi-class classification scenarios. Ten training sessions were conducted on distinct random data splits to ensure that the accuracy of the two models was not the result of chance. The models showed consistent performance across runs, and a paired t-test yielded a p-value of 0.04, which is below the conventional significance threshold of 0.05, indicating that the observed difference in accuracy between the models is statistically significant. In terms of sensitivity, the VGG-16 model performs the best, although it has a significantly lower precision score. The two-level binary classification model ranks second in sensitivity, but the key deciding factor in determining the best architecture is the F₁ score. With an impressive F₁ score of 92.1%, the two-level binary classification model demonstrates consistent performance in both the sensitivity and precision metrics. Given its near-equal accuracy to the NASNet three-class model and its superior balance of sensitivity and precision, the two-level binary classification model emerges as the ideal choice for the final layer in the application. Additionally, with a sensitivity of 91.9% on malignant cases, it demonstrates strong performance in identifying critical instances, further solidifying its suitability for deployment.

In contrast, the task of classifying relevant and irrelevant images, as shown in Table 2, involves a simpler classification challenge compared to distinguishing between malignant, benign, and normal breast tissues, where the data tend to share more overlapping features. In this case, the models exhibit near-perfect performance, as the distinguishing characteristics of irrelevant and relevant images are much more straightforward and less nuanced than those of different tissue types. Among the models tested, NASNet stands out for its superior accuracy, correctly classifying all 1043 test samples.

This outstanding performance on both datasets emphasizes the versatility of NASNet and its ability to excel in tasks of varying complexity, ranging from more challenging classifications to simpler ones. Despite the fact that this model requires more time for both training and inference compared to the other two, its ability to provide responses within milliseconds, particularly given that training occurs offline, makes it a viable choice for deployment. However, in scenarios involving online training, where quicker training and response times are critical, it would be necessary to reassess the choice of architecture and select the model best suited to meet the specific performance demands of the application.

10. Conclusions and Future Work

This paper explored the potential benefits of a hierarchical two-tier binary classification architecture compared to the traditional flat three-class classification approach for breast tumor detection in ultrasound images. Although the results show that breaking the task into two stages, first distinguishing normal tissue from tumorous tissue, and then categorizing benign and malignant tumors, did not significantly improve accuracy, the sensitivity and F₁ score metrics, combined with the hierarchical approach’s potential for improved modularity and scalability, suggest that this architecture may be more effective for the task. Among the models tested, NASNet demonstrated superior performance across almost all metrics, both in the tumor classification and the filter stage. The introduction of a filter stage, which effectively eliminates most irrelevant images before classification, plays a crucial role in enhancing the overall performance and user experience of the application developed, ensuring that only diagnostically relevant data proceeds to subsequent analysis.

Future work can build on these findings by exploring more diverse datasets and further optimizing each stage of the two-tier binary classification architecture, which is strongly believed to lead to improvements in accuracy as well as other performance metrics. Additionally, combining both architectures with other models in the same application through majority voting could enhance the final result by leveraging the strengths of each model. The results of this study provide valuable insights into the advantages of hierarchical architectures in medical image analysis, emphasizing their potential to improve diagnostic accuracy and efficiency in clinical settings, while also broadening their applicability to various medical and nonmedical imaging domains.

Author Contributions

Conceptualization, C.K., F.Z., S.K. and G.K.; methodology, C.K., F.Z., S.K. and G.K.; software, C.K.; validation, C.K., F.Z., S.K. and G.K.; formal analysis, C.K. and F.Z.; investigation, C.K. and S.K.; resources, C.K. and S.K.; data curation, C.K. and F.Z.; writing—original draft preparation, C.K., F.Z. and S.K.; writing—review and editing, C.K., F.Z., S.K. and G.K.; visualization, C.K. and G.K.; supervision, G.K.; project administration, G.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code used to support the findings of this study is available at: https://github.com/christopherkormpos/TumorNet. In this study, no new data was generated or analyzed.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BraNet	A mobile application for breast image classification
BUSI	Breast Ultrasound Images Dataset
CLAHE	Contrast Limited Adaptive Histogram Equalization
CNN	Convolutional Neural Network
FN	False Negative
FP	False Positive
GPU	Graphics Processing Unit
LCN	Local Classifier per Node
NAS	Neural Architecture Search
ResNet	Residual Neural Network
ResNet-101	Convolutional Neural Network model with a 101-layer depth
TL	Transfer Learning
TN	True Negative
TP	True Positive
VGG	Visual Geometry Group
VGG-16	Visual Geometry Group model with a 16-layer depth

References

Tenajas, R.; Miraut, D.; Illana, C.I.; Alonso-Gonzalez, R.; Arias-Valcayo, F.; Herraiz, J.L. Recent advances in artificial intelligence-assisted ultrasound scanning. Appl. Sci. 2023, 13, 3693. [Google Scholar] [CrossRef]
Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics, 2018. CA Cancer J. Clin. 2018, 68, 7–30. [Google Scholar] [CrossRef] [PubMed]
Roslidar, R.; Rahman, A.; Muharar, R.; Syahputra, M.R.; Arnia, F.; Syukri, M.; Pradhan, B.; Munadi, K. A review on recent progress in thermal imaging and deep learning approaches for breast cancer detection. IEEE Access 2020, 8, 116176–116194. [Google Scholar] [CrossRef]
Micucci, M.; Iula, A. Recent advances in machine learning applied to ultrasound imaging. Electronics 2022, 11, 1800. [Google Scholar] [CrossRef]
Le, E.; Wang, Y.; Huang, Y.; Hickman, S.; Gilbert, F. Artificial intelligence in breast imaging. Clin. Radiol. 2019, 74, 357–366. [Google Scholar] [CrossRef] [PubMed]
Sushanki, S.; Bhandari, A.K.; Singh, A.K. A review on computational methods for breast cancer detection in ultrasound images using multi-image modalities. Arch. Comput. Methods Eng. 2024, 31, 1277–1296. [Google Scholar] [CrossRef]
Ragab, M.; Albukhari, A.; Alyami, J.; Mansour, R.F. Ensemble deep-learning-enabled clinical decision support system for breast cancer diagnosis and classification on ultrasound images. Biology 2022, 11, 439. [Google Scholar] [CrossRef] [PubMed]
Joshi, R.C.; Singh, D.; Tiwari, V.; Dutta, M.K. An efficient deep neural network based abnormality detection and multi-class breast tumor classification. Multimed. Tools Appl. 2022, 81, 13691–13711. [Google Scholar] [CrossRef]
Raza, A.; Ullah, N.; Khan, J.A.; Assam, M.; Guzzo, A.; Aljuaid, H. DeepBreastCancerNet: A novel deep learning model for breast cancer detection using ultrasound images. Appl. Sci. 2023, 13, 2082. [Google Scholar] [CrossRef]
Gheflati, B.; Rivaz, H. Vision transformers for classification of breast ultrasound images. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; pp. 480–483. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Trojacanec, K.; Madjarov, G.; Loskovska, S.; Gjorgjevikj, D. Hierarchical classification architectures applied to Magnetic Resonance Images. In Proceedings of the ITI 2011, 33rd International Conference on Information Technology Interfaces, Cavtat, Croatia, 27–30 June 2011; pp. 501–506. [Google Scholar]
Kowsari, K.; Sali, R.; Ehsan, L.; Adorno, W.; Ali, A.; Moore, S.; Amadi, B.; Kelly, P.; Syed, S.; Brown, D. Hmic: Hierarchical medical image classification, a deep learning approach. Information 2020, 11, 318. [Google Scholar] [CrossRef]
Qamar, T.; Bawany, N.Z. Understanding the black-box: Towards interpretable and reliable deep learning models. PeerJ Comput. Sci. 2023, 9, e1629. [Google Scholar] [CrossRef] [PubMed]
Mukhlif, A.A.; Al-Khateeb, B.; Mohammed, M.A. Incorporating a novel dual transfer learning approach for medical images. Sensors 2023, 23, 570. [Google Scholar] [CrossRef]
Alzubaidi, L.; Fadhel, M.A.; Al-Shamma, O.; Zhang, J.; Santamaría, J.; Duan, Y.R.; Oleiwi, S. Towards a better understanding of transfer learning for medical imaging: A case study. Appl. Sci. 2020, 10, 4523. [Google Scholar] [CrossRef]
Chowdhury, D.; Das, A.; Dey, A.; Sarkar, S.; Dwivedi, A.D.; Rao Mukkamala, R.; Murmu, L. ABCanDroid: A cloud integrated android app for noninvasive early breast cancer detection using transfer learning. Sensors 2022, 22, 832. [Google Scholar] [CrossRef] [PubMed]
Jiménez-Gaona, Y.; Álvarez, M.J.R.; Castillo-Malla, D.; García-Jaen, S.; Carrión-Figueroa, D.; Corral-Domínguez, P.; Lakshminarayanan, V. BraNet: A mobil application for breast image classification based on deep learning algorithms. Med Biol. Eng. Comput. 2024, 62, 2737–2756. [Google Scholar] [CrossRef]
An, G.; Akiba, M.; Omodaka, K.; Nakazawa, T.; Yokota, H. Hierarchical deep learning models using transfer learning for disease detection and classification based on small number of medical images. Sci. Rep. 2021, 11, 4250. [Google Scholar] [CrossRef]
Silla, C.N.; Freitas, A.A. A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov. 2011, 22, 31–72. [Google Scholar] [CrossRef]
Pereira, R.M.; Costa, Y.M.; Silla, C.N., Jr. Handling imbalance in hierarchical classification problems using local classifiers approaches. Data Min. Knowl. Discov. 2021, 35, 1564–1621. [Google Scholar] [CrossRef]
Xiao, Z.; Dellandrea, E.; Dou, W.; Chen, L. Automatic hierarchical classification of emotional speech. In Proceedings of the Ninth IEEE International Symposium on Multimedia Workshops (ISMW 2007), Taichung, Taiwan, 10–12 December 2007; pp. 291–296. [Google Scholar] [CrossRef]
Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of breast ultrasound images. Data Brief 2020, 28, 104863. [Google Scholar] [CrossRef]
Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Aly, F. Deep learning approaches for data augmentation and classification of breast masses using ultrasound images. Int. J. Adv. Comput. Sci. Appl 2019, 10, 1–11. [Google Scholar] [CrossRef]
Shareef, B.M.; Xian, M.; Sun, S.; Vakanski, A.; Ding, J.; Ning, C.; Cheng, H.D. A Benchmark for Breast Ultrasound Image Classification. SSRN Electron. J. 2023; preprint. [Google Scholar] [CrossRef]
Byra, M.; Galperin, M.; Ojeda-Fournier, H.; Olson, L.; O’Boyle, M.; Comstock, C.; Andre, M. Breast mass classification in sonography with transfer learning using a deep convolutional neural network and color conversion. Med. Phys. 2019, 46, 746–755. [Google Scholar] [CrossRef]
Ayana, G.; Dese, K.; Choe, S.W. Transfer learning in breast cancer diagnoses via ultrasound imaging. Cancers 2021, 13, 738. [Google Scholar] [CrossRef] [PubMed]
Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 2018, 172, 1122–1131. [Google Scholar] [CrossRef]
Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 2015, 63, 1455–1462. [Google Scholar] [CrossRef]
Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Mogos, G. Impact of visual distortion on medical images. IAENG Int. J. Comput. Sci. 2022, 49, 36–45. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Hijab, A.; Rushdi, M.A.; Gomaa, M.M.; Eldeib, A. Breast cancer classification in ultrasound images using transfer learning. In Proceedings of the 2019 Fifth International Conference on Advances in Biomedical Engineering (ICABME), Tripoli, Lebanon, 17–19 October 2019; pp. 1–4. [Google Scholar] [CrossRef]
Hossain, A.A.; Nisha, J.K.; Johora, F. Breast cancer classification from ultrasound images using VGG16 model based transfer learning. Int. J. Image Graph. Signal Process. 2023, 13, 12. [Google Scholar] [CrossRef]
Nguyen, D.T.; Alam, F.; Ofli, F.; Imran, M. Automatic image filtering on social networks using deep learning and perceptual hashing during crises. arXiv 2017, arXiv:1704.02602. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Pascanu, R. On the difficulty of training recurrent neural networks. arXiv 2013, arXiv:1211.5063. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Rao, K.S.; Terlapu, P.V.; Jayaram, D.; Raju, K.K.; Kumar, G.K.; Pemula, R.; Gopalachari, V.; Rakesh, S. Intelligent ultrasound imaging for enhanced breast cancer diagnosis: Ensemble transfer learning strategies. IEEE Access 2024, 12, 22243–22263. [Google Scholar] [CrossRef]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef]
Reguieg, F.Z.; Benblidia, N. Ultrasound breast tumoral classification by a new adaptive pre-trained convolutive neural networks for computer-aided diagnosis. Multimed. Tools Appl. 2024, 83, 46249–46282. [Google Scholar] [CrossRef]
Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
Pang, B.; Nijkamp, E.; Wu, Y.N. Deep learning with tensorflow: A review. J. Educ. Behav. Stat. 2020, 45, 227–248. [Google Scholar] [CrossRef]
Li, H.; Rajbahadur, G.K.; Lin, D.; Bezemer, C.P.; Jiang, Z.M. Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate Overfitting. IEEE Access 2024, 12, 70676–70689. [Google Scholar] [CrossRef]
Kim, T.K. T test as a parametric statistic. Korean J. Anesthesiol. 2015, 68, 540–546. [Google Scholar] [CrossRef]
Biau, D.J.; Jolles, B.M.; Porcher, R. P value and the theory of hypothesis testing: An explanation for new researchers. Clin. Orthop. Relat. Res. 2010, 468, 885–892. [Google Scholar] [CrossRef]
Kormpos, C. Design and Implementation of a Web Application for Breast Tumors Classification through Convolutional Neural Network. Master’s Thesis, University of West Attica, Aigaleo, Greece, 2024. [Google Scholar] [CrossRef]
Sarvamangala, D.; Kulkarni, R.V. Convolutional neural networks in medical image understanding: A survey. Evol. Intell. 2022, 15, 1–22. [Google Scholar] [CrossRef]

Figure 1. The two architecture types to be evaluated in this study. (a) Two-level binary hierarchical architecture. (b) Flat three-class classification model.

Figure 2. Representative ultrasound images from the Breast Ultrasound Images Dataset. From left to right: (a) benign tissue, (b) malignant tissue, (c) normal tissue.

Figure 3. The different types of distortions applied to the dataset to augment the available data for training. Augmentation techniques: (a) Lower Contrast image, (b) CLAHE image, (c) Blurred image and (d) Noisy image.

Figure 4. Performance of the three pretrained models from the first architecture on the binary classification task of detecting normal/tumorous images, demonstrated through (a) accuracy and (b) loss curves during training and validation.

Figure 5. Performance of the three pretrained models from the first architecture on the binary classification task of detecting malignant/benign images, demonstrated through (a) accuracy and (b) loss curves during training and validation.

Figure 6. Performance of the three pretrained models from the second architecture on the multiclass classification task of detecting malignant, benign, and normal breast tissue, demonstrated through (a) accuracy and (b) loss curves during training and validation.

Figure 7. Confusion matrices for the best-performing models from (a) Architecture 1 and (b) Architecture 2. Although the first architecture uses two binary classifications, its final output aligns with that of the three-class model.

Table 1. Performance metrics of the two architectures.

Architecture	Model	Accuracy	Sensitivity	Precision	F₁ Score	Training Time	Inference Time
Two level	VGG-16 & NASNet	89.1	89.1	91.1	90.1	11 min	120 ms
binary	InceptionV3 & NASNet	86.8	83.2	94.7	88.6	11 min	130 ms
classification	NASNet & NASNet	92.7	91.9	92.4	92.1	15 min	200 ms
Flat three	VGG-16	88.7	97.1	80.7	88.1	5 min	30 ms
class	InceptionV3	87.3	83.8	93.5	88.4	5 min	40 ms
classification	NASNet	93.1	88.4	95.1	91.6	7 min	80 ms

Table 2. Performance metrics for the models in the filter layer.

Model	Accuracy	Sensitivity	Precision	F₁ Score	Training Time	Inference Time
VGG-16	99.8	99.9	99.9	99.9	4.3 mins	30 ms
InceptionV3	99.5	99.4	99.6	99.5	4.0 mins	40 ms
NASNet	99.9	100	99.8	99.9	6.0 mins	80 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kormpos, C.; Zantalis, F.; Katsoulis, S.; Koulouras, G. Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning. Big Data Cogn. Comput. 2025, 9, 111. https://doi.org/10.3390/bdcc9050111

AMA Style

Kormpos C, Zantalis F, Katsoulis S, Koulouras G. Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning. Big Data and Cognitive Computing. 2025; 9(5):111. https://doi.org/10.3390/bdcc9050111

Chicago/Turabian Style

Kormpos, Christopher, Fotios Zantalis, Stylianos Katsoulis, and Grigorios Koulouras. 2025. "Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning" Big Data and Cognitive Computing 9, no. 5: 111. https://doi.org/10.3390/bdcc9050111

APA Style

Kormpos, C., Zantalis, F., Katsoulis, S., & Koulouras, G. (2025). Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning. Big Data and Cognitive Computing, 9(5), 111. https://doi.org/10.3390/bdcc9050111

Article Menu

Evaluating Deep Learning Architectures for Breast Tumor Classification and Ultrasound Image Detection Using Transfer Learning

Abstract

1. Introduction

2. Related Work

3. Methodology

4. Data Sources and Image Preprocessing

5. Selected Models for Evaluation

5.1. VGG-16

5.2. InceptionV3

5.3. NASNet

6. Training Process

6.1. Experimental Setup

6.2. Data Normalization

6.3. Fine Tuning

7. Evaluation

8. Results

9. Discussion

10. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI