1. Introduction
Breast cancer is the most prevalent malignancy among women and stands as one of the leading causes of cancer-related mortality worldwide [
1]. Therefore, the early detection of tumors and the accurate differentiation between benign and malignant masses using various medical imaging modalities are crucial for patient treatment [
2]. This is not only because malignant tumors require the prompt initiation of therapy, but also because it contributes to the efficient allocation of medical resources and enhances patients’ quality of life by avoiding unnecessary invasive procedures or excessive follow-up for benign tumors. Consequently, improving the accuracy of medical imaging-based tumor differentiation is a paramount objective in the clinical field.
Among the diverse medical imaging techniques, ultrasound imaging is a key diagnostic modality widely used for the detection and characterization of tumors due to its advantages of being non-invasive, free of radiation exposure, and capable of real-time image acquisition [
3]. Conventional analysis of ultrasound images involves individual radiology experts comprehensively evaluating various morphological and functional information—such as tumor size, shape, margin clarity, internal echo patterns, and backscattering properties—to predict the likelihood of malignancy.
However, this method of ultrasound analysis has limitations, as the quality of images and interpretation results can vary depending on the operator’s skill and experience. Furthermore, it suffers from a significant drawback of inter- and intra-observer variability, where interpretations of the same image can differ between readers or even for the same reader at different times [
4]. Accurate differential diagnosis is also often challenging due to the subtle overlap in sonographic findings between benign and malignant tumors, and technical factors like image noise or artifacts can further impede precise interpretation.
To overcome the limitations of conventional ultrasound interpretation and to enhance diagnostic accuracy, Computer-Aided Diagnosis (CADx) systems have garnered significant attention [
5]. CADx systems assist medical professionals in their diagnostic decisions by applying various computer algorithms to medical image analysis. While initially developed based on image feature extraction and machine learning classifiers, recent exponential advancements in machine learning have brought revolutionary changes to the field of medical image analysis. In particular, deep learning models such as Convolutional Neural Networks (CNNs) have demonstrated high-level analytical performance by automatically learning complex and subtle patterns within images, proving their potential in the field of ultrasound-based tumor diagnosis [
6]. Indeed, the OASBUD dataset, which is used in our study, has served as a benchmark for various computer vision tasks. Prior works have utilized it for tasks ranging from classification to detection. For instance, Byra et al. [
7] applied transfer learning for mass classification, while Wei et al. [
8], a study more relevant to our own, demonstrated lesion detection using a two-stage detector, Faster R-CNN. However, in the medical imaging domain, the relatively limited availability of datasets for machine learning imposes significant constraints on image classification and decision-making using deep learning [
9]. As a method to overcome this, curriculum learning, which mimics human learning processes, is gaining traction [
10]. This approach aims to use limited datasets efficiently by starting with relatively simple tasks or subjects and progressively increasing the difficulty and complexity. It has been shown to improve model training stability and achieve better generalization performance in various applications [
11]. Specifically, applying curriculum learning to medical image classification can lead to more efficient convergence during the weight training of CNNs with small datasets and can enhance the classification accuracy for diverse medical images that may include noise [
12]. A critical aspect of applying such a curriculum is determining the sequence and procedure of the learning process, reflecting the characteristics of the application’s images. For medical ultrasound B-mode images, in particular, the segmentation of regions and their relationship with surrounding areas are important variables.
Recently, curriculum learning (CL), a methodology that progressively increases the difficulty of training data, has been actively investigated to enhance performance in medical image analysis [
13]. Existing studies have defined learning difficulty based on various criteria. A representative approach is model-centric, where the model’s prediction uncertainty or classification error dynamically defines difficult samples [
13]. Other studies have adopted a data-centric approach, defining difficulty based on intrinsic data properties such as image noise, lesion size, or boundary ambiguity [
14]. Meanwhile, in other computer vision domains, task complexity itself has served as a difficulty metric. For instance, a ‘part-to-whole’ strategy, which learns individual objects before their structural relationships, has proven effective for solving complex problems [
15].
In parallel, among deep learning-based object detection models, the YOLO family, particularly YOLOv5, has garnered attention in the medical imaging field for its practical balance between speed and accuracy [
16]. YOLOv5 offers the advantages of efficient training even with relatively limited data and is easily scalable with various model sizes [
17]. Owing to this practicality, it is being utilized to detect lesions in real-time from various medical images, such as those for breast and skin cancer, thereby improving the clinical diagnostic workflow. Most related studies tend to focus on the ‘application’ of the model to specific medical datasets or the ‘optimization’ of the model architecture, for instance, by creating lightweight versions for specific tasks [
9].
However, a review of existing literature reveals a distinct research gap at the intersection of these two domains. First, previous CL studies in medical imaging have largely overlooked a systematic difficulty definition that considers both the fine-grained details of a lesion and its broader context with surrounding tissues—a crucial aspect in ultrasound imaging. Second, while YOLOv5 has been widely applied, most works have adopted standard training protocols; innovations in the training framework itself to enhance efficiency and stability, especially in data-constrained environments, have been less explored.
In this study, we designed a curriculum learning strategy using hierarchical zoomed-in images to improve tumor detection accuracy with a YOLO object detection model, which is suitable for real-time application. Generally, applying a curriculum learning strategy requires pre-establishing a difficulty criterion for the training data and then systematically dividing the data into easier and more difficult stages. In this research, we defined the training difficulty based on the ratio of the tumor to the background areas in B-mode image. Based on the tumor’s bounding box in the full image, we sequentially cropped regions by expanding the width and height by factors of 2, 3, and 4, respectively, creating a total of four versions (zoom2, zoom3, zoom4, and full images). Through this cropping process, up to three additional hierarchical zoomed-in images were generated for each B-mode image.
For the prepared dataset, we implemented the sequential learning strategy into our curriculum, starting from the visually least difficult zoom2 images, progressively moving to zoom3, zoom4, and, finally, the full images, which contain broader contextual information. At each stage, the model weights with the best object detection performance were transferred as the initial weights for the next stage. This approach guides the model to first learn the morphological features of the tumor and then gradually integrate background information. The finally trained deep learning model’s object detection performance was ultimately analyzed using a test dataset that was not involved in the training process.
The remainder of this paper is organized as follows. First, we describe the deep learning model for object detection and the dataset used, and then we detail the implementation of our proposed framework for detection applying curriculum learning. Subsequently, we present the experimental results and a quantitative performance analysis of the proposed model. Finally, we discuss the interpretation and limitations of our findings and suggest future research directions.
2. Materials and Methods
2.1. Deep Learning-Based Object Detection Model: YOLO (You Only Look Once)
Deep learning models can autonomously learn hierarchical features from large-scale data, overcoming the limitations of traditional methodologies based on hand-crafted feature engineering. In specific diagnostic tasks, they have demonstrated performance approaching or even surpassing that of human experts. Their success has been prominent in various computer vision tasks such as image recognition, classification, and segmentation, and this has rapidly extended into diverse fields of medical image analysis. In ultrasound imaging analysis, deep learning models are also being actively researched for a range of applications including tumor classification, detection, and segmentation [
6].
Object detection, which is frequently utilized in medical imaging applications, differs from image classification—a method that simply predicts a benign or malignant label for the entire image. Instead, it performs the dual task of identifying the location of an object of interest (in this case, a tumor) with a bounding box while simultaneously classifying the detected object’s class (benign or malignant). Therefore, applying an object detection model to ultrasound tumor analysis explicitly provides spatial information, such as the clarity of the tumor’s boundary and its relationship with surrounding tissues. This allows the model to leverage not only the intrinsic features of the tumor but also its interaction with the surrounding area when determining its characteristics. Furthermore, it explicitly implements the ‘attention’ mechanism of a classification model, enabling it to focus on the detected tumor region for feature extraction and classification. This can enhance diagnostic performance by minimizing the influence of background noise or irrelevant surrounding organs present in the overall image. It also improves the clinical interpretability of the applied deep learning model by clearly visualizing the area on which the model based its decision, allowing medical professionals to more intuitively understand and assess the reliability of the model’s predictions [
5].
YOLO (You Only Look Once) is a representative family of deep learning-based object detection models developed for real-time object detection [
16]. YOLO employs a single-stage detector architecture that passes an input image through the neural network only once to simultaneously predict the object’s location and class. This design boasts significantly faster processing speeds compared to traditional two-stage detectors (e.g., the R-CNN series) [
16]. Indeed, one study applied the Faster R-CNN model to the same OASBUD dataset used in our research for breast lesion detection [
8]. This speed and efficiency enhance its potential for real-time clinical applications.
In this study, we employed a curriculum learning strategy that incorporates the training difficulty of B-mode ultrasound images to enhance the efficiency of breast cancer tumor detection, using the YOLOv5 object detection model. Among the YOLO series, YOLOv5 offers a favorable balance between detection accuracy and inference speed, along with improved stability and ease of use compared to earlier versions [
17]. These characteristics make it particularly suitable for applications in medical imaging, where both high accuracy and real-time performance are critical requirements [
17].
Subsequent iterations of the YOLO architecture—namely YOLOv6, YOLOv7, and YOLOv8—have focused on enhancing the backbone architecture, which is responsible for feature extraction. For example, YOLOv6 integrates the EfficientRep module, YOLOv7 introduces the E-ELAN module, and YOLOv8 adopts the C2f structure [
18,
19,
20]. While these architectural advancements have shown improved performance on large-scale, general-purpose datasets such as COCO, their complexity may not offer the same advantages for domain-specific datasets. In medical imaging tasks, such increased architectural complexity can lead to overfitting or introduce unnecessary computational overhead.
This trend is evident in the substantial rise in computational requirements. For instance, YOLOv5s contains approximately 7.2 million parameters and requires 16.5 GFLOPs, while YOLOv8s demands 11.2 million parameters and 28.6 GFLOPs [
20]. This corresponds to an increase of approximately 55% in parameters and 73% in computational cost. However, these increases do not necessarily result in proportional improvements in detection performance on specialized datasets. In fact, several comparative studies have demonstrated that YOLOv5 can outperform or match its successors in terms of efficiency and accuracy in specific tasks [
20].
Therefore, YOLOv5 remains a highly viable option for ultrasound image analysis, offering sufficient robustness and significantly faster inference speed. These advantages enhance its practicality for clinical applications, particularly in real-time diagnostic settings. The general architecture of the YOLO model is illustrated in
Figure 1.
2.2. OASBUD Dataset
OASBUD (Open Access Series of Breast Ultrasonic Data) is a publicly available ultrasound image dataset [
21] for research on breast tumor diagnosis and classification. This dataset consists of data from 100 breast lesions obtained from a total of 100 patients between November 2013 and October 2015, comprising 52 malignant and 48 benign tumors.
Each ultrasound data was acquired using an Ultrasonix SonixTouch Research ultrasound scanner with an L14-5/38 linear array transducer at a center frequency of 10 MHz and was digitized at a sampling frequency of 40 MHz. For each lesion, two orthogonal scans (vertical and horizontal) are provided, with each scan consisting of 510 RF echo lines. The dataset also includes mask information with the Region of Interest (ROI) for each tumor manually delineated by an expert radiologist, making it suitable for lesion segmentation studies. All malignant lesions were pathologically confirmed through core needle biopsy.
Plus, the OASBUD dataset includes BI-RADS (Breast Imaging-Reporting and Data System) classification information for each lesion, providing crucial clinical evidence for researchers developing and evaluating breast cancer diagnosis and classification models. Because of these features, the OASBUD dataset is widely utilized in various breast ultrasound image analysis fields, including Quantitative Ultrasound (QUS) research, ultrasound image processing algorithm development, and Computer-Aided Diagnosis (CAD) system construction.
2.3. Curriculum Learning Considering the Training Difficulty of Ultrasound Images
In object detection applications using deep learning algorithms, curriculum learning—a strategy that proceeds from relatively simple examples to progressively more complex ones—generally accelerates model convergence and demonstrates faster and more stable high-level performance than standard training methods [
10]. This learning framework helps the deep learning network find a better local minima, thereby improving generalization performance [
22]. It is particularly effective for complex datasets where traditional methods struggle or for tasks that must be trained on a relatively small amount of data efficiently [
12].
However, directly applying this curriculum learning strategy to ultrasound imaging presents several challenges. First, the size and shape of tumors within B-mode images are highly variable, and inherent image distortions such as speckle noise and artifacts like shadowing or enhancement are present throughout the ultrasound images [
3]. Furthermore, acquiring large-scale datasets with high-quality annotations is difficult. Most importantly, there are no established criteria for defining the difficulty of B-mode images for the application of a curriculum learning strategy. Therefore, in this study, we determined the learning difficulty based on the ratio of areas between the tumor and the background within the B-mode image. This strategy allows the deep learning model to first learn the shape and texture characteristics of the tumor itself and subsequently utilize contextual information such as its position and relationship with surrounding tissues.
In this study, using the ratio of the tumor to the background in B-mode images, we created an ‘N-fold Expanded Bounding Box’ by expanding the width and height of the tumor’s bounding box by a factor of N (where N = 2, 3, and 4 in this study). We then cropped these areas to generate a total of four versions of one B-mode image: zoom2, zoom3, zoom4, and full images. During this process, expansion beyond the original image boundaries was omitted to prevent unnecessary distortion of information. Through this cropping method, up to three additional hierarchically zoomed-in images are generated from a single B-mode image.
Figure 2 shows an original B-mode image and the three additional zoomed-in images based on the cropping ratios.
2.4. Preprocessing for the Size Normalization of an Input Image
Most deep learning models, including YOLO, require a fixed-size input image. In the case of the YOLOv5 model, if an input image is smaller than the default 640 × 640 resolution, it is normalized through interpolation, and if larger, the model divides it into 640 × 640 segments to perform object detection. Applying this detection method directly to ultrasound images has a drawback: it can reduce training efficiency because a segmented image might contain only a part of a tumor or no tumor at all. Therefore, in this study, we preprocessed all images from every stage of the curriculum (zoom2, zoom3, zoom4, and full images) by consistently normalizing them into the default input resolution of 640 × 640. This ensures that the object detection algorithm is applied only once to each image.
During this preprocessing, if the aspect ratio of the image to be resized deviates significantly from 1:1, the resized image may suffer from non-isotropic compression, which can cause distortion of critical pattern information such as the tumor’s size, shape, or texture. To address this, our study employed the following preprocessing steps to preserve the features of the original image as much as possible within the hierarchically zoomed-in images, thereby enhancing object detection performance. The resizing steps applied in this research are as follows:
To normalize all images to a fixed 640 × 640 input size while minimizing distortion, a consistent two-step process was applied, regardless of the original image dimensions.
The image was resized so that its longest dimension became 640 pixels, while preserving the original aspect ratio.
Zero-padding was applied to the shorter dimension to create the final 640 × 640 pixel image.
This preprocessing method ensures compatibility with the network architecture of object detection models when applied to ultrasound images. YOLOv5 generates a feature map with a stride of 32 through five stages of downsampling. When the input resolution is a multiple of 32, such as 640, the accumulation of misalignment errors caused by padding can be prevented [
23]. This maintains proper alignment with anchor ratios and the receptive field, which can particularly improve the accurate detection performance for small tumors.
Furthermore, unifying the input resolution to 640 × 640 significantly enhances the model’s training stability by normalizing the data distribution. If image resolutions are irregular, a tumor of the same physical size can have different pixel representations, increasing the learning burden on the model. In contrast, a normalized resolution maintains consistency in object-to-background ratios and spatial frequency distributions. This helps to stabilize the variance in batch normalization layers and mitigate internal covariate shift, which contributes to preventing information loss for small objects [
23].
2.5. Framework of the Curriculum Learning-Driven YOLO
The model is trained using a curriculum learning strategy designed to accommodate the varying difficulty of ultrasound images. This approach follows an ‘easy-to-hard’ or ‘detail-to-context’ progression [
15]. Initially, the model learns from closely cropped images where the tumor is prominent, allowing it to first master local features such as fine texture and echo patterns. In subsequent stages, it is trained on images with a progressively wider field of view, enabling it to learn contextual information like the tumor’s boundary, shape, and its relationship with surrounding tissues.
The training is divided into four sequential stages. At the conclusion of each stage, the network weights that achieve the highest detection performance (saved as best.pt) are transferred to initialize the model for the next. This strategy ensures that each subsequent phase builds upon the most optimized feature representations learned previously, facilitating a progressive refinement of the model’s capabilities.
The training curriculum is structured as follows:
Stage 1: Initial training begins on the zoom2 image dataset, where images are cropped to twice the size of the tumor.
Stage 2: Further training is conducted on the zoom3 image dataset, cropped to three times the tumor size.
Stage 3: Additional training is performed on the zoom4 image dataset, cropped to four times the tumor size.
Stage 4: Final training is completed using the entire original ultrasound images.
This structured curriculum helps the model more robustly learn the features critical for malignant/benign classification. By gradually introducing complexity, it prevents the model from being confused by irrelevant background information in the early stages, which is a significant advantage over the standard approach of training on all image types randomly. The entire framework of the proposed curriculum learning-driven YOLO network is shown in
Figure 3.
2.6. Performance Metrics
To quantitatively evaluate the performance of the object detection model proposed in this study, precision, recall, and mAP50 (mean Average Precision at IoU of 0.5) were used as key metrics [
24]. These metrics are calculated based on how well the model’s predictions align with the actual ground truth. Based on the Intersection over Union (IoU) threshold between a predicted bounding box and a ground truth box, prediction outcomes are classified as True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN).
First, Precision is a metric that indicates the proportion of actual objects among the results predicted as objects by the model. It signifies the accuracy of the predictions and increases as the number of false positives (FP) decreases. Conversely, Recall represents the proportion of objects correctly detected by the model out of all actual existing objects. It indicates how comprehensively the model finds objects, and its value increases as the number of false negatives (FN) decreases. Generally, Precision and Recall are in a trade-off relationship, making it important to evaluate both metrics comprehensively.
The mAP50 metric, which sets the Intersection over Union (IoU) threshold to 0.50, is a core performance indicator that comprehensively represents how accurately and completely the model detects various objects. It is based on Average Precision (AP), which is calculated as the area under the Precision-Recall Curve across varying confidence thresholds of the detection model. The mAP50 performance metric quantitatively assesses the model’s ability to both ‘classify’ the tumor’s class and accurately ‘localize’ its position within the ultrasound image. The formulas for each performance metric are shown below.
where
is number of classes and
is the average precision of class
.
2.7. Implementation Details
To clearly validate the effectiveness of the proposed curriculum learning framework, we adopted the standard and widely validated default hyperparameters for the baseline model, YOLOv5s. This approach was taken to exclude any confounding performance improvements that could arise from hyperparameter tuning and to ensure a fair evaluation focused solely on the impact of the training strategy on model performance. The details of each key hyperparameter are as follows:
Optimizer: Stochastic Gradient Descent (SGD) was employed as the optimizer. SGD is known to exhibit better generalization performance compared to more recent optimizers. Preventing overfitting and ensuring stable training are critical, especially when working with limited-scale medical image datasets such as OASBUD. Therefore, adopting SGD, with its proven stability, was a rational choice.
Loss Function: The default composite loss function of YOLOv5 was used without modification. This function consists of the Complete IoU (CIoU) loss for bounding box regression, and the Binary Cross-Entropy (BCE) with Logits loss for objectness and classification. This combination is designed to holistically optimize the three core components of object detection and has been well-established for its effectiveness in the YOLO family of models.
Epochs: The total number of epochs was set to 200 for the standard training (Trad_YOLO), and 70 epochs per stage for the curriculum learning model (CL_YOLO_70). These values were determined empirically based on monitoring the training and validation loss curves during preliminary experiments. The 200-epoch mark was identified as the point where the loss converged sufficiently, allowing the model to learn the data features while minimizing the risk of overfitting. The 70-epoch setting was intentionally chosen—representing only 35% of the total training epochs—to demonstrate the efficiency of the curriculum learning approach.
All experiments were conducted based on the environments summarized in
Table 1. The dataset was partitioned into training (65%), validation (15%), and test (20%) sets for each run.
The total training time for a standard 200-epoch model was approximately 3 min and 54 s. For the curriculum learning model, the total training time was approximately 15 min and 37 s (200 epochs per stage) or 5 min and 49 s (70 epochs per stage). The final model achieved an average inference speed of 13.30 ms per image.
3. Results
In this study, we implemented a curriculum learning strategy where the training difficulty of ultrasound images was hierarchically stratified based on the ratio of tumor-to-background areas. The deep learning model was trained by progressively increasing the difficulty from easy to hard samples. The training data were arranged as zoom2, zoom3, and zoom4 images based on visual complexity and object size. The initial stage with zoom2 images, characterized by consistent object locations, clear boundaries, and limited background information, was designed for the model to rapidly capture key features. In subsequent learning stages, the background noise, resolution, and object variations were configured to increase gradually. This difficulty sequence was determined through preliminary experiments and data distribution analysis to minimize arbitrariness.
Figure 4 compares the performance across the hierarchically zoomed-in image datasets.
As shown in
Figure 4, training on the zoom2 dataset alone yielded the highest scores across all metrics, with performance tending to decrease sequentially for zoom3 and zoom4 images. This implies that as complexity increases, the model requires greater representational capacity, which aligns with the hypothesis that the curriculum first learns core representations from easier stages before extending generalization to more difficult ones.
An analysis of the performance metrics related to the input image resizing preprocessing is shown in
Figure 5. It revealed that the model trained on images resized to 640 × 640 demonstrated a more significant performance improvement than the model trained with original-resolution images of various sizes. The precision of both models was nearly identical at approximately 0.928, indicating a similar level of prediction accuracy. However, the critical difference emerged in recall, which measures how effectively the model identifies actual lesions without missing them. The model trained on original images had a recall of 0.829, whereas the preprocessed model with 640 × 640 resizing achieved a recall of 0.848, a 2.3% increase. For the mAP50 metric, the original image model scored an average of 0.888, while the resized model reached 0.896.
What is more noteworthy is that beyond the improvement in average performance, the standard deviation of individual results also decreased significantly, leading to a substantial enhancement in model stability. The standard deviation of recall dropped from 0.077 in the original model to 0.058 in the resized model, a reduction of approximately 24.9%. Similarly, the standard deviation of the comprehensive mAP50 metric decreased from 0.054 to 0.043, a reduction of about 21.4%. This numerically demonstrates that the preprocessed model with resizing provides far more consistent and reliable predictions compared to using various input sizes. Although the standard deviation of precision showed a slight increase from 0.049 to 0.051, the overall reliability of the model is undoubtedly enhanced, given the overwhelmingly improved stability of the more critical metrics, recall and mAP50.
In most medical field including ultrasound diagnosis for breast cancer, recall is more critical than any other metric because low recall signifies a higher risk of False Negative errors—classifying an actual cancerous lesion as ‘normal’ [
25]. Such a misdiagnosis can pose a fatal threat to a patient’s life by delaying treatment. In contrast, even if precision is slightly lower, a False Positive (wrongly predicting normal tissue as cancerous) can be re-evaluated through follow-up examinations. Therefore, the top priority for a medical AI model is to not miss a single cancer, i.e., to secure a high recall rate.
Consequently, the high recall and dramatically reduced deviation of the model trained with resized images represent a significant advantage for medical ultrasound tumor detection. It is expected to reduce the risk of clinically critical misdiagnoses and serve as a more stable and reliable diagnostic support tool overall.
Detailed performance metrics for each condition are summarized in
Table 2. While Precision remained largely unchanged, the mean values of Recall and mAP50 showed slight improvements. More notably, the standard deviations of both metrics decreased, indicating that the model with resizing applied demonstrates more stable and consistent performance.
To compare the performance of the proposed curriculum learning framework with the preprocess of resizing steps, we compared the object detection performance of a traditionally trained YOLO model (Trad_YOLO) and our curriculum learning-driven YOLO model (CL_YOLO) on the same network architecture shown in
Figure 6. The simulation was repeated using 20 different seeds to split the training and test datasets. Trad_YOLO was trained for a total of 200 epochs, while CL_YOLO was trained in two scenarios: 70 epochs per stage and 200 epochs per stage. We then compared the median and dispersion metrics (standard deviation, IQR) of mAP50, precision, recall, and F1-score.
The simulation results using the OASBUD dataset suggested that curriculum learning offers the potential for rapid performance convergence in resource-limited environments. With just 70 epochs of training (only 35% of the total 200 epochs), the CL_YOLO model achieved an average mAP50 of 0.874, reaching 97.2% of the final average mAP50 of the traditional training (0.899). This demonstrates a significantly higher efficiency per epoch compared to the standard training method and highlights the clear temporal advantage that curriculum learning can offer in rapid prototyping or initial model development stages.
Furthermore, when the training was extended to 200 epochs, the curriculum learning-driven YOLO model demonstrated comprehensive performance on par with the standard model. While the average mAP50 for the curriculum model was 0.906, a negligible difference of only 0.001 from the standard model’s 0.905, qualitative advantages unique to curriculum learning were observed in detailed metrics. In terms of average precision and recall, the proposed curriculum learning framework achieves values of 0.946 and 0.851, respectively, compared to 0.927 and 0.843 obtained using the traditional approach. Furthermore, the number of seeds yielding high-performance, high-reliability outcomes—defined as cases where both precision and recall exceed 0.9—increased by 50% under the curriculum learning model, indicating an enhanced capacity to achieve superior performance under optimal conditions.
The quantitative comparison of the three training methods can be found in
Table 3.
4. Discussion
This study proposed and validated a curriculum learning (CL) based training framework to enhance the training stability and efficiency of the YOLOv5 model in a limited medical imaging dataset environment. Experimental results demonstrated that the proposed CL framework not only achieved a final detection performance (mAP@0.5: 0.905) comparable to the traditional training method but also showed significant potential to improve training efficiency, reaching 97.2% of the final performance with only 35% of the total training epochs.
To contextualize the contributions of our work, a comparison with prior studies that utilized the same OASBUD dataset is presented in
Table 4. As these studies address different primary tasks, such as either classification or detection, and employ distinct evaluation metrics, a direct numerical comparison is challenging.
For example, the work by Byra et al. focused on a classification task, using a transfer learning approach to distinguish between benign and malignant masses, achieving an AUC of 0.936 [
7]. The study most relevant to ours is by Wei et al., which addresses the detection task using a two-stage Faster R-CNN model. While their approach demonstrates strong performance, our framework, built on a single-stage YOLOv5, achieves a competitive mAP@0.5 of 0.905 and offers a significant advantage in inference speed, enhancing its suitability for real-time clinical applications [
8].
However, the most significant contribution of this study lies not in the final performance metrics themselves, but in the methodological innovation of ‘how’ that performance was achieved. The proposed curriculum learning framework reached 97.2% of the final performance using only 35% of the total training epochs. While prior studies have focused on the ‘application’ of different models to the OASBUD dataset, our work innovates the fundamental ‘training method’ itself to enhance efficiency and stability in a data-constrained environment. This marks a clear and practical distinction from previous works.
Despite these contributions, a rigorous evaluation of the academic and clinical utility of our framework requires a thorough examination of its key limitations and future research directions.
First, the empirical validation of this study is inherently limited as it was conducted exclusively on a small-scale public dataset (OASBUD). This constraint in dataset size presents a significant vulnerability to overfitting, which could impair the model’s generalization performance on diverse data encountered in real-world clinical settings. Therefore, to corroborate the robustness and clinical applicability of the proposed methodology, subsequent validation is imperative. This should involve leveraging large-scale, multi-institutional datasets composed of data from heterogeneous equipment and protocols, or participating in relevant public challenges.
Second, the current stage-transition mechanism within the curriculum is a static structure that relies on a predefined, fixed number of epochs. This rigidity fails to adequately reflect the dynamic nature of the training process and can lead to inefficiency, such as continuing redundant computations even after the model has mastered the features of a particular difficulty stage. As a solution, we propose the adoption of an adaptive stage-transitioning mechanism. This could be inspired by Early Stopping principles or involve performance-triggered transitions that dynamically advance stages based on validation metrics. Such a mechanism would maximize training efficiency and facilitate the exploration of an optimal learning path tailored to the data’s characteristics.
Finally, moving beyond the technical validity of the proposed model, several practical challenges must be addressed for its successful clinical adoption. For instance, incorporating Explainable AI (XAI) techniques to provide visual evidence for the model’s decisions is crucial for building trust among clinicians. Concurrently, research into system integration is needed to ensure the developed AI can be seamlessly embedded into existing clinical workflows to enhance diagnostic efficiency. Ultimately, clinical validation, including ethical and regulatory approval for patient data utilization, will be the final gateway to the technology’s commercialization. While these tasks extend beyond the scope of this paper, they represent essential future research directions to maximize the clinical impact of the proposed technology.
Beyond the aforementioned methodological considerations, ensuring the clinical efficacy of the proposed framework necessitates addressing the following real-world challenges inherent to the medical imaging domain.
Heterogeneity of Data Acquisition Environments: The data utilized in this study were acquired from a single model of an ultrasound scanner. However, real-world clinical environments are characterized by extreme heterogeneity, with a coexistence of equipment from various manufacturers and different image processing pipelines. This ‘domain shift’ problem is a primary factor that can drastically degrade the performance of a model optimized for a single environment. Therefore, developing models that ensure consistent performance regardless of the equipment source is a pressing need.
Operator-Dependent Data Variability: Ultrasound imaging has an intrinsic characteristic where image quality varies significantly depending on the operator’s skill and understanding of the protocol. This inter- and intra-operator variability acts as a major source of noise, compromising data consistency. Consequently, to enhance model reliability, strategies such as integrating an upstream quality assurance module to quantitatively assess and provide feedback on low-quality input images or applying data augmentation techniques that simulate a wide range of image qualities, are required.
Predictive Uncertainty for Atypical and Rare Cases: Similarly to most medical datasets, the data used in this study are predominantly composed of tumors with relatively typical morphological features. This introduces a risk of prediction failure for out-of-distribution (OOD) data, such as lesions with atypical patterns or rare tumor types. To minimize the possibility of such critical errors, it is crucial to not only expand databases of rare cases but also to introduce paradigms such as anomaly detection or one-shot/few-shot learning. These would enable the model to recognize its own predictive uncertainty and flag unseen types of lesions as ‘anomalies’ for expert review.
In conclusion, while this study presents meaningful results at the proof-of-concept stage, integrated research to address the multifaceted challenges discussed above is essential for the proposed technology to establish itself as a reliable clinical decision support tool. Through such efforts, we anticipate opening new horizons for AI-based medical diagnostic technologies.
5. Conclusions
In the field of medical ultrasound image analysis, deep learning-based object detection technology is still in its nascent stages of research. It is not intended to replace human experts but holds significant potential as a sophisticated decision-support tool for enhancing the accuracy and efficiency of clinical diagnosis. An AI model can reduce the risk of misdiagnosis by evaluating subtle features with consistent criteria, highlighting suspicious lesions or assisting in diagnoses that a human might miss. Furthermore, it contributes to improving diagnostic consistency by reducing subjective interpretation variability among sonographers.
To optimize ultrasound tumor detection performance, this study proposed a framework that combines a preprocessing step, which resizes B-mode images to a normalized size, with a curriculum learning strategy that considers training difficulty. This structure features a standardized preprocessing step to minimize distortion of the medical imaging data and a systematic learning sequence based on difficulty, defined by the tumor-to-background area ratio.
The experimental results showed that the proposed curriculum learning method demonstrated a comparable level of performance in overall metrics such as mAP50 when compared to the traditional training method. However, it showed the potential to significantly increase training efficiency in resource-limited environments by reaching 97.2% of the final model’s performance with only 35% of the total training epochs. Additionally, it exhibited some advantages in achieving a more balanced profile between precision and recall.
Nevertheless, several critical limitations must be considered when interpreting the contributions of this study. First, it was not clearly verified whether the performance improvements observed in this study are statistically significant. Furthermore, as the study was limited to a single, small-scale public dataset (OASBUD), there is a potential weakness in the model’s generalization performance. These factors make it difficult to assert that the proposed framework would exhibit the same effectiveness on diverse data from real-world clinical settings.
Therefore, to establish the practical clinical utility of the proposed framework, further validation using large-scale, multi-institutional datasets composed of data from heterogeneous equipment and protocols is essential. Future research should aim to move beyond the proof-of-concept presented in this study to develop a reliable clinical decision-support tool by including such statistical significance testing and evaluating robustness in various clinical environments.