5.1. Dataset
In this study, we used the DBTex challenge dataset [
23], which is the only publicly available DBT image dataset. It contains 1000 breast tomosynthesis scans from 985 patients. However, it is important to note that not all images are fully annotated. Specifically, among 101 patients, only 208 DBT images have been annotated with 223 tumor class annotations with location boundary boxes, which limits the availability of usable annotated tumors. Furthermore, none of the images in the DBTex dataset have subjective quality score annotations. This absence of quality score labels is a significant constraint of the dataset for image quality assessment tasks.
For the DBT quality assessment, we generated distorted images with corresponding objective quality scores following the synthetic data generation strategy discussed in
Section 4.1. The detailed distribution of the used DBT quality assessment dataset is shown in
Table 1. The DBT images of 50 patients were used to generate the synthetic image quality dataset. We divided the DBT images patient-wise into training and testing sets. It should be noted that we select a single DBT image from each patient randomly. For each image, we generated a blurred image, two images with gamma correction, and a noisy speckle image. Thus, the total number of distorted images is 250.
For breast tumor malignancy prediction, in terms of ROI selection, the tumors are center-cropped and extracted from the annotated DBT image. The selected ROI are then all resized into identical dimensions of to fit the input of various radiomics extraction deep learning-based networks. Then, we divide the DBT images and the corresponding patches patient-wise into training and testing sets.
Table 2 summarizes the DBT dataset used in the breast tumor malignancy prediction task. The dataset is split as follows: approximately
for training and
for testing (patient-wise). Out of the 223 tumor patches utilized in our study, it was found that 138 cases were categorized as benign, whereas 85 cases were categorized as malignant. This disparity in class distribution creates a significant bias toward the predominant class (benign), resulting in a reduction in the predictive capability of the proposed method and insufficient prediction for the minority class (malignant). It is clear that the number of benign tumors in the training set is approximately twice that of the malignant tumors. To address this challenge, the number of tumor patches in training data increases to balance the dataset, utilizing multiple augmentation processes. We doubled the number of malignant tumor images by jointly flipping all patches in the training set horizontally and vertically. This eventually resulted in 114 benign patches and 126 malignant tumor patches, with a total of 240 tumor patches in the training set. In addition, we used 46 tumor patches for the test consisting of 23 benign patches and 23 malignant patches.
5.3. Performance Evaluation of the Proposed Malignancy Prediction Approach
Indeed, the empirical evidence from studies [
20,
24,
26], specifically within the context of tumor feature extraction from DBT images, underscores the commendable performance of AlexNet as a feature extractor. Demonstrating its ability to deliver remarkable results and achieve high classification performance, AlexNet emerges as a promising choice as the baseline model for our task.
To demonstrate the effectiveness of the proposed method,
Table 3 presents results that compare the baseline model, which does not incorporate DBT image quality-aware features, with the proposed quality-aware model. The focus lies on the classification of breast tumors utilizing tumor descriptor features and integrating quality-based features. This comparison serves to highlight the added value and performance enhancements achieved through the incorporation of quality-aware features in our model.
By comparing the baseline model with our proposed quality-aware model, it becomes evident that our novel approach significantly enhances the classification performance across metrics, including accuracy, precision, and F1-score. Notably, our proposed method achieves an impressive accuracy of 78.26%, marking an 8% improvement over the baseline. Moreover, the precision reaches 88.24%, showcasing a remarkable 23% increase compared with the baseline. This substantial gain in precision underscores our model’s proficiency in accurately identifying positive instances. While the baseline achieves a higher true positive rate, our proposed method excels in terms of the F1-score, indicating superior overall performance.
Alternative experiments have been carried out to obtain a more reliable estimate of the model’s performance compared with a single train-test split. We turn to employing the k-fold cross-validation technique. This approach involves randomly dividing the dataset’s images into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the model is trained on the remaining folds. It is a resampling procedure that assesses the performance of a predictive model and helps to evaluate how well a model generalizes to unseen data.
Figure 7 presents boxplots visualizing the distribution of evaluation metrics across the k-fold cross-validation with k = 5 for both the baseline and proposed methods. Each boxplot summarizes the data using a five-number summary: minimum, maximum, first quartile (Q1), median, and third quartile (Q3). The red horizontal line within each box represents the median value, offering a quick comparison of central tendencies between the two methods for each metric.
Notably, it becomes evident that the proposed model surpasses the baseline in terms of accuracy, precision and F1-score across various folds’ validation. As depicted in
Figure 7, the minimum, maximum, and median values associated with the proposed model consistently outperform those of the baseline. This is indicative of the proposed model’s superior predictive capacity, as it not only achieves higher accuracy on average but also demonstrates better performance across the validation folds, from the lowest to the highest observed values. Regarding the recall values, while the baseline exhibits a broader distribution, indicating a wider range of performance outcomes, the proposed model demonstrates a more concentrated distribution. Despite the baseline’s higher maximum, the proposed model notably boasts median recall values. This suggests that although the baseline model may exhibit variability in its recall performance, the proposed model consistently achieves superior recall rates.
In
Figure 8, we present a visual representation of malignancy scores for four DBT images, comprising two benign and two malignant cases. The malignancy score ranges from 0 to 1 and serves as an indicator of tumor carcinogenicity. Benign tumors are characterized by low malignancy scores, closely approaching 0, while the probability of malignant tumors grows as the malignancy score reaches 1.
As we can see from
Figure 8, in the context of benign tumors, both the baseline and our proposed model demonstrated accurate classification for the upper image. However, the proposed method achieved a lower malignancy score of 0.039 compared with the baseline. Turning attention to the lower image, which shares a shape similarity with malignant tumors, the proposed model obtained a score of 0.44 < 0.5, correctly classifying the tumor. In contrast, the baseline model made an incorrect classification, assigning a score of 0.53 > 0.5. Similar results can be observed when examining the results of malignant tumor images. We see that the proposed method was able to obtain a high malignancy score for both images, with a score of 0.81 and 0.77, superior to the basic model. Also, the basic model obtained incorrect results for the lower image, which is similar in appearance to benign tumors. This analysis underscores the improved classification capability of our proposed model, particularly in scenarios with borderline malignancy scores, enhancing its precision in distinguishing between benign and malignant tumor patterns.
Table 4 compares the proposed malignancy prediction method with existing breast tumor classification methods for DBT images. To conduct a fair comparative study, we trained and tested all methods using the same training and testing DBT datasets.
Table 4 demonstrates that the performance of the proposed method outperforms all other compared methods in terms of accuracy and precision. Thanks to the proposed dual-branch approach, quality-aware features help the classifier focus its attention on the most relevant regions of the image and provide complementary information to the tumor descriptor branch that discerns meaningful patterns associated with the tumor, improving classification accuracy.
As we can see, the proposed method outperforms the method presented in [
26] which achieved a competitive performance over the other deep learning-based methods, with an accuracy of 75% and an F1-score of 79%. Following [
26], we used the INbreast mammogram dataset [
46] to fine-tune the AlexNet model in the first stage of the method, and then we used the DBTex train set to further fine-tune the model. A potential drawback of [
26] is the requirement for a mammography dataset, which may not always be accessible for training in the initial stage of the method. Additionally, the methodology proposed in the study [
22] demonstrated commendable performance, achieving competitiveness with an accuracy of 80.67% and specificity of 77.12%. However, it is crucial to mention a significant aspect of the classic machine learning technique of Moghadam et al. [
22]—it relies heavily on hand-crafted features tailored to a specific dataset. Although they stated in their study that the use of 15 selected features represented the optimum point via trial and error, they did not specify what these features are and how they extracted them, which raises a reproducibility limitation, along with the inherent limitations, as the effectiveness of selected features cannot be guaranteed to achieve the same level of performance when applied to diverse digital DBT datasets. Furthermore, it does not operate as an end-to-end system. Instead, this method involves a series of seven intricate sequential steps, unlike deep learning methods.
5.4. Ablation Study
To demonstrate the efficacy of our proposed method, we conducted a comprehensive evaluation by applying the quality-aware framework to various CNN-based classification networks, including VGG16, ResNet50, DensNet121, EfficientNet, DarkNet53, GhostNet, HRNet, CoAtNet, and ConvNext. The performance of each model, when employed with our quality-aware approach, was compared against its respective baseline in terms of accuracy.
The outcomes illustrated in
Figure 9 reveal notable improvements across all evaluated classification models. Specifically, the proposed method enhances the classification accuracy of ResNet50, EfficientNet, and ConvNext by 6%, 9%, and 9%, respectively, underscoring the broad applicability and consistent performance gains achieved by our quality-aware approach across diverse CNN architectures.
Statistical analysis: As shown above, the performance attained by the proposed method surpasses those of the baseline models (VGG16, ResNet50, DensNet121, EfficientNet, DarkNet53, GhostNet, HRNet, CoAtNet, and ConvNext), underscoring its efficacy in enhancing the classification outcomes across all evaluated models. In this study, we employed McNemar’s statistical test to ascertain the statistical significance of the performance disparities concerning accuracy between the proposed method and different baseline models. McNemar’s test is specifically designed for comparing paired nominal data, which makes it suitable for comparing the performance of two classifiers in a binary classification setting. In particular, we employed the continuity-corrected version of McNemar’s test, which is the more commonly used variant. A continuity-corrected version of McNemar’s test is governed by the following equation:
where
B denotes the count of instances correctly predicted by the baseline but incorrectly by the proposed method.
C signifies the count of instances correctly predicted by the proposed method but incorrectly by the baseline. The
in the numerator is included to adjust for the continuity correction. Once we have calculated McNemar’s statistic, we can compare it with the chi-squared distribution with 1 degree of freedom to obtain the
p-value. This
p-value indicates the probability of observing the discrepancy between the two models by chance alone.
Table 5 shows the statistical analysis of the accuracy values of the proposed model and each baseline model. In this table, a
p-value lower than
indicates statistical significance. As can be seen in
Table 5, for each baseline model, the results of the proposed method are more statistically significant than the ones of the baseline model.
Figure 10 shows the skill score (
) of different variants of the proposed approach.
measures the accuracy improvement of a model regarding the accuracy of a reference model.
can be expressed as follows:
where
and
stand for the accuracy of the reference and evaluated models, respectively. As the baseline AlexNet model obtained the highest classification accuracy, we used it as a reference model. As shown, the proposed method based on the AlexNet model achieves the highest skill score.
5.6. Enhancing Classification Accuracy through Ensemble Models
Aggregating multiple deep learning models can significantly enhance classification accuracy by leveraging the diversity and complementary strengths of individual models. This ensemble approach aims to mitigate the weaknesses of individual models and capitalize on their collective predictive power. By combining the predictions of diverse models, the ensemble can capture a broader range of patterns and features in the data, leading to a more robust and accurate classification performance.
Based on results shown in
Section 5.3 and
Section 5.4, we select the top-performing breast tumor malignancy prediction models in terms of accuracy, precision, and F1-score (AlexNet, ResNet50, EfficientNet, GhostNet, CoAtNet, and DensNet121). To construct the ensemble classification approach, we aggregate the malignancy scores of the top-performing models using the average and median aggregation functions.
As presented in
Table 6, the average aggregation for ensembled models achieves an accuracy of
, better than the individual proposed AlexNet-based quality-aware method with a
enhancement. Although the mean aggregation function yields an acceptable performance, the median aggregation leads to slightly better results. For ensembled models, we obtain an accuracy of
,
precision, and an F1-score of
.
Figure 12 shows the false positive rate (FPR) for the proposed method and the best-proposed ensemble method (ensemble models, based on the median aggregation function). Here, we can see that both methods efficiently classify the benign tumors of the DBT image with an accuracy of
. In the case of malignant tumors, ensemble models with the median aggregation function achieved a higher classification accuracy of
, which is
better than the AlexNet-based classification method.
Based on the analysis presented here and in
Section 5.3,
Section 5.4,
Section 5.5 and
Section 5.6, it is clear that the use of the quality-aware features can significantly enhance breast tumor classification on DBT images, and the proposed ensemble approach can achieve accurate malignancy prediction while outperforming the state-of-the-art methods.
5.7. Evaluating the Proposed Method on the Breast Mammography Modality
To further validate the effectiveness of our proposed quality-aware tumor classification approach and ensure its generalizability across different datasets, we conducted evaluations using an alternative breast imaging modality, namely breast mammography. Since there are no other publicly available DBT datasets, our assessment focused on the performance of the proposed method using the INbreast dataset [
46]. Comprising 410 images from 115 women, INbreast serves as a valuable resource for research in mammogram-based breast cancer diagnosis. It has been used in several studies in the development and evaluation of methodologies for breast tumor classification.
By following the implementation of our proposed pipeline, we split the dataset as follows: about for training and for testing (randomly). It is worth noting that we did not retrain the TomoQA model using mammogram images. Instead, we leveraged the same model previously trained on DBT images due to the similarities between mammogram images and 2D DBT slices.
Table 7 highlights the clear advantage of our quality-aware model compared with the baseline model lacking image quality-aware features. By comparing their performance across all metrics, we see a significant enhancement in classification achieved by our method. We can see that the proposed method improves the accuracy by
over the baseline with an enhancement of approximability of
for each of the precision, recall, and F1-score metrics.