1. Introduction
Breast cancer (BC) is the most diagnosed cancer among women. It caused about 2.26 million new BC cases globally in 2020 as the most prevalent cancer [
1]. The incidence of female BC continued a slow increase during 2014 through 2018 by 0.5% annually [
2]. In developed countries, such as in the United States, BC incidence rates have risen in most of the past four decades, largely driven by localized-stage and hormone receptor-positive disease [
3]. Compared to developed countries, the rates are relatively lower in China, while the actual number of new cases and deaths has been considerable [
4]. BC poses a serious threat to women’s health and places a heavy burden on the finance and healthcare systems. Improvements in medical imaging, cancer screening and diagnosis, therapeutic planning and delivery, and follow-up monitoring contribute to the wide coverage of the health system, insurance, and advances in treatment and management [
1,
2,
3,
4].
Medical imaging devices are routinely used for BC screening and diagnosis, including mammography (MAM) and ultrasound (US). MAM remains the gold standard due to its high-resolution imaging of internal anatomy and its sensitivity for early-stage BC detection [
5]. Its effectiveness has been demonstrated in large-scale clinical trials, leading to improved treatment outcomes, higher survival rates, and reduced mortality through early intervention [
2,
3]. However, it is associated with issues such as over-diagnosis, radiation exposure, and decreased sensitivity in patients with dense breast tissue. MAM has been found to be less suitable for young women and Asian women, who often have denser breast tissue [
6]. Hand-held US (HHUS) eliminates the risk of radiation exposure and provides more detailed imaging for women with dense breast tissue. It effectively differentiates between solid tumors and fluid-filled cysts, thereby reducing unnecessary biopsies [
7]. To address operator dependence, limited field of view, and to enhance imaging quality, three-dimensional (3D) ultrasound imaging has been developed [
8,
9,
10,
11].
As an emerging technique, 3D automated breast ultrasound (ABUS) serves as a supplementary method for evaluating women with heterogeneously and extremely dense breasts [
11]. It offers several advantages in screening and diagnostic settings, including an increased BC detection rate, improved workflow efficiency, and reduced examination time. Notably, ABUS separates image acquisition from image interpretation, thereby decreasing operator dependence and time cost. Vourtsis and Kachulis investigated the performance of ABUS and HHUS in a large cohort of 1886 women and found that ABUS enhances the sensitivity of cancer detection [
12]. Additionally, Klein et al. conducted a retrospective clinical study comparing the performance of ABUS and HHUS in cancer diagnosis, identifying that ABUS results in lower recall and biopsy rates, as it provides multiple perspectives of suspicious regions for examination [
13]. Therefore, ABUS has significant potential to be routinely used as a standardized, reproducible, and reliable tool for whole-breast visualization, screening, and diagnosis [
12,
13,
14], offering added value for patients with dense breasts [
10,
11].
Accurate diagnosis of breast lesions observed in ABUS enables the determination of tumor malignancy and the formulation of treatment plans. However, few deep learning models have been developed for this purpose [
15,
16]. Tan et al. extracted spiculation patterns in coronal planes and designed spiculation and other characteristic features for classifying lesions as malignant or benign using support vector machine [
17]. Wang et al. modified the Inception-v3 architecture for efficient feature extraction, integrating features from both transverse and coronal views for cancer diagnosis [
18]. Xiang et al. combined residual blocks, capsule neural structures, and group normalization for ABUS tumor classification [
19]. Zhou et al. designed a multi-task learning framework for joint ABUS tumor segmentation and classification, incorporating multi-scale feature extraction and iterative feature refinement [
20]. Wang et al. added an automatic segmentation network for morphological analysis along with ResNet-based tumor diagnosis [
21]. Ding et al. proposed a multi-view attention network that utilizes a localization unit for lesion region cropping and a classification unit for malignancy prediction based on the Transformer architecture [
22]. Yang et al. developed a 2.5D deep model that fine-tunes a pre-trained network for tumor classification, using the ten slices with the largest lesion regions along with adjacent slices as input [
23]. Therefore, there is an urgent need to explore deep learning networks for ABUS tumor classification.
In 3D high-resolution medical image analysis using deep learning, a key challenge in ABUS tumor classification is achieving a balance between time efficiency and classification accuracy. However, 3D-input deep networks face several significant problems [
20]. At first, high dimensionality of 3D images increases data processing complexity, computational demands, and resource intensity compared to 2D images. Secondly, ABUS tumors exhibit a wide range of shapes, sizes, and characteristics, and therefore, it is difficult to develop a model that can accurately classify all types of lesions. Thirdly, the quality of ABUS images is influenced by the imaging devices used, and these images often suffer from noise and various artifacts [
15]. Fourthly, many clinical settings require rapid image processing and analysis for timely decision-making, which adds pressure to achieve high classification accuracy while maintaining fast processing speeds. One of the most significant challenges is the limited availability of annotated datasets; very few ABUS cases are publicly accessible for training [
16]. Specifically, an overview [
15] of recent advancements in BC image analysis indicates that one 3D ABUS database with 100 volumetric cases is accessible for algorithm development and fair comparison.
In practice, using volumetric images as input for deep networks necessitates iterative optimization of hyper-parameters, which requires a large-scale, high-quality database and leads to significant time costs [
19,
20]. While noisy data can be utilized for model training, the robustness of these models must be thoroughly examined in the context of medical image analysis [
24,
25,
26]. On the other hand, while using slice images as input allows for faster slice-wise lesion predictions, effectively combining the slice-wise probabilities for benign and malignant predictions remains an open question. In summary, deep learning-based 3D-input ABUS tumor classification faces challenges related to high time costs, a lack of sufficient training samples, and the complexity of hyper-parameter optimization.
To the best of our knowledge, few studies have specifically addressed the aforementioned issues in ABUS tumor classification [
16]. In this study, a soft voting (SV) strategy is proposed using voxel-level weighting of slice prediction. It performs as a post-processing step after image slice prediction via a 2D-input deep learning model. It should be noted that, in pediatric brain tumor classification, Bianchessi et al. also propose soft voting for per-slice class prediction [
27]. However, there are key differences between the SV strategies in our work and that in Ref. [
27]. Firstly, our SV is directly related to the prediction for per-volume classes, while in Ref. [
27], it pertains to slice-level class prediction. Secondly, our SV operates on the predicted probabilities of the benign and malignant classes for each voxel, whereas that in [
27] aims to predict each slice as belonging to one of three classes. Thirdly, our SV uses only a single classification model, while the SV in [
27] involves multiple classification models. To verify the effectiveness, the proposed SV strategy is compared to the hard voting (HV) strategy, and the latter uses slice-based weighting. Furthermore, the baseline deep networks are compared to those with proposed strategies. Experimental results suggest that the proposed SV strategy improves tumor classification performance due to the utilization of tumor sizes and slice-level predicted probabilities.
2. Materials and Methods
This section introduces the ABUS database, the proposed post-processing strategy, the deep learning models for evaluation, the experiment design, the used performance metrics, and the implementation details.
2.1. The Database
The database is the training set of the Tumor Detection, Segmentation and Classification Challenge on ABUS 2023 (TDSC-ABUS2023). It contains 100 ABUS volume images. Currently, the database is the only ABUS database available online to the community, paving the way for improved data availability and 3D ultrasound image analysis [
15].
In the database, the matrix size of volumetric images ranges between [843, 546, 270] and [865, 682, 354], the physical in-plane spacing is [0.200 mm, 0.073 mm], and the between-slice spacing is around 0.476 mm. The volumes are acquired by using an Invenia ABUS system (GE Healthcare) at Harbin Medical University Cancer Hospital, Harbin, China. The data are stored in nrrd format, and the pixel intensity ranges from 0 to 255. An experienced radiologist checked the data cases and annotated the tumor regions.
Figure 1 shows the distribution of case numbers in terms of the voxel numbers of tumor regions. The horizontal axis presents the values of the base-10 logarithm of the voxel numbers
v (
). The values are equally divided into seven bins. The vertical axis displays the number of ABUS cases. It shows that most cases (90 out of 100 cases) contain around
to
voxels in the annotated tumor regions.
Moreover, the database provides the volume images and corresponding biopsy labels (42 benign and 58 malignant). The voxel number of tumor regions is (1.06 ± 1.36) (benign) and (3.91 ± 9.38) (malignant), and the voxel intensity is 67.10 ± 11.96 (benign) and 70.19 ± 10.58 (malignant). In addition, the smallest tumor contains 3539 voxels, and the largest tumor contains 6,863,915 voxels. The uniform voxel intensity and varying tumor shapes and sizes present significant challenges for accurate lesion classification.
2.2. The Proposed Soft-Voting Strategy
The proposed SV strategy is a post-processing step following a 2D-input image classification model. As shown in
Figure 2, after slice-wise malignancy prediction, the proposed voxel weighting strategy takes both tumor sizes and predicted probabilities (benign and malignant) into consideration for tumor classification.
At first, a convolutional neural network (CNN) is employed for slice-wise prediction. After the CNN model is trained, to an unseen input image slice (), its output is two prediction probabilities (benign, ; malignant, ; and ).
Then, the proposed SV strategy is implemented with voxel-level weighting. To a voxel in slice of tumor region, its predicted probability is and . Since a tumor is a volume with many slices and voxels, we assume that each voxel contributes to the prediction of tumor malignancy, and thus, voxel weighting becomes reasonable.
Specifically, assume that a volumetric tumor
v contains
n slices
, and the voxel numbers in slices is
correspondingly. And thereby, the benign probability
and the malignant probability
of the tumor can be defined in Equation (
1). The numerators stand for the contribution of voxel-weighted benign or malignant probabilities, and the denominators denote the number of voxels or the tumor size. Since the denominators are the same
, the values of the numerators are directly related to the final tumor classification.
In the end, the classification
of the volumetric tumor is determined by the larger value of
and
as shown in Equation (
2). The core idea of voxel weighting or the SV strategy is that the voxel numbers of a tumor and the voxel probabilities of malignancy derived from slice-wise prediction are utilized.
Compared to the baseline CNN model used in the workflow, as a post-processing strategy, voxel weighting slightly increases the computing time in tumor classification, while it enhances the prediction robustness and decision-making confidence.
In contrast to voxel weighting, a more straightforward strategy is HV, or slice-level weighting. It mainly depends on the number of slices predicted as benign or as malignant. In other words, among the tumor with
n slices of lesion regions, if
k slices are predicted as malignant (
) and
, the volumetric tumor is voted as malignant, and vice versa. Equation (
3) shows the comparison of the number of benign (
k) and malignant (
) slices in the volumetric tumor.
2.3. Involved CNN Models
To verify the effectiveness and efficiency of the proposed strategy, several CNN models are explored. This part briefely describes the four 2D models (ResNet34 [
28], MedViT [
29], HiFuse [
30], and MedMamba [
31]) and three 3D models (M3T3D [
32], ResNeXt3D [
33], DenseNet3D [
34], and ResNet3D [
35]). The proposed post-processing strategy is added to 2D CNN models for evaluating its effectiveness in ABUS lesion classification.
2.3.1. 2D CNN Models
The first 2D CNN model is
ResNet34 [
28], which is widely used as the backbone of many advanced networks in image classification and medical diagnosis [
5,
36].
Figure 3 shows the repeated residual blocks that use convolution layers for hierarchical representation and skip connections to mitigate the vanishing gradient problem. And consequently, deep neural networks can be straightforward, computationally efficient, and trained with fast convergence.
The second 2D CNN model is
MedMamba [
31] that combines both CNNs and Transformers. It is made up of patch embedding layers, stacked SS-Conv-SSM blocks, patch merging layers, and a feature classifier, as shown in
Figure 4. It is similar to Vision Transformers [
37] in that MedMamba splits the input image into non-overlapping patches. It builds hierarchical representations using four SS-Conv-SSM blocks for image down-sampling. Specifically, the basic block SS-Conv-SSM includes channel-split, convolutional layers, structured state-space model (SSM) layers, and channel-shuffle in two branches of Conv-Branch and SSM-Branch for local and global information processing.
The third 2D CNN model is
MedViT [
29]. It introduces the multi-head convolutional attention (MHCA), local feed-forward network (LFNN), and efficient self-attention (ESA), and the model is built based on the efficient convolutional blocks (ECBs) and local-token block (LTB), as shown in
Figure 5. Specifically, MHCA decomposes an image into multiple regions or tokens and captures long-range dependencies, LFNN token-wisely rearranges the feature maps and token sequences converted by Seq2Img and Img2Seq, and ECBd benefit residual blocks of detail preservation while integrating the Transformer for deep feature representation. At the same time, LTB combines both local features from ECBs and global features from MHCA and ESA, and patch momentum changer is used for data diversity augmentation and improved model robustness. After progressive feature extraction and fusion, the malignancy is predicted by using simple batch normalization, global average pooling, and a full connection layer.
The fourth 2D CNN model is
HiFuse [
30]. It develops three-branch hierarchical integration of multi-scale features. Self-attention-based Transformer and CNN are combined without destroying the respective modeling. As shown in
Figure 6, a parallel hierarchy structure with local and global feature blocks is designed for efficient representations of local and global semantic cues, and an adaptive hierarchical feature fusion (HFF) block is proposed to integrate the multi-scale features comprehensively. Specifically, the HFF block uses small modules, including spatial attention, channel attention, residual inverted multi-layer perceptron (MLP), and shortcut, to integrate the semantic features of each branch. At last, global average pooling and a layer-normalized linear classifier are used for lesion malignancy prediction.
It is observed that ResNet34 uses four different sizes of residual blocks (
Figure 3), and MedViT, HiFuse, and MedMamba repeat four specific designs of blocks (
Figure 4,
Figure 5 and
Figure 6) for hierarchical data representation, progressive feature fusion, and object classification. Full technical details of these 2D-input networks can be found in the relevant publications and code implementations.
2.3.2. 3D CNN Models
The first 3D CNN model is
M3T3D [
32] that extracts feature representation of the input 3D data samples by using two convolution layers, batch normalization, and ReLU activation functions. Meanwhile, it extracts 2D features from each slice of coronal, sagital, and axial planes. After that, 2D features are concatenated, projected, and embedded as tokens passed to the Transformer encoder [
38] for global information integration and long-range dependency capture. In the end, all the features are aggregated into global feature representation for malignancy prediction.
The second 3D CNN model is
ResNeXt3D [
33], which is designed for Alzheimer’s disease classification. It combines ResNeXt [
39] and Bi-LSTM [
40] for 3D magnetic resonance brain image analysis by replacing 2D convolution kernels with 3D kernels. The 3D representation features are flattened as 1D signal input of Bi-LSTM, and thereby, the spatial information of the 3D medical images are thoroughly learned for disease classification.
The third 3D CNN model is
DenseNet3D [
34] that combines both global and local features by using 3D densely connected CNN and prior shape information. Specifically, it upgrades the representation capacity by connecting each layer with the other convolution layers, and consequently, low-level features and high-level shape features are connected. Meanwhile, vanishing gradients are relieved by feature fusion.
The fourth 3D CNN model is
ResNet3D [
35] that designs spatio-temporal convolutions for action recognition. It proposes the mixed convolution to model object motion with a low- and mid-level operation in the early layers. Most importantly, the spatio-temporal variant divides the 3D convolution block explicitly into a 2D spatial image convolution and a 1D temporal or time-scale convolution, and an additional nonlinear rectification is embedded between these 2D and 1D convolution operations. Therefore, this bi-connection enables ResNet3D representation learning of complex functions.
2.4. Experiment Design
The database contains 100 volumes and 2028 slices with tumor regions. To assess the generalization performance, 5-fold cross-validation is employed. The database is randomly partitioned into five mutually exclusive folds with equal size. Each model is then trained and evaluated five times. In other words, four folders of 80 ABUS volumes (and the slices) are selected for 3D (and 2D) model training, and the remaining folder of volumes (and slices) is used for performance evaluation. The kind of splitting avoids data leakage and potential over-fitting, providing a more reliable estimation of classification performance.
After random data splitting, to 2D CNN models, the proposed SV strategy as well as the HV approach are verified by re-computing the prediction results for volumetric tumor classification. Moreover, for fair comparison, a total of 100 epochs are conducted for training these 2D and 3D deep networks.
2.5. Performance Metrics
Assume that true positive (TP) is the number of correctly predicted positive samples, true negative (TN) is the number of correctly predicted negative samples, false positive (FP) is the number of incorrectly predicted positive samples, and false negative (FN) is the number of incorrectly predicted negative samples.
In the current study, six metrics, namely accuracy (ACC), sensitivity (SEN), specificity (SPE), AUC (area under the curve), the training time in minute (time), and the score, are used to measure the classification performance. Equation (
4) shows how to compute these metrics. The metric
is the official metric for the challenge.
2.6. Implementation Details
The codes of the CNN models are available online. The deep learning models are evaluated without code modification. The networks run on the platform with GPU (NVIDIA, Santa Clara, California, USA) RTX 4090 (24 GB), CPU (15 vCPU Intel(R) Xeon(R) Platinum 8474C) and RAM 80 GB. For iterative hyper-parameter optimization, 100 epochs are conducted to 2D-input CNN models, and due to the much greater number of more parameters, 300 epochs are used for 3D-input CNN models.
3. Results
This section presents and compares the performance of the 2D-input models, 2D-input models with different post-processing strategies, and 3D-input models on tumor classification by using different metrics. The receiver operating characteristic (ROC) curves and
t-distributed stochastic neighbor embedding (
t-SNE) visualization [
41] are used for visual comparison of feature learning. In the end, the state-of-the-art results (the top-15 scores) on the database are presented.
3.1. Performance of 2D-Input CNN Models
Table 1 shows the metric values of the 2D-input CNN models with and without HV and SV strategies on average after five-fold cross validation. It indicates that both the voting strategies improve the prediction results, and the SV strategy achieves better performance than the HV approach. Specifically, the SV strategy increases the SPE of ResNet34 from 0.757 to 0.963 (0.206 ↑), the SEN of MedViT from 0.808 to 0.987 (0.179 ↑), the SEN of HiFuse from 0.685 to 0.864 (0.179 ↑), and the SEN of MedMamba from 0.726 to 0.975 (0.249 ↑). It is also found that the HV approach causes slightly inferior AUC values when ResNet34 or HiFuse perform as the baseline for lesion classification.
Among the 2D-input CNN models, ResNet34 achieves the highest AUC value (AUC, 0.936), followed by HiFuse (AUC, 0.883) and MedMamba (AUC, 0.799), while MedViT causes the lowest AUC value (AUC, 0.772) and SPE value (SPE, 0.547) on the classification task. In addition, ResNet34 requires the least time consumption on model training (63.2 min per 100 epochs). The training stage of all the 2D-input CNN models takes more than 1 h, and MedViT lasts for about 88.3 min.
3.2. ROC Curves of 2D-Input CNN Models
Figure 7 presents a one-time experimental ROC curve for the four 2D-input CNN models by using the baseline (red, solid line), the baseline with HV (green, dashed line), and the baseline with SV (blue, dot-dashed line) for identification.
According to the ROC curves, the baseline ResNet34 achieves the best performance, while both the voting strategies, HV and SV, enable further improvement in the classification results. In addition, the SV strategy can enhance the prediction performance of HiFuse and MedMamba from AUC to AUC , and the values are dramatically increased.
3.3. Visualization with t-SNE of 2D-Input CNN Models
Figure 8 presents the
t-SNE visualization of learned features of 2D-input CNN models. In each plot, large blue circles represent benign slices, and small red circles indicate malignant slices.
In the projection space, the learned features from the 2D-input models can effectively separate benign and malignant ones, since the blue and red circles are visually separable.
Figure 8a and
Figure 8d, respectively, indicate that 4 and 5 malignant slices are misclassified as benign by ResNet34 and MedMamba, while
Figure 8b shows that three benign slices are wrongly predicted as malignant by MedViT.
3.4. Performance Comparison to 3D-Input CNN Models
The performance of tumor classification using the 2D-input CNN models with SV strategy and the 3D-input CNN models is summarized in
Table 2. The best values and the worst values of the metrics are in boldface and underlined, respectively.
When the models are trained for 100 epochs, the metric values of 2D-input models with the SV strategy are much better than those using 3D-input models. In general, all the metric scores of 2D-input CNN models are larger than 0.84, and the training times are less than 90 min. Among the 3D-input CNN models, DenseNet3D achieves the highest AUC 0.622, M3T3D leads to the worst SEN 0.078, and ResNeXt3D causes relatively good balance of SEN and SPE values. Comparatively, the metric values of 3D-input CNN models are much worse than those of the ResNet34 + SV framework.
For further understanding the effect of training epoch numbers on the classification performance, the results of 300 epochs of 3D-input networks are shown. In comparison to the 3D-input models trained with 100 epochs, the metric values are correspondingly increased but slightly. For instance, ResNeXt3D achieves 0.033 ↑ in ACC, 0.060 ↑ in SEN, 0.142 ↑ in SPE, and 0.038 ↑ in AUC. However, its minor performance improvement is at the cost of 200 additional epochs of iteration and 409.6 min in model training.
In addition, the 3D-input CNN models take over 3 h (>180 min) to complete the 100 epochs of model training, which are significantly longer than those 2D-input CNN models (≈60 min). ResNet3D, an 3D extension of ResNet18 model, is the fastest among the 3D-input models, requiring 181.1 min, which is three times longer than the time taken by ResNet34 for model training.
3.5. Visualization with t-SNE of 3D-Input Networks
The
t-SNE of learned features of 3D-input CNN models is shown in
Figure 9. In each plot, a large blue circle stands for a benign volume case, and a small red circle denotes a malignant volume case. Because of the limited number of volumetric samples, five-fold cross validation leads to around 20 cases in the plots.
The projection space of the one-time experiment shows that both blue circles and red circles are mixed with each other. This indicates that the benign and malignant volumetric cases are difficult to be separated, and the intrinsic features of benign and malignant lesions are not well learned.
3.6. The State-of-the-Art Achievement on the Database
Figure 10 illustrates the state-of-the-art performance on the database, displaying the results of top 15 teams in the ABUS challenge and our ResNet + SV model. It should be noted that these teams use the 100 samples for model training and the trained models are evaluated on the unreleased testing set. The horizontal axis represents the rankings of the top teams, with the 16th position designated for the ResNet + SV model, while the vertical axis indicates the score values.
Four teams achieved scores greater than 0.90, with the top team reaching 0.9686, which is slightly lower than the ResNet + SV model (score = 0.986). The fifth-ranked team obtained a score of 0.8278, while the remaining teams scored below 0.80. This finding indicates that the majority of the prediction models (11 out of 15) are struggling to effectively classify the ABUS volumetric images into benign and malignant groups.
4. Discussion
BC remains the leading cause of cancer-related deaths among women worldwide, underscoring the urgent need for effective screening and diagnostic tools. Three-dimensional ABUS has emerged as a promising method for improving the screening and diagnosis of women with dense breast tissue, offering numerous advantages over traditional imaging techniques. While a limited number of studies have begun to explore deep learning-based approaches for tumor classification in ABUS, the challenge of balancing time efficiency with predictive accuracy has yet to be adequately addressed.
This study introduces a novel strategy, termed the SV strategy, which employs voxel weighting to enhance classification performance. This method can be seamlessly integrated into any 2D-input CNN models. To the best of our knowledge, this is the first application of voxel-level probabilities for distinguishing between benign and malignant tumors in the context of volumetric tumor classification, marking a significant advancement in the use of deep learning for BC diagnosis.
The SV strategy significantly enhances tumor classification performance when applied to 2D-input CNN models. As shown in
Table 1, this strategy leads to notable improvements in various metrics compared to the baseline models. For instance, MedMamba achieves reasonable results (ACC, 0.738; SEN, 0.726; SPE, 0.755; AUC, 0.799), while the SV-powered MedMamba demonstrates excellent performance (ACC, 0.954; SEN, 0.975; SPE, 0.943; AUC, 0.954). Several factors may contribute to this improvement. First and foremost, the SV strategy incorporates voxel probabilities into the computation of volumetric tumor classification (see
Figure 2). This approach effectively balances both the number of voxels and slice-wise probabilities in the final prediction. Secondly, the baseline 2D-input CNN models are proficient at slice-wise tumor classification. When a slice contains sufficient voxels for deep learning-based hierarchical feature representation, these models excel in slice-level classification. This is supported by the
t-SNE visualization (
Figure 8), which illustrates a clear separation between benign and malignant image slices. However, tumors exhibit diverse shapes, sizes, and textures, and the baseline models may struggle with smaller slices, leading to inaccurate predictions. Furthermore, the SV strategy outperforms the HV strategy, indicating that voxel-level weighting is more effective than slice-level weighting. Both the number of voxels and predicted probabilities play crucial roles in the accurate classification of tumor volumes (
Figure 7). It is important to note that this concept has parallels in the artificial intelligence community [
42,
43]. Unlike approaches that weight multiple classifiers or networks [
27], the proposed strategy re-weights voxel importance to enhance prediction performance.
The 3D-input CNN models demonstrate poor performance in ABUS tumor classification, as indicated by their metrics (AUC < 0.65) and as shown in
Table 2. Although these models have witnessed success in disease classification using multi-plane multi-slice Transformer (M3T3D [
32]), fusing ResNeXt and Bi-LSTM (ResNeXt3D [
33]), and densely connecting global and local features (DenseNet3D [
34]), several factors contribute to their shortcomings in this context. Firstly, a limited number of 300 training epochs hinders the iterative optimization of representation learning. The training parameters, such as learning rate, patch size, and loss function, are critical for prediction performance [
44]. However, grid searching of optimal parameters is bound to dramatically increase the time cost in model training. As shown in
Table 2, when the number of iteration epochs increases, the time cost is correspondingly and linearly increased. Compared to 2D-input CNN models, 3D data input increases computational complexity, requiring high memory and longer training time. Secondly, the insufficient number of ABUS tumor cases restricts the effective training of 3D-input CNN models. Deep 3D-input networks involve numerous hidden parameters that necessitate a large-scale database for optimal hyper-parameter tuning. Whether data augmentation, transfer learning, and domain adaptation can address this issue [
45,
46,
47] requires further investigation. It is important to note that the ABUS database, containing 100 volumetric cases, remains the only publicly available dataset in the field of 3D US imaging. The release of significantly larger databases in the future could greatly advance research in this domain, facilitating everything from algorithm development to fair performance comparisons, and ultimately paving the way for more accurate and robust 3D US image analysis. Lastly, due to the imaging principles involved, the quality of ABUS images is generally inferior to that of magnetic resonance images and computerized tomographic images. This poor image quality is also the reason of poor prediction performance. The quality of ABUS images heavily relies on the device and acquisition procedure and is further affected by artifacts, noise, and nipple shadow [
16]. Enhancing ABUS imaging quality remains a significant challenge that warrants further attention.
Several limitations remain in the current study. Firstly, the proposed SV strategy can only be integrated into 2D-input networks for 3D object classification. It benefits the ABUS lesion prediction, while its potential for improving performance of other 3D object classification tasks is scheduled in our future studies. Secondly, a limited number of epochs were used for hyper-parameter optimization, which may hinder the deep representation learning process [
48]. And consequently, the performance of the 3D-input CNNs involved may be underestimated, despite the fact that their predictive performance is not related to our finding that the proposed SV strategy improves ABUS lesion classification. However, increasing the number of training epochs, along with hyper-parameter optimization (e.g., learning rate and batch size), requires substantial additional time, effort, and funding. Thirdly, the number of training cases is insufficient due to the challenge in ABUS data collection. To relieve this issue, five-fold cross validation is conducted for robust evaluation and selection of better 2D-input CNN models. Most importantly, the release of larger databases is highly desirable, since diverse and large-scale databases can significantly enhance research progress, improve model generalization, and facilitate more comprehensive evaluations. To reduce labor costs, incorporating noisy and augmented data could be considered, although their effects on model training are still under investigation [
24]. Fourthly, on the TDSC-ABUS2023 challenge, the prediction score of the first-ranked team is slightly inferior to our ResNet34 + SV model. However, due to limited open resource of the challenge results, the details of these models are not yet available. Finally, data samples could be stratified into different groups based on factors such as tumor size and image quality, which would help elucidate the elements contributing to tumor classification.