4.3. Model Performance Evaluation
The primary goal of our experiment was to identify BLCA, aiming to improve patient outcomes, simplify the diagnostic process, and significantly reduce both patient time and costs. In the experiment, 80% of the EBTC dataset, consisting of 1403 images, was designated for training, while the remaining 20% (351 images) was set aside for testing. We employed supervised pre-training to pre-train six DL models—EfficientNet-B3, ConvNeXtBase, DenseNet-169, MobileNet, ResNet-101, and VGG-16—where the six DL models were trained on the ImageNet dataset.
Additionally, in our experiment, we utilized the five-fold cross-validation technique. This method involved dividing the training dataset into five equal subsets. To prevent data leakage, all data from a single patient were assigned exclusively to either the training or test set, and similarly, each patient appeared in only one-fold during cross-validation. This ensured that no patient’s data was shared across training and validation/test sets.
In each iteration, one subset was used as the validation set while the remaining four subsets were utilized for training the model. Each iteration represented a unique training and validation process that updated the model’s parameters. This procedure was repeated five times, with each subset serving as the validation set once. The average performance across all five iterations was calculated to evaluate the model’s generalization ability. At the conclusion of the experiment, we applied the measured metrics (Equations (1)–(7)) to the six DL models.
The outcomes of the five-fold cross-validation process for the six DL models, along with the evaluation metrics, are detailed in
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9 and
Figure 3. The average accuracy results from the five-fold cross-validation process for the models were as follows: EfficientNet-B3 achieved an accuracy rate of 99.03%, ConvNeXtBase reached 98.29%, DenseNet-169 attained 98.32%, MobileNet recorded 79.09%, and VGG-16 achieved 98.49%. Based on these results, EfficientNet-B3 demonstrated the highest accuracy among the models evaluated.
Table 4 and
Figure 4 show that the EfficientNet-B3 model was assessed using five cross-validation folds, with the following performance metrics reported:
- 1.
Accuracy and Specificity
- ○
The model’s accuracy ranged from 98.58% to 99.72% across the five folds, yielding an overall mean accuracy of 99.03%.
- ○
The specificity (true-negative rate) was consistently high, fluctuating between 98.97% and 99.80%, and averaging 99.31%.
- ○
These results indicate that the model rarely misclassified negative cases, demonstrating a strong ability to accurately identify healthy samples.
- 2.
FNR and NPV
- ○
The FNR (the proportion of actual positives missed) ranged from 0.45% (fold 4) to 4.43% (fold 5), with an average of 3.15%.
- ○
The corresponding NPV (the probability that a negative prediction was truly negative) varied between 99.04% and 99.79%, averaging 99.36%.
- ○
This indicates that when the network predicted a negative class, it was almost always correct.
- 3.
Precision, Recall, and F1-Score
- ○
Precision (positive predictive value) ranged from 96.98% to 99.52%, with a mean of 97.95%, indicating that most of the model’s positive predictions were true positives.
- ○
Recall (sensitivity) showed wider variability, ranging from 95.57% (fold 5) to 99.55% (fold 4), averaging 96.85%.
- ○
The F1-score, which balances precision and recall, spanned from 96.50% to 99.53%, with an overall mean of 97.37%.
Overall, the EfficientNet-B3 model demonstrated extremely high specificity and precision, a low false-negative rate, and very strong accuracy. Slight variability in recall across folds suggests occasional missed positive instances, but the model’s average F1-score above 97% reflects an excellent balance between sensitivity and precision.
Table 5 and
Figure 5 present that the average accuracy of the ConvNeXtBase reached 98.29%, demonstrating the model’s overall reliability in classification tasks. Specificity was also high across all folds, averaging 98.77%, indicating effective identification of true negative cases. The FNR averaged 3.94%, indicating a low likelihood of missing positive cases, which is vital in clinical or sensitive contexts. The NPV was strong at 98.83%, reflecting high confidence in negative predictions. Regarding precision, the model achieved an average of 96.56%, showcasing a low rate of FPs. The recall, measuring the model’s ability to detect all actual positive cases, averaged 96.06%, indicating general effectiveness in identifying positive instances. The F1-score, representing the harmonic mean of precision and recall, stood at 96.29%, confirming the model’s balanced performance.
In terms of fold-wise performance, Fold 2 excelled across most metrics, particularly in precision (98.27%), recall (98.01%), and F1-score (98.12%). In contrast, Fold 4 exhibited the lowest performance, with accuracy (97.44%), recall (93.16%), and F1-score (93.52%), suggesting some variability in model generalization based on data splits. Overall, the ConvNeXtBase model demonstrated excellent and stable performance across different folds, with only minor fluctuations in metrics.
The results in
Table 6 and
Figure 6 showed that the ConvNeXtBase model demonstrated consistently strong performance across all five folds. Fold 2 achieved the highest metrics with an accuracy of 98.86%, specificity of 99.15%, NPV of 99.20%, precision of 98.27%, recall of 98.01%, and F1-score of 98.12%. In contrast, Fold 4 recorded the lowest values: 97.44% accuracy, 98.20% specificity, 98.27% NPV, 93.92% precision, 93.16% recall, and 93.52% F1-score. The FNR varied from a low of 1.99% in Fold 2 to a high of 6.84% in Fold 4, indicating greater class imbalance or more challenging cases in that split. On average, the model achieved: accuracy: 98.29%; specificity: 98.77%; FNR: 3.94%; NPV: 98.83%; precision: 96.56%; recall: 96.06%; F1-score: 96.29%.
These averages suggest that ConvNeXtBase was highly effective at accurately identifying both positive and negative cases, with particularly strong specificity and NPV reflecting a very low rate of false positives. The slightly lower average recall and F1-score, influenced mainly by Fold 4, indicated that a small proportion of true positive cases were missed in the most difficult partition. Nevertheless, the overall performance of the model remained robust and well balanced.
Table 7 and
Figure 7 show that the MobileNet model showed consistently strong performance across all five cross-validation folds. Accuracy ranged from 98.01% (fold 3) to 99.00% (fold 1), resulting in an average accuracy of 98.66%. Specificity was also high, varying between 98.60% and 99.30%, with a mean of 99.08%. The FNR was low overall, ranging from 2.96% to 4.52% (mean 3.45%), indicating that the model rarely missed positive cases. Correspondingly, the NPV remained above 98.56% for every fold, averaging 99.08%, which reflects strong confidence in negative predictions. On the positive side, precision varied from 94.90% (fold 3) to 97.60% (fold 1), averaging 96.27%. Recall (sensitivity) ranged from 95.48% to 97.04%, with a mean of 96.55%. The resulting F1-scores ranged between 95.73% and 97.19%, averaging 96.37%, signifying that the model effectively balanced precision and recall. Overall, MobileNet maintained high discriminatory power (average specificity 99.08%) and robust detection capability (average recall 96.55%), with very few false negatives, underscoring its reliability for this classification task.
Table 8 and
Figure 8 report that the ResNet-101 model demonstrated strong performance across all five cross-validation folds. Its accuracy varied from 98.01% in Fold 1 to 99.15% in Fold 4, resulting in an average accuracy of 98.52%. Specificity was consistently high, ranging between 98.58% and 99.45% (mean = 98.97%), indicating that the network reliably identified negative cases with very few false positives. The FNR reached as low as 1.52% in Fold 4 and never exceeded 6.08%, averaging only 3.32% across the folds—showing that the model rarely missed positive instances. The NPV remained above 98.66% in every fold (mean = 98.98%), confirming that when the model predicted a negative outcome, it was almost always correct. Although precision dropped to 94.77% in Fold 2, it stayed above 95% in the other folds (mean = 96.25%), indicating that most positive predictions were true positives. Recall (sensitivity) was also strong, ranging from 93.92% to 98.48% (mean = 96.68%), reflecting the model’s ability to detect actual positives. Finally, the F1-score—a harmonic mean of precision and recall—varied from 94.86% to 97.59%, averaging 96.40%, confirming a balanced trade-off between these two metrics throughout the validation process.
Table 9 and
Figure 9 record that the VGG-16 model demonstrated consistently high performance across all five cross-validation folds. The accuracy ranged from 97.86% in Fold 1 to 98.72% in Folds 2 and 4, yielding an overall mean accuracy of 98.49%. Specificity was similarly robust, varying only between 98.46% and 99.14% (mean = 98.93%), which indicated that the model very reliably identified negative cases. The FNR was low in every fold (2.94–8.09%), with an average of 4.36%, reflecting that only a small proportion of positive cases were missed. Correspondingly, the NPV remained near perfection (mean = 98.99%), confirming that predictions of “negative” were almost always correct. In terms of positive-case detection, precision values spanned from 94.85% (Fold 3) to 98.15% (Fold 2), averaging 96.67%. Recall (sensitivity) was equally strong, ranging from 91.91% to 97.06% (mean = 95.64%), demonstrating that the model captured the vast majority of true positives. The resulting F
1-scores (harmonic mean of precision and recall) fell between 93.81% and 97.55%, with an average of 96.03%, underscoring a well-balanced trade-off between precision and recall. Overall, these past results showed that VGG-16 achieved excellent discrimination ability, maintained low error rates, and delivered highly reliable positive and negative predictions across all folds.
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15 illustrate the training and validation loss for six DL techniques: EfficientNet-B3, ConvNeXtBase, DenseNet-169, MobileNet, ResNet-101, and VGG-16. From
Figure 10, the EfficientNet-B3 model achieved successful convergence during the 25-epoch training period, with both training and validation losses consistently decreasing throughout the process. The most significant improvements were observed in the first 10 epochs.
At the start, the training loss was approximately 4.0 (epoch 0) with a steep decline noted during the initial epochs (0–5). After 15 epochs, the loss stabilized around 1.0, and by epoch 25, the final training loss reached approximately 0.5. The validation loss began slightly higher than the training loss (around 4.0+) and followed a similar decreasing trend. A small but consistent gap remained between the training and validation curves, with the final validation loss settling around 1.0 at epoch 25, without any significant divergence between the curves.
The parallel downward trend indicated good generalization, and the consistent gap of approximately 0.5 suggested mild but acceptable overfitting. The most rapid improvements occurred between epochs 0–10, with a noticeable slowdown in loss reduction after epoch 15. The model achieved its lowest loss values in the final five epochs, exhibiting stable and consistent learning behavior. The validation loss maintained a reasonable correlation with the training loss, indicating successful training and good convergence.
No signs of underfitting or severe overfitting were detected. The loss scale (0.5–4.0) was typical for well-initialized models, and the convergence pattern aligned with expectations for EfficientNet architectures. The 25-epoch duration proved sufficient for near-complete convergence. Overall, the EfficientNet-B3 model completed training with successful convergence, demonstrating a characteristic learning curve marked by steady loss reduction and effective learning, with final loss values indicating a stable and well-performing state by the end of the training cycle.
The EfficientNet-B3 model showed impressive learning capabilities over its 25-epoch training cycle. Training accuracy started at around 75% and displayed rapid, near-linear improvement during the first 10 epochs, exceeding 90% accuracy by epoch 10. Validation accuracy began at a slightly lower value (~73%) but closely followed the training curve, maintaining a consistent 2–3% gap throughout the training period. Both curves moved in stable parallel alignment after epoch 5, with no significant divergence noted. The most substantial gains happened between epochs 0–15, after which the rate of improvement slowed as the model approached its performance ceiling. By epoch 25, training accuracy leveled off at approximately 94–95%, while validation accuracy stabilized at 92–93%. The minimal and consistent gap between the curves suggested excellent generalization with negligible overfitting. The model reached its target validation accuracy benchmark (90%) by epoch 12 and continued to improve consistently through the rest of the training. The smooth, stable progression of both accuracy curves reflected well-tuned hyperparameters and effective learning dynamics characteristic of the EfficientNet architecture.
From
Figure 11, the ConvNeXtBase model showed effective learning behavior over its 20-epoch training cycle. Training loss started at approximately 4.0 and experienced a sharp, consistent decline during the initial phase, reaching about 1.0 by epoch 10. Validation loss began at a similar starting point but exhibited slightly more fluctuation, particularly between epochs 5–15, where it maintained a consistent gap of 0.3–0.5 above the training loss. Both curves progressed in parallel after epoch 15, converging to approximately 0.5–0.7 by the final epoch. The model achieved its most significant loss reduction in the first 10 epochs, after which improvements slowed considerably. While a persistent but narrowing gap existed between training and validation losses throughout the training, no concerning divergence occurred, indicating the model generalized reasonably well without severe overfitting. The final convergence pattern suggested that the 20-epoch duration was sufficient for this architecture to approach stability.
The ConvNeXtBase model showed strong learning behavior during its 20-epoch training cycle. Training accuracy started at about 75% and quickly improved in the early stages, exceeding 85% by epoch 5. Validation accuracy began at a slightly lower level (~72%) but followed the training curve closely. After epoch 5, both metrics maintained a consistent parallel trajectory, with validation accuracy lagging behind training accuracy by a steady 2–4% margin throughout the training period. The most significant gains happened between epochs 0–10, where accuracy increased at an almost linear rate of ~2% per epoch. After epoch 15, the rate of improvement slowed considerably as the model approached its performance limit. By the final epoch, training accuracy settled at approximately 94–95%, while validation accuracy stabilized at 91–92%. The model reached the 90% validation benchmark by epoch 12 and continued to show small gains in the following epochs. The persistent but narrow accuracy gap suggested mild overfitting, although the parallel curve alignment indicated that this did not significantly affect generalization capabilities.
From
Figure 12, the DenseNet-169 model underwent a thorough 30-epoch training cycle, during which both training and validation losses showed characteristic convergence behavior. The training loss started at a relatively high value and displayed a steep, consistent decline throughout the first 15 epochs, gradually stabilizing thereafter. The validation loss began slightly higher than the training loss and followed a similar downward path, although with slightly greater fluctuation, especially during the mid-training phase (epochs 10–20). A noticeable but moderate gap persisted between the two curves throughout the entire training period, narrowing to approximately 0.3–0.5 units by epoch 30. The most significant loss reduction occurred in the first 20 epochs, after which both curves progressed in near-parallel alignment, indicating stabilization. Final loss values settled around 1.0 for training and 1.3–1.5 for validation, suggesting that the extended 30-epoch duration was beneficial for this architecture’s convergence. The consistent downward trend without significant divergence reflected effective learning with no severe overfitting concerns.
The DenseNet-169 model showed steady but gradual learning progress over its 30-epoch training cycle. Training accuracy started at about 60% and increased consistently throughout the training period. Validation accuracy also began at a similar level but showed slightly more fluctuations, especially during the mid-training period (epochs 10–20). Both accuracy curves kept a consistent gap of 5–8 percentage points, with validation accuracy consistently lagging behind training accuracy. The most significant improvements happened between epochs 5–20, where training accuracy rose from around 65% to 75%. By epoch 25, training accuracy neared 80%, while validation accuracy reached about 72–74%. The model continued to show slight gains in the final epochs, although the rate of improvement slowed significantly after epoch 25. The persistent accuracy gap suggested moderate overfitting challenges, while the stabilization of both curves indicated that the model was nearing its performance ceiling. The 30-epoch duration proved essential for this architecture to achieve meaningful accuracy gains, even though the final validation accuracy remained below standard benchmarks for modern architectures.
From
Figure 13, MobileNet’s architecture showed a strong learning curve throughout its 25-epoch training cycle. Training loss started at about 6.0 and had a steep, almost linear decline during the first 10 epochs, reaching around 2.0 by epoch 10. Validation loss began at a similar high value but showed more fluctuations during the early training phase, briefly spiking around epoch 5 before continuing its downward trend. Both curves moved closely together after epoch 10, with training loss keeping a steady 0.2–0.5 unit advantage over validation loss. The most significant improvements happened in the first 15 epochs, after which the rate of loss reduction slowed considerably. By epoch 25, training loss ended up at about 0.5 while validation loss settled at around 1.0. The ongoing but narrowing gap between curves suggested mild overfitting, though the similar final convergence implied that the model achieved stable generalization. The extended 25-epoch duration was advantageous for this architecture, allowing it to fully utilize its efficient design features.
MobileNet’s architecture showed strong learning capabilities during its 25-epoch training cycle, although with noticeable early fluctuations. Training accuracy started at around 60% and demonstrated rapid, near-linear improvement during the first 10 epochs, exceeding 70% by epoch 5 and reaching 80% by epoch 15. Validation accuracy began at a similar level but experienced significant variations in the early training phase (epochs 0–10), including a temporary dip around epoch 5 before recovering robustly. After epoch 10, both curves moved in stable parallel alignment, with validation accuracy keeping a consistent 3–5% gap below training accuracy. The most significant gains took place between epochs 5–15, where accuracy improved at about 2% per epoch. After epoch 20, the rate of improvement slowed as the model neared its performance ceiling, with training accuracy converging to around 85–86% and validation accuracy stabilizing at 81–82% by epoch 25. The model reached the 80% validation benchmark by epoch 15 and continued to show incremental gains throughout the remaining training. The early fluctuations in validation accuracy indicated initial sensitivity to hyperparameters or data variations, while the stable late-stage convergence with a moderate gap suggested mild but manageable overfitting. The final accuracy values illustrated the efficiency of the MobileNet architecture for the given task, with validation accuracy demonstrating respectable performance despite the early fluctuations.
From
Figure 14, the ResNet-101 model showed a typical learning trend over its 25-epoch training period. The training loss started near 5.0 and experienced a quick initial drop, decreasing to about 2.0 within the first 5 epochs. The validation loss began at a similar level but showed slightly more fluctuations during this early stage. Both curves followed a steady downward path throughout the training cycle, with the most significant reductions occurring before epoch 15. A consistent but moderate gap of about 0.5 to 1.0 units remained between the training and validation losses, with the validation curve staying consistently above the training curve. After epoch 15, the rate of improvement slowed significantly, with both losses moving in near-parallel alignment toward their final values. By epoch 25, the training loss reached about 0.5 while the validation loss settled around 1.5. The stable convergence pattern without late-stage divergence suggested that the model achieved effective learning with no significant overfitting issues. The 25-epoch duration proved sufficient for considerable convergence.
The ResNet-101 model showed strong learning performance during its 25-epoch training cycle. Training accuracy started at about 75% and quickly improved in the first phase, exceeding 85% by epoch 5 and reaching 90% by epoch 10. Validation accuracy began at a slightly lower level (~73%) but followed the training curve closely, maintaining a consistent 2–4% gap throughout the training period. Both metrics followed a nearly parallel path after epoch 5, showing no significant divergence. The most significant improvements happened in the first 15 epochs, with accuracy growing by about 1.5% per epoch. After epoch 15, the rate of improvement slowed as the model neared its performance ceiling, with training accuracy leveling off around 94–95% and validation accuracy stabilizing at 91–92% by epoch 25. The model reached the important 90% validation benchmark by epoch 12 and continued to show small gains in the remaining training cycles. The persistent but narrow accuracy gap suggested mild overfitting, while the absence of late-stage fluctuations indicated stable learning dynamics. The parallel movement of both curves implied that the model generalized well to validation data, with the final validation accuracy reflecting strong performance for this architecture. The 25-epoch duration proved adequate to approach the model’s maximum capability, as shown by the plateau effect in the final training phase.
From
Figure 15, the VGG-16 model went through a challenging training process over about 20 epochs, as shown by its unique loss profile. Training loss started at a relatively high value near 20.0 and showed a steady downward trend throughout the training cycle. Validation loss began at a similar high level and exhibited more variability, especially during the middle training phase (epochs 5–15). A significant gap of 5–7 units remained between the training and validation curves during the first 15 epochs, narrowing slightly but still being considerable (2–4 units) by the final epochs. The most significant reduction in loss happened in the first 5 epochs, after which the rate of improvement slowed considerably. By epoch 20, training loss settled at around 5.0 while validation loss stabilized between 7.0–9.0. The consistent difference between the curves indicated notable overfitting challenges, while the relatively high final loss values suggested that the model had difficulty achieving optimal convergence within the 20-epoch cycle.
The VGG-16 model showed challenging learning dynamics during its 20-epoch training cycle. Training accuracy started at approximately 70% and steadily improved throughout the training period, progressing at a near-linear rate to reach 85% by epoch 15. Validation accuracy began at a significantly lower starting point (~65%) and diverged from the training curve. A substantial performance gap of 15–20 percentage points persisted during the early training phase (epochs 0–10), narrowing only modestly to 10–12 points by epoch 20. The validation curve displayed notable volatility, particularly between epochs 5–15, where it experienced multiple fluctuations instead of consistent improvement. While training accuracy reached 90% by epoch 17.5, validation accuracy stalled around 78–80% during the same period. The most significant gains occurred before epoch 10, after which validation accuracy showed minimal improvement despite continued training progress. By the final epoch, training accuracy converged to approximately 92–93%, while validation accuracy plateaued at 80–82%. The persistent and substantial accuracy gap indicated significant overfitting challenges, while the validation curve’s volatility suggested sensitivity to hyperparameters or data characteristics. The model did not achieve the 85% validation benchmark within the training cycle, with late-stage stagnation implying that additional epochs would yield diminishing returns. These results reflected the architectural limitations of VGG-16 for this specific task, particularly its tendency toward overfitting without extensive regularization techniques.
Hence, EfficientNet-B3, DenseNet-169, and ConvNeXtBase demonstrated the best overall performance. These architectures achieved the highest training and validation accuracies (above 94% on average) while maintaining consistently low validation losses across folds, indicating excellent generalization and stable learning. MobileNet and ResNet-101 also performed well, achieving strong accuracy levels above 92% with relatively low losses, although they exhibited slightly more fluctuations in validation performance compared to the top models. In contrast, VGG-16 showed the weakest results, with lower validation accuracy (around 85–90%) and higher, more unstable validation losses, suggesting less robust generalization and potential overfitting. Overall, EfficientNet-B3 emerged as the most reliable model due to its combination of rapid convergence, high validation accuracy, and minimal loss variations, making it the best candidate for applications requiring high precision and stability.
Table 10 presents a comparative performance for the six DL techniques.
Figure 16 shows the confusion matrices for EfficientNet-B3, ConvNeXtBase, DenseNet-169, MobileNet, ResNet-101, and VGG-16. These models were evaluated on the test set of the EBTC dataset, which contains a total of 1403 images designated for training. The remaining 20% (351 images) were set aside for testing. The EBTC dataset includes four classes: HGC (469 images), LGC (647 images), NTL (134 images), and NST (504 images). The test set for EBTC consists of 102 images from the HGC class, 124 images from the LGC class, 25 images from the NTL class, and 100 image from the NST class.
The EfficientNet-B3 model showed strong classification performance with notable strengths and minor limitations across the four tissue classes (HGC, LGC, NST, and NTL). The model achieved exceptional precision in identifying LGC (123 correct predictions) and NST (99 correct), with near-perfect recall for NST (99/100 = 99%) and NTL (22/25 = 88%). However, it displayed moderate confusion between HGC and LGC classes, where 4 HGC samples were misclassified as LGC (FN) and 1 LGC sample was incorrectly assigned to HGC (FP). The NTL class encountered minor challenges, with 3 samples misidentified as NST, indicating potential morphological similarities between these categories. Importantly, no cross-confusion occurred between biologically distinct classes (e.g., HGC vs. NST or LGC vs. NTL), reflecting the model’s effective learning of key discriminative features. The overall accuracy reached 97.3% (342 correct predictions out of 351 total samples), although the HGC class exhibited the lowest recall (98/103 ≈ 95.1%) due to LGC misclassifications. These results highlighted the model’s proficiency while emphasizing HGC-LGC differentiation as the primary area for potential improvement.
The ConvNeXtBase demonstrated strong classification abilities with minimal confusion between the four tissue categories (HGC, LGC, NST, and NTL). The model achieved outstanding results for LGC (125 correct predictions) and NTL (28 out of 29, approximately 96.6% recall), showing near-perfect specificity for these classes. However, it encountered significant difficulties in identifying HGC, with 4 samples misclassified (1 as LGC, 2 as NST, and 1 as NTL), resulting in the lowest class recall (91 out of 95, approximately 95.8%). Similarly, the NST class experienced moderate confusion, with 5 errors (1 misassigned to HGC, 3 to LGC, and 1 to NTL), leading to a recall of 96 out of 101, approximately 95.0%. Cross-class confusion primarily involved histologically similar categories: HGC-NST misclassifications hinted at difficulties in distinguishing high-grade features, while NST-LGC errors suggested possible overlap in stromal patterns. The model maintained strong diagonal dominance with 340 correct predictions out of 351 total samples (96.9% accuracy), although the error distribution revealed opportunities to enhance feature extraction for differentiating HGC and NST. Importantly, no FPs were found for LGC against NTL, confirming effective learning of essential diagnostic boundaries.
The DenseNet-169 model showed strong classification performance with excellent precision across most tissue classes, though minor inter-class confusion remained. The architecture achieved remarkable results for HGC (94 correct predictions, 1 misclassified as LGC) and NST (104 correct, 2 errors), reflecting near-perfect recall rates of 98.9% and 99.0%, respectively. However, it faced slight challenges in distinguishing LGC and NTL categories: LGC experienced 4 misclassifications (2 as HGC, 1 as NST, 1 as NTL), while NTL encountered 3 errors (1 as LGC, 2 as NST), resulting in the lowest class recall (22/25 = 88.0%). Notably, the model preserved critical diagnostic boundaries between biologically distinct classes (HGC vs. NST/NTL and LGC vs. HGC), with no cross-confusion between these pairs. The confusion between LGC and NTL implied potential similarities in stromal presentation, while NST’s single misclassification as LGC suggested minor feature overlap in necrotic patterns. With 340 correct predictions out of 351 total samples (96.9% accuracy), the model’s performance was highly competitive, though the NTL class represented the primary opportunity for improvement. The error distribution underscored the architecture’s proficiency in high-grade cancer identification while revealing subtle challenges in lower-grade and normal tissue differentiation.
The MobileNet model demonstrated outstanding classification performance with minimal confusion between classes across the four tissue categories (HGC, LGC, NST, and NTL). The model achieved nearly perfect results for HGC (94 correct predictions, 1 misclassified as NTL) and NST (105 correct, 2 minor errors), resulting in impressive recall rates of 98.9% and 99.1%, respectively. LGC identification was particularly effective, with only 3 misclassifications (2 as HGC, 1 as NST) out of 124 samples (97.6% recall). The NTL class showed slight difficulties, with 2 errors (1 as HGC, 1 as NST) against 23 correct predictions (92.0% recall), marking the highest error rate. Importantly, the model maintained excellent diagnostic boundaries between biologically distinct categories: no FPs were recorded between LGC-NTL or HGC-NST pairs, and no confusion was found between LGC and NTL classes. The single HGC-NTL misclassification implied potential ambiguity in normal tissue boundaries, while the NST-LGC error reflected minor feature overlap in stromal patterns. With 343 correct predictions out of 351 total samples (97.7% accuracy), the performance exceeded that of other architectures in overall precision. The highly diagonal confusion matrix illustrated the model’s superior feature discrimination capabilities, with NTL differentiation remaining the only area for potential enhancement despite its already strong 92% class recall.
The ResNet-101 model demonstrated excellent classification capabilities, showing particularly strong performance in critical diagnostic categories. The architecture achieved near-perfect results for NTL (27 correct predictions, 100% recall) and HGC (96 correct, 99.0% recall), with only one HGC sample misclassified as NST. LGC identification was robust (123 correct) but showed minor confusion with NST (3 misclassifications). The NST class faced moderate challenges, with 5 errors (1 as HGC, 3 as LGC, 1 as NTL) against 95 correct predictions (95.0% recall), representing the lowest performance among the classes. Critically, the model maintained essential diagnostic boundaries: no FPs occurred between HGC-NTL or LGC-NTL pairs, and no NTL samples were misassigned to other categories. The primary confusion occurred between histologically similar LGC and NST classes (accounting for 6 out of 10 errors), suggesting feature overlap in stromal presentation. With 341 correct predictions out of 351 samples (97.2% accuracy), the overall performance was highly competitive, though the error distribution highlighted opportunities to enhance NST differentiation. The isolated HGC-NST misclassification indicated rare edge cases where high-grade and necrotic features proved ambiguous, while the perfect NTL recall underscored the model’s proficiency in identifying normal tissue boundaries.
The VGG-16 model demonstrated strong classification capabilities, achieving excellent performance in three categories but exhibiting notable challenges in identifying NTL. The architecture achieved near-perfect results for HGC (95 correct predictions, 100% recall) and LGC (124 correct, 100% recall), with no misclassifications observed in these critical diagnostic categories. NST identification was similarly robust (99 correct predictions, 99% recall), with only one sample misclassified as LGC. However, the model faced significant difficulties with NTL recognition, where 2 of 25 samples were misidentified as NST, resulting in 92% recall for this class. The primary confusion occurred between histologically similar NST and NTL categories, suggesting challenges in distinguishing necrotic patterns from normal tissue boundaries. Importantly, the model maintained perfect diagnostic boundaries between cancerous (HGC/LGC) and normal (NTL) tissue types, with no FPs in these critical categories. The overall accuracy reached 97.4% (342 correct out of 351 samples), though the NTL errors highlighted ongoing challenges in normal tissue identification that aligned with observations from the architecture’s loss curve.
4.4. Statistical Analysis of Experimental Results
To evaluate the dependability of the results from EfficientNet-B3, a comprehensive statistical analysis was conducted. This analysis focused on the variance, standard deviation, standard error of the mean (SEM), and the confidence interval (CI) of the accuracy. The variance measures the average squared deviation of each data point from the mean of the data set. It quantifies how spread out the values are. The standard deviation is the square root of the variance. It expresses the average distance of the data points from the mean in the same units as the original data, making it more interpretable than variance. The SEM measures how much the sample mean differs from the actual population mean. It indicates the extent to which the sample mean would change if we collected multiple samples [
28]. A CI is a range of values, derived from sample data, that is likely to contain the true population parameter (e.g., mean, accuracy) with a specified level of confidence (commonly 95% or 99%). CIs provide statistical reliability and robustness to model evaluation metrics (e.g., accuracy, F1-score, AUC). Instead of reporting a single number, CIs give a range that reflects uncertainty due to sample variation, which is particularly important in DL models trained on limited or imbalanced data [
29]. The recorded measurements were obtained after performing the five-fold cross-validation of the procedure. The detailed evaluations of the six DL models for the EBTC dataset can be found in
Table 10,
Table 11 and
Table 12, as well as
Figure 17,
Figure 18 and
Figure 19.
Table 11 shows that the EfficientNet-B3 model delivered peak performance with a 99.03% mean accuracy (range: 98.58–99.72%), supported by moderate consistency (SD = 0.3519) and precise estimation (SEM = 0.1574) reflecting tight CIs.
The ConvNeXtBase achieved slightly lower mean accuracy (98.29%, range: 97.44–98.86%) with greater result dispersion, evidenced by higher variability metrics (SD = 0.4681, variance = 0.2192) and the poorest mean precision (SEM = 0.2094).
The DenseNet-169 attained 98.32% mean accuracy (range: 97.29–98.72%) but showed the most substantial performance fluctuations across trials, indicated by peak variability measures (SD = 0.5284, variance = 0.2792) and reduced mean reliability (SEM = 0.2363).
The MobileNet achieved robust 98.66% mean accuracy (range: 98.01–99.00%) with high result stability (SD = 0.3442) and confident mean estimation (SEM = 0.1540), demonstrating reliable performance.
The ResNet-101 yielded 98.52% mean accuracy (range: 98.01–99.15%) with moderate outcome spread (SD = 0.4089) and acceptable mean uncertainty (SEM = 0.1829).
The VGG-16 produced the most statistically stable results: 98.49% mean accuracy within the narrowest range (97.86–98.72%), minimal variability (lowest SD = 0.3198, variance = 0.1023), and the tightest mean confidence (lowest SEM = 0.1430).
Hence, EfficientNet-B3 delivered the highest average accuracy, while VGG-16 exhibited the most stable and consistent performance based on its low σ and SEM. DenseNet-169 and ConvNeXtBase showed the highest variability, implying less reliability across different test scenarios.
Table 12 shows that EfficientNet-B3 showed a standard deviation of 0.3519, indicating relatively moderate variation in performance across test runs. Its SEM was 0.1574, leading to a margin of error of 0.3084 and a confidence interval ranging from 98.72% to 99.34%. This implied the model achieved high accuracy with relatively tight precision.
The ConvNeXtBase exhibited a higher standard deviation (0.4681) and SEM (0.2094) than other models, resulting in a wider margin of error of 0.4103. The corresponding CI [97.88%, 98.70%] indicated greater uncertainty in its estimated performance, suggesting variability in results across different test sets.
The DenseNet-169 had the highest standard deviation (0.5284) and SEM (0.2363), which produced the widest margin of error of 0.4632 among all models. Its CI, ranging from 97.86% to 98.78%, showed that although its average performance was competitive, it suffered from significant variability, which reduced the reliability of its mean estimate.
The MobileNet maintained low variability, with a standard deviation of 0.3442 and a SEM of 0.1540, producing a margin of error of 0.3017. Its CI [98.36%, 98.96%] was among the narrower ones, indicating high precision and stability in its results.
The ResNet-101 had a standard deviation of 0.4089 and an SEM of 0.1829, yielding a margin of error of 0.3584. The confidence interval [98.16%, 98.88%] suggested moderate variability, and while the accuracy was fairly high, its reliability was slightly lower than that of MobileNet or VGG-16.
The VGG-16 showed the lowest variability overall, with the smallest standard deviation (0.3198), SEM (0.1430), and margin of error of 0.2803. The CI of [98.21%, 98.77%] was the narrowest, indicating that VGG-16’s performance was the most consistent and statistically stable among all models tested.
Hence, the EfficientNet-B3 attained the highest estimated accuracy with robust statistical precision. The VGG-16 exhibited the most dependable and consistent performance, showing the least uncertainty regarding its average accuracy. In contrast, the DenseNet-169 and ConvNeXtBase displayed more performance variability, indicating that their results may be more affected by different data splits or conditions.
From
Table 13, the reported CI analysis based on F1-score demonstrated the reliability of performance metrics across the evaluated models. EfficientNet-B3 achieved the narrow CI range of [96.35, 98.38] with a margin of error of 1.02, indicating both high accuracy and stability. ConvNeXtBase showed a wider CI of [94.94, 97.63] and a larger margin of error (1.34), reflecting higher variability in its performance. DenseNet-169 produced a CI of [94.49, 96.42] with a margin of error of 0.97, suggesting moderate consistency. MobileNet exhibited the smallest margin of error (0.05) and the tightest CI [95.92, 96.83], indicating that it yielded the most stable performance among the models. ResNet-101 achieved a CI of [95.51, 97.29] with a margin of error of 0.89, reflecting balanced accuracy and reliability. VGG-16, on the other hand, produced a CI of [94.89, 97.17] with a margin of error of 1.14, suggesting greater variability compared to MobileNet and ResNet-101.
Overall, EfficientNet-B3 and MobileNet were characterized by the most consistent and robust results, while ConvNeXtBase and VGG-16 showed relatively higher variability.
Moreover, we implemented the paired t-test as shown in
Table 14. From
Table 14, the performance of different models was compared against the baseline model, EfficientNet-B3, which achieved an accuracy of 99.03%. All other models showed lower accuracy values. ConvNeXtBase reached 98.29%, representing a difference of −0.74%, and the difference was statistically significant. DenseNet-169 obtained an accuracy of 98.32%, with a difference of −0.71%, which was also statistically significant. MobileNet achieved 98.66%, differing by −0.37% from the baseline, and this difference was significant. ResNet-101 showed an accuracy of 98.52%, with a −0.51% difference, which was significant. VGG-16 achieved 98.49%, differing by −0.54% from EfficientNet-B3, and the difference was statistically significant. From this, the t-test gave t-statistic is 8.42 and
p-value is 0.0011. The t-statistic of 8.42 indicated a substantial difference, suggesting that the performance of EfficientNet-B3 is consistently better than the other models. The
p-value of 0.0011 < 0.05, implies that EfficientNet-B3’s accuracy is statistically significant higher than that of the other models.
Overall, EfficientNet-B3 consistently outperformed all other models, and the observed differences in accuracy were statistically significant at α = 0.05. This result confirms that EfficientNet-B3’s superior performance is not due to random chance and represents a meaningful improvement over the other architectures.
4.6. Discussion of Results in Light of Recent Advances
BLCA is recognized as a significant type of urological cancer, resulting in approximately 196,500 deaths. It ranks as the 9th leading cause of cancer deaths among men and the 19th among women [
5,
6]. The manual classification of muscular tissues by pathologists is a labor-intensive task that heavily relies on their expertise. This dependency can introduce variability among observers, particularly due to the similarities in the morphology of cancerous cells. Traditional methods for analyzing endoscopic images are often time-consuming and resource-intensive, complicating the efficient differentiation of various tissue types. Despite progress in early detection, robotic surgical techniques, and immunotherapy that have contributed to improved survival rates, BLCA continues to be a significant and increasing health concern globally, particularly in developed countries [
1]. Consequently, there is a pressing need for a fully automated and reliable system to categorize BLCA images.
To tackle these challenges, this research presented a refined EfficientNet-B3 model specifically for detecting BLCA. This model aims to assist clinicians in identifying BLCA at an earlier stage, thereby decreasing both diagnostic time and costs. Additionally, the research employed a five-fold cross-validation method to improve the accuracy of the EfficientNet-B3 model. This method allows for effective parameter adjustments by partitioning the data into five subsets and repeating the process five times, with each subset serving as the validation set once. Five-fold cross-validation is a notable example of k-fold cross-validation, where (k) can be any integer greater than 1, with common values being 3, 5, or 10. This technique is widely recognized for providing a more accurate and reliable evaluation of DL models on unseen data.
The main goal of our experiment was to identify BLCA to enhance patient outcomes, streamline the diagnostic process, and significantly lower both patient time and costs. In our experiment, we utilized the EBTC dataset, which included a total of 1403 images. We allocated 80% of this dataset (1123 images) for training and reserved the remaining 20% (351 images) for testing. We implemented a supervised pre-training to pre-train six DL models—EfficientNet-B3, ConvNeXtBase, DenseNet-169, MobileNet, ResNet-101, and VGG-16—where the six DL models were trained on the ImageNet dataset.
Furthermore, we incorporated a five-fold cross-validation technique. This process involved dividing the training dataset into five equal subsets. In each iteration, one subset served as the validation set while the other four were used for training the model. Each iteration represented a distinct training and validation cycle that updated the model’s parameters. This procedure was repeated five times, with each subset acting as the validation set once. We calculated the average performance across all iterations to evaluate the model’s generalization capability.
Via the five-fold cross-validation, the EfficientNet-B3 achieved an accuracy between 98.58% and 99.72% (mean 99.03%) and a specificity from 98.97% to 99.80% (mean 99.31%), indicating exceptional discrimination of healthy cases. Its FNR varied from 0.45% to 4.43% (mean 3.15%), while the NPR value ranged from 99.04% to 99.79% (mean 99.36%), showing that negative predictions were almost always correct. Positive predictive value spanned 96.98% to 99.52% (mean 97.95%), and sensitivity ranged from 95.57% to 99.55% (mean 96.85%), with the F1-score balancing these at 96.50–99.53% (mean 97.37%). Overall, the model demonstrated outstanding specificity and precision, a low FNR, and robust accuracy, with only minor variability in recall across folds.
Maintaining an FNR between 0.45% and 4.43%, with an average of 3.15%, demonstrates high sensitivity. This means that the model rarely overlooks TP cases, enabling timely identification of patients with health conditions. As a result, earlier interventions can be made, leading to improved health outcomes. In healthcare, a low FNR minimizes the risk of denying necessary treatments to patients, which in turn lowers morbidity and mortality rates associated with delayed diagnoses. Additionally, a low FNR increases confidence in negative results, reducing the need for costly or invasive retests and improving the efficiency of patient management. From an algorithmic perspective, consistently maintaining a low FNR ensures that the model effectively detects positive cases across various datasets.
To evaluate the dependability of the results from EfficientNet-B3, a comprehensive statistical analysis was conducted. This analysis focused on the variance, standard deviation, and the CIs of the accuracy.
It is crucial to understand that the proposed system serves as a support tool rather than a substitute for human expertise. Every negative case flagged by the system must undergo clinical assessment, especially for high-risk individuals or ambiguous situations. Establishing thresholds informed by patient history, symptoms, and other risk elements may lead to additional testing or referrals to specialists for borderline cases. Moreover, consistently updating the model with new data from FN cases can reduce these mistakes over time.
Table 16 and
Figure 21 show that the state-of-the-art studies consistently demonstrated strong classification performance on the EBTC dataset, but our EfficientNet-B3 approach outperformed them all. In
Table 16, the references are limited to six because the dataset proposed in 2023 is the primary focus. Lazo et al. [
15] attained 90% accuracy using a semi-supervised GAN, while Yıldırım [
16] achieved 99.0% with a CBIR framework. Sunnetci et al. [
17] evaluated various CNN-based and hybrid DL plus ML pipelines as well as a ViT, reaching 92.57%. Sharma et al. [
18] combined a CNN with a Vision Transformer to yield 97.21%, and Kaushik et al. [
19] reported 80% accuracy with a conventional CNN. Lutviana et al. [
20] achieved 96.29% using another CNN architecture. In contrast, our model—fine-tuned EfficientNet-B3 with five-fold cross-validation—delivered 99.03% accuracy, surpassing prior methods in both mean performance and consistency across folds.
Compared to the approach in Yildirim, M. [
16], which involved extracting deep features from multiple pre-trained CNNs—with DenseNet201 identified as the best feature extractor—and combining these 1000-dimensional vectors with traditional classifiers like Subspace KNN, as well as a content-based image retrieval CBIR pipeline, our method employed an end-to-end fine-tuned EfficientNet-B3 model. This model was trained directly on the raw images using five-fold cross-validation.
While Yildirim, M. [
16] assessed seven CNN backbones along with texture descriptors (such as LBP and HOG) and seven similarity metrics to support both classification and CBIR tasks, our work was concentrated solely on enhancing classification performance with a single architecture. This focus allows us to achieve a comparable overall accuracy of 99.03% without the added complexity of feature-selection stages or external similarity-based retrieval. Furthermore, Yildirim, M. [
16] processed images in their original form and separated feature extraction from classification, whereas our pipeline integrated image representation learning and decision-making into one cohesive model, thereby simplifying deployment and reducing preprocessing steps.
Our fine-tuned EfficientNet-B3 model outperforms the multi-stage pipeline in Yildirim, M. [
16] for the following reasons:
EfficientNet employs compound scaling that balances depth, width, and resolution, resulting in significantly higher accuracy per parameter and FLOPS compared to traditional ConvNets. Specifically, EfficientNet-B3 outperforms ResNeXt-101 while using 18× fewer FLOPS, which leads to faster inference and reduced computing costs in deployment environments.
EfficientNet models are smaller yet more accurate, making them less susceptible to overfitting with limited datasets. Research indicates that EfficientNets achieve state-of-the-art accuracy on transfer-learning tasks with fewer parameters compared to larger architectures. This characteristic makes them particularly suitable when labeled data is scarce.
Instead of using pre-extracted deep features combined with classical classifiers as done in Yildirim, M. [
16], our approach fine-tuned the entire EfficientNet-B3 backbone directly on BLCA images. This method enabled the network to learn domain-specific representations across all layers, often resulting in better classification performance compared to “frozen” feature plus separate classifier pipelines.
Optimizing our model’s performance has been a priority; however, achieving 100% accuracy remains difficult. This is due to several factors, including variability in imaging quality, differences in image scanners, and inherent limitations within the dataset. Additional challenges arise from noise, artifacts, and interobserver variability. Despite these obstacles, our model demonstrates competitive performance when compared to existing methods, and we have thoroughly assessed its accuracy using standard metrics. The proposed DL model has shown potential to outperform other recent classifiers, particularly following parameter tuning. We believe our approach lays a strong foundation for further advancements in this field.