In this chapter, the results of hyperparameter tuning are presented first, including the final parameters used to train and test each model. Statistical power analysis is then presented to determine the minimum sample size required to reliably evaluate and compare the performance of machine learning models. Leveraging these optimized parameters, the statistical results are then reported. Descriptive statistics and performance comparisons for metrics such as test accuracy, test F1, test ROC AUC, and fit time follow. Finally, the results of the statistical analyses are provided, encompassing the t-test using 5 × 2 cross-validation, the t-test on 10-fold cross-validation data, the McNemar test conducted on the entire dataset, and the Friedman test applied to the results of 10-fold cross-validation.
3.4. Statistical Tests
3.4.1. McNemar’s Test
The McNemar’s test was employed to assess statistically significant differences in classification performance across referent models—Random Forest (RF), Artificial Neural Network (ANN), Logistic Regression (LR), Support Vector Machine (SVM), LightGBM (LGBM), CatBoost (CB), and XGBoost (XGB)—(
Table 6). A pronounced divergence was observed in all pairwise comparisons involving LR, which exhibited consistent underperformance relative to other models, as evidenced by universally significant
p-values (
p < 0.001) and elevated chi-squared statistics (e.g., χ
2 = 47.70 vs. RF; χ
2 = 50.08 vs. ANN; χ
2 = 54.28 vs. CB). These results underscore LR’s inferior discriminative capacity within the evaluated framework.
Statistically significant differences (α = 0.05) were further identified exclusively in two pairwise comparisons: CB versus RF (p = 0.0315, χ2 = 4.63) and CB versus LGBM (p = 0.0036, χ2 = 8.49). The absence of significance in the CB-XGB comparison (p = 0.210) was noted, with the chi-squared value reported as ‘nan’, a condition typically arising when contingency table cells contain zero counts, precluding conventional test computation. This outcome, however, aligns with the non-significant p-value derived via exact binomial approximation, suggesting parity in classification efficacy between CB and XGB.
3.4.2. 5 × 2 Cross-Validation Paired t-Test
A pairwise 5 × 2 cross-validation paired
t-test was conducted to evaluate all model pairs included in this research, and the resulting outcomes are presented in
Table 7, which is divided into two parts. The lower triangular section of
Table 7, situated below the main diagonal, provides the total count of detected statistically significant differences (
p < 0.05) across the five examined performance metrics (accuracy, precision, recall, F1-score, and ROC-AUC) for each model pair. For instance, the cell corresponding to the Random Forest (RF) and Logistic Regression (LR) comparison contains the value 5, indicating that statistically significant performance gaps (
p < 0.05) were identified for all five metrics under the 5 × 2 paired
t-test. This figure can range from a maximum of 5 to a minimum of 0, with zeros omitted from the table to ensure readability. The upper triangular region of
Table 7, located above the main diagonal, displays letters corresponding to the metrics for which a statistically significant difference emerged for a given pair of models: A for Accuracy, P for Precision, R for Recall, F for F1-score, and C for ROC-AUC. Uppercase letters (A, P, R, F, C) denote instances where the column-based model (model1) significantly outperforms the row-based counterpart (model2), while lowercase letters (a, p, r, f, c) indicate the opposite, where model2 outperforms model1. In the previously mentioned RF/LR example, the upper cell lists all five lowercase letters, implying that the LR model significantly underperformed the RF model on every measured metric. This nomenclature remains consistent across further statistical tests and is therefore not elaborated again. Complete statistical test results for all pairs/models, including exact t-scores and p-values, are available in
Appendix A Table A1.
The conducted 5 × 2 CV paired
t-test revealed that all comparisons involving Logistic Regression (LR) produced statistically significant differences (
p < 0.05). This outcome corroborates the findings in both
Figure 3 and
Table 5, which indicate that Logistic Regression yields considerably poorer classification performance relative to the other methods. The statistical test has now confirmed that the visibly weaker performance of LR, observed in earlier tables and figures, is robust to formal hypothesis testing. Among other model comparisons, Support Vector Machine (SVM) was found to differ significantly from several counterparts on the ROC-AUC metric, while no statistically significant differences were detected for SVM under the remaining metrics.
Across the accuracy metric, LR was significantly outperformed by all models (for instance, on RF vs. LR: t = 6.3684, p = 0.0014; ANN vs. LR: t = 7.4070, p = 0.0007; CB vs. LR: t = 5.2129, p = 0.0034). Inspection of precision revealed similar patterns; for example, LR showed a t-score of 4.1429 (p = 0.0090) relative to RF and a t-score of 5.5737 (p = 0.0026) when tested against CB. Under the recall metric, the disadvantages of LR were again confirmed by statistically significant comparisons such as XGB vs. LR (t = 2.0102, p = 0.0034) and SVM vs. LR (t = 2.9035, p = 0.0337). The F1 results underpin the general weakness of LR, with p-values (typically below 0.03) for all pairs, for example, (RF vs. LR: t = 5.4417, p = 0.0028; CB vs. LR: t = 3.9013, p = 0.0114, etc.). Finally, ROC-AUC analyses provided particularly strong evidence for LR’s weakness, as seen in RF vs. LR (t = 8.5058, p = 0.0004), LGBM vs. LR (t = 9.1800, p = 0.0003), and XGB vs. LR (t = 7.6399, p = 0.0006).
Under ROC-AUC, the SVM demonstrated inferior performance. Comparative analyses against models such as RF, LGBM, and CB frequently resulted in statistically significant differences, as indicated by small p-values and relatively large t-scores. These findings suggest that SVM’s ranking capability deviated substantively from that of its counterparts (except LR), highlighting potential limitations in its discriminative effectiveness within the given classification framework (for example, RF vs. SVM: t = 4.8027,
p = 0.0049; LGBM vs. SVM: t = 5.8171,
p = 0.0021; XGB vs. SVM: t = 6.3247,
p = 0.0015). By contrast, other metrics did not show marked differences for SVM when tested against the same methods. These findings are consistent with the graphical depictions in
Figure 3, which illustrated a noticeable inferiority for SVM under the ROC-AUC metric performance.
The overall impression is that most classifiers outside of LR did not differ significantly from one another across multiple measures, but that SVM stands out in its area-under-curve behavior, while LR sits at a clear disadvantage on virtually every metric.
3.4.3. 10-Fold Cross-Validation Paired t-Test
The results of the pairwise 10-fold cross-validated paired
t-test are presented in
Table 8, which summarizes the number of statistically significant differences across all models and all pairs evaluated in this study. Full and detailed 10-folc-CV results are shown in
Appendix A Table A2.
A substantially higher number of performance differences was observed compared to the 5 × 2 cross-validation test, with statistically significant differences detected across nearly all model pairs—where at least one metric exhibited significance. An exception was noted for the XGBoost/CatBoost (XGB/CB) pairing, where no differences were identified across any metrics.
The disparities were predominantly concentrated in pairs involving the Logistic Regression (LR) model, which demonstrated systemic deficiencies in performance relative to other algorithms. However, the 10-fold CV paired
t-test also highlighted discrepancies in pairs containing CatBoost (CB), with CB exhibiting discriminative superiority over multiple counterparts, whereas LR consistently ranked as the lowest-performing model (
Table 5).
Under the accuracy metric, CB exhibited a consistently superior performance, shown by its significantly better outcomes against most of the competing models. Although XGB approached CB with no conclusive difference established between them (t = 1.707 p = 0.122), suggesting comparable accuracy levels. Logistic Regression, on the other hand, ranked consistently lower (with a t-score around and above 5) than the other methods and showed significant inferiority in the vast majority of pairwise comparisons. Other models demonstrated intermediate performance, frequently producing accuracy scores that were neither statistically distinguishable from the leading methods nor markedly superior or definitively inferior.
With respect to precision, three algorithms—ANN, CB, and XGB—emerged as top performers. Their pairwise comparisons did not yield statistically significant disparities, indicating that they occupied a similar performance tier. Each of these three methods, however, significantly surpassed RF, LR, SVM, and LGBM in multiple tests.
The recall results differ from the patterns observed in accuracy and precision. The SVM proved to be the most outstanding model for this metric, recording a significantly higher recall than most of its counterparts. RF also demonstrated moderately high recall (RF vs. ANN, t = 2.949, p = 0.016, RF vs. LR, t = 14.933, p < 0.001), and it did not differ significantly from certain strong competitors like CB, XGB, SVM, and LGBM. Logistic Regression performed poorly once more, losing decisively against the majority of comparisons.
CB achieved prominent F1 scores, significantly surpassing many competing classifiers, including RF, ANN, LR, and LGMB. However, the differences between CB and certain other methods were statistically insignificant, suggesting that SVM and XGB perform at a comparable level for F1. Evaluation under the ROC-AUC metric indicated that CB, XGB, and ANN each occupied the top tier without significant differences among them. These methods were uniformly more effective than RF, LR, and SVM in multiple comparisons. LGBM performed acceptably but lagged behind the leading group, suggesting that CB and XGB offered the strongest separation capacity.
3.4.4. Corrected 10-Fold Cross-Validation Paired t-Test
The summarized results for the corrected version of the 10-fold cross-validation method, as proposed by [
24,
35], are presented in
Table 9, while the complete statistical data for all pairwise comparisons and metrics are provided in
Appendix A (
Table A3).
The corrected 10-fold CV test effectively reduces bias and Type I error, leading to a lower number of model pairs exhibiting statistically significant differences compared to the standard 10-fold CV test. However, the overall pattern remains consistent, with the logistic regression (LR) model demonstrating clear inferiority across all evaluated metrics.
RF occupied a mostly inferior position across all metrics. It showed notable superiority over LR in all metrics, for example, on accuracy (t = 6.967, p < 0.001) and precision (t = 2.787, p = 0.005), while no statistically significant differences were detected against ANN in most comparisons except on ROC-AUC (t = −4.140, p < 0.001), where it underperforms. However, CB consistently surpassed RF on almost all metrics (except recall) on accuracy (t = −4.783, p < 0.001) and precision (t = −4.461, p < 0.001), and XGB likewise outperformed RF in precision (t = −2.797, p = 0.005) and ROC-AUC (t = −4.552, p < 0.001). RF on most metrics statistically does not differ from SVM but has better performance on ROC-AUC (t = 3.923, p < 0.001). Overall, RF was neither conclusively the best nor the worst, frequently ranking in the mid-lower tier.
ANN demonstrated strong performance across multiple metrics, outperforming SVM (t = 2.888, p = 0.004) and LGBM (t = 2.020, p < 0.043) on precision, as well as surpassing SVM on ROC-AUC. However, it underperformed relative to SVM in recall (t = −12.188, p < 0.001). ANN significantly outperformed LR across all metrics, including ACC (t = 7.090, p < 0.001) and PREC (t = 4.338, p < 0.001). Comparisons with CB and XGB revealed no statistically significant differences (e.g., ACC: ANN vs. CB, t = −0.696, p = 0.486), indicating that while ANN ranks among the stronger classifiers, it does not decisively outperform the top models.
LGBM demonstrated statistically significant inferior performance in the vast majority of cases where differences were identified, with the exception of RF and SVM on the ROC-AUC metric and LR across all metrics. CB outperformed LGBM on four out of five metrics, while XGB showed superiority on three out of five. ANN achieved a statistically significant advantage only in precision (t = −2.020, p = 0.043), whereas no statistically significant differences were observed in other pairwise comparisons involving these models.
SVM demonstrated notable strengths in recall, outperforming all models except RF. It significantly surpassed leading classifiers such as CB (t = 3.329, p = 0.001) and XGB (t = 3.786, p < 0.001) and, in certain instances, RF, although this difference did not reach statistical significance (p = 0.107). Across all other metrics, SVM either underperformed or yielded statistically insignificant differences, with the exception of LGBM, where it exhibited a significant advantage on the F1 metric (t = 2.547, p = 0.011).
CB emerged as one of the strongest classifiers across nearly all metrics, demonstrating significant advantages wherever statistical differences were identified, except against SVM in recall. It outperformed RF on four out of five metrics, LR on all metrics, and LGBM on all but recall, where the difference was statistically insignificant (t = 0.455, p = 0.649) but still in the same direction. Comparisons with XGB and ANN generally did not reach significance (e.g., XGB accuracy: t = 1.046, p = 0.296; ANN precision: t = 0.405, p = 0.685), suggesting that CB, XGB, and ANN occupy a similarly strong position among top-performing models.
XGB likewise ranked among the best performers, particularly rivaling CB and ANN. It consistently outperformed LR in all five metrics (e.g., accuracy: t = 8.457, p < 0.001; precision: t = 5.121, p < 0.001, etc.). The comparisons with CB on accuracy and ROC-AUC showed no significant gap (t = −1.046, p = 0.296; t = −0.518, p = 0.604), suggesting an equivalently strong capacity. Similarly, with ANN model pairs, XGB did not find any statistical superiority. XGB held advantages over RF (precision: t = −2.797, p = 0.005; ROC-AUC: t = −4.552, p < 0.001), affirming its position as a consistent top-tier method across the examined metrics. Also, it statistically outruns three out of five metrics in comparison with LGBM.
LR consistently lagged behind its counterparts across accuracy, precision, recall, f1, and ROC-AUC. It was decisively outperformed by RF (accuracy: t = −6.967, p < 0.001; precision: t = −2.787, p = 0.005) and ANN (accuracy: t = −7.090, p < 0.001; precision: t = −4.338, p < 0.001). Comparisons against CB, ANN, and XGB likewise revealed statistically significant disadvantages, with p-values below 0.001 in most pairwise tests. These patterns indicate that LR occupied the lowest tier among the evaluated models.
3.4.5. Corrected Repeated (Ten-Times) 10-Fold Cross-Validation Paired t-Test
Unlike the previous analysis of the corrected 10-fold CV test, this statistical evaluation is based on 100 estimates derived from 10 repetitions of 10-fold CV with different splits, thereby increasing statistical power. The summarized results for the pairwise corrected-repeated 10-fold CV paired
t-test, as proposed by Bouckaert and Frank [
21], are presented in
Table 10, while the complete statistical data for all pairwise comparisons and metrics (including
p-values and t-scores) are provided in
Appendix A (
Table A4).
As shown in
Table 10, the results align closely with those of the previous corrected 10-fold CV test (
Table 9), maintaining a consistent overall pattern, particularly in the dominance significance difference count among LR, SVM, and RF model pairs.
LR consistently occupied the lowest tier, with highly significant disadvantages (p < 0.001) against all other models on every metric. Its performance gap was particularly evident in accuracy (e.g., LR vs. ANN: t = 6.1035, p < 0.0001) and ROC-AUC (LR vs. CB: t = −9.9764, p < 0.0001).
RF did not exhibit significant differences from most models in accuracy or F1. However, it was statistically outperformed in precision by CB (t = −3.6917, p = 0.0004), ANN (t = −2.9728, p = 0.0037), and XGB (t = −3.3647, p = 0.0011). In recall, RF ranked among the top alongside SVM (no significant difference, p = 0.1702), yet in ROC-AUC it was surpassed by CB, XGB, and ANN (all p ≤ 0.0008).
SVM demonstrated a pronounced advantage in recall, significantly surpassing CB, XGB, and LGBM (p ≤ 0.001). Its lead over RF was not conclusive (p = 0.170). Nonetheless, for precision and ROC-AUC, SVM was consistently outperformed by ANN, CB, and XGB (p < 0.001 in most pairwise comparisons). For the remaining SVM pairwise comparisons across metrics (excluding LR), no statistically significant differences were observed. However, regarding the F1 metric, while no significant differences were detected, SVM exhibited a dominant tendency. This is reflected in its positive t-scores when compared to top-performing models such as CB (t = 0.348, p = 0.728) and XGB (t = 0.490, p = 0.625).
ANN ranked among the top three for precision (e.g., ANN vs. SVM: t = 2.9467, p = 0.0040) and ROC-AUC, where it was statistically indistinguishable from CB and XGB (all p > 0.0937). Meanwhile, it shared no notable differences with RF and LGBM in accuracy or F1, implying a strong yet not dominant position. Regarding RF and SVM, ANN exhibited mixed results, outperforming RF on certain metrics while underperforming on others, such as recall. However, ANN demonstrated a clear advantage over LGBM in cases where statistical differences were identified, including precision (t = 2.151, p = 0.034) and ROC-AUC (t = 2.448, p = 0.014).
LGBM showed no significant gaps from the high-performing models on accuracy (e.g., LGBM vs. CB: p = 0.1244, LGBM vs. XGB: p = 0.1893). However, in precision and ROC-AUC, it was eclipsed by ANN, CB, and XGB (p ≤ 0.034). Its F1 performance remained on par with other classifiers except LR.
CB and XGB consistently emerged as leading classifiers across most metrics. Neither displayed significant superiority over the other (all p ≥ 0.8158 in precision; p = 0.9001 in ROC-AUC), and both attained strong positions in accuracy and precision. Their outperformance of LR, RF, and SVM reached statistical significance in multiple comparisons (e.g., XGB vs. LR in precision: t = 4.0168, p < 0.0001; CB vs. SVM in precision: t = 4.3526, p < 0.0001). Regarding ANN, no statistically significant results indicate its outperformance; however, the t-values suggest a tendency toward inferiority in opposition to CB and XGB.
Overall, LR was conclusively the weakest, while CB and XGB formed a top tier alongside ANN in precision and ROC-AUC. SVM and RF excelled primarily in recall, whereas LGBM maintained competitive yet slightly less dominant results.
3.4.6. Corrected Random Resampled Cross-Validation Paired t-Test
The corrected random resampled CV paired
t-test, proposed by Nadeau and Bengio [
20], was introduced to mitigate the Type I error inherent in the classical resampled CV paired
t-test, which has been shown to exhibit increasing bias as the number of repetitions grows. Similar to repeated 10-fold cross-validation, this corrected test generated 100 estimates, aligning with the recommendations of statistical power analysis to ensure sufficient power and minimize the risk of Type II error.
The summarized pairwise results are presented in
Table 11, while a detailed breakdown of the statistical test, including t-scores and p-values, is provided in
Appendix A (
Table A5). The results reveal a structure consistent with previous analyses, particularly the corrected 10-times repeated 10-fold CV paired
t-test (
Table 10), with most statistically significant differences observed among LR, SVM, and RF model pairs.
Test again affirmed that LR performed significantly worse than all other models across all metrics (e.g., accuracy vs. RF: t = −4.769, p < 0.001; precision vs. CB: t = −3.591, p < 0.001; recall vs. XGB: t = −3.384, p = 0.001), consistently positioning it at the lower tier.
Regarding accuracy, all classifiers except LR displayed mostly comparable results, with no statistically significant differences observed among RF, ANN, SVM, LGBM, CB, and XGB (e.g., ANN vs. CB: p = 0.444; CB vs. XGB: p = 0.888). In contrast, for precision, CB and XGB exhibited significant advantages over RF (p = 0.020 and p = 0.027, respectively) as well as over SVM and LGBM. The recall results underlined the dominance of SVM, which outperformed ANN, LGBM, CB, and XGB (p < 0.05 in each pairwise test) and also significantly outperformed LR. Meanwhile, in the F1-score, all models except LR formed a statistically indistinguishable result. Lastly, ROC-AUC analyses confirmed the overall strength of CB and XGB, as each surpassed RF and SVM (e.g., CB vs. RF: t = −3.915, p < 0.001; XGB vs. SVM: t = −4.064, p < 0.001) and did not significantly differ from each other (p = 0.1789). ANN also showed good results under the ROC-AUC metric, outperforming RF (t = 2.129, p = 0.036) and SVM (t = 4.065, p < 0.001). In sum, CB, XGB, and ANN formed a top-performing cluster, RF, SVM, and LGBM resided in a lower and mid-range position, and LR consistently placed at the lower bound of the comparative evaluation.
3.4.7. Wilcoxon Non-Parametric Signed-Rank Test
In addition to the parametric paired t-test, the Wilcoxon pairwise non-parametric signed-rank test was conducted to ensure more robust results, addressing potential violations of normality assumptions, small sample sizes, and the influence of outliers. Unlike the t-test, the Wilcoxon test operates on rank values rather than raw numerical differences, allowing it to capture consistent directional differences between models even when absolute values vary.
The aggregated results are presented in
Table 12, while the full set of detailed statistical results is available in
Appendix A (
Table A6). The results indicate a substantially higher number of statistically significant pairwise differences compared to the
t-test. Notably, the test identified significant differences (in at least three metrics) for nearly all model pairs, with the sole exception of CB and XGB, where no statistically significant difference was detected.
The sample for the Wilcoxon test was derived from repeated 10-fold cross-validation, resulting in 100 estimates. Initially, two approaches were considered: a standard 10-fold CV (yielding 10 estimates) and a repeated 10-fold CV (yielding 100 estimates). The latter was selected as it better aligns with the sample size requirements determined by statistical power analysis. With a sample size of only 10, the results were considerably more modest in terms of detecting statistically significant differences (higher Type II error).
Which model performs better cannot be derived directly from the W value, as this is always a positive value (unlike the t value). Instead, the medians of the model outputs are compared in order to take account of the rank-based nature of the Wilcoxon test. From
Table 12, it is evident that CB and XGB once again dominate across nearly all metrics and model comparisons, with the exception of the recall metric, where SVM, RF, and LGBM demonstrate stronger performance. As observed in previous analyses, SVM maintains its dominance in recall across all model pairs. Additionally, in the F1 metric, SVM outperforms all models except CB and XGB, where no statistically significant difference is observed, making it inconclusive which model has the advantage. ANN, RF, and exhibit mixed results, while LR consistently ranks as the weakest performer, surpassing only RF on a few metrics and ANN on ROC-AUC.
3.4.8. Non-Parametric Friedman Test
All previously applied statistical methods have been pairwise comparison tests, aiming to identify statistically significant differences between specific model pairs. The results were presented in tables, each populated with the outcomes of all possible two-model comparisons. In contrast, the Friedman test employs a multiple-model comparison approach, simultaneously evaluating all models to provide a broader perspective on their relative performance.
As a non-parametric test, the Friedman test imposes no prior assumptions about the input distribution. Its results indicate whether a statistically significant difference exists among the tested models; however, it does not specify which models differ. To determine specific pairwise differences, a post-hoc test, such as Nemenyi’s test, is required.
The Friedman test was performed using 100 estimates per metric, obtained through a repeated cross-validation procedure to ensure alignment with the required sample size. The statistical results, including chi-square values and
p-values, are presented in
Table 13. Across all metrics, statistical significance was achieved with
p < 0.001, and the large chi-square values (>480) provide strong evidence of performance differences among models. The highest chi-square value was observed for ROC-AUC (568.36), while the lowest was recorded for the recall metric (481.6).
Table 14 presents a summary of performance results across multiple metrics in the same format as previously shown. The results closely resemble those obtained from the corrected repeated cross-validation
t-test (
Table 10) and the corrected resampled
t-test (
Table 11). The most statistically significant differences were identified for the LR model, followed by SVM, LGBM, and RF. The table further highlights the dominance of CB and XGB across most metrics, except for recall, while LR consistently underperforms across all metrics. Additionally, SVM maintains its superiority in recall over all models except RF, where no statistically significant difference was observed.
A complete set of results for all model-metric combinations, including exact
p-values from Nemenyi’s post-hoc test, is provided in
Table A7 (
Appendix A).
Table 15 presents a summary of the average rankings of different classification models across various metrics. Each model was evaluated 100 times, with the corresponding metric computed in each iteration. Rather than using raw scores, models were ranked per iteration, with 1 assigned to the best-performing model for a given metric and 7 to the worst. After 100 iterations, the average rankings were calculated and are displayed in
Table 15. The results indicate that LR is the poorest-performing model, consistently ranking above 6 across all metrics. In terms of accuracy, precision, and ROC-AUC, the best performing models are CB (2.71, 2.39, 1.90) and XGB, which have slightly higher rankings. ANN follows closely behind, with an average value of rankings around 3. However, SVM is the best model for recall and F1, followed by CB and XGB. These findings align with previous observations regarding recall (
Table 8,
Table 9,
Table 10 and
Table 11) and F1 (
Table 12), reinforcing the overall ranking trends. Additionally, RF demonstrated strong performance in recall and F1 metrics, securing second and third place rankings, respectively.
The rankings in
Table 15, in contrast to those in
Table 14, are average ranks without statistical tests or significance analyzes. To conclude this evaluation, the Friedman statistical test followed by the post-hoc Nemenyi test is graphically presented in
Figure 5 in the form of critical difference (CD) diagrams, providing a visual comparison of model performance across multiple metrics.
In the critical difference (CD) diagram for accuracy (
Figure 5), four distinct performance groups of ML models can be observed. The top-performing group consists of CB and XGB (positioned on the right side of the scale), followed by a middle group comprising SVM and ANN. A lower-middle group includes RF and LGBM, while LR stands as a clear outlier on the far-left side of the chart, indicating the worst performance.
For precision, three performance groups can be identified: CB, XGB, and ANN as the top-performing models; RF, SVM, and LGBM forming a middle-tier group; and LR again positioned as the lowest-performing model on the far left. The ROC-AUC metric follows a similar grouping pattern, except that SVM shifts from the middle group to the lower-performing group alongside LR, while RF and LGBM remain in the middle tier.
In recall, LR continues to exhibit the worst performance, whereas SVM emerges as the best-performing model by a significant margin. The middle-tier models are more evenly distributed, with RF and LGBM showing the strongest results, followed by CB and XGB. ANN, in contrast, performs poorly on this metric.
For F1, the top-performing group consists of SVM, CB, and XGB, while RF, LGBM, and ANN cluster closely together in the middle tier. As expected, LR remains the worst-performing model.
It is important to note that the black lines connecting models in the CD diagrams indicate that no statistically significant difference exists between the grouped models for the selected metrics. For example, in the precision metric sub-chart, CB, XGB, and ANN are not statistically different in performance, just as RF, LGBM, and SVM form a statistically indistinguishable group.