In this section, we present and discuss the results of the SLEDgeH index in its CVI function, applied to both synthetic and real-world data sets. First, we analyze the data clustered using distance-based algorithms, followed by an evaluation using categorical data-specific algorithms.
6.2.1. Distance-Based Clustering with Synthetic Data
The results presented in this section relate to the data sets outlined in
Section 5.1.1, and employ the evaluation method described in
Section 5.
Table 8 illustrates the overall performance of the indices using hit rate, average MRE, and STD measures. The remaining tables focus solely on the MRE value to facilitate a univariate analysis.
The comprehensive evaluation of validation indices on synthetic data sets (
Table 8) reveals distinct performance patterns. The proposed SLEDgeH index demonstrates superior performance across all metrics, achieving the highest hit rate (85%), lowest mean relative error (MRE = 0.05), and most stable results (STD = 0.13). Its predecessor SLEDge shows comparable effectiveness (hit rate = 80%, MRE = 0.06), confirming the robustness of the semantic pattern approach. Among conventional indices, ASW emerges as the strongest competitor (hit rate = 76%, MRE = 0.07), though with marginally higher variability (STD = 0.15).
Notably, the density-based CDR index fails completely (0% hit rate), while information-theoretic CUBAGE and consensus-based CI show moderate performance (31–35% hit rates). The DB index presents an interesting paradox—while achieving a respectable 57% of hit rate, it exhibits the highest variability (STD = 0.51), suggesting inconsistent reliability. These results collectively demonstrate that pattern-based methods (SLEDge/SLEDgeH) outperform traditional distance and density measures in synthetic data scenarios, with the weighted version (SLEDgeH) providing a consistent 5–10% improvement over its unweighted counterpart.
The performance analysis across different clustering algorithms reveals distinct patterns in index effectiveness. As shown in
Table 9, hierarchical clustering methods generally yield better results for most validation indices compared to partitional approaches. The SLEDgeH index demonstrates particularly robust performance, achieving the lowest MRE of 0.04 with k-means and an exceptional 0.01 with Ward’s hierarchical method, outperforming all other indices in these configurations. Notably, while the original SLEDge index already shows competitive results (0.06 with k-means and 0.02 with Ward’s), its weighted version SLEDgeH provides consistent improvements across all algorithms. The ASW maintains strong performance, particularly with average linkage clustering (MRE = 0.10), where it matches both SLEDge and SLEDgeH. However, the DB index exhibits unstable behavior, with performance varying dramatically from 0.03 (h-ward) to 0.80 (h-average), suggesting high sensitivity to clustering method choice. Among specialized indices, CUBAGE shows moderate performance (MRE 0.32–0.39), while the automated CNAK method achieves 0.52, indicating room for improvement in parameter-free approaches. These results collectively suggest that while hierarchical methods generally provide more reliable cluster structures for validation, the choice of validation index remains crucial, with SLEDgeH emerging as the most consistently accurate option across different algorithmic approaches.
The performance analysis across different cluster numbers (
k = 3, 5, 7) reveals distinct patterns in index effectiveness. As shown in
Table 10, the MRE generally increases with higher values of
k for most indices, with the notable exception of the DB index, which shows an inverse relationship. This trend aligns with previous findings [
31] demonstrating that conventional indices like CNAK, CH, and ASW perform better with fewer clusters. Among all evaluated methods, SLEDgeH consistently achieves the lowest MRE values (0.02, 0.05, and 0.08 for
k = 3, 5, 7 respectively), outperforming both its unweighted counterpart SLEDge and the traditionally strong ASW index. While ASW maintains relatively stable performance (0.04 to 0.11), its accuracy still degrades slightly as cluster complexity increases, a limitation that SLEDgeH appears to mitigate more effectively through its weighted indicator approach. The density-based CDR and information-theoretic CUBAGE indices show particularly strong sensitivity to cluster number, with their MRE values nearly doubling between
k = 3 and
k = 7. These results suggest that while traditional indices remain useful for simple cluster structures, SLEDgeH, with its innovative weighting mechanism, provides more robust performance across varying cluster complexities in synthetic data sets.
The experimental results demonstrate distinct performance patterns among validation indices when applied to both balanced and imbalanced synthetic data sets across varying cluster numbers (
k = 3, 5, 7). As shown in
Table 11, SLEDgeH consistently achieves the lowest MRE values in all tested scenarios, outperforming both traditional indices and its predecessor SLEDge. Notably, while ASW shows competitive performance in balanced configurations (e.g., MRE 0.03 for
k = 3, 5, 7 in balanced cases), its advantage diminishes significantly in imbalanced scenarios, where SLEDgeH maintains superior robustness (0.02 vs. 0.05 for
k = 3 imbalanced). The density-based CDR index exhibits particularly poor performance with increasing
k (MRE
for
k = 7), while information-theoretic CUBAGE shows variable results that degrade sharply in imbalanced conditions. Consensus-based CI demonstrates moderate performance but fails to match the precision of SLEDgeH, particularly in larger
k values. Importantly, the weighted approach of SLEDgeH proves consistently effective regardless of cluster balance, showing ≤0.05 MRE in balanced cases and ≤0.11 MRE in challenging imbalanced configurations with
k = 7, confirming its reliability across diverse data distributions. These findings highlight the dual advantage of SLEDgeH: maintaining the interpretability of SLEDge while significantly improving accuracy, especially in realistic scenarios with uneven cluster sizes.
Finally, for the statistical significance test, we select the top-performing indices for each data set across different algorithms. These indices are ranked (using average ranks in cases of ties), and the Wilcoxon–Mann–Whitney test [
36] is applied. At a significance level of 0.05, the null hypothesis—that the performance of all indices is similar—is rejected.
6.2.2. Distance-Based Clustering with Real-World Data
The results presented in this section are related to the data sets described in
Section 5.1.2 under the evaluation method described in
Section 5. As in the synthetic data analysis section, the overall performance of the indices is summarized in
Table 12 using the hit rate, average MRE, and STD metrics, while the remaining tables report only the MRE values.
The comparative evaluation of clustering validation indices on real-world data sets reveals distinct performance patterns among the different methods. The CH index demonstrates the strongest overall performance, achieving the lowest mean relative error (MRE = 0.28) while maintaining reasonable consistency (STD = 0.32). However, the SLEDgeH index emerges as a particularly competitive alternative, securing second place in overall performance with an MRE of 0.33 and notably achieving the best standard deviation (STD = 0.28) among all indices, indicating superior stability in its evaluations. While the CNAK method shows the highest hit rate (50%), its higher MRE (0.60) and substantial standard deviation (1.11) suggest less reliable performance overall. The traditional SLEDge index, though showing moderate consistency (STD = 0.80), performs poorly on real-world data with an MRE of 1.16 and zero hit rate, highlighting the significant improvement achieved by its weighted version, SLEDgeH. Interestingly, the commonly used ASW index, which performed well on synthetic data, shows dramatically reduced effectiveness on real-world data sets (MRE = 1.04, Hit rate = 11%), suggesting limitations in handling complex, real-world data structures. The density-based CDR index presents middling performance (MRE = 0.34), while consensus-based (CI) and information-theoretic (CUBAGE) approaches show particularly weak results, with CI exhibiting the worst MRE (1.41) of all methods tested. These findings collectively suggest that while CH remains the top performer for real-world data validation, SLEDgeH offers a compelling alternative, particularly when measurement stability is prioritized, demonstrating the value of its semantic approach combined with weighted indicators.
The analysis of the clustering algorithm influence on validation index performance (
Table 13) reveals distinct methodological preferences among the evaluated measures. Hierarchical clustering algorithms (particularly Ward’s method and average linkage) demonstrate superior performance with most indices as evidenced by the lowest MRE scores for CDR (0.21–0.26), SLEDgeH (0.32), and SLEDge (1.03–1.26) in these configurations. This pattern aligns with established findings that hierarchical methods better preserve local data structures [
24]. However, notable exceptions emerge: the CH index achieves optimal performance (MRE = 0.17) with
k-means, consistent with its design for evaluating centroid-based partitions [
25], while CUBAGE shows degraded results (MRE = 0.57–0.80) under hierarchical clustering, corroborating its known sensitivity to distributional assumptions [
28]. The Consensus Index (CI) and CNAK exhibit algorithm-dependent behaviors by design, with CI aggregating multiple solutions (MRE = 1.41) and CNAK and its specialized
k-means++ variant achieving moderate performance (MRE = 0.60). Notably, SLEDgeH maintains consistent low-error performance (MRE = 0.32–0.37) across all algorithms, demonstrating greater robustness than its predecessor SLEDge (MRE = 1.03–1.26) and conventional indices like DB (MRE = 1.84–2.22), which show high variance depending on the algorithmic choice. These results suggest that index selection should consider both the clustering algorithm’s properties and the index’s inherent theoretical assumptions, with newer semantic approaches like SLEDgeH offering more stable evaluation across methodologies.
The performance analysis in
Table 14 reveals distinct patterns across validation indices as the number of clusters (
k) increases. While CDR and CH exhibit deteriorating performance with higher
k values (MRE increasing from 0.23 to 0.60 and 0.20 to 0.60, respectively), DB demonstrates an inverse relationship, improving from MRE = 2.60 at
k = 2 to MRE = 0.13 at
k = 5. This aligns with previous observations that distance-based indices like DB may favor larger numbers of clusters [
10]. The SLEDge variants show competitive performance across all
k values, with SLEDgeH achieving the lowest MRE (0.29) at
k = 4, suggesting particular robustness in mid-range cluster configurations. However, these trends should be interpreted cautiously given the imbalanced distribution of data sets across different
k values (ranging from 10 data sets at
k = 2 to just 1 at
k = 5), which may introduce statistical artifacts [
24]. Notably, CNAK shows exceptional performance at
k = 3 (MRE = 0.13), though its effectiveness varies substantially across other cluster counts, consistent with its known sensitivity to data characteristics [
31]. The comparative stability of information-theoretic CUBAGE (MRE range: 0.42–0.65) versus the volatility of consensus-based CI (MRE range: 0.20–1.75) underscores the fundamental methodological differences between these approaches to cluster validation.
The performance analysis of cluster validation indices across real-world data sets (
Table 15) reveals several key insights about their relative effectiveness. While CNAK achieves the highest number of optimal results (11 data sets with minimal MRE), followed by CDR (10 data sets) and CH (8 data sets), this raw count alone does not fully capture their comparative reliability. As demonstrated in
Table 12, CH emerges as the top performer overall despite its lower count of individual wins, a phenomenon explained by its consistently small error magnitudes when it does not select the exact
k value [
25]. This pattern is even more pronounced in SLEDgeH, which shows the smallest standard deviation of errors among all indices, confirming its robustness. The SLEDge index, while not leading in either metric, demonstrates intermediate performance that still outperforms several conventional indices like DB and ASW.
This apparent paradox between per-data set wins and overall rankings stems from fundamental differences in how indices handle marginal cases. Some indices may occasionally guess
k correctly but produce wildly inaccurate estimates when they fail, while others maintain stable near-optimal performance [
24]. Our results extend this finding, showing that CH and SLEDgeH belong to the latter category—their errors, when they occur, deviate less from the true
k compared to indices like CNAK which alternate between perfect guesses and substantial misses. This makes them particularly valuable for applications requiring consistent performance across diverse data sets, though the higher hit rate of CNAK may prove preferable in scenarios where exact
k determination is critical [
31]. The density-based CDR maintains its reputation for handling irregular cluster structures [
27], while the weighted approach of SLEDgeH appears to successfully balance between precision and stability across data types.
6.2.3. Categorical Clustering with Real-World Data
In this section, we evaluate the performance of the indices when applying the ROCK algorithm [
12], described in
Section 2.2. Although we use the same data sets from the real-world data analysis section, we separate this section due to the algorithm’s specificity in applications involving categorical data. Since CNAK uses
k-means++, we do not include it in this analysis.
The results presented in this section relate to the data sets described in
Section 5.1.2 under the evaluation method described in
Section 5. As in the real-world data analysis section, the overall performance of the indices appears in
Table 16 through the measures of accuracy rate, average MRE, and STD, while the remaining tables contain only the MRE.
The analysis of the overall performance of the indices (
Table 16) reveals notable differences in performance compared to the previous evaluation (
Table 12). SLEDgeH demonstrates the best overall performance, achieving the highest hit rate (61%) and the lowest MRE (0.16), while also maintaining excellent stability (STD = 0.23). This represents a significant improvement over its performance in the previous analysis, where it had a lower hit rate (22%) and a higher MRE (0.33), despite still showing good stability. The CDR index now performs comparably to SLEDgeH in terms of stability (STD = 0.23) and hit rate (50%), while its MRE (0.21) remains competitive. The ASW index shows better results, achieving a higher hit rate (56%) and a lower MRE (0.34) compared to its previous performance (Hit rate = 11%, MRE = 1.04).
In contrast, the CH index, which had the best performance in the previous analysis (MRE = 0.28), now performs worse, exhibiting a higher MRE (0.45) and significantly poorer stability (STD = 0.68). The CI and CUBAGE indices also show mixed results—while CI slightly improves in MRE (0.43 vs. 1.41), CUBAGE maintains a high MRE (0.61) but with greater stability compared to previous results. The DB index remains one of the worst performers, with a high MRE (1.16) and a low hit rate (33%), though it shows some improvement over its previous performance (MRE = 2.04, Hit rate = 0%).
These results suggest that the choice of the clustering algorithm type significantly impacts index performance. Although CH performed best in the previous evaluation, SLEDgeH and CDR emerge as more reliable options, primarily due to their balance of accuracy, stability, and hit rate. The improved performance of ASW indicates that some indices are more sensitive to the underlying clustering method than others. Overall, SLEDgeH stands out as the most robust index, demonstrating high precision and consistency.
In the analysis of index performance regarding different numbers of clusters (
Table 17), we identify significant differences compared to the previous evaluation (
Table 14). SLEDgeH shows the best index for
k = 2 and
k = 4, with MREs of 0.00 and 0.17 respectively, clearly outperforming the other indices in these scenarios. For
k = 3, CDR presents the best performance (MRE = 0.33), while for
k = 5, all indices converge to the same value (MRE = 0.60), a different behavior from the previous evaluation where DB showed the best result (MRE = 0.13).
When comparing with the index behavior in the previous analysis, we note that ROCK tends to produce more balanced results among the indices, especially for larger
k values. Examining the previous evaluation in
Table 14, we observe that CH had excellent performance for
k = 2 (MRE = 0.20), but in ROCK its result was slightly worse (MRE = 0.30). The behavior of DB is particularly interesting—while in the previous analysis it improved as
k increased, this pattern is not observed in ROCK. The consistency of SLEDgeH in ROCK reinforces its robustness for different cluster number configurations.
The performance analysis of indices specifically on real-world data sets clustered by ROCK (
Table 18) reveals significant differences when compared to the distance-based approach presented in
Table 15. SLEDgeH, which showed robust performance with small and consistent errors in the previous analysis, also demonstrates good performance with ROCK, achieving minimum MRE (equal to 0) on 11 data sets. Furthermore, examining
Table 18, we observe that all indices show improved performance in correctly identifying the number of clusters, indicating that for these data sets, ROCK generates cluster structures that are more easily identifiable by different metrics.
Finally, the SLEDgeH index proves consistently robust in both approaches—distance-based and specifically for categorical data—with low standard deviation, confirming its usefulness as a reliable metric regardless of the clustering algorithm. While other indices show improved performance with ROCK, they do not maintain the same superiority across different contexts.
6.2.4. Sensitivity Analysis of Weight Configurations
To evaluate the impact of different weight configurations on the SLEDgeH indicators, we select five weight combinations for sensitivity analysis based on rigorous methodological criteria. We include the default configuration [0.3, 0.1, 0.5, 0.1], obtained through systematic optimization as described in
Section 4.1, and four variations that individually emphasize each indicator (
,
,
, and
). This approach allows us to assess both the robustness of the default configuration and the isolated impact of each indicator while maintaining the unit sum constraint and avoiding excessive combinatorial complexity. The results demonstrate that this strategy effectively validates the default configuration as optimal while clearly characterizing each component’s influence on the overall index performance as evidenced in
Table 19.
The performance analysis of SLEDgeH with different weight configurations (
Table 19) reveals important patterns about indicator suitability for various types of categorical data sets. The
default configuration demonstrates the best overall balance, achieving the lowest MRE (0.33) and STD (0.28). This superior performance remains consistent across diverse data sets, ranging from small, well-structured ones like
Balloons (16 instances) to complex cases like
Mushroom (8124 instances).
The high weight for Exclusivity () proves crucial for identifying semantically distinct clusters as evidenced by perfect results (MRE = 0) in Balloons, Chess, and Mushroom. These data sets contain well-defined categories with clearly exclusive patterns between clusters. However, for data sets with natural category overlap like SPECT Heart and Hayes Roth, configurations with less emphasis on Exclusivity (e.g., [0.5, 0.1, 0.3, 0.1]) show slightly better performance, suggesting that indicator importance varies according to data characteristics.
The moderate weight for Support () provides an appropriate balance, ensuring relevance without excessive dominance in the evaluation. In data sets like Indian Diabetes and Mushroom, where pattern frequency strongly indicates cluster quality, configurations with higher Support weight [0.5, 0.1, 0.3, 0.1] achieve MRE = 0. Conversely, for sparse or noisy data like Survey Lung Cancer and Chess, slightly increasing the Difference weight could improve performance as suggested by the good results of the [0.2, 0.1, 0.2, 0.5] configuration on this specific data set.
The minimal weights for Length () and Difference () reflect their secondary role in most scenarios, though alternative configurations show marginally better results for data sets like Nursery and Students Adapt. This reinforces the need for weight adaptation in specific contexts, while maintaining the default configuration as a general starting point.
These empirical results, combined with theoretical foundations, support the conclusion that the default configuration [0.3, 0.1, 0.5, 0.1] represents the best general-purpose balance for categorical data applications. However, the observed performance variation across different data sets reinforces the recommendation to adjust weights for specific domains, particularly when dealing with characteristics like category overlap, noise, or sparsity.
6.2.5. Graphical Analysis of Cluster Validity Indices
In this section, we analyze how some indices behave using a graphical tool. To do this, we select some data sets, apply the hierarchical algorithm with
average linkage with
k varying in the same range from 2 to 10 and calculate the indices where the value of
k is determined from the identification of the maximum value in a series of points. The selected data sets are
Chess,
Balloons,
Car Evaluation and
Nursery. The graphics are presented in
Figure 3,
Figure 4,
Figure 5 and
Figure 6.
In the data set
Chess (
Figure 3), the expected
k is equal to 2. As we can see, CDR, DB, ASW and SLEDgeH suggest the correct number of
k. It is also interesting to note that, except for the different scales and considering the interval as a reference, these indices have a similar distribution of points, especially CDR, DB, ASW and SLEDgeH. Analyzing the emphasis of the suggestion, CDR and DB suggest
k equal to 2, but ASW and SLEDgeH seem to emphasize this suggestion more. This is verified when the suggestion point, proportionally, is larger than the others in the series. Regarding CH and CUBAGE, the performance was lower, but it is interesting to see that the distribution of points between them is also quite similar.
Observing the Balloons data set, knowing that the expected k is equal to 2, we can identify some interesting behaviors. The CH, CDR, and SLEDgeH indices suggest the number k correctly. CUBAGE and DB suggest k equal to 10, and although ASW suggests a value equal to 4, due to the proximity of points, it can be seen that ASW identifies that there are potential k equal to 2, 4, and 8. The SLEDgeH index also identifies the same points as potential clustering configurations.
Regarding the data set Car Evaluation, the expected k is equal to 4. Among the indices that suggest the correct number of k, we highlight CH, CUBAGE, ASW, and SLEDgeH. We also observe the same pattern of behavior regarding the distribution of points in some indices, such as CH and CUBAGE—where despite suggesting k equal to 4, the value of k equal to 3 is a close option—and ASW and SLEDgeH, emphatically suggesting the correct value.
Finally, the indices applied to the Nursery data set, which has an expected k equal to 5, demonstrate interesting results. CH, CDR, CUBAGE, and ASW suggest a k of 2, while DB suggests k of 6 and SLEDgeH k of 4. Although they all appear to be wrong, analyzing the data points, we found that out of a total of 12,960 data points, there is a class that represents only 0.015% of the data. Therefore, we believe that this has influenced the final results.