SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data

de Aquino, Roberto Douglas G.; Verri, Filipe A. N.; de Amorim, Renato Cordeiro; Curtis, Vitor V.

doi:10.3390/math13172832

Open AccessArticle

SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data

by

Roberto Douglas G. de Aquino

^1,*

,

Filipe A. N. Verri

²

,

Renato Cordeiro de Amorim

³

and

Vitor V. Curtis

²

¹

Department of Computer Systems, University of Sao Paulo, Sao Carlos 13566-590, SP, Brazil

²

Computer Science Division, Aeronautics Institute of Technology, Sao Jose dos Campos 12228-900, SP, Brazil

³

School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(17), 2832; https://doi.org/10.3390/math13172832

Submission received: 14 July 2025 / Revised: 15 August 2025 / Accepted: 22 August 2025 / Published: 3 September 2025

(This article belongs to the Section D: Statistics and Operational Research)

Download

Browse Figures

Versions Notes

Abstract

Cluster validation for categorical data remains a critical challenge in unsupervised learning, where traditional distance-based indices often fail to capture meaningful structures. This paper introduces SLEDgeHammer (SLEDgeH), an enhanced internal validation index that addresses these limitations through the optimized weighting of semantic descriptors derived from frequent patterns. Building upon the SLEDge framework, the proposed method systematically combines four indicators—Support, Length, Exclusivity, and Difference—using weight optimization to improve cluster discrimination, particularly in sparse and imbalanced scenarios. Unlike conventional methods, SLEDgeH does not rely on distance metrics; instead, it leverages the statistical prevalence and uniqueness of feature combinations within clusters. Through extensive experiments on 3600 synthetic categorical data sets and 18 real-world data sets, we demonstrate that SLEDgeH achieves significantly higher accuracy in identifying the optimal number of clusters and exhibits greater robustness, with lower standard deviation, compared to existing indices. Additionally, the index provides inherent interpretability by generating semantic cluster descriptions, making it a practical tool for supporting decision making in categorical data analysis.

Keywords:

cluster validation index; categorical data; frequent patterns; semantic description

MSC:

34H10; 65P20; 34D06

1. Introduction

Clustering is a foundational data analysis technique designed to group similar data points while ensuring distinctiveness between clusters [1]. It plays a crucial role in data mining and machine learning, with applications spanning domains such as healthcare, cybersecurity, natural language processing, and the social sciences [2,3,4,5]. Over the past decades, a wide variety of clustering algorithms have been developed, including partitional, hierarchical, and model-based approaches, each with different assumptions and objectives [5,6]. However, regardless of the algorithm used, one of the most persistent challenges is the assessment of clustering quality—particularly when determining the optimal number of clusters (k)—which remains largely unsolved [7].

Clustering algorithms can generate markedly different partitions even on the same data set and under similar parameter settings, which highlights the critical role of cluster validation [8]. Validation techniques are typically classified into external and internal methods [7,9]. External validation relies on ground-truth labels, which are often unavailable or unreliable in exploratory settings. In contrast, internal validation evaluates clustering quality based solely on intrinsic properties of the data and the resulting partition, making it particularly relevant in unsupervised contexts [10]. Despite their importance, most internal indices are designed for numerical data and based on geometric assumptions such as compactness and separation, measured through distance or density metrics [7,11].

These assumptions become problematic in data sets where attributes are categorical or binary, and where meaningful comparisons between data points cannot be captured by traditional distance functions [12,13]. Moreover, categorical data is prevalent in numerous real-world scenarios—such as patient records, transaction logs, survey responses, and social indicators—where the semantics of the variables carry crucial interpretive weight. In such cases, conventional internal validation indices not only underperform but also fail to provide interpretable insights that support the use of clustering for decision-making [14,15].

To bridge this gap, recent efforts have proposed validation strategies that explore non-metric properties of clusters, such as those based on frequent patterns, semantic cohesion, or information-theoretic principles. Among these, the Support, Length, Exclusivity, and Difference for Group Evaluation (SLEDge) index [16] stands out as a semantic-based validation method tailored for binary and categorical data. SLEDge evaluates the quality of clustering by extracting frequent descriptors within each cluster and quantifying their relevance through simple yet expressive indicators. This approach has shown promising results, particularly in synthetic environments, while also offering the advantage of producing interpretable summaries of clusters.

Despite its contributions, the original SLEDge index exhibits limitations in more challenging scenarios, such as real-world data sets characterized by sparsity, imbalance, or noise. The need to improve both the robustness and the generalization capacity of semantic-based validation motivates our present work. In this paper, we propose an enhanced version of the index, named SLEDgeH, which integrates optimized weighting of the four core indicators of SLEDge—Support, Length, Exclusivity, and Difference. By calibrating these weights through systematic search and validation, SLEDgeH achieves better adaptability across diverse data conditions without sacrificing interpretability.

The contributions of this paper are threefold. First, we propose SLEDgeH, a novel internal validation index for categorical data that is independent of distance metrics and provides semantic interpretability. Second, we conduct an extensive experimental evaluation using 3600 synthetic data sets and 18 real-world benchmarks, demonstrating that SLEDgeH outperforms conventional and recent internal indices in both accuracy and stability. Third, we release a reproducible implementation of the method and present qualitative evidence of its capacity to describe clusters meaningfully—thus offering a practical, interpretable, and effective solution for categorical clustering validation.

This paper is structured as follows. Section 2 presents our literature review, focusing on other cluster validation indices. Section 3 presents the SLEDge index, which precedes our proposed index. Section 4 describes our proposed index. Section 5 describes the experiments, for synthetic and real-world data, as well as the criteria for evaluating the results. Section 6 analyzes and discusses the obtained results, and Section 7 concludes the our paper and suggests future work.

2. Literature Review

This section is divided into three subsections. First, we define transactional data sets in Section 2.1, which serve as the representation format we adopt here. Second, we discuss some popular clustering algorithms in Section 2.2, which are later used to generate the clusterings in our experiments (see Section 5). Finally, we present the cluster validation indices in Section 2.3, used in the benchmark of the new proposed index.

2.1. Transactional Data Sets

Clustering techniques are designed to group data points with similar characteristics into the same cluster. Consider a market basket transactional data set, where each transaction represents a customer’s purchase and contains a set of items bought by that customer. Such transactional data can be used to cluster customers with similar purchasing patterns into distinct groups. These clusters can then be used to characterize different customer segments, which in turn can inform targeted marketing and advertising strategies. By understanding these characterizations, businesses can tailor specific products to particular customer groups and predict the purchasing patterns of new customers based on their profiles [12]. Although transactional data sets are predominantly used to identify consumption patterns, this type of representation is versatile and applicable in various other scenarios [17].

In transactional data representations, features are typically categorical. Each transaction can be viewed as a record with Boolean features, where each feature corresponds to a specific item. In this context, a feature is True if the transaction (data) includes the item (feature), and False otherwise. While the domain of categorical features in this example is True and False, it can encompass any arbitrary set of values. Moreover, even numeric data sets can be converted into categorical ones through standardization techniques and feature discretization into predefined intervals [18].

2.2. Clustering Algorithms

Clustering algorithms typically follow a hierarchical or partitional approach. Partitional algorithms, such as k-means [19] and k-means++ [20], produce a set of disjoint clusters. In k-means, given a data set X, the algorithm generates a partition

S = {S_{1}, \dots, S_{k}}

and centroids

Z = {z_{1}, \dots, z_{k}}

, minimizing the sum of squared Euclidean distances:

W_{k} = \sum_{l = 1}^{k} \sum_{x_{i} \in S_{l}} d (x_{i}, z_{l}),

(1)

where

d (x_{i}, z_{l}) = \sum_{v = 1}^{m} {(x_{i v} - z_{l v})}^{2} .

(2)

The algorithm iterates through three steps until convergence:

Select k initial centroids at random.
Assign each $x_{i} \in X$ to the nearest centroid.
Update centroids by computing the component-wise mean of the assigned points.

The k-means++ variant improves the initialization by selecting centroids with a weighted probability, reducing the chances of a poor clustering.

Hierarchical algorithms, such as Ward’s method [21] and average linkage [22], form clusters by either merging or splitting existing clusters. Ward’s method minimizes the following cost function during each merge:

Ward (S_{l}, S_{t}) = \frac{n_{S_{l}} n_{S_{t}}}{n_{S_{l}} + n_{S_{t}}} d (z_{l}, z_{t}),

(3)

where

n_{S_{l}}

and

n_{S_{t}}

are the cardinalities of clusters

S_{l}

and

S_{t}

, respectively. Average linkage merges clusters based on the lowest average pairwise distance:

AL (S_{l}, S_{t}) = \frac{1}{n_{S_{l}} n_{S_{t}}} \sum_{x_{i} \in S_{l}} \sum_{x_{j} \in S_{t}} d (x_{i}, x_{j}) .

(4)

For data with categorical features, traditional distance-based methods may perform poorly. The ROCK algorithm [12] addresses this by introducing a link-based approach. Two points

x_{i}

and

x_{j}

are neighbors if their similarity

s i m (x_{i}, x_{j})

exceeds a threshold

θ

. The number of links between them is defined as the count of common neighbors. ROCK maximizes the following criterion function for k clusters

E_{l} = \sum_{i = 1}^{k} n_{S_{i}} \cdot \frac{\sum_{x_{p}, x_{q} \in S_{i}} link (x_{p}, x_{q})}{n_{S_{i}}^{1 + 2 f (θ)}},

(5)

where

n_{S_{i}}

is the size of cluster

S_{i}

, and

f (θ)

adjusts for cluster density. The algorithm proceeds hierarchically by iteratively merging clusters

S_{l}

and

S_{t}

with the highest goodness measure:

g (S_{l}, S_{t}) = \frac{link [S_{l}, S_{t}]}{{(n_{S_{l}} + n_{S_{t}})}^{1 + 2 f (θ)} - n_{S_{l}}^{1 + 2 f (θ)} - n_{S_{t}}^{1 + 2 f (θ)}} .

(6)

ROCK is particularly robust for categorical data where traditional metrics fail to capture complex relationships.

2.3. Validity Indices

In this section, we review several widely used cluster validity indices (CVIs) based on diverse methodological approaches. These indices serve as benchmarks to evaluate the quality of a given clustering solution. By incorporating a variety of CVIs, we ensure a comprehensive and robust comparison for assessing the performance of our proposed method (see Section 6 for detailed results and analysis).

2.3.1. Distance-Based Methods

The distance-based validation indices form the cornerstone of traditional cluster evaluation. These measures quantify cluster quality by analyzing the spatial distribution of data points through pairwise distance computations. The Average Silhouette Width, Calinski–Harabasz, and Davies–Bouldin indices each employ distinct approaches to balance intra-cluster compactness against inter-cluster separation, making them particularly suitable for spherical or well-separated clusters in Euclidean spaces.

Average Silhouette Width (ASW) [23]

The ASW measures clustering quality using cohesion (average within-cluster distance) and separation (distance between clusters). For a data point

x_{i}

in cluster

S_{l}

:

C (x_{i}) = \frac{1}{n_{S_{l}} - 1} \sum_{x_{j} \in S_{l}, i \neq j} d (x_{i}, x_{j}),

(7)

C S (x_{i}) = min_{S_{t} \in S, t \neq l} \{\frac{1}{n_{S_{t}}} \sum_{x_{j} \in S_{t}} d (x_{i}, x_{j})\} .

(8)

The ASW for a data set X under partition S is

ASW (X, S) = \frac{1}{n} \sum_{x_{i} \in X} \frac{C S (x_{i}) - C (x_{i})}{max \{C (x_{i}), C S (x_{i})\}},

(9)

with

- 1 \leq ASW (X, S) \leq 1

, where higher values indicate better clustering [24].

Calinski–Harabasz (CH) [25]

The CH index evaluates clustering quality using the ratio of between-cluster and within-cluster sum of squares. For a data set X and clustering S:

C H (S) = \frac{(T - W_{k}) / (k - 1)}{W_{k} / (n - k)},

(10)

where

T = \sum_{i = 1}^{n} \sum_{v = 1}^{m} {(x_{i v} - {\bar{x}}_{v})}^{2}

is the total scatter,

{\bar{x}}_{v}

is the mean of feature v over X, and

W_{k}

is the within-cluster dispersion. Higher values indicate better clustering.

Davies–Bouldin (DB) [26]

The DB index evaluates clustering S based on cluster cohesion and centroid separation. For each cluster

S_{l} \in S

, cohesion is

C (S_{l}) = \frac{1}{n_{S_{l}}} \sum_{x_{i} \in S_{l}} d (x_{i}, z_{l}),

(11)

where

z_{l}

is the centroid of

S_{l}

. The DB index is then

DB (S) = \frac{1}{k} \sum_{S_{l} \in S} max_{S_{t} \in S, l \neq t} \{\frac{C (S_{l}) + C (S_{t})}{d (z_{l}, z_{t})}\} .

(12)

Lower DB values indicate better clustering.

2.3.2. Density-Based Methods

Density-based validation techniques address the limitations of distance metrics when handling irregular cluster geometries. The Contiguous Density Region index evaluates clustering quality by examining local point density variations and spatial patterns, proving especially effective for identifying arbitrarily shaped clusters that may be poorly served by conventional distance measures.

Contiguous Density Region (CDR) [27]

The CDR index measures cluster uniformity based on the spatial pattern and local density of data points. The local density of a data point

x_{i}

is

δ (x_{i}) = min_{x_{j} \in S_{l}, i \neq j} \{d (x_{i}, x_{j})\} .

(13)

The average density of a cluster

S_{l}

is

\bar{δ} (S_{l}) = \frac{1}{n_{S_{l}}} \sum_{x_{i} \in S_{l}} δ (x_{i}) .

(14)

Cluster uniformity is

υ (S_{l}) = \frac{1}{\bar{δ} (S_{l})} \sum_{x_{i} \in S_{l}} | δ (x_{i}) - \bar{δ} (S_{l}) |,

(15)

for

n_{S_{l}} > 1

; otherwise,

υ (S_{l}) = 0

. The CDR index for partition S is

C D R (S) = \frac{1}{n} \sum_{S_{l} \in S} n_{S_{l}} υ (S_{l}),

(16)

where n is the total number of data points. Lower CDR values indicate better clustering.

2.3.3. Information Approaches

Information indices bring a probabilistic perspective to cluster validation. The CUBAGE index leverages entropy and information gain to assess cluster quality, measuring how effectively a clustering partition reduces uncertainty in the data description. This approach demonstrates particular relevance for categorical data analysis where distance metrics may be inappropriate.

CUBAGE [28]

The CUBAGE index evaluates clustering using the Averaged Information Gain (AGE) of isolating each cluster for separation and the reciprocal entropy conditioned on the partition for compactness. For a data set X described by features

V = {v_{1}, \dots, v_{m}}

, with

v_{j}

taking values in

D (v_{j}) = {v_{j}^{1}, \dots, v_{j}^{η_{j}}}

, the entropy

H (V)

is

H (V) = - \sum_{j = 1}^{m} \sum_{i = 1}^{η_{j}} P (α_{j}^{i}) log (P (α_{j}^{i})),

(17)

where

P (α_{j}^{i})

is the probability of feature

v_{j}

taking value

α_{j}^{i}

. For partition S, the conditional entropy

H (V | S)

, also called the whole entropy of the partition, is

E (S) = H (V | S) = - \sum_{l = 1}^{k} \sum_{j = 1}^{m} \sum_{i = 1}^{η_{j}} P (α_{j}^{i}, S_{l}) log (P (α_{j}^{i} | S_{l})),

(18)

where

P (α_{j}^{i}, S_{l})

is the joint probability of value

α_{j}^{i}

and cluster

S_{l}

. The CUBAGE index is

CUBAGE (S) = \frac{H (V) - E (S)}{E (S)} .

(19)

Higher CUBAGE values indicate better clustering, with the optimal value identifying the most suitable number of clusters.

2.3.4. Consensus Techniques

Consensus-based validation methods synthesize information from multiple clustering solutions. The Consensus index operates by aggregating partitions generated through diverse algorithms or initializations, providing robustness against the limitations inherent to any single clustering method. This ensemble approach enhances reliability in complex data sets where no single algorithm dominates.

Consensus Index (CI)

This index calculates the average pairwise similarity between different partitions obtained by running various clustering algorithms. Suppose a set of r solutions from different clustering algorithms is given as

A = \{α^{1}, \dots, α^{r}\}

. The consensus index between these partitions is defined as

CI (S) = \sum_{p < q} Φ (α^{p}, α^{q}),

where

Φ

is an external validation index. Here, the Adjusted Rand Index [29], commonly used in related studies [30], is employed. Higher CI values indicate better clustering quality.

2.3.5. Automated Selection

Self-tuning validation methods integrate cluster evaluation with the clustering process itself. Cluster Number Assisted K-means (CNAK) exemplifies this approach by dynamically estimating the optimal number of clusters during the partitioning process. Such techniques reduce dependency on manual parameter specification while maintaining competitive performance in conventional clustering scenarios.

Cluster Number Assisted K-Means (CNAK) [31]

The CNAK algorithm is a k-means variant that estimates the number of clusters during the clustering process. It relies on randomly sampled large-sized subsets of data points, maintaining the same distribution as the original data set such that the number of generated cluster centroids from the sample approximates the true number of clusters in the original data.

The original authors demonstrated that CNAK successfully detects clusters and identifies hierarchical structures. However, they showed limited robustness to noise and data overlap. Here, the

k^{'}

suggested by CNAK is used and compared with other indices to validate the proposed method’s results.

2.3.6. Summary

The comparative analysis of clustering validity indices presented in Table 1 reveals fundamental differences in their theoretical assumptions and practical applicability. Traditional distance-based indices like ASW, CH, and DB demonstrate limitations when handling categorical data or irregular cluster shapes, as they inherently rely on geometric properties of the data space [24]. In contrast, information-theoretic measures such as CUBAGE and density-aware indices like CDR show superior adaptability to non-convex cluster structures and discrete data representations, though at varying computational costs [28]. The scale-invariance property, exhibited by CDR, CUBAGE, and CI, makes these indices particularly valuable for data sets with heterogeneous feature scales without requiring preprocessing [32]. While most methods handle sparse data adequately, only CDR and CI maintain robustness against noisy observations, aligning with their design principles for real-world data imperfections [27]. Notably, the absence of universal superiority across all criteria underscores the importance of selecting validation indices based on specific data characteristics and application requirements, rather than relying on single metric comparisons [10].

3. Semantic Description of Clustering

The SLEDge index [16] applies a technique based on the semantic description potential of clusters. This is an innovative index, proposing a completely different approach from traditional ones, such as those based on space, density, entropy, among others.

This index uses specific indicators, obtained from the analysis of frequent patterns identified in each of the clusters, to suggest the number of clusters. This index can be applied to any data set containing binary data.

Thus, considering the set of partitions

S = {S_{1}, \dots, S_{k}}

with k clusters, we obtain the values of

ρ (S_{l})

, which is the collection of all

n_{l}

frequent patterns with length 1 contained in

S_{l}

with support values

σ_{v_{j}}

of the frequent pattern j.

Mathematically, we define the support

σ_{v_{j}}

of a feature

v_{j}

within a cluster

S_{l}

as the proportion of data points in

S_{l}

containing the feature

v_{j}

:

σ_{v_{j}} (S_{l}) = \frac{1}{n_{l}} \sum_{x_{i} \in S_{l}} x_{i, j},

(20)

where

n_{l}

is the number of data points in cluster

S_{l}

and

x_{i, j} \in {0, 1}

is the binary value of feature

v_{j}

for the data point

x_{i}

.

The set

ρ (S_{l})

represents the collection of all features with non-zero support in the cluster:

ρ (S_{l}) = \{v_{j} ∣ σ_{v_{j}} (S_{l}) > 0, j \in {1, 2, \dots, m}\},

(21)

where m is the total number of features.

Then, we calculate the indicators Support, Length, Exclusivity, and Difference.

The

S u p p o r t

indicator is calculated from the equation

S_{ρ (S_{l})} = \frac{1}{n_{l}} \sum_{σ_{v_{j}} \in ρ (S_{l})} σ_{v_{j}} \forall σ_{v_{j}} \geq 0,

(22)

when

n_{l} \geq 1

; otherwise, the indicator is equal to zero. Thus, to obtain the next indicator, we need to calculate the average cardinality with the equation

{\bar{n}}_{ρ (S)} = \frac{1}{k} \sum_{l = 1}^{k} n_{l},

where k represents the number of clusters. Therefore, we calculate the

L e n g t h

indicator from the equation

L_{ρ (S_{l})} = \frac{1}{1 + |n_{l} - {\bar{n}}_{ρ (S)}|} .

(23)

For the

E x c l u s i v i t y

indicator, we use the equation

E_{ρ (S_{l})} = 1 - \frac{{\hat{e}}_{l}}{n_{l}},

(24)

where

{\hat{e}}_{l}

is the number of non-exclusive frequent patterns in cluster

S_{l}

by comparing its descriptors

ρ (S_{l})

with those from all other

k - 1

clusters. A descriptor

v_{j} \in ρ (S_{l})

is considered non-exclusive if it appears in any other cluster

S_{g}

(

g \neq l

). When

n_{l} \geq 1

, this measures the proportion of unique descriptors in

S_{l}

; otherwise, the indicator defaults to zero.

Finally, we calculate the

D i f f e r e n c e

indicator using the equation

D_{ρ (S_{l})} = \{\begin{matrix} \sqrt{δ_{m a x} \{ρ (S_{l})\}} & if n_{l} > 0, \\ 0 & otherwise, \end{matrix}

(25)

where

δ_{m a x}

is the largest difference identified between frequent ordered patterns taken in pairs, including zero.

After calculating the indicators with Equations (22)–(25), the cluster-related indices are obtained through the

m e d i a n

of these indicators, and the global index is calculated using the

a v e r a g e

of the cluster indices. The higher the index value, the better the quality of the partition, given the patterns identified in each cluster.

The SLEDge index has great potential in finding the most appropriate number of clusters in synthetic data sets, when compared to other indices already consolidated in the literature. Since it is based on the analysis of frequent patterns identified in the clusters rather than on distance-based metrics, it is especially suitable for applications in data sets where the use of distance metrics is not appropriate. Furthermore, given the simplicity of the algorithm, the SLEDge index is fast and effective. In fact, the index can also present the frequent patterns identified in the clusters, ordered based on their semantic description potential. This aids in understanding the clusters and in decision-making.

4. SLEDgeH Index

Although the SLEDge index achieves satisfactory results, it performs poorly on sparse and real-world data sets. To address these limitations, we propose the following:

To improve the index by introducing weights for each indicator;
To analyze the interpretability potential of frequent patterns obtained by the index in the semantic description;
To run tests on synthetic data sets and compare results against indices from the literature;
To conduct tests on real-world data sets and analyze the results using relative error and graphs.

The SLEDgeH index assigns optimized weights to the four indicators of SLEDge (Support, Length, Exclusivity, and Difference) to improve cluster validation performance. These weights are determined through a systematic grid search combined with cross-validation, ensuring robustness across diverse data characteristics (Section 4.1). The optimization process evaluates multiple weight combinations on synthetic data sets with varying dispersion, sparsity, and cluster balance, as well as on real-world data sets. The selected weights maximize the index’s accuracy in identifying the true number of clusters, measured against ground truth labels. The final configuration (

H_{S} = 0.3

,

H_{L} = 0.1

,

H_{E} = 0.5

,

H_{D} = 0.1

) demonstrates the strongest balance between semantic interpretability and validation precision. This weighted version preserves all advantages of the original SLEDge index while enhancing its reliability. Table 2 summarizes each indicator’s weight, value, and empirical justification derived from the optimization process.

4.1. Weight Optimization Framework

The weight assignment in SLEDgeH follows an optimization problem where we seek the weight combination

H = (H_{S}, H_{L}, H_{E}, H_{D})

that maximizes the index’s ability to identify the correct number of clusters. Formally, we define

H^{*} = \underset{H}{argmax} \sum_{i = 1}^{N} I (k_{i}^{'} (H) = k_{i})

(26)

where N is the number of data sets in our validation corpus,

k_{i}^{'} (H)

is the predicted number of clusters for data set i using weights H,

k_{i}

is the ground truth number of clusters, and I is the indicator function.

The optimization is constrained by

H_{S} + H_{L} + H_{E} + H_{D} = 1 and H_{j} \geq 0 \forall j \in S, L, E, D .

(27)

We implement this through a grid search over possible weight combinations, given fixed increments, evaluating each candidate H using 5-fold cross-validation. For each fold, we compute the weighted cluster index for each candidate partition

I_{l} (H) = median \{H_{S} \cdot S_{ρ (S_{l})}, H_{L} \cdot L_{ρ (S_{l})}, H_{E} \cdot E_{ρ (S_{l})}, H_{D} \cdot D_{ρ (S_{l})}\},

(28)

aggregate to the global index

I (H) = \frac{1}{k} \sum_{l = 1}^{k} I_{l} (H),

(29)

and select the partition that maximizes

I (H)

for each k and compare the predicted

k^{'}

against ground truth.

The final weights are selected as

H^{*} = \underset{H}{argmax} (\frac{1}{N} \sum_{i = 1}^{N} {Accuracy}_{i} (H) - θ \cdot CV (H)),

(30)

where

{Accuracy}_{i}

is the correct prediction rate for data set i, and CV is the coefficient of variation across folds

C V (H) = \frac{std (Accuracies (H))}{mean (Accuracies (H))},

(31)

and

θ

is a regularization parameter (empirically set to 0.2).

The grid search space is defined through preliminary sensitivity analysis, testing weights in increments of 0.1 while maintaining the unit sum constraint. The optimal configuration (

H_{S} = 0.3

,

H_{L} = 0.1

,

H_{E} = 0.5

,

H_{D} = 0.1

) emerges as consistently superior across both synthetic and real-world data sets in our experiments.

The current weight configuration represents the optimal general-purpose balance, with each weight reflecting the fundamental role of its corresponding indicator. The elevated weight for Exclusivity (

H_{E} = 0.5

) underscores its critical role in identifying semantically distinct clusters, though applications involving naturally overlapping categories may warrant a modest reduction. Support (

H_{S} = 0.3

) provides balanced emphasis on pattern frequency while leaving room for other factors, though domains where prevalence directly indicates cluster quality could justify a higher value. The minimal weights for Length (

H_{L} = 0.1

) and Difference (

H_{D} = 0.1

) reflect their secondary nature, though

H_{D}

may be increased to 0.2 for noisy or sparse data sets where filtering low-support patterns proves particularly valuable. In other application domains, this framework can be replicated to determine the most appropriate weights for the specific context. In Section 6.2, we present a sensitivity analysis for different weight configurations.

4.2. Index Computation Framework

The calculation process multiplies each indicator by its respective weight before computing the cluster index through median aggregation, which effectively mitigates the influence of outlier values. The global clustering index then derives from averaging these cluster indices, collectively evaluating clustering quality through the lens of frequent pattern semantics. For practical implementation, the SLEDgeH index is publicly available in a dedicated repository https://github.com/aquinordg/sledgehammer (accessed on 13 July 2025), facilitating easy access and deployment. Figure 1 provides a visual comparison between SLEDge and SLEDgeH.

Figure 1 illustrates the method used to calculate the SLEDge and SLEDgeH indices. The process begins with binary data and a given partition as input, followed by the computation of frequent patterns (

ρ (S_{l})

) and four main indicators: Support (

S_{ρ}

), Length (

L_{ρ}

), Exclusivity (

E_{ρ}

), and Difference (

D_{ρ}

). This initial stage is common to both SLEDge and SLEDgeH. The difference between the indices lies in the next steps: SLEDge (left branch) computes the median of each indicator per cluster and then averages the resulting scores, whereas SLEDgeH (right branch) incorporates an additional weighting step (

H_{S} = 0.3

,

H_{L} = 0.1

,

H_{E} = 0.5

,

H_{D} = 0.1

) prior to the median calculation. In the final step, both paths converge to their respective global index values.

4.3. Computational Complexity Analysis

To assess the practical applicability of the SLEDgeH index, we analyze its computational complexity in terms of both time and space requirements, where n represents the number of data points, m the number of features, and k the number of clusters. The computation involves several interconnected steps that collectively determine the index’s efficiency.

The process begins with support calculation for each cluster

S_{l}

containing

n_{l}

data points, where determining the support

σ_{v_{j}} (S_{l})

for all m features requires

O (n_{l} \cdot m)

operations per cluster. Since the total number of points across all clusters sums to n, this step results in

O (n \cdot m)

operations overall. Following this, the identification of frequent patterns (features with non-zero support) necessitates scanning all m features per cluster, yielding

O (k \cdot m)

operations, where

| ρ (S_{l}) | \leq m

denotes the number of frequent patterns in each cluster.

A computationally significant step involves sorting the support vectors to compute the Difference indicator, which requires

O (| ρ (S_{l}) | log | ρ (S_{l}) |)

operations per cluster for identifying the largest difference between ordered patterns. In scenarios where all features are frequent across all clusters, this sorting process escalates to

O (k \cdot m log m)

operations. Concurrently, the Length indicator computation, involving average cardinality

{\bar{n}}_{ρ (S)}

and

L ρ (S_{l})

, remains relatively lightweight at

O (k)

operations, considering all clusters.

The Exclusivity indicator introduces more substantial computational demands by requiring comparisons of frequent patterns across different clusters. Each frequent pattern in cluster

S_{l}

must be compared against patterns in all other clusters, leading to

O (m \cdot k)

comparisons per cluster and

O (m \cdot k^{2})

operations in aggregate. Similarly, the Difference indicator, after the initial sorting, computes pairwise differences between consecutive support values with

O (| ρ (S_{l}) |)

operations per cluster, totaling

O (k \cdot m)

in the worst case, plus an additional

O (| ρ (S_{l}) |)

operations per cluster to identify the maximum difference.

The final aggregation phase combines the median of the four indicators per cluster (

O (1)

operations per cluster due to the constant number of indicators) and an averaging step across clusters (

O (k)

operations). When considering the complete workflow, the overall time complexity emerges as

O (n \cdot m + k \cdot m log m + m \cdot k^{2})

, with the dominant terms depending on the specific problem dimensions. For typical clustering scenarios where

k ≪ n

, the complexity is primarily governed by either

O (n \cdot m)

for large data sets or

O (k \cdot m log m)

for high-dimensional data with moderate sample sizes.

Regarding memory requirements, while the calculation requires storing various indicators and intermediate results, the space complexity remains efficiently bounded by

O (n \cdot m)

, matching the input data storage needs. This makes the SLEDgeH index particularly memory efficient relative to the data set size, ensuring its practical viability for real-world applications.

In addition to our theoretical analysis, we conducted a series of empirical tests comparing the computational complexity of CVIs (excluding CI, as it relies on consensus among other indices). The evaluation method employed a carefully designed benchmark to assess the efficiency of each index across varying dimensionality levels. For this purpose, we generated controlled synthetic binary data sets containing 500 samples distributed across three clusters, grouped using the k-means algorithm. We systematically varied the dimensionality from 200 features up to 5000 features in increments of 200, covering a representative spectrum of high-dimensional problems. Each configuration was tested through 20 independent runs with different random seeds, including a warmup phase to eliminate initialization biases, and execution times were measured. All indices were implemented in Python 3.13 for consistent comparison.

The computational complexity analysis (Figure 2) reveals distinct patterns in index behavior as dimensionality increases. On a logarithmic scale, SLEDgeH demonstrates near-linear growth, maintaining consistent efficiency even in high-dimensional spaces. This performance is particularly notable when compared to metrics like CNAK and CUBAGE, which exhibit steeper growth curves, indicating stronger sensitivity to feature count increases. ASW shows intermediate complexity, while CH and DB remain relatively stable though slightly slower than SLEDgeH at extreme dimensions. CDR displays the worst scaling behavior, with execution times growing exponentially. The computational superiority of SLEDgeH is evident, as it ranks second fastest overall, surpassed only by DB. These results confirm the suitability of SLEDgeH for large-scale applications where scalability is critical, without compromising cluster evaluation quality.

5. Experiments Setting

In this section, we describe the data sets and experiments used to validate the proposed index.

5.1. Data Sets

This section is divided into two subsections, in which we describe the synthetic data sets obtained from the data generator [16], and real-world data sets already known and cited in the literature.

5.1.1. Synthetic Data

We used the catbird function [16] to generate the synthetic data for our experiments. We did so using all possible combinations of the parameters listed in Table 3. In the end, we obtained 3600 unique data sets. The data is hosted in a public repository https://github.com/aquinordg/sledgehammer/tree/main/tests/ds/synth (accessed on 13 July 2025) and can be used for replication and to support future work.

The selection criteria for these parameters are based on ensuring testing in simpler and more challenging scenarios, with different levels of overlap, noise and density, in addition to the range for the number of clusters.

5.1.2. Real-World Data

We selected real-world data sets following the following criteria: (a) categorical format, with data types string or integer; (b) different relative numbers of data points, features and classes; (c) availability in the public domain; and (d) citations in other related works [24,27,30,31].

After acquiring the data sets, as a preprocessing step, we removed data points with missing and repeated feature values, and we transformed the data from categorical to binary format. To do this, we used the OneHotEncoder function, which is part of the scikit-learn library, available for Python. The data is publicly hosted https://github.com/aquinordg/sledgehammer/tree/main/tests/ds/real (accessed on 13 July 2025) and can be used for replication and to assist future work.

Table 4 presents the data sets used in validation, and their original, binarized format, number of categories found, and classes. The first two columns are related to dimensions, and are presented in matrix format

(n, m)

, where n represents the number of data points and m the number of features. For example, the Balloons data set has 16 data points and 4 features in the original format; in the binary format, the data set continues with 16 data points but has 8 features. The Categories column is related to the number of categories found in the data set. For example, in Balloons, we see that there are only features with 2 categories, while in Car Evaluation, we can find features with 3 or 4 categories. The Classes column represents the number of unique classes or labels assigned to data points in each data set. For example, in Balloons, the classes are the labels “Inflated” and “Not inflated”.

Although the use of real-world data sets aimed at classification in clustering tasks is not suitable due to the absence of specific similarity patterns to define clusters, we decide to follow the methodological trend suggested by the literature and use them in the validation process [33].

5.2. Evaluation Criteria

In this section, we describe the experiments on each type of data set and the criteria we use to evaluate the performance of our proposed index.

We need to assess how well the SLEDgeH score reflects the quality of a given partition. To do so, we use two approaches. The first is qualitative, where we analyze the semantic descriptors of each cluster and how easy it is to interpret them (see Section 6.1). The second involves determining whether the SLEDgeH score is highest for the correct number of clusters. In both cases our analysis is based on partitions found with a clustering algorithm, not the ground truth.

We employ two types of clustering algorithms: distance-based methods, including k-means, hierarchical with average linkage, and Ward’s; and categorical-specific approaches, particularly ROCK (for details, see Section 2.2). The selection of ROCK as our categorical approach is motivated by its unique link-based methodology that captures complex relationships in binary and categorical data through similarity measures rather than traditional distance metrics, making it inherently suitable for handling the non-linear patterns and high dimensionality characteristic of real-world categorical data. While most algorithms are distance based, prior work considers them suitable for binary data sets [34,35]. Due to the fundamentally different approaches of these two strategies, we divide the analysis into two distinct sections: first, the distance-based methods, and then the categorical approach. We choose to apply ROCK specifically to real-world data sets because its ability to identify natural clusters in data with multiple attribute levels and irregular distributions provides a more rigorous test for evaluating cluster validation indices’ performance in realistic scenarios.

To analyze the interpretability of semantic descriptors, we select a data set, apply a clustering algorithm, identify the frequent patterns, and empirically analyze the results. As the analysis is strictly focused on the interpretability of the descriptors, the definition of the data set and the clustering algorithm are not of great relevance.

The algorithms are applied to each data set for different values of

k \in {2, 3, \dots, 10}

. Next, using the labels obtained, we apply each index (described in Section 6.2), except CNAK which uses its own clustering algorithm (k-means++) to suggest its estimated number of clusters

k^{'}

. For the CI index, the suggestion for the value of

k^{'}

is made using the results of the all clustering algorithms.

Our benchmark evaluation includes both well-established indices, such as ASW, CH, and DB, and more recent indices, including CUBAGE, CDR, and CNAK. Notice that we also incorporate indices that employ different strategies, such as CI, as well as methods like CDR, which is based on point density, and CUBAGE, which utilizes entropy.

As a criterion to determine the quality of the indices, we use the relative error (RE), defined by

RE = \frac{| k - k^{'} |}{k},

(32)

where k represents the correct number of clusters in the data set and

k^{'}

the estimated number of clusters given by an index. After calculating the RE of each index with k varying from 2 to 10, relative to each clustering algorithm applied to each data set, we obtain the mean relative error (MRE), which is the average of the values of RE; the hit rate, relating to the proportion of times the index identifies the correct number of clusters (RE = 0); and the standard deviation (STD).

In the graphical analysis of the behavior of the indices, presented at the end of Section 6.2.2, we use real-world data sets and adjust the indices so that the highest value of the point represents the number of clusters suggested by the respective index.

6. Results and Discussion

In this section, we first analyze the interpretability potential of semantic descriptors based on the frequent patterns identified by the SLEDgeH index. Then, we present and discuss the results of each of the data sets (see Section 5.1) obtained under the experiments setting in Section 5.

6.1. Semantic Description of Clusters

We use the data set Balloons to analyze the interpretability potential of semantic descriptors, clustered using the average linkage algorithm. Using the SLEDgeH index we select some clusters for analysis. Table 5 and Table 6 presents the semantic descriptors with their respective support values

σ

and the SLEDgeH index for k equal to 3 and 4.

Table 5 shows that in the partition where k equals 4, the data points within the clusters are primarily characterized by patterns associated with color and size. For example, in the cluster

ρ (S_{1})

, all data points are Purple and Large, unlike

ρ (S_{4})

, where they are all Yellow and Small. If our goal is to find a big purple balloon, we can definitely choose any one within the

ρ (S_{1})

cluster.

As for the other semantic descriptors, the proportion appears to be the same and equal to half of the data points. Therefore, we believe that for k equal to 4, the features related to color and size are more significant for clustering in relation to the others.

When we analyze the semantic descriptors for k equal to 3 from Table 6, we identify that, although the proportions between the frequent patterns still result in a good interpretation, it is worse in relation to the previous one (b). Just like in the previous example, we know that all data points of

ρ (S_{1})

are Yellow, but regarding size the proportion is divided in half. If we select any element from the

ρ (S_{1})

cluster, there is a 50% chance that it will be large or small. We still manage to find large balloons, but they are not the only elements there.

To demonstrate the applicability of SLEDgeH in more complex real-world scenarios, we analyze the Car Evaluation data set, which contains vehicle assessments categorized by price, maintenance, passenger capacity, and other attributes. Table 7 presents the identified semantic descriptors for

k = 4

, revealing significant patterns in vehicle segmentation. We observe that cluster

ρ (S_{1})

consists exclusively of high-priced cars, while

ρ (S_{4})

clusters only low-priced vehicles, indicating that price is a determining factor in cluster formation. Meanwhile, features like door count and trunk size show uniform distribution across clusters (

σ = 0.25

), suggesting they are less relevant for group distinction in this context. The SLEDgeH scores (

S c o r e = 0.09

for all clusters) reflect this structure, with low exclusivity (

H E_{ρ} = 0.02

) due to shared attributes between clusters, while maintaining semantic description consistency.

This example illustrates how SLEDgeH can support practical decision-making—for instance, in automotive market segmentation, where identifying homogeneous clusters by price range enables targeted marketing strategies.

Finally, regarding the semantic description function of the clusters, we believe that SLEDgeH can capture interesting information that can be used both for a better understanding of the main characteristics shared by the data points belonging to the clusters and for the selection of relevant features.

6.2. Internal Clustering Validation

In this section, we present and discuss the results of the SLEDgeH index in its CVI function, applied to both synthetic and real-world data sets. First, we analyze the data clustered using distance-based algorithms, followed by an evaluation using categorical data-specific algorithms.

6.2.1. Distance-Based Clustering with Synthetic Data

The results presented in this section relate to the data sets outlined in Section 5.1.1, and employ the evaluation method described in Section 5. Table 8 illustrates the overall performance of the indices using hit rate, average MRE, and STD measures. The remaining tables focus solely on the MRE value to facilitate a univariate analysis.

The comprehensive evaluation of validation indices on synthetic data sets (Table 8) reveals distinct performance patterns. The proposed SLEDgeH index demonstrates superior performance across all metrics, achieving the highest hit rate (85%), lowest mean relative error (MRE = 0.05), and most stable results (STD = 0.13). Its predecessor SLEDge shows comparable effectiveness (hit rate = 80%, MRE = 0.06), confirming the robustness of the semantic pattern approach. Among conventional indices, ASW emerges as the strongest competitor (hit rate = 76%, MRE = 0.07), though with marginally higher variability (STD = 0.15).

Notably, the density-based CDR index fails completely (0% hit rate), while information-theoretic CUBAGE and consensus-based CI show moderate performance (31–35% hit rates). The DB index presents an interesting paradox—while achieving a respectable 57% of hit rate, it exhibits the highest variability (STD = 0.51), suggesting inconsistent reliability. These results collectively demonstrate that pattern-based methods (SLEDge/SLEDgeH) outperform traditional distance and density measures in synthetic data scenarios, with the weighted version (SLEDgeH) providing a consistent 5–10% improvement over its unweighted counterpart.

The performance analysis across different clustering algorithms reveals distinct patterns in index effectiveness. As shown in Table 9, hierarchical clustering methods generally yield better results for most validation indices compared to partitional approaches. The SLEDgeH index demonstrates particularly robust performance, achieving the lowest MRE of 0.04 with k-means and an exceptional 0.01 with Ward’s hierarchical method, outperforming all other indices in these configurations. Notably, while the original SLEDge index already shows competitive results (0.06 with k-means and 0.02 with Ward’s), its weighted version SLEDgeH provides consistent improvements across all algorithms. The ASW maintains strong performance, particularly with average linkage clustering (MRE = 0.10), where it matches both SLEDge and SLEDgeH. However, the DB index exhibits unstable behavior, with performance varying dramatically from 0.03 (h-ward) to 0.80 (h-average), suggesting high sensitivity to clustering method choice. Among specialized indices, CUBAGE shows moderate performance (MRE 0.32–0.39), while the automated CNAK method achieves 0.52, indicating room for improvement in parameter-free approaches. These results collectively suggest that while hierarchical methods generally provide more reliable cluster structures for validation, the choice of validation index remains crucial, with SLEDgeH emerging as the most consistently accurate option across different algorithmic approaches.

The performance analysis across different cluster numbers (k = 3, 5, 7) reveals distinct patterns in index effectiveness. As shown in Table 10, the MRE generally increases with higher values of k for most indices, with the notable exception of the DB index, which shows an inverse relationship. This trend aligns with previous findings [31] demonstrating that conventional indices like CNAK, CH, and ASW perform better with fewer clusters. Among all evaluated methods, SLEDgeH consistently achieves the lowest MRE values (0.02, 0.05, and 0.08 for k = 3, 5, 7 respectively), outperforming both its unweighted counterpart SLEDge and the traditionally strong ASW index. While ASW maintains relatively stable performance (0.04 to 0.11), its accuracy still degrades slightly as cluster complexity increases, a limitation that SLEDgeH appears to mitigate more effectively through its weighted indicator approach. The density-based CDR and information-theoretic CUBAGE indices show particularly strong sensitivity to cluster number, with their MRE values nearly doubling between k = 3 and k = 7. These results suggest that while traditional indices remain useful for simple cluster structures, SLEDgeH, with its innovative weighting mechanism, provides more robust performance across varying cluster complexities in synthetic data sets.

The experimental results demonstrate distinct performance patterns among validation indices when applied to both balanced and imbalanced synthetic data sets across varying cluster numbers (k = 3, 5, 7). As shown in Table 11, SLEDgeH consistently achieves the lowest MRE values in all tested scenarios, outperforming both traditional indices and its predecessor SLEDge. Notably, while ASW shows competitive performance in balanced configurations (e.g., MRE 0.03 for k = 3, 5, 7 in balanced cases), its advantage diminishes significantly in imbalanced scenarios, where SLEDgeH maintains superior robustness (0.02 vs. 0.05 for k = 3 imbalanced). The density-based CDR index exhibits particularly poor performance with increasing k (MRE

0.71

for k = 7), while information-theoretic CUBAGE shows variable results that degrade sharply in imbalanced conditions. Consensus-based CI demonstrates moderate performance but fails to match the precision of SLEDgeH, particularly in larger k values. Importantly, the weighted approach of SLEDgeH proves consistently effective regardless of cluster balance, showing ≤0.05 MRE in balanced cases and ≤0.11 MRE in challenging imbalanced configurations with k = 7, confirming its reliability across diverse data distributions. These findings highlight the dual advantage of SLEDgeH: maintaining the interpretability of SLEDge while significantly improving accuracy, especially in realistic scenarios with uneven cluster sizes.

Finally, for the statistical significance test, we select the top-performing indices for each data set across different algorithms. These indices are ranked (using average ranks in cases of ties), and the Wilcoxon–Mann–Whitney test [36] is applied. At a significance level of 0.05, the null hypothesis—that the performance of all indices is similar—is rejected.

6.2.2. Distance-Based Clustering with Real-World Data

The results presented in this section are related to the data sets described in Section 5.1.2 under the evaluation method described in Section 5. As in the synthetic data analysis section, the overall performance of the indices is summarized in Table 12 using the hit rate, average MRE, and STD metrics, while the remaining tables report only the MRE values.

The comparative evaluation of clustering validation indices on real-world data sets reveals distinct performance patterns among the different methods. The CH index demonstrates the strongest overall performance, achieving the lowest mean relative error (MRE = 0.28) while maintaining reasonable consistency (STD = 0.32). However, the SLEDgeH index emerges as a particularly competitive alternative, securing second place in overall performance with an MRE of 0.33 and notably achieving the best standard deviation (STD = 0.28) among all indices, indicating superior stability in its evaluations. While the CNAK method shows the highest hit rate (50%), its higher MRE (0.60) and substantial standard deviation (1.11) suggest less reliable performance overall. The traditional SLEDge index, though showing moderate consistency (STD = 0.80), performs poorly on real-world data with an MRE of 1.16 and zero hit rate, highlighting the significant improvement achieved by its weighted version, SLEDgeH. Interestingly, the commonly used ASW index, which performed well on synthetic data, shows dramatically reduced effectiveness on real-world data sets (MRE = 1.04, Hit rate = 11%), suggesting limitations in handling complex, real-world data structures. The density-based CDR index presents middling performance (MRE = 0.34), while consensus-based (CI) and information-theoretic (CUBAGE) approaches show particularly weak results, with CI exhibiting the worst MRE (1.41) of all methods tested. These findings collectively suggest that while CH remains the top performer for real-world data validation, SLEDgeH offers a compelling alternative, particularly when measurement stability is prioritized, demonstrating the value of its semantic approach combined with weighted indicators.

The analysis of the clustering algorithm influence on validation index performance (Table 13) reveals distinct methodological preferences among the evaluated measures. Hierarchical clustering algorithms (particularly Ward’s method and average linkage) demonstrate superior performance with most indices as evidenced by the lowest MRE scores for CDR (0.21–0.26), SLEDgeH (0.32), and SLEDge (1.03–1.26) in these configurations. This pattern aligns with established findings that hierarchical methods better preserve local data structures [24]. However, notable exceptions emerge: the CH index achieves optimal performance (MRE = 0.17) with k-means, consistent with its design for evaluating centroid-based partitions [25], while CUBAGE shows degraded results (MRE = 0.57–0.80) under hierarchical clustering, corroborating its known sensitivity to distributional assumptions [28]. The Consensus Index (CI) and CNAK exhibit algorithm-dependent behaviors by design, with CI aggregating multiple solutions (MRE = 1.41) and CNAK and its specialized k-means++ variant achieving moderate performance (MRE = 0.60). Notably, SLEDgeH maintains consistent low-error performance (MRE = 0.32–0.37) across all algorithms, demonstrating greater robustness than its predecessor SLEDge (MRE = 1.03–1.26) and conventional indices like DB (MRE = 1.84–2.22), which show high variance depending on the algorithmic choice. These results suggest that index selection should consider both the clustering algorithm’s properties and the index’s inherent theoretical assumptions, with newer semantic approaches like SLEDgeH offering more stable evaluation across methodologies.

The performance analysis in Table 14 reveals distinct patterns across validation indices as the number of clusters (k) increases. While CDR and CH exhibit deteriorating performance with higher k values (MRE increasing from 0.23 to 0.60 and 0.20 to 0.60, respectively), DB demonstrates an inverse relationship, improving from MRE = 2.60 at k = 2 to MRE = 0.13 at k = 5. This aligns with previous observations that distance-based indices like DB may favor larger numbers of clusters [10]. The SLEDge variants show competitive performance across all k values, with SLEDgeH achieving the lowest MRE (0.29) at k = 4, suggesting particular robustness in mid-range cluster configurations. However, these trends should be interpreted cautiously given the imbalanced distribution of data sets across different k values (ranging from 10 data sets at k = 2 to just 1 at k = 5), which may introduce statistical artifacts [24]. Notably, CNAK shows exceptional performance at k = 3 (MRE = 0.13), though its effectiveness varies substantially across other cluster counts, consistent with its known sensitivity to data characteristics [31]. The comparative stability of information-theoretic CUBAGE (MRE range: 0.42–0.65) versus the volatility of consensus-based CI (MRE range: 0.20–1.75) underscores the fundamental methodological differences between these approaches to cluster validation.

The performance analysis of cluster validation indices across real-world data sets (Table 15) reveals several key insights about their relative effectiveness. While CNAK achieves the highest number of optimal results (11 data sets with minimal MRE), followed by CDR (10 data sets) and CH (8 data sets), this raw count alone does not fully capture their comparative reliability. As demonstrated in Table 12, CH emerges as the top performer overall despite its lower count of individual wins, a phenomenon explained by its consistently small error magnitudes when it does not select the exact k value [25]. This pattern is even more pronounced in SLEDgeH, which shows the smallest standard deviation of errors among all indices, confirming its robustness. The SLEDge index, while not leading in either metric, demonstrates intermediate performance that still outperforms several conventional indices like DB and ASW.

This apparent paradox between per-data set wins and overall rankings stems from fundamental differences in how indices handle marginal cases. Some indices may occasionally guess k correctly but produce wildly inaccurate estimates when they fail, while others maintain stable near-optimal performance [24]. Our results extend this finding, showing that CH and SLEDgeH belong to the latter category—their errors, when they occur, deviate less from the true k compared to indices like CNAK which alternate between perfect guesses and substantial misses. This makes them particularly valuable for applications requiring consistent performance across diverse data sets, though the higher hit rate of CNAK may prove preferable in scenarios where exact k determination is critical [31]. The density-based CDR maintains its reputation for handling irregular cluster structures [27], while the weighted approach of SLEDgeH appears to successfully balance between precision and stability across data types.

6.2.3. Categorical Clustering with Real-World Data

In this section, we evaluate the performance of the indices when applying the ROCK algorithm [12], described in Section 2.2. Although we use the same data sets from the real-world data analysis section, we separate this section due to the algorithm’s specificity in applications involving categorical data. Since CNAK uses k-means++, we do not include it in this analysis.

The results presented in this section relate to the data sets described in Section 5.1.2 under the evaluation method described in Section 5. As in the real-world data analysis section, the overall performance of the indices appears in Table 16 through the measures of accuracy rate, average MRE, and STD, while the remaining tables contain only the MRE.

The analysis of the overall performance of the indices (Table 16) reveals notable differences in performance compared to the previous evaluation (Table 12). SLEDgeH demonstrates the best overall performance, achieving the highest hit rate (61%) and the lowest MRE (0.16), while also maintaining excellent stability (STD = 0.23). This represents a significant improvement over its performance in the previous analysis, where it had a lower hit rate (22%) and a higher MRE (0.33), despite still showing good stability. The CDR index now performs comparably to SLEDgeH in terms of stability (STD = 0.23) and hit rate (50%), while its MRE (0.21) remains competitive. The ASW index shows better results, achieving a higher hit rate (56%) and a lower MRE (0.34) compared to its previous performance (Hit rate = 11%, MRE = 1.04).

In contrast, the CH index, which had the best performance in the previous analysis (MRE = 0.28), now performs worse, exhibiting a higher MRE (0.45) and significantly poorer stability (STD = 0.68). The CI and CUBAGE indices also show mixed results—while CI slightly improves in MRE (0.43 vs. 1.41), CUBAGE maintains a high MRE (0.61) but with greater stability compared to previous results. The DB index remains one of the worst performers, with a high MRE (1.16) and a low hit rate (33%), though it shows some improvement over its previous performance (MRE = 2.04, Hit rate = 0%).

These results suggest that the choice of the clustering algorithm type significantly impacts index performance. Although CH performed best in the previous evaluation, SLEDgeH and CDR emerge as more reliable options, primarily due to their balance of accuracy, stability, and hit rate. The improved performance of ASW indicates that some indices are more sensitive to the underlying clustering method than others. Overall, SLEDgeH stands out as the most robust index, demonstrating high precision and consistency.

In the analysis of index performance regarding different numbers of clusters (Table 17), we identify significant differences compared to the previous evaluation (Table 14). SLEDgeH shows the best index for k = 2 and k = 4, with MREs of 0.00 and 0.17 respectively, clearly outperforming the other indices in these scenarios. For k = 3, CDR presents the best performance (MRE = 0.33), while for k = 5, all indices converge to the same value (MRE = 0.60), a different behavior from the previous evaluation where DB showed the best result (MRE = 0.13).

When comparing with the index behavior in the previous analysis, we note that ROCK tends to produce more balanced results among the indices, especially for larger k values. Examining the previous evaluation in Table 14, we observe that CH had excellent performance for k = 2 (MRE = 0.20), but in ROCK its result was slightly worse (MRE = 0.30). The behavior of DB is particularly interesting—while in the previous analysis it improved as k increased, this pattern is not observed in ROCK. The consistency of SLEDgeH in ROCK reinforces its robustness for different cluster number configurations.

The performance analysis of indices specifically on real-world data sets clustered by ROCK (Table 18) reveals significant differences when compared to the distance-based approach presented in Table 15. SLEDgeH, which showed robust performance with small and consistent errors in the previous analysis, also demonstrates good performance with ROCK, achieving minimum MRE (equal to 0) on 11 data sets. Furthermore, examining Table 18, we observe that all indices show improved performance in correctly identifying the number of clusters, indicating that for these data sets, ROCK generates cluster structures that are more easily identifiable by different metrics.

Finally, the SLEDgeH index proves consistently robust in both approaches—distance-based and specifically for categorical data—with low standard deviation, confirming its usefulness as a reliable metric regardless of the clustering algorithm. While other indices show improved performance with ROCK, they do not maintain the same superiority across different contexts.

6.2.4. Sensitivity Analysis of Weight Configurations

To evaluate the impact of different weight configurations on the SLEDgeH indicators, we select five weight combinations for sensitivity analysis based on rigorous methodological criteria. We include the default configuration [0.3, 0.1, 0.5, 0.1], obtained through systematic optimization as described in Section 4.1, and four variations that individually emphasize each indicator (

H_{S}

,

H_{L}

,

H_{E}

, and

H_{D}

). This approach allows us to assess both the robustness of the default configuration and the isolated impact of each indicator while maintaining the unit sum constraint and avoiding excessive combinatorial complexity. The results demonstrate that this strategy effectively validates the default configuration as optimal while clearly characterizing each component’s influence on the overall index performance as evidenced in Table 19.

The performance analysis of SLEDgeH with different weight configurations (Table 19) reveals important patterns about indicator suitability for various types of categorical data sets. The default configuration demonstrates the best overall balance, achieving the lowest MRE (0.33) and STD (0.28). This superior performance remains consistent across diverse data sets, ranging from small, well-structured ones like Balloons (16 instances) to complex cases like Mushroom (8124 instances).

The high weight for Exclusivity (

H_{E} = 0.5

) proves crucial for identifying semantically distinct clusters as evidenced by perfect results (MRE = 0) in Balloons, Chess, and Mushroom. These data sets contain well-defined categories with clearly exclusive patterns between clusters. However, for data sets with natural category overlap like SPECT Heart and Hayes Roth, configurations with less emphasis on Exclusivity (e.g., [0.5, 0.1, 0.3, 0.1]) show slightly better performance, suggesting that indicator importance varies according to data characteristics.

The moderate weight for Support (

H_{S} = 0.3

) provides an appropriate balance, ensuring relevance without excessive dominance in the evaluation. In data sets like Indian Diabetes and Mushroom, where pattern frequency strongly indicates cluster quality, configurations with higher Support weight [0.5, 0.1, 0.3, 0.1] achieve MRE = 0. Conversely, for sparse or noisy data like Survey Lung Cancer and Chess, slightly increasing the Difference weight could improve performance as suggested by the good results of the [0.2, 0.1, 0.2, 0.5] configuration on this specific data set.

The minimal weights for Length (

H_{L} = 0.1

) and Difference (

H_{D} = 0.1

) reflect their secondary role in most scenarios, though alternative configurations show marginally better results for data sets like Nursery and Students Adapt. This reinforces the need for weight adaptation in specific contexts, while maintaining the default configuration as a general starting point.

These empirical results, combined with theoretical foundations, support the conclusion that the default configuration [0.3, 0.1, 0.5, 0.1] represents the best general-purpose balance for categorical data applications. However, the observed performance variation across different data sets reinforces the recommendation to adjust weights for specific domains, particularly when dealing with characteristics like category overlap, noise, or sparsity.

6.2.5. Graphical Analysis of Cluster Validity Indices

In this section, we analyze how some indices behave using a graphical tool. To do this, we select some data sets, apply the hierarchical algorithm with average linkage with k varying in the same range from 2 to 10 and calculate the indices where the value of k is determined from the identification of the maximum value in a series of points. The selected data sets are Chess, Balloons, Car Evaluation and Nursery. The graphics are presented in Figure 3, Figure 4, Figure 5 and Figure 6.

In the data set Chess (Figure 3), the expected k is equal to 2. As we can see, CDR, DB, ASW and SLEDgeH suggest the correct number of k. It is also interesting to note that, except for the different scales and considering the interval as a reference, these indices have a similar distribution of points, especially CDR, DB, ASW and SLEDgeH. Analyzing the emphasis of the suggestion, CDR and DB suggest k equal to 2, but ASW and SLEDgeH seem to emphasize this suggestion more. This is verified when the suggestion point, proportionally, is larger than the others in the series. Regarding CH and CUBAGE, the performance was lower, but it is interesting to see that the distribution of points between them is also quite similar.

Observing the Balloons data set, knowing that the expected k is equal to 2, we can identify some interesting behaviors. The CH, CDR, and SLEDgeH indices suggest the number k correctly. CUBAGE and DB suggest k equal to 10, and although ASW suggests a value equal to 4, due to the proximity of points, it can be seen that ASW identifies that there are potential k equal to 2, 4, and 8. The SLEDgeH index also identifies the same points as potential clustering configurations.

Regarding the data set Car Evaluation, the expected k is equal to 4. Among the indices that suggest the correct number of k, we highlight CH, CUBAGE, ASW, and SLEDgeH. We also observe the same pattern of behavior regarding the distribution of points in some indices, such as CH and CUBAGE—where despite suggesting k equal to 4, the value of k equal to 3 is a close option—and ASW and SLEDgeH, emphatically suggesting the correct value.

Finally, the indices applied to the Nursery data set, which has an expected k equal to 5, demonstrate interesting results. CH, CDR, CUBAGE, and ASW suggest a k of 2, while DB suggests k of 6 and SLEDgeH k of 4. Although they all appear to be wrong, analyzing the data points, we found that out of a total of 12,960 data points, there is a class that represents only 0.015% of the data. Therefore, we believe that this has influenced the final results.

7. Conclusions

In this paper, we propose the SLEDgeH index, which is obtained by weighting the indicators of the SLEDde index. These weights are determined using an optimization heuristic. SLEDgeH offers a straightforward method for the semantic description and evaluation of clusters, serving as an internal validation metric for clustering applied to categorical data sets in transactional format. Our index operates independently of the distance measures between points, relying instead on frequent patterns identified among data points to assess cluster quality. This approach is particularly relevant in scenarios where a domain expert is the main source of knowledge, offering significant potential for decision-making processes.

In terms of semantic description, SLEDgeH proves effective in identifying meaningful patterns within clusters. The index facilitates the interpretation of clusters by highlighting relevant attributes that are crucial for understanding the composition and structure of the data. This capability supports deeper analyses of clustering results and aids in feature selection. For cluster validation, SLEDgeH demonstrates high accuracy and reliability on synthetic data sets, achieving the lowest mean relative error and maintaining stable performance in both balanced and imbalanced cluster scenarios. Its robustness is further reflected in the lowest standard deviation among the tested indices, indicating consistent and accurate results. Moreover, the index outperforms or matches the performance of state-of-the-art methods when applied to hierarchical and k-means clustering algorithms, making it a versatile choice for different clustering techniques.

For real-world data, SLEDgeH maintains low error margins, ranking second overall but proving reliable in practical applications. Notably, in categorical data clustering, it achieves the best performance and superior stability.

The sensitivity analysis further validates the SLEDgeH default weight configuration [0.3, 0.1, 0.5, 0.1] as optimal, balancing exclusivity and support for most data sets. While adaptable to domain-specific needs, this configuration ensures consistent performance across algorithms and data types, reinforcing SLEDgeH as a robust, general-purpose index.

Graph analyses reinforce these findings, showing that SLEDgeH often identifies plausible clustering configurations and emphasizes meaningful patterns in the data, even when the exact number of clusters is not correctly determined.

For future research, we will pursue several key directions: (1) developing a theoretical framework for weighted combinations of categorical indicators to strengthen SLEDgeH’s mathematical foundations, (2) conducting domain-specific case studies in collaboration with experts to evaluate semantic interpretability in practical applications, (3) implementing comparative analyses with additional pattern-based validation indices as they become available, (4) performing systematic sensitivity analyses of the weighting scheme and investigating automated weight adaptation methods, and (5) extending the methodology through a clustering heuristic that optimizes the SLEDgeH index, complemented by comprehensive benchmarking against existing categorical clustering algorithms. We will also explore applications in high-impact real-world scenarios where both validation accuracy and semantic interpretability are crucial, with particular attention to imbalanced and high-dimensional data sets.

Author Contributions

Conceptualization, R.D.G.d.A., F.A.N.V., R.C.d.A. and V.V.C.; methodology, R.D.G.d.A., F.A.N.V. and R.C.d.A.; software, R.D.G.d.A. and F.A.N.V.; validation, R.D.G.d.A., F.A.N.V. and R.C.d.A.; formal analysis, R.D.G.d.A. and F.A.N.V.; investigation, R.D.G.d.A.; resources, R.C.d.A. and V.V.C.; data curation, R.D.G.d.A.; writing—original draft preparation, R.D.G.d.A.; writing—review and editing, F.A.N.V., R.C.d.A. and V.V.C.; visualization, R.D.G.d.A.; supervision, F.A.N.V., R.C.d.A. and V.V.C.; project administration, F.A.N.V. and R.C.d.A.; funding acquisition, V.V.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Brazilian Federal Foundation for Support and Evaluation of Graduate Education (CAPES)–Brazil–Finance Code 001.

Data Availability Statement

The data presented in this study are openly available in Github at https://github.com/aquinordg/sledgehammer/tree/main/tests/ds (accessed on 21 August 2025).

Acknowledgments

All tables and figures presented in this work were created by the authors. The authors would also like to acknowledge the use of the ChatGPT-4 language model developed by OpenAI for grammatical review and editorial suggestions in the preparation of this manuscript.

Conflicts of Interest

The author declares no conflicts of interest.

References

Faceli, K.; Lorena, A.C.; Gama, J.; Almeida, T.A.; Carvalho, A.C.P.L.F. Inteligência Artificial-Uma Abordagem de Aprendizado de Máquina. 2. Edição, 2021; Grupo Gen-LTC: Rio de Janeiro, Brasil, 2021; ISBN 978-8521637349. [Google Scholar]
Xu, D.; Tian, Y. A comprehensive survey of clustering algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef]
Zampieri, M.; De Amorim, R.C. Between sound and spelling: Combining phonetics and clustering algorithms to improve target word recovery. In Proceedings of the Advances in Natural Language Processing: 9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, 17–19 September 2014; Proceedings 9. Springer: Berlin/Heidelberg, Germany, 2014; pp. 438–449. [Google Scholar]
Harris, S.; De Amorim, R.C. An Extensive Empirical Comparison of k-means Initialization Algorithms. IEEE Access 2022, 10, 58752–58768. [Google Scholar] [CrossRef]
Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
Jain, A.K.; Dubes, R.C. Algorithms for Clustering Data; Prentice-Hall, Inc.: Hoboken, NJ, USA, 1988. [Google Scholar]
Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On clustering validation techniques. J. Intell. Inf. Syst. 2001, 17, 107–145. [Google Scholar] [CrossRef]
Brun, M.; Sima, C.; Hua, J.; Lowey, J.; Carroll, B.; Suh, E.; Dougherty, E.R. Model-based evaluation of clustering validation measures. Pattern Recognit. 2007, 40, 807–824. [Google Scholar] [CrossRef]
Pfitzner, D.; Leibbrandt, R.; Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 2009, 19, 361–394. [Google Scholar] [CrossRef]
Milligan, G.W.; Cooper, M.C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50, 159–179. [Google Scholar] [CrossRef]
Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal’X 2013, 1, 34. [Google Scholar]
Guha, S.; Rastogi, R.; Shim, K. ROCK: A robust clustering algorithm for categorical attributes. Inf. Syst. 2000, 25, 345–366. [Google Scholar] [CrossRef]
Dorman, K.S.; Maitra, R. An efficient k-modes algorithm for clustering categorical datasets. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 83–97. [Google Scholar] [CrossRef]
Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. (CSUR) 1999, 31, 264–323. [Google Scholar] [CrossRef]
De Amorim, R.C.; Hennig, C. Recovering the number of clusters in data sets with noise features using feature rescaling factors. Inf. Sci. 2015, 324, 126–145. [Google Scholar] [CrossRef]
de Aquino, R.D.G.; Curtis, V.V.; Verri, F.A.N. A Clustering Validation Index Based on Semantic Description. In Proceedings of the Brazilian Conference on Intelligent Systems, Belo Horizonte, Brazil, 25–29 September 2023; pp. 315–328. [Google Scholar]
Aggarwal, C.C.; Yu, P.S. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 70–81. [Google Scholar]
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: San Francisco, CA, USA, 2016; p. 654. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
Ward, J.H., Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Defays, D. An efficient algorithm for a complete link method. Comput. J. 1977, 20, 364–366. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Pérez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Rojas-Thomas, J.C.; Santos, M. New internal clustering validation measure for contiguous arbitrary-shape clusters. Int. J. Intell. Syst. 2021, 36, 5506–5529. [Google Scholar] [CrossRef]
Gao, X.; Yang, M. Understanding and enhancement of internal clustering validation indexes for categorical data. Algorithms 2018, 11, 177. [Google Scholar] [CrossRef]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Ünlü, R.; Xanthopoulos, P. Estimating the number of clusters in a dataset via consensus clustering. Expert Syst. Appl. 2019, 125, 33–39. [Google Scholar] [CrossRef]
Saha, J.; Mukherjee, J. CNAK: Cluster number assisted K-means. Pattern Recognit. 2021, 110, 107625. [Google Scholar] [CrossRef]
Aggarwal, C.C. Data Mining: The Textbook; Springer: Berlin/Heidelberg, Germany, 2015; Volume 1. [Google Scholar]
Vendramin, L.; Campello, R.J.; Hruschka, E.R. Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. ASA Data Sci. J. 2010, 3, 209–235. [Google Scholar] [CrossRef]
Ralambondrainy, H. A conceptual version of the k-means algorithm. Pattern Recognit. Lett. 1995, 16, 1147–1157. [Google Scholar] [CrossRef]
Dimitriadou, E.; Dolničar, S.; Weingessel, A. An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 2002, 67, 137–159. [Google Scholar] [CrossRef]
Mann, H.B.; Whitney, D.R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]

Figure 1. Flowchart comparing SLEDge and SLEDgeH index computation. The diagram shows the common initial steps (pattern and indicator computation) and the divergent processes for each index, highlighting the additional weighting step of SLEDgeH. The final outputs are global validation indices for cluster quality assessment in categorical data.

Figure 2. Computational complexity of cluster validity indices across increasing feature dimensions (200–5000). Dotted lines show fitted theoretical curves; circles represent empirical measurements (mean of 20 runs). SLEDgeH maintains near-linear growth, outperforming most indices in scalability.

Figure 3. Series of points based on the score of some indices for the data set Chess (k = 2).

Figure 4. Series of points based on the score of some indices for the data set Balloons (k = 2).

Figure 5. Series of points based on the score of some indices for the data set Car Evaluation (k = 4).

Figure 6. Series of points based on the score of some indices for the data set Nursery (k = 5).

Table 1. Comparison of clustering validity indices and methods.

	ASW	CH	DB	CDR	CUBAGE	CI	CNAK
Distance-independent	−	−	−	−	✓	−	−
Scale-invariant	−	−	−	✓	✓	✓	−
Categorical data	−	−	−	−	✓	✓	−
Sparse data	✓	✓	✓	✓	✓	✓	✓
Imbalanced data	✓	−	−	✓	✓	✓	✓
Noisy data/outliers	−	−	−	✓	−	✓	−
Irregular shapes	−	−	−	✓	✓	✓	−
Small clusters	✓	−	−	✓	✓	✓	✓

Legend: ✓ (supported/adequate), − (not supported/not adequate).

Table 2. Weights and descriptions of each indicator that, when multiplied, transform SLEDge into SLEDgeH.

Indicator	Weight	Value	Justification
$S_{ρ}$	$H_{S}$	0.3	Given that with a greater proportion of data points represented by frequent patterns, we obtain more reliable descriptions, we thus assign this weight to the indicator.
$L_{ρ}$	$H_{L}$	0.1	Although the cardinality of frequent patterns represents an interesting indicator, we believe that its influence on the quality of the semantic description is less significant.
$E_{ρ}$	$H_{E}$	0.5	We believe that clusters with exclusive frequent patterns can result in more unique semantic descriptions, particularizing the clusters in relation to the others; hence, we prioritize this weight.
$D_{ρ}$	$H_{D}$	0.1	Despite the function of penalizing semantic descriptions based on frequent patterns with lower support values, depending on the higher support values of other descriptors, the indicator may not be as effective; therefore, we assign this weight to the indicator.

Table 3. Parameters used in the generation of synthetic binary data. In total, 3600 unique data sets are generated, obtained from combinations of 2 parameters of

ϵ

, 2 of

λ

, 3 different numbers of clusters, 3 of features, 2 types of data balance for each number of clusters and 50 seeds.

Table 3. Parameters used in the generation of synthetic binary data. In total, 3600 unique data sets are generated, obtained from combinations of 2 parameters of

ϵ

, 2 of

λ

, 3 different numbers of clusters, 3 of features, 2 types of data balance for each number of clusters and 50 seeds.

Parameter	Value
$ϵ$	0.1, 0.2
$λ$	0.8, 0.9
k	3, 5, 7
Number of features (relevants)	30 (20), 50 (33), 100 (66)
Balance	balanced, imbalanced
Number of seeds	50
Number of data set generated	3600

Table 4. Real-world data sets used in validation experiments. The information in the columns related to size is presented in matrix format

(n, m)

, where n represents the number of data points and m the number of features. Original size refers to the original data set and Binary to the preprocessed data sets. The differences in the number of n are related to the removal of missing data in the original data sets, while for m it is related to the expansion of the number of features in the conversion process. The Categories column reports the number of categories that can be found in the data set, and k represents the number of classes existing in the data set, which we are relating to the number of clusters.

Table 4. Real-world data sets used in validation experiments. The information in the columns related to size is presented in matrix format

(n, m)

, where n represents the number of data points and m the number of features. Original size refers to the original data set and Binary to the preprocessed data sets. The differences in the number of n are related to the removal of missing data in the original data sets, while for m it is related to the expansion of the number of features in the conversion process. The Categories column reports the number of categories that can be found in the data set, and k represents the number of classes existing in the data set, which we are relating to the number of clusters.

Name	Original Size	Binary Size	Categories	k
Balance Scale	(625, 4)	(625, 20)	5	3
Balloons	(16, 4)	(16, 8)	2	2
Breast Cancer	(286, 9)	(277, 41)	2, 3, 5, 6, 7, 11	2
Car Evaluation	(1728, 6)	(1728, 21)	3, 4	4
Indian Diabetes	(768, 8)	(768, 31)	3, 4, 5	2
Chess	(3196, 36)	(3196, 73)	2, 3	2
Voting Records	(435, 16)	(232, 32)	2	2
Hayes Roth	(160, 4)	(132, 15)	3, 4	3
Lenses	(24, 4)	(22, 9)	2, 3	3
Lymphography	(148, 18)	(148, 59)	2, 3, 4, 8	4
Monk’s Problems	(556, 6)	(556, 17)	2, 3, 4	2
Mushroom	(8124, 22)	(5644, 98)	2, 3, 4, 6, 7, 8, 9	2
Nursery	(12,960, 8)	(12,960, 27)	2, 3, 4, 5	5
SPECT Heart	(267, 22)	(267, 44)	2	2
Students Adapt	(1205, 13)	(1205, 35)	2, 3, 6	3
Survey Lung Cancer	(309, 15)	(309, 67)	2, 39	2
Tic Tac Toe	(958, 9)	(958, 27)	3	2
Website Phishing	(1353, 9)	(1353, 25)	2, 3	3

Table 5. Semantic descriptors of the Balloons data set after clustering using the average linkage algorithm for k = 4. The frequent patterns are divided by category and have respective support values

σ

in addition to the SLEDgeH index of each cluster.

Table 5. Semantic descriptors of the Balloons data set after clustering using the average linkage algorithm for k = 4. The frequent patterns are divided by category and have respective support values

σ

in addition to the SLEDgeH index of each cluster.

		$σ$
	$v_{j}$	$ρ (S_{1})$	$ρ (S_{2})$	$ρ (S_{3})$	$ρ (S_{4})$
Color	Purple	1.00	-	1.00	-
	Yellow	-	1.00	-	1.00
Size	Large	1.00	1.00	-	-
	Small	-	-	1.00	1.00
Act	Dip	0.50	0.50	0.50	0.50
	Stretch	0.50	0.50	0.50	0.50
Age	Adult	0.50	0.50	0.50	0.50
	Child	0.50	0.50	0.50	0.50
	SLEDgeH
	$H S_{ρ}$	0.20	0.20	0.20	0.20
	$H L_{ρ}$	0.10	0.10	0.10	0.10
	$H E_{ρ}$	0.00	0.00	0.00	0.00
	$H D_{ρ}$	0.07	0.07	0.07	0.07
	Score	0.08	0.08	0.08	0.08

Table 6. Semantic descriptors of the Balloons data set after clustering using the average linkage algorithm for k = 3. The frequent patterns are divided by category and have respective support values

σ

in addition to the SLEDgeH index of each cluster.

Table 6. Semantic descriptors of the Balloons data set after clustering using the average linkage algorithm for k = 3. The frequent patterns are divided by category and have respective support values

σ

in addition to the SLEDgeH index of each cluster.

		$σ$
	$v_{j}$	$ρ (S_{1})$	$ρ (S_{2})$	$ρ (S_{3})$
Color	Purple	-	1.00	1.00
	Yellow	1.00	-	-
Size	Large	0.50	1.00	-
	Small	0.50	-	1.00
Act	Dip	0.50	0.50	0.50
	Stretch	0.50	0.50	0.50
Age	Adult	0.50	0.50	0.50
	Child	0.50	0.50	0.50
	SLEDgeH
	$H S_{ρ}$	0.17	0.20	0.20
	$H L_{ρ}$	0.06	0.07	0.07
	$H E_{ρ}$	0.07	0.00	0.00
	$H D_{ρ}$	0.07	0.07	0.07
	Score	0.07	0.07	0.07

Table 7. Semantic descriptors of the Car Evaluation data set after clustering using the average linkage algorithm. The frequent patterns are divided by category and have respective support values

σ

in addition to the SLEDgeH index.

Table 7. Semantic descriptors of the Car Evaluation data set after clustering using the average linkage algorithm. The frequent patterns are divided by category and have respective support values

σ

in addition to the SLEDgeH index.

		$σ$
	$v_{j}$	$ρ (S_{1})$	$ρ (S_{2})$	$ρ (S_{3})$	$ρ (S_{4})$
Price	Very high	-	1.00	-	-
	High	1.00	-	-	-
	Medium	-	-	1.00	-
	Low	-	-	-	1.00
Maintenance	Very high	0.25	0.25	0.25	0.25
	High	0.25	0.25	0.25	0.25
	Medium	0.25	0.25	0.25	0.25
	Low	0.25	0.25	0.25	0.25
Doors	2	0.25	0.25	0.25	0.25
	3	0.25	0.25	0.25	0.25
	4	0.25	0.25	0.25	0.25
	5 or more	0.25	0.25	0.25	0.25
Persons	2	0.33	0.33	0.33	0.33
	4	0.33	0.33	0.33	0.33
	More	0.33	0.33	0.33	0.33
Luggage size	Small	0.33	0.33	0.33	0.33
	Medium	0.33	0.33	0.33	0.33
	Big	0.33	0.33	0.33	0.33
Safety	Low	0.33	0.33	0.33	0.33
	Medium	0.33	0.33	0.33	0.33
	High	0.33	0.33	0.33	0.33
	SLEDgeH
	$H S_{ρ}$	0.10	0.10	0.10	0.10
	$H L_{ρ}$	0.10	0.10	0.10	0.10
	$H E_{ρ}$	0.02	0.02	0.02	0.02
	$H D_{ρ}$	0.08	0.08	0.08	0.08
	Score	0.09	0.09	0.09	0.09

Table 8. Overall performance of each index on synthetic data sets clustered by distance-based algorithms, measured by MRE average, STD, and hit rate.

Score	Hit Rate	MRE	STD
CDR	0.00	0.55	0.16
CH	0.39	0.32	0.29
CI	0.35	0.23	0.21
CNAK	0.03	0.52	0.19
CUBAGE	0.31	0.36	0.28
DB	0.57	0.30	0.51
ASW	0.76	0.07	0.15
SLEDge	0.80	0.06	0.14
SLEDgeH	0.85	0.05	0.13

Table 9. Average MRE of each index on synthetic data sets clustered by distance-based algorithms.

Algorithm	CDR	CH	CI	CNAK	CUBAGE	DB	ASW	SLEDge	SLEDgeH
k-means	0.55	0.33	-	-	0.39	0.07	0.07	0.06	0.04
h-average	0.55	0.28	-	-	0.32	0.80	0.10	0.10	0.10
h-ward	0.55	0.33	-	-	0.38	0.03	0.05	0.02	0.01
all	-	-	0.23	-	-	-	-	-	-
k-means++	-	-	-	0.52	-	-	-	-	-

Table 10. Average MRE of each index on synthetic data sets clustered by distance-based algorithms, evaluated for different numbers of clusters.

k	CDR	CH	CI	CNAK	CUBAGE	DB	ASW	SLEDge	SLEDgeH
3	0.33	0.18	0.19	0.31	0.19	0.45	0.04	0.03	0.02
5	0.60	0.33	0.24	0.58	0.39	0.27	0.07	0.06	0.05
7	0.71	0.44	0.26	0.68	0.50	0.18	0.11	0.10	0.08

Table 11. Average MRE of each index on synthetic data sets clustered by distance-based algorithms, applied to balanced and imbalanced distributions.

k	Rate	CDR	CH	CI	CNAK	CUBAGE	DB	ASW	SLEDge	SLEDgeH
3	1249, 1249, 1249	0.33	0.06	0.27	0.29	0.09	0.46	0.03	0.03	0.02
	2500, 833, 416	0.33	0.30	0.10	0.32	0.30	0.45	0.05	0.04	0.02
5	833, 833, 833, 833, 833	0.60	0.11	0.34	0.55	0.22	0.27	0.03	0.05	0.04
	2500, 833, 416, 250, 166	0.60	0.55	0.14	0.60	0.55	0.27	0.11	0.07	0.05
7	624, 624, 624, 624, 624, 624, 624	0.71	0.21	0.30	0.65	0.34	0.13	0.03	0.06	0.05
	2500, 833, 416, 250, 166, 119, 89	0.71	0.67	0.23	0.71	0.67	0.22	0.18	0.13	0.11

Table 12. Overall performance of each index on real-world data sets clustered by distance-based algorithms, measured by MRE average, STD, and hit rate.

Score	Hit Rate	MRE	STD
CDR	0.44	0.34	0.40
CH	0.33	0.28	0.32
CI	0.17	1.41	1.21
CNAK	0.50	0.60	1.11
CUBAGE	0.22	0.60	0.96
DB	0.00	2.04	1.32
ASW	0.11	1.04	1.05
SLEDge	0.00	1.16	0.80
SLEDgeH	0.22	0.33	0.28

Table 13. Average MRE of each index on real-world data sets clustered by distance-based algorithms.

Algorithm	CDR	CH	CI	CNAK	CUBAGE	DB	ASW	SLEDge	SLEDgeH
k-means	0.53	0.17	-	-	0.42	2.22	0.83	1.18	0.37
h-average	0.26	0.41	-	-	0.80	1.84	0.78	1.03	0.32
h-ward	0.21	0.25	-	-	0.57	2.06	1.50	1.26	0.32
all	-	-	1.41	-	-	-	-	-	-
k-means++	-	-	-	0.60	-	-	-	-	-

Table 14. Average MRE of each index on real-world data sets clustered by distance-based algorithms, assessed for different numbers of clusters.

k	Data Sets	CDR	CH	CI	CNAK	CUBAGE	DB	ASW	SLEDge	SLEDgeH
2	10	0.23	0.20	1.75	0.85	0.65	2.60	1.29	1.45	0.27
3	5	0.43	0.31	1.33	0.13	0.56	1.71	0.91	0.90	0.49
4	2	0.50	0.42	0.50	0.50	0.42	1.04	0.46	0.83	0.29
5	1	0.60	0.60	0.20	0.60	0.60	0.13	0.33	0.20	0.33

Table 15. Average MRE of each index per real-world data set clustered by distance-based algorithms.

Name	k	CDR	CH	CI	CNAK	CUBAGE	DB	ASW	SLEDge	SLEDgeH
Balance Scale	3	0.33	0.67	1.00	0.33	0.67	1.22	0.67	0.56	0.78
Balloons	2	0.00	0.00	1.00	4.00	4.00	4.00	0.67	2.66	0.00
Breast Cancer	2	0.00	0.00	3.00	0.00	0.17	2.83	0.17	0.72	0.67
Car Evaluation	4	0.50	0.17	0.00	0.50	0.17	1.08	0.42	0.12	0.17
Indian Diabetes	2	0.00	0.50	2.00	0.50	0.50	1.50	0.00	0.60	0.17
Chess	2	0.00	1.17	3.00	0.00	1.17	2.00	0.33	1.93	0.00
Voting Records	2	0.00	0.00	0.00	0.00	0.00	0.17	0.00	1.93	0.00
Hayes Roth	3	0.89	0.11	0.33	0.00	0.11	2.22	2.11	0.96	0.22
Lenses	3	0.33	0.33	2.33	0.33	1.67	2.33	0.44	1.69	0.56
Lymphography	4	0.50	0.67	1.00	0.50	0.67	1.00	0.50	1.54	0.42
MONK’s Problems	2	0.00	0.17	1.00	1.00	0.00	4.00	1.67	1.33	0.67
Mushroom	2	1.33	0.00	0.00	0.00	0.33	0.67	3.00	2.66	0.00
Nursery	5	0.60	0.60	0.20	0.60	0.60	0.13	0.33	0.20	0.33
SPECT Heart	2	0.00	0.00	2.50	0.00	0.00	2.83	1.00	0.12	0.17
Students Adapt	3	0.33	0.22	2.33	0.00	0.11	1.56	0.67	0.96	0.56
Survey Lung Cancer	2	0.00	0.00	1.00	0.00	0.00	4.00	3.33	1.09	0.17
Tic Tac Toe	2	1.00	0.17	4.00	3.00	0.33	4.00	2.67	1.45	0.83
Website Phishing	3	0.22	0.22	0.67	0.00	0.22	1.22	0.67	0.32	0.33

Table 16. Overall index performance on real-world data sets using ROCK clustering, measured by average MRE, standard deviation, and hit rate.

Score	Hit Rate	MRE	STD
CDR	0.50	0.21	0.23
CH	0.50	0.45	0.68
CI	0.44	0.43	0.45
CUBAGE	0.50	0.61	1.06
DB	0.33	1.16	1.37
ASW	0.56	0.34	0.48
SLEDge	0.50	0.74	1.24
SLEDgeH	0.61	0.16	0.23

Table 17. Average MRE across indices for real-world data sets clustered with ROCK, evaluated for varying numbers of clusters.

k	Data Sets	CDR	CH	CI	CUBAGE	DB	ASW	SLEDge	SLEDgeH
2	10	0.05	0.30	0.30	0.45	1.10	0.10	0.75	0.00
3	5	0.33	0.80	0.73	1.07	1.46	0.80	0.93	0.40
4	2	0.50	0.25	0.25	0.25	1.00	0.25	0.25	0.17
5	1	0.60	0.60	0.60	0.60	0.60	0.60	0.60	0.60

Table 18. Index-wise average MRE per real-world data set when using ROCK clustering.

Name	k	CDR	CH	CI	CUBAGE	DB	ASW	SLEDge	SLEDgeH
Balance Scale	3	0.33	0.67	1.00	0.67	2.00	0.67	0.67	0.67
Balloons	2	0.00	0.00	1.00	4.00	4.00	1.00	3.50	0.00
Breast Cancer	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Car Evaluation	4	0.50	0.00	0.00	0.00	1.50	0.00	0.00	0.00
Indian Diabetes	2	0.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00
Chess	2	0.00	0.50	1.00	0.50	0.50	0.00	4.00	0.00
Voting Records	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Hayes Roth	3	0.33	1.67	1.00	1.67	2.33	1.67	1.00	0.33
Lenses	3	0.33	1.00	1.00	2.33	2.33	1.00	2.33	0.33
Lymphography	4	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.33
MONK’s Problems	2	0.50	0.00	0.00	0.00	4.00	0.00	0.00	0.00
Mushroom	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Nursery	5	0.60	0.60	0.60	0.60	0.60	0.60	0.60	0.60
SPECT Heart	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Students Adapt	3	0.33	0.33	0.33	0.33	0.33	0.33	0.33	0.33
Survey Lung Cancer	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Tic Tac Toe	2	0.00	2.50	0.00	0.00	2.50	0.00	0.00	0.00
Website Phishing	3	0.33	0.33	0.33	0.33	0.33	0.33	0.33	0.33

Table 19. Performance of SLEDgeH on real-world data sets with different weight configurations for

H_{S}

,

H_{L}

,

H_{E}

, and

H_{D}

indicators. Bold values highlight the lowest MRE for each data set. The default configuration shows the best overall balance, with the lowest MRE (0.33) and STD (0.28), along with the highest hit rate (22%).

Table 19. Performance of SLEDgeH on real-world data sets with different weight configurations for

H_{S}

,

H_{L}

,

H_{E}

, and

H_{D}

indicators. Bold values highlight the lowest MRE for each data set. The default configuration shows the best overall balance, with the lowest MRE (0.33) and STD (0.28), along with the highest hit rate (22%).

	Weight Settings [ $H_{S}, H_{L}, H_{E}, H_{D}$ ]
Name	Default	[0.5, 0.1, 0.3, 0.1]	[0.2, 0.5, 0.2, 0.1]	[0.1, 0.1, 0.7, 0.1]	[0.2, 0.1, 0.2, 0.5]
Balance Scale	0.78	0.67	1.33	0.67	1.25
Balloons	0.00	0.00	1.00	0.00	3.00
Breast Cancer	0.67	0.50	1.25	0.17	0.50
Car Evaluation	0.17	0.20	0.50	0.20	0.33
Indian Diabetes	0.17	0.00	2.00	0.00	0.17
Chess	0.00	1.00	1.50	0.17	0.00
Voting Records	0.00	0.50	2.00	2.00	0.00
Hayes Roth	0.22	0.00	1.00	0.50	0.33
Lenses	0.56	2.33	1.33	2.33	2.33
Lymphography	0.42	0.50	1.00	0.50	0.50
MONK’s Problems	0.67	0.67	0.50	0.50	1.33
Mushroom	0.00	0.00	2.00	2.00	0.00
Nursery	0.33	0.60	0.20	0.22	0.20
SPECT Heart	0.17	0.00	1.00	0.33	0.33
Students Adapt	0.56	0.33	0.00	0.42	0.33
Survey Lung Cancer	0.17	2.50	1.00	0.17	0.00
Tic Tac Toe	0.83	1.00	0.50	0.33	2.00
Website Phishing	0.33	0.50	0.17	0.33	0.33
Overall	0.33	0.63	1.02	0.60	0.72
	Other metrics
Hit rate	0.22	0.28	0.06	0.11	0.22
STD	0.28	0.73	0.62	0.72	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

de Aquino, R.D.G.; Verri, F.A.N.; de Amorim, R.C.; Curtis, V.V. SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data. Mathematics 2025, 13, 2832. https://doi.org/10.3390/math13172832

AMA Style

de Aquino RDG, Verri FAN, de Amorim RC, Curtis VV. SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data. Mathematics. 2025; 13(17):2832. https://doi.org/10.3390/math13172832

Chicago/Turabian Style

de Aquino, Roberto Douglas G., Filipe A. N. Verri, Renato Cordeiro de Amorim, and Vitor V. Curtis. 2025. "SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data" Mathematics 13, no. 17: 2832. https://doi.org/10.3390/math13172832

APA Style

de Aquino, R. D. G., Verri, F. A. N., de Amorim, R. C., & Curtis, V. V. (2025). SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data. Mathematics, 13(17), 2832. https://doi.org/10.3390/math13172832

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SLEDgeHammer: A Frequent Pattern-Based Cluster Validation Index for Categorical Data

Abstract

1. Introduction

2. Literature Review

2.1. Transactional Data Sets

2.2. Clustering Algorithms

2.3. Validity Indices

2.3.1. Distance-Based Methods

Average Silhouette Width (ASW) [23]

Calinski–Harabasz (CH) [25]

Davies–Bouldin (DB) [26]

2.3.2. Density-Based Methods

Contiguous Density Region (CDR) [27]

2.3.3. Information Approaches

CUBAGE [28]

2.3.4. Consensus Techniques

Consensus Index (CI)

2.3.5. Automated Selection

Cluster Number Assisted K-Means (CNAK) [31]

2.3.6. Summary

3. Semantic Description of Clustering

4. SLEDgeH Index

4.1. Weight Optimization Framework

4.2. Index Computation Framework

4.3. Computational Complexity Analysis

5. Experiments Setting

5.1. Data Sets

5.1.1. Synthetic Data

5.1.2. Real-World Data

5.2. Evaluation Criteria

6. Results and Discussion

6.1. Semantic Description of Clusters

6.2. Internal Clustering Validation

6.2.1. Distance-Based Clustering with Synthetic Data

6.2.2. Distance-Based Clustering with Real-World Data

6.2.3. Categorical Clustering with Real-World Data

6.2.4. Sensitivity Analysis of Weight Configurations

6.2.5. Graphical Analysis of Cluster Validity Indices

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI