To further illustrate the effectiveness of the algorithm for detecting clusters of different shapes and densities in close proximity to each other, experiments were conducted on several synthetic datasets and the UCI (University of California) real dataset, which is a public repository used for machine learning and data mining research, and we compared them with current state-of-the-art clustering algorithms using three external evaluation metrics. The selected datasets are characterized by ambiguous boundaries and highly overlapping samples, especially R15, Pathbased, D31, S1, DS577, etc., with the aim of demonstrating the applicability of the proposed method. Finally, the effect of the nearest neighbor parameter K on the clustering results is analyzed. All experiments were conducted using a PC configured with a i9-12900H 2.50 GHz processor Windows 11 operating system, 16 GB RAM, and Python 3.11.
4.1. Preparation
Ten synthetic datasets and ten UCI real datasets were prepared for this experiment, and the detailed information of these datasets is displayed in
Table 3 and
Table 4, including the data size, dimensions, and the number of real clusters. All datasets were obtained from
https://github.com/Chelsea547/Clustering-Datasets (accessed on 21 January 2025).
Secondly, this part compares some classical and advanced algorithms used to demonstrate the superiority of the algorithms, including DPC, DBSCAN, K-means, RNN-DBSCAN, LDP-MST, and HCDC. The content of these six algorithms has already been described in the introduction and related work, so they will not be repeated here. In particular, this paper uses three traditional external evaluation metrics, clustering accuracy [
36] (ACC), adjusted Rand index [
37] (ARI), and normalized mutual information [
38] (NMI), to measure the algorithm’s performance. Among them, ACC is an important metric for evaluating the performance of a classification model, which indicates the ratio between the number of samples correctly predicted by the model and the total number of samples, and it takes the value in the range of [0,1]. In this paper, ACC is used to evaluate the consistency between the clustering results and the true labels; ARI is used to measure the similarity between two clustering results, which takes into account the random assignment of element pairs in the clustering results and takes the value in the range of [−1,1]; and NMI is used to compare the consistency of different clustering results and takes the value in the range of [0,1]. Overall, ACC, ARI, and NMI are all used to describe the accuracy of the clustering results, with larger values indicating that the algorithm is more effective in clustering.
4.2. Experiments on Synthetic Datasets
This section analyzes the comparative tests with six other state-of-the-art algorithms on ten synthetic datasets. The original distributions of the synthetic datasets are shown in
Figure 7a–j.
In them, data points belonging to the same cluster are marked with the same color. In particular,
Figure 7d,g exhibit uneven density characteristics, and datasets D31, S1, DS577, and T4.8K all feature indistinct cluster boundaries, especially D31 and S1. In order to clearly illustrate the experimental setup of each algorithm,
Table 5 references the selection of parameters in the original paper for each algorithm and provides the experimental parameters on the synthetic dataset.
Table 6 presents the ACC, ARI, and NMI results of each algorithm on different synthetic datasets to illustrate their performance.
Figure 8 shows the clustering results of the proposed algorithm CPDD-ID on the synthetic datasets, where the results on popular datasets such as Spiral, Jain, and Zelink1 are consistent with the distribution of the original dataset, and the results on the three evaluation metrics are all 1. For datasets similar to D31, S1, and other datasets with a high degree of overlap between clusters, the CPDD-ID algorithm accurately separates clusters that do not belong to the same clusters that do not belong to the same category and also has good results in the evaluation metrics. Finally, the CPDD-ID algorithm also correctly recognizes different cluster structures when dealing with datasets with ambiguous boundaries such as Pathbased and DS577.
Additionally,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13 and
Figure 14 display the clustering results of the comparative algorithms DPC, DBSCAN, K-means, RNN-DBSCAN, LDP-MST, and HCDC on the ten synthetic datasets. It is worth noting that k-means shows the clustering centers detected on each dataset with a red pentagram, which is indicated by averaging the results over 30 runs on the experimental setup since variations in the initial parameters can lead to different results.
Figure 9 presents the clustering results of DPC.
DPC is applied to various datasets by setting different truncation distances and gives good results on spherical datasets. It can be seen that it performs well on datasets such as R15 but performs poorly on datasets with uneven density such as Pahbased and Jain. The clustering results of DBSCAN are shown in
Figure 10.
DBSCAN can handle datasets of arbitrary shapes, and it especially excels in handling datasets such as Jain and Zelink1 that have clear cluster structures and are far away from each other. However, it incorrectly categorizes clusters close to each other as one cluster when dealing with datasets such as R15, S1, etc., which have unclear inter-cluster structures and are highly overlapping.
As shown in
Figure 11, K-means performs well on the Zelink1 dataset but performs poorly on Aggregation, Spiral and Pathbased, incorrectly classifying different clusters together and even detecting the wrong cluster centers. It also performs poorly on noisy datasets such as T4.8k.
RNN-DBSCAN is a density-based clustering algorithm based on DBSCAN that uses reverse nearest neighbors to compute the local density, and it shows better clustering performance than DBSCAN, as shown in
Figure 12.
Especially on the Jain and Aggregation datasets, the potential cluster structure is correctly identified. However, it is not as good as DBSCAN in dealing with datasets containing noise like T4.8k. Moreover, the RNN-DBSCAN algorithm is not suitable for datasets with ambiguous boundaries such as R15, D31, and Pathbased.
The clustering results of LDP-MST and HCDC are shown in
Figure 13 and
Figure 14, respectively.
It can be seen that LDP-MST and HCDC perform well in dealing with arbitrary shapes and highly overlapping datasets but perform poorly in dealing with uneven datasets such as Pathbased and DS577. HCDC is excellent in dealing with arbitrary shapes and highly overlapping datasets, but performs poorly in dealing with datasets with uneven density such as Pathbased and DS577.
Unlike the six comparison algorithms mentioned above, the CPDD-ID algorithm uses two-phase clustering. Local density maxima are detected using kernel density estimation in the partitioning phase, which can effectively detect the density distributions of dense and sparse regions and avoid incorrectly classifying sparse clusters into dense clusters. For example, the Pathbased, Aggregation and Jain datasets provide strong support for the subsequent merging phase by correctly identifying subclusters with different density distributions in the partitioning phase. The merging phase is similar to hierarchical clustering in that the subclusters with maximum similarity are iteratively merged through the interaction degree of shared nearest neighbors among subclusters, starting from structural similarity. The proposed merging strategy shows good performance in dealing with datasets that are highly overlapping and ambiguous to each other’s cluster structure and achieves first-place results on the Jain, Spiral, Zelink1, Pathbased, and DS577 datasets. Even on datasets with high overlap between clusters such as R15 and S1, it can achieve second-place results. In summary, the CPDD-ID algorithm combines the advantages of both density-based clustering and hierarchical clustering, demonstrating universal performance compared to other algorithms.
4.3. Experiments on Real Datasets
In this section, the performance of the CPDD-ID algorithm is further evaluated against six other algorithms on ten real datasets, all of which are taken from the UCI machine learning repository, and the parameters of all the real datasets on the seven different algorithms are given in
Table 7.
In particular,
Table 8 shows the clustering results of the selected real datasets on the different algorithms.
The results show that the CPDD-ID algorithm ranks first in ACC, NMI, and NRI on the Ionosphere and Pima datasets and also ranks first in two metrics on the Wine, Satimage, and Balance datasets, and it has top-three performance on the remaining datasets. Overall, the CPDD-ID algorithm shows excellent performance in dealing with low-dimensional datasets and highlights the effectiveness of the CPDD-ID algorithm’s strategy of reasonable partitioning and merging based on the correlation between all dimensions when dealing with high-dimensional datasets. As a result, the clustering performance of the CPDD-ID algorithm on high-dimensional datasets, such as Ionosphere and Satimage, is better than that of other algorithms.
4.4. Parameter Analysis of Algorithms
To verify the effect of the shared nearest neighbor parameter
K on the performance of the CPDD-ID algorithm, a further analysis was performed on the synthetic dataset. This analysis deflates the range of
K values from 1 to 50, and
Figure 15 demonstrates the effect of the variation of the parameter
K on the ACC, NMI, and ARI scores.
It can be seen that the results on the Aggregation dataset smooth out as the value of K increases, remaining constant especially in the range of 11 to 33 and showing a downward trend when the value of K is greater than about 43. The Spiral dataset shows a decreasing and then stabilizing trend. There is a small fluctuation on the Pathbased dataset. There is also a stabilizing effect on DSS577, which is a dataset with uneven density. In particular, the CPDD-ID algorithm performs very smoothly and well on datasets with ambiguous and highly overlapping boundaries such as R15, D31, and S1, which shows that the CPDD-ID algorithm is effective in dealing with this type of dataset.
On the basis of the above analysis, we suggest that the k value be set within the range of [5,12] when dealing with datasets such as Spiral, Jain, and Pathbased, which have a streaming structure. This is because there are generally many localized density peaks in such datasets, which may be partitioned into multiple subclusters during division, and if the k value is too large, it may incorrectly merge subclusters that do not belong to the same cluster structure. In addition, for datasets such as Aggregation, R15, D31, and S1, which have highly overlapping and interconnected clusters, the only way to avoid having highly overlapping cluster structures being incorrectly merged is to scale up the evaluation of the number of shared k-nearest neighbors. In addition, smaller k values tend to merge some clusters that do not belong to a cluster but are connected by a small number of samples into a single cluster. Therefore, a k value of [10,25] is recommended for such datasets.