A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters

Liu, Yangming; Ding, Jiaman; Wang, Hongbin; Du, Yi

doi:10.3390/app15073612

Open AccessArticle

A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters

¹

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

²

Artificial Intelligence Key Laboratory of Yunnan Province, Kunming University of Science and Technology, Kunming 650500, China

³

City College, Kunming University of Science and Technology, Kunming 650051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3612; https://doi.org/10.3390/app15073612

Submission received: 21 January 2025 / Revised: 18 March 2025 / Accepted: 20 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue Trusted Service Computing and Trusted Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In order to cope with data with an irregular shape and uneven density, this paper proposes a two-phase clustering algorithm based on detecting the peaks of dimensional density and the degree of interaction between clusters (CPDD-ID). In the partitioning phase, the local densities of the data in all dimensions are calculated using kernel density estimation, the density curves are constructed based on the densities of all the data, and the peaks of the density curves are used as the benchmark to construct a Kd-Tree to search for the data points that are closest to each peak to partition the initial sub-clusters. Then, the intersection of the results of the initial sub-clusters obtained from all the dimensions is taken to obtain the final sub-clusters. The proposed partitioning strategy is able to accurately identify clusters with density differences and has significant effects in dealing with data with irregular shapes as well as uneven densities in this category. In addition, a new similarity measure based on the interaction degree between clusters is proposed in the merging stage. This method iteratively merges subclusters with maximum similarity by calculating the interaction degree of shared k-nearest neighbors between neighboring subclusters. The proposed similarity measure is effective in dealing with the problems of high overlap between clusters and ambiguous boundaries. The proposed algorithm is tested in detail on 10 synthetic datasets and 10 UCI real datasets and compared with existing state-of-the-art algorithms. The experimental results show that the CPDD-ID algorithm accurately identifies potential cluster structures and exhibits excellent performance in terms of both clustering accuracy.

Keywords:

clustering; density peak; intersection; shared neighbor; similarity metric

1. Introduction

Cluster analysis is an unsupervised machine learning algorithm that explores potential relationships and rules in a dataset based on the similarities between data, and its primary task is to partition the data into a number of clusters while ensuring that the data within the same cluster have high similarity. In order to generate natural groups that maximize intra-cluster similarity and minimize inter-cluster similarity, clustering algorithms are widely used in a variety of fields, including image segmentation [1], bioinformatics [2,3], cybersecurity [4], computer vision [5], and so on. In recent years, clustering algorithms based on different strategies have been proposed and categorized into different types.

K-means [6] is the most representative partition-based clustering algorithm; it can effectively deal with spherical datasets, but cannot deal with complex structured datasets and the number of clusters needs to be selected manually. Unlike k-means, the density-based spatial clustering of applications with noise (DBSCAN) [7] is a density-based clustering algorithm that can recognize clusters of arbitrary shapes, but it performs poorly when dealing with clusters with uneven density or ambiguous boundaries between clusters. Density peak clustering (DPC) [8] is a density-based clustering algorithm proposed by Rodriguez and Laio that rapidly clusters data points by manually selecting cluster centers from a decision graph and assigning the remaining data points to the nearest-neighbor centers based on the two assumptions that the cluster centers have high local density and that the different cluster centers are far away from each other. Although DPC has achieved good results on non-spherical datasets, it is difficult to find the correct clustering centers in streaming and density heterogeneous datasets.

Hierarchical clustering algorithms utilize distance or similarity metrics to build a tree-like hierarchy [9]. In particular, agglomerative hierarchical clustering, such as balanced iterative reducing and clustering using hierarchies (BIRCH) [10], considers individual data points as a cluster merged bottom-up to obtain a hierarchical structure. Split hierarchical clustering [11] iteratively partitions all data points into smaller clusters by considering them as a single cluster. A hierarchical clustering algorithm using dynamic modeling (Chameleon) [12] essentially belongs to agglomerative hierarchical clustering, which can find the real clustering result by constructing k-nearest neighbor graphs and using a graph partitioning algorithm to partition the dataset into a number of subclusters and then iteratively merging subclusters by the connectivity between subclusters. It can handle complex structured data sets of arbitrary shape and size. However, when dealing with high-dimensional datasets or those with an excessive amount of data, the process is quite time-consuming. Cheng et al. proposed a local kernel-based hierarchical clustering method called a local core-based hierarchical clustering algorithm (HCLORE) [13], which first partitions the dataset into small subclusters by finding local kernels and then merges the subclusters based on a newly defined similarity. Although HCLORE is able to recognize complex cluster structures, it performs sensitively to cluster structures with ambiguous boundaries between clusters.

In order to effectively deal with the cluster structure with uneven density and the problem of ambiguous inter-cluster boundaries, we propose a two-phase clustering algorithm based on detecting density peaks in dimensions and measuring the degree of interaction between clusters (CPDD-ID). Firstly, we detect the local density peaks of each dimension for subclustering based on the proposed partitioning strategy. Secondly, the degree of interaction between subclusters is evaluated to define a new similarity metric to merge the subclusters. Lastly, we iteratively merge the subclusters with maximum similarity to achieve the desired number of clusters. The specific algorithmic process will be described later. In comparison to existing methods, the proposed algorithm possesses the following characteristics:

This paper proposes a new partitioning strategy, which is different from existing clustering algorithms that use only representative dimensions for partitioning, and our partitioning strategy is based on all dimensions of the data. It utilizes a non-parametric density measure to calculate the local density peaks of all dimensions in the dataset and takes each peak as the division benchmark for the first stage of subcluster division. In the second phase, the final subcluster is obtained by intersecting the results of each dimension. This partitioning strategy can accurately identify clusters with density differences, avoiding the problem that traditional density-based clustering algorithms and their variants incorrectly partition sparse clusters into dense clusters. Furthermore, the proposed partitioning strategy is effective in identifying clusters with irregular shapes and uneven densities, enabling the adaptive partitioning of subclusters.
This paper proposes a new similarity metric for merging subclusters from the perspective of structural similarity. Specifically, a new similarity metric is defined for evaluating the degree of interaction of shared nearest neighbors among the subclusters to be merged and iteratively merging the subclusters with the maximum similarity until the desired number of clusters is achieved. This merging strategy effectively addresses the issues of ambiguous boundaries and the high overlap between subclusters, enhancing the accuracy of clustering performance.
The CPDD-ID algorithm was tested on 20 benchmark datasets, and the comparison results with the existing clustering algorithms show that the proposed algorithm achieves good performance in discovering the underlying data structures in all cases.

The rest of the paper is organized as follows: Section 2 reviews the related work. Section 3 describes the algorithm proposed in this paper in detail. Section 4 compares the proposed algorithm with six advanced clustering algorithms and presents the experimental results. Section 5 summarizes the whole paper.

2. Related Work

2.1. Density Peak Clustering Algorithm

The DPC algorithm is a density-based clustering algorithm published in 2014, which is widely used in various scenarios and achieved good results according to the two core ideas of clustering centers with high local density and centers with different local densities that have a long distance. The clustering centers of the DPC algorithm can be selected manually based on the decision map, which contains two important attributes: the local density is represented on the horizontal axis, while the vertical coordinate indicates relative distance, the allowing identification of density peaks and outliers. In addition, its data point allocation strategy is based on single-chain label propagation, which assigns data points to those with the highest local density of nearest neighbors based on the nearest-neighbor principle. In recent years, density-based clustering algorithms have gradually become mainstream. For example, Bryant et al. [14] proposed a density-based clustering algorithm called the density-based clustering algorithm using reverse nearest neighbor density estimates (RNN-DBSCAN) based on DBSCAN, which uses inverse nearest neighbors to estimate the local density and traverses the data points based on the k-nearest neighbor graph. Compared to the DBSCAN algorithm, it uses only one parameter (k-nearest neighbors) to estimate the density, and secondly, it also demonstrates superiority in dealing with variable-density datasets but performs poorly on datasets with ambiguous boundaries. Cheng et al. proposed a clustering algorithm called the local density peak and minimum spanning tree clustering algorithm (LDP-MST) [15], which utilizes the minimum spanning tree and the density peak. It constructs a minimum spanning tree using local density peaks and then iteratively splits the longest edges until the termination condition is reached, which is robust to datasets with complex structures. Fan et al. [16] integrated the idea of nearest-neighbor graphs into DPC, which calculates the local densities and distances through an improved mutual k nearest-neighbor graph and uses the two assumptions of DPC to constrain and select the clustering centers. The proposed algorithm can ensure the correct selection of suitable clustering centers and effectively improve the clustering performance. Ding et al. [17] calculation the local density and relative distance based on the natural neighbors to classify the subclusters, which can effectively eliminate the effect of cutoff parameters on the clustering results. Then, a novel merging strategy is utilized to merge the subclusters generated in the division stage until the termination condition is satisfied. Euclidean distance, a commonly used distance metric for density clustering algorithms, is widely used, but it ignores the contribution of individual features to the similarity and clustering of data points. In order to address this limitation, Xie et al. proposed standard deviation weighted distance and fuzzy weighted k-nearest neighbor-based density peak clustering (SFKNN-DPC) [18], which enhances the Euclidean distance by utilizing the standard deviation-weighted distances, and this strategy takes into account the specific contribution of each individual feature to the similarity between data points.

To address clustering with multiple density peaks, Rasool et al. [19] proposed an effective data relevance measure based on probability mass named the density peak clustering algorithm based on probability mass (MP-DPC) and integrated it into the DPC algorithm. To avoid the domino effect inherent in the DPC’s single-allocation strategy, where an error in assigning one point leads to the misallocation of other data points, Qin et al. [20] utilized label propagation and the Jaccard coefficient to propose a two-step allocation strategy for measuring the similarity between data points, which is concerned only with the k points of highest similarity. The first step assigns the label of each data point to points near the cluster centers, while the second step completes the label assignment for the remaining points based on the nearest labeled data to each unassigned sample. Guo et al. [21] proposed a connectivity-based density peak clustering algorithm called density peak clustering with connectivity estimation (DPC-CE), which selects local centers with higher relative distances to further compute cluster centers and then introduces a neighbor graph strategy to calculate the connectivity information between cluster centers. This method not only overcomes the domino effect but also adapts to clusters with uneven densities. Although the aforementioned algorithms have achieved good clustering results, they do not provide an effective solution to the problem of parameter fine-tuning within the algorithms. However, selecting appropriate parameters for data without true labels is quite challenging. In response to this issue, a parameter-free density peak clustering algorithm has also been proposed. García-García et al. [22] described an optimized adaptive methodology that utilizes clustering validity indices as the objective function for the adaptive parameters of DPC. The results indicate that this approach is not only suitable for the DPC algorithm but also effective for its derivative algorithms. Zhu et al. [23] proposed a density-based hierarchical clustering algorithm by combining the ideas of DBSCAN and density–connectivity DBSCAN. The proposed algorithm recognizes clusters of arbitrary shape by generating a tree diagram. Inspired by the vagueness and uncertainty in clustering, Bian et al. [24] proposed a concept of fuzzy density peaks, which expresses the density of a sample point as a coupling of fuzzy distances between the sample point and its neighbors. Meanwhile, an improved density peak clustering algorithm based on fuzzy operators was designed. Experiments show that the proposed algorithm performs better than most density-based clustering algorithms. Wang et al. [25] modified the clustering centers of DPC by identifying the density peak points as the clustering centers. Liu et al. [26] proposed a fast identification density peak method named shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC), which completes the clustering by considering the information of the nearest neighbor structure of the sample points and overcomes the problem that the traditional density peak algorithm cannot solve the problem of variable density clustering effectively, but it is sensitive to noise.

Although the above algorithms effectively overcome the problems of DPC, there are still shortcomings: when the data samples have obvious density differences and are close to each other, it is difficult to find the correct clustering center in the low-density region, and the normal sample points in the low-density region can easily be identified as noise points. Compared to existing methods, the dimension density-peak-based partitioning method proposed in this paper can accurately detect cluster structures with density differences, and the dense and sparse clusters can be divided correctly when they are close to each other.

2.2. Hierarchical Clustering

Hierarchical clustering is another significant analytical technique in unsupervised clustering algorithms, which aims to analyze the similarity or dissimilarity between clusters according to the hierarchical structure, mainly including split hierarchical clustering and cohesive hierarchical clustering. Among them, split hierarchical clustering splits all data points as a cluster by utilizing the dissimilarity, while cohesive hierarchical clustering measures the similarity between the initial clusters to be merged and merges them step by step. This paper focuses on cohesive hierarchical clustering.

Ros et al. [27] proposed a cohesive hierarchical clustering algorithm, a novel clustering algorithm combining mutual neighboring and hierarchical approaches using a new selection criterion (KdMutual) driven by the number of clusters in three steps: the first step identifies core subclusters using mutual neighborhoods, the second step deals with outliers, and the last step selects the initial clusters to construct the resulting partition using an ordering criterion. KdMutual combines the features of density peaks as well as structural similarities and is suitable for high-dimensional datasets. In order to avoid the influence of outliers on the clustering results, Cheng et al. [15] proposed a noise-removing hierarchical clustering algorithm named the hierarchical clustering algorithm based on noise removal (HCBNR), which is capable of discovering arbitrarily shaped clusters. The algorithm initially removes noise points based on natural neighbors, followed by constructing a mutual nearest neighbor graph with the remaining points to partition into sub-clusters; eventually, sub-clusters are iteratively merged based on their similarity measures until the desired number of clusters is achieved. Han et al. [28] proposed an efficient hierarchical clustering algorithm that uses a scalable sample set kernel to measure the similarity between existing clusters in the clustering tree and new samples in the data stream. Moreover, the hierarchical structure can be automatically updated while iteratively merging clusters. Yang et al. [29] proposed a hierarchical clustering algorithm, a novel hierarchical clustering algorithm based on density-distance cores (HCDC), based on the density–distance kernel for variable—density clustering, which first selects a density–distance representative point for each data point from candidate points, then selects a density–distance kernel from all the density–distance representative points and completes the clustering by utilizing the proposed new distance to complete the clustering. The experimental results prove that HCDC has good performance on complex structured datasets, as well as variable-density datasets. Hulot et al. [30] proposed an algorithm that combines a tree aggregation method with hierarchical clustering, which merges the clusters by aggregating the tree structures that have the same cotyledons.

Overall, unlike the similarity measures proposed by the hierarchical clustering algorithms described above, the similarity measure proposed in this paper takes into account the structural composition between clusters. For subcluster pairs with a large number of shared nearest neighbors, they are usually considered to belong to the same cluster. Moreover, the similarity metric based on shared k-nearest neighbors can solve the problems of high overlapping between clusters and ambiguous boundaries.

2.3. Partition–Merge Clustering Algorithm

Clustering algorithms based on the partition–merge strategy have gained favor among researchers for their ability to overcome the limitations of traditional clustering methods. Clustering by combining k-means with the density- and distance-based method (KMDD) [31] is a classical clustering algorithm following the partition–merge strategy, which combines both k-means and density in order to quickly find the structure of the data in the space. In the partitioning phase, KMDD uses k-means to divide the dataset into a number of smaller spherical clusters and assigns the corresponding local densities. In the merging phase, the initial subclusters are merged by manually selecting the cluster centers from the decision graph following the feature that the cluster core density is higher than the density of neighboring subclusters at a greater distance. KMDD accurately identifies clusters of various shapes, but the optimal k-value is a challenge for the KMDD algorithm. To address the limitations of KMDD, Yuan et al. [32] proposed a clustering algorithm called Beyond K-Means++, which explores clusters through local geometric information. To overcome the dependence on k, the algorithm automatically increases or decreases k starting from 1 as clusters split and merge, thereby not only avoiding k-dependence but also enhancing clustering accuracy. To address the limitations of the DPC algorithm’s poor performance on popular datasets and its high dependence on the cutoff parameter, Ding et al. [17] proposed a division and merging algorithm called an improved density peaks clustering algorithm based on natural neighbor with a merging strategy (IDPC-NNMS), which combines peak density and natural nearest neighbors. IDPC-NNMS adaptively obtains the local densities by accurately identifying the natural neighbors of each data and then defines a novel merging strategy to merge the subclusters. Cheng et al. [33] proposed a split–merge clustering algorithm, a novel projection-based split-and-merge clustering algorithm (PSM) using projection techniques, which extends projection techniques to K-means to select the initial cluster centers and then determines the distribution of clusters by the distribution of density profiles. Experiments show that PSM performs well under strict data conditions.

3. The Proposed Algorithm

This paper proposes a clustering algorithm based on detecting the dimension density peaks in dimensions and measuring the degree of interaction between clusters (CPDD-ID). The algorithm consists of two main phases: (1) Firstly, in the partitioning phase, using kernel density estimation to calculate the local density of the points in each dimension and take the local densities of all dimensions to construct the kernel density curves, the curves themselves react to the density distribution of the data in each dimension. Secondly, the coordinates of the peaks of the kernel density curves are marked in each dimension, and a Kd-Tree is established to quickly and accurately assign the data points to their nearest-neighbors peaks. Lastly, the intersection of all dimensions is taken to adaptively divide reasonable subclusters. (2) In the merging phase, the interaction degree of shared neighbors between subclusters is calculated to define a novel similarity metric, and subclusters with the highest similarity are iteratively merged until the desired number of clusters is reached. The main clustering process is illustrated in Figure 1.

3.1. Partitioning Phase

In the partitioning stage, a partitioning strategy that considers both density and distance is proposed, which aims to partition the data into high-density regions and ensure that the data points in the same dense region have highly similar densities. The details are as follows. Before calculating local densities, use the box plots to remove outliers from the data to prevent their impact on the partitioning results, and the filtered data consist of points within the upper and lower limits of the box plots, with the calculations for these limits provided in Equations (1) and (2).

U B_{d_{i}} = Q 3_{d_{i}} + (e * I Q R_{d_{i}})

(1)

L B_{d_{i}} = Q 1_{d_{i}} - (e * I Q R_{d_{i}})

(2)

where

L B_{d_{i}}

and

U B_{d_{i}}

represent the upper and lower bounds of the

d_{i}

dimension, respectively;

Q 1_{d_{i}}

and

Q 3_{d_{i}}

denote the upper quartile and the lower quartile of the

d_{i}

dimension; and

I Q R_{d_{i}}

=

Q 3_{d_{i}}

−

Q 1_{d_{i}}

.

I Q R_{d_{i}}

is used to define the range of outliers. In addition, based on the statistical properties of the normal distribution, e is set to 1.5 [34]. This is because about 0.7 percent of the samples outside

L B_{d_{i}}

and

U B_{d_{i}}

are considered outliers under the normal distribution. This threshold balances sensitivity and robustness.

Secondly, according to the kernel density estimation, calculate the local density of the points after removing the outliers, construct the density curve to detect the density peak in each dimension, and use the coordinates of the peak of the density curve as the benchmark to establish a Kd-Tree to quickly attract the data points around the initial subcluster division. Taking the intersection of the partitioning results from each dimension to obtain several subclusters.

3.1.1. Local Density Is Calculated Using Kernel Density Estimation

Kernel density estimation [35] (KDE) is a nonparametric density estimation function that can estimate the probability distribution of data without assuming a density distribution. The motivation of this paper is to partition the subclusters using KDE to evaluate the density differences of the data in each dimension. The kernel density estimation function can be expressed as Equation (3):

f (x) = \frac{1}{\sqrt{2} n h} \sum_{i = 1}^{n} exp (- \frac{x - x_{i}}{h})

(3)

where h represents the kernel density bandwidth and the optimal bandwidth

h_{d_{v}}

[35] is defined as follows:

h_{d_{v}} = 0.9 n^{- \frac{1}{d + 2}} * min (Std (X_{v}, \frac{IQR (X_{v})}{8}))

(4)

where n represents the data size,

h_{d_{v}} = 1, 2, 3, \dots, v

denotes the bandwidth of the

v t h

dimension, and

I Q R

is the same as defined above.

3.1.2. Partitioning Initial Sub-Cluster

As previously mentioned, the two-dimensional dataset aggregation was used as an example to calculate local densities via kernel density estimation and generate density curves. The resulting curves exhibit distinct changes in density distribution, as illustrated in Figure 2a,b.

It can be seen that the curve presents multiple regions of high density. These regions record the probability distribution of the data density in this dimension, and the wave peaks represent the probability that the data appearing in this region are the largest such that the density reaches the maximum. These peaks can be used to rationally partition the initial subclusters. The wave crest coordinate matrix coordinate = 1, 2, 3, ……, n is recorded. Subsequent work will use these peak coordinates as a basis for partitioning sub-clusters.

All the peaks are utilized on the above basis, with any one of them marked as a separate category. In particular, because of the spatial partitioning properties of the Kd-Tree, it is possible to quickly locate points within a specific region, which helps to quickly identify which points belong to the same density peak while reducing the running time of the algorithm. In this paper, Kd-Tree is introduced to calculate the distance from all data points to the categories, find the nearest category index corresponding to each data point, and finally assign it to the nearest-neighbor category, as shown Figure 3a,b. Meanwhile, data points that are classified into the same category are marked with the same color.

3.1.3. Final Sub-Cluster

This section focuses on the initial subclusters generated in the previous phase and utilizes the idea of intersection to complete the purpose of the partitioning phase as follows: the density distribution of the data in a single dimension was computed in the previous phase but the single dimension does not reflect the overall distribution of the data. Therefore, the purpose of this phase is to further process the initial subclusters partitioned by a single dimension. The initial subclusters generated from all dimensions are intersected to obtain the overall distribution, as shown in Figure 4.

It can be seen that after taking the intersection, the initial subclusters are partitioned into a number of subclusters, and each subcluster exhibits a locally dense pattern. The aforementioned steps utilize the characteristics of the kernel density function to automatically detect the optimal number of peaks under the appropriate bandwidth conditions, and they assign the same labels to the most similar data points to achieve the adaptive partitioning of subclusters, which can cope with different types of cluster structures and overcomes the problem that traditional density-based clustering algorithms and their variants incorrectly partition sparse clusters into dense clusters. The main flow of the partition phase is shown in Algorithm 1.

Algorithm 1 Noise detection and partition strategy based on dimensional density peaks

Input:: Dataset X, Dimensions D
Output:: Final Subclusters $S u b_f i n a l$ , $N o i s e s$
1:: Initialize S = $[X]$ , $S u b_f i n a l$ = [ ], $p e a k_p o i n t s$ = [ ], $N o i s e s$ = [ ]
2:: for each point $i \in [1, D]$ do
3:: Calculate noises from the $i - t h$ dimension of S according to Equations (1) and (2)
4:: Identify $n o i s e s$ in the $i - t h$ dimension and remove them from S
5:: Append $n o i s e s$ to the $N o i s e s$
6:: end for
7:: for each point $i \in [1, D]$ do
8:: Extract the $i - t h$ dimension point from S
9:: Calculate the kernel density estimation (KDE) for the $i - t h$ dimension point according to Equation (3)
10:: Find all peak points and append the peak points to the $p e a k_p o i n t s$
11:: end for
12:: Initialize $s u b c l u s t e r s$ = [ ]
13:: for each $p e a k_p o i n t$ P $\in p e a k_p o i n t s$ do
14:: Initialize $s u b c l u s t e r$ for p = [ ]
15:: for each point $i \in [1, D]]$ do
16:: Use Kd-Tree to search for the nearest points to peak point p in the $i - t h$ dimension
17:: Assign the nearest points to the $s u b c l u s t e r$ for p
18:: end for
19:: Append the $s u b c l u s t e r$ for p to the $s u b c l u s t e r s$
20:: end for
21:: while $s u b c l u s t e r s > 0$ do
22:: $S u b_f i n a l$ = first $s u b c l u s t e r$ in $s u b c l u s t e r s$
23:: for each subcluster $s c \in s u b c l u s t e r s$ do
24:: $S u b_f i n a l$ = intersection of $S u b_f i n a l$ and $s c$
25:: end for
26:: end while
27:: return $S u b_f i n a l, N o s i e$

3.2. Merging Period

The merging phase aims to merge subclusters that are structurally close and of similar density. In order to obtain actual clusters, this paper proposes a novel similarity metric called inter-cluster interaction degree based on the degree of interaction of the shared nearest neighbors between subclusters. The proposed similarity metric starts from the structural similarity between subclusters and considers the shared neares neighbors between neighboring subclusters. The neighboring subclusters with maximum similarity are iteratively merged by evaluating the interaction degree of the shared nearest neighbors between the clusters until the desired number of clusters is reached. The merging phase consists of two parts: (1) Calculate the inter-cluster interaction degree. (2) Iteratively merge subclusters with maximum similarity.

3.2.1. Methods of Similarity Measurement

In this section, a similarity metric based on shared nearest neighbors is proposed to iteratively merge subclusters with maximum similarity, where shared nearest neighbors can be obtained based on k-nearest neighbors and reverse k-nearest neighbors. In the proposed merging strategy, the two subclusters to be merged should have sufficiently small shortest distances and more points in the set of shared nearest neighbors. Thus, the shared k-nearest neighbor is born. It evaluates the similarity by counting the number of shared neighbors in the neighborhood between subclusters. In addition, unlike similarity measures that only consider Euclidean distances that must rely on global linearity assumptions, shared k-neighbors focus more on the local neighborhood structure, take into account the relative positions of the samples in their neighborhoods, and are able to capture local structural and contextual information in the samples. This means that they are able to adapt to data distributions of different densities. Based on the above properties, some definitions are given to describe the new similarity metric.

Shared k-nearest neighbor (SKNN): For any two sample points $x_{i}$ and $x_{j}$ , $K N N (x_{i})$ and $K N N (x_{j})$ , denote the set of K nearest neighbor sample points of sample points $x_{i}$ and $x_{j}$ , respectively. $R K N N (x_{i})$ refers to a set of data points in a given dataset, where each point in the set considers a query point as one of its k-nearest neighbors. Formally, for a query point $x_{i}$ , its $R K N N (x_{i})$ set is defined as the collection of points in the dataset that include $x_{i}$ within their respective sets of k-nearest neighbors. In particular, since $K N N$ and $R K N N$ are symmetric neighborhoods and both utilize Euclidean distance to compute the distance between samples, $R K N N (x_{i})$ can be utilized instead of $K N N (x_{j})$ to denote the reverse k-nearest neighbors of the sample point $x_{i}$ as shown in Equation (5), which together form the basis for sharing k-nearest neighbors.

R K N N (x_{i}) = \{x_{j} \in {cluster}_{j} ∣ x_{i} \in KNN (x_{j})\}

(5)

Therefore, the number of shared nearest neighbor points

S K N N

defined by

K N N

and

R K N N

together for any sample point

x_{i}

is defined as in Equation (6):

S K N N (x_{i}) = K N N (x_{i}) \cap R K N N (x_{i})

(6)

where

K N N (x_{i})

denotes the set of k-nearest neighbors of the sample point

x_{i}

and

R K N N (x_{i})

denotes the set of reverse k-nearest neighbors of the sample point

x_{i}

.

Shared k-nearest neighbors between clusters: The closer two sample points on the structure are and the higher the degree of shared nearest neighbors, the higher their similarity and the higher the probability that they will be merged. As shown in Figure 5, it can be seen that the higher the degree of shared k-nearest neighbors of neighboring subclusters, the more likely they are to be merged into one cluster.
According to Equations (5) and (6), the shared k-nearest neighbors of subcluster $c l u s t e r_{i}$ can be defined as in Equation (7):

S K N N (c l u s t e r_{i}) = K N N (c l u s t e r_{i}) \cap R K N N (c l u s t e r_{i})

(7)

Equation (8) denotes the sum of two neighboring subclusters

c l u s t e r_{i}

and

c l u s t e r_{j}

sharing k-nearest neighbor sample points.

S N N ({cluster}_{i}, {cluster}_{j}) = S K N N ({cluster}_{i}) + S K N N ({cluster}_{j})

(8)

Shared k-nearest neighbor similarity measure: According to the above Equations (7) and (8), the similarity measure $S i m (c l u s t e r_{i}, c l u s t e r_{j})$ of two neighboring subclusters can be defined as in Equation (9):

S i m (c l u s t e r_{i}, c l u s t e r_{j}) = \frac{S N N (c l u s t e r_{i}, c l u s t e r_{j})}{n u m s_{c l u s t e r_{i}}} * \frac{c o n_{c l u s t e r_{i}} (c l u s t e r_{j})}{n u m s_{c l u s t e r_{i}}} * \frac{c o n_{c l u s t e r_{j}} (c l u s t e r_{i})}{n u m s_{c l u s t e r_{j}}}

(9)

where

n u m s_{c l u s t e r_{i}}

denotes the number of subcluster

c l u s t e r_{i}

data points.

And

c o n_{c l u s t e r_{i}} (c l u s t e r_{j}) = \sum_{i = 1}^{c l u s t e r_{j}} K N N (c l u s t e r_{i}) \cap c l u s t e r_{j}

denotes the number of k-nearest neighbors of the subcluster

c l u s t e r_{i}

near the subcluster

c l u s t e r_{j}

. Similarly,

c o n_{c l u s t e r_{j}} (c l u s t e r_{i}) = \sum_{i = 1}^{c l u s t e r_{j}} K N N (c l u s t e r_{j}) \cap c l u s t e r_{i}

denotes the number of k-nearest neighbor sample points of the subcluster

c l u s t e r_{j}

near the subcluster

c l u s t e r_{i}

.

In order to have a clearer understanding of the proposed similarity metric, a further analysis of Equation (9) is made. First,

S N N

denotes the number of shared nearest neighbors between two clusters. This is one of the important metrics to measure the similarity between two clusters. If two clusters have more shared neighbors, they are likely to belong to the same taxon. Secondly,

c o n_{c l u s t e r_{i}} (c l u s t e r_{j})

denotes the total number of points of k-nearest neighbors of

c l u s t e r_{j}

that are connected to

c l u s t e r_{i}

. This term can be viewed as the strength of the connection between two clusters. If a cluster has a large number of points connected to another cluster, then the two clusters are likely to be similar. Lastly,

n u m s_{c l u s t e r_{i}}

is used to represent the size of

c l u s t e r_{i}

. When calculating similarity, it is important to consider the size of the cluster because larger clusters are likely to have more connected points. By dividing the cluster size, we can obtain a normalized similarity metric that allows direct comparison of clusters of different sizes.

In summary, the proposed similarity metric takes into account the number of shared nearest neighbors between clusters, the strength of connections, and the size of the clusters, which is helpful in merging subclusters with maximum result similarity.

3.2.2. Merging Subclusters

The similarity between all subclusters is calculated based on the traversal of Equation (9). After obtaining the similarity between subclusters, the similarity of all subcluster pairs is sorted in descending order, and the subclusters with maximum similarity and closer distance are merged. In addition, the merging process is performed iteratively until the desired number of clusters is reached, which means that new clusters generated during the merging process can be evaluated again with respect to whether to merge them or not by the similarity metrics, and the merging process reduces the time complexity, particularly when there are more than two subclusters with maximum similarity in the merging process. For example, if multiple subclusters have the same maximum similarity as shown in Figure 6, then they can be merged simultaneously.

Therefore, in the merging process, not only are two subclusters with maximum similarity merged, but several subclusters with the same maximum similarity and similar distance can be merged at the same time. After traversing all the subclusters, the outliers identified in the partitioning phase are assigned to the nearest neighbor clusters. At this point, the merging process ends. The main flow of the merging phase is shown in Algorithm 2.

Algorithm 2 Merging strategies based on shared nearest neighbors

Input:: Subclusters $S u b_f i n a l$ , $N o i s e$
Output:: $L a b e l$
1:: Initialize $S i m i l a r i t y_m a t r i x$ = [ ], $L a b e l$ = [ ]
2:: for each cluster $c \in S u b_f i n a l$ do
3:: Calculate $K N N_c o u n t$ and $R K N N_c o u n t$ for c according to Equation (5)
4:: end for
5:: for each pair $(c l u s t e r 1, c l u s t e r 2) \in c o m b i n a t i o n s (c l u s t e r s, 2)$ do
6:: Calculate $s h a r e d_K N N_c o u n t$ for $c l u s t e r 1$ and $c l u s t e r 2$ according to Equations (7) and (8)
7:: Calculate $s i m i l a r i t y$ between $c l u s t e r 1$ and $c l u s t e r 2$ according to Equation (9)
8:: Store $s i m i l a r i t y$ in $s i m i l a r i t y_m a t r i x$ with key $(c l u s t e r 1, c l u s t e r 2)$
9:: end for
10:: while $c l u s t e r s > 1$ do
11:: Initialize $m e r g e_l i s t$ = [ ]
12:: for each pair $(c l u s t e r 1, c l u s t e r 2) \in c o m b i n a t i o n s (c l u s t e r s, 2)$ do
13:: if $s i m i l a r i t y_{m} a t r i x [(c l u s t e r 1, c l u s t e r 2)] > m a x_s i m i l a r i t y$ then
14:: $m a x_s i m i l a r i t y = s i m i l a r i t y_m a t r i x [(c l u s t e r 1, c l u s t e r 2)]$
15:: $m e r g e_l i s t = [(c l u s t e r 1, c l u s t e r 2)]$
16:: else if $s i m i l a r i t y_m a t r i x [(c l u s t e r 1, c l u s t e r 2)] = = m a x_s i m i l a r i t y$ then
17:: Append $(c l u s t e r 1, c l u s t e r 2)$ to $m e r g e_l i s t$
18:: end if
19:: end for
20:: for each pair $(c l u s t e r 1, c l u s t e r 2) \in m e r g e_l i s t$ do
21:: $m e r g e d_c l u s t e r = m e r g e_c l u s t e r s (c l u s t e r 1, c l u s t e r 2)$
22:: Remove $c l u s t e r 1$ from $c l u s t e r s$
23:: Remove $c l u s t e r 2$ from $c l u s t e r s$
24:: Append $m e r g e d_c l u s t e r$ to $c l u s t e r s$
25:: Update $s i m i l a r i t y_m a t r i x$ with new $m e r g e d_c l u s t e r$
26:: end for
27:: end while
28:: Initialize $L a b e l$ = [ ]
29:: for $i n d e x, c l u s t e r \in e n u m e r a t e (c l u s t e r s)$ do
30:: for $d a t a_p o i n t \in c l u s t e r$ do
31:: $L a b e l [d a t a_p o i n t] = i n d e x$
32:: end for
33:: end for
34:: for each $n o i s e \in N o i s e$ do
35:: Assign the $n o i s e$ to the nearest $c l u s t e r$
36:: $L a b e l [n o i s e]$ = $i n d e x$ of the nearest $c l u s t e r$
37:: end for
38:: return $L a b e l$

3.3. Complexity Analysis

Computational complexity is an important metric for evaluating the performance of an algorithm. In the previous subsections, all phases of the algorithm were described in detail. In this section, the time complexity of the algorithm is analyzed. In the partitioning phase, it is assumed that the size of the dataset is N. First, calculating the density profiles in all dimensions using the kernel density estimation function requires

O (N * m)

, where N is the number of data points and m is the grid points. After obtaining the density profile, building a Kd-Tree to accelerate the search for nearest-neighbor points and assigning them to the nearest-neighbor crest index takes

O (N l o g P)

, where P is the number of crests, and the time complexity of partitioning the subclusters is

O (V * P)

, with V denoting the dimension. In the merging phase, it takes

O (l o g N)

to compute the shared k-nearest neighbors among subclusters,

O (n^{2})

time complexity to compute the similarity among subclusters, and

O (n * l o g n)

to iteratively merge the subclusters, where n is the number of subclusters. In summary, the total time complexity of the algorithm is

O (N * m + N * l o g P + V * P + l o g N + N^{2} + n * l o g n)

, which can be approximated as

O (N l o g P)

. This is because when analyzing the time complexity of an algorithm, we usually look for the highest-order terms, since the highest-order terms will dominate the growth rate of the whole expression when the input size N becomes very large. Next, we will analyze why the time complexity is approximated as

O (N l o g P)

.

First, the term

O (N * m)

depends on the value of m. If m is a constant, then

N * m

is

O (N)

. However, regardless of the value of m,

N * m

will not exceed

O (N^{2})

. For

O (N * l o g P)

, when P is a constant, it is denoted as

O (N)

; otherwisel it remains below

O (N^{2})

. Similarly,

O (V * P)

,

O (l o g N)

and

O (n * l o g n)

are all below

O (N^{2})

. Consequently, the time complexity can be approximated as

O (N l o g P)

.

The above-mentioned O denotes a mathematical notation used to describe an upper bound on the growth rate of a function. Specifically, we can say that the time complexity of the algorithm is

O (g (n))

if there exist positive constants c and

n_{0}

such that the running time of the algorithm is always less than or equal to c multiplied by some function

g (n)

when the input size N is greater than or equal to

n_{0}

.

In order to show the time efficiency of the CPDD-ID algorithm more clearly, we report the running time of the CPDD-ID algorithm with six comparison algorithms on the selected synthetic dataset and the real dataset. The details are shown in Table 1 and Table 2.

4. Experiments and Results

To further illustrate the effectiveness of the algorithm for detecting clusters of different shapes and densities in close proximity to each other, experiments were conducted on several synthetic datasets and the UCI (University of California) real dataset, which is a public repository used for machine learning and data mining research, and we compared them with current state-of-the-art clustering algorithms using three external evaluation metrics. The selected datasets are characterized by ambiguous boundaries and highly overlapping samples, especially R15, Pathbased, D31, S1, DS577, etc., with the aim of demonstrating the applicability of the proposed method. Finally, the effect of the nearest neighbor parameter K on the clustering results is analyzed. All experiments were conducted using a PC configured with a i9-12900H 2.50 GHz processor Windows 11 operating system, 16 GB RAM, and Python 3.11.

4.1. Preparation

Ten synthetic datasets and ten UCI real datasets were prepared for this experiment, and the detailed information of these datasets is displayed in Table 3 and Table 4, including the data size, dimensions, and the number of real clusters. All datasets were obtained from https://github.com/Chelsea547/Clustering-Datasets (accessed on 21 January 2025).

Secondly, this part compares some classical and advanced algorithms used to demonstrate the superiority of the algorithms, including DPC, DBSCAN, K-means, RNN-DBSCAN, LDP-MST, and HCDC. The content of these six algorithms has already been described in the introduction and related work, so they will not be repeated here. In particular, this paper uses three traditional external evaluation metrics, clustering accuracy [36] (ACC), adjusted Rand index [37] (ARI), and normalized mutual information [38] (NMI), to measure the algorithm’s performance. Among them, ACC is an important metric for evaluating the performance of a classification model, which indicates the ratio between the number of samples correctly predicted by the model and the total number of samples, and it takes the value in the range of [0,1]. In this paper, ACC is used to evaluate the consistency between the clustering results and the true labels; ARI is used to measure the similarity between two clustering results, which takes into account the random assignment of element pairs in the clustering results and takes the value in the range of [−1,1]; and NMI is used to compare the consistency of different clustering results and takes the value in the range of [0,1]. Overall, ACC, ARI, and NMI are all used to describe the accuracy of the clustering results, with larger values indicating that the algorithm is more effective in clustering.

4.2. Experiments on Synthetic Datasets

This section analyzes the comparative tests with six other state-of-the-art algorithms on ten synthetic datasets. The original distributions of the synthetic datasets are shown in Figure 7a–j.

In them, data points belonging to the same cluster are marked with the same color. In particular, Figure 7d,g exhibit uneven density characteristics, and datasets D31, S1, DS577, and T4.8K all feature indistinct cluster boundaries, especially D31 and S1. In order to clearly illustrate the experimental setup of each algorithm, Table 5 references the selection of parameters in the original paper for each algorithm and provides the experimental parameters on the synthetic dataset.

Table 6 presents the ACC, ARI, and NMI results of each algorithm on different synthetic datasets to illustrate their performance.

Figure 8 shows the clustering results of the proposed algorithm CPDD-ID on the synthetic datasets, where the results on popular datasets such as Spiral, Jain, and Zelink1 are consistent with the distribution of the original dataset, and the results on the three evaluation metrics are all 1. For datasets similar to D31, S1, and other datasets with a high degree of overlap between clusters, the CPDD-ID algorithm accurately separates clusters that do not belong to the same clusters that do not belong to the same category and also has good results in the evaluation metrics. Finally, the CPDD-ID algorithm also correctly recognizes different cluster structures when dealing with datasets with ambiguous boundaries such as Pathbased and DS577.

Additionally, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 display the clustering results of the comparative algorithms DPC, DBSCAN, K-means, RNN-DBSCAN, LDP-MST, and HCDC on the ten synthetic datasets. It is worth noting that k-means shows the clustering centers detected on each dataset with a red pentagram, which is indicated by averaging the results over 30 runs on the experimental setup since variations in the initial parameters can lead to different results. Figure 9 presents the clustering results of DPC.

DPC is applied to various datasets by setting different truncation distances and gives good results on spherical datasets. It can be seen that it performs well on datasets such as R15 but performs poorly on datasets with uneven density such as Pahbased and Jain. The clustering results of DBSCAN are shown in Figure 10.

DBSCAN can handle datasets of arbitrary shapes, and it especially excels in handling datasets such as Jain and Zelink1 that have clear cluster structures and are far away from each other. However, it incorrectly categorizes clusters close to each other as one cluster when dealing with datasets such as R15, S1, etc., which have unclear inter-cluster structures and are highly overlapping.

As shown in Figure 11, K-means performs well on the Zelink1 dataset but performs poorly on Aggregation, Spiral and Pathbased, incorrectly classifying different clusters together and even detecting the wrong cluster centers. It also performs poorly on noisy datasets such as T4.8k.

RNN-DBSCAN is a density-based clustering algorithm based on DBSCAN that uses reverse nearest neighbors to compute the local density, and it shows better clustering performance than DBSCAN, as shown in Figure 12.

Especially on the Jain and Aggregation datasets, the potential cluster structure is correctly identified. However, it is not as good as DBSCAN in dealing with datasets containing noise like T4.8k. Moreover, the RNN-DBSCAN algorithm is not suitable for datasets with ambiguous boundaries such as R15, D31, and Pathbased.

The clustering results of LDP-MST and HCDC are shown in Figure 13 and Figure 14, respectively.

It can be seen that LDP-MST and HCDC perform well in dealing with arbitrary shapes and highly overlapping datasets but perform poorly in dealing with uneven datasets such as Pathbased and DS577. HCDC is excellent in dealing with arbitrary shapes and highly overlapping datasets, but performs poorly in dealing with datasets with uneven density such as Pathbased and DS577.

Unlike the six comparison algorithms mentioned above, the CPDD-ID algorithm uses two-phase clustering. Local density maxima are detected using kernel density estimation in the partitioning phase, which can effectively detect the density distributions of dense and sparse regions and avoid incorrectly classifying sparse clusters into dense clusters. For example, the Pathbased, Aggregation and Jain datasets provide strong support for the subsequent merging phase by correctly identifying subclusters with different density distributions in the partitioning phase. The merging phase is similar to hierarchical clustering in that the subclusters with maximum similarity are iteratively merged through the interaction degree of shared nearest neighbors among subclusters, starting from structural similarity. The proposed merging strategy shows good performance in dealing with datasets that are highly overlapping and ambiguous to each other’s cluster structure and achieves first-place results on the Jain, Spiral, Zelink1, Pathbased, and DS577 datasets. Even on datasets with high overlap between clusters such as R15 and S1, it can achieve second-place results. In summary, the CPDD-ID algorithm combines the advantages of both density-based clustering and hierarchical clustering, demonstrating universal performance compared to other algorithms.

4.3. Experiments on Real Datasets

In this section, the performance of the CPDD-ID algorithm is further evaluated against six other algorithms on ten real datasets, all of which are taken from the UCI machine learning repository, and the parameters of all the real datasets on the seven different algorithms are given in Table 7.

In particular, Table 8 shows the clustering results of the selected real datasets on the different algorithms.

The results show that the CPDD-ID algorithm ranks first in ACC, NMI, and NRI on the Ionosphere and Pima datasets and also ranks first in two metrics on the Wine, Satimage, and Balance datasets, and it has top-three performance on the remaining datasets. Overall, the CPDD-ID algorithm shows excellent performance in dealing with low-dimensional datasets and highlights the effectiveness of the CPDD-ID algorithm’s strategy of reasonable partitioning and merging based on the correlation between all dimensions when dealing with high-dimensional datasets. As a result, the clustering performance of the CPDD-ID algorithm on high-dimensional datasets, such as Ionosphere and Satimage, is better than that of other algorithms.

4.4. Parameter Analysis of Algorithms

To verify the effect of the shared nearest neighbor parameter K on the performance of the CPDD-ID algorithm, a further analysis was performed on the synthetic dataset. This analysis deflates the range of K values from 1 to 50, and Figure 15 demonstrates the effect of the variation of the parameter K on the ACC, NMI, and ARI scores.

It can be seen that the results on the Aggregation dataset smooth out as the value of K increases, remaining constant especially in the range of 11 to 33 and showing a downward trend when the value of K is greater than about 43. The Spiral dataset shows a decreasing and then stabilizing trend. There is a small fluctuation on the Pathbased dataset. There is also a stabilizing effect on DSS577, which is a dataset with uneven density. In particular, the CPDD-ID algorithm performs very smoothly and well on datasets with ambiguous and highly overlapping boundaries such as R15, D31, and S1, which shows that the CPDD-ID algorithm is effective in dealing with this type of dataset.

On the basis of the above analysis, we suggest that the k value be set within the range of [5,12] when dealing with datasets such as Spiral, Jain, and Pathbased, which have a streaming structure. This is because there are generally many localized density peaks in such datasets, which may be partitioned into multiple subclusters during division, and if the k value is too large, it may incorrectly merge subclusters that do not belong to the same cluster structure. In addition, for datasets such as Aggregation, R15, D31, and S1, which have highly overlapping and interconnected clusters, the only way to avoid having highly overlapping cluster structures being incorrectly merged is to scale up the evaluation of the number of shared k-nearest neighbors. In addition, smaller k values tend to merge some clusters that do not belong to a cluster but are connected by a small number of samples into a single cluster. Therefore, a k value of [10,25] is recommended for such datasets.

5. Conclusions

In this paper, a new clustering algorithm CPDD-ID is proposed, which combines both density-based and hierarchical clustering ideas. First, in order to cope with cluster structures with uneven densities and irregular shapes, a density-maxima-based delineation strategy is proposed, which utilizes kernel density estimation to compute the local density maxima and partition the subclusters using this coordinate as a benchmark. The proposed partitioning strategy accurately recognizes cluster structures with different density distributions and can overcome the problem of subclusters that do not belong to the same cluster but are close to each other being partitioned into one cluster. Second, a similarity metric focusing on the local neighborhood structure is proposed. This similarity metric iteratively merges the subclusters with the highest similarity generated in the division phase based on the degree of interaction between the shared k-nearest neighbors of the subclusters, which solves the problem of incorrectly merging subclusters that occurs in traditional hierarchical clustering algorithms when dealing with highly overlapping and interconnected cluster structures. The experimental results show that the CPDD-ID algorithm is applicable to more types of datasets and exhibits greater robustness compared to existing clustering algorithms, especially when dealing with datasets with uneven density and ambiguous boundaries, where it shows more competitive performance.

Despite the fact that the proposed study has achieved some meaningful findings, there are still some limitations. The performance of the CPDD-ID algorithm is not optimal when dealing with the large number of redundant features present in high-dimensional datasets. In future work, in order to address the poor performance of the CPDD-ID algorithm on high-dimensional datasets, the goal is to propose a feature-selection algorithm to cope with the large amount of redundant information that exists in high-dimensional datasets. The proposed feature-selection algorithm should be able to select a representative subset of features that can improve the performance of the algorithm as well as maximize the representation of the original feature set.

Author Contributions

Conceptualization, Y.L., J.D., Y.D. and H.W.; methodology, Y.L.; software, H.W. and Y.D.; validation, Y.L., J.D., H.W. and Y.D.; formal analysis, Y.L., H.W. and Y.D.; investigation, H.W. and Y.D.; resources, Y.L.; data curation, H.W. and Y.D.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and J.D.; visualization, H.W. and Y.D.; supervision, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This document is the result of the research project funded by the National Natural Science Foundation of China (62262034, 62262035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The publicly available datasets used in this study can all be found here https://github.com/Chelsea547/Clustering-Datasets (accessed on 21 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guan, J.; Li, S.; He, X.; Chen, J. Peak-graph-based fast density peak clustering for image segmentation. IEEE Signal Process. Lett. 2021, 28, 897–901. [Google Scholar]
Yeganova, L.; Kim, W.; Kim, S.; Wilbur, W.J. Retro: Concept-based clustering of biomedical topical sets. Bioinformatics 2014, 30, 3240–3248. [Google Scholar] [PubMed]
Xu, C.; Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 2015, 31, 1974–1980. [Google Scholar]
Yan, Y.; Qian, Y.; Sharif, H.; Tipper, D. A survey on cyber security for smart grid communications. IEEE Commun. Surv. Tutor. 2012, 14, 998–1010. [Google Scholar]
Dong, G.; Xie, M. Color clustering and learning for image segmentation based on neural networks. IEEE Trans. Neural Netw. 2005, 16, 925–936. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 21 June–18 July 1965; Volume 1, pp. 281–297. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar]
Nielsen, F. Introduction to HPC with MPI for Data Science; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Rec. 1996, 25, 103–114. [Google Scholar]
Guha, S.; Rastogi, R.; Shim, K. CURE: An efficient clustering algorithm for large databases. ACM Sigmod Rec. 1998, 27, 73–84. [Google Scholar]
Karypis, G.; Han, E.H.; Kumar, V. Chameleon: Hierarchical clustering using dynamic modeling. Computer 1999, 32, 68–75. [Google Scholar]
Cheng, D.; Zhu, Q.; Huang, J.; Wu, Q.; Yang, L. A local cores-based hierarchical clustering algorithm for data sets with complex structures. Neural Comput. Appl. 2019, 31, 8051–8068. [Google Scholar]
Bryant, A.; Cios, K. RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Trans. Knowl. Data Eng. 2017, 30, 1109–1121. [Google Scholar] [CrossRef]
Cheng, D.; Zhu, Q.; Huang, J.; Wu, Q.; Yang, L. Clustering with local density peaks-based minimum spanning tree. IEEE Trans. Knowl. Data Eng. 2019, 33, 374–387. [Google Scholar] [CrossRef]
Fan, J.C.; Jia, P.L.; Ge, L. M k-NN G-DPC: Density peaks clustering based on improved mutual K-nearest-neighbor graph. Int. J. Mach. Learn. Cybern. 2020, 11, 1179–1195. [Google Scholar] [CrossRef]
Ding, S.; Du, W.; Xu, X.; Shi, T.; Wang, Y.; Li, C. An improved density peaks clustering algorithm based on natural neighbor with a merging strategy. Inf. Sci. 2023, 624, 252–276. [Google Scholar] [CrossRef]
Xie, J.; Liu, X.; Wang, M. SFKNN-DPC: Standard deviation weighted distance based density peak clustering algorithm. Inf. Sci. 2024, 653, 119788. [Google Scholar] [CrossRef]
Rasool, Z.; Aryal, S.; Bouadjenek, M.R.; Dazeley, R. Overcoming weaknesses of density peak clustering using a data-dependent similarity measure. Pattern Recognit. 2023, 137, 109287. [Google Scholar] [CrossRef]
Qin, X.; Han, X.; Chu, J.; Zhang, Y.; Xu, X.; Xie, J.; Xie, G. Density peaks clustering based on Jaccard similarity and label propagation. Cogn. Comput. 2021, 13, 1609–1626. [Google Scholar] [CrossRef]
Guo, W.; Wang, W.; Zhao, S.; Niu, Y.; Zhang, Z.; Liu, X. Density peak clustering with connectivity estimation. Knowl.-Based Syst. 2022, 243, 108501. [Google Scholar] [CrossRef]
García-García, J.C.; García-Ródenas, R. A methodology for automatic parameter-tuning and center selection in density-peak clustering methods. Soft Comput. 2021, 25, 1543–1561. [Google Scholar] [CrossRef]
Zhu, Y.; Ting, K.M.; Jin, Y.; Angelova, M. Hierarchical clustering that takes advantage of both density-peak and density-connectivity. Inf. Syst. 2022, 103, 101871. [Google Scholar] [CrossRef]
Bian, Z.; Chung, F.L.; Wang, S. Fuzzy density peaks clustering. IEEE Trans. Fuzzy Syst. 2020, 29, 1725–1738. [Google Scholar] [CrossRef]
Wang, S.; Li, Q.; Zhao, C.; Zhu, X.; Yuan, H.; Dai, T. Extreme clustering—A clustering method via density extreme points. Inf. Sci. 2021, 542, 24–39. [Google Scholar] [CrossRef]
Liu, R.; Wang, H.; Yu, X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 2018, 450, 200–226. [Google Scholar] [CrossRef]
Ros, F.; Guillaume, S.; El Hajji, M.; Riad, R. KdMutual: A novel clustering algorithm combining mutual neighboring and hierarchical approaches using a new selection criterion. Knowl.-Based Syst. 2020, 204, 106220. [Google Scholar] [CrossRef]
Han, X.; Zhu, Y.; Ting, K.M.; Zhan, D.C.; Li, G. Streaming hierarchical clustering based on point-set kernel. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 525–533. [Google Scholar]
Yang, Q.F.; Gao, W.Y.; Han, G.; Li, Z.Y.; Tian, M.; Zhu, S.H.; Deng, Y.h. HCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density. Inf. Syst. 2023, 114, 102159. [Google Scholar] [CrossRef]
Hulot, A.; Chiquet, J.; Jaffrézic, F.; Rigaill, G. Fast tree aggregation for consensus hierarchical clustering. BMC Bioinform. 2020, 21, 120. [Google Scholar] [CrossRef]
Wang, J.; Zhu, C.; Zhou, Y.; Zhu, X.; Wang, Y.; Zhang, W. From partition-based clustering to density-based clustering: Fast find clusters with diverse shapes and densities in spatial databases. IEEE Access 2017, 6, 1718–1729. [Google Scholar] [CrossRef]
Ping, Y.; Li, H.; Hao, B.; Guo, C.; Wang, B. Beyond k-Means++: Towards better cluster exploration with geometrical information. Pattern Recognit. 2024, 146, 110036. [Google Scholar]
Cheng, M.; Ma, T.; Liu, Y. A projection-based split-and-merge clustering algorithm. Expert Syst. Appl. 2019, 116, 121–130. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Ruts, I.; Tukey, J.W. The bagplot: A bivariate boxplot. Am. Stat. 1999, 53, 382–387. [Google Scholar] [CrossRef]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Routledge: London, UK, 2018. [Google Scholar]
Fränti, P.; Sieranoja, S. Clustering accuracy. Appl. Comput. Intell. 2024, 4, 24–44. [Google Scholar]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar]
Kvalseth, T.O. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybern. 1987, 17, 517–519. [Google Scholar]

Figure 1. Clustering process of the CPDD-ID algorithm (developed by authors).

Figure 2. (a) Density curves in the first dimension generated using kernel density estimation. (b) Density curves in the second dimension generated using kernel density estimation. Black points are the coordinates of the labeled density peaks.

Figure 3. (a) Segmentation results obtained by utilizing the peaks of the wave in the first dimension. (b) Segmentation results obtained by utilizing the peaks of the wave in the second dimension. Different colors represent different clusters.

Figure 4. The final segmentation result obtained by taking the intersection of all dimension segmentation results.

Figure 5. Merge subclusters with maximum shared nearest neighbor similarity.

Figure 6. The process of simultaneously merging subclusters with maximum similarity.

Figure 7. (a–j) show the original distributions of the ten synthetic datasets selected for the experiments.

Figure 8. (a–j) show the clustering results of the CPDD-ID algorithm on ten synthetic datasets.

Figure 9. (a–j) show the clustering results of the DPC algorithm on ten synthetic datasets.

Figure 10. (a–j) show the clustering results of the DBSCAN algorithm on ten synthetic datasets.

Figure 11. (a–j) show the clustering results of the K-means algorithm on ten synthetic datasets.

Figure 12. (a–j) show the clustering results of the RNN-DBSCAN algorithm on ten synthetic datasets.

Figure 13. (a–j) show the clustering results of the LDP-MST algorithm on ten synthetic datasets.

Figure 14. (a–j) show the clustering results of the HCDC algorithm on ten synthetic datasets.

Figure 15. (a–h) show the trend of the parameter K value in the CPDD-ID algorithm on the synthetic dataset for ACC, NMI, and ARI. (The blue folds are ACC, the orange folds are NMI, and the green folds are ARI.)

Table 1. Runtime comparison on synthetic datasets.

Dataset	K-Means	DBSCAN	DPC	RNN-DBSCAN	LDP-MST	HCDC	CPDD-ID
Aggregation	0.35	0.82	1.71	0.98	0.41	0.92	0.315
Spiral	0.09	0.31	1.42	0.08	0.12	0.14	0.15
R15	0.85	0.44	0.99	0.56	0.25	0.55	0.46
Pathbased	0.02	0.50	1.31	0.07	0.20	0.10	0.23
Jain	0.05	0.30	1.49	0.11	0.33	0.14	0.18
D31	8.03	13.85	37.51	14.52	0.96	13.24	5.09
Zelink1	0.02	0.40	1.36	0.06	0.18	0.11	0.15
S1	9.73	37.80	105.07	36.84	5.92	41.33	5.07
DS577	0.03	0.56	3.18	0.28	0.39	0.33	0.33
T4.8K	6.80	79.97	202.21	150.80	10.24	94.50	25.70

Table 2. Runtime comparison on synthetic datasets.

Dataset	K-Means	DBSCAN	DPC	RNN-DBSCAN	LDP-MST	HCDC	CPDD-ID
Thyroid	0.03	0.01	1.07	0.04	0.08	0.03	0.62
Wine	0.30	0.05	0.29	0.05	0.01	0.08	0.16
Ionosphere	0.05	0.04	0.52	0.11	0.15	0.12	0.06
Wireless	0.38	0.30	4.52	3.51	0.67	3.56	31.30
Ecoli	1.21	0.48	1.12	0.23	0.29	0.34	0.66
Pima	0.14	0.24	0.34	0.49	0.18	0.56	1.80
Yeast	6.82	3.38	19.92	5.69	5.85	3.94	4.34
Balance	0.22	0.32	0.25	0.34	0.20	0.48	0.09
Satimage	25.33	24.32	26.68	35.45	23.52	37.55	29.14
Breast	1.87	0.98	-	0.90	1.45	1.08	1.04

Table 3. Details of ten synthetic datasets.

NO.	Dataset	Instances	Dimensions	Clusters
1	Aggregation	788	2	7
2	Spiral	312	2	3
3	R15	600	2	15
4	Pathbased	300	2	3
5	Jain	373	2	2
6	D31	3100	2	31
7	Zelink1	299	2	3
8	S1	5000	2	15
9	DS577	577	2	3
10	T4.8K	8000	2	6

Table 4. Details of ten UCI real datasets.

NO.	Dataset	Instances	Dimensions	Clusters
1	Thyroid	215	5	3
2	Wine	178	14	3
3	Ionosphere	351	33	15
4	Wireless	2000	7	3
5	Ecoli	366	7	2
6	Pima	768	8	31
7	Yeast	1484	8	3
8	Balance	625	4	15
9	Satimage	6435	36	3
10	Breast	699	9	6

Table 5. Optimal parameters of 7 algorithms on synthetic datasets.

Dataset	K-Means	DBSCAN	DPC	RNN-DBSCAN	LDP-MST	HCDC	CPDD-ID
Aggregation	k = 7	EPS = 1.2, MinPts = 5	dc = 2	k = 14	NC = 7	NC = 7	k = 11
Spiral	k = 3	EPS = 3, MinPts = 3	dc = 2	k = 7	NC = 3	NC = 3	k = 3
R15	k = 15	EPS = 0.3, MinPts = 3	dc = 2	k = 10	NC = 15	NC = 15	k = 19
Pathbased	k = 3	EPS = 2, MinPts = 9	dc = 3	k = 6	NC = 3	NC = 3	k = 6
Jain	k = 2	EPS = 2.5, MinPts = 15	dc = 2	k = 15	NC = 2	NC = 2	k = 5
D31	k = 31	EPS = 0.5, MinPts = 6	dc = 2	k = 13	NC = 31	NC = 31	k = 29
Zelink1	k = 3	EPS = 0.2, MinPts = 5	dc = 3	k = 14	NC = 3	NC = 3	k = 6
S1	k = 15	EPS = 197,500, MinPts = 2	dc = 3	k = 15	NC = 15	NC = 15	k = 40
DS577	k = 3	EPS = 0.45, MinPts = 5	dc = 2	k = 9	NC = 3	NC = 3	k = 10
T4.8K	k = 6	EPS = 6, MinPts = 6	dc = 2	k = 14	NC = 6	NC = 6, w = 0.6	k = 11

Table 6. The comparison of ACC, NMI, and ARI of 7 algorithms on synthetic datasets.

Dataset		K-Means	DBSCAN	DPC	RNN-DBSCAN	LDP-MST	HCDC	CPDD-ID
	ACC	0.78	0.98	0.99	0.99	0.99	0.99	0.99
Aggregation	NMI	0.88	0.97	0.97	0.99	0.99	0.98	0.99
	ARI	0.76	0.97	0.98	0.99	0.99	0.99	0.99
	ACC	0.34	1	1	1	1	1	1
Spiral	NMI	0.001	1	1	1	1	1	1
	ARI	−0.005	1	1	1	1	1	1
	ACC	0.71	0.91	0.97	0.97	0.99	0.99	0.99
R15	NMI	0.87	0.90	0.95	97	0.99	0.98	0.99
	ARI	0.69	0.85	0.94	0.97	0.99	0.98	0.99
	ACC	0.74	0.98	0.72	0.98	0.67	0.56	0.99
Pathbased	NMI	0.54	0.96	0.51	0.93	0.49	0.59	0.98
	ARI	0.46	0.98	0.44	0.95	0.41	0.45	0.99
	ACC	0.78	0.89	0.92	0.99	1	1	1
Jain	NMI	0.37	0.79	0.64	0.95	1	1	1
	ARI	0.32	0.79	0.69	0.98	1	1	1
	ACC	0.78	0.85	0.96	0.91	0.97	0.86	0.96
D31	NMI	0.95	0.87	0.94	0.95	0.96	0.93	0.94
	ARI	0.82	0.69	0.92	0.89	0.94	0.84	0.91
	ACC	1	1	0.46	1	1	1	1
Zelink1	NMI	1	1	0.17	1	1	1	1
	ARI	1	1	0.06	1	1	1	1
	ACC	0.82	0.97	0.99	0.98	0.99	0.73	0.99
S1	NMI	0.93	0.96	0.98	0.98	0.99	0.90	0.98
	ARI	0.82	0.96	0.98	0.97	0.99	0.70	0.98
	ACC	0.98	0.99	0.99	0.99	0.99	0.99	0.99
DS577	NMI	0.94	0.98	0.97	0.97	0.99	0.98	0.99
	ARI	0.96	0.99	0.98	0.98	0.99	0.99	0.99
	ACC	0.64	0.98	0.73	0.90	0.91	0.65	0.89
T4.8K	NMI	0.52	0.97	0.66	0.86	0.86	0.47	0.85
	ARI	0.61	0.95	0.74	0.86	0.87	0.69	0.84

Table 7. Optimal parameters of 7 algorithms on real datasets.

Dataset	K-Means	DBSCAN	DPC	RNN-DBSCAN	LDP-MST	HCDC	CPDD-ID
Thyroid	k = 3	EPS = 5500, MinPts = 3	dc = 0.2	k = 4	NC = 3	NC = 3	k = 17
Wine	k = 3	EPS = 30,000, MinPts = 2	dc = 0.2	k = 4	NC = 3	NC = 3	k = 7
Ionosphere	k = 2	EPS = 0.78, MinPts = 0.78	dc = 2	k = 50	NC = 2	NC = 2	k = 3
Wireless	k = 4	EPS = 0.14, MinPts = 11	dc = 2	k = 10	NC = 4	NC = 4	k = 36
Ecoli	k = 8	EPS = 0.5, MinPts = 5	dc = 2	k = 20	NC = 8	NC = 8	k = 11
Pima	k = 2	EPS = 7, MinPts = 7	dc = 2	k = 10	NC = 2	NC = 2	k = 7
Yeast	k = 10	EPS = 0.09, MinPts = 4	dc = 3	k = 15	NC = 10	NC = 10	k = 3
Balance	k = 15	EPS = 100, MinPts = 2	dc = 3	k = 15	NC = 15	NC = 15	k = 40
Satimage	k =6	EPS = 18, MinPts = 3	dc = 2	k = 10	NC = 6	NC = 6	k = 6
Breast	k = 2	EPS = 1.5, MinPts = 2	dc = 2	k = 45	NC = 2	NC = 2	k = 31

Table 8. The comparison of ACC, NMI, and ARI of 7 algorithms on real datasets.

Dataset		K-Means	DBSCAN	DPC	RNN-DBSCAN	LDP-MST	HCDC	CPDD-ID
	ACC	0.87	0.82	0.57	0.91	0.70	0.87	0.90
Thyroid	NMI	0.57	0.54	0.19	0.71	0.14	0.56	0.92
	ARI	0.59	0.73	0.18	0.64	0.23	0.55	0.67
	ACC	0.71	0.59	0.76	0.65	0.75	0.75	0.76
Wine	NMI	0.31	0.22	0.42	0.43	0.42	0.43	0.412
	ARI	0.31	0.20	0.43	0.39	0.43	0.42	0.45
	ACC	0.71	0.61	0.68	0.66	0.55	0.73	0.81
Ionosphere	NMI	0.13	0.09	0.24	0.05	0.09	0.19	0.42
	ARI	0.18	0.04	0.28	0.03	−0.05	0.17	0.35
	ACC	0.43	0.63	0.52	0.26	0.77	0.53	0.72
Wireless	NMI	0.49	0.51	0.35	0.02	0.80	0.55	0.75
	ARI	0.28	0.51	0.31	0.00	0.69	0.34	0.64
	ACC	0.38	0.47	0.42	0.43	0.58	0.54	0.56
Ecoli	NMI	0.26	0.16	0.30	0.02	0.43	0.31	0.33
	ARI	0.26	0.13	0.16	0.01	0.45	0.35	0.33
	ACC	0.65	0.02	0.67	0.66	0.66	0.65	0.75
Pima	NMI	0.005	0.01	0.02	0.002	0.01	0.002	0.14
	ARI	0.04	−0.02	0.07	0.01	0.02	0.01	0.22
	ACC	0.64	0.39	0.69	0.24	0.58	0.32	0.76
Satimage	NMI	0.61	0.30	0.61	0.001	0.61	0.28	0.58
	ARI	0.52	0.04	0.54	0.00	0.44	0.09	0.57
	ACC	0.17	0.32	0.28	0.32	0.25	0.28	0.32
Yeast	NMI	0.01	0.01	0.01	0.001	0.01	0.01	0.02
	ARI	0.01	0.002	−0.005	0.002	0.003	0.006	0.005
	ACC	0.04	0.005	0.46	0.47	0.47	0.47	0.56
Balance	NMI	0.01	0.25	0.06	0.02	0.03	0.008	0.13
	ARI	0.002	0.14	0.02	0.01	0.02	−0.001	0.15
	ACC	0.95	0.90	0.00	0.66	0.95	0.59	0.95
Breast	NMI	0.73	0.62	0.00	0.005	0.72	0.06	0.70
	ARI	0.83	0.65	0.00	0.003	0.82	−0.04	0.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Ding, J.; Wang, H.; Du, Y. A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters. Appl. Sci. 2025, 15, 3612. https://doi.org/10.3390/app15073612

AMA Style

Liu Y, Ding J, Wang H, Du Y. A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters. Applied Sciences. 2025; 15(7):3612. https://doi.org/10.3390/app15073612

Chicago/Turabian Style

Liu, Yangming, Jiaman Ding, Hongbin Wang, and Yi Du. 2025. "A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters" Applied Sciences 15, no. 7: 3612. https://doi.org/10.3390/app15073612

APA Style

Liu, Y., Ding, J., Wang, H., & Du, Y. (2025). A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters. Applied Sciences, 15(7), 3612. https://doi.org/10.3390/app15073612

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Clustering Algorithm Based on the Detection of Density Peaks and the Interaction Degree Between Clusters

Abstract

1. Introduction

2. Related Work

2.1. Density Peak Clustering Algorithm

2.2. Hierarchical Clustering

2.3. Partition–Merge Clustering Algorithm

3. The Proposed Algorithm

3.1. Partitioning Phase

3.1.1. Local Density Is Calculated Using Kernel Density Estimation

3.1.2. Partitioning Initial Sub-Cluster

3.1.3. Final Sub-Cluster

3.2. Merging Period

3.2.1. Methods of Similarity Measurement

3.2.2. Merging Subclusters

3.3. Complexity Analysis

4. Experiments and Results

4.1. Preparation

4.2. Experiments on Synthetic Datasets

4.3. Experiments on Real Datasets

4.4. Parameter Analysis of Algorithms

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI