1. Introduction
Clustering is a process of “clustering by nature”. As an important research content in data mining and artificial intelligence, clustering is an unsupervised pattern recognition method without the guidance of prior information [
1]. It aims at finding potential similarities and grouping, so that the distance between two points in the same cluster is as small as possible and the distance between data points in different clusters is opposite. A large number of studies have been devoted to solving clustering problems. Generally speaking, clustering algorithms can be divided into multiple categories such as partition-based, model-based, hierarchical-based, grid-based, density-based, and their combinations. Various clustering algorithms have been successfully applied in many fields [
2].
With the emergence of big data, there is an increasing demand for clustering algorithms that can automatically understand and summarize data. Traditional clustering algorithms cannot handle the data with hundreds of dimensions, which leads to low efficiency and poor results. It is urgent to develop the existing clustering algorithms or propose a new clustering algorithm to improve the stability of the algorithm and ensure the accuracy of clustering [
3].
In June 2014, Rodriguez et al. published clustering by fast search and find of density peaks (referred to as CFSFDP) in
Science [
4]. This is a new clustering algorithm on the basis of density and distance. The performance of the CFSFDP algorithm is superior to other traditional clustering algorithms in many aspects. First, the CFSFDP algorithm is efficient and straightforward, which can significantly reduce the calculation time without the iteration of the objective function. Second, it is convenient to find cluster centers intuitively with the help of a decision graph in the CFSFDP algorithm. Third, the CFSFDP algorithm can be used to recognize the groups regardless of their shape and spatial dimensions. Therefore, shortly after the algorithm was proposed, it has been broadly used in computer vision [
5], image recognition [
6], and other fields.
Although the distinct advantages of the CFSFDP algorithm are over other clustering algorithms, the CFSFDP algorithm also has some disadvantages, which are as follows:
The measurement method for calculating local density and distance still needs to be improved. It does not consider the processing of complex datasets, which makes it difficult to achieve the expected clustering results when dealing with datasets with various densities, multi-scales, or other complex characteristics.
The allocation strategy of the remaining points after finding the cluster centers may lead to a “domino effect”, whereby once one point is assigned wrongly, there may be many more points subsequently mis-assigned.
As the cut-off distance corresponding to different datasets may be different, it is usually challenging to determine . Additionally, the clustering results are susceptible to .
In 2018, Rui Liu et al. proposed a shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) algorithm to solve some fundamental problems of the CFSFDP algorithm [
7]. The SNN-DPC algorithm is illustrated by the following innovations.
A novel metric to indicate local density is put forward. It makes full use of neighbor information to show the characteristics of data points. This standard not only works for simple datasets, but also applies to complex datasets with multiple scales, cross twining, different densities, or high dimensions.
By improving the distance calculation approach, an adaptive measurement method of distance from the nearest point with large density is proposed. The new approach considers both distance factors and neighborhood information. Thus, it can compensate for the points in the low-density cluster, which means that the possibilities of selecting the correct center point are raised.
For the non-center points, a two-step point allocation is carried out on the SNN-DPC algorithm. The unassigned points are divided into “determined subordinate points” and “uncertain subordinate points”. In the process of calculation, the points are filtered continuously, which improves the possibility of correctly allocating non-center points and avoids the “domino effect”.
Although the SNN-DPC algorithm can be used to make up some deficiencies of the CFSFDP algorithm, it still has some unsolved problems. Most obviously, the number of clusters in the dataset needs to be manually input. However, if we have no idea about the number of clusters, we can select cluster centers through the decision graph. It is worth noting that both methods require some prior knowledge. Therefore, the clustering results are less reliable because of the high effect of human orientation. In order to solve the problem of adaptively determining the cluster center, a fast searching density peak clustering algorithm based on shared nearest neighbor and adaptive clustering center algorithm, called DPC-SNNACC, was proposed in this paper. The main innovations of the algorithm are as follows:
The original local density measurement method is updated, and a new density measurement approach is proposed. This approach can enlarge the gap between decision values to prepare for automatically determining the cluster center. The improvement can not only be applied to simple datasets, but also to complex datasets with multi-scale and cross winding.
A novel and fast method is proposed that can select the number of clustering centers automatically according to the decision values. The “knee point” of decision values can be adjusted according to information of the dataset, then the clustering centers are determined. This method can find the cluster centers of all clusters quickly and accurately.
The rest of this paper is organized as follows. In
Section 2, we introduce some research achievements related to the CFSFDP algorithm. In
Section 3, we present the basic definitions and processes of the CFSFDP algorithm and the SNN-DPC algorithm. In
Section 4, we propose the improvements of SNN-DPC algorithm and introduce the processes of the DPC-SNNACC algorithm. Simultaneously, the complexity of the algorithm is analyzed according to processes. In
Section 5, we first introduce some datasets used in this paper, and then discuss some arguments about the parameters used in the DPC-SNNACC algorithm. Further experiments are conducted in
Section 6. In this part, the DPC-SNNACC algorithm is compared with other classical clustering algorithms. Finally, the advantages and disadvantages of the DPC-SNNACC algorithm are summarized, and we also point out some future research directions.
2. Related Works
In this section, we briefly review several kinds of widely used clustering methods and elaborate on the improvement of the CFSFDP algorithm.
K-means is one of the most widely used partition-based clustering algorithms because it is easy to implement, is efficient, and has been successfully applied to many practical case studies. The core idea of K-means is to update the cluster center represented by the centroid of the data point through iterative calculation, and the iterative process will continue until the convergence criterion is met [
8]. Although K-means is simple and has a high computing efficiency in general, there still exist some drawbacks. For example, the clustering result is sensitive to K; moreover, it is not suitable for finding clusters with a nonconvex shape. PAM [
9] (Partitioning Around Medoids), CLARA [
10] (Clustering Large Applications), CLARANS [
11] (Clustering Large Applications based upon RANdomized Search), and AP [
12] (Affinity Propagation) are also typical partition-based clustering algorithms. However, they also fail to find non-convex shape clusters.
The hierarchical-based clustering method merges the nearest pair of clusters from bottom to top. Typical algorithms include BIRCH [
13] (Balanced Iterative Reducing and Clustering using Hierarchies) and ROCK [
14] (Bayesian Optimization with Cylindrical Kernels).
The grid-based clustering method divides the original data space into a grid structure with a certain size for clustering. STING [
15] (A Statistical Information Grid Approach to Spatial Data Mining) and CLIQUE [
16] (clustering in QUEst) are typical algorithms of this type, and their complexity relative to the data size is very low. However, it is difficult to scale these methods to higher-dimensional spaces.
The density-based clustering method assumes that an area with a high density of points in the data space is regarded as a cluster. DBSCAN [
17] (Density-Based Spatial Clustering of Applications with Noise) is the most famous density-based clustering algorithm. It uses two parameters to determine whether the neighborhood of points is dense: the radius
of the neighborhood and the minimum number of points
in the neighborhood.
Based on the CFSFDP algorithm, many algorithms have been put forward to make some improvements. Research on the CFSFDP algorithm mainly involves the following three aspects:
1. The density measurements of the CFSFDP algorithm.
Xu proposed a method to adaptively choose cut-off distance [
18]. Using the characteristics of Improved Sheather–Jones (ISJ), the method can be used to accurately estimate
.
The Delta-Density based clustering with a Divide-and-Conquer strategy clustering algorithm (3DC) has also been proposed [
19]. It is based on the Divide-and-Conquer strategy and the density-reachable concept in Density-Based Spatial Clustering of Applications with Noise (referred to as DBSCAN).
Xie proposed a density peak searching and point assigning algorithm based on the fuzzy weighted K-nearest neighbor (FKNN-DPC) technique to solve the problem of the non-uniformity of point density measurements in the CFSFDP algorithm [
20]. This approach uses K-nearest neighbor information to define the local density of points and to search and discover cluster centers.
Du proposed density peak clustering based on K-nearest neighbors (DPC-KNN), which introduces the concept of K-nearest neighbors (KNN) to CFSFDP and provides another option for computing the local density [
21].
Qi introduced a new metric for density that eliminates the effect of
on clustering results [
22]. This method uses a cluster diffusion algorithm to distribute remaining points.
Liu suggested calculating two kinds of densities, one based on k nearest neighbors and one based on local spatial position deviation, to handle datasets with mixed density clusters [
23].
2. Automatically determine the group numbers.
Bie proposed the Fuzzy-CFSFDP algorithm [
24]. This algorithm uses fuzzy rules to select centers for different density peaks, and then the number of final clustering centers is determined by judging whether there are similar internal patterns between density peaks and merging density peaks.
Li put forward the concept of potential cluster center based on the CFSFDP algorithm [
25], and considered that if the shortest distance between potential cluster center and known cluster center is less than the cut-off distance
, then the potential cluster center is redundant. Otherwise, it will be considered as the center of another group.
Lin proposed an algorithm [
26] that used the radius of neighborhood to automatically select a group of possible density peaks, then used potential density peaks as density peaks, and used CFSFDP to generate preliminary clustering results. Finally, single link clustering was used to reduce the number of clusters. The algorithm can avoid the clustering allocation problem in CFSFDP.
3. Application of the CFSFDP algorithm.
Zhong and Huang applied the improved density and distance-based clustering method to the actual evaluation process to evaluate the performance of enterprise asset management (EAM) [
27]. This method greatly reduces the resource investment in manual data analysis and performance sorting.
Shi et al. used the CFSFDP algorithm for scene image clustering [
5]. Chen et al. applied it to obtain a possible age estimate according to a face image [
6]. Additionally, Li et al. applied the CFSFDP algorithm and entropy information to detect and remove the noise data field from datasets [
25].
3. Clustering by Fast Search and Find of Density Peaks (CFSFDP) Algorithm and Shared-Nearest-Neighbor-Based Clustering by Fast Search and Find of Density Peaks (SNN-DPC) Algorithm
3.1. Clustering by Fast Search and Find of Density Peaks (CFSFDP) Algorithm
It is not necessary to consider the probability distribution or multi-dimensional density in the CFSFDP algorithm as the performance is not affected by the space dimension, which is why it can handle high-dimensional data. Furthermore, it requires neither an iterative process nor more parameters. This approach is robust concerning the choice of as the only parameter.
This algorithm is based on a critical assumption that the points with a higher local density and a relatively large distance than the others are more likely to be the cluster centers. Therefore, for each data point , only two variables need to be focused, that is, its local density and distance .
The local density
of data point
is defined as Equation (1).
where
is the Euclidean distance between point
and
;
is the cut-off distance; and
represents the neighborhood radius of a point, while it is a hyper-parameter that needs to be specified by users.
Equation (1) means that the local density is equal to the number of data points with a distance less than the cut-off distance.
is computed by the minimum distance between point
and another point
with a higher density than
, defined as Equation (2).
It should be noted that for the point with the highest density, its
is conventionally obtained by Equation (3).
Then, the points with high
and high
are simultaneously decided as cluster centers, and the CFSFDP algorithm uses the decision value
for each data point
to express the possibility of becoming the center of the cluster. The calculation method is shown by Equation (4).
From Equation (4), the higher and are, the larger the decision value is. In other words, point with higher and is more likely to be chosen as a cluster center. The algorithm introduces a representation called a decision graph to help users to select centers. The decision graph is the plot of as a function of for each point.
There is no need to specify the number of clusters in advance as the algorithm can find the density peaks and identify them as cluster centers, but it needs users to select the number of clusters by identifying outliers in the decision graph.
After finding the centers, we can assign the groups of remaining points according to the cluster where the nearest neighbor peak belongs. The information used is obtained by calculating .
3.2. Shared-Nearest-Neighbor-Based Clustering by Fast Search and Find of Density Peaks (SNN-DPC) Algorithm
The SNN-DPC algorithm introduces an indirect distance and density measurement method, taking the influence of neighbors of each point into consideration, and using the concept of shared neighbors to describe the local density of points and the distance between them.
Generally speaking, the larger the sum of neighbors shared by the two points, the higher the similarity of the two points.
For each point
and
in dataset
, the intersection of the K-nearest neighbor set of the
-th point and the K-nearest neighbor set of the
point is defined as the number of shared nearest neighbors, referred to as
. Equation (5) expresses the definition of
.
where
represents the set of K-nearest neighbors of point
and
represents the set of K-nearest neighbors of point
.
For each point
and
in dataset
, the SNN similarity can be defined as Equation (6)
In other words, SNN similarity is calculated only if point and point exist in each other’s K-neighbor set. Otherwise, the SNN similarity is zero.
Then, the local density is calculated by Equation (7)
where
is the set of
points with the highest similarities. The local density is the result of adding the similarity of
points with the highest similarity to point
.
From the definition of local density, it can be seen that the calculation of local density not only uses the distance information, but also obtains the information about the clustering structure by SNN similarity, which fully reveals the internal relevance between points.
Regarding
, the distance from the nearest larger density point is introduced in the SNN-DPC algorithm and uses the proximity distance to add a compensation mechanism, so the
value of the point in the low-density cluster may also be high. The definition of distance is shown in Equation (8)
The
value of the highest density point is the largest
among all the remaining points in the dataset
, which can be given by Equation (9)
The distance from the nearest larger density point not only considers the distance factor, but also considers the neighbor information of each point, thereby compensating the points in the low-density cluster and improving the feasibilities of the . That is to say, this method can be adapted to different density datasets.
The definition of the decision value is the same as the CFSFDP algorithm. The formula is displayed in Equation (10)
According to the decision graph, the cluster center-point can also be determined.
In the SNN-DPC algorithm, the unallocated points are divided into two categories: inevitable subordinate point and possible subordinate point.
Points
and
are two different points in dataset
. Only when at least half of the k neighborhoods of
and
are shared can they become the same cluster points. The condition is expressed by Equation (11).
If an unassigned point does not meet the criteria for the inevitable subordinate point, it is defined as a possible subordinate point. The condition is reflected by Equation (12).
According to Equation (11), inevitable subordinate points can be allocated first. For the points that do not meet the equation conditions, the neighborhood information is used to further determine the belonging cluster. During the whole process of the algorithm, the information will be updated continuously to achieve better results.
3.3. Analysis of the SNN-DPC Algorithm
Compared with the CFSFDP algorithm, the SNN-DPC algorithm has a significant improvement. Considering the information of the shared neighbors, the SNN-DPC algorithm is applicable to datasets with different conditions. For the non-center points, a two-step allocation method is adopted. With this approach, the algorithm can cluster variable density and non-convex datasets. However, the method of determining cluster center points without prior knowledge has not been explored in the SNN-DPC algorithm, in other words, manual intervention is still needed. Moreover, this approach is more effective for datasets with one unique density peak or with a small number of clusters because in such cases, it is easier to judge the data points with relatively large and from the decision graph. However, for the following cases, the decision graph showed obvious limitations:
For the datasets with unknown cluster numbers, choosing the cluster number is greatly affected by human subjectivity.
In addition to the apparent density peak points in the decision graph, some data points with relatively large and small , or relatively small and large may be cluster centers. These points are easy to be ignored artificially, so fewer cluster centers can be selected. Finally, some data points of different clusters will be mistakenly merged into the same group.
If there are multiple density peaks in the same group, these points can be wrongly selected as redundant cluster centers, resulting in the same cluster being improperly divided into sub-clusters.
When dealing with datasets with many clustering centers, it is also easier to choose the wrong clustering centers.
The SNN-DPC algorithm is more sensitive to the selection of clustering centers and it is likely to select fewer or more clustering centers artificially. This kind of defect is more prominent when dealing with some particular datasets.
4. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center (DPC-SNNACC) Algorithm
Through the analysis in
Section 3.3, it can be seen that the SNN-DPC algorithm needs to know the number of clusters in advance. A method that can adaptively find the number of clusters on the basis of the SNN-DPC algorithm was proposed to solve this problem. The improvements in the DPC-SNNACC algorithm are mainly reflected in two aspects. On one hand, an improved calculation method of local density is proposed. On the other hand, the position of the knee point is obtained by calculating the change of the difference between decision values. Then, we can get the center-points and a suitable number of clusters. In this part, we will elaborate on the improvements.
4.1. Local Density Calculation Method
The local density of the SNN-DPC algorithm is defined under the similarity and the number of shared neighbors between two points. For some datasets with unbalanced sample numbers and different densities between certain clusters, when choosing the cluster centers by the decision graph, the distinction between center and non-center points is vague. In order to better identify the clustering centers than previously, we used the squared value of the original local density. The enhanced local density
is defined as Equation (13).
where
is the set of
points with the highest similarities, and
stands for SNN similarity, which is calculated based on a symmetric distance matrix.
Through experimental analysis, we can draw a conclusion that changing the density has a greater impact than
on the
value. Thus, in order to limit the complexity of the algorithm, we only dealt with the
and kept the
unchanged. In addition, the
values will follow the change of
according to Equation (10). If the difference in
between the points increases, so does the difference in
. We used
to represent the results of ranking
in ascending order, subsequently, we could plot a novel decision graph by using the number of points as the x-axis and
as the y-axis. The subsequent analysis of the decision values will be based on the new decision graph. The specific analysis of
is illustrated in
Section 5.
4.2. Adaptive Selection of Cluster Centers
Through observation, it can be easily inferred that the . values of the cluster center points are relatively large, and they are far away from the decision values of the non-central points. Moreover, the values of non-center points are usually small and remain basically the same. According to this characteristic, we proposed a method to find the knee point, which is described as Equations (14)–(17).
The search range of clustering centers can be locked on several points with larger
values. On one hand, reducing the search range of the decision value
of the cluster center means reducing the time of calculation as we do not need to search for complete data. On the other hand, deleting elements that have little impact on the results also has a positive impact on the accuracy of the clustering results. The square root is a quick and straightforward method to reduce the size of value in mathematical calculations. If the number of data points in dataset
is defined as
,
is an integer closest to the result after rooting to
, then we lock the processing range of the points on
.
Specifically,
where
is the rounding symbol, whose value is the integer closest to the value in middle brackets and
is the difference value between adjacent
value. The
is to find a point with maximum sort value, which satisfies that the change in the difference between its corresponding
values difference is beyond a certain threshold value. We used the average value of the difference among the
values to represent the threshold
. The specific reasons for the selection of the number of
are explained in
Section 5.4.
After obtaining the position of knee point , the number of clusters is the number of (from to ). The corresponding point of the value is the cluster center of each cluster.
We took the Aggregation dataset (the details of Aggregation dataset are described in
Section 5.1) as an example. It includes 788 data points (
n = 788) in total, so DN =
. The subsequent calculation only needs to focus on the maximum of 28
values. First, calculate 27 difference values
from the biggest 28
values according to Equation (17), then calculate the average value
from the 26 increments of
by Equation (16). Next, the maximum of
, whose increment is greater than
according to Equation (14), is found to determine
. For this dataset,
, so the number of centers selected was 782; subsequently, the cluster centers were seven points corresponding to
to
.
As shown in
Figure 1, the red horizontal line in the figure distinguishes the selected group center points from the non-center points. The color points above the red line are the clustering center points chosen by the algorithm, and the blue points below the red line are the non-center points.
4.3. Processes
The whole process of the DPC-SNNACC algorithm still includes three major steps: the calculation of and , the selection of cluster centers, and the distribution of non-center points.
Suppose dataset
and number of neighbors
are given. Clustering aims to divide the dataset
into
classes.
is the number of clusters. The set of cluster centers is
. The result of clustering is
. In the following steps,
is the Euclidean distance matrix, local density is
, distance from the nearest larger density point is
, the decision value is
, and the ascending sorted decision value is
,
represents the initial clustering number where
are the noncentral points that are unassigned,
M is an ergodic matrix whose rows correspond to unallocated points, and columns correspond to clusters. Various symbols used in the paper are explained in the attached
Table A1.
- Step 1:
Initialize dataset , and standardize all data points so that their values are in the range of [0, 1]. Then, calculate distance matrix .
- Step 2:
Calculate the similarity matrix of SNN according to Equation (5).
- Step 3:
According to Equation (13), calculate local density .
- Step 4:
Calculate the distance from the nearest larger density point according to Equations (8) and (9).
- Step 5:
Calculate the decision value according to Equation (10), and arrange it in ascending order to get the sorted .
- Step 6:
According to the knee point condition defined in Equation (14), determine the number of clusters (), and then the set of cluster centers is determined.
- Step 7:
Initialize queue , push all center points into .
- Step 8:
Take the header of as , find the number of neighbors of , and the set of it is .
- Step 9:
Take the unallocated point If meets the conditions defined as Equation (11), then classify the data point into the cluster where is located, and add to the end of the queue . Otherwise, continue to choose the next point in and determine the distribution. After all the sample points in are judged, return to step 8 to proceed to the next determined subordinate point in queue .
- Step 10:
When , the initial clustering result is .
- Step 11:
Find all unallocated points , and re-number them. Then, define an ergodic matrix for distribution of possible subordinate points, the row in indicates the order number of unassigned points, and the column represents the cluster.
- Step 12:
Find the maximum value in each row of , and use the cluster where the is located to represent the cluster of unallocated points in this row. Update matrix until all points are assigned.
- Step 13:
Output the final clustering results .
We used a simple dataset as an example. This simple dataset contained 16 data points,
, which can be divided into two categories according to their relative positions. Suppose the number of neighbors
is 5. As shown in
Figure 2, there are two clusters in
, the red points from numbers 1 to 8 belong to one cluster, the blue points from numbers 9 to 16 is another cluster. Clustering aims to divide the dataset
into two classes.
According to steps 1 to 6 above, we can calculate the
and
of every point in dataset
. Then, the
of each point can be calculated and arranged in ascending order to get
, and the distribution of
is shown in
Figure 3. Through step 7, we can easily obtain the number of clusters. Furthermore, the cluster center set is
, corresponding to the two clusters as points
and
, respectively.
To determine inevitable subordinate points, we pushed all points of into , . First, we can take as , and simultaneously pop out of queue , so the queue becomes . Then, we can find the five neighbors of as . Point meets the conditions defined as Equation (11), so add to the cluster of , push point into . Next, check the points in in order, and repeat the same steps until is empty. The initial clustering result is .
The unallocated points are
. The five neighbors of
are
and the neighbors of
are
. We can define the ergodic matrix
as shown in
Table 1. In matrix
, the rows represent unallocated points
, and the columns stand for Cluster 1 centered on
and Cluster 2 centered on
.
Table 1 shows that most of the neighbors of point
belong to cluster 1, so
belongs to cluster 1 with
as the cluster center. Similarly, point
belongs to cluster 2 with
as the cluster center. Finally, all points in dataset
can be solved, and we can get clustering results.
4.4. Analysis of Complexity
In this section, we analyze the time complexity of the algorithm according to the algorithm steps in
Section 4.3. The time complexity corresponding to each step is analyzed as follows.
- Step 1:
Normalize points into the range of , so the time complexity is about . Calculate the distance between each of the two points, and the time complexity is .
- Step 2:
Calculate the number of nearest neighbors between each two points, and the time complexity is . Calculate the intersection within using a hash table. Generally speaking, the time complexity of step 3 is .
- Step 3:
Calculate the initial local density according to the number of shared nearest neighbors , and square it, so the time complexity of step 4 is .
- Step 4:
Calculate the distance from the nearest larger density point, so the time complexity is .
- Step 5:
Calculate the decision values for each point in ascending order, so the complexity is .
- Step 6:
Calculate the change in the difference between two different points in the total points, so the time complexity is about .
- Steps 7–10:
The total complexity is .
The total time complexity of Steps 8–11 is the basic loop times the highest complexity in this loop, which is . As the number of nearest neighbors can be obtained by step 3, the time complexity is recorded as . Therefore, the total time complexity is .
- Steps 11–13:
The total complexity is
The total time complexity of Steps 11–13 is the basic loop times the highest complexity in the loop, which is or ; therefore, the total time complexity is or , so we can combine them into .
In summary, the time complexity of the entire DPC-SNNACC algorithm is .
5. Discussion
Before the experiment, some parameters should be discussed. The purpose was to find the best value of each parameter related to the DPC-SNNACC algorithm. First, some datasets and metrics are introduced. Second, we discuss the performance of the DPC-SNNACC algorithm from several aspects including k, and . The optimal parameter values corresponding to the optimal metrics were found through comparative experiments.
5.1. Introduction to Datasets and Metrics
The performance of the clustering algorithm is usually verified with some datasets. In this paper, we applied 14 commonly used datasets [
28,
29,
30,
31,
32,
33,
34,
35,
36] containing eight synthetic datasets in
Table 2 and four UCI(University of California, Irvine) real-datasets in
Table 3. The tables list the basic information including the number of data records, the number of clusters, and the data dimensions. The datasets in
Table 2 are two-dimensional for the convenience of graphic display. Compared with synthetic datasets, the dimensions of the real datasets in
Table 3 are usually bigger than 2.
The evaluation metrics of the clustering algorithm usually include internal and external metrics. Generally speaking, internal metrics are suitable for the situation of unknown data labels, while external metrics have a good reflection on the data with known data labels. As the datasets used in this experiment had already been labeled, several external evaluation metrics were used to judge the accuracy of clustering results including normalized mutual information (NMI) [
37], adjusted mutual information (AMI) [
38], adjusted Rand index (ARI) [
38], F-measure [
39], accuracy [
40], and Fowlers-Mallows index (FMI) [
41]. The maximum values of these metrics are 1, and the larger the values of the metrics, the higher the accuracy.
5.2. The Influence of on the Metrics
The only parameter that needs to be determined in the DPC-SNNACC algorithm is the number of nearest neighbors
. In order to analyze the impact of
on the algorithm, the Aggregation dataset was used as shown in
Figure 4.
To select the optimal number of neighbors, we increased the number of neighbors from five to 100. For the lower boundary, if the number of neighbors is low and the density is sparse, it means that there is no similarity. Furthermore, errors may be caused by small for some datasets. Thus, the lower limit was determined to be 5. For the upper limit, if the value of is much too high, on one hand, the algorithm will be complex and run for a long time, on the other hand, a high value will affect the results of the algorithm. The analysis on shows that the exorbitant has no impact on the results of the algorithm, so it is of little significance for further tests. We set 100 as the upper limit.
When ranges from five to 100, the corresponding metrics oscillate obviously, and the trend of the selected metrics of “AMI”, “ARI”, and “FMI” are more or less the same. Therefore, we can replace multiple metric changes with one metric change. For example, when the AMI metric reaches the optimal value, other external metrics will float the optimal value nearby, and the corresponding is the best number of neighbors. Additionally, it can be seen from the change in metrics that the value of each metric tends to be stable with the increase in . However, an exorbitant value will lead to the decrease of metrics. Therefore, if a certain value is defined in advance without experiment, the optimal clustering result cannot be obtained; furthermore, the significance of the similarity measurement is lost. In the case of the Aggregation dataset, when = 35, each metric value was higher, so 35 could be selected as the best neighbor number of the Aggregation dataset.
5.3. The Comparison of Values between SNN-DPC and DPC-SNNACC
As we change the calculation method of local density, the difference between contiguous
will increase. Two different datasets were used to illustrate the comparison between two algorithms in
Figure 5 and
Figure 6. The
used were the best parameters in the respective algorithms.
After comparing and analyzing the above figures, whether using the SNN-DPC algorithm or the DPC-SNNACC algorithm, the correct selection of the number of cluster centers could be achieved on the Jain dataset and Spiral dataset, and the distinction was the differences between . In other words, the improved method showed a bigger difference between the cluster center and non-center points, thus indicating that we can use the DPC-SNNACC algorithm to identify the cluster centers and reduce unnecessary errors.
5.4. The Influence of Different on Metrics
As above-mentioned, represents the search range of clustering centers and is an integer closest to the result of treatment. In order to reduce the search range of the decision values of the cluster centers, we used closest to as the search scale of clustering centers. In this part, the effects of using several other values instead of as were subjected to further analysis. This section applies the Aggregation and Compound datasets to illustrate this problem.
As can be seen from
Figure 7,
Figure 8 and
Figure 9,
achieves better cluster quantity than other values of
in terms of the distributions of
and the clustering results. Furthermore, when
, the algorithm loses the exact number of clusters.
Table 4 uses a number of metrics to explain the problem objectively, which explain the clustering situations when
chooses different values. The bold type indicates the best clustering situation.
obtained higher metrics than
and
, indicating that
is the best value to determine
.
It can be seen from
Figure 10,
Figure 11 and
Figure 12 that different
obtained the same number of clusters, but the corresponding metrics were quite different. Comparing
Figure 10b and
Figure 11b, it can be clearly seen that the case of
correctly separated each cluster, while the case of
divided a complete cluster into three parts, and merged two upper left clusters that should have been separated.
The conclusions of
Table 5 are similar to those in
Table 4; when
and
were selected as
, the performances were not as good as
. This means that choosing
as the search range of the decision value is reasonable to determine the cluster centers.
7. Conclusions
In this paper, in order to solve the problem that the SNN-DPC algorithm needs to select cluster centers through a decision graph or needs to input the cluster number manually, we proposed an improved method called DPC-SNNACC. By optimizing the calculation method of local density, the difference in the local density among different points becomes larger as does the difference in the decision values. Then, the knee point is obtained by calculating the change in decision values. The points with a high decision value are selected as clustering centers, and the number of clustering centers is adaptively obtained. In this way, the DPC-SNNACC algorithm can solve the problem of clustering for unknown or unfamiliar datasets.
The experimental and comparative evaluation of several datasets from diverse domains established the viability of the DPC-SNNACC algorithm. It could correctly obtain the clustering centers, and almost every metric met the standard of the SNN-DPC algorithm, which was superior to the traditional CFSFDP and K-means algorithms. Moreover, the DPC-SNNACC algorithm has high applicability to datasets of different dimensions and sizes. Within the acceptable range, it is generally feasible, although it has some shortcomings such as long running time. In general, the DPC-SNNACC algorithm not only retains the advantages of the SNN-DPC algorithm, but also solves the problem of self-adaptive determination of the cluster number. Furthermore, the DPC-SNNACC algorithm can be applicable to any dimension and size of datasets, and it is robust to noise and cluster density differences.
In future work, first, we can further explore the clustering algorithm based on shared neighbors, find a more accurate method to automatically determine , and simplify the process of determining the algorithm parameters. Second, the DPC-SNNACC algorithm can be combined with other algorithms to give full play to the advantages of other algorithms and make up for the shortcomings of the DPC-SNNACC algorithm. Third, the algorithm can be applied to some practical problems to increase its applicability.