Next Article in Journal
Biomimetic Collagen Membranes as Drug Carriers of Geranylgeraniol to Counteract the Effect of Zoledronate
Previous Article in Journal
Innervation of an Ultrasound-Mediated PVDF-TrFE Scaffold for Skin-Tissue Engineering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimization of Density Peak Clustering Algorithm Based on Improved Black Widow Algorithm

1
College of Artificial Intelligence, Guangxi Minzu University, Nanning 530006, China
2
School of Computer Science & Technology, China University of Mining and Technology, Xuzhou 221116, China
3
Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis, Guangxi Minzu University, Nanning 530006, China
*
Author to whom correspondence should be addressed.
Biomimetics 2024, 9(1), 3; https://doi.org/10.3390/biomimetics9010003 (registering DOI)
Submission received: 25 November 2023 / Revised: 14 December 2023 / Accepted: 14 December 2023 / Published: 21 December 2023

Abstract

:
Clustering is an unsupervised learning method. Density Peak Clustering (DPC), a density-based algorithm, intuitively determines the number of clusters and identifies clusters of arbitrary shapes. However, it cannot function effectively without the correct parameter, referred to as the cutoff distance (dc). The traditional DPC algorithm exhibits noticeable shortcomings in the initial setting of dc when confronted with different datasets, necessitating manual readjustment. To solve this defect, we propose a new algorithm where we integrate DPC with the Black Widow Optimization Algorithm (BWOA), named Black Widow Density Peaks Clustering (BWDPC), to automatically optimize dc for maximizing accuracy, achieving automatic determination of dc. In the experiment, BWDPC is used to compare with three other algorithms on six synthetic data and six University of California Irvine (UCI) datasets. The results demonstrate that the proposed BWDPC algorithm more accurately identifies density peak points (cluster centers). Moreover, BWDPC achieves superior clustering results. Therefore, BWDPC represents an effective improvement over DPC.

1. Introduction

Clustering is a type of unsupervised learning method [1] that plays a crucial role in extracting essential and potentially valuable information from data [2]. It is a typical unsupervised learning technique that aggregates data objects into clusters based on some similarity measure without the need for any prior knowledge. Without the need for any prior knowledge about the data, data objects within the same cluster demonstrate high similarity, while those in different clusters exhibit low similarity in the dataset. Leveraging these advantages, clustering has been applied in various fields, such as community detection [3], pattern recognition [4], image processing [5], financial services [6], and security detection [7], and has achieved success in these domains.
With the rapid advancement of data science and machine learning, cluster analysis has evolved into a fundamental technique in the realms of data mining and pattern recognition. The development of clustering algorithms has progressed through several crucial stages. Initially, clustering algorithms primarily revolved around distance-based hierarchical methods like Clustering using representative (CURE) [8], Balanced iterative reducing and clustering using hierarchies (BIRCH) [9], and partition-based approaches such as K-means [10] and K-medoids [11] and Gaussian Mixed Model (GMM) [12]. Despite their effectiveness on simple datasets, these algorithms faced limitations in handling large-scale and high-dimensional data. As data sizes and complexity increased, researchers started developing more sophisticated clustering algorithms. A notable advancement came with the introduction of density-based clustering algorithms, exemplified by Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [13] and Or-dering Points to Identify the Clustering Structure (OPTICS) [14]. These algorithms eschew reliance on global distance measures; instead, they ascertain cluster structures based on the density of data points. Compared with traditional clustering algorithms, density-based clustering algorithms have greater advantages in finding clusters of arbitrary shapes and sizes and in pro-cessing data of complex shapes [15,16]. Nonetheless, Density-based clustering algorithms have their drawbacks [17,18], including sensitivity to parameters, issues with density differences, and challenges in managing high-dimensional data environments.
Density Peak Clustering (DPC) [19] is a density-based clustering algorithm proposed by Alex Rodriguez and Alessandro Laio in 2014. This algorithm identifies density peaks in a dataset by calculating the local density and distance of data points and assigns them to different clusters. DPC is a valuable tool in data analysis, revealing the underlying structure and patterns that facilitate a deeper understanding of the data. In computer vision, DPC finds applications in image segmentation, aiding in the identification of distinct objects or areas within an image. DPC can quickly discover density peak points (cluster centers) of datasets with arbitrary shapes, efficiently allocating sample points to clusters and removing outliers. However, DPC has limitations in capturing complex-shaped multimodal clusters [20,21] because the assumption about cluster centers does not provide a reliable criterion for identifying true density peaks. As a result, incorrect center selection leads to poor clustering results. Furthermore, DPC exhibits high sensitivity to parameters, with the clustering outcomes frequently affected by the value of dc. Varied dc values may result in distinct clustering outcomes. Additionally, the DPC algorithm tends to underperform on high-dimensional data, failing to produce satisfactory clustering results. To address this issue, researchers have conducted studies to enhance DPC. Chen [22] proposed a DPC algorithm based on Chebyshev inequality and differential privacy (DP-CDPC); the DP-CDPC algorithm can effectively select the clustering centers, improve the quality of clustering results, and provide good privacy protection performance. Wu [23] proposed a quantum DPC (QDPC) algorithm based on a quantum DistCalc circuit and a Grover circuit. It can effectively improve the efficiency of dealing with a large number of data scenarios, and it is easier to pinpoint the parameter range. In order to further improve the operation efficiency of DPC, Jiang [24] combined the artificial bee colony algorithm with density peak clustering to propose an enhanced clustering algorithm. This algorithm effectively utilized the global and local search capabilities of the artificial bee colony algorithm, resulting in higher quality and more stable clustering results. Li [25] integrated the particle swarm optimization algorithm with density peak clustering, presenting a dynamic optimization algorithm based on automatic and fast density peak clustering. It created parameter-insensitive subpopulations based on particle self-density and relative distance, improving the accuracy and robustness of clustering results.
The Black Widow Algorithm [26] is an optimization algorithm based on principles from biology, inspired by the mating behavior of black widow spiders. This algorithm simulates different movement strategies employed by black widow spiders during the mating process to search for optimal solutions. By introducing various motion strategies such as crawling, crouching, and swinging, the algorithm imitates the spider’s different behaviors in the mating process. These motion strategies enable the spider to explore the search space locally and globally, gradually optimizing the individual’s position to find the best solution. Moreover, the Black Widow Algorithm considers the role of pheromones; spiders release and perceive pheromones to influence the movement strategies and behaviors of other spiders, thus cooperating and collaborating within the group to achieve better search results. The Black Widow Algorithm exhibits good search performance and convergence and is suitable for various optimization problems, including function optimization, parameter optimization, and machine learning tasks. By simulating the mating behavior and pheromone communication of black widow spiders, this algorithm efficiently finds optimal solutions and demonstrates promising performance in practical applications.
The reference [19] does not provide a method for calculating the cutoff distance “dc” but leaves it to the users to define manually. To overcome the problem of manual selection and achieve automatic optimization of the DPC algorithm’s cutoff distance “dc”, this paper proposes a Density Peak Clustering algorithm based on the Black Widow Optimization Algorithm (BWDPC). The BWDPC clustering algorithm aims to select the optimal “dc” value using the Silhouette Coefficient (Sil) as the optimization objective. The algorithm follows a selection process where, within a certain number of iterations, it chooses the “dc” value corresponding to the highest Sil. This process helps to identify the best density centers under a reasonable “dc” setting. The results obtained from synthetic datasets and UCI real datasets demonstrate that the BWDPC algorithm can correctly select density centers. The main contributions of the BWDPC algorithm are as follows:
  • Intelligent Optimization with Sil Objective: By using an intelligent optimization algorithm with the Silhouette Coefficient as the objective, the BWDPC algorithm overcomes the problem of inaccurate density center selection in previous DPC algorithms, which could lead to chain errors in the clustering results.
  • Improved Black Widow Algorithm: The traditional Black Widow Algorithm has been modified by incorporating search factors, making it more suitable for optimizing the DPC algorithm. Multiple rounds of swarm intelligence search have been conducted to address issues such as the algorithm’s limited search paths and slow convergence.
  • Automatic Selection of “dc”: BWDPC requires only the initialization of “dc”, and then it automatically selects the appropriate “dc” value during the clustering process. This feature makes it well-suited for handling large-scale datasets.
The main content of this paper is as follows: Section 2 introduces related works. Section 3 primarily discusses the proposed BWDPC method. Section 4 presents experiments and discussions. Section 5 presents the conclusions.

2. Related Works

2.1. The DPC Algorithm

The DPC algorithm is based on two intuitive assumptions:
  • Points around a clustering center have lower densities than the center itself.
  • Clustering centers are farther away from other points with higher densities. The algorithm requires the input of a cutoff distance parameter dc. It then automatically selects clustering center points on the decision graph based on the given dc value. Afterward, using a one-step allocation strategy, it assigns the remaining points to the clusters represented by the identified centers to complete the clustering process.

2.1.1. The Relevant Parameters of the DPC Algorithm

In the DPC algorithm, there are two different ways to calculate the local density based on the size of the dataset. When the dataset is large, the calculation of local density is as follows:
ρ i = i j χ ( d i j d c )
ρ i = { χ ( x ) = 1   ,   x   <   0 χ ( x ) = 0   ,   x 0
where dc is a cutoff distance and ρ i represents the local density of point i. d i j denotes Euclidean distance between point i and point j, and x is equivalent to d i j minus dc. χ ( x ) = 1 when x < 0, χ ( x ) = 0 when x 0 . d i j is defined as follows:
d i j = k = 1 n ( x i k x j k ) 2
where i represents point i, j represents point j, and k represents the dimension of a certain point. When the dataset size is small, the definition of local density is as follows:
ρ i = j i exp ( d i j 2 d c 2 )
where the cutoff distance is the only parameter of DPC. In DPC, dc = Msort(round(pn)). Msort represents an ordered set of all values in the distance matrix M from small to large. The value of p is about 1% to 2%.
Set δ as the center deviation distance and δ as the center deviation distance of each point. The center deviation distance represents the minimum distance between point xi and the set of points whose local density is locally larger than xi. For point xi with the largest local density value, δ i represents the maximum distance between that point and other points. The definition of center deviation distance can be expressed as in mathematical Equation (5).
δ i = { min j ( d i j ) , ρ j > ρ i max j ( δ i ) , otherwise
Given a dataset X = {x1, x2, …, xn} for ∀xiX, the decision value for xi is calculated as follows:
γ i = ρ i   ·   δ i
where ρ i is the local density of xi,δi is the relative distance of xi, and γ i is the decision value of point i.
As shown in Figure 1, the blue point is a class cluster, and the red point is another class cluster. After calculating the ρ and δ values based on the aforementioned steps for each point, points 1 and 10 have the highest γ values, and it can be seen that sample points 1 and 10 are located at the upper right corner of the decision diagram. Therefore, points 1 and 10 are identified as cluster centers. The black points 26, 27, and 28 have a relatively high δ and a low ρ because they are isolated, and they can be considered outliers. Clustering centers should have large δ and ρ .
The clustering process of DPC is as follows:
  • Calculate the local density ( ρ ) and relative distance (δ) of each sample point using Equations (1)–(5).
  • Calculate the decision value for each sample point using Equation (6).
  • Select the points with higher decision values as the cluster centers.
  • Once the cluster centers are identified, allocate the remaining points in descending order of their local density. Each point is sequentially assigned to the cluster of the nearest preceding point in terms of relative distance.
DPC shows greater sensitivity to the initial selection of cluster centers compared to the K-means. Different initial values can yield diverse clustering results. In contrast to K-means, DPC eliminates the need for initializing cluster centers, rendering its clustering outcomes insensitive to initial conditions. DBSCAN may have poorer clustering results in situations with significant density variations, while DPC excels in handling clusters with different densities because it determines clusters through local density peaks.

2.1.2. The Limitations of DPC

The idea of the DPC algorithm is relatively simple, as it can recognize clusters of arbitrary shapes and intuitively determine the number of clusters. However, it still has the following shortcomings:
The cutoff distance dc value needs to be manually set, and its selection is quite sensitive. To illustrate this issue more intuitively, take the clustering results obtained from the Aggregation dataset shown in Figure 2 as an example. It can be observed that different values of dc lead to significantly different clustering results. Therefore, optimizing the cutoff distance dc becomes particularly important. The "+" in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 represents the cluster center, and samples of different types of clusters are represented by different colors.
As shown in Figure 2, when the cutoff distance (dc) is set to 3.1, the DPC correctly divides the Aggregation dataset into seven clusters, yielding satisfactory clustering results. However, with dc set to 3.2, 3.3, and 3.4, the clustering effect diminishes. For instance, with dc = 3.2, points from the same cluster in the upper-left corner are incorrectly assigned to two clusters, and those from two clusters in the lower-left corner are assigned to the same cluster, resulting in suboptimal clustering and performance. Similar issues arise at dc = 3.3, where points belonging to the red cluster are incorrectly assigned to the black cluster. Hence, optimizing the truncation distance dc is crucial for DPC, and the optimal value varies across datasets, influencing clustering outcomes.

2.2. BWOA Algorithm

2.2.1. Spider Movement

The movement of spiders in the spiderweb is modeled in two forms: linear and spiral. The position update formula is as follows:
x i ( t + 1 ) = x * m x r 1 ( t )                         if   rand   <   0 . 3
x i ( t + 1 ) = x * cos ( 2 π β ) m x i ( t )             otherwise
where x *   is the position of the current best individual, m is a random floating-point number between 0.4 and 0.9, and β is a random floating-point number between −1 and 1. m and β are random parameters, and their purpose is to ensure the randomness of the black widow’s movement and to prevent falling into local optimality. t represents the generation of the black widow spider, x r 1 ( t ) is the position of the r1 black widow, and x i ( t ) is the current position of the black widow. The rand value falls within the range of (0, 1), ensuring the randomness of the black widow spider’s movement. Assuming the spider moves in a linear and spiral fashion within the grid, it follows Formula (7) when rand <0.3; otherwise, it adheres to Formula (8).

2.2.2. Pheromone

Pheromones play a crucial role in the mating process of spiders. The correlation between spider diet and the variation in pheromone signals affecting the quality and quantity of silk is evident. Male spiders are more sensitive to the sex pheromones secreted by well-nourished females because they indicate higher reproductive capability, but mainly to avoid the cost of mating with potentially starved females. Therefore, male spiders tend to avoid females with low pheromone content. The pheromone deposition rate for black widow spiders can be calculated according to Formula (9).
p h e r o m o n e ( i ) = f i t n e s s max f i t n e s s ( i ) f i t n e s s max f i t n e s s min
When the pheromone value is less than or equal to 0.3, the individual will be replaced, and the position update formula is as follows:
x i ( t ) = x × ( t ) + 1 2 [ x r 1 ( t ) ( 1 ) σ × x r 2 ( t ) ]
where xr1 and xr2 are two different individuals, and σ is either 0 or 1. x i ( t ) indicates the location of a black widow with low pheromone levels in the female. Additionally, σ is a random number in order to randomize the location of black widows with low pheromone levels in females. x r 1 ( t ) is the position of the r1 black widow, x r 2 ( t ) is the position of the r2 black widow. It is specified that r1 must not be equal to r2.

2.3. Abbreviations and Their Descriptions

In this paper, external clustering evaluation is employed as the objective function, along with several commonly used clustering evaluation metrics, to assess the performance of clustering algorithms. The indices used include the Silhouette Coefficient [27], the Fowlkes–Mallows Index (FMI) [28], the Adjusted Rand Index (ARI) [29], and Adjusted Mutual Information (AMI) [30]. The silhouette refers to a method that reflects the consistency of data clustering results and can be used to evaluate the dispersion between clusters after clustering; others are employed to evaluate the performance of clustering algorithms, offering quantitative measures of the agreement between clustering results and true class labels. The Fowlkes–Mallows Index (FMI) exhibits sensitivity to outliers, while the Adjusted Rand Index (ARI) demonstrates robustness in handling random clustering results, rendering it more practical in stochastic scenarios. The Adjusted Mutual Information (AMI) remains insensitive to dataset size, making it suitable for datasets of various scales. The four evaluation indices are described in detail as follows:
(1)
Silhouette Coefficient (Sil):
The Silhouette Coefficient is a method to examine how similar an object is to its own cluster compared to other clusters. A data ser D with n sample points was divided into k clusters: C = (C1, C2, …, CK, …, CN). a(t) could be the average dissimilarity of sample t in Cj. D(t, Ci) was the average dissimilarity of sample point t to all samples in another cluster Ci, then b(t) = min{d(t, Ci)}, where i   j. The calculation formula of sample t Silhouette index Sil was shown in Formula (11).
S i l ( t ) = [ b ( t ) a ( t ) ] max { a ( t ) , b ( t ) }
Sil(t) value reflected among cluster Ci with compact classes and separable classes. The average of all the samples Sil(t) values reflect the quality of clustering results. The greater the average Sil value, the more compact the class and the better the quality of clustering is. Sil has a value range of [−1, 1].
S i l = 1 N t = 1 N S i l ( t )
Sil represents the average silhouette value of all sample points, and N represents the total number of samples.
(2)
Fowlkes–Mallows Index (FMI):
The Fowlkes–Mallows Index is defined as follows:
F M I = T P ( T P + F P ) ( T P + F N )
TP represents the count of sample pairs correctly assigned to the same cluster. FP represents the count of sample pairs that, according to the true labels, do not belong to the same category but are incorrectly assigned to the same cluster. FN represents the count of sample pairs that, according to the true labels, belong to the same category but are incorrectly not assigned to the same cluster. The value of FMI ranges from 0 to 1, with a higher value indicating a better clustering result.
(3)
The Adjusted Rand Index (RI):
The Rand Index (RI) is defined as follows:
R I = T P + T N T P + F P + T N + F N
TN represents the number of sample pairs that do not belong to the same category in the true labels and are correctly not assigned to the same cluster. The Adjusted Rand Index (ARI) is adjusted, and its definition is as follows:
A R I = R I E ( R I ) max ( R I ) E ( R I )
where E(⋅) represents the expected value, and the ARI has a range of [−1, 1]. The closer the value is to 1, the higher the clustering accuracy is indicated.
(4)
Adjusted Mutual Information (AMI):
Similar to ARI, AMI is another widely used cluster evaluation indicator, and its definition is as follows:
A M I = M I E ( M I ) max ( H ( A ) , H ( B ) E ( M I ) )
where H(A) and H(B) denote the entropy of two category labels, and AMI assesses the clustering effect based on mutual information (MI). E(MI) represents the mathematical expectation of MI. MI, as a measure of the coincidence of two data distribution indices, is expressed in its tabulated formula:
M I ( y t r u e , y p r e d ) = i = 1 k i = 1 k p i j log ( p i j p i . p j )
pij = mij/N, where k is the total number of clusters, mij is the number of samples in the intersection of the sample sets for cluster i in the true labels and cluster j in the clustering algorithm, and N is the total number of samples. pi represents the ratio of the number of samples in cluster i in the true labels to the total number of samples N, and pj represents the ratio of the number of samples in cluster j in the clustering algorithm to the total number of samples, N. The AMI has the same range as the ARI, and a higher value indicates more accurate clustering results.
The DPC algorithm, based on density peaks, operates without the need for a predefined number of clusters. It dynamically determines the number of clusters by analyzing density relationships between data points, showcasing adaptability to clusters of diverse shapes and sizes. Furthermore, as DPC identifies clusters through density peaks, it demonstrates strong adaptability to clusters with irregular shapes. Compared to certain agglomerative methods, DPC excels in capturing clusters with varying density distributions. Nevertheless, as illustrated in Figure 2, the DPC algorithm exhibits high sensitivity to the dc parameter. In the subsequent work, the project team successfully conducted experiments by optimizing this parameter. In the subsequent work, the project team optimized this parameter using the Black Widow Algorithm and conducted relevant experiments.

3. The Clustering of BWDPC

For the traditional DPC algorithm, the selection of the cutoff distance dc heavily relies on manual configuration and lacks intelligent optimization. Therefore, this paper proposes the BWDPC algorithm, which can automatically acquire more optimal values, thereby achieving accurate classification. In this section, the improvements of the BWOA algorithm and its combination with the DPC algorithm are elaborately presented.

3.1. The Shortcomings of the BWOA

The Black Widow Algorithm is an innovative population optimization algorithm inspired by the unique mating behavior of black widow spiders. It features minimal control parameters, straightforward operation, and strong optimization performance. Despite showing good performance in certain cases, this algorithm also has some drawbacks and limitations, including difficulties in parameter selection, susceptibility to local optima, problem-specific nature, and slow convergence speed, among others.

3.2. The RBWOA Model

The conventional BWOA algorithm exhibits slow convergence, particularly in addressing complex and high-dimensional optimization problems. It is susceptible to entrapment in local optima, with a weak search capability. To tackle these challenges, and informed by the issues identified in the proposed DPC algorithm, we devise a strategy to dynamically update the spider population range, capitalizing on the characteristics of the sigmoid function [30].
u = 1 1 + e t
S i + 1 = S i   r   ×   u   ×   S i
where u represents the search factor, t is the number of iterations, S represents the spider population range, and r is a random number between 0 and 1. Si denotes the maximum number of generation i black widow spiders. In this equation, it can be observed that the spider population range decreases with an increase in the number of iterations, and r is introduced to enhance the randomness of the search. In the initial iterations, with small values of t and the objective function u, the spider population range is relatively large. As the number of iterations grows and the objective function u increases, the spider population range nonlinearly decreases, gradually approaching 0. This effectively speeds up convergence while preventing entrapment in local optima.

3.3. Regarding the Pseudocode for BWDPC

The RBWOA algorithm (Algorithm 1) is as follows.
Algorithm 1: RBWOA algorithm
Input :   The   value   of   the   initial   population   S Output :   The   optimal   value   of   x . 1 .   Initialize   the   population   and   evaluate   the   fitness   function   values   and   population   S   based   on   Formula   ( 9 ) . 2 .   Generate   random   parameters   m ,   β 3 .   r 1 = random . uniform ( 0 , S ) ,   r 2 = random . uniform ( 0 , S ) ,   fit max   = 1 ,   fit min = 1 4 .   if   rand   0.3
5 . x i ( t + 1 ) = x * m x r 1 ( t )
6 .   else : 7 . x i ( t + 1 ) = x * - cos ( 2 π β ) m x i ( t )
8. fit = Sil // Calculate the fitness value “fit” using Sil
9. pheromone = (fitmaxfit) / (fitmaxfitmin) // Calculate the fitness value “fit” using Sil
10 .   if   pheromone   0.3 :
11 .   x i ( t ) = x × ( t ) + 1 2 [ x r 1 ( t ) ( 1 ) σ × x r 2 ( t ) ]
12 .   if   x i t 0 :
13 .   / / When   x i t is less than or equal to 0, it indicates the current population Si has been searched, and iterate to the next population Si + 1.
14.  u = 1 1 + t
15.   S i + 1 = S i r × u × S i
16 . return   to   step   2 17 .   else : 18 . output   x i t + 1

3.4. Algorithm Flow Steps

The specific flow of the BWDPC algorithm is presented in the table below, incorporating search factors and path optimization search strategies to enhance the algorithm’s convergence speed and efficiency. The BWDPC algorithm (Algorithm 2) is as follows.
Algorithm 2: BWDPC algorithm
Input: Experimental Dataset X = {x1, x2, …, xn}
Output: Clustering Results C = {c1, c2, …, cm}, m Is the Number of Data Cluster Results
1. Set the population size S and the maximum number of iterations T for the BWDPC algorithm
2. Data preprocessing: Calculate the distance matrix for all data points and determine the range of dc values
3. Enter S into BWDPC and set the output x of BWDPC to dc
4. Substitute dc into equations 4 and 5 to calculate the local density ρ i and δ i ; for all points
5. Formula (6) is employed to calculate γi, and the initial m points with the highest γi values are automatically chosen as the cluster centers
6. Introduce the evaluation metric Sil as the objective function for BWDPC and record the dc value d* corresponding to the maximum Sil
7. Check if the termination condition is met. If t > T, end the iteration and proceed to step
8. If not, go back to step 3 for further optimization
9. Output the optimal dc value and obtain the final clustering results to complete the clustering process

3.5. Algorithm Time Complexity

For a data set with a sample size of N, the time complexity of the DPC algorithm mainly consists of calculating the distance matrix D with a complexity of O(N2), sorting the Euclidean distances with a complexity of O(N2log2N), and computing the local density and relative distance δ with a complexity of O(N2). Assuming the maximum range of black widow population is M, and the maximum number of iterations is T, and the dimension for optimizing the cutoff distance dc is 1, the complexity of optimizing dc is O ( M × T ). During the optimization process, as dc changes, it affects the local density and relative distance δ, resulting in a complexity of O ( N 2 × T ) for this process. In summary, the time complexity of the algorithm mentioned in this chapter is O(N2(log2N+T)).

4. Experimental Results and Analysis

4.1. Experimental Dataset and Experimental Environment

This chapter used 12 datasets, including synthetic and UCI datasets, to validate the proposed clustering algorithm. Table 1 details attributes of the artificial and real datasets, while Table 2 shows the parameter settings for BWDPC, DPC, and K-means algorithms. The experimental environment consists of a LENOVO (Riyadh, Saudi Arabia) desktop computer with Windows 10 64-bit operating system, an Intel i7-10700 processor (Santa Clara, CA, USA), Python 3.9 as the programming environment, PyCharm as the development tool, and 8 GB of RAM.

4.2. Experiments on Synthetic Datasets

For the datasets provided in Table 1, BWDPC, DPC, DBSCAN, and K-means were used for clustering. Figure 3, Figure 4 and Figure 5 show the clustering results of the four algorithms on the R15 dataset, Aggregation dataset, and D31 dataset, respectively. These three datasets have different overall distributions and numbers of clusters, which can more intuitively reflect the clustering performance of the four algorithms. Points with different colors in the figures are assigned to different clusters.
Figure 3. The clustering results of the four algorithms on the R15 dataset.
Figure 3. The clustering results of the four algorithms on the R15 dataset.
Biomimetics 09 00003 g003
Figure 4. The clustering results of the four algorithms on the Aggregation dataset.
Figure 4. The clustering results of the four algorithms on the Aggregation dataset.
Biomimetics 09 00003 g004
Figure 5. The clustering results of the four algorithms on the D31 dataset.
Figure 5. The clustering results of the four algorithms on the D31 dataset.
Biomimetics 09 00003 g005
The clustering results for the Aggregation dataset in Figure 4 reveal that only the BWDPC algorithm accurately clusters the dataset, while the other three algorithms fall short of achieving precise clustering. Due to an incorrect choice of dc, the DPC algorithm erroneously divides the blue cluster in the top-left corner into two, and the red cluster in the bottom-left corner, originally two clusters, is treated as one, resulting in substantial errors. While K-means correctly determines the number of clusters, it falls short of achieving accurate clustering. Specifically, K-means generates two cluster centers within one cluster and places cluster centers between the black clusters, whereas points in the red and pink clusters should belong to the same cluster. This incorrect setting of cluster centers by K-means results in substantial errors. Even though DBSCAN accurately identifies the number of clusters, some points are erroneously marked as noise. For instance, certain boundary points of the purple cluster in the top-left corner are incorrectly classified as noise, leading to a slightly inferior clustering result.
Figure 3 reveals that only BWDPC and K-means can accurately cluster the R15 dataset. The DPC algorithm exhibits clear errors in choosing cluster centers, leading to unsatisfactory clustering outcomes. DBSCAN erroneously designates certain boundary points as noise, diminishing clustering accuracy. In the D31 dataset, both BWDPC and K-means successfully achieve accurate clustering, whereas DBSCAN not only mislabels certain boundary points as noise but also errs in determining the number of clusters. For instance, the red cluster, which originally belongs to two different clusters, is incorrectly consolidated into one by DBSCAN. DPC also faces analogous problems due to the absence of a suitable dc value, leading to substantial errors in cluster centers. In comparison, BWDPC not only determines the correct number of clusters but also identifies the positions of cluster centers more accurately, resulting in superior clustering performance.
Figure 6 and Figure 7 show the clustering results of the four algorithms on the Two_cluster dataset and Five_cluster dataset, respectively. These datasets further demonstrate the accuracy of the BWDPC algorithm in clustering.
Figure 6. The clustering results of the four algorithms on the Two_cluster dataset.
Figure 6. The clustering results of the four algorithms on the Two_cluster dataset.
Biomimetics 09 00003 g006
Figure 7. The clustering results of the four algorithms on the Five_cluster dataset.
Figure 7. The clustering results of the four algorithms on the Five_cluster dataset.
Biomimetics 09 00003 g007
From the clustering results in Figure 6 and Figure 7, it can be seen that BWDPC algorithm, DPC algorithm, and K-means algorithm can accurately cluster the datasets and find cluster centers. However, the DBSCAN algorithm treats the boundary points as noise and produces a wrong classification of the data points in the lower left corner, which indicates that the DBSCAN algorithm may produce incorrect classification when dealing with uneven data density, resulting in suboptimal clustering performance.
Table 3 provides clustering evaluation metrics for BWDPC and other comparison algorithms on six datasets. From the evaluation metrics in Table 3, it can be observed that BWDPC, with the improvement in cluster center selection strategy and the optimization of the cutoff distance using the Black Widow Optimization Algorithm, achieved good results on most datasets. Furthermore, all the clustering metrics of the BWDPC algorithm outperformed those of the DPC algorithm, indicating the significant effect of optimizing the cutoff distance dc in BWDPC. The best results are shown in bold, and the clustering metrics used in this section are FMI, ARI, and AMI.

4.3. Experiments on Real-World Datasets

The experiment used six real-world datasets to test the performance of the BWDPC algorithm. These datasets have different sample sizes, feature numbers, and cluster quantities. Table 1 provides specific information for each real dataset. The experiment conducted clustering using BWDPC, DPC, DBSCAN, and K-means on these six datasets, and the results are shown in Table 3, with the best results highlighted in bold.
As shown in Table 4, BWDPC excels over other clustering algorithms in six real datasets and shows substantial improvements on certain UCI datasets. For example, BWDPC performs exceptionally well, securing a leading position in clustering results, especially on widely used Iris and Sym datasets. In comparison to the DPC algorithm, BWDPC improves FMI, ARI, and AMI scores by 0.08, 0.2, and 0.1, respectively, on the Iris dataset. Specifically, the Sym dataset sees improvements of 0.01, 0.02, and 0.01 in FMI, ARI, and AMI scores, respectively. The sensitivity is due to DPC’s responsiveness to the truncation distance parameter (dc) in small-sample datasets. However, on the Segment dataset, despite BWDPC’s improved clustering performance over DPC, with increases of 0.10, 0.12, and 0.20 in FMI, ARI, and AMI scores, respectively, the overall clustering performance remains unsatisfactory. This is because BWDPC faces challenges when handling high-dimensional data, which may lead to less satisfying results. Besides the mentioned datasets, the BWDPC algorithm achieves optimal clustering results on the Segment and Zoo datasets. Specifically, on the Zoo dataset, BWDPC outperforms the DPC algorithm with increases of 0.31, 0.34, and 0.2 in FMI, ARI, and AMI scores, respectively, showcasing robust performance in low-dimensional data. In comparison to the PDC algorithm, BWDPC automatically optimizes the dc value, resulting in optimal clustering results. The enhanced BWDPC algorithm can precisely identify true cluster centers. In contrast to the K-means algorithm, both BWDPC and DPC can accurately identify cluster centers in streaming datasets. In certain datasets, DBSCAN might misclassify boundary points as noise, leading to less accurate clustering results.

5. Conclusions

In this study, BWDPC utilizes the Black Widow Optimization Algorithm to dynamically determine the optimal cutoff distance dc, thereby improving clustering performance. Moreover, by introducing a search factor and dynamically updating the spider population range, the algorithm addresses the challenge of parameter specificity, allowing it to avoid local optima and expedite convergence. The results obtained from six artificial datasets and six UCI real datasets illustrate that, based on the comparison of the Fowlkes–Mallows Index (FMI), Adjusted Rand Index (ARI), and Adjusted Mutual Information (AMI), BWDPC consistently and accurately identifies cluster centers, yielding optimal clustering outcomes in most cases. It can be asserted that BWDPC outperforms existing algorithms on the majority of datasets, demonstrating higher accuracy. While the BWDPC algorithm achieves automatic optimization of the cutoff distance dc, the parameter K for the number of cluster centers still requires manual determination. Consequently, additional research is necessary to implement the adaptive selection of the parameter K, aiming to decrease empirical input and human involvement while preserving algorithm accuracy and robustness, thus enhancing clustering efficiency. Nonetheless, BWDPC has its limitations. While the algorithm demonstrates proficiency on low-dimensional data, its clustering performance diminishes on high-dimensional data, which also constitutes a key area for future research. BWDPC excels in swiftly identifying irregular-shaped clusters and adeptly adjusting to clusters with varying densities in real-world applications. In summary, by leveraging the strengths of DPC and attaining automatic determination of the dc parameter, BWDPC has delivered exceptional clustering results.

Author Contributions

Software, H.W.; data curation, Y.Z.; writing—original draft preparation, H.H.; writing—review and editing, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (62266007); the Guangxi Natural Science Foundation (2021GXNSFAA220068); and the Innovation Project of Guangxi Graduate Education (JGY2022104, JGY2023116).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ding, S.F.; Du, W.; Xu, X.; Shi, T.; Wang, Y.; Li, C. An improved density peaks clustering algorithm based on natural neighbor with a merging strategy. Inf. Sci. 2023, 624, 252–276. [Google Scholar] [CrossRef]
  2. Guan, J.; Li, S.; He, X.; Chen, J. Clustering by fast detection of main density peaks within a peak digraph. Inf. Sci. 2023, 628, 504–521. [Google Scholar] [CrossRef]
  3. Shi, T.H.; Ding, S.; Xu, X.; Ding, L. A community detection algorithm based on Quasi-Laplacian centrality peaks clustering. Appl. Intell. 2021, 51, 1–16. [Google Scholar] [CrossRef]
  4. Gao, M.; Shi, G.-Y. Ship-handling behavior pattern recognition using AIS sub-trajectory clustering analysis based on the T-SNE and spectral clustering algorithms. Ocean Eng. 2020, 205, 106919. [Google Scholar] [CrossRef]
  5. Yan, X.Q.; Ye, Y.; Qiu, X.; Yu, H. Synergetic information bottleneck for joint multi-view and ensemble clustering. Inf. Fusion 2020, 56, 15–27. [Google Scholar] [CrossRef]
  6. Morris, K.; McNicholas, P.D. Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures. Comput. Stat. Data Anal. 2016, 97, 133–150. [Google Scholar] [CrossRef]
  7. Lv, Z.; Di, L.; Chen, C.; Zhang, B.; Li, N. A Fast Density Peak Clustering Method for Power Data Security Detection Based on Local Outlier Factors. Processes 2023, 11, 2036. [Google Scholar] [CrossRef]
  8. Guha, S.; Rastogi, R.; Shim, K. CURE: An efficient clustering algorithm for large databases. ACM Sigmod Rec. 1998, 27, 73–84. [Google Scholar] [CrossRef]
  9. Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Rec. 1996, 25, 103–114. [Google Scholar] [CrossRef]
  10. MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Los Angeles, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
  11. Park, H.S.; Jun, C.-H. A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 2009, 36, 3336–3341. [Google Scholar] [CrossRef]
  12. Dempster, A.P.; Lanird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]
  13. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 1996, 96, 226–231. [Google Scholar]
  14. Ankerst, M.; Breunig, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. ACM Sigmod 1999, 28, 49–60. [Google Scholar] [CrossRef]
  15. Ding, S.F.; Li, C.; Xu, X.; Ding, L.; Zhang, J.; Guo, L.L.; Shi, T. A sampling-based density peaks clustering algorithm for large-scale data. Pattern Recognit. 2023, 136, 109238. [Google Scholar] [CrossRef]
  16. Quyang, T.; Witold, P.; Nick, J.; Pizzi, N.J. Rule-based modeling with DBSCAN-based information granules. IEEE Trans. Cybern. 2019, 51, 3653–3663. [Google Scholar]
  17. Smiti, A.; Elouedi, Z. Dbscan-gm: An improved clustering method based on gaussian means and dbscan techniques. In Proceedings of the 2012 IEEE 16th International Conference on Intelligent Engineering Systems (INES), Lisbon, Portugal, 13–15 June 2012; pp. 573–578. [Google Scholar]
  18. Tran, T.N.; Drab, K.; Daszykowski, M. Revised DBSCAN algorithm to cluster data with dense adjacent clusters. Chemom. Intell. Lab. Syst. 2013, 120, 92–96. [Google Scholar] [CrossRef]
  19. Rodriguez, A.; Alessandro, L. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef]
  20. Pizzagalli, D.U.; Gonzalez, S.F.; Krause, R. A trainable clustering algorithm based on shortest paths from density peaks. Sci. Adv. 2019, 5, eaax3770. [Google Scholar] [CrossRef]
  21. Guan, J.Y.; Li, S.; He, X.; Chen, J. Peak-graph-based fast density peak clustering for image segmentation. IEEE Signal Process. Lett. 2021, 28, 897–901. [Google Scholar] [CrossRef]
  22. Chen, H.; Zhou, Y.; Mei, K.; Wang, N.; Tang, M.; Cai, G. An Improved Density Peak Clustering Algorithm Based on Chebyshev Inequality and Differential Privacy. Appl. Sci. 2023, 13, 8674. [Google Scholar] [CrossRef]
  23. Wu, Z.; Tingting, S.; Yanbing, Z. Quantum Density Peak Clustering Algorithm. Entropy 2022, 24, 237. [Google Scholar] [CrossRef] [PubMed]
  24. Jiang, R.; Jianhuab, J. Density Peaks Clustering Algorithm Based on CDbw and ABC Optimization. J. Jilin Univ. Sci. Ed. 2018, 56, 1469–1475. [Google Scholar]
  25. Li, F.; Yue, Q.; Pan, Z.; Sun, Y.; Yu, X. Dynamic particle swarm optimization algorithm based on automatic fast density peak clustering. J. Comput. Appl. 2023, 43, 154. [Google Scholar]
  26. Hayyolalam, V.; Kazem, A.A.P. Black widow optimization algorithm: A novel meta-heuristic approach for solving engineering optimization problems. Eng. Appl. Artif. Intell. 2020, 87, 103249. [Google Scholar] [CrossRef]
  27. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  28. Fowlkes, E.B.; Mallows, C.L. A method for comparing two hierarchical clusterings. A Method Comp. Two Hierarchical Clust. 1983, 78, 553–569. [Google Scholar] [CrossRef]
  29. Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Is A Correction for Chance Necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1073–1080. [Google Scholar]
  30. Han, J.; Moraga, C. The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning. In International Workshop on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 1995; pp. 195–201. [Google Scholar]
Figure 1. (a) Data distribution diagram; (b) decision graph based on ρ and δ.
Figure 1. (a) Data distribution diagram; (b) decision graph based on ρ and δ.
Biomimetics 09 00003 g001
Figure 2. Clustering result graphs corresponding to different values of dc.
Figure 2. Clustering result graphs corresponding to different values of dc.
Biomimetics 09 00003 g002
Table 1. Description of artificial synthetic datasets and UCI datasets.
Table 1. Description of artificial synthetic datasets and UCI datasets.
DatasetInstancesAttributesClusters
R15600215
Aggregation78827
D313100231
Two_cluster40022
Five_cluster200025
Flame24022
Iris15043
Wine178133
Sym35023
Waveform35000213
Segment2310187
Zoo26623
Table 2. Parameter settings.
Table 2. Parameter settings.
DatasetBWDPCDPCDBSCAN
dcdcepsmpts
R150.55332.00000.32003
Aggregation3.10003.20001.760014
D310.21212.40000.800024
Two_cluster1.68021.60000.25002
Five_cluster1.21261.30000.42007
Flame0.34474.00001.480016
Iris0.29323.00001.63002
Wine32.31523.60004.30002
Segment3.68901.60000.10002
Waveform33.74200.12003.60002
Zoo6.92861.60002.50006
Sym0.07840.50000.10006
Table 3. Presents the results for artificial datasets.
Table 3. Presents the results for artificial datasets.
DatasetMetricBWDPCDPCDBSCANK-Means
R15FMI0.99320.92200.93940.9932
ARI0.99139.91440.93470.9927
AMI0.99670.96720.94500.9938
AggregationFMI0.99820.87960.91100.8159
ARI0.99780.84540.88530.7624
AMI0.99560.91520.88650.8776
D31FMI0.93900.60400.79010.9538
ARI0.93700.53690.78260.9522
AMI0.95580.83250.88820.9653
Two_clusterFMI0.99500.99500.96510.9950
ARI0.99000.99000.93150.9900
AMI0.97720.97720.88330.9772
Five_clusterFMI0.99210.93160.93870.9940
ARI0.99050.90330.91370.9915
AMI0.97540.88230.84800.9809
FlameFMI0.79420.60020.75110.7363
ARI0.5734−0.03020.54530.4534
AMI0.56720.10640.59850.3969
Table 4. Results of the artificial datasets.
Table 4. Results of the artificial datasets.
DatasetMetricBWDPCDPCDBSCANK-Means
IrisFMI0.84070.75670.77140.5835
ARI0.75920.56090.56810.3711
AMI0.80320.70500.73160.4227
WineFMI0.58340.56740.53540.5039
ARI0.37150.3016−0.00030.2536
AMI0.41310.41690.05020.3600
SegmentFMI0.46180.38080.30470.4370
ARI0.32040.20130.00010.3133
AMI0.49270.39200.07800.4518
Waveform3FMI0.53380.50110.41520.5039
ARI0.28170.15840.00290.2536
AMI0.35120.24670.05870.3620
ZooFMI0.81360.50760.48390.7741
ARI0.71970.37960.00750.7087
AMI0.75980.56870.00790.7527
SymFMI0.83330.82390.47930.7369
ARI0.73570.71780.28880.5335
AMI0.77270.76260.47190.5645
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, H.; Wu, H.; Wei, X.; Zhou, Y. Optimization of Density Peak Clustering Algorithm Based on Improved Black Widow Algorithm. Biomimetics 2024, 9, 3. https://doi.org/10.3390/biomimetics9010003

AMA Style

Huang H, Wu H, Wei X, Zhou Y. Optimization of Density Peak Clustering Algorithm Based on Improved Black Widow Algorithm. Biomimetics. 2024; 9(1):3. https://doi.org/10.3390/biomimetics9010003

Chicago/Turabian Style

Huang, Huajuan, Hao Wu, Xiuxi Wei, and Yongquan Zhou. 2024. "Optimization of Density Peak Clustering Algorithm Based on Improved Black Widow Algorithm" Biomimetics 9, no. 1: 3. https://doi.org/10.3390/biomimetics9010003

Article Metrics

Back to TopTop