1. Introduction
Clustering has become a useful technique in many fields, including science, engineering, medicine, and our everyday lives. The term “clustering” refers to the process of grouping a collection of elements so that they are more comparable to other elements in their own cluster (group) than those in other clusters. A clustering theory needs to provide a unified framework for various approaches. Firstly, it should have foundations and limits of applicability. Secondly, it has to show connections between ideas and methods inside and outside of clustering. Finally, it should offer a cornerstone for resolving the emerging challenges in data processing and developing applications.
In the last few decades, clustering techniques for data mining have come a long way. Researchers have developed several methods [
1,
2,
3] to effectively cluster data. Despite some successful clustering method applications, there are still some challenges in data mining. For instance, most of the conventional clustering approaches function well with low-dimensional data but encounter difficulties with high-dimensional data. The term “dimensionality curse” refers to this phenomenon [
4]. Moreover, the overfitting problem occurs due to the high dimensionality of the feature space, and the model being recorded is too complicated [
5]. Furthermore, the available data could not accurately reflect the whole ground-truth model. When this occurs, learning algorithms tend to fit a model to the available data samples, overlooking the underlying structure. In other words, memorizing takes the place of learning in some way. Tasks involving practical signal processing and learning must also contain noise and outliers. Non-Gaussian noise is particularly prevalent in applications involving measurements. On the other hand, outliers are incongruous observations with the overall population. These idiosyncrasies provide significant hurdles for issues involving both linear and non-linear systems, especially when coupled with specific output requirements. Effective investigations into the consequences of these extra concerns are found in several recent works, e.g., [
6,
7]. Moreover, the raw data are often in crude formats. Clustering techniques need a preprocessing phase to deal with high dimensions and undesirable sampling problems. In circumstances involving large dimensions, preprocessing strategies [
8,
9,
10] have been proposed to improve performance. To observe the dataset more precisely and make it more suited for subsequent processing, these methods often restructure the sample space using transformations or eliminations. For example, the principal component analysis (PCA) converts sample characteristics into a form with the largest variance, making it more appropriate for classification tasks with the added bonus of decreased dimensions [
11]. This idea may be immediately extended as feature transformation in general. It is crucial to remember that these strategies keep all traits and restructure them using (non-linear) combinations rather than discard unimportant ones. The expanded method of dimensionality reduction by feature selection eliminates unimportant features and considers the subset of features that seems to be the most crucial while adhering to predetermined optimality criteria. Future classification or clustering problems may benefit from using any of these strategies to provide more accurate representations.
Despite the fact that feature selection may be utilized as an efficient solution to high-dimensional challenges, the elimination process may result in a loss of critical information with great significance in diverse contexts. After considering the difficulties of handling high-dimensional data, it is vital to list a few methods for avoiding overfitting. To address the issue of high-dimensional datasets, researchers have proposed various techniques. For instance, Yan et al. [
12] presented a novel supervised multi-view hash model that utilized deep learning and a multi-view stability evaluation method to enhance hash learning. The approach included multi-data fusion methods and a memory network to reduce computing resources during retrieval. The proposed method outperformed state-of-the-art single-view and multi-view hashing methods on three datasets, showing its potential for improving hash learning. In [
13], the authors proposed a group-based nuclear norm and learning graph (GNNLG) model for image restoration, which exploits patch similarity and low-rank properties. An optimized learning strategy was developed to impose smoothing priors on the denoised depth image. The authors used the alternating direction method of multipliers to improve speed and convergence. Experimental results showed that the proposed method outperformed state-of-the-art denoising methods. In Reference [
14], the authors proposed a task-adaptive attention module for image captioning that learned implicit non-visual clues and alleviated misleading problems. The authors also introduced diversity regularization to enhance the expression ability of the module. The experiments on the MSCOCO captioning dataset showed that using the proposed module with a vanilla Transformer-based image captioning model improved performance. The authors of [
15] proposed a precise NR IQA scheme with two steps: distortion identification and targeted quality evaluation. They used Inception-ResNet-v2 to train a classifier and designed a specific approach to quantify image distortion. The method outperformed state-of-the-art NR IQA methods in the experiments, indicating its potential for practical applications. A popular strategy for dealing with overfitting in clustering is to reduce within-cluster variation. In the case of clustering, a common way to deal with overfitting is to minimize within-cluster variance [
16].
The authors of [
17] proposed a technique where they merged DBSCAN with affinity propagation (APSCAN), enabling users to benefit from both methods. The density contains two pieces of data: radius and number. Radius is the average distance from the exemplar to all points, and the number is the number of points that are located inside the radius neighborhood. Both parameters—radius and number—have been employed in the strategy as an alternative to Eps and MinPts. The evolutionary parameter-free clustering algorithm (EPFC) was developed by Ding and Li [
18]; the average value of nearest neighbor points is calculated based on all datasets. Depending on the computed distance between each data point and its closest neighbor, one of two decisions will be determined, i.e., to divide the group or combine the data points. In [
19], a parameter-free approach was described, where the radius was determined automatically based on the window size and data distribution.
It may be more efficient to sample a group of solutions rather than search for the solution space entirely when faced with all obstacles at once, including high dimensionality, high nonlinearity, parameter interaction, and disturbances due to randomness. Most of the previously investigated clustering methods for data mining used manual parameter selection to find the cluster from the raw data. Moreover, the methods used in high-dimension datasets provided comparatively lower performances. To address this concern, this study developed a novel clustering method (DPPA) to find the cluster from the raw dataset. The proposed method has been tested using the well-reputed publicly available dataset.
The key contributions of this paper are as follows:
- 1.
In this study, we propose a new non-parametric clustering algorithm (DPPA), which can calculate 1-NN and Max-NN by analyzing the positions of data in the dataset without any initial manual parameter assignment.
- 2.
We use two well-known publicly available datasets to evaluate the proposed method to find clusters. In addition, the proposed method is not time-consuming because it reduces the dependence of analysis on the selection of artificial parameters. Finally, four popular algorithms (DBSCAN algorithm, K-means algorithm, affinity propagation algorithm, and Mean Shift algorithm) are implemented to compare the performance of the proposed model.
The rest of the manuscript is arranged as follows: an overview of existing clustering algorithms is presented in
Section 2. A detailed description of the proposed algorithm is described in
Section 3.
Section 4 describes details about the dataset. The experimental performance is described in
Section 5.
Section 6 presents the conclusion of the present study.
2. Related Work
Numerous clustering algorithms have been introduced in past years, as evidenced by the literature [
20,
21]. For instance, in Reference [
22], a method utilizing association rules is proposed to cluster customer transactions within a market database. The study conducted in [
23] focused on exploring algorithms for clustering categorical databases using non-linear dynamical systems.
According to Rokach [
24], the process of clustering splits data patterns into subsets in such a manner that patterns with comparable characteristics are grouped in the same cluster. Therefore, the patterns are organized into a well-formed assessment, which identifies the population from which the sample was selected. For clustering, the data, K-means algorithm was implemented in [
25,
26]. Likas et al. [
27] presented a global k-means algorithm, an incremental concept to clustering that dynamically adds one cluster center at a time using a deterministic global search technique, consisting of the sizes of the dataset executions of the k-means algorithm from suitable initial positions. The authors of [
28] presented an affinity propagation clustering technique for pattern recognition. In their approach, each data point exchanges signals (called responsibility and availability) in an iterative manner to identify acceptable examples. To ascertain the affinities of the exemplars in the network, certain preferences for criteria must be established. The success of this algorithm relies on well-defined criteria to attain its ideal aim, even if there are no beginning parameters that must be set. Its inability to handle groups of arbitrary forms is another issue.
The DNo, DNg, and DNs indices have been recognized as simple internal validity measures for clustering [
29,
30,
31]. These indices assess the relationship between cluster size and inter-cluster distance. Specifically, they calculate the ratio of the minimum distance between two clusters to the size of the largest cluster, aiming to maximize the index values. The DB index, proposed by Davies and Bouldin [
32], quantifies the average similarity between each cluster and its most similar one. It aims to maximize the distances between clusters while minimizing the distance between the cluster centroid and other data objects. The silhouette value, introduced by Rousseeuw [
33], measures the similarity of an object to its own cluster (cohesion) compared to other clusters (separation). Ranging from −1 to +1, a higher silhouette value indicates a better fit to the object’s own cluster and a poorer fit to neighboring clusters. Positive and negative large silhouette widths (SWs) signify proper and improper clustering, respectively. Objects with a SW value close to zero indicate a lack of clear discrimination between clusters. Another cluster validity measure is the gap statistic, which is based on a statistical hypothesis test [
34]. It assesses the change in within-cluster dispersion relative to the expected change under an appropriate reference null distribution.
In Reference [
35], a comparative study was conducted to evaluate five different clustering algorithms using gene expression time series datasets of the Saccharomyces cerevisiae yeast. The experiments employed a k-fold cross-validation procedure to compare the performance of the algorithms. The results showed that k-means, dynamical clustering, and SOM consistently achieved high accuracy across all experiments. In Reference [
36], the performances of k-means, single linkage, and simulated annealing (SA) were assessed using various partitions obtained through validation indexes. They introduced a novel validation index called the I index, which measured separation based on the maximum distance between clusters and compactness based on the sum of distances between objects and their respective centroids. The study concluded that the I index demonstrated the highest reliability among the considered indices, attaining its maximum value when the number of clusters was appropriately chosen.
It is worth noting that most existing clustering methods often require some form of supervision or expert knowledge, such as specifying the number of clusters or setting similarity thresholds and algorithm-specific hyperparameters. The research gap in the provided passage is that most previously investigated clustering methods for data mining relied on manual parameter selection to identify clusters from raw data. However, these methods often faced challenges in high-dimensional datasets and provided relatively lower performance. This suggests a need for a clustering method that can overcome these limitations and offer improved performance in high-dimensional settings.
To address this research gap, this study extends the work on developing the novel clustering method, called DPPA, developed by [
37]. This method is non-parametric and can calculate 1-NN and Max-NN by analyzing the positions of data in the dataset, eliminating the need for initial manual parameter assignments. The proposed method is evaluated using well-known publicly available datasets and offers the advantage of being less time-consuming by reducing the dependence on artificial parameter selection. Furthermore, the performance of the proposed method is compared with four popular clustering algorithms (DBSCAN, K-means, affinity propagation, and Mean Shift) to assess its effectiveness.
In summary, the research gap pertains to the limitations of existing clustering methods in high-dimensional datasets, while the proposed solution is the development of the DPPA algorithm that addresses these limitations and provides improved performance.
3. Overview of Clustering Algorithms
3.1. DBSCAN
DBSCAN, which stands for density-based spatial clustering of applications with noise, is widely used in machine learning and data science [
38]. The algorithm is capable of identifying clusters of points in a dataset based on their density and has the ability to handle noise and outliers effectively. DBSCAN is particularly useful for datasets with non-linear or complex structures and can detect clusters of different shapes and sizes.
The main idea behind DBSCAN is to identify regions in the dataset where the data points are densely packed together. The algorithm works by defining two key parameters, epsilon () and minimum points (MinPts). The epsilon parameter defines the radius of a neighborhood around each data point, and the MinPts parameter specifies the minimum number of points required to form a dense region.
The DBSCAN algorithm starts by randomly selecting a data point that has not yet been assigned to a cluster. It then identifies all the neighboring points within the radius of and checks if the number of neighboring points is greater than or equal to the MinPts parameter. If the number of neighbors is greater than or equal to MinPts, the algorithm considers the data point as part of a cluster and explores its neighboring points to find all other data points that belong to the same cluster. This process continues until no more points can be added to the cluster, and then the algorithm selects a new unassigned point and repeats the process.
If the number of neighboring points is less than MinPts, the algorithm marks the data point as a noise point or an outlier, which means that it does not belong to any cluster. The algorithm then moves on to the next unassigned data point and repeats the process until all data points have been assigned to a cluster or marked as noise.
One of the advantages of DBSCAN is that it can detect clusters of different shapes and sizes, including clusters that are not linearly separable. The algorithm can also handle noise and outliers effectively, as they are simply marked as noise points and do not interfere with the clustering process. DBSCAN is also relatively easy to implement and requires only two input parameters, and MinPts.
However, DBSCAN also has some limitations. One of the main challenges is setting the appropriate values for the and MinPts parameters. Choosing the right values can be difficult, especially for large and complex datasets, and may require some trial and error. DBSCAN is also sensitive to the choice of the distance metric used to calculate the distance between data points, and some distance metrics may perform better than others, depending on the dataset.
In summary, DBSCAN is a powerful clustering algorithm that can detect clusters of different shapes and sizes in complex datasets. It is capable of managing noise and outliers effectively, making it particularly useful for real-world applications. However, choosing the appropriate values for the
and MinPts parameters can be challenging, and the choice of distance metric can also impact the performance of the Algorithm 1.
Algorithm 1: DBSCAN clustering algorithm. |
|
3.2. Affinity Propagation
The affinity propagation (AP) technique, mentioned in [
28,
39], is a distance-dependent approach that can identify exemplars in a dataset. AP can cluster facial images, locate genes, and cluster spatial datasets on a map, among other tasks [
40,
41,
42]. Unlike K-means, AP selects existing points as exemplars and does not generate new points. It does not need a predetermined number of clusters or initial candidate exemplars. Instead, AP considers every point in the dataset and chooses exemplars through competitive iterations.
AP employs two types of “messages” to compute responsibility and availability, based on the distance between any two distinct points, and , in the dataset. Responsibility, represented by , is a “message” sent from to its candidate exemplar point and indicates how well can represent as an exemplar. Availability, represented by , is the “message” sent from to and represents the accumulated evidence for how suitable is for being selected as an exemplar by . AP initializes the values of these “messages” between any two distinct points, and then iteratively updates the values of responsibility and availability. Finally, AP selects the point with the highest sum of responsibility and availability as the exemplar for each point . Thus, AP clusters and into the same cluster and assigns a label based on the sequence index number of in the dataset. As a result, AP can identify exemplars and assign class labels to each point based on its exemplar’s sequence index number.
Here is a detailed description of the affinity propagation algorithm:
- 1.
Initialization: The algorithm begins by initializing two matrices: the “similarity matrix” and the “responsibility matrix”. The similarity matrix contains the pairwise similarity scores between all data points. The responsibility matrix is a matrix of the same size as the similarity matrix and is initialized to 0.
- 2.
Calculate responsibility matrix: In this step, the algorithm iteratively updates the responsibility matrix. The responsibility matrix represents how well-suited each data point is to be an “exemplar” or representative of a cluster. The update rule is as follows:
where
is the responsibility of point i to exemplar
k,
is the similarity score between point
i and exemplar
k, and
is the “availability” of point
i to exemplar
k, which is updated in step 3.
- 3.
Calculate availability matrix: In this step, the algorithm updates the availability matrix. The availability matrix represents how much “support” a data point receives from other data points for being an exemplar. The update rule is as follows:
where
is the availability of point
i to exemplar
k,
is the self-responsibility of exemplar
k, and the summation term represents the total responsibility that other points have assigned to exemplar
k.
- 4.
Calculate cluster exemplars: The algorithm uses the responsibility and availability matrices to calculate which data points are the best exemplars for each cluster. The exemplars are chosen as the data points with the highest sum of their responsibility and availability scores:
- 5.
Assign data points to clusters: Finally, the Algorithm 2 assigns each data point to its nearest exemplar, which forms the final clusters.
Affinity propagation algorithm has advantages: Automatic cluster number determination and handling of non-spherical clusters. However, it can be computationally expensive and sensitive to the initial exemplar choice.
Algorithm 2: Affinity propagation algorithm. |
|
3.3. Mean Shift Algorithm
The authors of [
43,
44] explored kernel-based clustering for a dataset
X (where
) in an s-dimensional Euclidean space
. This approach transforms the data space into a high-dimensional feature space
F using a kernel function to represent inner products. Another kernel-based clustering method in the data space is the kernel density estimation, which estimates the density over
X and identifies modes that correspond to the densest regions [
45]. These modes can serve as estimates of cluster centers. To identify the modes in kernel density estimation, the mean shift, a basic gradient approach, is used. In the following section, we will delve into the mean shift procedure.
3.3.1. Mean Shift Procedures
Mean shift techniques are utilized to identify the modes of the kernel density estimation. The kernel, denoted as
, is defined by
. The estimation of the kernel density is derived using the following equation [
46]:
In this equation,
is a weight function. Fukunaga and Hostetler [
47] first introduced the statistical properties of the gradient of the density estimation using a uniform weight, which includes asymptotic unbiasedness, consistency, and uniform consistency.
Assume the existence of a kernel
, such that
, where
c is a constant. If kernel
H is considered a shadow of kernel
K, as defined in Ref. [
48], then the following equation holds:
The generalized mean shift is expressed by the formula
, which measures the estimated density gradient. This equation was first introduced in Reference [
49] under the assumption of uniform weights. When the gradient estimator
is zero, the mode estimation can be determined by:
Here,
K is the kernel, and
H is its shadow. Equation (
6) is known as the sample weighted mean using kernel
K″. The mean shift can be implemented in three ways. The first method involves assigning initial values to each data point and updating each data point’s
with
, called blurring mean shift, where each data point and the density estimate
are modified with each iteration. The second method, called non-blurring mean shift, updates only the data point
x with
, while keeping most data points and the density estimate unchanged. The third method, called general mean shift, involves selecting
c starting values, which can be more or less than
n, and updating the data point
x with
. Cheng [
48] discussed the convergence properties of blurring mean shift using Equation (
6), while Comaniciu and Meer [
49] presented various mean shift properties for discrete data and their relationship with the Nadaraya–Watson estimator through kernel regression and the robust
M-estimator.
If a kernel K has a shadow H, the mean shift method can identify the modes of a known density estimate , allowing for direct identification of the estimated density shape modes. In cases where the shadow for the kernel K is unknown or does not exist, the mean shift method can still be useful for estimating alternative modes, such as cluster centers, for a dataset with an unknown density function.
3.3.2. Some Special Kernels and Their Shadows
In this section, the main focus will be on investigating special kernels that possess shadows. More specifically, we delve into Gaussian kernels
, which are the most frequently utilized kernels that possess their own shadows [
46].
with their shadows
defined as
The process of mean shift, where
x is reassigned as
, is utilized to identify the modes of the density estimate
. Cheng conducted a study on the behavior of mean shift in cluster analysis by employing a Gaussian kernel. The maximum entropy clustering algorithm is a specialized weight function-based Gaussian kernel mean shift. Chen and Zhang employed a Gaussian kernel-induced distance measure to perform robust image segmentation using spatially constrained fuzzy c-means (FCM). Yang and Wu utilized
as an objective function for total similarity and derived a similarity-based clustering method (SCM) that could autonomously organize the number and size of clusters based on the data structure [
46].
The Cauchy kernels, which are obtained from the Cauchy probability density function
, where
, are noteworthy kernels. Their shadows are defined in the following manner [
46]:
The process of mean shift using is used to locate the modes of the density estimate . The use of Cauchy kernels is not as prevalent as other types of kernels. To address the limitations of FCM in noisy environments, Krishnapuram and Keller proposed the possibilistic c-means (PCM) clustering algorithm by relaxing the constraint of the fuzzy c-partition’s summation to 1. The possibilistic membership functions were subsequently utilized as Cauchy kernels. This is the only instance of Cauchy kernels being utilized in clustering that we are aware of.
The flat kernel is the most basic kernel and is defined as follows [
46]:
with the Epanechnikov kernel
, as its shadows
In addition, the Epanechnikov kernel’s shadow is the biweight kernel
.
To provide a more general explanation, we generalize the Epanechnikov kernel
into the form of generalized Epanechnikov kernels
with the parameter
p defined as follows [
46]:
As a result, the generalized Epanechnikov kernels
have corresponding shadows
, defined as [
46]:
When , functions , , and become , , and , respectively. To identify the modes of the estimated density function , we employ the mean shift process using x replaced by . In total, we have three categories of kernels and their corresponding counterparts, namely Gaussian kernels and their shadows , Cauchy kernels and their shadows , and generalized Epanechnikov kernels and their shadows . The corresponding density estimates can be found using the mean shift process with any of these three kernel categories. The performance of the mean shift procedure for kernel density estimation is significantly affected by the normalization and stabilization parameters, denoted by and p, respectively. We will explore this topic in the next section.
3.4. K-Means
K-means clustering is a popular unsupervised machine learning algorithm used for clustering data points into groups based on their similarity [
50,
51]. The algorithm aims to minimize the sum of squared distances between the data points and their assigned cluster centers. Here are the main steps of the algorithm:
- 1.
Choose the number of clusters K that you want to identify and randomly initialize K cluster centers.
- 2.
Assign each data point to its nearest cluster center. This can be done using Euclidean distance or other distance measures.
- 3.
Calculate the mean of the data points in each cluster to obtain the new cluster centers.
- 4.
Repeat steps 2 and 3 until the cluster centers converge, i.e., when the assignments of the data points to clusters no longer change or change minimally.
Here are the equations for the steps:
- Step 2:
Assign each data point
to its nearest cluster center
:
where
denotes the Euclidean distance.
- Step 3:
Calculate the mean of the data points in each cluster to obtain the new cluster centers:
where
is the set of data points assigned to cluster
j.
- Step 4:
Repeat steps 2 and 3 until convergence.
The algorithm may converge to a local optimum rather than a global one, so it is often run multiple times with different random initializations to find the best result.
K-means clustering can be extended to handle different data types and distances, as well as more complex clustering problems, such as non-convex clusters or variable cluster sizes. However, the algorithm is sensitive to the choice of K and the initial cluster centers, and it may not work well with noisy or high-dimensional data.
4. Data Point Positioning Analysis Algorithm
The DPPA algorithm is required because clustering is an essential data analysis technique that is used in various fields, such as data mining, machine learning, and pattern recognition. Clustering helps to identify patterns, group similar data points, and extract useful information from large datasets. However, traditional clustering algorithms, like k-means, DBSCAN, Mean Shift, and affinity propagation, have some limitations that DPPA aims to address.
The k-means algorithm requires the number of clusters to be specified beforehand, which may not be known in advance. The DBSCAN algorithm assumes that the clusters have similar densities, which may not be valid in all cases. The Mean Shift algorithm is sensitive to the choice of the bandwidth parameter and may lead to overfitting or underfitting. The affinity propagation algorithm requires the selection of a damping factor, which can significantly affect the clustering results.
In contrast, the DPPA algorithm does not require any prior knowledge of the number of clusters or the density of the data. The algorithm works by calculating the distances between each pair of adjacent data points, and then using this information to identify neighborhoods of data points. For each data point, the algorithm counts the number of neighboring data points that fall within a specified distance range and builds a neighbor-link (
Table 1) that includes information about the 1-NN and Max-NN of each data point. The algorithm then sorts the data points based on the number of neighbors in ascending order, and clusters the data points, starting with the data point with the fewest neighbors and following its Max-NN, until no more points can be added to the cluster. The algorithm repeats this process for all remaining data points until all have been assigned to a cluster. The Algorithm 3 presented below outlines the complete set of steps involved in the DPPA clustering algorithm:
- 1.
- 2.
For determine
- 3.
Calculate the range of radius, denoted by , between the minimum and maximum values of a scalar value .
- 4.
For each point ,
- (a)
Compute , which is the count of neighboring data points within the specified radius range , for a given data point . The calculation involves summing up a series of ones or zeros, depending on whether the distance between and a particular neighboring data point is within or outside the specified range.
- (b)
Construct a table (
Table 1) that shows the nearest neighboring data points for each data point
, including the nearest neighbor (
) and maximum neighbor (
):
- 5.
Arrange the data points in the neighbor-link table in a manner such that they are sorted in ascending order based on the value of their corresponding .
- (a)
To form a cluster , first place the data point , and then add all the data points that are in the Max-NN linkage of .
- (b)
Add all the data points that are 1-NN (nearest neighbors) of to the cluster .
- (c)
If the next data point belongs to cluster , then assign to and set to , and repeat the process, starting from step a.
- (d)
Continue the process until there are no more data points left.
Algorithm 3: Data point positioning analysis algorithm. |
|
7. Conclusions
This study proposed a novel clustering algorithm (DPPA) to efficiently cluster the higher-dimensional dataset. The proposed algorithm does not depend on any predefined parameter value, meaning no need to analyze the values of the different parameters, like DBSCAN. For this purpose, this approach calculates 1-NN and Max-NN without requiring the initial manual parameter assignment by examining the positions of data points in the dataset. Therefore, it is no longer required to define the appropriate radius Eps and the minimum number of points MinPts, since the radius range, , is automatically computed by evaluating the position and distance of the data points. The proposed algorithm was validated using two publicly available higher-dimensional datasets. To compare the performance of the proposed model, this study also implemented four popular clustering algorithms. The experimental outcomes demonstrated that the proposed algorithm could identify the right number of clusters from the higher-dimensional dataset.
The following are the main benefits of our proposed method over previous non-parameter-free techniques:
This study proposed a novel approach for clustering data, called data point positioning analysis (DPPA), to enhance the efficiency of high-dimensional dataset clustering.
In this method, there is no need to pre-specify the number of clusters; whereas traditional clustering methods often require the number of clusters to be determined beforehand. This makes parameter-free methods more flexible and adaptable to different datasets.
This proposed parameter-free clustering algorithm is better able to handle noisy or outlier data points since it uses density-based clustering techniques that do not depend on distance measures alone.
The study compared the proposed method to four popular clustering algorithms and demonstrated that the proposed method achieves superior performance in identifying clusters.
In the future, we intend to compare the performance of the proposed method with various datasets. Future research directions for decision-dependent uncertainty in clustering algorithms in critical fields, such as electricity grids and petroleum offshore platforms, include exploring ensemble clustering techniques to improve the reliability of results, and incorporating prior knowledge or domain expertise to enhance the accuracy and effectiveness of the results. Developing methods for quantifying and evaluating uncertainty, as well as investigating the impacts of different evaluation metrics on decision-making processes, are also important areas of future work. Furthermore, applying clustering in other critical fields and evaluating its effectiveness would contribute to the advancement of clustering algorithms and improve decision-making processes in these fields.