2. Literature Survey
The focus of this study is to improve the robustness of the classical k-means algorithm in recognizing alternative clustering result shapes. To this end, this section summarizes some of the studies relevant to the work of this manuscript and divides these studies into three categories.
The first type is based on improving typical similarity distance determination methods, including Euclidean metrics, Manhattan distance, Minkowski distance, city block, cosine, correlation distance, and so on. Among them, Singh et al. [
21] compared the performance results of the k-means algorithm using Euclidean distance, Manhattan distance, and Minkowski distance in the classification process, and the conclusion showed that the distortion degree of the k-means algorithm using Minkowski distance in describing the similarity of the samples is the smallest, but the best classification results were obtained using Euclidean distance, and the k-means algorithm using Manhattan distance had the worst classification results. In Singh et al.’s study, they expressed the need for careful selection of similarity metrics. A. Chakraborty et al. [
22] showed the performance results of the k-means algorithm in using city block distance, cosine distance, and correlation distance in classifying the Iris dataset, and the k-means algorithm using city block distance and cosine distance achieved 98% accuracy. F. A. Sebayang et al. [
23] used Canberra distance, city block distance, and Euclidean distance in the process of using the k-means algorithm; the results show that the k-means algorithm using different distances on different datasets shows different degrees of superiority of results. This kind of research aims to improve the robustness of the k-means algorithm in different application scenarios of clustering shapes and achieve better real-world results, but ignores the impact of disadvantages (1)–(3) caused by the k-means algorithm, and at the same time for this kind of improvement the research applicant should also have a certain knowledge of the data distribution.
The second type is an improvement of the k-means algorithm based on weighted distances, typically weighted Euclidean distances. Among them, Tang et al. [
24] introduced the minimum and maximum principles based on weighted data to balance the effects of distance and density on clustering, as a way to automatically determine the clustering center and the number of centers. Wang et al. [
25] used dimensionally weighted Euclidean distance to measure the degree of similarity between samples when improving the k-means algorithm, and to ensure that the algorithm does not fall into the local optimum, they also proposed a new algorithm that jumps out of the local optimum based on the idea of evolutionary algorithms by using stochastic and evolutionary processes [
26]. In their discussion on the effect of initial distance centers and outliers on clustering results, Zhang et al. [
27] proposed an adaptive k-means variant based on weighted Gaussian distances, which gave better results in dealing with clustering results and the effect of outliers on clustering results. This kind of research revolves around solving the local optimum of the k-means algorithm and improving the robustness of the clustering shape, which automatically discovers the initial clustering center through the process of determining the weights, and then improves the accuracy of clustering by the weighted distances, which achieves a certain degree of practical application, but it ignores the effects of outliers and noisy data on the clustering results, and it fails to determine the number of clusters in advance very well.
The third type is the study of non-conventional sample similarity, dissimilarity measures, or variations of the k-means algorithm in combination with other methods. Among them, Zhu et al. [
17] proposed to improve the k-means algorithm by using evidence distance; they first replaced the original value of sample data with the probability assignment method, then used evidence distance for clustering, and compared the experimental results with the classical k-means algorithm, aggregated distance parameter k-means algorithm, and Gaussian mixture model, and the clustering experimental results are better and the algorithm convergence is also good. Chen and Yang [
28] proposed a diffusion k-means algorithm based on diffusion distance to maximize the proximity of sample data from the same cluster class, which has better advantages in dealing with nonlinear datasets and mixed-dimensional data with non-Euclidean geometrical features. Dinh et al. [
29] proposed K-CMM fusion interpolation to the clustering process based on the mean and kernel approach, which uses the information-theoretic dissimilarity in measuring the distance metric and squared Euclidean distance. Research in this category has failed to address problems (1), (2), and (3) by redefining the similarity metric and using it in conjunction with the k-means algorithm for clustering, which greatly improves the robustness of the algorithm in detecting the shape of the clusters, but still fails to address disadvantages (1), (2), and (3).
Through recent studies [
17,
18], it has been found that the traditional k-means algorithm still suffers from four problems that have not been solved systematically, and existing studies have proposed solutions around some of the four problems. Therefore, our goal is to solve the four problems that exist in traditional k-means using different methods, and ultimately integrate them to form a systematic solution to help weaken the influence of outliers and missing values on the clustering results, avoid the choice of initial centers falling into the trap of local optimum, and enhance the robustness of the algorithm to identify the shape of the clustering results.
3. Pan-Factor Space Theory and Contour Similarity
Pan-factor space is a mathematical theory of concept discovery and representation based on theories, factors, and their operations. Therefore, we give the following definition:
Definition 1. The domain represented by is a non-empty countable set that is composed of all objects studied in a problem study, where
is the i-th study subject.
In the process of studying a problem utilizing mathematical tools, it is necessary to resort to some indicators or variables, which in the pan-factor space we call factors, a factor generally denoted as .
Definition 2. For
the existence of a special value
, satisfied
. The set
is called the phase space of the factor
.
The set consisting of all factors defined on is denoted as . The factor is a surjection from the domain to the phase space and there is a possibility that the feature is vacant ( is a missing value) during the application, and the missing value is special in the theory of pan-factor spaces.
For the factor as a special mapping, it exists as a pre-image , defined mathematically by the following:
Definition 3. The
is a mapping from the phase space to the power set of , satisfied:where
. The
is the quotient set. Then we called
as Recall. The purpose of Recall is to form a perception of class in the discourse. Obviously:
For domains with a finite number of samples, assuming
, then the quotient set is
is a partition of .
Based on the definitions of factor and Recall , there are two special factors in the convention:
1. zero-factor . The phase space of is , the quotient set of is .
2. full-factor . The phase space of is , the quotient set of is .
Thus, the following axiom can be obtained:
Axiom 1. (Discovering Axioms)
Further, based on the foregoing, we define the logical operations of the factors as follows:
Definition 4. There are two factors
and
, if
equation to
, iff
, denoted as
.
Definition 5. There are two factors
and
, for
. If
,
, s.t.
, then we called
less than
, denoted as
.
According to the Definitions 4 and 5, we can combine with denoted as .
Definition 6. For
,
. If
, satisfied: Then we called is the Analysis-Factor of and , denoted as .
Definition 7. For
,
. If
, satisfied: Then we called is the Composite-Factor of and , denoted as .
Theorem 1. The algebra system
is a lattice.
Proof. Firstly, we prove the with “” is a partially ordered set. According to the Definitions 4 and 5, obviously: (Reflexive). If , (Antisymmetric relation). If , (Transitivity). Then the with “” is partially ordered set.
Secondly, for
, prove the existence of
and
. According to the rules of calculation of the quotient set:
and the existence theorem for upper and lower definite bounds on quotient sets, and according to Definition 5, it is possible to know
is the supremum, that is
.
is the infimum, that is
.
So, the algebra system is a lattice. □
From Theorem 1 Analysis-Factor and Composite-Factor are the natural arithmetic. Further, we can obtain the first absorption law, the law of order, the idempotent law, the law of commutation, the associative law, and the second absorption law of factors and on and arithmetic hold. It can further be demonstrated that the algebra system is bounded lattice.
Definition 8. Let’s refer to the combination of the four elements
as a pan-factor space, where
is a set that consists of all factors defined on
. is the data space that be constructed by all factors, denoted as:
In practice, the number of factors and the number of objects of study in the domain are limited. Denote the set of finite factors taken out of as . Denote as the set consisting of all factors obtained from and the factors contained in on operations and . We can prove that is a sublattice of .
Based on the foregoing, we give the following definition:
Definition 9. is called the finite factor scalar frame or lattice coordinate frame. The corresponding Cartesian product
is the lattice coordinate system, where
is the normalized phase space of
by
.
According to Definition 9, if , then the normalized phase space is .
The difference between the lattice coordinate system and the affine coordinate system is that the affine coordinate system cannot uniformly handle data of different metric scales.
Definition 10. For
, theare called the lattice coordinates of the object (abbr. L.c). The L.c has the following properties:
- (1)
The coordinates of the object on the factor are natural numbers and when represent no actual statistical values, i.e., missing values.
- (2)
For , the magnitudes of and represent only the order of precedence and have no quantitative meaning. In general, is meaningless, but represents the potential difference between two objects and on factor .
- (3)
For , and are not directly comparable.
Therefore, before similarity analyses are performed, the data corresponding to the original samples need to be converted to the corresponding "lattice point" data in
L.c. This step of data preprocessing is known as the lattice transformation. The detailed conversion process is described in detail in
Section 4.1.
Going from multidimensional data to low dimensional data is a common means of data analysis, but going from low to high dimensions represents the ability to obtain more information, with commonly used means such as kernel transformations. Such a means is lacking for the process of similarity metrics.
In the process of describing the similarity of samples, there are no more than four perspectives: first, describing the degree of mutual influence between samples, such as the inner product. The second is to compare the sizes, shapes, volumes, or areas of the samples and then determine the degree of similarity between the two samples, e.g., the 2-paradigm number. The third is to describe the similarity between samples from the distance between them within a coordinate system, such as the Minkowski distance. Fourthly, the similarity of the samples is described from the perspective of the sameness of the objects with which the samples interact, e.g., correlation coefficient. In this paper, according to the idea and principle of simple shape in topology, under the framework of lattice coordinates, we construct the factor contour indexes describing the basic shape of the sample data, and on this basis, we construct the variation of the degree of similarity between the factor contours of the samples and their algorithms.
Definition 11. For
, define:is the factor contour transformation of data
(abbr.
), where
is the factor contour that represents the shape of data
in L.c.
is the contour operator matrix with the following matrix form: Obviously, for
, the factor contour transformation
is ascending dimensions transform of data. The set
be called Factor Contour Space.
Definition 12. Let
, ifthen
is the zoom of
, the vector
is the zoom-operator. Definition 13. Let
, define:
If , then .
If , then .
If , then .
If , then .
If not, then .
The ordinal structure of the factor contour space will not be discussed in depth.
Definition 14. Let
, for the functional
If the satisfies the following conditions:
- (1)
Boundedness ;
- (2)
Regularity ;
- (3)
Symmetry ;
- (4)
Rank Preservation if , then .
Then is said to be a similarity measure on . is the contour similarity between and , denoted .
In Definition 14, the closer the value of
is to 1, the more similar the two factor contours
and
. We can construct a variety of profile similarities that conform to Definition 12 and Definition 14. In
Section 4.2, we give an algorithm for the computation of factor contour similarity.
5. Experimental Validation of the Csk-means Algorithm
To evaluate the performance of the proposed CSk-means algorithm, this study downloaded six datasets used as testing data from the UCI Machine Learning Repository (
https://archive.ics.uci.edu/, accessed on 21 April 2024). The experimental environment is a 12th Gen Intel(R) Core(TM) i5-12500 3.00 GHz processor, NVIDIA GeForce RTX 3080 graphics card, 16.0 GB of running memory, Windows 11 operating system, and programming with MATLAB 2023b-64bits. Meanwhile, the whole experiment is conducted based on the MATLAB 2023b platform. The comparison algorithms are the traditional k-means algorithm, hierarchical clustering model, spectral clustering, and k-means++. The code for the Csk-means algorithm that needs to be validated is custom code.
Table 1 describes the basic statistical characterization information of the above six datasets. In this paper, these datasets are used for validating the performance of the proposed algorithm, and in this process, there will be no division between the training set and the testing set. The main reasons for the selection of the reported datasets were: (1) the ability to characterize the validity and reliability of the algorithm using validated metrics; and (2) the need for the improved clustering algorithm to focus on real clustering scenarios, taking into account both varying degrees of complexity as well as varying degrees of distribution, which is reflected in the categorical and sample sizes of the dataset as well as in the characteristics of the data distribution.
The metrics for evaluating the performance of the model were chosen as Precision, Recall, Accuracy, Running Time, Number of Iterations, and Normalized Mutual Information (NMI).
Table 2 demonstrates the iteration time, number of iterations, and the degree of accuracy of the clustering results for the clustering process of the Csk-means algorithm and the classical k-means algorithm on the six datasets.
Table 2 demonstrates the iteration time, number of iterations, and the degree of accuracy of the clustering results for the clustering process of the Csk-means algorithm and the classical k-means algorithm on the five datasets. The results show that the Csk-means algorithm takes more time than the k-means algorithm for the clustering process since the gridded transformation of the Csk-means algorithm takes more time. However, the number of iterations and the clustering results are better than the classical k-means algorithm.
Table 3 demonstrates the F-measure, Precision, and Recall metric values for the clustering results of the CSk-means algorithm and classical k-means algorithm on the six datasets. It can be found that the CSk-means algorithm is more stable than the classical k-means algorithm. The overall performance of the clustering of the CSk-means algorithm is better than that of the classical k-means algorithm. It is noteworthy that the datasets Yeast and Abalone are less accurate, and we analyze in depth the reasons for this phenomenon: the two datasets contain a large number of classes and the degree of differentiation between samples in the different classes is not clear enough, and the data distribution morphology shows a strong consistency. Methods used in this process include, but are not limited to, principal component analysis, data distribution, and other clustering models.
Figure 3 and
Table 4 show the accuracy, F-measure, Precision, Recall, and NMI values of the clustering results of the Csk-means algorithm, hierarchical clustering, and spectral clustering on the six datasets. In comparison, the Csk-means algorithm shows better stability and more accurate clustering results.
6. Discussion
To systematically solve the four problems involved in the traditional k-means algorithm proposed in the recent research literature, we propose the Csk-means algorithm. Experimental results demonstrate the effectiveness and potential of the model. Based on the process and related techniques involved in the proposed Csk-means algorithm, in this section, we discuss the corresponding key findings, significance, and limitations of each of the techniques involved in the Csk-means algorithm in the order of the process.
To eliminate the negative effects of outliers and missing values on the clustering results, we redefine the preprocessing of the data, involving the method of lattice transformation. The lattice transform essentially constructs a new system of lattice coordinates, transforming the original continuous data to discrete lattice coordinates and transforming the missing values to the origin of the lattice coordinates. The lattice transform can be interpreted as an integer or discrete transform. The lattice transform may provide a new strategy and method for discretization of data, handling of missing values, and outliers. The advantage of the lattice transform is that it can reduce the adverse effects of outliers and missing values on the model results to a certain extent, which can be supported by some of the experimental results of the study. However, its non-ignorable limitation is that the lattice transform is a homomorphic migration transform, and the adverse effect of outliers still exists to a certain extent. In addition, the degree of discretization of the lattice transform is insufficient when the data is discretized, which may lead to poor generalization of the mined knowledge if the lattice transform is applied in the data preprocessing of rule learning.
According to the idea of hierarchical clustering, in the process of predetermining the number of clusters, the strategy we adopted is to use the contour similarity to build a vector of contour similarity between samples and use Fisher optimal segmentation to realize the predetermination of the number of clusters. From the results, the number of clusters predetermined by this strategy is consistent with the actual number. This strategy also provides a new idea and method for predetermining the number of classes in clustering applications. However, its limitation is obvious: the predetermination of the number of classes in a large sample dataset will take more time, which is not conducive to the layout of the online algorithm.
For the selection of initial clustering centers, we adopt the method of isometric interpolation, which is a new method for determining the initial clustering centers. Isometric interpolation implies the idea of dissimilarity between classes, which is essentially a balanced iterative search. Isometric interpolation can prevent the clustering model from falling into the trap of local optimality and also provides a new clustering strategy, which will be the focus of our next work. However, its limitation is obvious: in the process of determining the initial clustering centers, the selected initial clustering centers need to be evaluated for their balance with the centers of other classes. The adoption of the balancing strategy will deeply affect the selection of clustering centers, which in turn leads to fluctuations in the clustering results. Although the clustering centers can be determined by a trial-and-error strategy, the time cost will increase.
We improved the traditional k-means algorithm by using contour similarity instead of Euclidean distance to enhance the ability of the k-means algorithm to recognize different clustered shapes. From the experimental results, such an improvement is effective. Contour similarity is a new measure of similarity between samples, which is essentially an upscaling transformation. Compared with other similarity measures, contour similarity does not result in information loss due to data dimensionality reduction, which makes the description of similarity inaccurate. We believe that contour similarity enriches the methodological theory of similarity measures. However, its limitations are: when using contour similarity, one should have at least three samples; at the same time, its calculation process is relatively complicated, which will lead to an increase in time cost, which can be known from the results of the designed experimental comparison.
Although the improved Csk-means algorithm can lead to a certain degree of improvement in the accuracy of the clustering results, we also need to see its limitations and implications, especially the increase in the time cost of the Csk-means algorithm during the clustering process. The change in the data lattice, the selection of the initial clustering centers, and the computation of the similarity between the samples involved in the Csk-means algorithm increase the time cost. If the calculation of contour similarity to inter-sample similarity is further optimized, and the effect of outliers on the clustering results can be ignored, then the lattice transformations of data can be eliminated from the Csk-means algorithm, thus the Csk-means algorithm reduces the time cost.
In addition, the choice of comparison algorithms in our designed comparison experiments contains the traditional k-means algorithm, hierarchical clustering, and spectral clustering, which aims at verifying the feasibility of the proposed method. However, the comparison with other clustering models, such as SOM, was neglected. Therefore, in future studies, we will include these models to provide a more comprehensive analysis of the practical applications of clustering models.