Next Article in Journal
Colour Choice as a Strategic Instrument in Neuromarketing
Previous Article in Journal
Complex Network Model of Global Financial Time Series Based on Different Distance Functions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved K-Means Algorithm Based on Contour Similarity

1
Key Laboratory of Industrial Automation and Machine Vision of Qiannan, School of Mathematics and Statistics, Qiannan Normal University for Nationalities, Duyun 558000, China
2
College of Science, Liaoning Technical University, Fuxin 123000, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(14), 2211; https://doi.org/10.3390/math12142211
Submission received: 10 June 2024 / Revised: 30 June 2024 / Accepted: 6 July 2024 / Published: 15 July 2024
(This article belongs to the Special Issue Optimization Algorithms in Data Science: Methods and Theory)

Abstract

:
The traditional k-means algorithm is widely used in large-scale data clustering because of its easy implementation and efficient process, but it also suffers from the disadvantages of local optimality and poor robustness. In this study, a Csk-means algorithm based on contour similarity is proposed to overcome the drawbacks of the traditional k-means algorithm. For the traditional k-means algorithm, which results in local optimality due to the influence of outliers or noisy data and random selection of the initial clustering centers, the Csk-means algorithm overcomes both drawbacks by combining data lattice transformation and dissimilar interpolation. In particular, the Csk-means algorithm employs Fisher optimal partitioning of the similarity vectors between samples for the process of determining the number of clusters. To improve the robustness of the k-means algorithm to the shape of the clusters, the Csk-means algorithm utilizes contour similarity to compute the similarity between samples during the clustering process. Experimental results show that the Csk-means algorithm provides better clustering results than the traditional k-means algorithm and other comparative algorithms.

1. Introduction

Different researchers proposed the k-means algorithm in the 1950s and 1960s [1]. Those researchers include Lloyd [2], MacQueen [3], Jancey [4], and Steinhaus [5]. Due to the simplicity of implementation and low computational complexity of the k-means algorithm, it has become one of the top ten most popular algorithms in problems requiring clustering and data mining [6,7]. The application areas are data mining [8], cluster analysis [9], data clustering [10], pattern recognition [11], financial risk control [12], data science [13], intelligent marketing [14], and data compression [15,16].
From the recent works [17,18], the following problems with the traditional k-means algorithm remain a hot research topic:
(1)
Clustering results are susceptible to noisy data and outliers;
(2)
Random selection of initial centers tends to make clustering results locally optimal;
(3)
Realistic clustering application scenarios where the number of clusters is predetermined in advance are difficult;
(4)
Furthermore, the reliance on the Euclidean distance in the Classic K-means algorithm limits its capacity to discern non-linear cluster shapes or clusters with irregular formations.
In this case, the measure of similarity between samples is the core of the k-means clustering algorithm [19,20]. Therefore, this research attempts to improve the traditional k-means algorithm through a series of data processing methods and innovative similarity metrics to solve the above problems.
The contributions of this research work are as follows: 1. We have solved the aforementioned problems (1), (2), and (3) using data lattice transformation, isometric interpolation, and Fisher optimal segmentation, respectively; 2. The traditional k-means algorithm has been improved to solve the aforementioned problem (4) using contour similarity; and 3. Numerical experiments on clustering have been designed to verify the feasibility of the work performed. The work performed in this research can provide several valuable and implementable methods in handling data outliers and missing values, predetermining of the number of clusters, selecting initial centers in the clustering process, and measuring similarity between samples.
The subsequent sections of this paper are structured as follows: Section 2 categorically discusses the research on improving the k-means algorithm with different sample similarity metrics. Section 3 describes in detail the similarity basics used in this paper. Section 4 presents algorithmic improvement ideas and processes. Section 5 conducts numerical experiments to verify the performance of the proposed algorithm in this paper. Section 6 discusses the limitations and implications of the model proposed in this paper. Finally, Section 7 concludes the paper.

2. Literature Survey

The focus of this study is to improve the robustness of the classical k-means algorithm in recognizing alternative clustering result shapes. To this end, this section summarizes some of the studies relevant to the work of this manuscript and divides these studies into three categories.
The first type is based on improving typical similarity distance determination methods, including Euclidean metrics, Manhattan distance, Minkowski distance, city block, cosine, correlation distance, and so on. Among them, Singh et al. [21] compared the performance results of the k-means algorithm using Euclidean distance, Manhattan distance, and Minkowski distance in the classification process, and the conclusion showed that the distortion degree of the k-means algorithm using Minkowski distance in describing the similarity of the samples is the smallest, but the best classification results were obtained using Euclidean distance, and the k-means algorithm using Manhattan distance had the worst classification results. In Singh et al.’s study, they expressed the need for careful selection of similarity metrics. A. Chakraborty et al. [22] showed the performance results of the k-means algorithm in using city block distance, cosine distance, and correlation distance in classifying the Iris dataset, and the k-means algorithm using city block distance and cosine distance achieved 98% accuracy. F. A. Sebayang et al. [23] used Canberra distance, city block distance, and Euclidean distance in the process of using the k-means algorithm; the results show that the k-means algorithm using different distances on different datasets shows different degrees of superiority of results. This kind of research aims to improve the robustness of the k-means algorithm in different application scenarios of clustering shapes and achieve better real-world results, but ignores the impact of disadvantages (1)–(3) caused by the k-means algorithm, and at the same time for this kind of improvement the research applicant should also have a certain knowledge of the data distribution.
The second type is an improvement of the k-means algorithm based on weighted distances, typically weighted Euclidean distances. Among them, Tang et al. [24] introduced the minimum and maximum principles based on weighted data to balance the effects of distance and density on clustering, as a way to automatically determine the clustering center and the number of centers. Wang et al. [25] used dimensionally weighted Euclidean distance to measure the degree of similarity between samples when improving the k-means algorithm, and to ensure that the algorithm does not fall into the local optimum, they also proposed a new algorithm that jumps out of the local optimum based on the idea of evolutionary algorithms by using stochastic and evolutionary processes [26]. In their discussion on the effect of initial distance centers and outliers on clustering results, Zhang et al. [27] proposed an adaptive k-means variant based on weighted Gaussian distances, which gave better results in dealing with clustering results and the effect of outliers on clustering results. This kind of research revolves around solving the local optimum of the k-means algorithm and improving the robustness of the clustering shape, which automatically discovers the initial clustering center through the process of determining the weights, and then improves the accuracy of clustering by the weighted distances, which achieves a certain degree of practical application, but it ignores the effects of outliers and noisy data on the clustering results, and it fails to determine the number of clusters in advance very well.
The third type is the study of non-conventional sample similarity, dissimilarity measures, or variations of the k-means algorithm in combination with other methods. Among them, Zhu et al. [17] proposed to improve the k-means algorithm by using evidence distance; they first replaced the original value of sample data with the probability assignment method, then used evidence distance for clustering, and compared the experimental results with the classical k-means algorithm, aggregated distance parameter k-means algorithm, and Gaussian mixture model, and the clustering experimental results are better and the algorithm convergence is also good. Chen and Yang [28] proposed a diffusion k-means algorithm based on diffusion distance to maximize the proximity of sample data from the same cluster class, which has better advantages in dealing with nonlinear datasets and mixed-dimensional data with non-Euclidean geometrical features. Dinh et al. [29] proposed K-CMM fusion interpolation to the clustering process based on the mean and kernel approach, which uses the information-theoretic dissimilarity in measuring the distance metric and squared Euclidean distance. Research in this category has failed to address problems (1), (2), and (3) by redefining the similarity metric and using it in conjunction with the k-means algorithm for clustering, which greatly improves the robustness of the algorithm in detecting the shape of the clusters, but still fails to address disadvantages (1), (2), and (3).
Through recent studies [17,18], it has been found that the traditional k-means algorithm still suffers from four problems that have not been solved systematically, and existing studies have proposed solutions around some of the four problems. Therefore, our goal is to solve the four problems that exist in traditional k-means using different methods, and ultimately integrate them to form a systematic solution to help weaken the influence of outliers and missing values on the clustering results, avoid the choice of initial centers falling into the trap of local optimum, and enhance the robustness of the algorithm to identify the shape of the clustering results.

3. Pan-Factor Space Theory and Contour Similarity

Pan-factor space is a mathematical theory of concept discovery and representation based on theories, factors, and their operations. Therefore, we give the following definition:
Definition 1. 
The domain represented by  U = u i i = 1 n  is a non-empty countable set that is composed of all objects studied in a problem study, where  u i  is the i-th study subject.
In the process of studying a problem utilizing mathematical tools, it is necessary to resort to some indicators or variables, which in the pan-factor space we call factors, a factor generally denoted as f .
Definition 2. 
For  u i U , i = 1,2 , , n ,  the existence of a special value  d i , satisfied  d i = f u i . The set  I f = d i   |   u i U , d i = f u i  is called the phase space of the factor  f .
The set consisting of all factors defined on U is denoted as F . The factor f is a surjection from the domain U to the phase space I f and there is a possibility that the feature d i is vacant ( d i is a missing value) during the application, and the missing value is special in the theory of pan-factor spaces.
For the factor f as a special mapping, it exists as a pre-image f , defined mathematically by the following:
Definition 3. 
The  f  is a mapping from the phase space  I f  to the power set of  U , satisfied:
d I f , f d = d f U , d f U / f 2 U
where  d f = u i |   f u i = d I f   , u i U U , i = 1,2 , , n . The  U / f   is the quotient set. Then we called  f   as Recall.
The purpose of Recall is to form a perception of class in the discourse. Obviously:
1 .   d I f , f f d = d . 2 . d f U , f f d f = d f . 3 . x , y I f , x y , f x f y .
For domains with a finite number of samples, assuming I f = d 1 , d 2 , , d s , then the quotient set is
U / f = d 1 f , d 2 f , , d s f
U / f is a partition of U .
Based on the definitions of factor f and Recall f , there are two special factors in the convention:
1. zero-factor  o . The phase space of o is I o = N o N , the quotient set of o is U / o = U .
2. full-factor  e . The phase space of e is I e = U , the quotient set of e is U / e = u u U .
Thus, the following axiom can be obtained:
Axiom 1. 
(Discovering Axioms)  o N o N = U , e x =
Further, based on the foregoing, we define the logical operations of the factors as follows:
Definition 4. 
There are two factors  f  and  g , if  f  equation to    g , iff  I f = I g   a n d   u U , f f u = g g u , denoted as  f = g .
Definition 5. 
There are two factors  f  and  g , for  u U , f u = x I f , g u = y I g . If  x I f ,  y I g , s.t.  f x g y , then we called  g  less than  f , denoted as  g < f .
According to the Definitions 4 and 5, we can combine f = g with g < f denoted as g f .
Definition 6. 
For  u U , f u = x I f , g u = y I g ,  h u = x , y I f × I g . If  x , y I f × I g , satisfied:
h x , y = f x g y
Then we called  h  is the Analysis-Factor of  f  and  g , denoted as  h = f g .
Definition 7. 
For  u U , f u = x I f , g u = y I g ,  l u = x , y I f × I g . If  x , y I f × I g , satisfied:
l x , y = f x g y
Then we called  t h e   l    is the Composite-Factor of  f  and  g , denoted as  l = f g .
Theorem 1. 
The algebra system  F ,  is a lattice.
Proof. 
Firstly, we prove the  F  with “ ” is a partially ordered set. According to the Definitions 4 and 5, obviously: f f  (Reflexive). If f g g f f = g (Antisymmetric relation). If f g g h f h (Transitivity). Then the F with “ ” is partially ordered set.
Secondly, for f , g F , prove the existence of l . u . b . f , g and g . u . b . f , g . According to the rules of calculation of the quotient set:
U / f g = U / f U / g , U / f g = U / f + U / g
and the existence theorem for upper and lower definite bounds on quotient sets, and according to Definition 5, it is possible to know f g is the supremum, that is l . u . b . f , g = f g . f g is the infimum, that is g . u . b . f , g = f g .
So, the algebra system F , is a lattice. □
From Theorem 1 Analysis-Factor and Composite-Factor  are the natural arithmetic. Further, we can obtain the first absorption law, the law of order, the idempotent law, the law of commutation, the associative law, and the second absorption law of factors f and g on and arithmetic hold. It can further be demonstrated that the algebra system F , is bounded lattice.
Definition 8. 
Let’s refer to the combination of the four elements  U , F , D , 2 U  as a pan-factor space, where  F  is a set that consists of all factors defined on  U D  is the data space that be constructed by all factors, denoted as:
D = f i F I f i
In practice, the number of factors and the number of objects of study in the domain are limited. Denote the set of finite factors taken out of F as F n = f i | f i F i = 1 n . Denote F n as the set consisting of all factors obtained from F n and the factors contained in F n on operations and . We can prove that F n ; , is a sublattice of F ; , .
Based on the foregoing, we give the following definition:
Definition 9. 
o ; F n  is called the finite factor scalar frame or lattice coordinate frame. The corresponding Cartesian product  D = I 1 × I 2 × × I n  is the lattice coordinate system, where  I j , j = 1,2 , , n ,   is the normalized phase space of  I f i  by  N o N .
According to Definition 9, if I f i = x 1 , x 2 , , x n j , then the normalized phase space I j * is I j * = 0,1 , 2 , , n j .
The difference between the lattice coordinate system and the affine coordinate system is that the affine coordinate system cannot uniformly handle data of different metric scales.
Definition 10. 
For  f 1 , f 2 , , f n F n , u U , the
f 1 , f 2 , , f n u = x 1 , x 2 , , x n D *
are called the lattice coordinates of the object  u   (abbr. L.c).
The L.c has the following properties:
(1)
The coordinates f j u i = x i j of the object u i U on the factor f j u i are natural numbers and when x i j = 0 represent no actual statistical values, i.e., missing values.
(2)
For f k u i = x i k , f k u j = x j k I k , the magnitudes of x i k and x i k represent only the order of precedence and have no quantitative meaning. In general, x i k + x i k is meaningless, but x i k x i k represents the potential difference between two objects u i and u j on factor f k .
(3)
For f j u i = x i j I j * , f k u i = x i k I k * , x i j and x i k are not directly comparable.
Therefore, before similarity analyses are performed, the data corresponding to the original samples need to be converted to the corresponding "lattice point" data in L.c. This step of data preprocessing is known as the lattice transformation. The detailed conversion process is described in detail in Section 4.1.
Going from multidimensional data to low dimensional data is a common means of data analysis, but going from low to high dimensions represents the ability to obtain more information, with commonly used means such as kernel transformations. Such a means is lacking for the process of similarity metrics.
In the process of describing the similarity of samples, there are no more than four perspectives: first, describing the degree of mutual influence between samples, such as the inner product. The second is to compare the sizes, shapes, volumes, or areas of the samples and then determine the degree of similarity between the two samples, e.g., the 2-paradigm number. The third is to describe the similarity between samples from the distance between them within a coordinate system, such as the Minkowski distance. Fourthly, the similarity of the samples is described from the perspective of the sameness of the objects with which the samples interact, e.g., correlation coefficient. In this paper, according to the idea and principle of simple shape in topology, under the framework of lattice coordinates, we construct the factor contour indexes describing the basic shape of the sample data, and on this basis, we construct the variation of the degree of similarity between the factor contours of the samples and their algorithms.
Definition 11. 
For  x 1 × n = x 1 , x 2 , , x n D , define:
q 1 × n = x 1 × n P n × n
is the factor contour transformation of data  x 1 × n   (abbr.  q x ), where  q 1 × n   is the factor contour that represents the shape of data  x 1 × n   in L.c.  P n × n   is the contour operator matrix with the following matrix form:
P n × n = 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 1
Obviously, for  x 1 × n = x 1 , x 2 , , x n D , the factor contour transformation  q x = x 2 x 1 , x 3 x 2 , , x n x 1 , x 1 x n   is ascending dimensions transform of data. The set  L = q x | q x = x 1 × n P n × n , x 1 × n D   be called Factor Contour Space.
Definition 12. 
Let  q x , q y L , λ = λ 1 , λ 2 , , λ n R + n , if
q y = λ q x = λ 1 x 1 λ 2 x 2 , λ 1 x 2 λ 2 x 3 , , λ 1 x n λ 2 x 1
then  q y   is the zoom of  q x , the vector  λ   is the zoom-operator.
Definition 13. 
Let  q y = λ q x , λ = λ 1 , λ 2 , , λ n R + n , define:
  • If  λ k = 1 , k = 1,2 , , n , then  q y = q x .
  • If  λ k = m > 0 , k = 1,2 , , n , then  q y q x .
  • If  λ k > 1 , k = 1,2 , , n , then  q y q x .
  • If  λ k < 1 , k = 1,2 , , n , then  q y q x .
  • If not, then  q y q x .
The ordinal structure of the factor contour space will not be discussed in depth.
Definition 14. 
Let  x , y , z D ; q x , q y , q z L , for the functional
ρ : L × L 0,1
If the ρ satisfies the following conditions:
(1)
Boundedness 0 ρ q x , q y 1 ;
(2)
Regularity ρ q x , q x = 1 ;
(3)
Symmetry ρ q x , q y = ρ q y , q x ;
(4)
Rank Preservation if q y q x q z , then ρ q y , q x > ρ q y , q z .
Then ρ is said to be a similarity measure on L . ρ q x , q y is the contour similarity between q x and q y , denoted ρ x y .
In Definition 14, the closer the value of ρ x y is to 1, the more similar the two factor contours q x and q y . We can construct a variety of profile similarities that conform to Definition 12 and Definition 14. In Section 4.2, we give an algorithm for the computation of factor contour similarity.

4. Algorithmic Ideas and Improvements

The design thought process of the improved Csk-means algorithm is shown below (Figure 1).
In particular, the basic data format obtained herein is illustrated by the following images (Figure 2).

4.1. The Lattice Transformation Process

The data lattice transformation is a data preprocessing process on the original data matrix X = x i j m × n , aiming at establishing a unified reference point for each factor, transforming the missing values of the data into the origin of the lattice coordinate system, and integrating the original data. That is, there exists a one-to-one mapping:
I f j N j = 0,1 , 2 , , j = 1,2 , , n ,
In this paper, the steps to perform the data lattice transformation are as follows:
(1)
For the original data matrix X = x i j m × n , extract the number m of the matrix’s rows and the number n of the matrix’s columns.
(2)
Find the reference point vector
I N F = i n f I f 1 , i n f I f 2 , , i n f I f n
where i n f I j , j = 1,2 , , n represents the Infimum of the phase space I f j of the factor f j . If the factor f j is a continuous variable and i n f I j is unknown, let
i n f I j = min i x i j ε j , ε j 0,1 , i = 1,2 , , m . j = 1,2 , , n .
where ε j is the slack variable.
(3)
For x i j I f j , let
y i j = 0 , i f   x i j = N o N x i j i n f I j × 10 k j + 1 , i f N o N , i = 1,2 , , m . j = 1,2 , . n .
where N o N represents that x i j is the missing value, k j is the scale-accurate parameters (it can be viewed as the number of decimal places in the decimal portion of the data).
After the previous three steps, the original data matrix X = x i j m × n is lattice transformed into a matrix Y = y i j m × n .

4.2. The Similarity Calculation Process of Contour Similarity

For the data matrix Y = y i j m × n that has been lattice transformed, the procedure for calculating the degree of similarity between samples based on contour similarity is as follows:
(1)
Calculate the scale vector
S = s u p M 1 , s u p M 2 , , s u p M n
where s u p M i = max j y i j + δ j , i = 1,2 , , n .   j = 1,2 , , m . If the factor f j is a continuous variable, then δ j > 0 , otherwise δ j = 0 .
(2)
Generate the contour data matrix
Z m × n = Y R E
where
R n × n = 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 n × n
E is a unit matrix with dimension n × n .
(3)
Measure the potential of each sample
W = D Y d i a g ( S ) 1
where D Y represents taking the absolute value for each element of the matrix D Y ,
D m 1 m 2 × m = - 1 1 0 0 0 0 - 1 0 1 0 0 0       - 1 0 0 0 0 1 0 - 1 1 0 0 0 0 - 1 0 0 0 0     0 - 1 0 0 0 1         0 0   0 - 1 1 0   0 0   0 - 1 0 1   0 0   0   0 - 1 1 m 1   r o w s m 2   r o w s 2   r o w s 1   r o w
d i a g ( S ) represents the expansion of the scale vector S into a diagonal matrix.
(4)
Calculate the pose metric for each sample.
H = D Z d i a g S R + E 1
(5)
Calculate the factor contour distance matrix for the sample.
P = s u m W H T / n
where W H represents the Hadamard product of the matrix W and H , s u m W H T stands for summing the columns of the matrix W H T .
(6)
Calculate the contour similarity matrix between individual samples.
M = s u m 1 W 1 H T / n
The concept of contour similarity was first proposed in [30] and a simple calculation case is given; the specific application is performed in [31]. The contour similarity calculation step proposed here is an improvement and enhancement, which is more convenient to calculate.
In particular, for the process of predetermining the number of clusters, which aims to predetermine the number of classifications, the following strategy is adopted to build a sample similarity neighbor vector: Find the two most similar samples a and b using contour similarity, record their contour similarity as the initial value of the similarity neighbor vector, use one of the sample points (assumed to be b) as the datum point, find the most similar sample point c among the remaining samples using contour similarity, and append the contour similarity of these two to the similarity neighbor vector. Update the reference point b and repeat the process until all samples are traversed. The number of clusters is found using Fisher optimal partitioning based on the sample similarity neighbor vector.

4.3. Design of k-Means Algorithm Based on Contour Similarity

The classical k-means algorithm uses Euclidean distance to determine the similarity between samples, assumes that the sample needs to be divided into k categories, and its basic algorithmic flow is as follows:
(1)
Randomly select k samples in the sample data set as the initial center of mass;
(2)
The distance from each sample to these k centers of mass is calculated using the Euclidean distance;
(3)
Divide each sample into the nearest center of mass to form a collection of classes;
(4)
Update the center of mass of the k classes;
(5)
Repeat steps 1–4 until the number of iterations is reached or the center of mass no longer changes between the two preceding and following times;
(6)
Output the k classes of the division.
The idea of algorithm improvement in this paper lies in the use of contour similarity instead of Euclidean distance for the judgment of similarity between samples. We call this improved algorithm CSk-means. It should be noted that the contour similarity calculation between the samples needs to be carried out on the original data matrix formed by the lattice transformation is the data matrix Y = y i j m × n . The pseudo-code of the CSk-means is summarized as Algorithm 1.
Algorithm 1: The CSk-means.
Input :   The   lattice   transformed   matrix   Y = y 1 , y 2 , , y m T ( where   y i = y i 1 , y i 2 , , y i n , i = 1,2 , , m represents a vector of sample data), k value, Maximum Iterations
Output: The k classes that have been classified.
1. Select k sample points as the initial clustering centers.
2.  repeat
3.  for  i = 1,2 , , m
4.     Calculate   the   contour   similarity   from   each   sample   point   y i to each class center.
5.  Class labeling is decided based on maximum contour similarity.
6.  Divide the sample points into the appropriate clusters.
7.  end for
8. Updating the clustering center.
9. until the maximum number of iterations is reached or the clustering center no longer changes.
Algorithm 1 is also affected by the initial clustering centers; for this reason, the initial clustering center selection in this paper takes the following strategy:
Use contour similarity to find the two least similar sample points, and use these two sample points as the reference points. If the number of clusters needed is two, then use these two points as the initial set of clustering centers; if the number of clusters needed is three, then use contour similarity to find the points that are similar to both of these two sample points, and insert the sample points into the initial set of clustering centers; if the number of clusters needed is four, then use these three sample points as the reference points, use contour similarity to find the sample points that are similar to all three sample points, and insert the sample points found into the initial set of clustering centers. If the number of classes to be clustered is four, use the three sample points as the reference point, use the contour similarity to find the sample point that is similar to all three sample points, and insert this sample point into the initial clustering center set. Repeat this process until the initial set of clustering centers is found that meets the required number of clusters. We call this process isometric interpolation.

5. Experimental Validation of the Csk-means Algorithm

To evaluate the performance of the proposed CSk-means algorithm, this study downloaded six datasets used as testing data from the UCI Machine Learning Repository (https://archive.ics.uci.edu/, accessed on 21 April 2024). The experimental environment is a 12th Gen Intel(R) Core(TM) i5-12500 3.00 GHz processor, NVIDIA GeForce RTX 3080 graphics card, 16.0 GB of running memory, Windows 11 operating system, and programming with MATLAB 2023b-64bits. Meanwhile, the whole experiment is conducted based on the MATLAB 2023b platform. The comparison algorithms are the traditional k-means algorithm, hierarchical clustering model, spectral clustering, and k-means++. The code for the Csk-means algorithm that needs to be validated is custom code.
Table 1 describes the basic statistical characterization information of the above six datasets. In this paper, these datasets are used for validating the performance of the proposed algorithm, and in this process, there will be no division between the training set and the testing set. The main reasons for the selection of the reported datasets were: (1) the ability to characterize the validity and reliability of the algorithm using validated metrics; and (2) the need for the improved clustering algorithm to focus on real clustering scenarios, taking into account both varying degrees of complexity as well as varying degrees of distribution, which is reflected in the categorical and sample sizes of the dataset as well as in the characteristics of the data distribution.
The metrics for evaluating the performance of the model were chosen as Precision, Recall, Accuracy, Running Time, Number of Iterations, and Normalized Mutual Information (NMI).
Table 2 demonstrates the iteration time, number of iterations, and the degree of accuracy of the clustering results for the clustering process of the Csk-means algorithm and the classical k-means algorithm on the six datasets. Table 2 demonstrates the iteration time, number of iterations, and the degree of accuracy of the clustering results for the clustering process of the Csk-means algorithm and the classical k-means algorithm on the five datasets. The results show that the Csk-means algorithm takes more time than the k-means algorithm for the clustering process since the gridded transformation of the Csk-means algorithm takes more time. However, the number of iterations and the clustering results are better than the classical k-means algorithm.
Table 3 demonstrates the F-measure, Precision, and Recall metric values for the clustering results of the CSk-means algorithm and classical k-means algorithm on the six datasets. It can be found that the CSk-means algorithm is more stable than the classical k-means algorithm. The overall performance of the clustering of the CSk-means algorithm is better than that of the classical k-means algorithm. It is noteworthy that the datasets Yeast and Abalone are less accurate, and we analyze in depth the reasons for this phenomenon: the two datasets contain a large number of classes and the degree of differentiation between samples in the different classes is not clear enough, and the data distribution morphology shows a strong consistency. Methods used in this process include, but are not limited to, principal component analysis, data distribution, and other clustering models.
Figure 3 and Table 4 show the accuracy, F-measure, Precision, Recall, and NMI values of the clustering results of the Csk-means algorithm, hierarchical clustering, and spectral clustering on the six datasets. In comparison, the Csk-means algorithm shows better stability and more accurate clustering results.

6. Discussion

To systematically solve the four problems involved in the traditional k-means algorithm proposed in the recent research literature, we propose the Csk-means algorithm. Experimental results demonstrate the effectiveness and potential of the model. Based on the process and related techniques involved in the proposed Csk-means algorithm, in this section, we discuss the corresponding key findings, significance, and limitations of each of the techniques involved in the Csk-means algorithm in the order of the process.
To eliminate the negative effects of outliers and missing values on the clustering results, we redefine the preprocessing of the data, involving the method of lattice transformation. The lattice transform essentially constructs a new system of lattice coordinates, transforming the original continuous data to discrete lattice coordinates and transforming the missing values to the origin of the lattice coordinates. The lattice transform can be interpreted as an integer or discrete transform. The lattice transform may provide a new strategy and method for discretization of data, handling of missing values, and outliers. The advantage of the lattice transform is that it can reduce the adverse effects of outliers and missing values on the model results to a certain extent, which can be supported by some of the experimental results of the study. However, its non-ignorable limitation is that the lattice transform is a homomorphic migration transform, and the adverse effect of outliers still exists to a certain extent. In addition, the degree of discretization of the lattice transform is insufficient when the data is discretized, which may lead to poor generalization of the mined knowledge if the lattice transform is applied in the data preprocessing of rule learning.
According to the idea of hierarchical clustering, in the process of predetermining the number of clusters, the strategy we adopted is to use the contour similarity to build a vector of contour similarity between samples and use Fisher optimal segmentation to realize the predetermination of the number of clusters. From the results, the number of clusters predetermined by this strategy is consistent with the actual number. This strategy also provides a new idea and method for predetermining the number of classes in clustering applications. However, its limitation is obvious: the predetermination of the number of classes in a large sample dataset will take more time, which is not conducive to the layout of the online algorithm.
For the selection of initial clustering centers, we adopt the method of isometric interpolation, which is a new method for determining the initial clustering centers. Isometric interpolation implies the idea of dissimilarity between classes, which is essentially a balanced iterative search. Isometric interpolation can prevent the clustering model from falling into the trap of local optimality and also provides a new clustering strategy, which will be the focus of our next work. However, its limitation is obvious: in the process of determining the initial clustering centers, the selected initial clustering centers need to be evaluated for their balance with the centers of other classes. The adoption of the balancing strategy will deeply affect the selection of clustering centers, which in turn leads to fluctuations in the clustering results. Although the clustering centers can be determined by a trial-and-error strategy, the time cost will increase.
We improved the traditional k-means algorithm by using contour similarity instead of Euclidean distance to enhance the ability of the k-means algorithm to recognize different clustered shapes. From the experimental results, such an improvement is effective. Contour similarity is a new measure of similarity between samples, which is essentially an upscaling transformation. Compared with other similarity measures, contour similarity does not result in information loss due to data dimensionality reduction, which makes the description of similarity inaccurate. We believe that contour similarity enriches the methodological theory of similarity measures. However, its limitations are: when using contour similarity, one should have at least three samples; at the same time, its calculation process is relatively complicated, which will lead to an increase in time cost, which can be known from the results of the designed experimental comparison.
Although the improved Csk-means algorithm can lead to a certain degree of improvement in the accuracy of the clustering results, we also need to see its limitations and implications, especially the increase in the time cost of the Csk-means algorithm during the clustering process. The change in the data lattice, the selection of the initial clustering centers, and the computation of the similarity between the samples involved in the Csk-means algorithm increase the time cost. If the calculation of contour similarity to inter-sample similarity is further optimized, and the effect of outliers on the clustering results can be ignored, then the lattice transformations of data can be eliminated from the Csk-means algorithm, thus the Csk-means algorithm reduces the time cost.
In addition, the choice of comparison algorithms in our designed comparison experiments contains the traditional k-means algorithm, hierarchical clustering, and spectral clustering, which aims at verifying the feasibility of the proposed method. However, the comparison with other clustering models, such as SOM, was neglected. Therefore, in future studies, we will include these models to provide a more comprehensive analysis of the practical applications of clustering models.

7. Conclusions

In the clustering process using the traditional k-means algorithm, to predetermine the number of clusters, weaken the adverse effects of outliers and noise on the clustering results, avoid falling into the local optimum due to the random selection of the initial center, and improve the robustness of the algorithm in detecting the shape of the clusters, this paper proposes the Csk-means algorithm based on the k-means algorithm. The experimental results show that predetermining the number of clusters is still a thorny problem, the adverse effect of the selection of the initial center on the clustering results still exists, but the adverse effect of the outliers and the noise data on the clustering results has been weakened to a certain extent, and the Csk-means algorithm effectively improves the accuracy of the clustering results.

Author Contributions

J.Z.: Writing—original draft, Writing—review and editing, Methodology. Y.B.: Editing, Methodology; D.L. and X.G.: Investigation. J.Z., Y.B., D.L. and X.G. report financial support was provided by the Department of Education of Guizhou Province and the Department of Science and Technology of Guizhou. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Project for Growing Youth Talents of the Department of education of Guizhou Province (Qianjiaoji [2022] No.378,377,380,386; Qianjiaoji [2024] No.234,236,238); the Foundation Project for Talents of Qiannan Science and Technology Cooperation Platform Supported by the Department of Science and Technology, Guizhou ([2019]QNSYXM-05); the Guizhou Provincial Department of Education 2024 Humanities and Social Sciences Research Program for Colleges and Universities (2024RW101).

Data Availability Statement

These data were derived from the following resources available in the public domain: https://archive.ics.uci.edu/, accessed on 21 April 2024.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Pérez-Ortega, J.; Almanza-Ortega, N.N.; Vega-Villalobos, A.; Pazos-Rangel, R.; Zavala-Díaz, C.; Martínez-Rebollar, A. The K-means algorithm evolution. Introd. Data Sci. Mach. Learn. 2019, 69–90. [Google Scholar] [CrossRef]
  2. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  3. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Volume 1, No. 14; pp. 281–297. [Google Scholar]
  4. Jancey, R.C. Multidimensional group analysis. Aust. J. Bot. 1966, 14, 127–130. [Google Scholar] [CrossRef]
  5. Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801. [Google Scholar]
  6. Kapoor, A.; Singhal, A. A comparative study of K-Means, K-Means++ and Fuzzy C-Means clustering algorithms. In Proceedings of the 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 9–10 February 2017; pp. 1–6. [Google Scholar] [CrossRef]
  7. Ezugwu, A.E.-S.; Agbaje, M.B.; Aljojo, N.; Els, R.; Chiroma, H.; Elaziz, M.A. A Comparative Performance Study of Hybrid Firefly Algorithms for Automatic Data Clustering. IEEE Access 2020, 8, 121089–121118. [Google Scholar] [CrossRef]
  8. Annas, M.; Wahab, S.N. Data Mining Methods: K-Means Clustering Algorithms. Int. J. Cyber IT Serv. Manag. 2023, 3, 40–47. [Google Scholar] [CrossRef]
  9. Hu, H.; Liu, J.; Zhang, X.; Fang, M. An Effective and Adaptable K-means Algorithm for Big Data Cluster Analysis. Pattern Recognit. 2023, 139, 109404. [Google Scholar] [CrossRef]
  10. Mussabayev, R.; Mladenovic, N.; Jarboui, B.; Mussabayev, R. How to use K-means for big data clustering? Pattern Recognit. 2023, 137, 109269. [Google Scholar] [CrossRef]
  11. Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 3rd ed.; Academic Press: Cambridge, MA, USA, 2006. [Google Scholar]
  12. Guedes, P.C.; Müller, F.M.; Righi, M.B. Risk measures-based cluster methods for finance. Risk Manag. 2023, 25, 4. [Google Scholar] [CrossRef]
  13. Yudhistira, A.; Andika, R. Pengelompokan Data Nilai Siswa Menggunakan Metode K-Means Clustering. J. Artif. Intell. Technol. Inf. 2023, 1, 20–28. [Google Scholar] [CrossRef]
  14. Navarro, M.M.; Young, M.N.; Prasetyo, Y.T.; Taylar, J.V. Stock market optimization amidst the COVID-19 pandemic: Technical analysis, K-means algorithm, and mean-variance model (TAKMV) approach. Heliyon 2023, 9, 2–3. [Google Scholar] [CrossRef]
  15. Foster, J.; Gray, R.; Dunham, M. Finite-state vector quantization for waveform coding. IEEE Trans. Inf. Theory 1985, 31, 348–359. [Google Scholar] [CrossRef]
  16. Liaw, Y.-C.; Lo, W.; Lai, J.Z. Image restoration of compressed image using classified vector quantization. Pattern Recognit. 2002, 35, 329–340. [Google Scholar] [CrossRef]
  17. Zhu, A.; Hua, Z.; Shi, Y.; Tang, Y.; Miao, L. An improved K-means algorithm based on evidence distance. Entropy 2021, 23, 1550. [Google Scholar] [CrossRef] [PubMed]
  18. Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
  19. Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  20. Li, Y.; Wu, H. A clustering method based on K-means algorithm. Phys. Procedia 2012, 25, 1104–1109. [Google Scholar] [CrossRef]
  21. Singh, A.; Yadav, A.; Rana, A. K-means with Three different Distance Metrics. Int. J. Comput. Appl. 2013, 67, 1–3. [Google Scholar] [CrossRef]
  22. Chakraborty, A.; Faujdar, N.; Punhani, A.; Saraswat, S. Comparative Study of K-Means Clustering Using Iris Data Set for Various Distances. In Proceedings of the 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 29–31 January 2020. [Google Scholar]
  23. Sebayang, F.A.; Lydia, M.S.; Nasution, B.B. Optimization on Purity K-Means Using Variant Distance Measure. In Proceedings of the 2020 3rd International Conference on Mechanical, Electronics, Computer, and Industrial Technology (MECnIT), Medan, Indonesia,, 25–27 June 2020; pp. 143–147. [Google Scholar]
  24. Tang, Z.K.; Zhu, Z.Y.; Yang, Y.; Caihong, L.; Lian, L. DK-means algorithm based on distance and density. Appl. Res. Comput. 2020, 37, 1719–1723. [Google Scholar]
  25. Wang, Z.L.; Li, J.; Song, Y.F. Improved K-means algorithm based on distance and weight. Comput. Eng. Appl. 2020, 56, 87–94. [Google Scholar]
  26. Wang, Y.; Luo, X.; Zhang, J.; Zhao, Z.; Zhang, J. An Improved Algorithm of K-means Based on Evolutionary Computation. Intell. Autom. Soft Comput. 2020, 26, 961–971. [Google Scholar] [CrossRef]
  27. Zhang, Y.; Zhang, D.; Shi, H. K-means clustering based on self-adaptive weight. In Proceedings of the 2012 2nd International Conference on Computer Science and Network Technology, Changchun, China, 29–31 December 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
  28. Chen, A.; Yang, Y. Diffusion K-means clustering on manifolds: Provable exact recovery via semidefinite relaxations. Appl. Comput. Harmon. Anal. 2021, 52, 303–347. [Google Scholar] [CrossRef]
  29. Dinh, D.-T.; Huynh, V.-N.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021, 571, 418–442. [Google Scholar] [CrossRef]
  30. Bao, Y. Contour similarity and metric of samples of finite dimensional state vector. J. Liaoning Tech. Univ. 2011, 30, 603–660. [Google Scholar]
  31. Zhao, F.; Sun, M.; Bao, Y. Similarity Measure of Geometric Contours about Multi-Sale Data and Its Application. Math. Pract. Theory 2013, 43, 178–182. [Google Scholar]
  32. Fisher, R.A. Iris. UCI Machine Learning Repository. 1988. Available online: https://archive.ics.uci.edu/dataset/53/iris (accessed on 21 April 2024).
  33. Aeberhard, S.; Forina, M. Wine. UCI Machine Learning Repository. 1991. Available online: https://archive.ics.uci.edu/dataset/109/wine (accessed on 21 April 2024).
  34. Nakai, K. Ecoli. UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/39/ecoli (accessed on 21 April 2024).
  35. Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.; Lukasik, S. Seeds. UCI Machine Learning Repository. 2012. Available online: https://archive.ics.uci.edu/dataset/236/seeds (accessed on 21 April 2024).
  36. Nakai, K. Yeast. UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/110/yeast (accessed on 21 April 2024).
  37. Nash, W.; Sellers, T.; Talbot, S.; Cawthorn, A.; Ford, W. Abalone. UCI Machine Learning Repository. 1995. Available online: https://archive.ics.uci.edu/dataset/1/abalone (accessed on 21 April 2024).
Figure 1. The Csk-means algorithm design process.
Figure 1. The Csk-means algorithm design process.
Mathematics 12 02211 g001
Figure 2. Basic data format image presentation. The dark red dashed box represents the number of the sample data. The red rounded box represents the jth conditional factor described in the previous section. The red box represents the phase distribution of the ith sample under each factor; and the blue dashed box represents the original data matrix consisting of all the samples.
Figure 2. Basic data format image presentation. The dark red dashed box represents the number of the sample data. The red rounded box represents the jth conditional factor described in the previous section. The red box represents the phase distribution of the ith sample under each factor; and the blue dashed box represents the original data matrix consisting of all the samples.
Mathematics 12 02211 g002
Figure 3. Comparison of NMI of the Csk-means algorithm with other clustering algorithms on 6 datasets.
Figure 3. Comparison of NMI of the Csk-means algorithm with other clustering algorithms on 6 datasets.
Mathematics 12 02211 g003
Table 1. The information of six datasets was downloaded from the UCI Machine Learning Repository.
Table 1. The information of six datasets was downloaded from the UCI Machine Learning Repository.
DatasetsNo. of SamplesNo. of FeaturesNo. of Clusters
Iris [32]15043
Wine [33]178133
Ecoli [34]33688
Seeds [35]42073
Yeast [36]1484810
Abalone [37]4177828
Table 2. The number of iterations, the running time (s), and the accuracy for classifying the six datasets.
Table 2. The number of iterations, the running time (s), and the accuracy for classifying the six datasets.
CSk-Meansk-MeansCSk-Meansk-MeansCSk-Meansk-Means
DatasetsTime SpentNumber of IterationsAccuracy
Iris0.00760.0064340.920.89
Wine0.07540.0182750.940.70
Ecoli0.08220.041810230.760.58
Seeds0.43310.2781670.900.89
yeast0.19460.035810400.420.40
Abalone0.22860.209230540.180.22
Table 3. The F-measure (F), Precision (P), and Recall (R) for classifying the six datasets.
Table 3. The F-measure (F), Precision (P), and Recall (R) for classifying the six datasets.
CSk-Meansk-MeansCSk-Meansk-MeansCSk-Meansk-Means
DatasetsFPR
Iris0.940.920.940.880.940.98
Wine0.940.780.950.830.970.74
Ecoli0.760.570.760.900.760.42
Seeds0.910.910.920.910.910.92
yeast0.400.350.350.340.500.36
Abalone0.300.240.330.290.280.21
Table 4. Evaluation results of Csk-means algorithm, hierarchical clustering, and spectral clustering on six datasets.
Table 4. Evaluation results of Csk-means algorithm, hierarchical clustering, and spectral clustering on six datasets.
Csk-MeansHierarchical ClusteringSpectral Clusteringk-Means++
DatasetsAccuracyFPRAccuracyFPRAccuracyFPRAccuracyFPR
iris0.920.940.940.940.350.510.341.000.910.940.881.000.910.920.880.98
wine0.940.960.950.970.430.600.610.580.320.490.330.930.740.800.860.76
Ecoli0.760.760.760.760.450.610.440.980.560.550.850.410.600.570.910.43
Seeds0.900.910.920.910.920.940.930.950.900.920.910.940.910.930.930.94
yeast0.420.400.350.500.320.480.310.970.290.010.050.000.410.360.350.36
Abalone0.180.300.330.280.220.330.290.360.130.220.310.170.230.250.300.21
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, J.; Bao, Y.; Li, D.; Guan, X. An Improved K-Means Algorithm Based on Contour Similarity. Mathematics 2024, 12, 2211. https://doi.org/10.3390/math12142211

AMA Style

Zhao J, Bao Y, Li D, Guan X. An Improved K-Means Algorithm Based on Contour Similarity. Mathematics. 2024; 12(14):2211. https://doi.org/10.3390/math12142211

Chicago/Turabian Style

Zhao, Jing, Yanke Bao, Dongsheng Li, and Xinguo Guan. 2024. "An Improved K-Means Algorithm Based on Contour Similarity" Mathematics 12, no. 14: 2211. https://doi.org/10.3390/math12142211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop