1. Introduction
The c-means (CM) algorithm proposed by MacQueen [
1] is the most commonly used clustering algorithm over various research fields, but it cannot accurately partition data objects in which the membership to any specific cluster is uncertain [
2]. As a general extension of CM, fuzzy clustering has been proposed to address this problem [
3]. Most fuzzy clustering algorithms originate from Bezdek’s fuzzy c-means (FCM) algorithm [
4], which has been successfully applied in numerous applications including image segmentation [
5], feature extraction [
6], pattern recognition [
7,
8], etc.
However, the computed membership degrees in FCM are relative numbers. For a given data object, its membership degrees corresponding to all clusters in the fuzzy partition matrix must sum to 1 to avoid a trivial solution. The sum makes FCM noise-sensitive and unsuitable for applications in which membership degrees show a typicality or compatibility of data points with clusters under flexible constraints [
9,
10]. Consequently, the clustering results of FCM are often inaccurate. To address these problems, various variants of FCM have been developed. Krishnapuran et al. [
11] proposed a possibilistic c-means method by giving up relative number constraints in FCM and thus presented the typicality or compatibility from data objects to clusters. Pedrycz et al. [
12] assigned an importance degree to individual data points, thereby significantly mitigating the influence of relative numbers in FCM. More recently, numerous FCM-type algorithms have been proposed. Yu [
13] proposed a general c-means algorithm by extending the definition of means from statistical analysis. This algorithm generalized most variants of the fuzzy clustering method to a common model. Huang et al. [
14] proposed a feature-weighted c-means algorithm to address the difficulty that FCM faces in detecting clusters distributed across various subspaces.
Despite progress, these algorithms mainly focus on finding clusters by the inherent distribution and distance among data objects [
15,
16], but they cannot provide an assessment utilizing the information on each data object and cluster. Especially, in various clustering applications, the number of data points in specific cluster is a mandatory constraint that the clustering results of any clustering algorithm must obey. If these constraints cannot be met, the clustering results may be unacceptable [
17,
18]. To address this problem, Ng et al. [
19] proposed a constrained c-means (CCM) algorithm that assigns a fixed number of data points to each cluster, which means that each column in the partition matrix is constrained by an equation. Nevertheless, there are at least three unsolved problems in CCM. First, CCM must apply the solution of the transportation problem as a practical form, but this approach is not feasible for large-scale datasets due to high computational complexity. Secondly, CCM is a CM-type clustering algorithm, rather than a fuzzy clustering algorithm, and thus the issues that can be solved by FCM are not addressed by CCM. Finally, CCM cannot combine the importance degree of both clusters and data objects [
20,
21]. More recently, efforts have been made to determine the number of clusters in the fuzzy clustering process. There are two approaches to determine the optimal number of clusters. One utilizes one or several clustering indices to determine the optimal number through a trial-and-error method across all possible numbers of clusters [
22,
23]. The other attempts to solve the number of clusters in the iterative process of fuzzy clustering, such as the proposed Bayesian probabilistic model and inference algorithm for fuzzy clustering [
24], which provides expanded capabilities compared to traditional FCM. However, these algorithms fail to address the problem of typicality or compatibility inherent in FCM.
In this study, to address the typicality or compatibility of both data points and clusters, we incorporate dual constraints into each data point and cluster. Therefore, the objective function of FCM is reformulated and then its realization equation is mathematically conducted. The proposed method is easily operated as FCM and requires minimal additional parameters. We discuss the theoretical framework of the algorithm and analyze the clustering effectiveness of representative patterns in each cluster. In the proposed method, we enhance clustering qualities by satisfying the mandatory constraints on data objects and clusters. Our experimental results validate the effectiveness of the proposed algorithm and demonstrate its applicability and limitations.
2. Related Work
Let X = {xi|i = 1, 2, …, n} be a dataset with n data objects distributed in c clusters, xi∈Rd in a d-dimensional data space. Four typical fuzzy clustering algorithms are reviewed as follows.
- (1)
Bezdek’s FCM: The objective function in FCM can be stated as
where
,
vi is the prototype (center) of
ith cluster,
uij the membership degree of
jth point to
ith cluster,
m a fuzziness exponent, ranging in the interval [1, 3]. By the Lagrange multiplier optimization algorithm [
25], the optimal membership and prototype functions of (1) is
All fuzzy membership degrees consist of a n × c fuzzy partition matrix U = [uij].
FCM has frequently been criticized as it cannot show the typicality (importance) or compatibility of points with clusters [
6,
12], and thus the following algorithm was proposed.
- (2)
Pedrycz’s conditional FCM (CFCM): Let
wj be the importance degree of
jth point, Equation (1) in CFCM turns into
The membership degree of
jth point to
ith cluster in FCM is conducted as
The computation equation of the center vi in CFCM is the same as in the case of FCM, i = 1, 2, …, c. The use of CFCM can enhance the typicality of different clusters and increase the accuracy of FCM. But CFCM only focuses on the typicality of points rather than clusters.
- (3)
Krishnapuran et al.’s possibilistic c-means (PCM): Along the objective function of FCM, PCM is formulated as
From (5), the optimal membership function is
where
ηi is associated with the size of each cluster, it is computed as
PCM can effectively stress the typicality of points but must strongly depend on an initialization procedure. In practice, FCM can realize this purpose [
12].
- (4)
Ng et al.’s constrained CM (CCM): Equation (1) in FCM is turned into
From (8), the clustering center is
The optimal membership function
uij in CCM turns into the typical transportation problem (see [
19]) that can be solved by a set of existing algorithms. But the computational complexity of these algorithms is too large to be applied in large dataset.
These algorithms have their own applicable ranges and limitations but cannot provide an assessment that utilizes the information on both data objects and clusters. Nevertheless, these constraints are not only helpful to boost clustering quality but also meet mandatory application requirements. In this paper, we propose a new method to solve these problems after conducting the iterative equation along a solid mathematical optimization process.
3. Double-Constraint Fuzzy Clustering
Let
X = {
xj} be a dataset with
n data objects that are distributed in
c clusters in a
d-dimensional data space,
xj∈
Rd. According to the fuzzy partition matrix
U in FCM, we define two symbols as follows:
The value of pi is the constraint for the ith cluster if the constraint can be known a priori, whereas the value of qj is the constraint for jth data object.
The meaning of qj in PFCM is illustrated as follows:
qj < 1, the jth data object is under sparse distribution, likely being a noisy point or outlier;
qj = 1, the jth data object does not have an additional importance degree, and thus the point has the same membership degree value as it has in FCM;
qj > 1, the jth cluster may be an aggregation of data with high density such as a clustering center and so on; these points act as the main structure of various clusters.
Alternatively, the meaning of pi in CCM is defined as the number of data points in the ith cluster, and it is usually a mandatory requirement of the clustering results of any clustering algorithm. To date, no existing fuzzy clustering algorithm can combine the constraints of both data objects and clusters to enhance clustering quality.
Since
qj reflects the importance of each point whereas
pi reflects that of each point, they can represent the typicality or compatibility of points and clusters together. According to the constraints on both point and cluster importance degrees, a double-constraint fuzzy clustering algorithm is proposed, abbreviated as DFCM. The objective function of DFCM is formulated as
Taking
n +
c Lagrange multipliers:
λj,
j = 1, 2,…,
n;
ui,
i = 1, 2,
…,
c, the typical alternative optimization way [
25] is used to solve (11). And the Lagrange function is formulated as
Equation (12) is solved by the following two alternative optimization problems.
Fix
uij: solve the
kth cluster center
vk. The derivative of
Fm on
vk is
Fix
vk, solve
ukg from the
gth data vector to
vk. The derivative of
Fm on
ukg yields
it is
Taking it into
, they are
Since the number of equations in (15) is (
n +
c) and is equal to the number of both variables
μk and
λg, its solution is thus uniformly determined. But the power of 1/(
m − 1) limits its analytic solution. We turn to solve it iteratively using an iteratively numerical optimization process. Note that
According to the Newton iteration method [
25],
ukg is iteratively solved as
where
t is the iteration time. In this way, DFCM is alternatively optimized as follows. Given the initial (
v0,
λ0,
μ0), (
λ1,
μ1) can be calculated by (16); then,
u1 is calculated by (14);
v1 is obtained by (13). Then, (
v1,
λ1,
μ1) is used to calculate (
λ2,
μ2), and the above process is repeated until a stop criterion is met.
Especially, when
m = 2, (15) reduces to the following form:
According to (17), DFCM can be iteratively solved as follows. Given the initial v0, both λ0 and μ0 are solved. Subsequently, v1 is determined using (13), followed by the computation of λ1 and μ1, and this sequence continues iteratively. The process is repeated until a stop criterion is met. According to the algorithm optimization principle, the convergence of this process is guaranteed.
In practice, the weighting value of each point qj can be evaluated by the importance degree of each point. However, the weighting value pi cannot be directly associated with any cluster since these clusters are unknown before the clustering process is completed.
To settle this problem, we implement DFCM in two steps: coarse partitioning and fine partitioning. In the first step, the use of FCM can obtain c clusters, C1, C2, …, and Cc, subject to |C1| < |C2| <…<|Cc|.
In the second step, p1, p2, …, pc are actually the number of data vectors in all clusters such that p1 < p2 < … < pc. We add these constraints of p1, p2, …, pc to their corresponding c clusters in FCM. Beginning with these clustering centers in FCM, DFCM is used to partition C1, C2, …, Cc and obtain the final clustering results.
The computational time of DFCM includes the computation of the weighting value of
pi as well as the coarse-tuning and fine-tuning steps. However, these two steps account for the majority of the runtime in the entire clustering process. But DFCM begins with the clustering results (centers) derived from FCM, enabling it to reach an optimal solution more rapidly and effectively.
Algorithm 1. DFCM algorithm |
Input: Dataset X, number of clusters c, exponent indexes m, and acceptable error ε Output: Partitioned clusters from X |
Method: (1) Determine p1, p2, …, pc and q1, q2, …, qn; (2) Partition X to C1, C2, …, Cc by FCM; (3) Determine p1, p2, …, pc from |C1|, |C2|, …, |Cc|; (4) Initialize the clustering center in DFCM by v1, v2, …, vc from FCM; (5) Solve uij of jth point to cluster by (16) or (17), i = 1~c, j = 1~n; (6) Solve vi by (13), i = 1~c; (7) Stop if ,otherwise go to step (5); (8) Partition X to C1, C2, …, Cc by their final membership degrees. |
4. Experiment
Four synthetic low-dimensional datasets with different clustering features (e.g., density, size, and overlap) and eight actual datasets from UCI were used to assess the effectiveness and efficiency of DFCM. We applied DFCM to partition all data in these datasets and compared the results with three typical clustering algorithms: FCM, PCM, and CFCM.
4.1. Four Synthetic Datasets
Each cluster in the four synthetic datasets is generated by “randn()” in the Matlab® toolbox. Thus, each cluster is regular and is centralized on the center of the related function. As a result, after labeling the above “randn()” functions, the correct cluster label of any data point is just the label of the function that generates these data. These original cluster labels in the above datasets do not take part in any clustering process but are just used to examine the accuracy of these algorithms after the clustering process is completed. The four synthetic datasets are indicated as Set 1–Set 4.
The effectiveness of a clustering algorithm is typically assessed using datasets characterized by various features such as density difference, size difference, and noise effects. And Sets 1~4 are constructed along these features. Set 1 contains 1300 data points distributed across three clusters: a high-density cluster with 1000 data points and two low-density clusters with 100 and 200 data points each (see
Figure 1a). These clusters exhibit diversity in terms of density. Set 2 contains 1200 data points distributed across slightly overlapping clusters (see
Figure 1b). Set 3 contains 1900 data points distributed across three size-diverse clusters, where the largest cluster possesses a diameter twice that of the two smaller clusters (see
Figure 1c). In Set 4, there are 1500 data points distributed across spherical clusters that partially overlap (see
Figure 1d). In these figures, the centers derived from FCM and DFCM are marked by small green and red circles, respectively.
4.2. Eight Real Datasets from UCI
Eight actual datasets from UCI [
26] were used to assess the clustering accuracy and the mandatory constraint on the volume of data in each cluster. These datasets were selected due to their representativeness of different clustering structures and characteristics. Other datasets from UCI have similar features to these eight datasets. The correct clustering labels and the volume of data in each cluster are known a priori. These labels remain separate from the clustering process and are only used to evaluate the accuracy of various clustering results.
Table 1 shows the number of clusters, the volume of data in each cluster, and the dimensionality of each dataset.
The Iris dataset contains 150 data points, each characterized by four attributes and distributed across three clusters. Each cluster contains 50 data vectors. Two clusters overlap, while the third cluster is linearly separable from those two clusters. In the past decades, this dataset has frequently been used to assess the clustering results of different clustering algorithms. The, Tea, and Breast datasets exhibit characteristics similar to those of Iris. The Wisconsin dataset is high-dimensional, containing 683 instances after removing 16 instances due to missing values. Each instance has nine attributes. This dataset contains two clusters: 444 samples are categorized as “Benign” and 239 as “Malignant”. These two clusters are mainly in two different hyperplanes that respond to different components (attributes). The Cancer, Appendicitis, and Wisconsin datasets have similar characteristics.
4.3. Clustering Results
The clustering results were assessed using three indices: accuracy, the number of data points in each cluster, and runtime. Accuracy was determined by the percentage of correctly partitioned data points in each dataset. The total number of incorrectly partitioned data points in each cluster was determined using the following index:
where
and
are the actual and the computed numbers of data in the
ith cluster, respectively. On the other hand, according to CFCM, the weighting value of the
jth data point
qj is determined by its density as
where
N(
xj) is the set of neighboring data points around
xj and
denotes the distance between any pair of points.
All clustering results are presented in
Table 2 and the clustering centers of Sets 1–4 are shown in
Figure 1. According to clustering
accuracy and
Sum, we compared four clustering algorithms: FCM, PCM, CFCM, and DFCM. In these algorithms, all fuzziness exponents are uniformly taken as 1.5 and the stop error are 10
−3.
In the four synthetic datasets,
Figure 1 illustrates that the cluster centers derived from FCM deviate from the actual centers. This deviation is attributed to FCM’s limited capacity to discern the typicality or compatibility of points for effective clustering. Contrary to FCM, DFCM was able to determine the clustering centers more accurately.
Table 3 presents a detailed comparative analysis of the four clustering algorithms. In terms of clustering accuracy, DFCM surpasses the other three algorithms, with CFCM ranking the second and FCM achieving the lowest accuracy. The results demonstrate that DFCM is effective and valuable. The incorporation of dual constraints in DFCM enhances the accuracy of the clustering centers and the corresponding membership degrees. In contrast, PCM and CFCM exhibit slight deviations, whereas DFCM demonstrates negligible deviations. Hence, the value of
Sum of DFCM is the lowest among the three algorithms. PCM and CFCM rank as intermediate, while FCM has the highest value. Furthermore, the value of
Sum derived from CFCM is nearly the same as DFCM, as shown in
Table 2. However, FCM has the shortest runtime when the number of clusters is fixed, and both PCM and DFCM depend on FCM due to their initialization processes.
In the eight real datasets,
Table 2 shows that DFCM outperformed the other three algorithms in five of the datasets, whereas it failed in terms of
A and
Sum in the remaining three. However, the clustering results of DFCM is very close to the best results of the three datasets. Hence, DFCM exhibits a slight superiority compared to the other three algorithms. Our conclusions based on these results are as follows: Firstly, most clusters in these eight datasets are non-spherical, but the four algorithms can in principle work well only with datasets with spherical clusters. Secondly, the complex structures of the eight datasets result in a generally low clustering accuracy of any clustering algorithm applied. In recent decades, the clustering accuracy rates of these datasets were less than 50% using existing algorithms [
6].
In terms of average runtime, FCM’s runtime was the shortest among the four algorithms, followed by PCM, CFCM, and DFCM. These results are consistent with those in the four synthetic datasets.
Figure 2 shows the convergence of DFCM in the four synthetic datasets compared with the other three clustering algorithms, and the number of iterations is fixed at 40. The corresponding objective function value reflects the convergence speed. Especially, as these objective functions of the four algorithms have different orders of magnitude,
Figure 2 illustrates their relative values of objective function by normalizing these values to the interval [0, 1]. As shown in
Figure 2, PCM has the fastest convergence speed among the other three algorithms, followed by DFCM and FCM. Moreover, CFCM shows instability across the four datasets. Given that the convergence speed of any iteration algorithm reflects its runtime and varying tendency,
Figure 2 clearly illustrates the runtime of the four algorithms when applied to various datasets.
4.4. Comparison and Discussion
When confronted with various datasets with different characteristics, these four algorithms reveal their limitations. According to the general approach of evaluating a clustering algorithm,
Table 3 summarizes their applicable ranges based on five common features: cluster overlap, density difference, size difference, cluster shape, and time complexity. The term “applicable” indicates that the given algorithm can correctly cluster these data points in the dataset. Conversely, “inapplicable” indicates that the given algorithm is completely incapable of clustering the dataset correctly, while “partially applicable” refers to limited effectiveness in clustering a dataset with the relevant features.
Furthermore, the clustering results of the four clustering algorithms are explained and discussed as follows. Although FCM, PCM, CFCM, and DFCM can all be used to cluster datasets with overlapping clusters, both CFCM and DFCM can perform better since they stress the typicality of each point. And PCM must depend on FCM as its initialization process and otherwise cannot be applied to separate overlapping clusters. Differences in density and size can cause larger errors for FCM and PCM and partially affect the clustering results of both CFCM and DFCM due to their condition of constraining various clusters. In principle, the four algorithms cannot cluster data that are distributed in arbitrarily shaped clusters. But DFCM and DFCM have smaller errors than FCM and PCM when all clusters have convex shapes. Consequently, DFCM has an advantage over the other three algorithms in terms of clustering accuracy. But DFCM has a longer runtime than FCM and PCM, but not CFCM, and FCM is the most effective.