Three-Way Ensemble Clustering Based on Sample’s Perturbation Theory

Fan, Jiachen; Wang, Xiaoxiao; Wu, Tingfeng; Zhu, Jin; Wang, Pingxin

doi:10.3390/math10152598

Open AccessArticle

Three-Way Ensemble Clustering Based on Sample’s Perturbation Theory

by

Jiachen Fan

¹,

Xiaoxiao Wang

¹,

Tingfeng Wu

¹,

Jin Zhu

² and

Pingxin Wang

^2,*

¹

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003, China

²

School of Science, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(15), 2598; https://doi.org/10.3390/math10152598

Submission received: 21 June 2022 / Revised: 7 July 2022 / Accepted: 15 July 2022 / Published: 26 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

The complexity of the data type and distribution leads to the increase in uncertainty in the relationship between samples, which brings challenges to effectively mining the potential cluster structure of data. Ensemble clustering aims to obtain a unified cluster division by fusing multiple different base clustering results. This paper proposes a three-way ensemble clustering algorithm based on sample’s perturbation theory to solve the problem of inaccurate decision making caused by inaccurate information or insufficient data. The algorithm first combines the natural nearest neighbor algorithm to generate two sets of perturbed data sets, randomly extracts the feature subsets of the samples, and uses the traditional clustering algorithm to obtain different base clusters. The sample’s stability is obtained by using the co-association matrix and determinacy function, and then the samples can be divided into a stable region and unstable region according to a threshold for the sample’s stability. The stable region consists of high-stability samples and is divided into the core region of each cluster using the K-means algorithm. The unstable region consists of low-stability samples and is assigned to the fringe regions of each cluster. Therefore, a three-way clustering result is formed. The experimental results show that the proposed algorithm in this paper can obtain better clustering results compared with other clustering ensemble algorithms on the UCI Machine Learning Repository data set, and can effectively reveal the clustering structure.

Keywords:

three-way clustering; natural nearest neighbor; sample’s perturbation theory; ensemble clustering

MSC:

68T10; 68T37

1. Introduction

Clustering is a powerful data analysis technology which is widely used in different fields such as information granulation [1,2,3], information fusion [4,5,6], attribute reduction [7,8,9,10], feature selection [11,12,13], image analysis [14,15,16] and other fields. The purpose of clustering is to divide the samples with high similarity into one cluster and the samples with low similarity into different clusters [17]. There are many clustering methods, such as DPC [18], DBSCAN [19], spectral clustering [20]. In addition to those aforementioned clustering algorithms, the non-iterative ANN can also be used to solve the clustering problem [21,22,23].

The aforementioned clustering algorithms are all hard clustering algorithms, which means that the objects which are in the set definitely belong to this cluster and the objects which are not in the set do not definitely belong to this cluster. The hard clustering algorithm cannot solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data. In recent years, a new clustering algorithm based on three-way decision [24,25] has been proposed, named three-way clustering [26]. Three-way decision is an extension of two-way decision, in which a definite decision is given to the objects with definite information and a deferment decision is adopted when the object’s information is insufficient to avoid the decision risk. Three-way clustering divides the data set into the core region, fringe region and trivial region. By further processing the data objects in the fringe region, we can clearly understand their impact on the cluster.

Since the idea of three-way clustering was put forward, it has been widely concerned. Wang et al. [27] proposed a CE3 framework for three-way clustering and a three-way clustering method based on the dynamic domain. Zhang [28] proposed a three-way c-means algorithm, which proposed a three-way weight according to the inherent characteristics of each data point, making the weight more accurate. Then, a three-way allocation is proposed to allocate data points to the cluster. Liu [29] used gray relational clustering and three-way decision making, and constructed a three-way gray relational clustering method based on the principle of complementary advantages.

As the amount of big data increases, it is difficult to identify all different data structures from a large number of complex data sets by a single clustering algorithm. In order to solve this problem and improve the stability and robustness of the algorithm, ensemble clustering [30] is proposed. The concept of clustering ensemble was proposed by Strehl and Ghosh [31] in 2003. The ensemble method can be integrated by setting different parameters, different clustering algorithms, different representations of features and weak clustering. By constructing the matrix of the ensemble results, analyzing the differences, looking for appropriate algorithms to analyze the ensemble results, and finally obtain better clustering results. The clustering ensemble can combine multiple partitions of a data object set into a unified clustering result. Cluster ensemble is mainly divided into two stages: the cluster member generation stage and consistency function design stage. The design of the consistency function has a great impact on the quality of clustering results. At present, the consistency function is mainly divided into the voting method [30], hypergraph method [31], evidence accumulation method [32] and method based on mutual information theory [33]. Liu et al. [34] constructed a new algorithm model by using the clustering integration algorithm of the neural network algorithm to gather the majority votes. Yu et al. [35] developed a framework of three-way ensemble clustering based on Spark and proposed a consensus clustering algorithm based on cluster units.

In this paper, by setting different parameters for the same algorithm, a group of base clustering results with differences are obtained. In a given data set, some samples belong to the same category in a set of base clusters, while they may belong to other categories in other base clusters. This phenomenon is characterized by the tendency of samples to change classification. In order to quantify the trend of sample change, this trend is called the measure of a sample’s stability [36]. The algorithm first combines the natural nearest neighbor algorithm to generate two sets of perturbed data sets, randomly extracts the feature subsets of the samples, and uses the traditional clustering algorithm to obtain different base clusters. The sample’s stability is obtained by using the co-association matrix and determinacy function, and then the samples can be divided into a stable region and unstable region according to a threshold for the sample’s stability. The stable region consists of high-stability samples and is divided into the core region of each cluster using the K-means algorithm [37]. The unstable region consists of low-stability samples and is assigned to the fringe regions of each cluster. Therefore, a three-way clustering result is formed.

The paper has two contributions as follows:

(1): Compared with other ensemble clustering algorithms, this paper proposes a sample’s perturbation theory based on the natural nearest neighbor algorithm, which makes full use of the characteristics of samples in the process of generating base clustering.
(2): This paper proposes a three-way ensemble clustering algorithm. A hard clustering algorithm cannot solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data, while the proposed algorithm can avoid decision-making.

The main contents of this paper are organized as follows. We briefly introduce related concepts such as three-way clustering, natural nearest neighbor, the sample’s perturbation theory, and ensemble clustering. In Section 3, we mainly propose a three-way ensemble clustering algorithm based on sample’s perturbation theory. In Section 4, we analyze the performance of the proposed three-way ensemble clustering algorithm. Conclusions and future work are given in Section 5.

2. Preliminaries

This section mainly introduces the background of the algorithm in this paper, including three-way clustering, natural nearest neighbor, sample’s perturbation theory, and ensemble clustering.

2.1. Three-Way Clustering

The three-way decision [38,39,40] was proposed by adding one more uncommitted option than the traditional two-way decision theory. In traditional binary-decision model, one only has two decisions to select: acceptance or rejection decision. When current information is sufficient, we can easily make an acceptance decision or a rejection decision. However, when information is limited or insufficient, making an acceptance or a rejection decision, regardless of which is made, it may cause inaccuracy in the decision. A three-way decision adds a deferment decision (or non-commitment decision) to address the problem of inaccurate decision making. The main idea of three-way decision is to divide a universe into three disjoint regions and make three strategies for different regions that represent three levels of uncertainty.

Recently, by integrating three-way decision with clustering, a new type of clustering algorithm [41,42] was proposed, named three-way clustering. Three-way clustering uses the core region and fringe region to represent a cluster. These two sets divide the universe into three parts

C o (C_{i}), F r (C_{i})

and

T r (C_{i})

, which classify the three types of relationships between the objects and cluster, namely objects belonging to the cluster; objects not belonging to the cluster; and objects partially belonging to the cluster. For the samples in

C o (C_{i})

, they definitely belong to the cluster

(C_{i})

and have higher within-class similarities. For the samples in

F r (C_{i})

, they may belong to the cluster

(C_{i})

and have lower similarity to the core samples. For the samples in

T r (C_{i})

, they do not definitely belong to the cluster

C_{i}

. In contrast to the single set for representing the cluster, three-way clustering uses

C o (C_{i}), F r (C_{i})

to represent one cluster. Such representation can clearly distinguish the core and fringe parts of a cluster, such as a concept having an intension and extension.

Given a set of data objects

U = \{x_{1}, x_{2}, \dots, x_{n}\}

,

C = \{c_{1}, c_{2}, \dots, c_{k}\}

is a finite set of clusters, and U is divided into k classes. The idea of three-way clustering is to use a pair of sets to represent each cluster. This pair of sets consists of core region (Co), fringe region (Fr), and trivial fringe (Tr) [43]. Then, the three-way cluster is expressed as,

\begin{matrix} C_{i} = (C o (C_{i}), F r (C_{i})) . \end{matrix}

(1)

Three-way clustering results of data object set U are expressed as,

\begin{matrix} C = \{(C o (C_{1}), F r (C_{1})), (C o (C_{2}), F r (C_{2})), \dots, (C o (C_{k}), F r (C_{k}))\} . \end{matrix}

(2)

According to the definition of clustering results,

C o (C_{i})

and

F r (C_{i})

must meet the three following conditions,

\begin{matrix} C o (C_{i}) \neq ϕ, i = 1, 2, \dots, k; \end{matrix}

(3)

\begin{matrix} ⋃_{i = 1}^{k} (C o (C_{i}) \cup F r (C_{i})) = U; \end{matrix}

(4)

\begin{matrix} C o (C_{i}) \cap C o (C_{j}) = ϕ, i \neq j . \end{matrix}

(5)

Among them, condition (1) indicates that any cluster is non empty, condition (2) indicates that the sample

x_{i} \in U

belongs to at least one cluster, and condition (3) indicates that the core region of the cluster is pairwise disjoint.

Recently, lots of three-way clustering approaches have been developed. For example, Afridi et al. [44] proposed novel three-way clustering using a game-theoretic set to handle missing data. Wang et al. [45] improved the K-means algorithm by integrating the three-way decision and then developed a three-way k-means method; Yu et al. [46] presented three-way density peak clustering which used evidence theory. Jia et al. [47] developed an automatic three-way clustering approach by combining the proposed threshold selection based on the roughness degree using the sample’s similarity and the cluster number selection method. In addition to the papers listed above, there are other contributions that enrich the theories and models of three-way clustering [48,49,50,51,52].

2.2. Natural Nearest Neighbor

In the study of the properties of data structure, the concept of the nearest neighbor is proposed, and the commonly used nearest neighbor methods are K-neighborhood and

ε

-neighborhood [53] proposed by Stevens. The natural nearest neighbor is a new neighbor definition proposed by Zou [54] in 2011. Compared with the K-nearest neighbor and

ε

-nearest neighbor, its biggest difference is that it is generated by the data structure itself without any parameters. Its main idea is that if the data point a appears in the r-neighborhood of b, then point b is the natural nearest neighbor of point a.

Definition 1.

Natural nearest neighbor [55]. For data object X, if there is a data object Y which has X as its neighbor, and the most outlier data object Z in the data set has a data object that thinks Z as its neighbor, the data object Y is called the natural nearest neighbor of data object X. Furthermore, the outlier data point Z is the last point in the data set that first appeared in the neighborhood of other points.

Definition 2.

Natural characteristic value [55]. The natural characteristic value of the data set is the minimum r value that makes any data point x be included by the r-neighborhood of another data point

y (y \neq x)

. Its mathematical definition is as follows:

\begin{matrix} s u p_{k} = min \{r| \forall x \in X, \exists y \in X, y \neq x, s . t . x \in N N_{r} (y)\} \end{matrix}

(6)

where

N N_{r} (y)

represents the r-th nearest neighbor of point y, and

s u p_{k}

is also known as the average number of natural neighbors.

2.3. Sample’s Perturbation Theory

We first generate the perturbed data sets by using the natural nearest neighbor. Given a data set

U = {v_{1}, v_{2}, \dots, v_{n}}

, the natural nearest neighbor of each element can be obtained by Algorithm 1. For an element v, we can achieve two new elements by the following formulas:

\begin{matrix} v_{1} = v + α \times {s t d}_{v} \end{matrix}

(7)

\begin{matrix} v_{2} = v - α \times {s t d}_{v} \end{matrix}

(8)

where

α

is a given parameter, and

{s t d}_{v}

is the standard deviation of the natural nearest neighbor of the element v. Thus, the processes of generating perturbed data sets are as shown in Algorithm 2.

Algorithm 1: Natural nearest neighbor algorithm.

Algorithm 2: Generation of perturbed data sets.

2.4. Ensemble Clustering

In cluster analysis, different clustering algorithms will lead to different clustering results. As we all know, a single clustering algorithm cannot always achieve a good clustering result, especially when the data set has complex structures, indistinguishable boundaries, non-spherical distribution, and high dimensionality.

The ensemble clustering algorithm [31,32] uses the consistency function to effectively combine the base clusters, and integrates on the basis of the base clusters to achieve a clustering result with high performance. Ensemble clustering has two main steps, namely base cluster generation and consistency integration. In the base cluster generation stage, different clustering algorithms or the same clustering algorithm under different parameters are used to generate multiple different base clustering results. In the consistency ensemble stage, multiple different base clustering results are merged into a same cluster division by designing a consistency function. That is, a given data set

X = \{x_{1}, x_{2}, \dots, x_{n}\}

, where

x_{i} = {\{x_{i 1}, x_{i 2}, \dots, x_{i d}\}}^{T} \in R^{d}

, d is the dimension of the attribute of

x_{i}

, and n is the number of samples. Aa clustering algorithm is used to divide the data set X into M base partitions,

π = \{π_{1}, π_{2}, \dots, π_{L}\}

represents a set of base partitions, where

π_{l} = \{C_{1}^{l}, C_{2}^{l}, \dots, C_{k}^{l}\}

is the m-th partition in the data set X.

C_{j}^{l}

represents the j-th cluster in

π_{l}

, and k is the number of clusters in the

π_{l}

partition. The clustering ensemble is to search a new compartmentalization

π^{#} = \{C_{1}^{#}, C_{2}^{#}, \dots, C_{k}^{#}\}

, which is the result of consistent clustering. The process of ensemble clustering is shown in Figure 1.

3. Processes of Algorithms

Assuming the data sample

U = \{v_{1}, v_{2}, \dots, v_{n}\}

, a group of clustering results

Π = \{C_{1}, C_{2}, \dots, C_{M}\}

is obtained by setting different clustering algorithm parameters for a clustering ensemble. Then, the relation matrix [36] is constructed, in which the relationship between any two points is calculated as:

\begin{matrix} p_{i j} = \frac{1}{m} \sum_{m = 1}^{L} \prod (C_{m} (v_{i}), C_{m} (v_{j})) \end{matrix}

(9)

where M represents different clustering results;

v_{i}

and

v_{j}

represent two samples; and

C_{m} (v_{i})

represents the cluster number of point

v_{i}

in the l-th clustering result. Of which:

\begin{matrix} \prod (C_{m} (v_{i}), C_{m} (v_{j})) = \{_{0, C_{m} (v_{i}) \neq C_{m} (v_{j})}^{1, C_{m} (v_{i}) = C_{m} (v_{j})} \end{matrix}

(10)

A linear method is used to find the stability. First define the function f with respect to the variables p, t, where

p \in [0, 1], t \in [0, 1]

.

(1) If

p < t

,

f^{'} (p) < 0

; if

p > t

,

f^{'} (p) > 0

;

(2) If

p_{i} < t < p_{j}

, and

\frac{t - p_{i}}{p_{j} - t} = \frac{t}{1 - t}

, then

f (p_{i}) = f (p_{j})

.

Suppose a data set contains n samples. Based on this function f, for each point

v_{i}

, the stability [36] is defined as follows:

\begin{matrix} s (v_{i}) = \frac{1}{n} \sum_{j = 1}^{n} f (p_{i j}) \end{matrix}

(11)

A set of the core region and fringe region is usually used to represent each cluster of a three-way clustering result. In this section, we propose a three-way clustering algorithm based on the natural nearest neighbor and samples’ stability. Then, the proposed algorithm randomly extracts the feature subset of the sample and uses the K-means algorithm to generate multiple different base clusters. Algorithm 3 shows the steps of generating the base clustering results.

Algorithm 3: Base cluster generation.

The second step is that we compute the sample’s stability by using multiple clustering results and divide the data set into two regions according to a threshold t for stability, where t is obtained by the Otus algorithm [56].

\begin{matrix} O = {i | v_{i}^{M} > t, i = 1, 2, \dots, n} \end{matrix}

(12)

\begin{matrix} H = {i | v_{i}^{M} \leq t, i = 1, 2, \dots, n} \end{matrix}

(13)

where the elements of O have higher stability while others of H have lower stability.

Determining how a three-way clustering result can be obtained the final problem to solve. In this paper, we take different strategies to handle the different regions. For the stable data set O, we may get the clusters

C_{i}, (i = 1, 2, \dots, k)

of the elements of stable set O by K-means algorithm, which can be considered as the core regions of a three-way clustering result, namely

C o (C_{i})

. For each element v of the unstable data set U and random k clustering centers

x_{1}, x_{2}, \dots, x_{k}

of O, the minimum distance from element v to k cluster centers is

d (v, x_{i}) = min_{1 \leq j \leq k} d (v, x_{j})

, where

d (v, x_{j})

is the Euclidean distance between v and

x_{j}

. A parameter p is given, then we can achieve the set

T = {j : d (v, x_{j}) - d (v, x_{i}) \leq p \land j \neq i}

, from which there are two possible situations:

(1): If $T = \emptyset$ , then $v \in F r (C_{i})$ ;
(2): If $T \neq \emptyset$ , then $v \in F r (C_{i})$ and $v \in F r (C_{j})$ .

A three-way ensemble clustering can be naturally formed. Algorithm 4 shows the all steps of the three-way ensemble clustering based on sample’s perturbation theory.

Algorithm 4: Three-way Ensemble Clustering Based on Sample’s Perturbation Theory(3WESP).

4. Experimental Analyses

We mainly verify the performances of our proposed 3WESP algorithm in this section. In the base cluster generation, we use the K-means algorithm to generate diverse base clusters by selecting different percentages of feature subsets such as 50%, 60%, 70%, 80% and 90%. We demonstrate and explore the performances of the proposed algorithm and other clustering algorithms on 12 UCI data sets. We discuss the influence of selecting different percentages of feature subsets on the proposed algorithm.

4.1. Evaluation Indices

In the evaluation of clustering, we compare the proposed algorithm with other clustering algorithms by calculating several cluster evaluation indices such as

A M I

[57],

N M I

[31],

A R I

[57], and

A C C

[45]. For the cluster evaluation indices used in this paper, a good clustering result should have a higher value.

1. Adjusted Mutual Information (

A M I

)

The AMI index is proposed based on information theory. The larger the AMI value, the more consistent the result of the clustering algorithm with the real situation. AMI is calculated as follows:

A M I (U, V) = \frac{M I (U, V) - E {M I (U, V)}}{max {H (U), H (V)} - E {M I (U, V)}}

(14)

where

E {M I (U, V)}

is the expectation of mutual information

M I (U, V)

,

H (U)

is the information entropy, and its calculation formula is:

H (U) = - \sum_{i = 1}^{R} p_{i} log p_{i}

(15)

2. Normalized Mutual Information (

N M I

)

N M I = \frac{I (X, Y)}{\sqrt{H (X) H (Y)}}

(16)

where X is the test label, and Y is real label.

H (X)

and

H (Y)

represent the entropy of X and Y, respectively.

I (X, Y)

is the mutual information between X and Y.

3. Adjusted Rand Index (

A R I

)

A R I = \frac{2 (T P \cdot T N - F N \cdot F P)}{(T P + F N) (F N + T N) + (T P + F P) (F P + T N)}

(17)

where

T P

is the number of data points in a pair that belong to the same cluster in both real and experimental situations;

F N

is the number of data points in a pair that belong to the same cluster in real but not in experimental situations;

F P

is the number of data points in a pair that belong to the same cluster in experimental but not in real situations;

T N

is the number of data points in a pair that do not belong to the same cluster in both real and experimental situations.

4. Accuracy (

A C C

)

A C C = \frac{1}{N} \sum_{i = 1}^{k} n_{i}

(18)

where N is the total number of elements,

n_{i}

is the number of elements which are correctly divided into the corresponding cluster i, and k is cluster number.

A C C

represents the ratio between the number of correctly partitioned elements and the total number.

4.2. Performances of 3WESP

We employ 12 UCI [58] data sets to show the performance of the proposed algorithm. Table 1 presents the basic information of these used data sets. For a data set, we can obtain a three-way clustering result through Algorithm 4, where in the base cluster generation process, we use the K-means algorithm to generate diverse base clusters by selecting different percentages of feature subsets such as 50%, 60%, 70%, 80% and 90%.

Cluster evaluation indices are used to show the performances of a hard clustering result. While the 3WESP(

α = 3, p = 0.8

) algorithm is a soft clustering algorithm which cannot directly use these cluster evaluation indices, in this paper, we therefore only use all core regions to represent the clustering result when we calculate the

A M I

,

A R I

,

N M I

and

A C C

value. By running 50 times on all data sets, we can obtain the average

A M I

,

A R I

,

N M I

and

A C C

value where the ensemble size is 60. The performances of K-means, Voting [30] and CSPA [31] are also presented in Table 2, Table 3, Table 4 and Table 5. The best performance for each data set is highlighted in bold.

It can be seen from the data in the tables that the algorithm in this paper has advantages in these four evaluation indices on most data sets. For example, on the Wine data set, the four index results of the algorithm in this paper are 0.8851, 0.9068, 0.8873 and 0.9697, respectively. The effects of other algorithms are obviously slightly worse than that of this algorithm.

4.3. Experimental Results for Selected Feature Percentage

In order to analyze the influence of different percentages of features on the results of the clustering ensemble, the following experiments were carried out. On 12 different UCI data sets, this paper analyzes the comprehensive effect of the integration algorithm under different feature percentages. Given that the number of base clusters is 60, then select 50%, 60%, 70%, 80% and 90% feature ratios to generate base clusters. Fifty repeated experiments were conducted under different feature subsets, and the results of the evaluation indicators were averaged to analyze their changes. The specific results are shown in the following figures. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 show the specific experimental results.

From the experimental results recorded in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13, we can find that different data sets obtain the best performances at different percentages. It can also be seen from the results shown in the figure that with the increasing proportion of feature subsets, the four evaluation indices used in this algorithm fluctuate a little on each data set. Although individual data sets are slightly affected when the feature proportion changes, overall, the change is small.

5. Conclusions and Future Work

It has been recognized that a single clustering algorithm cannot identify all types of data structures. Ensemble clustering is an effective way to solve the problem that a single clustering algorithm cannot identify all types of data structures. A traditional clustering algorithm may lead to high decision-making which is caused by inaccurate information or insufficient data. Therefore, in this paper, we propose a three-way ensemble clustering algorithm based on the sample’s perturbation theory. In this algorithm, we use sample’s perturbation theory to obtain perturbed data sets, and then randomly extract feature subsets of samples, and use traditional clustering algorithms to obtain different base clusters. We use a set of basis-clustering results as input to obtain the stability of each sample through the co-association matrix and determinacy function, and the stability threshold obtained by the Otus algorithm divides the samples into stable region and unstable region. Different regions take different strategies. The stable region is composed of samples with high stability, and we use K-means algorithm to assign them to the core region of each cluster. The unstable region is composed of samples with low stability, and we assign them to the fringe region of each cluster. Therefore, a three-way ensemble clustering result is obtained. The experimental results on UCI data sets show that the new algorithm can effectively reveal the data structure compared with the traditional ensemble clustering algorithm.

The following topics deserve further investigation:

(1): The parameter selection of the clustering algorithm is a complicated problem. The parameters q is significant to the clustering results in this paper. Thus, the studies on the effect of the parameter dynamic changes on the results of the algorithm will be the subject of our future study.
(2): The base clusters generated by different feature subsets may have a low quality which may affect the final ensemble clustering result. Determining how bad base clusters will be a good research avenue of future research.

Author Contributions

Conceptualization, P.W.; Data curation, X.W.; Funding acquisition, J.Z.; Project administration, J.Z.; Software, T.W.; Supervision, P.W.; Writing—original draft, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China (nos. 62076111, 62006099) and the Key Laboratory of Oceanographic Big Data Mining & Application of Zhejiang Province (no. OBDMA202002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yang, X.; Qi, Y.; Song, X.; Yang, J. Test cost sensitive multigranulation rough set: Model and minimal cost selection. Inf. Sci. 2013, 250, 184–199. [Google Scholar] [CrossRef]
Xu, W.H.; Guo, Y.T. Generalized multigranulation double-quantitative decision-theoretic rough set. Knowl.-Based Syst. 2016, 105, 190–205. [Google Scholar] [CrossRef]
Li, W.; Xu, W.; Zhang, X.; Zhang, J. Updating approximations with dynamic objects based on local multigranulation rough sets in ordered information systems. Artif. Intell. Rev. 2022, 55, 1821–1855. [Google Scholar] [CrossRef]
Xu, W.H.; Yu, J.H. A novel approach to information fusion in multi-source datasets: A granular computing viewpoint. Inf. Sci. 2017, 378, 410–423. [Google Scholar] [CrossRef]
Chen, X.W.; Xu, W.H. Double-quantitative multigranulation rough fuzzy set based on logical operations in multi-source decision systems. Int. J. Mach. Learn. Cybern. 2022, 13, 1021–1048. [Google Scholar] [CrossRef]
Xu, W.H.; Yuan, K.H.; Li, W.T. Dynamic updating approximations of local generalized multigranulation neighborhood rough set. Appl. Intell. 2022, 52, 9148–9173. [Google Scholar] [CrossRef]
Yang, X.B.; Yao, Y.Y. Ensemble selector for attribute reduction. Appl. Soft Comput. 2018, 70, 1–11. [Google Scholar] [CrossRef]
Jiang, Z.; Yang, X.; Yu, H.; Liu, D.; Wang, P.; Qian, Y. Accelerator for multi-granularity attribute reduction. Knowl. Based Syst. 2019, 177, 145–158. [Google Scholar] [CrossRef]
Li, J.; Yang, X.; Song, X.; Li, J.; Wang, P.; Yu, D.J. Neighborhood attribute reduction: A multi-criterion approach. Int. J. Mach. Learn. Cybern. 2019, 10, 731–742. [Google Scholar] [CrossRef]
Liu, K.; Yang, X.; Yu, H.; Fujita, H.; Chen, X.; Liu, D. Supervised information granulation strategy for attribute reduction. Int. J. Mach. Learn. Cybern. 2020, 11, 2149–2163. [Google Scholar] [CrossRef]
Liu, K.; Yang, X.; Fujita, H.; Liu, D.; Yang, X.; Qian, Y. An efficient selector for multi-granularity attribute reduction. Inf. Sci. 2019, 505, 457–472. [Google Scholar] [CrossRef]
Liu, K.; Yang, X.; Yu, H.; Mi, J.; Wang, P.; Chen, X. Rough set based semi-supervised feature selection via ensemble selector. Knowl.-Based Syst. 2020, 165, 282–296. [Google Scholar] [CrossRef]
Sun, L.; Zhang, X.; Qian, Y.; Xu, J.; Zhang, S. Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Inf. Sci. 2019, 502, 18–41. [Google Scholar] [CrossRef]
Zhang, C.; Fu, H.; Hu, Q.; Cao, X.; Xie, Y.; Tao, D.; Xu, D. Generalized latent multi-view subspace clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 86–99. [Google Scholar] [CrossRef]
Singh, P.; Bose, S.S. A quantum-clustering optimization method for COVID-19 CT scan image segmentation. Expert Syst. Appl. 2021, 185, 115637. [Google Scholar] [CrossRef]
Singh, P.; Bose, S.S. Ambiguous D-means fusion clustering algorithm based on ambiguous set theory: Special application in clustering of CT scan images of COVID-19. Knowl. Based Syst. 2021, 231, 107432. [Google Scholar] [CrossRef]
Ji, X.; Liu, S.; Zhao, P.; Li, X.; Liu, Q. Clustering ensemble based on sample’s certainty. Cogn. Comput. 2021, 13, 1034–1046. [Google Scholar] [CrossRef]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [Green Version]
He, H.; Liu, W.; Zhao, Z.; He, S.; Zhang, J. Vulnerability of regional aviation networks based on DBSCAN and complex networks. Comput. Syst. Sci. Eng. 2022, 43, 643–655. [Google Scholar] [CrossRef]
Zhong, G.; Shu, T.; Huang, G.; Yan, X. Multi-view spectral clustering by simultaneous consensus graph learning and discretization. Knowl. Based Syst. 2022, 235, 107632. [Google Scholar] [CrossRef]
Tkachenko, R.; Izonin, I. Model and principles for the implementation of neural-like structures based on geometric data transformations. In Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2018; Volume 754. [Google Scholar]
Tkachenko, R.; Izonin, I.; Tkachenko, P. Neuro-Fuzzy diagnostics systems based on SGTM neural-like structure and t-controller. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Cham, Switzerland, 2021; Volume 77. [Google Scholar]
Tkachenko, R. An integral software solution of the SGTM neural-like structures implementation for solving different data mining tasks. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Cham, Switzerland, 2021; Volume 77. [Google Scholar]
Yao, Y.Y. Tri-level thinking: Models of three-way decision. Int. J. Mach. Learn. Cybern. 2020, 11, 947–959. [Google Scholar] [CrossRef]
Yao, Y.Y. The geometry of three-way decision. Appl. Intell. 2021, 51, 6298–6325. [Google Scholar] [CrossRef]
Yu, H.; Zhang, C.; Wang, G.; Zeng, X. A tree-base dincremental overlapping clustering method using the three-way decision theory. Knowl. Based Syst. 2016, 91, 189–203. [Google Scholar] [CrossRef]
Wang, P.X.; Yao, Y.Y. CE3: A three-way clustering method based on mathematical morphology. Knowl.-Based Syst. 2018, 155, 54–65. [Google Scholar] [CrossRef]
Zhang, K. A three-way c-means algorithm. Appl. Soft Comput. 2019, 82, 105536. [Google Scholar] [CrossRef]
Liu, Y.; Du, J.L.; Zhang, R.S. Three way decisions based grey incidence analysis clustering approach for panel data and its application. Kybernetes 2019, 48, 2117–2137. [Google Scholar] [CrossRef]
Zhou, Z.H.; Tang, W. Cluster Ensemble. Knowl.-Based Syst. 2006, 19, 77–83. [Google Scholar] [CrossRef]
Strehl, A.; Ghosh, J. Cluster ensembles-a knowledge reuse framework for combing multiple partitions. J. Mach. Learn. Res. 2003, 3, 583–617. [Google Scholar]
Fred, A.L.N.; Jain, A.K. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 835–850. [Google Scholar] [CrossRef]
Sandro, V.P.; Jose, R.S. A survey of clustering ensemble algorithms. Int. J. Pattern Recognit. Artif. Intell. 2011, 25, 337–372. [Google Scholar]
Liu, Y.; Xu, L.; Li, M. The parallelization of back propagation neural network in MapReduce and Spark. Int. J. Parallel Program. 2017, 45, 760–779. [Google Scholar] [CrossRef]
Yu, H.; Chen, Y.; Lingras, P.; Wang, G.Y. A three-way cluster ensemble approach for large-scale data. Int. J. Approx. Reason. 2019, 115, 32–49. [Google Scholar] [CrossRef]
Li, F.; Qian, Y.; Wang, J.; Dang, C.; Jing, L. Clustering ensemble based on sample’s stability. Artif. Intell. 2019, 273, 37–55. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Yao, Y.Y. The superiority of three-way decisions in probabilistic rough set models. Inf. Sci. 2011, 181, 1080–1096. [Google Scholar] [CrossRef]
Yao, Y.Y. Three-way decisions and cognitive computing. Cogn. Comput. 2016, 8, 543–554. [Google Scholar] [CrossRef]
Yao, Y.Y. Three-way decision and granular computing. Int. J. Approx. Reason. 2018, 103, 107–123. [Google Scholar] [CrossRef]
Yu, H. A framework of three-way cluster analysis. In Proceeding of the International Joint Conference on Rough Sets, Olsztyn, Poland, 3–7 July 2017; pp. 300–312. [Google Scholar]
Shah, A.; Azam, N.; Ali, B.; Khan, M.T.; Yao, J. A three-way clustering approach for novelty detection. Inf. Sci. 2021, 569, 650–668. [Google Scholar] [CrossRef]
Wang, P.X.; Yang, X.B. Three-way clustering method based on stability theory. IEEE Access 2021, 9, 33944–33953. [Google Scholar] [CrossRef]
Afridi, M.K.; Azam, N.; Yao, J.T.; Alanazi, E. A three-way clustering approach for handling missing data using GTRS. Int. J. Approx. Reason. 2018, 98, 11–24. [Google Scholar] [CrossRef]
Wang, P.X.; Shi, H.; Yang, Y.B.; Mi, J.S. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern. 2019, 10, 2767–2777. [Google Scholar] [CrossRef]
Yu, H.; Chen, L.Y.; Yao, J.T. A three-way density peak clustering method based on evidence theory. Knowl.-Based Syst. 2021, 211, 106532. [Google Scholar] [CrossRef]
Jia, X.; Rao, Y.; Li, W.; Yang, S.; Yu, H. An automatic three-way clustering method based on sample similarity. Int. J. Mach. Learn. Cybern. 2021, 12, 1545–1556. [Google Scholar] [CrossRef]
Yu, H.; Chen, L.Y.; Yao, J.T.; Wang, X.N. A three-way clustering method based on an improved DBSCAN algorithm. Phys. A Stat. Mech. Its Appl. 2019, 535, 122289. [Google Scholar] [CrossRef]
Yu, H.; Wang, X.C.; Wang, G.Y.; Zeng, X.H. An active three-way clustering method via low-rank matrices for multi-view data. Inf. Sci. 2020, 507, 823–839. [Google Scholar] [CrossRef]
Chu, X.; Sun, B.; Li, X.; Han, K.; Wu, J.; Zhang, Y.; Huang, Q. Neighborhood rough set-based three-way clustering considering attribute correlations: An approach to classification of potential gout groups. Inf. Sci. 2020, 535, 28–41. [Google Scholar] [CrossRef]
Shah, A.; Azam, N.; Alanazi, E.; Yao, J.T. Image blurring and sharpening inspired three-way clustering approach. Appl. Intell. 2022, 1–25. [Google Scholar] [CrossRef]
Wu, T.F.; Fan, J.C.; Wang, P.X. An improved three-way clustering based on ensemble strategy. Mathematics 2022, 10, 1457. [Google Scholar] [CrossRef]
Stevens, S.S. Mathematics, measurement, and psychophysics. In Handbook of Experimental Paychology; Wiley: London, UK, 1951; pp. 1–49. [Google Scholar]
Zou, X.L.; Zhu, Q.S.; Jin, Y.F. An adaptive neighborhood graph for LLE algorithm without free-parameter. Int. J. Compouter Appl. 2011, 16, 20–33. [Google Scholar] [CrossRef]
Zhu, Q.S.; Feng, J.; Huang, J.L. Natural neighbor: A self-adaptive neighborhood method without parameter K. Pattern Recognit. Lett. 2016, 80, 30–36. [Google Scholar] [CrossRef]
Otus, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef] [Green Version]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
Blake, C.L.; Merz, C.J. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 15 May 2022).

Figure 1. Flow chart of ensemble clustering.

Figure 2. Results of Cardiotocography by 3WESP.

Figure 3. Results of Congressional Voting by 3WESP.

Figure 4. Results of Dermatology by 3WESP.

Figure 5. Results of Forrest by 3WESP.

Figure 6. Results of Iris by 3WESP.

Figure 7. Results of Musk recognition by 3WESP.

Figure 8. Results of Parkinson by 3WESP.

Figure 9. Results of Seeds by 3WESP.

Figure 10. Results of Segmentation by 3WESP.

Figure 11. Results of Waveform by 3WESP.

Figure 12. Results of Wdbc by 3WESP.

Figure 13. Results of Wine by 3WESP.

Table 1. A description of UCI data sets.

ID	Data Sets	Samples	Attributes	Classes
1	Cardiocotography	2126	21	10
2	Congressional voting	435	16	2
3	Dermatology	366	34	6
4	Forrest	523	27	4
5	Iris	150	4	3
6	Musk	476	166	2
7	Parkinson	1208	26	2
8	Seeds	210	7	3
9	Segmentation	2310	19	7
10	Waveform	5000	21	3
11	Wdbc	569	30	2
12	Wine	178	13	3

Table 2. Experimental results of

A M I

on UCI data sets.

Table 2. Experimental results of

A M I

on UCI data sets.

ID	Data Sets	3WESP	K-Means	Voting	CSPA
1	Cardiotocography	0.3710	0.3404	0.2227	0.2367
2	Congressional voting	0.6577	0.6531	0.6531	0.6321
3	Dermatology	0.8639	0.7811	0.4815	0.4965
4	Forrest	0.6565	0.5418	0.4778	0.5003
5	Iris	0.8231	0.7387	0.7387	0.6895
6	Musk	0.0519	0.0098	0.0154	0.0125
7	Parkinson	0.0183	0.0133	0.0133	0.0154
8	Seeds	0.6655	0.3412	0.5263	0.6605
9	Segmentation	0.5327	0.5245	0.3275	0.5194
10	Waveform	0.3650	0.3640	0.1535	0.3102
11	Wdbc	0.6242	0.6126	0.6024	0.6143
12	Wine	0.8851	0.7893	0.5683	0.8131

Table 3. Experimental results of

A R I

on UCI data sets.

Table 3. Experimental results of

A R I

on UCI data sets.

ID	Data Sets	3WESP	K-Means	Voting	CSPA
1	Cardiotocography	0.1760	0.1750	0.1245	0.1236
2	Congressional voting	0.7064	0.6612	0.6602	0.5247
3	Dermatology	0.7056	0.6426	0.2377	0.6448
4	Forrest	0.6878	0.5029	0.4165	0.4372
5	Iris	0.8017	0.7163	0.7234	0.6652
6	Musk	0.0427	0.0023	0.0057	0.0034
7	Parkinson	0.0243	0.0204	0.0302	0.0236
8	Seeds	0.7008	0.3180	0.4256	0.7004
9	Segmentation	0.4393	0.4158	0.2135	0.3928
10	Waveform	0.2555	0.2536	0.0943	0.2530
11	Wdbc	0.7589	0.7302	0.7203	0.7002
12	Wine	0.9068	0.7842	0.4625	0.7954

Table 4. Experimental results of

N M I

on UCI data sets.

Table 4. Experimental results of

N M I

on UCI data sets.

ID	Data Sets	3WESP	K-Means	Voting	CSPA
1	Cardiotocography	0.3601	0.3364	0.2302	0.3250
2	Congressional voting	0.7429	0.7194	0.7034	0.4558
3	Dermatology	0.8729	0.6837	0.4934	0.8005
4	Forrest	0.6995	0.5448	0.4817	0.4958
5	Iris	0.8319	0.7419	0.7235	0.6934
6	Musk	0.0593	0.0114	0.0170	0.0133
7	Parkinson	0.0292	0.0139	0.0135	0.0158
8	Seeds	0.6900	0.3488	0.5286	0.6679
9	Segmentation	0.7193	0.5845	0.3597	0.4958
10	Waveform	0.3653	0.3642	0.1537	0.3649
11	Wdbc	0.6456	0.6231	0.6140	0.6237
12	Wine	0.8873	0.7915	0.5712	0.8243

Table 5. Experimental results of

A C C

on UCI data sets.

Table 5. Experimental results of

A C C

on UCI data sets.

ID	Data Sets	3WESP	K-Means	Voting	CSPA
1	Cardiotocography	0.4831	0.4026	0.4210	0.3833
2	Congressional voting	0.9100	0.4706	0.4567	0.7625
3	Dermatology	0.8769	0.8126	0.6913	0.7366
4	Forrest	0.8448	0.7839	0.8375	0.7086
5	Iris	0.9268	0.8867	0.8767	0.8200
6	Musk	0.6820	0.6534	0.6639	0.5412
7	Parkinson	0.5820	0.5728	0.5678	0.5781
8	Seeds	0.8950	0.8857	0.8905	0.8895
9	Segmentation	0.7778	0.5905	0.4810	0.6285
10	Waveform	0.5401	0.5530	0.5178	0.5014
11	Wdbc	0.9351	0.9279	0.9256	0.9079
12	Wine	0.9697	0.9270	0.9071	0.9352

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, J.; Wang, X.; Wu, T.; Zhu, J.; Wang, P. Three-Way Ensemble Clustering Based on Sample’s Perturbation Theory. Mathematics 2022, 10, 2598. https://doi.org/10.3390/math10152598

AMA Style

Fan J, Wang X, Wu T, Zhu J, Wang P. Three-Way Ensemble Clustering Based on Sample’s Perturbation Theory. Mathematics. 2022; 10(15):2598. https://doi.org/10.3390/math10152598

Chicago/Turabian Style

Fan, Jiachen, Xiaoxiao Wang, Tingfeng Wu, Jin Zhu, and Pingxin Wang. 2022. "Three-Way Ensemble Clustering Based on Sample’s Perturbation Theory" Mathematics 10, no. 15: 2598. https://doi.org/10.3390/math10152598

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Three-Way Ensemble Clustering Based on Sample’s Perturbation Theory

Abstract

1. Introduction

2. Preliminaries

2.1. Three-Way Clustering

2.2. Natural Nearest Neighbor

2.3. Sample’s Perturbation Theory

2.4. Ensemble Clustering

3. Processes of Algorithms

4. Experimental Analyses

4.1. Evaluation Indices

4.2. Performances of 3WESP

4.3. Experimental Results for Selected Feature Percentage

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI