Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization

Wang, Ying; Dang, Kelin; Yang, Rennong; Li, Leyan; Li, Hao; Gong, Maoguo

doi:10.3390/electronics13101987

Open AccessArticle

Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization

by

Ying Wang

^1,*

,

Kelin Dang

²,

Rennong Yang

¹,

Leyan Li

¹,

Hao Li

^2,*

and

Maoguo Gong

²

¹

Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710038, China

²

Key Laboratory of Collaborative Intelligence Systems, School of Electronic Engineering, Ministry of Education, Xidian University, Xi’an 710071, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(10), 1987; https://doi.org/10.3390/electronics13101987

Submission received: 14 April 2024 / Revised: 7 May 2024 / Accepted: 15 May 2024 / Published: 19 May 2024

(This article belongs to the Special Issue Object Detection, Segmentation and Categorization in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Data mining technology is the process of extracting hidden knowledge and potentially useful information from a large number of incomplete, noisy, and random practical application data. The clustering algorithm based on multi-objective evolution has obvious advantages compared with the traditional single-objective method. In order to further improve the performance of evolutionary multi-objective clustering algorithms, this paper proposes a multi-objective automatic clustering model based on evolutionary multi-task optimization. Based on the multi-objective clustering algorithm that automatically determines the value of k, evolutionary multi-task optimization is introduced to deal with multiple clustering tasks simultaneously. A set of non-dominated solutions for clustering results is obtained by concurrently optimizing the overall deviation and connectivity index. Multi-task adjacency coding based on a locus adjacency graph was designed to encode the clustered data. Additionally, an evolutionary operator based on relevance learning was designed to facilitate the evolution of individuals within the population. It also facilitates information transfer between individuals with different tasks, effectively avoiding negative transfer. Finally, the proposed algorithm was applied to both artificial datasets and UCI datasets for testing. It was then compared with traditional clustering algorithms and other multi-objective clustering algorithms. The results verify the advantages of the proposed algorithm in clustering accuracy and algorithm convergence.

Keywords:

evolutionary algorithm; multi-objective optimization; evolutionary multi-tasking; clustering algorithm; multi-tasking learning

1. Introduction

Clustering is an indispensable technique in data mining [1]. It has attracted widespread interest in image segmentation [2], social network analysis [3], text analysis [4], and so on. Clustering is an important unsupervised classification technique [5], which is suitable for revealing the underlying distribution of data. It can divide vectors in multidimensional space into different categories. Clustering techniques can be roughly divided into several categorization methods, such as partition-based, density-based, grid-based, hierarchical, and model-based clustering [6].

Among the partition-based clustering algorithms, the representative algorithms are k-means (KM) [7], mixed density clustering [8], fuzzy clustering represented by fuzzy C-means (FCM) [9], and graph clustering algorithms [10]. Hierarchical clustering is usually represented as a hierarchical tree structure [11]. According to the direction of constructing the tree structure, clustering can be divided into bottom-up [12] and top-down [13] construction methods. Traditional clustering algorithms are often limited to a criterion evaluation function. Only by selecting appropriate clustering evaluation criteria can an effective clustering result be obtained [14]. In the single-objective clustering algorithm [15], the evaluation function is often designed according to the dataset and the actual problem to be solved, and the number of clusters needs to be specified in advance. However, neither the cluster center nor the number of clusters can achieve adaptive discrimination [16,17]. Since clustering is often an unsupervised task with a general lack of prior knowledge [18], coupled with the explosion of data dimensions and sparse high-dimensional datasets, it is difficult for traditional single-objective clustering algorithms to obtain efficient clustering solutions [19].

The clustering algorithm is simple, fast, and efficient [20], but it is easy to fall into local optimum, the number of clustering classes needs to be specified in advance, and the robustness is not strong [21]. With the increasing scale and dimension of data in the real world, single-objective algorithms can no longer meet the modeling of more and more complex problems. Therefore, it is necessary to build a multi-objective algorithm with stronger problem description ability and seek a fast and efficient multi-objective optimization method [22]. In [23], Handl et al. proposed the multi-objective clustering algorithm, MOCK, and compared it with three single-objective clustering algorithms. Finally, it was found that the clustering algorithm can benefit from multiple objectives. Gong et al. [24] proposed an improved multi-objective clustering framework, IMCPSO, using particle swarm optimization. This framework designs a novel particle representation of clustering problems to help PSO search for clustering solutions in continuous space, and designs a leader selection strategy according to the distribution of Pareto sets, so that the algorithm can avoid falling into local optimum. The experimental part proves the advantages of multi-objective clustering algorithm compared with the single-objective clustering algorithm through a large number of experiments. In [25], Wang et al. considered the number of clusters and the square sum of the distance between the data point and its cluster centroid as the objective, and established a bi-objective clustering model, EMO-KC, to ensure that all clustering results with different k values are effectively obtained in a single run. Different from the single objective optimization problem, the multi-objective problem contains multiple conflicting optimization objectives [26], and a single solution cannot achieve the optimization of multiple objectives at the same time. Under the condition of satisfying the constraints, effectively balancing multiple conflicting objective values to solve multi-objective optimization problems (MOP) is critical [27]. From the perspective of optimization, the clustering problem can be regarded as an NP-hard problem, and traditional optimization methods cannot be directly applied to solve multi-objective problems [28]. As a meta-heuristic algorithm, the evolutionary algorithm is often used to solve NP-hard problems, so multi-objective optimization methods for clustering problems become very meaningful [29]. Handl et al. [23] first proposed a multi-objective adaptive clustering algorithm, MOCK, using PESA-II as an optimization tool. The algorithm takes the intra-class homogeneity and inter-class connectivity of clustering as optimization objectives, and initializes the population by minimum spanning tree and k-means. In the clustering phase, MOCK can adaptively adjust the number of clusters without specifying the number of clusters in advance. Finally, the method of gap statistics was used to select the best result from the non-dominated solutions generated by the multi-objective optimization algorithm. MOCK does not require a priori knowledge of the samples, and can adaptively determine the number of clusters, which is better than the single-objective clustering method. Hruschka et al. [30] introduced the general view of evolutionary multi-objective clustering (EMOC) in their review of evolutionary clustering algorithms. In 2012, Bong et al. [31] proposed a multi-objective clustering method applied to image segmentation. In 2013, Mukhopadhyay et al. [32] conducted a survey on multi-objective evolutionary methods for data mining, in which the authors introduced the general characteristics of EMOC and proposed some algorithms.

At present, most of the multi-objective clustering algorithms are often only for datasets with simple data distribution, and can only optimize one clustering task [33]. For multi-domain datasets with multiple related clustering tasks, because of the limitations in the expression of genotypes, the design of crossover operators, and population search strategies, multi-objective clustering algorithms cannot have a good performance in multi-domain datasets [34]. Simply applying the multi-objective clustering algorithm to complex clustering tasks will only increase the degree of negative transfer of genetic material between individuals and reduce the ability of the population to find the global optimal solution [35]. Therefore, this paper focuses on the positive transfer effect of evolutionary multi-task algorithms [36,37,38,39] between multiple clustering tasks. Gupta et al. [40] proposed a multifactorial evolutionary algorithm, MFEA, or an evolutionary multi-task algorithm, which uses a single evolutionary population to solve multiple optimization problems simultaneously by exploiting the implicit parallelism of population search.

In order to solve the above difficulties, this paper proposes a multi-task multi-objective automatic clustering model based on the MFEA framework and multi-objective clustering algorithm. The algorithm has multi-task solving ability to solve the problem of multi-domain datasets. The overall deviation and connectivity index are used as the two objective functions of clustering to obtain a set of non-dominated solutions with respect to the clustering parameters. Moreover, a location-based adjacency graph (LAG) encoding method was designed, and a mutation and crossover operator based on relevance learning between tasks was designed for information transmission between populations to reduce the impact of negative transfer on population convergence [41]. The specific contributions are as follows:

1. Firstly, the idea of evolutionary multi-task learning is integrated into the framework of multi-objective clustering, and the model of a multi-task multi-objective automatic clustering algorithm (MFMOCK) is established. Through the exchange of genetic material between individuals in the population, the relevant knowledge between different tasks is implicitly transferred, so as to optimize multiple tasks at the same time;

2. Based on the locus-based adjacency graph (LAG) encoding method, an individual representation suitable for evolutionary multi-tasks was designed, which provides a basis for the exchange of genetic information between tasks;

3. The crossover operator between individuals with different skill factors (solving different clustering tasks) was designed to limit the search direction of the algorithm, reduce the adverse evolution caused by random crossover, and accelerate the convergence speed of each task at the same time.

The remainder of this paper is organized as follows. Section 2 introduces the technical basis of the algorithm framework, including multi-objective optimization, MOCK, the multi-tasking evolutionary algorithm, and model correlation learning. Section 3 introduces the specific steps and implementation details of the multi-tasking and multi-objective clustering algorithm, including the innovation and improvement of the coding method, local search strategy, and individual crossover and mutation strategy. Section 4 records the setup and results of the comparison test, and analyzes the experimental results. Section 5 is the conclusion of this paper, which summarizes and describes the algorithms and experimental results proposed in this paper.

2. Background

In this section, the MOCK clustering model is firstly introduced in detail, and then the task correlation learning model is derived and its application in multi-task clustering is introduced.

2.1. Multi-Objective Optimization

Without loss of generality, the multi-objective optimization problems (MOPs) [42] discussed in this paper can be set to the minimum value. A MOP can be expressed as follows:

\begin{matrix} \min & F (x) = {(f_{1} (x), \dots, f_{m} (x))}^{T} \\ subject to x \in Ω \end{matrix}

(1)

where

Ω

is the feasible space, x is the solution of the MOP,

R^{m}

is the target space,

F : Ω \to R^{m}

consists of m real-valued objective functions. In most cases, the objectives in the MOP are contradictory, which means that it is impossible for any point in the feasible space to minimize all objectives at the same time. Therefore, the design of multi-objective optimization is to simultaneously find the best trade-off between them. For minimization, if and only if one solution

x_{u}

is better than another solution

x_{v}

\begin{matrix} \forall i = 1, 2, \dots, m f_{i} (x_{u}) \leq f_{i} (x_{v}) \\ \land \exists j = 1, 2, \dots, m f_{j} (x_{u}) < f_{j} (x_{v}) \end{matrix}

(2)

If there is no solution in

Ω

such that

F (x)

dominates

F (x^{*})

, we call

x^{*}

the Pareto-optimal solution. We make a series of

F (x^{*})

composed of

x^{*}

become the Pareto-optimal vector. The objectives in the Pareto-optimal vector has such a relationship that the reduction of one objective will lead to the increase of other objective. All Pareto-optimal points constitute a set, which is called the Pareto-optimal set, and the corresponding Pareto-optimal objective vector is called the Pareto-optimal frontier (PF) [43].

Evolutionary algorithms are usually used as an important tool for solving multi-objective problems. Because of the inherent parallelism of population iteration in multi-objective evolutionary algorithms, a near-optimal solution set covering the entire Pareto front can be generated in one iteration [44]. Deb et al. proposed the NSGA-II algorithm using a fast non-dominated sorting strategy [45]. The MOPSO algorithm proposed by Coello and Pulido et al. adopts the adaptive selection strategy based on a grid, and applies the new mutation strategy to the population evolution to ensure the diversity of solutions [46]. Zhang et al. proposed the MOEA/D algorithm, which is a multi-objective evolutionary algorithm based on decomposition [47]. The algorithm takes the Chebyshev method, weight method, and boundary intersection method as examples, decomposes the multi-objective problem into multiple single-objective problems, and, at the same time, through the iterative optimization of sub-problems, it alleviates the issue of large search space and slow iterative evolution in multi-objective optimization, and reduces the time complexity of the algorithm.

2.2. MOCK Clustering Model

MOCK is a multi-objective clustering algorithm that can automatically determine the number of clusters. The algorithm consists of two main stages: the clustering stage and model selection stage.

Clustering stage: The clustering phase of MOCK is based on the multi-objective evolutionary algorithm: the Pareto envelope-based selection algorithm version 2 (PESA-II), which optimizes the two clustering objectives of overall deviation and connectivity. In this stage, the minimum spanning tree and k-means algorithm are used to generate the initial solution set to obtain different numbers and shapes of clusters. The overall deviation is calculated as follows:

\begin{matrix} D e v (C) = \sum_{C_{k} \in C} \sum_{i \in C_{k}} D (i, μ_{k}) \end{matrix}

(3)

where C is the set of all clusters,

μ_{k}

is the centroid of the cluster, and

D (\cdot, \cdot)

is the chosen distance function.

The connectivity calculation is as follows:

\begin{matrix} C o n n (C) = \sum_{i = 1}^{N} \sum_{j = 1}^{L} x_{i, n n_{i j}}, \\ w h e r e x_{r, s} = \{\begin{matrix} \frac{1}{j}, i f ∄ C_{k} : r \in C_{k} a n d s \in C_{K} \\ 0, o t h e r w i s e \end{matrix} \end{matrix}

(4)

where

n n_{i j}

is the j-th nearest neighbor of datum i, N is the size of the clustered dataset, and L is a parameter determining the number of neighbors that contribute to the connectivity measure.

In MOCK, each individual is encoded using locus-based adjacency representation, where each data point is represented by a gene, and the allele value of the gene defines the connection between the data points. The offspring are generated by combining the genetic information of the two parents using the uniform crossover method. In addition, based on the concept of nearest neighbor, the neighborhood-based mutation is designed to change the connection relationship of data points.

Model Selection Stage: The model selection phase uses gap statistics to determine the number of clusters in the dataset. In addition, the Poisson model is used to generate control data in the feature space to obtain the expected connectivity and the overall deviation value of the unstructured data. Then, the solution and the control front are aligned and normalized, and the score of each point on the solution front is calculated. By analyzing the relationship between the score and the number of clusters, the local optimal solutions are identified, which are considered to be promising solutions. The global maximum may be considered as the best solution for estimation.

The specific implementation of MOCK can be found in [23].

2.3. Task Correlation Learning Models

Multi-task clustering improves the clustering performance of each task by transferring useful knowledge between related tasks [48]. In many practical applications, only part of the latent classes can be shared between tasks. If the correlation between tasks is not considered, it may lead to the problem of negative transfer. Zhang et al. proposed multi-task clustering with model relation correlation learning [49]. Intra-task clustering introduces symmetric non-negative matrix factorization through a linear regression model, and then clusters each task. Among tasks, the parameters of the linear regression model in each task are updated by learning the correlation of model parameters [50]. The block diagram of the algorithm for task relevance learning is as follows:

Given m tasks, each of which has a set of datasets, i.e.,

X^{t}

=

\{x_{1}^{t}, x_{2}^{t}, \dots, x_{n^{t}}^{t}\}

∈

i^{d \times n^{t}}

,

t = 1, \dots, m

; where

n^{t}

is the amount of data for the tth task and d is the dimension of the feature vector. The similarity matrix of the tth task is given by

M^{t} \in i^{n^{t} \times n^{t}}

. The data

X^{t}

for each task are divided into

h^{t}

clusters, i.e.,

C^{t}

=

\{C_{1}^{t}, C_{2}^{t}, \dots, C_{h^{t}}^{t}\}

. The cluster label matrix of the tth task is given by

Y^{t} \in i^{n^{t} \times h^{t}}

.

The objective function of multi-tasking model relational relevance learning is formulated as follows:

\begin{matrix} min_{Y^{t}, W^{t}, G^{t s}} J & = \sum_{t = 1}^{T} \frac{1}{2} {∥M^{t} - Y^{t} {(Y^{t})}^{T}∥}_{F}^{2} + \sum_{t = 1}^{T} (λ {∥Y^{t} - {(X^{t})}^{T} W^{t}∥}_{F}^{2} + μ {∥W^{t}∥}_{F}^{2}) \\ + α \sum_{t = 1}^{T} \sum_{s \neq t}^{T} (\sum_{i = 1}^{k^{t}} \sum_{j = 1}^{k^{s}} {∥W_{i}^{t} - W_{j}^{s}∥}_{2}^{2} G_{i j}^{t s} + β {∥G^{t s}∥}_{F}^{2}) \\ s . t . Y^{t} \geq 0, \sum_{i = 1}^{k^{t}} \sum_{j = 1}^{k^{s}} G_{i j}^{t s} = 1, 0 \leq G_{i j}^{t s} \leq 1 \end{matrix}

(5)

This formulation is mainly composed of two parts: intra-task clustering and inter-task correlation learning. The first part of intra-task clustering is mainly used to cluster within each task and update the clustering clusters through the knowledge transferred from other tasks. The formula for this part is as follows:

\begin{matrix} min_{Y^{t} W^{t}} J_{i n}^{t} = \frac{1}{2} {∥M^{t} - Y^{t} {(Y^{t})}^{T}∥}_{F}^{2} & + λ {∥Y^{t} - {(X^{t})}^{T} W^{t}∥}_{F}^{2} + μ {∥W^{t}∥}_{F}^{2} \\ s . t . & Y^{t} \geq 0 . \end{matrix}

(6)

The first term of the formula is symmetric non-negative matrix factorization (SNMF), and

M^{t}

is the similarity matrix of task t. The second and third terms of the formula are linear regression models, which are used to fit the clusters in each task, and the features of the input vectors are used to fit and predict the output vectors. However,

W^{t}

is the regularization term, which can prevent overfitting.

The second part is inter-task correlation learning, which learns cross-task correlation by computing the correlation of model parameters between multiple clusters in each pair of tasks. The relationship of the same cluster across different tasks is explored by learning the correlation of the model parameters corresponding to each cluster, and then the model parameters of each task can be updated by the model parameters of other tasks.

\begin{matrix} min_{G^{(t, s)}} J_{cross}^{t s} = \sum_{i = 1}^{h^{t}} \sum_{j = 1}^{h^{s}} {∥W_{i}^{t} - W_{j}^{s}∥}_{2}^{2} G_{i j}^{(t, s)} + β {∥G^{(t, s)}∥}_{F}^{2}, \\ s . t . \sum_{i = 1}^{h^{t}} \sum_{j = 1}^{h^{s}} G_{i j}^{(t, s)} = 1, 0 \leq G_{i j}^{(t, s)} \leq 1 (t \neq s), \end{matrix}

(7)

The first term of this formula can ensure that the models with a closer similarity distance have higher corresponding similarity, while the second term can prevent the situation that the similarity between similar models is 1, while the similarity between other models is 0.

W_{j}^{s}

is the linear regression parameter for the jth cluster of the task s, and

G^{t s}

is the cluster correlation coefficient matrix between task t and task s, where

G_{i j}^{t s}

is the correlation coefficient between

W_{j}^{t}

and

W_{j}^{s}

.

3. Multi-Task Multi-Objective Automatic Clustering Algorithm

In order to make full use of the parallelism implied by a population search, and extend multi-objective optimization to multi-tasking multi-objective optimization to optimize multiple similar clustering tasks at the same time, this section proposes a multi-objective clustering learning framework based on evolutionary multi-tasking optimization, which is also a multi-tasking multi-objective clustering framework. It tries to improve the clustering performance of each task while accelerating the iteration speed of each task. This section introduces the specific MFMOCK framework according to the process and technical points, and presents the complexity analysis of the algorithm. Table 1 and Table 2 are the abbreviation declaration table and the nomenclature declaration table, respectively.

3.1. Framework of MFMOCK

This part gives the basic framework and specific algorithm steps of the multi-task multi-objective automatic clustering algorithm (MFMOCK). NSGA-II [45] is used as the basic framework of MOEA, and the algorithm extends the single-task multi-objective optimization algorithm into a multi-tasking evolutionary algorithm. The individual coding method, local search strategy, genetic information transmission method between tasks, and crossover and mutation operators are mainly modified for the evolutionary multi-tasking framework. The framework of the MFMOCK is shown in Algorithm 1.

The m clustering tasks are optimized simultaneously, and the jth task is denoted as

T_{j}

, whose objective function is defined as

f_{j} : X_{j} \to i

, where

X_{j}

is the search space. Then, the evolutionary multi-tasking clustering optimization proposed in this paper can be expressed as

\{x_{1}, \dots, x_{m}\} = argmin \{f_{1} (x), . ., f_{m} (x)\}

; in order to solve this problem by using evolutionary algorithm, in the evolutionary multi-tasking framework, each individual

p_{i}

,

i \in {1, 2, \dots, | P |}

in population P has a set of attribute characteristics, and

p_{i}

in the m clustering optimization problems can be decoded as

\{x_{1}^{i}, \dots, x_{m}^{i}\}

, where

x_{1}^{i} \in X_{1}, \dots, x_{m}^{i} \in X_{m}

. Taking individual

p_{i}

as an example, its attribute characteristics are as follows:

1. Individual evaluation value: The individual evaluation value of

p_{i}

in task

T_{j}

is

Ψ_{i}^{j} = f_{i}^{j}

, and the corresponding ranking position is

r_{j}^{i}

;

2. Individual fitness: The individual fitness of

P_{i}

is defined as

φ_{i} = 1 / {min}_{j \in {1, \dots, m}} \{r_{j}^{i}\}

;

3. Skill factor: The skill factor of

P_{i}

is defined as

τ_{i} = arg {min}_{i} \{r_{i}^{j}\}, j \in {1, \dots, m}

, that is, the number of tasks for which individual performance is optimal.

Algorithm 1 Pseudocode for the MFMOCK algorithm.

Input: evolution iterations

g e n s_{m a x}

, Population size

p o p

, Crossover and mutation probabilities

p_{c}

and

p_{m}

, the dataset containing m tasks.

1: Initialization

2: Initialize the population P.

3: Assign skill factors

τ

and evaluate individuals.

4: Initialize LearningPool.

5: while

g e n \leq g e n_{m a x}

do

6: Individuals are selected to enter LearningPool based on the evaluation results

7: The parents are selected through a binary tournament

8: Generation of offspring population C, refer to Algorithm 2.

9: Individuals are evaluated according to the skill factor.

10: Merging populations

P \cup C

.

11: Update the fitness

φ

and skill factor

τ

.

12: The next generation of individuals is selected according to the

φ

and

τ

.

13: end while

14: A solution evaluation strategy based on controlled frontier distance is used to select the final solution.

Firstly, based on MOCK, MFMOCK uses the idea of evolutionary multi-tasking, in which each task contributes an additional factor that affects the evolution of the population. Moreover, the skill factor of an individual represents the task number that the individual is best at, and individuals with different skill factors contribute evolutionary factors to the population through chromosome crossover. Due to the unified individual coding method, the data and corresponding features in different tasks are mapped in the same search space, so that in the process of population evolution, each task also provides information for the optimization of other tasks. The unified search space means that the solutions of different tasks are contained in a unified genetic information base, which enables MFMOCK to exploit the potential genetic complementarity between multiple clustering tasks, so as to effectively discover useful genetic information and implicitly transfer it from one task to another. Secondly, because the amount of data in different clustering tasks may be inconsistent, the genotype length of individuals in different tasks obtained by the classical LAG encoding method is inconsistent, which makes it impossible for individuals in the population to cross. The multi-task adjacency representation method is used to solve this problem.

The algorithm also adopted the elite retention strategy of LearningPool, which set a fixed size of LearningPool to maintain a large number of different non-dominated solutions with different skill factors found in the search process. Using the histogram method based on density estimation, the solutions cover the whole objective space instead of gathering in the same region. At the end of each generation, the non-dominated solutions in the LearningPool are updated, new non-dominated solutions are added, and the dominated solutions are eliminated. Multi-objective clustering does not generate a single solution, but a set of clustered solution sets, with different solutions corresponding to different trade-offs between the two objectives. The model selection strategy of the algorithm adopts the solution selection strategy based on the control front (CF). According to the Poisson model in the feature space of the clustered data, the control data are obtained. Specifically, the principal component analysis method is applied to the covariance matrix of the original data to obtain the eigenvectors and eigenvalues, and then the control data are generated. Thus, an estimate of the connectivity and overall bias of unstructured data, which do not contain the original data but are located in the same data space, is obtained. For the non-dominated solution on the actual Pareto front (PF), the distance between the solutions and the point on the control front is calculated, and the non-dominated solution with the largest shortest distance is considered to be the optimal solution. The calculation formula is as follows:

BestSolution = \underset{i \in P F}{argmax} (min_{j \in C F} (distance (i, j)))

(8)

In addition, the crossover operator of multi-tasking correlation learning is combined, which not only enables the individuals to exchange genetic information between tasks, but also uses the correlation learning model between tasks to quadratic program the linear regression model parameters between different tasks in the process of individual crossover to extract the correlation between task models. The correlation coefficient is used to minimize the difference between the parameters of different regression models, so that the related tasks have similar clustering results, and the knowledge related to the transfer task is realized, which is ultimately reflected in the change of individual genotypes. Moreover, as a population search strategy, the crossover operator avoids meaningless crossover between individuals with different skill factors, and accelerates the search speed of the population individuals for the global optimal solution in the evolution process.

3.2. Multi-Task Adjacency Coding Based on Locus Adjacency Graph

In order to apply the evolutionary multi-tasking framework to solve multi-objective clustering problems, it is necessary to modify and design the population coding method, the crossover operator between individuals with different skill factors in the population, and the evolution strategy. The above contents are introduced in details in this subsection.

In the coding method of the population, the coding method adopted in this algorithm is the multi-tasking adjacency coding method based on the trajectory adjacency graph (TAG). In the TAG, the encoding of each individual can be represented as a data connection graph, and can be decoded as a clustering result. Among them, each data point in the task is represented as a node in the graph, the connected data are divided into the same cluster, and each individual corresponds to k clusters after the algorithm runs. For example, individual

p_{i}

contains n alleles

p_{i}^{j}, j \in {1, \dots, n}

, if the gene at position f is denoted as l, then the TAG of this gene is denoted as

p_{i}^{f} \to p_{i}^{l}

, and the data points f and l are connected and belong to the same cluster. The main advantage of the trace-based adjacency coding scheme for solving the clustering problem is that the number of clusters does not need to be fixed in advance, since it is determined automatically in the decoding step.

In Figure 1, taking the clustering task as an example, the task has nine pieces of data, each data point represents a node

i \in {1, 2, \dots, 9}

, the genotype of the individual represented in the figure is

2, 4, 2, 1, 5, 5, 9, 7, 8

, the solid line represents the connection track between the samples, the dotted line represents the corresponding cluster, and all the data are divided into three categories. Each bit of the gene in the individual genotype represents the sequence number of the node that is pointed to from the current node. In this representation, starting from a node, only a connecting line can lead to another node.

In a single multi-objective clustering task, since the number of samples is fixed, the genotype length of each individual in the population is consistent, and the gene length problem does not need to be considered for interindividual crossover. However, in the multi-tasking environment, the sample number of each task may not be the same, so when the same individual is decoded into different tasks, the genotype length required is different. Therefore, a unified encoding method, namely, the multi-tasking adjacency representation method, needs to be designed. As shown in Figure 2, the genotypes of individuals are all non-negative real numbers, so the non-negative real numbers need to be mapped into [0, 1]. When optimizing m tasks simultaneously, assume that the dimension of the jth task is

D_{j}

; thus, in the unified search space, its dimension

D_{m u l t i t a s k} = m a x (D_{j})

. In the population initialization step, each individual is a vector of random variables of dimension, and each gene is within a fixed range [0, 1]. If the skill factor

τ_{i} = j

of individual i, the first

D_{j}

genes of individual i need to be taken for individual crossover, mutation, coding, and decoding operations.

In the initialization process of the algorithm, the initialization strategy based on MST and k-means is mainly used, because different single-objective clustering algorithms have good clustering effects in different regions of the Pareto front, and the algorithm based on connectivity often generates the optimal solution in those regions of the Pareto front with low connectivity. However, compactness-based algorithms perform well in regions with low overall bias. However, MST and k-means are clustering algorithms based on connectivity and compactness, respectively, so the initial solution generation strategies of MST and k-means are used.

Since different clustering objectives can reflect completely different optimization criteria for clustering solutions, two complementary objective functions are selected in this algorithm: one is based on compactness and the other is based on cluster connectivity. Cluster compactness is expressed by calculating the overall deviation (Dev) of different cluster partitions, which can be calculated as the total distance between a data point and its corresponding cluster center:

Dev (C) = \sum_{C_{k} \in C} \sum_{i \in C_{k}} δ (i, u_{k})

(9)

where C is the set of clusters,

u_{k}

is the cluster center, and

δ (,)

is the distance function (we will choose Euclidean distance here). As the objective function, the overall bias should be minimized. Another objective function that reflects the clustering connectivity criterion is the connectivity index (Con), which evaluates the degree to which neighboring data points are classified into the same cluster, and is formulated as follows:

\begin{matrix} Con (C) & = \sum_{i = 1}^{N} (\sum_{j = 1}^{L} x_{i, n_{i j}}) \\ x_{r, s} & = \{\begin{matrix} \frac{1}{j}, & if ∄ C_{k} : r \in C_{k} \land s \in C_{k} \\ 0, & otherwise \end{matrix} \end{matrix}

(10)

where

n_{i j}

is the jth nearest neighbor of data i, N is the size of the clustered dataset, and L is a parameter that measures the number of neighbors that contribute to the connectivity. As an objective function, the connectivity index should be minimized.

3.3. Evolutionary Operators Based on Correlation Learning

According to the design principle of MFEA, two randomly selected parents must meet certain conditions before they can cross. During crossover, parents with the same cultural background (skill factor) prefer the exchange of genetic material between them. Therefore, if two randomly selected parents have the same skill factor

τ

, they can freely cross. Crossover is performed according to random mating probability (

r m p

), otherwise mutation occurs. The pseudocode for the algorithm is as follows (Algorithm 2):

Algorithm 2 Mating selection among individuals

1: Two parent individuals

P_{a}

and

P_{b}

are randomly picked from the contemporary population P.

2: Generate a random number Rand between 0 and 1.

3: if

τ_{a} = τ_{b}

then

4: Parent individuals

P_{a}

and

P_{b}

perform intra-task crossover to produce offspring individuals

C_{a}

and

C_{b}

.

5: else if rand < rmp then

6: Parent individuals

P_{a}

and

P_{b}

perform inter-task crossover according to Algorithm 3 to produce. offspring individuals

C_{a}

and

C_{b}

.

7: else

8: Mutation of parent individual

P_{a}

produces offspring individual

C_{a}

.

9: Mutation of parent individual

P_{b}

produces offspring individual

C_{b}

.

10: end if

In the proposed algorithm, individuals with the same skill factor correspond to the same clustering task. If the skill factor of two parents is the same, the two individuals correspond to the same clustering task, and the number of samples, the individual coding length, and the data distribution corresponding to the same clustering task are the same, so the crossover between individuals can be carried out according to the simulated binary crossover (SBX). In addition, the transfer of learning between tasks is also crucial because there is related knowledge between tasks. Although the evolutionary multi-task framework has great advantages in solving multi-tasking problems by using evolutionary algorithms, the exchange of genetic material between tasks of the original MFEA algorithm cannot accelerate the convergence of the clustering task, and the crossover between individuals with different skill factors may introduce the learning effect of the face. So, it is necessary to design new crossover operators between individuals with different skill factors. The new crossover operator based on multi-tasking introduces the idea of model parameter learning between tasks, so that when individuals with different skill factors exchange genetic material, the direction of individual change can be restricted to the direction of clustering learning convergence. The pseudocode for the algorithm is as follows (Algorithm 3):

Algorithm 3 Evolutionary operators based on correlation learning

1: Step1: Decoding.

2: The genotypes of parent individuals

P_{a}

and

P_{b}

are decoded according to the number of samples in the task, whose skill factors are

τ_{a}

and

τ_{b}

, respectively, and the corresponding clustering results

r_{a}

and

r_{b}

are obtained by decoding.

3: Step2: Initialization.

4: The cluster labels

h^{t}

of the inter-task correlation learning model are initialized according to the clustering results

r_{a}

and

r_{b}

.

5: Initialize the similarity matrix

M^{t}

based on the similarity measure of the dataset (in this case, Euclidean distance).

6: The category indicator label

Y^{t}

of the task is initialized according to the k-means algorithm.

7: Two parent individuals

P_{a}

and

P_{b}

are randomly picked from the contemporary population P.

8:

W^{t}

will be initialized to an all-unity matrix of

n^{t} \times h^{t}

size.

9: Step3: Learning inter-task parameter relevance.

10: for

t = 1 \dots m

do

11: for

s = 1 \dots m

do

12: if

s \neq t

then

13: The task relevance parameters

G^{(t, s)}

are calculated.

14: end if

15: end for

16: Calculating model parameters

W^{t}

.

17: Compute the cluster label matrix

Y^{t}

.

18: end for

19: Get the task

τ_{a}

and task

τ_{b}

clustering results

C^{i}

and

C^{j}

.

20: The genotypes of

P_{a}

and

P_{b}

are updated according to the clustering results

C^{i}

and

C^{j}

to obtain the offspring individuals

O_{a}

and

O_{b}

.

When the individuals with different skill factors are crossed, the sample numbers

n_{a}

,

n_{b}

of the corresponding task are obtained according to the skill factors of the parents

P_{a}

and

P_{b}

, the first n genes of the individual are decoded, the genes in [0, 1] are linearly mapped to [1, n], and the corresponding clustering results

r_{a}

and

r_{b}

of the parents are obtained. The cluster label of each sample corresponding to the task is initialized according to the clustering result, and other related matrices are initialized according to the initialization strategy in the algorithm block diagram. In step 3, inter-task correlation learning, the clustering results will be updated through intra-task linear model learning and inter-task model parameter correlation learning. In this step, not only the genetic information between individuals with different task preferences is exchanged, but also the convergence of different tasks corresponding to individuals is accelerated. In addition, in the design of the crossover operator, it is not necessary to optimize the task correlation learning model to complete convergence, but to iterate a certain number of times. Therefore, the evolution strategy of local search is implied in the process of individual crossover, which not only limits the random evolution of the population individuals, but also reduces the time complexity of the algorithm. After the local search strategy, new clustering results

C^{i}

and

C^{j}

will be obtained, where,

C^{i}

is a clustering result of the ith task, and

C^{j}

is a clustering result of the jth task. It is necessary to encode

C^{i}

and

C^{j}

as new offspring individuals

O_{a}

and

O_{b}

. In the encoding process, for example, the connection in

P_{a}

that is not the same as the clustering result indicated by

C^{i}

is disconnected, the nearest sample belonging to the same cluster is found in the nearest neighbor matrix of the disconnected sample, and the trajectory indicates the new sample to form a new offspring individual. The specific operation is shown in Figure 3. Figure 3a is the genotype representation of the parent individual

P_{a}

. In the figure, sample point 3 belongs to the same cluster as sample point 2, and there is a trajectory pointing to sample point 2. Figure 3b shows the clustering results obtained after learning the inter-task model relationship in the crossover operator. In Figure 3b, sample point 3 is divided into another cluster, so the adjacent trajectory between sample point 3 and 2 needs to be disconnected, and the adjacent trajectory of sample point 3 needs to be found in the new cluster. Since sample point 9 is the closest point to sample point 3 in the new cluster, the gene

O_{a}

at the location of offspring genotype 3 is changed to 9. Finally, the genes in [1, n] are linearly mapped to [0, 1] to complete the process of crossover between individuals and the generation of offspring individuals.

When the parent individuals with the same skill factor are crossed, the uniform crossover mode is selected. Because there is no bias in gene ranking for individuals with the same task, the uniform crossover is used to generate any combination of alleles from the two parent individuals. When an individual carries out genotype mutation, the mutation operator based on finite neighbors is used. For the dataset of size N, the search space during mutation is

N^{N}

. Using the mutation operator based on finite neighbors, each gene can only connect to one of the L nearest neighbors when it mutates, where

L ≪ N

; then, the search space will become

L^{N}

. This not only effectively reduces the amount of calculation during the mutation, but also reduces the generation of a lot of meaningless connections. Since it is impossible for each individual generated in MFEA to perform well in all tasks, the individual can only be evaluated according to the task with the best performance. Therefore, the changes of individual genetic factors during the whole evolution process refer to the concept of vertical cultural transmission, based on the vertical cultural transmission algorithm through selective imitation. The skill factor of an individual can mimic the skill factor of any of its parents.

3.4. Analysis of Algorithm Complexity

Given a dataset with m tasks, assuming that the data dimension of each task is

n d

, the algorithm complexity of the task correlation learning step is

O (n^{2} d)

, and the time complexity of fast non-dominated sorting is

O (p o p^{2})

, where

p o p

is the size of the subpopulation. In the initialization process, the time complexity of generating the initial solution using MST is

O (m n^{2})

, and the time complexity of generating the initial solution using k-means is

O (n (k + 2) m / d)

. In addition, the time complexity of each individual function evaluation is

O (m a x {n^{2}, k n d})

, where

k \in {1, \dots, K_{m a x}}

is the number of clusters. The time complexity of the MFMOCK algorithm is given by

O (g e n m a x {p o p n, p o p k n d, p o p n^{2} d, p o p^{2}})

.

4. Experiment Setup and Results

In this section, firstly, three clustering datasets are introduced, secondly, the comparison algorithms and evaluation indicators are introduced, and then the parameters are analyzed and set. Finally, the clustering results and algorithm convergence are compared and analyzed, respectively. All programs ran on the CPU (13th, Gen Intel(R) Core(TM) i9-13900-k, 3.00 GHz), RAM was 32 GB, and the program was realized by the MATLAB R2023a.

4.1. Datasets

In order to evaluate the proposed algorithm model, this experiment used three different sets of experimental data for testing. The characteristics of all datasets are described in Table 3, Table 4 and Table 5. The first set of datasets was a manually generated two-dimensional dataset, which was used to verify the robustness of the algorithm to the overlap between clusters, and the size and the shape of clusters. The data in these datasets are two-dimensional normally distributed data with a fixed cluster size, mean vector, and standard deviation vector. The instance distribution of the manual dataset is shown in Figure 4.

The second set of datasets was a standard cluster model generated by a random generator and conforming to a multivariate normal distribution. In the first random generator, randomly generated low-dimensional datasets have random directions, whereas in higher dimensions, the clusters are more spherical in shape. In the second generator, the data distribution is more biased towards arbitrarily distributed ellipsoids. The generator and the dataset used in this article are available at https://github.com/garzafabre/Delta-MOCK (accessed on 14 May 2024). The third set of datasets was UCI datasets, all available on the UCI website https://archive.ics.uci.edu/ (accessed on 14 May 2024).

4.2. Comparison Algorithm and Evaluation Index

The comparison algorithms in the experiment are as follows: K-means (KM) is a traditional clustering method based on Euclidean distance; single linkage (SL) is a hierarchical clustering algorithm that defines cluster proximity as the distance between the two closest points of two different clusters; average linkage (AL) is a hierarchical clustering algorithm that defines cluster proximity as the average of the proximity of pairs of points taken from two different clusters; spectral clustering (SC) is a clustering algorithm based on spectral graph theory that transforms the clustering problem into an optimal graph partition problem [51]; fuzzy C-means (FCM) is a clustering algorithm based on fuzzy C-means; MOCK is a multi-objective clustering algorithm that automatically determines the number of clusters and Multiple Information Exchange Multi-objective Clustering Algorithm Based on MOCK (MIE-MOCK) is a MOCK variant clustering algorithm with a random crossover operator pool and random mutation operator pool [52].

To accurately evaluate the clustering results, this paper uses performance metrics such as accuracy, mutual information, and adjusted Rand index, which are standard metrics widely used for clustering. In addition, the introduction of the adjusted Rand index performance index is as follows:

Clustering accuracy (ACC): Clustering accuracy discovers the one-to-one relationship between clustering labels and actual class labels, and measures the extent to which each cluster contains data points from the actual classes. The clustering accuracy is defined as follows:

A c c = \frac{\sum_{i = 1}^{n} δ (m a p (r_{i}), l_{i})}{n}

(11)

where

r_{i}

represents the cluster label of data

x_{i}

,

l_{i}

represents the actual class label of data

x_{i}

, n represents the total amount of data,

δ (x, y)

is the Delta function, equal to 1 if

x = y

, equal to 0 otherwise, and

m a p (r_{i})

is the mapping function, which maps each cluster label

r_{i}

to an equivalent label in the dataset.

Normalized mutual information (NMI): The second metric is normalized mutual information (NMI), which is used to determine the quality of the clusters. Given the clustering results, NMI is defined as follows:

N M I = \frac{\sum_{i = 1}^{c} \sum_{j = 1}^{c} n_{i, j} log \frac{n_{i, j}}{n_{i} {\hat{n}}_{j}}}{\sqrt{(\sum_{i = 1}^{c} n_{i} log \frac{n_{i}}{n}) (\sum_{j = 1}^{c} {\hat{n}}_{j} log \frac{{\hat{n}}_{j}}{n})}}

(12)

where

n_{i}

denotes the number of data contained in cluster

C_{i} (1 \leq i \leq c)

,

{\hat{n}}_{j}

denotes the number of data contained in cluster

L_{j}

(1 \leq j \leq c)

, and

n_{i, j}

denotes the intersection of the true class

L_{j}

and cluster

C_{i}

. A larger NMI indicates a better clustering result.

Adjusted Rand index (ARI): This computes the probability that two instances of two clusters belong to the same cluster or different clusters, defined as follows:

\begin{matrix} A R I_{1} & = \sum_{i = 1}^{k_{a}} \sum_{j = 1}^{k_{b}} (\begin{matrix} n_{i j} \\ 2 \end{matrix}) - \sum_{i = 1}^{k_{a}} (\begin{matrix} n_{i} \\ 2 \end{matrix}) \sum_{j = 1}^{k_{j}} (\begin{matrix} n_{j} \\ 2 \end{matrix}) / (\begin{matrix} n \\ 2 \end{matrix}) \\ A R I_{2} & = \frac{1}{2} \cdot [\sum_{i = 1}^{k_{a}} (\begin{matrix} n_{i} \\ 2 \end{matrix}) + \sum_{j = 1}^{k_{b}} (\begin{matrix} n_{j} \\ 2 \end{matrix})] - [\sum_{i = 1}^{k_{a}} (\begin{matrix} n_{i} \\ 2 \end{matrix}) \cdot \sum_{j = 1}^{k_{b}} (\begin{matrix} n_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix}) \\ A R I & = A R I_{1} / A R I_{1} \end{matrix}

(13)

where

n_{i j}

is the number of identical instances in the

c_{i}

cluster in solution

x_{a}

and the

c_{j}

cluster in solution

x_{b}

,

n_{i}

is the number of instances in the

c_{i}

cluster in solution

x_{b}

,

n_{j}

is the number of instances in the

c_{j}

cluster in solution

x_{b}

, and

k_{a}

and

k_{b}

are the number of clusters in solution

x_{a}

and solution

x_{b}

, respectively.

Hypervolume [53] is an index conforming to the Pareto dominance concept and the value of hypervolume represents the volume of hypercube surrounded by individuals in the solution set and reference points in the target space. Therefore, the hypervolume can be used as the evaluation index of the convergence and distribution of the solution set. A higher hypervolume value represents a better convergence and distribution.

4.3. Parameter Analysis and Setup

The experiments in this chapter were carried out on the basis of the datasets proposed in Section 4.1. The characteristics of each dataset are shown in Table 3, Table 4 and Table 5. The number of clusters in all experiments was set to the number of real classes of the data. In the experiments of this chapter, each algorithm was run independently 10 times, and the average of the final results was calculated. The proportion of correctly classified samples was taken as the clustering accuracy, and the accuracy of the algorithm was reflected by the accuracy and mutual information. Take the Sizes dataset and 50d-20c as an example, Figure 5 is the parameter change diagram of MFMOCK, showing the change relationship between the random mutation probability

r m p

and population size

p o p

, where the population evolution generation

g e n

was set to 100. Under different random mutation probabilities, the clustering accuracy was low when the population size

p o p

was small, and when pop was not less than 50, the clustering accuracy of MFMOCK gradually became stable. Because the data distribution of the Sizes dataset was relatively simple, the number of classes was small, the dimension was low, and it was difficult to observe the effect of

r m p

on the clustering results. However, can be clearly seen for the clustering results of MFMOCK on the 50d-20c dataset, under the same population size, when

r m p

was 0.9, the clustering accuracy of the algorithm was higher. Combined with parameter experiments, considering the stability and time complexity of the MFMOCK algorithm,

p o p

,

g e n

and

r m p

were set to 100, 100, and 0.9, respectively.

4.4. Comparison Experiment of Clustering Results

In this clustering performance test experiment, the comparison algorithms were single-task clustering algorithms, that is, there was only one clustering task in each clustering, and the dataset under this task conformed to the assumption of independent and identical distribution. The MFMOCK algorithm is a clustering algorithm based on evolutionary multi-tasking. In order to verify that MFMOCK algorithm effectively shared data features between multiple tasks and learns clustering tasks, it was necessary to design multi-task clustering data for each test dataset for MFMOCK. According to the distribution characteristics of the dataset, a multi-tasking auxiliary test dataset containing relevant knowledge (the data distribution of some categories was consistent) was generated. In addition, the number of categories and samples in the multi-tasking auxiliary test dataset may not be consistent with the original dataset. Moreover, the random sample points were added to the auxiliary test set to test the multi-tasking clustering learning performance of the algorithm.

In this experiment, each algorithm was run 10 times, and the average accuracy of the clustering results under the optimal parameters was selected. This section compares common single-objective and multi-objective clustering algorithms such as: KM, SL, AL, SC, FCM, MOCK, MIE-MOCK, etc. The algorithm results of this experiment are shown in the following table, where the performance indicators Acc and NMI given in the table are in percentage units, and the numerical range of ARI index is [−1, 1].

Table 6 shows the clustering results of all algorithms on manually generated datasets. The test was mainly to detect the clustering performance of each algorithm in datasets with different shapes and distributions. As shown in the table, different classical algorithms had different preferences for clustering datasets with different shapes. The KM algorithm showed excellent performance in clustering the Square and Sizes datasets, and the single join algorithm performed well in the Smile and Long datasets. Based on the objective function of overall deviation and connection index, the MFMOCK algorithm did not make any assumptions about the shape of different clusters, so it could detect clusters of arbitrary shapes. However, it had poor performance for datasets with overlapping data of different categories. In addition, the MFMOCK algorithm had a higher quality of clustering results for the Sizes and Triangle datasets, which was better than the other algorithms. For the Square dataset, because the dataset is relatively simple, the amount of data and data categories are small; thus, the accuracy of the classical clustering algorithm is already high, and the accuracy improvement effect of the MOCK-like algorithm based on the evolutionary algorithm was not obvious. The clustering results of the whole dataset showed that the MFMOCK algorithm had a partial improvement in clustering performance compared with the traditional single-objective clustering algorithm.

Table 7 shows the random datasets with specific dimensions and the number of cluster categories generated by the random generator to test the clustering performance of various clustering algorithms under different combinations of dimensions and number of categories. In the manually generated dataset, the data distribution presented a lot of geometric shapes that rarely appear in practical applications, and the individual variables were uncorrelated with each other, so the second randomly generated dataset was used. Compared with the manual dataset, the random dataset contained test data with higher dimensions and more categories, its dimensions varied from 2 to 100 dimensions, and the number of cluster categories varied from 4 to 40 categories. The randomly generated dataset was used for testing. This was also to test whether the clustering advantages of MFMOCK in manual datasets could be extended to more realistic, general, high-dimensional large-scale datasets. The experimental results showed that the classical single-objective clustering algorithms (e.g., KM and SC) had a better clustering effect when the data sample had a low dimension number and few categories. When the clustering dimension was increased to 10 dimensions and the number of classes was 10 or 20, the performance of classical single-objective clustering was significantly decreased compared with the evolutionary multi-objective clustering algorithm. As the sample dimension and the number of categories continued to increase, MFMOCK showed excellent clustering performance compared with single-task evolutionary multi-objective clustering algorithms such as MOCK and MIEMOCK. Compared with low-dimensional and low-class datasets, MFMOCK had a more obvious performance improvement when solving high-dimensional clustering problems with a large number of classes. The introduction of the idea of evolutionary multi-tasking made the evolutionary clustering algorithm exhibit better performance in the face of complex clustering datasets.

Table 8 shows the comparison of the experimental results of the comparison algorithms on seven UCI datasets in terms of accuracy, mutual information, and adjusted Rand index, where Append is short for the Appendicitis dataset and Aggregate is short for the Aggregation dataset. From the perspective of various indicators, MFMOCK had the best results in the four datasets among the comparison algorithms, which were Thyroid, Aggregate, Cancer, and Wine. For the dataset Zoo, MFMOCK was about

0.02

worse than the best AL in terms of clustering results, and it was optimal for the RAI index. For the dataset Jain, the clustering results of MFMOCK were about

0.05

worse than the best MOCK, and the other indicators were lower than the MOCK algorithm. For NMI and RAI, MFMOCK was higher than other algorithms in most datasets, and in the datasets with suboptimal clustering performance, the interpolation between MFMOCK and the optimal algorithm was small, i.e., within

0.07

.

4.5. Comparison Experiment of Algorithm Convergence

In order to verify that MFMOCK effectively exchanges genetic factors among multiple clustering tasks, which promotes the convergence of the problem and the improvement of clustering performance, we selected the comparison experiment between MOCK and MFMOCK. MOCK only clusters for a set of datasets, while MFMOCK simultaneously clusters the dataset and the auxiliary clustering test set. The auxiliary clustering test set contains some categories with the same data distribution as the clustering test set, and the number of categories and samples contained in the auxiliary test set may not be the same as the clustering dataset.

In order to judge the output of MFMOCK based on the quality of the solution and the robustness of the algorithm, the following parameters should be considered:

1. Average (Avg): The average function value of MFMOCK over all evolution rounds;

2. Coefficient of variation (CV): The ratio of the standard deviation of a function to its mean;

3. Average gap (AG): The difference between the mean and the true value. Expressed as a percentage of the true value.

In all of the following experiments, the population consisted of 100 individuals and evolved for 200 generations. Both MFMOCK and MOCK used the LAG-based chromosome representation method, and corresponding genetic operators, the SBX crossover operator, and Gaussian mutation operator were used to search the solution space; Moreover,

r m p

in MFMOCK was set to 0.7. Table 9 shows the test data in this comparison experiment. In order to test whether MFMOCK could optimize multiple similar clustering tasks at the same time and improve the performance of each clustering task, three groups of multi-tasking test sets were constructed, and different tasks had different test focuses on MFMOCK. MOCK only performed the clustering problem of the dataset corresponding to task 1. The specific differences are shown in Table 9.

The two tasks in dataset 1, Square and Sizes, had the same number of samples, feature dimensions, and classes. They both had spherical Gaussian clusters in their sample distribution, but they were manually generated, so they had different distributions. Figure 6a shows the convergence change trend of the hypervolume function values of MFMOCK and MOCK on dataset 1. In the initial evolution stage, the function curve of MFMOCK was better than the performance of MOCK, showing better overall convergence performance. Since MOCK and MFMOCK are completely consistent in the encoding way of population individuals and the design of crossover and mutation operators, the performance improvement can be attributed to the ability to transfer the relevant clustering knowledge between tasks by using implicit genetic, and the ability to mine similar data features by multi-tasking correlation learning. The hypervolume value of MFMOCK was better than that of MOCK in most cases, so the Pareto solution set of MFMOCK was more evenly distributed. In clustering problems, this area usually reflects the distribution and separation degree of different clusters, which may mean higher separation between clusters. In addition, hypervolume also considers the convergence and diversity of the solution set. In clustering analysis, this means that the algorithm can find clusters of different shapes and sizes while keeping the center points of these clusters as close as possible to the real situation. These advantages led MFMOCK to achieve better performance on clustering problems. In terms of the quality of the solutions generated during multi-tasking evolution, the results in Table 10 show that the MFMOCK algorithm performed well when dealing with the multi-tasking clustering of the manual dataset, the coefficient of variation of the overall bias and connection index in the two tasks were within

0.02

, and the final clustering solution was within

0.03

of the objective function value of the real classification cluster. This illustrates the excellent performance of MFMOCK in clustering handcrafted datasets.

The two tasks in dataset 2 were the 100d-10c and 100d-40c datasets, which only had the same feature dimension and were inconsistent in terms of the number of samples and classes. As shown in Figure 6b, the test environment of multi-domain data was simulated for the case of different number of categories and inconsistent data distribution. Compared with MOCK, the function value convergence trend in the figure shows the superior convergence of MFMOCK. Table 11 shows the performance of the MFMOCK algorithm in dataset 2, where the algorithm had a good clustering effect for task 2, the coefficient of variation of the target value was within

0.025

, and the deviation between the true value and the final value was about

0.02

. Due to the complexity of data and categories, the clustering performance of the algorithm was worse in task 1 than in task 2.

As shown in Table 12, the two tasks in dataset 3 were the 50d-20c-1 and 50d-20c-2 datasets, which differed only in the number of samples and had the same feature dimensions and number of classes, as well as a similar data distribution because they were generated by the same random generator. As shown in Figure 6c, the test environment of multi-domain data was simulated for the case of the same number of categories but with an inconsistent data distribution. Compared with MOCK, MFMOCK showed good convergence speed and clustering accuracy. In MFEA, the adopted multi-tasking was completely oblivious to multi-tasking, and it is considered that evolutionary multi-tasking does not necessarily guarantee that the performance of each task is improved, because not all genetic transfers are always useful. While certain tasks are positively affected by the implicit genetic transfer available during multi-tasking, certain other tasks may be negatively affected. In the MFMOCK algorithm proposed in this chapter, for multiple related multi-objective clustering problems, the strategy of random crossover between different individuals was abandoned, and multi-tasking correlation learning was introduced into the crossover operator. Only knowledge was transferred between tasks with a strong correlation, which reduced the possibility of negative transfer. The experimental results show that the algorithm is effective in transferring useful genetic information and accelerating the convergence of the algorithm.

5. Conclusions

In this paper, a multi-objective clustering framework based on evolutionary multi-tasking is proposed. By introducing MFEA into the evolutionary multi-objective optimization algorithm, evolutionary multi-objective clustering is extended from single-task learning to multi-tasking learning, and similar knowledge in different clustering tasks is shared by using the implicit search parallelism in the population. Firstly, by building an evolutionary multi-tasking framework, the algorithm improves the coding method of population individuals on the basis of MOCK, which provides a basis for individuals with different skill factors in the population to exchange genetic material. In addition, a new crossover operator between individuals was designed, and the idea of correlation learning between tasks was introduced into the crossover operator. By controlling the number of iterations of the learning algorithm, the crossover operator plays the role of a local search, so that individuals representing different clustering tasks can transfer the relevant knowledge between tasks when exchanging genetic material, and perform evolution towards the convergence direction of the clustering algorithm. By limiting the search direction of individual evolution, the evolution speed of the population is accelerated. In this paper, comparative experiments were carried out on manual, randomly generated and UCI datasets, and the multi-tasking learning ability and the final clustering effect of the model were verified. Through comparative experiments, the feasibility and effectiveness of evolutionary multi-tasking in accelerating multi-objective clustering learning were verified. In the future, we will study more objective functions and verify the proposed framework on more complex datasets. Moreover, we plan to improve the performance of MFMOCK to solve more diverse clustering problems in the real world. In addition, we will focus on studying more efficient individual coding methods to improve the optimization efficiency of MFMOCK.

Author Contributions

Conceptualization, Y.W., R.Y. and M.G.; software, K.D., L.L. and H.L.; formal analysis, K.D.; investigation, Y.W. and K.D.; resources, Y.W. and K.D.; data curation, Y.W. and K.D.; writing—original draft preparation, Y.W., K.D. and L.L.; writing—review and editing, R.Y. and M.G.; visualization, K.D. and L.L.; supervision, Y.W. and H.L.; project administration, R.Y. and M.G.; funding acquisition, H.L. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62036006, and in part by the Fundamental Research Funds for the Central Universities.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pande, S.; Sambare, S.; Thakre, V. Data clustering using data mining techniques. Int. J. Adv. Res. Comput. Commun. Eng. 2012, 1, 494–499. [Google Scholar]
Mocanu, A.A.; Iftene, A. PuzzleNN: A neural network for image segmentation based on clustering. In Proceedings of the 2023 International Conference on Innovations in Intelligent Systems and Applications (INISTA), Hammamet, Tunisia, 20–23 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Camacho, D.; Panizo-LLedot, A.; Bello-Orgaz, G.; Gonzalez-Pardo, A.; Cambria, E. The four dimensions of social network analysis: An overview of research methods, applications, and software tools. Inf. Fusion 2020, 63, 88–120. [Google Scholar] [CrossRef]
Antons, D.; Grünwald, E.; Cichy, P.; Salge, T.O. The application of text mining methods in innovation research: Current state, evolution patterns, and development priorities. R&D Manag. 2020, 50, 329–351. [Google Scholar]
Sinaga, K.P.; Yang, M.S. Unsupervised K-means clustering algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
Mehta, V.; Bawa, S.; Singh, J. Analytical review of clustering techniques and proximity measures. Artif. Intell. Rev. 2020, 53, 5995–6023. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Schaeffer, S.E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
Nielsen, F.; Nielsen, F. Hierarchical clustering. In Introduction to HPC with MPI for Data Science; Springer: Berlin/Heidelberg, Germany, 2016; pp. 195–211. [Google Scholar]
Bouguettaya, A.; Yu, Q.; Liu, X.; Zhou, X.; Song, A. Efficient agglomerative hierarchical clustering. Expert Syst. Appl. 2015, 42, 2785–2797. [Google Scholar] [CrossRef]
Xiong, T.; Wang, S.; Mayers, A.; Monga, E. DHCC: Divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 2012, 24, 103–135. [Google Scholar] [CrossRef]
Hancer, E.; Xue, B.; Zhang, M. A survey on feature selection approaches for clustering. Artif. Intell. Rev. 2020, 53, 4519–4545. [Google Scholar] [CrossRef]
Yan, D.; Cao, H.; Yu, Y.; Wang, Y.; Yu, X. Single-objective/multiobjective cat swarm optimization clustering analysis for data partition. IEEE Trans. Autom. Sci. Eng. 2020, 17, 1633–1646. [Google Scholar] [CrossRef]
Mok, P.Y.; Huang, H.; Kwok, Y.; Au, J. A robust adaptive clustering analysis method for automatic identification of clusters. Pattern Recognit. 2012, 45, 3017–3033. [Google Scholar] [CrossRef]
Bhatia, S.K. Adaptive K-means clustering. In Proceedings of the FLAIRS, Miami Beach, FL, USA, 17–19 May 2004; pp. 695–699. [Google Scholar]
Wang, S.; Zhao, D.; Zhang, C.; Guo, Y.; Zang, Q.; Gu, Y.; Li, Y.; Jiao, L. Cluster alignment with target knowledge mining for unsupervised domain adaptation semantic segmentation. IEEE Trans. Image Process. 2022, 31, 7403–7418. [Google Scholar] [CrossRef] [PubMed]
Abu Khurma, R.; Aljarah, I. A review of multiobjective evolutionary algorithms for data clustering problems. In Evolutionary Data Clustering: Algorithms and Applications; Springer: Singapore, 2021; pp. 177–199. [Google Scholar]
Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
Ahmad, M.F.; Isa, N.A.M.; Lim, W.H.; Ang, K.M. Differential evolution: A recent review based on state-of-the-art works. Alex. Eng. J. 2022, 61, 3831–3872. [Google Scholar] [CrossRef]
Kwon, K.; Seo, M.; Min, S. Efficient multi-objective optimization of gear ratios and motor torque distribution for electric vehicles with two-motor and two-speed powertrain system. Appl. Energy 2020, 259, 114190. [Google Scholar] [CrossRef]
Handl, J.; Knowles, J. An evolutionary approach to multiobjective clustering. IEEE Trans. Evol. Comput. 2007, 11, 56–76. [Google Scholar] [CrossRef]
Gong, C.; Chen, H.; He, W.; Zhang, Z. Improved multi-objective clustering algorithm using particle swarm optimization. PLoS ONE 2017, 12, e0188815. [Google Scholar] [CrossRef]
Wang, R.; Lai, S.; Wu, G.; Xing, L.; Wang, L.; Ishibuchi, H. Multi-clustering via evolutionary multi-objective optimization. Inf. Sci. 2018, 450, 128–140. [Google Scholar] [CrossRef]
Tanabe, R.; Ishibuchi, H. An easy-to-use real-world multi-objective optimization problem suite. Appl. Soft Comput. 2020, 89, 106078. [Google Scholar] [CrossRef]
Verma, S.; Pant, M.; Snasel, V. A comprehensive review on NSGA-II for multi-objective combinatorial optimization problems. IEEE Access 2021, 9, 57757–57791. [Google Scholar] [CrossRef]
Qian, S.Y.; Jia, Z.H.; Li, K. A multi-objective evolutionary algorithm based on adaptive clustering for energy-aware batch scheduling problem. Future Gener. Comput. Syst. 2020, 113, 441–453. [Google Scholar] [CrossRef]
Zhou, Y.; Kang, J.; Kwong, S.; Wang, X.; Zhang, Q. An evolutionary multi-objective optimization framework of discretization-based feature selection for classification. Swarm Evol. Comput. 2021, 60, 100770. [Google Scholar] [CrossRef]
Hruschka, E.R.; Campello, R.J.; Freitas, A.A. A survey of evolutionary algorithms for clustering. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2009, 39, 133–155. [Google Scholar] [CrossRef]
Bong, C.; Rajeswari, M. Multiobjective clustering with metaheuristic: Current trends and methods in image segmentation. IET Image Process. 2012, 6, 1–10. [Google Scholar] [CrossRef]
Mukhopadhyay, A.; Maulik, U.; Bandyopadhyay, S.; Coello, C.A.C. Survey of multiobjective evolutionary algorithms for data mining: Part II. IEEE Trans. Evol. Comput. 2013, 18, 20–35. [Google Scholar] [CrossRef]
Wang, L.; Cui, G.; Zhou, Q.; Li, K. A multi-clustering method based on evolutionary multiobjective optimization with grid decomposition. Swarm Evol. Comput. 2020, 55, 100691. [Google Scholar] [CrossRef]
He, R.; Liu, S.; He, S.; Tang, K. Multi-domain active learning: Literature review and comparative study. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 791–804. [Google Scholar] [CrossRef]
Orouskhani, M.; Shi, D.; Orouskhani, Y. Multi-objective evolutionary clustering with complex networks. Expert Syst. Appl. 2021, 165, 113916. [Google Scholar] [CrossRef]
Li, H.; Wan, F.; Gong, M.; Qin, A.; Wu, Y.; Xing, L. Privacy-enhanced multitasking particle swarm optimization based on homomorphic encryption. IEEE Trans. Evol. Comput. 2023; early access. [Google Scholar]
Li, H.; Li, D.; Gong, M.; Li, J.; Qin, A.; Xing, L.; Xie, F. Sparse Hyperspectral Unmixing With Preference-Based Evolutionary Multiobjective Multitasking Optimization. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 1922–1937. [Google Scholar] [CrossRef]
Li, H.; Luo, T.; Liu, L.; Gong, M.; Qiao, W.; Xie, F.; Qin, A. Selective Transfer Based Evolutionary Multitasking Optimization for Change Detection. IEEE Trans. Emerg. Top. Comput. Intell. 2024; early access. [Google Scholar]
Li, H.; Ong, Y.S.; Gong, M.; Wang, Z. Evolutionary multitasking sparse reconstruction: Framework and case study. IEEE Trans. Evol. Comput. 2018, 23, 733–747. [Google Scholar] [CrossRef]
Gupta, A.; Ong, Y.S.; Feng, L. Multifactorial evolution: Toward evolutionary multitasking. IEEE Trans. Evol. Comput. 2015, 20, 343–357. [Google Scholar] [CrossRef]
Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1930–1939. [Google Scholar]
Li, H.; Xiong, P.; Gong, M.; Qin, A.; Wu, Y.; Xing, L. Fast Heterogeneous Multi-Problem Surrogates for Transfer Evolutionary Multiobjective Optimization. IEEE Trans. Evol. Comput. 2024; early access. [Google Scholar]
Tian, Y.; Si, L.; Zhang, X.; Tan, K.C.; Jin, Y. Local model-based Pareto front estimation for multiobjective optimization. IEEE Trans. Syst. Man Cybern. Syst. 2022, 53, 623–634. [Google Scholar] [CrossRef]
Ouelmokhtar, H.; Benmoussa, Y.; Diguet, J.P.; Benazzouz, D.; Lemarchand, L. Near-optimal covering solution for USV coastal monitoring using PAES. J. Intell. Robot. Syst. 2022, 106, 24. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Coello, C.A.C.; Pulido, G.T.; Lechuga, M.S. Handling multiple objectives with particle swarm optimization. IEEE Trans. Evol. Comput. 2004, 8, 256–279. [Google Scholar] [CrossRef]
Zhang, Q.; Li, H. MOEA/D: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 2007, 11, 712–731. [Google Scholar] [CrossRef]
Zhong, G.; Pun, C.M. Local learning-based multi-task clustering. Knowl.-Based Syst. 2022, 255, 109798. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, X.; Liu, H.; Luo, J. Multi-task clustering with model relation learning. In Proceedings of the IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 3132–3140. [Google Scholar]
Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar] [CrossRef] [PubMed]
Nie, F.; Zeng, Z.; Tsang, I.W.; Xu, D.; Zhang, C. Spectral embedded clustering: A framework for in-sample and out-of-sample spectral clustering. IEEE Trans. Neural Netw. 2011, 22, 1796–1808. [Google Scholar]
Tsai, C.W.; Chen, W.L.; Chiang, M.C. A modified multiobjective EA-based clustering algorithm with automatic determination of the number of clusters. In Proceedings of the 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Seoul, Republic of Korea, 14–17 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 2833–2838. [Google Scholar]
While, L.; Hingston, P.; Barone, L.; Huband, S. A faster algorithm for calculating hypervolume. IEEE Trans. Evol. Comput. 2006, 10, 29–38. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of trace-based adjacency.

Figure 2. Schematic representation of multi-task adjacency.

Figure 3. Schematic diagram of the genotype update.

Figure 4. Handcrafted dataset sample instance distribution diagram. (a) Square, (b) Long, (c) Smile, (d) Triangle, (e) Sizes.

Figure 5. Plot of the variation of parameters

p o p

and

g e n

in (a) Sizes and (b) 50d-20c.

Figure 5. Plot of the variation of parameters

p o p

and

g e n

in (a) Sizes and (b) 50d-20c.

Figure 6. Comparison of hypervolume value curves between MOCK and MFMOCK on three datasets (a) dataset1, (b) dataset2, (c) dataset3.

Table 1. Abbreviation declaration table.

Abbreviations	Meaning
MOP	multi-objective optimization problem
MOEA	multi-objective evolutionary algorithm
LAG	locus-based adjacency graph
TAG	trajectory adjacency graph
PF	Pareto front
CF	control front
SNMF	symmetric non-negative matrix factorization
Dev	overall deviation
Con	connectivity index
SBX	simulated binary crossover
ACC	clustering accuracy
ARI	adjusted Rand index
Avg	average
CV	coefficient of variation
AG	average gap

Table 2. Nomenclature declaration table.

Nomenclature	Meaning
m	amount of tasks
X	dataset
n	amount of data in dataset
M	similarity matrix between tasks
C	clusters vectors
D	dimension of task
h	amount of cluster
Y	cluster label matrix
r	cluster label
l	actual class label
W	regularization term
G	cluster correlation coefficient matrix
T	task list
f	objective function
p	individual in evolution algorithm
P	population in evolution algorithm
$Ψ$	individual evaluation value
$φ$	individual Fitness
$τ$	skill factor
r	clustering result
O	gene in evolution algorithm
N	dataset size
$p o p$	population size
k	amount of cluster
$g e n$	population evolution generation
$r m p$	random mating probability

Table 3. Manual dataset characteristics.

Dataset	Samples	Feature Dimension	Classes
Square	1000	2	4
Smile	1000	2	4
Long	1000	2	2
Sizes	1000	2	4
Triangle	1000	2	4

Table 4. Characteristics of a randomly generated dataset.

Dataset	Generator	Samples	Feature Dimensions	Classes
10d-4c	1	966	10	4
50d-4c	2	351	50	4
100d-4c	2	1286	100	4
2d-10c	1	3408	2	10
10d-10c	1	2161	10	10
50d-10c	2	2919	50	10
100d-10c	2	2103	100	10
2d-20c	1	1231	2	20
10d-20c	1	1279	10	20
50d-20c	2	1254	50	20
100d-20c	2	1220	100	20
2d-40c	1	2563	2	40
10d-40c	1	1937	10	40
50d-40c	2	2169	50	40
100d-40c	2	1964	100	40

Table 5. UCI dataset features.

Dataset	Samples	Feature Dimensions	Classes
Iris	150	4	3
Thyroid	215	5	3
Aggregate	788	2	7
Cancer	683	9	2
Jain	373	2	2
Wine	178	13	3
Append	106	7	2

Table 6. Clustering results for the manually generated dataset.

ACC
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
Square	98.00	25.30	97.50	99.00	98.90	99.10	99.20	99.00
Long	90.60	93.30	68.40	93.20	93.40	89.50	91.70	90.80
Smile	98.60	99.10	99.10	98.70	98.50	100.00	100.00	100.00
Sizes	96.00	76.80	90.10	63.50	68.00	96.30	84.60	98.40
Triangle	86.80	97.40	97.30	97.50	83.30	99.10	99.20	99.60
NMI
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
Square	95.60	30.30	94.24	95.60	95.23	96.10	96.30	95.87
Smile	79.86	74.63	54.21	78.84	79.35	72.12	76.05	73.52
Long	98.40	99.50	99.50	99.10	99.40	100.00	100.00	100.00
Sizes	87.39	69.33	68.19	37.91	46.96	88.02	80.81	88.56
Triangle	95.98	99.06	98.42	93.28	92.79	93.19	95.33	99.58
ARI
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
Square	0.96	0.25	0.95	0.97	0.97	0.98	0.98	0.97
Smile	0.84	0.82	0.52	0.83	0.84	0.74	0.79	0.77
Long	0.97	0.98	0.98	0.98	0.99	1.00	1.00	1.00
Sizes	0.91	0.31	0.85	0.24	0.39	0.91	0.82	0.95
Triangle	0.88	0.97	0.97	0.96	0.81	0.99	0.99	1.00

Table 7. Clustering results for randomly generated datasets.

ACC
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
10d-4c	98.76	54.14	65.73	99.17	96.58	99.79	99.69	99.90
50d-4c	65.24	58.40	52.42	78.35	65.81	94.87	99.29	99.72
100d-4c	76.91	67.96	43.86	77.99	58.79	83.83	86.00	91.52
2d-10c	70.28	28.35	75.35	77.58	81.04	55.19	75.76	92.08
10d-10c	75.01	23.69	71.03	74.36	48.22	82.97	95.05	92.32
50d-10c	61.53	17.37	38.92	79.10	35.90	91.71	92.26	99.55
100d-10c	64.15	17.78	42.61	80.84	29.58	85.97	82.45	89.28
2d-20c	91.63	82.94	91.31	77.17	79.04	85.62	84.32	97.32
10d-20c	80.92	75.53	94.76	69.27	22.99	96.72	99.98	99.92
50d-20c	63.72	16.67	31.82	60.53	44.98	58.53	61.08	82.38
100d-20c	54.51	17.30	36.64	54.43	36.31	70.16	70.33	78.03
2d-40c	67.69	78.97	46.20	82.40	56.46	82.25	76.71	86.73
10d-40c	84.82	88.07	94.90	49.20	11.20	96.01	87.61	87.35
50d-40c	61.78	20.01	22.22	33.93	26.37	55.19	43.34	77.59
100d-40c	66.80	12.88	25.15	35.79	41.14	51.93	59.41	76.78
NMI
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
10d-4c	95.06	36.41	63.26	96.50	87.86	99.02	99.45	98.48
50d-4c	57.81	12.87	42.47	61.48	57.58	95.27	91.58	98.44
100d-4c	65.68	64.20	38.64	64.24	56.24	85.07	91.28	97.94
2d-10c	77.85	18.98	75.57	81.24	82.02	59.94	74.47	90.79
10d-10c	82.23	25.57	71.46	79.28	39.09	83.55	94.27	89.56
50d-10c	61.22	23.42	35.81	77.72	32.75	92.35	91.24	99.29
100d-10c	66.09	23.92	42.66	78.90	22.40	80.22	78.40	91.87
2d-20c	94.23	89.18	94.23	81.03	89.00	90.57	90.24	95.23
10d-20c	92.82	84.02	95.67	76.39	27.49	99.85	100.00	97.56
50d-20c	66.58	15.32	34.43	63.70	53.77	64.00	67.89	82.84
100d-20c	69.25	12.87	31.96	46.81	45.17	74.35	73.53	79.06
2d-40c	82.87	92.12	66.01	92.51	72.91	92.12	90.83	93.22
10d-40c	94.71	93.04	98.29	63.82	17.81	96.54	92.69	90.75
50d-40c	72.06	21.82	33.69	48.33	35.31	65.84	44.25	82.19
100d-40c	70.66	14.55	33.77	49.65	56.89	61.44	62.14	84.91
ARI
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
10d-4c	0.96	0.29	0.52	0.97	0.91	0.99	1.00	0.99
50d-4c	0.42	0.09	0.23	0.49	0.42	0.88	0.85	0.89
100d-4c	0.47	0.54	0.20	0.53	0.37	0.83	0.84	0.88
2d-10c	0.64	0.28	0.71	0.70	0.74	0.51	0.67	0.88
10d-10c	0.74	0.24	0.71	0.69	0.39	0.82	0.95	0.91
50d-10c	0.34	0.27	0.09	0.62	0.26	0.94	0.94	1.00
100d-10c	0.40	0.11	0.13	0.63	0.16	0.90	0.86	0.99
2d-20c	0.73	0.34	0.56	0.70	0.67	0.81	0.44	0.81
10d-20c	0.82	0.70	0.93	0.58	0.13	1.00	1.00	0.97
50d-20c	0.28	0.02	0.06	0.43	0.32	0.33	0.36	0.66
100d-20c	0.22	0.01	0.03	0.11	0.17	0.08	0.04	0.24
2d-40c	0.66	0.80	0.38	0.82	0.48	0.82	0.76	0.85
10d-40c	0.87	0.82	0.98	0.46	0.06	0.95	0.89	0.90
50d-40c	0.27	0.02	0.04	0.24	0.09	0.30	0.08	0.54
100d-40c	0.26	0.01	0.03	0.24	0.31	0.23	0.28	0.68

Table 8. Clustering results on UCI datasets.

ACC
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
Iris	89.33	77.33	91.33	88.67	89.33	86.66	88.60	89.30
Thyroid	73.40	75.81	76.74	57.16	79.63	79.50	79.60	82.30
Aggregate	77.55	86.04	94.54	76.41	75.85	98.69	98.06	99.71
Cancer	44.93	60.76	62.37	21.92	30.69	57.57	58.43	81.99
Jain	77.88	80.16	93.83	75.23	77.48	99.40	80.48	94.73
Wine	63.93	60.67	54.49	61.74	68.54	53.54	56.40	71.35
Append	80.66	82.08	78.30	75.47	79.25	65.94	84.81	87.74
NMI
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
Iris	75.15	65.60	80.54	73.64	74.50	72.69	70.74	74.60
Thyroid	31.82	21.85	16.90	19.16	32.81	32.15	34.10	35.60
Aggregate	79.61	89.94	92.34	98.45	77.18	80.08	72.58	99.07
Cancer	32.65	48.12	49.69	17.28	26.42	45.26	44.31	67.72
Jain	32.96	37.63	60.23	96.92	32.49	97.56	78.06	94.63
Wine	41.24	33.07	35.61	41.95	41.63	19.65	33.53	43.54
Append	26.26	28.24	22.31	26.27	26.21	22.62	31.88	32.07
ARI
Dataset	KM	SL	AL	SC	FCM	MOCK	MIE-MOCK	MFMOCK
Iris	0.73	0.58	0.77	0.72	0.73	0.81	0.72	0.83
Thyroid	0.34	0.37	0.23	0.16	0.45	0.42	0.43	0.48
Aggregate	0.71	0.91	0.93	0.99	0.69	0.81	0.60	0.97
Cancer	0.20	0.29	0.32	0.14	0.18	0.30	0.31	0.46
Jain	0.31	0.36	0.75	0.99	0.30	0.99	0.75	0.92
Wine	0.36	0.27	0.32	0.36	0.35	0.22	0.30	0.41
Append	0.32	0.28	0.30	0.29	0.29	0.22	0.33	0.32

Table 9. Comparison of algorithm test datasets.

Dataset	Task	Dataset Name	Samples	Feature	Classes
Dataset1	Task1	Sizes	1000	2	4
Dataset1	Task2	Square	1000	2	4
Dataset2	Task1	100d-40c	1964	100	40
Dataset2	Task2	100d-10c	1220	100	10
Dataset3	Task1	50d-20c-1	1245	50	20
Dataset3	Task2	50d-20c-2	2334	50	20

Table 10. Average performance of MFMOCK on dataset 1 problems.

Dataset 1	Task 1		Task 2
Dataset 1	Overall Deviation Dev	Connectivity Index Con	Overall Deviation Dev	Connectivity Index Con
Avg	3844.8	51.948	3188.8	97.291
CV	1.279%	1.602%	1.267%	1.09%
AG	1.866%	1.306%	2.879%	1.967%

Table 11. Average performance of MFMOCK on dataset 2 problems.

Dataset 2	Task 1		Task 2
Dataset 2	Overall Deviation Dev	Connectivity Index Con	Overall Deviation Dev	Connectivity Index Con
Avg	1562.4	44.02	904.8	38.95
CV	3.682%	7.019%	2.208%	0.91%
AG	5.83%	16.37%	1.04%	2.09%

Table 12. Average performance of MFMOCK on dataset 3 problems.

Dataset 3	Task 1		Task 2
Dataset 3	Overall Deviation Dev	Connectivity Index Con	Overall Deviation Dev	Connectivity Index Con
Avg	875.3	45.14	825.7	56.07
CV	5.981%	3.267%	2.014%	4.459%
AG	15.12%	11.2%	17.43%	9.71%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Dang, K.; Yang, R.; Li, L.; Li, H.; Gong, M. Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization. Electronics 2024, 13, 1987. https://doi.org/10.3390/electronics13101987

AMA Style

Wang Y, Dang K, Yang R, Li L, Li H, Gong M. Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization. Electronics. 2024; 13(10):1987. https://doi.org/10.3390/electronics13101987

Chicago/Turabian Style

Wang, Ying, Kelin Dang, Rennong Yang, Leyan Li, Hao Li, and Maoguo Gong. 2024. "Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization" Electronics 13, no. 10: 1987. https://doi.org/10.3390/electronics13101987

APA Style

Wang, Y., Dang, K., Yang, R., Li, L., Li, H., & Gong, M. (2024). Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization. Electronics, 13(10), 1987. https://doi.org/10.3390/electronics13101987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking Optimization

Abstract

1. Introduction

2. Background

2.1. Multi-Objective Optimization

2.2. MOCK Clustering Model

2.3. Task Correlation Learning Models

3. Multi-Task Multi-Objective Automatic Clustering Algorithm

3.1. Framework of MFMOCK

3.2. Multi-Task Adjacency Coding Based on Locus Adjacency Graph

3.3. Evolutionary Operators Based on Correlation Learning

3.4. Analysis of Algorithm Complexity

4. Experiment Setup and Results

4.1. Datasets

4.2. Comparison Algorithm and Evaluation Index

4.3. Parameter Analysis and Setup

4.4. Comparison Experiment of Clustering Results

4.5. Comparison Experiment of Algorithm Convergence

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI