1. Introduction
Networks are used to represent various types of complex systems in many fields such as computer science, physics and mathematics [
1]. Several common complex systems include biological networks [
2], social networks [
3], information networks [
4], and so on. Complex networks can reveal some potential rules and features, such as the community structure. In the social field, community detection of social networks can discover friends with common hobbies and interests. In the biological protein networks, community detection can uncover proteins with the same function, which is of great significance for biological gene repair.
From the perspective of graph theory, a network is a graph that can be infinitely large or infinitely small, where the vertices in the graph represent the objects in the network, and the edges represent the direct relationships between the objects. A community is defined as a subset of nodes in the graph that are closely connected to each other, while the nodes between communities are sparsely connected. This characteristic of community structure drives scholars in various fields to conduct research. Community detection has been widely used in social relationship analysis [
5], recommendation systems [
6,
7], link prediction [
8], and virus transmission [
9,
10].
In recent years, many community detection methods have been proposed. In 2002, Girvan et al. proposed the GN algorithm [
11] to obtain the community structure in the network by continuously removing the edges with the highest edge mediators. Newman [
12] introduced the concept of modularity, which allows community detection to be modeled as an NP-hard optimization problem. Subsequently, more and more metrics have been proposed to assess the quality of community detection, such as community fitness [
13] and community score [
14]. Benefiting from these metrics, a great number of algorithms based on intelligent optimization are used to solve the community detection problem. Pizzuti [
14] proposed a genetic algorithm for community detection to obtain the best division results by optimizing the objective function community score; Li [
15] designed an extended compact genetic algorithm using modularity as the optimization objective for community detection; Gong et al. proposed a memetic algorithm based on community detection, called Meme-Net, using module density as an optimization criterion [
16]. All the above-mentioned papers are optimized for only one metric criterion, and satisfactory results are achieved over the GN algorithm and the FN algorithm. However, the literature [
17] suggested that solving the community detection problem based on single-objective optimization is flawed, which is the resolution problem. Single-objective optimization attempts to find the larger communities in the network, ignoring the small communities that really exist in the network. In addition, the metrics do not fully reveal this characteristic of community structure. There are metrics that attempt to strengthen intra-community connections and metrics that attempt to weaken inter-community connections. Single-objective optimization does not allow for trade-offs between multiple metrics [
18].
In response to the above reasons, scholars have started to try to use multi-objective optimization to weigh multiple conflicting metrics to improve the accuracy of community delineation. Pizzuti proposed a multi-objective genetic algorithm for solving community detection, which is known as the MOGA-Net algorithm [
19]. Rahimi employed a discrete particle swarm algorithm to optimize the community structure using multi-objective optimization as a framework [
20]. Messaoudi proposed a multi-objective bat-based optimization algorithm for the dynamic community detection problem [
21]. Li designed an adaptive evolutionary algorithm to extract communities in the network [
22]. Chen [
23] proposed the MODTLBO/D algorithm for community detection based on a multi-objective teaching–learning-based optimization algorithm combined with a decomposition mechanism. Ji [
24] integrated the weighted simulated annealing local search operator into multi-objective ant colony optimization to expand the search range and introduce decomposition mechanisms to enhance the accuracy of community detection. Li [
25] designed a decomposition-based multi-objective chemical reaction optimization algorithm to improve the efficiency of community mining with the help of dynamic changes in the population of the algorithm. The authors of [
26] proposed a metaheuristic approach based on a variable neighborhood search, which leverages the combination of quality and diversity of a constructive procedure inspired by a greedy randomized adaptation procedure for detecting communities. Meanwhile, there are many scholars who have made outstanding contributions in the field of multi-objective community detection. Ma [
27] proposed a two-stage multi-objective community detection algorithm with local search and global search to merge local communities through a boundary control strategy. In the literature [
28], a new optimization objective, namely “balanced modularity”, is introduced. Liu [
29] introduced network embedding to map nodes to a low-dimensional space, which effectively reduces the search space through a consensus propagation strategy. Pizzuti [
30] proposed a multi-objective genetic framework, which integrates the topological and compositional dimensions to uncover community structure in attributed networks. The approach allows for the experimentation of different structural measures to search for densely connected communities and similarity measures between attributes to obtain high intracommunity feature homogeneity. In the literature [
31], the Grey Wolf optimization algorithm and the Label Propagation algorithm were improved and combined for better performance.
A community detection algorithm based on the multi-objective pigeon-inspired optimization algorithm was proposed, and the contribution of our work consists of three main aspects:
- (1)
We utilize the excellent optimization capabilities of the pigeon-inspired optimization algorithm and combine it with a multi-objective optimization strategy to form a novel algorithm for community detection problems in a complex network.
- (2)
We have re-discretized the pigeon-inspired optimization algorithm for the community detection problem. The velocity and position update formulas applicable to the community structure representation are redefined.
- (3)
We provide the definition of a boundary node. The misclassification of boundary nodes is a key factor affecting community detection. The corresponding variation strategies are proposed for boundary nodes and non-boundary nodes to improve the accuracy of community partitioning.
2. Background and Related Works
2.1. Community Definition
The definition of community is unclear [
32]. There is a generally accepted consensus that a community is a subset of different nodes, with tightly connected nodes within the set and sparsely connected nodes between the sets [
33]. Nodes form communities among themselves based on functional or other shared characteristics.
A network is usually represented in the form of an undirected graph:
where
V represents the vertices in the network and
E is described as the connection between two vertices in the network. From a mathematical point of view, a network can be represented in terms of an adjacency matrix
.
N denotes the number of nodes in the network. Where there is a real connection between
and
,
.
and
are neighbor nodes to each other, otherwise
.
is described as the sum of all valid connected edges of
.
Accordingly, belongs to a community , the degree of with respect to S is , where is the number of edges connecting to the other vertices in S, and is the number of edges connecting to the other vertices not in S. When , S is seen as a community in a strong sense. Conversely, when , S is a weak community. A strong community is more connected within the community than a weak community.
2.2. Multi-Objective Optimization
The multi-objective optimization problem returns a set of solutions by balancing a set of conflicting objective functions. In mathematics, taking minimization as an example, a multi-objective optimization problem can be described as follows:
where
is the
ith objective function;
x is the decision variable;
m represents the number of objective functions. Solution
dominates solution
, if the condition is met:
Multi-objective optimization returns a set of trade-off non-dominated solutions, rather than an optimal solution. This non-dominated solution set is called the Pareto optimal solution of multi-objective optimization problems. If there is no solution
x dominating
, then
is referred to as a Pareto optimal solution or non-dominated solution. A Pareto optimal set or set of non-dominated solutions is defined as:
Reference [
19] illustrates that the Pareto optimal solution set corresponds to different partitions of a network composed of different numbers of communities. This provides better opportunities for analyzing several communities at different levels. In the multi-objective solution space, the Pareto optimal front (POF) is obtained by mapping these non-dominated solutions [
20].
Due to the general applicability of multi-objective optimization, many excellent multi-objective methods have been proposed recently. Leung [
34] proposed a collaborative neurodynamic approach for multi-objective optimization that uses weighted Chebyshev to scalarize multiple objectives. In the reconstruction, the multi-projective neural network searches the POF with the help of the PSO algorithm and achieves good performance. Xu [
35] designed a fuzzy decision variable framework for large-scale multi-objective optimization to alleviate the problem of too many decision variables hindering the convergence speed of evolutionary algorithms. The framework improves the performance and computational efficiency of the algorithm in large-scale multi-objective optimization through two steps of fuzzy evolution as well as exact evolution; Liu [
36] proposed an accelerated evolutionary search strategy for the inefficient decision space of existing multi-objective evolutionary algorithms for dealing with large-scale multi-objective optimization problems. The main idea is to learn a gradient descent direction vector, i.e., the fastest possible convergence direction, for each solution through a specially trained feed-forward neural network to efficiently reconstruct the solution. Experimental results demonstrate that the strategy has obvious advantages in dealing with large-scale multi-objective optimization problems with 1000–10,000 dimensions. These methods perform very well but cannot be applied to the discrete community detection problem
2.3. The Pigeon-Inspired Optimization Algorithm
The pigeon-inspired optimization algorithm (PIO) [
37] is a heuristic biomimetic intelligent optimization algorithm proposed by Duan in 2014. This algorithm simulates the flight behavior of pigeons and summarizes two search operations: map and compass operator, as well as landmark operator. In map and compass operators, pigeons move toward the best-positioned pigeon in the group and toward the individual’s cognitive direction toward the destination. In the landmark operator, pigeons abandon half of the lost individuals, and the remaining pigeons move toward their destination under the leadership of the elite.
In the PIO algorithm, the position of a virtual pigeon in the solution space is determined by
;
denotes the flight speed of the pigeon, where
n denotes the number of pigeons. In the early stages of the algorithm, the pigeons rely on the sun as well as the earth’s magnetic field for navigation. Each pigeon moves according to the following rules:
where
t denotes the number of current iterations.
R is a positive real number, which is the map and compass operator.
R normally assumes a value of 0.2.
denotes a 0–1 random number that satisfies a normal distribution.
denotes the global optimal solution.
The map and compass operators attempt to exploit the exploratory power of the pigeon to prevent the algorithm from falling into a local optimal solution. The landmark operator attempts to accelerate the convergence of the algorithm near the optimal solution.
The landmark operator is determined by the position of the center of the current pigeon group:
where
denotes the number of pigeons in the current population;
represents the weight coefficient of the
ith pigeon, calculated according to the following equation:
is a positive real number. The formula for updating the pigeon position in the landmark operator is as follows:
Pigeons who are not familiar with the surrounding environment will gradually be eliminated by the group, according to their fitness value. The number of pigeons in the population after each iteration elimination is:
The basic process of the PIO algorithm (see in
Figure 1) is summarized as follows:
Step 1: Initialize the position information x and velocity information v of the population, as well as other parameters;
Step 2: Calculate the fitness value of each pigeon;
Step 3: Select the global optimal solution ;
Step 4: If the termination condition is not met, skip to step 5, otherwise skip to step 6;
Step 5: Update individual position and velocity information according to Formulas (
7) and (
8);
Step 6: Eliminate pigeons and update the position information of the remaining pigeons according to Formula (
9);
Step 7: If the termination condition is met, output the position information of the pigeon; otherwise, , jump to step 2.
At this stage, there is less research on community detection based on the PIO algorithms, and only literature [
38] has conducted related studies. However, the algorithm exhibits a very disappointing performance. Compared with most multi-objective algorithms for community detection, both the accuracy and stability of community partitioning lag far behind the mainstream community detection algorithms. The superior optimization power of the PIO algorithm is not properly used, which is the starting point of our research in this paper.
3. Proposed Method
Traditional community detection algorithms are mainly based on clustering methods, which have been tested and found to have the disadvantages of both accuracy and complexity. Thanks to the proposal of numerous community structure evaluation functions, intelligent optimization algorithms began to be applied to the field of community detection. Since single-objective optimization algorithms suffer from resolution limitations when optimizing modularity, multi-objective optimization was used. The PIO algorithm is one of the intelligent optimization algorithms with the advantage of high search capability. In this section, the proposed multi-objective pigeon-inspired optimization community detection method called MOPIO-Net is described in detail. The framework of the MOPIO-Net algorithm can be explained in three main steps, including initialization, search, and mutation. During the initialization stage, a specific representation is used to construct the solution, which illustrates a community structure of a network, to clearly and easily display and update the community structure. Thereupon, using this representation, the solutions are initialized by the PGLG method [
39]. Then, for each pigeon, two objective functions, including the Negative Ratio Association (NRA) [
40] and Ratio Cut (RC) [
41] are calculated. In the search phase, inspired by the search strategy in the PIO algorithm, we developed a discretization map and compass operator search process. We try to obtain the local optimal and global optimal solutions by computing the Normalized Mutual Information (NMI) value of each pigeon. In the mutation phase, like the landmark operator in the PIO algorithm, it moves toward the best community structure led by the globally optimal pigeon. This is reflected in the genetic learning of each pigeon with the best global individual. If the community labels on the same gene locus are inconsistent, the mutation will be carried out based on the neighbor’s community label. Finally, considering that suboptimal community partitioning is often caused by misclassification of those nodes that are at the community boundaries, we performed a realignment strategy for these boundary nodes in anticipation of reducing misclassification. The flowchart of the proposed method is illustrated in
Figure 2 and additional details are described in the following subsections.
3.1. Solution Representation and Initialization
A complex network is essentially a graph structure, and mining its community structure based on intelligent optimization algorithms requires a reasonable representation. To accommodate the discrete optimization problem, the position and velocity of the pigeon swarm are redefined.
3.1.1. Location Representation
Label-based representation and locus-based adjacency representation [
42] are two common encoding methods. Both two methods consider each solution as a combination of genes, each of which belongs to a node in the graph. Each gene locus in the locus-based adjacency method is randomly linked to a neighbor node, where
denotes the existence of linking edges between
and
. This method can automatically obtain the number of communities by decoding, but frequent encoding and decoding operations need to be performed. In the label-based representation method, each gene locus is generated by tag propagation.
denotes the
belonging to the
. If
, it means
and
are members of the same community. However, the label-based method has the drawbacks of redundant representation and blind search space. The two representations are shown in
Figure 3.
From
Figure 3a, although the representations of label 1 and label 2 are different, they represent the same community structure. There are problems with expanding search space, repeating searches, and damaging solution quality when searching. Therefore, we perform redundant operations on the solution based on label-based representation. We obtain the number of communities represented by the current individual based on the community coding at the individual’s locus and recode the individual based on the number of communities. As shown in
Figure 3a, the label1 position codes only have 3 and 8, thus indicating two communities. We force the community coded in front of the gene position in the individual coding to be 0, 1 is added to the subsequent community coding, and finally, label1 is recoded as
. The specific algorithm process is shown in Algorithm 1.
Algorithm 1 Location Representation |
begin- 1:
for each of solutions do - 2:
Count the number of clusters in the network - 3:
if then - 4:
for each label in do - 5:
Renumber according to the principle of smaller nodes and smaller numbers - 6:
end for - 7:
end if - 8:
end for
|
3.1.2. Velocity Representation
Velocity guides the flight of pigeons, and a suitable velocity determines whether the pigeons can reach their destination and how fast they can arrive. Excessive velocity can cause pigeons to fly over their destination, while conversely, it can lead to a decrease in the range of the pigeon’s activity. The velocity is discretized and expressed as . If , then it means the label of in the corresponding position will change; otherwise, the element remains unchanged.
3.2. Fitness Computation
The choice of fitness function is the key to improving the quality of solution optimization, whether in multi-objective optimization problems or in community detection problems. We optimize the objective functions with RA and RC. RA represents the average number of edges that exist between nodes in all communities. The value of RA is inversely proportional to the number of communities in the network. The larger the value of RA, the smaller the number of large communities with a high density of internal connections into which the network will be divided. The average values of connections between nodes within a community and other communities represented by RC are summed. The value of RC is proportional to the number of communities in the network. The smaller the value of RC, the sparser the edges connected between communities, and the greater amount of nodes within the community. This will divide the entire network into a smaller number of community structures with high internal connection density. We chose these two metric functions because, as we mentioned in the previous section, the tighter the intra-community connections and the sparser the inter-community connections, the clearer the community structure and the higher the algorithm recognition accuracy. From Equations (
13) and (
14), we can observe that the RC denotes the ratio of the number of inter-community edges to the number of communities. The RA denotes the ratio of the number of intra-community edges to the number of communities. In order to obtain a clearer community structure, when randomly grouping nodes, it is desired that the number of intra-community edges is as high as possible (RA) and the number of inter-community edges is as low as possible (RC). To formulate the problem as a minimum optimization problem, we take the opposite of the objective function RA, called the negative ratio association (NRA). Both objective functions are minimized simultaneously, allowing the community partitioning results to be explored toward the community structure we expect to obtain (internally tight, externally sparse).
Assume an undirected acyclic graph
contains
nodes and
edges. The corresponding adjacency matrix is
A.
A community structure
denotes the division of the graph
G into m communities. In the non-overlapping community detection study,
defines the number of edge connections that exist between two communities. The two objective functions are formulated as follows:
3.3. Search Strategy
In the PIO algorithm, the pigeons follow the global optimal solution at the map and compass operator and the central solution at the landmark operator. In the community detection problem, we utilize the mutation operation instead of the original search strategy for the second stage.
In the discrete process, the update rule for redefining the pigeon’s velocity is:
where ⊕ denotes the XOR operator. The role of the
function is to map the velocity into [0, 1] space, and
is defined as:
where the sigmoid function rule is:
Based on the redefined velocity update rule, we now represent the pigeon’s position update rule in the following discrete form:
The above equation indicates that during the tth iteration,
generates new position information
guided by the velocity
. The specific computation rules for the ⊗ operator are:
Among them,
is a positive integer that represents the label with the highest frequency in the neighbor set of
. We choose this method to update location information because the more neighbors a node joins in the community, the closer the community structure is internally and the sparser it is externally. This is exactly the community structure we expect to detect. The schematic diagram of the overall search process in the first stage can be found in
Figure 4.
Boundary nodes connect multiple communities, and their neighbors belong to different communities. Compared to non-boundary nodes, boundary nodes are more prone to misclassification. The misclassification of boundary nodes is one of the main factors leading to poor community structure. Therefore, to improve the quality of the partitioning results, we conducted different strategy mutation operations on boundary nodes and non-boundary nodes. Set the mutation probability to
. If the node belongs to a boundary node, the probability of mutation is increased accordingly to the number of types of community labels that the neighbors belong to. The specific mutation rules are as follows, where k is the count of the different communities to which
’s neighbor nodes belong. The second phase of the search update strategy can be summarized as shown in
Figure 5.
In the second search phase, the algorithm pseudo-code is shown in Algorithm 2.
Algorithm 2 Mutation |
begin- 1:
for each node in do - 2:
for each neighbor of node do - 3:
count the number of different labels - 4:
end for - 5:
if then - 6:
- 7:
end if - 8:
if dominate then - 9:
- 10:
else - 11:
rollback - 12:
end if - 13:
end for
|
The MOPIO-Net algorithm has three main processes: initialization, search and mutation. The first process complexity is analyzed according to Algorithm 1 as ; the second process complexity is mainly calculated by the fitness as well as the position update, and the complexity is . The complexity of the third process of mutation is analyzed according to Algorithm 2 as . Therefore, the complexity of MOPIO-Net is , where denotes the population size, n denotes the number of nodes in the network, and denotes the number of iterations.
5. Conclusions
In this paper, a novel multi-objective community detection algorithm based on a discrete PIO named MOPIO-Net has been proposed. The proposed method uses a multi-objective optimization strategy to solve the community detection problem. Our proposed method minimizes the set of conflicting objective functions, NRA and RC, to obtain a partition structure with tight intra-community connectivity and sparse inter-community connectivity. We changed the movement strategy of the pigeon in the PIO algorithm. In the new strategy, a similar crossover operation is performed by the pigeon to move closer to the optimal solution. For the community detection misclassification problem due to boundary nodes, we implemented different strategies for the community classification of boundary nodes. To verify the performance of the MOPIO-Net algorithm, a synthetic network and three real networks were tested. The results were compared with 11 excellent community detection algorithms. The experimental results show that MOPIO-Net detects partitions closer to the real community structure under all networks. It is verified that our discretization strategy is feasible, the algorithm avoids the resolution limitation problem, and the proposed boundary node variation strategy further improves the recognition accuracy.
The work in this paper validates the effectiveness of the MOPIO-Net algorithm in static network community detection. We hope to further explore the possibilities of MOPIO-Net in overlapping networks and dynamic networks in the future. We also consider how MOPIO-Net should deal with special networks such as signed networks and weighted networks.