1. Introduction
Many complex systems exist in the form of networks in the real world, such as social networks [
1,
2], traffic networks [
3,
4], network sparsification [
5] and protein interaction networks [
6,
7]. These complex systems can be characterized as complex networks sto in symmetric matrix for analysis and research. Entities in the complex network are represented by nodes, and the relationships between the entities are represented by edges [
8,
9]. Many researches based on complex networks have been investigated, such as social computing [
10], network computation [
11], and community discovery [
12]. The community structure (module or cluster) is an important feature of a complex network, which means that the network is composed of several communities. The connections between the nodes in the community are very close, while the connections between the communities are relatively sparse [
13]. The purpose of community discovery (or community detection [
14]) is to mine community structures in a complex network. Community discovery can reveal the universal features of a complex network and help in understanding its topology accurately, which provides guidance for the use and transformation of the network and promotes the practical application of the network. Hence, community discovery has become one of the hotspots of complex network research [
15] and various researches have been investigated, such as disjoint community detection [
16,
17], overlapping community detection [
18], and multiobjective community detection [
19].
Early researches on community discovery mainly focused on nonoverlapping communities, which assumed that each node belongs to only one community and there is no overlap of any two communities. Many representative algorithms have been proposed, such as the graph-partitioning-based method [
20], label-propagation-based method [
21], clustering method [
22,
23], and optimization method [
24,
25]. However, in the real world, things often have the characteristics of diversity. One thing often belongs to multiple categories and there may be overlap between communities. Therefore, overlapping community discovery has become a new research hotspot in recent years. Researches on overlapping community discovery can be divided into two categories: global-network-information-based and local-network-information-based methods.
The methods based on global network information aim to find the community structure in the whole network by optimizing a certain global objective function using whole connection information, which mainly include the link-based method [
26,
27] and the clique percolation method [
28]. These methods can get better results in community discovery, but they have high time complexity and are not suitable for large-scale complex networks with numerous nodes. The methods based on local information aim to find the community structure starting from a node in the network by optimizing a certain local objective function using local connection information, which mainly include the label propagation method [
29,
30] and the local community expansion method [
31,
32]. Since the process of community discovery is only related to the local information in the network, the time complexity is low. Thus, these methods are suitable for large-scale complex networks. However, their disadvantage is that when the parameters of the algorithms change slightly, the results of community discovery change remarkably.
To tackle this problem, this paper proposes an overlapping community discovery method based on Two Expansions of Seeds (TES). The main features of this method are that the topological feature of the network (node degree centrality) is used to define the gravitational degree and the local maximum node is taken as the seed. The reason is that the greater the gravitational degree of the node, the greater its influence and the stronger its information transmission ability in the network is, which is beneficial for robust community discovery. Then, the seed is expanded by the greedy strategy based on the fitness function. When new nodes are added to the community, the community structure may be changed, thereby, there may be nodes with negative fitness. To avoid such nodes, this paper adopts the community cleaning strategy. After the expansion based on the fitness function, a community can cover most of the nodes in the network, but there are still a small number of nodes that cannot be assigned to any community because of the uction of community fitness. To solve this problem, this paper uses a gravitational function to expand the nodes that are not included in any community for the second time. Thus, all nodes belong to at least one community. Finally, by calculating the distance between communities and merging similar communities, we effectively uce the undant communities. The main contributions of this paper are as follows:
We propose an overlapping community discovery algorithm named TES.
TES employs the gravitational degree to find the local maximum nodes as the seeds and expands these seeds by the greedy strategy.
Experimental results verify that our algorithm has better performance than other competitive algorithms.
The rest of this paper is organized as follows:
Section 2 briefly summarizes the related work.
Section 3 proposes our algorithm, named TES, which is composed of three parts: seed selecting, twice node expanding, and overlapping community merging.
Section 4 reports the performance of TES. We draw the conclusion in
Section 5.
2. Related Work
In this section, we will briefly review the categories of the overlapping community discovery methods first. Then, we will introduce the methods of local community optimization and expansion in detail and analyze the shortcomings of the-state-of-the-art algorithms. This paper aims to deal with the problem of unreasonable seed selection for local community optimization and expansion.
The overlapping community discovery methods can be divided into four categories: link-based method, clique percolation method, label-propagation-based method, and local-community-optimization-and-expansion-based method.
The link-based method converts the cluste objects into network edges (or links) and deals with these edges by nonoverlapping partitions. Since a node is usually a vertex of multiple edges, if these edges belong to different linked communities, the node is an overlapping node. The LINK algorithm [
27] is representative of this method. In addition,
k-means was employed to expand seeds twice in dynamic community detection [
33].
The clique percolation method considers that a community is composed of a number of fully connected subgraphs. defined as a clique, and an adjacent clique forms a community. Since a node may belong to more than one clique, it is an overlapping node. However, the algorithm has higher constraints on interconnected conditions and depends on the selection of parameter
k. The CPM algorithm [
28] is representative of this method.
The label-propagation-based method assigns a unique label to each node during initialization; updates the label and its membership by iteration; and finally, assigns the nodes with the same label to the same community. Apparently, if a node has multiple labels, the node is an overlapping node. The COPRA algorithm [
29] is a representative of this method.
The local-community-optimization-and-expansion-based method starts from the local communities, expands the communities gradually based on the optimization function, and forms cross-regions between multiple extensions, thus finding overlapping community nodes. The representative algorithms are LFM [
31] and GCE [
32]. In addition to the above algorithms, there are some classical methods, such as the semisupervised learning method [
34]; deep learning method [
35]; and the CONGA algorithm [
36], which splits the clone node by itself and adds a virtual edge between the split nodes to find the overlapping nodes.
Among the abovementioned methods, the fourth one—local community optimization and expansion—becomes more and more popular. For example, the research in [
21] found that taking the local maximum node defined by the degree centrality as the seed can discover higher quality communities and avoid instability at the same time. The research in [
37] was about two methods to define the node influence: the community structure of social networks and the influence-based measure of node intimacy center, and took the nodes with great influence as the seeds. The EAGLE algorithm took the largest clique in the network as the seed and igno the second largest one, which has high time complexity [
38]. Another paper [
39] selected a group of nodes as seeds that were closely connected in the network, namely, an Egonet (hawk-eye network), but this method is more suitable for networks with a large global clustering coefficient. A seed set expansion method based on graph partitioning was proposed in [
40] to find a group of nodes with low conductivity, and the node closest to the cluster was taken as a seed. The online social network (OSN) algorithm, as a multilever community discovery algorithm, combined user interests and cohesiveness to coarsen the initial network and found an initial community assignment using stochastic inference in the coarsest network [
41]. All these methods use the local topology information of the network to optimize the local optimization function to find the community structure in the network. It does not need to know the global topology of the network, and shows certain advantages in large-scale networks. Therefore, seed selection is the foundation of this kind of method, which will affect the quality of community structure mining. The LFM algorithm [
31] and the DEMON algorithm [
32] expand the community by random seed selection, which inevitably causes the instability of community discovery. The GCE algorithm improved the LFM algorithm by mining
k-cliques as the seed through the classic Bron–Kerbosch algorithm in the network [
42]. In this method, cliques are fixed, but the seed selection depends on the selection of parameter
k, which can easily cause the problem of low network coverage.
To solve the problem of unreasonable seed selection for local community optimization and expansion, this paper proposes an overlapping community discovery algorithm based on two expansions of seeds. A node with the local maximum gravitational degree defined by degree centrality is taken as a seed. This method has the advantages of a high-quality community and robust results, but the disadvantage is that these communities cannot cover the whole network. To overcome this problem, the communities are expanded for the second time to ensure that each node belongs to at least one community.
3. Proposed Method
In this section, we propose the TES algorithm, which is composed of three parts. The first part employs the gravitational degree defined by the network topological feature (degree centrality) to find the local maximum nodes as the seeds. The second part expands these seeds by the greedy strategy based on the fitness function. Then, the communities are expanded for the second time based on the gravitational function. The third part calculates the distance between the communities and merges the similar communities to get the final communities.
3.1. Seed Selection
In actual networks, some nodes are usually closely connected with other nodes, called central nodes, which contribute greatly to information transmission. They are usually scatte across the whole network and located in regions where the nodes are more closely connected. This is consistent with the fact that the nodes in a community are closely connected, while the connections between communities are sparse. Hence, the central nodes can be taken as the seeds. The centrality of a node reflects its centrality and importance in the network [
43]. Inspi by the gravitational relationships in the dynamic social network [
44], this paper proposes a gravitational degree based on degree centrality to measure the influence of the central nodes on other nodes.
Newton’s law of universal gravitation holds that any two particles are attracted by a force in the direction of the line between them. The gravitation is proportional to the product of their masses and inversely proportional to the square of their distance, as shown in Equation (
1).
where
g is the gravitational constant,
and
are the masses of two particles, and
r is the distance between two particles.
In this paper, a network is represented by an undirected graph , where is a set of n vertices and is a set of m edges.
Definition 1. Node centrality is the degree of a node, denoted by .
Definition 2. If there is an edge between nodes and , then node is a neighbor of node . All neighbors of node are denoted by .
Definition 3. To measure the similarity between nodes and , this paper employs the Jaccard similarity coefficient [45], denoted by . Definition 4. The distance between node and its neighbor is . Definition 5. The gravitation of node to its neighbor is . Using the node degree to measure the quality of a node can reflect the ability of information transmission to its neighbor. The gravitational degree of to its neighbor is directly proportional to the node degree and inversely proportional to the distance between them.
Definition 6. The gravitational degree of node is the sum of its gravitation to all nodes in the network. The greater the gravitational degree of node , the greater its influence on the network. The stronger the information transmission ability of a node, the more likely it is to become a seed node.
An illustrative example is shown as follows:
Example 1. In Figure 1, node has 7 neighbors, i.e., = {2, 3, 4, 5, 6, 7, 8}. Node has 3 neighbors, i.e., = {1, 2, 3}. Thus, node centrality of nodes and are = 7 and = 3, respectively. and are {2, 3} and {1, 2, 3, 4, 5, 6, 7, 8}, respectively. Thus, and . Hence, ; . Definition 7. If the gravitational degree of a node is no less than that of all its neighbors, the node will be called the local maximum degree node of the network.
The local maximum node has a large gravitational degree and strong information transmission ability. Most of them are scatte in the network. Therefore, this paper selects the local maximum nodes as the seeds. The seed selection algorithm is shown in Algorithm 1. First, all nodes are marked as 0 and the gravitational degree of each node is calculated. The node with the largest gravitational degree is put into the seed set. Then, the node with the local maximum degree is marked as 1, and the node and its neighbors are moved out of the vertex set. Search for the next seed iteratively until all nodes have been marked and moved out of the vertex set.
Algorithm 1 GetSeed. |
Require: network ; |
Ensure: seed set S; |
- 1:
; - 2:
for each do - 3:
; - 4:
; - 5:
end for - 6:
whiledo - 7:
; - 8:
if then - 9:
; - 10:
; - 11:
; - 12:
end if - 13:
end while - 14:
returnS
|
3.2. Community Discovery
For each seed in seed set
S, this paper iteratively adds its neighbors to the community to discover natural communities. There are many ways to expand the community, including the minimum one norm [
20], the label propagation method [
29], and the fitness function method [
31,
42]. This paper employs the fitness function method since it can provide good results on real datasets.
Definition 8. Community C is a subset of V. For community C in network , its neighbor is defined as Definition 9. For community C in network , its fitness is defined aswhere and are the sum of the degrees of the nodes that are inside and outside community C, respectively. and , where is the number of edges inside community C. is an adjustment parameter. in the fitness function is the resolution parameter, which can adjust the scale of the community discove. The smaller is, the greater the influence of . This will lead to a rapid increase of after adding node to community C. Therefore, community C can accept more nodes. When tends to be 0, the community may expand to cover the entire network. On the contrary, the larger is, the smaller the impact of . This will lead to the tiny increase of after adding node . Therefore, a small community is formed. When , . The more sparse the connection between community C and outside is, the smaller is and the larger is, which can reflect the local connection density of community C.
Example 2. In Figure 1, suppose community C is composed of nodes 5, 6, and 7. = {1} since node 1 connects with community C. Node 8 does not belong to since it does not connect with community C. Suppose there is an edge between nodes 5 and 8, node 8 belongs to . is 4 since the degrees of nodes 5, 6, and 7 in community C are 1, 1, and 2, respectively, and 1 + 1 + 2 = 4. Another method is to count the number of edges in community C. There are 2 edges in community C, thus, . Similarly, . Definition 10. The fitness of node can be obtained as follows: The disadvantage of this method is that although most of nodes can be assigned to the corresponding communities, some nodes fail to be assigned, thus resulting in low network coverage. Therefore, this paper expands the nodes that have not been assigned to the community for the second time. This is in accordance with the actual situation. For example, in a social network, everyone has friends and belongs to a circle of friends [
37]. This paper assumes that each node belongs to at least one community. A gravitational function is defined by the ratio of the gravitation between nodes and the gravitational degree of nodes. The gravitation of node
is the sum of the gravitational degrees between node
and its neighbors. The more neighbors of node
the community
C contains, the greater the gravitation between the community and node
is. The gravitational function is given as follows:
Definition 11. The gravitation of community C to node is measu by the gravitational degree, and the gravitational function is When the seed set is found in the first stage, the seed is expanded by the greedy strategy, that is, the local objective function of the community is maximized by adding node to the temporal community or deleting it from the community. We will show the principle of the algorithm as follows: We put a seed into temporal community C first. Then, we calculate the fitness of all its neighbors and add the maximum fitness neighbor into C, as shown in lines 3–7 of Algorithm 2. After adding the maximum fitness neighbor, the structure of the community will be changed. At this time, the fitness of each node for the new temporary community should be updated. If a node has a negative fitness, it will be removed from the community, as shown in lines 9–14 of Algorithm 2. Iterate the above expansion until the fitness decreases when any node is added. We store temporal community C into community set and remove these nodes from the network.
Obviously, when a community is expanded, the fitness of the nodes in the community and the neighbors need to be recalculated. To solve this problem, we adopt the following steps. If there is an edge between
and
, then
is 1; otherwise, it is 0. The initials of
and
are 0. If node
is added into the community, we adopt Equations (
10) and (
11). If node
is removed from the community, we adopt Equations (
12) and (
13).
Example 3. In Figure 1, according to Example 2, community C ={5, 6, 7}. Let node 1 be added into community C. According to Equation (10), , since when node 1 is added, there are three edges that are added into community C. According to Equation (11), . Let node 5 be removed from community C. According to Equation (12), , since there is one edge in community C that connects with node 5. According to Equation (13), . Equation (
8) is used to update the community fitness. In this way, we only need to know the degree of node
and calculate
of the nodes which are both in community
C and neighbors of
. To further speed up the calculation, we store
and
, which will be updated when temporal community
C adds a new node or removes a node, as shown in lines 7–8 and 11–12 of Algorithm 2.
Algorithm 2 GetNaturalcoms. |
Require: network , seed set S, and parameter ; |
Ensure: community set ; |
- 1:
,; - 2:
for each do - 3:
; - 4:
while do - 5:
; - 6:
if then - 7:
; - 8:
Update and ; - 9:
for each do - 10:
if then - 11:
; - 12:
Update and ; - 13:
end if - 14:
end for - 15:
else - 16:
break; - 17:
end if - 18:
end while - 19:
; - 20:
; - 21:
end for - 22:
return
|
Finally, we expand nodes for the second time. If a node does not belong to any community, the node is merged into the community with the greatest gravitation, as shown in Algorithm 3.
Algorithm 3 ExpandingSecond. |
Require: node set V, and community set ; |
Ensure: community set ; |
- 1:
ifthen - 2:
for each do - 3:
for do - 4:
; - 5:
end for - 6:
; - 7:
end for - 8:
end if - 9:
return
|
3.3. Merging Overlapping Communities
In a nonoverlapping community, a node belongs to only one community [
46], while a node may belong to multiple communities in an overlapping community. Therefore, there may be similarities between two communities. When a certain similarity is reached, the excessive overlapping phenomenon will occur, resulting in a undant community [
47]. Hence, after discovering the communities, this paper defines a measure of community distance which is used to discover and merge the overlapping communities to simplify the community structure.
Definition 12. The distance between communities and is In this paper, is the threshold of the distance parameter. If , communities and are merged into one community since they overlap excessively. The Merge_Overlap algorithm is shown in Algorithm 4.
To avoid invalid calculations, we adopt the principle of inverted index to prune invalid detection of overlapping communities. Therefore, set is used to store the communities in which node belongs. An illustrative example is shown as follows:
Suppose we have 3 communities: , , and . We know that , , and . To obtain the overlapping community of , we calculate = since . Therefore, communities and are two overlapping communities. It is not necessary to calculate the distance between communities and . Therefore, the inverted index is an effective pruning strategy.
According to the above example, we should create set at first, as shown in lines 1–7 of Algorithm 4. Apparently, if the number of elements in is greater than 1, it indicates that node belongs to multiple communities and is an overlapping node. We determine whether the communities in overlap or not, as shown in lines 8–19 of Algorithm 4.
To sum up, Algorithm 5 presents the overlapping community discovery algorithm based on two expansions of seeds.
Algorithm 4 MergeOverlap. |
Require: network , community set , and parameter ; |
Ensure: the new community set ; |
- 1:
fordo - 2:
for each do - 3:
if then - 4:
; - 5:
end if - 6:
end for - 7:
end for - 8:
for each do - 9:
for each do - 10:
if then - 11:
; - 12:
end if - 13:
end for - 14:
for do - 15:
if then - 16:
; - 17:
end if - 18:
end for - 19:
end for - 20:
return
|
Algorithm 5 TES. |
Require: network , parameter , and parameter ; |
Ensure: community set , community set to which the node belongs; |
- 1:
; //Searching for the seed in the network - 2:
; //Expand each seed according to the fitness function - 3:
; //Expand the nodes for the second time - 4:
; //Merge the overlapping communities in the network - 5:
return
|
3.4. Theoretical Analysis
The space complexity and time complexity of TES are and , respectively, where k, n, and m are the number of seeds, nodes, and edges in G, respectively. The reason is shown as follows:
The space complexity of network G is . The space complexity of all neighbors of each node is since each edge should be calculated. Thus, the time complexity of of each node is also . Further, the space complexity and time complexity of , , and are also . Obviously, the time complexity of of each node is and the space complexity of is . Hence, the time complexity of lines 2–5 in Algorithm 1 is . Since each node will be checked once, the time complexity of lines 6–13 is . Therefore, both the space complexity and time complexity of Algorithm 1 are .
Suppose we find
k seeds, where
. When a node is added into or removed from a community, no more than
n edges are checked. Thus, the time complexity of Equations (
10)–(
13) are
. Hence, the time complexity of Equations (
7)–(
9) are also
. A node can be assigned into no more than
k communities. Therefore, the time complexity of Algorithm 2 is
.
Suppose there are t nodes which are expanded twice, where . Each node will be added into each community once. Thus, the time complexity of lines 3–5 is . Therefore, the time complexity of Algorithm 3 is .
Obviously, the time complexity of lines 1–7 of Algorithm 4 is since there are k communities and n nodes. Similarly, the time complexity of lines 8–19 of Algorithm 4 is also .
Apparently, each community has no more than n nodes. Thus, the space complexities of these communities are . Hence, the space complexities of Algorithms 2, 3, and 4 are .
Since , the time complexity of TES is and the space complexity of TES is .
5. Conclusions
In this paper, we propose an overlapping community discovery algorithm, named TES, which has three parts. In the first part, the local maximum node is taken as the seed based on the gravitational degree. The second part discovers the natural community by two expansions. The community is expanded based on the fitness function. After adding a new node, the community is cleaned. The second expansion is based on the gravitational function. The third part examines and merges the overlapping communities. To verify the reasonability of these parts, four comparative algorithms, TES_Seed, TES_Unclean, TES_Fitness, and TES_Unmerge, are proposed. Besides these four algorithms, three state-of-the-art algorithms: CONGA, COPRA, and LFM, are employed. Experimental results on five real networks report that TES outperforms all these competitive algorithms.