Algorithm for Detecting Communities in Complex Networks Based on Hadoop

Hai, Mo; Li, Haifeng; Ma, Zhekun; Gao, Xiaomei

doi:10.3390/sym11111382

Open AccessArticle

Algorithm for Detecting Communities in Complex Networks Based on Hadoop

by

Mo Hai

,

Haifeng Li

^*

,

Zhekun Ma

and

Xiaomei Gao

School of Information, Central University of Finance and Economics, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Symmetry 2019, 11(11), 1382; https://doi.org/10.3390/sym11111382

Submission received: 9 October 2019 / Revised: 31 October 2019 / Accepted: 6 November 2019 / Published: 7 November 2019

(This article belongs to the Special Issue Recent Advances in Social Data and Artificial Intelligence 2019)

Download

Browse Figures

Versions Notes

Abstract

:

With the explosive growth of the scale of complex networks, the existing community detection algorithms are unable to meet the needs of rapid analysis of the community structure in complex networks. A new algorithm for detecting communities in complex networks based on the Hadoop platform (called Community Detection on Hadoop (CDOH)) is proposed in this paper. Based on the basic idea of modularity increment, our algorithm implements parallel merging and accomplishes a fast and accurate detection of the community structure in complex networks. Our extensive experimental results on three real datasets of complex networks demonstrate that the CDOH algorithm can improve the efficiency of the current memory-based community detection algorithms significantly without affecting the accuracy of the community detection.

Keywords:

community detection; complex networks; Hadoop; modularity increment

1. Introduction

In the era of Web 2.0, objects are connected to each other by various technologies such as the Internet and the Internet of Things, and form a variety of complex networks such as interpersonal interaction, essay reference, transportation, and protein interaction networks. Various complex networks are widely used in sociology, management, computer science, operations, biology, and other disciplines, while their wide application prospects have attracted the interest of many researchers. For example, Watts and Strogatz [1] applied the complex network theory in the field of biology and considered the nervous system to be a complex network of large numbers of nerve cells connected by nerve fibers. Faloutsos [2] applied the method of complex network analysis to study computer networks and evaluated their stability by analyzing their robustness. Sen et al. [3] mapped the transportation network to a complex network and implemented an optimal planning and configuration of the transportation network using dynamic analysis of the complex network. Xiao et al. [4] constructed a directed and weighted complex network based on the Beijing traffic network, analyzed the traffic network load-bearing pressure, and mined the corresponding regional centers, which provided a theoretical support for optimizing urban public transport network systems. Based on the characteristic analysis of the complex network itself, Ruguo [5] proposed a method for social coordination governance and provided ideas for solving mass public events based on a characteristic analysis of complex networks.

Many studies analyzed the inherent characteristics of complex networks and discovered the relationships between node attributes and connections within networks. To discover feature of complex networks, several community detection algorithms have been proposed. The so-called “community” is a sub-network composed of a group of nodes closely connected with their internal nodes and sparsely connected with other external community nodes. The community structure is a common feature of complex networks made up of one or more communities. The accurate identification of the community structure in complex networks play an important theoretical role for public opinion monitoring, interest recommendation, identification of the network internal structure and other related research. As a result, many researchers have studied community detection algorithms from the aspects of modularity and edge structure. For examples, Newman and Girvan [6] proposed the concept of modularity and mined the complex network community structure, Yang et al. [7] introduced a method for analyzing the edge structure and node properties allowing to improve the accuracy of the detection of the complex network community structure. The accurate identification of the community structure in complex networks have broad applications, such as influence maximization, influences discovery within a community, interest recommendation, edge intelligence empowered recommendation [8], and so on.

However, the existing studies about complex network community detection algorithms focused on small-scale data sets and limited to the improvement of the community detection accuracy while neglecting its efficiency. At the same time, the number of nodes in complex networks demonstrates an explosive growth trend considering the advent of big data era, increasing number of network users, and exponential increase of the generated contents. At present, many social networking platforms such as WeChat, Weibo, Facebook, and Twitter, have more than 100 million on-line users and various interaction forms, including follow-ups, comments, and sharing. The large-scale complex network data sets generated by such platforms have the characteristics of node diversity, complex structure, multi-complexity fusion, which challenges the accuracy of the traditional complex network community detection algorithms. Furthermore, the traditional community detection algorithms are based on matrix iterations, which make the algorithms unable to adapt to the requirements of real-time and flexibility.

In this paper, we propose a new complex network community detection algorithm based on Hadoop framework (called Community Detection on Hadoop (CDOH)). Hadoop is a distributed system infrastructure developed by the Apache Foundation. Our contributions are as follows:

Based on the idea of the maximum modularity, and combining the distributed characteristics of the Hadoop platform, a new modularity matrix update method is proposed and a corresponding community merging strategy is constructed to implement a fast and accurate detection and discovery of complex network community structures;
We theoretically analyze our proposed CDOH algorithm, and show the computational cost of our algorithm can achieve $O (n)$ computational cost when we use enough parallel nodes;
Experimental results on 3 real datasets demonstrate that CDOH significantly outperforms the traditional complex network community detection algorithm in terms of both the efficiency and accuracy of the community detection of complex networks.

The rest of our paper are organized as follows. Section 2 introduces the related works. Section 3 describes our proposed CDOH algorithm and analyzed its computational complexity. In Section 4, we show the experimental results with theoretical analysis. Section 5 concludes the paper and presents the future works.

2. Related Works

Since Newman [6,9] proposed the module optimality algorithm, the modularity-based community detection approach has been used in many network community mining algorithms such as the classic fast Newman community division algorithm [9] and CNM algorithm [10]. The fast Newman community detection algorithm is an agglomerative hierarchical clustering algorithm that starts with a state, in which each node is the sole member of n communities, and repeatedly joins communities together in pairs, choosing a joint at each step, which results in the greatest increase (or smallest decrease) in modularity. Recently, domestic researchers such as Lei et al. [11] implemented an edge community mining algorithm based on the local information of the considering network. Xiong [12] proposed a community discovery algorithm that combined the user closeness with clustering algorithms. Weiping [13] proposed the concept of new gravity of users for an accurate community discovery; Leng [14] proposed a new network community detection algorithm based on a greedy optimization technology. Zhang et al. [15] further improved the fast Newman algorithm by introducing an improved index for the closeness centrality to classify overlapping nodes; the proposed method demonstrated a high classification accuracy in detecting overlapping communities with a time complexity of

O (n^{2})

.

Blondel et al. [16] improved the modular incremental solution method by merging communities iteratively using a new calculation formula to achieve good results. Parsa et al. [17] used a probability vector model based on a single variable edge distribution algorithm, that combines an evolutionary algorithm with a community discovery method to enable the community detection; Oliverira et al. [18] used an improved Kuramoto coupled oscillator synchronization model to analyze networks from their dynamic factors and implemented a method for community discovery in complex networks. Ling Xing et al. [19] proposed a method that combines the sliding time-window method with the hierarchical encounter model based on association rules to increase the fidelity of the extracted networks by alleviating the homophily effect. Yuhui Gong et al. [20] focused on the customers’ conformity behaviors in a symmetry market where customers are located in a social network. Simulation results have shown that topology structure, network size, and initial market share have significant effects on the evolution of customers conformity behaviors. Recently, Aceto et al. [21,22] and Ruoyu Wang et al. [23] applied deep learning and machine learning technologies in the research about social networking.

Recently, researchers have proposed complex network community detection algorithms based on big data platforms. Clauset [24] proposed a community-based parallel detection method based on the CNM algorithm. The basic idea of the algorithm proposed in [24] is to calculate the maximum community modularity in parallel and recognize the communities of large-scale networks by decreasing the communication overhead. The limitation of this algorithm is that it fails to run when the network scale increases and the amount of data rises to a certain level. Jinpeng [25] proposed a link community recognition algorithm based on the Hadoop platform. While this algorithm resolves the limitation of the linked community method that cannot store and process large matrices when analyzing big networks, its efficiency is still not efficient enough. Furthermore, its processing time reaches more than 5000 seconds when the scale of nodes reaches 15,000. Riedy et al. [26] used servers with multi-core processors to calculate the maximum community modularity in parallel to identify communities. However, the proposed method has strong hardware dependencies.

Moon et al. [27] proposed a parallel GN algorithm [6] based on Hadoop that can be divided into 4 stages. Each stage includes the map and reduce process. In the first stage, the tuples of all node pairs are generated; in the second and third stages, the edges with large edge betweenness values are identified and removed, respectively; in the fourth stage, the tuples are recalculated according to the new network. The experiment results demonstrated that the efficiency of the algorithm increases linearly with the increase of the number of reducers which are in charge of reduce process. Weijiang et al. [28] proposed a parallel Louvain algorithm that solved the main time-consuming problem of calculating the modularity and ergodic modularity increment in the Louvain algorithm [29]. This proposed algorithm outputs the information about all neighbors of a node in the map phase and decides the new home community of the node in the reduce phase accordingly. When computing a new community of a node, it is necessary to ensure that the neighbor’s community is up-to-date, which is hard to be guaranteed in a distributed environment. Therefore, it is easy to face the problems of “community interchange” and “community ownership delay,” which can be solved by resolving the associated connected graph. To solve the problem of high complexity of the fast Newman algorithm [10] in calculating the modularity of nodes. Bingzhou [30] proposed a parallel fast Newman algorithm based on Hadoop that calculates the modularity increment of each node merged with its neighbors in the map stage in parallel. In the reduce stage, the 2 nodes with the largest modularity increment are found and merged. The map and reduce processes are executed iteratively until all nodes are merged into 1 community. To deal with the problems of the fast-unfolding algorithm in processing large-scale networks. Bingzhou [30] also proposed a parallel fast-unfolding algorithm based on Hadoop and the divide and conquer principle. First, a large-scale network is partitioned and merged separately, then the network is reconstructed according to the merging results of each partition, and finally the network is merged iteratively and reconstructed until the structure of community does not change any more. Conte et al. [31] proposed an algorithm which was able to find large k-plexes of very large graphs in just a few minutes and scale up to tens of machines with tens of cores each. Vincenzo el al. [32] proposed a novel algorithm for community detection in social networks based on game theory, and showed this algorithm outperformed other algorithms in terms of computational complexity and effectiveness. However, this algorithm cannot scale to a huge number of nodes and edges.

The traditional community detection algorithms focused on small-scale data sets and hard to scale to a large scale data sets. While parallel community detection algorithms are more scalable, they cannot achieve a good trade-off between the efficiency and accuracy. In order to overcome the shortcomings of traditional community detection algorithms and parallel community detection algorithms, we propose a new complex network community detection algorithm based on Hadoop, which effectively implements a fast and accurate detection of complex network community structure. Compared with traditional community detection algorithms, it can scale to a large scale data set. Compared with parallel community detection algorithms, it achieves a good trade-off between efficiency and accuracy.

3. Complex Network Community Detecting Algorithm Based on Hadoop

The proposed CDOH algorithm is based on the idea of the maximal modularity increment, which employs a new modularity matrix updating method and a community merging strategy.

3.1. Definitions

This section provides formal definition of the basic concepts involved in the proposed complex network community detection algorithm. The symbols and their meanings are shown in Table 1.

Definition 1.

(Complex network) A complex network is a network consisting of a series of nodes and their interconnected edges denoted as

N = (V, E)

. Here,

V = {v_{i} ∣ i = 1, 2, \dots, n}

represents a set of nodes in a complex network, and

E = {e_{i j} ∣ v_{i}, v_{j} \in V}

represents a set of edges in a complex network, where

e_{i j}

denotes the connection between nodes

v_{i}

and

v_{j}

. If they are connected, then

e_{i j} = 1

; otherwise,

e_{i j} = 0

.

Definition 2.

(Node degree) In a complex network

N = (V, E)

, the node degree

d_{i}

of each node

v_{i}

is defined as the number of edges connected to node

v_{i}

, which is defined by Equation (1),

d_{i} = \sum_{v_{j} \in V, i \neq j} e_{i j}

(1)

Figure 1 illustrates a simple network community structure. According to Definitions 1 and 2, there are 12 nodes in the network (from

v_{1}

to

v_{2}

), where

e_{12} = 1

,

e_{19} = 0

, and

v_{1}

has a node degree

d_{1} = 4

.

Definition 3.

(Modularity) The modularity of a network M is defined by Equation (2).

M = \sum_{c \in C} (\frac{l_{c}}{m} - a_{c}^{2})

(2)

Here,

C = {c_{i} ∣ i = 1, 2, \dots, k}

denotes the detected set of network community structures,

l_{c}

denotes the total number of edges interconnected between nodes within the community c, m denotes the total number of edges in the network, and

a_{c} = \frac{D_{c}}{2 m}

(3)

where

D_{c}

denotes the sum of the node degrees of all nodes in the community c, and

D_{c}

equals to 2 times of the sum of

l_{c}

and the total edges of connecting the community c and other external communities.

According to Equation (2), the modularity of complex networks measures the degree of closeness within the community and the degree of sparseness between the communities. The closer the internal connection of the community is and the thinner the connection between the communities is, the greater the modularity M is, and vice versa. Thus, when the modularity M of a complex network is the largest, the community detection results are optimal. However, it is quite difficult to determine directly whether M has reached its maximum. Therefore, the concept of the modularity increment

▵ M

proposed by Newman is adopted, where the increase or decrease in the modularity M caused by merging communities

c_{i}

and

c_{j}

, which is defined as Equation (4).

▵ M = \frac{2 R_{i j}}{m} - 2 \times a_{i} \times a_{j}

(4)

Here,

R_{i j}

denotes the number of connection edges between communities

c_{i}

and

c_{j}

in which

i \neq j

. Then the modularity M increases progressively when

▵ M > 0

. On the contrary, if

▵ M < 0

, the modularity M is the maximum and the process of the community detection ends.

When the number of nodes and edges in a complex network are kept the same, and different communities are merged to form a new community, the number of edges among nodes within the new community is the sum of the number of edges within the 2 merged communities and the number of edges between the 2 merged communities. Accordingly, [14] points out that when the number of nodes and edges are kept the same, the increase of the modularity between the new communities formed by merging multiple known communities and other communities can be established as Equation (5).

▵ M [c_{z}] [c_{k}] = \{\begin{matrix} ▵ M [c_{z}] [c_{k}] + ▵ M [c_{i}] [c_{k}], < c_{i}, c_{k} > \in E, c_{i} \in c_{z} \\ ▵ M [c_{z}] [c_{k}] - 2 \times a_{i} \times a_{k}, < c_{i}, c_{k} > \notin E, c_{i} \in c_{z} \end{matrix}

(5)

Here,

c_{z}

denotes the new community after merging,

c_{k}

denotes the old community that does not belong to

c_{z}

,

c_{i}

denotes the old community merged to

c_{z}

, and

< c_{i}, c_{k} >

denotes the edge set from community

c_{i}

to community

c_{k}

.

Taking the network structure in Figure 1 as an example, we can see that each node represents a community. Equation (4) can be used to calculate the modularity increment

▵ M

among any 2 communities and form a matrix as shown in Table 2, where the first row and column represent the community number. We focus only on the 2 same communities need to be merged, and the changes within the community need not be considered, so the diagonal of the matrix can be initialized to 0. From the values of the matrix, we can observe that communities that can be merged in this example are

c_{2}

and

c_{4}

,

c_{2}

and

c_{5}

,

c_{7}

and

c_{12}

,

c_{11}

and

c_{12}

, where

▵ M

is the maximal value, that is, 0.036. Taking the community

c_{13}

formed by merging

c_{2}

and

c_{4}

as an example, the results after merging are listed in Table 3.

As can be noticed from Table 2 and Table 3, the modularity increment between the community

c_{13}

and other communities is the sum of the modularity increment between the communities

c_{2}

,

c_{4}

, and the corresponding communities. For example, in Table 3, the modularity increment of the communities

c_{1}

and

c_{13}

is 0.021, which is the sum of the modularity increment, 0.033, of

c_{1}

and

c_{2}

, and the modularity increment, −0.012, of

c_{1}

and

c_{4}

, as shown in Table 2.

Considering that the modularity matrix update algorithm has the characteristics of merging communities in parallel and conforms to the characteristics of parallel processing on the Hadoop platform, we select the modularity incremental update method represented by Equation (5) to construct the proposed CDOH algorithm. According to the modularity increment represented by Equation (4), we initialize the entire network, treat each node as a community, and calculate the modularity increment when merging any 2 communities. Then, we iterate consecutively to find new communities. Based on the MapReduce parallel programming model, all the 2 communities with the maximum modularity increment are identified and merged in parallel. Equation (5) is used to update the modularity increment when merging any 2 communities in parallel. The community discovery process ends when the maximum modularity increment is negative. Finally, the CDOH algorithm stores the node set V as

(v I d, c I d)

, where

v I d

denotes the node number and

c I d

denotes the community number, and the edge set E is represented as

(s, d, ▵ M)

, where s denotes the source node of the edge, d denotes the destination node of the edge, and

▵ M

is the modularity increment corresponding to this edge.

3.2. The CDOH Algorithm

Based on the research framework of complex network community detection algorithm on the Hadoop platform shown in the Section 3.1. The CDOH has 4 steps, that is, first, we will initialize the parameters; second, we will find the maximum modularity increment; third, we will merge the communities and update the modularity increment; finally, we will generate the final community discovery results. Step 2 and step 3 will be repeated to find new communities until the maximum modularity increment is negative. We shown the flow charts of CDOH algorithm in Figure 2. Here, step 1 (Parameter initialization), step 2 (Finding the maximum modularity increment), and step 3 (Merging communities and updating the modularity increment) are implemented based on MapReduce parallel programming model of Hadoop.

3.2.1. Parameter Initialization

The initialization phase is responsible for calculating the necessary parameters of the algorithm, which includes the total number of nodes n, total number of edges m, degree d of each node, vector a, and the modularity increment

▵ M

between each pair of nodes. The process is listed in Algorithm 1, the main steps of which include the following:

First, we load the complex network data from the input file, then calculate the number of nodes n and edges m of the complex network, and broadcast the number of edges (m) to all nodes;
Second, we calculate the degree d of each node and the vector a according to Equation (3);
Finally, we use Equation (4) to calculate the modularity increment $▵ M$ between each pair of nodes, and construct a new network N using this modularity increment.

Algorithm 1 Initialization of CDOH Parameters

Input:
D: Preprocessed network data;
Output:

▵ M

: Modularity increment;
N: Network;
1: N = networkLoad(D);
2: n = getVertices(N);
3: m = getEdges(N);
4: Broadcast the number of edges m to all nodes in the cluster;
5: for each Node i in N do
6:

k_{i}

= getDegree(i);
7:

a_{i} = \frac{k_{i}}{2 m}

;
8: for each Edge e in N do
9:

▵ M_{i j} = \frac{R_{i j}}{m} - 2 \times a_{i} \times a_{j}

;

Here, we firstly divide the

n \times n

matrix into multiple sub matrix, then we deploy multiple mappers, and let each mapper calculate the vector a of each node and calculate the modularity increment between each pair of nodes of each sub matrix. Each mapper works in parallel.

3.2.2. Find the Maximum Modularity Increment

After completing the modularity increment calculation, we initiate the iterative community discovery, and find multiple community pairs with the largest modularity increment, and merge them into the corresponding new communities. Taking the network shown in Figure 1 as an example. According to the

▵ M

matrix shown in Table 2, communities

c_{2}

and

c_{4}

,

c_{2}

and

c_{5}

,

c_{7}

and

c_{12}

,

c_{11}

and

c_{12}

can be merged. Clearly, communities

c_{2}

,

c_{4}

,

c_{5}

and

c_{7}

,

c_{11}

,

c_{12}

should be merged to the community

c_{13}

and community

c_{14}

, respectively.

Algorithm 2 describes the steps involved in finding the modularity increment, which has 4 steps.

First, we compare the $▵ M$ value of each edge e in network N, find the maximum modularity increment $m a x (▵ M)$ , and broadcast it to all nodes in the cluster;
Second, we get the cartesian product T of the edge set E and node set V, $T = (s, s c, d, d c, ▵ M)$ , s denotes the number of the source node, d denotes the number of destination node, $s c$ and $d c$ denote the community numbers of the source node and destination node respectively, and $▵ M$ denotes the modularity increment between the source node and destination node;
Third, we find the sub-set $M C$ in the set T, where $▵ M$ equals to $m a x (▵ M)$ ;
Finally, to organize the merged communities, we obtain the community number (i) of the source node and the community number (j) of the destination node, which represent the current communities to be merged. If i or j already belongs to a new community in C, we will get the new community to merge i and j into it, or merge i and j into another new community, whose number is $n + 1$ . The final output is the community C after merging.

Algorithm 2 Find the Maximum Modularity Increment and Communities that need to be Merged

Input:

▵ M

: Modularity increment;

N (E, V)

: Network;
Output:

C = {c_{1}, c_{2}, \dots, c_{l}}

: Communities;

m a x (▵ M)

: Maximum Modularity increment;
1:

m a x (▵ M) = s e a r c h M a x D e l t a M (N)

;
2: Broadcasting

▵ M

to all nodes in the cluster;
3:

T = E \times V

;
4: for each quintuple t in T do
5: if

g e t D e l t a M (t) = = m a x (▵ M)

then
6:

M C = i n s e r t (t)

;
7: for each quintuple t in

M C

do
8:

(i, j) = g e t C o m m u N u m (t)

;
9: if

i \in C

or

j \in C

then
10: k = Get the new number of community i or j from C;
11:

c_{k}

= insert(i,j);
12: else
13: n = n + 1;
14:

c_{n}

= insert(i, j);

Here, we find the maximum modularity increment

m a x (▵ M)

based on the MapReduce. After dividing the

n \times n

matrix into multiple sub matrix, in the map phrase, each mapper finds the maximum modularity increment of each sub matrix and output the results to the reducer, and then in the reduce phrase, the reduce output the maximum modularity increment

m a x (▵ M)

. Afterwards, we find the community pairs with the largest modularity increment based on MapReduce. Each mapper finds the community pairs with the largest modularity increment of each sub matrix in parallel.

3.2.3. Merging and Updating Communities

Merging and updating communities are the core of the proposed algorithm. Since after step 2, the community pairs with the maximum modularity increment are identified to be merged, the mapper updated the number of the communities that need to be merged and the community number of the corresponding nodes to their corresponding new community number in parallel, and the

▵ M

of any 2 communities are updated by the mapper in parallel.

The steps of merging and updating of communities listed in Algorithm 3 are the following.

First, we obtain the Cartesian product T of the node set V and edge set E. Then, we look for the new community number corresponding to $s c$ and $d c$ in $t = (s, s c, d, d c, ▵ M)$ . Let X to be the set of community numbers to be merged in this round contained by the new community of the community $t . s c$ and Y to be the set of community numbers to be merged in this round contained by the new community of the community $t . d c$ ;
Second, using Equation (5), we will merge and update community i in X and community j in Y. If there is an edge connecting communities i and j, then the modularity increment between new communities X and Y should include the modularity increment between communities i and j. However, if there is no edge connecting communities i and j, the modularity increment between new communities X and Y should be reduced by the doubled product of vector value $a_{i}$ of community i and vector value $a_{j}$ of community j.

Algorithm 3 Merging and Updating Communities

Input:

C = {c_{1}, c_{2}, \dots, c_{l}}

: Communities;
N(E,V): Network;
Output:

N (E, V)

: Updated Network;
1: Update the number of the communities that need to be merged and the community number of the corresponding nodes to their corresponding new community number;
2:

T = V \times E

;
3: for each quintuple t in T do
4:

t s c = g e t N e w C o m m u N u m (t . s c)

;
5:

t d c = g e t N e w C o m m u N u m (t . d c)

;
6: if (

t s c \in C

or

t d c \in C

) and

t s c \neq t d c

then
7: X = a set of community numbers to be merged in this round contained by the new community corresponding to

t . s c

;
8: Y = a set of community numbers to be merged in this round contained by the new community corresponding to

t . d c

;
9: for each community i in X and each community j in Y do
10: if there exists at least an edge connecting i and j then
11:

▵ M_{X Y} = ▵ M_{X Y} + ▵ M_{i j}

12: else
13:

▵ M_{X Y} = ▵ M_{X Y} - 2 \times a_{i} \times a_{j}

3.2.4. Generating Community Discovery Results

After the community discovery finishes, redundant data in the data set (primarily the matrix data) should be cleared, while the initial node set and their community number should be kept. Here, the node storage structure in the network is considered to be

V = (v I d, c I d)

, where

v I d

denotes the node number and

c I d

denotes the community number indicating which community each node belongs to. Algorithm 4 presents the process of generating the results of the community partitions, which has 2 steps:

We will first traverse all nodes and keep the nodes with the same community number $c I d$ together. If $c I d$ is already in C, it means that the corresponding community of $c I d$ has already appeared. The node $I d s$ in the community $c I d$ that have been stored in C need to be taken out, merged with the current node $I d$ , and then stored in C; otherwise they are stored in C directly;
Then we store the community and community’s node set on the Hadoop distributed file system (HDFS) one by one. Thus, CDOH stores the final results of community discovery with a set of the tuple $(c I d, v I d s)$ , and finishes the detection and discovery of complex network communities on Hadoop platform.

Algorithm 4 Generating Community Discovery Results

Input:

N (E, V)

: Network;
Output:

C = {c_{1}, c_{2}, \dots, c_{l}}

: Communities;
1: for each

v = (v I d, c I d)

in N do
2: if

c I d \in C

then
3: g = getNodeId(

C, c I d

);
4: c = insert(

g, v I d

);
5: C = insert(

c I d, c

);
6: else
7: C = add(

c I d, v I d

);
8: for each community c in C do
9: output c;

3.3. Computational Complexity Analysis of the CDOH Algorithm

As presented before, in step 1, we let multiple mappers take charge of the initializing process of

n \times n

sub-matrix. Supposed the matrix is divided into m matrices, and let each mapper takes charge of each sub-matrix in parallel, so the computational complexity of the initializing process of the matrix is the computational complexity of the initializing process of the sub-matrices, that is

O (\frac{n^{2}}{m})

. In step 2, the maximum modularity increment

m a x (▵ M)

and the community pairs with the largest modularity increment is found based on MapReduce. Again, if we divide the matrix into m sub-matrices, and let each mapper takes charge of each sub-matrix in parallel, the computational complexity of step 2 is also

O (\frac{n^{2}}{m})

. In step 3, the mapper updated the number of the communities that need to be merged and the community number of the corresponding nodes to their corresponding new community number in parallel, whose computational complexity is

O (1)

. After merging, the

▵ M

values of any 2 communities are updated by the mapper in parallel. Supposing that each mapper works on a sub-matrix, the computational complexity of updating

▵ M

is

O (\frac{n^{2}}{m})

. In step 4, all nodes are traversed and the nodes with the same community number are kept together, whose computational complexity is

O (n)

. Since step 2 and step 3 are repeated until the the maximum modularity increment

m a x (▵ M)

becomes negative, and after some iterations, the

n \times n

matrix will shrink to a constant computing cost. As a result, our algorithm can achieve a performance that is in reverse proportion to the number of sub-matrices, which is determined by the number of nodes in the Hadoop platforms. Supposing we have n nodes to conduct the parallelly computing, we can achieve a

O (n)

computing cost.

4. Experimental Results

4.1. Datasets and Evaluation Algorithms

To evaluate the accuracy and running time of CDOH, 3 real complex network data sets obtained from the Stanford Network Analysis Project (SNAP) were selected. The data sets contain the nodes and connection status of real complex networks and mark the communities to which the nodes belong. Table 4 gives the characteristics of the data sets used in the experiments.

To evaluate our algorithm, we use 2 state-of-the-art algorithms in our experiments, that is, the traditional complex network community detection algorithm Fast Community Detection (FCD) proposed by Newman [9] and the non-overlapping community detection algorithm Non-Overlapping Community Detection Idea (OCDI) proposed by Zhang et al. [15].

All the algorithms were implemented with Java, and our algorithm was deployed on Hadoop cluster made of 3 different computers, of which 1 serving as a master node and the other 2 serving as slave nodes. The following experimental results are represented as average across the 10 runnings.

4.2. Analysis of Community Detection Accuracy

We used the community detection accuracy (CDA) metric to measured the accuracy of community detection. CDA is defined as the ratio of the number of nodes in the correctly identified communities to the total number of nodes in the network, which is shown in Equation (6).

C D A = \frac{\sum_{i = 1}^{k} m a x {| C_{i} \cap C_{j}^{'} | ∣ C_{j}^{'} \subset C_{i}}}{n}, j = 1, 2, \dots, l

(6)

Here,

C = {c_{1}, c_{2}, \dots, c_{k}}

denotes the original and accurate community set,

C^{'} = {c_{1}^{'}, c_{2}^{'}, \dots, c_{l}^{'}}

denotes the community set identified by the community detection algorithm,

m a x {| C_{i} \cap C_{j}^{'} | ∣ C_{j}^{'} \subset C_{i}}

denotes the maximum number of the common nodes between all community sets and the i-th accurate community

c_{i}

, and n denotes the number of nodes. As can be seen, the larger the value is, the higher the accuracy of a community detection algorithm is and the better the quality of the resulting community is. Figure 3 shows the community discovery accuracies of the considered algorithms on the 3 different data sets.

It can be noticed from Figure 3 that the accuracy of the CDOH algorithm is slightly lower than that of the FCD algorithm (on average by

1.7 %

) and similar to that of OCDI. The reason for this is that CDOH and OCDI have similar community merging strategies and module update principles. While multiple communities are merged at one time in the same iteration according to CDOH and OCDI, FCD only supports one-time merging of 2 communities in a single iteration, which results in the accuracy gap between FCD and the other 2 algorithms.

We also used the normalized mutual information (NMI) to evaluate our algorithm in comparison to the other 2 algorithms. NMI [33] is a standard factor which is often used to detect the difference between the results of the division and the true partition of the network. NMI can be described in Equation (7), in which

H (X)

is the entropy of X, and

H (X | Y) = H (X, Y) - H (Y)

.

N M I (X, Y) = \frac{H (X) - H (X | Y) + H (Y) - H (Y | X)}{2 m a x (H (X), H (Y))}

(7)

We can see from Figure 4 that the NMI of the 3 algorithms can reach at least

75 %

. Our algorithm, CDOH, has a very similar NMI score to the FCD algorithm and has a slightly higher score than OCDI. Again, we consider this is due the fact that our algorithm has similar community merging strategies and module update principles.

However, the computing cost of our algorithm is much better than that of FCD, which will be discussed in Section 4.3.

4.3. Analysis of Community Detection Efficiency

CDOH is a community detection algorithm based on Hadoop platform for large-scale complex networks. For processing large scale data, the run time of the algorithm is an important metric to evaluate its performance of the algorithm. Figure 5 shows the comparison of the run time of the 3 considered algorithms.

It can be noticed from Figure 5 that CDOH is highly efficient. To compared with OCDI and FCD, we can see that CDOH is about

2.1

times and

3.2

times faster, respectively, which is mainly determined by the number of slave nodes on the Hadoop platform. Compared with the traditional community detection algorithms, CDOH uses significantly less time required for community merging and modularity updating.

5. Conclusions and Future Works

5.1. Conclusions

In this paper, we proposed a community detection algorithm called CDOH based on the Hadoop platform to implement accurate and fast community identification in large-scale complex networks. The algorithm was based on the modularity increment calculation method, which employed the theory of complex networks to find multiple communities satisfying certain merging conditions. The parallel merging and modularity updating of communities based on MapReduce used in the proposed algorithm reduce the number of iterations. CDOH was compared with traditional complex network algorithms using real large-scale complex networks. The experimental results evaluated the effectiveness of CDOH in large-scale network community detection.

5.2. Future Works

Our proposed CDOH algorithm is independent of the underlying big data platform. To prove its effectiveness and efficiency, we implemented the CDOH algorithm and other complex network community detection algorithms based on the Hadoop platform. However, in the Hadoop platform, the MapReduce intermediate results are first stored in disk files, and a large number of I/O operations will affect the whole calculation time; while in the Spark platform, the intermediate results are stored in memory, which avoids the performance overhead brought by I/O. In the future, we will implement the CDOH algorithm on the Spark platform and evaluate the efficiency. Furthermore, our proposed CDOH algorithm focused on static complex network community discovery, in the future, we plan to adapt the proposed algorithm to the evolving community networks.

Author Contributions

Conceptualization, M.H.; methodology, H.L.; software, Z.M.; validation, M.H. and X.G.; formal analysis, M.H.; investigation, H.L.; resources, H.L.; data curation, H.L.; writing–original draft preparation, M.H.; writing–review and editing, M.H., H.L. and Z.M.; visualization, H.L.

Funding

This research is supported by the National Natural Science Foundation of China (61100112,61309030), Beijing Higher Education Young Elite Teacher Project (YETP0987), the Top Discipline Construction Project of Central University of Finance and Economics in 2019 (Key Technologies and Application of Independent Controllable Block Chain), the Fundamental Research Funds for the Central Universities, the Education and Teaching Reform Fund of Central University of Finance and Economics in 2018(2018GRZDJG06).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Watts, D.J.; Strogatz, S.H. Collective dynamics of ‘small-world’ networks. Nature 1998, 393, 440–442. [Google Scholar] [CrossRef] [PubMed]
Faloutsos, M.; Faloutsos, P.; Faloutsos, C. On power-law relationships of the Internet topology. ACM SIGCOMM Comput. Commun. Rev. 1999, 29, 251–262. [Google Scholar] [CrossRef]
Sen, P.; Manna, S.S. Clustering properties of a generalized critical Euclidean network. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 2003, 68, 026104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zheng, X.; Chen, J.; Shao, J.; Bie, L. Topological properties analysis of Beijing public transport network based on complex network theory. J. Phys. 2012, 61, 95–105. [Google Scholar]
Fan, R. Cooperative Innovation of Social Governance under the Paradigm of Complex Network Structure. Soc. Sci. China 2014, 4, 98–120. [Google Scholar]
Newman, M.E.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2003, 69, 17–32. [Google Scholar] [CrossRef]
Yang, J.; Leskovec, J. Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 2015, 42, 181–213. [Google Scholar] [CrossRef]
Xin, S.; Giancarlo, S.; Vincenzo, M.; Antonio, P.; Christian, E.; Chang, C. An Edge Intelligence Empowered Recommender System Enabling Cultural Heritage Applications. IEEE Trans. Ind. Inf. 2019, 15, 4266–4275. [Google Scholar]
Newman, M.E. Fast algorithm for detecting community structure in networks. Phys. Rev. E 2003, 69, 066133. [Google Scholar] [CrossRef]
Clauset, A.; Newman, M.E.; Moore, C. Finding community structure in very large networks. Phys. Rev. E 2004, 70, 066111. [Google Scholar] [CrossRef] [Green Version]
Pan, L.; Jin, J.; Wang, C.; Xie, J. Edge Community Mining Based on Local Information in Social Networks. J. Electron. 2012, 40, 2255–2263. [Google Scholar]
Xiong, Z. Community Discovery Technology and Its Application in Online Social Networks; Central South University: Changsha, China, 2012. [Google Scholar]
Huang, W. Research on Web Community Discovery Algorithms; Beijing University of Posts and Telecommunications: Beijing, China, 2013. [Google Scholar]
Leng, Z. Research on network community discovery algorithm based on greedy optimization technology. J. Electron. 2014, 42, 723–729. [Google Scholar]
Zhang, X.; You, H.; Zhu, W.; Quiao, S.; Li, J.; Gutierrez, L.A.; Zhang, Z.; Fan, X. Overlapping community identification approach in online social networks. Physica A Stat. Mech. Appl. 2015, 421, 233–248. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of community hierarchies in large networks. Comput. Res. Repos. 2008, abs/0803.0476. [Google Scholar]
Parsa, M.G.; Mozayani, N.; Esmaeili, A. An EDA-based community detection in complex networks. In Proceedings of the International Symposium on Telecommunications, Tehran, Iran, 9–11 September 2014; pp. 476–480. [Google Scholar]
Oliveira, J.E.M.D.; Quiles, M.G. Community Detection in Complex Networks Using Coupled Kuramoto Oscillators. In Proceedings of the International Conference on Computational Science and ITS Applications, Guimaraes, Portugal, 30 June–3 July 2014; pp. 85–90. [Google Scholar]
Jing-Ya, X.; Tao, L.; Lin-Tao, Y.; Davison, M. Finding College Student Social Networks by Mining the Records of Student ID Transactions. Symmetry 2019, 11, 307. [Google Scholar] [Green Version]
Yuhui, G.; Qian, Y. Evolution of Conformity Dynamics in Complex Social Networks. Symmetry 2019, 11, 299. [Google Scholar] [Green Version]
Giuseppe, A.; Domenico, C.; Antonio, M.; Antonio, P. Mobile Encrypted Traffic classification Using Deep Learning. In Proceedings of the 2018 Network Traffic Measurement and Analysis Conference (TMA), Vienna, Austria, 26–29 June 2018. [Google Scholar]
Giuseppe, A.; Domenico, C.; Antonio, M.; Pescapé, A. Mobile encrypted traffic classification using deep learning: Experimental evaluation, lessons learned, and challenges. IEEE Trans. Netw. Serv. Manag. 2019, 16, 445–458. [Google Scholar]
Ruoyu, W.; Zhen, L.; Yongming, C.; Deyu, T.; Jin, Y.; Zhao, Y. Benchmark Data for Mobile App Traffic Research. In Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, New York, NY, USA, 5–7 November 2018. [Google Scholar]
Clauset, A. Finding local community structure in networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 2005, 72, 026132. [Google Scholar] [CrossRef] [Green Version]
Li, J. Research on Overlapping Community Discovery Algorithm Based on Hadoop Platform; Jilin University: Changchun, China, 2014. [Google Scholar]
Riedy, J.; Bader, D.A.; Meyerhenke, H. Scalable Multi-threaded Community Detection in Social Networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops & Phd Forum, Shanghai, China, 21–25 May 2012; pp. 1619–1628. [Google Scholar]
Moon, S.; Lee, J.G.; Kang, M. Scalable community detection from networks by computing edge betweenness on MapReduce. In Proceedings of the 2014 International Conference on Big Data and Smart Computing (BIGCOMP), Bangkok, Thailand, 15–17 January 2014; pp. 145–148. [Google Scholar]
Wu, W.; Li, M.; Li, G. A Parallelization of Louvain algorithm. Comput. Digit. Eng. 2016, 44, 1402–1406. [Google Scholar]
Blondel, V.D.; Guillaume, J.L.; Lambiotee, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 10, P10008. [Google Scholar] [CrossRef]
Lai, B. Research on Parallelization of Community Discovery Algorithm Based on Hadoop; Jiangxi University of Science and Technology: Ganzhou, China, 2017. [Google Scholar]
Alessio, C.; Tiziano, D.M.; Daniele, D.S.; Grossi, R.; Marion, A.; Versari, L. D2k: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes; KDD 2018; ACM: New York, NY, USA, 2018; pp. 1272–1281. [Google Scholar]
Vincenzo, M.; Antonio, P.; Giancarlo, S. Community detection based on Game Theory. Eng. Appl. Artif. Intell. 2019, 85, 773–782. [Google Scholar]
Mcdaid, A.F.; Greene, D.; Hurley, N. Normalized Mutual Information to evaluate overlapping community finding algorithms. CoRR 2011, abs/1110.2515. [Google Scholar]

Figure 1. A Simple Network Community Structure.

Figure 2. Flow charts of Community Detection on Hadoop (CDOH) Algorithm.

Figure 3. Comparison of the Accuracy of the Community Detection Algorithms.

Figure 4. Comparison of the normalized mutual information (NMI) of the Community Detection Algorithms.

Figure 5. Comparison of the Runtime of Community Detection Algorithms.

Table 1. Symbols and Definitions.

Symbols	Meanings
N	A complex network
V	a set of nodes
$v_{i}$	node i
E	a set of edges
$e_{i j}$	Denotes the connection between node $v_{i}$ and node $v_{j}$ , if they are connected, $e_{i j}$ is 1; Otherwise $e_{i j}$ is 0.
$d_{i}$	the node degree of node $v_{i}$
M	the modularity of a network
C	the set of detected network communities
$c_{i}$	a community i
$l_{c}$	the total number of edges interconnected between nodes within the community c
m	the total number of edges in the network
$D_{c}$	the sum of the node degrees of all nodes in the community c
$a_{c}$	The ratio of the sum of degrees of all nodes in the community c to the sum of degrees of all nodes in N
$▵ M$	the modularity increment
$R_{i j}$	the number of connection edges between communities $c_{i}$ and $c_{j}$

Table 2.

▵ M

Matrix before Network Merging.

Table 2.

▵ M

Matrix before Network Merging.

	1	2	3	4	5	6	7	8	9	10	11	12
1	0.000	0.033	0.025	−0.012	0.033	0.029	−0.012	−0.017	−0.021	−0.017	−0.012	−0.012
2	0.033	0.000	−0.015	0.036	0.036	−0.012	−0.009	−0.012	−0.015	−0.012	−0.009	−0.009
3	0.025	−0.015	0.000	0.030	−0.015	0.025	−0.015	0.025	−0.026	0.025	−0.015	−0.015
4	−0.012	0.036	0.030	0.000	−0.009	0.033	−0.009	−0.012	−0.015	−0.012	−0.009	−0.009
5	0.033	0.036	−0.015	−0.009	0.000	0.033	−0.009	−0.012	−0.015	−0.012	−0.009	−0.009
6	0.029	−0.012	0.025	0.033	0.033	0.000	−0.012	−0.017	−0.021	−0.017	−0.012	−0.012
7	−0.012	−0.009	−0.015	−0.009	−0.009	−0.012	0.000	0.033	0.030	−0.012	−0.009	0.036
8	−0.017	−0.012	0.025	−0.012	−0.012	−0.017	0.033	0.000	0.025	0.029	−0.012	−0.012
9	−0.021	−0.015	−0.026	−0.015	−0.015	−0.021	0.030	0.025	0.000	0.025	0.030	0.030
10	−0.017	−0.012	0.025	−0.012	−0.012	−0.017	−0.012	0.029	0.025	0.000	0.033	−0.012
11	−0.012	−0.009	−0.015	−0.009	−0.009	−0.012	−0.009	−0.012	0.030	0.033	0.000	0.036
12	−0.012	−0.009	−0.015	−0.009	−0.009	−0.012	0.036	−0.012	0.030	−0.012	0.036	0.000

Table 3.

▵ M

Matrix after Merging

c_{2}

and

c_{4}

.

Table 3.

▵ M

Matrix after Merging

c_{2}

and

c_{4}

.

	1	3	5	6	7	8	9	10	11	12	13
1	0	0.025	0.033	0.029	−0.012	−0.017	−0.021	−0.017	−0.012	−0.012	0.021
3	0.025	0	−0.015	0.025	−0.015	0.025	−0.026	0.025	−0.015	−0.015	0.015
5	0.033	−0.015	0	0.033	−0.009	−0.012	−0.015	−0.012	−0.009	−0.009	0.027
6	0.029	0.025	0.033	0	−0.012	−0.017	−0.021	−0.017	−0.012	−0.012	0.021
7	−0.012	−0.015	−0.009	−0.012	0	0.033	0.03	−0.012	−0.009	0.036	−0.019
8	−0.017	0.025	−0.012	−0.017	0.033	0	0.025	0.029	−0.012	−0.012	−0.025
9	−0.021	−0.026	−0.015	−0.021	0.03	0.025	0	0.025	0.03	0.03	−0.031
10	−0.017	0.025	−0.012	−0.017	−0.012	0.029	0.025	0	0.033	−0.012	−0.025
11	−0.012	−0.015	−0.009	−0.012	−0.009	−0.012	0.03	0.033	0	0.036	−0.019
12	−0.012	−0.015	−0.009	−0.012	0.036	−0.012	0.03	−0.012	0.036	0	−0.019
13	0.021	0.015	0.027	0.021	−0.019	−0.025	−0.031	−0.025	−0.019	−0.019	0

Table 4. Characteristics of Datasets.

Dataset	No. of Nodes	No. of Edges	Node Average Degree	Description
Soc-Epinions	75,879	508,837	13.4118	Epinions.com Date Set
Web-NotreDame	325,729	1,497,134	9.1925	Web Graph Data Set
Soc-Pokec	1,632,803	30,622,564	37.5092	Poke Social Data Set

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hai, M.; Li, H.; Ma, Z.; Gao, X. Algorithm for Detecting Communities in Complex Networks Based on Hadoop. Symmetry 2019, 11, 1382. https://doi.org/10.3390/sym11111382

AMA Style

Hai M, Li H, Ma Z, Gao X. Algorithm for Detecting Communities in Complex Networks Based on Hadoop. Symmetry. 2019; 11(11):1382. https://doi.org/10.3390/sym11111382

Chicago/Turabian Style

Hai, Mo, Haifeng Li, Zhekun Ma, and Xiaomei Gao. 2019. "Algorithm for Detecting Communities in Complex Networks Based on Hadoop" Symmetry 11, no. 11: 1382. https://doi.org/10.3390/sym11111382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Algorithm for Detecting Communities in Complex Networks Based on Hadoop

Abstract

1. Introduction

2. Related Works

3. Complex Network Community Detecting Algorithm Based on Hadoop

3.1. Definitions

3.2. The CDOH Algorithm

3.2.1. Parameter Initialization

3.2.2. Find the Maximum Modularity Increment

3.2.3. Merging and Updating Communities

3.2.4. Generating Community Discovery Results

3.3. Computational Complexity Analysis of the CDOH Algorithm

4. Experimental Results

4.1. Datasets and Evaluation Algorithms

4.2. Analysis of Community Detection Accuracy

4.3. Analysis of Community Detection Efficiency

5. Conclusions and Future Works

5.1. Conclusions

5.2. Future Works

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI