Next Article in Journal
Fractional Survival Functional Entropy of Engineering Systems
Previous Article in Journal
Entropy, Statistical Evidence, and Scientific Inference: Evidence Functions in Theory and Applications
Previous Article in Special Issue
CoarSAS2hvec: Heterogeneous Information Network Embedding with Balanced Network Sampling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GTIP: A Gaming-Based Topic Influence Percolation Model for Semantic Overlapping Community Detection

1
School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150001, China
2
School of Automatic Control Engineering, Harbin Institute of Petroleum, Harbin 150028, China
3
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
*
Authors to whom correspondence should be addressed.
Entropy 2022, 24(9), 1274; https://doi.org/10.3390/e24091274
Submission received: 3 August 2022 / Revised: 1 September 2022 / Accepted: 7 September 2022 / Published: 9 September 2022

Abstract

:
Community detection in semantic social networks is a crucial issue in online social network analysis, and has received extensive attention from researchers in various fields. Different conventional methods discover semantic communities based merely on users’ preferences towards global topics, ignoring the influence of topics themselves and the impact of topic propagation in community detection. To better cope with such situations, we propose a Gaming-based Topic Influence Percolation model (GTIP) for semantic overlapping community detection. In our approach, community formation is modeled as a seed expansion process. The seeds are individuals holding high influence topics and the expansion is modeled as a modified percolation process. We use the concept of payoff in game theory to decide whether to allow neighbors to accept the passed topics, which is more in line with the real social environment. We compare GTIP with four traditional (GN, FN, LFM, COPRA) and seven representative (CUT, TURCM, LCTA, ACQ, DEEP, BTLSC, SCE) semantic community detection methods. The results show that our method is closer to ground truth in synthetic networks and has a higher semantic modularity in real networks.

1. Introduction

In recent years, with the rapid development of mobile internet technology and the continuous popularization of mobile terminal devices, social platforms such as Micro-blog, WeChat, QQ, SNS, RSS, etc., have changed social interaction deeply. People can join or set up their own community and update their status in the form of text, pictures, and videos to realize the sharing, dissemination, and acquisition of personal information. According to statistics from comScore, Inc. (Reston, VA, USA), as of 2018, an average of 395,833 people logged in to WeChat per minute and 19,444 people were engaged in video or voice chat; Sina Micro-blog sent or forwarded 64,814 microblogs per minute; Facebook users shared an average of four billion dynamic items of information per day; Twitter processed 340 million items of data per day; Tumblr authors published an average of 27,000 new posts per minute; and Instagram users shared an average of 3600 photos per day. Facing this data explosion caused by the growing to social media data, the traditional topological space of social networks is shifting towards a rich semantic form which poses great challenges to the detection of social network communities.
Community detection can effectively improve the performance of social application systems. For example, by analyzing the social behavior patterns of network users and detecting the audience groups of social services, the commercial value of advertising and product marketing can be significantly improved [1]. Han et al. [2] used community detection to realize information transfer between networks and solved the cold start problem of recommendation systems caused by network sparsity. In addition, community detection is widely used in network embedding [3], public health [4], and link prediction [5].
In conventional community detection methods, the network is represented as a topology graph and the nodes do not contain semantic information. Representative methods in this field include the GN (Girvan–Newman) algorithm [6], FN (Fast Newman) algorithm [7], CPM (Cluster Percolation Method) algorithm [8], and Louvain algorithm [9]. In recent research, Qiao et al. [10] proposed Picaso, a parallel community discovery model which uses the Mountain model to calculate the weight of each edge in the network and apply a gradient algorithm to discover the community structure. To solve the problem of community detection in large-scale complex networks, Lu et al. [11] proposed an improved label propagation algorithm using node importance ranking. Lyzinski et al. [12] embedded graphs in Euclidean space to obtain their lower-dimensional representation, then used non-parametric graph reasoning technology to identify the structural similarity between communities. This method performed well in detecting fine-grained community structures. Tagarelli et al. [13] integrated multi-layer network community modularity, which retains multi-layer topology information and optimizes the edge connectivity of multi-relational communities.
In semantic community detection tasks, the nodes are the basic components of the topology graph as well as the carriers of semantic information which leads to fundamental changes in the community’s form [14]. For example, after considering the document attributes of nodes, the common topics between nodes play a decisive role in the formation of the community. Two people who share a common topic may join the same community even if they do not have a strong connection in the topology graph [15]. Therefore, the use of semantic information to analyze the correlation between network nodes has become a critical issue in this field.
The Probabilistic Topic Model (PTM) is a common semantic representation method used for social network nodes [16]. For example, Xin et al. [17] defined the semantic feature of nodes according to the similarity between user documents and a set of global topics, then adopted multi-sampling to accelerate the convergence of the algorithm. He et al. [18] transformed LDA (Latent Dirichlet Allocation) and Markov Random Field (MRF) into a unified factor graph to form an end-to-end learning system for community detection, then derived an effective propagation algorithm to train their parameters. Jin et al. [19] stated that links in the network contain semantic information as well. They proposed a new probabilistic model for link community detection, and developed a dual nested Expectation Maximum (EM) algorithm to learn the model. Wang et al. [20] found that there are correlations between topics which significantly affect community structures. They proposed a Topic Correlations-based Community Detection (TCCD) model which can simultaneously output the community structure and the semantic interpretation of nodes. Node attributes can be used to address semantic data as well; for example, Fang et al. [21] grouped nodes that satisfied both structure cohesiveness and keyword cohesiveness into the same community.
Non-negative Matrix Factorization (NMF) has good performance in discovering implicit patterns from high-dimensional data. Therefore, scholars have integrated semantic information into the adjacency (or feature representation) matrix and used NMF to analyze the correlation between nodes. For example, Pei et al. [22] proposed a clustering framework based on Non-negative Matrix Tri-Factorization (NMTF) which can effectively identify both user similarity and message similarity. Qin et al. [23] introduced an adaptive parameter to control the contribution of the network topology and content information and use NMF to discover semantic communities. Wang et al. [24] set the member matrix and attribute matrix as two groups of parameters of NMF, which allows semantic interpretation for the communities to be added. Yang et al. [25] introduced an adaptive weighted group for sparse low-rank regularization in NMF in order to automatically obtain the number of semantic communities.
Deep learning has a natural advantage in attribute representation of high-dimensional data; thus, researchers have begun to introduce semantic attributes into the feature dimension of deep learning models [26]. For example, Jin et al. [27] proposed a uniformed graph representation of network topology and semantic information and developed a multi-component network embedding approach via a deep autoencoder. Cao et al. [28] designed a combination matrix consisting of a modularity matrix for linkage information and a Markov matrix for content information. After matrix factorization, the matrix is used as the input of the multi-layer deep auto-encoder framework for obtaining the deep representation of the graph. Jin et al. [29] proposed that the words in user documents have a hierarchical structure. They proposed a new Bayesian probability model which can explain the multiplex semantic community more clearly. He et al. [30] developed a co-learning strategy to jointly train the structure and semantic parts of the model by combining a nested EM algorithm and belief propagation.
While the above methods have made a great many exploratory contributions to the field of semantic community detection, there are several remaining deficiencies:
(1)
When measuring the semantic relevance between nodes, each topic receives the same status without considering the difference of topic influence.
(2)
There has been little exploration of the impact of topic propagation and influence propagation in community detection.
(3)
Methods based on deep learning require a large number of samples, high computational performance, and long training times. When the network evolves rapidly, these methods cannot meet the online requirements of social systems.
To better cope with these situations, and inspired by the information dissemination in social networks, we propose a user topic influence propagation model based on percolation theory that uses the Nash equilibrium to generate communities in a game-based way. Experiments with real social networks show that the proposed method has a high semantic modularity [17] in social networks with rich semantic attributes. In addition, the algorithm can converge in a short time without additional training. In summary, the contributions of this paper include:
(1)
Integrating topic influence into the correlation analysis of nodes, which makes the community detection process conform to the law of information dissemination in social networks.
(2)
A proposed one-dimensional diffusion model in percolation mechanics that can quantify the propagation of topic influence, which in turn can describe the impact of nodes near the topic source in the semantic space more accurately and solve the situation in which high-influence nodes in the network present a low influence score.
(3)
Use of the Nash equilibrium from game theory to generate communities, thereby identifying overlapping and non-overlapping communities at the same time and identifying community structures with smaller granularity.

2. LDA Model of Semantic Social Networks

2.1. LDA Representation of Nodes

The semantic space representation of nodes is generated based on LDA, a three-tier Bayesian probability model used for document-topic generation, including words, topics, and documents. LDA considers documents to be composed of topics, and each topic can be presented with a set of keywords. For example, technology topics have a high probability of containing the keywords: “Chip” and “Artificial Intelligence”. The probability distribution of the document on each topic shows the relevance of the document to each topic. The mathematical symbols involved in LDA are shown in Table 1.
The LDA vector is stored as a triplet, (w, d, z), where w i , d i and z i are the number, the node number, and the topic number of keyword i, respectively [31]. Figure 1 shows the data storage structure of the LDA vector, in which the shadow part represents the same elements in the vector. For example, w i 1 = w i 2 = w i 4 = w i 5 indicates that w i 1 , w i 2 , w i 4 , w i 5 are the same words, d i 1 = d i 3 = d i 5 = d i 6 indicates that w i 1 , w i 3 , w i 5 , w i 6 are the keywords of the same node d i 1 , and the keyword w i 1 appears twice in d i 1 . Additionally, z i 1 = z i 2 = z i 6 indicates that z i 1 , z i 2 , z i 6 belong to the same topic z i 1 , the keyword w i 1 appears twice in z i 1 , and z i 1 belongs to d i 1 and d i 2 , respectively. According to [31], the mathematical descriptions of w , d , z are as follows:
(1)
θ Dir ( α ) ; the topic distribution θ of nodes follows the Dirichlet distribution (noted as Dir in the formula) with parameter α .
(2)
z i | θ ( d i ) Multinomial ( θ ( d i ) ) ; the probability of topic z i in node d i under topic distribution θ follows Multinomial distribution (noted as Multinomial in the formula).
(3)
λ Dir ( β ) ; the keyword distribution follows the Dirichlet distribution with parameter β .
(4)
w i | z i , λ ( z i ) Multinomial ( λ ( z i ) ) , the probability of keyword w i in topic z i under keyword distribution λ follows Multinomial distribution.
To generate the LDA model, the first step is to extract the distribution of keywords that satisfy λ Dir ( β ) . Next, the topic distribution is extracted for each document in the corpus, satisfying θ Dir ( α ) . Finally, for each keyword, topics and keywords are further extracted to satisfy z i | θ ( d i ) Multinomial ( θ ( d i ) ) and w i | z i , λ ( z i ) Multinomial ( λ ( z i ) ) , respectively.
Figure 1. Data storage structure of LDA vector.
Figure 1. Data storage structure of LDA vector.
Entropy 24 01274 g001
Table 1. Description of notation.
Table 1. Description of notation.
NotationDescription
GSemantic social network
| G | The number of nodes in G
NThe total number of the keywords in G
N i The number of keywords of node G i
wKeyword vector
w i The i-th keyword in vector w
dNode number vector corresponding to w
d i The node number to which w i belongs
zTopic number vector corresponding to w
z i The topic number to which w i belongs
θ ( g i ) The topic distribution probability of node i
λ ( j ) The distribution of keywords in topic j
λ w i ( j ) The probability that w i belongs to topic j
α Prior parameter of topic distribution for each node
β Prior parameter of keyword distribution within a topic
Table 2. Differences between fluid percolation and semantic percolation.
Table 2. Differences between fluid percolation and semantic percolation.
AttributeFluid PercolationSemantic Percolation
Percolation areaAdjacent areaAdjacent nodes
The percolation processReversibleIrreversible
Percolation directionFlow to percolation areaFrom high Influence nodes to low Influence nodes
Percolation conditionContains fluidDetermined by the game

2.2. Gibbs Iterative Process

In statistics, Gibbs sampling is a Markov Monte Carlo (MCMC) algorithm which is used to approximately extract sample sequence from a multivariate probability distribution when it is difficult to directly sample. The key is to establish a posterior estimate for a sample and perform Gibbs sampling on the posterior estimate expression.
The expression of the Bayesian relation of z and w is
P ( z i = j | w i ) P ( w j ) = P ( w i | z i = j ) P ( z i = j ) P ( w i ) = j = 1 | z | P ( w i | z i = j ) P ( z i = j )
After transformation, we have
P ( z i = j | z i , w i ) P ( w j , w i ) = P ( w i | z i = j , z i , w i ) P ( z i = j | z i )
P ( z i = j | z i , w i ) P ( w i | z i = j , z i , w i ) P ( z i = j | z i )
The process of Gibbs sampling is as follows:
(1) z i is initialized as a random integer between 1 and K ( i = [ 1 , 2 , . . . , N ] ), which is the initial state of the Markov chain.
(2) According to the literature [32], the right side of Equation (3) can be expanded as
P ( w i | z i = j , z i , w i ) = f i , j ( w i ) + β f i , j ( · ) + | w | β
P ( z i = j | z i ) = P ( z i = j | θ ( d i ) ) P ( θ ( d i ) | z i ) d θ ( d i ) = f i , j ( d i ) + α f i , · ( d i ) + | z | α
Therefore, we have
P ( z i = j | z i , w i ) f i , j ( w i ) + β f i , j ( · ) + | w | β f i , j ( d i ) + α f i , · ( d i ) + | z | α
In Equation (6), | w | and | z | denote the number of keywords and topics, respectively, f i , j ( w i ) represents the number of words assigned to topic j that are the same as w i , f i , j ( · ) represents the number of words assigned to topic j, f i , j ( d i ) represents the number of words assigned to topic j in node d i , f i , · ( d i ) represents the number of all the words assigned to a topic in node d i , and z i is updated iteratively according to Equation (6).
(3) When step (2) has iterated enough times (when P ( z i = j | z i , w i ) converges), the process ends. We now normalize P ( z i = j | z i , w i ) to obtain the keyword topic probability matrix B i , j , B i , j = P ( z i = j | w = i ) , B i , · = 1 .

2.3. Semantic Feature Representation of Nodes

In a semantic social network G = ( V , E , T ) , the node set V represents the users in the semantic social network, the edge set E represents the relationship between users, and T is the document collection, representing the text information published by users.
We used Gensim (a topic generation toolkit in Python) to extract K topics in T as the base of a K-dimensional semantic space. The coordinate m i of the node v i ( v i V ) in the semantic space can be expressed by the mean value of the keywords in the document t i ( t i T ) published by v i , which is shown in Equation (7).
m i = j = 1 N i B N i , j , · N i
In Equation (7), N i represents the number of keywords (the words with the highest cosine similarity to the topic that t i belongs to) in document t i , N i , j represents the j-th keyword in document t i , and B N i , j , · represents the coordinate (expressed as the sequence of the cosine similarity between the j-th keyword and K topics) of the j-th keyword in document t i in the K-dimensional semantic space.

3. Modeling Topic Influence Based on Percolation Mechanics

3.1. Motivation

The flow of a fluid through porous media (soil voids or other permeable media) is called percolation. Each percolation source point contains a certain amount of substance, which diffuses to the area in a finite space that has not been penetrated. In the example shown in Figure 2, the grid represents the percolation area. We assume that there are three percolation source points in the figure, labeled red, blue, and green here. In real percolation process, percolation occurs when the difference between the source point and the adjacent area reaches a threshold, which is measured by the point source function. In this example, we simply assume that the probability of percolation is 50 % . After four infiltrations, the percolation state changes from Figure 2a,b.
It can be found that from the three source points the substance gradually penetrates into the adjacent areas. Inspired by this, we propose to construct the semantic social network topic percolation equation using percolation theory. Our motivation stems from the following four perspectives. First, both fluid percolation and semantic percolation need to be adjacent to the infiltration area. Second, similar to fluid percolation, in semantic social networks, whether users receive topics from neighbors (i.e., semantic percolation) is subject to a threshold, which in this paper is measured by the payoff concept from game theory. Next, both fluid percolation and semantic percolation are multiple source points percolating simultaneously, and this property can be simulated for community detection using a seed expansion strategy. Finally, all source points have the same status, which avoids the problem that nodes with less local influence cannot expand and promotes the formation of local communities. The differences between fluid percolation and semantic percolation are shown in Table 2.

3.2. Modeling Topic Influence

In this section, we construct the topic percolation differential equation; the symbols used are provided in Table 3. We propose topic influence percolation strength to measure the capacity of topics to influence the percolation area. In our model, each node is a fixed-size solid sphere filled with unequal topic influence in the semantic space. In the model, S has a virtual dimension [ λ γ 1 ] . In the semantic space, the inner product m i · m j represents the semantic correlation between nodes v i and v i . The more similar the semantic coordinates of v i and v i are, the larger m i · m j is. We define Z i j = 1 / m i · m j to represent the topic propagation space coordinate of node v j with node v i as the source point, which satisfies Z i i = 0 , and Z i j when m i · m j 0 .
We design three rules to construct the percolation dynamics of topic influence, based on which the second-order partial differential equation of topic percolation Z is provided in Equation (8)
(1) The topic influence of a percolation source point is greatest at the initial state, and spreads outward with the percolation of topic influence.
(2) As the topic influence of the source point continuously penetrates into the surrounding area, the influence of the source point on other nodes becomes smaller.
(3) While the nodes under the influence of the source point absorb and weaken the topic influence of the source point, the influence of the topic contained in the source point is enhanced.
2 S Z 2 = 1 η z S D
The initial condition of Equation (8) is as follows:
S ( Z , 0 ) = κ 0 δ ( Z )
Here, δ ( Z ) is a Dirac function, which satisfies the requirement that the value of the function (except source point a) be equal to 0 and the integral over the entire domain equal to 1. The expression of δ ( Z ) is
δ ( Z ) = δ ( Z a ) , x a , + δ ( Z ) d Z = 1 , x = a
Here, S ( Z , 0 ) denotes the topic influence percolation strength when the distance between the source point and the affected node is 0. At this point, the influence is concentrated on the source point, S ( Z , 0 ) = κ 0 .
The boundary conditions of Equation (8) are as follows:
S ( , D ) = 0 S ( , D ) Z = 0
Equation (11) indicates that S and the partial differential of S with respect to Z becomes 0 when Z .
Because the partial differential equation is established using physical phenomena, we use Dimensional Analysis (DA) to solve Equation (9). The basic principle of DA is Buckingham π theorem. The theorem states that if the formula of a physical process contains n physical quantities and k of them have independent dimensions, then the formula can be transformed into an equivalent function containing n k dimensionless numbers π i composed of these physical quantities.
The topic influence percolation strength S is a function of κ , z, D and η z . Suppose that F ( S , κ , Z , D , η z ) = 0 ; then, the dimension of S and κ is [ λ γ 1 ] and [ λ ] , respectively, and S is proportional to λ / η z D . Using Buckingham π theorem and selecting S , D , η z as the basic variable, we have
F ( κ S η z D , Z η z D ) = 0
4 π η z d κ S ( Z , d ) = f ( Z 4 η z d )
Next, we determine the undetermined function f. Let variable ψ = Z / 4 η z D ; then,
S ( Z , D ) = f ( ψ ) κ 4 π η z D
Combined with Equation (8), we have
d d ψ ( d f d ψ + 2 ψ f ) = 0
The boundary conditions of Equation (11) becomes
f ( ) = 0 d f ( ) d ψ = 0
After simplification, we have
d f d ψ + 2 ψ f = c
Here, c is a constant. By substituting Equation (8) into Equation (17), we have c = 0 ; therefore, the general solution of Equation (17) is f = ω o e ψ 2 . According to the hypothesis, the topic influence of the source point is conserved; therefore,
+ S d Z = κ + e u d u = π
As + e u d u = π , ω 0 = 1 , therefore,
S ( Z , D ) = κ 4 π η z D exp Z 2 4 η z D
After the transposition of terms, we have
S ( Z , D ) κ = 1 2 π 2 η z D exp Z 2 2 ( 2 η z D ) 2
Equation (20) is a typical standard normal function with the topic propagation space coordinate Z as the horizontal axis and the topic influence percolation strength S as the vertical axis. According to the mathematical properties of the standard normal function, the instantaneous influence of the source point follows a normal distribution along the Z direction at any D point in the strength field in one-dimensional unbounded semantic space. With increasing distance D, the peak value of influence strength decreases while the range of affected nodes becomes wider, and the distribution curve tends to become more stable.
According to the 3 σ principle, the probability of topic influence of each node outside ( μ + 3 σ , μ 3 σ ) is less than 0.3 % . Therefore, μ 3 σ < Z μ + 3 σ can be regarded as the actual range of random variable Z, and the topic influence of nodes is only valid within the range of 3 σ = 3 2 η z D .

4. The Game Process of Topic Influence Percolation

In social networks, each individual has free will and can decide whether to join a community after weighing the advantages and disadvantages, which is consistent with the behavior of the players in game theory. In semantic social networks, users influence people around them with their preferred topics and are influenced in turn by the topics held by others. When affected by different topics, people react differently. For high-impact topics that they prefer and are hotly discussed by the public, they continue to track the progress of these topics and further spread them. On the contrary, they do not pay further attention. From the perspective of game theory, all social individuals are considered to be rational and selfish players and follow certain rules to join the semantic community with greater influence and closer to their preferred topics in order to maximize their payoffs and achieve Nash equilibrium.

4.1. Basic Elements

The basic elements of our game model are as follows.
(1) Players: all nodes except the seed nodes (unequilibrium nodes) in semantic social networks.
(2) Strategy P i : each player chooses a single strategy; P i = 1 ( P i = 0 ) means that after being affected by the topic, node v i does (does not) spread the topic and joins (refuses to join) the community to which the topic belongs.
(3) Payoff U i : in the percolation dilemma game model, the payoff of node v i is defined as follows:
U i ( P i , P j ) = S j i ξ
Here, U i ( P i , P j ) represents the payoffs of v i of spreading topics from v j , S j i represents the percolation strength of the topic from v j to v i , and ξ represents the topic percolation loss. The correlation between P i and U i is as follows.:
P i = 0 , i f U i ( P i , P j ) 0 , 1 , i f U i ( P i , P j ) > 0 .
In a semantic social network, if there is a node with greater topic influence than node v i in the percolation area, v i is percolated by topic influence, and the percolation with smaller strength is covered by percolation with higher strength. On the contrary, the influence percolation strength S i of node v i in this area is considered infinite. S i is defined as follows:
S i = m a x { S j i } , j G , κ ( i ) 0 < κ ( j ) 0 , + , κ ( i ) 0 > κ ( j ) 0 .
In this way, it is only necessary to calculate the payoffs of spreading the topic of nodes that can percolate v i , instead of calculating the payoffs of the global nodes. To calculate faster, the topic influence percolation strength S is stored in a large root heap.
In Equation (23), the nodes only propagate one topic and join one community. However, communities in real semantic social networks generally overlap. If joining multiple communities can increase payoffs, players join multiple communities. Joining multiple communities results in a loss of payoffs. For semantic overlapping communities, the payoff is defined as follows:
U G ( i ) = j G U i ( P i , P j ) ζ ( | R ( i ) | 1 ) ζ = 1 | R ( i ) | j G U i ( P i , P j )
Here, ζ is the loss factor, | R ( i ) | represents the number of different topics spread by node v i , and U i ( P i , P j ) represents the payoffs of v i spreading only one topic. Obviously, spreading more topics results in the loss of ζ .
Players pursue the maximization of payoffs as well as the maximization of efficiency. In generally, the payoff of joining multiple communities is higher than that of joining a small number of communities; in certain cases, joining a small number of high payoff communities can obtain the equivalent payoffs of joining a large number of low-payoff communities. To maximize the payoff and efficiency at the same time, we define a payoff satisfaction function ρ ( i ) , which is
ρ ( i ) = 1 N k = 1 N j G , j i U k ( P k , P j ) , i f N i > 1 , 1 2 U i ( P i , P j ) , i f N i = 1 .
Here, N i represents the number of communities that node v i has joined. When N i = 1 , ρ ( i ) is set as U i / 2 to avoid that the initial payoff satisfaction of node v i is too large to join other communities. When N > 1 , the payoff satisfaction is the average of the payoff function. If U G ( i ) < ρ ( i ) , this means that joining the new community results in decreased payoff. In this case, v i chooses strategy P i = 0 .

4.2. Slecting the Source Point

Random selection of the source point may result in percolation failure due to the low influence of the selected node and cause additional time cost. Based on the PageRank algorithm, a source point selection algorithm for topic influence maximization is proposed.
(1) Initialize s e e d S e t , h a s h M a p , and o u t l i n k [ v i ] , where s e e d S e t stores the ranked topic influence, h a s h M a p stores the feature pairs ( n o d e I D and t o p i c i n f l u e n c e ) , and o u t l i n k [ v i ] is an array that stores the pointing nodes of v i .
(2) According to different transfer probabilities, the node percolates its influence to the pointing nodes. We construct the following transfer matrix
P i , j = M ( i , j ) v k o u t l i n k [ v i ] M ( i , k ) , o u t l i n k [ v i ] 0 , M ( i , j ) = 0 , o t h e r s .
to represent the probability of influence passing from v j to v i , where M ( i , j ) is a weighted adjacent matrix with the formula shown in Equation (27).
M ( i , j ) = m i · m j , i j , 0 , o t h e r s .
If node v i points to node v j , the edge weight of arc ( i , j ) is m i · m j ; otherwise, the edge weight is 0.
(3) The influence of each node depends on the influence of the nodes that point to it. In the iteration process, we use vector v e c to store the influence score of each node, which is updated based on Equation (28).
α P T v e c + ( 1 α ) τ N v e c , τ = ( 1 , 1 · · · 1 ) T
Here, α is the damping factor, which is used to prevent excessive influence of nodes, while τ / N is the self-restart vector, which establishes the transition probability for the node pair that does not have direct link. Equation (28) is repeated until the entire network converges.
(4) We define conversion coefficient ε and multiply the influence score of each node by ε to obtain the topic influence κ , then update h a s h M a p and s e e d S e t . The pseudo-code of the ranking procedure is provided in Algorithm 1.
Algorithm 1 Slecting SeedSet.
Input: Network G = < V , E , T >
Output: s e e d S e t , h a s h m a p
1: 0 s e e d S e t , 0 h a s h M a p ;
2: Initialize o u t l i n k [ v i ] , v i V ;
3: Construct M ( i , j ) and P i , j using Equations (27) and (26);
4: while (not converged)
5:     for  v i V  do
6:         Update the influence score based on Equation (28);
7:     end for
8: end while
9: Ranking v e c s e e d S e t ;
10: Feature pairs of v e c h a s h M a p ;

4.3. Game Rules for Overlapping Community Detection

Based on the topic influence percolation, we propose a game algorithm for overlapping community detection.
(1) A strategy combination is considered to be in Nash equilibrium if no player can increase their payoff by changing decisions unilaterally. In the initial stage, the nodes in the semantic social network are isolated, no payoff is generated, and all local communities are in a state of unequilibrium.
(2) The percolation is a local movement; therefore, choosing a reasonable propagation range (hops) can ensure the effectiveness of the influence and the fast convergence of the algorithm. According to the 3 σ principle of Equation (20), the topic propagation space coordinate Z satisfies
μ 3 2 η z D < Z μ + 3 2 η z D
Here, Z i j = 1 / m i · m j , m i · m j ( 0 , 1 ) . When m i · m j = 0.2 , | Z | m a x = 3 2 η z D = 5 , D m a x = 3 (after rounding). The experiments in Section 5.3.1 show that the community quality decreases rapidly when D m a x > 3 . Therefore, to speed up the algorithm, we assume that there is no percolation between v i and v j when m i · m j < 0.2 .
(3) Select nodes sequentially from the head of s e e d S e t ; if the node is marked as “divided” in h a s h M a p , select new nodes from s e e d S e t until the node is marked as “not divided”, making it the source point of the percolation.
(4) For v i within three hops of source point v j , if v i does not join any community, calculate the non-overlapping payoff function U i ( P i , P j ) . If U i ( P i , P j ) > 0 , then v i joins v j community and marks v i as “divided” in h a s h M a p , the number of h a s h M a p elements minus 1. If U i ( P i , P j ) < 0 , skip v i and analyze the next node.
(5) If v i has joined a community and is not in the same community as v j , calculate the cosine similarity between v j and the source point of v i community; the expression is as follows:
s i m ( m s e e d ( i ) , m j ) = m s e e d ( i ) · m j | m s e e d ( i ) | | m j | = g = 1 k m s e e d ( i ) m j , g g = 1 k m s e e d ( i ) , g 2 g = 1 k m j , g 2
Here, we use ς ( v i ) to represent the community collection of v i if s i m ( m s e e d ( i ) , m j ) > 0.8 , merging ς ( v i ) and ς ( v j ) . If s i m ( m s e e d ( i ) , m j ) 0.8 and the payoff is greater than the payoff satisfaction ( U G ( i ) ρ ( i ) ), we add v i to v j ’s community; otherwise, skip v i and find the next node.
(6) When performing an optimal strategy can improve the payoff, the node acts to achieve local Nash equilibrium. Next, we select nodes from the s e e d S e t to play the game until the whole network reaches Nash equilibrium.
(7) When the s e e d S e t is empty and there are elements marked "not divided" in the h a s h M a p , we can accelerate the convergence of the algorithm by randomly assigning these elements to the nearest community.
(8) Nodes affected by the same source point and meeting the game conditions are assigned to the same community, and the semantic community ς = ς 1 , ς 2 , . . . , ς N is output. The pseudo-code is shown in Algorithm 2.

4.4. A Practical Case

Figure 3a shows a directed weighted network G a with six nodes v 1 , v 2 , . . . , v 6 where the direction of the edge points to the source of percolation and the weight of the edge represent the difficulty of topic influence percolation.
According to Equations (26) and (27), the weighted adjacent matrix of G a is
M ( i , j ) = 3 0 0 0 0 0 0 0 0 0 0 0 1 4 0 0 1 2 0 0 2 0 0 4 0 0 0 1 0 3 0 0 0 0 0 0
and the transfer matrix of G a is
P ( i , j ) = 1 0 0 0 0 0 0 0 0 0 0 0 1 / 8 1 / 2 0 0 1 / 8 1 / 4 0 0 1 / 3 0 0 2 / 3 0 0 0 1 / 4 0 3 / 4 0 0 0 0 0 0
Algorithm 2 GTIP Algorithm.
Input: Network G = < V , E , T > , s e e d S e t , h a s h M a p .
Output: Divided communities ς = ς 1 , ς 2 , . . . , ς N
1: while s e e d S e t
2:     j = s e e d S e t . t o p ( ) ;
3:      s e e d S e t . p o p ( ) ;
4:     if  h a s h M a p [ j ] = = f a l s e  then
5:         repeat step 2 and step 3;
6:     for all nodes v i within 3-hops of seed node v j  do
7:         if  | ς ( v i ) | = 1  then
8:             if payoff U i ( P i , P j ) > 0  then
9:                  π k v i , | ς ( v i ) | + + ;
10:                  h a s h M a p [ i ] f a l s e ;
11:                  h a s h M a p . c o u n t ;
12:             else
13:                 continue;
14:             end if
15:         else if  ς ( v i ) and ς ( v i ) ς ( v j ) =  then
16:             if  s i m ( m s e e d ( i ) , m j ) > 0.8  then
17:                 merging community ς ( v i ) and ς ( v j ) ;
18:             else
19:                 if  U G ( i ) > 0  then
20:                      ς k v i ;
21:                      h a s h M a p [ i ] f a l s e ;
22:                      h a s h M a p . c o u n t ;
23:                 else
24:                     continue;
25:                 end if
26:             end if
27:         end if
28:     end for
29: end while
30: while h a s h M a p . c o u n t > 0
31:      h a s h M a p [ k ] ς k ;
32: end while
33: return ς 1 , ς 2 , . . . , ς N
The topic propagation space coordinate Z i j = 1 / m i · m j ; therefore, the coordinate matrix of G a is
Z i , j = 0 3 5 2 2 / 5 1 3 0 1 1 / 2 2 / 3 1 5 1 0 0 1 1 / 2 2 1 / 2 0 0 1 1 / 4 2 / 5 2 / 3 1 1 0 1 / 3 1 1 1 / 2 1 / 4 1 / 3 0
The topic influence score of the nodes in G a is shown in Table 4.
First, the most influential node v 6 in Table 4 is selected as the source point of percolation. Due to the small amount of data, we assume that the influence range of the topic is one hop, i.e., d = 1 .
The nodes affected by v 6 include v 3 , v 4 and v 5 . For v 3 , it is affected by v 6 , v 2 , and v 5 . Let the percolation coefficient η z = 0.5 and the dimensionless number π = 3 . According to Equation (19), the percolation strength of v 6 , v 2 , and v 5 to v 3 are S 6 , 3 = 11.60 × exp { 0.125 } = 10.237 , S 2 , 3 = 10.38 × exp { 0 . 5 } = 6 . 296 , and S 5 , 3 = 4.70 × exp { 0 . 5 } = 2 . 849 , respectively. Therefore, the node with the greatest influence on v 3 is v 6 . Assuming that the cost of propagating topics to v 3 is the topic influence of v 3 itself, therefore, U 6 P 6 , P 3 > 0 , and v 3 accepts and continues to spread the topic of v 6 and joins v 6 community. Similarly, v 4 and v 5 are divided into v 6 community.
The local area covered by the influence of v 6 reaches Nash equilibrium. Next, v 2 is selected as the source point of percolation. The influence of v 2 covers v 1 and v 3 ; v 3 is marked as “divided”, therefore, we need to compare the topic similarity between v 2 and the source point of v 3 community (i.e., v 6 ) according to Equation (30). Suppose that m 2 · m 6 = 1 , | m 2 | = 2 , | m 6 | = 1 ; then, we have U ( m 2 , m 6 ) = m 2 · m 6 / | m 2 | | m 6 | = 0.5 . Thus, U ( m 2 , m 6 ) < 0.8 , the communities of v 2 and v 6 , are not merged. According to Equations (24) and (25), the payoff and payoff satisfaction of v 3 are U G ( 3 ) = 10.237 + 6.296 8.267 = 8.266 and ρ ( 3 ) = 5.119 , respectively. U G ( 3 ) > ρ ( 3 ) ; thus, v 3 joins v 2 community, forming an overlapping structure. Similarly, we can calculate the topic influence of v 2 on v 1 to make the local region reach Nash equilibrium. The community detection result of G a is shown in Figure 3b.

5. Experimental Results and Analysis

5.1. Experimental Settings

5.1.1. Experimental Environment

All experiments in this paper were performed on a computer with an Intel (R) Core (TM) i5-7500 CPU, 3.40 GHz, and Yuzhan 16GB DDR4 RAM. All the proposed and compared algorithms were programmed in Python.

5.1.2. Compared Algorithms

For complex networks, GTIP was compared to four traditional community detection algorithms: GN (Girvan Newman) [6], FN (Fast GN) [7], LFM (Lancichinetti Fortunato Method) [33], and COPRA (Community Overlap Propagation Algorithm) [34]. GN and FN are non-overlapping community detection algorithms, while LFM and COPRA are overlapping community detection algorithms.
For semantic networks, GTIP was compared to seven semantic community detection algorithms: CUT (Community User Topic) [35], TURCM (Topic User Recipient Community Models) [36], LCTA (Latent Community Topic Analysis) [37], ACQ (Attributed Community Query) [21], DEEP (Deep Learning Method) [28], BTLSC (Background and Two-Level Semantic Community) [29], and SCE (Single Chromosome Evolutionary) [14]. CUT, TURCM, and LCTA generate communities based on Topic Probability Model; ACQ is an attribute graph community detection method; DEEP and BTLSC are both Deep Learning-based semantic community detection methods; and SCE is a new semantic community detection method based on Single-Chromosome Evolutionary.

5.1.3. Evaluation Criteria

Shen et al. [38] introduced Extension Q-modularity ( E Q ) to evaluate the quality of algorithms for identifying highly clustered communities; it is defined as follows:
E Q = 1 M t i C t , j C t 1 O i O j [ A i , j K i K j M ]
where K i is the degree of node v i , M = i j A i j is the total degree of the network nodes, A i , j is the adjacent matrix of the network, and O i is the number of communities to which v i belongs.
In a semantic social network, the community structure should satisfy both the link density and semantic cohesion between nodes. Xin et al. [17] introduced Semantic Q-modularity ( S Q ) to evaluate the semantic cohesion of the community structure, which is defined as follows:
S Q = 1 M t i C t , j C t s i m ( m i , m j ) O i O j [ A i , j K i K j M ]
In Equation (35), m i and m j is the coordinate of node v i and node v j in semantic feature space, s i m ( m i , m j ) is the cosine similarity between v i and v j (Equation (30)), and the range of E Q and S Q is ( 0 , 1 ) ; the closer this value is to 1, the higher the quality of the community.
Lancichinetti et al. [33] introduced Normalized Mutual Information (NMI) to compare the similarity between the ground truth and the detected communities. The normalized mutual information between partition C X and C Y is defined as follows:
N M I = 1 1 2 H ( C X C Y ) H ( C X ) + H ( C Y C X ) H ( C Y )
where H ( C X ) is the entropy of C X and H ( C X C Y ) is the variation of information between C X and C Y . In the experiments, N M I is used to compare the communities discovered by the algorithm with the ground-truth communities in the artificial network.

5.2. Datasets

5.2.1. Artificial Networks

For our experiments, we produced ten artificial networks with ground-truth communities using the LFR (Lancichinetti Fortunato Radicchi) benchmark [33]. The parameter settings of the LFR benchmark are provided below.
The number of nodes in the network was set to n = 10 , 000 . The average node degree of the network was set to d ¯ = 5 . The minimum and maximum size of the community were set to C min = 5 and C max = 500 , respectively. The overlap degree of each overlapping node was set to O m = 2 . The number of overlapping nodes in the network was set to O n = 500 . The mixing parameter μ was set to { 0.1 : 0.1 : 0.8 } , that is, the value of μ varied within the range from 0.1 to 0.8 with a span of 0.1 . As μ increases, community boundaries become blurred and communities in the network become less identifiable.

5.2.2. Complex Networks

Complex networks are used to validate the performance of GTIP and traditional community detection methods.
(1) The College Football Network. This network contains 115 nodes and 616 edges, where the nodes in the network represent the football team and the edge between two nodes indicates that there has been a game between the two teams.
(2) The Political Book Network. This network is generated by the sales records of political books on Amazon.com during the president election in the early 21st century, and consists of 105 nodes and 441 edges. The nodes represent the book and the edge represents co-purchasing of books by the same buyers. The network forms three natural communities, “liberal”, “neutral”, and “conservative”.
(3) The Dolphin Family Network. The network consists of two dolphin families with 62 nodes and 159 edges. The nodes in the network represent dolphins and the edge represents the frequency of contact between two dolphins.

5.2.3. Real-World Networks

Real-world networks were used to validate the performance of GTIP and semantic community detection methods. The five semantic-rich real-world networks used in the experiment can be downloaded from https://www.aminer.cn (accessed on 1 August 2022) and https://snap.stanford.edu/data/index.html (accessed on 1 August 2022). (1) Academic Social Network (ASN): this dataset includes paper information, paper citations, author information, and author collaboration, and contains 1,712,433 authors (nodes) and 4,258,615 collaboration relationships (edges).
(2) Youtube social network: Youtube is a video-sharing website where users can establish friendships and create groups. This dataset contains 1,134,890 nodes and 2,987,624 edges.
(3) DBLP collaboration network: the DBLP computer science bibliography provides a comprehensive list of research papers in computer science. This dataset contains 317,080 nodes and 1,049,866 edges.
(4) Amazon product co-purchasing network: this network was collected by crawling the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. This dataset contains 334,863 nodes and 925,872 edges.
(5) Enron email network: this dataset was originally made public and posted to the web by the Federal Energy Regulatory Commission during its investigation of Enron; it contains 36,692 nodes and 183,831 edges.

5.3. Parameter Analysis

5.3.1. Analysis on the Influence Range of Percolation

We use the parameter j u m p to represent the influence range of percolation, which can affect the aggregation of nodes inside communities.
In non-artificial network experiments (Table 5 and Table 6), when j u m p increases the number of detected communities decreases and the quality ( E Q and S Q ) of communities declines, especially when j u m p > 3 . According to Equation (29), the source point has a great influence on the nodes within three hops. Beyond this range there is uncertainty in percolation, which leads to the fragmentation of the community and reduces the quality of the community structure. In comparison, the decay rate of S Q is slightly faster than that of E Q . Comparing with Equations (34) and (35), changes in percolation range are more likely to affect the similarity between nodes within the community than that community’s proportion of internal and external links.
In artificial network experiments (Table 7), the performance of GTIP varies with parameter μ . As μ increases, communities in the network become less identifiable and the N M I score gradually decreases. The performance of GTIP continues to decrease rapidly when j > 3 . In contrast to non-artificial network experiments, the difference in N M I score for j = 1 , j = 2 , and j = 3 is not significant. One possible reason for this is that the link distribution of the non-artificial network is relatively uniform, which decreases the difference in node influence within three hops.
In summary, the performance of GTIP is weak when J u m p = 3 , and the percolation is ineffective when J u m p > 3 . Without loss of generality, we set J u m p = 3 in the following experiments.

5.3.2. Analysis on the Number of Topics

The number of topics (#Topics) in a document collection T can affect the size of the base of the semantic space; therefore, we verified the change in community quality when the number of topics was T = 1 , 2 , . . . , 20 .
The experiment results are shown in Figure 4, Figure 5 and Figure 6. It can be seen that when #Topics ranges from 0 to 8, the quality ( E Q , S Q and N M I ) of communities increases exponentially. When #Topics ranges from 8 to 12, the quality of communities tends to be stable. When #Topics ranges from 12 to 20, the quality of communities decreases rapidly. The reason for this is that when #Topics increases, the difference in the semantic space coordinate of each node becomes larger, which increases the possibility of community division. In this experiment, E Q , S Q , and N M I reach the optimal value when the number of topics is around 10. In addition, the S Q values of community structures are higher in networks with obvious topic attributes. For example, the topics in the Enron email network mostly focus on finance, stock price, and energy transportation, which makes the community have strong topic consistency. To better demonstrate the performance of our algorithm, we set #Topics = 10 in the following experiments.

5.4. Experimental Results on Artificial Networks

We executed eleven community detection algorithms on LFR artificial networks and recorded the NMI values. From Table 8, it can be seen that complex network community detection methods (GN, FN, LFM, and COPRA) have lower NMI values, while the NMI values slowly decreases when μ becomes large. In comparison, COPRA performs better and remains effective in mining the community structure when the community boundaries are blurred ( μ = 0.6 , 0.7 and 0.8 ). As the community boundaries become clearer, the performance of the semantic community discovery algorithm improves. When μ = 0.4 and 0.5 , ACQ and CUT have a higher NMI value. GTIP and DEEP perform better when μ = 0.1 , 0.2 and 0.3 . However, because DEEP requires a large number of ground-truth communities as samples, its NMI decays faster as μ grows larger. In comparison, GTIP has better performance. The reason for this is that when the community boundary is clearer, node cohesiveness and central tendency are stronger, which is more consistent with the community generation principle of GTIP.

5.5. Experimental Results on Complex Networks

We chose the Football, Books, and Dolphins networks as the experimental datasets. The algorithms used for the comparison included GN [6], FN [7], LFM [33], and COPRA [34]. GN and FN are non-overlapping community detection algorithms, while LFM and COPRA are overlapping community detection algorithms. We compared the E Q and S Q of each algorithm on the three complex networks described in Section 5.2.2.
Table 9 shows the E Q and S Q score of each algorithm. GN and FN discover communities by cutting edges and if communities do not overlap, their E Q values are lower. LFM and COPRA aim to increase the proportion of internal and external links of the community, therefore, the E Q value of the two algorithms is higher than that of GTIP (5.229% higher on average). The goal of GTIP is semantic similarity among nodes in the community, therefore, the S Q value of GTIP is higher than the other four algorithms (27.153% higher on average). The COPRA algorithm has the highest E Q value in the experiment; its S Q value, however, is lower than GTIP algorithm (8.184% lower on average). In general, traditional non-semantic community detection algorithms have high performance in mining communities based on topology structure and poor performance in community detection with rich semantic information.
Horizontal comparison shows that the E Q value of the classical community detection algorithms is higher than the S Q value (10.169% higher on average). COPRA and GTIP show good performance on complex networks. Both of them discover communities based on information diffusion, which indicates that accurately simulating the interaction behavior of social individuals is an effective way to detect communities with tight structure and semantic cohesion.

5.6. Experimental Results on Real-World Networks

In this section, we compare GTIP with seven semantic community detection algorithms: CUT [35], TURCM [36], LCTA [37], ACQ [21], DEEP [28], BTLSC [29], and SCE [14]. We used the five real-world networks described in Section 5.2.3 as the experiment data; the results are shown in Table 10 and Table 11.
BTLSC and SCE have better performance on ASN and Youtube networks. For example, in the E Q comparison experiment, BTLSC and SCE outperform GTIP by 0.294% and 11.233%, respectively. In the S Q comparison experiment, BTLSC and SCE outperform GTIP by 2.369% and 12.384%, respectively. On the DBLP, Amazon, and Enron networks, GTIP has a definite performance advantage. In the E Q and S Q comparison experiment, GTIP outperforms the other algorithms by an average of 18.386% and 19.973%, respectively. The reason for this is that the nodes in these three networks generally have a high propensity for topics. Taking the Enron network as an example, Figure 7 depicts the word clouds of the Enron network. It can be seen that the network has a strong topic concentration containing six distinct topics, which enhances the accuracy of the GTIP algorithm in selecting the source point of percolation. Additionally, in networks with rich semantic information S Q is typically lower than E Q . The reason for this is that in a semantic social network, although two users may focus on the same topic, different sentiment tendencies concerning the topic can lead to a split in the community.

6. Conclusions

This paper proposes GTIP, a semantic community detection method based on topic influence percolation. First, we modeled topic propagation in semantic social networks as the flow of a fluid through porous media based on percolation mechanics, then constructed a partial differential equation to solve the percolation intensity of topic influence. Second, based on game theory, the rules of accepting and forwarding topics were formulated to maximize the benefits of users and achieve Nash equilibrium. Finally, a semantic community was generated based on the seed expansion process.
We conducted experiments on artificial networks, complex networks, and semantic social networks. Our results show that when community boundaries are obvious and the corpus is rich, the modularity and NMI scores of GTIP are significantly better than other comparison algorithms. This shows that GTIP can capture the structural density and semantic cohesion of the network and has a high performance advantage in networks with high topic concentration.
In fact, users have different emotional perceptions of different topics, and even if we gather users with similar topics into one community, the community has the potential to split. In future work, we intend to integrate the sentiment attributes into the base of the semantic space in order to improve the structural stability of the detected communities.

Author Contributions

Investigation, H.Y.; Methodology, J.Z. and X.D.; Software, C.C. and L.W.; Supervision, H.Y.; Writing—original draft, H.Y. and J.Z.; Writing—review and editing, L.W. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is sponsored by the National Natural Science Foundation of China (61402126, 62101163), Nature Science Foundation of Heilongjiang Province of China (LH2021F029), Heilongjiang Postdoctoral Fund (LBH-Z20020), China Postdoctoral Science Foundation (No. 2021M701020), University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (UNPYSCT-2017094), and Fundamental Research Foundation for Universities of Heilongjiang Province (2020-KYYWF-0341).

Data Availability Statement

The publicly available datasets analyzed for this study can be found at (https://www.aminer.cn accessed on 1 August 2022) and (https://snap.stanford.edu/data/index.html accessed on 1 August 2022). Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank all anonymous reviewers for their comments.

Conflicts of Interest

The authors declare that they have no competing interest.

References

  1. Liu, S.; Wang, S. Trajectory Community Discovery and Recommendation by Multi-Source Diffusion Modeling. IEEE Trans. Knowl. Data Eng. 2017, 29, 898–911. [Google Scholar] [CrossRef]
  2. Zhan, Q.; Zhang, J.; Yu, P.S.; Xie, J. Community detection for emerging social networks. World Wide Web 2017, 20, 1409–1441. [Google Scholar] [CrossRef]
  3. Wang, X.; Cui, P.; Wang, J.; Pei, J.; Zhu, W.; Yang, S. Community Preserving Network Embedding. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Singh, S.P., Markovitch, S., Eds.; AAAI Press: Menlo Park, CA, USA, 2017; pp. 203–209. [Google Scholar]
  4. Choobdar, S.; Ahsen, M.E.; Crawford, J.; Tomasoni, M.; Cowen, L.J. Assessment of network module identification across complex diseases. Nature Methods 2018, 16, 843–852. [Google Scholar] [CrossRef]
  5. Bacco, C.D.; Power, E.A.; Larremore, D.B.; Moore, C. Community detection, link prediction, and layer interdependence in multilayer networks. Phys. Rev. E 2017, 95, 042317. [Google Scholar] [CrossRef]
  6. Newman, M.E.J.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2004, 69, 026113. [Google Scholar] [CrossRef]
  7. Newman, M.E.J. Fast algorithm for detecting community structure in networks. Phys. Rev. E 2004, 69, 066133. [Google Scholar] [CrossRef]
  8. Palla, G.; Derényi, I.; Farkas, I.; Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435, 814–818. [Google Scholar] [CrossRef] [PubMed]
  9. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, 10008. [Google Scholar] [CrossRef]
  10. Qiao, S.; Han, N.; Gao, Y.; Li, R.; Huang, J.; Guo, J.; Gutierrez, L.A.; Wu, X. A Fast Parallel Community Discovery Model on Complex Networks Through Approximate Optimization. IEEE Trans. Knowl. Data Eng. 2018, 30, 1638–1651. [Google Scholar] [CrossRef]
  11. Lu, M.; Zhang, Z.; Qu, Z.; Kang, Y. LPANNI: Overlapping Community Detection Using Label Propagation in Large-Scale Complex Networks. IEEE Trans. Knowl. Data Eng. 2019, 31, 1736–1749. [Google Scholar] [CrossRef]
  12. Lyzinski, V.; Tang, M.; Athreya, A.; Park, Y.; Priebe, C.E. Community Detection and Classification in Hierarchical Stochastic Blockmodels. IEEE Trans. Netw. Sci. Eng. 2017, 4, 13–26. [Google Scholar] [CrossRef]
  13. Tagarelli, A.; Amelio, A.; Gullo, F. Ensemble-based community detection in multilayer networks. Data Min. Knowl. Discov. 2017, 31, 1506–1543. [Google Scholar] [CrossRef]
  14. Pourabbasi, E.; Majidnezhad, V.; Afshord, S.T.; Jafari, Y. A new single-chromosome evolutionary algorithm for community detection in complex networks by combining content and structural information. Expert Syst. Appl. 2021, 186, 115854. [Google Scholar] [CrossRef]
  15. Jiang, H.; Sun, L.; Ran, J.; Bai, J.; Yang, X. Community Detection Based on Individual Topics and Network Topology in Social Networks. IEEE Access 2020, 8, 124414–124423. [Google Scholar] [CrossRef]
  16. Jin, D.; Li, B.; Jiao, P.; He, D.; Shan, H.; Zhang, W. Modeling with Node Popularities for Autonomous Overlapping Community Detection. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–23. [Google Scholar] [CrossRef]
  17. Xin, Y.; Yang, J.; Xie, Z.; Zhang, J. An overlapping semantic community detection algorithm base on the ARTs multiple sampling models. Expert Syst. Appl. 2015, 42, 3420–3432. [Google Scholar] [CrossRef]
  18. He, D.; Song, W.; Jin, D.; Feng, Z.; Huang, Y. An End-to-End Community Detection Model: Integrating LDA into Markov Random Field via Factor Graph. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; Kraus, S., Ed.; ijcai.org: Pasadena, CA, USA, 2019; pp. 5730–5736. [Google Scholar]
  19. Jin, D.; Wang, X.; He, R.; He, D.; Dang, J.; Zhang, W. Robust Detection of Link Communities in Large Social Networks by Exploiting Link Semantics. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, LO, USA, 2–7 February 2018; McIlraith, S.A., Weinberger, K.Q., Eds.; AAAI Press: Menlo Park, CA, USA, 2018; pp. 314–321. [Google Scholar]
  20. Wang, Y.; Jin, D.; Musial, K.; Dang, J. Community Detection in Social Networks Considering Topic Correlations. In Proceedings of the The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA, 2019; pp. 321–328. [Google Scholar]
  21. Fang, Y.; Cheng, R.; Luo, S.; Hu, J. Effective community search for large attributed graphs. Proc. VLDB Endow. 2016, 9, 1233–1244. [Google Scholar] [CrossRef]
  22. Pei, Y.; Chakraborty, N.; Sycara, K.P. Nonnegative Matrix Tri-Factorization with Graph Regularization for Community Detection in Social Networks. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, 25–31 July 2015; Yang, Q., Wooldridge, M.J., Eds.; AAAI Press: Menlo Park, CA, USA, 2015; pp. 2083–2089. [Google Scholar]
  23. Qin, M.; Jin, D.; Lei, K.; Gabrys, B.; Musial-Gabrys, K. Adaptive community detection incorporating topology and content in social networks. Knowl. Based Syst. 2018, 161, 342–356. [Google Scholar] [CrossRef]
  24. Wang, X.; Jin, D.; Cao, X.; Yang, L.; Zhang, W. Semantic Community Identification in Large Attribute Networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Schuurmans, D., Wellman, M.P., Eds.; AAAI Press: Menlo Park, CA, USA, 2016; pp. 265–271. [Google Scholar]
  25. Yang, L.; Wang, Y.; Gu, J.; Cao, X.; Wang, X.; Jin, D.; Ding, G.; Han, J.; Zhang, W. Autonomous Semantic Community Detection via Adaptively Weighted Low-rank Approximation. In ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM); ACM: New York, NY, USA, 2019. [Google Scholar]
  26. Liu, F.; Xue, S.; Wu, J.; Zhou, C.; Hu, W.; Paris, C.; Nepal, S.; Yang, J.; Yu, P.S. Deep Learning for Community Detection: Progress, Challenges and Opportunities. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Yokohama, Japan, 7–15 January 2021; Bessiere, C., Ed.; International Joint Conferences on Artificial Intelligence Organization. ijcai.org: Pasadena, CA, USA, 2020; pp. 4981–4987. [Google Scholar]
  27. Jin, D.; Ge, M.; Yang, L.; He, D.; Wang, L.; Zhang, W. Integrative Network Embedding via Deep Joint Reconstruction. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; Lang, J., Ed.; ijcai.org: Pasadena, CA, USA, 2018; pp. 3407–3413. [Google Scholar]
  28. Cao, J.; Jin, D.; Yang, L.; Dang, J. Incorporating network structure with node contents for community detection on large networks using deep learning. Neurocomputing 2018, 297, 71–81. [Google Scholar] [CrossRef]
  29. Jin, D.; Wang, K.; Zhang, G.; Jiao, P.; He, D.; Fogelman-Soulié, F.; Huang, X. Detecting Communities with Multiplex Semantics by Distinguishing Background, General, and Specialized Topics. IEEE Trans. Knowl. Data Eng. 2020, 32, 2144–2158. [Google Scholar] [CrossRef]
  30. He, D.; Feng, Z.; Jin, D.; Wang, X.; Zhang, W. Joint Identification of Network Communities and Semantics via Integrative Modeling of Network Topologies and Node Contents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Singh, S.P., Markovitch, S., Eds.; AAAI Press: Menlo Park, CA, USA, 2017; pp. 116–124. [Google Scholar]
  31. Blei, D.M.; Ng, A.Y.; Jordan, M.I.; Lafferty, J. Latent Dirichlet Allocation. J. Mach. Learn. Res 2012, 3, 993–1022. [Google Scholar]
  32. Schifanella, C.; Sapino, M.L.; Candan, K.S. On context-aware co-clustering with metadata support. J. Intell. Inf. Syst. 2012, 38, 209–239. [Google Scholar] [CrossRef]
  33. Lancichinetti, A.; Fortunato, S.; Kertész, J. Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 2009, 11, 33015. [Google Scholar] [CrossRef]
  34. Gregory, S. Finding overlapping communities in networks by label propagation. New J. Phys. 2010, 12, 103018. [Google Scholar] [CrossRef]
  35. Zhou, D.; Manavoglu, E.; Li, J.; Giles, C.L.; Zha, H. Probabilistic models for discovering e-communities. In Proceedings of the 15th International Conference on World Wide Web, Scotland, UK, 23–26 May 2006; ACM: New York, NY, USA, 2006; Volume 3, pp. 173–182. [Google Scholar] [CrossRef]
  36. Sachan, M.; Contractor, D.; Faruquie, T.A.; Subramaniam, L.V. Using content and interactions for discovering communities in social networks. In Proceedings of the 21st International Conference on World Wide Web, Lyon, France, 16–20 April 2012; ACM: New York, NY, USA, 2012; pp. 331–340. [Google Scholar]
  37. Yin, Z.; Cao, L.; Gu, Q.; Han, J. Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling. ACM Trans. Intell. Syst. Technol. 2012, 3, 63. [Google Scholar] [CrossRef]
  38. Hu, C.M.B. Detect overlapping and hierarchical community structure in networks. Phys. A Stat. Mech. Appl. 2009, 388, 1706–1712. [Google Scholar]
Figure 2. The percolation process of the fluid. (a) Initial percolation state. (b) Percolation state after time t.
Figure 2. The percolation process of the fluid. (a) Initial percolation state. (b) Percolation state after time t.
Entropy 24 01274 g002
Figure 3. Community detection with GTIP algorithm.
Figure 3. Community detection with GTIP algorithm.
Entropy 24 01274 g003
Figure 4. The E Q value on non-artificial networks with #Topics range from 1 to 20.
Figure 4. The E Q value on non-artificial networks with #Topics range from 1 to 20.
Entropy 24 01274 g004
Figure 5. The S Q value on non-artificial networks with #Topics range from 1 to 20.
Figure 5. The S Q value on non-artificial networks with #Topics range from 1 to 20.
Entropy 24 01274 g005
Figure 6. The N M I score on non-artificial networks with #Topics range from 1 to 20.
Figure 6. The N M I score on non-artificial networks with #Topics range from 1 to 20.
Entropy 24 01274 g006
Figure 7. Word clouds of six topics on Enron network: (a) California power, (b) Gas_trans, (c) Trading, (d) Deals, (e) Stock, (f) Finance.
Figure 7. Word clouds of six topics on Enron network: (a) California power, (b) Gas_trans, (c) Trading, (d) Deals, (e) Stock, (f) Finance.
Entropy 24 01274 g007
Table 3. Description of notations.
Table 3. Description of notations.
NotationDescription
SThe topic influence percolation strength
λ The dimension of topic influence
γ The sphere volume
Z i j The topic propagation space coordinate
DThe hops between the source point and the affected nodes
η z The percolation coefficient of topic propagation
κ 0 The initial topic influence value of the source point
Table 4. The topic influence score of nodes in G a .
Table 4. The topic influence score of nodes in G a .
Node IDTopic Influence Score
111.51
225.44
312.99
410.13
511.51
628.41
Table 5. The E Q value on non-artificial networks with J u m p range from 1 ro 6.
Table 5. The E Q value on non-artificial networks with J u m p range from 1 ro 6.
Jump FootballBookDolphinASNYoutubeDBLPAmazonEnron
10.51250.48470.46560.42510.36210.70150.89820.8322
20.55310.50220.47380.43150.36250.70880.90110.8414
30.42030.42910.39280.44220.36380.71470.91320.8543
40.21030.21590.21120.21330.21910.21890.21760.2102
50.00540.00980.00850.00730.00250.00180.00320.0047
60.00000.00000.00000.00000.00000.00000.00000.0000
Table 6. The S Q value on non-artificial networks with J u m p range from 1 ro 6.
Table 6. The S Q value on non-artificial networks with J u m p range from 1 ro 6.
Jump ASNYoutubeDBLPAmazonEnron
10.42060.35390.71320.81490.8101
20.41420.33930.69300.82730.8266
30.39680.32070.68650.80760.8064
40.10740.10710.10890.11010.1106
50.00320.00280.00350.00480.0045
60.00000.00000.00000.00000.0000
Table 7. The N M I value on artificial networks with J u m p range from 1 ro 6.
Table 7. The N M I value on artificial networks with J u m p range from 1 ro 6.
Jump μ = 0.1 μ = 0.2 μ = 0.3 μ = 0.4 μ = 0.5 μ = 0.6 μ = 0.7 μ = 0.8
10.62910.38470.22640.15370.13240.07590.02350.0128
20.62130.37960.22010.14710.12940.07120.02120.0117
30.61850.37130.21650.14060.12350.06730.01960.0087
40.00170.00150.00140.00120.00110.00080.00060.0004
50.00050.00040.00030.00020.00020.00010.00010.0001
60.00000.00000.00000.00000.00000.00000.00000.0000
Table 8. The N M I value on artificial networks.
Table 8. The N M I value on artificial networks.
Algorithms μ = 0.1 μ = 0.2 μ = 0.3 μ = 0.4 μ = 0.5 μ = 0.6 μ = 0.7 μ = 0.8
GN0.18230.08360.05510.01790.00640.00370.00060.0000
FN0.19120.08610.05730.01670.00660.00290.00030.0000
LFM0.21070.16840.13570.11220.06380.03110.01030.0043
COPRA0.51580.39880.33420.28010.24660.22720.21680.1707
CUT0.47580.38640.35610.28060.26530.22320.18370.1332
TURCM0.50660.42130.37220.23130.18920.15340.11080.0606
LCTA0.38510.37620.30970.28850.22040.18180.13900.0923
ACQ0.43440.40990.37680.31590.23510.21110.17920.1017
DEEP0.59320.46450.35360.28370.19630.09920.01030.0008
BTLSC0.50220.32690.28510.20220.17330.14310.10050.0520
SCE0.42240.38610.32360.26190.20520.13370.08380.0153
GTIP0.61850.47130.40650.24060.12350.06730.01960.0087
Table 9. Performance comparison with traditional community detection algorithms.
Table 9. Performance comparison with traditional community detection algorithms.
AlgorithmCriteriaFootballBookDolphin
GN E Q 0.29770.30840.3165
S Q 0.28210.29270.3002
FN E Q 0.28760.29880.3153
S Q 0.27740.28310.3032
LFM E Q 0.42070.42660.4137
S Q 0.38310.35150.3604
COPRA E Q 0.48580.46720.4003
S Q 0.41150.37280.3948
GTIP E Q 0.42030.42910.3928
S Q 0.43260.43640.4066
Table 10. The E Q value on real-world networks.
Table 10. The E Q value on real-world networks.
NetworksCUTTURCMLCTAACQDEEPBTLSCSCEGTIP
ASN0.24660.38670.35800.34580.36230.44350.42950.4422
Youtube0.32780.34450.43620.32870.44940.42240.51590.4638
DBLP0.60480.64130.60820.48460.58460.65200.69530.7147
Amazon0.72210.81280.70900.69400.80170.80430.89100.9132
Enron0.64360.74050.65120.63320.80130.67120.82610.8543
Table 11. The S Q value on real-world networks.
Table 11. The S Q value on real-world networks.
NetworksCUTTURCMLCTAACQDEEPBTLSCSCEGTIP
ASN0.20120.33570.31060.29770.32120.40620.38620.3968
Youtube0.29180.30140.39310.28560.40410.37620.47280.4207
DBLP0.57660.61530.58410.45640.56390.62620.67230.6865
Amazon0.61270.70720.60680.58330.69250.70110.78540.8076
Enron0.59360.69570.60120.58310.75340.62330.77610.8064
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yang, H.; Zhang, J.; Ding, X.; Chen, C.; Wang, L. GTIP: A Gaming-Based Topic Influence Percolation Model for Semantic Overlapping Community Detection. Entropy 2022, 24, 1274. https://doi.org/10.3390/e24091274

AMA Style

Yang H, Zhang J, Ding X, Chen C, Wang L. GTIP: A Gaming-Based Topic Influence Percolation Model for Semantic Overlapping Community Detection. Entropy. 2022; 24(9):1274. https://doi.org/10.3390/e24091274

Chicago/Turabian Style

Yang, Hailu, Jin Zhang, Xiaoyu Ding, Chen Chen, and Lili Wang. 2022. "GTIP: A Gaming-Based Topic Influence Percolation Model for Semantic Overlapping Community Detection" Entropy 24, no. 9: 1274. https://doi.org/10.3390/e24091274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop