1. Introduction
Community detection is a fundamental problem in various fields, such as biological study and social network analysis. The definition of a community can vary based on the specific problem and objective at hand, but the definitions provided in [
1,
2,
3] are generally considered widely accepted. In broad terms, a community is commonly understood as a group of individuals with stronger connections among its members than with individuals outside the group.
In the process of conducting community detection, real-world problems are typically translated into graphs where nodes represent individuals and edges represent relations. Numerous community detection methods have been developed based on diverse principles and objective functions. Surveys of community detection methods can be found in [
4,
5,
6,
7,
8,
9,
10].
Game theory has emerged as a technique applied in community detection [
8,
11,
12,
13,
14,
15]. Its applications extend to identifying disjoint, overlapping, and hierarchical communities. As a systematic framework, game theory models and studies the decisions and outcomes of players in a game [
16,
17]. Broadly, game theory can be categorized into two main types: non-cooperative game theory and cooperative game theory. Non-cooperative game theory focuses on the competition between individual players, emphasizing their strategies and payoffs. Cooperative game theory, on the other hand, focuses on the cooperation between players and addresses the allocation of payoffs to players based on the worth of the coalitions formed. Within cooperative game theory, there are two main types: non-transferable utility cooperative games, where the payoff for a player within a coalition cannot be transferred to another player in the same coalition, and transferable utility cooperative games, where payoffs are considered transferable among players in the same coalition. Solution concepts such as the core, kernel, nucleolus, Shapley value, egalitarian, etc., play crucial roles in cooperative game theory [
18,
19].
The community detection method based on cooperative game theory typically identifies the coalition with the highest score determined by a measure evaluated on the coalitions. However, due to the use of approximations, non-unique results are common. Zhou et al. [
20] presented a community detection method using cooperative game theory and the Shapley value. The study focused on a social network where nodes are linked to relationships in various finite topics. The Shapley value represents a node’s contribution to the connection closeness of a coalition. The algorithm forms hierarchical and overlapping coalitions by iteratively adding each node to one of the coalitions formed in the previous iteration, where the newly added node obtains the largest Shapley value. Despite running in polynomial time, the algorithm relies on approximation. Another related approach for overlapping and hierarchical community detection [
21] also employs cooperative game theory, and the hierarchical structure of coalitions is obtained through a greedy agglomerative method, potentially yielding non-unique results.
The Naming Game [
22,
23,
24] presents another game theoretic approach applicable to community detection, where the community structure emerges from the dynamic interactions between pairs of nodes within the game. However, empirical evidence suggests that the solution is generally not unique [
24]. The convergence and computational costs of the method are analyzed through extensive empirical experiments [
22,
24], while it remains unclear regarding the theoretical bounds. Furthermore, the Naming Game relies on pairwise connections and does not capture higher-order statistics among nodes beyond pairwise relationships, therefore limiting its scope of applications in community detection.
We introduce a notion of strength derived from cooperative game theory to identify strong communities that are interpretable. Moreover, the strong communities are unique, computable in polynomial time with recursive procedures, and can be represented by a dendrogram. The scope of consideration encompasses a set of individuals with a supermodular function for evaluating the communities, which means our approach is applicable to community detection tasks beyond graphical models. Our framework focuses on elucidating the theoretical properties of the strong communities and can provide the foundation for future research on empirical algorithms for large-scale datasets.
This paper is structured as follows: In
Section 2, we present the relevant concepts in cooperative games.
Section 3 outlines the derivation of the objective function based on convex games. We also formulate the definitions for community strength and strong communities in this section. Moving on to
Section 4, we delve into the discussion of the properties of strong communities, laying the foundation for their computation.
Section 5 details the solution to the problem through submodular function minimization and, in certain cases, introduces the use of the max-flow min-cut algorithm as a more efficient method in practice. In
Section 6, concrete examples are provided to demonstrate the computation of strong communities and the representation of the dendrogram of these communities. Finally, in
Section 7, we conclude our work.
2. Cooperative Game
A cooperative game [
17] is characterized by
, where
V is a finite set of players with , and
is a set function called the characteristic function, where is the worth of the coalition , assuming players in C cooperate to form such coalition.
Denote the payoff allocation for the players as a vector
with
being the
i-th element in
as the payoff allocated for
i-th player.
The total payoff in the coalition
is denoted as
Furthermore, when
g is a supermodular function, the game is called convex [
17]. In this case, for
,
Or equivalently, for
,
where both sides are the increases in worth when a player
i is added to a coalition. (
3) means that the increase in worth, when a player adds to a coalition, is equal or larger than that for a larger superset coalition, i.e., the marginal worth is non-diminishing for convex games. For simplicity,
g is thought to be normalized, i.e.,
.
As for the payoff allocation, the transferable utility is considered here, i.e., the payoffs can be transferred between players in the same coalition. The core [
18] is one of the relevant solution concepts in cooperative games, which is about the feasible allocation of payoffs to players.
The core of a game
is defined as [
17]:
In the definition of the core,
means the payoff allocation exactly splits the total worth of the grand coalition
V. The inequality
says that no other coalition
can have a worth larger than the payoff
C can receive by cooperating in
V, and hence will not deviate from the grand coalition
V. The core can be viewed as the stable payoff allocation. For a convex game, the core is always nonempty [
17,
25].
3. Problem Formulation
By regarding the set V of nodes as players, we consider a convex game with the characteristic function g being a supermodular function on .
In particular, consider a weighted digraph on the set
V of nodes. Such a graph can be characterized by the weight of the directed edges described using the weight function
w:
where
is the weight of the edge from node
i to node
j. This covers the undirected graphs special cases when
for all
.
Consider the function
g defined in the form of
where
and
. The function
g in (
5) is supermodular [
26]. When
, (
5) reduces to the total weight of edges in
B scaled by
; when
, (
5) reduces to the negative of the total weight of incoming edges from outside to
B scaled by
.
We want to identify strong communities based on the convex game using the following measure of community strength.
Definition 1. For , definewhich is referred to as the strength of community C. The inner maximization in (
6) is the stable payoff guaranteed to any player in
B, which we termed the community support to
B. The outside minimization in (
6) gives the strength of
C, which is the minimum community support over
.
The following example illustrates the interpretation of the strength in (
6) more concretely.
Example 1. Consider the unweighted graph in Figure 1a with and characteristic function g in (5) with and , i.e., for , , which calculates the total number of internal edges inside B. We are going to show how to obtain the strength of , which requires us to calculate the minimum community support over according to (6). By definition, the community support to B from is the stable payoff that is guaranteed to each player in B. For a payoff allocation to be stable, it should be in , which is calculated to be Then, we consider the guaranteed stable payoff to each player in B. For instance, when , the guaranteed stable payoff to players in B is 1, which is achieved with the payoff allocation ; when , the stable guaranteed payoff to players in B is 1, which is achieved with the payoff allocation . Therefore, we know that the minimum community support to any non-empty proper subset of is 1, i.e., the strength of is 1.
Additionally, there is 1 unit of payoff that is transferable between players 1 and 2 based on the constraint for the core. Such a transferable payoff tends to improve the guaranteed payoff for players in non-empty subsets of C. As a result, intuitively forms a meaningful community.
Similarly, we can show that the strength of V is 0. For a payoff allocation to be stable, it has to be in which is calculated to be (see Appendix A.1) Then, we consider the community support to . For instance, when , the community support to B is 1, which is achieved with the payoff allocation . By enumeration over , we can obtain that when , the community support to B is 0, which is the minimum value of such community support. Hence, the strength of V is 0.
There is another equivalent definition for with (1) where we consider the average payoff allocated to a set instead of the inner minimization term in (1), as stated in the following result.
Our goal is to identify strong communities defined using as follows.
Definition 2. For any threshold , define the collection of strong communities in V as The
means inclusion-wise maximal subsets in
, i.e.,
Similarly,
means inclusion-wise minimal subsets in
, i.e.,
Figure 1.
An illustrative example of an unweighted graph with for . (a) The unweighted graph; (b) Visualization of ; (c) The curve ; (d) The dendrogram.
Figure 1.
An illustrative example of an unweighted graph with for . (a) The unweighted graph; (b) Visualization of ; (c) The curve ; (d) The dendrogram.
Example 2. In Example 1, we already get . Similarly, we can get , and .
According to (10), the strong communities in V given by our approach are 4. Main Results
4.1. Characterization of Community Strength
The community strength defined in (
9) takes a simpler form for the convex game as shown in Theorem 1.
Theorem 1. For any , Furthermore, the set of optimal solutions to (9) is given bywhere is the set of optimal solutions to the minimization in (14). Equation (
14) is the basic formula of community strength that we will utilize to derive the properties of the strong communities and investigate how to calculate the strong subsets.
The following example shows the equivalent value of the strength of
V calculated by (
9) and (
14).
Example 3. Consider V as in Example 1, follow (14) to calculate the strength of V,with . The value of calculated here according to (14) is consistent with that calculated in Example 1. Define for
and
that
Denote the optimal solution set to (
18) as
, and the collection of inclusion-wise minimal sets among
as
, i.e.,
is the set that we use to analyze the relation between and the curve , and is the set we use for showing the computation of in the latter part.
The following example shows the curve of for the set V in Example 1.
Example 4. Consider V as in Example 1, according to (18),and the inclusion-wise minimal solution set to (18) is given byas illustrated in Figure 1c, where the result for and can be obtained directly after we draw every curve of for . For instance,
when , is given by , hence .
when , both and V are solutions to (18) with respect to , while is the inclusion-wise minimal solution, hence .
From (19), it can be seen that
is a linear function of
with slope
. Therefore,
in (
18) is a piece-wise linear function since it is a minimization of linear functions. With
, the curve must have at least one turning point since the slope of
is different from
.
Figure 1c is the curve of
for
V in Example 1.
The following result shows that can be obtained from the curve. It will be used for deriving the representation and computation of the strong communities defined in Definition 2.
Proposition 2. For the curve against :
- (1)
is the -coordinate of the first turning point. More precisely, - (2)
The collection of optimal solution to (18) satisfies
The following example can further illustrate the property of .
Example 5. In Example 4, the first turning point of the curve is , whose α-coordinate is exactly .
4.2. Representation of Strong Communities
The strong communities defined in Definition 2 form a hierarchy and can be represented by a dendrogram.
Theorem 2. For any where , we have The following example shows that the strong communities in
Figure 1a as in Example 1 with respect to two different
’s have a containment relationship.
Example 6. Let and . By the calculation results in Example 2, and . Then , which means the communities in are contained by those in . This shows the hierarchical structure of the strong communities with respect to the specific and .
Theorem 2 follows from the following lemma.
Lemma 1. For all , Example 7. As an example for showing Lemma 1, consider Figure 1a as in Example 1, let and , then . By the calculation results in Example 2,i.e., (28) holds for and . Lemma 1 establishes that the strength of the union of any two overlapping non-empty sets is lower bound by the smaller strength among the two sets, and this is the basis for Theorem 2.
The family is said to be laminar and can be shown to contain at most elements. More precisely, we will show that the family of communities, together with their levels of strength, can be represented by the following dendrogram with , meaning the cophenetic similarity.
Definition 3. The dendrogram for the set of communities is defined as follows:
- (1)
Every is an internal node annotated with the value ;
- (2)
Every singleton {i} for is a leaf node (annotated with the value );
- (3)
The parent of each node is defined as the minimum
As illustrated in Figure 2, the dendrogram forms a tree because each node (except the root node V) has a unique parent node. As a result of Theorem 2, the following corollary states that the parent of each strong community except V exists and is unique.
Corollary 1. For every , the minimum element exists.
Using the following result, we can show that the set of children for each node
is
which is also illustrated in
Figure 2.
Analogous to Corollary 1, a community B has a parent C in the dendrogram if and only if B is in , and the strength of B is larger than that of C, as stated in the following corollary.
Corollary 2. For any nodes of the dendrogram,which implies . Example 8. For Figure 1a as in Example 1, by the calculation results in Example 2, the dendrogram that corresponds to is shown in Figure 1d. We defined the community strength in (
6) by modeling the problem based on the convex game in game theory, gave its alternative forms in (
9) and (
14), and showed that the community strength and the solutions to the minimization of (
14) are related to the first turning point of the curve defined by (
18) against the parameter
. We also showed that the collection of strong communities defined in (
10) form a hierarchy and can be represented by a dendrogram. These motivate the methods for computing strong communities, as described in the following section.
5. Computation of Strong Communities
In this section, we will show how to calculate the strong communities in at a threshold , and how to calculate all the strong communities.
The following result shows that can be calculated with a recursive procedure.
Theorem 3. For , can be calculated with the following recurrence relationwhere is defined in (20), and (31) is the base case. Theorem 2 shows that
can be computed in a divisive way. In the first recursive step,
V is the ground set, if
, we directly calculate
by the base case (31) and stop the recursion; otherwise we calculate
, then
is the set of newly found strong subsets, and we enter the next recursive step. The new recursive step is similar to the first recursive step, but we use
U given in (
33) as the ground set.
The following example shows how to run the recursive procedure in Theorem 3 for computing .
Example 9. Consider Figure 1a as in Example 1 and we calculate at by following Theorem 3. - (1)
The first recursive step:
, which corresponds to the case in (32).
Then we need to compute . By the calculation in Example 4, we know .
By (32), the elements in are in , and computing will provide us the remaining strong subsets in , where U is given by (33). Here, .
- (2)
The second recursive step:
Regard U as the ground set and compute according to (31) and (32).
Since , the base case (31) applies, which means , and the recursive procedure ends.
According to the recursive steps, .
Notice that in this example, there are two recursive steps in total. For some other examples where the U obtained in the first recursive step has a cardinality larger than 1, i.e., the case (32) applies, then the following recursive step will be similar to the first recursive step except that U instead of V is regarded as the ground set.
Additionally, we use the from the calculation in Example 4, which employs a brute force method enumerating all . We will discuss how to compute in polynomial time later.
The following example illustrates why there are strong communities not in and why the recursive procedure in Theorem 3 can identify those strong communities.
Example 10. Consider Figure 3a on with function g defined by (5) with and . Then is the total weight of internal edges in C. In the graph, and have relatively large total weights of internal edges compared with other subsets of V hence they are meaningful communities that are expected to be identified.
Let . in (20) contains the minimal non-empty subsets of V that leads to the minimum value of in (19). is identified by because achieves the minimum value among all the non-empty subsets of V. However, the other meaningful community is not in because will never be a subsets of V that leads to the minimum value of , as always holds. In other words, dominates .
To identify , we remove the nodes that appeared in the communities in (32) from V, as described in (33), and then start a new recursive step to identify strong communities within the remaining nodes. Since the community that dominated in V was removed, can now be identified with . In this way, the recursive procedure in Theorem 3 works to identify all the strong communities in .
Figure 3.
A simple digraph and the dendrogram when
g is defined by (
46) with different
. (
a) The digraph; (
b) The dendrogram when
; (
c) The dendrogram when
.
Figure 3.
A simple digraph and the dendrogram when
g is defined by (
46) with different
. (
a) The digraph; (
b) The dendrogram when
; (
c) The dendrogram when
.
For the recurrence relation in Theorem 3 to be applicable, the recursive procedure in Theorem 3 finishes with finite recursive steps. The following results imply that U in (31) always has a smaller size than V.
Proposition 3. For and the set V with , As a result of Proposition 3, the number of recursive steps in Theorem 3 needed is bounded by .
Proposition 4. For a non-empty set V, it takes at most recursive steps to calculate by Theorem 3.
The following property is the basis of Theorem 3, which ensures that the recursive procedure in Theorem 3 does not leave out any strong communities in for a chosen .
Proposition 5. For any , ,or the contrapositive Equation (
36) implies that any other strong subset with strength larger than
is either disjoint with the elements in
, or is a subset of an element in
. This ensures that when we continue the computation with
U in (
33) as the ground set after we obtained the strong subset with strength larger than
in
, the remaining strong subsets will be captured by
.
To obtain
, calculating
is a basic step that requires optimization of (18), which can be done based on the method in [
26] as described in the following.
5.1. Divide-and-Conquer
We rewrite the minimization of
in (
18) in a similar way as that in [
26]:
which is a two-step optimization problem, and denote
as the minimal optimal solution set to (38).
Since
defined in (
18) is a submodular function,
can be solved with submodular function minimization (SFM) algorithms, and it has a unique element since the feasible domain
is a lattice ([
27] Proposition 10.1).
Let
be the set of optimal solutions
t to (
37), we have the following result that indicates how
can be calculated.
Proposition 6 ([
26] Proposition 2).
For , in (20) can be obtained from and by According to Theorem 3, computing , all the strong subset with a strength larger than , can be done with the following steps:
- (1)
Calculate for by optimizing (38) with SFM algorithms;
- (2)
Calculate by optimizing (38);
- (3)
Calculate
according to (
39);
- (4)
- (5)
is the newly found strong subsets that have a strength larger than in this recursive step. If , then stop; otherwise, regard U as V and go to (1) to start a new recursive step.
The union of the set of strong subsets calculated in all the recursive steps is .
To calculate all the strong subsets, i.e.,
, for each
, define
then
is a normalized submodular function.
We need to obtain the minimal optimal solution to (
41) for all
. Luckily, with SFM algorithms such as Wolfe’s minimum norm point algorithm [
28], for (
41), we can obtain for some
the sequence of
and the corresponding sequence of sets
that satisfies ([
27] Proposition 8.6)
and for any
Equation (
44) means with the sequences (
42) and (43), we can obtain the minimum solution to (
41) for all
.
For any
, if
is the unique minimal solution to (
41), then
will be the unique minimal solution to (
37), or in another word,
since
is a constant for all
. This means the minimal solution set
to (
37) for all
can be obtained from the solutions to (
41) with sequences (
42) and (43).
Therefore, with sequences (
42) and (
44) for all
,
can be obtained for all
. Then calculating
for all
based on Theorem 3, Proposition 6 and
is sufficient for us to obtain
.
With denoting the complexity of the minimum norm base algorithm for SFM on the ground set of size n, we have the following result.
Proposition 7. can be computed in time.
5.2. Using Max-Flow Min-Cut Algorithm
For the step of optimizing (38) in computing
, SFM algorithms are used. However, SFM algorithms are generally computationally expensive. There are works on improving the efficiency of SFM problems by max-flow min-cut algorithms [
29,
30]. We discuss a category of choices for
g when the max-flow min-cut algorithm can be utilized for computing
.
Consider the function
g defined in the form of ([
26] Difinition 2)
which is a special case of (
5) with
.
Following the method in [
26], we can construct an augmented digraph and run a max-flow min-cut algorithm [
31,
32,
33,
34] to obtain the solution to (38). With
in (19) and
in (46), the
-augmented digraph [
26] is a digraph on
where
is an additional node, with the edge weight
defined as
Proposition 8 ([
26] Theorem 3).
The contains a unique minimum set C such that is a minimum s-t cut of the -augmented digraph. Proposition 8 implies that
can be solved by max-flow min-cut algorithm. Moreover, with the parametric max-flow min-cut algorithms [
34], we can obtain
for all
. Hence, when
g has the form of (46), computing
for a certain
or all
follows the same procedure in
Section 5.1, except that we can use max-flow min-cut algorithms to calculate
instead of SFM algorithms.
6. Discussions
To illustrate the dendrogram of strong communities found by our approach, the digraph in
Figure 3a is used as an example, with the function
g given in (
46) for different choices of
for experiments.
The result for the cases
and
are shown in
Figure 3b and
Figure 3c, respectively. The example of the calculation procedures based on Theorem 3 for the case
is in
Appendix A.15.
We can obtain the collection of strong communities
in (
10) for
from the dendrogram. For instance, the strong communities for
is,
The parameter
in (
46) is a balancing factor between the total weight of internal edges and the negative total weight of incoming edges, and when
, it can be used for the problem of finding the minimal densest subgraphs.
In [
35], another kind of augmented graph is constructed, and an algorithm is given for quickly increasing
value based on the current community that has already been found and then conducting max-flow min-cut algorithms. We want to point out that, although the method there is similar to solving
in our work, the algorithm there for calculating the next critical
, as the author also said, needs more calculation steps if we want to obtain more solutions for intermediate
. In other words, not all critical
are found, while our approach calculates all the critical
and the solutions directly. Additionally, our approach goes beyond just finding
, and we considered digraphs, which can be generalized to undirected graphs directly.
The work in [
26] extends the notion of web communities [
36] to digraphs and calculates the communities in polynomial time, which is closely related to our approach. In fact,
is the set of web communities, which is included in the strong communities defined in (
10) in this work. For a set
, subsets of
can prevent
C from being a web community in [
26], even if
C has a strength larger than
. Nevertheless, such a phenomenon does not exist for strong communities detected by our approach. Whether
C is a strong community in (
10) or not is independent of subsets of
according to (
10). In Example 10,
is a web community, and
is not, since
is dominated by
. However,
is a meaningful community that is expected to be identified. The web community method fails to identify
, while our approach can identify it as a strong community, as we have calculated in the example.
Our approach also addresses some known issues associated with existing community detection methods. For instance, Modularity [
37], a common community detection method, is NP-hard and suffers from the limitation of resolution limit [
38]. There are works such as [
39,
40] aiming at resolving the resolution limit issue, yet both rely on heuristics to obtain solutions. It is worth mentioning that Modularity is a measure applied over partitions, while our strength measure is on individual communities. In
Figure 4, there are four complete graphs of two sizes,
and
. Despite the two complete graphs of size
being smaller than the other two, they should be identified as communities since they are the maximal complete graphs. However, Modularity fails to recognize the two smaller complete graphs, and instead, it merges them into a single community [
38]. In contrast, our approach successfully identifies the two smaller complete graphs of size
as strong communities.
The strong communities are derived from game theory, where the strength can be interpreted as the support inside the community that can be shared with a part of individuals in need in the same community. Application in real-world problems is promising, such as finding small groups of advertisers and keywords in sponsored auctions, where the community strength means the average money inside the groups [
35,
41].
7. Conclusions
We introduced a novel concept of strength for community detection using a convex game model in cooperative game theory. It can be applied to networks with a supermodular characteristic function. Theorem 1 establishes the dual objective function, based on which we conducted a comprehensive analysis of strong community properties. The laminar structure demonstrated in Theorem 2 reveals that strong communities form a hierarchy and can be represented by a dendrogram.
To compute strong subsets, Theorem 3 introduces a recurrence relation, enabling polynomial time calculations through submodular function minimization. For specific characteristic function choices in the convex game on the graphical model, an augmented digraph can be constructed to apply the max-flow min-cut algorithm to improve computation efficiency.
Unlike many existing community detection methods, which often rely on approximation, are non-deterministic, and lack guarantees on complexity, our approach for community detection is deterministic, computable in polynomial time, and supported by a rigorous theoretical analysis of its properties. Since our approach captures high-order statistics through the supermodular characteristic function, the primary limitation of our approach lies in its computational complexity. This complexity presents a challenge when applying the method to large-scale real-world datasets. Nevertheless, our work proposes an analytical framework for community detection that yields unique solutions and provides theoretical foundations for future research aimed at improving the complexity and empirical applications.