Community Detection Using Deep Learning: Combining Variational Graph Autoencoders with Leiden and K-Truss Techniques

Patil, Jyotika Hariom; Potikas, Petros; Andreopoulos, William B.; Potika, Katerina

doi:10.3390/info15090568

Open AccessArticle

Community Detection Using Deep Learning: Combining Variational Graph Autoencoders with Leiden and K-Truss Techniques

¹

Computer Science, San Jose State University, San Jose, CA 95192, USA

²

Electrical and Computer Engineering, National Technical University of Athens, 15780 Athens, Greece

^*

Author to whom correspondence should be addressed.

Information 2024, 15(9), 568; https://doi.org/10.3390/info15090568

Submission received: 30 June 2024 / Revised: 27 August 2024 / Accepted: 12 September 2024 / Published: 16 September 2024

(This article belongs to the Special Issue Optimization Algorithms and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning struggles with unsupervised tasks like community detection in networks. This work proposes the Enhanced Community Detection with Structural Information VGAE (VGAE-ECF) method, a method that enhances variational graph autoencoders (VGAEs) for community detection in large networks. It incorporates community structure information and edge weights alongside traditional network data. This combined input leads to improved latent representations for community identification via K-means clustering. We perform experiments and show that our method works better than previous approaches of community-aware VGAEs.

Keywords:

community detection; deep learning; variational graph autoencoders; Leiden algorithm; K-truss

1. Introduction

In today’s data-driven world, exploring graphs is a crucial field of study. From social networks to biological data to communication systems, graphs help us make sense of complex systems. Graphs represent systems in which entities are represented by nodes and relationships between these entities are indicated by edges. Understanding the structure and dynamics of graphs requires an analysis of the interactions and strengths of relationships between nodes. This makes it possible to identify communities or closely linked groups that have similar characteristics or purposes within the graph.

Community detection is a fundamental aspect of understanding the structure and dynamics of a graph. It is an important concept used in identifying closely connected cluster of nodes in a network. This grouping of nodes helps in understanding some of the hidden yet significant details of relations and structure present in the network graph. Understanding these intrinsic details can reveal crucial insights about the graph such as the influence of certain nodes, the clusters present in the graph, and also about the network flow in the graph. For example, in social networks it serves for segmenting groups by common interests or demographic information; in biology, communities in a graph might represent different species that tend to coexist in the same ecosystems. In communications, they could be clusters of people who frequently contact each other. Whether you are a sociologist, a biologist, or a network engineer, graph exploration and community detection provide powerful tools to make sense of complex systems and uncover hidden patterns and structures.

With advancements in graph theory and machine learning, new methods [1] have emerged for detecting communities in big graphs faster. Traditional approaches that use only the structure of the graph and some optimization metrics are effective for smaller networks. The traditional approaches include the following algorithms: Girvan–Newman, Louvain, spectral clustering, and Leiden. However, when dealing with larger, more complex systems with additional information or features on the nodes and edges, these traditional methods may not be as scalable or rich as needed. Therefore, learning-based approaches that use either probabilistic graphical model-based or deep learning-based are considered. These offer the advantage that they work fast with big data and can incorporate additional features.

Deep learning models working on graphs like graph autoencoders [2], graph neural networks (GNNs) [3], and graph convolutional networks (GCNs) [4] have shown potential to solve problems in graphs by learning representations of nodes and graphs, while considering both node features and graph structures. However, these deep learning models often do not perform that well, because they do not explicitly incorporate crucial aspects of prior community structure information of the nodes and appropriate edge information like weights. By developing new deep learning frameworks that can better encode the underlying community memberships and edge relationships, we can create more powerful and flexible community detection approaches that can handle a wide range of real-world graph-based applications.

The goal of this paper is to advance the state-of-the-art methods in community detection by designing deep learning models that can effectively capture the inherent community structure and incorporate edge weights within graphs. This involves novel neural network architectures, optimization techniques, or other innovations that allow the models to better leverage these crucial graph properties. The resulting approaches exhibit good results across diverse graph datasets and application domains compared to existing community detection methods.

More specifically, we explore the implementation and effectiveness of variational graph autoencoders (VGAEs) for community detection. We focus our attention on exploring the application of variational graph autoencoders (VGAEs) [5] for enhancing community detection in complex systems. VGAEs are a type of deep learning model that uses a probabilistic approach to learn compact, low-dimensional representations of graphs. We choose to experiment with VGAEs to embed nodes for the community detection problem because they have emerged as powerful tools due to their probabilistic framework, and are scalable and flexible. Our proposed technique, called Enhanced Community Detection with Structural Information VGAE (VGAE-ECF), aims to improve the VGAEs for community detection by incorporating preliminary insights about community structure, thus calls community preserving algorithms and edge weights into the learning process. This integration is expected to make learning more efficient and powerful, as the VGAEs can leverage the enriched, low-dimensional latent space embeddings to detect communities in a more informed way.

The problem, therefore, centers on how to effectively integrate community detection into deep learning models without compromising the generalization and robustness of these models. The proposed solution seeks to bridge the gap between classical community detection methods and modern deep learning techniques, to obtain an accurate community detection method.

The structure of the paper is as follows. Section 2 introduces the problem of community detection, relevant terminology, and background to provide context for the subsequent sections. Section 3 reviews the existing literature related to the methodology employed in this study and previous works in the field of community detection using VGAEs or similar approaches. Section 4 describes the methodology adopted in this paper. Section 5 presents the experiments conducted and the results obtained. It discusses the datasets used for evaluation, the evaluation metrics employed, and provides a thorough analysis and discussion of the findings. Finally, Section 6 concludes the paper by summarizing the key findings and insights gained from the research. It also suggests potential avenues for future research in this area.

2. Terminology and Background

The following is the key terminology related to community detection and deep learning approaches used in the rest of the paper.

2.1. Terminology

A graph

G = (V, E)

, consists of nodes (set V) and edges (set E). Nodes are the primary units or points in a graph. They typically represent entities in a dataset, such as individuals in social networks, neurons in neural networks, or stations in transportation networks. Edges are the connections between two nodes in a graph. They can represent relationships or interactions between entities, such as friendships on social media, synapses between neurons, or routes between stations. There can be directed or undirected edges that can carry weights signifying the strength or capacity of the connection.

One way to represent graphs is by using an adjacency matrix. An adjacency matrix is a square matrix (size

| V | \times | V |

) representing a graph. The elements in the adjacency matrix show whether pairs of nodes are connected in the graph. Generally, the value for the i-th row and j-th column of 1 indicates that nodes i and j are connected and 0 indicates that these nodes are not connected.

A sparse matrix is one in which most of the elements are zero, a sparse adjacency matrix implies that there are relatively few edges compared to the number of nodes, which is common in real-world networks. Storing and processing sparse matrices efficiently is important in large-scale graph analytics.

In graphs, a community is a tight-knit group of nodes that are more closely connected to each other than to nodes outside the group. Community detection is the process of finding groups within a network that share common properties. Identifying these communities is crucial to understanding how a network works and how information flows through it, whether it is a social network, biological system, or a communication system.

We can use the modularity score as a measure on how good our communities are compared to the null model and it is the fraction of the edges within the communities minus the expected fraction of the edges if edges were distributed at random (null model).

2.2. Background

This subsection discusses the architecture of variational graph autoencoders (VGAEs). Kipf and Welling [5] considered a probabilistic variant of graph autoencoders (GAEs) extending from variational autoencoders (VAEs), introduced by Kingma and Welling in their seminal work on autoencoding variational Bayes [6]. VAEs are designed to learn efficient data representations and the underlying probability distributions, making them exceptionally suitable for unsupervised learning tasks like dimensionality reduction, data generation, and community detection in network datasets.

One big breakthrough came in 2016 when Kipf and Welling [5] introduced variational graph autoencoders (VGAEs). The framework uses a graph convolutional network (GCN) to compress the input graph into a compact, low-dimensional vector, and then, a simple decoder to reconstruct the original graph. The loss for the model is the difference between the input of the encoder and the output of the decoder. By performing unsupervised learning on node features, VGAEs showed a performance boost on graph-structured data. However, community structures are inherently complex, so even these powerful models often struggle to fully capture the community information in the input graph. Figure 1 shows the basic architecture of this paper. As input we have both the adjacency matrix A of the graph, as well as the feature vector X of each node. We symbolize the stochastic latent variables by

z_{i}

. The output is the reconstructed adjacency matrix.

The encoder consists of a two-layer GCN [5]:

q (Z | X, A) = \prod_{i = 1}^{N} q (z_{i} | X, A)

with

N = | V |

,

q (z_{i} | X, A) = N (z_{i} | μ_{i}, d i a g (σ_{i}^{2}))

. Here,

μ = G C N_{μ} (X, A)

is the matrix of mean vectors

μ_{i}

and

log σ = G C N_{σ} (X, A)

.

N

denotes the normal distribution.

The two-layer GCN is defined in [5] as

G C N (X, A) = A^{'} R e L U (A^{'} X W_{0}) W_{1}

, with weight matrices

W_{i}

.

G C N_{μ} (X, A)

and

G C N_{σ} (X, A)

sharing first-layer parameters

W_{0}

; where

R e L U (\cdot) = m a x (0, \cdot)

and

A^{'} = D^{- 1 / 2} A D^{- 1 / 2}

(D degree matrix) is the symmetrically normalized adjacency matrix.

For the decoder [5],

p (A | Z) = \prod_{i = 1}^{N} \prod_{j = 1}^{N} p (A_{i j} | z_{i}, z_{j})

with

p (A_{i j} = 1 | z_{i}, z_{j}) = σ (z_{i}^{T}, z_{j})

, where

A_{i j}

are the elements of the adjacency matrix A and

σ

is the logistic sigmoid function here.

The total loss function of a VGAE is the sum of the reconstruction loss and the KL divergence. Mathematically, this can be represented as follows:

L = - E_{q (Z | X, A)} [log p (A | Z)] + β \cdot K L (q (Z | X, A) ‖ p (Z))

Here,

- E_{q (Z | X, A)} [log p (A | Z)]

is the expected log-likelihood of the observed data, essentially the reconstruction loss, and

K L (q (Z | X, A) ‖ p (Z))

is the KL divergence between the encoder’s distribution

q (Z | X, A)

and the prior distribution

p (Z)

. The parameter

β

allows for the weighting of the KL term, offering a way to balance the two aspects of the loss according to the specific goals of the model (this becomes particularly relevant in extensions of VGAEs, like the

β

-VGAE).

We perform full-batch gradient descent and make use of the re-parameterization trick [6] for training.

Figure 2 illustrates the paper’s goal of preserving community structure in the model input by incorporating both the initial graph structure as well as community information. The input is the graph (represented as an adjacency matrix) with the initial feature vector for each node. Additional community structural information is computed and fed to the VGAs. In the next step, the node representations produced are used in a K-means algorithm to finds clusters. These clusters are transformed to the appropriate communities, since each vector belonging to a cluster corresponds to a node in a community.

3. Related Work

Historically, algorithms like Girvan–Newman [7], modularity optimization [8,9], and spectral clustering [10] have been popular for community detection. These methods often rely on network topologies and heuristic approaches. While they can be effective for smaller or less complex networks, their performance often degrades with increasing network size and complexity. Moreover, they might not be well suited for dynamic networks where community structures evolve over time.

In 2018, J. J. Choong et al. [11] presented a unique deep generative model called Variational Graph Autoencoder for Community Detection (VGAECD). This model takes the standard VGAE framework and customizes it specifically for community detection. Unlike regular VAEs that use simple distributions, VGAECD uses multivariate distributions for generating latent representation to better capture the intricate relationships between nodes in the same community. It breaks down the joint probability of node attributes, latent variables, and community labels to generate those latent vectors. When reconstructing node attributes, it ties this directly to community-specific latent vectors, so the model can pick up on community characteristics more accurately.

In 2020, J. J. Choong et al. [12] took this idea a step further by introducing OPT-VGAECD, or Optimized VGAECD, with linearization of the encoder within the VGAE framework as proposed by Wu et al. [13]. Traditional VAEs use graph convolutional networks (GCNs) as the encoder, which are powerful but computationally expensive. By removing the non-linear activation and simplifying the GCN, a more streamlined model called simplified graph convolutional network (SGC) was proposed. This method focuses on the main objective of discovering communities that may be compromised when lowering the VGAE loss.

Another key aspect of community detection models is the loss function. The VGAECD framework has a unique dual loss that includes both reconstruction loss and community quality loss. Optimizing this dual loss requires a model that is powerful enough to minimize both the reconstruction error and ensure the detected communities are of high quality. Recent research suggests the introduction of a dual-optimization algorithm inspired by the Neural Expectation–Maximization (NEM) algorithm [14]. Unlike the classic Expectation–Maximization algorithm [15], NEM can be trained end-to-end with gradient descent, giving a solution to tackle the dual-optimization challenge. It enables the VGAECD model to learn robust node embeddings and recognize communities for nodes at the same time.

Previous GAE and VGAE models employed encoders that, while adept at capturing node features for reconstruction tasks, did not specifically preserve the community structures during the encoding process. This limitation led to the innovative development of modularity-aware GAE and VGAE models by Guillaume Salha-Galvana et al. [2] in 2022. These models introduce a revised encoding mechanism that incorporates a modularity-aware message-passing operator. This novel approach respects the original graph structure and emphasizes modularity-based node communities, thereby enriching the node embedding spaces with community-centric information. The Louvain algorithm [16] is integrated with the deep learning VGAE framework [2] to develop modularity-aware node communities. This algorithm is efficient on massive networks. Using the Louvain method to perform the community detection process before feeding the input to the VGAE encoder resulted in an overall performance boost. The new input to the VGAE now becomes communities detected by the Louvain method along with the original graph structure.

Moreover, earlier models often relied on optimization strategies that prioritized link prediction tasks, consequently sidelining the aspect of community detection. This bias stemmed from the utilization of standard loss functions like cross-entropy and the evidence lower bound (ELBO) [6]. The contemporary approach redefines this by integrating a modularity-inspired loss function, which acts as a regularization mechanism over the pairwise reconstruction losses. This alternative loss function serves to align the optimization process with the dual objectives of link prediction and community detection, fostering a balance between local pairwise interactions and the global community architecture. Additionally, the integration of the Louvain algorithm for maximizing modularity has been a crucial step forward. This technique not only automates the selection of the number of communities but also proves to be scalable and complementary to the encoding–decoding processes of GAEs and VAEs. The Louvain algorithm, known for its efficiency in large-scale networks, aids in initializing the community detection process and iteratively refines the detected communities, enhancing the overall performance of the model.

Another significant contribution to the domain is shown in [17]. This work presented a novel approach to extract the core structure in networks using the K-truss algorithm which along with a variational autoencoder enhances the detection of nodes and links within communities. The K-truss [18] is a subgraph such that every edge must participate in at least (K-2) triangles within the subgraph. This approach maintains the relation of community structures but with a reduced computational burden as compared to K-core, making the discovery of community structures in complex networks more practical.

The above works show the state-of-the-art deep learning methods with traditional community detection approaches. The integration of the K-truss technique to preserve the strength of relationships in nodes is the main goal of this study.

4. Process and Methodology

Community detection in graphs offers insights into the structure of complex networks. The nodes may have various features associated with them, which can make the task of clustering them difficult. The advent of deep learning, especially VGAE, introduces a powerful method for learning low-dimensional (latent) representations of the nodes of the graph, that can capture community structures. This paper proposes an enhancement to VGAE models by integrating a community matrix alongside the adjacency and feature matrices, in the input, aiming to refine the model’s ability to detect and preserve community information. The reason for integrating the community matrix as the features matrix in the VGAE comes from previous studies that showed that performing community detection without these features leads to poor performance [2].

Let us start by discussing the methodology used and provide a walk-through of the algorithm’s steps. The proposed algorithm, Enhanced Community Detection with Structural Information VGAE (VGAE-ECF) incorporates an additional matrix that encodes community information into the VGAE model.

Given a graph G having N nodes and E edges, the following are the inputs:

Adjacency matrix (A): An

N \times N

matrix where

A_{i j} = 1

indicates the presence while

A_{i j} = 0

the absence of an edge between nodes i and j.

Feature matrix (X): An

N \times F

matrix where each row corresponds to the F-dimensional features of node i.

Community matrix (C): An

N \times N

matrix generated from a community preserving algorithm. There are two types of independent community matrices:

Type 1: Incorporating community structure via the Louvain and the Leiden [19] algorithms, where

C_{i j} = 1

if nodes i and j are in the same community, and 0 otherwise.

Type 2: Incorporating edge weights via the K-truss algorithm [18], where

0 < C_{i j} < 1

denotes the edge weight between nodes i and j.

The pre-processing step involves applying a community-preserving algorithm to the graph to create the community matrix C holding the initial community structure and edge weights information.

The input to the VGAE is the modified adjacency matrix

A^{'}

, formed by A+C.

A^{'}

integrates the structural information in A with the community structure encoded in C. Matrix

A^{'}

now is an updated adjacency matrix that contains information of relationships as well as information from the communities or the K-truss. In order to make the computations faster, instead of C we use a sparse version

C s

(see Section 4.2).

4.1. Community Preservation Techniques

To augment the VGAE model with community-oriented and relationship knowledge, we employed two acclaimed algorithms: Leiden and K-truss. For the baseline, we also perform experiments on a randomly generated C matrix and a matrix generated by the Louvain algorithm to reproduce results obtained by the modularity-aware VGAE model [2]. Intuitively, the matrix we create with these community preserving techniques is creating “virtual” relationships between nodes or strengthening relationships between nodes.

4.1.1. Random Approach

To set a baseline comparison and assess the impact of structured community detection methods on the performance of VGAE, we perform an ablation study and randomly assign nodes to communities. By comparing the performance of the VGAE model with random versus structured community inputs, we can directly measure the value added by meaningful community structures.

The following pseudo-code of Algorithm 1 shows how nodes are assigned to random communities. The value of K is the number of communities. This is equal to the number of ground truth communities in our dataset. The random assignment was achieved using a pseudo-random number generator to ensure the reproducibility of our results.

Algorithm 1 Create Random Community Matrix

Input: n number of nodes, k number of communities
Output: $n \times n$ Community Matrix C

1:: $communities \leftarrow random integers between 0 and k - 1 of size n$
2:: $community_matrix \leftarrow zero matrix of size n \times n$
3:: for $i \leftarrow 0 to n - 1$ do
4:: for $j \leftarrow i + 1 to n - 1$ do
5:: if $communities [i] = communities [j]$ then
6:: $community_matrix [i] [j] \leftarrow 1$
7:: $community_matrix [j] [i] \leftarrow 1$
8:: end if
9:: end for
10:: end for
11:: return $community_matrix$

We anticipate that the VGAE model trained with randomly assigned communities will show lower performance in community detection compared to the same model trained with communities detected by the Louvain, Leiden, or K-truss algorithms. Such outcomes will substantiate the hypothesis that structured community knowledge significantly contributes to the model’s learning capability and overall performance.

4.1.2. Louvain Algorithm

The modularity score is used by the traditional heuristic-based community detection technique Louvain [16] to assign a score on how good a community structure is. The density of links within communities relative to the expected links between communities is measured by modularity. Modularity has a value between −1 and 1, and a value closer to 1 indicates a better assignment of nodes to communities, while a value closer to −1, a poor assignment to communities. Note that the modularity score is used for identifying the best community structure and is not used after that for the construction of the type 1 community matrix C.

The Louvain algorithm consists of two main phases that are repeated iteratively.

Phase 1: Modularity optimization. At first, each node in the network is considered as a distinct community. Next, the algorithm takes into account the impact of shifting each node from its current community to that of its neighbor. The shift is made where there is the most significant gain in modularity.

Phase 2: Community aggregation. The method creates a new network when the local moves are unable to further increase the modularity. The communities discovered in the first phase are represented by nodes in this new network. The weights of the linkages between nodes in the corresponding two communities add up to the weights of the links between the new nodes.

The two phases are applied iteratively. The second phase of the first iteration becomes the first phase of the second iteration, and so on. With each iteration, the network becomes smaller, and the communities grow larger.

The iterations continue until there are no changes in the community structure, i.e., the modularity can no longer be increased by either phase. At the end, we know which nodes belong to the same community and we can create the type 1 matrix C.

4.1.3. Leiden Algorithm

The Leiden algorithm [19] is a community detection method for graphs that improves upon the Louvain algorithm. It was designed to address some of the shortcomings of the Louvain algorithm, particularly in terms of the quality of the communities detected, guarantee of convergence, resolution limit, and the reproducibility of results. The Leiden approach refines partitions more effectively, resulting in more precise community detection. The outputs from this algorithm were used as alternative community-preserving inputs for the VGAE.

The Leiden algorithm consists of three main phases that are repeated iteratively:

Phase 1: Local moving.

Similar to Louvain, the Leiden algorithm starts with each node as its own community. It then uses a randomized breadth-first search to move nodes to nearby communities that result in the largest modularity gain. This process is iteratively repeated and can be seen as an improved version of the local moving algorithm of the Louvain.

Phase 2: Refinement.

This phase is unique to the Leiden algorithm and is absent in the Louvain algorithm. The refinement phase improves the partitions found in the first phase by breaking up poorly connected communities and moving nodes to form better-connected communities. It refines the partition by merging smaller communities, and subsequently, breaking them down to form a new partition. This addresses the resolution limit issue present in the Louvain algorithm.

Phase 3: Network aggregation.

In this phase, similar to the Louvain algorithm, a new network is built where nodes represent the communities found during the previous phases. Edges are created between these new nodes based on the edges between the nodes of the respective original communities.

The three phases are applied iteratively. After the network is aggregated into a new network of communities, the entire process is repeated until there are no further improvements in modularity. At the end, we know which nodes belong to the same community and we can create the type 1 matrix C.

Improvements over Louvain.

The Leiden algorithm presents several key improvements over the Louvain algorithm:

Guaranteed convergence: Leiden ensures that every iteration strictly improves modularity, while Louvain can sometimes result in partitions that do not improve modularity.
Refinement phase: The refinement phase in Leiden allows for finer control over the community sizes and addresses the resolution limit problem, whereas Louvain might merge smaller communities that should not be merged.
Quality of communities: The Leiden algorithm generally finds partitions that are more balanced and well-connected than those found by the Louvain algorithm.
Reproducibility: Leiden tends to be more stable and less dependent on the randomness in the initial node order than the Louvain algorithm, leading to more reproducible results.

4.1.4. K-Truss

The K-truss algorithm [18] is a method for identifying cohesive subgraphs within a larger network. A K-truss in a graph is a subset of the graph such that every edge in the subset is supported by at least

k - 2

other edges that form triangles with that particular edge. In other words, every edge in the truss must be part of

k - 2

triangles made up of nodes that are part of the truss. This concept is used to find communities or clusters within graphs that are tightly interconnected.

The idea of a K-truss is an extension of the concept of K-cores [20], which are subgraphs where each vertex is connected to at least K other vertices within the subgraph. While K-cores focus on individual vertices and their immediate neighbors, K-trusses extend this to edges and the triangles they form, providing a stronger condition for the cohesion of the subgraph.

We incorporate the K-truss algorithm as shown below in Algorithm 2 based on the following steps:

Algorithm 2 Constructing a community matrix with the K-truss algorithm

Input: A adjacency matrix, K size of truss
Output: Similarity matrix C
Convert the adjacency matrix to graph object

1:: $G \leftarrow n x . f r o m_s c i p y_s p a r s e_m a t r i x (a d j)$
Initialize K-truss level
Apply K-truss algorithm to graph
2:: $G_{k} \leftarrow f i n d_k_t r u s s (G, k)$
Calculate similarity matrix based on K-truss
3:: $X \leftarrow c a l c u l a t e_s i m i l a r i t y_m a t r i x (a d j, G_{k})$
Convert similarity matrix to sparse format
4:: $X \leftarrow c o o_m a t r i x X$
Pre-process graph to obtain community matrix
5:: $C \leftarrow p r e - p r o c e s s_g r a p h (X)$
6:: Return the final community matrix
7:: return C

Convert the graph’s adjacency matrix into a graph object.
Define a specific level of truss, K, to identify the strength of connections needed for community membership.
Apply the K-truss algorithm to retain only those edges in the graph that are part of at least $K - 2$ triangles, thus filtering out weaker connections.
For the remaining subgraph, calculate a similarity matrix by comparing the adjacency of nodes with the support levels determined by the K-truss algorithm.
The similarity matrix is then transformed into a sparse matrix format.
Finally, pre-process the sparse similarity matrix to obtain the type 2 community matrix C, which reveals the denser community structures within the graph.

Lower K-trusses are nested within higher K-trusses, forming a hierarchy of subgraphs with increasing cohesion. The value of K is considered as a hyperparameter and tuned to obtain better results.

4.2. Sparsification

To manage the computational demands associated with network analysis, particularly in community detection, a sparsified version of the adjacency matrix of the communities is considered. This sparsified matrix is denoted as

C s

. By using

C s

, enough information is preserved and the computational cost is smaller than C; see Proposition 6 [2]. In this sparsification process, each node within a community

C_{k}

is connected to s other nodes within the same community, where s is a number smaller than

n_{k}

, the total number of nodes in

C_{k}

. This random selection process reduces the number of connections each node has, thereby reducing the overall computational load.

The value of K in the K-truss algorithm influences the sparsity of the resulting subgraph or matrix representation of that subgraph. It determines the level of connection strength required for edges and nodes to remain in the subgraph. Specifically, an edge is part of a K-truss if it is contained in at least

K - 2

triangles within the subgraph. Selecting a larger K value will generally result in a sparser K-truss because only the edges that belong to a greater number of triangles are retained. This means that many edges (and potentially nodes) will be removed if they do not meet the stricter criteria, leaving behind a subgraph with fewer edges relative to the number of nodes. A smaller K allows for a denser subgraph, as edges are required to be part of fewer triangles to be included. Thus, more edges and nodes will meet this criterion, resulting in a less sparse subgraph. After the sparsification process,

A^{'}

becomes equal to

A + C s

.

4.3. VGAE Model Training

The VGAE model is trained using the modified adjacency matrix

A^{'}

and the feature matrix X, optimizing the latent space to capture both graph structure, community membership, and optionally, the features. The code for the encoder and decoder of the VGAE architecture is taken from Kipf and Welling [5] and Guillaume Salha-Galvana et al. [2].

4.4. Community Detection in Latent Space

After training, the latent representations of nodes are clustered using K-means++ [21]. These clusters determine community membership, leveraging the enriched embeddings that now encapsulate both structural and community information. Algorithm 3 shows the steps of the approach and Figure 3 shows the proposed architecture.

In this paper, we selected a specific portion of the Citeseer dataset [22], recognized as a benchmark in community detection research. This particular segment was strategically selected to cover a wide array of node attributes and linkages to ensure the experiment’s integrity. The pre-processing stage involved standardizing the feature matrices and formulating the adjacency matrices, which are critical components for the operation of the VGAE model.

We engaged the graph variational autoencoder (VGAE) model in the learning processes using this curated dataset segment. The number of training cycles was limited to 20 to promptly evaluate the model’s effectiveness and its proficiency in identifying significant community clusters. The emphasis was on monitoring the model’s preliminary responses to community-preserving tactics, not on final performance tuning.

Algorithm 3 Enhanced community detection using VGAE (VGAE-ECF)

Input: Adjacency Matrix A, Feature Matrix X
Output: Detected communities
Step 1: Pre-processing

1:: Apply community preserving algorithm: Leiden (Type 1), K-truss (Type 2) to obtain C
2:: Apply sparsification to obtain $C s$ when using the Leiden method
Step 2: Preparing VGAE Input
3:: $A^{'} \leftarrow A + C s$
Step 3: VGAE Model Training
4:: Feed $A^{'}$ and X into the VGAE model
5:: Train the VGAE model to optimize the latent space representation
Step 4: Community Detection in Latent Space
6:: Extract latent representations (embeddings) for each node
7:: Apply K-means++ clustering to the embeddings to detect communities
Step 5: Evaluation and Adjustment
8:: Evaluate detected communities with AMI and ARI metrics
9:: Tune K parameters for K-truss and s for Leiden’s method based on performance
10:: Refine C using detected communities and repeat from step 1.

5. Experiments and Results

We experimented on the Citeseer and Cora [23] datasets by using and assessing the effect of pre-processing via the Leiden and K-truss algorithms in addition to the Louvain approach that was previously used by Salha-Galvana et al. [2]. We also considered randomly assigning communities and its effect on community detection to obtain a better idea of any improvement achieved by applying strategic algorithms. Our code is available on GitHub: https://github.com/jyotikahp/community_aware_gae, accessed on 11 September 2024.

5.1. Datasets

The following section describes the datasets used in this work. The datasets used are Citeseer and Cora. The versions of the dataset used in this project were undirected and were pre-processed and generated by [2]. Both datasets are of scientific publications classified into different number of classes.

Citeseer:

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes: Agents, AI, DB, IR, ML, and HCI. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of a total of 3703 unique words.

Cora:

The Cora dataset is a citation network of 2708 machine learning papers, organized into seven distinct classes. These papers are interlinked by 5429 citations, forming a directed graph that maps out how papers cite each other. Each paper is represented by a binary word vector, derived from a dictionary of 1433 unique words, indicating the presence or absence of specific words in the paper.

We pass the entire dataset to the VGAE model without splitting it into train/test/validation sets. As the nature of the problem, i.e., community detection is an unsupervised learning problem, the goal is to learn the latent representation of the graph without knowledge of explicit labels. Thus, the entire dataset is used to learn the embeddings of the nodes of the graphs. This enables the VGAE to have all the information in the graph to study its internal structure, potentially leading to meaningful and accurate embeddings. To assess the quality of embeddings, additional evaluation metrics are used.

5.2. Models

The encoder for the VGAE framework is considered a GCN. We incorporated two GCN models for the encoder, converting input features to a 16-dimensional latent vector. The optimizer used for training was the Adam optimizer [24].

Single-layer linear encoder: As proposed by [25], this is a simple one-hop model without considering the activation function. The output dimension is a latent space of 16 features.

Two-layer GCN encoder: Initially proposed by Kipf and Welling [5], the two-layer GCN encoder consists of a 32-dimensional hidden layer and uses ReLU as the activation function. The latent space dimension is 16. This considers the effect of two-hop neighbors in calculating the embedding for the nodes.

Both models are trained with a learning rate of 0.01 and 0 dropout. As the nature of the task is community detection and the goal is to learn the generation of low-dimensional node embeddings, all of the input information is retained, enhanced, and passed to the model.

5.3. Evaluation Metrics

The following section describes the metrics used for determining the performance of the models used on the Citeseer and Cora datasets.

Mutual information (MI) [26] is a measure that quantifies the mutual dependence between two variables. For us, the variables are two different clusterings of the same data. Mutual information (MI) is defined as follows:

MI (U, V) = \sum_{i = 1}^{c_{U}} \sum_{j = 1}^{c_{V}} P (i, j) log (\frac{P (i, j)}{P (i) P (j)})

where

P (i)

and

P (j)

are the probabilities that a randomly selected sample belongs to clusters i in U and j in V, respectively, and

P (i, j)

is the joint probability that a randomly selected sample belongs to cluster i in U and cluster j in V.

Adjusted mutual information (AMI) is an adjustment of the mutual information score that accounts for chance.

AMI is defined as

AMI (U, V) = \frac{MI (U, V) - E {MI (U, V)}}{max {H (U), H (V)} - E {MI (U, V)}}

where

E {MI (U, V)}

is the expected MI of all possible clusterings with the same cluster marginal distributions as U and V, and

H (U)

and

H (V)

are the entropies of the clusterings.

The Rand index (RI) calculates the proportion of accurate choices the clustering algorithm makes. It is computed by dividing the total number of decisions that need to be made by the number of agreements (true positive and true negative decisions).

The Rand index (RI) is

RI (U, V) = \frac{a + d}{a + b + c + d}

The variable a represents the number of pairs of elements in the same cluster in U and in the same cluster in V; b represents the number of pairs of elements in different clusters in U and in different clusters in V; c represents the number of pairs of elements in the same cluster in U but in different clusters in V; and d represents the number of pairs of elements in the same cluster in U and in the same cluster in V.

The adjusted Rand index (ARI) is the Rand index modified to account for the elements’ random grouping. It takes into consideration the probability that the RI will rise as the number of clusters rises. Because it normalizes in relation to the predicted RI under random clustering, it offers a more reliable measure.

The ARI is given by

ARI (U, V) = \frac{RI (U, V) - E {RI (U, V)}}{max {RI (U, V)} - E {RI (U, V)}}

where

E {RI (U, V)}

is the expected Rand Index of all possible clusterings, and

RI (U, V)

is the actual Rand Index between clusterings U and V.

In community detection, these metrics are crucial for evaluating the quality of the detected communities against ground truth labels. If the ground truth is not available, metrics like ARI can still be useful for comparing the results of different algorithms or parameter settings. High scores in AMI and ARI indicate a strong agreement between the clustering results and the ground truth, suggesting effective community detection.

5.4. Results

In this subsection, we go over the outcomes of different implementations to show their qualities and effectiveness in identifying community structures with and without dataset features.

We conducted all our experiments on a machine equipped with an Intel (R) Core (TM) i5-8265U processor (Santa Clara, CA, USA), featuring four cores and eight threads. The processor operates at a base frequency of 1.6 GHz, with the ability to boost up to 3.9 GHz under turbo mode. The machine includes 8 GB of RAM, and the experiments benefited from a 512 GB SSD, which offered ample storage and significantly faster reads/writes. The experiments were run with a learning rate of 0.01, that optimized the performance of our models and ensured effective convergence. For Citeseer, the results are in Table 1 (with features) and Table 2 (without features), and for Cora the results are in Table 3 (with features) and Table 4 (without features). In these tables, the encoder model, the number of iterations, the community-preserving method, the means of the metrics AMI and ARI, as well as the mean time in seconds are provided. For the time, the K-truss method is the most expensive and the Louvain and Leiden have similar times. Regarding the model, the linear_VAE is between 1.4 and 1.7 times slower than the GCN_VAE for Citeseer, and for Cora it is between 1.1 and 1.5 times slower.

5.4.1. Citeseer Dataset

Leiden algorithm: When features were taken into account and the Leiden algorithm was applied to the Citeseer dataset, we obtained a mean adjusted mutual information (AMI) score of 24.65% and a mean adjusted Rand index (ARI) of 23.57%. Table 1 provides a summary of the results for the different models. The mean ARI was 5.7212% and the mean AMI was 15.6871% without considering features, as indicated in Table 2. These findings highlight the significance of feature inclusion for successful community detection since they show a decline in performance when features are removed.

K-truss algorithm: For K-truss, K is the hyperparameter for our model. The model was run for K values ranging from 3 to 11, resulting in a value of 6 for K segregating the communities in the best way. When features were considered, the K-truss approach produced better performance metrics, with a mean AMI of 29.71% and a mean ARI of 28.19%, as shown in Table 1. On the other hand, the AMI and ARI scores dramatically decreased to 3.0377% and 0.38683%, respectively, when feature consideration was removed. This sharp decline, as indicated in Table 2, emphasizes how important feature inclusion is to the K-truss algorithm’s ability to recognize communities as well as possible.

When considering features (see Table 1), the GCN_VAE model outperforms the Linear_VAE across all methods, highlighting the effectiveness of incorporating graph convolutional layers. The K-truss method provides the best performance in terms of AMI and ARI for both models, making it the most effective community detection method for the Citeseer dataset among those tested. The Leiden method, while an improvement over Louvain in theory, did not outperform Louvain in this specific context. Random assignment confirms the necessity of using structured methods for meaningful community detection results.

When no features are considered (see Table 2) the Linear_VAE generally performs better than GCN_VAE in terms of AMI. The GCN_VAE shows slightly better performance in terms of ARI with the Leiden method. The K-truss method performs poorly without node features, indicating its reliance on node attributes for effective community detection. These results highlight the importance of incorporating node features for community detection, as performance significantly drops when features are not considered. The Linear_VAE model performs better in terms of AMI, while GCN_VAE shows a slight advantage in ARI with the Leiden method. However, neither model performs well with the K-truss method without features, underscoring the method’s dependence on node attributes. Overall, the GCN_VAE model combined with the K-truss method and features shows the most promise for community detection on the Citeseer dataset.

5.4.2. Cora Dataset

Applying the Leiden method to the Cora dataset yielded a mean ARI of 46.122% with features included and a mean AMI of 50.756%. Leiden with the GCN_VAE model performed the best out of all the algorithms and models considered, as indicated in Table 3.

Table 4 displays the robustness of the performance measures in the absence of features, indicating a mean AMI of 40.632% and a mean ARI of 32.800%. These outcomes show how well suited and efficient Leiden is at managing environments with a variety of features and a lack of them.

In K-truss, K is the hyperparameter for our model. We trained our model for K values ranging from 3 to 11, with a value for K of 7 producing the best results. Using the Cora dataset with features, the K-truss algorithm produced a mean ARI of 40.580% and a mean AMI of 47.761%, as noted in Table 3. The findings were noticeably worse in the absence of features, with a mean ARI of 1.540% and a mean AMI of 2.570%, captured in Table 4. The significance of feature inclusion in enhancing the effectiveness of community detection algorithms is further supported by this pattern.

Overall, from Table 3, the GCN_VAE model combined with the Leiden method shows the most promise for community detection on the Cora dataset with features, demonstrating the highest mean AMI and ARI values among all the evaluated methods.

From Table 4, the Linear_VAE combined with the Louvain method performs the best for community detection on the Cora dataset when node features are not considered. These findings indicate that the inclusion of node features is critical for the K-truss method to perform well.

The highlighted results in Table 1, Table 2, Table 3 and Table 4 show the best performance (bold) achieved by our pre-processing algorithms using a linear and a GCN variational autoencoder. The study has surpassed the benchmarks, yielding better performance metrics. Notably, the combination of the GCN_VAE model with K-truss pre-processing has resulted in better mean AMI and ARI scores for the Citeseer dataset. On the other hand, pre-processing with Leiden exceeds the AMI and ARI scores for both models in the Cora dataset with features. Figure 4 and Figure 5 compare the performances graphically and display the patterns discussed for the Citeseer and Cora datasets. The superior performance of our proposed method underscores the value of integrating advanced pre-processing strategies and the significance of hyperparameter optimization in graph-based community detection. However, for featureless graphs, Louvain performs better when considering AMI and Leiden performs better as per ARI for both datasets. The random method performed worst on both datasets and for both models on all the pre-processing algorithms. This further echoed the benefit of applying a targeted algorithm as part of pre-processing.

Figure 6 and Figure 7 illustrate the ground truth communities, as established by the labels in the dataset. Each color represents a different community, highlighting the initial partitioning used as a benchmark for assessing the quality of our community detection method.

Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 present a visual account of the community structures within the network as predicted by our Enhanced Community Detection with Structural Information VGAE model. They offer an understanding of the performance of our proposed community detection approach.

6. Conclusions

This paper aimed at exploring community detection within a graph. Through experimentation and analysis, it demonstrated the potential of deep learning, particularly variational graph autoencoders (VGAEs). The proposed algorithm, called Enhanced Community Detection with Structural Information VGAE (VGAE-ECF), has proven to be effective, as shown by the improvement in performance metrics over traditional methods.

The comparative results indicate that our model, which integrates preliminary community information into the VGAE framework, surpasses the benchmarks set by previous studies, including those highlighted by Salha-Galvana et al. [2]. The enhanced VGAE model resulted in a higher average mutual information (AMI) and adjusted rand index (ARI), confirming our hypothesis that the inclusion of community structure information can significantly improve the community detection process.

Furthermore, the fine-tuning of the hyperparameter ’k’ within our experiments underscored its critical role in optimizing the community detection algorithms. This tuning was critical in refining our model to better capture the subtleties of community boundaries within the graph.

In conclusion, our findings include the importance of incorporating community structure and edge strengths within a neural network for identifying communities. This project also sets the stage for future research to build upon. As graphs and neural networks continue to grow in size and complexity, the methodologies we have developed will provide a foundation for more robust, scalable, and accurate community detection techniques.

Author Contributions

Conceptualization, J.H.P. and K.P.; methodology, J.H.P.; software, J.H.P.; validation, P.P., W.B.A. and K.P.; formal analysis, K.P.; investigation, J.H.P. and K.P.; resources, J.H.P.; data curation, J.H.P.; writing—original draft preparation, J.H.P. and K.P.; writing—review and editing, P.P. and W.B.A.; visualization, J.H.P.; supervision, K.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository (https://doi.org/10.1609/aaai.v29i1.9277, Network repository: https://networkrepository.com/cora.php).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VGAE	Variational graph autoencoders
GNN	Graph neural network
GCN	Graph convolutional network
VAE	Variational autoencoders
VGAE-ECF	VGAE Enhanced Community Detection with Structural Information

References

Jin, D.; Yu, Z.; Jiao, P.; Pan, S.; He, D.; Wu, J.; Yu, P.S.; Zhang, W. A Survey of Community Detection Approaches: From Statistical Modeling to Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 1149–1170. [Google Scholar] [CrossRef]
Salha-Galvan, G.; Lutzeyer, J.F.; Dasoulas, G.; Hennequin, R.; Vazirgiannis, M. Modularity-aware graph autoencoders for joint community detection and link prediction. Neural Netw. 2022, 153, 474–495. [Google Scholar] [CrossRef] [PubMed]
Khoshraftar, S.; An, A. A Survey on Graph Representation Learning Methods. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–55. [Google Scholar] [CrossRef]
Zhang, L.; Song, H.; Aletras, N.; Lu, H. Graph Node-Feature Convolution for Representation Learning. arXiv 2022, arXiv:cs.LG/1812.00086. [Google Scholar]
Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:stat.ML/1611.07308. [Google Scholar]
Kingma, D.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Despalatović, L.; Vojković, T.; Vukicević, D. Community structure in networks: Girvan-Newman algorithm improvement. In Proceedings of the 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 26–30 May 2014; pp. 997–1002. [Google Scholar]
Chen, M.; Kuzmin, K.; Szymański, B.K. Community Detection via Maximization of Modularity and Its Variants. IEEE Trans. Comput. Soc. Syst. 2014, 1, 46–65. [Google Scholar] [CrossRef]
Zhang, X.S.; Wang, R.S.; Wang, Y.; Wang, J.; Qiu, Y.; Wang, L.; Chen, L. Modularity optimization in community detection of complex networks. Europhys. Lett. 2009, 87, 38002. [Google Scholar] [CrossRef]
Bach, F.; Jordan, M. Learning Spectral Clustering. In Advances in Neural Information Processing Systems; Thrun, S., Saul, L., Schölkopf, B., Eds.; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
Choong, J.J.; Liu, X.; Murata, T. Learning Community Structure with Variational Autoencoder. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 69–78. [Google Scholar] [CrossRef]
Choong, J.J.; Liu, X.; Murata, T. Optimizing variational graph autoencoder for community detection with dual optimization. Entropy 2020, 22, 197. [Google Scholar] [CrossRef] [PubMed]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; ML Research Press: Philadelphia, PA, USA, 2019; Volume 97, pp. 6861–6871. [Google Scholar]
Greff, K.; van Steenkiste, S.; Schmidhuber, J. Neural Expectation Maximization. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6694–6704. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. [Google Scholar] [CrossRef]
Blondel, V.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Fei, R.; Wan, Y.; Hu, B.; Li, A.; Li, Q. A Novel Network Core Structure Extraction Algorithm Utilized Variational Autoencoder for Community Detection. Expert Syst. Appl. 2023, 222, 119775. [Google Scholar] [CrossRef]
Cohen, J. Trusses: Cohesive subgraphs for social network analysis. Natl. Secur. Agency Tech. Rep. 2008, 16, 1–29. [Google Scholar]
Traag, V.A.; Waltman, L.; van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 1–12. [Google Scholar] [CrossRef] [PubMed]
Panduranga, N.K.; Gao, J.; Yuan, X.; Stanley, H.E.; Havlin, S. Generalized model for k-core percolation and interdependent networks. Phys. Rev. E 2017, 96, 032317. [Google Scholar] [CrossRef] [PubMed]
Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable K-Means++. arXiv 2012, arXiv:1203.6402. [Google Scholar] [CrossRef]
Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fernández-Ramírez, J.; Chen, H.H.; Wu, Z.; Giles, L. CiteSeerx: A Scholarly Big Dataset. In Advances in Information Retrieval; de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K., Eds.; Springer: Cham, Switzerland, 2014; pp. 311–322. [Google Scholar]
Available online: https://ieee-dataport.org/documents/cora (accessed on 11 September 2024).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:cs.LG/1412.6980. [Google Scholar]
Salha, G.; Hennequin, R.; Vazirgiannis, M. Simple and Effective Graph Autoencoders with One-Hop Linear Models. arXiv 2020, arXiv:cs.LG/2001.07614. [Google Scholar]
Kinney, J.B.; Atwal, G.S. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 2014, 111, 3354–3359. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Architecture of a VGAE with encoder and decoder.

Figure 2. Process flow from an input graph to detecting communities. The VGAE produces embeddings that reflect community structure and edge weights, which are clustered with K-means.

Figure 3. Proposed architecture to generate communities.

Figure 4. Comparison of performance of different models for Citeseer dataset. Note that the blue line is for the GCN_VAE method.

Figure 5. Comparison of performance of different models for Cora dataset. The orange line is for the GCN_VAE method.

Figure 6. Ground truth communities in Citeseer dataset.

Figure 7. Ground truth communities in Cora dataset.

Figure 8. K-truss+GCN_VAE on Citeseer.

Figure 9. K-truss+Linear_VAE on Citeseer.

Figure 10. Leiden+GCN_VAE on Citeseer.

Figure 11. Leiden+Linear_VAE on Citeseer.

Figure 12. K-truss+GCN_VAE on Cora.

Figure 13. K-truss+Linear_VAE on Cora.

Figure 14. Leiden+GCN_VAE on Cora.

Figure 15. Leiden+Linear_VAE on Cora.

Table 1. Community detection results on Citeseer dataset considering features.

Model	Iterations	Method	Mean AMI (%)	Mean ARI (%)	Time (s)
Linear_VAE	100	Louvain [2]	25.39	10.58	559.73
GCN_VAE	100	Louvain [2]	26.59	26.09	393.51
Linear_VAE	100	Leiden	21.30	12.80	585.73
GCN_VAE	100	Leiden	24.65	23.57	330.17
Linear_VAE	100	K-truss	30.27	24.80	667.77
GCN_VAE	100	K-truss	29.71	28.19	378.43
Linear_VAE	100	Random	−0.012	−0.013	335.64
GCN_VAE	100	Random	0.37	0.29	339.03

Table 2. Experimental results on Citeseer dataset without considering features.

Model	Iterations	Method	Mean AMI (%)	Mean ARI (%)	Time (s)
Linear_VAE	100	Louvain [2]	15.83	7.94	621.86
GCN_VAE	100	Louvain [2]	10.95	7.75	422.83
Linear_VAE	100	Leiden	15.68	5.72	618.01
GCN_VAE	100	Leiden	10.53	7.98	412.26
Linear_VAE	100	K-truss	1.01	0.78	826.20
GCN_VAE	100	K-truss	3.86	3.04	480.62

Table 3. Experimental results on Cora dataset considering features.

Model	Iterations	Method	Mean AMI (%)	Mean ARI (%)	Time (s)
Linear_VAE	100	Louvain [2]	50.12	43.32	405.82
GCN_VAE	100	Louvain [2]	49.02	38.67	352.50
Linear_VAE	100	Leiden	50.75	46.12	392.30
GCN_VAE	100	Leiden	52.30	49.81	341.77
Linear_VAE	100	K-truss	45.09	39.09	476.85
GCN_VAE	100	K-truss	47.76	40.58	428.69
Linear_vae	100	Random	0.29	0.15	243.93
GCN_VAE	100	Random	0.69	0.59	330.69

Table 4. Experimental results on Cora dataset without considering features.

Model	Iterations	Method	Mean AMI (%)	Mean ARI (%)	Time (s)
Linear_VAE	100	Louvain [2]	50.12	43.39	481.55
GCN_VAE	100	Louvain [2]	35.76	27.09	394.40
Linear_VAE	100	Leiden	40.63	32.80	465.28
GCN_VAE	100	Leiden	38.98	32.46	365.97
Linear_VAE	100	K-truss	2.57	1.54	715.39
GCN_VAE	100	K-truss	8.20	4.08	461.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Patil, J.H.; Potikas, P.; Andreopoulos, W.B.; Potika, K. Community Detection Using Deep Learning: Combining Variational Graph Autoencoders with Leiden and K-Truss Techniques. Information 2024, 15, 568. https://doi.org/10.3390/info15090568

AMA Style

Patil JH, Potikas P, Andreopoulos WB, Potika K. Community Detection Using Deep Learning: Combining Variational Graph Autoencoders with Leiden and K-Truss Techniques. Information. 2024; 15(9):568. https://doi.org/10.3390/info15090568

Chicago/Turabian Style

Patil, Jyotika Hariom, Petros Potikas, William B. Andreopoulos, and Katerina Potika. 2024. "Community Detection Using Deep Learning: Combining Variational Graph Autoencoders with Leiden and K-Truss Techniques" Information 15, no. 9: 568. https://doi.org/10.3390/info15090568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Community Detection Using Deep Learning: Combining Variational Graph Autoencoders with Leiden and K-Truss Techniques

Abstract

1. Introduction

2. Terminology and Background

2.1. Terminology

2.2. Background

3. Related Work

4. Process and Methodology

4.1. Community Preservation Techniques

4.1.1. Random Approach

4.1.2. Louvain Algorithm

4.1.3. Leiden Algorithm

4.1.4. K-Truss

4.2. Sparsification

4.3. VGAE Model Training

4.4. Community Detection in Latent Space

5. Experiments and Results

5.1. Datasets

5.2. Models

5.3. Evaluation Metrics

5.4. Results

5.4.1. Citeseer Dataset

5.4.2. Cora Dataset

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI