NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection

Liu, Shihu; Chen, Hui; Li, Shuang; Yang, Xiyang

doi:10.3390/math13081300

Open AccessArticle

NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection

¹

School of Mathematics and Computer Science, Yunnan Minzu University, Kunming 650504, China

²

Fujian Provincial Key Laboratory of Data-Intensive Computing, Quanzhou Normal University, Quanzhou 362000, China

³

School of Mathematics and Computer Science, Quanzhou Normal University, Quanzhou 362000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(8), 1300; https://doi.org/10.3390/math13081300

Submission received: 13 March 2025 / Revised: 8 April 2025 / Accepted: 11 April 2025 / Published: 16 April 2025

(This article belongs to the Special Issue Data Analysis for Social Networks and Information Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Community detection is still regarded as one of the most applicable methods for discovering latent information in complex networks. Recently, many similarity-based community detection algorithms have been widely applied to the analysis of complex networks. However, these approaches may also have some limitations, such as relying solely on simple similarity measures, which makes it difficult to differentiate the tightness of the relation between nodes. Aiming at this issue, this paper proposes a community detection algorithm based on neighbor similarity and label selection (NSLS). Initially, the algorithm assigns labels to each node using a new local similarity measure, thereby quickly forming a preliminary community structure. Subsequently, a similarity parameter is introduced to calculate the similarity between nodes and communities, and the nodes are reassigned to more appropriate communities. Finally, dense communities are obtained by a fast-merge method. Experiments on real-world networks show that the proposed method is accurate, compared with recent and classical community detection algorithms.

Keywords:

complex network; community detection; label selection; neighbor similarity

MSC:

05C82

1. Introduction

Many complex systems in the real world can be modeled as complex networks, such as social networks [1], biological networks [2], information networks [3], and electric power networks [4]. In these networks, the entities in the networks are abstracted as nodes, and the relationships between entities are abstracted as edges.

In recent years, many scholars have begun to focus on network information mining. Network information mining mainly includes important research directions such as link prediction [5], community detection [6], and clustering analysis [7]. Among them, the goal of community detection is to identify these tightly connected communities and reveal the internal structure of the network. Specifically, a community in a network can be regarded as a group of network nodes [8]. The community structure is the connection relationship formed by the community [9]. Community detection not only helps to understand group divisions and node functions but also provides support for applications such as network optimization, information propagation, and personalized recommendations [10].

Until now, many algorithms have been proposed to address the community detection problem, which can be broadly classified into modularity-based algorithms, label propagation algorithms, random walk-based algorithms, and local similarity-based algorithms [11]. These algorithms try to explore communities in networks from various perspectives.

Since the concept of modularity was proposed, many researchers have used it to evaluate the performance of community detection algorithms [12,13]. The CNM algorithm proposed by Clauset et al. [14] performs community detection by gradually merging nodes and optimizing modularity. Later, an improved and fast method called Louvain was suggested by Blondel et al. [15]. The algorithm stands out as an efficient approach to modularity optimization in community detection. It achieves community detection by merging nodes based on maximum modularity. The Leiden algorithm [16] improves the Louvain algorithm by introducing a more refined local optimization and an improved community merging strategy. The MSM algorithm [17] solves the community detection problem by reformulating modularity maximization as a subset identification problem and maximizing its surrogate. However, algorithms based on modularity generally face challenges such as limited resolution and sensitivity to network size [16]. Modularity-based methods suffer from significant limitations, such as ignoring topological information, node similarity, trapping in local maxima, and greedy nature, which lead to inaccurate results.

The Label Propagation Algorithm (LPA) [18] detects communities using an information propagation mechanism. The LPA initially assigns a unique label to each node and iteratively updates the labels until all nodes belong to a single community. Although the LPA has near-linear time complexity, it demonstrates lower accuracy and instability in community detection due to weaknesses associated with random node selection and label update mechanisms. Scholars have studied the improvements in the LPA from the perspectives of using network topological information, improving label propagation rules, and improving community initialization [19,20,21].

Some methods are based on identifying core nodes and expanding local communities from the inside out [22,23]. The performance of these methods relies on accurate core node detection techniques, especially the algorithm based on density peak clustering and label propagation proposed by Li et al. [23]. The user needs to manually adjust the parameters multiple times to determine the number of community centers. The LBLD algorithm selects a limited number of nodes as cores to propagate labels to surrounding nodes [22]. Consequently, not all nodes can receive labels, and the network structure lacks sufficient stability to achieve balanced label propagation, which may lead to inaccuracies. The FSLD method [24] expands communities from boundary nodes inward, which avoids some pitfalls of identifying core nodes, assigning initial labels, and label propagation approaches. However, boundary nodes typically exhibit overlapping characteristics and contain scant information, thus posing significant challenges for accurately assigning them to a well-defined community. FluidC is a propagation-based method that identifies communities by simulating the expansion and contraction of fluids in an environment [25]. LMFLS [6] is a fast algorithm based on local multi-factor node scoring and label selection. These algorithms improve the robustness of the LPA and the accuracy of community partitioning. However, further optimization can be achieved by proposing more refined label selection strategies, improving similarity measures, and considering additional factors.

Random walk-based algorithms detect communities by leveraging the tendency of walkers to remain within a community over short time spans. Walktrap [26] uses random walks to calculate the transition probability between nodes and then measures the similarity between nodes and communities by calculating the distance. Walktrap provides reasonable results over a broader range of community structure strengths, but its performance heavily depends on the degree distribution of the network. Infomap is another random walk-based algorithm that combines encoding with community detection, but it cannot effectively reveal loosely connected communities [27]. Synwalk [28] combines the advantages of Infomap and Walktrap. It applies the concept of random block modeling to random walks to identify communities. Li et al. [29] proposed a new variant of random walk. The core idea of using HoSI is to identify core community members related to the query node and optimize the generated community structure. Random walk-based algorithms need to consider information of all nodes in community discovery, which makes them inefficient on large-scale networks due to their high time complexity. However, these algorithms are sensitive to the selection of initial nodes, which can lead to differences in the results.

Local similarity-based algorithms focus on the local features and neighborhood information of the nodes. The main idea of local methods is to define local metrics and iteratively divide communities based on these metrics [30]. These methods rely heavily on the accuracy of the similarity measurement. Sahu et al. [31] proposed two new similarity metrics and implemented community detection and optimization through a two-phase process. Yang et al. [32] introduced a dynamic time series method to quantify the similarity between nodes in networks. Information derived from the time series of the diffusion model is used to identify communities. SimCMR [33] expands the concept of community detection by evaluating similarities within and between communities. Zhang et al. [34] proposed an evolutionary multi-objective attribute community detection based on a similarity fusion strategy with central nodes. The similarity fusion strategy integrates topological and attribute similarities to effectively identify central nodes.

Based on the above analysis, we propose a novel community detection algorithm. In the initial community detection stage, a new similarity measure is used to select labels for the nodes, thereby forming the initial community structure. In the second phase, we introduce a similarity parameter to adapt to different types of network structures. Nodes are reassigned on the basis of the similarity between nodes and communities. In the final phase, we merge communities by analyzing the relationships between important nodes in neighboring communities. The main contributions of the proposed algorithm are summarized as follows:

Two similarity measures are introduced by considering the tightness relation between the nodes in a comprehensive way.
The algorithm comprehensively considers the similarity relation between nodes, between nodes and communities, and between communities.
We present comparison experiments with twelve methods on real-world datasets with a wide variation range of nodes. The effectiveness of the proposed algorithm on networks of various scales is demonstrated by several experiments on real-world networks.

The remainder of this paper is organized as follows. Section 2 provides a brief review of some concepts. Section 3 shows the details of the proposed community detection algorithm. Some necessary experimental materials are introduced in Section 4. Section 5 provides the experimental results and analysis. Finally, Section 6 concludes the paper and provides a possible direction for future research.

2. Preliminaries

In this section, we introduce some basic concepts involved in this paper, which are complex network, community detection, and modularity. For more detailed descriptions, one can refer to the relevant references [8,20].

2.1. Complex Network

Mathematically, the so-called complex network can be expressed as a 2-tuple

G = (V, E)

, where

V = {v_{1}, v_{2}, \dots, v_{n}}

denotes the set of nodes and

E = {e_{i j} | i, j = 1, 2, \dots, n}

denotes the set of edges. In addition, if there exists an edge between nodes

v_{i}

and

v_{j}

, then

e_{i j} = 1

, otherwise

e_{i j} = 0

. Obviously, if the number of nodes and edges of

G = (V, E)

is finite, take

n = | V |

and

m = | E |

for example, then the structure of the complex network can be expressed as a matrix, such as:

A = (\begin{matrix} e_{11} & e_{12} & \dots & e_{1 n} \\ e_{21} & e_{22} & \dots & e_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ e_{n 1} & e_{n 2} & \dots & e_{n n} \end{matrix}) .

(1)

Without loss of generality, in what follows, we adhere to the hypothesis that

G = (V, E)

is an undirected and unweighted network.

2.2. Community Detection

For a complex network

G = (V, E)

, the community can be regarded as a network

G_{i} = (V_{i}, E_{i})

where

V_{i} \subseteq V

and

E_{i} \subseteq E

. Certainly, the goal of community detection is to find such complex networks, taking

G_{1}, G_{2}, \dots, G_{k}

for example, that satisfy the following conditions:

(1): $V_{i} \neq \emptyset$ , for $i = 1, 2, \dots, k$ .
(2): $⋃_{i = 1}^{k} G_{i} = G, i . e ., ⋃_{i = 1}^{k} V_{i} = V and ⋃_{i = 1}^{k} E_{i} \subseteq E .$
(3): $V_{i} \cap V_{j} = \emptyset$ and $E_{i} \cap E_{j} = \emptyset$ for any $i \neq j$ .

2.3. Modularity

Given that

P = {C_{1}, C_{2}, \dots, C_{k}}

is a community detection result with respect to

G = (V, E)

, modularity can be defined as

Q = \frac{1}{2 m} \sum_{i = 1}^{k} (2 L_{C_{i}} - \frac{D_{C_{i}}^{2}}{2 m}),

(2)

where

C_{i} \in P

represents the i-th community, and

L_{C_{i}}

is the number of edges inside the community

C_{i}

.

D_{C_{i}}

is the sum of the degrees of the nodes in the community

C_{i}

, which can be defined as

D_{C_{i}} = \sum_{v_{i} \in C_{i}} d (v_{i}),

(3)

where

d (v_{i})

is the degree of node

v_{i}

and can be calculated as

d (v_{i}) = \sum_{j = 1}^{n} e_{i j} .

(4)

3. Our Proposed Approach for Community Detection

This section mainly introduces the motivation, the algorithm execution process, the pseudo-code of the algorithm, and the complexity analysis. The algorithm execution process includes three parts: node selection, node shifting, and community merging, which are discussed in Section 3.2, Section 3.3 and Section 3.4.

3.1. Motivation

Similarity-based community detection algorithms typically cluster nodes with high similarity in the same community. Traditional similarity measures such as Jaccard similarity, Cosine similarity, and SimRank have certain limitations in evaluating the closeness of relationships between nodes. For example, in sparse networks, where connections between nodes are limited, many similarity measures struggle to accurately capture the real community structure. In addition, when community boundaries are blurred, nodes are prone to being incorrectly assigned to the wrong communities.

To address these issues, we adopt two similarity measure methods and a series of basic rules. We only consider the similarity between a node and its most similar node to quickly form more accurate community structures. In addition, a similarity parameter is introduced to regulate the tightness of the nodes within the community. Nodes are reassigned on the basis of the similarity between nodes and communities. The specific rules are discussed in detail in the algorithm section.

3.2. Label Selection

In a complex network, two nodes that share a large number of neighbors are typically more closely connected and tend to be assigned to the same community. We focus on the similarity between a node and its neighbors, as they are closer to each other within the network. The similarity between nodes is defined as follows:

Given a complex network

G = (V, E)

, for node

v_{i} \in V

and its neighbor

v_{j} \in N (v_{i})

, where

N (v_{i}) = {v_{j} | e_{i j} \neq 0, i = 1, 2, \dots, n}

. The similarity between neighbor nodes is formally defined as

Sim (v_{i}, v_{j}) = \frac{| N (v_{i}) \cap N (v_{j}) |}{| N (v_{i}) \cap N (v_{j}) | + | N (v_{i}) \cup N (v_{j}) |},

(5)

where

| N (v_{i}) \cap N (v_{j}) |

denotes the number of common neighbors between nodes

v_{i}

and

v_{j}

, and

| N (v_{i}) \cup N (v_{j}) |

denotes the total number of distinct neighbors of the two nodes.

The importance of a node is not only related to the number of its neighbor nodes but is also closely associated with the structure of its neighbors. The importance of the node

v_{i}

can be expressed as

I (v_{i}) = \sum_{v_{j} \in N (v_{i})} S i m (v_{i}, v_{j}) .

(6)

Initially, for any node

v_{i} \in V

, a different label

l_{i}

is assigned to each node, indicating the community to which

v_{i}

belongs. The label of a node is updated according to its relationship with neighbor nodes. To more accurately capture the intrinsic connections between nodes in the network, only the neighbor with the highest similarity or the highest degree is considered. Nodes with the same label are placed in the same community. The specific label selection rules are as follows:

To avoid repeated updating of nodes, nodes with one neighbor are processed last. For any node

v_{i}

that satisfies the condition

d (v_{i}) > 1

, its neighbor

v_{s i m}

with the highest similarity is identified using the following formula

v_{s i m} = arg max_{v_{j} \in N (v_{i})} S i m (v_{i}, v_{j}) .

(7)

If node importance

I (v_{s i m}) > I (v_{i})

, then node

v_{s i m}

is selected as the most similar node of

v_{i}

; otherwise, node

v_{i}

is its own most similar node. Subsequently, node

v_{i}

adopts the label of its most similar node. If

S i m (v_{i}, v_{j}) = 0

, indicating that there are no common neighbors, then the most similar node is identified according to the following formula

v_{d e g} = arg max_{v_{j} \in N (v_{i})} d (v_{j}),

(8)

where

v_{d e g}

is the neighbor of node

v_{i}

with the highest degree. In this case, node

v_{i}

will receive the label of the most similar node

v_{d e g}

. Afterward, if the label of any node changes, then all labels will be updated again. For nodes satisfying the condition

d (v_{i}) = 1

, they directly adopt the label of their sole neighbor. After the label selection stage, the labels of the important nodes are propagated to the surrounding nodes.

In Figure 1a, node

v_{2}

is the most similar node of nodes

v_{18}

,

v_{22}

, and

v_{20}

; node

v_{1}

is the most similar node of node

v_{2}

. The node selects the label of its most similar node, and they are shown in different colors. Node

v_{2}

selects the label of node

v_{1}

. Consequently, nodes

v_{18}

,

v_{22}

, and

v_{20}

ultimately select the label of node

v_{1}

, as shown in Figure 1b. In Figure 1c, node

v_{12}

selects the label of its only neighbor

v_{1}

.

3.3. Node Shifting

We form preliminary communities based on label selection. On this basis, adjustments can be made to incorrectly assigned nodes according to the similarity measure between the nodes and communities. If most of the neighbors of a node belong to a specific community, the node is likely to belong to the same community. Sahu et al. [31] proposed a new method to measure the similarity between node and community, which classifies nodes according to their degree. Since the preliminary community structure has been obtained, we no longer consider detecting communities based on degree. The similarity is defined as follows:

For any node

v_{i} \in V

, the current community of node

v_{i}

is

C_{j}

, where

C_{j} = {v_{j 1}, v_{j 2}, \dots, v_{j l}}

, and l represents the number of nodes in community

C_{j}

. The similarity measure between the node and its community can be defined as follows.

S i m (v_{i}, C_{j}) = \{\begin{matrix} 1, & | N (v_{i}) \cap C_{j} | > \frac{d (v_{i})}{α} \\ 0, & otherwise \end{matrix},

(9)

where

α

is the similarity parameter used to adjust the strictness of the similarity criteria, and it takes values in the range of 1.5 to 3 with an increment of 0.5.

When a node shares a community with the majority of its neighbors, it is likely to belong to its current community.

S i m (v_{i}, C_{j}) = 1

indicates that node

v_{i}

still belongs to the current community

C_{j}

. If

S i m (v_{i}, C_{j}) = 0

, then node

v_{i}

will be reassigned to another community. The detailed assignment rules are as follows:

The set of communities

P_{m o s t}

that satisfies the following condition can be defined as

P_{m o s t} = {C_{p} ∣ C_{p} = arg max_{C_{i} \in P_{n e i g h b o r s}} | N (v_{i}) \cap C_{i} |},

(10)

where

P_{n e i g h b o r s}

is the set of communities to which the neighbors of node

v_{i}

belong,

C_{p}

is the community that has the maximum number of shared neighbors with node

v_{i}

.

If

| P_{m o s t} | = 1

, take

P_{m o s t} = {C_{p}}

for example, then node

v_{i}

is assigned to community

C_{p}

. Otherwise, we assume that

P_{m o s t} = {C_{p 1}, C_{p 2}, \dots, C_{p m}}

. In this case, node

v_{i}

will be assigned to community

C_{s}

according to the following formula:

C_{s} = arg max_{C_{p i}} \{\sum_{v_{j} \in N (v_{i}) \cap C_{p i}} d (v_{j})\},

(11)

where

C_{p i} \in {C_{p 1}, C_{p 2}, \dots, C_{p m}}

.

After one round of node shifting, each node is assigned to a suitable community. Subsequently, we will compare the changes in modularity before and after node shifting. Based on the results of this comparison, we will decide whether to carry out the next round of node shifting.

Taking the Karate network as an example, the similarity parameter is set to

α = 2.5

. In Figure 2a, eight communities are identified, and the modularity of the detected communities is 0.2948. In the node shifting stage, nodes are reassigned to the communities with the maximum number of neighbors. After the first round of shifting, the number of communities is reduced to 3. The modularity of the optimized communities is 0.3744. Since the modularity remains at 0.3744 after the second round of node shifting, and the modularity improvement is less than 0.01, the node shifting process is terminated.

3.4. Community Merging

In order to further optimize the community structure, we merge the incorrectly separated communities. Based on references [6,22], the merging method focuses on negotiations between key nodes within the communities to avoid checking each node individually. Specifically, the merging process includes the following two steps:

The first step is to select candidate communities based on the size of the community. Specifically, the largest community is identified, which is the community with the maximum number of nodes. After excluding the largest community, the average size of the remaining communities is calculated, and those smaller than the average size are selected to form the set of candidate communities

P_{c a n d i d a t e}

.

The second step is to evaluate whether the candidate communities need to merge. During the merger process, we only consider the relationships between the key nodes within the two communities. Based on the two metrics of node importance and degree, the key node

v_{k e y}

is selected using the following formula

v_{k e y} = arg max_{v_{i} \in C_{j}} (d (v_{i}) + I (v_{i})),

(12)

where

C_{j}

represents a candidate community and

C_{j} \in P_{c a n d i d a t e}

. Then, we find a neighbor node

v_{o p t}

that satisfies the following condition:

v_{o p t} = arg max_{v_{j} \in N (v_{k e y})} (d (v_{j}) + I (v_{j})),

(13)

where

v_{o p t}

is the neighbor node of node

v_{k e y}

that maximizes the sum of its degree and its importance.

If the selected neighbor node

v_{o p t}

satisfies the following conditions:

(1): Node $v_{o p t}$ has a different community label from $v_{k e y}$ ;
(2): The degree of node $v_{k e y}$ is greater than the degree of node $v_{k e y}$ , i.e., $d (v_{o p t}) > d (v_{k e y})$ ;

then the label of node

v_{o p t}

is assigned to node

v_{k e y}

, thereby merging the two communities. Otherwise, the merging process is not performed.

In the community merging process, the focus is on the key nodes within each community. The merging process is based on comparing the key nodes of two communities. To obtain an accurate community structure, the merging operation is performed twice. As shown in Figure 3a, the key node of the orange candidate community is node

v_{6}

, and its neighbors are

v_{1}

,

v_{11}

,

v_{7}

, and

v_{17}

. The selected neighbor node is node

v_{1}

in the red community. Since the merging conditions are met, the orange community is merged into the red community.

3.5. The Pseudo-Code of the Proposed Algorithm

The proposed algorithm consists of three stages: label selection, node shifting, and community merging, which are outlined in Algorithms 1–3.

In Algorithm 1,

L a b e l (v_{i})

refers to the label of node

v_{i}

.

L a b e l (v_{i}) \leftarrow L a b e l (v_{d e g})

shows the label assignment action, that is, node

v_{i}

receives the label of node

v_{d e g}

. In Algorithm 2,

O l d_m o d u l a r i t y \leftarrow M o d u l a r i t y (c o m m u n i t y)

means calculating the modularity of the preliminary community and storing the result in

O l d_m o d u l a r i t y

, and

N e w_m o d u l a r i t y \leftarrow M o d u l a r i t y (c o m m u n i t y)

means calculating the modularity after node shifting and storing the result in

N e w_m o d u l a r i t y

.

Algorithm 1: Label selection

Algorithm 2: Node shifting

3.6. Complexity Analysis

The NSLS algorithm is divided into three stages: label selection, node shifting, and community merging. The number of nodes in the network is n and the average degree is k. In the first stage, for the

n_{1}

nodes with a degree greater than 1, the time complexity for calculating node similarity is at most

O (\frac{n_{1} k^{2}}{2})

. In the second stage, the time complexity for calculating the similarity between nodes and communities is

O (n k)

. During the node shifting process, the time complexity is

O (r n log n)

, where r is the number of iterations. In the community merging stage, the time complexity for calculating the community sizes is at most

O (n)

. Therefore, the total time complexity of the algorithm is

O (r n log n)

.

Algorithm 3: Community merging

4. Experimental Materials

In this section, we introduce some experimental materials such as experiment datasets, benchmark algorithms, and evaluation metrics. The specified experimental platform is summarized in Table 1.

4.1. Datasets

In this article, we perform our experiment with the following 11 real-world datasets, which can be downloaded from the websites http://snap.stanford.edu/data (accessed on 10 November 2024) and http://konect.cc/networks/ (accessed on 10 November 2024). The detailed information of the related datasets is as follows.

Karate: The social network records friendships among 34 members of a karate club from a US university.

Dolphins: The social network depicts frequent interactions among 62 bottlenose dolphins.

Polbooks: The network is composed of books on US politics from 2004.

Football: It is a social network created according to the American College Football League.

Power Grid: The network describes the topology of the Western States Power Grid in the United States.

CA-GRQC: The collaboration network maps scientific collaborations among authors of papers in the category of general relativity and quantum cosmology.

PGP: The communication network records the interactions between users of the Pretty Good Privacy (PGP) algorithm.

Brightkite: The network is built on the Brightkite social platform, which records the bidirectional friendship relationships between users.

DBLP: This is the network of collaborations among the authors of papers indexed on the DBLP website.

Amazon: The network is generated by browsing Amazon pages and collecting suggested goods on each page.

YouTube: The network is built based on the YouTube social platform, where users are connected through friendship relationships.

The basic information of the above datasets is listed in Table 2. Each row contains the name of the dataset, the number of nodes (n), the number of edges (m), and the number of communities (c), listed from left to right.

4.2. Benchmark Algorithms

To systematically evaluate the superiority of the NSLS algorithm, this study selects seven classical community detection methods (Walktrap, CNM, LPA, Infomap, Louvain, Leiden, and FluidC) and five recent algorithms analogous to NSLS (FSLD, DS-LPA, LBLD, LMFLS, and NBCD) as comparative benchmarks. The NSLS method integrates multiple techniques, including modularity optimization, local similarity measurement, and label selection mechanisms, aiming to provide more precise community partitions. This paper covers various community detection algorithms, including different algorithm categories such as label propagation, label diffusion, modularity optimization, core node expansion, local similarity-based methods, random walk-based methods, and hybrid approaches. The main purpose is to comprehensively compare the performance of these community detection methods. A brief description of these benchmark algorithms is provided below.

Walktrap [26]: Walktrap is a community detection algorithm based on random walks. It defines the process of random walks to calculate the similarity between each pair of nodes. Based on this similarity, it constructs a distance matrix and uses hierarchical clustering to merge the nodes step by step until the optimal community partition is achieved.

CNM [14]: CNM is a hierarchical agglomerative method designed for homogeneous networks. It focuses on optimizing modularity through a greedy strategy while maintaining linear time complexity. It aims to maximize the modularity by grouping the nodes in the network to find highly clustered subgraphs.

Infomap [27]: The Infomap algorithm is a community detection algorithm based on information theory. It treats community detection as an encoding problem and obtains the optimal community by minimizing the description length during the random walk process of the nodes.

LPA [18]: The LPA is a community detection algorithm based on label propagation. It is renowned for its simplicity and efficiency, which makes it particularly suitable for large-scale networks. It propagates node information based on the network topology, takes the label with the most occurrences among its neighboring nodes as its own label, and then uses an asynchronous update strategy to make the algorithm converge.

Louvain [15]: This algorithm optimizes a quality function and consists of two phases: the local movement of nodes, where each node is moved to the community that maximizes the gain in the quality function, and aggregation of the network, where a new network is created based on the local partition, with each community becoming a node in the aggregate network. These two phases are repeated until the quality function can no longer be improved.

Leiden [16]: The Leiden algorithm is similar to the Louvain algorithm, as it also discovers communities by iteratively optimizing modularity. However, it introduces additional steps to ensure the connectivity and quality of communities. The algorithm involves three phases: the local movement of nodes, refinement of the partition, and network aggregation based on the refined partition, using the non-refined partition to create an initial partition for the aggregate network.

FluidC [25]: FluidC identifies the community structure of a graph by simulating the expansion and propagation of fluid within the network. It iteratively expands and merges fluid in the graph based on the topological structure and similarity score until a stable state is reached. The algorithm’s similarity score considers both the structural and feature similarities between nodes, allowing it to effectively identify communities within the network.

LBLD [22]: The LBLD algorithm is a fast community detection algorithm. The algorithm assigns nodes with their most similar neighbors to the same community. A new method for constructing rough cores is used to effectively detect some initial seed nodes, and balanced label diffusion is adopted to expand the communities. Finally, the communities are merged.

NBCD [31]: The NBCD algorithm is a community detection algorithm based on neighbor similarity. The precision of community detection is improved by introducing new similarity measures and a node-shifting mechanism. The algorithm is divided into two stages: nodes are clustered based on the similarity of their neighboring nodes, and then the nodes within the detected communities are reshuffled.

FSLD [24]: The FSLD algorithm uses a label diffusion method from marginal nodes by considering local criteria and similarities to discover communities. The algorithm consists of four steps: label diffusion, label update, diffusion of labels to first-degree nodes, and the merging of initial communities.

DS-LPA [23]: The DS-LPA algorithm improves the local density calculation method in the density peak clustering algorithm to identify community centers in the network. Then, the node update order is determined based on local similarity measures, and a stable label propagation strategy is employed to update the labels.

LMFLS [6]: LMFLS is a fast algorithm based on local multi-factor node scoring and label selection. It scores the nodes using multiple factors to obtain node rankings. Then, it employs two label selection strategies to choose optimal labels. Finally, community merging is performed.

Some community detection algorithms may require parameter adjustments before implementation. The Leiden, Louvain, LBLD, LSMD, Infomap, and CNM algorithms contain free parameters. For instance, Leiden and Louvain use randomness parameters to control the seed size and balance exploration and exploitation during the optimization process. In the FluidC and LPA algorithms, the maximum number of iterations needs to be determined, with FluidC set to 100 iterations and the LPA set to 12 iterations. The LBLD algorithm requires adjusting the number of iterations in the label selection stage. The Walktrap algorithm uses the default random walk length of t = 5, as stated in the paper. The parameters for NBCD and DS-LPA are shown in Table 3. For unstable algorithms, the average value of ten runs is adopted as the experimental result.

A detailed overview of the seven algorithms, Walktrap, CNM, LPA, Louvain, Leiden, Infomap, FluidC, as well as their Python implementations, can be found in [35], and on the website https://cdlib.readthedocs.io/en/latest/reference/cd_algorithms/node_clustering.html (accessed on 8 December 2024). The implementations of LBLD and FSLD are available at https://github.com/phdutm2009/community-detetcion-Bouyer (accessed on 4 February 2025). The Python implementation of LMFLS can be obtained from https://github.com/hamid-roghani/LMFLS (accessed on 18 January 2025). The Java implementation of DS-LPA is accessible at https://github.com/Lichuanwei1996/DS-LPA (accessed on 21 December 2024).

4.3. Evaluation Measures

To evaluate the performance of the proposed algorithm, four evaluation indices are adopted in our experiments: Normalized Mutual Information (NMI) [36], Adjusted Rand Index (ARI) [23], Adjusted Mutual Information (AMI) [37], and Modularity (Q) [20]. The NMI, ARI, and AMI are used to evaluate the accuracy of community detection when the ground truth communities of the network are available.

NMI: The NMI is a measure for evaluating the similarity between the communities detected by the algorithm and the ground-truth communities. For the ground-truth communities

C_{R}

and detected communities

C_{D}

, the definition of the NMI is as follows:

I (C_{R}, C_{D}) = \sum_{i = 1}^{| C_{R} |} \sum_{j = 1}^{| C_{D} |} \frac{n_{i j}}{N} log (\frac{N n_{i j}}{n_{i} m_{j}}),

(14)

H (C_{R}) = - \sum_{i = 1}^{| C_{R} |} \frac{n_{i}}{N} log (\frac{n_{i}}{N}),

(15)

H (C_{D}) = - \sum_{j = 1}^{| C_{D} |} \frac{m_{j}}{N} log (\frac{m_{j}}{N}),

(16)

NMI (C_{R}, C_{D}) = \frac{I (C_{R}, C_{D})}{\sqrt{H (C_{R}) \cdot H (C_{D})}},

(17)

where

I (C_{R}, C_{D})

represents the mutual information,

H (C_{R})

and

H (C_{D})

denote the entropy of

C_{R}

and

C_{D}

, respectively.

| C_{R} |

and

| C_{D} |

are the number of communities in

C_{R}

and

C_{D}

, respectively,

n_{i j}

is the number of nodes that belong to both the i-th community of

C_{R}

and the j-th community of

C_{D}

,

n_{i}

is the number of nodes in the i-th community of

C_{R}

,

m_{j}

is the number of nodes in the j-th community of

C_{D}

, and N is the total number of nodes in the network.

AMI: If

C_{R}

and

C_{D}

are the ground-truth and detected communities, then the AMI is defined as

AMI (C_{R}, C_{D}) = \frac{2 (I (C_{R}, C_{D}) - E (I (C_{R}, C_{D})))}{(H (C_{R}) + H (C_{D})) - 2 E (I (C_{R}, C_{D}))},

(18)

where

I (C_{R}, C_{D})

represents the mutual information between

C_{R}

and

C_{D}

.

H (C_{R})

and

H (C_{D})

denote the entropy of

C_{R}

and

C_{D}

.

E (I (C_{R}, C_{D}))

is the expected mutual information and can be calculated as

\begin{matrix} E (I (C_{R}, C_{D})) & = \sum_{i = 1}^{| C_{R} |} \sum_{j = 1}^{| C_{D} |} \sum_{n_{i j} = max (n_{i} + m_{j} - N, 0)}^{min (n_{i}, m_{j})} \frac{n_{i j}}{N} log (\frac{N \cdot n_{i j}}{n_{i} m_{j}}) \cdot Δ_{i j}, \end{matrix}

(19)

where

Δ_{i j}

can be represented as

\begin{matrix} Δ_{i j} = \frac{n_{i}! m_{j}! (N - n_{i})! (N - m_{j})!}{N! n_{i j}! (n_{i} - n_{i j})! (m_{j} - n_{i j})! (N - n_{i} - m_{j} + n_{i j})!} . \end{matrix}

(20)

The definitions of the symbols in the AMI formula are consistent with those in the NMI formula.

ARI: The ARI is another measure, such as the NMI, that is used to evaluate the similarity of communities. If

C_{R}

and

C_{D}

represent the real community and the detected community, respectively, it is calculated as:

ARI (C_{R}, C_{D}) = \frac{2 (N_{00} N_{11} - N_{01} N_{10})}{(N_{00} + N_{01}) (N_{10} + N_{11}) + (N_{00} + N_{10}) (N_{01} + N_{11})},

(21)

where

N_{11}

indicates the number of pairs of nodes in the same community in both

C_{R}

and

C_{D}

,

N_{00}

indicates the number of pairs of nodes in different communities in both

C_{R}

and

C_{D}

,

N_{01}

shows the number of pairs of nodes that are in the same community in

C_{R}

but are not in the same community in

C_{D}

, and

N_{10}

represents the number of pairs of nodes that are not in the same community in

C_{R}

but are in the same community in

C_{D}

.

Modularity: When ground-truth communities are absent, the modularity measure serves as an important metric to evaluate the effectiveness of community detection algorithms in identifying dense and well-structured communities. Modularity is calculated using Equation (2).

5. Results and Analysis

The results of different algorithms in real-world datasets are evaluated using four metrics: the NMI, AMI, ARI, and modularity. For the DBLP, Amazon, and YouTube datasets, only the top 5000 high-quality communities are used to calculate the NMI, AMI, and ARI. The Walktrap and CNM algorithms cannot be executed on the YouTube dataset due to insufficient memory and time complexity issues. DS-LPA does not run on large-scale datasets due to a Java out-of-memory error. Since the FluidC algorithm requires prior knowledge of the number of communities, we used the real number of communities provided in Table 2 as a parameter for the algorithm.

5.1. NMI Analysis

Table 4 shows the performance of different algorithms evaluated with the NMI metric. To better illustrate the community detection results of the NSLS algorithm, Figure 4 shows a visualization of the communities detected in the Karate, Dolphins, Polbooks, and Football datasets, achieving NMI scores of 1.0000, 1.0000, 0.5979, and 0.8974, respectively. Compared with other algorithms, the NSLS algorithm exhibits significant superiority across multiple datasets. The NSLS algorithm successfully identified all communities with 100% accuracy in the Karate and Dolphins datasets. In the Polbooks dataset, NSLS achieved the highest NMI value. In the Football dataset, NSLS ranked third, only 0.0128 lower than the best result achieved by the LBLD algorithm.

In the DBLP dataset, the NMI value of NSLS was only 0.0019 lower than that of the best-performing NBCD. In the Amazon dataset, NSLS achieved the highest NMI value among all algorithms, accurately identifying the community structure. In addition, the results of NSLS significantly outperform those of NBCD, LBLD, FSLD, and LMFLS in the YouTube dataset. The FluidC algorithm achieved the highest value on the YouTube dataset, as the number of communities was predefined during its execution. Obviously, it shows the significant accuracy of NSLS in both small and large datasets.

As can be seen from the Table 4, the proposed NSLS algorithm performs better in community detection compared to the classical algorithms. FluidC performs well on most datasets and can partition nodes more accurately when the number of communities is known. However, this also reveals its dependency on parameters (the number of communities), which may limit its generalizability in practical applications. As an improved version of Louvain, the Leiden algorithm performs slightly better than Louvain on multiple datasets, but the improvement is limited. The CNM, Louvain, and Leiden algorithms neglect the importance and similarity relationships between nodes, while Infomap and Walktrap use random strategies in some cases, resulting in lower accuracy. In contrast to the instability observed in the results of Louvain, Leiden, LPA, and Infomap algorithms across multiple independent runs, the NSLS algorithm proposed in this paper maintains stable community detection results on different executions.

Overall, the NSLS algorithm outperforms several recently published community detection algorithms, including LBLD, NBCD, FSLD, LMFLS, and DS-LPA. Due to limitations in the experimental environment, the network size detected by DS-LPA is relatively small. Compared to the LBLD algorithm, six of the highest results across the seven datasets of different sizes belonged to the NSLS algorithm. The same applies to the comparison with the NBCD algorithm. The NSLS algorithm utilizes more topological information when stabilizing the initial community. A more refined strategy is employed when adjusting node labels, combining similarity measures and modularity optimization. At the same time, the algorithm takes into account the similarity between communities. The synergy and integration of these three phases significantly enhance the overall performance. In general, NSLS not only offers superior accuracy but also demonstrates higher reliability when adapting to networks of various scales and complexities.

5.2. AMI Analysis

To evaluate the precision of each method, the AMI score is computed for identified communities. Table 5 shows the AMI results obtained on seven ground truth datasets.

It is obvious that NSLS achieves the highest results in five out of the seven datasets. In the Karate and Dolphins datasets, the communities detected by the FluidC, LBLD, FSLD, LMFLS, DS-LPA, and NSLS algorithms exhibit perfect alignment with the ground-truth communities. For these two small-scale datasets, the community detection performance of these algorithms is excellent, as they successfully capture the true structure of the network. In the Polbooks, Amazon, and YouTube datasets, NSLS achieved the highest AMI value. In the DBLP dataset, the AMI value of the NSLS algorithm is only 0.0197 lower than that of the better-performing FSLD algorithm, achieving comparable results. This indicates that, for large-scale datasets, NSLS is more successful in identifying partitions that are more similar to the ground-truth communities. By comparing Table 4 and Table 5, the NMI and AMI exhibit overall consistency in evaluating community detection performance. In general, the NSLS algorithm is a relatively successful method for obtaining high accuracy.

5.3. ARI Analysis

Table 6 shows the results of different algorithms evaluated based on the ARI metric. In the Karate and Dolphins benchmark datasets, the NSLS algorithm precisely reproduces the inherent community structures, achieving theoretically optimal matching accuracy. In the Polbooks dataset, NSLS achieved the third-best ARI scores. In the Football dataset, NSLS achieved an ARI of 0.8077, ranking in the upper middle range. In general, the NSLS algorithms exhibited outstanding community detection performance on small- scale datasets.

In the Amazon dataset, the NSLS algorithm performed the best, achieving an ARI score of 0.6865. In addition, it also achieved the highest NMI and AMI scores, demonstrating its strong ability to accurately identify community structures. In the DBLP dataset, although the Leiden algorithm achieved the highest ARI score, the NSLS algorithm had significantly higher values of the NMI and AMI. The Leiden and Louvain algorithms exhibit a tendency to over-merge communities. In contrast, recent algorithms perform better in the DBLP dataset, discovering communities that are closer to the ground truth. In the YouTube dataset, the NSLS algorithm achieved the third highest ARI score. When evaluated using multiple metrics, such as the NMI, AMI, and ARI, the NSLS algorithm consistently demonstrated outstanding community detection performance across multiple datasets. From Table 4, Table 5 and Table 6, it can be concluded that the proposed algorithm NSLS performs better than other community detection algorithms.

5.4. Modularity Analysis

Table 7 shows the modularity value (Q) and the corresponding number of detected communities (C) for each dataset. If the modularity value of a network exceeds 0.3, it is typically considered to have a significant community structure. If high values of modularity correspond to a good community division in a network, then one should be able to identify such good divisions by searching through the possible candidates for ones with high modularity [14].

In the Karate dataset, Leiden has detected four communities and achieved the highest modularity value. In Table 2, the actual number of communities in the Karate dataset is two, and the modularity value is 0.3715. This result is consistent with the modularity values of NBCD, LBLD, and NSLS. In the Dolphins dataset, Informap detected five communities with a modularity of 0.5285. Based on the ground truth of the Dolphins dataset, the actual number of communities is two. The real modularity is 0.3787, which is consistent with the results of LBLD and NSLS. NSLS accurately reveals the actual communities with 100% precision in both the Karate and Dolphins datasets, as illustrated in Figure 4a,c. In the Football dataset, Leiden achieved the highest modularity value of Q = 0.6046 with 10 communities. The real modularity value for the Football dataset is Q = 0.554. NSLS detected 14 communities with a modularity of Q = 0.5536, which is much closer to the real modularity value compared to the LBLD, LPA, and CNM algorithms.

In the Polbooks dataset, the Leiden algorithm achieved the highest modularity value (Q = 0.5269), but this value does not fully reflect the true community division results. The real modularity is 0.4149. FSLD achieved the modularity (Q = 0.4437) closest to the actual value, followed by DS-LPA, LMFLS, NSLS, and LBLD. All five algorithms detected two communities. In terms of the NMI, ARI, and AMI metrics, LMFLS, NSLS, DS-LPA, and LBLD performed the best, with the same number of correctly identified community members. Specifically, the communities partitioned by NSLS and LBLD were exactly the same, while the results of LMFLS were also very close, differing by just one member. DS-LPA followed closely. FSLD correctly identified slightly fewer community members, but its community division results were relatively close to those of the other four algorithms. This observation is consistent with the evaluation results of the NMI values. When there are ground-truth communities, the NMI is an important metric for evaluating the quality of community detection, effectively measuring the consistency between algorithmic results and true partitioning.

In the DBLP dataset, the real number of communities is 13,477. Leiden achieved the highest modularity value, but the algorithm detected 364 communities. Algorithms such as Walktrap, LPA, CNM, Louvain, Leiden, and Informap show a significant discrepancy between the number of detected communities and the real number of communities. In the Amazon dataset, the real number of communities is 75,149. Leiden detected 364 communities but achieved a high modularity value. However, Infomap detected only 13 communities. In large-scale datasets, NSLS achieved results that closely align with actual community structures.

Louvain and Leiden are modularity-based community detection algorithms that achieve relatively high modularity values. In large-scale networks, these methods may produce fragmented communities or overlook meaningful ones, often favoring hub nodes with many connections, thereby forming larger artificial communities and neglecting smaller communities. The NSLS algorithm also considers modularity during the node-shifting phase, but combines similarity measures and adaptive similarity parameters for diverse network structures to adjust community divisions. The core goal of community detection is to reveal the true community structure rather than simply following higher modularity values. The results of NSLS and LBLD are closer to the real communities.

5.5. Running Time Analysis

Table 8 presents the run times of various algorithms in real-world datasets. For the first four small-scale datasets, the running time of all algorithms is less than one second. Due to the extremely short running time, the performance differences between the algorithms are insignificant.

In the YouTube dataset, the CNM algorithm was terminated due to an excessive execution time, while the Walktrap algorithm failed to complete due to insufficient memory. LBLD and NBCD algorithms demonstrated high accuracy in community detection, but the execution time of NBCD was relatively longer. The Leiden and Infomap algorithms, implemented through the CDLIB library [35], rely heavily on C++. However, due to differences in the underlying programming languages, such comparisons inherently involve some bias. Algorithms such as the LPA and Leiden have short execution times. The NSLS algorithm achieves high accuracy in community detection while demonstrating a moderate execution time.

5.6. Parameter Analysis

Figure 5 shows the performance of NSLS for different

α

values in the datasets considered. We selected four parameter values for the experiments:

α

= 1.5, 2, 2.5, and 3, which correspond to neighbor similarities of

66.6 %

,

50 %

,

40 %

, and

33.3 %

, respectively. Different

α

values correspond to varying community densities. Specifically, the lower

α

values (e.g.,

α = 1.5

) tend to identify tightly connected communities, whereas the higher

α

values (e.g.,

α = 3

) are more suitable for detecting sparsely connected communities. By considering multiple evaluation metrics (e.g., NMI, ARI, and AMI), it can be found that

α = 2.0

is the recommended value. However, it is still recommended to try all possible parameter values.

6. Conclusions and Future Research

This paper proposes a community detection algorithm based on neighbor similarity and label selection. To form a better initial community structure, the algorithm introduces a new similarity measure and assigns nodes and their most similar nodes to the same community. Additionally, the algorithm introduces a similarity parameter

α

to regulate the tightness of the nodes within the community. To form dense communities, the algorithm also adopts a fast merging strategy. Some experiments on eleven real-world datasets of different scales are applied to fulfill the effectiveness of our proposed algorithms. Experiments show that the NSLS algorithm performs better on more datasets than other community detection algorithms. The NSLS algorithm shows consistent superiority in the four performance metrics.

Although the NSLS algorithm provides accurate results, parameter selection and modularity maximization inevitably increase the computation time. Future research will focus on developing parameter-free detection methods and optimizing label selection strategies to reduce computation time. In addition, we will do our best to study community detection on other types of networks, such as directed networks and weighted networks.

Author Contributions

Conceptualization, S.L. (Shihu Liu); writing—original draft, S.L. (Shihu Liu), H.C., S.L. (Shuang Li) and X.Y.; writing—review and editing, S.L. (Shihu Liu) and H.C.; software, S.L. (Shihu Liu) and X.Y.; formal analysis, H.C. and S.L. (Shihu Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Xingdian Talent Support Program for Young Talents (No. XDYC-QNRC-2022-0518) and the Open Project of Fujian Provincial Key Laboratory of Data-Intensive Computing (No. SJXY202401).

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

We are hugely grateful to the anonymous reviewers for their constructive comments with respect to the original manuscript.

Conflicts of Interest

The authors declare that there are no conflicts of interest associated with this publication, and there has been no significant financial support for this work that could have influenced its outcome.

References

An, Q.; Wang, P.; Zeng, Y.; Dai, Y. Cooperative Social Network Community Partition: A Data Envelopment Analysis Approach. Comput. Ind. Eng. 2022, 172, 108658. [Google Scholar] [CrossRef]
Doluca, O.; Oğuz, K. APAL: Adjacency Propagation Algorithm for Overlapping Community Detection in Biological Networks. Inf. Sci. 2021, 579, 574–590. [Google Scholar] [CrossRef]
Liu, M.; Liu, J.; Dong, Y.; Mao, R.; Cambria, E. Interest-Driven Community Detection on Attributed Heterogeneous Information Networks. Inf. Fusion 2024, 111, 102525. [Google Scholar] [CrossRef]
Guerrero, M.; Montoya, F.G.; Baños, R.; Alcayde, A.; Gil, C. Community Detection in National-Scale High Voltage Transmission Networks Using Genetic Algorithms. Adv. Eng. Inform. 2018, 38, 232–241. [Google Scholar] [CrossRef]
Liu, Y.; Liu, S.; Yu, F.; Yang, X. Link Prediction Algorithm Based on the Initial Information Contribution of Nodes. Inf. Sci. 2022, 608, 1591–1616. [Google Scholar] [CrossRef]
Li, H.; Nasab, S.S.; Roghani, H.; Roghani, P.; Gheisari, M.; Fernández-Campusano, C.; Abbasi, A.A.; Wu, Z. LMFLS: A New Fast Local Multi-Factor Node Scoring and Label Selection-Based Algorithm for Community Detection. Chaos Solitons Fractals 2024, 185, 115126. [Google Scholar] [CrossRef]
Li, D.X.; Zhou, P.; Zhao, B.W.; Su, X.R.; Li, G.D.; Zhang, J.; Hu, P.W.; Hu, L. Biocaiv: An Integrative Webserver for Motif-Based Clustering Analysis and Interactive Visualization of Biological Networks. BMC Bioinform. 2023, 24, 451. [Google Scholar] [CrossRef]
Fortunato, S. Community Detection in Graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef]
Feng, Y.; Chen, H.; Li, T.; Luo, C. A Novel Community Detection Method Based on Whale Optimization Algorithm with Evolutionary Population. Appl. Intell. 2020, 50, 2503–2522. [Google Scholar] [CrossRef]
Javed, M.A.; Younis, M.S.; Latif, S.; Qadir, J.; Baig, A. Community Detection in Networks: A Multidisciplinary Review. J. Netw. Comput. Appl. 2018, 108, 87–111. [Google Scholar] [CrossRef]
Das, S.; Biswas, A. Deployment of Information Diffusion for Community Detection in Online Social Networks: A Comprehensive Review. IEEE Trans. Comput. Soc. Syst. 2021, 8, 1083–1107. [Google Scholar] [CrossRef]
Javadpour Boroujeni, R.; Soleimani, S. The Role of Influential Nodes and Their Influence Domain in Community Detection: An Approximate Method for Maximizing Modularity. Expert Syst. Appl. 2022, 202, 117452. [Google Scholar] [CrossRef]
Jin, D.; Zhang, B.; Song, Y.; He, D.; Feng, Z.; Chen, S.; Li, W.; Musial, K. ModMRF: A Modularity-Based Markov Random Field Method for Community Detection. Neurocomputing 2020, 405, 218–228. [Google Scholar] [CrossRef]
Clauset, A.; Newman, M.E.J.; Moore, C. Finding community structure in very large networks. Phys. Rev. E 2004, 70, 066111. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing Well-Connected Communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef]
Yuan, Q.; Liu, B. Community Detection via an Efficient Nonconvex Optimization Approach Based on Modularity. Comput. Stat. Data Anal. 2021, 157, 107163. [Google Scholar] [CrossRef]
Raghavan, U.N.; Albert, R.; Kumara, S. Near Linear Time Algorithm to Detect Community Structures in Large-Scale Networks. Phys. Rev. E 2007, 76, 036106. [Google Scholar] [CrossRef] [PubMed]
Laassem, B.; Idarrou, A.; Boujlaleb, L.; Iggane, M. Label Propagation Algorithm for Community Detection Based on Coulomb’s Law. Phys. A Stat. Mech. Its Appl. 2022, 593, 126881. [Google Scholar] [CrossRef]
Zhang, W.; Shang, R.; Jiao, L. Large-Scale Community Detection Based on Core Node and Layer-by-Layer Label Propagation. Inf. Sci. 2023, 632, 1–18. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Li, Q.; Jin, R.; Wen, C. LILPA: A Label Importance Based Label Propagation Algorithm for Community Detection with Application to Core Drug Discovery. Neurocomputing 2020, 413, 107–133. [Google Scholar] [CrossRef]
Roghani, H.; Bouyer, A. A Fast Local Balanced Label Diffusion Algorithm for Community Detection in Social Networks. IEEE Trans. Knowl. Data Eng. 2023, 35, 5472–5484. [Google Scholar] [CrossRef]
Li, C.; Chen, H.; Li, T.; Yang, X. A Stable Community Detection Approach for Complex Network Based on Density Peak Clustering and Label Propagation. Appl. Intell. 2022, 52, 1188–1208. [Google Scholar] [CrossRef]
Bouyer, A.; Azad, K.; Rouhi, A. A Fast Community Detection Algorithm Using a Local and Multi-Level Label Diffusion Method in Social Networks. Int. J. Gen. Syst. 2022, 51, 352–385. [Google Scholar] [CrossRef]
Parés, F.; Gasulla, D.G.; Vilalta, A.; Moreno, J.; Ayguadé, E.; Labarta, J.; Cortés, U.; Suzumura, T. Fluid Communities: A Competitive, Scalable and Diverse Community Detection Algorithm. In Complex Networks & Their Applications VI; Springer: Cham, Switzerland, 2018; pp. 229–240. [Google Scholar]
Pons, P.; Latapy, M. Computing Communities in Large Networks Using Random Walks. In Proceedings of the Computer and Information Sciences—ISCIS 2005, Istanbul, Turkey, 26–28 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 284–293. [Google Scholar]
Rosvall, M.; Bergstrom, C.T. Maps of Random Walks on Complex Networks Reveal Community Structure. Proc. Natl. Acad. Sci. USA 2008, 105, 1118–1123. [Google Scholar] [CrossRef] [PubMed]
Toth, C.; Helic, D.; Geiger, B.C. Synwalk: Community Detection via Random Walk Modelling. Data Min. Knowl. Discov. 2022, 36, 739–780. [Google Scholar] [CrossRef]
Li, B.; Wang, M.; Hopcroft, J.E.; He, K. HoSIM: Higher-order Structural Importance Based Method for Multiple Local Community Detection. Knowl.-Based Syst. 2022, 256, 109853. [Google Scholar] [CrossRef]
Saoud, B.; Moussaoui, A. Node Similarity and Modularity for Finding Communities in Networks. Phys. A Stat. Mech. Its Appl. 2018, 492, 1958–1966. [Google Scholar] [CrossRef]
Sahu, S.; Rani, T.S. A Neighbour-Similarity Based Community Discovery Algorithm. Expert Syst. Appl. 2022, 206, 117822. [Google Scholar] [CrossRef]
Yang, B.; Huang, T.; Li, X. A Time-Series Approach to Measuring Node Similarity in Networks and Its Application to Community Detection. Phys. Lett. A 2019, 383, 125870. [Google Scholar] [CrossRef]
Tunali, V. Large-Scale Network Community Detection Using Similarity-Guided Merge and Refinement. IEEE Access 2021, 9, 78538–78552. [Google Scholar] [CrossRef]
Zhang, W.; Zhao, K.; Shang, R. Evolutionary Multi-Objective Attribute Community Detection Based on Similarity Fusion Strategy with Central Nodes. Appl. Soft Comput. 2024, 150, 111101. [Google Scholar] [CrossRef]
Rossetti, G.; Milli, L.; Cazabet, R. CDLIB: A Python Library to Extract, Compare and Evaluate Communities from Complex Networks. Appl. Netw. Sci. 2019, 4, 52. [Google Scholar] [CrossRef]
Chattopadhyay, S.; Basu, T.; Das, A.K.; Ghosh, K.; Murthy, L.C.A. Towards Effective Discovery of Natural Communities in Complex Networks and Implications in E-Commerce. Electron. Commer. Res. 2021, 21, 917–954. [Google Scholar] [CrossRef]
Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]

Figure 1. The visualization results of the label selection steps. (a) Selecting label of the most similar node. (b) Updating the labels of nodes with

d (v_{i}) > 1

. (c) Updating the labels of nodes with

d (v_{i}) = 1

.

Figure 1. The visualization results of the label selection steps. (a) Selecting label of the most similar node. (b) Updating the labels of nodes with

d (v_{i}) > 1

. (c) Updating the labels of nodes with

d (v_{i}) = 1

.

Figure 2. The visualization results before and after the node shifting process. (a) The modularity of the communities after the label selection phase is 0.2948. (b) The modularity of the communities after the node shifting phase is 0.3744.

Figure 3. The visualization results of community merging steps: (a) selecting candidate communities, and (b) merging candidate communities.

Figure 4. Visualization of the detected communities by the NSLS algorithm in Karate, Polbooks, Dolphins, and Football networks.

Figure 5. Performance of the NSLS algorithm on selected datasets for different

α

values.

Figure 5. Performance of the NSLS algorithm on selected datasets for different

α

values.

Table 1. Experimental platform.

Parameter	Parameter Value
RAM	8 GB
Speed	2.50 GHz
Programming	Python 3.9.19
CPU	Intel (R) Core (TM) i5-1155G7
GPU	Intel (R) Iris (R) Xe Graphics
System	Windows 11 system with 4 cores

Table 2. The basic information of real-world datasets.

Dataset	n	m	c
Karate	34	78	2
Dolphins	62	159	2
Polbooks	105	441	3
Football	115	613	12
Power Grid	4941	6594	–
CA-GRQC	5242	14,496	–
PGP	10,680	24,316	–
Brightkite	58,228	214,078	–
DBLP	317,080	1,049,866	13,477
Amazon	334,863	925,872	75,149
YouTube	1,134,890	2,987,624	8385

Table 3. Parameters used for different algorithms.

Datasets	NBCD	DS-LPA	NSLS
Karate	2.0	2.0	3.0
Dolphins	2.5	2.5	3.0
Polbooks	1.5	2.5	2.0
Football	3.0	1.6	2.0
DBLP	2.0	–	2.0
Amazon	2.0	–	2.0
YouTube	1.5	–	3.0

Table 4. NMI results of different algorithms on real-world datasets.

Datasets	Walktrap	CNM	Infomap	LPA	Louvain	Leiden	FluidC
Karate	0.5042	0.6925	0.6995	0.2075	0.7071	0.6873	1.0000
Dolphins	0.5376	0.5571	0.5565	0.5324	0.4838	0.5629	1.0000
Polbooks	0.5427	0.5308	0.5537	0.4383	0.5369	0.5737	0.4956
Football	0.8874	0.6977	0.8516	0.8547	0.8850	0.8903	0.8924
DBLP	0.5841	0.4938	0.6159	0.7529	0.5373	0.5455	0.7453
Amazon	0.6255	0.8865	0.4128	0.9538	0.8431	0.8635	0.8982
YouTube	–	–	0.5103	0.4310	0.4825	0.5033	0.7355
Datasets	LBLD	NBCD	FSLD	LMFLS	DS-LPA	NSLS
Karate	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Dolphins	1.0000	0.6536	1.0000	1.0000	1.0000	1.0000
Polbooks	0.5979	0.5607	0.5185	0.5979	0.5979	0.5979
Football	0.9102	0.9095	0.8948	0.8846	0.8873	0.8974
DBLP	0.7396	0.7569	0.7459	0.7482	–	0.7550
Amazon	0.9676	0.9596	0.9604	0.9619	–	0.9680
YouTube	0.5526	0.5708	0.4387	0.6222	–	0.6593

Table 5. AMI results of different algorithms on real-world datasets.

Datasets	Walktrap	CNM	Infomap	LPA	Louvain	Leiden	FluidC
Karate	0.4727	0.6808	0.6874	0.1516	0.6912	0.6712	1.0000
Dolphins	0.5145	0.5434	0.5414	0.5094	0.4664	0.5477	1.0000
Polbooks	0.5284	0.5159	0.5401	0.4056	0.5186	0.5611	0.4854
Football	0.8561	0.6501	0.8148	0.8191	0.8531	0.8600	0.8569
DBLP	0.1177	0.3689	0.4682	0.4561	0.4206	0.4319	0.4719
Amazon	0.0163	0.8041	0.3631	0.8807	0.7539	0.7769	0.6663
YouTube	–	–	0.3724	0.2200	0.3724	0.3935	0.2985
Datasets	LBLD	NBCD	FSLD	LMFLS	DS-LPA	NSLS
Karate	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Dolphins	1.0000	0.6442	1.0000	1.0000	1.0000	1.0000
Polbooks	0.5932	0.5460	0.5127	0.5932	0.5932	0.5932
Football	0.8781	0.8820	0.8659	0.8434	0.8431	0.8582
DBLP	0.4988	0.4919	0.5109	0.4686	–	0.4912
Amazon	0.9219	0.8981	0.9005	0.9050	–	0.9230
YouTube	0.3457	0.3779	0.2483	0.3246	–	0.4185

Table 6. ARI results of different algorithms on real-world datasets.

Datasets	Walktrap	CNM	Infomap	LPA	Louvain	Leiden	FluidC
Karate	0.3331	0.6803	0.7022	0.0879	0.5998	0.5414	1.0000
Dolphins	0.3135	0.4659	0.3792	0.2966	0.3464	0.3964	1.0000
Polbooks	0.6534	0.6379	0.6649	0.3845	0.6463	0.6752	0.5226
Football	0.8154	0.4741	0.6803	0.6205	0.8035	0.8069	0.8098
DBLP	0.0059	0.0453	0.0884	0.0117	0.1038	0.1130	0.0110
Amazon	0.0011	0.4085	0.0122	0.5803	0.3317	0.3794	0.2307
YouTube	–	–	0.0447	0.0080	0.0422	0.0501	0.0538
Datasets	LBLD	NBCD	FSLD	LMFLS	DS-LPA	NSLS
Karate	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Dolphins	1.0000	0.4689	1.0000	1.0000	1.0000	1.0000
Polbooks	0.6671	0.6797	0.5989	0.6671	0.6671	0.6671
Football	0.8514	0.8465	0.7889	0.7657	0.8132	0.8077
DBLP	0.0218	0.0141	0.0213	0.0136	–	0.0156
Amazon	0.6843	0.6190	0.6283	0.6302	–	0.6865
YouTube	0.0260	0.0427	0.0049	0.0291	–	0.0494

Table 7. Experimental results on real-world datasets based on modularity.

Datasets	Walktrap		CNM		LPA		Louvain
Datasets	C	Q	C	Q	C	Q	C	Q
Karate	5	0.3532	3	0.3807	3	0.1121	4	0.4151
Dolphins	7	0.4991	4	0.4955	7	0.4974	5	0.5188
Polbooks	4	0.5070	4	0.5020	8	0.4818	5	0.5268
Football	10	0.6029	6	0.5497	9	0.5521	10	0.6043
Power Grid	364	0.8310	43	0.9346	1406	0.6060	41	0.9357
CA-GRQC	816	0.7825	417	0.8196	991	0.7554	392	0.8611
PGP	1574	0.7894	193	0.8494	2059	0.7381	96	0.8832
Brightkite	6892	0.5721	1418	0.6146	5644	0.5790	722	0.6893
DBLP	30,425	0.6719	3047	0.7323	43,190	0.6494	206	0.8225
Amazon	14,904	0.8494	1346	0.8705	37,426	0.7263	242	0.9263
YouTube	–	–	–	–	113,811	0.3287	5791	0.7231
Datasets	Leiden		Infomap		NBCD		LBLD
Datasets	C	Q	C	Q	C	Q	C	Q
Karate	4	0.4198	3	0.4020	2	0.3715	2	0.3715
Dolphins	5	0.5277	5	0.5285	4	0.5265	2	0.3787
Polbooks	4	0.5269	4	0.5262	4	0.4774	2	0.4569
Football	10	0.6046	9	0.5864	11	0.6032	13	0.5807
Power Grid	42	0.9380	5	0.7637	739	0.7296	341	0.8219
CA-GRQC	395	0.8657	377	0.8374	747	0.7915	561	0.7918
PGP	96	0.8851	62	0.8577	1052	0.8062	358	0.8159
Brightkite	690	0.6962	564	0.3826	2395	0.6423	1251	0.5785
DBLP	256	0.8304	543	0.8173	23,704	0.6853	18,394	0.6948
Amazon	364	0.9319	13	0.7860	24,443	0.7690	15,501	0.8040
YouTube	5625	0.7315	926	0.6944	53,278	0.6487	25,169	0.5206
Datasets	FSLD		LMFLS		DS-LPA		NSLS
Datasets	C	Q	C	Q	C	Q	C	Q
Karate	2	0.3715	2	0.3715	2	0.3715	2	0.3715
Dolphins	2	0.3787	2	0.3787	2	0.3787	2	0.3787
Polbooks	2	0.4437	2	0.4569	2	0.4569	2	0.4569
Football	10	0.5908	13	0.5592	12	0.5653	14	0.5536
Power Grid	466	0.7822	342	0.8166	–	–	361	0.8178
CA-GRQC	419	0.7360	593	0.8006	–	–	557	0.8097
PGP	573	0.7373	559	0.8024	–	–	318	0.8407
Brightkite	1079	0.5620	1644	0.6224	–	–	2031	0.6312
DBLP	15,487	0.6690	23,018	0.6975	–	–	26,990	0.6405
Amazon	28,060	0.7264	19,908	0.7814	–	–	13,008	0.8124
YouTube	13,942	0.4348	17,906	0.4284	–	–	38,227	0.6120

Table 8. The running time (s) of different algorithms on real-world datasets.

Datasets	Walktrap	CNM	Infomap	LPA	Louvain	Leiden
Karate	0.0018	0.0061	0.0020	0.0023	0.0020	0.0010
Dolphins	0.0108	0.0123	0.0043	0.0031	0.0040	0.0010
Polbooks	0.0120	0.0413	0.0066	0.0056	0.0070	0.0011
Football	0.0180	0.0511	0.0115	0.0042	0.0060	0.0010
DBLP	2676.1003	64,618.8974	30.9595	140.7686	152.9744	5.7168
Amazon	1857.8097	19,759.3763	42.9629	30.7070	100.3629	5.4184
YouTube	N/A	N/A	126.0924	174.9820	418.1358	20.1317
Datasets	FluidC	LBLD	NBCD	FSLD	LMFLS	NSLS
Karate	0.0013	0.0050	0.0040	0.0013	0.0020	0.0209
Dolphins	0.0020	0.0021	0.0225	0.0030	0.0080	0.0249
Polbooks	0.0050	0.0060	0.0990	0.0060	0.0229	0.0891
Football	0.0116	0.0060	0.0277	0.0059	0.0189	0.0854
DBLP	66.2236	19.4091	1135.8148	2815.6651	122.8283	462.2714
Amazon	46.0640	16.1649	921.0104	3160.4955	58.3681	458.5498
YouTube	293.9875	351.2428	10,723.3501	20,820.3919	434.3692	966.3772

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Chen, H.; Li, S.; Yang, X. NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection. Mathematics 2025, 13, 1300. https://doi.org/10.3390/math13081300

AMA Style

Liu S, Chen H, Li S, Yang X. NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection. Mathematics. 2025; 13(8):1300. https://doi.org/10.3390/math13081300

Chicago/Turabian Style

Liu, Shihu, Hui Chen, Shuang Li, and Xiyang Yang. 2025. "NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection" Mathematics 13, no. 8: 1300. https://doi.org/10.3390/math13081300

APA Style

Liu, S., Chen, H., Li, S., & Yang, X. (2025). NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection. Mathematics, 13(8), 1300. https://doi.org/10.3390/math13081300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NSLS: A Neighbor Similarity and Label Selection-Based Algorithm for Community Detection

Abstract

1. Introduction

2. Preliminaries

2.1. Complex Network

2.2. Community Detection

2.3. Modularity

3. Our Proposed Approach for Community Detection

3.1. Motivation

3.2. Label Selection

3.3. Node Shifting

3.4. Community Merging

3.5. The Pseudo-Code of the Proposed Algorithm

3.6. Complexity Analysis

4. Experimental Materials

4.1. Datasets

4.2. Benchmark Algorithms

4.3. Evaluation Measures

5. Results and Analysis

5.1. NMI Analysis

5.2. AMI Analysis

5.3. ARI Analysis

5.4. Modularity Analysis

5.5. Running Time Analysis

5.6. Parameter Analysis

6. Conclusions and Future Research

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI