Next Article in Journal
Study on the Correlation between the Activity Trajectory of Crested Ibis (Nipponia nippon) and Meteorological Changes
Previous Article in Journal
Applying a Method for Augmenting Data Mixed from Two Different Sources Using Deep Generative Neural Networks to Management Science
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Edge-Based Approach to Partitioning and Overlapping Graph Clustering with User-Specified Density

by
Rohi Tariq
1,*,
Kittichai Lavangnananda
1,*,
Pascal Bouvry
2 and
Pornchai Mongkolnam
1
1
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok 10140, Thailand
2
Department of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, L-4365 Esch-sur-Alzette, Luxembourg
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2024, 14(1), 380; https://doi.org/10.3390/app14010380
Submission received: 9 November 2023 / Revised: 13 December 2023 / Accepted: 28 December 2023 / Published: 31 December 2023

Abstract

:
Graph clustering has received considerable attention recently, and its applications are numerous, ranging from the detection of social communities to the clustering of computer networks. It is classified as an NP-class problem, and several algorithms have been proposed with specific objectives. There also exist various quality metrics for evaluating them. Having clusters with the required density can be beneficial because it permits the effective deployment of resources. This study proposes an approach to partitioning and overlapping clustering of undirected unweighted graphs, allowing users to specify the required density of resultant clusters. This required density is achieved by means of ‘Relative Density’. The proposed algorithm adopts an edge-based approach, commencing with the determination of the edge degree for each edge. The main clustering process is then initiated by an edge with an average degree. A cluster is expanded by considering adjacent edges that can be included while monitoring the relative density of the cluster. Eight empirical networks with diverse characteristics are used to validate the proposed algorithm for both partitioning and overlapping clustering. Their results are assessed using an appropriate metric known as the mean relative density deviation coefficient (MRDDC). This is the first work that attempts to carry out partitioning and overlapping graph clustering, which allows user-specified density.

1. Introduction

The rapid increase in the volume of data and the popularity of artificial intelligence (AI) over the last few decades have led to the emergence of the popular fields known as data science and knowledge discovery. These fields provide techniques aimed at uncovering hidden structures and useful patterns within the available information. In its simplest terms, clustering is a technique that separates information into distinct groups in which intra-similarities are maximized while inter-similarities are minimized [1]. When information can be represented as a graph or a network, then graph clustering, a sub-field of clustering, has a direct application by analyzing the connectivity and information within nodes and edges to discover intrinsic knowledge hidden within the underlying structure [2].
In almost all domains of science, graph clustering serves a variety of objectives, such as extracting similar interests of communities within a large social network [2,3]. In transportation networks, it is applied in optimizing routes and traffic flow [4]. Bioinformatics is another field where the application of graph clustering is prolific. Identifying functional modules of genes or proteins is such an example [5]. Other application areas include image segmentation [6] and personalized recommendations in a recommendation system [7,8] etc.
Graph clustering is an NP-class problem [9] that can be classified into three distinct categories, partitioning, overlapping, and hierarchical clustering [2]. Section 3 elaborates on this further. In addition, numerous graph clustering algorithms have been proposed according to their diverse objectives. A few well-known techniques include k-means [10], DBSCAN [11], Girvan–Newman [12], Louvain [13], and numerous others. These algorithms can be parameter-free (i.e., preset to default) or parameterizable (i.e., allowing users to specify some parametric values). The scope of these parameters spans several facets; these are the cluster quantity, minimum neighboring points, the resolution parameter, regularization parameter, and convergence criteria [14].
Complex algorithms, such as random walk-based and spectral clustering, require some level of expertise in graph theory and linear algebra to adjust for suitable values of their parameters. They often involve numerous iterations and experimentation and their assessment usually involves the adjusting procedure. These can render them computationally intensive, especially for extensive and intricate networks. In order to take full advantage of parameter tuning, algorithmic complexity is often associated. Apart from computational complexity, this process may result in unbalanced clusters, especially if parameters are inappropriately adjusted. In the context of density and connectivity, unbalanced clustering can lead to inefficient resource utilization and a rise in maintenance costs [15].
As stated above, graph clustering has applications in almost every domain, and, like many other areas of science, graph clustering algorithms were introduced in response to the objective(s) of their applications. Nevertheless, there remain several requirements in graph clustering that have yet to be fulfilled. One of these is the capability to cluster a graph in such a way that the resultant clusters have a comparable or identical density. This feature is useful in the deployment and administration of resources in a variety of applications, for example, if a large community is represented by a complex graph. Subcommunities can be viewed as clusters within the whole community. If they are of equal or comparable density, distribution of resources can be more efficiently completed. A network that represents communication among individuals through LINE messaging is another good example. It would be beneficial if the network could be partitioned into smaller units with similar density. This allows for an efficient deployment of resources to monitor these communities as each subcommunity is likely to require similar resources.
During the recent COVID-19 outbreak, a graph may represent the interaction between infected subjects in a certain area. By splitting them into clusters with equal or similar densities, it would be possible to evaluate each cluster relative to the entire network. This can also identify clusters with fewer patients (i.e., nodes) but a higher number of interactions (i.e., edges) that may require special attention. In this case, an overlapping clustering has a unique characteristic. Members in overlapping areas may be considered as superspreaders. Also, these members can be viewed as sources for further communication as they interact with more than one cluster. The above scenarios can be directly transformed into clustering a well-connected graph so that the densities of the resultant clusters are identical or very similar.
To date, no algorithm that is capable of partitioning and overlapping graph clustering that permits users to specify any kind of density has been developed. Therefore, this research represents the first attempt at partitioning and overlapping clustering of an undirected, unweighted graph whose required density can be specified by users. The density adopted in this work is the ‘relative density’; this is explained in Section 3. An assumption is made that clusters within the graph are not easily identifiable or evident. It may be arguable that the implementation of a graph clustering algorithm may not be essential under such conditions.
The organization of the manuscript is as follows. Section 2 reviews the existing graph clustering algorithms in terms of their parameters. This is followed by Section 3, where the fundamental concepts adopted in this work are described. The key section of this study is Section 4, which describes in full the edge-based partitioning and overlapping clustering approach used in this work. Section 5 reveals the results of eight real-world networks. The results of two particular networks are graphically illustrated in this section. Section 6 describes the evaluation of results using the existing metrics and points out their inadequacy for this work. The section also reveals the evaluation using the appropriate metric. Relevant and significant findings are discussed in Section 7. The work is concluded in Section 8, and Section 9 discusses possible aspects for future work.

2. Literature

This section provides an overview of widely used graph clustering algorithms, encompassing both partitioning and overlapping clustering approaches. In [16], an extensive exploration is conducted on the categorization, underlying principles, and parameterization techniques of partitioning clustering algorithms. This exploration provides valuable insights into the intricate characteristics of these algorithms. Overall, this section contributes to a fundamental understanding of the current state-of-the-art graph clustering algorithms.
In the realm of graph clustering, a multitude of state-of-the-art algorithms have emerged to address the identification of cohesive communities or subgraphs within complex networks or graphs. Among them, the Girvan–Newman (GN) algorithm [12] utilizes edge-betweenness to find clusters in complex networks. By creating a hierarchical dendrogram, the algorithm enables modularity analysis to identify the optimal level cut. The Louvain algorithm [13], known for its multilevel clustering approach, optimizes modularity through local node movements and network aggregation. However, the Louvain algorithm has limitations in identifying small communities. To address the issue of poorly connected clusters, the Leiden algorithm [17] enhances the Louvain method by employing fast local moves and random neighbor moves. This enhancement ensures the formation of internally connected clusters and leads to a stable partition. Fluid communities is another example of a clustering algorithm that partitions a graph into communities using virtual fluid and has found applications in various domains [18]. It also provides an adjustable parameter for cluster customization. The spectral method leverages the power of eigenvalues derived from special matrices to improve the performance of cluster determination. It outperforms conventional methods such as k-means and enables a more robust analysis of complex networks [19,20]. Users may specify the desired number of clusters (k), or the algorithm may arbitrarily determine this value. The expectation maximization (EM) algorithm [21] embraces a model-based approach by estimating essential parameters through probability distributions. Furthermore, it provides users with the ability to define the optimal number of clusters (k) and adjust additional parameters, including maximal iterations and convergence criteria.
The fast greedy algorithm is a notable example of a scalable and efficient agglomerative hierarchical clustering method [22,23]. It operates without the need for user-specified parameters and takes advantage of optimized data structures and strategic optimization shortcuts. The InfoMap algorithm adopts information theory principles [24,25] by employing random walks to analyze information flow and partition the network into modules. It also optimizes the transmission rate together with minimum description length to quantify clustering quality. Similar to InfoMap, the label propagation algorithm (LPA) is another approach that avoids parameter specification by iteratively assigning node labels based on neighbors to achieve clustering [26]. To address the shortcomings of randomness issues, extended versions of LPAs, such as MemLPA and LPA-MNI, have been proposed to enhance the performance [27,28]. The Walktrap algorithm adopts a hierarchical approach through random walks and merging neighboring clusters. It incorporates a parameter ‘t’ to facilitate practical applications in its attempt to optimize the performance [29].
The spinglass approach [30,31], inspired by statistical mechanics, utilizes the ‘Potts model’ to connect nodes sharing the same spin state and employs advanced simulated annealing techniques for resource-efficient cluster identification. The leading eigenvector capitalizes on modularity maximization by computing the leading eigenvector of the modularity matrix in partitioning the graph [32]. Although the algorithm itself is parameter-free, graph clustering via modularity maximization may necessitate the adjustment of parameters like resolution, stopping criteria, and the modularity matrix computation method. The graph-based k-means algorithm is another parameter-free algorithm whereby minimum spanning trees are utilized to estimate k and centroid placement. Its proven high performance lies in handling noisy data [33].
Another noteworthy aspect of graph clustering includes Bayesian techniques. The Bayesian nonparametric methods have emerged as influential tools. These approaches provide a probabilistic framework for modeling complex relationships within graphs. Unlike traditional clustering methods that require a predetermined cluster count or some other parameters, Bayesian nonparametric techniques adaptively infer the number of clusters from the data. This flexibility is particularly advantageous in scenarios where the true number of clusters is unknown or varies. This makes Bayesian nonparametric methods well-suited for diverse real-world applications [34,35]. Furthermore, these techniques inherently capture uncertainty in the clustering process, providing a probabilistic assignment of nodes to clusters. Models such as the Dirichlet Process Mixture Model (DPMM) [36,37,38,39], Bayesian Community Detection (BCD) [40], Infinite Relational Model (IRM) [39,40], and the Chinese Restaurant Process (CRP) fall under the Bayesian nonparametric umbrella and have found applications in diverse fields, including social network analysis, bioinformatics, and community detection in complex networks [41]. Despite the unprecedented flexibility and applicability of these techniques in diverse real-world scenarios, the effective clustering of graphs remains a significant area of exploration and refinement from several perspectives. An absence in the ability in specifying the required density of the resultant clusters is an avenue for research and improvement in this field.
In the context of real networks characterized by intricately overlapping community structures, traditional methods prove insufficient for accurately identifying overlapped communities [42]. The existing approaches for identifying overlapped clusters can be broadly categorized into two main types. The first type includes node-based clustering algorithms (node clustering) that directly partition the network’s nodes into distinct clusters by leveraging the available structural information. The second type are edge-based algorithms, where edges that typically have unique identities and edges that connect to a single node may belong to multiple link communities. These algorithms are referred to as ‘edge’ or ‘link-based’ clustering methods. However, it is worth noting that the majority of well-established algorithms predominantly fall within the first category. One notable approach that stands out is the Clique Percolation Method (CPM), which draws inspiration from clique percolation theory [43,44]. It provides insights into the shared connections among nodes and enables the identification of multiple communities that overlap in terms of their node memberships. The Lancichinetti–Fortunato–Radicchi Measures (LFM) algorithm [45] detects overlapping and hierarchical community structures in complex networks by combining modularity and fitness optimization. This algorithm offers valuable insights into the complex nature of community structures. However, the LFM algorithm has limitations in handling large-scale networks due to its computational complexity. It also relies on predefined resolution parameters, which has an impact on its scalability and adaptability in real-world applications. The greedy clique expansion (GCE) algorithm, which was proposed in [46], is a strong way to find highly overlapping community structures in social networks by systematically expanding maximal cliques. GCE uncovers the intricate patterns of node overlaps and provides valuable insights into the complex relationships within communities. However, the algorithm’s scalability may be a limitation in large-scale network applications due to its reliance on a greedy expansion strategy. Nonetheless, GCE presents a valuable approach for capturing overlapping communities in social networks. The Overlapping Cluster Generator (OCG) method is a prominent approach for identifying overlap areas in protein–protein interaction (PPI) networks [47]. It employs a sophisticated methodology such as parameters like clique size, similarity threshold, and minimum cluster size to control the clustering process. By leveraging these parameters, OCG captures overlapping structures and identifies densely connected subgroups. However, OCG may be sensitive to parameter settings; hence, it requires tuning for optimal results. Another prominent algorithm, the improved Bacteria Foraging Optimization (IBFO) mechanism, combines the intuitionistic fuzzy set for identifying overlap modules in protein–protein interaction networks. This method effectively tackles the challenge of overlapping functional modules in PPI networks and automatically determines the number of clusters. However, the choice of parameter values, including the thresholds for membership degree and indeterminacy degree, as well as the maximum iterations of the IBFO algorithm, may significantly impact the resulting clustering outcomes [48]. The ETCC-SA (Edge-Triangle Clique Cover using Simulated Annealing) algorithm [49] is proposed for detecting overlapping communities in social networks. It explores different feature assignments for vertices and iteratively adjusts them based on a scoring function. A cluster is formed as the algorithm gradually converges toward an approximate solution. As the algorithm uses heuristic approximations, it does not guarantee optimality and is sensitive to parameter choices like the number of rounds and the mixing parameter. The Clique Overlap Newman–Girvan algorithm (CONGA) [50] is an extended version of the Girvan–Newman (GN) algorithm that specifically focuses on detecting overlapping clusters in networks. It enhances the GN algorithm by incorporating a mechanism to identify and represent overlapping communities. By considering the betweenness of edges and tracking overlapping vertices during the merging process, CONGA effectively detects and characterizes overlapping clusters in networks. Nevertheless, CONGA suffers from the resolution limit problem, making it unsuitable to detect smaller clusters within larger ones in overlapped communities.
The Clique-Based Louvain Algorithm (CBLA) is an approach for detecting overlapping communities in graphs that combines concepts from the Clique Percolation Method (CPM) and the Louvain algorithm. It identifies cliques using CPM and then employs the Louvain algorithm to classify unclassified nodes. However, the Louvain algorithm has a resolution limit, posing difficulties in detecting communities at different scales and identifying smaller communities within larger ones [17,51]. Furthermore, the CBLA’s performance may be sensitive to parameter choices, including the clique size [52,53]. A normalized cut-based scalable version of spectral clustering is proposed [54] to solve the problem of finding overlapping communities in large, complex networks. By utilizing a hierarchical framework and incorporating node weights, it addresses the challenge of identifying nodes belonging to multiple communities. However, the selection of the tolerance parameter within the hierarchical framework may significantly impact the clustering results. The algorithm can control the level of overlap between communities by adjusting the value of the threshold. A higher value allows for a greater degree of overlap, while a lower value results in less overlap. The selection of the optimal value requires careful consideration to achieve accurate and meaningful clustering results. In [55], a novel approach called LOCD (Local Overlapping Community Detection) is introduced to identify overlapping nodes within community structures. LOCD identifies structural centers in a network based on their higher density and relative distance from nodes with higher densities. It expands communities around these centers using structural similarity. While LOCD effectively uncovers overlapping communities, the choice of cutoff distance and the use of structural similarity as a weighting measure can influence clustering results. Therefore, careful adjustment of these parameters is crucial to ensure accurate and reliable clustering outcomes. Moreover, k-Neighborhood Attribute Structural (kNAS) [56] is another technique for detecting overlapping clusters in a graph by considering structural and attribute similarities among vertices. It groups objects into clusters based on their k-nearest neighbors and similar attributes, enabling the identification of overlapping communities. The key parameter is the distance value (k). However, kNAS strictly satisfies both structural and attribute similarities but potentially does not provide a balanced trade-off. Link communities (LC) is a link-based method that detects communities as interrelated groups using hierarchical clustering and partition density optimization. It combines link-based analysis, hierarchical clustering, and partition density to identify overlapping communities and hierarchical organization in networks. The algorithm’s results can be influenced by parameters like link weight threshold, community size threshold, resolution parameter, link strength measure, initialization strategy, convergence criteria, and network structure. Therefore, experimenting with different settings is crucial for achieving desired outcomes [57]. The extended link clustering (ELC) method improves the existing link clustering (LC) method by using extended link similarity (ELS) to create a denser transform matrix and employing EQ (an extended measure of quality of modularity) for optimal partitioning. This approach generates more realistic communities by incorporating more link information and using EQ to determine the cut level in the hierarchical clustering dendrogram. ELC achieves higher EQ values, making it more realistic compared to the original LC method and the classical CPM method [58]. Network Decomposition for Overlapping Community Detection (NDOCD) is a novel approach for overlapping community detection using network decomposition, node clustering, and seed expansion. It determines overlapping communities by applying a seed expansion strategy based on joint strength and membership degree thresholds. The approach showed superior performance compared to traditional link clustering and node clustering algorithms. However, its higher sensitivity is attributed to the adjustment of parameters like joint strength and membership degree [59].
The study in [60] introduces a Scalable Integrated Model using Game Theory (SIMGT), a novel approach for detecting overlapping communities in complex networks. It is inspired by social identity theory, and it models community formation as a noncooperative game, utilizing second-order proximity and stochastic gradient ascent. SIMGT proves effective, surpassing BigClam in AvgF1 across datasets, demonstrating scalability. Key parameters, including thresholds, proximity metrics (e.g., Jaccard coefficient), and weighting strategies, require careful optimization. These can be problematic for users in seeking optimal results.
The GREESE algorithm [61] is introduced for overlapping community detection, employing a coupled-seeds expansion strategy, a fitness function, and a merging phase that regulates overlapping rates. It outperforms seven state-of-the-art algorithms in accuracy and execution time on diverse networks; GREESE offers simplicity, effectiveness, and promising results. The algorithm’s parameters include Common Neighbor Similarity (C), Fitness Function (F), Expansion and Merging Phase Thresholds, as well as Network Parameters in LFR Benchmarks and Real-World Network Parameters. Optimal parameter selection may impact algorithm performance, requiring tuning for different networks.
At present, several popular libraries such as Scikit-learn, NetworkX, igraph, and MATLAB include the implementation of many algorithms, such as fluid communities, Louvain, and the spectral method described above, and they can be easily used. To date, there has not been a graph clustering algorithm that allows users to specify some form of density in the resulting clusters. The only attempt to satisfy this requirement in partitioning clustering is in [16]. It merits special attention, and the comparison with the approach in this work is discussed in Section 4.
The work described in this manuscript is the first attempt to integrate overlapping clustering with user-specified density. The significance of this approach lies in advancing overlapping clustering. It also presents an understanding of complex relationships within networks where entities often belong to multiple clusters simultaneously, especially in the fields of social network analysis, biology, and information retrieval. The study not only presents another novel approach but also has various applications highlighted in the Introduction section. It also identifies a research gap in overlapping clustering algorithms and extends the understanding in this field.

3. Preliminary

This section provides a foundational overview of key concepts and notations crucial to understanding the proposed methodology in the subsequent sections. It encompasses fundamental topics such as graph structures, various clustering types, the metric “edge degree”, and the concept of relative density as a user-specified density metric. Furthermore, it introduces the equations associated with these concepts and provides a visual representation through a practical example. The section serves as a solid basis for comprehending the subsequent sections.

3.1. Graph and Graph Clustering Type

A graph is a collection of nodes interconnected by edges, denoted as G ( V , E ) , where V represents the set of nodes and E represents the set of edges. The type of a graph can be determined based on the attributes of its edges (i.e., direction and weight). Specifically, if edges are assigned weights and exhibit directional properties that indicate the flow of relationships, the graph is classified as a weighted and directed graph. As stated in the Introduction, graph clustering has a wide range of applications. The scope of this work lies in problems that can be represented by undirected unweighted graphs.
Graph clustering has received much attention recently, offering insights into various aspects of network structures. As stated in the Introduction, graph clustering encompasses three fundamental types, partitioning, overlapping, and hierarchical clustering. Partitioning clustering involves assigning nodes to a disjoint set of clusters where intracluster similarity of each cluster is maximized and intercluster similarity is minimized. While the objectives of maximizing the intracluster similarity and minimizing the intercluster similarity are the same, in overlapping clustering, however, nodes are allowed to belong to multiple clusters. In hierarchical clustering, clusters are formed hierarchically, where partitioning or overlapping clusters may be allowed within a particular layer. Clusters then appear in a tree-like structure [62].
In simple terms, consider a set of clusters denoted as C = { c 1 , c 2 , c 3 , , c k } for a given set of nodes (V). Here, the cardinality of C represents number of clusters. In this study, the number of clusters is indicated by k. Simplified notation is used for clusters in this study, represented as c i , where i denotes a unique cluster number. The formal mathematical descriptions of clustering types are provided below:
  • Partitioning Clustering: Each node v V is uniquely assigned to a cluster c i . In other words, every node is exclusively a member of one cluster.
  • Overlapping Clustering: Each node v V is associated with at least one cluster ( c i ) ( i { c 1 , c 2 , c 3 , , c k } ), indicating that each node may belong to one or more clusters.
  • Partitioning Hierarchical Clustering: At each level of the hierarchy, each node (v) is exclusively assigned to a specific cluster ( c i ).
  • Overlapping Hierarchical Clustering: At the same level of the hierarchy, a node (v) can be assigned to one or more clusters ( c i ), allowing it to be a member of multiple clusters simultaneously.
Figure 1 depicts a graphical representation of these three clustering types, highlighting the potential diversity in clustering outcomes within a simple arbitrary graph. It is commonly known that graph clustering is an NP-class problem.

3.2. Edge Degree Description

Apart from the most well-known concept of ‘node degree’, there exists another useful concept known as ‘edge degree’. The interconnection among nodes is measured by the edge degree, which provides valuable insights into the connectivity patterns. The edge degree can be determined by considering either one endpoint (a node) or both endpoints (nodes) of an edge, thereby capturing different aspects of the graph’s structure. Various descriptions of edge degrees have been suggested in the graph community [63].
For instance, the simplest form of edge degree assigns a uniform value of 1 to each edge regardless of the degrees of the connected endpoints. This straightforward definition provides a fundamental understanding of edge connectivity. The more complex form of edge degree is the product of the power functions of the degrees of both endpoints. This complex measure reveals deeper insights into the influence of highly connected nodes on edge strength and network robustness. Figure 2a–e illustrate five possible definitions of an edge degree, serving as visual aids to depict several distinct definitions of edge degree. These illustrations, supported by rigorous mathematical concepts, help grasp the diversity of edge degree interpretations and their implications for network analysis.
In this work, the edge degree denoted as d ( e i ) is taken to mean the summation of edges on both sides minus one (i.e., avoiding duplicated counting as in Figure 2d). This definition is the most appropriate for this study as the density of clusters relative to the whole graph is of interest. Therefore, summation of adjacent edges from nodes at each end of the edge is directly relevant to the determination of relative density. The edge degree d ( e i ) is determined by using Equation (1):
d ( e i ) = ( d ( e 1 ) + d ( e 2 ) ) 1
where d ( e 1 ) and d ( e 2 ) represent the degrees of the first and second endpoints of the edge, respectively. Note that the sum of both degrees is subtracted by 1 to avoid duplicated counting of an edge. Hence, the average degree in a graph ( G ) is represented as Avg deg ( G , E ) and is determined by Equation (2) below. The edge corresponds to the average degree Avg deg ( G , E ) and is average degree edge within the set of edge (E) and is denoted as ( Avg deg ( G , e i ) ).
Avg deg ( G , E ) = i = 1 E d ( e i ) E
where E is the total number of edges in the graph.

3.3. Relative Density of a Cluster

Cluster connectivity primarily depends on two fundamental components: internal connectivity and external connectivity. This study categorizes these concepts as the ‘internal degree’ and ‘external degree’ of a cluster, respectively. To formulate more independent metrics, several approaches incorporate density and cut features by leveraging the concepts of the internal and external degrees of a cluster. The internal degree of a cluster, denoted as int deg ( c i ) , is determined by counting the total number of edges within the cluster. This is expressed as v c i int deg ( v ) , where c i represents the cluster and v signifies individual vertex within that cluster. Conversely, the external degree of a cluster, represented as ext deg ( c i ) , is computed by tallying the total number of edges that link the vertices of the cluster with those outside the cluster. This is articulated as v c i ext deg ( v ) .
int deg ( c i ) = v c i int deg ( v )
ext deg ( c i ) = v c i ext deg ( v )
Relative density [2,64], denoted as δ r ( c i ) for cluster ( c i ), is a metric that quantifies the ratio of cluster internal edges to its total degree. The total degree is the sum of the cluster’s internal and external degrees. Therefore, the mathematical definitions of the relative density of a cluster δ r ( c i ) and the average relative density ( Avg ( δ r ) ( C ) ) for a set of clusters are provided in Equations (5) and (6), respectively.
δ r ( c i ) = v c i int deg ( v ) v c i int deg ( v ) + v c i ext deg ( v )
Avg ( δ r ) ( C ) = i = 1 k δ r ( c i )

3.4. Illustration of Edge Degree, The Edge with Average Edge Degree, and Relative Density of a Cluster

To visually illustrate the concepts of edge degree, the edge with an average degree, and relative density of a cluster, an arbitrary simple graph comprising 15 nodes and 25 edges is presented in Figure 3.
In Figure 3a, the edge degree of each edge is determined using Equation (1), which is calculated as the sum of the degrees of the two endpoints minus 1. Figure 3b highlights the edge with an average edge degree (i.e., edge (0, 7)) using Equation (2). In Figure 3c, internal and external edges are indicated by dotted lines and solid lines, respectively. Cluster 1 (in yellow) showcases an internal degree ( int deg ( c 1 ) ) of 8 and an external degree ( ext deg ( c 1 ) ) of 10, as determined by Equations (3) and (4), respectively. This yields a relative density ( δ r ( c 1 ) ) of 0.44, calculated using Equation (5). In the case of the second cluster (in red), it indicates an internal degree ( int deg ( c 2 ) ) of 7 and an external degree ( ext deg ( c 2 ) ) of 8. Consequently, the relative density ( δ r ( c 2 ) ) for this cluster is 0.46. Further details on the partitioning and overlapping clustering processes are covered in Section 4.

3.5. User-Specified Density

In this study, a relative density metric for cluster density is introduced as user-specified density, denoted as U ( δ r ) . This user-defined density metric spans from 0 to 1, where a value of 1 signifies the entire graph without any inherent substructure, whereas a value of 0 corresponds to a single isolated node. The concept of relative density offers valuable insights into the level of connectivity within the network. It provides a comprehensive measure of cluster density, taking into account both internal and external edges.

4. Proposed Partitioning and Overlapping Clustering Method

The clustering method introduced in this study is designed for partitioning and overlapping clustering of connected, undirected, and unweighted graphs. A distinguishing feature of this method is its flexibility in enabling users to define the desired density for the resulting clusters, ranging from 0 to 1. This section provides a detailed description of the proposed clustering methods, including the pseudocode implementation and an illustration of an example on a simple random graph.

4.1. Partitioning Clustering

Partitioning clustering, also known as hard or disjoint clustering, is a fundamental technique in which each node is exclusively assigned to a single cluster. The proposed method employs a biphasic strategy to determine partitioning clusters based on user-specified density. The first stage of the clustering process involves calculating edge degrees, while the subsequent step focuses on expanding and identifying clusters. This two-phase approach facilitates a systematic analysis of the graph’s connectivity, enabling the effective grouping of nodes into distinct clusters based on their interconnections and density.

4.1.1. Determination of Edge Degrees

The initial phase entails computing the degree of each edge ( d ( e i ) ) within a given graph ( G ) using Equation (1). Figure 3a demonstrates the process of calculating edge degrees within a randomly generated graph. Equation (2) outlines the process for calculating the average edge degree ( Avg deg ( G , e i ) ) based on the determined edge degrees within the graph( G ). Figure 3b provides an illustrative example, where the average edge degree is measured at 6.76.
After determining the degrees of all edges and identifying the average degree edge Avg deg ( G , e i ) , it serves as both the cluster ( c i ) initialization and the starting edge for further expansion. It is worth mentioning that, in cases where an edge with the exact degree of Avg deg ( G , e i ) does not exist, the algorithm selects the edge with the closest value to initiate the clustering process. In contrast, when multiple edges possess the same average degree, the algorithm randomly selects one of them. The resulting Avg deg ( G , e i ) edge from this phase is characterized by its average degree and serves as the starting edge for expansion, functioning as the input for the second phase.

4.1.2. Cluster Expansion

The selection of the initial edge is a pivotal aspect of the proposed clustering method’s cluster expansion phase. Various strategies for choosing this initial edge were investigated, including prioritizing edges with the lowest, highest, or average total degree, as well as random selection. Notably, when prioritizing edges based on their highest or lowest edge degrees, experimental results revealed significant variations in cluster sizes and their topological characteristics, even across different relative density ( δ r ) settings. This intriguing phenomenon is further discussed in the ’Future Work’ section.
This study introduces an innovative clustering approach that commences by selecting an edge with an average degree Avg deg ( G , e i ) to initiate the clustering process and expand the cluster. This method consistently generates clusters with comparable sizes and topologies across various experiments. As the cluster expands, it continuously monitors its relative density δ r ( c i ) using Equation (5) to ensure adherence to the desired density specified by the user U ( δ r ) . At each expansion step, the algorithm strategically chooses adjacent edges based on their average degree. This iterative process entails a step-by-step expansion of the cluster by methodically including the chosen adjacent edge. Consequently, it leads to the refinement and enhancement of both the cluster’s structure and its relative density.
During each iteration, the most recently added edge is accepted if it contributes to bringing the relative density, δ r ( c i ) , of the current cluster ( c i ) closer to the specified target density, U ( δ r ) . If a cluster expansion does not precisely reach the desired value of U ( δ r ) , the algorithm compares the existing and expanded density deviation and selects the cluster with the relative density closest to U ( δ r ) .

4.2. Overlapping Clustering

Overlapping clustering, also referred to as ‘soft clustering’, introduces the concept that a node can be assigned to either a single cluster or multiple clusters. In this framework, the processes of clustering initiation and cluster expansion steps are similar to those of partitioning clustering. However, a distinctive feature of overlapping clustering is the allowance for all nodes to belong to one or more clusters. This implies that the overlapping region must exhibit shared edges with nodes belonging to more than one cluster. Consequently, the common edges among nodes facilitate the inclusion of a single node in multiple clusters. It is important to emphasize that the initial edge for each (or a new) cluster is considered only once to avoid repetition of the same process. A detailed step-by-step procedure using an arbitrary random graph is presented as an example in the next section after illustrating the methodology. Algorithm 1 describes the pseudocode for both partitioning and overlapping clustering based on edge degree in this study.
In Algorithm 1, careful consideration of edges is pivotal to ensure a unique starting edge for each cluster. The algorithm meticulously marks the starting edge as considered, preventing its further use and averting potential indefinite states. Once an edge is considered, it is removed from the set of edges (E), maintaining the integrity of the clustering process. Furthermore, during the expansion phase, adherence to partitioning clustering principles prevents the repetition of edges and nodes within or outside a cluster. Any added edge is marked as considered, updating the edge set (E). In scenarios of overlapping clustering, the flexibility to reconsider edges for expansion is maintained. The clustering process concludes when the edges list becomes empty, offering a nuanced approach that balances the rigidity of partitioning clustering with the adaptability needed for overlapping clustering scenarios.
As both types of clustering in this study assume that the network is static, an assumption is made that the computation time is not a constraint. This assumption may not be possible if the network is dynamic, for example, communication among moving vehicles in a mobile ad hoc network. The worst case of clustering is when the network is fully connected as there exists an edge between every pair of nodes in the network (i.e., there are n ( n 1 ) 2 edges, where n is the number of nodes) and it has a density of 1 (i.e., the entire graph is a large cluster). Clustering of any kind in a fully connected network is meaningless as the network is just one big cluster. However, it provides a quick computation analysis for this work. The computation time is then dominated by the expansion of clusters. In this worst-case scenario, the network comprises E edges, and a determination of a cluster has a time complexity of O ( E ) . Therefore, to consider every node in the network, the time complexity is O ( E 2 ) . While this magnitude seems quite high, it seldom happens in practice for the reasons stated above. The application of a clustering approach in this work to most real-world networks demands much less computation time.
Algorithm 1: Pseudocode for Partitioning and Overlapping Clustering.
Applsci 14 00380 i001

4.3. Illustration of Partitioning Clustering Method on a Random Graph

A detailed step-by-step procedure using an arbitrary random graph is presented as an example in the next section after illustrating the methodology. Algorithm 1 describes the pseudocode for both partitioning and overlapping clustering based on edge degree. This is followed by an illustration as displayed in Figure 4. It is assumed that the edge degrees of all edges have been determined, with the parameter U ( δ r ) set to 0.4.
The clustering process initiates, as demonstrated in Figure 4a, with the identification and selection of an edge ( 0 , 7 ) , which is characterized by an average degree. This edge is then designated as the initial cluster. Subsequently, the cluster expands by adding the most suitable adjacent edges, identified based on their respective average degrees. This sequential expansion is marked as ’exp_i’. The process follows Algorithm 1. Initially, the first cluster denoted as ( c 1 ) is formed and composed of nodes [0, 2, 3, 6, 7, 11, 13] (highlighted in yellow lines). It possesses a ( δ r ) value of 0.42, representing the closest approximation to the desired U ( δ r ) value of 0.4.
Subsequently, the same procedure is repeated to form a new cluster in the remaining edges, commencing with the edge ( 5 , 10 ) . In a manner similar to the determination of ( c 1 ), the second cluster ( c 2 ) is identified, consisting of 7 nodes [4, 5, 8, 9, 10, 12, 14] (highlighted in red lines). This cluster exhibits a precise δ r density value of 0.4, meeting the specified criterion of 0.4.
The selection of this particular graph serves the purpose of illustrating that, within such clustering scenarios, there exists the potential for certain edges or nodes to remain unclustered. In this specific example, node ’1’ is not incorporated into either c 1 as including the edge ( 0 , 10 ) or (0, 13) would result in a deviation of δ r ( c 1 ) from the desired value of 0.4.
Figure 4. (a) First partitioning cluster δ r ( c 1 ) of 0.42. (b) Second partitioning cluster δ r ( c 2 ) of 0.4.
Figure 4. (a) First partitioning cluster δ r ( c 1 ) of 0.42. (b) Second partitioning cluster δ r ( c 2 ) of 0.4.
Applsci 14 00380 g004

4.4. Illustration of Overlapping Clustering on a Random Graph

In the context of overlapping clustering, the clustering process adopts a strategy similar to that of partitioning clustering. The initial cluster in both clustering types consistently yields the same cluster (i.e., c 1 represented by yellow lines) as shown in Figure 4a and Figure 5a. In subsequent clusters, variations may arise, allowing for the inclusion of overlapping nodes. Following the methodology described in Algorithm 1, the second cluster ( c 2 indicated by red lines) is identified with ( δ r ) closest to the desired value of U ( δ r ) (i.e., 0.39). The noteworthy observation is that four nodes [2, 6, 7, 13] of c 2 (i.e., overlapping nodes) also belong to c 1 . These overlapping nodes are visually represented by encircling them with double lines (orange and red circles), indicating their presence in two clusters ( c 1 and c 2 ).
After completing the determination of c 2 , the process proceeds to identify a new cluster. In Figure 5c, the third cluster ( c 3 ) (represented by green lines) emerges, initiated from an edge ( 4 , 9 ) , and expands similarly to previous examples. Cluster ( c 3 ) exhibits a precise ( δ r ) density value of 0.4. During the formation of the third cluster, node ’0’ overlaps in two clusters, c 1 and c 3 , respectively. It is worth emphasizing that, in overlapping clustering, the resulting clusters tend to be larger in size. Unlike partitioning clustering, there may be potential for a smaller number of unclustered nodes.
Figure 5. (a) Overlapping cluster ( c 1 ) with δ r ( c 1 ) = 0.42. (b) Overlapping cluster ( c 2 ) with δ r ( c 2 ) = 0.39 and overlapped nodes [2, 6, 7, 13]. (c) Overlapping cluster ( c 3 ) with δ r ( c 3 ) = 0.4 with overlapped node [0].
Figure 5. (a) Overlapping cluster ( c 1 ) with δ r ( c 1 ) = 0.42. (b) Overlapping cluster ( c 2 ) with δ r ( c 2 ) = 0.39 and overlapped nodes [2, 6, 7, 13]. (c) Overlapping cluster ( c 3 ) with δ r ( c 3 ) = 0.4 with overlapped node [0].
Applsci 14 00380 g005

5. Clustering Results

This section presents the results of the proposed graph clustering algorithms across a range of real-world networks. The selection of the networks for experimentation is guided by the objective of evaluating the algorithm across diverse network characteristics. The aim is to encompass a range of scenarios to enhance the generalizability of the algorithm’s performance. The chosen networks represent a variety of characteristics, such as varying sizes (number of nodes and edges), structures, and complexities encountered in real-world scenarios. Table 1 provides an overview of these characteristics, including pertinent metrics such as the graph’s maximum edge degree ( max deg ( E ) ), minimum edge degree ( min deg ( E ) ), average edge degree ( Avg deg ( G , E ) ), and the standard deviation of the edge degree ( σ ( E ) ).

5.1. Experimental Configuration

The user-specified density, denoted as U ( δ r ) , can be configured within the range of 0 and 1, as outlined in Section 3. Various densities have been applied to explore their impact on clustering outcomes. As a means of validating the proposed methodology, two distinct user-specified densities were selected for results demonstration: 0.4 and 0.6. These values exemplify scenarios that are marginally below and above the mid-value of 0.5 thresholds as representative values of diverse density levels. The rationale for selecting these specific values for illustration lies in the understanding of the relationship between multiple characteristics (as presented in Table 1) of the networks with the U ( δ r ) . The clustering outcomes resulting from the partitioning clustering of eight real-world networks are summarized in Table 2, while the results of overlapping clustering are detailed in Table 3.
Furthermore, visual representations illustrate clusters from both types of clustering within a confined space on two distinct networks: Zachary’s Karate Club (ZKC) and the US-Grid Power (USGP) network. These networks were chosen for their diverse structural properties. Despite its small scale, ZKC demonstrates dense interconnectivity, characterized by an average degree edge Avg deg ( G , E ) ) of 15, allowing for a distinct analysis of the determined clusters. In contrast, the USGP, being the largest network, exhibits sparse connectivity, reflected in an average degree ( Avg deg ( G , E ) ) of 7.

5.2. Partitioning Clustering Results

In this subsection, the outcomes of partitioning clustering are investigated through the utilization of two user-specified density settings: 0.4 and 0.6. The analysis presents an exploration of the inherent characteristics exhibited by the resultant clusters. This exploration includes several key facets, including the determination of the number of clusters ( c k ) formed at each density configuration, in conjunction with the computation of the corresponding average relative density ( Avg deg ( C ) ) for a set of clusters. Furthermore, the identification of unclustered nodes is also presented for each dataset, thus contributing to a comprehensive understanding of the implications arising from the clustering outcomes. This information reveals the influence that the topology and characteristics of each network have on the formation of clusters.
Table 2. Partitioning clustering results for U ( δ r ) of 0.4 and 0.6.
Table 2. Partitioning clustering results for U ( δ r ) of 0.4 and 0.6.
Network c k Avg deg ( C )Unclustered Nodes
0.4 0.6 0.4 0.6 0.4 0.6
Zachary’s Karate Club220.400.6085
Aves-Weaver Social540.380.6053
Dolphins Interaction420.400.451511
Les Misérables430.400.60814
Political Books350.400.4045
American College Football430.300.4074
Facebook Pages Food25230.370.50138116
US-Grid Power6263200.400.581350859
The results presented in Table 2 reveal a consistent density pattern for a density parameter of 0.4. Under this condition, the average relative density closely corresponds to the user-specified value. This alignment indicates a substantial portion of clusters adhering to the intended density criterion. It is worth noting that the number of clusters varies across different density settings and graph structures. Conversely, when a higher density of 0.6 is applied, certain networks exhibit slightly lower average relative densities compared to the desired density. This phenomenon is indicative of clusters with higher cohesion, resulting in fewer clusters and the formation of larger, densely populated clusters.
In Figure 6a,b, the results of partitioning clustering for the Zachary network are depicted at density settings of 0.4 and 0.6, respectively. These visualizations employ different colors to distinguish distinct clusters. Node labels are formatted as follows: the first number represents the node, and the second number indicates the cluster (e.g., (16-2) denotes that node 16 belongs to cluster 2). Additionally, unclustered nodes are represented in gray.
Figure 6a reveals the presence of two clusters with densities that precisely match the user-specified value. Despite significant variations in node sizes among these clusters, their similar densities reveal an interesting feature of the proposed approach as they emphasize the importance of density in cluster separation. Examining Figure 6b at a density setting of 0.6 reveals the emergence of two clusters characterized by both larger size and higher density. This heightened the fact that density facilitates the formation of more substantial and densely populated clusters. Importantly, this increase in density is accompanied by a notable reduction in the count of unclustered nodes.
Figure 6. Zachary’s Karate Club network partitioning clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Figure 6. Zachary’s Karate Club network partitioning clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Applsci 14 00380 g006
Figure 7 presents the partitioning clustering outcomes for the US-Power Grid network. Different colors represent distinct clusters, but due to the large network size and a high number of clusters, assigning a unique color to each cluster becomes challenging, leading to the repetition of colors for clusters in this scenario. Figure 7a shows the partitioning clustering results for a density of 0.4. It reveals a pattern of relatively smaller and more clusters. Figure 7b illustrates the clustering outcomes at a density of 0.6. It reveals a substantial reduction in the number of clusters by nearly half when compared to Figure 7a. This reduction highlights the emergence of larger and more densely populated clusters, a feature particularly pronounced in Figure 7b. This pattern suggests a transition towards a denser and more cohesive clustering structure, characterized by fewer and more substantial clusters. Notably, the number of unclustered nodes decreases by 64% compared to the clusters obtained at a density of 0.4.

5.3. Overlapping Clustering Results

Table 3 summarizes the results of overlapping clustering, providing a comprehensive overview of all the networks in this study. Complementing this tabular representation, Figure 8 and Figure 9 provide a visual depiction of the clustering outcomes of two selected networks for overlapping clustering.
A comparison of overlapping clustering to partitioning clustering in Table 2 reveals that overlapping clustering results in a greater number of clusters for both values of U ( δ r ) , as evidenced by the values of c k . The average degree of the clusters ( Avg ( δ r ) ( C ) ) is the same for both values of U ( δ r ) for each network. Overall, the number of unclustered nodes is also reduced, although not for every network. These findings illustrate the characteristics of overlapping clustering. Since a node is allowed to be a member of more than one cluster, more edges (or nodes) are available to be included during cluster expansion; hence, increasing the likelihood of achieving low-density deviation or a chance of achieving a cluster closer to U ( δ r ) is relatively high. Another characteristic of overlapping clustering is that it increases the possibility of new cluster initiation points, which leads to a higher number of clusters ( c k ). However, the topology of the network has a significant impact on the clustering results. In some networks, overlapping clustering can result in a higher number of unclustered nodes than partitioning clustering. As mentioned earlier, the proposed approach is intended for networks where clusters are not readily apparent. Furthermore, the goal of clustering in this work is not to ensure that every node or edge belongs to a cluster.
Figure 7. US-Grid Power network partitioning clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Figure 7. US-Grid Power network partitioning clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Applsci 14 00380 g007
Table 3. Overlapping clustering results of U ( δ r ) 0.4 and 0.6.
Table 3. Overlapping clustering results of U ( δ r ) 0.4 and 0.6.
Network c k Avg deg ( C )Unclustered NodesOverlaps Nodes
0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6
Zachary’s Karate Club540.40.6031922
Aves-Weaver Social980.40.6332021
Dolphins Interaction1160.40.611143437
Les Misérables980.40.612112934
Political Books15110.40.6756179
American College Football13100.40.6659095
Facebook Pages Food89700.40.6149125273333
US-Grid Power8454770.40.6118173412452509
Figure 8a,b depict the results of applying the proposed overlapping clustering algorithm to Zachary’s Karate Club network. Each cluster is depicted with the same color, and the blue nodes represent the overlapped area across different clusters. Two numbers are associated with each node: the node number and the number of clusters to which the node belongs, if the latter is enclosed in parentheses. If the second number is not in parentheses, it signifies that the node is not overlapped with any cluster. For example, in Figure 8a, the node with the label 3-1 belongs to cluster number 1. The node with label 1–(2) indicates that node ‘1’ overlaps within two clusters (i.e., it is a member of two clusters). An alternative way of presenting the results is to enumerate the membership of each node. While this may seem useful for full comprehension at first, the detailed information of all clusters can be overwhelming and distracting from the main insights.
Figure 8. Zachary’s Karate Club network overlapping clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Figure 8. Zachary’s Karate Club network overlapping clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Applsci 14 00380 g008
Figure 8a,b illustrate the impact of the parameter U ( δ r ) on the number and overlap of clusters. Figure 8a depicts five clusters, while Figure 8b exhibits four clusters, which is consistent with the previously discussed trend that a higher value of U ( δ r ) leads to fewer clusters. Additionally, Figure 8b exhibits a noticeable increase in overlapping regions, which is attributable to the higher value of U ( δ r ) in Figure 8b. This is because more edges are available for consideration during cluster expansion when U ( δ r ) is higher.
Figure 9a,b depict the results of applying the proposed overlapping clustering algorithm to the US-Power Grid network, a real-world sparse network. Notably, blue-colored nodes indicate overlaps, suggesting their presence in multiple clusters. However, due to the network’s size, it is challenging to highlight each cluster with a distinct color and label each node with its respective cluster. Visual analysis of the results is complex; however, the overall characteristics are consistent with those observed in Zachary’s Karate Club network. Specifically, a larger value of U ( δ r ) leads to fewer clusters, fewer unclustered nodes, and larger overlapping regions. In general, the proposed approach reveals that the number of clusters and the number of unclustered nodes in the result are inversely proportional to the value of U ( δ r ) , while the size of the overlapping regions is positively correlated. It is also important to note that the topology of the network can significantly influence the clustering process, both partitioning and overlapping. For example, in graphs that are densely connected in some areas, clusters may be evident, making graph clustering unnecessary. Additionally, a larger number of unclustered nodes may appear if a graph contains a large number of terminal nodes.
Figure 9. US-Grid Power network overlapping clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Figure 9. US-Grid Power network overlapping clusters: (a) U ( δ r ) = 0.4. (b) U ( δ r ) = 0.6.
Applsci 14 00380 g009

6. Clustering Process Evaluation

To assess the cohesiveness and separation within and between clusters, the clustering results are evaluated using pre-existing quality metrics.
Internal connectivity metrics, such as coverage (Cov), average embeddedness (Avg-Embed), average fraction over median degree (Avg-FOMD), average triangle participation ratio (Avg-TPR), and average internal edge density (Avg-IED), are used to measure the strength of ties between nodes within clusters. Coverage measures the proportion of internal edges to the total number of graph edges [65], indicating how well-connected nodes are within clusters. Average embeddedness measures the ratio of internal node degrees to total node degrees, which indicates how well-connected nodes are within clusters relative to their overall connectivity in the graph [66]. FOMD measures the proportion of nodes with internal degrees exceeding the median, quantifying the number of nodes appearing in multiple clusters [66]. TPR measures the ratio of triangles within clusters to all possible triangles, reflecting cluster density [67,68]. Average internal edge density measures the density of edges within clusters, determining the subgraph cohesion of clusters. Overall, the internal connectivity metrics provide a comprehensive assessment of how well-connected nodes are within clusters [69].
Similarly, external connectivity is another important aspect of clustering as it quantifies the interactions beyond the cluster boundaries. These metrics together provide a comprehensive view of clustering effectiveness, revealing intricate connectivity patterns from clusters to the surrounding graph. The cut ratio metric addresses the challenge of edge distribution between different clusters [70,71]. Expansion quantifies external edges per node in a cluster, revealing cluster compactness. Conductance is a widely used metric that captures both internal and external connectivity [65]. Normalized cut partitions the graph for cohesive clusters with weak intercluster links [72]. Flake-ODF (out degree fraction) assesses cluster external connectivity by measuring nodes within the cluster that have more edges directed outside the cluster than within it [72]. Collectively, these metrics provide a robust framework for assessing the effectiveness of clustering methods and encompass diverse aspects of connectivity.
It is worth noting that these metrics range from 0 to 1. For internal connectivity metrics, values approaching 1 indicate better clustering results and vice versa for external connectivity metrics. In the following subsection, the evaluation Table 4 and Table 5 emphasize the optimal values by using bold formatting. However, in Table 6, the coverage metric values are not highlighted. This is intentional as values exceeding the metric range do not represent optimal conditions. This is particularly observed due to the nature of overlapping clustering. Table 4, Table 5, Table 6 and Table 7 are respective to rows (i.e., which metric yields the best value for each network under two density conditions). The highlights in Table 8 and Table 9 are respective to columns (i.e., which network yields the best value for MRDCC). In the comparison between clique-based and edge-based approaches, Table 10 and Table 11 highlight optimal values with respect to rows.

6.1. Partitioning Clustering Outcomes Evaluation through Pre-Existing Quality Metrics

Referring to Table 4, among the existing metrics, the average embeddedness metric consistently yields higher values compared to other internal connectivity metrics across various networks and density variations. This suggests that nodes within clusters tend to have strong connections to their respective clusters, indicating a high level of interconnectivity. The higher values of average embeddedness can be attributed to the fact that this metric considers both the internal degree of nodes and their overall degree.
Table 4. Partitioning resultant clusters’ internal connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
Table 4. Partitioning resultant clusters’ internal connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
NetworkCovAvg-EmbedAvg-FOMDAvg-TPRAvg-IED
0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6
Zachary’s Karate Club0.550.670.640.770.250.270.420.400.500.51
Aves-Weaver Social0.540.740.550.710.100.200.580.800.850.75
Dolphins Interaction0.510.620.500.570.210.290.330.350.670.60
Les Misérables0.570.700.610.660.190.300.570.450.550.44
Political Books0.580.660.550.520.300.330.670.650.410.45
American College Football0.530.670.320.510.250.230.490.880.540.35
Facebook Pages Food0.520.660.590.680.140.240.250.330.760.68
US-Grid Power0.500.680.610.730.200.240.230.080.640.51
Referring to Table 5, the cut ratio metric yields the most favorable results among the existing external connectivity metrics. This is attributed to its consideration of external degrees within clusters and factors such as cluster size and total graph nodes.
Table 5. Partitioning resultant clusters’ external connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
Table 5. Partitioning resultant clusters’ external connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
NetworkConductanceCut RatioNormalized CutFlake-ODF
0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6
Zachary’s Karate Club0.430.270.100.060.650.510.380.35
Aves-Weaver Social0.370.260.080.040.540.360.320.02
Dolphins Interaction0.480.450.040.030.530.560.420.31
Les Misérables0.390.300.050.030.510.500.290.24
Political Books0.420.360.040.020.500.490.320.38
American College Football0.480.330.060.050.670.500.450.47
Facebook Pages Food0.430.360.000.000.490.400.290.21
US-Grid Power0.420.280.000.000.430.310.150.06
The trend of higher metric values with higher-density specifications for internal quality measures suggests a bias towards larger clusters and network size. Interestingly, this trend is reversed when considering external evaluation metrics. This bias highlights the sensitivity of the metric to network structure and edge count rather than solely capturing clustering quality.

6.2. Overlapping Clustering Evaluation through Pre-Existing Quality Metrics

Referring to Table 6, in the context of overlapping clustering analysis using existing metrics, a significant revelation emerges regarding that certain metric values may exceed the maximum value of 1. This phenomenon occurs due to the nature of overlapping clusters, where nodes connect with multiple clusters and edges might potentially be considered more than once. Traditional metrics like average embeddedness, FOMD, conductance, and coverage are designed for nonoverlapping scenarios, thus encountering limitations in these cases. The metrics values for Zachary’s Karate Club and Facebook Pages Food are good examples of the unsuitability for their uses in overlapping clustering. This identifies the need for more suitable metrics for such clustering.
Table 6. Overlapping resultant clusters’ internal connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
Table 6. Overlapping resultant clusters’ internal connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
NetworkCovAvg-EmbedAvg-FOMDAvg-TPRAvg-IED
0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6
Zachary’s Karate Club1.420.630.790.300.350.750.890.250.22
Aves-Weaver Social2.13.60.610.760.210.320.940.980.440.41
Dolphins Interaction2.02.40.620.750.340.400.830.890.320.17
Les Misérables1.32.80.640.780.240.410.640.940.440.26
Political Books2.14.30.580.740.410.450.940.970.240.17
American College Football2.64.90.570.750.390.370.890.950.170.10
Facebook Pages Food10.514.70.640.780.380.430.710.780.350.17
US-Grid Power0.822.70.620.770.250.330.120.150.540.25
Referring to Table 7, similarly, in the external quality evaluation of overlapping outcomes, the cut ratio metric once again emerges as the most favorable. In overlapping clustering, clusters have the potential to expand up to the desired density, resulting in larger clusters in terms of nodes. Therefore, the denominator in the cut ratio calculation increases, leading to relatively lower values for this metric compared to non-overlapping scenarios. Note that each metric was invented with a particular objective in mind. All internal connectivity metrics have similar intentions, and the same can be said for all external connectivity metrics. In overlapping clustering, these metrics were not supposed to be used to assess the performance of clustering; hence, their values can be out of range (as in Table 6). Finding suitable external connectivity metrics is beyond the scope of this work.
Careful analysis of the use of these metrics reveals interesting facts. Internal connectivity metrics yield relatively poor for both partitioning and overlapping clustering, while the opposite can be said for external connectivity metrics. It can be argued that, for this particular work, current metrics, which exclusively consider either one or both types of connections, are inappropriate for the clustering process. This hampers a meaningful evaluation of the proposed method. An appropriate metric known as the mean relative density deviation coefficient [16] is probably the best metric for this work as it is meant for a similar objective.
Table 7. Overlapping resultant clusters’ external connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
Table 7. Overlapping resultant clusters’ external connectivity evaluation under U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
NetworkConductanceCut RatioNormalized CutFlake-ODF
0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6
Zachary’s Karate Club0.410.230.110.080.620.470.130.03
Aves-Weaver Social0.400.220.130.090.590.430.180.07
Dolphins Interaction0.390.240.050.050.530.490.240.03
Les Misérables0.420.230.060.050.530.440.180.09
Political Books0.390.240.040.040.540.420.210.07
American College Football0.420.250.060.050.580.490.330.11
Facebook Pages Food0.400.200.000.000.500.390.180.06
US-Grid Power0.410.240.000.000.410.240.160.05

6.3. Mean Relative Density Deviation Coefficient (MRDDC) Metric

As explored earlier, it is arguable that prevailing quality metrics, primarily focusing on singular or combined connectivity aspects, might lack adequacy for assessing the density proximity of clusters comprehensively. The inappropriate nature of the existing metrics to comprehensively assess both partitioning and overlapping clustering methods, especially when aiming for specific cluster densities, necessitates an appropriate metric. The MRDDC is the suitable metric to assess the proposed approach as it is introduced to assess the performance of clustering where resulting clusters should be of a required density. MRDDC can be determined by Equation (7) below.
MRDDC = 1 1 k c i = 1 c k U ( δ r ) δ r ( c i ) U ( δ r )
where δ r ( c i ) signifies the density of each individual cluster c i , and k denotes the total number of clusters.
In obtaining a cluster with relative density ( δ r ) the same as the user-specified density U ( δ r ) , this implies that the deviation is zero (i.e., U ( δ r ) c i ( δ r ) is zero). In a situation where MRDCC is 1, this implies that the cluster is just a single node (which should not be considered a cluster). By dividing this deviation by U ( δ r ) and taking the absolute value, this metric normalized to a range between 0 and 1.
Therefore, the average deviation 1 k i = 1 k U ( δ r ) δ r ( c i ) U ( δ r ) also falls within the range of zero to one. This average deviation has a value of 0 for the best scenario and 1 for the worst. This appears unintuitive as most researchers tend to associate the value of 1 with the best result and 0 with the worst. To make this metric intuitive, MRDCC is derived by the value of 1 subtracted by the average deviation (as expressed in Equation (7)). In summary, MRDCC quantifies the proximity of the resulting clusters to the U ( δ r ) , thereby reflecting the effectiveness of the graph clustering process.

6.4. Partitioning and Overlapping Outcomes Evaluation Using MRDDC Metric

Table 8 reveals the performance of the proposed partitioning clustering approach using the MRDDC metric. As stated in the previous subsection, it is a more appropriate metric for the validation of this work. Among the datasets evaluated, the American College Football network exhibits the lowest values for both values of U ( δ r ) . This may be attributed to the fact that it possesses the highest average node degree of 20 (as shown in Table 1), reflecting its high average degree of 10.8. This leads to certain clusters characterized by heightened external connectivity and, therefore, yields lower values of MRDDC. This is another piece of evidence that the topology of a network can influence partitioning clustering significantly.
Similarly, Table 9 reveals the performance of the proposed overlapping clustering approach using the MRDDC metric. The results are superior to partitioning clustering. As nodes and edges can belong to more than one cluster, a cluster can, therefore, be quite dense (i.e., the number of internal edges is high). As ’relative density’ is used as a means for the required density of a cluster in this work and referring to Equation (7), this reveals that the value of int deg ( c i ) can be quite high, enabling the U ( δ r ) to be satisfied more easily, especially when compared with partitioning clustering. Note also that, in overlapping clustering, the required density is less prone to the influence of the network topology than in partitioning clustering, as shown in the MRDCC of the American College Football network.
Table 8. Partitioning clustering evaluation using MRDDC metric of U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
Table 8. Partitioning clustering evaluation using MRDDC metric of U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
NetworkMRDDC Metric
U ( δ r ) = 0.4 U ( δ r ) = 0.6
Zachary’s Karate1.000.97
Aves-Weaver0.920.95
Dolphins Interaction0.910.80
Les Misérables1.000.99
Political Books1.000.72
American College Football0.880.65
Facebook Pages Food0.890.78
US-Grid Power0.950.93
Table 9. Overlapping clustering evaluation using MRDDC metric of U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
Table 9. Overlapping clustering evaluation using MRDDC metric of U ( δ r ) 0.4 and 0.6. Best results are shown in bold.
NetworkMRDDC Metric
U ( δ r ) = 0.4 U ( δ r ) = 0.6
Zachary’s Karate1.001.00
Aves-Weaver1.001.00
Dolphins Interaction1.001.00
Les Misérables1.001.00
Political Books1.001.00
American College Football1.001.00
Facebook Pages Food0.990.99
US-Grid Power0.960.98

6.5. Partitioning Clustering Comparative Analysis

There has been an attempt at clustering where users can specify the required relative density of the resultant clusters [16]. In this algorithm, a clique of size three (i.e., a triangle) is adopted as the basis of clustering. This work will be referred to as the ‘clique-based’ method for brevity and ease of discussion. A clique with an average degree is selected as an initial cluster. A cluster is expanded by adding adjacent cliques while relative density is monitored until its value is the same or closest to the specified density. However, only partitioning clustering was considered. Nevertheless, this ‘clique-based’ method merits special attention and comparison to this work. The partitioning clustering comparative analysis involves a thorough examination of two distinct approaches: the clique-based method and the proposed approach. Within this section, the analysis commences by evaluating the number of clusters generated by each approach.
Subsequently, the analysis extends to the presence of unclustered nodes, shedding light on the effectiveness of each method in handling nodes that do not adhere to a particular cluster. Following this, the outcomes of both approaches are systematically evaluated, primarily focusing on density deviation as measured by the proposed metric (MRDDC). This metric provides insights into the proximity of the resultant clusters to the user-specified density. By considering these facets, a comprehensive understanding of the performance and limitations of both approaches is gained, thereby enabling informed decisions in selecting the most suitable clustering strategy for various scenarios.
Table 10 reveals the comparison of the performance of both studies subjected to the MRDDC metric. The proposed approach in this work yields slightly better performance in almost all networks for both values of U ( δ r ) . The MRDCC of the American College Football is slightly less than those of [16]. Again, this is likely due to the highest average node degree, as discussed in the earlier subsection (results in Table 8).
Table 10. Comparison with respect to mean relative density deviation coefficient (MRDDC) metric. Best results are shown in bold.
Table 10. Comparison with respect to mean relative density deviation coefficient (MRDDC) metric. Best results are shown in bold.
Network U ( δ r ) = 0.4 U ( δ r ) = 0.6
Edge-Based Clique-Based Edge-Based Clique-Based
Zachary’s Karate1.000.900.971.00
Aves-Weaver0.920.860.950.84
Dolphins Interaction0.911.000.800.91
Les Misérables1.001.000.991.00
Political Books1.000.890.720.76
American College Football0.880.920.650.95
Facebook Pages Food0.890.880.780.94
US-Grid Power0.950.900.930.90
Ideally, good graph clustering, regardless of type, ought to cover all the nodes in the graph/network (i.e., every node belongs to, at least, a cluster) and no node remains unclustered. In practice, this cannot always be guaranteed, especially when a resultant cluster is subjected to certain constraint(s). Also, the topology may dictate the results of clustering. Referring to Table 11 reveals the comparison of both approaches concerning the number of unclustered nodes. All the edge-based methods in this work are superior to the clique-based approach, except the American College Football for U ( δ r ) of 0.6. For a large network such as US-Grid Power, the proposed edge-based method in this work is preferable. It is worth noting, before clustering, that the proposed edge-based method is also computationally more efficient as the determination of ’edge degree’ for all edges requires less computation time than the determination of all cliques hidden in the network.
Table 11. Comparison with respect to number of unclustered nodes. Best results (i.e., lower number of unclustered nodes) are shown in bold.
Table 11. Comparison with respect to number of unclustered nodes. Best results (i.e., lower number of unclustered nodes) are shown in bold.
Network U ( δ r ) = 0.4 U ( δ r ) = 0.6
Edge-Based Clique-Based Edge-Based Clique-Based
Zachary’s Karate8955
Aves-Weaver3835
      Dolphins Interaction      11191117
Les Misérables8251420
Political Books714515
American College Football61342
Facebook Pages Food149209116236
US-Grid Power118128808591478

7. Discussion

The clustering approach in this work is unique in the sense that it allows users to specify the required density. The choice of clustering type (partitioning, overlapping, or hierarchical) is highly application-oriented and depends on clustering objectives. For any network with a specified density, it has been shown that overlapping clustering results in fewer but larger clusters than partitioning clustering. The number of unclustered nodes is also smaller. Also, overlapping clustering has a higher tendency to result in clusters with a density closer to U ( δ r ) . These phenomena are due to more edges being available for inclusion during the expansion of a cluster. However, this ought not to imply that overlapping clustering is a more efficient method of clustering than partitioning clustering; they serve different purposes in practice.
The network’s topology greatly influences the outcomes of both types of clustering. Knowing whether the network is dense (i.e., Avg deg ( G , E ) is high) or sparse (i.e., Avg deg ( G , E ) is low) may not provide a sufficient indication. A network can be considered dense, but dense areas may only occur in a few places, while the rest is sparse. This is where the standard deviation of edge degree ( σ ( E ) ) can be useful additional information, especially when it is not feasible to visualize a large network. As mentioned earlier, in networks where dense areas are visible (i.e., σ ( E ) is very high), this indicates that clusters may already be visible and the clustering approach in this work is inappropriate. Nevertheless, if the dense area is extracted and treated as an independent network, the clustering approach in this work may still be applicable. Networks with a large number of terminal nodes are good examples of such cases.
With respect to the scalability of the proposed algorithm, the affecting factors are numbers of nodes and number of edges and their connectivity. As stated earlier, it is assumed that the graph/network is static, and time is not the constraint in this work. Therefore, scalability ought not to be a critical issue. It is envisaged that, in a very large and complex network where response time is critical, it may be very difficult to come up with an algorithm that can produce satisfactory results within a given time. Extensive experiments were performed on datasets of varying sizes, ranging from small to large networks, and both synthetic graphs and real-world networks. It is intuitive that clustering takes much longer if the network is larger. ‘Zachary Karate Club’ and ‘US-Power grid Network’ are specifically selected for illustration as they provide good contrast from a scalability perspective. ‘Zachary Karate Club’ has relatively few nodes but a high average node degree, while ‘US-Power grid Network’ has a large number of nodes but low average node degree. Clustering of ‘Zachary Karate Club’ takes little time (within 20 s using the Apple MacBook Pro. M1), while clustering of ‘US-Power grid Network’ takes a much higher magnitude (around 50 times more) when compared to clustering ‘Zachary Karate Club’. Please note that real-world networks with realistic connectivity usually do not possess high average edge degrees. Networks with high density in some parts make clusters visually evident. Such networks will not be of interest (or even meaningless) in finding resultant clusters with some specified density.
While specifying U ( δ r ) ought to be a user’s decision, it must be kept in mind that, in some networks, selecting appropriate values can be crucial. In relatively sparse networks, setting a high value for U ( δ r ) is likely to result in small-size clusters, which may not be useful in some applications. On the contrary, if the network is relatively dense, specifying a low value of U ( δ r ) can lead to the same small-size clusters. In determining standard specifications for the user-specified density U ( δ r ) , this study reveals crucial considerations in guiding users effectively when applying U ( δ r ) to diverse network analyses.
  • Density Specification: A fundamental standard proposed is the range within which the U ( δ r ) parameter should fall. Values beyond this range are deemed irrelevant in the context of the study, ensuring a valid and coherent parameter threshold.
  • Topology-Dependent Specification: The specification of the U ( δ r ) parameter is intricately tied to the topology of the network under analysis. Contrary to conventional assumptions in density contexts, where higher density often corresponds to more clusters, this study reveals that this may not universally hold. Instead, the appropriate value for U ( δ r ) is highly dependent on the specific characteristics of the network.
In providing valuable guidance, diverse experiments to uncover optimal specifications for the user-specified density parameter under various network scenarios may be necessary. This iterative process should also provide additional understanding for real-world network analyses. The discussion above aims to provide guidelines that acknowledge the network’s nature and diverse structural characteristics. By embracing the interplay between user-specified density and network topology, users can make informed decisions that may enhance the effectiveness of the proposed approach in diverse applications.
This work also emphasizes the concept of ’the right tool for the right job.’ In assessing the efficiency of the clustering process, the existing metrics have proven to be inappropriate. This does not imply that they are not useful; in fact, metrics such as ‘coverage’ and ‘conductance’ are very popular and universal in graph clustering. However, information such as the number of unclustered nodes, the number of unclustered edges, and the area covered by clustering are not of real interest or very meaningful when a specific density of resultant clusters is required. Therefore, MRDCC, as suggested in this work, provides an efficient means of assessing the clustering.
The proposed clustering technique is tailored for static connected graphs, and its performance may be affected in scenarios deviating from this assumed model. In the case of disconnected graphs, each subcomponent may be treated independently. As the algorithm relies on density as an input, users must be mindful of network characteristics when specifying density. For instance, in applications like resource management, users should be aware of relative density range in order to strategically set effective density values. These limitations underscore the importance of aligning the proposed approach with specific network conditions and domain expertise in its application.

8. Conclusions

Many real-world problems are graph-related, where information can be represented as graphs. In such cases, graph clustering has an application, especially when the objective includes dividing members into groups where intrasimilarity is high and intersimilarity is low. This allows intrinsic information hidden in each group as well as the overall composite graph to be extracted. This work is the first attempt at both partitioning and overlapping graph clustering of unweighted undirected graphs, where users can specify the required density. This is completed by means of ‘Relative Density’. The proposed approach is edge-based, in which the concept of edge degree is utilized. The intended application is to cluster graphs where clusters are not so visible initially (i.e., σ ( E ) is not so high) as it will not be meaningful to demand a certain density from resultant clusters if the clustering pattern is evident from the start. In such situations, splitting the graph at appropriate places will be more meaningful, and there already exist such algorithms.
Selecting a suitable user-specified density U ( δ r ) may also require careful consideration. As discussed in the previous section, the density or sparsity of a graph is likely to guide a suitable value for U ( δ r ) . Among several metrics in graph clustering, the MRDCC is the most appropriate for assessing the clustering performance in this work as other metrics are invented with different objectives.
As demonstrated in Section 6, overlapping clustering consistently outperforms partitioning clustering. Importantly, this is not entirely attributed to the algorithm presented in this work but stems from higher availability of edges to consider during the clustering expansion process. The inherent nature of overlapping clustering permits a node to belong to more than one cluster, resulting in clusters with relative density ( δ r ) near to or the same value as the user-specified density ( U ( δ r ) ).
The applications of the proposed clustering method are multifaceted. The identification of clusters in a complex network with closely matched density holds strategic advantages as resource allocation can be completed effectively across all clusters. This not only streamlines the deployment of resources but also enhances the efficiency of cluster monitoring.
The overlapping areas within the context of overlapping clustering introduce a dimension of particular interest. Allocating resources to nodes within these areas enables coverage across multiple clusters; this provides both resource optimization and identification of regions that may require special attention. In practical terms, users can interpret and select density values based on the specific requirements of their application. For instance, in the context of monitoring the spread of infectious diseases, such as COVID-19, overlapping areas identified by the clustering may indicate potential superspreader locations, emphasizing the algorithm’s relevance in this domain. Awareness of the user-specified density range and network topologies is imperative for users, significantly influencing the precise density specification and the ensuing effectiveness of clustering outcomes in specific applications.
The proposed algorithm represents a significant advancement in the realm of graph clustering. In contrast to traditional approaches, the algorithm introduces a novel perspective by prioritizing clustering based on user-required density. This is also valuable in understanding of network structures. Furthermore, the algorithm’s capability to identify clusters with user-specified density values contributes to its practical applicability in real-world settings. This feature not only sets the proposed algorithm apart from the existing literature but also extends its potential to address the inherent shortcomings of conventional graph clustering methods.

9. Future Directions

Future work can be carried out from several aspects. The following merit further study. They are, by no means, a comprehensive list of all possible directions.
  • Graph clustering type: This work proposed an approach for both partitioning and overlapping graph clustering. Hierarchical graph clustering is a natural extension. In practice, their applications may be relatively fewer. However, the recent drone communication network is a prime example as they are naturally formed in the hierarchy in some applications. Both agglomerative and divisive approaches in hierarchical clustering can be studied. The critical issues are the suitable level of hierarchy (if they do not initially exist) and whether partitioning or overlapping or both types of clustering are allowed at each level. The objectives of the application are likely to dictate the direction of investigation.
  • Sensitivity analysis: The approach is initiated for an edge with an average degree. In some cases, there may be more than one of such edge. This work considers them equally and selects any one to initiate clustering. In such cases, it may be worth investigating the impact of different starting edges on the resultant clusters and their MRDCC values.
  • Incorporation of edge centralities: Different kinds of edge centralities exist and some are already utilized in graph clustering algorithms, especially in selecting the appropriate edge to split the graph. In situations with more than one edge with the average degree, different kinds of edge centralities may be applied to aid the selection. This may result in better clustering performances.
  • A more complex form of a graph: Graph clustering in this work is the simplest kind (i.e., undirected unweighted graph). The complex form is the directed weighted graph; transportation problems are relevant examples. However, a directed unweighted graph may be a good candidate for the next stage. Many bioinformatics datasets are qualified for such studies. Alternatively, undirected weighted can also be studied, where weights may represent frequency during communication of two sources. Such clustering may result in small clusters and unequal clusters as weights ought to influence the density of a cluster.

Author Contributions

Conceptualization, R.T. and K.L.; Methodology, R.T. and K.L; Software, R.T.; Validation, R.T., K.L. and P.B.; Formal analysis, R.T. and K.L.; Investigation, R.T.; Resources, K.L. and P.B.; Data curation, R.T.; Writing—original draft preparation, R.T.; Writing—review and editing, K.L., P.B. and P.M.; Visualization, R.T.; Supervision, K.L.; Project administration, K.L.; Funding acquisition, P.B. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the KMUTT’s Petchra Pra Jom Klao Scholarship for supporting the PhD studies of author R. Tariq under Grant No. 49/2562, and in part by the Interdisciplinary Centre for Security, Reliability, and Trust (SnT), Faculty of Science, Technology, and Medicine, University of Luxembourg.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Berahmand, K.; Haghani, S.; Rostami, M.; Li, Y. A new attributed graph clustering by using label propagation in complex networks. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 1869–1883. [Google Scholar] [CrossRef]
  2. Schaeffer, E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
  3. Huang, X.; Cheng, H.; Yu, J.X. Dense community detection in multi-valued attributed networks. Inf. Sci. 2015, 314, 77–99. [Google Scholar] [CrossRef]
  4. Saeedmanesh, M.; Geroliminis, N. Dynamic clustering and propagation of congestion in heterogeneously congested urban traffic networks. Transp. Res. Procedia 2017, 23, 962–979. [Google Scholar] [CrossRef]
  5. Thomas, J.; Seo, D.; Sael, L. Review on graph clustering and subgraph similarity-based analysis of neurological disorders. Int. J. Mol. Sci. 2016, 17, 862. [Google Scholar] [CrossRef] [PubMed]
  6. Xia, K.; Gu, X.; Zhang, Y. Oriented grouping-constrained spectral clustering for medical imaging segmentation. Multimed. Syst. 2020, 26, 27–36. [Google Scholar] [CrossRef]
  7. Rostami, M.; Oussalah, M.; Farrahi, V. A novel time-aware food recommender system based on deep learning and graph clustering. IEEE Access 2022, 10, 52508–52524. [Google Scholar] [CrossRef]
  8. Shao, B.; Li, X.; Bian, G. A survey of research hotspots and frontier trends of recommendation systems from the perspective of knowledge graph. Exp. Syst. Appl. 2021, 165, 113764. [Google Scholar] [CrossRef]
  9. Hong, S.W.; Miasnikof, P.; Kwon, R.; Lawryshyn, Y. Market graph clustering via QUBO and digital annealing. J. Risk Financ. Manag. 2021, 14, 34. [Google Scholar] [CrossRef]
  10. MacQueen, J. Classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
  11. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Kdd, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
  12. Girvan, M.; Newman, M.E.J. Community structure in social and biological networks. Proc. Nat. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef]
  13. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 10, 10008. [Google Scholar] [CrossRef]
  14. Kothari, R.; Pitts, D. On finding the number of clusters. Pattern Recognit. Lett. 1999, 20, 405–416. [Google Scholar] [CrossRef]
  15. Sankar, S.; Ramasubbareddy, S.; Luhach, A.K.; Nayyar, A.; Qureshi, B. CT-RPL: Cluster tree-based routing protocol to maximize the lifetime of Internet of Things. Sensors 2020, 20, 5858. [Google Scholar] [CrossRef] [PubMed]
  16. Tariq, R.; Lavangnananda, K.; Bouvry, P.; Mongkolnam, P. Partitioning Graph Clustering with Density. IEEE Access 2023, 11, 122273–122294. [Google Scholar] [CrossRef]
  17. Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef] [PubMed]
  18. Parés, F.; Gasulla, D.G.; Vilalta, A.; Moreno, J.; Ayguadé, E.; Labarta, J.; Cortés, U.; Suzumura, T. Fluid communities: A competitive, scalable and diverse community detection algorithm. In International Conference on Complex Networks and Their Applications; Springer: Berlin/Heidelberg, Germany, 2017; pp. 229–240. [Google Scholar]
  19. Ng, A.; Jordan, M.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 14, 1–8. [Google Scholar]
  20. Luxburg, V.U. A tutorial on spectral clustering. Statist. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  21. Dempster, A.P.; Laird, N.M.; Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B 1997, 39, 1–22. [Google Scholar] [CrossRef]
  22. Tandon, A.; Albeshri, A.; Thayananthan, V.; Alhalabi, W.; Fortunato, S. Fast consensus clustering in complex networks. Phys. Rev. E 2019, 99, 042301. [Google Scholar] [CrossRef]
  23. Kuwil, F.H.; Shaar, F.; Topcu, A.E.; Murtagh, F. A new data clustering algorithm based on critical distance methodology. Exp. Syst. Appl. 2019, 129, 296–310. [Google Scholar] [CrossRef]
  24. Rosvall, M.; Bergstrom, C.T. Maps of random walks on complex networks reveal community structure. Proc. Nat. Acad. Sci. USA 2008, 105, 1118–1123. [Google Scholar] [CrossRef] [PubMed]
  25. Rosvall, M.; Axelsson, D.; Bergstrom, C.T. The map equation. Eur. Phys. J. Spec. Top. 2009, 178, 13–23. [Google Scholar] [CrossRef]
  26. Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 2007, 76, 036106. [Google Scholar] [CrossRef] [PubMed]
  27. Fiscarelli, A.M.; Brust, M.R.; Danoy, G.; Bouvry, P. Local memory boosts label propagation for community detection. Appl. Netw. Sci. 2019, 4, 95. [Google Scholar] [CrossRef]
  28. Li, H.; Zhang, R.; Zhao, Z.; Liu, X. LPA-MNI: An improved label propagation algorithm based on modularity and node importance for community detection. Entropy 2021, 23, 497. [Google Scholar] [CrossRef] [PubMed]
  29. Pons, P.; Latapy, M. Computing communities in large networks using random walks. In International Symposium on Computer and Information Sciences; Springer: Berlin/Heidelberg, Germany, 2005; pp. 284–293. [Google Scholar]
  30. Xie, W.B.; Lee, Y.L.; Wang, C.; Chen, D.B.; Zhou, T. Hierarchical clustering supported by reciprocal nearest neighbors. Inf. Sci. 2020, 527, 279–292. [Google Scholar] [CrossRef]
  31. Rustamaji, H.C.; Suharini, Y.S.; Permana, A.A.; Kusuma, W.A.; Nurdiati, S.; Batubara, I.; Djatna, T. A network analysis to identify lung cancer comorbid diseases. Appl. Netw. Sci. 2022, 7, 30. [Google Scholar] [CrossRef]
  32. Newman, M.E. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 2006, 74, 036104. [Google Scholar] [CrossRef]
  33. Galluccio, L.; Michel, O.; Comon, P.; Hero, A.O. Graph-based k-means clustering. Signal Process. 2012, 92, 1970–1984. [Google Scholar] [CrossRef]
  34. Bourouis, S.; Alroobaea, R.; Rubaiee, S.; Andejany, M.; Bouguila, N. Nonparametric Bayesian Learning of Infinite Multivariate Generalized Normal Mixture Models and Its Applications. Appl. Sci. 2021, 11, 5798. [Google Scholar] [CrossRef]
  35. Orbanz, P.; Teh, Y.W. Bayesian Nonparametric Models. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2011; pp. 81–89. [Google Scholar]
  36. Karras, C.; Karras, A.; Giotopoulos, K.C.; Avlonitis, M.; Sioutas, S. Consensus Big Data Clustering for Bayesian Mixture Models. Algorithms 2023, 16, 245. [Google Scholar] [CrossRef]
  37. McAuliffe, J.D.; Blei, D.M.; Jordan, M.I. Nonparametric empirical Bayes for the Dirichlet process mixture model. Stat Comput. 2006, 16, 5–14. [Google Scholar] [CrossRef]
  38. Li, Y.; Schofield, E.; Gonen, M. A Tutorial on Dirichlet Process Mixture Modeling. J. Math. Psychol. 2019, 91, 128–144. [Google Scholar] [CrossRef] [PubMed]
  39. Andersen, K.W.; Madsen, K.H.; Siebner, H.R.; Schmidt, M.N.; Morup, M.; Hansen, L.K. Non-parametric Bayesian graph models reveal community structure in resting state fMRI. Neuroimage 2014, 100, 301–315. [Google Scholar] [CrossRef] [PubMed]
  40. Palla, K.; Knowles, D.A.; Ghahramani, Z. Relational learning and network modelling using infinite latent attribute models. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 462–474. [Google Scholar] [CrossRef] [PubMed]
  41. Blei, D.M.; Frazier, P.I. Distance-dependent Chinese restaurant processes. J. Mach. Learn. Res. 2011, 12, 2461–2488. [Google Scholar]
  42. Xie, J.; Kelley, S.; Szymanski, B.K. Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Comput. Surv. 2013, 45, 1–35. [Google Scholar] [CrossRef]
  43. Palla, G.; Derényi, I.; Farkas, I.; Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 2015, 435, 814–818. [Google Scholar] [CrossRef]
  44. Shen, H.; Cheng, X.; Cai, K.; Hu, M.B. Detect overlapping and hierarchical community structure in networks. Phys. A. Stat. Mech. Appl. 2009, 388, 1706–1712. [Google Scholar] [CrossRef]
  45. Lancichinetti, A.; Fortunato, S.; Kertész, J. Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 2009, 11, 033015. [Google Scholar] [CrossRef]
  46. Lee, C.; Reid, F.; McDaid, A.; Hurley, N. Detecting highly overlapping community structure by greedy clique expansion. arXiv 2010, arXiv:1002.1827. [Google Scholar]
  47. Becker, E.; Robisson, B.; Chapple, C.E.; Guénoche, A.; Brun, C. Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinform. 2012, 28, 84–90. [Google Scholar] [CrossRef] [PubMed]
  48. Lei, X.; Wang, F.; Wu, F.X.; Zhang, A.; Pedrycz, W. Protein complex identification through Markov clustering with firefly algorithm on dynamic protein–protein interaction networks. Inf. Sci. 2016, 329, 303–316. [Google Scholar] [CrossRef]
  49. Li, P.; Dau, H.; Puleo, G.; Milenkovic, O. Motif clustering and overlapping clustering for social network analysis. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications, IEEE, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar]
  50. Gregory, S. An algorithm to find overlapping community structure in networks. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 17–21 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 91–102. [Google Scholar]
  51. Fortunato, S. Community detection in graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef]
  52. Seda, M. The Maximum Clique Problem and Integer Programming Models, Their Modifications, Complexity, and Implementation. Symmetry 2023, 15, 1979. [Google Scholar] [CrossRef]
  53. Gupta, S.K.; Singh, D.P. CBLA: A Clique Based Louvain Algorithm for Detecting Overlapping Community. Procedia Comput. Sci. 2023, 218, 2201–2209. [Google Scholar] [CrossRef]
  54. Van Lierde, H.; Chow, T.W.; Chen, G. Scalable spectral clustering for overlapping community detection in large-scale networks. IEEE Trans. Knowl. Data Eng. 2019, 32, 754–767. [Google Scholar] [CrossRef]
  55. Wang, X.; Liu, G.; Li, J. Overlapping community detection based on structural centrality in complex networks. IEEE Access 2017, 5, 25258–25269. [Google Scholar] [CrossRef]
  56. Boobalan, M.P.; Lopez, D.; Gao, X.Z. Graph clustering using k-Neighbourhood Attribute Structural similarity. Appl. Soft Comput. 2016, 47, 216–223. [Google Scholar] [CrossRef]
  57. Ahn, Y.Y.; Bagrow, J.P.; Lehmann, S. Link communities reveal multiscale complexity in networks. Nature 2010, 466, 761–764. [Google Scholar] [CrossRef]
  58. Huang, L.; Wang, G.; Wang, Y.; Blanzieri, E.; Su, C. Link clustering with extended link similarity and EQ evaluation division. PLoS ONE 2013, 8, e66005. [Google Scholar] [CrossRef] [PubMed]
  59. Ding, Z.; Zhang, X.; Sun, D.; Luo, B. Overlapping community detection based on network decomposition. Sci. Rep. 2016, 6, 24115. [Google Scholar] [CrossRef] [PubMed]
  60. Wang, Y.; Bu, Z.; Yang, H.; Li, H.J.; Cao, J. An effective and scalable overlapping community detection approach: Integrating social identity model and game theory. Appl. Math. Comput. 2021, 390, 125601. [Google Scholar] [CrossRef]
  61. Asmi, K.; Lotfi, D.; Abarda, A. The greedy coupled-seeds expansion method for the overlapping community detection in social networks. Computing 2022, 104, 295–313. [Google Scholar] [CrossRef]
  62. Ran, X.; Xi, Y.; Lu, Y. Lu, Y.; Wang, X.; Lu, Z. Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artif. Intell. Rev. 2023, 56, 8219–8264. [Google Scholar] [CrossRef]
  63. Zheng, B.; Wu, H.; Kuang, L.; Qin, J.; Du, W.; Wang, J.; Li, D. A simple model clarifies the complicated relationships of complex networks. Sci. Rep. 2014, 4, 6197. [Google Scholar] [CrossRef] [PubMed]
  64. Lu, Z.; Wahlström, J.; Nehorai, A. Community detection in complex networks via clique conductance. Sci. Rep. 2018, 8, 5982. [Google Scholar] [CrossRef] [PubMed]
  65. Emmons, S.; Kobourov, S.; Gallant, M.; Börner, K. Analysis of network clustering algorithms and cluster quality metrics at scale. PLoS ONE 2016, 11, e0159161. [Google Scholar] [CrossRef]
  66. Hric, D.; Darst, R.K.; Fortunato, S. Community detection in networks: Structural communities versus ground truth. Phys. Rev. E 2014, 90, 062805. [Google Scholar] [CrossRef]
  67. Wagenseller, P.; Wang, F.; Wu, W. Size matters: A comparative analysis of community detection algorithms. IEEE Trans. Computat. Social Syst. 2018, 5, 951–960. [Google Scholar] [CrossRef]
  68. Adraoui, M.; Retbi, A.; Idrissi, M.K.; Bennani, S. Maximal cliques based method for detecting and evaluating learning communities in social networks. Future Gener. Comput. Syst. 2022, 126, 1–14. [Google Scholar] [CrossRef]
  69. Chakraborty, T.; Dalmia, A.; Mukherjee, A.; Ganguly, N. Metrics for community analysis: A survey. ACM Comput. Surv. 2018, 50, 1–37. [Google Scholar] [CrossRef]
  70. Hagen, L.; Kahng, A.B. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 1992, 11, 1074–1085. [Google Scholar] [CrossRef]
  71. Chan, P.K.; Schlag, M.D.F.; Zien, J.Y. Spectral K-way ratio-cut partitioning and clustering. IEEE TCAD 1994, 13, 1088–1096. [Google Scholar] [CrossRef]
  72. Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
Figure 1. (a) Arbitrary random graph. (b) Partitioning clustering. (c) Overlapping clustering. (d) Hierarchical clustering.
Figure 1. (a) Arbitrary random graph. (b) Partitioning clustering. (c) Overlapping clustering. (d) Hierarchical clustering.
Applsci 14 00380 g001
Figure 2. Five possible definitions of an edge degree. (a) Uniformly assigning a value of 1 to each edge. (b) Considering the degree at one endpoint of an edge. (c) Multiplying the degrees of endpoints in an edge. (d) Summative combination of endpoint degrees in an edge. (e) Power function product of endpoint degrees.
Figure 2. Five possible definitions of an edge degree. (a) Uniformly assigning a value of 1 to each edge. (b) Considering the degree at one endpoint of an edge. (c) Multiplying the degrees of endpoints in an edge. (d) Summative combination of endpoint degrees in an edge. (e) Power function product of endpoint degrees.
Applsci 14 00380 g002
Figure 3. (a) Edge degrees distribution. (b) Identifying the edge with average degree. (c) Cluster characteristics: internal and external degrees, and relative density in a randomly partitioned graph.
Figure 3. (a) Edge degrees distribution. (b) Identifying the edge with average degree. (c) Cluster characteristics: internal and external degrees, and relative density in a randomly partitioned graph.
Applsci 14 00380 g003
Table 1. Characteristics of selected networks.
Table 1. Characteristics of selected networks.
NetworkVE max deg ( E ) min deg ( E ) Avg deg ( G , E ) σ ( E )
Zachary’s Karate Club (ZKC)3478285155
Aves-Weaver Social (AWS)42152433209
Dolphins Interaction (DI)62159223133
Les Misérables (LM)772545722310
Political Books (PB)105441484228
American College Football (ACF)1156132315201
Facebook Pages Food (FBPF)620210214721413
US-Grid Power (USGP)4941659428273
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tariq, R.; Lavangnananda, K.; Bouvry, P.; Mongkolnam, P. An Edge-Based Approach to Partitioning and Overlapping Graph Clustering with User-Specified Density. Appl. Sci. 2024, 14, 380. https://doi.org/10.3390/app14010380

AMA Style

Tariq R, Lavangnananda K, Bouvry P, Mongkolnam P. An Edge-Based Approach to Partitioning and Overlapping Graph Clustering with User-Specified Density. Applied Sciences. 2024; 14(1):380. https://doi.org/10.3390/app14010380

Chicago/Turabian Style

Tariq, Rohi, Kittichai Lavangnananda, Pascal Bouvry, and Pornchai Mongkolnam. 2024. "An Edge-Based Approach to Partitioning and Overlapping Graph Clustering with User-Specified Density" Applied Sciences 14, no. 1: 380. https://doi.org/10.3390/app14010380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop