Social Network Forensics Analysis Model Based on Network Representation Learning

Zhao, Kuo; Zhang, Huajian; Li, Jiaxin; Pan, Qifu; Lai, Li; Nie, Yike; Zhang, Zhongfei

doi:10.3390/e26070579

Open AccessArticle

Social Network Forensics Analysis Model Based on Network Representation Learning

by

Kuo Zhao

^1,2,3,*

,

Huajian Zhang

¹,

Jiaxin Li

¹,

Qifu Pan

¹,

Li Lai

¹,

Yike Nie

¹ and

Zhongfei Zhang

^2,3,4

¹

School of Intelligent Systems Science and Engineering, Jinan University, Zhuhai 519070, China

²

Guangdong International Cooperation Base of Science and Technology for GBA Smart Logistics, Jinan University, Zhuhai 519070, China

³

Institute of Physical Internet, Jinan University, Zhuhai 519070, China

⁴

School of Management, Jinan University, Guangzhou 510632, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(7), 579; https://doi.org/10.3390/e26070579

Submission received: 1 May 2024 / Revised: 7 June 2024 / Accepted: 9 June 2024 / Published: 7 July 2024

(This article belongs to the Section Complexity)

Download

Browse Figures

Versions Notes

Abstract

:

The rapid evolution of computer technology and social networks has led to massive data generation through interpersonal communications, necessitating improved methods for information mining and relational analysis in areas such as criminal activity. This paper introduces a Social Network Forensic Analysis model that employs network representation learning to identify and analyze key figures within criminal networks, including leadership structures. The model incorporates traditional web forensics and community algorithms, utilizing concepts such as centrality and similarity measures and integrating the Deepwalk, Line, and Node2vec algorithms to map criminal networks into vector spaces. This maintains node features and structural information that are crucial for the relational analysis. The model refines node relationships through modified random walk sampling, using BFS and DFS, and employs a Continuous Bag-of-Words with Hierarchical Softmax for node vectorization, optimizing the value distribution via the Huffman tree. Hierarchical clustering and distance measures (cosine and Euclidean) were used to identify the key nodes and establish a hierarchy of influence. The findings demonstrate the effectiveness of the model in accurately vectorizing nodes, enhancing inter-node relationship precision, and optimizing clustering, thereby advancing the tools for combating complex criminal networks.

Keywords:

network representation learning; social network forensics; node vectorization; node2vec algorithm; gradient update; hierarchical clustering

1. Introduction

The proliferation of mobile internet technology has significantly enhanced communication across various platforms, including phone calls, emails, and social media networks such as WeChat and Weibo. By the end of 2023, China’s Ministry of Industry and Information Technology reported nearly 1.727 billion mobile phone users, and WeChat’s monthly active accounts surpassed 1.327 billion. The convenience, speed, and immediacy of mobile networks have rendered online social interactions indispensable in both personal and professional spheres. Originating from email, social networks have evolved to fulfill a variety of social requirements, resulting in the accumulation of valuable Mobile Communication Data (MCD) and Criminal Incident Reports (CIRs) [1,2,3]. These datasets not only reveal personal relationships but are also instrumental in social network forensics.

In social network forensics, the Social Network Forensic Analysis (SNFA) model is utilized to detect important actors within criminal networks via email interactions, potentially identifying leaders or suspects based on their centrality. This application of the SNFA model highlights its importance in aiding law enforcement agencies in their investigative and judicial endeavors by providing a means to analyze the structure of criminal networks and pinpoint key participants.

However, it is important to note that applying the SNFA model to modern social network apps or communication platforms such as WeChat might not yield effective results due to the presence of irrelevant content and tenuous connections. These platforms often contain vast amounts of non-criminal-related interactions, which can dilute the forensic relevance of the data. This limitation underscores the necessity of carefully selecting and preprocessing data to ensure their applicability in forensic investigations.

Challenges arise owing to the sheer volume and authenticity of interactive data, as well as the heterogeneity of users and information [4,5]. Analyzing MCD and CIRs is crucial for comprehending the structure of criminal networks and pinpointing key participants, thereby aiding law enforcement agencies in their investigative and judicial endeavors [6]. In social network forensics, mobile data reveals user connections and communication content. Forensic methods are evolving with technology, focusing on network data, such as logs and streams, often through real-time capture using sniffing technologies.

Research on web-based forensics predominantly encompasses the analysis of dynamic, real-time information across mobile networks, leveraging sniffing technologies to glean and secure network data flows. In the realm of evidentiary representation, Erbacher, R.F. and Christensen, K. introduced an innovative forensic visualization paradigm in 2006 [7], which enhanced the application of forensic technology in actual crime scenes. Network forensics techniques can be categorized into server-based, client-based, and data-flow-based methodologies. Although the advent of cloud computing has complicated access to server-side data, suggesting limitations to server-based forensics, client-based techniques continue to offer the thorough extraction and analysis of individual computer records. Meanwhile, data-flow forensics provide a holistic snapshot of user activity. However, forensic investigation of web-based social networks is not without challenges, given the security measures of platforms and user privacy controls.

In semantic forensic analysis, Natural Language Processing (NLP) [8] tools are pivotal in identifying incriminating evidence within textual content [9,10]. Foundational NLP models, such as Latent Semantic Analysis (LSA) [11,12], Probabilistic Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA) [13,14] serve to reduce noise and unearth thematic structures, whereas algorithms such as TF-IDF [15], BM25 [16], and TextRank [17] play a crucial role in keyword extraction. Techniques such as Hidden Markov Models (HMMs) [18] and Conditional Random Fields (CRFs) [19] are integral to lexical scrutiny. These methodologies enhance texts with poor semantic frameworks, laying the groundwork for advanced NLP tasks, including machine translation and sentiment analysis.

Community detection algorithms trace their origins to early algorithms such as Girvan and Newman’s GN algorithm [20] and its refined version, the Fast Newman algorithm [21], which addresses longstanding issues in identifying intricate network relationships, often exacerbated by challenges such as community overlap [22,23] and dynamic network [24,25] evolution. Cutting-edge solutions, such as the Louvain Algorithm [19], Infomap [26], the Label Propagation Algorithm (LPA) [27], and the Girvan–Newman Algorithm [28], have substantially improved the process of community detection. These instruments are instrumental in identifying denser nodes within a network, shedding light on complex structures and their evolution. They utilize distinct approaches and metrics, such as modularity, to increase the accuracy of community delineation. Nevertheless, the analysis of multilayer networks remains intricate because of the complex interconnections across layers, which calls for more sophisticated analytical tools.

Advancements in the computational power and proliferation of social networks have transformed social network forensics into an essential field of research. The aforementioned methods, limited to the analyses of single-layer networks, often neglect the complexity of the node characteristics and indirect relationships [29]. Traditional forensic methods tend to overlook the multi-dimensional nature of social networks, focusing primarily on surface-level interactions and failing to capture the full extent of relationships within the network. To overcome these limitations, our study introduces a multidimensional network representation learning approach [30] that captures both direct and indirect node connections.

The proposed SNFA model, an advanced hierarchical system, integrates network attributes such as small-world phenomena and scale-free properties to enhance node evaluation. By combining Node2vec [31] with the Continuous Bag-of-Words (CBOW) approach [32] and hierarchical Softmax [33], the model fine-tunes the node sampling and gradient optimization processes that correspond to the inherent stratification of networks. Furthermore, hierarchical clustering techniques are utilized to identify key nodes, significantly improving the identification of pivotal elements within criminal networks. Innovations of the SNFA model include (1) employing network representation learning to represent criminal networks in vector space, streamlining the analysis of node relationships, inclusive of those among non-adjacent nodes; (2) enhancing the random walk sampling procedure and gradient update mechanism in the CBOW model’s output layer via Node2vec; and (3) leveraging hierarchical clustering to achieve comprehensive clustering structures and identify optimal central values, using diverse distance metrics to determine the significance of nodes at the core of the network.

Here are the major contributions of this paper:

Introduction of SNFA Model: This paper introduces the SNFA model, which employs network representation learning to identify and analyze key figures within criminal networks, including leadership structures.
Utilization of Network Representation Learning: The model incorporates traditional web forensics and community algorithms, utilizing concepts such as centrality and similarity measures, and integrating algorithms such as Deepwalk, Line, and Node2vec to map criminal networks into vector spaces while maintaining node features and structural information.
Innovative Node Relationship Refinement: The model refines node relationships through modified random walk sampling using BFS and DFS and employs a Continuous Bag-of-Words with Hierarchical Softmax for node vectorization, optimizing value distribution via the Huffman tree.
Enhanced Forensic Analysis Techniques: By leveraging hierarchical clustering and distance measures (cosine and Euclidean), the model identifies the key nodes and establishes a hierarchy of influence. This approach demonstrates the model’s effectiveness in accurately vectorizing nodes, enhancing inter-node relationship precision, and optimizing clustering.
Application and Evaluation: The application of the model to Enron emails serves as a case study to compare metrics with two other models, highlighting its practical utility in forensic investigations.

In the subsequent sections of this paper, we delve into various facets of the Social Network Forensic Analysis (SNFA) model to elaborate on its framework and functionality. Section 2, ‘Materials and Methods’, provides an in-depth examination of the model architecture, focusing on the selection and analysis of nodes within social networks and the computational strategies employed to enhance forensic analysis accuracy. Section 3, ‘Results and Discussion’, presents the outcomes of our empirical tests, demonstrating the efficacy of the SNFA model through a comparative analysis with existing forensic approaches. Finally, Section 4, ‘Conclusions and Outlook’, summarizes the contributions of this research and discusses potential avenues for further enhancements and applications of the SNFA model in the field of digital forensics. Through these discussions, we aim to illustrate the robustness of our approach in improving the precision and scalability of social network forensics.

2. Materials and Methods

In this section, we introduce the SNFA model, a social network forensic analysis framework based on representation learning methods. Due to the characteristics of social networks such as scale-freeness, small-world phenomena, community structures, and hierarchical nature, SNFA constructs a hierarchical network model to identify key nodes. The following sections analyze several aspects: the selection of computable nodes in the network, improvements in the sampling encoding model and random walk strategies, gradient computation, layered clustering division of the network, and acquisition and calculation of the network’s core nodes.

2.1. Enhanced Node Sampling Precision

The accuracy of node sampling within the realm of social network forensics is essential for precise identification and analysis of key nodes in criminal networks. The complexity and vastness of social networks require improved sampling accuracy to efficiently identify the nodes of interest, reduce computational demands, and enhance the effectiveness of forensic analyses. This section will present the algorithmic enhancements integrated into our Social Network Forensic Analysis (SNFA) model that aim to refine node sampling accuracy. Essential to uncovering the structure of criminal activities, these advancements facilitate the accurate identification of critical nodes in forensic analysis. We explain the improvements to our node sampling methodology, covering both theoretical foundations and practical applications, to offer a detailed understanding of our contribution to network forensics. The success of social network forensic analysis largely depends on the node sampling precision, which impacts the identification accuracy of key nodes in criminal networks. While traditional methods are broadly effective, they often lack the required specificity for forensic analysis, where pinpointing behaviorally significant nodes is paramount. We have thus significantly enhanced the Node2vec algorithm, a key component of our SNFA model, and developed new sampling strategies to improve precision.

Modifications to Node2vec:

Our modified Node2vec algorithm incorporates a weighted random walk mechanism that prioritizes nodes based on their forensic relevance, diverging from the traditional approach that treats all nodes with equal importance during the walk. By integrating forensic relevance as a criterion for node selection, our approach ensures that nodes with higher potential forensic values are sampled more frequently, thus increasing the focus of the model on areas of the network that are most likely to yield valuable insights.

Mathematically, we introduce a weighting function,

W (n)

, which assigns a forensic relevance score to each node,

n

, based on predefined criteria, such as node centrality, frequency of communication, and anomalous behavior patterns. The probability of transitioning from node

i

to node

j

during a random walk is then adjusted as follows:

P (i \to j) = \frac{α \cdot W (j)}{\sum_{k ϵ N (i)} W (k)},

(1)

where

α

is a normalization factor ensuring that the probabilities sum to 1 and

N (i)

represents the set of neighbors of node

i

.

2.: Novel Sampling Strategies:

Beyond the modifications to Node2vec, we introduce a novel sampling strategy that further refines the selection of nodes for analysis. This strategy employs a two-tiered approach, initially segmenting the network into clusters based on structural and behavioral similarities, and subsequently applying our enhanced Node2vec algorithm within each cluster. This approach allows for more focused sampling within areas of the network that are homogenous in terms of forensic characteristics, thereby improving the overall precision of node sampling.

To operationalize these improvements, we present the following pseudocode, illustrating the modified Node2vec algorithm with an integrated weighting function for forensic relevance:

Algorithm 1 marks a pivotal enhancement in our model, integrating forensic relevance to refine the random walk process for social network forensic analysis. This modification of the Node2vec algorithm, as outlined in Algorithm 1, highlights the crucial role of domain-specific insights in algorithmic development. By assigning priority to nodes based on forensic significance, we enhanced the precision of identifying crucial network nodes and boosted investigative efficiency and depth. This advancement aligns with our Social Network Forensic Analysis (SNFA) model’s goal of offering a focused, accurate, and comprehensive tool for dissecting criminal activity in complex networks. This incorporation of forensic relevance is set to enrich forensic analysis and open new investigative pathways in the dynamic field of digital forensics.

Algorithm 1. Enhanced Node2vec.

Input: Graph

G (V, E)

, Weighting function

W (n)

, Walk length

L

, Start node

s

Output: Sequence of nodes

S

representing a weighted random walk

Initialize sequence $S$ with start node $s$
for $i = 1$ to $L$ do
2.1
Let $c$ be the current node (last node in $S$ )
2.2
Calculate total weight $T = sum (W (n) for n in neighbors (c))$
2.3
For each neighbor $n$ of $c$ , calculate transition probability $P (c \to n) = (W (n) / T)$
2.4
Select next node $n'$ based on transition probabilities $P (c \to n)$
2.5
Append $n'$ to sequence $S$
return $S$

2.2. Selection of Computable Nodes

In the complex domain of social network forensics, the scale-free nature of networks means that many nodes have low degrees and their connections do not carry significant information [34]. Including these nodes with minimal informational content not only fails to improve experimental outcomes but also increases computational time and algorithm complexity.

Within these networks, certain nodes may be suspicious, thus introducing the concepts of node synchronization and anomalies. Synchronization refers to the high similarity and parallel behaviors among nodes, where excessive synchronization can compromise the experimental accuracy. Anomaly, on the other hand, indicates a node’s deviation in behavior from the majority, reducing the accuracy if such nodes are dispersed within the network graph.

Moreover, incorporating numerous low-degree nodes can invalidate sampling the parameters in algorithms such as Node2vec, which uses parameters p and q to balance the Breadth-First Search (BFS) [35] and Depth-First Search (DFS) [36] in random walks. If a node’s degree is too low, adjusting these parameters does not alter the random walk pattern, as exemplified in social networks where a node with a single connection always leads to the same random walk sequence, regardless of p and q values.

To circumvent these issues, rules for selecting computable nodes and eliminating low degree, synchronized, or anomalous nodes from the network are necessary for more efficient and accurate computations. The experimental criteria for computable nodes include the following:

Formulas for synchronization and normalization were utilized to filter out synchronized and anomalous nodes in the network [37]. Node $u$ ’s synchronization is defined by the similarity between every pair of nodes, as shown in Equation (2), and its normalization by the average similarity with the majority of the other nodes, as in Equation (3). Here, $c (v, v^{'})$ represents the similarity between nodes $v$ and $v^{'}$ , $O (u)$ is the set of target nodes for node $u$ , $d_{o} (u)$ is the count of outward nodes, i.e., the size of $O (u)$ , and $N$ is the total number of nodes. These metrics are influenced by connected nodes, akin to the PageRank algorithm.

$s y n c (u) = \frac{\sum_{(v, v^{'}) \in O (u) * O (u)} c (v, v^{'})}{d_{o} (u) * d_{o} (u)}$

(2)

$n o r m (u) = \frac{\sum_{(v, v^{'}) \in O (u) * u} c (v, v^{'})}{d_{o} (u) * N}$

(3)
Selecting nodes in the network with a degree greater than 1.
The total communication count of a node divided by the number of nodes it communicates with should exceed the average total communication count per node across the network. The selection rule is as follows:

$\{\begin{matrix} \frac{W (w)}{P (w)} \geq \frac{\sum_{i = 1}^{N} W (i)}{\sum_{i = 1}^{N} P (i)} i, u \in G, N = |G| \\ P (u) > 1 \end{matrix} .$

(4)

2.3. Node Sampling and Encoding

In social network analysis, node sampling and encoding are critical for converting nodes into computational representations. The Node2vec algorithm [31], based on the foundation of Word2vec [38,39] from natural language processing, improves the adaptation for analyzing network structures. It treats nodes in a manner akin to words in natural language and undergoes a sampling process to represent their interrelations and properties.

Random walks across the network generate “sentences” of nodes as part of the sampling phase, reflecting the contextual background of each node. The frequency of sampling and depth of the walks are integral to defining the degree of connection between nodes, with more frequently sampled nodes suggesting closer ties. Node2vec’s parameters, p and q, modulate the sampling between breadth-first and depth-first explorations, enhancing the detection of less direct relationships.

Encoding succeeds in sampling and translating the nodes into computer-readable formats. Although one-hot encoding is straightforward, it becomes unfeasible for larger networks owing to scalability issues. Vector encoding methods [40], central to Deepwalk [41], Word2vec, and Node2vec, address this by mapping each node to a dense, low-dimensional vector, reducing the computational load and better capturing nodal features. This vector space model aids in machine learning, allowing quantitative analysis of node relationships and similarities.

Tomas Mikolov introduced two formative models in 2013, Skip-gram and Continuous Bag-of-Words (CBOW) [42], applying neural network approaches for effective word vector encodings. These models accelerate training using simplified architectures and underpin statistical language modeling techniques.

The CBOW model predicts the probability of a node from the context of adjacent nodes and consists of three layers: output, projection, and input. Given a target node,

w_{i}

, within its context, represented as

\{W_{i - c}, \dots, W_{i - 1}, W_{i}, W_{i + 1}, \dots, W_{i + c}\}

, the CBOW constructs a context set from the surrounding nodes. The model endeavors to maximize its likelihood function,

P = p (w | C o n t e x t (w))

, across all nodes

N

.

CBOW’s three-tier architecture is depicted in Figure 1.

The input layer processes 2c neighboring nodes around $w_{i}$ as vectors, with the window size c set to 2.
The projection layer aggregates input vectors into a summary vector.
The output layer forms a Huffman tree, whereby terminal nodes align with network nodes and non-terminal nodes embody vector representations. In the diagram, non-terminal nodes are highlighted in yellow to indicate their role in the hierarchical structure of the tree, as they are responsible for the intermediate calculations in the encoding process, while the terminal nodes are not highlighted as they represent the final output in the Huffman tree structure.

In contrast to CBOW, the Skip-gram model infers the surrounding nodes from the vector encoding of a single node. Both models employ Negative Sampling [43] and Hierarchical Softmax, with the CBOW model’s optimization aiming to condense the Huffman tree encodings for efficiency and encoding fidelity.

2.4. Constructing the SNFA Forensic Model

While constructing the SNFA forensic model, two crucial concepts are introduced to reflect the key roles of individuals, or nodes, within social networks: activity and significance. The activity level, denoted as

W (u)

, indicates the communication frequency of node

u

, with higher values representing more active nodes. Forensic significance, expressed as

i m p (u) = P (u) W (u)

, combines the unique number of communicating nodes,

P (u)

, with the total communication count,

W (u)

, to gauge the importance of a node within the network.

Utilizing vectors for representation, the SNFA model employs the Node2vec algorithm, initially based on the CBOW model, to capture the importance of the nodes. Social networks are modeled as undirected weighted graphs, where individuals are nodes and communication frequencies form edge weights. Node2vec’s random walk mechanism, adjusted by parameters p and q, navigates the trade-off between exploring local and global network structures. However, this adjustment is network-wide and overlooks the nuances of the individual node relationships. The probability of transitioning between two nodes is weighted according to their relative communication frequency, with respect to the total communications initiated by the starting node, and is formalized as follows:

k = \frac{w_{t, x}}{W (t)} t \in E,

(5)

with

w_{t, x}

being edge

(t, x)

in the network.

The transition probability from one source node to another within a communication network is defined as follows:

(c_{i} = x | c_{i - 1} = v) = \{\begin{matrix} \frac{π_{v x}}{Z} i f (v, x) \in E \\ 0 o t h e r w i s e \end{matrix},

(6)

with

π_{v x}

being the unnormalized transition probability and

Z

a normalization constant.

The SNFA model applies Node2vec for node sampling and encoding, representing social network G(V, E) nodes as vectors [41], which encapsulates node attributes and network structure. Given the extensive data, acceleration algorithms, such as Negative Sampling and Hierarchical Softmax, were employed, with the latter chosen for this study. The Hierarchical Softmax algorithm facilitates efficient vector updates, evenly distributing the gradient updates across all nodes to prevent the over-amplification of values. The structural diagram of the algorithm is shown in Figure 2.

This approach introduces the concept of average contribution by dividing the update value from each gradient ascent equally among nodes, aligning it more closely with the network’s structure and ensuring a balanced update across the network.

2.5. Utilization of Hierarchical Clustering and Distance Formulas

In the intricate task of analyzing criminal networks on social media, our Social Network Forensic Analysis (SNFA) model utilizes hierarchical clustering and distance formulas to improve the identification and ranking of core nodes. This section explains the application of these methodologies within the SNFA model and highlights their importance in forensic analysis.

The SNFA model makes use of Agglomerative Hierarchical Clustering (AHC), which is well-suited for dissecting the layered structures inherent in criminal networks. Unlike partitioning clustering methods that require a predefined number of clusters, AHC merges nodes or existing clusters iteratively, starting from individual nodes, thereby naturally adapting to the network’s structure. This bottom-up approach aids in uncovering nested communities that mirror the operational hierarchies commonly observed in criminal networks.

The benefit of employing AHC in our model lies in its capability to unveil subtle groupings within the network, which is essential for comprehending the roles and relationships among individuals involved in criminal activities. By constructing a dendrogram, AHC provides a visual and analytical framework for evaluating the connectivity and proximity of nodes, enabling forensic analysts to pinpoint pivotal nodes and their spheres of influence within the network.

To merge clusters effectively and identify core nodes, the SNFA model utilizes a combination of Euclidean distance and cosine similarity distance measures. Euclidean distance is crucial for measuring the direct dissimilarity between nodes, facilitating the initial clustering of closely connected individuals. This straightforward calculation renders it a reliable choice for identifying distinct subgroups within a network.

Cosine similarity, represented by Equation (7), measures the angle between the vector representations of two nodes and provides insights into the similarity in their interaction patterns rather than their magnitude. This metric is especially valuable in forensic scenarios where communication patterns, including frequency, timing, and common contacts, are more critical than the mere volume of communication.

C o s i n e S i m i l a r i t y (n_{i}, n_{j}) = \frac{\overset{⃑}{n_{i}} \cdot \overset{⃑}{n_{j}}}{‖\overset{⃑}{n_{i}}‖ ‖\overset{⃑}{n_{j}}‖}

(7)

By integrating these distance measures, the SNFA model identifies core nodes by evaluating their direct interactions and behavioral similarities, thereby enhancing the accuracy of the model in pinpointing individuals central to criminal activities.

The implementation of Agglomerative Hierarchical Clustering and sophisticated distance formulas within the SNFA model significantly improves our ability to dissect and comprehend criminal networks on social media. By meticulously clustering nodes based on their interactions and behavioral patterns, we enhanced the forensic analysis process, enabling more precise and detailed identification of core nodes. This approach aligns with the overarching objective of providing a comprehensive tool for social network forensics and offers a scalable and adaptable solution to the challenges posed by the intricate and dynamic nature of criminal networks.

2.6. Acquiring Key Figures in Social Network Forensics

In social network forensics, the SNFA model is utilized to detect important actors within criminal networks via email interactions, potentially identifying leaders or suspects based on their centrality. Core node identification assesses two key metrics: the frequency of communication, denoted by

W (u)

, and the network-wide extent of communication, expressed as follows:

i m p (u) = P (u) W (u) .

(8)

Vector representation, adopted by the SNFA model through techniques such as Node2vec, is shaped by the community structure and hierarchical nature inherent to social networks. Clustering algorithms are then executed to determine the core nodes and rank their similarities to gauge their status in criminal networks.

Machine learning employs unsupervised learning for clustering when training samples are unlabeled. This approach categorizes samples into distinct clusters for pattern identification. Classic K-Means clustering iteratively groups samples into k clusters based on their centrality [44]. However, initial cluster center selection and outlier sensitivity can compromise K-Means accuracy, issues K-Means++ seeks to address through strategic initial center placements [45].

Considering the community and hierarchical structures of social networks, the SNFA model implements hierarchical clustering [46] to define an optimal cluster count, leveraging AGNES to merge the nearest clusters progressively using the average linkage criteria. Hierarchical clustering is notable for enabling a comprehensive one-time clustering execution, providing flexibility in cluster quantity [47].

Hierarchical clustering identifies a set of core nodes, N, of varying significance. The SNFA model correlates the importance of these nodes with sentencing data from the Enron corpus to discern leaders within the network. Vectors represent nodes and their interrelations, with vector similarity commonly measured by Minkowski, Euclidean, or Manhattan metrics [48]. However, a more nuanced node relationship analysis emerges when cosine similarity is combined with Euclidean distance.

The aggregated average cosine and Euclidean distances of the core nodes to others in N facilitate a multi-faceted importance assessment. The clustering pseudocode of the SNFA model specifies the steps for computable node selection, Node2vec encoding, and hierarchical clustering application for core node detection. This integration of distance measures refines the significance rankings, highlighting the key influencers of the network.

The streamlined structure of the SNFA model, depicted in Figure 3, commences with social network initialization using either the MCD or CIR datasets. The selection of computable nodes adheres to established network principles, and Node2vec translates the criminal network into a vector space, optimizing both sampling and gradient computations. Hierarchical clustering then builds a complete dendrogram to identify the optimal cluster centroids. Through iterative refinement, the significance of core nodes is quantified and ordered, aiding in the identification of pivotal actors or suspects in criminal networks [49].

3. Experiment and Results

This section briefly outlines the experimental section of the SNFA model, describing the data background, data preprocessing, and experimental parameters of the model, as well as the final experimental results obtained. It also compares and analyzes these results against those from forensic models such as LogAnalysis and CrimeNet Explorer, measuring them against typical standards of accuracy, including recall, precision, and the F-measure, to demonstrate the construction and analytical capabilities of the SNFA model.

3.1. Experimental Procedure

Experimental data primarily take two forms: Mobile Communication Data (MCD) and Criminal Incident Reports (CIRs). Despite the wealth of information available from communications on platforms such as QQ, WeChat, Weibo, email, and Twitter, these data frequently include irrelevant content and tenuous connections, lacking robust evidence of criminal networks, and thus are unsuitable for the SNFA model. In light of considerations such as data relevance, structure, and volume, the SNFA model leverages Enron Corporation’s email dataset, a resource originating from a company renowned for its significant place in criminal network history.

The once-dominant Enron Corporation, previously atop the energy, natural gas, and telecommunications sectors and praised for its innovation by Fortune magazine, fraught with crisis, succumbed to bankruptcy in a matter of weeks. Investigations have revealed widespread financial fraud, incriminating numerous executives in elaborate criminal undertakings [50]. The primary form of communication within Enron is via email; these archives became a key asset for investigative purposes, ultimately leading to the conviction of 28 executives, including a 120-year sentence for CEO Kenneth Lay. The investigative exposure of the Enron email dataset has galvanized analyses seeking to dissect criminal associations and network frameworks, along with the prospect of discerning suspects [51]. Because of the dataset’s direct linkage to criminal activities and its structurally intricate network, coupled with its ability to align experimental findings with authentic judicial outcomes, the SNFA model adopts the Enron email dataset for analysis. Table 1 lists the convicted Enron executives.

Comprising everyday communication among staff, the Enron email compilation was rife with anomalies and superfluity. Preprocessing efforts are vital for standardization, catering to the node vectorization of SNFA, which necessitates data in an encoded format. The encoding process transformed individuals within the dataset into a uniform node format, preparing the data for incorporation into network representation learning algorithms, culminating in three distinct preprocessed datasets: a communication edge table, a node table, and a node communication table, as shown in Table 2.

Table 2a outlines the communication edge table, where the first two columns catalog encoded identifiers for the sender and receiver nodes, respectively, and the third column quantifies the interaction strength between nodes, together constructing an elementary network graph [52]. Table 2b shows the corresponding pairs of node identities and their encrypting numbers, thus simplifying the node recognition. Table 2c tabulates, per node, the volumetric count of communications and the assortment of unique interconnected nodes, reflecting network activity and guiding the selection of analyzable nodes.

Although the Enron email dataset provides a valuable and detailed source of communication data relevant to criminal network analysis, it is crucial to acknowledge its limitations in terms of generalizability. The dataset is specific to a particular corporate environment and time period, which may not fully represent the dynamics of modern communication platforms such as social media or contemporary email systems. These differences can affect the applicability of the findings to broader contexts. Consequently, although the results obtained using the Enron dataset are insightful and demonstrate the effectiveness of the SNFA model, caution should be exercised when extending these findings to other settings. Modern social networks and communication platforms such as WeChat, Twitter, and others differ significantly in terms of user behavior, content types, and network structures. Therefore, the insights gained from the Enron dataset may not fully translate into these contemporary platforms. This limitation should be considered when interpreting results and assessing the broader applicability of the SNFA model.

3.2. Parameters Settings

The SNFA model utilizes three common evaluation metrics, Precision, Recall, and F-value, to assess the model’s accuracy and performance. Higher values in these metrics indicate a better fit and more accurate results, whereas lower values suggest a poorer fit. The concept of a Confusion Matrix is introduced, where rows represent actual categories and columns represent predicted categories, with cell values indicating the count of classified data points. As shown in Table 3, the Confusion Matrix uses

T

for positive classes,

F

for negative classes,

P

for positive instances, and

N

for negative instances.

T P

indicates true positives,

T N

true negatives,

F P

false positives, and

F N

false negatives.

Precision represents the probability that the true data are correctly identified as positive, as defined by:

P r e c i s i o n = \frac{T P}{T P + F P} .

(9)

Recall quantifies the likelihood of accurately identifying positive instances in the original sample, as delineated in:

R e c a l l = \frac{T P}{T P + F N} .

(10)

The F-value, which harmonizes Precision and Recall to enhance experimental outcomes, is frequently regarded as the most critical metric, as represented by:

F - v a l u e = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l} .

(11)

Given the SNFA model’s foundation in network representation learning, representing nodes as vectors containing information and network structure, choosing the appropriate parameters is crucial for accurate and efficient representation. The parameters of the Node2vec algorithm significantly affect the process and results. Controlled variable experiments were conducted to understand the influence of the individual parameters and their interactions. The initial parameters were set to t = 30, d = 128, W = 10, γ = 40, p = 2, and q = 0.5, focusing on the learning dimensions and sampling frequencies.

Figure 4 shows the impact of γ and d on the F-value, indicating that both the sampling frequency and learning dimensions significantly affect the results of the SNFA model. For sampling frequency γ, the results improved substantially with values from 1 to 10, moderately from 10 to 30, and plateaued beyond 30, suggesting an optimal range of 30 to 50 for balancing accuracy and computational efficiency. Learning dimensions of 64 or 128 yield better results, as too-low dimensions may lose information, while too-high dimensions can lead to redundancy, decreasing accuracy.

Figure 5 illustrates the effect of walk depth t and sampling frequency γ on the F-value, with similar patterns observed in sampling frequency, where increasing values improve the results up to a point. The choice of the walk depth t generally depends on the dataset size, with larger datasets requiring greater walking depths. A walk depth of 30 was chosen based on the data to optimize the experimental outcomes.

3.3. Results

The SNFA model employed network representation learning for social network forensics and was compared with conventional forensic methods, CrimeNet Explorer [53] and LogAnalysis [54], both of which utilize network topology approaches.

CrimeNet Explorer: This forensic method is specifically designed for criminal networks by segmenting a criminal network into multiple subnetworks. It measures using three centrality values and employs shortest path algorithms and Blockmodeling to determine the closeness between nodes in a criminal network, identifying core members and even top leaders.
LogAnalysis: This method automates the import of telephonic communication data by utilizing MCD data. It incorporates network topology, criminal investigation, and statistical analysis to establish a framework for revealing and analyzing the structure and behavior of criminal networks. Based on the Newman and GN algorithms, it measures using betweenness centrality and employs a greedy algorithm for hierarchical clustering, identifying closely connected clusters and core members through structural analysis.

Figure 6 and Table 4 present comparisons of the three-evaluation metrics across the three forensic approaches using the Enron dataset. The parameters for the Node2vec random walk were set to p = 2, q = 0.5, window size = 15, sampling frequency γ = 50, walk depth t = 40, and learning dimension d = 64.

The accompanying chart clearly demonstrates the superiority of the SNFA model over both LogAnalysis and CrimeNet Explorer across precision, recall, and F-value metrics, indicating a marked advancement over CrimeNet Explorer. These findings substantiate the efficacy of the SNFA model in pinpointing key individuals and leadership in criminal networks.

The SNFA model applies multidimensional data provided by the MCD for more effective data mining and analysis. The utilization of the Enron email dataset streamlines the construction and hierarchical examination of criminal networks. In contrast to traditional topological techniques, the SNFA model employs network representation learning to transform network nodes into vectors, thereby simplifying the assessment of the relationships between nodes. The integration of the Node2vec algorithm refines random walk strategies for the nuanced representation of non-adjacent node interactions, thus enhancing experimental precision. The SNFA model uses the CBOW and Hierarchical Softmax models for nodal vectorization by formulating rules to select computable nodes grounded in social network traits. This method not only boosts computational speed and analytic proficiency but also advances the gradient update process within the Huffman tree of the output layer. Hierarchical clustering was then used to determine the optimal cluster count (k-value), with subsequent iterative relocation to achieve the desired clustering. Ultimately, the efficacy of the SNFA model is both theoretically and empirically contrasted with traditional forensic approaches, namely CrimeNet Explorer and LogAnalysis. Outperforming these platforms in all three evaluation criteria, the SNFA enhancements are particularly noteworthy when juxtaposed with CrimeNet Explorer. Establishing its practicality, the SNFA model proved adept at uncovering central figures in criminal networks, possibly extending it to the top echelons of leadership.

4. Conclusions and Outlook

Owing to the limitations of traditional forensic algorithms based on web analysis and community structures, this study adopted network representation learning algorithms to implement a social network forensic model. It outlines the background, development, and current state of social network forensics research, introduces the basic concepts of forensics and fundamental network characteristics, and compares various classic network representation learning algorithms. The Node2vec algorithm was chosen to map the criminal networks into vector spaces for computational analysis. The model integrates the Continuous Bag-of-Words (CBOW) model with the Hierarchical Softmax acceleration algorithm to refine the node sampling process, making it more aligned with the hierarchical features of social networks. The SNFA model employs hierarchical clustering to achieve optimal clustering by identifying core nodes within the criminal network. The importance of these core nodes is determined through similarity calculations, potentially identifying high-level leaders or suspects and aiding law enforcement in dismantling criminal networks.

It should be noted that while the SNFA model demonstrated effectiveness when applied to the Enron email dataset, its application to modern social network apps or communication platforms, such as WeChat or Twitter, may encounter challenges. These platforms often contain a significant amount of irrelevant content and tenuous connections, which can hinder the model’s ability to accurately identify key figures within criminal networks. Future research should explore methods to filter and preprocess data from such platforms in order to enhance the applicability of the SNFA model in contemporary forensic investigations.

The main work of this paper includes:

The basic principles and technologies of social networks are analyzed to address the shortcomings of traditional forensic methods. This paper proposes the use of network representation learning to map criminal networks into a vector space, accurately representing node attributes and network structure and facilitating the calculation of inter-node relationships.
The CBOW and Hierarchical Softmax models were selected for network node sampling and encoding. Owing to the different levels of closeness between network nodes and the hierarchical nature of networks, the SNFA model refines the sampling process to make it more reasonable and improves the gradient calculation process in the Huffman tree of the output layer.
Using clustering to identify core nodes within the criminal network, we address the significant influence of the initial cluster center selection in traditional K-Means clustering. The SNFA model proposes an improved hierarchical clustering method coupled with iterative reallocation to achieve optimal clustering.
The Enron company crime dataset was chosen for its well-defined criminal network structure and available sentencing results for comparison. The SNFA model demonstrates improvements over the traditional forensic methods LogAnalysis and CrimeNet Explorer based on three evaluation metrics: Precision, Recall, and F-value.

Despite the SNFA model outperforming LogAnalysis and CrimeNet Explorer in the experiments, there is significant room for improvement. Future research should focus on the following aspects:

Exploring deep neural network-based representation learning algorithms or incorporating higher-order similarities to calculate relationships more accurately between non-adjacent nodes.
Improving the methodology for identifying core nodes and the formulas for calculating their similarity. Deep diving into the hierarchical clustering structure or considering other clustering methods might enhance the results.
Employing natural language processing to semantically analyze the Enron email dataset, potentially yielding more comprehensive and accurate results when combined with an analysis of the topological structure criminal network. Additionally, considering other crime datasets with more nodes could lead to more accurate experimental outcomes owing to more thorough random walks and sampling.

Author Contributions

Conceptualization, K.Z. and H.Z.; methodology, Y.N.; software, L.L. and Y.N.; validation, L.L.; formal analysis, L.L. and Q.P.; investigation, Q.P.; resources, J.L.; data curation, J.L., Q.P., Y.N. and L.L.; writing—original draft preparation, K.Z. and H.Z.; writing—review and editing, K.Z., H.Z. and L.L.; visualization, Y.N.; supervision, K.Z., J.L. and Z.Z.; project administration, K.Z., H.Z. and Z.Z.; funding acquisition, K.Z. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2021YFB3301702), the Guangdong Basic and Applied Basic Research Foundation (2023A1515011712), the 2019 Guangdong Special Support Talent Program–Innovation and Entrepreneurship Leading Team (China) (2019BT02S593), and the 2018 Guangzhou Leading Innovation Team Program (China) (201909010006).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available at https://www.cs.cmu.edu/~./enron/, accessed on 15 March 2024.

Acknowledgments

The authors would like to thank the editorial department and reviewers for their suggestions on this article, which have helped us greatly improve the quality of the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, W.-X.; Wang, B.-H.; Yin, C.-Y.; Xie, Y.-B.; Zhou, T. Traffic dynamics based on local routing protocol on a scale-free network. Phys. Rev. E 2006, 73, 026111. [Google Scholar] [CrossRef] [PubMed]
Riascos, A.P.; Mateos, J.L. Random walks on weighted networks: A survey of local and non-local dynamics. J. Complex Netw. 2021, 9, cnab032. [Google Scholar] [CrossRef]
Okmi, M.; Por, L.Y.; Ang, T.F.; Ku, C.S. Mobile Phone Data: A Survey of Techniques, Features, and Applications. Sensors 2023, 23, 908. [Google Scholar] [CrossRef] [PubMed]
Karabiyik, U.; Canbaz, A.M.; Aksoy, A.; Tuna, T.; Akbas, E.; Gonen, B.; Aygun, R.S. A survey of social network forensics. J. Digit. Forensics Secur. Law 2016, 11, 8. [Google Scholar] [CrossRef]
Pasquini, C.; Amerini, I.; Boato, G. Media forensics on social media platforms: A survey. EURASIP J. Inf. Secur. 2021, 2021, 1–19. [Google Scholar] [CrossRef]
Kurt, Y.; Kurt, M. Social network analysis in international business research: An assessment of the current state of play and future research directions. Int. Bus. Rev. 2020, 29, 101633. [Google Scholar] [CrossRef]
Teelink, S.; Erbacher, R.F. Improving the computer forensic analysis process through visualization. Commun. ACM 2006, 49, 71–75. [Google Scholar] [CrossRef]
O’Connor, J.; McDermott, I. NLP; Thorsons: London, UK, 2001. [Google Scholar]
Amato, F.; Cozzolino, G.; Mazzeo, A.; Moscato, F. An application of semantic techniques for forensic analysis. In Proceedings of the 2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA), Krakow, Poland, 16–18 May 2018; pp. 380–385. [Google Scholar]
Amato, F.; Cozzolino, G.; Moscato, V.; Moscato, F. Analyse digital forensic evidences through a semantic-based methodology and NLP techniques. Future Gener. Comput. Syst. 2019, 98, 297–307. [Google Scholar] [CrossRef]
Landauer, T.K.; Foltz, P.W.; Laham, D. An introduction to latent semantic analysis. Discourse Process. 1998, 25, 259–284. [Google Scholar] [CrossRef]
Huyut, M.-M.; Kocaoğlu, B.; Meram, Ü. Regulation Relatedness Map Creation Method with Latent Semantic Analysis. Comput. Mater. Contin. 2022, 72, 2093–2107. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Zhou, Z.; Qin, J.; Xiang, X.; Tan, Y.; Liu, Q.; Xiong, N.-N. News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark. Comput. Mater. Contin. 2020, 62, 217–231. [Google Scholar] [CrossRef]
Bafna, P.; Pramod, D.; Vaidya, A. Document clustering: TF-IDF approach. In Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 3–5 March 2016; pp. 61–66. [Google Scholar]
Svore, K.M.; Burges, C.J. A machine learning approach for improved BM25 retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, 2–6 November 2009; pp. 1811–1814. [Google Scholar]
Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef]
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1529–1537. [Google Scholar]
Newman, M.E. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef]
Newman, M.E. Fast algorithm for detecting community structure in networks. Phys. Rev. E 2004, 69, 066133. [Google Scholar] [CrossRef] [PubMed]
Ding, Z.; Zhang, X.; Sun, D.; Luo, B. Overlapping community detection based on network decomposition. Sci. Rep. 2016, 6, 24115. [Google Scholar] [CrossRef] [PubMed]
Yuan, S.; Zeng, H.; Zuo, Z.; Wang, C. Overlapping community detection on complex networks with Graph Convolutional Networks. Comput. Commun. 2023, 199, 62–71. [Google Scholar] [CrossRef]
Peixoto, T.P. Network reconstruction and community detection from dynamics. Phys. Rev. Lett. 2019, 123, 128301. [Google Scholar] [CrossRef] [PubMed]
Berner, R.; Gross, T.; Kuehn, C.; Kurths, J.; Yanchuk, S. Adaptive dynamical networks. Phys. Rep. 2023, 1031, 1–59. [Google Scholar] [CrossRef]
Devi, S.; Rajalakshmi, M.; Saranya, S.; Shana, J. Meta Heuristic-Based Community Detection of Social Network Using Cuckoo with InfoMap Algorithm. In Intelligent Manufacturing and Energy Sustainability: Proceedings of ICIMES 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 15–23. [Google Scholar]
Traag, V.A.; Šubelj, L. Large network community detection by fast label propagation. Sci. Rep. 2023, 13, 2701. [Google Scholar] [CrossRef] [PubMed]
Devi, S.; Rajalakshmi, M. Community Detection by Node Betweenness Using Optimized Girvan-Newman Cuckoo Search Algorithm. Inf. Technol. Control 2023, 52, 53–67. [Google Scholar] [CrossRef]
Delp, E.J.; Tubaro, S.; Barni, M.; Scheirer, W.J.; Kuo, C.; Memon, N.; Verdolvia, L.A.; Abd-Almageed, W. Media Forensics Integrity Analytics. 2022. Available online: https://apps.dtic.mil/sti/citations/trecms/AD1179160 (accessed on 15 March 2024).
Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Network representation learning: A survey. IEEE Trans. Big Data 2018, 6, 3–28. [Google Scholar] [CrossRef]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
Fati, S.M.; Muneer, A.; Alwadain, A.; Balogun, A.O. Cyberbullying Detection on Twitter Using Deep Learning-Based Attention Mechanisms and Continuous Bag of Words Feature Extraction. Mathematics 2023, 11, 3567. [Google Scholar] [CrossRef]
Mohammed, A.A.; Umaashankar, V. Effectiveness of hierarchical softmax in large scale classification tasks. In Proceedings of the 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 19–22 September 2018; pp. 1090–1094. [Google Scholar]
Barabási, A.-L.; Bonabeau, E. Scale-free networks. Sci. Am. 2003, 288, 60–69. [Google Scholar] [CrossRef] [PubMed]
Bundy, A.; Wallen, L. Breadth-first search. In Catalogue of Artificial Intelligence Tools; Springer: Berlin/Heidelberg, Germany, 1984; p. 13. [Google Scholar] [CrossRef]
Tarjan, R. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972, 1, 146–160. [Google Scholar] [CrossRef]
Jemili, F.; Bouras, H. Intrusion detection based on big data fuzzy analytics. In Open Data; IntechOpen: London, UK, 2021. [Google Scholar]
Di Gennaro, G.; Buonanno, A.; Palmieri, F.A. Considerations about learning Word2Vec. J. Supercomput. 2021, 77, 12320–12335. [Google Scholar] [CrossRef]
Paliwal, S.; Mishra, A.-K.; Mishra, R.-K.; Nawaz, N.; Senthilkumar, M. XGBRS Framework Integrated with Word2Vec Sentiment Analysis for Augmented Drug Recommendation. Comput. Mater. Contin. 2022, 72, 5345–5362. [Google Scholar] [CrossRef]
Qiu, J.; Dong, Y.; Ma, H.; Li, J.; Wang, K.; Tang, J. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018; pp. 459–467. [Google Scholar]
Xia, F.; Sun, K.; Yu, S.; Aziz, A.; Wan, L.; Pan, S.; Liu, H. Graph learning: A survey. IEEE Trans. Artif. Intell. 2021, 2, 109–127. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26. Available online: https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-abstract.html (accessed on 15 March 2024).
Yang, Z.; Ding, M.; Zhou, C.; Yang, H.; Zhou, J.; Tang, J. Understanding negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 1666–1676. [Google Scholar]
Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (SODA), Philadelphia, PA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
Ran, X.; Xi, Y.; Lu, Y.; Wang, X.; Lu, Z. Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artif. Intell. Rev. 2023, 56, 8219–8264. [Google Scholar] [CrossRef]
Alotaibi, A.; Barnawi, A. IDSoft: A federated and softwarized intrusion detection framework for massive internet of things in 6G network. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101575. [Google Scholar] [CrossRef]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview, II. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2017, 7, e1219. [Google Scholar] [CrossRef]
Dong, Y.; Hu, Z.; Wang, K.; Sun, Y.; Tang, J. Heterogeneous Network Representation Learning. In Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI, Yokohama, Japan, 7–15 January 2020; pp. 4861–4867. [Google Scholar]
Davis, J.; Hossain, L.; Murshed, S.H. Social Network Analysis and Organizational Disintegration: The Case of Enron Corporation. 2007. Available online: https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1162&context=icis2007 (accessed on 15 March 2024).
Cox, P.L.; Friedman, B.; Edwards, A. Enron: The smartest guys in the room—Using the Enron film to examine student attitudes towards business ethics. J. Behav. Appl. Manag. 2009, 10, 263–290. [Google Scholar] [CrossRef]
Yang, S.; Keller, F.B.; Zheng, L. Social Network Analysis: Methods and Examples; Sage Publications: Washington, DC, USA, 2016. [Google Scholar]
Xu, J.J.; Chen, H. CrimeNet explorer: A framework for criminal network knowledge discovery. ACM Trans. Inf. Syst. 2005, 23, 201–226. [Google Scholar] [CrossRef]
Catanese, S.A.; Fiumara, G. A visual tool for forensic analysis of mobile phone traffic. In Proceedings of the 2nd ACM Workshop on Multimedia in Forensics, Security and Intelligence, Firenze, Italy, 29 October 2010; pp. 71–76. [Google Scholar]

Figure 1. CBOW model structure.

Figure 2. CBOW and Hierarchical Softmax model structure.

Figure 3. SNFA model structure.

Figure 4. The F-measure under the changes of γ and d.

Figure 5. The F-measure under the changes of t and γ.

Figure 6. Comparison of the three forensic schemes.

Table 1. List of Enron’s criminal executives.

Name	Position
Andrew Fastow	CEO
Andrew Lewis	Director
Ben Glisan	Financial Director
Cliff Baxter	Vice President
Dan Bayly	COO
Davis Maxey	N/A
Hunter Shively	Vice President
Jeffery Skilling	CEO
Joe Hirko	Sub-Company CEO & Portland General Electric CEO
John Forney	Manager
John Lavorato	CEO
Kenneth Lay	CEO
Ken Rice	Enron broadband CEO
Kevin Hannon	non-Enron Employee
Kevin Howard	Vice President
Linda Lay	Kenneth Lay’s wife
Louis Borget	Oil & Gas CEO
Louise Kitchen	President
Mark Koenig	Vice CEO
Mark Taylor	Employee
Mike Krautz	CFO
Paula Rieker	Manager
Ray Bowen	Executive and CFO
Richard Causey	Chief Accountant
Richard Sanders	Vice Presidente
Rick Buy	Manager
Rod Hayslett	Vice President
Stephen Cooper	Interim CEO and CRO

Table 2. Experimental data preprocessing results: (a) the result of communication edge; (b) the result of node encoding; (c) the result of node communication.

2604	2605	1.0	sara.davidn	7448	1	3	1
1514	4537	20.0	trisha.hubbard	8300	2	6786	482
2676	409	3.0	schulmeyer.gerhard	7592	3	1	1
2114	409	2.0	ken.williams	5399	4	359	72
481	614	4.0	langfeldt.andrea	4430	5	2	1
3672	3673	1.0	anna.harris	5174	6	31	14
1208	3046	1.0	falbaum.william	5318	7	456	52
2203	2718	1.0	john.aleazurix	3472	8	2	2
1213	230	1.0	hunter.larry.jn	915	9	81	4
134	1932	1.0	hlopak.ed	4496	10	47	9
1079	3474	1.0	steve.whitaker	7417	11	307	17
3448	36	2.0	matthias.lee	695	12	89	21
3307	3890	3.0	emmons.suzette	5689	13	166	43
(a)			(b)		(c)

Table 3. Confusion Matrix.

	Positive Case	Negative Case
Actual Condition	Positive Case	Negative Case
Positive Case	$T P$	$T N$
Negative Case	$F P$	$F N$

Table 4. Results of the three forensic approaches.

	CrimeNet Explorer	LogAnalysis	SNFA
Recall	0.46	0.54	0.57
Precision	0.43	0.48	0.52
F-value	0.44	0.51	0.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, K.; Zhang, H.; Li, J.; Pan, Q.; Lai, L.; Nie, Y.; Zhang, Z. Social Network Forensics Analysis Model Based on Network Representation Learning. Entropy 2024, 26, 579. https://doi.org/10.3390/e26070579

AMA Style

Zhao K, Zhang H, Li J, Pan Q, Lai L, Nie Y, Zhang Z. Social Network Forensics Analysis Model Based on Network Representation Learning. Entropy. 2024; 26(7):579. https://doi.org/10.3390/e26070579

Chicago/Turabian Style

Zhao, Kuo, Huajian Zhang, Jiaxin Li, Qifu Pan, Li Lai, Yike Nie, and Zhongfei Zhang. 2024. "Social Network Forensics Analysis Model Based on Network Representation Learning" Entropy 26, no. 7: 579. https://doi.org/10.3390/e26070579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Social Network Forensics Analysis Model Based on Network Representation Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Enhanced Node Sampling Precision

2.2. Selection of Computable Nodes

2.3. Node Sampling and Encoding

2.4. Constructing the SNFA Forensic Model

2.5. Utilization of Hierarchical Clustering and Distance Formulas

2.6. Acquiring Key Figures in Social Network Forensics

3. Experiment and Results

3.1. Experimental Procedure

3.2. Parameters Settings

3.3. Results

4. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI