4.1. Experimental Setup
4.1.1. Description of the Dataset
This study leverages two widely adopted standard graph datasets, namely, the Elliptic [
16] and OGB-Arxiv [
45] datasets, to serve as the experimental foundation for validating the effectiveness of the proposed methodology. The datasets are randomly partitioned into training, validation, and test sets at an 8:1:1 ratio, ensuring consistency during the partitioning process through the use of the same random seed. The Elliptic dataset has extensive applications in the domains of money laundering prevention and fraud detection. Detailed descriptions of both datasets are provided in
Table 1.
The Elliptic graph dataset is a publicly available Bitcoin transaction dataset associated with illicit money laundering activities that was jointly released by the Elliptic blockchain analytics company and the Massachusetts Institute of Technology (MIT). This dataset documents 203,769 Bitcoin transactions that are worth approximately
$6 billion. Among the 203,769 nodes and 234,355 edges, 2% of the nodes were labeled illicit, 21% were labeled licit, and the remaining 77% were unlabeled. In this graph, each node represents a transaction, and the edges denote the flow of Bitcoin between pairs of transactions. Each node is characterized by 166 features, including timestamp information, with timestamp values ranging from 1 to 49.
Figure 1 illustrates the variations exhibited by the sample quantities of different categories over time in the Elliptic dataset, with a time interval of approximately two weeks between adjacent timestamps.
To validate the effectiveness and generalizability of the proposed method, given the scarcity and sensitivity of the available financial transaction graph datasets, we select the OGB-Arxiv graph dataset as an additional experimental control group. OGB-Arxiv is an open benchmark graph dataset released by Stanford University in 2019 that contains 169,343 nodes and 1,166,243 edges. In this dataset, nodes represent academic papers, and edges denote citation relationships between papers. Each node is associated with a 128-dimensional feature vector representing the keywords of the corresponding paper and a temporal feature named “year,” indicating the publication year of the paper and ranging from 1979 to 2019. Due to the relatively limited number of data samples before 2006, a data merging process was applied in the experiments to alleviate the excessive parameter fluctuations that occurred at early timestamps. The primary task of the OGB-Arxiv graph dataset is to predict the main categories of Arxiv academic papers, encompassing a total of 40 different classes.
Compared to the Elliptic graph dataset, the OGB-Arxiv graph dataset exhibits similar dynamic node features but possesses more intricate edge relationships and a richer set of categories. Therefore, this dataset is better suited for validating the predictive performance of the method proposed in this study when addressing complex graph datasets.
4.1.2. Evaluation Metric Selection
In the experiments conducted in this paper, a series of evaluation metrics are employed to gauge the performance of the proposed model, including the
micro-precision,
micro-recall,
micro-F1,
macro-precision,
macro-recall, and
macro-F1 values. The computational formulas for these evaluation metrics are outlined in Equation (9):
In the provided formulas, represents the total number of categories, denotes the number of samples in category that are correctly predicted as positive, indicates the number of samples in category that are erroneously predicted as positive, and represents the number of samples in category that are correctly predicted as negative. signifies the probability that the samples predicted by the model as positive for category are indeed positive. represents the probability that the model correctly predicts the samples that actually belong to the positive class for category .
In this paper, the
macro-F1 score is adopted as the primary evaluation metric; this choice is motivated by the highly imbalanced sample distribution in the Elliptic dataset. As depicted in
Figure 4, the number of nodes associated with illicit money laundering in the Elliptic dataset is significantly lower than that associated with legitimate nodes. In such a scenario, models tend to achieve better performance on categories with larger sample distributions, while their performance diminishes on categories with fewer samples. Given that our research emphasis lies in predicting illicit money laundering transactions, strong emphasis must be placed on relatively scarce illicit transaction samples.
To validate the advantages of employing the
macro-F1 score as an evaluation metric, we illustrate the trends exhibited by various evaluation metrics during the model training process across different training epochs in
Figure 5. Additionally,
Figure 6 shows the variations in each element of the confusion matrix as the number of epochs increases.
As shown in
Figure 5, the MDGC-LSTM model induces the lowest loss in the validation set after 25 training epochs, with the micro-F1 score approaching 0.9. The confusion matrix in
Figure 6 indicates that at this moment, the predictions produced by the model for legitimate transaction samples are nearly perfect. However, its ability to detect illicit money laundering transactions is quite limited, as reflected by the fact that the true negative (TN) values approach zero.
As the number of training epochs approaches 75, the
micro-F1 score stabilizes, while the macro-F1 score continues to gradually increase, which is consistent with the trend exhibited by the TN values in
Figure 6. This finding suggests an ongoing improvement in the ability of the model to predict illicit money laundering transactions. These observations underscore the rationale behind using the
macro-F1 score as an evaluation metric, particularly when addressing highly imbalanced sample distributions.
4.1.3. Model Parameter Selection
The experiments in this paper build upon prior research findings [
30] and incorporate the latest advancements in dynamic graph convolution theory [
32]. Consequently, the primary parameter configurations in these experiments align with those used in a preceding study [
33]. Message passing and aggregation operations are implemented on the given money laundering transaction graph using the open-source DGL toolkit [
46]. Detailed information regarding the experimental parameters is provided in
Table 2.
In contrast with the previous study in [
33], we do not employ an early stopping mechanism. This decision is informed by the observed trends in the model evaluation metrics across different training epochs, as depicted in
Figure 5. It is evident that the model achieves its minimum loss on the validation set after approximately 25 epochs, while the macro-F1 score begins to stabilize at approximately 150 epochs. Therefore, this study opts not to use an early stopping mechanism and instead maintains a consistent model training process for 220 epochs.
In
Table 2, “RNN_layers” denotes the number of layers in RNN models such as LSTM and a GRU, while “Graph_layers” represents the number of layers in the graph convolution model. The “Support” parameter signifies the depth of the node neighborhood during the graph convolution process, where a “Support” value of 2 indicates the adoption of second-order neighborhood nodes for graph convolution purposes. “Lr” represents the initial learning rate of the model. We determine the optimal settings for these two parameters by referencing prior research [
33].
During the experimental process, we utilize the cross-entropy loss function to assess the disparity between the model predictions and the actual values. For a multiclass classification problem with N classes, given the true class label y for a particular sample and the model output probability p, the formula for calculating the cross-entropy loss is as follows:
Here, represents the true class of the sample, and denotes the probability value predicted by the model for class , i.e., the output of the softmax layer. represents the classification weights of the fully connected layer, and represents the bias term.
In the experiments, we employ adaptive moment estimation (Adam) as the optimizer, which is an optimization algorithm that combines momentum and adaptive learning rates, aiding in effectively converging the loss function to its minimum value. Additionally, we utilize the Weight_decay parameter optimization strategy, which is a regularization technique that helps control the complexity of the model and reduce the risk of overfitting. Finally, we apply a rectified linear unit (ReLU) as the activation function, as this method is widely used in deep learning scenarios to introduce nonlinearity and contributes to better capturing the complex relationships within data.
4.1.4. Baselines
To validate the effectiveness of the proposed MDGC-LSTM method in the field of money laundering prediction, we conduct a comprehensive comparison with six traditional machine learning models and six deep learning-based models.
Among the six traditional machine learning models, the first is logistic regression (LR) [
16,
17], which is a widely used linear model for classification problems that predicts the probabilities of output categories by combining the input features with weights and applying a logistic function. The second is an SVM [
18,
19], which is a classification and regression model that was designed to find an optimal hyperplane for establishing the maximum margins between data points belonging to different classes. The next model is a random forest (RF) [
20], which is an ensemble learning model based on predictions derived from multiple decision trees that achieves enhanced model performance and robustness through voting or averaging. Backpropagation (BP) [
21] is a training algorithm for shallow neural networks in which the network weights are updated by backpropagating errors to minimize the discrepancy between the predicted outputs and the actual targets. Additionally, both extreme gradient boosting (XGBoost) [
47] and the light gradient boosting machine (LightGBM) [
48] are gradient boosting tree models, with XGBoost iteratively training multiple trees to minimize prediction errors and incorporating gradient boosting and regularization techniques to attain improved model performance. LightGBM employs a histogram-based learning algorithm to increase training speed and reduce memory usage, and this method is suitable for large-scale datasets and high-dimensional features.
Among the six deep learning-based models, the GRU [
27] and LSTM [
24] models are RNN variants used for handling sequential data. They address the vanishing and exploding gradient issues faced by traditional RNNs by introducing gate mechanisms that capture long-term dependencies. Furthermore, a GCN [
28] is a deep learning model for processing graph data that captures the relationships between nodes by performing convolution operations on the input graph structure. GCN-GRU and MGC-LSTM [
33] are fusion models that execute graph convolution operations on a static graph constructed from the entire training set in combination with GRUs and LSTM networks, respectively. The DGCN-GRU constructs graph snapshots using dynamic data for each timestamp, performs graph convolution operations on these snapshots, and utilizes GRUs to capture sequential properties, enabling complex time series graph data to be comprehensively modeled.
4.2. Experimental Results and Analysis
This section aims to validate the effectiveness of the proposed MDGC-LSTM method in the field of money laundering prediction and examine its performance in terms of handling complex dynamic graph data.
Table 3 and
Table 4 provide summaries of the model evaluation results obtained by the baseline deep learning models and the proposed method on the Elliptic and OGB-Arxiv datasets, respectively. Notably, the evaluation results of different models in the tables represent the best performance achieved across all 49 timestamps. Additionally,
Figure 7 illustrates the variation trends exhibited by the macro-F1 scores of the different deep learning models at each timestamp during the model training phase.
The experimental results demonstrate the significant advantage of the proposed MDGC-LSTM method in terms of the macro-F1 score. The MDGC-LSTM model cleverly integrates spatial correlation features and temporal feature extraction, leveraging the properties of dynamic graph convolution and LSTM. This integration allows the model to accurately capture complex dynamic patterns within the given dataset, leading to a substantial money laundering prediction advantage over the other comparative methods.
Furthermore, we observe that the macro-recall values of the GRU and LSTM are notably lower than those of the GCN and the other fusion methods, and their macro-F1 values that are also lower than those of the GCN. This may be attributed to the limitations of the methods in terms of handling dynamic graph data. The GCN and the other fusion methods effectively utilize graph convolution operations to better capture the spatial correlation features within each dataset.
A closer examination reveals that the performances of MDGC-LSTM and DGCN-GRU, which use dynamic graph convolution, surpass those of MGC-LSTM and GCN-GRU, which employ static graph convolution. Static graph convolution conducts message passing among nodes based on a graph constructed from the entire set of transaction behaviors, and each timestamp involves message passing among the node features within the graph. As messages are repeatedly passed over multiple timestamps, the features of transaction nodes from earlier timestamps are involved in multiple message passing operations, leading to the gradual smoothing of the node features. This smoothing effect causes reductions in the feature differences among the nodes, making it challenging for the model to distinguish between different nodes, resulting in suboptimal performance in the task of detecting illicit money laundering transactions.
In contrast, dynamic graph convolution better simulates real-world scenarios. In a money laundering prediction task, the transaction behavior at each timestamp is treated as an independent snapshot, and the temporal features that are inherent to each timestamp can be effectively mined. Moreover, as future transaction behaviors cannot be detected at the current timestamp, constructing a static graph based on the entire transaction behavior set is unreasonable in money laundering prediction scenarios and may result in information leakage issues. Therefore, dynamic graph convolution aligns better with the practical application requirements and effectively enhances the performance of the developed model.
Figure 7 shows that the seven different deep learning methods exhibit similar performance trends across all timestamps. At the initial four timestamps, the GCN performs exceptionally well, possibly because as a relatively simple model, the GCN is only used to extract the spatial correlation features of the transaction nodes, resulting in lower model complexity and faster convergence. However, at later timestamps, the proposed MDGC-LSTM approach significantly outperforms the other six methods. This can be attributed to MGDC-LSTM’s utilization of dynamic graph convolution and LSTM for feature extraction in both temporal and spatial dimensions. Compared to single-dimensional models, MGDC-LSTM is better equipped to capture the spatiotemporal correlations present in the dataset. Furthermore, in contrast to static graph convolution, MGDC-LSTM can more effectively extract temporal features of transaction nodes, thereby achieving optimal detection accuracy.
Notably, at timestamp 43, a significant decline is observed in the predictive performance of all the models. This downward trend is attributed to the sudden closure of the world’s largest black market trading network. The shutdown of the black market trading network eliminates instances of illicit transactions from the dataset, rendering the algorithms incapable of capturing the latest features of illicit transactions.
In addition to comparing the proposed approach with different deep learning methods, this study also conducts a comprehensive comparison between MDGC-LSTM and traditional machine learning methods.
Table 5 and
Table 6 provide detailed summaries of the model evaluation results obtained by the baseline traditional machine learning models and the proposed method on the Elliptic and OGB-Arxiv datasets, respectively.
Traditional machine learning models, such as SVMs, typically differ from neural networks due to their inability to dynamically load previous model parameters. Their parameters are typically saved in the form of model coefficients and intercepts, making it challenging to attain dynamic predictions. According to the experimental results presented in
Table 5 and
Table 6, the traditional machine learning models undergo a single training session on the entire training set and are subsequently evaluated on the validation set. This setup results in the macro-F1 values in
Table 5 and
Table 6 generally exceeding the dynamic temporal prediction results provided in
Table 3 and
Table 4.
During the experiments, we meticulously tuned the parameters of all the models. The parameters for the LR and RF algorithms are referenced from [
33], the parameters for the SVM are referenced from [
18], the parameters for the BP algorithm are referenced from [
19], the parameters for the XGBoost model are derived from [
47], and the parameters for the LightGBM model are acquired from [
48]. The results presented in
Table 5 and
Table 6 clearly indicate that the proposed MDGC-LSTM method can effectively extract spatial and temporal features from each dataset, demonstrating a pronounced advantage over traditional algorithms.
Notably, the performances achieved by the traditional machine learning algorithms on the OGB-Arxiv dataset are noticeably lower than their performances on the Elliptic dataset. This might stem from the fact that the OGB-Arxiv dataset contains a greater number of categories (totaling 40), making its detection task a highly complex multiclass classification problem. Traditional machine learning methods exhibit certain limitations when handling such intricate classification problems, as they struggle to fully unleash their potential performance.