Research on Fraud Detection Method Based on Heterogeneous Graph Representation Learning

Zheng, Xuxu; Feng, Chen; Yin, Zhiyi; Zhang, Jinli; Shen, Huawei

doi:10.3390/electronics12143070

Open AccessArticle

Research on Fraud Detection Method Based on Heterogeneous Graph Representation Learning

¹

Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100086, China

²

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(14), 3070; https://doi.org/10.3390/electronics12143070

Submission received: 15 June 2023 / Revised: 6 July 2023 / Accepted: 7 July 2023 / Published: 14 July 2023

(This article belongs to the Special Issue Machine Intelligent Information and Efficient System)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Detecting fraudulent users in social networks could reduce online fraud and telecommunication fraud cases, which is essential to protect the lives and properties of internet users and maintain social harmony and stability. We study how to detect fraudulent users by using heterogeneous graph representation learning and propose a heterogeneous graph representation learning algorithm to learn user node embeddings to reduce human intervention. The experimental results show promising results. This article investigates how to use better heterogeneous graph representation learning to detect fraudulent users in social networks and improve detection accuracy.

Keywords:

heterogeneous; information network; fraud detection; graph convolutional networks; similarity

1. Introduction

With the rapid development of technology, communication and digital media technology is constantly advancing. These networks provide a large amount of information and a low threshold for use; this has led to some websites becoming targets for many fraudsters and attackers. Therefore, detecting fraudulent users in social networks can reduce internet fraud and phone fraud cases, which is of great significance for protecting the lives and property of netizens and maintaining social harmony and stability.

With the increasing application of deep learning in fraud detection problems, the field of fraud detection has gradually become a digital field. According to the interdisciplinary fields of mining, network security, and machine learning, knowledge from these three fields is required to complete the design of an anomaly detection or fraud detection system. Now, mainstream methods have shifted their focus to focusing on discovering fraudulent users and activities in complex interactive networks. These interactions are represented as data objects in the graph. The interdependence and relationship between building a graph based on objects and the interaction relationships in the network using data mining and machine learning technology detect abnormal nodes and edges in the graph and mark them as potential fraud. This article focuses on the use of heterogeneous graph representation learning methods. After research, it was found that there are two problems when using heterogeneous graph representation learning algorithms to complete fraud detection tasks:

First, social networks have a huge amount of data information. In addition, due to the imbalance of data interaction, the data distribution following the identification of fraudulent users on social networks can help decrease instances of network and telecommunications fraud. This is crucial in safeguarding the well-being and assets of the majority of internet users and upholding social peace and stability. The relationships exhibit sparsity, with some nodes having few connections to other nodes and being highly isolated from the rest of the graph. Using only the explicit connection structure of the graph can sometimes be difficult to capture these nodes. It cannot generate robust feature representations for these isolated nodes. This makes it difficult to further improve the detection accuracy.
Second, in most online platforms, the number of normal users is still far greater than that of fraudulent users, with fraudulent users accounting for only a small portion of the total user volume. For example, out of one million users, there may only be a few thousand or a few hundred fraudsters. We consider the fraud detection problem as a binary classification problem. Therefore, there is an extreme category imbalance. When dealing with extremely imbalanced data in a class, both deep graph neural networks and traditional machine learning models struggle to learn effectively. As a result, the performance of the model is reduced.

This article researched the development of effective and reliable algorithms for learning heterogeneous graph representations, creating node embeddings, and identifying fraudulent individuals on social network platforms. The structure of this article is as follows: in Section 2, we introduce the related work; in Section 3, we focus on the model and method used; and in Section 4, we expound the results of the experiment.

2. Related Work

2.1. Heterogeneous Graphs

The heterogeneous graph can be defined as a directed graph

G = (V, E)

, where each node

v \in V

has a mapping relationship

τ (v) : V \to T_{V}

and each edge

e \in E

also has a mapping relationship

ϕ (e) : E \to T_{E}

, where

T_{v}

represents a set of node types and

T_{E}

represents a set of edge types, with

T_{V} + T_{E} > 2

. If two edges have the same type, their source and target vertices have the same node type. Specifically, when

T_{v} = 1

and

T_{E} = 1

, it represents a homogeneous information network, known as a homogeneous graph.

In the real world, most network structures contain multiple types of nodes, so the corresponding graph structures are also heterogeneous. Graph representation learning algorithms based on heterogeneous graphs should preserve various relationships well. Considering the widespread existence of heterogeneous networks, much research has shifted from homogeneous graph representation learning to heterogeneous graph representation learning. In order to better capture the rich structural and semantic information on heterogeneous graphs, many research works have improved the random walk algorithm to better adapt to the structure of heterogeneous graphs. Representative works include Meta-path 2 vec [1] and HIN 2 vec [2], which fully utilize the advantage of meta paths to constrain the selection of fixed point types for random walks in the next step and preserve the relationships between nodes in heterogeneous graphs. Afterwards, the generated node sequence containing heterogeneous neighborhoods is input into the skip-gram model to generate high-level node feature representations.

2.2. Graph Based Fraud Detection

We present the recent research progress in the field of graph-based fraud detection related to sparse nodes and class imbalance problems. To address the problem of unbalanced category labels in fraud detection datasets, Zhao et al. [3] proposed the graph-based fraud detection method GAL, which combines a graph neural network with an unsupervised model. A new loss function is used to train the graph neural network, and global grouping patterns found from graph mining algorithms are used to evaluate the similarity of nodes, which can automatically adjust the edges of a smaller number of class nodes according to the data distribution.

To address the problem of the long-tailed distribution of data and isolated nodes in heterogeneous graph fraud detection tasks, Wang et al. [4] proposed a community-based detection approach and used the attention mechanism and node similarity. The model focuses on two inconsistency problems that arise in fraud detection [5]: structural inconsistency and content inconsistency. Structural inconsistency refers to the central node surrounded by labeled neighbors that are different from the central node; it is caused by the extreme imbalance between negative and positive samples. Content inconsistency is due to the different categories of items and the inconsistent content of reviews posted by the same household for different products. The authors used a community detection algorithm based on label propagation to filter structurally inconsistent neighbors [6] and a similarity-based [7] neighbor sampling strategy to filter content inconsistent neighbors. And the effectiveness of the model was demonstrated on two datasets.

Tang et al. [8] considered the disguised behavior of fraudsters and the inconsistency of heterogeneous graphs and proposed a heterogeneous graph-based fraud detection model FAHGT to detect fake users in e-commerce online review systems. The model injects domain knowledge through a relational scoring mechanism without hand-designed meta-paths, and a type-aware feature mapping mechanism is used to process the heterogeneous graph data to mitigate inconsistency and discover fakes. Finally, features of neighbors are aggregated together to construct information representations.

William et al. [9] believed that for each vertex in the graph, its neighborhood is considered a receiving domain, and each node recursively aggregates information from this receiving domain. They proposed the GraphSAGE model, which includes two steps: sampling and aggregation. For each node, a fixed number of neighbors are sampled, and then the information of neighboring nodes is aggregated to update their embedding. The node labels are predicted based on the updated embedding vector.

Wang et al. [10] proposed the fraud detection model SemiGNN based on semi-supervised graph attention networks, which uses unsupervised methods to separate labeled and unlabeled parts, and uses multiple attention mechanisms to make the model interpretable.

Liu et al. [11] fully considered the relationship between two nodes, and proposed an attention-based hierarchical graph neural network HAGNN model for fraud detection in e-commerce. It incorporates a weighted adjacency matrix in different relationships to identify carefully disguised fraudulent users. The model is designed with two attention modules: a relational attention module to reflect the strength of the connection between two nodes, and a neighborhood attention module to capture information about long-range neighbor nodes associated with the graph. Node embeddings are generated by aggregating information about the heterogeneous graph structure and information about the original node features.

3. Materials and Methods

In this section, we introduce the similarity-based multi-view GCN [12] fraud detection model [13] (similarity-based multi-view GCN—SMGCN). The model is divided into three modules: the similarity-based single-view graph convolution module, the multi-view fusion module, and the fraud detection module, respectively. The overall structure diagram is shown in Figure 1.

In the learning part of the heterogeneous graph representation, first, we compute the similarity between user nodes [14]. Second, for the constructed heterogeneous graphs [15], we use meta-paths [16] to transform the heterogeneous graphs into several homogeneous graphs [17] containing only user nodes. Each homogeneous graph is a view of the heterogeneous graph under a certain meta-path. For every single view, graph convolution is used to obtain high-level feature representation of the nodes, incorporate the node similarity [18] into the adjacency matrix of every single view, and design a fusion function to use the fused matrix as the input of the graph convolution [19]. Finally, the attention mechanism [20] is used to fuse the representations of every single view to generate the final node representations.

3.1. Similarity-Based Single-View GCN

At first, we calculate the similarity between nodes, and since we only detect anomalous user nodes, we only calculate the similarity between user nodes. We use a simple cosine similarity method [21] to calculate the similarity between two nodes.

For the input initial node features

U = \{u_{1}, u_{2}, \dots, u_{N}\}

, use a linear transformation to map it to the news of feature space to obtain a new set of node features

X = \{x_{1}, x_{2}, \dots, x_{N}\}

as shown in Equation (1):

X = W U

(1)

Then, the cosine similarity is used to calculate the similarity between two nodes, and for any two nodes, the similarity between them can be expressed as the formula in (2):

S (i, j) = \frac{x_{i} \cdot x_{j}}{∥x_{i}∥ \cdot ∥x_{j}∥}

(2)

where

x_{i}

is the feature of the node i, and

x_{j}

is the feature of the node j. After the similarity calculation, if two nodes are similar, the connection between them can be established, even if there is no edge between them.

For the constructed heterogeneous graphs, we transform the heterogeneous graphs into multiple single views according to the predefined meta-path graphs. Each meta-path represents semantic information, and the meta-path-based single view can well preserve the semantic information of the heterogeneous graph. After the meta-path transformation, every single view is a homogeneous graph containing user nodes only, and the edge between two nodes indicates whether two users are connected through the meta-path. Different single views describe different interactions between users, e.g., a single view meta-transformed by mete-path U-P-U indicates that users in the graph interact with each other through products. Multiple single views form a multi-view network, and Figure 2 show a multi-view network transformed by a meta-path.

The graph convolution network (GCN) is a classical graph neural network that has been shown to work well in the field of homogeneous graph representation learning. Moreover, the model is based on a semi-supervised classification approach, which requires only a few labeled nodes to train the model [22]. Due to the powerful representation capability of graph convolutional networks, for every single view, we use GCN to learn its node embedding. Each convolutional layer of GCN can be written as a function of the following equation and is represented by multi-layer iterations that update the feature representation of the nodes as shown in Equation (3):

H^{l + 1} = f (H^{l}, A)

(3)

A m o n g t h e m, H^{0} = X

In the above Equations (3) and (4), A denotes the adjacency matrix of the graph, and X denotes the feature matrix of the node.

H^{0}

= X is the initial input of the model.

The difference is that the heterogeneous graphs [23] constructed from datasets in the fraud detection domain are sparse [24] and the graphs have multiple isolated nodes that have few connections to other nodes and are very isolated from the rest of the graph. For example, some fraudulent users may remain silent at the initial stage of registration and do not make relevant connections or have other activities with other users, but the characteristics of these nodes are similar to those of users who have been identified as fraudulent. If the adjacency matrix of the graph is directly used as the input to the graph convolution, it may not be possible to model the interaction of these nodes using only the connection structure of the graph.

We take full advantage of the similarity between nodes to address the above limitations. To better apply graph convolution to the fraud detection task, we combine the graph structure and node similarity by a weighting function that fully considers the connection structure of the graph and the feature similarity [25] of the nodes. After multiple layers of iterative updates, the model learns the interaction between two nodes, even if they are not directly connected to each other. The fusion formula is shown in (4):

\hat{A} (i, j) = α \cdot S_{i j} + (1 - α) A_{i j}

(4)

where

\hat{A}

denotes the adjacency matrix converted by the formula, S denotes the node similarity matrix, A denotes the graph’s adjacency matrix, and

α

is the weight coefficient.

After obtaining the new adjacency matrix, the fused weighted adjacency matrix is used as the initial input to the GCN. In the model, a two-layer GCN model is used to obtain the representation of user nodes in every single view, as shown in Equation (5):

f (X, \hat{A}) = s o f t m a x (\hat{A} R e L U (\hat{A} X W^{(0)}) W^{(1)})

(5)

where

W^{(0)}

is the weight matrix from the input layer to the hidden layer and

W^{(1)}

is the weight matrix from the hidden layer to output layer.

3.2. Multi-View Fusion

After obtaining feature representations of each single view, each view can only respond to nodes from one aspect, and to learn a more comprehensive node embedding, the representations of these single views should be aggregated to generate the final node embedding. However, the problem is how to aggregate the embedding of each single view. Intuitively, after splitting the heterogeneous graph using meta-paths, every single view contributes differently to the final representation of the whole heterogeneous graph because each meta-path represents a different semantic relation and has a different importance to the heterogeneous graph. It is unreasonable to fuse the embedding of every single view by taking the average value, and some of the more critical views of the features may be lost. At the same time, if we simply artificially assign a weight to every single view and weight, the aggregated representation of every single view by this weight, it may cause a reduction in the quality of the final heterogeneous graph representation. This is because it is difficult to accurately determine the importance of each view through artificial definitions.

The principle of the attention mechanism is to mimic the attention function of the human brain, and it is able to assign different attention to different modules to highlight the impact of important modules on the model. Inspired by the attention mechanism, we employ an attention mechanism [26] to automatically learn the attention weights of different views and to aggregate different single views based on the attention coefficients. Taking the features of every single view

H_{k}

as input, using the attention network yields the attention weights

β_{k}

for each view, as shown in Equation (6), which denotes the number of single views:

β_{1}, β_{2}, \dots, β_{m} = a t t e n t i o n (H_{1}, H_{2}, \dots, H_{M})

(6)

Specifically, for node i in the first view k, we splice the feature representation of that node in all views and make an inner product with the attention vector to find the attention coefficient. The importance of every single view is defined as the average of the importance of all nodes in that view as shown in Equation (7):

ω_{k} = \frac{1}{|V|} \sum_{i \in V}^{} Q_{K}^{T} \cdot h_{i}^{k}

(7)

where Q denotes the attention vector,

|V|

is the number of nodes in the view, and

h_{i}^{k}

is the splicing of the feature representation in the number of nodes in all views.

After obtaining the importance of every single view, using the

s o f t m a x

function to normalize it and obtain the importance of each of the weight coefficients

β_{k}

of the individual views, which are shown in Equation (8):

β_{k} = \frac{\exp (w_{k})}{\sum_{k = 1}^{m} \exp (w_{k})}

(8)

Based on the learned weight coefficients, the specific embedding of multiple single views is fused to obtain the final embedding of the user nodes in the heterogeneous graph as shown in Equation (9):

Z = \sum_{k = 1}^{m} β_{k} \cdot H_{k}

(9)

In the above equation,

H_{k}

is the feature representation matrix of every single view,

β_{k}

is the weight coefficient of each view, and

Z \in R^{N \times d}

represents the final output feature matrix, where N is the number of users and d is the feature dimension of each user output.

3.3. Fraudulent User Detection

We pass the features of the user nodes, obtained from the graph representation learning [27] model into a single-layer network to classify the user nodes in order to distinguish between normal and fraudulent users. In our study, we improve the classification model that distinguishes between normal and fraudulent users. We use the

s o f t m a x

function to obtain the class prediction probability of each node as represented by Equation (10):

\hat{y_{i}} = S o f t m a x (R e L U (Q z_{i} + b))

(10)

Finally, the fraud rate is low due to the category imbalance of the data in the fraud detection task. To avoid the overfitting problem of the model due to the mismatch in the number of samples per category, a classification-balanced cross-entropy loss is used as the objective function for the optimization of the fraudulent user detection task, using a weight parameter to augment the weights of the data labeled as fraudulent users in the training set. We improve the semi-supervised classification problem by minimizing the following loss function as shown in Equation (11):

L = - λ \sum_{i \in N_{0}}^{} y_{i \log_{} (\hat{y_{i}})} - \sum_{i \in N_{1}}^{} y_{i \log_{} (\hat{y_{i}})}

(11)

where

N_{0}

denotes a tagged normal user instance,

N_{1}

denotes a tagged fraud instance,

y_{i}

denotes a user’s label vector,

\hat{y_{i}}

denotes a user’s predicted value, and

λ

is a weight parameter.

4. Results

4.1. Dataset

To validate the effectiveness of the proposed similarity-based multi-view GCN model for the fraud detection task in this section, we select two real datasets, the Amazon dataset [28] and MicroblogPCU dataset [29], with specific statistical information in Table 1. Experiments are conducted on both datasets to analyze the effectiveness of the model and the impact of the parameters.

The Amazon dataset is a publicly available dataset that includes product reviews under the musical instruments product category. The users in the dataset are labeled as both normal and fraudulent. We adopt some features as the original input features of users to calculate the similarity between nodes, learn high-level node features after the graph representation learning model, and finally input the learned features into the classifier to identify fraudulent users in the Amazon dataset.

This dataset also contains some attribute information that can be used as the initial characteristics of the nodes. It includes user attributes, rating attributes, product attributes, etc. The user attributes include user name length and the number of products rated by the user. Product attributes include product theme, product description, etc. The rating attributes include the number of ratings, the rating ratio, the median, minimum, maximum, and average values of user ratings as well as the total length of the review text and the emotional color of the review text.

The MicroblogPCU dataset stems from the Sina Weibo social platform, which is a publicly available UCI dataset that crawls log data about users posting tweets from Sina Weibo and is used in machine learning methods and research on social networks. The dataset contains two types of users, normal users and fraudulent users, which are spammers. The dataset contains 221,579 instances and 20 attributes, where abnormal users are marked as 1 and normal users are marked as 0.

In addition, the dataset contains a number of attributes, including user ID, gender, age, registered location, number of followers, number of followers, etc., used to identify a unique user, and microblogging attributes, including the text length, posting time, number of retweets, number of comments, whether or not the tweet contains a URL, etc., all of which can be encoded as vectors.

In this paper, a similarity-based multi-view GCN model for heterogeneous graph representation learning is introduced in Section 4.3, in which the similarity between nodes needs to be computed. Therefore, in order to calculate the similarity between user nodes, we need to select some features as the initial input features of the user nodes. As in other research works, the specific genus features we selected from the Amazon dataset are described in Table 1, and the specific attributes of the MicroblogPCU dataset are described in Table 2.

4.2. Experimental Setting

In this experiment, we set the embedding dimension of all feature vectors of the graph representation learning algorithm to 64 by default, the training set ratio to 60% for the Amazon dataset and the training set ratio to 80% for the microblogPCU dataset.

Since the fraud detection dataset has a class imbalance and the number of normal users is much more than the number of fraudulent users, a balance parameter is added to the loss function, and experiments are set up to demonstrate the effectiveness of this parameter. We choose two evaluation metrics, F1 and AUC, to measure the performance of the model.

(1): F1: F1-score is the average of the summation of the accuracy and completeness of the check; the higher the value of F1, the better the performance of the model.
(2): AUC: The value AUC is the area under the ROC curve, which measures the performance of the model when differentiating between positive and negative samples. The higher the value of AUC, the better the performance of the model.

4.3. Experimental Results and Analysis

In order to verify the effectiveness of the fraud detection method, the above experimental setup was performed in experiments conducted on two datasets, Amazon and MicroblogPCU.

First, experiments were conducted on a similarity-based multi-view GCN model to analyze its experimental effectiveness on fraud detection tasks. For the baseline model, the multi-view GCN model is replaced with other graph representation learning models described in the baseline, and the performance of various fraud detection models is evaluated in this paper. For the comparison models of GCN, GAT, and other homogeneous graph representation learning, we ignore the node types and relationship types in the graph, treat the heterogeneous graph as homogeneous, and learn the feature representations of the nodes in the graph. After obtaining the low-dimensional feature representations of user nodes, they are fed into the MLP classifier to classify normal users and fraudulent users.

The experimental results are shown in Table 3. From the table, we can see that our heterogeneous graph representation learning model SMGCN outperforms the comparison models in both F1 and AUC when compared with other graph representation learning models, and outperforms the other three comparison models for fraud detection tasks.

Table 3. Experimental result.

Model	Amazon		microblogPCU
Model	F1	AUC	F1	AUC
GCN [30]	74.60	90.21	82.73	91.00
GAT [31]	74.61	90.46	83.25	91.36
GraphSAGE	71.63	90.30	81.50	91.05
SemiGNN	74.62	90.81	83.76	91.59
GraphConsis [32]	74.08	89.56	82.80	88.79
CARE-GNN [33]	60.00	82.73	78.39	90.60
SMGCN (ours)	76.84	92.39	86.07	93.17

The SMGCN heterogeneous graph representation learning model is able to achieve better results because it considers both graph heterogeneity and node similarity. Although the CARE-GNN fraud detection model also considers the similarity between nodes, unlike the model proposed in this paper, we use similarity to model the interaction of isolated nodes and design a fusion function to fuse the node similarity matrix and the graph adjacency matrix so that the model can automatically learn the initial features of nodes and the importance of the graph structure.

For a single view transformed from each meta-path, the SMGCN graph representation learning model uses a semi-supervised GCN network to learn the low-dimensional embedding of the nodes, followed by an attention mechanism to aggregate the representations of multiple views to generate the final representation. Finally, to address the problem of class imbalance in the fraud data, the model performance is further improved by using class-balanced cross-entropy loss as an objective function for the optimization of the fraud user detection task and adding a weight parameter to enhance the weight of labeled data in a few but important classes.

In addition, ablation experiments are performed in this section. In the single-view convolutional networks, the similarity between nodes is not calculated, and the original adjacency matrix of the graph is used as the input to the GCN, while the other parts are kept constant to form the model MSGCN-noSim. We validate the model on two datasets, and the experimental results are shown in Figure 3. The left side shows the experimental results on the Amazon dataset, and the right side shows the experimental results on the microblogPCU dataset. Figure 3 shows that the model incorporating similarity works better on both F1 and AUC than without node similarity, which fully illustrates the effectiveness of the similarity between nodes and the fusion function for the fraud detection task.

This section tests the effect of the proportion of the training set in the dataset on the effect of the model. The dataset is divided into the training set and test set, and the values of the proportion of the training set are set to 20%, 40%, 60%, and 80% in order to study the changes of AUC and F1 values of the model. The experimental results are shown in Figure 4. The left panel shows the experimental results on the Amazon dataset, and the right panel shows the experimental results on the dataset.

From the figure, we can see that the highest values of F1 and AUC are reached when the proportion of the Amazon dataset reaches 60% of the training set, after which there is no significant change in the model performance as the proportion of the training set increases. The highest values of F1 and AUC are obtained when the percentage of the microblogPCU dataset in the training set reaches 80%. Therefore, we set the training set ratio to 60% for the Amazon dataset and 80% for the microblogPCU dataset.

4.4. Summary of This Section

For the problem of isolated nodes and data category imbalance in the structure of heterogeneous graphs in the field of fraud detection, this paper proposes a fraud detection model based on similarity and multi-view graph convolution (SMGCN). In the model, heterogeneous graphs are transformed into multiple views using meta-paths, fusing similarity in every single view into a graph convolutional neural network, learning node representations in every single view, and finally fusing multiple views using an attention mechanism. The SMGCN model is able to automatically assign different weights to structural features and original node features, and integrates multiple relationships through the attention mechanism, thus obtaining richer higher-order structural information that facilitates the recognition of fraudster nodes in heterogeneous graphs. In addition, to solve the problem of unbalanced data classes and low fraud rates in fraud detection tasks, using class-balanced cross-entropy loss as an objective function for the optimization of the fraud user detection task, a weight parameter is utilized to augment the weights of samples that have been marked as fraudulent. In real-life application scenarios, information in social networks is updated in real time, and some historical information is also useful for identifying fraudulent users. Therefore, in future work, we will further explore the application of dynamic graph representation learning in the field of fraud detection to make full use of the historical information of users and further improve the detection efficiency of the model.

5. Conclusions

This article proposes a fraud detection model based on similarity and multi-view graph convolution (SMGCN) to address the problems of isolated nodes and imbalanced data categories in the structure of heterogeneous graphs in the field of fraud detection. In the model, the heterogeneous graph is transformed into multiple views with the meta path, and the similarity is fused into the graph convolutional neural networks in each single view, learning the node representation in each single view, and finally using the attention mechanism to fuse multiple views. The SMGCN model can automatically assign different weights to structural features and original node features, and integrate multiple relationships through attention mechanisms to obtain richer higher-order structural information, which is beneficial for identifying fraudster nodes in heterogeneous graphs. In addition, to address the issue of imbalanced data categories and low fraud rates in fraud detection tasks, the class-balanced cross-entropy loss is used as the optimization objective function of the fraud user detection task, and a weight parameter is used to enhance the weight of the fraud samples that have been marked. In real application scenarios, information in social networks is updated in real time, and some historical information is also useful for identifying fraudulent users. Therefore, in future work, we will further explore the application of dynamic graph representation learning in the field of fraud detection, fully utilizing user historical information, and further improving the detection efficiency of the model.

Author Contributions

Conceptualization, X.Z. and J.Z.; Methodology, X.Z.; Validation, X.Z.; Writing—original draft, X.Z. and C.F.; Writing—review & editing, X.Z. and C.F.; Formal analysis, C.F. and H.S.; Investigation, C.F.; Supervision, Z.Y., J.Z. and H.S.; Project administration Z.Y., J.Z. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the National Natural Science Foundation of China (U21B2046).

Data Availability Statement

No additional data are available.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dong, Y.; Chawla, N.V.; Swami, A. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 135–144. [Google Scholar]
Fu, T.-Y.; Lee, W.-C.; Lei, Z. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1797–1806. [Google Scholar]
Zhao, T.; Deng, C.; Yu, K.; Jiang, T.; Wang, D.; Jiang, M. Error-bounded graph anomaly loss for gnns. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 29–23 October 2020; pp. 1873–1882. [Google Scholar]
Wang, L.; Li, P.; Xiong, K.; Zhao, J.; Lin, R. Modeling heterogeneous graph network on fraud detection: A community-based framework with attention mechanism. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 1959–1968. [Google Scholar]
Mansour, A.Z.; Ahmi, A.; Popoola, O.M.J.; Znaimat, A. Discovering the global landscape of fraud detection studies: A bibliometric review. J. Financ. Crime 2022, 29, 701–720. [Google Scholar] [CrossRef]
Waqas, M.; Tu, S.; Halim, Z.; Rehman, S.U.; Abbas, G.; Abbas, Z.H. The role of artificial intelligence and machine learning in wireless networks security: Principle, practice and challenges. Artif. Intell. Rev. 2022, 55, 5215–5261. [Google Scholar] [CrossRef]
Azizi, I.; Echihabi, K.; Palpana, T. Elpis: Graph-based similarity search for scalable data science. Proc. VLDB Endow. 2023, 16, 1548–1559. [Google Scholar] [CrossRef]
Tang, S.; Jin, L.; Cheng, F. Fraud detection in online product review systems via heterogeneous graph transformer. IEEE Access 2021, 9, 167364–167373. [Google Scholar] [CrossRef]
Hamilton, W. Inductive representation learning on large graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
Wang, D. A semi-supervised graph attentive network for financial fraud detection. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 598–607. [Google Scholar]
Liu, Y.; Sun, Z.; Zhang, W. Improving fraud detection via hierarchical attention-based graph neural network. J. Inf. Secur. Appl. 2023, 72, 103399. [Google Scholar] [CrossRef]
Li, P.; Xie, Y.; Xu, X.; Zhou, J.; Xuan, Q. Phishing fraud detection on ethereum using graph neural network. In Proceedings of the Blockchain and Trustworthy Systems: 4th International Conference, BlockSys 2022, Chengdu, China, 4–5 August 2022; Revised Selected Papers. Springer: Berlin/Heidelberg, Germany, 2022; pp. 362–375. [Google Scholar]
Jiang, J.; Chen, J.; Gu, T.; Choo, K.-K.R.; Liu, C.; Yu, M.; Huang, W.; Mohapatra, P. Anomaly detection with graph convolutional networks for insider threat and fraud detection. In Proceedings of the MILCOM 2019-2019 IEEE Military Communications Conference (MILCOM), Norfolk, VA, USA, 12–14 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 109–114. [Google Scholar]
Chen, H.; Huang, Z.; Xu, Y.; Deng, Z.; Huang, F.; He, P.; Li, Z. Neighbor enhanced graph convolutional networks for node classification and recommendation. Knowl.-Based Syst. 2022, 246, 108594. [Google Scholar] [CrossRef]
Wang, X.; Bo, D.; Shi, C.; Fan, S.; Ye, Y.; Philip, S.Y. A survey on heterogeneous graph embedding: Methods, techniques, applications and sources. IEEE Trans. Big Data 2022, 9, 415–436. [Google Scholar] [CrossRef]
Shan, J.; Ye, C.; Jiang, Y.; Jaroniec, M.; Zheng, Y.; Qiao, S.-Z. Metal-metal interactions in correlated single-atom catalysts. Sci. Adv. 2022, 8, eabo0762. [Google Scholar] [CrossRef]
Aranda, A. Ib-homogeneous graphs. Discret. Math. 2022, 345, 113015. [Google Scholar] [CrossRef]
Wang, X.; Yan, G.; Yang, Z. Relative entropy of k-order edge capacity for nodes similarity analysis. Int. J. Mod. Phys. C (IJMPC) 2023, 34, 1–19. [Google Scholar] [CrossRef]
Bhatti, U.A.; Tang, H.; Wu, G.; Marjan, S.; Hussain, A. Deep learning with graph convolutional networks: An overview and latest applications in computational intelligence. Int. J. Intell. Syst. 2023, 2023, 8342104. [Google Scholar] [CrossRef]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Gazda, M.; Drotar, P.; Romaguera, L.V.; Kadoury, S. End-to-end deformable attention graph neural network for single-view liver mesh reconstruction. arXiv 2023, arXiv:2303.07432. [Google Scholar]
Pei, Y.; Huang, T.; van Ipenburg, W.; Pechenizkiy, M. Resgcn: Attention-based deep residual modeling for anomaly detection on attributed networks. In Proceedings of the 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), Porto, Portugal, 6–9 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–3. [Google Scholar]
Du, G.; Zhou, L.; Li, Z.; Wang, L.; Lü, K. Neighbor-aware deep multi-view clustering via graph convolutional network. Inf. Fusion 2023, 93, 330–343. [Google Scholar] [CrossRef]
Hurley, N.; Rickard, S. Comparing measures of sparsity. IEEE Trans. Inf. Theory 2009, 55, 4723–4741. [Google Scholar] [CrossRef] [Green Version]
Qian, W.; Xiong, Y.; Yang, J.; Shu, W. Feature selection for label distribution learning via feature similarity and label correlation. Inf. Sci. 2022, 582, 38–59. [Google Scholar] [CrossRef]
Yang, J.; Gao, L.; Tan, Q.; Yihua, H.; Xia, S.; Lai, Y.-K. Multiscale mesh deformation component analysis with attention-based autoencoders. IEEE Trans. Vis. Comput. Graph. 2021, 29, 1301–1317. [Google Scholar] [CrossRef]
Wang, L.; Lu, J.; Sun, Y. Knowledge graph representation learning model based on meta-information and logical rule enhancements. J. King Saud. Univ.-Comput. Inf. Sci. 2023, 35, 112–125. [Google Scholar] [CrossRef]
McAuley, J.J.; Leskovec, J. From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 897–908. [Google Scholar]
Hai-Ning, M.; Kai, F.; Lei, Z.; Bei-bei, Z.; Xin-yu, T.; Xin-hong, H. Short-text clustering algorithm based on laplacian graph. ACTA Electron. Sin. 2021, 49, 1716. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
Liu, Z.; Dou, Y.; Yu, P.S.; Deng, Y.; Peng, H. Alleviating the inconsistency problem of applying graph neural network to fraud detection. arXiv 2020, arXiv:2005.00625. [Google Scholar]
Dou, Y.; Liu, Z.; Sun, L.; Deng, Y.; Peng, H.; Yu, P.S. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. arXiv 2020, arXiv:2008.08692. [Google Scholar]

Figure 1. The architecture of SMGCN.

Figure 2. Multi-view network built from heterogeneous graph.

Figure 3. Ablation experiment.

Figure 4. The effect of training set proportion.

Table 1. Detail feature of Amazon dataset.

Feature Type	Feature Name	Feature Description
User attributes	Username length, number of rated products	Describes the total of user-name characters, describes how many products are rated
Score attributes	Rating quantity, rating ratio, median, minimum, value, maximum, and mean, rating entropy	The number of each rating level given by user, proportion of each rating level given by user. The fraudsters may give extreme ratings to lower/lift the product. Measure the skewness of the ratings
Time attributes	Metrics of the same date, days of interval time entropy	Whether comments are on the same date, user days between first rating and last rating, the skewness of the user’s rating time
Comment attributes	Comment text length, comment text sentiment	Words included in the comment text, represents emotions for all user comments
Commodity property	Product title product description	Represents the type and title of the product, detailed description of the product

Table 2. Detail feature of MicroblogPCU dataset.

Feature Type	Feature Name	Feature Description
User attributes	Username length	The total of user-name characters for a user
	Number of posts on Microblog	How many tweets have the users posted
	Level	Current level
	Following	How many fans do the users have
	Follower	The number of other users that one user follows
Microblog attributes	Text length	Total number of words in the text
	Release time	The number of URL links
	URL quantity	The release time of the microblog
	The number of forwarding	Number of reposts of a micro-post
	Number of comments	The number of comments under a micro-blog post

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, X.; Feng, C.; Yin, Z.; Zhang, J.; Shen, H. Research on Fraud Detection Method Based on Heterogeneous Graph Representation Learning. Electronics 2023, 12, 3070. https://doi.org/10.3390/electronics12143070

AMA Style

Zheng X, Feng C, Yin Z, Zhang J, Shen H. Research on Fraud Detection Method Based on Heterogeneous Graph Representation Learning. Electronics. 2023; 12(14):3070. https://doi.org/10.3390/electronics12143070

Chicago/Turabian Style

Zheng, Xuxu, Chen Feng, Zhiyi Yin, Jinli Zhang, and Huawei Shen. 2023. "Research on Fraud Detection Method Based on Heterogeneous Graph Representation Learning" Electronics 12, no. 14: 3070. https://doi.org/10.3390/electronics12143070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Fraud Detection Method Based on Heterogeneous Graph Representation Learning

Abstract

1. Introduction

2. Related Work

2.1. Heterogeneous Graphs

2.2. Graph Based Fraud Detection

3. Materials and Methods

3.1. Similarity-Based Single-View GCN

3.2. Multi-View Fusion

3.3. Fraudulent User Detection

4. Results

4.1. Dataset

4.2. Experimental Setting

4.3. Experimental Results and Analysis

4.4. Summary of This Section

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI