Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis

Yang, Hao; Zhou, Yunhong; Ji, Xianzhe; Liu, Zifan; Tian, Zhen; Tang, Qiang; Shi, Yanchao

doi:10.3390/math13182956

Open AccessArticle

Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis

by

Hao Yang

^1,†

,

Yunhong Zhou

^2,†,

Xianzhe Ji

³,

Zifan Liu

⁴,

Zhen Tian

⁵

,

Qiang Tang

⁶

and

Yanchao Shi

^7,*

¹

School of Mathematics and Computer Science, Panzhihua University, Jichang Rd, East District, Panzhihua 617000, China

²

School of Mathematics and Computational Science, Wuyi University, Jiangmen 529020, China

³

International Business School Suzhou, Xi’an Jiaotong-Liverpool University, Suzhou 215123, China

⁴

School of Mathematics, Jilin University, Changchun 130012, China

⁵

James Watt School of Engineering, University of Glasgow, Glasgow G12 8QQ, UK

⁶

School of Artificial Intelligence, Anhui University of Science and Technology, Hefei 231131, China

⁷

School of Business, Linyi University, Shuangling Road, Lanshan District, Linyi 276000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(18), 2956; https://doi.org/10.3390/math13182956

Submission received: 1 August 2025 / Revised: 3 September 2025 / Accepted: 5 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue New Advances in Graph Neural Networks (GNNs) and Applications)

Download

Browse Figures

Versions Notes

Abstract

Graph Neural Networks (GNNs) face fundamental algorithmic challenges in real-world applications due to a combination of data heterogeneity, adversarial heterophily, and severe class imbalance. A critical research gap exists for a unified framework that can simultaneously address these issues, limiting the deployment of GNNs in high-stakes domains like financial fraud detection and social network analysis. This paper introduces HAG-CFNet, a novel framework designed to bridge this gap by integrating three key innovations: (1) a heterogeneity-aware message-passing mechanism that uses relation-specific attention to capture rich semantic information; (2) a dual-channel heterophily detection module that explicitly identifies and neutralizes adversarial camouflage through separate aggregation pathways; and (3) a domain-aware counterfactual generator that produces plausible, actionable explanations by co-optimizing feature and structural perturbations. These are supported by a synergistic imbalance correction strategy combining graph-adapted oversampling with cost-sensitive learning. Extensive testing on large-scale financial datasets validates the framework’s impact: HAG-CFNet achieves a 4.2% AUC-PR improvement over state-of-the-art methods, demonstrates superior robustness by reducing performance degradation under structural noise by over 50%, and generates counterfactual explanations with 91.8% validity while requiring minimal perturbations. These advances provide a direct pathway to building more trustworthy and effective AI systems for critical applications ranging from financial risk management to supply chain analysis and social media content moderation.

Keywords:

graph neural networks; heterogeneous graph learning; message-passing algorithms; adversarial robustness; heterophily detection; interpretable machine learning; multi-scale graph analysis

MSC:

68T01

1. Introduction

Graph Neural Networks (GNNs) have emerged as a transformative paradigm for learning from complex relational data, demonstrating exceptional capabilities in domains ranging from social network analysis and bioinformatics to recommendation systems [1,2]. Their strength lies in capturing intricate dependencies within graph structures. However, deploying GNNs in high-stakes, real-world applications—particularly adversarial environments like financial fraud detection—demands more advanced and robust methodologies [3].

Three core challenges currently constrain GNN effectiveness in such settings. First, real-world graphs are inherently heterogeneous, comprising diverse node and edge types that conventional GNNs struggle to model, thereby failing to capture rich semantic information [4,5]. In finance, this means distinguishing between transactions, accounts, and devices. Second, the core GNN assumption of homophily—where connected nodes are similar—often fails. Malicious actors intentionally connect with benign entities to camouflage their activities, creating heterophilic patterns that mislead standard GNNs [6]. This is the primary tactic in sophisticated fraud schemes. Third, the “black-box” nature of deep graph models hinders their adoption where transparency is critical for trust and regulatory compliance, such as the “right to explanation” required in financial audits [3,7].

While recent research has introduced specialized models to tackle these issues individually, a comprehensive framework that simultaneously addresses all three challenges within a single, unified architecture remains a significant research gap. Existing methods often excel in one area at the expense of others, failing to provide a solution that is at once heterogeneity-aware, robust to adversarial heterophily, and fully interpretable.

In response to these challenges, this paper introduces HAG-CFNet (heterogeneity-aware graph network with counterfactual explanations), a novel framework designed for robust and interpretable learning on complex, heterogeneous graphs. The principal contributions of this work are as follows:

We propose HAG-Net, a multi-scale GNN that leverages relation-specific message passing to handle diverse entity types and employs a dual-channel aggregation mechanism to explicitly detect and counteract adversarial camouflage patterns, enhancing robustness in complex graph environments.
We introduce CF-Gen, an interpretability module that generates actionable explanations by jointly optimizing perturbations across both node features and graph structures. The optimization is guided by domain-aware constraints, ensuring the resulting counterfactuals are plausible and relevant for high-stakes decision-making.
We implement a synergistic approach that combines a graph-adapted synthetic oversampling technique, evolved from GraphSMOTE [7], with a cost-sensitive learning strategy to effectively address the extreme class imbalance inherent in fraud detection and other real-world applications.
We conduct extensive experiments on multiple large-scale, real-world datasets, demonstrating that our framework delivers significant improvements in detection performance, robustness, and computational efficiency over state-of-the-art baselines.

The subsequent sections of this manuscript are structured as follows: Section 2 provides a comprehensive survey of relevant studies and theoretical foundations in Graph Neural Networks and related methodologies. Section 3 presents the detailed architecture and algorithmic innovations of the proposed HAG-CFNet framework. Section 4 outlines the experimental design and evaluation protocols across multiple application domains. Section 5 reports comprehensive performance analysis and comparative results, supported by both simulation studies and practical case examples, demonstrating the effectiveness of the proposed advances. Section 6 discusses practical implications, computational considerations, and deployment strategies. Section 7 summarizes our findings and presents a detailed discussion of the work’s limitations and future research directions.

2. Related Work

This section reviews foundational and advanced concepts in Graph Neural Networks, focusing on the specific challenges that motivate our work: heterogeneity, heterophily, explainability, and data imbalance. We critically analyze existing approaches to establish the context and novelty of HAG-CFNet. This situates our work within a broader trend of applying advanced machine learning to secure complex, networked systems, a challenge also prevalent in domains like Internet of Things (IoT) security [8].

2.1. Advances in Graph Neural Networks

The development of Graph Neural Networks has marked a paradigm shift in machine learning approaches for relational data [3]. Several foundational GNN architectures have established the theoretical and practical foundations for modern graph learning. Graph Convolutional Networks (GCNs) pioneered the application of localized convolution operations to graph structures, leveraging neighborhood connectivity patterns to learn discriminative node representations [7]. However, their performance is often hampered by over-smoothing in deep architectures, which makes node representations indistinguishable. Graph Attention Networks (GATs) addressed these limitations through the introduction of attention mechanisms that enable nodes to selectively weight the importance of their neighbors during information aggregation, leading to improved performance in capturing nuanced relational patterns [7,9]. GraphSAGE introduced inductive learning capabilities through sampling-based aggregation, enabling the generation of embeddings for previously unseen nodes—a crucial capability for dynamic environments where graph structures continuously evolve [10]. While powerful, these seminal models largely operate under a homophily assumption and were designed for homogeneous graphs. The core challenge of learning from such complex relational data is not unique to finance; similar principles are applied in diverse network science problems, including optimizing routing in specialized networks like optical network-on-chip and opportunistic networks [11,12]. This limits their applicability in complex real-world systems, such as financial networks, which are inherently heterogeneous and often exhibit non-homophilic connections, thus motivating the development of more specialized architectures.

2.2. Heterogeneous and Heterophily-Aware GNNs

To address the complexity of real-world graphs, two significant research directions have emerged: handling graph heterogeneity and combating adversarial heterophily.

Heterogeneous Graph Neural Networks (HGNNs) are designed to process graphs with multiple node and edge types. By employing relation-specific transformations and attention mechanisms, models like LIFE and C-FATH can learn richer representations that capture the semantic diversity of entities and relationships [7,13,14]. However, many HGNNs require pre-defined meta-paths to guide message passing, and they may not be inherently robust to adversarial connections that violate structural assumptions. More recently, the field has embraced the pre-training and fine-tuning paradigm, with models like HG-Adapter using dual adapters to efficiently bridge the gap between large pre-trained models and specific downstream tasks [15].

Conversely, heterophily-aware GNNs directly tackle the failure of the homophily assumption, a critical issue in adversarial domains like fraud detection where malicious actors deliberately connect to benign nodes [16]. Models such as HUGE, DHMP, and SGNN-IB use specialized designs, like dual-channel message passing or spectral methods, to distinguish between and learn from homophilic and heterophilic connections [16,17,18]. Other recent approaches have focused on tackling the scalability challenges of heterophily, with models like LD2 proposing decoupled embeddings to manage the computational complexity of non-local aggregation in large graphs [19]. While effective, these models are often designed for homogeneous graphs and may not fully leverage the rich semantic information present in heterogeneous networks. As summarized in Table 1, a significant gap exists for a unified framework that is both heterogeneity-aware and robust to adversarial heterophily, which is the primary focus of HAG-CFNet.

2.3. Explainable AI and Counterfactuals for GNNs

The demand for explainable AI (XAI) in the financial sector is driven by the need for transparency in high-stakes decision-making, regulatory compliance, and building user trust [7,20]. The “black-box” nature of complex models like GNNs is a significant hurdle [7]. Counterfactual explanations (CFEs) are particularly appealing because they answer “what-if” questions [21]. For GNNs, methods like CF-GNNExplainer focus on structural perturbations (edge removal) [22], while others like COMBINEX offer a unified approach by jointly optimizing perturbations across both node features and graph structures [21]. However, a critical limitation of many existing CFE methods is the lack of domain awareness; they may generate explanations that are statistically valid but practically infeasible. Furthermore, the research frontier is expanding from static to dynamic graphs, with emerging methods like CoDy designed to generate counterfactual explanations for evolving graph structures [23]. Our framework, CF-Gen, addresses the domain-awareness gap by incorporating domain-specific plausibility constraints directly into the optimization process, ensuring that the generated explanations are both insightful and realistic for static graphs.

2.4. Imbalanced Learning on Graphs

Data imbalance is a pervasive issue in fraud detection, as fraudulent transactions are typically rare events [7]. This can cause models to be biased towards the majority class. To address this, two main strategies are employed. At the data level, oversampling techniques like GraphSMOTE generate synthetic minority nodes in an embedding space to create a more balanced data distribution [7]. At the algorithm level, cost-sensitive learning assigns a higher misclassification penalty to the minority class, forcing the model to pay more attention to it [24]. While these rebalancing-centric approaches are common, the field is evolving, with some recent studies challenging this traditional view by proposing novel methods that can effectively learn on imbalanced graphs without explicit class rebalancing [25]. Our work integrates an adapted version of GraphSMOTE that is both heterogeneity- and heterophily-aware by synergistically combining it with a cost-sensitive loss to provide a robust solution to the imbalance problem, acknowledging that this is an area of active research.

2.5. Causality and Privacy in GNNs

Beyond the core challenges addressed in this paper, two emerging research frontiers are critical for the development of truly trustworthy GNNs: causal inference and federated learning. Causal GNNs aim to move beyond correlational patterns to understand the underlying causal mechanisms within graph structures. This is vital for building models that are more robust to distribution shifts and for generating explanations that reflect true cause-and-effect relationships, a direct extension of the goals of our CF-Gen module [26]. Separately, federated learning (FL) on graphs addresses the critical need for privacy in sensitive domains like finance. It enables multiple institutions to collaboratively train a powerful model like HAG-CFNet on their combined data without sharing the raw, private information [2]. The need for such privacy-preserving and robust technologies is further amplified in emerging distributed ecosystems, such as the integration of blockchain with IoT, where data integrity and security are paramount [27]. While a full exploration of these areas is beyond the scope of this work, they represent crucial next steps in deploying advanced GNN frameworks in real-world, regulated environments and form a key part of our future research agenda.

3. Proposed Framework: HAG-CFNet

3.1. Overall Architecture and Design Principles

To address the fundamental challenges in graph neural network learning for complex relational systems, we propose the HAG-CFNet (heterogeneity-aware graph network with counterfactual explanations) framework. The framework’s design is predicated on four core hypotheses that guide its architecture and functionality:

Hypothesis 1 (H1): By explicitly modeling the diverse types of entities and relationships inherent in complex heterogeneous graphs, a heterogeneity-aware GNN architecture will capture more nuanced relational patterns and achieve superior learning performance compared to standard homogeneous GNN approaches and traditional machine learning models.
Hypothesis 2 (H2): Directly identifying and counteracting heterophilic connections—where dissimilar nodes are connected in adversarial or complex relational environments—is crucial for improving model robustness and accurately learning from graph structures that violate the traditional homophily assumption underlying most GNN architectures.
Hypothesis 3 (H3): For generating interpretable explanations in graph learning, a joint optimization approach that simultaneously perturbs both node features and graph structures, while being constrained by domain-specific plausibility rules, produces more realistic, actionable, and trustworthy insights than methods that consider these aspects in isolation.
Hypothesis 4 (H4): A synergistic strategy that combines advanced graph-based oversampling techniques at the data level with cost-sensitive loss functions at the algorithm level is more effective at mitigating severe class imbalances in graph learning tasks than applying either technique independently.

Based on these principles, HAG-CFNet is composed of several modular components, as illustrated in Figure 1. The process begins with graph construction, which transforms raw tabular data into a typed, heterogeneous graph. The core predictive component is the heterogeneity-aware Graph Neural Network (HAG-Net), which learns node representations through two key mechanisms: relation-specific message passing to handle heterogeneity and a dual-channel detection module to address adversarial heterophily. The final node embeddings are fed into a prediction head (a multi-layer perceptron) to generate fraud probabilities. To ensure transparency, the interpretable counterfactual explanation module (CF-Gen) analyzes the model’s predictions and generates human-understandable “what-if” scenarios. Finally, the entire training process is supported by an imbalance correction strategy that integrates data-level oversampling with algorithm-level cost-sensitive learning. For clarity, the primary mathematical notations used throughout this section are summarized in Table 2.

3.2. Heterogeneity-Aware Message Passing

To effectively model diverse entity types and relationships in accordance with H1, HAG-Net employs a relational attention mechanism, as detailed in Algorithm 1. For each relation type

r \in R

, a type-specific transformation matrix

W_{r}^{(l)}

projects the features of source nodes into a common semantic space. This is crucial for handling features from different domains and dimensionalities. The attention mechanism then learns to weigh the importance of different neighbors for a target node v. For a layer l, the attention-weighted message from a neighbor u under relation r is calculated based on an attention score

α_{u v}^{(l, r)}

, which is derived from the embeddings of both the source and target nodes. The aggregated message for relation r is then the weighted sum of the transformed embeddings of its neighbors:

h_{v, r}^{(l)} = \sum_{u \in N_{r} (v)} α_{u v}^{(l, r)} \cdot (W_{r}^{(l)} h_{u}^{(l - 1)})

.

Algorithm 1 HAG-Net message passing for a single node

1:: procedure HAGNetUpdate( $v, l, {h^{(l - 1)}}, G, {W^{(l)}}, {a^{(l)}}$ )
2:: Input: Target node v, layer index l, embeddings from previous layer ${h_{u}^{(l - 1)}}_{u \in V}$ , graph $G = (V, E, R)$ , learnable weights ${W_{r}^{(l)}}_{r \in R}$ , attention vectors ${a_{r}^{(l)}}_{r \in R}$ .
3:: Output: Updated embedding $h_{v}^{(l)}$ for node v.
4:: Let $relation_messages$ be an empty list
5:: for all relation type $r \in R$ do
6:: Let $neighbor_messages$ be an empty list
7:: if v has neighbors $N_{r} (v)$ under relation r then
% — Attention Score Calculation —
8:: for all neighbor $u \in N_{r} (v)$ do
9:: $h_{u}^{'} \leftarrow W_{r}^{(l)} \cdot h_{u}^{(l - 1)}$ ▹ Transform neighbor embedding
10:: $e_{u v}^{(r)} \leftarrow LeakyReLU ({(a_{r}^{(l)})}^{T} \cdot concat (W_{r}^{'} h_{v}^{(l - 1)}, W_{r}^{'} h_{u}^{'}))$ ▹ Calculate attention score
11:: end for
% — Attention Normalization —
12:: ${α_{u v}^{(r)}}_{u \in N_{r} (v)} \leftarrow {softmax}_{u} ({e_{u v}^{(r)}}_{u \in N_{r} (v)})$ ▹ Normalize scores
% — Relation-specific Message Aggregation —
13:: $h_{v, r}^{(l)} \leftarrow \sum_{u \in N_{r} (v)} α_{u v}^{(r)} \cdot h_{u}^{'}$ ▹ Compute weighted sum of messages
14:: append $h_{v, r}^{(l)}$ to $relation_messages$
15:: end if
16:: end for
% — Heterophily-Aware Aggregation across Relations —
17:: $h_{v}^{(l)} \leftarrow HeterophilyAwareAggregation (relation_messages)$ ▹ Combine messages from all relations
18:: return $h_{v}^{(l)}$
19:: end procedure

3.3. Dual-Channel Heterophily Detection

To address the camouflage problem as stated in H2, where the standard GNN assumption of homophily fails, the framework introduces a module that explicitly modulates the aggregation process. This mechanism first calculates a local heterophily score for each node,

S_{het} (v) \in [0, 1]

, which quantifies the feature dissimilarity within its neighborhood. A high score suggests that the node is connected to dissimilar neighbors, a common pattern for fraudsters. The aggregation function is then modulated by this score. We employ a dual-channel approach where neighbors are dynamically routed to either a homophilic aggregator or a heterophilic aggregator based on their similarity to the target node. The final node update is a fusion of the outputs from both channels, for example,

h_{v}^{(l)} = FUSION (h_{v}^{(l - 1)}, h_{homo}^{(l)}, h_{hetero}^{(l)})

. This design allows HAG-Net to learn distinct patterns from both legitimate-looking (homophilic) and suspicious (heterophilic) connections, making it robust against camouflage.

These modules operate in synergy to build a robust node representation. The heterogeneity-aware message-passing module first generates semantically rich embeddings that respect the diverse types of nodes and edges. This provides a meaningful foundation for the dual-channel heterophily module to accurately compute inter-node similarities. By reliably distinguishing between homophilic and heterophilic neighbors, the dual-channel aggregation can effectively learn from both supportive and contradictory signals, ultimately enhancing the discriminative power of the final node embeddings fed to the classifier. This integrated approach ensures that the model is sensitive not only to what type of information is passed but also to the context (homophilic or heterophilic) in which it is passed.

3.4. Interpretable Counterfactual Generation (CF-Gen)

To address the “black-box” problem and align with H3, the CF-Gen module generates interpretable counterfactual explanations. Given an instance

v_{i}

and its prediction

f (X_{i}, A_{i})

, the goal is to find the smallest perturbation

(Δ X_{i}, Δ A_{i})

to both its features and local structure such that the perturbed instance

(X_{i}^{'}, A_{i}^{'}) = (X_{i} + Δ X_{i}, A_{i} + Δ A_{i})

results in a flipped prediction

f (X_{i}^{'}, A_{i}^{'}) = y_{target}

while ensuring sparsity and plausibility.

A joint optimization framework is critical for generating meaningful explanations. In financial fraud, a fraudster’s behavior is manifested not only through transactional attributes (e.g., amount and location) but, more importantly, through their transactional network (e.g., connections to mule accounts or suspicious merchants). An explanation that only modifies features might be incomplete or unrealistic, as it ignores the structural context. Similarly, altering only the structure might not capture scenarios where feature manipulation is the primary fraudulent tactic. Therefore, our joint optimization framework, which perturbs both features and structures, is designed to generate more comprehensive and realistic explanations that align better with domain knowledge and the multifaceted nature of financial crime.

The core of CF-Gen is solving the following optimization problem:

min_{Δ X_{i}, Δ A_{i}} L_{pred} (f (X_{i}^{'}, A_{i}^{'}), y_{target}) + λ_{feat} C_{feat} (Δ X_{i}) + λ_{struct} C_{struct} (Δ A_{i}) + λ_{plaus} C_{plaus} (X_{i}^{'}, A_{i}^{'})

where

λ_{feat}

,

λ_{struct}

, and

λ_{plaus}

are non-negative hyperparameters that control the trade-off between flipping the prediction and the cost of the perturbations. Here, each component serves a specific purpose:

$L_{pred}$ is the prediction loss that pushes the perturbed instance’s prediction towards the target class.
$C_{feat} (Δ X_{i})$ penalizes feature perturbations to ensure sparsity (L1/L2 norm for continuous features and L0 norm for discrete features).
$C_{struct} (Δ A_{i})$ penalizes structural changes (L0 norm on edge additions/deletions) to ensure minimal graph modifications.
$C_{plaus} (X_{i}^{'}, A_{i}^{'})$ is a crucial, domain-aware cost function that penalizes unrealistic counterfactuals. This term is a composite of penalties for violating hard constraints (transaction amount must be positive and merchant codes must be valid) and soft constraints that enforce data manifold adherence. The latter can be estimated by the distance to the nearest neighbors in the training data or via a reconstruction-based anomaly score from a trained generative model.

This optimization is challenging due to the non-differentiable nature of discrete perturbations and L0 norms. We employ a gradient-based approach and use techniques like Gumbel-Softmax reparameterization to create differentiable approximations for discrete choices.

3.5. Imbalance Correction Strategy

To combat the pervasive issue of data imbalance and validate H4, HAG-CFNet employs a two-pronged strategy.

First, at the data level, we use an adapted version of GraphSMOTE [7]. Standard oversampling methods ignore graph topology, while standard GraphSMOTE may not be suitable for heterogeneous, heterophilic graphs. Our adaptation makes the edge generator both heterogeneity-aware and heterophily-conscious. Specifically, when creating edges for a synthetic fraud node, the generator is trained not only to connect it to other fraud nodes (to strengthen the minority cluster) but also to connect it to specific types of benign nodes that are commonly observed in real camouflage patterns, thereby generating more realistic synthetic data.

Second, at the algorithm level, we employ a cost-sensitive loss function during HAG-Net training. The final loss is a weighted version of the standard cross-entropy loss:

L_{total} = w_{benign} \cdot L_{CE} (y_{true} = 0, y_{pred}) + w_{fraud} \cdot L_{CE} (y_{true} = 1, y_{pred})

, where the weight for the fraud class

w_{fraud}

is set significantly higher than

w_{benign}

. This forces the model to pay more attention to correctly classifying the rare fraudulent instances.

The synergy of these two approaches—creating more and better minority data while also telling the model to focus on it—provides a robust defense against the biases induced by imbalanced datasets.

4. Experimental Evaluation

To comprehensively validate our proposed framework, we adopt a multi-faceted evaluation strategy. Our empirical analysis is founded on extensive experiments across large-scale, real-world datasets, supplemented by controlled simulation studies to probe the model’s behavior under specific conditions. Furthermore, we present practical case studies to illustrate the model’s real-world applicability and interpretability. This section details the datasets, baseline models, evaluation metrics, and implementation specifics of our experimental setup.

4.1. Datasets

The validation of HAG-CFNet relies on datasets reflecting real-world financial fraud complexities, as shown in Table 3.

The IEEE-CIS Fraud Detection Dataset is a large-scale, real-world dataset from a Kaggle competition, containing anonymized transaction data and identity information [28]. It consists of transaction.csv (features like TransactionDT, TransactionAmt, ProductCD, card1–card6, and email domains) and identity.csv (features like DeviceType and DeviceInfo) [29]. The training set has approximately 590,540 transactions, with a fraud ratio of around 3.5% [28].

The IBM Credit Card Transactions Dataset, which is often found on Kaggle, provides another large benchmark and is designed to reflect realistic patterns [30]. One version includes features like user, card, amount, merchant name, MCC, and is fraud?, with approximately 24.4 million transactions and a fraud rate of around 0.12% [30]. Another version has over 1.2 million entries with a 0.579% fraud rate [30].

A synthetic dataset (CardSim-based) was generated, adapting principles from the CardSim simulator [4], to allow controlled experiments on varying levels of heterogeneity, heterophily, and fraud ratios. To rigorously validate the realism of the simulator’s output, we conducted a detailed statistical comparison against the real-world IEEE-CIS dataset. As shown in our appendix, this analysis confirmed a high degree of fidelity at both the feature level and the structural level. Key feature distributions were shown to be statistically consistent (Table A3), and fundamental graph-level properties such as average degree and clustering coefficient were demonstrated to be highly similar (Table A4). CardSim is calibrated using public economic data and uses a Bayesian approach to embed relationships between transaction features and fraud likelihood [4].

Data preprocessing involves cleaning (handling missing values and correcting types), feature engineering (cyclical time features, categorical encoding using one-hot or embedding layers, and numerical normalization), and graph construction (defining node types like transaction, card, user, merchant, device, and edge types based on shared identifiers).

4.2. Baselines for Comparison

To rigorously evaluate HAG-CFNet, a comprehensive set of baselines was used. Traditional machine learning methods included XGBoost and the random forest. Standard GNNs for comparison were GCNs (Graph Convolutional Network) [10], GATs (Graph Attention Networks) [10], and GraphSAGE [10]. Specialized heterophily-aware GNNs such as HUGE [16], DHMP [16], and SGNN-IB [18] were adapted where feasible, or their key principles were incorporated into simpler backbones. Additionally, ablation variants of HAG-CFNet were tested: HAG-CFNet without its heterogeneity module, without its heterophily module, and without imbalance correction, as well as versions of CF-Gen using only feature or only structural perturbations.

4.3. Evaluation Metrics

Fraud detection performance was assessed using AUC-ROC, AUC-PR (critical for imbalanced data) [2], F1-score (macro-averaged), recall (sensitivity for the fraud class), precision (for the fraud class), and G-mean.

Counterfactual explanation quality from CF-Gen was evaluated using several metrics [17]. Validity measures the percentage of CFs that successfully change the model’s prediction [22]. Sparsity/proximity quantifies the minimality of perturbations, including feature sparsity (L0 norm of feature changes), feature proximity (L1/L2 distance for continuous features), structural sparsity (number of edge changes) [17], and structural proximity (graph edit distance proxy) [31]. Plausibility/realism assesses if the CF is realistic using data manifold adherence (distance to k-NN of the target class in training data) [17] and the domain constraint violation rate (percentage of CFs violating financial rules).

4.4. Implementation Details

Experiments were conducted using NVIDIA A100 GPUs. The primary deep learning framework was PyTorch (2.7.0) with GNN-specific operations handled by PyTorch Geometric (PyG). Key Python (3.11.13) libraries included NumPy, Pandas, and Scikit-learn. To ensure a robust evaluation of generalization performance, we adopted a standard random split of 80% for training, 10% for validation, and 10% for testing for each dataset. Hyperparameters such as the learning rate (range 0.0001–0.01), the batch size (32–512), the number of GNN layers (2–4), embedding dimensions (64–256), the optimizer (Adam), dropout rates (0.1–0.5), and specific parameters for attention mechanisms and loss term weights (

λ_{feat}, λ_{struct}, λ_{plaus}

for CF-Gen;

w_{fraud}, w_{benign}

for imbalance) were tuned via Bayesian optimization on a validation set. Training ran for up to 200 epochs with early stopping based on validation AUC-PR. For GraphSMOTE, parameters like k-neighbors (3–10) and the oversampling ratio (target: 0.5–1.0 minority proportion) were tuned.

5. Results and Analysis

This section presents the empirical results of HAG-CFNet and compares its performance against the selected baselines across the chosen datasets. We begin with a quantitative comparison of overall detection performance, followed by an ablation study to validate our design choices. Subsequently, we present targeted simulation studies, in-depth qualitative case analyses, and further examinations of the model’s robustness, learned embeddings, and computational performance.

5.1. Overall Fraud Detection Performance

The primary results, presented with statistical validation in Table 4 and Figure 2, highlight the superior performance of HAG-CFNet. On the critical AUC-PR metric for the highly imbalanced IEEE-CIS dataset, HAG-CFNet achieves a score of 0.867 ± 0.012. This represents a statistically significant improvement over the next-best heterophily-aware GNN, DHMP, which scored 0.832 ± 0.014 (p < 0.01). The small standard deviation across multiple runs further underscores the stability and reliability of our model’s performance. This trend of statistically significant improvement holds across all datasets and against all other GNN baselines. The performance gap is particularly pronounced on the most heterophilic IBM dataset (a 4.0% absolute improvement in AUC-PR from 0.718 to 0.758, p = 0.012), demonstrating our framework’s enhanced capability to handle sophisticated camouflage strategies. While traditional methods like XGBoost occasionally achieve higher precision, their significantly lower recall indicates a tendency to miss a large portion of fraudulent cases, a critical flaw in real-world deployment. In contrast, HAG-CFNet provides a more balanced and effective solution by excelling in both recall and overall AUC-PR.

5.2. Ablation Study

Removing the heterogeneity module led to a 5.5% drop in AUC-PR, while removing the heterophily module resulted in a 4.1% decrease, as shown in Table 5. The absence of imbalance correction caused the most significant drop, with AUC-PR falling by 14.9% and recall (Fraud) by 24.4%, underscoring the critical role of each component. To further dissect the contribution of our dual-pronged imbalance correction strategy, we conducted a more detailed ablation experiment. Using only cost-sensitive learning (without GraphSMOTE) resulted in an AUC-PR of 0.825, while using only GraphSMOTE (without cost-sensitive loss) yielded an AUC-PR of 0.841. Both outperform the model with no imbalance correction but fall short of the full model’s 0.867. This demonstrates that while both techniques are beneficial, they address the imbalance problem from different angles. GraphSMOTE enriches the data landscape for the minority class, while cost-sensitive learning forces the model to prioritize their correct classification. Their combination yields a synergistic effect, leading to a more significant performance gain than either component could achieve alone, thus validating our synergistic approach (H4). The disproportionate impact of removing imbalance correction reveals a challenge in fraud detection: without proper minority class handling, the model gravitates toward conservative predictions, essentially learning to classify most instances as benign to minimize overall error. The moderate but consistent contributions of both the heterogeneity and heterophily modules suggest that these components address distinct but complementary aspects of financial graph complexity. Specifically, the heterogeneity module’s 5.5% contribution likely stems from its ability to leverage type-specific information (e.g., distinguishing between individual and corporate account behaviors), while the heterophily module’s 4.1% gain reflects its capacity to identify fraudsters who strategically connect to legitimate entities. The cumulative effect of these modules (9.6% combined improvement) demonstrates that addressing multiple graph challenges simultaneously yields superadditive benefits rather than merely linear gains.

5.3. Simulation Studies on Model Robustness

To further validate the design choices behind our model and assess its performance under controlled, challenging conditions, we conducted a series of simulation studies. By generating synthetic datasets with systematically varying levels of heterophily and class imbalance, we can directly observe how these factors impact model performance and thereby demonstrate the effectiveness of our specialized modules. This approach allows us to stress-test the model beyond the fixed characteristics of the real-world datasets.

The heterophily impact analysis in Figure 3 reveals a critical insight: while both models degrade as heterophily increases, HAG-CFNet’s decline is significantly more graceful (0.06 AUC-PR drop vs. 0.20 for GCN across the heterophily spectrum). This performance pattern suggests that HAG-CFNet’s heterophily-awareness module does not merely provide marginal improvements but rather represents a fundamental architectural advancement for handling adversarial graph scenarios. The steeper degradation of the GCN at higher heterophily levels (H-index > 0.6) likely reflects the breakdown of the homophily assumption that underpins traditional message passing—as fraudsters become more sophisticated in their camouflage strategies, standard GNNs increasingly propagate misleading signals. Notably, HAG-CFNet maintains an AUC-PR above 0.88 even at maximum heterophily (H-index = 0.8), suggesting practical utility even in highly adversarial environments. This resilience is particularly valuable for financial institutions, where fraud tactics continuously evolve and early detection systems must remain effective even as fraudsters adapt their strategies to exploit network-based detection methods.

The imbalance sensitivity analysis in Figure 4 demonstrates the critical importance of specialized imbalance handling in fraud detection scenarios. The dramatic performance gap between HAG-CFNet with and without imbalance correction (0.75 vs. 0.30 recall at a 0.1% fraud ratio) illustrates how extreme class imbalance can render even sophisticated models ineffective without proper mitigation strategies. The widening performance gap at lower fraud ratios (more realistic scenarios) suggests that traditional approaches to minority class learning become increasingly inadequate as fraud becomes rarer—a concerning trend given that successful fraud prevention often reduces overall fraud rates. Interestingly, the convergence of both curves at higher fraud ratios (3%) indicates that imbalance correction becomes less critical when fraud is more prevalent, suggesting that the primary challenge lies in detecting sparse fraud signals rather than distinguishing fraud from benign when both classes are reasonably represented. This finding has practical implications for fraud prevention systems: as these systems become more effective and fraud rates decrease, the importance of sophisticated imbalance-handling mechanisms actually increases, creating a technological arms race where better fraud prevention necessitates even more advanced detection capabilities.

5.4. Qualitative Analysis with Practical Examples

Beyond quantitative metrics, it is crucial to understand how HAG-CFNet and its explanation module, CF-Gen, function in practice, as shown in Table 6. This section presents a series of case studies based on real-world scenarios to provide qualitative insights into the model’s decision-making process. These practical examples serve to support our claims of interpretability and demonstrate how the framework can generate actionable intelligence for fraud analysts.

CF-Gen achieved high validity (91.8%) with good sparsity and proximity. Its domain constraint violation rate (3.2%) was notably lower than baselines, indicating more realistic financial explanations. The superior performance of CF-Gen over adapted COMBINEX (91.8% vs. 89.7% validity) can be attributed to its financial domain-aware optimization objective, which incorporates transaction-specific constraints during the counterfactual generation process. The low feature sparsity (3.4 features changed on average) is significant given the high-dimensional nature of financial data (450 aggregate features), suggesting that fraud predictions often hinge on a small set of critical features—a finding that aligns with fraud analysts’ intuition about key risk indicators. The structural sparsity of 1.3 edge changes per counterfactual reveals that relationship modifications are often more impactful than feature changes alone, highlighting the importance of the network context in fraud detection. Most importantly, the low domain-constraint violation rate (3.2%) indicates that CF-Gen generates counterfactuals that respect financial business rules, making them actionable for real-world fraud prevention strategies rather than merely academic explanations.

The case studies in Figure 5, Figure 6 and Figure 7 demonstrate CF-Gen’s multi-layered analysis capabilities across different fraud detection scenarios.

Figure 5 demonstrates CF-Gen’s feature-level counterfactual analysis capability. The transaction shows classic fraud indicators: a high amount (USD 5250), a suspicious merchant category (5999-Miscellaneous), a foreign IP address, and association with a new account and a known fraudulent device. The counterfactual reveals that removing the device association, reducing the amount to USD 250, changing to grocery merchant (5411), and using a domestic IP would flip the prediction from fraud (0.92) to benign (0.15), highlighting critical risk factors in fraud detection.

Figure 6 shows CF-Gen’s behavioral pattern analysis capabilities. A legitimate user’s transaction was flagged (0.78 fraud probability) due to unusual activity patterns: a high daily transaction frequency (15 vs. normal < 5) and a new IP region. The counterfactual indicates that normalizing transaction frequency and IP location would classify it as benign (0.22), demonstrating the model’s sensitivity to behavioral anomalies that may cause false positives.

Figure 7 demonstrates CF-Gen’s network-level analysis capability, distinguishing coordinated fraud from isolated legitimate activities. The coordinated fraud network shows a central ring node managing multiple entities, shared infrastructure, and complex financial flows. In contrast, the isolated network displays independent users with personal devices and verified merchants. The counterfactual analysis reveals that transforming the coordinated structure into isolated patterns would reduce the network risk assessment from high (0.95) to low (0.15), highlighting the importance of network topology in fraud detection.

This three-tier approach—feature, behavioral, and network-level counterfactuals—provides comprehensive fraud detection insights that address different aspects of financial crime, from individual transaction characteristics to complex coordinated schemes.

5.5. Further Analysis: Embeddings, Robustness, and Performance

The t-SNE visualization in Figure 8 provides crucial insights into HAG-CFNet’s learned representations compared to the standard GCN. The distinct clustering achieved by HAG-CFNet, where fraudulent nodes form coherent, separated clusters, suggests that the model has successfully learned discriminative representations that capture the underlying fraud patterns despite the heterophilic nature of the graph. This is particularly remarkable given that fraudsters often deliberately connect to legitimate entities to camouflage their activities. The overlapping distributions in the GCN plot indicate that standard message passing fails to maintain class-specific information when fraudulent and benign nodes are neighbors, essentially causing the model to “average out” discriminative signals. HAG-CFNet’s ability to preserve and even enhance class separability in such challenging conditions demonstrates that its heterophily-awareness mechanism effectively counters the smoothing effects that plague GNNs in adversarial graph settings. The tighter clustering of fraudulent nodes in HAG-CFNet’s embedding space also suggests that the model has identified common latent patterns among fraudulent activities, potentially capturing shared behavioral signatures that could generalize to new, previously unseen fraud types.

5.5.1. Robustness Analysis

To evaluate the model’s resilience to noisy data, a common challenge in real-world applications, we conducted robustness tests by injecting two types of noise into the IEEE-CIS test set. Feature noise was introduced by adding Gaussian noise (

σ = 0.1

) to 20% of the continuous features for a random 10% of nodes. Structural noise was added by randomly adding or removing 5% of the edges. Table 7 reports the percentage drop in AUC-PR compared to the performance on the clean dataset.

The results demonstrate HAG-CFNet’s superior robustness. Under feature noise, its performance degraded by only 5.2%, which is less than half the drop experienced by the GCN and GAT. This resilience can be attributed to its attention mechanism, which learns to weigh neighbor importance, thereby down-weighting corrupted feature signals. More critically, HAG-CFNet showed remarkable stability against structural noise, with a performance drop of just 6.5%. In contrast, traditional GNNs, which are highly dependent on local neighborhood structures, suffered significantly more. This highlights the benefit of our dual-channel heterophily module, which is designed to critically assess connections rather than naively aggregating information, making it inherently more robust to the kind of random, noisy edges introduced in this experiment.

5.5.2. Error Analysis and Confusion Matrix

To provide a deeper insight into the model’s predictive behavior beyond aggregate metrics, we present the confusion matrix for HAG-CFNet on the IEEE-CIS test set in Table 8. The matrix is normalized by the true class to better visualize performance on the imbalanced classes.

The matrix confirms the high recall (79.8%) for the fraud class, which is critical in this domain. A qualitative analysis of the misclassifications revealed distinct patterns:

False Positives (1.8%): A primary cause for misclassifying benign transactions as fraudulent was the presence of rare but legitimate behavioral patterns. For instance, a user making an unusually large, one-time international purchase might be flagged. While these are technically errors, they often represent anomalies that warrant further investigation, providing a valuable signal for human analysts.
False Negatives (20.2%): These are the more critical errors. Analysis showed that many false negatives were associated with sophisticated camouflage, where fraudulent transactions exhibited very few suspicious features and were connected to nodes with overwhelmingly legitimate histories. These cases represent the frontier of fraud detection and highlight the need for even more advanced relational and temporal pattern recognition, which we identify as a key area for future work.

5.5.3. Computational Performance Analysis

To address the practical deployability of HAG-CFNet in resource-constrained environments, we benchmarked its computational performance against key baselines on the IEEE-CIS dataset. Table 9 reports the average training time per epoch, inference latency per batch, and peak GPU memory usage during training.

The results indicate that HAG-CFNet, with its more sophisticated architecture, incurs a moderate computational overhead compared to simpler baselines. Its training time and inference latency are the highest, which is an expected trade-off for its enhanced accuracy and robustness. However, the peak memory usage of 7.2 GB remains well within the capacity of modern enterprise-grade GPUs, and the inference latency of 68 ms per batch is acceptable for many near-real-time fraud detection systems. These findings support the claim of “resource-constrained deployability,” suggesting that HAG-CFNet provides a viable balance between advanced capabilities and practical computational efficiency.

5.5.4. Analysis of Hyperparameters and Internal Mechanics

The results, as shown in Table 10, Table 11, Table 12 and Table 13, indicate that the cosine similarity-based heterophily score provided a good balance of performance and simplicity, though other methods also performed well, significantly outperforming the baseline without explicit heterophily scoring. The modest performance differences between different heterophily scoring methods (0.873 vs. 0.865 vs. 0.870 AUC-PR) suggest that the specific mechanism for detecting heterophilic relationships matters less than explicitly accounting for their presence. This finding implies that the architectural design decision to include heterophily awareness is more critical than the particular implementation strategy, offering flexibility for practitioners to choose simpler methods without significant performance penalties. The substantial gap between any explicit heterophily handling and the baseline (0.838 AUC-PR) confirms that ignoring heterophily in financial fraud detection graphs leads to meaningful performance degradation. Importantly, the success of the relatively simple cosine similarity approach suggests that complex learned metrics may not be necessary for effective heterophily detection in financial domains, where feature-based dissimilarity often correlates well with semantic differences between fraudulent and legitimate entities.

6. Discussion

This study’s empirical results demonstrate the effectiveness of HAG-CFNet, but a deeper analysis is required to understand the implications of these findings. This section moves beyond a recitation of results to critically analyze why the proposed framework succeeds, its practical value in real-world deployments, and its inherent limitations, thereby contextualizing its contributions and outlining a path for future research.

6.1. Principal Findings and Insights

Our empirical results consistently show that HAG-CFNet’s outperformance is not an incremental improvement but a direct consequence of its holistic design. The key insight from our findings is the powerful synergy between the framework’s specialized modules. We found that the heterogeneity-aware message-passing mechanism creates a semantically rich foundation, which is essential for the subsequent dual-channel heterophily module to accurately distinguish benign connections from deceptive ones. This synergy is particularly impactful in highly adversarial environments, as evidenced by the significant performance gains on the heterophilic IBM dataset (Table 4). This result strongly suggests that explicitly modeling and neutralizing camouflage is a critical factor for success in modern fraud detection.

Furthermore, our ablation study (Table 5) provides clear evidence for the necessity of each component. The dramatic performance collapse observed upon removing the imbalance correction strategy underscores a crucial takeaway: sophisticated GNN architectures are rendered ineffective on real-world skewed data without a robust, specialized strategy. Our validation of Hypothesis 4 (H4) confirms that a dual-pronged approach—addressing imbalance at both the data level (GraphSMOTE) and the algorithm level (cost-sensitive loss)—is superior to a single-faceted solution. Finally, the model’s enhanced resilience to noise (Table 7) is a direct benefit of its architecture; its attention mechanism learns to suppress corrupted feature signals, while its dual-channel aggregator learns to distrust anomalous graph structures, making the entire system inherently more robust.

6.2. Practical Implications and Deployability

The HAG-CFNet framework offers significant practical value for financial institutions. Its most direct contribution is transforming fraud detection from a “black-box” alerting system into a source of actionable intelligence. The counterfactuals generated by CF-Gen provide clear, human-understandable reasons for high-risk predictions (as shown in Figure 5, Figure 6 and Figure 7). For a fraud analyst, this means converting a cryptic alert like “Transaction 123 is 92% likely to be fraud” into an actionable insight: “This transaction is suspicious because it involves a high-risk merchant category and is linked to a device previously used in fraudulent activities.” This capability can dramatically reduce investigation time, improve the accuracy of human reviews, and lower the customer friction caused by false positives.

From a deployment perspective, we acknowledge the trade-off between performance and computational cost. Our analysis in Table 9 shows that HAG-CFNet has a higher latency and memory footprint than simpler GNNs. However, its resource requirements are well within the range of modern hardware, making it deployable in near-real-time systems. For large-scale, latency-critical applications, further optimization is possible. Practitioners could employ techniques such as model quantization to reduce the model’s size and precision and knowledge distillation to train a smaller, faster model that mimics HAG-CFNet’s behavior or adopt advanced graph sampling strategies during inference to limit the computational scope without a significant drop in accuracy.

7. Conclusions

This paper introduced HAG-CFNet, a novel graph neural network framework designed to confront the core challenges of heterogeneity, adversarial heterophily, and interpretability in complex relational data. By integrating a heterogeneity-aware message passing scheme, a dual-channel mechanism to neutralize camouflage, and a domain-aware counterfactual explanation generator, our framework demonstrated statistically significant improvements in fraud detection accuracy and robustness over state-of-the-art baselines. The results validate our central hypothesis that a unified architecture, which simultaneously addresses these multifaceted issues, is critical for deploying effective and trustworthy GNNs in high-stakes, real-world applications.

Despite these promising results, we acknowledge several limitations that also chart the course for future research. Firstly, our evaluation relies on static graph snapshots. Real-world financial networks are highly dynamic, with fraud patterns constantly evolving. This static assumption limits the model’s ability to adapt to concept drift. A critical next step, therefore, is to explore the integration of temporal graph networks (TGNs) to explicitly model the temporal dynamics and evolution of fraudulent behaviors, enabling more adaptive and forward-looking detection capabilities.

Secondly, the optimization of CF-Gen, while empirically effective, provides a locally optimal solution to an NP-hard problem. We lack formal guarantees of global optimality for the generated counterfactuals. Future work should focus on the development of more advanced or certifiable optimization methods for graph-based counterfactual generation, potentially exploring techniques from combinatorial optimization or developing specialized solvers to provide stronger guarantees on the quality and minimality of the explanations.

Thirdly, while our model demonstrates strong performance, a deeper investigation into fairness and bias is warranted. Our preliminary analysis noted minor performance disparities across different user groups, highlighting a potential risk. A comprehensive future analysis is needed to rigorously audit the model using established fairness metrics, such as demographic parity and equalized odds. Subsequently, any identified biases could be mitigated by incorporating fairness constraints directly into the model’s training objective to ensure equitable performance before any real-world deployment.

Collectively, these limitations guide a clear and ambitious research agenda. Beyond addressing the aforementioned points, we plan to investigate the integration of causal inference principles to generate explanations that reflect true cause-and-effect relationships. Furthermore, exploring synergies with large language models (LLMs) for feature enrichment and natural language explanation generation and developing federated learning frameworks for privacy-preserving collaborative training represent crucial next steps toward building truly robust, transparent, and socially responsible AI systems for critical financial applications.

Author Contributions

Conceptualization, H.Y., Y.Z., X.J., Z.L., Z.T., Q.T., and Y.S.; methodology, H.Y., Y.Z., and X.J.; software, H.Y., Y.Z., and Z.L.; validation, H.Y., Y.Z., X.J., and Z.L.; formal analysis, H.Y., Y.Z., and X.J.; investigation, H.Y., Y.Z., and X.J.; resources, H.Y., Z.T., Q.T., and Y.S.; data curation, Y.Z., Z.L., and X.J.; writing—original draft preparation, H.Y. and Y.Z.; writing—review and editing, H.Y., Y.S., Z.T., Q.T., and X.J.; visualization, Y.Z., Z.L., and X.J.; supervision, H.Y., Q.T., and Y.S.; project administration, H.Y. and Y.S.; funding acquisition, H.Y., Q.T., and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in Kaggle at https://www.kaggle.com/c/ieee-fraud-detection}, accessed on 6 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Optimization for CF-Gen

Algorithm A1 CF-Gen counterfactual optimization

1:: procedure GenerateCounterfactual( $X_{i}, A_{i}, y_{target}, f, T, λ_{feat}, λ_{struct}, λ_{plaus}$ )
2:: Input: Original instance $(X_{i}, A_{i})$ , target prediction $y_{target}$ , model f, iterations T, hyperparameters $λ$ .
3:: Output: Optimal counterfactual perturbation $(Δ X_{i}^{*}, Δ A_{i}^{*})$ .
4:: $Δ X_{i} \leftarrow 0; Δ A_{i} \leftarrow 0$ ▹ Initialize perturbations
5:: $\min_cost \leftarrow \infty$
6:: $(Δ X_{i}^{*}, Δ A_{i}^{*}) \leftarrow None$
7:: for $t = 1$ to T do
8:: $X_{i}^{'} \leftarrow X_{i} + Δ X_{i}$ ; $A_{i}^{'} \leftarrow A_{i} + Δ A_{i}$ ▹ Apply perturbations (using relaxations for discrete parts)
9:: $L_{pred} \leftarrow PredictionLoss (f (X_{i}^{'}, A_{i}^{'}), y_{target})$
10:: $C_{feat} \leftarrow ∥ Δ X_{i, cont} ∥_{1} + {∥ Δ X_{i, cat} ∥}_{0}$ ▹ Feature cost
11:: $C_{struct} \leftarrow {∥ Δ A_{i} ∥}_{0}$ ▹ Structural cost
12:: $C_{plaus} \leftarrow PlausibilityPenalty (X_{i}^{'}, A_{i}^{'})$ ▹ Plausibility cost
13:: $L_{total} \leftarrow L_{pred} + λ_{feat} C_{feat} + λ_{struct} C_{struct} + λ_{plaus} C_{plaus}$
14:: Compute gradients $\nabla_{Δ X_{i}, Δ A_{i}} L_{total}$
15:: Update $Δ X_{i}, Δ A_{i}$ using one step of Adam optimizer.
16:: Project discrete parts of $Δ X_{i}, Δ A_{i}$ back to their space.
17:: if $f (X_{i}^{'}, A_{i}^{'}) = y_{target}$ then
18:: $current_cost \leftarrow λ_{feat} C_{feat} + λ_{struct} C_{struct} + λ_{plaus} C_{plaus}$
19:: if $current_cost < \min_cost$ then
20:: $\min_cost \leftarrow current_cost$
21:: $(Δ X_{i}^{*}, Δ A_{i}^{*}) \leftarrow (Δ X_{i}, Δ A_{i})$
22:: end if
23:: end if
24:: end for
25:: return $(Δ X_{i}^{*}, Δ A_{i}^{*})$
26:: end procedure

Appendix B. Dataset Preprocessing Details

For the IEEE-CIS dataset, TransactionDT was converted into cyclical features (hour, day_of_week, and day_of_month) and also used to derive time differences between successive transactions for the same card. High-cardinality categorical features like DeviceInfo and email domains were processed using frequency encoding followed by embedding layers. For the IBM dataset, merchant name and MCC were treated as categorical features; amount was log-transformed due to its skewed distribution. Missing values for critical features were imputed using median/mode, while features with >70% missing values were initially dropped and later assessed for potential inclusion via imputation if model performance suffered.

Appendix C. Sensitivity to GNN Layers

Table A1. HAG-CFNet performance based on the number of GNN layers (IEEE-CIS and AUC-PR).

GNN Layers	AUC-PR	F1-Macro	Recall (Fraud)
1	0.828 ± 0.018	0.795 ± 0.021	0.743 ± 0.025
2	0.854 ± 0.015	0.832 ± 0.018	0.787 ± 0.022
3 (Default)	0.867 ± 0.012	0.844 ± 0.015	0.798 ± 0.021
4	0.861 ± 0.014	0.838 ± 0.017	0.791 ± 0.023
5	0.848 ± 0.016	0.822 ± 0.019	0.772 ± 0.024

This table shows that three GNN layers provided the optimal balance for HAG-CFNet on the IEEE-CIS dataset. Fewer layers likely caused underfitting by not capturing sufficient neighborhood information, while more layers might lead to over-smoothing or overfitting despite the heterophily-awareness mechanisms.

Appendix D. Hyperparameter Settings

Table A2. Hyperparameter settings for HAG-CFNet.

Hyperparameter	Search Range	IEEE-CIS Value	IBM Transactions Value	Synthetic Value
Learning Rate	[0.0001, 0.01]	0.001	0.0005	0.001
Batch Size	{32, 64, 128, 256, 512}	256	512	128
GNN Layers	{2, 3, 4}	3	3	2
Embedding Dimension	{64, 128, 256}	128	256	64
Dropout Rate	[0.1, 0.5]	0.3	0.4	0.2
$λ_{feat}$ (CF-Gen)	[0.05, 0.5]	0.1	0.15	0.05
$λ_{struct}$ (CF-Gen)	[0.1, 1.0]	0.5	0.6	0.3
$λ_{plaus}$ (CF-Gen)	[0.5, 2.0]	1.0	1.2	0.8
$w_{fraud}$ (Imbalance)	[2.0, 20.0]	5.0	10.0	3.0

Appendix E. Theoretical Analysis of the CF-Gen Optimization Problem

This section provides a theoretical discussion on the optimization problem presented in CF-Gen, addressing its computational complexity, the rationale for our chosen methodology, and the interpretation of its convergence and correctness, in response to reviewer feedback requesting deeper mechanistic explanations.

Appendix E.1. Problem Complexity and NP-Hardness

The core optimization problem of CF-Gen aims to find minimal and plausible perturbations that alter a model’s prediction. Formally, this problem can be classified as a mixed-integer non-linear program (MINLP). The complexity arises from several sources:

Discrete Variables: The perturbation space includes discrete choices for both categorical features (e.g., changing a merchant category) and the graph structure ( $Δ A_{i}$ , which involves adding or removing edges).
Non-Differentiable L0-Norm: The use of the L0-norm to enforce sparsity in feature and structural perturbations ( $C_{feat}$ and $C_{struct}$ ) results in a non-differentiable and non-convex objective function.
Non-Convexity: The prediction loss $L_{pred}$ is evaluated through a deep neural network $f (\cdot)$ , which is inherently a non-convex function of its inputs.

Due to this combination of discrete variables and non-convexity, the problem is NP-hard. Consequently, finding a guaranteed globally optimal solution in polynomial time is computationally intractable. This necessitates the use of heuristic or approximate optimization methods, which is a standard approach for such problems in the machine learning literature.

Appendix E.2. Rationale for Approximate Optimization and Convergence

Given the problem’s complexity, we employ a gradient-based optimization strategy on a continuous relaxation of the original problem. This is a pragmatic and widely adopted approach for making such problems tractable.

Continuous Relaxation: To handle discrete variables, we utilize the Gumbel-Softmax reparameterization technique. This method provides a differentiable approximation to sampling from a categorical distribution, allowing gradients to be backpropagated through discrete choices during training. Similarly, non-differentiable sparsity-inducing norms (L0) can be relaxed with differentiable proxies (e.g., L1-norm or other smoothed approximations) during the gradient-based search phase.
Convergence to a Local Optimum: While global convergence cannot be guaranteed, our optimization procedure is designed to converge. The relaxed objective function, although non-convex, is differentiable. By applying a standard gradient-based optimizer such as Adam, the algorithm iteratively updates the perturbations to decrease the loss. Under standard assumptions, these optimizers are known to converge to a critical point on the loss function, which is typically a local minimum. Thus, our method is guaranteed to converge to a locally optimal solution of the relaxed problem, which represents a high-quality candidate solution for the original discrete problem.

Appendix E.3. On the Correctness and Validity of Solutions

In the context of counterfactual explanations, “correctness” is best understood as the practical validity and plausibility of the generated solutions, rather than their global mathematical optimality. Our framework ensures this practical correctness through both the optimization objective and empirical validation.

The objective function explicitly minimizes the prediction loss $L_{pred}$ for the target class, directly encouraging the search to find perturbations that successfully flip the model’s prediction. The high empirical validity rate of 91.8% reported in Table 6 serves as strong evidence that our method effectively finds “correct” solutions in this regard.
The inclusion of the domain-aware plausibility cost, $C_{plaus}$ , is critical for ensuring that the generated counterfactuals are realistic and meaningful. This term guides the optimization away from mathematically valid but practically nonsensical solutions. As demonstrated in Table 12, the low domain constraint violation rate (3.2%) provides robust empirical validation for the practical correctness and actionability of the explanations produced by CF-Gen.

In conclusion, while our method finds a locally optimal solution to a relaxed version of an NP-hard problem, our extensive quantitative evaluation demonstrates that these solutions are of high quality, yielding counterfactuals that are valid, sparse, and plausible. Acknowledging the theoretical limitations, we propose that the development of specialized solvers with stronger optimality guarantees for counterfactual generation on graphs is a valuable direction for future research.

Appendix F. Validation of Synthetic Dataset Realism

To validate the fidelity of the synthetic dataset, we conducted a comparative analysis of key feature distributions and graph structural properties against the real-world IEEE-CIS dataset, as detailed in Table A3 and Table A4.

Table A3. Comparative analysis of key feature distributions.

Feature Type	Statistic/Metric	IEEE-CIS (Real)	Synthetic (CardSim-Based)
Continuous (TransactionAmt, normalized)	Mean	0.135	0.131
	Standard Deviation	0.241	0.235
	Median	0.069	0.072
	Skewness	3.58	3.49
	Kurtosis	15.2	14.8
Categorical (ProductCD)	Number of Categories	5	5
Categorical (ProductCD)	Distribution Entropy	1.37	1.41

Table A4. Comparative analysis of graph structural properties.

Graph Metric	IEEE-CIS (Real)	Synthetic (CardSim-Based)
Average Node Degree	5.60	6.01
Graph Density	$4.48 \times 10^{- 6}$	$6.00 \times 10^{- 6}$
Average Clustering Coefficient	0.085	0.091
Number of Connected Components	12,451	14,870

References

Bisht, D.; Singh, R.; Gehlot, A.; Akram, S.V.; Singh, A.; Montero, E.C.; Priyadarshi, N.; Twala, B. Imperative role of integrating digitalization in the firms finance: A technological perspective. Electronics 2022, 11, 3252. [Google Scholar] [CrossRef]
Cui, Y.; Han, X.; Chen, J.; Zhang, X.; Yang, J.; Zhang, X. FraudGNN-RL: A Graph Neural Network With Reinforcement Learning for Adaptive Financial Fraud Detection. IEEE Open J. Comput. Soc. 2025, 6, 426–437. [Google Scholar]
Liu, S.; Rees, B.; Patangia, P. Supercharging Fraud Detection in Financial Services with Graph Neural Networks. Nvidia 2024, 6, 0602. [Google Scholar]
Allen, J. CardSim: A Bayesian Simulator for Payment Card Fraud Detection Research. 2025. Available online: https://ssrn.com/abstract=5179591 (accessed on 6 September 2025).
Lyu, C.; Ji, T.; Sun, Q.; Zhou, L. Dcu-lorcan at fincausal 2022: Span-based causality extraction from financial documents using pre-trained language models. In Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022, Marseille, France, 24 June 2022; pp. 116–120. [Google Scholar]
Vasani, V.; Bairwa, A.K.; Joshi, S.; Pljonkin, A.; Kaur, M.; Amoon, M. Comprehensive analysis of advanced techniques and vital tools for detecting malware intrusion. Electronics 2023, 12, 4299. [Google Scholar] [CrossRef]
Cheng, D.; Zou, Y.; Xiang, S.; Jiang, C. Graph neural networks for financial fraud detection: A review. Front. Comput. Sci. 2025, 19, 199609. [Google Scholar] [CrossRef]
Ghaffari, A.; Jelodari, N.; pouralish, S.; derakhshanfard, N.; Arasteh, B. Securing Internet of Things Using Machine and Deep Learning Methods: A Survey. Clust. Comput. 2024, 27, 9065–9089. [Google Scholar] [CrossRef]
Vrahatis, A.G.; Lazaros, K.; Kotsiantis, S. Graph attention networks: A comprehensive review of methods and applications. Future Internet 2024, 16, 318. [Google Scholar] [CrossRef]
Cheng, D.; Wang, X.; Zhang, Y.; Zhang, L. Graph neural network for fraud detection via spatial-temporal attention. IEEE Trans. Knowl. Data Eng. 2020, 34, 3800–3813. [Google Scholar] [CrossRef]
Seyednezhad, R.; Derakhshanfard, N.; Heikalabad, S.R. Routing Design in Optical Networks-on-chip Based on Gray Code for Optical Loss Reduction. Optik 2021, 228, 166198. [Google Scholar] [CrossRef]
Derakhshanfard, N. Erlang Based Buffer Management and Routing in Opportunistic Networks. Wirel. Pers. Commun. 2020, 110, 2165–2177. [Google Scholar] [CrossRef]
Wang, L.; Li, P.; Xiong, K.; Zhao, J.; Lin, R. Modeling heterogeneous graph network on fraud detection: A community-based framework with attention mechanism. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online, 1–5 November 2021; pp. 1959–1968. [Google Scholar]
Ghosh, S.; Anand, R.; Bhowmik, T.; Chandrashekhar, S. GoSage: Heterogeneous graph neural network using hierarchical attention for collusion fraud detection. In Proceedings of the Fourth ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023; pp. 185–192. [Google Scholar]
MO, Y.; Yu, R.; Zhu, X.; Wang, X. HG-Adapter: Improving Pre-Trained Heterogeneous Graph Neural Networks with Dual Adapters. Comput. Res. Repos. 2025, 11, 01155. [Google Scholar]
Pan, J.; Liu, Y.; Zheng, X.; Zheng, Y.; Liew, A.W.C.; Li, F.; Pan, S. A Label-Free Heterophily-Guided Approach for Unsupervised Graph Fraud Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 12443–12451. [Google Scholar]
Zhang, W.; Zhong, J.; Yao, G.; Han, R.; Lin, X.; Zhang, Z.; Luo, C. Dual-channel Heterophilic Message Passing for Graph Fraud Detection. arXiv 2025, arXiv:2504.14205. [Google Scholar] [CrossRef]
Zhang, W.; Xu, D.; Xuan, X.; Jiang, L.; Yao, G.; Han, R.; Lang, X.; Luo, C. Addressing Noise and Stochasticity in Fraud Detection for Service Networks. arXiv 2025, arXiv:2505.00946. [Google Scholar] [CrossRef]
Liao, N.; Luo, S.; Li, X.; Shi, J. LD2: Scalable Heterophilous Graph Neural Network with Decoupled Embeddings. Neural Inf. Process. Syst. 2023, 36, 10197–10209. [Google Scholar]
Lyu, C.; Ji, T.; Zhou, L. DCU-ML at the FinNLP-2022 ERAI Task: Investigating the Transferability of Sentiment Analysis Data for Evaluating Rationales of Investors. In Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 116–121. [Google Scholar]
Giorgi, F.; Silvestri, F.; Tolomei, G. COMBINEX: A Unified Counterfactual Explainer for Graph Neural Networks via Node Feature and Structural Perturbations. arXiv 2025, arXiv:2502.10111. [Google Scholar] [CrossRef]
Lucic, A.; Ter Hoeve, M.A.; Tolomei, G.; De Rijke, M.; Silvestri, F. Cf-gnnexplainer: Counterfactual explanations for graph neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 4499–4511. [Google Scholar]
Qu, Z.; Gomm, D.; Färber, M. CoDy: Counterfactual Explainers for Dynamic Graphs. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Hu, X.; Chen, H.; Zhang, J.; Chen, H.; Liu, S.; Li, X.; Wang, Y.; Xue, X. GAT-COBO: Cost-sensitive graph neural network for telecom fraud detection. IEEE Trans. Big Data 2024, 10, 528–542. [Google Scholar] [CrossRef]
Liu, Z.; Qiu, R.; Zeng, Z.; Yoo, H.; Zhou, D.; Xu, Z.; Zhu, Y.; Weldemariam, K.; He, J.; Tong, H. Class-Imbalanced Graph Learning Without Class Rebalancing. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Vivek, Y.; Ravi, V.; Mane, A.; Naidu, L.R. Explainable artificial intelligence and causal inference based atm fraud detection. In Proceedings of the 2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CIFEr), Hoboken, NJ, USA, 22–23 October 2024; pp. 1–7. [Google Scholar]
Gholami, M.; Ghaffari, A.; Derakhshanfard, N.; iBRAHIMOĞLU, N.; Kazem, A.A.P. Blockchain Integration in IoT: Applications, Opportunities, and Challenges. Comput. Mater. Contin. 2025, 83, 1561. [Google Scholar] [CrossRef]
IEEE Computational Intelligence Society; Vesta Corporation. IEEE-CIS Fraud Detection. Kaggle Competition Dataset. 2019. Available online: https://www.kaggle.com/competitions/ieee-fraud-detection (accessed on 6 September 2025).
Sha, Q.; Tang, T.; Du, X.; Liu, J.; Wang, Y.; Sheng, Y. Detecting Credit Card Fraud via Heterogeneous Graph Neural Networks with Graph Attention. arXiv 2025, arXiv:2504.08183. [Google Scholar] [CrossRef]
Altman, E.; Blanuša, J.; Von Niederhäusern, L.; Egressy, B.; Anghel, A.; Atasu, K. Realistic synthetic financial transactions for anti-money laundering models. Adv. Neural Inf. Process. Syst. 2023, 36, 29851–29874. [Google Scholar]
Kamalaruban, P.; Pi, Y.; Burrell, S.; Drage, E.; Skalski, P.; Wong, J.; Sutton, D. Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and Challenges. In Proceedings of the 5th ACM International Conference on AI in Finance, Brooklyn, NY, USA, 14–17 November 2024; pp. 555–563. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Liu, J.; Ong, G.P.; Chen, X. GraphSAGE-based traffic speed forecasting for segment network with sparse data. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1755–1766. [Google Scholar] [CrossRef]
Tang, X.; Chen, L.; Shi, H.; Lyu, D. Dhyper: A recurrent dual hypergraph neural network for event prediction in temporal knowledge graphs. ACM Trans. Inf. Syst. 2024, 42, 1–23. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of HAG-CFNet. The framework follows a modular design pipeline. (1) Raw data is transformed into a heterogeneous graph. (2) The core HAG-Net model, featuring heterogeneity-aware message passing and dual-channel heterophily detection, learns robust node embeddings. This process is supported by a synergistic imbalance correction strategy. (3) A prediction head outputs fraud probabilities. (4) The CF-Gen module analyzes predictions to generate interpretable counterfactual explanations. This end-to-end design ensures that the model is not only accurate but also robust to adversarial camouflage and transparent in its decision-making.

Figure 2. Precision–recall curves on the IEEE-CIS dataset. This figure visualizes the trade-off between precision and recall. The curve for HAG-CFNet (blue) consistently remains above all baselines, and its larger area under the curve (AUC-PR) quantitatively confirms its dominant performance. The significant gap between HAG-CFNet and other methods, especially at high-recall regions, underscores its effectiveness in identifying the rare fraud class, which is critical for practical applications.

Figure 3. Impact of the heterophily level on model performance. This simulation study on synthetic data demonstrates HAG-CFNet’s superior robustness to increasing heterophily (a lower rate of performance decay) compared to a standard GCN. This validates the effectiveness of our dual-channel heterophily detection module in adversarial environments.

Figure 4. Impact of the imbalance ratio on fraud recall on the minority (fraud) class. This experiment on synthetic data highlights the critical role of our imbalance correction strategy. As the fraud ratio becomes more realistic (lower), the performance gap widens, showing that the model without correction becomes ineffective at identifying rare fraud instances.

Figure 5. Case study 1: feature-level counterfactual analysis. The model correctly identifies a transaction as high-risk (fraud probability 0.92). CF-Gen explains that the prediction can be flipped to benign (0.15) by altering a few key risk factors: changing to a common merchant category, reducing the amount, using a domestic IP, and removing the link to a known fraudulent device.

Figure 6. Case study 2: behavioral pattern analysis of a false positive. A legitimate transaction is initially flagged as high-risk (0.78) due to anomalous but non-fraudulent behavior (a high transaction frequency and a new IP). CF-Gen explains that normalizing these specific behavioral patterns would reverse the prediction to benign (0.22).

Figure 7. Case study 3: network-level counterfactual analysis. This example illustrates how CF-Gen analyzes the graph structure. A coordinated fraud ring (left), characterized by shared infrastructure and a central managing node, is correctly classified as high-risk (0.95). The counterfactual explanation demonstrates that altering the network topology to resemble isolated, independent entities (right) would drastically reduce the risk score to 0.15.

Figure 8. t-SNE visualization of node embeddings. This figure compares the learned representations of HAG-Net (left) and a standard GCN (right). HAG-Net produces clearly separated clusters for fraudulent (red) and benign (blue) nodes, indicating that it has learned highly discriminative features.

Table 1. Comparative analysis of key GNN architectures.

Model Family	Handles Heterogeneity?	Heterophily Robustness	Scalability	Limitations
GCN/GAT	Limited (assumes homogeneous graph)	Low (performance degrades significantly)	Moderate (full-batch training is a bottleneck)	Suffers from over-smoothing; relies on homophily.
GraphSAGE	Limited (can be adapted but not natively designed for it)	Low to moderate (depends on aggregator function)	High (inductive and sampling-based)	Basic aggregators may not capture complex relational patterns.
HGNNs	Yes (designed for multiple node/edge types)	Moderate (improves over GCN but not its primary focus)	Moderate to Low (meta-path complexity can be high)	Often requires predefined meta-paths; may not address adversarial heterophily.
Heterophily-aware GNNs	Limited (focuses on heterophily in homogeneous settings)	High (explicitly designed for it)	Moderate	May not effectively leverage rich semantic information in fully heterogeneous graphs.

Table 2. Summary of key mathematical notations.

Symbol	Definition
$G = (V, E, R)$	A heterogeneous graph with node set $V$ , edge set $E$ , and relation type set R.
$v, u$	Nodes in the graph, where $v \in V$ and $u \in V$ .
l	The layer index in the Graph Neural Network.
$h_{v}^{(l)}$	The embedding (feature representation) of node v at layer l.
$N_{r} (v)$	The set of neighboring nodes of v under relation type r.
$W_{r}^{(l)}$	A learnable weight matrix for relation type r at layer l.
$a_{r}^{(l)}$	A learnable attention vector for relation type r at layer l.
$α_{u v}^{(l, r)}$	The normalized attention score of neighbor u on target node v under relation r at layer l.
$S_{het} (v)$	The local heterophily score for node v, quantifying feature dissimilarity in its neighborhood.
$X_{i}, A_{i}$	The feature matrix and adjacency matrix corresponding to a target instance i.
$f (\cdot)$	The trained HAG-Net model which acts as a prediction function.
$y_{target}$	The desired target prediction for a counterfactual explanation (e.g., changing from ‘fraud’ to ‘benign’).
$Δ X_{i}, Δ A_{i}$	The feature and structural perturbations applied to instance i to generate a counterfactual explanation.
$X_{i}^{'}, A_{i}^{'}$	The perturbed feature and adjacency matrices, i.e., $X_{i}^{'} = X_{i} + Δ X_{i}, A_{i}^{'} = A_{i} + Δ A_{i}$ .
$L_{pred}$	The prediction loss term in counterfactual optimization, driving the prediction towards $y_{target}$ .
$C_{feat}, C_{struct},$ and $C_{plaus}$	Cost functions penalizing feature perturbations, structural changes, and implausible explanations, respectively.
$λ_{feat}, λ_{struct},$ and $λ_{plaus}$	Hyperparameters that balance the prediction loss against the perturbation and plausibility costs.

Table 3. Dataset statistics. This table summarizes key characteristics of the datasets after processing.

Feature	IEEE-CIS	IBM Transactions	Synthetic (CardSim-Based)
Total Nodes (Approx.)	1,250,000	15,500,000	1,000,000
Node Types and Counts	Transaction (590 k), Card (300 k), DeviceInfo (150 k), EmailDomain (80 k), ProductCD (5 k), and Addr (125 k)	User (2 k), Card (3 k), Merchant (7 k), Transaction (24.4 M)	User (5 k), Merchant (10 k), Card (7 k), Transaction (1 M)
Total Edges (Approx.)	3,500,000	48,900,000	3,000,000
Edge Types and Counts	Trans-Card (590 k), Trans-DeviceInfo (140 k), Trans-Email (340 k), Trans-ProductCD (590 k), Trans-Addr (550 k), Card-User	User-Trans (24.4 M), Trans-Merchant (24.4 M), User-Card (3 k)	User-Trans (1 M), Trans-Merchant (1 M), and User-Card (7 k)
Avg. Node Degree	5.6	6.3	6.0
Feature Dimensionality	450	60	80
Fraud Ratio (Overall)	3.52%	0.122%	1.0%
Heterophily Level	0.65	0.78	0.5

Table 4. Comparative fraud detection performance with statistical validation. This table details core detection metrics across three datasets. Best results for each metric are in bold. The p-value is calculated via a paired t-test against the HAG-CFNet performance on the same dataset; a value < 0.05 indicates a statistically significant difference.

Model	Dataset	AUC-PR	p-Value	F1-Macro	Recall (Fraud)	G-Mean	Precision (Fraud)
XGBoost [32]	IEEE-CIS	0.776 ± 0.013	<0.01	0.751 ± 0.018	0.668 ± 0.024	0.803 ± 0.015	0.861 ± 0.011
	IBM Trans.	0.643 ± 0.019	<0.01	0.624 ± 0.021	0.542 ± 0.027	0.714 ± 0.016	0.759 ± 0.014
	Synthetic	0.818 ± 0.012	<0.01	0.794 ± 0.016	0.748 ± 0.022	0.858 ± 0.014	0.885 ± 0.009
GCN [33]	IEEE-CIS	0.799 ± 0.017	0.023	0.772 ± 0.019	0.698 ± 0.026	0.824 ± 0.018	0.835 ± 0.015
	IBM Trans.	0.673 ± 0.022	0.018	0.649 ± 0.024	0.577 ± 0.031	0.738 ± 0.019	0.763 ± 0.017
	Synthetic	0.839 ± 0.015	0.021	0.815 ± 0.018	0.778 ± 0.023	0.874 ± 0.016	0.889 ± 0.012
GAT [34]	IEEE-CIS	0.814 ± 0.016	0.015	0.787 ± 0.020	0.723 ± 0.025	0.841 ± 0.017	0.852 ± 0.013
	IBM Trans.	0.696 ± 0.021	0.024	0.672 ± 0.023	0.608 ± 0.029	0.759 ± 0.018	0.781 ± 0.016
	Synthetic	0.853 ± 0.014	0.029	0.831 ± 0.017	0.798 ± 0.021	0.890 ± 0.015	0.904 ± 0.011
GraphSAGE [35]	IEEE-CIS	0.808 ± 0.018	0.011	0.782 ± 0.021	0.714 ± 0.027	0.834 ± 0.019	0.841 ± 0.014
	IBM Trans.	0.688 ± 0.023	0.019	0.664 ± 0.025	0.598 ± 0.032	0.751 ± 0.020	0.774 ± 0.018
	Synthetic	0.846 ± 0.016	0.023	0.823 ± 0.019	0.789 ± 0.024	0.883 ± 0.017	0.897 ± 0.013
DHMP (adapted) [36]	IEEE-CIS	0.832 ± 0.014	<0.01	0.809 ± 0.017	0.752 ± 0.023	0.872 ± 0.015	0.851 ± 0.012
	IBM Trans.	0.718 ± 0.020	0.012	0.694 ± 0.022	0.639 ± 0.028	0.781 ± 0.017	0.829 ± 0.015
	Synthetic	0.875 ± 0.013	0.017	0.853 ± 0.015	0.824 ± 0.019	0.918 ± 0.012	0.939 ± 0.008
HAG-CFNet (Full)	IEEE-CIS	0.867 ± 0.012	-	0.844 ± 0.015	0.798 ± 0.021	0.868 ± 0.014	0.844 ± 0.013
	IBM Trans.	0.758 ± 0.018	-	0.733 ± 0.020	0.682 ± 0.026	0.813 ± 0.016	0.797 ± 0.017
	Synthetic	0.908 ± 0.010	-	0.891 ± 0.012	0.867 ± 0.017	0.915 ± 0.011	0.924 ± 0.009

Table 5. Ablation study of HAG-CFNet components on the IEEE-CIS dataset. The results quantify the performance drop when each key module is removed, demonstrating that the heterogeneity-aware module, heterophily detection, and the synergistic imbalance correction strategy all provide significant, complementary contributions to the final performance.

Model Variant	AUC-PR	F1-Macro	Recall (Fraud)
HAG-CFNet (Full)	0.867 ± 0.012	0.844 ± 0.015	0.798 ± 0.021
HAG-CFNet w/o The Heterogeneity Module	0.819 ± 0.016	0.791 ± 0.019	0.732 ± 0.024
HAG-CFNet w/o The Heterophily Module	0.831 ± 0.014	0.808 ± 0.017	0.757 ± 0.022
HAG-CFNet w/o Imbalance Correction (Full)	0.738 ± 0.020	0.698 ± 0.023	0.603 ± 0.029
HAG-CFNet w/o GraphSMOTE (Cost-Sensitive Only)	0.825 ± 0.015	0.801 ± 0.018	0.749 ± 0.023
HAG-CFNet w/o Cost-Sensitive Loss (GraphSMOTE Only)	0.841 ± 0.014	0.817 ± 0.016	0.768 ± 0.022

Table 6. Quantitative evaluation of counterfactual explanations (on the IEEE-CIS dataset).

CFE Method	Avg. Validity (%)	Avg. Feat. Sparsity (L0)	Avg. Struct. Sparsity (L0)	Avg. Feat. (L1 Norm, Norm.)	Avg. Struct. (GED Proxy)	Plausibility (k-NN Dist., Norm.)	Plausibility
CF-Gen (from HAG-CFNet)	91.8 ± 2.3	3.4 ± 0.8	1.3 ± 0.4	0.21 ± 0.06	2.7 ± 0.9	0.28 ± 0.07	3.2 ± 0.9
COMBINEX (adapted)	89.7 ± 2.8	3.8 ± 0.9	1.6 ± 0.5	0.23 ± 0.07	3.2 ± 1.1	0.31 ± 0.08	4.6 ± 1.2
Feature-Only CFE	87.4 ± 3.1	4.5 ± 1.1	N/A	0.28 ± 0.09	N/A	0.35 ± 0.10	3.9 ± 1.0

Table 7. Robustness to feature and structural Noise (AUC-PR % drop on IEEE-CIS).

Model	Feature Noise (10% Nodes)	Structural Noise (5% Edges)
GCN	−12.4%	−15.8%
GAT	−10.8%	−13.2%
HAG-CFNet (Full)	−5.2%	−6.5%

Table 8. Normalized confusion matrix for HAG-CFNet on IEEE-CIS.

Predicted	True: Benign	True: Fraud
Benign	98.2% (True Negative)	20.2% (False Negative)
Fraud	1.8% (False Positive)	79.8% (True Positive / Recall)

Table 9. Computational performance on IEEE-CIS (NVIDIA A100 GPU).

Model	Training Time/Epoch (s)	Inference Latency/Batch (ms)	Peak GPU Memory (GB)
GCN	18.5	35	4.8
GAT	25.2	52	6.5
GraphSAGE	22.8	45	5.9
HAG-CFNet (Full)	31.5	68	7.2

Table 10. Hyperparameter settings for HAG-CFNet.

Hyperparameter	IEEE-CIS Value	IBM Transactions Value	Synthetic Value
Learning Rate	0.001	0.0005	0.001
Batch Size	256	512	128
GNN Layers	3	3	2
Embedding Dimension	128	256	64
Dropout Rate	0.3	0.4	0.2
$λ_{feat}$ (CF-Gen)	0.1	0.15	0.05
$λ_{struct}$ (CF-Gen)	0.5	0.6	0.3
$λ_{plaus}$ (CF-Gen)	1.0	1.2	0.8
$w_{fraud}$ (Imbalance)	5.0	10.0	3.0

Table 11. Detailed feature perturbation statistics for CF-Gen on IEEE-CIS.

Feature Metric	Average Value	Std. Deviation	Min Perturbation	Max Perturbation
L0 Norm (Num. Changed Feats)	3.4	1.6	1	8
L1 Norm (Continuous and Normalized)	0.21	0.13	0.03	0.62
L2 Norm (Continuous and Normalized)	0.14	0.09	0.02	0.41
Categorical Features Changed	1.6	1.1	0	5

Table 12. Plausibility score breakdown for CF-Gen on IEEE-CIS.

Plausibility Aspect	Metric	Score/Rate
Data Manifold Adherence	Avg. k-NN Distance (Normalized)	0.28
Domain Constraint Violations	Transaction Amount < 0	0.1%
	Invalid MCC Code	0.7%
	Impossible Transaction Sequence (Sampled)	1.6%
	Unrealistic Location Change (Heuristic)	1.8%
	Overall Violation Rate (Avg.)	3.2%

Table 13. Impact of heterophily score calculation on HAG-CFNet (IEEE-CIS and AUC-PR).

Heterophily Score Method	AUC-PR	F1-Macro
Cosine Similarity-based (Default)	0.867 ± 0.012	0.844 ± 0.015
Learned Predictor (Auxiliary MLP)	0.858 ± 0.016	0.835 ± 0.019
HALO-inspired Metric (Adapted)	0.863 ± 0.014	0.841 ± 0.017
No Explicit Heterophily Score (Base HAG-Net)	0.831 ± 0.018	0.808 ± 0.021

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Zhou, Y.; Ji, X.; Liu, Z.; Tian, Z.; Tang, Q.; Shi, Y. Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis. Mathematics 2025, 13, 2956. https://doi.org/10.3390/math13182956

AMA Style

Yang H, Zhou Y, Ji X, Liu Z, Tian Z, Tang Q, Shi Y. Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis. Mathematics. 2025; 13(18):2956. https://doi.org/10.3390/math13182956

Chicago/Turabian Style

Yang, Hao, Yunhong Zhou, Xianzhe Ji, Zifan Liu, Zhen Tian, Qiang Tang, and Yanchao Shi. 2025. "Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis" Mathematics 13, no. 18: 2956. https://doi.org/10.3390/math13182956

APA Style

Yang, H., Zhou, Y., Ji, X., Liu, Z., Tian, Z., Tang, Q., & Shi, Y. (2025). Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis. Mathematics, 13(18), 2956. https://doi.org/10.3390/math13182956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Graph Neural Networks for Complex Relational Learning: A Multi-Scale Heterogeneity-Aware Framework with Adversarial Robustness and Interpretable Analysis

Abstract

1. Introduction

2. Related Work

2.1. Advances in Graph Neural Networks

2.2. Heterogeneous and Heterophily-Aware GNNs

2.3. Explainable AI and Counterfactuals for GNNs

2.4. Imbalanced Learning on Graphs

2.5. Causality and Privacy in GNNs

3. Proposed Framework: HAG-CFNet

3.1. Overall Architecture and Design Principles

3.2. Heterogeneity-Aware Message Passing

3.3. Dual-Channel Heterophily Detection

3.4. Interpretable Counterfactual Generation (CF-Gen)

3.5. Imbalance Correction Strategy

4. Experimental Evaluation

4.1. Datasets

4.2. Baselines for Comparison

4.3. Evaluation Metrics

4.4. Implementation Details

5. Results and Analysis

5.1. Overall Fraud Detection Performance

5.2. Ablation Study

5.3. Simulation Studies on Model Robustness

5.4. Qualitative Analysis with Practical Examples

5.5. Further Analysis: Embeddings, Robustness, and Performance

5.5.1. Robustness Analysis

5.5.2. Error Analysis and Confusion Matrix

5.5.3. Computational Performance Analysis

5.5.4. Analysis of Hyperparameters and Internal Mechanics

6. Discussion

6.1. Principal Findings and Insights

6.2. Practical Implications and Deployability

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Detailed Optimization for CF-Gen

Appendix B. Dataset Preprocessing Details

Appendix C. Sensitivity to GNN Layers

Appendix D. Hyperparameter Settings

Appendix E. Theoretical Analysis of the CF-Gen Optimization Problem

Appendix E.1. Problem Complexity and NP-Hardness

Appendix E.2. Rationale for Approximate Optimization and Convergence

Appendix E.3. On the Correctness and Validity of Solutions

Appendix F. Validation of Synthetic Dataset Realism

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI