GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer

Zhu, Qigang; Zhan, Xiong; Chen, Wei; Li, Yuanzhi; Ouyang, Hengwei; Jiang, Tian; Shen, Yu

doi:10.3390/electronics14193807

Open AccessArticle

GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer

by

Qigang Zhu

¹,

Xiong Zhan

²,

Wei Chen

³,

Yuanzhi Li

³,

Hengwei Ouyang

³,

Tian Jiang

¹ and

Yu Shen

^1,*

¹

NARI Group Corporation (State Grid Electric Power Research Institute), Nanjing NARI Information & Communication Technology Co., Ltd., Nanjing 211000, China

²

State Grid Corporation of China, Beijing 100031, China

³

State Grid Anhui Electronic Power Co., Ltd., Hefei 230022, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3807; https://doi.org/10.3390/electronics14193807

Submission received: 26 July 2025 / Revised: 10 September 2025 / Accepted: 15 September 2025 / Published: 25 September 2025

(This article belongs to the Special Issue Advances in Information Processing and Network Security)

Download

Browse Figures

Versions Notes

Abstract

Emerging complex multi-step attacks such as Advanced Persistent Threats (APTs) pose significant risks to national economic development, security, and social stability. Effectively detecting these sophisticated threats is a critical challenge. While deep learning methods show promise in identifying unknown malicious behaviors, they often struggle with fragmented modal information, limited feature representation, and generalization. To address these limitations, we propose GATransformer, a new dual-modal detection method that integrates topological structure analysis with temporal sequence modeling. Its core lies in a cross-attention semantic fusion mechanism, which deeply integrates heterogeneous features and effectively mitigates the constraints of unimodal representations. GATransformer reconstructs network behavior representation via a parallel processing framework in which graph attention captures intricate spatial dependencies, and self-attention focuses on modeling long-range temporal correlations. Experimental results on the CIDDS-001 and CIDDS-002 datasets demonstrate the superior performance of our method compared to baseline methods with detection accuracies of 99.74% (nodes) and 88.28% (edges) on CIDDS-001 and 99.99% and 99.98% on CIDDS-002, respectively.

Keywords:

network attacks; network threat detection; graph attention networks; temporal sequence modeling; multi-modal fusion; cross-attention fusion

1. Introduction

The ongoing digital transformation and expansion of network infrastructure have led to a dramatic rise in the frequency, scale, and sophistication of Complex Multi-step Attacks (CMAs) [1,2]. Consequently, network threat detection has become a paramount challenge in the information age [3]. These attacks result in immense global economic damage, with annual losses projected to reach USD 10.5 trillion by 2025 [4]. Sophisticated CMAs like Advanced Persistent Threats (APTs)—often fueled by zero-day exploits—threaten not only economic development but also national security and social stability [5,6]. Their targeted, stealthy, and highly professional nature makes them exceptionally difficult to detect.

Traditional network threat detection primarily relies on rule-based intrusion detection and signature matching techniques [7,8]. These conventional methods typically depend on known attack signatures or patterns, thus rendering them ineffective in identifying emerging threats such as Advanced Persistent Threats (APTs), which are dominated by zero-day attacks and characterized by high targeting, high concealment, and high professionalism. When confronted with complex attacks like Complex Multi-step Attacks (CMAs), traditional approaches, on one hand, struggle to detect unknown threats through rule-based intrusion detection methods; on the other hand, they fail to effectively correlate temporally and spatially sparse key attack steps, making it impossible to construct a complete attack chain view. Meanwhile, in recent years, deep learning models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) have demonstrated strong capabilities in network traffic analysis, malware detection, and abnormal behavior identification [9,10,11]. A large number of methods based on deep learning, data mining, and artificial intelligence have emerged in the field of network security detection [12,13], providing crucial technical support for building intelligent security defense systems. Recent advancements in data-driven attack detection have shown promising results across various critical infrastructure domains. Notably, Wang et al. proposed a data-driven framework for detecting false data injection attacks in DC microgrids using subspace identification methods, which directly constructs detection systems from process data without explicit system modeling [14]. While such approaches demonstrate effectiveness in power system security, they exhibit fundamental limitations when applied to general network threat detection scenarios, including reliance on well-defined physical dynamics, assumption of stable system behaviors, and focus on single-modal sensor data rather than multi-modal information integration.

However, existing deep learning methods still have significant limitations in network threat detection. The primary issue is fragmented modal information: mainstream approaches either simplify network behaviors into time-series modeling—completely ignoring the rich entity relationship information in network topology—or focus solely on graph structure analysis, lacking an in-depth understanding of the temporal evolution process of attacks. This unimodal modeling strategy leads to a substantial loss of critical semantic information and serves as a core bottleneck restricting detection accuracy [15]. Most existing data-driven approaches either employ purely sequential modeling that ignores network topology, utilize graph-based methods that overlook temporal dynamics, or apply AI-based detection techniques without theoretical foundations for multi-modal information fusion. Crucially, these approaches fail to address the fundamental challenge of semantic fragmentation across different information modalities, which is a critical requirement for detecting sophisticated multi-stage attacks. Unlike domain-specific data-driven methods that operate within controlled physical environments, network threat detection requires principled mechanisms for integrating heterogeneous information sources without assuming underlying physical constraints. The second issue is insufficient feature representation capability: feature extraction and selection are crucial, yet effectively identifying features useful for anomaly detection remains a challenging task. Inappropriate feature extraction and selection can degrade model performance. Network attacks often exhibit complex multi-stage and multi-entity collaborative patterns, which simple feature engineering and traditional machine learning methods struggle to capture due to their high-dimensional and nonlinear nature [16]. Existing multimodal fusion methods are primarily categorized into the following three types: early fusion integrates information from different modalities via feature-level concatenation but tends to cause the curse of dimensionality and semantic mismatch; late fusion adopts decision-level integration, preserving modality-specificity while ignoring inter-modal interactions; intermediate fusion uses attention mechanisms for feature weighting, yet existing methods mostly rely on simple attention weight allocation and lack in-depth understanding of the semantic gap in heterogeneous information [17,18]. The third issue is limited generalization capability: existing models are often optimized for specific types of attacks or datasets, and their performance tends to be poor when confronted with new attack variants or cross-domain deployments [3].

In response to the fundamental limitations of existing methodologies and the specific gaps identified in current data-driven approaches, this paper introduces GATransformer, which pioneers multi-modal semantic integration for network threat detection. Unlike existing data-driven methods that operate on single information modalities or require domain-specific physical constraints, our approach establishes a mathematically rigorous dual-pathway information processing framework that addresses semantic integration across heterogeneous information types without assuming underlying system dynamics. First, to address the semantic fragmentation inherent in traditional unimodal approaches, we establish a dual-pathway information processing framework that employs complementary computational channels for topological dependency extraction and temporal evolution modeling. This architecture transcends conventional processing limitations by implementing synergistic information synthesis, ensuring complete semantic preservation across multiple representational dimensions. Secondly, to overcome the representational inadequacies in capturing complex multi-stage attack semantics, we formulate an advanced attention-driven integration mechanism. Through cross-modal semantic bridging, this framework achieves deep-level feature synthesis that fully exploits structural relationship semantics while preserving temporal evolution integrity. Finally, to enhance cross-domain generalizability and adaptive resilience, we introduce a unified multi-objective optimization paradigm. Through joint optimization of hierarchical classification tasks and predictive modeling objectives, the architecture significantly improves adaptability and robustness via knowledge transfer dynamics and complementary learning mechanisms. This multi-dimensional theoretical integration fundamentally overcomes traditional methodological constraints, establishing a comprehensive solution framework for network threat detection.

The main technical contributions and innovations include the following:

(1): Graph-Sequence Enhanced Transformer Architecture (GATransformer): On the CIDDS-001 dataset, the model achieved 99.74% accuracy in node attack detection and 88.28% accuracy in edge attack detection. On the CIDDS-002 dataset, the model achieved 99.99% accuracy in node attack detection and 99.98% accuracy in edge attack detection. Experiments show that compared to using GAT alone (edge classification accuracy of 86.77%) or Transformer alone (edge classification accuracy of 87.45%), our hybrid architecture improved edge classification accuracy to 88.28% on the CIDDS-001 dataset, verifying the complementarity of the two information carriers and the full utilization of original semantic information.
(2): Cross-Attention Feature Fusion Mechanism: An innovative cross-attention fusion strategy was designed to deeply integrate graph features and sequence features, fundamentally resolving the problem of fragmented modal information in traditional methods.
(3): Multi-Level Data Processing and Multi-Task Learning Framework: A preprocessing strategy combining adaptive traffic normalization and multi-dimensional feature fusion was constructed. A dynamic time window graph construction mechanism was adopted to support real-time threat detection, and a unified multi-task learning framework was designed to achieve anomaly detection at multiple levels.

2. Related Work

In the digital era, network detection based on network traffic has become a core indicator for assessing the health status of network operations, and the importance of anomaly detection is increasingly prominent. In the face of increasingly intelligent and complex cyberattack techniques, traditional detection methods based on rule matching and shallow machine learning are no longer sufficient to meet the needs of cybersecurity defense due to their limitations in feature extraction depth and generalization capability. Deep learning, with its powerful nonlinear modeling ability and automated feature learning mechanism, breaks through the technical bottlenecks of traditional methods and has become a research hotspot in the field of malicious traffic detection. Current research mainly focuses on innovative applications of deep learning models, including the use of recurrent neural networks (RNNs) in traffic feature extraction and anomaly pattern recognition, as well as the integration of new techniques such as graph attention mechanisms. To address the dependency of deep learning models on large amounts of labeled data, strategies such as adversarial sample generation and unsupervised learning are used to enhance generalization. In addition, to improve the accuracy and efficiency of anomaly detection models, methods such as ensemble learning and hybrid model construction are being continuously explored.

Deep learning-based traffic anomaly analysis and detection mainly rely on deep models to automatically learn complex features from traffic data and identify anomalous patterns. Convolutional Neural Networks (CNNs) can effectively capture potential attack patterns in serialized network traffic [12,13], and extract key features such as packet length and inter-packet intervals from traffic data. By constructing multi-layer convolution and pooling structures, CNNs obtain high-level feature representations. Yin Zinuo et al. proposed an anomaly traffic detection framework combining CNN and LSTM networks [19], which deeply mines the spatiotemporal features of traffic data by leveraging CNN’s feature extraction and LSTM’s time series analysis. The framework achieved accuracy as high as 98.8%, with recognition rates for most attack types exceeding 99.5%. Hooshmand et al. proposed a model based on a one-dimensional CNN architecture [20], which divides network traffic data into three categories: TCP, UDP, and other protocols, and processes each category independently. This model performed well in terms of recall and F1-score when handling separate categories but performed poorly in precision. RNNs can capture the temporal dependencies in sequence data. Yuan et al. designed a recursive Deep Neural Network (DNN) capable of effectively learning DDoS attack patterns from network traffic sequences [21], thus enabling tracking of cyberattack activities. The DNN showed significantly reduced error rates on larger-scale datasets, verifying its superiority and effectiveness in learning network traffic patterns and tracing attack behaviors. Li et al. proposed a Bidirectional Independently Recurrent Neural Network (BiIndRNN) with parallel computation and adjustable gradients [22]. This model extracts bidirectional structural features of network traffic through forward and backward inputs and captures spatial correlations in data flows. In addition, graph attention mechanisms have also been applied to network traffic analysis. By constructing graph structures to characterize relationships between nodes, detection accuracy can be improved [23].

Existing deep learning models typically rely on large amounts of labeled data for training, which reduces their flexibility in real-world applications. To address this issue, some researchers have proposed integrating adversarial sample generation techniques to enhance dataset diversity and thereby improve the generalization capability of models. This strategy not only increases detection accuracy but also enhances the model’s adaptability to unknown attacks. In traffic anomaly detection, autoencoders can learn the feature representation of normal traffic and identify anomalies by comparing the reconstruction error between anomalous and normal traffic. Qu et al. proposed a network data contrastive representation method called CRND, which is based on contrastive learning and can train high-performance models without label information [24]. Regarding the lack of interpretability in deep learning models for anomaly traffic detection, the IDEAL framework integrates an interpretability supervision mechanism to make the model decision process more transparent and trustworthy [25]. This method uses security rules from the Snort firewall to automatically generate annotation information, reducing the workload of manual labeling. Semi-supervised learning methods can efficiently identify malicious traffic even with only a small amount of labeled data by jointly modeling the multimodal characteristics of traffic [26].

To further improve the accuracy and efficiency of traffic anomaly detection, researchers have considered combining different deep learning models to form hybrid approaches. Yao et al. proposed an AMI intrusion detection model based on CNN and LSTM networks [27], which integrates the global feature extraction capability of CNN and the periodic feature memory function of LSTM. In addition, decision trees and their ensemble learning methods, by incorporating information entropy theory and the Proximal Policy Optimization (PPO) algorithm, can effectively enhance detection accuracy [28]. Furthermore, random forest and Extreme Gradient Boosting (XGBoost) have also been used to construct efficient malicious traffic detection frameworks, showing excellent performance, especially when dealing with concept drift problems [29].

In certain specific scenarios, such as the Internet of Things (IoT) environments or industrial control systems, lightweight design has become a hotspot due to resource constraints of devices. The LightGuard model significantly reduces computational overhead while maintaining detection performance by introducing lightweight residual block modules [30]. It simplifies the classifier structure using global average pooling and pointwise convolution [31]. For emerging threats like zero-day attacks, existing models often perform poorly. In response, a method named CDDA-MD, which integrates concept drift detection and adaptive mechanisms, has been proposed. It can update model parameters in real time to cope with the impact of concept drift [32].

Despite the significant progress achieved, there is still considerable room for improvement in the current accuracy of traffic monitoring. On the one hand, network traffic data is high-dimensional, dynamic, and noisy. Some complex attack features are easily overwhelmed by massive normal traffic, resulting in high false positive and false negative rates. On the other hand, new types of attack techniques—such as low-rate attacks—are constantly emerging. Existing models struggle to quickly learn their feature patterns, and their cross-scenario generalization capabilities are weak, making it difficult to achieve high-precision detection in real-world complex network environments. These problems urgently need to be addressed through more efficient feature engineering and multimodal data fusion technologies.

3. Proposed Methodology

Network threat detection has long faced the fundamental challenge of fragmented modal information. Existing methods simplify network behavior into time series modeling, completely ignoring the entity relationships embedded in network topology. Others analyze entity relationships only from the perspective of graph structure, lacking a deep understanding of the temporal evolution of attacks. This unimodal modeling leads to significant loss of important semantic information and is the core bottleneck preventing further improvement in threat detection accuracy.

This paper proposes a neural network architecture named GATransformer, as shown in Figure 1. The architecture adopts a dual-stream parallel processing strategy that fully leverages the complementary advantages of graph neural networks and sequence modeling. The system first performs user behavior encoding and temporal feature encoding on the CIDDS dataset through a multi-layer data preprocessing module and then inputs the results into the GAT branch and the Transformer branch for feature extraction, respectively. The GAT branch, through components such as high-dimensional projection, graph attention mechanism, layer normalization, and residual connections, effectively captures spatial dependencies in the network topology. The Transformer branch utilizes modules including positional encoding, masked multi-head attention, and feedforward networks to deeply mine long-term dependency patterns in the time series. The features extracted by the two branches are deeply fused through a cross-attention mechanism and are finally fed into a multi-task learning head to simultaneously perform multiple anomaly detection tasks such as node classification and edge classification. The core advantage of this architecture lies in its ability to handle both graph-structured and sequential data simultaneously, enabling the model to understand topological relationships between nodes while also capturing temporal dependencies in the sequence.

3.1. Multi-Level Data Preprocessing and Feature Engineering Strategy

3.1.1. Adaptive Traffic Normalization and Encoding Mechanism

Network security traffic data is characterized by diverse formats and noise interference. To address this issue, this study designed a hierarchical traffic standardization processing framework. First, through in-depth analysis of network behavior traffic characteristics, a semantic encoding strategy centered on user behavior patterns was constructed. This method extracts traffic features from multiple dimensions, including IP addresses, transport protocols, and port numbers, converting heterogeneous traffic data into structured vector representations. Original labels are numerically encoded, converting string labels such as “normal” “attacker” and “victim” into integer representations. Protocol types are also mapped to corresponding numerical values to ensure that subsequent machine learning models can effectively process classification information. To address the common issue of irregular numerical formats in network traffic data, an adaptive numerical parsing algorithm was designed. This algorithm intelligently identifies and converts numerical strings with units, such as “10K” and “1.5M”, using regular expression matching and a unit mapping dictionary to achieve accurate numerical restoration.

In terms of time feature encoding, statistical analysis of historical traffic data reveals that scanning-type attacks conducted by attackers usually have short durations and high operational intensity, whereas normal business activities have longer time spans and relatively even distribution. Based on this pattern, a multi-granularity time feature encoding scheme is designed. The time feature encoding function is defined as:

T_{encode} (t) = \{\begin{array}{l} 0, i f \cdot t \leq τ_{1}, \\ 1, i f \cdot τ_{1} < t \leq τ_{2}, \\ 2, i f \cdot τ_{2} < t \leq τ_{3}, \\ 3, i f \cdot t > τ_{3} . \end{array}

(1)

Among them

τ_{1}

,

τ_{2}

,

τ_{3}

and respectively represent the thresholds for short, medium, and longtime intervals. The continuous time intervals are discretized into four levels, effectively preserving the relative relationships of temporal information and facilitating the subsequent sequence modeling process.

3.1.2. Multi-Dimensional Feature Fusion and Anomaly Pattern Recognition

Operation status codes, as important indicators reflecting system execution results, contain rich security-related semantic information. Through statistical analysis of the distribution of status codes in many security events, it is observed that the success rate of normal user behavior is generally high, whereas attack behaviors—due to being blocked by security protection mechanisms—often result in a higher frequency of failure status codes. Based on this pattern, a status feature vector

S = {[s_{s u c c e s s}, s_{f a i l u r e}, s_{e r r o r}]}^{T}

is constructed, each component of the vector represents the normalized frequency of successful execution states, operation failure states, and critical system error states. Differentiating by status codes enables effective capture of abnormal patterns in the operation execution process. Meanwhile, for the textual description fields of attack behaviors, a pattern-matching-based attack tool identification strategy is designed. By constructing an attack tool keyword dictionary, when the system detects that a specific tool identifier is present in the attack description, it further extracts detailed configuration information of the tool, providing important auxiliary information for threat identification.

3.2. Construction of Network Traffic Graph Structure and Topology Modeling

3.2.1. IP Entity Node Feature Extraction and Representation Learning

In this paper, IP address entities in the network environment are treated as nodes in a graph, with each IP node carrying the comprehensive network behavior features of the corresponding entity. The node feature extraction process is divided into two stages: IP address feature encoding and traffic statistical feature construction. The IP address feature encoding parses an IPv4 address into four octets and extracts features such as network type, subnet characteristics, and address pattern semantics—including identification of private networks, loopback address detection, multicast address recognition, and other semantic information. The IP address feature vector

f_{I P}

is defined as:

f_{I P} = [o, n, s, p]

(2)

where

o = [o_{1}, o_{2}, o_{3}, o_{4}]

represents the four octets,

n

denotes the network type feature vector,

s

denotes the subnet feature vector, and

p

denotes the address pattern feature vector. Traffic statistical feature construction is based on the aggregation of all network traffic involving the IP, including multi-dimensional metrics such as total packet count, total byte count, number of flows, average duration, in-degree and out-degree statistics, and protocol distribution ratios.

For each IP node, the system separately counts its traffic characteristics as both source and destination addresses, including the number of outbound and inbound packets, byte statistics, and protocol distribution. To capture abnormal behavior patterns, the number of unique destination ports and source ports accessed by the IP is further extracted. These metrics help in identifying attack behaviors such as port scanning and service enumeration. The comprehensive node feature vector

h_{ν}^{(0)}

is constructed as follows:

h_{ν}^{(0)} = Concat [f_{I P} (v), f_{t r a f f i c} (v), f_{b e h a v i o r} (v)]

(3)

where

f_{t r a f f i c} (v)

represents the traffic statistical features of node

v

, and

f_{b e h a v i o r} (v)

denotes behavior anomaly indicators.

3.2.2. Edge Relationship Modeling and Spatiotemporal Feature Construction

Edges in the graph represent network traffic connections, where each network flow corresponds to a directed edge connecting the source IP node and the destination IP node. Edge feature extraction covers four dimensions: basic traffic features, protocol features, port features, and temporal features. Basic traffic features include core indicators such as flow duration, number of packets, number of bytes, number of flows, and service type. Protocol features are represented using one-hot encoding for different protocol types such as TCP, UDP, and ICMP. Port features are encoded separately for source and destination ports, including port type classification (well-known ports, registered ports, dynamic ports), identification of common service ports, and normalization of port values. Temporal features record the timestamp information of traffic occurrences, providing support for subsequent time series analysis. Through this multi-dimensional edge feature representation, the system can comprehensively characterize the patterns of network traffic and provide rich information for anomaly detection.

3.2.3. Dynamic Time Window Graph Construction Mechanism

To support real-time threat detection and large-scale data processing, an efficient dynamic graph construction mechanism is designed. This mechanism adopts a sliding window strategy to determine whether each time window contains an attack, while maintaining the timeliness of the graph through incremental updates. During the dynamic time graph construction process, start and end timestamps are set for each segment of network traffic. A fixed-size sliding window mechanism is used to construct a small graph from the flows within each window segment. This facilitates the dynamic modeling of attack behaviors as they evolve over time. The sliding window function is defined as:

W_{t} = {f_{i} | t - Δ T \leq t i m e s t a m p (f_{i}) < t}

(4)

where

W_{t}

represents the window at time

t

,

Δ T

is the window size,

f_{i}

is the

i

-th network flow, and

t i m e s t a m p (f_{i})

is the timestamp of the flow. Based on the previously extracted multi-dimensional features, further processing is conducted to construct a complete graph. Specifically, all IP addresses are used to build node features, each flow is traversed to construct edge features, and feature standardization is performed accordingly.

3.3. GATransformer Architecture Design

3.3.1. Parallel Dual-Channel Fusion Architecture Design

The GATransformer architecture adopts a parallel dual-channel design concept, consisting of three core components: the graph structure encoding channel, the sequence encoding channel, and the feature fusion module. The graph structure encoding channel focuses on extracting structural features from the network topology; the sequence encoding channel is dedicated to handling the temporal dependencies of activity sequences; and the feature fusion module integrates the outputs of the two channels to generate a comprehensive threat feature representation. In the sequence encoding channel, nodes are first sorted by degree to construct a node sequence, and a positional encoding mechanism is applied to preserve the sequential information. The design of positional encoding is particularly important in network security applications, as attack behaviors often exhibit distinct stages and order. With positional encoding, the model can learn the characteristic patterns of different stages in attack sequences—such as reconnaissance, infiltration, and persistence—thereby significantly improving threat detection accuracy. The sequence encoding channel adopts a multi-layer Transformer structure, modeling dependencies between any positions in the sequence through the self-attention mechanism. The positional encoding uses sine and cosine functions, defined as:

P E_{(p o s, 2 i)} = s i n (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

(5)

P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

(6)

where

p o s

represents the position,

i

denotes the feature dimension, and

m o d e l

is the model dimension. The multi-head attention mechanism enables the model to learn sequence dependencies from different representational subspaces.

3.3.2. Graph Attention Network Enhancement Mechanism

The graph structure encoding channel adopts the Graph Attention Network (GAT) mechanism, which can adaptively learn the importance weights between nodes in the graph. GAT calculates node importance through attention coefficients:

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} α_{i j}^{(l)} W^{(l)} h_{j}^{(l)})

(7)

where

h_{i}^{(l)}

represents the feature of node

i

at layer

l

,

α_{i j}^{(l)}

is the attention weight,

N (i)

denotes the neighborhood of node

i

, and

W^{(l)}

is the learnable weight matrix. The attention coefficients are computed based on the correlation between node features:

α_{i j}^{(l)} = \frac{e x p (LeakyReLU (a^{T} [W^{(l)} h_{i}^{(l)} ‖ W^{(l)} h_{j}^{(l)}]))}{Σ_{k \in N (i) \cup {i}}^{Σ (LeakyReLU (a^{T} [W^{(l)} h_{i}^{(l)} ‖ W^{(l)} h_{k}^{(l)}]))} e x p}

(8)

where

a

is a learnable attention parameter vector, and

| |

denotes the feature concatenation operation. Compared with traditional Graph Convolutional Networks (GCNs), the attention mechanism enables the model to dynamically focus on the most relevant neighboring nodes, enhancing its representation capacity and generalization performance.

The original 32-dimensional node features are projected into a unified 256-dimensional hidden space through feature projection, and the original 20-dimensional edge features are also projected into the same 256-dimensional hidden space. Subsequently, a layer GAT is applied. Each layer includes information from all paths between a node and its neighbors, allowing the model to perceive increasingly distant neighboring nodes and thus model more complex structural dependencies within the graph. This enables the detection of more sophisticated propagation patterns, attack paths, and behavioral contexts. Each layer’s output is stabilized with LayerNorm, and residual connections are used to alleviate gradient vanishing and degradation problems in deep models.

3.3.3. Cross-Attention Feature Fusion Strategy

The effective fusion of sequence features and graph features is a key technical challenge in the graph-sequence enhanced architecture. This paper designs a cross-attention feature fusion strategy to deeply integrate sequence and graph features. The cross-attention mechanism is defined as:

CrossAttention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(9)

where

Q

comes from the graph structure features output by GAT, and

K

and

V

come from the sequence features encoded by the Transformer. Through this mechanism, GAT and Transformer process the same set of node features in parallel, extracting structural information and temporal sequence information, respectively. To address the problem that GAT may overlook structural information from distant nodes, and to better attend to node information in the Transformer, the cross-attention mechanism integrates features through the interaction of the query and key-value matrices. The Transformer features obtained through cross-attention are concatenated with the structural features extracted by GAT to form a fused high-dimensional feature vector. This fused vector simultaneously captures neighborhood structural information and global semantic sequence behavior of the node. To avoid the high computational cost associated with high-dimensional feature vectors, a two-layer MLP is used for feature transformation and compression. A nonlinear activation function (ReLU) is applied to achieve nonlinear fusion, learning complex nonlinear combinations among features, and finally mapping them into the predictive representation space.

The proposed cross-attention mechanism facilitates heterogeneous feature integration through a multi-head attention architecture, wherein graph-structured representations serve as queries and sequential embeddings function as both keys and values. The node sequence construction employs a topology-aware ranking strategy based on degree centrality metrics, ensuring that nodes with higher connectivity degrees maintain precedence in the sequential ordering. The feature integration pipeline executes concatenation-based fusion of graph embeddings with cross-modal attention outputs, subsequently processed through a dual-stage multilayer perceptron transformation:

H_{f u s e d} = M L P ([H_{g r a p h}; CrossAttn (H_{g r a p h}, H_{s e q}, H_{s e q})])

(10)

where the MLP architecture implements a bottleneck design with intermediate dimensionality reduction:

Linear (2 d \to d) \to ReLU \to Dropout (0.1) \to Linear (d \to d)

, with

d

denoting the latent feature dimensionality. This architectural design enables the extraction of complementary structural dependencies from the GAT component while preserving temporal sequential patterns from the Transformer encoder.

Algorithm 1 implements our new cross-attention fusion mechanism that systematically integrates heterogeneous graph structural information with temporal sequence patterns. Unlike traditional concatenation-based fusion approaches, our cross-attention mechanism allows the model to selectively focus on the most relevant structural features when processing temporal dependencies, thereby preserving both local topological relationships and global behavioral patterns.

Algorithm 1 Cross-Attention Feature Fusion

Require: Graph features

H_{G} \in ℝ^{N \times d}

, Sequence features

H_{T} \in ℝ^{N \times d}

Require: Multi-head attention parameters

W_{Q}, W_{K}, W_{V}

Require: Fusion projection layers

M L P_{f u s i o n}

Ensure: Fused node representations

H_{f u s e d} \in ℝ^{N \times d}

1:: procedure CROSS_ATTENTION_FUSION( $H_{G}, H_{T}$ )
2:: //Prepare queries, keys, values for cross-attention
3:: $Q_{g r a p h} \leftarrow H_{G} \times W_{Q}$ {Graph features as queries}
4:: $K_{s e q} \leftarrow H_{T} \times W_{K}$ {Sequence features as keys}
5:: $V_{s e q} \leftarrow H_{T} \times W_{V}$ {Sequence features as values}
6:: //Multi-head attention computation
7:: for h = 1 to num_heads do
8:: $Q_{h} \leftarrow Q_{g r a p h} [:, (h - 1) \times d_{h} : h \times d_{h}]$
9:: $K_{h} \leftarrow K_{s e q} [:, (h - 1) \times d_{h} : h \times d_{h}]$
10:: $V_{h} \leftarrow V_{s e q} [:, (h - 1) \times d_{h} : h \times d_{h}]$
11:: //Attention scores computation
12:: $A_{h} \leftarrow s o f t m a x (\frac{Q_{h} \times K_{h}^{T}}{\sqrt{d_{h}}})$
13:: //Weighted value aggregation
14:: $O_{h} \leftarrow A_{h} \times V_{h}$
15:: end for
16:: //Concatenate multi-head outputs
17:: $H_{c r o s s} \leftarrow c o n c a t (O_{1}, O_{2}, \dots, O_{H}) \times W_{O}$
18:: //Feature fusion through MLP
19:: $H_{c o n c a t} \leftarrow c o n c a t (H_{G}, H_{c r o s s}) {[N \times 2 d]}$
20:: $H_{f u s e d} \leftarrow M L P_{f u s i o n} (H_{c o n c a t}) {| N \times d |}$
21:: return $H_{f u s e d}$
22:: end procedure

Algorithm 2 presents our dynamic time window graph construction strategy, which enables real-time threat detection while maintaining computational efficiency. The sliding window approach ensures temporal locality preservation, allowing the system to capture evolving attack patterns across different time scales. The multi-level feature extraction process (lines 23–27 and 30–36) creates comprehensive node and edge representations that incorporate both network topology characteristics and traffic statistical properties, providing rich semantic information for subsequent anomaly detection tasks.

Algorithm 2 Dynamic Time Window Graph Construction

Require: Network flows

F = {f_{1}, f_{2}, \dots, f_{M}}

, Time window size

W

Require: Feature extractors

ϕ_{n o d e}

,

ϕ_{e d g e}

Ensure: Temporal graph sequence

G = {G_{1}, G_{2}, \dots, G_{T}}

1:: procedure BUILD_TEMPORAL_GRAPHS( $F$ , $W$ )
2:: $G \leftarrow \emptyset$ {Initialize graph sequence}
3:: $t_{m i n} \leftarrow m i n {t i m e s t a m p (f_{i}) ∣ f_{i} \in F}$
4:: $t_{m a x} \leftarrow m a x {t i m e s t a m p (f_{i}) ∣ f_{i} \in F}$
5:: $t_{c u r r e n t} \leftarrow t_{m i n}$
6:: window id $\leftarrow$ 1
7:: while $t_{c u r r e n t} < t_{m a x}$ do/
8:: $t_{e n d} \leftarrow t_{c u r r e n t} + W$
9:: $F_{w i n d o w} \leftarrow {f_{i} \in F ∣ t_{c u r r e n t} \leq t i m e s t a m p (f_{i}) < t_{e n d}}$
10:: if $F_{w i n d o w} \neq \emptyset$ then
11:: $V \leftarrow U {s r c_i p (f), d s t_i p (f) ∣ f \in F_{window}}$
12:: $i p_t o_i d \leftarrow C R E A T E_M A P P I N G (V)$
13:: $X \leftarrow []$
14:: for each $i p \in V$ do
15:: $f l o w s_i p \leftarrow {f \in F_{w i n d o w} ∣ s r c_i p (f) = i p \lor d s t_i p (f) = i p}$
16:: $x_{i p} \leftarrow ϕ_{n o d e} (i p, f l o w s_i p)$ {Extract IP and traffic features}
17:: $X \leftarrow X U {x_{i p}}$
18:: end for
19:: $E \leftarrow [], A \leftarrow []$
20:: for each $f \in F_{w i n d o w}$ do
21:: $s r c_i d \leftarrow i p_t o_i d [s r c_i p (f)]$
22:: $d s t_i d \leftarrow i p_t o_i d [d s t_i p (f)]$
23:: $E \leftarrow E U {(s r c_i d, d s t_i d)}$
24:: $a_{f} \leftarrow ϕ_{edge} (f)$ {Extract flow features}
25:: $A \leftarrow A U {a_{f}}$
26:: end for
27:: $Y_{n o d e s} \leftarrow C O M P U T E_N O D E_L A B E L S (V, F_{w i n d o w})$
28:: $Y_{e d g e s} \leftarrow C O M P U T E_N O D E_L A B E L S (F_{w i n d o w})$
29:: $y_{g r a p h} \leftarrow a n y (Y_{e d g e s})$ {Graph-level anomaly}
30:: $G_{t} \leftarrow (X, E, A, Y_{n o d e s}, Y_{e d g e s}, y_{g r a p h}, w i n d o w_i d)$
31:: $G \leftarrow G U {G_{t}}$
32:: end if
33:: $t_{c u r r e n t} \leftarrow t_{e n d}$
34:: $w i n d o w_i d \leftarrow w i n d o w_i d + 1$
35:: end while
36:: return $G$
37:: end procedure

The algorithmic complexity analysis reveals that our approach scales efficiently: Algorithm 1 has O(N²d) complexity for the cross-attention computation, while Algorithm 2 operates in O(MW) time where M is the number of flows and W is the window size. This computational efficiency, combined with the incremental graph construction mechanism, makes our approach suitable for real-time network monitoring scenarios.

3.4. Multi-Task Learning Framework and Loss Function Optimization

The fused features in the GATransformer architecture simultaneously support these prediction tasks: node classification, edge classification and sequence prediction, achieving unified multi-level anomaly detection. Node classification identifies abnormal hosts and devices in the network. Edge classification detects abnormal network traffic. Sequence prediction forecasts the network behavior pattern at the next step. Each task is equipped with a dedicated prediction head, and they share the underlying feature representations to enable knowledge transfer and complementary learning. The objective function for multi-task learning is defined as:

L_{t o t a l} = \sum_{i = 1}^{3} λ_{i} L_{i}

(11)

where

L_{i}

represents the loss function of the

i

-th task, and

λ_{i}

is the corresponding weight coefficient. For classification tasks, cross-entropy loss is used:

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{i j} \log (p_{i j})

(12)

where

N

is the number of samples,

C

is the number of classes,

y_{i j}

is the true label, and

p_{i j}

is the predicted probability. For sequence prediction tasks, the mean squared error loss is used:

L_{s e q} = \frac{1}{N} \sum_{i = 1}^{N} ‖ y_{i} - {\hat{y}}_{i} ‖^{2}

(13)

Additionally, the weight allocation strategy implements hierarchical task prioritization with primary classification objectives receiving unit weights (1.0) and sequence prediction assigned reduced weighting (0.01) for regularization. The sequence loss incorporates stabilization mechanisms show as Formula (14), resulting in the composite objective Formula (15).

The multi-task learning paradigm employs a hierarchical task prioritization framework predicated on domain-specific importance weighting and gradient stabilization constraints. Primary classification objectives (node-level and edge-level anomaly detection) receive unit weighting coefficients (1.0), reflecting their direct correspondence to threat detection efficacy. Graph-level classification maintains equivalent weighting to enforce global topological constraints. The auxiliary sequence prediction task operates under substantially reduced weighting (0.01) to provide regularization while mitigating gradient instability phenomena. The sequence prediction loss incorporates dual-stage numerical stabilization through hyperbolic tangent activation bounds and clipping operations:

L_{s e q} = MSE (\tanh ({\hat{y}}_{s e q}), clamp (y_{s e q}, - 1, 1)) \times 0.1

(14)

The composite objective function integrates all task-specific losses through weighted summation:

L_{t o t a l} = 1.0 \cdot L_{n o d e} + 1.0 \cdot L_{e d g e} + 0.01 \cdot L_{s e q}

(15)

This formulation ensures optimal convergence toward primary detection objectives while maintaining numerical stability throughout the training process via auxiliary task regularization.

3.5. Vulnerable Asset Correlation and Threat Attribution Mechanism

Building upon abnormal traffic detection, this paper further establishes a correlation mechanism between active/passive traffic and vulnerable assets. By analyzing the source and destination addresses of abnormal traffic, port scanning patterns, and attack payload characteristics, combined with network topology information and an asset fingerprint database, the system can accurately locate network assets with security vulnerabilities. Active traffic analysis focuses on abnormal outbound behavior originating from the internal network, which often indicates that a compromised host is engaged in data exfiltration or command-and-control communication. Passive traffic analysis targets scanning and penetration attempts from external attackers. By examining the target ports, service versions, and vulnerability exploitation characteristics in the attack payload, the system can identify vulnerable assets in the network that match the exploited weaknesses. This correlation mechanism not only establishes a complete analytical chain from network traffic anomaly detection to precise vulnerable asset localization but also provides accurate guidance for subsequent vulnerability remediation and security reinforcement, significantly enhancing the relevance and effectiveness of cybersecurity defense.

4. Experiment and Evaluation

4.1. Experiment Setup

We implement the GATransformer model using PyTorch 2.6+, PyTorch Geometric 2.6.1 and Python 3.9.23.

Experiments are carried out on two versions of the CIDDS benchmark dataset: CIDDS-001 and CIDDS-002, which synthetically generate realistic NetFlow traffic including advanced threats like APTs within an emulated enterprise environment [33]. The dataset contains 14 core features including temporal information, network identifiers, traffic statistics, and protocol specifications. Moreover, our experiment utilized the UNSW-NB15 dataset. This dataset, created by the University of New South Wales, is widely used for network intrusion detection research, containing both normal network activities and various types of simulated attack behaviors The dataset comprises 2,540,044 records, integrating real normal traffic with synthetic attack behaviors to cover complex network scenarios and emerging threat patterns. The dataset is divided into normal traffic (87%) and attack traffic (13%). Normal traffic simulates routine activities like web browsing and file transfers, while attack traffic encompasses nine attack families: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. These correspond to threat behaviors such as fuzz testing, port scanning, backdoor access, resource exhaustion, and vulnerability exploitation [34,35,36,37,38].

We compare GATransformer against baseline models including Graph Attention Network, Transformer, and traditional machine learning methods. Performance is evaluated using accuracy, precision metrics for classification tasks, with data split into 70% training, 20% validation, and 10% testing sets.

4.2. Experiment Results

Table 1 presents the feature descriptions of the CIDDS dataset. This dataset encompasses 14 core features, which are detailed as follows in terms of feature names, data types, descriptions, and illustrative example values. These features collectively characterize network traffic data, including temporal information. Date first seen, traffic duration Duration, protocol specifications Proto, source and destination network identifiers Src IP Addr, Dst IP Addr and port numbers Src Pt, Dst Pt, packet—related metrics Packets, Bytes, flow—level statistics Flows, TCP- specific flag information Flags, service type classification Tos, and labels for traffic nature label and attack categorization attackType, providing a comprehensive basis for network traffic analysis and security research.

Hyperparameter selection is based on extensive grid search and empirical validation. As shown in Table 2, we detail several key hyperparameters.

To validate the effectiveness of different models, we separately evaluated the performance of the Graph Attention Network (GAT), Transformer, and GATransformer models on the network traffic anomaly detection task. As shown in Table 3, on the CIDDS-002 dataset, the Transformer model achieved 99.7% accuracy on the node classification task and 99.1% accuracy on the edge classification task, demonstrating well sequential modeling capabilities. The GAT model achieved 99.8% accuracy on node classification but had relatively lower accuracy 99.0% on edge classification, indicating limitations in handling complex network traffic relationships through pure graph structure modeling. The GATransformer model achieved 99.9% accuracy on both node and edge classification tasks, comparable to the Transformer model, verifying the effectiveness of fusing graph structure information with sequential information. On the CIDDS-001 dataset, which features more diverse and complex attack types, the GAT model achieved an accuracy rate of 99.4% in the node classification task but only reached 86.7% in the edge classification task, still indicating its limitations in handling complex network topological relationships. The Transformer model achieved a node accuracy of 99.5% and an edge accuracy of 87.4% on this dataset, demonstrating its strong sequence modeling capabilities. The GATransformer hybrid model achieved a node accuracy of 99.7% and an edge accuracy of 88.2% on CIDDS-001, showing significant improvements over single models, further validating the effectiveness of integrating graph structural information with sequence information. For the UNSW-NB15 dataset, CNN+LSTM achieved a node accuracy of 99.36, while the Extra Trees ensemble classifier reached 98.19, and DT, LR, NB, ANN, EM CLUSTRING reached 85.56 [39]. Additionally, the GAT model achieved a node accuracy of 0.99735 and an edge accuracy of 0.96012. The Transformer model had a node accuracy of 0.99735 and an edge accuracy of 0.96009, and the GATransformer (our) model obtained a node accuracy of 0.99743 and an edge accuracy of 0.97414, indicating that integrating graph and sequence information can enhance performance even on this dataset with diverse attack types. Experiments confirmed the rationality of the hybrid model design. The Transformer component plays a key role in capturing temporal dependencies, while the GAT component helps model the topological structure features of the network.

As shown in Figure 2, the confusion matrix results of GATransformer on the CIDDS-001 dataset demonstrate that the model achieves 99.74% accuracy in node classification tasks and 88.28% accuracy in edge classification tasks, showcasing outstanding multi-level anomaly detection capabilities. In contrast, Figure 3 highlights the performance limitations of unimodal approaches: the pure GAT model achieves only 86.77% accuracy in edge classification, while the pure Transformer model attains 87.45% accuracy for edge classification.

4.3. Discussion

To optimize the performance of the GATransformer model, we conducted sensitivity analysis on key hyperparameters on the CIDDS-002, including the number of GAT layers, number of Transformer layers, and learning rate. As shown in Table 4, we systematically evaluated how these parameters influence the model’s performance. We employed a systematic hyperparameter optimization methodology incorporating grid search, random search, and Bayesian optimization techniques. The hyperparameter search space was designed based on domain-specific characteristics of network threat detection tasks and empirical findings from related research. Key parameters included: GAT layers [1,3,4], Transformer layers [1,4,6], learning rates [

5 \times 10^{- 5}

,

1 \times 10^{- 4}

,

5 \times 10^{- 4}

]. The optimization process consisted of three sequential stages: (1) coarse-grained grid search for initial parameter space exploration using 20-epoch rapid evaluation with validation accuracy as the primary metric; (2) fine-grained random sampling with 200 iterations within promising regions, employing early stopping over 50 epochs, where patience = 10; and (3) Bayesian optimization using Tree-structured Parzen Estimator (TPE) algorithm for 50 iterations with complete 100-epoch training cycles. Experimental results show: First, increasing the number of GAT layers exhibits a diminishing return. The single-layer GAT configuration achieved the highest detection accuracy of 99.9% and 99.5% on node and edge classification tasks respectively, while performance significantly dropped with a deeper 4-layer GAT, decreasing to 98.9% and 99.5%, respectively. This phenomenon is attributed to the over-smoothing problem in deep graph neural networks. Second, increasing the number of Transformer layers significantly improved the model’s sequential modeling ability. A 6-layer Transformer configuration achieved the best performance, validating the effectiveness of deeper attention mechanisms in capturing complex temporal dependencies. Lastly, the learning rate was stable within the range of

1 \times 10^{- 4}

to

5 \times 10^{- 4}

, while a too-small learning rate led to under-convergence and noticeable performance degradation. Based on the above analysis, this paper identifies the optimal hyperparameter configuration as one GAT layer, six Transformer layers, and a learning rate of

1 \times 10^{- 4}

. This setup ensures the best performance on network anomaly detection tasks while keeping model complexity under control.

As shown in Figure 4, the performance of the GATransformer model on node and edge classification tasks for network traffic data is demonstrated. On the CIDDS-002, the confusion matrix results indicate that the proposed GATransformer hybrid model exhibits outstanding performance in network anomaly detection tasks. In the node classification task, the model achieved an overall accuracy of 99.99%. Specifically, the precision for detecting attack nodes reached 100%, and the recall was 99.13%, with only one attack node misclassified as a normal node, indicating the model’s strong capability for node-level anomaly identification. In the edge classification task, the model also achieved 99.99% accuracy, with a precision of 99.985% for detecting attack traffic and a recall of 100%, completely avoiding false negatives in attack traffic detection. Only 82 normal traffic samples were falsely identified as attacks, demonstrating traffic-level anomaly detection performance. This experimental result fully validates the effectiveness of the hybrid architecture in integrating graph structure information and sequential features. The GAT module successfully captured anomalous patterns in the network topology, while the Transformer component effectively modeled the temporal dependencies of traffic. Their synergy enabled precise identification of complex network attack behaviors.

Figure 5 illustrates the total loss trend of the GATransformer model over 50 training epochs. From the training curve, it can be observed that the model demonstrates good convergence characteristics: the training loss rapidly decreases from an initial value of 1.8, while the validation loss simultaneously decreases from 0.9. Both curves experience a steep decline within the first 10 epochs, then gradually stabilize and converge to values close to zero. Moreover, the training and validation loss curves almost completely overlap and exhibits consistent trends, indicating that the model does not suffer from overfitting during training and possesses strong generalization capability. After the 10th epoch, the loss values remain stable below 0.05, suggesting that the model has sufficiently learned the underlying patterns in the data and achieved effective multi-task joint optimization. This fast and stable convergence confirms the rationality of the proposed hybrid architecture design and the effectiveness of the training strategy.

4.4. Model Architecture Evaluation

Considering the performance constraints and real-time requirements faced by network threat detection systems in practical deployment environments, this study conducted an in-depth theoretical analysis and experimental validation of the computational complexity of the GATransformer. The GATransformer model contains 5,815,334 trainable parameters and has a model size of 22.18 MB. We use the NVIDIA GeForce RTX 4060 Laptop GPU to conduct experiments about model performance evaluation.

From an algorithmic complexity perspective, the overall time complexity of GATransformer is primarily determined by three core components: spatial dependency modeling in the graph attention network, sequence dependency capture in the Transformer, and the feature fusion mechanism in cross-attention. For an input graph with N nodes and E edges, and a hidden layer dimension of d, the time complexity of the GAT component is O (E·d + N·d²). The E·d term stems from the linear transformation computation of edge features, while the N·d² term arises from the quadratic computational complexity of calculating attention weights between nodes. The Transformer sequence encoding component employs an L-layer encoder structure with a time complexity of O(L·N²·d), where the primary computational overhead stems from the fully connected computations within the self-attention mechanism. The complexity of the feature fusion and multi-task prediction components is O(N·d²), encompassing the computational costs of cross-attention fusion and classification prediction. Synthesizing the above analysis, the overall time complexity of GATransformer can be expressed as T(N, E) = O (E·d + L·N²·d + N·d²).

From the perspective of spatial complexity, the model’s memory usage primarily consists of parameter storage and intermediate activations. The spatial complexity of parameter storage is O(d²), mainly stemming from the parameter scale of the linear transformation matrices in each layer. The memory complexity of intermediate activations is O (N·d + N²·h), where h denotes the number of attention heads. The N² term corresponds to the storage requirements of the attention score matrix, which is also a key factor limiting the model’s scalability.

As shown in Table 5, several key characteristics of GATransformer’s inference performance emerge. First, the batch size 4 configuration achieves the optimal latency-throughput tradeoff, boosting throughput to 150.2 graphs/s while maintaining reasonable inference latency—representing a 1.89× performance improvement over single-sample inference mode. This phenomenon is primarily attributed to more efficient utilization of GPU parallel computing units and the optimizing effect of batch processing on memory access patterns. Second, as the batch size increases further, the marginal gain in throughput diminishes. The performance improvement at a batch size of 8 significantly decreases, indicating the existence of a saturation point in hardware resource utilization.

Real-time performance is a critical quality metric for network threat detection systems in production environments, directly impacting threat response capabilities and user experience. To comprehensively evaluate the stability and reliability of GATransformer under continuous operation, this study conducted a 30-s high-frequency continuous inference test. During the test, the system executed 2829 inference operations with a 100% success rate, demonstrating the model’s robustness and stability under high-load conditions. As shown in Table 6, latency distribution analysis reveals that GATransformer exhibits outstanding real-time performance characteristics. The average latency was 7.74 ms, with P90 latency controlled below 11.28 ms. Even under the extreme P99 scenario, latency reached only 36.06 ms—significantly below the response time threshold for real-time threat detection systems. This favorable latency distribution indicates minimal performance jitter, enabling stable and predictable service quality.

Scalability analysis reveals a nonlinear relationship between detection time and network size: detection time for small-scale networks (7 nodes, 5 edges) is 58.48 ms. However, when network size scales to medium (14 nodes, 51 edges) and large (25 nodes, 106 edges) scales, detection time significantly decreases and stabilizes around 9 ms (8.99 ms and 8.89 ms, respectively). This phenomenon indicates that the GATransformer fully leverages GPU parallel computing capabilities, demonstrating well time complexity control when processing large-scale networks. Detection time does not increase linearly with network size; instead, higher computational efficiency is achieved on medium-to-large networks. In real-time detection scenarios, the model maintains a high throughput of 94.3 inferences per second. This “scale-friendly” detection time characteristic makes the GATransformer particularly well-suited for practical large-scale network anomaly detection applications.

4.5. Case Study

This case study constructs a simulated environment with real-world enterprise network characteristics. The environment adopts a four-layer network topology architecture, including a DMZ area (192.168.100.0/24), an office network (192.168.1.0/24), a server segment (192.168.10.0/24), and a database network (172.16.1.0/24). The dataset consists of 1000 network traffic records, with attack samples accounting for 20%, covering five typical APT attack patterns: port_scan, web_attack, lateral_movement, data_exfiltration, and malware_communication. Through the feature engineering module, the system converts the raw network traffic into a semantic representation of 32-dimensional node features and 20-dimensional edge features. A Graph Attention Network (GAT) encoder is used to capture spatial dependencies in the network topology, while a 4-layer Transformer encoder models the long-range temporal features of attack behaviors, achieving deep fusion of spatio-temporal features.

In this experiment, the left subfigure of Figure 6 shows the propagation stage of attack activities based on real traffic. It can be clearly seen that malicious traffic (represented by red arrows) originates from the external attacker (203.0.113.100) and penetrates the internal server segment (192.168.10.0/24) and office segment (192.168.1.0/24) via the DMZ area, with several key nodes already compromised (marked in red). The right subfigure presents the detection results based on GATransformer. Through real-time monitoring by the model, the system successfully identifies and blocks most attack paths, effectively curbing the further spread of attacks, with only one attack (marked with ×) not correctly identified. The comparison results show that the proposed detection system can accurately identify abnormal behavior patterns in the network, promptly detect compromised host nodes, and achieve effective protection against complex network attacks through multi-level anomaly detection mechanisms, significantly enhancing network security defense capabilities.

Experimental results demonstrate that GATransformer performs good across multiple key metrics. The system’s overall detection accuracy reaches 91%, with a precision of 89%, a recall of 93%, and an F1 score of 91% in the network analysis module. From the right subfigure’s detection results, it is evident that compared with the propagation stage of real attack activities shown in the left subfigure, GATransformer achieves a lower false positive rate while maintaining a high recall rate. This reflects two major advantages of GATransformer: first, high detection accuracy—91% overall accuracy meets practical needs in production environments; second, good balance—89% precision and 93% recall provide a strong trade-off between accuracy and completeness in detection.

The core innovation of this study lies in the integration of the structure-aware capability of graph neural networks with the sequence modeling power of Transformers and the achievement of a technological breakthrough in spatio-temporal anomaly detection in the field of cybersecurity through a multi-task learning paradigm and enhanced visual analysis functionality.

5. Conclusions

This paper addresses key challenges in cybersecurity threat detection by proposing GATransformer, a new detection method that combines graph neural networks with Transformer architecture. Our main theoretical contributions include a cross-attention fusion mechanism that develops a mathematical framework for integrating graph structural features with temporal sequence features, thereby preserving both spatial topology relationships and temporal dependencies through learnable attention weights and solving the modal information fragmentation problem. Furthermore, we establish a multi-task optimization theory with a unified loss function that jointly optimizes node classification, edge classification, and sequence prediction tasks, where the hierarchical weighting strategy ensures stable convergence while preventing gradient conflicts between different objectives. Additionally, our parallel dual-channel architecture theoretically guarantees that GAT and Transformer branches extract complementary information without redundancy, as the GAT branch captures spatial dependencies in network topology, while the Transformer branch models long-term temporal patterns.

Experimental validation demonstrates the effectiveness of our theoretical framework, where GATransformer achieves 99.74% node accuracy and 88.28% edge accuracy on CIDDS-001, and 99.99% and 99.98% respectively on CIDDS-002, thus verifying our mathematical predictions about optimal feature fusion. Moreover, the cross-attention mechanism shows measurable improvements compared to single-modal approaches: whereas GAT alone achieves 86.77% and Transformer alone achieves 87.45%, our hybrid approach reaches 88.28% edge classification accuracy, consequently confirming the theoretical benefits of eliminating modal fragmentation. Therefore, our work provides both practical detection capabilities and theoretical foundations for multi-modal network analysis, ultimately establishing mathematical principles for integrating structural and temporal information in cybersecurity applications.

6. Limitations and Future Work

6.1. Current Limitations

Firstly, the proposed GATransformer model relies primarily on synthetic datasets CIDDS-001 and CIDDS-002 for evaluation. While these datasets enable high detection accuracies under controlled conditions, they introduce critical limitations: synthetic data, constructed from known attack patterns, fails to fully capture the complexity, diversity, and uncertainty of real-world networks; real-world traffic noise, anomalies, and atypical patterns are often simplified or omitted; and zero-day attacks/new threats, with patterns distinct from training data, may degrade detection performance. Secondly, notable performance discrepancies across datasets highlight strong sensitivity to data distribution, stemming from differences in network topology, traffic patterns, and attack type distributions. This necessitates extensive adaptation when migrating to target networks with distinct characteristics. Furthermore, the datasets used in the case studies are relatively small. Such limited sample sizes may fail to adequately reflect the complexity of real-world network scenarios, thereby potentially undermining the generalizability of the conclusions drawn.

6.2. Future Research

Future research should systematically conduct comparative evaluations between GATransformer and current state-of-the-art hybrid architectures, including GraphFormer that embeds graph structural information directly into Transformer attention bias matrices, Heterogeneous Graph Transformer (HGT) designed for multi-type nodes and edges, Temporal Graph Networks (TGN) with memory modules for continuous-time dynamic modeling, and Graph-BERT employing large-scale unsupervised pre-training strategies. Evaluation metrics should encompass detection accuracy, computational efficiency, memory consumption, training overhead, and generalization performance.

Moreover, existing GATransformer implementations utilize discrete time windows for temporal information processing, which imposes modeling precision limitations. To address this, future work should incorporate continuous-time point processes such as Hawkes processes or neural point processes to model irregularly-spaced network events, thereby accurately capturing temporal dependencies in attack sequences. Specialized attention mechanisms should be developed to explicitly model attack kill chain stages, while causal inference methods including structural causal models should be integrated to distinguish correlation from causation in network events.

Furthermore, practical deployment optimization requires systematic strategies to facilitate GATransformer implementation in production environments. This encompasses developing incremental and online learning mechanisms for continuous adaptation to emerging attack patterns without complete retraining, exploring federated learning frameworks for collaborative training across organizations while preserving data privacy, and employing knowledge distillation techniques to create lightweight model versions suitable for resource-constrained edge devices. Additionally, adaptive inference mechanisms should dynamically adjust model complexity based on real-time network traffic characteristics.

Concurrently, current case studies utilize relatively limited datasets that may inadequately reflect real-world network complexity, potentially affecting research conclusion generalizability. Future investigations should construct large-scale, diverse cybersecurity datasets encompassing various network topologies, traffic patterns, and attack types, while establishing standardized preprocessing pipelines and feature engineering methodologies to ensure data quality and consistency.

Finally, existing research demonstrates insufficient reliance on Advanced Persistent Threat (APT) attack datasets, limiting effectiveness evaluation against sophisticated threats characterized by high stealth, persistence, and evolution. Future work should develop specialized APT detection algorithms incorporating long-term behavior modeling, anomaly pattern recognition, and attack intent inference, while establishing professional evaluation frameworks with corresponding performance metrics and assessment standards specifically tailored for APT detection scenarios.

7. Broader Implications

7.1. Positive Technological Impacts

The deployment of GATransformer significantly enhances organizational defense capabilities against advanced persistent threats by combining graph structure analysis with temporal pattern recognition to identify complex multi-stage attacks that traditional methods cannot detect. Experimental results demonstrate high detection accuracy, suggesting substantial potential for reducing false negative rates and minimizing risks of data breaches and system intrusions, particularly for critical infrastructure and national security systems. Furthermore, given that cybercrime-related economic losses are projected to reach USD 10.5 trillion by 2025, widespread deployment of GATransformer could substantially mitigate these damages through early detection and rapid response capabilities. This technology not only helps organizations avoid costly data breach expenses, business disruption losses, and reputational damage but also provides broad societal benefits by protecting personal privacy data, financial assets, and intellectual property.

7.2. Privacy and Surveillance Risks

Network traffic analysis inherently requires examination and monitoring of communication patterns, inevitably involving user activity surveillance. While GATransformer’s deep learning capabilities can identify subtle traffic patterns that aid threat detection, they may also infer user behavioral habits, preferences, and social relationships, potentially compromising personal privacy. Consequently, organizations must strictly comply with data protection regulations such as GDPR and CCPA, implementing privacy-preserving technologies like differential privacy and homomorphic encryption to minimize privacy impact while maintaining detection effectiveness. Moreover, this technology poses risks of abuse by authoritarian governments or malicious actors for mass surveillance purposes, potentially enabling identification of dissidents, tracking of specific groups, or monitoring of citizens’ online activities. Such misuse not only violates fundamental human rights but may also suppress freedom of speech and democratic participation, necessitating strict legal frameworks and technical safeguards to limit the technology’s scope and purpose.

7.3. Security and Adversarial Risks

High accuracy rates may lead defenders to over-rely on GATransformer while overlooking its vulnerability to adversarial sample attacks, where attackers can deceive the model through carefully crafted input perturbations to misclassify malicious traffic as benign. This vulnerability is particularly concerning in security-critical applications where attackers may exploit such weaknesses to bypass detection systems. Additionally, during the training phase, adversaries may inject malicious training data to implant backdoors, causing model failure or incorrect predictions under specific trigger conditions. Since GATransformer requires periodic updates to adapt to emerging threats, this continuous learning process creates opportunities for poisoning attacks, highlighting the need for robust training procedures and data validation mechanisms to prevent such vulnerabilities.

Author Contributions

Conceptualization, Q.Z.; methodology, Q.Z.; software, X.Z.; validation, X.Z.; formal analysis, Q.Z.; investigation, Y.S.; resources, W.C.; data curation, Y.L.; writing—original draft, Q.Z.; writing—review & editing, T.J. and Y.S.; visualization, H.O.; supervision, T.J. and Y.S.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the science and technology project of State Grid Corporation of China: Research on key technologies for cybersecurity vulnerability grading and vulnerability case verification and evaluation of special network in power monitoring system (grant No. 5108-202320447A-3-2-ZN).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Authors Qigang Zhu, Tian Jiang and Yu Shen were employed by the company NARI Group Corporation (State Grid Electric Power Research Institute), Nanjing NARI Information & Communication Technology Co., Ltd. Author Xiong Zhan was employed by the company State Grid Corporation of China. Authors Wei Chen, Yuanzhi Li and Hengwei Ouyang were employed by the company NARI Group Corporation (State Grid Electric Power Research Institute). The authors declare that this study received funding from State Grid Corporation of China. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Navarro, J.; Deruyver, A.; Parrend, P. A systematic survey on multi-step attack detection. Comput. Secur. 2018, 76, 214–249. [Google Scholar] [CrossRef]
Chen, Q.; Bridges, R.A. Automated behavioral analysis of malware: A case study of wannacry ransomware. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; IEEE: New York, NY, USA, 2017; pp. 454–460. [Google Scholar]
Sharma, A.; Rani, S.; Driss, M. Hybrid evolutionary machine learning model for advanced intrusion detection architecture for cyber threat identification. PLoS ONE 2024, 19, e0308206. [Google Scholar] [CrossRef] [PubMed]
Ventures, C. Cybercrime to cost the world $10.5 trillion annually by 2025. Cybercrime Mag. 2020, 13. [Google Scholar]
Martínez, J.; Durán, J.M. Software supply chain attacks, a threat to global cybersecurity: SolarWinds’ case study. Int. J. Saf. Secur. Eng. 2021, 11, 537–545. [Google Scholar] [CrossRef]
Krombholz, K.; Hobel, H.; Huber, M.; Weippl, E. Advanced social engineering attacks. J. Inf. Secur. Appl. 2015, 22, 113–122. [Google Scholar] [CrossRef]
Scarfone, K.; Mell, P. Guide to intrusion detection and prevention systems (idps). NIST Spec. Publ. 2007, 800, 94. [Google Scholar]
Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
Mohammed, S.H.; Singh, M.S.J.; Al-Jumaily, A.; Islam, M.T.; Islam, S.; Alenezi, A.M.; Soliman, M.S.; Nejad, M.G. Dual-hybrid intrusion detection system to detect False Data Injection in smart grids. PLoS ONE 2025, 20, e0316536. [Google Scholar] [CrossRef]
Xu, C.; Zhan, Y.; Chen, G.; Wang, Z.; Liu, S.; Hu, W.; Aung, Z. Elevated few-shot network intrusion detection via self-attention mechanisms and iterative refinement. PLoS ONE 2025, 20, e0317713. [Google Scholar]
Ran, L.; Cui, Y.; Zhao, J.; Yang, H.; Ibrahim, A.O. TITAN: Combining a bidirectional forwarding graph and GCN to detect saturation attack targeted at SDN. PLoS ONE 2024, 19, e0299846. [Google Scholar] [CrossRef]
Hassn, B.M.; Alomari, E.S.; Alrubaye, J.S.; Hassen, O.A. Adversarially Robust 1D-CNN for Malicious Traffic Detection in Network Security Applications. J. Cybersecur. Inf. Manag. 2025, 16, 162–175. [Google Scholar]
Liu, H.; Han, F.; Zhang, Y. Malicious traffic detection for cloud-edge-end networks: A deep learning approach. Comput. Commun. 2024, 215, 150–156. [Google Scholar] [CrossRef]
Wang, X.; Zhu, H.; Luo, X.; Guan, X. Data-Driven-based Detection and Localization Framework against False Data Injection Attacks in DC Microgrids. IEEE Internet Things J. 2025, 12, 36079–36093. [Google Scholar] [CrossRef]
Awad, A.A.; Ali, A.F.; Gaber, T. An improved long short term memory network for intrusion detection. PLoS ONE 2023, 18, e0284795. [Google Scholar] [CrossRef]
Hasan, M.Z.; Hanapi, Z.M.; Zukarnain, Z.A.; Huyop, F.H.; Abdullah, M.D.H.; Ali, G. An efficient detection of Sinkhole attacks using machine learning: Impact on energy and security. PLoS ONE 2025, 20, e0309532. [Google Scholar] [CrossRef]
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep learning approach for intelligent intrusion detection system. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
Kumar, V.; Sinha, D.; Das, A.K.; Pandey, S.C.; Goswami, R.T. An integrated rule based intrusion detection system: Analysis on UNSW-NB15 data set and the real time online dataset. Clust. Comput. 2020, 23, 1397–1418. [Google Scholar] [CrossRef]
Yin, Z.; Ma, H.; Hu, T. A Traffic Anomaly Detection Method Based on the Joint Model of Attention Mechanism and One-Dimensional Convolutional Neural Network-Bidirectional Long Short Term Memory. J. Electron. Inf. 2023, 45, 3719–3728. [Google Scholar]
Hooshmand, M.K.; Hosahalli, D. Network Anomaly Detection Using Deep Learning Techniques. CAAI Trans. Intell. Technol. 2022, 7, 228–243. [Google Scholar] [CrossRef]
Yuan, X.; Li, C.; Li, X. DeepDefense: Identifying DDoS Attack via Deep Learning. In Proceedings of the 2017 IEEE International Conference on Smart Computing (SMARTCOMP), Hong Kong, China, 29–31 May 2017; IEEE: New York, NY, USA, 2017; pp. 1–8. [Google Scholar]
Li, H.; Ge, H.; Yang, H.; Yan, J.; Sang, Y. An Abnormal Traffic Detection Model Combined BiIndRNN with Global Attention. IEEE Access 2022, 10, 30899–30912. [Google Scholar] [CrossRef]
Cai, S.; Tang, H.; Chen, J.; Lv, T.; Zhao, W.; Huang, C. GSA-DT: A Malicious Traffic Detection Model Based on Graph Self-Attention Network and Decision Tree. IEEE Trans. Netw. Serv. Manag. 2025, 22, 2059–2073. [Google Scholar] [CrossRef]
Qu, Y.; Ma, H.; Jiang, Y. CRND: An Unsupervised Learning Method to Detect Network Anomaly. Secur. Commun. Netw. 2022, 2022, 9509417. [Google Scholar] [CrossRef]
Jia, H.; Lang, B.; Li, X.; Yan, Y. IDEAL: A malicious traffic detection framework with explanation-guided learning. Knowl.-Based Syst. 2025, 317, 113419. [Google Scholar] [CrossRef]
Liu, M.; Yang, Q.; Wang, W.; Liu, S. Semi-Supervised Encrypted Malicious Traffic Detection Based on Multimodal Traffic Characteristics. Sensors 2024, 24, 6507. [Google Scholar] [CrossRef]
Yao, R.; Wang, N.; Liu, Z.; Chen, P.; Sheng, X. Intrusion Detection System in the Advanced Metering Infrastructure: A Cross-Layer Feature-Fusion CNN-LSTM-Based Approach. Sensors 2021, 21, 626. [Google Scholar] [CrossRef]
Zhao, Y.; Ma, D.; Liu, W. Efficient Detection of Malicious Traffic Using a Decision Tree-Based Proximal Policy Optimisation Algorithm: A Deep Reinforcement Learning Malicious Traffic Detection Model Incorporating Entropy. Entropy 2024, 26, 648. [Google Scholar] [CrossRef]
Zhu, S.; Xu, X.; Zhao, J.; Xiao, F. LKD-STNN: A Lightweight Malicious Traffic Detection Method for Internet of Things Based on Knowledge Distillation. IEEE Internet Things J. 2024, 11, 6438–6453. [Google Scholar] [CrossRef]
Huo, Y.; Liang, W.; Chen, J.; Zhuang, S.; Sun, J. LightGuard: A Lightweight Malicious Traffic Detection Method for Internet of Things. IEEE Internet Things J. 2024, 11, 28566–28577. [Google Scholar] [CrossRef]
Huo, Y.; Chen, J.; Guo, Y.; Liang, W.; Sun, J. LG-BiTCN: A Lightweight Malicious Traffic Detection Model Based on Federated Learning for Internet of Things. Electronics 2025, 14, 1560. [Google Scholar] [CrossRef]
Cai, S.; Tang, H.; Chen, J.; Hu, Y.; Guo, W. CDDA-MD: An efficient malicious traffic detection method based on concept drift detection and adaptation technique. Comput. Secur. 2025, 148, 104121. [Google Scholar] [CrossRef]
Ring, M.; Wunderlich, S.; Grüdl, D.; Landes, D.; Hotho, A. Flow-based benchmark data sets for intrusion detection. In Proceedings of the 16th European Conference on Cyber Warfare and Security, Dublin, Ireland, 29–30 June 2017; pp. 361–369. [Google Scholar]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
Moustafa, N.; Slay, J. The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 dataset and the comparison with the KDD99 dataset. Inf. Secur. J. Glob. Perspect. 2016, 25, 18–31. [Google Scholar] [CrossRef]
Moustafa, N.; Slay, J.; Creech, G. Novel geometric area analysis technique for anomaly detection using trapezoidal area estimation on large-scale networks. IEEE Trans. Big Data 2017, 5, 481–494. [Google Scholar] [CrossRef]
Moustafa, N.; Creech, G.; Slay, J. Big data analytics for intrusion detection system: Statistical decision-making using finite dirichlet mixture models. In Data Analytics and Decision Support for Cybersecurity; Springer: Cham, Switzerland, 2017; pp. 127–156. [Google Scholar]
Sarhan, M.; Layeghy, S.; Moustafa, N.; Portmann, M. NetFlow Datasets for Machine Learning-Based Network Intrusion Detection Systems. In Big Data Technologies and Applications: 10th EAI International Conference, BDTA 2020, and 13th EAI International Conference on Wireless Internet, WiCON 2020, Virtual Event, December 11, 2020, Proceedings; Springer Nature: Cham, Switzerland, 2021; p. 117. [Google Scholar]
Abed, R.A.; Hamza, E.K.; Humaidi, A.J. A modified CNN-IDS model for enhancing the efficacy of intrusion detection system. Meas. Sens. 2024, 35, 101299. [Google Scholar] [CrossRef]

Figure 1. Architecture of anomalous traffic detection: GATransformer.

Figure 2. Confusion matrices of GATransformer on CIDDS001 for node and edge classification tasks.

Figure 3. Confusion matrices of Transformer (upper) and GAT (lower) on CIDDS001 for node and edge classification tasks.

Figure 4. Confusion matrices for node and edge classification tasks.

Figure 5. Total loss curve of the GATransformer model during training.

Figure 6. Comparison of the effectiveness of network attack detection.

Table 1. Description of network dataset features.

Feature Name	Data Type	Description	Example Value
Date first seen	String	First observation time	00:00.4
Duration	Float	Flow duration	1.245
Proto	String	Protocol type	TCP
Src IP Addr	String	Source IP address	192.168.1.100
Src Pt	Integer	Source port number	443
Dst IP Addr	String	Destination IP address	172.16.0.5
Dst Pt	Integer	Destination port number	22
Packets	Integer	Number of packets	10
Bytes	Integer	Total number of bytes	1500
Flows	Integer	Number of flows	2
Flags	String	TCP flag bits	.AP…
Tos	Integer	Type of service	7
label	String	Traffic label	normal attacker victim
attackType	String	Attack type	scan

Table 2. Hyperparameter Configuration.

Parameter	Description	Value
time_window	Time window size(s)	30
hidden_dim	Hidden layer dimension	256
num_gat_layers	Number of GAT layers	1
num_transformer_layers	Number of Transformer layers	6
num_heads	Number of attention heads	8
learning_rate	Learning rate	$1 \times 10^{- 4}$
batch_size	Batch size	4
epochs	Number of training cycles	100
node_weight	Node classification weight	1.0
edge_weight	Edge classification weight	1.0
sequence_weight	Sequence prediction weights	0.01
optimizer	Optimizer type	Adam

Table 3. Performance comparison of different dataset and models.

Dataset	Model	Node Accuracy	Edge Accuracy
CIDDS-001	GAT	0.99463	0.86773
	Transformer	0.99556	0.87454
	GATransformer(our)	0.99745	0.88283
CIDDS-002	GAT	0.99864	0.99092
	Transformer	0.99711	0.99159
	GATransformer(our)	0.99990	0.99989
UNSW-NB15	GAT	0.99735	0.96012
	Transformer	0.99735	0.96009
	GATransformer(our)	0.99743	0.97414
	CNN + LSTM, LuNet, RNN	99.36
	DT, LR, NB, ANN, EM CLUSTRING	85.56
	Extra Trees ensemble classifier	98.19

Table 4. Hyperparameter sensitivity analysis for GATransformer hybrid model.

Model	Node Accuracy	Edge Accuracy
GAT_layers 1	0.99991	0.99559
GAT_layers 3	0.99133	0.99249
GAT_layers 4	0.98420	0.99177
Transformer_layers1	0.98573	0.98979
Transformer_layers4	0.98817	0.98632
Transformer_layers6	0.99711	0.99106
learning_rate $5 \times 10^{- 4}$	0.99043	0.98659
learning_rate $1 \times 10^{- 4}$	0.99837	0.99403
learning_rate $5 \times 10^{- 5}$	0.98961	0.98137

Table 5. Inference performance test results at different batch sizes.

Batch Size	Average Inference Time (ms)	Throughput (Graphs/s)
1	12.61	79.3
4	26.63	150.2
8	74.23	107.8

Table 6. Real-time performance under continuous operation.

Performance Metrics	Average Delay	P90 Latency	P99 Latency
Value (ms)	7.74	11.28	36.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Q.; Zhan, X.; Chen, W.; Li, Y.; Ouyang, H.; Jiang, T.; Shen, Y. GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer. Electronics 2025, 14, 3807. https://doi.org/10.3390/electronics14193807

AMA Style

Zhu Q, Zhan X, Chen W, Li Y, Ouyang H, Jiang T, Shen Y. GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer. Electronics. 2025; 14(19):3807. https://doi.org/10.3390/electronics14193807

Chicago/Turabian Style

Zhu, Qigang, Xiong Zhan, Wei Chen, Yuanzhi Li, Hengwei Ouyang, Tian Jiang, and Yu Shen. 2025. "GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer" Electronics 14, no. 19: 3807. https://doi.org/10.3390/electronics14193807

APA Style

Zhu, Q., Zhan, X., Chen, W., Li, Y., Ouyang, H., Jiang, T., & Shen, Y. (2025). GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer. Electronics, 14(19), 3807. https://doi.org/10.3390/electronics14193807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer

Abstract

1. Introduction

2. Related Work

3. Proposed Methodology

3.1. Multi-Level Data Preprocessing and Feature Engineering Strategy

3.1.1. Adaptive Traffic Normalization and Encoding Mechanism

3.1.2. Multi-Dimensional Feature Fusion and Anomaly Pattern Recognition

3.2. Construction of Network Traffic Graph Structure and Topology Modeling

3.2.1. IP Entity Node Feature Extraction and Representation Learning

3.2.2. Edge Relationship Modeling and Spatiotemporal Feature Construction

3.2.3. Dynamic Time Window Graph Construction Mechanism

3.3. GATransformer Architecture Design

3.3.1. Parallel Dual-Channel Fusion Architecture Design

3.3.2. Graph Attention Network Enhancement Mechanism

3.3.3. Cross-Attention Feature Fusion Strategy

3.4. Multi-Task Learning Framework and Loss Function Optimization

3.5. Vulnerable Asset Correlation and Threat Attribution Mechanism

4. Experiment and Evaluation

4.1. Experiment Setup

4.2. Experiment Results

4.3. Discussion

4.4. Model Architecture Evaluation

4.5. Case Study

5. Conclusions

6. Limitations and Future Work

6.1. Current Limitations

6.2. Future Research

7. Broader Implications

7.1. Positive Technological Impacts

7.2. Privacy and Surveillance Risks

7.3. Security and Adversarial Risks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI