3.1. ADFCNN-BiLSTM Network
Figure 1 shows the architecture of the ADFCNN-BiLSTM network. The ADFCNN-BiLSTM is composed of three modules: the spatial feature extraction module based on deformable convolution (DFCNN) and attention, the temporal feature extraction module based on multi-head attention (MHA), and the categories module. Before the network traffic data enters the ADFCNN-BiLSTM network, it needs to undergo two steps: data preprocessing and class balancing.
Data preprocessing (step 1 in
Figure 1) transforms the original network traffic into a form that can be processed by the deep learning network. For example, it can remove redundant data, correct errors, use one-hot encoding to convert discrete non-numeric features into numeric features, and standardize the data to the [0, 1] interval through max–min normalization.
Class balancing (step 2 in
Figure 1) is performed on the traffic data after data preprocessing. The minority samples are expanded using the synthetic minority oversampling technique (SMOTE), and the majority samples are undersampled by the edit nearest neighbor (ENN) technique in order to optimize the data distribution and provide a more balanced dataset for model training.
The resampled data are sent to the spatial feature extraction module (step 3 in
Figure 1) to extract spatial features. In this module, the DFCNN adaptively adjusts the receptive field according to the data flow. The module also uses efficient channel attention (ECA) to selectively focus on important channel features. Additionally, based on ECA, the spatial attention mechanism (SAM) is used to locate the important area in the input space. The combination of the DFCNN, ECA, and SAM enables the module to focus on key feature areas and reduce redundant information, thereby enhancing the network’s robustness and expressive ability in complex scenarios.
The features extracted by the spatial feature extraction module are concatenated along the channel dimension and then input into the temporal feature extraction module (step 4 in
Figure 1) to extract temporal features. The temporal feature extraction module is based on the BiLSTM network framework, with multi-head attention (MHA) added behind it. MHA helps the network focus on important time points in the time series, which is more beneficial for identifying intrusion behavior in the network.
In the categories module (step 5 in
Figure 1), the dropout layer is used to suppress feature redundancy and reduce the risk of overfitting. The extracted spatiotemporal features are fused using the fully connected layer to generate high-order feature representations. Finally, the high-dimensional features are mapped onto the specific classification through the fully connected layer.
3.2. Spatial Feature Extraction Module Based on DFCNN and Attention
Figure 2 shows the detailed structure of the spatial feature extraction module. The module consists of two layers. The first layer (layer 1) includes a one-dimensional DFCNN (128 channels, with a convolution kernel size of 3), a ReLU activation function, a one-dimensional max-pooling layer (with a kernel size of 4), batch normalization, and ECA. The structure of the second layer (layer 2) is similar to the first, except that the number of DFCNN channels increases from 128 to 256. After two-layer feature extraction, the traffic data are input into the SAM layer and finally output as the spatial information feature map of the traffic data (output 1).
Network traffic has obvious one-dimensional characteristics. The process of a traditional one-dimensional convolution operation is as follows: Firstly, the input feature map is divided into multiple regions of the same size according to the size of the convolution kernel. In each region, the kernel’s weights are multiplied by the elements at the corresponding positions, and the results are summed to generate the corresponding output feature. To obtain a complete output feature map, the traditional one-dimensional convolution slides over the entire input feature map using a sliding window and performs the same operation at each position. Finally, the complete output feature graph is calculated. The convolution operation at any point
on the input feature graph can be expressed by Formula (1).
represents the offset of each point in the convolution kernel relative to the center point, that is, the current convolution kernel, which can be expressed as . represents the weight at the corresponding position of the convolution kernel. represents the element value at position on the input feature graph. represents the element value at position on the input feature graph.
In network intrusion detection tasks, intrusion behaviors often manifest as anomalies in local traffic. CNNs struggle to capture the full features of these local anomalies due to fixed sampling points. In contrast, a DFCNN adds a learnable offset
to allow the convolution kernel to deform during operation, allowing for a more precise capture of the subtle features of intrusion behaviors. As shown in
Figure 2, the offset
is calculated by another convolution. In deformable convolution, it is necessary to increase the offset for each point based on Formula (1). As shown in Formula (2), the relative center offset is changed from the original offset
to
.
In the deformable convolution layer, we extract multiple channel features, with each channel regarded as a feature detector. However, this mechanism may overlook the key learning of important objects, potentially leading to overfitting. The ECA [
33] at the end of each layer adaptively assigns different weights to each channel, focusing on the “importance” of features in the traffic, thus effectively extracting useful information.
Let the output of batch normalization be
, where f and c represent the flow features and the number of channels, respectively. As shown in the yellow box in
Figure 2, in ECA,
extracts aggregation features through global average pooling (GAP). Subsequently, ECA generates channel weights through a fast one-dimensional convolution (kernel size
), where the kernel size
is adaptively determined according to the channel dimension
. The channel weight is adjusted using the sigmoid activation function
to scale the values within [0, 1]. A weight value of 0 indicates that the channel can be discarded, while a weight value of 1 signifies that the channel is fully preserved. Finally, the weighted feature
is obtained by multiplying
by the obtained weight matrix. This design captures important channel information, effectively alleviates overfitting, and improves the generalization ability of the network.
The channel attention mechanism emphasizes specific feature types, while the spatial attention mechanism locates the key positions in the feature map. To achieve comprehensive attention on “Which features are important?” (channel) and “Where is important?” (space), we use the spatial attention module (SAM) at the end of the spatial feature extraction module [
34]. The SAM is located in the blue area of
Figure 2. Firstly, the SAM performs average pooling and maximum pooling along the channel axis of the feature map to extract both global and local information. The pooled results are then concatenated to form a feature descriptor, effectively capturing key features of the feature map across different spatial locations. Next, the key spatial location information is captured through a convolution operation, and the spatial position information is passed through a sigmoid activation function to obtain weights in the range of [0, 1]. These weight values represent the importance of each spatial location; the closer the value is to 1, the more significant the features are at that location. Finally, the weights are assigned to the corresponding features. This design highlights important areas in the feature map, improving the network’s perception of key information.
3.3. Temporal Feature Extraction Module Based on Multi-Head Attention Mechanism
The temporal feature extraction module consists of two layers: BiLSTM and MHA. The principle of BiLSTM is shown in
Figure 3. BiLSTM is essentially a deep learning network that combines a bidirectional neural network with the LSTM structure. Building upon traditional LSTM, it captures richer context information by processing data in both the forward and backward directions simultaneously. The memory cell and various gates in LSTM are its core components. A memory cell solves the problem of long-term dependence in an RNN and dynamically selects the fate of information through the gating mechanism.
As shown in the green module in
Figure 3, the input
at the current time step sequentially passes through the forget gate, input gate, and output gate, ultimately producing the current time step’s output
. The output
of the forgetting gate is multiplied by
to control the retention of the previous time step’s memory, as shown in Formula (3). The input gate determines how much of the new input information
at the current time step is written into the memory cell. The calculation process is as follows: Firstly, the weight
of the input gate is computed using the
activation function. Then, the candidate value
is generated using the tanh function. Finally,
is multiplied by
, as shown in Formulas (4) and (5). By combining the output
of the forget gate and the output
of the input gate, the memory unit
at the current time step is updated, as shown in Formula (6). The output gate first calculates the output gate weight
using the σ activation function. Then,
is mapped onto an appropriate output range through the tanh function. Finally, the two values are multiplied to obtain the final output
at the current time step, as shown in Formulas (7) and (8). This design enables LSTM to effectively preserve long-term dependencies while dynamically adjusting short-term memory, thus avoiding the vanishing gradient problem.
BiLSTM processes sequential data in a step-by-step manner. Although the bidirectional mechanism enhances its ability to capture information from both directions, it still has a limited capacity for modeling long-range dependencies in long sequences. Intrusion behaviors in network traffic often exhibit such dependencies, and the multi-head attention (MHA) mechanism addresses this by concurrently attending to multiple, diverse dependencies. This allows it to capture a more comprehensive representation of attack behaviors, overcoming BiLSTM’s limitations in long-sequence modeling. Therefore, we incorporate MHA after the BiLSTM layer to capture inter-time step relationships and extract global context information. The location of the MHA mechanism is shown in
Figure 1. By calculating the attention weight, the network can give different attention to the features of different time steps according to their importance so as to better focus on the critical time segments.
3.4. Equalization Loss v2
In this paper, the EQLv2 loss function [
35] is used to balance the influence of each class at the algorithm level by assigning different weights to each class. The core calculation formula of the loss function can be expressed as weighted binary cross entropy (BCE), where the weighting factor depends on the class gradient, as shown in Formula (9):
By averaging the weighted cross-entropy loss of each category, the final EQLv2 loss value can be obtained, where N represents the number of classes, and
represents the cross-entropy loss of class
. The weighted cross-entropy loss is calculated by combining the weight
of the
-th class. Formula (10) provides the calculation formula for the cross-entropy loss:
The formula calculates the loss for a positive sample () and a negative sample (), respectively. If , only is valid, and in this case, the closer the model’s predicted probability is to 1. If , is valid. At this time, the closer the probability predicted by the model is to 0, the lower the loss value is. Here, represents the sigmoid function, which maps the prediction score of the -th class onto a probability value between 0 and 1.
The working principle of EQLv2 is to calculate the corresponding weight by accumulating the ratio of positive to negative gradients of the classifier output during each backpropagation process. By dynamically adjusting the weight of positive and negative samples, it can inhibit the dominant role of common classes in model training. Formulas (11) and (12) define the calculation methods for the weights of positive and negative samples, respectively.
In Formula (11), α is a hyper-parameter, which is used to control the balance between the weights of positive and negative samples. The formula for calculating the weight of negative samples is given by (12). In this formula,
and
are hyperparameters that control the influence of gradient information on the weight of negative samples, while the parameter
represents the ratio between positive and negative samples. This ratio is obtained by accumulating the gradient ratio in the training task because the one-time gradient ratio cannot explain anything but will bring some noise to affect the normal learning of the network. Finally, the model parameter θ is updated through Formula (13).