4.3. Data Preprocessing
In processing network traffic data, non-numeric features are present, which are not suitable for direct input into anomaly detection models. To address this issue, one-hot encoding was employed to convert these non-numeric features into numeric ones, making them compatible with the model. Additionally, there may be significant differences in the scales of different features. Such disparities can affect classification performance, necessitating the normalization of feature values. Normalization ensures that all features are on a comparable scale, thereby enabling more equitable comparisons within the model.
For instance, in the publicly available KDDCUP99 dataset, the features ‘service’, ‘flag’, and ‘protocol_ty’ are non-numeric and thus require one-hot encoding to be converted into binary numeric values. When the ‘protocol_type’ feature, which includes three attributes—TCP, UDP, and ICMP—is one-hot encoded, the results are 100, 010, and 001, respectively. The ‘service’ feature comprises 70 attributes, and, after one-hot encoding, is transformed into a 70-dimensional binary vector. The ‘flag’ feature contains 11 attributes, and its one-hot encoding yields an 11-dimensional binary vector.
Moreover, significant differences in the ranges of feature values can lead to slow convergence of the anomaly detection model, increasing experimental time and ultimately affecting detection performance. Therefore, feature normalization is necessary. In the CIC-IDS-2017 dataset, for example, the feature ‘Total Length of Fwd Packets’ ranges from [0, 181,012], while the feature ‘Flow IAT Max’ ranges from [−1, 120,000,000], indicating substantial disparities. To mitigate the issues arising from these differences, normalization is performed to unify the range of each feature within the interval [0, 1]. This process stabilizes the model and accelerates convergence, thereby saving experimental time and enhancing detection performance. In this study, mean–variance normalization, as defined by Equation (10), was utilized. By computing the mean and standard deviation of each feature, the values were transformed to follow a standardized distribution. Consequently, regardless of their original range, the feature values were effectively mapped to a common numerical interval, providing a more reliable foundation for model training and prediction.
4.4. Model Parameter Settings
To address the potential issues of gradient explosion or vanishing gradients in deep convolutional neural networks (CNNs), experiments were conducted on CNNs with different numbers of layers (from one to five) to compare their accuracies. All other parameters were kept constant in the experiments except for the number of convolutional layers. The experimental results are shown in
Figure 9.
From the results in the figure, it can be concluded that the performance was poorest when the number of convolutional layers was one, and it was best when the number was three, with an accuracy of 99.5%. Therefore, this model uses a network structure with three convolutional layers.
Although dilated convolution can enlarge the receptive field and capture more contextual information without increasing the number of parameters, a common problem known as the ‘grid effect’ may occur when the dilation rate of dilated convolution is large. This effect is when the convolutional kernel covers only part of the input feature map, leading to discontinuities in the output feature map in terms of spatial arrangement. This discontinuity may cause information loss or fragmentation, affecting the model’s performance. To address this issue, we used Hybrid Dilated Convolution [
19] (HDC) and determined the dilation rates for dilated convolution based on three-layer convolution. HDC possesses three characteristics. Firstly, the pattern of the dilation rate has a zigzag configuration. Secondly, the dilation rates of sequentially arranged HDC must not share a common factor greater than 1. Lastly, HDC satisfies the formula:
where r
i represents the dilation rate of the i-th layer. M
i represents the largest dilation rate of the i-th layer.
The dilation rates were set to satisfy the equation for dilation rates of 1, 2, and 3, and 1, 2, and 5. Experiments were conducted by setting the dilation rates to 1, 2, and 3, and 1, 2, and 5, while keeping the other conditions constant. The experimental results are shown in
Figure 10. From
Figure 10, it can be seen that when the dilation rates were set to 1, 2, and 3, the accuracy reached 99.5%, which is significantly better than when the dilation rates were 1, 2, and 5.
In deep learning models, the number of layers in a neural network can significantly impact the final classification performance, a factor that is particularly pronounced when using GRU networks. Therefore, to determine the optimal number of layers, all other parameters were kept constant in the experiments, and only the number of GRU layers was adjusted. Different layer configurations of GRU were combined with the DC-1D (one-dimensional dilated convolution) structure, specifically including GRUs with 1, 2, 3, and 4 layers. This combination allowed for the exploration of the GRU’s ability to handle complex features and maintain sequence information while observing its impact on overall model performance. Each layer configuration underwent the same training and validation process to ensure comparability of the results. The experimental results are shown in
Figure 11.
Considering that the computational cost and efficiency of deep learning models are crucial factors in model design and selection, the slight difference in accuracy between using a two-layer GRU and a three-layer GRU became a critical decision point. Although the three-layer GRU provided slightly higher accuracy, it also resulted in increased computational cost and complexity. Therefore, after balancing accuracy and computational cost, the decision to use a two-layer GRU as the final model was based on a comprehensive consideration of performance and efficiency. This choice not only maintained high accuracy but also ensured the model’s efficiency and practicality.
Since the selection of GRU neurons also affects model performance, experiments were conducted to determine the optimal number of neurons. The two-layer GRU was set to 16, 32, 64, and 128 neurons, the experimental results for which are shown in
Figure 12. The results indicate that the highest accuracy was achieved when the number of neurons was set to 64-64.
Through the above parameter design experiment, we designed the structure of a DC-GRU. The overall structure consists of three one-dimensional dilated convolutions with dilation rates of 1, 2, and 3. Generally speaking, in order to achieve better accuracy, researchers tend to design network models with greater depth to obtain higher-level features. However, the deeper the layers, the more parameters involved. Theoretically, the more parameters, the better the model’s fitting capacity, but this also increases the risk of overfitting. Based on multiple experiments, the convolution kernel size and the number of convolution kernels in this study were set to 3 and 64, respectively.
In the GRU section, it was found during the experiments that appropriately increasing the network depth can enhance the model’s predictive capability. In this study, two GRU layers with an output dimension of 64 were used.
4.5. Model Effect Verification
During the experiment, the model converged when the epoch was set to 10, and at this point, the accuracy of the test set reached 99.5%.
Table 3 displays the precision, recall, and F1 score values for various attack types in the CIC-IDS-2017 dataset.
As can be observed in
Table 3, the precision of DCGCANet for 10 types of network behaviors was above 90%, and for Bot types with fewer data, it also exceeded 92%. The recall rate and F1 score also remained at high levels, sufficiently demonstrating the effectiveness of DCGCANet in traffic classification.
To more intuitively demonstrate the model’s performance, a heatmap comparing the precision, recall, and F1 score for various attack types was generated, as shown in
Figure 13.
To further verify the effectiveness of the DCGCANet model and its components, ablation experiments were carried out under the same experimental environment. The experimental results are shown in
Figure 14,
Figure 15,
Figure 16 and
Figure 17. In this experiment, different modules or models were disabled separately to verify the impact and effectiveness of the proposed model through comparison.
Figure 14 shows the confusion matrix results for the CNN-1D model,
Figure 15 presents the results for the DC-GRU model,
Figure 16 displays the results for the combined DC and CA modules, and
Figure 17 illustrates the experimental results for the DCGCANet model.
From the results of the ablation experiments, it can be clearly seen that both the DC-GRU model and the CA (channel attention) module significantly improved recognition of abnormal traffic. Specifically, the DC-GRU model, with its unique dilated convolution structure, was found to capture key features in data more effectively, thereby providing a more accurate feature representation for the detection of abnormal traffic. On the other hand, the CA module, by weighting the features of different channels, enhanced the focus of the model on important features, further improving the accuracy of abnormal traffic detection. More importantly, when these two modules were combined, their complementarity resulted in a greater improvement in the overall accuracy of the model.
To better illustrate our DCGCANet model’s ability to detect abnormal traffic, comparisons were made between the DCGCANet model and several algorithms—KNN, Multi-Layer Perceptron (MLP), Random Forests (RFs), CNNs, AFM-ICNN-1D, and CNN-GRU [
20]—using the CIC-IDS-2017 dataset under the same experimental conditions. The results of these comparisons are presented in
Table 4.
As shown in the comparison table evaluating multiple models on accuracy, the model presented in this paper demonstrated a significant advantage despite requiring longer detection times. Specifically, our model achieved an accuracy of 99.6%, showing substantial improvements over traditional and state-of-the-art methods like KNN, ID3, MLP, RF, CNNs, and the recently developed AFM-ICNN-1D model.
As shown in
Table 4, traditional machine learning algorithms such as kNN, RFs, and ID3 also achieve satisfactory detection results. However, to reduce dimensionality and improve accuracy, feature selection is necessary in machine learning methods. The method proposed in this study demonstrated superior performance in the area of feature selection.
Methods such as CNN, AFM-ICNN-1D, and CNN-GRU suffer from a certain amount of information loss due to pooling operations. In contrast, this method replaces pooling with dilated convolution and incorporates channel attention. This allows the model to assign different weights to features in different channels based on their importance, thereby enhancing the model’s ability to capture critical information and improve overall representation capability. However, due to the high complexity of the model, an increased time cost was incurred to improve the detection performance of the model.
An improvement of 3.3% in accuracy over KNN indicates a strong capability in handling complex datasets, while an increase of 1.3% compared to ID3 demonstrates high efficiency in classification decisions. Compared to MLP, an accuracy improvement of 23% was observed. Additionally, our model showed improvements of 2.9% and 4.3% over RFs and CNN, respectively, proving its excellent performance in detecting network traffic anomalies. Furthermore, significant improvements are observed in precision, recall, and F1 score, verifying the comprehensiveness and reliability of the proposed model.
To verify the generalization ability of the DCGCANet, we used the four subsets P1 to P4 of CIC-IDS2017 as the test set, and the DCGCANet was compared with the LSTM model on these datasets. The results are presented in
Table 5.
From the comparison results presented in
Table 5, it can be observed that the precision, recall, and F1 values of the proposed DCGCANet on the four subsets of CIC-IDS2017 were above 99%, and these experimental results are significantly better than those of the LSTM model. Therefore, it can be concluded that the DCGCANet possesses a superior generalization ability. This is attributed to the DCGCANet model’s emphasis on important channel features through channel weighting, which enhances the representational capability of the model and leads to improved generalization ability.
To further verify the generalization ability of the model, the KDD Cup99 dataset was used for validation, and the results are shown in
Table 6. The KDD Cup99 dataset contains normal behavior and four types of anomalous behaviors: Normal traffic (Normal), Denial of Service attack (DOS), Probe attack (Probe), Remote-to-Login attack (R2L), and User-to-Root attack (U2R).
As shown in
Table 6, the accuracy and detection rates for all metrics of the five attack types in the KDD Cup99 dataset were above 90% using the proposed model. The model demonstrated an excellent detection performance on both the CIC-IDS-2017 and KDD Cup99 datasets, indicating strong generalization capability.