5.3. Data Preprocessing and Feature Selection
As described in
Section 4.2.1, the raw datasets are first subjected to preprocessing. This process primarily involves two key steps: data preprocessing and feature selection.
(1) Data preprocessing
Data cleaning: For the UNSW-NB15 dataset, while there are no infinite values, there are a significant number of missing values, especially in the ‘service’ column, which contains 141,321 missing entries. Over half of the samples have missing values, and removing these would lead to a substantial loss of information. Therefore, another method was used for handling missing values: filling them with the mode of the ‘service’ column for each category. For the CIC-IDS2017 dataset, given its large number of samples and the relatively few occurrences of missing and infinite values, these samples were directly removed.
Feature encoding: For the UNSW-NB15 dataset, there are three categorical feature columns: “proto”, “service”, and “state”. The one-hot encoding method was used to convert these categorical features. After this process, the feature dimension of the samples increased to 195 dimensions. Additionally, the sample labels in the UNSW-NB15 dataset encompass 10 categories. Label Encoding was applied to transform these categories into integer values ranging from 0 to 9. For the CIC-IDS2017 dataset, since it does not contain any categorical feature columns, only the sample labels needed to be processed. This dataset includes nine categories of sample labels, and Label Encoding was used to map each label to a unique integer value.
Normalization: The feature columns of both datasets were normalized using the min-max normalization method.
(2) Feature selection
After preprocessing, the UNSW-NB15 dataset had a high dimensionality of 195 features. Given the potential negative impact of high-dimensional sparsity on model performance, the Random Forest algorithm was employed for feature selection. Features with an importance score above 0.003 were initially selected. The optimal number of features was subsequently determined by evaluating the classification accuracy of various feature subsets using an MLP classifier.
Figure 5 displays the features with importance scores exceeding 0.003, while
Figure 6 illustrates the classification accuracy of the MLP classifier for different numbers of features. As a result, the number of features in the dataset was reduced to 28 dimensions.
The CIC-IDS2017 dataset did not undergo one-hot encoding and therefore had lower data sparsity, so no feature selection was performed.
5.4. Experimental Procedure and Results Analysis
5.4.1. The Training Experiment of the VAE-WACGAN Model
As described in
Section 4.2.2, the dataset was divided into a training set and a test set after preprocessing. A subset containing only minority class samples was selected from the training set for training the VAE-WACGAN model. For the UNSW-NB15 dataset, these minority classes include Analysis, Backdoor, Shellcode, and Worms. For the CIC-IDS2017 dataset, the minority classes include DoS GoldenEye, FTP-Patator, SSH-Patator, DoS slowloris, and DoS Slowhttptest.
The hyperparameters for training the VAE-WACGAN model were determined through extensive experimentation. The model is set to train for 2000 epochs; each batch consists of 256 samples; the learning rates for the encoder, decoder, and discriminator are set to 0.001, 0.001, and 0.0001, respectively. Before each training iteration of the encoder and decoder, the discriminator is trained five times; the weight for the reconstruction loss is set to 1.
Taking the UNSW-NB15 dataset as an example, the VAE-WACGAN model was trained, and the loss curves and discriminator score variation during the training process were plotted. As shown in
Figure 7, the losses of the discriminator, decoder, and encoder all rapidly decrease at the beginning of the training and converge after approximately 10,000 batch iterations. This indicates effective learning in both the generative and discriminative parts of the model during adversarial training. According to
Figure 8, after 1000 training epochs, the discriminator scores for both real and fake samples converge to approximately 0.5, indicating that the VAE-WACGAN model has reached an equilibrium state and achieved convergence. At this point, the fake samples generated by the decoder are highly similar to real samples, making it difficult for the discriminator to distinguish between them.
5.4.2. The Training Experiment of the VAEGAN Model
The VAE-WACGAN model is an improved version of the VAEGAN model. To validate the effectiveness of the VAE-WACGAN model, a longitudinal comparison of the two models is necessary.
A training experiment for the VAEGAN model was conducted. Its network structure is shown in
Table 7, with the symbols having the same meanings as described in
Section 4.1.1. The training hyperparameters for VAEGAN were determined through multiple experiments, including setting the number of training epochs to 1000, training each batch with 256 samples, and setting the learning rates for the encoder, decoder, and discriminator to 0.000521, 0.000521, and 0.0001, respectively. The weight for the reconstruction loss is set to 1. The configuration of the loss functions and the detailed training process can be referenced from the original VAEGAN paper [
5].
Since the VAEGAN model cannot simultaneously learn the data distribution of multiple minority class samples, a separate learning strategy is employed. Taking the Analysis class data from the UNSW-NB15 dataset as an example, the VAEGAN model was trained, and the loss curves and discriminator score variations during the training process were plotted. According to
Figure 9, although the discriminator and decoder losses of the VAEGAN model show a general downward trend during training, they fluctuate significantly, reflecting instability and difficulty in effective convergence during the training process. In contrast, the losses of the discriminator and decoder in the VAE-WACGAN model steadily decrease and quickly stabilize during training, indicating that the VAE-WACGAN model demonstrates significantly better stability during the training process compared to the VAEGAN model. Comparing
Figure 8 and
Figure 10, it can be seen that the discriminator scores for both real and fake samples tend towards 0.5 at the end of training for both the VAE-WACGAN and VAEGAN models. However, the VAE-WACGAN model exhibits smaller fluctuations in scores, indicating that the VAE-WACGAN model achieves better final convergence.
5.4.3. Comparison of the VAE-WACGAN with Various Class-Balancing Methods
To objectively evaluate the data augmentation effectiveness of the VAE-WACGAN algorithm, it is compared with three traditional class-balancing algorithms: ROS, SMOTE, and ADASYN, as well as the VAEGAN algorithm.
The performance of different class-balancing algorithms was validated using an MLP [
38] classification model. This classification model was obtained through the scikit-learn (sklearn) library and configured with specific parameters, including the use of the Adam optimizer, a regularization parameter of 1 × 10
−4, three hidden layers of sizes 256, 128, and 64, and a maximum of 200 iterations.
The classes and the number of samples generated by each class-balancing method were consistent, as shown in
Table 8 and
Table 9. The performance of different class-balancing methods on the UNSW-NB15 dataset is illustrated in
Figure 11 and
Table 10, while the performance on the CIC-IDS2017 dataset is shown in
Figure 12 and
Table 11.
On the UNSW-NB15 dataset, traditional class-balancing methods such as SMOTE, ROS, and ADASYN negatively impact the classifier’s performance.
Specifically, SMOTE reduces the false positive rate by 0.51% and improves precision by 0.41%. However, it results in decreased recall by 2.36%, F1 score by 0.41%, G-means by 0.54%, and accuracy by 2.36%. Similarly, ROS lowers the false positive rate by 0.33%, but it results in a decrease in precision by 0.38%, recall by 2.73%, F1 score by 1.01%, G-means by 0.92%, and accuracy by 2.73%. Meanwhile, ADASYN leads to declines in all metrics. These negative impacts suggest that traditional class-balancing methods are insufficient for capturing the complex data distribution, thereby introducing noise points that do not conform to the true distribution characteristics, ultimately reducing the classifier’s performance.
In contrast, the VAEGAN method enhances performance, with precision increasing by 0.69%, recall by 0.18%, F1 score by 0.77%, G-means by 0.55%, and accuracy by 0.18%, alongside a decrease in the false positive rate by 0.59%. And the VAE-WACGAN method provides the most significant improvement, increasing precision by 1.40%, recall by 1.12%, F1 score by 2.12%, G-means by 1.5%, and accuracy by 1.12%, while also decreasing the false positive rate by 1.06%. This indicates that both the VAEGAN and VAE-WACGAN methods effectively capture the distribution characteristics of the UNSW-NB15 dataset, resulting in the generation of high-quality samples and enhanced classifier performance. Nonetheless, based on the final classifier metrics, the VAE-WACGAN method exhibits superior data augmentation effects compared to the VAEGAN method.
On the CIC-IDS2017 dataset, SMOTE and VAEGAN algorithms negatively impact the classifier’s performance, while ROS, ADASYN, and VAE-WACGAN algorithms effectively enhance the classifier’s performance.
Although the SMOTE and VAEGAN algorithms improve the classifier’s G-means and false positive rate (SMOTE increases the G-means by 0.13% and reduces the false positive rate by 0.36%; VAEGAN increases the G-means by 0.02% and reduces the false positive rate by 0.018%), the remaining four metrics all decline. This indicates that these two algorithms do not effectively capture the distribution characteristics of the CIC-IDS2017 dataset, thereby introducing noise data.
For the ROS, ADASYN, and VAE-WACGAN algorithms, all performance metrics of the classifier are effectively improved.
Specifically, the ROS algorithm increases the classifier’s precision by 0.03%, recall by 0.02%, F1 score by 0.02%, reduces the false positive rate by 0.36%, increases G-means by 0.2%, and improves accuracy by 0.02%.
The ADASYN and VAE-WACGAN algorithms both improve the classifier’s precision by approximately 0.1% (ADASYN by 0.1%, VAE-WACGAN by 0.09%), recall by 0.09%, F1 score by 0.09%, reduce the false positive rate by 0.29%, increase G-means by 0.19%, and improve accuracy by 0.09%. This indicates that these three algorithms effectively capture the distribution characteristics of the CIC-IDS2017 dataset, thereby generating high-quality samples. Regarding various performance metrics, the enhancement effect of the VAE-WACGAN algorithm is comparable to ADASYN and superior to the other class-balancing methods.
The above experimental results indicate that on both the UNSW-NB15 and CIC-IDS2017 datasets, the VAE-WACGAN method effectively improves all performance metrics of the classifier and has superior data augmentation effects compared to the other four class-balancing methods. However, VAE-WACGAN shows a lower performance than ADASYN on the CIC-IDS2017 dataset. This is likely due to CIC-IDS2017’s simpler characteristics, which make it easier to classify. Consequently, simpler methods like ADASYN prove more effective in addressing class imbalance, whereas the more sophisticated VAE-WACGAN method may not perform as well in this less complex context.
5.4.4. Analysis of Model Complexity
Analyzing model complexity is essential for understanding the computational requirements and efficiency of a model. In this experiment, a comparative complexity analysis of VAE-WACGAN and VAEGAN was conducted on two datasets to rigorously evaluate the performance of VAE-WACGAN and validate its improvements.
Model complexity is typically measured using two key metrics: Floating Point Operations (FLOPs) and the number of parameters. FLOPs represent the total number of floating point operations required for a single forward pass, reflecting the model’s time complexity. The number of parameters refers to the total count of trainable parameters, indicating the model’s space complexity. The complexity of the VAE-WACGAN and VAEGAN models across two datasets is presented in
Table 12.
Table 12 reveals that the VAE-WACGAN model exhibits markedly higher complexity compared to VAEGAN across both datasets. Specifically, on the UNSW-NB15 dataset, VAE-WACGAN has 2.743 GFLOPs and 1.509 million parameters, significantly exceeding VAEGAN’s 0.331 GFLOPs and 0.738 million parameters. Similarly, on the CIC-IDS2017 dataset, VAE-WACGAN’s complexity is 3.732 GFLOPs and 4.379 million parameters, which is considerably higher than VAEGAN’s 0.976 GFLOPs and 2.072 million parameters. These results indicate that while VAE-WACGAN demands greater computational resources and incurs higher storage costs than VAEGAN, it delivers superior data augmentation performance on both datasets, as demonstrated in
Table 10 and
Table 11. The increased complexity of VAE-WACGAN enhances its representational capability, allowing it to capture more intricate data features and thereby improve its effectiveness in data augmentation tasks.
5.4.5. Visualization Comparison of the Original and Balanced Datasets
In the previous analysis, the effectiveness of the VAE-WACGAN method was validated from three dimensions: the training process curves, the performance comparison of class-balancing methods, and model complexity. To more intuitively assess the data augmentation effect of the VAE-WACGAN method, this experiment visualizes the comparison between the original dataset and the balanced dataset processed by the VAE-WACGAN method. The steps were as follows: the t-SNE (t-Distributed Stochastic Neighbor Embedding) algorithm [
39] was first used to perform dimensionality reduction on the original and balanced datasets, and then both were visualized. Given the large sample size and numerous categories of the datasets used, reducing the data to two dimensions may not clearly show the distribution characteristics. Therefore, the data was reduced to three dimensions for visualization. The visualization results are shown in
Figure 13.
Figure 13a,b reveals that the original UNSW-NB15 dataset exhibits severe class imbalance, with samples from minority classes such as Analysis, Backdoor, Shellcode, and Worms being so scarce that they are nearly indistinguishable in (a). However, after applying the VAE-WACGAN method, these minority class samples become visible in (b), with a notable increase in their proportions. As shown in
Figure 13c,d, the original CIC-IDS2017 dataset also has a class imbalance issue. However, after processing with the VAE-WACGAN method, the minority classes are significantly enhanced in the balanced dataset.
These experimental results indicate that, from the intuitive visualization, the VAE-WACGAN method can increase the proportion of minority class samples, effectively addressing the class imbalance issue in the datasets.
5.4.6. Multi-Classifier Validation of the VAE-WACGAN Data Augmentation
To comprehensively evaluate the VAE-WACGAN method, this experiment employs four different classifiers to observe the performance changes after applying the VAE-WACGAN method. The four classifiers include two shallow classifiers: Random Forest (RF) [
40] and Support Vector Machine (SVM) [
41]; and two deep classifiers: Multi-Layer Perceptron (MLP) [
38] and One-Dimensional Convolutional Neural Network (1DCNN) [
42]. The performance of each classifier on the UNSW-NB15 dataset is shown in
Figure 14 and
Table 13, and on the CIC-IDS2017 dataset in
Figure 15 and
Table 14.
The 1DCNN classifier consists of a four-layer network structure. The first two layers are convolutional layers, followed by two fully connected layers. Each of the first three layers applies batch normalization and the ReLU activation function. Both convolutional layers have a kernel size of 3, with stride and padding set to 1, and the number of kernels set to 32 and 64, respectively. The first fully connected layer contains 256 neurons, while the second fully connected layer outputs classification predictions, with the number of neurons corresponding to the number of sample classes. The following training hyperparameters were set to optimize the 1DCNN: a total of 200 training epochs, a batch size of 512 samples per epoch, a learning rate of 0.001, and the Adam optimizer for weight updates. These hyperparameters were determined through multiple experiments.
On the UNSW-NB15 dataset, the VAE-WACGAN method significantly enhances the performance of most classifiers, especially for the MLP and 1DCNN classifiers.
Specifically, the VAE-WACGAN method notably improves the MLP and 1DCNN classifiers. For the MLP classifier, VAE-WACGAN-MLP improves Precision from 0.8523 to 0.8663, Recall from 0.8467 to 0.8579, F1 score from 0.8340 to 0.8552, G-means from 0.8845 to 0.8995, Accuracy from 0.8467 to 0.8579, and reduces FPR from 0.0355 to 0.0249. For the 1DCNN classifier, VAE-WACGAN-1DCNN improves Precision from 0.848 to 0.8563, Recall from 0.8408 to 0.8533, F1 score from 0.8345 to 0.8492, G-means from 0.8845 to 0.8983, Accuracy from 0.8408 to 0.8533, and reduces FPR from 0.0364 to 0.0256.
The VAE-WACGAN method slightly improves the performance of the RF classifier. Notably, VAE-WACGAN-RF improves Precision from 0.8914 to 0.8960, Recall from 0.8771 to 0.8781, F1 score from 0.8673 to 0.8674, G-means from 0.9116 to 0.9182, Accuracy from 0.8771 to 0.8781, and reduces FPR from 0.0173 to 0.0167.
For the SVM classifier, VAE-WACGAN-SVM performs excellently in Precision, FPR, and G-means, with Precision increasing from 0.8088 to 0.8099, FPR decreasing from 0.0615 to 0.0606, and G-means increasing from 0.8379 to 0.8381. However, there is a slight decrease in Recall, F1 score, and Accuracy, leading to a minor overall decline in performance.
On the CIC-IDS2017 dataset, the VAE-WACGAN method also enhances classifier performance.
For the MLP and 1DCNN classifiers, the VAE-WACGAN method improves all evaluation metrics. Compared to MLP, VAE-WACGAN-MLP increases Precision from 0.9879 to 0.9888, Recall from 0.9876 to 0.9885, F1 score from 0.9877 to 0.9886, G-means from 0.9848 to 0.9867, Accuracy from 0.9876 to 0.9885, and reduces FPR from 0.0179 to 0.015. Compared to 1DCNN, VAE-WACGAN-1DCNN increases Precision from 0.9666 to 0.9776, Recall from 0.9505 to 0.9746, F1-score from 0.9547 to 0.9754, G-means from 0.9542 to 0.9699, Accuracy from 0.9505 to 0.9746, and reduces FPR from 0.0415 to 0.0345.
For the RF classifier, VAE-WACGAN-RF improves Precision from 0.9986 to 0.9987, Recall from 0.9986 to 0.9987, F1 score from 0.9986 to 0.9987, G-means from 0.9978 to 0.9979, Accuracy from 0.9986 to 0.9987, and reduces FPR from 0.003 to 0.0029. These results indicate that the VAE-WACGAN method marginally improves the performance of the RF classifier, positively impacting key metrics.
For the SVM classifier, although VAE-WACGAN-SVM improves Precision and FPR, with Precision increasing from 0.9521 to 0.9589 and FPR decreasing from 0.0944 to 0.0695, Recall, F1-score, G-means, and Accuracy all decline, resulting in an overall performance drop.
In summary, the VAE-WACGAN method demonstrates significant data augmentation effects across different classifiers, especially for the MLP and 1DCNN classifiers, further validating the superior data augmentation capability of the VAE-WACGAN method.
5.4.7. Comparison with Recent Advanced Methods
The multi-classifier validation experiment in
Section 5.4.6 indicated that the intrusion detection model achieved optimal performance with the RF classifier. Therefore, this experiment compares the VAE-WACGAN-RF model with recent advanced approaches to assess the feasibility of the proposed intrusion detection method, as shown in
Table 15 and
Table 16.
Based on
Table 15 and
Table 16, our method consistently outperforms the other three advanced intrusion detection approaches across all evaluation metrics on both the UNSW-NB15 and CIC-IDS2017 datasets. Notably, in terms of the F1-score—a critical metric that balances precision and recall—our method demonstrates substantial improvements. On the UNSW-NB15 dataset, our method exceeds FCWGAN-BiLSTM by 0.9%, CNN-BiLSTM by 6.79%, and MCNN-DFS by 5.74%. Similarly, on the CIC-IDS2017 dataset, our approach outperforms KD-TCNN by 0.41%, KNN-TACGAN by 4.06%, and GAN-RF by 4.83%. These results clearly demonstrate the effectiveness of our method in performing intrusion detection.