*4.3. Implementation*

We employed tensorflow 2.6 as the deep learning framework to implement our approach. We implemented the feature extractor module (discriminative autoencoder), and

the classifier was a standard multi-layer perceptron (MLP) neural network. Table 3 present the architectural details of the modules.


**Table 3.** Architectural details of our models.

### *4.4. Experiments and Results*

We conducted the following experiments to evaluate our few-shot detection approach.

### 4.4.1. Experiment 1: Detecting Mutants of Existing Attacks

The objective of this experiment was to evaluate our approach in detecting mutants of existing attacks. Malicious users develop mutants of existing attacks to evade detection, as conventional supervised learning models detect such mutants poorly, especially when there is significant variation with the known existing attack.

To accomplish this, we extracted categories of attacks, which we believed to comprise variants of one another in both the NSL-KDD and CIC-2017IDS datasets. Tables 1 and 2 depict the compositions of such attack mutants.

To evaluate our approach in detecting a particular mutant, we utilized all the other mutants of the attack as a data source for the meta-training stage. For instance, if the attack mutant we wanted to detect was DoS hulk, then our meta-training data source will comprise all the other DoS subclasses (golden eye, slow loris, slow http test). The DoS hulk would then serve as the data source for the meta-testing stage. We then trained our feature extractor using the meta-training set. Thereafter, in the few-shot detection stage, we selected 10 samples of the meta-testing set and trained our classifier on top of our feature extractor. We applied the same logic to all attacks in both the NSL-KDD and CIC-IDS2017 datasets. However, we conducted this experiment on classes with sufficient numbers of samples. Therefore, on the NSL-KDD set we were able to perform the experiment on DoS and probe classes. While on the CIC-IDS2017 dataset, only the DoS class has a sufficient number of samples.

Figures 1–3 display the results of our experiment on both the CIC-IDS2017 and NSL-KDD datasets. As can be seen from the results, our model achieves the highest detection rate of 99.9% on DoS attack mutant Neptune (Figure 1), followed by 90.0% and 80.0% on tear drop and smurf attack, respectively. These are good results, considering that we employed few-shots examples to train our classifier. However, our model achieves a DR of 0.0% on DoS attack mutant back. In fact, our model scores 0.0% on all the other performance metrics (accuracy and precision). This indicates that the DoS attack mutant back has significant variations with other DoS attack subclasses.

**Figure 1.** Results of few-shot detection of DoS attack mutant for NSL-KDD dataset.

**Figure 2.** Result of few-shot detection of probe attack mutants for NSL-KDD dataset.

**Figure 3.** Result of few-shot detection on DoS attack mutant for CIC-IDS2017 dataset.

For the probe attack mutants (Figure 2), our model also performs well, achieving the highest DR score of 99.8% on the port sweep attack, and a DR score of 99.6%, 80.0%, and 70.7% for the port sweep, Nmap and Satan, respectively. All the attack mutants achieved a good DR, which indicated that the attack mutants had many attributes in common, which our feature extractor model was able to discover.

Similarly, with the CIC-IDS2017 dataset, our model's performance was good. We achieved the highest DR score of 99.1% on DoS http test attack (Figure 3), and DR scores of 93.5%, 86.7%, and 80.0% on DoS golden eye, DoS slow loris and DoS hulk attacks, respectively. The results follow a similar trend as in the NSL-KDD dataset.

We also compared our model's DR with baseline models, and we employed a combination of classical and deep learning algorithms as our baselines. For deep learning, we employed a multi-layer perceptron (MLP), while for the classical ML models we employed K-nearest neighbor (KNN), decision tree, support vector machine (SVM) and random forest. All the baseline models were trained using the full dataset.

Figures 4–6 depict the results of such comparisons. As can be seen from results, there was not much of a difference in terms of performance between the DL models, ML models, and our model. It is well known that classical ML models perform well on small and medium datasets. The highest DR score was 100% on the Neptune DoS mutant (Figure 4), which was achieved by all baselines, while our model scored 99.6% (despite having been trained with only 10 examples), which is quite on par with the baselines results. The lowest DR score was for the back attack mutant, where our model's DR score as 0.0%. However, the lowest DR score from the baselines was achieved by SVM, which scored 40.4% on the tear drop attack.

**Figure 4.** Detection rate (DR) comparison of various classification methods on DoS subclasses for NSL-KDD datasets.

**Figure 5.** Detection rate (DR) comparison of various classification methods on probe attack subclasses for NSL-KDD datasets.

**Figure 6.** Detection rate (DR) comparison of various classification methods on DoS subclasses for CIC-IDS2017 datasets.

In addition, for the probe attack class, the result followed similar trend, the highest DR score achieved was 100%, which was achieved by the baseline models on the port sweep attack mutant (Figure 5), while our model was slightly worse with a DR score of 99.8% on the same attack class.

Our model also recorded the lowest DR score, 70.7%, on the Satan attack mutant, while the lowest DR score achieved by the baselines was 79.5%, which was scored by SVM on the Nmap attack mutant. Overall, there was no significant difference in performance with our model; our model was either slightly worse or on par with the baselines.

Similarly, Figure 6 presents the results of the comparison with baselines models for the CIC-IDS2017 dataset. The highest performing model was random forest, which achieved a DR of 100% for the DoS golden eye attack mutant. The lowest DR score was 48.5% which was achieved by the decision tree model for the DoS http test attack mutant. Similar to the NSL-KDD dataset, our model performed competitively with the baseline models, where it achieved DRs of 99.6%, 99.8%, 80.0%, and 70.7% on IP sweep, port sweep, Nmap and Satan, respectively.

### 4.4.2. Experiment: General Anomaly Detection

The objective here was to evaluate our approach in detecting broader classes of attacks that were not seen before. For example, the NSL-KDD dataset can be broadly categorized into the following classes: DoS, Probe, R2L and U2R. Similarly, the CIC-IDS2017 dataset, when classified broadly, is made up of the following classes: DoS, port scan, heart bleed, Botnet, SSH-Patator, FTP-Patator, and web-based attacks

To perform the experiment, we applied a similar logic as in experiment 1. For instance, to detect the probe class of attack in the NSL-KDD dataset, our meta-training set consisted of all other attack classes (DoS, R2L, and U2R). While the meta-testing set comprised the attack we wanted to classify, which was the probe attack in this case. We applied the same procedure as for the CIC-IDS2017 dataset.

Figures 7 and 8 present the results of our experiments when the number of samples selected to train our classifier was limited 10 for both NSL-KDD and CIC-IDS2017, respectively.

**Figure 7.** Result of experiment 2: general anomaly detection for NSL-KDD dataset.

**Figure 8.** Result of experiment 2: general anomaly detection for CIC-IDS2017 dataset.

For the NSL-KDD dataset, (Figure 7), our approach performed reasonably well, especially for DoS and probe classes. Our model achieved the highest DR score of 90% with the DoS class and 70% on probe class. However, overall, our model's performance was low compared to the results of experiment 1. Some attacks were poorly detected. For example, our model's DR scores was 66% for the R2L class, and it completely failed to detect the U2R attack class, scoring a DR of 0.0%.

Similarly, the result of our experiment on CIC-IDS2017 (Figure 8) follows similar trend. There is drop in performance, our model performance was not as good as that of experiment 1. The highest DR score was 94.1%, which was achieved for the DDoS attack class. Our model poorly detects the Patator attack classes, scoring 34.8% and 47.6%, respectively.
