**2. Related Works**

Traditionally, networks are defended against intrusion using signature-based techniques, whereby incoming network traffic is compared against commonly known attack patterns. These approaches perform well against previously known attacks, but fail to detect novel attacks.

Classical machine learning (ML) methods provide an upgrade overtraditional signaturebased techniques. These methods exploit various features of network traffic, which enable them to detect attack signatures without explicit rule specifications [18]. Thus, popular classical ML approaches, such as K-nearest neighbor (KNN) [19], support vector machines (SVM) [20], decision tree (DT) [21], and random forest (RF) [22], have all been employed as network-based IDSs.

For example, Kutrannont et al. [23] proposed a KNN-based IDS. KNN operates based on the assumption that a sample belongs to the class where most of its top K-neighbors reside. Therefore, parameter K affects the performance of the model. In their work, Kutrannont et al. proposed the integration of a simplified neighborhood classification

using a percentage instead of group rankings. Taking into account the unevenness of data distribution, the improve rule selects a fixed percentage (50%) of neighboring samples as neighbors and its efficiency is enhanced via parallel processing using a graphical processing unit (GPU). The algorithm performs well on sparse data, achieving an accuracy of 99.30%.

Goeschel et al. [24] employed a combination of SVM, decision tree (DT), and naïve Bayes classifiers. The SVM was first trained to perform a binary classification to separate data instances into benign and malicious classes. The malicious classes are then categorized into specific classes of attacks using a DT classifier. However, since DT can only separate known classes of attacks, they further employed a naïve Bayes classifier to identify unknown attacks types. This hybrid method achieved an accuracy of 99.62% and a false alarm rate of 1.57%.

Malik et al. [25] proposed an IDS using random forest (RF) and particle swarm optimization (PSO). They trained the IDS in two stages: feature selection and classification. The PSO serves as feature selection algorithm, which is used to select appropriate features for classifying attacks, while the RF is used as a classifier. They evaluated their approach using the KDD cup99 dataset, and achieved detection rates of 99.92%, 99.49%, and 88.46% on DoS, Probe, and U2R attack classes.

Recently, there has been widespread adoption of DL techniques for network-based IDSs for sizeable numbers of datasets. These techniques can operate directly on raw data, learn features, and perform classifications. Hence, they achieve better performances when compared to classical machine learning methods [26]. Deep learning models, such as multi-layer perceptron (MLP), convolutional neural network (CNN), autoencoders (AE), recurrent neural network (RNN), as well as deep generative networks, such as the deep belief network (DBN) and generative adversarial networks (GANs) have all been applied in the context of network-intrusion detection [27,28].

Min et al. [29] proposed an IDS named TR-IDS, which leverages both statistical features as well as payload features. They employed a CNN to extract important features from the payload. To accomplish this, they first encoded each byte in the payload in to a word vector using skip-gram word embedding, and then applied the CNN to extract the features. The extracted features were then combined with the statistical features generated from each network flow, which included fields from the packet header and statistical attributes of the entire flow. The features were then used to train a random forest classifier, which achieves an accuracy of 99.13%.

In the work by Yin et al. [30], a recurrent neural network (RNN) was directly applied for intrusion detection tasks. The RNN model achieved better performance on a NSL-KDD dataset when compared with classical ML techniques consisting of support vector machines and random forest.

Wang et al. employed a combination of CNN and long short-term memory (LSTM). Intuitively, the CNN learns the low-level spatial features of network traffic, while the LSTM learns the high-level temporal features of the data. The learned features enable the model to improve the false alarm rate of an IDS [31].

In another work, Al Qatf et al. employed a sparse autoencoder (AE) for dimensionality reduction and the reduced features are then retrained using the SVM classifier. This enables the model to outperform classical machine learning methods [32].

Similarly, in a recent work by Narayana et al., a hybrid methodology involving a sparse autoencoder, DNN, and LSTM was employed. In the first stage, the autoencoder is trained in an unsupervised fashion with smoothed l1 regularization to enforce sparsity. This enables the autoencoder to learn sparse representations, which are then used to train the MLP and LSTM classifiers in the second stage. The model performs better than conventional deep learning classifiers in terms of detection rates and low false positive rates [33].

Another hybrid intrusion detection method that employs both classical machine learning and deep learning techniques was proposed by Le, et al. They first built a feature selection model termed a sequence forward selection (SFS) algorithm (SFSD) and a decision

tree. The SFSD algorithm selects the best subset of features, which are then used in the second part to train various forms of RNN (traditional RNN, LSTM and gated recurrent neural network (GRU)). The model achieves significant improvements in detection rates when compared with classical methods [34].

However, these techniques require huge amounts of labeled data during training in order to generalize well. The dynamic nature of the modern-day cyber-threat landscape makes it unfeasible or prohibitively expensive to acquire sufficient enough malicious samples to train deep learning classifiers. Therefore, a trend is developing towards techniques that require only a few shot of malicious examples to achieve detection.

For example, Hindy et al. proposed an intrusion detection model using one-shot learning. The main idea of one-shot learning is to learn patterns and similarities from previously seen classes that enable classifying unseen classes using only one instance. Thus, one-shot learning is an instance of few-shot learning, whereby the number of examples is restricted to only a single example [35]. To model an IDS using one-shot learning, Hindy et al. employed a Siamese neural network, a form of neural network consisting of twin networks. The Siamese network is trained using two pairs of instances to learn patterns and similarities instead of fitting the model to fixed classes. Therefore, during the training stage, the Siamese network learns patterns and discriminate between benign traffic and different classes of a known cyber-attacks. At the evaluation stage, a new traffic instance is compared against all known classes (used during training) without any form of additional training. Although the approach provides a simple framework for one-shot learning, in general, they achieve lower detection rates relative to other works [36].

In another work, Xu et al. proposed an intrusion detection method using few-shot learning. They employed a deep neural network architecture (DNN) named FC-Net, which is composed of two parts: a feature extraction network and a comparison network. FC-Net is trained using a meta-learning approach consisting of two disjointed stages of metatraining and meta-testing. In the meta-training phase, the feature extraction network of FC-Net is trained using several meta-tasks, where a meta-task is comprised of a binary classification between an attack category and benign traffic. This enables FC-Net to learn a pair of feature maps, which are then used by the comparison network in the meta-testing stage to determine whether a new traffic instance belongs to the different classes of attacks learned during training [37]. However, one drawback with their approach is it requires a complex DNN architecture and computationally intensive optimization procedures.

### **3. Our Proposed Few-Shot Intrusion Detection Method**

Supervised learning approaches for network intrusion detection require all categories of attacks to be known in advance, with a sufficient number of training examples available for each category. The basic task is to use a classifier, *f* , to infer labels for network traffic samples, *N*. The number of samples, *N*, is often very large and is simply composed of two groups: a training set and a test set. Contrary to this, in real-world settings, new attacks frequently emerge, and only subset of categories are known beforehand, with few examples per category. Therefore, in such scenarios, where the number of samples, *N*, is small, the problem is considered as a few-shot classification. Applying the conditions of a supervised learning method to this problem will encounter overfitting.

Few-shot learning is popularly addressed based on the meta-learning paradigm, which is composed of meta-training and meta-testing. Each one of the meta stages consists of a number of classification tasks, where each task describes a pair: training (support) and testing (query). The meta-training set is described as *T* = - *Dtrain i* , *Dtest i I t*=1 and the metatesting set is *S* = *Dtrain q* , *Dtest q Q <sup>q</sup>*=1, with each dataset containing pairs of data points and their ground-truth labels, i.e., *Dtrain* = {(*xt*, *yt*)}*<sup>T</sup> t*=1 and *Dtest* = *xq*, *yq<sup>Q</sup> q*=1, which are sampled from the same distribution. The objective is to leverage the meta-training stage to learn good representations, which will enable it to adapt quickly to unseen tasks in the meta-testing stage, using powerful optimization techniques.

In a network intrusion detection context, a task, *T*, can be simply defined as a binary classification between a normal network traffic sample and a category of malicious samples. Supposing that there are five different network traffic samples, *O*, *A*, *B*, *C* and *E* such that, sample *O* is a benign network traffic, samples *A*, *B*, *C* indicate known categories of attacks with sufficient examples, while the remaining sample, *E*, refers to a newly found category of attack with a few examples. The goal is to identify the new attack sample, S, with as few examples as possible. Then, three different tasks can be constructed, *T*1, *T*2 and *T*3 where *T*1, *T*2 and *T*3 define a binary classification task between a normal sample *O* and attack categories *A*, *B* and *C*. *T*1, *T*2 and *T*3 constitute the meta-training set, while the meta-test set consists of a normal sample, *O*, and the novel class, *E*, which has few examples. The idea is to leverage the meta-training stage to learn transferable knowledge from *T*1, *T*2 and *T*3 that will enable a classifier to accomplish task *T*4 (a binary classification between normal sample *O* and attack category *E*) with as few examples as possible during the meta-testing phase. Thus, in our case, a discriminative autoencoder was employed to acquire such transferable knowledge.

### *Feature Extraction with Discriminative Autoencoder*

Autoencoders have been proved to be powerful models for learning representations in an unsupervised fashion. However, discriminative autoencoders are a form of autoencoders that, in addition to residual errors, considers class information of the input in its objective function. This ensures that more powerful and discriminative representations are learned than those learned by conventional autoencoders.

We adopted the discriminative autoencoder proposed in [38], which, in its setup, uses data from two distributions, termed positive (*X*<sup>+</sup>) and negative (*X*−), with their labeled information. The discriminative autoencoder then learns a manifold that is good at reconstructing the data from the positive distribution, while ensuring that those of the negative distributions are pushed away from the manifold. This enables it to learn robust patterns and similarities that separate the two distributions.

In our case, the two distributions, *X*<sup>+</sup> and *X*<sup>−</sup>, can be generated from benign network traffic classes and malicious traffic classes.

Let *l*(*x*) denote the label of an example, *x*, with *l*(*x*) ∈ {−1, 1} and *d*(*x*) is the distance of that example to the manifold, with *d*(*x*) = *x* − *<sup>x</sup>*. Then, the loss function is described as:

$$L(X^{+} \cup X^{-}) = \sum\_{\mathbf{x} \in X^{+} \cup X^{-}} \max(0, l(\mathbf{x}) \cdot (d(\mathbf{x}) - 1))\tag{1}$$

Thus, to train the discriminative autoencoder, we merged all the meta-training tasks *Dtrain t*from *T* into a single training set, *Dnew*, of seen classes:

$$D^{new} = \cup \left\{ D\_1^{train}, \dots, D\_l^{train}, \dots, D\_T^{train} \right\} \tag{2}$$

We trained the discriminative autoencoder during the meta-training stage (Algorithm 1). After training, the decoder part of the model was discarded, while the encoder module, which then served as our feature extractor was retained. The encoder was then employed in a fixed state (no fine-tuning) in the meta testing stage. The meta testing stage consists of the task of identifying a novel class of attack, which has few examples.

For a given task *Dtrain q* , *Dtest q* sampled from the meta-testing set, *S*, we trained a classifier, *f* , on top of the extracted features to recognize the unseen classes using the training dataset, *Dtrain q* (Algorithm 2).

### **Algorithm 1** Discriminative Autoencoder Training

**Input:** meta-training dataset *D* containing, *n* normal data samples and *m* malicious samples, the label of the dataset with *l*(*x*) ∈ {1, <sup>0</sup>}, **Output: encoder** *fe* **and a decoder** *fd θe* ← initialize encoder parameters *θd* ← initialize decoder parameters **Repeat for** *i* = 1 to *k* do Draw a batch of *k* samples *<sup>x</sup>*(1), . . . ., *x*(*k*) from the dataset *D zi* = *fe xi x* ˆ*i* = *fd zi LDAE* = 1*k* <sup>∑</sup>*ki*=<sup>1</sup> max 0, *<sup>l</sup>*(*x*). *xi* − *x*ˆ*i* −<sup>1</sup> **end for //** update parameters with gradients *θe* ← *θe* − ∇*θe LDAE θd* ← *θd* − ∇*θd LDAE* **until**convergenceofparameters*θ<sup>e</sup>*,*θd*

*l*(*x*)

### **Algorithm 2** Few-Shot Detection

**Input:** meta-testing dataset *D* containing, *n* normal data samples and *m* few malicious samples with *n m*, *l*(*x*) the label of the dataset with *l*(*x*) ∈ {1, <sup>0</sup>}, trained encoder *fe*

**Output: classifier** *cl*, **prediction** *lpred θc* ← initialize classifier parameters **Repeat for** *i* = to *k* do Draw a batch of *k* samples *<sup>x</sup>*(1), . . . ., *x*(*k*) from the dataset *D f iextract* = *fE xi lpred* = *Cl f iextract LC* = *binarycrossentropy l*, *lpred* **end for //** update parameters with gradients *θC* ← *θC* − ∇*θC LC* **until** convergence of parameters *θC*
