*3.3. Classification*

The classification mechanism is the next important part of the system. As illustrated in Figures 3–5, our packet-based traffic classification framework consists of three modules: packet pre-processing module, word embedding module, and LSTM module. However, to perform training in the LSTM module, we first need to label the benign/malicious mark for the packets in the dataset, e.g., ISCX2012 [17], USTC-TFC2016 [3] and Mirai-RGU [16]. Based on the label of the flows (well-labeled in ISCX2012 and USTC-TFC2016 dataset) and the packets (well-labeled in Mirai-RGU dataset), we mark the label of the corresponding packet (malicious/benign) according to the flow's label that the packet belongs to.

For the LSTM module, we build standard LSTM cells, in which each cell consists of input gate, output gate, and forget gate. The sigmoid function is used as the activation function (*α*). Our proposed framework, as shown in Figure 4, consists of three LSTM layers with dropout. Total parameters of the first layer (dimension = 128, without dropout) exceed 4 million (4,194,304).

ZRUGVHDFKZRUGLVDGLPHQVLRQDOZRUGYHFWRUIURPZRUGHPEHGGLQJ

**Figure 3.** The illustration of the packet-word-transfer and classification module in the proposal. Packet-work-transfer module is at the input data layer.

**Figure 4.** The illustration of the full network model.

**Figure 5.** The illustration of the workflow of the training and testing/validation stage. Data pre-processing and packet labelling for training are done at the training stage. Training/Testing is done at the pre-processing data and with the ratio is 9:1. Validation is performed on the random samples (60 s/pcap).

The pseudo-code of our algorithm and the processing flow is illustrated in Algorithm 1. The preprocessing phase adjusts features to ensure data representation is suitable for the used algorithms, i.e., parsing the packet and converting it to the translated word. The new dataset of translated Words is in the format of the dataset of integer number. The translated dataset is then split into two parts: Training Data and Testing Data, respectively. Training stage starts on the Training Data and running the LSTM three-layer model. Note that the dimension from input data are required to reshape, e.g., 64, before performing the training and testing. The dropout rate is flexibly set and can be tuned up to 0.5. The loss function is based on binary cross entropy and RMSProp optimizer is selected for the learning rate adjustment method. A dense output layer with softmax is added to the model. The model is then compiled with the binary cross-entropy as the loss function and the Adam optimiser over total of 200 iterations. Finally, the model is testing on the Testing Data to determine the effectiveness of the model in terms of accuracy, precision, recall, f1-score, and loss (defined later in the next section).


Training and validation often cost significant time regarding the system evaluation. In fact, the dataset size after converting (transferring to integer format) can be smaller in storage than that of the original text/string as in the packets, since the integer format may cost few spaces than the text data type. However, the dimension unlikely reduces. As a result, selecting word size may dramatically affect the training time. Smaller word size reduces the size of dictionary, but also causes more training data and running time. In this research, we set the maximum word size to 2 bytes based on the consideration of the length of most fields in the packet while limiting the number of words in the dictionary to be less than 65,536.

## **4. Evaluation Results**

Similar to most existing deep learning research, our proposed classification model has been implemented using TensorFlow/Keras (version 2.2.4, Google, Mountain View, CA, USA). All of our evaluations have performed on the GPU-enabled TensorFlow which is running on a 64-bit Ubuntu 18.04.2 LTS server with an Intel(R) Xeon(R) E5-2650 2.2 GHz processor, 32 GB RAM, and an NVIDIA RTX Tesla K80.

To perform our evaluations, we have used the ISCX2012, USTC-TFC2016, Mirai-RGU and Mirai-CCU datasets. During the training and testing stages, we try to include hundreds to thousands of thousands of packets while balancing the benign and malicious traffic. At the validation stage, we run the trained model on the packets which is extracted randomly per 60 s from the selected datasets, i.e., packets in the consecutive 60 s in the original dataset are extracted while maintaining their temporal order. At this stage, the original (real) traffic is presented to the trained model without any manipulation or balancing of two types of traffic. Since our approach aims to gain a significant reduction in processing time, we have many interests in classifying the incoming packet whether it is malicious or not, instead of considering the attack type in detail. In practice, if a malicious packet is detected by the proposed system, it can raise an alarm to the network administrator and direct the packet for some offline computational intensive traffic classification systems. Therefore, in the following section, we define several measurements to evaluate the performance of the proposed solution in the term of a binary classifier. All of these metrics are all derived from the four values found in the confusion matrix in Table 4. The traffic proportion of training/testing/validation configuration of four datasets is listed in Tables 5–8.



where:

• True Positive (TP)—Attack packet that is correctly classified as an attack.


The accuracy (Equation (1)) measures the proportion of the total number of correct classifications:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}.\tag{1}$$

The precision implies the number of correct classifications that is influenced by the incorrect classification (Equation (2)):

$$Precision = \frac{TP}{TP + FP}.\tag{2}$$

*Appl. Sci.* **2019**, *9*, 3414

The recall (Equation (3)) measures the number of correct classifications that is considered with the number of missed entries:

$$Recall = \frac{TP}{TP + FN}.\tag{3}$$

The false alarm or fair (Equation (4)) measures the proportion of benign packets incorrectly classified as malicious:

$$Falsealarmate(FAR) = \frac{FP}{FP + TN}.\tag{4}$$

The *F*1-score (Equation (5)) measures the harmonic mean of precision and recall, which serves as a derived effectiveness measurement:

$$F\_1 - Score = 2 \times \frac{Precision \ast Recall}{Precision + Recall} \,\text{}\,\tag{5}$$

In our experiments, based on the flow statistics in the datasets and number of labeled packets in data pre-processing, we can calculate the number of packets (including benign packets) for training and number of real attack packets in testing, as summarized in Tables 5–8 below. Due to a large amount of input data, we set the mini-batch size value at 100 and the data training with less than 200 epochs.

The performance results of our proposal for each kind of malicious traffic with 10-fold testing and 10-fold validation are shown in Tables 9 and 10, respectively. From these two tables, we can observe that our method can perform the classification with nearly 100% accuracy and precision in major cases of the security attacks, which outperforms compared to the prior works, e.g., [6,19,20]. However, since the characteristics of traffic types and attacks are unique, the performance measurements are also different in each scenario and heavily relied on the training dataset. From our experimental experience, we notice that, besides tuning the learning model and its parameters (e.g., learning rate), word-embedding and attack representation samples in datasets play a critical role in the improvement of the performance.

**Table 5.** Train/Test/Validate traffic proportion in the ISCX-IDS-2012 dataset.


**Table 6.** Train/Test/Validate traffic proportion in the Mirai-RGU dataset.



**Table 7.** Train/Test/Validate traffic proportion in the USTC-TFC-2016 dataset.

**Table 8.** Train/Test/Validate traffic proportion in our dataset.


**Table 9.** Performance evaluation results of the proposed solution on the four datasets with 10-fold testing.


**Table 10.** Performance evaluation results of the proposed solution on the four datasets with 10-fold validation.


One might question the aforementioned results of our proposed DL-based framework. That is, intuitively, the proposed framework can classify malicious packets because it obtains the features of malicious packets from the training dataset. Thus, we perform a validation test in which the Mirai-RGU dataset is used for training and packets from Mirai-CCU dataset are used for validation. Our experimental results show that, even though the training and validation stages use different datasets, the proposed framework is still able to detect malicious packets with very high accuracy. The performance results are shown in Table 11.

**Table 11.** Performance evaluation results of the Mirai-CCU dataset validation on the Mirai-RGU training model.

