*3.1. Dataset*

In deep learning, a lack of a variety of shareable traces dataset can be seen as one of the most obvious obstacles to obtain realistic progress on traffic classification. Many researchers may like creating synthetic datasets through their testbed. The shortcomings of this approach are that their generated traffic may not represent well the real traffic on the Internet, thus the performance evaluation results of the proposed solutions can face the problems of the credibility. Over the decades, the researchers have been recommended to use public famous traffic datasets such as KDD CUP1999 and NSL-KDD to test. Each dataset explicitly provides useful statistics on labeled features and the number of benign and malicious flows as well. However, these datasets do not provide information at the raw traffic level, which is required in our approach. In addition, while the credible dataset from Microsoft Malware Classification Challenge [15] can provide a metadata manifest and hexadecimal representation of the malware's binary content, the missing of packet information in the data, unfortunately, makes it hard to be used in this research.

For the ones matching the requirements, USTC-TFC2016 [3] is one of the most prominent datasets. Table 2 summarizes the statistics of the benign and malware traffic in the dataset. As their statement, there are a total of ten types of malware traffic from public websites which were collected from a real network environment from 2011 to 2015. Along with such malicious traffic, the benign part contains ten types of normal traffic which were collected using IXIA BPS, a professional network traffic simulation equipment. The size of USTC-TFC2016 dataset is 3.71 GB, and the format is pcap.


**Table 2.** Summary of benign and malware traffic in USTC-TFC2016 dataset. (Notations: SMB: Server Message Block; IM: Instant Message; P2P: Peer-to-Peer).

For the Mirai-based DDoS traffic, we use the dataset from Robert Gordon University [16], denoted by Mirai-RGU. This data set contains Mirai botnet traffic such as Scan, Infect, Control, Attack traffic and normal IoT IP Camera traffic. It contains ten types of malicious traffic, include HTTP flood, UDP flood, DNS flood, Mirai traffic, VSE flood, GREIP flood, GREETH flood, TCP ACK flood, TCP SYN flood, and UDPPLAIN flood. The dataset includes features such as Time, Source, Destination, Protocol, Length, and overall payload information. In addition, we also collected a dataset from the Mirai botnet we have built in the campus of National Chung Cheng University(CCU) (denoted by Mirai-CCU), as shown in Figure 1, to generate four types of attack traffic with much bigger attack magnitude: TCP SYN (41 GB), TCP ACK (2.4 GB), HTTP POST (103 GB), UDP (127.06 GB) in the total of 667 GB attack data.

**Figure 1.** The testbed of our Mirai-based DDoS dataset in the campus of National Chung Cheng University (CCU).

To enrich the dataset for our experiments, we select ISCX2012 [17], which also contains both malicious and benign traffic. ISCX2012 consists of packets collected for seven days. Packets collected in the first and sixth day are normal traffic. In the second and third day, both normal packets and attack packets are collected. In the fourth, fifth, and seventh days, besides the normal traffic, HTTP DoS, DDoS and IRC Botnet, and Brute Force SSH packets are collected, respectively.

#### *3.2. Word Embedding and Data Preprocessing*

Our goal is to classify incoming packets into benign or malicious classes without pre-processing the packets into a specific flow. To achieve this target, instead of considering the whole flow (e.g., as a document), we consider each packet (e.g., as a paragraph) and construct the key sentence from each packet, in which each word is a field in the packet header. After that, we apply word embedding [18] to extract semantics and syntax features from this sentence. We choose to consider meanings on the sentence rather than the whole paragraph since the meaning of a paragraph usually can be captured by the key sentence. Here, the order of the fields in each packet (fixed for each packet type) plays the role of resembling some grammar rules which are decisive in building sentence patterns for malicious traffic (signature-based detection) or benign traffic (anomaly detection). Notably, this packet-to-sentence-based model can significantly accelerate the traffic classification, since the behavior and characteristics of one or several first packets can entirely reveal whether their flow is a malicious one.

In general, a field in each packet could be *a byte of the packet header*, *a field of the packet header*, or *a block of the packet payload*. As the initial trial, we consider a field in the packet header as a word and trim a packet to a fixed length of *n* = 54 bytes. Depending on the field length in the packet, the word size can vary. The strict order of the fields in the packet structure constructs a potential grammar rule for the built sentence. Note that the extracting field stage can be done along with the packet reading/decoding (i.e., data pre-processing), thus the resource consumption is quite efficient. In addition, if the length of the packet is less than *n* bytes, it will be padded with zeros (as shown in Figure 2). The rationale for examining 54 bytes is that most TCP packets have a 14-byte MAC header, a 20-byte IP header, and a 20-byte TCP header. Table 3 lists the header fields and their length of TCP and UDP packets.

**Figure 2.** The illustration of the packet-word-transfer mechanism.



After applying the word-embedding technique to fields in the packet header, each header field is embedded, (i.e., to integer number format based on their index in a dictionary of all words) and reshaped, (i.e., dimension) and put to the LSTM-based training model for performing the classification task. In order to understand each attack type and remain consistent with the order of the fields in the packet header, it is important to maintain the sequence order of fields and a consistent word dictionary. Finally, the word size (hyperparameter) selection and further the word-embedding strategy are adjustable and can be specified by the deployment environment and IoT applications under the system's protection umbrella.
