**4. Flows**

Flows, grouped raw network traffic according to the same properties, form the second part of network traffic that could be processed while preparing CNN entries. Some papers work on datasets that already consist of flows, whereas others order the raw traffic to pick up all packets belonging to each flow.

#### *4.1. One Dimensional CNN Input*

A big group of research articles processed, in CNN-based tools, vectors built from flows. The transformations are basically differentiated in two areas: sizes of input vectors and data manipulations [49,64,66,80–82,104,105] (Table 4).

An extended version of [78] was widely elaborated by Casas et al. for flow vectors in network security, i.e., malware detection [104]. The authors decided to use only two packets and the first 100 bytes of each. This approach was based on statistical calculations. The 3-Layer CDM was tested on the MAWILab dataset.

The same research group, Marín et al., continued to investigate malware detection [105]. The authors checked the same, previously proposed CNN tool with vectors crafted from flows of the CTU-Malware dataset. Then, the tests were extended with USTC-TFC2016 in the next two papers [80,81]. In each, the 3-Layer CNN obtained the flow vector as an input. The articles also tested different machine learning models, also not related exclusively to CNNs.

The objective of traffic classification enhancement was set out by Song et al. in [64]. The authors utilized 8-Layer CDM, which contains an embedding layer (EMB). Pcaps from ISCX VPN-nonVPN were then transformed to flow vectors to test the CNN models. Traffic prepossessing was done according to the idea of Wang Wei et al. [34]. Wang's concept is described in the next subsection. After the normalization process, the one-hot encoding method was used for each byte in the vector, which enlarges up to 255 different values of bytes. The matrix consists of concatenated vectors. To increase the speed and effectiveness of the process, all one-hot-encoded vectors of the matrix are converted into low-dimensional dense vectors.

Hwang et al. widely tested different sizes of CNN input vectors [82]. An example of the vector is 2 by 50 bytes, which means two flows and 50 bytes of each. The researchers used pcaps from the USTC-TFC2016, the Mirai-RGU and their own datasets. The introduced 11-Layer CDM to deal with malware detection, especially anomaly detection problems.

The proposed concept of Chen et al. can determine whether traffic belongs to any of the known classes [66]. The idea requires unchanged flows that form vectors, which later become the input of the 16-Layer CDM. This function enhances the capability of the detection of yet-unknown traffic. The tool was tested on two datasets: USTC-TFC2016 and ISCX VPN-nonVPN. This article is an example of both approaches: traffic classification as well as malware detection.

Flows were also used by Chen et al. to form a 1D-CNN input [49]. Vectors of the sizes of 784 bytes were created due to the idea of [63]. Three different datasets—ISCX VPNnonVPN, ISCX-IDS-2012 and ISCX-IDS-2017—were used to check the proposed 5-Layer CDM's effectiveness. The CDM classifies traffic.

#### *4.2. Two-Dimensional CNN Input*

Contrary to the previously described approaches, the two-dimensional CDM input requires increasing the dimensionality of the traffic data. Flow wrapping seems to be one of the leading trends of traffic manipulations within discussed CNN entries [34,50,55–58,60,61,68,106] (Table 4). The practical concept, which started in 2017, wraps the network traffic data into the matrix [34].

The very first article that fulfilled the concept of [2] is the paper written by Wang Wei et al. [34]. This paper seems to be the first practical approach to utilize CNNs to process network traffic. Ref. [34] uses 6-Layer CNN to detect malware in the USTC-TFC2016 dataset. The authors decided to give the CDM a matrix of 28 bytes per 28 bytes. The process of creating the image (matrix) is the following: raw traffic packets are aggregated into flows or sessions, and then data are anonymized. The next step is to trim the flows to 784 bytes and 'wrap' the vector so then one has a matrix of 28 × 28 bytes, visualized as a gray level image. The authors decided to share the tool used to create the matrix. While aggregating the packets into flow, one can choose one of the four versions of the process:


The choice of the only L4+ could have been placed in Section 6 of our survey. Nevertheless, the remaining two aggregating methods are also widely discussed in the paper, and this indicator makes the paper ideal for this section.

Moskalenko and Moskalenko proposed a typical flow wrapping to check 2-Layer CNN for malware detection [55]. To create the matrix, raw packets are aggregated into flows. Then, 784 bytes of sequential flows are taken to create a wrapped vector—the matrix of 28 × 28 pixels. The last step is to normalize the matrix values (pixels' brightnesses) in the range [0,1]. The CDM is tested by pcaps from the CTU-Mixed and CTU-13 datasets.

Flow wrapping is also utilized to detect malware—more specifically, botnets by Taheri et al. [56]. The flows from the CTU-13 dataset are transformed into grayscale images of 28 bytes × 28 bytes and delivered to the entry of the DenseNet CNN [23]. It is important to underline that all layers of flows are utilized.

Zhou et al. delivered the following sizes of session images to the entry of the 5-Layer CDM: 16 × 16, 20 × 20, 28 × 28, 32 × 32 [106]. The CDM detecting botnets was tested on the ISCX-Bot-2014 dataset. The raw traffic packets from the dataset were aggregated into flows.

The transformation concept of [34] was utilized in the malware detection article of Wang et al., which introduced the 5-Layer CDM [60]. The tool tests were conducted on captured packets from the UNSW-NB15 dataset. From the raw traffic, sessions were cropped. Finally, the CNN input was a 28 by 28 bytes matrix.

The same transformation to 32 by 32 bytes images was used in the research in which malware detection was a theme [57]. Huang et al., in their article, tested the 7-Layer CDM's quality against sessions with all layers from the CTU-13 and ISCX VPN-nonVPN datasets.

The novel transformation of flow via one-hot encoding was proposed by Wang et al. to test 5-Layer CDM (named HAST-I) [50]. The CNN tool was introduced to detect malware. The network traffic came from the DARPA 1998, and ISCX-IDS-2012 datasets. In the beginning, raw packets were aggregated into flows. Then, during thorough tests, flows were trimmed to either 600 or 800 bytes. Later, one-hot encoding transformed each byte in the vector into a vector. All the vectors were transpositioned and then concatenated so that a matrix was formed. The smaller, 256 × 600 bytes image achieved the best classification results for the ISCX-IDS-2012 dataset, while the 256 × 800 bytes were ideal for DARPA 1998.

A few different traffic transformations for malware detection purposes were based on three CDMs [61]. Out of the proposed tools, only one was exclusively CNN. The different CDMs of Millar et al. were given three different types of entries: 50 byte flow vector, 24 traffic features and flow wrapping. The first two methods tested non-CNN based models, whereas the last type of entry was a 2D flow image, which was given to the CNN model. In these flows' images, each pixel represents a byte of data in the network. A row of the image stands for the next packet in the flow. In each field, the value means the packet filling. The CNN model was tested on UNSW-NB15.

Moskalenko et al. investigated flow wrapping inherited from [34] to detect malware [58]. It used LeNet [19]. The tests input was taken from two datasets: CTU-Mixed and CTU-13.

A simple flow wrapping method to 28 × 28 bytes matrices was used during the image generating process [68]. Li et al. proposed traffic classification for 9-Layer CDM, which was tested on data from the ISCX VPN-nonVPN and USTC-TFC2016 datasets.

#### *4.3. Various Dimensionalities*

A few papers verified the various dimensions of CNN inputs built from flows [12,63,65,67,69,83,107,108] (see Table 4).

Flow vectors are the input of the 6-Layer CDM proposed by Wang et al. [63]. Data are transformed as in [34] until the 784-byte vectors are formed. In this approach, the CDM deals with the ISCX VPN-nonVPN dataset. Additionally, the researchers mentioned that the proposed method is compared with 2D transformation. The interesting outcome achieved by the authors was that the 2D approach achieved worse results than the 1D approach.

A thought-provoking transformation of network traffic into the third dimension to classify traffic was proposed by Ran et al. The researchers utilized the 8-Layer CDM [83], and then tested it on pcaps from USTC-TFC2016. The 3D model was built in four steps. The first one identified flows within packets. The next step extracted a chosen number of bytes from each flow. The third step concerned trimming all packets to one fixed size. Longer packets were trimmed, whereas shorter ones were padded with zeros. Then, each packet was transformed into a 2D matrix with the usage of one-hot encoding. To create a 3D image, all 2D images of the same packets had to be put together.

Flow vectors were also used as a CNN entry in the research articles of [107,108]. Aceto et al. utilized three types of CNN entries. Two methods, based on forming 784-byte vectors, were taken from [63]. The third method proposed a matrix of traffic features as a CNN entry. We describe this 2D concept in detail in Section 8. The articles utilized 6-Layer CDMs from [34,63]. Flows for this traffic classification approach were taken from the authors' own dataset.

While discussing in detail the wrapping flows, one should mention the approach of Cui et al., which improved it slightly, with the advantage of the sessions' weights [67]. On top of that, 6-Layer CDM of [63] was checked by traffic that originates from ISCX VPN-nonVPN. Additionally, during flow transformation, unrelated SNMP, DNS and ARP sessions were diluted, whereas valid sessions' weights were increased. The paper's aim was to classify traffic. It is important to underline that the paper introduced a 5-Layer CDM, CapsNet. The core part of the paper, which is a 2D transformation model, achieved a better outcome than the 1D classification.

The next paper of He and Li distinguished two types of traffic from the ISCX VPNnonVPN dataset and touched upon flow wrapping [65]. The raw traffic packets were aggregated into sessions. Then, for non-VPN traffic, the first 90 non-zero payloads of flows were taken. In the second group, the VPN traffic one, the first 20 non-zero payloads were chosen for further processing. In both groups, the tool, introduced in [34], was used. Additionally, all DNS and NetBIOS names packets were erased. The authors decided to remove also the three-way handshake packets. Then, traffic images of 28 × 28 byte size were provided to the 5-Layer CDM for traffic classification. The paper also proposed a 1D model, which works on 784-byte vectors, and compared its results with the CNN of [70].

Yet another work on the topic of traffic classification slightly modified flow wrapping [12]. The experiment transformed the traffic of the first 20 packets of each flow to not only 28 by 28 bytes images, but also other square images. These values tested by

Pacheco et al. were not explicitly specified. In this paper, CNNs from [34,63] were tested on traffic captured during research as well as on the VPN-NonVPN dataset.

Recently, the concept of wrapping flows to create CNN input images was also utilized by Chen et al. for malware detection purposes [69]. The tests were conducted on pcaps originated from the following datasets: VPN-nonVPN and USTC-TFC2016.


**Table 4.** All research works of the flow transformation method

#### **5. Payload Extracted from Raw Traffic**

The next possible transformation is based on the extraction of chosen payloads from the raw traffic. For example, according to Figure 2, one can remove the headers of Layers 2, 3 and 4. The most popular idea is to remove the L4 header and form the CDM entry from only the L4+ payload.

#### *5.1. One-Dimensional CNN Input*

This section outlines those papers that provided the CNN model a vector, formed with the advantage of payloads and header manipulations [70–73] (Table 5). In this section, extractions are made from raw traffic packets.

An interesting concept of Lotfollahi et al. is based on 1D vectors, which reach the length of 1500 bytes [70]. In the first step, headers of L2 are removed. On top of that, there are some changes in the L4 layer. According to the design, normally shorter than TCP, UDP packets are padded with zeros to reach the TCP length. The next step is to remove not only the entire three-way handshake communication, but also the DNS queries. Finally, the vectors reach the length of 1500 bytes, in which each byte value is normalized. The paper used the ISCX VPN-nonVPN dataset to test the 6-Layer CDM. This is typical traffic classification work.

The same transformation idea, like in the previous work, was inherited to classify traffic [71]. The authors, Akbari and Tahoun, utilized a federated learning model, called

the model-averaging technique, and created 3-Layer and 5-Layer CDMs. The publication's models were tested on the USTC-TFC2016 dataset.

#### *5.2. Two-Dimensional CNN Input*

The concept of creating traffic images from the extracted payload of raw traffic packets is a next form of 2D-CNN input. This transformation was carried out in the following group of scientific investigations [73,109–112] (see Table 5).

In the research of He and Shi, images were generated with the advantage of wrapping the payload of raw traffic [109]. It seems that the authors chose the L4+ payload, so they removed the L2, L3 and L4 headers. The researchers aimed to identify traffic, especially SSH applications. The used 5-Layer CDM was tested on traffic from the article's own dataset. The authors informed that the CNN input is a 28 by 28 bytes image.

Li et al. removed the L2 headers and modified the L4+ headers to form the CNN input [110]. The modification in L4 means unifying the length of TCP and UDP headers. On top of that, all duplicated and empty packets (with no payload) were erased. In this research paper, the transformation of packets to 30 × 30 byte matrices was utilized in order to classify traffic for virtualization purposes with the 5-Layer CNN. Tests were conducted on traffic captured by the authors.

The payload of L4+ was extracted from the raw traffic packets to create a 2D image [111]. The creation of the image required taking 10,000 packets from each application traffic captured in the UPC Broadband Traffic research group. Then, payloads of each packet were divided by four to constitute one pixel of a future image. The sizes of all application's traffic were readjusted to the following: 36, 64, 256 and 1024 pixels. In the case of a smaller number of payload samples, the images were padded with zeros. The paper of Lim et al. used 4-Layer CDM and also ResNet to classify the captured traffic.

A similar concept of choosing only L4+ payload while creating images was applied by Xue et al. [112]. The transformation's last step was to wrap vectors in order to create 2D images. The paper utilized six different CNN networks: ResNet [24], VGG16, VGG19 [25], Inception V3 [22], Xception [26] and MobileNet [27]. Their task was to work on traffic classification issues. Models were tested on traffic captured within the authors' research.

#### *5.3. Various Dimensionalities*

Papers in this section compare a few transformations methods (see Table 5).

Only the transport layer's payload (L4+) was taken from the raw traffic to form the CNN input [72]. The authors Xu et al. tested four sizes of input data: 400, 625, 784 and 900 bytes, which were later left as a vector or transformed to an image. Vectors are the input to the first 8-Layer CDM. Moreover, the entry of the second 12-Layer CDM is a square matrix. Four variants were investigated: 20 by 20, 25 by 25, 28 by 28 and 30 by 30 bytes. Traffic classification tests were extensively conducted with the advantage of pcap files of the ISCX VPN-nonVPN dataset. Due to the dataset choice, this was a typical traffic classification approach. The CNNs that work on vectors outperformed those models that deal with matrices.

The paper of Zhang et al. applied three versions of transformations, i.e., to the vector (1D), to the matrix (2D) and to the cubic form (3D) for traffic classification purposes [73]. The one-dimensional CNN input is a 1456 byte vector that consists of the raw traffic payload of L4+. The second dimension was implemented by wrapping a different initial vector (1521 bytes) into a 39 by 39 byte matrix. The third dimension was also created by wrapping. An initial vector (1452 bytes) was changed into 22 by 22 by 3 bytes RGB colored cubic forms. The network traffic was taken from the ISCX VPN-nonVPN dataset as well as their own dataset. For this transformation, the paper used 5-Layer CDM. The results of the experiments indicate the matrix as the input for achieving the highest classification results.


**Table 5.** Works that extract payload of raw traffic.

#### **6. Payload Extracted from Flows**

This section is entirely devoted to the transformations of raw traffic, which extract payloads from grouped packets, i.e., flows. This section discusses 13 research papers.

#### *6.1. One Dimensional CNN Input*

Another group of articles proposed giving the CNN model an extracted payload of flows [51,52,113,114] (see Table 6). All papers worked on L4+ payloads, which means that headers from L2, L3 and L4 were decapsulated.

Zeng et al. created 900 byte vectors from flows and used them to form a 5-Layer CDM entry [51,52]. While creating the vectors, TCP and UDP headers were removed. In [51] the malware detection model's performance was checked with data from ISCX VPN-nonVPN and ISCX-IDS-2012. The latter paper detected malware in the vehicular ad hoc network (VANET) by testing the CNN model on the network traffic of the ISCX-IDS-2012 dataset and their own simulated dataset, NS-3 VANET. The datasets contained pcap files from which flows were aggregated. In both papers, the main concept was a hybrid deep learning model, which obtains 30 byte by 30 byte images. These flow images were built according to the concept of [63]. As the hybrid models do not fulfill the requirements of this CNN-based survey, the two papers [51,52] are not described in the Section 4.

The next paper of Wang et al. also introduced a CNN input vector, that is, the L4+ payload [113]. The deep learning model's entry reached the size of 200 bytes. The work introduced a few models for traffic classification. The solel CNN was App-CNN, which is a 5-Layer CDM. Flows were taken from the researchers' own dataset.

Similarly, in the approach of Wang et al., CNN's entry is a fix-length vector, only consisting of the flow payloads from L4+ [114]. Then, only the top hundreds of flow bytes are stuck into the vector. Three different deep learning models, in which one of them is solely a CNN, were investigated. The 5-Layer CDM classifies traffic. Additionally, it was tested on the authors' own dataset.

#### *6.2. Two-Dimensional CNN Input*

Some papers decided to process the payload of grouped raw traffic— flows [20,74,84,115–119] (Table 6). After the extraction step, which is the common part for all papers, the changes within these concepts arise. The biggest differences are mainly the layer choice as well as the selection of headers for removal.

A matrix of 32 bytes × 32 bytes was proposed to examine the 6-Layer CDM of Ma and Qin in the work [115]. The input was formed from 1024 bytes of the L4+ payload. The flows were caught by the authors. According to the paper, the first 1024 bytes contain crucial information.

In the next paper, Zhao and Chen used the L4+ payload while transforming network traffic [116]. On top of that, much larger unidirectional flow images of 87 bytes per 87 bytes were used to classify the traffic of smartphone applications. The researchers tested the

5-Layer CDM model on their own dataset. During the preprocessing phase, all tiny flows with less than two packets were removed. After that, five flow vectors, 1500 bytes each, were converted into a 2D image of the mentioned size.

Wrapping the L2 payload of flows to create an image was the dimension transformation used by Zhang et al. [84]. The paper dealt with the malware detection problem with the advantage of different CNN models: LeNet [19], AlexNet [21] and VGGNet [25]. Tests were accomplished on the USTC-TFC2016 dataset. Each input image had 28 by 28 bytes.

Removal of the L4 header, and so choosing the L4+ payload of flow, was applied by Zhou and Cui [74]. Additionally, the authors examined the usefulness of Alexnet [21] to classify traffic from the ISCX VPN-nonVPN dataset. The usage of the datasets means that the authors were dealing with the encrypted payloads.

The next article, written by Feng et al., inherited the idea of [34] of wrapping only the L4+ payload of flow and widely utilized it for traffic classification purposes [117]. The paper used 6-Layer CDM. The tests of the model were based on flows coming from the DARPA 1998 dataset.

A different idea of choosing the L3 payload was tested by Zhao et al. While transforming flows from the Malware Capture Facility Project (malware samples) and their own dataset (benign samples), researchers decided to focus on the first 32 packets of each flow, and the first 512 bytes of each packet [119]. Then, chosen data werer saved as a matrix of 32 by 512 bytes. Later, after the normalization process, the final matrix was reshaped to a 128 by 128 byte size. If anything was smaller than the desired size, they were padded with zeros. The matrices were given as an input to the proposed 7-Layer CDM network. The paper also utilized an interesting metric regularization term, which enforced the model to learn more discriminative features. This feature impacted the classification so that the results were more precise.

A novel approach of the L2 payload of flow transformation was utilized by Saleh and Ji. The authors constituted images by one of the five possible mappings of flow vector (1D) into a 2D matrix. Prior to that, pcaps from the authors' own dataset were aggregated into flows [20] for the purpose of network traffic classification. Then, all invalid connections were removed. Matrices of 17 by 17 bytes size or 25 by 25 bytes size were an entry to the VGG-16 CNN [25], the 16-Layer CNN model. The authors proposed the following mappings: linear, diagonal, waterfall, center spiral and edge spiral. The first method is frankly wrapping flows. The diagonal mapping starts by placing bytes from the top left corner and then arranges them diagonally. The waterfall method is said to imitate nature: a water stream pulling into a cliff. Here, the first byte is also in the top left corner. The second byte moves along the diagonal, the third one to the left side, the fourth up and again along the diagonal and so on. The center spiral starts from the central position and locates the next bytes around the previous ones. The last mapping is the center spiral in the reverse order. Despite attempts of various mappings, the classic, linear one—the flow wrapping—achieved the best results.

#### *6.3. Various Dimensionalities*

Research works described in this section deal with two various dimensionalities of CNN input (Table 6).

Android traffic was transformed into images [118] according to the method of removing 24 bytes, i.e., the header of L4. The authors, Yunjie et al., decided to enlarge images, as they used 1024 bytes. Thus, the images achieved 32 by 32 byte sizes. The interesting part of the algorithm was the step where third party traffic was removed. The paper adds to a growing corpus of malware detection research. The 7-Layer CDM was used to find unwanted traffic within the CIC-AAGM2017 dataset. On top of that, the authors dealt with two various dimensions of CNN input. The 2D method outperformed the 1D concept.

In the following approach, the one dimension is changed into three dimensions to better detect unwanted traffic [31]. Consequently, the CNN input is three dimensional. The paper of Millar et al. proposed a segmented CDM of 1D- and 2D-CNNs. Additionally, the 1D-CNN and separable 2D-CNN models were introduced. Their quality was tested on 3D flow images generated from the UNSW-NB15 dataset. The 1D CDM was given a 2D flow image. At the beginning of image creation, 97 bytes of flow were chosen. A total of 47 bytes were taken from the flow's header, whereas the remaining 50 were from the payload. Then, an additional nine flows were added, so the 2nd dimension was achieved by the flow wrapping concept. The third dimension was built with the advantage of one-hot encoding. While comparing the separate models of the 1D- and 2D-CNNs, one can see that the application of the one-dimensional transformation resulted in higher effectiveness.


**Table 6.** Papers that belong to the group extracted payload—flows.

#### **7. Traffic Features**

The papers collected in this chapter proposed a transformation of the features of the network traffic to the CNN entry. The difference between this chapter's concept and the next one is that here, the research groups utilized only those datasets that consist of traffic features (e.g., KDD Cup 1999). In contrast, in the next chapter, the papers not only created interesting CNN entries, but also proposed feature extraction techniques.

#### *7.1. One-Dimensional CNN Input*

The transformation of chosen network traffic features into a vector that later becomes the CNN deep learning model input is a core part of Refs. [36–38,120–122]. These papers used network traffic datasets with explicitly traffic features, or extracted them from flow, pcap based datasets. On top of that, four works combined traffic features with the traffic payload [53,75,123,124].

A simple vector of features was given as an entry to different CDMs, which were used to detect unwanted traffic, e.g., intrusions [36]. The solely CNNs which were used were 3-Layer CNN, 4-Layer CNN and 5-Layer CNN. The authors, Vinayakumar et al., chose the KDD Cup 1999 dataset to test the proposed models.

The same transformation was used by Vinayakumar et al. in a work that focused on SSH traffic identification [120]. The paper concept was ten different deep learning models. The most interesting are two CNN models, i.e., 3-Layer and 6-Layer. The vector consisted of flow features, for instance, protocol, duration of flow, maximum packet, etc. The article made use of publicly available datasets: NLANR AMP, NLANR MAWI and NIMS.

The CNN model is given a vector, which consists of Can 2017 dataset features, which were collected from in-vehicle on-board diagnostics [121]. The article of Lokman et al. considered malware and intrusion detections with the advantage of 4-Layer CDM.

The 6-Layer CDM, to detect unwanted traffic, was also tested with a vector of network traffic features [37]. The traffic samples in Manimaran et al. research were taken from the KDD Cup 1999 dataset.

Another paper, written by Liu and Zhang also proposed 1D input of traffic features to improve malware detection [38]. Here, the 5-Layer CDM was tested on data from the KDD Cup 1999 and NSL-KDD datasets.

The same transformation was performed by Susilo and Sari on the BoT-IoT dataset [122]. It appears that the 5-Layer CDM input was the vector of features. The paper showed the malware detection approach.

The discussed transformation approach was extended by combining ten network features with additional traffic payloads [53]. The researchers, Cui et al., decided to test GoogLeNet [22] on the ISCX-IDS-2012 dataset. This work widely investigated malware as well as intrusion detection.

A combination of network traffic features with flow payloads was classified by a few AI models [123]. The paper of Zhao et al. used 6-Layer CDM ([63]) and other classical methods, e.g., random forest. The CNN was given a vector with 29 attributes, where 12 were statistical features, 16 byte values, and the last one was a port number. The statistical features were the payload size (5 features) and the packet length (7 features). The byte values were 16 bytes of the payload. The model was tested on the researchers' own dataset, which consisted of flows.

The trend of combining traffic payload with its statistical features continued in the article of Dong et al. [75]. Firstly, all unneeded packets, such as DHCP and NetBios, were removed from the pcap files. The second step was to aggregate raw traffic packets with respect to the sessions. After removing all retransmission flows and those related to a particular application, each packet was trimmed to the set size. Then everything was joined into one vector. The last step was the normalization of the vector's data. In this paper, two 6-Layer CNNs from different articles [63,70], were utilized. Both CDMs aimed to classify encrypted traffic. The input of CNNs was crafted from the ISCX VPN-nonVPN dataset.

The idea of Yang et al. was to create the CNN input in four steps: payload extraction, inter-arrival time calculation, truncating/padding process and normalization process [124]. The 8-Layer CDM tested this kind of an payload and time feature input. Flows were originated from the WRCCDC dataset. This article is an example of a traffic classification approach.

#### *7.2. Two-Dimensional CNN Input*

The next method of CNN input transformation is vector of features wrapping [39–48,76,85,125]. This idea changes the form of input data representation from a vector to a matrix, similar to that done with flows. The combination of both traffic features and payload was also proposed in [126].

Vector of features wrapping was first introduced in the work of Liu et al., which was focused on malware detection and intrusion detection purposes [39]. The paper proposed 32 by 32 byte matrices to be given as an entry of LeNet [19]. CNN was tested on KDD Cup 1999, which consisted of feature vectors. To create feature wrapping images, the authors chose 1024 bytes from feature vectors, which were later transformed into images.

A novel transformation of network traffic was proposed by Liu et al. in their work focused on malware detection and the intrusion detection challenge [40]. For this task, the paper used two CNNs: ResNet 50 [24] and GoogLeNet [22]. Network traffic was taken from the NSL-KDD dataset. The paper introduced an innovative method to create input for CNN images. Firstly, all symbolic features from the dataset, i.e., protocol type, flag and service, were converted into binary vectors (one-hot encoded). All continuous features were normalized to scale [0–1]. After that, the authors discretized the scaled continuous value into ten intervals. The next step was to use one-hot encoding again. This time, the method ordered intervals into binary vectors. The vector with 484 features was then changed into a greyscale image. Eight bytes were changed into one pixel. Finally, the data

became an image of 8 bytes by 8 bytes in size. If necessary, the images were padded with zeros. It is important to draw attention to the fact that the dataset consisted of vectors of 41 network traffic features. To sum up, vectors of 41 traffic features were transformed into 2D images.

Replicating vectors of features as a 11-Layer CDM entry was proposed by Naseer and Saleem in their work which dwelt on traffic classification malware detection, mainly intrusion detection [41]. The tool was tested on transformed features vectors from the KDD Cup 1999 dataset. The vector contained 41 features. Three symbolic features: 'protocol\_type', 'service' and 'flag' were converted to become a quantitative date. Then, whole vectors were replicated three times, and five chosen features were concatenated. These actions created 128 features vectors. Later, the vectors were again replicated (probably eight times) to create an image 32 bytes by 32 bytes—2D matrices. These matrices then became greyscale images, which were the CNN tool entry.

Malware and intrusion detection, more precisely anomaly detection, were closely investigated [42]. Kim et al. utilized the GoogleLeNet CNN model and tested its usefulness for the topic with the advantage of three datasets: KDD Cup 1999, UNSW-NB15, and ISCX-IDS-2017. This means that they dealt with vectors of network traffic features, flows and raw packets. While processing the dataset, the authors normalized numerical data with the min–max normalization algorithm. Then they transformed categorical features into numerical ones with the advantage of one-hot encoding. Later, all data were encoded to a greyscale vector and reorganized into a greyscale image. Finally, the created images were of the following sizes:


A novel transformation of the feature vector into an image was widely examined Mohammadpour et al. [44]. The first step was taken to convert nominal attributes into discrete attributes with the advantage of one-hot encoding. This action established the number of attributes to 122. Then, the authors removed one of the 122 features. The remaining features were normalized in the range of [0, 1] by max–min normalization. Finally, the 121 feature vector was wrapped to a 2D matrix. The paper used 7-Layer CDM to deal with the NSL-KDD dataset traffic. The authors' aim in this paper was to develop intrusion as well as malware detection issues.

The same transformation of a feature vector into a 2D matrix was introduced in the paper of Wang et al., which was fully devoted to the detection of unwanted traffic in the network [43]. The authors checked the usefulness of the proposed 9-Layer CDM and LeNet [19], on vectors of network features from the KDD Cup 1999 and the NSL-KDD datasets.

Unchanged transformation from [44] was used to test the 4-Layer CDM of Hu et al. The introduced tool had to detect malware as well as intrusions in wireless networks. In the CDM, there is a split convolution module (SPC), which is a special layer to minimize the problem of an unbalanced dataset [46]. The paper made use of the NSL-KDD dataset.

The researchers Li et al. decided to utilize randomly repeating features to enhance traffic images [47]. The paper focused on 9 by 9 bytes, 9 by 10 bytes, 10 by 10 bytes and 11 by 11 bytes matrices. The authors decided to find malware, especially intrusions in the network, with 7-Layer CDM. The idea was tested with the advantage of the NSL-KDD dataset.

Network traffic transformation proposed by Mohammadpour et al. [44] was continued [85]. This time, the authors detected malware and intrusions with 4-Layer CDM on the traffic from the ISCX-IDS-2017 dataset. The model consists of a layer known as a mean convolutional layer (MC). This layer enhances classification so that all anomaly samples are separated during computing. Moreover, this helps in learning the prediction error filters, which can generate low-level abnormal features.

The same idea of 2D transformation was utilized [125], where Zhang proposed a 6-Layer CDM to deal with malware detection. The vector of features images was 32 × 32 bytes. They were formed from the KDD Cup 1999 dataset.

To detect malware as well as detect intrusions, Pham et al. utilized two methods of network traffic transformations [76]. The first one, based on histogram creation, was inherited from [77]. The second one, for the purpose of image creation, multiplied the packet's length by the normalized delivery time. This was done in order to differentiate two packets of the same length, collected at different times. Thanks to multiplication, the same length packets were stuck in different parts of the image, not disturbing the sequence pattern. The next step was to reduce the multiplication outcome to the image size in order to achieve data within the image's size—the so-called modulo operation. The researchers created 30 by 30 pixel images for the CSE-CIC-IDS2018 dataset traffic and a 300 by 300 pixels matrix for ISCX VPN-nonVPN. The used CDM was a 9-Layer one.

A novel sliding window based approach was introduced in [126] for traffic classification. Li et al. used 7-Layer CDM, which was later evaluated by flows from their own dataset. The CNN input was an image created in a few steps. At first, the flow traffic was divided into segments that corresponded to particular applications activities. Then, each segmented traffic stream was represented by a matrix and a vector. The matrix consisted of a number of packets received in the chosen time unit. The vector held frequency-domain features of the traffic.

The same network data transformation, as in [40], was applied by Su et al. The authors utilized the neuro evolution of augmenting topologies algorithm to find the optimal CNN architecture [48]. As there was not one chosen CNN for malware detection purposes, this paper will not be covered in the summary table at the end of this section. Tests of different CNNs were conducted on the NSL-KDD dataset.

#### *7.3. Three-Dimensional CNN Input*

A few articles proposed CLM models that require a 3D entry [127–130] (see Table 7).

Probability distributions of the network flow sequence to images were converted [128]. To fulfill the task, reproducing kernel Hilbert space (RKHS) embeddings were used by Chen et al. This method is said to create a neat image representation of a (conditional) distribution. Network flows were originated from the researchers own dataset. The article aimed to develop traffic classification methods with the advantage of a 7-Layer CDM.

There is a traffic classification in terms of QoS and a security approach in which CNN input is an RGBA image [127]. The article used four predefined CNNs: LeNet [19], AlexNet [21], ConvNet and GoogleNet [22]. The network traffic in the form of pcaps was taken from the researchers' own dataset. Raw traffic packets were firstly aggregated into flows. Then, four features—size (s), interarrival\_time (t), protocol (p) and direction (d)—were taken. Merged together, the following vector of the packet's feature was formed: [s, t, p, d]. Later, vectors with the packets' features formed a flow matrix, so that each matrix element was a vector. Salman et al. highlighted that a feature vector of four elements can become an RGBA pixel [127]. They followed this idea and created RGBA images. The size of each was firmly connected to the mode of the model: offline vs. online. The online mode worked on smaller images with 16 packets of the flow, whereas the offline was capable of processing 28 packets of the flow.

Volumetric colored images that represent the amount of the data captured within a chosen time was also utilized as a CNN entry [129]. The concept assumed a colored input of 656 by 874 pixels. This input was built from the dataset of De Schepper et al. and tested 8-Layer CDM in terms of traffic classification accuracy.

The concept of building a 3D entry from a features vector was used by Arivudainambi et al. for malware detection [130]. With the advantage of PCA compressions, seven attributes were minimized to only two crucial ones. Then, the CDM model was given an entry from the traffic captured by the authors. Details of the CNN architecture were not revealed.

In contrast to the previous articles, one work tested various dimensions within the discussed transformation approach [45]. In the article of Wu et al., 11 by 11 byte images were given as an entry of a 5-Layer CDM. The classification tool was that of [44], which was tested on network traffic features from the NSL-KDD dataset. CNN input images were created as in [44]. This is a malware detection approach. The paper compared the 2D transformation and classification results with those of 1D. Having analyzed these results, one can see that the feature wrapping concept is a better method for classification.


**Table 7.** The summary of all feature-based articles.

#### **8. Extracted Features**

When compared to the previous section, this category of traffic transformations focused not only on classification but also on feature extraction. Here, the used datasets were mainly flow or packet-based. Therefore, after the extraction process, the CNN models dealt with traffic features.

#### *8.1. One-dimensional CNN input*

These following batch of papers widely elaborated feature extraction approaches to form input vectors [54,59,62,131,132] (see Table 8).

The different CNN model works on the basis of CTU-Malware, UNSW-NB15 and SCU-RNE datasets [62]. Shao et al. also proposed a novel method to extract features by a 4-Layer CDM. The idea learns the representation of a time series input data at each network model layer with the advantage of a hierarchical transformation of a CNN. The researchers compared the extraction tool with other feedforward networks. Further, the method avoids explicit feature extraction. The research paper proposed a 5-Layer CNN classifier to detect malware within computer network traffic.

MontazeriShatoori et al. created in a novel way CNN input vectors from pcaps and flows originated from the CIRA-CIC-DoHBrw-2020 dataset [131]. While preparing an input vector for CNN, all statistical features of flows were chosen from raw packets, i.e., the number of flow bytes sent, the rate of flow bytes sent, the number of flow bytes received, the rate of flow bytes received, packet length (e.g., mean, variance), packet time (e.g., mean, variance), request/response time difference (e.g., mean, variance). It is important to highlight that the authors decided to share their statistical feature extractor tool 'DoHMeter' publicly. They used CNN and other hybrid deep learning models to detect malware within DNS over HTTPS tunnels. The details of the CNN were not described.

Yet another publication written by Kolcun et al. utilized vector of features as the CNN entry to deal with the traffic classification challenge in IoT [132]. There are a few models proposed, but only one meets the requirements of this review, the 4-Layer CDM. The model's input is a vector of features that originates from the authors' own dataset. They are 19 chosen features, among others: source and destination ports, a number of received bytes, mean size of packets, a variance of the packets' sizes and duration of the stream.

Fourteen features from the CTU-13 dataset were extracted to form a 5-Layer CNN input [59]. These features included the traffic flow start time, protocol, the total number of packets or average packet rate, etc. This is a concept of detecting malware in the network traffic, especially botnets.

Differentiation of the features of flow packets due to time arrivals was proposed by Doriguzzi-Corin et al. to create network traffic vectors [54]. Once all flows in a particular time window were chosen, 11 special features were extracted. Longer flows were truncated. Then, these feature vectors were normalized and, if needed, zero-padded. The last step was devoted to labeling. The authors gave the vectors as an entry to 3-Layer CDM. The research was based on the traffic taken from the following datasets: ISCX-IDS-2012, ISCX-IDS-2017 and CSE-CIC-IDS2018.

#### *8.2. Two-dimensional CNN input*

The following eight articles created matrices of wrapped traffic features. These features were first extracted from the chosen datasets [77,86,107,108,133–136] (Table 8).

The first 2D concept of extracting traffic features from captured flows was proposed by Lopez-Martin et al. [133]. The authors took advantage of the 6-Layer CDM to classify the traffic. The deep learning model was given matrices containing six flow features, i.e., source port, destination port, the number of bytes in payload, TCP window size, interarrival time, and packet's direction in each row. The six features were taken from the 20 packets. Thus, the 2D input size was 20 by 6. Flows were originated from the authors own dataset, from Spanish research centers.

Two papers [107,108] utilized a few concepts of traffic transformations. The flow wrapping approaches are widely discussed in Section 4. On top of that, Aceto et al. followed the idea of [133] and also provided matrices of extracted features to classification models. The CDM used in this category of traffic manipulations was a 6-Layer CDM. This is a typical traffic classification study. The traffic was taken from the author's own capture.

Images based on the arrival time of packets are the input of the deep learning model of Yang et al. in another traffic classification research article [134]. Scientific work uses AlexNet CNN [21]. The tool was tested on the author's own dataset, where flows were captured. CNN's input was 10 by 10 bytes matrices. These 2D images are generated from the inter-arrival time of the first 50 packets of a session or their lengths. For packet lengths, the 1500 byte maximum transmission unit (MTU), and for an inter-arrival time, the 1200 milliseconds, constitute states which later form matrices.

The same concept of CNN entry was further tested by Hussein et al. on a few models: LeNet [19], AlexNet [21], ConvNet and GoogleNet [22,135]. Vectors were crafted from traffic features, which originated from the authors' own dataset. During processing, input data were transformed into images with a size of 16 × 16 bytes. The goal of the paper was to detect malware or find intrusions.

CNNs' input was created by concatenating matrices [136]. The article tested malware detection on a few different models. Among others, two were exclusively CNN based: 5-Layer CDM and ResNet [24]. The test was based on flows from VAST 2013 challenge collections. When it comes to the deep learning models' inputs, the authors created interesting

correlation matrices on the numeric features of flows. While doing this, they omitted categorical data. This means that each numerical feature of the flow had a correlation matrix. Then all matrices for all traffic features were concatenated. It is important to highlight that each matrix was surrounded by a chosen value of top features. The image was called SC matrices. This is an outstanding concept of Liu et al. of the discussed topic when compared to other proposed ideas. LeNet [19] was used to enhance network traffic classification field [77]. The tests of the model were carried out on pcaps from the ISCX VPN-nonVPN and ISCX Tor-nonTor datasets. Raw traffic packets were processed beforehand. The first step of the process was to aggregate packets into flows. Then flows were divided into 60-s blocks. The next step was time normalization: the opening time was zero, and the final time was 1500. That means that 60 s is now 1500. Later, all pairs of IP datagram sizes and arrival times of the flow were registered in the 2D histogram. Each cell in the histogram contains the number of received packets in a particular time and of a particular size. Histograms are 1500 by 1500 bytes size and are named Flowpic. Shapira and Shavitt provide Flowpic as an input of CNN [77]. This is an interesting concept, dealing with the topic of transformations from different perspective of input data.

A thought-provoking article of Zhang et al. dwells on the traffic classification scientific problem and amends the concept of [50]. The changes included extracting features from the raw traffic [86]. The paper assumed that each flow consistsed of five packets, which, according to the authors' suggestion, were the most important ones. This assumption reduced redundant features from the top network layer and proposed more compact flows. The authors summarized these changes with the statement that more flows can be processed, and the introduction of zero elements was more firmly reduced. The image was 16 by 16 bytes in size. The proposed CNN model was a segmented CDM. The top branch of the model was responsible for image segmentation tasks that handle pixel-level classifications. The bottom branch main task was to deal with abnormal traffic that was imbalanced. The model was tested on ISCX-IDS-2017.



#### **9. Summary and Conclusions**

There have been many scientific publications on CNN-based deep learning models (CDMs) for traffic classification and malware detection since 2015, as indicated in Figure 1. The aim of this survey was to study different dataset transformations described in the selected papers, using different criteria. The following aspects were considered:


The type of network traffic data as an input for CDM is one of the crucial elements. Network traffic data used in the studied papers were acquired from different sources: test-beds, real traffic, or datasets prepared for and shared to the scientific community. Acquisition and the preprocessing of network traffic are an essential part of data analysis. The two most popular datasets within the elaborated topic are ISCX VPN-nonVPN (19 articles) and USTC-TFC2016 (11 articles). On top of that, many scientists did not share their datasets (25 articles).

As shown in Table 9, the numerousness of papers in each category highlights the paths followed by researchers. The most popular categories are manipulations of flows and traffic features. Using raw traffic so data do not need preprocessing is the least popular category. Under that reasoning, feature vectors as well flows were widely taken from utilized datasets.

**Table 9.** The summary of the most common transformation methods of the CNNs' inputs.


Analyzing CNN layers and models, LeNet was the most common CDM. Moreover, some papers amended their architecture with one or more additional layers. This was caused by the usefulness and practicality of the model in other scientific areas, such as data science and image recognition. Then, there is only a need to adjust the input data (network traffic), so it fits the requirements of the trained LeNet on, for instance, the MNIST dataset. This aspect is called transfer learning.

This survey is CNN based, so the majority of papers decided to form a 2D input to the deep learning model. Vectors as CNN entries were not so frequently used. Methods that proposed a 3D input to the 3D-CNN, dyed red, were in the minority (see Figure 8).

**Figure 8.** The popularity of discussed transformation methods, with the CNN architecture.

Regarding the comparison of dimensions, as the input data for CDM, we observed the following trends. Among various dimensions, 2D was the most common approach. The majority of articles added the second dimension with the advantage of wrapping. In the group of 2D methods, the entry size of 28 by 28 bytes was the leading trend. This concept may have been taken from the CNN structure used for the MNIST dataset.

As presented in Table 10, a noticeable batch of research papers utilized two or more dimensions of the CNN entries. We found out that seven papers gave proof of high results obtained by lower dimensions. This conclusion was not only unexpected, but also relevant for further studies (see Table 10). On top of that, manipulations on flows were the most common ones within the papers that compared various dimensions of CNN entry.


**Table 10.** The comparison of papers using more than one transformation model for CNN entry purposes.

In some of the studied papers, researchers also used other CDMs to analyze network traffic. For example, the following methods were applied to the study of network traffic analysis: classical methods (tree-based, K-nearest neighbor, naive Bayes, logistic regression, support vector machine and semi-supervised) and neuronal methods (recurrent, multilayer perceptron, autoencoder and hybrid models).

Considering the constant increase in the number of papers on CNN-based models for computer network traffic analysis, one may conclude that this approach is becoming one of the classic approaches to traffic classification. One may also predict that the growth in the number of applications will continuously improve both the efficiency and detection/classification speed.

**Funding:** The research was funded by POB Cybersecurity and Data Analysis of Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) program.

**Conflicts of Interest:** The authors declare no conflict of interest.
