**1. Introduction**

#### *1.1. Deep-Learning Approach to Network Traffic Analysis*

The rapid growth of computer networks over the last decades [1] has entailed a larger amount of cyber-attacks. In order to minimize the losses, many security methods are in heavy use. Among others, network traffic analysis is in the lead. This day-to-day operation consists of processing typical patterns, such as traffic flow, bandwidth usage or resource access. Together, these patterns identify the normal network behavior, also known as a baseline. Having this baseline in mind, it is possible to interpret abnormal activities, which may indicate an attack.

Deep learning methods have also begun gaining popularity recently. This is mainly caused by the development of computing capabilities based on parallel processors originated from graphics cards. This has resulted in the rapid increase in efficient implementations of computationally demanding complex neural networks and, finally, a remarkable growth in capabilities to solve advanced problems. Among the most successful architectures of this kind are convolutional neural networks (CNNs, conv-nets). They are ideally suited for multidimensional data, originate from image processing but can be successfully applied to other computing domains.

Internet traffic analysis and machine learning are the two worlds that must be connected with one another, especially when applying the latter to the data provided by the former. The originator of the junction of traffic analysis with CNNs is Wang, who, during

**Citation:** Krupski, J.; Graniszewski, W.; Iwanowski, M. Data Transformation Schemes for CNN-Based Network Traffic Analysis:A Survey. *Electronics* **2021**, *10*, 2042. https://doi.org/10.3390/ electronics10162042

Academic Editor: Amir Mosavi

Received: 2 July 2021 Accepted: 16 August 2021 Published: 23 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the innovative presentation at the "Black Hat" conference in 2015 [2], pointed out the similarities between images and TCP flow payloads. Despite the utilization of an autoencoder to identify network traffic, in a later work, Wang signaled the usefulness of CNNs for the same task. To the best of our knowledge, this is the first mention of network traffic identification or malware detection with the advantage of CNNs.

There is a striking change in the number of research papers that are devoted to the analysis of network traffic by CNN models (see Figure 1). To enhance the analysis, we distinguish three possible subjects of the articles: malware detection, traffic classification, and the junction of both. These categories are connected with datasets studied in each paper. The typical indicators of the datasets are the motifs of data, e.g., captured botnets are the foundation of the CTU-13 dataset, so each article utilizing it will belong to the malware detection group.

**Figure 1.** The growth of CNN-based models, which process transformed network traffic in the years 2015–2020.

What is particularly interesting seems to be the overview from the perspective of traffic transformation methods before being given as an entry to the CNNs. Deep learning models require particular data formats that are rarely similar to original computer traffic. In most of the reviewed articles, the traffic data are transformed into the forms needed for the analytical part of the whole workflow. These transformations usually require performing one or more typical actions, e.g., the selection of specific network layers, trimming the data stream, or computing some traffic features. Due to the given architecture of some learning algorithms, these data transformations often demand an increase in the dimensionalities of the traffic data. The network traffic data are a time series, while the CNNs require multidimensional input consisting of equal-length samples. Due to this fact, the original traffic data must always be transformed into a format acceptable by the deep-learning models.

#### *1.2. Our Contributions to the Topic*

This survey deals with transformations of the network traffic, which are the input of the deep learning models, with particular attention paid to the CNN models. We have studied many articles and finally chose 136 papers written recently in this field of science. It is essential to highlight that other surveys present these studies from the perspective of cyber-attacks, particular system traffic, or mixed deep-learning models.

We explicitly focus on the network traffic transformations before being given as an entry to the CNN models.

	- (a) Traffic classification.
	- (b) Malware detection.
	- (c) The combination of traffic classification and malware detection.

The first contains all articles that focus on encrypted traffic identification. The malware detection category is about finding unwanted traffic. The last includes research about both mentioned categories. These categories are firmly connected with the datasets utilized by each group of authors. Moreover, it is possible to distinguish different themes of the datasets, such as dealing with VPN traffic or exploring features of botnets.

3. This work highlights and describes the utilized datasets and the architecture of each CNN model.

The proposed taxonomy is the first on this topic. While preparing this work, we inherited and developed the concept from the CNN chapter from [3].

As the number of papers in the described field is constantly growing, we decided to review the proposed works and highlight all scientific observations. The detailed comparison in this area can establish current trends as well as enhance network traffic analysis.

#### *1.3. Paper Structure*

This paper consists of nine sections. Section 2 touches upon fundamental issues in the discussed scientific field. The following subsections are devoted to main categories of methods that reflect the ways that the network traffic is transformed prior to transferring them into the CNN-based neural models. Section 3 presents approaches based on raw traffic, i.e., network traffic without any filtering. Section 4 is about all transformations working on flows—particles of the entire traffic. Section 5 highlights data manipulations on payloads extracted from the raw traffic. Section 6 focuses on all concepts based on payload that is extracted from flows. Traffic features approaches are the main subject of Section 7. Section 8, in contrast to Section 7, gives a concrete overview of those articles that additionally focus on the feature extraction process. Section 9 concludes the paper.
