1. Introduction
With the rise in Internet usage, proper management of the network traffic in modern ISPs has become a challenge of paramount importance. Categorizing each network packet on the fly is key to proper decision-making. For many applications, such as the quality of service policing or traffic shaping, routing might be more efficient if the type of data consumed is known. For example, in Internet telephony, latency is key and dropped packets are acceptable, whereas in a batch download latency is acceptable, but there must be guarantees that all information is received at least once. This packet classification cannot be performed offline, i.e., after the network flow has already been completed, but rather online, i.e., while the first packets in the flow are passing in real time through the network equipment. However, as the Internet has become a fundamental part of our society, privacy issues have arisen, causing the advent of the encryption of network traffic. Encryption poses yet another challenge to traffic classification, as classical methods such as those mentioned below do not make sense any longer. Several methods have been tested for real-time network packet classification over the years: initially, we saw the rudimentary use of port-based classification [
1] (with a basis on TCP and UDP well-known port numbers); later, we saw the use of Deep Packet Inspection (DPI) using pattern matching and detecting protocol anomalies [
2]; and, most recently, algorithms based on statistical features, machine learning-based techniques such as Support Vector Machine (SVM) [
3], and deep neural networks such as convolutional neural networks (CNNs) [
4,
5]. In particular, CNNs are at this moment a well-known solution for traffic classification, because their ability to find spatial correlations between earlier and later parts of the data is useful in network traces, where the data might be correlated spatially to a high degree, due to the relation between contiguous bytes. In fact, CNNs are becoming increasingly relevant, due to their reportedly high accuracy values, but they have a major drawback: they generally behave as black boxes, because the statistical features detected by the neural network are hard to be explained outside the context of the predictor itself. Thus, we decided to study the capabilities of CNNs in depth using a public network traffic dataset, focusing on whether it is feasible to extract usable data from the model. By analyzing state-of-the-art models and their methodologies for training and testing, we were able to identify common practices, keys to success, and possible malpractices, such as data leakage.
Compared to previous works, our methodology includes the following main contributions: firstly, we needed to make sure that the separation between packets in the train and validation sets prevented data leakage. That is, we ensured that we did not train and test with different packets in the same flows, in order to prevent overfitting and avoid bias in the model. While this is an obvious practice in AI, it is not as simple as it sounds when conducting network packet processing, because packets in the same connection can cause data leakage if they are incorrectly split. Secondly, we wanted to gain a deeper understanding of what CNNs are actually watching. The lack of explainability or interpretability is a major concern for their potential feasibility and deployment in real-world scenarios. We expect to hold models to the same standards of privacy, unbiasedness, and reliability as other standards in the industry. Once we obtained the right information, we built a reference CNN model for network traffic classification. From the aforementioned dataset, we filtered the headers of the packets in order to manipulate just the encrypted payload data and avoid using IP addresses and ports. After training, we analyzed the performance and the explainability. Regarding explainability, we needed to study how to extract data directly from a deep neural network. For this, we applied techniques used in other research fields, such as image and video classification, to create a useful representation of the reasoning behind the classification of network packets. To make our methodology applicable to all contexts, we used an algorithm that is universally applicable to all CNNs in order to further understand them, and developed several modifications in the model in order to see how the behavior may change depending on the added or removed features.
As a result, we focused on understanding important factors for model performance. First, we analyzed characteristics of the data that might be relevant. A good example of this is that data leakage leads to overfitting when packets in the same flow are in both the training and test set. We also analyzed the different parameters of the model that impact performance. Finally, the decision was analyzed using eXplainable Artificial Intelligence (XAI) techniques, so that we could try to understand how the features were built and what they focused on.
The rest of this paper is structured as follows. Next,
Section 2 shows the related work, in order to contextualize our research. Then,
Section 3 explains how the available datasets were processed, trying to avoid the bias that is present in many prior network classification works. In
Section 4, we present the different components of a CNN architecture to classify network packets. After that,
Section 5 presents gradient analysis as a way to explain the convolutional models.
Section 6 applies this technique to real data, so we can understand what the neural network generalizes from the encrypted traffic it trains with. Finally,
Section 7 concludes the paper, providing a summary of the main results and contributions.
3. Initial Data Processing: Avoiding Model Bias
The very first task we have to solve when working with AI is data preprocessing. Our main goal is to extract features adapted to network packet classification and build models with useful and standardized information. It is also essential to manage the bias in our data, so that the predictions are not only realistic but also accurate and fair. This section details how data preprocessing was handled and the processes, decisions, and developments that were made to ensure, to the best of our ability, that the model is capable of classifying traffic in a real network, avoiding overfitting or model bias.
The dataset used in this paper is the experimentation dataset ISCXVPN2016 [
33], composed of 25 GBytes of captured data traffic divided into 152 capture files. It is grouped in 42 labels, depending on the application. For our study, we group these labels into general categories representing traffic classes [
35], such as Video (Vimeo, Facebook Video, and YouTube), Chat (Facebook Chat, Skype or ICQ), Web E-mail (GMail), File Transfer (FTP), or Audio (Skype). We make this decision because there is no need to look at the Over-The-Top (OTT) service provider to apply fair QoS policies.
As a multiclass classification problem, always choosing the same class in a balanced-class situation would yield an average accuracy of
. According to the current literature, some systems reach values close to 90% in class averages [
25,
27,
36]. However, one of the main problems detected is that these papers do not make clear the experimental protocol, e.g., how they split the training and test packets. In the case of network traffic, it is especially important to split them correctly to avoid undesired biases. For example, performing a temporal split on a traffic capture of 100 packets would mean that we take the first 70 for training and the last 30 for testing. In that case, we might be training the model on recognizing the start of the connection phase but not the closing part. Thus, the model will exhibit strong biases in a real environment. Then, will this problem be solved if the training and test data are taken randomly? Not necessarily. By using random packets, you are probably using packets from the same flows or sessions in the training and test, causing data leakage. If part of the test information leaks into the training set, the model does not learn to generalize but to detect those related packets by identifiers such as IP addresses, port numbers, or domain names. This is not a realistic case, as it encourages the neural network to learn particular literals of our dataset instead of general conditions. Even if IP addresses and ports are discarded, there might be information in the data that identifies the flow, such as session identifiers or encryption algorithms and keys.
In order to give the full context on how to avoid these issues, an overview of the whole process will be made from the time the traffic trace is obtained until it is classified by the neural network. The first step we will perform is to filter from each traffic capture those packets that do not contain payload of the application, for example ACKs, DNS, and other protocols. This is performed to prevent the neural network from training with packets that can be classified by well-known ports in a real application or with DPI techniques. Then, we extract from each traffic capture the individual packets and convert them to binary files. In this step, we also discard Ethernet, IP, and TCP headers, since they do not contain payload from the application, and we want to avoid training with data that depend on the network conditions of the capture. We can take advantage of this process to group the packets by flows by using the quintuple (source IP, destination IP, source port, destination port, transport protocol). This does not mean that statistics are extracted, simply that the packets in the same flow are labeled. This method was chosen to ensure the independence of traffic flows and to separate the data into training and testing sets, avoiding data leakage. This approach leverages the similarity among flows in the same application (see, for instance,
Figure 1). Thus, the objective is to train and test with packets in different flows in order to avoid overfitting and data leakage. Next, these individual packets are converted into images so that they can be used to train the neural network. Each byte of payload information is converted to a pixel, and then the pixels are arranged according to the desired dimension.
The final step is to generate a convolutional neural network that will process these images, classify them, and extract the most probable application value. With this label, we can use different QoS policies to improve the experience in the network. With this process, we achieved the fact that, from each packet, an application label was extracted individually, without the need to load memories of the rest of the flow data. It is important to work with individual packets and not with network flows to gain speed and reduce latency because, in a high-speed system, grouping data extracted from a traffic flow can be time-consuming, and it can affect the performance of a latency-critical network application such as VoIP, Video calls, or live gaming.
3.1. From Packets to Images
As explained above, we start with the processing of the protocol headers. As shown in
Figure 2, a typical network packet includes ETH, IP, TCP, and UDP headers containing information specific to the network conditions at the time of packet capture. In our model, we propose discarding these headers to isolate the payload, thereby retaining only the “Application Data” bytes. This task is performed using a data preprocessing step performed using Tshark, available in the GitHub repository link attached to this paper (
https://github.com/elbisbe/XAI_DL_NetworkPacketClassification accessed on 14 March 2023). The process is summarized in the paper, but it is a simple filter. These headers should be discarded for several reasons. The first one is that, due to the number of servers in a cloud platform, they do not necessarily provide reliable information on the user’s application data (e.g., different Google services share the same IP addresses) [
37]. Therefore, a traffic classification would not be correctly made according to the applications. The second reason is that they can easily generate false positives in the neural network model prediction, since it could learn to differentiate an application by using a specific IP address, a specific port, or even unrelated TCP parameters such as the window size. Additionally, this is not necessary in the neural network model, as a known port system would more efficiently leverage port information and avoid overfitting. In addition, other TCP parameters such as the ACK or SYN sequence numbers would not necessarily be used to classify traffic. Once the packet headers are filtered, the payload binary data have to be transformed into an image or another data structure that accepts a convolutional neural network as input. To transform the data into images, we use each byte of information in the packet as a pixel and represent that value as a grayscale value, i.e., 0 is a black pixel and 255 is a white pixel. This process helps to normalize the dataset and can help in detecting patterns in the image representation of the packets.
However, images tend to have fixed sizes, whereas network packets can vary in size: from minimal sizes of 40 bytes to the maximum of the MTU (usually 1500 bytes), even for each class (E-mail, Chat, Video, etc.). Thus, we need all sequences and images to have a fixed size because CNN libraries usually need a fixed input size. When a packet does not reach the size to be transformed as an image, missing data should be filled in. Two solutions are proposed: the first is to fill in the rest of the packet with zeros (zero padding); the second is to fill it in with random values (random padding) [
12]. We used the first alternative, as the intention of the model is to only alter the packets that need to be able to be fed into the model, based on input perturbation techniques. Given the different packet sizes, filling with random values would introduce an undesirable noise in the model. In fact, we confirmed that this random padding produced worse performance than zero padding. Packets without payload data (e.g., TCP SYN or ACK segments) were taken out of the dataset, as they would be pitch-black images. This packet-to-image process can be performed for grayscale one-dimensional (1D) or one-dimensional (2D) structures [
38], although in other studies, so-called three-dimensional (3D) architectures have also been considered [
28]. By 3D images, they mean images with RGB color gamut, not grayscale. This choice of the number of dimensions may affect the performance, both in terms of the accuracy of the model and in terms of the throughput of the system. It is expected that a neural network can take advantage of image acceleration hardware architectures to process the 2D and 3D images in parallel. Moreover, in these cases it must be considered that an artificial dimension is created in the packets, where the information is originally sequential, and a row of pixels does not have to be related to the one immediately above or below it, as will be discussed in
Section 5. In the work by Zhang et al. [
28], they presented a comparative analysis of 1D, 2D, and 3D convolutional neural networks (CNNs) for traffic packet classification. The results showed that the 1D-CNN achieves the highest precision and F1-score, which is consistent with previous findings. Interestingly, the 3D-CNN closely matches the performance of the 1D-CNN and even surpasses that of the 2D-CNN, despite the potential increase in complexity and spatial information introduced by the 3D images. Furthermore, that paper provides insight into the resource usage, speed, and bandwidth of the three cases. It is observed that, as expected, the addition of dimensions to the convolutional input layer leads to improved speed performance of the classifier.
Regarding the size of the packet, it could be considered that a larger image size will result in better accuracy results, due to the increase in information about each packet that is provided to the neural network. In our preliminary results, we observed that increasing the image size from pixels to pixels significantly improves neural network performance in packet recognition tasks, with accuracy increasing from 56% to 66%. However, it is important to note that further increases in image size beyond pixels may not necessarily lead to performance improvements. For example, we found that pixel images produced the same 66% accuracy. In this work, a size of 1024 pixels was used, i.e., square images of pixels, so that we did not exceed the typical MTU of 1500 bytes. We empirically chose this size because smaller images yield worse performance, while larger ones do not exhibit any noticeable improvement.
3.2. Organizing Traffic Flows
To avoid any data leakage from packets in the same flow in training and test datasets, we propose a folder structure that allows having the packet images of the same flow in a common path. This will be useful for testing with independent inputs in the model. As shown in
Figure 3, a folder structure with three levels was utilized. At the first level, all the folders with different traffic captures from the dataset are organized, totaling 42 groups of applications and services. Within these folders, at the second level, all flows that comprise the traffic capture are stored. This means that each capture can contain a varying number of flows, and each flow can have a different number of packets. At the third and final level, each flow is divided into 1D and 2D images. We did not consider higher-dimension images, as state-of-the-art images did not show substantial improvements. These images are stored in the respective folders and are used as input files for model training or testing. A problem with this system of grouping images by flows is that in some cases, such as File Transfer-type applications, there will be very few flows but with thousands of images each. However, in other cases we may have many small streams with few images. Our proposed solution limits how many packets are used per traffic flow, so that there are no large differences in the number of packets in each flow. Similarly, the work in [
29] developed traffic classifiers with high accuracy percentages that used only the first 20 packets in each flow. However, this may introduce bias in the training, since connection closure will not be observed in all flows.
Figure 4a illustrates the standard dataset division process as described in the reviewed literature. In this scenario, each packet from a traffic capture (represented as colored geometric shapes) is divided between train and test sets, either using a round-robin algorithm or randomly. This approach can lead to data leakage during model testing, as packets from flows used in training share information.
Figure 4b depicts our proposed methodology, which includes grouping packets into flows to ensure no information is shared between the train and test splits. This prevents overfitting and results in a model that should perform more consistently with what is expected in a real network environment. Hereafter, we will refer to this phenomenon as ‘flow dependence’.
With these considerations, we can perform a desired data-leakage-free train–test split. Suppose we want to divide the dataset between training files and test files. In a first approach, the first of the flows in the trace are assigned to the training dataset, and the last are assigned to the test dataset. The advantage of this method is that it is easy to implement and debug. The main disadvantage is that the training flows will be the first of the capture and, in case there is a relationship among the flows, some overfitting would be introduced. An alternative is to perform a round-robin system between the flows. In this way, the first seven flows would be for training, the next three for testing, then seven again for training, etc. This method corrects the possible overfitting introduced due to the order of the flows. The last proposal consists of randomly choosing of the flows in the training set and in the test set. This method is very difficult to debug and check for errors. On the other hand, there are no problems with the order of flows and their relation and overfitting.
In summary, this section has detailed the process on how to avoid bias in the classification of encrypted traffic packets using convolutional networks. Our assumption was that the current literature and authors were not giving enough effort in the handling and preprocessing of the payload data. Now, with the given preliminary results, we have shown that overfitting and model bias might be common in the literature. Our aim in the next sections will be to study how the neural network is processing the input data in order to give transparency in the decision-making and audit these black-box-like classification systems.
5. Gradient Analysis
As we have seen, proper separation in flows is needed in order to avoid overfitting or data leakage and reach a better overall performance. This conclusion was made with the network behavior analysis when exposed to data of the same flow. In order to prevent this type of problem in the future, we looked for a way to obtain an explanation of the behavior of the models. For this purpose, we applied GradCAM, known as gradient-weighted class activation mapping. This XAI algorithm was presented initially by [
39] and used for image-based classification, captioning, and a visual question-answering model, such as ResNet [
40]. However, this algorithm is applicable to a wider variety of model families, such as any convolutional neural network (CNN) model with fully connected layers, with structured output, or with multi-modal inputs. GradCAM is a generalization of CAM for CNN architectures [
41].
The main concept behind GradCAM is deeper representation on a CNN capture higher-level visual construct; convolutional layers naturally retain spatial information, which is lost in fully connected layers. As such, the last convolutional layers should have a spatial representation of the features extracted from the input, so the algorithm uses the gradient information flowing into the last convolutional layer of the CNN to identify important regions considered by the model for the decision. We will now briefly explain the algorithm, visually shown in
Figure 6.
To obtain the class activation map or heatmap, we first need to find the CNN layer in our model that we want to extract the gradient from. In
Figure 5, this is the convolutional layer when
. As a result, we obtain (i) the inputs of the model
X; (ii) the output
of the CNN layer, which consists of
k tensors of the same size; and (iii) the output
y of the softmax activations from the model. As an observation, GradCAM focuses on the feature extraction part of the model, not in the feature analysis part. In that sense, GradCAM is agnostic to the type of convolutional model as long as the input is fed into some type of CNN that retains spatial information, so that the extracted heatmap is spatially correlated to the input
X.
We fed the input X, the bytes from the packet payload, into the model, while we retained information of the gradients of y with respect to A computed with backpropagation, i.e., . These gradients were used together with A to compute a heatmap for each class. Finally, the heatmaps were normalized and scaled up to the original size of the input X. Usually, the last step is made when a representation with both the original image and the activation map is desired. However, for some cases where we want to extract the data as-is, we can skip the normalization step.
In addition to the standard use of GradCAM, that is, to detect relevant areas of information for the model for a certain input, you can also obtain explanations that highlight support for regions that would make the model change the prediction. This is performed by using the gradient of the opposite of the score of each class with respect to
A. This is called counterfactual explanations, and further reinforces the insight the standard model might give. These counterfactual explanations are very useful in non-binary classification problems, as they show what features they are seeing to determine that a possible packet does not belong to a category. Merging both results, we obtain the two most important features the model considers for the prediction, positive and negative, as shown in
Figure 7.
More technically speaking, to obtain the heatmap
, we have to study the relation between each output tensor
of the last convolutional layer and the output of a class
, which is just the
c component of vector
y. For each element of
, we compute a measure of how that element contributes to
. In particular, two options can be used: (i) relying on the gradients, i.e., for the
i-th element of the tensor
(ii) or using guided gradients, i.e., for the
i-th element of the tensor
By using the first, we only consider the gradients in every part of
, while in the second approach we only take into consideration elements whose value is positive, i.e., they would trigger the ReLU activation, and whose gradient is also positive, i.e., contributes towards class
c in the classification decision.
Using either of the previous gradients, we compute the average score of
,
which represents how much
contributes to class
c. Moreover, it is worth noting that this works for both 1D, 2D, or n-dimensional CNNs, since we compute the gradient at each element of
and they are averaged all together, ignoring their original position or the 2D/3D structure of
.
After obtaining the averaged gradient score, a partial linearization is made to only keep the parts that produce a positive response, that is,
Similarly, (
3) can be modified to obtain counterfactual explanations, i.e., zones that discard a class, contributing negatively to
. This is performed by computing the gradient of
. Similarly to (
1),
so, essentially,
when using gradients. However, for guided gradients, there is no relevant relation between these two quantities:
In both cases, the weights of each
can be computed using the previous gradients
yielding the final expression for the counterfactual relevance map:
Together, and explain the decision of the model. On the one hand, describes the regions that contribute to choose class c. On the other hand, describes which regions hint not to choose class c.
Another consideration that must be made is that on multiclass applications like this one, this technique only presents data relevant to the predicted class. For example, if a packet has been identified as “Chat”, the information that GradCAM will be able to provide will be related to whether that packet has been identified as “Chat” and not other classes.
GradCAM has been widely tested and validated in computer vision problems (such as weakly supervised localization, segmentation, class discrimination, the identification of bias in datasets, image captioning, or visual question answering) [
41,
42,
43], but the same principles can be applied to a network management problem involving CNNs, like the one at hand in this article.
The specific implementation of this algorithm for the model used in network flow classification focuses on the gradients obtained in the feature extraction part. Therefore, despite the fact that RNNs and MLPs can be used to find other correlations in the feature analysis, GradCAM will not consider them. Nevertheless, it still gives a valid metric to understand the general behavior of the convolutional model, as the later layers can only work with the extracted features.
In the next section, we will discuss how GradCAM was applied for both 1D-CNNs and 2D-CNNs and what the explanations are for the decisions of the models.
6. Results
In this section, we present the results of our experiments grouped into four categories. First, we analyze the performance of our model compared to the state of the art. Second, we focus on the data, and how some characteristics of the input data may significantly alter the performance. Third, the same process is repeated for the model and its parameters. Last, we apply XAI to understand the decisions of the model.
6.1. Performance of the Model
In
Section 4, we presented the CNN architecture and two models, 1D-CNN and 2D-CNN. Both of these networks can be trained for any purpose. We are particularly interested in classifying packets in traffic classes—e.g., Video or Chat—for QoS policing purposes.
After dividing the dataset into 70% training and 30% test sets randomly, an accuracy of 90% was easily obtained with both models, comparable to state-of-the-art alternatives [
25,
27,
36]. However, as we explained previously, we believe that packets in the same flow can only belong to either the training or the test set.
Consequently, in
Table 3 we present the performance metrics of the 2D-CNN model using a train–test split with the flow division. All of them, precision, recall, and F1-score, are around 85%, which is comparable with the existing architectures and other alternatives. The small 5% differences, as covered in the next subsection, are due to the flow dependence. Furthermore,
Figure 8 displays the confusion matrix for the 1D model after some data preprocessing. This shows that the error is mostly focused on the Audio class, and is confused with E-mail and Video classes.
6.2. Understanding Your Data
There are some aspects of the data that are worth mentioning: the flow dependence and its effects, the effect of direction of the flow, and the padding of small packets.
In order to study the impact of flow dependence, we trained the same model with two different datasets. First, we trained the model with just a random split over the packets, as shown in
Figure 9. Using a validation set of random packets, we managed to observe that both the training and validation curves overlap, showing an accuracy above 90%. On a first view, this would mean that the model is not overfitting, and the performance is optimal.
On the other hand, when training and validating the system with independent flows, the performance changed drastically. While the accuracy in training was about 90%, independent flows in the validation set showed that the real number was closer to 65%. That is, the model is not learning to generalize the specific per-class characteristics of the packets, but is learning fixed patterns that are related to the particular flows or servers involved. This overfitting due to data leakage renders the model incapable of handling real-life situations where all data might be coming from systems not present in the training data.
Second, we analyzed the difference depending on the direction of the flow.
Figure 10 shows the Grad-CAM relevance for the class FTP depending on the direction. Both lines show the relevance of each byte of the packet, denoted by its position, when processed by the CNN, showing differences in uplink (client to server) and downlink (server to client) for FTP traffic. Due to the asymmetry of the roles of the client and server in this case, uplink relevance is focused on the first bytes of the packets, where requests should be the most relevant part, while downlink relevance is more relevant in higher bytes. This makes clear that bidirectional flows should not be mixed, since they can have different and confusing behaviors if considered together.
Third, we studied the impact of padding, that is, filling with extra data when the packet does not reach the desired size of 1024. This can be performed in two ways: (1) using random padding, i.e., adding random bytes at the end of the packet; or (2) using zero padding, i.e., adding 0 bytes to fill the rest of the packet. Although both options might be reasonable, they hide the attribute of the size. If we use the first option, we are considering that the size of the packet is not an important attribute, and we are hiding that information to add data to maintain the entropy as they were encrypted. In contrast, the second option exposes the end of the packet, since models should be able to notice the position of a final position of zeros.
In this case, we believe that packet size is an important attribute that can intuitively help identify traffic classes, since we expect that larger packets should belong to data transfers and smaller packets should belong to other delay-sensitive services. Therefore, we only considered zero padding, as shown in
Figure 1; the bottom of the 2D images is filled with black pixels.
6.3. Understanding the Model
With all the previous observations about the data, we are now able to understand better what the model might be doing. At this point, we studied what are the difference is between both models and how some parameters may affect the performance.
As we mentioned in previous sections, both models, 1D-CNN and 2D-CNN, perform similarly. As other studies highlighted [
28], the difference between models is noticeable but minor. When using a 1D approach, we assume that the sequence of bytes is spatially correlated, i.e., close bytes might have a relation. On the other hand, 2D models do consider this relation by adding an extra relation to bytes that are before or after a fixed offset. Although this might happen, we did not observe it when plotting data and 2D relevance.
Figure 11 depicts the relevance of each position of the 2D representation to showcase the absence of strong 2D relations. Two examples are shown: for the class E-mail (
Figure 11a), there is no clear relation; and for the class FTP (
Figure 11b), there is some spurious 2D relation that happens to be periodical but with a small period of length of five to eight bytes. Additionally, this 2D metric is sensitive to the width of the image, which affects the generality if, for some reason, this offset changes depending on the environment.
Apart from some hyperparameters that can be optimized and depend on the amount of data, such as the number of convolutional stages in the feature extraction, we noticed that the amount of convolution padding (different from the packet padding) might have an effect on the data. Convolution padding is an important parameter of convolutional layers, because the borders of the signal are not considered if we do not add a padding for the convolution. For 1D, this means adding extra data at the beginning and the end, while for the 2D, this happens every time we reach the borders of the image.
Figure 12 represents the effect of convolution padding length for the relevance of the 1D model. Previous works [
44] already used convolution padding successfully. Different types of border padding were tested, but the use of “same” padding (replicating the last value in the border) yielded the best results, as the model did not lose spatial data, unlike padding with zeros or the max value. Another test was conducted using a padding with an arbitrary number of pixels in the test set, so it would permanently modify the length size of all the packets, and the classification capabilities of the model were reduced significantly to the extent that it could not work as intended. Other works also show that packet size is a relevant metric when analyzing the performance of classifiers made with CNNs [
12]. Upon testing different widths, we observed a clear softening of the borders of the relevance when using padding larger than 1 pixel, which may be the result of giving the gradient further space to develop, as per
Figure 12. We settled on using 2-pixel padding here onwards, as it provided generic enough patterns to analyze and gave a clear outline identified for that packet’s position in the flow.
6.4. Understanding the Decision
After studying the data and the model, we can explore the classification results using XAI techniques. In this section, we use GradCAM relevance for the different classes of the dataset to analyze the feature extraction procedure of CNNs in encrypted packets. After understanding the features, we provide a simplified ML model that achieves a similar performance with much less complexity.
First, the relevance per sample is computed and displayed on the left-hand side of
Figure 13. All of them but
Figure 13g display similar behavior: they are smooth functions with some random noise. In order to model this, we provide a median function to act as the centroid of the groups that, by nature, should be less noisy. The right-hand side of the figure represents the same median behavior as representative of the relevance of each byte together with the histogram of packet lengths.
Figure 13a shows the relevance
and counterfactual relevance
according to GradCAM for class Chat. In general,
is always above
except for an environment centered at 300 bytes, where
increases and surpasses
. As shown in
Figure 1, packets in this class have different sizes, and their images are filled with zeroes depending on such sizes. Thus, when looking at
Figure 13b, we observe that this point is not random, but the mode of the packet length. This means that the model is mostly focusing on the last bytes of content in the packet. Due to the convolutional nature of the CNNs, they are powerful when detecting borders (abrupt changes from high values to low values), so in particular, it is likely that the convolutional model is just extracting the feature that the packet is ending at around 300 bytes. Bear in mind that we used zero padding to fill the packet images, so the CNN was able to detect this abrupt change to a zero-valued region at the end of each packet.
Similarly,
Figure 13c exposes the relevance and counterfactual relevance of class Audio, where the same behavior occurred at around 100 bytes. This is confirmed by
Figure 13d, with a mode of the packet length at around 100 bytes.
For class E-mail, the result is not exactly the same. In this case, the relevance
of
Figure 13e is always above
except for a small interval centered at 100 bytes. In this case, by looking at the histogram in
Figure 13f,
highlights that values around 100 bytes of packet size are not likely for class E-mail.
Figure 13g shows the hardest-to-categorize behavior of all classes, class FTP. Due to the spread value of the packet sizes, the models showing values of
spread around all values, with an important maximum only at 1024 bytes. In particular, this maximum is confirmed with the mode of the packet size in
Figure 13h. This confirms the poor performance of this class observed in the confusion matrix (
Figure 8), due to the overlapping of the packet length histograms among the classes.
Class Video in
Figure 13i exhibits a multi-modal relevance behavior with modes around 50, 125, and 1024 bytes. For the rest of the values,
is equal to
. In the histogram of
Figure 13j, we confirm yet another time the hypothesis of the packet size being the most relevant attribute in encrypted traffic classification.
These results allow us to conclude that CNNs are overcomplicated models to classify high-entropy payloads that only provide information on the size of the payload. With this in mind, our next step is to provide a simpler model that only uses the packet size to check whether a similar performance can be achieved with a higher packet classification rate in packets/s.
For this reason, we built a decision tree that only uses packet size for encrypted packet classification. The performance obtained across all metrics is around 80%, comparable to the 85% obtained before. This evidence supports our hypothesis that packet size is the most important attribute and that encryption poses quite a challenge in that CNNs are not yet able to solve encryption effectively.
Figure 14 shows the trained decision tree, showing the simplicity and minimal number of parameters of the provided model. Moreover, this solution can run much faster than a CNN, which is necessary in this scope to classify the network packets in real time.
7. Conclusions
In summary, we have analyzed the key aspects of packet classification CNN-based methodologies, with a focus on their wise choices, to replicate them, and mistakes, to avoid them. With such information, we built a model that achieved state-of-the-art performance for the most used dataset, ISCX VPN vs. nonVPN [
33], and analyzed the model’s predicted traffic class using GradCAM, an XAI technique for convolutional feature extraction. The conclusion of the analysis is that, after removing all kind of biases that identify the flow or the servers, the model only looks at the packet size to make a decision. Consequently, we were able to match the performance by just using a decision tree on the packet size. Despite having analyzed CNNs for the most popular dataset in encrypted network packet classification, this methodology can be applied to other datasets with different categories. Next, we detail the most important contributions of this work.
Poor experimental protocols result in an unrealistic performance evaluation: As has been shown, the bias introduced in CNN training and validation datasets has not been sufficiently taken into consideration before. This can be demonstrated by separating the traffic flows in the dataset preparation process. The neural network will learn based on data closer to what can be expected from a real traffic network; however, the real accuracy will be limited compared to the theoretical results of other authors. We propose a better experimental protocol by ensuring that biases are eliminated from the dataset and avoiding data leakage from the training set to the test set.
XAI techniques are of great help in the assessment of the model: Using the XAI techniques discussed above, it has been shown that caution must be had when preprocessing data, as the model might find not be able to properly identify the classes if two flows that behave differently (client and server flows) are aggregated into one category. Moreover, the use of a heatmap shows that the correlations found in the trace by the model are purely statistical and not based on concrete evidence or certain values of metadata, so more strict codification that further diffuses the statistical traits of the payload will affect the accuracy of the model significantly. The use of heatmaps helps to understand the viability of the model in practical scenarios, where the point of interest of the model can be extracted and discussed in order to determine if the training exercise was successful or not.
CNNs do not detect significant features in encrypted payloads: Based on the presented results, CNNs do not seem to be the right tool to classify encrypted network packets, as they are just focusing on the packet length. Just adding random lengths of padding to the packets in encrypted traffic would be enough to make such classifications unfeasible, at the cost of more payload per packet. This could be a good solution for users’ privacy, but it is an inadequate one for the network operators and systems, which would need to find other solutions to classify the network traffic and handle the extra computing load and bandwidth. On the other hand, before packets include random padding, this conclusion provides a valid means to heuristically classify the traffic at high speed, just by looking at the packet lengths.
Regarding limitations, in this work we have only addressed CNNs with a technique, GradCAM, specific for CNNs. Using a generic CNN allowed us to focus on the convolutional features in network encrypted packets. While GradCAM allowed us to extract deeper information about CNNs, it also limited our study to convolutional features. Nowadays, other techniques—e.g., Transformers—are emerging, and their rapid development again poses the challenge of understanding what they are learning. This opens the topic of extracting and understanding more complex features to improve the differentiation between traffic patterns.
For the reproducibility of our results, we have published the code and methods in GitHub (
https://github.com/elbisbe/XAI_DL_NetworkPacketClassification accessed on 14 March 2023). Furthermore, the same repository includes tools to preprocess and split datasets to build an appropriate experimental protocol with no biases.