A Convolutional Neural Network with Hyperparameter Tuning for Packet Payload-Based Network Intrusion Detection

Boulaiche, Ammar; Haddad, Sofiane; Lemouari, Ali

doi:10.3390/sym16091151

Open AccessArticle

A Convolutional Neural Network with Hyperparameter Tuning for Packet Payload-Based Network Intrusion Detection

by

Ammar Boulaiche

¹

,

Sofiane Haddad

^2,*

and

Ali Lemouari

^1,3

¹

LaRIA Laboratory, Faculty of Exact Sciences and Computer Science, University of Jijel, Jijel 18000, Algeria

²

RE Laboratory, Faculty of Science and Technology, University of Jijel, Jijel 18000, Algeria

³

Faculty of Science and Technology, University of Tamanrasset, Tamanrasset 11000, Algeria

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(9), 1151; https://doi.org/10.3390/sym16091151

Submission received: 28 July 2024 / Revised: 26 August 2024 / Accepted: 2 September 2024 / Published: 4 September 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

In the last few years, the use of convolutional neural networks (CNNs) in intrusion detection domains has attracted more and more attention. However, their results in this domain have not lived up to expectations compared to the results obtained in other domains, such as image classification and video analysis. This is mainly due to the datasets used, which contain preprocessed features that are not compatible with convolutional neural networks, as they do not allow a full exploit of all the information embedded in the original network traffic. With the aim of overcoming these issues, we propose in this paper a new efficient convolutional neural network model for network intrusion detection based on raw traffic data (pcap files) rather than preprocessed data stored in CSV files. The novelty of this paper lies in the proposal of a new method for adapting the raw network traffic data to the most suitable format for CNN models, which allows us to fully exploit the strengths of CNNs in terms of pattern recognition and spatial analysis, leading to more accurate and effective results. Additionally, to further improve its detection performance, the structure and hyperparameters of our proposed CNN-based model are automatically adjusted using the self-adaptive differential evolution (SADE) metaheuristic, in which symmetry plays an essential role in balancing the different phases of the algorithm, so that each phase can contribute in an equal and efficient way to finding optimal solutions. This helps to make the overall performance more robust and efficient when solving optimization problems. The experimental results on three datasets, KDD-99, UNSW-NB15, and CIC-IDS2017, show a strong symmetry between the frequency values implemented in the images built for each network traffic and the different attack classes. This was confirmed by a good predictive accuracy that goes well beyond similar competing models in the literature.

Keywords:

network intrusion detection; multiclass classification; convolutional neural network; hyperparameter tuning; self-adaptive differential evolution; metaheuristic optimization

1. Introduction

For several years now, IT security has been considered a major issue, particularly with the evolution of IT tools, which has been accompanied by a meteoric rise in attack techniques that have become increasingly complex and sophisticated [1]. With all these developments, the quantity and complexity of network traffic analyzed by intrusion detection systems have grown exponentially. As a result, conventional intrusion detection techniques are no longer effective in the face of these new challenges. Responding to these challenges, intrusion-tolerant systems (ITS) have recently gained an increasingly important place in the literature. Motivated by the development of virtualization and cloud techniques in recent years, several techniques for implementing intrusion tolerance have been developed. In a relatively recent work, Kwon et al. [2] introduced an optimal cluster expansion-based intrusion-tolerant system (ITS) that ensures quality of service (QoS) during large denial of service (DoS) attacks. By considering queuing theory to determine the optimal number of virtual machines (VMs), the proposed scheme enhances system performance by reducing resource waste and maintaining good QoS. Even more recently, Cuan et al. [3] proposed a novel adaptive tracking control strategy to protect a class of full-state uncertain nonlinear CPSs from deception attacks. Compared to existing attack tolerance approaches, the proposed strategy is designed to simultaneously handle external disturbances and deception attacks with high robustness. Although intrusion tolerance systems are effective in maintaining service continuity, their reliance on redundancy and recovery can increase operational complexity, making them difficult to manage and maintain. Furthermore, these systems are reactive, i.e., they mitigate the consequences of a breach rather than prevent it. Such a reactive approach may not be enough as cyberattacks become increasingly sophisticated, and proactive defense strategies such as intrusion detection systems are becoming more and more necessary. Fortunately, this need has arisen at a crucial time when machine learning is undergoing a significant evolution, enabling more sophisticated and accurate classification methods that can keep pace with the growing complexity of security threats. Consequently, research in this field is increasingly focused on applying artificial intelligence and machine learning techniques, which have demonstrated their effectiveness across various domains. This trend has been further accelerated by the rapid advancements in hardware resources and deep learning methods that have emerged in recent years [4,5].

In this context, numerous research studies employing advanced deep learning techniques have been proposed in recent years [6]. However, their effectiveness remains insufficient due to poor data preprocessing and reliance on inappropriate ready-made features that do not align well with the model’s requirements, since these ready-made features are often defined manually, which may lead to discarding or modifying important information, thereby losing crucial details that could be useful for learning a specific model. This is a fact that can easily be noticed in the fields of computer vision and image classification, where the performance of convolutional neural network (CNN) models is impressive compared to conventional machine learning models [7,8]. This is because traditional machine learning models are trained from preprocessed datasets whose features are extracted using costly handcrafted feature algorithms. In contrast, CNN-based models are distinguished for their ability to identify and extract automatically complex features in brute images that may not be easily visible to human experts. So, CNN-based models provide greater accuracy compared to traditional machine learning-based models.

Inspired by the success of convolutional architectures in computer vision and image classification domains, as well as their impressive results when applying them directly to raw images, we propose in this paper a new model for detecting malicious network traffic based on convolutional neural networks, in which the learning and detection phases are applied directly to raw dataset traffic. However, the challenge with the direct application of CNNs to raw traffic data is the need to find a way to adapt the raw traffic to such models. The main contribution of this paper is therefore to propose an effective method for adapting raw traffic data to be suitable for CNNs, thereby enhancing the model’s ability to process and learn from the raw traffic data directly. The proposed method is based on converting the payloads of network traffic packets into a square grayscale image using a cross-frequency calculation of the byte occurrences within these payloads. Using this method, we generate images similar to 2D topographic maps, where each traffic class is represented by a distinct image. This makes learning the CNN model simple, easy, and efficient, and it allows the learning process to take advantage of as much information as possible, including that not represented in the preprocessed datasets, thus leading to a more accurate intrusion detection model. Moreover, converting network traffic into images will make it possible to benefit from the enormous advantages of convolutional neural networks, which are mainly designed to work with grid-structured inputs. Thanks to this grid structure, the convolutional layers can easily extract all the spatial dependencies in the different local regions of the grid and then provide them to the input layer of the neural network as network traffic features. That justifies the power of the features generated by convolutional neural networks. Moreover, convolutional neural networks are well known for their excellent multiclass classification abilities, so using them in our research will allow us to generate a powerful multiclass classification model that is able to detect the presence, as well as the category, of the attack.

In short, this paper presents the following contributions:

We define a new multiclass classification model for intrusion detection based on convolutional neural networks (CNNs).
We use raw network traffic datasets (raw pcap files) to train and test the proposed model instead of preprocessed datasets (feature-ready CSV files). So, input features are automatically extracted.
We propose an innovative methodology for converting raw network traffic to a 2D representation, which is better suited to convolutional neural networks.
We use a hyperparameter tuning method based on the self-adaptive differential evolution (SADE) algorithm to adaptively optimize the structure of the proposed CNN-based intrusion detection model.
We evaluate the performance of the proposed model using three different datasets, namely, KDD-99, UNSW-NB15, and CIC-IDS2017.

The rest of this paper is organized as follows: First, we describe the related work in Section 2. Next, we present the methodology used in our proposed CNN-based model in Section 3. Then, we describe the experimental study in Section 4. After that, we discuss the results and compare them with other competitive approaches in Section 5. Finally, we conclude the paper in Section 6.

2. Related Work

Since their emergence in the second half of the 2010s, deep learning methods have demonstrated success in many real-world problems and are now gradually replacing traditional machine learning techniques in many domains. Over the last few years, many studies have applied deep learning methods for detecting intrusions. For example, Pingale and Sutar [9] used advanced deep learning techniques for detecting network intrusions, such as convolutional neural networks (CNNs), which are mainly used for feature extraction, deep maxout networks (DMNs), and deep autoencoders (DAEs), which are combined in a hybrid deep model for performing intrusion detection. To generate their model, they also used many other techniques, such as z-score and holoentropy methods for data normalization and data transformation, respectively, as well as the Remora Optimization Algorithm (ROA) and Whale Optimization Algorithm (WOA), which were used as hybrid optimization algorithms in the training procedure of the proposed model. Similarly, the authors in Asgharzadeh et al. [10] combined convolutional neural networks (CNNs), binary multiobjective enhanced capuchin search algorithms (BME-CSAs), and random forests (RFs) to detect and classify anomalies in the IoT. The convolutional neural network was mainly used to extract local and global features from raw network features. The extracted features were then passed through the BME-CSA algorithm to select effective features. Finally, random forest classifiers were used to classify normal or attack samples. Altaf et al. [11] exploited the potential of graph neural networks to develop a novel network intrusion detection framework. The proposed framework was based on graph convolutional networks and was composed of two hidden layers. At each layer, node embeddings were calculated by applying the model’s trainable parameters on the aggregated message and updated by passing through the ReLU function. The final output of the proposed framework was given by passing the sum of initial and final embeddings through a transformation function. Moreover, in order to keep the overall lightweight of the proposed framework, the best features from the raw dataset were selected using the Recursive Feature Elimination (RFE) algorithm. In Daoud et al. [12], principal component analysis (PCA) and convolutional neural networks (CNNs) were combined to implement a classification model for network flows. The PCA was mainly used for feature dimension reduction, whereas the CNN was used as a classification model. The work in Hnamte and Hussain [13] explored the use of the Deep Convolution Neural Network (DCNN) as a novel framework for dependable intrusion detection through empirical studies and performance evaluations. The performances of the implemented approach were compared with those of the traditional deep neural network (DNN) over many datasets, including the ISCX 2012, DDoS (Kaggle), CICIDS 2017, and CICIDS 2018 datasets.

It is obvious that the solutions based on deep learning techniques presented in the above articles have delivered very high accuracy rates; however, these results were obtained for binary classification, which only considers the presence of an attack, regardless of its nature. It is a fact that discovering the presence of an attack is important, but the most important thing is to be able to determine the nature of the attack so that specific and effective measures can be taken to neutralize it. However, the design of such a solution is very challenging due to their low performance when dealing with multiple types of attacks compared to binary classifiers, which only alert to the presence of an attack. In this connection, many other deep learning-based research projects that focus on multiclass intrusion detection with the ability to detect the presence, as well as the type, of attack have been proposed over the past few years. For example, the authors of Vinayakumar et al. [14] implemented and compared the classification performance of various deep neural networks and machine learning algorithms for both binary and multiclass categories on various publicly available datasets. On the basis of the obtained experimental results, the authors proposed a unique DNN architecture for NIDS and HIDS composed of five hidden layers. In Li et al. [15], a new deep learning-based approach was developed to detect intrusions using a multiconvolutional neural network (multi-CNN) fusion method. To generate their CNN model, they first converted one-dimensional feature data into the form of two-dimensional feature data, and then the generated model was evaluated on the NSL-KDD dataset in both binary and multiclass classification. Andresini et al. [16] presented a deep learning-based model for the multiclass classification of network traffic data. In that work, the authors applied a convolutional neural network (CNN) with an attention mechanism that enables humans to understand the generated model decisions by producing an attention map on the flow characteristics of specific attack categories. The flow characteristics of network traffic data were converted to two-dimensional features and used as the input of the CNN to carry out traffic classification. In Udas et al. [17], a novel hybrid model called SPIDER based on a series of convolutional neural networks (CNNs) and enhanced recurrent neural networks (RNNs) was presented to detect and monitor intrusions within network traffic. The proposed model combines principal component analysis (PCA) and convolutional neural networks (CNNs) with four updated versions of conventional RNNs, namely, Bidirectional Long Short-Term Memory (Bi-LSTM), Long Short-Term Memory (LSTM), the Bidirectional Gated Recurrent Unit (Bi-GRU), and the Gated Recurrent Unit (GRU). The PCA and CNNs are mainly used to reduce feature dimensions and extract efficient spatial characteristics, respectively, whereas the four updated RNNs serve as the default backbone of the proposed hybrid model. In a similar work, Brandon et al. [18] proposed a hybrid deep learning model that combines a convolutional neural network (CNN) and bidirectional short-term memory (Bi-LSTM) for the detection of network attacks. In their model, convolutional neural networks (CNNs) were primarily used to recognize patterns in the features of the input network traffic before sending them to two BLSTM layers, which identified malicious traffic by exploiting forward and backward data propagation. Similarly, Wang et al. [19] combined a convolutional neural network (ResNet) with transformers and bidirectional long short-term memory (Bi-LSTM) networks to implement a deep learning-based model for network intrusion detection in IoT systems. To improve the performance of their model, the authors used the Synthetic Minor Overriding Technique (SMOTE) for minority class sample expansion and the Edited Nearest Neighbor (ENN) method for majority class sample reduction in order to reduce the degree of data imbalance in their datasets. He et al. [20] proposed an intrusion detection system based on a convolutional neural network (CNN). The proposed system used the Variational Gaussian Model (VGM) to decompose continuous features into multiple single Gaussian values, the OneHot technique to convert discrete features into OneHot vectors, pyramid convolution neural networks (PyCNNs) with multiscale convolution kernels to extract features and to classify the processed network traffic into different attack types, and Depthwise Separable Convolution (DSC) to reduce the model generation complexity and improve the whole intrusion detection process.

All the above-mentioned studies with multiclass deep learning-based frameworks are close to the research presented here in our paper, as they all introduce convolutional neural networks (CNNs) for the multiclass classification of network traffic data. Furthermore, similarly to our research, most of the above-mentioned works use 2D representations of network flows in order to generate their CNN-based models. However, despite using CNN models, which have recently achieved great success in the image classification domain, their performance results remain lower than expected. This is mainly due to the fact that none of the above research has used raw pcap file data for training their CNN models, unlike CNNs used in the image classification domain, which are generated immediately from raw images. The problem when using feature-ready CSV file data instead of raw pcap file data, as the above research has shown, is that the generated model is not learned from all the information contained in the original network traffic, because the extracted features in CSV file data do not guarantee containing all the information in raw pcap file data.

To deal with this problem and in order to further exploit the information hidden in the raw network traffic, the model proposed in this research was trained on the raw pcap file dataset and not on the feature-ready CSV file dataset. This is considered one of the contributions that helps our proposed model be more accurate than its competitors. This is because very little research has been focused directly on raw pcap files to generate their intrusion detection models. Among such research, Li et al. [21] proposed a novel deep neural network for network intrusion detection based on a hierarchical and dynamic feature extraction framework. The neural network designed in this research can automatically extract abstract features directly from the packets of network traffic. Another work by Liu et al. [22] proposed a payload-based anomaly detection framework based on long short-term memory (LSTM), convolutional neural networks (CNNs), and multihead self-attention mechanisms. The training process of the proposed framework was applied directly to the data collected from the packet payload traffic. The problem with the two aforementioned researches is that, unlike our multiclass classification model, they are mainly designed for binary classification, which, as stated above, only alerts to the presence of an attack without regard to its type.

In another work, Qiu et al. [23] proposed a novel hybrid intrusion detection system by combining two different models: a random forest-based model and a convolutional neural network-based model. The first one was trained on feature-ready CSV files, whereas the second was trained on raw pcap files. For the output result, a fusion module was defined to combine the two model results. The problem with this research is that they only used the first N packets of split flows to train their packet-based model, and as such, their model cannot detect distributed attacks that are transmitted by multiple packets, including the latest ones. Another problem with this research is that the two models were generated from a mixed dataset with only five kinds of common attacks from three different datasets. It is thus very difficult to compare their detection performance against similar competitor models. Similarly, Lin et al. [24] proposed a deep learning-based model for intrusion detection with a multilevel feature fusion method that combines data timing, byte, and statistical features to extract valid information from raw network traffic. The same problem can be found in this research, as the authors eliminated all uncommon attack categories from the datasets used in their experiments, making comparisons inappropriate.

The closest work to the research presented in this paper is that of Yu et al. [25]. In that work, the authors proposed a hierarchical packet byte-based CNN model, called PBCNN, in which the features supplied to the CNN model were extracted from the raw pcap data at two different levels. In the first level, abstract features were automatically extracted from bytes in a packet of pcap data, and then the final representation was further constructed in the second level from packets in a flow or session. Compared to our model, PBCNN is a hierarchical-based model that extracts network traffic features hierarchically and considers temporal information in network connections, whereas our model is a 2D CNN-based model in which input features are extracted once in 2D form (a grayscale image), and no temporal information in the network connections is considered. It is a well-known fact that 2D representation has a benefit over hierarchical representation when dealing with convolutional neural network-based models because CNNs are primarily designed for grayscale images, which are represented in 2D. This is confirmed by our experimental results, which outperformed those of the hierarchical-based model. Table 1 below summarizes all the related work cited in this paper.

3. Proposed Intrusion Detection Framework

In this section, we present the design of our convolutional neural network-based intrusion detection model. As shown in Figure 1, the proposed framework consists of three main components: data preparation, data reformulation, and model training. The first two components form the preprocessing part of the proposed framework, which aims to transform the payloads of each network flow into a single-channel square image. In the first component, the attack packet flows are extracted and split by attack category. Then, the packet payloads of each network flow (session) are extracted and converted into a 2D image grid form through the data reformulation component. The last component forms the latter part of the proposed framework, which aims to train and generate our CNN-based model to detect anomalies for network traffic payloads. For a better understanding of the proposed framework, a detailed description of how the three components work is given in the following subsections.

3.1. Data Preparation

Unlike feature-ready-based datasets, where data are well organized as a table in which rows represent labeled samples, and columns represent sample features, raw traffic datasets are very poorly organized; a pcap file in this dataset may contain a mixture of overlapped malicious and normal network sessions. So, before converting traffic payloads into a suitable form for the proposed model, we first need to separate normal traffic from malicious traffic and then split up different attack categories into separate pcap files. To do this, we need to use the description file, which comes with pcap files and contains information about where the attack instances are located in the raw pcap files.

Like the feature-ready CSV files, the description file is organized as a table in which the information for each attack instance, such as the timestamp and quintuple information of the attack packets, is saved in a line of this table. So, to separate and split different attack categories into separate pcap files, we need to read both the pcap and description files and then save the paquets of each attack type in an individual pacp file by matching the timestamp and quintuple information of the description file to the corresponding raw packet traffic.

It should be noted here that the matching of quintuple information, which refers to the source IP, destination IP, source port, destination port, and protocol, is carried out in both directions (i.e., the source IP address and source port are exchangeable with the destination IP address and destination port, respectively).

3.2. Data Reformulation

After splitting the raw pcap dataset by attack category, data reformulation is done to convert the payloads of each network flow into a 2D representation format. To make this conversion, we first need to split the network packets carrying the same quintuple, in both directions, into individual network sessions, then extract and fuse the session packet payloads to construct a compact payload for each network session. The problem with the extracted compact payloads is that their size changes from one network session to another. They are therefore incompatible for use as inputs to convolutional neural networks. In order to solve this problem and convert the extracted compact payloads into 2D grids of fixed size, we propose a new method based on statistical calculations in which the conversion process goes through three stages:

First, we generate four different frequency distribution vectors for each network session’s compact payload. The first one is generated immediately from the original compact payload by counting the number of times each byte value occurs in the compact payload data. As is well known, a byte has 8 bits, so it can take values ranging from 0 to 255. As a result, the four frequency vectors generated each have 256 elements: the first element contains the frequency of occurrence of the value 0 in the various bytes of the compact payload, the second element contains the frequency of the value 1, and so on. The three remaining vectors are generated in the same way, shifting the payload by two bits each time.
Second, we merge the four frequency distribution vectors into a single frequency vector of 1024 elements. The overall algorithm for these first two stages is shown in Algorithm 1.
Algorithm 1: Generation of the frequency vector
Third, we change the shape of the obtained 1D frequency distribution vector to a 2D frequency distribution vector. As our frequency vector is 1024 elements long, we can easily change it into a squared 2D vector of size $32 \times 32$ without having to add any padding values. The generated 2D vector is considered as a $32 \times 32$ grayscale image.

3.3. Model Training

Finally, once the 2D traffic images have been produced, the CNN classification model is trained.

3.3.1. Overall Structure of Our CNN-Based Model

CNNs are a kind of robust and popular deep neural network that are mainly designed to process 2D and 3D array data, such as images, videos, and audio spectrograms. As shown in Figure 2, our proposed CNN-based model is composed of many kinds of layers connected to each other in series, including the input layer, convolutional layer, pooling layer, flatten layer, fully connected layer, and output layer. The convolutional layer is the core part of the CNN architecture; it extracts features from the input data by applying a linear weighting operation to the input data, in which the weights are given by a small squared matrix called the kernel or filter matrix. This convolutional operation can be expressed mathematically as follows [26]:

(X * K) (i, j) = \sum_{v = 1}^{N} \sum_{u = 1}^{N} X (i + u, j + v) K (u, v)

(1)

where X is a 2D input image if this is the first layer—otherwise, it is the output feature map of the previous layer; K is a 2D convolutional kernel of

N \times N

size; and ∗ is the convolution operation. This operation is repeated several times by sliding the 2D kernel over the 2D input image, i.e., incrementing the position

(i, j)

and placing the result at this position each time. By generalizing this formula for any layer l and any convolutional filter K with an activation function f and a bias B, the formula becomes the following [26]:

Y^{l} (i, j) = f (\sum_{v = 1}^{N} \sum_{u = 1}^{N} X^{l - 1} (i + u, j + v) K^{l} (u, v) + B^{l})

(2)

where

Y^{l} (i, j)

is the output feature map of the lth convolution layer at position

(i, j)

,

X^{l - 1}

is the output feature map of the previous convolution layer or the input data, and f is an activation function such as the rectified linear unit (Relu) that is commonly adopted and which is given by the following formula [27]:

ReLU (x_{i j}) = max (0, x_{i j}) = \{\begin{matrix} x_{i j}, & if x_{i j} > 0 \\ 0, & if x_{i j} < 0 . \end{matrix}

(3)

The pooling layer comes next after the convolution layers in order to compress and reduce the spatial size of the generated feature maps and therefore reduce the computational cost of the layers that follow. This is mainly achieved by dividing the input feature map into smaller regions and then replacing each region with a single representative value. The most commonly used representative values are the average value and the maximum value. After extracting the feature maps from the sequence of convolutional and pooling layers, the flatten layer is used to convert them into a single one-dimensional feature vector. Then, a series of fully connected layers are used to map the feature vector extracted in the previous layers to the sample space (classification). Each neuron of a fully connected layer is connected to every neuron of both the previous and next layer but not to the neurons of the same layer. The computation of each fully connected layer is mathematically defined as follows [28]:

y_{j}^{l} = f (\sum_{i = 1}^{N} x_{i}^{l - 1} \cdot w_{i j}^{l} + b_{j}^{l})

(4)

where

y_{j}^{l}

is the output of the jth neuron in the fully connected layer l,

x_{i}^{l - 1}

is the output of the ith neuron in the previous layer

l - 1

,

w_{i j}^{l}

is the weight between the ith neuron in the previous layer

l - 1

and the jth neuron of the fully connected layer l,

b_{j}^{l}

is the bias of the jth neuron in the fully connected layer l, and f is the activation function, which is commonly either a sigmoid function or a Relu function. Finally, an output layer is used to classify every testing sample to its category. The most commonly used activation function in this layer is the Softmax function, which outputs the probabilities of different categories for each given sample. The Softmax function is simply a generalization of logistic regression to achieve the problem of multiclass classification. Its mathematical formula is defined as follows [28]:

Softmax (y_{i}) = \frac{e^{y_{i}}}{\sum_{k = 1}^{N} e^{y_{k}}}

(5)

where

y_{i}

is the output value of the ith neuron in the output layer, and N is the number of output layer neurons, i.e., the number of categories or classes.

In addition to the layers described above, convolutional neural networks also use some other layers and strategies to increase efficiency, such as the dropout layer, which randomly discards a portion of nodes to prevent overfitting problems, and the batch normalization (BN) layer, which is mainly used to accelerate the convergence of the generated model and improve training stability. The formula for batch normalization is as follows [27]:

BN (x) = α (\frac{x - μ}{\sqrt{σ^{2} + ε}}) + β

(6)

where x is the input to be batch normalized,

μ

is the mini-batch mean,

σ

is the mini-batch variance,

α

is a scale parameter,

β

is a shift parameter, and

ε

is a small constant used for stability to prevent division by zero.

3.3.2. CNN Structure and Hyperparameter Tuning

Determining the optimal architecture for a CNN model with the best hyperparameter settings is a difficult task. Up to now, there are no clear rules for determining the most suitable CNN structure for a particular problem, and in most cases, hyperparameters are adjusted manually by the designer according to his expertise and his intuition, which is mostly a very tedious and time-consuming task. In this paper, the hyperparameters of the proposed CNN model were adjusted using the self-adaptive differential evolution (SADE) algorithm [29], which is an improved version of the differential evolution (DE) algorithm [30] in which the learning strategy and the two control parameters F and CR are automatically adapted during algorithm execution.

Differential evolution is a powerful evolutionary-based metaheuristic that has quickly drawn a great deal of interest in the optimization and artificial intelligence communities due to its simplicity, high efficiency, ease of implementation, and quick convergence. Like all evolutionary algorithms, the DE algorithm starts with an initialization phase, followed by a loop of evolutionary operations: mutation, crossover, and selection.

In the initialization phase, an initial population of

N P

candidate solutions (or individuals)

P^{0} = {X_{1}^{0}, X_{2}^{0}, \dots, X_{N P}^{0}}

with

X_{i}^{0} = {x_{i, 1}^{0}, x_{i, 2}^{0}, \dots, x_{i, d}^{0}}

, where d is the dimension of the solution vector, is randomly generated according to a uniform distribution within the search space constrained by the prescribed lower bound

b_{L} = (b_{L 1}, \dots, b_{L d})

and upper bound

b_{U} = (b_{U 1}, \dots, b_{U d})

. Hence, if the search space of the jth solution vector component is a continuous interval, its initial values are randomly generated as follows [31]:

x_{i, j}^{0} = b_{j}^{L} + rand (0, 1) \cdot (b_{j}^{U} - b_{j}^{L})

(7)

where

i = 1, 2, \dots, N P

represent the index of the candidate solution vector;

j = 1, 2, \dots, d

denote the index of the component in the candidate solution vector; and rand(0, 1) is a uniform random number in the range of

[0, 1]

. Otherwise, when the search space of the jth solution vector component is a set of discrete values, its initial values are randomly generated as follows:

x_{i, j}^{0} = Round (b_{j}^{L} + rand (0, 1) \times (b_{j}^{U} - b_{j}^{L}), vs)

(8)

where

Round (\dots, vs)

is a function that rounds the final result to the nearest value in the value set (vs).

After initialization, the DE enters an iterative process to make better candidate solutions. In each iteration q, new solutions are created through the three evolutionary operators represented by mutation, crossover, and selection. Firstly, in the mutation step, a mutated vector

V_{i}^{q} = {v_{i, 1}^{q}, v_{i, 2}^{q}, \dots, v_{i, d}^{q}}

is generated for each candidate solution vector

X_{i}^{q} = {x_{i, 1}^{q}, x_{i, 2}^{q}, \dots, x_{i, d}^{q}}

, that belongs to the current population

P^{q} = {X_{1}^{q}, X_{2}^{q}, \dots, X_{N P}^{q}}

, according to one of the strategies: Rand/1 (R1), Best/1 (B1), Rand/2 (R2), Current to Rand/1 (CR1), and Current to Best/1 (CB1), which are given by the following formulas [31]:

R1 : V_{i}^{q} = X_{r 1}^{q} + F \cdot (X_{r 2}^{q} - X_{r 3}^{q})

(9)

B1 : V_{i}^{q} = X_{best}^{q} + F \cdot (X_{r 1}^{q} - X_{r 2}^{q})

(10)

R2 : V_{i}^{q} = X_{r 1}^{q} + F \cdot (X_{r 2}^{q} - X_{r 3}^{q}) + F \cdot (X_{r 4}^{q} - X_{r 5}^{q})

(11)

CR1 : V_{i}^{q} = X_{i}^{q} + F \cdot (X_{r 1}^{q} - X_{i}^{q}) + F \cdot (X_{r 2}^{q} - X_{r 3}^{q})

(12)

CB1 : V_{i}^{q} = X_{i}^{q} + F \cdot (X_{b e s t}^{q} - X_{i}^{q}) + F \cdot (X_{r 1}^{q} - X_{r 2}^{q})

(13)

where

V_{i}^{q}

is the mutated vector of the current candidate solution vector

X_{i}^{q}

;

X_{b e s t}^{q}

is the best candidate solution vector in the current population; and

X_{r 1}^{q}

,

X_{r 2}^{q}

,

X_{r 3}^{q}

,

X_{r 4}^{q}

, and

X_{r 5}^{q}

are candidate solutions randomly chosen from the current population; they are different from each other and also distinct to the candidate solution vector

X_{i}^{q}

. F is called the mutation factor or the scale factor, which usually ranges on the interval

[0, 1]

.

After the mutation step, the current candidate solution vector

X_{i}^{q}

and its mutated vector

V_{i}^{q}

are crossed over to obtain a new trial vector

U_{i}^{q} = {u_{i, 1}^{q}, u_{i, 2}^{q}, \dots, u_{i, d}^{q}}

, in which its components

u_{i, j}^{q}

are calculated as follows [31]:

u_{i, j}^{q} = \{\begin{matrix} v_{i, j}^{q} & if j = K or {rand}_{i, j} [0, 1] \leq C r \\ x_{i, j}^{q} & otherwise \end{matrix}

(14)

where K is a positive integer index randomly chosen in range

[1, d]

,

{rand}_{i, j} [0, 1]

is a uniform random number on the interval

[0, 1]

independently generated for the ith candidate solution vector at the jth component, and

C r

is a pre-fixed parameter called the crossover rate, which usually ranges on the interval

[0, 1]

.

Finally, in the selection step, DE determines whether the current candidate solution vector

X_{i}^{q}

or the generated trial vector

U_{i}^{q}

will pass into the next iteration

q + 1

according to their fitness values. The selection operator is formulated as follows [31]:

X_{i}^{q + 1} = \{\begin{matrix} U_{i}^{q} & if f (U_{i}^{q}) \leq f (X_{i}^{q}) \\ X_{i}^{q} & otherwise \end{matrix}

(15)

where

f (U_{i}^{q})

and

f (X_{i}^{q})

are the fitness values of the generated trial vector and the current candidate solution vector, respectively. The above three evolutionary operators are repeated iteration after iteration until the predefined stopping conditions are satisfied.

It is obvious that differential evolution (DE) is one of the best-performing algorithms in evolutionary computation for optimization problems; however, this performance is highly dependent on the learning strategies used in the mutation step and the corresponding critical control parameters

C r

, F, and

N P

. To deal with this problem, we used in this paper an improved version of DE called self-adaptive differential evolution (SADE), in which the learning strategies and their associated control parameters

C r

and F are automatically adapted during evolution. For learning strategies, SaDE uses four different strategies, rand/1, current to best/2, rand/2, and current to rand/1, instead of only one in the original DE algorithm. So, for each candidate solution vector in the current iteration q, one of the four mutation strategies is selected by applying roulette wheel selection techniques to their selection probabilities. In the beginning, all learning strategies have the same probability of being selected, but as the number of iterations progresses, their probabilities change for each specified number of iterations, called the learning period, according to their past experience of success in generating surviving trial vectors. This change is summarized by the following formula [29]:

p_{i} = \frac{n s_{i}}{n s_{i} + n f_{i}}, i = 1, 2, 3, 4, 5

(16)

where

n s_{i}

records the number of trial vectors successfully entering the next generation while generated by strategy i, and

n f_{i}

records the number of trial vectors discarded while generated by strategy i.

Similarly, the control parameters F and

C r

are gradually self-adapted by learning from their previous experiences of producing surviving trial vectors. For the scale factor F, a set of

N P

values is randomly generated through a normal distribution of mean

μ F

and standard deviation

σ F

and applied to each individual in the current population. The same goes for the crossover rates

C r

, in which an initial set of

N P

random values is produced through a normal distribution of mean

μ C R

and standard deviation

σ C R

and applied to each individual in the current population. To the contrary of F, where the normal distribution’s mean remains fixed during all iterations,

μ C R

is initially set at a specific value, and then after a specified number of iterations, called the

μ C R

update period, it will be changed according to all the recorded

C r

values corresponding to successful trial vectors during this period. The above procedure will be repeated with this new normal distribution’s mean and the standard deviation of

σ C R

for the same number of iterations and so on until the last iteration is reached.

4. Implementation and Experiments

To assess the performance of the proposed model, the preprocessing part and the training part of our framework were implemented independently of each other. The preprocessing part (data preparation and data reformulation parts) was implemented using the C# language, while the training part was implemented in Python using Tensorflow and Keras libraries. All the implemented parts were carried out on the same host, with an Intel Core I5-2520M CPU @ 2.50 GHz and 8 GB of memory. The host used runs two different operating systems. The first operating system is Windows 10, in which we implemented and executed the preprocessing part, while the second is Fedora 35, in which we implemented and executed the training part. The process and the experiments conducted to assess the performance of our model are explained in detail in the rest of this section. We first start by describing the datasets used to train the model, and then we outline the metrics calculated to measure its performance. Finally, we present more details about the configuration parameters of each dataset model.

4.1. Dataset Description

The experiments conducted in this work were carried out on three different datasets: KDD’99 [32], UNSW-NB15 [33], and CIC-IDS2017 [34]. It is true that the KDD’99 dataset is relatively outdated, but our objective is to verify whether the proposed model is able to obtain better results in different datasets with various types of attacks, whether or not these attacks are recent or old:

KDD’99 is the most commonly used dataset for the evaluation of intrusion detection systems. It was created by the MIT Lincoln Laboratory and the Air Force Research Laboratory for participation in an international competition conducted in 1999. The dataset was generated over five weeks (weeks 1–5). The first and third weeks are free of attacks, whereas the second, fourth, and fifth weeks include the network traffic of 58 different attack types divided into 4 categories: denial-of-service attacks (DoS), user-to-root attacks (U2R), remote-to-local attacks (R2L), and probing attacks.
UNSW-NB15 is a network intrusion detection dataset that was created in 2015 by the Cyber Range Lab of the Australian Center for Cyber Security (ACCS). The original raw traffic, amounting to approximately 100 GB, was collected during two simulation periods, each lasting about 15 h, on 22 January 2015 and 17 February 2015, respectively. The dataset comprises nine different attack categories, namely, fuzzers, analysis, backdoors, DoS, exploits, generic, reconnaissance, shellcode, and worms.
CIC-IDS2017 is a recent dataset consisting of network data collected by the Canadian Institute of Cyber Security in 2017. The dataset contains both benign and malicious raw network traffic collected over five days from Monday, 3 July 2017, to Friday, 7 July 2017. The first day contains only benign traffic, while the other days contain various types of attacks, namely DoS attacks (Hulk, GoldenEye, Slowloris, and Slowhttptest), web attacks (Brute Force, XSS, and SQL Injection), patator attacks (FTP and SSH), heartbleed attacks, infiltration attacks, botnet attacks, port scan attacks, and DDoS attacks.

The three datasets were all provided in raw form stored in pcap files, as well as in feature-ready form stored in CSV files. In this work, we have used raw pcap traffic instead of feature-ready data. However, due to the large scale and class imbalances of the original raw pcap data, and in order to reduce the amount of data to be trained, which will cut down on training time, the three datasets were downsampled by reducing the number of samples corresponding to the most frequent classes. This was intended to reduce the dataset size while simultaneously reducing any bias towards the most frequent behaviors. The reduced raw datasets were then randomly divided into training, validation, and testing parts at a ratio of 70%, 10%, and 20%, respectively, while ensuring that all traffic categories were split at the same rate (70%, 10%, and 20%). The sample distribution of the three datasets is depicted in Table 2.

4.2. Evaluation Metrics

The proposed model was evaluated using the four common evaluation metrics, namely, accuracy, precision, recall, and F-score. However, since the proposed model is a multiclass classification problem, the calculation of these metrics is slightly different from their calculation in the case of binary classification. This is because the multiclass model can generate as many results as the number of classes, which must all be taken into account when calculating these metrics. To address such differences, the weight and macro-average are generally adopted to measure the overall values of precision, recall, and F-score. Below is a short definition of each of these four metrics, along with their corresponding equations for multiclass classification cases. Note that the corresponding equations of all these metrics are all derived from the values of false negatives (FN), false positives (FP), true negatives (TN), and true positives (TP) of the confusion matrix, which were computed during the testing step of the experiment. Whereas, in multiclass classification, TP_i (true positives of class i) indicates the number of class i samples predicted as class i, FP_i (false positives of class i) is the number of not-class i samples predicted as class i. TN_i (true negatives of class i) is the number of not-class i samples predicted as not-class i. FN_i (false negatives of class i) is the number of class i samples predicted as not-class i:

Accuracy (ACC) is a metric that refers to the rate of samples correctly classified for a particular class type i, and it is calculated as follows [35]:

${ACC}_{i} = \frac{{TP}_{i} + {TN}_{i}}{{TP}_{i} + {TN}_{i} + {FP}_{i} + {FN}_{i}}$

(17)
Precision (PR) is a metric that measures the rate of samples correctly classified for a particular class type i given all predictions of that class. Its formula is given by [35]:

${PR}_{i} = \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}}$

(18)
Recall (RC) is a metric that measures the rate of samples correctly classified for a particular class type i given all occurrences of that class type. It is calculated by the following formula [35]:

${RC}_{i} = \frac{{TP}_{i}}{{TP}_{i} + {FN}_{i}}$

(19)
F-Score (F1) is a metric that measures the harmonic mean of precision and recall per class type i. Its formula is given as follows [35]:

${F 1}_{i} = 2 \times \frac{{PR}_{i} \times {RC}_{i}}{{PR}_{i} + {RC}_{i}}$

(20)

The overall values of these four metrics are given by the overall accuracy, macro-average precision, macro-average recall, and macro-average F-score—or weighted precision—weighted recall, and weighted F-score, which are calculated by the following formulas [35]:

Overall ACC = \frac{\sum_{i = 1}^{K} {TP}_{i}}{N}

(21)

Macro-average PR = \frac{1}{K} \sum_{i = 1}^{K} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}}

(22)

Macro-average Recall = \frac{1}{K} \sum_{i = 1}^{K} \frac{{TP}_{i}}{{TP}_{i} + {FN}_{i}}

(23)

Macro-average F 1 = \frac{1}{K} \sum_{i = 1}^{K} \frac{2 \times {PR}_{i} \times {RC}_{i}}{{PR}_{i} + {RC}_{i}}

(24)

Weighted PR = \frac{1}{N} \sum_{i = 1}^{K} C_{i} \times {PR}_{i}

(25)

Weighted RC = \frac{1}{N} \sum_{i = 1}^{K} C_{i} \times {RC}_{i}

(26)

Weighted F 1 = \frac{1}{N} \sum_{i = 1}^{K} C_{i} \times \frac{2 \times {PR}_{i} \times {RC}_{i}}{{PR}_{i} + {RC}_{i}}

(27)

where N is the total number of samples, K is the number of classes, and

C_{i}

represents the number of instances in class i.

4.3. Hyperparameters Setting

The experiments were conducted by determining the optimal structure of the CNN model. To do this, we optimized the hyperparameters of the proposed CNN model for each dataset using the self-adaptive differential evolution algorithm (SADE) described above. However, since the number of hyperparameters in CNN models is large, optimizing them all would be computationally expensive. For this reason, we chose only seven hyperparameters (See Table 3) that were of greater importance and posed a significant impact on the performance of the model, including the number of filters for the first convolution layer (NF1), the number of filters for the second convolution layer (NF2), the number of neurons in the hidden layer (NNH), the dropout rate (DR), the learning rate (LR), the batch size (BS), and the batch normalization (BN). The remaining hyperparameters were set according to conventional values widely used in related research studies. Details of the selected values for these hyperparameters are provided in Table 4.

The optimization algorithm started, as shown in Algorithm 2, by generating an initial population of NP candidate solutions. The initial candidate solutions were generated by randomly setting the hyperparameters’ values from their predefined ranges. The hyperparameter ranges for our optimization problem are shown in Table 3. It should be noted that the selected ranges in this table are limited because of our limited computational resources. For the same reason, the number of epochs for training our model was set to only 10, as shown in Table 4.

After generating the initial population, the optimization algorithm proceeded to create new solutions using mutation and recombination operators, and then the fittest ones were selected according to their fitness. This process continued for many generations until a stopping criterion was satisfied, where the optimal CNN structure was selected based on its highest fitness.

In our optimization problem, the fitness function represented the accuracy of the model corresponding to the CNN structure represented by the candidate solution. For the rest of the parameters of the optimization algorithm, their values are summarized in Table 5.

Algorithm 2: Optimization of our CNN hyperparameters using the SaDE metaheuristic.

input: :/∗ Hyperparameter ranges from Table 3 ∗/
Bounds ← Hyperparameter ranges list;
/∗ Initial strategy selection probabilities ∗/
St_Prob ← {0.2, 0.2, 0.2, 0.2, 0.2};
/∗ Initial SaDE parameters from Table 5 ∗/
$μ F$ ; $σ F$ ; $μ C R$ ; $σ C R$ ; NP; G; LP; $μ$ Cr_P;
output:: Xbest (the best hyperparameters obtained);

5. Experiment Results and Discussion

Through a parallel execution on the three datasets that lasted more than a week, our optimization algorithm returned, as a result, the best configurations for our three CNN models corresponding to the three datasets. The best hyperparameters returned for these three models are shown in Table 6.

The training and testing these three models on the three datasets helped calculate their various performance measures. For better observation, the evaluation results of this study have been presented as follows: First, we show the learning curve to assess the performance of the model during training. Then, we present the confusion matrix and associated metrics such as accuracy, precision, recall, and F-score for a detailed analysis of the performance of the proposed model. After that, we evaluate the performance of the proposed model in terms of computational complexity. Finally, we compare the performance of the proposed model with that of other models in the literature.

5.1. Training Performance

As can be seen from the training results in Figure 3, the accuracy rate for both the training and validation data increased gradually with the increase in the epoch number, and when the number of epochs exceeded a certain value, the accuracy tended to be stable at about 98% for KDD’99, 90% for UNSW-NB15, and 99% for CICIDS 2017. This indicates that the model was learning well and that training was progressing quite steadily in mastering the patterns present in the training data. Moreover, both the training and validation curves for the three models are close to each other throughout the learning process, which indicates that the model is generalizing effectively to unseen validation data and that the generated model is not suffering from overfitting or underfitting problems. However, we note that the validation curve sometimes crossed or exceeded the training curve for the three datasets, particularly at the start of the learning process. According to Aurelien Geron [36], this sometimes happens by chance when the validation set is fairly small, which could result in a validation set that is easier than the training set. Another reason, again according to Aurelien Geron [36], is that regularization, such as dropout, is only applied during training and not during validation, which could lead to the training performance being penalized compared to the validation performance.

The same conclusions can be drawn from the results in Figure 4, where the loss rate for the both training and validation data decreased gradually with the increase in the number of epochs until a certain number of epochs, where the loss rate became stable at very low values. This indicates that the model was effectively minimizing the training error while generalizing effectively to unseen validation data without overfitting or underfitting.

5.2. Performance Measurement on the Test Dataset

The confusion matrix, considered to be the most basic, obvious, and simple performance measurement technique, often used to describe the performance of a classification model on a test dataset, was also used in this study to observe the performance of our generated models. Its importance is reflected in its ability to show correctly or incorrectly classified test instances by indicating the number of true positive, true negative, false positive and false negative predictions. The confusion matrices of our three multiclass classification models on the NSL-KDD, UNSW-NB15, and CIC-IDS2017 datasets are shown in Figure 5, Figure 6 and Figure 7, respectively. We used the heat map with a difference in color and brightness to show the difference in results.

Before going into the details of each matrix, it is clear that the highest values and the darkest colors of the three matrices are found on the diagonal, which means that the rate of samples from the test dataset that are correctly predicted is high. This is a good sign, since it indicates that the generated models are performing well and that the other remaining measures, such as accuracy, precision, recall, and F-score, will also be high.

The confusion matrix in Figure 5 shows that all normal test samples in the KDD’99 dataset were correctly classified as normal samples. The DoS and Probe attack samples were also correctly classified, with a success rate of 98.74% (8566 out of 8675 samples) and 93.99% (3051 out of 3246 samples), respectively. However, for the R2L and U2R attacks, only 80.13% (125 out of 156 samples) and 56.00% (14 out of 25 samples), respectively, were correctly classified. It is therefore evident that the proposed model provides a high detection rate for the majority classes, but it is less effective for the minority classes. This is mainly due to the reduced number of samples used for the learning of the latter, which can lead to poorer performance for these classes and, consequently, the generated model will tend to be biased towards the majority classes.

The same observation applies to the results of the confusion matrix in Figure 6, where the majority of classes have higher values along the diagonal, indicating that the model was performing well for these classes. For example, 99.57% of normal class samples (4651 out of 4671 samples) were correctly classified as normal samples. The same is true for exploits, fuzzers, and reconaissance classes, where the model classified them with a success rate of 93.68% (4207 out of 4491 samples), 98.20% (3921 out of 3993 samples), and 96.01% (1782 out of 1856 samples), respectively. However, for minority classes, except for analysis and shellcode, which performed well (79 out of 84 samples and 255 out of 225 samples, respectively, were correctly classified), we observe many misclassified samples (many high values in off-diagonal cells), especially for backdoors and DoS classes, in which more than 46% (36 out of 78 samples) and 66% (463 out of 701 samples), respectively, were wrongly classified. We clearly see that the UNSW-NB15 model had difficulties differentiating between DoS traffic and exploit traffic. This may be due to the features extracted from the network traffic data, which may not sufficiently differentiate between DoS and exploit attacks. This is what is commonly referred to as the problem of overlapping between classes. This can be justified by the fact that when we have unbalanced datasets with overlapping regions, the majority class of the overlapping regions is generally the dominant class [37].

Compared to the first two models, the confusion matrix in Figure 7 shows that the model corresponding to the CICIDS2017 dataset is more efficient, since all the values in the confusion matrix are concentrated on the diagonal for both majority and minority classes. For example, out of 13 classes, 9 were correctly classified with a 100% success rate. Even those that were not 100% correct had a very low error rate. For example, only 3 samples out of 7186 in the normal class were misclassified, 16 out of 186 for the GoldenEye class, 39 out of 11,710 for the PortScan class, and 1 out of 2 for the SQL Injection class. This may be justified by the quality and quantity of data contained in the CICIDS2017 dataset compared to the KDD’99 and UNSW-NB15 datasets.

The performance of the proposed model was further evaluated by examining other measures, such as accuracy, precision, recall and F-score. However, since the proposed model is a multiclass classification model in which the performance results vary from one class to another, the measures used must consider the aggregation of the different results across classes. To address this issue, both macro- and weighted averaging techniques have been considered in this study, in addition to individual measurements for each class.

The experiment’s metric values for the proposed model are shown in Figure 8, Figure 9 and Figure 10. To facilitate the comparison, the results are expressed through a table heat map in which the accuracy, precision, recall, and F-score of each class are listed at the top, and then the overall accuracy and macro- and weighted averages of precision, recall, and F-score are listed at the bottom. As indicated in the right vertical bar, the darker blue cells represent the higher values, whereas the lighter blue cells represent the lower values.

As can be seen from Figure 8, the individual accuracies for the five classes of the KDD’99 dataset are all greater than 0.9830, giving a total accuracy equal to 0.9810, which demonstrates the high performance of the proposed model in terms of accuracy. The same can be said for the precision measures, where all precision values are above 0.9670, except for the U2R class, where the precision value reached 0.9333. This allowed for a total macro-precision and weighted precision that were highly acceptable, reaching 0.9711 and 0.9809, respectively. Contrary to accuracy and precision, recall was high for the majority classes Normal, DoS, and Probe and lower for the two minority classes R2L and U2R, in which its value reached 0.8013 and 0.5600, respectively. This is because the sample size of U2R and R2L in the KDD’99 dataset is very small and therefore cannot be trained efficiently, and consequently, it is not possible to obtain satisfactory results for these two classes. This clearly emerges from the relatively low macro-recall of 0.8577 compared to the significantly higher weighted recall of 0.9810.

The same observation can be made for the performance metrics of the models trained on the UNSW-NB15 and the CICIDS2017 datasets, as shown in Figure 9 and Figure 10. The accuracy and precision of the models reached the best values for most classes, while the recall metric reached some small values, especially for the Backdoors and DoS classes in the UNSW-NB15 dataset and the SQL injection in the CICIDS2017 dataset. As mentioned earlier, for the Backdoors and DoS classes, this may be due to the problem of overlapping between classes because the model encountered a problem of confusion between these two classes and the Exploits class, as shown in Figure 6. However, for the SQL injection class, it may be due to the fact that the instances of this class in the dataset are very rare compared to the instances of the other classes, which influence the recall of this class, dragging it down.

As a conclusion to all these results, the proposed model shows impressive performance with very high individual and global precision and accuracy measures. This indicates that the model effectively classified instances for most classes, maintaining consistent levels of accuracy, especially for the two models trained on the KDD’99 and CICIDS2017 datasets. However, while macro-recall was slightly insufficient, indicating that there was a difficulty in correctly identifying instances of some classes; weighted recall was very acceptable, indicating that the model excelled at accurately capturing instances, particularly for classes that are sufficiently represented in the dataset. On the whole, the model demonstrated a strong ability to correctly classify instances for different classes, with stable and strong performance.

5.3. Computational Efficiency Measurement

The computational complexity of deep learning models, such as convolutional neural networks (CNNs), is typically determined by considering all the operations required to perform a forward pass through the network. These include preprocessing, convolution, pooling, and fully connected (dense) operations. However, in this paper, we only considered the convolutional layer and the dense layer, as the time complexity of the preprocessing process and the pooling layers is generally ignored with regard to these two layers [38].

The computational complexity of data preprocessing is typically determined by the operations performed to prepare the data for use in the model. This includes steps such as data extraction, tokenizing, normalization, and so on. For example, in the proposed model, preprocessing simply consists of converting the payload of captured packets into a frequency vector, as shown in Algorithm 1.

In this algorithm, we have an initialization loop that executes 1024 times, giving a complexity of

O (1024)

, or simply

O (1)

—as it is a constant—then we have an outer loop that executes four times and an inner loop that executes N times for each iteration of the outer loop. The operation inside the loop is in constant time

O (1)

, giving a total time complexity, including that of the initialization loop of

O (4 \times N) + O (1)

. Simplified, this complexity becomes

O (N)

, where N represents the size of the compact packet payload.

The computational complexity of the convolutional layer is mainly computed using the following formula [39]:

Conv layer time \sim O (\sum_{l = 1}^{D} M_{l}^{2} \times K_{l}^{2} \times C_{l - 1} \times C_{l})

(28)

where D is the number of convolutional layers,

M_{l}

is the length of the output feature map of the lth convolutional layer,

K_{l}

is the length of the kernel in the lth layer,

C_{l - 1}

is the number of input channels in the lth layer, and

C_{l}

is the number of output channels in the lth layer.

For the fully connected layer, its complexity is calculated using the following formula [38]:

Dense layer time \sim O (\sum_{l = 1}^{D} N_{l - 1} \times N_{l})

(29)

where D is the depth (number of layers) of the dense layer,

N_{l - 1}

is the number of inputs in the lth layer, and

N_{l}

is the number of outputs in the the lth layer.

Based on these two formulas, we calculated the complexity of the three models corresponding to the three datasets, KDD’99, NUSW-NB15, and CICIDS2017. The results are shown in Table 7.

As shown in Table 7, the computational complexity of the proposed model ranged between 0.7 and 7 million operations, which is reasonable for a classification model based on deep learning techniques, especially when compared with similar models. Unfortunately, it cannot be compared with the works cited in this paper, as none of these have addressed the problem of complexity in their works. However, we can select one of these models by calculating its complexity. Here, we chose the best model in terms of macro-F-score—that of Wang et al. [19] (see Table 8). The complexity of their model was found at about 98 million operations, which shows the great superiority of the proposed model in terms of complexity. This is expected, since our model consists of only 2 convolutional layers, while Wang et al.’s [19] model consists of over 16 convolutional layers plus a 32-unit transformer, 128-unit BiLSTM, and over five dense layers.

5.4. Comparison with State-of-the-Art Works

In order to provide a better and more objective evaluation, the performance of our proposed intrusion detection model was compared with some recent and relevant intrusion detection approaches reported in the state-of-the-art literature. Such works were selected considering their classification techniques that are mainly based on advanced deep learning algorithms such as convolutional neural networks (CNNs), deep neural networks (DNNs), and recurrent neural networks (RNNs) and that were trained and tested on the datasets KDD’99, UNSW-NB15, and CICIDS2017—and in which multiclass classification results were available in terms of both the overall accuracy and macro- and/or weighted averages of the precision, recall, and F-score.

The comparison results for the three datasets (KDD’99, UNSW-NB15, and CICIDS2017) are summarized in Table 8. The “Model” column shows the deep learning approach (CNN, DNN, or RNN) used in the related work. The “Balan.?” column indicates if the related work dealt with the class imbalance problem (yes) or not (no). The remaining columns display the values for the four evaluation metrics: the overall accuracy, macro- and weighted averages of precision, recall, and F1-score.

Observing the results reported in this comparison table, it can be clearly seen that the proposed model consistently outperformed all other related works for almost all evaluation metrics with a considerable improvement, especially compared to similar models that applied unbalanced datasets, where the accuracy improvement exceeded 15% for the KDD’99 dataset, 12% for the UNSW-NB15 dataset, and 2% for the CICIDS2017 dataset. However, compared to studies that have applied data balancing techniques to improve their results, we notice that our proposed model still had better performance, but with a relatively small rate of improvement, and sometimes it was slightly decreased compared to some works, particularly in terms of recall. This is quite reasonable given that using unbalanced datasets will inevitably cause difficulties in detecting minority classes, regardless of the approach used, which will reduce the detection rate of these classes and, consequently, the rate of recall will inevitably be reduced. This is not the case for related works, which have used different methods to sub-sample majority classes and over-sample minority ones in order to balance sample classes in their datasets, thereby increasing the detection rate and, consequently, the overall performance of their models.

In summary, from these comparison results, we can conclude that our proposed model is very effective, since it has excellent performance compared to other works in the literature. This superiority underscores the effectiveness and robustness of our proposed method in addressing the challenge of converting raw network traffic into a 2D image while retaining all its distinguishing characteristics. These results confirm the importance and potential impact of this new approach in advancing the field of intrusion detection using modern artificial intelligence techniques. Overall, the work presented in this paper provides valuable insights and establishes a new approach for future research in this domain.

6. Conclusions

Intrusion detection remains a significant challenge in the IT world today. Despite extensive research employing advanced deep learning techniques, many studies fail to leverage the full potential of network traffic information. This is primarily due to the reliance on feature-ready CSV datasets rather than the original raw pcap datasets, which may lead to sub-optimal results. In response, this paper proposes a novel framework for more effective network intrusion detection using convolutional neural networks (CNNs). The approach uniquely converts raw network traffic into 2D images, which have been demonstrated to be highly suitable for CNN-based classification models.

The CNN model’s structure and hyperparameters were automatically selected during learning through an optimization algorithm based on the self-adaptive differential evolution (SADE) metaheuristic. This not only maximized the model’s performance but also reduced the manual effort required for constructing and tuning the CNN. Performance evaluations on the KDD’99, UNSW-NB15, and CICIDS2017 datasets confirmed the proposed model’s effectiveness in accurately identifying and classifying various types of attacks, underscoring its universality. The proposed approach achieved very high performance with a success rate of over 98%, especially for the two datasets KDD’99 and CICIDS2017, and outperformed the various existing methods by a remarkable margin, i.e., more than 5% over the nearest competitor in the literature for the datasets KDD’99 and UNSW-NB15. Moreover, when compared to related works, our model not only achieved better and more robust performance but also featured a simple CNN architecture, with a maximum of two convolutional layers and up to 32 filters per layer. For models based on KDD’99 and CICIDS2017, the CNN was even more streamlined, consisting only of input and output layers. This simplicity contributed to faster learning and classification speeds, which are critical in practical applications.

However, the proposed model has some limitations, particularly in detecting minority classes within imbalanced datasets. The imbalanced nature of datasets often leads to difficulties in identifying these minority classes, which reduces the overall recall rate. This issue becomes evident when comparing our results with studies that applied data balancing techniques such as the Synthetic Minority Oversampling Technique (SMOTE) or the Adaptive Synthetic Sampling Method (ADASYN). While our model still outperformed others, the improvement rate was relatively modest, and in some cases, the recall rate was slightly lower.

To address the issue of overlapping classes, several strategies can be employed. Utilizing balanced datasets can ensure the equal representation of each class, reducing misclassification risks due to overlap. Minimizing overlapping regions through transformation techniques can further enhance class distinction. Additionally, exploring advanced solutions to similar problems, particularly in areas that have achieved amazing results with advanced machine learning algorithms, such as computer vision, natural language processing, and speech recognition, may offer further improvements. For example, we can draw inspiration from the solutions proposed to solve the problem of adversarial examples in speech recognition. In particular, the solution proposed by Kwon and Nam [46] to identify audio adversarial examples by the difference in classification scores may be a useful solution for our problem since, like adversarial samples that are slightly modified versions of normal samples, overlapping classes are slightly modified packet payloads, causing a small difference in their classification scores.

Future research will focus on adapting these techniques for application to raw network traffic datasets, which presents a greater challenge than working with feature-ready datasets.

Author Contributions

Conceptualization, A.B.; methodology, A.B.; software, A.B.; resources, A.B.; writing—original draft preparation, A.B.; writing—review and editing, A.B., S.H. and A.L.; validation, A.B., S.H. and A.L.; supervision, S.H.; project administration, A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Algerian Ministry of Higher Education and Scientific Research within the PRFU project under grant No. C00L07UN180120230001.

Data Availability Statement

The data are available in a publicly accessible repository at https://www.ll.mit.edu/r-d/datasets/1999-darpa-intrusion-detection-evaluation-dataset (accessed on 8 July 2024), https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 8 July 2024), and https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 8 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Admass, W.S.; Munaye, Y.Y.; Diro, A.A. Cyber security: State of the art, challenges and future directions. Cyber Secur. Appl. 2024, 2, 100031. [Google Scholar] [CrossRef]
Kwon, H.; Kim, Y.; Yoon, H.; Choi, D. Optimal cluster expansion-based intrusion tolerant system to prevent denial of service attacks. Appl. Sci. 2017, 7, 1186. [Google Scholar] [CrossRef]
Cuan, Z.; Ren, Y.; Ding, D.W. Adaptive intrusion tolerant control for a class of uncertain nonlinear cyber-physical systems with full-state constraints. Automatica 2024, 166, 111728. [Google Scholar] [CrossRef]
Agrawal, S.; Sarkar, S.; Aouedi, O.; Yenduri, G.; Piamrat, K.; Alazab, M.; Bhattacharya, S.; Maddikunta, P.K.R.; Gadekallu, T.R. Federated Learning for intrusion detection system: Concepts, challenges and future directions. Comput. Commun. 2022, 195, 346–361. [Google Scholar] [CrossRef]
Sowmya, T.; Mary Anita, E. A comprehensive review of AI based intrusion detection system. Meas. Sens. 2023, 28, 100827. [Google Scholar] [CrossRef]
Lee, S.W.; Mohammed Sidqi, H.; Mohammadi, M.; Rashidi, S.; Rahmani, A.M.; Masdari, M.; Hosseinzadeh, M. Towards secure intrusion detection systems using deep learning techniques: Comprehensive analysis and review. J. Netw. Comput. Appl. 2021, 187, 103111. [Google Scholar] [CrossRef]
Sajed, S.; Sanati, A.; Garcia, J.E.; Rostami, H.; Keshavarz, A.; Teixeira, A. The effectiveness of deep learning vs. traditional methods for lung disease diagnosis using chest X-ray images: A systematic review. Appl. Soft Comput. 2023, 147, 110817. [Google Scholar] [CrossRef]
Abade, A.; Ferreira, P.A.; de Barros Vidal, F. Plant diseases recognition on images using convolutional neural networks: A systematic review. Comput. Electron. Agric. 2021, 185, 106125. [Google Scholar] [CrossRef]
Pingale, S.V.; Sutar, S.R. Remora whale optimization-based hybrid deep learning for network intrusion detection using CNN features. Expert Syst. Appl. 2022, 210, 118476. [Google Scholar] [CrossRef]
Asgharzadeh, H.; Ghaffari, A.; Masdari, M.; Soleimanian Gharehchopogh, F. Anomaly-based intrusion detection system in the Internet of Things using a convolutional neural network and multi-objective enhanced Capuchin Search Algorithm. J. Parallel Distrib. Comput. 2023, 175, 1–21. [Google Scholar] [CrossRef]
Altaf, T.; Wang, X.; Ni, W.; Liu, R.P.; Braun, R. NE-GConv: A lightweight node edge graph convolutional network for intrusion detection. Comput. Secur. 2023, 130, 103285. [Google Scholar] [CrossRef]
Daoud, M.; Dahmani, Y.; Bendaoud, M.; Ouared, A.; Ahmed, H. Convolutional neural network-based high-precision and speed detection system on CIDDS-001. Data Knowl. Eng. 2023, 144, 102130. [Google Scholar] [CrossRef]
Hnamte, V.; Hussain, J. Dependable intrusion detection system using deep convolutional neural network: A Novel framework and performance evaluation approach. Telemat. Informa. Rep. 2023, 11, 100077. [Google Scholar] [CrossRef]
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
Li, Y.; Xu, Y.; Liu, Z.; Hou, H.; Zheng, Y.; Xin, Y.; Zhao, Y.; Cui, L. Robust detection for network intrusion of industrial IoT based on multi-CNN fusion. Measurement 2020, 154, 107450. [Google Scholar] [CrossRef]
Andresini, G.; Appice, A.; Caforio, F.P.; Malerba, D.; Vessio, G. ROULETTE: A neural attention multi-output model for explainable Network Intrusion Detection. Expert Syst. Appl. 2022, 201, 117144. [Google Scholar] [CrossRef]
Udas, P.B.; Karim, M.E.; Roy, K.S. SPIDER: A shallow PCA based network intrusion detection system with enhanced recurrent neural networks. J. King Saud Univ. -Comput. Inf. Sci. 2022, 34, 10246–10272. [Google Scholar] [CrossRef]
Brandon, B.; Anitha, C.; Ana, G.; Daisy, L. BLoCNet: A hybrid, dataset-independent intrusion detection system using deep learning. Int. J. Inf. Secur. 2023, 22, 893–917. [Google Scholar] [CrossRef]
Wang, S.; Xu, W.; Liu, Y. Res-TranBiLSTM: An intelligent approach for intrusion detection in the Internet of Things. Comput. Netw. 2023, 235, 109982. [Google Scholar] [CrossRef]
He, J.; Wang, X.; Song, Y.; Xiang, Q. A multiscale intrusion detection system based on pyramid depthwise separable convolution neural network. Neurocomputing 2023, 530, 48–59. [Google Scholar] [CrossRef]
Li, Y.; Qin, T.; Huang, Y.; Lan, J.; Liang, Z.; Geng, T. HDFEF: A hierarchical and dynamic feature extraction framework for intrusion detection systems. Comput. Secur. 2022, 121, 102842. [Google Scholar] [CrossRef]
Liu, J.; Song, X.; Zhou, Y.; Peng, X.; Zhang, Y.; Liu, P.; Wu, D.; Zhu, C. Deep anomaly detection in packet payload. Neurocomputing 2022, 485, 205–218. [Google Scholar] [CrossRef]
Qiu, W.; Ma, Y.; Chen, X.; Yu, H.; Chen, L. Hybrid intrusion detection system based on Dempster-Shafer evidence theory. Comput. Secur. 2022, 117, 102709. [Google Scholar] [CrossRef]
Lin, K.; Xu, X.; Xiao, F. MFFusion: A Multi-level Features Fusion Model for Malicious Traffic Detection based on Deep Learning. Comput. Netw. 2022, 202, 108658. [Google Scholar] [CrossRef]
Yu, L.; Dong, J.; Chen, L.; Li, M.; Xu, B.; Li, Z.; Qiao, L.; Liu, L.; Zhao, B.; Zhang, C. PBCNN: Packet Bytes-based Convolutional Neural Network for Network Intrusion Detection. Comput. Netw. 2021, 194, 108117. [Google Scholar] [CrossRef]
Crowley, J.L. Convolutional Neural Networks. In Human-Centered Artificial Intelligence: Advanced Lectures; Springer International Publishing: Cham, Switzerland, 2023; pp. 67–80. [Google Scholar] [CrossRef]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Aggarwal, C.C. Neural Networks and Deep Learning—A Textbook; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar] [CrossRef]
Huang, V.; Qin, A.; Suganthan, P. Self-adaptive Differential Evolution Algorithm for Constrained Real-Parameter Optimization. In Proceedings of the 2006 IEEE International Conference on Evolutionary Computation, Vancouver, BC, Canada, 16–21 July 2006; pp. 17–24. [Google Scholar] [CrossRef]
Storn, R.; Price, K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
Cui, L.; Li, G.; Zhu, Z.; Wen, Z.; Lu, N.; Lu, J. A novel differential evolution algorithm with a self-adaptation parameter control method by differential evolution. Soft Comput. 2018, 22, 6171–6190. [Google Scholar] [CrossRef]
DARPA. DARPA Intrusion Detection Data Sets; DARPA: Arlington, VA, USA, 1999; Available online: https://www.ll.mit.edu/r-d/datasets/1999-darpa-intrusion-detection-evaluation-dataset (accessed on 8 July 2024).
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy—ICISSP. INSTICC, SciTePress, Funchal, Portugal, 22–24 January 2018; pp. 108–116. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Geron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019. [Google Scholar]
Ding, H.; Chen, L.; Dong, L.; Fu, Z.; Cui, X. Imbalanced data classification: A KNN and generative adversarial networks-based hybrid approach for intrusion detection. Future Gener. Comput. Syst. 2022, 131, 240–254. [Google Scholar] [CrossRef]
Shah, B.; Bhavsar, H. Time Complexity in Deep Learning Models. Procedia Comput. Sci. 2022, 215, 202–210. [Google Scholar] [CrossRef]
Zhang, Y.; Qiao, S.; Zeng, Y.; Gao, D.; Han, N.; Zhou, J. CAE-CNN: Predicting transcription factor binding site with convolutional autoencoder and convolutional neural network. Expert Syst. Appl. 2021, 183, 115404. [Google Scholar] [CrossRef]
Lopez-Martin, M.; Carro, B.; Sanchez-Esguevillas, A.; Lloret, J. Shallow neural network with kernel approximation for prediction problems in highly demanding data networks. Expert Syst. Appl. 2019, 124, 196–208. [Google Scholar] [CrossRef]
Shams, E.A.; Rizaner, A.; Ulusoy, A.H. A novel context-aware feature extraction method for convolutional neural network-based intrusion detection systems. Neural Comput. Appl. 2021, 33, 13647–13665. [Google Scholar] [CrossRef]
Andresini, G.; Appice, A.; De Rose, L.; Malerba, D. GAN augmentation to deal with imbalance in imaging-based intrusion detection. Future Gener. Comput. Syst. 2021, 123, 108–127. [Google Scholar] [CrossRef]
Gupta, N.; Jindal, V.; Bedi, P. CSE-IDS: Using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in network-based intrusion detection systems. Comput. Secur. 2022, 112, 102499. [Google Scholar] [CrossRef]
Jiaxing, H.; Xiaodan, W.; Qian, S.Y.X.; Chen, C. Network intrusion detection based on conditional wasserstein variational autoencoder with generative adversarial network and one-dimensional convolutional neural networks. Appl. Intell. 2023, 53, 12416–12436. [Google Scholar] [CrossRef]
Kasongo, S.M.; Sun, Y. A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Comput. Secur. 2020, 92, 101752. [Google Scholar] [CrossRef]
Kwon, H.; Nam, S.H. Audio adversarial detection through classification score on speech recognition systems. Comput. Secur. 2023, 126, 103061. [Google Scholar] [CrossRef]

Figure 1. The block diagram of our proposed intrusion detection framework.

Figure 2. Overall structure of our CNN model before hyperparameter tuning.

Figure 3. Accuracy curves during the training of our three multiclass classification models.

Figure 4. Loss curves during the training of our three multiclass classification models.

Figure 5. Confusion matrix of our CNN model on KDD’99.

Figure 6. Confusion matrix of our CNN model on UNSW-NB15.

Figure 7. Confusion matrix of our CNN model on CICIDS2017.

Figure 8. Performance of the KDD-based CNN model.

Figure 9. Performance of the UNSW15-based CNN model.

Figure 10. Performance of the CICIDS2017-based CNN model.

Table 1. Summary of related works. “B” denotes Binary, while “M” denotes Multiclass.

Refs.	Year	Datasets	Format	Output	Description
[9]	2022	NSL-KDD, UNSW-NB15, and CICIDS2017	CSV	B	Combines DMNs and DAEs in a hybrid deep model for intrusion detection while using CNNs for feature extraction.
[10]	2023	NSL-KDD, and TON-IoT	CSV	B	Develops an IoT intrusion detection model using CNNs for feature extraction, BME-CSA for feature selection, and RF for classification.
[11]	2023	UNSW-NB15	CSV	B	Proposes the NE-GConv framework, which uses RFE for feature selection and GCN for classification.
[12]	2023	CIDDS-001	CSV	B	Introduces a deep learning-based intrusion detection model using PCA for feature reduction and CNNs for classification.
[13]	2023	ISCX-IDS12, DDoS (Kaggle), CICIDS2017, and CICIDS2018	CSV	B	Develops a network intrusion detection model based on DCNNs.
[14]	2019	NSL-KDD, UNSW-NB15, Kyoto, WSN-DS, and CICIDS2017	CSV	B+M	Presents a performance analysis of various machine learning algorithms on different datasets before selecting a single DNN architecture composed of five hidden layers.
[15]	2020	NSL-KDD	CSV	B+M	Develops a network intrusion detection model using a multi-CNN fusion method.
[16]	2022	NSL-KDD and UNSW-NB15	CSV	M	Applies a CNN with an attention mechanism that produces an attention map on the flow characteristics of specific attack categories.
[17]	2022	NSL-KDD and UNSW-NB15	CSV	B+M	Combines PCA and CNNs with four updated versions of conventional RNNs, namely, Bi-LSTM, LSTM, Bi-GRU, and GRU.
[18]	2023	UNSW-NB15, CICIDS2017, IoT-23, and Bot-IoT	CSV	M	Proposes a hybrid network intrusion detection model by combining CNNs and Bi-LSTM.
[19]	2023	NSL-KDD and CICIDS2017	CSV	M	Combines CNNs with Bi-LSTMs and transformers to implement a deep learning-based model for IoT intrusion detection.
[20]	2023	NSL-KDD, UNSW-NB15 and CICIDS2017	CSV	M	Proposes a CNN-based intrusion detection model in which features are extracted using VGM, PyCNN, and DSC methods.
[21]	2022	UNSW-NB15, CICIDS2017 and CSE-CICIDS2018	pcap	B	Proposes an LSTM-based intrusion detection model in which a hierarchical and dynamic feature extraction framework is defined to extract features from packet traffic.
[22]	2022	CSIC2010, ISCX2012, and CICIDS2017	pcap	B	Combines CNNs with LSTMs and multihead self-attention mechanisms to construct an efficiency payload-based anomaly detection framework.
[23]	2022	ISCX-bot-2014, ISCX-SlowDoS-2016, and CICIDS2017	pcap	M	Develops a hybrid intrusion detection system that combines a pcap-based CNN model with a CSV-based RF model using the Dempster–Shafer Theory (DST).
[24]	2022	ISCXIDS2012, CICIDS2017, and IoT23	pcap	M	Proposes a deep learning-based model for intrusion detection with a multilevel feature (data timing, byte, and statistical features) fusion method.
[25]	2021	CICIDS2017 and CSE-CICIDS2018	pcap	M	Proposes a CNN-based model for intrusion detection in which features are extracted from the packet bytes at two levels (abstract level and final representation level).

Table 2. Distribution of samples in the three datasets.

Dataset	Class	Training	Validation	Test	Total
KDD’99	Normal	21,768	2466	6154	30,388
	DoS	31,171	3472	8675	43,318
	Probe	11,675	1297	3246	16,218
	R2L	539	64	156	759
	U2R	79	10	25	114
	Total	65,232	7309	18,256	90,797
UNSW-NB15	Normal	17,870	800	4671	23,341
	Analysis	289	14	84	387
	Backdoors	327	54	78	459
	DoS	2935	520	701	4156
	Exploits	18,856	3213	4491	26,560
	Fuzzers	14,752	661	3993	19,406
	Generic	2725	517	632	3874
	Reconnaissance	8276	1733	1856	11,865
	Shellcode	1049	223	225	1497
	Worms	116	23	27	166
	Total	67,195	7758	16,758	91,711
CIC-IDS2017	Normal	25,552	2878	7186	35,616
	Bot	531	59	148	738
	DDoS	32,434	3604	9010	45,048
	DoS GoldenEye	613	70	186	869
	DoS Hulk	4314	480	1203	5997
	DoS Slowhttptest	1010	114	284	1408
	DoS slowloris	1609	179	447	2235
	FTP-Patator	2444	271	679	3394
	PortScan	42,154	4684	11,710	58,548
	SSH-Patator	1715	191	476	2382
	Brute Force	108	12	31	151
	Sql Injection	9	1	2	12
	XSS	15	2	6	23
	Total	112,508	12,545	31,368	156,421

Table 3. Hyperparameter ranges for our optimization problem.

Hyperparameter	Range
# of filters for the 1st conv. layer	$\{16, 32, 64\}$
# of filters for the 2nd conv. layer	$\{0, 16, 32, 64\}$
# of neurons in the hidden layer	$\{32, 64, 128, 256\}$
Dropout rate	$[0.0, 0.3]$
Learning rate	$[0.001, 0.009]$
Batch size	$\{32, 64, 128, 256, 512\}$
Batch normalization	$\{0 (without), 1 (with)\}$

Table 4. Hyperparameter settings for our CNN model.

Hyperparameter	Setting
Convolution kernel size	$3 \times 3$
Convolution stride size	1
Pooling kernel size	$2 \times 2$
Pooling stride size	2
Pooling type	Max
Activation function	ReLu
Weight initialization	Xavier
Optimization function	Adam
Training epochs	10

Table 5. Initial parameters of our optimization algorithm.

Parameter	Value
$μ F$	$0.7$
$σ F$	$0.3$
$μ C R$	$0.8$
$σ C R$	$0.1$
Population size (NP)	15
Number of generations (G)	100
Learning period (LP)	18
$μ C R$ update period ( $μ C R$ _p)	9

Table 6. Best hyperparameter values after optimization.

Hyperparameter	KDD’99	UNSW-NB15	CIC-IDS2017
# of filters for the 1st conv. layer	32	16	32
# of filters for the 2nd conv. layer	32	16	0
# of neurons in the hidden layer	0	128	0
Dropout rate	0.2	0.1	0.1
Learning rate	0.001	0.001	0.001
Batch size	64	32	256
Batch normalization	0	0	0

Table 7. Computational complexity of the proposed model.

Dataset	Conv1	Conv2	Dense	Total
KDD’99	294,912	9,437,184	163,840	9,895,936
UNSW-NB15	147,456	2,621,184	4,195,584	6,964,224
CIC-IDS2017	294,912	-	425,984	720,896

Table 8. Comparison results of the proposed model with other related works using the KDD’99, UNSW-NB15, and CICIDS2017 datasets. The results of the related works were collected from the reference papers. “-” denotes that no value is reported in the reference paper. The best results are in bold. Abbreviation: C + R = CNN + RNN.

Dataset	Ref.	Year	Model	Balan.?	Accuracy	Precision		Recall		F-Score
Dataset	Ref.	Year	Model	Balan.?	Accuracy	Macro	Weight.	Macro	Weight.	Macro	Weight.
KDD’99	[14]	2019	DNN	No	0.7850	0.8100	-	0.7850	-	0.7650	-
	[15]	2020	CNN	No	0.8133	0.6947	-	0.6384	-	0.6418	-
	[17]	2022	RNN	No	0.8291	0.7127	0.8678	0.5789	0.8291	0.5917	0.8041
	[16]	2022	CNN	No	0.8150	-	-	-	-	0.6130	0.7900
	[20]	2023	CNN	No	0.8163	0.8355	-	0.8163	-	0.8063	-
	[40]	2019	DNN	No	0.8070	-	0.8180	-	0.8070	-	0.7930
	[41]	2021	CNN	Yes	0.8334	-	0.8535	-	0.8344	-	0.8260
	[37]	2022	DNN	Yes	0.9297	0.8556	-	0.6340	-	0.6702	-
	[42]	2021	CNN	Yes	0.9329	-	-	-	-	-	0.9566
	[43]	2022	DNN	Yes	0.9200	0.7480	-	0.7560	-	0.7480	-
	[44]	2023	CNN	Yes	0.9011	-	0.9073	-	0.9011	-	0.8990
	[19]	2023	RNN	Yes	0.9099	0.9139	-	0.9094	-	0.9089	-
	Ours		CNN	No	0.9810	0.9711	0.9809	0.8577	0.9810	0.9032	0.9808
UNSW-NB15	[14]	2019	DNN	No	0.6600	0.6230	-	0.6600	-	0.5960	-
	[17]	2022	RNN	No	0.7286	0.6322	0.8198	0.5413	0.7137	0.5254	0.7372
	[16]	2022	CNN	No	0.7640	-	-	-	-	0.4240	0.7670
	[20]	2023	CNN	No	0.8047	0.8074	-	0.8047	-	0.7889	-
	[40]	2019	DNN	No	0.7780	-	0.7720	-	0.7780	-	0.7730
	[45]	2020	DNN	No	0.8092	-	-	-	-	-	-
	[37]	2022	DNN	Yes	-	0.7508	-	0.8479	-	0.7964	-
	[42]	2021	CNN	Yes	0.8973	-	-	-	-	-	0.9197
	[44]	2023	CNN	Yes	0.8886	-	0.9046	-	0.8896	-	0.8771
	[18]	2023	C+R	Yes	0.7632	0.4500	0.8100	0.4500	0.7600	0.4100	0.7700
	Ours		CNN	No	0.9340	0.9463	0.9321	0.8275	0.9340	0.8696	0.9281
CIC-IDS2017	[14]	2019	DNN	No	0.9620	0.9720	-	0.9620	-	0.9650	-
	[20]	2023	CNN	No	0.9760	0.9073	-	0.9781	-	0.9413	-
	[41]	2021	CNN	Yes	0.9929	-	0.9928	-	0.9929	-	0.9927
	[42]	2021	CNN	Yes	0.9849	-	0.9520	-	0.9740	-	0.9628
	[43]	2022	DNN	Yes	0.9200	0.6743	-	0.8171	-	-	0.7000
	[25]	2021	CNN	No	-	0.7460	0.9820	0.7480	0.9830	0.7467	0.9830
	[18]	2023	C+R	Yes	0.9800	0.8800	0.9900	0.8400	0.9800	0.8100	0.9800
	[19]	2023	RNN	Yes	0.9915	0.9915	-	0.9914	-	0.9914	-
	Ours		CNN	No	0.9981	0.9920	0.9982	0.9546	0.9981	0.9667	0.9981

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Boulaiche, A.; Haddad, S.; Lemouari, A. A Convolutional Neural Network with Hyperparameter Tuning for Packet Payload-Based Network Intrusion Detection. Symmetry 2024, 16, 1151. https://doi.org/10.3390/sym16091151

AMA Style

Boulaiche A, Haddad S, Lemouari A. A Convolutional Neural Network with Hyperparameter Tuning for Packet Payload-Based Network Intrusion Detection. Symmetry. 2024; 16(9):1151. https://doi.org/10.3390/sym16091151

Chicago/Turabian Style

Boulaiche, Ammar, Sofiane Haddad, and Ali Lemouari. 2024. "A Convolutional Neural Network with Hyperparameter Tuning for Packet Payload-Based Network Intrusion Detection" Symmetry 16, no. 9: 1151. https://doi.org/10.3390/sym16091151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Convolutional Neural Network with Hyperparameter Tuning for Packet Payload-Based Network Intrusion Detection

Abstract

1. Introduction

2. Related Work

3. Proposed Intrusion Detection Framework

3.1. Data Preparation

3.2. Data Reformulation

3.3. Model Training

3.3.1. Overall Structure of Our CNN-Based Model

3.3.2. CNN Structure and Hyperparameter Tuning

4. Implementation and Experiments

4.1. Dataset Description

4.2. Evaluation Metrics

4.3. Hyperparameters Setting

5. Experiment Results and Discussion

5.1. Training Performance

5.2. Performance Measurement on the Test Dataset

5.3. Computational Efficiency Measurement

5.4. Comparison with State-of-the-Art Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI