The experiment of this paper was conducted twice, each using a different set of datasets. The first set of datasets was the CIC-IDS2017 dataset [
15] and the CSE-CIC-IDS2018 dataset. As described in
Section 2.1, both datasets were created by Canadian Institute for Cybersecurity (CIC). As created by the same organisation, both datasets share a common set of features. The CSE-CIC-IDS2018 dataset was created one year after the CIC-IDS2017 dataset. Hence, these datasets meet the needs of this paper. We used the CIC-IDS2017 dataset as the training dataset and the CSE-CIC-IDS2018 dataset as the testing dataset.
4.1. Data Pre-Processing
The first step of the experiment was to pre-process the dataset. First, we performed dataset cleaning by eliminating unwanted entries in the dataset. Entries containing missing or infinity values were dropped as they only contribute to a relatively small portion of the dataset. We also removed duplicates to expose the models to as many unique samples as possible.
In network intrusion detection, the dataset should be symmetrical, meaning the model must be trained and tested on an equal number of normal and abnormal network traffic instances. Maintaining symmetry is crucial because it ensures the model is not biased towards one type of traffic. If the training data were imbalanced with more instances of benign traffic than attack traffic, then the model may classify all traffic as benign, even if it were abnormal. Therefore, in the next step of the data pre-processing, we addressed the high-class imbalance problem, where some classes had significantly more samples than others. The problem results in a bias towards the majority class, which in turn makes the accuracy of the model meaningless. We downsampled the majority class to address the high-class imbalance problem and to ensure symmetry in the dataset. In our approach, samples were selected randomly to reduce the number of samples for the majority class. Additionally, multiple minority classes were combined to form a larger class.
We also ensured that both datasets had the same set of columns with the same sequence, as we proposed using different datasets for training and testing. Both datasets were also relabelled, if necessary, to ensure that both datasets had the same classes.
4.2. Feature Selection
Feature selection is an important step to improve the performance of an IDS, from the perspective of time efficiency and accuracy [
40]. Since modern datasets such as the CIC-IDS2017 dataset contain around 80 features, training the model without any feature selection will consume time. Other than that, some features may include noise and reduce the accuracy of the model. As pointed out by Aksu et al. [
41], the accuracy of the models starts to drop when more than 40 features of the CIC-IDS2017 dataset are used to train the model.
We used the random forest algorithm by utilising the RandomForestClassifier provided by scikit-learn to perform feature selection. A random forest was trained using the training data, and the top
n features with the highest importance score were selected. The random forest algorithm has been widely adopted for feature selection. As an example, Sharafaldin, Lashkari, and Ghorbani [
15] and Kostas [
42] also used the random forest for feature selection on the CIC-IDS2017 dataset.
After reducing the number of features using random forest, we further reduced the number of features using brute force. We then built preliminary ML models using a different number of features. A for loop was used to add a new feature in each iteration to construct the ML models until all n features were included. The accuracy of each model was recorded with respect to the number of features. Based on the accuracy rate of the models, a more concise set of features was selected.
Moreover, some features were removed by human inspection. Some features should be removed, although having a high correlation with the output variable, the source IP address, for example, is one of them. When a dataset includes a large amount of malicious traffic from one IP address, the “source IP address” may be ranked as an important feature by the random forest. However, the feature should be removed to prevent overfitting, as classifying the traffic based on the IP address may not be relevant in the future.
4.3. ML Models Training
We trained the models using the training dataset after selecting the features. When training the models, the hyperparameters of the models were optimised using grid search by utilising the GridSeachCV function provided by scikit-learn. The grid search method works by searching through a predefined hyperparameter space, and different combinations of hyperparameters are used to train the model. The hyperparameters that give the best accuracy are chosen to train the final models. The hyperparameters of a model are parameters that govern the training process. Take DNN as an example; the number of hidden layers is the hyperparameter, while the weights and biases of the neural network are the parameters of the model. Since optimising the hyperparameters requires training the models multiple times, only a fraction of the training dataset is used to speed up the entire process.
In this work, we implemented decision tree (DT), random forest (RF), support vector machine (SVM), naïve bayes (NB), artificial neural network (ANN) and deep neural network (DNN). The implementation of each model and the hyperparameters that were optimised for each model are described as follows:
The decision tree of this work was implemented using the DecisionTreeClassifier provided by the scikit-learn library. To optimise the tree, pruning was performed to prevent overfitting. The method used to prune the decision tree is called cost complexity pruning, which is controlled by the ccp_alpha parameter. The higher the alpha value, the more nodes will be pruned, and the total impurity of the leaves will increase. The optimisation was performed by brute-forcing different values of alpha and choosing the alpha value that gives the highest accuracy.
The random forest of this work was implemented using the RandomForestClassifier, which is also provided by the scikit-learn library. Just as with the decision tree, pruning is an important method to optimise the model. However, the optimisation of the random forest is carried out by using a grid search. Hence, the pruning was performed by specifying the maximum depth of each tree, the minimum number of samples that are required on a leaf node, and the minimum number of samples that are required on a node for a split to be considered. Additionally, two formulas were tested to measure the quality of the split, Gini and entropy. By using a different strategy to evaluate the quality of a split, different features were chosen, and this would result in a different tree. Moreover, the number of trees in the random forest was also optimized.
The SVM was implemented using SVC (support vector classifier) provided by the scikit-learn library. For SVM, several important hyperparameters need to be optimised. The most important hyperparameter of SVM is the kernel function that is being used. By using a different kernel, the accuracy and time complexity of the models will be affected. The kernels used in this work included linear, Gaussian radial basis function (RBF) and sigmoid, as shown in the below equations.
The kernel function takes two points,
and
, as input and returns their similarity in higher dimensional space as output. Besides the kernel functions, the gamma
value used by the RBF and Sigmoid kernels is optimised. Moreover,
C is also an important hyperparameter of SVM. The hyperparameter
C helps the SVM to prevent overfitting by allowing errors. The larger the value of
C, the more errors will be allowed and hence the less sensitive it will be to outliers.
The naïve Bayes was implemented using GaussianNB, which is available in the scikit-learn library. As naïve Bayes is a very simple model, it has only one hyperparameter to be optimised, the var_smooth parameter. As Gaussian naïve Bayes assumes that the probability distribution follows a Gaussian normal distribution, increasing the var_smooth value allows the model to account for samples that are further away from the distribution mean. In other words, the hyperparameter smoothes the distribution curve.
The ANN was implemented using the MLPClassifier provided by the scikit-learn library. The most important hyperparameter is the number of hidden layers and the number of neurons on each layer. The ANN in this work was implemented with only one hidden layer and 10 to 50 neurons on the hidden layer. The activation function was also optimised. The activation functions used in this work were Logistic, Tanh and ReLU, as shown below.
The third hyperparameter to be optimised was the optimisation function, which optimises the weights and biases of the neural network. For the MLPClassifier, the optimisation function is controlled by the solver parameter. Two optimisation functions were tested in this work: stochastic gradient descent (SGD) and Adam. The last hyperparameter to be optimized for ANN was the alpha value. The parameter is a penalty term which can be used to prevent an overfitting or underfitting problem. A larger alpha value would encourage smaller weights and thus prevent overfitting, while a smaller alpha value would encourage larger weights and prevent underfitting.
The implementation of DNN is very similar to that of the ANN. The only difference is the number of hidden layers of the neural network. For the implementation of this work, the DNN had three to four hidden layers with an equal number of neurons on each layer. Additionally, the hyperparameters that were optimized were the same as for the ANN.
4.5. Model Evaluation
After verifying the accuracy of the models, the last and most crucial step of this paper was performed. The last step was training the final model and testing the models using the testing dataset. First, we used 70% of the training dataset to train the final models. Next, we used the rest of the training dataset to calculate the accuracy of the models on the training dataset. Finally, we used the testing dataset to test the models.
The interest of this paper was to compare the accuracy of each model when a different dataset was being used. At the same time, a comparison between different models was conducted in terms of accuracy and efficiency. The performance metrics that were used to measure the performance of the models included accuracy, precision, recall, and F1-score, as shown in Equations (
9)–(12).
In the above equations, true positive (
) and true negative (
) denote the number of samples that were correctly classified as positive and negative, respectively. False positive (
) and false negative (
) indicate the number of incorrectly classified samples as positive and negative, respectively. We also visualised the classification result using a confusion matrix.
Moreover, we also measured the time complexity of each model. Our primary focus was the time consumption of each model to be trained and the time consumption for prediction. We did not compare with other literature regarding time efficiency as the time consumption depends on the implementation of the model, the number of samples, and the hardware used to execute the experiment.