*2.2. Feature Selection*

The continuous increase in the volume of data makes real-time intrusion detection a difficult task. Feature selection (FS) is a technique used to reduce computation complexity by selecting the relevant features and eliminating many redundant and needless features from the data to produce effective learning model [25,26].

When dimensionality of a feature is high, the selection of meaningful features becomes a difficult challenge. Bioinspired metaheuristic algorithms are efficient in addressing such challenges. They provide high-quality solutions in a reasonable time and within affordable resources. Moreover, they decrease the computational complexity of classification and improve its accuracy [27].

Intelligent IDS work by collecting and analyzing large amounts of data from several areas. These data contain various redundant and irrelevant features, which results in increased overfitting and processing time and reduced detection rate. Therefore, feature selection needs to be applied as a preprocessing step to improve system performance and accuracy while reducing the dataset size [28].

Different approaches have been developed by researchers to reduce the number of features while implementing IDS. In reference [29], a genetic algorithm was applied to select the optimal set of features from KDD CUP 99 and UNSW-NB15. The authors applied a new fitness function, which was composed of TPR, FPR, and the number of selected features, with specific weights given for each factor. The selected features were then passed to the classification stage, at which SVM was applied. The number of optimized features ranged between 7 and 10 for different types of attacks. The maximum accuracy of classification of normal traffic was 99.05%, while the accuracy of classification ranged from 98.25% for R2L to 100% for U2R.

Parimala and Kayalvizhi [30] developed a two-stage feature selection as a prestage for an IDS that provided secure communication in a wireless environment. The first stage applied a conditional random field (CRF) for the initial selection of features. In the second stage, spider monkey optimization (SMO) was applied to the selected feature to identify the most useful features from the dataset. This model was applied to the NSL-KDD dataset, which is a wireless dataset that has 41 features. This model extracted 16 features. The authors did not provide exact figures of achieved accuracy, but the results outperformed four existing works according to their experiments.

By using the same dataset, NSL-KDD, Reference [31] examined the performance of four feature selection methods: information gain, gain ratio, chi squared, and relief. Their performance was analyzed by applying them on four machine learning algorithms: J48, random forest, naïve Bayes, and k-nearest neighbor (KNN). The obtained results showed that feature selection greatly improved the performance of the IDS but had a slightly negative effect on accuracy.

In addition to information gain and chi squared, Reference [32] applied correlationbased feature selection. These selection methods were applied to the NSL-KDD dataset. The decision tree classifier was the ML algorithm adopted in their IDS model. Initially, the dataset consisted of 41 features. During the preprocessing stage, the nonnumeric data was converted to numeric data, producing a total of 122 features. Their proposed methods selected 20 optimal features out of these 122. The maximum accuracy obtained was 81%.

Reference [33] developed a cloud IDS (CIDS) that started by feature selection. They used an efficient correlation-based feature selection approach, which extended correlationbased feature selection and mutual information feature selection. The feature selection process consisted of two stages. The first stage was used to recognize important features, so it reduced the dimensionality of the dataset. This was achieved by reducing the pairwise calculations of the correlation between features. To distinguish all different classes, this method was combined with the LIBSVM classifier. In the second stage, features were selected such that they were highly relevant to a given class *c* but not relevant to the selected features. This approach was applied to both the KDD CUP 99 and NSL-KDD datasets. The proposed method enhanced the time required to obtain the optimum set of features and reduced the number of selected features to 10.

Reference [26] proposed a hybrid IDS that consisted of feature selection, density-based spatial clustering of application with noise (DBSCAN), K-Mean++ clustering, and SMO classification. The feature selection adopted to be applied on the KDDCUP 99 dataset was a genetic algorithm that selected 13 features for further processing. The average accuracy achieved in this model was 96.92%.

The feature selection model implemented by [28] consisted of two parts. The first part was the attribute evaluator, which evaluated each attribute separately. The second part was the search method, in which different combinations were tried to obtain the selected list of features. This model was applied to the NSL-KDD dataset. Feature selection was performed on three classes: all attack types (23 types), main attack types (5 types), and two attack types (normal and abnormal). This approach was then examined using three classification models: random forest, J48, and naïve Bayes. The results obtained for the two-attacks type ranged from 98.69 to 99.41%.

#### *2.3. Hybrid Intrusion Detection System Based on Machine Learning*

Researchers have implemented IDS using several ML techniques, among other techniques. To focus, the discussion in this section is limited to two ML techniques: the genetic algorithm (GA) and support vector machine (SVM). Before presenting related work conducted using GA and SVM, it is worth shedding some light on these techniques.

#### 2.3.1. Genetic Algorithm Overview

Genetic algorithms (GA) constitute a family of mathematical models that operate on the principles of selection and natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms work by transforming selected problems using chromosome-like data and developing chromosomes using selection factors, recombination, and mutations.

The genetic algorithm process usually begins with a random set of chromosomes. These chromosomes are representations of a problem that must be solved. Depending on the attributes of a problem, the positions of each chromosome are encoded in numbers, letters, or bits. These positions are referred to as genes and are randomly changed within a range during development. The set of chromosomes during the evolution stage is called a population. The evaluation function is used to calculate the GOODNESS of each chromosome. There are two primary factors during the evaluation. Crossover is used to simulate the transformation and the natural reproduction of the species. Chromosomal selection for survival and synthesis is biased towards the fittest chromosomes. Genetic algorithms can model virtually any type of constraint in the form of partial functions or by using different chromosome coding schemes specifically designed for a specific problem. Figure 3 shows the structure of a simple genetic algorithm.

**Figure 3.** Structure of a genetic algorithm.

Since GAs are best used for optimization, they can be used to generate the best rules for a network connection that can be used to classify the network behavior as intrusive or normal. In addition, GAs can be used to select the optimal set of features from a dataset that can be used in the process of intrusion detection. They can also be used in computer security applications to find optimal solutions to specific problems.

#### 2.3.2. SVM-Based IDS and Related Work

The support vector machine is a machine learning technique known as the best learning algorithm for classification. It is one of the most popular supervised ML algorithms [4,34]. The SVM is a type of pattern classification and regression with a variety of kernel functions. It has been applied to several pattern recognition applications. SVMs have been used mainly for binary classification. The idea is to find a line or hyperplane that separates two classes such that it is as far as possible from the closest instances of each class. Instances of different groups are separated by an area called the margin. The closest instances to the hyperplane are called support vectors. It is required to have the margin as big as possible; this is important to enhance the efficiency of the classification of

newly added instances. The boundary function of the largest margin can be computed as follows [35]:

$$Minimize \; W\left(\alpha\right) \frac{1}{2} \sum\_{i=1}^{N} \sum\_{j=1}^{N} X\_I X\_J \alpha\_I \alpha\_j k\left(y\_{i\prime}, y\_j\right) - \sum\_{i=1}^{N} \alpha\_i$$

where α is a vector of *N* variables, *C* is the soft margin parameter, *C>0*, and *k* (*yi*, *yj*) represents the kernel function of SVMs. Kernel functions are used in SVMs to classify instances of the dataset into different categories. Four main types of kernel functions are listed here [35]:


 where *γ, r* and *d* are kernel parameters. The concepts of the margin, separation hyperplane, and support vector are illustrated in Figure 4.

**Figure 4.** Classical example of SVM linear classifier [4].

This algorithm has been applied in information security to detect intrusion. The main advantage of using a support vector machine for an IDS is its speed as a capability of detecting intrusion in real time. SVMs have become a popular technique for detecting anomalous intrusion because of their good circular nature and their ability to overcome the curse of dimensionality. They are also useful for finding global minimum actual risk using structural risk reduction, because they can generalize well with kernel tricks even in high-dimensional spaces under small-training-sample conditions.

The support vector machine can determine the appropriate setting parameters because it does not depend on traditional experimental risks such as neural networks and can learn a wide range of patterns that can expand. Support vector machines can also dynamically update training patterns whenever there is a new pattern during classification. Several researchers have applied SVMs as part of intrusion detection, as discussed below.

#### 2.3.3. Related Work on Existing Hybrid IDS Using GA and SVM

A set of research works have adopted GA as a base for IDS, while others have adopted SVM. GA has been applied in IDS by several researchers, as follows. Neha Rai and Khushbu Rai [36] proposed an intrusion detection system using genetic algorithm techniques. To create a new population, they used rank selection for selection, single point for crossover, and bitwise for mutation. Reference [34] proposed an intrusion alarm based on a genetic algorithm and support vector machine. In the GA part, an accuracy fitness function was applied, and to create a new population, tournament selection in the selection part, twopoint crossover, and simple mutation in the reproduction part were used. Ojugo et al. [37] proposed a genetic algorithm rule-based IDS. In this system, they applied the confidence fitness function, and to create a new population, they used tournament selection and two-point crossover. Bhattacharjee et al. [38] proposed an IDS using a vectorized fitness function in a genetic algorithm and used uniform crossover. Prakash et al. [39] built an effective IDS using GA-based feature selection. They applied an accuracy fitness function, and to create a new population, they used tournament selection and uniform crossover. Reference [40] proposed a GA-based approach to NIDS. In this system, a confidence fitness function was applied, and to create a new population, two-point crossover was used. Pawar and Bichkar [41] proposed a NIDS using GA with VLC; to create a new population, they used single-point crossover.

Moukhafi et al. [4] proposed a new method of IDS that used an SVM optimized by a GA to improve the efficiency of detecting unknown and known attacks. They used the particle swarm optimization algorithm to select influential features. The detection rate (DR) achieved was 96.38% when applied to 10% of the KDD CUP 99 dataset. The small size of the dataset facilitated the process of the GA–SVM classifier.

Another hybrid IDS was developed in [5]. It used kernel principal component analysis (KPCA), an SVM, and a GA. The kernel function of the SVM was RBF modified to include additional two parameters to obtain better feature values. These parameters were the mean value and the mean square difference values of feature attributes. The chromosome of the GA had three representations. The KDD CUP 99 dataset was used. The DR produced was 95.26%.

Al-Yaseen et al. [6] developed a hybrid IDS which uses K-means, SVM, and ELM. The focus in this work lay on building a training set with high quality that required little time for classification by using K-means. The KDD CUP 99 dataset was used. The DR produced was 95.17%.

Feng et al. [7] developed a hybrid IDS that used an SVM and an ant colony network. The hybrid system benefited from the high classification rate and runtime efficiency produced from the combination of these two algorithms. They used the KDD CUP 99 dataset and managed to achieve a DR of 94.86%.

Tao et al. [34] proposed an improved intrusion detection algorithm based on the GA and SVM methods. This paper proposed an alarm intrusion detection algorithm. Feature selection and weight and parameter optimization of the support vector machine were based on the genetic algorithm. The simulations and experimental results showed that the intrusion detection technology based on the "GA and SVM" proposed increased the intrusion DR, the Accuracy Rate, and the TPR while reducing the FPR. SVM training time was also reduced.

Agarwal and Mittal [42] proposed a hybrid approach for the detection of anomaly network traffic using data mining techniques. In this hybrid model, they applied entropy of network and SVM techniques. This hybrid method worked well and gave high accuracy for the detection of attack traffic with few false alarms. This method was not dynamic and did not detect or decide whether there was an attack.

Serpen and Aghaei [43] proposed a host-based misuse IDS using PCA feature extraction and a KNN classification algorithm. This system used basic analysis Eigen traces to extract data of the OS for a "trace data and k-nearest neighbor" algorithm for classification. This design exhibited very high performance and the ability to detect intrusion and attack plus the type of intrusion. This system worked on the Linux OS only.

Praveen et al. [44] proposed a hybrid IDS algorithm for private cloud services. They applied anomaly intrusion and misuse intrusion, and they used the NET framework as a front end and a SQL server as a back end to implement the algorithm. They conducted an overall study to build a hybrid IDS that would help to detect all types of intrusion in the cloud environment. According to the authors, the major characteristics that a hybrid IDS must have are a dynamic nature, self-adaptiveness, and scalability by the OS in the network and host.

#### **3. The Proposed Hybrid IDS**

The proposed hybrid IDS is a combination of two machine learning techniques, the GA and SVM, in which the GA is used for feature selection, which is considered as an optimization problem. Since the used dataset consisted of 79 features, and some of these featured had little to no effect on the identification of intrusion, a GA is applied to select the optimum number of features to maintain the high performance of IDS while reducing the overhead of the classification process. On the other hand, an SVM is applied to perform the actual classification of network data into normal and abnormal behaviors. However, before applying the algorithm to the dataset, it needs to be preprocessed so that the algorithm works smoothly on clean and consistent data. The preprocessing is explained in the following subsection.
