*3.1. Data Preparation*

As mentioned in section one, one of the objectives of this work was to examine the proposed IDS on a recent dataset rather than the commonly used old dataset, KDD CUP 99. The dataset used in this work was the Intrusion Detection Evaluation Dataset [45]. Data covering only 5 days between 3 July 2017 and 7 July 2017 were used in this study. This data contained benign traffic and a limited number of several types of attacks such as brute-force FTP, brute-force SSH, SQL injection attack, and cross-site scripting. The dataset was known as CICIDS2017 and consisted of 170 K records and 79 features.

The first stage of preparation was to clean the data. Cleaning was done in terms of unifying the data type to integers, since some numerical values were entered as "infinity". Second, the missing values were replaced with zeros. Third, two columns were eliminated, as they were dramatically corrupted, causing the algorithm to fail. These columns were Fwd\_Packets\_s and Bwd\_Packets\_s. Fourth, the data was scaled, as some columns had one-figure values, while others had up to six-figure values. Scaling was performed using the sklearn function StandardScaler, which normalizes data according to the formula:

> *z* = *x* − *mean standard deviation*

The remaining features are listed in Table 2.

**Table 2.** Features of the CICIDS2017 dataset [45].



**Table 2.** *Cont.*

The second dataset used in this work was KDD CUP 99, which was used to benchmark the performance of the proposed system. The dataset consisted of only 42 features, which are listed in Table 3.

**Table 3.** Features of the KDD CUP 99 dataset [46].


#### *3.2. Genetic Algorithm (GA)*

As aforementioned, GAs constitute a family of mathematical models that operate on the principles of selection and natural evolution. GAs have multiple parameters, each of which can be implemented in several methods. GAs represent one of the best techniques for optimization problems and feature selection. The following is a description of GA operators: initial population creation, crossover, mutation, and the proposed fitness function, which was used in the feature selection stage. These operators are summarized in Algorithm 1.

#### **Algorithm 1**: GA process

