*4.2. Data Preprocessing*

Since we were interested in evaluating CTI for threat classification, we deleted the normal traffic from the datasets. Additionally, in the BoT-IoT dataset, we omitted the pkSeqID feature since it represented an identifier for the traffic records. The datasets contains some categorical features that could not be processed by the neural network. Thus, we converted the nominal values into numeric using sklearn LabelEncoder. LabelEncoder converts categorical values into numerical values [22]. We implemented sklearn StandardScaler to scale the data. For training and evaluation, several papers have split the dataset into training and testing, with a ratio of 20% for testing s in [19] and 30% for testing in [21]. However, due to

the size of the BoT-IoT dataset and the resource constraints of our device, we divided the data into training and testing sets, with a ratio of 35% for testing, while having the same ratio of classes in both parts by using the stratify parameter.
