1. Introduction
According to Cisco, fifty percent of all networked devices will be Internet of Things (IoT) devices in 2023, reaching 14.7 billion of devices [
1]. As the number of devices and connections continue to grow, Cybercriminals are also looking for new ways of conducting sophisticated attacks. Gartner states that more than 25% of the total attacks in 2025 will target IoT devices [
2].
IoT devices are considered low-power devices; therefore, these devices cannot have sophisticated programs running on them to detect attacks, such as, Denial of Service (DoS), Distributed Denial of Service (DDoS), Theft attacks, or Reconnaissance attacks. Researchers have shown that pre-trained machine learning models can be deployed on IoT devices that would avoid extensive resource usage necessary for model training [
3,
4]. These detection algorithms can detect attacks very fast.
Pre-trained models are important for detecting attacks. However, a framework is necessary for training and developing models that are robust and perform with high accuracy. Most of the researchers use anomaly detection and intrusion signature detection [
5,
6] in combination or separately to detect attacks. However, both strategies in combination show better results [
7].
As we rely on pre-trained models to detect anomalies accurately, we need to consider many factors that can affect the training such as (i) the number of datasamples considered in the training, (ii) the number of features selected, (iii) the quality of the data, and (iv) the distribution of the classes in the dataset.
First, the number of samples is important in the training of any machine and deep learning model. If we do not have enough datasamples our model can be under-fitted. On the other hand, if a larger dataset is available, we can obtain accurate and robust models. For this reason, the purpose of our framework is to train many machine learning models with a big dataset called BoT IoT dataset. This dataset is composed of 132 million of normal and malicious datasamples. Processing that amount of information using a laptop takes a lot of time. We verified this fact comparing Pandas and Pyspark to upload all the files of the BoT-IoT dataset. Pandas took around 30 min to upload the files while a Hadoop–Spark cluster took 10 s. The laptop that we used for this experiment is an Intel Core I5 with 8GB RAM and two cores while our Hadoop–Spark cluster has 3 nodes with 2 cores and 8 GB RAM. In addition, Spark is 100 faster than Map-Reduce according to its founders [
8]. Therefore, many researchers have created machine learning models for intrusion detection with the small version of the BoT IoT dataset [
3,
9,
10,
11,
12]; thus, the models can not generalize well unseen data. In addition, training machine learning models using big datasets can incur huge resource expenses because it is necessary to have computers with GPUs and many cores.
Second, the number of features selected can impact the accuracy of a machine learning model since it can cause noise or produce under-fitting. In addition, using a lot of features can cause a waste of processing resources. In related research papers, the authors used feature engineering to select or create new features obtaining robust models after training. We adopt a similar approach as in reference [
13] to select enough features to avoid under-fitting but not selecting all the features to avoid noise.
Another important factor is the quality of the data. Since we did not create the dataset, we rely on the BoT-IoT dataset [
9] to create our own models. In our previous paper [
13], we compare the BoT-IoT dataset with other available IoT datasets publicly available and we conclude that this dataset contains traffic from real emulated IoT sensors deployed in Node-red which were connected with the public IoT hub, AWS. For this reason, we focused on the analysis of this dataset.
Finally, class imbalance can severely impact the detection accuracy of the minority classes. Therefore, balancing the data is crucial in designing an intrusion detection system using the BoT IoT dataset, since the dataset is extremely imbalanced considering the classes. To illustrate, the Normal to DoS datasample ratio is approximately 1:10000. In the present research, we use Conditional Tabular Generative Adversarial Network (CTGAN) [
14] to generate unseen data from a minority class. For this reason, decision boundaries of machine learning algorithms changed and the accuracy of detection was significantly improved.
The present paper is an extension of reference [
13]. In [
13], a framework was proposed to train an anomaly detection (One-class SVM) and intrusion detection system using Random Forest algorithm with Hadoop–Spark framework. In [
13], we proposed a new approach to train a one-class SVM in Hadoop–Spark. In addition, we analyzed the impact of feature selection in the Random Forest classifier. However, we did not evaluate the accuracy of other algorithms such as Decision Trees, Logistic Regression, Gradient Boosted Tree, Support Vector Machine, and Naive Bayes. Furthermore, we obtained bad results detecting minority classes such as Normal and Theft classes when using the intrusion detection system with Random Forest. This present paper solves these two concerns in the intrusion detection system. The contributions of this paper are the following:
- 1
Multi-class classification algorithms in Pyspark are limited to the usage of Random Forest, Decision Trees, Naive Bayes, and Logistic Regression. For this reason, in this paper we proposed the usage of One vs. Rest (OVR) strategy to evaluate the accuracy and performance of other algorithms available in Pyspark such as Gradient Boosted Tree and SVM Linear. We evaluate all the algorithms with the entire BoT-IoT dataset and identify which is the best algorithm in terms of accuracy and performance.
- 2
The BoT-IoT dataset is an extremely imbalanced dataset; therefore, we propose the usage of a new tabular data generator denoted as CTGAN to increase the number of datasamples of minority classes and obtained outstanding results in terms of F1-score.
- 3
We compare CTGAN oversampling method with other traditional methods such as Synthetic Minority Over-sampling (SMOTE) and Adaptive Synthetic oversample (ADASYN) demonstrating its accuracy to generate datasamples.
The rest of the paper is organized as follows.
Section 2 summarizes previously reported work.
Section 3 explains the two proposed methodologies for multi-class classification available in Spark and data oversampling using CTGAN.
Section 4 presents the experiments and the obtained results.
Section 5 describes the conclusions and future work.
2. Related Work
This section is divided into three parts. The first part takes into account research papers that use other multi-class classification algorithms to detect attacks using the short version of the BoT-IoT dataset. The second part explains research papers that use sampling methods to reduce class imbalance problem. In the third part, we present some papers that use the big data framework to create intrusion detection systems using BoT-IoT dataset or other similar datasets.
The following research papers consider many supervised machine learning algorithms to detect attacks in IoT networks.
First, we summarized related work that uses the short version of the BoT-IoT dataset to train an intrusion detection system using machine learning algorithms.
Kumar et al. [
5] created a multi-class classification methodology to identify DoS, DDoS, Reconnaissance, Theft attacks, and Normal network traffic. This methodology used feature selection and multi-class classification algorithms. The authors used a hybrid approach for feature selection in which they used Pearson’s Correlation Coefficient, Random Forest mean, and Gain Ratio approach to select features. Then, they joined the results using an AND operation. They applied correntropy to measure the accuracy distinguishing normal and abnormal data samples. Finally, they trained and tested 3 classifiers named Random Forest, XGBoost, and K-nearest neighbors (KNN). This work used the short version of the dataset. The authors highlighted that the approach recognized Theft attacks with 93% of accuracy even though the number of samples of this class was the lowest. XGBoost showed the best results in detecting Reconnaissance, DoS, DDoS attacks with 100% of accuracy.
Shafiq et al. [
6] trained and tested five algorithms to detect anomalous behavior using the BoT IoT dataset. This paper is different from others because the authors evaluated how accuracy, precision, TP rate, Recall, and training time were important to select the best algorithm. They used a bijective soft set approach to evaluate the best algorithm considering these five factors. They concluded that the Naïve Bayes algorithm reached 98% accuracy, precision, TP rate, Recall, and the training time was around 4s. The authors used Weka to train and test the classifiers. This paper used the shorter version of the BoT IoT dataset.
Soe et al. [
15], in 2020, trained and tested a lightweight model to detect anomalous behavior in IoT devices. This model was designed to run on Raspberry Pi and the authors used the shorter version of BoT-IoT dataset. To train and test the models, the authors created three sub-datasets. Each sub-dataset contained only one kind of attack and normal datasamples. It is necessary to highlight that they considered only DDoS, Theft, and Reconnaissance attacks. Therefore, they created 3 sub-datasets. Then, they extracted the most important features for each subset using correlated-set threshold on
gain-ratio (CST-GR). Each subset had different number of features. DDoS features were
drate and
total number of bytes per destination IP while Theft features were
state number, total number of packets per protocol, and average rate per protocol per dport. The authors then trained 4 classifiers named tree-based J48, Hoeffding Tree, logistic model tree, and random forest. The authors reduced the number of datasamples of DDoS and Reconnaissance since it does fit on the Raspberry Pi memory. The authors concluded that the model could detect all kind of attacks with an accuracy of over 99.3% in all the cases. Random Forest was the best algorithm to detect all kind of attacks. This paper has some drawbacks. First, when a new datasample is the input, it should pass through 3 feature extraction and evaluation. This problem increases processing time and unnecessary usage of resources. In addition, the authors down-sampled the number of datasamples for training. For this reason, some important statistical information could be missed.
Bagui and Li [
16] developed a framework using Artificial Neural Networks (ANN) and different methods of resampling to obtain models to detect anomalies in IoT networks. Since IoT network datasets are imbalanced, it is difficult to obtain a model which recognizes the minority class with high accuracy. The authors used a Spark cluster and a standalone computer to run their experiments. They used a compact version of the BoT IoT dataset. Their models only tackled a classification problem in which they tried to identify different kinds of attacks and when the traffic was normal. When the experiment was run in a Spark cluster provisioned in AWS, they obtained the best Macro F1-score of around 58% when the sampling method was Random oversampling.
Fatani et al. [
17], in 2021, used an innovative feature selection approach called SWARM intelligence with AQUILA optimizer to detect IoT attacks using the BoT-IoT dataset. This methodology is composed of some stages. The first stage is called feature extraction in which they used a Convolutional Neural Network to extract meaningful features from the original raw data. They extracted the features generated from the last fully connected layer which had 64 neurons. Then, they ranked these features to select the most important ones using AQUILA optimization. Finally, they used the final features to train a machine learning algorithm. They used the shorter version of BoT IoT dataset reaching 99% of accuracy in the training and testing dataset. However, if we look at the confusion matrix the accuracy to detect normal class was 60.7% and 85.7% to detect Theft. The overall accuracy was high since the dataset is highly imbalance in nature. However, the model could not generalize well the minority classes.
We can conclude that the authors in references [
5,
6,
15,
16,
17] can detect accurately the attacks from BoT-IoT dataset. Nonetheless, all of them use the shorter version of the dataset. Thus, we cannot expect that these models will perform well with unseen data.
Next, we summarize some research papers that use sampling methods to reduce the class imbalance of the dataset. In addition, we include some papers that use GAN to generate new datasamples.
Zixu et al. [
18], in 2020, developed a novel approach to recognize anomalous behavior locally in each IoT device. They used a GAN to find the best data distribution representation of the data using normal network traffic in each device. The GAN network consisted of a generator and a discriminator. The input to the generator was random data with normal distribution. The authors defined 100 features as an input to the network. The output of the generator corresponds to the number of features which would be the input to the discriminator. Since the authors defined 9 features (flag, state, mean, stddev, max, min, rate, srate, drate), the output of the generator was 9. The generator network was composed of 2 hidden layers with 1024 and 256 neurons on each layer. The discriminator neural network was symmetrical with the generator. After training the discriminator and generator locally, the weights of the generator were sent to the central authority. This entity aggregated the weights of each of the local neural networks. The central authority with a random input generated new samples. These samples were passed through an autoencoder. The main goal of the autoencoder was to learn a representation of the distribution of the data using backpropagation. The autoencoder had two parts - the encoder and the decoder. The encoder part encoded or reduced the size of the original input. On the other hand, the decoder decoded the reduced input to create a vector with the original input. The error was calculated with the predicted decoder output and the original input. Depending on the error, the authors defined a threshold to identify benign and malicious traffic. After creating the autoencoder model, this model was spread among all the nodes. This model was able to discriminate between benign and malicious signals. The authors compared the results with other anomaly detection techniques such as One-class SVM, Isolation Forest, K-means clustering, and Local Outlier Factor. The results showed that the proposed model achieved improved performance when compared to the other models.
Ferrag et al. [
19] created a methodology to reduce the impact of imbalanced datasets in IoT networks anomaly detection. They proposed a model which consisted of 3 models. In the beginning, two models ran in parallel. The first model only identified between normal and malicious behavior. The second model labeled all rows of the training dataset as benign or one of the different categories of attacks. The classification outputs of the two models were appended to the dataset as features. Then, the third model was created training the features and the results of the two prior models. The classification algorithms used to build the model were REP tree, JRIP, and Forest PA. The authors trained and tested the model using the shorter version of the BoT-IoT dataset reaching a low false alarm rate and higher detection rate.
Prabakaran et al. [
20] proposed a methodology that used GAN to discriminate between normal and malicious IoT traffic. The authors used the shorter version of the BoT IoT dataset in their approach. In the beginning, they labeled all the rows as benign or attacked to create one dataset. Then, they created another dataset labeling all rows as benign or one category attack. Finally, they normalized and joined both datasets. The final dataset was used to train a GAN network. The authors changed the discriminator loss function to reach a good performance of the model. They showed that the accuracy reached by the discriminator was greater than other deep learning models such as Convolutional Neural Network (CNN), autoencoder, KNN, MLP, ANN, and Decision Trees (DT). The accuracy shown in the paper was around 92%.
Ullah and Qusay [
21] developed one of the most complete methodologies that used GAN networks for anomaly detection in IoT devices. In their methodology, the authors generated more data samples from the minority class using one class conditional GAN. In addition, they generated normal and anomalous data samples training a conditional binary GAN network. To train the binary GAN network, they reduced the size of abnormal data samples to have a balanced dataset. Finally, they used a multiclass classification GAN network which consisted of multiple binary GAN networks. After generating the new data samples using each GAN network from the three configurations, they trained a feed-forward neural network with a deep architecture.
Although the papers cited in [
18,
19,
20,
21] proposed the best methodologies to solve class imbalance problem, the authors did not evaluate their models with the entire dataset.
Finally, we described next some research papers that have used big data frameworks for intrusion detection.
Belouch et al. [
22] used Apache Spark to train and test 4 classifiers using the UNSW-NB15 dataset for intrusion detection modeling. They concluded that Random Forest was the most accurate algorithm with 97% of accuracy and 5.69 s training time while Naïve Bayes was the worst algorithm with 74.19% accuracy and 0.18 s training time. The authors used the shorter version of the dataset with 257,340 records.
Haggag et al. [
23] proposed the usage of Spark platform for training and testing deep learning models in a distributed way to detect intrusion detection attacks. The authors used the NSL-KDD dataset to train MLP, RRN, and LSTM deep learning models. In addition, the authors added one stage called class imbalance handling using SMOTE. It is necessary to highlight that Spark does not have deep learning capability. Therefore, the authors used Elephas to train and test the deep learning models. To use Elephas, the input should have 3 dimensions. For this reason, RDD form as input to Elephas was the solution. The authors showed that the average F1-score detection was 81.37%.
Morfino et al. [
24] proposed an approach to train and test machine learning models in Spark to detect SYN/DOS attacks in IoT networks. They used MLIB to train binary classifiers. The data trained was around 2 million of instances. The authors demonstrated the Random Forest is the algorithm provided the best accuracy of around 100% and the training time was 215s. Our dataset is different since it contains more than 50 million of records.
The following paper is the most relevant work we found in which the researchers used Hadoop–Spark to train and test the entire BoT-IoT dataset.
Abushwereb et al. [
25] used MLIB from Spark to train the shorter and larger version of BoT-IoT dataset. The authors proposed a methodology in which they removed duplicated values and rows with missing and unkown values, normalized the data with min-max normalization and applied feature selection using chi-square. The authors then trained machine learning algorithms named RF, DT, and NB using 70% of the data. Finally, they evaluated the accuracy of the algorithm with the 30% of the remaining data. The framework used by the authors was created on Google Cloud platform. The hadoop-spark cluster consisted of eight Vms with an overall Ram of 16Gb. After training and testing the multi-class classification problem, the authors concluded that the overall F1-score was 77% for DT and 73% for RF. The F1-score decreased since normal and theft had much less number of datasamples when compared with other classes. The authors indicated that its model could detect Theft attacks only 23% of F1-score and Normal datasamples with 71.8%. This reference presents a similar approach as the present paper; however, the theft and normal class accuracy are quite low.
5. Conclusions and Future Work
In this paper, we propose a combination of systems and strategies to reach the best accuracy for detecting different attacks from the BoT-IoT dataset with an average F1-score accuracy of 96.77%. Our work differs to other related papers since our work trains and test a model with the entire BoT-IoT dataset. We use a Hadoop–Spark strategy to reach this goal. Training a model with more data means robustness to identify new unseen data.
We show that Random Forest is the best algorithm to train extremely imbalanced datasets. In addition, we demonstrate that OVR wrapper is useful to train multi-class problems with binary classification algorithms. However, training time with OVR (binary classifier) was five times more than that of multi-class classification. For this reason, we select Random Forest algorithm from Spark multi-class algorithms to train the model. Although the accuracy is high, if we analyze each class, we notice that minority classes are affected. For this reason, we propose the usage of CTGAN to generate new datasamples. CTGAN is different to other oversampling methods since it creates datasamples from the original distribution. We compare CTGAN with other oversampling methods such as SMOTE and ADASYN concluding that CTGAN is more accurate to generate datasamples. With CTGAN, Random Forest provides an F1-score of 96.77%. Finally, we compare the results of our approach with other recent related work concluding that our approach is more accurate in detecting minority class, e.g., Theft and Normal classes. It is necessary to highlight that our approach uses the entire dataset; thus, our methodologies can be more robust to unseen data.
Since our research was focused on attack detection using machine learning, in the future, we want to solve this problem using deep learning models which are not available in MLIB spark library. For this reason, our plan is to develop a blockchain-based federated learning approach to train our models with high accuracy, security and less processing time.