1. Introduction
The growing use of IoT in many areas of life, such as Smart Cities, has resulted in the creation of a massive complicated network of linked things (apps, devices, and people) [
1]. It is estimated that 50% of the world’s needs will greatly depend on IoT-based services in various sectors in the near future [
2,
3]. Along with such dependence, the threat of cyber attacks on the IoT networks is increasing and becoming more damaging [
4]. Among the most vulnerable security risks in the IoT networks are man-in-the-middle attacks, eavesdropping, and malware such as botnets. Malware, or malicious software, is a specific cyberattack that has continued to develop since the era of traditional Internet networks and has become a serious threat in today’s IoT environment. Botnets are one of several types of malware commonly used by criminals to carry out cyberattacks on IoT networks [
5]. A botnet is basically a group of computers running malware controlled by hackers (usually called “botmasters”). Botnets turn computers into cyberattack forces, usually for spam, fake websites, DoS (Denial of Service) attacks, viruses, and information gathering through phishing and fraud [
6]. Botnets themselves have undergone a significant evolution, making them more diverse and more difficult to detect. Traditional intrusion detection systems (IDS) capabilities are now lagging and less effective in detecting this new version of botnet attacks [
7].
Intrusion detection systems (IDS) generally use signature-based and anomaly-based detections to detect cyber threats. As a result, signature-based detections are becoming increasingly rare. Predefined attack signatures are used to create signature-based detection systems. As a result, this technique is incapable of detecting novel attack types or variants of known attacks [
8]. Anomaly detection systems, on the other hand, have a significant false-positive rate, despite being better at detecting novel types of cyber threats [
7,
8]. Furthermore, creating an anomaly detection system is complicated by the presence of so many distinct types of IoT devices. As a result, several studies have concluded that IDS is only effective against existing botnets and fails to detect and block new and distinct botnet versions [
9]. Machine learning (ML) was proposed as a way to close the gap [
10,
11] since it improves the effectiveness and intelligence of IDS.
There are two main problems identified from previous research. First, kNN usually shows a lack of performance in dealing with large datasets in the IoT cyberattack domain. Second, there have been very few studies involving real IDS implementation in Raspberry Pi for IoT networks.
Therefore, we would like to propose developing a low-cost and lightweight intelligent intrusion detection system based on the Raspberry Pi machine. Then, for machine learning, the kNN algorithm is chosen considering its simplicity, effectiveness, and robustness, which makes it appropriate for use on resource-constrained devices, such as the Raspberry Pi.
To tackle the low-performance issue, this study aims to apply and evaluate several feature selection techniques in combination with the kNN algorithm to find the best-performing one using Rapidminer software. Finally, the best among them will be selected to establish a novel Suricata with an enhanced kNN algorithm on the Raspberry Pi for a real implementation called SUKRY.
The main contributions of this research are as follows:
SUKRY, a novel Suricata IDS with an enhanced kNN algorithm on the Raspberry Pi machine.
The use of five evaluation factors (accuracy, precision, recall, F1 score and execution time) to justify the best among three feature selection techniques in enhancing the performance of the kNN algorithm is novel.
The application of kNN-FS (a combination between the kNN algorithm and the Forward Selection technique) as the core machine learning model in developing SUKRY.
The rest of the paper is organized as follows.
Section 2 discusses the relevant literature concerning machine learning applications in dealing with IoT security issues.
Section 3 describes the methodology used in conducting the study.
Section 4 presents the data analysis, while the discussion is provided in
Section 5. Eventually, the conclusion is given in the last section.
2. Relevant Studies
Research by Liao et al. [
12] in 2002 was one of the first studies to consider potential machine learning algorithms to enhance intrusion detection theoretically. In this study, they justified the effectiveness of the kNN algorithm by having simplicity, robustness, and ease of use. Then, Binkley and Singh [
13] also introduced several machine learning algorithms in dealing with botnet attack identification. Traditional signature-based IDS was questioned in this study in terms of its usefulness in recognizing new botnet variations, and it was suggested that machine learning be used to address the concerns. Later, many machine learning algorithms were adopted and exemplified in several studies. For example, Support Vector Machine in anomaly traffic detection, which showed an improvement in botnet identification rate [
14], automating DDoS attack identification with several machine learning algorithms [
15], DDoS attack mitigation ability after using machine learning approaches [
16], the use of machine learning techniques to classify flooding network attacks on networks [
17], and many others.
Eslahi et al. [
18] constructed a relevant survey to address the emergent impact of botnets by examining their characteristics, the state of the art of machine learning-based detection, and future challenges in deploying secure IoT environments. A further review was also carried out by Simkhada et al. [
19] to compare the advantages and disadvantages of various machine learning methods in botnet classification. Rashid et al. [
20] discussed various cyberattacks on IoT-based Smart Cities by exploring attack and anomaly detection techniques based on machine learning algorithms (LR, SVM, DT, RF, ANN, and kNN). Moreover, Dwibedi et al. [
21] focused their comparative study on different datasets by employing ML algorithms such as Random Forest (RF), Support Vector Machines (SVMs), Keras Deep Learning models, and XGBoost. They also suggested making the best model suitable for real-time environments. Then, advanced machine learning models were surveyed and evaluated by Pacheco and Sun [
22] against several IDS datasets, such as UNSWNB15 and Bot-IoT. They demonstrated how several attacks were able to effectively degrade the overall performance of SVM, DT, and RF using two IDS datasets. They found RF was shown as the most resilient classifier, while SVM was the least robust on both datasets.
Aswal et al. [
23] conducted another interesting investigation on botnet identification on the Internet of Vehicles (IoV). To deal with the IoV attack classification problem, many machine learning algorithms (NB, kNN, LR, LR, CART, and SVM) were chosen. In terms of TPR and FPR, it was discovered that the CART method outperformed other machine learning algorithms. Similarly, Hasan et al. [
24] did a comparative investigation. Several machine learning algorithms (LR, SVM, DT, RF, and ANN) were examined for their ability to effectively anticipate attacks and abnormalities on IoT devices. It was demonstrated that RF outperforms other technologies. This conclusion was comparable to that of Bedi et al. [
25], who investigated irregularities on an IoT network by comparing ANN, LR, RF, SVM, and DT. They also found RF performed better than the others.
A big data framework was discussed in [
26] to detect peer-to-peer botnet attacks by employing a Random Forest algorithm. Furthermore, several approaches applied different algorithms in identifying and tackling various botnet attacks, such as parallel random forest by Chen et al. [
27]. In addition, a comparison performance of three algorithms (NB, kNN and DT) for mobile botnet examined by Yusof et al. [
28], a parallel multiclassification algorithm for big IoT data analysis by Duan et al. [
29], and research by Vengatesan et al. [
30] in predicting malware attacks in IoT by using big data analytics. Furthermore, Marjani, et al. [
31] presents a comprehensive survey in this field by highlighting its architecture, various approaches and future research challenges.
Gadelrab et al. [
32] used a machine learning strategy based on statistical features to detect botnets. Later, Hoang and Nguyen [
33] created a botnet detection model based on machine learning and validated its performance using popular machine learning techniques using the Domain Name Service. The results of the experiments demonstrated that the machine learning algorithm can be used to detect botnets successfully, with the Random Forest method producing the best detection accuracy among the others.
The hybrid technique is helpful in this field. A hybrid is a method of combining different machine learning techniques to improve botnet detection accuracy. By merging the C5 and SVM algorithms, Khraisat et al. [
34] developed a novel technique to detect botnet attacks, which resulted in greater accuracy for the hybrid approach. Almashhadi [
35] shows that utilizing hybrid machine learning to analyze DNS traffic irregularities during the botnet lifetime improves accuracy. Similarly, Khan et al. [
36] used a hybrid technique to improve botnet attack similarity identification. They used experiments to prove the effectiveness of their strategy. However, the hybrid approach requires high computational resources.
Rambabu and Venkatram [
37] proposed an ensemble classification using traffic flow metrics for DDoS attacks in IoT networks. Using cross-validation, they addressed the importance of the ensemble approach towards DDoS defense accuracy with fewer false alarms. Another ensemble approach was studied by Khraisat et al. [
38] to protect IoT devices from cyberattacks. They combined the C5 classifier and one class of support vector machine classifier to develop a novel ensemble Hybrid Intrusion Detection System (HIDS). Likewise, an AdaBoost ensemble learning method based on DT, NB, and ANN was introduced by Moustafa et al. [
39] to evaluate and detect malicious events using UNSW-NB15 and NIMS botnet datasets.
In addition, the use of machine learning frameworks is exemplified in several studies. A Weka-based comparative study of various machine learning algorithms was explored by Farhat et al. [
40]. They used the NSL-KDD dataset for comparison purposes. Likewise, the use of the Weka tool was confirmed as efficient by Celil and Dener [
41] in classifying and detecting anomalous activities within IoT networks. A similar study by Soe et al. [
42] implemented machine learning-based IDS using a new feature selection algorithm. The whole machine learning analysis with several algorithms was conducted within the Weka environment. Then, they suggested evaluating other machine learning algorithms with lower computational requirements and relatively simple implementations.
Based on the aforementioned studies, the research gap is identified as follows. Among many machine learning algorithms applied in dealing with cyberattacks on IoT networks, kNN often shows lower performance in comparison to other machine learning algorithms. This is due to the fact that most datasets relating to botnet attacks in the IoT networks are usually large in size causing kNN to require more computational cost and extra processing time.
In this study, we are interested in kNN, a supervised machine learning algorithm that is simple to implement, robust and with lower computational requirements applicable for Raspberry Pi implementation. Although in 2002 it was already suggested in [
12], very few studies in terms of IoT-related botnet attacks have shown the best results for kNN accuracy. Churcher and Ullah [
43] found kNN as a suitable model to be used for intrusion detection, thanks to its simplicity and robustness. Similarly, Mrabet and Belguith [
44] argued kNN’s important advantage is simplicity and ease of application, however, kNN is incapable of handling large datasets. The high similarity between the closest and farthest neighbors on the basis of distance functions and weights is the main cause for kNN’s poor performance when dealing with large datasets, according to Wazirali [
45].
Therefore, we aim to enhance kNNs performance, by proposing the implementation of feature selection techniques. Instead of Weka [
40,
41,
42], Rapidminer is selected as the environment to conduct the analysis since Rapidminer offers several in-built feature selection techniques and other visual functions [
46,
47,
48]. Moreover, Lee et al. [
49] reported the effective use of in-built feature selection techniques in the Rapidminer environment to handle high dimensionality datasets. The next step in our research is to turn the enhanced kNN model into an actual IDS application that uses a Raspberry Pi engine for intelligent botnet detection.
4. Analysis and Discussion
This section presents an in-depth analysis of how three different feature selection techniques, Information Gain, Forward Selection and Backward Elimination, would affect the performance of kNN. The performance analysis is based on five factors: accuracy, precision, recall, F1 score and processing time. It is divided into three sections which are called kNN-IG, kNN-FS and kNN-BE as follows.
4.1. kNN Algorithm with Information Gain (kNN-IG)
Information Gain (IG) is the first feature selection technique applied to the dataset in order to measure the most influential features among existing features of the Botnet-IoT dataset. IG feature selection is simply called the Filter method in Rapidminer and
Figure 4 represents the process of IG implementation on the dataset in the Rapidminer environment.
Table 3 shows the number of features selected by performing the IG feature selection in Rapidminer.
Once selected features are obtained, the kNN algorithm proceeds to the machine learning analysis, which is called kNN-IG.
Figure 5 shows the results of kNN-IG in the confusion matrix.
The confusion matrix presents how many data are correctly and incorrectly predicted in each class by kNN-IG. In this case, the DDoS class has 6643 pieces of data that are predicted correctly, only one piece of data was incorrectly predicted. The Reconnaissance class has 5039 pieces of data correctly predicted while 10 others are incorrect. Similarly, the DoS class has 6808 pieces of data correctly predicted and five data with an incorrect class. The normal class has 448 pieces of data predicted correctly and 29 pieces of data incorrectly predicted. Finally, there are 64 pieces of data correctly predicted for the Theft class, while seven pieces of data with an incorrect class.
Based on these, the performance of kNN-IG could be determined by calculating scores of accuracy, recall, precision and F1 score. The calculation processes are shown below.
Then, we calculate the individual recall score of each class. For example, the following is the recall score for the DDoS class:
By using the same calculation, all classes obtain their recall scores as follows, DDoS (99.98%), Reconnaissance (99.80%), DoS (99.93%), Normal (93.92%) and Theft (87.67%).
Then the average recall score is obtained as follows:
Next, we calculate the individual precision score of each class. For example, the following is the precision score for the DDoS class:
Then, by using the same calculation, all classes obtained their precision scores as follows, DdoS (99.92%), Reconnaissance (99.25%), DoS (99.93%), Normal (98.68%) and Theft (100%). Then the average precision score is obtained as follows:
Next, based on recall and precision scores, the F1 score is calculated as follows:
Finally, Rapidminer records the processing time of kNN-IG.
Figure 6 shows an execution time of kNN-IG of 14 s.
4.2. kNN Algorithm with Forward Selection (kNN-FS)
Forward Selection (FS) is the second feature selection technique applied to the dataset in order to measure the most influential features among existing features of the Botnet-IoT dataset. FS feature selection is classified as the Wrapper method in Rapidminer.
Figure 7 represents the process of FS implementation on the dataset in the Rapidminer environment.
After the selection process is carried out, the number of the selected features by the Forward Selection technique is decreased, as can be seen in
Table 4.
Then the kNN algorithm is applied to this new dataset, this process is named kNN-FS.
Figure 8 shows the results of kNN-FS implementation on the dataset in the confusion matrix.
As can be seen in the confusion matrix of kNN-FS, the DDoS class has 6643 pieces of data that are predicted correctly, with only one piece of data incorrectly predicted. The Reconnaissance class has 5047 pieces of data correctly predicted while two others had a different class.
Similarly, the DoS class has 6808 pieces of data correctly predicted and seven pieces of data with an incorrect class. Then, there are 473 pieces of data predicted correctly in the Normal class while four other pieces of data were incorrectly predicted. Finally, the Theft class has 66 correct pieces of data and seven pieces of data with an incorrect class. Based on these, the performance of the kNN-FS could be determined by calculating scores of accuracy, recall, precision and F1. The calculation processes are shown below.
Furthermore, we calculate individual recall scores for each class. For example, the following is the recall score for the DDoS class:
Using the same process, we obtain recall scores for all classes as follows, DDoS (99.98%), Reconnaissance (99.96%), DoS (99.90%), Normal (99.16%) and Theft (90.14%). Then the average recall is obtained as follows:
Next, we calculate an individual precision score for each class. For example, the following is the precision score for the DDoS class:
Using the same formula, scores are obtained as follows, DDoS (100%), Reconnaissance (99.72%), DoS (99.96%), Normal (99.16%) and Theft (100%). Then the average precision score is obtained as follows:
Next, based on recall and precision scores, the F1 score is obtained as follows:
Finally, Rapidminer logs the processing time of kNN-FS.
Figure 9 shows an execution time by kNN-FS of 7 s.
4.3. kNN Algorithm with Backward Elimination (kNN-BE)
Backward Elimination (BE) is the last feature selection technique applied to the dataset in order to choose the most significant features among the existing ones in the Botnet-IoT dataset. The BE feature selection is also classified as the Wrapper method in Rapidminer.
Figure 10 represents the process of BE implementation on the dataset in the Rapidminer environment.
This step reduced the number of features in the dataset according to the Backward Elimination technique, which can be seen in
Table 5.
Based on the result of Backward Elimination above, we continue applying the kNN algorithm to it, and this process is named kNN-BE.
Figure 11 shows the results of kNN-BE in the confusion matrix.
According to the kNN-BE confusion matrix, the DDoS class has 6642 pieces of data that are correctly predicted, while two pieces of data were incorrectly predicted. Then, the Reconnaissance class has all 5049 pieces of data correctly predicted. For the DoS class, there are 6803 correctly predicted pieces of data, whereas 10 pieces of data were incorrectly classed. Then, 472 pieces of data were predicted correctly for the Normal class, while five pieces of data were incorrectly classed. Finally, the Theft class has 64 pieces of data that are predicted correctly and nine pieces of data that were incorrect.
In order to measure the performance of kNN-BE, several scores should be obtained, namely, accuracy, recall, precision and F1 score, as well as execution time. The following is the calculation to obtain the accuracy for kNN-BE.
Next, we calculate the recall for DDoS class as follows:
Using the same calculation we obtain the recall scores for all classes as follows, DDoS (99.97%), Reconnaissance (100%), DoS (99.85%), Normal (98.95%) and Theft (87.67%). Then, the average recall score is obtained as follows:
Next, we calculate an individual precision score for each class. For example, the following is the precision score for the DDoS class:
Using a similar formula we obtain precision scores for all classes as follows, DDoS (100%), Reconnaissance (99.88%), DoS (99.84%), Normal (98.54%) and Theft (96.97%). Then, the average precision is obtained as follows:
Next, based on recall and precision scores, the F1 score is calculated as follows:
Finally, Rapidminer automatically records the processing time of kNN-BE.
Figure 12 shows an execution time by kNN-BE of 18 s.
4.4. Performance Comparison
This section presents the whole performance comparison between kNN-IG, kNN-FS and kNN-BE obtained previously using Rapidminer, as presented in
Table 6. It is clearly visualized in
Figure 13 and
Figure 14. While
Figure 13 depicts the level of accuracy, recall, precision, and F1 score,
Figure 14 reveals execution time results.
It is clearly seen that kNN-FS, which represents the Forward Selection-based kNN algorithm outperforms the other two models (kNN-IG and kNN-BE). The kNN-FS achieves the best results of all aspects, accuracy (99.89%), recall (97.82%), precision (99.77%), F1 score (98.78%) (see
Figure 13) and also the fastest execution time of 7 s (see
Figure 14).
The next position goes to the kNN-BE, which outperforms kNN-IG in three factors, namely, accuracy (99.86%), recall (96.29%), and F1 score (98.16%) while achieving the lowest results in precision (99.05%) and the longest execution time of 18 s. Finally, kNN-IG only shows better results than kNN-BE in two factors, F1 score (97.88%) and 14 s for execution time while accounting for the lowest results among others in accuracy (99.72%), recall (96.26) and precision (99.56%). It is found that the implementation of Forward Selection on the given dataset significantly improves the performance of the kNN algorithm in classifying botnet attacks in IoT networks.
4.5. SUKRY Implementation
Based on our results, the kNN-FS model is selected for SUKRY implementation. This last section describes the implementation of SUKRY in the Raspberry Pi machine. The following steps describe how SUKRY is deployed:
- -
- -
Install Raspberry Pi OS on the Raspberry Pi machine;
- -
- -
Install Suricata on Raspberry Pi OS;
- -
- -
Install OPNIDS over Suricata to enable machine learning models being implemented over Suricata;
- -
Running kNN-FS model using OPNIDS.
The deployment steps above are useful for future studies in reproducing the experiment [
58]. In the near future, the hardware of SUKRY will be packed properly for commercialization purposes. The current deployment of SUKRY is presented in
Figure 15.
5. Discussion
Compared to previous studies, our study showed a significant improvement in kNN performance. First, recent research by Bahsi et al. [
59] implemented a hybrid machine learning approach to detect botnets in IoT networks, which obtains the following results: DT (98.9%) and kNN (94.9%). Secondly, a novel EDIMA botnet detector by Kumar and Lim [
60], EDIMA, was applied in IoT networks using three machine learning algorithms with an accuracy of RF (88.8%), kNN (94.44%), and GNB (77.78%). All of these comparisons are presented in
Table 7.
While previous related studies commonly focus on the level of accuracy in determining the best machine learning algorithm, we argue through this study that execution time should be considered an important factor as well. This is particularly true for crucial cases such as cyberattacks in IoT. Unfortunately, none of these previous studies [
42,
43] provided information about it. As a result, a performance comparison from an execution time perspective could not be presented. SUKRY’s 7-min execution time, on the other hand, is a notable breakthrough for the application of the kNN algorithm.
In our view, both accuracy level and execution time factors must be considered equally in detecting any IoT-related cyberattacks. Otherwise, the whole IoT network and all vital services that depend on it will be seriously affected, such as Smart Cities, Smart Grid, and many others, since cyberattacks usually occur in a very short time.
SUKRY is a novel approach to establishing Suricata-based IDS using hybrid kNN and Forward Selection over Raspberry Pi. SUKRY’s advantage lies in its high level of accuracy in detecting potential cyber threats in a relatively short time. We affirm that both the high accuracy and short execution time offered by SUKRY are essential factors in securing the IoT environment from botnet attacks more effectively and efficiently.
6. Conclusions
This study introduces SUKRY, a novel Suricata IDS using an enhanced kNN algorithm on the Raspberry Pi machine. The first issue to solve is the low-performance problems of the k-Nearest Network algorithm, particularly in dealing with large datasets such as the Botnet-IoT dataset. By using the Rapidminer environment, three feature selection techniques (Information Gain, Forward Selection, and Backward Elimination) are selected to reduce the number of features in the dataset before being processed with the kNN algorithm. The three combinations called kNN-IG, kNN-FS, and kNN-BE are then evaluated and compared in terms of accuracy, precision, recall, F1 score, and processing time.
It is found that kNN-FS or the kNN algorithm with the Forward Selection technique accounted for the highest performance score among others in terms of accuracy (99.89%), precision (99.77%), recall (97.82%), and F1 score (98.78%), as well as accounted for the shortest execution time (7 s). These achievements outperform previous kNN-related botnet IoT studies. As a result, the kNN-FS model is selected for the implementation of SUKRY, an intelligent IDS on the Raspberry Pi machine. In the near future, the physical look of SUKRY will be improved properly for commercialization purposes.