Advanced Feature-Selection-Based Hybrid Ensemble Learning Algorithms for Network Intrusion Detection Systems
Abstract
:1. Introduction
- Reduce the dimensionality of the CICIDS2017 dataset through the proposed coupling of Correlation Feature Selection with Forest Panelized Attributes.
- Find the best machine learning (ensemble method) approach to collect the four modified classifiers (Support Vector Machine, Random Forest, Naïve Bayes, and K-Nearest Neighbor) to ensure the best result of the hybrid ensemble method.
- Conduct a comparative study between the CFS–FPA and other features selection techniques in terms of accuracy, Detection Rate (DR), and False Alarm Rate (FAR). The outcome will be used to generalize the efficiency of the proposed features selection technique.
- Compare the four classifiers before and after modification and work as the AdaBoosting method. In addition, comparing the proposed method with other existing approaches.
2. Related Work
3. Materials and Methods
3.1. Description of CICIDS2017 Datasets
3.2. CICIDS17 Dataset Preprocessing
Algorithm 1: Preprocessing and Minimax Scaling |
Input: Read d1 where d1 is CICIDS201 Output: Normalize the dataset to d1normalize. Begin For each Di dataset Do Step 1: Data Filtering Removed meaningless and redundant instances. Arrange Distribution-categorization. Step 2: Data transformation if (do non-numeric input) then do: Transform categorical features into numbers using: Label Encoder () One-Hot Encoding /*this process is a complement to the categorical transform that is used to convert categorical features into numbers such as convert protocol types such as UDP, and TCP into numerical data using this function) */ End if Step 3: Normalization Minimax scaling is computed by applying the following: Max = Find the Maximum value. Min = Find the Minimum value. For each XiValue in the dataset Do Return XiValue Between [0, 1] Remove missing and duplicated data Encoding process with the second normalization End For End For End |
3.3. Correlation Feature Selection-Forest Panelized Attribute (CFS-FPA)
3.4. Classifiers
3.4.1. Random Forest (RF)
3.4.2. Naïve Bayes Classifier
3.4.3. Support Vector Machine (SVM)
3.4.4. K-Nearest Neighbor (KNN)
3.5. Hybrid Classifier Algorithms
Algorithm 2: Hybrid HABBAs for Intrusion Detection |
Input: D1 = CICIDS17 training datasets; Mi = modified classifiers, k = the number of rounds (one Modified algorithm per round); Output: A composite model Step 1: Adaboosting Algorithms initialize weight of each class Wi = 0; // this weight for each modified algorithm. k = 4; (four modified algorithms). for i = 1 to k do // for each modified algorithm Compute ErrorRate (Mi) If ErrorRate (Mi) > 0.5 then compute Wi to each k. [log (1 − ErrorRate (Mi))/ErrorRate (Mi)] compute prediction of each modified classifier Mi: Ci = Mi(x) add Wi to weight for classifier Ci End if End for Return Ci with the highest weight and error rate Step 2: bagging Algorithms For each Ci Do Ensemble these Ci to bootstrap models. Aggregate each Ci using voting average as parallel operations. Average voting . End for For the testing set part do: Compute accuracy for predicted XiAfter voting. Compute accuracy for predicted XiBefore voting. If XiBefore < XiAfter then Return to the Average voting step and replace the probability of weighting using the highest probabilities. Else Compute general measurement: Accuracy, DR, FAR, Precision, False-positive Rate, False-negative Rate, True Positive Rate, True negative Rate End if End for Return composite model and Performance-Measurements End |
4. Implementation
5. Experimental Results and Discussion
5.1. Binary and Multi-Class Confusion Matrix
5.2. Time Complexity
5.3. Analysis of Results
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sun, X.; Dai, J.; Liu, P.; Singhal, A.; Yen, J. Using Bayesian Networks for Probabilistic Identification of Zero-Day Attack Paths. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2506–2521. [Google Scholar] [CrossRef]
- Alazab, M. Profiling and classifying the behavior of malicious codes. J. Syst. Softw. 2015, 100, 91–102. [Google Scholar] [CrossRef]
- Sumaiya Thaseen, I.; Aswani Kumar, C. Intrusion detection model using fusion of chi-square feature selection and multi class SVM. J. King Saud Univ.—Comput. Inf. Sci. 2017, 29, 462–472. [Google Scholar] [CrossRef] [Green Version]
- Rajagopal, S.; Kundapur, P.P.; Hareesha, K.S. A Stacking Ensemble for Network Intrusion Detection Using Heterogeneous Datasets. Secur. Commun. Netw. 2020, 2020, 4586875. [Google Scholar] [CrossRef] [Green Version]
- Aljawarneh, S.; Aldwairi, M.; Yassein, M.B. Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model. J. Comput. Sci. 2018, 25, 152–160. [Google Scholar] [CrossRef]
- Sharma, S.; Challa, R.K.; Kumar, R. An ensemble-based supervised machine learning framework for android ransomware detection. Int. Arab J. Inf. Technol. 2021, 18, 422–429. [Google Scholar] [CrossRef]
- Devarajan, R.; Rao, P. An efficient intrusion detection system by using behaviour profiling and statistical approach model. Int. Arab J. Inf. Technol. 2021, 18, 114–124. [Google Scholar] [CrossRef]
- Hnaif, A.; Jaber, K.; Alia, M.; Daghbosheh, M. Parallel scalable approximate matching algorithm for network intrusion detection systems. Int. Arab J. Inf. Technol. 2021, 18, 77–84. [Google Scholar] [CrossRef]
- Aljanabi, M.; Ismail, M. Improved intrusion detection algorithm based on TLBO and GA algorithms. Int. Arab J. Inf. Technol. 2021, 18, 170–179. [Google Scholar] [CrossRef]
- Tabash, M.; Allah, M.A.; Tawfik, B. Intrusion detection model using naive bayes and deep learning technique. Int. Arab J. Inf. Technol. 2020, 17, 215–224. [Google Scholar] [CrossRef]
- Wang, K.; Wang, Y.; Zhao, Q.; Meng, D.; Liao, X.; Xu, Z. SPLBoost: An Improved Robust Boosting Algorithm Based on Self-Paced Learning. IEEE Trans. Cybern. 2021, 51, 1556–1570. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, C.; Du, J.; Fan, X. High-dimensional correlation matrix estimation for general continuous data with Bagging technique. Mach. Learn. 2022. [Google Scholar] [CrossRef]
- Guo, H.W.; Hu, Z.; Liu, Z.B.; Tian, J.G. Stacking of 2D Materials. Adv. Funct. Mater. 2021, 31, 2007810. [Google Scholar] [CrossRef]
- Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
- Hota, H.S.; Shrivas, A.K. Decision tree techniques applied on NSL-KDD data and its comparison with various feature selection techniques. In Advanced Computing, Networking and Informatics; Springer: Cham, Switzerland, 2014; Volume 1, pp. 205–212. [Google Scholar]
- Khammassi, C.; Krichen, S. A GA-LR wrapper approach for feature selection in network intrusion detection. Comput. Secur. 2017, 70, 255–277. [Google Scholar] [CrossRef]
- Moon, S.H.; Kim, Y.H. An improved forecast of precipitation type using correlation-based feature selection and multinomial logistic regression. Atmos. Res. 2020, 240, 104928. [Google Scholar] [CrossRef]
- Mohamad, M.; Selamat, A.; Krejcar, O.; Crespo, R.G.; Herrera-Viedma, E.; Fujita, H. Enhancing big data feature selection using a hybrid correlation-based feature selection. Electronics 2021, 10, 2984. [Google Scholar] [CrossRef]
- Zhou, Y.; Cheng, G.; Jiang, S.; Dai, M. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Comput. Netw. 2020, 174, 107274. [Google Scholar] [CrossRef] [Green Version]
- Jaw, E.; Wang, X. Feature Selection and Ensemble-Based Intrusion Detection System: An Efficient and Comprehensive Approach. Symmetry 2021, 13, 1764. [Google Scholar] [CrossRef]
- Gupta, N.; Jindal, V.; Bedi, P. CSE-IDS: Using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in network-based intrusion detection systems. Comput. Secur. 2022, 112, 102499. [Google Scholar] [CrossRef]
- Tama, B.A.; Comuzzi, M.; Rhee, K.H. TSE-IDS: A Two-Stage Classifier Ensemble for Intelligent Anomaly-Based Intrusion Detection System. IEEE Access 2019, 7, 94497–94507. [Google Scholar] [CrossRef]
- Aldallal, A.; Alisa, F. Effective intrusion detection system to secure data in cloud using machine learning. Symmetry 2021, 13, 2306. [Google Scholar] [CrossRef]
- Pelletier, Z.; Abualkibash, M. Evaluating the CIC IDS-2017 Dataset Using Machine Learning Methods and Creating Multiple Predictive Models in the Statistical Computing Language R. Science 2020, 5, 187–191. [Google Scholar]
- Abbas, A.; Khan, M.A.; Latif, S.; Ajaz, M.; Shah, A.A.; Ahmad, J. A New Ensemble-Based Intrusion Detection System for Internet of Things. Arab. J. Sci. Eng. 2022, 47, 1805–1819. [Google Scholar] [CrossRef]
- Pangsuban, P.; Nilsook, P.; Wannapiroon, P. A Real-time Risk Assessment for Information System with CICIDS2017 Dataset Using Machine Learning. Int. J. Mach. Learn. Comput. 2020, 10, 465–470. [Google Scholar] [CrossRef]
- Gopalan, S.S.; Ravikumar, D.; Linekar, D.; Raza, A.; Hasib, M. Balancing Approaches towards ML for IDS: A Survey for the CSE-CIC IDS Dataset. In Proceedings of the ICCSPA 2020—4th International Conference on Communications, Signal Processing, and Their Applications, Sharjah, United Arab Emirates, 16–18 March 2021; Volume 2021-Janua. [Google Scholar]
- Mhawi, D.N. Proposed Hybrid Correlation Feature Selection Forest Panalized Attribute Approach to advance IDSs. Karbala Int. J. Mod. Sci. 2021, 7, 15. [Google Scholar] [CrossRef]
- Sekulić, A.; Kilibarda, M.; Heuvelink, G.B.M.; Nikolić, M.; Bajat, B. Random forest spatial interpolation. Remote Sens. 2020, 12, 1687. [Google Scholar] [CrossRef]
- Feng, Q.; Liu, J.; Gong, J. UAV Remote sensing for urban vegetation mapping using random forest and texture analysis. Remote Sens. 2015, 7, 1074–1094. [Google Scholar] [CrossRef] [Green Version]
- Alkasassbeh, M. An empirical evaluation for the intrusion detection features based on machine learning and feature selection methods. J. Theor. Appl. Inf. Technol. 2017, 95, 5962–5976. [Google Scholar]
- Chen, S.; Webb, G.I.; Liu, L.; Ma, X. A novel selective naïve Bayes algorithm. Knowl.-Based Syst. 2020, 192, 105361. [Google Scholar] [CrossRef]
- Huang, M.W.; Chen, C.W.; Lin, W.C.; Ke, S.W.; Tsai, C.F. SVM and SVM ensembles in breast cancer prediction. PLoS ONE 2017, 12, 161501. [Google Scholar] [CrossRef] [PubMed]
- Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the support of a high-dimensional distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
- Gou, J.; Qiu, W.; Yi, Z.; Shen, X.; Zhan, Y.; Ou, W. Locality constrained representation-based K-nearest neighbor classification. Knowl.-Based Syst. 2019, 167, 38–52. [Google Scholar] [CrossRef]
- Thaseen, I.S.; Kumar, C.A.; Ahmad, A. Integrated Intrusion Detection Model Using Chi-Square Feature Selection and Ensemble of Classifiers. Arab. J. Sci. Eng. 2019, 44, 3357–3368. [Google Scholar] [CrossRef]
- Ikram, S.T.; Cherukuri, A.K.; Poorva, B.; Ushasree, P.S.; Zhang, Y.; Liu, X.; Li, G. Anomaly Detection Using XGBoost Ensemble of Deep Neural Network Models. Cybern. Inf. Technol. 2021, 21, 175–188. [Google Scholar] [CrossRef]
Classes | CIC_IDS/Wen. |
---|---|
DoS-slow loris | 5499 |
DoS-Slow-HTTPtest | 5796 |
DoSHulk | 10,293 |
DoS-Golden-Eye | 230,124 |
Hear-bleed | 11 |
Normal | 439,683 |
Total | 691,406 |
Attack | 251,723 |
CICIDS2017 |
---|
Port-Destination |
Flow-Duration |
FlowIATStd |
FlowIATMax |
Flow_IAT_Min |
Fwd_IAT_Total |
Fwd_IAT_Std |
Fwd_IAT_Max |
Fwd_IAT_Min |
BwdIATStd |
BwdIATMax |
BwdIATMin |
FwdPSHFlags |
MaxPacketLength |
PacketLengthMean |
PacketLengthStd |
PacketLengthVariance |
FINFlagCount |
SYNFlagCount |
PSHFlagCount |
ACKFlagCount |
IdleMean |
‘IdleMax |
IdleMin |
Destination_Port |
Flow_Duration |
PSHFlagCountID |
Bwd_Packet_Length_Max |
Bwd_Packet_Length_Max |
BwdIATStd |
Number of FS | Accuracy % |
---|---|
13 | 74 |
20 | 78 |
25 | 83 |
30 | 99 |
35 | 98.9 |
40 | 98.5 |
45 | 98 |
50 | 96.9 |
55 | 96.3 |
60 | 96 |
65 | 94 |
70 | 93 |
78 | 90 |
ActualClass | PredictedClass | |
---|---|---|
Positive | Negative | |
Positive | 443,615 | 10,650 |
Negative | 62,736 | 48,561 |
ActualClass | PredictedClass | |
---|---|---|
Positive | Negative | |
Positive | 453,916 | 349 |
Negative | 369 | 110,928 |
ActualClass | PredictedClass | |
---|---|---|
Positive | Negative | |
Positive | 437,550 | 16,715 |
Negative | 24,741 | 86,556 |
Features | Accuracy | FNR |
---|---|---|
13 | 0.87 | 0.123 |
30 | 0.99 | 0.0008 |
78 | 0.92 | 0.053 |
ActualClass | PredictedClass | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Normal | Bot | Brute Force | DDoS | DoS Golden Eye | DoS Hulk | DoS Slow HTTP Test | DoS Slow Loris | FTP Pastor | Port Scan | Pastor | XSS | |
Normal | 453,761 | 50 | 0 | 0 | 2 | 274 | 3 | 0 | 1 | 174 | 0 | 0 |
Bot | 0 | 391 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Brute Force | 0 | 0 | 299 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DDoS | 10 | 0 | 0 | 25,595 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DoS GoldenEye | 8 | 0 | 0 | 0 | 2042 | 6 | 2 | 0 | 0 | 0 | 0 | 0 |
DoS Hulk | 21 | 0 | 1 | 2 | 1 | 45,999 | 0 | 0 | 0 | 1 | 0 | 0 |
DoS Slow-HTTP-test | 7 | 0 | 0 | 0 | 0 | 0 | 1091 | 2 | 0 | 0 | 0 | 0 |
DoS slow loris | 3 | 0 | 1 | 0 | 0 | 0 | 6 | 1149 | 0 | 0 | 0 | 0 |
FTP-Patator | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1584 | 0 | 0 | 0 |
Port-Scan | 1 | 0 | 3 | 0 | 0 | 4 | 0 | 0 | 0 | 31,752 | 0 | 1 |
SSH-Patator | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1174 | 0 |
XSS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 130 |
Attack | Precision | Detection-Rate | F-Score |
---|---|---|---|
Normal | 99 | 99 | 100 |
bot | 100 | 100 | 100 |
Brute force | 100 | 100 | 100 |
Port-Scan | 99 | 99 | 99 |
DoS-slow loris | 99 | 99 | 99 |
DoS-Slow-HTTPtest | 99 | 99 | 99 |
DoSHulk | 99 | 99 | 99 |
DoSGolden-Eye | 99 | 99 | 99 |
Hear-bleed | 99 | 99 | 99 |
FTP-Patter | 99 | 99 | 99 |
SSH-Scan | 100 | 99 | 99 |
References, Authors | Feature Selection Method | Classification Method | FS | Accuracy | DR | FAR |
---|---|---|---|---|---|---|
Zhou Y., et al. [19] | CFS_BA | Voting contain (C4.5, RF, ForestPA). | 10 | 98.4 | 99.1 | 0.15 |
13 | 97.3 | 98 | 0.12 | |||
Jaw E. and Wang X. [20] | HFS-KODE | Voting contains (K-means, SVM, DBSCAN, and Maximization-Expectation, (KODE)) | 11 | 96.4 | 99 | 1.15 |
8 | 98.3 | 99 | 0.14 | |||
13 | 98 | 98 | 1.12 | |||
Gupta N., et al. [21] | deep neural network | eXtreme Gradient Boosting algorithm | 41 | 99% | __ | __ |
38 | 96% | __ | __ | |||
78 | 92% | __ | __ | |||
Tama B., et al. [22] | hybrid | Ensemble Two-stage | 37 | 96.42 | __ | __ |
Pelletier, Z.; Abualkibash, M. [24] | NN | RF | 30 | 97.30% | 98% | __ |
Thaseen I. S. and Ahmad A. [36] | Chi-square | Voting (SVM, MNB, Boosting) | __ | 98.5 | 95 | 2.15 |
Ikram S. T., et al. [37] | DNN | XGBoost | __ | 97.5 | 97 | __ |
Proposed system | CFS_FPA | Voting (RF, NB, KNN, SVM) | 30 | 99.7 | 99.99 | 0.004 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mhawi, D.N.; Aldallal, A.; Hassan, S. Advanced Feature-Selection-Based Hybrid Ensemble Learning Algorithms for Network Intrusion Detection Systems. Symmetry 2022, 14, 1461. https://doi.org/10.3390/sym14071461
Mhawi DN, Aldallal A, Hassan S. Advanced Feature-Selection-Based Hybrid Ensemble Learning Algorithms for Network Intrusion Detection Systems. Symmetry. 2022; 14(7):1461. https://doi.org/10.3390/sym14071461
Chicago/Turabian StyleMhawi, Doaa N., Ammar Aldallal, and Soukeana Hassan. 2022. "Advanced Feature-Selection-Based Hybrid Ensemble Learning Algorithms for Network Intrusion Detection Systems" Symmetry 14, no. 7: 1461. https://doi.org/10.3390/sym14071461
APA StyleMhawi, D. N., Aldallal, A., & Hassan, S. (2022). Advanced Feature-Selection-Based Hybrid Ensemble Learning Algorithms for Network Intrusion Detection Systems. Symmetry, 14(7), 1461. https://doi.org/10.3390/sym14071461