An Effective Network Intrusion Detection System Using Recursive Feature Elimination Technique

Yadav, Narendra Singh; Sharma, Vijay Prakash; Reddy, D. Sikha Datta; Mishra, Saswati

doi:10.3390/engproc2023059099

Open AccessProceeding Paper

An Effective Network Intrusion Detection System Using Recursive Feature Elimination Technique^†

Department of Information Technology, Manipal University Jaipur, Jaipur 303007, Rajasthan, India

^*

Author to whom correspondence should be addressed.

^†

Presented at the International Conference on Recent Advances on Science and Engineering, Dubai, United Arab Emirates, 4–5 October 2023.

Eng. Proc. 2023, 59(1), 99; https://doi.org/10.3390/engproc2023059099

Published: 21 December 2023

(This article belongs to the Proceedings of Eng. Proc., 2023, RAiSE-2023)

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning is an emerging area in research. Nowadays, researchers are utilizing machine learning across all domains to find optimal solutions. Machine learning facilitates the growth of an intrusion detection system (IDS) in the context of cyber security. These systems are proposed to identify and classify cyber-attacks on the network. However, an exhaustive assessment and performance evolution of various machine learning algorithms remains unavailable. In this study, we introduce a framework designed to nurture a versatile and efficient IDS adept at identifying and categorizing unexpected and evolving cyber threats. This is achieved through the use of Recursive Feature Elimination (RFE). In RFE, the algorithm is run recursively until a selected number of features are identified to enhance efficiency and reduce computational cost. The rapid detection of these attacks can facilitate the identification of potential intruders, and the damage will be lowered. We attained remarkable accuracies, with an average rate between 98% and 99% across all the classifiers and against all four types of attacks. The random forest and decision tree models stood out, each achieving peak accuracies of 99% in both KDD-99 and NSL-KDD Datasets.

Keywords:

intrusion detection system; naїve Bayes; KNN; random forest; reverse feature elimination

1. Introduction

IDS is a software application proposed to identify network intrusion utilizing artificial intelligence systems. IDS monitors an organization or system for malicious activity, safeguarding computer networks from unauthorized access, including potential insider threats. The task of intrusion detection learning is to develop a classifier (an analytical model) able to recognize between a “bad connection” and a good (normal) connection [1,2].

The quick advances In the web and correspondence fields have brought about enormous expansion in the organization size and the comparing information. Consequently, a significant number of new attacks are being created, necessitating advancements in network security to accurately identify intrusions. Moreover, they cannot ignore the potential existence of intruders intending to launch various assaults within the organization. An intrusion discovery framework is one such apparatus that keeps the organization from potential intrusions by investigating the organization’s traffic to guarantee its privacy, uprightness, and approachability [1,2,3]. Regardless of tremendous endeavors by specialists, IDS faces difficulties in further developing location precision while lessening phony problem rates and recognizing novel intrusions [1,2]. As of late, AI-, machine-learning-, and deep-learning-based IDS frameworks are being conveyed as likely answers for distinguishing intrusions across the organization in a productive way.

2. Literature Review

Numerous prior studies have posited that leveraging diverse feature selection techniques [4,5,6] and utilizing a range of machine learning algorithms to compare and ascertain the optimal fit can enhance the effectiveness of intrusion detection systems. T. Saranyaa [4] argued that to handle intrusion detection efficiently, we should employ big data techniques. The work focused on comparing different types of techniques and classifying various attacks using machine-learning (ML) algorithms. ML is a learning process where the system improves over time based on past experiences, and it improves the result. In addition to using an artificial neural network, a number of classification algorithms, including principal component analysis (PCA), logistic regression, modified K-means, support vector machine (SVM), decision tables, and decision trees, were compared for effectiveness. In addition to the aforementioned machine-learning algorithms, the researcher implemented strategies such as LDA (linear discriminant analysis), and random forest (RF) algorithms, CART (classification and regression trees) for classifying intrusion detection [4]. They utilized performance metrics like precision, accuracy, F-score, and recall to compare the performance of various algorithms.

Sharma N. et al. [5] used a Recursive Feature Elimination system to improve the efficiency of the IDS. The functionality of IDS, utilizing machine learning, is analyzed, followed by an examination of classification methods such as support vector machine, random tree, and decision tree. In this work, RFE played a crucial job in selecting the most valuable features and reducing the size of the dataset. Hence, we use RFE followed by different classifiers to analyze and compare. Nkiama et al. [6] discussed how, in a large amount of data, irrelevant features in the traffic can reduce the and have the most relevant ones to produce the most accurate results. In this work, they have followed the same process of RFE to eliminate the features that are irrelevantly followed by a decision tree to classify the results. This resulted in maximum accuracy, which helped us consider the decision tree as one of the important classifiers in our work.

Presently, in this venture, we utilize an anomaly-based intrusion detection system. Leveraging artificial intelligence and anomaly-based detection systems, we have trained a detection framework to identify uniform patterns. The patterns represent the normal behavior of the system. Subsequently, all network activity is compared to this baseline. Anomaly-based IDS typically operates by establishing a standard derived from the normal traffic and activities occurring in the network [7,8]. It is crucial to test the efficiency of the designed intrusion detection system model using various datasets. Therefore, in this project, we have employed two datasets, KDD and NSL KDD, which significantly differ in terms of age and data composition [9]. After training our model, we will test it using these datasets to simulate different attacks. Attacks refer to intrusions or malicious activities initiated within our server or network that can cause substantial damage to data and software.

In this anomaly-based intrusion detection system, we aim to identify threats with the highest possible accuracy to enhance the system’s performance. We test the reliability and durability of our proposed model against all types of attacks [8], carry out a detailed analysis of machine learning classifiers, utilizing diverse publicly accessible benchmark datasets for malware, and suggest a robust and hybrid infrastructure that can monitor network traffic and host-level activities in real-time, offering early warnings about possible cyber-attacks. Enterprises can use these data to enhance their security measures or establish more proficient controls. Within substantial corporations, an intrusion detection system serves to pinpoint glitches or issues in network device setups [8,9,10].

3. Material and Method

We use the standard KDD-99 dataset and NSL-KDD dataset in our model, and we eliminate the outliers, inaccuracies, and irrelevant features using RFE (Reverse Feature Elimination) and further classify using random forest, decision tree, KNN, and naive Bayes. Reverse Feature Elimination proves efficient with the dataset used by our models.

3.1. KDD Dataset

The KDD-99 dataset stands as a prevalent choice in research endeavors within the domain of intrusion detection systems (IDSs) and machine learning. Encompassing roughly 4.9 million individual association vectors, it includes a set of 41 features.

It has separate files for training and testing. The training file, titled “KDDTrain+_20Percent” includes a total of 25,192 instances, while the test dataset file “KDDTest+” encompasses 22,544 instances [11,12].

3.2. NSL-KDD Dataset

The NSL-KDD dataset is a refined and enhanced version of the KDD-99 dataset, serving as an effective standard mechanism for intrusion detection systems [5]. Although the KDD-99 datasets had certain shortcomings, the NSL-KDD dataset addressed these issues. The training dataset is similar to that of KDD-99, containing approximately 1,074,992 vectors. Each is characterized by 41 features. Every vector is designated as either “Normal” or “attack”, and each attack is further subdivided into DoS, U2R, R2L, and probing attacks [13,14].

3.3. Pre-Processing Data

The first step involves collecting data on normal behavior, which is then baselined over a period. Subsequently, this information is organized into a statistical profile based on various algorithms found in the knowledge base. In the KDD-99 dataset, scaling is utilized to preprocess the data. We extracted numerical features and scaled them to achieve zero mean and unit variance; thereafter, the results were converted back into a data frame. To separate the training and testing sets, we divided the columns and extracted and encoded the categorical attributes, thereby isolating the target column from the encoded data. In NSL-KDD, One-Hot-Encoding (one-of-K) is employed to transfer categorical features into binary features [15].

A few features are categorical types, and they have been transformed using a label encoder to convert them into numbers. Some features have very large values, which might excessively influence the result. These features need to be scaled. To carry this out, the values of all the features are brought into the range [0,n_values] [16,17].

3.4. Feature Selection

The main function of feature selection is to choose features that can positively influence the results while removing those that generate inaccuracies. Features with a strong relationship to the target attribute are selected to improve both the model’s performance and computational efficiency. We utilized Recursive Feature Elimination (RFE) under the wrapper method for feature selection [13]. In the method, the input data are divided into multiple subsets; various models are then created based on these subsets. Following this, the best features are selected based on certain performance metrics. Our features were ranked according to their importance using Recursive Feature Elimination, as illustrated in Figure 1. Recursive Feature Elimination is a process wherein the algorithm runs recursively until the specified number of features is selected [14]. In the KDD process, RFE is employed to obtain the necessary dataset. We extracted the top 10 features using the “n_features_to_select” parameter. In NSL-KDD, we eliminated redundant data by selecting a subset of relevant features.

To determine the strength of each individual feature’s relationship with the target attribute, Patgiri R. et al. [15] applied univariate feature selection using the ANOVA F-test. Recursive Feature Elimination (RFE) was used to choose the top 13 features from the NSL-KDD dataset using a different method that involved determining the percentile of the highest scores [13,14,15].

3.5. Classifiers

3.5.1. Random Forest

As the name suggests, this classifier is composed of a group of several decision trees that operate on various subsets of the specified dataset and works collectively, averaging their result to enhance the overall accuracy as illustrated in Figure 2 [3]. The random forest classifier is generally preferred because it requires less training time, yields accurate results, and maintains a high level of accuracy even as the dataset increases.

Random forest operates based on two fundamental assumptions:

Although individual inputs may sometimes produce incorrect output, their aggregated responses will be correct more often than not. Ensuring the presence of actual signal variables in the feature dataset will facilitate the generation of accurate results by the classifier [15].

3.5.2. Decision Tree

This classifier operates in a tree-structured manner, as in Figure 3. Classification and regression tasks are two uses for the decision tree. To use it, we pose a question and apply the classification and regression tree (CART) algorithm, which facilitates answers in “yes/no” format. The decision tree mimics the human brain’s computational approach to finding the solutions, presenting it in an easy-to-understand tree-like structure. This intuitive representation explains its use in decision tree classifier [13]. In building a decision tree, attribute selection is vital, guided by the attribute selection measure (ASM). This method uses the Gini Index and information gain as its two main approaches. Information gain represents the change in entropy—a measure of impurity or randomness in the dataset- after segmenting the database based on the attribute. On the other hand, the Gini Index quantifies the purity or impurity present during the decision tree’s generation.

Entropy is the measure of impurity in the dataset, and it specifies randomness.

E n t r o p y (s) = - ƥ (Y) * l o g_{2} ƥ (Y) - ƥ (N) * l o g_{2} ƥ (N)

Here,

\begin{array}{l} s = Total number of samples ƥ (Y) = probability of yes \\ ƥ (N) = probability of no \end{array}

3.5.3. KNN

A non-parametric classifier, KNN (k-Neareset neighbors), leverages proximity to carry out classifications. It assumes the similarity between new and existing data points to categorize the new entry into the most similar available categories as illustrated in Figure 4. Being a non-parametric and lazy learner algorithm, it retains all the available data to readily compare and classify new cases based on their similarity to existing ones [18].

Like the decision tree and random forest classifiers, KNN operates through a similar five-step algorithm. The foremost advantage over other classifiers is its robustness to noisy data and effectiveness in handling large datasets. However, it can be somewhat costly to implement, and determining the optimal value of K can sometimes be complicated [18].

3.5.4. Naïve Bayes

A classifier model operates based on the Bayes Theorem to generate the results. Naïve Bayes is particularly suited for high-dimensional training datasets, where it is commonly used to classify texts [12].

Naïve Bayes functions by converting data into a frequency table and then creating a likelihood table using the probability of the features. Ultimately, the Bayes Theorem is used to determine the probability [16].

Flow Chart of Proposed Model is described in Figure 5.

3.6. Evaluation Metrics

In the classification problem, the accuracy of the classifier is evaluated by the confusion matrix. In this paper, we calculate the confusion matrix and F1 score. The F1 score is calculated by recall (ɍȼ) and precision (ƥɍ) values.

ƥ ɍ = \frac{ϯ Ꝓ}{ϯ Ꝓ + Ӻ Ꝓ}

ɍ ȼ = \frac{ϯ Ꝓ}{ϯ Ꝓ + Ӻ ŋ}

A c c u r a c y = \frac{ϯ Ꝓ + ϯ ŋ}{ϯ Ꝓ + Ӻ Ꝓ + Ӻ ŋ + ϯ ŋ}

F_{1 - s c o r e} = 2 * \frac{ƥ Ꝓ \times ɍ ȼ}{ƥ ɍ + ɍ ȼ}

ϯ Ꝓ

(True positives): The model predicted these are attacks, and these are attacks.

ϯ ŋ

(True negatives): The Model predicted these are not attacks, and these are not attacks.

Ӻ Ꝓ

(False positives): The model predicted these are attacks, but these are not attacks.

Ӻ ŋ

(False negatives): The model predicted these are not attacks, but these are attacks.

4. Results and Analysis

We tested the KDD dataset and the NSL-KDD dataset against four types of attacks (i.e., DoS, U2R, R2L, Probe) to evaluate their accuracy, precision, recall, and F-measure. We utilized four classifiers (i.e., decision tree, random forest, KNN, naïve Bayes) to assess the reliability of the model in defending against these attacks. To visually present the result, we plotted graphs illustrating the outcomes. Results of KDD dataset with all four classifier is shown in Table 1. Table 2, Table 3, Table 4 and Table 5 shown accuracy of Dos, U2R, R2L and Probe attacks on NSL-KDD data set.

In the KDD dataset, we achieved an average accuracy ranging from 98% to 99% across all the classifiers in countering all four types of attacks. In the case of the NSL-KDD dataset, the average accuracy attained fell between 98.1% and 99.8% for all the classifiers when pitted against all four attack types. Figure 6 shows a comparison between the accuracy of different classifier models on the KDD Dataset.

5. Conclusions

Finally, after running our dataset through four classifiers and utilizing RFE, we were able to achieve high accuracy. Both the decision tree and random forest classifier, in conjunction with RFE, yield highly accurate results. However, to further enhance accuracy and facilitate a more effective comparison of results, we plan to design a hybrid model.

Author Contributions

N.S.Y. conceived of the presented idea, D.S.D.R. and S.M. developed the theory and performed the computations and carried out the experiment., V.P.S. suggest methodology and verified the analytical methods. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article. No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Agrawal, S.; Walke, P.; Pandit, S.; Nevse, S.; Deokule, S. Intrusion Detection System. Int. J. Sci. Res. Sci. Eng. Technol. 2020, 7, 13–16. [Google Scholar] [CrossRef]
Kishore, R.; Chauhan, A. Intrusion Detection System a Need. In Proceedings of the 2020 IEEE International Conference for Innovation in Technology (INOCON), Bangluru, India, 6–8 November 2020; pp. 1–7. [Google Scholar] [CrossRef]
Maseer, Z.K.; Yusof, R.; Bahaman, N.; Mostafa, S.A.; Foozy, C.F. Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 2021, 9, 22351–22370. [Google Scholar] [CrossRef]
Saranya, T.; Sridevi, S.; Deisy, C.; Chung, T.D.; Khan, M.A. Performance analysis of machine learning algorithms in intrusion detection system: A review. Procedia Comput. Sci. 2020, 171, 1251–1260. [Google Scholar] [CrossRef]
Sharma, N.V.; Yadav, N.S. An optimal intrusion detection system using recursive feature elimination and ensemble of classifiers. Microprocess. Microsyst. 2021, 85, 104293. [Google Scholar] [CrossRef]
Nkiama, H.; Said, S.Z.; Saidu, M. A subset feature elimination mechanism for intrusion detection system. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 148–152. [Google Scholar] [CrossRef]
Rehman, E.; Haseeb-ud-Din, M.; Malik, A.J.; Khan, T.K.; Abbasi, A.A.; Kadry, S.; Khan, M.A.; Rho, S. Intrusion detection based on machine learning in the internet of things, attacks and counter measures. J. Supercomput. 2022, 78, 8890–8924. [Google Scholar] [CrossRef]
Latah, M.; Toker, L. Towards an efficient anomaly-based intrusion detection for software-defined networks. IET Netw. 2018, 7, 453–459. [Google Scholar] [CrossRef]
Kazim, M. Doreswamy, Machine Learning Based Network Anomaly Detection. Int. J. Recent Technol. Eng. 2019, 8, 542–548. [Google Scholar]
Shah, R.A.; Qian, Y.; Kumar, D.; Ali, M.; Alvi, M.B. Network intrusion detection through discriminative feature selection by using sparse logistic regression. Future Internet 2017, 9, 81. [Google Scholar] [CrossRef]
Mora-Gimeno, F.J.; Mora-Mora, H.; Volckaert, B.; Atrey, A. Intrusion detection system based on integrated system calls graph and neural networks. IEEE Access 2021, 9, 9822–9833. [Google Scholar] [CrossRef]
Ding, Y.; Zhai, Y. Intrusion detection system for NSL-KDD dataset using convolutional neural networks. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence, Las Vegas, NV, USA, 12–14 December 2018; pp. 81–85. [Google Scholar]
Lian, W.; Nie, G.; Jia, B.; Shi, D.; Fan, Q.; Liang, Y. An intrusion detection method based on decision tree-recursive feature elimination in ensemble learning. Math. Probl. Eng. 2020, 2020, 1–5. [Google Scholar] [CrossRef]
Ustebay, S.; Turgut, Z.; Aydin, M.A. Intrusion detection system with recursive feature elimination by using random forest and deep learning classifier. In Proceedings of the 2018 international Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT), Ankara, Turkey, 3–4 December 2018; pp. 71–76. [Google Scholar]
Sharma, V.P.; Yadav, N.S.; Adavi, S.S.; Reddy, D.S.D.; Gupta, B.B. A two stage hybrid intrusion detection using genetic algorithm in IoT networks. J. Discret. Math. Sci. Cryptogr. 2023, 26, 667–676. [Google Scholar] [CrossRef]
Patgiri, R.; Varshney, U.; Akutota, T.; Kunde, R. An investigation on intrusion detection system using machine learning. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1684–1691. [Google Scholar]
Lama, A.; Savant, P. A survey on network-based intrusion detection systems using machine learning algorithms. Int. J. Eng. Appl. Sci. Technol. 2022, 6, 225–230. [Google Scholar] [CrossRef]
Liao, Y.; Vemuri, V.R. Use of k-nearest neighbor classifier for intrusion detection. Comput. Secur. 2002, 21, 439–448. [Google Scholar] [CrossRef]

Figure 1. Features and their importance in KDD Dataset.

Figure 2. Flow chart for random forest classification [9].

Figure 3. Flow chart for decision tree classification [10].

Figure 4. Distribution of points on a graph in KNN [2].

Figure 5. Block diagram of the proposed model.

Figure 6. Accuracies of models on KDD dataset.

Table 1. Result of models on KDD dataset.

Classifier	Result of KDD Data Set
Classifier	Accuracy	Precision	Recall	F1-Score
Decision Tree	99%	99%	99%	99%
Random Forest	99%	99%	99%	99%
KNN	99%	99%	99%	99%
Naïve Bayes	88%	93%	88%	89%

Table 2. Result of DoS attack in NSL-KDD dataset.

Classifier	Result of NSL-KDD Data Set DoS-Attack
Classifier	Accuracy	Precision	Recall	F1-Score
Decision Tree	99.6%	99.5%	99.6%	99.5%
Random Forest	99.8%	99.8%	99.7%	99.7%
KNN	99.7%	99.6%	99.6%	99.6%
Naïve Bayes	86.7%	98.8%	70%	82.1%

Table 3. Result of probe attack in NSL-KDD dataset.

Classifier	Result of NSL-KDD Data Set Probe
Classifier	Accuracy	Precision	Recall	F1-Score
Decision Tree	99.5%	99.3%	99.2%	99.3%
Random Forest	99.6%	99.5%	99.3%	99.4%
KNN	99%	98.6%	98.5%	98.5%
Naïve Bayes	97.8%	97.3%	96%	96.6%

Table 4. Result of u2r attack in NSL-KDD dataset.

Classifier	Result of NSL-KDD Data Set U2R
Classifier	Accuracy	Precision	Recall	F1-Score
Decision Tree	99.6%	86.2%	90%	88.2%
Random Forest	99.7%	96.1%	88.7%	91.7%
KNN	99.7%	93.1%	85%	87%
Naïve Bayes	97.2%	60%	97.9%	66%

Table 5. Result of r2l attack in NSL-KDD dataset.

Classifier	Result of NSL-KDD Data Set U2R
Classifier	Accuracy	Precision	Recall	F1-Score
Decision Tree	97.9%	97.1%	96.9%	97%
Random Forest	98.1%	97.5%	97.2%	97.4%
KNN	96.7%	95.2%	95.4%	95.3%
Naïve Bayes	93.5%	89%	95.5%	91.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yadav, N.S.; Sharma, V.P.; Reddy, D.S.D.; Mishra, S. An Effective Network Intrusion Detection System Using Recursive Feature Elimination Technique. Eng. Proc. 2023, 59, 99. https://doi.org/10.3390/engproc2023059099

AMA Style

Yadav NS, Sharma VP, Reddy DSD, Mishra S. An Effective Network Intrusion Detection System Using Recursive Feature Elimination Technique. Engineering Proceedings. 2023; 59(1):99. https://doi.org/10.3390/engproc2023059099

Chicago/Turabian Style

Yadav, Narendra Singh, Vijay Prakash Sharma, D. Sikha Datta Reddy, and Saswati Mishra. 2023. "An Effective Network Intrusion Detection System Using Recursive Feature Elimination Technique" Engineering Proceedings 59, no. 1: 99. https://doi.org/10.3390/engproc2023059099

Article Menu

An Effective Network Intrusion Detection System Using Recursive Feature Elimination Technique^†

Abstract

1. Introduction

2. Literature Review