Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments

Alabdulwahab, Saleh; Kim, Young-Tak; Seo, Aria; Son, Yunsik

doi:10.3390/app131910951

Open AccessArticle

Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments

¹

Department of Computer Science and Engineering, Dongguk University, Seoul 04620, Republic of Korea

²

Department of Biomedical Sciences, Korea University College of Medicine, Seoul 02841, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10951; https://doi.org/10.3390/app131910951

Submission received: 25 August 2023 / Revised: 28 September 2023 / Accepted: 2 October 2023 / Published: 4 October 2023

(This article belongs to the Special Issue Technologies and Services of AI, Big Data, and Network for Smart City)

Download

Browse Figures

Versions Notes

Abstract

:

Networks within the Internet of Things (IoT) have some of the most targeted devices due to their lightweight design and the sensitive data exchanged through smart city networks. One way to protect a system from an attack is to use machine learning (ML)-based intrusion detection systems (IDSs), significantly improving classification tasks. Training ML algorithms require a large network traffic dataset; however, large storage and months of recording are required to capture the attacks, which is costly for IoT environments. This study proposes an ML pipeline using the conditional tabular generative adversarial network (CTGAN) model to generate a synthetic dataset. Then, the synthetic dataset was evaluated using several types of statistical and ML metrics. Using a decision tree, the accuracy of the generated dataset reached 0.99, and its lower complexity reached 0.05 s training and 0.004 s test times. The results show that synthetic data accurately reflect real data and are less complex, making them suitable for IoT environments and smart city applications. Thus, the generated synthetic dataset can further train models to secure IoT networks and applications.

Keywords:

intrusion detection system; machine learning; information security; IoT; CTGAN; advanced persistent threat

1. Introduction

Researchers have enhanced intrusion detection systems (IDSs) using new detection approaches, such as ML-based IDSs, which utilize machine learning (ML) techniques to differentiate between normal and attack packets. ML techniques are capable of learning from large datasets and understanding the behaviors and patterns of data anomalies [1]. An ML-based IDS uses classification models to learn patterns from labeled datasets. However, large datasets and complex algorithms require powerful devices. Hence, researchers must use data preprocessing and feature selection methods to reduce the complexity of ML algorithms. However, reducing complexity may compromise the accuracy of the classification process. Therefore, employing a deep learning data generator model and ML classification algorithms could enable private and wide attack coverage to achieve high performance with low model training and testing time in order to fit the lightweight devices adopted by Internet of Things (IoT) networks.

IoT devices are low-energy and lightweight, and they dedicate most of their available energy and computation to executing core application functionality [2]. Hence, it is challenging for the IoT environment to collect and store network data for training ML algorithms. Message queuing telemetry transport (MQTT) is a popular IoT protocol used to connect devices and allows communication between them. However, security issues with MQTT brokers must be addressed to avoid network exposure and vulnerabilities to attacks from cyber criminals who aim to obtain sensitive data. Furthermore, the MQTT protocol is lightweight, which makes it easy to implement in embedded systems. However, the MQTT broker’s limited memory complicates its ability to handle encryption methods. Moreover, MQTT has weak authentication by default; usernames and passwords are sent in plain text. The network can intercept this format and then subscribe to any topic on the MQTT server. Another challenge faced by IoT devices is malware, which uses the configuration weakness of the MQTT protocol to subscribe to all messages and send them to a third party. IoT devices contain essential resources that are exchanged among networks. Researchers must improve protection methods and develop processes to protect these systems [3].

Several connected devices and the amount of data transmission in the IoT environment can be exploited by advanced persistent threat (APT) groups to hide inside an IoT network and avoid detection [4]. In an insider attack scenario, the APT gains control of an authorized device via social engineering, scanning, and brute forcing (see Figure 1). The multistage nature of an APT attack leads to difficulties in capturing the complete stages of intrusion datasets [5]. These attacks aim to maintain long-term access inside the network, control the system, and gain information in the long term [6]. Consequently, security and privacy solutions for IoT, particularly ML algorithms, need further investigation to overcome the multistage nature of APT attacks. The selection of proper ML algorithm pipelines and fine-tuning parameters is essential for achieving maximum data exchange protection in IoT networks.

Although researchers know that some ML algorithm pipelines are superior in intrusion detection accuracy, these algorithms require more training and testing time on large datasets. Moreover, researchers have documented that model-building time is a crucial aspect of predicting real-life intrusions [7]. A delay in the IDS can compromise IoT networks over a lengthy period before raising an alert. Therefore, timely classification is vital to guarantee the rapid detection of attacks and ensure that the IDS has a fast data stream monitoring capability. Time plays a crucial role in detecting attacks on high-speed IoT networks with large volumes of data. Thus, a practical real-time IDS is desirable for detecting intrusion. The scientific literature has presented several effective ML algorithm pipelines for protecting networks, including search methods, feature reduction, selection methods, and classifiers [8,9].

The MQTT protocol is generally adopted for smart IoT networks. The MQTT dataset can be used to train ML algorithms to protect IoT networks [10]. Combining IoT datasets can identify an appropriate ML algorithm pipeline that is fully designed to protect smart IoT networks in simulation and real-life scenarios. However, capturing traffic from a real-time IoT environment can cause privacy issues and expose the system’s credentials. In addition, this is a time-consuming process. Moreover, the detection performance is limited to a specific dataset depending on the feature [11]. Therefore, there is a need to conduct further studies on the performance of ML algorithms and dataset generation to identify the most efficient pipelines for search methods, feature selection approaches, and classifiers. Such studies can reduce the model-building time without compromising the intrusion detection accuracy in an IDS to protect smart IoT networks.

In this study, we suggest training the conditional tabular generative adversarial network (CTGAN) model using datasets that have numerous attacks that can generate a synthetic dataset for the IoT environment. Such synthetic datasets overcome companies’ privacy issues and system credentials and save time when collecting testing and training data. Moreover, it can solve the challenges of collecting network data, which require large storage and a long period of time and are expensive for the IoT environment [12]. According to previous research, CTGAN works well with datasets with continuous and discrete values, which is the case in network datasets [13]. In the context of IoT, where various devices and scenarios lead to varied data distributions, conditional generation provides a valuable advantage [14]. The method proposed in this study could be a solution for obtaining an IoT-focused dataset and solving issues when collecting training data.

1.1. Research Contribution and Scope

The contributions of the proposed methodology are as follows: First, the method can form a comprehensive attack dataset for the IoT. Second, it applies CTGAN to a comprehensive IoT attack dataset. Third, using a synthetic dataset, the time and cost of capturing and recording the dataset using large storage is reduced. This would help solve the challenges of collecting network data in the IoT environment. Fourth, the features were universalized among several datasets. Thus, the classifier’s performance is not limited to one specific dataset [11], which describes all features. Fifth, the synthetic dataset will make it more balanced, thereby increasing the classification performance. Sixth, feature selection and preprocessing reduce the time complexity of the classification process. Seventh, the synthetic dataset replaces the real dataset to improve privacy.

1.2. Research Structure

The remainder of this paper is organized into the following sections: The Section 2 reviews studies related to ML-based IDSs, ML pipeline results, and detection approaches. The Section 3 details the methodology’s steps and the tools used to obtain the results; the Section 4 discusses the results and the performance metrics to verify the generation tool and the performance of ML methods using the generated dataset. Finally, the Section 5 summarizes the paper and the results and then discusses future work and further research.

2. Literature Review

In information security businesses, ML algorithm pipelines are typically used to improve the prediction model’s performance in terms of improving intrusion detection accuracy, reducing model-building time, and increasing the cost-effectiveness of the IDS in IoT networks. Researchers have identified several ML algorithm pipelines using various datasets, including NSL-KDD, UNSW-NB15, NIMS, IoT-23, TON-IoT, AWID, and CIC-IDS2017. These datasets neither focus on IoT attacks nor cover all IoT attacks, such as MQTT attacks. Thus, the identified datasets may not be suitable for safeguarding IoT communication and information in networks.

2.1. ML-Based IDS Studies

Soe et al. [15] provided a lightweight ML-based IDS with a new feature selection algorithm implemented on Raspberry Pi. This feature selection algorithm, called correlated set thresholding on the gain ratio (CST-GR), was designed to select the necessary features and create fast classifiers. The CST-GR algorithm was run on the Bot-IoT dataset collected from an IoT environment and evaluated using the Weka tool. It concluded that the IDS is lightweight enough for implementation in IoT environment devices, such as Raspberry Pi, without compromising detection performance, and it has a 0.99 true positive rate (TPR), 8.61 s training time, and 0.81 s testing time. This study experimented in an environment that represents the IoT network; however, it did not consider or compare other options for feature selection evaluators. Unbalanced dataset statistics showed a significant gap between the attacks.

Zhou et al. [16] proposed an IDS framework based on feature selection and ensemble classifiers. This study used CFS-BA to reduce dimensionality. It combines C4.5, RF, and forest by penalizing attribute (ForestPA) algorithms for detection accuracy when run on the NSL-KDD, AWID, and CIC-IDS2017 datasets. The research revealed that the proposed CFS-BA-ensemble feature selection technique with the ForestPA-ensemble classifier showed a stronger performance compared to related approaches. The authors’ approach achieved an accuracy of 0.99 and 36.28 s training time on NSL-KDD, 0.99 accuracy and 92.62 s training time on AWID, and 0.99 accuracy and 98.42 s training time on CIC-IDS2017 datasets.

Rahman et al. [17] suggested a structural design for an effective IDS for an IoT network of resource-constrained devices. They proposed two methods, semi-distributed and distributed, combined with preprocessing and feature selection methods to solve centralized IDS limitations for resource-constrained devices. The authors reported that applying the semi-distributed method to the AWID dataset achieved a detection accuracy of 0.99 and a long CPU time (186.26 s). In comparison, the distributed method achieved the lowest CPU time (73.52 s) with respect to building the model, with a detection accuracy of 0.97. The authors stated that further investigation using other representative dataset options and feature selection methods is required for a scalable ML-based IDS in IoT network environments. However, none of these studies focused on reducing model-building time using ML methods, such as feature selection or preprocessing, which is important for lightweight IoT devices. Moreover, the used datasets did not focus on IoT environments.

Furthermore, Binbusayyis and Vaiyapuri [8] attempted to identify the most appropriate feature evaluators for building an effective IDS. They studied four filter-based feature evaluation measures: consistency, correlation, information, and distance. They applied feature evaluators to different ML algorithms and ran them on NSL-KDD/UNSW-NB15 datasets. The authors reported that the random forest (RF) classifier yielded the best results with all feature evaluators. However, this study focused on a general network rather than an IoT network environment. Moreover, the detection performance is limited to a specific dataset depending on the features, and one cannot apply the ML model to other datasets [11]. Additionally, Somwang and Lilakiatsakun [18] proposed an anomaly-based IDS using a hybrid algorithm of supervised and unsupervised learning schemes on the KDDcup99 dataset. The feature selection method integrated principal component analysis (PCA) with a support vector machine (SVM), selecting 10/41 features. The detection rate of the proposed method using PCA with fuzzy adaptive resonance theory (FART) was 0.97. However, KDDcup99 is already known as a general network attack dataset and does not focus on the IoT network environment.

Sindhu et al. [19] suggested a lightweight IDS for multiclass categorization. Feature selection is a wrapper-based genetic algorithm. They used a hybrid neural network and decision a decision classifier. In this study, KDD datasets and the min–max method were used for normalization. The highest detection rate, evaluated using WEKA, was 0.98. In addition, Setiawan et al. [20] proposed an IDS model using a pipeline for feature selection, normalization, and SVM. They used feature selection as an information gain (IG) filter. They selected 17/41 features from the NSL-KDD dataset. The overall accuracy evaluated by WEKA was 0.99; it showed a training time of 56.603 s and a testing time of 2.094 s. These studies showed high performance in terms of classification accuracy. However, the dataset used is outdated. Therefore, new dataset options must be identified to represent new attacks.

In addition, Rashid et al. [21] evaluated the performance of several ML algorithms (linear regression, support vector machine, decision tree, random forest, artificial neuron networks, K-nearest neighbor, bagging, boosting, and stacking) on the CICIDS2017 and UNSW-NB15 datasets using 10-fold cross-validation with the information gain ratio to select the top 25 highly relevant features. The study reported that the proposed method can efficiently identify IoT cyber-attack threats in a smart city. The stacking ensemble model showed a better result than comparable models for detecting cyber-attacks in smart city systems. The authors’ approach to the performance achieved 0.96 accuracy, 25.6 s training time, and 5.70 s testing time.

Joloudari et al. [22] reported that APT attacks can be detected using ML. Furthermore, the models can be trained using a record of the network’s traffic. The authors used the NSL-KDD dataset to train the ML models. They used 10-fold cross-validation to test the models. The ML models they used included the C5.0 decision tree, Bayesian network, and deep learning to classify APT attacks. Deep learning yielded the best results among the classification models (0.98 accuracy). However, using the NSL-KDD dataset for an APT attack is inefficient due to the lack of attacks and stages that represent the nature of APTs.

Shang et al. [23] designed a network based on network flow to detect command and control channels. These are used to control the host from a distance in order to steal and spy on the network. The authors reported that APT attacks are similar when it comes to intrusion tools and services, which could help find unknown malware due to hidden shared attributes. To identify hidden attributes, the authors used deep learning methods for mining. Subsequently, they trained and tested several classifier detection performances on the Contagio blog malware database. They found that their method of detecting unknown malicious network flows using gradient-boosted decision trees showed the highest result (reaching 0.968 F-1 scores).

Chizoba and Kyari [24] analyzed APT attacks using network traffic logs. They reported that using an ensemble of classifiers and the majority voting method can yield better results for stealthy APT attacks. The authors trained and tested an ensemble of support vector machine, RF, and decision tree classifiers and found that the proposed ensemble of classifiers showed a 0.90 detection accuracy. Nevertheless, they did not focus on the privacy perspective using synthetic datasets isolated from private systems. Moreover, collecting APT attacks requires several months of recording and storage [25]. CTGAN can solve privacy and storage limitations. In addition, Myneni et al. [26] proposed a dataset focused on an APT attack called DAPT2020, which collected a dataset using a real-world network and attackers to capture multistage and different attack vectors. They then used semi-supervised learning to benchmark anomaly detection performance. However, it is important to include IoT network attacks and focus on classification time and accuracy. Moreover, using deep learning generators can improve the balancing of instances and is less expensive when collecting datasets. Safa et al. [27] proposed a Health Care Big Data Analytics (HCBDA) model to predict cardiac disease using IoT Devices. The IoT devices are attached to the patient’s body to collect vital signals, which are transferred through several intermediate nodes to the monitoring system in order to build big data to generate intelligence. In the proposed model, route selection is executed based on the values of the trusted forwarding weight (TFW) and trusted carrier weight (TCW), which are fed to a decisive support system. The decisive support system cluster generates big data using feature depth similarity (FDS) clustering algorithms; then, a classification process is carried out by measuring the feature disease class similarity (FDCS). The proposed method’s accuracy in predicting cardiac disease reaches 96%. However, the used dataset and model may be appropriate for the prediction of disease but not network intrusion attacks.

2.2. ML-Based IDS Using GAN Methods

Although the identified ML algorithm pipelines obtained a very high level of accuracy, the IDS still could not be properly trained using insufficient and imbalanced datasets. The size of the abnormal/attack data is too small compared to the normal data in most published datasets, creating a data imbalance problem during training. Furthermore, the distribution of the data space cannot be apparent to the IDS due to some missing data, even if there is a sufficient amount of data. Generative adversarial networks (GANs) are used to generate new synthetic similar samples in order to solve these problems. Shahriar et al. [28] implemented a GAN-IDS framework model for the NSL-KDD’99 dataset. They reported that the GAN-IDS framework predicted better accuracy than an independent IDS. However, further investigation is required to elaborate on issues and limitations such as centralization, computational expenses, and the time-consuming nature of the GAN-IDS framework. Liu et al. [29] considered the imbalance and high dimensionality of datasets in intrusion detection and suggested an oversampling intrusion detection technique based on GAN and feature selection. They oversampled the rare classes of attack samples using GAN to generate a rebalanced low-dimensional dataset for ML training and improved the efficiency of intrusion detection. They implemented this approach on the NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets and reported effective improvements in the detection performance of ML models. Additionally, Lin et al. [30] suggested a GAN framework called IDSGAN for adversarial malicious attack generation against an IDS. An IDSGAN generates malicious feature records to attack the network by deceiving and evading detection. It can attack real-time IDS models powered by several ML algorithms and maintain malicious traffic functionalities. They implemented the IDSGAN in the NSL-KDD dataset, reported its effectiveness in generating adversarial malicious traffic records of different classes of attacks, and lowered the detection rates of various IDS models to approximately 0%.

Kumar and Sinha [31] proposed a synthetic IDS dataset generation model using a Wasserstein conditional generative adversarial network with an XGBoost classifier, and they used the feature reduction method based on a deep autoencoder. The applied datasets were NSL-KDD, UNSW-NB15, and BoT-IoT; the F1 score for each dataset was 0.96, 0.81, and 0.99, respectively. However, the quality of the WCGAN model on the ML classifiers was assessed using a mix of the actual and generated samples instead of using only the generated samples for training. Moreover, the features were not unified among all datasets, and the high detection performance was limited to the NSL-KDD dataset. Additionally, although the IoT environment was one of the main focuses of the study, the reduction in training and testing times has not been considered.

Strickland et al. [32] presented deep reinforcement learning using the NSL-KDD dataset generated by conditional and unconditional CTGAN and CopulaGAN. They found that CTGAN had the highest accuracy (0.85) among the other studied GANs. However, feature selection and time reduction were not considered in the methodology. In addition, the generated dataset was not assessed; the generated dataset was used in training and the real dataset was used during testing, which may provide a deeper understanding of whether the generated dataset can replace the real dataset.

The abovementioned studies identified several ML algorithm pipelines using various datasets. However, these datasets neither focus on IoT attacks nor cover all IoT attacks, such as MQTT attacks. Thus, they may not be appropriate for safeguarding IoT networks. To address this problem, Vaccari et al. [33] created a new dataset that focused on the MQTT protocol and was widely adopted in IoT networks. They demonstrated that MQTTset can be used to train ML algorithms to implement IDSs in order to protect IoT networks. However, further investigation on ML algorithm pipelines and GAN is still required to improve IDS performance in protecting IoT networks. MQTT-IoT-IDS2020 has been reported to be an imbalanced dataset [10]. The percentage of attack instances in MQTT-IoT-IDS2020 is too varied, which creates an imbalance problem during training. Thus, studies are needed to use GAN to generate new synthetic similar samples of MQTT-IoT-IDS2020 for ML training and to improve IDS performance in safeguarding IoT networks.

2.3. Literature Review Summary

Previous studies have shown that the success of ML algorithms depends on the quality of data. Current real-life network activities are characterized by a massive volume of high-dimensional data with irrelevant and redundant information. This negatively impacts the detection accuracy and model-building time of IDSs. Thus, it is crucial to identify or propose an effective ML algorithm pipeline in intrusion detection accuracy and model-building time processes in order to reduce data dimensionality and improve the data balance. Moreover, it is essential to have a well-structured benchmark dataset suitable for IoT networks to train ML algorithm models in order to identify malicious attacks on networks [33].

Additionally, the dataset should represent a large number of attacks when we wish to discover them. Currently (and in the simulation lab), most ML algorithm models are trained on general datasets, such as KDDCUP99, NSL-KDD, and UNSW-NB15. However, they are rarely suitable for IoT networks due to the limited support for dedicated protocols used in IoT networks. Even the datasets used in the simulation lab and those related to IoT (i.e., N-BaIoT, IoT-23, TON-IoT, and CICIDS2017) are not sufficiently comprehensive to cover the need for IoT networks in real-life attack situations. Moreover, HCBDA is used for disease prediction and cannot be used for IoT network intrusion attacks. Furthermore, in the field of IDS and GANs, researchers have not considered some of the critical aspects, such as the training and testing time performance of the generated dataset; generating IoT-focused datasets; unifying features across multiple datasets; applying feature selection methods; the adoption of train on synthetic, test on real data (TSTR) as a performance measure; and including statistical performance metrics, leaving significant contribution gaps for further research and investigation.

Thus, as shown in Figure 2, this study was designed to select diverse datasets to cover a broader number of attacks from the TON-IoT, MQTT-IoT-IDS2020, and Bot-IoT datasets with 12 attacks and a normal traffic dataset. Then, these datasets are applied as input for generating a synthetic dataset using GAN. This study also aims to identify the appropriate pipeline of ML algorithms for detecting attacks with the highest detection accuracy and shortest time for the development of an efficient IDS in protecting the IoT network. Table 1 summarizes the reviewed ML algorithm studies that focused only on training and testing times. It contains the proposed ML algorithm approaches and the selected datasets. It compares their efficiencies in terms of accuracy, training time, and testing time.

3. Generating a Lightweight IoT Dataset

This study applies a pipeline, including feature extraction, preprocessing, CTGAN, filter-based feature selection, and ML classifiers. CTGAN is a type of generative adversarial network used to generate synthetic tabular data. It comprises two parts: a generator and a discriminator. The generator learns to create realistic data, while the discriminator learns to differentiate real data from fake data. Both are trained in a confrontational way, competing against each other until they reach an equilibrium status. Moreover, for differential privacy, the generator has no access to the dataset [34].

The methodology intended to apply CTGAN to generate a new dataset that represents the nature of multistage attacks, which can help solve the capturing and high storage issues that come with the slow and stealthy nature of attacks such as APT. In addition, some attacks appear more than others, leading to an imbalanced dataset that affects the classification performance. The equations used for the CTGAN generator and the discriminators are as follows:

GminDmaxV(D,G) = Ex∼pdata(x)[logD(x|y)] + Ez∼pz(z)[log(1 − D(G(z|y)|y))] − λH(G(z|y)|y)

(1)

where x is the real data, y is the condition, z is the noise vector, D is the discriminator, G is the generator, pdata is the real data distribution, pz is the noise distribution, λ is a hyperparameter, and H is the conditional entropy.

The generated dataset is used to train and test ML classifiers to differentiate between normal and attack traffic, which can help build an efficient ML-based IDS. All classifiers and feature selection methods were presented using scikit-learn, which is an open-source ML library [35]. Classification performance requires high accuracy and low training and testing times. The experimental setup focused on the training and testing times because the datasets were collected and used for IoT environments. IoT devices have low CPU performance; therefore, finding an ML pipeline with lower training and testing times without affecting accuracy provides a suitable pipeline for the IoT environment.

3.1. Dataset Generation

This study used the CTGAN model to generate synthetic data that resemble real data. The CTGAN model is a powerful tool for deep learning. It estimates a generative model using an adversarial approach. It has shown promising results in the generation of text, images, and tables. It uses two neural networks, a generator and a discriminator, which operate together to produce outputs of improved quality. The generators were trained using real-world data to produce synthetic data that resembled real data. The discriminators checked the generated data to determine whether they were real.

CTGAN is represented in the SDV open-source library [36]. Proposed by [34], CTGAN is a deep learning model that uses a conditional generator to generate synthesized tabular datasets that are similar to real datasets. To validate the performance of the dataset generated by CTGAN, classifier efficiency metrics, which are represented by SDV [37] as a single table metric, are used to reveal whether the generated dataset can be used to replace the real dataset.

CTGAN was employed on a comprehensive IoT attack dataset to generate a new intrusion dataset. The comprehensive IoT attack dataset is a collection of the TON-IoT, MQTT-IoT-IDS2020, and Bot-IoT datasets and covers more attack stages.

The TON-IoT is a dataset collected from a realistic IoT environment. This dataset was presented by UNSW Canberra Labs to solve IoT dataset challenges for an efficient IDS [38]. The MQTT-IoT-IDS2020 dataset focuses on the MQTT protocol, which is widely adopted by IoT environments. The dataset contains MQTT-based attacks, which comprise aggressive scan (Scan A), user datagram protocol (UDP) scan (Scan sU), Sparta SSH brute force (Sparta), and MQTT brute-force attack (MQTT BF) [10]. The last dataset used is Bot-IoT, which is collected from legitimate and simulated IoT network traffic. It includes information gathering, denial of service, and information theft attacks [39].

The features extracted from the raw PCAP format files of these datasets were analyzed using “tshark,” which is a tool for capturing and reading network packets [40]. This tool allows each feature to be defined individually in order to identify commonalities. The selected features were universalized among several datasets; therefore, the detection performance was not limited to one specific dataset [11]. Then, the CTGAN model was trained using 50 to 100 epochs to select the highest test score according to the statistical verification metric to generate a dataset that contains several attacks and stages, and it represents the real dataset. The generator and discriminator’s learning rate was set to 20,000 and their dimensions were (256, 256) (Table 2).

The CTGAN performance was tested using the Kolmogorov–Smirnov (KS) test, which was used in [13] to compare different types of tabular data generators. The KS test is a statistical verification metric that assesses the similarity in distributions between real and generated datasets [41]. The KS test was performed using the SDV package [37]. A related study used only 100 epochs on CTGAN. However, it is recommended that several epochs be investigated to obtain an adequate number of epochs [42]. Accordingly, the KS test was applied to the generated dataset with epochs ranging from 50 to 100 to select the outperforming epochs.

3.2. Machine Learning Classification

The generated dataset was validated based on accuracy and training/testing times using several widely used ML classification methods, decision trees, Naïve Bayes, RF, MLP, gradient boost, XGBoost, and LightGBM. Considering the hyperparameters of the classifiers, a default configuration setting was applied for RF, decision tree, XGBoost, LightGBM, and Gaussian Naïve Bayes. For gradient boost, it is configured with a maximum estimator of 20. The MLP classifier had 130 max iterations and a batch size of 1000; “relu” was used for the activation function, and “adam” was used as the solver. The system configuration was 2.20 GHz Intel(R) Xeon(R) and 13 GB of RAM. The training and testing times were calculated as the mean values of five classification runs, as presented in Algorithm 1.

Each classifier was trained individually, and the performance metrics used were accuracy, which shows the total performance of correct predictions by the classifier; the F1 score, which is a combination of precision and recall; and training and testing times in seconds, which implies the complexity of the classification process.

The description of the performance metrics is as follows: true positive (TP), which comprises the attack instances that are detected correctly as an attack; true negative (TN), which comprises the normal instances that are detected correctly as normal; false positive (FP), which comprises the normal instances incorrectly recognized as an attack; and false negative (FN), which comprises the attack instances that are incorrectly identified as normal. The performance metrics can be determined using the following formulas:

A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N}

(2)

F 1 s c o r e = \frac{2 T P}{2 T P + F P + F N}

(3)

Considering the time performance, training time is the time of fitting the train data in the classifier, and the testing time is the time of using the classifier to predict the labels using the test data.

Additionally, to further improve the classification performance in terms of training and testing times, lightweight IoT devices were fitted by reducing the complexity of training and testing and by obtaining a dataset containing only important features. Thus, a filter-based feature selection method was applied to the dataset, which can be used without compromising classification accuracy [43].

The filter-based feature selection used was an information gain method, and SelectKBest was used for the search method [44]. Figure 3 shows the importance score for the best 15 features according to filter-based correlation. In this research study, 5, 10, and 15 of the best features were evaluated to study the change in using the different number of features.

Figure 4 summarizes and visualizes the steps of the methodology. Diverse datasets were selected to cover a broader number of attacks from the TON_IoT, MQTT-IoT-IDS2020, and Bot-IoT datasets, with 12 attacks and normal traffic. Then, feature extraction was performed using the raw PCAP files and the “tshark” tool. The features were defined and labeled in the dataset according to the IP addresses of the attackers. Next, a filter-based feature selection method was used to universalize the features among the selected datasets. The three datasets were linked to one dataset using the “concatenating” data-combining method, and the total number of instances was 335,170. The combined datasets preprocessed their features into categorical features in order to be compatible with ML classifiers, and the training and testing datasets were randomly divided into 70% to 30%, respectively. After combining the datasets, the deep-learning CTGAN model was applied to generate a new dataset. This includes several attacks and the full features, as shown in Table 3. In addition, it is isolated from private systems. Finally, the generated dataset was evaluated and validated based on statistical metrics, accuracy, F-score, and training/testing times using ML classification to determine whether the generated dataset represented the real dataset.

4. Results and Discussion

The results show that detecting a more comprehensive number of attacks is possible using the CTGAN model and ML classifiers. The proposed system generates a private and balanced dataset with multistage attacks and universal features. GAN has also been reported to generate a more balanced dataset that resembles denial of service, probe, user to root, and remote to local [13,45]. This indicates that the CTGAN is suitable for generating similar attacks.

4.1. Data Generation Performance Testing

Table 4 and Table 5 show that the dataset statistics were balanced using CTGAN compared with the real dataset statistics in Table 4. The “keylogging” and “theft” attacks have the lowest instance number. After applying CTGAN, they significantly increased from 135 instances of keylogging (to 21,686) and 423 instances of theft (to 8087). The dataset was distributed according to the CTGAN, which was implemented on all attack and normal datasets to generate the instances.

Table 6 shows that the performance increased from 0.80 (50 epochs) to 0.83 (70 epochs) and then decreased to 0.82 until 100 epochs. This indicates that 70 epochs exhibited the highest performance according to the KS test, and they are used to perform further measurements using ML efficiency metrics, as adopted by [13,46]. To further support the KS test’s results, the overall performance shown in Table 6 according to the root mean square error (RMSE) and mean absolute error (MAE) [47] shows low error values of less than 1 [48,49], as presented in Algorithm 1.

Another method for assessing the quality of the generated dataset, which was proposed by [50], is known as the train on synthetic, test on real (TSTR) data. The method can be applied using the dataset generated by CTGAN as a training dataset for machine learning models; then, the real dataset is used to test the models. Researchers used this method to determine whether the generated dataset using GAN methods is suitable for real applications for ML-based IDS [47,51,52].

In this paper, the TSTR method is used to further investigate the statistical method’s results, aiming to select the number of epochs for use with machine learning efficiency metrics to compare the results with the related studies. Section 4.2 introduced the configuration of the machine learning models that were used.

Table 7 presents the results of using TSTR on the ML models. Although some models showed low accuracy, using RF consistently achieved high accuracy (0.98), indicating that it is suitable for this ML-based IDS context and can capture complex relationships. Moreover, the average accuracy line provides an overall sense of how well synthetic data perform as a training set across different models. Using 70 epochs showed a higher average accuracy on TSTR, which was 0.85, followed by a slight decline, supporting the statistical metrics’ findings for selecting the optimal number of epochs for ML efficiency metrics.

Most classifiers show an improvement in accuracy from 50 to 70 epochs, with some decline or stabilization later. To compare the results with the related works, in [51], they applied TSTR using the XGBoost classifier on the TON_IoT dataset. The model was evaluated with and without the time difference as a feature. The accuracy reached 0.61 and 0.98, respectively, which matches the accuracy of the RF classifier. In [48], they also trained the model on synthetic samples from UNSW-NB15, KDDTest-21, and KDDTest+ datasets. The highest accuracy for each dataset was 0.67 with respect to the decision tree, 0.57 with respect to LSTM, and 0.77 with respect to LSTM. However, these results were lower than the findings in this paper, which achieved better accuracy after 70 epochs. The TSTR method provides an understanding of the quality and possible usability of the synthetic data that CTGAN has generated. The change in performance among classifiers and epochs shows the importance of model selection and the number of epochs in order to make the most of synthetic data in ML pipelines.

Algorithm 1. The proposed dataset generation and validation methodology. DR is the real data, and DG is the generated data.

1.: Input: DR = Concatenating(TON_IoT, MQTT-IoT-IDS2020, Bot-IoT) //TON_IoT, MQTT-IoT-IDS2020 and Bot-IoT is extracted datasets the features of the raw PCAP files
2.: Categorization_preprocessing(DR)
3.: for e in range (1, 101)
4.: if e ≥ 50
5.: if e%10 == 0
6.: CTGAN_Result = CTGAN(DR, epochs=e)
7.: Output: DG = CTGAN_Result
8.: KS_test(DR, DG), RMSE(DR, DG), MAE(DR, DG), TSTR(DR, DG)
9.: Max_KS_test_score = Find_max_KS_test_score(DR, DG)//KS_test_Bestscore is epoch 70
10.: DgwithFeat = getFeature_selection(DG(Max_KS_test_score))
11.: train(DR, DG(Max_KS_test_score), DgwithFeat)
12.: RF_result = test_RF(DR, DG(Max_KS_test_score), DgwithFeat)
13.: DecisionTree_result = test_DecisionTree(DR, DG(Max_KS_test_score), DgwithFeat)
14.: NaiveBayes_result = test_NaiveBayes(DR, DG(Max_KS_test_score), DgwithFeat)
15.: MLP_result = test_MLP(DR, DG(Max_KS_test_score), DgwithFeat)
16.: Gradient_Boost_result = test_Gradient_Boost(DR, DG(Max_KS_test_score), DgwithFeat)
17.: accuracy_result = Find_Minimum_Differences_ML_Accuracy(RF_result, DecisionTree_result, NaiveBayes_result, MLP_result, GradientBoost_result)//The lower the difference, the more representative
18.: avg_Time = avg(RF_time, Decision_time, naïve_Time, MLP_Time, Gradient_Time)//Average of train and test time
19.: compare(ML[DG(Max_KS_test_score), DGwithFeat], avg_Time)//Comparison of average learning and testing times between DG and DGwithFeat

Figure 5 provides a comparative visualization of the cumulative sums of both real and generated datasets, as presented by [47]. This shows insights into which features the CTGAN had difficulty replicating accurately. The figure was applied to the generated data using 70 epochs, which performed the best according to the presented performance metrics. The overall figures were fairly aligned, and the generated data followed the same shape and structure as the real data, but some features, like ip.id and tcp.hdr_len, were not seamlessly generated by CTGAN.

4.2. Machine Learning Efficiency Metrics

The ML classification efficiency metric method is applied to both datasets to ensure that the generated dataset represents the real dataset. This method evaluates the similarity between both datasets to examine the replacement of a real dataset with a generated dataset used by [13,42], as implemented in Algorithm 1 line 17, and to check the minimum difference between real and generated data. The ML classification results were slightly similar for both the generated and real datasets, proving that replacing the real dataset with the generated dataset is possible. The promising results in Table 8 show that the accuracy of the generated data before using feature selection reached 1 in XGBoost and 0.99 in RF, decision tree, gradient boost, and LightGBM, which indicates the quality of the dataset for ML tasks.

A filter-based feature selection method is applied to the dataset to reduce the time complexity of the classification process for further improvement. As shown in Table 8, feature selection improved the accuracy of some classifiers and did not affect their performance. The RF classifier increased from 0.99 to 1.0, and the accuracy of Naïve Bayes significantly increased from 0.73 to 0.97 using the 10 and 15 best features. The results show how to reduce the complexity without affecting the accuracy or even improving it. These results imply that the generated IoT dataset from CTGAN can be used as learning data for several ML tasks and can efficiently capture, detect, and replicate the patterns between the features of the network data and its class. This agrees with previously published studies that employed GAN on the CIDDS and NSL-KDD datasets [13,42]. However, the ML classification results of the current study were better than those of other studies [13,42] in terms of accuracy. Moreover, the ML classification of the generated synthetic resemblance datasets of NSL-KDD and CIDDS may not be suitable for detecting lightweight IoT network attacks as the proposed pipeline reduces time complexity and contains more types of attacks related to IoT environments.

Accordingly, Table 9 shows that the detection time in seconds (using a decision tree) improved from 1.48 to 0.05 in terms of training time and 0.004 in terms of test time using the five best features. Additionally, the training time of XGBoost and MLP was significantly reduced from 24.54 and 94.14 to 5.35 and 24.70, respectively. However, the RF time did not improve, which may not be suitable for lightweight contexts.

As presented in Figure 6, the figure includes related studies that evaluated the test and training times. The results show that the filter-based feature selection method that we used can significantly reduce the time and complexity compared with related studies without compromising detection accuracy, which is beneficial for lightweight IoT devices and low storage. Such improvements in training/testing time are crucial for predicting real-life intrusions, accelerating the IDS function to guarantee the rapid detection of attacks, and ensuring that the IDS has fast data stream monitoring capabilities. It has been reported that a practical real-time IDS is vital and desirable for detecting intrusions in high-speed IoT networks with large volumes of data [7]. This further confirms the suitability of the generated synthetic IoT dataset for ML classification training and the possibility of replacing real-life datasets.

Moreover, the results of this study show that it is possible to use the proposed methodology to generate a new synthetic dataset containing a broader number of attacks without requiring months of recording and large storage for collecting attack datasets. Additionally, the generated dataset after feature selection can be applied to an ML-based IDS in a lightweight IoT environment without affecting the accuracy of real-time updates and communication between IoT devices. The final results showed similar high-performance results and lower training and testing times compared to the related studies presented in Table 1, even after using filter-based feature selection and reducing training and testing times. Moreover, compared with other published GAN network results, such as [31,32], the proposed method can generate a dataset that is suitable for training ML models with higher accuracy across all three datasets. This is due to the universal features shared among the datasets, making it adaptable and not limited to only one type of dataset.

5. Conclusions and Further Research

The proposed methodology can generate a synthetic dataset that researchers and security professionals can use to detect and prevent attacks like APT. Additionally, the method can also help solve the challenge of collecting data from a system and storing them in large storage devices, which is a challenging task when dealing with attacks in low-resource IoT environments. The experimental results showed that we could replace the generated dataset with a real dataset. This replacement would not affect the system’s accuracy and might even improve it, as shown by the Naïve Bayes classifier. This could have important implications for future research as it would allow researchers to use more confidential data while maintaining high accuracy levels, and they can utilize this method to promote the efficiency of the IDS using large amounts of data. The findings of the TSTR and statistical metrics, KS test, MAE, and RMSE suggest that CTGAN is a promising tool for generating a synthetic dataset that is similar to the real dataset. Additionally, it generates a balanced dataset by increasing the number of attacks with fewer instances.

Applying classifier efficiency metrics reveals that CTGAN synthetic datasets can be used for classification tasks. As the results indicated, using RF and decision trees provided high accuracies of 1.0 and 0.99, respectively. Moreover, as described in the second section, many existing studies have focused primarily on the performance of classification algorithms while neglecting other important factors, such as classification time. Thus, the results indicated that using the filter-based feature selection method reduced the training and test times to fit lightweight IoT devices without affecting accuracy. This study mainly involves generating a dataset for IoT devices that are applied in smart cities, which face difficulties in collecting a training dataset for machine learning algorithms. In future studies, the generated dataset could be tested in a real-world IoT network and compared with real-world datasets. Moreover, further investigation could be performed using multiclass classification to evaluate the performance of each attack. The limitation of CTGAN is that it has not been compared with advanced large language models like GPT-4. In addition, it does not include all attack scenarios that occur in IoT environments, and it is not assessed for zero-day attacks.

Author Contributions

Conceptualization, S.A. and Y.S.; methodology, S.A.; software, S.A.; validation, S.A.; formal analysis, S.A.; investigation, S.A.; resources, S.A.; data curation, S.A.; writing—original draft preparation, S.A.; writing—review and editing, Y.-T.K., A.S. and Y.S.; visualization, S.A., Y.-T.K. and A.S.; supervision, Y.S.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1A2C1013296, No. 2018R1A5A7023490), in part by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2023-2020-0-01789) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jeong, J.; Lim, J.Y.; Son, Y. A data type inference method based on long short-term memory by improved feature for weakness analysis in binary code. Future Gener. Comput. Syst. 2019, 100, 1044–1052. [Google Scholar] [CrossRef]
Son, Y.; Jeong, J.; Lee, Y. An Adaptive Offloading Method for an IoT-Cloud Converged Virtual Machine System Using a Hybrid Deep Neural Network. Sustainability 2018, 10, 3955. [Google Scholar] [CrossRef]
Jeong, J.; Joo, J.W.J.; Lee, Y.; Son, Y. Secure Cloud Storage Service Using Bloom Filters for the Internet of Things. Access 2019, 7, 60897–60907. [Google Scholar] [CrossRef]
Chen, W.; Helu, X.; Jin, C.; Zhang, M.; Lu, H.; Sun, Y.; Tian, Z. Advanced persistent threat organization identification based on software gene of malware. Eur. Trans. Telecommun. 2020, 31, e3884. [Google Scholar] [CrossRef]
Cheng, X.; Luo, Q.; Pan, Y.; Li, Z.; Zhang, J.; Chen, B. Predicting the APT for Cyber Situation Comprehension in 5G-Enabled IoT Scenarios Based on Differentially Private Federated Learning. Secur. Commun. Netw. 2021, 2021, 8814068. [Google Scholar] [CrossRef]
Tankard, C. Advanced Persistent threats and how to monitor and deter them. Netw. Secur. 2011, 2011, 16–19. [Google Scholar] [CrossRef]
Malhotra, H.; Sharma, P. Intrusion Detection using Machine Learning and Feature Selection. Int. J. Comput. Netw. Inf. Secur. 2019, 11, 43–52. [Google Scholar] [CrossRef]
Binbusayyis, A.; Vaiyapuri, T. Comprehensive analysis and recommendation of feature evaluation measures for intrusion detection. Heliyon 2020, 6, e04262. [Google Scholar] [CrossRef]
Onik, A.; Haq, N.; Alam, L.; Mamun, T. An Analytical Comparison on Filter Feature Extraction Method in Data Mining using J48 Classifier. Int. J. Comput. Appl. 2015, 124, 1–8. [Google Scholar] [CrossRef]
Hindy, H.; Bayne, E.; Bures, M.; Atkinson, R.; Tachtatzis, C.; Bellekens, X. Machine Learning Based IoT Intrusion Detection System: An MQTT Case Study (MQTT-IoT-IDS2020 Dataset). In Selected Papers from the 12th International Networking Conference; Springer International Publishing: Cham, Switzerland, 2021; pp. 73–84. [Google Scholar]
Hussain, F.; Abbas, S.G.; Fayyaz, U.U.; Shah, G.A.; Toqeer, A.; Ali, A. Towards a Universal Features Set for IoT Botnet Attacks Detection. In Proceedings of the 2020 IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020; pp. 1–6. [Google Scholar]
Chen, Z.; Liu, J.; Shen, Y.; Simsek, M.; Kantarci, B.; Mouftah, H.T.; Djukic, P. Machine Learning-Enabled IoT Security: Open Issues and Challenges Under Advanced Persistent Threats. ACM Comput. Surv. 2022, 55, 37. [Google Scholar] [CrossRef]
Bourou, S.; El Saer, A.; Velivassaki, T.; Voulkidis, A.; Zahariadis, T. A Review of Tabular Data Synthesis Using GANs on an IDS Dataset. Inf. (Basel) 2021, 12, 375. [Google Scholar] [CrossRef]
Appenzeller, A.; Leitner, M.; Philipp, P.; Krempel, E.; Beyerer, J. Privacy and Utility of Private Synthetic Data for Medical Data Analyses. Appl. Sci. 2022, 12, 12320. [Google Scholar] [CrossRef]
Soe, Y.N.; Feng, Y.; Santosa, P.I.; Hartanto, R.; Sakurai, K. Towards a Lightweight Detection System for Cyber Attacks in the IoT Environment Using Corresponding Features. Electronics 2020, 9, 144. [Google Scholar] [CrossRef]
Zhou, Y.; Cheng, G.; Jiang, S.; Dai, M. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Comput. Netw. 2020, 174, 107247. [Google Scholar] [CrossRef]
Rahman, M.A.; Asyhari, A.T.; Leong, L.S.; Satrya, G.B.; Hai Tao, M.; Zolkipli, M.F. Scalable machine learning-based intrusion detection system for IoT-enabled smart cities. Sustain. Cities Soc. 2020, 61, 102324. [Google Scholar] [CrossRef]
Somwang, P.; Lilakiatsakun, W. Intrusion detection technique by using fuzzy ART on computer network security. In Proceedings of the 2012 7th IEEE Conference on Industrial Electronics and Applications (ICIEA), Singapore, 18–20 July 2012; pp. 697–702. [Google Scholar]
Sivatha Sindhu, S.S.; Geetha, S.; Kannan, A. Decision tree based light weight intrusion detection using a wrapper approach. Expert Syst. Appl. 2012, 39, 129–141. [Google Scholar] [CrossRef]
Setiawan, B.; Djanali, S.; Ahmad, T. Increasing accuracy and completeness of intrusion detection model using fusion of normalization, feature selection method and support vector machine. Int. J. Intell. Eng. Syst. 2019, 12, 378–389. [Google Scholar] [CrossRef]
Rashid, M.M.; Kamruzzaman, J.; Hassan, M.M.; Imam, T.; Gordon, S. Cyberattacks Detection in IoT-Based Smart City Applications Using Machine Learning Techniques. Int. J. Environ. Res. Public Health 2020, 17, 9347. [Google Scholar] [CrossRef]
Hassannataj Joloudari, J.; Haderbadi, M.; Mashmool, A.; Ghasemigol, M.; Band, S.S.; Mosavi, A. Early Detection of the Advanced Persistent Threat Attack Using Performance Analysis of Deep Learning. Access 2020, 8, 186125–186137. [Google Scholar] [CrossRef]
Shang, L.; Guo, D.; Ji, Y.; Li, Q. Discovering unknown advanced persistent threat using shared features mined by neural networks. Comput. Netw. 2021, 189, 107937. [Google Scholar] [CrossRef]
Chizoba, O.J.; Kyari, B.A. Ensemble classifiers for detection of advanced persistent threats. Glob. J. Eng. Technol. Adv. 2020, 2, 1. [Google Scholar] [CrossRef]
Stojanović, B.; Hofer-Schmitz, K.; Kleb, U. APT datasets and attack modeling for automated detection methods: A review. Comput Secur 2020, 92, 101734. [Google Scholar] [CrossRef]
Myneni, S.; Chowdhary, A.; Sabur, A.; Sengupta, S.; Agrawal, G.; Huang, D.; Kang, M. DAPT 2020-Constructing a Benchmark Dataset for Advanced Persistent Threats. In Deployable Machine Learning for Security Defense: First International Workshop, MLHat 2020, San Diego, CA, USA, August 24, 2020, Proceedings 1; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 138–163. [Google Scholar]
Safa, M.; Pandian, A.; Gururaj, H.L.; Ravi, V.; Krichen, M. Real time health care big data analytics model for improved QoS in cardiac disease prediction with IoT devices. Health Technol 2023, 13, 473–483. [Google Scholar] [CrossRef]
Shahriar, M.H.; Haque, N.I.; Rahman, M.A.; Alonso, M. G-IDS: Generative Adversarial Networks Assisted Intrusion Detection System. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; pp. 376–385. [Google Scholar]
Liu, X.; Li, T.; Zhang, R.; Wu, D.; Liu, Y.; Yang, Z. A GAN and Feature Selection-Based Oversampling Technique for Intrusion Detection. Secur. Commun. Netw. 2021, 2021, 9947059. [Google Scholar] [CrossRef]
Lin, Z.; Shi, Y.; Xue, Z. IDSGAN: Generative Adversarial Networks for Attack Generation Against Intrusion Detection. In Advances in Knowledge Discovery and Data Mining; Springer International Publishing: Cham, Switzerland, 2022; pp. 79–91. [Google Scholar]
Kumar, V.; Sinha, D. Synthetic attack data generation model applying generative adversarial network for intrusion detection. Comput. Secur. 2023, 125, 103054. [Google Scholar] [CrossRef]
Strickland, C.; Saha, C.; Zakar, M.; Nejad, S.; Tasnim, N.; Lizotte, D.; Haque, A. DRL-GAN: A Hybrid Approach for Binary and Multiclass Network Intrusion Detection. arXiv 2023, arXiv:2301.03368. [Google Scholar] [CrossRef]
Vaccari, I.; Chiola, G.; Aiello, M.; Mongelli, M.; Cambiaso, E. MQTTset, a New Dataset for Machine Learning Techniques on MQTT. Sensors 2020, 20, 6578. [Google Scholar] [CrossRef]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular data using Conditional GAN. Adv. Neural Inf. Process. Syst. 2019, 32, 1–15. [Google Scholar]
Scikit-Learn: Machine Learning in Python—Scikit-Learn 1.1.1 Documentation. Available online: https://scikit-learn.org/stable/ (accessed on 6 March 2022).
CTGAN Model—SDV 0.13.1 Documentation. Available online: https://sdv.dev/SDV/user_guides/single_table/ctgan.html (accessed on 6 March 2022).
Single Table Metrics—SDV 0.13.1 Documentation. Available online: https://sdv.dev/SDV/user_guides/evaluation/single_table_metrics.html (accessed on 6 March 2022).
Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset. Future Gener. Comput Syst 2019, 100, 779–796. [Google Scholar] [CrossRef]
Tshark(1) Manual Page. Available online: https://www.wireshark.org/docs/man-pages/tshark.html (accessed on 6 March 2022).
Hong, D.; Baik, C. Generating and Validating Synthetic Training Data for Predicting Bankruptcy of Individual Businesses. J. Inf. Commun. Converg. Eng. 2021, 19, 228–233. [Google Scholar]
Zingo, P.; Novocin, A. Can GAN-Generated Network Traffic be used to Train Traffic Anomaly Classifiers? In Proceedings of the 2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 4–7 November 2020; p. 540. [Google Scholar]
Alabdulwahab, S.; Moon, B. Feature Selection Methods Simultaneously Improve the Detection Accuracy and Model Building Time of Machine Learning Classifiers. Symmetry 2020, 12, 1424. [Google Scholar] [CrossRef]
Hall, M.A.; Holmes, G. Benchmarking attribute selection techniques for discrete class data mining. TKDE 2003, 15, 1437–1447. [Google Scholar] [CrossRef]
Alabrah, A. A Novel Study: GAN-Based Minority Class Balancing and Machine-Learning-Based Network Intruder Detection Using Chi-Square Feature Selection. Appl. Sci. 2022, 12, 11662. [Google Scholar] [CrossRef]
Arvanitis, T.N.; White, S.; Harrison, S.; Chaplin, R.; Despotou, G. A method for machine learning generation of realistic synthetic datasets for validating healthcare applications. Health Inform. J. 2022, 28, 14604582221077000. [Google Scholar] [CrossRef] [PubMed]
Brenninkmeijer, B.; de Vries, A.; Marchiori, E.; Hille, Y. On the Generation and Evaluation of Tabular Data Using GANs; Radboud University: Nijmegen, The Netherlands, 2019. [Google Scholar]
Neves, D.T.; Alves, J.; Naik, M.G.; Proença, A.J.; Prasser, F. From Missing Data Imputation to Data Generation. J. Comput. Sci. 2022, 61, 101640. [Google Scholar] [CrossRef]
Ashraf, H.; Jeong, Y.; Lee, C.H. Underwater Ambient-Noise Removing GAN Based on Magnitude and Phase Spectra. IEEE Access 2021, 9, 24513–24530. [Google Scholar] [CrossRef]
Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv 2017, arXiv:1706.02633. [Google Scholar] [CrossRef]
Sasirekha, G.V.K.; Bangari, A.; Rao, M.; Bapat, J.; Das, D. Das Synthesis of IoT Sensor Telemetry Data for Smart Home Edge-IDS Evaluation. In Proceedings of the 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE), Jakarta, Indonesia, 16 February 2023; pp. 562–567. [Google Scholar]
Dina, A.S.; Siddique, A.B.; Manivannan, D. Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks. IEEE Access 2022, 10, 96731–96747. [Google Scholar] [CrossRef]

Figure 1. MQTT network architecture of the APT attack scenario.

Figure 2. A flow diagram of the data generation and performance assessment.

Figure 3. Importance score of the best 15 features according to feature selection.

Figure 4. The proposed pipeline design.

Figure 5. The plots show the cumulative sums per feature for both real and generated datasets. Each plot displays the data points of the respective feature.

Figure 6. Comparison with related studies that consider training and testing times, with the CTGAN-generated dataset using a decision tree classifier and the 5 best features [15,20,21].

Table 1. Summarization of ML-based IDS studies that considered the model’s time performance, with ‘n/a’ indicating studies that did not evaluate testing time.

Reference	Applied ML Algorithms			Datasets	Accuracy	Training/ Testing Time (Seconds)
Reference	Preprocessing	Feature Selection	Classification	Datasets	Accuracy	Training/ Testing Time (Seconds)
Rahman et al. [17]	Normalization, balancing, and numerical transforming	IG	Multi-layer perceptron (MLP)	AWID	0.97	73.52/n/a
Zhou et al. [16]	Normalization, balancing, and filtration	CFS-BA-ensemble	Forest PA-ensemble	NSL-KDD	0.99	36.28/n/a
				AWID	0.99	92.62/n/a
				CIC-IDS2017	0.99	98.42/n/a
Soe et al. [15]	n/a	CST-GR	J48	Bot-IoT	0.99 (TPR)	8.61/0.81
Setiawan et al. [20]	Nominal to numerical, log normalization	Modified rank-based IG	SVM	NSL-KDD	0.99	56.603/2.094
Rashid et al. [21]	Cleaning, visualization, feature engineering, and vectorization	IG	Stacking ensemble	UNSW-NB15	0.96	25.6/5.70
Rashid et al. [21]		IG	Stacking ensemble	CIC-IDS2017	0.99	27.09/4.19

Table 2. CTGAN parameters configurations.

CTGAN Parameters	Values
Epochs	50 to 100
Generator learning rate	0.0002
Discriminator learning rate	0.0002
Generator dimension	(256, 256)
Discriminator dimension	(256, 256)

Table 3. Description of the features of network records.

#	Feature	Data Type
1	Protocol	Text
2	ip.id	Unsigned integer
3	ip.flags	Unsigned integer
4	ip.flags.df	Binary
5	ttl	Unsigned integer
6	ip.proto	Unsigned integer
7	ip.checksum	Unsigned integer
8	ip.len	Unsigned integer
9	tcp.srcport	Unsigned integer
10	tcp.dstport	Unsigned integer
11	tcp.seq	Unsigned integer
12	tcp.ack	Unsigned integer
13	tcp.stream	Unsigned integer
14	tcp.len	Unsigned integer
15	tcp.hdr_len	Unsigned integer
16	tcp.analysis.ack_rtt	Time offset
17	tcp.flags.fin	Boolean
18	tcp.flags.syn	Boolean
19	tcp.flags.push	Boolean
20	tcp.flags.ack	Boolean
21	tcp.window_size	Unsigned integer
22	tcp.checksum	Unsigned integer
23	frame.time_relative	Time offset
24	frame.time_delta	Time offset
25	tcp.time_relative	Time offset
26	tcp.time_delta	Time offset
27	label	Text
28	Category	Text

Table 4. Collected real dataset count and statistics.

Category	Count	Percentage (%)
DDOS	17,923	5.35
DOS	17,204	5.13
injection	17,708	5.28
keylogging	135	0.04
password	17,587	5.25
mqtt_bruteforce	16,820	5.02
scan_A	16,009	4.78
ransomware	15,734	4.69
backdoor	16,464	4.91
XSS	15,766	4.70
Sparta	17,148	5.12
theft	423	0.13
normal	166,249	49.60
Total	335,170	100.00

Table 5. Count and percentage of each attack from the generated dataset using CTGAN.

Category	Count	Percentage (%)
DDOS	13,282	3.96
DOS	12,047	3.59
injection	13,933	4.16
keylogging	12,026	3.59
password	12,852	3.83
mqtt_bruteforce	21,686	6.47
scan_A	14,203	4.24
ransomware	21,358	6.37
backdoor	14,849	4.43
XSS	11,986	3.58
Sparta	12,612	3.76
theft	8087	2.41
normal	166,249	49.60
Total	335,170	100.00

Table 6. KS test, MAE, and RMSE metrics’ comparison between epochs.

Epochs	50	60	70	80	90	100
KS test	0.80	0.82	0.83	0.82	0.82	0.82
RMSE	0.13	0.12	0.11	0.11	0.12	0.12
MAE	0.09	0.08	0.08	0.08	0.08	0.08

Table 7. The accuracy and F1 score of training the classifiers using the generated dataset and testing using the real dataset.

Classifier	50 Epochs		60 Epochs		70 Epochs		80 Epochs		90 Epochs		100 Epochs
Classifier	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1
Decision Tree	0.39	0.39	0.54	0.54	0.88	0.88	0.74	0.74	0.60	0.60	0.43	0.43
Naïve Bayes	0.75	0.75	0.75	0.75	0.72	0.72	0.77	0.77	0.77	0.77	0.75	0.75
RF	0.93	0.93	0.97	0.97	0.98	0.98	0.98	0.98	0.98	0.98	0.96	0.96
MLP	0.80	0.80	0.76	0.76	0.81	0.81	0.81	0.81	0.80	0.80	0.86	0.86
Gradient Boost	0.41	0.41	0.54	0.54	0.87	0.87	0.74	0.74	0.60	0.60	0.43	0.43
XGBoost	0.79	0.79	0.73	0.73	0.88	0.88	0.92	0.92	0.82	0.82	0.69	0.69
LightGBM	0.42	0.42	0.55	0.55	0.88	0.88	0.74	0.74	0.60	0.60	0.44	0.44
Average	0.64	0.64	0.69	0.69	0.85	0.85	0.81	0.81	0.74	0.74	0.65	0.65

Table 8. Classification accuracy and F1 results with the 5, 10, and 15 best feature selections.

	Collected Original Dataset		Generated Dataset Before Feature Selection		Generated Dataset with 5 Best Feature Selection		Generated Dataset with 10 Best Feature Selection		Generated Dataset with 15 Best Feature Selection
Classifier	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1
RF	0.99	0.99	0.99	0.99	0.99	0.99	1	1	1	1
Naïve Bayes	0.73	0.72	0.76	0.75	0.96	0.96	0.97	0.97	0.97	0.97
Decision Tree	0.99	0.99	0.99	0.99	0.99	0.99	0.99	0.99	0.99	0.99
MLP	0.99	0.99	0.98	0.98	0.99	0.99	0.81	0.84	0.81	0.84
Gradient Boost	0.99	0.99	0.99	0.99	0.99	0.99	0.99	0.99	0.99	0.99
XGBoost	0.99	0.99	1	1	0.99	0.99	1	1	1	1
LightGBM	0.99	0.99	0.99	0.99	0.99	0.99	1	1	1	1

Table 9. Performance of training and test times in seconds with the 5, 10, and 15 best feature selections.

	Collected Original Dataset		Generated Dataset Before Feature Selection		Generated Dataset with 5 Best Feature Selection		Generated Dataset with 10 Best Feature Selection		Generated Dataset with 15 Best Feature Selection
Classifier	Train Time	Test Time	Train Time	Test Time	Train Time	Test Time	Train Time	Test Time	Train Time	Test Time
RF	4.31	0.1	4.39	0.09	4.15	0.47	11.09	0.57	14.94	0.61
Naïve Bayes	0.31	0.06	0.13	0.04	0.04	0.006	0.08	0.01	0.1	0.02
Decision Tree	1.48	0.02	2.14	0.02	0.05	0.004	0.51	0.007	0.91	0.008
MLP	94.14	0.09	51.91	0.09	24.70	0.07	31.95	0.08	26.0	0.09
Gradient Boost	12.78	0.04	8.56	0.04	1.44	0.04	5.19	0.04	8.41	0.04
XGBoost	24.54	0.07	23.27	0.06	5.35	0.06	9.52	0.08	12.14	0.08
LightGBM	4.61	0.38	3.85	0.26	1.72	0.34	2.13	0.28	2.55	0.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alabdulwahab, S.; Kim, Y.-T.; Seo, A.; Son, Y. Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments. Appl. Sci. 2023, 13, 10951. https://doi.org/10.3390/app131910951

AMA Style

Alabdulwahab S, Kim Y-T, Seo A, Son Y. Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments. Applied Sciences. 2023; 13(19):10951. https://doi.org/10.3390/app131910951

Chicago/Turabian Style

Alabdulwahab, Saleh, Young-Tak Kim, Aria Seo, and Yunsik Son. 2023. "Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments" Applied Sciences 13, no. 19: 10951. https://doi.org/10.3390/app131910951

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments

Abstract

1. Introduction

1.1. Research Contribution and Scope

1.2. Research Structure

2. Literature Review

2.1. ML-Based IDS Studies

2.2. ML-Based IDS Using GAN Methods

2.3. Literature Review Summary

3. Generating a Lightweight IoT Dataset

3.1. Dataset Generation

3.2. Machine Learning Classification

4. Results and Discussion

4.1. Data Generation Performance Testing

4.2. Machine Learning Efficiency Metrics

5. Conclusions and Further Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI