Create a Realistic IoT Dataset Using Conditional Generative Adversarial Network
Abstract
:1. Introduction
- Development of a novel Synthetic Data Generator Tool capable of generating realistic and balanced synthetic data which can be used in cybersecurity applications to detect network attack types.
- Building a multi-class IoT network attack type dataset that includes six attack types and a normal class.
- Development of a user-controlled Synthetic Data Generator Tool capable of generating realistic and balanced datasets which can be used in cybersecurity applications to detect network attack types.
- Constructing a comprehensive pre-processing pipeline that addresses challenges in synthetic data generation using CGANs.
- Evaluating the generated datasets’ balance.
2. Literature Review
2.1. Comprehensive Review of Public IoT Datasets
2.2. Recent Advances in Synthetic Dataset Generation for IoT Cybersecurity Applications
2.2.1. GAN-Based Approaches for IDS Dataset Generation
2.2.2. Non-GAN-Based Toolkit for Dataset Generation
- Limited Dataset Representation: Studies such as Alabdulwahab et al. [3] and Strickland et al. [19] focus on generating synthetic data based on pre-existing datasets. This approach can restrict the variety of attack scenarios simulated, as the generated data may only replicate patterns already present in the original datasets. Consequently, these methods may fail to capture novel or evolving IoT threats.
- Lack of Focus on Real-Time Applicability: Many existing studies, such as those by Strickland et al. [19] and Alabdulwahab et al. [4], do not account for the computational demands of real-time data generation and analysis. IoT environments require immediate response times, and the lack of optimization for speed and efficiency in data generation limits the practical use of these methods in real-world applications.
- Feature Selection and Optimization: GAN-based methods like GAN-FS [6] integrate feature selection but sometimes disrupt optimal feature combinations. While feature selection can improve performance in certain contexts, it does not always yield better results, particularly for high-performing models. Additionally, methods without feature selection (e.g., Strickland et al. [19]) risk inefficiencies, leading to longer processing times.
- Generating Fully Synthetic Datasets: Our approach eliminates the reliance on real data by generating entirely synthetic datasets. This ensures that the model is trained on balanced, unbiased data, improving its ability to generalize to new and previously unseen IoT traffic patterns.
- Attack Scenario Generation: We go beyond augmenting pre-existing datasets by generating new IoT attack scenarios that are underrepresented in current datasets. This allows our model to train on a more diverse set of threats, which enhances its applicability in real-world IoT environments.
- Real-Time Applicability: Our method is designed with real-time IoT environments in mind, optimizing both the generation and processing of synthetic data. This ensures that the model can respond to network threats swiftly, making it suitable for real-time IDS deployment.
- Integrated Feature Selection: We incorporate advanced feature selection techniques that streamline the model’s performance, reducing the computational overhead while maintaining or improving detection accuracy. This makes our approach more efficient and scalable, particularly for large IoT networks.
3. Methodology
3.1. Network Configuration
3.2. Attack Scenarios
- Reconnaissance Attack: Reconnaissance is the first step of any cyberattack. By gathering information about available services and open ports, attackers can identify vulnerabilities. In our experiment, we used Nmap to perform scanning and information gathering, simulating how an attacker could probe for weaknesses in IoT devices. To better understand potential reconnaissance attacks on these devices, we recorded Nmap traffic, as seen in Figure 2.
- Man-in-the-Middle (MITM) Attack: MITM attacks are highly effective in IoT environments where communication between devices and their controlling apps is frequent and often unencrypted. Attackers can intercept or manipulate this communication without detection. This attack was prioritized because many IoT devices lack robust encryption, making them susceptible to this type of intrusion. In our testbed, data were intercepted between IoT devices and the cloud, allowing us to analyze the extent to which sensitive information could be accessed, as shown in Figure 3.
- Deauthentication Attack: Deauthentication attacks were selected due to their ability to disrupt IoT devices by disconnecting them from their wireless networks. Since most IoT devices rely on Wi-Fi, such an attack causes significant service disruption; see Figure 4. This causes the user to lose access to the internet and all network services until they reconnect. As previously mentioned, the IoT device is wirelessly connected to the access point, providing IPs within the range of 192.168.2.2/254, and the attacker (for this attack) is not connected to the access point. The attacker achieves this by sending a fake request to the access point, causing the user’s device to disconnect.
- UDP flood Attack: This type of denial-of-service attack targets IoT devices by overwhelming them with UDP packets. The attack was simulated on our smart lamp, which communicates via UDP, to test how it handles excessive network traffic. In the following illustration, Figure 2, we can see the attacker sends a large number of User Datagram Protocol (UDP) packets to a specific server or device. It is worth noting that we found the Xiaomi smart lamp to be controlled via UDP. To test this, we sent random UDP messages with a falsified source address that matched the control server.
- SYN Flood Attack: The SYN Flood attack is another type of denial-of-service attack that disrupts the three-way handshake process in TCP communication. This attack was particularly effective against the web server running on the smart plug, rendering it unresponsive; see Figure 2. During a previous Nmap scan, we found that the smart camera was running API services and the Shelly smart plug had a web server, both of which are vulnerable to a SYN Flood attack.
- Password Cracking Attack: Password cracking attacks were significant in our experiment because many IoT devices in smart home networks use weak or default passwords. Our smart plug and camera both required password authentication, and we were able to use brute force techniques to successfully gain unauthorized access. Given that many IoT devices have weak password mechanisms, these attacks remain a serious threat. We discovered that our smart plug has a security feature that prompts web users to enter a password once activated. Additionally, the smart camera has an RTSP stream feature that requires a username and password. As a result, we utilized brute force and dictionary attacks to guess the passwords of IoT devices.
3.3. Building the Dataset
- The Protocol feature can define a deauth attack due to its fixed protocol usage.
- The combination of “Protocol” and “Flow volume” features can identify deauth attacks and MITM attacks, as MITM attacks consist of four ICMP packets.
- The “IoT_Respond_401” feature identifies false web and RTSP login credentials to identify password-cracking attacks.
- The “IsServer” feature helps identify a trusted destination.
- The combination of “Protocol”, “No_of_received_packets_per_minutes”, and “No_of_ sent_packets_per_minutes” helps identify DOS-SYN attacks.
3.4. Benchmark Dataset
- DDoS (Distributed Denial of Service): This class simulates attacks where multiple systems overwhelm a target or its surrounding infrastructure with a flood of Internet traffic.
- DoS (Denial of Service): Similar to DDoS but typically involves a single attacking system, focusing on overwhelming or crashing the target by flooding it with excessive requests.
- Reconnaissance: These activities are exploratory in nature, aiming to gather information about a network to identify potential vulnerabilities for future attacks.
- Theft: This class includes scenarios involving unauthorized access and extraction of sensitive data, representing data breaches.
- Normal: Represents normal network activities to provide a baseline for detecting anomalous behavior and differentiating between benign and malicious traffic.
3.5. Designing the Synthetic Data Generator Tool
3.5.1. Synthetic Data Generator Tool Algorithm
- Input:Dataset D with categorical C and numerical columns N, class column class, latent dimension latent_dim.Output: Synthetic dataset D
- Initialization:
- Construct label encoders for C and a standard scaler for N.
- Define generator G and discriminator D architectures.
- Prepossessing:
- Impute and encode C; impute and scale N using median and mode strategies.
- Model Construction:
- Generator G: Combine noise vector z and embedded categorical labels to generate synthetic N.
- Discriminator D: Classify combined real or synthetic N and embedded labels.
- Training:
- Alternate training of D and G using batches of real and generated data.
- Synthesis:
- Generate and decode synthetic samples for each class, ensuring feature fidelity and balance.
- Output: Return D matching the distribution and characteristics of D.
3.5.2. GUI of the Proposed Synthetic Data Generator Tool
- 1.
- Pre-processing: the user uploads his or her CSV dataset using the GUI (Figure 8). Proper data pre-processing with a special focus on mixed data types is ensured, as it is considered one of the main challenges in using CGANs for tabular data generation. This is important to ensure these diverse data types are represented accurately in the generated data. Our tool, which implements the CGAN model, allows the user to initially select the class column, the categorical and numerical features, as well as the numerical features that should be generated as non-decimal values (Figure 9). These user-selected features are then handled using techniques such as label encoding for categorical values and scaling for numerical ones to help overcome issues related to non-Gaussian distributions, making it suitable for the CGAN pipeline. Without these pre-processing steps, the model might be unsuccessful in capturing the relationships between the data features, leading to poor-quality synthetic data generation.
- 2.
- Feature Selection: To enhance the model’s performance, a feature selection step is incorporated to identify the most important numerical features for synthetic data generation. This step is also controlled by the user from the tool’s interface. The objective is to reduce the dimensionality of the data and overcome issues related to computational efficiency.
- 3.
- Model Architecture: Primarily, the model consists of a generator and discriminator networks, which are trained alternately to ensure that both models work competitively, mitigating issues related to mode collapse. The generator has three dense layers with 128 and 256 neurons, implementing ReLU activations. This structure provides sufficient capacity for the generator to capture the complexity and high-dimensionality of relationships between noise, categorical, and numerical features. Layers for batch normalization and dropout layers are added as well to ensure training stability and prevent overfitting. The generator works by combining two input layers: one for random noise and the other for categorical labels. The categorical labels are embedded into a dense vector using an embedding layer. This layer maps each category to a high-dimensional space. Both the noise vector and the previously created embeddings are then concatenated to create a combined input, which is fed to several dense layers with ReLU activation. As seen in Figure 11, the architecture facilitates the generation of realistic and balanced synthetic data that imitate the statistical distribution of the original multi-class dataset.
- 4.
- Training: The training process targets the enhancement of the generator and discriminator models’ performance through adopting an adversarial training framework.
- 5.
- Data Generation: The actual process of data generation entails ensuring that the generated data adhere to the distributions and characteristics of the original multi-class dataset, reproducing values for both numerical and categorical features (Figure 12). This function is performed post-training, utilizing the generator’s learned parameters to generate new data points. The user of the tool selects the number of samples to be generated, then downloads the output as CSV files for both datasets (with feature selection and without) (Figure 13).
4. Results and Discussion
4.1. Implementation
4.2. Class Imbalance Evaluation
4.3. Evaluation of the Ranges of Numerical Features in Both Datasets
4.4. Evaluation of Categorical Features
4.5. Evaluation of Numerical Features
5. Conclusions
6. Limitations and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Results of Applying the Attacks on the IoT Devices
No. | Attack Type | Used Library | Attack Successful | Results | Inside/Outside Attack |
---|---|---|---|---|---|
1 | Nmap | Nmap | yes | No open ports found | Inside |
2 | MITM | Ettercap | yes | Data were intercepted. However, some of the data were readable as some devices’ data were encrypted. | Inside |
3 | Deauth | Aireplay-ng | yes | The device is no longer authenticated to the network. | Outside |
4 | DOS (tcp) | slowhttptest, slowloris, slowite | no | Device does not have TCP servers running. | Inside |
5 | DOS (UDP Flood) | hping3 | yes | Device was halted sometimes and its response was delayed. The device changed status randomly, occasionally when the UDP had a meaning by coincidence. | Inside |
6 | Password crack | Brute force | no | The device has no login/password prompts on any services. | Inside |
No. | Attack Type | Used Library | Attack Successful | Results | Inside/Outside Attack |
---|---|---|---|---|---|
1 | Nmap | Nmap | yes | 443/tcp open https −554/tcp open rtsp −2020/tcp open xinupageserver −8800/tcp open sunwebadmin −20,002/tcp open commtact-http | Inside |
2 | MITM | Ettercap | yes | Data were intercepted. However, some of the data were readable as some devices’ data were encrypted. | Inside |
3 | Deauth | Aireplay-ng | yes | The device is no longer authenticated to the network | Outside |
4 | DOS (tcp) | slowhttptest, slowloris, slowite | yes | Service on 443 went down. However, the device did not go offline, nor was functionality affected | Inside |
5 | DOS (UDP Flood) | hping3 | no | Device does not use UDP or have UDP ports open | Inside |
6 | Password crack | Brut force | yes | The device has a user-password prompt on rtsp streaming which yields 401 when wrong credentials are provided. | Inside |
No. | Attack Type | Used Library | Attack Successful | Results | Inside/Outside Attack |
---|---|---|---|---|---|
1 | Nmap | Nmap | yes | −80/tcp open https −554/tcp open rtsp | Inside |
2 | MITM | Ettercap | yes | Data were intercepted. However, some of the data were readable as some devices’ data were encrypted. | Inside |
3 | Deauth | Aireplay-ng | yes | The device is no longer authenticated to the network | Outside |
4 | DOS (tcp) | slowhttptest, slowloris, slowite | yes | Service on 443 went down. However, the device did not go offline, nor was functionality affected | Inside |
5 | DOS (UDP Flood) | hping3 | no | Device does not use UDP or have UDP ports open. | Inside |
6 | Password crack | Brut force | yes | The device has a user-password prompt on rtsp streaming, which yields 401 when wrong credentials are provided. | Inside |
References
- Kumar, V.; Sinha, D. Synthetic attack data generation model applying generative adversarial network for intrusion detection. Comput. Secur. 2023, 125, 103054. [Google Scholar] [CrossRef]
- Jeong, J.; Lim, J.Y.; Son, Y. A data type inference method based on long short-term memory by improved feature for weakness analysis in binary code. Future Gener. Comput. Syst. 2019, 100, 1044–1052. [Google Scholar] [CrossRef]
- Alabdulwahab, S.; Kim, Y.T.; Seo, A.; Son, Y. Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments. Appl. Sci. 2023, 13, 10951. [Google Scholar] [CrossRef]
- Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT telemetry dataset: A new generation dataset of IoT and IIoT for data-driven intrusion detection systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
- Samarakoon, S.; Siriwardhana, Y.; Porambage, P.; Liyanage, M.; Chang, S.Y.; Kim, J.; Kim, J.; Ylianttila, M. 5g-nidd: A comprehensive network intrusion detection dataset generated over 5g wireless network. arXiv 2022, arXiv:2212.01298. [Google Scholar]
- Liu, X.; Li, T.; Zhang, R.; Wu, D.; Liu, Y.; Yang, Z. A GAN and feature selection-based oversampling technique for intrusion detection. Secur. Commun. Netw. 2021, 2021, 9947059. [Google Scholar] [CrossRef]
- Riera, T.S.; Higuera, J.R.B.; Higuera, J.B.; Herraiz, J.J.M.; Montalvo, J.A.S. A new multi-label dataset for Web attacks CAPEC classification using machine learning techniques. Comput. Secur. 2022, 120, 102788. [Google Scholar] [CrossRef]
- Parmisano, A.; Garcia, S.; Erquiaga, M.J. A Labeled Dataset with Malicious and Benign Iot Network Traffic; Stratosphere Laboratory: Praha, Czech Republic, 2020. [Google Scholar]
- Hindy, H.; Bayne, E.; Bures, M.; Atkinson, R.; Tachtatzis, C.; Bellekens, X. Machine learning based IoT intrusion detection system: An MQTT case study (MQTT-IoT-IDS2020 dataset). In Proceedings of the International Networking Conference; Springer: Cham, Switzerland, 2020; pp. 73–84. [Google Scholar]
- Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
- Hamza, A.; Gharakheili, H.H.; Benson, T.A.; Sivaraman, V. Detecting volumetric attacks on lot devices via sdn-based monitoring of mud activity. In Proceedings of the 2019 ACM Symposium on SDN Research, San Jose, CA, USA, 3–4 April 2019; pp. 36–48. [Google Scholar]
- Sivanathan, A.; Gharakheili, H.H.; Loi, F.; Radford, A.; Wijenayake, C.; Vishwanath, A.; Sivaraman, V. Classifying IoT devices in smart environments using network traffic characteristics. IEEE Trans. Mob. Comput. 2018, 18, 1745–1759. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
- Meidan, Y.; Bohadana, M.; Mathov, Y.; Mirsky, Y.; Shabtai, A.; Breitenbacher, D.; Elovici, Y. N-baiot—Network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput. 2018, 17, 12–22. [Google Scholar] [CrossRef]
- Sivanathan, A.; Sherratt, D.; Gharakheili, H.H.; Radford, A.; Wijenayake, C.; Vishwanath, A.; Sivaraman, V. Characterizing and classifying IoT traffic in smart cities and campuses. In Proceedings of the 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Atlanta, GA, USA, 1–4 May 2017; pp. 559–564. [Google Scholar]
- Sureda Riera, T.; Bermejo Higuera, J.R.; Bermejo Higuera, J.; Sicilia Montalvo, J.A.; Martínez Herráiz, J.J. SR-BH 2020 Multi-Label Dataset 2022. Available online: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OGOIXX (accessed on 15 September 2024).
- Mashrur Arifin, M.; Shoaib Ahmed, M.; Ghosh, T.K.; Zhuang, J.; Yeh, J.h. A Survey on the Application of Generative Adversarial Networks in Cybersecurity: Prospective, Direction and Open Research Scopes. arXiv 2024, arXiv:2407.08839. [Google Scholar]
- Ranka, P.; Shah, A.; Vora, N.; Kulkarni, A.; Patil, N. Computer Vision-Based Cybersecurity Threat Detection System with GAN-Enhanced Data Augmentation. In International Conference on Soft Computing and Its Engineering Applications; Springer: Cham, Switzerland, 2023; pp. 54–67. [Google Scholar]
- Strickland, C.; Zakar, M.; Saha, C.; Soltani Nejad, S.; Tasnim, N.; Lizotte, D.J.; Haque, A. Drl-gan: A hybrid approach for binary and multiclass network intrusion detection. Sensors 2024, 24, 2746. [Google Scholar] [CrossRef] [PubMed]
- Dina, A.S.; Siddique, A.; Manivannan, D. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access 2022, 10, 96731–96747. [Google Scholar] [CrossRef]
- Vasilomanolakis, E.; Cordero, C.G.; Milanov, N.; Mühlhäuser, M. Towards the creation of synthetic, yet realistic, intrusion detection datasets. In Proceedings of the NOMS 2016—2016 IEEE/IFIP Network Operations and Management Symposium, Istanbul, Turkey, 25–29 April 2016; pp. 1209–1214. [Google Scholar]
- Subahi, A.; Almasre, M. IoT Traffic Analyzer Tool with Automated and Holistic Feature Extraction Capability. Sensors 2023, 23, 5011. [Google Scholar] [CrossRef] [PubMed]
- Ashraf, J.; Keshk, M.; Moustafa, N.; Abdel-Basset, M.; Khurshid, H.; Bakhshi, A.D.; Mostafa, R.R. IoTBoT-IDS: A novel statistical learning-enabled botnet detection framework for protecting networks of smart cities. Sustain. Cities Soc. 2021, 72, 103041. [Google Scholar] [CrossRef]
- UNSW, S. The Bot-IoT Dataset. 2021. Available online: https://research.unsw.edu.au/projects/bot-iot-dataset (accessed on 27 August 2024).
- Figueira, A.; Vaz, B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 2022, 10, 2733. [Google Scholar] [CrossRef]
- Saxena, D.; Cao, J. Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 2021, 54, 1–42. [Google Scholar] [CrossRef]
- Couplet, E.; Lee, J.A.; Verleysen, M. Tabular Data Synthesis Using Generative Adversarial Networks: An Application to Table Augmentation. Master’s Thesis, UCLouvain, Ottignies-Louvain-la-Neuve, Belgium, 2021. [Google Scholar]
- Nayak, A.A.; Venugopala, P.; Ashwini, B. A Systematic Review on Generative Adversarial Network (GAN): Challenges and Future Directions. Arch. Comput. Methods Eng. 2024, 1–34. [Google Scholar] [CrossRef]
- Ahmad, Z.; Chen, M.; Bao, S. Understanding GANs: Fundamentals, variants, training challenges, applications, and open problems. Multimed. Tools Appl. 2024, 1–77. [Google Scholar] [CrossRef]
Dataset | Year | Description | Size | Features | Labels | Strengths | Limitations |
---|---|---|---|---|---|---|---|
CAPEC Web Attacks | 2022 | Multi-label dataset classifying web-based attacks (SQLI, XSS, CSRF) using CAPEC classification; focuses on web server vulnerabilities and web attacks prevalent in IoT environments | ∼5 GB | CAPEC attack patterns, including web-based attacks (SQLI, XSS, CSRF) | Normal traffic and web attack labels | Valuable for IDS development against web threats in IoT, underrepresented in IoT datasets | Limited to web-based attacks |
5G-NIDD | 2022 | Data from a functional 5G test network, capturing normal and malicious activities | ∼5 GB | 112 features, including flow-based, packet, and statistical attributes | Normal traffic and nine attack types | Realistic 5G traffic, detailed feature set; generated dataset for testing missing attacks | Large size, requires significant resources |
TON_IoT | 2020 | IoT network traffic, OS logs, and telemetry data | ∼80 GB | Multiple features across network traffic, log data, and telemetry data | Normal and malicious activities | Realistic environment, diverse data sources | Large size, complexity of analysis |
IoT-23 | 2020 | Network traffic from 23 IoT devices with both benign and malicious scenarios | ∼50 GB | Network traffic features including packet details, flow statistics, protocol-specific attributes | Normal traffic and different types of attacks | Device diversity, comprehensive attack coverage | Large size, labeling complexity |
MQTT-IoT IDS | 2020 | Network traffic data from MQTT environments | ∼5 GB | Packet details, flow characteristics, MQTT-specific attributes | Normal traffic and various attack types (DoS, scan, brute force) | Realistic traffic, MQTT focus | Smaller dataset, protocol specificity |
Bot-IoT | 2019 | IoT network traffic dataset capturing botnet activities | ∼69 GB | Over 50 features | Different attack types and normal traffic | Comprehensive attack coverage, generated dataset used for modeling certain botnet behaviors | Synthetic scenarios, large size |
UNSW-IoT | 2019 | IoT network traffic capturing normal and attack scenarios | ∼50 GB | Flow-based, statistical, and protocol-specific features | Normal traffic and various attack types (DDoS, reconnaissance, etc.) | Comprehensive feature set, realistic traffic | Large size, complexity of analysis |
UNSW-IoT Trace | 2018 | Subset of UNSW-IoT dataset with packet-level details | ∼10 GB | Packet-level features including timestamps, IP addresses, and port numbers | Normal traffic and attack types | Detailed packet-level data | Limited scope |
CICIDS2017 | 2018 | Network traffic dataset for intrusion detection systems | ∼70 GB | 80 network traffic features | Normal traffic and various attack types | Realistic traffic, comprehensive data | High dimensionality, class imbalance |
IoT Dataset by MedBIoT | 2018 | Network traffic from IoT devices under attack | ∼30 GB | Packet-level details, flow features, and statistical measures | Normal traffic and different attack types | Real-world scenarios, detailed traffic analysis | Large size, diverse attack types |
IoTID20 | 2017 | Network traffic for IoT device identification and anomaly detection | ∼100 GB | Over 80 features | Different device types and anomaly types | Covers a wide range of IoT devices, labeled anomalies | Large size, high dimensionality |
No. | Attack Type | Used Library | Attack Successful | Results | Inside/Outside Attack |
---|---|---|---|---|---|
1 | Nmap | Nmap | yes | 80/tcp open HTTP | Inside |
2 | MITM | Ettercap | yes | Data were intercepted. However, some of the data were readable as some devices’ data were encrypted. | Inside |
3 | Deauth | Aireplay-ng | yes | The device is no longer authenticated to the network | Outside |
4 | DOS (tcp) | slowhttptest, slowloris, slowite | yes | Device became unresponsive and offline and required reboot. | Inside |
5 | DOS (UDP Flood) | hping3 | no | Device does not use udp or have udp ports open | Inside |
6 | Password crack | Brute force | yes | The device has a login feature that can be activated in the browser, which prompts the user to enter a username and a password | Inside |
Feature | Count | Unique Values | Top Value | Top Value Count |
---|---|---|---|---|
pkSeqID | 733,705 | 733,705 | 2 | 1 |
proto | 733,705 | 5 | udp | 399,618 |
saddr | 733,705 | 16 | 192.168.100.147 | 189,606 |
sport | 733,705 | 65,538 | 0x0303 | 1794 |
daddr | 733,705 | 45 | 192.168.100.3 | 475,171 |
dport | 733,705 | 4111 | 80 | 714781 |
seq | 733,705 | 249,513 | 380 | 12 |
N_IN_Conn_P_SrcIP | 733,705 | 100 | 100 | 369,260 |
state_number | 733,705 | 11 | 4 | 399,567 |
N_IN_Conn_P_DstIP | 733,705 | 100 | 100 | 573,744 |
category | 733,705 | 5 | DDoS | 385,309 |
Feature | Count | Mean | Std | Min | 25% | 50% | 75% | Max | Number of Samples |
---|---|---|---|---|---|---|---|---|---|
stddev | 733,705 | 0.887894 | 0.804013 | 0 | 0.030132 | 0.795481 | 1.745595 | 2.496758 | 733,705 |
min | 733,705 | 1.018868 | 1.484235 | 0 | 0 | 0 | 2.163444 | 4.98047 | 733,705 |
mean | 733,705 | 2.233429 | 1.517572 | 0 | 0.182193 | 2.691715 | 3.566569 | 4.981785 | 733,705 |
drate | 733,705 | 0.506298 | 74.33018 | 0 | 0 | 0 | 0 | 58823.53 | 733,705 |
srate | 733,705 | 2.262398 | 403.4081 | 0 | 0.156231 | 0.283784 | 0.488849 | 333333.3 | 733,705 |
max | 733,705 | 3.023 | 1.860725 | 0 | 0.281688 | 4.011386 | 4.296505 | 4.999999 | 733,705 |
Joint Dataset | Attack Type | ||||||
---|---|---|---|---|---|---|---|
Normal | MITM | Deauth | Password-Cracking | Reconn-Aissance | SYN Flood | UDP-Flood | |
Original Number of Samples | 148 | 4 | 8 | 8 | 5718 | 5299 | 1590 |
Original Deviation | 0.131272 | 0.142544 | 0.142231 | 0.142231 | 0.304736 | 0.271937 | 0.018395 |
Synthetic Number of Samples | 1428 | 1428 | 1428 | 1428 | 1428 | 1428 | 1428 |
Synthetic Deviation | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Imbalance Reduction Proportion | 0.131272 | 0.142544 | 0.142231 | 0.142231 | 0.304736 | 0.271937 | 0.018395 |
Bot Dataset | Attack Type | ||||
---|---|---|---|---|---|
DDoS | DoS | Normal | Reconnaissance | Theft | |
Number of Samples | 385309 | 330112 | 18163 | 107 | 14 |
Original Deviation | 0.325155 | 0.249925 | 0.199854 | 0.175245 | 0.199980 |
Synthetic Number of Samples | 20,000 | 20,000 | 20,000 | 20,000 | 20,000 |
Synthetic Deviation | 0 | 0 | 0 | 0 | 0 |
Imbalance Reduction Proportion | 0.325155 | 0.249925 | 0.199854 | 0.175245 | 0.199981 |
Feature | Count | Mean | Std | Min | 25% | 50% | 75% | Max | Number of Samples |
---|---|---|---|---|---|---|---|---|---|
Dest_port_no | 9996 | 41,659.04 | 12,909.2 | 53 | 36,065.5 | 43,553 | 51,254 | 60,988 | 9996 |
IoT_port_no | 9996 | 10,112.15 | 15,022.46 | 1 | 1152 | 4001 | 9101 | 65,389 | 9996 |
Dest_TCP_Flags | 9996 | 1161.495 | 2844.339 | 0 | 2 | 11 | 18 | 8180 | 9996 |
IoT_TCP_Flags | 9996 | 22.59104 | 32.59825 | 0 | 2 | 12 | 18 | 100 | 9996 |
IOT_Respond_401 | 9996 | 0.493597 | 0.499984 | 0 | 0 | 0 | 1 | 1 | 9996 |
Feature | Cumulative Difference | MAE | RMSE | Correlation |
---|---|---|---|---|
Send_receive_ratio | 3.082799 | 3.137454 | 3.183326 | 0.010839 |
No_of_received_packets_per_minutes | 672.1486 | 672.1486 | 674.19 | 0.016176 |
No_of_sent_packets_per_minutes | 1809.712 | 1810.018 | 1815.296 | −0.00774 |
Avg_TTL | 48.10356 | 48.38865 | 49.17743 | 0.00193 |
Flow_volume | 879907.2 | 879982.6 | 882675.5 | 0.013071 |
Flow_duration | 2.580025 | 2.589503 | 2.603747 | −0.03076 |
Dest_ip_avg_packet_length | 214.148 | 214.301 | 216.9572 | −0.00157 |
Src_ip_avg_packet_length | 716.8348 | 716.9825 | 718.8302 | 0.000316 |
Flow_rate | 1746863 | 1746863 | 1751104 | −0.01433 |
Max_dest_SSL_payload | 728.1321 | 728.3052 | 733.4881 | 0.012872 |
Min_dest_SSL_payload | 155.2861 | 155.441 | 156.3907 | −0.00866 |
Avg_dest_SSL_payload | 234.1595 | 234.2863 | 236.5934 | −0.00195 |
Std_dest_SSL_payload | 129.1881 | 131.8779 | 133.9625 | −0.00594 |
Max_IoT_SSL_payload | 746.4051 | 747.3188 | 749.6432 | −0.00565 |
Min_IoT_SSL_payload | 122.7403 | 122.7745 | 123.4037 | −0.00796 |
Avg_IoT_SSL_payload | 660.2485 | 660.3813 | 662.3838 | 0.008442 |
Std_IoT_SSL_payload | 313.2357 | 313.48 | 314.5114 | −0.00133 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Almasre, M.; Subahi, A. Create a Realistic IoT Dataset Using Conditional Generative Adversarial Network. J. Sens. Actuator Netw. 2024, 13, 62. https://doi.org/10.3390/jsan13050062
Almasre M, Subahi A. Create a Realistic IoT Dataset Using Conditional Generative Adversarial Network. Journal of Sensor and Actuator Networks. 2024; 13(5):62. https://doi.org/10.3390/jsan13050062
Chicago/Turabian StyleAlmasre, Miada, and Alanoud Subahi. 2024. "Create a Realistic IoT Dataset Using Conditional Generative Adversarial Network" Journal of Sensor and Actuator Networks 13, no. 5: 62. https://doi.org/10.3390/jsan13050062