Dataset Generation for Development of Multi-Node Cyber Threat Detection Systems
Abstract
:1. Introduction
- Applying new models of cyber threats and setting up platforms to simulate them in near-production environments.
- Elaboration of detection algorithms that are technically feasible in modern networks and systems.
- Sharing the collected information on cyber threats and applying this information in technical solutions.
- Accelerating the decision-making process within cybersecurity teams and departments together with the company’s decision-makers.
- Delivery and C2 Phase designated by Cyber Kill Chain methodology.
- Defense Evasion, Exfiltration, and C2 Tactics classified by MITRE ATT&CK methodology.
- Section 2 briefly presents the available datasets and the systematic approach to evaluating self-generated datasets. It builds the context for the need for the research in this paper:
- -
- Generating datasets for cyber threat detection research in the domain of information hiding techniques applied by modern malware and malicious cyber operations like APTs.
- -
- Establish the possibility to generate these datasets in different and randomized networking environments with a varying set of sources and destinations for the simulated cyber attacks.
- Section 3 presents the methodology used to establish the framework for end-to-end dataset generation for cyber threat detection research. It follows the context of the research established in Section 2. This section covers the concept of the system for capturing network traffic in a multi-node setup, simulation of benign and malicious network flows and scenarios for generating the final datasets, and a simple methodology for generating datasets.
- Section 5 concludes the paper with a summary of the results and further research directions that could be based on this paper.
- An approach to collect datasets for cyber threat detection research in a multi-node setup using the developed agent system. This contribution goes far beyond the state-of-the-art presented in Section 2.3. The majority of the available datasets are focused on providing indicators for simulated cyber attacks from single endpoints like central collectors, whereas this research tackles multi-node cyber data collection to follow the cyber attack path of execution.
- Application of the information hiding techniques in communication networks [10] to research cyber threats as an emerging problem in cyberspace. The paper shows how to generate network data streams using information hiding techniques. This is a key effect, as most of the state-of-the-art datasets presented in Section 2.3 include the classic types of cyber attacks only with no covert communication samples. The introduction of this paper and Section 2.4 show the increase in malware applying information hiding techniques for Command and Control channels, to exfiltrate data or to persistently maintain the presence in the compromised environments. It means that any research into cyber threat detection methods in the area of steganography used in malicious operations has never been as important.
- Development of an automated and randomized tool for setting up network configurations (nodes and links) when performing simulations of network communication scenarios. According to the state-of-the-art cyber data collection environments of the datasets presented in Section 2.3 they were mostly configured once with the chosen sources and destinations of cyber attacks. The contribution of this paper offers a solution to mitigate the biases in datasets related to the shape and topology of the environment in which they were collected.
- The execution of reference cyber threat detection experiments on the collected datasets. Most of the state-of-the-art research papers related to datasets included in Section 2.3 present the datasets and collection process. This paper contributes to the approach applied by the authors where the collected datasets were evaluated to be feasible in data-driven cyber threat detection workflows.
2. Related Work
2.1. Multi-Node Cyber Defense Solutions
2.2. Generation of Datasets for Cyber Threat Detection Research
- Collecting data from actual production networks and cyber intrusions,
- Building models of production networks and simulating network communications (malicious and benign),
- The use of mathematical, statistical, machine learning, and other algorithms to generate the data.
- Collecting sensors distributed on network nodes,
- Allowing for continuous communication and coordination between sensors,
- The use of a central processing unit to improve detection decisions,
- The automation of the network scenarios in which the data was collected,
- The use of data science methods to oversample the least representative samples of malicious data.
2.3. Availability of Datasets for Cyber Threat Detection Research
- CALIBRATION set was collected from March to June 2016 (four months) with accurate background traffic data.
- itemize TEST was collected from July to August 2016 with factual background and synthetically generated traffic data of various known attack types.
2.4. Malware with Information Hiding Techniques Applied
- Modified properties of the protocol;
- Modified properties of the protocol may refer to mechanisms related to inadequacies of the communication channel, the nature of the messages exchanged, or their form;
- Communication parties trying to prevent the observer from detecting the transmission of data using information hiding techniques.
- Stealth data tunneling in ICMP protocol traffic (ping).
- Insertion of steering commands in cookies in the HTTP protocol header.
- Insertion of steering commands into specially prepared TCP protocol segments or UDP datagrams.
- Use of multimedia steganography to hide the data.
- Use of standard protocols of the TCP/IP stack, especially application network traffic, to smuggle multimedia files between victims and attackers either directly or via C2 servers.
- Stegobot [36]—one of the pioneering systems using OSNs as an overlay network for the technical operations.
- Instegogram [37]—a technique that uses the image feed of a given Instagram account to decode C2 messages from images. The main achievement here was using a popular internet service to smuggle malware communications.
- StegHash with SocialStegDisc [38]—The StegHash technique was used to distribute multimedia files with hidden data portions across many Internet services and accounts. The mechanism of hashtags creates an invisible chain through which the original message can be recovered. SocialStegDisc implemented the StegHash technique to address the scheme in a novel steganographic file system.
3. Generating Datasets for Cyber Threat Detection Research
3.1. Application of Multi-Node Cyber Threat Detection System
- Agents collect network traffic logs in a specific format and send this data to the Central Unit for analysis.
- Agents equipped with motion logic that follows the developed algorithm for computing anomaly metrics and cooperates in selecting additional areas of the observation network. The goal is to discover the sources of the attack.
- The Central Unit node, which manages the actor system of the whole platform and coordinates the life cycle of the distributed agents and of itself. More details are presented in Table 1.
- A router with the actor system instance in which the single node managing agent is instanced and connected with the whole platform managed by the Central Unit. Furthermore, this agent could spawn other node agents to operate different functions. Figure 1 shows such an agent called the Interface Sniffer Agent.
- Sniffing network traffic on all interfaces of a router;
- Storing PCAP files at the nodes;
- Collecting PCAP files across nodes in the Central Unit.
- The performance related to the type of the networks and its protocols.
- The data flow rates and processing performance.
- The physical bandwidth of interfaces within network nodes (routers and the other network appliances).
- The computational resources within a single network node where the cyber threat detection agent would operate.
- The multi-node cyber threat detection system management links to the performance.
3.2. Network Traffic Streams Simulations
3.2.1. Malicious Network Data Streams
- Method based on intentionally lost packets that can carry hidden payloads. It could be implemented in various network protocols such as SIP or RTP, with one important characteristic rule—a packet is detected as lost even if it eventually reaches the destination, it is simply discarded. No verification is performed. This fact can be directly applied to network steganography in the following way:
- -
- Some packets must be intentionally delayed to be detected as lost.
- -
- The payload of such packets could be overwritten to carry steganograms.
- -
- When such a packet finally arrives at its destination, it is simply discarded. If a steganographic receiver is installed, it could intercept these packets to extract the hidden payload.
- Method based on modulating the transmission time between packets to encode bits ‘0’ and ‘1’. Delay-based network steganography is a type of time-based steganography. It uses modulation of the transmission times of successive packets in network traffic to encode ‘1’ and ‘0’ bits of hidden data. Probably any network protocol can be used for such a method. The secret between sender and receiver is to encode and decode the hidden data in the temporal relationships between the packets. The sender side must be parameterized with the type of distribution used to generate the network stream. The receiver side must also be parameterized with this distribution and with decision thresholds in the decoding module.
3.2.2. Benign Network Data Streams
- Surfing the Internet and using the HTTP protocol.
- VoIP communication using SIP, RTP, UDP, HTTP, and TCP protocols.
- Video streaming using RTP and HTTP protocols.
- Data transfer using HTTP, FTP, SSH/SFTP, TCP, UDP, or email protocols.
- Using network-related protocols such as ICMP.
3.2.3. Engine of Generation of Network Topologies for Experimentation
- Enabling rapid prototyping of new use cases.
- Enabling automatic generation of new network topologies, i.e., setting up a new dataset.
- The number of routers and hosts,
- IP address ranges and routing,
- Whether to provide access to the Internet,
- Whether a firewall should be included,
- Placement of the Central Unit of the entire Cyber Threat Detection Agent system.
- The backend network engine and simulation tool—GNS3 [40];
- The text file to hold the network configuration—nodes and their types;
- the main script in Python, which
- -
- Interprets and validates the entered network configuration,
- -
- Randomly generated connections between nodes,
- -
- Automatically sets up the network using the GNS3 API;
- The set of scripts in Python to configure the senders and receivers of the network traffic depending on the purpose—to run benign, malicious, or mixed scenarios using the agent system presented in Section 3.1.
- Generation of links between the configured set of nodes.
- Selection of senders and receivers for each network data stream profile (benign or malicious).
3.2.4. Generation of Example Datasets
- Select nodes (computer hosts) that represent the pairs of a sender and a receiver of a hidden communication for both techniques in this study.
- Run benign network communication patterns along with hidden network communication.
- Collect PCAPs for each network node used.
- Drag all PCAPs to the Central Unit of the platform.
- Finish generating and collecting data.
4. Cyber Threat Detection Research Enabled
4.1. Cyber Threat Detection as Standard ML Classification Problem
- Anomaly detection to anomalously detect observations defined as a deviation from the specified base model. The detected anomaly is examined in more detail to determine if it is a cyber threat.
- Detection of cyber threats by finding patterns of the known nature of a cyber attack. Tagged records are required.
4.1.1. Network Flows Generation
- flow timeout—120 s
- activity timeout—30 s
4.1.2. Training Classifiers for Cyber Threat Detection
4.2. Conclusions and Future Directions
- Generating more data using the prepared application for simulations, especially for more examples of hidden communication techniques.
- Selection of a different predictive model or design of a more complex architecture combining some classifiers.
- To find a data representation that is better suited to the context of detecting information hiding techniques. The output data representation of CICFlowMeter contains several different metrics related to temporal aspects of network communication. Thus, this is the most likely answer as to why the detection of temporally modulated hidden communication had better performance.
- With other feature selection or data augmentation methods.
5. Summary
- Within different abstraction layers within the ISO-OSI 7-layer model, with the application layer in particular considered a significant threat.
- Per each step of a cyber attack modeled as Cyber Kill Chain. [4]
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Barrett, M. NIST Cybersecurity Framework (CSF): Framework for Improving Critical Infrastructure Cybersecurity. Version 1.1. 2018. Available online: https://nvlpubs.nist.gov/nistpubs/CSWP/NIST.CSWP.04162018.pdf (accessed on 22 October 2021).
- Fragkos, G.; Minwalla, C.; Plusquellic, J.; Tsiropoulou, E.E. Artificially Intelligent Electronic Money. IEEE Consum. Electron. Mag. 2021, 10, 81–89. [Google Scholar] [CrossRef]
- Cichonski, P.; Millar, T.; Grance, T.; Scarfone, K. NIST SP 800-61: Computer Security Incident Handling Guide. 2012. Available online: https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-61r2.pdf (accessed on 22 October 2021).
- Hutchins, E.; Cloppert, M.J.; Amin, R.M. Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains. Lead. Issues Inf. Warf. Secur. Res. 2011, 1, 80. Available online: https://www.lockheedmartin.com/content/dam/lockheed-martin/rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf (accessed on 22 October 2021).
- MITRE ATT&CK. Available online: https://attack.mitre.org/ (accessed on 5 September 2021).
- Chou, D.; Jiang, M. Data-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods. arXiv 2020, arXiv:2009.07352. [Google Scholar]
- Ptacek, T.; Newsham, T. Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection; Secure Networks, Inc.: Mandaluyong, Philippines, 1998; Available online: http://www.icir.org/vern/Ptacek-Newsham-Evasion-98.ps (accessed on 22 October 2021).
- Nehinbe, J.O. A Simple Method for Improving Intrusion Detections in Corporate Networks. In Information Security and Digital Forensics; Weerasinghe, D., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 111–122. [Google Scholar]
- McAfee Labs Threats Report—June 2017. Available online: https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-jun-2017.pdf (accessed on 5 September 2021).
- Mazurczyk, W.; Wendzel, S.; Zander, S.; Houmansadr, A.; Szczypiorski, K. Background Concepts, Definitions, and Classification. In Information Hiding in Communication Networks: Fundamentals, Mechanisms, Applications, and Countermeasures; IEEE-Wiley Press: New York, NY, USA, 2016; Chapter 2; pp. 39–58. [Google Scholar]
- Balasubramaniyan, J.; Garcia-Fernandez, J.; Isacoff, D.; Spafford, E.; Zamboni, D. An architecture for intrusion detection using autonomous agents. In Proceedings of the 14th Annual Computer Security Applications Conference (Cat. No. 98EX217), Phoenix, AZ, USA, 7–11 December 1998; pp. 13–24. [Google Scholar]
- Herrero, A.; Corchado, E. Multiagent Systems for Network Intrusion Detection: A Review. Comput. Intell. Secur. Inf. Syst. 2009, 63, 143–154. [Google Scholar]
- Docking, M.; Uzunov, A.V.; Fiddyment, C.; Brain, R.; Hewett, S.; Blucher, L. UNISON: Towards a Middleware Architecture for Autonomous Cyber Defence. In Proceedings of the 2015 24th Australasian Software Engineering Conference, Adelaide, SA, Australia, 28 September–1 October 2015; pp. 203–212. [Google Scholar]
- Saeed, I.A.; Selamat, A.; Rohani, M.F.; Krejcar, O.; Chaudhry, J.A. A Systematic State-of-the-Art Analysis of Multi-Agent Intrusion Detection. IEEE Access 2020, 8, 180184–180209. [Google Scholar] [CrossRef]
- Kott, A. Intelligent Autonomous Agents are Key to Cyber Defense of the Future Army Networks. Cyber Def. Rev. 2018, 3, 57–70. [Google Scholar]
- Pascale, F.; Adinolfi, E.A.; Coppola, S.; Santonicola, E. Cybersecurity in Automotive: An Intrusion Detection System in Connected Vehicles. Electronics 2021, 10, 1765. [Google Scholar] [CrossRef]
- Lombardi, M.; Pascale, F.; Santaniello, D. EIDS: Embedded Intrusion Detection System using Machine Learning to Detect Attack Over the CAN-BUS. In Proceedings of the 30th European Safety and Reliability Conference and 15th Probabilistic Safety Assessment and Management Conference, Venice, Italy, 1–5 November 2020; pp. 2028–2035. Available online: https://www.rpsonline.com.sg/proceedings/esrel2020/pdf/5090.pdf (accessed on 22 October 2021).
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. ICISSP 2018, 1, 108–116. [Google Scholar]
- Ring, M.; Wunderlich, S.; Grüdl, D.; Landes, D.; Hotho, A. Creation of Flow-Based Data Sets for Intrusion Detection. J. Inf. Warf. 2017, 16, 41–54. [Google Scholar]
- Shahriar, M.H.; Haque, N.I.; Rahman, M.A.; Alonso, M. G-IDS: Generative Adversarial Networks Assisted Intrusion Detection System. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; pp. 376–385. [Google Scholar]
- Canadian Institute for Cybersecurity. Intrusion Detection Evaluation Dataset (ISCXIDS2012). Available online: https://www.unb.ca/cic/datasets/ids.html (accessed on 5 September 2021).
- Canadian Institute for Cybersecurity. NSL-KDD Dataset. Available online: https://www.unb.ca/cic/datasets/nsl.html (accessed on 5 September 2021).
- Canadian Institute for Cybersecurity. Intrusion Detection Evaluation Dataset (CICIDS2017). Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 5 September 2021).
- Canadian Institute for Cybersecurity. A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018). Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 5 September 2021).
- Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]
- Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR’16: A new dataset for the evaluation of cyclostationarity-based network IDSs. Comput. Secur. 2018, 73, 411–424. [Google Scholar] [CrossRef] [Green Version]
- Center for Applied Internet Data Analysis. CAIDA Datasets. Available online: https://www.caida.org/catalog/datasets/overview/ (accessed on 5 September 2021).
- Information Marketplace For Policy and Analysis of Cyber Risk & Trust. Available online: https://www.impactcybertrust.org (accessed on 5 September 2021).
- Szczypiorski, K. Steganography in TCP/IP Networks—State of the Art and a Proposal of a New System—HICCUPS; Institute of Telecommunications’ Seminar, Warsaw University of Technology: Warsaw, Poland, 2003. [Google Scholar]
- Mullaney, C. Morto Worm Sets a (DNS) Record. 2011. Available online: http://www.symantec.com/connect/blogs/morto-worm-sets-dns-record (accessed on 22 October 2021).
- Attackers Hide Communication within Linux Backdoor. Available online: https://www.securityweek.com/attackers-hide-communication-linux-backdoor (accessed on 5 September 2021).
- Regin: Top-Tier Espionage Tool Enables Stealthy Surveillance. 2015. Available online: https://docs.broadcom.com/doc/regin-top-tier-espionage-tool-15-en (accessed on 5 September 2021).
- Bencsáth, B.; Pék, G.; Buttyán, L.; Félegyházi, M. Duqu: A Stuxnet-like malware found in the wild. CrySyS Lab Tech. Rep. 2011, 14, 60–141. Available online: https://www.crysys.hu/publications/files/bencsathPBF11duqu.pdf (accessed on 22 October 2021).
- Dell Secureworks. Malware Analysis of the Lurk Downloader. Available online: https://www.secureworks.com/research/malware-analysis-of-the-lurk-downloader (accessed on 5 September 2021).
- FireEye Threat Intelligence. HAMMERTOSS: Stealthy Tactics Define a Russian Cyber Threat Group. Available online: https://www.fireeye.com/blog/threat-research/2015/07/hammertoss_stealthy.html (accessed on 5 September 2021).
- Nagaraja, S.; Houmansadr, A.; Piyawongwisal, P.; Singh, V.; Agarwal, P.; Borisov, N. Stegobot: A Covert Social Network Botnet. In Information Hiding; Filler, T., Pevný, T., Craver, S., Ker, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 299–313. [Google Scholar]
- Deutsch, J.; Garrie, D. Instegogram: A New Threat and Its Limits for Liability. J. Law Cyber Warf. 2017, 6, 1–7. [Google Scholar]
- Bieniasz, J.; Szczypiorski, K. Methods for Information Hiding in Open Social Networks. JUCS-J. Univers. Comput. Sci. 2019, 25, 74–97. [Google Scholar]
- Hewitt, C.; Bishop, P.; Steiger, R. A Universal Modular Actor Formalism for Artificial Intelligence. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence (IJCAI’73), Stanford, CA, USA, 20–23 August 1973; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1973; pp. 235–245. [Google Scholar]
- GNS3 Network Simulation Tool. 2021. Available online: https://www.gns3.com (accessed on 5 September 2021).
- Canadian Institute for Cybersecurity. CICFlowmeter—Network Traffic Bi-Flow Generator and Analyzer for Anomaly Detection. Available online: https://github.com/ahlashkari/CICFlowMeter (accessed on 5 September 2021).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Beckmann, M.; Ebecken, N.F.; de Lima, B.S.P. A KNN undersampling approach for data balancing. J. Intell. Learn. Syst. Appl. 2015, 7, 104. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Random Forrest Classifier from Scikit-Learn Framework. 2018. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (accessed on 20 September 2021).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: tensorflow.org (accessed on 5 September 2021).
Module | Objective | Communication Patterns |
---|---|---|
Central Unit |
|
|
A router | Computing and networking platform that hosts remote portion of entire agent system | Receiving requests to create new control agent |
Router Controlling Agent |
|
|
Internal Sniffing Agent |
|
|
Scenario | Hidden Communication over Lost Packets | Hidden Communication over Time Modulation |
---|---|---|
Scenario 1 | Sender: Host H1 | Sender: Host H6 |
Receiver: Host H2 | Receiver: Host H7 | |
Scenario 2 | Sender: Host H2 | Sender: Host H1 |
Receiver: Host H3 | Receiver: Host H6 | |
Scenario 3 | Sender: Host H3 | Sender: Host H1 |
Receiver: Host H4 | Receiver: Host H2 | |
Scenario 4 | Sender: Host H4 | Sender: Host H2 |
Receiver: Host H5 | Receiver: Host H3 | |
Scenario 5 | Sender: Host H1 | Sender: Host H3 |
Receiver: Host H5 | Receiver: Host H7 | |
Scenario 6 | Sender: Host H6 | Sender: Host H4 |
Receiver: Host H7 | Receiver: Host H5 | |
Scenario 7 | Sender: Host H4 | Sender: Host H5 |
Receiver: Host H7 | Receiver: Host H6 |
Workflow Step | Objective |
---|---|
Step 1 | Benign and malicious network communication simulation scenarios |
Step 2 | Collect raw source data |
Step 3 | Generate network data representation from raw source data |
Step 4 | Data labeling |
Step 5 | Feature selection |
Step 6 | Prepare train and test datasets |
Step 7 | Train data augmentation to balance samples per each label |
Step 8 | Train, test, and evaluate a machine learning classifier |
Types of Flows | Flows in Training Dataset | Flows in Testing Dataset |
---|---|---|
Benign traffic (Label: 0) | 92,737 | 23,163 |
Hidden communication over lost packets (Label: 1) | 742 | 195 |
Hidden communication over time modulation (Label: 2) | 457 | 127 |
Types of Flows | Flows in Training Dataset before SMOTE-ENN | Flows in Training Dataset after SMOTE-ENN |
---|---|---|
Benign traffic (Label: 0) | 92,737 | 92,737 |
Hidden communication over lost packets (Label: 1) | 742 | 92,706 |
Hidden communication over time modulation (Label: 2) | 457 | 92,643 |
Parameter | The Custom MLP Classifier |
---|---|
Layer 1 (Input) | Input, size: 28 × 20, activation: relu |
Layer 2 | Dense, size: 20 × 20, activation: relu |
Layer 3 | BatchNormalization, size: 20 × 20, activation: relu |
Layer 4 | Dense, size: 20 × 150, activation: relu |
Layer 5 | Dense, size: 150 × 20, activation: relu |
Layer 6 (Predictions) | Dense, size: 20 × 20, activation: softmax |
Optimizer | Adam with the learning rate: 0.001 |
Loss | Categorical Cross Entropy |
Evaluation metrics | accuracy |
Training setup | batch size: 128, epochs: 20, validation split: 0.15 |
Types of Flows | Precision | Recall | F1 Score |
---|---|---|---|
Benign traffic (Label: 0) | 1.00 | 0.99 | 1.00 |
Lost packets attack (Label: 1) | 0.49 | 0.92 | 0.64 |
Packet timing attack (Label: 2) | 0.93 | 1.00 | 0.97 |
Overall accuracy | 0.99 |
Types of Flows | Precision | Recall | F1 Score |
---|---|---|---|
Benign traffic (Label: 0) | 1.00 | 0.99 | 1.00 |
Lost packets attack (Label: 1) | 0.42 | 0.99 | 0.59 |
Packet timing attack (Label: 2) | 0.84 | 1.00 | 0.91 |
Overall accuracy | 0.99 |
Workflow Step | Objective | Realization in This Paper |
---|---|---|
Step 1 | Benign and malicious network communication simulations | Implementation of the dedicated tool to set up networks and simulations; proof-of-concept implementation of Multi-node Cyber Threat Detection System to use in monitor and collect mode |
Step 2 | Collecting raw source data | Network traffic traces collected as PCAP files |
Step 3 | Generating network data representation from raw source data | Generation of network flows including 80 metrics from [41] |
Step 4 | Data labeling | Labeling based on Table 2 by adding the column Label with the respected coding to the corresponding network flows |
Step 5 | Feature selection | Selecting features using correlation coefficient between Label and the other features |
Step 6 | Preparing train and test datasets | Train/test split method from [45] |
Step 7 | Train data augmentation to balance samples per each label | Augmentation of the training dataset with SMOTE-ENN method |
Step 8 | Train, test, and evaluate a machine learning classifier | Preparing Random Forest and the custom MLP classifiers. |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bieniasz, J.; Szczypiorski, K. Dataset Generation for Development of Multi-Node Cyber Threat Detection Systems. Electronics 2021, 10, 2711. https://doi.org/10.3390/electronics10212711
Bieniasz J, Szczypiorski K. Dataset Generation for Development of Multi-Node Cyber Threat Detection Systems. Electronics. 2021; 10(21):2711. https://doi.org/10.3390/electronics10212711
Chicago/Turabian StyleBieniasz, Jędrzej, and Krzysztof Szczypiorski. 2021. "Dataset Generation for Development of Multi-Node Cyber Threat Detection Systems" Electronics 10, no. 21: 2711. https://doi.org/10.3390/electronics10212711
APA StyleBieniasz, J., & Szczypiorski, K. (2021). Dataset Generation for Development of Multi-Node Cyber Threat Detection Systems. Electronics, 10(21), 2711. https://doi.org/10.3390/electronics10212711