**1. Introduction**

In recent years, threats in cyberspace have evolved into well-organized, long-term, and resource-intensive intrusion campaigns known as Advanced Persistent Threats. As a result, there is a need to increase research into and the implementation of new cyber defense solutions, methods, operations, and procedures. Cybersecurity research activity is very broad, but it could be summarized by offering new developments and solutions for new use cases for each function of the NIST Cybersecurity Framework (CSF) [1]. Examples of a tailored solution that has been developed based on a new use case for cybersecurity would be physical unclonable functions [2]. The need for a new secure identification and authentication method was driven by the restricted requirements of cyber-physical systems. The resulting concept offers low computational cost and resource requirements, whereas Identify and Protect functions are easily provided for such systems. The same strategy for research in cyber threat detection is followed in this paper. One of the most important scientific and technological areas that are increasingly being used for cyber defense is data science and data-driven methods. The following list summarizes the five areas of work around data science in cybersecurity:


**Citation:** Bieniasz, J.; Szczypiorski, K. Dataset Generation for Development of Multi-Node Cyber Threat Detection Systems. *Electronics* **2021**, *10*, 2711. https://doi.org/10.3390/ electronics10212711

Academic Editor: Qusay H. Mahmoud

Received: 28 September 2021 Accepted: 3 November 2021 Published: 7 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).


For a more detailed overview of the challenges of data-driven cyber threats and intrusion detection, see [6]. This paper presents efforts to create an end-to-end process that combines aspects 1, 2, and 3. This is possible by extending the established approach for cyber threat detection systems [7], mainly realized by Network and Host Intrusion Detection Systems (NIDS, HIDS), to Multi-Node Cyber Threat Detection (MNCTD) systems. The cyber defense action matrix [4] shows that classical NIDS or HIDS can be used individually in four out of seven phases of the Cyber Kill Chain—weaponization, exploitation, installation, and C2 (Command and Control). MNCTD goes further and proposes the combination of the detection capabilities of all steps into a cyber threat detection system that focuses on network communications.

Any research project on such models, algorithms, and systems suffers from the availability of the right data. The recognized problem of availability of specific datasets for the particular research hypothesis and preparation of appropriate datasets for cyber threat detection is a critical challenge [8]. In the first part, this paper summarizes the current state of datasets for cyber security research available in academia and industry. It then proposes an approach to create specific datasets for hybrid cyber threat detection systems research, as such datasets are scarcely available in the public domain. Most of these available datasets focus on network attacks, such as Distributed Denial of Service (DDoS), SSH Brute Force, or botnet communication over open text protocols such as HTTP or IRC. Modern cyber threat modeling shifts thinking to identify threat phases (Cyber Kill Chain) or tactics realized through various techniques (MITRE ATT&CK) to block a threat as early as possible. Developing new solutions for cyber defense is about defining the aspects of the threat using the chosen modeling method and creating certain observable indicators that can be analyzed by detection algorithms. Such an approach could provide the desired ability to block and counter cyber threat campaigns as soon as indicators of threat are detected. Another novelty of this paper is the emphasis on the increasing importance of detecting information concealment techniques used in cyber attacks, especially in APT campaigns. One of the most important reports on the rise of stegomalware was the June 2017 McAfee report [9]. In it, steganography was identified as the emerging element used in new malware campaigns. Information hiding techniques can be used at any stage of a cyber threat campaign, but the focus is on methods that work with communication activities over networks:


This paper presents the possibility of preparing datasets with information hiding techniques to develop the concept of a Multi-Node Cyber Threat Detection platform. The created multi-agent system for collecting network packet traces was applied in the automatically generated environment of network nodes and with the random setup of malicious pairs of hosts (sender-receiver) per experimental run. Then, the collected sample datasets were used in the data science workflow for cyber threat detection. The classical pipeline of a data science experiment includes data cleaning, feature selection, underor over-selection of rare class examples, and development of the default solution for classification problems.

The structure of the paper is as follows:

• Section 2 briefly presents the available datasets and the systematic approach to evaluating self-generated datasets. It builds the context for the need for the research in this paper:


#### **Contributions of the Paper**

The main contributions of the paper are:


#### **2. Related Work**

#### *2.1. Multi-Node Cyber Defense Solutions*

The systems that could be built upon the results of this paper combine the idea of network intrusion detection systems with the concept of multi-agent systems into a multi-node cyber threat detection system. In the last 30 years, it has been investigated in

different aspects related to architectures, computational aspects (for example, involving AI), effective collaboration within multi-agent platforms, and applications. One of the milestones is the paper [11], where the idea of intrusion detection using autonomous agents was proposed. Publications such as [12–14] combined are drawing a comprehensive review of the state-of-the-art in multi-agent cyber defense solutions.

Nowadays, a cyber defense based on multi-agent systems is recognized as a modern and very efficient approach with continuous emerging. Interest in such systems has been extensively revisited recently within academia, industry, law enforcement agencies, and even the military. In [15], the author developed the idea that intelligent autonomous agents will be the standard on the battlefield of the future. It means that intelligent autonomous cyber defense agents are going to become the main element of any entity involved with the battlefield, where cyberspace will become the crucial area of conflict. The paper introduced several novel ideas with summarization of the other ones into the reference architecture of any multi-agent system for cyber defense.

A current industrial application of such systems could be any Internet of Things networks or, in general, cyber-physical systems and networks. The justification behind this is that these systems are by default distributed and multi-node. Furthermore, the requirements on the lightness of the computation on the nodes implicates that only multi-agent cyber threat detection solutions would fit such environments. For example, the state-ofthe-art in this field from two papers [16,17] introduce the intrusion detection system in connected vehicles (Vehicle-to-Vehicle, V2V). The system presented in [16] consists of the part that is analyzing the node of the environment—a vehicle—with the option of centralized data analytics in the cloud. The main contribution of the authors was to consider each single element of the vehicle as the valuable source of data to detect cyber threats. Next, it was proposed to combine the real time data from such different and distributed elements together for the classification algorithm based on Bayesian networks. The paper [17] investigates such multi-agent cyber threat detection within a single vehicle more deeply in terms of how to combine data from different sensors to detect intrusions. Such an approach complies with the general idea of multi-agent intrusion detection systems and it is an important example of how to apply it to solve the modern problems of security in cyberspace. As the use case of a connected vehicle will be rapidly adopted, cyber defense solutions involving multi-agent concepts crucially need to be developed.

#### *2.2. Generation of Datasets for Cyber Threat Detection Research*

The general prerequisite for any discovery problem to be addressed by data science methods is to have the right data. There are three main approaches to obtaining data for cyber threat detection:


The first approach is highly desirable, as working on actual data should guarantee low-fault detection algorithms that are ready for actual cyber attacks. The main problem with the reliability of this approach is that few organizations could use such data for cyber threat detection research. Cyber attacks are very rare if we consider the total observation time. This means that it would take a very long time to collect enough examples to train a detection algorithm on these indicators. Another challenge with such data is the privacy concern. It is impossible to share such information, so the cybersecurity community cannot benefit from it for cyber threat detection.

The simulation approach is usually used in modern research into intrusion detection systems in industry and academia. One of the most well-known research institutes in this field is the Canadian Institute of Cybersecurity (CIC) at the University of Brunswick. The institute has published nearly 30 datasets over the past decade, while researchers have

developed reference methods for generating such data. A 2018 paper [18] summarizes the current approach to generating the simulated networks and data for cyber threat detection research over them. The main outcome was the development of a parametric configuration of the network communication patterns to be simulated, called profiles. This improved the quality of the resulting datasets. Another systematic approach was presented in [19]. This paper adds the new idea of simulated datasets for cyber threat detection systems based on a novel architecture:


The details are presented in Section 3. The latter approach exploits the mathematical foundations of modeling and data analysis, in particular, to apply machine learning methods for data generation. It could help to increase the similarity of generated data with actual production or to address shortcomings of simulations (complementary between approaches). An example of applying machine learning to improve detection rates and complement the small number of malicious samples is presented in [20]. It uses adversarial machine learning methods for cyber threat detection research. Generative Adversarial Networks (GAN) are implemented to generate synthetic samples. Then, the module IDS was trained on them along with the original samples. It also fixes the problems of unbalanced or missing data on input. This approach greatly improves the performance of the IDS detection algorithm. The major challenge in applying machine learning for cyber threat detection is the explainability and transparency of such algorithms.

#### *2.3. Availability of Datasets for Cyber Threat Detection Research*

Historically, the first milestones in the public availability of datasets for cyber threat detection research were in 1998–1999, when the DARPA'98 and KDD'99 datasets were released. Since then, many other and different datasets have been created, but there are still not enough publicly available datasets for cybersecurity research. This section presents some examples of publicly available datasets that are generally recognized as comprehensive, well-prepared, and appropriate for cybersecurity research on cyber threat detection systems.

Canadian Institute of Cybersecurity datasets: ISCX 2012 Dataset [21] was the first participation of the Canadian Institute of Cybersecurity that provided a systematic approach for creating datasets for cyber threat detection systems research. They introduced the concept of profiles, which contain detailed descriptions of intrusions and abstract distribution models for lower-level applications, protocols, or network entities. Previously, they analyzed real-world traces to create these profiles. The dataset created included benign and malicious network traffic traces of HTTP, SMTP, SSH, IMAP, POP3, and FTP. The NSL-KDD ISCX Dataset [22] was created as a solution to the inherent problems of the original KDD'99 dataset. It still suffers from some of the problems and may not perfectly represent real-world networks. Nevertheless, it can be used as a useful benchmark dataset to help researchers compare different cyber threat detection methods. The CIC 2017 dataset [23] contains benign and the most recent widespread attacks stored as network traffic traces from actual real-world executions. Implemented attacks include Brute Force FTP, Brute Force SSH, DoS, Heartbleed, web attack, infiltration, botnet, and DDoS. Due to the nature of the prepared profiles, they can be directly applied to a variety of network protocols with different topologies to create a dataset for specific requirements. The CSE-CIC 2018 dataset [24] follows the pattern in the scaled infrastructure of 500 devices. The dataset provides the network traffic traces and system logs from each of these devices.

UNSW-NB15 Dataset [25]: The raw network packets of the UNSW-NB 15 dataset were created in the Cyber Range Lab of the Australian Center for Cyber Security (ACCS). It contains a mixture of actual normal activities and synthetic current attack behaviors

from fuzzers, backdoors, DoS, exploits, generic cyber attacks, reconnaissance, shellcode, and worms. Argus, Zeek (formerly *BroIDS*), and the authors' tools were used for data collection. Class tagging was also provided. The number of records in the training set was 175,341, and the testing set was 82,332 from different types of network traffic (benign and malicious).

UGR'16 Dataset [26]: The dataset was created with real traffic and actual attacks. The network traffic was recorded by Netflow v9 collectors strategically placed in the network of one of the Internet Service Providers from Spain. It consists of two datasets split into weeks:


The main advantage of this dataset is its usefulness for evaluating cyber threat detection algorithms with a long-term perspective. The models can also take into account differentiation by day/night or working days/off days.

CAIDA Datasets: The Center for Applied Internet Data Analysis (CAIDA) collects various types of data from geographically and topologically diverse locations and makes these data available to the research community. Information was collected from active and passive measurement infrastructures that provide insights into global Internet behavior. CAIDA collects, curates, archives, and shares the datasets resulting from these measurements. It also processes and shares several derived datasets. Datasets through April 2016 are available at [27]. One of the most well-known CAIDA datasets is the DDoS 2007 dataset, which contains network traffic traces from large-scale distributed denial-of-service attacks. More recent datasets are made available on the Impact Cyber Trust Project [28] system. The Information Marketplace for Policy and Analysis of Cyber-risk Furthermore, Trust (IMPACT) project was created by the U.S. Department of Homeland Security to support the global cyber-risk research community through the coordination and development of real-world data and information sharing capabilities. The IMPACT project enables the sharing of empirical data and information among the global cybersecurity research and development (R&D) community in academia, industry, and governmen<sup>t</sup> to accelerate solutions to cyber risk and infrastructure security. Datasets are available exclusively to researchers from the U.S. and collaborating countries.

#### *2.4. Malware with Information Hiding Techniques Applied*

Network steganography, as a branch of information hiding techniques, is rapidly evolving and has attracted tremendous interest from cybersecurity researchers since the paper [29]. Any network steganography technique must meet three conditions [10]:


The Morto worm [30], a malware with network steganography capabilities, used records stored on Domain Name System (DNS) servers to communicate with C2 servers. This was the first actual implementation of network steganography in malware ever discovered. Over the years, DNS has proven to be one of the most popular network protocols abused for information concealment techniques. Any system from IT that has access to the Internet must use it, so port 53 is wide open and allowed by firewalls and cyber threat detection systems. The DNS protocol is characterized by open text messages that provide many opportunities to hide data in them using text steganography methods. Another protocol that has been used for network steganography in malware in recent years is the Secure Shell (SSH) protocol. It was discovered in 2013 in the Fokitor Trojan [31]. •

The motivation to use SSH for such operations is the same as DNS: widely used in IT systems, port 22 open and allowed. In this method, the SSH protocol connections merely carried the hidden information as a payload. The Regin malware [32], discovered in 2014, was equipped with three mechanisms to prevent network communication:


This is the ongoing trend of implementing different steganographic C2 channels and using them depending on the deployment conditions. Steganography, a cyber deception method, provides the ability to bypass the detection and measures of standard network security applications, such as blocking by firewalls or triggering alerts by cyber threat detection systems.

Another trend is the combination of different methods to hide information, e.g., combining multimedia steganography with hidden communication via TCP/IP protocols. The typical approach for combining multimedia steganography and network communication to form hybrid steganography is as follows:


The first practical application of such an approach was a 2011 malware campaign. Duqu [33] used multimedia steganography to hide data in JPEG images and then sent them to the C2 server. This communication looks like an ordinary image file transfer, but in reality it is used to establish a covert C2 channel. A similar technique was used for the 2014 Zeus Trojan morph, Lurk [34], where images were the carriers of the hidden control commands. In the following years, the C2 channels used in modern cyber threat campaigns were considered for information hiding techniques. More recently, the techniques have evolved, spreading multimedia steganography over open social networks (OSN) and adding methods of text steganography. This introduced a new level of complexity to any forensic analysis, making it a problem similar to finding a needle in a haystack. An example of a practical application is Hammertoss APT, applied by the group APT29 [35]. They used Twitter to exchange URLs to image files that contained hidden data. Each Twitter message also contained a specially prepared hashtag needed to decode the hidden part of the image. The project attracted interest from cybersecurity researchers who were looking for models to define detection techniques, as the classical signature approach was insufficient. Interesting proofs-of-concept of steganography systems include:


Therefore, the use of hybrid and network steganography to breach the security of computer systems, in particular, is an important area of research to identify vulnerabilities and methods to combat them. This is the critical goal of this work, to improve the security of cyberspace.

#### **3. Generating Datasets for Cyber Threat Detection Research**

*3.1. Application of Multi-Node Cyber Threat Detection System*

A multi-node cyber threat detection system operates in the environment of distributed network devices running open operating systems (e.g., Linux), mainly programmable routers. Each router contains the execution environment of mobile agents that are interconnected to form a platform that controls the Central Unit. For the purpose of this study, the monitoring mode of such a system is considered.

On the execution platform, it is possible to run agents with different purpose settings:


To build a multi-agent peer-to-peer communication, the concept of the actor system [39] has been used. The main purpose of the actor system is to develop a high-level and non-blocking parallel execution model for computation. The atomic execution units, called actors, execute their assigned tasks and then share the results via the message box communication abstraction. The actor system is responsible for creating and managing the lives of the actors (agents) in various distributed environments. The scheme of the prepared platform is shown in Figure 1. It shows the main nodes of the architecture:


**Figure 1.** Scheme of monitoring platform based on actor system approach.

As the prototype of the platform is utilized in the monitor mode (sniffing and collecting network data), the main functionalities to be included within the main nodes of the system are:


• Collecting PCAP files across nodes in the Central Unit.

The interface Sniffer Agent would provide the first and second functionality. The third is implemented by the communication scheme between the Platform Manager Agent, Router Manager Agents, and Interface Sniffer Agents. The whole platform (node agents and Central Unit) provides the other functions, such as life cycle management, PCAP file management, or controlling the operation mode of the system. Table 1 summarizes the operational aspects for each main component of the platform: Central Unit, a router, Router Controlling Agent, and Internal Sniffing Agent. It includes the functional role (*Objectives* column) realized by each of them and the communication patterns with the other components that are utilized to fulfill the role (*Communication patterns* column).

**Table 1.** Summary of operational aspects of multi-node Network Traffic Monitoring Platform.


The concept of architecture realized by the presented proof-of-concept is easily expandable by embedding processing and detection algorithms together with any distributed computing strategies to be imposed within the system. Any complexity in terms of logical distribution of the processing and decision-making could be considered. However, the constraints and limitations of such an expansion for any logical workflow of the cyber threat detections and mitigations are driven by:


Network packets need to be processed in the time imposed by the bandwidth of the interfaces within a network node. If there is an objective to detect and react to cyber threats inline, then the detection and computation architecture is required to be able to draw decisions within the time frame of the network packet processing. For 10 Gb/s networks, one packet of 300 bytes (average size in the Internet) needs to be processed in 240 nanoseconds. Otherwise, the system would process the copy of the data, so the main constraints will be limited to copy operation, transferring data to the other agents, and the size of the generated data in time (directly based on the network flow data rates).

In fact, the proof-of-concept was implemented in Python as the most efficient for fast prototyping. It was used in the monitor mode only to collect PCAPs as datasets. However, the real production multi-node cyber threat detection system should be implemented in more suitable hardware and software for technological stacking. Software programming languages for data processing within constrained environments in terms of computational resources are C, C++, or Rust. If the software processing cannot fulfill the processing requirements, then hardware solutions to accelerate the computations needs to be considered, such as ASICs or FPGAs. The Central Unit node or any other node considered in general as the "computational center" could be built upon Big Data technological stacks characterized by high scalability, efficiency, and possibility to parallelize computations. The main limitation would be related to the available hardware resources and if it is possible to implement several computational servers as the component of a production multi-node cyber threat detection system.

#### *3.2. Network Traffic Streams Simulations*

#### 3.2.1. Malicious Network Data Streams

The network data streams within the scope of this paper must contain steganographic techniques of the various types. For this work, the implementation could be simple so that the required data can be generated for further research. The choices of such methods are:

	- **–** Some packets must be intentionally delayed to be detected as lost.
	- **–** The payload of such packets could be overwritten to carry steganograms.
	- **–** When such a packet finally arrives at its destination, it is simply discarded. If a steganographic receiver is installed, it could intercept these packets to extract the hidden payload.

The dedicated applications were prepared as the element of the whole end-to-end framework for cyber threat detection research. The signalization over lost packets in the multimedia stream of packets utilizes the RTP protocol. The hidden communication over packets with the modulated time of sending uses the ICMP protocol. The prepared applications could also be executed in benign mode to generate the expected network flows of the selected protocols (RTP or ICMP).

#### 3.2.2. Benign Network Data Streams

The approach to generating benign network traffic was developed by analyzing the typical patterns of network communications in consumer and enterprise networks LAN/WAN. Several specific applications and protocols were identified:


For this study, some publicly available applications and scripts were used to simulate such traffic. The complementary method uses publicly available network traces to replay them within a network. The applications mentioned in Section 2.4 are also used in benign mode to generate legitimate traffic without using information hiding techniques.

3.2.3. Engine of Generation of Network Topologies for Experimentation

A network emulation engine should be used for functions such as:


When generating a new topology, the basic features must be specified, such as:


Based on these parameters, a random graph should be generated and then fed into a network emulation engine via an API or configuration file. Preparing such an automation promises to minimize any bias in the network topologies on the measured effectiveness of the newly developed cyber threat detection algorithms. This means that well-generalized cyber threat detection models should be created that can work equally efficiently in any network topology.

For this research, we developed such a tool for the automatic generation of test application scenarios, which consist of the following elements:

	- **–** Interprets and validates the entered network configuration,
	- **–** Randomly generated connections between nodes,
	- **–** Automatically sets up the network using the GNS3 API;

It should be noted that the crucial aspect of the prepared solution is the automated and randomized mechanism for:


Such an approach allows the generation of datasets from many network scenarios and data generation configurations. It could also provide the ability to mitigate any factors associated with configuration bias across the broad spectrum of research in data-driven cyber threat detection. Data sets were collected in specific network scenarios with a small degree of variation in the sender-receiver network data configuration (benign or malicious).

#### 3.2.4. Generation of Example Datasets

As presented in Section 3.2.3, the application to automatically generate the network configurations and network communication scenarios was implemented in this article. Figure 2 shows an example output topology of the fully working network of nodes (routers, PC hosts, firewall, Internet connection) prepared by this tool for the purpose of this article.

Among the setup presented in Figure 2, the seven different network scenarios of hidden communication between transmitters and receivers were run. The configuration for each scenario is shown in Table 2. A given scenario (Table 2, first column) consists of the setup of the sender and receiver for a hidden communication over lost packets (second column in Table 2) and the setup of the sender and receiver for a hidden communication over lost packets (third column in Table 2). The order of operations to collect the datasets is as follows:



**Table 2.** Setup of pairs of sender-receiver for generation of hidden communication simulations.

**Figure 2.** Network setup for simulations of hidden communication techniques.

#### **4. Cyber Threat Detection Research Enabled**

*4.1. Cyber Threat Detection as Standard ML Classification Problem*

In terms of machine learning, cyber threat detection is defined by the principles of the classification problem adapted to the chosen goal of detection. The two classical objectives for cyber threat detection algorithms are:


Since the datasets generated for this work contain the detailed labels of the cyber threats, the set of experiments follows the second option to be presented. Table 3 shows the general workflow used in this work for cyber threat detection experiments on the hidden communication techniques. It presents the objectives to be achieved for each step of the workflow.


**Table 3.** Generic procedure to research cyber threat detection methods.

#### 4.1.1. Network Flows Generation

The records were collected by the system presented in Section 3.1 as network data traces in PCAP format. These PCAPs were then processed into network datasets using CICFlowMeter [41].

Each network flow dataset was bidirectional and consisted of 84 metrics. Network flows were identified by the classic 5-tuple key (source IP, destination IP, source port, destination port, Layer 4 protocol code). When a flow exceeded the configured flow timeout (in seconds) or the flow was inactive (activity timeout), the status was saved and exported. In the exported flow table, the timestamp adds the 5-tuple key to distinguish the flows. In the case of this experiment, the parameters were set to:


The output of the application is a CSV file. Another step was the labeling. It was done manually through the script based on the coding scheme shown in Table 2. The last step in the data preparation was to combine all CSV files into a final dataset with all collected observations from the simulated scenarios. This dataset was then preprocessed in the data experimentation phase according to the state of the art in data science.

#### 4.1.2. Training Classifiers for Cyber Threat Detection

This part shows how to use the prepared datasets to find the classifier to be used as a cyber threat detector. The first step was to analyze the metrics in the dataset. The aim was to check which metrics were more important for predicting the target class (feature selection). The process involved pairwise correlation between the metrics and the explained variable (class or label). Figure 3 shows the filtered matrix of the metrics for which the absolute value of the correlation coefficient of any metric with a label greater than 0.05. The most positively or negatively correlated metrics were related to time (inter-arrival time (IAT), time of activity) and volume of the network data (number of packets, number of bytes). The practical aspect of this step was to reduce the dimension of the problem. The number of metric outputs was 28 since 52 metrics were filtered and three metrics were skipped since they were not relevant to the problem (flow ID, timestamp, IP addresses).

**Figure 3.** Pairwise correlation matrix of prepared dataset. Filtered according to level of correlation between label column and other parameters for feature selection purposes.

Subsequently, the filtered data were divided into training (80% of the datasets) and testing (20% of the datasets) datasets. The number of collected data streams distributed among the training and testing datasets after the split is shown in Table 4.

**Table 4.** Distribution of flows among training and test datasets split from collected dataset as described in Section 3.2.4.


The first problem that arises is the imbalance of the class examples in the training dataset. The imbalanced classifications poses a challenge to any predictive algorithm. In the case of imbalance in the training examples, the trained models might have poor predictive performance, especially for the minor classes. On the other hand, the minor classes are important because the context of the experiment is cyber threat detection research, where cyber attacks are sporadic compared to harmless attacks. The Synthetic Minority Oversampling Technique (SMOTE) [42] and Edited nearest neighbor (ENN) [43] were applied as data extensions to overcome the problem. SMOTE is used to increase the number of samples in the minority class by linear interpolation, and ENN is used to remove the noise from the majority samples. Table 5 contains the number of data streams before and after applying SMOTE-ENN techniques to the training dataset.


**Table 5.** Data augmentation results—number of flows in training dataset before and after SMOTE-ENN.

The experimental phase was conducted to measure the possibility of predicting cyber threats by applying selected ML classifiers. Two classifiers were chosen to test the preparation of the example solution for this paper:



**Table 6.** Custom MLP classifier architecture setup per each layer (6) with selected optimizer, loss function, evaluation metric, and training procedure.

Both classifiers were trained with the SMOTE-ENN training datasets and evaluated with non-SMOTE-ENN test datasets. The metrics of accuracy, precision, recognition, and F1 score were used to evaluate the performance. The result metrics for the RF and MLP classifier are shown in Tables 7 and 8, respectively. The last rows of both tables show the overall accuracy of the respective classifier. Figures 4 and 5 show the confusion matrices of the two selected classifier models. The classification was conducted within three classes (0, 1, or 2), so the size of each confusion matrix was 3 × 3. Each cell presented the number of instances of a given true class (rows) classified into a given predicted class (columns). The diagonal of each matrix included true positives. The other cells could be classically interpreted as false positives, false negatives, and true negatives in relation to a selected class.


**Table 7.** Performance of RF classifier in terms of precision, recall, F1 score, and overall accuracy metrics.

**Table 8.** Performance of custom MLP classifier in terms of precision, recall, F1 score, and overall accuracy metrics.


**Figure 4.** Confusion matrix of RF classifier.

**Figure 5.** Confusion matrix of custom MLP classifier.

#### *4.2. Conclusions and Future Directions*

The prepared models show almost perfect results for the Label 2—Time Modulation Information Hiding Attack. The results for Label 1—Flows with Lost Packets Information Hiding Technique—were noticeably worse. The presented results should be considered as an example for the research work with the generated datasets. Table 9 supplements Table 3 with a summary of the actions performed for each step in this work.

Each step of the workflow shown in Table 9 could be considered to be different research directions, such as:



**Table 9.** Summary of realized actions per each step of generic procedure to research cyber threat detection methods.
