3.1. Overview
We propose a novel honeynet architecture, HoneyFactory, to address the issues of existing honeynet frameworks that do not consider the specificity of application scenarios and can only maintain one single session to communicate with attackers at one time, making it unable to deal with complicated network attacks. The architecture of HoneyFactory is shown in
Figure 1.
The HoneyFactory architecture creates multiple honeynets in the server, each one contains multiple honeypots. The server itself does not provide external services. The honeynet architecture assigns external IPs to some honeypots through gateways, enabling them to be accessed by Internet. Other honeypots collaborate with Internet-accessible honeypots to simulate intranet structure and perform deep cyber deception.
The HoneyFactory architecture consists of five parts.
The environment learning module learns the network, nodes, and application services information of business network under protection, and analyses learning results to output the optimized honeynet structure. The corresponding attack and defense methods are completely different in different network scenarios. Therefore, if the services in honeynet are significantly different from those of the business network, the attack information captured by honeynet is meaningless to the business network.
The honeynet generation module obtains environment learning results from a database, uses a container engine to generate honeypots, and creates a container honeynet. This module also receives results of honeynet deception model to adjust honeynet, such as adding or deleting honeypots.
The honeynet deception model evaluates the attack status based on collected attack information of honeynet, adjusts the honeynet situation, and provides more lateral movement space for the attacker. This module aims at capturing more types of attacks and alleviating potential internal network security threats in the business network.
Honeynet data collection module obtains logs inside the honeypot through container volume mount mechanism, and monitors server network card to collect all traffic sent to the honeynet, to collect the behavior and traffic information of the attackers.
The design motivation of the honeynet information utilization module comes from some related research on intelligent honeynet architecture. Existing research proposes to use the information collected in honeynet for applications such as vulnerability mining and cyber threat intelligence, but there is no specific implementation and design yet.
The goal of the honeynet information utilization module is to collect attack information and utilize it, such as attack tools and means of attack information. Therefore, the utilization of information collected by honeynet is also an important component of honeynet architecture. Some research studies also mention honeynet information utilization to generate threat intelligence [
18], improve intrusion detection [
19], generate attacker profiles, and perform fuzz tests [
20]. Most of them do not mention its implementation. However, this module involves too many technical fields, and its implementation is also not in the scope of this paper. We will focus on the design and implementation of other modules and the deception performance of HoneyFactory in the following content.
The following chapters will introduce modules of HoneyFactory architecture separately, including design, algorithm innovation, and model innovation. It should be noted that the design of each specific module in the HoneyFactory architecture not only refines the functions and algorithms but also points to some specific technologies and tools used to prove the feasibility of HoneyFactory architecture. The design of a module has a certain degree of universality, but subsequent research can also replace specific components, algorithms, and functions according to application scenarios to form a new honeynet architecture.
3.3. Honeynet Generation Module
Based on the result of environment learning, we can generate honeynets that are highly correlated with the protected network, and attack information collected by honeynet can better reflect the potential vulnerabilities and threats faced by the protected network.
The HoneyFactory architecture suggests using docker container technology to orchestrate and construct honeynets. Recently, with the rapid development of container technology and virtualization technology, there have been some studies on the orchestration and construction of honeynets based on related technologies, such as containers and microservices [
22]. However, most container honeynets only focus on how to use microservices and containers to collectively manage and launch honeypots and use SDN, proxy, and other technologies to redirect attackers to just one specific honeypot and maintain connections between the attacker and that honeypot.
Some cyber security research studies have focused on building experimentation platforms based on containers [
23]. However, the existing research on container honeynets only focuses on the proxy application of multiple honeypots and does not consider the impact among honeypots. The network deception in these studies is just one-step and does not consider the multi-step and long-term characteristics of penetrating a network from the perspective of attackers. Therefore, we propose the HoneyFactory honeynet generation module, which simulates the entire business network and constructs several container honeynets. The HoneyFactory honeynet generation module is shown in
Figure 3.
HoneyFactory detects and learns the business network through tools such as Nmap, portscan, and SNMP. Based on the learning result, honeynet generation module generates container information of honeypots in the generated honeynet and stores it into a database. At the same time, the honeynet generation module receives the deception action of the honeynet deception model and converts the action into modifying the container information of honeypots.
The honeynet generation module continuously monitors the database and updates honeynet using docker engine and calico component, including generating honeypots and adjusting the network relationships between honeynets. The honeynet generation module also sends network information to the gateway for networks to address the translation configuration of honeynets.
Some honeynets can be accessed by attackers directly, while other honeynets can only be accessed by honeynets. The honeynet generation module combines container honeypot technology, container network technology, and NAT technology to create a multi-level comprehensive honeynet. There exists an intranet security structure within the honeynet, which can deeply deceive attackers.
To achieve this, the honeynet generation module needs to record honeypot information, network information, and network connection information in detail, and generate or adjust honeynets based on this information.
As part of this information, the honeypot information can be divided into multiple categories based on the type of honeypot. The first type is service simulation honeypot that imitates the business network host. This type of honeypot is lightweight and used to simulate services within the protected network, enhancing the authenticity of honeypot. However, it does not have real service functions and only simulates service with minimal resource consumption. The second type of honeypot is traditional honeypot, which uses container technology to deploy traditional web honeypots, SSH honeypots, and database honeypots and places them in the network to interact with attackers. This type of honeypot does not provide a foothold for further and internal attacks, regardless of whether they have vulnerabilities exploited by the attackers or not. The third type of honeypot is vulnerability honeypot, which contains applications with real vulnerability. This type of honeypot is disguised as an experimental testing server for various applications, used to interact with various attacks during the vulnerability exploitation and lateral movement stages of attackers, providing a foothold for further internal attacks so that honeypot can capture potentially related attacks. The honeypot itself is still a container, and the network environment that attackers can detect with this honeypot is still within container network, without posing a threat to the host.
This honeypot classification can be implemented based on file configuration and container configuration in container technology. For example, for the first type of honeypot, service simulation honeypot, honeynet generation module prepares some service files or crawler files for honeypot to simulate the business network through file configuration. For example, the module can configure the Apache. conf or nginx. conf files, and configure computer instructions for service start-up and log collection, which will be executed when the container is created. In this way, the network composed of these honeypots is similar to protected networks. For the second type and the third type of honeypots, the module can configure container images containing related deceptive interactive applications or vulnerability applications.
The honeypot information storage architecture is shown in
Figure 4, which can also serve as the directory structure of the etcd database. Based on this architecture, the content of honeypots can be easily modified to cope with different types of attacks. In this paper, to demonstrate the feasibility of the system, we selected some existing honeypots and transformed them into container images, and constructed some honeypot images for specific services with vulnerabilities. In future use and further research, these honeypots can be modified according to the scene. The honeynet system can also integrate other cyber deception techniques through file configuration and command configuration, such as honeytoken [
24] and camouflage network flow [
25], making the system flexible and scalable.
Network information contains the honeypot ID and network segment IP. The connection information records the connection relationship between network segments.
At this point, HoneyFactory has completed the learning of protected business network and the construction of honeynet. In the early stage of HoneyFactory system operation, after environment learning, several container networks are created. Based on the services of hosts in the business network, simulation honeypots are created from basic images and allocated to each subnet. In the initial interaction with attackers, for example, the attacker uses various reconnaissance tools. The system creates containers from various traditional honeypot images and configures them to run. These honeypots are allocated to container subnets for interactions with attackers. As the attacker gradually delves into the honeynet intranet environment, the system creates containers from various vulnerability honeypot images, allocates these honeypots to internal container subnets, and interacts with the attacker to capture more types of attacks.
Some other honeypot research also mentioned honeypot content architecture to make them easy to design and use, like HoneyDoc [
26], without considering the relationship between honeypots and honeynets. They also did not consider a unified definition of honeypot content based on some container orchestration techniques [
27]. Some other honeynet research considers using the cloud-native orchestration technique [
28], but it is still in its early stage.
Most of the existing honeynet research is used to maintain the connection between attackers and just one single honeypot at one time, naturally without considering deployment efficiency. HoneyFactory needs to deploy honeypots in multiple network segments and maintain network connections. The system should be able to quickly adjust and modify these honeypots and honeynets when an attacker’s behavior changes. Therefore, we use above honeynet representation method to quickly update honeypot and honeynet content in a lightweight manner.
3.4. Honeynet Deception Model
In HoneyFactory, the honeynet deception model is responsible for adjusting cyber deception strategies based on the behavior of the attackers in honeynet. In this way, on the one hand, computing resources can be saved when the attacker does not perform an exploit; on the other hand, cyber deception actions can be targeted to the attack stage and the effectiveness and success rate of cyber deception will be improved.
Technologies related to the perception and detection of attacker behavior are relatively complex, including intrusion detection, network situation analysis, etc. These technologies are beyond the scope of cyber deception defense discussed in this paper. However, network attacks and network traffic in honeynets are quite distinctive. Normal service networks contain a large amount of normal traffic. The traffic in honeynet is generally malicious.
To better generate specific cyber deception actions, we propose a honeynet deception model in HoneyFactory. This model will be used to evaluate the current deception stage and adjust honeypots and honeynets.
HoneyFactory is aimed at capturing multi-stage network attacks, collecting as much information as possible from attackers during various stages of penetration and intrusion. Therefore, the honeynet deception model first needs to define attack stages.
According to the definitions of cyber kill chain and ATT&CK framework for the attack and defense stages, this model selects some stages where both attacker and defender are involved and where traffic has obvious features. The attack stage and corresponding deception stage are divided into four stages. The first stage is the Reconnaissance stage, where attackers use some scanning tools to find network, host, and service information about honeynet. The second stage is the Exploitation and Persistent stage, where attackers exploit service vulnerabilities and maintain long-term access to hosts. The third stage is the Exfiltration and Control stage, where attackers establish transmission and control channel based on covert communication, such as dns tunnel. The fourth stage is the Discovery and Lateral Movement stage, where attackers begin to explore and attack the deep internal network.
The stages summarized above have removed the weaponization stage in the cyber kill chain that the defender did not participate in, and have integrated some stages with similar characteristics. During the Reconnaissance stage, attackers mainly interact with some open IPs in the network and service resources. The interaction is frequent, lasting for a long time with a small size of payload. The Exploitation and Persistent stage is the integration of execution phase, persistence phase, and privilege escalation phase in the ATT&CK framework. The number of packets sent during this stage is small, the duration is short, and the size of payload varies greatly depending on the type of attack. The attack at this stage is based on the vulnerabilities in specific applications or system services, and the location of the attacked host is still at the edge of honeynet. During the Exfiltration and Control stage, attackers mainly obtain sensitive files and create backdoors on the attacked machine. At this stage, there are a large number of traffic packets generated, with a large payload size, few service types, and a strong degree of traffic encryption. During the Discovery and Lateral Movement stage, attackers further penetrate the internal network based on controlled machines. The source and destination address of traffic generated during this stage are all in honeynet.
After the honeynet deception model obtains the attack stage based on traffic features in honeynet, the honeynet deception model only needs to select deception honeypots in the corresponding stage and allocate these honeypots to honeynet to generate effective and targeted cyber deception against attackers.
For example, during the Reconnaissance stage, the honeynet deception model can deploy some traditional honeypots of specific services to capture the attacker’s basic information and generate threat intelligence. These traditional honeypots do not provide real services to attackers but is only used to collect attack information. This type of honeypot has some available samples on the network, such as the web service honeypot Glastopf [
29] and the SSH service honeypot kippo [
30]. This stage is also the main deception stage of the other existing cyber deception research, which only conducts preliminary deception on attackers and can collect the attackers’ IP, port, and some types of attacks (botnets, spam, etc.), but cannot interact with deep-level attacks that exploit vulnerabilities and collect information. Even if these honeypots are not recognized by the attacker, the attacker will give up further attacks after consuming a certain amount of energy on these fake services.
When the interaction traffic of attackers in the Reconnaissance stage gradually decreases, the honeynet deception model can assume that the attacker is about to lose interest in the Reconnaissance stage and may enter the Exploitation and Persistent stage, conducting vulnerability detection and exploitation attacks on specific services. At this moment, the honeynet model can deploy some vulnerability honeypots that contain real vulnerability applications, such as SSH vulnerability container honeypots, HTTP web vulnerability container honeypots, and MySQL vulnerability container honeypots. These honeypots are based on container technology to run real vulnerable service applications, built on various CVE vulnerabilities, and can be used to interact with real attacks and obtain attack samples, such as spring cve_2020_5410, Elasticsearch cve_2015_3337, log4j2-cve_2021_4104, etc. The services in these real vulnerability honeypots are mainly various application services, such as web services, database services, and log services. The appearance of these services in the network boundary is reasonable.
When the interaction traffic of attackers in the Exploitation and Persistent stage suddenly increases, the honeynet deception model can assume that the attacker is in the Exfiltration and Control stage. During this stage, the honeynet deception model will save the data in the attacked container to create a new container image, capturing the backdoor tools, and attack scripts uploaded by the attacker.
When attack traffic begins to appear inside honeynet, the honeynet deception model can assume that attacker has obtained control permissions of the boundary container honeypot and launched attacks inside the network. At this time, the attacker is in the Discovery and Lateral Movement stage. At this moment, the honeynet model can also deploy some vulnerability honeypots that contain real vulnerability applications. In contrast to the deployed honeypots in the Exploitation and Persistent stage, honeypot includes both application service vulnerabilities and system service vulnerabilities at this stage, like Linux cve. These operating system service vulnerabilities are reasonable to occur in the internal network and can capture more types of attack information.
We summarize the deception stages and actions of the honeynet deception model mentioned above and obtain the honeynet deception model, as shown in
Figure 5.
However, for the honeynet deception model, obtaining attack stage through traffic features within the honeynet is relatively difficult. Random phenomena are common in network security. The interaction traffic does have certain characteristics in terms of the overall distribution when attackers are in various stages of network attacks, but specific attackers may display completely different characteristics from each other. Therefore, the honeynet deception model needs to measure the uncertainty between attack stage and traffic features, comprehensively considering features in multiple timestamps, and making the optimal reference of current attack stage.
The honeynet deception model includes four attack stages mentioned earlier. We use Equation (1) to represent the collection of stages.
At different timestamps, attackers may be in different stages, and their observation features are various features of traffic. The honeynet deception model can evaluate attack stage of attacker based on traffic features and thus execute corresponding honeynet deception actions to complete deep-level network deception. These traffic features include the number of packets, average size of packets, destination ports, flow duration, protocol types, etc. We use Equation (2) to represent traffic features during timestamp t.
Due to the presence of continuous variables in the observed features, the honeynet deception model is a model with discrete state space but continuous observed features. Different states naturally show different observational features. We assume that the distribution of these continuous features can be fitted using the Gaussian distribution. We use Equation (3) to represent the probability of observation is
, while state
is
at timestamp t.
In Equation (3),
represents the expect of observation distribution when the state is
.
represents the covariance matrix of observation when the state is
.
The expect and covariance matrixes of multidimensional Gaussian distributions are parameters of the honeynet deception model, which can be learned from actual traffic data and corresponding attack stages. Some datasets can be used to train the model, such as the CICIDS2017 and DAPT2020 datasets, which contain attack traffic data and corresponding network attack stages.
At present, the honeynet deception model can be regarded as a multidimensional Gaussian Hidden Markov Model (Gaussian HMM). The state may also change over time. The probability of transition among states can be expressed by Equations (6) and (7).
At present, the honeynet deception model has the following parameters: state transition matrix
K, initial state probability
π, multi-dimensional Gaussian distribution expect
μ, and covariance C.
The first step is the construction of training data, and the dataset has already indicated the specific stages corresponding to the traffic. We extract traffic features from attack traffic at different stages to form a sequence of attack stages. These sequences are training data. The training data representation is shown in Equation (9).
The set of states and observation features at different times within one sequence is shown in Equations (10) and (11).
After the construction of training data, their parameters can be calculated iteratively using the Expectation Maximum (EM) algorithm [
31].
Take the model parameter
π as an example to illustrate the calculation process.
After the parameters are solved, the honeynet deception model can transform the problem of estimating the attacker’s current state based on long-term observation into a state model filtering problem, which can be solved using forward algorithms. The problem can be transformed based on the Bayesian formula.
We can simplify Equation (16) based on the assumption of state model independent observation.
Based on the homogeneous Markov assumption of the state model, we can further simplify and normalize Equation (17) to obtain its recursive calculation formula.
Using the above algorithms, the honeynet deception model describes the uncertainty of a network attack. It can evaluate the probability of the current attack state based on an indefinite-length feature sequence. Based on the result, the honeynet deception model can take corresponding honeynet deception actions.