1. Introduction
According to the kill chain proposed by Bryant et al. [
1], the inside multi-step attack has become the main part of the intrusion process, which includethe pre hack (reconnaissance and delivery), hack (installation and privilege escalation), compromise (lateral movement, actions on objective) and theft (exfiltration). Currently, many sophisticated intrusion attacks [
2,
3,
4] start from the inside of victim networks through the spear-phishing email or the host with vulnerabilities. Once the victim host executes the malicious code, the attacker begins to explore the whole network to discover the resources they need for further attacking. The attacker will hide the traits and avoid detection during intrusion [
5,
6]. In a word, it is critical to enhancing the existing multi-step attack detection abilities inside the network.
However, the multi-step attack detection approaches are mostly based on the intrusion detection system (IDS) sensors deployed in the hosts or the entrance of the network and are facing more severe challenges: (1) high false positives, which leads to the insertion of irrelevant intrusion actions into the data stream; (2) high redundancy, which is caused by the IDS detection mechanisms or the intended intrusion strategies; (3) incomplete data, which is a common situation in the real environment due to the network delay or the system error; (4) disordered data, which is caused by the intended intrusion strategies [
7] or multiple parallel attacking paths. These issues will cause the failure of the methods based on sequence learning.
Recently, researchers have been making their systems more lightweight, more sensitive to the threats, and capable of online analysis and processing of large-scale streaming data with noises and errors, rather than learning outdated patterns based on limited historical data. They hope that the whole system can automatically adapt to the variances of data and capture the known or unknown threat patterns in an unsupervised way [
8]. This is one of the most important studies in the intrusion detection field.
Modeling the multi-step attack is a process of gathering the evidence extracted from the network logs or IDS alerts, to find out how the attack might transpire over time, it is a broader concept than traditional intrusion detection [
9]. In the whole process of attack scenario construction, a few kinds of literatures focus on the data preprocessing such as alert normalization, alert aggregation, noise reduction, and hyper-alert extraction, while others focus on the correlation analysis, intrusion pattern discovery, attack scenario construction and attack prediction [
10]. Unfortunately, few of them are dedicated to addressing the challenges described above.
Liu et al. [
11] proposed a framework for reconstructing the attack scenarios based on the reasoning methods, which can deal with the incomplete evidence using the known vulnerabilities database and other expert knowledge, but it was difficult for them to work out the missing part of the unknown attack scenario.
Angelini et al. [
12] proposed a graph-based online multi-step attack detector which can detect the on-going attacks early enough for managers to take proper countermeasures, and a visualization interface was developed to represent comprehensive network situations. The preprocessing of the sensor data was not described, and it was also based on the known vulnerabilities, besides, there was no evaluation based on unknown intrusion patterns.
Shen et al. [
13] noticed the problems caused by the incomplete, disordered intrusion action sequences. They proposed a framework named Tiresias which is based on the RNN (Recurrent Neural Networks) algorithm, which can effectively deal with the disordered alert stream. Although Tiresias can calculate the probabilities of multiple actions that may happen in the future, it is trained based on the pre-labeled events, the labeled data imply that the framework is based on the known attack analysis.
Haas et al. [
14] proposed a framework for multi-step attack detection by alert correlation process. The alert clustering is leveraged to reduce the number of alerts, and highlight the intrusion actions. The communication patterns are identified based on the clusters, and then the graph-based alert correlation (GAC) algorithm is applied to realize the alert correlations. In addition, the clusters will be labeled based on the vulnerabilities database, and then correlated together with IP addresses. The irrelevant intrusion actions will mix in the discovered attack scenarios due to the correlation based on IP.
In this paper, the backward influence factor (BIF) algorithm is proposed aiming to overcome the problems caused by the disordered, incomplete and noisy IDS logs in a real-time manner. It is a sequence pattern mining algorithm, which is suitable for analyzing the streaming data generated online by IDS or other devices. The BIF algorithm is evaluated in the context of a multi-step attack scenario discovery task. The whole system is based on IDS alert analysis containing five phases: normalizing, intrusion action extraction, intrusion session pruning, correlation discovery, dynamic correlation graph construction. The first three phases are inherited from our previous work [
15] because we want to use the data structures and concepts that have been created before. Each phase is summarized as follows:
In the normalizing phase, it unifies the raw alerts from different types of sensors and converts them into the common data structures that can be solved by the system. After normalizing, the raw alerts are converted into alert objects (alert for short).
In the intrusion action extraction phase, it groups alerts by two fields: the source IP address and the destination IP address. Then the intrusion actions are extracted based on the type field and the destination port field of alerts derived from the same group. In this phase, most redundant and repeated alerts are merged.
In the intrusion session pruning phase, a long action sequence can be divided into several short sequences (intrusion sessions) by calculating the average time interval of actions. Then a pruning process begins to remove the repeat sub-patterns from the original sequence. The pruning algorithm can significantly reduce the length of the intrusion sessions without destroying the original associations of actions.
In the correlation discovery phase, all the pruned sessions will be fed to the correlation discovery module according to the start time of the session. The BIF algorithm is applied to calculate the attraction levels between any two actions of the session. The influence factor (IF) values which express the attraction level will increase or decrease with the incoming data over time, then the real association relations are built.
The dynamic correlation graph (DCG) is constructed based on the discovered correlations with higher IF values, the DCG links and nodes are dynamically created or destroyed with the influence factor matrix (IFM) which is the matrix maintaining all IF values and updating them in real-time.
In this paper, the proposed BIF algorithm will be introduced in detail, it can be leveraged either as an optimized method for the intrusion scenario discovery task of our former work [
15] or a separate intrusion pattern mining system.
The proposed algorithm is based on the assumptions: the distance of two actions in a session can be used to measure the association strength of them.
2. Materials and Methods
The network environment is always complex and unpredictable, the problems caused by the network environment can badly impact the network security systems. In addition, the IDS can also cause problems such as redundant alerts and repeated patterns.
As shown in
Figure 1, S1 is the correct intrusion session extracted under the experimental environment, while S2, S3, and S4 are three different states in a running environment. The action sequence is altered due to the different configurations of the network and the security systems. Therefore, it is necessary to pay attention to these problems and try to minimize their impact.
For some IDS introduced problems, a few methods were proposed in our previous work [
16,
17], the redundant alerts, for example, can be reduced in the action extraction phase, and a few repeated action patterns can be removed in the session pruning phase by the pruning algorithm. In this paper, we mainly focus on the incomplete and disordered session (sequence) learning and attack pattern discovery in real-time.
Definitions
The problem domain can be formalized as follows. An intrusion action consists of a group of alerts, where A denotes the set of all unique actions, and |A| denotes the size of A. An intrusion session which is a sequence of actions ordered by their time field, where x and y denote the two hosts from which the session is extracted.
It is assumed that an intrusion action has an attraction effect on the subsequent actions, and a particular action happens due to the attractions of one or more other actions which happened in different sessions. The degree of influence can be calculated and used to measure the association strength between actions.
For a given session
,
has a direct influence on
, and indirect influence on
, the degree of influence will get lower with the longer distance between
and
. The influence range is
, where
w is the number of actions influenced. For a given
, if an intrusion action B falls in the influence range of A, A influents B (A attracts B) which can be denoted as
. The algorithm will calculate the influences on the actions that fall in the influence range of specified action in the session. The influence range is shown in
Figure 2.
The influence degree of
on
can be calculated using Equation (1):
where
and
denotes the
ith and jth action of the session, respectively,
is the influence range, and |s| denotes the length of the session. The more actions exist between two actions, the lower their influence is and the lower their association strength is. The influence of each pair of actions will update the old values recorded in the influence matrix shown in
Figure 3. In the matrix, the values in a row express the attraction strengths of the particular action on other actions, and the values in a column express the attraction strengths of other actions on the particular action.
With the continuous arrival of the intrusion sessions, the two actions A and B may have different influence values in different sessions, the old influence in the matrix should be updated with the new values using Equation (2):
where
indicates the refreshing rate,
denotes the original value of influence recorded in the matrix,
denotes the newly calculated influence value. Note that, the calculated
may increase or decrease.
The number of sessions contains
can be denoted as
, and the total number of sessions in an analyzing period can be denoted as
. The probability of
can be calculated using Equation (3):
The comprehensive influence of intrusion action A on B can be calculated using Equation (4):
where with Equation (4), the influence strength of A on B can be calculated based on the historical observation.
For a given intrusion action A, what predictions will the algorithm make? First, the influenced actions by A will be collected as the candidate predictions, second, the comprehensive influences are calculated, and the strongly influenced actions will be selected and predicted.
4. Discussion
In this paper, a novel sequence learning and mining algorithm is proposed to meet the challenges of network log (or the IDS logs) defects caused by various environmental problems. The algorithm is simple, lightweight and effective in discovering the patterns hiding in the network logs.
The proposed algorithm is based on the backward attraction calculation, which means the association relation between two intrusion actions can be measured by the distance of their indices in the intrusion session (action sequence). The nearer of their position in the session, the stronger the attraction between them. The attraction strength is updated and accumulated with the different distances of actions in different sessions. The actions with higher attractions are selected to construct the correlation graph which is used for attack prediction or attack scenario recognition.
Three types of automatically generated datasets each with a different type of data defects are tested on the algorithm, the results show that the proposed algorithm is effective in learning and mining the sequence patterns from the disordered, noisy and incomplete session data. The prediction tests are used to measure the learning ability of the algorithm, and the average prediction accuracy is kept above 90%. Finally, the heterogeneous dataset is generated by combining all data defects to simulate the real data environment, and the pattern discovery ability of the algorithm is measured in different conditions. The experimental results show that the algorithm can discover the unknown intrusion pattern in the dataset containing 50 other different intrusion patterns sharing 40% actions with the target pattern session, and the discovery accuracy is kept above 91%.
Although the algorithm is effective in mining the intrusion patterns in the complex dataset, it is still impacted by the high disordered sessions, if 50% of actions in each session is randomly changed their position, the accuracy of the algorithm will go down to 85%. Solving this problem will be the aim of subsequent studies.
The proposed algorithm is designed based on online unsupervised learning and with no complex parameter tuning and retraining. In addition, the network logs normalization, aggregation, and other clustering processes are preferred to prevent the explosive growth of the action types.