1. Introduction
As the diversity and number of interconnected embedded devices steadily grow, energized by various trends, such as IoT, network monitoring, including traffic-based intrusion detection, will play a crucial role in future systems. Although being a fundamental part of computer network security, intrusion detection systems (IDSs) still face major challenges [
1,
2,
3], e.g., the inability to handle massive amounts of throughput and process data in almost real-time due to inherent resource limitations. The permeating application of machine learning (ML) has favored the detection of novel sophisticated network-based attacks changing their behavior dynamically and autonomously. In particular, unsupervised outlier detection (OD) algorithms can help uncover policy violations or noisy instances as indicators of attacks by observing deviations in high-dimensional and high-volume data without requiring a priori knowledge. However, the ubiquity of massive continuously generated data streams across multiple domains in different applications poses an enormous challenge to state-of-the-art, offline, unsupervised OD algorithms that process data as a batches. Data streams in real-world applications are encountering evolving data which are enormous in number, potentially infinite; data are continuously streaming from many places in almost real-time. Therefore, efficient and optimized schemes, in terms of processing time and memory usage, in intelligently designed IDSs are required. These should be able to process the time-varying and rapid data streams one-pass at a time, while only a limited number of data records can be accessed. Furthermore, legitimate changes in data can occur over time, called concept drift, which require updates to a model in order to counteract less accurate predictions as time passes.
Recently, many OD solutions have been developed that are able to compute anomaly scores while dealing with data streams. Data streams can be subdivided into streaming data (SD) and streaming features. In this work, we do not focus on the latter case in which the amount of features changes over time. Furthermore, we do not focus on SD in the context of time-series data, as it is the object of many other research papers, such as [
4,
5]. In this paper, we first highlight the most popular OD solutions for SD with an in-depth view of promising online variants of one of the most widely accepted (offline) OD algorithms called isolation forest (iForest) [
6]. Moreover, substantial requirements are derived to compare the existing state-of-the-art solutions. As the main contribution of this article, we propose the so-called Performance Counter-Based iForest, denoted as PCB-iForest, a generic and flexible framework that is able to incorporate almost any ensemble-based OD algorithm. However, in this work, we focus on two iForest-based variants, an improvement of classical iForest applied in a streaming setting and a variant of classical iForest able to score features according to their contributions to a data record’s anomalousness in a completely unsupervised way. PCB-iForest is able to deal with concept drifts with a dedicated drift detection method, as we take advantage of the recently proposed NDKSWIN algorithm [
7]. Our solution is hyperparameter-sparse, meaning that one does not have to deal with complex hyperparameter settings, which often demands a priori knowledge. In extensive experiments involving various multi-disciplinary but also security-related datasets with different characteristics, we show that the proposed PCB-iForest variants outperform off-the-shelf, state-of-the-art competitors in the majority of cases. This work offers the following contributions and mainly differs from others presented in
Section 2 in said ways:
Carefully engineered and specified requirements for an agile, future-oriented, online OD algorithm design are derived, which are compared with state-of-the-art solutions, thereby pointing out the need for a more flexible solution which, consequently, is presented in this article.
Contrary to other adaptions of iForest for SD, a flexible framework called PCB-iForest is proposed that "wraps around" any iForest-based learner (it might even be generalized to other ensemble-based approaches) and regularly updates the model in cases of concept drift by only discarding outdated ensemble components.
Two iForest-centric base learners are integrated into PCB-iForest providing (i) the first application of the improved iForest solution—extended isolation forest—on SD denoted as PCB-iForestEIF, and (ii) online feature importance scoring by utilizing a recent iForest-based feature selection method for static data—isolation-based feature selection (IBFS)—in our online proposal, denoted as PCB-iForestIBFS.
Extensive evaluations were conducted for PCB-iForest and off-the-shelf online OD algorithms on numerous, multi-disciplinary datasets with diversity in the numbers of dimensions and instances which no other iForest-based competitor for SD has dealt with yet.
The remainder of this work is organized as follows—
Section 2 first provides relevant background for the reader for unsupervised OD on SD and provides related work with the most popular state-of-the-art solutions, especially existing iForest adaptions for SD. Substantial requirements for online OD algorithms are derived in
Section 3 and compared with the related work. In
Section 4, details on the conceptualization and operating principle of PCB-iForest can be found. It is able to satisfy all requirements stated in
Section 3. In
Section 5, the test environment is presented, and details on the extensive evaluation are presented together with the discussion of results (
Section 6), which reveals the superiority of PCB-iForest among the state-of-the-art by most the measurements. The conclusions are drawn in
Section 7, alongside glances at future work.
2. Related Work
Outlier detection, also referred to as anomaly or novelty detection, is an important issue for many real-world application domains, especially detecting indicators of malicious activity in computer networks. Outlier detection identifies atypical patterns or observations that significantly deviate from the norm based on some measure by assuming that (i) the majority of data are normal and there is only a small portion of outliers (imbalance); (ii) outliers are statistically different from the normal data (distinction) and (iii) they do not occur frequently (rarity). Numerous techniques have been introduced for OD, such as statistical, distance, clustering or density-based ones [
8]. An OD algorithm
assigns a class label
or a score value
, describing the strengths of anomalousness, for each data object in
X. This divides
X into a set of outliers
and inliers
(
). In the streaming setting,
is a continuous transmission of data records which arrive sequentially at each time step
t. The count of features is denoted as
d (dimension) and
the
-th
d-dimensional, most recent incoming data instance at time
t.
To concentrate on supervised or semi-supervised learning, widely accepted online anomaly detection algorithms such as Hoeffding trees [
9] and online random forests [
10] achieve good accuracy and robustness in data streams [
11]. However, the main focus of this work lies on unsupervised approaches, since the amount of unlabeled data with missing ground truth generated across many scientific disciplines, especially intrusion detection, has steadily increased. In recent years, many methods have been proposed for unsupervised online OD, such as [
12,
13,
14], but only a few of them apart from iForest, namely, HS-Trees [
15], RS-Hash [
16] and Loda [
17] have been shown to outperform numerous standard detectors and hence are considered the state-of-the-art [
18,
19]. Since xStream proposed in [
18] is as competitive as those detectors, particularly effective in high-dimensions and revolutionized online OD algorithms by being able to deal with streaming features, we count it on the list of the state-of-the-art.
RS-Hash samples axis-parallel, subspace grid regions of varying size and dimensionality to score data points by utilizing the concept of randomized hashing. Its adaption, RS-Stream, is able to operate on SD by computing the log-likelihood density model using time-decayed scores. Thus, compared to other work that applies sliding windows, for example, it uses continuous counting where points are down-weighted by their up-to-dateness. Compared to RS-Hash, the streaming variant requires greater sophistication in the hash-table design and maintenance, although the overall approach is quite similar [
20].
Loda, a lightweight online detector of anomalies, is an ensemble approach consisting of a collection of
h one-dimensional histograms; each histogram approximates the probability density of the input data projected onto a single projection vector. Projection vectors diversify individual histograms, which is a necessary condition to improve the performance of individual classifiers in high-dimensional data. The features used must only be of approximately the same order of magnitude, which is an improvement over other methods, such as HS-Trees. Loda’s output
on a sample
x is the averaged logarithm of the probabilities estimated on a single projection vector. It is especially useful in domains where large numbers of samples have to be processed, because its design facilitates very good balance between accuracy and complexity. The algorithm exists in different variants for batch and online learning. For online processing, a subdivision can be made. Similarly to HS-Trees, two alternating histograms can be used in Loda, denoted as Loda
Two Hist., where the older set of histograms is used for classification and the newer one is built in the current window. If the new set is built, it replaces the currently used histogram set. A floating window approach, Loda
Cont., denotes an implementation of continuously updated histograms based on [
21].
xStream is able to deal with data streams that are characterized by SD in terms of instances (rows), and evolving, newly-emerging features can be processed while xStream remains constant in space and time. Due to a fixed number of bins for the one-dimensional histograms, growing feature space cannot be handled by Loda. The authors of xStream overcame this limitation by so-called half-space chains where the data, independently of streaming features, are projected via sparse random projection into recursively constructed partitions with splits into small, flexible bins. This density-based ensemble handles non-stationarity, similarly to HS-Trees and LodaTwo Hist., by a pair of alternating windows.
The random forest is one of the most successful models used in classification and is known to outperform the majority of classifiers in a variety of problem domains [
20,
22]. Due to their intuitive similarity, iForest is an unsupervised approach that has been established as one of the most important methods in the field of OD. Much work has been done to improve iForest, e.g., [
23,
24], and to adapt it to other application scenarios, such as feature selection [
25]. Even if it was initially not designed to work as an online algorithm, over the last few years, manifold variants of online algorithms have been proposed that are either based on iForest’s concept or adapt it to operate in a streaming fashion.
HS-Trees, a collection of random half-space-trees, is based on a similar tree ensemble concept as iForest. HS-Trees has a different node splitting criterion and calculates the anomaly scores based on the sample counts and densities of the nodes. Furthermore, the trees have a fixed depth (height), whereas iForest uses adaptive depths with smaller subspaces. For SD with concept drifts, HS-Trees utilizes two windows (batches) of equal size, and as in the learning of HS-Trees in the current window, the HS-Trees trained in the previous one replace the old.
One of the first approaches adopting iForest for SD was iForestASD proposed in [
26]. It utilizes a sliding window with a fixed length to sample data with which the ensemble of trees is built. Based on a predefined threshold value, changes within the window can be detected. In the case of an occurring concept drift, this leads to a re-training of the whole ensemble based on the information of the current sliding window content. The authors themselves proposed significant improvements for future work. For instance, that the predefined threshold relying on a priori knowledge should be replaced alongside partial re-training of only some trees was suggested, rather than discarding the complete model. A detailed description of the differences between HS-Trees and iForestASD can be found in [
7].
Recently, iForestASD has been implemented in an open source ML framework for data streams scikit-multiflow [
27] and improved in [
7] to better handle concept drifts by extending it using various drift detection methods. Therefore, the authors extended ADWIN [
28] and KSWIN [
29] drift detectors and denoted them SADWIN/PADWIN and NDKSWIN. Their OD solutions in this article are denoted as IFA
(S/P)ADWIN and IFA
NDKSWIN. However, some major disadvantages of their proposals, such as partially updating the model rather than discarding the complete forest, have still been present in subsequent work.
More recently, the work of [
30] improved LSHiForest, a classifier based on iForest to handle high-dimensional data while detecting special anomalies, e.g., axis-parallel ones, to handle SD and produce time-efficient results when processing large, high-dimensional datasets. Their improvement, denoted as LSHiForest
Stream in this article, consists of a combination of streaming pre-processing based on dimensionality reduction with principal component analysis (PCA) and a weighted Page–Hinckley test to find suspicious data points. Furthermore, locality sensitive hashing is applied that hashes similar input items into the same branches with high probability, and dynamic iForest is applied with efficient updating strategies. Thus, rather than exchanging the whole model as with iForestASD, this approach repeatedly checks if new suspicious data points exist and updates them in the tree structure.
Another hybrid method called iMondrian forest, denoted as iMForest, was proposed in [
31]. Mondrian forest is a method based on the Mondrian processes for classification and regression on SD. The authors embedded the concept of isolation from iForest by using the depth of a node within a tree and the data structure used in Mondrian forest to operate for OD on SD.
The concept of growing random trees or GR-Trees was proposed in [
11], which is also capable of partially updating the ensemble of random binary trees. The GR-Trees approach is quite like the iForest with respect to the training process, as is the approach for anomaly score assignment to the data instances. In an initial stage using the first sliding window content, the ensemble is built without explicit training data. Incremental learning is achieved by combining an update trigger deciding when to update the ensemble. Online tree growth and mass weighting ensure that the model can be adapted in time and is able to handle concept drifts.
Referring to
Section 4, most related to our PCB-iForest approach—in order to only partially update the ensemble-based model rather than completely discarding it (as present in iForestASD and IFA variants)—are the solutions presented within LSHiForest
Stream, iMForest and GR-Trees. Both LSHiForest
Stream and iMForest do not fulfill the requirement of algorithm agility because of being tailored to dedicated data structures for online learning and classification, rather complexly in the case of LSHiForest. Since LSHiForest
Stream is designed to deal with multi-dimensional multi-stream data, it is seemingly more computationally intensive than multi-dimensional, single-stream solutions, and considering the inclusion of k-means clustering for anomaly detection in online mode, the same applies for iMForest. GR-Trees is similar to iForest in its training and classification process, and in its framework for SD. During online detection of an initially created ensemble, the classified normal instances are stored in a buffer and are used for the streaming updates, leading to tree growth and updating the trees in the ensemble. The trees are discarded based on the mass weights of the results evaluated by each tree for the sliding window. The tree online growth and mass weighting mechanisms ensure that the detection model can be adjusted in time as the data distribution changes to avoid misjudgments caused by concept drift. Apart from the sliding window, our approach does not need an additional buffer which preserves memory. Additional hyperparameters, such as update rate and discard rate, rather than the ensemble size and subsample size, could require some adjustments to suit the actual needs to obtain better results. Furthermore, apart from the replacement of discarded trees with trees obtained from a building window, existing trees are updated based on an update window. Both mechanisms use the buffered normal instances, which could pose a slight performance issue, as stated in [
6]; refer to the section “Training using normal instances only”.
3. Requirement Specification and Validation
In this section, we specify the requirements OD algorithms for SD have to satisfy to be applied in real-world, future-oriented scenarios. Additionally, some requirements are added that help with a holistic incident handling process—for example, providing functionality to assist with identifying the root causes of incidents rooted in outliers. We structure the requirements into operation, data, performance and functionality-related ones.
As already pointed out in the introduction, the missing ground truth values in evolving (theoretically infinite) data that require real or almost near real-time processing, taking the evolution and speed of data feeding into account, demands unsupervised methods, leading to the requirement we will denote as (R-OD01). In addition, those must be capable of dealing with SD (R-OD02). Both requirements are categorized into the operation-related ones.
We can see additional requirements, as follows, that shall play a major role in future systems. Another operation-related requirement is that the settings of hyperparameters should be of low complexity, and in particular, no information of ground truth should be mandatory for the settings, or even better, no hyperparameters should be set at all (R-OD03). This requirement is important since, nowadays, domain experts are expected to have a high level of multi-disciplinary expertise from data science, e.g., extensive knowledge in statistics, in order to properly set up a machine learning pipeline. Even if recent developments in the field of automated machine learning [
32] have tried to aid domain experts, they require extensive offline supervised training and testing for each hyperparameter value, and thus do not operate in the unsupervised online cases. Hence, aiding the domain experts by not burdening them with setting parameters seems important for manageable systems. Furthermore, the time-varying nature of SD, especially in highly dynamic networks, is subject to the phenomenon called concept drift. This means the data stream distribution changes over time. With the different types of concept drifts (sudden, gradual, incremental, recurring and blips) [
33], algorithms must be able to efficiently adapt to these changes and continuously re-train or update their models (R-OD04) to prevent being outdated, which would lead to performance degradation. Dedicated drift detection methods might come to the rescue for OD algorithms to deal with data’s varying nature.
From a data-related perspective, in the field of network-based OD, data flow from a single source (single-view) and do not necessarily have to be normalized (R-OD05)—e.g., as a raw network interface, as statistics from network switching elements or in the form of log-files from devices. Features can be defined in advance by an expert since incorporating domain knowledge can help select relevant features to improve learning performance greatly. Thus, the cardinality of the feature set (meaning the number of dimensions) is fixed (R-OD06), thereby not demanding algorithms that are able to deal with streaming features. For network-based features, one may distinguish between basic features (derived from raw packet headers (meta data) without inspecting the payload, e.g., ports, and MAC or IP addresses), content-based features (derived from payload assessment having domain knowledge, e.g., protocol specification), time-based features (temporal features obtained from, e.g., message transmission frequency and sliding window approaches) and connection-based features (obtained from a historical window incorporating the last
n packets) [
34]. However, with the technological advancements and the increasing of potential features, the algorithms must efficiently operate on high-dimensional and high-volumes of data (R-OD07) and must cope with missing variables that might occur due to unreliable data sources (R-OD08).
Performance-related requirements can be subdivided into computational and classification performance requirements. The former demands lightweight algorithms (R-OD09), in terms of time and space complexity, for both model-updating and classification, to be implementable in embedded software. SD is potentially infinite, and algorithms must temporarily store as little data as possible and process the data as fast as possible due to the time constraint of observing incoming data in a limited amount of time. The classification performance needs to be sufficiently good (R-OD10), i.e., producing decent area under the
curve (
), in which
is the receiver operating characteristic, or
score metric, in order to detect malicious activity in a reliable way. It should be noted that stream methods, compared to their batch competitors, typically perform worse in terms of classification. However, under the assumption of applying a subsequent root cause analysis, we strongly support the justification in [
27] that when considering critical SD applications, an efficient method, even with less accuracy, is preferred.
Functionality-related requirements can be subdivided as follows. One still unresolved issue for IDS is the lack of finding the actual root cause of incidents. Instead of yielding simple binary values (normal or abnormal), requirement (R-OD11) demands algorithms that provide outlier score values. Those carry more information and could help a subsequent root cause analysis, for instance, by dealing with false negatives. In addition, the importance of features can play an important role, thereby demanding functionality to score or rank features according to their outlier score contributions (R-OD12), and so providing information about which feature (mainly) caused the outlier. Reducing the data’s dimensionality can help deal with the curse of dimensionality, referring to the phenomenon that data becomes sparser in high-dimensional space and can still be represented accurately with less dimensions. This adversely affects both the storage requirements and computational cost of the algorithms. Reduction by methods such as PCA maps higher-order matrices into ones with lower dimensions with a certain probability. However, the physical meaning of the features is no longer retained by this projection and impedes root cause analysis (feature interpretability). Feature selection methods reduce the dimensions by only selecting the most relevant features, and hence, preserve their physical meanings. Applying feature selection on SD would possibly lead to changing top-performing-features as time passes and thus demand for OD algorithms that are capable of changing feature sets during runtime (R-OD13). Considering the multitude of recent work that is tailored to attack machine learning, e.g., [
35,
36], we can see, similarly to cryptographic agility, the flexibility to exchange the actual algorithm as a forward-looking requirement (R-OD14) in the case where the currently used algorithm (poisoning) or its model (evasion) gets compromised. More likely it is the former, since an evasion of the model seems, due to the continuous updating, more irrelevant.
Over the past few years, much attention has been paid to establishing OD algorithms for SD in the field of network security, and is increasingly facing trends of massive amounts of data generated with high velocity and afflicted with the phenomenon of concept drift. Many existing works have tried to improve the algorithm settings in terms of performance-related requirements by competing on the same (often outdated) benchmark dataset. For real-world applications, in which most of the algorithms might perform insufficiently, we expect that designing algorithms and finding a tradeoff between the stated requirements is more crucial. Thus, assuming the application of a subsequent root cause analysis enabled by, e.g., (R-OD12), certain numbers of false positives and false negatives are acceptable when referring to (R-OD10). In particular, considering critical application domains, it might be preferred to quickly and efficiently detect outliers, even with less accuracy.
Table 1 validates state-of-the-art algorithms presented in
Section 2 using the specified requirements. It can clearly be seen that none of the existing methods have good results across the majority of requirements.
4. Generic PCB-iForest Framework
In this section we focus on the design of an intelligent OD solution that is able to satisfy all of the aforementioned requirements. Thus, we carefully reviewed related work and combined the merits of the most promising approaches while alleviating their shortcomings. Our focus lies on the iForest-based approaches, since iForest is (i) a state-of-the-art algorithm for OD, (ii) widely used by the community, (iii) efficient in terms of computational and classification performance and (iv) can easily be adapted for the SD application [
7]. The wide acceptance in research is reflected in numerous improvements and extensions—for instance, extended isolation forest (EIF) [
23], functional isolation forest [
24], entropy iForest [
37], LSHiForest [
38] and SA-iForest [
39]—for different application domains or focusing on special problems, such as dealing with categorical and missing data [
40]. However, those adaptions are mainly tailored for static datasets rather than the application on SD. Thus, our aim was to provide a framework that is able to exchange the iForest-based classifier in cases of compromising data or if the application domain, with its specific task, demands another base learner. Moreover, the framework can be generalized to basically incorporate any ensemble-based algorithm consisting of a set of components such as trees. The workflow of our so-called Performance Counter-Based iForest framework, denoted as PCB-iForest, is shown in
Figure 1.
Data instance (data point) with dimensions d of the data stream will be captured as the latest instance at each time step t in the count-based sliding window W and in parallel will be evaluated in the Scoring module, which provides an outlier score y for each . The sliding window is composed of the latest w instances such that . A dedicated drift detection method is applied that triggers the partial fitting process to discard and add components, denoted as C, of the ensemble E. The core of PCB-iForest is the Performance Counter-Based Scoring module, which is able to identify well and badly performing components of an ensemble. Partial fitting will then discard only the poorly-performing data and replaces them with newly created ones from the most recent instances contained in the sliding window. In the following, we provide more details on the main parts of our framework.
4.1. Drift Detection Method
Detecting changes in multi-dimensional SD is a challenging problem, especially when considering scaling with the number of dimensions. A sometimes applied solution is to reduce the number of dimensions by either performing PCA (cf. LSHiForest
Stream) or random projections. Furthermore, one might even reduce the number to one (or more) uni-dimensional statistics and apply a well-known drift detection method, such as DDM [
41] or ADWIN [
28,
42]. For IFA
(S/P)ADWIN, drift detection will be performed on the one-dimensional statistic of either the binary prediction value (PADWIN) or the actual score value (SADWIN). Reduction is achieved by the learning model. Thus, it is referred to as a model-centric approach. However, generic approaches, such as [
43,
44], exist that deal with multi-dimensional changes performed specifically on the SD, referred to as data-centric.
We can see that model-centric approaches (indicated by the dotted line in
Figure 1) might be prone to a phenomenon called positive feedback. This means that drift detection causing partial fitting will be negatively influenced by the actual classification results in such a way that the ensemble results tend to be the same by discarding “badly” performing components from the model point of view. Positive feedback is also present within iForestASD, since drift detection depends on the anomaly rate computed by the model’s scoring results. Furthermore, iForestASD’s anomaly rate is dependent on a priori knowledge, which is hardly feasible in real-world applications. Therefore, we recommend the usage of data-centric solutions, which are unbiased of the applied model only relying on the SD characteristics. Since NDKSWIN in [
7] has, as of now, proven to be a reliable drift detection method, we are applying it to PCB-iForest, but our approach is open to any data-centric or model-centric solution. NDKSWIN adapts a relatively new one-dimensional method called KSWIN [
29] based on the Kolmogorov–Smirnov (KS) statistical test, which does not require any assumption of the underlying data distribution to be capable of detecting concept drifts in multi-dimensional data.
In KSWIN, the sliding window W is divided into two parts. The first sub-window, called R, contains the latest data instances where concept drift might have taken place. The length of R is predefined by the parameter r. The second sub-window, called L, contains uniformly selected data instances that are a sampled representation of the old data. The concept drift is detected by comparing the distances of the two empirical cumulative distributions from R and L according to , in which is the probability for the statistical KS test. NDKSWIN extends this test by declaring concept drift if drift is detected in at least one of the d dimensions. However, contrary to IFANDKSWIN, the application of NDKSWIN in PCB-iForest differs. We do not apply drift detection inline before scoring. Our parallel setting, that newly arriving data instances are immediately forwarded to the scoring function, allows us to detect anomalies in near real-time without losing time when performing upstream applied drift detection. Although possible concept drift might already afflict the new instance, thereby legitimizing the approach to first update the model before scoring, we again state that for network-based anomaly detection, an accelerated but less precise model is favored. PCB-iForest’s design seems obviously more performant, especially if a high throughput is demanded. Our approach further improves the computational benefit with NDKSWIN since, contrary to iForestASD or IFANDKSWIN, we do not discard the whole model in cases of detected drift, but are able to only partially update it. Consequently, even if NDKSWIN detects slightly more drift, our approach is a good tradeoff between a resource-saving model up-to-dateness and a continuously updating model, e.g., HS-Trees or LodaCont., which continuously fit their models with each arriving instance even if there is no need.
4.2. Performance Counter-Based Scoring
Performance Counter-Based Scoring monitors the performance of each component
C (herein an iTree) in the ensemble
E (herein the iForest) by assigning it with a performance counter (PC). In general, the approach favors or penalizes individual ensemble components at runtime by referring to their contributions to the ensemble’s overall scoring result. Thus, the PC-value is changed for each data instance, i.e., increased or decreased depending on the component’s scoring quality in the ensemble’s anomaly score. The PC-values increase or decrease by 1 for well and poorly performing components, depending on whether each individual score is above or under the anomaly threshold
s (herein 0.5 for iForest-based learners, as discussed in [
6]). For example, the ensemble scores a sample with
, which indicates an anomalous sample. Each individual component’s score contribution is verified such that if the score of the
i-th component
is greater than
s,
, the PC-value of
,
increases. In turn, if
,
is penalized by decreasing
. However, one might even increase or decrease the PC-values in an even more granular fashion, e.g.,
, depending on the confidence level of the ensemble score and each individual component’s score contribution. For instance, if
, the confidence level of the ensemble is high, so the sample is anomalous. Thus, if any
, component
might be penalized to a larger extent by decreasing
with a higher value. For the sake of simplicity in this article, we apply the more simple binary approach in which each individual PC is increased/decreased by 1 if its score value is greater/less than the ensemble’s score value. The counting goes until a drift is detected. Once this happens, the weaker performing components, as indicated by their negative PC values, are replaced with new ones built on data instances present in the current window
W. The PC values of all trees are set to zero after the partial update is finished; hence, even resetting the values for previously well performing trees clears the old bias (effect of previous scoring). Referring to
Figure 1, Algorithm 1 shows the working principle, including the core of the Performance Counter-Based Scoring. Additionally, we neglected the initialization phase in which, once the sliding window is filled, the components are initially built and the PC values are set to zero.
Algorithm 1: The working principle of PCB-iForest. |
|
4.3. Base Learner
The PCB-iForest framework is designed to allow exchanges of the base learner. Although being initially intended for iForest-based approaches, the conceptualization can easily be generalized for any ensemble method with its components, such as trees, histograms and chains. With the partial updating, unlike iForestASD and the IFA-approaches, higher throughput is possible since the complete model does not need to be updated. Rather, only a certain number of penalized trees are updated, allowing it to not completely and abruptly forget previously learned information by flushing the whole model, similarly to catastrophic interference known from the field of neural networks. Thus, with respect to non-iForest-based approaches, we see the potential of our framework to replace, e.g., the alternating windows of HS-Trees or Loda
Two Hist. in which new ensembles are built and continuously replace those currently used-even if there is no necessity. Our approach would be more resource-preserving while keeping a set of ensemble components as long as there is no need to replace them, e.g., due to a concept drift. However, in this article we focus on iForest-based approaches for the reasons stated in the beginning of this section. In particular, we want to present two application scenarios underlining the fulfillment of crucial requirements from
Section 3.
4.3.1. Algorithm Agility
SD is afflicted with a theoretically infinite flow of data. Thus, in some cases, it might be necessary to exchange the base learner as time passes. A possible application scenario would be if the currently used base learner has been compromised. This means it is vulnerable to, e.g., poisoning of the algorithm, and an adversary might bypass the detection of its malicious activity. Another non-malicious use case would be a major change of data due to the long term running time that is beyond a concept drift which requires a different type of base learner. Some iForest improvements might then need tailoring for specific application scenarios. In this article, apart from classic iForest, we prove the algorithm’s agility by incorporating EIF. It addresses the drawbacks of iForest’s branching using random horizontal and vertical cuts by substituting them with non-axis-parallel hyperplanes with random slopes. Thus, to the best of our knowledge, PCB-iForest is the first work that applies the improved version of iForest on SD. Since the PCB-iForest framework only "wraps around" EIF, no other specific adaptions are necessary except for adding the Performance Counter-Based Scoring. We denote this variant as PCB-iForestEIF. However, it should be remarked that feature interpretability is irretrievably lost with the improvement of branching in EIF. Therefore, in addition, we are taking on the topic of feature importance measurement for OD on SD by a second variant explained in the next section.
4.3.2. Feature Scoring
Apart from popular dimensionality reduction algorithms such as PCA, feature selection for OD aims to only select relevant features for the classification task by discarding irrelevant and redundant features, which reduce dimensionality. This leads to more computationally efficient OD, all the while preserving feature interpretability. Especially in a consecutively applied root cause analysis, feature interpretability plays a crucial role for future forensics use cases. While some feature selection approaches only provide a subset of relevant features, others are able to score and rank features according to their contribution to a sample’s anomalousness. Therefore, one is able to select the best performing features as indicated by their score values. In particular, since iForest is inferior to projection-based methods on high-dimensional noisy datasets [
18], feature selection would significantly aid in reducing dimensions, and thus, amplify iForest’s classification performance in lower dimensions. This coincides very well with the suggestion from Togbe et al. in [
7], mentioning that feature selection could mitigate the effect of choosing the most important dimensions for drift detection.
Much work has been done in the field of feature selection but, to the best of our knowledge, existing approaches either focus on feature selection for SD (but not with the focus on imbalanced data classification), e.g., [
45,
46,
47], or focus on feature selection for OD, e.g., [
25,
48,
49], (but mainly in a supervised and offline fashion). Thus, we see it as crucial to contribute with feature scoring solutions to SD that focus on OD which might be exploited for feature selection. Pevný, in [
17], proposed a one-tailed, two-sample test for Loda to achieve feature scoring without increasing its overall complexity. This approach seems most related to the intention of scoring relevant features (for the task of OD) in a streaming fashion, which one might use to rank and select the top-performing features.
In order to achieve feature scoring, we take advantage of the unsupervised isolation-based feature selection (IBFS) method recently proposed in [
25] tailored for OD. The method exploits the training phase of the classical iForest, in particular, the random selection of feature values, and computes score values for each feature by calculating imbalance scores using an entropy measure. Although this method is designed for offline iForest, it can easily be adapted to our PCB-iForest in a streaming fashion, denoted as PCB-iForest
IBFS. Since it is designed for the training phase, we only obtain feature scores after each partial update (training). In order to receive representative feature scores as time passes, we will continuously update the score values with each partial update as shown in Algorithm 2. Once a partial update is triggered, we let IBFS compute feature scores
for the
i-th feature
based on the data instances in
W resulting in a one-dimensional array
of
d feature scores. With each partial update we continuously update the feature scores by incremental averaging. For the sake of simplicity, we apply the incremental average
with a continuous counter value
k for each partial update, in order to obtain the averaged array of feature scores
. It must be noted that other methods exist—e.g., those discussed in [
5]—that might be superior when concept drifts occur within the feature scores. While only preserving
d values for the current average scores and one value for the continuous counter
k and performing
d updates of the scores, both the space and time complexity for each feature score averaging yields
when applying the well-known Welford’s algorithm [
50]. This does not significantly increase the overall complexity of PCB-iForest
IBFS since
d is fixed. A summary of the feature scoring functionality is shown in Algorithm 2. As time passes and feature scores are continuously computed, one might rank the feature scores and identify the top-performing ones. Thus, it might be necessary to change the feature set or reduce the number of features from the original set. PCB-iForest
IBFS is able to change the feature set during runtime. Once a partial update is triggered, instead of discarding only poorly performing components, the whole model can be discarded and the new ensemble can be built using the newly proposed feature set.
Algorithm 2: Feature scoring in PCB-iForest utilizing IBFS. |
|
7. Conclusions and Future Work
Over the past few years, the continuous increase in high-volume, high-speed and high-dimensional unlabeled streaming data has pushed the development of anomaly detection schemes across multiple domains. Those require efficient unsupervised and online processing capable of dealing with challenges such as concept drift. The most popular state-of-the-art outlier detection methods for streaming data were discussed in this paper, with a focus on algorithms based on the widely known isolation forest (iForest), and they were compared against thoroughly engineered requirements pointing out the lack of a flexible, robust and future-oriented solution.
Thus, this article introduces and discusses a novel framework called PCB-iForest that “wraps around”, generically, any ensemble-based online OD method. However, due to its popularity, the main focus lies on the incorporation of iForest-based methods for which we present two variants. PCB-iForestEIF is (to the best of our knowledge) the first application of the iForest improvement called extended isolation forest on streaming data. PCB-iForestIBFS applies a recently proposed feature importance scoring functionality designed for static data, which is adapted to function in a streaming fashion. We provide details of PCB-iForest’s core functionality based on performance counters assigned to each ensemble component in order to favor or penalize well or poorly performing components. The latter will be replaced if concept drifts are detected by newly built ones based on samples within a sliding window. Since drift detection is crucial, in order to regularly update our model if required, we rely on a recently proposed method denoted as NDKSWIN, but are open to any multi-dimensional data-centric method.
Our extensive evaluation firstly evaluated the drift detection functionality, which showed that NDKSWIN is able to detect concept drifts, and even if afflicted by some additional detections, regularly updates PCB-iForest. Comprehensively comparing both PCB-iForest methods with state-of-the-art competitors pointed out the superiority of our method in most cases. In terms of area under the receiver operating characteristic () curve, we achieved the best results on four datasets used by online iForest-based competitors. On the multi-disciplinary ODDS, PCB-iForest clearly outperformed nine competitors in approximately 50% of the datasets and achieved comparable results in 80% with respect to the metric and its tradeoff with the average runtime. Utilizing the four most efficient competitors and our PCB-iForest variants on four security-related UNSW-NB15 datasets again proved the superiority of our approach. It achieved the highest (we do not include the poor performance of all classifiers on one dataset) while still being comparable in speed to the extremely fast Loda algorithm.
PCB-iForest’s current implementation faces two limitations which will be addressed as part of future work by taking advantage of the framework’s flexible design to incorporate other ensemble-based OD algorithms and replacing the drift detection method. Thus, firstly, we will focus on the integration of LodaTwo Hist. into PCB-iForest, given (i) its rapid processing and (ii) the fact that its classification results could be further improved by replacing the pair of alternating windows with our approach. Therefore, the set of histograms will only partially be replaced if concept drift is detected and will not completely be discarded after each window.
Secondly, drift detection is a crucial component of our approach, since it must reliably detects concept drifts and should not be prone to a high number of false detections, because this would degrade both the classification and computational performance. Therefore, our future work will (i) focus on the improvement of NDKSWIN, since it is possibly prone to the multiple comparisons problem, and (ii) evaluate the use of other multi-dimensional data-centric algorithms within PCB-iForest.
More research will also be part of PCB-iForestIBFS’s feature scoring functionality, since even if it is able to score and rank the best-performing features during runtime, determining the “optimum” number of features that should be used as a feature subset is still an open issue. Although the subsets with 25–75% of the whole feature set could achieve promising results, this is highly dependent on the data source and algorithm type. Thus, a method will be developed that determines the best feature subset based on the analysis of the feature scores.