**4. Proposed Method**

In this section, we introduce our proposed algorithm in detail. First, in Section 4.1, we introduce two different coding schemes corresponding to the local and global versions of the WOA applied to hierarchical clustering. In Section 4.2, we describe the fitness function that generates alarm clustering using the WOA. In Section 4.3, we combine the WOA with the crossover and variation factors of the genetic algorithm and propose the pseudo-codes of the local and global versions of the WOA alarm hierarchical clustering algorithm.

#### *4.1. Encoding and Decoding*

We first introduced the encoding and decoding scheme of the local version of the WOA applied to hierarchical clustering. In this scheme, a search agen<sup>t</sup> in the WOA corresponds to a cluster center of hierarchical clustering. As mentioned above, a cluster center is composed of a basic alarm or an abstract alarm, so we can obtain the data structure encoded by the search agent. Each attribute in the alarm corresponds to a binary string in the encoded data structure, represented by 0 or 1, as shown in Figure 4.

**Figure 4.** A cluster center (A3,B6,C1,D5) and its corresponding coding scheme: (**a**) the coding scheme of the cluster center; (**b**) the cluster center (A3,B6,C1,D5).

Figure 5a shows the coding scheme of the cluster center in binary form, and Figure 5b shows the attribute values of the cluster center corresponding to this coding. Figure 4 shows an alarm with four attributes (A, B, C, and D) with decimal values of 3, 6, 1, and 5.

The four attribute fields of the alarm are located in their respective hierarchical trees, as shown in Figure 5.

**Figure 5.** Hierarchical tree structure for attributes A, B, C and D. (**a**) three-layer tree structure of attribute A; (**b**) four-layer tree structure of attribute B; (**c**) three-layer tree structure of attribute C; (**d**) four-layer tree structure of attribute D.

We can find the location of the node corresponding to the binary-encoded attribute in the hierarchical tree. At the same time, we can easily obtain the binary fragment corresponding to the alarm through the hierarchical tree. This coding scheme indicates that a search agen<sup>t</sup> corresponds to a cluster center. The goal of the WOA is to find the best search agen<sup>t</sup> for the fitness function over multiple iterations, output its corresponding cluster center, and categorize the alarms that belong to that cluster.

In this encoding and decoding scheme, a WOA search agen<sup>t</sup> corresponds to a cluster center, assuming that the cluster center is composed of *N* attributes and the binary length of each attribute is *K*, then the encoding length of a search agen<sup>t</sup> is *N* ∗ *K*, corresponding to *N* hierarchical trees.

After giving the encoding and decoding scheme of the local version WOA–hierarchical tree, we introduce the encoding and decoding schemes of the global version WOA– hierarchical tree. In the coding scheme of the global version, a WOA search agen<sup>t</sup> is composed of a group of cluster centers. Assuming that WOA eventually obtains *C* cluster centers, each of which is composed of *N* attributes with length *K*, the coding length of the global version WOA–hierarchical tree is *C* ∗ *N* ∗ *K*. Let us take the example shown in Figure 6 for illustration.

**Figure 6.** Hierarchical tree structure for attributes A, B and C.

In Figure 6, there are three hierarchical trees corresponding to the three attributes of an alarm, respectively. If we need to finally obtain three cluster centers for this alarm set, which are (A4, B2, C3), (A1, B1, C2) and (A3, B2, C4), then one of our search agents can be encoded as shown in Figure 7a in the second coding method. Figure 7b shows the three cluster centers corresponding to this coding scheme.

**Figure 7.** Three cluster centers and its corresponding coding scheme: (**a**) the coding scheme of the three cluster centers; (**b**) three cluster centers (A4,B2,C3), (A1,B1,C2) and (A3,B2,C4).

#### *4.2. Fitness Function*

The core of WOA is to find the best solution set in a finite solution set space through a finite number of iterations. The fitness function is the standard to evaluate whether a solution set is excellent. Therefore, how to set up an appropriate fitness function is the key to solve the problem of hierarchical clustering using WOA. The selection of the fitness function in this paper mainly considers the following three factors: the number of alarms contained in the cluster center, the distance between alarms belonging to the same cluster, and the coincidence degree between clusters. We believe that, given a fixed threshold of alarm distance, the more alarms that a cluster contains, the greater the fitness value will be. In addition, when the similarity of alarms belonging to the same cluster is higher (the distance is smaller), the fitness value is higher. If the coincidence degree of cluster center is higher, we believe that the meanings of the two clusters are closer, and the overall fitness value will be smaller.

For a given alarm cluster center, *S* = (*<sup>N</sup>*1, *N*2, *N*3,*N*4,... *Nm*), where the value of m corresponds to the number of hierarchical trees used in the alarm cluster. If the alarm distance meets the number of alarms within the given threshold, we believe that the fitness of the alarm cluster center is higher. In Refs. [45–47], the setting of the fitness function is to determine whether the number of alarms belonging to a certain alarm cluster center exceeds the given threshold. If the number exceeds, the fitness is set to 1, and if not, the fitness is set to 0. This processing method has a simple idea and can well distinguish the alarms that do not meet the clustering requirements from those that meet the clustering requirements. However, the problem is that the method cannot reflect the quality of the cluster centers which exceed the threshold value. For example, if the threshold value is set to 500, the existing two clusters *C*1 and *C*2 contain alarm numbers of 2000 and 5000, respectively. We intuitively feel that *C*2 is better than *C*1 but their fitness values are set to the same value, which does not achieve a good distinction. In this paper, a new calculation method of alarm number fitness is adopted, as shown in Equation (10).

$$E\_Tk(t, \mathbf{x}) = \begin{cases} \begin{array}{c} 0, if(\mathbf{x} < t) \\ \ln\left(\frac{\mathbf{x}}{T}\right), if(\mathbf{x} \ge t) \end{array} \end{cases} \tag{10}$$

where *t* represents the threshold of the number of alarms that the cluster should contain, and *x* represents the number of alarms belong to the cluster center.

When *x* < *t*, we think that the cluster contains too few alarms, and the cluster center should not be selected; when *x* > *t*, we think that the fitness value of the cluster center increases with the increase in *x*, and taking *t* = 500 as an example, the image of the fitness function is shown in Figure 8.

**Figure 8.** Figure of fitness function *ETk*(*<sup>t</sup>*, *x*) for *t* = 500.

For a cluster center, only considering the number of alarms contained in the cluster center as an evaluation index cannot indicate the quality of the cluster center. Only when the number of alarms contained in the cluster center is large enough and the difference between alarms is small enough do we believe that the selection of the cluster center is reasonable. Therefore, we define a fitness function for the internal differences in the alarm cluster center, as shown in Equation (11). The average depth of the four hierarchical trees as shown in Figure 1 is (3 + 2 + 3 + 2)/4 = 2.5.

$$E\_S(k) = \frac{1}{n} \sum\_{i=1}^{n} \left( 1 - \frac{D(C\_i)}{M\_d} \right) \tag{11}$$

where *i* = (1, 2, 3, . . . , *n*) represents *n* alarms belonging to a cluster center *k*, *D*(*Ci*) represents the sum of the distances between the alarm and each attribute of the cluster center in its attribute tree, and *Md* represents the average depth of all attribute hierarchy trees.

#### *4.3. Crossover and Mutation Operator*

One of the difficulties of the WOA in solving hierarchical clustering problems is how to apply Equations (2), (6) and (9) to transform search agen<sup>t</sup> positions for different types of attributes. If an attribute is a continuous variable, the use of the above formula is not affected, but if an attribute is a discrete variable then using the formula is difficult. Because we use the attributes of the hierarchical tree structure and type of binary coding structure, we can easily transform the attributes into a hierarchical tree. Here, we use crossover and mutation operators of the genetic algorithm to solve this problem. Another advantage of using these two operators is that the WOA is combined with crossover and genetic operators to further improve the algorithm's ability to search for local and global optimal solutions. This conclusion is mentioned in Ref. [48].

Now, we present the application of crossover operator based on the WOA coding scheme. Taking an alarm with four attributes as an example, the binary identity of the attribute field with two alarms is shown in Figure 9.

As can be seen from Figure 10, the attributes of the two coded alarms are *Alarm*1 (*<sup>A</sup>*3*B*6*C*1*D*5) and *Alarm*2 (*<sup>A</sup>*3*B*10*C*2*D*3). Starting with 6 bit, cross transposition of the attributes of the two alarms can be carried out so that two new alarms can be obtained after operation, as shown in Figure 10.

From Figure 10, we can see that the two new alarms, *Alarm*1 (*<sup>A</sup>*3*B*6*C*2*D*3) and *Alarm*2 (*<sup>A</sup>*3*B*10*C*1*D*5), are generated after crossing. Looking at the changes in the attribute fields of the two alarms, we find that, except for the change in the attribute of the exchange location, the other attributes only changed the location.

**Figure 9.** Property fields for the two alarm records waiting for a crossover operation.



**Figure 10.** Two new alarm records after a crossover operation.

After introducing the application of crossover operator in alarm clustering, we introduce the use of the mutation operator in alarm clustering. Taking the alarm shown in the left figure of Figure 9 as an example, the change of the alarm after a mutation operation is performed on a bit of the alarm attribute is shown in the right figure of Figure 11. By changing the value of the sixth bit in the binary from 1 to 0, the alarm changes from *Alarm* (*<sup>A</sup>*1, *B*5, *<sup>C</sup>*6) to *Alarm*(*<sup>A</sup>*1, *B*4, *<sup>C</sup>*6). Observing the change in the alarm, we can find that the mutation operation only makes a certain attribute of the alarm field change, while the other attributes remain unchanged.

**Figure 11.** An alarm record before/after mutation operation.

#### *4.4. WOA-Based Alarm Hierarchical Clustering Process*

After introducing the coding and decoding scheme, fitness function and crossover and mutation operator of the WOA applied to hierarchical clustering, in this section, we present the processing of the local and global versions of WOA applied to the alarm hierarchical clustering process. In order to express clearly, the WOA hierarchical clustering of the local version and global version are respectively called WOAHC-L and WOAHC-G. Compared with the traditional alarm hierarchical clustering algorithm, which can only generate one random generalization alarm at a time, WOA uses multiple search agents to search the solution set space of the generalization alarm simultaneously, which can improve efficiency and obtain more possibilities of solution sets. The random agen<sup>t</sup> selection stage of WOA provides a higher possibility to jump out of the local optimum to find a better solution set. The use of crossover operators and mutation operators can help WOA deal with various types of data and enhance the ability of local search and global search.

We first provide the algorithm flow of WOAHC-L. The process of WOAHC-L can be described as follows: the algorithm first initializes several search agents, each representing a cluster center of the alarm. Then, we calculate the fitness value of each search agen<sup>t</sup> according to the fitness function. The cluster center represented by the search agen<sup>t</sup> with the best fitness value is the optimal solution. After that, according to WOA's search mechanism, the remaining search agents explore the solution set space around the optimal solution through exploitation phase and exploration phase and update the value of the optimal solution whenever there is a better solution. After several iterations, the cluster center represented by the search agen<sup>t</sup> with the optimal fitness value is output, and alarms belonging to that cluster are added to the cluster and removed from the original alarm set. Each time the algorithm is executed, a cluster center is output and the alarms belonging to the cluster are deleted from the original alarm set. When the remaining alarms no longer meet the clustering rules after several times of algorithm execution, the algorithm is finished. The alarm clustering process based on WOAHC-L is shown in Figure 12.

The algorithm pseudocode of WOAHC-L is shown in Algorithm 2.


**Figure 12.** Alarm reduction framework flow chart based on WOAHC-L.

Based on the excellent local and global search capabilities and the group search mechanism of WOAHC-L algorithm, we can find excellent alarm cluster centers. However, the problem of WOAHC-L is that there may be a high degree of overlap between the cluster centers obtained by executing WOAHC-L several times, so that the alarm sets originally belonging to the same cluster may be divided into multiple clusters. In other words, WOAHC-L only focuses on the generation of single cluster centers and does not consider the overlap between the finally obtained cluster centers. In order to solve the above problem of the WOAHC-L algorithm, we propose a global version based on WOA hierarchical clustering, namely WOAHC-G, which uses the second encoding scheme of the search agen<sup>t</sup> mentioned in Section 4.1 above.

The process of WOAHC-G can be described as follows: the algorithm initializes several search agents, each of which is composed of *N* cluster centers and represents the final clustering result. According to the fitness function, each search agen<sup>t</sup> calculates a fitness value. The search agen<sup>t</sup> with the best fitness value is the optimal search agent, and the cluster center represented by the agen<sup>t</sup> is the final cluster center set. Based on the WOA exploitation phase and exploration phase, the cluster center set represented by the search agen<sup>t</sup> with the optimal fitness value is finally obtained after several iterations. WOAHC-G differs from WOAHC-L in that WOAHC-G only needs to execute once to obtain all cluster centers, while WOAHC-L needs to execute several times until the number of remaining alarms is insufficient for the next algorithm execution. In addition, WOAHC-G considers the problem of coincidence degree between different cluster centers and takes the coincidence degree as an important indicator of fitness value. Therefore, the fitness function of the algorithm needs to be changed to add the coincidence degree of cluster centers, as shown in Equation (12).

$$O\left(\mathbf{C}\_{i\prime}\mathbf{C}\_{j}\right) = \begin{cases} \mathbf{1}, \mathbf{C}\_{i} \cap \mathbf{C}\_{j} = \mathcal{Q} \\ \mathbf{0}, \mathbf{C}\_{i} \cap \mathbf{C}\_{j} \neq \mathcal{Q} \end{cases} \tag{12}$$

where *Ci*,*Cj* represent two cluster centers. If each attribute of the two cluster centers has no intersection, the coincidence degree is considered to be 0; otherwise, it is 1.

After the calculation function of clustering coincidence degree is given, the evaluation equation of clustering coincidence degree is given as Equation (13).

$$ES\_0 = \frac{2}{k(k-1)} \sum\_{0 \le i < j \le k} O\left(\mathbb{C}\_i, \mathbb{C}\_j\right) \tag{13}$$

where *k* represents the number of cluster centers. When there are *k* cluster centers, *k*(*k*−<sup>1</sup>) 2 times are needed to calculate the coincidence degree between them. Therefore, the coefficient in the calculation formula of *ESo* is set as 2 *k*(*k*−<sup>1</sup>) to ensure that the value of *ESo* is within the interval [0, 1].

The alarm clustering process based on WOAHC-G is shown in Figure 13.

**Figure 13.** Alarm reduction framework flow chart based on WOAHC-G.

The algorithm pseudocode of WOAHC-G is shown in Algorithm 3


#### **5. Experiments and Results**

*5.1. Experiment Data Set*

In this section we describe the data sets used in the experiment. We use UNSW-NB15 as the experimental data set [8]. The UNSW-NB15 data set was developed by Ixia Perfectstorm. It is used to simulate and generate real and contemporary attack models. This

is a tool called Tcpdump, which contains up to 100 GB of PCAP files and is used to simulate nine different types of attacks. These include DOS, Shellcode, worms, Fuzzers, backdoors, exploits, analytics, generality, and scouts. In addition, the data set consists of 12 algorithms for generating 49 features belonging to class tags. The following Table 2 shows a set of features in UNSW-NB15, along with the corresponding groups and data types.


**Table 2.** UNSW-NB15 Features with their data type and category.

The UNSW-NB15 data set has a total of 2,540,044 records, which are stored in four files respectively. In order to better conduct the experiment, the data set provides the training set and test set that have removed the missing values, with 175,341 records and 82,332 records respectively. As can be seen from the above table, the 49 fields in the data set include fields of different types, such as Flow, Basic, Content, Time, Content, etc., and each attribute belongs to either the discrete or continuous types.
