**4. Results**

In our experiments, each hash function was used in 16-bit and 32-bit versions, which organized the flow cache into 2<sup>16</sup> and 2<sup>32</sup> buckets, respectively. Every probe was supplied with three types of traffic from the CICIDS 2017 dataset labeled *Normal (Monday)*, *Normal + attacks (Wednesday)*, and *Normal + attacks (Friday)* (see Section 3.2). For each traffic type, the number of flows it contains was given as *N*.

The results for hardware-accelerated probes using 16-bit hash functions have been presented in Table 6. For every traffic type used to supply, all probe metrics proposed in Section 3.3 were recorded. As can be seen, all hash functions except Mod32 yielded similar statistics over flow cache buckets. Mod32 function achieved noticeably worse Max, Mean, and SD values than the rest of the hash functions. We observed, however, that statistics for all functions were not much affected by anomalous traffic (DoS attacks, botnet communication, port scan attacks)—see the results for the Wednesday and Friday traffic. It can be noticed that traffic *Normal + attacks (Friday)* generated larger values of recorded parameters for all functions than the other two traffic types. However, this can be explained by the fact that it contains much more flows than the other two traffic types used. Graphical presentation of the distribution of flow records over the flow cache buckets for a hash function based on a simple modulo operation (Mod32), modified Vermont, or the SHA-3 cryptographic function is shown in Figure 12. It can be seen that the distribution produced by the simple modulo hash function is far from uniform. The modified Vermont hash function and that based on the cryptographic SHA-3 function offer much better distributions.


**Table 6.** Statistics of bucket occupation for various hash functions—2<sup>16</sup> buckets used.

A more precise overview is given in Table 7, where the results for the 32-bit version of hash functions are presented. Such a hash size greatly increases the flow cache capacity (up to 2<sup>32</sup> buckets). In this case, in addition to the metrics used in Table 6, the number of nonempty buckets is also given (*Buckets* column). Again, all hash functions, except Mod32,

SHA-3

 1  2  1 0.01 496,922

showed similar distribution over flow cache buckets, which was not affected by typical anomalous traffic. The Mod32 results significantly deviate from those obtained for the rest of the hash functions. It is worth noting that for Vermont, modified Vermont, and the two SHA hash functions, the mean value of flow records in a bucket was 1, and the number of nonempty buckets was almost equal to the number of all flows present in the traffic. This indicates that these functions put almost every flow record in a separate bucket, offering almost uniform distribution of flow records over flow cache buckets for normal traffic and typical anomalous traffic.

**Figure 12.** Visualization of bucket occupation for 2<sup>16</sup> buckets. (**a**) Mod32; (**b**) modified Vermont; (**c**) SHA-3.


 2  1 0.01 452,583

**Table 7.** Statistics of bucket occupation for various hash functions—2<sup>32</sup> buckets used.

#### **5. Discussion and Conclusions**

 1

A proper view of the statistics and the dynamics of a network is of grea<sup>t</sup> importance, since it enables us to detect network attacks. Thus, network monitors using the network flow concept are an important part of modern cybersecurity defense. As such, these devices themselves may be the targets of cyberattacks. One of the possible weak points of NetFlow probes is a network flow cache, which is usually implemented as a hash table. Due to the limited size of a hash table, it is inevitable that, sooner or later, two different flows will be mapped to the same hash bucket. It is essential that the hash function used for calculating the hash keys offers a uniform distribution of NetFlow records over available buckets, so that the lengths of all bucket lists would be almost equal. This makes it possible to use a reasonably sized hash data structure to make the flow lookup fast, because of minimal list lengths. The experiments conducted during this research show that even a relatively simple hash function may guarantee such characteristics.

 1  2  1 0.01 792,420

However, nowadays, when components of cybersecurity systems themselves may be a targets of a cyberattack, a no less important feature of such systems is their resistance to attacks. In the case of a NetFlow probe, it should be impossible for an attacker to create directed collisions in the hash function. If an attacker is able to fabricate network traffic in such a way as to lead to a large number of collisions in the hash function, some buckets of the hash table may overflow, causing malfunction of the probe.

The results from Section 4 show that only very simple hash functions (i.e., Mod32) are susceptible to common malicious traffic, such as DDoS or port scan attacks. More complex methods, such as Vermont, based on CRC32, offer relatively uniform distribution of flow records over flow cache buckets for normal traffic, and typical anomalous traffic. However, as demonstrated in [19], it is possible to prepare a targeted attack exploiting a vulnerability of the implemented hash function.

Thus, it is crucial to select a hashing function that maps a small number of flow keys on to the same flow cache location. A hash function should therefore compute hash keys that are uniformly distributed, so that it should be impossible for an attacker to create directed collisions. At the same time, the hash function must be fast so that it does not become a bottleneck of the NetFlow probe.

The obvious countermeasure against hash collision-based attacks is the application of cryptographic hash functions, for which collisions cannot be created easily. The results presented in Section 4 prove that the use of the cryptographic functions SHA-1 and SHA-3 offers comparable distribution of flows in the flow cache to the dedicated methods (Vermont, modified Vermont) used as reference. The advantage of implementing a hash function based on cryptographic functions in a NetFlow probe is that it is very difficult (or even impossible) to prepare a targeted attack on such a probe by fabricating network traffic to overflow flow cache buckets through systematically creating packets that lead to hash collisions.

Cryptographic functions, however, have not usually been candidates for hash functions in NetFlow probes, since they are considered to be computationally too expensive for efficient use in flow monitoring. Our concept presented in Section 2.3 shows that it was possible to implement a hardware-accelerated network flow probe employing a cryptographic hash function that offered sufficient performance to construct a network probe working in real-time with multigigabit traffic, even when it was flooded with the smallest IP packets. Relatively low hardware resource utilization makes it possible to reach a 100 Gbit/s bandwidth limit by applying hardware-specific design optimization and parallelization.

It has to be emphasized that most available traffic datasets contain traffic with a relatively small number of flows. The set CICIDS 2017 used in our experiment contains, in total, 2,830,540 flows. Taking into account the fact that the flow cache of a probe that uses a 32-bit hash function contains 2<sup>32</sup> buckets, the flow records fill only a small fraction of the flow cache. The use of datasets with significantly larger numbers of flows with normal and anomalous traffic might give a better view of possible differences in distribution of flow records over flow cache buckets for the evaluated hash functions. Such an approach, and the application of customized traffic containing flows intentionally constructed to produce hash collisions (which may not be a trivial task for some hash functions), could be the subject of future work.

To conclude, we can state that the resistance of cryptographic hash functions to collisions and the multigigabit efficiency of a hardware-accelerated implementation of hash computation allow the creation of an effective monitoring solution for modern cybersecurity systems while delivering a high level of resilience to targeted attacks.

**Author Contributions:** Conceptualization, M.K., M.R. and A.J.; methodology, M.K. and M.R.; software, M.K. and P.S.; validation, M.K. and P.S.; formal analysis, M.K. and M.R.; investigation, M.K., P.S., M.R. and A.J.; resources, M.K., P.S. and M.R.; data curation, P.S.; writing—original draft preparation, M.K., P.S., M.R. and A.J.; writing—review and editing, M.K., M.R. and A.J.; visualization, M.K.; supervision, M.R. and A.J.; project administration, A.J.; funding acquisition, A.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** The study has been supported by the SIMARGL Project—Secure Intelligent Methods for Advanced RecoGnition of malware and stegomalware, with the support of the European Commission and the Horizon 2020 Program, under gran<sup>t</sup> agreemen<sup>t</sup> number 833042. The publication was funded by the statutory activity subsidy from the Polish Ministry of Education and Science.

**Conflicts of Interest:** The authors declare no conflict of interest.

ˇ
