**1. Introduction**

At present, society is witnessing an unparalleled pace of technological development and global expansion of the Internet. An increasing number of ventures rely on network connectivity, both in the public sector and in business. Entities connected to the Internet range from those used for leisure purposes to elements of critical infrastructure, such as industrial process control or transportation managemen<sup>t</sup> systems. In the background, a new technology paradigm known as Internet of Things (IoT) is evolving, which consists of objects that collect, process, and exchange data via diverse networks, often operating without direct human supervision [1]. This automation is one of the reasons why people have been already surrounded by massive numbers of IoT devices; it is estimated that about 75 million IoT devices will be connected to the network by 2025 [2].

In parallel, computer networks enable criminal activities named cybercrimes [3]. Constantly, new cybercrime types are being developed [4]. Some methods were previously associated only with mafia and now are a threat in the virtual world. This includes extortion using distributed denial of service (DDoS) attacks or ransomware—software that encrypts user data for ransom. According to the NETSCOUT Threat Intelligence Report [5], 9.7 million DDoS attacks were encountered in 2021. As Cybersecurity Ventures estimates [6], global cybercrime costs will grow yearly by 15%, reaching 10.5 trillion US dollars annually by 2025. Even though general awareness of various cybersecurity threats is increasing, as is the overall level of safety, constant effort to improve countermeasures is required. The growing number of targets, new attack vectors, and the fact that malware constantly evolves do not make this an easy task. It is estimated that over 450,000 new malicious programs and potentially unwanted applications (PUA) are registered every day [7].

**Citation:** Korona, M.; Szumełda, P.; Rawski, M.; Janicki, A. Comparison of Hash Functions for Network Traffic Acquisition Using a Hardware-Accelerated Probe. *Electronics* **2022**, *11*, 1688. https://doi.org/10.3390/ electronics11111688

Academic Editor: Taeshik Shon

Received: 29 April 2022 Accepted: 23 May 2022 Published: 25 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In response to numerous network threats, various cybersecurity methods have been proposed. The first safeguards of a network are firewalls and intrusion detection/prevention systems (ID/PS), whose task is to analyze incoming traffic and intercept packets when a malicious signature is detected. Collecting IP traffic information for network monitoring is a common practice of network operators and researchers. To build a coarse-grained understanding of network traffic, the concept of network flows is used. It records traffic statistics in the form of flow records. Each record contains important information about a flow, such as its source and destination Internet Protocol (IP) addresses, start and end timestamps, types of service, and application ports, along with the volume of packets or bytes, etc. IP packets are assigned into flows based on their characteristics, such as source or destination address, protocol type carried, and protocol port numbers (for TCP and UDP) that can be referred to as *flow keys*. As a result of the analysis procedure, which often incorporates the most cutting-edge approaches, including machine learning [8–10], disallowed flows can be eliminated.

Flow-based network monitoring is today the most widespread technology, and Net-Flow [11–13] is a widely used tool in network measurement and analysis. It is now gradually evolving into one of the most important means of ensuring network cybersecurity.

Performance of NetFlow monitoring tools has been identified as a crucial factor in network security allowing for the application of immediate countermeasures. It has been widely addressed, including the possibility for its hardware acceleration [14–18]. However, it is important to note that also the monitoring device itself can be a target of a specialized cyberattack [19], especially when the assailant has appropriate knowledge and is willing to spend their resources and time for initial reconnaissance. *Crossfire* [20] is an example of such a sophisticated attack (in comparison to the brute-force DDoS attack), tailored to a targeted enterprise, that can isolate a target area by flooding carefully selected network links.

NetFlow-like tools face grea<sup>t</sup> challenges when both the speed and complexity of the network traffic increase. To keep up with the multigigabit speed of network traffic, especially on high-bandwidth backbone links, *NetFlow probes* incorporate advanced techniques to efficiently store and manipulate flow records [21]. A fast local memory inside the probe, known as *flow cache*, is used to store the active flows. The flow cache is organized in a data structure called a *flow table*, which consists of a list of flow records, one for each active flow.

To efficiently process incoming packets and access the database gathered based on the flow key of the current packet often requires the use of sophisticated data structures, which vastly reduces computational complexity. Hash-based data structures are commonly proposed for this purpose as a solution allowing high-speed packet processing. Such data structures are usually coupled with a hashing function that maps a flow key to a flow cache location. Unfortunately, applying a perfect hashing function that maps each flow key to a distinct flow cache location is not possible in practice. Thus, it is crucial to select a hashing function that maps a small number of flow keys on to the same flow cache location, so-called *hash buckets*. If the number of collisions is sufficiently small, then hash tables work quite well and give *O*(1) search times. To ensure optimal utilization of the hash table and reduce the vulnerability of a NetFlow probe to cyberattacks, the hash function needs to be carefully chosen. If it is not, malicious traffic may be able to create collisions that degenerate the hash table to linked lists with worst-case lookup times of *<sup>O</sup>*(*n*) and greatly reduce the performance of the flow cache modules.

In [19], the authors evaluated the resilience of hash functions used in the softwarebased NetFlow probes nProbe and Vermont. Theoretical analysis and real attacks proposed by the authors show how easily flow monitors can be overloaded if the hash algorithm has not been carefully chosen. The paper also presents a hash function that seems to offer protection against hash collision attacks and computes fast enough to be deployed in high-speed flow meters.

The obvious countermeasure against hash collision-based attacks (hash flooding or HashDoS) is a hash function for which collisions cannot easily be created. Cryptographic hash functions would provide such a feature; however, they are computationally expensive, which makes them difficult to use efficiently in NetFlow probes. The implementation of such network monitoring elements with rigorous throughput may be challenging. Hardware acceleration of their crucial functions can be an aid here. Still, to the best of our knowledge, there is a lack of publications discussing hardware-accelerated network probes for network traffic analysis with dedicated hash functions that would be resilient to targeted attacks.

Our article aims at filling up this gap. In this work, we propose a hardware-accelerated network probe that accelerates extraction of network packet characteristics and calculation of the hash identifier. In addition, we describe the application of the cryptographic hash functions SHA-1 and SHA-3 to map a flow key to a flow cache location. The efficiency of our approach will be compared with the solutions discussed in [19].

Our article is organized as follows: First, in Section 2 we present the concept of a hardware-accelerated network probe and review different hashing algorithms. Next, in Section 3 we describe the experiments conducted. Their results are presented in Section 4, followed by discussion and conclusions in Section 5.

#### **2. Materials and Methods**

In this section, we outline the concept of a hardware-accelerated network probe (Section 2.1). Different hash algorithms that can produce hash table keys are discussed in Section 2.2. Details of hardware implementations and functional verification of the design are described in Section 2.3.

#### *2.1. Hardware-Accelerated Network Probe*

A network probe is a tool which acquires parameters from network traffic for trafficanalysis purposes. In this work, we used a hardware-accelerated version of the software network probe proposed in [22], which is also briefly presented here. The block diagram of the probe is presented in Figure 1. The network probe processes the traffic data in the following steps:


**Figure 1.** Block diagram of hardware-accelerated network probe.

The traffic captured from a network interface is analyzed and then a network flow record is created or an existing one is updated in the flow cache. Packet headers are analyzed in terms of second, third, and fourth ISO/OSI Reference Model layers. Assignment of new packets to flows is based on a hash function of the header parameters, which is calculated using the IP source address, the IP destination address, the source port number, the destination port number, and information on the transport layer protocol.

Considering the transport layer protocols, the conditions for classifying the stream as ended are RST or FIN flags in the case of TCP, and reaching a predefined inactivity

time in the case of UDP. The flows considered as ended are statistically analyzed and their parameters extracted, as described in the next section. Expired flows are dumped to a file.

Captured packets are processed starting with the second ISO/OSI layer. From the data link layer, information about the timestamp and the packet length is fetched. The *Ether\_type* field contains information about the higher-layer protocol used, which is, in the network probe's case, IPv4. After receiving the IP header, it is possible to decode the source and destination IP addresses, along with the transport layer protocol. Knowing the values of the headers of transport layer protocols, it is possible to decode the recipient's port, and the TCP flags, if applicable.

Current flows are stored in flow caches organized in *buckets*. For every incoming packet, a hash of the flow key is calculated and then checked against the existing flow keys in the appropriate bucket. If the hash does not exist, a new flow record is created in the given bucket, with parameters such as: source and destination IP addresses, source and destination port numbers, first packet timestamp, and transport layer protocol. If the hash already exists, the existing flow is updated. The packet count value is incremented, TCP flags are updated (if applicable), and a new timestamp and the packet size are added to the list.

In the case of the TCP protocol, the appearance of a FIN or RST flag means the end of the flow. Then, some of the flow's parameters are updated. Furthermore, the flow is moved from the active flows map to the expired flows list. Post-processing of the parameters consists of converting source and destination IP addresses to ASCII format; marking last timestamp; and calculating the flow's duration and total byte count, and its statistical parameters.

In the case of UDP packets, these are periodically checked by the application thread, which will be iterating through the active flows cache. The last packet's arrival time in a flow is compared to the last packet's arrival time on the network adapter, and if this exceeds the time difference by a predefined value (set in our case to 10 s), it is moved from the current flows cache to the expired flows list.

#### *2.2. Hash Functions*

Hashing is an extremely useful technique widely used to construct fast lookup methods to be able to quickly assign received packets to their corresponding flows. The hash functions used for mapping flow keys to hash values need to be chosen carefully to ensure optimal utilization of the hash table. Intuitively, a hash function is a function that maps every item to a hash value in a fashion that is somehow random. The most obvious model for a hash function is that it is fully random. Unfortunately, it is almost always impractical to construct fully random hash functions, as the space required to store such a function is essentially the same as that required to encode an arbitrary function as a lookupTable [23]. Thus, the hashing applied is usually a compromise between the randomness properties that are desired in a hash function and the computational resources needed to store and evaluate such a function.

Hash functions utilized in network monitoring devices should have the following features:


Report [19] discusses hash algorithms used in two popular monitoring tools—nProbe [24] and Vermont [25]. The authors of the current paper have identified some flaws in both algorithms and proposed a modified version of Vermont. They also sugges<sup>t</sup> that crypto-

graphic hash functions might be best for such an application, if their implementations meet performance demands.

The network probe implements all three algorithms from [19] in hardware. In addition, two cryptographic hash functions were implemented—the cryptographically broken but still widely used SHA-1 and the state-of-the-art SHA-3. All of the algorithms are described in following subsections.

For the proposed network probe, a hash width of 32 bits was considered. If the result of a given algorithm was wider, this was reduced accordingly to 32 bits. The network probe considers source IP address, destination IP address, protocol, and protocol (TCP/UDP) source/destination port numbers as flow keys.

#### 2.2.1. Sum Modulo 32—nProbe

The nProbe [24] monitoring tool utilizes simple sum modulo as its hash algorithm. For the proposed network probe, the calculation is presented as Equation (1):

$$h = (srcIP + dstIP + protocol + srcPart + dstPart) \text{mod32} \tag{1}$$

This algorithm is very simple; however, as the authors of [19] point out, after testing it with a captured network packet trace, it does not have a perfectly uniform distribution—a number of buckets contain considerably more entries than others. Another drawback is relative ease of generating collisions, because an attacker can freely manipulate the values of the flow keys provided that their sum is constant.

#### 2.2.2. Nested CRC-32—Vermont

Cyclic redundancy checks or cyclic redundancy codes (CRC) have been utilized for error detection in computing for a long time. A digest is calculated from transmitted data and is appended to the frame. The same algorithm is applied to data upon frame reception, and when the result is the same as the code calculated by the transmitter, it means that the received packet is correct.

The actual algorithm can be described mathematically as polynomial division of binary data being interpreted as polynomial over GF(2) (every bit is a polynomial coefficient—zero or one) by generator polynomial G(x). The remainder of that division is treated as a check sequence, which is appended to the transmitted frame [26].

The CRC-32 implementation used in the proposed network probe is based on IEEE 802.3 [27] polynomial. Implementation parameters, according to [26], are presented in Table 1.


**Table 1.** The network probe hardware accelerator CRC-32's implementation parameters, following [26].

Vermont [25] is built on nested CRC-32 invocations. The algorithm starts with a given initial seed, and Figure 2 presents how CRC-32 is invoked five times to include flow keys in the hash calculation. The result of the preceding CRC-32 function is utilized as seed for the next one.

**Figure 2.** Illustration of Vermont hashing function.

The authors of [19] found that Vermont is computationally efficient and offers roughly uniform distribution; however, they also proved that an attacker is still able to create hash collisions on purpose.

#### 2.2.3. Nested CRC-32 with *w* Constants—Modified Vermont

Report [19] proved that the CRC-based Vermont algorithm does not protect network monitoring devices from targeted collision attacks. The goal of the authors of this current paper was to design a function that does not have this flaw, but that offers the same statistical qualities. The result of their research is a modified Vermont algorithm, presented in Figure 3.

**Figure 3.** Illustration of modified Vermont hashing function, based on [19].

To ensure that an attacker cannot create collisions in a simple way, a unique secret random value (*w(i)*, initialized during network monitor activation) is added to every flow key before CRC-32 calculation. This significantly increases the cost of a targeted attack, but does not prevent it, since the CRC-32 scheme is still used.
