2.3.1. Implementation

The proposed network probe hardware accelerator was implemented with the hardware description language Verilog [38]. The accelerator's top module is depicted in Figure 7. It has a 128-bit data path with two AXI4-Stream interfaces [39], Slave and Master, used for data flow. Packets are processed sequentially, and their order is not changed. Block *netprobe\_top* consists of two submodules that implement the two main functions of the accelerator:


**Figure 7.** Block scheme of the network probe hardware accelerator's top module—*netprobe\_top*.

Figure 8 presents the structure of the packet parser module. The first block in the data path is a protocol filter, responsible for dropping IP packets that contain a protocol other than TCP or UDP. Packets that pass this protocol check are distributed in a round-robin manner between two parallel parser engines which extract flow keys and other information from the packet header.

**Figure 8.** Block scheme of the network probe hardware accelerator packet parser module— *netprobe\_parser\_top*.

These modules were parallelized to avoid empty cycles on the Master interface due to the unfavorable header structure of processed IP packets, e.g., such as IP header length (IHL), and as a result the TCP header offset that causes the TCP port and TCP flag fields to be in different packet beats for the 128-bit data path width. Parser engines process the IP packet header, extract flow keys and the rest of the features, and forward data in an internal format (two beats in a 128-bit data path). A placeholder for the hash is included, although it is calculated later in the pipeline.

Module *netprobe\_hash\_top* is a block that wraps hash engines. It is parameterized with a HASH\_ALGORITHM variable, which selects an appropriate algorithm submodule to be instantiated (Table 3).


**Table 3.** Values of HASH\_ALGORITHM parameter for *netprobe\_hash\_top* module configuration.

The module *netprobe\_hash\_top* also has a set of strap ports used for modified Vermont and SHA-3 algorithm configuration as in Table 4. In the network probe hardware accelerator, *w* constant straps were tied off to random integers and a 512-bit hash was selected for the SHA-3 algorithm.


**Table 4.** Module *netprobe\_hash\_top* strap ports for algorithm configuration.

The nProbe hash algorithm (for HASH\_ALGORITHM = 1) was implemented as a simple 32-bit adder, whose inputs are flow keys extracted from the internal packet format and left-padded with zeros to 32-bit width if necessary.

The Vermont hash algorithm (for HASH\_ALGORITHM = 2) was implemented as 5-stage pipeline, similarly to the diagram in Figure 2. Internal packet data are registered in parallel to CRC-32 logic, and at every stage an appropriate flow key is selected to be included in the hash.

The modified Vermont hash algorithm (for HASH\_ALGORITHM = 3) was realized in a similar manner to regular Vermont. Flow keys are obfuscated with *w* constants before being used in CRC-32 calculations, as in Figure 3.

In the case of SHA-1 (for HASH\_ALGORITHM = 4), concatenation of all flow keys forms a 104-bit word, which is considered input to the hash function. The length of the input word is less than 512 bits, which means that SHA-1 transformation (80 rounds) must be applied only to a single block. This makes pipelined algorithm implementation possible, as backpressure towards subsequent packets is not necessary.

Figure 9 presents an example of such a pipeline. Data with extracted flow keys are constantly fed to the input, and multiple packets are processed simultaneously. Since the internal packet format requires two cycles to be transmitted in a 128-bit data path, where only the first cycle carries valid flow keys, a valid hash is obtained at the final stage of the pipeline only for the first beat of this packet.

In regular SHA-1 implementation, the hash pipeline would have 80 stages—one per SHA-1 round. It is possible to reduce the number of stages by unfolding the algorithm loop and implementing two rounds between stage registers. This approach, however, leads to critical path extension of circuits and as a result decreases maximum clock frequency. The solution to this problem was proposed in [40], where the authors described a method with the SHA-1 algorithm loop unfolding using additional variables. This technique allows us to perform two algorithm rounds within one clock cycle and reduces the required number of stages by half. It was incorporated in the network probe hardware accelerator SHA-1 implementation; therefore, its pipeline had 40 stages.

For SHA-3 (for HASH\_ALGORITHM = 5), as previously, concatenation of all flow keys creates a 104-bit input word. Again, this is less than the SHA-3 block length, so the approach illustrated in Figure 9 can be applied once more. The SHA-3 pipeline in the proposed network probe hardware accelerator has 24 stages, one per SHA-3 round.

In all cases, a 32-bit flow hash is inserted into the initial placeholder of the output accelerator packet.

**Figure 9.** Pipelined hash algorithm implementation in the network probe hardware accelerator.

#### 2.3.2. Functional Verification

Functional verification of the proposed network probe hardware accelerator was conducted using cocotb—an open source, Python-based testbench environment for VHDL/Verilog RTL [41]. It adopts the same concepts of constrained random verification as industrystandard UVM [42]; however, it is implemented in Python rather than SystemVerilog. This enables swift and productive construction of the verification environment, as Python scripting is simple, and additionally, a huge library of existing code is available (e.g., packet generation libraries and cryptographic algorithm implementations).

Figure 10 presents the structure of the *cocotb*-based verification environment. DUT (Design Under Test, here *netprobe\_top*) was instantiated as top level in the simulator and was surrounded by verification environment components as drivers, monitors, and scoreboard, which were extended from infrastructure provided by cocotb. Ports of the tested module were stimulated directly from the Python function acting as a test case.

**Figure 10.** *cocotb*-based verification environment of *netprobe\_top* module.

At the beginning, a number of transaction objects that mimic IP packets were created and randomized. The goal was to cover a broad space of possible network traffic, so multiple packet parameters were changed: packet length, addresses, encapsulated protocol, etc. These objects were passed to an AXI4-Stream driver, which transmitted them onto the Slave interface of the *netprobe\_top* module. Both Slave and Master interfaces were watched by AXI4-Stream monitors, which were able to transform waveforms into transaction objects. Initial packets and those processed by DUT were fed to the scoreboard component. The DUT behavior model was applied to the stimulus packets there, and the result was compared with transactions processed by the *netprobe\_top* module itself. They must be the same, and when this condition is not fulfilled, an error is reported.

Figure 11 is a screen capture from a simulation of *netprobe\_top* module configured with the SHA-3 algorithm. The selected SHA-3 hash length was 512 bit (*strap\_hash\_length* equals 2'd3). The goal of the executed test case was to check the performance of the design. Signal *axis\_m\_tready* of the accelerator's Master interface was tied off to high value, which indicates no backpressure. DUT was flooded with a number of short IP packets—signal *axis\_s\_tvalid* went high at Cursor 1. After 32 clock cycles (latency for SHA-3 configuration), the first result packets were presented on the Master interface (Cursor 2, *axis\_m\_tvalid* goes high). Checks implemented in the testbench verified whether the *axis\_s\_tready* signal goes low. Module *netprobe\_top* does not introduce backpressure on its own, and even in these harsh conditions, DUT behaved as expected.

**Figure 11.** Simulation of *netprobe\_top* module with the SHA-3 hash algorithm using the *cocotb*-based verification environment.

#### 2.3.3. Synthesis Results

Synthesis of the network probe hardware accelerator was performed for Intel Stratix V GX FPGA (5SGXEA7N2F45C2), an element of the Terasic DE5-Net development kit [43], using Intel Quartus Prime 18.1 software.

Table 5 summarizes the synthesis results of the *netprobe\_top* module for a range of hash algorithms. Since nProbe, Vermont, and modified Vermont are based on simple hashing schemes that use basic types of calculations (addition modulo 32 or CRC), hardware implementation of these algorithms requires little hardware resources (less than 1% of available resources of FPGA used in the experiment). Although SHA-1 and SHA-3 cryptographic functions are far more computationally expensive, the proposed implementation requires few enough resources to be efficiently used as a part of the hardware NetFlow probe. Even though SHA-1 and SHA-3 were optimized for performance, not for the area, the probe with the most complex SHA-3 algorithm utilized only 16.44% of resources, leaving enough of them to implement other functionalities of the NetFlow probe [17]. It is no surprise that straightforward hash algorithms (such as nProbe, Vermont, or modified Vermont) implementations can sustain multigigabit throughput, but realizations of cryptographic functions (SHA-1, SHA-3) definitely match this. All investigated hash algorithms offer throughput over 20 Gbit/s.


**Table 5.** Module *netprobe\_top* implementation results in Stratix V FPGA.

It has been assumed that cryptographic hash functions such as SHA are computationally too expensive for efficient use in a flow monitor. The high bandwidth and low latency of the hardware accelerator based on the SHA-1 and SHA-3 functions definitely enables construction of a network probe working in a real-time manner—even when it is flooded with the smallest IP packets.

It is worth mentioning that the low percentage of logic utilization allows for further design optimization and parallelization [44]. Utilizing such techniques, it should be even possible to reach a 100 Gbit/s bandwidth limit.
