Programmable Deterministic Zero-Copy DMA Mechanism for FPGA Accelerator

Wang, Jichang; Lv, Gaofeng; Liu, Zhongpei; Yang, Xiangrui

doi:10.3390/app12199581

Open AccessArticle

Programmable Deterministic Zero-Copy DMA Mechanism for FPGA Accelerator

by

Jichang Wang

^*

,

Gaofeng Lv

,

Zhongpei Liu

and

Xiangrui Yang

Computer College, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9581; https://doi.org/10.3390/app12199581

Submission received: 26 August 2022 / Revised: 17 September 2022 / Accepted: 20 September 2022 / Published: 23 September 2022

Download

Browse Figures

Versions Notes

Abstract

:

With the expansion of network scales, the B/S architecture of monolithic applications is gradually being replaced by microservices. The unbundling of services has led to exponential growth in the size of APIs. When handling massive microservice requests, the commercial NIC shows limitations in three aspects: deterministic, programmability, and data copy. To ensure that each microservice node handles requests efficiently, flexibly, and precisely, this paper proposes a programmable deterministic multi-queue FPGA Accelerator. The Accelerator relies on the instantiated 1000 queues and the queue management unit to extend the rule-based RSS algorithm for the serverless-friendly programmability of packet distribution. A PTP hardware clock is added to collaborate with the queue management unit to control the deterministic delivery. To improve the sending and receiving efficiency of network node data, a driver adapted to the FPGA accelerator is designed to realize zero-copy. Experiments conducted on a 100 Gbps FPGA show that the Accelerator can support the multi-queue transmission with various packet sizes, define the forwarding behavior, and almost approach the line rate on an 8-core FPGA device. In addition, it can forward packets with low latency close to that of the current state-of-the-art ovs-DPDK. This Accelerator overcomes, to some extent, the limitations of commercial NICs when oriented to microservice architectures.

Keywords:

deterministic; microservice; multi-queue; programmability; zero-copy DMA

1. Introduction

Currently, the throughput of data-center networks has reached the terabit level, traffic is tending to diversify, and scalable and easy-to-develop microservices are gradually replacing the tightly coupled and large monolithic application architecture code base. As one of the most usual ways to implement microservices, serverless networks pose new challenges to traditional network equipment in terms of asynchronous concurrency, burst unpredictability, component-independent deployment expansion, and rapid development iteration. Commercial NICs have struggled to meet serverless’s high throughput, customizable, and deterministic latency data transmission requirements [1] due to hardware design and software driver limitations, which can be summarized as follows.

(i): Programmability. Typically, serverless network nodes need to handle a diverse set of requests. Whereas ASIC-based fully customized network devices are complicated to program, commercially available FPGA-based devices often cannot modify device configurations directly. The hardware design closure severely limits researchers’ utility and flexibility in developing new network request features.
(ii): Determinism. With the rise of the Industrial Internet of Things (IIoT), many services require precise control of end-to-end latency and jitter, which commercial NIC-based “best-effort” Ethernet cannot meet. Therefore, we need a new generation of network devices that can provide “on-time, accurate” transmission quality of service with deterministic latency, deterministic jitter, and accurate transmission control.
(iii): Data Copying. The frequent data copying between NIC and an Application limits the high-performance data processing of network nodes. When a packet arrives, the NIC copies packets to a ring buffer in the host memory via the PCIe bus with DMA (direct memory access). It then generates a hardware interrupt to notify the CPU of receiving a packet (one interrupt per packet). The CPU allocates a socket buffer (skb) for sending packets to the kernel for layer-by-layer parsing. The application thread eventually taps into the kernel space via the socket’s recv() interface and copies the data from the kernel to user space for processing. The host performs the sending process in reverse order. The process above requires 1 DMA copy and 1 CPU copy each. Research studies show that CPU copying of a packet consumes 500–2000 clock cycles (depending on packet size) [2], and frequent CPU copying consumes excessive computational and memory resources.

To construct a microservice network node with high processing efficiency to overcome hardware design and software processing limitations, this paper proposes a serverless-oriented programmable deterministic DMA zero-copy FPGA accelerator. At the hardware level, 1000 hardware queues are instantiated. A hardware clock-based deterministic sending mechanism is designed to achieve precise timing control for multiple queues. At the device driver level, a multi-queue-based zero-copy [3] mechanism is introduced to eliminate the CPU cost caused by frequent copying. In addition, to enhance the scalability and flexibility of the service in the serverless scenario, we add a rule-based distribution module to the traditional RSS [4] functionality to enhance the diversity and flexibility of services in the serverless model. This module extends the flow-based distribution for the emerging transport protocol QUIC [5,6,7]. Users can program the forwarding behaviour by modifying the flow table through a remote server. In the experimental evaluation section, we simulated the serverless environment. Furthermore, a multi-node packet forwarding topology was designed to test the prototype’s throughput rate and forwarding latency under different queue numbers and packet length conditions.

This paper presents the design, implementation, and evaluation of a programmable multi-queue FPGA accelerator, which breaks the shackles of existing commercial NIC, whose main innovations can be summarized as follows:

(i): Achieving precise time synchronization based on a hardware clock to achieve deterministic sending control of the accelerator’s multi-queues;
(ii): Implementing a rule-based flow classification that can match flow tables according to traffic characteristics and user requirements and realize diverse multi-queue-based scheduling for packets;
(iii): Distributed dataflow distribution control Agent: the management nodes in the network, can define the data flow distribution and encapsulation behaviour by issuing flow tables, which facilitates the centralized management of distributed serverless network nodes.

2. Zero-Copy DMA Technology Overview

When the amount of data to be processed is large, CPU copies and hardware interrupts can take up many CPU cycles and lead to interrupt livelock [8]. Data copying and context switching can consume many memory and clock cycles. To cope with the performance bottlenecks caused by hard interrupts and CPU copies, the Industry has proposed the concept of zero-copy to improve the efficiency of sending and receiving data [9]. The most representative technologies and frameworks are pf_ring zc, Netmap, AF_XDP, and DPDK [10,11,12,13], among which the best performance is DPDK [14], the leading research target. Table 1 shows the comparison of several zero-copy techniques.

2.1. PF_RING Zero-Copy

PF_RING ZC [10] implements the Direct NIC Access DNA technique, which maps the NIC memory and registers it to the userspace. The flow of its zero-copy implementation is as follows: (1) The DMA controller copies the packet from the NIC to the host memory (RX buffer). (2) Call the Library functions mmap() to map the packet of the RX buffer to the userspace for application access directly. PF RING ZC’s API emphasizes convenient multi-core support. It does not use standard system calls, provides custom functions (polling), and realizes its driver based on the Linux driver. When PF_RING ZC is active, applications cannot send or receive data using their respective default system interfaces. Since PF_RING ZC’s packet processing goes through fewer processes, it can achieve almost the same data processing efficiency as DPDK.

Ntop [15] provides a large number of applications running on top of PF_RING ZC, such as n2disk [16] (packet capture tool) and nProbe [17] (traffic monitoring tool). However, when applying PF_RING ZC to multi-queue NICs, applications must negotiate in advance to decide on the queue to distribute packets because applications can directly manipulate the multiple queues simultaneously. Otherwise, it is easy to cause conflicts in the use of queues.

2.2. Netmap

Netmap [11] achieves zero-copy mainly by mapping the packet buffer of the NIC to the Shared Memory Region, where the Netmap ring in memory corresponds to the NIC ring in the NIC, and the application accesses the buffer of the Netmap ring by calling the Netmap API. Netmap’s network driver is mainly a patching implementation of the Linux kernel’s NIC driver, which turns to Netmap’s custom logic when executing critical logic (such as OPEN/RX/Tx). The NIC maintaining the compatibility in the driver allows easy integration into standard operating systems.

Netmap is integrated into the FreeBSD kernel [18], and several applications have shown performance improvements by employing Netmap, such as Click (a software router) [19], VALE (virtual switch) [20], and ipfw (FreeBSD firewall) [21]. However, Netmap ignores the NIC’s hardware offload feature (e.g., VLAN offload, checksum offload, TSO, etc.). It does not define other libraries for packet processing, which does not facilitate the extension of applications such as GRO, GSO, IP slicing/reconfiguration, quality of service, encryption and decryption, etc.

2.3. AF_XDP Redirection

AF_XDP is a socket address family built on XDP (eXpress Data Path) [22,23], which realize zero-copy by calling the XSK (AF_XDP Socket) interface. XSK uses a pair of data structures, Rx and Tx, in user space, each pointing to a memory area called umem (user space memory). Each umem has two rings, the Fill ring and the Completion ring. The application fills an empty descriptor into the fill ring of Rx. When a packet arrives, the kernel pops a descriptor from the Fill ring. The DMA controller copies the packet to the memory pointed to by the descriptor and queued the descriptor into the Completion ring when finished. Afterward, the application retrieves the data through the descriptor in the Completion ring. The Fill and Completion rings work in the same way when sending packets using Tx queue.

AF_XDP relies on the Linux kernel to achieve zero-copy while maintaining compatibility with existing operating systems. Its zero-copy implementation is similar to Netmap in that it initializes a shared memory area between the kernel and user space. However, there are two additional limitations to using AF_XDP. One is that the development of XDP programs needs to rely on restricted libc libraries (i.e., libbpf). Second, the NIC still generates frequent hardware interrupts, and the interrupt livelock issues are not fundamentally solved.

2.4. DPDK

DPDK [13] perfectly supports multi-queue and multi-core scaling and is the best known open-source zero-copy I/O technology in terms of performance [14]. Its high-performance implementation by (1) PMD (Poll Mode Driver). I/O threads periodically poll for packet arrival marker bits, thus avoiding the overhead of context switching introduced by interrupts. (2) The UIO [24] (Userspace I/O) driver, which intercepts NIC interrupts and resets the callback function corresponding to the interrupt, maps the operation on the hardware device to an operation on the file in userspace, circumventing the process of user state and kernel state switching and CPU copying. (3) In HugePage access, the larger page size means the lower number of page table levels and the lower number of entries in the TLB (Translation Lookaside Buffer). Therefore, reducing the number of memory accesses during the CPU address translation and increasing the probability of TLB hits when accessing memory. (4) Affinity exclusivity binds threads to a CPU core with high affinity to realize RTC (run to complete) mode, which avoids frequent switching of processes or threads between different cores and reduces the overhead of memory accesses.

Typical applications of DPDK include DPDK vSwitch (accelerated ovs) [25] and xDPd (high-performance software switching method) [26], among others. The feature comparison in Table 1 shows that DPDK offers advantages that other DMA zero-copy technologies cannot match from the deployment, development, and performance perspectives.

3. Programmable Deterministic Multi-Queue Accelerator Model

Traditional NICs have a few flaws in programmability and determinism, which cannot meet the bursty, diverse, and scalable service requests in the serverless network model. Given this, this section proposes a deterministic multi-queue accelerator model, as shown in Figure 1. Compared to the traditional NIC, this accelerator model is designed with relevant improvements in the programmability of multi-queue assignment and determinism of packet delivery. Combined with the practicality that zero-copy DMA technology can improve packet sending and receiving efficiency, a user-state zero-copy driver adapted to this accelerator model is designed to provide a reference for improving the throughput of the serverless network nodes.

3.1. Model Design

We divided the model structurally into three modules: Serverless-friendly Programmable Multi-queue Scheduling, Deterministic DMA for Serverless, and UserSpace Zero-copy Driver. Combined with the characteristics of a multicore serverless node, a programmable rule-based flow classifier is used to forward the input flow precisely to hardware queues bound to different cores to enhance the concurrency. A hardware clock component is added to interact with the transmission scheduler to control the transport engine of each hardware queue for deterministic delivery. In addition, the zero-copy driver in the userspace, which eliminates hardware interrupts and data copy overhead, can further enhance the packet sending and receiving efficiency.

The three modules provide their features in the accelerator and interact through a hardware descriptor ring that the accelerator and the host jointly manage. This ring is located in the memory pool allocated by the driver, and both the DMA controller and the accelerator can operate. Each descriptor stores information such as the physical address of the packet, the timestamp, the descriptor completion flag, etc. The starting physical address of the descriptor ring is stored in the accelerator’s base register, the number of descriptors is stored in the size register, and the registers tail and head represent the production and consumption of descriptors. The descriptor between the tail and head is available (avail). The accelerator and host interact through reading and writing registers. The state of the descriptor ring’s usage follows the production–consumption mode and its state change over time, as shown in Figure 2. The accelerator delivers or collects the packet to/from the host through the DMA controller in cooperation with the PCI bus. The driver implements deterministic DMA functionality by writing a descriptor with a custom timestamp field into the hardware descriptor ring, combined with the precise timing of the PTP hardware clock.

3.2. Serverless-Friendly Programmable Multi-Queue Scheduling

This module consists of a flow classifier and multiple hardware queues, and each queue consists of RX and Tx structures. The RSS is a core component which realizes the rule-based flow classification and forwarding function. It is associated with the network interface of the accelerator, and the Agent is used for remote management. The cloud server programs the forwarding action by issuing a flow table to the Agent. The Forwarder transponder is responsible for forwarding incoming packets according to the flow table matching signals.

RSS has multi-queue-oriented packet distribution capabilities compared to traditional NICs and extends multi-layer serverless packet forwarding features. Usually, RSS uses the standard Toeplitz hashing algorithm [27] to hash the packet’s 5(4) tuples, redirect the packets to specified hardware queues according to the results, and achieve flow classification at layer three or layer four. When transmitting data of emerging transport protocols such as QUIC, it is also possible to identify the data with the same stream identifier in the QUIC payload and then forward them to the same hardware queue to achieve forwarding in layer 4.5 based on the rule of QUIC that a single transport object uses the same stream number.

The forwarder is the main functional component of the programmable multi-queue module, and Figure 3 shows its workflow. When a packet arrives, the interface hands the received data to the RSS flow classifier bound to the interface, processing by the packet receiving function defined in the RSS. First, the classifier obtains the packet header information, identifies the packet type, and sends it to the flow table for matching. After matching, forwarded control of the packet is given to the Repeater along with a customizeable hash key. The flags match in the flow table. After finding the corresponding flow table entry, a signal is generated and passed to the Repeater, which combines the different signal values, obtains the individual fields of the packet for hash calculation, and forwards it to the corresponding queue according to the hash value.

The Agent is controller of the RSS classification component, which is connected to the remote server through a network interface and is responsible for actively reporting node changes or requesting service data for the cloud server. The cloud server can also update the flow table content through the Agent to achieve programmability of forwarding behavior.

3.3. Deterministic DMA for Serverless

The deterministic DMA module consists mainly of the DMA component, the tx engine (transmission engine), the Tx scheduler(transfer scheduler) [28], and the PTP hardware clock(HC) [29]. The queue transmission engine is specific to a hardware queue and processes transmission instructions from the transmission scheduler. It is responsible for coordinating the operations required to transmit packets while controlling the enabling/disabling of the queue by writing registers. The PTP hardware clock provides the accelerator with high-precision timing with sub-microsecond errors from the host, which, combined with the Tx scheduler’s control of the transmission engine, enables on-time and accurate transmission.

When the application makes a deterministic sending request for transmission, the driver writes the timestamp along with the description information into the descriptor ring, and the DMA controller detects the change in the queue tail pointer of the Descriptor Ring and finds the next available descriptor in the transmission ring. The controller first reads the timestamp and, while copying the data in the buffer pointed by the descriptor to the specified transfer queue of the Accelerator via the PCI bus, communicates the read timestamp information to the transfer scheduler. At this point, the Tx scheduler obtains the current time by timing it with the PTP hardware clock. It issues a command for the transmission engine corresponding to the transmission queue at the time specified by the timestamp. Then, the transmission queue forwards the data out through the MAC chip. After sending, the Accelerator updates the head pointer of the ring, and the DMA controller initiates a hard interrupt to notify the CPU to release the packets in the buffer.

The above transferring of the control of packet sending to the tx engine of the queue through timestamps set by the serverless node combined with the precise timing function of the PTP hardware clock simplifies the operation of the serverless node and provides deterministic control of the transmission.

3.4. UserSpace Zero-Copy Driver

The zero-copy module consists of a kernel zero-copy driver, a shared memory pool, and a userspace PMD. The zero-copy implementation owes much to a DMA-consistent shared memory pool allocated by the zero-copy driver. This pool consists of a hardware descriptor ring (ring buffer), a software descriptor ring (sw_ring), and contiguous memory region for sending and receiving. Packets moving between the accelerator and the application are implemented via DMA+mmap, which copies most packets and does not involve the CPU.

Figure 4 shows the process of zero-copy. Take receiving packets as an example: (1) the ZC (zero-copy) driver initializes by allocating a memory pool for rx_ring, sw_ring, and mbuf; (2) requests mbuf; (3) puts the physical address of the new mbuf into the hardware descriptor, setting the status bit DD (descriptor complete) to 0 (indicating available); and (4) assigns the virtual address of the mbuf to the software descriptor (sw_ring) for direct application access by the application. (5) The driver starts from the first descriptor and polls the DD flag bit of the hardware descriptor to see if it is 1. If it is 1, the mbuf corresponding to this descriptor is considered unavailable. If it is 0, it is acquired by DMA to the accelerator. (6) When the packet arrives, the DMA resolves the physical address of the mbuf from the fetched descriptor and then (7) writes the packet from the queue to the specified hardware address (an mbuf element corresponding to a sw_ring descriptor), setting the DD flag bits of the rx_ring and sw_ring descriptors and other information. (8) The application checks for packet arrivals by PMD polling.

When PMD reads the packet, the result of determining the DD bit is opposite to that of the accelerator using the hardware descriptor. It first checks if the DD bit is 1. If it is not 1, there is no packet to read. If it is 1, it means the accelerator has DMAed the data to memory and can read it. After reading, the application sets the DD bit to 0. Finally, the mbuf is released and can be used again. The above cooperation between the driver, shared memory, and DMA eliminates the resource consumption caused by CPU copy in traditional data processing and realizes the zero-copy in userspace.

3.5. Application Affinity for Serverless

In a multi-core endpoint that runs multiple cloud functions simultaneously, the cache of each core stores the information the function service uses. The service may be scheduled to other cores by the operating system when running so that the CPU cache hit rate decreases, leading to the program running less efficiently. The most intuitive way to solve this problem is to bind a serverless node to a CPU core. The CPU binding will always run in RTC (run-to-completion) mode on the specified core, reducing context switching.

The Accelerator supports the affinity feature of the cloud-function services via the RSS multi-queue, the ZC driver, and the deterministic DMA module acting together. The driver is responsible for initializing the accelerator, registering the kernel interface, and allocating DMA-accessible buffers for descriptors. The driver reads the parameters exposed by the accelerator using registers, such as the number of queues and the timestamp, and combines them with the number of CPU cores to derive the number of queues (Num) from being activated by Num = Min (number of queues, number of CPU cores). At this point, the driver allocates the respective memory pool for the queues and requests Num interrupt numbers, using irqbalance [30] to assign a unique IRQ number to each queue.

The application affinity applies in two scenarios, the first one being load balancing. The RSS module sends packets to different queues in a balanced manner according to the flow rules when it receives a packet, after which the queue issues an interrupt request to the bound core according to the IRQ number. The DMA controller copies the packets to the memory pool corresponding to that queue. At this point, the service can obtain packet by polling the memory pool corresponding to that irp number. The second scenario is based on the priority of the application to isolate essential business processes. Real-time processes with high scheduling priority can be bound to a designated core, ensuring real-time operations scheduling and preventing processes on other CPUs from interfering with that real-time process. Binding the control-plane threads and each data-plane thread to different CPUs eliminates the performance consumption of repeated scheduling back and forth, and the threads finish their work without interfering.

4. Performance Evaluation

The accelerated programmable deterministic multi-queue accelerator can theoretically support thousands of transmission queues for data transferring across multiple endpoints. The main differences between our accelerator and traditional DPDK-enabled NICs are:

(i): Our accelerator can forward the packet directly without the host involved and added the multi-layer forwarding such as VxLAN [30] encapsulation.
(ii): With the control of the hardware clock and transmit queue Tx scheduler, our accelerator can implement deterministic sending. DMA controllers were sending the higher priority for data flow with real-time requirements.

This section evaluates our Accelerator prototype to show that it meets the design requirements above. In Section 4.1, We introduce the Accelerator prototype platform and each module’s specific implementation. Section 4.2 presents the experimental simulation topology to verify the above two design requirements of the Accelerator. Section 4.3 is the research method. Finally, Section 4.4 first analyzes the experimental results, then shows the use of hardware resources of the accelerator prototype and compares it with related experiments of the same type. We implement different functional engines and test our Accelerator end-to-end; these results show that the Accelerator can support zero-copy DMA drive and send and receive packets according to user requirements.

4.1. Prototype System

We implemented the prototype system on two types of FPGA boards, Virtex-7 X690T and X2000, respectively. The host configuration is an Intel(R) Core(TM) i7-10510U CPU, 1.8 GHz, 8 G RAM, 8 Cores, and the host OS is KylinOS, Linux 4.18.0-25-generic. The version of the zero-copy driver comes from dpdk-19.11.1. We instantiate a multi-queue FPGA accelerator that both configures one interface with various hardware queues and writes the configuration parameters of the accelerator to specific registers. The FPGA resources required for deploying various queues is shown in Figure 5.

The zero-copy driver still follows the logic of DPDK. We made the main modifications in the following aspects:

(i): Added the device information corresponding to the FPGA board to the device driver list of DPDK. Provide driver capability for this device through PCI device registration;
(ii): Implemented device probe and release functions and mapped hardware addresses in the probe function;
(iii): Created the Ethernet device object of DPDK, read the parameters in the device register, and instantiated the object, mainly including the configuration of device functions, initialization of transceiver queue, port enable, Link, and UP/DOWN;
(iv): Implemented the device’s bulk receives and sends to support packet sending and receive functions for each enabled network interface and hardware queue.

Furthermore, the programmable multi-queue module in the accelerator is the most central component to achieve multiple parallels, programmable and deterministic. In the module, the Agent programs the flow table with the OpenFlow, and the Repeater working logic can be described by Algorithm 1.

4.2. Experimental Topology

For our end-to-end experiment, we evaluated the accelerator on a small test bench consisting of two end nodes and an intermediate forwarding node. The end node is responsible for sending and receiving data packets, and the intermediate node forwards the data packets. All nodes are equipped with our accelerator prototype. The end nodes and two ports of the intermediate node are connected in pairs using 100 G optical fiber. We use the accelerator model implemented by Verilog to program on the FPGA board. The multi-node network topology based on the FPGA boards is shown in Figure 6. After completing the DPDK initialization, sending and receiving operations can be performed according to the requirements. Take “pktgen” as a sample program in the examples directory of the zero-copy driver. After running, it will start a corresponding send/receive packet thread on each corresponding core according to the number of enabled queues and then distribute the received packets to its designated output port directly after a bit of processing according to the forwarding rules.

Algorithm 1 Rule-based flow classification algorithm

Input:: $i n c o m i n g p a c k e t Pkt, f l o w_t a b l e Table, h a s h_k e y K$
Output:: $A a c t i o n t o d e l i v e r y p a c k e t$
1:: functionFlow_Classifier( $P k t$ )
2:: $E x t r a c t Pkt . tuple, Pkt . type f r o m P k t . h e a d e r$
3:: if $P k t . t y p e! = Q U I C$ then $f l a g \leftarrow 2$
4:: else $f l a g \leftarrow 1$
5:: end if
6:: return $f l a g$
7:: end function
8:: /* Get the packet type and flag from the packet header */
9:: functionMatch( $T a b l e, f l a g, P k t . t u p l e$ )
10:: if $P k t . t u p l e m a t c h i n T a b l e i s d r o p$ then $r e t \leftarrow - 1$
11:: else if $P k t . t u p l e m a t c h i n T a b l e i s f o r w a r d$ then $r e t \leftarrow 0$
12:: else $P k t . t u p l e m a t c h i n T a b l e i s p r o c e s s$ $r e t \leftarrow 1$
13:: end if
14:: $s i g n a l \leftarrow f l a g * r e t$
15:: return $s i g n a l$
16:: end function
17:: /* Get signal using the flag matching flow table */
18:: functionRepeater( $P k t, s i g n a l, K$ )
19:: if $s i g n a l = 1$ then
20:: $E x t r a c t streamid f o r m q u i c p a y l o a d$
21:: $n u m \leftarrow h a s h (K, s t r e a m i d)$
22:: $d e l i v e r y t o num t h q u e u e f o r p r o c e s s i n g$
23:: else if $s i g n a l = 0$ then
24:: $n u m \leftarrow h a s h (K, s t r e a m i d)$
25:: $d e l i v e r y t o num t h q u e u e f o r t r a n s m i t i o n$
26:: else if $s i g n a l < 0$ then $d r o p i t$
27:: else $n u m \leftarrow h a s h (K, P k t . t u p l e)$
28:: $d e l i v e r y t o num t h q u e u e f o r p r o c e s s i n g$
29:: end if
30:: end function
31:: /* The forwarding behavior of different type of packets */

4.3. Research Methods

Evaluation goals: Three main features are verified: firstly, the implementation of the zero-copy driver, running the sample program (such as pkegen) on every node, and demonstrating the zero-copy driver’s adaptation to the programmable multi-queue accelerator by observing the driver operation. The second is the programmability of the accelerator. At the intermediate node (X2000), the forwarding behavior is programmed by remotely configuring the flow table to forward the traffic from Host0 without processing and directly to Host1. The third is the scalability of accelerator multi-layer forwarding. The VxLAN tunneling technology in the RSS module is enabled to establish network tunnels between three nodes using hardware-implemented VxLAN network virtualization technology to enhance the scalability of the experimental model for deployment in the cloud data center network.

Evaluation variables and parameters: To achieve the above experimental objectives, two variables, queue number and packet length, are set during the experiment to test the total throughput of all queues with different queue numbers when transmitting packets of various sizes. Meanwhile, the accelerator’s VxLAN virtualization technology is considered to verify the accelerator’s multilayer forwarding function further. In addition, we also tested the forwarding latency of various packet sizes under a single queue. We obtained the throughput directly from a script that executes the command to redirect the corresponding network port traffic size to the background text (.txt) every 1 s. Different from the throughput test results from the tool, obtaining latency test results is not simple. We remember that there are three nodes in the simulation topology, two for sending and receiving traffic and one for forwarding traffic. The receiving node captures the data packets and timestamps on the network interface and saves the packet in .pcap format. Therefore, the latency calculation is straightforward, just calculating the timestamp value of each packet. Because the test takes a long time to execute and has many packets, we uniformly take a small part of the packets for calculation.

Experiment process and calculation: We test the per-queue throughput rate of different queues (1, 2, 4, 8) with varying packet sizes (64 B, 128 B, 256 B, 512 B, 1024 B, 1536 B) in scenarios where VxLAN is enabled and disabled the, respectively. Each experiment lasts for 1 min. To ensure accurate experimental results and eliminate errors, we adopt the method of evenly taking data and calculating the average value. When calculating the average total throughput, 50 of the acquired data are used to calculate the average value. Since the amount of message data obtained each time is considerable (far more than 1000) when calculating the average delay, 50 messages are taken for calculation by a uniform retrieval method to facilitate calculation. The average Throughput and Latency are calculated as shown in Equations (1)–(3). In Equation (1), we calculate the average total throughput (TPS) under various packet sizes (PS) for each set number of queues. Equation (2) is the way to obtain the average latency (LAT) under various packet sizes (ps) for each set number of queues. Equations (2) and (3) are methods to obtain each queue set’s average latency (LAT) under different packet sizes (ps).

T P S_{A V G}^{P S} = \frac{1}{50} \sum_{K = 1}^{50} T P S_{K}^{P S} (T P S f o r T h r o u g h p u t s, P S f o r p a c k e t_s i z e)

(1)

L A T_{A V G}^{P S} = \frac{1}{50} \sum_{K = 1}^{50} L A T_{K}^{P S} (L A T f o r L a t e n c y)

(2)

L A T_{K} = T S P . r e c v - T S P . s e n t (T S P f o r T i m e S t a m p)

(3)

4.4. Results Analysis and Comparison

Figure 7 shows the experimental correlation between the average total throughput and packet sizes. After that, we also evaluate the average forwarding latency of single-queue transmitting packets of various sizes (64 B, 128 B, 256 B, 512 B, 1024 B, 1536 B), and Figure 8 shows experimental results.

The ability to run the sample program (pkegen) to send and receive packets properly indicates that the accelerator can correctly adapt to the zero-copy driver. Figure 7 show that the higher the number of queues enabled, the higher the throughput when transmitting packets of the same size. As the packet size increases, the total throughput growth rate slows. At a packet size of 1536 B, the throughput of an eight-queue approaches the line rate (100 G). It means that an eight-core host may meet the line-rate forwarding requirement by using our accelerator. With VxLAN enabled for all three nodes, the same packet can be sent and received by programming the flow table. Compared to the equivalent case without a vxLAN-disabled accelerator, there is a decrease in throughput, but the impact is less, verifying the multilayer forwarding capability of the accelerator. Combined with FPGA resources required and throughput gain, every time the number of instanced queues doubles, the FPGA hardware resources consumed need to be more than doubled, but the throughput can only be increased by less than half.

Figure 8 shows the correlation between the average latency, which is Timestamp.recv minus Timestamp.sent, with various packet sizes enabled single-queue on both the sender and receiver ends. As the packet size increases, the packet sending and receiving latencies also gradually increase. Since the intermediate node forwards the data packet, the latency of sending and receiving at one time should be half of the calculated average latency. For example, the average latency for 64 B (packet size) is around 6.37 compared to 7.05 in 1536 B. After verifying the forwarding performance, we referred to the experimental parameters and design of the ovs-DPDK [31] forwarding latency test and compared it with the state-of-the-art forwarding software. As a result, when the forwarding latency is the lowest, it can approach the performance of ovs-DPDK.

The above two experiments verified and analyzed the zero-copy driver adaptation of the accelerator, the high throughput of packet sending and receiving, and the programmability and low latency of packet forwarding and generally achieved the above experimental goals. Our future work will continue to focus on the accelerator, mainly extending and implementing deterministic forwarding for multiple queues and further exploring the enhancement of the high performance of programming between Agent agents and flow tables.

5. Related Work

Several features introduced by our Accelerator proposed in this paper are verified above. The Accelerator is orthogonal to these features. This zero-copy-driven programmable architecture for multi-queue stream forwarding can support 100 Gbps. In this section, we review the most closely related prior work.

Multi-queue management based on the hardware clock. The Accelerator is similar to the Corundum NIC [28] in that it has hardware queues and supports time synchronization based on the hardware clock. Corundum NIC Implementation of Microsecond Time Division Multiple Access (TDMA) Hardware Scheduler. However, the corundum NIC uses a hardware scheduler to manage the queue. Compared with the Accelerator based on software management, it lacks considerable flexibility.

Multistream-oriented scheduling. The Accelerator is also similar to PANIC [32], which improves fairness between competing flows. PANIC provides features such as chaining without involving a CPU. Adopting PANIC’s scheduler and non-blocking crossbar can solve these fundamental problems with many-core NICs. In addition, in terms of flexibility, the multistream scheduling based on flow rules adopted by the Accelerator is programmable, which is more advantageous than the hardware-based scheduling of PANIC.

New transport protocol acceleration support. Some NICs have protocol acceleration support with hard-wired transport protocols [33,34]. However, they only implement two protocols, either RoCE [33] or a vendor-selected TCP variant, and can only be modified by their vendor. The Accelerator enables programmers to implement various transport protocols (such as QUIC) with modest effort. Without a publicly available detailed description of these NIC architectures, we could not compare our design decisions with theirs.

Accelerating network functionality. Some academic and industrial projects unload the virtual switching and network functions of the terminal host to FPGA and process the generated packet stream [35,36,37,38]. On the other hand, The accelerator can directly forward packets according to the flow_table programmed in advance without the participation of the host CPU. When necessary, it can change the forwarding behavior through remote programming of the flow_table. This method is more adaptable than the above-mentioned technologies.

6. Conclusions

The microservice-oriented programmable deterministic multi-queue Accelerator is a feasible attempt to meet the bursty traffic and diverse demands in serverless nodes and to break the limitations of commercial NICs. This paper presents the design and implementation of the programmable deterministic multi-queue FPGA accelerator and the supporting zero-copy driver. The advantages and disadvantages of various zero-copy technologies are compared in terms of deployment and performance. The driver for this multi-queue hardware is designed based on the driver logic of the current best-performing DPDK. A prototype 100G Accelerator is built using two types of FPGA boards and evaluated for multi-node simulation topology. Experiments show that the zero-copy driver can be adapted to the Accelerator to achieve zero-copy for multi-queue packet sending and receiving. It supports the programmability of multi-queue packet forwarding. It extends the multi-level forwarding function to meet the data transmission requirements of microservice network architecture.

Author Contributions

Conceptualization, G.L.; methodology, X.Y.; validation, J.W. and Z.L.; formal analysis, G.L. and X.Y.; investigation, J.W.; writing—original draft preparation, J.W., G.L. and Z.L.; writing—review and editing, J.W.; visualization, J.W.; supervision, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2020YFB1805603).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

García-Dorado, J.L.; Mata, F.; Ramos, J.; del Río Santiago, P.M.; Moreno, V.; Aracil, J. High-performance network traffic processing systems using commodity hardware. In Data Traffic Monitoring and Analysis; Springer: Berlin/Heidelberg, Germany, 2013; pp. 3–27. [Google Scholar]
Liao, G.; Znu, X.; Bnuyan, L. A new server I/O architecture for high speed networks. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, San Antonio, TX, USA, 12–16 February 2011; pp. 255–265. [Google Scholar]
Kim, J.; Soh, Y.J.; Izraelevitz, J.; Zhao, J.; Swanson, S. SubZero: Zero-copy IO for persistent main memory file systems. APSys 2020, 1–8. Available online: https://juno-kim.github.io/papers/apsys2020-kim.pdf (accessed on 24 August 2020).
MacMichael, D. Introduction to Receive Side Scaling; Microsoft Hardware Dev Center: Redmond, WA, USA, 2017. [Google Scholar]
Iyengar, J.; Thomson, M. QUIC: A UDP-Based Multiplexed and Secure Transport. RFC 9000. May 2021. Available online: https://www.tzi.de/~cabo/rfc9000.pdf (accessed on 27 May 2021).
Thomson, M.; Turner, S. Using TLS to Secure QUIC. RFC9001. May 2021. Available online: https://www.rfc-editor.org/info/rfc9001 (accessed on 27 May 2021).
Iyengar, J.; Swett, I. Internet Engineering Task Force. QUIC Loss Detection and Congestion Control. RFC9002. May 2021. Available online: https://www.hjp.at/doc/rfc/rfc9002.html (accessed on 27 May 2021).
Oliveira, M.V.M.; Sampaio, A.; Cavalcanti, A. Local livelock analysis of component-based models. In International Conference on Formal Engineering Methods; Springer: Cham, Switzerland, 2016; pp. 279–295. [Google Scholar]
Song, J.; Alves-Foss, J. Performance review of zero copy techniques. Int. J. Comput. Sci. Secur. IJCSS 2012, 6, 256. [Google Scholar]
Kim, J.H.; Na, J.C. A study on one-way communication using PF_RING ZC. In Proceedings of the 2017 19th International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea, 19–22 February 2017; pp. 301–304. [Google Scholar]
Rizzo, L. netmap: A novel framework for fast packet I/O. In Proceedings of the 21st USENIX Security Symposium (USENIX Security 12), Boston, MA, USA, 2012; pp. 101–112. Available online: https://www.usenix.org/system/files/conference/atc12/atc12-final186.pdf (accessed on 13 June 2012).
Tu, W.; Wei, Y.H.; Antichi, G.; Pfaff, B. Revisiting the open vswitch dataplane ten years later. In Proceedings of the 2021 ACM SIGCOMM Conference, Online, 23–27 August 2021; pp. 245–257. [Google Scholar]
Data Plane Development Kit (DPDK): A Software Optimization Guide to the User Space-Based Network Applications; CRC Press: Boca Raton, FL, USA, 2020.
Gallenmüller, S.; Emmerich, P.; Wohlfart, F.; Raumer, D.; Carle, G. Comparison of frameworks for high-performance packet IO. In Proceedings of the 2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), Oakland, CA, USA, 7–8 May 2015; pp. 29–38. [Google Scholar]
Ntop Website Introducing PF_RING ZC (Zero Copy). 2014. Available online: http://www.ntop.org/pf_ring/introducing-pf_ring-zc-zero-copy/ (accessed on 14 April 2014).
Deri, L.; Cardigliano, A.; Fusco, F. 10 Gbit line rate packet-to-disk using n2disk. In Proceedings of the 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Turin, Italy, 14–19 April 2013; pp. 441–446. [Google Scholar]
Deri, L. How to Monitor Latency Using nProbe. 2011. Available online: https://www.ntop.org/nprobe/how-to-monitor-latency-using-nprobenagios-world-conference-europe/ (accessed on 9 April 2011).
Rizzo, L.; Carbone, M.; Catalli, G. Transparent acceleration of software packet forwarding using netmap. In Proceedings of the 2012 Proceedings IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012; pp. 2471–2479. [Google Scholar]
Kim, J.; Huh, S.; Jang, K.; Park, K.; Moon, S. The power of batching in the click modular router. In Proceedings of the Asia-Pacific Workshop on Systems, Seoul, Korea, 23–24 July 2012; pp. 1–6. [Google Scholar]
Casoni, M.; Grazia, C.A.; Patriciello, N. On the performance of linux container with netmap/vale for networks virtualization. In Proceedings of the 2013 19th IEEE International Conference on Networks (ICON), Singapore, 11–13 December 2013; pp. 1–6. [Google Scholar]
Rizzo, L.; Lettieri, G.; Maffione, V. Very high speed link emulation with TLEM. In Proceedings of the 2016 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN), Rome, Italy, 13–15 June 2016; pp. 1–6. [Google Scholar]
Brouer, J.D.; Høiland-Jørgensen, T. XDP: Challenges and future work. In Proceedings of the Linux Plumbers Conference, Heraklion, Greece, 4–7 December 2018. [Google Scholar]
Brouer, J.D.; Gospodarek, A. A practical introduction to XDP. In Proceedings of the Linux Plumbers Conference. Available online: https://people.netfilter.org/hawk/presentations/LinuxPlumbers2018/presentation-lpc2018-xdp-tutorial.pdf (accessed on 3 November 2018).
Emmerich, P.; Ellmann, S.; Bonk, F.; Egger, A.; Sánchez-Torija, E.G.; Günzel, T.; Di Luzio, S.; Obada, A.; Stadlmeier, M.; Voit, S.; et al. User space network drivers. In Proceedings of the 2019 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), Cambridge, UK, 24–25 September 2019; pp. 1–12. [Google Scholar]
Pfaff, B.; Pettit, J.; Koponen, T.; Jackson, E.; Zhou, A.; Rajahalme, J.; Gross, J.; Wang, A.; Stringer, J.; Shelar, P.; et al. The Design and Implementation of Open vSwitch. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), Oakland, VA, USA, 4–6 May 2015; pp. 117–130. [Google Scholar]
Suné, M.; Köpsel, A.; Alvarez, V.; Jungel, T. xDPd: eXtensible DataPath Daemon; EWSDN: Berlin, Germany, 2013. [Google Scholar]
Pal, S.P.; Ian, K.K.S.; Ray, K.C. FPGA implementation of stream cipher using Toeplitz Hash function. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi, India, 24–27 September 2014; pp. 1834–1838. [Google Scholar]
Forencich, A.; Snoeren, A.C.; Porter, G.; Papen, G. Corundum: An open-source 100-gbps nic. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 3–6 May 2020; pp. 38–46. [Google Scholar]
Chavan, A.; Nagurvalli, S.; Jain, M.; Chaudhari, S. Implementation of fpga-based network synchronization using IEEE 1588 precision time protocol (ptp). In Recent Findings in Intelligent Computing Techniques; Springer: Singapore, 2018; pp. 137–143. [Google Scholar]
Yan, Y.; Wang, H. Open vSwitch Vxlan performance acceleration in cloud computing data center. In Proceedings of the 5th International Conference on Computer Science and Network Technology (ICCSNT), Changchun, China, 10–11 December 2016; pp. 567–571. [Google Scholar]
Shanmugalingam, S.; Ksentini, A.; Bertin, P. DPDK Open vSwitch performance validation with mirroring feature. In Proceedings of the 2016 23rd International Conference on Telecommunications (ICT), Thessaloniki, Greece, 16–18 May 2016; pp. 1–6. [Google Scholar]
Lin, J.; Patel, K.; Stephens, B.E.; Sivaraman, A.; Akella, A. PANIC: A High-Performance Programmable NIC for Multi-tenant Networks. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Banff, Alberta, Canada, 4–6 November 2020; pp. 243–259. [Google Scholar]
RDMA and RoCE for Ethernet Network Efficiency Performance. Available online: http://www.mellanox.com/page/products_dyn?product_family=79&mtag=roce (accessed on 19 August 2019).
TCP Offload Engine (TOE). Available online: https://www.chelsio.com/nic/tcp-offload-engine/ (accessed on 1 August 2019).
Arashloo, M.T.; Ghobadi, M.; Rexford, J.; Walker, D. Hotcocoa: Hardware congestion control abstractions. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks, Palo Alto, CA, USA, 30 November–1 December 2017; pp. 108–114. [Google Scholar]
Firestone, D.; Putnam, A.; Mundkur, S.; Chiou, D.; Dabagh, A.; Andrewartha, M.; Angepat, H.; Bhanu, V.; Caulfield, A.; Chung, E.; et al. Azure Accelerated Networking: SmartNICs in the Public Cloud. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, USA, 9–11 April 2018; pp. 51–66. [Google Scholar]
Lavasani, M.; Dennison, L.; Chiou, D. Compiling high throughput network processors. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2012; pp. 87–96. [Google Scholar]
Li, B.; Tan, K.; Luo, L.; Peng, Y.; Luo, R.; Xu, N.; Xiong, Y.; Cheng, P.; Chen, E. Clicknp: Highly flexible and high performance network processing with reconfigurable hardware. In Proceedings of the 2016 ACM SIGCOMM Conference, Florianopolis, Brazil, 22–26 August 2016; pp. 1–14. [Google Scholar]

Figure 1. Deterministic multi-queue accelerator model.

Figure 2. Hardware descriptor usage.

Figure 3. Multi-queue receiving process.

Figure 4. Zero-copy workflow.

Figure 5. FPGA resource requirement for various queues (In X2000).

Figure 6. Experiment topology.

Figure 7. Throughput with Various Queues..

Figure 8. Forwarding Latency With Various Packet Sizes.

Table 1. Comparison of Four Zero-copy DMA Technologies.

Framework	Muiltqueue	Muiltcore	Kernelbypass	Driver	I/O Sys_Call	License	API	Thought/Core
PF_RINGZC	Yes	Yes	Yes	ZC driver	Coustom	Proprietary	libpcap	≈DPDK
Netmap	Yes	No	Yes	linux patch	Standard	BSD	libc	<DPDK
AF_XDP	Yes	Yes	No	Linux driver	Stabdard	MIT	libbpf	<DPDK
DPDK	Yes	Yes	Yes	UIO	Coustom	BSD	PMD	/

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Lv, G.; Liu, Z.; Yang, X. Programmable Deterministic Zero-Copy DMA Mechanism for FPGA Accelerator. Appl. Sci. 2022, 12, 9581. https://doi.org/10.3390/app12199581

AMA Style

Wang J, Lv G, Liu Z, Yang X. Programmable Deterministic Zero-Copy DMA Mechanism for FPGA Accelerator. Applied Sciences. 2022; 12(19):9581. https://doi.org/10.3390/app12199581

Chicago/Turabian Style

Wang, Jichang, Gaofeng Lv, Zhongpei Liu, and Xiangrui Yang. 2022. "Programmable Deterministic Zero-Copy DMA Mechanism for FPGA Accelerator" Applied Sciences 12, no. 19: 9581. https://doi.org/10.3390/app12199581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Programmable Deterministic Zero-Copy DMA Mechanism for FPGA Accelerator

Abstract

1. Introduction

2. Zero-Copy DMA Technology Overview

2.1. PF_RING Zero-Copy

2.2. Netmap

2.3. AF_XDP Redirection

2.4. DPDK

3. Programmable Deterministic Multi-Queue Accelerator Model

3.1. Model Design

3.2. Serverless-Friendly Programmable Multi-Queue Scheduling

3.3. Deterministic DMA for Serverless

3.4. UserSpace Zero-Copy Driver

3.5. Application Affinity for Serverless

4. Performance Evaluation

4.1. Prototype System

4.2. Experimental Topology

4.3. Research Methods

4.4. Results Analysis and Comparison

5. Related Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI