ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission

Ma, Jiandong; Guo, Zhichuan; Pan, Yipeng; Zhang, Mengting; Zhao, Zhixiang; Sun, Zezheng; Chang, Yiwei

doi:10.3390/electronics14010088

Open AccessArticle

ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission

by

Jiandong Ma

^1,2

,

Zhichuan Guo

^1,2,*

,

Yipeng Pan

^1,2

,

Mengting Zhang

^1,2

,

Zhixiang Zhao

^1,2

,

Zezheng Sun

^1,2

and

Yiwei Chang

^1,2

¹

National Network New Media Engineering Research Center, Institute of Acoustics, Chinese Academy of Sciences, No. 21, North Fourth Ring Road, Haidian District, Beijing 100190, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, No. 19(A), Yuquan Road, Shijingshan District, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 88; https://doi.org/10.3390/electronics14010088

Submission received: 4 November 2024 / Revised: 25 December 2024 / Accepted: 27 December 2024 / Published: 28 December 2024

(This article belongs to the Topic Advanced Integrated Circuit Design and Application)

Download

Browse Figures

Versions Notes

Abstract

:

Remote Direct Memory Access (RDMA) technology provides a low-latency, high-bandwidth, and CPU-bypassed method for data transmission between servers. Recent works have proved that multipath transmission, especially packet spraying, can avoid network congestion, achieve load balancing, and improve overall performance in data center networks (DCNs). Multipath transmission can result in out-of-order (OOO) packet delivery. However, existing RDMA transport protocols, such as RDMA over Converged Ethernet version 2 (RoCEv2), are designed for handling sequential packets, limiting their ability to support multipath transmission. To address this issue, in this study, we propose ORNIC, a high-performance RDMA Network Interface Card (NIC) with out-of-order packet direct write method for multipath transmission. ORNIC supports both in-order and out-of-order packet reception. The payload of OOO packets is written directly to user memory without reordering. The write address is embedded in the packets only when necessary. A bitmap is used to check data integrity and detect packet loss. We redesign the bitmap structure into an array of bitmap blocks that support dynamic allocation. Once a bitmap block is full, it is marked and can be freed in advance. We implement ORNIC on a Xilinx U200 FPGA (Field-Programmable Gate Array), which consumes less than 15% of hardware resources. ORNIC can achieve 95 Gbps RDMA throughput, which is nearly 2.5 times that of MP-RDMA. When handling OOO packets, ORNIC’s performance is virtually unaffected, while the performance of Xilinx ERNIC and Mellanox CX-5 drops below 1 Gbps. Moreover, compared with MELO and LEFT, our bitmap has higher performance and lower bitmap block usage.

Keywords:

RDMA; multipath transmission; out-of-order packet; bitmap; FPGA

1. Introduction

In recent years, with the development of online applications, such as Artificial Intelligence (AI), social networking, and Big Data (BD), data centers have become critical infrastructure. To support High-Performance Computing (HPC), a data center network (DCN) is required to provide high-bandwidth, low-latency, and stable data transmission capabilities. Specifically, the network interface bandwidth has increased from 40 Gbps to 100 Gbps, 400 Gbps [1], and even more. Due to the limited CPU frequency, software network stacks and traditional protocols (such as TCP) have failed to meet these demands. Remote Direct Memory Access (RDMA) offloads the entire data transmission from the CPU to a hardware Network Interface Card (NIC), which provides outstanding network performance with little CPU overhead.

Currently, RDMA over Converged Ethernet version 2 (RoCEv2) [2] has become the most widely adopted RDMA protocol in DCNs [3]. However, RoCEv2 is designed to handle sequential packets under single-path transmission. The Packet Sequence Numbers (PSNs) of received packets are expected to be in order. Discontinuous PSNs are directly identified as packet loss, triggering retransmission. However, over-reliance on single-path transmission may lead to reliability concerns, scalability issues, and performance bottlenecks in DCNs. Although the Equal-Cost Multi-Path (ECMP) routing algorithm is able to distribute flows to different paths using hashing, two heavy flows can be routed to the same path because of hash collision, thus causing a significant performance dip [4].

The typical network structure of a DCN is a multi-root tree topology, such as the spine-leaf [5], fat-tree [6], and Clos [7], which provide rich parallel paths. Compared with ECMP, per-packet multipath transmission provides higher flexibility, efficiency, and performance [8]. It can quickly adapt to changes in the network conditions, such as congestion, by which each packet is routed along the most suitable path [9]. Finer-grained load balancing across multiple paths can split the potential elephant flows and improve the utilization of network resources [10,11]. It also contributes to achieving higher overall performance and supporting larger cluster scales [12]. Industrial vendors such as Cisco [10] and Nvidia [9] have also adopted packet-level multipath transmission in DCNs.

However, multipath transmission leads to out-of-order (OOO) packets at the receiver. RDMA NIC (RNIC) is the key hardware component in data transfer offloading. Handling OOO packets presents significant challenges for the RNIC design. First, the original RoCEv2 protocol does not support receiving OOO packets. Packet reordering in the RNIC suffers from low throughput and large storage requirements. Second, the RNIC needs to distinguish between packet loss and out-of-order packets correctly. Third, a bitmap is required to record received OOO packets for data integrity checks. The on-chip memory footprint should be considered as the increase in the bitmap width and Queue Pair (QP) number.

In the existing studies, IRN [13] is a selective retransmission method for RDMA under single-path transmission. It uses a bitmap to record the incoming packets after packet loss. MP-RDMA [14] is a multipath transport method for RDMA. Every data packet for WRITE and READ operation is embedded with a payload write address. The payload of OOO packets is directly written to user memory. However, the embedded address increases the length of packet header. The bitmap is allocated in advance for each QP. The bitmap width is too narrow to record the following packets after packet loss. The ACK-clocking mechanism creates as many response packets as request packets, which increases the workload for the network and RNIC. MP-RDMA reaches 40 Gbps throughput. MELO [15] uses a linked list to improve the bitmap memory utilization, and LEFT [16] improves MELO’s performance using caches. However, these solutions have unstable random access performance, and consume a great number of bitmap blocks in the case of packet loss.

To enhance the RNIC’s ability to handle out-of-order packets under multipath transmission, we present ORNIC, a high-performance RDMA NIC with Out-of-order packet direct write method. ORNIC supports 2048 QPs. When receiving an out-of-order packet, ORNIC calculates the memory address of the payload and writes the payload directly to user memory. We use a bitmap to track OOO packets, check data integrity, and detect packet loss. A register-scheduler circuit is proposed to transmit acknowledgment (ACK) messages. The contributions of this study are summarized as follows.

The study presents a high-performance RDMA NIC with out-of-order packet direct write method. ORNIC supports both sequential and OOO packet reception. The payload of OOO packets is written directly to user memory without reordering. The write address is embedded in the packets only when necessary. ORNIC is implemented on a U200 FPGA and can achieve 95 Gbps RDMA throughput, which is nearly 2.5 times that of MP-RDMA. When handling OOO packets, ORNIC’s performance is virtually unaffected, while the RDMA throughput of Xilinx ERNIC or Mellanox CX-5 drops below 1 Gbps.
To support data integrity checks when receiving OOO packets, we redesign the bitmap structure into an array of bitmap blocks that support dynamic allocation. Once a bitmap block is full, it is marked and can be freed in advance. Compared with MELO and LEFT, our bitmap has higher performance and lower bitmap block usage.
ORNIC is a compact design that consumes less than 15% of hardware resources on a U200 FPGA. While ORNIC requires more resources than ERNIC, it offers additional support for handling out-of-order packets.

The rest of this article is organized as follows. Section 2 reviews existing RDMA solutions that handle out-of-order packets or employ multipath transmission. Section 3 introduces the details and considerations of the design of ORNIC. In Section 4, we deploy ORNIC on a Xilinx U200 FPGA and evaluate the RDMA throughput, resource utilization, and bitmap of our design. Section 5 concludes this paper. Section 6 presents the future of our work.

2. Related Work

Packet reordering is a straightforward solution to address the issue of out-of-order packets. However, the reordering depth and throughput are severely limited by the logic and memory resources [17] on hardware platforms. Hoang [18] proposes a high-performance sorting circuit, which consumes over 20% of the Adjustable Logic Modules (ALMs) of a Cyclone V FPGA under a 16-bit packet sequence number and a 64-depth reorder buffer for one session.

Instead of reordering, IRN [13] firstly raises the ideas of out-of-order packet delivery and the Bandwidth-Delay Product (BDP) send window in RDMA. To avoid Go-Back-N retransmission, IRN employs selective retransmission under the single-path scenario. The remote memory address is carried by every packet in order to write the payload directly to user memory. A BDP-window-width bitmap is used to track the arrival of packets. However, as the number of QPs and the transmission rate increase, the bitmap consumes too much memory space. SR-DCQCN [19] combines selective retransmission and DCQCN [3] congestion control schemes. To improve the bitmap utilization, MELO [15] employs a linked list of bitmap blocks and a shared resource pool. Each QP can access the head block or tail block of the bitmap, and bitmap blocks can be connected with each other by pointers. However, the bitmap is designed for single-path transmission and does not support random access. LEFT [16] optimizes MELO’s design with a recently used cache for OOO packets. The decision on whether to perform cache replacement is determined by calculating the time required to traverse the bitmap. However, caching is only a local optimization for performance and introduces additional design complexity. The random access performance of a linked list is still unstable. When packet loss occurs, a QP can consume a BDP-window-width bitmap.

To handle OOO packets, SRNIC [20] employs a similar in-place reordering method instead of on-chip reordering. SRNIC improves the scalability of the RNIC by repositioning the data structure from the NIC into packets (such as the payload address) or host memory (such as the bitmap). All WRITE packets carry their target remote address. The payload of OOO packets is placed in user data buffer directly. However, when handling OOO packets, the author overlooks the issue of data consistency between WQEs. And the host-memory-located bitmap can severely degrade the SRNIC’s performance.

MP-RDMA [14] presents a packet-level multipath transport for RDMA. For WRITE and READ operations, payload address is embedded in every data packet. The width of bitmap is the maximum OOO degree. A retransmission will be triggered if any packets fall outside the bitmap. An ACK-clocking mechanism is used to move the send window forward for each path. A synchronize flag in WQE (Work Queue Element) ensures in-order memory updates if needed. However, it is unnecessary to add a payload address field in every RDMA data packet. The bitmap is pre-allocated for each QP. The bitmap width is too narrow to record the incoming packets after packet loss is detected, causing additonal packet drops. There are as many response packets as RDMA data packets, which increases the network and RNIC workload. MP-RDMA is implemented on an FPGA and achieves a 40 Gbps throughput.

Intermediate network devices can also be used to address the OOO packet issues in multipath transmission. AFR-MPRDMA [21] adopts a fast retransmission method that monitors the queue status on switch. However, it requires the sender RNIC to record all the in-flight packets of each path, resulting in high memory consumption. ConWeave [22] employs an in-network reordering solution based on P4 [23] switches during packet rerouting. It further creates the additional need for programmable capabilities of network devices, increasing the cost and difficulty of deployment.

Except for IRN, MP-RDMA, and SRNIC, the other methods above were only simulated and have not been realized on actual RNIC hardware. To implement the RNIC on an FPGA (Field-Programmable Gate Array), Xilinx proposed the ERNIC (Embedded RDMA-Enabled NIC) [24], a soft IP core providing RoCEv2-enabled NIC functionality. The latest version of ERNIC is version 4.0, which was released in 2022. ERNIC is able to achieve a 100 Gbps line rate. Its module division and interface definition are instructive for RNIC design. ERNIC adheres to the RoCEv2 specification and does not support out-of-order packet reception. RecoNIC [25] is an open-source SmartNIC (Smart Network Interface Card) [26] architecture that integrates ERNIC IP as an RDMA offload engine. Besides ERNIC, StRoM [27] is a programmable RNIC that supports the offloading of application-level kernels. JingZhao [28] is an RNIC framework utilizing Go-Back-N and selective repeat modules. Both StRoM, JingZhao, and ERNIC will trigger retransmissions upon receiving OOO packets.

RDMA support for packet-level multipath transmission and out-of-order packet handling has become widespread in industry. Cisco proposed a fully scheduled network [10] to accelerate AI workloads. For the best overall performance, packets are sprayed across all available links from the ingress leaf switches and reordered at the egress leaf switches. Potential elephant streams are split to avoid network congestion. Nvidia presented a Spectrum-X network [9]. With an adaptive routing strategy, switches are able to select the least congested port to forward packets. This improves the effective bandwidth from 60% on standard Ethernet to 95% with Spectrum-X (1.6X). With Direct Data Placement (DDP) technology, OOO packets’ data are placed in the correct order directly in the host memory or Graphics Processing Unit (GPU). However, the above-mentioned companies have not disclosed further technical details.

3. System Design

In this section, we describe the details and considerations of the design of ORNIC.

3.1. Architecture Overview

User applications abstract the RDMA task into a Work Queue Element (WQE) and send the WQE to an RNIC for execution. The WQE commonly includes opcode (such as WRITE, READ, SEND, RECV), local address, remote address, word request ID, remote key, etc. When the WQE is completed, the responding CQE is sent back to the user application. The WQE is placed in the Send Queue (SQ) and the CQE is placed in the Completion Queue (CQ). A Queue Pair (QP) represents an RDMA connection.

We adopt the block diagram of ERNIC [24] as a guideline and develop the ORNIC IP. The architecture of ORNIC is shown in Figure 1. It mainly consists of the RX (Receive) Plane, Control Plane, and TX (Transmit) Plane. The Register Manager module maintains all of the configurable registers, which indicate the state or information of ORNIC, and it shares their values with user applications (via the AXI4-Lite interface) and other inner modules. Note that the Ethernet MAC (Media Access Controller) and DMA (Direct Memory Access) subsystems are omitted in the diagram.

3.1.1. RX Plane

The RX Plane is responsible for processing all of the incoming RoCEv2 packets. After parsing the packets, we classify them into request packets (sent by the requester) and response packets (sent by the responder). We store the request packets and their headers (including MAC (Medium Access Control), IP (Internet Protocol), and BTH+ (Base Transport Header and other extended transport headers), excluding the payload) in the Req (Request) Pkt (Packet) Buffer and Req Hdr (Header) Buffer, respectively. The Req Processor module continually fetches a request packet’s header, performs logical processing, and forwards the result to the related modules. The DMA Engine module can write the packet payload to the user memory or fetch the packet payload from the user memory, achieving the functionality of RDMA. The workflow of the response packets is the same. In particular, if an incoming packet is out of order, either the Req Processor or the Rsp (Response) Processor would use the bitmap in the Bitmap Manager module to help trigger acknowledgment (ACK) or retransmission (NAK).

3.1.2. Control Plane

The Control Plane handles the behavior of ORNIC corresponding to the outstanding WQEs. An outstanding (or in-flight) WQE is a WQE that has been processed by ORNIC but not completely finished (acknowledged). If an incoming request packet triggers ORNIC to produce ACK, NAK, or generate read response packets, the Req Processor would register the related information in the Req Manager module. If an incoming response packet triggers ORNIC to produce retransmission packets, the Rsp Processor registers the related information in the Rsp Manager module. Additionally, if an incoming ACK packet may trigger CQE, its PSN is sent to the CQE Manager module.

3.1.3. TX Plane

The TX Plane generates and sends RoCEv2 packets. There are two events that trigger ORNIC to create packets. One is where user applications place WQEs in the SQ and ring the hardware doorbell. The other is where the Control Plane receives signals from the RX Plane and sends them to the TX Plane. Additionally, every time the QP Manager module retrieves a WQE from the SQ and sends it to the Hdr Generator module, the Hdr Generator caches the WQE in BRAM and registers it to the CQE Manager.

3.2. Out-of-Order Packet Direct Write Method

The data transmission process of WQE has a low-entropy problem [29]. The user application does not require in-order delivery of packets, but it must be notified when a transfer (WQE) is completed. However, the order of packets on the wire is strictly maintained for the RoCEv2 specifications. For the sake of the simplicity of protocol design, RoCEv2 puts forward additional requirements for in-order transmission on a network. Within a WQE, this is not only unnecessary but also limits the flexibility and performance of networking.

Therefore, ORNIC supports both standard WQEs (following the RoCEv2 specifications, assuming single-path transmission) and OOO WQEs (assuming multipath transmission, enabling out-of-order packet direct write method as present in this paper). Standard WQE and OOO WQE are distinguished by the different opcode of the WQE issued by the user application. ORNIC handles standard WQEs and its RoCEv2 packets like other typical RNICs, which is omitted in this paper. For OOO WQEs, ORNIC enables all of the packets within an OOO WQE (henceforth may referred to as a WQE for short) to be transmitted and delivered out of order. To avoid consistency issues, the next WQE is retrieved by the QP Manager only after the previous one has been completed. There are also some alternative ways in the future for the RoCEv2 protocol to solve the consistency issue, such as specifying a synchronize flag [14], using a reordering buffer [10,20], or developing an OOO Verbs API/QP [30].

We propose an out-of-order packet direct write method, in which the payload of OOO packets is written to user memory directly instead of reordering. When an OOO packet reaches the receiver, the memory write address

W r i t e_A d d r

of the packet payload can be calculated using the start address of the RDMA message

S t a r t_A d d r

, the start PSN of the RDMA message

S t a r t_P S N

, and the packet PSN

P k t_P S N

. This is shown in Figure 2 and Equation (1).

W r i t e_A d d r = S t a r t_A d d r + (P k t_P S N - S t a r t_P S N) \times P M T U

(1)

Here,

P M T U

is the Path Maximum Transfer Unit, which is the payload length of every RDMA data packet except the last one within a WQE.

Existing methods [13,14,20] embed

W r i t e_A d d r

in every data packet for WRITE and READ operations. However, ORNIC does not embed

W r i t e_A d d r

in RDMA data packets unless it is necessary.

The packet exchange of a READ WQE and OOO READ WQE is shown in Figure 3. For the READ operation,

S t a r t_P S N

and

S t a r t_A d d r

of in-flight WQEs are cached at the requester ORNIC. When the requester ORNIC receives OOO READ response data packets, every packet’s payload can be written directly to the correct memory location according to Equation (1). So, no READ response data packets have to be inserted into

W r i t e_A d d r

. The response packet types of a READ WQE and OOO READ WQE are shown in Figure 4. The response data packet types for an OOO READ WQE are basically the same as those for a READ WQE, except for the BTH opcode.

The packet exchange of a WRITE WQE and OOO WRITE WQE is shown in Figure 5. For the WRITE operation, the responder does not have the information of

S t a r t_A d d r

. So, at the beginning, all WRITE data packets carry

W r i t e_A d d r

. Until the responder has calculated and cached the

S t a r t_A d d r

by Equation (2) (and informed the requester by setting a flag in the returned ACK packets), the requester ORNIC will stop carrying

W r i t e_A d d r

in the following WRITE data packets within this WQE.

S t a r t_A d d r = W r i t e_A d d r - (P k t_P S N - S t a r t_P S N) \times P M T U

(2)

The request data packet types for a WRITE WQE and OOO WRITE WQE are shown in Figure 6. OOO WRITE FSpot (First Spot) packets are the WRITE data packets that carry

W r i t e_A d d r

. OOO WRITE Spot packets are the following WRITE data packets without the

S t a r t_A d d r

field. The OOO WRITE Last packet is the last WRITE data packet within a WQE, and it is used to indicate the end of a WQE and to update the Message Sequence Number (MSN). According to the RoCEv2 specifications, the MSN field in the response packet should be incremented by 1 for each complete request received.

3.3. Multipath Transmission and Packet Loss Detection

Multipath transmission results in OOO packets at the RDMA data receiver. In the former section, we dealt with the OOO data placement problem. However, OOO packets also raise the issue of loss detection. The discontinuity of PSN, the method from the RoCEv2 specifications used in single-path transmission to detect packet loss, is not applicable in our scenario. For ORNIC, we use the OOO Tolerance Distance (OTD) to detect packet loss. OTD is the maximum OOO degree [21], and it can be obtained using the maximum multipath delay difference MDD and packet sending bandwidth B in Equation (3).

OTD = \frac{MDD \times B}{P k t_L e n g t h}

(3)

where MDD is the maximum delay difference among multiple paths. In this paper, network delay refers to the time it takes for a packet to travel from source to destination, which is one-trip time or is half of Round-Trip Time (RTT). MDD is defined in Equation (4).

MDD = \frac{R T T_{m a x} - R T T_{m i n}}{2}

(4)

where

R T T_{m a x}

is the RTT of the slowest path,

R T T_{m i n}

is the RTT of the fastest path.

Once the PSN distance of the receiving RDMA data packets exceeds the OTD, packet loss is considered to have occurred, and retransmission is triggered.

In our design, the OTD is set to 64 by default for a network with a bandwidth of 100 Gbps, a PMTU of 4096 B, and a multipath delay difference of 20 us [31]. External modules can spray packets uniformly across all available paths. The OTD of each QP can be configured by registers. Different network environments and multipath strategies (such as random spray [12], equal spray [10], adaptive spray [9], and ACK-clocking spray [14]) result in different OTDs. We look forward to diverse implementations and believe that a precise, small OTD is conducive to triggering retransmissions properly and reducing bitmap resources.

To deal with the packet loss that cannot be detected with the OTD method above, such as when the size of a WQE is small or there are no more packets following the packet loss, the request-side ORNIC has a timeout mechanism. For a QP, if there is no ACK or retransmission request for a specific time, such as

(RTT + OTD / B)

, the requester ORNIC retransmits packets from the unacknowledged PSN.

3.4. Bitmap

Bitmap is a key structure for data integrity checks and loss recovery. In principle, we locate the bitmap at RDMA data receiver to achieve a low OTD. For the READ operation, the bitmap records the READ response packets at the requester. For the WRITE operation, the bitmap records the WRITE data packets at the responder. Therefore, ORNIC only needs to maintain a QP-depth bitmap at the Bitmap Manager module, which can be called by the Req Processor and the Rsp Processor concurrently.

Recent multipath RDMA works [14,21] set the width of the bitmap to the OTD. However, this is not wide enough. When packets fall outside the bitmap and trigger retransmission, subsequent incoming packets are dropped due to insufficient width of the bitmap until the arrival of retransmission packets. In order to avoid packet dropping during retransmission, we increase the maximum bitmap width for each QP. When packet loss is detected by the OTD mechanism, in order to record the incoming packets in the bitmap until the arrival of the retransmission packets, the width of the bitmap should be the sum of the OTD and the BDP window size, which can be calculated using Equations (5) and (6).

B D P_W i n d o w = \frac{B \times RTT}{P k t_L e n g t h}

(5)

B i t m a p_W i d t h = OTD + B D P_W i n d o w

(6)

When the network bandwidth is 100 Gbps, the PMTU is 4096 B, RTT is 90 us (the 99% RTT for a commodity RDMA datacenter, as reported in Ref. [32]), and

B D P_W i n d o w

is about 256. The OTD is set to 64, as determined before. Therefore,

B i t m a p_w i d t h

is

256 + 64 = 320

.

To optimize bitmap resource utilization, we were inspired by the idea of a dynamic bitmap block pool in MELO [15] and adapted it for OOO packet reception. Compared with MELO, ORNIC organizes the bitmap of each QP into an array (instead of a linked list) of bitmap blocks for better random access performance. Bitmap blocks can be dynamically allocated or released from the bitmap block pool by all QPs on demand. Moreover, a bitmap block can be freed in advance as soon as it is full, and a flag is set to indicate it. For example, Figure 7 shows a bitmap of QPx (refers to a specific QP of ORNIC at runtime), which is an array of cyclic bitmap block nodes. Each bitmap block node consists of bitmap_block_id, valid, and all_1_flag. The bitmap_block_id is the index number (or address) of the bitmap block content, as indicated by the red arrow in Figure 7. Assume the receiving RDMA data packets’ PSNs start at 1, and the second packet (PSN = 2) is lost. So, the header pointer (

h e a d e r_p t r

, which identifies the position of the expected PSN) of the bitmap is stuck at the first bitmap block node, as indicated by the black arrow in Figure 7. Because the subsequent bitmap blocks are full, Bitmap_block_node_1 and bitmap_block_node_2 occupy no bitmap blocks, and their all_1_flag and valid values are set to 1, as shown by the green part in Figure 7. The last node, bitmap_block_node_3, occupies one bitmap block to record the incoming OOO packets. Since the data packets within the block have not been completely received, both Bitmap_block_node_0 and Bitmap_block_node_3 each consume one bitmap block, as shown by the orange part in Figure 7. The unused bitmap block nodes are represented in white in Figure 7. Therefore, our design greatly reduces the usage of bitmap blocks during packet loss.

The bitmap blocks also simplify ORNIC’s retransmission design. In Figure 7, assume OTD is equal to twice the bitmap block width. Thus, if any packets fall in Bitmap_block_node_3, ORNIC generates an NAK embedded with the bitmap block of Bitmap_block_node_1 and sends it to the initiator of the retransmission.

In Refs. [15,16], the author optimistically assumed the efficiency of retransmission and set the size of the bitmap block pool to one BDP. However, we increased its resource margin to enhance the reliability. Specifically, in this study we implement ORNIC with support for 2048 QPs, as in ERNIC [24]. An active OOO QP consumes at least two bitmap blocks. So, the depth of the bitmap block pool is set to

2048 \times 2 = 4096

. The width of a bitmap block is set to 16 (M is 16 in Figure 7) to balance the waste of resources and the overhead of management. So, bitmap blocks consume

16 bits \times 4096 = 65,536 bits

. And the width of a bitmap block id is

{log}_{2} 4096 = 12

. Because

B i t m a p_W i d t h

is 320, as calculated before, there are

320 / 16 = 20

bitmap block nodes for each QP (N is 20 in Figure 7). The bitmap block node array for each QP occupies 20 × (12 + 1 + 1) bits = 280 bits. For 2048 QPs, bitmap block node arrays occupy 280 bits × 2048 = 573,440 bits. The size of ORNIC’s data structure is parameterizable in the hardware code.

3.5. Acknowledgment Aggregation

For ORNIC, multiple acknowledgment (ACK) frames can be combined into a single frame for transmission, similarly to TCP cumulative acknowledgment. In the Control Plane, we implement a “register-scheduler” circuit to transmit the ACKed PSN, as shown in Figure 8. In the figure, the responder ORNIC receives sequential WRITE data packets and informs the Req Manager module in the Control Plane to reply ACKs. Since the packet (PSN = 258, DST QP ID = 2) has been received, the Req Processor registers PSN = 258 into QP2’s location of the ACKed PSN memory in the Req Manager, and QP2 is added to the scheduling queue. The PSN in the ACKed PSN memory can be overwritten. The scheduler forwards the active QP’s ACKed PSN to the Hdr Generator to produce the ACK packet (PSN = 27, SRC QP ID = 5). Similar circuit structures also exist in the Rsp Manager and CQE Manager modules. In this way, the performance of the RX Plane is only limited by itself and the PCIe (Peripheral Component Interconnect Express) bandwidth and is not affected by the Control Plane.

Moreover, compared with MP-RDMA [14], for which the number of ACK packets is equal to the number of RDMA data packets, ORNIC is able to adaptively adjust the number of ACK packets sent, which can be as low as one ACK for each WQE. This consolidation greatly reduces the overhead associated with individual ACK frames for the RNIC and networks.

4. Implementation and Evaluation

4.1. RDMA Performance

ORNIC was integrated into the RecoNIC [25] RDMA architecture. Specifically, within RecoNIC, we were able to replace the Xilinx ERNIC [24] IP with our ORNIC IP and remove modules that were not relevant to the RDMA test. We implemented RecoNIC + ORNIC on a Xilinx Alveo U200 [33] FPGA board with a PCIe Gen3 × 16 interface and two 100 Gbps Ethernet ports.

The test environment is shown in Figure 9. The testbed consisted of two servers and a network emulator. The network emulator was implemented on a Xilinx Alveo U200 FPGA, which can disorder packets within a specified OOO distance. Each server was a Dell PowerEdge R740 with two Intel Xeon Silver 4216 2.1 GHz CPUs and 192 GB RAM. Every server was equipped with a U200 FPGA board and a Mellanox ConnectX-5 (CX-5) [34] NIC. The FPGA board could be loaded with a RecoNIC+ORNIC or RecoNIC+ERNIC program. The PMTU was 4 KB and the IP layer was IPV4. Within the user application, we measured the time intervals

T_{c o m p l e t i o n}

from issuing a WQE to receiving a CQE. Therefore, the throughput of RDMA transmission

R D M A_T h r o u g h p u t

in the presence of different WQE sizes

W Q E_S i z e

could be obtained using Equation (7).

R D M A_T h r o u g h p u t = \frac{W Q E_S i z e}{T_{c o m p l e t i o n}}

(7)

First, we evaluated the RDMA throughput of RecoNIC+ORNIC in an out-of-order scenario with an OOO WRITE/READ WQE size of 1 MB, 2 MB, 4 MB, …, 256 MB. The packets were disordered by the network emulator, whose OOO distance was less than 64. The result is shown in Figure 10. As illustrated, when WQE size exceeded 64MB, ORNIC’s RDMA throughput approached 95 Gbps.

Unlike other multipath RDMA solutions capable of handling OOO packets, ORNIC features a hardware implementation and achieves outstanding performance. Compared to the FPGA implementation of MP-RDMA [14], which achieved a maximum RDMA throughput of nearly 40 Gbps as reported, ORNIC’s performance is almost 2.5 times higher. Furthermore, AFR-MPRDMA [21] and LEFT [16] have not been implemented on hardware RNICs. Their evaluation remains limited to simulations.

For WRITE operation, the responder was able to reply ACKs as soon as it received WRITE data packets. For READ operation, a CQE could only be triggered after the payload of last READ data packet was written into the host memory. This is why the RDMA performance of READ is slightly lower than that of WRITE.

Then, we tested the performance of ORNIC and ordinary RNICs in the in-order and out-of-order scenarios. The ordinary RNIC represents the type of RDMA NIC designed for single-path transmission. Specifically, we compared the performance of RecoNIC+ORNIC, RecoNIC+ERNIC and Mellanox CX-5 in the sequential and out-of-order scenarios. In the sequential scenario, ORNIC processed WRITE WQE according to the RoCEv2 standard. Packets in the network were not disordered. Conversely, in the out-of-order scenario, the packets were disordered by the network emulator, which OOO distance was less than 64. ORNIC employed the method described in this paper to handle OOO WRITE WQE. The result is presented in Figure 11. As shown, the performance of ERNIC and Mellanox CX-5 dropped below 1 Gbps when dealing with out-of-order packets, while ORNIC’s performance remained largely unaffected. It is because OOO packets result in numerous retransmissions for ordinary RNICs.

4.2. Resource Utilization

We compared the main hardware resource utilization of the soft RDMA IP core of ORNIC and ERNIC. After we synthesized IP supporting 2048 QPs on a U200 FPGA with Vivado 2021.2, the resource usage and ratio of ORNIC and ERNIC were obtained and are shown in Table 1 and Figure 12, which include the Look-Up Table (LUT), Look-Up Table RAM (LUTRAM), Flip Flop (FF), Block RAM (BRAM), UltraRAM (URAM).

Overall, ORNIC required less than 15% of hardware resources on a U200 FPGA. Although ORNIC consumed more resources than ERNIC, the cost was acceptable. ORNIC extended ERNIC by offering additional support for handling OOO packets in the presence of multipath transmission. Equipped with a bitmap for recording OOO packets, ORNIC consumed noticeably more BRAMs than ERNIC. However, the bitmap resources of ORNIC were shared by all QPs, and the space was adjustable for users. Generally, ORNIC achieved a slightly higher resource utilization than ERNIC, which provides the feasibility for ORNIC to be deployed on a hardware platform, such as FPGA or ASIC.

4.3. Bitmap Performance

We tested the performance of ORNIC’s bitmap through simulation. MP-RDMA [14] used a static bitmap that consumed so many resources that it could not be included in the comparison. ORNIC, MELO [15], and LEFT [16] all use the method of bitmap block pool, which supports dynamic allocation of bitmap blocks for each QP. A comparison of the bitmap structure among MELO, LEFT, and ORNIC is shown in Figure 13. For MELO, the bitmap was a linked list of bitmap blocks. A header pointer and a tail pointer were used for a QP to access the bitmap. LEFT added a cache pointer, which identified the last-hit bitmap block, to improve access performance. The cache pointer was updated if the number of bitmap blocks to traverse exceeded a threshold. ORNIC redesigned the bitmap structure into an array of bitmap block nodes. In particular, an

a l l_1_f l a g

field was used to indicate whether the bitmap block was full. A full bitmap block could be released in advance, and its

a l l_1_f l a g

was set to 1.

Since MELO only supported access to the header and tail bitmap blocks in the original study, we upgraded it to support access to the middle of bitmap using pointers in the linked list, and this was called MELO+. For LEFT, if there was a cache miss, the algorithm had to traverse the bitmap from the header pointer to access to the corresponding bitmap block. We updated LEFT to LEFT+, which could access the corresponding bitmap block from the cache pointer instead of the header pointer for better performance.

To unify the testing conditions, the width of a bitmap block was 16 bits, and the total width of a QP’s bitmap was 320 bits. The bitmap was considered located at the RDMA data receiver and used to record the arrival of packets. The incoming packets were shuffled, and the OOO distance did not exceed 64. It was assumed that the access of a bitmap block through a pointer consumed one clock cycle. The tests were divided into two scenarios: no packet loss and with packet loss. Under the packet loss scenario, a random packet in the first bitmap block was dropped.

First, we evaluated the average number of clock cycles required for each OOO packet to access the corresponding bitmap block. The result is shown in Table 2.

Due to the use of an array structure, for ORNIC, any OOO packet was able to access the corresponding bitmap block within one clock cycle. For MELO+, when the incoming packet fell into the middle of the bitmap, some clock cycles were consumed by traversing the bitmap through pointers. Especially when packet loss occurred, more clock cycles were required due to the longer linked list. Compared with MELO+, LEFT offered limited performance gains using a last-hit bitmap block cache, but introduces great design complexity.

Then, we evaluated the maximum number of bitmap blocks occupied by QP during the test. The result is shown in Table 3.

When there was no packet loss, ORNIC, MELO, and LEFT all occupied at most four bitmap blocks during the test, because the OOO distance of packets did not exceed 64, which was four times the bitmap block width. However, in the case of packet loss, ORNIC occupied significantly fewer bitmap blocks. By setting

a l l_1_f l a g

to 1 for a full bitmap block and freeing it, ORNIC consumed resources from the bitmap block pool only at the head block, the tail block, and the blocks where packet loss was occurring. However, for MELO+, LEFT and LEFT+, all of the bitmap blocks in the linked list, whether they were full or not, consumed resources from the pool. Until the arrival of retransmission packets, recording the incoming packets could occupy a BDP-window-width bitmap, which equaled to the space of bitmap block pool in their studies. In summary, ORNIC made QP occupy as few bitmap blocks as possible, avoiding resource exhaustion of the bitmap block pool in the case of packet loss and slow retransmission. Our design is more robust to fluctuations in network conditions and sender-side performance.

5. Conclusions

In this paper, we propose ORNIC, a high-performance RDMA NIC with out-of-order packet direct write method for multipath transmission. ORNIC supports both sequential and OOO packet reception. The payload of OOO packets is written directly to user memory without reordering. The write address is embedded in the packets only when necessary. A bitmap is used to check data integrity and detect packet loss. We redesign the bitmap structure into an array of bitmap blocks, which support dynamic allocation. A full bitmap block is marked and can be freed in advance. We implement ORNIC in RecoNIC architecture on a Xilinx U200 FPGA. ORNIC can achieve 95 Gbps RDMA throughput, which is nearly 2.5 times that of MP-RDMA. When handling OOO packets, ORNIC’s performance is virtually unaffected, while the performance of Xilinx ERNIC and Mellanox CX-5 drops below 1 Gbps. Moreover, compared with MELO and LEFT, our bitmap has higher performance and lower bitmap block usage. ORNIC’s ability to handle out-of-order packets efficiently positions it as a key component for deploying multipath transmission in RDMA.

6. Future Work

6.1. Higher Throughput

Driven by the increasing demand for higher bandwidth for data transmission, data center networks are transitioning from 100 Gbps to higher rates like 200 Gbps, 400 Gbps, or more. Our ongoing research involves implementing a 200 G/400 G ORNIC on an Intel Agilex 7 [35] FPGA board. The method of out-of-order packet direct writing proposed in this paper can be used in the following projects. However, achieving higher RDMA throughput faces multiple challenges: First, high-throughput PCIe and Ethernet MAC IP integration. Second, RTL (Register Transfer Level) code optimization, including reducing processing cycles, solving timing issues under higher system frequency. Third, pipeline design or bulk data access, which are optional.

6.2. Outstanding WQEs

For ORNIC, the number of outstanding WQEs is limited to one per QP. When the current WQE is not completed, the next WQE of the same QP will not be processed. This is due to the fact that, according to RoCEv2 protocol, WQEs within a single QP must be executed sequentially. The limitation on the number of outstanding WQEs per QP restricts the performance of ORNIC, especially in the case of small WQEs. There are three possible ways to address this problem. First, sort packets before processing them [16]. This method requires high-performance on-chip packet reordering circuit but maintains good compatibility with RoCEv2 protocol. Second, define a type of OOO QP where WQEs can be executed out of order [30]. Third, use a flag to ensure that a certain WQE is executed only after all the former WQEs have been finished [14]. The latter two solutions, while requiring collaborations in software development and modifications to the RoCEv2 protocol, could offer potential higher performance by eliminating the need for packet reordering.

Author Contributions

Conceptualization, Z.G.; methodology J.M.; implementation, J.M., Y.P. and M.Z.; validation, Z.Z. and Z.S.; software Y.C.; writing—original draft preparation, J.M.; writing—review and editing, Z.G. and J.M.; supervision, Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research on Key Technologies and Equipment for Low-Latency Interconnected Networks in Intelligent Computing Centers, Oriented Project Independently Deployed by Institute of Acoustics, Chinese Academy of Sciences (Grant No. MBDK202401).

Data Availability Statement

All the necessary data are included in the article.

Acknowledgments

The author would like to thank Jiahao Zhang for his insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qian, K.; Xi, Y.; Cao, J.; Gao, J.; Xu, Y.; Guan, Y.; Fu, B.; Shi, X.; Zhu, F.; Miao, R.; et al. Alibaba HPN: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference, Sydney, NSW, Australia, 4–8 August 2024; pp. 691–706. [Google Scholar]
InfiniBand. Annex A17: RoCEv2. Available online: https://cw.infinibandta.org/document/dl/7781 (accessed on 26 October 2024).
Zhu, Y.; Eran, H.; Firestone, D.; Guo, C.; Lipshteyn, M.; Liron, Y.; Padhye, J.; Raindel, S.; Yahia, M.H.; Zhang, M. Congestion control for large-scale RDMA deployments. ACM SIGCOMM Comput. Commun. Rev. 2015, 45, 523–536. [Google Scholar] [CrossRef]
Mon, M.T. Flow Collision Avoiding in Software Defined Networking. In Proceedings of the 2020 IEEE Conference on Computer Applications (ICCA), Yangon, Myanmar, 27–28 February 2020; pp. 1–5. [Google Scholar]
Alizadeh, M.; Edsall, T. On the data path performance of leaf-spine datacenter fabrics. In Proceedings of the 2013 IEEE 21st Annual Symposium on High-Performance Interconnects, San Jose, CA, USA, 21–23 August 2013; pp. 71–74. [Google Scholar]
Greenberg, A.; Hamilton, J.R.; Jain, N.; Kandula, S.; Kim, C.; Lahiri, P.; Maltz, D.A.; Patel, P.; Sengupta, S. VL2: A scalable and flexible data center network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, Barcelona, Spain, 17–21 August 2009; pp. 51–62. [Google Scholar]
Al-Fares, M.; Loukissas, A.; Vahdat, A. A scalable, commodity data center network architecture. ACM SIGCOMM Comput. Commun. Rev. 2008, 38, 63–74. [Google Scholar] [CrossRef]
Xu, Y.; Ni, H.; Zhu, X. Survey of Multipath Transmission Technologies in Information—Centric Networking. J. Netw. New Media 2023, 12, 1–9, 20. [Google Scholar]
Nvidia. Nvidia Spectrum-X Network Platform Architecture. Available online: https://nvdam.widen.net/s/h6klwtqv5z/nvidia-spectrum-x-whitepaper-2959968 (accessed on 27 October 2024).
Cisco. Evolve your AI/ML Network with Cisco Silicon One. Available online: https://www.cisco.com/c/en/us/solutions/collateral/silicon-one/evolve-ai-ml-network-silicon-one.pdf (accessed on 27 October 2024).
Cao, J.; Xia, R.; Yang, P.; Guo, C.; Lu, G.; Yuan, L.; Zheng, Y.; Wu, H.; Xiong, Y.; Maltz, D. Per-packet load-balanced, low-latency routing for clos-based data center networks. In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT ’13), Santa Barbara, CA, USA, 9–12 December 2013; pp. 49–60. [Google Scholar] [CrossRef]
Dixit, A.; Prakash, P.; Hu, Y.C.; Kompella, R.R. On the impact of packet spraying in data center networks. In Proceedings of the 2013 Proceedings IEEE INFOCOM, Turin, Italy, 14–19 April 2013; pp. 2130–2138. [Google Scholar]
Mittal, R.; Shpiner, A.; Panda, A.; Zahavi, E.; Krishnamurthy, A.; Ratnasamy, S.; Shenker, S. Revisiting network support for RDMA. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Budapest, Hungary, 20–25 August 2018; pp. 313–326. [Google Scholar]
Lu, Y.; Chen, G.; Li, B.; Tan, K.; Xiong, Y.; Cheng, P.; Zhang, J.; Chen, E.; Moscibroda, T. {Multi-Path} transport for {RDMA} in datacenters. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, USA, 9–11 April 2018; pp. 357–371. [Google Scholar]
Lu, Y.; Chen, G.; Ruan, Z.; Xiao, W.; Li, B.; Zhang, J.; Xiong, Y.; Cheng, P.; Chen, E. Memory efficient loss recovery for hardware-based transport in datacenter. In Proceedings of the First Asia-Pacific Workshop on Networking, Hong Kong, China, 3–4 August 2017; pp. 22–28. [Google Scholar]
Huang, P.; Zhang, X.; Chen, Z.; Liu, C.; Chen, G. LEFT: LightwEight and FasT packet Reordering for RDMA. In Proceedings of the 8th Asia-Pacific Workshop on Networking, Sydney, NSW, Australia, 3–4 August 2024; pp. 67–73. [Google Scholar]
Ukon, Y.; Yamazaki, K.; Nitta, K. Video service function chaining with a real-time packet reordering circuit. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [Google Scholar]
Hoang, V.Q.; Chen, Y. Cost-effective network reordering using FPGA. Sensors 2023, 23, 819. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Gong, Y.; Fan, Z.; Chen, Y.; Zhang, W.; Tian, W.; Liu, Y. SR-DCQCN: Combining SACK and ECN for RDMA Congestion Control. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022; pp. 788–794. [Google Scholar]
Wang, Z.; Luo, L.; Ning, Q.; Zeng, C.; Li, W.; Wan, X.; Xie, P.; Feng, T.; Cheng, K.; Geng, X.; et al. SRNIC: A scalable architecture for RDMA NICs. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023; pp. 1–14. [Google Scholar]
Li, Z.; Huang, J.; Wang, S.; Wang, J. Achieving Low Latency for Multipath Transmission in RDMA Based Data Center Network. IEEE Trans. Cloud Comput. 2024, 12, 337–346. [Google Scholar] [CrossRef]
Song, C.H.; Khooi, X.Z.; Joshi, R.; Choi, I.; Li, J.; Chan, M.C. Network load balancing with in-network reordering support for rdma. In Proceedings of the ACM SIGCOMM 2023 Conference, New York, NY, USA, 10–14 September 2023; pp. 816–831. [Google Scholar]
Bosshart, P.; Daly, D.; Gibb, G.; Izzard, M.; McKeown, N.; Rexford, J.; Schlesinger, C.; Talayco, D.; Vahdat, A.; Varghese, G.; et al. P4: Programming protocol-independent packet processors. ACM SIGCOMM Comput. Commun. Rev. 2014, 44, 87–95. [Google Scholar] [CrossRef]
Xilinx. Xilinx Embedded RDMA Enabled NIC v4.0 LogiCORE IP Product Guide. Available online: https://docs.xilinx.com/r/en-US/pg332-ernic (accessed on 27 October 2024).
Zhong, G.; Kolekar, A.; Amornpaisannon, B.; Choi, I.; Javaid, H.; Baldi, M. A Primer on RecoNIC: RDMA-enabled Compute Offloading on SmartNIC. arXiv 2023, arXiv:2312.06207. [Google Scholar]
Firestone, D.; Putnam, A.; Mundkur, S.; Chiou, D.; Dabagh, A.; Andrewartha, M.; Angepat, H.; Bhanu, V.; Caulfield, A.; Chung, E.; et al. Azure accelerated networking: {SmartNICs} in the public cloud. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, USA, 9–11 April 2018; pp. 51–66. [Google Scholar]
Sidler, D.; Wang, Z.; Chiosa, M.; Kulkarni, A.; Alonso, G. StRoM: Smart remote memory. In Proceedings of the Fifteenth European Conference on Computer Systems, Heraklion, Greece, 27–30 April 2020; pp. 1–16. [Google Scholar]
Yang, F.; Wang, Z.; Kang, N.; Ma, Z.; Li, J.; Yuan, G.; Tan, G. JingZhao: A Framework for Rapid NIC Prototyping in the Domain-Specific-Network Era. arXiv 2024, arXiv:2410.08476. [Google Scholar]
Dmitry, S. To Spray or Not to Spray. Available online: https://community.juniper.net/blogs/dmitry-shokarev1/2023/11/21/to-spray-or-not-to-spray (accessed on 27 October 2024).
Ultra Ethernet Consortium. Overview of and Motivation for the Forthcoming Ultra Ethernet Consortium Specification. Available online: https://ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf (accessed on 27 October 2024).
Guo, C. RDMA in Data Centers: Looking Back and Looking Forward. ACM SIGCOMM APNet 2017. Available online: https://conferences.sigcomm.org/events/apnet2017/slides/cx.pdf (accessed on 27 October 2024).
Guo, C.; Wu, H.; Deng, Z.; Soni, G.; Ye, J.; Padhye, J.; Lipshteyn, M. RDMA over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, Florianópolis, Brazil, 22–26 August 2016; pp. 202–215. [Google Scholar]
Xilinx. Alveo u200 and u250 Data Center Accelerator Cards Data Sheet (ds962). Available online: https://docs.amd.com/r/en-US/ds962-u200-u250/Summary (accessed on 27 October 2024).
Mellanox Technologies. Product Brief of ConnectX-5 EN Card. Available online: https://network.nvidia.com/files/doc-2020/pb-connectx-5-en-card.pdf (accessed on 27 October 2024).
Intel. Agilex 7 FPGAs and SoCs Product Brief. Available online: https://cdrdv2-public.intel.com/762901/agilex-7-fpga-product-brief.pdf (accessed on 27 October 2024).

Figure 1. Architecture of ORNIC.

Figure 2. Out-of-order packet direct write method.

Figure 3. Packet exchange of RDMA READ. (a) READ WQE. (b) OOO READ WQE.

Figure 4. Response packet types of RDMA READ. (a) READ WQE. (b) OOO READ WQE.

Figure 5. Packet exchange of RDMA WRITE. (a) WRITE WQE. (b) OOO WRITE WQE.

Figure 6. Request packet types of RDMA WRITE. (a) WRITE WQE. (b) OOO WRITE WQE.

Figure 7. Bitmap structure.

Figure 8. Register-scheduler circuit used to transmit ACKed PSNs.

Figure 9. Testbed topology.

Figure 10. RDMA throughput of RecoNIC+ORNIC in the OOO scenario.

Figure 11. RDMA WRITE throughput of RecoNIC+ORNIC, RecoNIC+ERNIC and Mellanox CX-5. (a) Sequential scenario. (b) Out-of-order scenario.

Figure 12. Resource utilization ratio of the ORNIC and ERNIC IPs on a U200 FPGA.

Figure 13. Comparison of the bitmap structure. (a) MELO. (b) LEFT. (c) ORNIC.

Table 1. Resource utilization of the ORNIC and ERNIC IPs on a U200 FPGA ¹.

IPs	LUT	LUTRAM	FF	BRAM	URAM
IPs	(1,182,240)	(591,840)	(2,364,480)	(2160)	(960)
ORNIC	111,086	25,686	72,150	279.5	106
ERNIC v4.0 ²	87,222	17,060	58,423	215.5	93

¹ Xilinx Alveo U200 data center accelerator card. ² Embedded RDMA NIC version 4.0, which is a soft IP core developed by Xilinx and released on 2 December 2022.

Table 2. Average number of clock cycles required for each OOO packet to locate the corresponding bitmap block.

Solutions	Avg. Number of Clock Cycles (the Fewer, the Better)
Solutions	No Packet Loss	With Packet Loss
MELO+ ¹	1.75	7.75
LEFT ²	1.25	5.25
LEFT+ ³	1.25	1.65
ORNIC	1	1

¹ The bitmap is a linked list of bitmap blocks. Support access to the middle of the bitmap compared with MELO. ² A cache was used to improve the access performance of the linked list. ³ Compared with LEFT, support traverse the bitmap from the cache pointer.

Table 3. Maximum number of bitmap blocks occupied by QP.

Solutions	Maximum Number of Bitmap Blocks (the Fewer, the Better)
Solutions	No Packet Loss	With Packet Loss
MELO+	4	20
LEFT	4	20
LEFT+	4	20
ORNIC	4	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, J.; Guo, Z.; Pan, Y.; Zhang, M.; Zhao, Z.; Sun, Z.; Chang, Y. ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission. Electronics 2025, 14, 88. https://doi.org/10.3390/electronics14010088

AMA Style

Ma J, Guo Z, Pan Y, Zhang M, Zhao Z, Sun Z, Chang Y. ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission. Electronics. 2025; 14(1):88. https://doi.org/10.3390/electronics14010088

Chicago/Turabian Style

Ma, Jiandong, Zhichuan Guo, Yipeng Pan, Mengting Zhang, Zhixiang Zhao, Zezheng Sun, and Yiwei Chang. 2025. "ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission" Electronics 14, no. 1: 88. https://doi.org/10.3390/electronics14010088

APA Style

Ma, J., Guo, Z., Pan, Y., Zhang, M., Zhao, Z., Sun, Z., & Chang, Y. (2025). ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission. Electronics, 14(1), 88. https://doi.org/10.3390/electronics14010088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission

Abstract

1. Introduction

2. Related Work

3. System Design

3.1. Architecture Overview

3.1.1. RX Plane

3.1.2. Control Plane

3.1.3. TX Plane

3.2. Out-of-Order Packet Direct Write Method

3.3. Multipath Transmission and Packet Loss Detection

3.4. Bitmap

3.5. Acknowledgment Aggregation

4. Implementation and Evaluation

4.1. RDMA Performance

4.2. Resource Utilization

4.3. Bitmap Performance

5. Conclusions

6. Future Work

6.1. Higher Throughput

6.2. Outstanding WQEs

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI