Next Article in Journal
Generative Adversarial Network Models for Augmenting Digit and Character Datasets Embedded in Standard Markings on Ship Bodies
Previous Article in Journal
DTV Essential Hidden Area Decoder for DTV Broadcasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An FPGA Design with High Memory Efficiency and Decoding Performance for 5G LDPC Decoder

by
Bich Ngoc Tran-Thi
1,2,
Thien Truong Nguyen-Ly
1 and
Trang Hoang
1,*
1
Department of Electronics, Faculty of Electrical and Electronics Engineering, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City 700000, Vietnam
2
Faculty of Electrical and Electronics Engineering, Vietnam Aviation Academy, Ho Chi Minh City 700000, Vietnam
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(17), 3667; https://doi.org/10.3390/electronics12173667
Submission received: 31 July 2023 / Revised: 26 August 2023 / Accepted: 28 August 2023 / Published: 30 August 2023

Abstract

:
A hardware-efficient implementation of a Low-Density Parity-Check (LDPC) decoder is presented in this paper. The proposed decoder design is based on the Hybrid Offset Min-Sum (HOMS) algorithm. In the check node processing of this decoder, only the first minimum is computed instead of the first two minimum values among all the variable-to-check message inputs as in the conventional approach. Additionally, taking advantage of the unique structure of 5G LDPC codes, layered scheduling and partially parallel structures are employed to minimize hardware costs. Implementation results on the Xilinx Kintex UltraScale+ FPGA platform show that the proposed decoder can achieve a throughput of 2.82 Gbps for 10 decoding iterations with a 5G LDPC codelength of 8832 bits and a code rate of 1/2. Moreover, it yields a check node memory reduction of 10% with respect to the baseline and provides a hardware usage efficiency of 4.96 hardware resources/layer/Mbps, while providing a decoding performance of 0.15 dB better than some of the existing decoders.

1. Introduction

Gallager first introduced the Low-Density Parity-Check (LDPC) codes as a family of linear Forward Error Correction (FEC) codes in his doctoral thesis in 1962 [1]. LDPC codes were known as near-Shannon limits error correction codes in practice [2]. Since their rediscovery in the middle of the 1990s [3], they have received much attention in both academia and industry. LDPC codes have a lot of advantages such as excellent coding gain, very low error floor, inherently parallelizable decoding schemes, and various optimum decoding architectures. Several industry standards, including WLAN (802.11ay) [4], WiMAX (802.16e) [5], WiFi 6 [6], DVB-S2 [7], and 10GbaseT (802.3an) [8] systems, have actually adopted them. Moreover, they have been selected for the enhanced Mobile Broadband (eMBB) in the 5th generation (5G) mobile communications by 3GPP (3rd Generation Partnership Project) [9,10]. LDPC codes have a variety of desirable characteristics that make them particularly appealing for implementation. The LDPC decoding algorithm may be performed using low-complexity computations, resulting in comparatively acceptable processing hardware design and implementation costs.
LDPC codes can be effectively decoded with the iterative Message-Passing (MP) algorithm. This decoding technique is the standard Belief-Propagation (BP) algorithm [11], which is also called the Sum-Product (SP) [12]. Although the BP algorithm provides excellent decoding performance, its implementation complexity is very high. Therefore, the Min-Sum (MS) algorithm [13] was proposed to reduce the processing of exchanged messages by applying computational approximations. The computations of the MS algorithm required only addition and comparison operations. The MS algorithm carried out the low-cost hardware implementation at the cost of significant error correction performance degradation. In an effort to improve this performance loss, numerous approaches have been presented in the state-of-the-art, including some well-known algorithms such as the Offset Min-Sum (OMS) and the Normalized Min-Sum (NMS) [14,15]. Moreover, several enhancements have been further proposed in the literature, such as the Improved Offset Min-Sum (IOMS) algorithm [16], the Second Minimum Approximation Min-Sum algorithm (SMA-MSA) [17], the Simplified 2-Dimensional Scaled (S2DS) algorithm [18], the Hybrid Offset Min-Sum (HOMS), and the Variable Offset Min-Sum (VOMS) algorithms [19].
Although the design of individual processing units is relatively basic, the design of a whole LDPC decoder is influenced by a complicated interaction between a number of system properties. These include processing delay, decoding performance, processing throughput, bandwidth efficiency, power consumption, and flexibility. In the case of an area-efficient decoder design, one of the main difficulties is an increase in decoding algorithm complexity, which results in more hardware and memory resource consumption. Many designs have been proposed in recent years to reduce hardware resources. For instance, the authors in [20] proposed a layered decoder architecture that included several single-port memory bank structures. Once a pipeline conflict occurs, the A-Posteriori (AP) update is postponed. The conflicted check node messages contribute to the AP only when a new non-conflicted update happens. In the case that the base graph matrix is dense, which means that it has a high percentage of non-negative values relative to the total number of entries, this postponement of the AP updates might dramatically lower the performance of the layered scheduling. The authors in [21] proposed an LDPC decoder architecture without any stall cycles caused by memory access pipeline conflicts or check node degree change. The design was performed as a hybrid decoding schedule. This means it generally operates as layered scheduling and only shifts to flooded scheduling when a pipeline conflict occurs. The resulting LDPC decoder obtains better hardware efficiency. However, the design produces an increase in the hardware usage of registers. Another hardware-friendly structure was presented in  [22]. Based on the 5G LDPC basis matrix’s orthogonality property [9], a layer-merging and a split storage methods were utilized to improve efficiency and reduce the check-to-variable messages memory cost, respectively. The message reordering and selective-shift structure approach are applied to optimize the interconnection blocks. However, the space of memory that holds AP messages is slightly expanded due to the selective-shift structure. Recently, the hardware architecture of the QC-LDPC decoder with additional architectural transformations for parallel processing and lower routing complexity was presented in [23]. The decoder kernel was based on the Simplified Offset Min-Sum (SOMS) algorithm. In that work, they suggested a new message grouping technique that reduced the computational complexity of the QC-LDPC decoder. However, the error correction performance was reported to be lower than that of the conventional OMS decoder.
Inspired by the pros and cons of the decoders above, this paper presents an FPGA hardware implementation for the 5G LDPC decoder architecture using the HOMS algorithm [19] as the decoding kernel. This decoder leads to a reduction in hardware cost and memory size, as well as a better decoding performance. In order to achieve this, it is necessary to exploit several techniques, such as pipelining, layered scheduling, appropriate quantization bits, partially parallel structure, and parallel processing datapath units. In particular, unlike the conventional MS decoder, the proposed decoder does not need to determine the first two minimum values of all variable-to-check message inputs during check node processing; instead, it only needs to compute the first minimum value. As a consequence, it is possible to minimize hardware resources and memory at the same time.
The rest of this paper is structured as follows. Section 2 provides some definitions and fundamental information for 5G QC-LDPC codes. The proposed hardware design is described in Section 3 and the results and associated discussions are presented in Section 4. Finally, Section 5 concludes the paper.

2. Related Definitions and Preliminaries

2.1. Definitions

For simplicity, the list of notations used throughout the paper is given in Table 1.
The information bits are a vector of K bits. It is encoded to produce the codeword c = c 1 , c 2 , , c N , which is a vector of N ( N > K ) bits, by adding M = N K parity check bits to the message. These bits are used to detect and even correct transmission errors. In this work, a codeword c of the LDPC code is modulated by Binary Phase Shift Keying (BPSK) modulation ( x = x 1 , x 2 , , x N ) and transmitted over the AWGN (Additive White Gaussian Noise) channel. The information at the receiver can be described by y = x + z , where z is an independent Gaussian random variable with zero mean and noise variance σ 2 . The LDPC codes can be presented by a sparse Parity-Check Matrix (PCM) H with M rows and N columns, where each row corresponds to a Check Node (CN) and each column corresponds to a Variable Node (VN). Moreover, the LDPC codes can be represented using a bipartite graph. This graph is also called a Tanner graph [24]. In the Tanner graph, each VN and each CN represent a code symbol and a parity check equation, respectively. An edge connects a VN i and a CN j if the entry H ( j , i ) of the PCM H equals 1.

2.2. 5G Quasi-Cyclic LDPC Codes

Quasi-Cyclic (QC) LDPC codes are one of the typical structured LDPC codes extensively used in practice [25]. These QC-LDPC codes offer a number of advantages, including similar decoding performance compared with random-LDPC codes with a basic and structured design, suitability for hardware implementations with parallelizable architecture, a simple interconnection network, and support for flexible block lengths.
In general, QC-LDPC codes are defined by a base matrix B of size L × C with entries b i , j 1 , where i 1 , 2 , , L and j 1 , 2 , , C . The PCM H of a QC-LDPC code is determined by extending its base matrix with an expansion factor Z. Each element in the base matrix is replaced with the sparse matrix of size Z × Z in the following manner: all zero matrices are extended for elements b i , j = 1 , and elements b i , j > 1 are replaced by a circulant permutation matrix created by right shifting the identity matrix by b i , j positions. As a result, a PCM H has M = Z × L rows and N = Z × C columns. The 5G LDPC code design allows it to support many code sizes and code rates. Moreover, the base matrix of the 5G LDPC code is extensible by the expansion factor up to 384, allowing it to support very high parallelism levels. The expansion factor values can be obtained as Z = a × 2 k , where a = 2 , 3 , 5 , 7 , 9 , 11 , 13 , 15 and k = 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 . Consequently, there are all 51 expansion factor Z values listed in Table 2 [9].
The BG1 and BG2 matrices are considered for 5G New Radio channel coding by 3GPP [9]. While BG1 is constructed for higher rates ( 1 / 3 R 8 / 9 ) and larger information bits ( 500 K 8448 ) , BG2 is designed for lower rates ( 1 / 5 R 2 / 3 ) and smaller information bits ( 40 K 2560 ) . In this paper, the BG1 matrix is considered. It must be emphasized that the 5G LDPC codes are dramatically irregular for both check nodes and variable nodes. For example, with BG1, the check node degree varies from 3 to 19 and the variable node degree varies from 1 to 30. BG1 can support the maximum information bits of 8448.
QC-LDPC codes can apply layered scheduling due to their construction based on circulant matrices. A decoding layer in layered scheduling typically consists of Z consecutive rows of the parity check matrix H , corresponding to one row of the base matrix B . In general, the decoding layer can consist of a number of consecutive rows of B such that these rows do not overlap. As mentioned above, since each non-negative element of the base matrix B is replaced by a circulant matrix (permutation), each column of matrix H has exactly one non-negative element in each decoding layer. In a decoding layer, only CNs belonging to one (or some) row(s) of the base matrix B update their messages and the VNs connecting to that CN update their AP information. A decoding iteration is done when all of the CNs of H update their messages. The algorithm of the suggested decoder with the layered scheduling principle using the HOMS algorithm [19] is summarized in Algorithm 1.
The message transmitted from VN n to neighbor CN m is denoted by α m , n (VTC messages), while the message sent from CN m to VN n is denoted by β m , n (CTV messages). The priori information is indicated by γ n . The PCM H of size M × N is divided into L horizontal layers, commonly known as decoding layers. Each decoding layer consists of M / L consecutive rows of H such that each variable node is connected only once to each layer. The set of consecutive rows of H corresponding to layer l 1 , , L is denoted as M l . To perform hardware implementation, we suppose that γ n , α m , n , γ ˜ n messages are quantized on q ˜ -bits, while q-bits (where q < q ˜ ) are used for the quantization of β m , n messages. The symbol for the alphabet of q ˜ -bit messages is M ˜ = ( Q ˜ , , 1, 0, +1, , + Q ˜ ) , where Q ˜ = 2 q ˜ 1 1 . The q-bit messages alphabet is expressed similarly by M = ( Q , , 1, 0, +1, , + Q ) , where Q = 2 q 1 1  [26]. More specifically, α m , n messages should be represented in q-bits before being passed to CNs. To this purpose, the saturation functions s M ( z ) is applied, which is defined as:
s M ( z ) = Q if z < Q z if z M + Q if z > + Q
Similar definitions apply to the q ˜ -bit saturation function s M ˜ .
Algorithm 1 Layered HOMS decoding algorithm [19].
Input:  y = ( y 1 , y 1 , , y N ) Y N Y is the channel output alphabet ▹ received word
Output:  x ^ = ( x ^ 1 , x ^ 1 , , x ^ N ) 0 , 1 N ▹ estimated codeword
 1:
Initialization
 2:
for all  n = 1 , , N  do  L n = log Pr ( x n = 0 y n ) Pr ( x n = 1 y n ) γ ˜ n = γ n = φ ( L n )
 3:
end for
 4:
for all  m = 1 , , M and n H ( m )  do  β m , n = 0 ;
 5:
end for
 6:
Iteration Loop
 7:
for all  i = 1 , , n i t e r  do
▹ Loop over iterations
 8:
    for all  l = 1 , , L  do
▹ Loop over horizontal layers
 9:
        for all  m = 1 , , M l and m H ( n )  do 
VN-processing
 10:
            α m , n = s M ˜ γ ˜ n sgn ( β m , n ) · max β m , n δ , 0
 11:
            α m , n s a t = s M α m , n
 12:
        end for
 13:
        for all  m = 1 , , M l and n H ( m )  do 
CN-processing
 14:
            β m , n = A · min 1 if position of α m , n s a t = index _ min 1 A · max min 1 β , 0 otherwise
 15:
            min 1 = min n H ( m ) α m , n s a t
 16:
            A = n H ( m ) n sgn ( α m , n s a t )
 17:
        end for
 18:
        for all  m = 1 , , M l and m H ( n )  do 
AP-update
 19:
            γ ˜ n = s M ˜ α m , n + β m , n
 20:
        end for
 21:
    end for
 22:
end for
 23:
End Iteration Loop
n i t e r is the maximum number of iterations.
i n d e x _ m i n 1 is the position of m i n 1 . In cases where there exists more than one m i n 1 value,
i n d e x _ m i n 1 is determined as the position with the smallest index.
δ , β > 0 is the offset factors.
α m , n s a t is the value of α m , n saturated to q-bit.

3. The Proposed Decoder Architecture

The layered HOMS algorithm presented in Algorithm 1 is used to implement the suggested QC-LDPC decoder architecture for 5G New Radio (NR) communication systems. In this work, we consider the BG1 matrix of size m b × n b = 24 × 46 and the code rate of 1 / 2 . The expansion factor Z = 192 corresponds to a PCM H of size M × N , with M = Z · m b = 192 · 24 = 4608 and N = Z · n b = 192 · 46 = 8832 . As mentioned before, the 5G LDPC decoder is irregular [9] and the check node degree in the BG1 changes widely from 3 to 19. Let d cmax indicate the maximum check node degree. Since each column of H contains precisely one non-negative element in each layer, we further define one row of B within one decoding layer. It should be noted that BG1 is constructed such that B can be separated into L = m b horizontal decoding layers. The decoder’s parallelism level is equal to the expansion factor Z.
The architecture of the proposed LDPC decoder is depicted in Figure 1. Its architecture consists of a network of multiplexers (MUXes), a large number of Variable Node Units/A Posteriori Blocks (VNUs/APBs), Check Node Units (CNUs), and two memory blocks for AP information ( γ ˜ n ) and CTV messages ( β m , n ) . The MUX block’s task is to select between channel intrinsic messages ( γ n ) and AP messages ( γ ˜ n ) updated from the preceding layer. It is important to understand that intrinsic messages are only used during the initialization stage. The intrinsic signals of q ˜ -bits quantization are buffered before being stored in the AP Memory Block (AP-MB). The detail of the AP-MB will be discussed in a later subsection. There are n b simultaneous outputs of the AP-MB, each representing Z values of q ˜ -bits. The decoded bits are read from the AP-MB when the maximum number of iterations is reached. AP messages are read from the AP-MB and then delivered to the Read Permutation Units (Read-PUs) at each decoding layer. There are n b parallel Read-PUs that are used to reorganize the data according to the processed layer, ensuring processing by the appropriate following processing units. The Read-PUs select d cmax AP combinations among n b inputs, which depend on the positions of the non-negative entries of the BG1 matrix’s current layer. For example, the first four layers of the BG1 matrix have d cmax = 19 non-negative elements, and the fifth layer has only four non-negative elements, thus, ( d cmax 4 ) elements are set to zero. In order to enhance the operating frequency, we apply the pipeline technique at Read-PU outputs. The Write Permutation Units (Write-PUs) perform the reverse operation of Read-PU. One layer is executed in two consecutive clock cycles, and each clock cycle involves the processing of two layers in the base matrix by the two pipeline stages. After pipelining, Read-PU output data are transferred into Read-Shifter Blocks (R-SBs). R-SBs should be used to design the data flow in circular shifts (cyclic) in line with the relevant shift factors. In the proposed architecture in Figure 1, we use d cmax R-SB blocks in parallel, enabling high-processing throughput. Each of them has Z inputs and Z outputs. Figure 2 illustrates the detailed working of right-shifting by the two positions rule of R-SB block with 60 elements of q ˜ -bits. The Write-Shifter Blocks (W-SBs) perform a reverse function compared to the R-SBs.
VTC messages α m , n are computed or AP values γ ˜ n can be updated via Variable Node Units/A Posteriori Blocks (VNUs/APBs). How these blocks work in detail will be mentioned in the next subsection. The output message values of the VNUs/APBs are transmitted in parallel q ˜ × Z = 1152 bits updated for the next layer and saturating to q-bits. Here, the saturation values are within the range Q , , + Q . Then these saturated data are sent to the Check Node Units (CNUs). For the HOMS algorithm, the output message of each CNU comprises the first minimum, its index, and the signs of the incoming VTC messages. This message is kept in a compressed format to save memory space, which is referred to as a compressed β-message, as seen in Figure 3.
These compressed β-messages are fed in two directions: they are stored back into the Check Node Memory Blocks (CN-MBs) for the next iteration and directed to the VNUs/APBs for the next decoding layer processing. It is worth noting that the compressed β-messages from CN-MBs are sent back to VNUs/APBs through Decompress units (DECOMs). Two DECOMs are put into the decoder to convert compressed to uncompressed CTV messages. They create CTV messages for further processing. Similarly, the updated AP messages are rearranged using the Write Permutation Unit (Write-PU) to ensure that they can be placed in the appropriate addresses of AP-MB. The Split block creates the serial signal stream of Z × q ˜ bits that is sent to the AP-MB. One decoding layer is done after writing them back to the AP-MB. To guarantee that the decoder operates properly, the controller block is required. This block controls the synchronous operation of the other ones. Additionally, it also creates control signals to manage memory access.
Initially, AP-MB storage of all AP messages takes n b = 46 clock cycles. The proposed decoder is designed to operate in two consecutive clock cycles in every layer. In the first clock cycle, the AP-reading signal is in the “enable” condition to provide the AP information messages that correspond to the current BG1 matrix layer. Following one another, operations for VN and CN are updated. In the second clock cycle, the decoder executes the AP processing. After AP information computations, they are written back to AP-MB by activating the AP-writing signal at a “high” level. The processing described above continues until the last layer (i.e., 24th layer) of the BG1 matrix is finished, at the moment one decoding iteration is complete. These procedures are performed in 10 iterations during the decoding process. After the 10th iteration, the LDPC decoder produces information bits, corresponding to the first 22 memory units of the AP-MB (e.g., 22 × 192 = 4224 bits).
The following subsections provide a detailed explanation of some main blocks in the proposed decoder architecture.

3.1. Check Node Unit

The main task of the CNU is to calculate the CTV messages ( β m , n ) . On the hardware aspect, conventional algorithms need to determine the two minimum values of the VTC input signals and the index of the first minimum. In this work, the HOMS method simply requires the first minimum of incoming VTC messages and its index. It is the competitive advantage of the proposed architecture, which may reduce the hardware resources. In addition, to reduce CN-MB storage space, CTV messages are compressed as shown in Figure 3. For a current check node, the CTV messages, named compressed β-messages, are represented in the format of q 1 + log 2 ( d cmax ) + d cmax = 27 bits. It is important to note that each conventional β-message includes 2 · ( q 1 ) + log 2 ( d cmax ) + d cmax = 30 bits.
Each CNU has d cmax inputs, and each one has Z × q = 768 bits. As a result, each CNU block really contains Z computing units that are utilized to compute the Z check nodes within a single row of the BG1 matrix in parallel. Each layer of these d cmax CNUs produces Z × ( q 1 + log 2 d cmax + d cmax ) = 5184 -bit extrinsic CTV messages and delivers to CN-MB.
As illustrated in Figure 4, the proposed CNU architecture consists of d cmax multiplexer units, a Min & Index Finder, a sign unit, and some control signals. In this design, we suppose that all check nodes have the same degree, which is represented by d cmax . To support irregular 5G LDPC codes, some additional control logic (i.e., a set signal) is used. More precisely, in the case of check node degree d c < d cmax , “inactivate” (set = ‘0’) refers to setting the last ( d cmax d c ) to the maximum value ( 2 q 1 1 = + 7 including the sign bit). Multiplexers are set up at the inputs to choose the input message based on either the “real” VTC message ( α m , n ) or the maximum value. To find the first minimum ( m i n 1 ) and its index ( i n d e x _ m i n 1 ) , the CNU is constructed by applying the effective Tree-Structure (TS) technique suggested in [27]. The structure of the CNU is described in more detail in Figure 4. The Most Significant Bits (MSBs) of the d cmax VTC messages are XORed to generate a sign bit, and the remaining ( q 1 ) -bits of these are processed by the Min & Index finder to find m i n 1 and its index. The d cmax inputs of Min & Index finder can be split down into the sum of 16 and 3 inputs. As a result, the d cmax -mVU (minimum Value Unit) is achieved by combining the corresponding blocks similar to the TS approach used in [27]. The 2-mVU includes one multiplex and one comparator, as illustrated in Figure 5a. The architecture of the 3-mVU consists of two 2-mVU for finding the minimum value and its index, as depicted in Figure 5b. The 16-mVU (or 2 4 -mVU) can be constructed from the 2-mVU. The Index Generator (IG) block is used to compute the i n d e x _ m i n 1 . This IG block uses the same methodology as in [27].

3.2. Variable Node Units and A Posteriori Blocks

As shown in Algorithm 1, VNUs of the proposed decoder calculate α m , n of the VN processing step, while APBs compute γ ˜ n messages of the AP update step. It is worth noting that the main difference between these two blocks is that the VNU uses the subtractor while the APB uses the adder. Hence, we merge their logical functionalities into a VNU/APB, controlled by the “sel” signal. The controller generates this control signal, which selects VNU mode during the first clock and APB mode during the second clock. The detailed architecture of VNUs/APBs is depicted in Figure 6. In the current iteration, during the first clock cycle, the control signal sel = 0, and at the input two multiplexers are selected VNU mode ( γ ˜ n o l d , β m , n o l d ) . The Subtractor/Adder block chooses the subtraction operation, thus, at the output, VTC messages are received. Each VN messages is calculated by α m , n n e w = γ ˜ n o l d β m , n o l d , where γ ˜ n o l d , β m , n o l d are the AP information values and the appropriate CN message values read from the AP-MB and CN-MB, respectively. Moreover, at the first clock cycle, these received α m , n n e w messages provide to CNUs to create CN messages β m , n n e w . During the second clock cycle, the control signal sel = 1, APB mode is executed and AP information update values ( γ ˜ n n e w ) are updated by γ ˜ n n e w = α m , n n e w + β m , n n e w .
As mentioned above, the check node degree varies dramatically in the 5G LDPC codes. Therefore, control signal “set” is needed to inactivate (i.e., the input values are set to zero) in cases with the check node degree d c < d cmax . In this case, the last ( d cmax d c ) CTV inputs are inactivated (set = ‘0’).
It is important to emphasize that for the HOMS algorithm, VN processing is modified by applying an offset factor ( δ > 0 ) to the VTC message. Before sending into the Subtractor/Adder block, the β m , n o l d magnitude is reduced by the offset parameter δ > 0 or set to zero if the resulting value is less than zero. The SMU transforms a sign-magnitude binary number in q ˜ -bits (i.e., 1-bit sign and q ˜ 1 magnitude) to q ˜ -bit in two’s complement format.

3.3. Memory Blocks

LDPC decoder architecture requires two storage options: one for the γ ˜ n values (AP-MB) and another for the β m , n messages (CN-MB). The number of memory units of the AP-MB equals the number of columns of the base matrix. Each unit contains Z memory locations with a width of q ˜ -bits. Accordingly, the AP-MB contains n b memory units that hold n b × Z = 8832 received-AP information of q ˜ -bits, for a total of n b × Z × q ˜ = 52,992 bits. The q ˜ × Z = 1152 -bit AP messages are sent sequentially into AP-MB. That means n b clock cycles are needed to store all of the AP messages. Such n b AP groups are stored in n b memory units and read simultaneously from the AP-MB in a single clock cycle.
In the proposed decoder, the second minimum should not be calculated and stored in the CN-MB, which is the main difference from the other conventional decoders. The CN-MB is used for the storage of the compressed β-message. Since β-messages from all layers must be kept, the depth of the CN-MB is equal to the number of layers. Consequently, a total of m b × Z × q 1 + log 2 d cmax + d cmax = 124,416 bits of the CN-MB is required.
Table 3 shows the size of the proposed and conventional CN-MB. Without storing the second minimum, the size of the proposed CN-MB could be reduced by 10%. These CN-MBs’ outputs are sent into the Decompress block. This compressed β-message consists of four parts: 19 MSBs between 26–8 bit-position representing the signs of 19 CNU input messages, the next 7–5 bit-positions corresponding to the magnitude of m i n 1 and the last 5 Least-Significant-Bits (LSBs) of 4–0 bit-positions representing the index of m i n 1 . The overall memory of the LDPC decoder is mostly determined by the size of the AP-MB and CN-MB. According to the calculations above, the AP-MB requires 52,992 bits and the CN-MB stores 124,416 bits. Therefore, the total memory is 173.25 kb.

3.4. Decompress Block

It should be noted that the CTV message is stored in CN-MBs as a compressed β-message. The Decompress block (DECOM) converts the CTV messages from a compressed to an uncompressed format (Figure 3). Moreover, this block is responsible for reducing the input value m i n 1 by a negative offset ( β ) . In the case that the difference value ( m i n 1 β ) is less than 0, the output of this block is set to 0. The detailed architecture of the DECOM block is depicted in Figure 7. The d cmax outputs of the DECOM represent d cmax 1 = 18 replications of ( max min 1 β , 0 ) magnitudes and only one m i n 1 magnitude (its position equals the i n d e x _ m i n 1 ). The FCU (Format Conversion Unit) converts a sign-magnitude binary number in q-bits (i.e., 1-bit sign and q 1 magnitude) to q ˜ -bits in two’s complement format.

4. Results and Discussion

In this work, an LDPC decoder was developed in Verilog Hardware Description Language (HDL) instead of using intellectual property (IP) modules. The results of FPGA synthesized and implemented after place and route are achieved by using Xilinx Vivado 2021.2. This section additionally reports on the decoding performance of several 5G LDPC codes.

4.1. Performance Results

We consider the BG1 matrix with code rates of 1/2, 2/3, codeword lengths N of 8832, 6720, and the expansion factor Z of 192. The HOMS is set with β = 0.5 and δ = 0.375 [19]. For comparison purposes, the performances of the NMS ( 1 / α = 0.75 ) [14], OMS ( β = 0.5 ) [14], IOMS ( γ = 0.875 and η = 0.5 ) [16], SMA-MSA ( α 2 = 0.25 and γ = 0.75 ) [17], S2DS [18], and VOMS ( β = 0.5 and τ = 0.875 ) [19] algorithms are also evaluated.
First, to demonstrate the modification of the CTV messages, we have compared the CTV messages among the HOMS, BP, and MS algorithms along with various reported MS-based decoding algorithms, as illustrated in Figure 8.
In this demonstration, we calculate four CTV messages with the same data inputs for some algorithms such as the BP, MS, OMS, NMS, IOMS, SMA-MSA, VOMS, and HOMS. These four CTV messages correspond to Figure 8a,b,c and d, respectively. There are 11 tested samples. Looking closely, we can see that the gap between the HOMS and the BP algorithm is large. There are two reasons for this overestimation: first, the HOMS algorithm is based on the MS algorithm that has self-existing errors; second, the updated CTV messages in the HOMS algorithm are calculated only based on the m i n 1 value, while the m i n 2 is approximated. To overcome this disadvantage, another offset factor is applied to the VTC messages of the HOMS in variable node processing. A solution is offered to modify VNU instead of continuing to modify CNU because the CNU is one of the most complex blocks, therefore, adding a modified offset factor may increase the hardware complexity. Meanwhile, the VNU/APB block is very simple, using only adders/subtractors. Next, to verify the error correction capacity of these algorithms, we proceed with the Monte Carlo simulations for several 5G LDPC codes. Some simulation results are shown in Figure 9 and Figure 10.
It can be seen that at the target BER of 10 8 , the HOMS algorithms improve the decoding gain up to 0.15 dB compared to the NMS, IOMS, SMA-MSA, S2DS, and SOMS algorithms.

4.2. Implementation Results

In this section, we evaluate the hardware resources of the proposed LDPC decoder on the FPGA platform. The design is synthesized and implemented on the Xilinx Kintex UltraScale+ FPGA by using Vivado software. We investigate a 5G LDPC code with a codeword length of 8832 bits and a code rate of 1/2. The decoding iteration number is set to 10, and the quantization bits ( q , q ˜ ) are (4, 6). The throughput computed is given by the formula [23]:
T = N × F max L × n i t e r × n c y c Mbps
where L is the number of decoding layers, n c y c is the number of clock cycles necessary to complete one layer, n i t e r is the maximum number of iterations, N is the codeword length, and F max [MHz] is the maximum operating frequency.
The proposed 5G LDPC decoder achieves a throughput of 2.82 Gbps while decoding a code with code length N = 8832 bits, L = 24 layers at a code rate 1/2. The decoding iteration number is set to 10 and the number of clock cycles per layer is 2. Since LDPC decoder designs have varying implementation characteristics such as code length, quantization bits, algorithm, and iteration number, the performance normalization is required for a fair comparison. Therefore, the Hardware Usage Efficiency (HUE) parameter is used. The HUE metric is defined as the hardware resources needed for each layer of the base matrix to produce a throughput of 1 Mbps. It is measured in hardware resources/layer/Mbps. A lower HUE value implies a better hardware-efficient design. The HUE metric is calculated as follows [23]:
HUE = H u L × T
where L is the number of decoding layers, T [Mbps] is the throughput, and H u reflects the FPGA hardware utilization, which is the number of F7 MUXes, F8 MUXes, Flip-flops, LUTs, and BRAMs in bits used by the LDPC decoder. Note that, for BRAM, we need to calculate its size in bits. This means each bit of BRAM is considered to be the same weight as one flip-flop or one MUX.
In Table 4, the implementation results are compared with the other related 5G designs: [21,23,28,29].
It can be seen that the suggested decoder’s throughput is 2.82 Gbps, which is almost equivalent to the throughput of the decoder in [29] and around 2.5 times higher than the decoder in [28]. Concerning hardware utilization efficiency, the proposed design provides a HUE value of 4.65 hardware resources/layer/Mbps. It shows a hardware utilization efficiency of approximately 4.5–5 times lower than the decoders in [29] or [21]. Although compared to [23], the proposed decoder has 1.7 times lower throughput and about 2.4 times higher HUE, it provides better decoding performance and memory efficiency. More precisely, the memory capacity of the proposed decoder is 4.4 times less than the decoder in [23] and at a BER of 10 8 , it shows a gain of 0.15 dB.

5. Conclusions

This paper presented the area-efficient architecture and enhanced decoding performance of a 5G LDPC decoder. The suggested decoder was based on the HOMS algorithm. To accomplish this architecture, various techniques such as layer scheduling, partial-parallel construction, and a low number of quantization bits were used. Particularly, the proposed check node processing only requires computing the first minimum of all input messages. As a result, in the proposed decoder, the storage size for check node messages was decreased by around 10% when compared to the conventional decoders. Implemented decoders demonstrated a gain of 0.15 dB over the SOMS decoder and provided throughputs of 2.82 Gbps, which met the 5G throughput standards.

Author Contributions

B.N.T.-T. proposed the idea, performed simulations, collected and analyzed data and prepared the original manuscript. T.T.N.-L. reviewed, supervised and revised the manuscript. T.H. supervised and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing not applicable.

Acknowledgments

We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNUHCM for supporting this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

APA Posteriori
APBA Posteriori Block
AP-MBA Posteriori Memory Block
AWGNAdditive White Gaussian Noise
BPBelief-Propagation
BPSKBinary Phase Shift Keying
CNCheck Node
CN-MBCheck Node Memory Block
CNUCheck Node Unit
CTVCheck-to-Variable
DECOM    Decompress unit
FECForward Error Correction
FPGAField Programmable Gate Array
HOMSHybrid Offset Min-Sum
HUEHardware Usage Efficiency
IGIndex Generator
IOMSImproved Offset Min-Sum
LDPCLow-Density Parity-Check
LSBLeast-SignificantBits
MPMessage-Passing
MSMin-Sum
MSBMost Significant Bit
MUXMultiplexer
mVUminimum Value Unit
NMSNormalized Min-Sum
NRNew Radio
OMSOffset Min-Sum
PCMParity-Check Matrix
QC-LDPCQuasi-Cyclic LDPC
Read-PURead Permutation Unit
R-SBRead-Shifter Blocks
S2DSSimplified 2-Dimensional Scaled
SMA-MSASecond Minimum Approximation
Min-Sum algorithm
SOMSSimplified Offset Min-Sum
SPSum-Product
TSTree-Structure
VNVariable Node
VNUVariable Node Unit
VOMSVariable Offset Min-Sum
VTCVariable-to-Check
Write-PUWrite Permutation Unit

References

  1. Gallager, R. Low-density parity-check codes. IRE Trans. Inf. Theory 1962, 8, 21–28. [Google Scholar] [CrossRef]
  2. Chung, S.Y.; Forney, G.D.; Richardson, T.J.; Urbanke, R. On the design of low-density parity-check codes within 0.0045 dB of the Shannon limit. IEEE Commun. Lett. 2001, 5, 58–60. [Google Scholar] [CrossRef]
  3. MacKay, D.J.; Neal, R.M. Near Shannon limit performance of low-density parity check codes. Electron. Lett. 1997, 33, 457–458. [Google Scholar] [CrossRef]
  4. Lin, C.H.; Su, H.H.; Chen, T.S.; Lu, C.K. Reconfigurable Low-Density Parity-Check (LDPC) Decoder for Multi-Standard 60 GHz Wireless Local Area Networks. Electronics 2022, 11, 733. [Google Scholar] [CrossRef]
  5. Liu, C.H.; Yen, S.W.; Chen, C.L.; Chang, H.C.; Lee, C.Y.; Hsu, Y.S.; Jou, S.J. An LDPC decoder chip based on self-routing network for IEEE 802.16e applications. IEEE J. Solid-State Circuits 2008, 43, 684–694. [Google Scholar] [CrossRef]
  6. Wu, Y.; Wu, B.; Zhou, X. High-Performance QC-LDPC Code Co-Processing Approach and VLSI Architecture for Wi-Fi 6. Electronics 2023, 12, 1210. [Google Scholar] [CrossRef]
  7. Kim, S.M.; Park, C.S.; Hwang, S.Y. A novel partially parallel architecture for high-throughput LDPC decoder for DVB-S2. IEEE Trans. Consum. Electron. 2010, 56, 820–825. [Google Scholar] [CrossRef]
  8. Cohen, A.E.; Parhi, K.K. A low-complexity hybrid LDPC code encoder for IEEE802.3an (10GBase-T) ethernet. IEEE Trans. Signal Process. 2009, 57, 4085–4094. [Google Scholar] [CrossRef]
  9. Richardson, T.; Kudekar, S. Design of low-density parity-check codes for 5G new radio. IEEE Commun. Mag. 2018, 56, 28–34. [Google Scholar] [CrossRef]
  10. Thi Bao Nguyen, T.; Nguyen Tan, T.; Lee, H. Low-complexity high-throughput QC-LDPC decoder for 5G new radio wireless communication. Electronics 2021, 10, 516. [Google Scholar] [CrossRef]
  11. Richardson, T.J.; Urbanke, R.L. The capacity of low-density parity-check codes under message-passing decoding. IEEE Trans. Inf. Theory 2001, 47, 599–618. [Google Scholar] [CrossRef]
  12. Kschischang, F.R.; Frey, B.J.; Loeliger, H.A. Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 2001, 47, 498–519. [Google Scholar] [CrossRef]
  13. Fossorier, M.P.; Mihaljevic, M.; Imai, H. Reduced complexity iterative decoding of low-density parity-check codes based on belief propagation. IEEE Trans. Commun. 1999, 47, 673–680. [Google Scholar] [CrossRef]
  14. Chen, J.; Dholakia, A.; Eleftheriou, E.; Fossorier, M.P.; Hu, X.Y. Reduced-complexity decoding of LDPC codes. IEEE Trans. Commun. 2005, 53, 1288–1299. [Google Scholar] [CrossRef]
  15. Chen, J.; Fossorier, M.P. Near optimum universal belief propagation based decoding of low-density parity check codes. IEEE Trans. Commun. 2002, 50, 406–414. [Google Scholar] [CrossRef]
  16. Tran-Thi, B.N.; Nguyen-Ly, T.T.; Hong, H.N.; Hoang, T. An improved offset min-sum LDPC decoding algorithm for 5G new radio. In Proceedings of the 2021 International Symposium on Electrical and Electronics Engineering (ISEE), Ho Chi Minh, Vietnam, 15–16 April 2021. [Google Scholar]
  17. Català-Pérez, J.M.; Lacruz, J.O.; Garcia-Herrero, F.; Valls, J.; Declercq, D. Second minimum approximation for min-sum decoders suitable for high-rate LDPC codes. Circuits Syst. Signal Process. 2019, 38, 5068–5080. [Google Scholar] [CrossRef]
  18. Cho, K.; Lee, W.H.; Chung, K.S. Simplified 2-dimensional scaled min-sum algorithm for LDPC decoder. J. Electr. Eng. Technol. 2017, 12, 1262–1270. [Google Scholar] [CrossRef]
  19. Tran-Thi, B.N.; Nguyen-Ly, T.T.; Hoang, T. High-performance and low complexity decoding algorithms for 5G Low-Density Parity-Check codes. J. Commun. 2022, 17, 358–364. [Google Scholar] [CrossRef]
  20. Boncalo, O.; Kolumban-Antal, G.; Amaricai, A.; Savin, V.; Declercq, D. Layered LDPC decoders with efficient memory access scheduling and mapping and built-in support for pipeline hazards mitigation. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 66, 1643–1656. [Google Scholar] [CrossRef]
  21. Petrović, V.L.; Marković, M.M.; El Mezeni, D.M.; Saranovac, L.V.; Radošević, A. Flexible high throughput QC-LDPC decoder with perfect pipeline conflicts resolution and efficient hardware utilization. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 5454–5467. [Google Scholar] [CrossRef]
  22. Cui, H.; Ghaffari, F.; Le, K.; Declercq, D.; Lin, J.; Wang, Z. Design of high performance and area-efficient decoder for 5G LDPC codes. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 68, 879–891. [Google Scholar] [CrossRef]
  23. Verma, A.; Shrestha, R. Low computational-complexity SOMS-algorithm and high-throughput decoder architecture for QC-LDPC codes. IEEE Trans. Veh. Technol. 2023, 72, 66–80. [Google Scholar] [CrossRef]
  24. Tanner, R. A recursive approach to low complexity codes. IEEE Trans. Inf. Theory 1981, 27, 533–547. [Google Scholar] [CrossRef]
  25. Fossorier, M.P. Quasicyclic low-density parity-check codes from circulant permutation matrices. IEEE Trans. Inf. Theory 2004, 50, 1788–1793. [Google Scholar] [CrossRef]
  26. Nguyen-Ly, T.T.; Savin, V.; Le, K.; Declercq, D.; Ghaffari, F.; Boncalo, O. Analysis and design of cost-effective, high-throughput LDPC decoders. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2017, 26, 508–521. [Google Scholar] [CrossRef]
  27. Wey, C.L.; Shieh, M.D.; Lin, S.Y. Algorithms of finding the first two minimum values and their hardware implementation. IEEE Trans. Circuits Syst. I Regul. Pap. 2008, 55, 3430–3437. [Google Scholar]
  28. Nadal, J.; Baghdadi, A. Parallel and flexible 5G LDPC decoder architecture targeting FPGA. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2021, 29, 1141–1151. [Google Scholar] [CrossRef]
  29. Li, Y.; Li, Y.; Ye, N.; Chen, T.; Wang, Z.; Zhang, J. High Throughput Priority-Based Layered QC-LDPC Decoder with Double Update Queues for Mitigating Pipeline Conflicts. Sensors 2022, 22, 3508. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the proposed LDPC decoder architecture.
Figure 1. Block diagram of the proposed LDPC decoder architecture.
Electronics 12 03667 g001
Figure 2. An example of right-shifting by the two positions rule of R-SB block with 60 elements of q ˜ -bits.
Figure 2. An example of right-shifting by the two positions rule of R-SB block with 60 elements of q ˜ -bits.
Electronics 12 03667 g002
Figure 3. Presentation of the check-to-variable node message in compressed and uncompressed formats.
Figure 3. Presentation of the check-to-variable node message in compressed and uncompressed formats.
Electronics 12 03667 g003
Figure 4. Check Node Unit (CNU) architecture with d cmax = 19 (Min & Index finder architecture is shown in the gray dashed box).
Figure 4. Check Node Unit (CNU) architecture with d cmax = 19 (Min & Index finder architecture is shown in the gray dashed box).
Electronics 12 03667 g004
Figure 5. The architecture of (a) 2-mVU and (b) 3-mVU.
Figure 5. The architecture of (a) 2-mVU and (b) 3-mVU.
Electronics 12 03667 g005
Figure 6. The proposed VNU/APB architecture.
Figure 6. The proposed VNU/APB architecture.
Electronics 12 03667 g006
Figure 7. Decompress block architecture.
Figure 7. Decompress block architecture.
Electronics 12 03667 g007
Figure 8. Comparison of the Check Node unit outputs among various LDPC decoding algorithms.
Figure 8. Comparison of the Check Node unit outputs among various LDPC decoding algorithms.
Electronics 12 03667 g008
Figure 9. BER performance of various LDPC decoders for the (8832, 4608), code rate 1/2.
Figure 9. BER performance of various LDPC decoders for the (8832, 4608), code rate 1/2.
Electronics 12 03667 g009
Figure 10. BER performance of various LDPC decoders for the (6720, 2496), code rate 2/3.
Figure 10. BER performance of various LDPC decoders for the (6720, 2496), code rate 2/3.
Electronics 12 03667 g010
Table 1. The list of mathematical notations and their meaning.
Table 1. The list of mathematical notations and their meaning.
NotationMeaning
B Base matrix
H Parity-Check Matrix (PCM)
ZThe expansion factor
NThe length of the codeword
KInformation bits
RThe code rate
c A codeword c = ( c 1 , c 2 , , c N )
x The transmitted codeword x = ( x 1 , x 2 , , x N )
y The received information y = ( y 1 , y 2 , , y N )
σ 2 Noise variance
nThe coded bit 1 n N
mThe parity check bit 1 m M
H ( n ) The set of Check Nodes (CNs) connected to the Variable Node (VNs) n
H n m The set of H ( n ) with check node m excluded.
H ( m ) The set of Variable Nodes (VNs) connected to the Check Node (CNs) m
H m n The set of H ( m ) with check node n excluded.
d v The number of neighbors of a VN is called its degree, d v = H ( n )
d c The number of neighbors of a CN is called its degree, d c = H ( m )
γ ˜ n A Posteriori (AP) information update
γ n A priori information for each variable node n
β m , n Check-to-Variable (CTV) messages sent from CN m to VN n
α m , n Variable-to-Check (VTC) messages sent from VN n to CN m
q ˜ The number of bits that represent γ n , α m , n , γ ˜ n messages (in this paper, q ˜ = 6 bits)
qThe number of bits that represent β m , n messages (in this paper, q = 4 bits)
φ ( x ) The q ˜ -bit quantization map of x , that rounds each element of x to the nearest integer
less than or equal to x .
Table 2. The set of expansion factors Z.
Table 2. The set of expansion factors Z.
a
23579111315
023579111315
146101418222630
2812202836445260
k3162440567288104120
4324880112144176208240
56496160224288352
6128192320
7256383
Table 3. Comparison between the proposed and conventional CN-MB.
Table 3. Comparison between the proposed and conventional CN-MB.
Conventional CN-MBProposed CN-MB
WidthDepthTotalWidthDepthTotal
30Z24720Z27Z24648Z
Table 4. Implementation results and comparison with related works.
Table 4. Implementation results and comparison with related works.
SpecificationsThis Work2023 [23]2022 [29]2021 [28]2020 [21]
FPGA BoardXinlinx Kintex
Ultrascale+
Xinlinx Zynq
Ultrascale+
Xinlinx Virtex-7Xinlinx Kintex-7Xinlinx Kintex
Ultrascale+
Quantization Bits(4,6)-bit7-bit8-bit8-bit8-bit
Throughput (Gbps)2.823.6–13.312.850.391–1.13.2–4.9
Code Length883210,368–26,11210,368345610,368–26,112
Expansion Factor192384384384384
Clock Frequency (MHz)153.5128.36255160404.8
Code Rate1/21/3–8/922/271/2–5/622/68–22/27
Number of Layers245–4675–467–46
Number of Iterations101011810
Pipeline Stages2313-13
Memory Size (kb)173.25765.8388871464914
Decoding AlgorithmHOMSSOMSOMSOMSOMS
Hardware Utilization314,9681,200,3944,204,1227,438,3945,233,977
Memory TypeBRAMRegistersBRAMBRAMBRAM
Permutation NetworkBarel ShifterBit Rotational MappingBarel ShifterBarel ShifterBarel Shifter
HUE *4.651.9620.85147.5423.12
* Hardware utilization efficiency (HUE) in hardware resources/layer/Mbps.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tran-Thi, B.N.; Nguyen-Ly, T.T.; Hoang, T. An FPGA Design with High Memory Efficiency and Decoding Performance for 5G LDPC Decoder. Electronics 2023, 12, 3667. https://doi.org/10.3390/electronics12173667

AMA Style

Tran-Thi BN, Nguyen-Ly TT, Hoang T. An FPGA Design with High Memory Efficiency and Decoding Performance for 5G LDPC Decoder. Electronics. 2023; 12(17):3667. https://doi.org/10.3390/electronics12173667

Chicago/Turabian Style

Tran-Thi, Bich Ngoc, Thien Truong Nguyen-Ly, and Trang Hoang. 2023. "An FPGA Design with High Memory Efficiency and Decoding Performance for 5G LDPC Decoder" Electronics 12, no. 17: 3667. https://doi.org/10.3390/electronics12173667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop