Next Article in Journal
Efficient CU Decision Algorithm for VVC 3D Video Depth Map Using GLCM and Extra Trees
Previous Article in Journal
Topologies and Design Characteristics of Isolated High Step-Up DC–DC Converters for Photovoltaic Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

41.6 Gb/s High-Depth Pre-Interleaver for DFE Error Propagation in 65 nm CMOS Technology

1
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Co., Ltd., Jinan 250101, China
2
Institute of RF- & OE-ICs, Southeast University, Nanjing 210096, China
3
Purple Mountain Laboratories, Nanjing 211111, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(18), 3912; https://doi.org/10.3390/electronics12183912
Submission received: 15 August 2023 / Revised: 4 September 2023 / Accepted: 8 September 2023 / Published: 16 September 2023
(This article belongs to the Section Circuit and Signal Processing)

Abstract

:
A high-speed, high-depth pre-interleaver in the proposed symbol pre-interleaving Bit MUX (PBM) was implemented to mitigate decision feedback equalizer (DFE) error propagation in a 400 G Ethernet Serializer–Deserializer (SerDes) interface. Based on the SerDes interface link architecture with 5-tap DFE, the performance of the PBM under DFE error propagation was simulated theoretically, which could obtain an interleaving gain of 0.35 dB. In the pre-interleaver, in order to significantly increase the transmission rate while keeping the larger interleaving depth, characteristic polynomial parallelization with the logic expansion method and register-based memory with interleaving technology were adopted. Finally, the pre-interleaver was fabricated with 65 nm CMOS technology, with a total area of 0.615 mm2, including the I/O pad. The measurement results show that the horizontal opening degree of the output signal can reach 0.925 UI at the data rate of 41.6 Gb/s. The total power consumption is 38.52 mW at the supply voltage of 1.2 V and frequency of 1.3 GHz.

1. Introduction

As is known to all, “0” and “1” in a communication system, such as in 400 Gb/s Ethernet [1], Infiniband, and PCI-e 6.0 [2], can normally be transmitted in the forms of voltage and current through printed circuit board (PCB) wires. However, as the data rate increases sharply, the channel loss of PCB wires becomes larger and larger, as shown in Figure 1a. Under this factor, the influences among adjacent symbols, i.e., intersymbol interference (ISI), become more serious, so the ideal voltage and current can be greatly distorted and become infected, shown in Figure 1b. This can worse transmission performance and even cause the system to fail. Therefore, the signal integrity problem has become more and more prominent [3], making it a critical concern for communication engineers and circuit designers.
Aiming to address the signal integrity problem, equalization technology [4] is usually adopted in a high-speed communication link, i.e., Serializer-Deserializer (SerDes). From the perspective of the compensation principle, equalization technology can mainly be divided into frequency-domain equalization, e.g., continuous-time linear equalizer (CTLE) [5], and time-domain equalization, e.g., pre-emphasis, feed-forward equalizer (FFE) [6] and decision feedback equalizer (DFE). CTLE’s frequency response in Figure 2 is approximately opposite to the channel frequency response, making its combined frequency response flat. Pre-emphasis and FFE can offset the channel loss by increasing the amplitude of high-frequency signals. However, they can also greatly amplify transmission noise and crosstalk. Unlike these, DFE can cancel post-ISI without noise amplification [7]. Nonetheless, the error propagation phenomenon incurred by DFE has an influence on the bit-error-ratio (BER) performance. As a result, a model that accurately estimates this influence is very important for modern SerDes design.
According to various noise sources in the SerDes link, several models have been developed for estimation, with each having its own limits. For example, the Markov chain model [8] evaluates DFE burst errors in the computation of conditional probability. However, its complex computation exponentially increases with the number of DFE taps and its pulse amplitude modulation (PAM) order. As a result, it is impractical to find a proper stationary distribution for two- or multi-tap DFE in a four-level pulse amplitude modulation (PAM4) link, which is not applicable. To overcome this issue, a disaggregated Markov chain was adopted [9]. Unfortunately, this yielded inaccurate results due to the additional residual ISIs. Instead of directly computing the conditional probability, an aggregated Markov chain with a set of interdependent transition probabilities was created to capture the residual ISIs [10]. However, a large amount of recursive computations can result in a significant increase in computational complexity. Therefore, we developed an accurate treatment of the residual ISI and noise to further enhance the DFE’s modeling accuracy.
Based on our estimations, the prominence of mitigating DFE error propagation is growing, as this offers the optimal performance in channel compensation and BER from the system perspective. Tang [11] adopted an opposite threshold technology to construct a Dual DFE (DDFE) structure. Although it can restrain error propagation by preventing initial bit errors from occurring, this structure can significantly increase the circuit complexity compared with the traditional structure. To address the challenge of circuit complexity, we focus on modifying the whole link architecture to effectively reduce the length of burst errors.
Given the relationship between neighboring error symbols, Lu [12] added pre-coding technology into the link and explored the effect on link performance through a simple Monte-Carlo model. Moreover, Zhang [13] also conducted a detailed performance analysis of the link architecture with various DFE tap configurations, such as one-tap, two-tap and multi-tap. The same conclusion can be drawn: pre-coding technology can conditionally improve the BER performance margin and even worsen forward error correction (FEC) performance. However, the drawback of pre-coding is also obvious. When there is only one symbol error, pre-coding will always convert this into at least two symbol errors, regardless of the length of the burst error [12]. Therefore, in order to avoid this, our analysis aims to develop an error discretization method to break a larger uncorrectable error symbol into several smaller correctable error symbols.
Meanwhile, some alternative discretization methods, e.g., multi-codeword interleaving, could improve FEC performance in a 400 GbE system [14]. For example, FEC orthogonal multiplexing (FOM), can be implemented in a physical media attachment (PMA) sublayer. In this method, DFE, however, is only a one-tap structure, which may be too simple to accurately estimate the impact of error propagation. Moreover, some testing performances, such as data rate, power and area, are not available, so it is very hard to provide a detailed guide for its design and application. Therefore, there is an urgent demand for an unconditional system model with multi-tap DFE SerDes link and an advanced interleaving technology to reflect the effect of mitigating DFE error propagation.
To achieve our objectives, we propose and implement a high-speed high-depth pre-interleaver of symbol pre-interleave bit muxing (PBM) to mitigate DFE error propagation. The theoretical level focuses on developing a cumulative probability distribution model to theoretically analyze the error probability and interleaving gain of PBM [15]. Due to the compensate constraint, the model is subject to a limit to the SerDes configuration with DFE. The circuit level aims to optimize the parallel pseudo-random binary sequence (PRBS) generator and memory of the pre-interleaver with 65 nm CMOS technology, increasing the transmission data rate and error discretization. This includes characteristic polynomial parallelization, the logic expansion method, and register-based memory with ping-pong interleaving technology. The circuit level incorporates basic optimizable methods among the interconnections, such as reducing fanout and transmission latency, to ensure the application of the 400 Gb/s Ethernet physical layer. The research framework of the pre-interleaver is illustrated in Figure 3.

2. DFE Error Propagation

Figure 4 provides the structure of the DFE [16] and also illustrates the process of DFE error propagation [17]. The DFE consists of a summer, a slicer, several multipliers and several delay elements. x(n) and y(n) represent the input and output signal of DFE at the n-th period, respectively. d(ni) represents the decision data in the delay element through i periods from the slicer. ci represents the coefficient of the i-th tap. Therefore, the propagation process can be expressed as [17]:
y ( n ) = x ( n ) i = 1 k c i × d ( n i ) ,
From Equation (1), it can be seen that when a single or multiple errors occur, they can be passed into the delay element and have a certain probability of impacting the current signal, which shows that the error propagation is related to tap coefficients and decision data. For example, when d(n − 1) is wrong and c1 has a larger magnitude, the offset for y(n) can result in an error in d(n). If a larger error offset for y(n) always occurs, the slicer is continuously wrong, i.e., continuous propagation. If there is a smaller error offset for y(n) and a larger error offset for y(n + 1), the error propagation is temporarily interrupted, i.e., discontinuous propagation. In brief, DFE error propagation can be divided into two propagation cases: continuous propagation and discontinuous propagation.
Further, Figure 5 illustrates these two propagation cases in detail [12]. Let brl be the burst error run length from the first burst error to the last one. 0, E and e for slicer output are the correct decision, the wrong decision caused by random noise, and one caused by error propagation, respectively. The arrow represents the propagation direction of decision data. Additionally, solid arrows indicate that previous errors can lead to errors in the current decision, and dotted ones indicate that they can only be passed into the delay elements.
One case is continuous propagation, shown in Figure 5a. Although there are still errors in the delay elements in the 7th and 8th periods because the decision in the 6th period is wrong, there are no errors in the slicer after the 7th period. The error propagation caused by the initial random error E is indicated to be able to stop in the 7th period, in which brl is recorded as 6. Another case is the discontinuous one, shown in Figure 5b. It can be seen that, although DFE can make a correct decision in the 4th period, the errors in decision data in the 5th and 6th periods also occur because of some errors in the delay elements. In this case, brl is still 6.
From the analysis above, it is clear that DFE error propagation is unavoidable in the high-speed SerDes link configured with DFE. Many burst errors with different brls caused by DFE can seriously worsen SerDes system performance, especially BER performance. Therefore, various technologies, e.g., pre-interleaving, can be adopted to mitigate this and improve system performance.

3. Pre-interleaving Technology

In SerDes applications, such as in 400 GbE, pre-interleaving technology can generally be applied to the physical coding sublayer (PCS). Next, subject to some 400 GbE scenarios, e.g., 16 × 25 G and 8 × 50 G, we will conduct detailed research on pre-interleaving technology from both the theoretical level and circuit level.

3.1. Architecture and Analysis

Figure 6 presents the pre-interleaving architecture in the Ethernet physical layer, including four FEC coders and one memory in the PCS, as well as eight multiplexers (MUX) [18] in the physical medium attachment (PMA). Memory can receive FEC coding data from four parallel FEC lanes to interleave these data. Then, these interleaved data are aggregated through the multiplexer to further discrete errors and double the transmission rate.
We have provided and theoretically explored the use of FEC Orthogonal bit Multiplexing (FOM), symbol Pre-interleaving Bit MUX (PBM) and symbol Pre-interleaving Symbol MUX (PSM) in detail [15]. The error-mitigation processes of non-interleaving and PBM are shown in Figure 7a,b, respectively. In this process, the upper number represents the number of symbols, while the lower number represents the number of bits within the symbol.
For non-interleaving scheme, every 16 FEC symbols from a selected FEC lane are distributed into two sub-lanes in a round-robin manner, and then they are directly transmitted through PMA without MUX. The disadvantage of this scheme is bits for every sub-lanes come from the same FEC lane, and this can result in reductions in error discretization. This problem can be addressed by another scheme, e.g., PBM. After selecting the FEC lane in a round-robin manner, every 16 FEC symbols from a selected FEC lane are distributed into 16 sub-lanes, and then 2 bits forming 2 sub-lanes are multiplexed in the MUX. Comparing these from the perspective of mitigating errors, assume that a seven-bit burst error occurs around the boundary of two symbols (see Figure 7a,b); these errors affect 3 symbols in an FEC lane for the non-interleaving scheme, but they only affect 2 symbols for the PBM scheme. It can be proved that PBM is more able to form discrete errors than the non-interleaving system.
Before accurately estimating the effect of PBM on DFE error propagation, a SerDes link with some impairments, such as Gaussian noise, package and crosstalk, is constructed in Figure 8. In the SerDes link, some equalizers, e.g., pre-emphasis, are modeled to receive data from the MUX of PBM in Figure 7, and then a five-tap DFE combined with a dedicated FFE is adopted.
Next, a whole system with a pre-interleaving scheme and SerDes link is simulated through a cumulative-probability distribution model of symbol burst errors, and its BER performance is also evaluated in Appendix A Algorithm A1. In the algorithm, its main work is to count burst errors, which requires a large amount of recurrent iterative calculation and judgment. Moreover, there are some mathematical operations, such as convolution and Fourier transform, in the signal transmission process. Through simulation, it can be obtained that, compared with the non-interleaving scheme in reference [15], PBM can achieve an interleaving gain of 0.35 dB at the BER of 10−7, as shown in Figure 9a. This shows that PBM can reduce more error symbols caused by the same burst errors than FOM, but fewer than PSM. However, in Reference [19], the interleaving gain of DFE with interleaving at the BER of 10−6 is less than 0.26 dB (=1.02 dB − 0.76 dB), which outperforms DFE with pre-coding. In conclusion, pre-interleaving, e.g., PBM, is a preferred unconditional technology and also outperforms Reference [19].
Further, focusing on the memory capacity for buffer codewords, MUX and the latency of interleaving, a performance comparison of these three FEC interleaving schemes is given in Table 1 [15]. In order to prepare all 16 symbols well before they are sent to MUX, PBM needs a 1500-bit memory, which consists of 150 bit, 300 bit, 450 bit and 600 bit to buffer 16 symbols from the first FEC lane to the fourth one, respectively. Therefore, the memory buffering data can take 600T (=(16 − 1) × 10 × 4T), in which 4T is the clock period of 16 sub-lanes when T is the clock period of FEC lane. For 10-bit 2:1 MUX in PSM, instead of 1-bit 2:1 MUX, it can be necessary to complete a serial-to-parallel conversion before multiplexing each two parallel 10-bit symbol from sub-lanes, which can add 400T latency. Similarly, FOM requires at least three symbols to be distributed in parallel, i.e., 30 bit memory for each FEC lane and a total of 120 bit memory for four FEC lanes, which can result in about 120T latency (=(4 − 1) × 10 × 4T). Compared with the other two interleaving methods, PBM establishes a better trade-off among hardware complexity, latency and interleaving gain, making it a more suitable method for mitigating error propagation.
Additionally, an NRZ signal is adopted in the PBM, while DFE with interleaving in Reference [19] used a PAM4 signal. Therefore, pre-interleaving technology can be widely and unconditionally applied in NRZ/PAM4 SerDes interconnection scenarios, such as in 400 GbE, scientific computing, and high-performance computing (HPC) [20]. While meeting the high-speed requirements of these scenarios, this technology can enable system performance to develop towards a lower BER, e.g., 10−15 or below. However, the technology has drawbacks: a stronger FEC and larger memory capacity can increase latency [21] and may not meet the low-complexity, low-latency requirements of some scenarios, such as autonomous driving, and trading. This requires some optimization technologies at the circuit level.

3.2. Design of Pre-Interleaver

Figure 10 presents the block diagram of a pre-interleaver in PBM, including a PRBS generator, a register-based memory and a writing/reading counter (wr_cnt/rd_cnt). The PRBS generator, instead of FEC, is used to launch the m-bit high-speed signals, and then the memory can disperse and rearrange them, with the corresponding address synchronously recorded by a counter.
Generally, a serial PRBS generator, as the signal source, employs a linear-feedback-shift-register (LFSR), as shown in Figure 11 [22], in which u1~u31 represents the serial shift register and ⊕ is the XOR operation. Since register u31 can output signal Do through the shift operation in the direction of u31, this structure encountered challenges in interactive efficiency, complexity and consumption. To achieve high interactive efficiency, where it takes one period to generate m-bit parallel signals, the adopted structure is an m-lane serial PRBS31 generator, e.g., m = 40. However, there are 1240 (=40 × 31) registers and 40 XORs, which significantly increase complexity and consumption. The serial generator requires 40 periods to generate a set of 40-bit signals, greatly reducing the interactive efficiency. From the perspective of compromise, the next focus is the parallel optimization of the PRBS generator.
Another key module of the proposed pre-interleaver is register-based memory, which plays a role in discretizing data errors for easy correction. The degree of discretization completely depends on its interleaving depth and interleaving width. However, although their increase is very helpful for weakening the correlation between adjacent burst errors, it is a challenge in terms of complexity, consumption and transmission latency [23]. Another challenge is its architecture, e.g., a chip-selecting structure in which there are two identical memories and one works alternately with the other [24]. However, its main drawbacks are complexity and consumption. To overcome these problems, a handshake-based structure is proposed and achieves the alternating operation of single memory using a handshake signal. Further, considering the transmission latency and interconnect relationship between the generator and memory, a row–column interleaver with the handshake mechanism is employed to realize error discretization by means of the ping-pong method, as shown in Figure 12. Its depth (n) and width (m) are 32 and 40, respectively.
Based on these solutions, the entire circuit works as follows. In Figure 10, all signals can first be set to 0 through the rstn signal for easy verification. Then, when the writing-enabled signals wr_en_1~wr_en_n are high, an m-bit signal prbs_out from the PRBS generator is written into memory, and the writing counter wr_cnt also starts. When wr_cnt increases to n − 1, memory can be written fully. Finally, after one-period handshake, the writing-enabled signals wr_en_1~we_en_n become low, and the reading-enabled signals rd_en_1~rd_en_m become high. Subsequently, the reading counter rd_cnt starts and the stored data prbsout_inte are also outputted. When rd_cnt is m − 1, all data in the memory are read out, which shows that the writing–reading process is completed.

3.3. Logical Combination

To achieve these objectives, a parallel generator [25] is obtained using characteristic polynomial-based matrix conversion and logic expansion technology. In a generator, a pseudo-random sequence polynomial for 231 − 1 length is defined as 1 + x28 + x31.
Assume that the state vector of the current data U(t) is [u1(t), u2(t),…, ui(t),…, u31(t)]T, in which ui(t) represents the state of the i-th register at the t-th period, the state vector at the next period is U(t + 1). Conversion matrix T characterizes the transfer relationship between U(t) and U(t + 1), which is obtained according to a pseudo-random sequence polynomial and represented as follows:
U(t + 1) = T · U(t),
1   x 28   x 31 T = [ 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 ] 31 × 31 ,
where the first row is x28 + x31, representing u1(t + 1) = u28(t)⊕u31(t), while the other rows represent ui(t + 1) = ui−1(t), 2 ≤ i ≤ 31.
Through two periods, U(t + 2) can be obtained from Equation (2) as follows:
U(t + 2) = T · U(t + 1) = T · T · U(t) = T2· U(t),
Analogously, after k periods, state vector U(t + k) can be obtained by k iterations of Equation (2) and is expressed as:
U(t + k) = Tk · U(t),
where Tk is the k × k matrix. Since the k-bit parallel PRBS generator is achieved by reconstructing output data from the serial one, the state vector U(t + j × k), including k data from the serial PRBS generator at the (t + j × k)-th period, can be reconstructed to be k-bit output vector Sk(t + j) at the (t + j)-th period.
Therefore,
Sk(t + 1) = Tk · Sk(t),
where k donates the degree of parallelism and is also equal to the memory width. When k >> n, i.e., k = 40 > 31, the state vector U(t)31×1 and conversion matrix T31×31 can be expended into U(t)k×1 and Tk×k, making up zero and one according to Equations (2)–(5), as shown in Equation (7). Then, when k is 40,Tk in Equation (6) can be computed as Equation (A1).
  1   x 28   x 31 T = [ 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 ] 40 × 40 ,
From Equation (A1), we can see that the number of ones in T40 increases, showing an increasing number of XOR and increasing register. For example, the number of ones from the tenth row to the twelfth row, marked by solid lines, is 3, so the number for the XOR and register is 2 and 3, respectively. For other rows, the number for the XOR and register is 1 and 2, respectively. The extra XORs reduce operating frequency since they lead to additional gate delays. And the additional registers increase power consumption. Additionally, we can see that the maximum number of ones is concentrated between the 23rd column and the 25th column. This indicates high fanout for their registers, which significantly increases transmission latency and circuit complexity. Therefore, to ensure minimum power consumption and maximum operating frequency, two optimizable steps are used.
From Equation (A1), we can see that the number of ones in T40 increases, showing an increasing number of XOR and increasing register. For example, the number of ones from the tenth row to the twelfth row, marked by solid lines, is 3, so the number for the XOR and register is 2 and 3, respectively. For other rows, the number for the XOR and register is 1 and 2, respectively. The extra XORs reduce operating frequency since they lead to additional gate delays. And the additional registers increase power consumption. Additionally, we can see that the maximum number of ones is concentrated between the 23rd column and the 25th column. This indicates high fanout for their registers, which significantly increases transmission latency and circuit complexity. Therefore, to ensure minimum power consumption and maximum operating frequency, two optimizable steps are used.
In the first step, a logic combination can be employed into the above rows to reduce the XOR and register numbers.
Taking the tenth row as an example, assume that the state vector of current data S(t) is [s1(t), s2(t),…, si(t),…, s31(t)]T; s10(t + 1) in the next period can be expressed as:
s10(t + 1) = s1(t)⊕s26(t)⊕s27(t),
s′38(t) = s26(t)⊕s27(t),
s10(t + 1) = s1(t)⊕s′38(t),
where s′38(t) is the input signal of the thirty-eighth register since XOR is the combination logic. Following this analogy, s11(t + 1) and s12(t + 1) for the 11th row and 12th row change to s2(t) ⊕ s′39(t) and s3(t) ⊕ s′40(t), respectively, in which s′39(t) and s′40(t) are the input signal of the 39th and 40th register, respectively. As we can see, for every row, the number of ones is decreased to two, implying the reducing number of XORs and registers.
The second step is the reduction in fanout through logic expansion technology. For the 23rd column, marked by the dashed line in Equations (A1) and (9), some registers in many feedback paths, such as s1(t), s7(t), s32(t) and s35(t), are related to s23(t), indicating that the fanout is 4. Take s1(t) as an example. S17(t – 1) and s23(t – 1), before optimization, are XOR with s20(t − 1), respectively, and then a new combination circuit is obtained and constructed with s 29 ( t - 1 ) and s 32 ( t 1 ) , as shown in Equation (10). So, s1(t) can only be related to the current input of s29 and s32, rather than s23(t − 1), which can reduce the fanout of s23, and so on. The fanout can be reduced to 2.
s 1 ( t ) = s 17 ( t 1 ) s 23 ( t 1 ) s 7 ( t ) = s 23 ( t 1 ) s 29 ( t 1 ) s 32 ( t ) = s 20 ( t 1 ) s 23 ( t 1 ) s 35 ( t ) = s 23 ( t 1 ) s 26 ( t 1 ) ,
s 1 ( t ) = s 17 ( t 1 ) s 23 ( t 1 ) = ( s 17 ( t 1 ) s 20 ( t 1 ) ) ( s 20 ( t 1 ) s 23 ( t 1 ) ) = s 29 ( t 1 ) s 32 ( t 1 )
Through minimizing XOR, register and maximum fanout, a 40-lane parallel PRBS generator is obtained in Figure 13. This incorporates 40 XORs, 40 registers and has the maximum fanout of two, which outperforms the serial PRBS generator.

4. Measurement Result

The semi-custom pre-interleaver was realized and fabricated with 65 nm CMOS LP technology. The chip layout and micrograph are shown in Figure 14, with a total area of 0.615 mm2 including the I/O pad. Figure 15 provides the on-chip measurement scheme and instruments.
From Figure 14 and Figure 15, single-end clock signals CLKwr and CLKrd, provided by clock signal generator Agilent J-BERT N4903B, are used for writing/reading operations. rstn, provided by the DC power supply, is used to reset all signals. Load [1:0] is used to load four embedded initial signals. After loading an initial signal successfully, oscilloscope Tsktronix MSO 71254C can correctly receive the output signal prbs_inte, and can be triggered synchronously by a single pulse signal loadp from load [1:0]. Additionally, by comprehensively considering design cost and measurement requirements, the first eight output signals prbs_inte [7:0] are given.

4.1. Results

Figure 16 presents the post-simulated and measured results of prbs_inte [1:0] when load [1:0] is “00” and “01”, respectively. It can be seen from Figure 16a that the maximum simulated frequency under SS process corner is 1 GHz. After loadp is triggered, the writing operation first works for some time (32/fclk = 32 ns), and the last of the written data are retained. After handshaking, it takes 40 periods (=40 ns) to read these interleaved data, e.g., prbs_inte [1:0]. Then, the chip alternately realizes a “write-read” process, namely data discretization. Moreover, we can see from Figure 16a that the measured results at 1.3 GHz are consistent with the post-simulated ones, indicating that it can operate normally at 1.3 GHz. When load [1:0] becomes “01”, another embedded different initial signal is loaded to generate different output signals prbs_inte [7:0], as shown in Figure 16b. Figure 16b presents the eye diagram of prbs_inte [0]. It can be seen that there is no noise in the eye diagram and the horizontal opening degree reaches 0.925 UI.

4.2. Discussion

Table 2 provides many parameters during simularion and measurement. During simulation and measurement, the power supply and frequency are two controlled parameters, in which the power supply depends on the adopted technology, e.g., 1.2 V power supply for 65 nm LP technology. In order to achieve a better performance, e.g., frequency, SS corner is generally considered in the simulation [26]. However, masks in the fabrication, provided by the foundry, are a main influence on measurement results and an uncontrollable factor. From the perspective of frequency, the measurement result is slightly larger than the simulation result due to the location of the chip on the wafer.
Since the circuit is generally applied to the interface of 400 GbE PHY, some requirements for 8 × 50 G 400 GbE scenarios, such as interface lane and transmission speed, can be considered in the measurement. Through analysis, the limitations of circuit performance contain internal factors such as voltage drop, parasitic resistance and capacitance, as well as external testing factors such as clock quality, and power noise. Therefore, the chip is completely suitable for 16 × 25 G 400 GbE scenarios rather than 8 × 50 G 400 GbE scenarios.
Table 3 summarizes the performance of the proposed pre-interleaver and compares it with interleaving circuits published in recent years. This work and Reference [27] adopted a D-flip-flop-based register to realize the memory array, while References [28,29,30] employed SRAM with fewer MOS transistors. Since the influence of peripheral auxiliary circuits, such as sense amplifiers and decoders, on the operation frequency has been eliminated, this work can achieve the highest frequency. Moreover, the memory size in this work is larger than that in Reference [27], and less than that in References [29,30], with a higher interleaving degree and less power consumption. In Reference [28], due to the adopted process and voltage supply, the leakage current in SRAM is larger, resulting in higher power consumption than in References [29,30].

5. Conclusions and Future Work

Clearly, it can be concluded from theoretical level research that pre-interleaveing technology, rather than other technologies, e.g., pre-coding, is an effective method to unconditionally mitigate DFE error propagation. At the circuit level, a high-speed high-depth pre-interleaver has been designed and fabricated with 65 nm LP CMOS technology. The measurement results show that the data rate reaches over 40 Gb/s and the horizontal opening degree reaches 0.925 UI. More performance benefits are obtained from the characteristic polynomial parallelization and logic expansion method, significantly reducing the number of XORs, registers and fanouts. From the perspective of transmission speed, it could be suitable for 16 × 25 G 400 GbE scenarios, rather than 8 × 50 G 400 GbE scenarios.
However, there is still a lot to that requires further work. One of our future tasks is to correlate the estimated BER from the simulation with a more advanced FEC [32] like the turbo product code, staircase code, or a low-density parity checking (LDPC) code to further improve the current work for some applications with low latency requirements, e.g., autonomous driving, trading, and AI Chiplet [33]. Additionally, the impact of impairments from other sources, e.g., the quantization noise of the analog-to-digital converter (ADC) [34], on SerDes link with DFE error propagation, needs to be studied in more depth.

Author Contributions

Conceptualization, Y.Z. and T.L.; methodology, X.Z.; formal analysis, L.Z.; investigation, Y.Z.; resources, L.L.; data curation, Q.H.; writing—original draft preparation, Y.Z.; writing—review and editing, X.Z. and L.Z.; visualization, Q.H.; supervision, T.L.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Provincial Natural Science Foundation, grant number ZR2022QF146.

Data Availability Statement

The data used to support the findings of the study are available within the article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Algorithm A1: Evaluating BER performance with DFE error propagation
1. Input:
  • m: the number of signals used for simulation.
  • X = {x1,…,xm}: the launched signal.
  • BR: bit transmission frequency.
  • brlmax: the maximum burst error run length.
  • n: the number of DFE taps.
  • error_pos: storing the position of error.
Before evaluation, the signal launched at BR is first processed through pre-interleaving, pre-emphasis, multi-tap FFE, Gaussian noise, channel, package and crosstalk by means of S parameter, generating output signal Y = {y1,…,ym}.
2. Output
BER: bit error ratio
3. for i = 1 to m do
4.   yi is equalized by n-tap DFE model, generating zi
5.   If zi-xi ! = 0 then
6.     { error_posi ← i-th error position
7.      if brl > brlmax then
{ brl = 0;
zi = xi;
        }
8.      else
brl = brl + 1;
9.      end if
      }
10.    else
11.      { if brl > brlmax then
brl = 0;
12.       else
error_posi ← 0,
13.        for k = i − 1 to i-n do
            sum = ∑weight (error_posk)
14.        end for
15.        if sum! = 0 then
brl = brl + 1;
16.        else
brl = 0;
17.        end if
18.       end if
      }
19. end for
20. BER can be calculated by the expression:
        weight (Error_pos)/m

Appendix B

T 40 = 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 40 × 40 ,

References

  1. IEEE 802.3 400 GbE Study Group. IEEE Standard for Ethernet Amendment 10: Media Access Control Parameters, Physical Layers, and Management Parameters for 200 Gb/s and 400 Gb/s Operation. 2017. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8207825 (accessed on 15 August 2023).
  2. Das Sharma, D. A Low-Latency, High-Bandwidth, High-Reliability, and Cost-Effective Interconnect with 64.0 GT/s PAM-4 Signaling. IEEE Micro 2021, 41, 23–29. [Google Scholar] [CrossRef]
  3. Wang, Z.R.; Gao, J.C.; Flowers, G.T.; Wu, Y.L.; Xie, G.; Lv, Y.H. Modeling and Analysis of Signal Integrity of High-Speed Interconnected Channel with Degraded Contact Surface, IEEE Transactions on Components. Packag. Manuf. Technol. 2019, 9, 2227–2236. [Google Scholar] [CrossRef]
  4. Bulzacchelli, J.F. Equalization for Electrical Links: Current Design Techniques and Future Directions. IEEE Solid-State Circuits Mag. 2015, 7, 23–31. [Google Scholar] [CrossRef]
  5. Lu, Y.C.; Zhao, P.C.; Wang, W.Y.; Huang, Z.L.; Wong, H.; Tonietto, D. A Comparative Study of Equalization Schemes for 112G PAM4 Links. In Proceedings of the DesignCon 2019, Santa Clara, CA, USA, 19–31 January 2019. [Google Scholar]
  6. Chen, J.K.; Gu, Y.Z.; Xu, M.M. A 4.75-64 Gb/s PAM-4 Wireline Transmitter with 3-tap FFE in 28-nm CMOS. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21 May 2023; pp. 1–5. [Google Scholar]
  7. Tang, L.X.; Gai, W.X.; Shi, L.Q.; Xiang, X.; Sheng, K.; He, A. A 32Gb/s 133mW PAM-4 Transceiver with DFE Based on Adaptive Clock Phase and Threshold Voltage in 65nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 11–15 February 2018; pp. 114–116. [Google Scholar]
  8. Yang, M.; Shahramian, S.; Shakiba, H.; Wong, H.; Krotnev, P.; Carusone, A.C. Statistical BER Analysis of Wireline links with Non-Binary Linear Block Codes Subject to DFE Error Propagation. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 284–297. [Google Scholar] [CrossRef]
  9. Yang, M.; Shahramian, S.; Shakiba, H. A Statistical Modeling Approach for C-encoded High Speed Wireline Links. In Proceedings of the DesignCon 2020, Santa Clara, CA, USA, 28–30 January 2020. [Google Scholar]
  10. Kim, K.; Kwon, P.; Alon, E. Accurate Statistical BER Analysis of DFE Error Propagation in the Presence of Residual ISI. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 619–623. [Google Scholar] [CrossRef]
  11. Tang, T.; Li, Y.B.; Wang, J. Improved Methods of Decision Feedback Equalization for Error Propagation Prevention. In Proceedings of the IEEE 9th Conference on Industrial Electronics and Applications, Hangzhou, China, 20 October 2014; pp. 1072–1076. [Google Scholar]
  12. Lu, Y.C.; Wong, H.; Tonietto, D. DFE Error Propagation Characteristics in real 56Gbps PAM4 High-Speed Links with Pre-coding and Impact on the FEC Performance. In Proceedings of the DesignCon 2017, Santa Clara, CA, USA, 31 January 2016–2 February 2017. [Google Scholar]
  13. Zhang, G. Preliminary Studies on DFE Error Propagation, Precoding, and their Impact on KP4 FEC Performance for PAM4 Signaling Systems, IEEE 802.3 Interim Meeting. 2018. Available online: http://www.ieee802.org/3/ck/public/18_09/zhang_3ck_01a_0918.pdf (accessed on 15 August 2023).
  14. Wang, T.T.; Wang, Z.F.; Wang, X.Y.; Sun, J.Q.; Ghiasi, A. Analysis and Comparison of FEC Schemes for 200GbE and 400GbE. IEEE Commun. Stand. Mag. 2017, 1, 24–30. [Google Scholar] [CrossRef]
  15. Zhan, Y.Z.; Hu, Q.S. Effect of DFE Error Propagation and its Mitigation using MUX-based FEC Interleaving for 400 GbE Electrical Link. High Technol. Lett. 2018, 24, 387–395. [Google Scholar]
  16. Yuan, F.; AL-Taee, A.R.; Ye, A.; Sadr, S. Design Techniques for Decision Feedback Equalization of Multi-giga-bit-per-second Serial Data Links: A state of the art review. IET Circuits Devices Syst. 2014, 8, 118–130. [Google Scholar] [CrossRef]
  17. Wu, X.; Hu, Q.S. Design of a 6.25Gb/s Adaptive Decision Feedback Equalizer in 0.18μm CMOS Technology. In Proceedings of the IEEE Workshop on Advanced Research and Technology in Industry Applications, Ottawa, ON, Canada, 29–30 September 2014; pp. 1209–1212. [Google Scholar]
  18. Pan, M.; Feng, J. Design of a Low-Power 20Gb/s 1:4 Demultiplexer in 0.18μm CMOS. Chin. J. Electron. 2015, 24, 71–75. [Google Scholar] [CrossRef]
  19. Wu, K.Q.; Liga, G.; Lee, J.; Paulissen, L.; Riani, J.; Alvarado, A. DFE State-Tracking Demapper for Soft-Input FEC in 800G Data Center Interconnects. In Proceedings of the European Conference on Optical Communication (ECOC), Basel, Switzerland, 18–22 September 2022; pp. 1–4. [Google Scholar]
  20. A.Ali, T.; Chen, E.; Park, H.; Yousry, R.; Ying, Y.-M.; Abdullatif, M.; Gandara, M.; Liu, C.-C.; Weng, P.-S. A 460mW 112Gb/s DSP-Based Transceiver with 38dB Loss Compensation for Next-Generation Data Centers in 7nm FinFET Technology. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 1–3. [Google Scholar]
  21. Matsuda, T.; Nishiyama, K.; Seki, T. Data-Centric Transmission with Adaptive FEC for Ultra-Low Latency Resource Sharing in Wide Area. In Proceedings of the European Conference on Optical Communication (ECOC), Basel, Switzerland, 18–22 September 2022; pp. 1–4. [Google Scholar]
  22. Sasaki, M.; Ikeda, M.; Asada, K. 4-Gb/s Low-Power PRBS Generator with Wave-Pipeline Technique in 0.18-μm CMOS. In Proceedings of the 13th IEEE International Conference on Electronics, Circuits and Systems, Nice, France, 10–13 December 2006; pp. 1007–1010. [Google Scholar]
  23. Bohorquez, R.G.; Nour, C.A.; Douillard, C. Channel Interleavers for Terrestrial Broadcast: Analysis and Design. IEEE Trans. Broadcast. 2014, 60, 679–692. [Google Scholar] [CrossRef]
  24. He, C.; Li, B.H.; Chen, X.M. Unconventional Use of SRAM in a 32-bit SOPC System. Astron. Res. Technol. 2013, 10, 42–48. [Google Scholar]
  25. Cai, C.; Zheng, X.Q.; Chen, Y.; Wu, D.Y. A 1.4-Vppd 64-Gb/s PAM-4 Transmitter with 4-Tap Hybrid FFE Employing Fractionally-Spaced Pre-Emphasis and Baud-Spaced De-Emphasis in 28-nm CMOS. In Proceedings of the IEEE 47th European Solid State Circuits Conference (ESSCIRC), Grenoble, France, 6–9 September 2021; pp. 527–530. [Google Scholar]
  26. Zhao, Z.; Gu, Y.X.; Fu, Y.H. A High Frequency Accuracy, High Stability and Tunable RC Oscillator. In Proceedings of the IEEE 16th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), Nangjing, China, 25–28 October 2022; pp. 1–3. [Google Scholar]
  27. Shrestha, R.; Paily, R. Design and Implementation of a Linear Feedback Shift Register Interleaver for Turbo Decoding. In Proceedings of the International Conference on Progress in VLSI Design & Test, Shipur, India, 1–4 July 2012; pp. 30–39. [Google Scholar]
  28. Zhao, H.; Fan, S.Q.; Chen, L.C.; Song, Y.; Geng, L. A 0.2V-1.8V 8T SRAM with Bit-interleaving Capability. IEICE Electron. Express 2014, 11, 1–8. [Google Scholar] [CrossRef]
  29. Wen, L.; Zhang, Y.J.; Zeng, X.Y. Column-Selection-Enabled 10T SRAM Utilizing Shared Diff-VDD Write and Dropped-VDD Read for Power Reduction. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1470–1474. [Google Scholar] [CrossRef]
  30. Wen, L.; Liu, Y.; Mo, W.; Zhang, J.; Qi, S.Q.; Lv, J.P.; Zhang, Y.J. A 96kb, 0.36V, Energy-Efficient 8T-SRAM with Column-Selection and Shared Buffer-Foot Techniques for EEG Processor. In Proceedings of the IEEE 13th International Conference on ASIC (ASICON), Chongqing, China, 29 October–1 November 2019; pp. 1–4. [Google Scholar]
  31. Wen, L.; Cheng, X.; Zhou, K.J.; Tian, S.D.; Zeng, X.Y. Bit-Interleaving-Enabled 8T SRAM with Shared Data-Aware-Write and Reference-Based Sense Amplifier. IEEE Trans. Circuits Syst. II Express Briefs 2016, 64, 643–647. [Google Scholar] [CrossRef]
  32. Liu, Y.C. 100+ Gb/s Ethernet Forward Error Correction (FEC) Analysis. In Proceedings of the DesignCon 2019, Santa Clara, CA, USA, 29–31 January 2019. [Google Scholar]
  33. Wu, Y.J.; Li, T.Z.; Shao, Z. An Efficient Design Framework for 2×2 CNN Accelerator Chiplet Cluster with SerDes Interconnects. In Proceedings of the IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hangzhou, China, 11–13 June 2023; pp. 1–5. [Google Scholar]
  34. Khairi, A.; Krupnik, Y.; Laufer, A.; Segal, Y.; Cusmai, M.; Levin, I.; Gordon, A.; Sabag, Y.; Rahinski, V.; Lotan, I.; et al. A 1.41-pJ/b 224-Gb/s PAM4 6-bit ADC-Based SerDes Receiver with Hybrid AFE Capable of Supporting Long Reach Channels. IEEE J. Solid-State Circuits 2023, 58, 8–18. [Google Scholar] [CrossRef]
Figure 1. Signal integrity problem: (a) frequency response of the transmission channel; (b) ideal signal vs. infect signal.
Figure 1. Signal integrity problem: (a) frequency response of the transmission channel; (b) ideal signal vs. infect signal.
Electronics 12 03912 g001
Figure 2. Frequency response and corresponding CTLE [5].
Figure 2. Frequency response and corresponding CTLE [5].
Electronics 12 03912 g002
Figure 3. Framework diagram of pre-interleaver.
Figure 3. Framework diagram of pre-interleaver.
Electronics 12 03912 g003
Figure 4. DFE error propagation process.
Figure 4. DFE error propagation process.
Electronics 12 03912 g004
Figure 5. DFE error propagation: (a) continuous, (b) discontinuous.
Figure 5. DFE error propagation: (a) continuous, (b) discontinuous.
Electronics 12 03912 g005
Figure 6. Pre-interleaving architecture in the Ethernet physical layer.
Figure 6. Pre-interleaving architecture in the Ethernet physical layer.
Electronics 12 03912 g006
Figure 7. Error mitigation process of two schemes: (a) non-interleaving, (b) proposed PBM.
Figure 7. Error mitigation process of two schemes: (a) non-interleaving, (b) proposed PBM.
Electronics 12 03912 g007
Figure 8. SerDes link.
Figure 8. SerDes link.
Electronics 12 03912 g008
Figure 9. BER performance analysis: (a) proposed PBM; (b) DFE with interleaving and pre-coding in Reference [19].
Figure 9. BER performance analysis: (a) proposed PBM; (b) DFE with interleaving and pre-coding in Reference [19].
Electronics 12 03912 g009
Figure 10. The block diagram of pre-interleaver in PBM.
Figure 10. The block diagram of pre-interleaver in PBM.
Electronics 12 03912 g010
Figure 11. Serial PRBS31 generator.
Figure 11. Serial PRBS31 generator.
Electronics 12 03912 g011
Figure 12. Row-column interleaver.
Figure 12. Row-column interleaver.
Electronics 12 03912 g012
Figure 13. 40-lane parallel PRBS generator.
Figure 13. 40-lane parallel PRBS generator.
Electronics 12 03912 g013
Figure 14. Chip layout (a) and micrograph (b).
Figure 14. Chip layout (a) and micrograph (b).
Electronics 12 03912 g014
Figure 15. On-chip measurement scheme and instruments.
Figure 15. On-chip measurement scheme and instruments.
Electronics 12 03912 g015
Figure 16. Post-simulated and measured results of prbs_inte [1:0]: (a) load [1:0] is 00 (b) load [1:0] is 01.
Figure 16. Post-simulated and measured results of prbs_inte [1:0]: (a) load [1:0] is 00 (b) load [1:0] is 01.
Electronics 12 03912 g016
Table 1. Comparison of pre-interleaving schemes.
Table 1. Comparison of pre-interleaving schemes.
PerformanceFOMPBMPSM
ComplexityMemory (bit)12015001500
MUX8 (1-bit)8 (1-bit)8 (10-bit)
Latency120T600T1000T
Interleaving Gain (dB)0.270.350.48
Table 2. Parameters during simulation and measurement.
Table 2. Parameters during simulation and measurement.
ParameterSimulationMeasurement
Technology65 nm65 nm
Power supply1.2 V1.2 V
Frequency1 GHz1.3 GHz
load [1:0]1.2 V/0 V1.2 V/0 V
cornerSS-
Mask-Y
EDA toolsVirtuousAgilent J-BERT N4903B, Tsktronix MSO 71254C, DC power supply
Table 3. Performance summary and comparison.
Table 3. Performance summary and comparison.
TechnologyMemory SizeMemory TypeFrequencySupply (V)Power (mW)Area (mm2)
This work65 nm32-row × 40-columnRegister1.3 GHz1.238.52 a0.613
[27]130 nm1-row × 14-columnRegister561.498 MHz1.20.4952.72 × 10−3
[28]180 nm64-row × 16-column8T SRAM c208 MHz1.85.60.1367
[29]65 nm512-row × 11-colunm10T SRAM c6.25 MHz0.5 b0.0620.05
[30]65 nm6 × 256-row × 64-column8T SRAM c125 KHz0.36 b5.1 × 10−30.5781 [31]
a. PRBS generator included. b. Low voltage adopted. c. T stands for transistor.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhan, Y.; Li, T.; Zou, X.; Hu, Q.; Li, L.; Zhang, L. 41.6 Gb/s High-Depth Pre-Interleaver for DFE Error Propagation in 65 nm CMOS Technology. Electronics 2023, 12, 3912. https://doi.org/10.3390/electronics12183912

AMA Style

Zhan Y, Li T, Zou X, Hu Q, Li L, Zhang L. 41.6 Gb/s High-Depth Pre-Interleaver for DFE Error Propagation in 65 nm CMOS Technology. Electronics. 2023; 12(18):3912. https://doi.org/10.3390/electronics12183912

Chicago/Turabian Style

Zhan, Yongzheng, Tuo Li, Xiaofeng Zou, Qingsheng Hu, Lianming Li, and Lu Zhang. 2023. "41.6 Gb/s High-Depth Pre-Interleaver for DFE Error Propagation in 65 nm CMOS Technology" Electronics 12, no. 18: 3912. https://doi.org/10.3390/electronics12183912

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop