A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects

Zhang, Geng; Lai, Mingche; Lyu, Fangxu

doi:10.3390/electronics11182864

Open AccessArticle

A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects

by

Geng Zhang

^1,2,†

,

Mingche Lai

^2,† and

Fangxu Lyu

^2,*

¹

School of Air and Missile Defense College, Air Force Engineering University, Xi’an 710000, China

²

School of Computer, National University of Defense Technology, Changsha 410003, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2022, 11(18), 2864; https://doi.org/10.3390/electronics11182864

Submission received: 14 July 2022 / Revised: 18 August 2022 / Accepted: 8 September 2022 / Published: 10 September 2022

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a multichannel and high-bandwidth (BW) receiver for standard packaging die-to-die (D2D) interconnects. The receiver adopts forward clock (FCK) architecture of the high-density transmission standard, which consists of 16 high-speed data paths and a pair of low-speed differential clocks for 512 Gbps BW. To reduce the chip area and power consumption, a common minimal phase-locked loop (MINI-PLL) and data adjustment (CDA) circuit to replaces the clock data recovery circuit (CDR) in the traditional receiver. A delay-matching circuit is adopted to combat PVT variation and lane skew. In addition, a high linearity phase interpolator (PI) circuit design is used in the minimum phase-locked loop (MINI-PLL) to adjust the clock phase and improve the clock jitter performance. Using 28 nm CMOS technology, the overall link power consumption is 1.56 pJ/b. Bit error rate (BER) is less than

10^{- 15}

under the real S-parameters with a channel loss of 10db@16GHz.

Keywords:

die-to-die; phase interpolator; forward clock; jitter performance; bit error rate

1. Introduction

With the development of data exchange rates, the amount of data transmission has grown exponentially, which is very important for hyperscale data centers, high-performance central processing units (CPU), graphics processing units (GPU), and artificial intelligence (AI). Therefore the development of system-on-chip (SoC) will face unprecedented challenges. With the slowdown of Moore’s Law, improvements in chip performance and power consumption are increasingly uneconomical. Heterogeneous integration (chiplet) technology provides a new design solution to this problem [1,2]. Multiple smaller chips in a multichip module (MCM) are linked by D2D interconnects that have extremely low power consumption and high BW at the edge of each chip. In high-performance computing (HPC) and AI applications, a large SoC is divided into two or more homogeneous chips. As shown in Figure 1, the I/O and network cores are divided into separate chips in networking applications. D2D interconnects in such SoC must not affect overall system performance [3], so D2D interconnects focus more on lower bandwidth density (bandwidth per area/layer), lower power consumption (power consumption per data rate) and lower BER [4].

Traditional D2D interconnects often adopt single-end and low-speed parallel transmission [5,6] to increase bandwidth density, but with the rapid increase of the amount of exchanged information, this method cannot meet the demand for D2D interconnects. To solve the problem of low bandwidth, Pulse amplitude modulation (PAM) was used in D2D interconnects, but it brings a larger area and power consumption problems [7], resulting in high power consumption. At the same time, PAM4 has lower signal-to-noise ratio (SNR) than NRZ, so BER will higher than the systems using NRZ. With the application of SerDes between chips, D2D interconnects have gradually turned to single-ended and high-speed transmission to further increase bandwidth density. However, with the continuous increase of speed, the marginalization of signals is intensified, and it is easily affected by external factors. So single-ended signals (SES) are more dependent on advanced packaging technology [8], but not in standard packages. In order to be suitable for standard packaging and higher rates, differential signals (DS) transmission is often used. But due to circuit characteristics of DS, the theoretical power consumption for D2D interconnection is twice that of SES transmission, and the bandwidth density is half of that. Therefore, it is necessary to improve the bandwidth density and reduce the power consumption of DS transmission in order to be suitable for standard packaging under high speed transmission.

This paper proposes a solution for the problem of DS transmission for D2D interconnects. MINI-PLL and CDA circuit replaces CDR in traditional receiver, and a delay matching circuit is adopted to combat PVT variation and lane skew, which reduce power consumption and area.

2. Receiver Design

To improve clock quality and reduce BER, forward clock (FCK) is often used in D2D interconnects. Figure 2 shows a FCK architecture [9,10]. In the FCK, the frequency of the clock and the data are well matched, so the jitter introduced in the data transmission is small. The sampling clock can be adjusted by a phase interpolator (PI), so a length mismatch between the data line and the clock line are allowed in FCK transmission. The advantage of the mode is that the sending and the receiving ends are relatively independent, which is convenient for modular design and reduces the complexity of the receiver, so it is very suitable for D2D interconnects [6,7,11]. The additional clock path will occupy some interconnect channels, and will cause additional Si-area and power consumption. However, these are distributed among the individual data paths with the number of channels increasing.

In D2D interconnects, each channel is often configured with its own clock recovery circuit (CDR), because the skew of the clock on each channel is different. In this design, CDA and PLL structures are used instead of CDR, and delay matching

δ

is used to match the clocks on each channel. The advantage of this method is that it is not necessary to configure CDRs on each channel, so the chip area and power consumption can be greatly reduced in multi-channel transmission, thereby increasing the bandwidth density.

As shown in Figure 3, there is the Integral structure of the receiver. The output of PLL generates eight-phase equal-phase clocks, which contain PFD, filters, charge pumps (CP), VCO, divide-by-4 circuits (DIV4), and PI [12]. The MINI-PLL adjusts the phase of the PI, so that the clock generated by the VCO is sampled to the best position of the data. CDA is used to process the phase early/late information and send it to the PI [4,13].

2.1. CDA

As shown in the data path and CDA digital design of Figure 4, the sampler samples 32 Gbps data and generates four channels of data information and four channels of data edge informations with a speed of 8 Gbps. After the DMUX module, select 16 (D[0:15] + E[0:15])-aligned 2 Gbps information streams from L0. After the phase detector (PD) module, early/late/hold signals are generated, and then through the voting processing of the voter module, the final early/late/hold signals are sent to the digital filter. The eight bits in the output of the filter are transformed to control signals of 35 BITs through the Weight Coding. Among them, P<0:2> is the Gray code, which is used to control the quadrant; Bit[0:31] is the thermometer code, which is used to control the PI rotation. The MINI-PLL adjusts the phase of the PI, so that the clock generated by VCO is sampled to the best position of the data. The delay calibration is used to perform different delays on MINI-PLL after the L0 channel is locked to match each channel.

2.2. Delay Matching $δ$

The deviation of the channel length and the difference of the PVT of the chip lead to the inconsistent time of each signal reaching the receiving end, so each channel needs to be matched with a delay. As shown in Figure 5, the MINI-PLL samples the data to the best position after locking through the L0 channel, and L1 to L15 use this scheme to delay the clock. Taking L1 as an example, when the transmitter sends PRBS31 for training, the PI in the

δ

is traversed under the control of the control table (from 00000 to 111111), and the output data receive the bathtub curve. At the best position (Time = 0 ps), the control code in the query control table is

X_{1}

X_{2}

X_{3}

X_{4}

X_{5}

X_{6}

, and input it into the PI. L1 to L15 use the same method to get 15 control bits input to

δ

in L1 to L15.

3. Receiver Circuits

3.1. DMUX1:2

The structure of the DMUX is shown in Figure 6a. The data rate of DMUX input is below 20 GHz and the clock frequency is below 10 GHz [14]. After the data passes through the 1/4 sampler, they become 10 GHz data. The clock frequency (5 GHZ) is half of the data rate. In the D1 path, Latch1 and Latch2 form a rising edge trigger, and Latch2 and Latch3 form a falling edge trigger. The rising edge of the clock is sampled and the falling edge is output. In the Dout2 path, Latch4 and Latch5 form a falling edge trigger, and the falling edge of the clock is sampled and output. The timing diagram of the output is shown in Figure 6b.

3.2. CDA Design

3.2.1. Digital Model

CDA adopts an all-digital structure, and Figure 7 shows the entire digital structure of CDA. The 2G data D[0:15] and edge information E[0:16] are divided into four groups, and each group contains four BBPDs. After each group of phase discrimination results pass through the voter, a signed digital signal (−1 to +1) is formed. The symbol “+” represents the lead signal, “−” represents the lag signal, and “0” is the hold. Four groups of digital signals pass through the adder to form a new signed digital signal (−4 to +4). Then through the bandwidth controller (BW), select the appropriate bandwidth to amplify or maintain, and generate unsigned eight-bit lead-lag information. The signal output by the filter is converted by weight to generate a thermometer code to control the PI rotation.

3.2.2. Weight Coding

The eight-bit early and late information output by the digital filter is sent to the weight coding control, forming a phase control code (generated by the higher three bits) and a thermometer control code (generated by the lower four bits).

When the control signal is 00000000 in the initial state of the circuit, the last five bits of the control signal are 00000 (indicating that the clock lags). The phase of the corresponding output clock can be determined to be 2

π

. As the value of the control signal increases, the clock lags further. So in order for the clock to track the data, the output clock phase is reduced. The output phase should rotate clockwise from the fourth quadrant, and the change trend is shown in the Figure 8. When the phase changes, the five-bit signal has a sudden change. Therefore, before the last five bits are decoded, they must be processed first, and the change trend after processing is shown in the Figure 9.

In the coding process, the phase control code and the thermometer code are designed separately. To convert the higher three bits into a Gray code with no competition risk, the overall logic is shown in Table 1. Then, the obtained new five-bit signal A4A3A2A1A0 is decoded, and the phenomenon of competition and risk of binary numbers can be avoided. The conversion relationship is shown in Table 2.

3.3. MINI-PLL

3.3.1. Design

LC oscillators use passive components, such as inductors and capacitors, which are large in area, complicated in process, and difficult to redesign. Packaging and EMI issues require further consideration. Ring oscillators achieve a high level of integration and good coordination without the need for additional processes or packaging. In addition, ring oscillators can provide polymorphic clocks.

The RX-PLL in Figure 10 adopts a divide-by-four structure. The output of the ring oscillator (number of ring oscillations N = 4) comprises eight phases to improve the rotation accuracy of the PI. Clock and data alignment (CDA) was used to detect eight-bit edge information, and after thermometer coding, three phase codes and 32 thermometer codes were generated. When the clock generated by the phase-locked loop is sampled to the best position of the data, it is locked and the output clock jitter is 2.4 ps.

3.3.2. PFD

The structure of the frequency and phase discriminator in this paper is shown in Figure 11a, which is mainly composed of two D flip-flops with reset function, delay module, logic operation and gate. The reference clock

f_{r e f}

and the feedback signal

f_{d i v}

are used as the CLK end of the D starting device after passing through the buffer of the invertor. When there is frequency difference or difference between the two input signals, QA_ P and QB_ P will output high level, and the power supply pump will work. When two input signals are synchronized, both QA_ P and QB_ P outputs are low. QA_ N and QB_ N are the inverse output signals of QA_ P and QB_ P, respectively.

Figure 11b is the working sequence diagram of the frequency and phase detector. When the reference clock

f_{r e f}

leads, the QA_ P signal jumps before the QB_ P signal and becomes high level. At this time, after QA_ P and QB_ P phase are connected, the Reset signal becomes high level and is transmitted to two D triggers with a certain delay, so that QA_ P and QB_ P are brought back to low level. When the feedback signal

f_{d i v}

is in the lead, QB_ P jumps before QA_ P, becomes high level, and finally becomes low level under the action of the Reset signal.

3.3.3. Charge Pump

The charge pump circuit adopts differential input and differential output, which can well suppress common mode noise and prevent clock feedthrough. Its circuit structure is shown in Figure 12a. QA_ P, QA_ N, QB_ P, QB_ N are two pairs of phase difference information output by the frequency discriminator, M15, M16, M17, M18, M4, M5, M6 form a current mirror structure. M7, M8, M9, and M10 are four switch transistors, all of which use NMOS transistors, thereby eliminating the mismatch caused by the simultaneous use of PMOS transistors and NMOS transistors as switch transistors and reducing current mismatch. M11, M12, M3, and M14 are respectively connected between the switch transistors and the output terminal, which can isolate the clock feedthrough. M1, M2, and M3 are kept on under the control of the common mode voltage

V_{C M}

and the common mode points

V C P

and

V C N

fed back by the filter. As shown in Figure 12b, is the variation of differential signal (

V D S = V C P - V C N

) with time, at about 0.42

μ

s, the PLL is locked and sampled to the best position of the data.

3.3.4. VCO

The VCO block includes the control voltage regulation, digital ring oscillator, and level conversion circuit, where I1 and I2 are digitally controlled current sources as shown in Figure 13. The CML to CMOS module converts the signal output by the ring oscillator into a clock signal whose swing and duty cycles meet the requirements through negative feedback.

Figure 14 shows the delay elements of the ring oscillator used in this design. The fewer the number of ring vibration units (minimum two), the smaller the power consumption, delay, and area. The delay of each unit is the sum of the inverter and latch, and

α

is the ratio of the parameters of the latch and inverter, where

α

is 0.7 to ensure that the oscillation can be started [15]. M1 and M2 form the inverter. M5 and M6 form a latch. By adding M3 and M4, it can effectively increase the oscillation frequency of the ring vibration.

The load changed the center frequency of the ring oscillator (6 GHz to 11 GHz) and the VCO gain (4 GHz/V to 8 GHz/V).Under different I1, and I2 controls, the desired VCO output frequency range can be obtained in Figure 15. Figure 15a is for 9.5GHz to 11GHz clock, Figure 15b is for 7.5 GHz to 8.5 GHz clock, Figure 15c is for 6 GHz to 7 GHz clock.

3.3.5. PI

(1) Method

PI is the key module in this design, and the linearity of its rotation will directly affect the overall performance of the receiver. In the design, the monotonicity and linearity of PI are the main concerns. Its linearity can be represented by a proportional function.

φ_{o u t} = k_{P I} n 0 \leq n \leq N, 0 \leq φ_{o u t} \leq 2 π

(1)

V_{o u t} = A_{1} sin (ω t) + A_{2} sin (ω t + φ_{d})

(2)

φ_{o u t} = arctan (\frac{A_{2} sin φ_{d}}{A_{1} + A_{2} cos φ_{d}})

(3)

k_{P I}

is the gain of PI, and n is the control code. Equation (1) shows that, when n increases from 0 to N, the output phase increases from 0 to

2 π

. If

k_{P I}

remains unchanged, the relationship between

φ_{o u t}

and n is monotonic linear. However, the input and output of the actual PI are nonlinear sinusoidal Equation (2), and the phase and amplitude of the output are determined by

A_{1}

,

A_{2}

and

φ_{d}

in Equation (3).

When the traditional PI takes

φ_{d}

to take

90^{\circ}

, 0 to 2

π

are divided into four quadrants. Each quadrant is interpolated separately, which leads to a decrease in the linearity of the phase interpolator. In this work, since the designed minimum phase-locked loop generates 8 clocks with equal phases, PI is interpolated between

45^{\circ}

and

0^{\circ}

, which greatly increases the linearity. When

φ_{d}

takes

45^{\circ}

, Equations (4) and (5) are obtained.

φ_{o u t} = arctan (\frac{\frac{\sqrt{2}}{2} A_{2}}{A_{1} + \frac{\sqrt{2}}{2} A_{2}})

(4)

V_{o u t} = \sqrt{{A_{1}}^{2} + {A_{2}}^{2} + \sqrt{2} A_{1} A_{2}} sin (ω t + φ_{o u t})

(5)

Due to the symmetry of the quadrants, the model is simulated for linearity in the first quadrant and compared with a traditional PI. Figure 16 shown the linearity analysis and comparison of PI. Figure 16a is the output phase with the thermometer code changing when

φ_{d}

=

45^{\circ}

. This model has an inflection point at

φ_{d}

=

22 . 5^{\circ}

, and the trend of the curve approximates the ideal linearity. The traditional phase interpolator (

φ_{d}

=

90^{\circ}

) has an inflection point at

45^{\circ}

in Figure 16b. Through the fitted curve as shown in Figure 16c, phase interpolator (

φ_{d}

=

45^{\circ}

) is closer to the ideal value. Therefore, the method of eight equal phases is much better than the traditional PI.

(2) Circuit

Figure 17a shows the structure of a traditional equivalent current source PI. The size of the input transistors M1, M2, M3, and M4 are all the same. The loads R1 and R2 are equal to R. The input signal is two pairs of quadrature differential signals

v_{I P}

and

v_{Q P}

,

v_{I N}

and

v_{Q N}

, (

0^{\circ}

,

90^{\circ}

,

180^{\circ}

, and

270^{\circ}

). PI interpolates the phases of these two pairs of clocks, and a recovered clock with a phase between them can be obtained. The phase of the recovered clock can be adjusted by changing the tail currents of these two differential pairs.

When

φ_{d}

=

45^{\circ}

, the input clocks are eight equal-phase clocks (

v_{0}

=

0^{\circ}

,

v_{1}

=

45^{\circ}

,

v_{2}

=

90^{\circ}

,

v_{3}

=

135^{\circ}

,

v_{4}

=

180^{\circ}

,

v_{5}

=

225^{\circ}

,

v_{6}

=

270^{\circ}

,

v_{7}

=

315^{\circ}

), as shown in Figure 17b.

As shown in the Figure 18, the circuit is the overall circuit design of PI. The bottom of each differential pair is composed of 32 controllable current sources, and R1 and R2 are used as load resistors. The phase control terminal is P<0:2>, which are derived from the Weight Encoder in the CDA module proposed above. BIT0-BIT31 are the phase thermometer codes that control the PI in the range of

45^{\circ}

. The phase control code and thermometer code together divide 0–

360^{\circ}

into 256 points, so the minimum precision of PI is

1 . 4^{\circ}

. The control bits SWITCH_ 0, SWITCH_ 45, SWITCH_ 90, SWITCH_ 135, SWITCH_ 180, SWITCH_ 225, SWITCH_ 270, and SWITCH_ 315 that control the current source current path are generated as shown in Figure 19.

By selecting different branches, the phases of different quadrants can be output. For example, when the SWITCH_ 0 branch and the SWITCH_ 45 branch work, the phase interpolator works from

0^{\circ}

to

45^{\circ}

. During the working process of the phase interpolator, only 32 switches can be turned on at the same time, so that the total current of PI will not change in any state.

Figure 20 is the phase of the circuit output under different control codes, which corresponds to Figure 16. When

φ_{d}

=

90^{\circ}

(Figure 20a), the output phase shows obvious inequality. However, the improved one (

φ_{d}

=

45^{\circ}

) shows equality, as shown in Figure 20b.

(2) DNL and INL

The output linearity determines the extra jitter introduced by the phase interpolator and is an important technical indicator of PI, which is mainly measured by the Differential Non-Linearity (DNL) Integral Non-Linearity (INL). Equation (6) is the minimum resolution, Equation (7) is the DNL calculation method, Equation (8) is the INL calculation method [16].

P_{LSB} = \frac{360^{\circ}}{256} = {1.40652}^{\circ}

(6)

DNL = \frac{P h_{N + 1} - P h_{N}}{P_{LSB}} - 1

(7)

INL = \frac{P h_{N} - P h_{0}}{P_{LSB}} - N

(8)

Under standard process corner and temperature

27^{\circ}

, the curve of DNL and INL controlled by N from 0 to 256 (

0^{\circ}

to

360^{\circ}

) is shown in Figure 21. As can be seen from Figure 21, the maximum value of DNL is 0.54 LSB, and the maximum value of INL is 0.68 LSB. The theoretical maximum value of INL of traditional PI is 1.69 LSB. Therefore, compared with PI of traditional structure, the linearity of PI designed in this work has been improved.

Table 3 shows the comparison for three types of PI. Compared with [17,18], INL is significantly improved, but the power consumption is increased both the accuracy and linearity have been greatly improved. Therefore, we use the improved PI in MINI-PLL for a excellent clock signal, while in the

δ

of each channel, the traditional PI is still used to keep the power consumption low, because the amount of phase rotation is determined by

δ

. In this way, the balance between high performance and low power consumption can be better optimized. The power consumption of PI in this work can be allocated to each channel. When the receiver clock is sampled to the best position, Figure 22 shows the two PI clocks that are stabilized and output in MINI-PLL. The traditional way is shown in Figure 22a and the jitter is 6 ps, while the improved PI jitter is only 2.4 ps in Figure 22b. It can be seen that the performance of the clock is improved due to the improvement in accuracy and linearity.

4. Results

The transceiver can be supported in a D2D interconnect with a channel length of 50 mm. Figure 23 shows the circuit layout of a transceiver system that works at 512 Gbps. The central part of the clock path data includes the MINI-PLL layout, all-digital CDA layout and PI layout designed in this research. There are 16 pairs of differential data paths on the upper and lower sides of the overall layout. The transceiver system uses 9ML(nine-metal-layer) 28 nm CMOS technology, and the receiver occupies a 0.9 mm × 2 mm silicon area.

The power consumption of the receiver is shown in Figure 23b. In the receiver, the analog part consumes 137 mW (61%) for CTLE, 1/4 sampler circuit, PI and clock. The power consumption of the digital part (21%) mainly includes 4:16DMUX and 16:64MUX, and the total power consumption is 45 mW. The phase-locked loop of the receiver uses a forward clock structure, and the power consumption is only 36 mW (8%). The transceivers share a bias circuit and its power consumption is 14mW, so the power consumption (3%) of the bias circuit in the receiver is taken as half. Others which is used for the clock is 16mW (7%).

At temperature 125

^{\circ}

C and tt corner, the power consumption of the receiver is measured to be 0.44pJ/b (the overall system power consumption is 1.56 pJ/b). Figure 24 shows that the VCO outputs eight-phase clocks of equal phase, each with a phase difference of 15.6 ps and a 2.4 ps jitter.

Figure 25 shows the bathtub curves of L0 to L15. The eye opening is 15.02 ps in L0 to L7 in Figure 25a and 14.98 ps in L8 to L15 in Figure 25b. With a channel loss of 10 dB and a channel length of 50 mm, the worst eye opening is opened at 15.02 ps at

10^{- 15}

.

Table 4 shows the performance of the previous work compared to this one. There is a decrease in power consumption when adopting the same signal-DS [19,20]. This work is highlighted by the bidirectional bandwidth density of up to 284 Gb/s/mm compared to SES transmission [8,21]. As The BER is less than

10^{- 15}

and the channel loss is 10dB, it can support standard packaging D2D interconnects of 512G.

5. Conclusions

A 16-lane 512 Gbps bandwidth receiver is designed using 28 nm CMOS process, which provides a solution to the problems of bandwidth density and power consumption in standard packing D2D interconnects. The overall link power consumption of this solution is 1.56 pJ/b. The BER is less than 10^–15 under the true S-parameters with a channel loss of 10db@16GHz and a channel length of 50 mm.

Author Contributions

G.Z.: Conceptualization, Methodology, Software. M.L.: Methodology, Software. F.L.: circuit design and simulation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by China Postdoctoral Science Special Foundation (2022T150781) and supported by China Postdoctoral Science Foundation (2020M673697).

Conflicts of Interest

The authors declare no conflict of interest.

References

Hutner, M.; Sethuram, R.; Vinnakota, B.; Armstrong, D.; Copperhall, A. Special Session: Test Challenges in a Chiplet Marketplace. In Proceedings of the 2020 IEEE 38th VLSI Test Symposium (VTS), San Diego, CA, USA, 5–8 April 2020; pp. 1–12. [Google Scholar] [CrossRef]
Farjadrad, R.; Vinnakota, B. A Bunch of Wires (BoW) Interface for Inter-Chiplet Communication. In Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, USA, 14–16 August 2019; pp. 27–273. [Google Scholar] [CrossRef]
Shivnaraine, R.; van Ierssel, M.; Farzan, K.; Diclemente, D.; Ng, G.; Wang, N.; Musayev, J.; Dutta, G.; Shibata, M.; Moradi, A.; et al. 11.2 A 26.5625-to-106.25Gb/s XSR SerDes with 1.55pJ/b Efficiency in 7nm CMOS. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 181–183. [Google Scholar] [CrossRef]
Loh, M.; Emami-Neyestanak, A. A 3x9 Gb/s Shared, All-Digital CDR for High-Speed, High-Density I/O. IEEE J. Solid-State Circuits 2012, 47, 641–651. [Google Scholar] [CrossRef]
Liu, C.; Botimer, J.; Zhang, Z. A 256Gb/s/mm-shoreline AIB-Compatible 16nm FinFET CMOS Chiplet for 2.5D Integration with Stratix 10 FPGA on EMIB and Tiling on Silicon Interposer. In Proceedings of the CICC, Austin, TX, USA, 25–30 April 2021; pp. 1–2. [Google Scholar]
Lin, M.S.; Huang, T.C.; Tsai, C.C.; Tam, K.H.; Hsieh, K.C.; Chen, C.F.; Huang, W.H.; Hu, C.W.; Chen, Y.C.; Goel, S.K.; et al. A 7-nm 4-GHz Arm-Core-Based CoWoS Chiplet Design for High-Performance Computing. IEEE J. Solid-State Circuits 2020, 55, 956–966. [Google Scholar] [CrossRef]
Zhou, G.; Zhou, L.; Guo, Y.; Chen, S.; Lu, L.; Liu, L.; Chen, J. 32-Gb/s OOK and 64-Gb/s PAM-4 Modulation Using a Single-Drive Silicon Mach–Zehnder Modulator with 2 V Drive Voltage. IEEE Photonics J. 2019, 11, 6603610. [Google Scholar] [CrossRef]
Poulton, J.W.; Dally, W.J.; Chen, X.; Eyles, J.G.; Greer, T.H.; Tell, S.G.; Wilson, J.M.; Gray, C.T. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications. IEEE J. Solid-State Circuits 2013, 48, 3206–3218. [Google Scholar] [CrossRef]
Chung, S.; Kim, L. 1.22mW/Gb/s 9.6Gb/s data jitter mixing forwarded-clock receiver robust against power noise with 1.92ns latency mismatch between data and clock in 65nm CMOS. In Proceedings of the 2012 Symposium on VLSI Circuits (VLSIC), Honolulu, HI, USA, 13–15 June 2012; pp. 144–145. [Google Scholar] [CrossRef]
Chen, S.; Li, H.; Chiang, P.Y. A Robust Energy/Area-Efficient Forwarded-Clock Receiver With All-Digital Clock and Data Recovery in 28-nm CMOS for High-Density Interconnects. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2016, 24, 578–586. [Google Scholar] [CrossRef]
Universal Chiplet Interconnect Express (UCIe) Specification Revision 1.0. 2022. Available online: https://www.uciexpress.org/specification (accessed on 30 February 2022).
Chung, C.-C.; Lee, C.-Y. A new DLL-based approach for all-digital multiphase clock generation. IEEE J. Solid-State Circuits 2004, 39, 469–475. [Google Scholar] [CrossRef]
Tajalli, A.; Bastani, M.; Carnelli, D.; Cao, C.; Fox, J.; Gharibdoust, K.; Gorret, D.; Gupta, A.; Hall, C.; Hassanin, A.; et al. A 1.02pJ/b 417Gb/s/mm USR Link in 16nm FinFET. In Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, 9–14 June 2019; pp. C92–C93. [Google Scholar] [CrossRef]
Kanda, K. 40Gb/s 4:1 MUX/1:4 DEMUX in 90nm standard CMOS. In Proceedings of the ISSCC 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, San Francisco, CA, USA, 10 February 2005; Volume 1, pp. 152–590. [Google Scholar] [CrossRef]
Bae, W.; Ju, H.; Park, K.; Cho, S.; Jeong, D. A 7.6 mW, 214-fs RMS jitter 10-GHz phase-locked loop for 40-Gb/s serial link transmitter based on two-stage ring oscillator in 65-nm CMOS. In Proceedings of the 2015 IEEE Asian Solid-State Circuits Conference (A-SSCC), Xia’men, China, 9–11 November 2015; pp. 1–4. [Google Scholar] [CrossRef]
IEEE Std 1241-2000; IEEE Standard for Terminology and Test Methods for Analog-to-Digital Converters. IEEE: Piscataway, NJ, USA, 2001; pp. 1–98. [CrossRef]
Chen, M.-S.; Hafez, A.A.; Yang, C.-K.K. A 0.1–1.5 GHz 8-bit inverter-based digital-to-phase converter using harmonic rejection. In Proceedings of the 2012 IEEE Asian Solid State Circuits Conference (A-SSCC), Kobe, Japan, 12–14 November 2012; pp. 145–148. [Google Scholar] [CrossRef]
Ravi, A.; Madoglio, P.; Xu, H.; Chandrashekar, K.; Verhelst, M.; Pellerano, S.; Cuellar, L.; Aguirre-Hernandez, M.; Sajadieh, M.; Zarate-Roldan, J.E.; et al. A 2.4-GHz 20–40-MHz Channel WLAN Digital Outphasing Transmitter Utilizing a Delay-Based Wideband Phase Modulator in 32-nm CMOS. IEEE J. Solid-State Circuits 2012, 47, 3184–3196. [Google Scholar] [CrossRef]
Erett, M.; Carey, D.; Hudner, J.; Casey, R.; Geary, K.; Neto, P.; Raj, M.; McLeod, S.; Zhang, H.; Roldan, A.; et al. A 126mW 56Gb/s NRZ wireline transceiver for synchronous short-reach applications in 16nm FinFET. In Proceedings of the 2018 IEEE International Solid-State Circuits Conference—(ISSCC), San Francisco, CA, USA, 11–15 February 2018; pp. 274–276. [Google Scholar] [CrossRef]
Shibasaki, T.; Danjo, T.; Ogata, Y.; Sakai, Y.; Miyaoka, H.; Terasawa, F.; Kudo, M.; Kano, H.; Matsuda, A.; Kawai, S.; et al. 3.5 A 56Gb/s NRZ-electrical 247mW/lane serial-link transceiver in 28nm CMOS. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 31 January–4 February 2016; pp. 64–65. [Google Scholar] [CrossRef]
Wilson, J.M.; Turner, W.J.; Poulton, J.W.; Zimmer, B.; Chen, X.; Kudva, S.S.; Song, S.; Tell, S.G.; Nedovic, N.; Zhao, W.; et al. A 1.17pJ/b 25Gb/s/pin ground-referenced singleended serial link for off- and on-package communication in 16nm CMOS using a process-and temperature-adaptive voltage regulator. In Proceedings of the 2018 IEEE International Solid-State Circuits Conference—(ISSCC), San Francisco, CA, USA, 11–15 February 2018; pp. 276–278. [Google Scholar]

Figure 1. Chips interconnection of different processes.

Figure 2. Forward clock architecture.

Figure 3. Integral structure of receiver.

Figure 4. Data path and clock path.

Figure 5.

δ

on L0 to L15.

Figure 5.

δ

on L0 to L15.

Figure 6. (a) DMUX design circuit and (b) timing.

Figure 7. CDA architecture.

Figure 8. Eight-bit encodings with competition.

Figure 9. Eight-bit encodings without competition.

Figure 10. Architecture and design of Mini-PLL.

Figure 11. (a) PFD design circuit and (b) timing.

Figure 12. (a) Charge pump design, (b) The locking process of RX-PLL.

Figure 13. VCO control circuit.

Figure 14. Design of VCO.

Figure 15. Frequency and KVCO from VCO. (a) is for 10 GHz clock, (b) is for 8 GHz clock, (c) is for 6 GHz clock.

Figure 16. Linearity analysis and comparison of PI. (a) is the output phase with the thermometer code changing when

φ_{d}

=

45^{\circ}

, (b) is the output phase with the thermometer code changing when

φ_{d}

=

90^{\circ}

, (c) is the fitted curve of (a,b) and a ideal curve.

Figure 16. Linearity analysis and comparison of PI. (a) is the output phase with the thermometer code changing when

φ_{d}

=

45^{\circ}

, (b) is the output phase with the thermometer code changing when

φ_{d}

=

90^{\circ}

, (c) is the fitted curve of (a,b) and a ideal curve.

Figure 17. Shown two PI designs. (a) Traditional PI design, (b) improved PI design.

Figure 18. PI circuit design.

Figure 19. Logic design of SWITCH.

Figure 20. PI output under control code BIT. (a) is PI output under control code BIT when

φ_{d}

=

90^{\circ}

, (b) is PI output under control code BIT when

φ_{d}

=

45^{\circ}

.

Figure 20. PI output under control code BIT. (a) is PI output under control code BIT when

φ_{d}

=

90^{\circ}

, (b) is PI output under control code BIT when

φ_{d}

=

45^{\circ}

.

Figure 21. DNL and INL of PI.

Figure 22. (a) PLL output clock when using traditional PI, (b) PLL output clock when using improved PI.

Figure 23. (a) Layout design, (b) power consumption ratio of each module in receiver.

Figure 24. Eight equal phase clocks.

Figure 25. Measured bathtub diagrams showing the eye opening at

10^{- 15}

. (a) bathtub curves of L0 to L7, (b) bathtub curves of L8 to L15.

Figure 25. Measured bathtub diagrams showing the eye opening at

10^{- 15}

. (a) bathtub curves of L0 to L7, (b) bathtub curves of L8 to L15.

Table 1. Higher three bits coding to P<2:0>.

P<2:0>	000	001	011	010	110	111	101	100
Ph[ $^{\circ}$ ]	0–45	45–90	90–135	135–180	180–225	225–270	270–315	315–360(0)

Table 2. Lower five bits coding to thermometer codes.

BIT	5-bit	BIT	5-bit	BIT
31	$A_{5}$ $A_{4}$ $A_{3}$ A $A_{2}$ $A_{1}$	20	$A_{5}$ ( $A_{4}$ + $A_{3}$ )	9	$A_{5}$ + $A_{4}$ ( $A_{3}$ + $A_{2}$ + $A_{1}$ )
30	$A_{5}$ $A_{4}$ $A_{3}$ $A_{2}$	19	$A_{5}$ ( $A_{4}$ + $A_{3}$ + $A_{2}$ $A_{1}$ )	8	$A_{5}$ + $A_{4}$
29	$A_{5}$ $A_{4}$ $A_{3}$ ( $A_{2}$ + $A_{1}$ )	18	$A_{5}$ ( $A_{4}$ + $A_{3}$ + $A_{2}$ )	7	$A_{5}$ + $A_{4}$ + $A_{3}$ $A_{2}$ $A_{1}$
28	$A_{5}$ $A_{4}$ $A_{3}$	17	$A_{5}$ ( $A_{4}$ + $A_{3}$ + $A_{2}$ + $A_{1}$ )	6	$A_{5}$ + $A_{4}$ + $A_{3}$ $A_{2}$
27	$A_{5}$ $A_{4}$ ( $A_{3}$ + $A_{2}$ $A_{1}$ )	16	$A_{5}$	5	$A_{5}$ + $A_{4}$ + $A_{3}$ ( $A_{2}$ + $A_{1}$ )
26	$A_{5}$ $A_{4}$ ( $A_{3}$ + $A_{2}$ )	15	$A_{5}$ + $A_{4}$ $A_{3}$ $A_{2}$ $A_{1}$	4	$A_{5}$ + $A_{4}$ + $A_{3}$
25	$A_{5}$ $A_{4}$ ( $A_{3}$ + $A_{2}$ + $A_{1}$ )	14	$A_{5}$ + $A_{4}$ $A_{3}$ $A_{2}$	3	$A_{5}$ + $A_{4}$ + $A_{3}$ + $A_{2}$ $A_{1}$
24	$A_{5}$ $A_{4}$	13	$A_{5}$ + $A_{4}$ $A_{3}$ ( $A_{2}$ + $A_{1}$ )	2	$A_{5}$ + $A_{4}$ + $A_{3}$ + $A_{2}$
23	$A_{5}$ ( $A_{4}$ + $A_{3}$ $A_{2}$ $A_{1}$ )	12	$A_{5}$ + $A_{4}$ $A_{3}$	1	$A_{5}$ + $A_{4}$ + $A_{3}$ + $A_{2}$ + $A_{1}$
22	$A_{5}$ ( $A_{4}$ + $A_{3}$ $A_{2}$ )	11	$A_{5}$ + $A_{4}$ ( $A_{3}$ + $A_{2}$ $A_{1}$ )	0	$A_{1}$
21	$A_{5}$ ( $A_{4}$ + $A_{3}$ ( $A_{2}$ + $A_{1}$ ))	10	$A_{5}$ + $A_{4}$ ( $A_{3}$ + $A_{2}$ )

Table 3. Comparison for PI.

Type	DNL	INL	Power Consumtion	Precision	Technology
[17]	0.89 LSB	2.18 LSB	3.4 mW	8 bit	65 nm
[18]	—	1.64 LSB	—	8 bit	32 nm
This work	0.54 LSB	0.68 LSB	10 mW	8 bit	28 nm

Table 4. Performance comparison.

Reference	[19]	[20]	[21]	[8]	This work
Signaling	DS	DS	SES	SES	DS
Date Rate (Gb/s)	56	56.2	25	20	32
Insertion loss (dB)	11	18.4	8.5	1	10
Technology (nm)	16	28	16	28	28
Energy Efficiency (pJ/b)	2.25	4.4	1.17	0.54	1.56
BRE	10⁻¹⁵	10⁻¹⁵	10⁻¹⁵	10⁻¹²	10⁻¹²
Core area (RX+TX) [mm²]	2.64	1.4	0.75	1.2	1.8
Throughput (RX+TX) Gb/s	448	112.4	200	320	512
Density of BW Gb/s/mm	169	80.2	266.7	266.7	284

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, G.; Lai, M.; Lyu, F. A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects. Electronics 2022, 11, 2864. https://doi.org/10.3390/electronics11182864

AMA Style

Zhang G, Lai M, Lyu F. A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects. Electronics. 2022; 11(18):2864. https://doi.org/10.3390/electronics11182864

Chicago/Turabian Style

Zhang, Geng, Mingche Lai, and Fangxu Lyu. 2022. "A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects" Electronics 11, no. 18: 2864. https://doi.org/10.3390/electronics11182864

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects

Abstract

1. Introduction

2. Receiver Design

2.1. CDA

2.2. Delay Matching $δ$

3. Receiver Circuits

3.1. DMUX1:2

3.2. CDA Design

3.2.1. Digital Model

3.2.2. Weight Coding

3.3. MINI-PLL

3.3.1. Design

3.3.2. PFD

3.3.3. Charge Pump

3.3.4. VCO

3.3.5. PI

4. Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Multichannel, High-Bandwidth Wirelane Receiver for D2D Interconnects

Abstract

1. Introduction

2. Receiver Design

2.1. CDA

2.2. Delay Matching δ

3. Receiver Circuits

3.1. DMUX1:2

3.2. CDA Design

3.2.1. Digital Model

3.2.2. Weight Coding

3.3. MINI-PLL

3.3.1. Design

3.3.2. PFD

3.3.3. Charge Pump

3.3.4. VCO

3.3.5. PI

4. Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. Delay Matching $δ$