A 56 Gbps 4-tap PAM-4 direct decision feedback equaliser with negative capacitance employing dynamic CML comparators in 65-nm CMOS

Here, a 4-level pulse amplitude modulation direct decision feedback equaliser (DFE) with a novel dynamic current-mode-logic comparator (DCMLC) is presented. The DCMLC breaks the trade-off between set- tling time and regeneration time in traditional CML comparator design by utilizing dynamic logic and separately optimizes the tracking stage and regeneration stage for a correct latch operation at ultrahigh speed. Compared with the traditional CML comparator, the DCMLC reduces delay by 36% and has better input sensitivity on high baud rates at the cost of 7% shrunk output swing. The negative capacitance is adopted to achieve a 0.5 dB bandwidth extension ratio of up to 1.89. The reduced delay and wider bandwidth of the proposed comparator allow the implementation of 4-tap direct DFE at 56 Gbps with 2.8 pJ/bit energy efﬁciency and an active area of 0.007 mm 2 in 65-nm CMOS technology.


Introduction:
The increasing bandwidth requirement of communication systems has prompted wireline transceivers capable of operating up to 56 Gbps. Therefore, the 4-level pulse amplitude modulation (PAM-4) signalling is chosen to limit channel loss and preserve link budget. However, it is challenging to design such a high-speed PAM-4 decision feedback equaliser (DFE). First, in a quarter-rate receiver, 12 comparators for data sampling and 12 comparators for clock and data recovery error sampling are needed, which bring a large amount of power and loading. Second, the unit interval (UI) time constraints set by the feedback nature of DFE (especially for the first tap) are increasingly challenging to meet as data rates are increased. In [1], speculative first tap proposed in NRZ signalling is adopted to relax one UI timing constraint. But in PAM-4 signalling, the unrolled tap needs three times more hardware than NRZ, introducing large capacitance loading and degradation of energy efficiency. Besides, an additional time burden on later non-unrolled taps is placed. Three techniques, namely a merged latch and summer, reduced latch gain, and dynamic latch design, are proposed in [2] to achieve a 66 Gbps 3-tap NRZ closed-loop DFE, which needs careful design with the gain of front-end gain stage and dynamic latch. However, the solution increases more noise from DFE circuits because of the small gain of the dynamic latch and is not suitable for PAM-4 due to the reduced voltage margin. StrongArm comparator (SAC) [3] and modified double-tail dynamic comparator (DTDC) [4] have advantages of no dc power, high gain, and CMOS-level outputs, but their large swing and multi-stage increase the delay. A single-stage current-mode-logic comparator (CMLC) [5] has higher bandwidth and reduced delay, but the regeneration pair induces substantial self-loading, and there is a trade-off between bandwidth in the tracking phase and regeneration speed. Given that this letter is targeted for an aggressive DFE working at 56 Gbps data rate in 65-nm CMOS technology, a dynamic CML comparator (DCMLC) with negative capacitance [6] is proposed to address the drawbacks of the CML comparator and reduce the delay even further. DFE architecture and circuit: As stated in Figure 1, DFE adopts quarter-rate architecture to lower clock frequency to reduce clock power. It consists of four identical slicers (0, 90, 180, and 270 clocked by the corresponding clock). Each slicer has three samplers, and each sampler contains a data comparator and an edge comparator. The data comparator adopts the proposed DCMLC to accomplish DFE summation and data decision. The edge comparator is designed with a pass-transistor and a SAC. The outputs of DCMLC are directly fed back to the first tap. Tak  fed back to the first tap in slicer 0. Then, utilizing the inverter's high gain at the transition region, a two-stage inverter buffer amplifies the data to CMOS-level for tapping to the second tap in slicer 90. In addition, the SR latch transforms the signal to NRZ signalling and drives the third tap in slicer 180. Finally, the data is buffered to the fourth tap in slicer 270 and output to the demux. This architecture separates each tap, decreases the loading of the first tap, and ensures the node bandwidth of 4-tap. Note that the architecture brings an additional time burden on later taps. For example, for the second tap, an extra buffer delay of about 20 ps is added to its feedback loop, which is less than 1 UI (35.7 ps at 56 Gbps). In a traditional CMLC, the settling time constant is R L C L (R L is the load resistance, and C L is the load capacitance) during the tracking phase, while the negative conductance during the regeneration phase is G m = 1/R L − g m , where g m is the conductance of the regeneration pair. In other words, reducing R L to decrease the settling time will increase the regeneration time constant. Figure 2 illustrates both the proposed DCMLC and its use in the DFE. When clock C0P is "1" (C0N is "0"), M 11-12 and M 16 turn on, M 13 turns off, and the DCMLC works at the tracking phase. M 18-19 is on and paralleled with M 20-21 to realize a small load resistance. The input data, offset or comparator threshold, and DFE taps are summed by current. The bandwidth of the summer node is essential for settling, so a tighter bandwidth of 0.5 dB is adopted in this design. M 7-8 , M 14-15 , and C C in the green block with dashed line constitute a negative capacitance [6] to broaden 0.5 dB bandwidth up to 1.89 times which further decreases the settling time of the summer. V C is used to adjust the value of negative capacitance to compensate for the value change over process, voltage, and temperature (PVT). On the other hand, when clock C0P is "0" (C0N is "1"), M 11-12 and M 16 turn off,  are off, and only M 20-21 are used as the load resistor to achieve a large resistance, which reduces the regeneration time constant and shortens the delay of the regeneration process. It should be noted that the larger the load resistance in regeneration is, the smaller the output swing is. In order to reduce the delay of DCMLC and ensure the robustness of the subsequent circuits, the size ratio 3/2 of M 18-19 and M 20-21 is designed in the DCMLC. Another advantage is that the tracking path is separated from the regeneration path, which allows separate optimization for the two paths. Figure 3 illustrates the simulation results of the "reset-regenerate" comparators (SAC and DTDC) and the "track-regenerate" comparators (CMLC and DCMLC). The input signals to the comparators are a worstcase data pattern as shown in Figure 3a. Figure 3b gives the worst clockto-Q delay (with Q point defined as ± 600 mV and V is 60 mV) of each comparator. The delays of SAC, CMLC, DTDC, and the proposed DCMLC are 38.2, 31.5, 27.7, and 20.2 ps, respectively. Compared with the first three structures, the delay of the DCMLC was reduced by 47%, 36%, and 27%, respectively. Moreover, the swing loss of DCMLC is negligible compared with that of the CMLC. Figure 3c plots the input sensitivity performance at different baud rates, which is defined as the minimum required differential input swing for the output swing to be larger than 600 mV. The sensitivity of SAC and DTDC has become worse at high baud rates because of the reset phase. The proposed DCMLC has comparable sensitivity performance to CMLC and is better at baud rate over 36 Gbps due to the bandwidth expansion by negative capacitance.
For a fair comparison, all the comparators are optimized with 100 fF capacitor load.
As shown in Figure 2, the delay of the first tap critical path (the red dot line) of the DFE includes the clock-to-Q delay of DCMLC, the settling time of DCMLC in the tracking phase, and the setup time of DCMLC, which should be less than 1 UI. That is T ckq + T settle + T setup < 1 UI. As the output of DCMLC is directly connected to the feedback stage (M 22-23 in Figure 2), a few conclusions can be made from this. First, the definition of Q point directly affects the delay of T ckq , which should be defined as the voltage that the feedback stage has to interpret as 'digital' level. Second, the larger the feedback stage, the smaller the clipping voltage ('digital' level) will be, but a larger capacitive loading to the summer node. Considering that the negative capacitance is designed to broaden the bandwidth, a relatively large feedback stage (1.2u/0.06u) is designed to reduce the requirement for DCMLC output swing and shorten the delay of T ckq . Figure 4a gives the simulation results of the first tap current (Itap) and its effective feedback current (Idiff) under the various differential input swing of the feedback stage over PVT. It can be seen that when the input swing is above 500 mV, the current utilization rate under PVT is more than 95%. Moreover, in order to achieve nearly noise-free feedback, the Q point is defined as 600 mV. Meanwhile, as reported in Figure 4b, the relationship between the delay of T ckq and the output swing (differential) of DCMLC is simulated with the input signal  Figure 3a. With Q point defined as 600 mV, the worst T ckq is 26.6 ps over PVT. The T settle and T setup cannot be obtained directly in this design. To further analysis, the simulation results of the first tap loop are given in Figure 5a, where I P and I N are the total feedback current of the first tap of the DCMLC with the reference level V 0 in slicer 90. Similar to soft decisions [7], the outputs of DCMLCs in slicer 0 have been fed back to the first tap of slicer 90 for DFE summation after slicer 0 enters the tracking phase. When the clock CK0 arrives, DCMLCs in slicer 0 make the data decision, and the feedback current gradually increases until the feedback stage (e.g. M 22-23 in Figure 2) interprets the feedback signal as a digital level (∼600 mV). Due to the existence of the tracking phase, the first tap feedback current has already been initiated before the decision clock (0.5 UI early), which effectively shortens the time for settling and summation. Moreover, the settling and summation is simultaneous with the regeneration of DCMLC, and the simulation results confirm that the extra time for T settle and T setup is about 5 ps. That is, the total time of the critical path does not exceed 31.6 ps (at 56 Gbps 1 UI is 35.7 ps).
Simulation results: The DFE prototype chip is fabricated in 65-nm CMOS technology with a core area of 0.007 mm 2 as shown in Figure 5b. Figure 5c shows the channel insertion loss with 9 dB loss at 14-GHz. Figures 5d and 5e show the post-layout simulated eye diagram at the node of DFE input and DCMLCs (reference level V 0 in each slicer). In quarter-rate architecture, the DCMLC has two UI working at the regeneration phase to make data decisions and two UI working at the tracking phase, which ensure enough conversion time of DCMLC from the regeneration phase to the tracking phase. Moreover, the ISI caused by the channel is eliminated during the sampling UI, achieving a time margin of 0.54 UI. Table 1 summarizes the performance of the proposed DFE and compares it with prior works. The proposed DFE achieves a power efficiency of 2.8 pJ/bit, which is superior to the unroll-based design of [1]. Compared to [3], the presented work realizes the same data rate in the 65-nm process. The DCMLC and negative capacitance allow for a 4-tap DFE relative to [5].
Conclusion: A 56 Gbps 4-tap quarter-rate direct DFE with edge slicers is implemented in 65-nm CMOS technology. The proposed DCMLC reduces the delay by 36% compared with CMLC and effectively solves the timing constraint of the first tap in DFE. And the proposed DFE achieves 2.8 pJ/bit energy efficiency with a 0.007 mm 2 area.