1. Introduction
Programmable time delays with resolutions in the ps range are required in many applications. These applications involve skew compensation [
1], accurate timing of opening, like for camera shutters [
2], and device/detector testing and characterization [
3,
4,
5]. These are but a few of the numerous applications where a precise digital-to-time converter (DTC) is necessary [
6].
DTCs create an output signal whose duration is directly proportionate to the input value from a digital input code. These devices often have to perform a careful trade-off between merits. They aim for ps-level resolution, also known as least significant bit (LSB), and jitter, which allows the generation of very short and precise delays. However, in order to produce the greatest range of signal values, they strive for a wider full-scale range (FSR) with low differential and integral nonlinearity errors (DNL/INL). These crucial characteristics frequently conflict with one another [
7,
8].
In the scientific literature and on the market, numerous architectures of DTCs exist, most of which are designed for application-specific integrated circuits (ASICs) [
9], which, although offering excellent performance, are characterized by high time-to-market and non-recurring engineering (NRE) costs. This makes such solutions difficult to apply in fast-prototyping and research contexts where only a few units are required. In order to tackle this, we suggest a programmable logic solution compatible with a Xilinx 28 nm 7-Series field-programmable gate array (FPGA) and system-on-chip (SoC). This solution maintains flexibility and occupies a low area while delivering a resolution (LSB) of 52 ps, a full-scale range (FSR) up to 56 ms, and high VT stability, ensuring interoperability with various programmable logic devices.
In the scientific literature, there are multiple FPGA-based DTC architectures.
A simple counter (also known as a timer) is the simplest and most compact DTC that can be built; it is characterized by a resolution equal to the clock period, a wide FSR, determined by the number of bits, and low jitter (i.e., a few ps). Pure synchronous logic’s limitation to the clock’s resolution is its main flaw. Our constraints are a result of oscillator performance, which for FPGA devices usually maxes out at about 1 GHz. Regarding the 28 nm Xilinx 7-Series devices, we have 630 MHz (i.e., resolution less than 1.6 ns) for the low-end Artix-7 FPGAs and the Zynq-7000 SoCs up to the 7020 model, while we can go up to 800 MHz (i.e., resolution less than 1.25 ns) for the Kintex-7, Virtex-7, and Zynq-7000 from the 7030 model onwards [
10]. Using
N clocks with
phase separation and all operating at the same frequency is one method to exceed this limitation. The system resolution is improved by a factor of
N using this technique, known in the scientific literature as
N-clock synchronous logic [
11]. However, clock buffers and routing resources, on the other hand, impose restrictions on FPGA and SoC designs, limiting the maximum number of clocks. Regarding the 28 nm Xilinx 7-Series devices, we have a maximum of 32 clock buffers in a single FPGA/SoC with a limit of up to 10 clock lines per region [
10], making resolutions on the order of 100 ps possible. Clock networks also add a lot to the total dynamic power consumption, which reduces the power efficiency of this approach.
Most high-resolution DTC solutions are based on the concept of a (digital) programmable delay line (PDL). These DTCs are characterized by low jitter (i.e., a few ps), high resolution (i.e., a few ps), and good linearity, but limited FSR. Typically, delay lines (DLs) are built with a series of buffers, sometimes referred to as “taps” or “bins”. A simpler PDL can be made by connecting each buffer’s output to a multiplexer’s input (see
Figure 1 for an example). This makes it possible to control the circuit’s delay in real time. So, the propagation delay of the tap defines the LSB and the total delay the FSR.
PDL FPGA-based solutions are implemented by connecting several look-up tables (LUTs) in series [
12], or by using carry propagation chains (i.e., CARRY) available within the FPGA fabric [
13] or in digital signal processing (DSP) modules [
7]. Regardless of the type of buffer chosen, since FPGAs are not optimized to have logic elements with identical propagation delays, these PDLs require a calibration mechanism to estimate the delay introduced by each tap. This is crucial to achieve high resolution while keeping DNL and INL errors low. Moreover, as FPGAs do not feature automatic stabilization mechanisms for propagation delay in response to temperature and voltage fluctuations, the jitter and resolution provided by the PDL experiences significant variations depending on the operating temperature and voltage [
7]. The choice of buffer type is usually made by balancing performance, resource availability, and practicality. In fact, DSP-based PDLs are much more effective in terms of jitter and resolution compared to those based on carry chains and LUTs, but DSPs are far less abundant.
Regardless of the buffer’s nature, this architecture exhibits a direct dependency between area utilization, resolution, jitter, and FSR. To increase the dynamic range while maintaining the same resolution, it is necessary to increase the number of buffers, which results in a larger multiplexing mechanism and a more complex and prolonged calibration process [
13,
14]. On the other hand, if one aims to increase the FSR while keeping the area constant, the tap must be slowed down (i.e., the LSB must be increased), thereby degrading the system’s resolution. Additionally, the jitter between the input and output signals (i.e.,
) increases as the number of taps traversed grows, because the signal passing through the PDL accumulates the jitter introduced by each tap. In this regard, two main trends have been observed. Specifically, in a PDL with
N taps, where
denotes the jitter of a single buffer, a total jitter of
is observed in [
15] and
in [
12], where
is the number of buffers the signal traverses, determined by the multiplexer’s selection input.
To mitigate area utilization while maintaining high resolution, techniques based on ring oscillators have been proposed in the literature, specifically the Vernier Delay-Locked Loop (VDLL) [
16]. In this architecture, the time delay is generated as the differential delay between the edges of two ring oscillators operating at different frequencies. This provides a very compact and high-resolution solution; however, due to the nature of ring oscillators, it is highly susceptible, more than PDL, to voltage and temperature (VT) variations.
To mitigate VT variations and increase resolution, at the expense of system simplicity (area utilization and calibration), systems based on Programmable Vernier Delay Lines (PVDL) have been proposed [
17]. These are solutions where the generated delay is obtained as the time difference between two PDLs, each characterized by a different propagation delay. In this way, the differential nature of the approach helps limit the dispersions caused by VT fluctuations, and the LSB is equal to the difference in the propagation delays of the taps.
To eliminate the trade-offs related to FSR, techniques rooted in Nutt interpolation are used [
18]. This approach involves pairing a fine DTC (e.g., PDL, VDLL, PVDL), which is highly precise but has a limited FSR, with a coarse counter. Thus, the total delay (i.e.,
) is the sum of a fine delay (i.e.,
) and a coarse delay (i.e.,
); i.e.,
[
14,
19].
As
Figure 2 shows, the system starts an
n-bit digital counter clocked at
(i.e., with
as period) and loads a digital comparator with the “coarse” part of the desired delay (i.e.,
). Concurrently, the “fine” segment (i.e.,
) is utilized to configure the fine DTC (e.g., PDL). When the counter hits the designated
, a
, characterized with a delay
refereed to the
(i.e., count reset), is generated and forwarded to the fine DTC (e.g., PDL), obtaining
characterized with a total delay of
.
has the same FSR of
(i.e., coarse counter,
) with the resolution of the fine DTC (e.g., PDL).
The goal of this paper is to present a DTC compatible with all programmable logic solutions (i.e., FPGA and SoC) at 28 nm from the Xilinx 7-Series, offering high performance. The proposed solution does not require a calibration mechanism or specific component placement, resulting in a streamlined, simple, and compact structure (i.e., 348 LUTs and 550 flip-flops). The design is characterized by good linearity, immunity to PVT variations, and the elimination of key trade-offs such as jitter vs. FSR and area vs. FSR.
The paper is organized as follows: the proposed architecture is presented in
Section 2, while
Section 3 focuses on the experimental validation using a low-end Artix-7 XC7A100FTFG256-2, achieving a jitter lower than 50 ps r.m.s., a DNL of 1.19 LSB, an INL of 1.56 LSB, and an average dynamic power dissipation of 285 mW. Finally, a comparison with other academic works and commercial solutions is presented in
Section 4.
2. Hardware Implementation
The proposed architecture adopts Nutt interpolation [
18] and combines dual-clock synchronous coarse logic with a PDL asynchronous fine logic.
Every I/O block in Xilinx 28 nm 7-Series FPGAs and SoCs has an adjustable PDL primitive called IDELAYE2 [
20]. This primitive can be used on signals coming from the FPGA logic as well as combinational and registered input signals.
These primitives have a tap delay adjusted (i.e., IDELAYCTRL [
21]) for fluctuations in process, voltage, and temperature (PVT), and are implemented as 32-tap wraparound PDLs. A reference clock is needed as input for IDELAYCTRL in order to guarantee precise calibration. This primitive’s basic mechanism splits the calibration clock period (i.e.,
) into 64 steps (i.e.,
) and a signal can be delayed by 32 steps (i.e., from 0 to 31), achieving a maximum delay of half of the period (i.e.,
). As such, the frequency of the clock applied to the primitive itself directly affects the delay values. The tap resolution varies depending on the calibration clock frequency because of the underlying process.
Table 1 lists the calibration frequencies (i.e.,
) that are available as well as the delays that relate to them.
It is feasible to separate the proposed architecture into a dual-clock synchronous logic (i.e., coarse) and an asynchronous (i.e., fine) part based on IDELAYE2 and IDELAYCTRL.
The main difficulty in dual-clock logic lies in deriving the 180°-phase-shifted clocks while ensuring low jitter between them, considering that a double-data-rate (DDR) approach is not always feasible due to the dispersion and duty cycle fluctuations that affect most commercial clock sources. Phase-locked loops (PLLs) are the standard circuits used to create clocks. Unfortunately, the PLLs hosted inside the Xilinx 7-Series FPGAs and SoCs cause several hundred picoseconds of jitter, which is more than IDELAYE2’s resolution (
Table 1). Our suggested fix is to apply the “clock gating” method using the Xilinx primitive known as BUFGCE [
10] in order to avoid the use of high-performance PLLs external to the FPGA device.
Clock division is made possible via the BUFGCE primitive, which functions as a buffer with a clock enable (CE) input.
Figure 3 shows the proposed solution to convert an input clock signal (i.e.,
) at frequency
into two output clock signals (i.e.,
and
) with frequency
, where
shifted by 180° and characterized by a duty cycle of 25%.
is connected to a 2-bit circular buffer, where the two type-D flip-flops (DFFs) store “1” and “0” [
22,
23]; the two DFFs’ outputs are used to toggle the CE inputs of the BUFGCE (i.e.,
and
) connected to
.
Referring to the dual-clock synchronous coarse logic shown in
Figure 4, we generate a high-speed n-bit counter using
to clock the
most significant bits (i.e.,
), while
clocks the least significant bit (i.e.,
). The comparison with the threshold value (i.e.,
for
and
for
) happens independently in the two clock domains (i.e.,
for
and
for
); so, a coarse DTC output signal with a resolution of
and an FSR of
(i.e.,
) is generated, combining
in and with
.
The final architecture, depicted in
Figure 5, is obtained by forwarding the
resolution DTC signal (i.e., th_reached), provided by the dual-clock synchronous coarse logic, to the IDELAYE2 primitive that uses
for the IDELAYCTRL. In this manner, the IDEALYE2 acts as a fine interpolator, allowing for an increase in resolution up to
; that is,
with
4. State of the Art and Discussion
In the field of DTC research, both scholarly and commercial options are available these days.
Regarding the academic works, a clear tendency quickly becomes apparent: solutions frequently sacrifice FSR in order to achieve the best resolution with the least amount of jitter. For example, an LSB of 20 ps was obtained in [
13], which is in close agreement with the 14.2 ps presented in [
7]. But both pieces struggle with dynamic ranges that are limited to hundreds of ps or, in the best-case scenario, ns. Families of one-shot devices, such as Dallas Semiconductor’s [
26], are common in the market sector. The device with the highest resolution among them creates pulses between 5 ns and 15 ns, and the one with the widest FSR produces pulses between 100 ns and 500 ns. Furthermore, the fact that each device in this series only has five preset pulse-width possibilities is a major constraint.
Table 4 displays the main academic solutions based on programmable logic devices (i.e., FPGAs and SoCs) compared to the proposed solution.
The use of IDELAY2 as a PDL eliminates the need for external calibration systems, as required in [
7,
13,
19], or specific primitive placement [
12,
14]. This enables a simple and compact system (i.e., 550 DFFs, 348 LUTs, and 1 IDELAY2), free from any dependency between resolution and FSR [
13,
14], while maintaining stability under PVT variations and ensuring good linearity (i.e., DNL/INL of 1.19/1.56 with LSB of 52 ps). The linearity is fully comparable to that of calibrated PDL-based solutions: [
7] (i.e., DNL/INL of 3.19/7.11 with LSB of 9.1 ps), [
13] (i.e., DNL/INL of 22.08/19.63 with LSB of 14.2 ps), and [
19] (i.e., DNL/INL of 3.95/6.2 with LSB of 20 ps), although slightly inferior to VDLL [
16] (i.e., DNL/INL of 0.24/0.02 with LSB of 38.6 ps) and PVDL [
17] (i.e., DNL/INL of 0.17/0.62 with LSB of 1.02 ps) solutions, which, being based on “Vernier” techniques, are more complex. To eliminate the dependency between FSR and jitter, characteristic of PDLs in general [
12] (i.e., 7 ps r.m.s. for a few ps of delay and 165 ps r.m.s. for a few units of ns), and of IDELAY specifically (
Section 3.3), Nutt interpolation was adopted. This approach allowed the jitter to be kept below the LSB (i.e, in the range between 25 and 50 ps r.ms.) while ensuring an FSR of up to 58 ms. Furthermore, the use of Nutt interpolation, as seen in [
19] (i.e., jitter of 20 ps r.m.s. in an FSR of 33 μs) and [
14] (i.e., jitter of 35 ps r.m.s. in an FSR of 57.3 ns), enabled the trade-off between area occupation and FSR to be overcome, contributing to the compactness of the system.
With reference to the DSP-based PDLs presented in [
7], only the implementation with an FSR of 10.9 ns provides an operating range suitable for practical applications. Both DSP-based PDLs deliver excellent performance in terms of jitter and resolution, superior to those of our solution. However, from the perspective of area usage, the use of IDELAYE2 primitives, which are more abundant than DSP resources, as proposed in our solution, proves to be a better choice. Specifically, considering the FPGA Artix-7 XC7A100TFTG256-2 as a target (126,800 DFFs, 63,400 LUTs, 300 IDELAYs, 240 DSPs), it would only be possible to implement 15 DSP-based PDLs, utilizing 100% of the DSPs (a scarce and valuable resource). In contrast, our solution enables the implementation of 182 channels, with LUTs (a less critical resource) being the limiting factor.
Regarding the CARRY-based PDL presented in [
13], while it offers excellent jitter and LSB performance, it exhibits significant nonlinearity, caused by the propagation delay inconsistencies within the CARRY blocks. The same issue is found in the PDL discussed in [
14].
Moving to the “Vernier” architecture, the VDLL solution described in [
16] achieves similar performance in terms of jitter and LSB but performs worse in terms of area utilization compared to the solution presented in this work. Specifically, the substantial imbalance between the LUT and DFF requirements for implementing the VDLL limits the maximum number of channels to 94 on the Artix-7 XC7A100TFTG256-2, consuming 100% of the LUTs but only 11% of the DFFs. While the PVDL proposed in [
17] offers an excellent balance between resolution, jitter, nonlinearity, and FSR, but it requires highly complex manual placement to meet the timing constraints necessary for its functionality. Conversely, our solution does not impose any specific place-and-route constraints, delegating these tasks to the compiler. This approach simplifies the firmware and facilitates multichannel scalability.
The architecture proposed in [
19] also achieves a good balance between resolution, jitter, nonlinearity, and FSR. However, the DCM-based PDL limits the maximum operational rate to only 2 MHz, a restriction not present in our proposed structure, which allows the delays to be generated sequentially without similar constraints.
For a fairer comparison of dynamic power consumption, it was considered appropriate to normalize the dynamic power (i.e.,
) to the corresponding clock frequencies (i.e., 300 MHz for the proposed work, 25 MHz for [
7], and 200 MHz for [
17]), thereby estimating the average energy dissipated in each clock cycle (i.e.,
). Naturally, this will depend on the number of resources used in the circuit (i.e., area), the parasitic capacitance
C (depending on the technology node), and the core supply voltage
of the FPGA (i.e., 1 V for the proposed work, 1.2 V for [
7], and 1 V for [
17]). In this regard, an energy consumption of 0.95 mW/MHz (i.e., 0.95 nJ) is obtained for the 28 nm system we present, 3.4 mW/MHz (i.e., 3.4 nJ) [
7] in a 45 nm system, and 0.825 mW/MHz (i.e., 0.825 nJ [
17]) in a 40 nm system.
The architecture proposed here is compatible with all FPGAs equipped with the IDELAYE2 primitive, specifically all 28 nm Xilinx 7-Series FPGA and SoC devices, and can be easily migrated to any devices that feature an equivalent (i.e., IODELAYE1, IDELAYE3, IDELAYE5) present in Xilinx devices from the 40/45 nm 6-Series (i.e., IDELAYE1, which have slightly lower performance than IDELAYE2), 20/16 nm UltraScale/UltraScale+ (i.e., IDELAYE3 [
27], which have approximately 10 times better performance than IDELAYE2), and 7 nm Versal (i.e., IDELAYE5 [
28], which also have approximately 10 times better performance than IDELAYE2). Migration to devices from Intel/ALTERA [
29], Lattice Semiconductor [
30], and Microsemi [
31] is more complex. In fact, Intel/ALTERA does not have a direct delay block equivalent to Xilinx’s IDELAY, but their FPGA architecture includes flexible modules for I/O line delay through PLLs and clock adjustment using integrated resources such as IOE (Input–Output Element) modules, which can be configured to support adjustable delays. Lattice offers more compact solutions with their ECP5 and iCE40 FPGA series, which support configurable delay modules. However, these solutions are typically aimed at low-power designs and do not offer the same level of granularity as Xilinx’s IDELAY. Microchip provides advanced clock management modules in their PolarFire and IGLOO2 FPGAs. While there is no direct IDELAY equivalent, their FPGAs support configurable I/O delays through the use of CCG (Configurable Clock Generator) and SERDES modules, which can be programmed to achieve precise signal synchronization.