**Analysis and Comparison of Rad-Hard Ring and LC-Tank Controlled Oscillators in 65 nm for SpaceFibre Applications** †

#### **Danilo Monda, Gabriele Ciarpi and Sergio Saponara \***

Department of Information Engineering (DII), University of Pisa, 56126 Pisa, Italy;

danilo.monda@phd.unipi.it (D.M.); gabriele.ciarpi@ing.unipi.it (G.C.)


Received: 28 June 2020; Accepted: 14 August 2020; Published: 17 August 2020

**Abstract:** This work presented a comparison between two Voltage Controlled Oscillators (VCOs) designed in 65 nm CMOS technology. The first architecture based on a Ring Oscillator (RO) was designed using three Current Mode Logic (CML) stages connected in a loop, while the second one was based on an LC-tank resonator. This analysis aimed to choose a VCO architecture able to be integrated into a rad-hard Phase Locked Loop. It had to meet the requirements of the SpaceFibre protocol, which supports frequencies up to 6.25 GHz, for space applications. The full custom schematic and layout designs are shown, and Single Event Effect simulations results, performed with a double exponential current pulses generator, are presented in detail for both VCOs. Although the RO-VCO performances in terms of technology scaling and high-integration density were attractive, the simulations on the process variations demonstrated its inability to generate the target frequency in harsh operating conditions. Instead, the LC-VCO highlighted a lower influence through Process-Voltage-Temperature simulations on the oscillation frequency. Both architectures were biased with a supply voltage of 1.2 V. The achieved results for the second architecture analyzed were attractive to address the requirements of the new SpaceFibre aerospace standard.

**Keywords:** ring oscillator; LC-tank oscillator; SpaceFibre; rad-hard circuits; radiation effects; high-speed data transfer

#### **1. Introduction**

Several thousand launch activities have been performed during the last half-century, and with the rapid development of technology, satellites are playing an important role in human society. These systems are widely used for navigation, communication, and earth observation. One of the first communication experiments with laser was conducted between two Low Earth Orbit (LEO) satellites and a geostationary satellite ARTEMIS. The experiment was performed with a data rate of up to 50 Mbps. Then, other experiments followed with an increased data rate to achieve inter-satellite communication links. Today, current trends in satellites show a rapid increase in data traffic and digital processing. The throughput of next-generation satellites for digital telecom applications, as well as scientific missions, surveillance, and remote sensing, will exceed terabits per second of data that must be processed on board. For instance, the high-resolution cameras and synthetic aperture radars need high-speed communications between the instruments and the on-board data storage system [1]. The optical technology, thanks to its high bandwidth-length product, the lightweight cabling, and

electromagnetic hardness, can potentially be the solution for data-rate increment in satellites. In this direction, the European Space Agency (ESA) has recently released the new SpaceFibre standard for on-board satellite communication up to 6.25 Gbps [2,3]. The standard describes the very high-speed serial link and network technology, and it was designed specifically for use on-board spacecraft and satellites. This protocol provides a coherent quality of service mechanism able to support bandwidth reserved, scheduled, and priority-based qualities of service. SpaceFibre provides robust, long-distance communications for launcher applications and supports avionics applications with deterministic delivery constraints using virtual channels. Communication performances are strongly related to the ability to synchronize the receiver with the transmitter. This issue is typically fixed with a Clock Data Recovery (CDR), and the key block used for its synchronization is the Phase Locked Loop (PLL). The Voltage Controlled Oscillator (VCO) is the core system, inside the PLL, able to generate the required frequency of 6.25 GHz to be compliant with the SpaceFibre protocol. Although the required Total Ionizing Dose (TID) level is lower than 1 Mrad for space applications [4], the main problems are due to Single Event Effects (SEEs) that temporarily disturb the typical operation of the circuit. This work targets, as implementation technology, a commercial 65 nm CMOS from TSMC (Taiwan Semiconductor Manufacturing Company). This technology, thanks to its thin gate-oxide thickness, could be considered radiation hard up to few hundred Mrad TID levels, as proved in [5], and by us in previous designs of other high-speed circuits in [6–8]. To the best of the authors' knowledge, in literature and market, there are not examples of rad-hard VCOs able to work at 6.25 GHz. The paper [9] showed the design of a PLL in the range from 0.2 GHz to 1.2 GHz, designed in 65 nm STMicroelectronics space technology. This system was irradiated up to 300 krad TID level, and its behavior was verified with different protons. In [10], a comparison between Ring Oscillator (RO) and LC-tank VCO for PLL was made for Large Hadron Collider's (LHC) applications. Both were designed for a working frequency from 2.2 GHz to 3.2 GHz, and the SEE test performed with heavy-ions showed that the LC-VCO had a larger cross-section than the RO-VCO. Varactors have been identified as the most sensitive part of LC-tank architectures, and Triple Modular Redundancy (TMR) technique has been adopted to face SEEs in the design of the phase frequency divider. The goal of this work was to compare the performances of the widely used RO and LC controlled oscillators in radiation environments and to contribute with new approaches for exploiting the characteristics that have made these systems the most implemented.

This work is an extension of the preliminary work presented by us at the conference [11]. With respect to the conference presentation, this work presented the complete full custom design of schematic and layout for both the RO and the LC-tank controlled oscillator (respectively reported in Sections 2 and 3). Moreover, this work in Section 4 provides transient and SEE simulations results, missing in [11]. Section 5 compares this work vs. The state-of-the-art. Conclusions are drawn in Section 6.

#### **2. Ring Oscillator Based on a Cascade of Three Current Mode Logic (CML) Bu**ff**er**

#### *2.1. Ring Oscillator Schematic Design*

The RO-VCO presented in this work is composed of a cascade of inverting amplifiers in closed-loop, as shown in Figure 1. The transconductance gm is the gain of the single amplifier, while R and C are the equivalent output resistance and the equivalent input capacitance, respectively, of previous and following stages. According to Figure 1, the open-loop gain of the system composed of N generic stages is expressed as

$$H(j\omega) \, := \left( -\frac{g\_m R}{1 + j\omega RC} \right)^N \tag{1}$$

For the Barkhausen oscillation criterion [12], the module of the transfer function has to be higher than one for the start-up condition and then equal to one to sustain the oscillation, while the transfer function phase has to be an integer multiple of 2π.

**Figure 1.** Ring oscillator modalized using inverting stage amplifiers.

Applying this criterion at the model in Figure 1, we obtain the oscillation condition in terms of design parameters, expressed as

$$g\_{m}R \geq \frac{1}{\cos \theta} \tag{2}$$

where θ is the phase shift introduced by each RC load, which for the Barkhausen oscillation criterion must be an integer multiple of π/*N*. In a ring oscillator, the frequency *f*<sup>0</sup> = 1/2*N*τ*D*, where τ*<sup>D</sup>* is the delay of a single stage, and N is the number of stages in the loop. In order to limit power consumption and to reduce the silicon area to decrease the number of collisions caused by ionizing particles, *N* = 3 was chosen for the RO-VCO design. Although two stages ring oscillator provides a quadrature clock, as demonstrated in [13], a three stages oscillator is conventionally used for differential architecture [14]. Moreover, a smaller value of N provides a better phase noise [15] and a higher value of the working frequency *f*0. With this choice, in accordance with Equation (2), the following condition is extracted as the main design guideline

$$
\mathcal{g}\_m \mathbb{R} \ge 2 \tag{3}
$$

Although CMOS architectures are largely used for their low static-power dissipation and high integration density, the designed RO-VCO is composed of three CML stages. The current mode logic architecture, based only on n-MOSFETs and resistors, is more suitable for high-frequency applications, thanks to their lower voltage swing and lower output impedance than a standard CMOS approach [16,17]. Moreover, the use of a differential structure allows obtaining higher common-mode disturb immunity than the use of a single-ended structure, as in classic CMOS circuits [18]. Guard rings and deep n-well are also used for the design of MOSFETs devices to prevent Single Event Latch-up (SEL) and to mitigate SEEs [19,20]. The single CML stage, shown in Figure 2, is made by a source coupled pair with a resistive load, a simple current mirror, and accumulation-mode MOSFETs varactors. Active components M1 and M2 are designed with the minimum channel length allowed by technology, and the transistor width is chosen in order to ensure, in the worst case, a *gm\*R* value of 4, which is two times higher than the critical value expressed in Equation (3). The supply voltage for this technology is 1.2 V, and the value chosen for resistors shifts the output common-mode voltage level at 0.9 V. The RO-VCO bias current is controlled by the external generator I0 through the simple current mirror M3 and M4 with a unity current gain. These MOSFETs are designed with the maximum MOSFET length allowed by the RF-device model to increase the output resistance. A current of 4 mA feeds the controlled oscillator, and the post-layout simulated power consumption is 18 mW. In order to take control of the oscillation frequency, a couple of varactors are added at the output of each stage [21,22].

The frequency tuning is made, thanks to accumulation-mode MOSFETs devices. A single varactor is designed by 40 fingers divided into 2 groups, and each finger is designed with the minimum finger length of 200 nm and a finger width of 550 nm. They can assume the value in the range from 69.53 fF to 34.93 fF, respectively, for the minimum and maximum value of the control voltage in the typical case. As shown in Figure 3a, the variation of the capacitance value through the corner cases is lower than 5%.

**Figure 2.** Circuit schematic of the single-stage, based on a Current Mode Logic (CML) buffer, of the ring oscillator and a couple of varactors connected at the two outputs.

**Figure 3.** Varactor capacitance vs. control voltage for RO-VCO (**a**) and LC-VCO (**b**), different corner cases. RO, Ring Oscillator; VCO, Voltage Controlled Oscillator.

The oscillation frequency of the RO-VCO based on a CML architecture is closely related to the value of the gate capacitance [23], and it is expressed by the relation *f*<sup>0</sup> = 1/2π*RCT*, where R is the parallel between the pull-up CML resistive load and the output MOSFET resistance, while *CT* is the cumulative capacitance due by varactors and the gate capacitance of the following stage.

#### *2.2. Ring Oscillator Layout Design*

The complete layout of the RO-VCO designed in 65 nm CMOS bulk-silicon technology is shown in Figure 4. The simple current mirror, in the bottom side, and the three source-coupled pairs are designed, adopting the common centroid technique to increase matching. All the gate terminals are turned in the same way so that the current flows in the same direction, and the space between instances is the minimum allowed by technology rules. A trade-off between metal width and length is made to prevent the electro-migration phenomena due to high current density. Moreover, alternate layers perpendicular to each other are drawn to minimize parasitic capacitances that lead to a frequency reduction. The total layout area of the proposed RO-VCO is 249 <sup>×</sup> <sup>86</sup> <sup>μ</sup>m2.

**Figure 4.** Full custom layout of the RO-VCO based on the CML buffer designed with Cadence Virtuoso [24].

#### **3. LC-Tank Oscillator**

#### *3.1. LC-Tank Schematic Design*

The second architecture designed is based on an LC-tank resonator. This architecture bases its oscillation frequency on the filtering effect of an LC-tank, leaving for active components only the role of setting the feedback gain [25] and compensate for the loss of the inductor.

The design guideline to respect Barkhausen oscillation criterion must be

$$
\log\_m > 1/R\_P \tag{4}
$$

where *gm* is the value of the transconductance of the n-MOSFETs devices inside the cross-coupled cell, and *RP* is the parasitic resistance of the inductor [26]. Figure 5 shows the schematic of the LC-VCO designed to generate the target 6.25 GHz frequency. A polysilicon resistor is used to shift the output common-mode level at VDD/2, preventing the damaging or lifetime reduction of the low-voltage MOSFETs used for the cross-coupled pair.

**Figure 5.** LC-tank VCO circuit schematic and a couple of varactors connected at the outputs.

This resistor is connected to the center tap of the symmetrical inductor chosen for its lower layout area than that of two separate inductors. In order to achieve the best frequency performance of this technology, the cross-coupled pair is sized using minimum length MOSFETs and a MOSFET width of 3.6 μm to guarantee a cell gain of at least 6 dB for start-up condition. The VCO bias current is controlled by the external current Io through the simple current mirror M3 and M4 with a current gain of 5, and the power consumption is less than 3 mW. The oscillation frequency of the LC-VCO is set by *f*<sup>0</sup> = 1/ 2π *L*(*C* + *Cvar*) [26], where C is the equivalent capacitance due to the cross-coupled cell and the first stage of the output buffer, and Cvar is the capacitance of the accumulation-mode MOSFETs varactors connected at the controlled oscillator outputs. The Tuning Range (TR) is made with the control voltage Vctrl in the range from 0 V to VDD, and varactors assume, respectively, the value in the range from 629.6 fF to 197.6 fF, as shown in Figure 3b. A single varactor is composed of 120 fingers divided into 6 groups, and each finger is designed with 300 nm finger length and 1.2 μm finger width.

Figure 6 shows the simulated frequency response of the VCO for the two extreme values of the control voltage, and a minimum cell gain of about 10 dB for the minimum value of the control voltage, allowing to achieve a robust start-up condition for the oscillator.

**Figure 6.** Frequency response simulated for minimum (**red line**) and maximum (**blue line**) values of the control voltage. The vertical marker indicates the target frequency of 6.25 GHz.

#### *3.2. LC-Tank Layout Design*

The complete layout of the LC-VCO is shown in Figure 7, and it is composed of the simple current tail mirror, varactors, cross-coupled cell, inductor, and poly-silicon resistance from bottom to the top.

The current mirror is designed as a single strip, and a common centroid technique is adopted for the cross-coupled cell. Moreover, the minimum space allowed by technology rules is used, helping to increase matching. About 85% of the total area is occupied by the differential inductor (177 <sup>×</sup> <sup>198</sup> <sup>μ</sup>m2) that has a quality factor of 20. It has been chosen with an odd number of turns because the two output terminals are on the same side of the cell, thus making the routing shorter with MOSFETs devices. Moreover, the single resistor connected to the center tap helps to reduce the metal connection length between the inductor and the cross-coupled cell.

**Figure 7.** Full custom layout of the LC-tank VCO designed with Cadence Virtuoso [21].

The oscillator is designed to work properly in the temperature range −55 ◦C, +125 ◦C with 10% variations of current bias and voltage supply. The total layout area of the proposed LC-VCO is <sup>308</sup> <sup>×</sup> <sup>198</sup> <sup>μ</sup>m2.

#### **4. Simulations Results**

#### *4.1. Design Simulations*

The small length size n-MOSFETs allowed to achieve high-frequency performance, but on the other hand, this choice increased the deviation of the device parameters from the typical condition. Although the frequency tuning was made with the use of accumulation-mode varactors, the frequency shift due to the technology simulations was so high that it could not be compensated using the control voltage. Figure 8 shows a post-layout simulation of the free-running oscillation frequency of the RO-VCO for the only three corners process. The frequency values were plotted versus an increasing value of the control voltage from the minimum to the maximum values. The oscillation frequency in the slow-slow corner case did not reach the 6.25 GHz frequency value required by the SpaceFibre standard, even using the maximum value of the control voltage. In the fast-fast corner case, the frequency was higher than the targeted frequency, even with the minimum value of the control voltage. Although the RO-VCO resulted as strongly dependent on the device parameters, in space applications, the best components should be selected.

Although n-MOSFETs devices in the cross-coupled cell were designed with the minimum MOSFET length, the frequency shift in the LC-VCO, due to the technology simulations, could be recovered with the use of varactors and the control voltage. This can be seen by the curves in Figure 9, showing LC-tank VCO post-layout simulation of the free running-frequency versus control voltage in fast-fast (red line), typical (green line), and slow-slow (blue line) technology corner cases.

**Figure 8.** RO-VCO post-layout simulation of the free-running frequency versus control voltage in fast-fast (**red line**), typical (**green line**), and slow-slow (**blue line**) technology corner cases. The horizontal marker indicates the target frequency.

**Figure 9.** LC-tank VCO post-layout simulation of the free-running frequency versus control voltage in fast-fast (**red line**), typical (**green line**), and slow-slow (**blue line**) technology corner cases. The horizontal marker indicates the target frequency.

In addition to technology simulations, thus increasing the simulation realism, PVT (Process-Voltage-Temperature) simulations were performed by also changing temperature and supply voltage for both architectures. The SpaceFibre standard required the system to properly work under harsh conditions. In Table 1, the process and fabrication results are listed, respectively, in the third and fourth columns. The frequency variations were calculated as a variation from the nominal condition for temperature, supply voltage, and polarization current in each technology corner. The variations were obtained for temperature variations in the range −55 ◦C, 125 ◦C, and for ±10% supply voltage

and polarization current deviations. Fabrication results were expressed as the frequency standard deviation σ, and data were obtained from the Monte Carlo simulations. Monte Carlo simulations were performed with 200 simulations in the nominal condition for each corner case considering process and mismatch variations.


**Table 1.** Frequency variations and standard frequency deviation, respectively, for PVT (Process-Voltage-Temperature) and Monte Carlo simulations.

Figure 10 shows the simulated phase noise of the two architectures with the harmonic balance simulation in post-layout. Both VCOs were simulated at the same frequency, and the LC-VCO exhibited a better phase noise of about 30 dB than the other architecture (at 1 MHz offset, in Figure 9, there was a phase noise of −110 dBc/Hz for the LC-VCO vs. The −82 dBc/Hz for the RO-VCO). Device noise was considered in every simulation for both oscillator architecture and for all simulations performed in this work.

**Figure 10.** Phase noise simulated for RO-VCO (**red line**) and LC-VCO (**blue line**) at post-layout in typical conditions at the same working frequency of 6.25 GHz. These simulations were performed using the Cadence environment.

The integrated RMS jitter was calculated from Figure 10 in the bandwidth from 100 kHz to 10 MHz. The RMS jitter obtained was 9.51 ps and 0.44 ps for RO-VCO and LC-VCO, respectively. Moreover, the RO was more sensitive to temperature variations than the LC-VCO. The time-domain VCO stability was made with the use of the Allan variance [27,28], or two-sample variance, defined as

$$
\sigma\_x^2(\tau) = \frac{1}{2} \mathbb{E} \left[ \left( \overline{x\_2} - \overline{x\_1} \right)^2 \right] \tag{5}
$$

Figure 11 shows the Allan deviation, or σ-τ plot, calculated as the square root of Equation (5) when the VCO was in the steady-state oscillation. In particular, Figure 11a,b show the comparison between the Allan deviation in frequency and in amplitude, respectively, for both architectures. LC-VCO exhibited lower variations in frequency and amplitude than the RO-VCO.

**Figure 11.** Allan deviation of (**a**) frequency and (**b**) amplitude for RO-VCO (**red line**) and LC-VCO (**blue line**) calculated in typical condition at the same oscillation frequency.

#### *4.2. Single Event E*ff*ect Simulations*

The model used for SEE simulations and widely accepted by the scientific community [29–31] is shown in Equation (6), where *tinj* is the injection instant, *ta* is the collection time constant of the junction, *tb* is the ion track establishment time constant, and *Q* is the critical charge.

$$ISET = \frac{Q}{ta - tb} \left[ e^{-\left(\frac{t - \text{div}j}{lt}\right)} - e^{-\left(\frac{t - \text{div}j}{lb}\right)} \right] \tag{6}$$

SEEs were modeled as double exponential current pulses at sensitivity nodes, and two different sets of values, with a Linear Energy Transfer (LET) of 5 and 60 MeV×cm2/mg, were used [32]. The two sets of values were expressed for different time constants versus critical charge Q and LET. The strike of an ionizing particle could be modeled by inserting a current pulse on each P-N junction, with the direction of the injected current depending on the device type [33], as shown in Figure 12. Moreover, the effects generated by the injected currents were strongly sensitive to the circuit conditions, requiring the analysis of the system in different states.

Both VCOs based their frequency tuning on accumulation-mode MOSFETs varactors, and the output nodes resulted as the most sensitive nodes in the whole architectures. Indeed, the strike of an ionizing particle generated a voltage variation in the node that was then translated in a frequency deviation by varactors. In this subsection, SEE simulations results are discussed, respectively, for RO and LC controlled oscillators.

**Figure 12.** The model used for the correct stimulation of the P-N junction with the double exponential current pulses generators.

Figures 13 and 14 show the effects generated by a particle strike for the two values of LET provided for the model in Equation (5). Particles with 5 and 60 MeV×cm2/mg are representant in the following figures with the label hit1 and hit2, respectively. The two exponential current generators excited sensitive nodes of RO-VCO at 25 ns and 30 ns, and the LC-VCO ones at 10 ns and 15 ns.

**Figure 13.** Two sets of values with a Linear Energy Transfer (LET) of 5 and 60 MeV×cm2/mg were used for Single Event Effect (SEE) simulations, and the free-running frequency was plotted for the control voltage value equal to 0 V (**red line**) and VDD (**blue line**). (**a**) shows the free-running frequency in the RO-VCO, while (**b**) shows the free-running frequency in the LC-VCO.

**Figure 14.** Differential amplitude variations due to the hit of the two LET values provided for the model. (**a**) shows the differential amplitude in the RO-VCO, while (**b**) shows the differential amplitude in the LC-VCO.

In Figure 13, the free-running frequency versus time is plotted for minimum (red line) and maximum (blue line) values of control voltage, and in Figure 14, the differential output amplitudes for the maximum value of the control voltage, respectively, for RO-VCO and LC-VCO are shown.

In Table 2, the data extracted from Figures 13 and 14 are listed, where the column called clock cycles shows the number of clock cycles in which the frequency assumes different values respect to the nominal due to the strike of the particle. In the last two columns, the maximum variations for frequency and amplitude are reported.


**Table 2.** Variations due to a SEE for RO-VCO and LC-VCO.

As it is shown in the last two columns of Table 2, when the VCOs were hit by the ionizing particle with a LET of 5 MeV×cm2/mg (called hit1 in Table 2), both architectures showed nearly the same amplitude and frequency variations. Instead, when a particle with higher LET (called hit2 in Table 2) did hit the two VCOs, the amplitude variations of the LC-tank were greater than that of the RO, while the frequency variations were lower in the LC-tank-based architecture. This was despite that the LC architecture used one order greater varactor capacitances than RO one. This greater frequency deviation in RO-VCO was due to the frequency relationship with capacitance, which was 1/C for the RO-VCO and 1/ √ *C* for the LC-VCO. This attested that a small capacitance variation generated a huge frequency variation in the RO-VCO, as highlighted in Table 2.

n-MOSFETs devices in both architectures were designed with the minimum channel length targeting high-frequency applications, but a maximum number of fingers and an oversized MOSFET width were used to increase the parasitic capacitance of devices. Although this SEE mitigation technique increased the silicon area and reduced the tuning range, it increased the SEE tolerant property of both VCOs. Indeed, following the simple rule V = Q/C, if the capacitance value was increased for a fixed value of charge, then a lower variation of the voltage occurred. Moreover, guard rings and deep n-wells were adopted to isolate the devices by the charge generated in the substrate during a particle strike. Indeed, if an ionizing particle passed through the device, electron-hole pairs could be generated, which, thank to guard rings and deep n-wells, were rapidly collected, avoiding the activation of parasitic latch-up.

#### **5. Comparison vs. The State-of-the-Art**

A state-of-the-art comparison of voltage controlled oscillator designed in 65 nm technology is made in Table 3. In works [9,10], the PLLs were based on a RO and LC-VCO. They were irradiated up to 300 krad TID level compliant with SpaceFibre protocol and tested with different protons. Their working frequency did not meet that required by the SpaceFibre standard, and the aim of this work was to investigate the behavior of these two well-known architectures at a higher frequency. In [34], a VCO based on LC tank was optimized against SEEs, and in [35], a three stages ring oscillator was designed targeting Bluetooth front-end applications, but no process simulations were performed. Another solution presented in [36] was based on a Colpitts architecture for mm-wave applications.


**Table 3.** State-of-the-art comparison.

#### **6. Conclusions**

In this work, the comparison between two VCO architectures designed in a commercial 65 nm technology was made. Targeting high-frequency space applications, as the SpaceFibre protocol, a CML approach was adopted for the design of the RO-VCO. CML architecture was preferred, targeting high frequency, thanks to its lower voltage swing than a CMOS. The RO-VCO was an appetible VCO configuration in terms of technology scaling, high-integration density, and area occupancy, which was about 35% of the total silicon area required for the LC-VCO. Although the RO-VCO resulted as strongly dependent on the device parameters, in space applications, the best components should be selected. To overcome the effects of the device parameters deviation on the oscillation frequency, an LC-tank VCO architecture was designed. This architecture, despite its large area, mainly occupied by the inductor, presented promising performances in terms of the frequency range, covering the 5.35 GHz to 6.55 GHz range, in the typical case, with a control voltage swing of VDD. SEE simulation results highlighted the output nodes as the most sensitive nodes for both VCOs, for the effects due to the varactors. Although the LC-tank VCO used one order greater varactor values than RO, and the ionizing particle hits generated higher amplitude variation on its output signals, the frequency variations of this VCO were lower than that showed by the RO architecture, thanks to the different relationship between frequency and capacitance. In the literature, VCOs based on Colpitts architecture for space applications are not available because of their large silicon area. The LC system, whose layout is shown in Figure 7, would be integrated into a 1 mm<sup>2</sup> chip containing a SERDES (Serializer-Deserializer) to test system-level performance. The whole chip would be electrically tested in standard condition, then it would be exposed to X-rays to achieve the 300 krad TID and to heavy ions for SEE characterization.

**Author Contributions:** D.M.: investigation, methodology, conceptualization, software, writing-review and editing; G.C.: investigation, methodology, conceptualization, software, writing-review and editing; S.S.: investigation, methodology, conceptualization, writing-review and editing; All authors contributed to the present paper with the same effort in finding available literature resources, conducting simulations, as well as writing the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by PHOS4BRAIN and ISHTAR Projects by INFN and the University of Pisa, as well as by the Dipartimento di Eccellenza Crosslab project by MIUR.

**Acknowledgments:** The authors would like to thank L. Berti, G. Mangraviti, and all the other guys from IMEC (Leuven) for useful discussions and support on the preliminary VCO design. Discussions with F. Palla and G. Magazzù are also acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Analysis and Design of Integrated Blocks for a 6.25 GHz Spacefibre PLL**

#### **Marco Mestice \*, Bruno Neri, Gabriele Ciarpi and Sergio Saponara**

Department of Information Engineering (DII), University of Pisa; via G. Caruso 16, 56122 Pisa, Italy; bruno.neri@iet.unipi.it (B.N.); gabriele.ciarpi@ing.unipi.it (G.C.); sergio.saponara@iet.unipi.it (S.S.)

**\*** Correspondence: m.mestice@studenti.unipi.it

Received: 24 June 2020; Accepted: 17 July 2020; Published: 19 July 2020

**Abstract:** The design of a Phase-Locked Loop (PLL) to generate the clock reference for the new Spacefibre standard is presented in this paper. Spacefibre has been recently released by the European Space Agency (ESA) and supports up to 6.25 Gbps for on-board satellite communications. Taking as a starting point a rad-hard 6.25 GHz Voltage Controlled Oscillator in 65 nm technology, this work presents the design of the key blocks for an integrated PLL: a Triple Modular Redundancy Phase/Frequency Detector, a Charge Pump, and a passive Loop Filter. The modeling activities carried out in an Advanced Design System have proven that the proposed PLL can be completely integrated on-chip, with a Loop Filter area consumption of only 6000 μm2 (considering the 65 nm technology). The design of active circuits has been carried out at the transistor level in a Cadence Virtuoso environment, implementing both system and layout rad-hard techniques, and different solutions are discussed in this paper. As a result, a compact (0.09 mm2), low power (10.24 mW), dead zone free and rad-hard PLL is obtained with a Phase Noise below −80 dBc/Hz @ 1 MHz. A preliminary block view and floor plan of the test chip is also proposed.

**Keywords:** rad-hard; PLL (phase-locked loop); SEE (single event effects); Spacefibre; TID (total ionization dose); charge pump; phase/frequency detector; frequency divider

#### **1. Introduction**

Recently, the new Spacefibre standard [1] has been released by the European Space Agency (ESA). To cope with the data transfer of high bandwidth sensors, required in scientific, surveillance, and telecom satellite applications, Spacefibre supports up to 6.25 Gbps. The clock reference generator is a key block for the Spacefibre implementation. It must be hardened against SEE (Single Event Effects) [2] and TID (Total Ionization Dose) [3], it should work in a −55 to 125 ◦C temperature range, and it should sustain up to 6.25 GHz. Moreover, output frequency divided by 2 and by 4 should be supported. In aerospace applications, the radiation issues are mainly related to SEE because the TID levels are from several dozen to a few hundred krad, even lower if aluminum shields are used. As discussed in [4] with a 5 mm aluminum shielding, the estimated total TID received by three satellites for earth observation and environmental data collection, RazakSAT-1, SCD-2, and ALOS are 2.30 rads, 170 rads, and 24200 rads, respectively. Therefore, rad-hard techniques should be employed to mitigate SEEs, whose effect in PLLs (Phase-Locked Loops) has been widely analyzed and demonstrated [5,6]. Although several solutions have been proposed in the literature [7–9], the state-of-the-art rad-hard PLLs are limited to frequencies below 6 GHz in normal conditions(e.g. less than 3 GHz is achieved in [10–12]) but, in order to reach 6.25 GHz in all PVT (Process–Voltage–Temperature) corners and in environments pervaded by radiations, it is necessary to work at even higher frequencies in nominal conditions.

An issue that needs to be dealt with is that radiation hardness comes at the cost of power and area consumption, since redundancy techniques are usually implemented. Currently, the Triple Modular Redundancy (TMR) technique is one of the most effective techniques for digital blocks, such as the PFD (Phase Frequency Detector) [13]. It consists of triplicating a cell and adding a voter to choose between the outputs following a majority approach. For the Charge Pump (CP), instead, voltage-switching Charge Pumps (V-CP) have been demonstrated to be a good choice in terms of SEE mitigation, but they lead to a great degradation in terms of noise performance compared to CPs [14].

Among the state-of-the-art designs, the 4.8–6 GHz PLL proposed in [15] represents a good tradeoff in terms of power consumption, area, and noise performance, while really low-noise performances are achieved in [13] with a 2.2–3.2 GHz PLL.

In this paper, we present a low-power, low-area consumption and reliable 6.25 GHz PLL for aerospace environments. Here, a complete project flow in 65 nm TSMC technology is developed starting from an already designed Voltage Controlled Oscillator (VCO) [16], from an already designed Frequency Divider (FD), and from a preliminary system level analysis performed by us in the conference work [17], of which this work is an extension. With respect to [17], in which the modeling activities and the preliminary system level analysis and design were presented, in this work, beside this, the complete development at the circuit and layout level of the blocks that compose the PLL in 65 nm TSMC technology is illustrated, together with the implemented rad-hard techniques, and with comparisons among different solutions.

The Loop Filter's components and the Charge Pump's current have been chosen through a system-level analysis carried out in an ADS (Advanced Design System) environment, while the design of the Charge Pump and the Phase/Frequency Detector has been carried out in a *Cadence Virtuoso* environment. A preliminary block view and floor plan of the test chip is also proposed.

Hereafter, Section 2 presents the PLL system-level analysis and the design of the passive Loop Filter circuit. Section 3 deals with the transistor-level design of the main circuit blocks. Section 4 shows the results of the SEE simulation and post-layout simulations of the whole PLL. In Section 5, a starting point for a test chip is proposed, while the conclusions are drawn in Section 6.

#### **2. System-Level Design of PLL and Passive Filter Sizing**

The target architecture of this work is a CP-PLL [18]. As shown in Figure 1, the blocks that compose the proposed PLL are a PFD, a CP circuit, a passive Loop Filter, a VCO, whose output corresponds to one of the PLL outputs, and an FD with an integer divide ratio, which generates the other three frequencies required by the application. The input signals of the PFD are the reference signal (REF in Figure 1) and the FD output, while the outputs related to the input phase/frequency difference are named UP and DOWN in the figure. These signals are converted by the CP in a charging or discharging current needed to control the VCO through the Loop Filter. The PLL proposed in this work is designed starting from an LC-tank VCO macrocell, which is integrated by the University of Pisa in 65 nm CMOS technology [16] and has the following characteristics: a frequency tuning range between 5.9 and 7.5 GHz, obtained through integrated varactors; the control voltage leads to a gain in the center of the frequency range of 2 GHz/V; the Phase Noise is below −100 dBc/Hz at 1 MHz from the carrier.

Starting from the system-level analysis and design, the Loop Filter's components, the Charge Pump's current (Icp), and the Frequency Divider ratio (N) have been chosen. Thanks to a divider ratio of 40 and with 156.25 MHz as the reference frequency, the output at 6.25 GHz is obtained. According to the Spacefibre standard, the PLL has also to generate the frequencies 3.125 GHz and 1.5625 GHz, which are obtained using an integer FD with the following ratios: one-half, one-fourth, and 1/40th.

Moving onto the Loop Filter, as shown in Figure 2, it is composed of a capacitor C1, a resistor R1, which is needed to stabilize the loop, and a capacitor C2, whose aim is to reduce the spurious tones at multiples of the reference frequency and whose value, according to [19], should be no more than C1/5. Thanks to the modeling and simulation activity in the ADS environment in Section 2.1, a bandwidth of 6 MHz has been targeted as a tradeoff between the noise behavior and Loop Filter integrability. A completely integrated filter allows the avoidance of all the problems deriving from

the parasitic effects of external components. A higher loop bandwidth, obtained with small passive devices, provides a better reduction of the VCO and Loop Filter contribution to Phase Noise. Instead, a lower bandwidth leads to lower CP and reference noise contributions, but it can be achieved only with large capacitors. A lower CP noise contribution can also be obtained thanks to a higher Icp, at the cost of larger capacitors for a given bandwidth. Given all these reasons, an Icp of 40 μA has been chosen, and consequently, the values of C1, C2, and R1 have been derived: 8 pF, 1 pF, and 12 kΩ, respectively.

**Figure 1.** Block schematic of the Charge Pump Phase-Locked Loop (CP-PLL).

**Figure 2.** Loop Filter's schematic view.

#### *2.1. PLL Modeling in "Advanced Design System" Environment*

A phase domain model has been firstly built to analyze the PLL bandwidth and stability. There, all the blocks are linearized: the PFD/CP and the FD have constant gains of Icp/2π and 1/*N*, respectively, while the VCO is modeled as an integrator with gain K*vco*. The transfer functions for the open, Equation (1), and closed, Equation (2), loop models are the following:

$$H\_{ol} = \frac{\mathbf{I\_{cp}}}{2\pi} Z(s) \frac{\mathbf{K\_{rco}}}{s} \frac{\mathbf{1}}{\mathbf{N}}\,'\,. \tag{1}$$

$$H\_{cl} = \frac{\frac{\text{l\_{cp}}}{2\pi} Z(s) \frac{K\_{\text{cov}}}{s}}{1 + \frac{\text{l\_{cp}}}{2\pi} Z(s) \frac{K\_{\text{cov}}}{s} \frac{1}{N}} \,\text{}\,\tag{2}$$

with an Icp of 40 μA, *N* of 40, and:

$$\mathbf{K}\_{\rm rco} = 12.57 \ast 10^9 \operatorname{ rad} / (V \ast \mathbf{s}),\tag{3}$$

$$Z(s) = \frac{1}{s(C1 + C2)} \frac{1 + sR1C1}{1 + sR1\frac{C1C2}{C1 + C2}}.\tag{4}$$

Figure 3 shows the results obtained using an AC simulation. As expected, R1, introducing a zero, increases the stability of the loop. In contrast, C2, introducing a pole, reduces the Phase Margin; therefore, its value has been selected to maximize the latter. The main results of the simulations are a unity gain frequency of 3.31 MHz with 50.9◦ of Phase Margin, and a 5.37 MHz closed loop bandwidth.

**Figure 3.** Results of the AC analysis performed on the phase domain models: magnitude on the left and phase on the right of (**a**) the open loop model and (**b**) the closed loop model.

Secondly, to evaluate the noise and lock performances, a behavioral model of the closed loop PLL, as shown in Figure 4 together with the model of the VCO alone, has been built to perform simulations in time and frequency domains. Since all the blocks of the model are noise-free, the block labeled NoiseVCO is added to consider the VCO contribution to phase noise. It is a voltage noise source that approximates with a piecewise linear curve the simulated phase noise of the VCO designed in [16], and starting from it, it generates the equivalent input noise. Regarding the Loop Filter, the noise model provided by ADS has been used for the analysis. Figure 5a shows the PLL lock behavior, which presents a lock time of 555.6 ns for a locking error below 0.01%. The peaks in Figure 5a, due to the resistor of the Loop Filter, are not seen when the PLL is locked because the CP model is ideal. However, since the real CP will be affected by non-idealities, the second capacitor is needed to attenuate these peaks. Instead, Figure 5b shows that the phase noise is well below −80 dBc/Hz. However, the reference and CP contributions have not yet been considered in this simulation, since this represents a preliminary analysis (the PFD and CP contributions to Phase Noise are analyzed in Section 3.1.3). As expected from the theory [19], the loop has a band-pass response with respect to the Loop Filter noise and a high-pass response with respect to the VCO noise.

**Figure 4.** Closed-loop PLL (TOP) and Voltage Controlled Oscillator (VCO, BOTTOM) models in the time and frequency domains.

**Figure 5.** Results of the envelope analysis performed on the model of Figure 4: (**a**) locking process of the PLL; (**b**) comparison between the phase noise of the VCO and the phase noise of the closed loop PLL, considering the noise contribution of the VCO and the Loop Filter.

#### *2.2. Second-Order Loop Filter's Layout*

Choosing to realize C1 and C2 as Metal–Insulator–Metal (MIM) capacitors and R1 as an N-well under OD resistor, in order to minimize the area consumption, have led to the Loop Filter's layout shown in Figure 6, in which C1 can be recognized on the left, C2 can be recognized on the right, and R1 can be recognized in the lower right corner, with an area consumption of only 6000 μm2.

**Figure 6.** Layout in 65 nm technology of the Loop Filter in Figure 2.

#### **3. Transistor Level Design of the PLL**

#### *3.1. Charge Pump and Phase*/*Frequency Detector*

#### 3.1.1. Charge Pump

Ideally, a Charge Pump is a current sink or source, depending on its inputs. To obtain this behavior, there are three main topologies of CP in the literature:


Therefore, firstly, three basic Charge Pumps [20] have been designed and compared, one for each topology: the conventional CP's Drain Switching architecture is shown in Figure 7a. It consists of two mirrors, one type n (M0 master) with two slaves (M1 and M2) and one type p (M4 master, M5 slave). The switch M3 is realized with an NMOS transistor, while the switch M6 is realized with a PMOS transistor. M0, M1, M2, M4, and M5 have no minimum length to obtain a higher output impedance and the necessary width to have the drain current of M3 (sink current) and M6 (source current) of 40 μA and the output voltage of 520 mV when the output is left open. 520 mV is the nominal control voltage needed to have 6.25 GHz as the output frequency of the VCO in [16]. Therefore, the CP has been sized to have near-zero current mismatch and an Icp of 40 μA when the control voltage of the VCO (Vctrl) is 520 mV. Instead, the switch transistors are wide and short to minimize their VDS (drain-source voltage drop).

Instead, the Source Switching-based architecture is shown in Figure 7b. Its operation is very similar to the previous CP, and therefore, it is sized with the same criteria: enhance the output impedance and reduce the current mismatch near 520 mV of Vctrl.

Finally, the Gate switching architecture is shown in Figure 7c. In contrast with the other two architectures, in this one, it is necessary to add another type p (M6, M7) and type n (M3, M4) current mirror, since causing the main type n mirror (M0, M1, M2) in the OFF state to switch off the sink current would lead to the shutdown of the source current, too. Given that, the circuit in Figure 7c has been sized with the same criteria of the other two architectures.

**Figure 7.** Charge Pump's architectures: (**a**) Drain Switching architecture, (**b**) Source Switching architecture, (**c**) Gate Switching architecture.

To analyze the output impedance of the three CP circuits in Figure 7, imagine placing an ideal voltage source on the output node of each CP circuit. Then, we measure the output current and its derivative when UP is low and DOWN is high for the Source and Drain Switching, and when UP is high, and DOWN is low for the Gate Switching. The three CP circuits, because of their similarity, show almost the same DC characteristics: an output range between 0.15 and 0.9 V, but quite a low output impedance of about 30 kΩ. Instead, regarding the transient behavior shown in Figure 8, the three CP architectures show different characteristics. In this case, the simulations were performed forcing the output node at 520 mV with an ideal voltage source, and the source and sink currents were measured. Since the ON time of the PFD's outputs during the reset period depends on several factors, such as the PVT conditions and layout, which were unknown at this design level, the simulations were performed considering two different values of ON time: 3.2 ns to let the two currents exhaust the transient and therefore analyze all their transient behavior; and 500 ps to approximate a more realistic reset time of the PFD and then analyze the current's behavior in an approximation of the locked state of the PLL. The Drain Switching CP in Figure 7a is the fastest one, while the Source and Gate Switching CP ones are too slow to turn ON in less than 500 ps. The Source Switching CP in Figure 7b is the slowest one, as can be seen in Figure 8b. Both the Source and Drain Switching CP architectures show peaks during the switching period, and this leads to a degradation of the transient matching. Instead, the Gate Switching CP architecture does not show any peak current during the

switching period, and therefore, although the Drain Switching is the fastest architecture and all of them can be enhanced, increasing the complexity, it has been chosen as the target topology for the whole PLL design. However, it is important to notice that this architecture is intrinsically more complex because there is the need to separate the bias circuits for the UP (M8, M9, M10) and DOWN (M3, M4, M5) branches.

**Figure 8.** Transient behavior of the three CPs of Figure 7. The source current is represented in red, while the sink current is represented in black for two different values of ON time of the input signals (UP and DOWN): 3.2 ns on the left, which corresponds to a duty cycle of 50%, and 500 ps on the right, which corresponds to a duty cycle of 7.8125%. The Drain Switching CP results are represented in (**a**); the Source Switching CP results are represented in (**b**); the Gate Switching CP results are represented in (**c**).

Once the CP topology has been chosen, the target architecture has been modified to obtain better performances, as shown in Figure 9 versus Figure 7c. Firstly, the output simple current mirrors have been replaced with high swing current mirrors to enhance the output impedance without degrading the output range too much. This change also allowed the reduction of the sizes of the output transistors, leading to a faster CP. Moreover, both the UP and DOWN branches have their bias circuitry in order to reduce the effect of the UP signal on the sink current and of the DOWN signal on the source current. This has led to an output impedance of 250 kΩ, with an output range of 0.3–0.9 V. Instead, in terms of DC current mismatch, the worst case of process–temperature corners is 1.454 μA. Regarding the transient characteristics shown in Figure 10, the designed architecture is peak-free, and it is able to start sinking or sourcing the current in less than 500 ps.

**Figure 9.** Enhanced CP's Gate Switching architecture with CMOS standard inputs.

**Figure 10.** Transient behavior of the CP of Figure 9. The source current is in red, and the sink current is in black for two different values of ON time of the input signals (UP and DW\_N): (**a**) 3.2 ns, (**b**) 500 ps.

A differential input Gate Switching CP has also been developed. Indeed, a differential PFD/CP guarantees a better current matching during the switching period, thanks to the symmetric input load of the CP and to the lack of necessity of adding an inverter between one of the PFD outputs and the corresponding CP's input. Hence, the switch transistors have been replaced with differential pairs (M6–M7 and M8–M9), and the mirror M22–M23 has been added to enhance the switching speed of the source current during the ON-to-OFF switch [21], as shown in Figure 11. The DC characteristics are similar to the previous architecture: an output range of about 0.3–0.9 V, an output impedance of about 250 kΩ, and a DC current mismatch in the worst case of all the temperature–process variations of 1.66 μA. Instead, regarding the transient behavior, it is peak-free and shows a better transient current matching, as can be seen from Figure 12.

**Figure 11.** Enhanced CP's Gate Switching architecture with differential inputs.

**Figure 12.** Transient behavior of the CP of Figure 11. The source current is in red and the sink current is in black for two different values of ON time of the input signals (UP\_P-UP\_N and DW\_P-DW\_N): (**a**) 3.2 ns, (**b**) 500 ps.

#### 3.1.2. Phase/Frequency Detector

Since the PFD is essentially a digital block, the easiest and most effective way of hardening it against radiation is the TMR [13]. The designed architecture is shown in Figure 13a. The block-labeled PFDs in Figure 13a consist of the conventional PFD architecture shown in Figure 13b, and every PFD has its own reset voter that decides between all the output resets. In this way, the only possible loss of lock for the PLL happens when an SEE occurs on the UP and/or DOWN voters. However, this error does not lead the PFD into a wrong state causing cycle slipping, i.e., one of the input's loss (or gain) of one cycle with respect to the other input. This architecture has been developed in both CMOS logic and CML (Current Mode Logic), in order to be compatible with the designed CPs.

**Figure 13.** Triple modular redundant PFD architecture in (**a**) and simple PFD architecture in (**b**).

3.1.3. Comparison between the Two PFD/CP Architectures

In order to compare the two different solutions (CMOS and CML versions of the PFD/CP), two PLL testbenches have been built in Cadence. In these testbenches, the VCO model with parasitic effects is considered, while the CP's and PFD's schematic view has been used to reduce the simulation time. Finally, a Verilog-A frequency divider, which can be found in the Cadence's *rf-library*, has been used. For the Phase Noise comparison, because of the non-linear nature of the PLL, a direct time domain noise analysis was necessary. This has been done exploiting the Transient Noise Analysis option embedded in the Spectre RF's Transient Analysis. Since this type of simulation is time consuming, a resolution bandwidth of 500 KHz to reduce the simulation time has been set. In Figure 14, the Phase Noise results are shown: the two solutions have almost the same behavior in terms of noise, but the CMOS solution shows higher peaks at multiples of the reference frequency because of the worse current matching during the switch of the CP currents from ON to OFF and vice versa. Instead, in Table 1, the comparison in terms of DC current matching of the CP and power consumption between the CML and CMOS solutions is summarized. The DC current mismatch has been measured performing a DC

simulation on the CP circuits and measuring the output current when the output node is forced at 520 mV with an ideal voltage source, and the inputs are such that both the source and sink current are ON. Considering that (1) the noise performance difference is negligible, (2) the CMOS solution has a better current matching in the worst case, and (3) the CML solution shows 25 times higher power consumption, then the CMOS solution has been chosen and has been developed at layout level for the whole PLL design.

**Figure 14.** Comparison between the two CP/PFD (Phase Frequency Detector) architectures in terms of phase noise: Current Mode Logic (CML) architecture's results in red, CMOS architecture's results in black.

**Table 1.** Comparison between the two CP/PFD architectures in terms of CP's current matching and CP+PFD's power consumption.


#### 3.1.4. PFD/CP Layout

The CP's layout is shown in Figure 15. The output mirrors, since their slaves are composed of two MOS in parallel, are interleaved with the master transistor to enhance the technology process matching. Moreover, every mirror and the switch transistors are surrounded by guard rings connected to the voltage supply. These rings have two main functions:


Regarding the PFD, as for the CP, guard rings have been placed around the transistor to avoid latch-up and to reduce the drift current deriving from an SEE. Moreover, the triplicated cells have been placed at a distance of 10 μm from each other to avoid MBUs (Multi Bit Upsets). The TMR technique would become useless if an SEE causes an SEU (Single Event Upset) in more than one PFD each time. In Figure 16, the complete layout of the TMR PFD is shown: the three PFDs with their reset majority logic can be recognized on the left, while the output majority logic can be recognized on the right.

**Figure 15.** Charge Pump's layout.

**Figure 16.** Phase/Frequency Detector's layout.

#### **4. Simulation's Results**

#### *4.1. PFD*/*CP Characterization*

In Figure 17, the average CP current is represented as function of the phase difference between the inputs for different technology corners (Figure 17a) and different temperatures (Figure 17b). As can be seen from the figure, the PFD/CP is dead zone-free thanks to the presence of the RESET state in the PFD.

#### *4.2. Single Event E*ff*ect Simulations on the CP*

The current caused on a sensible node by an SEE can be modeled as a double exponential function [23] in Equation (5). In this work, the parameter model of Equation (6), which was taken from a previous high-frequency design we have done in the same technology [24], has been used:

$$I(t) = Q \frac{e^{-\frac{t}{\tau\_1}} - e^{-\frac{t}{\tau\_2}}}{\tau\_1 - \tau\_2} \,. \tag{5}$$

$$
\tau\_1 = 200 \div 400 \text{ ps} \,, \ \tau\_2 = 50 \div 100 \text{ ps} \,, \ Q = 67 \div 800 \text{ fC} \,\tag{6}
$$

where τ<sup>1</sup> and τ<sup>2</sup> are the constant times of the double exponential shape and *Q* is the charge released in the silicon substrate during a particle strike, which corresponds to a particle LET (Linear Energy Transfer) between 5 and 60 Mev·cm2/mg.

**Figure 17.** PFD/CP characteristic for different technology corners at 27 ◦C (**a**) and different temperatures in typical case (**b**).

The SEE analysis is focused on the most critical block, the CP, since the PFD adopts a TMR mitigation approach, the passive Loop Filter avoids the use of feedback loops and of active circuits, while the VCO was inherited from a previous SEE-tested designed block in [16]. Hence, double exponential current generators from the Cadence's analogLib library have been placed on every sensitive node of the CP (all the drains and sources of the MOSFETs that were not connected to the supply voltage), taking care of the direction of the current. Then these generators have been delayed to analyze their effect separately, and the output current has been measured in order to see how an SEE on each node of the CP affects the output current. Then, these results have been reported to the ADS behavioral model of the PLL to see how the PLL reacts to Single Event Transients (SETs) on the CP (see Figure 18).

**Figure 18.** Output frequency of the ADS PLL model as function of time, for Single Event Transients (SETs) spaced 1 μs apart, hitting every sensitive node of the CP and with a Linear Energy Transfer (LET) of 60 MeV·cm2/mg.

As shown in Figure 18, the PLL, which loses lock after an SET, is able to recover the locked state in less than 600 ns. The largest peaks, which are indicated in Figure 18 with the numbers 1 and 2, are the ones derived by an SET on the output nodes, as highlighted in Figure 19 with the corresponding numbers, because the charge injected into these nodes goes directly through the output node, charging or discharging the Loop Filter and then modifying the control voltage of the VCO.

**Figure 19.** Highlight of the output nodes of the CP.

#### *4.3. PLL Testbench*

Once all the blocks that compose the PLL have been designed, a testbench has been built. Here, the netlist generated by the layout parasitic extraction has been used for all the blocks (CP, PFD, VCO, Loop Filter, and FD), while the connections between the blocks are still ideal. On this testbench, a Transient simulation has been performed to analyze the locking behavior and the Phase Noise. As for the simulations presented in Section 3.1.3, the Transient Noise Analysis Option has been used for the Phase Noise evaluation.

#### 4.3.1. Locking Process

The locking process in the typical technology corner at 27 ◦C is shown in Figure 20. The lock time is 1 μs, which is twice that expected from the ADS behavioral model. This is mainly due to two factors: (1) the non-linearity of the VCO's characteristic, and (2) the cycle slipping. In the ADS model, the VCO was approximated to have a linear characteristic with 2 GHz/V gain, but this is true only in a small range of frequencies. For the other frequencies, a smaller gain and consequently, a smaller loop bandwidth and damping factor results, which leads to a higher lock time. Moreover, as can be seen from the locking process, at the beginning of the process, two cycle slips are seen. This is due to the RESET state of the PFD necessary to cancel the dead zone. When the phase difference between the two PFD's inputs is high enough that the new edge of the input ahead comes while the PFD is in its RESET state, this edge is not seen, and the PFD goes into the wrong state with respect to the ideal one. From that, it recovers the correct state when the cycle slip happens, but this leads to an enlargement of the lock time.

#### 4.3.2. Noise Simulations

In Figure 21, the resulting post-layout Phase Noise for the whole PLL is shown. It is below −85 dBc/Hz, which is in line with that of the VCO. The largest contribution is due to the CP/PFD at low and middle frequencies, while the VCO's noise is predominant at higher frequencies, as expected from the theory [19].

Since in digital design, a temporal characterization of noise is usually preferred, the absolute Jitter has been measured, resulting in a peak-to-peak value of 8.8 ps and an RMS (root mean square) value of 2.03 ps (typical corner, considering 12,500 cycles, which in terms of Phase Noise means a bandwidth between 500 kHz and 1 GHz). Absolute Jitter is really important for communications systems, since it measures how much an edge of the clock is in a different position from the ideal one. For other applications, Period Jitter is of most interest, since it measures the difference between a clock

period and the average one. Therefore, Period Jitter has been also measured, in the typical corner as well, resulting in a peak-to-peak of 96 fs and an RMS value of 14.744 fs.

**Figure 21.** Post-layout phase noise.

#### **5. Test Chip**

To prototype the designed PLL, a block view of a possible test chip has been developed and is shown in Figure 22. As can be seen, it consists of a whole PLL, which can be recognized at the bottom, and a duplicate of the PFD and CP. The main idea is to test every block separately: there is the possibility of giving the inputs from an external source, and all the outputs are outputs of the test chip. Since the PFD/CP's area consumption is low, duplicating these blocks gives the possibility of testing them separately, without adding multiplexer and buffer inside the loop, disturbing the PLL's operation. The FD is not duplicated because it consists of three stages that need bias currents. This means that at least 4 pads are necessary for it and, therefore, duplicating it would lead to a too high requirement in terms of pads. Moreover, in Figure 23, a draft of the floor plan is shown: the chip's area to be fabricated using a Europractice Multi Project Wafer is 1 mm2. As can be seen from the figure, the whole PLL occupies an area of about 0.09 mm2 (the remaining area and pads are used for other designs not

discussed in this paper). It is to be noted that the design of the pads is inherited from a previous design, where they were tested also with respect to resistance to radiation effects [24].

**Figure 22.** Block schematic of the test chip.

**Figure 23.** Draft floor plan of the test chip.

Although the figures and tables proposed in this paper refer to post-layout simulation results, the experience of our research group from previous designs in the same 65 nm TSMC technology, using the CAD (Computer-Aided Design) design environment and targeting similar operating frequencies, has proven that there is a coherent alignment between post-layout simulations and experimental measurements. In the next step, when the measures of the prototyped chip will be performed, a complete assessment of the results accuracy will be done.

#### **6. Conclusions and Future Work**

The design of a PLL for the new ESA Spacefibre standard has been presented in this paper. The work was carried out in ADS for the modeling activity and Cadence Virtuoso for the design activity. In particular, the design of a TMR PFD, a CP, and a passive Loop Filter has been presented, starting from an already designed 6.25 GHz rad-hard VCO in 65 nm technology. The modeling activity has shown that the PLL can be completely integrated on-chip, with a Loop Filter area consumption of only 6000 μm2. The PLL is able to generate three output signals at 6.25, 3.125, and 1.5625 GHz with a gain margin of 86 dB and phase margins of 50◦. The Phase Noise of the PLL, considering the

higher frequency output, is below −85 dBc/Hz @ 1 MHz, and hence, it is in line with that of the VCO. The PLL power consumption in the locked state is about 10.24 mW. The PFD/CP is dead-zone free and shows a current matching below 2 μA (5% of the nominal CP's current value) in the worst case of process–temperature corners. Moreover, the PLL is highly immune to SEE on the PFD, while it is able to relock in less than 600 ns for an SEE on the CP. Finally, an area estimation has been done considering also the VCO, resulting in a total area of 0.09 mm2.

In Table 2, a comparison with the state-of-the-art rad-hard PLLs is performed. In [15], a CP-PLL in 65 nm technology for high-energy physics is presented. It generates a tunable output frequency in the range of 4.8–6 GHz, which is slightly lower the SpaceFibre standard. As reported in [15], the CP-PLL was not designed to be SEE tolerant; indeed, a simple PFD was used in the PLL loop. Instead, in this work, a TMR PFD was implemented to increase the hardness against SEE. In [13], all the digital logic of the CP-PLL is implemented using a TMR approach, but its low-frequency range is not suitable for SpaceFibre application. In [25], the design of a rad-hard CP-PLL that is able to generate an output frequency in the range 1.17-3.16 GHz is reported. It bases its radiation hardness on the use of a Silicon On Sapphire (SOS) substrate. Indeed, the SOS having a resistive sapphire in the substrate reduces the creation of electron–hole pairs and their migration in the device bulk in the case of energetic ion strikes. An example of radiation hardness improvement achieved using a different technology is presented also in [26], where a high-frequency CP-PLL is designed in 250 nm SiGe technology. Instead, this work presents the design of a PLL that is suitable for space applications in standard silicon CMOS technology, whose radiation hardness is achieved by TMR and layout techniques.

**Table 2.** Comparison with the state-of-the-art Rad-Hard PLLs. SOS: Silicon On Sapphire.


\* measured.

The next step is test chip fabrication to prototype the proposed PLL solution. The design has been submitted to tape out through MPW (Multi Project Wafer), and hence, the experimental verification of the fabricated prototype is part of the future roadmap.

**Author Contributions:** M.M.: conceptualization, investigation, methodology, software, writing—original draft, writing—review & editing; B.N.: conceptualization, methodology, supervision, writing-review & editing; G.C.: conceptualization, investigation, methodology, software, writing-review & editing; S.S.: conceptualization, investigation, methodology, supervision, writing-review & editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** Work supported by ISHTAR and Dipartimento di Eccellenza projects by Pisa University and MIUR.

**Acknowledgments:** Discussions with D. Monda, Pisa University, and G. Magazzù, INFN, are acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Machine Learning on Mainstream Microcontrollers** †

#### **Fouad Sakr, Francesco Bellotti \*, Riccardo Berta and Alessandro De Gloria**

Department of Electrical, Electronic and Telecommunication Engineering (DITEN)-University of Genoa, Via Opera Pia 11a, 16145 Genova, Italy; Fouad.Sakr@elios.unige.it (F.S.); riccardo.berta@unige.it (R.B.); alessandro.degloria@unige.it (A.D.G.)


Received: 27 March 2020; Accepted: 29 April 2020; Published: 5 May 2020

**Abstract:** This paper presents the Edge Learning Machine (ELM), a machine learning framework for edge devices, which manages the training phase on a desktop computer and performs inferences on microcontrollers. The framework implements, in a platform-independent C language, three supervised machine learning algorithms (Support Vector Machine (SVM) with a linear kernel, k-Nearest Neighbors (K-NN), and Decision Tree (DT)), and exploits STM X-Cube-AI to implement Artificial Neural Networks (ANNs) on STM32 Nucleo boards. We investigated the performance of these algorithms on six embedded boards and six datasets (four classifications and two regression). Our analysis—which aims to plug a gap in the literature—shows that the target platforms allow us to achieve the same performance score as a desktop machine, with a similar time latency. ANN performs better than the other algorithms in most cases, with no difference among the target devices. We observed that increasing the depth of an NN improves performance, up to a saturation level. k-NN performs similarly to ANN and, in one case, even better, but requires all the training sets to be kept in the inference phase, posing a significant memory demand, which can be afforded only by high-end edge devices. DT performance has a larger variance across datasets. In general, several factors impact performance in different ways across datasets. This highlights the importance of a framework like ELM, which is able to train and compare different algorithms. To support the developer community, ELM is released on an open-source basis.

**Keywords:** machine learning; edge computing; embedded devices; edge analytics; ANN; k-NN; SVM; decision trees; ARM; X-Cube-AI; STM32 Nucleo

#### **1. Introduction**

The trend of moving computation towards the edge is becoming ever more relevant, leading to performance improvements and the development of new field data processing applications [1]. This computation shift from the cloud (e.g., [2]) to the edge has advantages in terms of response latency, bandwidth occupancy, energy consumption, security and expected privacy (e.g., [3]). The huge amount, relevance and overall sensitivity of the data now collected also raise clear concerns about their use, as is being increasingly acknowledged (e.g., [4]), meaning that this is a key issue to be addressed at the societal level.

The trend towards edge computing also concerns machine learning (ML) techniques, particularly for the inference task, which is much less computationally intensive than the previous training phase. ML systems "learn" to perform tasks by considering examples, in the training phase, generally without being programmed with task-specific rules. When running ML-trained models, Internet of Things

(IoT) devices can locally process their collected data, providing a prompter response and filtering the amount of bits exchanged with the cloud.

ML on the edge has attracted the interest of industry giants. Google has recently released the TensorFlow Lite platform, which provides a set of tools that enable the user to convert TensorFlow Neural Network (NN) models into a simplified and reduced version, then run this version on edge devices [5,6]. EdgeML is a Microsoft suite of ML algorithms designed to work off the grid in severely resource-constrained scenarios [7]. ARM has published an open-source library, namely Cortex Microcontroller Software Interface Standard Neural Network (CMSIS-NN), for Cortex-M processors, which maximizes NN performance [8]. Likewise, a new package, namely X-Cube-AI, has been released for implementing deep learning models on STM 32-bit microcontrollers [9].

While the literature is increasingly reporting on novel or adapted embedded machine learning algorithms, architectures and applications, there is a lack of quantitative analyses about the performance of common ML algorithms on state-of-the-art mainstream edge devices, such as ARM microcontrollers [10]. We argue that this has limited the development of new applications and the upgrading of existing ones through an edge computing extension.

In this context, we have developed the Edge Learning Machine (ELM), a framework that performs ML inference on edge devices using models created, trained, and optimized on a Desktop environment. The framework provides a platform-independent C language implementation of well-established ML algorithms, such as linear Support Vector Machine (SVM), k-Nearest Neighbors (k-NN) and Decision Tree. It also supports artificial neural networks by exploiting the X-Cube-AI package for STM 32 devices [9]. We validated the framework on a set of STM microcontrollers (families F0, F3, F4, F7, H7, and L4) using six different datasets, to answer a set of ten research questions exploring the performance of microcontrollers in typical ML Internet of Things (IoT) applications. The research questions concern a variety of aspects, ranging from inference performance comparisons (also with respect to a desktop implementation) to training time, and from pre-processing to hyperparameter tuning. The framework is released on an open-source basis (https://github.com/Edge-Learning-Machine), with a goal to support researchers in designing and deploying ML solutions on edge devices.

The remainder of this paper is organized as follows: Section 2 provides background information about the ML techniques that are discussed in the manuscript. Section 3 describes the related work in this field. Section 4 shows the implemented framework and the supported algorithms. Section 5 presents the extensive experimental analysis we conducted by exploiting the framework. Finally, Section 6 draws conclusions and briefly illustrates possible future research directions.

#### **2. Background**

The Edge Learning Machine framework aims to provide an extensible set of algorithms to perform inference on the edge. The current implementation features four well-established supervised learning algorithms, which we briefly introduce in the following subsections. They all support both classification and regression problems.

#### *2.1. Artificial Neural Network (ANN)*

An Artificial Neural Network is a model that mimics the structure of our brain's neural network. It consists of a number of computing neurons connected to each other in a three-layer system; one input layer, several hidden layers, and one output layer. Artificial Neural Networks (ANNs) can model complex and non-linear or hidden relationships between inputs and outputs [11]. This one of the most powerful and well-known ML algorithms, which is used in a variety of applications, such as image recognition, natural language processing, forecasting, etc.

#### *2.2. Linear Kernel Support Vector Machine (SVM)*

The SVM algorithm is a linear classifier that computes the hyperplane that maximizes the distance from it to the nearest samples of the two target classes. It is a memory-efficient inference algorithm and is able to capture complex relationships between data points. The downside is that the training time increases with huge and noisy datasets [12]. While the algorithm deals well with non-linear problems, thanks to the utilization of kernels that map the original data in higher dimension spaces, we implemented only the original, linear kernel [13] for simplicity of implementation into the edge device.

#### *2.3. K-Nearest Neighbor (k-NN)*

k-NN is a very simple algorithm based on feature similarity that assigns, to a sample point, the class of the nearest set of previously labeled points. k-NN's efficiency and performance depends on the number of neighbors K, the voting criterion (for K > 1) and the training data size. The training phase produces a very simple model (the K parameter), but the inference phase requires exploring the whole training set. Its performance is typically sensitive to noise and irrelevant features [12,14].

#### *2.4. Decision Tree (DT)*

This is a simple and useful algorithm, which has the advantage of clearly exposing the criteria of decisions that are made. In building the decision tree, at each step, the algorithm splits data so as to maximize the information gain, thus creating homogeneous subsets. The typical information gain criteria are Entropy and Gini. DT is able to deal with linearly inseparable data and can handle redundancy, missing values, and numerical and categorical types of data. It is negatively affected by high dimensionality and high numbers of classes, because of error propagation [12,15]. Typical hyperparameters that are tuned in the model selection phase concern regularization and typically include depth, the minimum number of samples for a leaf, and the minimum number of samples for a split, the maximum number of leaf nodes and the splitter strategy (the best one, which is the default, or a random one, which is typically used for random forests), etc. Several DTs can be randomly built for a problem, in order to create complex but high-performing random forests.

#### **3. Related Work**

A growing number of articles are being published on the implementation of ML on embedded systems, especially with a focus on the methodology of moving computation towards the edge. Zhang et al. [16] presented an object detector, namely MobileNet-Single Shot Detector (SSD), which was trained using a deep convolutional neural network with the popular Caffe framework. The pre-trained model was then deployed on NanoPi2, an ARM board developed by FriendlyARM, which uses Samsung Cortex-A9 Quad-Core S5P4418@1.4GHz SoC and 1 GB 32bit DDR3 RAM. MobileNet-SSD can run at 1.13FPS.

Yazici et al. [17] tested the ability of a Raspberry Pi to run ML algorithms. Three algorithms were tested, Support Vector Machine (SVM), Multi-Layer Perceptron, and Random Forests, with an accuracy above 80% and a low energy consumption. Fraunhofer Institute for Microelectronic Circuits and Systems have developed Artificial Intelligence for Embedded Systems (AIfES), a library that can run on 8-bit microcontrollers and recognize handwriting and gestures without requiring a connection to the cloud or servers [18]. Cerutti et al. [19] implemented a convolutional neural network on STM Nucleo-L476RG for people detection using CMSIS-NN, which is an optimized library that allows for the deployment of NNs on Cortex-M microcontrollers. In order to reduce the model size, weights are quantized to an 8-bit fixed point format, which slightly affects the performance. The network fits in 20 KB of flash and 6 KB of RAM with 77% accuracy.

Google has recently released Coral Dev Board, which includes a small low power Application-Specific Integrated Circuit (ASIC) called Edge TPU, and provides high-performance ML inferencing without running the ML model on any kind of server. Edge TPU can run TensorFlow Lite, with a low processing power and high performance [20]. There are a few application programming interfaces (APIs) in the Edge tencor processing unit (TPU) module that perform

inference (ClassificationEngine) for image classification, for object detection (DetectionEngine) and others that perform on-device transfer learning [21].

Microsoft is developing EdgeML, a library of machine learning algorithms that are trained on the cloud/desktop and can run on severely resource-constrained edge and endpoint IoT devices (also with 2 KB RAM), ranging from the Arduino to the Raspberry Pi [7]. They are currently releasing tree- and k-NN-based algorithms, called Bonsai and ProtoNN, respectively, for classification, regression, ranking and other common IoT tasks. Their work also concerns recurrent neural networks [22]. A major achievement concerns the translation of floating-point ML models into fixed-point code [23], which is, however, not the case in state-of-the-art mainstream microcontrollers.

The Amazon Web Services (AWS) IoT Greengrass [24] supports machine learning inference locally on edge devices. The user could use his own pre-trained model or use models that are created, trained, and optimized in Amazon SageMaker (cloud), where massive computing resources are available. AWS IoT Greengrass features lambda runtime, a message manager, resource access, etc. The minimum hardware requirements are 1 GHz of computing speed and 128 MB of RAM.

Ghosh et al. [25] used autoencoders at the edge layer that are capable of dimensionality reduction to reduce the required processing time and storage space. The paper illustrates three scenarios. In the first one, data from sensors are sent to edge nodes, where data reduction is performed, and machine learning is then carried out in the cloud. In the second scenario, encoded data at the edge are decoded in the cloud to obtain the original amount of data and then perform machine learning tasks. Finally, pure cloud computing is performed, where data are sent from the sensors to the cloud. Results show that an autoencoder at the edge reduces the number of features and thus lowers the amount of data sent to the cloud.

Amiko's Respiro is a smart inhaler sensor featuring an ultra-low-power ARM Cortex-M processor [26]. This sensor uses machine learning to interpret vibration data from an inhaler. The processor allows for the running of ML algorithms where the sensor is trained to recognize breathing patterns and calculate important parameters. The collected data are processed in an application and feedback is provided.

Magno et al. [27] presented an open-source toolkit, namely FANNCortexM. It is built upon the Fast Artificial Neural Network (FANN) library and can run neural networks on the ARM Cortex-M series. This toolkit takes a neural network trained with FANN and generates code suitable for low-power microcontrollers. Another paper by Magno et al. [28] introduces a wearable multi-sensor bracelet for emotion detection that is able to run multilayer neural networks. In order to create, train, and test the neural network, the FANN library is used. To deploy the NN on the Cortex-M4F microcontroller, the above-mentioned library needs to be optimized using CMSIS and TI-Driverlib libraries.

FidoProject is a C++ machine learning library for embedded devices and robotics [29]. It implements a neural network for classification and other algorithms such as Reinforcement Learning. Alameh et al. [30] created a smart tactile sensing system by implementing a convolutional neural network on various hardware platforms like Raspberry Pi 4, NVidia Jetson TX2, and Movidius NCS2 for tactile data decoding.

As recent works used knowledge transfer (KT) techniques to transfer information from a large neural network to a small one in order to improve the performance of the latter, Sharma et al. [31] investigated the application of KT to edge devices, achieving good results by transferring knowledge from both the intermediate layers and the last layer of the teacher (original model) to a shallower student (target).

While most of the listed works use powerful edge devices (e.g., Cortex-A9, Raspberry PI) to test algorithms, especially NNs, there is a lack of performance analysis of common ML algorithms on mainstream microcontrollers. We intend to plug this gap by providing an open-source framework that we used for an extensive analysis.

#### **4. Framework and Algorithm Understanding**

The proposed Edge Learning Machine (EML) framework consists of two modules, one working on the desktop (namely DeskLM, for training and testing), and one on the edge (MicroLM, for inferencing and testing), as sketched in Figure 1.

**Figure 1.** Block diagram of the Edge Learning Machine system architecture.


The tool has been designed to support a four-step workflow, as shown in Figure 2.






As anticipated, the current version of the EdgeLM framework features four well-established supervised learning algorithms, of which, in the following subsections, we briefly describe the implementation on both the desktop and edge side.

#### *4.1. Artificial Neural Network (ANN)*

In Desk-LM, ANNs are implemented through the TensorFlow [5] and its wrapper Keras [33] packages. As an optimizer, we use adaptive moment estimation ('adam') [35]. At each execution run, the DeskLM module performs the hyperparameter tuning by analyzing different ranges of parameters (Table 1 and first column of Table 2) specified by the user. The ANN model hyperparameters include layer shape (number and size of input, hidden, and output layers), activation function for the hidden layers (Rectified Linear Unit (ReLU), or Tangent Activation Function (Tanh)), number of epochs, batch size, number of repeats (in order to reduce result variance), and dropout rate. The best selected model is then saved in the high-efficiency Hierarchical Data Format 5 (HDF5) compressed format [36].

For the edge implementation, DeskLM relies on the STM X-Cube-AI expansion package, which is supported by STM32CubeIDE, and allows for its integration in the application of a trained Neural Network model. The package offers the possibility of compressing models up to eight times, with an accuracy loss which is estimated by the package. The tool also provides an estimation of the complexity, through the Multiply and Accumulate Operation (MACC) figure, and of the Flash and RAM memory footprint [37].

#### *4.2. Linear Support Vector Machine (SVM)*

As anticipated, for the simplicity of the implementation of the edge device, we implemented only the original, linear kernel SVM [13]. The linear model executes the y = w\*x + b function, where w is the support vector and b is the bias. Model selection concerns the C regularization parameter [38] (Table 2). As an output model, Desk-LM generates a C source file containing the w and b values.

#### *4.3. K-Nearest Neighbor (KNN)*

For simplicity of implementation, we used a Euclidean distance criterion and majority voting (for K > 1). The training phase produces a very simple model (the K parameter), but deployment also requires the availability of the whole training set (Table 2).

#### *4.4. Decision Tree (DT)*

In order to cope with the limited resources of edge devices, our framework allows us to analyze different tree configurations in terms of depth, leaf size, and number of splits. Concerning the splitting criterion, for simplicity of implementation on the target microcontrollers, we implemented only the "Gini" method.

#### **5. Experimental Analysis and Result**

We conducted the experimental analysis using six ARM Cortex-M microcontrollers produced by STM, namely F091RC, F303RE, F401RE, F746ZG, H743ZI2, and L452RE. The F series represents a wide range of microcontroller families in terms of execution time, memory size, data processing and transfer capabilities [39], while the H series provides higher performance, security, and multimedia capabilities [40]. L microcontrollers are ultra-low-power devices used in energy-efficient embedded systems and applications [41]. All listed MCUs have been used in our experiments with their STM32CubeIDE default clock values, that could be increased for a faster response. Table 3 synthesizes the main features of these devices. In the analysis, we compare the performance of the embedded devices with that of a desktop PC hosting a 2.70 GHz Core i7 processor, with 16 GB RAM and 8 MB cache.

In order to characterize the performance of the selected edge devices, we have chosen six benchmark datasets to be representative of IoT applications (Table 4). These datasets represent different application scenarios: binary classification, multiclass classification, and regression. University of California Irvine (UCI) heart disease is a popular medical dataset [42]. Virus is a dataset developed by the University of Genova to deal with data traffic analysis [43–45]. Sonar represents the readings of

a sonar system that analyses materials, distinguishing between rocks and metallic material [46,47]. Peugeot 207 contains various parameters collected from cars, which are used to predict either the road surface or the traffic (two labels were considered in our studies: label\_14: road surface and label\_15: traffic) [48]. The EnviroCar dataset records various vehicular signals through the onboard diagnostic (OBDII) interface to the Controller Area Network (CAN) bus [49–51]. The air quality index (AQI) dataset measures air quality in Australia during a period of one year [52]. Before processing, all data were converted to float32, according to the target execution platform.


**Table 3.** Microcontroller specifications.



\* For Peugeot 207, we considered two different labels.

Our analysis was driven by a set of questions, synthesized in Table 5, aimed at investigating the performance of different microcontrollers in typical ML IoT contexts. We are also interested in comparing the inference performance of microcontrollers vs. desktops. The remainder of this section is devoted to the analysis of each research question. In a few cases, when the comparison is important, results are reported for every tested target platform. On the other hand, in most of the cases, when not differently stated, we chose the F401RE device as the reference for the embedded targets.



#### *5.1. Performance*

The first research question concerns the performance achieved both on desktop and on edge. For SVM, k-NN and DT on desktops, we report the performance of both our C implementation and the python scikit-learn implementation, while for ANN we have only the TensorFlow Keras implementation. The following set of tables show, for each algorithm, the obtained score, which is expressed in terms of accuracy (in percent, for classification problems), or coefficient of determination, R-Squared (R2, for regression problems). R2 is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). The best possible score for R2 is 1.0. In scikit-learn, R2 can assume negative values, because the model can be arbitrarily worse. The second performance we consider is the inference time.

In the following (Tables 6–13), we report two tables for each algorithm. The first one provides the best performance (in terms of score) obtained in each dataset. The second shows the hyperparameter values of the best model.


\*: F0 not supported by the STM X-Cube-AI package.

**Table 7.** ANN corresponding configurations.


Activation Function (AF), Layer Configuration (LC), Principal Component Analysis (PCA).

#### 1. ANN:

Remarkably, all the embedded platforms were able to achieve the same score (accuracy or R2) as the desktop python implementation. None of the chosen datasets required the compression of the models by the STM X-Cube-AI package. ANN performed well in general, except for the Heart and Virus datasets, where the accuracy is under 90%. The inference time is relatively low in both desktop and MCUs (with similar values, in the order of ms and sometimes less). However, there is an exception in some cases—especially for Peugeot\_Target\_15 and Sonar—when using the F3 microcontroller.


**Table 8.** Linear Support Vector Machine (SVM) performance.

**Table 9.** Linear SVM corresponding configuration.


SVM regularization parameter (C), Principal Component Analysis (PCA).



#### 2. Linear SVM:

As with ANN, for the linear SVM, we obtained the same score across all the target platforms, and relatively short inference times (again, with almost no difference between desktop and microcontroller implementations). However, we obtained significantly worse results than ANN for more than half of the investigated datasets. Table 9 stresses the importance of tuning the C regularization parameter, which implies the need for longer training times, particularly in the absence of normalization. We explore this in more depth when analyzing research question 9.


**Table 11.** k-NN corresponding configurations.

Number of neighbors (K), Principal Component Analysis (PCA).

**Table 12.** Desktop (DT) performance.




Principal Component Analysis (PCA). \* This configuration fits all targets.

#### 3. k-NN:

Notably, in some cases, the training set cap needed to be set to 100, because the Flash size was a limiting factor for some MCUs. Hence, for different training sets, we also had a different number of neighbors (K). Accordingly, the accuracy is also affected by the decrease in training set size, since the number of examples used for training is reduced. This effect is apparent for Sonar with an F0 device. This dataset has sixty features, much more than the others (typically 10–20 features). The inference time varies a lot among datasets, microcontrollers and in comparison with the desktop implementations. This is because the k-NN inference algorithm always requires the exploration of the whole training set, and thus its size plays an important role in performance, especially for less powerful devices. In the multiclass problems, k-NN exploits the larger memory availability of H7 well, outperforming SVM, and reaching a performance level close to that of ANN. It is important to highlight that the Sonar labels were reasonably well predicted by k-NN compared to ANN and SVM (92% vs. 87% and 78%). In general, k-NN achieves performance levels similar to ANN, but requires a much larger memory footprint, which is possible only on the highest-end targets.

#### 4. DT:

When processing the EnviroCar dataset, the DT algorithm saturated the memory in most of the targets. We had to reduce the leaf size for all MCU families, apart from F7 and H7. However, this reduction did not significantly reduce the R2 value. In addition, DT performs worse than the others in two binary classification datasets, Heart and Sonar, and in the AQI regression dataset as well, but performs at the same level as the ANNs for the multiclass datasets and in the EnviroCar regression problem. Notably, DT achieves the fastest inference time among all algorithms, with F0 and F3 performing worse than the others, particularly in the regression problems.

As a rough summary of the first research question, we can conclude that ANN and, surprisingly, k-NN, had the highest accuracy in most cases, and Decision Tree had the shortest response time, but accuracy results were quite dependent on the dataset. The main difference between ANN and k-NN results is represented by the fact that high performance in ANN is achieved by all the targets (but not F0, which is not supported by the STM X-Cube-AI package), while k-NN poses much higher memory requirements. Concerning the timing performance, microcontrollers perform similarly to desktop implementations on the studied datasets. The only exception is found in k-NN, for which each inference requires the exploration of the whole dataset, and the corresponding computational demand penalizes the performance, especially on low-end devices. When comparing the edge devices, the best time performance is achieved by F7 and H7 (and we used default clock speeds, that can be significantly increased). Unsurprisingly, given the available hardware, F0 performs worse than all the others. Considering the score, we managed to train all the edge devices to achieve the same level of performance as the desktop in each algorithm, with the exception of k-NN in the multiclass tests (Peugeot), where only H7 is able to perform like a desktop, but with a significant time performance penalty. On the other hand, F0 performs significantly worse than the other edge devices in the k-NN Sonar binary classification.

#### *5.2. Scaling*

Feature preprocessing is applied to the original features before the training phase, with the goal of increasing prediction accuracy and speeding up response times [34]. Since the range of values is typically different from one feature to another, the proper computation of the objective function requires normalized inputs. For instance, the computation of the Euclidean distance between points is governed by features with a broader value range. Moreover, gradient descent converges much faster on normalized values [53].

We considered three cases that we applied on ANN, SVM, and k-NN: no scaling, MinMax Scaler, and Standard Scaler (Std) [54]. The set of tables below (Tables 14–18) show the accuracy of R<sup>2</sup> for all datasets under various scaling conditions. Most common DT algorithms are invariant to monotonic transformations [55], so we did not consider DT in this analysis.

#### 1. ANN:


**Table 14.** Performance and configuration of ANN with no scaling.

Activation Function (AF), Layer Configuration (LC), Principal Component Analysis (PCA).

**ANN Dataset MinMax Configuration AF LC PCA** Heart 80% Tanh [300, 200, 100, 50] 30% Virus 99% ReLU [100, 100, 100] None Sonar 87% ReLU [300, 200, 100, 50] 30% Peugeot\_Target 14 99% ReLU [100, 100, 100] mle Peugeot\_Target 15 98% Tanh [300, 200, 100, 50] mle EnviroCar 0.99 ReLU [50] mle AQI 0.86 ReLU [300, 200, 100, 50] None

**Table 15.** Performance and configuration of ANN with MinMax scaling.

Activation Function (AF), Layer Configuration (LC), Principal Component Analysis (PCA).

**Table 16.** Performance and configuration of ANN with StandardScaler normalization.


Activation Function (AF), Layer Configuration (LC), Principal Component Analysis (PCA).

#### 2. SVM:

**Table 17.** Performance and configuration of SVM for different scaling techniques.


SVM regularization parameter (C), Principal Component Analysis (PCA).

#### 3. k-NN:


**Table 18.** Performance and configuration of k-NN for different scaling techniques.

Number of neighbors (K), Principal Component Analysis (PCA).

These results clearly show the importance across all the datasets and algorithms of scaling the inputs. For instance, MinMax scaling allowed ANNs to reach 99% accuracy in Virus (from a 74% baseline), and Peugeot 14 (from 95%) and 0.86 R2 (from 0.70) in AQI. The application of MinMax allowed SVM to achieve 94% accuracy in Virus (form 71%) and 91% accuracy in Peugeot 14 (from 50%). Standard input scaling improved the k-NN accuracy of Heart from 63% to 83%. For large regression datasets, especially with SVM (see also research question 9), input scaling avoids large training times.

#### *5.3. Principal Component Analysis (PCA)*

Dimensionality reduction allows us to reduce the effects of noise, space and processing requirements. One well-known method is Principal Component Analysis (PCA), which performs an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly independent variables, which are called principal components [24]. We tried different values of PCA dimension reduction: none, 30% (i.e., the algorithm selects a number of components such that the amount of variance that needs to be explained is greater than 30%), and automatic maximum likelihood estimation (mle) [56], whose results are shown in Tables 19–26.

#### 1. SVM:

**Table 19.** SVM performance and configuration for various PCA values.


SVM regularization parameter (C).

#### 2. k-NN:


**Table 20.** k-NN performance and configuration for various PCA techniques.

Number of neighbors (K).

#### 3. ANN:

**Table 21.** ANN performance and configuration for PCA = None.


Activation Function (AF), Layer Configuration (LC).

**Table 22.** ANN performance and configuration for PCA = 30%.


Activation Function (AF), Layer Configuration (LC).

**Table 23.** ANN performance and configuration for PCA = mle.


Activation Function (AF), Layer Configuration (LC).

#### 4. DT:


**Table 24.** DT performance and configuration for PCA = None.

**Table 25.** DT performance and configuration for PCA = 30%.


The results reported in the above tables are quite varied. The 30% PCA value is frequently too low, except for the Sonar dataset, which has 60 features, much more than the others, and thus looks less sensitive to such a coarse reduction. For SVM, PCA does not perform better than or equal to mle, while the opposite is true for k-NN. Moreover, in ANNs, mle tends to provide better results, except AQI. In DT, there is a variance of outcomes. Mle does not perform any better than the other algorithms in Heart (78% vs. 67% accuracy) and Sonar (76% vs. 65%), while performance decreases for Peugeot 14 (93% vs. 99%) and AQI (0.49 vs. 0.65). For AQI, PCA never improves performance. The opposite is true for Heart and (except SVM) Sonar.

**Table 26.** DT performance and configuration for PCA = mle.


#### *5.4. ANN Layer Configuration*

To answer this question, we investigated performance among four ANN hidden-layer configurations, as follows:


Tables 27–29 indicate the highest performance for each layer shape.

**Table 27.** Results for layer configuration LC = [50] and LC = [500]. We have omitted columns with all zero Dropout values.


Activation Function (AF), Principal Component Analysis (PCA).


**Table 28.** Layer configuration LC = [100,100,100] results.

Activation Function (AF), Principal Component Analysis (PCA).

**Table 29.** Layer Configuration (LC) = [300,200,100,50] results.


Activation Function (AF), Principal Component Analysis (PCA).

By observing the results, we can see that deepening the network tends to improve the results, but only up to a certain threshold. For the Heart dataset, which has the lowest overall accuracy, we tried additional, deeper shapes beyond those reported in Tables 27–29, but with no better results. On the other hand, widening the first layer provides only slightly better results (and in one case worsens them).

#### *5.5. ANN Activation Function*

Another relevant design choice concerns the activation function in the hidden layers. Activation functions are attached to each neuron in the network and define its output. They introduce a non-linear factor in the processing of a neural network. Two activation functions are typically used: Rectified Linear Unit (ReLU) and Tangent Activation Function (Tanh). On the other hand, for the output layer, we used a sigmoid for binary classification models as an activation function, and a softmax for multiclassification tasks. For regression problems, we created an output layer without any activation function (i.e., we use the default "linear" activation), as we are interested in predicting numerical values directly, without transformation. Tables 30 and 31 show the highest accuracy achieved in hidden layers for each function, alongside its corresponding configuration.



Layer Configuration (LC), Principal Component Analysis (PCA).



Layer Configuration (LC), Principal Component Analysis (PCA).

The results are similar, with a slight prevalence of ReLU, with a valuable difference for Sonar (+7% accuracy) and AQI (+5% R2).

#### *5.6. ANN Batch Size*

The batch size is the number of training examples processed in one iteration before the model being trained is updated. To test the effect of this parameter, we considered three values, one, 10, and 20, keeping the number of epochs fixed to 20. Table 32 shows the accuracy of each dataset for various batch sizes.


**Table 32.** Performance on difference batch size.

The results show that the value of 10 provides optimal results in terms of accuracy. Actually, the difference becomes relevant only for the case of AQI. A batch size equal to one poses an excessive time overhead (approximately 30% slower than the batch size of 10), while a batch size of 20 achieves a speedup of about 40%.

#### *5.7. ANN Accuracy vs. Epochs*

ANN training goes through several epochs, where an epoch is a learning cycle in which the learner model sees the whole training data set. Figures 3 and 4 show that the training of ANN on all datasets converges quickly within 10 epochs.

#### *5.8. ANN Dropout*

Dropout is a simple method to prevent overfitting in ANNs. It consists of randomly ignoring a certain number of neuron outputs in a layer during the training phase.

The results in Table 33 show that this regularization step provides no improvement in the considered cases, but has a slight negative effect in a couple of datasets (Sonar and AQI).


**Table 33.** Dropout effect on ANN.

#### *5.9. SVM Regularization Training Time*

In SVM, *C* is a key regularization parameter, that controls the tradeoff between errors of the SVM on training data and margin maximization [13,57]. The classification rate is highly dependent on this coefficient, as confirmed by Tables 8 and 9. Desk-LM uses the grid search method to explore the C values presented by the user, which require long waiting times in some cases. To quantify this, we measured the training latency time in a set of typical values (*C* = 0.01, 0.1, 1, 10, and 100), with the results provided in Table 34.

Different values of the *C* parameter have an impact on the training time. The table shows that higher *C* values require higher training time. We must stress that the above results represent the training time for the best models. In particular, when no normalization procedure was applied, the training time using large values of C became huge (also up to one hour), especially for regression datasets.

**Figure 3.** Accuracy vs. Epochs for (**a**) Heart, (**b**) Virus, (**c**) Sonar, (**d**) Peugeot target 14, and (**e**) Peugeot target 15.

**Table 34.** Training time for different values of the C parameter.


**Figure 4.** Mean Squared Error vs. Epochs for (**a**) EnviroCar, and (**b**) air quality index (AQI).

#### *5.10. DT Parameters*

Tuning a decision tree requires us to test the effect of various hyperparameters, such as *max\_depth*, *min\_simple\_split*. Figure 5 shows the distribution of the tested parameter values for the best models in the different datasets (see also Table 13 to see the best results).

**Figure 5.** Number of occurrences of each DT parameter.

In most cases, the whole tree depth is needed, and this does not exceed the memory available in the microcontrollers. However, *Max\_Leaf\_Nodes* values usually need a low threshold (80). EnviroCar required a high value of 5000, which had to be reduced down to 1000 for F3, F4, L4 and to 200 for F0 because of the limited RAM availability.

#### **6. Conclusions and Future Work**

This paper presented the Edge Learning Machine (ELM), a machine learning platform for edge devices. ELM performs training on desktop computers, exploiting TensorFlow, Keras, and scikit-learn, and makes inferences on microcontrollers. It implements, in platform-independent C language, three supervised machine learning algorithms (Linear SVM, k-NN, and DT), and exploits the STM X-Cube-AI package for implementing ANNs on STM32 Nucleo boards. The training phase on Desk-LM searches for the best configuration across a variety of user-defined parameter values. In order to investigate the performance of these algorithms on the targeted devices, we posed ten research questions (RQ 1–10, in the following) and analyzed a set of six datasets (four classifications and two regressions). To the best of our knowledge, this is the first paper presenting such an extensive performance analysis of edge machine learning in terms of datasets, algorithms, configurations, and types of devices.

Our analysis shows that, on a set of available IoT data, we managed to train all the targeted devices to achieve, with at least one algorithm, the best score (classification accuracy or regression R2) obtained through a desktop machine (RQ1). ANN performs better than the other algorithms in most of the cases, without differences among the target devices (apart from F0, that is not supported by STM X-Cube-AI). k-NN performs similarly to ANN, and in one case even better, but requires that all the training sets are kept in the inference phase, posing a significant memory demand, which penalizes time performance, particularly on low-end devices. The performance of Decision Tree performance varied widely across datasets. When comparing edge devices, the best time performance is achieved by F7 and H7. Unsurprisingly, given the available hardware, F0 performs worse than all the others.

The preprocessing phase is extremely important. Results across all the datasets and algorithms show the importance of scaling the inputs, which lead to improvements of up to 82% in accuracy (SVM Virus) and 23% in R2 (k-NN Heart) (RQ2). The applications of PCA have various effects across algorithms and datasets (RQ3).

In terms of the ANN hyperparameters, we observed that increasing the depth of a NN typically improves its performance, up to a saturation level (RQ4). When comparing the neuron activation functions, we observed a slight prevalence of ReLU over Tanh (RQ5). The batch size has little influence on score, but it does have an influence on training time. We established that 10 was the optimal value for all the examined datasets (RQ6). In all datasets, the ANN training quickly converges within 10 epochs (RQ7). The dropout regularization parameter only led to some slight worsening in a couple of datasets (RQ8).

In SVM, the C hyperparameter value selection has an impact on training times, but only when inputs are not scaled (RQ9). In most datasets, the whole tree depth is needed for DT models, and this does not exceed the memory available in the microcontrollers. However, the values of *Max\_Leaf\_Nodes* usually require a low threshold value (80) (RQ10).

As synthesized above, in general, several factors impact performance in different ways across datasets. This highlights the importance of a framework like ELM, which is able to test different algorithms, each one with different configurations. To support the developer community, ELM is released on an open-source basis.

As a possible direction for future work, we consider that the analysis should be extended to include different types of NNs (Convolutional Neural Networks, Recurrent Neural Networks) with more complex datasets (e.g., also including images and audio streams). An extensive analysis should also be performed on unsupervised algorithms that look particularly suited for immediate field deployment, especially in low-accessibility areas. As the complexity of IoT applications is likely to increase, we also expect that distributed ML at the edge will probably be a significant challenge in the coming years.

**Author Contributions:** F.S.: conceptualization, data curation, investigation, software, validation; F.B.: conceptualization, data curation, investigation, methodology, software, validation; R.B.: conceptualization, methodology, software, validation; A.D.G.: conceptualization, methodology, supervision, validation. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Cryptographically Secure Pseudo-Random Number Generator IP-Core Based on SHA2 Algorithm †**

#### **Luca Baldanzi, Luca Crocetti, Francesco Falaschi \*, Matteo Bertolucci, Jacopo Belli, Luca Fanucci and Sergio Saponara**

Department of Information Engineering, University of Pisa, Via G. Caruso n. 16, 56122 Pisa, Italy; luca.baldanzi@ing.unipi.it (L.B.); luca.crocetti@phd.unipi.it (L.C.); francesco.falaschi@phd.unipi.it (F.F.); matteo.bertolucci@phd.unipi.it (M.B.); jacopo.belli23@gmail.com (J.B.); luca.fanucci@unipi.it (L.F.); sergio.saponara@unipi.it (S.S.)


Received: 7 February 2020 ; Accepted: 17 March 2020; Published: 27 March 2020

**Abstract:** In the context of growing the adoption of advanced sensors and systems for active vehicle safety and driver assistance, an increasingly important issue is the security of the information exchanged between the different sub-systems of the vehicle. Random number generation is crucial in modern encryption and security applications as it is a critical task from the point of view of the robustness of the security chain. Random numbers are in fact used to generate the encryption keys to be used for ciphers. Consequently, any weakness in the key generation process can potentially leak information that can be used to breach even the strongest cipher. This paper presents the architecture of a high performance Random Number Generator (RNG) IP-core, in particular a Cryptographically Secure Pseudo-Random Number Generator (CSPRNG) IP-core, a digital hardware accelerator for random numbers generation which can be employed for cryptographically secure applications. The specifications used to develop the proposed project were derived from dedicated literature and standards. Subsequently, specific architecture optimizations were studied to achieve better timing performance and very high throughput values. The IP-core has been validated thanks to the official NIST Statistical Test Suite, in order to evaluate the degree of randomness of the numbers generated in output. Finally the CSPRNG IP-core has been characterized on relevant Field Programmable Gate Array (FPGA) and ASIC standard-cell technologies.

**Keywords:** intelligent sensors; autonomous driving; cyber security; HW accelerator; on-chip random number generator (RNG); SHA2; FPGA; ASIC standard-cell

#### **1. Introduction**

The rapid technology development of intelligent sensors in the automotive field in recent years, driven by institutions and supported by manufacturers to integrate advanced systems for active safety and hazard prevention, has generated additional collateral technological needs. All the equipment distributed on board the vehicle are, in fact, interconnected with real communication networks and exchange critical data and sensitive information that must be protected from potential attacks and violations. Cybersecurity thus becomes a central issue and all mechanisms ensuring authentication, confidentiality and integrity of messages become an enabling technology for the further development of Advanced Driver Assistance Systems (ADAS) and Autonomous Driving Systems (AD).

In particular, projects like the one carried on in the framework of the European Processor Initiative (EPI) programme [1] anticipate requirements from future scenarios, with very high throughput hardware accelerators for cybersecurity applications. Highly automated vehicles will be equipped with advanced sensors (e.g., camera arrays, radar, lidar), capable of generating significant data flows. Moreover, over-the-air update (OTA) of the software of control units will be implemented, for which it will be very important to keep the update time as short as possible. In this context, it is essential to increase information security, and it is therefore necessary to implement encrypted data transfers (i.e., encryption and decryption) and verified content (i.e., digital signature) with throughput compatible with the use of buses such as automotive ethernet (i.e., 1–10 Gbps). For this reason, hardware accelerators with throughput requirements of the order of tens Gbps are necessary, also to take into considerations the expected lifespan of vehicles.

Random numbers are widely used in encryption and security applications, usually to generate encryption keys or secret data to be shared between communication entities. Therefore a Random Number Generator (RNG) is a very important primitive for cryptographically secure applications [2]. In particular, it is used as a fundamental part for the development of several globally distributed applications related to the field of cyber security such as digital payments, online authentication and instant messaging [3]. Cryptographic keys are based on random numbers and must be characterized by a high degree of unpredictability to be considered secure: this is necessary to prevent an attacker from violating the security chain based on this cryptographic key. Random numbers are also the starting point for generating nonces within authentication protocols as a countermeasure against replay attacks, so the higher the degree of randomness, the more robust the countermeasure will be. Moreover, strong random number generation helps digital signature procedures to prevent private keys to be disclosed, thus violating the signature itself [4]. Several development techniques for RNG engines are reported in literature, and many of them exploit physical sources (e.g., analog noise) as random processes to obtain the randomness characteristic of the generated bit sequences [3–9].

These circuits are identified as True Random Number Generators (TRNG) and their output sequences are considered high quality random numbers. However, TRNGs have non-negligible disadvantages that must be considered: the use of physical sources leads to high energy consumption and insufficient throughput for fast and advanced integrated systems. TRNGs are also sensitive to changing operating conditions, which means that post-processing must be implemented to ensure reliable random output data, further reducing the throughput under non-ideal condition.

To overcome these limitations, a powerful Deterministic Random Bit Generator (DRBG) circuit can be used in addition to a very low-area, low-power and low-throughput TRNG implementation. This means that the RNG engine would be mainly based on a deterministic algorithm that generates pseudo-random output sequences. In this case, the required degree of randomness is obtained through additional mechanisms to increase the level of entropy of the generated sequences, which would otherwise be deterministic. This operation is called *reseed* and it consists in providing a trigger to restart the circuit of the deterministic algorithm from a new high entropy starting point (i.e., the new seed). DRBG-based solutions use periodic *reseed* to allow the RNG to generate pseudo-random binary output sequences that are equivalent and indistinguishable from true random ones. The limitations of TRNGs for high-speed devices are thus overcome by restricting its use to the periodic seed generation operation only, which has the characteristic of being a very light task. For this reason the new seed is usually provided by a very low complexity and target specific TRNG module [10], which also may be powered down when not necessary.

For the proposed architecture, the DRBG mechanism was chosen from those approved by National Institute of Standards and Technology (NIST) [6]. The standard provides general information for PRNGs based on cryptographic primitives, some of which are incontrovertible and proven (e.g., Hash DRBG and HMAC DRBG). For the proposed Cryptographically Secure Pseudo-Random Number Generator (CSPRNG) IP-core the algorithm selection was made based on a compromise between performance, area and security strength. The Hash DRBG with SHA-256 as cryptographic core

(i.e., based on the SHA2 algorithm) proved to be the most efficient solution between logical complexity and expected throughput during random bit generation, offering 256 bits of security strength.

The reminder of this paper is organized as it follows: Section 2 presents the trade-off analysis among the different suitable DRBG algorithms, Section 3 details the implementation of the SHA2 algorithm being chosen to develop the DRBG core, Section 4 describes the Hash DRBG design architecture as CSPRNG IP-core, Section 5 collects the characterization results. Finally, conclusions are discussed in Section 6.

#### **2. DRBG Algorithms Trade-Off Analysis**

As mentioned in Section 1, the different deterministic algorithms suitable for implementing DRBG circuits have been evaluated by the NIST and those approved-recommended are collected in the NIST SP 800-90A Rev.1 pubblication [6]. Such mechanisms present common features and functionalities:

	- **–** *instance*, to acquire a random seed (i.e., concatenation of input entropy content, possibly input or internal random nonce, and personalization string) and to initialize the internal state to a random value derived from the seed;
	- **–** *reseed*, to acquire a random seed (i.e., concatenation of internal state, input entropy content, and personalization string) and update the internal state to a random value derived from the seed;
	- **–** *generate*, to generate an output bits sequence based on current state and then to update the state to a random value derived from previous state;
	- **–** *uninstantiate*, to delete the internal state;

Most of the DRBG engines implementations rely on hash functions and counter mode (CTR) of symmetric-key encryption processes. Hash functions family includes SHA1 and SHA2 algorithms, but the former is going to be deprecated because of its low security strength and high vulnerability [11], therefore only SHA2 cryptographic primitives are taken into exam for Hash DRBG mechanisms. The main parameters related to DRBG cores based on SHA2 primitive are reported in Table 1.


**Table 1.** Hash DRBG mechanisms parameters (SHA2 only).

CTR (CTR is abbreviation for *Counter*) DRBG mechanisms are based onto block cipher cores used in *counter mode*. Different block cipher cores are suitable to develop a DRBG circuit and the main parameters related to the different implementations are collected in Table 2.


**Table 2.** CTR DRBG mechanisms parameters. *<sup>B</sup>* = (2*ctrl*\_*len* <sup>−</sup> <sup>4</sup>) *blocklen*.

Table 3 summarizes area and latency values obtained for our version of SHA2 IP-core, where complexity values are related to the characterization on 45nm ASIC standard-cell technology. In the perspective to implement a Hash DRBG circuit, solutions for SHA-224 and SHA-384 are discarded in favor of SHA-256 and SHA-512 because area and latency values are the same, but the former couple offers a shorter output block. Concerning SHA-256 and SHA-512 comparison, the following considerations can be done in order to select the best candidate for Hash DRBG:



**Table 3.** SHA2 IP-core specifications.

The throughput values of the SHA-256 core and SHA-512 core, when operating in generation phase for a Hash DRBG implementation, can be obtained through Equations (1) and (2), respectively.

$$T\_{SHA-256} = 256/67 \cdot f\_{clk} \cdot n\_{paralle\\_core} = 3.82 \cdot f\_{clk} \cdot n\_{paralle\\_core} \qquad \text{bit/s} \tag{1}$$

$$T\_{SHA-512} = 512/83 \cdot f\_{clk} \cdot n\_{parallel\\_core} = 6.17 \cdot f\_{clk} \cdot n\_{parallel\\_core} \qquad \text{bit/s} \tag{2}$$

Concerning the CTR DRBG circuit, the AES IP-core is proved to be best in class for both area and throughput. Table 4 collects the area and latency values for our versions of AES-128 and AES-256 IP-cores characterized on 45nm ASIC standard-cell technology.


**Table 4.** AES IP-core specifications.

Given that the target is to identify the most suitable core for implementation of DRBG circuit with highest level of security strength possible, the AES-256 algorithm is the only block cipher core being considered for the trade-off. As shown in Table 4, its area is lower than the one for SHA-256, while the throughput value is higher than that reported for SHA-512. In particular, the throughput of AES-256 to be considered for a CTR DRBG implementation is calculated as below:

$$T\_{A \to 3-256} = 128/15 \cdot f\_{\text{clk}} \cdot n\_{\text{paralle\\_core}} = 8.53 \cdot f\_{\text{clk}} \cdot n\_{\text{paralle\\_core}} \qquad \text{bit/s} \tag{3}$$

Figure 1 shows a comparison of DRBG mechanisms implemented as in [5] and based on Hash algorithms (i.e., SHA-256 or SHA-512) and block ciphers (i.e., AES-256) used in CTR mode. The logic throughput and complexity values are obtained with synthesis on 45 nm ASIC standard-cell technology.

**Figure 1.** Comparison between NIST approved DRBG mechanisms based on logic complexity in kGE and throughput.

The algorithm chosen for the development of the DRBG circuit inside the CSPRNG IP-core was SHA-256 (i.e., the primitive SHA2), therefore we decided to use an Hash DRBG despite the better area and latency values collected for AES cores (i.e., for CTR DRBG circuit). This is because M.Schmid [12] explained how block cipher-based DRBGs should not be used as they are indeed not able to reach maximum security strength. The author declares that the pseudo-random permutation inside each AES round, coupled with counter mode of operation, generates a binary sequence which results to be distinguishable with respect to what a random source could give, thus being unable to satisfy the security requirements. This is not the case with Hash-based DBRGs, so the use of SHA-256 cores offers better robustness to the entire security chain where the CSPRNG is based on this algorithm. The SHA-256 core ensures a compact implementation for the mechanism and the possibility to extend the design for supporting multiple cores to increase the throughput. In a context with multiple cryptographic cores, 2 SHA-256 perform better than a single SHA-512, having a higher throughput and requiring lower internal state.

#### **3. SHA-256 Core Implementation**

As explained in Section 2, the fundamental element of the proposed Hash DRBG circuit is the SHA-256 core, based on SHA2 cryptographic primitive. In order to achieve high throughput for the whole CSPRNG IP-core, it is then essential to optimize the SHA-256 implementation performances to the maximum. To do so, the canonical logic implementation derived from the standard [13] has been improved through the use of *Carry-Save Adder* (CSA) units for consecutive additions and by application of retiming-pipelining to perform delay balancing. To better understand the implemented optimizations, a brief description of the standard is given.

The SHA-256 standard may be defined by two separate, ideally consecutive, procedures:


The *message schedule* is in charge of creating a *key* schedule starting from the 512-bits *input message* to be then provided to the *compression function*. The operation is performed through the *σ*0, *σ*<sup>1</sup> and modulo 2<sup>32</sup> adder operations defined in Equations (4)–(6) respectively.

$$\sigma\_0(\mathbf{x}) = \operatorname{RotateRight}\_7(\mathbf{x}) \oplus \operatorname{RotateRight}\_{18}(\mathbf{x}) \oplus \operatorname{Shift}\_{\mathbf{R}} \operatorname{light}\_3(\mathbf{x}) \tag{4}$$

$$
\sigma\_1(\mathbf{x}) = \operatorname{RotateRight}\_{17}(\mathbf{x}) \oplus \operatorname{RotateRight}\_{19}(\mathbf{x}) \oplus \operatorname{ShiftRight}\_{10}(\mathbf{x})\tag{5}
$$

$$\mathbf{x} \boxplus \mathbf{y} = \mathbf{x} + \mathbf{y} \pmod{2^{32}} \tag{6}$$

Usually the message schedule operation is also called *expansion* due to the fact that the 512-bits input message is expanded to 32 · 64 = 2048*bits*. The canonical serial architecture of the *message schedule* block, derived from Equation (7), is depicted in Figure 2.

$$\mathcal{W}\_t = \begin{cases} M\_t & 0 < t < 15\\ \sigma\_1(\mathcal{W}\_{t-2}) \boxplus \mathcal{W}\_{t-7} \boxplus \sigma\_0(\mathcal{W}\_{t-15}) \boxplus \mathcal{W}\_{t-16} & 16 < t < 63 \end{cases} \tag{7}$$

**Figure 2.** SHA2 standard *message schedule* architecture.

The optimization for the message schedule is performed on the adder chain through the use of CSA, which are essentially *Full-Adder Arrays* (FAA), producing partial sums (*ps*) and shift-carries (*sc*).

$$\begin{aligned} ps\_i &= a\_i \oplus b\_i \oplus b\_i\\ sc\_i &= (a\_i \land b\_i) \lor (a\_i \land c\_i) \lor (b\_i \land c\_i) \end{aligned} \tag{8}$$

CSAs have advantages on both area and critical path. Implementation on 45nm and 7nm ASIC standard-cell technologies demonstrated that, when compared to *Carry-Lookahead Adder* (CLA) units, the delay relationship is *TCLA*−<sup>32</sup> = 1.78 · *TCSA*−<sup>32</sup> and *TCLA*−<sup>32</sup> = 1.87 · *TCSA*−<sup>32</sup> respectively for the two technologies. The optimized serial implementation is shown in Figure 3, where the high-level timing block analysis shows that the critical path is reduced from *Tσ*<sup>0</sup> + 3 · *T* to *Tσ*<sup>0</sup> + 2 · *TCSA* + *T*-. Further optimizations are possible through the use of retiming, but they are not considered due to the critical path being mainly located in the *compression function* architecture.

The SHA-256 *compression function* is composed of three consecutive steps: initialization, one-way compression and termination. In the first step, the variables A-H are initialized with the intermediate Hash value *H*(*t*−1) (the first 512-bits message block at t=1 uses a constant *H*(0) provided by the standard). The one-way compression then performs 64 loops according to:

*Sensors* **2020**, *20*, 1869

$$\begin{array}{rcl} T\_1 & \leftarrow H \boxplus \Sigma\_1(E) \boxplus \mathit{Ch}(E, F, G) \boxplus K\_{\dot{f}} \boxplus W\_{\dot{f}} \\ T\_2 & \leftarrow \Sigma\_0(A) \boxplus \mathit{Maj}(A, B, \text{C}) \\ H & \leftarrow G \\ F & \leftarrow E \\ E & \leftarrow D \boxplus T\_1 \\ D & \leftarrow \text{C} \\ C & \leftarrow B \\ B & \leftarrow A \\ \dot{\neg} & \leftarrow \neg \dot{\neg} \end{array}$$

*T*<sup>2</sup>

$$A\_1 \leftarrow T\_1 \boxplus T\_2$$

$$A\_{\mathbf{w}\_1} \cdots \boxplus T\_n$$

$$A\_{\mathbf{w}\_2} \cdots \boxplus T\_n \boxplus T\_1 \boxplus T\_2 \cdots \boxplus T\_n \boxplus T\_n \boxplus T\_1 \boxplus T\_2 \boxplus T\_3 \boxplus T\_3 \boxplus T\_4 \boxplus T\_5 \boxplus T\_6 \boxplus T\_7 \boxplus T\_8$$

**Figure 3.** SHA2 optimized *message schedule* architecture.

Finally the intermediate Hash value at time *t* is calculated by a 232 modulo addition between the variables A-H at initialization time and the variables A-H after the one way compression. The functions *Maj*, *Ch*, Σ<sup>0</sup> and Σ<sup>1</sup> are defined as:

$$\begin{aligned} \text{Maj}(\mathbf{x}, y, \mathbf{z}) &= (\mathbf{x} \wedge y) \oplus (\mathbf{x} \wedge \mathbf{z}) \oplus (y \wedge \mathbf{z}) \\ \text{Ch}(\mathbf{x}, y, \mathbf{z}) &= (\mathbf{x} \wedge y) \oplus (\neg \mathbf{x} \wedge \mathbf{z}) \\ \Sigma\_0(\mathbf{x}) &= \text{RotateRight}\_2(\mathbf{x}) \oplus \text{RotateRight}\_{13}(\mathbf{x}) \oplus \text{RotateRight}\_{22}(\mathbf{x}) \\ \Sigma\_1(\mathbf{x}) &= \text{RotateRight}\_6(\mathbf{x}) \oplus \text{RotateRight}\_{11}(\mathbf{x}) \oplus \text{RotateRight}\_{25}(\mathbf{x}) \end{aligned} \tag{9}$$

The canonical scheme corresponding to the described procedure is represented in Figure 4, where the output stage performing the termination phase is not represented.

**Figure 4.** SHA2 standard *compression function* architecture

The high-level block timing analysis showed that the critical path on the non-optimized architecture is located between register H and register E, involving 5 operations. Optimization of the *compression function* was achieved through the use of CSA, retiming and delay balancing. In particular, all the adder chains were converted to CSA with the exception of register B and E inputs. Moreover, the path going from *Kt*-*Wt* was duplicated to allow the value of register D to be added immediately to *H* + *Kt* + *Wt*. Finally a pipeline stage *L*1, *L*<sup>2</sup> was added (with the associated C-D multiplexer to ensure the functionality) and the a register was split to move the CLA position.

Finally, both SHA-2 implementations have been synthesized on 45 nm and 7 nm ASIC technologies, whose results are represented on Table 5 for canonical and optimized architectures.

**Table 5.** Canonical and Optimized SHA-256 implementation results.


A detail analysis of the critical path on the implemented design shows that the real critical path on 45 nm ASIC technology is located between register L1 and register E (Figure 5), while for the 7 nm ASIC technology it is located in the message schedule between the second right register and the left one (Figure 3). This behavior can easily be attributed to the synthesizer, which is able to use complex ASIC cells and better merge the 4·CSAs than the 2·SCA+1·CLA. a separate synthesis, to emulate the main paths of the canonical and optimized architectures on the 45 nm technology, shows the critical path of these extracted sub-elements. As visible from Table 6 the 2·SCA+1·CLA sub-clock is the slowest path w.r.t. The optimized implementation, with an equivalent frequency *<sup>f</sup>*2·*CSA*+1·*CLA* = 1200 MHz.

**Figure 5.** SHA2 optimized *compression function* architecture.

**Table 6.** SHA-256 sub-block implementation results.


Looking for a comparison between the canonical ASIC implementation and the ASIC optimized one, the latter results to be about 9% smaller, while providing about a 56% maximum frequency increase.

#### **4. CSPRNG Design Architecture**

The architecture of the proposed CSPRNG IP-core, meaning of the Hash DRBG with optimized SHA-256 core, is shown in Figure 6. The proposed design is based on the following building blocks:


```
sha_256_in = adder_out || 1'd1 || 7'd0 || 64'd440
adder_x = adder_out
adder_y = 440'd1
```

**Figure 6.** CSPRNG (Hash DRBG) design architecture developed.

The *instance* procedure acquires 512 entropy bit from the entropy content input and then hashes eight blocks to create the internal state. With *τ* equal to the number of clock cycles necessary to acquire 8 bitt from the entropy content input, total execution time is approximately:

$$t\_{instance} = (64 \cdot \tau + 8 \cdot 67) \cdot T\_{clk} = (64 \cdot \tau + 536) \cdot T\_{clk} \tag{10}$$

The *reseed* procedure acquires 384 entropy bit from the entropy content input and then hashes 8 blocks to update the internal state. Execution time is approximately:

$$t\_{resed} = (48 \cdot \tau + 8 \cdot 67) \cdot T\_{clk} = (48 \cdot \tau + 536) \cdot T\_{clk} \tag{11}$$

In the *generation* phase, if a personalization string is inserted by the user, a new value of *V* is immediately calculated before generating output bits. This operations requires a sum and two hash cycles. Since the serial adder latency is 14 clock cycles:

$$t\_{\text{gen\\_pers\\_string}} = (14 + 2 \cdot 67) \cdot T\_{clk} \tag{12}$$

After generation a new state is derived within the same time:

$$t\_{\text{gen\\_new\\_state}} = t\_{\text{gen\\_pers\\_string}} = (14 + 2 \cdot 67) \cdot T\_{clk} \tag{13}$$

For a clock frequency of 100 MHz and *τ* = 1, these values result to be:

$$\begin{aligned} t\_{\text{instance}} &= 6.000 \text{ \upmu s} \\ t\_{\text{resed}} &= 5.840 \text{ \upmu s} \\ t\_{\text{\%}cm\text{-}pers\text{-}string} &= 1.480 \text{ \upmu s} \\ t\_{\text{\%}cm\text{-}new\text{-}stat} &= 1.480 \text{ \upmu s} \end{aligned}$$

For the same clock frequency and *τ* = 8:

*tinstance* = 10.480 μ*s treseed* = 9.200 μ*s tgen*\_*pers*\_*string* = 1.480 μ*s tgen*\_*new*\_*state* = 1.480 μ*s*

#### **5. Results**

In order to validate the CSPRNG IP-core, evaluation of the randmoness degree of the sequences generated was obtained by using the NIST Statistical Test Suite [14]. The test suite ran on a sequence of 128 MB acquired from the CSPRNG with the following strategy:


In this way the total number of acquired bits is *nbit* = 1024 · 1000 · 1000 = 1,024,000,000. This sequence has been then converted to a binary file, which is subsequently given as input to the test suite. The NIST Statistical Test Suite parameters and the corresponding results (using *α* = 0.01) on the 128 MB data block are collected on Table 7.

Three technologies were identified as potential targets for characterization of the CSPRNG hardware accelerator IP-core, one FPGA and two ASIC standard-cell: Intel Stratix IV FPGA, 45 nm Silvaco [15], and 7 nm Artisan TSMC [16]. In all of these cases different implementation effort corners were tested, in order to evaluate the trade-off between throughput and area. The synthesis performed on Intel Stratix IV (EP4SGX230KF40C2) FPGA technology with high performance constraints, configuring a single instance of the SHA-256 core, provides a maximum operating frequency of 180 MHz; a throughput of 690 Mbps; and an overall resource utilization of 4713 ALMs. The 45 nm Silvaco ASIC standard-cell implementation increases its throughput to 3.82 Gbps, since the maximum frequency is 1 GHz (being the critical path in the SHA-2 sub-block), with a logical

complexity of 49.19 kGE. Finally, the proposed design, with a single SHA-256 core, brought on the 7 nm Artisan ASIC standard-cell reaches a throughput value of 19.67 Gbps, given a maximum clock frequency of 5.15 GHz, requiring an overall complexity of 46.56 kGE.


**Table 7.** NIST Statistical Test Suite parameters and results.

The diagrams in Figures 7 and 8 show the occupation percentage for the different parts of the architecture proposed respectively for 45nm and 7nm ASIC target technologies.

**Figure 7.** CSPRNG (Hash DRBG) IP-core occupation diagram on 45 nm ASIC technology (kGE based).

**Figure 8.** CSPRNG (Hash DRBG) IP-core occupation diagram on 7 nm ASIC technology (kGE based).

#### **6. Conclusions**

This paper presented the architecture and implementation of a high performance digital Cryptographically Secure Pseudo-Random Number Generator (CSPRNG). Specifically, a Hash-based Deterministic Random Bit Generator (DRBG) circuit was presented, following recommendation given by NIST in [6], using SHA256 cryptographic primitive. CSPRNG is a key component to implement efficient cybersecurity applications for authentication, confidentiality and message integrity. In addition, the security of critical information exchanged between the different subsystems in modern vehicles is proving to be a key issue in the automotive sector: more advanced and complex devices and sensors provide the platform for active assistance and security on which passengers rely, so it is crucial to ensure that these systems are adequately protected from cyber attacks. Hash algorithm selection was done according to a trade-off analysis on throughput, area and security strength: among the solutions able to satisfy the security requirements, the SHA-256 core was proved to be the most efficient solution in terms of throughput-complexity ratio. The detailed description of the optimized SHA-256 core architecture being developed for DRBG circuit implementation is also given. The proposed CSPRNG IP-core was tested by means of NIST Statistical Test Suite, thus stating that the sequences of bits generated cannot be distinguished from a true random sequence of numbers, and therefore validating its use for cryptographic applications. It has been also implemented on FPGA and ASIC standard-cell technologies for characterization: on Intel Stratix IV FPGA it is reported a throughput of 690 Mbps at 180 MHz with a maximum occupation of 4713 ALMs, on 45 nm ASIC standard-cell [15] the throughput is equal to 3.82 Gbps at 1 GHz with a logic complexity of 49.19 kGE, and finally on 7 nm ASIC standard-cell [16] the throughput reaches a value of 19.67 Gbps at 5.15 GHz with the logic complexity of 46.56 kGE.

**Author Contributions:** Conceptualization, L.F. and S.S.; methodology, L.B., L.C., F.F., L.F. and S.S.; software, L.B. and L.C.; validation, M.B. and J.B.; formal analysis, L.B., L.C., F.F., M.B. and J.B.; investigation, M.B. and J.B.; resources, L.F. and S.S.; data curation, L.B., L.C. and F.F.; writing—original draft preparation, L.B., L.C., F.F. and M.B.; writing—review and editing, F.F., L.F. and S.S.; visualization, F.F., L.F. and S.S.; supervision, L.F. and S.S.;

project administration, L.F. and S.S.; funding acquisition, L.F. and S.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partially funded by the European Union's Horizon 2020 research and innovation programme "European Processor Initiative" under grant agreement No. 826647.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Data Processing and Information Classification— An In-Memory Approach**

#### **Milena Andrighetti 1, Giovanna Turvani 1,\*, Giulia Santoro 1, Marco Vacca 1, Andrea Marchesin 1, Fabrizio Ottati 1, Massimo Ruo Roch 1, Mariagrazia Graziano <sup>2</sup> and Maurizio Zamboni <sup>1</sup>**


Received: 31 January 2020; Accepted: 13 March 2020; Published: 18 March 2020

**Abstract:** To live in the information society means to be surrounded by billions of electronic devices full of sensors that constantly acquire data. This enormous amount of data must be processed and classified. A solution commonly adopted is to send these data to server farms to be remotely elaborated. The drawback is a huge battery drain due to high amount of information that must be exchanged. To compensate this problem data must be processed locally, near the sensor itself. But this solution requires huge computational capabilities. While microprocessors, even mobile ones, nowadays have enough computational power, their performance are severely limited by the Memory Wall problem. Memories are too slow, so microprocessors cannot fetch enough data from them, greatly limiting their performance. A solution is the Processing-In-Memory (PIM) approach. New memories are designed that can elaborate data inside them eliminating the Memory Wall problem. In this work we present an example of such a system, using as a case of study the Bitmap Indexing algorithm. Such algorithm is used to classify data coming from many sources in parallel. We propose a hardware accelerator designed around the Processing-In-Memory approach, that is capable of implementing this algorithm and that can also be reconfigured to do other tasks or to work as standard memory. The architecture has been synthesized using CMOS technology. The results that we have obtained highlights that, not only it is possible to process and classify huge amount of data locally, but also that it is possible to obtain this result with a very low power consumption.

**Keywords:** bitmap indexing; processing in memory; memory wall; big data; internet of things

#### **1. Introduction**

Nowadays many applications used everyday, defined as data-intensive, require a lot of data to process. Examples are the databases manipulation and image processing. This requirement is the effect of the fast improvement of CMOS technology, that has lead to the creation of very powerful and flexible portable devices. These devices are full of sensors that continuously acquire data. Data can be elaborated remotely by powerful servers, but sending a lot of information through electromagnetic waves requires a huge amount of energy, severely impacting the battery life of mobile devices. The only solution is to elaborate data locally, on the mobile device itself.

Thanks to the scaling of transistors size, mobile microprocessors are now theoretically capable of such computation. Unfortunately, memory scaling has been following a different path, resulting still in slow accesses compared to processors computing speed. This discrepancy in performance harms the computing abilities of the CPU, since the memory cannot provide data as quickly as required by the CPU. This problem is called *Von Neumann bottleneck* or *Memory Wall*. The idea that took form to solve this problem is to null the distance between processor and memory, removing the cost of data transfer and create a unit which is capable of storing information and of performing operation on them. This idea takes the name of Processing-in-Memory.

Many in literature have approached the "in-memory" idea. Some narrowing the physical distance between memory and computation unit by creating and stacking different layers together. But even if the two units are moved very close to each other, they are still distinct components. Others exploited intrinsic functionality of the memory array or slightly modified peripheral circuitry to perform computation.

Among the many example provided by literature, one of the best fitting representative of the PIM concept is presented in Reference [1]. In this work the proposed architecture is a memory array in which the cell itself is capable of performing logical operations aimed at solving Convolutional Neural Networks (CNN). In this paper, our main goal is to introduce a proper example of Processing-in-Memory, choosing Bitmap Indexing as an application around which the architecture is shaped. In the design, it was not used a specific memory technology because the idea is to provide a worst-case estimation and it was also meant to leave space for future exploration to implement the cell with a custom model of the memory cell. The Bitmap Indexig algorithm has been chosen because it is used for data classification. This is one of the most important task that must be performed by such mobile devices. Being able to classify data allows to understand which data must be sent to remote servers and which not, greatly reducing the overall power consumption. The presented architecture is a memory array in which each cell is both capable of storing information and to perform simple logical operation on them. A characteristic of our architecture is its modularity. The architecture is divided in independent memory banks. A memory bank can work both on its own or interacting with other banks. Moreover it is possible to build the array with as many banks as needed. This feature lead to great flexibility and high degree of parallelism. The structure was eventually synthesized for analysis purposes, in a 8.5 KB square array, using CMOS 45 nm and 28 nm. The storage segment of the proposed PIM cell was synthesized as a latch. The evaluation showed great results, achieving a maximum throughput of 2.45 Gop/s and 9.2 Gop/s respectively for the two technologies used. This paper is the extended version of our prior work [2]. In the conference paper the general idea was introduced. Here we greatly expand the architecture, moving from the idea to the real implementation. The novelty of this work, in comparison with other works presented in the literature, consists in an enhanced architecture characterized by a high level of granularity and flexibility.

#### **2. Background**

The Processing-in-Memory paradigm was born to solve the *Von Neumann bottleneck*, which is characterized by the gap in performance between memory and processor. Processing-in-Memory thus tries to reduce the disparity by merging together storage and processing units. Processing-in-Memory (PIM) can be approached in different ways, depending on the architecture or the technologies to use. A lot of examples can be found in literature, some of them will be depicted in the following, grouped in categories.

#### *2.1. Magnet-Based*

*Magnetic Random Access Memory* (MRAM) is a non-volatile memory that uses Magneto-Tunnel Junctions as its basic storage element. Thanks to their dual storage-logic properties, MTJs are suitable to implement hybrid logic circuits with CMOS technology suited to implement the PIM principle. In Reference [3] is presented a MTJ-CMOS Full Adder, which compared to a standard only-CMOS solution showed better results. In Reference [4] the authors proposed an MTJ-based TCAM, in which the logic part and the storage element are merged together, and an MTJ-based Non-Volatile FPGA exploiting MTJs and combinatorial blocks. Both structures resulted in a more compact solution with respect to conventional ones.

In Reference [5] it is proposed a different way to implement Nano Magnetic Logic (NML) exploiting the MRAM structure. Since the basic concept of the NML technology is the transmission of information through magnetodynamic interaction between neighbouring magnets, the MRAM structure has been modified so that MTJs could interact with each other. Another example is represented by PISOTM [6], an architecture based on SOT-RAM. It is a reconfigurable architecture in which the main advantage is that the storage and logic element result identical and for this reason technology conflict is avoided.

#### *2.2. 3D-Stacking*

According to the 3D-Stacking approach multiple layers of DRAM memory are stacked together with a logic layer that can be application-specific ([7,8]) or general purpose [9]. In Reference [7] the XNOR-POP architecture was designed to accelerate CNNs for mobile devices. It is composed of Wide-IO2 DRAM memory with the logic layer modified according to the XNOR-Net requirements. In Reference [8] it is proposed an architecture for data intensive applications, where a PIM layer made of memory and application-specific logic is sandwiched between DRAM dies connected together using TSVs. An example of general purpose 3D-stacking is 3D-MAPS in Reference [9]. A multi-core structure is used, and every core is composed of a memory layer and a computing layer.

#### *2.3. ReRAM-Based*

*Resistive RAM* is a non-volatile memory that uses a metal-insulator-metal element as storage component. The information is represented by the resistance of the device that can be either high (HRS) or low (LRS). To switch between states the appropriate voltage has to be applied to the cell. The common structure of a ReRAM array is a crossbar, a structure used in matrix-vector multiplication, commonly found in neural networks applications. PRIME [10], an architecture aimed at accelerating Artificial Neural Networks is an example of this kind of implementations. PRIME is compliant with the in-memory principle, since the computation is performed directly into the memory array with few modifications to the peripheral circuitry. Memory banks are divided intro three sub-arrays each with a specific role in the architecture. In Reference [11] is proposed a 3D-ReCAM based architecture to accelerate the BLAST algorithm for DNA sequence alignment. The architecture, named RADAR, aims to move the operations in memory, this way there is no need to transfer the DNA database. In Reference [12] is presented a non-volatile intelligent processor built on a 150 nm CMOS process with HfO RRAM. The structure is capable of both general computing and the acceleration of neural networks, in fact it is provided with a FCNN Turbo Unit, enhanced with low-power MVM engines to perform FCNN tasks.

Another application that is limited by the Memory Wall problem is Graph Processing. In Reference [13] is proposed a ReRAM-based in-memory architecture as a possible solution. The structure is composed of multiple ReRAM banks, divided into 2 types: graph banks that are used to map the graph and to store its adjacency list and a master bank which stores metadata of the graph banks. This allows to process the graphs that are stored inside the memory. In Reference [14] is presented PLiM, a programmable system composed of a PIM controller and a multi-bank ReRAM which can work both as a standard memory and as a computational unit, according to the controller signals. PLiM implemented only serial operation to keep the controller as simple as possible. In Reference [15] the authors presented ReVAMP, an architecture composed of two ReRAM crossbars, supporting parallel computations and VLIW-like instructions. To perform logic operations ReVAMP exploits the native properties of ReRAM cells that implement a majority voting logic function.

#### *2.4. PIM*

In Reference [16] the authors presented TOP-PIM, a system composed of an host processor surrounded by several units characterized by 3D-stacked memories with an in-memory processor embedded on the logic die. In Reference [17] is proposed DIVA, a system in which multiple PIM chips serve as smart-memory co-processors to a standard microprocessor aimed at improving bandwidth performance for data intensive applications executing computation directly in memory and enabling a dedicated communication line between the PIM chips. In Reference [18] is presented Terasys, a massively parallel PIM array. The goal of Terasys was to embed an SIMD PIM array very close to an host processor in order for it to be seen both as a processor array and conventional memory. As solution for large-scale graph processing performance bottleneck, in Reference [19] the authors proposed Tesseract, a PIM architecture used as an accelerator for an host processor. Each element of Tesseract has a single-issue in-order core to execute operations, moreover, the host processor has access to the entire Tesseract's memory whilst each core of Tesseract can interact only with its own. Tesseract does not depend on a particular memory organization, but it was analyzed exploiting Hybrid Memory Cube (HMC) as baseline. Such a structure proved to perform better than traditional approaches thanks to the fact that Tesseract was able to use more of the available bandwidth. In Reference [20] is presented Prometheus, a PIM-based framework, which proposes the approach of distributing data across different vaults in HMC-based systems with the purpose of reducing energy consumption, improving performance and exploiting the high intra-vault memory bandwidth.

In Reference [21] is proposed a solution to accelerate Bulk Bitwise Operations. PINATUBO is an architecture based on resistive cell memories, such as ReRAMs. The structure is composed of multiple banks which are also subdivided into mats. Pinatubo is able to eliminate the movement of data, since computation is performed directly inside memory, executing operations between banks, mats and subarrays. This way PINATUBO interacts with CPU only for row addresses and control commands. Another example of PIM architecture to accelerate bulk bitwise operations was conceived by the authors of Reference [22], who presented Ambit, an in-memory accelerator which exploits DRAM technology to achieve total usage of the available bandwidth. The DRAM array is slightly modified to perform AND, OR and NOT operations. Moreover, the CPU can access Ambit directly, this way it is not necessary to transfer data between CPU memory and the accelerator. In Reference [23] is proposed APIM, an Approximate Processing-in-Memory architecture which aims to achieve better performance despite a decrease in accuracy. It is based on emerging non-volatile memories, such as ReRAM and it is composed of a cross-bar structure grouped in blocks. All the blocks are structurally identical but divided into data and processing blocks. They are linked together through configurable interconnections. Furthermore APIM is able to configure computation precision dynamically, so that it is possible to tune the accuracy runtime.

In Reference [24] is presented ApproxPIM, an HMC-based system in which each vault is independent from one another and communication with the host processor is based on a parcel transmission protocol. This results in energy and speedup improvements with respect to the used baselines. In Reference [25] the authors presented MISK, a proposal to reduce the gap between memory and processor. Since data movement imply a great energy cost, MISK is intended to reduce it by implementing a monolithic structure, avoiding physical separation between memory and CPU. In fact, MISK is to be integrated into the cache and it is not conceived to work on its own, but embedded in the CPU. This way it is possible to achieve great results in terms of energy-per-cycle and execution time. In Reference [26] is introduced Gilgamesh, a system based on distributed and shared memory. It is characterized by a multitude of chips, called MIND chips, which are connected together through a global interconnection network. Each chip is a general purpose unit equipped with multiple DRAM bank and processing logic. In Reference [27] Smart Memory Cube is presented, a PIM processor built near the memory, in particular HMC, which is connected to an host processor. HMC vault controls are modified to perform atomic operations. The PIM processor interacts with the host processor so that smaller tasks are executed directly side by side the memory.

In References [28,29], the authors presented in-memory architectures on which the Advanced Encryption Standard (AES) algorithm was mapped, showing great result in speed and energy saving compared to other solutions. In Reference [1], the authors presented an architecture based on the in-memory paradigm aimed at Convolutional Neural Networks (CNN). The structure is a memory array in which each cell is provided with both storage and computation properties and with the support of an additional weight memory which is designed to support CNN data flow and computation inside the array. This structure showed great result compared with a conventional CNN accelerator in terms of memory accesses and clock cycles.

#### **3. The Algorithm**

The Processing-in-Memory principle requires that the storage and logic components are merged together. In order to implement an architecture compliant with such a requirement it was necessary to firstly shape it according to a suitable application. For this purpose Bitmap indexing was selected. Bitmap indexes are often used in database management systems.

Taking as an example the simple database in Figure 1A, each column of the database represents a particular characteristic of the profile of the entry described in one row. Suppose a search on the database is to be performed to create a statistic on how many men possess a sport car or a motorbike. Such a query would imply looking for all the men and then excluding the ones that do not own the specified vehicles. If the database is big this operation would require a long response time. Bitmap indexing was introduced to solve this issue. Bitmap indexing transforms each column of a table in as many indexes as the number of distinct key-values that particular column can have.

A bitmap index is a bit array in which the *i*-th bit is set to 1 if the value in the *i*-th row of the column is equal to the value represented by the index, otherwise it is set to 0 (Figure 1A). Thus, bitmap indexing allows to fragment search queries in simple logic bitwise operations (Figure 1B). This way it is not necessary to analyze the whole database discarding unwanted data, but only to operate on selected indexes. Bitmap indexing can provide great results in response time and in storage requirements since it can be compressed. Bitmap indexing is suited for entries with a number of possible values smaller than the depth of the whole table. This technique is mostly functional for queries regarding the identification of the position of specific features, for this reason to answer an "how many" query it is necessary to insert a component that counts the hits obtained. Summing up, a query can be decomposed in simple logic operations which are performed between indexes, processing bits belonging to the same position in the array (Figure 1C).

Clearly, Bitmap indexing results compatible with the Processing-in-Memory paradigm, since it is characterized by simple logic bitwise operations and its data format make it easy to embed in memory. However, bitmap indexing involves operations between columns of a table. If we consider memory organization and imagine to maintain the column-row distribution of the table in memory, this would imply to access multiple rows and then discard all the data that do not belong to the desired indexes. This approach would be too costly. For this reason for our implementation a column-oriented was preferred, which means that the entire table is stored transposed, so that now, applying bitmap indexing, indexes lie on rows (Figure 2).

Thanks to this method, to access an index it is only necessary to access a row and consequently operations between indexes result in operations between memory rows. In this implementation we thus consider the indexes distributed on rows in a memory array. We also take into account two types of query, *simple* and *composed*. A simple query is composed of only one operation (e.g., "Who is female and married?") whilst a composed one is characterized by intertwined operations (e.g., Figure 1B). Considering the composed query depicted in Figure 1B the operations to perform would be:


While to answer a simple query only steps 1–4 are needed. The goal is then to implement the just introduced algorithm directly inside a memory array.

**Figure 1.** (**A**) Given a table, bitmap indexing transforms each column in as many bitmap as the number of possible key-values for that column (**B**) In order to answer a query logic bitwise operations are to be performed (**C**) Practical scheme of the execution of the query.


**Figure 2.** Column-oriented memory organization.

#### **4. The Architecture**

The architecture proposed in this paper present a possible solution for the Von Neumann bottleneck implementing a proper *in-memory* architecture, where logic functions are implemented directly inside each memory cell, in contrast with the *near-memory* approach seen in some state-of-the-art implementations, where logic operations are performed with logic circuits located on the border of the memory array. Moreover, this architecture was intended to overcome the limits provided by specific technologies by keeping the development of the architecture technology-independent, in order to implement a configurable architecture with the highest degree of parallelism achievable.

A memory array is composed of many storage units, each of which is made of multiple memory cells. Cells are the basic element of the memory itself. Therefore, in order to implement an entire memory array aimed at executing the Bitmap indexing algorithm, firstly it is necessary to define the structure of the memory cell.

According to the specifications required by the Bitmap indexing, the cell has to be able to perform simple logic operations interacting with other cells in the array. This means that our cell should have both storage and logic properties. Indeed, the basic cell of the PIM array is provided with an element that store information and a configurable logic element which performs AND, OR, XOR operations with all the combinations of input (e.g., *A*, *A*), between the stored information and the one coming from another cell (Figure 3). The system has indeed the granularity of a single bit, meaning that every memory cell executes a logic operation.

**Figure 3.** (**A**) Overview of the complete architecture. (**B**) Structure of the duo Bank-Breaker. (**C**) Insight of the Processing-In-Memory (PIM) cell.

Other than standard memory features the PIM cell can interact with other cells, according to its control input. As every single cell in the array has the ability to perform computation, it is necessary to choose which cell will be executing the operation and which will be read. In order to implement it, the designated passive cell is read and the stored data travels to the operative cell. To avoid interference between inactive cells, the output lines of cells that are not used are interrupted. To implement the bitwise feature each cell of a row has its input and output line common to any other cell belonging to the same column of different rows.

In Figure 3, the whole structure is depicted. Noticeably, other than the array, the architecture is composed of a control unit and some additional components, such as the counter (for counting ones) and register files. Focusing on the array, like any standard memory, it was divided into multiple banks. Each bank is associated with a *breaker* that manages data flow from and to the bank. A bank represents the smallest degree of parallelism of the architecture. This means that in a bank it is possible to execute one operation at a time. The system has also a second level of granularity because thanks to the breakers every bank can work independently. This solution provides at the same time a high level of granularity and flexibility. Banks can execute operations between its rows or can work with other banks, making interact rows belonging to different banks, while other banks work on different operations in parallel. As a consequence, supposing each bank in the array works on a different operation by itself, the maximum degree of parallelism achievable is equal to the number of banks in the array. The *Bidirectional Breaker* is in charge of managing relations between its bank and the rest of the array. According to the control input, the breaker can be passive, that is, letting data pass through without disturbing its bank so that the bank can work on its own or be silent. The breaker can also be active and diverting data to or from its bank.

A bank is composed of multiple PIM rows and one *Ghost row* which is provided only with memory properties used to store temporary operation results. The Ghost row has the input line connected to the logic result output line of the PIM rows, whilst its output line is common with the PIM rows. This way it is possible to read the Ghost row or use its content for further computation. As in standard memories, each row is fragmented in multiple words. This means that operations are actually performed between words belonging to different rows. The result is then temporary saved in the Ghost word corresponding to the same word address of the word which executed the operation. This was implemented to avoid the need to manage a third address. To handle all the configuration signals needed to manage the correct execution, two decoders were needed inside each bank. One that sets the configuration for the logic operation to execute, sending it to the right row. The second was implemented to control addresses, data flow inside the bank and to distinguish between standard memory mode and PIM operation mode. Since a simple AND operation can be performed in one bank in a single clock cycle, imaging of having multiple banks definitely increase the number of operations that can be executed in one clock cycle in parallel. The same reasoning goes for a composed operation which takes two clock cycles. The throughput is directly proportional to the number of banks in the memory block. So, the larger the number of banks, the larger the memory block and also the larger the throughput.

In Figure 3, it is highlighted that, other than the array, there are some additional components which are used to guarantee the correct functioning of the entire structure.

The *Instruction Memory* is used to collect the queries to execute. It consists in a register file, having as many registers as the number of banks, with an input parallelism equal to the length of a complete query (i.e., two complete addresses and a logic operation configuration string). A composed query is treated as the combination of two distinct queries, which means that a composed query will occupy two consecutive registers of the Instruction Memory. Clearly, even if the architecture was configured to exploit its maximum potential by implementing the bitmap indexing algorithm, it can be configured to perform additional algorithms. For reconfigurability purposes the instruction memory had to be implemented as wide as possible, but most likely it will not be updated fully each time. In order to avoid conflicts the *Operation Dispatcher* is in charge of blocking any old query. Since a query can

take place between any couple of addresses in the array, it is necessary to sent the addresses to their respective bank. The Operation Dispatcher thus reorders addresses and sends them to their own bank. After the correct reordering, to ensure synchronization the addresses are sampled by the *Address Register File* which loads the addresses and sends them to the array.

As illustrated previously, results of bitwise logic operations answer to queries in where clause. To count the number of ones ("1") in the "how many" clause it was inserted a ones counter of logic "1" connected with the output of a delay register. The register was added to ensure timing constraints given by the counter. A simple counter that processes the data input bit-by-bit and increments by one for each "1" found was too slow. Therefore, a tree-structured counter was implemented. Firstly, the data array is fragmented into D segments, each of *<sup>N</sup> <sup>D</sup>* -bits. All segments are then analyzed at the same time and the ones contained in each segment are counted. Finally, all the factors are added together to obtain the final sum. Also, all the adders that form the tree-structure are of the same dimension computed to avoid overflow.

The architecture was conceived to incorporate as many features as possible and at the same time trying to keep the control circuits as simple as possible. The implemented structure is versatile and can work in 8 different operation modes, discerned among traditional memory operations and PIM operations based on the position of the two operands and the desired parallelism: (1) Write; (2) Read; (3) Save result; (4) PIM simple single bank; (5) PIM simple different banks; (6) PIM multiple banks; (7) PIM composed; (8) PIM multiple composed. Each operation mode is the starting point of a query, which is composed as shown in Figure 4A. The FSM chart of all operation modes are reported in Figure 4B.

**Figure 4.** (**A**) Composition of a complete query. (**B**) Preliminary stages.

The developed architecture is a modular configurable parallel architecture that implements the concept of Processing-in-Memory to perform bitwise logic operations directly inside the memory, making it suitable for other applications other than Bitmap indexing, as long as they are based on bitwise.

#### **5. Results and Conclusions**

The architecture was fully developed in VHDL (VHSIC Hardware Description Language). In order to evaluate its performance a 8.704 KB square memory array was analysed. The array distribution consisted in 16 banks with 16 bit data size. All the internal structures have been kept para- metric to give the possibility to implement the architecture composed of how many banks, rows and words needed according to the target database. From a MATLAB script (or from an external source in the case of the bitmap) were extracted both the bitmap and the queries to execute. The files were then set as input for the VHDL Testbench and finally it was run a simulation of the queries to feed the PIM architecture. When started, the script enters a loop that terminates only when the user decides not to create any more queries and a file generated as output. The completion of the query is assisted by two pop-up windows: one shows the internal composition of the memory and the other shows the available logic operations and their correspondent code.

All eight operation modes were tested with Modelsim to ensure the correct functioning. Two examples of operation mode are reported in Figure 5, it shows two examples of logic behavior (expected and simulated) of the proposed architecture.

**Figure 5.** (**A**) Expected waveform of a LIM single same bank AND operation. (**B**) Waveform of a LIM single same bank AND operation. (**C**) Expected waveform of a PIM multiple operations. (**D**) Simulated waveform of a PIM multiple-bank operation.

The architecture was later synthesized with Synopsys Design Compiler using 45 nm BULK and 28 nm FDSOI CMOS technologies (Table 1). By using Synopsys Design Compiler latches and logic gates are used to implement the memory cell, so the results are not optimized as they will be if a custom transistor layout was created for the memory cell.

As the fundamental element of the whole structure, the Cell was analyzed and optimized. The obtained results are reported in Tables 1 and 2.

From, Table 1 it is possible to evince the the area overhead is 55%. The overhead in terms of power dissipation is similar.


**Table 1.** Synthesis of the fundamental element.



An interesting point is the relation between the number of the segments and the resulting delay. An analysis was carried out with 8 bit and 16 bit input data size (Figure 6). As it shows the delay reduces considerably with a bigger amount of segments. Indeed, the architecture under consideration was synthesized with a value D of 8 to achieve best speed.

**Figure 6.** Relation between number of segments in the counter and resulting delay.

One of the main goal this paper aimed to fulfill is the high level of concurrency. This was accomplished thanks to the internal structure of the array, distributed on banks which are capable of working both independently and with each other, providing flexibility in the position of the operands that are called to act in the query. To execute a simple query only one cycle is required. Thanks to the modular structure of the array, the maximum throughput achievable working in parallel in PIM multiple banks mode is:

$$
throw\_
{\mathfrak{g}
bar}\_{
\mathrm{max}\_{
\mathrm{simple}}} = f\_{\mathbb{C}L\mathcal{K}} \cdot \aleph\_{\mathrm{ops}}.
$$

As for composed query two cycles are required to complete the operations. The resulting maximum throughput operating in PIM multiple composed mode is:

 $Although \text{ $p \,\,ll\_{\text{max}}$ } = \frac{f\_{\text{CLK}}}{2} \cdot N\_{\text{ops}}.$ 

So, assuming to execute a different query in each of the 16 available banks, we will reach a maximum throughput of 2.45 Gop/s and 9.2 Gop/s for 45 nm and 28 nm respectively. The performance of the proposed PIM architecture was compared with results of other in-memory proposals found in Reference [29] (Table 3).

**Table 3.** Clock cycles comparison for a single query execution.


Noticeably, operations in the proposed PIM array take less clock time compared to other solutions. Moreover, it should be taken into consideration that executing multiple parallel operations would not change the number of clock cycles required. This shows how the throughput mentioned above is obtained. Thus, the maximum degree of parallelism achievable is correspondent to the number of the available banks. Moreover, it is possible to scale the architecture to bigger dimensions as it was conceived as modular, meaning it can be composed with as many banks as wanted. Another possibility is to develop a 3D structure in order to enhance performance. Nonetheless, it would be easy to modify the architecture to make it fit for other types of operations. These results, coupled with the flexibility of the architecture, highlight the potential of the proposed architecture.

**Author Contributions:** Conceptualization, G.T., G.S., M.G., M.V., M.Z., M.R.R.; methodology, G.T., M.V.; software, M.A.; validation, M.A., G.T.; investigation, M.A., G.T., M.G., M.V., M.Z., M.R.R.; resources, X.X.; data curation, X.X.; writing–original draft preparation, M.A., G.T.; writing–review and editing, M.A., G.T., M.V., A.M., F.O.; supervision, M.G., M.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Digital Circuit for Seamless Resampling ADC Output Streams†**

#### **Mauro D'Arco \*, Ettore Napoli and Efstratios Zacharelos**

Department Electrical and Information Technologies Engineering (DIETI), University of Naples Federico II, via Claudio 21, 80125 Naples, Italy; ettore.napoli@unina.it (E.N.); efstratios.zacharelos@unina.it (E.Z.)


Received: 6 February 2020; Accepted: 11 March 2020; Published: 14 March 2020

**Abstract:** Fine resolution selection of the sample rate is not available in digital storage oscilloscopes (DSOs), so the user has to rely on offline processing to cope with such need. The paper first discusses digital signal processing based methods that allow changing the sampling rate by means of digital resampling approaches. Then, it proposes a digital circuit that, if included in the acquisition channel of a digital storage oscilloscope, between the internal analog-to-digital converter (ADC) and the acquisition memory, allows the user to select any sampling rate lower than the maximum one with fine resolution. The circuit relies both on the use of a short digital filter with dynamically generated coefficients and on a suitable memory management strategy. The output samples produced by the digital circuit are characterized by a sampling rate that can be incoherent with the clock frequency regulating the memory access. Both a field programmable gate array (FPGA) implementation and an application specific integrated circuit (ASIC) design of the proposed circuit are evaluated.

**Keywords:** resampling; interpolating polynomial; polyphase filter; digital circuit design; FPGA; ASIC

#### **1. Introduction**

In the majority of digital storage scopes (DSOs) the analog-to-digital converter (ADC) always works at its maximum sampling rate, imposed by an internal fixed frequency clock [1,2]. The user can also select lower sampling rates, which are achieved by seamlessly resampling the ADC output stream. Resampling is performed by means of a digital circuit that interfaces ADC and acquisition memory, and merely consists in decimating the input stream, which involves grouping the samples at the maximum sampling rate into consecutive sets, and acquiring, that is, storing in the acquisition memory, only the first sample of each set. All sets have the same size, which is equal to the required decimation factor—for instance, grouping samples into sets with size equal to 2 means acquiring one every other sample, thus halving the input sampling rate [3,4].

Resampling based on decimation is characterized by the following drawbacks: (i) the selection of the sampling rate is limited to the values that can be obtained dividing the maximum sampling rate by integer values; (ii) if the selected sampling rate is less than the Nyquist rate of the analog input, the acquired signal is corrupted by aliasing [5,6].

In general, fine selection of the sample rate improves the performance of the DSO, allowing more efficient usage of memory resources. In fact, a limited set of sample rates implies a limited set of time windows for signal observation. Due to these limitations, it is possible that the analysis is performed observing the signal of interest in a time window where up to almost 50% of the window contains useless samples. Many DSOs are also complemented with math capabilities like Fast Fourier Transform (FFT) options that allow frequency domain analyses. In these applications the choice of

the sample rate determines, in conjunction with the memory size, the frequency span and resolution settings; the limitations characterizing the sample rate selection lead to sub-optimal settings. Some DSOs allow the user applying an external clock signal to control the sampling rate. This option is not very common because of the following drawbacks: (i) the external path has a limited bandwidth, much inferior to that of the internal path, so that the operative range of the DSO is substantially reduced; (ii) some functionalities of the instrument, which cannot work with the external clock, are disabled; (iii) the precision specifications of the DSO, which are related to the operation with the internal clock, cannot be used to evaluate the accuracy of the measurement results.

In theory, fine control of the sampling rate in real-time DSOs can be obtained by resampling the ADC output stream by means of more effective methods alternative to hard decimation [7–9]. These methods can be inherited by digital signal processing theory, and rely either on the use of interpolation algorithms or polyphase filters [10,11]. The first method allows varying the sampling rate dynamically, and puts no restrictions on the selection of the output sampling rate. The second method is instead limited to decimation factors that are equal to *<sup>L</sup> <sup>M</sup>* , where *L* and *M* are integers. Both methods counteract aliasing effects by means of low-pass filtering operations, which are implicit in the interpolation algorithm, and explicit in the processing scheme of polyphase filters [12–15]. In fact, the use of an interpolation function is equivalent to filtering the signal with a filter characterized by a frequency response where the number of taps is equal to the number of points used in interpolation. Unfortunately, the hardware implementation of both methods is difficult due to the strict requirements of seamless operation and fine resolution in sampling rate selection [16–18].

A method that shows a viable solution to finely control the sampling rate in DSOs has been presented in Reference [19], and a digital circuit that implements this method using field programmable gate array (FPGA) technology has been illustrated at the ApplePies 2019 Conference [20]. In detail, the digital circuit exploits a resampling method based on linear interpolation, which trades-off between accuracy and circuit complexity. It is designed to work between the ADC and the acquisition memory, and allows selecting sampling rates from the highest frequency, *fck*, down to its half value, *fck* 2 . Choosing a sample rate lower than *fck* <sup>2</sup> is easily obtained by cascading the proposed circuit with a standard one that performs decimation by an integer value. The acquisition chain is made up of ADC, proposed digital circuit, and acquisition memory, all operating synchronously at the system clock rate *fck*. It provides samples that represent a version of the input signal characterized by a sample rate *fs* = *C fck*, where *C* is a fractional value in the interval ( <sup>1</sup> <sup>2</sup> , 1). The defining resolution of *C* is only limited by the number of bits adopted in its binary representation; the reciprocal of *C* can be regarded as a non-integer decimation factor.

This work is an extended version of the article published in the Conference Proceedings [20]. It takes into consideration several different methods for DSOs sampling rate control, and, by evaluating their performance highlights how the proposed digital circuit represents a good compromise between achievable accuracy and circuit complexity. Starting from the primary version of the circuit, an improved version characterized by different pipeline levels is developed, and an application specific integrated circuit (ASIC) design of the proposed solution is also analyzed [21–23].

The paper discusses more about the resampling methods based on interpolation and polyphase filters in Section 2. The performance of different interpolators, which satisfy the requirements of effective hardware implementation and high resolution in sampling rate selection, is analyzed in Section 3 through simulations. Section 4 illustrates the design of the proposed digital circuit, and, finally, Section 5 gives concluding remarks.

#### **2. Methods**

In general, resampling a mono-dimensional signal, defined upon a sampling grid, aims at producing another representation of the same signal, referred to a different sampling grid. It basically requires gaining the samples referred to the output grid by processing the available ones. Resamplers manage a redundant representation of the signal, that includes both the input samples and the

resampled ones; the second are the only ones returned by the circuit.

In the most common resampling applications both the input and output sampling grid are uniform and the circuit has to deal with samples that are streamed at regular time instants, such that a sampling rate is defined. Also, resampling has to be performed real-time seamlessly on the input stream, which is very challenging, especially in the presence of high-rate data streams [24–27].

Hereinafter, the attention is mainly paid to real-time seamless resampling of signals that are naturally defined in the time domain, for which the input sampling rate needs to be changed into a different sampling rate, lower than the input one. The methods that are illustrated can be adapted to other signals defined in different domains by exploiting the unique correspondence between the points of the sampling grids and the related time-stamps in their streamed form produced at the processing stage [28,29].

#### *2.1. Resampling Based on the Use of Approximating Polynomials*

The most straightforward resampling approach, capable of granting real-time seamless performance, exploits the zero-order interpolation process, which assumes the signal constant until the next sample is available. In other terms, the resampled value, *x*(*n* + *t*), where *t* is a fraction of the sampling period, *Ts*, (reciprocal of the sampling rate, *fs*) of the input stream, is assumed equal to that of the most recent sample *x*(*n*) [30,31].

Alternatively, the first-order or linear interpolator can be used to improve the accuracy of the resampling process. Linear interpolators wait for the subsequent sample of the input stream *x*(*n* + 1) to compute the value of any sample at a time instant in the midst. Specifically, they compute it by adding to *x*(*n*) a term equal to *t* times the time derivative, which is estimated as first forward difference [32,33].

More generally, resampling can rely on interpolators that use a larger set of samples adjacent to the resampling instant to determine the resampled value. The samples of the set are processed to identify a polynomial of the *t* variable, *P*(*t*), that locally approximates the signal behavior. The value of the polynomial at the resampling instant provides the resampled value. The polynomial is identified imposing constraints that can involve the values of the signal and/or of its time derivatives. The most common solutions are:


In all the aforementioned approaches the resampled value obtained using an approximating polynomial can be represented with a matrix formulation. For instance, for a 3-degree approximating polynomial, one as:

$$\begin{array}{rclcrcl}\mathbf{x}(n+t) &=& c\_1(n) + c\_2(n)t + c\_3(n)t^2 + c\_4(n)t^3 = \\\\ &=& \begin{bmatrix} 1 & t & t^2 & t^3 \end{bmatrix} \begin{bmatrix} a\_{11} & a\_{12} & a\_{13} & a\_{14} \\ a\_{21} & a\_{22} & a\_{23} & a\_{24} \\ a\_{31} & a\_{32} & a\_{33} & a\_{34} \\ a\_{41} & a\_{42} & a\_{43} & a\_{44} \end{bmatrix} \begin{bmatrix} \mathbf{x}\_{n-1} \\ \mathbf{x}\_{n} \\ \mathbf{x}\_{n+1} \\ \mathbf{x}\_{n+2} \end{bmatrix} \end{array} \tag{1}$$

where *x*(*n* + *t*) is the resampled value at time instant *n* + *t*, *t* is within the interval (0, 1), and each coefficient, *aij*, *i* = 1, . . . , 4, is a linear combination of the values of the 4 consecutive samples {*xn*−1, *xn*, *xn*+1, *xn*+2} with constant coefficients, namely:

$$c\_l(n) = \sum\_{j=1}^{4} a\_{lj} x\_{n-2+j}.\tag{2}$$

The constant coefficients *aij* can be determined imposing the constraints used to define the approximating polynomial. Hence, for a Lagrange polynomial, one can consider the system of equations obtained imposing that the polynomial connects the values {*xn*−1, *xn*, *xn*+1, *xn*+2} characterized, respectively, by *t* abscissas {−1, 0, 1, 2}:

$$
\begin{bmatrix} 1 & -1 & 1 & -1 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 2 & 4 & 8 \end{bmatrix} \begin{bmatrix} c\_1 \\ c\_2 \\ c\_3 \\ c\_4 \end{bmatrix} = \begin{bmatrix} \chi\_{n-1} \\ \chi\_n \\ \chi\_{n+1} \\ \chi\_{n+2} \end{bmatrix} \tag{3}
$$

from which the *aij* values are determined by inverting the coefficient matrix in (3) as:

$$\begin{Bmatrix} a\_{ij} \end{Bmatrix} = inv \begin{bmatrix} 1 & -1 & 1 & -1 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 2 & 4 & 8 \end{bmatrix} = \frac{1}{6} \begin{bmatrix} 0 & 6 & 0 & 0 \\ -2 & -3 & 6 & -1 \\ 3 & -6 & 3 & 0 \\ -1 & 3 & -3 & 1 \end{bmatrix} \tag{4}$$

If an approximating polynomial of second degree is selected, then only three coefficients, *ci*(*n*), *i* = 1, 2, 3, that are still linear combination of the samples {*xn*−1, *xn*, *xn*+1, *xn*+2} with constant coefficients are needed. These coefficients can be determined imposing the same constraints adopted to identify the Lagrange polynomial, i.e:

$$
\begin{bmatrix} 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 4 \end{bmatrix} \begin{bmatrix} c\_1 \\ c\_2 \\ c\_3 \end{bmatrix} = \begin{bmatrix} \chi\_{n-1} \\ \chi\_n \\ \chi\_{n+1} \\ \chi\_{n+2} \end{bmatrix} \tag{5}
$$

but, since a second degree polynomial cannot in general grant the connection of more than 3 points, one has to accept an approximate solution that best fits the data according to a given cost function. The solution that grants the least mean square error, as well known, is obtained solving the system in (5) using the pseudo-inverse matrix method; in this case the *aij* values, *i* = 1, ..., 3, *j* = 1, . . . , 4, are:

$$\begin{aligned} \left\{ \begin{matrix} a\_{ij} \\ \end{matrix} \right\}\_{i} &= \operatorname{inv} \left\{ \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ -1 & 0 & 1 & 2 \\ 1 & 0 & 1 & 4 \\ \end{bmatrix} \begin{bmatrix} 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 4 \\ \end{bmatrix} \right\} \left[ \begin{matrix} 1 & 1 & 1 & 1 \\ -1 & 0 & 1 & 2 \\ 1 & 0 & 1 & 4 \\ \end{matrix} \right] = \\ &= \frac{1}{20} \left[ \begin{matrix} 3 & 11 & 9 & -3 \\ -11 & 3 & 7 & 1 \\ -5 & -5 & -5 & 5 \\ \end{matrix} \right] \end{aligned} \tag{6}$$

The coefficients of a 3-degree Hermite polynomial are identified using also the time derivative of the approximating polynomial *P*(*t*), namely:

$$\frac{dP}{dt} = c\_2(n) + c\_3(n)t + c\_4(n)t^2 \tag{7}$$

to form a system of equations that imposes that the polynomial connects the central samples, referred to the *t* abscissas equal to 0 and 1, and has the same derivative of the signal in those points. In matrix form, these constraints can be expressed by:

$$
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} c\_1 \\ c\_2 \\ c\_3 \\ c\_4 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} 0 & 2 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ -1 & 0 & 1 & 0 \\ 0 & -1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \chi\_{n-1} \\ \chi\_n \\ \chi\_{n+1} \\ \chi\_{n+2} \end{bmatrix} \tag{8}
$$

where the time derivative of the signal is estimated in terms of finite central difference. The *aij* values, *i*, *j* = 1, . . . , 4, are obtained solving system (8) as:


The equations that define the interpolation methods discussed above can be summarized as in Table 1.

**Table 1.** Lagrange, Hermite and best fitting polynomial (in the sense of least square error) adopted in resampling.


#### *2.2. Resampling Based on Polyphase Filters*

Resampling with polyphase filters is commonly performed in a variety of systems, like multipurpose receivers, where several different sampling rates are supported to process signals characterized by different bandwidths, as well as in digital audio and video systems, and so forth [34,35]. In these systems the signal is initially sampled at a high sampling rate, then processed to modify the sampling rate by a factor *<sup>L</sup> <sup>M</sup>* . Processing involves interpolation by *L*, low-pass filtering, and decimation by *M*. Low-pass filtering removes the image frequencies due to sampling rate changes; it is implemented using polyphase decomposition of both the input signal and filter coefficients [36,37].

For the sake of clarity, an example of a <sup>3</sup> <sup>4</sup> -resampler that uses a short low pass filter with 9 coefficients, *h*(*n*) <sup>=</sup> {*h*(0), *h*(1),... *h*(8)}, is shown in Figure 1. The input signal *y*(*n*) is de-multiplexed in order to retrieve 4 consecutive samples and route them to 4 individual channels with a single operation. The output of the resampler, *z*(*m*), is obtained by multiplexing the outputs produced by 3 filters, each filter defined in terms of 3 coefficients of *h*(*n*) according to polyphase decomposition rules. *Sensors* **2020**, *20*, 1619

Polyphase filters are characterized by low requirements in terms of clock frequency and can be set to both up-sample and down-sample the input stream, but are not suitable for programmable resampling factors, because any polyphase structure is defined by the same ratio between the input and output sampling rate; consequently, any change of the resampling ratio implies modifying their structure [38].

**Figure 1.** Schematic of a digital resampler implementation based on polyphase decomposition. Resampling factor equal to 3/4 low-pass filter with 9 taps.

#### *2.3. Pro and Cons of Approximating Polynomials and Polyphase Filters*

Resampling with polyphase filters straightforwardly changes the input sampling rate, *fck* to the output one *fs* = *<sup>L</sup> <sup>M</sup> fck*. In fact, thanks to the use of a demultiplexer at the front-end, the polyphase filter processes any *Tck <sup>M</sup>* seconds (*Tck* <sup>=</sup> *<sup>f</sup>* <sup>−</sup><sup>1</sup> *ck* ) a set of *M* input samples and returns a set of *L* output samples, which are written in the acquisition memory with a single memory access, thus lowering the input sampling rate by a factor *<sup>L</sup> <sup>M</sup>* . Unfortunately, any change of the sampling rate requires re-programming the digital circuit. Although, in theory, re-programming can be done, in case of sampling rates that involve very large *M* and *L* values, one should reserve sufficient hardware resources for huge polyphase structures, seldom required and largely unused; nonetheless, the responsiveness of the system would definitely slow-down.

Resamplers based on interpolators are instead less demanding in terms of hardware resources and allow controlling the sampling rate easily. They also require a suitable strategy for arranging the lower sampling rate output stream. Specifically, the digital resampler can take as input both a set of consecutive samples and the *t* variable, as specified by the interpolation equations summarized in Table 1. It can run at a clock rate equal to the input sampling rate, quantifying the *t* variable as the delay of the resampling instant with respect to the discrete time *n*. To this end, it can exploit an accumulator that increments by *Ts Tck* <sup>−</sup> 1 (*Ts* <sup>=</sup> *<sup>f</sup>* <sup>−</sup><sup>1</sup> *<sup>s</sup>* ) any *Tck* seconds. The accumulator represents the *t* variable except when it overflows a unitary value. The overflow repeatedly occurs with a cadence related to the selected sampling rate. Overflow means that the resampling instant does not fall between the discrete time *n* − 1 and *n*, but is in the midst of *n* and *n* + 1, such that it should be considered at the next processing step. At any occurrence of an overflow, the digital circuit skips the calculation of the resampled value, and performs a unitary decrement of the accumulator at the subsequent clock cycle, thus restoring *t* between the expected discrete time instants.

The use of approximating polynomials is preferred in the development of a digital circuit aimed at granting fine control of DSOs sampling rate, because it has several interesting features. These include the capability of resampling even if the ratio of sample rates is not rational, as well as of seamlessly managing real-time streams even in the presence of time-varying sample rates.

#### **3. Simulation Analyses**

As well known, changing the sampling rate produces aliasing, which is usually counteracted by filtering the digital signal with a low pass filter before resampling. The performance of the resampling methods is affected by the presence of residual alias, thus the frequency response of the adopted anti-aliasing filter must be taken into account. The anti-aliasing filter is implicit in the resampling approach based on the use of an approximating polynomial, and its impulse response is given, in general, by a set of coefficients that depend on *t*; for instance, from Equation (1), one can obtain the coefficients of the 4-tap filter, *dj*, *j* = 1, ..., 4 as:

$$d\_{\vec{j}} = \sum\_{i=1}^{4} a\_{i\vec{j}} t^{\vec{j}-1}. \tag{10}$$

An estimation of the residual alias can be approached taking into account that the spectrum of a digital signal is periodic with unitary period, and that lowering the sampling rate down to *fs* = *C fck* has the effect of replicating the spectrum at a pace equal to *C*. Moreover, since the *t* variable changes during the resampling process, ranging in the interval (0, 1), the features of the anti-aliasing filters change as well. Anyway, taking into account that *t* is within (0, 1) and, on average, *t* = <sup>1</sup> <sup>2</sup> , one determines the average behavior of the filter. Using the frequency response, *H*(*ν*), of the filter that describes the average behavior, which is gained by taking the Fourier transform of the filter coefficients estimated with *t* = <sup>1</sup> <sup>2</sup> , allows representing the spectrum of the resampled version as:

$$Z(\nu) = \sum\_{p,q=-\infty}^{\infty} H(\frac{\nu}{\mathbb{C}}) X(\frac{\nu-p}{\mathbb{C}} - q),\tag{11}$$

where *ν* is normalized to the sampling rate *fck*. From Equation (11) the alias-free version of the resampled signal can be obtained using *<sup>p</sup>* <sup>=</sup> *<sup>q</sup>* <sup>=</sup> 0, whereas all the combinations satisfying <sup>|</sup>*<sup>p</sup>* <sup>−</sup> *Cq*<sup>|</sup> <sup>&</sup>lt; *<sup>C</sup>* <sup>2</sup> identify the residual aliases that fall in the spectrum of the resampled signal. Figure 2 shows the frequency response of the anti-alias filters that are implicit in the 7 approaches detailed in Table 1. The different responses are characterized by specific markers and colors: 'plus' marker and blue color is for the linear interpolator, 'circle' marker and red color for the first-degree polynomial fitting 3 sample points, 'x' marker and green color for the second-degree Lagrange polynomial, 'star' marker and yellow color for the second-degree polynomial fitting 4 sample points, 'square' marker and magenta color for the third-degree Lagrange polynomial, and, finally, 'diamond' marker and cyan color for the third-degree Hermite polynomial (a suitable legend has been included to highlight these correspondences). For the sake of completeness also an additional graphic, related to the zero-order resampling approach, is shown using 'dot' marker and black color to highlight the all-pass nature of this approach, which is detrimental because it provides no mitigation of aliasing effects.

The frequency responses given in Figure 2 are obtained by Fourier transforming the impulse response estimated upon 50 points, and consist of 25 bins, equally spaced at a pace of 0.02; they show the behavior of the filters up to the normalized frequency 0.5, corresponding to *fck* <sup>2</sup> hertz. One can observe that the approximating polynomials with higher degree offer flatter gain and better selectivity. Also, the mean behavior of the anti-aliasing filters related to the use of the second-degree polynomial fitting 4 sample points, the third-degree Lagrange polynomial, and the third-degree Hermite polynomial are identical. The ideal frequency response behavior should exhibit unitary gain in the interval (0, *<sup>C</sup>* <sup>2</sup> ), to avoid undesired attenuation of the signal spectral content, and zero gain in ( *C* 2 , 1 <sup>2</sup> ) to cancel any alias contribution.

**Figure 2.** Frequency response of the anti-alias filters implicit in the resampling approaches based on the use of approximating polynomials.

Although the anti-aliasing filter plays an important role in the resampling process, the analysis of its mean behavior provides only partial insight, since the time-varying nature can play a role that cannot be analyzed using Equation (11). A deep insight in the performance of the resampling methods can instead be gained by using the standard test methods for ADC, detailed in Reference [39], such as the effective number of bits (ENOB) and the spurious-free dynamic range (SFDR). The first is a measure of the signal-to-noise and distortion ratio used to compare the actual ADC performance to an ideal one; the latter considers, in the presence of a pure sine-wave input, the ratio of the amplitude of the output spectral component at the input frequency, *f*0, to the amplitude of the largest harmonic or spurious spectral component. Figure 3 shows the ENOB offered by the considered methods in the presence of test sine-waves.

**Figure 3.** Effective number of bits (ENOB) offered by the resampling approaches based on the use of approximating polynomials.

The simulations have considered samples quantized by an 8 bit ADC. Quantization has been applied to a signal corrupted by white Gaussian noise, with rms value equal to 15% of the LSB of the ADC. The sampling rate of the input stream is *fck* = 1 GSa/s, that is resampled at *fs* = 743 MHz, thus *C* = 0.743. The sine-waves adopted in the tests are characterized by the frequency values {1, 2, 5, 10, 20, 50, 100, 200} MHz. The results show that ENOB obtained after resampling can even improve in the presence of the lower input frequencies of the considered set with respect to the nominal 8 bit. This is due to the anti-aliasing low-pass filter that reduces the acquired bandwidth and thus also the distortion due to quantization. As the input frequency approaches the upper limit of the Nyquist bandwidth, the performance of all methods rapidly decreases, and one can observe that the methods that use approximating polynomials with higher degree can grant ENOB close to the nominal number of bits on wider ranges. As expected, the effectiveness of interpolation algorithms diminishes as soon as the input sinusoidal signal is sampled collecting a few points per period, namely 7–8 points. This happens because the algorithms consider the local behavior of the signal, whereas the uniform sampling theorem claims for interpolation with *sinc* functions that consider the behavior on the whole time axis; unfortunately *sinc* interpolation is unfeasible and its straightforward approximations, like those based on the use of truncated *sinc* functions, are characterized by huge computational burden, which is not compatible with real-time execution. As a rule of thumb, suggesting some oversampling in the use of the acquisition mode with fine selection of the sample rate, avoids incurring in poor results.

The simulations have also estimated for the same test set-up the SFDR in order to highlight if the time-varying behavior of the anti-alias filters introduce relevant spurious; the obtained results are shown in Figure 4.

**Figure 4.** Spurious-free dynamic range (SFDR) offered by the resampling approaches based on the use of approximating polynomials.

The performance parameters highlight the convenience of using Lagrange or Hermite polynomials (the linear interpolation coincides with the adoption of a first-degree Lagrange polynomial) for interpolation rather then zero-order or fitting polynomials based methods.

Further simulations have been addressed to the analysis of any dependence of the performance on the output sampling rate. As an example, Figure 5 shows the ENOB obtained in the presence of a sine-wave input at 20 MHz when the 1 GHz input sampling rate is lowered down to the frequencies of the set {587, 641, 743, 797, 859, 907, 971} MSa/s.

**Figure 5.** ENOB offered by the resampling approaches in the presence of an input sine-wave at 20 MHz when the 1 GHz input sampling rate is lowered down to frequencies of the set {587, 641, 743, 797, 859, 907, 971} MSa/s.

The simulations highlight that the performance is unaffected by sampling rate changes; all the methods offer ENOB constant and above 8 bits, except for the zero-order method, the performance of which, although independent of the output sampling rate, is largely below the lower axis limit utilized in Figure 5.

#### **4. Proposed Digital Circuit**

#### *4.1. Operation Details*

The proposed digital circuit implements the linear interpolation method that represents a good compromise between accuracy and circuit complexity. It processes in real-time the signal *x*(*n*) streaming out of the ADC, and returns the output, *y*(*n*); both are characterized by the clock rate, *fck*, but *y*(*n*) contains a resampled version of *x*(*n*) characterized by a sampling rate *fs* = *C fck*.

More specifically, the value *y*(*n*) is determined by combining the samples *x*(*n* − 1) and *x*(*n*) returned by the ADC according to:

$$\begin{array}{rcl}y(n) &=& a(n)\mathbf{x}(n-1) + (1 - a(n))\mathbf{x}(n) =\\& &=& a(n)\mathbf{x}(n-1) + b(n)\mathbf{x}(n)\end{array} \tag{12}$$

where *a*(*n*) is a time-varying coefficient, updated at every clock cycle by subtracting to its current value a quantity, chosen by the user, and related to the sampling factor as <sup>1</sup>−*<sup>C</sup> <sup>C</sup>* . Notice that the aforementioned variable *t* corresponds to the variable *b*(*n*) of the Equation (12), and consequently *a*(*n*) = 1 − *t*. Subtraction is skipped if the current value of the coefficient *a*(*n*) is negative, and in its place an addition by one is performed. Hence, the output of the digital circuit *y*(*n*) contains, with some redundancy, the resampled version of *x*(*n*).

The circuit also produces a signal *PtrX*, that indicates the memory location where *y*(*n*) is stored. The generated sequence *y*(*n*) is stored in memory at system frequency, *fck* but, in order to cope with the lower sampling rate, *PtrX* is not incremented when the *a*(*n*) coefficient is incremented by one. In this way, two consecutive outputs share the same value of *PtrX*, which means that the second one overwrites the first.

An example will better clarify the meaning of *a*(*n*). In Figure 6 a sinusoidal signal at 54 MHz is shown. It is sampled with the 1 GHz (*Tck* = 1.0 ns) system clock (sampling shown with circles). The result obtained resampling at 761 MHz (*Ts* = 1.314 ns) is shown with red bullets. The resampling factor is *C* = 0.761, and the coefficient *a*(*n*) is updated subtracting <sup>1</sup>−*<sup>C</sup> <sup>C</sup>* = 0.3141 to the current value. Variable *b*(*n*) represents the point inside the sampling period where resampling must be performed. The bottom axis is the time while the top axis shows the increment of the memory pointer. When *a*(*n*) is incremented (time: 6, 10, 14, 18 in Figure 6) the memory pointer is not updated.

**Figure 6.** Example sequences for *a*(*n*) and *PtrX*.

#### *4.2. Design Details*

A digital circuit for the implementation of the proposed resampling algorithm has been designed. The schematics before and after pipelining are in Figures 7 and 8.

Circuit input data are the signal to be resampled *x*, and factor *d* = *<sup>C</sup>*−<sup>1</sup> *<sup>C</sup>* . The output data are the resampled stream *y*, and the memory pointer, *PtrX*. The number of bits for *x*, *d*, and *y*, is 8, while the memory pointer ,*PtrX* is represented with 32 bits.

The two complementary coefficients, *a* and *b*, are multiplied by the previous value (*z* signal in Figure 7) and the current value of the input signal (*x* signal in Figure 7), respectively. Afterwards, the two products are summed, in order to produce the output signal, *y*, as indicated in Equation (12).

The updating of the coefficient *a*, relies on adding either the quantity *d*, or in the case of exception, a unitary value to the current value of *a*. In the case of exception, *a* is negative, and the most significant bit (MSB) of the coefficient, is high, *a* [9]=1; otherwise, *a* [9]=0, and *d* is added to the current value of *a*. This distinction is realized with the use of a multiplexer, controlled by the MSB of signal *a*. After the correct choice between "1" and "*d*", an accumulator is implemented for the updating of *a*.

A second accumulator is implemented, for the memory management. When *a* is positive, *a* [9]=0, *g*=1, and *PtrX* is incremented by a unitary value. In the case of exception, *a* is negative, *a* [9]=1, *g*=0, and *PtrX* remains unchanged. The above described memory management strategy allows to store only the resampled values. The fact that occasionally the memory pointer is not incremented reflects the fact that after resampling the number of samples is less than that of the input signal.

In Figure 8 the pipelined resampler can be observed. The four vertical dashed lines mark the four pipeline levels introduced to the circuit in order to isolate the combinational logic, thereby achieving a lower clock period and a higher throughput. On the other hand, latency and chip area are increased. The number of flip-flops (registers) used for the pipeline implementation is: (8 + 8) + (10 + 10 + 8) + (12 + 1 + 12) + (8 + 32) = 109.

**Figure 7.** Circuital implementation of the proposed algorithm.

**Figure 8.** Circuital implementation of the proposed algorithm with pipeline registers (pipeline levels are highlighted with dashed lines).

#### *4.3. Implementation and Performance*

As mentioned earlier, 8 bits are used for the representation of *d*, where *d* is within (–1, 0). However, during the experimental procedure, a 16-bit signal was also tried out in order to test the performance of the circuit. For an *n*-bit signal, the resolution obtained for *d* is constant and is equal to 2−*n*. The resolution obtained for *C* can be derived from:

$$d(\mathbb{C}) = \frac{d(\mathbb{C})}{d(d)} d(d) = \frac{d}{d(d)} (\frac{1}{1-d}) 2^{-n}.\tag{13}$$

Given the fact that the relationship between *C* and *d* is not linear, the resulting resolution of *C* (the actual resampling factor) differs for different *d* values. In Table 2, some information related to the resolution of the resampling factor are presented. Assuming a 1 GHz clock frequency, using 8-bit for the *d* signal allows a frequency resolution that ranges from 390 kHz to 97.5 kHz, while using 16 bit the frequency resolution can be as low as 4 kHz.


**Table 2.** Resampling factor resolution.

The circuit is described in hardware description language (HDL) and a first assessment of the performance has been conducted with a high-end FPGA as a target. This aims to demonstrate the available performances in a reconfigurable environment. In Table 3, some basic features and resources are presented for the implementation on a StratixIV GX FPGA device by Altera. The implemented design is the one depicted in Figure 8 with *d* signal represented by 8 bits.

**Table 3.** Basic Features of the Resampler and FPGA resources (StratixIV-EP4SGX230KF40 implementation).


Tests similar to those considered in the simulation analyses have been repeated on sinusoidal signals, demonstrating that when the sampling frequency is at least ten times higher than the signal bandwidth, the results are satisfactory. For instance, in the presence of an input signal corrupted by white Gaussian noise (rms value equal to 15% the LSB of the ADC) and quantized by an 8 bit ADC, resampling at 743 MHz a 47.1 MHz signal converted with a 1GSs ADC has lowered the ENOB from 7.8 to 7.5 and left unaltered the SFDR, which is a quite limited degradation. The results do not exhibit recognizable changes if 50 kHz random deviations of the input frequency are considered.

An ASIC implementation has also been carried out. The circuit is synthesized by targeting a commercial standard-cell library in 14 nm fin field effect transistor (FinFET), from Global Foundry. Physical synthesis is performed by using Cadence Genus; no special cells are designed for the implementation and the circuit is automatically synthesized according to timing constraints. The considered technology corner is the typical one with 0.8 V of supply voltage and regular threshold voltage. The simulations, with delay and switching activity annotation, have been conducted with a suite of tools for the design and verification of ASICs and FPGAs, commonly referred to by the name NCSIM in reference to the core simulation engine. Power dissipation is computed by simulating the final netlist with 10,000 input vectors from an asynchronously sampled sinusoid to obtain the switching activity of each node.

While aiming for the highest frequency possible, several syntheses took place. Firstly, the circuit in Figure 8 was synthesized. Later the same circuit was synthesized, taking into account a retiming algorithm that moves the structural location of registers in order to improve the performance, while preserving the functional behavior at the outputs. Afterwards, two and three extra levels of pipeline were added to the design of Figure 8, and the synthesis was carried out with the retiming algorithm. The same syntheses were done for both an 8-bit and a 16-bit *d* signal and the results are reported in Table 4, and Table 5, respectively. As expected, the maximum working frequency is largely increased with respect to the FPGA implementation. Moreover, there is a trade-off between maximum frequency and chip area as well as power consumption.

**Table 4.** ASIC implementation results for the resampler in 14 nm FinFET from Global Foundry technology using 8 bits for *d* signal.



**Table 5.** ASIC implementation results for the resampler in 14 nm FinFET GF technology using 16 bits for *d* signal.

A comparison between the FPGA implementation (Table 3) and the ASIC design (Tables 4 and 5) in this particular application allows the following considerations. The FPGA design is composed by quite large blocks and uses the digital signal processing (DSP) blocks to efficiently perform the binary multiplication. This is very useful for the FPGA that can reach a remarkable speed for a reconfigurable target but leaves very little space for the arithmetic optimization and for the introduction of pipeline levels (e.g., a pipeline is not possible inside the DSP blocks). On the other hand, the ASIC design exploits a standard cell library with very small granularity and can choose among various design techniques for the arithmetic blocks. Also, the pipeline level can be moved freely inside the arithmetic block if needed. As a consequence, retiming and pipelining allow a large leap in circuit clock frequency (from 3.03 GHz to 5.26 GHz in the d = 8 bit case and from 2.70 GHz to 5.00 GHz in the d = 16 bit case).

Implementation results show that the circuit is able to reach the 5 GHz target in both cases. The effect of the retiming and the presence of the additional pipeline levels is seen in an increase of both area (mainly due to the additional Flip Flops) and power dissipation.

#### **5. Conclusions**

The paper has reviewed the main digital signal processing based methods for controlling the sampling rate in DSOs by means of digital resampling approaches. A digital circuit that offers a promising solution to grant more control of the sampling rate, with respect to the existing approaches, has then been discussed. The circuit can be deployed in the acquisition channel of any DSO to interface the internal ADC and the acquisition memory. It has been implemented on FPGA and evaluated. Also, the performance of an ASIC design of the same circuit has been investigated. The proposed solution can be exploited to effectively improve the sampling rate selection capability of DSOs, especially when the instrument does not permit the use of an external clock to drive the internal ADC.

**Author Contributions:** Methodology, formal analysis, simulations and original draft preparation, M.D.; methodology, circuit design, supervision, E.N.; circuit design, validation, original draft preparation and review, E.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the Project "Vision for Robotic Surgery (ViRoS)", funded by the Departmental of Electrical and Information Technology Engineering, University of Naples Federico II, Italy.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


finFET fin Field Effect Transistor DSP Digital Signal Processing

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
