In this section, we present the circuit and system setup we used for evaluation and discuss our findings.
3.1. Setup
NbAs-based interconnect: A simple transmitter–receiver test circuit is used to carry out the simulations to study the behavior of the Cu and NbAs nanoribbons, respectively. A pulse of amplitude 1 V, initial delay 1 ns, rise and fall time of 1 ps each, a period of 2 ns, and duty cycle of 50% is given as input to a CMOS transmitter inverter. The output of this inverter is then passed through a nanoribbon of Cu/NbAs. At constant intervals, buffers are placed in the path of the pulse. The size of these buffers is 4-fold the size of the transmitter inverter. At the end of the ribbon, the output pulse is given as input to the receiver inverter. The size of the receiver varied in one of our experiments. We used 65 nm, 45 nm, and 22 nm predictive technology models (PTM) [
13] for simulations. The values for resistivity for the Cu and NbAs materials were taken from the plot in
Figure 6 [
14].
GEM5: GEM5 simulator provides a diverse set of CPU models, instruction set architectures (ISAs), memory systems, etc., which help in conducting computer architecture research. GEM5 supports two main system modes and four different CPU models [
15]. For our experimentation, we used a system called emulation mode (SE) and the out-of-order CPU model.
The gem5 simulator has modular support for multiple ISAs (ARM, RISC-V, SPARC, etc.). We ran our simulations on the X86 architecture. All of the experiments were performed using the gem5 detailed processor,
DerivO3CPU clocked at 2 GHz.
Table 1 shows the full gem5 system configuration used for simulations.
Gem5 supports a number of benchmark suites such as SPEC CPU 2017, SPLASH-2, NPB (NAS Parallel Benchmarks), etc. We used the Parsec 3.0 benchmark suite [
16] for our evaluation purpose. It includes a wide spectrum of emerging applications in recognition, mining, and synthesis (RMS) as well as systems applications that mimic larger-scale multithreaded commercial programs. Out of the available 13 main benchmarks, we used 5 of them for our experiments, namely
Blackscholes, which analytically calculates the prices for a portfolio of European options with the Black–Scholes partial differential equation;
Canneal, which uses simulated cache-aware annealing to optimize the routing cost of a chip design;
Fluidanimate, uses fluid dynamics for animation purposes with smoothed particle hydrodynamics;
Raytrace, real-time tracing and
Streamcluster, solves the online clustering of an input system.
Additionally, we also used another CPU benchmark released by [
17] to evaluate the system performance on deep learning workloads. The benchmark involves testing a simple MLP with three hidden layers, 1024 neurons in the first two layers and 26 neurons (number of classes) in the final layer after it was trained on a synthetic dataset. All the simulations were performed on Gem5 version 20.1.0 on a Ubuntu system with Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz and 16 GB Ram.
3.2. Results
Hybrid poly-WSM interconnect: For the evaluation of the hybrid poly-WSM interconnect, we considered two test cases, namely:
(a) DRAM/eDRAM: The wordline of the DRAM/eDRAM designs is typically routed using poly-Si to achieve the best memory density. However, it severely degrades the wordline (WL) performance due to high poly-Si resistivity. To recover the performance, the WLs are also routed in metal-3 and the poly-Si WLs are occasionally shorted with metal-3 (called strapping). Since metal-3 is much faster than poly-si, the worst-case WL delay is alleviated with this approach. However, the contact with poly-Si degrades the array density. By tuning the number of metal-3 and poly-Si connection/straps, one can make a trade-off between performance and memory density. To evaluate the DRAM/eDRAM performance with the hybrid poly-WSM interconnect, we assume three baseline cases of 256-bit DRAM designs. In
Figure 7a, pure poly is used to drive the WL. This has the best packing density but turns out to be the slowest. In
Figure 7b, the WL is driven by poly, and M3 is used to strap the WL every 16 bits. This provides the worst density but is the fastest.
Figure 7c is similar to the second, except that M3 is used to strap the WL per 32 bits.
We also evaluated the same designs using 512 B DRAMs, also by replacing poly routing with the poly-WSM hybrid interconnect.
(b) D-FF: D-FF is widely used in chip design in large quantities (tens of thousands). Therefore, compact and high-performance D-FF design is key to achieving area and energy efficiency. D-FFs also suffer from congestion due to vias and metals making it a perfect candidate for hybrid poly-WSM exploration.
Area vs. performance trade-off for hybrid poly-WSM interconnect: We use hybrid poly-WSM to route the WLs and compare the performance for all three cases of DRAM/ eDRAM. We note that the highest improvement in performance is obtained when we compare the delays for a regular poly interconnect with a hybrid poly-WSM interconnect without strapping (
Table 2). The delay becomes 1/3 of that for the usual poly interconnect. When M3 is used to strap the wordline every 16 bits, the improvement is approximately 12%. The performance gain is modest (approximately 7–8%) when the strapping is performed every 32 bits.
For a D-FF, we note from
Figure 8a that M1 vias limit the footprint. The hybrid interconnects (
Figure 8b) are used to route some of the signals which in turn eliminate the M1 and vias and reduces the area by 15.6%.
However, it has some impact on the performance: in this case, the clk-Q and D-Q delay is degraded. The delay performance is, however, better for a hybrid poly design than for a pure poly-based design and can be further optimized (
Table 3). The poly-WSM hybrid interconnects can be used in off-critical paths to avoid performance issues while simultaneously reaping the benefits of area reduction.
Pathfinding for a higher-level interconnect: To evaluate the performance of the NbAs-based interconnect, we first varied the length of the ribbon from 1 mm to 10 mm (width: 0.1 μm) and measured the variation in propagation delay and resistance. As expected, the resistance (
Figure 9a) and propagation delay (
Figure 10b) increases with the nanoribbon length. Furthermore, NbAs is a better conductor than Cu as is evident from the plots, providing lower delay and resistance.
Figure 10a shows that the maximum possible delay improvement for NbAs ribbon for a 22 nm process technology is 20.9%.
We also varied the width of the ribbon from 0.025 μm to 0.5 μm (length: 10mm) and observed the delay and resistance. Increasing the width increased the cross-sectional area of the ribbon (a thickness of 200 nm) and thus the resistance dropped (
Figure 9b), subsequently reducing the propagation delay (
Figure 10a). We observed that the maximum delay improvement provided by NbAs is 35.88% for the 45 nm process technology when the width of the ribbon is 0.025 μm. Figure 14a shows the maximum delay improvement for various process technologies.
A general guideline for the slew time is to keep it below 100 ps to mitigate potential signal integrity issues. For this study, the ratio of widths of the transmitter inverter and the receiver inverter is varied from 1:4 to 1:36. From
Figure 11, we can note that NbAs outperforms Cu in terms of slew time. As the size of the receiver inverter increases, the propagation delay also increases.
Figure 12 shows that the percentage delay improvement for NbAs with respect to Cu increases as the width ratio of the transmitter and the receiver inverter increases. When the width of the ribbon is varied with a fixed receiver inverter size (
Figure 13), the slew time falls since the resistance falls with an increase in width.
System-level performance evaluation:Figure 10a shows that for the 22 nm process technology (at length and width of the wire fixed at 10 mm and 25 nm, respectively), the propagation delay for NbAs is 35.28% lower when compared to Cu.
Table 4 lists the interconnect properties corresponding to the propagation delay improvement. As discussed earlier, although the total cache latency is dependent on both the interconnect and cache access latency, interconnect latency is the major bottleneck because access latencies are very fast, i.e., typically 1 cycle [
18]. Hence, taking the delay improvement in mind, we reduced both the L1 and L2 cache latencies, keeping all the other parameters constant to analyze the performance improvement provided by NbAs interconnects. From
Figure 14b, we can see that, as expected, with a decrease in cache latency (in the case of NbAs), the IPC increases. This is because lower cache latency indicates that the cache requires fewer clock cycles to perform the required operations, resulting in a reduction in the total clock cycles used by the overall system and thus increasing the number of instructions executed per cycle, i.e., IPC. The IPC improvement ranges from 12.7% in the case of canneal to as high as 23.8% in the case of streamcluster. The average improvement in IPC provided by NbAs across all benchmarks is 18.56%. The actual improvement amount would vary depending on the workload. Intuitively, we understand that workloads that require frequent memory access would benefit more.
We further considered a more realistic global interconnect scenario where the width and length of the interconnect wire are 100 nm and 10 mm, respectively. The corresponding propagation delay improvement for the 22 nm node is found to be 20.9% (
Figure 10b).
We reran our gem5 simulations considering this improvement and found the IPC to improve by up to 15.7% in the case of the blackscholes benchmark. The average IPC improvement across all the benchmarks was 13.67%.
We also analyzed the total execution time (ET) improvement provided by NbAs interconnects. We showed this on top of IPC improvement because the total execution time is a widely recognized performance metric and is known as the “Iron Law of Performance” [
19]. The total execution time for a single thread program with
N instructions can be calculated as follows:
where
refers to the clock cycles per instruction and
f refers to the cpu clock frequency.
The Iron Law of Performance is useful because the terms correspond to the sources of performance. The quantity of instructions
N is determined by the instruction-set architecture (ISA) and the compiler;
CPI is determined by the microarchitecture and the circuit-level implementation, and
f is determined by the circuit-level implementation and technology. Improving one of these three performance sources improves overall performance [
20].
Figure 15a shows the execution time improvement (in %) provided by NbAs over Cu. As expected, similarly to the IPC improvement, the improvement varies from 11% provided by the canneal benchmark to 19% provided by the streamcluster benchmark. The average improvement in execution time across all benchmarks is 15.95%.
Table 5 displays the host memory usage and L2 cache miss rate for the various benchmarks when run on the gem5 simulator to provide a better understanding of the workload of these benchmarks.
We also tested the ML benchmark with different workloads (model sizes), to see the impact it has on IPC and the total memory used. Specifically, we varied the layer sizes of the neural network model to create different-sized workloads.
Table 6 shows the memory used by the different benchmark workloads (models), with hidden layer sizes ranging from 512 neurons to 8196 neurons on the gem5 simulator. As expected, the host memory used by the benchmark increases as we increase the model size since the number of parameters increases.
Figure 15b shows the IPC for Cu and NbAs on the ML benchmark with different workloads. From the figure, we can understand that there is very minimal change in IPC with the increase in neural network model size. However, the IPC improvement increases by 2% as we increase the model size from 1024 neurons to 8192 neurons in hidden layers.