5.5.2. Circuit-Level Results

Based on the clustering method, the original ALU is partitioned into three ALUs (3-ALU) and four ALUs (4-ALU). Table 2 reports the area overhead and performance improvement of the 3-ALU and 4-ALU compared to the original ALU at 14 nm. In the performance improvement calculation, we also considered the delay of the extra multiplexer/demultiplexer circuit for both 4-ALU and 3-ALU designs.

Both 3-ALU and 4-ALU designs occupy more area compared to the original ALU as expected. The reported area overhead is the increase in the overall ALU area in percentage, considering all additional circuits. Please note that in modern processors, the die area is dominated by memory elements [123]; hence, the increased ALU area does not contribute to a significant change in the overall die area. On the other hand, these ALUs are faster than the original ALU because the logic implemented in each of the partitioned ALUs is simpler and more coherent. By reducing the supply voltage, the performance gain diminishes slowly, because the impact of process variation on the shorter critical paths of smaller ALUs is more than the long critical path in the original ALU.


**Table 2.** Energy improvement results for 3-ALU and 4-ALU over Original ALU for 14 nm PTM [117].

\* In this technology *Vth* is close to 0.35 V.

Additionally, the overall energy improvement achieved by the proposed ALU partitioning method is extracted and reported in Table 2. This is done by calculating the percentage of the time that each smaller ALU can be power-gated based on a given power-gating threshold (*PGTH*) and for each workload. As presented in Table 2, there is a significant energy improvement even in the super-threshold region

(*Vdd* = 0.80 V). The reason is that one of the partitioned ALUs is mostly in sleep mode because its instructions are rarely used (see Figures 9 and 14). The percentage of the time an ALU is power-gated for a specific *PGTH* is the same for all supply voltages; however, the energy improvement by the proposed ALU partitioning technique is slightly more at lower supply voltages. The reason is that the leakage power contribution to the overall power consumption grows significantly by reducing the supply voltage. Therefore, reducing the same amount of leakage results in a larger percentage of energy saving at lower supply voltages.

Figure 15 compares the energy improvement results for 4-ALU at near-threshold region (0.35 V) in two technology nodes: 10 and 14 nm. As shown, the energy improvement results for these technology nodes resemble each other closely. A similar trend is observed for the rest of the results.

**Figure 15.** Energy improvement of functional unit partitioning, on an ALU partitioned into 4 smaller units (4-ALU), for 10 and 14 nm technology nodes.

## 5.5.3. Performance and Reliability Trade-Off

The performance improvement of the functional unit partitioning, shown in Table 2, could be traded for reliability at low supply voltages, similar to what was explained for instruction multi-cycling in Section 5.4.

According to the results, it is possible to enjoy maximum performance improvement (for example 11% at 0.80 V) or invest the achieved speed on reliability to ge<sup>t</sup> up to 10<sup>9</sup> times lower failure probability. At the near-threshold supply voltage (0.35 V) the maximum achievable reliability improvement is 11.5× (by trading 7% performance improvement from Table 2) because the delay distribution of the union ALU becomes wider than the original ALU due to its shorter critical path.
