4.3.1. Instruction Pattern Analysis

A careful analysis of the instruction patterns on a set of representative workloads provides information regarding the ALU partitioning. For this purpose, we first simulate the execution of a set of representative workloads by an architectural simulation tool as explained in Section 5, and then based on the extracted instruction streams the utilization frequency and the temporal distance of different instructions are extracted.

**Instruction utilization frequency:** The instruction utilization frequencies presented in Figure 9 show a significant difference in utilization among the instructions. The instruction utilization frequency has an inverse relation with the power-gating feasibility for the ALUs associated with the instructions. For example, the ALUs that contain instructions ADDQ and BIS are less likely to be power-gated because these instructions appear in the instruction buffer on average every 10 cycles. However, it is more likely to power-gate the ALUs implementing the rarely used instructions on the right side of Figure 9 such as S8ADDL.

The utilization frequency of an instruction can be simply defined as the number of cycles in which the instruction is executed divided by total cycles. For a given instruction stream **S**, which is a sequence containing *N* = |**S**| instructions, we can define the frequency of instruction A as:

$$Freq\_{\mathsf{A}} = \frac{\#\mathsf{S}\_{\vec{l}} \in \mathsf{S} \text{ such that } \mathsf{S}\_{\vec{l}} = \mathsf{A}}{N} \tag{5}$$

where *Si* is the *i*-th element (instruction) of **S**. If an instruction is used rarely, it can be easily grouped into any existing ALUs. However, we need to be more cautious in grouping frequently used instruction because the grouping strategy might improve or deteriorate the power-gating capability of the ALUs. Based on this analysis, we can define the frequency distance metric between two instructions A and B as the geometric mean:

$$dist\_{\mathsf{A}\to\mathsf{B}}^{freq} = \sqrt{Freeq\_{\mathsf{A}} \* Freeq\_{\mathsf{B}}}.\tag{6}$$

Such definition facilitates the partitioning of rarely used instructions into a single ALU.

**Instruction temporal distance:** Some instructions are more likely to appear next to each other in an application. For example, from "bzip2" workload, we observed that ADDL appears after LDA on average every 2.97 instructions (see Figure 10). In these cases, it could be beneficial to group these neighboring instructions inside one ALU in order to improve the power-gating interval for other ALUs.

The temporal distance between two instructions A and B can be extracted based on the number of cycles between any occurrences of A and B. For example, according to the results presented in Figure 10, there are 19146 LDA instructions that are directly followed by an ADDL instruction, and there are 11860 LDA instructions that are followed by an ADDL instruction after three cycles. Based on the workload analysis, a distribution is extracted for every instruction pair A-B, which explains the percentage of the A-B pairs that are far from each other by *k* cycles. The Survival Function (SF) (defined as 1 − *CDF* of a distribution) of these *temporal distance* distributions is useful in the clustering problem definition in Section 4.3.2.

We define a *temporal distance set* which contains the indexes of consequent instructions A and B and their corresponding distances (*i*, *k*):

$$\text{Temporal}\_{\mathsf{A-B}} = \left\{ (\mathsf{i}, \mathsf{k}) \mid \mathbf{0} < \mathsf{i}, \mathbf{0} < \mathsf{k}, \mathbf{S}\_{\mathsf{i}} = \mathsf{A}, \ \mathsf{S}\_{\mathsf{i}+\mathsf{k}} = \mathsf{B}, \not\exists \mathsf{j}, \ \mathsf{0} < \mathsf{j} < \mathsf{k}, \ \mathsf{S}\_{\mathsf{i}+\mathsf{j}} \in \{\mathsf{A}, \mathsf{B}\} \right\}. \tag{7}$$

> Accordingly, the Probability Mass Function (*PMF*) based on distance *k* is calculated as:

**Figure 10.** The temporal distance between LDA and ADDL instructions in "bzip2" workload (simulation for 2 million cycles). There are 19,146 cases in which the ADDL instruction appeared right after LDA. The average distance is 2.97. The results are obtained using gem5 simulator as explained in Section 5.3.

The extracted *PMF* is then used to find the *SF* as follows:

$$\mathbb{C}DF\_{\mathbb{A-B}}(k) = \sum\_{i=1}^{k} PMF\_{\mathbb{A-B}}(i),\tag{9}$$

$$SF\_{\mathsf{A-B}}(k) = 1 - CDF\_{\mathsf{A-B}}(k). \tag{10}$$

 (8)

In a fictitious scenario where only instructions A and B exist, and they are divided into two ALUs, the *SF* can explain the power-gating possibility. In this case, if the minimum number of cycles required to perform a power-gating (power-gating threshold) is *PGTH*, then *SF*A-B(*PGTH*) obtains the power-gating probability. Therefore,

$$dist\_{\mathsf{A-B}}^{\text{temporal}} = SF\_{\mathsf{A-B}}(PGTH) \tag{11}$$

can be used as the temporal distance metric between two instructions A and B.

**Instruction similarity:** Many instructions share some gates in ALU mostly due to their similarity. For example, an ALU could have different addition and subtraction instructions, which are inherently similar. As a result, these instructions share a large portion of gates in the synthesized netlist. Therefore, implementing these instructions in separate ALUs would impose redundant structures leading to undesirable leakage and area overhead. Therefore, it is preferable to group such instructions into one ALU to reduce the associated overheads.

We introduce a dissimilarity metric defined as the structural dissimilarity between instructions (*distdissimilarity* A-B ). For example, *distdissimilarity* ADDL-ADDQ = 0.0 as both addition instructions implement similar

functionality. However, *distdissimilarity* ADDL-ORNOT = 1.0 as the corresponding instructions implement two completely different logic structures. The dissimilarity values are assigned based on the knowledge we have about the logic implementation of different instructions.

Some of the above parameters may lead to contradictory grouping of instructions into ALUs. For example, instructions S8ADDL and ADDL should be grouped into one ALU because of inherent similarity; however, according to their utilization frequencies they should be placed into different ALUs to allow power-gating. In the next section, we define a formal clustering problem considering the aforementioned parameters and solve it to find the best instruction grouping strategy.

## 4.3.2. Instruction Clustering Problem Definition

The problem of partitioning a large ALU into smaller ALUs can be defined as a clustering problem, in which the distance between the instructions is explained by temporal proximity, utilization frequency, and similarity of instructions. The goal of such clustering algorithm is to maximize the distance between ALUs while minimizing the distance between instructions of each ALU. This allows us to increase the overall power-gating likelihood of ALUs, which leads to lower leakage and better energy efficiency.

For this purpose we apply the Agglomerative Hierarchical Clustering (AHC) algorithm [108] to cluster the instructions into several groups, each group implemented in one ALU. AHC is suitable for our problem because we can provide pairwise distances between each and every two instructions.

We create the pairwise distance matrix needed for the AHC algorithm based on the frequency distance metric (*dist<sup>f</sup> req* A-B ), temporal distance metric (*disttemporal* A-B ), and structural similarity (*distsimilarity* A-B ) introduced in the previous section. Finally, the elements of the pairwise distance matrix (*pdist*) are obtained as (Cartesian distance on a 3D space):

$$\left(pdist\_{\mathsf{A-B}}^2 = (\beta.dist\_{\mathsf{A-B}}^{freq})^2 + (\gamma.dist\_{\mathsf{A-B}}^{compural})^2 + (\lambda.dist\_{\mathsf{A-B}}^{dissimilarity})^2. \tag{12}$$

Here, *β*, *γ*, *λ* are coefficients to scale all the metrics into the same scale.

We consider the single-linkage clustering method on the AHC. In a single-linkage method, the linkage function *<sup>D</sup>*(*<sup>X</sup>*,*<sup>Y</sup>*), which is the distance between two clusters *X* and *Y*, is defined as the minimum distance between every two members of the clusters:

$$D(X,Y) = \min\_{\mathbb{A}\in X, \mathbb{B}\in Y} pdist\_{\mathbb{A}\text{-B}}.\tag{13}$$

Therefore, maximizing the distance between clusters *X* and *Y* will allow the maximum power gating possibility of the corresponding ALU implementations and improves the energy efficiency.

## 4.3.3. Fine-Grained Power-Gating Prediction

Once the inactive phase for a component is detected at architecture-level, the component can be power-gated by asserting a sleep signal on the header/footer sleep transistors. Although the power-gating can effectively reduce the wasted leakage energy, it has to be done when the functional unit is not utilized for a minimum number of cycles (*PGTH*) to break-even the associated overheads. This value is estimated to be around 10 cycles for a typical technology [105,109].

The inactive phase of a functional unit can be predicted based on several techniques at runtime. In order to determine the inactive interval of each functional unit and decide whether to power gate it or not, it is possible to monitor the instruction buffer for a number of upcoming instructions while considering the branch prediction buffer. However, the power-gating signal for a given functional unit can be mispredicted due to branch misprediction. As a result of misprediction, the entire pipeline may be needed to be flushed

to reload the correct instructions. This provides some time to properly power up the required functional units without imposing much overhead due to pipeline stall. It is worth mentioning that sub-threshold and near-threshold processors typically have a deeper pipeline to benefit in terms of performance and energy efficiency [110]. In such processors, there is enough time for functional unit power up after misprediction due to deeper pipeline design.

In Out of Order (OoO) processors, the order of execution of the instructions can slightly change at runtime in order to avoid stalls in the pipeline. As a result, predicting the idle time of the functional unit partitions is not straightforward. The proposed functional unit partitioning method can be used whenever such prediction is possible; however, without good prediction methods, the proposed method may not bring significant improvement to the design. In the presented experimental results in Section 5.5, we evaluated the results with the assumption of in-order execution of the instructions. Further development and application of the proposed functional unit partitioning method to OoO designs is not presented in this paper.
