In this section, we analyze the reconfiguration strategy and design the reconfiguration architecture at three levels: the structural level, the module level, and the array level. At the structural level and module level, our main methodology was multiplexing as much as possible, including using the symmetry of the algorithm to design architecture and splitting the architecture to exploit greater parallelism. At the array level, we separated the architecture again to decouple all operators and then analyzed the array design method to select an optimal scheme according to some common metrics.
3.1. Reconfiguration Analysis
It is easy to observe, in
Figure 1, that the overall structure of the compression function is similar. They all have a critical path based on addition, and several paths based on simple logic operations, and the assignment logic for them all presents the characteristics of shift assignment. In existing designs, the difficulty of reconfiguration is SM3, which needs more computational resources. The architecture must provide all the resources required by SM3. If an appropriate reconfiguration strategy is not adopted, it results in 50% waste of adder and 5/8 register resources when performing MD5 and SHA1. After careful observation, we found that both the left and right sides of SM3 have a complex path, and the structure of each side can contain MD5 and SHA1. SHA2 also presents such characteristics. So, we concluded that designing an architecture according to SM3 and SHA2, and supporting MD5 and SHA1 on both sides of the architecture, would fully utilize computational resources at the structural level. This was the reconfiguration strategy adopted in this paper.
The above analysis solves the problem of resource waste at the structural level, but serious resource wastage still exists in the phase of algorithmic performance. The fundamental reason is that word widths vary among algorithms. The word widths of SHA384 and SHA512 are 64 bits, while the others are 32 bits. In existing designs, the word width of all computational resources is designed to be 64 bits, resulting in at least 50% of the resource being in an idle state when executing algorithms with 32-bit word width [
20,
21,
22,
23,
24]. However, if designed to 32 bits, this cannot meet the operational requirements of 64-bit word widths. To address the above problem, we used a processing unit with 32-bit word width to build a unit with 64-bit word width and to select execution modes through control signals. So, after optimizing, the architecture can execute one set operation of 64-bit word width or two sets of operations of 32-bit word width simultaneously, which means that the architecture can support two sets of SM3 or SHA224/256, etc., or one set of SHA384/512 at the algorithm level. So, the problem of resource waste introduced by word widths is solved.
Data expansion rules among algorithms are simple, and the most crucial feature is 16-level shift registers and simple arithmetic operations. Different computing logic is selected according to the algorithms. The output value of the shift register is passed directly to the iterative compression module to participate in computing.
3.3. Module Design
The architecture above only provides the design at the structural level, but two issues remain unexplained:
To be compatible with SHA512 and SHA384, in existing works, the word width of the computational units and registers are all set to 64, and when executing 32-bit width algorithms, only the high 32 bits are used, while the low 32 bits are set to 0, which leads to at least 50% waste of resources. To solve the problem, we built the processing units with 64-bit word width through units with 32-bit data width. The specific idea was to divide the 64-bit processing unit into high 32-bit and low 32-bit. Each part can handle 32-bit operation independently, while the two can work together to handle 64-bit operation. In other words, we divided the hardware structure shown in
Figure 2 into two layers, with the upper layer processing high 32 bits data and the lower layer processing low 32 bits data, and an algorithm with a bit width of 64 bits, such as SHA384/512, is calculated by the two layers together. The data extension module also adopts this strategy.
The logic unit in
Figure 2 undertakes the task of calculating the logic functions shown in
Table 4. The logic unit must be reconfigurable to meet all the computational requirements. We adopted a three-layer CLB interconnection structure in this paper, with each CLB word width being 32 bits. The detailed structure is shown in
Figure 4. The logic function is bit operations and the high and low bits do not disturb each other, so the logic function units in both layers are the same as in
Figure 4.
The addition operation was designed differently for high and low layers due to the carry flag, such as the structure shown in
Figure 5. Our goal was to construct a 64-bit adder with two 32-bit adders which could work separately or cooperatively. When working separately, they are essentially two independent 32-bit adders, without carrying any flag. When they work cooperatively, they are essentially a 64-bit carry ripple adder consisting of two 32-bit adders. Now, the low bits adder needs to output a carry flag to the high bits adder. The two modes can be selected through the dword signal.
When analyzing algorithms, we found that there are many computational patterns, such as
. So, we designed a dedicated shift unit. The shift unit includes three shift operations and two xor operations. The unit is designed to perform two functions:first, shift for one input; second, multiple shifts and xor for one input. Since 64-bit shift operations cannot be calculated through 32-bit shift operations, 64-bit shift operations cannot reuse the 32-bit shift operation calculation logic. The structure of the designed shift unit is shown in
Figure 6. The lower layer uses the low bits shift unit, and the upper layer uses the high bits shift unit. The
and
operations in the data expansion module can also be implemented by the same method but with a fixed shift number.
After the module was designed, as above, the architecture in
Figure 2 was divided into two layers. Each layer can perform two sets of MD5 and SHA1 and one set of SHA256, SHA224, and SM3. When executing SHA384 and SHA512 algorithms, the upper and lower layers work together to expand the computing logic and register unit word width to 64 bits. In this way, the waste of resources introduced by the different word widths is eliminated, and the computational capability of all resources can be utilized entirely.
3.4. Array Design
In the above analysis, we optimized the reconfigurable architecture at the structural and module levels. However, due to massive iterations and computation of hash algorithms, a single hardware structure only provides limited computational capability, and, thus, the array design is necessary. The design of the array affects the implementation model of the performing algorithms, and the main difference is whether or not to adopt the pipeline. After analysis, we found that the pipelined implementation was not suitable for the hashing algorithm, mainly for the following reasons: first, if the data extension module pipelines, each data extension module introduces an extra 512 bits interconnection overhead for data words pipelining, as shown by the red line of Mode a in
Figure 7; second, if the data expansion module does not pipeline, this means that a set of data is processed in only one data expansion module, as shown in Mode b of
Figure 7. At this time, each data expansion module needs to connect with all the compression modules, which introduces enormous interconnection and control overheads and may, ultimately, result in frequent switching of configuration information or failures in data synchronization. Therefore, we did not use the pipelined implementation and processed the whole hash computing in a set of data extension modules and compression modules, as shown in Mode c of
Figure 7. From another perspective, pipelined implementation is more suitable for dealing with programs containing multiple sub-processes. However, our design can complete the whole iteration process just once without dividing into several sub-processes, and, hence, we did not use the pipelined implementation.
In array design, function units should be decoupled as much as possible to reduce control difficulties and to increase flexibility. In
Section 3.2.1, we designed the architecture according to the principle of left–right symmetry, so, when designing arrays, we could separate the left and right sides from the middle to obtain the two sub-modules. And we denote the left side as A and the right side as B. We connected modules A and B with the data expansion module, deriving A and B operators, as shown in
Figure 8. The A and B operators can work together by transmitting three sets of signals. So far, we had obtained four operators according to the high/low and left/right separation principles: A_high, A_low, B_high, and B_low.
Table 5 provides the interconnection relations between the operators.
Table 6 depicts the requirements of the different algorithms for the operators.
The computing capability of the array is positively related to the number of computational resources, so we needed to determine the scale of the array first. Assuming that the array size is M × N, where N is the width of the array, and M is the depth of the array, when the array is not designed in the pipeline, its computational power is proportional to M, so the size of N is the crucial factor affecting the array efficiency. The design of the row structure should consider the parallelism of multiple algorithms and single algorithm. Parallelism of a single algorithm means how many messages belonging to one algorithm are executing simultaneously, while for multiple algorithms it means the number of messages many kinds of algorithms are executing simultaneously. When considering the maximum parallelism of multiple algorithms, the row must satisfy the requirements for computational resources of all algorithms, as shown in Equation (
1).
where
is resources occupied by algorithm, which is shown in
Table 6, and
is the number of every operator in one row. According to the barrel principle, the parallelism of a single algorithm depends on the smallest amount of resources required by the algorithm. Equation (
2) shows the parallelism of a single algorithm.
After a simple analysis, the minimum design solution meeting the parallelism requirements was
=
= 2,
=
= 3, or
=
= 3,
=
= 2. Both solutions provide the same computing capability, and the one that occupied less area was a better choice. The parallelism of the individual algorithms at this time is shown in
Table 7.
In this paper, we took
=
= 2,
=
= 3 as an example. The design could achieve maximum parallelism only when the interconnection relationships in
Table 5 were satisfied. At this time, the whole problem became a permutation problem. After excluding part solutions that did not satisfy the interconnection relationship and equivalent solution, the number of solutions satisfying requirements was
, and the detailed solution is shown in
Figure 9.
Although the four schemes provide the same maximum parallelism for each algorithm, they do not provide the same mapping flexibility. In this paper, we defined the number of mapping solutions as mapping freedom, and
Table 8 gives the mapping freedom of the four schemes.
Therefore, we chose solution 3 as the row design solution in this paper, and the array structure is shown in
Figure 10. No interconnection is required between rows, and interconnecting lines within rows and layers are used for the operators to work together. Let us denote the row computing capability as Cap_row, then the array computational capability
. The maximum throughput rate of the array when the computational resources are fully utilized is:
where
,
is the number of messages of alg[i],
is the array throughput rate,
is the throughput rate when running alg[i], and satisfies
.
The mapping requirements for multiple algorithms can be satisfied according to the following principles: first, do not map MD5 and SHA1 on the same layer; second, search the operator suitable for algorithms from upper to lower and left to right.