3.2. Quotient Pipeline
From the public parameter settings and the leveled breakdown of the supersingular isogeny protocol (
Figure 3), it can be concluded that almost all fundamental arithmetic operations required to evaluate a supersingular isogeny work in
. In
Section 4, we will quantitatively analyze the operations in each phase of the protocol and demonstrate that modular multiplication is the most costly and frequently used building block. Thus, a dedicated and efficient quadratic field arithmetic unit will improve the hardware implementation.
Previous SIDH hardware designs [
24,
26,
27] have mostly been applied to FPGAs. The typical
modular multiplier proposed in these works uses the native digital signal processors (DSPs) of reconfigurable hardware. The main part of the DSP slice is a high-performance
multiplier. By wiring through the programmable connectivity, the DSP arrays or matrices consist of interleaved systolic Montgomery multipliers capable of processing large operands. This architecture is regarded as a modular multiplier with radix equivalent to the DSP bit-width, typically 16-bit. An appropriately higher radix will reduce the number of iterations needed in each modular multiplication, resulting in performance gains. FPGAs do not apply to native higher radix (e.g., 32-bit or 64-bit) because of their slice structure. Customized hardware, however, can implement such multipliers to enhance the computation speed.
When performing modular multiplication using the Montgomery method, it is essential to understand that the radix will significantly influence the architecture of the multiplier, even the part built on the arithmetic units. The original Montgomery multiplication indicates that a high radix decreases the number of iterations but increases the latency of every single iteration. Thus, there is no generic way to minimize the latency of complete modular multiplication. Specifically, the calculation of residue uses a carry-save adder (CSA) to compress the partial products, which may mitigate the accrued latency caused by a higher radix. The critical path of the CSA climbs slowly with respect to the bit-width of the operands. However, to obtain the result of the quotient, we must use a carry propagation adder (CPA), for which the computation time is highly dependent on the radix. To solve this problem, Holger [
28] proposed a method to decouple the calculation of residue and quotient, which allows for concurrent execution of modular multiplication.
In addition to the influence of the selected radix, another possible improvement involves building a specific
modular multiplier instead of an
arithmetic unit. The traditional way to compute an extension field multiplication is to apply the Karatsuba–Ofman algorithm to the prime field multiplier. An
multiplication requires at least three
multiplications, with some pre- and post-additions. If operating using the schoolbook algorithm, the
multiplication will be resolved as Equation (
8), where
and
,
,
.
By defining an operator
, the
multiplication can be rewritten [
29] as
A dedicated processing unit will help to build an multiplier with similar latency to the computation at a reasonable area cost. The design of the modular multiplier will be discussed after its theoretical foundations are presented.
In the Montgomery multiplication, the following equation is the most costly iteration, where
M is the modulus,
r is the radix, and
is the precomputed parameter satisfying
.
To simplify the
calculation, we first combine
with
M and rewrite as
. Then
can be defined as
The residue calculation in each iteration can be expressed as
Furthermore, by pre-scaling
A by
and adding one iteration to compensate for this extra factor, high-radix Montgomery multiplication can be performed in the following form.
Although the quotient determination only needs to reduce
by right shifting, it is not trivial because
exists in carry-save form. The ordinary form of
has to be determined before it is used to calculate
. The data dependency between
and
limits the performance of the high-radix Montgomery multiplier. The quotient pipelining technique [
28] delays the use of the quotient digit
by
d iterations, giving the carry ample time to propagate and ensuring there is sufficient time to determine a quotient at the cost of
d extra iterations. The disadvantages of this method are the extra cycles required to merge
and the larger operand bit-width. The overhead will become more significant as the delay increases. In our situation, a delay of one cycle (
) is sufficient to remove the interference between the residue and the quotient calculation. By involving several precomputed parameters, the cost of the 1-stage quotient pipeline can be further mitigated.
The key point in decoupling the quotient and residue calculation is to remove the
term from the
expression, allowing
and
to be determined in parallel. By defining
,
can then be represented as
. Furthermore, the core iteration of high-radix Montgomery multiplication becomes
where
,
, and
are pre-determined. The index 1 indicates that these pre-computed parameters work for a 1-stage quotient pipeline. The reason that specific factors have to be proposed for the delayed variant is that the correctness of the algorithm is built on
dividing
. From Equation (
14), we can tell that
is now independent of
, which decouples the calculation in each iteration.
Furthermore, it is obvious that the
operator can be straightforwardly implemented with a minor modification to Equation (
14). Setting up a sufficient digit range, the Montgomery algorithm can merge the post-addition with multiplication. The large integer addition can be processed along with the compression of partial products. Additionally, the
operator produces a similar effect as Karatsuba–Ofman multiplication. Karatsuba proposed a method for computing a complex multiplication, such as Equation (
15), by saving a real number multiplication. Correspondingly, the
operator reduces the number of partial products in the
computation by a quarter, which results in a lower area than two individual
multipliers. This design is highly suitable for supersingular isogeny calculation considering that these curves are defined over
.
We obtain the final form of the high-radix Montgomery multiplication from Algorithm 1.
Algorithm 1 1-Stage High-Radix Montgomery Multiplication with Quotient Pipeline |
- Require:
A prime modulus , a positive radix r, a positive integer n, such that . Integer , where , and integers , , , . Operands A, B, C, D where , , - Ensure:
Integer - 1:
, - 2:
for to n do - 3:
- 4:
- 5:
end for - 6:
|
Proof of Algorithm 1. To verify the specific variant of Montgomery multiplication, we can simply accumulate each iteration.
Considering that and , the last term of has an upper bound of . Additionally, A, B, C, and D should individually have upper bounds of . Therefore, the result satisfies . On the condition that , the output and input share the same range , which means the algorithm can be applied recursively. □
3.3. Quadratic Finite Field Multiplier
From the perspective of circuit design, we can use a customized multiplier built with Booth encoding and a Wallace tree to implement the quadratic field modular multiplication unit (QMM). Without the limitation of the FPGA slice architecture, a larger elementary integer multiplier can be built to achieve a more efficient finite field arithmetic unit. The modified Booth-2 encoder reduces the number of partial products by half at little cost, which in turn allows us to process more bits with the same hardware resource. The Wallace tree is employed to accumulate partial products in carry-save form with a short critical path. These two techniques are usually combined to implement a dedicated large-size multiplier because of their regularity and efficiency.
The architecture of the proposed QMM, shown in
Figure 5, is equipped with a modified Booth-2 encoder, a Wallace tree, and a carry propagate adder (CPA). The multiplication dataflow is started by pushing the multiplicand into the shift register while padding its head and tail with sign bits and zeros. This register allows the Booth encoder to manipulate different segments of the multiplicand on each iteration, and is, therefore, suitable for diverse operand widths. Next, the Booth encoder takes a multiplicand segment from the ahead register and a multiplicator radix to produce the partial products of
,
,
. A bundle of partial products is dropped into the Wallace tree and passes through a 6-layer 4:2 CSA. This yields the final result of
in carry-save form. While the CPA combines
, the Wallace tree steps into the next iteration to process the next segment, which is the fundamental idea behind the quotient pipeline.
To work with operands of different bit widths, it is useful to develop a flexible multiplier architecture that can accommodate a range of input sizes. The radix of the multiplicators determines the number of partial products we need to compress in one cycle, and thus the critical delay of the multiplier. Additionally, the radix affects the size of the final addition which sums up the outputs from the Wallace tree. Theoretically, the 4:2 CSA tree has a latency of ∼
, whereas the CPA has a latency of
. To balance the performance of different parts and ensure a regular digit width, the multiplier uses a 64-bit radix. Afterward, the multiplicands expand to meet the requirement of the quotient pipeline. they are so large that an unnecessary connectivity delay will be introduced. Thus, the multiplicands are segmented and processed iteratively. According to the typical primes we are studying, the extended widths of the multiplicands are 570, 639, 746, and 887 bits, as listed in
Table 2. Considering the QMM utilization rate in each field and the circuit size of the processing unit, we would propose a multiplicand batch size of 128 bits for each iteration.
The workflow of QMM is summarised in Algorithm 2, which explains how the embedded iterations work. The total loop time depends on the size of the operands presented in
Table 2.
v and
w indicate the batch number of the multiplicand and multiplicator, respectively, and the total number of loops can be calculated as
. The superscript
indicates that
is stored in a carry-save form which does not affect the correctness of the algorithm. Right before the quotient determination, the CPA is applied to
to derive the original form. For each iteration, the Wallace tree compresses 98 integers into a single carry-save result. This can be efficiently implemented by cascading the six layers of 4:2 CSA. A schematic of the 1-bit CSA is given in
Figure 5. Each pair of lines constitutes a carry-save encoding of input, output, or ‘double’ carry. The 4:2 structure has a more regular layout compared with 3:2 adders, which allows the use of binary tree structures. It can be seen as an adder taking two carry-save encoding numbers and producing the result in the same representation. To support the QMM and the subsequently introduced modular adder, the accelerator is equipped with 128-bit width memories for storing intermediate variables and caching larger operands. The specific design of the memory and access unit will be discussed in
Section 4.
The dedicated design of the QMM offers several salient performance improvements over an individual field multiplier. The advantages depend to a certain extent on the specific
scenario and, therefore, have some limitations.
Algorithm 2 Quadratic field modular multiplier workflow |
- Require:
A prime modulus , a positive radix r, a positive integer n such that . Integer , where , and integers , , , . Operands A, B, C, D where , , - Ensure:
Integer - 1:
, , , - 2:
for to w do - 3:
▹ Quotient determination - 4:
for to do - 5:
- 6:
if then - 7:
- 8:
else - 9:
- 10:
end if - 11:
end for - 12:
- 13:
end for - 14:
for to do - 15:
▹ Last iteration - 16:
if then - 17:
- 18:
else - 19:
- 20:
end if - 21:
end for - 22:
return
|
Performance and area. The operator merges two multiplications together, which saves a post-addition and reduces the number of partial products of and by a quarter. Although the QMM has to compress more partial products in each iteration, the critical path increases slightly because of the Wallace tree architecture. It takes two operations in series or in parallel to complete an multiplication, where one QMM has an area cost of 205k equivalent gates and a latency cost of 2.316 ns.
Flexibility. Our design is flexible enough to support different operand sizes, including
,
,
, and
with a high hardware utilization ratio. The architecture is intended for use with the parameters predefined in the SIDH protocol. Furthermore, the newly proposed primes in the round 3 submission of SIKE [
22] are also supported by this circuit. The downside that the ASIC implementation is less scalable has been mitigated. Additionally, setting
C and
D to zero trivially transforms this quadratic modular multiplier into a normal multiplier.
Regularity. Compared with the Karatsuba method, an additional advantage of deploying the QMM is its regularity, which benefits both scheduling and circuit layout. To parallelize the hundreds of field arithmetic calculations in the protocol, the sequencer needs to handle the data dependency and processing unit workload carefully. In the classical approach, a fully parallelized occupies three basic modular multipliers simultaneously. If the number of modular multipliers is not a multiple of 3, it becomes less convenient to schedule the operations. From this point of view, QMM uses two operations, which makes it easier to sequence different subroutines.
3.5. Finite Field Inverse Unit
Several algorithms can be used to compute modular inverses. Fermat’s Little Theorem is an effective method in cases where modular exponentiation is highly optimized. However, this algorithm is unsuitable for fields with smooth characteristics used in large degree isogeny. This method requires roughly the same number of modular squares as the bit-width of the modulus and a certain number of modular multiplications. Even if the addition chain can accelerate the computation. The efficiency is still relatively low. The Euclidean extended algorithm and the Kaliski inversion can significantly reduce the complexity of modular inversion from to . The drawback is their non-constant execution time, leading to side-channel leakage.
The Kaliski algorithm is designed specifically for the Montgomery modular arithmetic [
30]. It takes an original field element as input and calculates its inverse in the Montgomery domain. Savas extends this algorithm to high-radix mode and proposes a new Montgomery modular inverse algorithm [
31], which replaces the
iterations in Phase II with up to three Montgomery multiplications. This algorithm is more suitable for our design. For example, computing the modular inverse in
using Fermat’s Little Theorem requires approximately 745 modular squares and 150 modular multiplications. By comparison, the new Montgomery inversion requires roughly 1066 iterations, with each iteration consisting of several additions and subtractions, as well as two to three additional modular multiplications.
Unlike the optimization employed for quadratic modular multiplication and addition, which require multiple operations in the base field for extension field arithmetic, modular inversion in
can be reduced to just one inversion in
. Therefore, we design an inversion unit that operates solely in
, as described in Algorithm 3. It utilizes the previously developed QMA and QMM components, obviating the need for a dedicated inversion unit. In particular, two QMAs perform two
subtractions and one
addition in parallel, thereby completing phase I of the Montgomery inversion. The final multiplications required in phase II can then be executed using the QMM.
Algorithm 3 New Montgomery modular inversion datapath |
- Require:
, , , and M - Ensure:
- 1:
, , and ▹ Phase I: AlmMonInv - 2:
while do - 3:
if then , - 4:
else if then , - 5:
else - 6:
, , - 7:
if then , , - 8:
else , , - 9:
end if - 10:
end if - 11:
- 12:
end while - 13:
, - 14:
if then - 15:
end if - 16:
, ▹ Phase II: correction - 17:
if then - 18:
- 19:
- 20:
end if - 21:
- 22:
|