1. Introduction
Modular reduction, expressed as
mod
, is considered to be both one of the most computationally expensive basic-mathematical instructions, and one of the most important instructions in modern cryptography. In the simplest terms, modular reduction is the remainder of elementary integer division. From this definition, one could naively discern that calculating a modular congruence
c requires division. This is easy to perform in software, as shown in Equation (
1). The floor function can be replaced by integer division, depending on the software language used.
Unfortunately, since Equation (
1) requires both a division and subtraction operation, this implementation is inefficient for both hardware and software [
1]. Another naive approach, which can be quickly implemented in software, is shown in Algorithm 1. This algorithm implements a loop that will continue to reduce the input
a until it is less than the modular base,
b, returning the modular congruence
c.
Algorithm 1 A naive software implementation of modular reduction. |
- 1:
- 2:
whiledo - 3:
- 4:
return c
|
This implementation presents two issues. The first problem with this implementation is that the number of iterations through the loop is directly related to how large the input a is. For cryptographic purposes, knowing the relative size of the input can be used to assist an attacker in decrypting encrypted data. Second, there is no defined upper-limit to the algorithm. This results in unpredictable latency, where the required iterations approach infinity as the input grows.
Within this work, we explore state-of-the-art methods for modular reduction, with an emphasis on the hardware requirements for the implementation of each. We argue that modern methods could be both faster and more power efficient, and present our own algorithm as a solution.
Our Hardware Optimized Modular Reduction (HOM-R) algorithm will have impacts in a variety of fields, as a wide breadth of topics rely on modular reduction. This includes fields that use a system of modular numbers, referred to as a Residue Number System (RNS). Specifically, however, we targeted our design for cryptography, as algorithms such as Homomorphic Encryption (HE) and Zero-Knowledge Proofs (ZKP) heavily utilize modular reduction, depending on the specific form of the algorithm used [
2,
3,
4,
5].
The HOM-R algorithm is also viable for other applications outside of cryptography, however. For example, Kalmykov et al. implemented a novel Error-Correcting Code using a Residue Number System (RNS), which relies heavily on modular arithmetic [
6]. Valueva et al. applied the RNS to Convolutional Neural Networks (CNN) as a means to reduce the hardware cost of the machine learning model [
7]. This application in AI for RNS is also seen in other works, including a multiplication-free neural network by Salamat et al. [
8] and an ASIC implementation of an RNS-based CNN by Sakellariou et al. [
9]. Other examples include an accelerator for probability models [
10], intrusion detection systems [
11], and new data storage methods for increased reliability [
12].
As we will see in
Section 2, modern approaches to modular reduction either rely on high-power resources, such as digital signal processors (DSPs), or have a high latency. Our motivation for this work is to achieve both low power and low latency. As we demonstrate in
Section 5, our algorithm relies solely on bit shifts and a predictable number of additions, allowing us to avoid the use of DSPs whilst only requiring a single clock cycle of latency, even at high clock frequencies.
The rest of the paper is organized as follows: in
Section 2, we cover other algorithms that are often used for modular reduction. In
Section 3, we describe our algorithm and how to implement it based on use-case. In
Section 4, we demonstrate the feasibility of our algorithm by implementing it on a Field Programmable Gate Array (FPGA). In
Section 5, we discuss the implementation results of the HOM-R system. Finally, in
Section 6, we conclude the paper.
2. Background
When describing modular reduction on hardware, there are two methods that must be considered. However, before we can consider those, there are some special cases where one would not opt for any of these methods. First, let us consider a common case of modular reduction: when the modular base
b is a power of two (1, 2, 4, 8, etc.). Of course, for base two modular reduction, one can simply truncate the bit vector, as shown in Equation (
2), where
‘&’ represents the bit-wise AND function, and
‘−’ represents the conventional integer subtraction. This method, when implemented in hardware, requires simple combinatorial logic that induces minimal latency and resource usage.
The second method that can be used for special cases is when the modular base is not a power of two, and the input size is small. One can use a direct look-up table in hardware, or an array in software, to quickly find the modular congruence of the input. With , and , this method requires n entries of bits, leading to a total memory requirement of .
For all other cases, prior to the introduction of the HOM-R system, one’s options for modular reduction include one of the three following algorithms: Montgomery reduction, Barrett reduction, or a more recent addition, a fast modular reduction method by Will et al. [
13]. We will include Will’s method as it serves as the basis for our work.
While other options exist, we do not cover those implementations in-depth as they either have large memory/resource requirements depending on the input, or are sub-optimal for hardware due to the reliance on recursion. This includes work by Parhami et al., in which the authors explore segmentation and truncation within look-up tables for modular reduction [
14]. Parhami et al. offers another work exploring the inverse of this operation: combining residues into a unified, binary representation [
15]. Another work by Opasatian et al. also explores the use of look-up tables similar to Parhami, but their method requires an additional ’adjustment’ step that relies on subtraction. The principle behind this step is similar to that in Montgomery reduction, in that the computed value may be twice the value of the modulus [
16]. Lim et al. offer a method for modular reduction that also relies on look-up tables; however, they implement their design in software [
17]. Another method is the one by Cao et al., which offers a unique approach that evaluates a binary vector based on the longest run of 1 s and 0 s [
18].
2.1. Mongtomery Reduction
Montgomery reduction [
19] is one of the most commonly used modular reduction algorithms in modern hardware. The premise of the formula is to represent a residue class using a different modular base. Ideally, this new base is a power of two, which can then be reduced via the modular-base-two method listed above. The delay for Montgomery Reduction comes in play when changing the modular base. Because of this, a number’s base is often changed only once, where the new base representation, referred to as its Montgomery form, is used for all mathematical calculations. This way, any further modular reductions can be performed via the modular-base-two truncation method. Once a number has finished processing and does not require any more operations, its modular base can be converted back to the original, and the final congruence of the original number can be retrieved.
For
, let
where
R is co-prime to
b.
R should be chosen so that
is easy to compute; ideally, R should be a power of two. Further, let
. The Montgomery reduction of input
a can then be defined as Equation (
3).
Note that one subtraction of b may be required since . It is also important to note that Montgomery reduction only requires multiplications. The number of multiplications required is not correlated to the size of the target number, which is not the case in the following algorithm, the Barrett reduction.
The Montgomery form allows for quick modulo operations where multiple numbers are multiplied or added together, as they can all be converted to the same residue class. Addition in Montgomery space is performed as with regular numbers; however, there is a small caveat to multiplying numbers in Montgomery space. Since two separate numbers in Montgomery space include a R factor, multiplication results in a term. This is an easy obstacle to overcome: since the Montgomery form relies on R being an integer that is a power of two, the result can be divided by a factor of R through a bit-wise right shift of bits.
2.2. Barrett Reduction
Barrett’s method [
20] is similar to Montgomery’s approach in that it requires the selection of a “magic number” that is relative to the modular base
b. When implementing Barrett reduction, one must consider the word size
L of the system, which depends heavily on whether a general purpose computer or specialized circuit, such as an FPGA, is used. Depending on the word size, a base
must be chosen such that
. From
, we can calculate the required parameter in Equation (
4), where
. There is the further limitation that
a from
that
.
If we assign
, then
. From Equation (
5), we can derive Equation (
6).
If we precompute
, it is easy to see from Equation (
6) that the main complexity in the Barrett reduction is caused by the multiplication of
. The residue
c can be computed from this value as
, which involves at most two subtractions and three bit shifts.
2.3. Will’s Algorithm
The look-up tables reduction method proposed by Will et al. is a unique and attractive option for modular reduction when compared to Barrett’s and Montgomery’s methods, as it does not require a multiplication primitive [
13]. Through a series of bit shifts and additions with precomputed numbers, Will’s method is able to reduce a number to a residue. The basic structure of the algorithm is defined in Algorithm 2, where bit vector width
, the input vector is split into
l-width-vectors, defined as
, and
N is defined as
. In this Algorithm, and the others that follow,
indicates an integer
n shifted left
m times.
Algorithm 2 Will’s algorithm for modular reduction [13]. |
1: | |
2: | whiledo |
3: | |
4: | for do |
5: | |
6: | while do |
7: | |
8: | |
9: | while do |
10: | |
11: | |
12: | while do |
13: | |
14: | return |
The mod values in Algorithm 2 are precomputed and stored. Will also emphasizes that there can be a larger number of precomputed values, reducing the number of iterations in the for loop. These are stored in the look-up table, and the modified algorithm is shown as Algorithm 3. In this algorithm, is the width of the look-up table, with each value in the look-up table being calculated by Algorithm 4.
Algorithm 3 Will’s algorithm for modular reduction modified for a user defined modular look-up width , and is defined in Algorithm 4 [13]. |
1: | |
2: | |
3: | for do |
4: | |
5: | |
6: | while do |
7: | |
8: | for do |
9: | |
10: | while do |
11: | |
12: | |
13: | |
14: | while do |
15: | |
16: | |
17: | |
18: | while do |
19: | |
20: | return |
Algorithm 4 Modular value look-up tables generation with index width [13]. |
- 1:
- 2:
for do - 3:
- 4:
- 5:
return
|
From these algorithms, there are a few important things to consider. The first concern is that there is a memory complexity
, so the look-up table size must be determined with device and power constraints in mind, with the optimal
value being 1. However, this must be a balanced decision, as a smaller look-up table width will require more bit shifts in Algorithm 3, increasing the latency of the system. Another point of note is the
while loops do not have defined latency. This prevents a combinatorial implementation in hardware, as each iteration of the loop will need to be analyzed at the end of a clock cycle. This is a pain-point for latency. Further, it must also be considered that the summation of the look-up table values leads to a value that is larger than the defined index of
, resulting in undefined behavior. We evaluate these shortcomings and propose solutions in
Section 3.
3. Methods
The design of the HOM-R system has a primary focus on low power usage. In terms of FPGAs or ASICs, this means reducing digital signal processors (DSPs), and keeping the FPGA Look-Up Table (LUT) usage to a minimum. We specifically use the abbreviation “LUT” to refer to the the logic resource utilized by the FPGA, and “look-up table” to refer to the and components required by the HOM-R system. By prioritizing resource utilization and power reduction, we are able to formulate an algorithm that results in zero DSPs and minimal LUTs regardless of the base used. We determine Will’s algorithm to be a good starting point for this objective, as it relies purely on additions and bitwise operations.
The primary issue with Will’s algorithm is the number of while loops, which are non-ideal for hardware implementations due to their unpredictable number of iterations. An ideal hardware implementation should focus on utilizing for loops, which have a constrained number of iterations. This allows for predictable latency, thereby allowing the use of combinatorial logic that can fit an entire loop within a single clock cycle. This is opposed to the while loop, which will require the condition to be evaluated based on the output of a circuit. This is only possible by calculating an iteration and re-evaluating the loop condition at the end of a clock cycle, leading to an entire clock cycle being used only for one a single iteration of the loop.
For our implementation of the HOM-R module to solve the equation , the following terminology will be used:
Stage: Each input vector
x of width
will be split into stages of width
to solve for a residue width
. Each stage is represented as a
. The input vector is therefore transformed into an array of
N stages, represented as
, where
. The calculation for
is shown in Equation (
7).
Base Stage: This is the least significant stage, stage zero. The final residue c will have a maximum width of the width of the modular base b, and will have exactly one stage remaining in , the base stage.
Modular-Value Look-up Table (MLUT): The look-up table composed of modular values, calculated via Algorithms 5 and 6, and represented by with a depth and with a depth , respectively. There are N different MLUTs, with each MLUT representing the modular values for different . By utilizing an MLUT for each different stage, we can reduce each stage to the base stage in parallel.
Overflow Bits: When summing values from , it is possible that the width of the sum exceeds . These bits are referred to as overflow bits.
A full description of the required parameters for HOM-R are outlined in
Table 1.
Algorithm 5 HOM-R 2D modular value look-up table generation with index width and stage width . |
- 1:
- 2:
for do - 3:
for do - 4:
- 5:
- 6:
return
|
Algorithm 6 HOM-R 1D modular value look-up table generation for reducing . Index width and stage width . |
- 1:
- 2:
for do - 3:
- 4:
- 5:
return
|
The full HOM-R algorithm is defined in Algorithm 7. Line 1 handles the simplest of cases, when the modular base class is an even power of two. In this case, we can simply truncate the input vector a to be the same length as the modular vector b to retrieve the residue c, saving precious hardware memory.
Algorithm 7 Hardware Optimized Modular Reduction Algorithm for . |
1: | if then return |
2: | |
3: | |
4: | |
5: | for do |
6: | |
7: | |
8: | for do |
9: | |
10: | |
11: | for do |
12: | |
13: | |
14: | |
15: | for do |
16: | |
17: | |
18: | if then |
19: | |
20: | return c |
The first loop in the system, Lines 3–10, breaks the input vector
a into a series of
that compose the array
. For each
, we compute
according to the modular base
b using Equation (
8). While
is not the final residue
c, it is congruent to the original
relative to
b and much smaller than
s. We use this value to calculate the final residue
c using the rest of the algorithm.
To help demonstrate how Equation (
8) may be implemented for a four-bit
with a stage index
s in
, composed of bits
with a
, we present Equation (
9). As shown in Equation (
9), this function is merely shifting the vector
left by two bits, isolating two bits at a time, and calculating the residue for each. Since there are four bits in
, the vector must be shifted twice. Each shift is equivalent to multiplying by four. We move these factors of four inside or outside the
lookup depending on the location of the bits in the vector
.
We prove that we can use the bits in
as an index to
in Equation (
8) via Lemmas 1–3. Lemma 1 allows us to prove Lemmas 2 and 3. Lemma 2 proves that the bit vectors can be operated on as a sum of the bits, rather than as the entire vector. Lemma 3 demonstrates that the vector can be shifted (multiplied by a power of two), regardless of whether the factor of two is used to compute the modular residue or operate on the modular residue.
Lemma 1. For such that .
Lemma 2. For
Lemma 3. For
For the second inner loop in Algorithm 7, we account for overflow bits in . It is important to note that there can be at most overflow bits. Therefore, if the number of overflow bits exceeds , then the corresponding lookup value will be undefined. This is a failure condition that must be avoided. There are potential conditions that may fail this part of the loop, resulting in undefined behavior.
All possible valid values were pre-calculated via an iterative search for combinations and . We noticed that for this range, valid values of relative to a user chosen must satisfy . We extend this to all possible values of and in Lemma 4. These combinations guarantee that any modular base with width , used with the corresponding , will not face the overflow condition described above. By utilizing a for loop with a guaranteed number of iterations, we can utilize a predetermined amount of combinatorial logic within the HOM-R module, allowing for single clock-cycle computations.
Line 13 in Algorithm 7 sums up all values of , storing the sum in c. This sum is then used to compute the maximum value of with lines 15–17. It is important to note that the width of sum() must not exceed , as this will result in undefined behavior. There can be a maximum of addends in line 10, as more than this will overflow . Therefore, must be greater than or equal to . Of course, more bits than necessary will result in a larger circuit, increasing the power draw and space unnecessarily. So, while a larger than is possible, it is recommended to use .
Lemma 4. For if
Lemma 5. For if and .
The first part of the algorithm, Lines 1–17, reduce to the single base stage. According to the constraints from the algorithm, the base stage will have the same width as the modular base b. Therefore, . From Lemmas 5 and 6, we can conclude that either or must be true. This reduces the while loop in Algorithm 1, Lines 18–19, to a single if statement. We can therefore determine that an input vector of width is at most a single subtraction from being the residue . The final result is the residue .
Lemma 6. For if and .
4. Implementation
We implemented our modular reduction system on an FPGA, and tested various configurations to compare area usage, power consumption, and resource requirements. The full implementation of the HOM-R system relies on internal sub-modules to reduce the individual stages of the input
to the base stage. Each stage has an input of
bits, and an output of
bits. The stage-reduction module encompasses Lines 6–13 in Algorithm 7, and is replicated
times within the HOM-R system. For the example illustrated within this section, we demonstrate a four-stage HOM-R implementation for
with
and
. A close-up for
is shown in
Figure 1, showcasing the hardware implementation for Lines 6–10 in Algorithm 7.
It is important to note that the same
is replicated in the sub-module. This is illustrated within the dotted-line box in
Figure 1. While this requires an increase in resources, it allows the system to operate on the stage splits in parallel. The alternate option is to instead iterate each stage through
, at one stage per clock cycle, which we determined to be too expensive.
Since each stage split is reduced to the base-stage, we must sum them together, as in Lines 11–13 of Algorithm 7. This was proven possible in Lemma 2. Similar to the first stage-reduce module, this step is completed with combinatorial logic, reducing the latency of the system. An iterative method was used to find the number of maximum required additions to reduce the width of the sum to
. The method yielded that a maximum of two additional addition steps following the original summation of the
outputs will always be enough to reduce the width to
. We illustrate the second half of the stage-reduce module in
Figure 2 for
with
and
.
These two series of combinatorial logic result in the full stage-reduce module, shown in
Figure 3. We place an optional stage register between the output of the stage-reduce module and the output of the full HOM-R system to help with timing for either high-frequency or large-input systems. However, for all of the values tested within this paper, we did not need to rely on the optional register and therefore removed it from our system.
The stage-reduce module must be replicated
times within the modular reduction module. This replication implements Line 4 of Algorithm 7, and allows us to operate on each stage split in parallel. This results in a significant reduction in latency, fitting multiple loops of the algorithm within a single clock cycle. The replication of stage splits is shown in the full hardware implementation of HOM-R in
Figure 4. We also show the optional summation register, which, if enabled in the design, will register the output for the first summation only. This provides a significant improvement in timing for high-operating-frequency systems.
The output of each stage-reduce module must be summed together and reduced, similar to the logic within the stage reduce module. The primary difference is that the size of the look-up table for the reductions,
, is
rather than the user defined
. In our example, we use four splits, which results in an
depth of two, as shown in
Figure 4.
Once the base stages of the input have been summed and the output reduced to the width of , we can ensure that the value is at most from Lemmas 5 and 6. We can therefore subtract the modular base b from the input. If there is an asserted overflow bit, then we can ensure that the sum is less than b, and we can accept it as the residue. If there is no overflow bit, then we can be assured that the sum is greater than or equal to b, and we accept the difference as the residue b.
5. Results
Typically, for conventional place and route tool flows utilized by standard applications such as Vivado by AMD or Quartus by Intel, a look-up table for each bit of an output vector is generated depending on a user defined funciton. In this case, that function would be modular reduction. We compare the HOM-R system against this conventional method utilized by Vivado.
For testing, we used a 64-bit vector input with a 150 MHz clock rate. This clock rate was chosen as it was the fastest clock that the conventional method could use before violating timing constraints. Further, since the resource requirements for modular arithmetic depend heavily on the modular base, we tested three different bases for
and
. For the HOM-R implementation tested, we used
, and disabled all of the registers in the system. This allowed for the conventional method and the HOM-R system to have the same latency. These results are shown in
Table 2.
It is clear from the table that the HOM-R implementation offers significant improvements over the conventional method for modular reduction. For all test cases, the HOM-R method used an average of 30% less power, 27% fewer configurable logic blocks (CLBs), and 42% fewer LUTs. Further, despite these reductions in power and resources, the HOM-R method showed significantly better timing margins, allowing for a higher maximum operating frequency.
Next, we compare multiple combinations of
and
for
in
Table 3. These tests were run at a clock rate of 150 MHz, with the main limiting test case being
,
. For each value of
, it is apparent that
is the most resource and power friendly. However, this is not always an option, as
results in a LUT that has
values, which is too large to be synthesized by Vivado. Based on the results in
Table 3, it appears best to use
, however, this is likely due to the configuration of the CLBs for the Kintex UltraScale+ device, and may vary from device family to family.
We also compared the usage of the various registers in the system. For this test, we used
, and a clock rate of 150 MHz. One would utilize these registers for high-frequency systems. Refer to
Figure 3 and
Figure 4 for the location of the summation and stage registers. For each register enabled in the system, an extra clock cycle is added to the overall latency.
One can see from
Table 4 that the summation register has the greatest effect on the system, offering the most relaxed timing margins between the two optional registers. Further, this also offers lower power usage, even when compared to the configuration that does not rely on either register. This is likely due to optimizations performed by the Place and Route Engine, as extra resources are added in the case where no registers are enabled so as to meet the strict timing requirements. To further relax timing constraints, one can utilize both the stage output register and the summation register. This adds two clock cycles of latency, but can be essential for large modular reductions that must operate at high frequencies.
The final test that we conducted on the HOM-R system evaluated
against
, as shown in
Table 5. For each test, we used a value of
at a clock rate of 250 MHz. While not every iteration passed timing, we included failing iterations for comparison with those test iterations that did pass timing.
It can be determined from the table that an increasing
leads to both a linear increase in resource usage and power consumption. This is expected, because a larger input will require more logic to reduce it to the predetermined
. Further, it can also be concluded that an increase in
directly results in an increase in power consumption, resource usage, and harsher timing requirements. The failing cases in the table could be remedied with the enabling of registers, as in
Table 4.
6. Conclusions
In this paper, we presented a new method for modular reduction that is optimized for hardware. Using a calculated number of reduction cycles, we can use combinatorial logic to reduce large input vectors down to a modular base. Our results show an average of 30% less power usage, 27% fewer CLBs, and 42% fewer LUTs than the conventional method for modular reduction used by Vivado. Further, the HOM-R implementation allows for more relaxed timing requirements, and is able to reduce a 256-bit number down to a four-bit base in a single 250 MHz clock cycle.
Our design also allows for the implementation of registers at multiple stages in the reduction algorithm, enabling the design to be used in systems with high click frequencies. We showed that even with the implementation of these registers, there are only negligible increases in power consumption and resource utilization, while significant gains are offered in terms of timing requirements.
Moreover, we offer a significant contribution over other popular modular reduction methods, such as Barrett and Montgomery reduction, in that we do not require any multipliers or dividers. Our algorithm is implemented using only simple addition and LUTs with variable sizes, determined by the user’s timing and resource requirements. This improvement is significant, as multiplication and division require DSPs, which are both limited and power hungry on devices such as FPGAs. By removing these requirements, we free up resources for designs that require them, and offer a low-power, low-latency alternative.