Next Article in Journal
A Broadband Mode Converter Antenna for Terahertz Communications
Next Article in Special Issue
A Survey of Advancements in Scheduling Techniques for Efficient Deep Learning Computations on GPUs
Previous Article in Journal
Advanced Hybrid Models for Air Pollution Forecasting: Combining SARIMA and BiLSTM Architectures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hardware Optimized Modular Reduction

Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77204, USA
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(3), 550; https://doi.org/10.3390/electronics14030550
Submission received: 9 December 2024 / Revised: 23 January 2025 / Accepted: 27 January 2025 / Published: 29 January 2025
(This article belongs to the Special Issue Emerging Applications of FPGAs and Reconfigurable Computing System)

Abstract

:
We introduce a modular reduction method that is optimized for hardware and outperforms conventional approaches. By leveraging calculated reduction cycles and combinatorial logic, we achieve a remarkable 30% reduction in power usage, 27% reduction in Configurable Logic Blocks (CLBs), and 42% fewer look-up tables (LUTs) than the conventional implementation. Our Hardware-Optimized Modular Reduction (HOM-R) system can condense a 256-bit input to a four-bit base within a single 250 MHz clock cycle. Further, our method stands out from prevalent techniques, such as Barrett and Montgomery reduction, by eliminating the need for multipliers or dividers, and relying solely on addition and customizable LUTs. This innovative method frees up FPGA resources typically consumed by power-intensive DSPs, offering a compelling low-power, low-latency alternative for diverse design needs.

1. Introduction

Modular reduction, expressed as ( a ) mod ( b ) c , is considered to be both one of the most computationally expensive basic-mathematical instructions, and one of the most important instructions in modern cryptography. In the simplest terms, modular reduction is the remainder of elementary integer division. From this definition, one could naively discern that calculating a modular congruence c requires division. This is easy to perform in software, as shown in Equation (1). The floor function can be replaced by integer division, depending on the software language used.
( a ) mod ( b ) c a a b
Unfortunately, since Equation (1) requires both a division and subtraction operation, this implementation is inefficient for both hardware and software [1]. Another naive approach, which can be quickly implemented in software, is shown in Algorithm 1. This algorithm implements a loop that will continue to reduce the input a until it is less than the modular base, b, returning the modular congruence c.
Algorithm 1 A naive software implementation of modular reduction.
1:
c     a
2:
while   c   >   b   do
3:
    c     c     b
4:
return c
This implementation presents two issues. The first problem with this implementation is that the number of iterations through the loop is directly related to how large the input a is. For cryptographic purposes, knowing the relative size of the input can be used to assist an attacker in decrypting encrypted data. Second, there is no defined upper-limit to the algorithm. This results in unpredictable latency, where the required iterations approach infinity as the input grows.
Within this work, we explore state-of-the-art methods for modular reduction, with an emphasis on the hardware requirements for the implementation of each. We argue that modern methods could be both faster and more power efficient, and present our own algorithm as a solution.
Our Hardware Optimized Modular Reduction (HOM-R) algorithm will have impacts in a variety of fields, as a wide breadth of topics rely on modular reduction. This includes fields that use a system of modular numbers, referred to as a Residue Number System (RNS). Specifically, however, we targeted our design for cryptography, as algorithms such as Homomorphic Encryption (HE) and Zero-Knowledge Proofs (ZKP) heavily utilize modular reduction, depending on the specific form of the algorithm used [2,3,4,5].
The HOM-R algorithm is also viable for other applications outside of cryptography, however. For example, Kalmykov et al. implemented a novel Error-Correcting Code using a Residue Number System (RNS), which relies heavily on modular arithmetic [6]. Valueva et al. applied the RNS to Convolutional Neural Networks (CNN) as a means to reduce the hardware cost of the machine learning model [7]. This application in AI for RNS is also seen in other works, including a multiplication-free neural network by Salamat et al. [8] and an ASIC implementation of an RNS-based CNN by Sakellariou et al. [9]. Other examples include an accelerator for probability models [10], intrusion detection systems [11], and new data storage methods for increased reliability [12].
As we will see in Section 2, modern approaches to modular reduction either rely on high-power resources, such as digital signal processors (DSPs), or have a high latency. Our motivation for this work is to achieve both low power and low latency. As we demonstrate in Section 5, our algorithm relies solely on bit shifts and a predictable number of additions, allowing us to avoid the use of DSPs whilst only requiring a single clock cycle of latency, even at high clock frequencies.
The rest of the paper is organized as follows: in Section 2, we cover other algorithms that are often used for modular reduction. In Section 3, we describe our algorithm and how to implement it based on use-case. In Section 4, we demonstrate the feasibility of our algorithm by implementing it on a Field Programmable Gate Array (FPGA). In Section 5, we discuss the implementation results of the HOM-R system. Finally, in Section 6, we conclude the paper.

2. Background

When describing modular reduction on hardware, there are two methods that must be considered. However, before we can consider those, there are some special cases where one would not opt for any of these methods. First, let us consider a common case of modular reduction: when the modular base b is a power of two (1, 2, 4, 8, etc.). Of course, for base two modular reduction, one can simply truncate the bit vector, as shown in Equation (2), where ‘&’ represents the bit-wise AND function, and ‘−’ represents the conventional integer subtraction. This method, when implemented in hardware, requires simple combinatorial logic that induces minimal latency and resource usage.
( a ) mod ( b ) c a & ( b 1 )
The second method that can be used for special cases is when the modular base is not a power of two, and the input size is small. One can use a direct look-up table in hardware, or an array in software, to quickly find the modular congruence of the input. With ( a ) mod ( b ) c , and a < n , this method requires n entries of log 2 ( n ) bits, leading to a total memory requirement of n log 2 ( n ) .
For all other cases, prior to the introduction of the HOM-R system, one’s options for modular reduction include one of the three following algorithms: Montgomery reduction, Barrett reduction, or a more recent addition, a fast modular reduction method by Will et al. [13]. We will include Will’s method as it serves as the basis for our work.
While other options exist, we do not cover those implementations in-depth as they either have large memory/resource requirements depending on the input, or are sub-optimal for hardware due to the reliance on recursion. This includes work by Parhami et al., in which the authors explore segmentation and truncation within look-up tables for modular reduction [14]. Parhami et al. offers another work exploring the inverse of this operation: combining residues into a unified, binary representation [15]. Another work by Opasatian et al. also explores the use of look-up tables similar to Parhami, but their method requires an additional ’adjustment’ step that relies on subtraction. The principle behind this step is similar to that in Montgomery reduction, in that the computed value may be twice the value of the modulus [16]. Lim et al. offer a method for modular reduction that also relies on look-up tables; however, they implement their design in software [17]. Another method is the one by Cao et al., which offers a unique approach that evaluates a binary vector based on the longest run of 1 s and 0 s [18].

2.1. Mongtomery Reduction

Montgomery reduction [19] is one of the most commonly used modular reduction algorithms in modern hardware. The premise of the formula is to represent a residue class using a different modular base. Ideally, this new base is a power of two, which can then be reduced via the modular-base-two method listed above. The delay for Montgomery Reduction comes in play when changing the modular base. Because of this, a number’s base is often changed only once, where the new base representation, referred to as its Montgomery form, is used for all mathematical calculations. This way, any further modular reductions can be performed via the modular-base-two truncation method. Once a number has finished processing and does not require any more operations, its modular base can be converted back to the original, and the final congruence of the original number can be retrieved.
For ( a ) mod ( b ) c , let ( R Z ) > b where R is co-prime to b. R should be chosen so that ( a ) mod ( R ) is easy to compute; ideally, R should be a power of two. Further, let b = b 1 mod R . The Montgomery reduction of input a can then be defined as Equation (3).
R E D C ( a ) = ( a R 1 ) mod ( b )
Note that one subtraction of b may be required since 0 R E D C ( b ) < 2 b . It is also important to note that Montgomery reduction only requires k ( k + 1 ) multiplications. The number of multiplications required is not correlated to the size of the target number, which is not the case in the following algorithm, the Barrett reduction.
The Montgomery form allows for quick modulo operations where multiple numbers are multiplied or added together, as they can all be converted to the same residue class. Addition in Montgomery space is performed as with regular numbers; however, there is a small caveat to multiplying numbers in Montgomery space. Since two separate numbers in Montgomery space include a R factor, multiplication results in a R 2 term. This is an easy obstacle to overcome: since the Montgomery form relies on R being an integer that is a power of two, the result can be divided by a factor of R through a bit-wise right shift of l o g 2 ( R ) bits.

2.2. Barrett Reduction

Barrett’s method [20] is similar to Montgomery’s approach in that it requires the selection of a “magic number” that is relative to the modular base b. When implementing Barrett reduction, one must consider the word size L of the system, which depends heavily on whether a general purpose computer or specialized circuit, such as an FPGA, is used. Depending on the word size, a base β must be chosen such that β = 2 L . From β , we can calculate the required parameter in Equation (4), where k = log b p + 1 . There is the further limitation that a from ( a ) mod ( b ) c that 0 a < β 2 k .
μ = β 2 k b
If we assign q = a b , then a mod b c a q b . From Equation (5), we can derive Equation (6).
a b = a β k 1 · β 2 k b · 1 β k + 1
0 q ^ = a β k 1 · μ β k + 1 a b = q
If we precompute μ , it is easy to see from Equation (6) that the main complexity in the Barrett reduction is caused by the multiplication of μ . The residue c can be computed from this value as c = a q ^ p + ( q ^ q ) b , which involves at most two subtractions and three bit shifts.

2.3. Will’s Algorithm

The look-up tables reduction method proposed by Will et al. is a unique and attractive option for modular reduction when compared to Barrett’s and Montgomery’s methods, as it does not require a multiplication primitive [13]. Through a series of bit shifts and additions with precomputed numbers, Will’s method is able to reduce a number to a residue. The basic structure of the algorithm is defined in Algorithm 2, where bit vector width β = l o g 2 ( a + 1 ) , the input vector is split into l-width-vectors, defined as s ^ = [ i = 0 l 1 a i 2 i , , i = 0 l 1 a β l + i 2 i ] , and N is defined as N = l e n g t h ( s ^ ) 1 . In this Algorithm, and the others that follow, n < < m indicates an integer n shifted left m times.
Algorithm 2 Will’s algorithm for modular reduction [13].
1: n N
2:while   n > 0   do
3:       T s ^ [ n ]
4:      for  i = β 1 d o w n t o 0  do
5:             T T < < 1
6:            while  ( T & 2 β ) 0  do
7:                  T T & ( 2 β 1 ) + ( 2 β ) mod b
8:       s ^ [ n 1 ] = s ^ [ n 1 ] + T
9:      while  s ^ [ n 1 ] & 2 β 0  do
10:             s ^ [ n 1 ] s ^ [ n 1 ] & ( 2 β 1 ) + ( 2 β ) mod b
11:       n n 1
12:while  s ^ [ 0 ] > b  do
13:       s ^ [ 0 ] s ^ [ 0 ] b
14:return  s ^ [ 0 ]
The mod values in Algorithm 2 are precomputed and stored. Will also emphasizes that there can be a larger number of precomputed values, reducing the number of iterations in the for loop. These are stored in the look-up table, and the modified algorithm is shown as Algorithm 3. In this algorithm, δ is the width of the look-up table, with each value in the look-up table being calculated by Algorithm 4.
Algorithm 3 Will’s algorithm for modular reduction modified for a user defined modular look-up width δ , and m ^ is defined in Algorithm 4 [13].
1: n N
2: m a s k 0
3:for  i = δ 1 d o w n t o 0  do
4:     m a s k m a s k + 2 i
5: m a s k m a s k < < β
6:while  n > 0  do
7:     T s ^ [ n ]
8:    for  i = β 1 d o w n t o 0  do
9:         T T < < δ
10:        while  ( T & m a s k ) 0  do
11:              i ^ = ( T & m a s k ) > > β
12:              T T & ( 2 β 1 ) + m ^ [ i ^ ]
13:     s ^ [ n 1 ] = s ^ [ n 1 ] + T
14:    while  s ^ [ n 1 ] & m a s k 0  do
15:         i ^ = ( s ^ [ n 1 ] & m a s k ) > > β
16:         s ^ [ n 1 ] s ^ [ n 1 ] & ( 2 β 1 ) + m ^ [ i ^ ]
17:     n n 1
18:while  s ^ [ 0 ] > b  do
19:     s ^ [ 0 ] s ^ [ 0 ] b
20:return  s ^ [ 0 ]
Algorithm 4 Modular value look-up tables generation with index width δ [13].
1:
m ^ [ 0 ] 2 δ
2:
for  i = 0 u p t o 2 δ  do
3:
       d = i < < β
4:
       m ^ [ i ] d mod b
5:
return  m ^
From these algorithms, there are a few important things to consider. The first concern is that there is a memory complexity Θ ( 2 δ ) , so the look-up table size must be determined with device and power constraints in mind, with the optimal δ value being 1. However, this must be a balanced decision, as a smaller look-up table width will require more bit shifts in Algorithm 3, increasing the latency of the system. Another point of note is the while loops do not have defined latency. This prevents a combinatorial implementation in hardware, as each iteration of the loop will need to be analyzed at the end of a clock cycle. This is a pain-point for latency. Further, it must also be considered that the summation of the look-up table values leads to a value that is larger than the defined index of m ^ , resulting in undefined behavior. We evaluate these shortcomings and propose solutions in Section 3.

3. Methods

The design of the HOM-R system has a primary focus on low power usage. In terms of FPGAs or ASICs, this means reducing digital signal processors (DSPs), and keeping the FPGA Look-Up Table (LUT) usage to a minimum. We specifically use the abbreviation “LUT” to refer to the the logic resource utilized by the FPGA, and “look-up table” to refer to the m ^ and m ^ components required by the HOM-R system. By prioritizing resource utilization and power reduction, we are able to formulate an algorithm that results in zero DSPs and minimal LUTs regardless of the base used. We determine Will’s algorithm to be a good starting point for this objective, as it relies purely on additions and bitwise operations.
The primary issue with Will’s algorithm is the number of while loops, which are non-ideal for hardware implementations due to their unpredictable number of iterations. An ideal hardware implementation should focus on utilizing for loops, which have a constrained number of iterations. This allows for predictable latency, thereby allowing the use of combinatorial logic that can fit an entire loop within a single clock cycle. This is opposed to the while loop, which will require the condition to be evaluated based on the output of a circuit. This is only possible by calculating an iteration and re-evaluating the loop condition at the end of a clock cycle, leading to an entire clock cycle being used only for one a single iteration of the loop.
For our implementation of the HOM-R module to solve the equation ( a ) mod ( b ) c , the following terminology will be used:
  • Stage: Each input vector x of width Ω = l o g 2 ( a + 1 ) will be split into stages of width ω to solve for a residue width β = l o g 2 ( b + 1 ) . Each stage is represented as a s ^ . The input vector is therefore transformed into an array of N stages, represented as Ψ , where N = ( Ω β ) / ω . The calculation for Ψ is shown in Equation (7).
    Ψ = [ s ^ 0 , , s ^ N 1 ] = [ i = 0 β 1 a i 2 i , , i = β ( N 1 ) N β 1 a Ω β + i 2 i ]
  • Base Stage: This is the least significant stage, stage zero. The final residue c will have a maximum width of the width of the modular base b, β and will have exactly one stage remaining in Ψ , the base stage.
  • Modular-Value Look-up Table (MLUT): The look-up table composed of modular values, calculated via Algorithms 5 and 6, and represented by m ^ with a depth 2 δ and m ^ with a depth 2 δ , respectively. There are N different MLUTs, with each MLUT representing the modular values for different s ^ . By utilizing an MLUT for each different stage, we can reduce each stage to the base stage in parallel.
  • Overflow Bits: When summing values from m ^ , it is possible that the width of the sum exceeds β . These bits are referred to as overflow bits.
A full description of the required parameters for HOM-R are outlined in Table 1.
Algorithm 5 HOM-R 2D modular value look-up table generation with index width δ and stage width β .
1:
m ^ [ [ 0 ] 2 δ ] ( N 1 )
2:
for  i = 0 u p t o N 1  do
3:
    for  j = 0 u p t o 2 δ  do
4:
           d = j < < ( ( i + 1 ) β )
5:
           m ^ [ i ] [ j ] d mod b
6:
return  m ^
Algorithm 6 HOM-R 1D modular value look-up table generation for reducing c = sum ( σ ) . Index width δ and stage width β .
1:
m ^ [ 0 ] 2 δ
2:
for  i = 0 u p t o 2 δ  do
3:
       d = i < < β
4:
       m ^ [ i ] d mod b
5:
return  m ^
The full HOM-R algorithm is defined in Algorithm 7. Line 1 handles the simplest of cases, when the modular base class is an even power of two. In this case, we can simply truncate the input vector a to be the same length as the modular vector b to retrieve the residue c, saving precious hardware memory.
Algorithm 7 Hardware Optimized Modular Reduction Algorithm for ( a ) mod b c .
1:if  log 2 ( b ) = = l o g 2 ( b )  then return  a & ( b 1 )
2: N ( Ω β ) / ω
3: σ [ 0 ] N
4: σ [ 0 ] Ψ [ 0 ]
5:for  i = 1 u p t o N  do
6:       idx Ψ [ i ]   &   ( 2 δ 1 )
7:       σ [ i ] m ^ [ i ] [ i d x ]
8:      for  j = 1 u p t o β / ffi  do
9:             idx ( Ψ [ i ] > > ( j δ ) )   &   ( 2 δ 1 )
10:             σ [ i ] σ [ i ] + j 2 δ m ^ [ i ] [ i d x ]
11:      for  j = 0 u p t o NUM _ SUMS  do
12:             idx ( σ [ i ] > > β )
13:             σ [ i ] σ [ i ] + m ^ [ 0 ] [ i d x ]
14: c sum ( œ )
15:for  j = 0 u p t o NUM _ SUMS  do
16:       idx ( c > > β )
17:       c c + m ^ [ i d x ]
18:if  c > b  then
19:       c s ^ [ 0 ] b
20:return c
The first loop in the system, Lines 3–10, breaks the input vector a into a series of s ^ that compose the array Ψ . For each s ^ , we compute σ according to the modular base b using Equation (8). While σ is not the final residue c, it is congruent to the original s ^ relative to b and much smaller than s. We use this value to calculate the final residue c using the rest of the algorithm.
f ( x ) = ( s ^ > > ( x δ ) )   &   ( 2 δ 1 ) < < ω σ = f ( 0 ) mod ( c ) + i = 1 ω / δ ( i 2 δ ) ( f ( i ) mod ( c ) )
To help demonstrate how Equation (8) may be implemented for a four-bit s ^ with a stage index s in Ψ , composed of bits b 3 b 2 b 1 b 0 with a δ = 2 , we present Equation (9). As shown in Equation (9), this function is merely shifting the vector s ^ left by two bits, isolating two bits at a time, and calculating the residue for each. Since there are four bits in s ^ , the vector must be shifted twice. Each shift is equivalent to multiplying by four. We move these factors of four inside or outside the m ^ lookup depending on the location of the bits in the vector s ^ .
σ = ( 4 4 b 1 b 0 ) mod ( c ) + 4 ( 4 b 3 b 2 ) mod ( c ) = m ^ [ s 1 ] [ b 1 b 0 ] + 4 m ^ [ s 1 ] [ b 3 b 2 ]
We prove that we can use the bits in s ^ as an index to m ^ in Equation (8) via Lemmas 1–3. Lemma 1 allows us to prove Lemmas 2 and 3. Lemma 2 proves that the bit vectors can be operated on as a sum of the bits, rather than as the entire vector. Lemma 3 demonstrates that the vector can be shifted (multiplied by a power of two), regardless of whether the factor of two is used to compute the modular residue or operate on the modular residue.
Lemma 1.
For a , b , c , k Z , ( a ) mod ( b ) c , k Z such that c = a + k b .
From definition of ( a ) mod ( b ) c , b | ( a c ) From definition of a c b , k b = a c c = a + k b .
Lemma 2.
For a , b , c , d , m , k , j Z , if a + c ( b + d ) mod m , then a b mod m and c d mod m .
From Lemma 1 , a + c = ( b + d ) + ( k + j ) m a + c = b + k m + d + j m
a = b + k m c = d + j m
Lemma 3.
For a , b , c , m Z , a [ ( b c ) mod m ] ( a b ) ( c mod m ) .
From Lemma 1 , l . h . s . a b c + a k m r . h . s . a b c + a b j m a b c + a k m a b c + a b j m Assign x = a b c x + a k m x + a b j m Congruent via definition from Lemma 1 .
For the second inner loop in Algorithm 7, we account for overflow bits in σ . It is important to note that there can be at most δ overflow bits. Therefore, if the number of overflow bits exceeds δ , then the corresponding lookup value will be undefined. This is a failure condition that must be avoided. There are potential conditions that may fail this part of the loop, resulting in undefined behavior.
All possible valid values were pre-calculated via an iterative search for combinations δ [ 1 , 18 ] and ω [ 1 , 18 ] . We noticed that for this range, valid values of δ relative to a user chosen σ must satisfy δ ω 2 . We extend this to all possible values of σ and δ in Lemma 4. These combinations guarantee that any modular base with width ω , used with the corresponding δ , will not face the overflow condition described above. By utilizing a for loop with a guaranteed number of iterations, we can utilize a predetermined amount of combinatorial logic within the HOM-R module, allowing for single clock-cycle computations.
Line 13 in Algorithm 7 sums up all values of σ , storing the sum in c. This sum is then used to compute the maximum value of c + b with lines 15–17. It is important to note that the width of sum( σ ) must not exceed m ^ , as this will result in undefined behavior. There can be a maximum of 2 δ addends in line 10, as more than this will overflow m ^ . Therefore, δ must be greater than or equal to log 2 ( N + 1 ) . Of course, more bits than necessary will result in a larger circuit, increasing the power draw and space unnecessarily. So, while a δ larger than log 2 ( N + 1 ) is possible, it is recommended to use δ = log 2 ( N + 1 ) .
Lemma 4.
For l o g 2 ( σ + 1 ) ω + δ if δ , ω , c Z 0 + , ω 2 δ ω , max ( c ) = 2 δ 1 , σ = f ( 0 ) mod ( c ) + i = 1 ω / δ ( i 2 δ ) ( f ( i ) mod ( c ) ) .
Evaluate f ( i ) mod ( c ) at the largest value , σ = f ( 0 ) mod ( c ) + i = 1 ω / δ ( i 2 δ ) ( f ( i ) mod ( c ) ) max ( mod ( c ) ) = max ( c 1 , 0 ) max ( f ( i ) mod ( c ) ) = max ( 2 δ 2 , 0 ) max ( σ ) = max ( 2 δ 2 , 0 ) + i = 1 ω / δ ( i 2 δ ) max ( 2 δ 2 , 0 ) .
( A ) Evaluate at maximum boundary δ = ω σ = 2 δ 2 + i = 1 1 ( i 2 δ ) ( 2 δ 2 ) σ = 2 δ 2 + ( 2 δ ) ( 2 δ 2 ) σ = 2 2 δ 2 δ 2 log 2 ( 2 2 δ 2 δ 1 ) 2 δ 2 2 δ 2 δ 1 2 2 δ 2 δ 1 ( B ) Evaluate at minimum boundary δ = ω 2 , ω = 1 δ = 1 l . h . s . = log 2 ( 1 + max ( 2 δ 2 , 0 ) + i = 1 2 ( i 2 δ ) max ( 2 δ 2 , 0 ) = log 2 ( 1 + 0 + ( 2 ω 2 ) ( 0 ) + ( 2 2 ω 2 ) ( 0 ) ) Evaluate B at ω = 1 log 2 ( 1 ) 3 2 0 3 2 From A ) and B ) , log 2 ( σ + 1 ) ω + δ .
Lemma 5.
For log 2 ( x + y + 1 ) log 2 ( y + 1 ) if x , y Z 0 + and log 2 ( x + 1 ) = log 2 ( y + 1 ) .
log 2 ( x + y + 1 ) = log 2 ( x ( 1 + y x + 1 x ) ) ) = log 2 ( x ) + log 2 ( 1 + y + 1 x ) Evaluate log 2 ( 1 + y + 1 x ) at minumum log 2 log 2 ( 1 + R 0 + ) log 2 ( x ) + log 2 ( 1 + R 0 + ) log 2 ( x + 1 ) log 2 ( x ) + log 2 ( 1 + R 0 + ) log 2 ( y + 1 ) log 2 ( x + y + 1 ) log 2 ( y + 1 ) .
The first part of the algorithm, Lines 1–17, reduce Ψ to the single s ^ base stage. According to the constraints from the algorithm, the base stage will have the same width as the modular base b. Therefore, 0 Ψ [ 0 ] < 2 b . From Lemmas 5 and 6, we can conclude that either s ^ [ 0 ] b or s ^ [ 0 ] b < b must be true. This reduces the while loop in Algorithm 1, Lines 18–19, to a single if statement. We can therefore determine that an input vector of width β = Ω is at most a single subtraction from being the residue ( a ) mod ( b ) c . The final result is the residue c ( a ) mod ( b ) .
Lemma 6.
For log 2 ( x y + 1 ) log 2 ( y + 1 ) if x , y Z 0 + , x > y , and log 2 ( x + 1 ) = log 2 ( y + 1 ) .
Proof. 
log 2 ( x y + 1 ) = log 2 ( x ( 1 y x + 1 x ) ) ) = log 2 ( x ) + log 2 ( 1 + 1 y x ) Evaluate log 2 ( 1 + 1 y x ) as log 2 ( 1 + 1 y x ) < R < 0 log 2 x log 2 ( x + 1 ) log 2 ( x y + 1 ) log 2 ( y + 1 )

4. Implementation

We implemented our modular reduction system on an FPGA, and tested various configurations to compare area usage, power consumption, and resource requirements. The full implementation of the HOM-R system relies on internal sub-modules to reduce the individual stages of the input Ψ to the base stage. Each stage has an input of β bits, and an output of β bits. The stage-reduction module encompasses Lines 6–13 in Algorithm 7, and is replicated N 1 times within the HOM-R system. For the example illustrated within this section, we demonstrate a four-stage HOM-R implementation for x [ 15 : 0 ] mod ( 14 ) y with β = 4 and δ = 2 . A close-up for Ψ [ 1 ] is shown in Figure 1, showcasing the hardware implementation for Lines 6–10 in Algorithm 7.
It is important to note that the same m ^ is replicated in the sub-module. This is illustrated within the dotted-line box in Figure 1. While this requires an increase in resources, it allows the system to operate on the stage splits in parallel. The alternate option is to instead iterate each stage through m ^ , at one stage per clock cycle, which we determined to be too expensive.
Since each stage split is reduced to the base-stage, we must sum them together, as in Lines 11–13 of Algorithm 7. This was proven possible in Lemma 2. Similar to the first stage-reduce module, this step is completed with combinatorial logic, reducing the latency of the system. An iterative method was used to find the number of maximum required additions to reduce the width of the sum to β . The method yielded that a maximum of two additional addition steps following the original summation of the m ^ outputs will always be enough to reduce the width to β . We illustrate the second half of the stage-reduce module in Figure 2 for x [ 15 : 0 ] mod ( 14 ) y with β = 4 and δ = 2 .
These two series of combinatorial logic result in the full stage-reduce module, shown in Figure 3. We place an optional stage register between the output of the stage-reduce module and the output of the full HOM-R system to help with timing for either high-frequency or large-input systems. However, for all of the values tested within this paper, we did not need to rely on the optional register and therefore removed it from our system.
The stage-reduce module must be replicated Ω β ω times within the modular reduction module. This replication implements Line 4 of Algorithm 7, and allows us to operate on each stage split in parallel. This results in a significant reduction in latency, fitting multiple loops of the algorithm within a single clock cycle. The replication of stage splits is shown in the full hardware implementation of HOM-R in Figure 4. We also show the optional summation register, which, if enabled in the design, will register the output for the first summation only. This provides a significant improvement in timing for high-operating-frequency systems.
The output of each stage-reduce module must be summed together and reduced, similar to the logic within the stage reduce module. The primary difference is that the size of the look-up table for the reductions, m ^ , is log 2 ( N + 1 ) rather than the user defined δ . In our example, we use four splits, which results in an m ^ depth of two, as shown in Figure 4.
Once the base stages of the input have been summed and the output reduced to the width of β , we can ensure that the value is at most 2 b from Lemmas 5 and 6. We can therefore subtract the modular base b from the input. If there is an asserted overflow bit, then we can ensure that the sum is less than b, and we can accept it as the residue. If there is no overflow bit, then we can be assured that the sum is greater than or equal to b, and we accept the difference as the residue b.

5. Results

Typically, for conventional place and route tool flows utilized by standard applications such as Vivado by AMD or Quartus by Intel, a look-up table for each bit of an output vector is generated depending on a user defined funciton. In this case, that function would be modular reduction. We compare the HOM-R system against this conventional method utilized by Vivado.
For testing, we used a 64-bit vector input with a 150 MHz clock rate. This clock rate was chosen as it was the fastest clock that the conventional method could use before violating timing constraints. Further, since the resource requirements for modular arithmetic depend heavily on the modular base, we tested three different bases for β = 32 and β = 16 . For the HOM-R implementation tested, we used δ = ω = 8 , and disabled all of the registers in the system. This allowed for the conventional method and the HOM-R system to have the same latency. These results are shown in Table 2.
It is clear from the table that the HOM-R implementation offers significant improvements over the conventional method for modular reduction. For all test cases, the HOM-R method used an average of 30% less power, 27% fewer configurable logic blocks (CLBs), and 42% fewer LUTs. Further, despite these reductions in power and resources, the HOM-R method showed significantly better timing margins, allowing for a higher maximum operating frequency.
Next, we compare multiple combinations of ω and δ for Ω = 32 ,   β = 16 in Table 3. These tests were run at a clock rate of 150 MHz, with the main limiting test case being ω = 2 , δ = 1 . For each value of ω , it is apparent that δ = ω is the most resource and power friendly. However, this is not always an option, as ω = δ = 16 results in a LUT that has 2 16 values, which is too large to be synthesized by Vivado. Based on the results in Table 3, it appears best to use ω = δ = 8 , however, this is likely due to the configuration of the CLBs for the Kintex UltraScale+ device, and may vary from device family to family.
We also compared the usage of the various registers in the system. For this test, we used Ω = 64 , ω = 8 , δ = 8 , and a clock rate of 150 MHz. One would utilize these registers for high-frequency systems. Refer to Figure 3 and Figure 4 for the location of the summation and stage registers. For each register enabled in the system, an extra clock cycle is added to the overall latency.
One can see from Table 4 that the summation register has the greatest effect on the system, offering the most relaxed timing margins between the two optional registers. Further, this also offers lower power usage, even when compared to the configuration that does not rely on either register. This is likely due to optimizations performed by the Place and Route Engine, as extra resources are added in the case where no registers are enabled so as to meet the strict timing requirements. To further relax timing constraints, one can utilize both the stage output register and the summation register. This adds two clock cycles of latency, but can be essential for large modular reductions that must operate at high frequencies.
The final test that we conducted on the HOM-R system evaluated Ω against β , as shown in Table 5. For each test, we used a value of β = ω = δ at a clock rate of 250 MHz. While not every iteration passed timing, we included failing iterations for comparison with those test iterations that did pass timing.
It can be determined from the table that an increasing Ω leads to both a linear increase in resource usage and power consumption. This is expected, because a larger input will require more logic to reduce it to the predetermined β . Further, it can also be concluded that an increase in β directly results in an increase in power consumption, resource usage, and harsher timing requirements. The failing cases in the table could be remedied with the enabling of registers, as in Table 4.

6. Conclusions

In this paper, we presented a new method for modular reduction that is optimized for hardware. Using a calculated number of reduction cycles, we can use combinatorial logic to reduce large input vectors down to a modular base. Our results show an average of 30% less power usage, 27% fewer CLBs, and 42% fewer LUTs than the conventional method for modular reduction used by Vivado. Further, the HOM-R implementation allows for more relaxed timing requirements, and is able to reduce a 256-bit number down to a four-bit base in a single 250 MHz clock cycle.
Our design also allows for the implementation of registers at multiple stages in the reduction algorithm, enabling the design to be used in systems with high click frequencies. We showed that even with the implementation of these registers, there are only negligible increases in power consumption and resource utilization, while significant gains are offered in terms of timing requirements.
Moreover, we offer a significant contribution over other popular modular reduction methods, such as Barrett and Montgomery reduction, in that we do not require any multipliers or dividers. Our algorithm is implemented using only simple addition and LUTs with variable sizes, determined by the user’s timing and resource requirements. This improvement is significant, as multiplication and division require DSPs, which are both limited and power hungry on devices such as FPGAs. By removing these requirements, we free up resources for designs that require them, and offer a low-power, low-latency alternative.

Author Contributions

Conceptualization, A.M.; methodology, A.M.; investigation, A.M.; writing—original draft preparation, A.M.; writing—review and editing, A.M. and Y.C.; visualization, A.M.; supervision, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study is openly available on GitHub at https://github.com/alexmagyari/HOM-R (accessed on 8 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hassan, F.; Ammar, A.; Drennen, H. A 32-bit Integer Division Algorithm Based on Priority Encoder. In Proceedings of the 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 23–25 November 2020; pp. 1–4. [Google Scholar]
  2. Alaya, B.; Laouamer, L.; Msilini, N. Homomorphic encryption systems statement: Trends and challenges. Comput. Sci. Rev. 2020, 36, 100235. [Google Scholar] [CrossRef]
  3. Alloghani, M.; Alani, M.M.; Al-Jumeily, D.; Baker, T.; Mustafina, J.; Hussain, A.; Aljaaf, A.J. A systematic review on the status and progress of homomorphic encryption technologies. J. Inf. Secur. Appl. 2019, 48, 102362. [Google Scholar] [CrossRef]
  4. Zhang, J.; Xie, T.; Zhang, Y.; Song, D. Transparent polynomial delegation and its applications to zero knowledge proof. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 18–20 May 2020; pp. 859–876. [Google Scholar]
  5. Bootle, J.; Lyubashevsky, V.; Seiler, G. Algebraic techniques for short (er) exact lattice-based zero-knowledge proofs. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 2019; pp. 176–202. [Google Scholar]
  6. Kalmykov, I.A.; Pashintsev, V.P.; Tyncherov, K.T.; Olenev, A.A.; Chistousov, N.K. Error-correction coding using polynomial residue number system. Appl. Sci. 2022, 12, 3365. [Google Scholar] [CrossRef]
  7. Valueva, M.V.; Nagornov, N.; Lyakhov, P.A.; Valuev, G.V.; Chervyakov, N.I. Application of the residue number system to reduce hardware costs of the convolutional neural network implementation. Math. Comput. Simul. 2020, 177, 232–243. [Google Scholar] [CrossRef]
  8. Salamat, S.; Shubhi, S.; Khaleghi, B.; Rosing, T. Residue-Net: Multiplication-free neural network by in-situ no-loss migration to residue number systems. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 18–21 January 2021; pp. 222–228. [Google Scholar]
  9. Sakellariou, V.; Paliouras, V.; Kouretas, I.; Saleh, H.; Stouraitis, T. An end-to-end RNS CNN Accelerator. In Proceedings of the 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 22–25 April 2024; pp. 75–79. [Google Scholar]
  10. Zhang, P. Applying Residue Number Systems to Hardware Probability Models. Master’s Thesis, Aalto University, Espoo, Finland, 2024. [Google Scholar]
  11. Saheed, Y.K.; Kehinde, T.O.; Ayobami Raji, M.; Baba, U.A. Feature selection in intrusion detection systems: A new hybrid fusion of Bat algorithm and Residue Number System. J. Inf. Telecommun. 2024, 8, 189–207. [Google Scholar] [CrossRef]
  12. Kucherov, N. Data Storage with Increased Survivability and Reliability Based on the Residue Number System. Adv. Syst. Sci. Appl. 2024, 24, 166–186. [Google Scholar]
  13. Will, M.A.; Ko, R.K. Computing mod with a variable lookup table. In Security in Computing and Communications, Proceedings of the 4th International Symposium, SSCC 2016, Jaipur, India, 21–24 September 2016; Proceedings 4; Springer: Berlin/Heidelberg, Germany, 2016; pp. 3–17. [Google Scholar]
  14. Parhami, B. Modular reduction by multi-level table lookup. In Proceedings of the 40th Midwest Symposium on Circuits and Systems. Dedicated to the Memory of Professor Mac Van Valkenburg, Sacramento, CA, USA, 3–6 August 1997; Volume 1, pp. 381–384. [Google Scholar]
  15. Parhami, B.; Hung, C. Optimal table lookup schemes for VLSI implementation of input/output conversions and other residue number operations. In Proceedings of the 1994 IEEE Workshop on VLSI Signal Processing, La Jolla, CA, USA, 26–28 October 1994; pp. 470–481. [Google Scholar]
  16. Opasatian, A.; Ikeda, M. Lookup Table Modular Reduction: A Low-Latency Modular Reduction for Fast ECC Processor. In Proceedings of the 2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS), Tokyo, Japan, 19–21 April 2023; pp. 1–6. [Google Scholar]
  17. Lim, C.H.; Hwang, H.S.; Lee, P.J. Fast modular reduction with precomputation. In Proceedings of the Korea-Japan Joint Workshop on Information Security and Cryptology (JWISC’97). Citeseer, 1997; pp. 65–79. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a683d452712796ca9ab4c99dffdfd48e1e5d195d (accessed on 8 December 2024).
  18. Cao, Z.; Wei, R.; Lin, X. A fast modular reduction method. Cryptol. ePrint Arch. 2014. [Google Scholar]
  19. Montgomery, P.L. Modular multiplication without trial division. Math. Comput. 1985, 44, 519–521. [Google Scholar] [CrossRef]
  20. Barrett, P. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Conference on the Theory and Application of Cryptographic Techniques; Springer: Berlin/Heidelberg, Germany, 1986; pp. 311–323. [Google Scholar]
Figure 1. β δ Splits required to reduce a stage to the base stage. This is configured for the two-bit modulus-value LUT with a modular base of 14 for Ψ [ 1 ] .
Figure 1. β δ Splits required to reduce a stage to the base stage. This is configured for the two-bit modulus-value LUT with a modular base of 14 for Ψ [ 1 ] .
Electronics 14 00550 g001
Figure 2. The final summation step reduces the base stage to the correct bit width of the modular base. The amount of additions required is dependent on the base, modulus-value LUT, and stage width.
Figure 2. The final summation step reduces the base stage to the correct bit width of the modular base. The amount of additions required is dependent on the base, modulus-value LUT, and stage width.
Electronics 14 00550 g002
Figure 3. A complete four-bit stage configured for a two-bit modulus-value LUT with a modular base of 14.
Figure 3. A complete four-bit stage configured for a two-bit modulus-value LUT with a modular base of 14.
Electronics 14 00550 g003
Figure 4. The full modular reduction system for a static base. This system is configured for four splits, with each split configured for a two-bit modulus-value LUT with a modular base of 14.
Figure 4. The full modular reduction system for a static base. This system is configured for four splits, with each split configured for a two-bit modulus-value LUT with a modular base of 14.
Electronics 14 00550 g004
Table 1. Parameter definitions for HOM-R.
Table 1. Parameter definitions for HOM-R.
ParameterDescription
xThe input vector
bModular base
s ^ An individual reduction stage
Ψ The collection of stages. When concatenated together, it is equivalent to x.
NNumber of stages
m ^ Modular value look-up tables
σ Stage output after summing each respective m ^ for the stage
Ω Width of x
ω Width of the input to a stage
δ Width of the input to m ^
β Maximum width of the residue output
Table 2. The conventional modular reduction method compared against the HOM-R system, tested at a clock rate of 150 MHz. For the HOM-R system, δ = ω = 8 .
Table 2. The conventional modular reduction method compared against the HOM-R system, tested at a clock rate of 150 MHz. For the HOM-R system, δ = ω = 8 .
MethodBaseCLBsLUTsWNSPower
HOM-R0xC53A49B21317320.377 ns40 mW
Conventional0xC53A49B217211260.064 ns51 mW
HOM-R0xF12E33D61337130.349 ns40 mW
Conventional0xF12E33D619012500.017 ns53 mW
HOM-R0xAF0C59E51397430.184 ns43 mW
Conventional0xAF0C59E517711900.049 ns54 mW
HOM-R0xA43C1065251.234 ns15 mW
Conventional0xA43C1659690.474 ns25 mW
HOM-R0xE1271065351.194 ns15 mW
Conventional0xE1271389510.321 ns27 mW
HOM-R0x894C1015200.970 ns15 mW
Conventional0x894C1359870.190 ns25 mW
Table 3. Results from various configurations of ω and δ . For this test, Ω = 32 , β = 16 , and the modular base is 0xF7E3.
Table 3. Results from various configurations of ω and δ . For this test, Ω = 32 , β = 16 , and the modular base is 0xF7E3.
ω δ CLBsLUTsRegistersWNSPower
2148183180.040 ns7 mW
2236139161.237 ns5 mW
4245187160.369 ns6 mW
4433136161.562 ns5 mW
8433143161.274 ns6 mW
8839172242.207 ns5 mW
16839273241.0787 mW
Table 4. Results from various configurations of ω and δ . For this test, Ω = 32 , β = 16 , and the modular base is 0xF7E3.
Table 4. Results from various configurations of ω and δ . For this test, Ω = 32 , β = 16 , and the modular base is 0xF7E3.
Register En.CLBsLUTsReg.WNSPower
None131732560.377 ns40 mW
Sum114673912.516 ns39 mW
Stage1357322021.053 ns41 mW
Stage + Sum1196732472.822 ns41 mW
Table 5. Results from various configurations of ω and δ . For this test, Ω = 32 , β = 16 , and the modular base is 0xF7E3.
Table 5. Results from various configurations of ω and δ . For this test, Ω = 32 , β = 16 , and the modular base is 0xF7E3.
Ω β CLBsLUTsRegistersWNSPower
16462002.379 ns<1 mW
168135851.600 ns4 mW
161271502.994 ns<1 mW
32432701.961 ns<1 mW
3282713180.806 ns8 mW
3212128770420.552 ns17 mW
644118501.497 ns7 mW
64854341400.243 ns19 mW
64124802806481−0.466 ns56 mW
12842916700.996 ns9 mW
1288946191040.012 ns26 mW
1281211316536902−1.916 ns108 mW
25647350500.384 ns17 mW
25681861264102−0.063 ns49 mW
25612259414061119−2.965 ns210 mW
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Magyari, A.; Chen, Y. Hardware Optimized Modular Reduction. Electronics 2025, 14, 550. https://doi.org/10.3390/electronics14030550

AMA Style

Magyari A, Chen Y. Hardware Optimized Modular Reduction. Electronics. 2025; 14(3):550. https://doi.org/10.3390/electronics14030550

Chicago/Turabian Style

Magyari, Alexander, and Yuhua Chen. 2025. "Hardware Optimized Modular Reduction" Electronics 14, no. 3: 550. https://doi.org/10.3390/electronics14030550

APA Style

Magyari, A., & Chen, Y. (2025). Hardware Optimized Modular Reduction. Electronics, 14(3), 550. https://doi.org/10.3390/electronics14030550

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop