Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5

Liu, Kuanhao; Wang, Xiaohua; Hao, Yue; Zhang, Jingqi; Wang, Weijiang

doi:10.3390/electronics14071402

Open AccessArticle

Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5

by

Kuanhao Liu

¹,

Xiaohua Wang

¹,

Yue Hao

^2,3

,

Jingqi Zhang

¹

and

Weijiang Wang

^1,4,*

¹

School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China

²

School of Integrated Circuits, Peking University, Beijing 100871, China

³

Beijing Microelectronics Technlogy Institute, Beijing 100076, China

⁴

Chongqing Institute of Microelectronics and Microsystems Beijing Institute of Technology, Chongqing 401332, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1402; https://doi.org/10.3390/electronics14071402

Submission received: 1 March 2025 / Revised: 22 March 2025 / Accepted: 26 March 2025 / Published: 31 March 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Toom–Cook-n multiplication is an efficient large-width multiplication algorithm based on a divide-and-conquer strategy, widely used in modular multiplication operations for cryptographic algorithms. Theoretically, as the degree n increases, Toom–Cook-n can split the multiplicands into more sub-terms to further enhance the performance of the multiplier. However, constrained by the computational burden brought by the growing size of the interpolation matrix as the degree increases, current research predominantly focuses on Toom–Cook-4 and Toom–Cook-3. This paper proposes a Montgomery modular multiplication design based on Toom–Cook-5, which alleviates the computational difficulty of the interpolation step by introducing an interpolation matrix pre-simplification strategy. Additionally, the design incorporates and optimizes carry–save adder and Karatsuba multiplication, enabling Toom–Cook-5 multiplication to be applied in practical and efficient hardware implementation. This paper presents the ASIC implementation results of the hardware architecture under a 90nm process, demonstrating superior performance compared to previous works.

Keywords:

Toom–Cook-n multiplication; Montgomery modular multiplication; ASIC implementation; Karatsuba multiplication; carry–save adder

1. Introduction

1.1. Background

In the field of cryptography, increasing security demands have led to larger key lengths in systems such as RSA and ECC [1,2]. For both of RSA and ECC over prime fields, up to 80% of the calculation latency arises from modular multiplication [3,4]. As a result, the design and optimization of large-width modular multipliers, particularly the underlying multiplier component, are critical for efficient hardware implementations of ECC and RSA in prime field systems [5]. The conventional Karatsuba multiplication [6] exhibits a time complexity of

O (W^{{log}_{2} 3})

, where W denotes the bit-width of operands. Karatsuba multiplication achieves enhanced performance with large-width multiplication, albeit at the cost of increased resource utilization [7]. In the GNU Multiple Precision Arithmetic Library [8], Toom–Cook-n multiplication [9] is used as the successor of Karatsuba multiplication and then followed by the application of more complex FFT-based multiplication [10]. Toom–Cook-n multiplication, which inherits Karatsuba’s divide-and-conquer strategy, offers a lower time complexity of

O (W^{{log}_{n} (2 n - 1)})

, where n represents the degree of Toom–Cook-n.

1.2. Related Work

Toom–Cook-n multiplication can flexibly decompose the multiplicands to enhance calculation efficiency [11,12]. However, it faces two critical challenges in hardware implementation. First, as the degree increases, the time complexity of Toom–Cook-n multiplication decreases further, but this simultaneously makes the interpolation operations more challenging. Striking a balance between these two factors to achieve improved performance in multiplication operations becomes a key concern [13]. In prior research [14,15,16], greater attention has been devoted to the hardware optimization of Toom–Cook-3 and Toom–Cook-4 algorithms. While this work effectively addressed multiplications below 512-bit, it still demonstrates limitations when dealing with multiplications of 1024-bit and above. Recent work [17,18] proposed several optimization strategies for interpolation operations in the Toom–Cook-4 multiplication, which alleviates the computational burden on hardware while also bringing about more cycles. Second, the result of Toom–Cook-n multiplication retains an odd factor of redundancy. Directly eliminating this redundancy through division is computationally prohibitive [19]. In article [20], a Toom–Cook-n multiplier specifically designed for the NIST prime field was proposed, which employs a non-least-positive (NLP) approach to replace precise division with right-shift operations. Building upon this, Gu [21] and Hao [22], respectively, introduced division-free Toom–Cook-n multipliers that combine Montgomery modular multiplication and Barrett modular multiplication, resulting in a modular multiplier suitable for arbitrary prime fields.

1.3. Motivation and Contribution

In the latest RSA standards, it is acknowledged that 1024-bit RSA is no longer considered secure, leading to a more widespread adoption of 2048-bit RSA [23,24]. Similarly, ECC requires a 521-bit key for encryption [25]. As the key lengths of cryptographic algorithms increase, the corresponding modular multipliers need to process operands with wider bit-widths. In RSA, modular exponentiation is achieved through repeated modular multiplications, while in ECC, point addition and scalar multiplication operations similarly rely on modular multiplication calculations. Therefore, the modular multiplication operation directly determines the performance of cryptographic algorithms and is the primary bottleneck in improving computational efficiency. Existing multipliers [26], such as Toom–Cook-4, Toom–Cook-3, or Karatsuba, generally employ a divide-and-conquer approach to transform large-width multiplication into several smaller-width multiplications to simplify the calculation. However, even methods like Toom–Cook-4, splitting the multiplicands into four parts, still face the challenge of 256-bit multiplications in the context of 1024-bit multiplication. While recursive splitting can be employed to further reduce the bit-width, it inevitably leads to a increase in calculation latency.

In this paper, the main contributions are as follows:

This paper analyzes the complete workflow of the Toom–Cook-n multiplication and elaborates on the impact of increasing the degree n at each step.
This paper proposes a pre-simplification approach to address the complexity of the interpolation matrix in Toom–Cook-5. It significantly reduces the computational burden of the interpolation step without incurring additional clock cycles.
This paper optimizes the traditional carry–save adder (CSA) architecture by decoupling the compressors and full adders, effectively eliminating unnecessary resource overhead caused by redundant full adders and improving area efficiency.
This paper designs a two-stage pipelined 3-level Karatsuba multiplication architecture to optimize the timing of the post-splitting multiplications.
This paper improves the addition processing in the Montgomery modular multiplication by nearly halving the bit-width of the addition operations.

2. Notations and Preliminaries

2.1. Toom–Cook-n Multiplication

In a integer multiplication operation with bit-width W, each multiplicand can be split into n sub-terms, where the bit-width of each sub-terms is

w = ⌈ \frac{W}{n} ⌉

. Here, n is the degree in the Toom–Cook-n algorithm.

The polynomial representations of the multiplicands A and B are given by

A (N) = \sum_{i = 0}^{n - 1} a_{i} N^{i} = a_{n - 1} N^{(n - 1)} + a_{n - 2} N^{(n - 2)} + \dots + a_{0} N^{0}

(1)

B (N) = \sum_{i = 0}^{n - 1} b_{i} N^{i} = b_{n - 1} N^{(n - 1)} + b_{n - 2} N^{(n - 2)} + \dots + b_{0} N^{0}

(2)

where

N = 2^{w}

is the base word and

a_{i}

is the coefficient.

Clearly, the product C resulting from multiplying A and B can be represented as a polynomial

C (N)

with coefficients

c_{i}

:

C (N) = A (N) \cdot B (N) = (\sum_{i = 0}^{n - 1} a_{i} N^{i}) \cdot (\sum_{i = 0}^{n - 1} b_{i} N^{i}) = \sum_{i = 0}^{2 n - 2} c_{i} N^{i} .

(3)

It is evident that changing the base word N does not affect the coefficients. The essence of the Toom–Cook-n algorithm lies in obtaining

2 n - 1

linearly independent equations by varying the base word and determining the coefficients of

C (N)

through these equations.

The Toom–Cook-n multiplication can typically be divided into five calculation steps: Splitting, Evaluation, Multiplication, Interpolation, and Recomposition.

Splitting: Split the multiplicands into n coefficients, each of bit-width W.

Evaluation: Substitute

2 n - 1

sets of base words to construct a system of linearly independent equations. The selection of these base words will influence the computational complexity in subsequent steps; therefore, it is advantageous to choose base words with minimal absolute values to facilitate the calculations. To achieve this objective, the original base word N is replaced with a two-dimensional base word set

(μ, ν)

, which can yield a greater number of base words with minimal absolute values. The polynomial representation in this new base word is given by Equation (4).

A (μ, ν) = \sum_{i = 0}^{n - 1} a_{i} μ^{i} ν^{n - i - 1}

(4)

Multiplication: Multiply

A (μ, ν)

and

B (μ, ν)

after substituting

2 n - 1

sets of base words to obtain the value of

C (μ, ν)

as Equation (5). Since the base word is minimal in absolute value, the bit-width of the multiplicand in this step is only slightly greater than

⌈ \frac{W}{n} ⌉

. In the case of Toom–Cook-5 with base word

(1, 1)

, the bit-width of

A (1, 1)

is

⌈ \frac{W}{5} ⌉ + 3

, where

A (1, 1) = a_{0} + a_{1} + a_{2} + a_{3} + a_{4}

. This step constitutes the most computationally intensive part of the Toom–Cook-n, as the multiplication operations in this step demand significantly more area and incur higher latency compared to the addition operations in other steps. Furthermore, as the degree of the Toom–Cook-n algorithm increases, the bit-width of the multiplicand in this step decreases. However, this reduction in bit-width leads to the need for more iterations to complete the additional multiplication operations.

[\begin{matrix} C (μ_{0}, υ_{0}) \\ C (μ_{1}, υ_{1}) \\ ⋮ \\ C (μ_{2 n - 2}, υ_{2 n - 2}) \end{matrix}] = [\begin{matrix} A (μ_{0}, υ_{0}) B (μ_{0}, υ_{0}) \\ A (μ_{1}, υ_{1}) B (μ_{1}, υ_{1}) \\ ⋮ \\ A (μ_{2 n - 2}, υ_{2 n - 2}) B (μ_{2 n - 2}, υ_{2 n - 2}) \end{matrix}]

(5)

Interpolation: Determine the coefficients

c_{i}

from

2 n - 1

sets of

C (μ, ν)

and the interpolation matrix

M_{b}

. In addition to the form shown in Equation (5),

C (μ, ν)

can also be expressed as a polynomial form based on Equation (3), as shown in Equation (6).

[\begin{matrix} C (μ_{0}, υ_{0}) \\ C (μ_{1}, υ_{1}) \\ ⋮ \\ C (μ_{2 n - 2}, υ_{2 n - 2}) \end{matrix}] = M_{b} [\begin{matrix} c_{0} \\ c_{1} \\ ⋮ \\ c_{2 n - 2} \end{matrix}]

(6)

where the interpolation matrix

M_{b}

is given as

M_{b} = [\begin{matrix} ν_{0}^{2 n - 2} & μ_{0} ν_{0}^{2 n - 3} & \dots & μ_{0}^{2 n - 2} \\ ν_{1}^{2 n - 2} & μ_{1} ν_{1}^{2 n - 3} & \dots & μ_{1}^{2 n - 2} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ ν_{2 n - 2}^{2 n - 2} & μ_{2 n - 2} ν_{2 n - 2}^{2 n - 3} & \dots & μ_{2 n - 2}^{2 n - 2} \end{matrix}]

(7)

Thus, the coefficient vector

\vec{C} = {[c_{0}, c_{1}, \dots, c_{2 n - 2}]}^{T}

can be obtained by left-multiplying the inverse of

M_{b}

to the left-hand side of Equation (6). Since the

(μ, ν)

are selected in advance, the inverse matrix

M_{b}^{- 1}

can be precomputed and stored in hardware for computational use.

Recomposition: Recompose the multiplication result C of A and B, leveraging left-shift and addition operations after obtaining the coefficient vector

\vec{C}

.

2.2. Karatsuba Multiplication

The Karatsuba multiplication is an efficient method for multiplying high-width integers. Similar to the Toom–Cook-n, it accelerates multiplication by decomposing high-width operations into smaller ones combined with additions. Given two integers X and Y of bit-width W, they can be represented as

X = x_{1} \cdot 2^{⌈\frac{W}{2}⌉} + x_{0}

(8)

Y = y_{1} \cdot 2^{⌈\frac{W}{2}⌉} + y_{0}

(9)

where

x_{1}

,

y_{1}

, and

x_{0}

,

y_{0}

represent the higher and lower halves of X and Y, respectively.

By expressing X and Y in this manner and following the derivation as shown in Equation (10), the multiplication of W-bit can first be transformed into four

⌈\frac{W}{2}⌉

-bit multiplications. Then, it can be further reduced to three

⌈\frac{W}{2}⌉

-bit multiplications, along with six extra addition operations [27].

\begin{matrix} X \cdot Y & = x_{1} y_{1} \cdot 2^{2 ⌈\frac{W}{2}⌉} + (x_{0} y_{1} + x_{1} y_{0}) \cdot 2^{⌈\frac{W}{2}⌉} + x_{0} y_{0} \\ = x_{1} y_{1} \cdot 2^{2 ⌈\frac{W}{2}⌉} + ((x_{1} + x_{0}) \cdot (y_{1} + y_{0}) - x_{1} y_{1} - x_{0} y_{0}) \cdot 2^{⌈\frac{W}{2}⌉} + x_{0} y_{0} \end{matrix}

(10)

When the bit-width of

⌈ \frac{W}{2} ⌉

remains high, recursive decomposition of the operands can be performed [28]. Specifically, the

⌈ \frac{W}{2} ⌉

-bit multiplication can leverage the Karatsuba multiplication once more, further reducing the bit-width required for multiplication.

2.3. Montgomery Modular Multiplication

The fundamental principle of Montgomery modular multiplication [29] is the introduction of the “Montgomery domain”, where modular operations leverage right-shift in place of division, significantly enhancing the efficiency of modular multiplication. The Montgomery modular multiplication is presented as Algorithm 1.

Algorithm 1 Montgomery Modular Multiplication

Parameter: $N, R . R = 2^{r} \geq 4 N, g c d (R, N) = 1$
Precompute: $N^{'} \equiv - N^{- 1} mod R$
Input: $0 \leq X, Y < 2 N$
Output: $Z \equiv X Y R^{- 1} m o d N$
  1:
$T = X Y$
  2:
$s = (T mod R) \cdot N^{'}$
  3:
$t = (s mod R) \cdot N$
  4:
$Z = (T + t) ≫ r$
Return: Z.

In what follows, the Montgomery modular multiplication will be denoted as MMM, and a complete derivation of the modular multiplication operation will be provided.

In the context of computing

X Y mod N

, X and Y are first transformed into the Montgomery domain through Montgomery modular multiplication, as described in Equations (11) and (12).

X^{'} = MMM (X, R^{2}, N, R) \equiv X R mod N

(11)

Y^{'} = MMM (Y, R^{2}, N, R) \equiv Y R mod N

(12)

Montgomery modular multiplication is then performed on

X^{'}

and

Y^{'}

, yielding the result as presented in Equation (13).

MMM (X^{'}, Y^{'}, N, R) \equiv X^{'} Y^{'} R mod N \equiv X Y R mod N

(13)

Finally, to obtain the correct modular multiplication result, the inverse transformation is applied to exit the Montgomery domain, as shown in Equation (14):

MMM (X Y R mod N, 1, N, R) \equiv X Y mod N .

(14)

2.4. Carry–Save Adder

In the hardware implementation of multi-operand addition, the operation is typically performed through a series of two-operand additions, where the result of adding the first two operands is subsequently added to the third operand, and so on. This approach leads to a long carry chain, and it is challenging to complete the entire calculation within one clock cycle when the bit-width of the operands or the number of operands increases. The carry–save adder [30] provides an effective solution to the multi-operand addition problem, where the fundamental unit is the 3:2 compressor and the full adder.

The 3:2 compressor processes three operands using a parallel bitwise operation to derive two outputs, namely, CARRY and SAVE, as shown in Equations (15) and (16).

S A V E = I n 0 \oplus I n 1 \oplus I n 2

(15)

C A R R Y = I n 0 \cdot I n 1 + I n 0 \cdot I n 2 + I n 1 \cdot I n 2

(16)

where ⊕, ·, and + denote the XOR, AND, and OR operations, respectively.

By stacking multiple 3:2 compressors, a multi-operand compressor with low latency can be constructed [31], and the 5:2 compressor is shown in Figure 1. By feeding the SAVE and CARRY outputs of the compressor into the full adder, the final result can be obtained. In this multi-operand adder design, the critical path consists of nine two-input AND gates, nine two-input OR gates, and one two-input full adder. In large bit-width operations, the calculation latency can approach that of one two-input full adder.

3. Proposed Algorithms and Methods

Compared to lower-degree Toom–Cook-n algorithms, the Toom–Cook-5 multiplication significantly reduces the operand width during the “Multiplication” step, thereby enabling the design to operate at higher clock frequencies. However, this reduction in operand width leads to an increased complexity in both the dimensionality and data handling of the interpolation matrix, resulting in greater demands on logic resources. In response to these challenges posed by the complex interpolation matrix in Toom–Cook-5, this paper optimizes at the algorithmic level by proposing a series of strategies such as interpolated matrix pre-simplification and partial result precomputation. Additionally, we apply the Toom–Cook-5 multiplication method to Montgomery modular multiplication to eliminate the odd redundancy. Furthermore, we derive and simplify the addition operations in Montgomery modular multiplication, effectively halving the operand bit-width.

Subsequently, we first present mathematical derivation of the proposed algorithm by taking 1024-bit Toom–Cook-5 multiplication as an example. Then, we provide a detailed explanation of the Montgomery modular multiplication based on Toom–Cook-5 multiplication.

3.1. Proposed Toom–Cook-5 Multiplication

First, we split the 1024-bit multiplicands into vectors

\vec{A}

and

\vec{B}

, each composed of five coefficients with 205-bit multiplicands. The coefficients in

\vec{A}

and

\vec{B}

are all unsigned numbers.

\vec{A} = {[a_{0}, a_{1}, a_{2}, a_{3}, a_{4}]}^{T}

(17)

\vec{B} = {[b_{0}, b_{1}, b_{2}, b_{3}, b_{4}]}^{T}

(18)

Second, adhering to the principle of minimizing absolute values, the following nine sets of base words

(μ, ν)

are selected: (0, 1), (1, 1), (−1, 1), (2, 1), (−2, 1), (1, 2), (1, −2), (4, 1), (1, 0). Substituting these nine sets of base words into Equation (4), we can calculate nine sets of

A (μ, ν)

and

B (μ, ν)

through left-shift and addition operations in parallel; here, the addition operation is implemented through two five-input carry–save adder operations. The representations of

A (μ, ν)

and

B (μ, ν)

are identical except for the coefficients. Here, the representation of

A (μ, ν)

is given as Equation (19).

\begin{matrix} A (0, 1) & = a_{0}, \\ A (1, 1) & = a_{0} + a_{1} + a_{2} + a_{3} + a_{4}, \\ A (- 1, 1) & = a_{0} - a_{1} + a_{2} - a_{3} + a_{4}, \\ A (2, 1) & = a_{0} + 2 a_{1} + 4 a_{2} + 8 a_{3} + 16 a_{4}, \\ A (- 2, 1) & = a_{0} - 2 a_{1} + 4 a_{2} - 8 a_{3} + 16 a_{4}, \\ A (1, 2) & = 16 a_{0} + 8 a_{1} + 4 a_{2} + 2 a_{3} + a_{4}, \\ A (1, - 2) & = 16 a_{0} - 8 a_{1} + 4 a_{2} - 2 a_{3} + a_{4}, \\ A (4, 1) & = a_{0} + 4 a_{1} + 16 a_{2} + 64 a_{3} + 256 a_{4}, \\ A (1, 0) & = a_{4} . \end{matrix}

(19)

Next, the corresponding values of

A (μ, ν)

and

B (μ, ν)

obtained from the previous step are multiplied to produce

C (μ, ν)

. This multiplication is performed utilizing the Karatsuba multiplication to reduce latency.

Based on the selected

(μ, ν)

, the interpolation matrix

M_{b}

can be determined, as shown in the Equation (20).

M_{b} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & - 1 & 1 & - 1 & 1 & - 1 & 1 & - 1 & 1 \\ 1 & 2 & 4 & 8 & 16 & 32 & 64 & 128 & 256 \\ 1 & - 2 & 4 & - 8 & 16 & - 32 & 64 & - 128 & 256 \\ 256 & 128 & 64 & 32 & 16 & 8 & 4 & 2 & 1 \\ 256 & - 128 & 64 & - 32 & 16 & - 8 & 4 & - 2 & 1 \\ 1 & 4 & 16 & 64 & 256 & 1024 & 4096 & 16384 & 65536 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}]

(20)

Substituting the interpolation matrix

M_{b}

into Equation (6), it can be observed that

c_{0} = C (μ_{0}, ν_{0})

and

c_{8} = C (μ_{8}, ν_{8})

. In other words, there is no need to solve for these two coefficients by creating linearly independent equations, nor do they need to be included in the subsequent interpolation calculations. This allows the 9 × 9 interpolation matrix to be simplified to a 7 × 7 matrix, as shown in Equation (21).

[\begin{matrix} C_{1} \\ C_{2} \\ C_{3} \\ C_{4} \\ C_{5} \\ C_{6} \\ C_{7} \end{matrix}] = [\begin{matrix} C (μ_{1}, v_{1}) - c_{0} - c_{2 n - 2} \\ C (μ_{2}, v_{2}) - c_{0} - c_{2 n - 2} \\ C (μ_{3}, v_{3}) - c_{0} - 256 c_{2 n - 2} \\ C (μ_{4}, v_{4}) - c_{0} - 256 c_{2 n - 2} \\ C (μ_{5}, v_{5}) - 256 c_{0} - c_{2 n - 2} \\ C (μ_{6}, v_{6}) - 256 c_{0} - c_{2 n - 2} \\ C (μ_{7}, v_{7}) - c_{0} - 65536 c_{2 n - 2} \end{matrix}] = M_{b (7 \times 7)} [\begin{matrix} c_{1} \\ c_{2} \\ c_{3} \\ c_{4} \\ c_{5} \\ c_{6} \\ c_{7} \end{matrix}]

(21)

This step is primarily designed to simplify the subsequent interpolation calculations. The calculation of

{[C_{1}, C_{2}, C_{3}, \dots, C_{7}]}^{T}

reuses the five-input adder employed in the previous evolution step, eliminating any additional logic resource overheads. Moreover, introducing this pre-simplification step incurs only one extra computational cycle. This is because, in the original algorithm, the subsequent operations cannot proceed until all

C (μ, ν)

calculations are completed. By comparison, the pre-simplification begins after the completion of the first

C (μ, ν)

calculation and continues until the next cycle following the the last

C (μ, ν)

calculation. Following this, the succeeding steps can be undertaken by leveraging a simplified interpolation matrix. This pre-simplification process provides significant advantages for the implementation of interpolation calculations, both in terms of area and latency. Detailed explanations will be provided in the following discussion.

The coefficient vector

\vec{C} = {[c_{1}, c_{2}, c_{3}, \dots, c_{7}]}^{T}

is obtained by left-multiplying the inverse of simplified interpolation matrix

M_{b (7 \times 7)}^{- 1}

by the vector

[C_{1}, C_{2}, C_{3}, \dots, C_{7}]

, as shown in Equation (22). Some elements in the original inverse matrix

M_{b (7 \times 7)}^{- 1}

have denominators containing odd numbers, which are extremely challenging to implement in hardware for division. To address this, we first scale the entire inverse matrix by the greatest odd common divisor of all the denominators, which is 2835 for Toom–Cook-5. Specifically, we employ

2835 M_{b (7 \times 7)}^{- 1}

to replace the original inverse matrix for interpolation, which causes the computed coefficients

c_{i}

to be scaled by a factor of 2835. Later, when applying the Toom–Cook-5 to the modular multiplication algorithm, this redundant odd common factor can be eliminated. As a consequence, the denominators of all elements in the matrix

2835 M_{b (7 \times 7)}^{- 1}

are transformed into the form

2^{x}

and can be implemented through simple right-shift operations.

\vec{C} = [\begin{matrix} c_{1} \\ c_{2} \\ c_{3} \\ c_{4} \\ c_{5} \\ c_{6} \\ c_{7} \end{matrix}] = M_{b (7 \times 7)}^{- 1} [\begin{matrix} C_{1} \\ C_{2} \\ C_{3} \\ C_{4} \\ C_{5} \\ C_{6} \\ C_{7} \end{matrix}]

(22)

The interpolation process essentially involves calculating several dot products. Since the interpolation matrix is fixed, the calculation can be reduced to constant multiplication followed by summation. In our design, the dot product is accomplished by employing a shift-and-add method. Specifically, the term

840 a_{1}

is converted into

512 a_{1} + 256 a_{1} + 64 a_{1} + 8 a_{1}

, and similar transformations are applied to the other terms, which are then summed together. This approach, while eliminating the need for multipliers, requires the construction of an adder with a large number of inputs.

The inverse of the interpolation matrix

2835 M_{b}^{- 1}

is shown in Figure 2. The outer frame represents the original 9 × 9 inverse matrix, while the inner frame shows the simplified 7 × 7 inverse matrix. It can be observed that the first and ninth columns of the matrix contain plenty of large numbers. By simplifying the interpolation matrix, these large numbers are excluded from subsequent calculations, thus reducing their impact on the interpolation process. The term “large numbers” here does not refer to their numerical size but rather to the fact that their binary representations contain a significant quantity of “1s”. For example, the number 59,535 corresponds to the binary value “1110100010001111”, which contains nine “1s”. This implies that if a shift-and-add strategy is employed to handle constant multiplication involving this number, a nine-input adder would be required. Since the design employs a set of adders for serially calculate seven dot products, we need to design this set of adders based on the worst-case scenario, which involves the maximum number of addends. By reducing the number of addends required for the dot products through matrix pre-simplification, the consumption of logic resources can be directly minimized.

From the perspective of the columns, the simplified inverse matrix contains several partial results that can be reused. For example, if

P_{1} = 840 C_{1}

is precomputed, then

630 C_{1}

can be calculated as

((12 P_{1}) ≫ 4)

,

3780 C_{1}

can be computed as

((9 P_{1}) ≫ 1)

, and

5355 C_{1}

can be derived as

((51 P_{1}) ≫ 3)

. This method effectively reduces redundant computations, thereby improving the efficiency of the interpolation operation. All the methods for calculating the partial results are shown in Equation (23).

\begin{matrix} P_{1} & = 8 C_{1} + 64 C_{1} + 256 C_{1} + 512 C_{1} \\ P_{2} & = 512 C_{2} - 8 C_{2} \\ P_{3} & = 64 C_{3} - C_{3} \\ P_{4} & = C_{4} + 4 C_{4} + 16 C_{4} \\ P_{5} & = 2 C_{5} + 16 C_{5} \\ P_{6} & = 2 C_{6} + 4 C_{6} + 8 C_{6} \\ P_{7} & = C_{7} + 4 C_{7} + 16 C_{7} \end{matrix}

(23)

Following this, a set of adders is utilized to obtain the final coefficient vector

\vec{C}

. This set of adders is composed of a total of four carry–save adders. Among them, two five-input adders, called

A d d e r_{0}

and

A d d e r_{1}

, are reused from the evolution step, and one four-input adder is reused from the partial result calculation, called

A d d e r_{2}

. A additional four-input adder is newly constructed. This four-input

A d d e r_{2}

does not include full adder, so it will produce two outputs: carry and save.

The combination of the first three adders are utilized to process up to 14 addends, reducing them to 4 addends, which are then passed as inputs to a four-input

A d d e r_{3}

to produce the final result. This design maximizes the reuse of previously utilized resources, minimizing logic resource waste while allowing for partial results from the previous step to be directly used as inputs for the next step of the calculation. The calculation formulae for each coefficient

c_{i}

are provided in Equation (24), and the hardware implementation scheme for the adders will be detailed in the subsequent sections.

\begin{matrix} 16 \times 2835 c_{1} & = - 16 P_{1} + 16 P_{2} + 8 P_{3} - 8 P_{4} + 16 P_{5} - 16 P_{6} - 4 C_{7} \\ 16 \times 2835 c_{2} & = - 4 P_{1} - 8 P_{1} - 4 P_{2} - 16 P_{2} + 2 P_{3} + 2 P_{4} + 4 P_{4} + 4 P_{5} + 8 P_{5} + 16 P_{5} \\ + 4 P_{6} + 32 P_{6} \\ 16 \times 2835 c_{3} & = 8 P_{1} + 64 P_{1} - 64 P_{2} - P_{3} - 8 P_{3} - 32 P_{3} + P_{4} + 2 P_{4} + 4 P_{4} + 32 P_{4} \\ - 4 P_{5} - 8 P_{5} - 16 P_{5} + 4 P_{6} + 8 P_{6} + P_{7} \\ 16 \times 2835 c_{4} & = P_{1} + 2 P_{1} + 16 P_{1} + 32 P_{1} + P_{2} + 4 P_{2} + 16 P_{2} + 64 P_{2} - 2 P_{3} - 8 P_{3} \\ - 2 P_{4} - 4 P_{4} - 8 P_{4} - 16 P_{4} - P_{5} - 2 P_{5} - 32 P_{5} - 8 P_{6} - 32 P_{6} \\ 16 \times 2835 c_{5} & = - P_{1} - 32 P_{1} - P_{2} + P_{3} + 4 P_{3} + 32 P_{3} - P_{4} - 2 P_{4} - 8 P_{4} - 16 P_{4} + 2 P_{5} \\ + 4 P_{5} + 8 P_{5} + 2 P_{6} + 4 P_{6} - P_{7} \\ 16 \times 2835 c_{6} & = - 4 P_{1} - 8 P_{1} - 4 P_{2} - 16 P_{2} + 8 P_{3} + 8 P_{4} + 16 P_{4} + P_{5} + 2 P_{5} + 4 P_{5} \\ + P_{6} + 8 P_{6} \\ 16 \times 2835 c_{7} & = 4 P_{1} + 4 P_{2} - 4 P_{3} - 4 P_{4} - 2 P_{5} - 2 P_{6} + 4 C_{7} \end{matrix}

(24)

Ultimately, the nine obtained coefficients are combined through shift-and-add operation to calculate the result

16 \times 2835 c_{i}

. The factor of 16 is used to eliminate the

2^{x}

form denominators in the inverse matrix. After summing, the result is right-shifted by 4 bits to obtain

2835 c_{i}

. The coefficients

2835 c_{i}

, derived through Toom–Cook-5 interpolation, are not exactly in the 205-bit form, so they cannot be directly obtained by simple concatenation. We will shift this set of coefficients and arrange them in sequence, summing the overlapping parts. The maximum possible bit width for this set of coefficients is 424 bits, and the derivation of this bit width will be provided in the hardware architecture section.

Let

c_{i}^{'}

represent

2835 c_{i}

. Each

c_{i}^{'}

is divided into three parts: the most significant 14-bit part, the middle 205-bit part, and the least significant 205-bit part. These parts are referred to as

high (c_{i}^{'})

,

mid (c_{i}^{'})

, and

low (c_{i}^{'})

, respectively. The summation operation proceeds from the low to the high parts, as illustrated in Figure 3. All overlapping parts are summed to obtain the corresponding

r_{i}

, and the concatenation of 10 results

r_{i}

is

2835 C

. A special case occurs with

c_{8}^{'}

, which is divided into only two parts, and the bit-width of

r_{9}

in the result refers to a 218-bit width, while all other

r_{i}

results are 205-bit.

The entire algorithmic process of the Toom–Cook-5 multiplication and the associated optimizations have been fully derived through the discussion above. The complete pre-simplified interpolation matrix for the Toom–Cook-5 multiplication is presented in Algorithm 2.

Algorithm 2 Interpolation Matrix Pre-simplified Toom–Cook-5 Multiplication

Parameter: $(μ_{0}, ν_{0})$ , $(μ_{1}, ν_{1})$ , …, $(μ_{8}, ν_{8})$ , $2835 M_{b (7 \times 7)}^{- 1}$
Input: $0 \leq A, B < 2^{W}$
Output: $R = 2835 A B$
1:
${[a_{0}, a_{1}, a_{2}, a_{3}, a_{4}]}^{T} = \vec{A}, {[b_{0}, b_{1}, b_{2}, b_{3}, b_{4}]}^{T} = \vec{B}$
2:
for $i = 0$ to 8 do
3:
    $A (μ_{i}, ν_{i}) = \sum_{j = 0}^{4} a_{j} μ_{i}^{j} ν_{i}^{4 - j}$ , $B (μ_{i}, ν_{i}) = \sum_{j = 0}^{4} b_{j} μ_{i}^{j} ν_{i}^{4 - j}$
4:
    $C (μ_{i}, ν_{i}) = A (μ_{i}, ν_{i}) B (μ_{i}, ν_{i})$
5:
end for
6:
for $i = 1$ to 7 do
7:
    $C_{i} = C (μ_{i}, ν_{i}) - μ_{i}^{8} C (μ_{8}, ν_{8}) - ν_{i}^{8} C (μ_{0}, ν_{0})$
8:
end for
9:
$2835 \vec{C} = 2835 M_{b (7 \times 7)}^{- 1} \cdot [C_{1}, C_{2}, \dots, C_{7}]$
10:
$R = C_{0} + C_{8} ≪ 1640 + \sum_{j = 0}^{6} (2835 \vec{C} (j) ≪ 205 (j + 1))$
Return: R.

3.2. Montgomery Modular Multiplication Based on Toom–Cook-5

Following the Toom–Cook-5 multiplication introduced in the previous discussion, the result is magnified by a factor of 2835. Earlier works have proposed the application of Toom–Cook-n multiplication in Montgomery modular multiplication or Barrett modular multiplication [32], with specialized techniques aimed at eliminating this factor without the need for division operations. The algorithm for Montgomery modular multiplication based on Toom–Cook-n, proposed in [22], is described in Algorithm 3. In this context,

γ

denotes the redundant coefficient under Toom–Cook-n, and

T C n

represents the Toom–Cook-n multiplication.

The output of Montgomery modular multiplication carries a redundant factor of

R^{- 1}

. In a complete modular multiplication process, multiple Montgomery modular multiplications are required to eliminate this redundancy. When the multiplication operations in Algorithm 1 are executed, employing a Toom–Cook-n multiplication as in Algorithm 3, the final output will be associated with an additional factor of

γ^{3}

. Next, the product of

R^{- 1}

and

γ^{3}

is treated as the redundant coefficient. By performing four Montgomery modular multiplications, as described in Section 2.3, both

R^{- 1}

and

γ^{3}

can be simultaneously eliminated, yielding the final correct modular multiplication result.

Algorithm 3 Montgomery Modular Multiplication based on Toom–Cook-n

Parameter: $N, R, γ . R = 2^{r}, g c d (R, N) = 1$
Precompute: $N^{'} \equiv - N^{- 1} mod R$
Input: $0 \leq X, Y < 2 γ N$
Output: $Z \equiv γ^{3} X Y R^{- 1} mod N$
1:
$T = T C n (X, Y)$
2:
$s = T C n ((T mod R), N^{'})$
3:
$t = T C n ((s mod R), N)$
4:
$Z = (γ^{2} T + t) ≫ r$
Return: Z.

To enhance the efficiency of Montgomery modular multiplication, we propose an optimized addition algorithm to handle step 4 of Algorithm 3. The result of

(γ^{2} T + t)

undergoes a right-shift by r -bit, meaning its lower r-bit values are not relevant to the calculation. According to the derivation of the Montgomery modular multiplication, the lower r-bit values of

(γ^{2} T + t)

are guaranteed to be all zeros. Based on this precondition, we conclude that a carry into the

(r + 1)

-th bit will occur in all cases except when the lower r bits of

γ^{2} T

and t are both all zeros.

Therefore, we decompose step 4 in Algorithm 3 into two separate operations. First, we check whether the lower r-bits of

γ^{2} T

and t are all zeros. If they are, it suffices to calculate the sum of their components above the r-th bit. Otherwise, we calculate the sum of their components above the r-th bit and add 1. Compared to wide-width addition operations, an all-zero detector can be implemented in hardware circuits using only a series of OR gates. The complete Montgomery modular multiplication based on Toom–Cook-5 is shown in Algorithm 4.

Algorithm 4 Montgomery Modular Multiplication based on Toom–Cook-5

Parameter: $N, R . R = 2^{r}, g c d (R, N) = 1$
Precompute: $N^{'} \equiv - N^{- 1} mod R$
Input: $0 \leq X, Y < 5670 N$
Output: $Z \equiv 2835^{3} X Y R^{- 1} mod N$
1:
$T = T C 5 (X, Y)$
2:
$s = T C 5 ((T mod R), N^{'})$
3:
$t = T C 5 ((s mod R), N)$
4:
if $2835^{2} T [r - 1 : 0] = = 0 and t [r - 1 : 0] = = 0$ then
5:
$Z = 2835^{2} T [: r] + t [: r]$
6:
else
7:
$Z = 2835^{2} T [: r] + t [: r] + 1$
8:
end if
Return: Z.

4. Hardware Architecture of Toom–Cook-5

4.1. Overall Architecture

The ASIC architecture of the Montgomery modular multiplication based on Toom–Cook-5 is shown in Figure 4. The overall architecture consists of a control unit, parameter RAM, and an arithmetic logic unit (ALU). The parameter RAM stores various precomputed parameters required for the modular multiplication operation, while the ALU is responsible for executing the entire modular multiplication algorithm. The control unit invokes the ALU sub-modules and manages data interactions according to the algorithm’s steps. It also retrieves the parameters necessary for calculation from RAM, upplying them to the ALU.

4.2. Decoupled Carry–Save Adder Architecture

In the hardware implementation of Toom–Cook-5, multi-operand adders are widely utilized in evolution, interpolation, and recomposition steps. And in our design, all multi-operand adders are constructed employing carry–save adders composed of compressors and full adders. By strategically arranging the calculation steps, it is possible to avoid equipping every compressor with a dedicated full adder. Instead, a single full adder can be shared among several compressors. In practice, we decoupled the carry–save adder to achieve independent compressor group and full adders. In this way, the architecture can flexibly adapt its configuration based on calculation needs. Figure 5 presents two exemplar transformations: a 14-input adder and parallel 5-input and 9-input adders.

In the evolution step, two five-input adders equipped with full adders operate in parallel to calculate two sets of addition operations. During the pre-simplification step, the additions initiated in the evolution step continue, while an additional four-input adder is introduced to perform the associated calculation simultaneously. In the interpolation step, the five-input adder with full adders can be paired with a 5:2 compressor and a 4:2 compressor, serving as the first stage of the addition process. This configuration can handle up to 14 inputs and produce 5 outputs. These five outputs are then fed into a 5:2 compressor with full adders. Thus, the compressor group combined with the full adders constitutes a 14-input adder.

Considering the large bit-width of the addends and the requirement to produce addition results within a single cycle, the carry select adder is employed to optimize the full adder. The original addends are evenly divided into four parts based on their bit-width. The lowest part is directly added, while the higher three parts are computed in parallel to generate two possible results: one assuming a carry-in and the other without. After the parallel addition of all four parts, the carry-out from the lowest part determines the selection of the appropriate outputs for the higher parts, which are then concatenated to form the final result. The schematic diagram of the entire architecture is shown in Figure 6. This addition design significantly shortens the carry chain, ensuring low-latency calculation for the entire decoupled carry–save adder architecture.

4.3. 3-Level Karatsuba Multiplication Architecture

In accordance with the formula derivation of the Karatsuba multiplication described in Section 2.3, recursively partitioning the multiplicands X and Y can progressively reduce the bit-width of the lowest-level multiplication operation. In our design, a 3-level Karatsuba multiplication architecture is implemented to perform high-width multiplication. To perform the resultant five-input addition after each level of partitioning, carry–save adders are employed. Additionally, registers are inserted after the output of the lowest-level multipliers to implement a two-stage pipeline.

The complete three-level Karatsuba multiplication architecture is illustrated in Figure 7, where

x_{i}

and

y_{i}

represent the partitioned components of the operands X and Y. Specifically,

x_{1}

denotes the upper half obtained by splitting X into two segments;

x_{11}

represents the upper half of

x_{1}

after further splitting it into two segments; and

x_{110}

refers to the lower half obtained by further partitioning

x_{11}

. Based on latency analysis, the entire multiplication architecture is divided into a two-stage pipelined structure: the multiplication and addition operations at the lowest level constitute the first stage, while the addition operations at the second and third levels form the second stage. To balance the latency of the two pipeline stages, the

3 : 2

compressor is employed to optimize the timing of the five-input addition at the lowest level, while other addition and multiplication operations are directly implemented using the corresponding DW IPs.

4.4. Bit Width Derivation and Division Optimization for Interpolation

In this section, we derive the required adder bit-width for interpolation and analyze the error-free replacement of

2^{n}

division with right-shift operations in

M_{b}^{- 1}

.

By expanding the initial expression of

C (μ, ν)

based on the polynomial multiplication of

A (μ, ν)

, we obtain Equation (25). Since the choice of the base word does not affect the derivation, the base word is represented as N for simplicity. Multiplying the coefficients associated with the base word by

2835 \times 16

, yielding the theoretical coefficients in the Toom–Cook-5 interpolation step. Among these, the potential maximum value arises from the coefficient of

N^{4}

, denoted as

c_{4} = a_{0} b_{4} + a_{1} b_{3} + a_{2} b_{2} + a_{3} b_{1} + a_{4} b_{0}

. We assume that the bit-width of multiplicands A and B is W, where

a_{i}

and

b_{i}

are unsigned integers with a bit-width of

⌊\frac{W}{5}⌋

. Consequently, the maximum bit-width of

2835 c_{i}

is determined to be

2 ⌊\frac{W}{5}⌋ + 18

bits; this also determines the bit-width of adders used in the interpolation step. Here,

2 ⌊\frac{W}{5}⌋

is the bit-width of

a_{i} b_{i}

; and 18 is the additional bit-width due to the sum of five

a_{i} b_{i}

terms corresponding to the coefficient

c_{4}

and multiplied by

2835 \times 16

.

\begin{matrix} \sum_{i = 0}^{8} c_{i} N^{i} & = (\sum_{i = 0}^{4} a_{i} N^{i}) \cdot (\sum_{j = 0}^{4} b_{i} N^{i}) \\ = a_{0} b_{0} N^{0} + (a_{0} b_{1} + a_{1} b_{0}) N^{1} + (a_{0} b_{2} + a_{1} b_{1} + a_{2} b_{0}) N^{2} \\ + (a_{0} b_{3} + a_{1} b_{2} + a_{2} b_{1} + a_{3} b_{0}) N^{3} + \dots + a_{4} b_{4} N^{8} \end{matrix}

(25)

In earlier works on the Toom–Cook-n interpolation implementations of hardware, each element’s numerator in the inverse matrix was initially multiplied by corresponding components from vector

{[C_{1}, C_{2}, C_{3}, \dots, C_{7}]}^{T}

. Subsequently, the division by

2^{x}

in the denominator of the inverse matrix was replaced by a simple right-shift operation. However, replacing

2^{x}

division directly with right-shift requires that the least significant x-bit values are all zero to avoid computational errors. Clearly, the computation of

2835 M_{b}^{- 1} \cdot {[C_{1}, C_{2}, C_{3}, \dots, C_{7}]}^{T}

does not satisfy this condition.

We derived and proposed an error-free alternative division strategy. First, we select the largest

2^{x}

factor of the denominators in each row of the inverse matrix; then, we multiply the corresponding row by this factor. After completing the interpolation step, the result is processed through right-shift appropriately to restore the original values. Next, we provide a formal proof demonstrating that replacing division with right-shift in this manner does not introduce any computational errors.

It is sufficient to prove that the coefficients

c_{i}

obtained after the division operation are always integers, thereby demonstrating that replacing division with right-shift does not introduce errors. According to Equation (26), for any

μ, ν \in Z

,

C (μ, ν)

satisfies the polynomial form:

C (μ, ν) = \sum_{j = 0}^{n - 1} c_{i} μ^{i} ν^{n - j - 1} .

(26)

Assuming that there exists a coefficient

c_{i}

that is not an integer, it would then be possible to identify a pair

(μ, ν)

such that

C (μ, ν)

, calculated from Equation (26), results in a non-integer. However, according to Equation (3),

C (μ, ν)

is expressed as the product

A (μ, ν) \cdot B (μ, ν)

, where

A (μ, ν), B (μ, ν) \in Z

. This creates a contradiction as the product of two integers must necessarily be an integer. Hence, the initial assumption is invalid, proving that

c_{i} \in Z

. Accordingly, the proposed right-shift approach as a substitute for division is error-free.

5. Implementation Results and Comparison

In this section, we present a thorough analysis and discussion of the hardware implementation results for the Toom–Cook-5 multiplier. Initially, we establish evaluation metrics for the multiplier to provide an objective benchmark for assessing its performance characteristics. Following this, we conduct an extensive comparison between our empirical results from the hardware implementation and those reported in related works in recent years, thereby substantiating the advantages and advancements of our proposed design.

In the ASIC implementation of modular multipliers, the primary concerns are the area and latency metrics. Among these, the calculation of latency is expressed by the following equation:

T = C y c l e \times T_{clk} .

(27)

Furthermore, the area–time product (ATP) is commonly utilized as a performance metric to evaluate the trade-off between hardware resource consumption and latency.

A T P = A r e a \times T

(28)

We implemented the 1024-bit Montgomery modular multiplication architecture based on Toom–Cook-5 using the Design Compiler tool from Synopsys (Sunnyvale, CA, USA), targeting 90 nm process nodes. Table 1 shows the performance results of our design, together with some recent designs. All listed results are for modular multiplication implementations. Overall, the proposed design demonstrates significant advantages in the ATP metric among existing 1024-bit modular multipliers, only ranking behind the

127 {kum}^{2} \times μ s

achieved by [22].

The first four designs in Table 1 focus primarily on optimizing the Montgomery modular multiplication algorithm without improvements to the multiplication operations. Specifically, refs. [33,34,36] adopt an iterative approach to execute operations bit by bit. While this approach effectively reduces the area, it significantly increases the number of cycles required, resulting in intolerable latency. The design [35] improves the redundant binary representation and implements customized solutions tailored to different bit-widths and application scenarios, achieving a balance between latency and area. However, as the design does not specifically optimize the multiplier but relies on an existing multi-cycle multiplier for calculation, it ultimately leads to an unfavorable ATP result.

The focus of [21,22] is similar to that of the design presented in this paper, namely, optimizing the performance of modular multiplication by implementing Toom–Cook-n multipliers. Design [21] employs highly parallelized multiplication operations, saving a substantial number of computation cycles to achieve optimal computational latency. However, this performance is achieved at the cost of reduced system frequency and increased area. In contrast, the design presented in this paper occupies only about one-third of the area used by the design in [21], while achieving a significantly higher system frequency. Ref. [22] demonstrates superior ATP performance utilizing Barrett modular multiplication, which inherently requires fewer multiplications than Montgomery modular multiplication. However, it shows relatively weaker performance in multi-operand modular multiplication, which is disadvantageous for the implementation of encryption algorithms [37]. In terms of power consumption, due to the adoption of a lightweight design in this work, the overall power consumption is lower than that of larger-area designs [21,22]. Therefore, the design proposed in this paper is more suitable for area-sensitive applications, such as embedded edge devices. It strikes an excellent trade-off between area and latency, with a strong emphasis on optimizing area performance.

Toom–Cook-5 serves as an intermediate solution between Toom–Cook-4 and FFT-based multiplication, exhibiting excellent performance for bit-widths ranging from 1024 bits to 4096 bits. Comprehensive results demonstrate that the proposed design achieves operating frequencies of 339 MHz and 268 MHz for 2048-bit and 4096-bit workloads, respectively, with further performance improvements achievable through additional decomposition of Karatsuba multiplication. In this context, Toom–Cook-4 performs poorly as its simpler computational logic and partitioning scheme are more suited to multiplication operations involving smaller bit-widths, such as around 256 bits. On the other hand, FFT-based multiplication is constrained by the substantial overhead associated with complex preprocessing and point value transformation operations, which demand excessive logical resources. It is only within the realm of handling integer multiplication involving numbers exceeding one million bits that its complexity advantage becomes evident [38].

Regarding the resistance to side-channel attacks, considerations are typically made at the cryptographic algorithm level. However, as the most complex and critical computational unit at the hardware level, the design of the modular multiplier has a significant impact on higher-level attack resistance schemes. The proposed modular multiplier features a constant computational delay that is not affected by input data, effectively resisting timing analysis attacks. Additionally, in practical applications of cryptographic algorithms such as ECC or RSA, the operands involved in modular multiplication during encryption are transformed into the Toom–Cook domain. This transformation obscures the influence of input data on power consumption, making it difficult for power analysis attacks to obtain sensitive information through power monitoring. Therefore, our multiplier design not only achieves efficient computation but also enhances resistance to side-channel attacks, providing a more secure hardware foundation for higher-level cryptographic systems.

6. Conclusions

This paper presents a Montgomery modular multiplication algorithm based on Toom–Cook-5, along with its hardware optimization strategies. At the algorithmic level, a pre-simplification operation is introduced for an interpolation matrix of -5, and the addition operations in Montgomery modular multiplication are optimized. At the hardware implementation level, the design focuses on customized enhancements of the underlying multi-operand adders and multipliers, achieving significant improvements in both area and timing performance. Compared to existing works, the proposed design demonstrates a notable enhancement in ATP and achieves substantial advantages in terms of area efficiency.

Author Contributions

Conceptualization, K.L.; Methodology, K.L. and Y.H.; Writing—original draft, K.L.; Funding acquisition, W.W.; Writing—review & editing, W.W. and X.W.; Data curation, Y.H. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Chongqing Natural Science Foundation, grant number cstc2021jcyj-msxmX1090. The APC was funded by Weijiang Wang.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rivest, R.L.; Shamir, A.; Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar]
Miller, V.S. Use of elliptic curves in cryptography. In Proceedings of the Conference on the Theory and Application of Cryptographic Techniques, Linz, Austria, 9–11 April 1985; Springer: Berlin/Heidelberg, Germany, 1985; pp. 417–426. [Google Scholar]
Hankerson, D.; Menezes, A.; Springer, S.V. Guide to Elliptic Curve Cryptography; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Eberle, H.; Shantz, S.; Gupta, V.; Gura, N.; Rarick, L.; Spracklen, L. Accelerating next-generation public-key cryptosystems on general-purpose CPUs. IEEE Micro 2005, 25, 52–59. [Google Scholar]
Choe, J.Y.; Shin, K.W. A High Performance Modular Multiplier for ECC. J. IKEEE 2020, 24, 961–968. [Google Scholar]
Karatsuba, A. Multiplication of multidigit numbers on automata. Sov. Phys. Dokl. 1963, 7, 595–596. [Google Scholar]
Heidarpur, M.; Mirhassani, M. An Efficient and High-Speed Overlap-Free Karatsuba-Based Finite-Field Multiplier for FGPA Implementation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 667–676. [Google Scholar]
Granlund, T.; The GMP Development Team. The GNU Multiple Precision Arithmetic Library Manual. 2014. Available online: https://gmplib.org/ (accessed on 25 March 2025).
Toom, A.L. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Math 1963, 3, 498. [Google Scholar]
Yap, C.; Li, C. QuickMul: Practical FFT-Based Integer Multiplication; Department of Computer Science Courant Institute: New York, NY, USA, 2001. [Google Scholar]
Cook, S.A.; Aanderaa, S.O. On the minimum computation time of functions. Trans. Am. Math. Soc. 1969, 142, 291–314. [Google Scholar]
Elia, M. Loss of Precision in Implementations of the Toom-Cook Algorithm; The University of Vermont and State Agricultural College: Burlington, VT, USA, 2021. [Google Scholar]
Wang, J.; Yang, C.; Zhang, F.; Meng, Y.; Su, Y. TCPM: A Reconfigurable and Efficient Toom-Cook-Based Polynomial Multiplier Over Rings Using a Novel Compressed Postprocessing Algorithm. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2023, 31, 1153–1166. [Google Scholar]
Das, M.; Jajodia, B. Area and Delay Trade-Offs in Three-Way Toom-Cook Large Integer Multipliers Implemented on FPGAs. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 72, 600–609. [Google Scholar]
Bodrato, M. Towards optimal Toom-Cook multiplication for univariate and multivariate polynomials in characteristic 2 and 0. In Proceedings of the Arithmetic of Finite Fields: First International Workshop, WAIFI 2007, Madrid, Spain, 21–22 June 2007; Proceedings 1. Springer: Berlin/Heidelberg, Germany, 2007; pp. 116–133. [Google Scholar]
Dutta, S.; Bhattacharjee, D.; Chattopadhyay, A. Quantum circuits for Toom-Cook multiplication. Phys. Rev. A 2018, 98, 012311. [Google Scholar]
Umer, U.; Rashid, M.; Alharbi, A.R.; Alhomoud, A.; Kumar, H.; Jafri, A.R. An Efficient Crypto Processor Architecture for Side-Channel Resistant Binary Huff Curves on FPGA. Electronics 2022, 11, 1131. [Google Scholar] [CrossRef]
Wang, J.; Yang, C.; Zhang, F.; Meng, Y.; Xiang, S.; Su, Y. A High-Throughput Toom-Cook-4 Polynomial Multiplier for Lattice-Based Cryptography Using a Novel Winograd-Schoolbook Algorithm. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 71, 359–372. [Google Scholar]
Putranto, D.S.C.; Wardhani, R.W.; Larasati, H.T.; Kim, H. Space and Time-Efficient Quantum Multiplier in Post Quantum Cryptography Era. IEEE Access 2023, 11, 21848–21862. [Google Scholar] [CrossRef]
Ding, J.; Li, S.; Gu, Z. High-Speed ECC Processor Over NIST Prime Fields Applied With Toom–Cook Multiplication. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 66, 1003–1016. [Google Scholar]
Gu, Z.; Li, S. A Division-Free Toom–Cook Multiplication-Based Montgomery Modular Multiplication. IEEE Trans. Circuits Syst. Part II Express Briefs 2019, 66, 1401–1405. [Google Scholar]
Hao, Y.; Wang, W.; Dang, H.; Wang, G. Efficient Barrett Modular Multiplication Based on Toom–Cook Multiplication. IEEE Trans. Circuits Syst. II Express Briefs 2023, 71, 862–866. [Google Scholar]
Barker, E.B.; Barker, W.C.; Burr, W.E.; Polk, W.T.; Smid, M.E. Recommendation for Key Management Part 1: General (Revision 3); National Institute of Standards and Technology: Gaithersburg, MD, USA, 2012. [Google Scholar]
Barker, E.B.; Dang, Q. Recommendation for Key Management Part 3: Application-Specific Key Management Guidance; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2009. [Google Scholar]
Lochter, M.; Merkle, J. Elliptic Curve Cryptography (ECC) Brainpool Standard Curves and Curve Generation. In N Koblitz an Elliptic Curve Implementation of the Finite Field Digital Signature Algorithm Proceedings of Crypto ’98 Lncs; Springer: Berlin, Germany, 2010. [Google Scholar]
Awano, H.; Ikeda, M. Fourℚ on ASIC: Breaking Speed Records for Elliptic Curve Scalar Multiplication. In Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 25–29 March 2019. [Google Scholar]
St. Denis, T.J.; Hamilton, N.F. Karatsuba Based Multiplier and Method. WO2007012179A3, 21 July 2006. [Google Scholar]
Maeder, R.E. Storage allocation for the Karatsuba integer multiplication algorithm. In Proceedings of the Design & Implementation of Symbolic Computation Systems, International Symposium, Disco 93, Gmunden, Austria, 15–17 September 1993. [Google Scholar]
Montgomery, P.L. Modular multiplication without trial division. Math. Comp 1985, 44, 519–521. [Google Scholar]
Bedrij, O.J. Carry-Select Adder. IRE Trans. Electron. Comput. 1962, EC-11, 340–346. [Google Scholar] [CrossRef]
Ramkumar, B.; Kittur, H.M.; Kannan, P.M. ASIC implementation of modified faster carry save adder. Eur. J. Entific Res. 2010, 42, 53–58. [Google Scholar]
Kong, Y. Optimizing the Improved Barrett Modular Multipliers for Public-Key Cryptography. In Proceedings of the 2010 International Conference on Computational Intelligence and Software Engineering, Wuhan, China, 10–12 December 2010. [Google Scholar]
Kuang, S.R.; Wang, J.P.; Chang, K.C.; Hsu, H.W. Energy-efficient high-throughput Montgomery modular multipliers for RSA cryptosystems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2012, 21, 1999–2009. [Google Scholar]
Kuang, S.R.; Wu, K.Y.; Lu, R.Y. Low-cost high-performance VLSI architecture for Montgomery modular multiplication. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 24, 434–443. [Google Scholar]
Zhang, Z.; Zhang, P. A scalable montgomery modular multiplication architecture with low area-time product based on redundant binary representation. Electronics 2022, 11, 3712. [Google Scholar] [CrossRef]
Miyamoto, A.; Homma, N.; Aoki, T.; Satoh, A. Systematic design of RSA processors based on high-radix Montgomery multipliers. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2010, 19, 1136–1146. [Google Scholar]
Knezevic, M.; Vercauteren, F.; Verbauwhede, I. Faster Interleaved Modular Multiplication Based on Barrett and Montgomery Reduction Methods. IEEE Trans. Comput. 2010, 59, 1715–1721. [Google Scholar] [CrossRef]
Coronado, L.C.; García. Can Schönhage Multiplication Speed Up the RSA Decryption or Encryption? (Extended Abstract). 2005. Available online: http://www.cdc.informatik.tu-darmstadt.de/mitarbeiter/coronado.html (accessed on 25 March 2025).

Figure 1. The 5:2 compressor structure.

Figure 2. The inverse of the interpolation matrix.

Figure 3. Recomposition for the result.

Figure 4. Overall architecture.

Figure 5. Decoupled carry–save adder architecture.

Figure 6. Carry select adder architecture.

Figure 7. Three-level Karatsuba multiplication architecture.

Table 1. Comparison of implementation under 90 nm process.

Design	Process	Area ( $μ m^{2}$ )	Frequency (MHz)	Time (ns)	Power (mW)	ATP
[33]	90 nm	749,076	179	4570.3	70.6	3423
[34]	90 nm	498,379	250	3520	-	1754
[35]	90 nm	992,500	392	291	-	288
[36]	90 nm	54,587	472	4680	-	257
[21]	90 nm	1,799,451	257	85.58	402.93	154
[22]	90 nm	1,012,576	277	126	392.47	127
ours	90 nm	515,148	413	274.8	379.35	142

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, K.; Wang, X.; Hao, Y.; Zhang, J.; Wang, W. Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5. Electronics 2025, 14, 1402. https://doi.org/10.3390/electronics14071402

AMA Style

Liu K, Wang X, Hao Y, Zhang J, Wang W. Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5. Electronics. 2025; 14(7):1402. https://doi.org/10.3390/electronics14071402

Chicago/Turabian Style

Liu, Kuanhao, Xiaohua Wang, Yue Hao, Jingqi Zhang, and Weijiang Wang. 2025. "Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5" Electronics 14, no. 7: 1402. https://doi.org/10.3390/electronics14071402

APA Style

Liu, K., Wang, X., Hao, Y., Zhang, J., & Wang, W. (2025). Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5. Electronics, 14(7), 1402. https://doi.org/10.3390/electronics14071402

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5

Abstract

1. Introduction

1.1. Background

1.2. Related Work

1.3. Motivation and Contribution

2. Notations and Preliminaries

2.1. Toom–Cook-n Multiplication

2.2. Karatsuba Multiplication

2.3. Montgomery Modular Multiplication

2.4. Carry–Save Adder

3. Proposed Algorithms and Methods

3.1. Proposed Toom–Cook-5 Multiplication

3.2. Montgomery Modular Multiplication Based on Toom–Cook-5

4. Hardware Architecture of Toom–Cook-5

4.1. Overall Architecture

4.2. Decoupled Carry–Save Adder Architecture

4.3. 3-Level Karatsuba Multiplication Architecture

4.4. Bit Width Derivation and Division Optimization for Interpolation

5. Implementation Results and Comparison

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI