An Area-Efficient and Configurable Number Theoretic Transform Accelerator for Homomorphic Encryption

Huang, Jingwen; Kuo, Chiayi; Liu, Sihuang; Su, Tao

doi:10.3390/electronics13173382

Open AccessArticle

An Area-Efficient and Configurable Number Theoretic Transform Accelerator for Homomorphic Encryption

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3382; https://doi.org/10.3390/electronics13173382

Submission received: 27 July 2024 / Revised: 19 August 2024 / Accepted: 22 August 2024 / Published: 26 August 2024

(This article belongs to the Special Issue System-on-Chip (SoC) and Field-Programmable Gate Array (FPGA) Design)

Download

Browse Figures

Versions Notes

Abstract

:

Homomorphic Encryption (HE) allows for arbitrary computation of encrypted data, offering a method for preserving privacy in cloud computations. However, efficiency remains a significant obstacle, particularly with the polynomial multiplication of large parameter sets, which occupies substantial computing and memory overhead. Prior studies proposed the use of Number Theoretic Transform (NTT) to accelerate polynomial multiplication, which proved efficient, owing to its low computational complexity. However, these efforts primarily focused on NTT designs for small parameter sets, and configurability and memory efficiency were not considered carefully. This paper focuses on designing a unified NTT/Inverse NTT (INTT) architecture with high area efficiency and configurability, which is more suitable for HE schemes. We adopt the Constant-Geometry (CG) NTT algorithm and propose a conflict-free access pattern, demonstrating a 16.7% reduction in coefficients of storage capacity compared to the state-of-the-art CG NTT design. Additionally, we propose a twiddle factor generation strategy to minimize storage for Twiddle Factors (TFs). The proposed architecture offers configurability of both compile time and runtime, allowing for the deployment of varying parallelism and parameter sets during compilation while accommodating a wide range of polynomial degrees and moduli after compilation. Experimental results on the Xilinx FPGA show that our design can achieve higher area efficiency and configurability compared with previous works. Furthermore, we explore the performance difference between precomputed TFs and online-generated TFs for the NTT architecture, aiming to show the importance of online generation-based NTT architecture in HE applications.

Keywords:

homomorphic encryption; number theoretic transform; constant-geometry; memory access pattern; twiddle factor generation strategy; configurability

1. Introduction

Homomorphic Encryption (HE), enabling the computation of encrypted data, is an optimal privacy-preserving solution for various scenarios, including cloud computing [1] and machine learning [2]. Unlike cryptographic methods [3,4] based on chaos theory, HE relies primarily on algebraic structures such as latticed-based mathematical theories. Over the past few decades, multiple HE schemes based on the Ring Learning With Error (RLWE) problem [5] have been proposed, among which the Brakerski–Gentry–Vaikuntanathan (BGV) scheme [6], Brakerski–Fan–Vercauteren (BFV) scheme [7], and Cheon–Kim–Kim–Song (CKKS) scheme [8] are popular. Despite the improvements in theoretical cryptography, intensive computing and high storage requirements remain serious, i.e., thousands of times slower on unencrypted data [9] and occupying up to 512 MB of on-chip memory [10], limiting its widespread adoption.

Polynomial multiplication, as a fundamental operation in HE schemes, has become a performance bottleneck for HE applications. Using NTT can reduce the complexity of polynomial multiplication from traditional

O (N^{2})

to

O (N l o g_{2} N)

, where N is the degree of the polynomial. Consequently, the effective implementation of NTT significantly impacts the performance of HE applications. Unlike low-degree polynomials (ranging from

2^{8}

to

2^{12}

) with small moduli (i.e., 24 bit) needed in some traditional post-quantum cryptography schemes [11,12], HE schemes [13,14,15,16] require a higher polynomial degree (ranging from

2^{13}

to

2^{17}

) and larger moduli (i.e., 1240-bit) to support multiplicative depth. Introducing a Residue Number System (RNS) to HE enables the decomposition of a large modulus into several smaller moduli with varying bit widths (e.g., 54 bit). Therefore, an NTT architecture suitable for RNS-based HE schemes is required to support a set of high-degree polynomials with diverse moduli. This implies that when accelerated in hardware, such an architecture consumes a significant amount of on-chip resources. Consequently, the building of a hardware-efficient NTT architecture for practical applications of HE is in great demand.

Area efficiency and configurability are two key metrics used to evaluate the performance of NTT designs. Area efficiency shows a negative correlation with resource and time costs. Configurability emphasizes both Compile-Time Configurability (CTC) [17] and Run-Time Configurability (RTC) [18]. CTC enables the adjustment of parameter sets at the compilation time to satisfy different schemes with a wide range of parameters. It also means that the computational parallelism of the NTT architecture can be compiled to realize the trade-off between area and speed, RTC corresponds to the ability to support various parameter sets at runtime.

Existing NTT designs [19,20] for HE parameters tend to employ a large number of Processing Elements (PEs) to achieve high throughput, albeit at the cost of notable resource overhead and low area efficiency. Öztürk et al. [19] introduced an NTT architecture with up to 256 PEs to accelerate the LopezAlt–Tromer–Vaikuntanathan (LTV) scheme, ensuring low latency. However, the proposed architecture stores all TFs for multiple moduli in on-chip memory, leading to high memory overhead, which is challenging for polynomial degrees greater than

2^{15}

on the target FPGA device. Similarly, Su et al. [20] deployed 168 PEs across 41 RNS channels and stored all TFs for 41 32-bit moduli in internal memory, occupying a significant amount of hardware resources. Consequently, supporting moduli with a larger bit width on the target FPGA device became impossible. Kurniawan et al. [21] reported a memory-based NTT design, presenting remarkable advantages of its low-complexity modular multiplier and optimized memory access pattern. However, it is still unsuitable for RNS-based HE schemes with multiple moduli due to the considerable storage overhead for TFs. To solve this issue, Kim et al. [22] and Duong-Ngoc et al. [23] used the twiddle factor generation strategy to avoid the storage of TFs. Furthermore, they reduced the memory bandwidth requirement during high-computational parallelism by employing the mixed-radix NTT algorithm, enhancing area efficiency. However, flexibility is limited because both designs only aim at the specific RNS-based HE parameter sets. Moreover, the number of PEs was fixed at design time, while the lack of unified data flows between NTT and INTT constrained the architectures’ adaptability.

On the other hand, some recent NTT designs [17,24,25,26,27] with configurability have restrictions on performance and area efficiency. Mert et al. [17] proposed the first NTT hardware generator that takes the desired parameters and the number of PEs as inputs. A multiplier using the word-level Montgomery algorithm could adapt to moduli of diverse bit widths without recompilation. Hu et al. [24] developed an NTT-based high-efficiency polynomial multiplier that permits the reuse of twiddle factors across different polynomial degrees. However, the two designs based on the in-place algorithm suffer from increased access complexity and lower hardware efficiency as PEs increase. To solve this problem, Banerjee et al. [28] proposed the Constant-Geometry (CG) algorithm, which exhibits superior scalability for multiple PEs, benefiting from its consistent memory access structure in each stage. However, the CG algorithm operates out of place, resulting in twice the memory overhead for coefficients compared to the in-place algorithm. Su et al. [25] and Liu et al. [26] optimized the coefficient access pattern, reducing the coefficient storage requirement from

2 N

to

1.5 N

, and exploited algorithm-level optimization techniques. However, the memory bandwidth requirement remains high, which hinders its implementation on FPGA. Geng et al. [27] introduced a novel high–low iterative memory access approach to reduce the storage bandwidth required per PE, mitigating storage restrictions. However, it is worth noting that in the mentioned works with configurability, all precomputed constants are stored in the internal memory. Therefore, the memory footprint remains a major obstacle when these NTT designs are embedded into HE schemes.

Consequently, developing a unified NTT/INTT architecture with flexibility while ensuring high area efficiency for HE-based parameters remains challenging. This paper proposes an area-efficient and highly configurable architecture to accelerate CG-based NTT/INTT operations on FPGA. Reducing the storage of coefficients and twiddle factors is the core focus of this study. More specifically, our contributions are summarized as follows:

(1): We propose a novel memory access pattern based on the CG algorithm, reducing the total coefficient capacity requirement from $1.5 N$ to $1.25 N$ . Moreover, we minimize the required memory bandwidth per PE as much as that of the in-place-based radix-2 algorithm. The reduction in overall capacity and bandwidth significantly mitigates memory demands when compared to prior works.
(2): We develop a flexible Twiddle Factor Generator (TFG), with a Step Generator (SG) supplying multipliers for the TFG. With this approach, only a few TF bases need to be stored, while the remaining TFs are generated on the fly. Implementation results show a 99.17% reduction of TF storage in our proposed NTT architecture with 32 PEs when the maximum polynomial degree is set to $2^{16}$ .
(3): We present a hardware-efficient and configurable unified architecture to accelerate NTT/INTT without additional modifications. The proposed design not only allows for a balanced design in terms of area and throughput based on the compiled parameters, including the number of PEs, the supported maximum number of polynomial degrees, and the maximum size of polynomial coefficients, but it also supports polynomials with varying degrees and coefficients of different sizes after compilation. In addition, although our architecture is designed for HE parameter sets, it also accommodates small parameter sets without compromising computational complexity.
(4): We also implement a memory-based NTT/INTT architecture adopting our proposed coefficient access pattern, aiming to fairly compare it with previous NTT designs with all TFs stored in memory. The experimental results demonstrate the improved performance of the proposed memory access pattern. In the end, we conduct another comparison to investigate the performance difference between precomputed TFs and online-generated TFs for the NTT architecture. The results show the importance of the online generation-based NTT architecture in RNS-based HE applications.

The organization of this paper is outlines as follows. Section 2 provides basic knowledge about polynomial multiplication and the CG NTT algorithm. The hardware architecture is detailed in Section 3. In Section 4, the implementation results are analyzed and compared with prior works. Finally, Section 5 concludes this article.

2. Background

2.1. Polynomial Multiplication

Polynomial multiplication defined in a finite domain is considered a fundamental operation in cryptographic algorithms. It is a computational bottleneck when the polynomial degree is high, especially in HE applications. Typically, mathematical operations of polynomials are performed over a quotient ring (

R_{q} = Z_{q} [x] / (x^{N} + 1)

, where N is a power of 2 and q is an NTT-friendly prime satisfying the condition of

q \equiv 1 mod 2 N

). The polynomial over

R_{q}

has a degree of N, with coefficients constrained to the interval of

[0, q)

. Consider two polynomials (

a (x) = \sum_{i} a_{i} x^{i}

and

b (x) = \sum_{i} b_{i} x^{i}

), with

c (x) = a (x) \cdot b (x)

denoting the multiplication result over

R_{q}

. The traditional computational method requires expensive matrix-vector multiplication, resulting in computational complexity of

O (N^{2})

. The NTT algorithm is utilized to accelerate the process, with a complexity of

O (N l o g_{2} N)

, while still requiring a modulo operation of

(x^{N} + 1)

in the end. Employing the Negative Wrapped Convolution (NWC) [29] method can circumvent the need for zero-padding and dimension extension by introducing additional preprocessing and post-processing. Accordingly, for determined q and N, we define w and

ψ

as the N-th and

2 N

-th unit primitive roots on

Z_{q}

, respectively satisfying the relation of

ψ^{2} \equiv w mod q

. The symbol ⊙ signifies element-wise multiplication. Preprocessing operations for

a (x)

and

b (x)

are outlined in Equation (1) [26] and Equation (2) [26], respectively.

\begin{matrix} \tilde{a} (x) & = a (x) ⊙ {ψ^{0}, ψ^{1}, \dots, ψ^{N - 1}} \end{matrix}

(1)

\begin{matrix} \tilde{b} (x) & = b (x) ⊙ {ψ^{0}, ψ^{1}, \dots, ψ^{N - 1}} \end{matrix}

(2)

Then, polynomial multiplication is performed as follows [26]:

\tilde{c} (x) = I N T T (N T T (\tilde{a} (x)) ⊙ N T T (\tilde{b} (x)))

(3)

The final multiplication results over

R_{q}

are obtained by executing the post-processing operation for

\tilde{c} (x)

as Equation (4) [26].

c (x) = \tilde{c} (x) ⊙ {ψ^{- 0}, ψ^{- 1}, \dots, ψ^{- (N - 1)}}

(4)

2.2. Constant-Geometry NTT/INTT Algorithm

The CG algorithm is an out-of-place NTT algorithm, where access patterns at every stage remain consistent, rendering it suitable for the extension of multi-core array architectures. The CG NTT module utilizes the Cooley–Tukey (CT) algorithm [30], whereas the CG INTT module relies on the Gentleman–Sande (GS) algorithm [31]. The CT algorithm and GS algorithm are two divide-and-conquer strategies for NTT/INTT computation. Additional optimization efforts involve merging preprocessing and post-processing steps [32], along with the incorporation of the

N^{- 1}

operation into each stage of the INTT [25]. Based on the simplified CG algorithm proposed by Su et al. [25], we make some adjustments to make it applicable to scenarios with L PEs, as shown in Algorithms 1 and 2. The BitReverse function reorders elements by reversing the binary representation of their original indices, effectively rearranging the input vector.

Algorithm 1 Constant-Geometry NTT [25]

Require:: $a (x)$ , q, $ψ^{i}$ , where $i = 0, 1, \dots, N - 1$ .
Ensure:: $A (x) = N T T (a (x))$
1:: $a \leftarrow B i t R e v e r s e (a)$
2:: for $s = 1$ to $l o g_{2} N$ do
3:: for $j = 0$ to $\frac{N}{2 L} - 1$ do
4:: for $l = 0$ to $L - 1$ do
5:: $e = j L + l$
6:: $k = \frac{N}{2^{s}}$
7:: $A [e] = a [2 e] + a [2 e + 1] \cdot ψ^{2 ⌊ \frac{e}{k} ⌋ \cdot k + k} mod q$
8:: $A [e + \frac{N}{2}] = a [2 e] - a [2 e + 1] \cdot ψ^{2 ⌊ \frac{e}{k} ⌋ \cdot k + k} mod q$
9:: end for
10:: end for
11:: if $s \neq l o g_{2} N$ then
12:: $a = A$
13:: end if
14:: end for
15:: return $A (x)$

Algorithm 2 Constant-Geometry INTT [25]

Require:: $A (x)$ , q, $ψ^{- i}$ , where $i = 0, 1, \dots, N - 1$ .
Ensure:: $a (x) = I N T T (A (x))$
1:: for $s = 1$ to $l o g_{2} N$ do
2:: for $j = 0$ to $\frac{N}{2 L} - 1$ do
3:: for $l = 0$ to $L - 1$ do
4:: $e = j L + l$
5:: $k = 2^{s - 1}$
6:: $a [2 e] = (A [e] + A [e + \frac{N}{2}]) \cdot 2^{- 1} mod q$
7:: $a [2 e + 1] = (A [e] - A [e + \frac{N}{2}]) \cdot ψ^{- (2 ⌊ \frac{e}{k} ⌋ \cdot k + k)} \cdot 2^{- 1} mod q$
8:: end for
9:: end for
10:: if $s \neq l o g_{2} N$ then
11:: $A = a$
12:: end if
13:: end for
14:: $a \leftarrow B i t R e v e r s e (a)$
15:: return $a (x)$

3. The Proposed Accelerator

3.1. Overall Architecture

The overall architecture of our design is illustrated in Figure 1, consisting of the following three types of components: computing, storage, and control components. The computing components comprise the PE array, TFG array, and SG. Meanwhile, the storage components include the coefficient memory, TF base memory, and step memory. The control unit receives a two-bit working mode signal as input to transition the circuit among the following three states: IDLE, NTT, and INTT. Further elaboration on the specific functionalities of each submodule follows.

(1): PE array: The PE array is the computational component in this architecture, primarily handling butterfly operations. It consists of L PEs, where the parameter (L) is typically set to a power of two. Each PE executes either NTT or INTT operations on input polynomials, depending on the working state of the architecture.
(2): Coefficient memory: The coefficient memory stores all polynomial coefficients during calculation. In this design, a total storage capacity of $1.25 N$ for coefficients is necessary, which is 16.7% less than the prior CG storage requirement of $1.5 N$ . The total storage space is divided into $2 L$ memory blocks to accommodate the computational bandwidth of the PE array.
(3): TF base memory: The TF base memory holds TF bases, supplying inputs to the TFG array for the dynamic generation of remaining TFs. The overall storage capacity is contingent upon computational parallelism and the polynomial degree. When $L \geq 2$ , $((l o g_{2} N + 1) \times L / 2) \times 2$ storage space is required; otherwise, $l o g_{2} N \times 2$ space suffices. The total storage space is divided into $⌈ L / 2 ⌉$ blocks to meet the computational bandwidth of the TFG array, where the symbol ⌈ represents rounding up.
(4): Step memory: The step memory stores the inputs and outputs of the SG. The storage capacity is adjusted to match the pipeline of the SG, preventing pipeline stalls caused by mismatched data processing speeds.
(5): SG: The SG is responsible for cyclically generating steps to supply the TFG array, thereby avoiding the need for expensive step storage.
(6): TFG array: The TFG array consists of $⌈ L / 2 ⌉$ twiddle factor generators (TFGs), which take steps and TF bases as inputs and output TFs to the PE array for computations.
(7): Control unit: The controller provides the correct read address and write address for storage components and coordinates the work of all components.

In this architecture, all memory modules are configured in simple dual-port mode, allowing for external initialization after compilation. Modular Multipliers (MMs) utilize the word-level Montgomery algorithm. With specified iterations and parameterized word size, they can accommodate moduli within a range of bit widths after compilation. The compile parameters of N and

⌈ l o g_{2} q_{m a x} ⌉

are the maximum supported polynomial degree and the maximum supported modulus bit width, respectively. The compiled circuit can support different polynomial degrees (n;

2 L \leq n \leq N

). Moreover, the architecture realizes the unification of NTT and INTT in the design of all components.

3.2. Unified PE

The proposed PE architecture is shown in Figure 2, consisting of a Modular Adder (MA), Modular Subtractor (MS), Modular Multiplier (MM), Modular Divider (MD), Multiplexers (Muxs), and registers. The inputs are a, b, and c, and the outputs are

R e s 0

and

R e s 1

. The signal

m o d e

determines whether the PE performs NTT or INTT. The PE executes the CG NTT operations when

m o d e = 0

. The results are shown as follows [26]:

\begin{matrix} R e s 0 & = a + b \times c mod q \end{matrix}

(5)

\begin{matrix} R e s 1 & = a - b \times c mod q \end{matrix}

(6)

When

m o d e

is set to 1, the PE executes the CG INTT operations. The results are shown as follows [26]:

\begin{matrix} R e s 0 & = (a + b) \times \frac{1}{2} mod q \end{matrix}

(7)

\begin{matrix} R e s 1 & = (a - b) \times c \times \frac{1}{2} mod q \end{matrix}

(8)

In addition, a total of

(2 m + 4)

registers are inserted to balance the pipeline latency, where m is the pipeline level of the MM.

The MM applies the word-level Montgomery algorithm proposed in [33] as shown in Algorithm 3. It utilizes the equivalent expression of an NTT-friendly prime (

q = q_{H} \times 2^{W} + 1

) to decompose the modular reduction operation into smaller steps. The compiled parameter (W) is usually set to

l o g_{2} 2 N

, denoting the word size. The iteration number of the “for” loop is represented by the parameter (T), where

T = ⌈ l o g_{2} q_{m a x} / W ⌉

. This means splitting large-bit-width modulo reduction steps into T iterations of smaller-bit-width modular operations; therefore, fewer Digital Signal Processing (DSP) resources are needed. After compilation, the supported modulus bit width (K) is configurable at runtime within the range of

(W, ⌈ l o g_{2} q_{m a x} ⌉]

, while the traditional Barrett modular reduction algorithm [34] only supports a fixed-modulus bit width of

K = ⌈ l o g_{2} q_{m a x} ⌉

after compilation and lacks runtime configurability. For

n < N

, setting W to

l o g_{2} 2 N

restricts the range of q for dimension n. Hence, W should be small enough to accommodate a wide range of n, yet this entails more iterations for specific

⌈ l o g_{2} q_{m a x} ⌉

. Therefore, W and T should be carefully selected to obtain the optimal trade-off between generality and complexity. In our work, we set T to 4, and W can be compiled as

{15, 16, 17}

to match the common HE-based parameter set at

(N = 2^{16}, K = 60)

[23]. It is important to note that the constraint on q varies with

n < N

. Specifically, as n decreases, the range of values for q shrinks compared to the initially supported range, while it remains unchanged as n increases. For example, when

N = 2^{16}

and

W = 17

, for

n = 2^{15}

or

n = 2^{14}

, q theoretically only needs to satisfy the expression of NTT-friendly primes (

q = q_{H} \times 2^{16} + 1

or

q = q_{H} \times 2^{15} + 1

). However, the compiled parameter (W) determines that q must adhere to the condition of

q = q_{H} \times 2^{17} + 1

, thereby imposing a constraint on q. Even so, compared to previous work [22,23], which utilized specialized forms of modulus to reduce complexity, the proposed MM is more general for HE applications.

Algorithm 3 Word-level Montgomery Modular Multiplication [33]

Require:: $A, B, q = q_{H} \cdot 2^{W} + 1$ (three K-bit positive integers, $W < K \leq ⌈ l o g_{2} q_{m a x} ⌉$ )
Ensure:: $A \cdot B / R mod q$ , where $R = 2^{4 \times W} mod q$
1:: $X = A \cdot B$
2:: for $i = 0$ to 3 do
3:: $X_{H} = X > > W$
4:: $X_{L} = X mod 2^{W}$
5:: $X_{L N} = - X_{L} mod 2^{W}$ //2’s Complement of $X_{L}$
6:: $C I N = X_{L N} [W - 1] | X_{L} [W - 1]$
7:: $X = X_{H} + C I N + q_{H} \cdot X_{L N}$
8:: end for
9:: $Y = X - q$
10:: if $(Y < 0)$ then
11:: $Z = X$
12:: else
13:: $Z = Y$
14:: end if
15:: return Z

Modular division can be viewed as an inverse element multiplier. For

x / 2 mod q

, it can be implemented using a shifter, adder, and multiplexers without multipliers. Similar to prior works [25,26], MD is designed based on Equation (9).

\frac{x}{2} mod q = \{\begin{matrix} (x ≫ 1), & if x is even; \\ (x ≫ 1) + \frac{q + 1}{2}, & if x is odd . \end{matrix}

(9)

3.3. Coefficient Memory Access Pattern

The storage overhead for coefficients on FPGA mainly depends on the following two factors: the total storage capacity and the number of required memory blocks per PE. This paper proposes a novel coefficient memory access pattern based on the CG algorithm, effectively reducing the storage demand from

1.5 N

to

1.25 N

(where N is the polynomial degree). At the same time, each PE necessitates just two memory blocks. In prior CG-based NTT studies [25,26,27], a storage capacity of

2 N

or

1.5 N

was typically allocated for coefficient storage. Su et al.’s work [25] and Liu et al.’s work [26] required 12 and 4 memory blocks per PE, respectively. Although Geng et al.’s work [27] required just two memory blocks per PE, their total storage requirement remained at

2 N

. In contrast, our design minimizes the total storage capacity and bandwidth, leading to a notable decrease in Block Random Access Memory (BRAM) resource usage.

The proposed coefficient storage arrangement comprises

2 L

memory blocks to match the computational needs of L PEs. Assuming that N original coefficients are cyclically distributed across

2 L

memory blocks, the coefficient pairs (

a [e]

,

a [e + N / 2]

) occupy the same block, hindering the simultaneous retrieval of both source data during INTT. Consequently, we reorganize coefficients

a [N / 2] \sim a [N - 1]

in reverse order within

2 L

memory blocks, while coefficients

a [0] \sim a [N / 2 - 1]

are still stored sequentially. This strategy effectively segregates

a [e]

and

a [e + N / 2]

into distinct blocks. Moreover, it is important to acknowledge that this arrangement necessitates at least

0.5 N

of additional space to ensure sufficient time for data retrieval before updates. To minimize the extra storage requirement, we further refine the layout, grouping coefficients in sets of

N / 4

.

The improved coefficient arrangement is depicted in Figure 3, in which each block has a depth of

1.25 N / 2 L

. Within each block,

N / 2 L

of space is designated for storage of the original coefficients, while the remaining

N / 8 L

of space is allocated for newly generated data. To facilitate explanation, we divide

2 L

blocks into five regions from top to bottom, labeled as Regions 1∼5. The arrangement of the original coefficients proceeds as follows:

$a [0] \sim a [N / 4 - 1]$ are placed in Region 1 sequentially;
$a [N / 4] \sim a [N / 2 - 1]$ are placed in Region 3 sequentially;
$a [N / 2] \sim a [3 N / 4 - 1]$ are stored in reverse order at the block level, filling up Region 2;
$a [3 N / 4] \sim a [N - 1]$ are stored in reverse order at the block level, filling up Region 4.

More details are provided below for the loading and storing patterns of data during NTT. As shown in Algorithm 1, during the j-th iteration, coefficient pairs

(a [2 (j L + l)],

a [2 (j L + l) + 1])

are loaded from blocks and transmitted to PE

_{l}

for computation. Then, the updated results

(A [j L + l], A [j L + l + N / 2])

are stored in blocks. The l parameter ranges from 0 to

L - 1

, indicating a total of

2 L

data being simultaneously read and stored in each computation cycle of the NTT. In the following, we describe the two initial executions of PEs in stage 1 in detail.

At the first step in stage 1, i.e., $j = 0$ , PE $_{l}$ retrieves $a [2 l]$ from Block $2 l$ and $a [2 l + 1]$ from Block $2 l + 1$ to execute butterfly computation. The result ( $A [l]$ ) is stored in the first available slot in Block l, and another result ( $A [N / 2 + l]$ ) is stored in the first row of Block $2 L - 1 - l$ . When L PEs operate simultaneously, coefficients in the first row of Region 1 are loaded simultaneously. $2 L$ results are stored in the first vacant position of Block $0 \sim L - 1$ in Region 5 and the first row of Block $L \sim 2 L - 1$ in Region 1 (the original coefficients in the first row of Block $L \sim 2 L - 1$ are utilized).
At the second step in stage 1, i.e., $j = 1$ , PE $_{l}$ fetches $a [2 (L + l)]$ from Block $2 l$ and $a [2 (L + l) + 1]$ from Block $2 l + 1$ . The output ( $A [L + l]$ ) is stored in the first available position in Block $L + l$ , and $A [L + l + N / 2]$ is stored in the first row of Block $L - 1 - l$ . When L PEs work simultaneously, coefficients in the second row of Region 1 are loaded, and the computed results are stored separately in the first vacant positions of Block $L \sim 2 L - 1$ in Region 5 and the first row of Block $0 \sim L - 1$ in Region 1 (the original coefficients in the first row of Block $0 \sim L - 1$ are utilized).

In the next steps of stage 1, the coefficients are loaded and stored according to the above steps. After

N / 4 L

clock cycles,

a [0] \sim a [N / 2 - 1]

from Regions 1 and 3 are read. The newly generated

N / 2

data fill Regions 5 and 1. PEs then handle the data from Regions 2 and 4, with the results filling in Regions 2 and 3. Upon computing stage 1, Region 4 becomes available for storage in stage 2. The described operations are replicated in successive stages. After each stage’s execution, the remaining space is utilized for storage in the next stage. Subsequently, the coefficient storage arrangement reverts to its initial state at stage 6.

For a more detailed depiction of the coefficient storage access pattern, let us derive the process for a 16-point NTT employing two PEs. The load and stored data flow are illustrated in Figure 4. It is assumed that one clock cycle is required from data retrieval to the completion of computation. The left–right direction represents the progression of clock cycles within a stage. Blue denotes the read data to be transmitted to PEs, and green denotes the newly generated data to be written into memory in that cycle.

The coefficient access logic for INTT is the reverse of that used in NTT. For example, in the first step of stage 1, L PEs read coefficients

A [0] \sim A [L - 1]

from Region 1 and

A [N / 2] \sim A [L - 1 + N / 2]

from Region 2. The newly generated data (

a [0] \sim a [2 L - 1]

) are written into the first row in Region 5. Subsequent steps follow a similar pattern, which is not repeated here.

It is worth noting that the preceding description is rooted in the compilation parameter (N), but the memory access structure specifically supports polynomials of n degree, which is much smaller than N. In this scenario, only the initial

5 n / 8 L

of space of each block is accessed, leaving the remaining

5 N / 8 L - 5 n / 8 L

of space unused. Moreover, our design allows for external access to arbitrary addresses in all memory blocks, enabling users to dynamically update coefficients of n-dimensional polynomials in memory blocks post compilation. This feature ensures the proposed memory access pattern with runtime configurability.

3.4. Read-after-Write Conflict Analysis

Due to the delay in accessing RAM and PE computation, data from a particular address may be needed by the next stage before it is updated by the computation of the current stage, leading to access conflict. The critical point of conflict lies in the simultaneous reading and updating of data in two consecutive stages, known as a read-after-write conflict. The following conflict analysis process refers to Geng et al.’s work [27].

We define d as the total number of clock cycles, involving the latency of RAM access and the pipeline levels of PE. Figure 5 illustrates the occurrence of conflict when

N = 16

,

L = 2

, and

d = 2

. The notation “cycle#x” indicates the x-th clock cycle starting from the NTT computation. Cycle#0 and cycle#2 are the first read and first write cycles of stage 1, respectively. If the coefficient access operations between stages are contiguous, then at cycle#5, coefficients indexed as 6 and 7 are to be read in stage 2, while both are also updated with results from stage 1, resulting in conflict. We can observe that conflicts always arise during the final write-back operation in every stage.

In general, each stage spans

N / 2 L

cycles for the retrieval of N coefficients. Coefficients with indices ranging from

N / 2 - L

to

N / 2 - 1

are written in the final cycle of the current stage’s operation and read in the #

(N / 4 L - 1)

-th cycle of the next stage. If

N / 2 L - 1 + d = N / 2 L + N / 4 L - 1

, then coefficients with indices of

N / 2 - L

to

N / 2 - 1

are written and read at the same time. Therefore, the condition of

N / 2 L - 1 + d < N / 4 L + N / 2 L - 1

is necessary to prevent such conflicts. The simplified condition of

L < N / 4 d

indicates that the maximum level of parallelism (L) is decided by the polynomial degree (N) and the delay (d), thereby constraining the architectural flexibility. Furthermore, inserting idle periods between stages decelerates the read operation at the next stage; therefore, there are sufficient cycles to update coefficients within the current stage, effectively avoiding conflict.

Assuming that the number of inserted idle cycles is g, the next stage waits for the additional g cycles to start reading after the current stage completes reading. Therefore, we have

\frac{N}{2 L} - 1 + d < \frac{N}{2 L} + \frac{N}{4 L} - 1 + g

(10)

The minimum number of inserted idle cycles (

g_{m i n}

) is expressed as follows:

g_{m i n} = \{\begin{matrix} 0, & if L < \frac{N}{4 d}, \\ d + 1 - \frac{N}{4 L}, & otherwise . \end{matrix}

(11)

The NTT operation takes an additional

g_{m i n} \times (l o g_{2} N - 1)

clock cycles, in total, to avoid conflict. In the end, through the strategic insertion of idle cycles between stages, we achieve a conflict-free memory access architecture.

3.5. TF Generation Strategy

The TF online generation strategy can be classified into two categories, i.e., data-dependent and data-independent, based on whether the generated TFs are utilized in the next generation. We take Figure 6 as an example to illustrate the difference between these generation methods. In Figure 6a,b, the required TFs for each stage are listed according to Algorithm 1. Columns 2 and 3 represent the factors allocated to two PEs, and column 4 indicates the frequency of repeated reading by PEs for each row’s factors. For example, in stage 1,

ψ^{8}

is used by PEs and lasts for four clock cycles.

For the case of data dependence, TFs labeled in black are obtained through the modular multiplication of TFs generated in the preceding operation and the corresponding step labeled in red. Therefore, the pre-stored constants include the TF bases marked in blue and the steps marked in red. Considering that MMs usually take several cycles to perform calculations, it is necessary to cache more TF bases for each stage to avoid pauses of the MM. The total number of stored TF bases is directly proportional to the pipeline levels of the MM. Kim et al. [22] and Duong-Ngoc et al. [23] employed this method to generate TFs on the fly. In the scenario of data independence, the update of TFs relies on TF bases (labeled in blue) and steps, as depicted in Figure 6b. The total number of pre-stored TF bases is determined by compilation parameters (N and L), regardless of the pipeline levels in the MM. However, the varying steps within a stage pose a new challenge in terms of storage. In this paper, we focus on the data-independent TF generation strategy and propose a step generation method to avoid storing steps.

3.5.1. Step Generation

As shown in Algorithm 1, L PEs must execute

N / 2 L

iterations to complete coefficient transformation for one stage. For a new stage, assume that the j-th parallel execution of L PEs occurs in cycle#j. Then, according to line 7 in Algorithm 1, the required TFs for PEs are {

ψ^{2 ⌊ \frac{l}{k} ⌋ \cdot k + k} | l \in [0, L - 1]

} at cycle#0 and {

ψ^{2 ⌊ \frac{j L + l}{k} ⌋ \cdot k + k} | l \in [0, L - 1]

} at cycle#j, where l represents the index of PE. So, for PE

_{l}

, the required step is computed as follows:

\frac{ψ^{2 ⌊ \frac{j L + l}{k} ⌋ \cdot k + k}}{ψ^{2 ⌊ \frac{l}{k} ⌋ \cdot k + k}} = ψ^{2 k (⌊ \frac{j L + l}{k} ⌋ - ⌊ \frac{l}{k} ⌋)}

(12)

We can represent

j L + l

and l in binary form as Equation (13) and Equation (14), respectively.

\begin{matrix} {(j L + l)}_{2} & = {j_{(l o g_{2} \frac{N}{2 L} - 1)}, \dots, j_{0}, l_{(l o g_{2} L - 1)}, \dots, l_{0}} \end{matrix}

(13)

\begin{matrix} {(l)}_{2} & = {l_{(l o g_{2} L - 1)}, \dots, l_{0}} \end{matrix}

(14)

Since k is a power of 2, when

l o g_{2} k < l o g_{2} L

,

⌊ \frac{l}{k} ⌋

is equivalent to the right shifting of l by

l o g_{2} k

bits. Similarly, the derivations for

⌊ \frac{j L}{k} ⌋

and

⌊ \frac{j L + l}{k} ⌋

follow a similar pattern as follows:

\begin{matrix} {⌊ \frac{l}{k} ⌋}_{2} & = {l_{(l o g_{2} L - 1)}, \dots, l_{l o g_{2} k}} \end{matrix}

(15)

\begin{matrix} {⌊ \frac{j L}{k} ⌋}_{2} & = {j_{(l o g_{2} \frac{N}{2 L} - 1)}, \dots, j_{0}, \underset{(l o g_{2} L - l o g_{2} k) - bit}{\underset{︸}{0, \dots, 0}}} \end{matrix}

(16)

\begin{matrix} {⌊ \frac{j L + l}{k} ⌋}_{2} & = {j_{(l o g_{2} \frac{N}{2 L} - 1)}, \dots, j_{0}, l_{(l o g_{2} L - 1)}, \dots, l_{l o g_{2} k}} \end{matrix}

(17)

Hence, the following formula is valid.

⌊ \frac{j L + l}{k} ⌋ = ⌊ \frac{j L}{k} ⌋ + ⌊ \frac{l}{k} ⌋

(18)

When

l o g_{2} k \geq l o g_{2} L

, the proof of Equation (18) follows a similar trend. At cycle#j, the required step sequence is

ψ^{2 k (⌊ \frac{j L + l}{k} ⌋ - ⌊ \frac{l}{k} ⌋)} = ψ^{2 k ⌊ \frac{j L}{k} ⌋}

(19)

It can be observed that

ψ^{2 k ⌊ \frac{j L}{k} ⌋}

is independent of l, indicating that the L TFGs use the same step to generate TFs for cycle#j. When

k = 1

, i.e., in the final stage, the required step sequence follows the pattern of

{ψ^{2 j L} | j \in [0, N / 2 L)}

. For other stages, the steps only update to

ψ^{2 j L}

at cycle#

(t k / L)

(where t is an integer) while maintaining consistency with the previous cycle. In general, based on the derivation presented above, we can draw the following three conclusions:

The shared step across L TFGs for generating the TFs of the next cycle suggests that only one step generator needs to be designed to provide steps for L TFGs.
The total number of unique steps is $N / 2 L$ because the steps required for the final stage can be reused by the remaining stages.
For the specified stage (s), the required unique steps ( ${ψ^{2 j L} | j = t k / L, t \in Z}$ ) are solely determined by j, implying that the step generation logic can be uniform across all stages.

The proposed step generator is an MM with two inputs, aiming to provide steps for L TFGs. To generate the unique step sequence

{ψ^{2 j L} | j \in [0, N / 2 L)}

in a pipelined manner, M steps must be pre-stored in the step memory, where M is the total number of cycles involving the delay of the MM and an extra cycle. The steps in the step memory are continuously refreshed as the clock cycle increments, as shown in Figure 7. At cycle#0 of stage 1, the SG fetches the first step from address 0 and the constant

ψ^{2 M L}

to generate the result (

ψ^{2 M L}

) at cycle#

(M - 1)

. The result is written to the same address at the same time. Similarly, the second result of the SG is written to the second address of the step memory at cycle#M. Once the SG handles M pre-stored steps sequentially, it circles back to fetch the step from address 0 as the input at cycle#M. Continuing along the same lines, in the last cycle, we obtain the ultimate unique step (

ψ^{N - 2 L}

), which also means that the complete set of steps (

{ψ^{0}, ψ^{2 L}, \dots, ψ^{N - 2 L}}

) is generated within the previous

N / 2 L

cycles. It is worth noting that the constant should be adjusted to

ψ^{N + 2 M L}

at cycle#

(N / 2 L - M)

, aiming to replicate the generation of the same step sequence in the next stage. For L TFGs, at the #

(t k / L)

-th cycle of each stage, the required step is retrieved from the step memory to contribute to the generation of new TFs.

For better comprehension, we provide the timeline of the SG when

N = 16

,

L = 2

, and

M = 2

, as shown in Figure 8. The data with blue background represent the steps pre-stored in the step memory. At cycle#0 of stage 1, the SG reads the step (

ψ^{0}

) located at address 0 and the constant (

ψ^{8}

) for modular multiplication. The result (

ψ^{8}

) is written to the same address at cycle#1. In the same cycle, the SG reads the step (

ψ^{4}

) located at address 1 for a new calculation. The result (

ψ^{12}

) is stored at the same address at cycle#2. In the same cycle, the SG fetches the updated step (

ψ^{8}

) from address 0 and the second constant (

ψ^{24}

) to generate

ψ^{0}

for the next stage. Similarly, at cycle#3, the SG fetches the updated step (

ψ^{12}

) from address 1 to produce the next step. The steps masked in red represent steps that need to be transmitted to TFGs, occurring at cycle#

(t k / L)

, where t is an integer.

In conclusion, the proposed step generator reduces the total step storage from

N / 2 L

down to M and supports INTT mode through the dynamic modification of two constants and the step memory.

3.5.2. TF Generation

The TFGs take the step from the step memory and the TF bases from the TF base memory as inputs to generate the remaining TFs for PEs. This means that the number of TFGs and PEs should be equal to maximize the computational capacity, leading to a doubling of computational resource overhead. Therefore, we conduct optimization to reduce the number of TFGs to

⌈ L / 2 ⌉

at the cost of several additional clock cycles. The optimization strategy is not applied when one PE and one TFG exist. When

L \geq 2

, we can observe that, except for the final stage, at least two TFGs generate the same TF in the same cycle. Consequently, halving the total number of TFGs only slows down the operational efficiency of the last stage, requiring an extra

N / 2 L

cycles to generate TFs for the final stage. But it proves to be cost-effective, reducing the total computational resources by approximately 25%.

The proposed architecture for online TF generation, as illustrated in Figure 9, comprises one SG,

⌈ L / 2 ⌉

TFGs, the TF base memory, and step memory. The total capacity of the TF base memory is, at most,

(l o g_{2} N + 1) ⌈ L / 2 ⌉ \times 2

for one modulus, where the multiplication by 2 is due to the difference between TF bases in NTT and INTT. The total capacity of the step memory is M. Despite the expense of including additional

N / 2 L

clock cycles, it significantly diminishes the storage demands for TFs and steps, achieving enhanced area efficiency.

4. Evaluation

4.1. Experimental Environment and Methodology

We implement the proposed NTT architecture with the 2021.1 Xilinx Vivado tools. To ensure a fair comparison, the design is placed and routed on different FPGA platforms, in accordance with previous works. We conduct a thorough presentation and comparative analysis of the performance of the proposed accelerator in the following sections.

In Section 4.2, we present some details about our architecture, including resource, latency, and evaluation metrics. Subsequently, our comparison and analysis are divided into three subsections. In Section 4.3, we conduct a detailed comparison with existing works designed for HE parameters. In Section 4.4, utilizing the coefficient memory access pattern proposed in this paper, we develop a memory-based NTT/INTT architecture that stores all TFs on chip, enabling a fair comparison with works optimized for small parameter sets. Finally, in Section 4.5, we explore the area efficiency of memory-based and online generation-based NTT architectures under RNS-based HE parameter sets.

4.2. Experimental Results

First, Table 1 shows the hardware resource breakdown of our NTT architecture under the compilation parameter set (

N = 2^{16}

,

L = 32

, and

⌈ l o g_{2} q_{m a x} ⌉

= 60). We primarily focus on the BRAM and DSP overhead. In our design, each MM employs 18 DSP slices, and the PE array consumes 576 (

= 18 \times 32

) DSP slices for 32 MMs. Moreover, the TFG array consumes 288 (

= 18 \times 16

) DSP slices for 16 MMs. We including an additional MM to construct the SG, resulting in a total of 882 (

= 576 + 288 + 18

) DSP slices. Coefficient storage consumes

2 L

coefficient memory blocks with a depth of

5 N / 8 L

, while TF base storage consumes

⌈ L / 2 ⌉

memory blocks with a depth of, at most,

(l o g_{2} N + 1) \times 2

for one modulus. The step memory is implemented using Look-Up Table (LUT) resources, thereby avoiding the consumption of any BRAM. All BRAMs are configured in simple dual-port mode, with a total of 208 BRAMs used (i.e., 192 BRAMs for coefficients and 16 BRAMs for TF bases). Furthermore, to illustrate the reduction in storage capacity, Table 2 compares the coefficient storage overhead with the state-of-the-art CG NTT design [26] for

N = 2^{16}

. It also compares the total number of pre-stored TF bases for 32 RNS moduli with the design proposed in [23], which also optimizes the TF generator. The results indicate that our architecture reduces memory overhead for both coefficients and TFs in CG-based NTT designs.

Secondly, for the timing information, the coefficient conversion across

l o g_{2} N

stages requires

l o g_{2} N \times N / 2 L

Clock Cycles (CCs). Moreover, when

L \geq 2

, optimization of the TF generation strategy introduces an extra

N / 2 L

cycle. To avoid access conflict, a minimum of

g_{m i n}

idle cycles are inserted between stages, allowing adequate time for the previous stage’s results to be written back. Finally, considering the delay of the SG denoted by M and that of the PE denoted by d, the total CCs and latency of NTT/INTT operations are as depicted in Equations (20) and (21) [23], where

f_{m a x}

represents the maximal clock frequency achievable on the target FPGA platform.

CCs = \{\begin{matrix} if L > 1, \\ (l o g_{2} N + 1) \times \frac{N}{2 L} + M + d + g_{min} \times (l o g_{2} N - 1) \\ otherwise, \\ l o g_{2} N \times \frac{N}{2} + M + d + g_{min} \times (l o g_{2} N - 1) \end{matrix}

(20)

Latency (us) = \frac{CCs}{f_{m a x} (MHz)}

(21)

The data throughput can be measured as the maximum number of data bits processed by the NTT/INTT module per second, as depicted in Equation (22) [23].

Throughput (Mbps) = \frac{Number of bits}{Latency (us)}

(22)

Moreover, when the proposed architecture is extended to support p moduli, the data movement overhead of multiple polynomials with diverse moduli is not considered, similar to prior works [20,23]. Therefore, the total number of CCs and the maximal number of data bits supported by the architecture increase by a factor of p.

Finally, a larger number of PEs results in shorter latency and higher throughput but requires more resources. Therefore, comparing only the resource usage or throughput is one-sided. For a configurable architecture, area efficiency is a more comprehensive and fair performance evaluation standard. Commonly used metrics for quantifying area efficiency include “Area × Time” Products (ATP) [35] and “Throughput Per Slice” (TPS) [36], which differ in how they equate area. Lower ATP and higher TPS indicate better area efficiency. In our comparison, we calculate both metrics, considering the diversity of platforms and resources.

4.3. Comparison to Works Considering HE Parameters

In this part, our NTT performance results are compared with those reported in related works [19,20,21,22,23], as shown in Table 3. Our design offers the highest level of configurability while achieving optimal area efficiency, with lower ATP and higher TPS compared to other works, except for [21].

Öztürk et al. [19] reported a block-level NTT architecture for partial HE schemes. Large amounts of on-chip memory and multipliers are utilized for TF storage and computation, respectively, leading to inefficient hardware utilization. Compared with the design proposed in [19], our design achieves higher throughput, with 5.17× higher TPS values, and the ATP decreases by 85.46%. The architecture proposed in [22] enables a full pipeline by deploying many PEs in series. However, this comes at the cost of high memory capacity and bandwidth expenses due to the presence of five intermediate buffers for coefficient reordering, which is impractical for memory-bound homomorphic evaluations. In contrast, our proposed coefficient access strategy significantly reduces memory overhead. Specifically, when performing a

2^{17}

-point NTT operation with a modulus bit width of 62, their architecture requires at least 10,080 KB of memory resources for coefficients, while ours only requires 1440 KB. In all, our design considers optimal hardware efficiency through configurable parallelism, and Table 3 shows that the proposed NTT architecture outperforms that proposed in [22] in terms of both performance metrics. Su et al. [20] proposed a multi-channel and multi-PE architecture based on the CG algorithm. Their design requires 1.49× more LUTs than ours, achieving 2× higher throughput. However, due to our optimized memory access pattern and TF generation strategy, our design decreases BRAM resources by 7.85×. The results in terms of ATP and TPS indicate that our design has a higher area efficiency. In comparison with the design proposed in [23], our architecture decreases ATP by 31.38% and obtains 1.25× higher TPS. Due to the advantage of the carefully designed pipelined architecture of PE, our NTT can run at a higher clock frequency with the same parallelism and, thus, obtain 1.6× higher throughput. Although the compared design uses less BRAM thanks to the optimized coefficient memory bandwidth based on the mixed-radix algorithm, it is worth noting that our design requires less memory for TF bases, including capacity and bandwidth. Furthermore, when supporting 32 RNS moduli, the compared design requires 24,000 pre-stored TF bases, whereas 17,408 are required in ours. In addition, due to its inconsistent design, TF bases for NTT and INTT are stored in different memory blocks, while our unified architecture does not require extra TF base memory blocks for INTT. Kurniawan et al. [21] proposed a memory-based NTT architecture that supports RTC, achieving superior area efficiency. This improvement in area efficiency is primarily attributed to the low DSP consumption resulting from the use of a specific modulus form, which allows for minimal equivalent area. However, in terms of throughput, our design achieves a 1.12× improvement with the same level of computational parallelism while also offering significantly higher configurability and allowing for more RNS moduli.

In addition, among the aforementioned works, only [22,23] implemented an online generation strategy for TFs, which is closer to impractical HE applications. In comparison with the compared designs, when supporting multiple moduli, our architecture incurs only a slight increase in BRAM usage for storage of the TF bases of different moduli. However, the compared designs not only require additional BRAM but also necessitate an increase in LUT resources. This is because the compared MM designs are optimized for specific sets with low Hamming weights, utilizing shifters for lightweight integer multiplication. Consequently, each additional supported modulus leads to increases in shifters and multiplexers. Instead, the bit-width range of the moduli supported by our MM at runtime is (W,

⌈ l o g_{2} q_{m a x} ⌉

]. When

⌈ l o g_{2} q_{m a x} ⌉ = 62

and

W = 17

, our architecture accommodates moduli ranging from 18 to 62 bits, which covers all RNS moduli proposed in prior works, without incurring any additional computation overhead. Furthermore, the designs proposed in [22,23] only support a fixed number of PEs, lacking configurability for architecture parallelism. In contrast, our design supports both CTC and RTC. We provide a unified architecture for NTT and INTT, avoiding redundant design efforts.

In all, our NTT/INTT module achieves improvements in terms of area efficiency and configurability. These advancements establish it as a practical NTT accelerator for RNS-based HE applications.

4.4. Comparison with Works Considering Small Parameters

For small parameter sets, NTT designs typically use on-chip storage for TFs due to their less stringent storage constraints. To ensure a fair comparison with literature works, we use the TF storage strategy proposed in [26], combined with our optimized coefficient memory access method, to create a memory-based NTT design. The comparison results between our memory-based NTT and prior works considering small parameters are shown in Table 4.

The designs proposed in [17,24,26,27,38] support the configuration of parameters N,

⌈ l o g_{2} q_{m a x} ⌉

, and L at compile time, and that proposed in [24,26] also supports diverse polynomial degrees at runtime. In comparison, our design offers greater scalability, with three compilable parameters, namely N,

⌈ l o g_{2} q_{m a x} ⌉

, and L, further enabling runtime adjustments of N and different sizes of q. Furthermore, compared to the design proposed in [38], our architecture demonstrates poorer ATP in LUT when

L = 1

but higher LUT efficiency for larger values of L. Moreover, our overall ATP exhibits superior performance due to its lower latency. In comparison with the design proposed in [27], BRAM consumption is the same because each PE requires two memory blocks for coefficient storage and one block for twiddle factor storage in both designs. But it is worth noting that the actual storage consumption in our design only slightly exceeds

2.25 N

, whereas that of the compared design is approximately

3 N

. Mert et al. [39] employed a fully parallelized architecture based on the four-step algorithm to speed up NTT but at the cost of significant resources. In comparison, our design is more area-efficient. Liu et al. [26] proposed a configurable NTT/INTT accelerator that supports both CTC and RTC while decreasing the actual memory overhead to

2.5 N

. Their design consumes considerably fewer LUT resources because of its simplified memory access pattern. However, our design is more area-efficient thanks to its lower BRAM usage. Our architecture also outperforms the designs proposed in [24] and [17] in terms of ATP.

In general, benefiting from the proposed coefficient memory pattern, our design can fully exploit storage resources and obtain higher LUT and BRAM efficiencies than many previously proposed architectures.

4.5. Comparison between Memory-Based NTT and Online Generation-Based NTT

In the preceding discussion, we learned that there are two methods to provide TFs for PE arrays, namely (i) memory-based, and (ii) online generation-based methods. This part presents a theoretical analysis of how these methods impact area efficiency.

To simplify the analysis, let us assume that all operations occur on the chip to remove the impact of data movement. This implies that storing the TFs for (i) and the TF bases for (ii) across multiple moduli in the internal memory is necessary. Moreover, we assume that only one RNS channel with one NTT core is deployed, consisting of L PE units. Consequently, the total constant storage capacity increases linearly with multiple moduli. The NTT/INTT domain transformations of polynomials across p moduli must be executed serially.

For method (i), according to [26], L memory blocks with a depth of

(N / L + l o g_{2} L - 1) \times p

are required for TFs. Method (ii) requires

⌈ L / 2 ⌉

memory blocks for TF bases, each with a maximum depth of

({log}_{2} N + 1) \times 2 \times p

. Table 5 presents the BRAM consumption of the proposed memory-based and online generation-based NTT architectures on a Xilinx ZCU102 FPGA when the parameter set of (N,

⌈ l o g_{2} q_{m a x} ⌉

, p) is set as (

2^{16}

, 60, 32). Notably, the BRAM overhead for method (i) is approximately 10.91×∼ 25.84× more than that required for method (ii).

In addition to the variances in BRAM occupancy, distinctions exist in latency and the overhead of LUT and DSP resources in both approaches. Therefore, we provide the ATP values of our proposed NTT architectures under the parameter set of (

N = 2^{16}

,

⌈ l o g_{2} q_{m a x} ⌉ = 60

,

p = 32

), as shown in Figure 10. The required operating frequency of the architecture for latency computation is considered to be consistent to the maximum operating frequency under a single modulus. It is observed that under the same computational parallelism, the online generation-based NTT architecture significantly outperforms the memory-based architecture in terms of area efficiency. Furthermore, the ATP of the memory-based architecture gradually decreases as L increases, indicating that expanding computation units can mitigate performance limitations resulting from storage requirements. Meanwhile, the ATP of the online generation-based architecture reaches its minimum when L = 32. This suggests that the primary factor influencing performance enhancement is the availability of computational resources when

L < 32

. However, when

L \geq 32

, performance improvement becomes constrained due to the increasing storage bandwidth.

In general, our comparison results highlight the necessity of generating TFs on the fly in RNS-based HE applications. Moreover, in high-dimensional polynomials, NTT designs with high parallelism are advantageous for enhancing area efficiency.

5. Conclusions and Future Works

This paper proposes an area-efficient and flexible NTT architecture suitable for RNS-based HE evaluations. The proposed conflict-free and low-complexity memory access pattern reduces the total coefficient storage requirements in CG-based NTT design. And the carefully designed twiddle factor generator saves a significant amount of memory for large sets of prime moduli. Furthermore, the proposed unified architecture offers CTC, which can be configured with various numbers of computational units and support multiple polynomial degrees and various sizes of moduli after compilation. We evaluated the proposed design under a wide range of parameters to demonstrate its advantages in terms of performance and configurability.

Future work will utilize the proposed NTT architecture to develop more complex HE operations, such as bootstrapping and key switching. Additionally, when the proposed accelerator is integrated into a system, the data transfer overhead for various moduli should also be considered.

Author Contributions

Conceptualization, J.H. and C.K.; methodology, J.H. and S.L.; software, J.H.; validation, J.H.; investigation, J.H. and C.K.; resources, T.S.; data curation, J.H. and C.K.; writing—original draft preparation, J.H.; writing—review and editing, S.L. and T.S.; project administration, J.H. and T.S.; funding acquisition, T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Program of Guangdong Province under Grant 2022B0701180001.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy concerns.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, J.; Qian, J. A Multi-Party Secure SaaS Cloud Accounting Platform Based on Lattice-Based Homomorphic Encryption System. In Proceedings of the 2021 International Conference on Public Management and Intelligent Society (PMIS), Shanghai, China, 26–28 February 2021; pp. 1–4. [Google Scholar] [CrossRef]
Brutzkus, A.; Elisha, O.; Gilad-Bachrach, R. Low Latency Privacy Preserving Inference. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Lin, H.; Deng, X.; Yu, F.; Sun, Y. Grid Multi-Butterfly Memristive Neural Network with Three Memristive Systems: Modeling, Dynamic Analysis, and Application in Police IoT. IEEE Internet Things J. 2024, 1–11. [Google Scholar] [CrossRef]
Lin, H.; Deng, X.; Yu, F.; Sun, Y. Diversified Butterfly Attractors of Memristive HNN with Two Memristive Systems and Application in IoMT for Privacy Protection. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 1–12. [Google Scholar] [CrossRef]
Balbás, D. The Hardness of LWE and Ring-LWE: A Survey. IACR Cryptol. ePrint Arch. 2021, 2021, 1358. [Google Scholar]
Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) Fully Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory 2014, 6, 1–36. [Google Scholar] [CrossRef]
Brakerski, Z. Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP. In Advances in Cryptology—CRYPTO 2012, Proceedings of the 32nd Annual Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2012; Safavi-Naini, R., Canetti, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 868–886. [Google Scholar]
Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Advances in Cryptology—ASIACRYPT 2017, Proceedings of the 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Takagi, T., Peyrin, T., Eds.; Springer: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar]
Feldmann, A.; Samardzic, N.; Krastev, A.; Devadas, S.; Dreslinski, R.; Eldefrawy, K.; Genise, N.; Peikert, C.; Sanchez, D. F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption (Extended Version). arXiv 2021, arXiv:2109.05371. [Google Scholar]
Kim, J.; Lee, G.; Kim, S.; Sohn, G.; Kim, J.; Rhu, M.; Ahn, J.H. ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse. In Proceedings of the 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, IL, USA, 1–5 October 2022; pp. 1237–1254. [Google Scholar] [CrossRef]
Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Dilithium: A Lattice-Based Digital Signature Scheme. Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2018, 238–268. [Google Scholar] [CrossRef]
Alkim, E.; Barreto, P.S.L.M.; Bindel, N.; Krämer, J.; Longa, P.; Ricardini, J.E. The Lattice-Based Digital Signature Scheme qTESLA. In Proceedings of the Applied Cryptography and Network Security, Rome, Italy, 19–22 October 2020; Conti, M., Zhou, J., Casalicchio, E., Spognardi, A., Eds.; Springer: Cham, Switzerland, 2020; pp. 441–460. [Google Scholar]
Cheon, J.H.; Han, K.; Kim, A.; Kim, M.; Song, Y. A Full RNS Variant of Approximate Homomorphic Encryption. In Selected Areas in Cryptography—SAC 2018, Proceedings of the 25th International Conference, Calgary, AB, Canada, 15–17 August 2018; Springer: Cham, Switzerland, 2019; pp. 347–368. [Google Scholar]
Bajard, J.C.; Eynard, J.; Hasan, M.A.; Zucca, V. A Full RNS Variant of FV like Somewhat Homomorphic Encryption Schemes. In Selected Areas in Cryptography—SAC 2016, Proceedings of the 23rd International Conference, St. John’s, NL, Canada, 10–12 August 2016; Avanzi, R., Heys, H., Eds.; Springer: Cham, Switzerland, 2017; pp. 423–442. [Google Scholar]
Chen, H.; Chillotti, I.; Song, Y. Improved Bootstrapping for Approximate Homomorphic Encryption. In Proceedings of the 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, 19–23 May 2019; Springer: Cham, Switzerland, 2019; pp. 34–54. [Google Scholar]
Han, K.; Ki, D. Better Bootstrapping for Approximate Homomorphic Encryption. In Topics in Cryptology—CT-RSA 2020, Proceedings of the Cryptographers’ Track at the RSA Conference 2020, San Francisco, CA, USA, 24–28 February 2020; Jarecki, S., Ed.; Springer: Cham, Switzerland, 2020; pp. 364–390. [Google Scholar]
Mert, A.C.; Karabulut, E.; Ozturk, E.; Savas, E.; Becchi, M.; Aysu, A. A Flexible and Scalable NTT Hardware: Applications from Homomorphically Encrypted Deep Learning to Post-Quantum Cryptography. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 346–351. [Google Scholar] [CrossRef]
Mert, A.C.; Öztürk, E.; Savaş, E. FPGA Implementation of a Run-Time Configurable NTT-based Polynomial Multiplication Hardware. Microprocess. Microsyst. 2020, 78, 103219. [Google Scholar] [CrossRef]
Ozturk, E.; Doroz, Y.; Savas, E.; Sunar, B. A Custom Accelerator for Homomorphic Encryption Applications. IEEE Trans. Comput. 2017, 66, 3–16. [Google Scholar] [CrossRef]
Su, Y.; Yang, B.L.; Yang, C.; Zhao, S.Y. ReMCA: A Reconfigurable Multi-Core Architecture for Full RNS Variant of BFV Homomorphic Evaluation. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2857–2870. [Google Scholar] [CrossRef]
Kurniawan, S.; Duong-Ngoc, P.; Lee, H. Configurable Memory-Based NTT Architecture for Homomorphic Encryption. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 3942–3946. [Google Scholar] [CrossRef]
Kim, S.; Lee, K.; Cho, W.; Nam, Y.; Cheon, J.H.; Rutenbar, R.A. Hardware Architecture of a Number Theoretic Transform for a Bootstrappable RNS-based Homomorphic Encryption Scheme. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 3–6 May 2020; pp. 56–64. [Google Scholar] [CrossRef]
Duong-Ngoc, P.; Kwon, S.; Yoo, D.; Lee, H. Area-Efficient Number Theoretic Transform Architecture for Homomorphic Encryption. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 1270–1283. [Google Scholar] [CrossRef]
Hu, X.; Tian, J.; Li, M.; Wang, Z. AC-PM: An Area-Efficient and Configurable Polynomial Multiplier for Lattice Based Cryptography. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 719–732. [Google Scholar] [CrossRef]
Su, Y.; Yang, B.L.; Yang, C.; Yang, Z.P.; Liu, Y.W. A Highly Unified Reconfigurable Multicore Architecture to Speed Up NTT/INTT for Homomorphic Polynomial Multiplication. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 993–1006. [Google Scholar] [CrossRef]
Liu, S.H.; Kuo, C.Y.; Mo, Y.N.; Su, T. An Area-Efficient, Conflict-Free, and Configurable Architecture for Accelerating NTT/INTT. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 519–529. [Google Scholar] [CrossRef]
Geng, Y.; Hu, X.; Li, M.; Wang, Z. Rethinking Parallel Memory Access Pattern in Number Theoretic Transform Design. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 1689–1693. [Google Scholar] [CrossRef]
Banerjee, U.; Ukyab, T.S.; Chandrakasan, A.P. Sapphire: A Configurable Crypto-Processor for Post-Quantum Lattice-Based Protocols (Extended Version). TCHES 2019, 2019, 17–61. [Google Scholar] [CrossRef]
Pöppelmann, T.; Güneysu, T. Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware. In Progress in Cryptology—LATINCRYPT 2012; Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7533, pp. 139–158. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Gentleman, W.M.; van der Sande, G. Fast Fourier Transforms: For fun and profit. In Proceedings of the AFIPS ’66 (Fall), San Francisco, CA, USA, 7–10 November 1966. [Google Scholar]
Pöppelmann, T.; Oder, T.; Güneysu, T. High-Performance Ideal Lattice-Based Cryptography on 8-Bit ATxmega Microcontrollers. In Progress in Cryptology—LATINCRYPT 2015, Proceedings of the 4th International Conference on Cryptology and Information Security in Latin America, Guadalajara, Mexico, 23–26 August 2015; Lauter, K., Rodríguez-Henríquez, F., Eds.; Springer: Cham, Switzerland, 2015; pp. 346–365. [Google Scholar]
Mert, A.C.; Öztürk, E.; Savaş, E. Design and Implementation of a Fast and Scalable NTT-Based Polynomial Multiplier Architecture. In Proceedings of the 2019 22nd Euromicro Conference on Digital System Design (DSD), Kallithea, Greece, 28–30 August 2019; pp. 253–260. [Google Scholar] [CrossRef]
Barrett, P. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In Advances in Cryptology—CRYPTO’ 86; Odlyzko, A.M., Ed.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 263, pp. 311–323. [Google Scholar] [CrossRef]
Ye, Z.; Cheung, R.C.C.; Huang, K. PipeNTT: A Pipelined Number Theoretic Transform Architecture. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 4068–4072. [Google Scholar] [CrossRef]
Kundi, D.e.S.; Zhang, Y.; Wang, C.; Khalid, A.; O’Neill, M.; Liu, W. Ultra High-Speed Polynomial Multiplications for Lattice-Based Cryptography on FPGAs. IEEE Trans. Emerg. Top. Comput. 2022, 10, 1993–2005. [Google Scholar] [CrossRef]
Liu, W.; Fan, S.; Khalid, A.; Rafferty, C.; O’Neill, M. Optimized Schoolbook Polynomial Multiplication for Compact Lattice-Based Cryptography on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 2459–2463. [Google Scholar] [CrossRef]
Mu, J.; Ren, Y.; Wang, W.; Hu, Y.; Chen, S.; Chang, C.H.; Fan, J.; Ye, J.; Cao, Y.; Li, H.; et al. Scalable and Conflict-Free NTT Hardware Accelerator Design: Methodology, Proof, and Implementation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 1504–1517. [Google Scholar] [CrossRef]
Mert, A.C.; Öztürk, E.; Savaş, E. Design and Implementation of Encryption/Decryption Architectures for BFV Homomorphic Encryption Scheme. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 353–362. [Google Scholar] [CrossRef]

Figure 1. Overall architecture with L PEs and

⌈ L / 2 ⌉

TFGs.

Figure 1. Overall architecture with L PEs and

⌈ L / 2 ⌉

TFGs.

Figure 2. Architectures of PE and MM.

Figure 3. The arrangement of original coefficients across

2 L

memory blocks.

Figure 3. The arrangement of original coefficients across

2 L

memory blocks.

Figure 4. The 16-point NTT employs two PEs. It is assumed that it takes one clock cycle from read to finish computation. Blue signifies the read data to be sent to PEs. Green denotes the newly generated data to be stored in memory.

Figure 5. Conflict in a 16-point NTT when

L = 2

and

d = 2

.

Figure 5. Conflict in a 16-point NTT when

L = 2

and

d = 2

.

Figure 6. Twiddle factor generation methods in a 16-point NTT with 2 PEs. (a) Data-dependent strategy; (b) data-independent strategy.

Figure 7. The dynamic evolution of the step memory in an N-point NTT with L PEs. The steps labeled in blue represent M pre-stored steps. The #j parameter denotes the j-th read cycle of each stage.

Figure 8. The timeline of the SG when executing 16-point NTT with two PEs. The

i n 1

and

i n 2

signals refer to the inputs of the SG. The

r e s u l t

signal denotes the output of the SG, and

r e a d_a d d r

represents the read address of the step memory.

Figure 8. The timeline of the SG when executing 16-point NTT with two PEs. The

i n 1

and

i n 2

signals refer to the inputs of the SG. The

r e s u l t

signal denotes the output of the SG, and

r e a d_a d d r

represents the read address of the step memory.

Figure 9. Architecture for online generation of TFs.

Figure 10. Comparisons of ATP for different values of L between memory-based NTT and online generation-based NTT.

Table 1. Resource breakdown of architecture with the compilation parameter set (

N = 2^{16}

,

L = 32

, and

⌈ l o g_{2} q_{m a x} ⌉

= 60) on a Zynq Ultrascale+ ZCU102 board.

Table 1. Resource breakdown of architecture with the compilation parameter set (

N = 2^{16}

,

L = 32

, and

⌈ l o g_{2} q_{m a x} ⌉

= 60) on a Zynq Ultrascale+ ZCU102 board.

Module	LUT (%)	FF (%)	DSP (%)	BRAM (%)
Controller	6210	113	0	0
Coef Mem	4032	128	0	192
TF base Mem	0	0	0	16
Step Mem	166	0	0	0
PE Array	51,168	75,136	576	0
⌊ PE	1599	2348	18	0
⌊ MM	690	1388	18	0
TFG Array	9008	22,784	288	0
⌊ TFG	563	1424	18	0
SG	626	1346	18	0
Total	71,210 (26.0)	99,507 (18.15)	882 (35)	208 (22.8)

Table 2. The total storage capacity required for the NTT/INTT design.

Item	Coefficients			TF Bases
Item	[26]	Ours	Reduction	[23]	Ours	Reduction
Capacity	98,304	81,920	16.7%	24,000	17,408	27.5%

Table 3. Our implementation results and comparison with prior works considering HE parameter sets. The bold text indicates better results.

Work	Platform $^{d}$	N $^{b}$	q $^{b}$	p	Timing		Resource				Efficiency Indicators		Configurability $^{c}$
Work	Platform $^{d}$	N $^{b}$	q $^{b}$	p	Freq (MHz)	Latency (us)	LUT	FF	DSP	BRAM	ATP $^{a, r}$ /1000	TPS $^{t, r}$	CTC	RTC
[19] $^{g, †}$	V690	15	32	41	250	2086.9	219.2k	90.8k	768	193	978.5k	0.08	✗	✗
[22] $^{s}$	VU190	17	62	42	200	3760	365k	335k	1.3k	2.3k	4420.3k	0.19	✗	✗
[20] $^{g, †}$	V1140	15	32	41	250	245.8	194.1k	153.1k	1.7k	1.8k	218.7k	0.26	✓	✗
[23] $^{s}$	ZU102	16	60	32	200	2684.2	148.5k	90.9k	564	137	660.4k	0.59	✗	✗
[21] $^{s}$	VU37	16	60	1	250	66.5	74.5k	61.4k	288	315	13.2k	0.92	✗	✓
Our $^{g, †}$	V690 $^{1}$	15	32	41	240	1405.3	35.6k	51.2k	392	88	142.3k (14.54%)	0.41 (5.17×)	✓	✓
	VU190 $^{1}$	17	62	42	145	10,687.7	75.4k	105.8k	1.4k	368	3452.4k (78.1%)	0.25 (1.28×)
	V1140 $^{2}$	15	32	41	164	520.5	130.1k	201.6k	1.5k	224	183.3k (83.83%)	0.31 (1.20×)
	ZU102 $^{1}$	16	60	32	285	1958.4	71.2k	99.5k	882	240	453.2k (68.62%)	0.73 (1.25×)
	VU37 $^{1}$	16	60	1	295	59.1	72.7k	99.6k	882	208	13.2k (100%)	0.79 (0.86×)

^{d}

: V690: Virtex-7 XC7VX690T; VU190: Virtex Ultrascale XCVU190; V1140: Virtex-7 XC7VX1140T; ZU102: Zynq Ultrascale+ ZCU102; VU37: Virtex Ultrascale+ XCVU37P;

^{b}

: the bit width of the N or q parameter;

^{a}

: ATP = Equ. LUT × latency, in which, Equ. LUT = (LUT + 100 × DSP + 300 × BRAM) [35];

^{t}

: TPS = Throughput (Mbps) /Equ. slice, in which, for 7 series, Equ. slice ≈ (LUT/4 + FF/8) × 0.97 + 102.4 × DSP + 232.4 × BRAM; for ultrascale series, Equ. slice ≈ (LUT/8 + LUT/16) × 0.97 + 51.2 × DSP + 116.2 × BRAM [36,37];

^{r}

: the ratio calculated by dividing our ATP or TPS by the ATP or TPS of other works;

^{c}

: the symbol ✗ indicates no support for the corresponding configurability, while ✓ indicates support;

^{g}

: general modulus;

^{s}

: specific form of modulus;

^{1}

: 32 PEs;

^{2}

: 128 PEs;

^{†}

: the unified architecture supporting NTT and INTT.

Table 4. Our implementation results and comparison with prior works considering small parameter sets. The bold text indicates better results.

Work $^{d}$	N $^{b}$	q $^{b}$	PE	Timing		Resource				ATP				ATP $^{a, k}$	Configurability $^{c}$
Work $^{d}$	N $^{b}$	q $^{b}$	PE	Freq (MHz)	Latency (us)	LUT	FF	DSP	BRAM	LUT $^{k}$	FF $^{k}$	DSP	BRAM	ATP $^{a, k}$	CTC	RTC
[38] $^{†, i}$	12	24	1	185	132.88	802	525	4	7	106.6	69.8	532	930	438.8	✓	✗
			4 × 2	157	19.61	5.7k	3.2k	33	8.5	111.1	62.5	647	167	225.8
			8 × 2	121	12.75	14.6k	6.5k	80	12	186.1	82.5	1020	153	334
our $^{†, o}$	12	24	1	243	101	1.1k	1.1k	6	7	107.2	113.7	607	708	380.4	✓	✓
			8	219	14.11	5.8k	5.3k	48	16	82.4	75.1	677	226	217.9
			16	216	7.19	11.3k	10.2k	96	24	81.1	73.3	691	173	202
[27] $^{†, o}$	10	32	8	244	2.67	9.5k	4.7k	64	12	25.3	12.6	171	32	52.1	✓	✗
[27] $^{†, o}$	12	32	4	244	25.2	5k	2.8k	32	14	126	70.6	806	353	312.5	✓	✗
[39] $^{†, i}$	12	32	-	200	2.3	70k	-	599	129	161	-	296.7	1377.7	387.8	✓	✗
our $^{†, o}$	10	32	8	233	2.82	7.4k	7.1k	64	12	21	20	181	34	49.3	✓	✓
our $^{†, o}$	12	32	4	245	25.15	4.2k	3.9k	32	14	105.8	97.5	805	352	291.9	✓	✓
[26] $^{†, o}$	12	60	1	154	159.6	1.9k	1.8k	42	17	303.2	287.3	6703	2713	1787.5	✓	✓
			8	150	20.54	14.1k	12.5k	336	41	289.7	256.8	6901	842	1232.4
			32	133	5.8	52k	47k	1300	160	301.6	272.6	7540	928	1334
[24] $^{†, i}$	12	60	1	144	170.8	2.6k	2.5k	26	14	444	426.9	4440	2390.7	1605.2	✓	✓
			8	141	21.89	22.1k	19.5k	208	32	483.7	426.8	4552.4	700.4	1149
			32	130	6.02	89.9k	77.2k	832	96	540.8	464.4	5004.8	577.5	1214.5
[17] $^{i}$	12	60	64	125	7.8	338k	138k	1984	768	2636.4	1076.4	15475	5990	5981	✓	✗
our $^{†, o}$	12	60	1	198	124.2	2.2k	2.7k	28	17	269.4	323.5	3478	2111	1250.6	✓	✓
			8	180	17.16	15k	13.4k	224	32	258.1	230.2	3843	549	807.1
			32	137	5.74	60k	54.4k	896	96	344	312.4	5141	551	1023.2
			64	120	3.35	118.4k	108.1k	1792	192	396.8	362.3	6003	643	1190.1

^{d}

: Based on Xilinx’s Virtex-7 series of FPGAs;

^{b}

: the bit-width of the N or q parameter;

^{a}

: ATP = Equ. LUT × latency, in which, Equ. LUT = (LUT + 100 × DSP + 300 × BRAM) [35];

^{k}

: the ATP value divided by 1000;

^{c}

: the symbol ✗ indicates no support for the corresponding configurability, while ✓ indicates support;

^{†}

: the unified architecture supporting NTT and INTT;

^{i}

: in place; ^o: out of place.

Table 5. BRAM consumption of the proposed memory-based NTT and online generation-based NTT with the parameter set of (

N = 2^{16}

,

⌈ {log}_{2} q_{m a x} ⌉ = 60

,

p = 32

) on a Xilinx ZCU102.

Table 5. BRAM consumption of the proposed memory-based NTT and online generation-based NTT with the parameter set of (

N = 2^{16}

,

⌈ {log}_{2} q_{m a x} ⌉ = 60

,

p = 32

) on a Xilinx ZCU102.

L	2	4	8	16	32	64
Method (i)	3592	3604	3616	3648	3712	3840
Method (ii)	139	146	156	184	240	352
Ratio	25.84×	24.68×	23.18×	19.83×	15.47×	10.91×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Kuo, C.; Liu, S.; Su, T. An Area-Efficient and Configurable Number Theoretic Transform Accelerator for Homomorphic Encryption. Electronics 2024, 13, 3382. https://doi.org/10.3390/electronics13173382

AMA Style

Huang J, Kuo C, Liu S, Su T. An Area-Efficient and Configurable Number Theoretic Transform Accelerator for Homomorphic Encryption. Electronics. 2024; 13(17):3382. https://doi.org/10.3390/electronics13173382

Chicago/Turabian Style

Huang, Jingwen, Chiayi Kuo, Sihuang Liu, and Tao Su. 2024. "An Area-Efficient and Configurable Number Theoretic Transform Accelerator for Homomorphic Encryption" Electronics 13, no. 17: 3382. https://doi.org/10.3390/electronics13173382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Area-Efficient and Configurable Number Theoretic Transform Accelerator for Homomorphic Encryption

Abstract

1. Introduction

2. Background

2.1. Polynomial Multiplication

2.2. Constant-Geometry NTT/INTT Algorithm

3. The Proposed Accelerator

3.1. Overall Architecture

3.2. Unified PE

3.3. Coefficient Memory Access Pattern

3.4. Read-after-Write Conflict Analysis

3.5. TF Generation Strategy

3.5.1. Step Generation

3.5.2. TF Generation

4. Evaluation

4.1. Experimental Environment and Methodology

4.2. Experimental Results

4.3. Comparison to Works Considering HE Parameters

4.4. Comparison with Works Considering Small Parameters

4.5. Comparison between Memory-Based NTT and Online Generation-Based NTT

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI