1. Introduction
Homomorphic Encryption (HE), enabling the computation of encrypted data, is an optimal privacy-preserving solution for various scenarios, including cloud computing [
1] and machine learning [
2]. Unlike cryptographic methods [
3,
4] based on chaos theory, HE relies primarily on algebraic structures such as latticed-based mathematical theories. Over the past few decades, multiple HE schemes based on the Ring Learning With Error (RLWE) problem [
5] have been proposed, among which the Brakerski–Gentry–Vaikuntanathan (BGV) scheme [
6], Brakerski–Fan–Vercauteren (BFV) scheme [
7], and Cheon–Kim–Kim–Song (CKKS) scheme [
8] are popular. Despite the improvements in theoretical cryptography, intensive computing and high storage requirements remain serious, i.e., thousands of times slower on unencrypted data [
9] and occupying up to 512 MB of on-chip memory [
10], limiting its widespread adoption.
Polynomial multiplication, as a fundamental operation in HE schemes, has become a performance bottleneck for HE applications. Using NTT can reduce the complexity of polynomial multiplication from traditional
to
, where
N is the degree of the polynomial. Consequently, the effective implementation of NTT significantly impacts the performance of HE applications. Unlike low-degree polynomials (ranging from
to
) with small moduli (i.e., 24 bit) needed in some traditional post-quantum cryptography schemes [
11,
12], HE schemes [
13,
14,
15,
16] require a higher polynomial degree (ranging from
to
) and larger moduli (i.e., 1240-bit) to support multiplicative depth. Introducing a Residue Number System (RNS) to HE enables the decomposition of a large modulus into several smaller moduli with varying bit widths (e.g., 54 bit). Therefore, an NTT architecture suitable for RNS-based HE schemes is required to support a set of high-degree polynomials with diverse moduli. This implies that when accelerated in hardware, such an architecture consumes a significant amount of on-chip resources. Consequently, the building of a hardware-efficient NTT architecture for practical applications of HE is in great demand.
Area efficiency and configurability are two key metrics used to evaluate the performance of NTT designs. Area efficiency shows a negative correlation with resource and time costs. Configurability emphasizes both Compile-Time Configurability (CTC) [
17] and Run-Time Configurability (RTC) [
18]. CTC enables the adjustment of parameter sets at the compilation time to satisfy different schemes with a wide range of parameters. It also means that the computational parallelism of the NTT architecture can be compiled to realize the trade-off between area and speed, RTC corresponds to the ability to support various parameter sets at runtime.
Existing NTT designs [
19,
20] for HE parameters tend to employ a large number of Processing Elements (PEs) to achieve high throughput, albeit at the cost of notable resource overhead and low area efficiency. Öztürk et al. [
19] introduced an NTT architecture with up to 256 PEs to accelerate the LopezAlt–Tromer–Vaikuntanathan (LTV) scheme, ensuring low latency. However, the proposed architecture stores all TFs for multiple moduli in on-chip memory, leading to high memory overhead, which is challenging for polynomial degrees greater than
on the target FPGA device. Similarly, Su et al. [
20] deployed 168 PEs across 41 RNS channels and stored all TFs for 41 32-bit moduli in internal memory, occupying a significant amount of hardware resources. Consequently, supporting moduli with a larger bit width on the target FPGA device became impossible. Kurniawan et al. [
21] reported a memory-based NTT design, presenting remarkable advantages of its low-complexity modular multiplier and optimized memory access pattern. However, it is still unsuitable for RNS-based HE schemes with multiple moduli due to the considerable storage overhead for TFs. To solve this issue, Kim et al. [
22] and Duong-Ngoc et al. [
23] used the twiddle factor generation strategy to avoid the storage of TFs. Furthermore, they reduced the memory bandwidth requirement during high-computational parallelism by employing the mixed-radix NTT algorithm, enhancing area efficiency. However, flexibility is limited because both designs only aim at the specific RNS-based HE parameter sets. Moreover, the number of PEs was fixed at design time, while the lack of unified data flows between NTT and INTT constrained the architectures’ adaptability.
On the other hand, some recent NTT designs [
17,
24,
25,
26,
27] with configurability have restrictions on performance and area efficiency. Mert et al. [
17] proposed the first NTT hardware generator that takes the desired parameters and the number of PEs as inputs. A multiplier using the word-level Montgomery algorithm could adapt to moduli of diverse bit widths without recompilation. Hu et al. [
24] developed an NTT-based high-efficiency polynomial multiplier that permits the reuse of twiddle factors across different polynomial degrees. However, the two designs based on the in-place algorithm suffer from increased access complexity and lower hardware efficiency as PEs increase. To solve this problem, Banerjee et al. [
28] proposed the Constant-Geometry (CG) algorithm, which exhibits superior scalability for multiple PEs, benefiting from its consistent memory access structure in each stage. However, the CG algorithm operates out of place, resulting in twice the memory overhead for coefficients compared to the in-place algorithm. Su et al. [
25] and Liu et al. [
26] optimized the coefficient access pattern, reducing the coefficient storage requirement from
to
, and exploited algorithm-level optimization techniques. However, the memory bandwidth requirement remains high, which hinders its implementation on FPGA. Geng et al. [
27] introduced a novel high–low iterative memory access approach to reduce the storage bandwidth required per PE, mitigating storage restrictions. However, it is worth noting that in the mentioned works with configurability, all precomputed constants are stored in the internal memory. Therefore, the memory footprint remains a major obstacle when these NTT designs are embedded into HE schemes.
Consequently, developing a unified NTT/INTT architecture with flexibility while ensuring high area efficiency for HE-based parameters remains challenging. This paper proposes an area-efficient and highly configurable architecture to accelerate CG-based NTT/INTT operations on FPGA. Reducing the storage of coefficients and twiddle factors is the core focus of this study. More specifically, our contributions are summarized as follows:
- (1)
We propose a novel memory access pattern based on the CG algorithm, reducing the total coefficient capacity requirement from to . Moreover, we minimize the required memory bandwidth per PE as much as that of the in-place-based radix-2 algorithm. The reduction in overall capacity and bandwidth significantly mitigates memory demands when compared to prior works.
- (2)
We develop a flexible Twiddle Factor Generator (TFG), with a Step Generator (SG) supplying multipliers for the TFG. With this approach, only a few TF bases need to be stored, while the remaining TFs are generated on the fly. Implementation results show a 99.17% reduction of TF storage in our proposed NTT architecture with 32 PEs when the maximum polynomial degree is set to .
- (3)
We present a hardware-efficient and configurable unified architecture to accelerate NTT/INTT without additional modifications. The proposed design not only allows for a balanced design in terms of area and throughput based on the compiled parameters, including the number of PEs, the supported maximum number of polynomial degrees, and the maximum size of polynomial coefficients, but it also supports polynomials with varying degrees and coefficients of different sizes after compilation. In addition, although our architecture is designed for HE parameter sets, it also accommodates small parameter sets without compromising computational complexity.
- (4)
We also implement a memory-based NTT/INTT architecture adopting our proposed coefficient access pattern, aiming to fairly compare it with previous NTT designs with all TFs stored in memory. The experimental results demonstrate the improved performance of the proposed memory access pattern. In the end, we conduct another comparison to investigate the performance difference between precomputed TFs and online-generated TFs for the NTT architecture. The results show the importance of the online generation-based NTT architecture in RNS-based HE applications.
The organization of this paper is outlines as follows.
Section 2 provides basic knowledge about polynomial multiplication and the CG NTT algorithm. The hardware architecture is detailed in
Section 3. In
Section 4, the implementation results are analyzed and compared with prior works. Finally,
Section 5 concludes this article.
3. The Proposed Accelerator
3.1. Overall Architecture
The overall architecture of our design is illustrated in
Figure 1, consisting of the following three types of components: computing, storage, and control components. The computing components comprise the PE array, TFG array, and SG. Meanwhile, the storage components include the coefficient memory, TF base memory, and step memory. The control unit receives a two-bit working mode signal as input to transition the circuit among the following three states: IDLE, NTT, and INTT. Further elaboration on the specific functionalities of each submodule follows.
- (1)
PE array: The PE array is the computational component in this architecture, primarily handling butterfly operations. It consists of L PEs, where the parameter (L) is typically set to a power of two. Each PE executes either NTT or INTT operations on input polynomials, depending on the working state of the architecture.
- (2)
Coefficient memory: The coefficient memory stores all polynomial coefficients during calculation. In this design, a total storage capacity of for coefficients is necessary, which is 16.7% less than the prior CG storage requirement of . The total storage space is divided into memory blocks to accommodate the computational bandwidth of the PE array.
- (3)
TF base memory: The TF base memory holds TF bases, supplying inputs to the TFG array for the dynamic generation of remaining TFs. The overall storage capacity is contingent upon computational parallelism and the polynomial degree. When , storage space is required; otherwise, space suffices. The total storage space is divided into blocks to meet the computational bandwidth of the TFG array, where the symbol ⌈ represents rounding up.
- (4)
Step memory: The step memory stores the inputs and outputs of the SG. The storage capacity is adjusted to match the pipeline of the SG, preventing pipeline stalls caused by mismatched data processing speeds.
- (5)
SG: The SG is responsible for cyclically generating steps to supply the TFG array, thereby avoiding the need for expensive step storage.
- (6)
TFG array: The TFG array consists of twiddle factor generators (TFGs), which take steps and TF bases as inputs and output TFs to the PE array for computations.
- (7)
Control unit: The controller provides the correct read address and write address for storage components and coordinates the work of all components.
In this architecture, all memory modules are configured in simple dual-port mode, allowing for external initialization after compilation. Modular Multipliers (MMs) utilize the word-level Montgomery algorithm. With specified iterations and parameterized word size, they can accommodate moduli within a range of bit widths after compilation. The compile parameters of N and are the maximum supported polynomial degree and the maximum supported modulus bit width, respectively. The compiled circuit can support different polynomial degrees (n; ). Moreover, the architecture realizes the unification of NTT and INTT in the design of all components.
3.2. Unified PE
The proposed PE architecture is shown in
Figure 2, consisting of a Modular Adder (MA), Modular Subtractor (MS), Modular Multiplier (MM), Modular Divider (MD), Multiplexers (Muxs), and registers. The inputs are
a,
b, and
c, and the outputs are
and
. The signal
determines whether the PE performs NTT or INTT. The PE executes the CG NTT operations when
. The results are shown as follows [
26]:
When
is set to 1, the PE executes the CG INTT operations. The results are shown as follows [
26]:
In addition, a total of registers are inserted to balance the pipeline latency, where m is the pipeline level of the MM.
The MM applies the word-level Montgomery algorithm proposed in [
33] as shown in Algorithm 3. It utilizes the equivalent expression of an NTT-friendly prime (
) to decompose the modular reduction operation into smaller steps. The compiled parameter (
W) is usually set to
, denoting the word size. The iteration number of the “
for” loop is represented by the parameter (
T), where
. This means splitting large-bit-width modulo reduction steps into
T iterations of smaller-bit-width modular operations; therefore, fewer Digital Signal Processing (DSP) resources are needed. After compilation, the supported modulus bit width (
K) is configurable at runtime within the range of
, while the traditional Barrett modular reduction algorithm [
34] only supports a fixed-modulus bit width of
after compilation and lacks runtime configurability. For
, setting
W to
restricts the range of
q for dimension
n. Hence,
W should be small enough to accommodate a wide range of
n, yet this entails more iterations for specific
. Therefore,
W and
T should be carefully selected to obtain the optimal trade-off between generality and complexity. In our work, we set
T to 4, and
W can be compiled as
to match the common HE-based parameter set at
[
23]. It is important to note that the constraint on
q varies with
. Specifically, as
n decreases, the range of values for
q shrinks compared to the initially supported range, while it remains unchanged as
n increases. For example, when
and
, for
or
,
q theoretically only needs to satisfy the expression of NTT-friendly primes (
or
). However, the compiled parameter (
W) determines that
q must adhere to the condition of
, thereby imposing a constraint on
q. Even so, compared to previous work [
22,
23], which utilized specialized forms of modulus to reduce complexity, the proposed MM is more general for HE applications.
Algorithm 3 Word-level Montgomery Modular Multiplication [33] |
- Require:
(three K-bit positive integers, ) - Ensure:
, where - 1:
- 2:
for to 3 do - 3:
- 4:
- 5:
//2’s Complement of - 6:
- 7:
- 8:
end for - 9:
- 10:
if
then - 11:
- 12:
else - 13:
- 14:
end if - 15:
return Z
|
Modular division can be viewed as an inverse element multiplier. For
, it can be implemented using a shifter, adder, and multiplexers without multipliers. Similar to prior works [
25,
26], MD is designed based on Equation (
9).
3.3. Coefficient Memory Access Pattern
The storage overhead for coefficients on FPGA mainly depends on the following two factors: the total storage capacity and the number of required memory blocks per PE. This paper proposes a novel coefficient memory access pattern based on the CG algorithm, effectively reducing the storage demand from
to
(where
N is the polynomial degree). At the same time, each PE necessitates just two memory blocks. In prior CG-based NTT studies [
25,
26,
27], a storage capacity of
or
was typically allocated for coefficient storage. Su et al.’s work [
25] and Liu et al.’s work [
26] required 12 and 4 memory blocks per PE, respectively. Although Geng et al.’s work [
27] required just two memory blocks per PE, their total storage requirement remained at
. In contrast, our design minimizes the total storage capacity and bandwidth, leading to a notable decrease in Block Random Access Memory (BRAM) resource usage.
The proposed coefficient storage arrangement comprises memory blocks to match the computational needs of L PEs. Assuming that N original coefficients are cyclically distributed across memory blocks, the coefficient pairs (, ) occupy the same block, hindering the simultaneous retrieval of both source data during INTT. Consequently, we reorganize coefficients in reverse order within memory blocks, while coefficients are still stored sequentially. This strategy effectively segregates and into distinct blocks. Moreover, it is important to acknowledge that this arrangement necessitates at least of additional space to ensure sufficient time for data retrieval before updates. To minimize the extra storage requirement, we further refine the layout, grouping coefficients in sets of .
The improved coefficient arrangement is depicted in
Figure 3, in which each block has a depth of
. Within each block,
of space is designated for storage of the original coefficients, while the remaining
of space is allocated for newly generated data. To facilitate explanation, we divide
blocks into five regions from top to bottom, labeled as Regions 1∼5. The arrangement of the original coefficients proceeds as follows:
are placed in Region 1 sequentially;
are placed in Region 3 sequentially;
are stored in reverse order at the block level, filling up Region 2;
are stored in reverse order at the block level, filling up Region 4.
More details are provided below for the loading and storing patterns of data during NTT. As shown in Algorithm 1, during the j-th iteration, coefficient pairs are loaded from blocks and transmitted to PE for computation. Then, the updated results are stored in blocks. The l parameter ranges from 0 to , indicating a total of data being simultaneously read and stored in each computation cycle of the NTT. In the following, we describe the two initial executions of PEs in stage 1 in detail.
At the first step in stage 1, i.e., , PE retrieves from Block and from Block to execute butterfly computation. The result () is stored in the first available slot in Block l, and another result () is stored in the first row of Block . When L PEs operate simultaneously, coefficients in the first row of Region 1 are loaded simultaneously. results are stored in the first vacant position of Block in Region 5 and the first row of Block in Region 1 (the original coefficients in the first row of Block are utilized).
At the second step in stage 1, i.e., , PE fetches from Block and from Block . The output () is stored in the first available position in Block , and is stored in the first row of Block . When L PEs work simultaneously, coefficients in the second row of Region 1 are loaded, and the computed results are stored separately in the first vacant positions of Block in Region 5 and the first row of Block in Region 1 (the original coefficients in the first row of Block are utilized).
In the next steps of stage 1, the coefficients are loaded and stored according to the above steps. After clock cycles, from Regions 1 and 3 are read. The newly generated data fill Regions 5 and 1. PEs then handle the data from Regions 2 and 4, with the results filling in Regions 2 and 3. Upon computing stage 1, Region 4 becomes available for storage in stage 2. The described operations are replicated in successive stages. After each stage’s execution, the remaining space is utilized for storage in the next stage. Subsequently, the coefficient storage arrangement reverts to its initial state at stage 6.
For a more detailed depiction of the coefficient storage access pattern, let us derive the process for a 16-point NTT employing two PEs. The load and stored data flow are illustrated in
Figure 4. It is assumed that one clock cycle is required from data retrieval to the completion of computation. The left–right direction represents the progression of clock cycles within a stage. Blue denotes the read data to be transmitted to PEs, and green denotes the newly generated data to be written into memory in that cycle.
The coefficient access logic for INTT is the reverse of that used in NTT. For example, in the first step of stage 1, L PEs read coefficients from Region 1 and from Region 2. The newly generated data () are written into the first row in Region 5. Subsequent steps follow a similar pattern, which is not repeated here.
It is worth noting that the preceding description is rooted in the compilation parameter (N), but the memory access structure specifically supports polynomials of n degree, which is much smaller than N. In this scenario, only the initial of space of each block is accessed, leaving the remaining of space unused. Moreover, our design allows for external access to arbitrary addresses in all memory blocks, enabling users to dynamically update coefficients of n-dimensional polynomials in memory blocks post compilation. This feature ensures the proposed memory access pattern with runtime configurability.
3.4. Read-after-Write Conflict Analysis
Due to the delay in accessing RAM and PE computation, data from a particular address may be needed by the next stage before it is updated by the computation of the current stage, leading to access conflict. The critical point of conflict lies in the simultaneous reading and updating of data in two consecutive stages, known as a read-after-write conflict. The following conflict analysis process refers to Geng et al.’s work [
27].
We define
d as the total number of clock cycles, involving the latency of RAM access and the pipeline levels of PE.
Figure 5 illustrates the occurrence of conflict when
,
, and
. The notation “cycle#
x” indicates the
x-th clock cycle starting from the NTT computation. Cycle#0 and cycle#2 are the first read and first write cycles of stage 1, respectively. If the coefficient access operations between stages are contiguous, then at cycle#5, coefficients indexed as 6 and 7 are to be read in stage 2, while both are also updated with results from stage 1, resulting in conflict. We can observe that conflicts always arise during the final write-back operation in every stage.
In general, each stage spans cycles for the retrieval of N coefficients. Coefficients with indices ranging from to are written in the final cycle of the current stage’s operation and read in the #-th cycle of the next stage. If , then coefficients with indices of to are written and read at the same time. Therefore, the condition of is necessary to prevent such conflicts. The simplified condition of indicates that the maximum level of parallelism (L) is decided by the polynomial degree (N) and the delay (d), thereby constraining the architectural flexibility. Furthermore, inserting idle periods between stages decelerates the read operation at the next stage; therefore, there are sufficient cycles to update coefficients within the current stage, effectively avoiding conflict.
Assuming that the number of inserted idle cycles is
g, the next stage waits for the additional
g cycles to start reading after the current stage completes reading. Therefore, we have
The minimum number of inserted idle cycles (
) is expressed as follows:
The NTT operation takes an additional clock cycles, in total, to avoid conflict. In the end, through the strategic insertion of idle cycles between stages, we achieve a conflict-free memory access architecture.
3.5. TF Generation Strategy
The TF online generation strategy can be classified into two categories, i.e., data-dependent and data-independent, based on whether the generated TFs are utilized in the next generation. We take
Figure 6 as an example to illustrate the difference between these generation methods. In
Figure 6a,b, the required TFs for each stage are listed according to Algorithm 1. Columns 2 and 3 represent the factors allocated to two PEs, and column 4 indicates the frequency of repeated reading by PEs for each row’s factors. For example, in stage 1,
is used by PEs and lasts for four clock cycles.
For the case of data dependence, TFs labeled in black are obtained through the modular multiplication of TFs generated in the preceding operation and the corresponding step labeled in red. Therefore, the pre-stored constants include the TF bases marked in blue and the steps marked in red. Considering that MMs usually take several cycles to perform calculations, it is necessary to cache more TF bases for each stage to avoid pauses of the MM. The total number of stored TF bases is directly proportional to the pipeline levels of the MM. Kim et al. [
22] and Duong-Ngoc et al. [
23] employed this method to generate TFs on the fly. In the scenario of data independence, the update of TFs relies on TF bases (labeled in blue) and steps, as depicted in
Figure 6b. The total number of pre-stored TF bases is determined by compilation parameters (
N and
L), regardless of the pipeline levels in the MM. However, the varying steps within a stage pose a new challenge in terms of storage. In this paper, we focus on the data-independent TF generation strategy and propose a step generation method to avoid storing steps.
3.5.1. Step Generation
As shown in Algorithm 1,
L PEs must execute
iterations to complete coefficient transformation for one stage. For a new stage, assume that the
j-th parallel execution of
L PEs occurs in cycle#
j. Then, according to line 7 in Algorithm 1, the required TFs for PEs are {
} at cycle#0 and {
} at cycle#
j, where
l represents the index of PE. So, for PE
, the required step is computed as follows:
We can represent
and
l in binary form as Equation (
13) and Equation (14), respectively.
Since
k is a power of 2, when
,
is equivalent to the right shifting of
l by
bits. Similarly, the derivations for
and
follow a similar pattern as follows:
Hence, the following formula is valid.
When
, the proof of Equation (
18) follows a similar trend. At cycle#
j, the required step sequence is
It can be observed that is independent of l, indicating that the L TFGs use the same step to generate TFs for cycle#j. When , i.e., in the final stage, the required step sequence follows the pattern of . For other stages, the steps only update to at cycle# (where t is an integer) while maintaining consistency with the previous cycle. In general, based on the derivation presented above, we can draw the following three conclusions:
The shared step across L TFGs for generating the TFs of the next cycle suggests that only one step generator needs to be designed to provide steps for L TFGs.
The total number of unique steps is because the steps required for the final stage can be reused by the remaining stages.
For the specified stage (s), the required unique steps () are solely determined by j, implying that the step generation logic can be uniform across all stages.
The proposed step generator is an MM with two inputs, aiming to provide steps for
L TFGs. To generate the unique step sequence
in a pipelined manner,
M steps must be pre-stored in the step memory, where
M is the total number of cycles involving the delay of the MM and an extra cycle. The steps in the step memory are continuously refreshed as the clock cycle increments, as shown in
Figure 7. At cycle#0 of stage 1, the SG fetches the first step from address 0 and the constant
to generate the result (
) at cycle#
. The result is written to the same address at the same time. Similarly, the second result of the SG is written to the second address of the step memory at cycle#
M. Once the SG handles
M pre-stored steps sequentially, it circles back to fetch the step from address 0 as the input at cycle#
M. Continuing along the same lines, in the last cycle, we obtain the ultimate unique step (
), which also means that the complete set of steps (
) is generated within the previous
cycles. It is worth noting that the constant should be adjusted to
at cycle#
, aiming to replicate the generation of the same step sequence in the next stage. For
L TFGs, at the #
-th cycle of each stage, the required step is retrieved from the step memory to contribute to the generation of new TFs.
For better comprehension, we provide the timeline of the SG when
,
, and
, as shown in
Figure 8. The data with blue background represent the steps pre-stored in the step memory. At cycle#0 of stage 1, the SG reads the step (
) located at address 0 and the constant (
) for modular multiplication. The result (
) is written to the same address at cycle#1. In the same cycle, the SG reads the step (
) located at address 1 for a new calculation. The result (
) is stored at the same address at cycle#2. In the same cycle, the SG fetches the updated step (
) from address 0 and the second constant (
) to generate
for the next stage. Similarly, at cycle#3, the SG fetches the updated step (
) from address 1 to produce the next step. The steps masked in red represent steps that need to be transmitted to TFGs, occurring at cycle#
, where
t is an integer.
In conclusion, the proposed step generator reduces the total step storage from down to M and supports INTT mode through the dynamic modification of two constants and the step memory.
3.5.2. TF Generation
The TFGs take the step from the step memory and the TF bases from the TF base memory as inputs to generate the remaining TFs for PEs. This means that the number of TFGs and PEs should be equal to maximize the computational capacity, leading to a doubling of computational resource overhead. Therefore, we conduct optimization to reduce the number of TFGs to at the cost of several additional clock cycles. The optimization strategy is not applied when one PE and one TFG exist. When , we can observe that, except for the final stage, at least two TFGs generate the same TF in the same cycle. Consequently, halving the total number of TFGs only slows down the operational efficiency of the last stage, requiring an extra cycles to generate TFs for the final stage. But it proves to be cost-effective, reducing the total computational resources by approximately 25%.
The proposed architecture for online TF generation, as illustrated in
Figure 9, comprises one SG,
TFGs, the TF base memory, and step memory. The total capacity of the TF base memory is, at most,
for one modulus, where the multiplication by 2 is due to the difference between TF bases in NTT and INTT. The total capacity of the step memory is
M. Despite the expense of including additional
clock cycles, it significantly diminishes the storage demands for TFs and steps, achieving enhanced area efficiency.
4. Evaluation
4.1. Experimental Environment and Methodology
We implement the proposed NTT architecture with the 2021.1 Xilinx Vivado tools. To ensure a fair comparison, the design is placed and routed on different FPGA platforms, in accordance with previous works. We conduct a thorough presentation and comparative analysis of the performance of the proposed accelerator in the following sections.
In
Section 4.2, we present some details about our architecture, including resource, latency, and evaluation metrics. Subsequently, our comparison and analysis are divided into three subsections. In
Section 4.3, we conduct a detailed comparison with existing works designed for HE parameters. In
Section 4.4, utilizing the coefficient memory access pattern proposed in this paper, we develop a memory-based NTT/INTT architecture that stores all TFs on chip, enabling a fair comparison with works optimized for small parameter sets. Finally, in
Section 4.5, we explore the area efficiency of memory-based and online generation-based NTT architectures under RNS-based HE parameter sets.
4.2. Experimental Results
First,
Table 1 shows the hardware resource breakdown of our NTT architecture under the compilation parameter set (
,
, and
= 60). We primarily focus on the BRAM and DSP overhead. In our design, each MM employs 18 DSP slices, and the PE array consumes 576 (
) DSP slices for 32 MMs. Moreover, the TFG array consumes 288 (
) DSP slices for 16 MMs. We including an additional MM to construct the SG, resulting in a total of 882 (
) DSP slices. Coefficient storage consumes
coefficient memory blocks with a depth of
, while TF base storage consumes
memory blocks with a depth of, at most,
for one modulus. The step memory is implemented using Look-Up Table (LUT) resources, thereby avoiding the consumption of any BRAM. All BRAMs are configured in simple dual-port mode, with a total of 208 BRAMs used (i.e., 192 BRAMs for coefficients and 16 BRAMs for TF bases). Furthermore, to illustrate the reduction in storage capacity,
Table 2 compares the coefficient storage overhead with the state-of-the-art CG NTT design [
26] for
. It also compares the total number of pre-stored TF bases for 32 RNS moduli with the design proposed in [
23], which also optimizes the TF generator. The results indicate that our architecture reduces memory overhead for both coefficients and TFs in CG-based NTT designs.
Secondly, for the timing information, the coefficient conversion across
stages requires
Clock Cycles (CCs). Moreover, when
, optimization of the TF generation strategy introduces an extra
cycle. To avoid access conflict, a minimum of
idle cycles are inserted between stages, allowing adequate time for the previous stage’s results to be written back. Finally, considering the delay of the SG denoted by
M and that of the PE denoted by
d, the total CCs and latency of NTT/INTT operations are as depicted in Equations (
20) and (
21) [
23], where
represents the maximal clock frequency achievable on the target FPGA platform.
The data throughput can be measured as the maximum number of data bits processed by the NTT/INTT module per second, as depicted in Equation (
22) [
23].
Moreover, when the proposed architecture is extended to support
p moduli, the data movement overhead of multiple polynomials with diverse moduli is not considered, similar to prior works [
20,
23]. Therefore, the total number of CCs and the maximal number of data bits supported by the architecture increase by a factor of
p.
Finally, a larger number of PEs results in shorter latency and higher throughput but requires more resources. Therefore, comparing only the resource usage or throughput is one-sided. For a configurable architecture, area efficiency is a more comprehensive and fair performance evaluation standard. Commonly used metrics for quantifying area efficiency include “Area × Time” Products (ATP) [
35] and “Throughput Per Slice” (TPS) [
36], which differ in how they equate area. Lower ATP and higher TPS indicate better area efficiency. In our comparison, we calculate both metrics, considering the diversity of platforms and resources.
4.3. Comparison to Works Considering HE Parameters
In this part, our NTT performance results are compared with those reported in related works [
19,
20,
21,
22,
23], as shown in
Table 3. Our design offers the highest level of configurability while achieving optimal area efficiency, with lower ATP and higher TPS compared to other works, except for [
21].
Öztürk et al. [
19] reported a block-level NTT architecture for partial HE schemes. Large amounts of on-chip memory and multipliers are utilized for TF storage and computation, respectively, leading to inefficient hardware utilization. Compared with the design proposed in [
19], our design achieves higher throughput, with 5.17× higher TPS values, and the ATP decreases by 85.46%. The architecture proposed in [
22] enables a full pipeline by deploying many PEs in series. However, this comes at the cost of high memory capacity and bandwidth expenses due to the presence of five intermediate buffers for coefficient reordering, which is impractical for memory-bound homomorphic evaluations. In contrast, our proposed coefficient access strategy significantly reduces memory overhead. Specifically, when performing a
-point NTT operation with a modulus bit width of 62, their architecture requires at least 10,080 KB of memory resources for coefficients, while ours only requires 1440 KB. In all, our design considers optimal hardware efficiency through configurable parallelism, and
Table 3 shows that the proposed NTT architecture outperforms that proposed in [
22] in terms of both performance metrics. Su et al. [
20] proposed a multi-channel and multi-PE architecture based on the CG algorithm. Their design requires 1.49× more LUTs than ours, achieving 2× higher throughput. However, due to our optimized memory access pattern and TF generation strategy, our design decreases BRAM resources by 7.85×. The results in terms of ATP and TPS indicate that our design has a higher area efficiency. In comparison with the design proposed in [
23], our architecture decreases ATP by 31.38% and obtains 1.25× higher TPS. Due to the advantage of the carefully designed pipelined architecture of PE, our NTT can run at a higher clock frequency with the same parallelism and, thus, obtain 1.6× higher throughput. Although the compared design uses less BRAM thanks to the optimized coefficient memory bandwidth based on the mixed-radix algorithm, it is worth noting that our design requires less memory for TF bases, including capacity and bandwidth. Furthermore, when supporting 32 RNS moduli, the compared design requires 24,000 pre-stored TF bases, whereas 17,408 are required in ours. In addition, due to its inconsistent design, TF bases for NTT and INTT are stored in different memory blocks, while our unified architecture does not require extra TF base memory blocks for INTT. Kurniawan et al. [
21] proposed a memory-based NTT architecture that supports RTC, achieving superior area efficiency. This improvement in area efficiency is primarily attributed to the low DSP consumption resulting from the use of a specific modulus form, which allows for minimal equivalent area. However, in terms of throughput, our design achieves a 1.12× improvement with the same level of computational parallelism while also offering significantly higher configurability and allowing for more RNS moduli.
In addition, among the aforementioned works, only [
22,
23] implemented an online generation strategy for TFs, which is closer to impractical HE applications. In comparison with the compared designs, when supporting multiple moduli, our architecture incurs only a slight increase in BRAM usage for storage of the TF bases of different moduli. However, the compared designs not only require additional BRAM but also necessitate an increase in LUT resources. This is because the compared MM designs are optimized for specific sets with low Hamming weights, utilizing shifters for lightweight integer multiplication. Consequently, each additional supported modulus leads to increases in shifters and multiplexers. Instead, the bit-width range of the moduli supported by our MM at runtime is (
W,
]. When
and
, our architecture accommodates moduli ranging from 18 to 62 bits, which covers all RNS moduli proposed in prior works, without incurring any additional computation overhead. Furthermore, the designs proposed in [
22,
23] only support a fixed number of PEs, lacking configurability for architecture parallelism. In contrast, our design supports both CTC and RTC. We provide a unified architecture for NTT and INTT, avoiding redundant design efforts.
In all, our NTT/INTT module achieves improvements in terms of area efficiency and configurability. These advancements establish it as a practical NTT accelerator for RNS-based HE applications.
4.4. Comparison with Works Considering Small Parameters
For small parameter sets, NTT designs typically use on-chip storage for TFs due to their less stringent storage constraints. To ensure a fair comparison with literature works, we use the TF storage strategy proposed in [
26], combined with our optimized coefficient memory access method, to create a memory-based NTT design. The comparison results between our memory-based NTT and prior works considering small parameters are shown in
Table 4.
The designs proposed in [
17,
24,
26,
27,
38] support the configuration of parameters
N,
, and
L at compile time, and that proposed in [
24,
26] also supports diverse polynomial degrees at runtime. In comparison, our design offers greater scalability, with three compilable parameters, namely
N,
, and
L, further enabling runtime adjustments of
N and different sizes of
q. Furthermore, compared to the design proposed in [
38], our architecture demonstrates poorer ATP in LUT when
but higher LUT efficiency for larger values of
L. Moreover, our overall ATP exhibits superior performance due to its lower latency. In comparison with the design proposed in [
27], BRAM consumption is the same because each PE requires two memory blocks for coefficient storage and one block for twiddle factor storage in both designs. But it is worth noting that the actual storage consumption in our design only slightly exceeds
, whereas that of the compared design is approximately
. Mert et al. [
39] employed a fully parallelized architecture based on the four-step algorithm to speed up NTT but at the cost of significant resources. In comparison, our design is more area-efficient. Liu et al. [
26] proposed a configurable NTT/INTT accelerator that supports both CTC and RTC while decreasing the actual memory overhead to
. Their design consumes considerably fewer LUT resources because of its simplified memory access pattern. However, our design is more area-efficient thanks to its lower BRAM usage. Our architecture also outperforms the designs proposed in [
24] and [
17] in terms of ATP.
In general, benefiting from the proposed coefficient memory pattern, our design can fully exploit storage resources and obtain higher LUT and BRAM efficiencies than many previously proposed architectures.
4.5. Comparison between Memory-Based NTT and Online Generation-Based NTT
In the preceding discussion, we learned that there are two methods to provide TFs for PE arrays, namely (i) memory-based, and (ii) online generation-based methods. This part presents a theoretical analysis of how these methods impact area efficiency.
To simplify the analysis, let us assume that all operations occur on the chip to remove the impact of data movement. This implies that storing the TFs for (i) and the TF bases for (ii) across multiple moduli in the internal memory is necessary. Moreover, we assume that only one RNS channel with one NTT core is deployed, consisting of L PE units. Consequently, the total constant storage capacity increases linearly with multiple moduli. The NTT/INTT domain transformations of polynomials across p moduli must be executed serially.
For method (i), according to [
26],
L memory blocks with a depth of
are required for TFs. Method (ii) requires
memory blocks for TF bases, each with a maximum depth of
.
Table 5 presents the BRAM consumption of the proposed memory-based and online generation-based NTT architectures on a Xilinx ZCU102 FPGA when the parameter set of (
N,
,
p) is set as (
, 60, 32). Notably, the BRAM overhead for method (i) is approximately 10.91×∼ 25.84× more than that required for method (ii).
In addition to the variances in BRAM occupancy, distinctions exist in latency and the overhead of LUT and DSP resources in both approaches. Therefore, we provide the ATP values of our proposed NTT architectures under the parameter set of (
,
,
), as shown in
Figure 10. The required operating frequency of the architecture for latency computation is considered to be consistent to the maximum operating frequency under a single modulus. It is observed that under the same computational parallelism, the online generation-based NTT architecture significantly outperforms the memory-based architecture in terms of area efficiency. Furthermore, the ATP of the memory-based architecture gradually decreases as
L increases, indicating that expanding computation units can mitigate performance limitations resulting from storage requirements. Meanwhile, the ATP of the online generation-based architecture reaches its minimum when
L = 32. This suggests that the primary factor influencing performance enhancement is the availability of computational resources when
. However, when
, performance improvement becomes constrained due to the increasing storage bandwidth.
In general, our comparison results highlight the necessity of generating TFs on the fly in RNS-based HE applications. Moreover, in high-dimensional polynomials, NTT designs with high parallelism are advantageous for enhancing area efficiency.