Next Article in Journal
On Bicomplex (p,q)-Fibonacci Quaternions
Previous Article in Journal
Machine-Learning-Based Approaches for Multi-Level Sentiment Analysis of Romanian Reviews
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit

1
School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
2
College of Data Science and Application, Inner Mongolia University of Technology, Huhhot 010080, China
3
School of Software, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(3), 458; https://doi.org/10.3390/math12030458
Submission received: 30 December 2023 / Revised: 26 January 2024 / Accepted: 28 January 2024 / Published: 31 January 2024
(This article belongs to the Section Probability and Statistics)

Abstract

:
The bootstrapping procedure has become the main bottleneck affecting the efficiency of all known fully homomorphic encryption (FHE) schemes. The state-of-the-art scheme for efficient bootstrapping, which is called fully homomorphic encryption over the torus (TFHE), accelerates polynomial multiplication by leveraging number theoretic transform (NTT) and implementing NTT in parallel on a GPU. Unfortunately, almost none of the recent advancements in NTT take full advantage of a GPU, leading to the need for more time. With this in mind, in this work, a novel faster number theoretic transform based on a GPU is proposed, in which matrix multiplication is used to implement a decomposed small-point NTT. When implementing matrix multiplication, we introduce a merging preprocessing method to merge multiple inputs of the small-point NTT, aiming to effectively minimize the count of modulo operations. Subsequently, when the merged result is multiplied by rotation factors, we use logical left shift rather than arithmetic multiplication to improve the computational efficiency. Our scheme can easily be used to realize a 1024-point NTT and the results of the experiments show that the speedup ratio of our method over the butterfly algorithm is about 2.49.

1. Introduction

1.1. Motivation

Fully homomorphic encryption (FHE) provides an amazing ability to calculate ciphertext data without decryption, which makes it possible to process data without destroying sensitive source data. On the premise of security, the ciphertext can be used for all kinds of data processing (addition, subtraction, multiplication, division, square, polynomial, etc.) as if it were plaintext. So, FHE perfectly realizes the secure separation of data ownership and data access rights. The “computable but invisible” feature of fully homomorphic encryption has removed the biggest obstacle—privacy protection from the development of big data, cloud computing, intelligent manufacturing, the industrial internet, etc., which will directly promote the leapfrog development of the above industries.
Since Gentry [1] proposed the first fully homomorphic encryption scheme by using bootstrapping technology in 2009, many FHEs have been proposed, such as BGV [2], BFV [3] and CKKS [4]. Despite the significant application value of FHE, its high computation complexity has been a longstanding obstacle to its practical implementation. Among all the time-consuming components, the technical process of ciphertext bootstrapping proposed by Gentry is the most time-consuming one, as it involves a large amount of computation. Consequently, enhancing the efficiency of bootstrapping has become the central focus of FHE research.
Now, the gate bootstrapping of the TFHE scheme [5] is the fastest bootstrapping algorithm, which was proposed by Chillotti et al. in 2016. During the gate bootstrapping process of the TFHE scheme, a significant number of polynomial multiplications occur, and these multiplications over polynomial rings are computationally expensive. Bianchi T et al. [6] first utilized fast Fourier transform (FFT) to improve the speed of polynomial multiplications in homomorphic encryption. However, there is also the problem of rounding error in their method because FFT is a floating-point number operation. To avoid this error, number theoretic transformation (NTT) [7] was introduced, which uses another “primitive root” instead of the unit root to solve polynomial multiplication problems. In fact, NTT is a specialized form of FFT in the finite field of integers. NTT is more suitable for accelerating large polynomial multiplication on the ring, because NTT’s inherently parallel structure makes it a very attractive candidate for implementation on parallel architectures. Recently, many efficient implementations of NTT and NTT-based polynomial multiplication have been proposed in the literature, which makes NTT-based polynomial multiplication the core arithmetic operation in the majority of homomorphic encryption implementations. In addition, there have been some recent studies that improve the efficiency of bootstrapping by optimizing the blind rotation algorithm [8,9].

1.2. Prior Work

Theoretically, NTT accelerates polynomial multiplication to a large extent; however, its efficiency is still not high enough for practical applications. Therefore, various NTT-based polynomial multiplication optimized implementations aiming at accelerating FHE have been studied. At present, there are different platforms available for NTT-based polynomial multiplication: hardware architecture, CPU implementation, and GPU implementation.
The hardware architecture of NTT is mainly implemented on an FPGA (field programmable gate array). The main implementation of the current design is an NTT hardware multiplier [10], which uses a parallel architecture to implement NTT operations and basically uses addition and shift operations to ensure a large quantity of parallel processing. The test results show that the efficiency on an FPGA is 60 times that on a CPU. In addition, to improve efficient accelerators in lattice-based cryptosystems, Mert et al. [11] designed and implemented an NTT-based polynomial multiplication architecture on an FPGA, which can be used to execute NTT and INTT operations, and achieve a nearly 7 times speedup for the BFV scheme.
On a CPU, we can reduce the size of parameters or speed up the operation of NTT by improving the algorithm of NTT. For example, the Kyber–NTT algorithm [12] and PTNTT algorithm [13] are used to weaken the constraint of modulus q on ring polynomials, and then accelerate the NTT process by combining Karatsuba multiplication [14] and NTT multiplication.
In addition, due to the inherent parallel structure, NTT can be implemented on GPU platforms, which also makes it a research hotspot for accelerating NTT based on a GPU. Wang et al. [15] first explored the feasibility of implementing FHE on a GPU. In their scheme, the polynomial multiplication is implemented in parallel by the Schönhage–Strassen algorithm [16], and a new pre-computation technique is introduced to eliminate the expensive back-and-forth conversion in the Schönhage–Strassen algorithm, which enables an efficient acceleration of polynomial multiplication on a GPU. However, as the dimensions and the number of operands grow, the size of the pre-computation makes the GPU memory become a bottleneck, which results in less speedup. Later, Dai et al. [17] proposed the cuHE library, and summarized various optimizations and alternative ways to take advantage of the memory organization and bandwidth of CUDA GPU, such as algebraic optimization for efficient computation, memory minimization optimization, memory and thread scheduling optimization, etc. Dai et al. resolved the GPU memory problem in Wang et al.’s scheme. However, in their scheme, shared memory can run into serious bank conflict issues if there is a large amount of data to be processed at the same time. Goey et al. [18] introduced an efficient GPU accelerator for NTT in FHE, which has been demonstrated to be faster than the implementation presented by Dai et al. To take full advantage of the memory hierarchy in a GPU, they used faster memory types to reduce communication between registers and global memory. In addition, they used warp shuffle to allow threads to access registers, which enables faster data access compared to global and shared memory. Based on the previous technique, Lee et al. [19] designed a new scheme to improve the performance of the NTT algorithm. In this scheme, the Nussbaumer algorithm [20] was implemented on a GPU, where the recursive part is eliminated, and then the NTT algorithm is accelerated by optimizing the non-merged memory access pattern. Later, Kim et al. [21] introduced the on-the-fly twiddle factor into NTT and then analyzed and improved the performance of the NTT operation in the FHE scheme. Recently, Özgün Özerk et al. [22] provided a high-performance and efficient GPU implementation for NTT, INTT, and NTT-based polynomial multiplication. They significantly modified the NTT and INTT algorithms for te threading used on a GPU and proposed single-core and multi-core methods that perform better on small-degree and large-degree polynomials, respectively.

1.3. Our Contributions

Currently, the bootstrapping schemes based on a GPU primarily employ the Cooley–Tukey algorithm [23], also known as the butterfly algorithm, to execute the NTT operation. The optimized implementation of this algorithm is the cuFHE library [24]. When using the butterfly algorithm, the NTT operation needs to temporarily cache the calculation data, and reuse the intermediate data, which makes it impossible to calculate on multiple threads. In this case, to realize NTT based on the butterfly algorithm cannot give full play to the advantages of the GPU parallel architecture. In addition, because the calculation is carried out in a finite field, the implementation of the butterfly algorithm requires multiple modulo operations, which also consumes a lot of time.
Motivated by these concerns, we proposed a new NTT accelerating method on a GPU, which uses matrix multiplication to compute a small-point NTT. By employing a merging preprocessing technique, we can execute the same operation on multiple operands simultaneously, thus reducing the number of modulo operations. Additionally, logical left shift instead of multiplication operation is used to improve efficiency when calculating the NTT of a small point. It is shown that this new method is more suitable for polynomial multiplication in the process of gate bootstrapping in the TFHE scheme. The main process that increases the computational cost in TFHE is bootstrapping. Our acceleration scheme can significantly reduce the running time of bootstrapping, thereby improving the efficiency of TFHE, which is particularly important for handling large-scale data and complex computations.
Roadmap. Section 2 introduces the notations and the proposed accelerating NTT implementation on a GPU. Section 3 introduces the results of comparison and Section 4 makes the discussion. Appendix A defines multiple abbreviations that appear in this paper.

2. Materials and Methods

2.1. Preliminary

2.1.1. Notations

First, we present the definitions of symbols used in this paper which follows the conventions set by TFHE [5]. Let B = 0 ,   1 denote the set of 1 and 0, R denote the set of real numbers, N denote the set of natural numbers, and Z denote the set of integers. The real Torus T = R / Z = R   mod   1 is the set of real number modulo 1. The ring Z q consists of the set of integers 0 , 1 ,   2 ,   ,   q 1 , in which q is a prime number. The polynomial ring P q = Z q x / Q x represents all of the polynomials divided with the polynomial Q x , and all the coefficients are in Z q . Q x is defined as x N + 1 , and the degree of the polynomial ring is at most N 1 . Z N x denotes the ring of polynomials Z x / x N + 1 , T N x denotes T x / x N + 1 , and B N x denotes the polynomials with binary coefficients.

2.1.2. Number Theoretic Transform

NTT is used for calculating the multiplication of the two polynomials f x = a 0 + a 1 x + + a N 1 x N 1 and g x = b 0 + b 1 x + + b N 1 x N 1 . If we utilize the traditional method, the multiplication of polynomials f x and g x can be carried out by using Equation (1). Obviously, the time complexity of this method is O n 2 .
h x = f x g x = i = 0 N 1 j = 0 N 1 a i b j x i + j .
In fact, we can describe the degree of polynomial N−1 by using N points. In this way, f x and g x can be described as follows:
f ( x ) = { ( x 0 ,   f ( x 0 ) ) ,   ( x 1 ,   f ( x 1 ) ) ,   ( x 2 ,   f ( x 2 ) ) ,   ,   ( x N 1 ,   f ( x N 1 ) ) } , g ( x ) = { ( x 0 ,   g ( x 0 ) ) ,   ( x 1 ,   g ( x 1 ) ) , ( x 2 ,   g ( x 2 ) ) ,   ,   ( x N 1 ,   g ( x N 1 ) ) } .
Then, the multiplication of polynomials of f x and g x can be represented as h x = { x 0 ,   f x 0 g x 0 ,   x 1 ,   f x 1 g x 1 ,   ,   x N 1 ,   f x N 1 g x N 1 } .
We usually use discrete Fourier transform (DFT) to calculate this form of polynomial, and NTT is defined as DFT over the ring Z q . An N-point NTT operation transforms a vector x with N elements to another vector X with N elements, which is described by Equation (2):
X k = i = 0 N 1 x i W N i k mod q ,   k = 0 , 1 , , N 1 .
The constant W N Z q is called the twiddle factor, which satisfies the condition of W N N 1 ( mod q ) , W N N 2 1 ( mod q ) and W N 2 k + N W N 2 k W N / 2 k ( mod q ) . Similarly, the INTT (inverse number theoretic transform) operation uses almost the same formula as the NTT operation as shown in Equation (3):
x k = 1 N i = 0 N 1 X i W N i k mod q ,   k = 0 , 1 , , N 1 .
Here, W N 1 Z q is the modulo inverse of W n in Z q .
The multiplication of polynomials f x and g x can be simplified by using NTT and INTT. However, for two polynomials f x and g x of degree (N − 1), we cannot use the N-point NTT because the multiplication result has 2N terms. In this case, we should use the 2N-point NTT, which is described as Equation (4):
h x = INTT 2 N NTT 2 N f x NTT 2 N g x mod Q x .
In fact, Q x is usually defined as x N + 1 . In this case, we can use the Formulas (5) and (6) to represent NTT and INTT. In this method, we can use an N-point NTT to calculate the multiplication of two polynomials of the degree (N − 1) rather than a 2N-point NTT, and eliminate the operation of module Q x .
X k = i = 0 N 1 x i ω i W n i k mod q ,   k = 0 , 1 , , N 1 ,
x k = 1 N i = 0 N 1 X i ω i W n i k mod q ,   k = 0 , 1 , , N 1 .
It should be noted that W = ω 2 in the above Formulas (5) and (6). If we take x i ω i as the input, then the Formula (5) is the same as an N-point NTT and the Formula (6) is the same as an N-point INTT in value.
Obviously, when using this method, the time of multiplication of the two polynomials is reduced by using NTT. However, the time complexity of the NTT algorithm is still O n 2 . To further reduce the complexity, the use of Cooley–Tukey NTT, also named the butterfly algorithm, is proposed. The time complexity of NTT is reduced to O n log 2 n by the butterfly algorithm. The pseudo-code of the Cooley–Tukey NTT algorithm [23] is described as Algorithm 1.
Algorithm 1. Cooley–Tukey NTT
Mathematics 12 00458 i001
In the above algorithm, the bit-reverse operation means to array the bits of an integer with the binary form in reverse order and obtain the corresponding decimal integer. And the butterfly operation is given in Algorithm 2.
Algorithm 2. Butterfly operation
  • Input: A, B, W, module q
  • Output: A, B
  • B = B · W (mod q)
  • B = AB
  • A = A + B

2.1.3. Ciphertexts in TFHE

Gate bootstrapping was proposed by Chillotti et al. in 2016 [5]. In this section, we will introduce the ciphertexts used in TFHE, which includes TLWE, TRLWE, TGSW, and TRGSW.
Let k 1 be an integer, N be a power of 2, and α R + be a standard deviation, where R + denotes the set of positive real numbers. Let μ T N x be the plaintext and s B N x k be the TLWE secret key. The error is expressed as e D T N x , α , where D T N x , α is the modulo Gaussian distribution of standard deviation α over T N x . The TRLWE ciphertext is defined as c = a ,   s a + e + μ , where a is uniformly selected from T N x k , and μ is the plaintext of c. When N = 1 , the elements of c are numbers rather than polynomials, and in the rest of the paper we call it TLWE, which means the scalar form of TRLWE.
We first introduce the gadget before we give TRGSW. A gadget H is a block diagonal matrix which has K + 1 L rows. Let K, L and B g be three positive integers, and M T N x K + 1 , then the gadget H is described as the Formula (7):
H = 1 / B g 0 1 / B g L 0 0 1 / B g 0 1 / B g L T N X K + 1 L , K + 1 .
C is a TRGSW sample if C = Z + μ H holds. Each element of Z M K + 1 L is a TRLWE ciphertext obtained by encrypting the plaintext 0. When N = 1 , the scalar form TRGSW is called TGSW.
Now we introduce the product of the two ciphertexts. First we should introduce the gadget decomposition of TRLWE samples. For a TRLWE sample c T N x K + 1 and a gadget H, the result of the gadget decomposition of c based on H is called v = d e c H c Z N [ X ] ( K + 1 ) L , which is a vector and satisfies c = v H . Then, we can obtain the definition of the external product of the two kinds of ciphertexts through formula 8, where the external product of a TRGSW ciphertext A = TRGSW μ A and a TRLWE ciphertext b = TRLWE μ b is defined. The result of A b is a TRLWE ciphertext, which corresponds to the plaintext μ A μ b .
A b = d e c H b A .
Next, we will introduce the important structure of the gate bootstrapping in TFHE, the CMux gate. There is one control input and two data inputs in one CMux gate. The control input is a TRGSW sample C = TRGSW μ C , and the data inputs are two TRLWE samples: d 0 = TRLWE μ 0 and d 1 = TRLWE μ 1 . The output is a TRLWE ciphertext. When μ C = 0 , the plaintext corresponding to the output ciphertext is μ = μ 0 , and when μ C = 1 , the plaintext corresponding to the output ciphertext is μ = μ 1 , as shown in Figure 1.

2.1.4. Graphical Processing Unit (GPU)

A GPU (graphics processing unit) is a microprocessor designed for data processing on computers [25]. In this section, we will introduce the GPU from three aspects: hardware level, thread organization level, and memory structure.
On the hardware level, a GPU consists of multiple streaming multiprocessors and each streaming multiprocessor contains multiple streaming processors.
  • Streaming Multiprocessor (SM): It is the core component of the GPU hardware and serves as the basic control instruction unit with an independent instruction scheduling circuit. Multiple streaming processors exist within a single streaming multiprocessor, and all streaming processors share the same set of control instructions;
  • Streaming Processor (SP): It is the fundamental processing unit where specific instructions and tasks are processed. A GPU performs parallel computation, which means numerous streaming processors are processing simultaneously.
On the thread organization level, we will introduce it from the perspective of CUDA [26]. CUDA is a programming interface provided by NVIDIA for writing programs that run on its GPUs, utilizing multiple threads for parallel execution.
  • Kernel: In CUDA, you will execute the calculations you want to run on a GPU in a form similar to C/C++ functions, which are called kernels;
  • Grid, Block, Thread: This is a hierarchical structure for thread allocation. When programming with CUDA, a grid is divided into multiple blocks, while a block is divided into multiple threads. The division is based on the btask characteristics and the hardware features of the GPU.
On the memory storage level, the main types of memory include registers, constant memory, shared memory, and global memory.
  • Registers: They are the smallest storage units in a GPU, but they offer the fastest execution speed. Each thread is allocated private registers during execution, and other threads cannot read or write to these registers;
  • Constant Memory: it is used to cache constant data utilized in the code executed on the streaming multiprocessor. We need to explicitly declare objects as constants in the code so that a GPU can cache and store them in the constant cache;
  • Shared Memory: Each streaming multiprocessor also has a block of shared memory. It is a small, fast and low-latency on-chip programmable static random access memory (SRAM) used for sharing data among the thread blocks running on the SM;
  • Global Memory: A GPU also has an off-chip global memory, which is a large-capacity and high-bandwidth dynamic random access memory (DRAM).

2.2. Parallel Accelerating NTT for Bootstrapping on a GPU

In this section, we will introduce the proposed parallel accelerating NTT on a GPU. For the sake of completeness, we will describe the accelerating NTT on GPU in a whole gate bootstrapping process. The entire gate bootstrapping process consists of three parts: blind rotation, ciphertext extraction, and key switching. The most time-consuming part is the blind-rotation process, which is the focus of our optimization. The essence of the blind-rotation process is the multiplication of two polynomials with large coefficients and degrees, and NTT can be used to accelerate this process. Our scheme is primarily designed to further speed up the NTT process. To fully leverage the parallel characteristics of a GPU, we implement the multiplication of polynomial coefficients and rotation factors using matrix multiplication. Additionally, due to the large degrees of the polynomial, we can decompose this large matrix into multiple smaller matrices. For each small matrix, we use logical left shift instead of multiplication operation to improve the running speed. The bootstrapping process with parallel accelerating NTT is shown in Figure 2 below.
Next, we will describe the gate bootstrapping process based on the proposed parallel accelerating NTT in detail. Given a TLWE sample c = TLWE s μ , the gate bootstrapping process constructs an encryption of μ under the same key s with a fixed amount of noise.

2.2.1. Initialization

Randomly choose two fixed messages μ 0 , μ 1 T , such that μ 0 μ 1 , and set the message space to be = μ 0 ,   μ 1 . The input of gate bootstrapping is a TLWE ciphertext c = TLWE s μ = a ,   b , where a = a 1 ,   a 2 ,   ,   a p T p , b = i = 1 p s i a i + e + μ and μ T . The secret key is a vector of p bits, i.e., s = s 1 ,   s 2 ,   ,   s p B p . The bootstrapping key is BK = BK 1 ,   BK 2 ,   ,   BK p , where BK i = TRGSW s s i . The key-switching key is KSK s s = KSK i , j TLWE s s i 2 j   i = 1 ,   2 ,   ,   p ,   j = 1 ,   2 ,   ,   t , where t represents the precision of the key-switching algorithm. For all i i = 1 ,   2 ,   ,   p , let a i ¯ be the nearest integer to 2 N a i , and let b ¯ be the nearest integer to 2 N b . Let μ = 1 2 μ 1 T , and compute v = 1 + x + + x N 1 x N / 2 μ . At last, construct the vector 0 ,   v , where 0 is a p-dimension zero vector.

2.2.2. Gadget Decomposition

In the first CMux gate, the inputs are ACC 0 and BK 1 , and the output is ACC 1 . Set ACC t m p = x a 1 ¯ 1 ACC 0 , then decompose ACC t m p by the basis H T N x K + 1 L , K + 1 . For all A j = c 0 , j + c 1 , j x + + c N 1 , j x N 1 in ACC t m p = A 1 ,   A 2 ,   ,   A K ,   A K + 1 T N x K + 1 , compute the coefficient c i , j = q = 1 L c i , j , q 1 B g q . Set A j , q = i = 1 N 1 c i , j , q x i , and next we can obtain the result of gadget decomposition d e c H ACC t m p = A 1 , 1 ,   ,   A 1 , L ,   ,   A K + 1 , 1 ,   ,   A K + 1 , L . At this time, we will calculate the multiplication of d e c H ACC t m p and BK 1 , which are both matrices constructed by polynomials. The multiplication of polynomials will be calculated by NTT and then we can obtain ACC 1 , which is the result of multiplication.

2.2.3. Decomposition of an N-Point NTT

When calculating an N-point NTT, we should firstly choose the prime q = 2 64 2 32 + 1 [27,28], which satisfies 2 192 mod q = 1 and 2 96 mod q = 1 . Additionally, the twiddle factors of small-point NTTs are the power of 2, for example W 4 = 2 48 , W 8 = 2 24 , W 16 = 2 12 , and W 8 = 2 6 , so we can use logical left shift to replace the multiplication with twiddle factors to simplify the calculations.
The calculation of an N-point NTT usually can be decomposed into many calculations of small-point NTTs. Let k = n k 1 + k 2 and i = t i 2 + i 1 , where k 1 = 0 , 1 , , t 1 , k 2 = 0 , 1 , , n 1 , i 1 = 0 , 1 , , t 1 and i 2 = 0 , 1 , , n 1 , respectively represent the groups and offsets of t n-point NTTs and n t-point NTTs, then the twiddle factor can be described as W N i k = W N t i 2 + i 1 n k 1 + k 2 = W t i 1 k 1 W n i 2 k 2 W N i 1 k 2 . Obviously, W t is the twiddle factor of the t-point NTT and W n is the twiddle factor of the n-point NTT. At this time, the N-point NTT can be defined in Equation (9):
X n k 1 + k 2 = i 1 = 0 t 1 W t i 1 k 1 i 2 = 0 n 1 W n i 2 k 2 x t i 2 + i 1 W N i 1 k 2 mod   q .
From Equation (9), it can be seen that the calculation of an N-point NTT can be decomposed into the computation of t n-point NTTs and n t-point NTTs. Therefore, the computation of an N-point NTT can be divided into two steps. Firstly, compute the t n-point NTTs by calculating i 2 = 0 n 1 W n i 2 k 2 x t i 2 + i 1 with x t i 2 + i 1 , i 2 = 0 , 1 , , n 1 as inputs. Then, the result of these t n-point NTTs is multiplied by the rotation factor W N i 1 k 2 of the N-point NTT to obtain x t k 2 + i 1 = t 2 = 0 n 1 W n i 2 k 2 x t i 2 + i 1 W N i 1 k 2 , which will serve as the input of t-point NTTs. Secondly, compute the n t-point NTTs by i 1 = 0 t 1 W t i 1 k 1 x t k 2 + i 1 with x t k 2 + i 1 , i 1 = 0 , 1 , , t 1 as inputs. The results obtained from these n t-point NTTs form the final result X n k 1 + k 2 of the N-point NTT.

2.2.4. Thread Assignment

For an N-point NTT, we need N threads in one block. When calculating the N-point NTT on GPU, we should divide the N threads into t groups, with n threads in each group. i 1 in each group represents the identification number of that group. The input of each group is x t i 2 + i 1 i 2 = 0 ,   1 ,   ,   n 1 , and the output is X k = i 2 = 0 n 1 W n i 2 k 2 x t i 2 + i 1 . At this time, the n-point NTT can be calculated using Equation (10):
X k = i = 0 n 1 x i W n i k mod q ,   k = 0 , 1 , , n 1 .
And we transform them into matrix form as shown in Equation (11):
X 0 X 1 X 2 X n 1 = 1 1 1 1 1 W n 1 W n 2 W n n 1 1 W n 2 W n 4 W n 2 n 1 1 W n n 1 W n 2 n 1 W n n 1 n 1 x 0 x 1 x 2 x n 1 .

2.2.5. n-Point NTT

1
Merging
After thread assignment in Section 2.2.4, there are n inputs in each thread group. We further categorize the n inputs into two groups based on the parity of k. Next, we take one operand from each group and describe how to perform merging preprocessing on these two operands in this section because of the nature of prime q, 2 96 mod q = 1 . When k is odd, x m m = 0 ,   1 ,   ,   n / 2 1 and x m + n / 2 m = 0 ,   1 ,   ,   n / 2 1 can be merged before they are logically left-shifted. x m m = 0 ,   1 ,   ,   n / 2 1 should be shifted left by l = j m k mod 192 bits, and x m + n / 2 m = 0 ,   1 ,   ,   n / 2 1 should be shifted left by l + 96 bits. Because of the nature of prime q, 2 96 mod q = 1 , we can obtain x m + n / 2 2 96 mod q = x m + n / 2 . Therefore, we perform modulo subtraction firstly on x m m = 0 ,   1 ,   ,   n / 2 1 and x m + n / 2 m = 0 , 1 ,   ,   n / 2 1 , and secondly the result of modulo subtraction is shifted left by l bits. For the same reason, when k is even, we perform modulo subtraction firstly on x m ( m = 0 ,   1 ,   ,   n / 4 1 ,   n / 2 ,   ,   3 n / 4 1 ) and x m + n / 4 ( m = 0 ,   1 ,   ,   n / 4 1 , n / 2 ,   ,   3 n / 4 1 ) , and secondly the result is shifted left by l bits.
That is to say, when k is an odd number, we calculate x m using Equation (12):
x m = mod s u b x m , x m + n / 2 ,     m k mod n < n 2                         mod s u b x m + n / 2 , x m ,     m + n / 2 k mod n < n 2 .
When k is an even number, we calculate x m using Equation (13):
x m = mod s u b x m , x m + n / 4 ,     m k mod n < n 2                       mod s u b x m + n / 4 , x m ,     m + n / 4 k mod n < n 2 .
2
Expansion and logical left shift
First, expand the operand x m to 192 bits, which are described as three 64-bit unsigned numbers a m , b m and c m . Set a m = 0 , b m = 0 and c m = x m . Then, we use logical left shift to replace the multiplication of operands with twiddle factor. When m k mod n < n / 2 , the operand x m is described as 192 bits and it should be logically left-shifted by l = j m k mod 192 bits. When m + n / 2 k mod n < n / 2 , the operand x m should be logically left-shifted by l = j m + n / 2 k mod 192 bits, where 0 l 96 . When 0 < l < 64 , the specific logical left shift is described as Equation (14):
a m = 0 , b m = c m > > 64 l , c m = c m < < l .
when 64 l < 128 , the specific logical left shift is described as Equation (15):
a m = c m > > 128 l , b m = c m < < l 64 , c m = 0 .
We can obtain a 192-bit operand x m , which is composed of three 64-bit unsigned numbers a m , b m , and c m . Due to the fact 0 l 96 , logical left shift does not exceed the limit of 192 bits, so there is no need for a cyclic shift. Each thread can obtain n/2 192-bit operands. Then, we will sum them up. The addition process is described as Equation (16):
w = trunc m = 0 h a m ,           y = trunc m = 0 h b m ,   z = trunc m = 0 h c m .
The function trunc ( ) means that we obtain the lower 64 bits of the 192-bit input. The value of h in Equation (16) varies depending on the parity of k. When k is odd, h is n / 2 1 and m = 0 ,   1 ,   ,   n / 2 1 . When k is even, h is 3 n / 4 1 and m = 0 ,   1 ,   ,   n / 4 1 , n / 2 ,   ,   3 n / 4 1 . The maximum value of l is 96, so each a m will not exceed 32 bits and the summation of a m will not exceed 64 bits. Therefore, there is no carry generated when we calculate w . When calculating y , we use a variable u to represent the carry. When calculating z , we use a variable v to represent the carry. Then, the 192-bit version s of the result of the k-th thread X k can be described as Equation (17):
s = w + u 2 128 + y + v 2 64 + z .
3
Turn back to 64-bit version
Let y = trunc y + v and w = trunc w + u . We can obtain s as described in Equation (18):
s = w 2 128 + y 2 64 + z mod q .
Because of the nature of NTT’s primitive roots, we can simplify s as described in Equation (19):
s = 2 32 y w + z y mod q .
Let modadd a , b = mod q a + b express the modulo addition of two 64-bit unsigned numbers a and b, then the result of the k-th thread X k can be described in Equation (20):
X k = modadd ( left 32 ( modsub ( y , w ) ) , modsub ( z y ) ) .
The left 32 a means to shift the 64-bit unsigned number a to the left by 32 bits and returns the result. This process can be described as Equation (21):
left 32 ( a ) = modp ( ( a > > 32 ) < < 32 ( a > > 32 ) + ( a < < 32 ) ) .
The result of the n-point NTT is composed by the result of all the n threads in one group.

2.2.6. N-Point NTT

We can obtain the result of the n-point NTT through Section 2.2.5. Then, we compute the result i 2 = 0 n 1 W n i 2 k 2 x t i 2 + i 1 of each thread multiplied by W N i 1 k 2 . After this, we store it as x t k 2 + i 1 = i 2 = 0 n 1 W n i 2 k 2 x t i 2 + i 1 W N i 1 k 2 . Next, we calculate the result of the N-point NTT as the formula Equation (22). Actually, we can see that it is the calculation of a t-point NTT.
X n k 1 + k 2 = i 1 = 0 t 1 W t i 1 k 1 x t k 2 + i 1 mod   q .
If t is small enough, such as 4, 8, 16, or 32, we can obtain the result of the N-point NTT through the calculation of n t-point NTTs. The calculation of the t-point NTT is the same as the n-point NTT.
If t is bigger than 32, the calculation process of the t-point NTT is similar to the process of the N-point NTT, in which the t-point NTT can be decomposed into t/n n-point NTTs and n t/n-point NTTs. If t/n is still a big number, the t/n-point NTT can be further decomposed by using the same method.

2.2.7. Multiplication of Polynomials

First, obtain NTT A j , q for each A j , q = i = 0 N 1 c i , j , q x i by using the previous N-point NTT method. And then, obtain NTT BK 0 , j , q for each BK 0 , j , q by using the previous N-point NTT method. Next, we should calculate the multiplication of NTT A j , q and NTT BK 0 , j , q . At last, we can calculate the INTT on all multiplications using Equation (23):
x i = INTT ( X k ) = 1 N k = 0 N 1 X k W 2 N W N i k .
According to the formula shown in Equation (24), we can add all the results of INTT, for j = 1 ,   2 ,   ,   K + 1 and q = 1 ,   2 ,   , L , and then continue to add ACC t m p to obtain ACC 1 .
ACC 1 = j = 1 K + 1 q = 1 L INTT NTT A j , q NTT BK 0 , j , q + ACC t m p .

2.2.8. The Result of Bootstrapping

First, store the value ACC 1 obtained above on the shared memory. Then, take ACC d and BK d + 1 , d = 1 , 2 , , p 1 as the blind rotation input to repeat Section 2.2.2 to Section 2.2.7 and we can obtain the blind rotation output ACC d + 1 , d = 1 , 2 , , p 1 . Next, for all polynomials A j j = 1 ,   2 ,   ,   K + 1 in ACC p = A 1 ,   A 2 ,   ,   A K + 1 , we execute the sample extraction in bootstrapping to extract the coefficient to form a new TLWE ciphertext c = ( c 0 , c 1 , , c K N = μ ) . Finally, we perform key switching in bootstrapping. Let c j be the closest multiple of 1 2 t to c j and set c j = r = 1 t c j , r 2 r c j , where t is a precision parameter and c j , r 0 ,   1 . At last, the result of bootstrapping is shown in Equation (25):
c = 0 ,   c K + μ j = 1 K r = 1 t c j , r KSK j , r .

3. Results

As we know, we can improve the efficiency of bootstrapping from different aspects, such as optimizing blind rotation and accelerating NTT. Our solution primarily aims to accelerate NTT on a GPU. In this direction, the current implementation of NTT based on the butterfly algorithm in cuFHE [24] is the best. Therefore, we chose NTT based on the butterfly algorithm for comparison. In our scheme, we adopt the original matrix structure that is more suitable for a GPU to calculate number theoretic transform and add a merging preprocessing operation to parallelize the multiplication process. It should be noted that we have implemented two versions of the proposed scheme, denoted as our scheme and our scheme (no merging), in order to demonstrate the effect of the merging preprocessing operation. In our scheme, we adopt the multiplication of the matrix with the merging operation, while in our scheme (no merging), we adopt the multiplication of the matrix without the merging operation. All the schemes are implemented in C++ 17, and the main functions run on Intel R Core (TM) i9-9900k at 3.60 GHz. The NTT algorithms were run on a TU102 [GeForce] RTX 2080 Ti graphics card. Because the running time is very short, the NTT algorithms were run 10,000 times on a GPU and we took the average time as the result. The performance comparison is shown in Table 1 and the time unit is milliseconds (ms).
In order to have a clearer view of the time consumption trends for each scheme, we provide the corresponding curve of Table 1 in Figure 3. From Figure 3, it can be observed that the running time of both our schemes increases approximately linearly as the scale of NTT expands. In comparison to our solutions, the butterfly algorithm exhibits a longer running time, and as the number of points N increases, its running time grows more quickly. The running time of our scheme (no merging) is longer than that of our scheme, which shows the importance of the merging operation.
Furthermore, we designed experiments to compare the performance of a 1024-point NTT realized by different decomposition methods using our scheme and the responding results are shown in Table 2. To intuitively demonstrate the running time of different decomposition methods, we converted Table 2 into Figure 4. As before, each of our solutions was run 10,000 times on a GPU and we took the average time as the result. From Figure 4, it can be seen that it is wrong to say that the more times a decomposition is run, the shorter the running time. Because we should calculate a multiplication between any two small-point NTTs, which cannot be replaced by using the logical left-shift operation. That is to say, the more decomposition time, the longer the multiplication time. At the same time, we can see that the decomposition of 8 × 8 × 16 is a suitable method. In this situation, the running time of a 1024-point NTT based on our scheme is only 6.03 ms while that based on the butterfly algorithm is 15.01 ms. The speedup ratio of our method over the butterfly algorithm is about 2.49.

4. Discussion

In this paper, we presented an accelerated NTT algorithm based on GPU, which is suitable for polynomial multiplication in the process of gate bootstrapping of the TFHE scheme. In our scheme, we adopt the original matrix structure that is more suitable for a GPU to calculate number theoretic transform and add a merging preprocessing operation to parallelize the multiplication process, aiming to improve the running speed of number theoretic transform. Furthermore, we improved computational efficiency by logical left shifts instead of multiplication when computing the NTT value of a small point. Compared with the current mainstream Cooley–Turkey NTT algorithm, our scheme greatly reduces the use of a GPU’s shared memory and avoids the inherent sequential bottleneck between the front and back layers of the butterfly algorithm. In summary, when using our scheme instead of the butterfly algorithm to realize a 1024-point NTT, we can obtain a speedup ratio as large as 2.49 or so. Of course, our scheme is not perfect either. For example, fully exploring the allocation of multiple threads on GPU to adapt to our scheme is left for future research.

Author Contributions

Conceptualization, H.L.; methodology, H.L. and J.L.; software, D.P.; formal analysis, H.W.; investigation, H.W.; resources, H.L.; data curation, D.P. and H.W.; writing—original draft, D.P.; writing—review and editing, H.L., D.P., J.L. and H.W.; visualization, J.L.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Basic Research Plan in Shaanxi Province of China, grant number 2023-JC-YB-546.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

In order to avoid confusion for readers, we have once again defined multiple abbreviations that appear in this article in this section. We adopt the definitions marked with an asterisk (*) in the table, following the conventions set by TFHE [5].
AbbreviationsExtension
TFHEFully homomorphic encryption over the Torus
BGVBrakerski–Gentry–Vaikuntanathan homomorphic encryption scheme [2]
BFVBrakerski–Fan Vercauteren homomorphic encryption scheme [3]
CKKSCheon–Kim–Kim–Song homomorphic encryption scheme [4]
NTTNumber theoretic transform
INTTInverse number theoretic transform
DFTDiscrete Fourier transform
FFTFast Fourier transform
GPUGraphics processing unit
CPUCentral processing unit
CUDACompute unified device architecture
FPGAField programmable gate array
TRLWE *A ciphertext for the bootstrapping input
TLWE *The scalar form of the TRLWE
TRGSW *A ciphertext for the bootstrapping key
TGSW *The scalar form TRGSW

References

  1. Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing; Association for Computing Machinery: New York, NY, USA, 2009; Volume 9, pp. 169–178. [Google Scholar]
  2. Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) Fully Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory 2014, 6, 1–36. [Google Scholar] [CrossRef]
  3. Fan, J.; Vercauteren, F. Somewhat Practical Fully Homomorphic Encryption. Iacr Cryptology Eprint Archive. 2012, pp. 1–19. Available online: https://eprint.iacr.org/2012/144 (accessed on 11 October 2023).
  4. Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Proceedings of the 23rd International Conference on the Theory and Application of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Springer: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar]
  5. Chillotti, I.; Gama, N.; Georgieva, M.; Izabachène, M. Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds. In Proceedings of the Advances in Cryptology–ASIACRYPT 2016: 22nd International Conference on the Theory and Application of Cryptology and Information Security, Hanoi, Vietnam, 4–8 December 2016; pp. 3–33. [Google Scholar]
  6. Bianchi, T.; Piva, A.; Barni, M. On the implementation of the discrete Fourier transform in the encrypted domain. IEEE Trans. Inf. Forensics Secur. 2009, 4, 86–97. [Google Scholar] [CrossRef]
  7. Wang, W.; Hu, Y.; Chen, L.; Huang, X.; Sunar, B. Accelerating fully homomorphic encryption using GPU. In Proceedings of the 2012 IEEE Conference on High Performance Extreme Computing, Waltham, MA, USA, 10–12 September 2012; pp. 1–5. [Google Scholar]
  8. Lee, Y.; Micciancio, D.; Kim, A.; Choi, R.; Deryabin, M.; Eom, J.; Yoo, D. Efficient fhew bootstrapping with small evaluation keys, and applications to threshold homomorphic encryption. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, Lyon, France, 23–27 April 2023; pp. 227–256. [Google Scholar]
  9. Xiang, B.; Zhang, J.; Deng, Y.; Dai, Y.; Feng, D. Fast blind rotation for bootstrapping FHEs. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 20–24 August 2023; pp. 3–36. [Google Scholar]
  10. Xie, X.; Sun, L.; Huang, X.M.; Han, S.F. Design and implementation of finite field NTT algorithm based on FPGA. Mod. Electron. Tech. 2019, 41, 1855–1860. [Google Scholar]
  11. Mert, A.C.; Öztürk, E.; Savaş, E. FPGA implementation of a run-time configurable NTT-based polynomial multiplication hardware. Microprocess. Microsyst. 2020, 78, 103219. [Google Scholar] [CrossRef]
  12. Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Kyber: A CCA-Secure Module-Lattice-Based KEM. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; pp. 353–367. [Google Scholar]
  13. Zhou, S.; Xue, H.; Zhang, D.; Wang, K.; Lu, X.; Li, B.; He, J. Preprocess-then-NTT technique and its applications to Kyberand NewHope. In Proceedings of the Information Security and Cryptology: 14th International Conference, Fuzhou, China, 14–17 December 2018; Inscrypt 2018. Springer: Cham, Switzerland, 2019; Volume 11449, pp. 117–137. [Google Scholar]
  14. Zhu, Y.; Liu, Z.; Pan, Y. When NTT Meets Karatsuba: Preprocess-then-NTT Technique Revisited. In Proceedings of the International Conference on Information and Communications Security, Chongqing, China, 17–19 September 2021; pp. 249–264. [Google Scholar]
  15. Wang, W.; Hu, Y.; Chen, L.; Huang, X.; Sunar, B. Exploring the Feasibility of Fully Homomorphic Encryption. IEEE Trans. Comput. 2015, 64, 698–706. [Google Scholar] [CrossRef]
  16. Schönhage, A.; Strassen, V. Schnelle multiplication grosser Zahlen. Computing 1971, 7, 281–292. [Google Scholar] [CrossRef]
  17. Dai, W.; Sunar, B. cuHE: A homomorphic encryption accelerator library. In Proceedings of the Cryptography and Information Security in the Balkans: Second International Conference, BalkanCryptSec 2015, Koper, Slovenia, 3–4 September 2015; pp. 169–186. [Google Scholar]
  18. Goey, J.Z.; Lee, W.K.; Goi, B.M.; Yap, W.S. Accelerating number theoretic transform in GPU platform for fully homomorphic encryption. J. Supercomput. 2021, 77, 1455–1474. [Google Scholar] [CrossRef]
  19. Lee, W.-K.; Akleylek, S.; Wong, D.C.-K.; Yap, W.-S.; Goi, B.-M.; Hwang, S.-O. Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: Application to qTESLA. J. Supercomput. 2021, 77, 3289–3314. [Google Scholar] [CrossRef]
  20. Nussbaumer, H. Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 205–215. [Google Scholar] [CrossRef]
  21. Kim, S.; Jung, W.; Park, J.; Ahn, J.H. Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs. In Proceedings of the 2020 IEEE International Symposium on Workload Characterization (IISWC), Beijing, China, 27–30 October 2020; pp. 264–275. [Google Scholar]
  22. Özerk, Ö.; Elgezen, C.; Mert, A.C.; Öztürk, E.; Savaş, E. Efficient number theoretic transform implementation on GPU for homomorphic encryption. J. Supercomput. 2022, 78, 2840–2872. [Google Scholar] [CrossRef]
  23. Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
  24. CUDA-Accelerated Fully Homomorphic Encryption Library. Available online: https://github.com/vernamlab/cuFHE (accessed on 21 June 2023).
  25. Owens, J.D.; Houston, M.; Luebke, D.; Green, S.; Stone, J.E.; Phillips, J.C. GPU computing. Proc. IEEE 2008, 96, 879–899. [Google Scholar] [CrossRef]
  26. Sanders, J.; Kandrot, E. CUDA by Example: An Introduction to General-Purpose GPU Programming; Addison-Wesley Professional: Boston, MA, USA, 2010. [Google Scholar]
  27. Solinas, J. Generalized Mersenne Numbers; Tech. Rep. 06/MI/006; Blekinge College Technology: Karlskrona, Sweden, 1999. [Google Scholar]
  28. Wang, W.; Huang, X.; Emmart, N.; Weems, C. VLSI design of a large-number Multiplier for fully homomorphic encryption. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2014, 22, 1879–1887. [Google Scholar] [CrossRef]
Figure 1. CMux gate.
Figure 1. CMux gate.
Mathematics 12 00458 g001
Figure 2. Overview of the accelerating NTT on a GPU in a whole gate bootstrapping process.
Figure 2. Overview of the accelerating NTT on a GPU in a whole gate bootstrapping process.
Mathematics 12 00458 g002
Figure 3. The running time of small-point NTT by three different methods.
Figure 3. The running time of small-point NTT by three different methods.
Mathematics 12 00458 g003
Figure 4. The running time of a 1024-point NTT achieved using different decomposition methods.
Figure 4. The running time of a 1024-point NTT achieved using different decomposition methods.
Mathematics 12 00458 g004
Table 1. The running time of small-point NTT by three different methods.
Table 1. The running time of small-point NTT by three different methods.
4-Point NTT8-Point NTT16-Point NTT32-Point NTT
Our scheme0.721.372.935.28
Our scheme (no merging)1.182.334.789.42
Butterfly algorithm [27]0.982.497.1119.52
Table 2. The running time of a 1024-point NTT achieved using different decomposition methods.
Table 2. The running time of a 1024-point NTT achieved using different decomposition methods.
Decomposition Method4 × 4 × 4 × 4 × 48 × 8 × 1616 × 16 × 432 × 32
Running time (ms)7.606.037.2411.45
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Pan, D.; Li, J.; Wang, H. Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit. Mathematics 2024, 12, 458. https://doi.org/10.3390/math12030458

AMA Style

Li H, Pan D, Li J, Wang H. Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit. Mathematics. 2024; 12(3):458. https://doi.org/10.3390/math12030458

Chicago/Turabian Style

Li, Huixian, Deng Pan, Jinglei Li, and Hao Wang. 2024. "Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit" Mathematics 12, no. 3: 458. https://doi.org/10.3390/math12030458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop