1. Introduction
The emergence of new highly parallel architectures has increased interest in fast, carry-free, and energy-efficient computer arithmetic techniques. One such technique is the residue number system (RNS), which has received a lot of attention over recent years [
1,
2,
3]. The RNS is of interest to scientists dealing with computationally intensive applications as it provides efficient highly parallelizable arithmetic operations. This number coding system is defined in terms of pairwise coprime integers called moduli, and a large weighted number is converted into several smaller numbers called residues, which are obtained as the remainders when the given number is divided by the moduli. A useful feature is that the residues are mutually independent, and for addition, subtraction and multiplication, instead of big word length (multiple-precision) operations, we can perform several small word length operations on these residues without carry propagation between them [
1].
The RNS has been used in several applications, namely homomorphic encryption [
4,
5], cloud computing [
6], stochastic computing [
7], motion estimation [
8], energy-efficient digital signal processing [
9], high-precision linear algebra [
10], blockchain [
11], pseudo-random number generation [
12], and deep neural networks [
13]. Interest in RNS is currently growing due to the widespread adoption of massively parallel computing platforms such as graphics processing units (GPUs).
However, some operations are still difficult to implement in RNS, such as magnitude comparison, sign estimation, scaling, and division. Various methods have been proposed in the literature to overcome this problem and perform the above operations in an efficient manner. Many of the existing methods are designed for special moduli sets like
[
14],
[
15],
and
[
16]. On the other hand, there are methods for arbitrary moduli sets [
17,
18,
19,
20].
In a recent paper [
21], we have presented a method for implementing difficult RNS operations via computing a finite precision floating-point interval that localizes the fractional value of an RNS representation. Such an interval is called a floating-point interval evaluation, or simply an interval evaluation. The method deserves attention for three reasons. First, it is intended for arbitrary moduli sets with large dynamic ranges significantly exceeding the usual word length of computers. Dynamic ranges consisting of hundreds and even thousands of bits are in demand in many modern applications, primarily in cryptography and high-precision arithmetic. Second, the method leads to efficient software implementations using general-purpose computing platforms, since it only requires very fast standard (finite precision) floating-point operations, and most computations can be performed in a parallel manner. Third, it is a fairly versatile method suitable for computing a wide range of fundamental operations that are problematic in RNS, including
A key component of this method is an accurate algorithm that computes the floating-point interval evaluation for a number in RNS representation; see Algorithm 1 in [
21]. In order to obtain the result with the desired accuracy using only finite precision operations, this algorithm performs the iterative procedure, which in some cases is the most expensive part of the algorithm.
In this paper, we continue our research on the application of finite precision floating-point intervals to implement high-performance RNS algorithms. The contribution of this paper is four-fold:
In
Section 3, we provide proofs of some important properties of the interval evaluation algorithm that were not included in the previous article [
21].
In
Section 4, we present an improved version of the interval evaluation algorithm that reduces the number of iterations required to achieve the desired accuracy.
In
Section 5, we demonstrate that our method can successfully be applied to implement efficient data-parallel primitives in the RNS arithmetic, namely we use it to find the maximum element of an array of RNS numbers on GPUs.
In
Section 6, we present new numerical results showing the performance of the method on various moduli sets, from a moderately small 4-moduli set with 64-bit dynamic range to a huge 256-moduli set with 4096-bit dynamic range.
3. Properties of the Interval Evaluation Algorithm
The following theorem gives the maximum number of iterations in the refinement stage of the described algorithm.
Theorem 1. The refinement stage of the described algorithm for computing is performed in no more than iterations.
Proof. The iterative procedure ends at the jth iteration if . On the other hand, when applying upwardly directed rounding, cannot be less than , provided that . Thus, the number of iterations depends on the magnitude of X: the smaller the magnitude, the more iterations are needed. For , the loop termination condition is satisfied when . The result follows by expressing j in terms of M, , and k and requiring it to be an integer. □
One way to reduce the number of iterations is to increase the refinement factor k. However, the following theorem says that this can lead to an incorrect result.
Theorem 2. To guarantee the correct result of calculating , the refinement factor k must not be greater than .
Proof. The proof is based on two remarks:
As shown in [
21], if
X is too close to
M, namely if
then
may not be computed correctly. Consequently,
X should be less than
if we want to calculate the correct value of
.
Calculation of (3) at the th refining iteration can be considered the same as calculating (2) with rounding upwards for some number . Accordingly, calculating (3) at the ith iteration can be considered the same as calculating (2) with rounding upwards for the number .
We denote the result of the
th iteration by
, and the result of the
ith iteration by
. The
ith iteration is performed when
, whence it follows that
First, assume . In this case, is less than , and since then is less than . Thus, if , then will be computed correctly (see the first remark above).
Now consider the case when . For this setting, we have , so we can only guarantee that is less than M, but not that it is less than . Thus, may not be computed correctly. □
4. Proposed Improvement
4.1. Description
We propose an improvement to the algorithm of [
21] described above by modifying each iteration of the refinement loop as follows:
Thus, we make the refinement factor r dependent on the value of , but not less than , and before starting the iterations, we assign the value of computed at the first stage of the algorithm to . Once the iterations are finished, is computed according to (4), and the desired endpoints of are obtained according to (5). The proposed modification improves the performance of the algorithm by reducing the number of iterations required to achieve the desired accuracy.
Summarizing, we present an improved accurate algorithm for computing the floating-point interval evaluation of an RNS number in Algorithm 1. The algorithm takes as input an integer
represented by the residues
relative to the moduli set
and produces as output an interval
such that
and
, where
is a given accuracy parameter and
.
Algorithm 1 Computing the floating-point interval evaluation for an RNS number |
- 1:
for all - 2:
- 3:
- 4:
ifthen - 5:
return and - 6:
end if - 7:
- 8:
- 9:
ifthen - 10:
Compute the mixed-radix representation of X and test the most significant mixed-radix digit, : if , then set ; otherwise set - 11:
end if - 12:
ifthen - 13:
return and - 14:
end if - 15:
- 16:
- 17:
whiledo - 18:
- 19:
for all - 20:
- 21:
- 22:
end while - 23:
- 24:
- 25:
- 26:
return and
|
The correctness statement of the original algorithm has been proved in [
21], so we only must prove that the proposed modification does not violate the correctness of the algorithm. The following theorem establishes this fact.
Theorem 3. The proposed modification does not violate the correctness of the algorithm for computing the floating-point interval evaluation.
Proof. Denote again the result of the th iteration by , and the result of the ith iteration by . Assuming that is calculated properly, it is only necessary to prove that is also calculated properly. The proof is as follows. Calculation of (6) at the ith iteration can be considered the same as calculating (2) with upwardly directed rounding for the input . Please note that cannot exceed . According to the first remark from the proof of Theorem 2 above, should be less than . Assuming (we are not interested in the case ), cannot exceed and thus cannot exceed . Simplifying, we have . Finally, since , then , so if , then it is guaranteed that is computed properly. □
To implement the proposed improved algorithm, all the ’s should be preprocessed and stored in a lookup table of size no more than n by . We note that this memory overhead is not actually significant. In fact, let each RNS modulus consists of 32 bits and . Then M has a bit size of about 16 thousand bits (huge dynamic range), and the total size of the lookup table is 32 MB. Moreover, the table of powers of two is in demand in various RNS applications.
4.2. Demonstration
To demonstrate the benefits of the proposed modification, we have counted the number of iterations required to compute the interval evaluation in residue number systems with four different moduli sets. The first set consists of 8 moduli and provides a 128-bit dynamic range. The second set consists of 32 moduli and provides a 512-bit dynamic range. The third set consists of 64 moduli and provides a 1024-bit dynamic range. The fourth set consists of 256 moduli and provides a 4096-bit dynamic range.
The results are reported in
Figure 2 and
Figure 3, where the algorithm from [
21] is labeled as “Original algorithm” and the modified algorithm (Algorithm 1) as “Proposed modification”. The inputs are represented by powers of two from
to
, and the
x-axis on the plots denotes the binary logarithm of the RNS number for which the interval evaluation is computed. The
y-axis denotes the number of refining iterations required to compute the interval evaluation with accuracy
.
In this demonstration, all calculations were done in standard floating-point arithmetic (double precision). The plots show that for small inputs, the proposed modification reduces the number of iterations by almost 3 times, and increasing the size of the moduli set increases the advantage.
5. Application: Finding the Maximum Element of an Array of RNS Numbers
Reduction is a widely used data-parallel primitive in high-performance computing and is part of many important algorithms such as least squares and MapReduce [
26]. For an array of
N elements
, applying the reduction operator ⊕ gives a single value
. The reduction operator is a binary associative (and often commutative) operator such as +, ×, MIN, MAX, logical AND, logical OR. In this section, we applied the considered interval evaluation method to implement one operation of the parallel reduction primitive over an array of RNS numbers, namely MAX, which consists of finding the maximum element in the array. Our implementation is intended for GPUs supporting the Compute Unified Device Architecture (CUDA) [
27].
5.1. Approach
Let
be an array of
N integers in RNS representation and we want to find the largest element of this array,
. To simplify our presentation, we are considering unsigned numbers, i.e., each
varies in the range of 0 to
. Our approach to finding
is illustrated in
Figure 4. The computation is decomposed into two stages:
For a given array
, the array
is calculated. Each interval evaluation is coupled with the corresponding index of the RNS representation in the input array (
in
Figure 4), so knowing the interval evaluation, we can always fetch the original RNS representation.
The reduction tree is built over the array of interval evaluations to obtain the maximum RNS representation. The basic brick of the reduction is comparing the magnitude of two numbers in the RNS. For two given numbers
and
, the magnitude comparison is performed as follows [
21]:
if then ;
if then ;
if for all , then ;
if neither case is true, then the mixed-radix representations of X and Y are calculated and compared component-wise to produce the final result.
Thus, after the array of interval evaluations is computed, each RNS comparison is performed very quickly, namely in time, and the input RNS array is accessed only in corner cases, when the numbers being compared are equal or nearly equal to each other and the accuracy of their interval evaluations is insufficient. On the other hand, the presented approach also reduces the memory footprint of intermediate computations, since not RNS numbers are stored on the reduction layers, but only their interval evaluations, and each interval evaluation has a fixed size, regardless of the size of the moduli set and dynamic range.
5.2. CUDA Implementation
In CUDA, programs that run on the GPU are called kernels. Each kernel launches a grid of parallel threads, and within the grid, GPU threads are grouped into thread blocks. Threads belonging to the same block can communicate through shared memory or shuffle instructions, while global communication between thread blocks is done by sequential kernel launches or using device memory atomic operations.
In what follows, we denote by Arr the pointer to the global memory for accessing the input RNS array and by Eval the pointer to the global memory for accessing the array of floating-point interval evaluations. Each element of the Eval array is an instance of a structure that consists of three fields:
low is the lower bound of the interval evaluation;
upp is the upper bound of the interval evaluation;
idx is the index of the corresponding RNS representation in the Arr array.
We also use the following notations:
N is the size of the input array;
gSize is the number of thread blocks per grid;
bSize is the number threads per block;
bx is the block index within the grid;
tx is the thread index within the block.
For an input array of RNS numbers, the calculation of the array of floating-point interval evaluations is implemented as a separate kernel. One thread computes one interval evaluation, and many threads run concurrently, storing the results in a pre-allocated global memory buffer of the GPU. The pseudocode of the kernel (the code executed by each thread) is shown in Algorithm 2, where the device function
ComputeEval is a sequential computation of an interval evaluation as described in Algorithm 1.
Algorithm 2 Computing an array of floating-point interval evaluations |
- 1:
- 2:
whiledo - 3:
- 4:
- 5:
- 6:
end while
|
Remark 1. We have also implemented a parallel algorithm in which n threads simultaneously compute an interval evaluation for a number represented in an n-moduli RNS, and the ith thread is assigned for modulo computation. However, for our purpose (calculating an array of interval evaluations), the sequential algorithm is preferable, since it does not require communication between parallel threads. The parallel version can be used in applications that have insufficient internal parallelism to saturate the GPU.
Once the array of interval evaluations is computed, the reduction kernel is launched, which generates and stores partial reduction results using multiple () thread blocks. To avoid multiple global memory accesses, fast shared memory cache is employed at the internal reduction levels. The same reduction kernel then runs again to reduce the partial results into a single result using a single thread block. The result is an interval evaluation that represent the maximum element in the input array, and the desired RNS number is retrieved from the array using the index associated with that interval evaluation.
The pseudocode of the reduction kernel is shown in Algorithm 3. In this algorithm,
S is the size of the reduction operation,
Sh is an array in the shared memory of each thread block, and
is the next highest power of 2 of
. Just like the
Eval array, each element of the
Sh array is a structure consisting of three fields,
low,
upp and
idx.
Algorithm 3 Reduction of an array of floating-point interval evaluations |
- 1:
- 2:
- 3:
whiledo - 4:
if then - 5:
- 6:
end if - 7:
- 8:
end while - 9:
— Intra-block synchronization — - 10:
- 11:
whiledo - 12:
if and and then - 13:
- 14:
end if - 15:
- 16:
— Intra-block synchronization — - 17:
end while - 18:
ifthen - 19:
- 20:
end if
|
Note that in Algorithm 3,
for the first kernel invocation and
for the second one. The final result of the two kernel launches is stored in the first element of the
Eval array, i.e.,
assuming zero-based indexing. We give the pseudocode of the
RnsCmp function in Algorithm 4.
Algorithm 4 |
- 1:
iforthen - 2:
return 1 - 3:
else iforthen - 4:
return - 5:
else if and are equal component-wise then - 6:
return 0 - 7:
else - 8:
Compare and using mixed-radix conversion - 9:
end if
|
Although we have only considered the MAX operation in this paper, other reduction operations can be implemented in a quite similar manner. We also note that our approach can be straightforwardly extended to find the maximum element in an array of signed RNS numbers.
6. Results and Discussion
We present several performance results of different approaches to finding the maximum element in an array of RNS numbers on the GPU:
Proposed approach is an implementation of the MAX operation as described in
Section 5 using floating-point interval evaluations to compare the magnitude of RNS numbers.
Naive approach is a straightforward parallel reduction using floating-point interval evaluations that consists of two kernel invocations. In contrast to the proposed variant, the naive one does not have a kernel that computes an array of interval evaluations. Instead, the computation of two interval evaluations is performed each time two RNS numbers are compared. This reduces the memory footprint but leads to more computation load.
Mixed-radix approach is an implementation of the MAX operation as described in
Section 5, but using the MRC procedure instead of floating-point interval evaluations to compare the magnitude of RNS numbers. We used the Szabo and Tanaka MRC algorithm [
3] for this implementation.
All tests were carried out on a system running Ubuntu 20.04.1 LTS, equipped with an NVIDIA GeForce RTX 2080 video card (see
Table 1), an Intel Core i5 7500 CPU, and 16 GB of DDR4 memory. We used CUDA Toolkit version 11.1.105 and NVIDIA Driver version 455.32.00. The source code was compiled with the -O3 option.
We carried out the performance tests using 7 different sets of RNS moduli that provide dynamic ranges from 64 to 4096 bits. An overview of the moduli sets is given in
Table 2, and the tool used to generate these sets is available on the GitHub repository:
https://github.com/kisupov/rns-moduli-generator.
The measurements were performed with arrays of fixed length 5,000,000, and random residues were used for the data generation, i.e., in the case of an n-moduli RNS, the input dataset (an array of N RNS numbers) was obtained from a random integer array of size . The GNU MP library was used to validate the results.
We report the performance of the tested CUDA implementations in
Table 3. We clearly see that the proposed approach significantly outperforms the naive and mixed-radix variants for all test cases. The MRC method has a quadratic behavior, while the proposed and naive approaches based on floating-point interval evaluations have a linear behavior. In fact, the Szabo and Tanaka algorithm requires
operations modulo
to compute the mixed-radix representation in an
n-moduli RNS, while sequential (single-threaded) computation of the interval evaluation using Algorithm 1 requires only
n operations modulo
and
standard floating-point operations, assuming no ambiguity resolution or refinement iterations are required. The mixed-radix implementation has not been evaluated for the 256-moduli set due to excessive memory consumption.
Although in some cases Algorithm 1 calls the MRC procedure, these cases are very rare when the numbers are randomly distributed. Moreover, if it is known in advance that the RNS number is small enough, then a quick-and-dirty computation of the interval evaluation is possible, which is obtained from Algorithm 1 by eliminating steps 9 to 11.
In
Figure 5, we show the performance gains of the proposed approach over the other approaches tested. The superiority of the proposed approach over the naive one increases with an increase in the size of the moduli set. In turn, for the case of the 128-moduli set, the superiority of the proposed approach over the mixed-radix method is less than that for the case of the 64-moduli set. One possible reason for this is a decrease in the GPU memory bandwidth due to strided accesses to the input array of RNS numbers. The reader can refer to [
28] for further details. An interval addressing scheme, in which the residues of the RNS numbers are interleaved, will provide more efficient access to the global GPU memory. We plan to explore this in the future.
Table 4 shows the memory consumption (in MB) of the tested implementations. Memory consumption is calculated as the size of the auxiliary buffer that needs to be allocated in the global GPU memory for a particular implementation to work properly.
The memory requirements of the mixed-radix implementation increase in proportion to the number of moduli, since the size of each mixed-radix representation is equal to the size of the corresponding RNS representation. In contrast, the memory requirements of the naive and proposed implementations are constant, since the size of each floating-point interval evaluation is fixed (40 bytes in our setting, including padding) and does not depend on the size of the moduli set.
The naive implementation requires less memory, since it stores only partial results generated by the first reduction kernel, while the proposed implementation requires storing the computed interval evaluations for all inputs. However, the memory consumption of our proposed implementation do not seem critical, since the NVIDIA RTX 2080 graphics card has 8 GB of GDDR6 RAM.