Next Article in Journal
Strategic Electricity Production Planning of Turkey via Mixed Integer Programming Based on Time Series Forecasting
Next Article in Special Issue
Maximal (v, k, 2, 1) Optical Orthogonal Codes with k = 6 and 7 and Small Lengths
Previous Article in Journal
The Alternating Direction Search Pattern Method for Solving Constrained Nonlinear Optimization Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BooLSPLG: A Library with Parallel Algorithms for Boolean Functions and S-Boxes for GPU

by
Dushan Bikov
1,
Iliya Bouyukliev
2,* and
Mariya Dzhumalieva-Stoeva
3
1
Faculty of Computer Science, Goce Delchev University, 2000 Stip, North Macedonia
2
Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria
3
Faculty of Mathematics and Informatics, University of Veliko Turnovo, 5003 Veliko Tarnovo, Bulgaria
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(8), 1864; https://doi.org/10.3390/math11081864
Submission received: 15 March 2023 / Revised: 10 April 2023 / Accepted: 12 April 2023 / Published: 14 April 2023
(This article belongs to the Special Issue Theory and Application of Algebraic Combinatorics)

Abstract

:
In this paper, we present a library with sequential and parallel functions for computing some of the most important cryptographic characteristics of Boolean and vectorial Boolean functions. The library implements algorithms to calculate the nonlinearity, algebraic degree, autocorrelation, differential uniformity and related tables of vectorial Boolean functions. For the sake of completeness, we provide the mathematical basis of these algorithms. Furthermore, we compare the performance of the parallel functions from the developed software with the corresponding sequential functions and with analogous functions from the well-known SageMath and SET packages. Functions from BooLSPLG can be used to develop efficient algorithms for constructing Boolean and vectorial Boolean functions with good cryptographic properties. The parallel part of the library is implemented using a CUDA parallel programming model for recent NVIDIA GPU architectures. BooLSPLG is an open-source software library written in CUDA C/C++ with explicit documentation, test examples, and detailed input and output descriptions of all functions, both sequential and parallel, and it is available online.

1. Introduction

The main subjects which we consider in this paper are Boolean and vectorial Boolean functions (S-boxes) with good cryptographic properties. There is a substantial body of research on S-boxes with eight or fewer variables that are embedded in the most popular ciphers [1], but not much is known about larger S-boxes, despite the interest in them [2,3,4]. One of the reasons for this is the computationally difficult evaluation of their cryptographic properties. The construction of such types of objects is very important, but in most cases it is also a computationally expensive issue.
Some of the construction methods are based on checking for proper parameters between huge amounts of candidates [5]. The cryptographic parameters which we investigate in this paper are nonlinearity, algebraic degree, autocorrelation, and differential uniformity. The computation of these parameters is related to Fourier-type transforms such as Walsh–Hadamard and Möbius (Reed-Muller) transforms [6,7]. The algorithms, known as butterfly algorithms, that implement these fast, discrete transforms are very efficient [8]. Moreover, these algorithms are suitable for parallelization in SIMD (single instruction, multiple data) computer architectures.
For this type of parallelization, using modern graphics processing units (GPUs) together with CUDA (compute unified device architecture) [9] is natural and very effective. GPUs are usually specialized for manipulating high-resolution computer graphics, but their structure makes them suitable for processing large amounts of data. This feature of GPUs is a great advantage for deep learning [10], neural systems [11], molecular modeling [12], etc.
For this paper, we developed a library BoolSPLG with C/C++ functions. It can be used to study important cryptographic properties of Boolean functions with n variables and bijective n × n S-boxes for n 20 . BoolSPLG computes the following cryptographic parameters of the Boolean and vectorial Boolean functions: the Walsh spectrum of a Boolean function, the linearity of a Boolean function, the Walsh Spectrum of an S-box, the linear approximation table of an S-box, the linearity of an S-box, the autocorrelation spectrum of a Boolean function, the autocorrelation of a Boolean function, the autocorrelation spectrum of an S-box, the autocorrelation of an S-box, the algebraic normal form of a Boolean function, the algebraic normal form of an S-box, the algebraic degree of a Boolean function, the algebraic degree of an S-box, the difference distribution table of an S-box, and the differential uniformity of an S-box and a component function of an S-box. All of the basic functions have two versions—sequential and parallel. All these features can be used for project development by anyone who knows the C/C++ programming language but is not so familiar with CUDA C.
The effectiveness of the developed library is based on the optimal use of the capabilities of the GPU architecture and on the properties of the CUDA platform as well. Building an optimized and portable parallel library requires various strategies so that it can be used not only for computing the properties of Boolean and vector Boolean functions, but also for implementation in another software. We will focus on some of the basic features for building the library.
First, developing a parallel library requires an efficient consecutive algorithm which is suitable for parallel implementation. Our research is focused on the cryptographic properties of vector Boolean functions. These are mainly calculated by butterfly algorithms, and such methods are appropriate for parallelization.
Second, GPUs and the CUDA platform have a great advantage—they possess huge computing power for large amounts of data. However, implementing optimized software is a complex task. One of the main problems is related to the transfer of data from the main memory to the GPU memory and vice versa. This is a time-consuming process, sometimes commensurate with the calculation time. Therefore, each of the functions is tailored to this fact. If the algorithm consists of several parallel steps, data transfer is conducted only at the beginning and the end of the function. The other technique that shortens the transfer time is compressing the data to bitwise representation.
Another peculiarity in the implementation of the library functions is using the fast memories of the GPU. The CUDA platform provides access to the global memory of the graphic card and also to the much faster shared and local memories. In order to perform calculations on variables located on the shared or local memory, additional steps are necessary to move the data from variables located on global memory. However, due to the throughput-oriented organization of the CUDA API, this turns out to be much more efficient than directly using global memory variables for calculations.
The functions of the library are adapted to the size of the processed data. In the case of S-boxes of fewer variables, simultaneous parallel computations are made on all component functions defined by the considered S-box. For larger parameters, component functions are processed separately.
Many of the library’s features also use the cutting-edge techniques of the ever-updating CUDA interface. A good example of this are the shuffle operations, which allow for the direct passing of the values of local variables from one thread to another. The use of bitwise calculations, where possible, provides another degree of parallelism and further optimization of the library functions.
In addition, BoolSPLG can be used for training research, to view the characteristics of the video card, and to compare parallel and sequential performance. For example, we used it to study bijective S-boxes with n 18 variables and good cryptographic properties which were derived from linear codes with quasi-cyclic structures in ref. [13].
There are many libraries and mathematical software for computing the cryptographic properties of Boolean and vectorial Boolean functions developed for sequential CPU computation. As examples, we point out SageMath [14], MATLAB [15], VBF Library [16], PEIGEN [17], or SET (S-box Evaluation Tool) [18]. A detailed review of the software related to the cryptographic characteristics of the S-boxes is made in [16]. This software is good for training purposes and basic calculations but is not as fast as parallel implementation, especially for large ( n 8 ) S-boxes.
In terms of linear algebra GPU tools, we can mention cuBLAS [19] and cuSPARSE [20]. There are GPU libraries for butterfly algorithms, such as BPLG [21], NVIDIA’s cuFFT [22], but most of them are for signal processing (fast Fourier transform, Hartley transform, etc.) and not for vector Boolean functions. Examples of parallel software related to cryptography include Eval16BitSbox and the algorithms in Refs. [23,24]. Ref. [24] discusses only the linear approximation tables (LAT) of S-boxes and offers a different approach to their calculation. In addition, the authors evaluate and compare their work with other known software in this area. The implementation of Ref. [24] is in CUDA, which allows us to compare it with our software library BoolSPLG. We would like to point out that the computation time for LAT of 16 × 16 S-boxes in Ref. [24] is close (comparable) to the computation time with a function of our library and is much less than the other presented software.
The paper is organized as follows. The main definitions connected to Boolean and vectorial Boolean functions are given in Section 2. In Section 3, we present some advantages of the CUDA programming platform. Basic facts and information about the data organization and the algorithms used is provided in Section 4. Section 5 presents some experimental results as well as a comparison of the calculation times of the considered library to SageMath and SET packages. A short conclusion and directions for future improvements to the presented library are given in Section 6.

2. Main Definitions and Preliminaries

In this section, we present the terminology and definitions we follow (see Refs. [8,25,26]).
Let α 0 , α 1 α 2 n 1 be the vectors of the n-dimensional vector space F 2 n over the field F 2 = { 0 , 1 } in lexicographic order. There is a one-to-one correspondence between the vectors of F 2 n and integers in [ 0 ; 2 n 1 ] , which allows us to switch from a vector to an integer and vice versa. The Hamming weight w H ( v ) of a vector v is the number of its nonzero coordinates.
A Boolean function f of n variables is a mapping from F 2 n into F 2 . The Hamming distance d H ( f , g ) between two Boolean functions f and g is the number of function values in which they differ. Two natural representations of a Boolean function are its truth table T T ( f ) and its algebraic normal form A N F ( f ) . Any Boolean function f of n variables is uniquely determined by its truth table T T ( f ) = ( f ( α 0 ) , f ( α 1 ) f ( α 2 n 1 ) ) . Another way of uniquely representing a Boolean function f is by means of a polynomial with n variables, called its algebraic normal form (ANF), whose monomials have the form x i 1 x i 2 x i k , 1 i 1 < i 2 < < i k n , 0 k n [8].
Denote by x u the monomial x 1 u 1 x 2 u 2 x n u n , where u Z , 0 u 2 n 1 , u = ( u 1 , u 2 , , u n ) F 2 n . Then the algebraic normal form of f is a polynomial
f ( x ) = f ( x 1 , x 2 , , x n ) = u = 0 2 n 1 a n x u .
The degree of A N F ( f ) is called the algebraic degree deg ( f ) of the Boolean function f, and it is equal to the maximum number of variables of the terms x u , or
d e g ( f ) = max { w H ( u ) | a u = 1 } , where f ( x ) = u = 0 2 n 1 a n x u .
The Boolean functions a 0 a 1 x 1 a 2 x 2 a n x n = a 0 l a ( x ) of algebraic degree at most 1 play a special role in our investigations, and they are called affine, while l a ( x ) are called linear.
Obviously, A N F ( f ) can be associated with the binary ( 2 n -dimensional) vector f A N F F 2 2 n whose coordinates are the coefficients in (1) following the lexicographical order [27].
Associated with the Boolean function f is the function f ^ = ( 1 ) f = 1 2 f whose function values belong to the set { 1 ; 1 } . The corresponding vector that contains the function’s values of f ^ is called the polarity truth table ( P T T ) of the function f.
Definition 1.
Walsh (Hadamard, Walsh–Hadamard, Walsh–Fourier) transform f W of the Boolean function f is the integer-valued function f W : F 2 n Z , defined by
f W ( a ) = x F 2 n ( 1 ) f ( x ) l a ( x ) = x F 2 n f ^ ( x ) l ^ a ( x ) = 2 n 2 d H ( f , l a ) ,
where a = ( a 1 , , a n ) F 2 n .
The function f ^ ( x ) can be recovered by the inverse Walsh transform:
f ^ ( x ) = 2 n a F 2 n ( f W ) ( a ) ( 1 ) a · x .
The values of f W are called Walsh coefficients of the Boolean function f. For any Boolean function f and any vector a F 2 n , we have 2 n f W ( a ) 2 n . The functions l a ( x ) and l ¯ a ( x ) = l a ( x ) 1 have the maximal and minimal Walsh coefficients, namely l a W ( a ) = 2 n and l ¯ a W ( a ) = 2 n .
The vector W f = ( f W ( α 0 ) , f W ( α 1 ) f W ( α 2 n 1 ) ) is called the Walsh spectrum of the Boolean function and is denoted by W f . The Walsh spectrum measures the distance to the linear and affine functions.
The linearity of a Boolean function f is the maximum absolute value of a Walsh coefficient of f: L i n ( f ) = max { | f W ( a ) | | a F 2 n } . The Parseval’s Equality a F 2 n ( f W ( a ) ) 2 = 2 2 n gives that L i n ( f ) 2 n / 2 [8]. Functions attaining this lower bound are called bent functions.
Another important parameter which is closely connected with linearity is nonlinearity.
Nonlinearity n l ( f ) of the Boolean function f is the minimum Hamming distance from f to the nearest affine function:
n l ( f ) = min { d H ( f , g ) | g affine function } .
The relation between the linearity and nonlinearity of the Boolean function f is given by the equality L i n ( f ) = 2 n 2 n l ( f ) [8]. Obviously, minimum linearity corresponds to maximum nonlinearity.
Definition 2.
Autocorrelation function of the Boolean function f (auto-correlation of f with a shift w) is the function r f : F 2 n Z defined by:
r f ( w ) = x F 2 n ( 1 ) f ( x ) f ( x w ) ,
where w F 2 n .
The expression of the autocorrelation values r f ( w ) for all w F 2 n in terms of the Walsh coefficients [6] is equal to
r f ( w ) = 2 n u F 2 n ( f W ( u ) ) 2 ( 1 ) u · w .
For any Boolean function f and any vector w F 2 n we have 2 n r f ( w ) 2 n and r f ( 0 ) = 2 n . The vector of the autocorrelation values r f ( w ) is referred to as its autocorrelation spectrum of the function f.
The absolute indicator of a Boolean function f of n variable, denoted by A C ( f ) , is the maximum absolute value of an autocorrelation value and is defined by A C ( f ) = max { | r f ( w ) | | w F 2 n } .
The Sylvester–Hadamard matrix (or Walsh–Hadamard matrix) of order 2 n , denoted by H n , is generated by the recursive relation:
H 0 = 1 , H 1 = 1 1 1 1 , H n = H n 1 H n 1 H n 1 H n 1 = H 1 H n 1 f o r n > 1 ,
where ⊗ denotes the Kronecker product. The i-th row (column) of H n is a PTT of the linear function l i . So W f t = H n . P T T ( f ) t and P T T ( f ) t = 2 n H n W f t .
Fast Walsh transform (FWT) is usually used to calculate the Walsh spectrum. It is based on matrix vector multiplication and can be given by a butterfly diagram. The theoretical base of the FWT is given by Good [28] and it follows from a suitable factorization of H n .
A similar approach can be used to calculate the fast Möbius transform (FMT). This transform gives the coefficients of ANF(f) from the truth table of the Boolean function f and vice versa. It is based on the following matrices:
A 0 = 1 , A 1 = 1 0 1 1 A n = A n 1 0 A n 1 A n 1 = A 1 A n 1 f o r n > 1 .
Actually, the i-th column of A n is the truth table of the monomial m i = x α i . Using these matrices, we have ( f A N F ) t = A n . ( T T ( f ) ) t and T T ( f ) t = A n . ( f A N F ) t . The complexity of the algorithms for both fast Walsh and Möbius transforms is O ( n 2 n ) , and they require O ( 2 n ) memory units.
A vectorial Boolean function S : F 2 n F 2 m (also called ( n , m ) S-box or shortly S-box) can be represented by the vector ( f 1 , f 2 , , f m ) , where f i are Boolean functions of n variables, i = 1 , 2 , , m . The functions f i are called the coordinate functions of the S-box. Then the m × 2 n matrix
G S = T T ( f 1 ) T T ( f m )
represents the considered S-box, where T T ( f i ) is the truth table of the Boolean function f i , i = 1 , , m . An S-box is bijective if n = m and S is an invertible function.
In order to study the cryptographic properties of a vectorial Boolean function f related to linearity, algebraic degree, and autocorrelation, we need to consider all non-zero linear combinations of the coordinate functions of the S-box, denoted by
S b = b · G S = b 1 f 1 b m f m ,
where b = ( b 1 , , b m ) F 2 m . These are the component functions of the S-box S.
The Walsh spectrum of S is defined as the collection of all of the Walsh spectra of the component functions of S. The linearity and nonlinearity of the vectorial Boolean function are defined as
L i n ( S ) = max b F 2 m { 0 } L i n ( S b ) , n l ( S ) = min b F 2 m { 0 } n l ( S b ) .
In order to obtain the important parameters of an S-box, we use four tables, namely a linear approximation table (LAT), a difference distribution table (DDT), an autocorrelation table (ACT), and a table with the algebraic degrees (ADT) of the monomials in the component Boolean functions of the considered S-box. We define these tables below.
The 2 n × 2 m table whose entries are defined by
L a , b = | { x F 2 n : S b ( x ) = a · x 2 n 1 } | , a F 2 n , b F 2 m ,
is called its linear approximation table and is also denoted by L A T ( S ) . The elements of L A T ( S ) show the relationship between the inputs and outputs of the S-box. Since S b W ( a ) = 2 L a , b , the Walsh spectrum and the linear approximation table of an S-box are closely related, and by computing one of these parameters, we obtain the other. Therefore, we actually compute the Walsh spectrum instead of the L A T ( S ) in order to find the linearity and nonlinearity of S.
Another important parameter related to an S-box S is its algebraic degree. We define this as the maximum among all degrees of the component functions, or deg ( S ) = max b F 2 m { 0 } deg ( S b ) . The minimum degree is also important regarding algebraic attacks. Therefore, we define the maximal and the minimal algebraic degree of the vectorial Boolean function S as
max deg ( S ) = max b F 2 m { 0 } deg ( S b ) , min deg ( S ) = min b F 2 m { 0 } deg ( S b ) .
Autocorrelation spectrum A C T of the vectorial Boolean function S is defined as the collection of all autocorrelation spectra of its component functions. In fact, we consider A C T ( S ) as a 2 n × 2 m autocorrelation matrix, whose columns represent the autocorrelation functions of all component Boolean functions of S. The autocorrelation (or the maximal absolute autocorrelation value) A C ( S ) is defined as:
A C ( S ) = max b F 2 m { 0 } | r ( S b ) | .
The difference distribution table (DDT) is a 2 n × 2 m table whose entries are defined as
D D T ( S ) α , β = | { x F 2 n , α F 2 n { 0 } , β F 2 m | S ( x ) S ( x α ) = β } | .
The differential uniformity denoted by δ ( S ) is defined as the largest value in its difference distribution table not counting the first entry in the first row, or
δ ( S ) = max α 0 , β D D T ( S ) α , β .
We are looking for S-boxes that have a differential uniformity as low as possible. It is well known that δ ( S ) takes only even values in the interval [ 2 n m , 2 n ] . The smallest possible value of δ in the case of bijective S-boxes ( n = m ) is 2.

3. GPU and CUDA

One way to understand the difference between CPU and GPU is to compare the ways they process tasks. Usually, CPU consists of a few cores optimized for sequential serial processing. They have powerful ALU, large caches, and sophisticated control. Modern NVIDIA GPUs have their own memory, a massively parallel architecture consisting of thousands of smaller cores and designed for handling multiple tasks simultaneously. These cores have a throughput-oriented design with small caches, simple control, and energy-efficient ALUs, and they require a massive number of threads to tolerate latency. A GPU is very convenient when manipulating large data or using a high number of threads in single-instruction multiple-data (SIMD) programming model [29].
The CUDA programming platform allows programmers to interact directly with GPUs and run parallel parts of programs using the advantages of GPU architecture [9]. CUDA C is a programming language close to C by syntax, but conceptually and semantically it is quite different from C. The source code for CUDA applications consists of a mixture of conventional C/C++ host code and GPU device functions.
The processing of the data flow has several steps. At the highest level, we have a master process which runs on the CPU and performs the following steps: prepares data for manipulation, allocates memory on GPU, copies data from the host (CPU) to the GPU global memory, launches multiple instances of the execution “kernel” on GPU, copies data from the GPU memory to the host, deallocates all memory, and terminates. In a program, a parallel GPU part can be activated many times with different data and manipulations.
Functions for parallel execution on GPU are written in units called kernels. Syntactically, a kernel is a function of a programming language that is very similar to the C/C++ language functions. However, semantically, it is used by several directions. Its header initiates a grid of threads that practically performs the parallel execution of the calculations. The definition of a kernel header, which contains the type of grid, is given as follows:
k e r n e l _ n a m e < < < g r i d b l o c k s , t h r e a d s p e r b l o c k > > > ( ) ,
where “ k e r n e l _ n a m e ” is a usual name (identifier) and ‘grid blocks’ and ‘threads per block’ are positive integers. After the header, each kernel consists of a program code that refers to the single thread of the grid. Any thread has its own number in the grid of threads. According to this number, it is determined which part of the data will be calculated by the particular thread. Kernels are executed over the stream of data by many threads on a device in parallel. Thread is a process that performs series of programming instructions and it is a single instance of the kernel. Threads are organized into blocks, which are sets of threads that can communicate and synchronize their execution. Maximum 1024 threads per block can be launched. Multiple blocks can be executed simultaneously. First, a configuration of the kernel (number of blocks and number of threads per block inside) has to be made in order to launch it. Blocks and threads per block form a grid. All threads run the same code.
The threads are executed in groups of 32 threads called “wraps”. Usually, all of the threads in the wrap execute the same instruction at the same time. The difference is only input data, which depends on the unique number of any thread in block and any block in a grid.
The memory model has the following features. Each thread has access to the slowest global memory, but threads from different blocks can communicate with it. Each block has its own memory called shared which serves the communication between all of the threads in one block and is much faster than the global one. Each thread uses a small amount of fast local memory. The variables in the global memory, unlike the others, are preserved even after the terminating execution of each kernel.
We would like to mention some features of the CUDA model that are especially important for the efficiency of GPU calculations. Creating and destroying threads takes a negligible amount of time but only states which resources will be needed, so they do not affect performance. The time required to transfer data from the main memory to the global GPU memory and vice versa in many cases turns out to significantly lengthen the duration of calculations. Therefore, the master process has to manage the overall performance by running different kernels sequentially (if possible) without intermediate data transfer and only returning the final result.

4. Strategies in Algorithms and Data Organization

In order to discuss the implemented strategies and how the data is organized in the memory, it is necessary to show the model of the library BoolSPLG. Its structure is presented in Figure 1.

4.1. Data Organization

One way to represent the n × n vectorial Boolean function S that we use is through the truth tables of its coordinate functions. Therefore, an n × 2 n matrix is needed. For convenience, we use the integers corresponding to the binary representation of the columns of this matrix, so S is defined by the integer vector ( s 0 , s 1 , , s 2 n 1 ) . This representation has several advantages: data from the main memory is transferred to the GPU memory much faster, and the value of the function S for the input vector v F 2 n corresponds to the v-th coordinate of the array.
Any vector a F 2 n defines the component function S a as a linear combination of rows of the matrix corresponding to S. The truth table of the component function S a can be calculated in the following way:
T T ( S a ) = ( a · s 0 , a · s 1 , , a · s 2 n 1 ) .

4.2. Strategies in Algorithms

Effective sequential algorithms serve as a basis for parallel implementation. An in-depth examination of such algorithms, in some cases with different approaches, has been conducted in Ref. [30]. For the relationship between the different linear and differential characteristics, see Ref. [31]. The time complexity of the fast Walsh and Möbius transforms of a Boolean function is O ( n 2 n ) , and the required memory is O ( 2 n ) .
The fast Walsh Transform (FWT) is the main tool (part) of the functions in the library related to the linear characteristics of the the vectorial Boolean functions. In order to apply the fast Walsh transform to a Boolean function with n variables, we consider its truth tables as an integer array of length 2 n . The algorithm requires n steps. In the i-th step, the sum and the difference of the integers from the jth and rth cells (depending on i) of the current tuple must be written in the jth and rth cells of the new tuple. Therefore, the same array and variables can be used for the result of the calculations of all of its steps. Each thread can calculate the values of two elements without communicating with other threads. When using shared memory, it is convenient to calculate several steps of FWT on one thread on a given part of the array. A detailed description is presented in Ref. [32].
When calculating the linearity, nonlinearity, or LAT of a vectorial Boolean function S, we need all of its components’ functions (not only the coordinate functions). For small n, we list all of the component functions in an array of size 2 2 n . After the first n steps of the fast Walsh transform are applied to this array, we obtain the Walsh spectrum of S. This is enough to find the linearity (nonlinearity) of the vectorial Boolean function. The result is obtained by finding the minimum element in the set of absolute values of the coordinates of the obtained vector (by reduction). For larger values of n (when the hardware resource of the GPU is insufficient), in order to determine the linearity of the vector Boolean function S, the Walsh transform of each component Boolean function has to be calculated separately (see Algorithm 1).
Algorithm 1: Linearity of an S-box.
Input:An S-box S _ b o x
Output:The linearity L i n ( S _ b o x )
sequentialcopy S _ b o x (as it is represented) to GPU memory
sequential L i n ( S _ b o x ) = 0
sequentialfor all component Boolean functions of S _ b o x do
{
parallelget TT of current component function f
parallelcalculate W f (the Walsh spectrum of f)
parallelcompute the linearity of f
parallelupdate the value of L i n ( S _ b o x ) // using one thread
}
sequentialcopy linearity of S-box in RAM memory
For the autocorrelation properties of S-boxes with n > 10 , we use the following Algorithm 2.
Algorithm 2: The autocorrelation of an S-box.
Input:An S-box S _ b o x
Output: A C ( S _ b o x )
sequentialcopy S _ b o x to GPU memory
sequentialcurrent autocorrelation value A C ( S _ b o x ) = 0
sequentialfor all component Boolean functions of S _ b o x do
{
parallelget TT of current component function f
parallelcalculate W f and g = ( W f ) 2
parallelcompute 2 n W g
parallelupdate the value of A C ( S _ b o x )
}
sequentialcopy A C ( S _ b o x ) in RAM memory
We would like to note the following points in the calculation of the algebraic degree of a Boolean function. Since A N F ( f ) can also be computed via a fast Möbius transform from the truth table of a Boolean function, this can be achieved using only bitwise operations. Compared to Walsh transform, this is a very significant advantage and allows us to use bitwise representation of the truth tables. Note that the i-th coordinate of the vector of A N F ( f ) corresponds to the monomial x i , and the degree of this monomial is equal to the Hamming weight of i. This can easily be calculated from a single thread. A detailed description of the corresponding parallel butterfly algorithms is given in Ref. [33].
From the definition of DDT for a vectorial Boolean function, we can directly derive a basic algorithm. We enumerate all of the values of the input difference Δ . For each possible difference, we initialize to zero an array D D T Δ ( S ) of 2 n cells, one for each possible output difference. Then, for each pair ( x , x Δ ) with the prescribed input difference, its output difference S ( x ) S ( x Δ ) is computed and the corresponding counter in D D T Δ ( S ) is incremented. The runtime for sequential implementation is O ( 2 2 n ) .
In our parallel implementation, each thread calculates only one output difference S ( x ) S ( x Δ ) . The second part, increasing the value of the corresponding DDT cell, is more difficult. More than one thread can yield the same value for the output difference. Therefore, they have to write the results in the same time, and to avoid "race conditions", we use so-called atomic operators.
For S-boxes of less than 15 variables, all of the tables that we used are generated and accessible through the functions of the library.
In this case, the DDT for all of the input differences is generated in parallel. For S-boxes of more than 14 variables, the rows for each of the tables are generated sequentially (see Algorithm 3), but the elements of one row are generated in parallel.
Algorithm 3: The differential uniformity of an S-box
Input:An S-box S _ b o x
Output: δ ( S _ b o x )
sequentialcopy S _ b o x to GPU memory
sequential δ ( S _ b o x ) = 2 n
sequentialfor all Δ F 2 n do
{
parallelcompute D D T Δ ( S _ b o x )
parallelfind current δ ( S _ b o x )
}
sequentialcopy δ ( S _ b o x ) in RAM memory

5. Experimental Results

A server with two different GPU devices was used to evaluate the efficiency of the implemented library. Their parameters are listed in Table 1. One of the GPUs, presented as Device 0, is much more powerful than the other (Device 1).
The average times for calculating the considered cryptographic parameters per 100 randomly generated invertible S-boxes with n variables for any n = 8 , 9 , , 20 were obtained. The results for the different parameters were systematized in Table 2, Table 3, Table 4 and Table 5. The first column in each table shows the size of the considered S-box. The second column contains the average time required to find the parameter with a sequential program using a single CPU core. The next column shows the time required to find the corresponding parameter using Device 0. Then, the speedup found between the sequential and parallel implementation methods is given by the formula
S p = T 0 ( n ) T p ( n ) ,
where n is the number of variables in the S-box, T 0 ( n ) is the execution time of the fastest known sequential algorithm, and T p ( n ) is the execution time of the parallel algorithm. The speedup of the parallel algorithm is given in the columns CPU vs. Dev0 and CPU vs. Dev1, respectively. For example, the linearity of an S-box of size 2 8 is calculated for 0.863 ms by the CPU, while Device 0 performs the calculation for about 0.223 ms. This means it gives a 3.86-times better executing time, written in the table as × 3.86 .
It should be noted that the time required for parallel implementation includes the time used for data transfer from RAM to device memory and vice versa. The next two columns provide the execution time of Device 2 and the corresponding acceleration. The test results show the following: Using parallel implementation is much more efficient for S-boxes with larger parameters. In parallel implementation, the algebraic degree is calculated the fastest and the D D T is calculated the slowest. Device 0 gives much better acceleration in most cases.
As can be seen, for S-boxes with particular sizes, the speedup drops. This occurs due to the CUDA memory hierarchy model. The functions in the library generate the following tables related to S-boxes: a linear approximation table LAT(S), an autocorrelation table ACT(S), a table with the algebraic degrees of the monomials in the component Boolean functions ADT(S), and a difference distribution table (DDT). In the case of S-boxes with less than 11 variables, all component functions are calculated simultaneously (LAT, ACT, and ADT). The necessary transformations of the functions to one vector saved in the global memory are also performed simultaneously. For S-boxes with 11 or more variables, the component functions are generated one after the other. Further, in the case of S-boxes with more than 14 variables, the rows of the DDT table are generated sequentially, and this takes more time as well.
One million randomly generated invertible 16-bit S-boxes have been studied. Table 6 provides information on the best S-boxes in terms of the considered cryptographic parameters. For comparison, the table also presents the parameters of one S-box generated in a different way from a quasi-cyclic code.
In addition, we compared the parallel version of the presented library with the packages SageMath v9.8 and SET. In Table 7 are given the calculation times of the following cryptographic parameters of the S-boxes: linearity, differential uniformity, algebraic aegree, and autocorrelation. The computing environment for BoolSPLG and SET is presented in Table 1, while for SageMath we used SageMathCell. Computing the linearity and the differential uniformity of S-boxes of sizes bigger than 2 12 is not possible in SageMath. The calculation of autocorrelation is also not included in this package.

6. Conclusions and Future Work

In this article, a C++ library with sequential and parallel functions, implemented in CUDA C and designed to analyze large vectorial Boolean functions from cryptographic perspective, was presented. The parallel functions for many of the parameters are up to 60 times faster, which makes them convenient to use in ambitious research projects. The library has several opportunities for development. One is in the direction of universality. We plan to expand it so that it can be used for any type of vectorial Boolean function, not just bijective.
Another direction is to present more detailed information about the studied cryptographic parameters. For example, S-boxes that have the same differential uniformity but different differential spectra can perform differently in terms of resistance against differential attacks. Thus, some design criteria impose restrictions on the differential spectra of the S-box. The other example is related to algebraic degree. Note that, different from the notion of algebraic degree, the minimum among all degrees of the coordinate functions does not equal the minimum among all degrees of the component functions. Moreover, the number of component and coordinate functions with minimum (or maximal) degree is also important for some of the cases. In the current version, the library only calculates the value of the smallest (or largest) degrees of the component functions.
In the parallel functions of the current implementation, the calculations are conducted in two ways. One way (for smaller parameters) is to perform calculations for all of the component functions simultaneously. The other is to perform calculations for each of the component functions separately. In cases where the component functions are not of sufficient size, the second method of calculation does not provide good acceleration (as can be seen from the experimental results). In this case, it is more efficient to make calculations on appropriate groups of component functions.

Author Contributions

Conceptualization, D.B. and I.B.; methodology, D.B. and I.B.; software, D.B.; validation, D.B.; formal analysis, D.B. and I.B.; investigation, D.B., I.B. and M.D.-S.; resources, D.B., I.B. and M.D.-S.; data curation, D.B.; writing—original draft preparation, D.B, I.B. and M.D.-S.; writing—review and editing, M.D.-S.; visualization, M.D.-S.; supervision, D.B., I.B. and M.D.-S.; project administration, I.B.; funding acquisition, M.D.-S. All authors have read and agreed to the published version of the manuscript.

Funding

The research of Dushan Bikov and Iliya Bouyukliev is partially supported by the Bulgarian National Science Fund under contract no. KP-06-N62/2/13.12.2022. The research of Mariya Dzhumalieva-Stoeva was supported, in part, by a Bulgarian NSF contract KP-06-N32/2-2019.

Data Availability Statement

The library BoolSPLG, as well as the user manual and documentation, is available at https://github.com/BoolSPLG/BoolSPLG-v0.3 and https://doi.org/10.5281/zenodo.7825493, accessed on 15 March 2023. There are also detailed descriptions of each function of the library and test examples. A Cmake file is provided for easy compilation.

Acknowledgments

We gratefully acknowledge the support of the NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
BoolSPLGBoolean functions and S-box Parallel Library for GPU
SIMDSingle Instruction, Multiple Data
CUDACompute Unified Device Architecture
GPUGraphic Processing Unit
CPUCentral Processing Unit
ALUArithmetic-Logic Unit
LATLinear Approximation Table
TTTruth Table
PTTPolarity Truth Table
ANFAlgebraic Normal Form
FWTFast Walsh Transform
FMTFast Möbius Transform
DDTDifference Distribution Table
ACTAutocorrelation Table
ADTAlgebraic Degree Table

References

  1. Shetty, V.S.; Anusha, R.; Dileep Kumar, M.J.; Hegde, P. A survey on performance analysis of block cipher algorithms. In Proceedings of the 2020 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–28 February 2020; pp. 167–174. [Google Scholar]
  2. Kelly, M.; Kaminsky, A.; Kurdziel, M.; Łukowiak, M.; Radziszowski, S. Customizable sponge-based authenticated encryption using 16-bit s-boxes. In Proceedings of the MILCOM 2015-2015 IEEE Military Communications Conference, Tampa, FL, USA, 26–28 October 2015; pp. 43–48. [Google Scholar]
  3. Canteaut, A.; Duval, S.; Leurent, G.; Naya-Plasencia, M.; Perrin, L.; Pornin, T.; Schrottenloher, A. Saturnin: A suite of lightweight symmetric algorithms for post-quantum security. IACR Trans. Symmetric Cryptol. 2020, 2020, 160–207. [Google Scholar] [CrossRef]
  4. Matsui, M. New block encryption algorithm MISTY. In Proceedings of the International Workshop on Fast Software Encryption; Springer: Berlin/Heidelberg, Germany, 1997; pp. 54–68. [Google Scholar]
  5. Georgi, I.; Nikolay, N.; Svetla, N. Reversed Genetic Algorithms for Generation of Bijective S-boxes with Good Cryptographic Properties. IACR Cryptol. ePrint Arch. 2014, 2014, 801. [Google Scholar]
  6. Beauchamp, K. Applications of Walsh and Related Functions. With an Introduction to Sequence Theory; Microelectronics and Signal Processing Series; Academic Press, Inc.: London, UK; Orlando, FL, USA, 1985; p. xvi+308. ISBN 0-12-084180-0. [Google Scholar]
  7. Bakoev, V. A method for fast computing the algebraic degree of boolean functions. In Proceedings of the 21st International Conference on Computer Systems and Technologies, Ruse, Bulgaria, 19–20 June 2020; pp. 141–147. [Google Scholar]
  8. Carlet, C.; Crama, Y.; Hammer, P.L. Chapter Eight—Boolean Functions for Cryptography and Error-Correcting Codes. In Boolean Models ad Methods Mathemaics, Computer Science, and Engineering; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  9. Sanders, J.; Kandrot, E. CUDA by Example: An Introduction to General-Purpose GPU Programming; Addison-Wesley Professional: Boston, MA, USA, 2010. [Google Scholar]
  10. Jeon, W.; Ko, G.; Lee, J.; Lee, H.; Ha, D.; Ro, W.W. Chapter Six—Deep learning with GPUs. In Hardware Accelerator Systems for Artificial Intelligence and Machine Learning; Advances in Computers; Kim, S., Deka, G.C., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; Volume 122, pp. 167–215. [Google Scholar] [CrossRef]
  11. Xie, Z.; Kwak, A.S.; George, E.; Dozal, L.W.; Van, H.; Jah, M.; Furfaro, R.; Jansen, P. Extracting Space Situational Awareness Events from News Text. arXiv 2022, arXiv:2201.05721. [Google Scholar]
  12. Stone, J.E.; Phillips, J.C.; Freddolino, P.L.; Hardy, D.J.; Trabuco, L.G.; Schulten, K. Accelerating molecular modeling applications with graphics processors. J. Comput. Chem. 2007, 28, 2618–2640. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Bikov, D.; Bouyukliev, I.; Bouyuklieva, S. Bijective S-boxes of different sizes obtained from quasi-cyclic codes. J. Algebra Comb. Discret. Struct. Appl. 2019, 6, 123–134. [Google Scholar] [CrossRef]
  14. Zimmermann, P.; Casamayou, A.; Cohen, N.; Connan, G.; Dumont, T.; Fousse, L.; Maltey, F.; Meulien, M.; Mezzarobba, M.; Pernet, C.; et al. Computational mathematics with SageMath; SIAM: Philadelphia, PA, USA, 2018. [Google Scholar]
  15. Higham, D.J.; Higham, N.J. MATLAB Guide; SIAM: Philadelphia, PA, USA, 2016. [Google Scholar]
  16. Álvarez-Cubero, J.A.; Zufiria, P.J. Algorithm 959: VBF: A library of C++ classes for vector Boolean functions in cryptography. ACM Trans. Math. Softw. (TOMS) 2016, 42, 1–22. [Google Scholar] [CrossRef] [Green Version]
  17. Sasaki, Y.; Ling, S.; Guo, J.; Bao, Z.; Bao, Z.; Guo, J.; Ling, S.; Sasaki, Y.; Commons License, C. PEIGEN—A platform for evaluation, implementation, and generation of S-boxes. IACR Trans. Symmetric Cryptol. 2019, 2019, 330–394. [Google Scholar]
  18. Picek, S.; Batina, L.; Jakobović, D.; Ege, B.; Golub, M. S-box, SET, match: A toolbox for S-box analysis. In Proceedings of the IFIP International Workshop on Information Security Theory and Practice; Springer: Berlin/Heidelberg, Germany, 2014; pp. 140–149. [Google Scholar]
  19. Barrachina, S.; Castillo, M.; Igual, F.D.; Mayo, R.; Quintana-Orti, E.S. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing, Miami, FL, USA, 14–18 April 2008; pp. 1–8. [Google Scholar]
  20. Naumov, M.; Chien, L.; Vandermersch, P.; Kapasi, U. Cusparse library. In Proceedings of the GPU Technology Conference, San Jose, CA, USA, 23 September 2010. [Google Scholar]
  21. Lobeiras, J.; Amor, M.; Doallo, R. BPLG: A tuned butterfly processing library for GPU architectures. Int. J. Parallel Program. 2015, 43, 1078–1102. [Google Scholar] [CrossRef]
  22. Vasilache, N.; Johnson, J.; Mathieu, M.; Chintala, S.; Piantino, S.; LeCun, Y. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv 2014, arXiv:1412.7580. [Google Scholar]
  23. Khadem, B.; Ghasemi, R. Improved algorithms in parallel evaluation of large cryptographic S-boxes. Int. J. Parallel Emergent Distrib. Syst. 2020, 35, 461–472. [Google Scholar] [CrossRef]
  24. Kim, G.; Jeon, Y.; Kim, J. Speeding up LAT: Generating a Linear Approximation Table Using a Bitsliced Implementation. IEEE Access 2022, 10, 4919–4923. [Google Scholar] [CrossRef]
  25. Preneel, B.; BRAEKEN, A. Cryptographic Properties of Boolean Functions and S-Boxes; Departement elektrotechniek (ESAT): Leuven, Belgium, 2006. [Google Scholar]
  26. Chabaud, F.; Vaudenay, S. Links between differential and linear cryptanalysis. In Proceedings of the Workshop on the Theory and Application of of Cryptographic Techniques; Springer: Berlin/Heidelberg, Germany, 1994; pp. 356–365. [Google Scholar]
  27. Bakoev, V. Fast computing the algebraic degree of Boolean functions. In Proceedings of the Algebraic Informatics: 8th International Conference, CAI 2019, Niš, Serbia, 30 June–4 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 50–63. [Google Scholar]
  28. Good, I.J. The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. Ser. B Methodol. 1958, 20, 361–372. [Google Scholar] [CrossRef]
  29. Hughes, C.J. Single-instruction multiple-data execution. Synth. Lect. Comput. Archit. 2015, 10, 1–121. [Google Scholar]
  30. Joux, A. Algorithmic Cryptanalysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009. [Google Scholar]
  31. Zhang, X.M.; Zheng, Y.; Imai, H. Relating differential distribution tables to other properties of of substitution boxes. Des. Codes Cryptogr. 2000, 19, 45–63. [Google Scholar] [CrossRef]
  32. Bikov, D.; Bouyukliev, I. Parallel fast Walsh transform algorithm and its implementation with CUDA on GPUs. Cybern. Inf. Technol. 2018, 18, 21–43. [Google Scholar] [CrossRef] [Green Version]
  33. Bikov, D.; Bouyukliev, I. Parallel fast Möbius (Reed-Muller) transform and its implementation with CUDA on GPUs. In Proceedings of the International Workshop on Parallel Symbolic Computation, Kaiserslautern, Germany, 23–24 July 2017; pp. 1–6. [Google Scholar]
Figure 1. Classification and module dependencies of the building blocks involved in the library.
Figure 1. Classification and module dependencies of the building blocks involved in the library.
Mathematics 11 01864 g001
Table 1. Description of the used CPU and GPU devices.
Table 1. Description of the used CPU and GPU devices.
Environment:Platform
CPUIntel Xeon E5-2640, 2.50 GHz
Memory48 GB DDR3 1333 MHz
OSWindows 7, 64-bit
IDE/CompilerMSVC 2019
CUDA SDK10.2
GPU DriverV 471.96
GPUNvidia TITAN X (Pascal)GeForce GTX TITAN
Device 0Device 1
ArchitecturePascalKepler
CUDA Cores35842688
Boost Clock1531 MHz876 MHz
Memory Speed10 Gbps6 Gbps
Global Memory12 GB GDDR5X6 GB GDDR5
Memory Bandwidth480 (GB/s)288.38 (GB/s)
Table 2. Computing linearity.
Table 2. Computing linearity.
SizeCPU (ms)Device 0 (ms)CPU vs. Dev 0Device 1 (ms)CPU vs. Dev 1
2 8 (256)0.8630.223×3.860.14336×6
2 9 (512)7.6540.340×22.510.3576×21.4
2 10 (1024)15.8550.318×49.851.1442×13.8
2 11 (2048)68.34362.774×172×1
2 12 (4096)314.927149.562×2.1186.106×1.68
2 13 (8192)1257.97287.688×4.83347.72×3.6
2 14 (16,384)5686.84622.29×9.13759.53×7.5
2 15 (32,768)24,083.81239.54×19.422141.55×11.2
2 16 (65,536)99,800.32446.03×40.86289.8×15.86
2 17 (131,072)442,6786165.59×71.822,772×19.43
2 18 (262,144)1,677,92122,029.27×76.190,089×18.62
2 19 (524,288)7,562,24790,786.66×83.29375,519×20.13
2 20 (1,048,576)29,638,868418,942×70.71,618,402×18.3
Table 3. Computing autocorrelation.
Table 3. Computing autocorrelation.
SizeCPU (ms)Device 0 (ms)CPU vs. Dev 0Device 1 (ms)CPU vs. Dev 1
2 8 (256)2.0200.206×9.80.319×6.3
2 9 (512)14.2120.435×32.670.393×36.13
2 10 (1024)37.9400.463×81.941.462×25.95
2 11 (2048)185.673132.284×1.4144.06×1.28
2 12 (4096)771.527209.130×3.68253.18×3
2 13 (8192)3142.98503.77×6.23418.54×7.5
2 14 (16,384)13,555.2866.02×15.651094.63×12.3
2 15 (32,768)57,324.51931.21×29.683189.85×18
2 16 (65,536)238,216.43813.45×62.469621.8×24.76
2 17 (131,072)1,060,2949396.38×112.8434,810.46×30.45
2 18 (262,144)3,832,30833,814.94×113.33133,629.15×28.67
2 19 (524,288)16,860,299138,108.15×122.1566,914×29.7
2 20 (1,048,576)68,227,870629,515×108.382,411,828×28.2
Table 4. Computing differential uniformity.
Table 4. Computing differential uniformity.
SizeCPU (ms)Device 0 (ms)CPU vs. Dev 0Device 1 (ms)CPU vs. Dev 1
2 8 (256)0.2820.208×10.136×2
2 9 (512)1.9080.432×4.4160.351×5.4
2 10 (1024)3.5910.366×9.8111.053×3.4
2 11 (2048)14.2690.705×20.231.313×10.8
2 12 (4096)69.9321.710×40.892.914×24
2 13 (8192)245.7195.773×42.5611.27×22
2 14 (16,384)998.17421.022×47.4837.64×27
2 15 (32,768)4059.71990.88×4.11497.6×2.7
2 16 (65,536)18,307.91924.91×9.54345×4.2
2 17 (131,072)86,983.94206.78×20.6814,988.47×5.8
2 18 (262,144)453,13614,319.73×31.6455,269.61×8.2
2 19 (524,288)1,925,44461,572×31.27283,692×6.7
2 20 (1,048,576)9,312,785730,660×12.71,531,534×6
Table 5. Computing degree— d e g ( S ) –deg_ bitwise (S).
Table 5. Computing degree— d e g ( S ) –deg_ bitwise (S).
SizeCPU (ms)Device 0Device 0CPU vs. Dev0CPU vs. Dev0Device 1Device 1CPU vs. Dev1CPU vs. Dev1
Base (ms)BitwiseBaseBitwiseBase (ms)BitwiseBaseBitwise
2 8 1.1270.1220.115×9.23×9.80.31930.2692×3.5×14.1
2 9 10.0990.2010.202×49.9×490.73210.3087×13.6×32.46
2 10 20.3960.3180.142×64.1×143.61.18140.7277×17.2×28
2 11 102.39076.04212.447×1.3×8.278.6623.71×1.3×4.3
2 12 369.589162.98728.963×2.2×12.7220.5842.9×1.6×8.6
2 13 1754.26307.87054.385×5.69×32.236665.68×4.8×26.9
2 14 6822.06720.525279.5×9.4×24.45822.66767.35×8.3×8.9
2 15 28,395.31410.75489.42×20.1×5822212658.13×12.78×10.6
2 16 117,070.82961.871593.99×39.5×73.564009964×18.29×11.74
2 17 492,150.26421.096470×76.6×7622,97426,110×21.42×18.84
2 18 1,699,95022,824.712,011.3×74.48×141.5384,70656,085×20×30.1
2 19 7,327,08390,603.832,384.5×80.87×226.2356,180122,500×20.5×59.8
2 20 29,481,099416,37683,943×70.8×351.21,530,321478,612×19.2×61.59
Table 6. Test evaluation between bijective 16-bit QCS-boxes and randomly generated 16-bit S-boxes.
Table 6. Test evaluation between bijective 16-bit QCS-boxes and randomly generated 16-bit S-boxes.
S-BoxesLinnl δ deg ( S ) (max) AC ( S ) Number
QCS-boxes, n = 16
C2, M 1 , m = 13,107, r = 551232,5124155125
S-boxes, random153232,00222152344, 22482
n = 16 153232,00220152568–220834
153232,00218152432–222439
S-boxes, random152832,00420152504–223223
n = 16 152832,00418152416–226417
S-boxes, random152432,00620152512–22408
n = 16 152432,00618152392–22167
S-boxes, random152032,00820152288, 2280, 21843
n = 16 152032,00818152352–22645
S-boxes, random151632,010201522801
n = 16 151632,010181523121
Table 7. Calculation times of BoolSPLG, Sage, and SET.
Table 7. Calculation times of BoolSPLG, Sage, and SET.
Lin δ degAC
SizeSageSETBoolSPLGSageSETBoolSPLGSageSETBoolSPLGSETBoolSPLG
2 8 0.66 s2 ms0.22 ms0.096 s0.4 ms0.2 ms0.12 s3 ms0.1 ms2 ms0.2 ms
2 9 1.94 s5 ms0.34 ms0.35 s2 ms0.4 ms0.14 s18 ms0.2 ms7 ms0.4 ms
2 10 7.2 s39 ms0.32 ms1.44 s8 ms0.3 ms0.18 s52 ms0.1 ms48 ms0.4 ms
2 11 26.3 s246 ms62 ms6.2 s83 ms0.7 ms0.22 s374 ms12 ms0.4 s132 ms
2 12 99 s1.5 s149 ms29 s0.5 s1.7 ms0.39 s2.3 s28 ms2.7 s209 ms
2 13 N/A7.9 s284 msN/A2.1 s5.7 ms1 s12.4 s54 ms14 s503 ms
2 14 N/A66 s662 msN/A9.9 s21 ms1.6 s91 s279 ms102 s866 ms
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bikov, D.; Bouyukliev, I.; Dzhumalieva-Stoeva, M. BooLSPLG: A Library with Parallel Algorithms for Boolean Functions and S-Boxes for GPU. Mathematics 2023, 11, 1864. https://doi.org/10.3390/math11081864

AMA Style

Bikov D, Bouyukliev I, Dzhumalieva-Stoeva M. BooLSPLG: A Library with Parallel Algorithms for Boolean Functions and S-Boxes for GPU. Mathematics. 2023; 11(8):1864. https://doi.org/10.3390/math11081864

Chicago/Turabian Style

Bikov, Dushan, Iliya Bouyukliev, and Mariya Dzhumalieva-Stoeva. 2023. "BooLSPLG: A Library with Parallel Algorithms for Boolean Functions and S-Boxes for GPU" Mathematics 11, no. 8: 1864. https://doi.org/10.3390/math11081864

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop