1. Introduction
Conventional cryptographic applications assume that algorithms run within a trusted environment. However, as digital information becomes more pervasive, relying solely on secure environments for key protection no longer satisfies the requirements of diverse applications. To address this challenge, Chow et al. [
1] introduced the concept of the white-box attack environment, where the attacker possesses unrestricted access to a cryptographic implementation and the software execution process. Concurrently, they proposed a white-box scheme employing encoding technology to hinder the attacker from extracting the key in the AES software implementation.
Scholars have introduced several cracking methods against white-box designs. One method is algebraic analysis, exemplified by the BGE attack proposed by Billet et al. in 2004 [
2]. The other is differential computational analysis (DCA) introduced by Bos et al. in 2016 [
3]. The former requires challenging reverse-engineering steps in practice, while DCA assumes only that software execution traces containing information about memory addresses are available. Therefore, DCAs substantially reduce the difficulty and workload for attackers. Numerous white-box schemes, including many public submissions to WhibOx 2016 (the white-box cryptography competition), have been practically broken using DCAs. Recognizing the considerable threat posed by DCAs to white-box schemes, a crucial objective of white-box design is to mitigate these vulnerabilities.
Common countermeasures against DCAs are masking and shuffling. In 2018, Lee et al. [
4] improved Chow et al.’s white-box scheme for AES by employing linear Boolean masking to prevent first-order DCAs. However, to conserve memory, only the first round of encryption is protected, leading to the scheme being quickly cracked [
5]. In the same year, Biryukov et al. [
6] employed nonlinear masking in conjunction with linear masking to hinder first-order DCAs. However, this implementation requires considerable memory resources. In 2021, Seker et al. [
7] designed an AES white-box scheme that thwarts first-order DCAs by combining linear and nonlinear masking. Nevertheless, the performance of the scheme was reduced due to the need for timely updating mask values. In 2019, Bogdanov et al. [
8] adopted a shuffling strategy in AES white-box schemes to counter first-order DCAs, but even with randomizing the order of operations, an attacker can still rearrange the trajectory based on the accessed memory addresses.
Unlike AES, which is based on the SPN structure, the SM4 algorithm [
9], recommended by the Chinese government, employs a Feistel network. Since it was standardized internationally by ISO/IEC in 2021, SM4 has been accepted by many cryptographic communities and integrated by prominent cryptographic libraries such as OpenSSL and Crypto++. Nowadays, the SM4 algorithm is applicable in scenarios including secure communications, data protection, and financial transactions, and plays a crucial role in various security protocols and systems worldwide. Thus, the security of SM4 implementations has also garnered attention.
In 2009, Xiao and Lai proposed the first white-box scheme for SM4 based on the lookup table technique, referred to as the Xiao–Lai scheme [
10]. Meanwhile, they proved that the scheme can resist BGE attacks [
2]. In 2013, Lin et al. [
11] combined the BGE attack with differential cryptanalysis, solving equations and other methods to form what is known as Lin–Lai analysis. They demonstrated that the worst-case time complexity to obtain the key of the Xiao–Lai scheme was
. In 2016, Bai et al. [
12] employed more complicated encodings and larger tables to prevent the cancellation of higher-order linear encodings through combinatorial lookup tables, and proposed the Bai–Wu scheme. The scheme runs nine times faster than the Xiao–Lai scheme but requires 32.5 MB of memory for storing lookup tables. In 2020, Yao et al. [
13] proposed a white-box scheme for SM4 that uses internal state expansion combined with random numbers to obfuscate keys, markedly enhancing the complexity of key extraction through algebraic attacks. In addition, there are other non-open-source white-box designs for SM4, such as the scheme introduced by Shi et al. in 2015 [
14], which protects lookup tables through dual cryptography and random confusion. Due to the absence of the corresponding code, no security analysis results for this scheme are available.
While the SM4 white-box solution effectively withstands algebraic attacks, it remains vulnerable to potent DCAs. Refs. [
8,
15], respectively, indicate that the Xiao–Lai scheme and the Bai–Wu scheme are vulnerable to first-order DCA. Although the Yao–Si scheme aimed to thwart first-order DCA, it is compromised as the number of trajectories increases according to our tests.
In 2022, Zhang et al. [
8] enhanced the Xiao–Lai scheme by integrating an 8th-order nonlinear encoding so that it resists first-order DCAs. Similarly, Yuan et al. [
15] adopted this method to enhance the Bai–Wu scheme. Although these improvements have increased the security of the schemes, the 8th-order nonlinear encodings prevent direct XOR operations, necessitating the utilization of lookup tables to execute. This significantly increases the memory overhead of the schemes.
We apply Boolean masking to the Xiao–Lai scheme. Based on prior findings, collision analysis against the T-table has a higher success rate. Thus, we transform the T-table into a secured S-box table with the corresponding affine transformations. Next, we employ a randomly generated permutation to reuse the mask values, supported by the common share theory. We introduce two schemes: a simplified version and an algebraic enhancement version. The simplified scheme applies masking only to the first and the last four rounds because the risk of key leakage is elevated during those rounds, enabling implementation with a low resource overhead of approximately 1.62 MB. Experimental results show that there is no obvious peak during DCA trials with the public tool Deadpool, suggesting that the scheme can resist first-order DCAs. The algebraic enhancement scheme applies masking to every encryption round, resulting in a resource cost of approximately 7.8 MB. The scheme not only thwarts first-order DCAs but also proves secure against widely used algebraic attacks according to theoretical analysis.
The remainder of the paper is structured as follows: We present relevant knowledge in
Section 2.
Section 3 describes the design and implementation of the schemes in detail.
Section 4 evaluates the performance of the schemes and compares them with other SM4 white-box schemes.
Section 5 examines the security of the schemes, including their resistance to algebraic attacks and DCA. Finally, we conclude the paper in
Section 6.
2. Preliminaries
The white-box scheme designed in this paper draws upon methods from the Xiao–Lai SM4 white-box scheme [
10] and the AES white-box scheme by Lee et al. [
4]. Therefore, in this section, we first introduce the encryption process of SM4, followed by a brief description of the two mentioned schemes.
2.1. SM4 Algorithm
SM4 is a block cipher algorithm [
9] with a 128-bit block length and a 128-bit key length. The encryption process and round key schedule employ a 32-round Feistel structure.
Before the first round, the plaintext is divided into four 32-bit words, denoted as
. The round function calculates the XOR sum of the last three state words and a 32-bit round key. The sum is input into the round function
T, and the output is XORed with the first state word, denoted as the
F-function. The result of the
F-function is swapped with the other three state words before the next round. The detailed steps of each round of computation are illustrated in
Figure 1.
The function
F is defined as follows:
Here,
is the round key and
i is the index of the current iteration.
T is the composition of the linear transformation
L and the nonlinear transformation
.
can be decomposed into four parallel S-box lookup operations. Given that the 32-bit input
is divided into four bytes
, the 32-bit output
is denoted as:
The linear transformation
L consists of left rotations and XOR operations. The calculation formula is as follows:
After the final round, the four state words are reversed, and the output is the ciphertext .
2.2. The Xiao–Lai SM4 White-Box Scheme
The encoding used in the Xiao–Lai scheme is affine transformations. We use matrix multiplication to represent linear maps and vector addition to represent translations. An affine transformation acting on a vector can be represented as , where is an invertible matrix and is a constant vector.
First, each 32-bit word
of the plaintext is externally encoded by a 32-dimensional affine transformation
. Then, the encoded input word is transformed to
, as follows:
Similarly,
indicates the external decoding to obtain the ciphertext, as follows:
Each round of computation in the Xiao–Lai SM4 white-box scheme is divided into three sequential parts. Part 1 calculates
. Since
,
, and
are secured by different affine transformations, the XOR operation cannot be performed directly. Hence,
is needed before the XOR operations.
comprises two affine transformations: first eliminating the affine transformations
and then applying a unified encoding
. For each
, the corresponding
is processed as follows:
Here,
is an invertible 32-dimensional affine transformation over GF(2).
= diag
, where each
represents an invertible 8th-dimensional affine transformation over GF(2).
and
are inverses of each other. After unifying the encodings of the three words, the result is obtained through two XOR operations:
Part 2 calculates
. Before encryption, a table is generated for each byte of the output
from Part 1. The input of the table is an 8-bit
and the output is a 32-bit
. Since
involves the affine transformation
,
is first applied to remove it, and then the round key
is added. After byte substitution with an S-box and the corresponding linear operation
, the output
is encoded by another 32-dimensional invertible affine transformation
. The lookup table is constructed as follows:
Therefore, four lookup tables are searched to obtain the results
during the encryption process. Finally,
can be obtained by XORing the four
:
Part 3 completes the calculation of
. It is necessary to unify the encodings that protect the output
from Part 2 and the 0th word
of the
ith round input.
is used to convert the encoding for
, which is composed of
and
.
is used to eliminate the carried encoding of
, and
represents the new unified encoding. Likewise,
is utilized to convert encoding for
, and it is composed of
and
.
is used to eliminate the carried encoding of
, and
is the new unified encoding. The calculation is as follows:
The invertible matrix of
and
is the same, denoted by
. However, the constant vector of
is
, and that of
is
, which satisfies
. As a result, the output
of this round is obtained from the XOR sum of two temporary results
and
:
2.3. The Masked AES White-Box Scheme
In 2018, Lee et al. [
4] proposed a masked AES white-box scheme based on Chow et al.’s white-box design [
1] to thwart first-order DCAs. The AES white-box scheme introduced by Chow et al. [
1] contains four classes of lookup tables. Lee et al. added masks to the Type II table as displayed in
Figure 2.
Linear and nonlinear encodings protect the 8th-bit input, so decoding is performed first. Then, AddRoundKey, SubBytes, and partial MixColumns of the round function of AES are sequentially performed to obtain the genuine 32-bit state word T. Secondly, for each value of byte , a randomly generated byte value is added, that is, . Next, the masked data and 256 mask values are, respectively, encoded with the same linear encoding. Finally, distinct nonlinear encodings are applied for further protection. As a result, Lee et al. expanded an 8-to-32-bit table to an 8-to-64-bit table, which doubled the required memory.
3. Masked SM4 White-Box Scheme
We attempted to enhance the Xiao–Lai SM4 white-box scheme [
10] by exploiting the Boolean masking technique to prevent DCAs. Similarly, in the Xiao–Lai scheme [
10], the S-box and linear operation
L are combined into a single lookup table, with one byte as input and a 32-bit word as output. Following Lee et al.’s masking method, an 8-to-64-bit table is required to store both the masked data and the mask values, totaling 2 KB (
bits) of memory. However, according to the algebraic analysis of the DIBO function, the encodings combined with the linear operation can be divided into four independent functions [
16]. Furthermore, each function may leak information about the S-box output through collision analysis [
17]. Therefore, we separate the S-box and linear operation
L, create a lookup table for the nonlinear operation S-box, and mask its output. This reduces the lookup table size to 256 B (
bits). By applying distinct encodings to protect masked data and mask values separately, the true values of the intermediate state bytes cannot be directly disclosed through XOR.
Secondly, to reduce the total number of real random numbers, we exploit the common share technique [
18], reusing half of the random numbers. Specifically, we randomly generate 256 values
as mask values for the 256 output values of the first S-box operation. Next, we generate a random permutation
and ensure that the function compositions
and
do not result in an identity permutation. Subsequently, the permutation
is applied to deduce the input values
to secure the outputs of the following S-box operations by
. We do not record the corresponding mask value along with the output of each S-box operation as done in the Type II table of Lee et al.’s scheme. Instead, we only record the permutation table and the randomly generated 256 mask values
. Since the
L operation of SM4 is linear, the following relationship holds:
As illustrated in the above equation, can be calculated when and the permutation are input. Consequently, only two tables are needed to remove the related mask values during the subsequent unmasking process. Therefore, the required memory space is further reduced.
Finally, a critical consideration is the appropriate location for unmasking. In 2020, Lee et al. [
5] pointed out that unmasking after MixColumns may expose the output value of the first-round computation to an attacker. This vulnerability allows an attacker to target the round output for key extraction using DCAs. To address this problem, we recommend that all intermediate values remain masked throughout the encryption process. Additionally, distinct random numbers and permutations should be utilized for different encryption rounds.
Similar to how the intermediate states during the first and last rounds of AES computations are vulnerable to secret leakage via DCA attacks, the intermediate states during rounds 0–3 and 28–31 of the SM4 algorithm are also high risk. To further reduce the required memory, we can simplify the scheme by adding masks to only these eight rounds. Consequently, in rounds 4–7, the unmasking calculations need to be incorporated into Part 1 and Part 3. Subsequently, the remaining rounds (8–27) adhere to the Xiao–Lai SM4 white-box design. Our white-box scheme is elaborately described in the simplified version. First of all, we define the symbols utilized in subsequent sections in
Table 1.
3.1. Table Generation
Five types of lookup tables are defined in this scheme, as shown in
Figure 3. Among them, (a) Table_1 is the lookup table used in Rounds 0 and 28; (b) Table_2 corresponds to Rounds 1–3 and 29–31; (c) Table_3 is used in Rounds 4–27; (d)
M is the table for recording random mask values in Rounds 0–3 and 28–31; (e)
is the permutation table in Rounds 0–3 and 28–31. In this section, we will elaborate on how to generate these five types of tables.
Masking is applied to the output of the S-box during the generation of Table_1, as depicted in
Figure 3a. The input is a byte of
, and the output is
. The generation process is as follows:
Here, and are two 8-dimensional invertible affine transformations, and represents the jth byte of the ith round key. means is the index of table M to obtain the corresponding encoding mask value. represents the 8-dimensional invertible affine transformation to eliminate the protection of mask values. Table_1 needs 256 B ( bits) of memory. There are four such tables in each round and two rounds in total, so all Table_1 instances consume 1 KB ( B) of memory.
Since Table_2 is used in Rounds 1–3 and 29–31, the mask values should be updated during the generation of Table_2. As depicted in
Figure 3b, the input is two bytes
and
, and the output is an 8-bit value
. The relationship between inputs and output is defined as follows:
Here, represents the jth byte of the result from Part 2 in the th round computation, and is the invertible 8-dimensional linear transformation. The way to obtain remains the same as before. Table_2 requires 64 KB ( bits) of memory with four tables per round across six rounds; the total memory consumption is 1.5 MB ( KB). In the last four rounds, mask values are unprotected, so there is no necessity to use to remove the encoding when generating the related tables.
Table_3 is utilized in Rounds 4–27 in the simplified version of our masked SM4 white-box scheme. As shown in
Figure 3c, its generation is identical to the table of the Xiao–Lai scheme described in
Section 2. Table_3 requires 1 KB (
bits) memory. Each round has four tables, so the 24 rounds consume a total of 96 KB (
KB) of memory.
Simultaneously, the mask value
is generated randomly and protected by the encoding
. The
M-table stores mask values used in Rounds 0–3 and 28–31, as depicted in
Figure 3d. It is an 8-to-8-bit table. The index of the table is
and each cell involves a random byte protected by the affine transformation
. The generation of the mask table
M is as follows:
The size of an M-table is 256 B ( bits), with only one M-table generated per round. Thus, eight rounds consume 2 KB ( B) of memory.
The
-table stores random permutation used in Rounds 0–3 and 28–31, as depicted in
Figure 3e. The generation of it is as follows:
A -table also needs 256 B ( bits) of memory, and each round needs only one -table. Hence, eight rounds require 2 KB ( B) of memory.
3.2. Round Operation
In this section, we will introduce how to use our lookup tables and affine transformations to obtain the output of a round. Given the differences in round computations, we will detail these processes separately.
3.2.1. Rounds 0–3 and 28–31
Figure 4 takes the execution of the 0-th round of encryption as an example. Part 1 returns the XOR sum of the lower three words
. Similarly, we should unify encodings before performing the XOR operations, consistent with the Xiao–Lai scheme.
Part 2 involves round key addition, S-box operations, and the linear operation
L. Here, this part is further divided into three portions. First, it involves calculating round key addition and an S-box operation using a lookup table. The input of part 2 is
, which is protected by
. We use the
jth byte
of
as the index of the
jth Table_1 to obtain the corresponding output
. Notably, when
, Table_2 is used to execute round key addition, an S-box operation, and make value updates. Thus, there are two inputs
and
to the lookup table, as illustrated in
Figure 5. Specifically, Table_1 uses a single index
, whereas Table_2uses two indices,
and
.
Secondly, the linear transformation
L is performed. We concatenate the four outputs from the four lookup tables as the input of a composite affine transformation
, resulting in
. The affine transformation is calculated as follows:
Here, is a 32-dimensional invertible affine transformation. The linear matrix of is identical to that of affine transformation but the constant vector is different, denoted by .
Thirdly, to ensure the correctness of encryption, we should calculate the protected mask values simultaneously. First,
is the index of the
-table. We repeat the lookup operation
j times and the related output
serves as the index of the
M-table. Next, we search the
M-table to obtain the mask value which secures
. Finally, we concatenate the four mask values and apply a composition affine transformation
to obtain
. This portion is calculated as follows:
Here, N = diag , where is an 8-dimensional invertible linear transformation, and . For the last four rounds of computation, mask values are naked, which means the composite affine transformation is defined by .
Part 3 is used to calculate the XOR sum between the output of the round function and the state word
. Due to
and
secured by different encodings,
should unify encoding with
and then perform an XOR operation to obtain the output
of the current round, as depicted in Part 3 of
Figure 4:
The linear matrix of is the same as that of , that is . However, the constant vectors are different and satisfy .
It should be noted that after 32 rounds of iterations, it is necessary to remove the encodings of the four 32-bit words and then XOR them with the corresponding mask values to obtain the genuine ciphertext.
3.2.2. Rounds 4–27
In the algebraic enhancement version of our scheme, the calculation during each middle round of encryption is identical to that in Round 1. However, in the simplified version of our scheme, the implementation of each round follows the design of the Xiao–Lai scheme except for Rounds 4–7. The lookup table used in Part 2 is Table_3 in Rounds 4–27. However, we should remove mask values during the process of Part 1 and Part 3 in Rounds 4–7, because the input words are masked. Therefore, the steps are slightly different. Below, we will only introduce Part 1 and Part 3 of Rounds 4–7.
Part 1 primarily calculates
. First of all, we should unify encodings of
and mask values:
Here,
is the invertible matrix of affine transformation
. As a result, the output
from the following XOR operation remains protected by an affine transformation rather than a linear transformation.
Part 3 completes the XOR sum between
and the output of the
to obtain the result
for this round. Additionally, unmasking is involved in Rounds 4–7 to ensure the correctness of the scheme. Before the above two XOR operations, it includes unifying the encodings of
,
, and mask values:
4. Performance
The memory consumption of the simplified version of our SM4 white-box scheme is listed in
Table 2.
As shown in
Table 2, there are
32-dimensional affine transformations in Part 1 during the 32 rounds of encryption. Meanwhile, Part 2 requires
and Part 3 requires
affine transformations. Consequently, the memory requirement for affine transformations is 22.69 KB (1056 bit
.
Table 1 is only used in Rounds 0 and 28, with four tables constructed in each round, requiring 2 KB (
bit
). Table_2 is used in six rounds—Rounds 1–3 and 29–31—with four tables constructed in each round, requiring 1.5 MB (
KB
). Table_3 is needed in the middle 24 rounds, with four tables generated in each round, requiring 96 KB (1 KB
). An
M-table and a
-table are involved in Rounds 0–3 and 28–31, so the required memory for them is 4 KB (
bit
). In summary, the total memory required to generate a white-box encryption program is approximately 1660.69 KB or about 1.62 MB.
In addition to DCA, attackers may attempt reverse engineering and code-lifting attacks. To mitigate such threats, external encodings are essential. Additionally, masking all rounds can enhance security against not only DCAs but also algebraic attacks. The full version of the scheme, referred to as the algebraic enhancement scheme, includes only Table_1 (Round 0), Table_2 (Rounds 1–31), the M-table, the -table, and affine transformations. The memory required for the lookup tables is calculated as: 4 × 2 Kb + 4 × 31 KB + 2 × 32 × 2Kb = 17 KB + 7.75 MB. There are 216 affine transformations, requiring 216 × 1 Kb = 27.84 KB of memory. In summary, the algebraic enhancement scheme requires about 7.8 MB (=7.75 MB + 17 KB + 27.84 KB) of memory.
On a personal computer configured with the 12th Gen Intel(R) Core(TM) i5-12400F 2.50 GHz CPU and 16 GB RAM, the average time to generate a white-box program for the simplified version is about 0.385 s, and the average runtime to encrypt a block of plaintext is approximately 0.07 ms.
The efficiency of the white-box SM4 scheme designed in this paper is compared with that of other white-box schemes in
Table 3. The Xiao–Lai scheme and Yao et al.’s scheme require minimal memory and take the shortest time to generate a white-box program. However, these two schemes are vulnerable to DCA [
15,
19]. The Bai–Wu scheme completely applies lookup tables and XOR operations for encryption, resulting in minimal runtime for encrypting a 128-bit plaintext but requiring substantial memory. Nevertheless, this scheme can not thwart the DCA either [
20].
The white-box schemes proposed by Zhang et al. and Yuan et al. are targeted to resist DCAs but require a large amount of memory. Specifically, Zhang et al.’s scheme requires approximately 15 times more memory than ours, and Yuan et al.’s scheme requires about 21 times more memory. Since the source codes of their schemes are not available, we could not test the average time to generate a white-box program and the runtime of encrypting a 128-bit plaintext. However, it is worth noting that Yuan et al. applied 8-dimensional nonlinear encodings to the Bai–Wu scheme to construct their scheme. Therefore, the time required to generate a white-box program in their scheme is theoretically longer than that of the Bai–Wu scheme and our scheme.