An Efficient Masked White-Box Implementation of SM4

Zhao, Dongyan; Wang, Yubo; Li, Yan; Hu, Xiaobo; Yu, Yanyan; Chen, Shi; Zheng, Shihui

doi:10.3390/electronics13122326

Open AccessArticle

An Efficient Masked White-Box Implementation of SM4

by

Dongyan Zhao

¹,

Yubo Wang

¹,

Yan Li

¹,

Xiaobo Hu

¹,

Yanyan Yu

¹,

Shi Chen

^2,* and

Shihui Zheng

^2,*

¹

Beijing Smart-Chip Microelectronics Technology Co., Ltd., Beijing 102299, China

²

Department of Cyberspace Security, School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(12), 2326; https://doi.org/10.3390/electronics13122326

Submission received: 14 May 2024 / Revised: 9 June 2024 / Accepted: 11 June 2024 / Published: 14 June 2024

(This article belongs to the Special Issue Data Security and Privacy Preserving in Data Society: Scenarios and Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

Differential computation analysis (DCA) is a powerful method for extracting secret information from carefully designed white-box schemes without reverse engineering. Consequently, white-box solutions typically require substantial storage and computing resources to withstand DCAs, as demonstrated by the schemes proposed by Zhang et al. and Yuan et al. for the ISO/IEC standard algorithm SM4. Our approach employs Boolean masking to obscure the correlation between the key and intermediate states. Additionally, we introduce nonlinear permutations to reuse random mask values, thereby reducing space consumption. Experimental results indicate that DCAs against both the simplified version and the algebraic enhancement version of our scheme fail to retrieve the correct keys. Moreover, the former version can be implemented with approximately 1.62 MB of memory and the latter with 7.8 MB, which is much less than 24.3 MB (Zhang et al.) and 34.5 MB (Yuan et al.). Consequently, our design can thwart first-order DCA with lower overhead.

Keywords:

SM4; white-box; Boolean mask; differential computation attack (DCA)

1. Introduction

Conventional cryptographic applications assume that algorithms run within a trusted environment. However, as digital information becomes more pervasive, relying solely on secure environments for key protection no longer satisfies the requirements of diverse applications. To address this challenge, Chow et al. [1] introduced the concept of the white-box attack environment, where the attacker possesses unrestricted access to a cryptographic implementation and the software execution process. Concurrently, they proposed a white-box scheme employing encoding technology to hinder the attacker from extracting the key in the AES software implementation.

Scholars have introduced several cracking methods against white-box designs. One method is algebraic analysis, exemplified by the BGE attack proposed by Billet et al. in 2004 [2]. The other is differential computational analysis (DCA) introduced by Bos et al. in 2016 [3]. The former requires challenging reverse-engineering steps in practice, while DCA assumes only that software execution traces containing information about memory addresses are available. Therefore, DCAs substantially reduce the difficulty and workload for attackers. Numerous white-box schemes, including many public submissions to WhibOx 2016 (the white-box cryptography competition), have been practically broken using DCAs. Recognizing the considerable threat posed by DCAs to white-box schemes, a crucial objective of white-box design is to mitigate these vulnerabilities.

Common countermeasures against DCAs are masking and shuffling. In 2018, Lee et al. [4] improved Chow et al.’s white-box scheme for AES by employing linear Boolean masking to prevent first-order DCAs. However, to conserve memory, only the first round of encryption is protected, leading to the scheme being quickly cracked [5]. In the same year, Biryukov et al. [6] employed nonlinear masking in conjunction with linear masking to hinder first-order DCAs. However, this implementation requires considerable memory resources. In 2021, Seker et al. [7] designed an AES white-box scheme that thwarts first-order DCAs by combining linear and nonlinear masking. Nevertheless, the performance of the scheme was reduced due to the need for timely updating mask values. In 2019, Bogdanov et al. [8] adopted a shuffling strategy in AES white-box schemes to counter first-order DCAs, but even with randomizing the order of operations, an attacker can still rearrange the trajectory based on the accessed memory addresses.

Unlike AES, which is based on the SPN structure, the SM4 algorithm [9], recommended by the Chinese government, employs a Feistel network. Since it was standardized internationally by ISO/IEC in 2021, SM4 has been accepted by many cryptographic communities and integrated by prominent cryptographic libraries such as OpenSSL and Crypto++. Nowadays, the SM4 algorithm is applicable in scenarios including secure communications, data protection, and financial transactions, and plays a crucial role in various security protocols and systems worldwide. Thus, the security of SM4 implementations has also garnered attention.

In 2009, Xiao and Lai proposed the first white-box scheme for SM4 based on the lookup table technique, referred to as the Xiao–Lai scheme [10]. Meanwhile, they proved that the scheme can resist BGE attacks [2]. In 2013, Lin et al. [11] combined the BGE attack with differential cryptanalysis, solving equations and other methods to form what is known as Lin–Lai analysis. They demonstrated that the worst-case time complexity to obtain the key of the Xiao–Lai scheme was

O (2^{47})

. In 2016, Bai et al. [12] employed more complicated encodings and larger tables to prevent the cancellation of higher-order linear encodings through combinatorial lookup tables, and proposed the Bai–Wu scheme. The scheme runs nine times faster than the Xiao–Lai scheme but requires 32.5 MB of memory for storing lookup tables. In 2020, Yao et al. [13] proposed a white-box scheme for SM4 that uses internal state expansion combined with random numbers to obfuscate keys, markedly enhancing the complexity of key extraction through algebraic attacks. In addition, there are other non-open-source white-box designs for SM4, such as the scheme introduced by Shi et al. in 2015 [14], which protects lookup tables through dual cryptography and random confusion. Due to the absence of the corresponding code, no security analysis results for this scheme are available.

While the SM4 white-box solution effectively withstands algebraic attacks, it remains vulnerable to potent DCAs. Refs. [8,15], respectively, indicate that the Xiao–Lai scheme and the Bai–Wu scheme are vulnerable to first-order DCA. Although the Yao–Si scheme aimed to thwart first-order DCA, it is compromised as the number of trajectories increases according to our tests.

In 2022, Zhang et al. [8] enhanced the Xiao–Lai scheme by integrating an 8th-order nonlinear encoding so that it resists first-order DCAs. Similarly, Yuan et al. [15] adopted this method to enhance the Bai–Wu scheme. Although these improvements have increased the security of the schemes, the 8th-order nonlinear encodings prevent direct XOR operations, necessitating the utilization of lookup tables to execute. This significantly increases the memory overhead of the schemes.

We apply Boolean masking to the Xiao–Lai scheme. Based on prior findings, collision analysis against the T-table has a higher success rate. Thus, we transform the T-table into a secured S-box table with the corresponding affine transformations. Next, we employ a randomly generated permutation to reuse the mask values, supported by the common share theory. We introduce two schemes: a simplified version and an algebraic enhancement version. The simplified scheme applies masking only to the first and the last four rounds because the risk of key leakage is elevated during those rounds, enabling implementation with a low resource overhead of approximately 1.62 MB. Experimental results show that there is no obvious peak during DCA trials with the public tool Deadpool, suggesting that the scheme can resist first-order DCAs. The algebraic enhancement scheme applies masking to every encryption round, resulting in a resource cost of approximately 7.8 MB. The scheme not only thwarts first-order DCAs but also proves secure against widely used algebraic attacks according to theoretical analysis.

The remainder of the paper is structured as follows: We present relevant knowledge in Section 2. Section 3 describes the design and implementation of the schemes in detail. Section 4 evaluates the performance of the schemes and compares them with other SM4 white-box schemes. Section 5 examines the security of the schemes, including their resistance to algebraic attacks and DCA. Finally, we conclude the paper in Section 6.

2. Preliminaries

The white-box scheme designed in this paper draws upon methods from the Xiao–Lai SM4 white-box scheme [10] and the AES white-box scheme by Lee et al. [4]. Therefore, in this section, we first introduce the encryption process of SM4, followed by a brief description of the two mentioned schemes.

2.1. SM4 Algorithm

SM4 is a block cipher algorithm [9] with a 128-bit block length and a 128-bit key length. The encryption process and round key schedule employ a 32-round Feistel structure.

Before the first round, the plaintext is divided into four 32-bit words, denoted as

(X_{0}, X_{1}, X_{2}, X_{3})

. The round function calculates the XOR sum of the last three state words and a 32-bit round key. The sum is input into the round function T, and the output is XORed with the first state word, denoted as the F-function. The result of the F-function is swapped with the other three state words before the next round. The detailed steps of each round of computation are illustrated in Figure 1.

The function F is defined as follows:

X_{i + 4} = F (X_{i}, X_{i + 1}, X_{i + 2}, X_{i + 3}, r k_{i}) = X_{i} \oplus T (X_{i + 1} \oplus X_{i + 2} \oplus X_{i + 3} \oplus r k_{i}), i \in {0, 1, \dots, 31} .

Here,

r k_{i}

is the round key and i is the index of the current iteration. T is the composition of the linear transformation L and the nonlinear transformation

τ

.

τ

can be decomposed into four parallel S-box lookup operations. Given that the 32-bit input

α

is divided into four bytes

(α [0], α [1], α [2], α [3])

, the 32-bit output

β

is denoted as:

β = τ (α) = (S (α [0]) | | S (α [1]) | | S (α [2]) | | S (α [3])) .

The linear transformation L consists of left rotations and XOR operations. The calculation formula is as follows:

Z = L (β) = β \oplus (β < < < 2) \oplus (β < < < 10) \oplus (β < < < 18) \oplus (β < < < 24) .

After the final round, the four state words are reversed, and the output is the ciphertext

(X_{35}, X_{34}, X_{33}, X_{32})

.

2.2. The Xiao–Lai SM4 White-Box Scheme

The encoding used in the Xiao–Lai scheme is affine transformations. We use matrix multiplication to represent linear maps and vector addition to represent translations. An affine transformation

P_{i + k}

acting on a vector

X_{i + k}

can be represented as

P_{i + k} (X_{i + k}) = L P_{i + k} \cdot X_{i + k} \oplus C P_{i + k}

, where

L P_{i + k}

is an invertible matrix and

C P_{i + k}

is a constant vector.

First, each 32-bit word

X_{k} (k = 0, 1, 2, 3)

of the plaintext is externally encoded by a 32-dimensional affine transformation

P_{k}

. Then, the encoded input word is transformed to

{\hat{X}}_{k}

, as follows:

({\hat{X}}_{0}, {\hat{X}}_{1}, {\hat{X}}_{2}, {\hat{X}}_{3}) = (P_{0} (X_{0}), P_{1} (X_{1}), P_{2} (X_{2}), P_{3} (X_{3})) .

Similarly,

P_{32 + k}^{- 1}

indicates the external decoding to obtain the ciphertext, as follows:

(X_{35}, X_{34}, X_{33}, X_{32}) = (P_{35}^{- 1} ({\hat{X}}_{35}), P_{34}^{- 1} ({\hat{X}}_{34}), P_{33}^{- 1} ({\hat{X}}_{33}), P_{32}^{- 1} ({\hat{X}}_{32})) .

Each round of computation in the Xiao–Lai SM4 white-box scheme is divided into three sequential parts. Part 1 calculates

Y_{i} = X_{i + 1} \oplus X_{i + 2} \oplus X_{i + 3}

. Since

{\hat{X}}_{i + 1}

,

{\hat{X}}_{i + 2}

, and

{\hat{X}}_{i + 3}

are secured by different affine transformations, the XOR operation cannot be performed directly. Hence,

J_{i + k} (k = 1, 2, 3)

is needed before the XOR operations.

J_{i + k}

comprises two affine transformations: first eliminating the affine transformations

P_{i + k}

and then applying a unified encoding

E_{i}^{- 1}

. For each

{\hat{X}}_{i + k}

, the corresponding

J_{i + k}

is processed as follows:

{\hat{X}}_{i + k}^{'} = J_{i + k} ({\hat{X}}_{i + k}) = E_{i}^{- 1} \circ P_{i + k}^{- 1} ({\hat{X}}_{i + k}) .

Here,

P_{i + k}^{- 1}

is an invertible 32-dimensional affine transformation over GF(2).

E_{i}

= diag

(E_{i, 0}, E_{i, 1}, E_{i, 2}, E_{i, 3})

, where each

E_{i, j} (j \in {0, 1, 2, 3})

represents an invertible 8th-dimensional affine transformation over GF(2).

E_{i}

and

E_{i}^{- 1}

are inverses of each other. After unifying the encodings of the three words, the result is obtained through two XOR operations:

{\hat{Y}}_{i} = {\hat{X}}_{i + 1}^{'} \oplus {\hat{X}}_{i + 2}^{'} \oplus {\hat{X}}_{i + 3}^{'} = E_{i}^{- 1} (X_{i + 1} \oplus X_{i + 2} \oplus X_{i + 3}) = E_{i}^{- 1} (Y_{i}) .

Part 2 calculates

Z_{i} = T (Y_{i} \oplus r k_{i})

. Before encryption, a table is generated for each byte of the output

{\hat{Y}}_{i}

from Part 1. The input of the table is an 8-bit

{\hat{y}}_{j}

and the output is a 32-bit

{\hat{z}}_{j}

. Since

{\hat{y}}_{j}

involves the affine transformation

E_{i, j}^{- 1}

,

E_{i, j}

is first applied to remove it, and then the round key

r k_{i, j}

is added. After byte substitution with an S-box and the corresponding linear operation

L_{j}

, the output

z_{j}

is encoded by another 32-dimensional invertible affine transformation

Q_{i}

. The lookup table is constructed as follows:

{\hat{z}}_{j} = Q_{i} (z_{j}) = Q_{i} \circ L_{j} \circ S (r k_{i, j} \oplus E_{i, j} ({\hat{y}}_{j})) .

Therefore, four lookup tables are searched to obtain the results

{\hat{z}}_{j} (j = 0, 1, 2, 3)

during the encryption process. Finally,

{\hat{Z}}_{i}

can be obtained by XORing the four

{\hat{z}}_{j}

:

{\hat{Z}}_{i} = {\hat{z}}_{0} \oplus {\hat{z}}_{1} \oplus {\hat{z}}_{2} \oplus {\hat{z}}_{3} = Q_{i} \circ L \circ S (Y_{i} \oplus r k_{i}) = Q_{i} (Z_{i}) .

Part 3 completes the calculation of

X_{i + 4} = X_{i} \oplus Z_{i}

. It is necessary to unify the encodings that protect the output

{\hat{Z}}_{i}

from Part 2 and the 0th word

{\hat{X}}_{i}

of the ith round input.

G_{i}

is used to convert the encoding for

{\hat{X}}_{i}

, which is composed of

P_{i}^{- 1}

and

P_{i + 4}^{″}

.

P_{i}^{- 1}

is used to eliminate the carried encoding of

{\hat{X}}_{i}

, and

P_{i + 4}^{″}

represents the new unified encoding. Likewise,

H_{i}

is utilized to convert encoding for

{\hat{Z}}_{i}

, and it is composed of

Q_{i}^{- 1}

and

P_{i + 4}^{'}

.

Q_{i}^{- 1}

is used to eliminate the carried encoding of

{\hat{Z}}_{i}

, and

P_{i + 4}^{'}

is the new unified encoding. The calculation is as follows:

{\hat{X}}_{i}^{'} = G_{i} ({\hat{X}}_{i}) = P_{i + 4}^{″} \circ P_{i}^{- 1} ({\hat{X}}_{i}),

{\hat{Z}}_{i}^{'} = H_{i} ({\hat{Z}}_{i}) = P_{i + 4}^{'} \circ Q_{i}^{- 1} ({\hat{Z}}_{i}) .

The invertible matrix of

P_{i + 4}^{″}

and

P_{i + 4}^{'}

is the same, denoted by

L P_{i + 4}

. However, the constant vector of

P_{i + 4}^{″}

is

C P_{i + 4}^{″}

, and that of

P_{i + 4}^{'}

is

C P_{i + 4}^{'}

, which satisfies

C P_{i + 4}^{'} \oplus C P_{i + 4}^{″} = C P_{i + 4}

. As a result, the output

{\hat{X}}_{i + 4}

of this round is obtained from the XOR sum of two temporary results

{\hat{X}}_{i}^{'}

and

{\hat{Z}}_{i}^{'}

:

{\hat{X}}_{i + 4} = {\hat{X}}_{i}^{'} \oplus {\hat{Z}}_{i}^{'} = P_{i + 4} (X_{i} \oplus Z_{i}) = P_{i + 4} (X_{i + 4}) .

2.3. The Masked AES White-Box Scheme

In 2018, Lee et al. [4] proposed a masked AES white-box scheme based on Chow et al.’s white-box design [1] to thwart first-order DCAs. The AES white-box scheme introduced by Chow et al. [1] contains four classes of lookup tables. Lee et al. added masks to the Type II table as displayed in Figure 2.

Linear and nonlinear encodings protect the 8th-bit input, so decoding is performed first. Then, AddRoundKey, SubBytes, and partial MixColumns of the round function of AES are sequentially performed to obtain the genuine 32-bit state word T. Secondly, for each value of byte

t_{j} (j \in [0, 3])

, a randomly generated byte value

m_{j, t_{j}}

is added, that is,

{\hat{t}}_{j} = t_{j} \oplus m_{j, t_{j}}

. Next, the masked data

{\hat{t}}_{j}

and 256 mask values

m_{j} = {m_{j, t_{j}} | t_{j} = 0, 1, \dots, 255}

are, respectively, encoded with the same linear encoding. Finally, distinct nonlinear encodings are applied for further protection. As a result, Lee et al. expanded an 8-to-32-bit table to an 8-to-64-bit table, which doubled the required memory.

3. Masked SM4 White-Box Scheme

We attempted to enhance the Xiao–Lai SM4 white-box scheme [10] by exploiting the Boolean masking technique to prevent DCAs. Similarly, in the Xiao–Lai scheme [10], the S-box and linear operation L are combined into a single lookup table, with one byte as input and a 32-bit word as output. Following Lee et al.’s masking method, an 8-to-64-bit table is required to store both the masked data and the mask values, totaling 2 KB (

2^{8} \times 64

bits) of memory. However, according to the algebraic analysis of the DIBO function, the encodings combined with the linear operation can be divided into four independent functions [16]. Furthermore, each function may leak information about the S-box output through collision analysis [17]. Therefore, we separate the S-box and linear operation L, create a lookup table for the nonlinear operation S-box, and mask its output. This reduces the lookup table size to 256 B (

2^{8} \times 8

bits). By applying distinct encodings to protect masked data and mask values separately, the true values of the intermediate state bytes cannot be directly disclosed through XOR.

Secondly, to reduce the total number of real random numbers, we exploit the common share technique [18], reusing half of the random numbers. Specifically, we randomly generate 256 values

{m_{0, t_{0}} | t_{0} = 0, 1, \dots, 255}

as mask values for the 256 output values of the first S-box operation. Next, we generate a random permutation

φ

and ensure that the function compositions

φ^{2}

and

φ^{3}

do not result in an identity permutation. Subsequently, the permutation

φ

is applied to deduce the input values

{φ^{j} [t_{j}] | t_{j} = 0, 1, \dots, 255; j \in {1, 2, 3}}

to secure the outputs of the following S-box operations by

S (t_{j}) \oplus m_{0, φ^{j} [t_{j}]}

. We do not record the corresponding mask value along with the output of each S-box operation as done in the Type II table of Lee et al.’s scheme. Instead, we only record the permutation table and the randomly generated 256 mask values

{m_{0, t_{0}} | t_{0} = 0, 1, \dots, 255}

. Since the L operation of SM4 is linear, the following relationship holds:

\begin{matrix} L (S (t_{0}) \oplus m_{0, t_{0}}) \oplus L (S (t_{1}) \oplus m_{0, φ [t_{1}]}) \oplus L (S (t_{2}) \oplus m_{0, φ^{2} [t_{2}]}) \oplus L (S (t_{3}) \oplus m_{0, φ^{3} [t_{3}]}) \\ = & L (S (t_{0}) \oplus S (t_{1}) \oplus S (t_{2}) \oplus S (t_{3})) \oplus L (m_{0, t_{0}} \oplus m_{0, φ [t_{1}]} \oplus m_{0, φ^{2} [t_{2}]} \oplus m_{0, φ^{3} [t_{3}]}) \\ = & Z_{0} \oplus Z_{1} . \end{matrix}

As illustrated in the above equation,

Z_{1} = L (m_{0, t_{0}} \oplus m_{0, φ [t_{1}]} \oplus m_{0, φ^{2} [t_{2}]} \oplus m_{0, φ^{3} [t_{3}]})

can be calculated when

m_{0, t_{j}}

and the permutation

φ

are input. Consequently, only two tables are needed to remove the related mask values during the subsequent unmasking process. Therefore, the required memory space is further reduced.

Finally, a critical consideration is the appropriate location for unmasking. In 2020, Lee et al. [5] pointed out that unmasking after MixColumns may expose the output value of the first-round computation to an attacker. This vulnerability allows an attacker to target the round output for key extraction using DCAs. To address this problem, we recommend that all intermediate values remain masked throughout the encryption process. Additionally, distinct random numbers and permutations should be utilized for different encryption rounds.

Similar to how the intermediate states during the first and last rounds of AES computations are vulnerable to secret leakage via DCA attacks, the intermediate states during rounds 0–3 and 28–31 of the SM4 algorithm are also high risk. To further reduce the required memory, we can simplify the scheme by adding masks to only these eight rounds. Consequently, in rounds 4–7, the unmasking calculations need to be incorporated into Part 1 and Part 3. Subsequently, the remaining rounds (8–27) adhere to the Xiao–Lai SM4 white-box design. Our white-box scheme is elaborately described in the simplified version. First of all, we define the symbols utilized in subsequent sections in Table 1.

3.1. Table Generation

Five types of lookup tables are defined in this scheme, as shown in Figure 3. Among them, (a) Table_1 is the lookup table used in Rounds 0 and 28; (b) Table_2 corresponds to Rounds 1–3 and 29–31; (c) Table_3 is used in Rounds 4–27; (d) M is the table for recording random mask values in Rounds 0–3 and 28–31; (e)

φ

is the permutation table in Rounds 0–3 and 28–31. In this section, we will elaborate on how to generate these five types of tables.

Masking is applied to the output of the S-box during the generation of Table_1, as depicted in Figure 3a. The input is a byte of

{\hat{Y}}_{i} = {\hat{y}}_{0} | | {\hat{y}}_{1} | | {\hat{y}}_{2} | | {\hat{y}}_{3}

, and the output is

{\hat{z}}_{j}

. The generation process is as follows:

{\hat{z}}_{j} = E_{i, j}^{' - 1} (S (E_{i, j} ({\hat{y}}_{j}) \oplus r k_{i, j}) \oplus m_{i, φ^{j} [{\hat{y}}_{j}]}), j \in [0, 3], i = 0 or 28 .

Here,

E_{i, j}

and

E_{i, j}^{' - 1}

are two 8-dimensional invertible affine transformations, and

r k_{i, j}

represents the jth byte of the ith round key.

m_{i, φ^{j} [{\hat{y}}_{j}]} = R_{i} (M [φ^{j} [{\hat{y}}_{j}]])

means

φ^{j} [{\hat{y}}_{j}]

is the index of table M to obtain the corresponding encoding mask value.

R_{i}

represents the 8-dimensional invertible affine transformation to eliminate the protection of mask values. Table_1 needs 256 B (

2^{8} \times 8

bits) of memory. There are four such tables in each round and two rounds in total, so all Table_1 instances consume 1 KB (

4 \times 2 \times 256

B) of memory.

Since Table_2 is used in Rounds 1–3 and 29–31, the mask values should be updated during the generation of Table_2. As depicted in Figure 3b, the input is two bytes

{\hat{y}}_{j}

and

{\hat{Z}}_{(i - 1) 1, j}

, and the output is an 8-bit value

{\hat{z}}_{j}

. The relationship between inputs and output is defined as follows:

{\hat{z}}_{j} = E_{i, j}^{' - 1} (S (E_{i, j} ({\hat{y}}_{j}) \oplus N_{j} ({\hat{Z}}_{(i - 1) 1, j}) \oplus r k_{i, j}) \oplus m_{i, φ^{j} [{\hat{y}}_{j}]}), j \in [0, 3] .

Here,

{\hat{Z}}_{(i - 1) 1, j}

represents the jth byte of the result from Part 2 in the

(i - 1)

th round computation, and

N_{j}

is the invertible 8-dimensional linear transformation. The way to obtain

m_{i, φ^{j} [{\hat{y}}_{j}]}

remains the same as before. Table_2 requires 64 KB (

2^{8} \times 2^{8} \times 8

bits) of memory with four tables per round across six rounds; the total memory consumption is 1.5 MB (

4 \times 6 \times 64

KB). In the last four rounds, mask values are unprotected, so there is no necessity to use

N_{j}

to remove the encoding when generating the related tables.

Table_3 is utilized in Rounds 4–27 in the simplified version of our masked SM4 white-box scheme. As shown in Figure 3c, its generation is identical to the table of the Xiao–Lai scheme described in Section 2. Table_3 requires 1 KB (

2^{8} \times 32

bits) memory. Each round has four tables, so the 24 rounds consume a total of 96 KB (

4 \times 24 \times 1

KB) of memory.

Simultaneously, the mask value

m_{i, j}

is generated randomly and protected by the encoding

R_{i}^{- 1}

. The M-table stores mask values used in Rounds 0–3 and 28–31, as depicted in Figure 3d. It is an 8-to-8-bit table. The index of the table is

φ^{j} [{\hat{y}}_{j}]

and each cell involves a random byte protected by the affine transformation

R_{i}^{- 1}

. The generation of the mask table M is as follows:

\begin{matrix} For {\hat{y}}_{0} & = 0 to 255 do \\ m = rand () \mod 256; \\ M [{\hat{y}}_{0}] \leftarrow R_{i}^{- 1} (m) . \end{matrix}

The size of an M-table is 256 B (

2^{8} \times 8

bits), with only one M-table generated per round. Thus, eight rounds consume 2 KB (

8 \times 256

B) of memory.

The

φ

-table stores random permutation used in Rounds 0–3 and 28–31, as depicted in Figure 3e. The generation of it is as follows:

\begin{matrix} Initialize φ [s] = s, s = 0, \dots, 255; \\ For s = 0 to 255 do \\ t = random () \mod 256; \\ Swap φ [s] and φ [t] . \end{matrix}

A

φ

-table also needs 256 B (

2^{8} \times 8

bits) of memory, and each round needs only one

φ

-table. Hence, eight rounds require 2 KB (

8 \times 256

B) of memory.

3.2. Round Operation

In this section, we will introduce how to use our lookup tables and affine transformations to obtain the output of a round. Given the differences in round computations, we will detail these processes separately.

3.2.1. Rounds 0–3 and 28–31

Figure 4 takes the execution of the 0-th round of encryption as an example. Part 1 returns the XOR sum of the lower three words

X_{1} \oplus X_{2} \oplus X_{3}

. Similarly, we should unify encodings before performing the XOR operations, consistent with the Xiao–Lai scheme.

Part 2 involves round key addition, S-box operations, and the linear operation L. Here, this part is further divided into three portions. First, it involves calculating round key addition and an S-box operation using a lookup table. The input of part 2 is

{\hat{Y}}_{i}

, which is protected by

E_{i}^{- 1}

. We use the jth byte

{\hat{y}}_{j}

of

{\hat{Y}}_{i}

as the index of the jth Table_1 to obtain the corresponding output

{\hat{z}}_{j}

. Notably, when

i \in {1, 2, 3, 29, 30, 31}

, Table_2 is used to execute round key addition, an S-box operation, and make value updates. Thus, there are two inputs

{\hat{Z}}_{(i - 1) 1}

and

{\hat{Y}}_{i}

to the lookup table, as illustrated in Figure 5. Specifically, Table_1 uses a single index

{\hat{y}}_{j}

, whereas Table_2uses two indices,

{\hat{y}}_{j}

and

{\hat{Z}}_{(i - 1) 1, j}

.

Secondly, the linear transformation L is performed. We concatenate the four outputs from the four lookup tables as the input of a composite affine transformation

P_{i + 4}^{'} \circ L \circ E_{i}^{'}

, resulting in

{\hat{Z}}_{i 0}

. The affine transformation is calculated as follows:

{\hat{Z}}_{i 0} = P_{i + 4}^{'} \circ L \circ E_{i}^{'} ({\hat{z}}_{0} ‖ {\hat{z}}_{1} ‖ {\hat{z}}_{2} ‖ {\hat{z}}_{3}) .

Here,

P_{i + 4}^{'}

is a 32-dimensional invertible affine transformation. The linear matrix of

P_{i + 4}^{'}

is identical to that of affine transformation

P_{i + 4}

but the constant vector is different, denoted by

C P_{i + 4}^{'}

.

Thirdly, to ensure the correctness of encryption, we should calculate the protected mask values simultaneously. First,

{\hat{y}}_{j}

is the index of the

φ

-table. We repeat the lookup operation j times and the related output

φ^{j} [{\hat{y}}_{j}]

serves as the index of the M-table. Next, we search the M-table to obtain the mask value which secures

z_{j}

. Finally, we concatenate the four mask values and apply a composition affine transformation

N^{- 1} \circ L \circ diag (R_{i}, R_{i}, R_{i}, R_{i})

to obtain

{\hat{Z}}_{i 1}

. This portion is calculated as follows:

{\hat{Z}}_{i 1} = N^{- 1} \circ L \circ diag (R_{i}, R_{i}, R_{i}, R_{i}) (M [φ^{0} [{\hat{y}}_{0}]] ‖ M [φ^{1} [{\hat{y}}_{1}]] ‖ M [φ^{2} [{\hat{y}}_{2}]] ‖ M [φ^{3} [{\hat{y}}_{3}]]) .

Here, N = diag

(N_{0}, N_{1}, N_{2}, N_{3})

, where

N_{j}

is an 8-dimensional invertible linear transformation, and

N \circ N^{- 1} = I d e n t i t y

. For the last four rounds of computation, mask values are naked, which means the composite affine transformation is defined by

L \circ diag (R_{i}, R_{i}, R_{i}, R_{i})

.

Part 3 is used to calculate the XOR sum between the output of the round function and the state word

X_{i}

. Due to

{\hat{X}}_{i}

and

{\hat{Z}}_{i 0}

secured by different encodings,

{\hat{X}}_{i}

should unify encoding with

{\hat{Z}}_{i 0}

and then perform an XOR operation to obtain the output

{\hat{X}}_{i + 4}

of the current round, as depicted in Part 3 of Figure 4:

{\hat{X}}_{i}^{'} = P_{i + 4}^{″} \circ P_{i}^{- 1} ({\hat{X}}_{i}),

{\hat{X}}_{i + 4} = {\hat{X}}_{i}^{'} \oplus {\hat{Z}}_{i 0} .

The linear matrix of

P_{i + 4}^{″}

is the same as that of

P_{i + 4}^{'}

, that is

L P_{i + 4}

. However, the constant vectors are different and satisfy

C P_{i + 4}^{'} \oplus C P_{i + 4}^{″} = C P_{i + 4}

.

It should be noted that after 32 rounds of iterations, it is necessary to remove the encodings of the four 32-bit words and then XOR them with the corresponding mask values to obtain the genuine ciphertext.

3.2.2. Rounds 4–27

In the algebraic enhancement version of our scheme, the calculation during each middle round of encryption is identical to that in Round 1. However, in the simplified version of our scheme, the implementation of each round follows the design of the Xiao–Lai scheme except for Rounds 4–7. The lookup table used in Part 2 is Table_3 in Rounds 4–27. However, we should remove mask values during the process of Part 1 and Part 3 in Rounds 4–7, because the input words

{\hat{X}}_{4}, {\hat{X}}_{5}, {\hat{X}}_{6}, {\hat{X}}_{7}

are masked. Therefore, the steps are slightly different. Below, we will only introduce Part 1 and Part 3 of Rounds 4–7.

Part 1 primarily calculates

X_{i + 1} \oplus X_{i + 2} \oplus X_{i + 3}

. First of all, we should unify encodings of

{\hat{X}}_{i + k} (k = 1, 2, 3)

and mask values:

{\hat{X}}_{i + k}^{'} = E_{i}^{- 1} \circ P_{i + k}^{- 1} ({\hat{X}}_{i + k}), i = 4, \dots, 7, k = 1, 2, 3;

{\hat{R M}}_{i} = L E_{i}^{- 1} \circ N (\oplus_{l = i - 3}^{3} {\hat{Z}}_{l 1}), i = 4, 5, 6 .

Here,

L E_{i}^{- 1}

is the invertible matrix of affine transformation

E_{i}^{- 1}

. As a result, the output

{\hat{Y}}_{i}

from the following XOR operation remains protected by an affine transformation rather than a linear transformation.

{\hat{Y}}_{i} = {\hat{X}}_{i + 1}^{'} \oplus {\hat{X}}_{i + 2}^{'} \oplus {\hat{X}}_{i + 3}^{'} \oplus {\hat{R M}}_{i} .

Part 3 completes the XOR sum between

X_{i}

and the output of the

Z_{i} = T (X_{i + 1}, X_{i + 2},

X_{i + 3}, r k_{i})

to obtain the result

X_{i + 4}

for this round. Additionally, unmasking is involved in Rounds 4–7 to ensure the correctness of the scheme. Before the above two XOR operations, it includes unifying the encodings of

{\hat{X}}_{i}

,

{\hat{Z}}_{i}

, and mask values:

\begin{matrix} {\hat{X}}_{i}^{'} & = P_{i + 4} \circ P_{i}^{- 1} ({\hat{X}}_{i}) \oplus C P_{i + 4}^{'} \oplus C P_{i + 4}^{″}, \\ {\hat{Z}}_{(i - 4) 1}^{'} & = L P_{i + 4} \circ N ({\hat{Z}}_{(i - 4) 1}) \oplus C P_{i + 4}^{″}, \\ {\hat{Z}}_{i}^{'} & = L P_{i + 4} (Q_{i}^{- 1} ({\hat{Z}}_{i})) \oplus C P_{i + 4}^{'}, \\ {\hat{X}}_{i + 4} & = {\hat{X}}_{i}^{'} \oplus {\hat{Z}}_{(i - 4) 1}^{'} \oplus {\hat{Z}}_{i}^{'} . \end{matrix}

4. Performance

The memory consumption of the simplified version of our SM4 white-box scheme is listed in Table 2.

As shown in Table 2, there are

100 (3 \times 32 + 4)

32-dimensional affine transformations in Part 1 during the 32 rounds of encryption. Meanwhile, Part 2 requires

12 (2 \times 4 + 4)

and Part 3 requires

64 (32 + 24 + 4 + 4)

affine transformations. Consequently, the memory requirement for affine transformations is 22.69 KB (1056 bit

\times (100 + 12 + 64))

. Table 1 is only used in Rounds 0 and 28, with four tables constructed in each round, requiring 2 KB (

2^{11}

bit

\times 8

). Table_2 is used in six rounds—Rounds 1–3 and 29–31—with four tables constructed in each round, requiring 1.5 MB (

2^{6}

KB

\times 24

). Table_3 is needed in the middle 24 rounds, with four tables generated in each round, requiring 96 KB (1 KB

\times 96

). An M-table and a

φ

-table are involved in Rounds 0–3 and 28–31, so the required memory for them is 4 KB (

2^{11}

bit

\times 8 \times 2

). In summary, the total memory required to generate a white-box encryption program is approximately 1660.69 KB or about 1.62 MB.

In addition to DCA, attackers may attempt reverse engineering and code-lifting attacks. To mitigate such threats, external encodings are essential. Additionally, masking all rounds can enhance security against not only DCAs but also algebraic attacks. The full version of the scheme, referred to as the algebraic enhancement scheme, includes only Table_1 (Round 0), Table_2 (Rounds 1–31), the M-table, the

φ

-table, and affine transformations. The memory required for the lookup tables is calculated as: 4 × 2 Kb + 4 × 31

\times 2^{6}

KB + 2 × 32 × 2Kb = 17 KB + 7.75 MB. There are 216 affine transformations, requiring 216 × 1 Kb = 27.84 KB of memory. In summary, the algebraic enhancement scheme requires about 7.8 MB (=7.75 MB + 17 KB + 27.84 KB) of memory.

On a personal computer configured with the 12th Gen Intel(R) Core(TM) i5-12400F 2.50 GHz CPU and 16 GB RAM, the average time to generate a white-box program for the simplified version is about 0.385 s, and the average runtime to encrypt a block of plaintext is approximately 0.07 ms.

The efficiency of the white-box SM4 scheme designed in this paper is compared with that of other white-box schemes in Table 3. The Xiao–Lai scheme and Yao et al.’s scheme require minimal memory and take the shortest time to generate a white-box program. However, these two schemes are vulnerable to DCA [15,19]. The Bai–Wu scheme completely applies lookup tables and XOR operations for encryption, resulting in minimal runtime for encrypting a 128-bit plaintext but requiring substantial memory. Nevertheless, this scheme can not thwart the DCA either [20].

The white-box schemes proposed by Zhang et al. and Yuan et al. are targeted to resist DCAs but require a large amount of memory. Specifically, Zhang et al.’s scheme requires approximately 15 times more memory than ours, and Yuan et al.’s scheme requires about 21 times more memory. Since the source codes of their schemes are not available, we could not test the average time to generate a white-box program and the runtime of encrypting a 128-bit plaintext. However, it is worth noting that Yuan et al. applied 8-dimensional nonlinear encodings to the Bai–Wu scheme to construct their scheme. Therefore, the time required to generate a white-box program in their scheme is theoretically longer than that of the Bai–Wu scheme and our scheme.

5. Security Analysis

5.1. White-Box Diversity and White-Box Ambiguity

White-box diversity and white-box ambiguity are important metrics to measure the security of a white-box scheme. White-box diversity refers to the number of different white-box implementations that maintain functional equivalence [1]. The higher the white-box diversity of a scheme, the less secret information an attacker can access. White-box ambiguity denotes the number of possible constructions for a given white-box implementation [1]. Similarly, the larger the white-box ambiguity is, the more difficult it is for an attacker to derive the hidden key from it.

Suppose that a white-box implementation of a proposal has a total of n steps, and each step has

m_{n}

choices for encoding, then the white-box diversity of the implementation is

\prod_{1}^{n} m_{n}

. Regarding the number of d-dimensional invertible matrix over GF(2) is

(2^{d} - 1) \times \prod_{j = 1}^{d - 1} (2^{d} - 1 - \sum_{k = 1}^{j} (\binom{j}{k}))

. Hence, the number of 8-dimensional invertible matrices is about

2^{62}

, and for 32-dimensional invertible matrices is about

2^{254}

. Similarly, the number of 8-dimensional invertible affine transformations is

2^{62} \times 2^{8}

, and for 32-dimensional invertible affine transformations is

2^{254} \times 2^{32}

. For each table, the number of possible keys is

2^{8}

and that of mask value is

2^{8 \times 256}

. The white-box diversity of the components for each round in the proposal is as follows:

\begin{matrix} Part 1 (Affine trans .) : & {(2^{254} \times 2^{32})}^{3} \times {(2^{62} \times 2^{8})}^{4} = 2^{1138}; \\ Part 2 (Table_1) : & {(2^{62} \times 2^{8})}^{4} \times 2^{8 \times 4} \times 2^{8 \times 4} \times {(2^{62} \times 2^{8})}^{4} = 2^{624}; \\ Part 2 (Table_2) : & {(2^{62} \times 2^{8})}^{4} \times {(2^{62})}^{4} \times 2^{8 \times 4} \times 2^{8 \times 256 \times 4} \times {(2^{62} \times 2^{8})}^{4} = 2^{9032}; \\ Part 2 (Table_3) : & {(2^{62} \times 2^{8})}^{4} \times 2^{8 \times 4} \times (2^{254} \times 2^{32}) = 2^{598}; \\ Part 2 (Affine trans .) : & {(2^{62} \times 2^{8})}^{4} \times (2^{254} \times 2^{32}) \times (2^{62} \times 2^{8}) \times {(2^{62})}^{4} = 2^{884} (the first four rounds), \\ {(2^{62} \times 2^{8})}^{4} \times (2^{254} \times 2^{32}) = 2^{566} (the last four rounds); \\ Part 2 (M) : & 2^{8 \times 256} \times (2^{62} \times 2^{8}) = 2^{2118}; \\ Part 2 (φ) : & 256!; \\ Part 3 : & {(2^{254} \times 2^{32})}^{3} \times 2^{32} = 2^{890} . \end{matrix}

White-box ambiguity can be calculated by dividing its white-box diversity by the (usually much smaller) number of distinct tables of that type. Finding a rigorous and tractable way to compute white-box ambiguity appears difficult [1]. The white-box ambiguity of our scheme is relatively complex, so we have omitted it here. Based on the calculation results of white-box diversity, our scheme is secure against brute-force attacks. Namely, it is difficult for attackers to extract encodings and key related information through exhaustive searching.

5.2. Algebraic Attacks

We evaluate the security of our scheme with widely used algebraic methods, including BGE analysis, Lin–Lai analysis, and Pan et al.’s analysis.

5.2.1. BGE Analysis

As mentioned above, when carrying out a BGE attack, the property

{(o u t)}^{r} = {(I n^{r + 1})}^{- 1}

between the output encoding of the previous round and the input decoding of the current round can be used to eliminate the output encoding for rounds (except for the first round) by combining lookup tables of the two adjacent rounds. However, Xiao and Lai demonstrate that their scheme can resist BGE attacks [10] due to a constant that can not be removed and is unknown to attackers during the combination of lookup tables and related operations.

The middle rounds of the simplified scheme are similar to the round computation of the Xiao–Lai scheme. Hence, we focus on the algebraic enhancement scheme or the masked rounds of the simplified one. Let us take the first two rounds as an example, namely, the combination of Part 2 in Round 0 and Part 1 in Round 1. Specifically, we consider Table_1 and protected linear operation L in Part 2 of the 0-th round, along with the unified encoding transformation in Part 1 of the 1st round, as shown in Figure 6.

The output word of the combination is denoted by

V = (v_{0}, v_{1}, v_{2}, v_{3})

. The relationship between the input and the output can be described as follows:

\begin{matrix} V = E_{1}^{- 1} \circ P_{4}^{- 1} \circ P_{4}^{'} \circ L \circ E_{0}^{'} \circ E_{0}^{' - 1} ( & S (E_{0, 0} ({\hat{y}}_{0}) \oplus r k_{0, 0}) \oplus m_{0, φ [{\hat{y}}_{0}]} | | S (E_{0, 1} ({\hat{y}}_{1}) \oplus r k_{0, 1}) \oplus m_{0, φ^{1} [{\hat{y}}_{1}]} | | \\ S (E_{0, 2} ({\hat{y}}_{2}) \oplus r k_{0, 2}) \oplus m_{0, φ^{2} [{\hat{y}}_{2}]} | | S (E_{0, 3} ({\hat{y}}_{3}) \oplus r k_{0, 3}) \oplus m_{0, φ^{3} [{\hat{y}}_{3}]}) . \end{matrix}

We observe that

E_{0}^{' - 1}

and

E_{0}^{'}

can be canceled out. However, the composition of

P_{4}^{- 1}

and

P_{4}^{'}

introduces an addition of a constant

L P_{4}^{- 1} \cdot C P_{4}^{″}

to the related input. Denote the output of L as

Z = (u_{0}, u_{1}, u_{2}, u_{3})

and we can conclude the relationship between each byte of V and the output of L as follows:

\begin{matrix} V & = E_{1}^{- 1} (Z_{0} \oplus L P_{4}^{- 1} \cdot C P_{4}^{″}) \\ = d i a g (L E_{1, 0}^{- 1} (u_{0}) \oplus g_{1, 0}, L E_{1, 1}^{- 1} (u_{1}) \oplus g_{1, 1}, L E_{1, 2}^{- 1} (u_{2}) \oplus g_{1, 2}, L E_{1, 3}^{- 1} (u_{3}) \oplus g_{1, 3}) . \end{matrix}

Here,

L E_{1, j}^{- 1}

(j = 0, 1, 2, 3) is the matrix part of

E_{1, j}^{- 1}

and

C E_{1, j}^{- 1}

is the constant vector. Moreover,

(g_{1, 0} | | g_{1, 1} | | g_{1, 2} | | g_{1, 3})

represents

C E_{1}^{- 1} \oplus L E_{1}^{- 1} (L P_{4}^{- 1} \cdot C P_{4}^{″})

.

The composition of affine transformations

P_{4}^{'}

and

P_{4}^{- 1}

are a constant vector

L P_{4}^{- 1} \cdot C P_{4}^{″}

that cannot be eliminated and is unknown to the attacker. Moreover, the mask is a random variable and unknown to the attacker as well. Thus, the BGE attack is effectively thwarted as the constant cannot be deduced from available information.

5.2.2. Lin–Lai Analysis

Lin–Lai analysis combines BGE attack and differential analysis, focusing on the Xiao–Lai SM4 white-box scheme, and the time complexity for recovering the key is

O (2^{47})

. Since the middle rounds of our simplified scheme follow the design of Xiao–Lai’s white-box scheme, Lin–Lai analysis can be applied to recover the round key in these rounds.

If the attacker can conduct reverse engineering, it raises the security requirement significantly. In such scenarios, the algebraic enhancement scheme, with mask values applied in all rounds, becomes highly recommended. Here, we discuss the security of the enhanced scheme. Taking the first two rounds as an example, the related parts of the scheme are combined for Lin–Lai analysis, as illustrated in Figure 6. We have established the relationship between the input and the output as described in Section 5.2.1. As we all know, linear transformation L in SM4 can be represented as matrix multiplication. Let

L_{j t} (j, t \in [0, 3])

represent the jth row and tth column of L-matrix (each element of the matrix is a byte), so each

v_{j}

(jth byte of V) can be represented as a function with the input of

({\hat{y}}_{0}, {\hat{y}}_{1}, {\hat{y}}_{2}, {\hat{y}}_{3})

as follows:

v_{j} ({\hat{y}}_{0}, {\hat{y}}_{1}, {\hat{y}}_{2}, {\hat{y}}_{3}) = L E_{1, j}^{- 1} (\oplus_{t = 0}^{3} L_{j t} \cdot (S (E_{0, t} ({\hat{y}}_{t})) \oplus m_{0, φ^{t} [{\hat{y}}_{t}]}) \oplus g_{1, j}, j = 0, 1, 2, 3 .

In the Xiao–Lai scheme,

g_{1, j}

equals the jth byte of

C E_{1}^{- 1} \oplus L E_{1}^{- 1} (L P_{4}^{- 1} \cdot C P_{4}^{″})

, which is a constant. Lin et al. [11] proved that

L E_{1, j}^{- 1}

can be solved because any two

(v_{j}, v_{r})

pairs have a linear relationship between them. It is important to note that the mask values are generated randomly. And each value of

{\hat{y}}_{j}

corresponds to a different mask value. As a result, establishing an equation between

{\hat{y}}_{j}

and

m_{0, φ^{j} [{\hat{y}}_{j}]}

to form the difference distribution table for subsequent analysis becomes impossible.

Moreover, except for the operations involved in Figure 6, other operations will not leak mask values either. Firstly, the mask values are protected by a randomly generated affine transformation

R_{i}^{- 1}

, so the values are unknown. Secondly, only considering lookup tables M, we know that:

{\hat{z}}_{j, 1} = M [φ^{j} [{\hat{y}}_{j}]] .

The operation only permutates the M table, preventing attackers from recovering the mask values. Subsequently, the lookup tables and affine transformations are evaluated collectively:

{\hat{Z}}_{01} = N^{- 1} \circ L (R_{0} (M [φ^{0} [{\hat{y}}_{0}]]) ‖ R_{0} (M [φ^{1} [{\hat{y}}_{1}]]) ‖ R_{0} (M [φ^{2} [{\hat{y}}_{2}]]) ‖ R_{0} (M [φ^{3} [{\hat{y}}_{3}]])) .

Similarly, since the output is protected by another randomly generated linear transformation N, establishing the relationship between the input

{\hat{y}}_{j}

(or

M [φ^{j} [{\hat{y}}_{j}]]

) and the corresponding mask value becomes unfeasible. Thus, it is impossible to obtain any relevant information to recover the mask value. Therefore, the enhanced scheme can prevent Lin–Lai analysis since mask values remain confidential.

5.2.3. Pan et al.’s Analysis

Pan et al.’s analysis simplifies the complexity of Lin–Lai analysis by adjusting the order of unknowns to be recovered. For the Xiao–Lai scheme, Lin et al. initially recover

L E_{1, j}^{- 1}

, and then recover the constant vector corresponding to the affine transformation

Q_{i}

through the difference distribution table. Next, they obtain the constant portion

g_{i, j}

. Finally, they establish the key-related equations to recover the key. However, Pan et al. recover the matrix of each affine transformation at first. Then, they utilize the known information to deduce the constant vector of the affine transformation. As mentioned in Section 5.2.2, secret information cannot be recovered by solving the equation with variables and unknown mask values. Thus, this analysis method is ineffective for our scheme.

5.3. Side-Channel Attack

The algorithm obscures the key with ing, thereby diminishing the correlation between the key and the intermediate data. We utilize the publicly available tool Deadpool [21] for the DCA test. To compare with other research results, we collected 200 traces for analysis according to experiments in [20]. As depicted in Figure 7, there is no obvious peak in the differential traces of each byte when attempting to guess all the possible values of a byte of the first round key. Meanwhile, the DCA program fails to return the correct result, indicating that attackers are unable to deduce the correct key.

In addition, we conducted two experiments under different conditions as follows:

(1): Fixing encodings in the scheme, we employ 10 different random keys to generate 10 white-box encryption programs. Under each encryption program, we collected 200 traces to perform DCA tests.
(2): Fixing the key in the scheme, we randomly generate 10 groups of encodings and the corresponding white-box encryption programs. Similarly, 200 traces are collected under each encryption program to conduct DCA tests.

The experimental results reveal that no matter in which condition, there is no obvious peak in the differential traces, and DCAs failed to yield the correct key. In addition, we also increased the number of trajectories to 1000 and conducted DCAs, but the results remained unchanged. This outcome confirms that the attacker cannot recover the correct key using DCAs. Therefore, our scheme effectively thwarts the first-order DCA.

Table 4 provides a security comparison of our SM4 white-box schemes alongside other white-box SM4 schemes. The simplified scheme demonstrates security against both BGE attacks and DCAs. As the middle rounds of this scheme follow the design of the Xiao–Lai scheme, it inherits the weaknesses of the Xiao–Lai scheme. Namely, it is susceptible to Lin–Lai analysis and Pan et al’s analysis. However, the algebraic enhancement scheme exhibits security against the BGE attack, Lin–Lai analysis, Pan et al.’s analysis, and DCAs, comparable to the schemes independently developed by Yuan et al. and Zhang et al. It is worth highlighting that the Xiao–Lai scheme and the Bai–Wu scheme fail to withstand DCAs as demonstrated in [19,20]. Additionally, their algebraic security is inferior to that of our algebraic enhancement scheme.

6. Conclusions

We utilize the Boolean masking technique to obfuscate the correlation between the intermediate data and the key, thereby mitigating DCAs in this paper. Furthermore, we try to reduce the storage overload associated with the masking, resulting in a white-box scheme of SM4 with lower overhead and high security.

According to the experimental and theoretical analysis, we have demonstrated that introducing random mask values significantly enhances the security of the scheme not only against DCAs but also against algebraic attacks. For further research, we will exploit the idea of higher-order side-channel analysis and evaluate the security of the existing schemes.

Author Contributions

Conceptualization, S.Z. and S.C.; data curation, X.H.; investigation, Y.W. and Y.L.; methodology, S.Z. and Y.Y.; project administration, D.Z.; software, S.C.; validation, Y.Y., Y.L. and X.H.; formal analysis, D.Z.; resources, Y.W.; writing—original draft preparation, S.C.; writing—review and editing, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Laboratory Specialized Scientific Research Projects of Beijing Smart-chip Microelectronics Technology Co., Ltd., grant number SGSC0000AQQT2400701.

Data Availability Statement

The data presented in this study are available on request to the corresponding authors.

Conflicts of Interest

Author Zhao, D.; Wang, Y.; Li, Y.; Hu, X.; Yu, Y.; were employed by the company Beijing Smart-chip Microelectronics Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chow, S.; Eisen, P.; Johnson, H.; van Oorschot, P.C. White-Box Cryptography and an AES Implementation. In Proceedings of the ACM Symposium on Applied Computing, St. John’s, NF, Canada, 15–16 August 2002. [Google Scholar]
Billet, O.; Gilbert, H.; Ech-Chatbi, C. Cryptanalysis of a White Box AES Implementation. In Proceedings of the ACM Symposium on Applied Computing, Waterloo, ON, Canada, 9–10 August 2004. [Google Scholar]
Bos, J.W.; Hubain, C.; Michiels, W.; Teuwen, P. Differential Computation Analysis: Hiding Your White-Box Designs is Not Enough. In Proceedings of the Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2019. [Google Scholar]
Lee, S.; Kim, T.; Kang, Y. A Masked White-Box Cryptographic Implementation for Protecting Against Differential Computation Analysis. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2602–2615. [Google Scholar] [CrossRef]
Lee, S.; Kim, M. Improvement on a Masked White-Box Cryptographic Implementation. IEEE Access 2020, 8, 90992–91004. [Google Scholar] [CrossRef]
Biryukov, A.; Udovenko, A. Attacks and Countermeasures for White-box Designs. In Proceedings of the Advances in Cryptology—ASIACRYPT 2018, ASIACRYPT 2018, Brisbane, QLD, Australia, 2–6 December 2018; Peyrin, T., Galbraith, S., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2018; Volume 11273. [Google Scholar]
Seker, O.; Eisenbarth, T.; Liskiewicz, M. A White-Box Masking Scheme Resisting Computational and Algebraic Attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 61–105. [Google Scholar] [CrossRef]
Bogdanov, A.; Rivain, M.; Vejre, P.S.; Wang, J. Higher-Order DCA against Standard Side-Channel Countermeasures. In Proceedings of the Constructive Side-Channel Analysis and Secure Design, COSADE 2019, Darmstadt, Germany, 3–5 April 2019; Polian, I., Stöttinger, M., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2019; Volume 11421. [Google Scholar]
GB/T 32907-2016; Information Security Technology—SM4 Block Cipher Algorithm. General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. Standards Press of China: Beijing, China, 2016.
Xiao, Y.; Lai, X. White-Box cryptography and implementations of SMS4. In Proceedings of the 2009 CACR Annual Meeting, Guangzhou, China, 14 November 2009; pp. 24–34. [Google Scholar]
Lin, T.; Lai, X. Efficient attack to white-box SMS4 implementation. J. Softw. 2013, 24, 2238–2249. [Google Scholar] [CrossRef]
Bai, K.; Wu, C. A secure white-box SM4 implementation. Secur. Commun. Netw. 2016, 9, 996–1006. [Google Scholar] [CrossRef]
Alpirez Bock, E.; Bos, J.W.; Brzuska, C.; Hubain, C.; Michiels, W.; Mune, C.; Gonzalez, E.S.; Treff, P.T.A. White-Box Cryptography: Don’t Forget About Grey-Box Attacks. J. Cryptol. 2019, 32, 1095–1143. [Google Scholar] [CrossRef]
Shi, Y.; Wei, W.; He, Z. A lightweight white-box symmetric encryption algorithm against node capture for WSNs. Sensors 2015, 15, 11928–11952. [Google Scholar] [CrossRef] [PubMed]
Yao, S.; Chen, J. A new method for white-box implementation of SM4 algorithm. J. Cryptologic Res. 2020, 7, 358–374. [Google Scholar]
Tang, Y.; Gong, Z.; Li, B.; Zhao, L. Revisiting the computation analysis against internal encodings in white-box implementations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2023, 2023, 493–522. [Google Scholar] [CrossRef]
Carlet, C.; Guilley, S.; Mesnager, S. Structural attack (and repair) of diffused-input-blocked-output white-box cryptography. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 57–87. [Google Scholar] [CrossRef]
Coron, J.S.; Rondepierre, F.; Zeitoun, R. High Order Masking of Look-up Tables with Common Shares. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2018, 40–72. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, D.; Chen, J. Analysis and Improvement of White-box SM4 Implementation. J. Electron. Inf. Technol. 2022, 44, 2903–2913. [Google Scholar]
Yuan, Z.Q.; Chen, J. A white-box SM4 scheme against differential computation analysis. J. Cryptologic Res. 2023, 10, 386–396. [Google Scholar]
SideChannelMarvels/Deadpool. Available online: https://github.com/SideChannelMarvels (accessed on 9 June 2024).
Pan, W.L.; Qin, T.H.; Jia, Y.; Zhang, L.T. Cryptanalysis of two white-box SM4 implementations. J. Cryptologic Res. 2018, 5, 651–670. [Google Scholar]

Figure 1. Round function of SM4.

Figure 2. The masked Type II table introduced by Lee et al.

Figure 3. Five types of lookup tables in the simplified scheme.

Figure 4. Operations in Round 0.

Figure 5. Operations in Part 2 of Round 1.

Figure 6. Combination of lookup tables and operations.

Figure 7. The differential traces related to the first round key.

Table 1. Symbols.

i	The index of the iteration round, $i = 0, \dots, 31$ ;
j	The index of a byte within a 32-bit state word, $j = 0, \dots, 3$ ;
k	The index of a word within a 128-bit state, $k = 0, \dots, 3$ ;
$X_{i + k}$	The kth input word (without protection) of the ith round computation;
$r k_{i, j}$	The jth byte of the ith round key;
$X_{i + 4}$	The output word (without protection) of the ith round computation;
$P_{i + k}$	A 32-dimensional invertible affine transformation to protect word $X_{i + k}$ ;
$P_{i + k}^{- 1}$	The inverse transformation of $P_{i + k}$ ;
$L P_{i + k}$	The invertible matrix of $P_{i + k}$ ;
$C P_{i + k}$	The constant vector of $P_{i + k}$ ;
$E_{i, j}$	A 8th-dimensional invertible affine transformation;
$E_{i, j}^{- 1}$	The inverse transformation of $E_{i, j}$ ;
$C E_{i, j}$	The constant vector of $E_{i, j}$ ;
$E_{i}$	A 32-dimensional invertible affine transformation, $E_{i}$ = diag $(E_{i, 0}, E_{i, 1}, E_{i, 2}, E_{i, 3})$ ;
$E_{i}^{- 1}$	The inverse transformation of $E_{i}$ ;
$Q_{i}$	A 32-dimensional invertible affine transformation, $i = 4, \dots, 27$ ;
$E_{i, j}^{'}$	An 8th-dimensional invertible affine transformation;
$E_{i}^{'}$	A 32-dimensional invertible affine transformation, $E_{i}^{'}$ = diag $(E_{i, 0}^{'}, E_{i, 1}^{'}, E_{i, 2}^{'}, E_{i, 3}^{'})$ ;
$E_{i}^{' - 1}$	The inverse transformation of $E_{i}^{'}$
$N_{j}$	An 8th-dimensional invertible linear transformation;
$N_{j}^{- 1}$	The inverse transformation of $N_{j}$ ;
N	A 32-dimensional invertible linear transformation, N = diag $(N_{0}, N_{1}, N_{2}, N_{3})$ ;
$R_{i}$	An 8th-dimensional invertible affine transformation to protect mask values;
$R_{i}^{- 1}$	The inverse of $R_{i}$ ;
${\hat{X}}_{i + k}$	Protected $X_{i + k}$ with affine transformation $P_{i + k}$ ;
${\hat{X}}_{i + k}^{'}$	${\hat{X}}_{i + k}$ after unified encoding;
$Y_{i}$	The output word of Part 1 of the ith round;
${\hat{Y}}_{i}$	Protected $Y_{i}$ with affine transformation $E_{i}^{- 1}$ ;
${\hat{y}}_{j}$	The jth byte of ${\hat{Y}}_{i}$ ;
${\hat{z}}_{j}$	The output byte of the table lookup operation;
$Z_{i}$	The output word of Part 2 of the ith round;
${\hat{Z}}_{i}$	Protected $Z_{i}$ with affine transformation $Q_{i}$ ;
${\hat{Z}}_{i}^{'}$	${\hat{Z}}_{i}$ after unified encoding;
${\hat{Z}}_{i 0}$	An output word (involved masked state word) of Part 2 of the ith round;
${\hat{Z}}_{i 1}$	The other output word (involved mask values) of Part 2 of the ith round;
${\hat{Z}}_{i 1}^{'}$	${\hat{Z}}_{i 1}$ after unified encoding;
$φ$	A randomly generated permutation;
M	A table to store encoded mask values;
$φ^{j} [{\hat{y}}_{j}]$	Successive application of a permutation $φ$ to element ${\hat{y}}_{j}$ for j times;
$m_{i, φ^{j} [{\hat{y}}_{j}]}$	Mask value corresponding to the jth byte of the state word of the ith round;
${\hat{R M}}_{i}$	The encoded mask values during the ith round computation.

Table 2. Memory requirements for the affine transformation and lookup tables.

Operation/Table	Single	Numb.	Total
Affine transformation	1 Kb ( $32 \times 32 + 32$ bits)	100 + 12 + 64	22.69 KB
Table_1	2 Kb ( $2^{8} \times 8$ bits)	8	2 KB
Table_2	$2^{6}$ KB ( $2^{8} \times 2^{8} \times 8$ bits)	24	1.5 MB
Table_3	1 KB ( $2^{8} \times 32$ bits)	96	96 KB
M	2 Kb ( $2^{8} \times 8$ bits)	8	2 KB
$φ$	2 Kb ( $2^{8} \times 8$ bits)	8	2 KB

Table 3. Efficiency comparison of various SM4 white-box schemes.

	Generate a White-Box Program		Encrypt a Plaintext
	Memory	Time	Number of Table	Number of XOR	Number of Matrix Multiplication	Time
Xiao–Lai scheme [10]	148.625 KB	0.021 s	128	192 (32 bit)	160	0.06 ms
Bai–Wu scheme [12]	32.5 MB	3.97 s	640	640 (32 bit)	0	0.001 ms
Yao et al.’s scheme [15]	276.625 KB	0.092 s	128	96 (32 bit) + 96 (64 bit)	160	0.06 ms
Zhang et al.’s Scheme [19]	24.3 MB	-	640	192 (32 bit)	128	-
Yuan et al.’s Scheme [20]	34.5 MB	-	672	536 (32 bit)	0	-
Simplified scheme	1.62 MB	0.385 s	144	179 (32 bit)	176	0.07 ms
Enhancement scheme	7.8 MB	2.66 s	192	208 (32 bit)	216	0.08 ms

Table 4. Security comparison of various SM4 white-box schemes.

	BGE	Lin–Lai Analysis	Pan et al. Analysis	DCA
Xiao–Lai scheme	Yes [10]	$2^{47}$ [11]	61,200 $\times 2^{32}$ [22]	No [19]
Bai–Wu scheme	Yes [12]	-	61,200 $\times 2^{128}$ [22]	No [20]
Yao et al.’s scheme	Yes [15]	$2^{51}$ [15]	61,200 $\times 2^{32}$ [15]	-
Zhang et al.’s Scheme [19]	Yes	Yes	Yes	Yes
Yuan et al.’s Scheme [20]	Yes	Yes	Yes	Yes
Simplified scheme	Yes	$2^{47}$ [11]	61,200 $\times 2^{32}$ [22]	Yes
Enhanced scheme	Yes	Yes	Yes	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, D.; Wang, Y.; Li, Y.; Hu, X.; Yu, Y.; Chen, S.; Zheng, S. An Efficient Masked White-Box Implementation of SM4. Electronics 2024, 13, 2326. https://doi.org/10.3390/electronics13122326

AMA Style

Zhao D, Wang Y, Li Y, Hu X, Yu Y, Chen S, Zheng S. An Efficient Masked White-Box Implementation of SM4. Electronics. 2024; 13(12):2326. https://doi.org/10.3390/electronics13122326

Chicago/Turabian Style

Zhao, Dongyan, Yubo Wang, Yan Li, Xiaobo Hu, Yanyan Yu, Shi Chen, and Shihui Zheng. 2024. "An Efficient Masked White-Box Implementation of SM4" Electronics 13, no. 12: 2326. https://doi.org/10.3390/electronics13122326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Masked White-Box Implementation of SM4

Abstract

1. Introduction

2. Preliminaries

2.1. SM4 Algorithm

2.2. The Xiao–Lai SM4 White-Box Scheme

2.3. The Masked AES White-Box Scheme

3. Masked SM4 White-Box Scheme

3.1. Table Generation

3.2. Round Operation

3.2.1. Rounds 0–3 and 28–31

3.2.2. Rounds 4–27

4. Performance

5. Security Analysis

5.1. White-Box Diversity and White-Box Ambiguity

5.2. Algebraic Attacks

5.2.1. BGE Analysis

5.2.2. Lin–Lai Analysis

5.2.3. Pan et al.’s Analysis

5.3. Side-Channel Attack

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI