Next Article in Journal
Balancing of Resonant Differential Coils for Broadband Inductive Sensor Systems
Previous Article in Journal
Data Reconstruction Using Smart Sensor Placement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient SM9 Aggregate Signature Scheme for IoV Based on FPGA

1
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
2
China CITIC Bank Co., Ltd., Zhengzhou Branch, Zhengzhou 450008, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(18), 6011; https://doi.org/10.3390/s24186011
Submission received: 18 August 2024 / Revised: 15 September 2024 / Accepted: 16 September 2024 / Published: 17 September 2024
(This article belongs to the Section Vehicular Sensing)

Abstract

:
With the rapid development of the Internet of Vehicles (IoV), the demand for secure and efficient signature verification is becoming increasingly urgent. To meet this need, we propose an efficient SM9 aggregate signature scheme implemented on Field-Programmable Gate Array (FPGA). The scheme includes both fault-tolerant and non-fault-tolerant aggregate signature modes, which are designed to address challenges in various network environments. We provide security proofs for these two signature verification modes based on a K-ary Computational Additive Diffie–Hellman (K-CAA) difficult problem. To handle the numerous parallelizable elliptic curve point multiplication operations required during verification, we utilize FPGA’s parallel processing capabilities to design an efficient parallel point multiplication architecture. By the Montgomery point multiplication algorithm and the Barrett modular reduction algorithm, we optimize the single-point multiplication computation unit, achieving a point multiplication speed of 70776 times per second. Finally, the overall scheme was simulated and analyzed on an FPGA platform. The experimental results and analysis indicate that under error-free conditions, the proposed non-fault-tolerant aggregate mode reduces the verification time by up to 97.1% compared to other schemes. In fault-tolerant conditions, the proposed fault-tolerant aggregate mode reduces the verification time by up to 77.2% compared to other schemes. When compared to other fault-tolerant aggregate schemes, its verification time is only 28.9% of their consumption, and even in the non-fault-tolerant aggregate mode, the verification time is reduced by at least 39.1%. Therefore, the proposed scheme demonstrates significant advantages in both error-free and fault-tolerant scenarios.

1. Introduction

Currently, with the rapid development of the Internet of Things (IoT), an increasing number of vehicles are becoming intelligent, giving rise to the Internet of Vehicles (IoV). Within the same regional network, different IoV components exchange information through their self-organizing networks. However, IoV faces numerous challenges, such as communication delays, resource limitations, and complex communication environments. Efficiently verifying a large number of signatures using edge computing resources, while ensuring the accuracy and security of the information, is therefore a critical challenge [1,2]. To address this problem, current methods for accelerating signature verification primarily fall into two categories: (1) Compression Techniques: These methods aim to reduce the length of information to accelerate computation, such as through aggregate signature technology. (2) Algorithm Optimization: These methods focus on hardware and software optimization of the algorithms involved in signature verification, such as optimizing elliptic curve algorithms and leveraging Field-Programmable Gate Array (FPGA) hardware acceleration. However, aggregate signatures still face several issues, such as relatively long verification times and the potential for batch signature invalidation. Techniques such as parallel point multiplication acceleration and fault-tolerant aggregate signatures can effectively address these issues.
In order to address the challenge of efficiently verifying a large number of signatures, various scholars have proposed their own schemes based on different computational hardness problems and from distinct perspectives for improvement. Xie [3] improved the CLAS aggregate signature scheme based on the Elliptic Curve Discrete Logarithm Problem (ECDLP). While this reduced the time complexity of various stages of the aggregate signature process, it did not adequately address the issues of numerous point multiplication operations and fault tolerance. An [4] modified the SM9 algorithm to support aggregate signatures, improving its performance; however, the number of point multiplication operations remains high, and fault tolerance issues were not resolved. Mei [5] designed an aggregate signature scheme aimed at addressing location privacy concerns in vehicular networks, effectively ensuring the integrity and authenticity of information, but further improvements in fault tolerance and elliptic curve acceleration remain possible. Ran [6] further enhanced the elliptic curve-based aggregate signature scheme, making it more suitable for IoT applications, yet fault tolerance was not considered. Yang [7] applied aggregate signature modifications to the Schnorr signature to adapt it to wireless communication networks in the medical IoT environment, but these modifications introduced a significant number of elliptic curve operations, leaving room for optimization in terms of efficiency. Fu [8], to solve the problem of numerous point multiplication operations in aggregate signatures, accelerated these operations using FPGA, but the fault tolerance issue was still unresolved. To tackle the fault tolerance problem in aggregate signatures, Hartung [9] proposed the concept of fault-tolerant aggregate signatures and provided theoretical derivations, but this concept has not been applied to practical schemes. Therefore, for the efficient optimization of aggregate signatures, accelerating point multiplication operations and introducing fault tolerance are crucial.
In current aggregate signature schemes, the point multiplication operation is a critical path that is relatively time consuming. There are various methods for point multiplication, such as the binary expansion method, sliding window method [10], and window non-adjacent form (w-NAF) method [11]. However, there is still room for optimization in both software and hardware implementations. To improve the efficiency of point multiplication algorithms, many scholars have conducted research from different perspectives. In software, Hai et al. [12] proposed replacing odd numbers with prime numbers during the pre-computation phase and constructing micro multi-base chains to compensate for the differences between primes and odd numbers, reducing the computational complexity of scalar multiplication by up to 77.38%. Zhao et al. [13] further optimized the computational complexity by replacing the odd numbers in the base chains of the w-NAF chain with the form of 2 n . In hardware: Bellemou et al. [14] used the Montgomery power ladder binary method and an improved Radix- 2 32 Montgomery modular multiplication to optimize elliptic curve point multiplication, achieving a computation time of 14.32 ms and balancing security and efficiency. Yue Hao et al. [15], based on the Zynq platform, utilized the Montgomery ladder algorithm and adaptively modified point addition and point double, optimizing the computation time to 1.80 ms. Islam et al. [16], also using the Montgomery ladder algorithm on the Virtex-7 platform, proposed a Radix-2 modular multiplication architecture on the Edwards curve, optimizing the computation time to 1.63 ms.
The SM9 [17] identity-based encryption algorithm was proposed in 2016 and officially became an international standard in 2021. Numerous scholars have researched and applied this algorithm. For instance, as mentioned earlier, the SM9 aggregate signature proposed by An Tao [4] added aggregation functionality to the SM9 signature. Additionally, Liu et al. [18] combined threshold signatures, ring signatures, and SM9 to achieve higher security but did not optimize for time consumption. Liu et al. [19] proposed a two-party cooperative signature based on SM9 and applied it to smart homes, although there remains significant room for optimization in verification efficiency. Jing et al. [20] introduced a four-stage pipeline improved modular multiplication algorithm and enhanced point addition and point double algorithms. Tested on the Zynq FPGA platform based on SM9, the point multiplication algorithm achieved a performance of 1179 operations per second, though further improvements could be made in utilizing FPGA board resources. Wang et al. [21] designed highly parallel computational unit structures and optimized F p and F p 2 domain point addition, point double, and line functions. Based on the SM9 algorithm, they designed optimal ATE computation units on the Virtex-7 FPGA platform, reducing computation time to 3.43 ms, which is 20% of the time consumed by similar designs. Cheng [22] reduced the hardware cost of the SM9 algorithm by simplifying the Frobenius map and implemented its parallel structure. However, FPGA optimization for bilinear pairings is quite resource intensive, and optimizing point multiplication operations is crucial for improving verification efficiency in aggregate signature verification. Therefore, how to fully utilize edge resources to accelerate point multiplication and apply it to specific schemes is a critical issue.
In our work, (1) we design a comprehensive scheme for SM9 aggregate signatures based on FPGA. (2) We employ a parallel hardware structure to accelerate the computation speed of aggregate signature verification. (3) We utilize an improved Montgomery ladder algorithm to compute F p domain point multiplications in SM9 verification and optimize the Barrett modular reduction algorithm to better suit the characteristics of SM9. (4) We implement a fault-tolerant mechanism within the aggregate signature scheme and analyze and prove the scheme’s security based on the K-ary Computational Additive Diffie–Hellman (K-CAA) difficult problem.
The rest of the paper is organized as follows: In Section 2, we introduce the preliminary knowledge and related work pertinent to our work, mainly including the SM9 signature algorithm, aggregate signatures, K-CAA problem, and fault tolerance mechanisms. In Section 3, we describe the overall architecture of our proposed scheme In Section 4, we present the hardware design, optimization strategies employed, and the fault-tolerant mechanism design. In Section 5, we analyze the experimental results. In Section 6, we provide a conclusion and outlook for future work.

2. Related Work

Our work primarily involves the SM9 signature and verification algorithm, aggregate signatures, the K-CAA problem in hardness issues, and fault tolerance mechanisms. The background and related work are introduced in four parts below.

2.1. SM9 Signature Algorithm

As shown in Figure 1, the SM9 algorithm can be divided into four levels in terms of functionality and complexity. The first level is the SM9 protocol layer, which includes the signature algorithm, key exchange protocol, and others. The second level consists of bilinear pairing and point multiplication, which are more complex functional components based on elliptic curve operations. The bilinear pairing e a , b operation rules are quite special and can be referenced in [23]. The first level achieves its functionality by invoking these components. The third level includes point addition and point double, which are the basic elliptic curve operations. The fourth level comprises modular addition, modular inversion, modular multiplication, and modular subtraction over finite fields [24].

2.2. Aggregate Signature

The current focus of research on encryption mechanisms is on computational efficiency and memory optimization [25]. To address these issues, many aggregate signature schemes have been developed [2,3,4,5,6,7]. Most aggregate signature algorithms involve the following steps: sign, single signature verification, aggregate signature generation, and aggregate signature verification. Below are the descriptions and explanations of these algorithms:
(a)
Sign: The user, holding the signature private key d s i issued by the Key Generation Center (KGC), generates a signature σ i for the message m i they intend to send.
(b)
Single Signature Verification: The verifier, holding the signature public key Q i , receives the message m i and uses the public key Q i , the message m i , and the signature σ i to perform the verification.
(c)
Aggregate Signature: The aggregator, holding the signature public keys Q i for multiple messages, the signatures σ i of these multiple messages, and other accompanying information o m i sent, aggregates the signatures to obtain an aggregate signature σ along with the packaged messages M and accompanying information O M .
(d)
Aggregate Signature Verification: The verifier, upon receiving the aggregate signature σ , the packaged messages M , and the accompanying information O M sent by the aggregator, uses the corresponding public keys Q i to perform the verification.
These steps are the most time-consuming parts of the aggregate signature mechanism. This is particularly true when an error occurs in the aggregate signature, as it requires individual verification of each signature [26], leading to a significant amount of time spent on single signature verifications.

2.3. K-CAA Problem

At present, the main computational hardness assumption problems for elliptic curves include the Computational Diffie–Hellman (CDH) problem [25] and ECDLP [3]. In elliptic curve aggregate signatures, the commonly used hardness assumption is the K-CAA problem. Below is the definition of the K-CAA problem.
Let G 1 , + be a cyclic group of order q , where q is a large prime number. For an integer k and unknown x Z q * , given e 1 , e 2 , , e k Z q * , g , x g G 1 and 1 x + e 1 g , 1 x + e 2 g , , 1 x + e k g , compute e , 1 x + e g , where e Z q * , e e 1 , e 2 , , e k .

2.4. Fault Tolerance Mechanism

To address the significant time consumption caused by verifying individual signatures due to aggregated erroneous signatures, Hartung et al. [9] proposed the concept of fault-tolerant aggregate signatures and conducted theoretical derivations. References [27,28] further extended these schemes based on non-overlapping sets using compression and nesting methods, respectively. However, these schemes are very challenging to implementing in practice. Wang et al. [26] proposed a new fault-tolerant scheme based on u n i f o r m ( k , n ) s e t combinatorial theory and validated it theoretically. We adopt u n i f o r m ( k , n ) s e t approach. Below is an introduction to the theoretical framework. Given a set A = a 1 , a 2 , , a m 1 , a m and its subsets A 1 , A 2 , A n 1 , A n A , these subsets satisfy u n i f o r m ( k , n ) s e t , where m C n k 1 . Additionally, these subsets meet the following conditions: (1) The sizes of these subsets are all equal. (2) For any k subsets A i 1 , A i 2 , A i k A 1 , A n 1 , A n , the union of A i 1 , A i 2 , A i k is A . (3) For any k 1 distinct subsets, the union of any A i j , j 1 , k 1 subsets contains at least one element a i that is not included in the union.
Cao et al. [29] proposed a method suitable for constructing a structure of u n i f o r m ( k , n ) s e t . First, the set A 1 , A 2 , A n 1 is evenly divided into m = C n k 1 groups, each including k 1 subsets A i . The format of the subsets is shown in Equation (1).
W 1 = A 11 A 12 A 1 k 1 W m = A m 1 A m 2 A m k 1
where A i 1 A , i = 1 , 2 , , m , j = 1 , 2 , , k 1 , i j , W i W j , a b , a b W b , a b W c , 1 b , c m .
Below is an example. Suppose we have a set A = a 1 , , a 10 ; then, m = 10 . It is straightforward to set n = 5 ,   k = 4 , which satisfies the condition C n k 1 = 10 . Then, dividing these 5 subsets A 1 , , A 5 into 10 groups can be achieved by arranging them into 10 different sets of numbers, that is, A 1 , A 2 , A 3 ,   A 1 , A 2 , A 4 ,   A 1 , A 2 , A 5 , , A 3 , A 4 , A 5 .
Based on the requirements for u n i f o r m ( k , n ) s e t mentioned above, we can derive the specific subsets A 1 , , A 5 , as shown in Equation (2):
A 1 = a 1 , a 2 , a 4 , a 7 A 2 = a 2 , a 3 , a 6 , a 9 A 3 = a 1 , a 3 , a 5 , a 8 A 4 = a 4 , a 5 , a 6 , a 10 A 5 = a 7 , a 8 , a 9 , a 10
The specific construction method is referred to in references [26,29].

3. Entire Structure of the Scheme

In our work, there are two available modes: the classic aggregate signature mode and the fault-tolerant aggregate signature mode. The classic aggregate signature mode is suitable for scenarios with minimal network errors, while the fault-tolerant aggregate signature mode can handle situations with a certain error rate. The choice between these two methods only requires the Road-Side Unit (RSU) to decide based on its own situation with little impact on the perception of the authentication center and vehicles in the network.
As shown in Figure 2, the entire scheme is illustrated. While vehicles are driving on the road, they collect information, which is then aggregated by a randomly selected vehicle responsible for gathering the aggregated information, and subsequently sent to the RSU. The RSU initially processes the information using its CPU before passing it to the attached FPGA device for further processing. After the FPGA processes the information, it is sent back to the CPU for final processing. Once the RSU has processed the information, it is fed back to the vehicle cluster and the server. This scheme is an improvement based on the systems described in references [4,30]. The processes of system establishment, master key generation, vehicle registration, and key distribution are the same as those in the aforementioned references. The following sections will describe the main parts and the improved processes, including the sign, single signature verification, aggregate signature generation, and aggregate signature verification.
First, the main parameters required for the efficient SM9 aggregate signature scheme need to be explained, as shown in Table 1. The public key, private key, and pseudonym of the vehicle are assigned by the authentication center.
The following is the overall process of the efficient SM9 aggregate signature scheme. This section is divided into four parts: sign, single signature verification, fault-tolerant aggregate signature, and fault-tolerant aggregate verification.
(1)
Sign: s i g n ( m i , V I D i ) , where m i is the message to be sent.
(a)
Compute g = e P 1 , k p b ;
(b)
Randomly select r i , r i Z n 1 * , and compute w i = g r i ;
(c)
Compute h 2 i = H 2 m i | | w i , N , L i . If L i = 0 , reselect the random number; otherwise, L i = r i V I D i mod N ;
(d)
Compute S i = L i · s i , U i = L i · s i · V I D i ;
(e)
Package and send m i , S i , U i , w i .
(2)
Single signature verification: To verify the received message packet m i , S i , U i , w i , the verification process is as follows:
(a)
Compute g = e P 1 , k p b , h 2 i = H 2 m i | | w i , N , t i = g h 2 i ;
(b)
Compute v i = e S i , Q i · t i , verify whether v i is equal to w i . If they are not equal, the verification fails.
(3)
Fault-tolerant aggregate signature: The RSU can negotiate with the selected vehicle OBU based on its own situation to adopt the following non-fault-tolerant or fault-tolerant aggregate signature modes.
(a)
Non-fault-tolerant aggregate signature mode;
The OBU of the randomly selected vehicle acts as the signature aggregator. The set of pseudonyms of the received vehicles is V I D 1 , , V I D n , the corresponding set of messages is m 1 , m n , and the received digital signatures are S 1 , U 1 , w 1 , , S n , U n , w n . The signature aggregator computes N U = i = 1 8 U i , N S = i = 1 8 S i , N W = i = 1 8 w i . The aggregate signature is δ = N S , N U , N W , w 1 , , w 8 .
  •  
    (b)
    Fault-tolerant aggregate signature mode.
The OBU of the randomly selected vehicle acts as the signature aggregator, performing fault-tolerant aggregation processing on the received signature set. The set of pseudonyms of the received vehicles is V I D 1 , , V I D n , the corresponding set of messages is m 1 , m n , and the received digital signatures are S 1 , U 1 , w 1 , , S n , U n , w n . Due to the limited resources of the RSU, the received digital signatures and messages are divided into u n i f o r m ( 8 , 9 ) . According to the lemma mentioned in [26], there are a total of nine subsets, each containing eight elements. These are defined as T 1 , , T 9 . The fault-tolerant aggregate signature set composition is represented as T i = T S i , T U i , T w i , w 1 , , w 8 , where w 1 , , w 8 is not specifically the first eight received digital signatures but rather a generic representation. Suppose the digital signature set S 1 , U 1 , w 1 , , S 8 , U 8 , w 8 forms the combination elements of u n i f o r m ( 8 , 9 ) ; then, the aggregation method of T i is shown in Equations (3)–(5).
T S i = i = 1 8 S i
T U i = i = 1 8 U i
T w i = i = 1 8 w i
(4)
Fault-tolerant aggregate verification: The RSU selects the verification mode based on the previously negotiated mode.
(a)
Non-fault-tolerant verification mode;
The RSU receives the message set m 1 , m n and the aggregate signature δ = N S , N U , N W , w 1 , , w 8 sent by the vehicle OBU, computation h 2 i = H 2 m i | | w i , N and verification Equation (6).
e N S , k p b e ( N U , P 2 ) e ( i = 1 8 h 2 i P 1 , k p b ) = N W
  •  
    (b)
    Fault-tolerant verification mode.
After receiving the message set m 1 , m n and the fault-tolerant aggregate signature T 1 , , T n sent by the signature aggregator, the RSU computes h 2 i = H 2 m i | | w i , N and verifies each fault-tolerant aggregate signature using Equation (7), where c is the number of signatures that constitute a single fault-tolerant aggregate signature set.
e T S i , k p b e ( T U i , P 2 ) e ( i = 1 c h 2 i P 1 , k p b ) = T w i
It can be noted that there are a large number of point multiplication operations in Equations (6) and (7). Regardless of whether it is in the non-fault-tolerant aggregation or fault-tolerant aggregation process, the number of point multiplication operations will increase with the number of signatures, and point multiplication is a quite time-consuming operation [31]. Therefore, the time consumption will surge. Consequently, the point multiplication operations in the SM9 efficient aggregate signature scheme can be offloaded to the FPGA board for computation.

4. Hardware Improved Acceleration Structure

The overall structure of the hardware acceleration for the efficient SM9 aggregate signature scheme is shown in Figure 3. The architecture consists of a CPU connected to multiple FPGA boards with information exchanged via the PCIe bus. The CPU is responsible for bilinear pairing operations involved in the SM9 algorithm, the configuration of parallel point multiplication parameters in SM9, and the final verification of the results. The FPGA boards are responsible for parallel acceleration of the point multiplication operations.
As illustrated in Figure 3, a master state machine is set up to control the data flow within each point multiplication unit and handle the subsequent processing of the computation results. The state transition diagram of the master state machine is shown in Figure 4.
In the Idle state, parameters are read, and the state machine is initialized. When the start signal is valid, the state machine updates multiple point multiplication parameter registers and transitions to the PONIT_MUL_DSB state; otherwise, it remains in Idle. Upon transitioning to the PONIT_MUL_DSB state, the state machine reads the multiple point multiplication tasks and assigns each individual point multiplication task to the corresponding point multiplication unit. Once the task distribution is completed, the state transitions to the PONIT_MUL_WAIT state. In the PONIT_MUL_WAIT state, after all point multiplication tasks are completed, the state machine reads all the computed results and transitions to the COORDINATE_CONVERT state. In the COORDINATE_CONVERT state, the state machine transfers all point multiplication results to the coordinate conversion unit for merging and coordinate transformation. Then, it transitions to the CONVERT_WAIT state. While in the CONVERT_WAIT state, the state machine waits for the coordinate conversion unit to complete its computation. After the coordinate conversion is completed, the state machine reads the results and transitions to the Output state to output the computed results. Finally, the state transitions back to the Idle state.
For each point multiplication unit, the state machine is responsible for controlling the data flow between the point addition and point doubling units, achieving unit reuse. Additionally, by configuring multiple point multiplication modules, the on-board resources can be fully utilized to accelerate large-scale point multiplication operations.

4.1. Montgomery Point Multiplication

Montgomery point multiplication is currently the most efficient point multiplication algorithm, with a time complexity superior to existing methods such as binary expansion, non-adjacent form (NAF), and NAF-related improved algorithms. Montgomery point multiplication requires a base point G , and subsequently calculates P 1 and P 2 , where P 1 = G , P 2 = 2 G . Then, based on P 1 , P 2 and the multiplication scalar k , the point multiplication result is computed. The computation process is shown in Algorithm 1. Additionally, it can resist simple power analysis attacks.
Algorithm 1: Montgomery Point Multiplication
Input: G = ( x G , y G ) , k = k n 1 , , k 0
Output: Q = k G
1.
P 1 = G ; P 2 = 2 G ; i = n 2
2.
while ( i 0 ) do
3.
    if ( k i = = 0 ) then P 2 = P 1 + P 2 ; P 1 = 2 P 1 ;
4.
    else   if   ( k i = = 1 ) then P 1 = P 1 + P 2 ; P 2 = 2 P 2 ;
5.
    i = i 1 ;
6.
  end if
7.
end while
8.
Q = P 1

4.2. Optimization of Finite Field Operational Units

Finite field arithmetic units are the fundamental components of elliptic curve operations, and they include modular addition, modular subtraction, modular multiplication, and modular inversion modules. The performance of these finite field arithmetic units is crucial for the computation speed of point multiplication algorithms on the FPGA.

4.2.1. Design of Modular Addition and Subtraction Units

To optimize the use of on-board resources and reduce the area, modular addition and subtraction are implemented in a combined manner using pure combinational logic. This module performs calculations based on the initial input modulus P and mode selection m o d e . Since modular addition can result in overflow and modular subtraction can result in negative outcomes, these situations need to be handled accordingly. The optimized computation method is shown in Algorithm 2.
Algorithm 2: Modular Addition and Subtraction Algorithm
Input: A , B , P , m o d e
Output: C
1.   if ( m o d e = = 0 ) begin
2.   B 1 = B , B a m = P
3.  end else begin
4.   B 1 = P , B a m = B
5.  end
6.   C A 1 , C A 2 = A + B 1
7.   C A S 1 , C A S 2 = C A 1 , C A 2 1 b 0 , B a m
8.   C S 1 , C S 2 = A B
9.   if ( m o d e = = 0 ) begin
10.   if ( C A S 1 = = 1 ) begin
11.   C = C A 2
12.  end else begin
13.   C = C A S 2
14. end
15. end else begin
16.   if ( C 2 = = 1 ) begin
17.    C = C A S 2
18.  end else begin
19.    C = C S 2
20.  end
21. end
As shown in Figure 5, the hardware structure of the modular addition and subtraction is composed of pure combinational logic, forming a parallel circuit structure. The computation is completed within a single clock cycle.

4.2.2. Design of the Modular Inversion Unit

Currently, the algorithms for computing modular inverses include Fermat’s Little Theorem, the Montgomery algorithm, the Extended Euclidean algorithm, and the binary modular inversion algorithm. Fermat’s Little Theorem requires modular exponentiation, while the Montgomery algorithm needs two domain transformations to obtain the final modular inverse result. Additionally, the original Extended Euclidean algorithm involves division in each step of the multiplication operations, which is highly costly [32]. Therefore, a more hardware-suitable binary modular inversion algorithm is adopted. The computation process of the binary modular inversion algorithm is shown in Algorithm 3.
Algorithm 3: Modular Inverse Algorithm
Input: a , b , p
Output: c = b a 1 mod p
1.   u = a , v = p , x 1 = b , x 2 = 0
2.   while ( u ! = 1 & & v ! = 1 ) do
3.   if ( u [ 0 ] = = 0 & & x 1 [ 0 ] = = 0 ) then
4.    u = u > > > 1 ; x 1 = x 1 > > > 1 ;
5.   end if
6.   if ( u [ 0 ] = = 0 & & x 1 [ 0 ] = = 1 ) then
7.    u = u > > > 1 ; ; x 1 = x 1 + p > > > 1 ;
8.  end if
9.   if ( u [ 0 ] = = 1 & & v [ 0 ] = = 0 & & x 2 [ 0 ] = = 0 ) then
10.    v = v > > > 1 ; x 2 = x 2 > > > 1 ;
11. end if
12.   if ( u [ 0 ] = = 1 & & v [ 0 ] = = 0 & & x 2 [ 0 ] = = 1 ) then
13.    v = v > > > 1 ; x 2 = x 2 + p > > > 1 ;
14.  end if
15.   if ( u [ 0 ] = = 1 & & v [ 0 ] = = 1 ) then
16.       if   ( u v ) then u = u v ; x 1 = x 1 x 2 ;else
17.          v = v u ; x 2 = x 2 x 1 ;
18.      end if
19  end if
20.  end while
21.   if   ( u ! = 1 ) then x 1 = = x 2 end if
22.   while ( x 1 p ) do
23.     x 1 = x 1 p
24.  end while
25.   c = x 1 ;

4.2.3. Design of the Fast Modular Multiplication Unit

In the computation of elliptic curve finite fields, modular multiplication is the most critical part affecting performance. Modular multiplication includes the Montgomery multiplication algorithm, interleaved modular multiplication algorithm, and Barrett modular multiplication algorithm. Although Montgomery multiplication offers good generality and flexibility, it requires a domain transformation of data. The interleaved modular multiplication involves many iterations, leading to longer computation time. The Barrett modular multiplication algorithm, on the other hand, can achieve low-cost computation for any modulus p with precomputed parameters. Therefore, this section uses and Barrett modular reduction algorithm to implement the Barrett fast modular reduction algorithm.
(1)
KOA Fast Multiplication
The main idea of KOA multiplication is to use a recursive approach to reduce multiple lower-complexity multiplications into a higher-complexity multiplication. This results in a faster and more efficient multiplication algorithm. In the computation of KOA, some parameters need to be defined first. For an n bit number A , it can be represented in binary as A = a n 1 , , a 1 , a 0 , a i 0 , 1 , 0 i n 1 , where A H = a n 1 , , a n / 2 is the higher-order part and A L = a n / 2 1 , , a 0 is the lower-order part. Using the above definitions, the two numbers A and B to be multiplied using KOA multiplication can be represented as
A = A H × 2 n / 2 + A L
B = B H × 2 n / 2 + B L
For C = A × B , the result of the computation is
C = A H × B H × 2 n + A H × B L + A L × B H × 2 n / 2 + A L × B L
while A H × B L + A L × B H can be represented as
A H × B L + A L × B H = A H + A L × B H + B L A H × B H A L × B L
From the above, it is clear that KOA multiplication only requires the computation of A H × B H , A L × B L and A H + A L × B H + B L , further reducing the computational complexity.
In SM9, all operations are based on 256-bit numbers. To better utilize the FPGA’s on-board resources, a double recursion method can be used to divide a 256-bit multiplication into multiple 64-bit multiplications. Additionally, the multiplications A H × B H × 2 n + A H × B L + A L × B H × 2 n / 2 of 2 n and 2 n / 2 can be accomplished through bit shifting. The KOA multiplication process is shown in Algorithm 4.
Algorithm 4: KOA Multiplication
Input: A , B , n ,
Output: C
1.   if ( n = = 64 ) then return C = A × B
2.  end if
3.   A = A H × 2 n / 2 + A L ;
4.   B = B H × 2 n / 2 + B L ;
5.   X = K O A A H , B H , n 2 ;
6.   Y = K O A A H + A L , B H + B L , n 2 ;
7.   Z = K O A A L , B L , n 2 ;
8.   D = Y X Z ;
9.   C = X < < n + D < < n 2 + Z ;
To further leverage the performance of the KOA algorithm on the FPGA, the computation cycle of the DSP 64-bit multipliers used in the component design is set to 0. Wire-type variables are used to connect the components within the module, achieving the goal of immediate output upon input. A register is then set at the end to divide the clock cycle. As shown in Figure 6, the schematic diagram of the 256-bit KOA algorithm expansion is illustrated, where H and L represent the higher-order and lower-order bits of the further split data. Finally, a 512-bit multiplication result is output through an adder.
(2)
Barret Fast Multiplication
For any two positive integers A and B , there exist two numbers q and r such that the following equation holds:
A = q × B + r , r 0 , b 1
Thus, A = r mod B , where B is the modulus for the modular operation. However, finding such numbers q and r requires high-cost division operations. But when the modulus is a fixed value, the Barrett algorithm can be used to perform modular reduction using multiplication and shift operations [33]. Let a , b 0 , P 1 to obtain the result of a × b mod P . We can set d = a × b , then d = e × P + c , and we have c , which is the result of a × b mod P .
First, we calculate the approximate value e 1 of e . Let μ = 2 2 n P , e 1 = d P , where n is the bit length of the modulus P in its binary representation. Then, e 1 = d P d 2 n μ 2 n = d > > n μ > > n . The computation of e 1 thus transforms into a combination of shifts and multiplications. For the calculation of μ , this value is precomputed in software and then input μ into the FPGA. In practical applications, once the modulus P is selected, it is typically treated as a system constant. Therefore, μ can also be treated as a system constant. The Barrett modular reduction process is shown in Algorithm 5.
Algorithm 5: Barrett Reduction
Input: a , b , n , P , μ ,
Output: c = a × b mod P
1.   d = a × b ;
2.   e 11 = μ n 1 : 0 × d / 2 n ;
3.   if ( μ n = = 1 b 1 )then
4.   e 12 = d > > n , 256 b 0 ;
5.   else e 12 = 512 b 0 ;
6.  end if
7.   e 1 = e 11 + e 12 > > n ;
8.   d 1 = P × e 1 ;
9.   c = d d 1 ;
10. if ( c > P ) then c = c P ;
11. end if
12. return c;
For the multiplications involved, KOA multiplication is used. Since μ can be up to 257 bits, steps 3 to 6 handle this case. The final computed c , as it ranges from 0 to 2 P 1 , needs to be checked and adjusted if necessary. Steps 9 to 11 handle this process.

4.3. Optimization of Point Addition and Point Double Modules

In elliptic curve point multiplication, the use of different coordinate systems results in varying computational complexities. For instance, operations in the affine coordinate system involve modular inversion, which is computationally expensive. In our work, we employ the standard projective coordinate system for the computations. In the standard projective coordinate system, modular inversion is required only when converting back to the affine coordinate system. Additionally, in the Montgomery point multiplication process within the standard projective coordinate system, computations are simplified as they only need to be performed on the coordinates X and Z , thus reducing the computational burden.
Consider points P 1 = X 1 , Y 1 , Z 1 ,   P 2 = X 2 , Y 2 , Z 2 and P 3 = X 3 , Y 3 , Z 3 in the standard projective coordinate system. The point addition formulas are given by Equations (13) and (14).
X 3 = X 1 X 2 a Z 1 Z 2 2 4 b Z 1 Z 2 X 1 Z 2 + X 2 Z 1
Z 3 = x G X 1 Z 2 X 2 Z 1 2
The point double operation formulas are given by Equations (15) and (16).
X 3 = X 1 2 a Z 1 2 2 8 b X 1 Z 1 3
Z 3 = 4 Z 1 X 1 3 + a X 1 Z 1 2 + b Z 1 3
After completing the computations, the results are converted back to the affine coordinate system. The conversion methods are given by Equations (17) and (18).
x i = X i Z i , i 1 , 3
y 3 = 2 b + a + x 3 x G x 3 + x G x 1 x 3 x G 2 2 y G
The resulting x 3 , y 3 are the converted results in the affine coordinate system.

Data Flow Optimization

To maximize the efficient use of on-board resources and complete point addition and point double in the shortest possible time, we optimize the data flow for state machine operations and Montgomery point multiplication. The computation processes and data flows for point addition and point double are shown in Table 2 and Table 3, respectively.
In the design presented in our work, the point addition and point double modules operate in parallel during the point multiplication process. Therefore, the total computation cycle count is determined by the module with the longest cycle duration. Modules with shorter cycles will idle, ensuring synchronization and smooth operation of the point multiplication state machine. Additionally, the multiplication operations utilize a fast modular multiplication unit with a computation cycle of seven cycles, while the modular addition and subtraction units each have a computation cycle of one cycle. Due to the presence of the state machine and synchronization stages, both the point addition and point double modules have an operation cycle of 102 cycles.

4.4. Optimization of Coordinate Transformation

After the point multiplication operation, the results need to be aggregated and coordinate conversion must be performed. In the design of this module, a modular inversion module and a point addition module are used with a state machine controlling the data flow and the reuse of these modules. The coordinate conversion method is described in Algorithm 6, where X i , Z i , i 0 , 7 are the results output by the eight parallel point multiplication modules. The function Point_add is a wrapper for the point addition module, and i n v a , b , p is a wrapper for the modular inversion module. The output result is a / b mod P .
Algorithm 6: Coordinate Transformation
Input: X i , Z i , i 0 , 7 , a , b , n , P , x G ,
Output: x , y
1.  i = 1;
2.   X , Z = X 1 , Z 1
3.   while ( i 6 ) do
4.   X , Z = Point _ add ( X , Z , X i + 1 , Z i + 1 , a , b , P , x G )
5.  i = i+1;
6.  end while
7. x 6 = i n v X 6 , Z 6 , P
8. x = i n v X , Z , P
9. T 1 = x G x
10. T 2 = x G + x
11. T 3 = x G x ;
12. T 3 = T 3 2 ;
13. T 3 = x 6 T 3
14. T 1 = T 1 + a
15. T 1 = T 1 T 2
16. T 1 = T 1 + b
17. T 1 = T 1 + b
18. T 1 = T 1 T 3
19. T 1 = T 1 T 2
20. y = y G + y G
21. y = i n v y , T 1 , P
22. return x , y

5. Experimental Results and Scheme Analysis

The host machine in this study is a computer with a Pentium Dual CPU E2200 2.20 GHz running Ubuntu 16.04, which can support simultaneous connections to nine FPGA boards. The hardware platform consists of a system of nine FPGA boards, where hardware modules are designed using the Verilog language based on FPGA boards with the chip model xcku-060-ffva1156-2-i. The software used is Vivado 2019.2.

5.1. Safety Analysis

5.1.1. Security Proof

For a signature scheme, its security must satisfy existential unforgeability under adaptive chosen message attacks (EUF-CMA). To prove that the proposed scheme possesses this property, the following definitions and proof process are provided. To address the security of the scheme, consider an attacker who does not have the capability to learn the master private key but can replace the vehicle’s public key. Under this attacker’s threat model, to prove that the scheme satisfies existential unforgeability under adaptive chosen message attacks (EUF-CMAs), Game 1 is defined as follows.
Definition 1: 
Facing a Type 1 attacker, if no Type 1 attacker can win Game 1 with a non-negligible advantage, then the proposed scheme is EUF-CMA secure.
Game 1: 
Let C be the challenger, A 1 be the Type 1 attacker, and I D i be the target being challenged. The game is constructed as follows:
(1)
System Initialization: The system initialization is executed by C , generating the system parameter set P a r m s and the master private key k s . The system parameter set P a r m s is then sent to the attacker A 1 , while the master private key k s is kept secret.
(2)
First Phase Queries: In this phase, A 1 can make the following queries to C : hash queries, private key queries, and signature queries.
(a)
Private Key Query: When A 1 requests the private key for I D i , C sends the private key to A 1 .
(b)
Signature Query: When A 1 requests this query, assuming the requested parameters are I D i , m i , C sends the signature to A 1 . However, A 1 cannot request a signature query where I D i is the identity of the forger; otherwise, the game terminates.
(3)
Forgery: The attacker A 1 forges a digital signature S i , U i , w i for the identity I D i . A 1 is considered to have won Game 1 if the following conditions are met:
(a)
The signature S i , U i , w i is valid.
(b)
The master private key has not been queried.
(c)
In the signature query, the requested parameters do not include the identity I D i .
Theorem 1: 
In the random oracle model, if the K-CAA problem is hard to solve, then the proposed scheme is EUF-CMA secure.
Lemma 1: 
In the random oracle model, if there exists a Type 1 attacker A 1 who can win Game 1 with non-negligible advantage ε (after making at most q H i i = 1 , 2 H i hash queries, q E private key queries, and q s signature queries), then there exists a challenger C who can solve the K-CAA problem with non-negligible advantage s u c c A 1 K C A A ε 1 2 k q E q H 1 q E + q S q H 1 q E q H 1 .
Proof: 
Game 1 is constructed between the challenger C and the attacker A 1 . If A 1 can forge a valid signature, then C can leverage Game 1 to solve the K-CAA problem. Therefore, given random inputs R = k P 2 G 1 , h 1 = H 1 I D i | | h i d , N , h 1 i , , h 1 q E Z q * and h 1 1 P 1 k + h 1 1 , h 1 2 P 1 k + h 1 2 , , h 1 q E P 1 k + h 1 q E , the goal is to output a solution to the K-CAA problem where the solution is h 1 I , h 1 I P 1 k + h 1 I , h 1 I h 1 1 , h 1 2 , , h 1 q E .
(1)
System Initialization: The system initialization is executed by C . Let k p b = R = k P 2 , where k s = k serves as the master private key, which is unknown to C . The challenger C selects an identity I D i as the challenge identity and sends the parameters P 1 , P 2 , k p b , H 1 , H 2 to the attacker A 1 .
(2)
First Phase Queries: A 1 can initiate queries to C , and C must respond according to the game rules. The possible queries include H 1 hash queries, H 2 hash queries, random oracle queries, public key queries, private key queries, public key replacement queries, and signature queries. Additionally, lists H 1 L i s t , H 2 L i s t , s k l i s t and p k l i s t are used to store the records of H 1 hash queries, H 2 hash queries, random oracle queries, public key queries, and private key queries, respectively. The following details each type of query in the first phase:
(a)
H 1 Hash Query: When A 1 requests an H 1 hash query for input I D i , h i d , N , C first searches the list H 1 L i s t to see if there is a corresponding tuple. The structure of the tuple H 1 L i s t is I D i , h i d , N , h 1 i . If H 1 L i s t already contains the tuple I D i , h i d , N , h 1 i , C directly returns h 1 i to A 1 . Otherwise, C randomly selects a value h 1 i Z q * , sets h 1 i = H 1 I D i | | h i d , N , forms a new tuple I D i , h i d , N , h 1 i , inserts this tuple into the list H 1 L i s t , and then returns h 1 i to A 1 .
(b)
H 2 Hash Query: When A 1 requests a H 2 hash query for input I D i , m i , r i , N , C first searches the list H 2 L i s t to see if there is a corresponding tuple. The structure of the tuple H 2 L i s t is I D i , m i , r i , w i , N , h 2 i . If H 2 L i s t already contains the tuple I D i , m i , r i , N , h 2 i , C directly returns h 2 i to A 1 . Otherwise, C randomly selects two values r i , h 2 i Z q * , sets w i = e P 1 , k p b r i , h 2 i = H 2 m i w i , N , forms a new tuple I D i , m i , r i , w i , N , h 2 i , inserts this tuple into the list H 2 L i s t , and then returns h 2 i to A 1 . This H 2 hash query event is denoted by E 1 .
(c)
Private Key Query: When A 1 requests the private key for identity I D i 1 i q E , C retrieves I D i , h 1 i from the list H 2 L i s t . C then checks if h 1 i belongs to the challenge identity set h 1 1 , h 1 2 , , h 1 q E . If it is not, the game terminates (this event is denoted by E 2 ); otherwise, C sets s i = P 1 h 1 i P 1 k + h 1 i , Q i = k p b P 1 s i and saves the tuple I D i , h i d , N , Q i , s i in s k l i s t . C then returns s i and Q i to A 1 .
(3)
Forgery Phase: A 1 outputs a digital signature S I , U I , w I for the identity I D I , and the signature passes the verification e S I , Q I · e P 1 , k p b h 2 I = e P 1 , k p b r I .
C retrieves I D I , h 1 I from H 1 l i s t . Then, it retrieves the tuple I D i , m i , r i , w i , N , h 2 i from the list H 2 l i s t . Let Q I = h 1 I P 2 + k p b . If h 1 I h 1 1 , h 1 2 , , h 1 q E ; then, the simulation terminates and outputs “Failure” (denoted as E 3 ), Otherwise, the equation e S I , Q I · e P 1 , k p b h 2 I = e P 1 , k p b r I holds, and let x = h 1 I P 1 k + h 1 I , Q I = k p b P 1 P 1 x = k p b P 1 P 1 h 1 I P 1 k + h 1 I 1 = k p b P 1 , k P 1 k + h 1 I 1 = h 1 I P 2 + k p b , so e S I , Q I = e S I , k p b P 1 P 1 x = e P 1 , k p b r I h 2 I , e S I P 1 P 1 x , k p b = e r I h 2 I P 1 , k p b , S I P 1 P 1 x = r I h 2 I P 1 . From this, we obtain x = P 1 S I r I h 2 I = h 1 I P 1 k + h 1 I , where h 1 I , h 1 I P 1 k + h 1 I is the solution to the K-CAA problem. Thus, C outputs the solution h 1 I , h 1 I P 1 k + h 1 I to the K-CAA problem. If the events E 1 , E 2 and E 3 do not occur, then C can solve an instance of the K-CAA problem. The probability that A 1 forges a valid signature without querying H 2 does not exceed 1 2 k , so C ’s probability of successfully solving the K-CAA problem is s u c c A 1 K C A A ε 1 2 k q E q H 1 q E + q S q H 1 q E q H 1 .
Because A 1 ’s advantage in winning the game is negligible, the probability of C successfully solving the K-CAA problem is negligible. Thus, the prerequisite for successfully solving the K-CAA problem does not exist. Furthermore, the fault-tolerant aggregate signature scheme constructed in our work operates mainly during the aggregation process and does not affect the aggregation method of the aggregate signature. Therefore, according to Definition 1, both the non-fault-tolerant and fault-tolerant aggregate signature schemes in our work are secure.

5.1.2. Unforgeability

Based on the security of the signature scheme under the K-CAA assumption, the scheme proposed in this section can resist existential forgery under adaptive chosen message attacks and identity attacks. Therefore, the scheme presented in our work possesses unforgeability.

5.1.3. Privacy

When a vehicle joins the IoV, it first sends its identity I D i to the authentication center to generate its pseudonym for use within the network. Consequently, during information exchange, no third party other than the authentication center can know the vehicle’s true identity. In subsequent communications, vehicles use pseudonyms to participate in the fault-tolerant aggregate signature process. Therefore, this scheme ensures the privacy of the vehicles participating in the communication.

5.1.4. Traceability

When the RSU detects an error or verification failure in the aggregated signature, it can send the pseudonym of the vehicle that caused the verification error to the authentication center. The authentication center can then use the pseudonym to trace the relevant information of the vehicle. Therefore, this scheme has traceability.

5.2. Performance of Each Module

In the SM9 signature algorithm, the main time-consuming modules are bilinear pairing, modular exponentiation, and point multiplication. Among these, the most time-consuming step in the aggregate signature process is point multiplication. Bilinear pairing is implemented in software. The software computation times are as follows: bilinear pairing takes 4.129 ms, point multiplication takes 1.814 ms, and point addition takes 0.009 ms. These times are the average values obtained from running each operation 1000 times in software.
An eight-way parallel point multiplication algorithm is implemented on the FPGA. The specific details regarding the resource utilization, control method, operation mode, and operating cycles of each module are shown in Table 4. The test data are sourced from the SM9 Chinese National Standard [34].
Based on Table 4, from the perspective of structural functionality, the modular multiplication and modular inversion units in the parallel point multiplication algorithm are both computed serially. The modular addition and subtraction functions are combined using combinational logic, which completes the operations within a single clock cycle, exhibiting parallel characteristics. The point addition and point doubling units achieve their functionality by invoking the modular addition/subtraction and modular multiplication units. As seen from the table, point addition and point doubling are executed in parallel with their parallel nature supported by the Montgomery point multiplication algorithm and controlled by the slave state machine. Additionally, the table shows that the slave state machine and point multiplication modules also operate in parallel with their parallelism supported by the master control state machine, which simultaneously distributes tasks and manages parallel control. However, during coordinate conversion, all data from the point multiplication modules need to be sent to the master state machine, which then forwards it to the coordinate conversion module. The coordinate conversion module operates serially, and its serial nature is determined by the coordinate conversion algorithm.
According to Table 4, in terms of resource and performance analysis, a single-point multiplication module efficiently utilizes the FPGA’s board resources. When the eight-way parallel point multiplication algorithm is implemented, the overall resource utilization of each FPGA is LUT: 57.48%, FF: 23.38%, DSP: 93.91%. The operating frequency of the above modules is shown in Table 4 as well. The eight-way parallel point multiplication algorithm on the FPGA experimental platform achieves a calculation throughput of approximately 2.7 × 10 7 / 26,014 + 1453 × 8 × 9 70,776 operations per second. Meanwhile, the software implementation achieves only 551 operations per second, indicating a significant improvement in computational efficiency.

5.3. Efficiency Analysis of Aggregated Signature

Below, we compare and analyze the aggregate verification efficiency of the proposed aggregate signature scheme in our work with currently well-performing software and hardware implementation schemes from different perspectives. P B represents the bilinear pairing operation time on software, m represents the running time of a modular exponentiation on software, M represents the running time of a point multiplication on software, A represents the running time of a point addition on software, M P represents the running time of a point multiplication on the single-path FPGA in our work, C o T P represents the running time of a coordinate conversion operation on the single-path FPGA in our work, and n represents the number of signatures participating in the aggregation.

5.3.1. Analysis of Non-Fault-Tolerant Aggregate Signature Verification Efficiency

First, let us assume that all signatures are valid and that the total number of signatures to be verified exceeds the 36 signatures required to construct a fault-tolerant set. Under this assumption, we can proceed with an analysis of the computational load associated with different aggregate signature schemes.
One such scheme, presented in [35], is a derivative of the fault-tolerant aggregate signature approach. To facilitate a meaningful comparison, the design of the fault-tolerant set in [35] has been aligned with the scheme proposed in our work. This unification allows for a direct comparison of the computational efficiency between the two methods, providing insights into their relative performance under similar conditions.
As shown in Table 5, the time consumption formulas for individual signature verification and batch verification are provided. The efficiency of aggregate signatures is primarily determined by the efficiency of batch verification. Figure 7 displays a categorized bar chart of batch verification time consumption, where “no tol” represents the verification time in the non-fault-tolerant mode, and “tol” represents the consumption time in the fault-tolerant mode. From the figure, it can be seen that a significant difference in verification time emerges even with a small number of signatures. Therefore, we analyze the case with 36 signatures and no faults. The aggregate signature verification times from the literature are as follows: 139.532 ms in [36], 432.172 ms in [37], 199.013 ms in [38], 78.321 ms in [4], 523.692 ms in [35], 198.689 ms in [3], 205.142 ms in [39], 264.317 ms in [40], and 490.885 ms in [6]. The verification time in the fault-tolerant mode of our work is 113.459 ms, and it is 14.899 ms in the non-fault-tolerant mode. Compared to the original SM9 aggregate signature scheme, the verification time in the non-fault-tolerant mode of our work is reduced by 80.9%, and compared to other schemes, it is reduced by up to 97.1%. Although the time consumption in the fault-tolerant aggregate signature mode has increased compared to the original SM9 scheme, it has decreased by 78.3% compared to the fault-tolerant aggregate signature scheme in [35] and at least by 18.6% compared to other schemes. Additionally, it provides fault-tolerant aggregation functionality that other schemes do not possess.

5.3.2. Analysis of Fault-Tolerant Aggregate Signature Verification Efficiency

Below, we analyze the fault-tolerant efficiency of the fault-tolerant aggregate signature, using an example where there are 36 signatures to be verified, including two erroneous signatures. First, we introduce the following theorem. Assume that set A 1 , A 2 , , A n is a u n i f o r m k , n construction of set A ; then, for any r subset A i 1 , A i 2 , , A i r A 1 , , A n 1 r k 1 , we have Equations (19) and (20), where m = n k 1 .
j = 1 r A i j = m n r k r 1
j = 1 r A i j = m n k 1 + n r k 1
Similarly, the size of each subset A i is n 1 k 1 , 1 i n . Specifically, when r takes the value of n k + 1 , the value of Equation (20) is 1.
In our work, we construct a u n i f o r m ( 8 , 9 ) fault-tolerant aggregate signature set T 1 , , T 9 . Each T i 1 i 9 has a size of T i = 9 1 8 1 = 8 . Once we have constructed the fault-tolerant set, for convenience of representation, we denote it using ε 1 , ε 2 , ε 9 to represent the corresponding T 1 , , T 9 . Additionally, we can observe that each individual signature appears twice in different T i , T j . Given that 16 T i / 36 = 2 , we can infer j = 1 2 A i j = 9 2 8 1 = 1 from Equation (20) that the number of single signatures in the intersection of any two subsets is 1. Therefore, we can conclude that the two erroneous signatures will appear in three or four different T i subsets. In our analysis, we consider the worst-case scenario where the two erroneous signatures appear in four different T i subsets. Thus, only five subsets ε i 1 , ε i 2 , ε i 5 pass verification. Consequently, the number of correct signatures passing verification can be calculated using Equation (19), yielding 36 9 5 8 5 1 = 30 . Therefore, a total of 30 signatures pass the verification. In error handling of aggregate signature, the method of independent verification after error is generally adopted. Below, we provide a brief explanation of the number of individual verifications required for non-fault-tolerant aggregate signatures. When two erroneous signatures are mixed into 36 signatures, assuming the probability of encountering the first erroneous signature at position 0 , 35 is uniform, we denote its position as x . The probability of the second erroneous signature being at position x + 1 , 36 is 1 / 36 i . Therefore, the expected number of verifications is approximately 21. In contrast, the maximum number of verifications required for fault-tolerant aggregate signatures is six. Hence, the fault-tolerant aggregate signature scheme demonstrates better performance in the presence of erroneous signatures. Figure 8 presents the efficiency analysis of fault-tolerant aggregate signatures, where “no tol” represents the verification time in the non-fault-tolerant mode, and “tol” represents the consumption time in the fault-tolerant mode. Below, we provide more specific data: in this scenario, the time to verify and identify all erroneous signatures is 351.611 ms for [36], 720.25 ms for [37], 313.862 ms for [38], 289.833 ms for [4], 567.39 ms for [35], 351.632 ms for [3], 455.315 ms for [39], 455.354 ms for [40], and 653.782 ms for [6]. The non-fault-tolerant aggregate mode takes 191.206 ms, while the fault-tolerant aggregate mode takes 163.97 ms. In the current scenario, analyzing the time taken by each scheme, our proposed scheme, whether in fault-tolerant or non-fault-tolerant aggregate mode, significantly outperforms other non-fault-tolerant schemes. Even compared to the fault-tolerant scheme in [35], our scheme’s time consumption is only 28.9% of theirs. In summary, our designed scheme shows a substantial advantage in both fault-tolerant and fault-free scenarios.

6. Conclusions

Considering the complex network environment and high-performance requirements of vehicular networks, we design an efficient aggregate signature algorithm with two modes based on the SM9 aggregate signature algorithm, FPGA acceleration technology, and the uniform(k,n) theory. The security of the proposed algorithm is also analyzed. For the numerous elliptic curve point multiplication operations involved, a parallel point multiplication architecture based on FPGA is designed. Single-point multiplication uses the Montgomery point algorithm, and the data flow of the point addition and point double modules is optimized. Key modules for modular addition/subtraction, multiplication, and inversion are designed using combined modular addition/subtraction, KOA and Barret algorithms, and the binary modular inversion algorithm, respectively. The parallel point multiplication architecture is applied to the SM9 efficient aggregate signature scheme, and its effectiveness is verified through simulation and on-board experiments. A comprehensive analysis of the proposed scheme’s performance is also conducted. The highly parallel elliptic curve point multiplication acceleration module designed on the FPGA platform is applied to both non-fault-tolerant and fault-tolerant modes, demonstrating good performance in the low-latency and complex network environment scenarios of vehicular networks. Compared to similar schemes, the proposed scheme not only achieves higher operational efficiency but also incorporates fault-tolerant features.
The next step could explore the hardware security design of the SM9 parallel point multiplication architecture to prevent side-channel attacks. This involves ensuring high performance while preventing the leakage of computational information. Additionally, developing new fault-tolerance theories to further optimize the construction speed of fault-tolerant sets can be investigated.

Author Contributions

Conceptualization, B.Z., B.L. and J.Z.; methodology, B.Z., B.L. and. J.Z.; engineering implementation, B.Z., B.L., Y.W. and Y.Y.; validation, B.Z., H.H. and B.L.; formal analysis, B.Z., B.L. and. J.Z.; investigation, B.Z., B.L. and Q.Z.; writing—original draft preparation, B.Z. and B.L.; writing—review and editing, B.Z., B.L. and. J.Z.; visualization, B.Z., B.L. and. J.Z.; supervision, B.L. and Q.Z.; project administration, B.L. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, H.; Wang, L.; Zhang, K.; Li, J.; Luo, Y. A Conditional Privacy-Preserving Certificateless Aggregate Signature Scheme in the Standard Model for VANETs. IEEE Access 2022, 10, 15605–15618. [Google Scholar] [CrossRef]
  2. Raya, M.; Hubaux, J.P. Securing vehicular ad hoc networks. J. Comput. Secur. 2007, 15, 39–68. [Google Scholar] [CrossRef]
  3. Xie, Y.; Li, X.; Zhang, S.; Li, Y. iCLAS: An Improved Certificateless Aggregate Signature Scheme for Healthcare Wireless Sensor Networks. IEEE Access 2019, 7, 15170–15182. [Google Scholar] [CrossRef]
  4. An, T.; Ma, W.; Liu, X. Aggregated Signature Scheme Based on SM9 Cryptographic Algorithm in VANET. Comput. Appl. Softw. 2020, 12, 280–284+321. [Google Scholar] [CrossRef]
  5. Mei, Q.; Xiong, H.; Chen, J.; Yang, M.; Kumari, S.; Khan, M.K. Efficient Certificateless Aggregate Signature with Conditional Privacy Preservation in IoV. IEEE Syst. J. 2021, 15, 245–256. [Google Scholar] [CrossRef]
  6. Xu, R.; Zhou, Y.; Yang, Q.; Yang, K.; Han, Y.; Yang, B.; Xia, Z. An Efficient and Secure Certificateless Aggregate Signature Scheme. J. Syst. Archit. 2024, 147, 103030. [Google Scholar] [CrossRef]
  7. Yang, W.; Fan, J.; Zhang, F. An Efficient Aggregate Signature Scheme with Designated Verifier Based on the Schnorr Signature in Healthcare Wireless Sensor Networks. IEEE Internet Things J. 2024, 1. Early Access. [Google Scholar] [CrossRef]
  8. Fu, J.; Liu, J.; Huang, Y.; Si, X.; Wang, Y.; Li, B. Aggregate Signature Consensus Scheme Based on FPGA. In Proceedings of the First International Conference, BlockSys 2019, Guangzhou, China, 7–8 December 2019; pp. 92–100. [Google Scholar]
  9. Hartung, G.; Kaidel, B.; Koch, A.; Koch, J.; Rupp, A. Fault-Tolerant Aggregate Signatures. In Proceedings of the 19th IACR International Conference on Practice and Theory in Public-Key Cryptography, Taipei, China, 6–9 March 2016; pp. 331–356. [Google Scholar]
  10. Verri, L.A.; Mariano, S.G.; Leithardt, V.; Beko, M.; Albenes, Z.C.; Parreira, W. A Review of Techniques for Implementing Elliptic Curve Point Multiplication on Hardware. JSAN 2020, 10, 3. [Google Scholar] [CrossRef]
  11. Khleborodov, D. Fast Elliptic Curve Point Multiplication Based on Window Non-Adjacent Form Method. Appl. Math. Comput. 2018, 334, 41–59. [Google Scholar] [CrossRef]
  12. Hai, H.; Ning, N.; Lin, X.; Zhiwei, L.; Bin, Y.; Shilei, Z. An Improved wNAF Scalar-Multiplication Algorithm with Low Computational Complexity by Using Prime Precomputation. IEEE Access 2021, 9, 31546–31552. [Google Scholar] [CrossRef]
  13. Zhao, S.; Yang, X.; Liu, Z.; Yu, B.; Huang, H. An improved wnaf scalar-multiplication algorithm with low computational complexity. Acta Electonica Sin. 2022, 4, 977. [Google Scholar] [CrossRef]
  14. Bellemou, A.; Benblidia, N.; Anane, M.; Issad, M. MicroBlaze-Based Multiprocessor Embedded Cryptosystem on FPGA for Elliptic Curve Scalar Multiplication over Fp. J. Circuit. Syst. Comp. 2019, 28, 1950037. [Google Scholar] [CrossRef]
  15. Hao, Y.; Zhong, S.; Ma, M.; Jiang, R.; Huang, S.; Zhang, J.; Wang, W. Lightweight Architecture for Elliptic Curve Scalar Multiplication over Prime Field. Electronics 2022, 11, 2234. [Google Scholar] [CrossRef]
  16. Islam, M.M.; Hossain, M.S.; Hasan, M.K.; Shahjalal, M.; Jang, Y.M. FPGA Implementation of High-Speed Area-Efficient Processor for Elliptic Curve Point Multiplication over Prime Field. IEEE Access 2019, 7, 178811–178826. [Google Scholar] [CrossRef]
  17. State Cryptography Administration of China. Information Security Technology—Identity-Based Cryptographic Algorithms SM9—Part 1: General; State Cryptography Administration of China: Beijing, China, 2016. [Google Scholar]
  18. Liu, S.; Chen, K.; Liu, Z.; Wang, T. Secure Threshold Ring Signature Based on SM9. IEEE Access 2021, 9, 95507–95516. [Google Scholar] [CrossRef]
  19. Liu, S.G.; Liu, R.; Rao, S.Y. Secure and Efficient Two-Party Collaborative SM9 Signature Scheme Suitable for Smart Home. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 4022–4030. [Google Scholar] [CrossRef]
  20. Jing, S.; Yang, X.; Feng, Y.; Liu, X.; Hao, F.; Yang, Z. Hardware Implementation of SM9 Fast Algorithm Based on FPGA. In Proceedings of the 2nd International Conference on Internet, Education and Information Technology (IEIT 2022), Zhangjiajie, China, 15–17 April 2022; pp. 797–803. [Google Scholar]
  21. Wang, A.T.; Guo, B.W.; Wei, C.J. Highly-Parallel Hardware Implementation of Optimal Ate Pairing over Barreto-Naehrig Curves. Integration 2019, 64, 13–21. [Google Scholar] [CrossRef]
  22. Cheng, X.; Zhang, Y.; Wang, Y. Simplification and Hardware Parallel Design of Frobenius Mapping Algorithm Based on SM9. In Proceedings of the 2019 IEEE 3rd International Conference on Circuits, Systems and Devices (ICCSD), Chengdu, China, 23–25 August 2019; pp. 78–82. [Google Scholar]
  23. Ali, I.; Chen, Y.; Ullah, N.; Afzal, M.; He, W. Bilinear Pairing-Based Hybrid Signcryption for Secure Heterogeneous Vehicular Communications. IEEE Trans. Veh. Technol. 2021, 70, 5974–5989. [Google Scholar] [CrossRef]
  24. Yang, G.Q.; Kong, F.Y.; Xu, Q.L. A high performance FPGA based implementation method of SM9. J. Shandong Univ. (Nat. Sci.) 2020, 9, 54–61. [Google Scholar]
  25. Mundhe, P.; Verma, S.; Venkatesan, S. A Comprehensive Survey on Authentication and Privacy-Preserving Schemes in VANETs. Comput. Sci. Rev. 2021, 41, 100411. [Google Scholar] [CrossRef]
  26. Wang, G.; Cao, Z.; Dong, X. Improved Fault-Tolerant Aggregate Signatures. Comput. J. 2019, 62, 481–489. [Google Scholar] [CrossRef]
  27. Bardini Idalino, T.; Moura, L. Efficient Unbounded Fault-Tolerant Aggregate Signatures Using Nested Cover-Free Families. In Proceedings of the Combinatorial Algorithms: 29th International Workshop, IWOCA 2018, Singapore, 16–19 July 2018; p. 52. [Google Scholar]
  28. Bardini Idalino, T.; Moura, L. Nested Cover-Free Families for Unbounded Fault-Tolerant Aggregate Signatures. Theor. Comput. Sci. 2021, 854, 116–130. [Google Scholar] [CrossRef]
  29. Zhenfu, C. Finite Set Theory and Its Application to Cryptology. J. Stat. Plan. Inference 1996, 51, 129–136. [Google Scholar] [CrossRef]
  30. State Cryptography Administration of China. Information Security Technology—Identity-Based Cryptographic Algorithms SM9—Part 2: Algorithms; State Cryptography Administration of China: Beijing, China, 2020. [Google Scholar]
  31. Zhao, X.; Li, B.; Zhang, L.; Wang, Y.; Zhang, Y.; Chen, R. FPGA Implementation of High-Efficiency ECC Point Multiplication Circuit. Electronics 2021, 10, 1252. [Google Scholar] [CrossRef]
  32. Hu, J.; Li, Y. An Improved Modular Inverse Algorithm and Hardware Implementation. J. Hunan Univ. (Nat. Sci. Ed.) 2022, 2, 101–105. [Google Scholar] [CrossRef]
  33. Liu, W.; Ni, J.; Liu, Z.; Liu, C.; O’Neill, M. Optimized Modular Multiplication for Supersingular Isogeny Diffie-Hellman. IEEE Trans. Comput. 2019, 68, 1249–1255. [Google Scholar] [CrossRef]
  34. State Cryptography Administration of China. Identity-Based Cryptographic Algorithms SM9—Part 5: Parameter Definition; State Cryptography Administration of China: Beijing, China, 2016. [Google Scholar]
  35. Zhao, Y.; Dan, G.; Ruan, A.; Huang, J.; Xiong, H. A Certificateless and Privacy-Preserving Authentication with Fault-Tolerance for Vehicular Sensor Networks. In Proceedings of the 2021 IEEE Conference on Dependable and Secure Computing (DSC), Fukushima, Japan, 30 January 2021; pp. 1–7. [Google Scholar]
  36. Shu, H.; Qi, P.; Huang, Y.; Chen, F.; Xie, D.; Sun, L. An Efficient Certificateless Aggregate Signature Scheme for Blockchain-Based Medical Cyber Physical Systems. Sensors 2020, 20, 1521. [Google Scholar] [CrossRef]
  37. Chen, Y.; Chen, J. CPP-CLAS: Efficient and Conditional Privacy-Preserving Certificateless Aggregate Signature Scheme for VANETs. IEEE Internet Things J. 2022, 9, 10354–10365. [Google Scholar] [CrossRef]
  38. Du, H.; Wen, Q.; Zhang, S. An Efficient Certificateless Aggregate Signature Scheme without Pairings for Healthcare Wireless Sensor Network. IEEE Access 2019, 7, 42683–42693. [Google Scholar] [CrossRef]
  39. Deng, L.; Ning, B.; Jiang, Y. A Lightweight Certificateless Aggregation Signature Scheme with Provably Security in the Standard Model. IEEE Syst. J. 2020, 14, 4242–4251. [Google Scholar] [CrossRef]
  40. Gayathri, N.B.; Thumbur, G.; Rajesh Kumar, P.; Rahman, M.Z.U.; Reddy, P.V.; Lay-Ekuakille, A. Efficient and Secure Pairing-Free Certificateless Aggregate Signature Scheme for Healthcare Wireless Medical Sensor Networks. IEEE Internet Things J. 2019, 6, 9064–9075. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of SM9 algorithm.
Figure 1. Overall architecture of SM9 algorithm.
Sensors 24 06011 g001
Figure 2. The overall structure of the efficient SM9 aggregate signature scheme.
Figure 2. The overall structure of the efficient SM9 aggregate signature scheme.
Sensors 24 06011 g002
Figure 3. Hardware acceleration overall structure.
Figure 3. Hardware acceleration overall structure.
Sensors 24 06011 g003
Figure 4. The figure of master state machine state transition.
Figure 4. The figure of master state machine state transition.
Sensors 24 06011 g004
Figure 5. The figure of modular addition and subtraction units.
Figure 5. The figure of modular addition and subtraction units.
Sensors 24 06011 g005
Figure 6. Overall architecture of KOA 256 bit algorithm.
Figure 6. Overall architecture of KOA 256 bit algorithm.
Sensors 24 06011 g006
Figure 7. Efficiency analysis of error free aggregation signature verification.
Figure 7. Efficiency analysis of error free aggregation signature verification.
Sensors 24 06011 g007
Figure 8. Efficiency analysis of fault-tolerant aggregate signature.
Figure 8. Efficiency analysis of fault-tolerant aggregate signature.
Sensors 24 06011 g008
Table 1. Notations.
Table 1. Notations.
NotationDescriptions
P 1 , P 2 Two different generators in the system
k s System master private key
k p b System master public key, k p b = k · P 2 k Z n 1 *
N The order of the elliptic curve group in SM9
H 1 Z , n , H 2 Z , n One-way hash function, which takes a bit string
Z and an integer n as input, outputs h [ 1 , n 1 ]
I D i The real identity of vehicle i
h i d Private key generation identifier
V I D i Pseudonym assigned to vehicle i , denoted as V I D i = H 1 I D i | | h i d , N
s i Private key of vehicle i , s i = k · V I D i + k 1 · P 1
Q i Public key of vehicle i , Q i = V I D i · P 2 + k p b
Table 2. Point addition data flow.
Table 2. Point addition data flow.
NumberOperationComputation
1 T 1 = X 1 X 2 X 1 X 2
2 z 1 z 2 = Z 1 Z 2 Z 1 Z 2
3 T 2 = a z 1 z 2 a Z 1 Z 2
4 T 1 = T 1 T 2 X 1 X 2 a Z 1 Z 2
5 T 1 = T 1 2 X 1 X 2 a Z 1 Z 2 2
6 x 1 z 2 = X 1 Z 2 X 1 Z 2
7 x 2 z 1 = X 2 Z 1 X 2 Z 1
8 T 2 = x 1 z 2 + x 2 z 1 X 1 Z 2 + X 2 Z 1
9 T 2 = b T 2 b X 1 Z 2 + X 2 Z 1
10 T 2 = T 2 z 1 z 2 b X 1 Z 2 + X 2 Z 1 Z 1 Z 2
11 T 2 = T 2 + T 2 2 b X 1 Z 2 + X 2 Z 1 Z 1 Z 2
12 T 2 = T 2 + T 2 4 b X 1 Z 2 + X 2 Z 1 Z 1 Z 2
13 X 3 = T 1 T 2 X 1 X 2 a Z 1 Z 2 2 4 b X 1 Z 2 + X 2 Z 1 Z 1 Z 2
14 T 1 = x 1 z 2 x 2 z 1 X 1 Z 2 X 2 Z 1
15 T 1 = T 1 2 X 1 Z 2 X 2 Z 1 2
16 Z 3 = x G T 1 x G X 1 Z 2 X 2 Z 1 2
Table 3. Point double data flow.
Table 3. Point double data flow.
NumberOperationComputation
1 x 1 _ s = X 1 2 X 1 2
2 z 1 _ s = Z 1 2 Z 1 2
3 a z 1 _ s = a z 1 _ s a Z 1 2
4 T 1 = x 1 _ s a z 1 _ s X 1 2 a Z 1 2
5 T 1 = T 1 2 X 1 2 a Z 1 2 2
6 x 1 z 2 = X 1 Z 2 X 1 Z 2
7 X 1 2 a Z 1 2 X 1 Z 2 b Z 1 2
8 T 2 = x 1 z 1 b z 1 _ s b X 1 Z 1 3
9 T 2 = T 2 + T 2 2 b X 1 Z 1 3
10 T 2 = T 2 + T 2 4 b X 1 Z 1 3
11 T 2 = T 2 + T 2 8 b X 1 Z 1 3
12 X 3 = T 1 T 2 X 1 2 a Z 1 2 2 8 b X 1 Z 1 3
13 T 1 = x 1 _ s + a z 1 _ s X 1 2 + a Z 1 2
14 T 1 = T 1 x 1 z 1 X 1 2 + a Z 1 2 X 1 Z 1
15 T 2 = b z 1 _ s z 1 _ s b Z 1 4
16 T 1 = T 1 + T 2 X 1 2 + a Z 1 2 X 1 Z 1 + b Z 1 4
17 T 2 = T 1 + T 1 2 Z 1 X 1 3 + a X 1 Z 1 2 + b Z 1 3
18 Z 2 = T 2 + T 2 4 Z 1 X 1 3 + a X 1 Z 1 2 + b Z 1 3
Table 4. Resource utilization and performance of FPGA modules.
Table 4. Resource utilization and performance of FPGA modules.
ModuleControl ModeOperation ModeLUTFFDSPFrequency/MHZClock Cycles
modular inversionState MachineSerial39511568027544
modular addition and subtraction unitsPure Combinational LogicParallel291800271
modular multiplicationState MachineSerial45812054144276
point additionState MachineParallel8137516814427102
point doubleState MachineParallel8003566414427102
point multiplicationState MachineParallel20,32115,5052882726,014
coordinate conversionState MachineSerial18,0917618288271453
slave state machineState MachineParallel41814673027-
master control state machineState MachineParallel10,01223,471050-
Table 5. Comparison of individual and batch verification costs in different aggregate signature schemes.
Table 5. Comparison of individual and batch verification costs in different aggregate signature schemes.
ModuleIndividual VerificationBatch Verification
Shu, H. [36] 2 P B + M + 3 A 2 P B + 2 n M + 2 n + 1 A
Chen, Y. [37] 2 P B + 3 M + 2 A 2 n P B + 2 n + 1 M + 2 n A
Du, H. [38] 3 M + 3 A 3 n + 1 M + 4 n 1 A
An, T. [4] 2 P B + M 3 P B + n M + 2 n 1 A
Zhao, Y. [35] 4 M + 3 A 8 n M + 4 n 4 A
Xie, Y. [3] 4 M + 3 A 3 n + 1 M + 3 n 1 A
Deng, L. [39] 2 P B + 2 M + 3 A 2 P B + 3 n M + 3 n A
Gayathri, N.B. [40] 5 M + 3 A 4 n + 1 M + 4 n 1 A
Xu, R. [6] P B + 2 M 2 n 1 P B + 3 n + 1 M
Non-fault-tolerant mode 2 P B + M 3 P B + n 72 M P + 3 n 1 A + n 72 C o T P
Fault-tolerant mode 2 P B + M 3 n 4 P B + n 36 M P + 3 n 1 A + n 36 C o T P
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, B.; Li, B.; Zhang, J.; Wei, Y.; Yan, Y.; Han, H.; Zhou, Q. An Efficient SM9 Aggregate Signature Scheme for IoV Based on FPGA. Sensors 2024, 24, 6011. https://doi.org/10.3390/s24186011

AMA Style

Zhang B, Li B, Zhang J, Wei Y, Yan Y, Han H, Zhou Q. An Efficient SM9 Aggregate Signature Scheme for IoV Based on FPGA. Sensors. 2024; 24(18):6011. https://doi.org/10.3390/s24186011

Chicago/Turabian Style

Zhang, Bolin, Bin Li, Jiaxin Zhang, Yuanxin Wei, Yunfei Yan, Heru Han, and Qinglei Zhou. 2024. "An Efficient SM9 Aggregate Signature Scheme for IoV Based on FPGA" Sensors 24, no. 18: 6011. https://doi.org/10.3390/s24186011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop