6.1. Experimental Environment Configuration
The experimental hardware configuration includes an Intel Core i7-8700K processor (intel, Santa Clara, CA, USA), GTX 1080T GPU, and 16.0 GB of RAM, running in the Ubuntu 20.04 LTS environment. We utilized FISCO BCOS as the framework, an open-source consortium blockchain platform. Through its provided interface, transactions and smart contracts could be examined. The underlying blockchain and consensus process was implemented using Python (v3.6.3), with smart contracts written in C++. The vehicular federated learning environment was constructed using Python (v3.6.3) and TensorFlow (v2.4.2) for local training and global aggregation. Data collection and blockchain updates were simulated using the OMNET (v5.5.1), integrated with the Simulation of Urban Mobility (SUMO-v1.3.0). To simulate the communication among system nodes, we adopted the Krauss model [
43]. During simulation, nodes were observed to move at speeds from 20 to 100 km/h on bidirectional roads within a simulated area of
(1) Experimental Model: We adopted the VGG16 image classification model as the global model for federated learning. The VGG16 model consists of eight layers in total, with the first 13 layers being convolutional layers and the last 3 layers being fully connected layers. In the vehicular federated learning environment, the parameters of the VGG16 model were set as follows: learning rate of dropout rate of 0.2, and each vehicle performed local training with a batch size of 64, while the RSU performed verification with a batch size of 32.
(2) Datasets: The MNIST dataset [
44] consists of 60,000 training samples and
test samples, each of which is a
grayscale image. This dataset contains 10 classes, representing handwritten digits from 0 to 9, paired with their corresponding labels. The CIFAR-10 dataset [
45] is a collection of color images, which comprises
training samples and
test samples. Each sample is a
RGB image, and this dataset encompasses 10 types of universal objects, such as “aircraft”, “dog”, and “car”. These two datasets can represent the medium-complexity data collected by in-vehicle local devices, which simulate the image information obtained in real time during driving. They serve as benchmark test data for various FL algorithms designed for mobile edge scenarios.
(3) Node Configuration: As shown in
Table 3, we used a configuration of 40 system nodes to simulate data sharing in the IoV. This system scale more precisely replicates the intricacies of the IoV, facilitating the assessment of performance and effectiveness in data-sharing schemes during large-scale deployments. Here, 20% represents the RSUs, while the remaining 80% represents the vehicle nodes. This distribution more precisely mimics the real-world prevalence of RSUs and vehicle nodes, which is a crucial factor in assessing the effects of data sharing and model aggregation among different node types. Moreover, the increased number of vehicles guarantees more frequent local updates, thereby enhancing the reliability and accuracy of the aggregation algorithm. To ensure fairness, equivalent values are configured for the nodes’ parameters in these schemes.
(4) Comparative Schemes:
FL [
40]: This centralized federated learning approach lacks validation and protection of intermediate parameter updates. It relies on a central server to distribute and aggregate the global model. The accuracy of the global model in this scheme serves as the baseline.
41]: This scheme replaces the central aggregator with a consortium chain in the traditional IoV system and applies homomorphic encryption (HE) to protect gradients. Its global training effect is comparable to that of centralized FL. This scheme enhances the Multi-Krum algorithm to implement gradient detection and filtering, which aligns with our approach. We chose the BPFL as the baseline for comparing with the secret sharing proposed in this paper, allowing us to analyze the computational overhead and global accuracy associated with the encryption and decryption processes.
Madi’s FL [
42]: This scheme extends the concept of centralized FL and incorporates homomorphic encryption (HE) and verifiable computing (VC) technology, ensuring both privacy protection and verifiability. The experiments with the Fashion MNIST demonstrate the effective accuracy guarantee of the global model.
6.2. Experimental Results
(1) Model Accuracy: We evaluated the impact of our scheme on the model’s accuracy. Subsequently, the average accuracy for each scheme was calculated, and the experimental results are illustrated in
Figure 5.
On the MNIST, these four schemes were trained for 50 rounds, with the BPFL achieving the highest accuracy of 95.56%. Despite our solution not achieving the utmost accuracy, it proved comparable to the BPFL. It is worth noting that our solution displayed a consistently stable accuracy growth curve, resulting in an optimal average accuracy. For the CIFAR-10 dataset, all schemes were trained for 00 training rounds, converging around the 50th epoch. Both our solution and the BPFL scheme realized the highest global model accuracy, with a mere 2.32% disparity.
With an increase in the number of training epochs, the accuracy of the global model steadily improves across all four schemes. Notably, FL and Madi’s FL schemes exhibit a consistent upward trend in global model accuracy, whereas the accuracy of the global model in the two blockchain-based federated learning schemes fluctuates during the training process. Moreover, during the initial stages of model training, the blockchain-based schemes achieve higher accuracy for the global model. This can be attributed to the verification of global model accuracy in FL and Madi’s FL schemes being conducted by a central aggregator, which utilizes a more concentrated dataset distribution for accuracy evaluation, resulting in a stable upward trend. Both the BPFL scheme and our scheme select intermediate gradients with higher accuracy for aggregation through validation nodes and models, resulting in higher accuracy for the global model in the early stages of training. The validation process for the global model in the BPFL scheme and our scheme involves utilizing local datasets from committee members or validation nodes. This leads to variations in the dataset used for each validation process, resulting in fluctuations in the accuracy of the global model observed in the experimental results.
In conclusion, the BPFL scheme achieves the highest accuracy by conducting a hash function verification and setting parameter retention ratios after completing the training process. Our proposed scheme achieves comparable accuracy to the BPFL but with smoother growth. Before training, we employ a vehicle node selection process based on similarity, which results in optimal average accuracy. As a result, our proposed scheme does not compromise the accuracy of the global model and meets the fundamental requirements of the framework design.
(2) Evaluation of Average Training Time: To evaluate the model training performance of our proposed scheme, we conducted experiments to measure the time required for each round, using the MNIST dataset for 50 rounds of training and the CIFAR-10 dataset for 100 rounds. Subsequently, the average time spent in each training round was calculated, with the duration taken for a single entity to complete its computational task recorded to measure the running time. Both the RSU and vehicles executed computational work in parallel. The experimental results are illustrated in
Figure 6.
Figure 6 reveals the minimal differences among the four schemes when applied to the MNIST dataset. Specifically, the average time consumed by training in our scheme was 1.3 s higher than the original FL, accounting for 5.18%. Moreover, it was 1.1 s lower than the BPFL, representing a decrease of 4%, and 2.3 s lower than the Madi’s FL, accounting for 8.01%. For the CIFAR-10 dataset, our proposed scheme’s training time increased by 3.6 s compared to the original FL, accounting for 17.40%. However, it decreased by 6.9 s compared to the BPFL, accounting for 22.12%. And it decreased by 8.6 s compared to the Madi’s FL, accounting for 26.14%. The time consumption gap between our proposed scheme and the original FL remained at 20%. Although the overall time consumption was higher than that of the original FL, the utilization of the secret sharing and the malicious model filtering, as presented in this paper, may have an impact on efficiency. Moreover, the training time in our scheme was significantly lower than that of both the BPFL and Madi’s FL, which employed homomorphic encryption, with the difference reaching up to 26%.
Our research adopted secret sharing over homomorphic encryption, providing a more efficient alternative that reduces computing overhead while ensuring the privacy of the global model. Therefore, our scheme is suited for the IoV scenarios characterized by limited computing power and constrained network communication.
(3) Evaluating Blockchain Consensus: to evaluate the performance of our consensus mechanism, we initially compared the dBFT with several mainstream mechanisms and further validated the advantages of our proposed mechanism, as shown in
Table 4 and
Figure 7.
In terms of fairness, the dBFT randomly selects the leader nodes, which is more equitable than the PoS, DPoS, and traditional PBFT, which select leader nodes in a sequential manner. This allows every node to have an equal chance of being selected. Fairness plays a crucial role for the participating nodes in the training process discussed in this paper. Here, each RSU can be chosen as a miner node with accounting rights, ensuring their computing resources are put to meaningful use. In terms of computational consumption, unlike POW, dBFT allows nodes on the blockchain to obtain accounting rights without requiring a significant number of computational resources. This results in higher block generation efficiency by selecting miner nodes to achieve consensus. Once consensus is reached, the Block becomes irreversible and there is no possibility of forks. Compared to the PBFT, the dBFT does not require verification from all nodes on the blockchain, leading to a shorter communication time.
Considering that the Byzantine-fault tolerance rate of the dBFT is one-third, the miner nodes are categorized as positive miners and negative miners. We define the number of negative miners as
; only when
satisfies the condition
can a new Block be successfully verified and recorded on the blockchain, where
represents the number of miner nodes in total. Then, we define the following Formula (16):
Here, the parameter
represents the probability of becoming a negative miner, with
representing the security probability of the miner nodes pool. In this experiment, we set the values of
to 0.1, 0.2, and 0.3, respectively, as shown in
Figure 7.
The security probability of the dBFT process decreases as increases. Additionally, as the size of the miner node pool increases, the security of the system also increases. This is because the number of positive miners participating in block validation increases with the size of the miner node pool. The larger the size of the miner node pool, the more secure the consensus process in this paper becomes. When a sufficient number of miner nodes participate in the dBFT, this ensures reliable validation for generating new blocks.
(4) Validation of the Vehicle Selection Algorithm: To validate the efficacy of the vehicle selection algorithm, we conducted experiments by varying the number of selected vehicle training nodes. We then applied the algorithm to both the MNIST and CIFAR-10 datasets for testing purposes, with a comparison of the model accuracy presented in
Figure 8.
We carried out 50 rounds of training on the MNIST dataset, as depicted in
Figure 8a. It is evident that the vehicle selection algorithm was utilized, the global model eventually achieved a stable accuracy. However, the necessary training rounds to achieve this stable accuracy varied. The algorithm-based selection of vehicles, using the concept of Euclidean distance, demonstrated high accuracy from the beginning, with the model quickly converging within approximately 20 rounds of training. Conversely, randomly selected vehicles demonstrated extremely low initial accuracy, and it took longer to reach maximum accuracy, typically requiring about 40 rounds.
Then, we performed 100 rounds of training for the CIFAR-10 dataset, as depicted in
Figure 8b. By employing the algorithm-based selection of vehicles, stable convergence was achieved within 35 rounds. In contrast, utilizing randomly selected nodes did not lead to stable convergence even after approximately 65 rounds. Additionally, it can also be seen from
Figure 8b that a smaller number of training nodes selected by the algorithm results in faster convergence of the accuracy rate. This outcome can be attributed to the enhanced correlation between each vehicle and the task issuer, resulting in quicker convergence of the accuracy rate. Hence, the results of the two comparative experiments demonstrate that this selection algorithm significantly enhances the speed and effectiveness of training.
(5) Evaluating Resistance to Poisoning Attacks: To evaluate the robustness of our scheme against poisoning attacks, we conducted experiments using the label inversion attack in [
46] to generate poisoning samples. The labels of the training samples were altered while maintaining their original features, with poison sample ratios set at 10%, 20%, and 30% respectively. These manipulated samples were then assigned to designated attackers, and the outcomes were compared against the FL [
40] without any poison samples. We then conducted a comparative analysis between our approach and three advanced algorithms specifically designed to mitigate Byzantine attacks.
The experimental results are presented in
Figure 9 and
Figure 10. For the MNIST dataset, the source label “1” was modified to the target label “8”, the source label “2” was modified to “4”. For the CIFAR-10 dataset, the source label “dog” was modified to the target label “horse”. To minimize the impact of irrelevant labels, binary classifiers were exclusively trained using samples that only contained the source and target labels. Additionally, a random selection of 500 test samples with the source label was conducted to determine the success rate of the attack. The success rate is defined as the percentage of samples for which the source label was predicted as the target label. Subsequently, we partitioned the experimental dataset randomly into local datasets for each vehicle.
As shown in
Figure 9, the poisoning sample ratio was 20%, and the error rate of our approach on both datasets approached 0.1, gradually converging to the level of the FL with no poisoning samples. However, as the poisoning sample ratio increased to 30%, our approach stabilized below 0.16 during subsequent iterations. These findings suggest that our approach can effectively mitigate poisoning attacks with a maximum ratio of 30%.
In order to further validate the performance of our proposed solution, we conducted a comparative analysis by assuming an optimal scenario and comparing it against three other advanced algorithms designed to mitigate Byzantine attacks.
Optimal Scenario: In real-world scenarios where it is impractical to predetermine malicious vehicles, we assume that the RSU has prior knowledge of these malicious nodes. This enables the RSU to autonomously filter out updates uploaded by malicious vehicles, ensuring uninterrupted model training and ultimately resulting in an optimal algorithm outcome. The optimal scenario is employed to assess the accuracy rate of our solution in mitigating poisoning attacks.
Krum [
36]: This algorithm is utilized in distributed machine learning to evaluate the similarity of gradient vectors based on the Euclidean distance. This approach effectively eliminates vectors that deviate significantly, thereby eliminating harmful updates. The calculation steps are as follows:
Calculate Euclidean distance: calculate the Euclidean distance between each user-uploaded gradient and the gradient of the other user, .
Select the minimum distance: Identify the set of vectors closest to the vectors of the other n−2f−2 users for each user i, where f represents the number of Byzantine nodes. Denote this set of minimum distances as .
Calculate the score: evaluate the quality score for each gradient vector by summing the distances in its corresponding set of minimum distances,
Choose the best gradient: select the gradient vector with the lowest quality score as the legitimate update for aggregation, .
Multi-Krum [
37]: The Multi-Krum algorithm executes the Krum algorithm multiple times, with the calculation steps as follows:
For n users, where f is the number of Byzantine nodes (satisfying n−2f−2).
For each user i, perform the Krum to compute the score for its gradient vector.
Choose the gradient with the lowest score as legitimate for aggregation.
Median: The dimensional median algorithm is employed for calculating the global gradient by computing the median along each dimension:
Here, ΔGMj represents the j-th dimension of the global model ΔGM, indicating the median value of the local updates contributed by all nodes in the j-th dimension; represents the j-th dimension of the local update ; and the function median{} is utilized to calculate the median value within a set or sequence of numbers.
Additionally, we conducted a comparative analysis of our approach with three other advanced algorithms designed to mitigate Byzantine attacks, using the optimal scenario as the baseline. In
Figure 10a, we examined a scenario without any Byzantine nodes, ensuring that each vehicle consistently shared accurate updates with the RSU throughout each iteration. It is evident that both our approach and Multi-Krum showcased promising aggregation performance, attaining an accuracy close to 96% and gradually converging toward near-optimal levels. In contrast, the Krum and Median exhibited poor performance, deviating significantly from the optimal level. The experimental results under inversion attack are presented in
Figure 10b, revealing that our approach and the Multi-Krum showed robust aggregation performance, maintaining an accuracy of over 91% and approaching the optimal accuracy, while the Krum and Median performed poorly, achieving an accuracy of approximately 84%.
In summary, our solution exhibits exceptional resilience when faced with poisoning attacks at a rate of 30%. Our approach maintains a high level of accuracy when compared to the optimal algorithm and three other advanced Byzantine attack mitigation techniques. Consequently, our solution successfully eliminates harmful parameter updates while meeting the fundamental requirements of the framework design.
(6) Analysis of Computational Complexity: We compared our approach with the three advanced algorithms to evaluate its computational complexity in combating poisoning attacks.
Table 5 presents the parameters (Params), floating-point operations (FLOPs), and inference time for each algorithm.
Compared to the Multi-Krum algorithm, our algorithm exhibited lower computational complexity and faster inference speed. This is attributed to the introduction of the vehicle similarity calculation at the initial stage, allowing for the selection of vehicle nodes highly compatible with the data provided by the task issuer through Euclidean distance filtering. Additionally, we implemented an evaluation mechanism for the RSU contribution and vehicle credibility. The calculation of the RSU contribution is based on the precision deviation between an RSU and others, indicating the quality advantage of this RSU node relative to others, further ensuring the reliability of collaborative training. This, combined with the enhanced consensus mechanism and on-chain storage, results in faster convergence of our algorithm. Compared to the original Krum algorithm, although our algorithm has slightly more parameters, Krum may result in misjudgments in the presence of only one anomalous gradient, leading to the discarding of updates from honest clients. Therefore, relative to the Krum and Multi-Krum algorithms, our algorithm performs better overall. Compared to the Median algorithm, its faster inference speed is due to its reliance solely on taking the median along dimensions, but it suffers from lower overall model accuracy. In summary, our algorithm not only exhibits outstanding overall performance but also a relatively faster inference speed, making it more suitable for highly dynamic IoV systems with unstable communication.