1. Introduction
In recent years, federated learning (FL), as an emerging distributed machine learning method, has achieved notable advancements and has been extensively utilized in practical applications, including IoT, Wearables, sensors, and interface improvements [
1,
2,
3]. FL can enhance the privacy protection level by allowing clients to learn the symmetric and asymmetric patterns from their local training data. Nevertheless, recent research has highlighted a significant concern regarding FL, specifically the emergence of a new threat known as the backdoor attack [
4,
5,
6,
7]. This attack involves embedding triggers into a small fraction of the training data, which can deceive the poison model into aligning strongly with the patterns associated with these triggers. In addition, attackers can use these triggers to manipulate the behavior of the affected model on specific inputs while leaving its performance on clean data unaffected. The backdoor attack in FL originally proposed [
8] that adversaries aim to implant a backdoor in their local model, which then integrates into the global model during aggregation. These attacks pose heightened risks in federated learning due to its principle of universal participation and the covert nature of training procedures. The invisible triggers make defending against backdoor injections in FL inherently challenging [
9,
10,
11,
12].
Backdoor attacks typically involve data poisoning, which contains symmetric and asymmetric backdoor properties, in centralized learning scenarios [
13]. For instance, in the CIFAR-10 dataset used for classifying cats and dogs, an attacker might label all “blue cats” as “dogs” in the training data. Subsequently, the model trained on this manipulated dataset may misclassify “blue cats” as ‘dogs’ during predictions. Differing from those encountered in centralized learning scenarios, data are distributed among different clients in federated learning, and the attacker faces constraints due to the decentralized nature of the training data. Thus, backdoor attacks in federated learning often involve model poisoning. This means attackers introduce a backdoor into the local model during the training phase, embedding a malicious pattern into the local updates [
9,
14,
15]. When aggregated with updates from other clients, these local backdoored model updates result in a newly generated model with embedded backdoored features in subsequent rounds of federated communication.
To counteract backdoor attacks in FL, recent researchers have explored two main strategies: backdoor detection [
16,
17,
18,
19] and backdoor elimination [
20,
21,
22]. The backdoor detection approach aims to determine if a global model has been compromised. They further seek to pinpoint potential attackers by differentiating between normal and malicious updates. Nevertheless, datasets contributed by different participants in FL typically exhibit Non-Independent and Identically Distributed (Non-IID) characteristics, making it challenging to discover malicious attackers. Furthermore, these detection approaches may conflict with the application of secure aggregation in FL [
23,
24,
25]. On the other hand, backdoor elimination techniques focus on erasing backdoors through various techniques. In centralized learning, recent proposals for methods such as fine-tuning, pruning, and machine unlearning eliminate backdoored features. In contrast, various robust techniques like model pruning and knowledge distillation are employed to mitigate backdoored influence in federated learning [
26,
27,
28,
29,
30].
However, several problems exist with current backdoor elimination approaches in FL. First, pruning enhances its defensive capability by selectively pruning neurons by exhibiting backdoor feature activation. Due to the fact that pruned neurons may also be normal information. In order to ensure that a low attack success rate is achieved, pruning will result in low model performance. Second, knowledge distillation aims to purify the trigger in the backdoored model. This method is the use of clean samples to distill malicious information. Unfortunately, it only mitigates the backdoor and does not completely eliminate it. Consequently, the method still fails to achieve efficient defense against backdoor attacks in FL. The current defense methods in FL are unable to effectively reduce the backdoor success rate while ensuring high main task accuracy.
In this paper, we propose an effective backdoor elimination method in FL based on self-attention distillation, called FLSAD. In contrast to the existing methods, pruning depends on manual experience and knowledge distillation depends on teacher models. FLSAD can only require the modification capability of the model itself to eliminate the impact of the backdoored model. Backdoor attacks are to establish a strong connection between the backdoor trigger and the target label in order to achieve a high attack success rate in samples injected with triggers and to maintain the original accuracy in clean samples. Due to the high accuracy of clean samples, the model shallow is more likely to contain correct information. We leverage the self-attention distillation method to eliminate the backdoored trigger. To reach this goal, the following two steps must be carried out. First, We should restore the backdoor trigger to further eliminate the backdoor without accessing the training dataset. FLSAD leverages entropy maximization to restore the backdoor trigger. Second, we use the self-attention distillation method to eliminate the backdoor in the model by the restored trigger as an aid.
We summarize our main contributions as follows:
We develop the effective backdoor defense framework in FL through a self-attention distillation named FLSAD. FLSAD distills attention knowledge of neurons to eliminate the backdoor of the model by the restored trigger.
To efficiently defend against the backdoor attack, we design the entropy maximization estimator for trigger reconstruction and then eliminate the backdoor by self-attention distillation.
In real-scenario datasets, we conduct extensive experiments against the SOTA backdoor attacks in FL. The experimental results demonstrate that FLSAD outperforms all baseline defense methods and has high defense accuracy and model performance.
The remainder of this paper is organized as follows.
Section 2 introduces the background and related work. The threat model and defense goal are discussed in
Section 3.
Section 4 describes the proposed self-attention distillation backdoor defense method in FL.
Section 5 evaluates and analyzes the results of the experiment. Finally,
Section 6 concludes this paper.
4. The Proposed Method
4.1. Overview
The general framework of our approach is shown in
Figure 2. The defense of our method against the backdoor attack consists of two main stages. First, FLSAD uses the entropy maximization estimator to recover the trigger. Second, on the basis of the first step, FLSAD eliminates the backdoor by utilizing the self-attention distillation method.
4.2. Trigger Recovery
In the data classification task, the normal model is trained on the data to obtain a distinct decision boundary that is used to distinguish between different samples, aiming for minimal prediction errors on correctly labeled samples. In contrast, in a backdoor attack scenario, the attacker makes the backdoor samples cross the decision boundary and judges them as the target class through a trigger. This can cause the model to misjudge the backdoor sample and does not affect the operation of the normal sample.
We assume that the label space of the model G is K. Let the original label of a sample S as belong to the feature space K, and let the target label of the backdoor attack belong to this feature space as well. By injecting a trigger t by the attacker, can be made to become . The above analysis allows us to transform the problem of recovering backdoor triggers into the problem of how to discover the trigger distribution using a generative model that does not require data sampling.
To address this problem, our first consideration is to leverage the adversarial generative network approach. The generator is used to generate the distribution of triggers, and then the discriminator is used to determine whether the generated triggers are true or not. However, adversarial generative networks can cause problems with model performance degradation when evaluating high-dimensional triggers. Thus, our method leverages the maximum entropy method to solve this issue.
FLSAD goes through
n sub-models
to evaluate the distribution of triggers
t, where staircase approximations in triggers can be learned in each sub-model. Set up a backdoor model
B. The trigger distribution of model
is identified as
. The loss function
of the sub-model will be updated based on the distribution.
where
is the parameters of B and
indicates one of the labels in all samples.
b denotes the batch size, and the validation data for each batch is
V.
and
represent random noise sampled from 0 to 1,
represents a hyperparameter.
4.3. Self Attention Distillation
Recovering triggers aids defenders in identifying and determining whether a sample contains a trigger or not. However, the operation does not yet eliminate the effects of backdoors and does not provide a defense against them. By performing interpretive studies on neural networks, we found that the shallow layer mainly proposes global structural information and the deep structure mainly extracts fine-grained features. Existing backdoor attacks add triggers that are often in the deeper layers of the model in order to ensure that the backdoor is hidden. To eliminate hidden backdoors, we consider the method of knowledge distillation. Currently, knowledge distillation is often conducted by instructing student models through teacher models. However, using the bad depths of the teacher model to distill the bad depths of the student model is not very effective for backdoor elimination. Therefore, we leverage the self-attention distillation approach to eliminate hidden backdoors by using good shallow knowledge to guide malicious information in the deeper layers of the model. Specifically, this underscores the critical importance of eliminating backdoors.
For the backdoored model
B, the activation tensor for the
l-layer of the model is
. The channel dimensions, height, and width of the attention map are denoted
,
, and
. The feature dimensions are represented as two dimensions in the attentional representation process, i.e., from
to
. Further, we compute the values on all channels to build the mapping function
.
where
represents the
i-layer slice of
.
can achieve higher activation weights by an increase in
p.
takes into account the weighting of all areas.
Better backdoor elimination can be achieved through attention distillation modeling. Self-attentional distillation methods are able to utilize knowledge in the shallow layer to eliminate highly hidden backdoors in the deep layer. The loss function of this method can be represented as
where
denotes the distance between attention maps computed via the
-norm.
indicates the extracted feature representation.
and
indicate the feature vectors of different layers on the attention map, respectively. In order to harmonize the shallow and deep attention sizes, this method uses a bilinear sampling operation, i.e.,
H.
Since self-attention distillation is used to minimize the gap between attention maps, no concern is paid to the accuracy of the prediction results. The loss of using only self-attention distillation (
) during retraining reduces the prediction accuracy of clean samples. Therefore, this method adds cross-entropy loss (
) to ensure the accuracy of clean samples.
where
denotes the hyperparameter used to balance model prediction accuracy and self-attentive distillation. We ensure the accuracy expected on clean models via
. The pseudocode for the proposed
FLSAD is listed as Algorithm 1.
Algorithm 1: FLSAD Algorithm |
|
5. Experimental Evaluation
In this section, we validate our defense approach on the different real datasets to evaluate the performance of FLSAD against different attack methods. First, we analyze the datasets and experimental setup, including the dataset, training settings, evaluation metrics, and baseline methods. Then, the superiority between FLSAD and existing defense methods is compared. Finally, we further conduct ablation experiments to analyze our method.
5.1. Datasets and Experimental Setup
5.1.1. Datasets
To validate the effectiveness of our present method, we perform extensive verification on four real datasets, including MNIST (
http://yann.lecun.com/exdb/mnist/ (accessed on 21 July 2024)), Fashion-MNIST (
https://github.com/zalandoresearch/fashion-mnist (accessed on 21 May 2022)), CIFAR-10 (
https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 6 June 2020)), and CIFAR-100 (
https://www.kaggle.com/datasets/fedesoriano/cifar100 (accessed on 10 May 2020)). In this case, both the MNIST and Fashion-MNIST datasets contain 60,000 training samples and 10,000 test samples, and the data categories contain 10 classes. Despite containing the same amount of data, these two datasets are differentiated in terms of content, MNIST for the handwritten digits dataset and Fashion-MNIST for the clothing objects dataset. We leverage these datasets for federated learning model training and the details of the datasets are shown in
Table 1.
5.1.2. Experimental Setup
Training setting. We set up federated learning with 100 clients participating in model training, of which 70 are normal clients and 30 are malicious clients. The attacker injects an identical trigger in each backdoor sample that is model predicted to be the target label. We set up training N local epochs with a batch size of 64 and a local model learning rate of during training. Let the local epoch of malicious clients be set to 20 and the model learning rate to 0.05, while the normal client’s local epoch to 10, and the model learning rate is set to 0.1 during local training. This is the setup in different types of backdoor attacks and our experiments follow this standard.
Evaluation metrics. We leverage the attack success ratio (ASR) and model accuracy (ACC) metrics to effectively evaluate our approach. The attack success rate indicates the success rate of the attacker’s backdoor attack, and the purpose of our defense is to effectively reduce the backdoor success rate. The model accuracy is an indication of the accuracy of the main task of the model, i.e., it indicates the prediction result of the backdoor model for clean samples. The purpose of our method is to ensure that the main task is not affected by the backdoor attack.
Baseline methods. We will select three state-of-the-art backdoor attacks in federated learning to measure the effectiveness of our defense approach. These three attack methods are watermarking [
51], pixel block [
52], and random noise [
53]. To exhibit the superiority of the defense of our method, we compare four backdoor defense methods in FL, including FoolsGold [
16], Baffle [
18], CONTRA [
22], and RLR [
20].
5.2. Comparison with Baseline Defenses
In order to demonstrate that
FLSAD can deal with various backdoor attacks in a real federated learning scenario and the superiority of our method, we analyze the defense against three state-of-the-art federated learning backdoor attack methods in four real-world datasets and compare four baseline defense methods.
Table 2,
Table 3,
Table 4 and
Table 5 exhibit the experimental results of comparative baseline defense methods for
FLSAD.
On the MNIST dataset, the experimental displays that FLSAD can effectively reduce the attack success rate in backdoor attacks as well as guarantee the performance of the main task of the model. It can be found that the ASR under undefended backdoor attack is about 90%, and the ASR after defending by our method is below 2% in all cases. Meanwhile, the accuracy of the models after defense is all higher than 93%, which does not affect the performance of the models. In addition, we set up experiments with different trigger sizes in order to discover the impact of different trigger sizes on backdoor defense. A larger trigger size indicates a stronger backdoor injected by the attacker, which can also enhance the success rate of the backdoor attack. From the results, we can find that a larger trigger size will reduce the effectiveness of our defense.
It can be found that FLSAD has similar defense performance on the Fashion-MNIST and MNIST datasets. However, the model has a lower accuracy in the model main task in CIFAR-10 and CIFAR-100 datasets. For the CIFAR-10 dataset, the accuracy of the model performance is only about 84%, and the success rate of the attack is only about 60% in the without-defense case. By the FLSAD method, it is possible to achieve 88% accuracy in the main task, which improves the performance of the model. The attack success rate of the model can also be reduced to about 0.9%. For the CIFAR-100 dataset, the success rate of the attack is only about 58% in the without-defense case against three backdoor attacks. Among them, the attack success rate is only 54.79% against the Pixel block backdoor attack in the setting of trigger size of 3 × 3. The accuracy of the model performance is only about 72%. With the defense of our method, it is possible to achieve 73% model accuracy without degrading the model performance. Overall, it is observed that FLSAD outperforms all baseline defense methods, effectively reducing the success rate of the attack while maintaining the accuracy of the primary task.
5.3. Computational Costs
To assess the efficiency of
FLSAD, we compare the computational costs of
FLSAD with baselines in
Table 6. Since this method employs a self-distillation approach, its computational complexity is relatively high. The complexity of this method is lower than that of the RLR method but higher than FoolsGold, Baffle, and CONTRA. However, the objective of this method is to effectively defend against backdoor attacks, which is better than all the baseline methods under the different datasets, as seen in
Table 2,
Table 3,
Table 4 and
Table 5. In addition,
Table 6 and
Table 7 present a comprehensive analysis of the necessary computation epochs for achieving a target accuracy using different local epochs E. Increasing E leads to higher computation costs. Consider the scenario where E = 6, which necessitates 230 computation epochs to achieve 80% accuracy. When we set E = 30, it demands 300 computation epochs. These findings highlight that achieving high accuracy in aggregation convergence epochs is not guaranteed.
5.4. Ablation Study
Our method first recovers the triggers and then utilizes the recovered triggers for self-attention distillation, thereby eliminating the backdoor. To further verify the effects of recovery triggers and self-attention on eliminating the backdoor, we conduct ablation experiments separately. In our experiment, the results of the ablation experiments are shown in
Figure 3 and
Figure 4 for the MNIST and CIFAR-10 datasets, where the elimination of the reduction trigger method is called NO-RT and the elimination of the self-attention method is called NO-SA.
5.4.1. Impact of Trigger Recovery
The recovery trigger can be used to help localize the distillation model for better distillation of backdoor features in FL. In
Figure 3, we see that the NO-RT method can reduce the attack success rate to less than 45% and the model accuracy can reach 76%. Under the Watermarking backdoor attack, the success rate of the attack can be reduced to 40.53%, but the difference is still significant when comparing the
FLSAD method. Our method ASR can reach 1.87%, which is 38.66% lower than the NO-RT method. Therefore, the importance of the reduction trigger for the task can be found.
5.4.2. Impact of Self-Attention
FLSAD leverages the self-attention method to precisely locate the positions containing backdoor information in the deeper layers of the model, allowing for more effective corrections based on the shallow information. On the CIFAR-10 dataset, the ASR of the NO-SA method can be reduced to less than 35%, and the model accuracy drops by about 15%. In particular, the model accuracy drops to only 69.47% when subjected to the random noise attack method. This method represents a 18.15% reduction in accuracy compared to FLSAD. Therefore, the self-attention method is very essential for defense against backdoor attacks in FL.