In this section, first, we derive gaps in the research field from the mapping results. Second, we highlight several issues in the assumptions made in some papers. Third, we identify fallacies in the evaluation of the attacks and discuss their implications.
5.1. Main Research Gaps
We base our discussion here on the results of
Section 4. In addition, we are looking at how the papers are distributed over pairs of categories by the means of bubble charts, as shown for example in
Figure 10, where we show how attacks with specific purposes are distributed with regard to the access point. It should be noted that the categories in some classification schemes are not disjoint; therefore, the total number of publications may sum up to more than 48.
Gap 1. Little research is conducted about attacks on the server side and by eavesdroppers.
Description.Figure 10 illustrates that membership, property inference, model corruption, and backdoor attacks are rarely studied on the server side or with an eavesdropper adversary. This might be due to two reasons. First, it is widely assumed in the literature that FL is coordinated by a trusted server. Second, approaches that protect against curious servers and eavesdroppers, such as secure aggregation [
45], were proposed and widely adopted by the research community because of the firm protection guarantees they achieve. However, applying such approaches still incurs nonnegligible overhead [
99], despite the improvements, which leaves open questions about their efficiency in real-world applications.
Implications. Typically, servers (service providers) are supposedly better equipped to repel attacks compared with clients. However, numerous events in recent years showed us that providers were subject to many successful attacks, where users’ data were breached. Therefore, it is of high importance to study how attacks by a curious or compromised server can impact the FL process. We argue that attacks on the server side are becoming even more relevant in FL, especially considering the emergence of applying FL in different architectures, such as hierarchies in edge networks [
100,
101]. In such environments, multiple entities play the role of intermediate servers, i.e., collect and aggregate the updates from clients, thereby introducing more server-type access points.
For eavesdroppers, recent model inversion attacks on gradients were proved to successfully reconstruct user training data [
19,
20]. This opens the door for more investigations about how gradients or model updates can be exploited to apply other attack types, especially privacy attacks.
Gap 2. Very little effort is devoted to studying attacks on ML functions other than classification.
Description. ML models can be used to fulfill a variety of functions, such as classification, regression, ranking, clustering, and generation. However, our SMS shows that there is a heavy bias towards the classification function: 46 (96%) of the attacks. Other functions, namely, regression, generation, and clustering, were addressed in only 4 (9%), 1 (2%), and 1 (2%) attacks, respectively.
Implications. This gap involves a lack of knowledge with respect to a large spectrum of models and applications that have different functions than classification. These functions are of high importance in many domains, e.g., ranking in natural language processing [
102] and recommender systems [
103]. It is an open question of how the existing attacks impact these functions. It is worth mentioning that a similar gap was also observed for adversarial attacks in general ML settings by Papernot et al. [
28].
Gap 3. There is lack of research about attacks on ML models other than CNNs.
Description. Although FL is not restricted to NN models, we have seen in the previous section that only three (6%) attacks target non-NN models. On a closer look, we depict in
Figure 11 the types of models targeted by the different attacks. We notice that non-NN models were never targeted by membership inference or backdoor attacks. For NN models, we observe that RNN were not studied under any type of privacy attacks or model corruption attacks. Additionally, no research has been carried out on backdoors for DNNs. The AEs also have received very little attention. We only found only two privacy attacks using AEs. Overall, this illustrates the limited diversity in the literature considering the target models.
Implications. NN models are the state of the art in several applications, e.g., face recognition [
104]; however, other ML models are still of high value and usage in real-world systems, e.g., genome analysis [
105], culvert inspection [
106], and filtering autocompletion suggestions [
107], to name a few.
Within the NN models, there are a variety of network architectures, and as we show above, many of these architectures are not well covered in the evaluation of the attacks, even architectures that are widely used in several applications, e.g., RNN, which is used in Gboard [
108]. Consequently, the evaluations of the proposed attacks fall short of providing evidence on how the attacks will perform against other network architectures.
Overall, we noticed limited effort devoted to studying the influence of using different model architectures on the effectiveness of the proposed attacks. We found only Geiping et al. [
79] providing an adequate analysis of this aspect. Consideration of this issue when evaluating the attacks is important to improve the generalizability of the results.
5.2. Assumption Issues
There are a number of attacks that succeed only under special assumptions. These assumptions do not apply in many real-world scenarios; consequently, the applicability of these attacks is limited. Here, we highlight the issues of these assumptions and discuss their implications.
Assumption Issue 1. The attacks are effective only under special values of the hyper-parameters of the NN models.
Description. As described in
Section 2, the hyper-parameters of NN models include, among others, the batch size, learning rate, activation function, and loss function. The hyper-parameters need to be carefully and fairly optimized to meet the application requirements. On contrary, we found in some papers that the hyper-parameters are tailored to demonstrate the high effectiveness of the attacks rather than to illustrate realistic scenarios.
Examples and Implications. In some model inversion attacks, the gradients are used to reconstruct the training data. Zhu et al. [
19] and Wei et al. [
78] showed that their attacks perform well only when the gradients are generated from a batch size
. Zhao et al. [
20] proposed an attack to extract the labels of the clients from gradients. However, the attack works only when the batch size is one, which is an exceptional and uncommon value. Hitaj et al. [
18] also used a batch size of one to evaluate their attack on the At&T dataset.
Using small batches leads to a lack of accurate estimation of the gradient errors, which in turn causes less stable learning. Additionally, this requires more computational power to perform a large number of iterations, where gradients need to be calculated and applied every time to update the weights. While FL pushes the training to the client device, it is essential to consider the limited resources of the client devices. Therefore, the efficiency of the local training process is an important requirement. That is, having batches of very small values increases the computational overhead and is therefore not preferable for FL applications.
Although it is insightful to point out the vulnerabilities that some special hyper-parameters might introduce, it is of high importance to discuss the relevance of these hyper-parameters to real-world problems.
Assumption Issue 2. The attacks succeed only when a considerable fraction of clients are malicious and participate frequently in the training rounds.
Description. In cross-device FL, a massive number of clients (up to
) form the population of the application. Out of these clients, the server selects a subset of clients (∼100 [
107]) randomly for every training round to train the model locally and share their updates [
9]. This random sampling is assumed to be uniform (i.e., the probability for a client to participate is
) to achieve certain privacy guarantees for clients, in particular, differential privacy [
109]. Under these conditions, it is rather unlikely for a specific client to participate in a large number of training rounds (≫
) or consecutive ones. However, this was found as an assumption in a number of papers to enable some privacy and poisoning attacks. Furthermore, several attacks require a large number of clients to collude and synchronize in order to launch an attack, which also can be tricky to achieve in some cases.
Examples and Implications. Hitaj et al. [
18] assumed that the adversary participates in more than 50 consecutive training rounds in order to carry out a reconstruction attack successfully. A stronger assumption was made by [
83], namely, to have the adversary participating in all the rounds to poison the model. This requires the adversary to fulfill the FL training requirements [
107] and to trick the server to be selected frequently, which is a challenge per se considering the setting described above.
State-of-the-art poisoning attacks in cross-device FL [
51,
76] assume up to 25% of the users to be malicious [
23]. Considering that cross-device FL is mainly intended to be used by a massive number of users, the effective execution of these attacks would require the compromise of a significant number of devices. This in turn requires a very great effort and considerable resources, which could make the attacks impractical at scale [
23]. For instance, a real-world FL application such as Gboard [
108] has more than 1 billion users [
110]. This means that the adversary ould need to compromise 250 million user devices to apply these attacks successfully [
23]. However, it is worth mentioning that there are many ML applications (i.e., potential FL applications) on the market that use a smaller user base. Still, to the best of our knowledge, there are no real-world FL applications that represent this case.
The distributed nature of FL might indeed enable malicious clients to be part of the system. However, the capabilities of these malicious clients to launch successful attacks need to be carefully discussed in the light of applied FL use cases. Thus, the risk of these attacks is not overestimated.
Assumption Issue 3. The attacks can be performed when the data are distributed among clients in a specific way.
Description. FL enables clients to keep their data locally on their devices, i.e., the data remain distributed. This usually introduces two data properties: first, the data are non-IID; i.e., the data of an individual client are not representative of the population distribution. Second, the data are unbalanced, as different clients have different amounts of data [
9]. In an ML classification task, for example, this may cause some classes not to be equally represented in the dataset. In any FL setting, it is essential to consider these two properties. While the meaning of IID and balanced data is clear, non-IID and unbalanced data distribution can be achieved in many ways [
24]. In a number of papers, we found that specific distributions are assumed to enable the proposed attacks.
Examples and Implications. A backdoor attack on a classification model by Bagdasaryan et al. [
16] achieved 100% accuracy on the backdoor task by one malicious client participating in one training round. However, in their experiment on CIFAR10, it was assumed that only the adversary has the backdoor feature, which is a big assumption [
72]. The massive number of clients in FL suggests that the client’s data might cover the backdoor feature. Therefore, it should be considered that at least one honest client will have additional benign data for the backdoor feature.
Another example was found in the model inversion attack of [
18], where the authors assumed that all data of one class belong to one client and that the adversary is aware of that. Additionally, their attack works only when all the data of one class are similar (e.g., images of one digit in the MNIST dataset). These assumptions do not apply to many real-world scenarios, so they were found to be unrealistic by [
27]. Moreover, the model corruption attack introduced in [
50] was launched under the setting of IID data, which contradicts the main FL assumptions. Similarly, Nasr et al. [
27] evaluated their membership inference attack on a target model trained with balanced data. It is worth mentioning that Jayaraman et al. [
111] pointed out the issue that most membership inference attacks [
48,
112,
113] for stand-alone learning also focus only on the balanced distribution scenarios.
Overall, the way of implementing non-IID and unbalanced data distribution needs to be (1) discussed and justified in light of the application to assure as realistic as possible setup, (2) reflected clearly in the conclusions of the evaluation.
5.3. Fallacies in Evaluation Setups
Designing a comprehensive and realistic experimental setup is essential to prove the applicability of the attack and the generalizability of the conclusions. Although all the studied papers provide insightful evaluations of their proposed attacks, a number of practices were followed that might introduce fallacies. In this section, we set out to highlight this issue by identifying six fallacies. We discuss the implications of each fallacy on the evaluation results. Then, we propose a set of actionable recommendations to help to avoid them.
Fallacy 1. The datasets are oversimplified in terms of data content or data dimensions.
Description. Datasets are used to train and test the FL model and also to evaluate the attack. These datasets need to be representative of the population targeted by the model. As we highlighted in
Section 4, the majority of attacks are evaluated on the image classification task. Therefore, here we focus on the image-based datasets.
Despite the growing calls for decreasing the usage of simple datasets, in particular MNIST [
55], it is still one of the most common datasets in the deep learning community [
114]. This is due to several reasons, such as its small size and the fact that it can be easily used in deep learning frameworks (e.g., Tensorflow, PyTorch) by means of helper functions [
55].
MNIST was introduced by LeCun et al. [
115] in 1998 and contains 70,000 gray-scale images of handwritten digits in the size of 28 × 28 pixels. Since then, substantial advances were made in deep learning algorithms and the available computational power. Consequently, MNIST became an inappropriate challenge for our modern toolset [
116]. In addition, the complexity of images increased in modern computer vision tasks. That renders MNIST unrepresentative of these tasks [
67].
Still, the phenomenon of the wide usage of MNIST is also observed in the examined papers: more than 53% (see
Figure 8) used MNIST as the main dataset for evaluating the effectiveness of the proposed attacks. The second most common dataset was CIFAR, which is more complex in terms of data content; however, it is a thumbnail dataset; i.e., the images have are
pixels.
It is worth mentioning that in 41 (85%) of the papers, the authors evaluated their attacks on more than one dataset, which is a good practice. However, in a considerable number of papers (15 i.e., 31%), the authors used only datasets that contain either simple or small (thumbnail) images.
Examples and Implications. Using oversimplified datasets can lead to a misestimation of the attack capabilities. For instance, the capabilities of privacy attacks to retrieve information about the dataset are tightly related to the nature of this dataset. Consequently, the complexity and size of the images in the dataset impact the attacks’ success rate. It is clear that obtaining complex and bigger images requires higher capabilities. This is evident in the literature through several examples. Melis et al. [
69] introduced a privacy attack that exploits the updates sent by the clients to infer the membership and properties of data samples. In [
19], the authors demonstrated that the proposed attack of [
69] only succeeds on simple images with clean backgrounds from the MNIST dataset. However, the attack’s accuracy degrades notably on the LFW dataset and fails on CIFAR. In the same context of privacy attacks, Zhu et al. [
19] proposed the model inversion attack DLG, which reconstructs the training data and labels from gradients. Their experiments showed that DLG can quickly (within just 50 iterations) reconstruct images from MNIST. However, it requires more computational power (around 500 iterations) to succeed against more complex datasets such as CIFAR and LFW. Recently, Wainakh et al. [
94] demonstrated that the accuracy of DLG in retrieving the labels degrades remarkably on CelebA, which has a bigger image size than the thumbnail datasets, such as MNIST and CIFAR.
Recommendations. We acknowledge that it is challenging to find a single dataset that provides an adequate evaluation of the attacks; therefore, it is essential to evaluate the attack on diverse datasets with regard to image complexity and dimensions. We encourage researchers to also consider real-life datasets, which pose realistic challenges for the models and attacks, e.g., ImageNet (image classification and localization) [
117], Fer2013 (facial recognition) [
118], and HAM10000 (diagnosing skin cancers) [
119].
Fallacy 2. The datasets are not user-partitioned, i.e., not distributed by nature.
Description. In FL, data are distributed among the clients; each client typically generates their data by using their own device. Therefore, these data have individual characteristics [
9]. The datasets used for evaluating the attacks should exhibit this property, i.e., be generated in a distributed fashion. However, in only 11 (23%) of the papers were user-partitioned datasets used. One of these datasets is EMNIST [
120], which was collected from 3383 users, and thus, it is appropriate for the FL setting [
72]. Researchers in the majority of studies, (37 i.e., 77%), used pre-existing datasets that are designed for centralized machine learning [
121] and thus are unrealistic for FL [
122]. These datasets then were artificially partitioned to simulate the distributed data in FL. One additional issue with these datasets is that they are by default balanced, yet FL assumes the client’s data to be unbalanced [
9].
Examples and Implications. In the image classification use case, the poisoning attacks proposed in [
16,
67] were evaluated on centralized datasets, such as Fashion-MNIST and CIFAR. The attacks were reported to achieve 100% accuracy in the backdoor task. However, by using EMNIST as a standard FL dataset, Sun et al. [
72] illustrated the limitations of the previous attacks. More precisely, they showed that the performance of the attacks mainly depends on the ratio of adversaries in the population. Moreover, the attacks can be easily mitigated with norm clipping and “weak” differential privacy. Although this fallacy was discussed in previous works [
121,
122], its implications on the evaluation results need to be investigated further and demonstrated with empirical evidence.
Recommendations. It is recommended to use FL-specific datasets for adequate evaluation of the attacks. Researchers have recently been devoting more efforts to curating such datasets. The LEAF framework [
122] provides five user-partitioned datasets of images and text, namely, FEMNIST, Sent140, Shakespeare, CelebA, and Reddit. Furthermore, Luo et al. [
121] created a street dataset of high-quality images, which is also distributed by nature for FL.
Fallacy 3. The attacks are evaluated against simple NN models.
Description. We observe a major focus on attacking NN models in federated settings. These models can have a variety of architectures, as discussed in
Section 2. The complexity of these architectures vary with respect to the number of layers (depth), the number of neurons in each layer (width), and the type of connections between neurons. In the case of CNN models (41 papers), our study shows that researchers tend to use simple architectures to evaluate their attacks (21 (52%) papers), e.g., 1-layer CNN [
22] and 3-layer CNN [
67]. In 20 (49%) papers, the authors considered complex state-of-the-art CNN models, such as VGG [
123], ResNet [
124], and DenseNet [
125], the winners of the famous
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [
126].
Examples and Implications. It is reasonable to start evaluating novel attacks on simple models to facilitate the analysis of the initial results. However, this is insufficient for drawing conclusions on the risks posed by these attacks to real-life FL-based applications for two reasons. First, modern computer vision applications, e.g., biometrics, use advanced models, mostly with sophisticated architectures, to solve increasingly complex learning objectives [
127]. Second, in deployed systems, a ML model typically interacts with other components, including other models. This interaction can be of extreme complexity, which might introduce additional challenges for adversaries [
128]. For instance, in the Gboard app [
108], as a user starts typing a search query, a baseline model determines possible search suggestions. Yang et al. [
107] utilized FL to train an additional model that filters these suggestions in a subsequent step to improve their quality.
Several model inversion attacks reconstruct the training data by exploiting the shared gradients [
22,
78,
97]. In particular, they exploit the mathematical properties of gradients in specific model architectures to infer information about the input data. For example, Enthoven et al. [
22] illustrated that neurons in fully connected layers can reconstruct the activation of the previous layer. This observation was employed to disclose the input data in fully connected models with high accuracy. However, the same attack achieves considerably less success when the model contains some convolutional layers.
The NN capacity (i.e., number of neurons) also influences the performances of some attacks, in particular, backdoors. It is conjectured that backdoors exploit the spare capacity in NNs to inject a sub-task [
129]. Thus, larger networks might be more prone to these attacks. However, this interesting factor still needs to be well investigated [
72]. In this regard, it is worth mentioning that increasing the capacity, e.g., for CNNs, is a common practice to increase the model accuracy. However, recent approaches, such as EfficientNet [
130], call for scaling up the networks more efficiently, achieving better accuracy with smaller networks. This development in the CNNs should be also considered in the evaluation of the attacks.
Recommendations. We highly encourage the researchers to consider the state-of-the-art model architectures that are widely used in the applications where they will apply their attacks. In addition, it would be insightful for a more realistic security assessments to consider evaluating the proposed attacks on deployed systems that contain multiple components.
Fallacy 4. The attacks are designed for cross-device scenarios (massive client population), yet evaluated on a small number of clients .
Description. FL can be applied in cross-silo or cross-device settings. In the cross-silo setting, clients are organizations or data centers (typically 2–100 clients), whereas in the cross-device scenario, clients are a very large number of mobile or IoT devices (massive up to
) [
24]. For instance, in applied use cases of FL, Hard et al. [
108] reported using 1.5 million clients to train the Coupled Input and Forget Gate language model [
131]. Yang et al. [
107] trained a logistic regression model (for the Gboard application) for 4000 training rounds. They employed 100 clients in each round.
Although many of the studied papers do not explicitly use the term “cross-device” to describe their scenario, they refer mainly to clients as individual users who have personal data. However, 27 (56%) papers provided an evaluation with a total population of ≤100 clients. Moreover, 13 (27%) of the papers did not report at all the client population in their experiments.
Examples and Implications. The total number of clients and the clients participating per round in FL determine the influence of a single client on the global model. For privacy attacks, this means that each client contributes considerably to shaping the model parameters; thus, the parameters more prominently reflect the client’s personal data. Shen et al. [
95] demonstrated that increasing the client population led to a decrease in the accuracy of their property inference attack. For poisoning attacks, using a small number of clients amplifies the impact of the poison injected by malicious ones. This was shown in the experiments of [
67], where the accuracy of the backdoor task degraded with bigger client populations.
Recommendations. We recommend researchers to consider a large number of clients to evaluate novel attacks. For that, it is helpful to use the datasets provided by LEAF [
122], which contain more than 1000 clients. In case large-scale evaluation is not feasible, researchers are encouraged to discuss at least the potential implications of different client populations on their attacks.
Fallacy 5. The attacks are not evaluated against existing defense mechanisms.
Description. An attack becomes ineffective if it requires the adversary to make a disproportional large effort to overcome a small defense mechanism [
128]. Proposed attacks need to be evaluated in this respect with state-of-the-art defenses. However, we showed in
Section 4.3,
Figure 9, that 21 (48%) of the proposed attacks were not evaluated against any of the defense mechanisms. In most of these papers, the authors only discussed theoretically potential countermeasures to mitigate their attacks.
Examples and Implications. This fallacy leaves the evaluation of the attacks incomplete, and their applicability under real-world scenarios, where defense mechanisms are typically deployed, is questionable. However, it is important here to distinguish between the different categories of defense mechanisms. On the one hand, cryptography-based defenses typically provide formally proved properties; thus, in some cases, their impacts on the attacks can be sufficiently discussed without empirical evidence. Still, in these cases, efficiency remains an open question. On the other hand, the impacts of other defense categories, namely, perturbation and sanitization, against attacks require experimental analysis, as these defenses usually introduce loss in the model accuracy, hence need to be customized to reach a desired balance between the accuracy and privacy. In
Figure 12, we see that most of the implemented defenses in the literature are from these two categories. We see also that perturbation is mainly used for privacy attacks, which reduces the information leakage about individuals, whereas sanitization mitigates the impact of malicious updates from adversaries, and thus is used against poisoning attacks.
Fallacy 6. The results of the experimental evaluations are not easily reproducible.
Description. The majority (97%) of the proposed attacks are validated through empirical experiments. To accurately reproduce the results of these experiments by other researchers, several practices need to be considered. In our analysis, we took into account three main practices: (1) using publicly available datasets, (2) reporting technical details about the implementation, and (3) publishing the source code. Our study shows in
Section 4.3 that public datasets were used in all the examined papers, which is a good practice. However, 23 (48%) papers did not contain any details about the technologies used in the implementation. Furthermore, the source code of 40 (83%) papers was not found publicly.
Examples and Implications. Dacrema et al. [
66] reported that reproducibility is one of the main factors to assure progress for research, especially with approaches based on deep learning algorithms. To conduct a proper assessment of a novel attack, researchers usually compare it with previous attacks as baselines. Evaluating the different attacks under different settings and assumptions hinders this direct comparison. That is, researchers have to re-implement the respective attacks to reproduce their results under different settings. This becomes even more challenging when the authors do not describe their experimental setups and parameters to the extent of full reproducibility.
Recommendations. We encourage all researchers to share their source code and detailed descriptions of their setups. We also recommend using libraries and benchmark frameworks that support FL, namely, Tensorflow-federated, PySyft [
132], LEAF [
122], FATE [
133], and FedML [
134]. This in turn will help researchers to implement their ideas more easily and improve the consistency of implementations and experimental settings across different papers.