Next Article in Journal
Uncertainty Quantification for Infrasound Propagation in the Atmospheric Environment
Previous Article in Journal
Analysis of a Special Sulphite-Producing Yeast Starter after Fermentation and during Wine Maturation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Towards Federated Learning with Byzantine-Robust Client Weighting

Computer Science Department, Ben-Gurion University of the Negev, P.O. Box 653, Be’er Sheva 8410501, Israel
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(17), 8847; https://doi.org/10.3390/app12178847
Submission received: 17 July 2022 / Revised: 31 August 2022 / Accepted: 1 September 2022 / Published: 2 September 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:

Featured Application

The paper provides a solution for practical federated learning tasks in which a dataset is partitioned among potentially malicious clients. One such case is training a model on edge medical devices, where a compromised device could not only lead to lower model accuracy but may also introduce public safety issues.

Abstract

Federated learning (FL) is a distributed machine learning paradigm where data are distributed among clients who collaboratively train a model in a computation process coordinated by a central server. By assigning a weight to each client based on the proportion of data instances it possesses, the rate of convergence to an accurate joint model can be greatly accelerated. Some previous works studied FL in a Byzantine setting, in which a fraction of the clients may send arbitrary or even malicious information regarding their model. However, these works either ignore the issue of data unbalancedness altogether or assume that client weights are a priori known to the server, whereas, in practice, it is likely that weights will be reported to the server by the clients themselves and therefore cannot be relied upon. We address this issue for the first time by proposing a practical weight-truncation-based preprocessing method and demonstrating empirically that it is able to strike a good balance between model quality and Byzantine robustness. We also establish analytically that our method can be applied to a randomly selected sample of client weights.

1. Introduction

Federated learning (FL) [1,2,3,4] is a distributed machine learning paradigm where training data reside at autonomous client machines and the learning process is facilitated by a central server. The server maintains a shared model and alternates between requesting clients to try and improve it and integrating their suggested improvements back into that shared model.
A few challenges arise from this model. First, the need for communication efficiency, both in terms of the size of data transferred and the number of required messages for reaching convergence. Second, clients are outside of the control of the server and as such may be unreliable or even malicious (Blanchard et al. [5] dubbed this behavior Byzantine). Third, whereas classical learning models generally assume that data are homogeneous, here, privacy and the aforementioned communication concerns force us to deal with the data as it is seen by the clients; that is (1) non-IID (identically and independently distributed)—data may depend on the client they reside at, and (2) unbalanced—different clients may possess different amounts of data.
In previous works [6,7,8,9,10], unbalancedness is either ignored or is represented by a collection of a priori known client importance weights that are usually derived from the amount of data each client has. This work investigates aspects that stem from this unbalancedness. Concretely, we focus on the case where unreliable clients declare the amount of data they have and may thus adversely influence their importance weight. We show that without some mitigation, a single malicious client can obstruct convergence in this manner even in the presence of popular FL defense mechanisms. Our experiments consider protections that replace the server step by a robust mean estimator, such as median [11,12,13] and trimmed mean [12].

2. Materials and Methods

2.1. Problem Setup

2.1.1. Optimization Goal

We are given K clients where each client k has a local collection Z k of n k samples taken IID from some unknown distribution over sample space Z . We denote the unified sample collection as Z = k [ K ] Z k and the total number of samples as n (i.e., n = | Z | = k [ K ] n k ).
Our objective is global empirical risk minimization (ERM) for some loss function class ( w ; · ) : Z R , parameterized by w R d :
min w R d F ( w ) , where F ( w ) : = 1 n z Z ( w ; z ) .
We let w denote arg min w R d F ( w ) .
In the following sections, we denote the vector of client sample sizes as N = ( n 1 , , n K ) and assume, without loss of generality, that it is sorted in decreasing order.

2.1.2. Collaboration Model

We restrict ourselves to the FL paradigm, which leaves the training data distributed among client machines and learns a shared model by iterating between client updates and server aggregation.
Additionally, a subset of the clients, marked B , can be Byzantine, meaning they can send arbitrary and possibly malicious results on their local updates. Moreover, unlike previous works, we also consider clients’ sample sizes to be unreliable because they are reported by possibly Byzantine clients. When the distinction is important, values that are sent by clients are marked with an overdot to signify that they are unreliable (e.g., n ˙ k ), whereas values that have been preprocessed in some way are marked with a tilde (e.g., n ˜ k ).

2.1.3. Federated Learning Meta Algorithm

We build upon the baseline federated averaging algorithm ( F e d A v g ) described by McMahan et al. [2]. There, it is suggested that in order to save communication rounds, clients perform multiple stochastic gradient descent (SGD) steps while a central server occasionally averages the parameter vectors.
The intuition behind this approach becomes clearer when we mark the kth client’s ERM objective function by F k ( w ) 1 n k z Z k ( w ; z ) and observe that the objective function in Equation (1) can be rewritten as a weighted average of clients’ objectives:
F ( w ) : = 1 n k [ K ] n k F k ( w ) .
Similarly to previous works [10,13,14], we capture a large set of algorithms by abstracting F e d A v g into a meta-algorithm for the FL (Algorithm 1). We require three procedures to be specified by any concrete algorithm:
  • P r e p r o c e s s —receives possibly Byzantine n ˙ k s from clients and produces secure estimates marked as n ˜ k s. To the best of our knowledge, previous works ignore this procedure and assume that the n k s are correct.
  • C l i e n t U p d a t e —per-client w k computation. In F e d A v g , this corresponds to a few local mini-batch SGD rounds. See Algorithm 2 for the pseudocode.
  • A g g r e g a t e —the server’s strategy for updating w. In F e d A v g , this corresponds to the weighted arithmetic mean, i.e., w 1 n ˙ k [ K ] n ˙ k w ˙ k .
Algorithm 1 Federated Learning Meta-Algorithm
Given procedures:  P r e p r o c e s s , C l i e n t U p d a t e , and A g g r e g a t e .
1:
{ n ˙ k } k [ K ] collect sample size from clients
2:
{ n ˜ k } k [ K ] P r e p r o c e s s ( { n ˙ k } k [ K ] )
3:
w initial guess
4:
for  t 1   to  T  do
5:
S t a random set of client indices
6:
for all  k S t  do
7:
   w ˙ k   C l i e n t U p d a t e ( w )
8:
end for
9:
w A g g r e g a t e ( { n ˜ k , w ˙ k } k S t )
10:
end for
Algorithm 2 FedAvg: C l i e n t U p d a t e
Hyperparameters: learning rate ( η ), number of epochs (E), and batch size (B).
1:
for E epochs do
2:
for B-sized batch b k in Z k  do
3:
   w k w k η 1 B z b k ( w k ; z )
4:
end for
5:
end for
6:
return  w k

2.2. Preprocessing Client-Declared Sample Sizes

2.2.1. Preliminaries

The following assumption is common among works on Byzantine robustness:
Assumption 1
(Bounded Byzantine proportion). The proportion of clients who are Byzantine is bounded by some constant α; i.e., 1 K | B | α .
The next assumption is a natural generalization when considering unbalancedness:
Assumption 2
(Bounded Byzantine weight proportion). The proportion between the combined weight of Byzantine clients and the total weight is bounded by some constant α ; i.e., 1 n b B n b α .
Previous works on robust aggregation [6,7,8,9,15] either used Assumption 1, without considering the unbalancedness of the data, or implicitly used Assumption 2. However, we observe that Assumption 2 is unattainable in practice since Byzantine clients can often influence their own weight.
We address this gap with the following definition and an appropriate P r e p r o c e s s procedure.
Definition 1
(mwp). Given a proportion p and a weights vector V = ( v 1 , , v | V | ) sorted in decreasing order, the maximal weight proportion, mwp ( V , p ) , is the maximum combined weight for any p-proportion of the values of V :
mwp ( V , p ) i = 1 t v i i = 1 | V | v i , where t = p | V | .
Note that this is just the weight proportion of the p | V | clients with the largest sample sizes.
In the rest of this work we assume Assumption 1 and design a P r e p r o c e s s procedure that ensures the following:
mwp ( P r e p r o c e s s ( N ) , α ) α .
Observe that this requirement enables the use of weighted robust mean estimators in a realistic setting by ensuring that Assumption 2 holds for the preprocessed client sample sizes. It should also be noted that here, α is our assumption about the proportion of Byzantine clients, whereas α relates to an analytical property of the underlying robust algorithm. For example, we may replace the federated average with a weighted median as suggested by Chen et al. [11], in which case, α must be less than 1 / 2 .

2.2.2. Truncating the Values of N

Our suggested preprocessing procedure uses element-wise truncation of the values of N by some value U, marked trunc ( N , U ) = ( min ( n 1 , U ) , , min ( n K , U ) ) . Given α and α , we search for the maximal truncation that satisfies Equation (3):
U : = arg max U R mwp ( trunc ( N , U ) , α ) α .
Here, α and U present a trade-off. A higher α means more Byzantine tolerance but requires a smaller truncation value U , which may cause slower and less accurate convergence, as we demonstrate empirically in Section 3 and theoretically in Theorem 3.

Finding U Given α

If one has an estimate for α , it is easy to calculate U , for example, by going over values in N in decreasing order until finding a value that satisfies the inequality in Equation (4). Then we mark the index of this value by u and use the fact that in the range [ n u 1 , n u ] , we can express mwp ( trunc ( N , U ) , α ) as a simple function of the form a + b U c + d U :
i = u t n i + | { n i : i < min ( t , u ) } | U i = u K n i + | { n i : i < u } | U = i = u t n i + min ( t , u ) U i = u K n i + u U
for which we can solve Equation (4) with
U a c α d α b .

Optimality

Theorem 1.
Given Byzantine proportion α and Byzantine weight proportion α , then setting N ˜ = trunc ( N , U ) = ( n ˜ 1 , , n ˜ K ) for U given in Equation (4) gives an optimal solution for the following minimization problem: min N = ( n 1 , , n K ) N N 1 s.t. Equation (3) is satisfied (for P r e p r o c e s s ( N ) = N ) and k [ K ] : n k n k .
Proof. 
See Appendix A.1. □

The α - U Trade-Off

When we do not know α , as a practical procedure, we suggest plotting U as a function of α . In order to do so, we can start with α α , U n 1 , and alternate between decreasing α by 1 / K (one less Byzantine client tolerated) and solving Equation (4). This procedure can be made efficient by saving intermediate sums and using a specialized data structure for trimmed collections. See Algorithm 3 for the pseudocode and Figure 1 for an example output.
Algorithm 3 Plot ( α , U ) Pairs
α α
for  u 1   to  K 1   do
while  mwp ( trunc ( N , n u + 1 ) , α ) > α  do
    U solve Equation (5) for U [ n u , n u + 1 ]
   output ( α , U )
    α α 1 K
end while
end for

2.2.3. Truncation Given a Partial View of N

When K is very large, we may want to sample only k K elements IID from N . In this case, we will need to test that the inequality in Equation (4) holds with high probability. Such a test is useful in practice as the number of clients and their dataset sizes is not always tractable or pre-known, especially in the cross-device federated learning scenarios.
We consider k discrete random variables taken IID from N after truncation by U, that is, taken from a distribution over { 0 , , U } . We mark these random variables as X 1 , , X k , and their order statistic as X ( 1 ) , , X ( k ) where X ( 1 ) X ( k ) .
Theorem 2.
Given parameter δ > 0 and ε 1 = ln ( 3 / δ ) 2 k , ε 2 = U ln ln ( 3 / δ ) 2 ( k ( α ε 1 ) + 1 ) , ε 3 = U ln ln ( 3 / δ ) 2 k , we have that mwp ( trunc ( N , U ) , α ) α is true with 1 δ confidence if the following holds:
α i ( 1 ( α ε 1 ) ) k k X ( i ) k ( 1 ( α ε 1 ) ) k + 1 + ε 2 1 k i [ k ] X i ε 3 α .
Proof. 
See Appendix A.2. □

2.2.4. Convergence Analysis

We split our analysis into two parts. In the first part, we analyze our truncation preprocessing step, showing how it bounds the skewness of the learning objective that may be caused by Byzantine clients. In the second part, we discuss stronger guarantees of convergence.

Objective Function Skewness Bound

After applying our P r e p r o c e s s procedure, we have the truncated number of samples per client, marked { n ˜ k } k [ K ] . We can trivially ensure that any algorithm instance works as expected by requiring that clients ignore samples that were truncated. That is, even if an honest (non-Byzantine) client k has n k samples, it may use only n ˜ k samples during its C l i e n t U p d a t e .
Although this solution always preserves the semantics of any underlying algorithm, it does hurt convergence guarantees since the total number of samples decreases [Tables 5 and 6 in Kairouz et al. [3]; Yin et al. [12]; Haddadpour and Mahdavi [9]]. Interestingly, Theorem 3 in Li et al. [16] analyzes the baseline FedAvg and shows that the convergence bound increases with m a x n k / m i n n k (marked there as ν / ς ). This suggests that in some cases, unbalancedness itself deteriorates the convergence rate, a phenomenon that may be mitigated by truncation to some degree.
Additionally, we note that when the clients’ sample distributions are similar, the performance of federated averaging-based algorithms improves when honest clients use all their original n k samples. Intuitively, this follows from the observation that A g g r e g a t e procedures are generally composite mean estimators and C l i e n t U p d a t e calls are likely to produce more accurate results given more samples.
Lastly, as we have mentioned before, convergence is guaranteed, but we note that the optimization goal itself is inevitably skewed in our Byzantine scenario. The following theorem bounds this difference between the original weighted optimization goal (Equation (2)) and the new goal after truncation. In order to emphasize the necessity of this bound (in terms of Assumption 2), we use an overdot and tilde to signify unreliable and truncated values, respectively, as previously described in Section 2.1.2.
Theorem 3.
Given the same setup as in Equation (1) and a truncation bound U, the following holds for all w R d :
1 n ˙ i [ K ] n ˙ i F i ( w ) 1 n ˜ i [ K ] n ˜ i F i ( w ) i : n ˙ i > U ( n ˙ i n ˙ 1 K ) F i ( w ) + ( 1 n ˙ 1 n ˜ ) i : n ˙ i U L ( Z i ) ,
where L ( Z i ) is defined as z Z i ( w ; z ) .
Proof. 
Using the fact that n ˜ U K , we obtain:
1 n ˙ i [ K ] n ˙ i F i ( w ) 1 n ˜ i [ K ] n ˜ i F i ( w ) = i : n ˙ i > U n ˙ i n ˙ F i ( w ) + 1 n ˙ i : n ˙ i U L ( Z i ) i : n ˙ i > U U n ˜ F i ( w ) 1 n ˜ i : n ˙ i U L ( Z i ) i : n ˙ i > U n ˙ i n ˙ F i ( w ) + 1 n ˙ i : n ˙ i U L ( Z i ) i : n ˙ i > U 1 K F i ( w ) 1 n ˜ i : n ˙ i U L ( Z i ) = i : n ˙ i > U n ˙ i n ˙ 1 K F i ( w ) + 1 n ˙ 1 n ˜ i : n ˙ i U L ( Z i ) .
From the bound in Theorem 3, we can clearly see how the coefficients in the left term, ( n ˙ i / n ˙ 1 / K ) , stem from unbalancedness in the values above the truncation threshold, whereas the coefficient in the right term, ( 1 / n ˙ 1 / n ˜ ) , accounts for the increase of relative weight of the values below the truncation threshold. Additionally, note that this formulation demonstrates how a single Byzantine client can increase this difference arbitrarily by increasing its n ˙ i . Lastly, observe how both terms vanish as U increases, which motivates our selection of U as the maximal truncation threshold for any given α and α .

Convergence Guarantees

Having bounded the difference between the pre-truncation and post-truncation training objectives, we can now utilize known convergence guarantees.
Such guarantees and their respective proofs depend on the specific underlying A g g r e g a t e procedures and are orthogonal to our work. Nevertheless, we include two examples of how to account for our preprocessing step in the convergence analysis of specific underlying algorithms. In the first example, which we defer to Appendix C, we restate and adapt for truncation the convergence analysis given by Li et al. [16] for the standard, non-Byzantine-tolerant, F e d A v g algorithm. In the second example, we do the same for the Byzantine-robust trimmed mean aggregation procedure by Yin et al. [12]. This establishes that convergence is still guaranteed, even though the weights of some clients are truncated.
The following definitions and assumptions are required for the analysis of Yin et al. [12]:
Definition 2
(Coordinate-wise trimmed mean). For β [ 0 , 1 2 ) and vectors x i R d , i [ m ] , the coordinate-wise β-trimmed mean g : = trmean β { x i : i [ m ] } is a vector with its k-th coordinate being g k = i R k n i x k i i R k n i for each k [ d ] . Here, R k is a subset of indices from { x k 1 , , x k m } obtained by removing the largest and smallest β fraction of its elements.
Definition 3
(Sub-exponential random variables). A random variable X with E X = μ is called v-sub-exponential if E e λ ( X μ ) e 1 2 v 2 λ 2 , | λ | < 1 v .
Definition 4
(Lipschitz). h is L-Lipschitz if | h ( w ) h ( w ) | L w w , w , w .
Definition 5
(Smoothness). h is L -smooth if h ( w ) h ( w ) L w w , w , w .
Definition 6
(Strong convexity). h is λ-strongly convex if h ( w ) h ( w ) + h ( w ) , w w + λ 2 w w 2 , w , w .
Assumption 3
(Parameter space convexity and compactness). We assume that the parameter space W is convex and compact with diameter D, i.e., w w D , w , w W .
Assumption 4
(Smoothness of and F). We assume that for any z Z , the partial derivative of ( · ; z ) with respect to the k-th coordinate of its first argument, denoted by k ( · ; z ) , is L k -Lipschitz for each k [ d ] , and the function ( · ; z ) is L-smooth. Let L ^ : = k = 1 d L k 2 . We also assume that the population loss function F ( · ) is L F -smooth.
Assumption 5
(Sub-exponential gradients). We assume that for all k [ d ] and w W , the partial derivative of ( w ; z ) with respect to the k-th coordinate of w, k ( w ; z ) , is v-sub-exponential.
Define
m = min i [ K ] n i , m ˜ = min i [ K ] n ˜ i .
In order to prove convergence with truncation, we can adapt Theorems 4–6 from [12], which provide convergence guarantees for the trimmed mean aggregation procedure for the strongly convex, non-strongly convex, and non-convex cases, respectively. We restate Theorem 4 (for the trimmed mean aggregation procedure), where Δ is an upper bound that we prove on the distance between the trimmed mean aggregation of client-reported stochastic gradients and the true gradient (i.e., g ( w ) F ( w ) ).
Theorem 4
([12] Theorem 4). Consider the trimmed mean aggregation procedure. Suppose that Assumptions 4 and 5 hold, F ( · ) is λ F -strongly convex, and α β 1 2 ϵ for some ϵ > 0 . We choose a step-size of η = 1 / L F . Then, with a probability of at least 1 O d L ^ d D d K d 1 m d , after T parallel iterations, we have
w T w 1 λ F L F + λ F T w 0 w + 2 λ F Δ ,
where w t denotes the model-weights vector at the server at the end of step number t such that 0 is the first step of the algorithm. For brevity, we do not repeat Theorems 5 and 6 from [12]. Note that they similarly depend on Δ.
In the following lemma, we prove the upper bound Δ . This lemma is an adaptation of Theorem 11 in [12] to our setting. Given this updated bound, the proofs of Theorems 4–6 follow precisely as in [12].
Lemma 1.
Define
g i ( w t ) i B F i ( w t ) else , g ( w t ) trmean β g i ( w t ) : i [ K ] .
Suppose that Assumptions 4 and 5 are satisfied, and that α β 1 2 ϵ . Then, with probability at least 1 O d L ^ d D d K d 1 m d ,
g ( w ) F ( w ) O v d U m ˜ m log 1 + L ^ D i = 1 K n i + 1 d log K + O ˜ U K m ˜ m .
Proof. 
See Appendix D. □
It can be seen that Δ increases with d and U and decreases with m and m ˜ , with the latter three showing that the bound becomes less tight when the distributed dataset becomes more imbalanced.

3. Experimental Evaluation

In this section, we demonstrate how truncating N is a crucial requirement for Byzantine robustness. That is, we show that no matter what the specific attack or aggregation method is, using N “as-is” categorically devoids any robustness guarantees.
The code for the experiments is based on the Tensorflow machine learning library Abadi et al. [17]. Specifically, the code for the shakespeare experiments is based on the Tensorflow Federated sub-library of Tensorflow. It is given under the Apache license 2.0. We perform the experiments using a single NVIDIA GeForce RTX 2080 Ti GPU, but the results are easily reproducible on any device. Our code is given under the MIT license and can be found in the following GitHub repository: https://github.com/amitport/Towards-Federated-Learning-with-Byzantine-Robust-Client-Weighting, accessed on 1 July 2022.

3.1. Experimental Setup

3.1.1. The Machine Learning Tasks and Models

Shakespeare: Next-Character-Prediction Partitioned by Speaker

Presented in the original FedAvg paper [2] and also as part of the LEAF benchmark [18], the Shakespeare dataset contains 422,615 sentences taken from The Complete Works of William Shakespeare [19] (freely available public domain texts). The next-character-prediction task with the per-speaker partitioning represents a realistic scenario in the FL domain. Each client trains using an LSTM recurrent model [20] with hyperparameters matching those suggested by Reddi et al. [21] for FedAvg.

MNIST: Digit Recognition with Synthetic Client Partitioning

The MNIST database [22] (available under Creative Commons Attribution-ShareAlike 3.0 license) includes 28 × 28 grayscale labeled images of handwritten digits split into 60,000 training images and 10,000 testing images. We randomly partition the training set among 100 clients. The partition sizes are determined by taking 100 samples from a Lognormal distribution with μ = 1.5 , σ = 3.45 and then interpolating corresponding integers that sum to 60,000. This produces a right-skewed, fat-tailed partition size distribution that emphasizes the importance of correctly weighting aggregation rules and the effects of truncation. Clients train a classifier using a 64-unit perceptron with ReLU activation and 20 % dropout, followed by a softmax layer. Following Yin et al. [12], on every communication round, all clients perform a mini-batch SGD with 10% of their examples.
Note that the the Shakespeare and MNIST synthetic tasks were selected because they are relatively simple, unbalanced tasks: simple, because we want to evaluate a preprocessing phase and avoid tuning of the underlying algorithms we compare; unbalanced, since as can be understood from Theorem 3, when the client sample sizes are spread mostly evenly, ignoring the client sample size altogether is a viable approach. See Figure 2 for the histograms of the partitions.

3.1.2. The Server

We show three A g g r e g a t e procedures: arithmetic mean, as used by the original F e d A v g , and two additional procedures that replace the arithmetic mean with robust mean estimators. The first of the latter uses the coordinatewise median [11,12]. That is, each server model coordinate is taken as the median of the clients’ corresponding coordinates. The second robust aggregation method uses the coordinatewise trimmed mean [12], which, for a given hyperparameter β , first removes the β -proportion lowest and β -proportion highest values in each coordinate and only then calculates the arithmetic mean of the remaining values.
When preprocessing the client-declared sample size, we compare three options: we either ignore client sample size, truncate according to α = 10 % and α = 50 % , or just pass through the client sample size as reported. Note that the α value is derived from the fact that the robust median and trimmed mean defenses require at least 50 % honest clients [12].

3.1.3. The Clients and Attackers

We examine a model negation attack [5]. In this attack, each attacker “pushes” the model towards zero by always returning a negation of the server’s model. When the data distribution is balanced, this attack is easily neutralized since Byzantine clients typically send easily detectable extreme values. However, in our unbalanced case, we demonstrate that without our preprocessing step, this attack cannot be mitigated even by robust aggregation methods.
In order to provide comparability, we additionally follow the experiment shown by Yin et al. [12], in which 10 % of the clients use a label shifting attack on the MNIST task. In this attack, Byzantine clients train normally except for the fact that they replace every training label y with 9 y . The values sent by these clients are then incorrect but are relatively moderate in value, making their attack somewhat harder to detect.
We first execute our experiment without any attacks for every server aggregation and preprocessing combination. Then, for each attack type, we repeat the process two additional times: (1) with a single attacker that declares 10 million samples, and (2) with 10% attackers that declare 1 million samples each.

3.2. Experimental Results

The Shakespeare experiments without any attackers are shown in Figure 3, and the executions with attackers are shown in Figure 4. The results of the MNIST experiments are similar to the results of the Shakespeare experiments and are relegated to Appendix B.
Figure 3 shows that using truncation (dashed orange curve, labeled “Truncate”) remains comparable to the properly weighted mean estimators, where no preprocessing is done (solid blue curve, labeled “Passthrough”), whereas ignoring clients’ sample sizes and weighing the clients uniformly (dotted green curve, labeled “Ignore”) is sub-optimal. This effect is pronounced when the unweighted median is used (middle column, dotted green curve), since it is generally very far from the mean with our unbalanced partition.
From Figure 4, we observe that even with a single attacker performing a trivial attack (first row), using the weights directly without preprocessing (solid blue curve) is devastating, whereas when our truncation-based preprocessing method is used in conjunction with robust mean aggregations (i.e., median and trimmed mean) (dashed orange curve, two last columns), convergence remains stable even when there are α (=10%) attackers (second row). In contrast, the same cannot be said for the regular mean aggregator, as can be seen by the sub-optimal accuracy (2nd row) and an occasional accuracy drop (1st row) in the leftmost column (the drops can be explained by the fact that in each round we randomly select clients for training, and so the byzantine clients have varying effects across different rounds). We note that our method may be slightly less accurate in some cases when ignoring reported client sample sizes and weighing them uniformly (dotted green curve, second row, middle column). This is expected because we allow Byzantine clients to potentially get close to the α -proportion (50%, in this case) of the weight. However, our method is significantly closer to the optimal solution when there are no or only a few attackers (see Figure 3). Moreover, when used with robust mean aggregation methods, it maintains their robustness properties.
The results from the first experiment, running without any attackers (Figure 3), demonstrate that ignoring client sample size results in reduced accuracy, especially when median aggregation is used, whereas truncating according to our procedure is significantly better and is on par with properly using all weights. These results highlight the imperativeness of using sample size weights when performing server aggregations.
Whereas Figure 3 shows that truncation-based preprocessing performs on par with that of taking all weights into consideration when all clients are honest, Figure 4 demonstrates that the results are very different when there is an attack. In this case, we see that when even a single attacker reports a highly exaggerated sample size and the server relies on all the values of N , the performance of all aggregation methods, including robust median and trimmed mean, quickly degrades.
In contrast, in our experiments, robustness is maintained when truncation-based preprocessing is used in conjunction with robust mean aggregations such as median or trimmed mean, even when Byzantine clients attain the maximally supported proportion, α = 10 % .

4. Discussion

The results of our experiments establish that: (1) in the absence of attacks, model convergence is on par with that of properly using all reported weights, and (2) when attacks do occur, the performance of combining truncation-based preprocessing and robust aggregations incurs almost no penalty in comparison with the performance of using all weights in the lack of attacks, whereas without preprocessing, even robust aggregation methods collapse to a performance that is worse than that of a random classifier.
When the number of clients is very large, performing server preprocessing and aggregation on the server may become computationally infeasible. We prove that, in this case, truncation-based preprocessing can achieve the same upper bound on α with high probability based on the weight values reported from a sufficiently large number of the clients selected IID.
In future work, we plan to further analyze the trade-off between robustness and the usage of client sample size in rectifying data unbalancedness. We also plan to investigate alternative forms of estimating client importance that may avoid client sample size altogether.

Author Contributions

Conceptualization, A.P. and D.H.; methodology, A.P. and D.H.; software, A.P. and Y.T.; validation, A.P. and Y.T.; formal analysis, A.P. and Y.T.; investigation, A.P. and Y.T.; resources, N/A; data curation, A.P.; writing—original draft preparation, A.P., Y.T. and D.H.; writing—review and editing, A.P., Y.T. and D.H.; visualization, A.P. and Y.T.; supervision, A.P. and D.H.; project administration, D.H.; funding acquisition, D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Cyber Security Research Center, Ben-Gurion University of the Negev.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are available in a publicly accessible repository. The data presented in this study are openly available in arxiv and in the Wolfram Data Repository at https://doi.org/10.48550/arXiv.1812.01097 and https://doi.org/10.24097/wolfram.62081.data, all accessed on 1 July 2022.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
FLFederated Learning
ERMEmpirical Risk Minimization
FedAvgFederated Averaging
SGDStochastic Gradient Descent
MWPMaximal Weight Proportion
trunctruncation

Appendix A. Proofs

Appendix A.1. Truncation Optimality Proof

From the condition that k [ K ] : n k n k , we get:
N N ˜ 1 = k = 1 K | n k n ˜ k | = k = 1 K n k n ˜ k = k = 1 K n k k = 1 K n ˜ k
This means that
N N ^ 1 = N trunc ( N , U ) 1 k = 1 K n k k = 1 K n ^ k = k = 1 K n k k = 1 K n ˜ k k = 1 K n ^ k = k = 1 K n ˜ k
For some vector N ^ = ( n ^ 1 , , n ^ K ) .
We first prove a helper claim:
Lemma A1.
d 0 , n N n : U : N trunc ( N , U ) 1 = d . This value is unique.
Proof. 
This U can be found using a method similar to the one shown in Section Finding U Given α . Any change to U would obviously result in a different 1 distance. □
Using this lemma, we can define the following:
Definition A1
(Equivalent truncation solution). Let there be a solution N ˜ to the optimization problem. We define the equivalent truncation solution to N ˜ as the solution given by the U value for which N trunc ( N , U ) 1 = N N ˜ 1 .
Using this definition, we prove the theorem:
Let N ^ be some optimal solution to the optimization problem. We mark the equivalent truncation solution by trunc ( N , U ) . Note that both solutions have the same 1 distance from N . Assume by contradiction that mwp ( N ^ , α ) < mwp ( trunc ( N , U ) , α ) . We see that:
mwp ( N ^ , α ) < mwp ( trunc ( N , U ) , α ) k = 1 t n ^ k k = 1 K n ^ < k = 1 t n ˜ k k = 1 K n ˜ k = 1 t n ^ k < k = 1 t n ˜ k
If u t :
k = u t n ^ k + k = 1 u n ^ k < k = u t n ˜ k + u U k = u t n k + k = 1 u n ^ k < k = u t n k + u U k = 1 u n ^ k < u U
Else:
k = 1 t n ^ k < t U
We reach a contradiction in both cases, as t U / u U minimizes k = 1 t / u n k without having elements in ( n 1 , , n t / u ) be smaller than values with a higher index.
This gives us that mwp ( trunc ( N , U ) , α ) mwp ( N ^ , α ) α , which means that trunc ( N , U ) is a legal, optimal solution.

Appendix A.2. Truncation Given a Partial View of N Proof

First, in the scope of this proof, we use a couple of additional notations:
  • top ( V , p ) : The collection of p | V | largest values in V .
  • V : The sum of all elements in V .
We observe that mwp ( trunc ( N , U ) , α ) α can be rewritten as
mwp ( trunc ( N , U ) , α ) = top ( trunc ( N , U ) , α ) trunc ( N , U ) = α E [ top ( trunc ( N , U ) , α ) ] E [ trunc ( N , U ) ] α .
Then we note that membership in top ( trunc ( N , U ) , α ) can be viewed as a simple Bernoulli random variable with probability α , for which we obtain the following bound using Hoeffding’s inequality, t 0 :
Pr | { i [ k ] : X i top ( trunc ( N , U ) , α ) } | ( α t ) k e 2 t 2 k .
Therefore, with t = ε 1 , we have the following with 1 δ 3 confidence:
top ( { X i | i [ k ] } , α ) { X ( i ) | ( 1 ( α ε 1 ) ) k i k } .
Using Hoeffding’s inequality again, we can bound the expectation of X ( i ) | ( 1 ( α ε 1 ) ) k i k by ε 2 with 1 δ 3 confidence and together with (A3) have that:
E [ top ( trunc ( N , U ) , α ) ] i ( 1 ( α ε 1 ) ) k k X ( i ) k ( 1 ( α ε 1 ) ) k + 1 + ε 2 .
Then, using Hoeffding’s inequality for the third time, E [ trunc ( N , U ) ] is bound from below by ε 3 with 1 δ 3 confidence:
E [ trunc ( N , U ) ] 1 k i [ k ] X i ε 3 .
The proof is concluded by applying Equations (A3)–(A5) to Equation (A1) using the union bound.

Appendix B. MNIST Experimental Results

The experiments without any attackers are shown in Figure A1, and the executions with attackers are shown in Figure A2.
The results of the MNIST experiments follow the same trends as those of the Shakespeare dataset, but there is one notable difference. When aggregating using the mean procedure (left column), it can be seen that except for a label shift attack with a single attacker (left column, top row), ignoring reported client weights and weighing the clients uniformly outperforms our truncation-based method (left column, middle, and bottom rows), sometimes even being the only preprocessing method to converge. This again can be explained by the fact that we allow Byzantine clients to potentially get close to the α -proportion ( 50 % , in this case) of the weight.
Figure A1. Accuracy by round without any attackers for the MNIST experiments. Curves correspond to preprocessing procedures, and columns correspond to different aggregation methods.
Figure A1. Accuracy by round without any attackers for the MNIST experiments. Curves correspond to preprocessing procedures, and columns correspond to different aggregation methods.
Applsci 12 08847 g0a1
Figure A2. Accuracy by round under Byzantine attacks for the MNIST experiments. In the first two rows, Byzantine clients perform a label shifting attack with 1 and 10 % attackers, respectively. In the last two rows, we repeat the experiment with a model negation attack.
Figure A2. Accuracy by round under Byzantine attacks for the MNIST experiments. In the first two rows, Byzantine clients perform a label shifting attack with 1 and 10 % attackers, respectively. In the last two rows, we repeat the experiment with a model negation attack.
Applsci 12 08847 g0a2

Appendix C. Convergence of FedAvg in the Absence of Byzantine Clients

For this analysis, we make the following assumptions:
Assumption A1.
A g g r e g a t e = weighted arithmetic mean.
Assumption A2.
All clients send their true C l i e n t U p d a t e result.
Assumption A3.
Full device participation.
Assumption A4
(Smoothness). k [ K ] : F k is L-smooth (see Definition 5).
Assumption A5
(Strong convexity). k [ K ] : F k is μ-strongly convex (see Definition 6).
Assumption A6
(Bounded variance). k [ K ] : Let b k Z k be the uniformly sampled B-sized batch of client k’s local data. The variance of stochastic gradients of each client is bounded: E F k ( w k , b k ) F k ( w k ) 2 σ k 2 , where F k ( w k , b k ) = 1 B z b k ( w k ; z ) .
Assumption A7
(Bounded squared norm). k [ K ] : The expected squared norm of stochastic gradients is bounded: E F k ( w k , b k ) 2 G 2 .
We also define the following:
Definition A2
(Client proportion). k [ K ] : p k = n ˜ k n .
Definition A3
(Degree of heterogeneity). Let F and F k be the minimum values of F and F k , respectively. The degree of heterogeneity is defined as: Γ = F k = 1 K p k F k .
As mentioned previously, we utilize the results of Li et al. [16] for our analysis.
Theorem A1
([16] Theorem 1). Let Assumptions A4–A7 hold and L , μ , σ k , G be defined therein. Choose κ = L μ , γ = max { 8 κ , E } and the learning rate η t = 2 μ ( γ + t ) . Then, FedAvg with full device participation satisfies
E [ F ( w T ) ] F κ γ + T 1 2 B μ + μ γ 2 E w 0 w 2 ,
where
B = k = 1 K p k 2 σ k 2 + 6 L Γ + 8 ( E 1 ) 2 G 2 .
Here, w t denotes the model-weights vector at the server at the end of step number t such that 0 is the first step of the algorithm.
From this theorem, we immediately get the following:
Corollary A1.
Assuming the predefined assumptions laid out above are true, Theorem A1 holds for any P r e p r o c e s s procedure.

Appendix D. Proof of Lemma 1

Initially, we assume that each machine i [ K ] stores n i examples derived from one-dimensional random variable x D such that D is the unknown distribution from which we sample the distributed dataset, and x is v-sub-exponential. Define
μ = E x , x i , j = j th sample of the i th client , x ¯ i = 1 n i j = 1 n i x i , j , B = All Byzantine machines , M = [ K ] \ B , T = All trimmed machines , R = [ K ] \ T .
As done in the proof of Lemma 3 in [12], using Bernstein’s inequality, we get the following for all t 0 and i [ K ] :
Pr | x ¯ i μ | t 2 exp n i min t 2 v , t 2 2 v 2 2 exp min i [ K ] n i min t 2 v , t 2 2 v 2 .
Then, by union bound, we know that
Pr max i R | x ¯ i μ | t 2 ( 1 2 β ) K exp min i [ K ] n i min t 2 v , t 2 2 v 2 .
Our version of the trimmed mean of means computes:
trmean β { x ¯ i : i [ K ] } = i R n ˜ i x ¯ i i R n ˜ i .
From this definition, and assuming Inequality (A9) does not hold, we obtain:
| trmean β { x ¯ i : i [ m ] } μ | = | i R n ˜ i x ¯ i i R n ˜ i μ | = 1 i R n ˜ i | i R n ˜ i ( x ¯ i μ ) | 1 i R n ˜ i i R n ˜ i | x ¯ i μ | U i R n ˜ i i R | x ¯ i μ | ( 1 2 β ) U K max i R | x ¯ i μ | ( 1 2 β ) K min i [ K ] n ˜ i t U min i [ K ] n ˜ i
Based on Assumption 5, we can apply inequalities (A9) and (A10) to the k-th partial derivative of the loss functions. To extend this result to all w W and all the d coordinates, we proceed as in [12] by using the union bound and a covering net argument [23]. Let W δ = w 1 , w 2 , , w N δ be a finite subset of W such that for any w W , there exists w l W δ such that w l w δ . According to the standard covering net results [23], we know that N δ 1 + D δ d and obtain:
Pr g ( w ) F ( w ) 2 d t U min i [ K ] n ˜ i + 2 2 U min i [ K ] n ˜ i δ L ^ 2 ( 1 2 β ) K d N δ exp min i [ K ] n i min t 2 v , t 2 2 v 2 .
By ignoring constants and by setting:
δ = 1 i = 1 K n i L ^ , t = v max 2 min i [ K ] n i d log 1 + D δ + log K , 2 min i [ K ] n i d log 1 + D δ + log K ,
we get that with probability at least
1 O d K L ^ D i = 1 K n i d 1 O d L ^ d D d K d 1 m d ,
g ( w ) F ( w ) O v d U min i [ K ] n ˜ i min i [ K ] n i log 1 + L ^ D i = 1 K n i + 1 d log K + O ˜ U min i [ K ] n ˜ i i = 1 K n i O v d U m ˜ m log 1 + L ^ D i = 1 K n i + 1 d log K + O ˜ U K m ˜ m .

References

  1. Konečnỳ, J.; McMahan, B.; Ramage, D. Federated optimization: Distributed optimization beyond the datacenter. arXiv 2015, arXiv:1511.03575. [Google Scholar]
  2. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  3. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
  4. Bonawitz, K.; Eichner, H.; Grieskamp, W.; Huba, D.; Ingerman, A.; Ivanov, V.; Kiddon, C.; Konecny, J.; Mazzocchi, S.; McMahan, H.B.; et al. Towards federated learning at scale: System design. arXiv 2019, arXiv:1902.01046. [Google Scholar]
  5. Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2017; pp. 119–129. [Google Scholar]
  6. Ghosh, A.; Hong, J.; Yin, D.; Ramchandran, K. Robust Federated Learning in a Heterogeneous Environment. arXiv 2019, arXiv:1906.06629. [Google Scholar]
  7. Alistarh, D.; Allen-Zhu, Z.; Li, J. Byzantine Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 31; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 4613–4623. [Google Scholar]
  8. Li, L.; Xu, W.; Chen, T.; Giannakis, G.B.; Ling, Q. Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 8–12 October 2019; Volume 33, pp. 1544–1551. [Google Scholar]
  9. Haddadpour, F.; Mahdavi, M. On the Convergence of Local Descent Methods in Federated Learning. arXiv 2019, arXiv:1910.14425. [Google Scholar]
  10. Pillutla, K.; Kakade, S.M.; Harchaoui, Z. Robust Aggregation for Federated Learning. arXiv 2019, arXiv:1912.13445. [Google Scholar] [CrossRef]
  11. Chen, Y.; Su, L.; Xu, J. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent. Proc. ACM Meas. Anal. Comput. Syst. 2017, 1, 1–25. [Google Scholar] [CrossRef]
  12. Yin, D.; Chen, Y.; Ramchandran, K.; Bartlett, P. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar] [CrossRef]
  13. Chen, X.; Chen, T.; Sun, H.; Wu, Z.S.; Hong, M. Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms. arXiv 2019, arXiv:1906.01736. [Google Scholar]
  14. Chen, Y.; Sun, X.; Jin, Y. Communication-Efficient Federated Deep Learning with Asynchronous Model Update and Temporally Weighted Aggregation. arXiv 2019, arXiv:1903.07424. [Google Scholar] [CrossRef] [PubMed]
  15. Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
  16. Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the convergence of fedavg on non-iid data. arXiv 2019, arXiv:1907.02189. [Google Scholar]
  17. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Software. Available online: tensorflow.org (accessed on 1 July 2022).
  18. Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečný, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. LEAF: A Benchmark for Federated Settings. arXiv 2019, arXiv:1812.01097. [Google Scholar]
  19. Shakespeare, W. The Complete Works of William Shakespeare; Complete Works Series; Wordsworth Editions Ltd.: Stansted, UK, 1996. [Google Scholar]
  20. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  21. Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. arXiv 2020, arXiv:2003.00295. [Google Scholar]
  22. LeCun, Y.; Cortes, C.; Burges, C. MNIST Handwritten Digit Database. ATT Labs. 2010, Volume 2. Available online: http://yann.Lecun.Com/exdb/mnist (accessed on 1 July 2022).
  23. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. arXiv 2010, arXiv:1011.3027. [Google Scholar]
Figure 1. Example plot of data generated by executing Algorithm 3 on unbalanced vector N and α = 50 % (this vector corresponds to the partition used in our experiments; see Section 3.1 for details).
Figure 1. Example plot of data generated by executing Algorithm 3 on unbalanced vector N and α = 50 % (this vector corresponds to the partition used in our experiments; see Section 3.1 for details).
Applsci 12 08847 g001
Figure 2. Histogram of the sample partitions of the MNIST (left) and Shakespeare (right) datasets.
Figure 2. Histogram of the sample partitions of the MNIST (left) and Shakespeare (right) datasets.
Applsci 12 08847 g002
Figure 3. Accuracy by round without any attackers for the Shakespeare experiments. Curves correspond to preprocessing procedures, and columns correspond to different aggregation methods.
Figure 3. Accuracy by round without any attackers for the Shakespeare experiments. Curves correspond to preprocessing procedures, and columns correspond to different aggregation methods.
Applsci 12 08847 g003
Figure 4. Accuracy by round under Byzantine attacks for the Shakespeare experiments. Curves correspond to preprocessing procedures, and columns correspond to different aggregation methods. In the two rows of the experiment, the Byzantine clients perform a model negation attack with 1 and 10% attackers, respectively.
Figure 4. Accuracy by round under Byzantine attacks for the Shakespeare experiments. Curves correspond to preprocessing procedures, and columns correspond to different aggregation methods. In the two rows of the experiment, the Byzantine clients perform a model negation attack with 1 and 10% attackers, respectively.
Applsci 12 08847 g004
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Portnoy, A.; Tirosh, Y.; Hendler, D. Towards Federated Learning with Byzantine-Robust Client Weighting. Appl. Sci. 2022, 12, 8847. https://doi.org/10.3390/app12178847

AMA Style

Portnoy A, Tirosh Y, Hendler D. Towards Federated Learning with Byzantine-Robust Client Weighting. Applied Sciences. 2022; 12(17):8847. https://doi.org/10.3390/app12178847

Chicago/Turabian Style

Portnoy, Amit, Yoav Tirosh, and Danny Hendler. 2022. "Towards Federated Learning with Byzantine-Robust Client Weighting" Applied Sciences 12, no. 17: 8847. https://doi.org/10.3390/app12178847

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop