Generalized Federated Learning via Gradient Norm-Aware Minimization and Control Variables
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper is very well-written and the results are very clear. However, it is hard to follow the proposed methodology of the authors. Please make a simple methodology diagram using block diagrams which will help the readers a great deal in better understanding of your work.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors present a federated learning approach based on first-order flatness and error function, which expects to generalize better than traditional methods using only the error function. However, the following comments need to be achieved to improve their manuscript.
For the non-expert readers, please include mathematical formulation and definitions of
i) Local Minima
ii) Global Minima
iii) Flat Minima
iv) Sharp Minima
v) Discuss the above concepts and their differences.
vi) The principle of empirical risk
The neighborhood is continuous in Equations 4, 5, and 6 (centered in w with radio rho). How is it possible to evaluate the infinite neighbors? In the case of discretization, how is it done?
In mathematics, the zeroth-order, first-order, and second-order terms are terms coming from power series. Specify which terms are elevated to the zero, one, and two powers in the zeroth-order flatness, first-order flatness, etc.
In the comparison with zeroth-order flatness, it is not clear if epsilon is a neighbor or a perturbation value. Please clarify the above and discuss how to find the value (perhaps linear programming or other techniques). Moreover, the authors emphasize that first-order flatness is superior to zero-order flatness. Does that mean second-order flatness is superior to first and zero-order flatness?
About the name GAM (Gradient Norm Aware Minimization), are you minimizing the vector gradient norm (distance)? The above seems unclear in your proposal. Please clarify the above or change their names.
Why is it necessary to transform Equation 12 into Equation 17? Please add a numerical method to compute it.
The rationale behind the parameters c and c_i in Equations 18 and 19 needs to be clarified. Would the parameter c_i decay the local gradient in the client according to the previous weights? The parameter c seems arbitrary because to compute it (Equation 20) uses delta c_i. Please include a Table with all the necessary parameters, such as Delta, Eta, etc., and specify if there is only one parameter for every weight or a single parameter for all the weight updates.
Section 4.2.2 seems to break the federated learning approach because the clients incorporate gradient information from all clients. The above approximates the parameters c_i and c. How is it possible, even when communication links fail?
Please improve the English language when possible to make it clearer.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors1、The title does not adequately explain the research content of the paper.
2、To what extent does the non-independent and identically distributed (Non-IID) nature of data across clients in federated learning environments exacerbate the challenge of client drift?
3、In the context of the proposed FedGAM algorithm, how effectively do control variables facilitate the correction of local updates to steer model training towards globally flat minima, and what quantifiable advancements in model performance can be identified when compared to conventional federated learning techniques?
Minor editing of English language required.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for Authors- The authors combined FL (Federated Learning) and GAM (Gradient Norm-Aware Minimization) to propose FedGAM, which enhances the generalization ability of the global model.
- The experiments are well-executed, addressing the three main causes that contribute to client drift individually.
- I suggest that the authors should provide a more detailed explanation of the symbols used so that readers unfamiliar with previous literature can still understand the content of the paper.
- In the experiments, the alpha value is used to control data heterogeneity. The authors should consider explaining this more clearly or citing relevant papers to facilitate the reproducibility of the experiments.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for Authorsagree to publishing it