Next Article in Journal
Reference Architecture for the Integration of Prescriptive Analytics Use Cases in Smart Factories
Previous Article in Journal
Graph Information Vanishing Phenomenon in Implicit Graph Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Symmetric ADMM-Based Federated Learning with a Relaxed Step

Business School, University of Shanghai for Science and Technology, Jungong Road, Shanghai 200093, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(17), 2661; https://doi.org/10.3390/math12172661
Submission received: 12 July 2024 / Revised: 24 August 2024 / Accepted: 24 August 2024 / Published: 27 August 2024

Abstract

:
Federated learning facilitates the training of global models in a distributed manner without requiring the sharing of raw data. This paper introduces two novel symmetric Alternating Direction Method of Multipliers (ADMM) algorithms for federated learning. The two algorithms utilize a convex combination of current local and global variables to generate relaxed steps to improve computational efficiency. They also integrate two dual-update steps with varying relaxation factors into the ADMM framework to boost the accuracy and the convergence rate. Another key feature is the use of weak parametric assumptions to enhance computational feasibility. Furthermore, the global update in the second algorithm occurs only at certain steps (e.g., at steps that are a multiple of a pre-defined integer) to improve communication efficiency. Theoretical analysis demonstrates linear convergence under reasonable conditions, and experimental results confirm the superior convergence and heightened efficiency of the proposed algorithms compared to existing methodologies.

1. Introduction

Federated learning serves as a widely adopted distributed machine learning methodology that has garnered substantial interest in recent years due to its effectiveness in addressing issues related to data privacy, security, and the accessibility of heterogeneous data [1,2,3]. This methodology has been employed extensively in various domains, such as health care [4] (Yang et al., 2019), finance [5] (Liu et al., 2021), the Internet of Things (IoT) [6] (Zeng et al., 2023), and intelligent transportation [7,8] (Manias et al., 2021). By providing solutions for data security and compliance, federated learning facilitates improved utilization of decentralized data and enhances the performance and efficiency of machine learning models. As a contemporary research hot spot, federated learning’s significance spans beyond boosting performance of machine learning models, extending to the safeguarding of data privacy, economizing computational resources, and supporting heterogeneous devices.

1.1. Related Work

(a) Improving computational efficiency and accuracy.
Each node independently addresses local optimization sub-problems in federated learning. Early research [6,9,10] used a strategy of splitting the computation across multiple devices for local optimization, although shortcomings persist regarding computational efficiency and accuracy. The Alternating Direction Method of Multipliers (ADMM), as a distributed method, has been used to solve many optimization problems due to its simplicity and efficiency. In federated learning, there are two types of ADMM available, namely exact [11] and inexact [12] ADMM. The former requires the clients to update the parameters by solving the sub-problems accurately, thereby increasing the computational burden [12,13,14,15,16,17]. The latter updates the parameters by solving sub-problems approximately, which reduces the computational complexity for clients [18,19,20,21]. However, these algorithms implement a single dual-update step per iteration and necessitate relatively stringent assumptions with respect to the parameters to ensure convergence properties.
(b) Saving computing resources. In distributed learning, local clients and the central server engage in frequent, often inefficient communication. Thus, extensive research efforts focus on devising algorithms to minimize the number of communication rounds. The stochastic gradient descent method, which aggregates in a cyclic fashion, is a widely used approach [22,23,24,25,26,27] that has demonstrated promising results in reducing the number of communication rounds. To further alleviate the burden of communication rounds, McMahan et al. [6,28,29] introduced the Federated Averaging Algorithm (FedAvg) and its improved variants. These algorithms minimize communications by performing local iterations multiple times before periodically conducting global aggregation. Li et al. [30] further refined the FedAvg method by introducing the Federated Proximal (FedProx) algorithm, allowing available device-based system resources to perform varying amounts of local work before aggregating partial solutions. Both FedAvg and FedProx have seen extensive application in distributed learning.

1.2. Our Contribution

The main contributions of this paper include the introduction of two federated learning algorithms based on Symmetric ADMM with Relaxed Step (Fed-RSADMM; see Algorithms 2 and 3), characterized as follows:
(I) Relaxed step. In contrast to the conventional ADMM, the presented algorithms employ a convex combination of the current local and global variables to generate the relaxed steps.
(II)  S y m m e t r i c A D M M . We integrate two dual-update steps into the ADMM framework to construct a symmetric ADMM algorithm with varying relaxation factors, which is different from the general ADMM.
(III)  W e a k p a r a m e t r i c a s s u m p t i o n s . Differing from conventional algorithmic assumptions, only simple assumptions are made with respect to the parameters.

1.3. Organization

This paper is organized as follows. In the next section, we provide the symbolic definitions and some common mathematical definitions that are used in this paper. In Section 3, we present the Fed-RSADMM and FedAvg-RSADMM algorithms, then prove their convergence. In Section 4, we design some comparative numerical experiments to illustrate the performance of our two proposed algorithms. The conclusions of this paper are presented in Section 5.

2. Preliminaries

The present section introduces some notations and definitions employed in this paper.

2.1. Notations

R n denotes the n-dimensional Euclidean space, x , y = x T y , and · denotes the Euclidean norm. Let the set S R n and the point x R n . If S is non-empty, the distance from point x to set S is denoted as d i s t ( x , S ) = inf { y x : y S } . When S = , let d i s t ( x , S ) = + .
Definition 1
( L Lipschitz continuity). In mathematics, a function ( g : R n R ) is said to be L-Lipschitz continuous (or simply L-Lipschitz) if, for any x , y R n , one has
g ( x ) g ( y ) L x y ,
where · denotes the Euclidean norm.
If the function (f) is continuously differentiable and its gradient ( f ) is L-Lipschitz continuous, then we have
f ( y ) f ( x ) f ( x ) , y x L 2 y x 2 , x , y R n .
Definition 2.
Let the function f : R n R be normally lower semicontinuous. The authors of [31] provided the following definitions:
(I) 
The Frechet subdifferential of f at x d o m f is denoted as
^ f ( x ) = x * R n : lim y x inf y x f ( y ) f ( x ) x * , y x y x ,
and when  x domf , ^ f ( x ) = .
(II) 
The limiting subdifferential of f at x d o m f is denoted as
f ( x ) = x * R n : x k x , f x k f ( x ) , x ^ k ^ f x k , x ^ k x * ,
and assuming that x R n is a minimal-value point of f, then 0 f ( x ) . If 0 f ( x ) , then x is said to be a stable point of f, and the set of stable points of f is denoted as c r i t f .
The proximal operator of a normal closed, convex function f at v R n is defined as follows [32]:
P r o x λ f ( v ) = arg min x R n f ( x ) + 1 2 λ x v 2 ,
where · denotes the Euclidean norm.

2.2. Loss Function

A machine learning model encompasses a set of parameters that are refined based on the training data. Typically, the training data samples include the following two components: input features represented as vectors ( a j ) and desired outputs known as output labels ( b j ). Each model is equipped with a loss function defined on its parameter vector (x) for each data sample. The loss function records the model error in the training data. The learning process aims to minimize this loss function over a set of training data samples ( a j i ). For each data sample, the loss function is designated as f ( x i , a j , b j ) or abbreviated to f ( x i ) for convenience.
Table 1 [33,34,35,36] summarizes loss functions for popular machine learning models. For convenience, suppose there are m edge nodes and that the local datasets are D 1 , D 2 , , D m . For each dataset ( D i ) at node i, the loss function of the set of data samples at that node is
F i ( x ) 1 D i j D i f j ( x ) .
We define D i , where D i denotes the size of set D i , i.e.,
D = i = 1 m D i .
The global loss function of all distributed datasets is defined as
F ( x ) j i D i f j ( x ) i D i = N i = 1 D i F i ( x ) D .

2.3. Symmetric ADMM

The following convex minimization model with linear constraints and a separable objective function is considered.
min x R n , y R q f ( x ) + g ( y ) , s . t . A x + B y = b ,
where A R p × n , B R p × q and b R p . Such divisible convex optimization problems can be solved by using the ADMM algorithm. The augmented Lagrangian function for the above optimization problem is expressed as follows.
L ( x , y , u ) : = f ( x ) + g ( y ) + A x + B y b , u + ρ 2 A x + B y b 2 ,
where ρ is the penalty parameter and u is the Lagrange multiplier. ADMM follows the following update process [37]:
x k + 1 = arg min x R n L x , y k , u k , y k + 1 = arg min y R q L x k + 1 , y , u k , u k + 1 = u k ρ A x k + 1 + B y k + 1 b .
Based on the Peaceman–Rachford splitting method [38], symmetric ADMM (S-ADMM) was proposed to solve (9). The iterative process is expressed as follows:
x k + 1 = arg min x R n L x , y k , u k , u k + 1 2 = u k ρ A x k + 1 + B y k b y k + 1 = arg min y R q L x k + 1 , y , u k + 1 2 , u k + 1 = u k + 1 2 ρ A x k + 1 + B y k + 1 b .

2.4. Federated Learning

Suppose we have m local nodes, each with a local dataset ( D i ). Each node has a local total loss ( f i ( x ) ) as a lower-bounded function. The global loss function can, therefore, be derived as
f x : = i = 1 m w i f i x ,
where w i , ( i = 1 , 2 , , m ) are positive weights and satisfy
i = 1 m w i = 1 .
The goal of federated learning is to minimize the loss function at the central node to obtain the optimal parameter ( x * ), which can be described as the following problem:
x * : = a r g m i n x R n f ( x ) .
By introducing the auxiliary variable (y) and adding the constraint of x i = y , the original problem can be rewritten in the following form:
min x i R n , y R i = 1 m w i f i ( x i ) , s . t . x i = y .
Based on the above optimization problem, the conventional federated learning Algorithm 1 can be summarized in the following form [39]:
Algorithm 1 Federated Learning
1:
Initialize:  x i 0 = x 0 , m , γ > 0 .
2:
for  k = 1  to n do
3:
     G l o b a l u p d a t e : ̲
4:
    The central server calculates the average parameter y k + 1 by:
5:
    
y k = i = 1 m w i x i k .
6:
    Broadcasts the parameter y k + 1  to every local client:
7:
    for  i = 1  to n do
8:
         L o c a l u p d a t e : ̲
9:
        Each client updates its parameter locally and in parallel by:
10:
        
x i k + 1 = x i k γ f ( y k ) .
11:
    end for
12:
end for
13:
R e t u r n : x i k + 1 , y k + 1 ( i = 1 , 2 , , m )
To implement the algorithms proposed in this paper, we construct the augmented Lagrangian function for problem (14). The augmented Lagrangian function of (14) is
L ρ ( y , X , U ) : = i = 1 n L ρ i ( y , x i , u i ) ,
where X = ( x 1 , x 2 , , x n ) , U = ( u 1 , u 2 , , u n ) , ρ = ( ρ 1 , ρ 2 , , ρ m ) , and
L ρ i ( y , x i , u i ) = w i f i ( x ) x i y , u i + ρ i 2 x i y 2 ,
where u i , i = 1 , 2 , , m are the Lagrange multipliers and ρ i , i = 1 , 2 , , m are the penalty parameters.

2.5. Stationary Points

Here, we present the optimality condition for problem (14).
Definition 3.
Point ( y * , X * , U * ) is a stationary point of problem (14) and satisfies the following conditions:
w i f i x i * + u i * = 0 x i * y * = 0 i = 1 m u i * = 0 , i = 1 , 2 , , m .
Point x * is deemed a stationary point of problem (14) if it satisfies the following condition:
f x * = 0 .
One can readily observe that any local optimal solution satisfies (17), and if each f i is a convex function, a point fulfilling (18) constitutes a globally optimal solution.
Based on the definition of the proximal operator and Definition 3, we obtain the following lemma.
Lemma 1
([39,40]). Suppose that f i : R n R , i = 1 , 2 , , m are properly convex lower semicontinuous functions. Then, solving problem (15) reduces to a zero point of
e ( p , ϱ ) = e x i ( p , ϱ ) : = x i P r o x ϱ f i x i + ϱ u i e y ( p , ϱ ) : = ϱ i = 1 m u i e u i ( p , ϱ ) : = ϱ ( x i y ) ( i = 1 , 2 , , m ) ,
where p : = ( X , y , U ) , ϱ R + for any given positive constant. For p * = X * , y * , U * c r i t L ρ , e p * , ϱ = 0 . Thus, e ( p , ϱ ) can be used to measure the distance between point p and stable set c r i t L ρ .
Now, we provide the following lemma for e ( p , ϱ ) , which is important for Remark 2 and Lemma 5 in Section 3.
Lemma 2
([39,40]). Suppose that f i : R n R , i = 1 , 2 , , m are properly convex and lower semicontinuous. If p is not a stable point of c r i t L ρ and ϱ ¯ ϱ > 0 , then
e ( p , ϱ ¯ )     e ( p , ϱ ) ,
and
e ( p , ϱ ¯ ) ϱ ¯ e ( p , ϱ ) ϱ .

3. Symmetric ADMM-Based Federated Learning with a Relaxed Step and Convergence

Based on the above augmented Lagrangian functions, in this section, we construct two symmetric ADMM-based federated learning algorithms, the first of which is Fed-RSADMM, which utilizes the federated learning framework and symmetric ADMM with a relaxed step (RSADMM). The second is FedAvg-RSADMM, based on Fed-RSADMM, which allows local clients to update multiple times, then upload their parameters to the central server.

3.1. Fed-RSADMM

Given an original dataset comprising m nodes, the local parameter for the i -th node is set as x i , and data for the i -th node is assigned as D i . The specific algorithmic workflow proceeds as follows (Algorithm 2).
Algorithm 2 Fed-RSADMM
  • Input:  α , τ , γ , ρ i > 0 . S = [ m ] , [ m ] : = { 1 , 2 , m } .
  • Initialize:  x i 0 , y 0 , u i 0 , i [ m ] . S e t k 0 .
  • for  k = 1  to n do
  •     for  i = 1  to m do
  •          L o c a l r e l a x e d u p d a t e : ̲
  •         
    x r s ( i ) k + 1 = α x i k + ( 1 α ) y k
  •     end for
  •      G l o b a l u p d a t e : ̲
  •     The central server calculates the global parameter y k + 1 by
  •     
    y k + 1 = arg min y m i = 1 w i f i x r s ( i ) k + 1 x r s ( i ) k + 1 y , u i k + ρ i 2 x r s ( i ) k + 1 y 2
  •     Broadcasts the parameter y k + 1 to every local client:
  •     for  i = 1  to m do
  •          L o c a l r e l a x e d u p d a t e : ̲
  •         
    u i k + 1 2 = u i k τ ρ i x r s ( i ) k + 1 y k + 1
  •         
    x i k + 1 = arg min x i w i f x i x i y k + 1 , u i k + 1 2 + γ ρ i 2 x i y k + 1 2
  •         
    u i k + 1 = u i k + 1 2 γ ρ i x i k + 1 y k + 1
  •         
  •     end for
  • end for
  • R e t u r n : x i k + 1 , y k + 1 ( i = 1 , 2 , , m )
Remark 1.
In Algorithm 2, subproblem (23) can be solved by the following equation:
y k + 1 = a r g min y L ( y , X k , U k ) = i = 1 m ρ i x r s ( i ) k + 1 ρ ^ + i = 1 m u i k ρ ^ ,
where
ρ ^ : = i = 1 m ρ i ,
Compared to traditional federated learning, we adopt the update methods outlined in (23) and (27) for the global parametersinstead of using the average of all local parameters, denoted as x i k + 1 . In contrast to the symmetric ADMM algorithm, we introduce a relaxation step to accelerate the convergence rate [19].

3.2. FedAvg-RSADMM

The communications in FedAvg-RSADMM only occur when k K = { 0 , k 0 , 2 k 0 , } , where k 0 is a predefined positive integer. To facilitate local updates in Algorithm 2, an auxiliary variable ( z k + 1 = y τ k + 1 ) is introduced, where Γ k = k / k 0 k 0 . It can be readily observed that if k = Γ k , then k K , and when Γ k < k < Γ k + k 0 , then k K , i.e.,
z k + 1 = y k + 1 , i f k K , y τ k + 1 , i f k K .
This approach decreases the number of communication rounds (e.g., parameter feedback and parameter upload), resulting in substantial cost savings with a convergence rate of 0 ( 1 / K ) , where K is the number of iterations. A convex combination between local and global variables is used to formulate the relaxed step, which is then employed to perform the parameter update. The corresponding Algorithm 3 proceeds as follows.
Algorithm 3 FedAvg-RSADMM
1:
Input:  α , τ , γ , ρ i > 0 . S = [ m ] .
2:
Initialize:  x i 0 , y 0 , u i 0 , i [ m ] . S e t k 0 .
3:
for  k = 1  to n do
4:
    for  i = 1  to m do
5:
         L o c a l r e l a x e d u p d a t e : ̲
6:
        
x r s ( i ) k + 1 = α x i k + ( 1 α ) z k
7:
    end for
8:
    if  k K : = 0 , k 0 , 2 k 0 , 3 k 0 ,  then
9:
         G l o b a l u p d a t e : ̲
10:
        The central server calculates the global parameter z k + 1 by
11:
        
z k + 1 = arg min z m i = 1 w i f i x r s ( i ) k + 1 x r s ( i ) k + 1 z , u i k + ρ i 2 x r s ( i ) k + 1 z 2
12:
        Broadcasts the parameter z k + 1 to every local client:
13:
    end if
14:
    for  i = 1  to m do
15:
         L o c a l r e l a x e d u p d a t e : ̲
16:
        
u i k + 1 2 = u i k τ ρ i x r s ( i ) k + 1 z k + 1
17:
        
x i k + 1 = arg min x i w i f x i x i z k + 1 , u i k + 1 2 + γ ρ i 2 x i z k + 1 2
18:
        
u i k + 1 = u i k + 1 2 γ ρ i x i k + 1 z k + 1
19:
        
20:
    end for
21:
end for
22:
R e t u r n : x i k + 1 , z k + 1 , u i k + 1 ( i = 1 , 2 , , m )

3.3. Convergence

In this section, we only provide the corresponding convergence lemmas and theorem for Algorithm 2, as those for Algorithm 3 follow a similar process. The following assumption is important for the proof.
Assumption 1.
(a) 
The function f i : R n R , i = 1 , 2 , , m is lower semi-continuous.
(b) 
The function f i : R n R , i = 1 , , is continuous and has the same L-Lipschitz continuous gradient.
(c) 
The parameters in the algorithm satisfy the following:
γ + τ 0 , γ + τ α > 0 , 0 < α < 1 , 0 < γ < 1 , τ < 1 .
The penalty parameter ( ρ i ) complies with the following:
ρ i > c + c 2 4 h L 2 2 h ,
where c = γ α γ τ ( 1 + γ α ) , h = τ 1 α 1 L 2 + 2 1 γ L .
(d) 
The datasets of all devices are independently and identically distributed (i.i.d).
We first prove the decreasing property of the { L ρ p k } sequence; these properties enable us to obtain Lemma 4, which, together with the optimality conditions, shows the convergence of the { p k } sequence.
Lemma 3.
Assuming that Assumption 1 holds, there exist a > 0 and b > 0 such that
L ρ p k L ρ p k + 1 a m i = 1 Δ x i k + 1 2 + b Δ y k + 1 2 , k .
Then, the L ρ ( p k ) sequence is monotonically decreasing, where m is a finite constant.
Lemma 4.
Assuming assumption A holds and w k is bounded, then
k = 0 + | p k + 1 p k | < + .
Theorem 1 establishes the subsequence convergence property of the iterative sequence generated by Algorithm 2.
Theorem 1
(Subsequence Convergence). Assuming the conditions in Lemma 4 are met, the set of accumulation points of the w k sequence is denoted as Ω. Then, the following conclusions hold:
(1) 
Ω is a non-empty compact set, and d i s t ( w k , Ω ) 0 as k + ;
(2) 
Ω c r i t L ρ .

3.4. Linear Convergence Rate

To obtain the local linear convergence rates of sequences { p k } and { L ρ ( p k ) } generated by Algorithm 2, the following results require that functions f i , i = 1 , 2 , , m be convex. We also make the following assumptions:
Assumption 2.
For any ζ inf p L ρ ( p ) , there exist ε > 0 and ς > 0 such that | e ( p , 1 ) | ε and L ρ ( p ) ζ , which implies
d i s t p , c r i t L ρ ς e ( p , 1 ) .
Remark 2.
According to Lemma 1, the expression for e ( p , 1 ) is
e ( p , 1 ) = e x i ( p , 1 ) : = x i P r o x f i x i + u i e y ( p , 1 ) : = i = 1 m u i e u i ( p , 1 ) : = ( x i y ) , ( i = 1 , 2 , , m ) .
According to Lemma 2, for any given ϱ > 0 , we have
e ( p , 1 ) max ϱ , 1 ϱ e ( p , ϱ ) .
Thus, Assumption 2, we can also set d i s t ( p , c r i t L ρ ) ς | e ( p , ϱ ) | , where | e ( p , ϱ ) | ε and L ρ ( p ) ζ .
Assumption 3.
For any given p ¯ = ( X ¯ , y ¯ , U ¯ ) c r i t L ρ and p ˜ = ( X ˜ , y ˜ , U ˜ ) c r i t L ρ , there exists δ > 0 such that when | p ¯ p ˜ | δ , L ρ ( p ¯ ) = L ρ ( p ˜ ) holds.
To prove the convergence rate, we also need Lemma 5.
Lemma 5.
Suppose Assumption 1 holds and the functions f i , i = 1 , 2 , , m are convex; then, there exists θ 1 , θ 2 > 0 such that
e p k + 1 , 1 θ 1 i = 1 m Δ x i k + 1 + θ 2 Δ y k + 1 , k .
We have shown that Algorithm 2 converges. Now, we would like to see how fast this convergence is based on Lemma 5, as expressed by Theorems 3 and 4.
Theorem 2.
Suppose Assumptions 1 and 2, and the conditions in Lemma 4 hold and the that functions f i , i = 1 , 2 , , m are convex; then, the following conclusions are valid:
(1) 
d i s t p k , c r i t L ρ 0 , k + ;
(2) 
For any given p ¯ k c r i t L ρ and p ¯ k p k = d i s t ( w k , c r i t L ρ ) , there exists a positive integer ( k ^ ) such that
L ρ p ¯ k = lim k + L ρ p k = inf k L ρ p k , k k ^ ;
(3) 
The { L ρ ( p k ) } sequence is Q-linearly convergent.
Theorem 3.
Assuming the conditions in Theorem 2 hold, the p k sequence converges to c r i t L ρ with an R-linear convergence rate.
See the proofs of Lemmas 3 and 4, and Theorem 1, as well as Lemma 5, Theorems 2 and 3, in Appendix A.

4. Numerical Experiment

In this section, the performances of Algorithms 2 and 3 are demonstrated by two numerical arithmetic examples, namely linear regression and logistic regression, respectively. All numerical experiments were conducted on a laptop computer with 16 GB of RAM and an Intel(R) Core(TM) i5-12500H 2.3 GHz CPU, using MATLAB (R2021a) for implementation.

4.1. Testing Examples

In our examples, each local client is designated with its own objective function ( f i , i = 1 , 2 , , m , where m signifies the number of client nodes). Subsequently, every client generates random datasets ( A i = [ a 1 i , a 2 i , , a n i ] and b i = [ b 1 i , b 2 i , , b n i ] ). In these datasets, a j i , ( i = 1 , , m ; j = 1 , 2 , , n ) represents the feature data of dimension d, whereas b j i , ( i = 1 , , m ; j = 1 , 2 , , n ) denotes the label data of dimension 1.
Regarding Algorithms 2 and 3, we establish identical parameters across each node. These parameters include a relaxed step weight of α = 0.5 , a penalty parameter ( ρ i ) set to 1 for each node, a relaxed factor ( τ ) of 0.1 for the initial dual step, and a relaxed factor ( γ ) of 0.5 for the subsequent dual step at every node.
Example 1
(Linear regression). Linear regression, a canonical problem in machine learning, seeks to construct a linear function from a specified dataset, enabling the prediction of relationships between input and output variables. In this context, the objective functions for local nodes are given by
f i ( x ) = 1 2 n j = 1 x T a j i b j i 2 , i = 1 , 2 , , m .
In this function, a j i R n and b j i R denote the j-th sample for client i. It should be noted that the above objective function formulates a convex quadratic optimization problem. For this scenario, we randomly generate features ( A i ) and corresponding labels ( b i ) from a uniform distribution within the interval of [ 0 , 1 ] . For the purpose of simplification, we initially set n = 100 while selecting m [ 10 , 100 ] . Subsequently, we solidify m = 30 and let n fall in the range of [ 100 , 200 ] .
Example 2
(Logistic regression). Logistic regression, a prevalently utilized classification algorithm, is especially apt for handling binary classification predicaments. Within this context, local clients define their objective functions as
f i ( x ) = 1 n j = 1 n log 1 + exp b j i x T a j i , i = 1 , 2 , , m ,
where a j i R n and b j i R correspond to the j-th sample of client i. Features ( A i ) are randomly generated in accordance with a uniform distribution spanning the interval of [ 0 , 1 ] , and labels ( b i ) are derived from the set of { 1 , 1 } . Each nodal dataset is designated with a unique dimensionality. In the first instance, datasets are defined such that m = 100 and n = 1000 , with k 0 permitted to be selected from the set of { 5 , 8 , 10 , 20 , 25 } . In a subsequent iteration, the dataset configuration persists with m = 100 , while n is expanded to 2000, continuously allowing for the selection of k 0 from { 5 , 8 , 10 , 20 , 25 } .

4.2. Numerical Results

Data generation was conducted in accordance with Example 1. Principal component analysis was then employed to illustrate the distribution of the ensuing data. In this case, the five-dimensional feature data were condensed into a two-dimensional plane, with a color bar employed to signify the continuum of label values, thereby delineating the different random distributions within the [ 0 , 1 ] interval that each original sample encountered at the data nodes.
In Figure 1a–j, the first principal component resulting from the dimensionality reduction via principal component analysis is represented along the x axis, while the second principal component is plotted along the y axis. Figure 1 suggests that the data are variably randomly distributed across the ten nodes.
Upon employing Algorithm 2 for the case illustrated in Example 1 and stipulating the number of communication rounds as 20, a two-stage process was adopted. To initiate the process, the number of data points for a single node is set to n = 100 , and m, denoting the number of nodes, is varied. This produces the graphical output depicted on the left. Subsequently, with a constant node count of m = 30 , the quantity of sample points per node is varied, resulting in the representation displayed on the right.
Figure 2 presents the variation in average node loss for different data quantities. Figure 2a depicts a scenario with a single-node data sample of n = 100 and a node count m ranging within { 10 , 30 , 50 , 80 , 100 }. As evidenced in Figure 2a, the average loss function for nodes under Algorithm 2 descends most rapidly for m = 100 and most sluggishly for m = 10 . Consequently, it can be inferred that an increase in the number of nodes accelerates the decrease in the loss function, implying higher accuracy. Figure 2b considers five scenarios where m = 30 and n varies within the range of { 100 , 130 , 150 , 180 , 200 } . The average node loss function under Algorithm 2 is found to descend most swiftly for n = 100 and most slowly for n = 200 . This indicates that a reduction in the number of nodes expedites the descent of the loss function.
In addressing Example 1, Algorithm 2 was employed to determine parameters, which were then integrated into the linear model. Subsequently, the original feature data were incorporated, and labels for the original feature data were computed, which were then compared with the original data labels. Dimensionality reduction was achieved via principal component analysis to display a comparative graph of the original and model-predicted data.
Figure 3a–j employ distinct symbols to represent raw and predicted data, with color gradients depicting the corresponding label values. Dot markers indicate the sample data points post dimensionality reduction of the original data via principal component analysis. Cross markers denote data processed through the linear regression model generated by Algorithm 2, with the subsequent predicted labels presented in reduced dimensionality, also by principal component analysis. A comparison between the original and predicted data, as showcased in Figure 3, reveals a high level of accuracy attained by the linear regression model in conjunction with Algorithm 2.
A comparison is also made with conventional Distributed Machine Learning (DML) [34] and Federated Learning (FL) [28]. The loss function progression of these three algorithms is depicted accordingly.
Figure 4 displays the three corresponding loss function curves. From top to bottom, the curves represent the following distinct models. The first pertains to conventional FL, the second to DML, and the third to Algorithm 2. It is observed that the loss function value for Algorithm 2 (Fed-RSADMM) decreases more rapidly than that for DML and FL and also yields the smallest value upon convergence. Figure 4 suggests that Algorithm 2 exhibits superior accuracy and faster convergence compared to traditional algorithms.
To compare the time efficiency of the algorithms, we conducted a comparative analysis of their execution times, yielding the following results.
Figure 5 provides a comparative analysis of the execution times for FL, DML, and Fed-RSADMM applied to test Example 1 over a series of iterations. The performance of the Fed-RSADMM algorithm is traced by the dashed green line, which consistently shows reduced execution times, in contrast to FL (solid blue line) and DML (dash–dot red line). The flatter trajectory of the Fed-RSADMM line across the iteration spectrum underscores its time-efficiency advantage. In essence, the results from test Example 1 endorse the Fed-RSADMM algorithm’s superior time efficiency, with its modest increase in execution time demonstrating potential for scalable and efficient processing in iterative tasks.
Subsequently, Algorithm 3 is subjected to verification via its application to Example 2. To commence, k 0 = 5 is established within Algorithm 3, followed by a comparison with the conventional federated learning algorithm, resulting in the accompanying comparative results.
As depicted in Figure 6, the upper curve corresponds to the loss function for the traditional FL algorithm, whereas the lower curve represents Algorithm 3 (FedAvg-RSADMM). The graph illustrates that Algorithm 3 exhibits a more rapid rate of descent than the FL algorithm, indicating its superior performance over the conventional FL algorithm.
The subsequent analysis focuses on discerning the effect of k 0 on the performance of Algorithm 3. While resolving the logistic regression problem of Example 2 using Algorithm 3, the sample data size of a single node is kept constant at 1000 and 2000, while k 0 varies within the range of { 5 , 8 , 10 , 20 , 25 } , leading to the ensuing comparison.
Figure 7 plots the number of iterations of the local variable (x) on the x axis against the average loss of the nodes on the y axis for different m, n, and k 0 values. As per the flow of Algorithm 3, a larger k 0 implies fewer global updates. Consequently, Figure 7 reveals a slower descent of the node’s loss function with larger k 0 values, which is attributable to the reduced number of steps in the global update, thereby conserving computational and communication resources. However, it is noted that Algorithm 3 exhibits convergence similar to that achieved with various k 0 values. This suggests that an increased k 0 effectively reduces communication and computational resource consumption, inducing only a minor error loss. Accordingly, a modest increase in k 0 in Algorithm 3 can boost its computational efficiency. The data sample size for a single node is then elevated as n = 2000 based on the experiment illustrated in Figure 7a for n = 1000, yielding the results portrayed in Figure 7b, which mirroring those presented in the left figure. Thus, it follows that the accuracy of Algorithm 3 remains unimpacted by the escalation of local sample data size, rendering the algorithm suitable for federated learning problems involving extensive data volumes and multiple nodes.
In Figure 8, the number of iterations of the global variable y is denoted by the x-axis, while the average loss of the nodes forms the y-axis, resulting in the displayed loss curve. The figure exhibits faster node loss function decreases with larger k 0 , attributable to increased local updates for sizeable k 0 during global variable updating. Consequently, fewer global update steps are required for convergence, thus significantly diminishing the number of communications and global update steps. Therefore, Algorithm 3 can be deemed effective in reducing communication losses. Again, the data sample size for a single node is amplified to n = 2000 based on the experiments of Figure 8a, culminating the results in Figure 8b. These demonstrate that Algorithm 3 is fitting for distributed optimization problems involving considerable local node data.
To investigate the impact of varying initial iteration counts ( k 0 ) and the number of client nodes (m) on the temporal efficiency of the algorithm, we conducted a series of experiments. The parameter influences the number of local updates ( k 0 ), while m corresponds to the number of client nodes involved in the computation. Our objective was to ascertain whether an increased number of nodes affects the algorithm’s parallelism. The experimental outcomes are listed in Figure 9.
Figure 9a illustrates the relationship between the iterative time and the communication rounds at different k 0 values for Algorithm 3 . An increment in k 0 corresponds to a decrease in the total number of iterations required, allowing Algorithm 3 to halt earlier and utilize less iterative time. Each line in the graph represents the algorithm’s performance with a different k 0 value, demonstrating that higher values lead to quicker convergence, as evidenced by the curves leveling off sooner. This suggests that optimizing the k 0 parameter can enhance the algorithm’s efficiency, reducing computational time while maintaining convergence integrity.
Figure 9b demonstrates the relationship between the iterative time and k 0 with different numbers of nodes (m) for Algorithm 3. It is observed that an increase in m leads to a longer iteration time. However, the overall time decreases with an increase in k 0 , indicating that our algorithm exhibits favorable parallelism. Despite this, the influence of node count on computation time is not entirely negated; therefore, the potential time cost due to an increase in nodes must be considered in computational evaluations. This underscores the necessity of balancing the number of nodes against the performance gains achieved through parallel processing when deploying the algorithm in distributed computing environments.
Obviously, Algorithm 3 considers the communication and computation costs in relation to k 0 . Specifically, the framework states that the global update occurs only at certain steps (e.g., at step km being a multiple of a pre-defined integer ( k 0 )). It is shown that the larger the k 0 , the less time for our algorithm to converge (see Figure 7, Figure 8 and Figure 9). In addition, the local computational complexity at each node is O ( k ) , where k is the iteration time. By adjusting k 0 , we can optimize the trade-off between communication efficiency and computational load, demonstrating the scalability and adaptability of the algorithm for federated learning.

5. Conclusions

This study introduces two symmetric ADMM-based federated learning algorithms relaxed steps. Algorithm 2 bolsters computational efficiency in federated learning, while Algorithm 3 capitalizes on Algorithm 2 to further optimize communication efficiency. Relevant numerical experiments were set up to illustrate the feasibility and efficiency of the algorithms. In conclusion, the two proposed algorithms exhibit rapid convergence and excellent performance. While experiments based on linear and logistic regressions were conducted on small scales and only serve as proofs of concept, in contrast to [41,42], which researched large-scale cases and realistic usage. Therefore, exploring applications related to large-scale optimization problems will be the subject of further research in the future.

Author Contributions

J.L.: data curation, theoretical derivation, methodology, software, writing—original draft preparation, and modification; Y.D.: conceptualization, formal analysis, writing—review and editing, supervision, and validation. Y.Z.: data curation, methodology, software, writing—original draft preparation, and modification. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grants 71901145 and 12371308.

Data Availability Statement

The dataset used in this paper was generated by computer simulations to support the experimental data.

Acknowledgments

We acknowledge the efforts of the editorial board and anonymous reviewers for their thorough evaluation of our manuscript. Their comments and recommendations contributed to the refinement of our research paper.

Conflicts of Interest

No conflicts of interest exist with respect to the submission of this manuscript, and the manuscript was approved by all authors for publication. We would like to declare that the work described is original research that has not been published previously and is not under consideration for publication elsewhere, either in whole or in part. All listed authors approved of the manuscript that is enclosed. The authors declare that they have no known competing financial interests or personal relationships that could appeared to have influenced the work reported in this paper.

Appendix A

Appendix A.1. Proof of Lemma 3

For notational simplicity, hereafter, we denote
Δ y k + 1 : = y k + 1 y k , Δ x i k + 1 : = x i k + 1 x i k , Δ u i k + 1 : = u i k + 1 u i k .
According to Definition (4), we provide the optimality conditions for the subproblems in Fed-RSADMM,
0 = w i f i x i k + 1 u i k + 1 2 + γ ρ i x i k + 1 y k + 1 , 0 = m i = 1 u i k ρ i x r s ( i ) k + 1 y k + 1 .
Based on the update steps for x r s ( i ) and u i , we have
m i = 1 u i k ρ i x r s ( i ) k + 1 y k + 1 = 0 , γ ρ i + τ ρ i α x i k + 1 y k + 1 = τ ρ i α Δ x i k + 1 + τ ρ i ( 1 α ) Δ y k + 1 Δ u i k + 1 , x i k + 1 y k + 1 = 1 γ + τ α τ α Δ x i k + 1 + τ ( 1 α ) Δ y k + 1 1 ρ i Δ u i k + 1 .
The optimality conditions for the subproblems involving x i , y, and u i are obtained as follows:
w i f x i k + 1 = u i k + 1 , Δ u i k + 1 = γ ρ i τ ρ i α x i k + 1 y k + 1 + τ ρ i α Δ x i k + 1 + Δ y k + 1 + τ ρ i Δ y k + 1 ,
Considering the updating methodology for u, it can be inferred that
x i k + 1 y k + 1 = 1 γ + τ α τ α Δ x i k + 1 + τ ( 1 α ) Δ y k + 1 1 ρ i Δ u i k + 1 .
Regarding the subproblem of x i ,
w i f i x i k + 1 x i k + 1 y k + 1 , u i k + 1 2 + γ ρ i 2 x i k + 1 y k + 1 2 w i f i x i k x i k y k + 1 , u i k + 1 2 + γ ρ i 2 x i k y k + 1 2 .
Hence,
w i f i x i k + 1 x i k + 1 y k + 1 , u i k + 1 2 + γ ρ i 2 x i k + 1 y k + 1 2 w i f i x i k + x i k y k + 1 , u i k + 1 2 γ ρ i 2 x i k y k + 1 2 0 .
Therefore,
w i f i x i k + 1 w i f i x i k Δ x i k + 1 , u i k + 1 2 γ ρ i 2 Δ x i k + 1 2 γ ρ i x i k + 1 y k + 1 , Δ x i k + 1 .
According to the subproblem of y,
m i = 1 x r s ( i ) k + 1 y k + 1 , u i k + ρ i 2 x r s ( i ) k + 1 y k + 1 2 m i = 1 x r s ( i ) k + 1 y k , u i k + ρ i 2 x r s ( i ) k + 1 y k 2 .
Therefore, it follows that
m i = 1 Δ y k + 1 , u i k + ρ i 2 x r s ( i ) k + 1 y k + 1 2 x r s ( i ) k + 1 y k 2 0 .
After simplification, it is obtained as:
m i = 1 Δ y k + 1 , u i k ρ i 2 Δ y k + 1 2 m i = 1 ρ i Δ y k + 1 , x r s ( i ) k + 1 y k + 1 .
In summary, it is evident that
m i = 1 Δ y k + 1 , u i k + ρ i 2 Δ y k + 1 2 m i = 1 ρ i Δ y k + 1 , α x i k y k .
By incorporating the subproblem of x i , we consider the following formulation:
L ρ X k , y k + 1 , U k + 1 2 L ρ X k + 1 , y k + 1 , U k + 1 2 .
The result is given by
L ρ X k , y k + 1 , U k + 1 2 L ρ X k + 1 , y k + 1 , U k + 1 2 = m i = 1 w i f i x i k w i f i x i k + 1 + Δ x i k + 1 , u i k + 1 2 + ρ i x i k + 1 y k + 1 , Δ x i k + 1 + ρ i 2 Δ x i k + 1 2 m i = 1 ρ i ( γ 1 ) x i k + 1 y k + 1 , Δ x i k + 1 + ρ i 1 γ 2 Δ x i k + 1 2 .
Integrating the subproblem of y, now, we consider the following formulation:
L ρ X k , y k , U k L ρ X k , y k + 1 , U k .
The outcome is expressed as follows:
L ρ X k , y k , U k L ρ X k , y k + 1 , U k = m i = 1 Δ y k + 1 , u i k + ρ i x i k y k , Δ y k + 1 ρ i 2 Δ y k + 1 2 m i = 1 ( 1 α ) ρ i Δ y k + 1 , x k y k = m i = 1 ρ i ( 1 α ) γ + τ α γ Δ x i k + 1 , Δ y k + 1 + ( γ + τ ) Δ y k + 1 2 1 ρ i Δ y k + 1 , Δ u i k + 1
Utilizing the updating step of the multiplier ( u i ), in conjunction with Equation (A5), it is found that
L ρ X k , y k + 1 , U k L ρ X k , y k + 1 , U k + 1 2 + L ρ X k + 1 , y k + 1 , U k + 1 2 L ρ X k + 1 , y k + 1 , U k + 1 = m i = 1 x i k y k + 1 , u i k + 1 2 u i k x i k + 1 y k + 1 , u i k + 1 2 u i k + 1 = m i = 1 γ ρ i Δ x i k + 1 + Δ u i k + 1 , x i k + 1 y k + 1 Δ x i k + 1 , Δ u i k + 1 .
By cumulatively considering (A13), (A15), and (A16) with γ + τ α > 0 , it is concluded that
L ρ X k , y k , U k L ρ X k + 1 , y k + 1 , U k + 1 m i = 1 ρ i ( γ 2 γ τ α τ + α γ ) 2 ( γ + τ α ) Δ x i k + 1 2 + ρ i ( 1 α ) ( γ + τ ) 2 ( γ + τ α ) Δ y k + 1 2 1 ρ i γ + τ α Δ u i k + 1 2 + 1 γ γ + τ α Δ x i k + 1 , Δ u i k + 1 + τ 1 ( 1 α ) γ + τ α Δ y k + 1 , Δ u i k + 1 .
Upon further integration of γ + τ α > 0 , the Lipschitz continuity of f i and the Cauchy–Buniakowsky–Schwarz inequality yield
L ρ X k , y k , U k L ρ X k + 1 , y k + 1 , U k + 1 m i = 1 ρ i ( γ 2 γ τ α τ + α γ ) 2 ( γ + τ α ) Δ x i k + 1 2 + ρ i ( 1 α ) ( γ + τ ) 2 ( γ + τ α ) Δ y k + 1 2 1 ρ i γ + τ α Δ u i k + 1 2 1 γ L i 2 ( γ + τ α ) Δ x i k + 1 2 + τ 1 ( 1 α ) γ + τ α Δ y k + 1 , Δ u i k + 1 m i = 1 ρ i ( γ 2 γ τ α τ + α γ ) 2 ( γ + τ α ) L i 2 ρ i γ + τ α τ 1 α 1 L 2 + 2 1 γ L i 2 ( γ + τ α ) Δ x i k + 1 2 + m i = 1 ρ i ( γ + τ ) ( 1 α ) 2 ( γ + τ α ) τ 1 α 1 2 ( γ + τ α ) Δ y k + 1 2 .
Let ρ min i = 1 , 2 , , m ρ i , L max i = 1 , 2 , , m L i , and it follows that
L ρ X k , y k , U k L ρ X k + 1 , y k + 1 , U k + 1 ρ ( γ 2 γ τ α τ + α γ ) 2 ( γ + τ α ) L 2 ρ γ + τ α τ 1 α 1 L 2 + 2 1 γ L 2 ( γ + τ α ) m i = 1 Δ x i k + 1 2 + m ρ ( γ + τ ) ( 1 α ) 2 ( γ + τ α ) τ 1 α 1 2 ( γ + τ α ) Δ y k + 1 2 = a m i = 1 Δ x i k + 1 2 + b Δ y k + 1 2 ,
wherein
a = ρ ( γ α γ τ ( 1 + γ α ) ) 2 ( γ + τ α ) L 2 ρ γ + τ α τ 1 α 1 L 2 + 2 1 γ L 2 ( γ + τ α ) ,
b = m ρ ( γ + τ ) ( 1 α ) 2 ( γ + τ α ) τ 1 α 1 2 ( γ + τ α ) .

Appendix A.2. Proof of Lemma 4

Given that the sequence of p k is bounded, it can be deduced that the sequence has a limit point. Without loss of generality, we assume that p * is a limit point of the sequence of p k , and the subsequence of p k j converges to p * . Since f i is lower semi-continuous, L ρ is also lower semi-continuous. Hence,
L ρ p * l i m i n f j + L ρ p k j .
The above equation shows that L ρ p k j has a lower bound. It also follows from Lemma 3 that L ρ p k is monotonically decreasing such that L ρ p k j is also monotonically decreasing, so there is L ρ p k j convergence. The sequence of L ρ p k has monotonically convergent subsequences such that its whole column converges and has
lim k + L ρ p k L ρ p * .
Shifting the term for (35) results in
a m i = 1 Δ x i k + 1 2 + b Δ y k + 1 2 L ρ p k L ρ p k + 1
Taking the sum over the finite terms of the above inequality and the limit yields
k = 1 N a m i = 1 Δ x i k + 1 2 + b Δ y k + 1 2 L ρ p 0 L ρ p * .
Consequently, k = 0 + a i = 1 m Δ x i k + 1 2 + b Δ y k + 1 2 < + . Further integration with k = 0 + Δ u i k 2 < + , from Equation (A5) shows that
x i k + 1 = 1 γ + τ α τ α Δ x i k + 1 + τ ( 1 α ) Δ y k + 1 1 ρ i Δ u i k + 1 + y k + 1 ;
therefore,
Δ x i k + 1 = O Δ x i k + 1 + O Δ y k + 1 + Δ y k + 1 1 γ + τ α ρ i Δ u i k + 1 Δ u i k .
Then, according to the Cauchy inequality, we obtain
Δ x i k + 1 Δ y k + 1 + 1 γ + τ α ρ i Δ u i k + 1 + 1 γ + τ α ρ i Δ u i k .
This integrates with k = 0 + Δ y k + 1 2 < + , k = 0 + Δ u i k + 1 2 < + . It holds that k = 0 + Δ x i k + 1 2 < + . Therefore, it can be immediately established that
+ k = 0 p k + 1 p k 2 < + .
The proof is complete.

Appendix A.3. Proof of Theorem 1

(1) From the definition of Ω , conclusion (I) is validated.
(2) If p * Ω , there exists a subsequence ( p k , p k j ) such that p k j p * , j + . Additionally, from Lemma 4, we know p k j + 1 p k j 0 ; thus, p k j + 1 p k j 0 . Also, since x i k j + 1 is the solution to the x subproblem in Algorithm 2, for any k j , it holds that
L ρ X k j + 1 , y k j , U k j L ρ X * , y k j , U k j .
Subsequently,
l i m s u p j + L ρ p k j + 1 = l i m s u p j + L ρ X k j + 1 , y k j , U k j l i m s u p j + L ρ X * , y k j , U k j = L ρ p * .
According to this, together with the lower semi-continuity of L ρ ( · ) and given l i m i n f j + L ρ p j + 1 L ρ p * , then
l i m i n f j + L ρ p j + 1 L ρ p * . Therefore, lim j + f x k j + 1 = f x * . Therefore, due to the closedness of f , considering k = k j in Equation (23) and taking the limit as j + yields
U * f x * , X * y * = 0 .
This, in conjunction with Definition 3, establishes that w * c r i t L ρ .Q.E.D.

Appendix A.4. Proof of Lemma 5

First, based on Equations (23)–(25), the definition of e w k + 1 , 1 , and the non-expansive nature of the proximity operator, it can be shown that
e X p k + 1 , 1 = i = 1 m x i k + 1 P r o x f i x i k + 1 + u i k + 1 = i = 1 m P r o x f i x i k + 1 + w i f i ( x i k + 1 ) P r o x f i x k + 1 + u i k + 1 = i = 1 m P r o x f i x i k + 1 + u i k + 1 P r o x f i x k + 1 + u i k + 1 = 0 .
Additionally, from (23), it is known that
e y p k + 1 , 1 = i = 1 m u i k = 0 .
Next, integrating (24) and (25), it is established that
e U k + 1 p k + 1 , 1 = i = 1 m x i k + 1 y k + 1 i = 1 m 1 γ + τ α τ α Δ x i k + 1 + τ ( 1 α ) Δ y k + 1 1 ρ i Δ u i k + 1 = i = 1 m τ α L + ρ i ρ i ( τ + α γ ) Δ x i k + 1 + m τ ( 1 α ) γ + τ α Δ y k + 1 τ α L + ρ max ρ i ( τ + α γ ) i = 1 m Δ x i k + 1 + m τ ( 1 α ) γ + τ α Δ y k + 1 .
Finally, according to Equations (A23)–(A25), there exist positive numbers ( θ 1 , θ 2 ) such that
e p k + 1 , 1 = e x p k + 1 , 1 2 + e y p k + 1 , 1 2 + e λ p k + 1 , 1 2 θ 1 i = 1 m Δ x i k + 1 + θ 2 Δ y k + 1 .

Appendix A.5. Proof of Theorem 2

(1) As for Lemma 4, p k + 1 p k 0 . This, in conjunction with (37), yields e p k + 1 , 1 0 . Furthermore, as L ρ p k is monotonically decreasing, L ρ p k L ρ p 0 , k . Furthermore, integrating with Assumption 2, there exists ς > 0 and a positive integer ( k 1 ) such that
d i s t p k , c r i t L ρ ς e p k , 1 , k k 1 .
Consequently, conclusion (1) is validated. (2) Setting p ¯ k + 1 = X ¯ k + 1 , y ¯ k + 1 , U ¯ k + 1 c r i t L ρ , it follows that
p ¯ k + 1 p k + 1 = d i s t p k + 1 , c r i t L ρ , k .
Combined with the above conclusion (1), p k + 1 p ¯ k + 1 0 . Utilizing the triangle inequality, it is further deduced that
p ¯ k p ¯ k + 1 p ¯ k p k + p k p k + 1 + p k + 1 p ¯ k + 1 0 .
According to Assumption 3, for any p ¯ , p ˜ c r i t L ρ , it holds that p ¯ p ˜ δ ( δ > 0 ) ; therefore, we have L ρ ( p ¯ ) = L ρ ( p ˜ ) . Therefore, according to (A29), there exists a positive integer ( k ^ k 1 ) and a constant ( L ρ * ) such that L ρ p ¯ k + 1 = L ρ p ¯ k = L ρ * , k k ^ .
Next, we analyze the properties of L ρ * . According to Theorem 1 (2), it follows that any accumulation point of p k is a stable point of L ρ ( p k ) . It is also proved by Theorem 1(2) that lim j + L ρ p k j + 1 = L ρ p * . Considering that L ρ p k converges, lim k + L ρ p k = L ρ p * = inf k L ρ p k . Hence, L ρ p k remains constant for the set of accumulation points.
Since p k j + 1 p ¯ k j + 1 0 , p * p ¯ k j + 1 0 . Consequently, integration with Assumption 3 yields
L ρ p ¯ k = L ρ * = L ρ p * = inf k L ρ p k , k k ^ .
(3) According to (A28), it is understood that
i = 1 m x ¯ i k + 1 x i k + 1 d i s t p k + 1 , c r i t L ρ , y ¯ k + 1 y k + 1 d i s t p k + 1 , c r i t L ρ ,
x ¯ i k + 1 y ¯ k + 1 = 0 .
From the definition of ALF in (16), it is deduced that
L ρ p k + 1 L ρ p ¯ k + 1 = i = 1 m f i x i k + 1 x i k + 1 y k + 1 , u i k + 1 + ρ i 2 x i k + 1 y k + 1 2 i = 1 m f i x ¯ i k + 1 x ¯ i k + 1 y ¯ k + 1 , u ¯ i k + 1 + ρ i 2 x ¯ i k + 1 y ¯ k + 1 2 .
On the other hand, due to the convexity of f i , it is inferred that
f i x i k + 1 f i x ¯ k + 1 u i k + 1 , x i k + 1 x ¯ i k + 1 ,
Combining the above two equations with (A5) and (A32) and simplifying (A33), we obtain
L ρ p k + 1 L ρ p ¯ k + 1 i = 1 m u i k + 1 , x i k + 1 x ¯ i k + 1 x i k + 1 y k + 1 , u i k + 1 + ρ i 2 x i k + 1 y k + 1 2 = i = 1 m u i k + 1 , x i k + 1 x ¯ i k + 1 1 γ + τ α τ α Δ x i k + 1 + τ ( 1 α ) Δ y k + 1 1 ρ i Δ u i k + 1 , u i k + 1 + ρ i 2 γ + τ α 2 τ α Δ x i k + 1 + τ ( 1 α ) Δ y k + 1 1 ρ i Δ u i k + 1 2 .
Furthermore, according to simple calculations and by integrating (A31), there must exist positive numbers ( t 1 , t 2 , t 3 ) such that
L ρ p k + 1 L ρ p ¯ k + 1 t 1 Δ y k + 1 2 + t 2 i = 1 m Δ x i k + 1 2 + t 3 i = 1 m x i k + 1 x ¯ k + 1 2 + y k + 1 y ¯ k + 1 2 + t 4 i = 1 m Δ u i k + 1 2 t 1 Δ y k + 1 2 + t 2 + t 4 L 2 i = 1 m Δ x i k + 1 2 + 2 t 3 d i s t 2 p k + 1 , c r i t L ρ .
Furthermore, based on the aforementioned conclusion (2), Assumption 2, and Lemma 5, it is found that
L ρ p k + 1 inf k L ρ p k = L ρ p k + 1 L ρ p ¯ k + 1 t 1 Δ y k + 1 2 + t 2 + t 4 L 2 i = 1 m Δ x i k + 1 2 + 2 t 3 d i s t 2 p k + 1 , c r i t L ρ t 1 Δ y k + 1 2 + t 2 + t 4 L 2 i = 1 m Δ x i k + 1 2 + 2 t 3 ς 2 e ( p , 1 ) 2 = t 1 + 2 t 3 ς 2 θ 2 Δ y k + 1 2 + t 2 + t 4 L 2 + 2 t 3 ς 2 θ 1 m i = 1 m Δ x i k + 1 2 = h 1 Δ y k + 1 2 + h 2 i = 1 m Δ x i k + 1 2 , k k ^ ,
where h 1 = t 1 + 2 t 3 ς 2 θ 2 , h 2 = t 2 + t 4 L 2 + 2 t 3 ς 2 θ 1 . This, together with Equation (35), the inequality, a < b and the condition L ρ p k + 1 inf k L ρ p k 0 yield
L ρ p k + 1 inf k L ρ p k L ρ p k inf k L ρ p k a m i = 1 Δ x i k + 1 2 + b Δ y k + 1 2 a h L ρ p k + 1 inf k L ρ p k .
Hence, for sufficiently large values of k, it holds that
0 L ρ p k + 1 inf k L ρ p k 1 1 + a h L ρ p k inf k L ρ p k .
Consequently, the sequence L ρ p k is Q-linearly convergent.

Appendix A.6. Proof of Theorem 3

According to Equation (35), it can be established that
a m i = 1 Δ x i k + 1 2 + b Δ y k + 1 2 L ρ p k L ρ p k + 1 L ρ p k inf k L ρ p k .
From Theorem 2, we know that the sequence of L ρ p k is Q-linearly convergent, and there exist 0 < q ^ < 1 and M 1 > 0 such that Δ y k + 1 M 1 q ^ k , k . This, in combination with Equation (36), implies the existence of 0 < q ^ < 1 and M 2 > 0 , M 3 > 0 such that
Δ x i k + 1 M 2 q ^ k , Δ u i k + 1 M 3 q ^ k , k .
Hence, it can be concluded that
p k + 1 p k M ¯ q ^ k , k ,
where M ¯ = M 1 2 + M 2 2 + M 3 2 > 0 . Therefore, for any m 2 > m 1 1 , it holds that
p m 2 p m 1 m 2 1 k = m 1 p k + 1 p k M ¯ 1 q ^ q ^ m 1 .
This indicates that p k is a Cauchy sequence; hence, it converges. Let its limit point be denoted as p ^ ; then,
p m 1 p ^ M ¯ 1 q ^ q ^ m 1 .
Furthermore, from Theorem 1(1), it is understood that the sequence of L ρ p k converges to the steady point of L ρ p k at the rate of R-linear convergence.

References

  1. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  2. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
  3. Zhang, X.; Hong, M.; Dhople, S.; Yin, W.; Liu, Y. Fedpd: A federated learning framework with adaptivity to non-iid data. IEEE Trans. Signal Process. 2021, 69, 6055–6070. [Google Scholar] [CrossRef]
  4. Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. Acm Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
  5. Liu, T.; Wang, Z.; He, H.; Shi, W.; Lin, L.; An, R.; Li, C. Efficient and secure federated learning for financial applications. Appl. Sci. 2023, 13, 5877. [Google Scholar] [CrossRef]
  6. Zeng, Q.; Lv, Z.; Li, C.; Shi, Y.; Lin, Z.; Liu, C.; Song, G. FedProLs: Federated learning for IoT perception data prediction. Appl. Intell. 2023, 53, 3563–3575. [Google Scholar] [CrossRef]
  7. Manias, D.M.; Shami, A. Making a case for federated learning in the internet of vehicles and intelligent transportation systems. IEEE Netw. 2021, 35, 88–94. [Google Scholar] [CrossRef]
  8. Posner, J.; Tseng, L.; Aloqaily, M.; Jararweh, Y. Federated learning in vehicular networks: Opportunities and solutions. IEEE Netw. 2021, 35, 152–159. [Google Scholar] [CrossRef]
  9. Konečný, J.; McMahan, B.; Ramage, D. Federated optimization: Distributed optimization beyond the datacenter. arXiv 2015, arXiv:1511.03575. [Google Scholar]
  10. Satish, S.; Nadella, G.S.; Meduri, K.; Gonaygunta, H. Collaborative Machine Learning without Centralized Training Data for Federated Learning. Int. Mach. Learn. J. Comput. Eng. 2022, 5, 1–14. [Google Scholar]
  11. Zhou, S.; Li, G.Y. Federated learning via inexact ADMM. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9699–9708. [Google Scholar] [CrossRef]
  12. Elgabli, A.; Park, J.; Ahmed, S.; Bennis, M. L-FGADMM: Layer-wise federated group ADMM for communication efficient decentralized deep learning. In Proceedings of the 2020 IEEE Wireless Communications and Networking Conference (WCNC), Seoul, Republic of Korea, 25–28 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
  13. Zhang, X.; Khalili, M.M.; Liu, M. Improving the privacy and accuracy of ADMM-based distributed algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5796–5805. [Google Scholar]
  14. Guo, Y.; Gong, Y. Practical collaborative learning for crowdsensing in the internet of things with differential privacy. In Proceedings of the 2018 IEEE Conference on Communications and Network Security (CNS), Beijing, China, 30 May–1 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–9. [Google Scholar]
  15. Zhang, X.; Khalili, M.M.; Liu, M. Recycled ADMM: Improve privacy and accuracy with less computation in distributed algorithms. In Proceedings of the 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 959–965. [Google Scholar]
  16. Huang, Z.; Hu, R.; Guo, Y.; Chan-Tin, E.; Gong, Y. DP-ADMM: ADMM-based distributed learning with differential privacy. IEEE Trans. Inf. Forensics Secur. 2019, 15, 1002–1012. [Google Scholar] [CrossRef]
  17. He, S.; Zheng, J.; Feng, M.; Chen, Y. Communication-efficient federated learning with adaptive consensus admm. Appl. Sci. 2023, 13, 5270. [Google Scholar] [CrossRef]
  18. Ding, J.; Errapotu, S.M.; Zhang, H.; Gong, Y.; Pan, M. Stochastic ADMM based distributed machine learning with differential privacy. In Proceedings of the Security and Privacy in Communication Networks: 15th EAI International Conference, SecureComm 2019, Orlando, FL, USA, 23–25 October 2019; Proceedings, Part I 15. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 257–277. [Google Scholar]
  19. Hager, W.W.; Zhang, H. Convergence rates for an inexact ADMM applied to separable convex optimization. Comput. Optim. Appl. 2020, 77, 729–754. [Google Scholar] [CrossRef]
  20. Yue, S.; Ren, J.; Xin, J.; Lin, S.; Zhang, J. Inexact-ADMM based federated meta-learning for fast and continual edge learning. In Proceedings of the Twenty-Second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, Shanghai, China, 26–29 July 2021; pp. 91–100. [Google Scholar]
  21. Ryu, M.; Kim, K. Differentially private federated learning via inexact ADMM with multiple local updates. arXiv 2022, arXiv:2202.09409. [Google Scholar]
  22. Zhang, S.; Choromanska, A.E.; LeCun, Y. Deep learning with elastic averaging SGD. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  23. Koloskova, A.; Stich, S.U.; Jaggi, M. Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. In Proceedings of the Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, 28 November–9 December2022; pp. 17202–17215. [Google Scholar]
  24. Yu, H.; Yang, S.; Zhu, S. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019; Volume 33, pp. 5693–5700. [Google Scholar]
  25. Dai, S.; Meng, F. Addressing modern and practical challenges in machine learning: A survey of online federated and transfer learning. Appl. Intell. 2023, 53, 11045–11072. [Google Scholar] [CrossRef]
  26. Wang, J.; Joshi, G. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. J. Mach. Learn. Res. 2021, 22, 1–50. [Google Scholar]
  27. Smith, V.; Forte, S.; Ma, C.; Takac, M.; Jordan, M.I.; Jaggi, M. CoCoA: A general framework for communication-efficient distributed optimization. J. Mach. Learn. Res. 2018, 18, 1–49. [Google Scholar]
  28. Konečný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
  29. Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
  30. Li, T.; Sahu, A.K.; Sanjabi, M.; Zaheer, M.; Talwalkar, A.; Smith, V. On the convergence of federated optimization in heterogeneous networks. arXiv 2018, arXiv:1812.06127. [Google Scholar]
  31. Rockafellar, R.T.; Wets RJ, B. Variational Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  32. Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends® Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
  33. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  34. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  35. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Keynote, Invited and Contributed Papers. Physica-Verlag HD: Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
  36. Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 2019, 37, 1205–1221. [Google Scholar] [CrossRef]
  37. Zhu, Y. An augmented ADMM algorithm with application to the generalized lasso problem. J. Comput. Graph. Stat. 2017, 26, 195–204. [Google Scholar] [CrossRef]
  38. Yang, K.; Jiang, T.; Shi, Y.; Ding, Z. Federated learning via over-the-air computation. IEEE Trans. Wirel. Commun. 2020, 19, 2022–2035. [Google Scholar] [CrossRef]
  39. Jia, Z.; Gao, X.; Cai, X.; Han, D. Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems. J. Optim. Theory Appl. 2021, 188, 1–25. [Google Scholar] [CrossRef]
  40. Jia, Z.; Gao, X.; Cai, X.; Han, D. The convergence rate analysis of the symmetric ADMM for the nonconvex separable optimization problems. J. Ind. Manag. Optim. 2021, 17, 1943–1971. [Google Scholar] [CrossRef]
  41. Kadu, A.; Kumar, R. Decentralized full-waveform inversion. In Proceedings of the 80th EAGE Conference and Exhibition 2018, Copenhagen, Denmark, 11–14 June 2018; European Association of 1Geoscientists and Engineers: Utrecht, The Netherlands, 2018; Volume 2018, pp. 1–5. [Google Scholar]
  42. Yin, Z.; Orozco, R.; Herrmann, F.J. WISER: Multimodal variational inference for full-waveform inversion without dimensionality reduction. arXiv 2024, arXiv:2405.10327. [Google Scholar]
Figure 1. Scatter plots of raw data.
Figure 1. Scatter plots of raw data.
Mathematics 12 02661 g001
Figure 2. Loss function plots for n = 100 , m { 10 , 30 , 50 , 80 , 100 } (a) and m = 30 , n { 100 , 130 , 150 , 180 , 200 } (b).
Figure 2. Loss function plots for n = 100 , m { 10 , 30 , 50 , 80 , 100 } (a) and m = 30 , n { 100 , 130 , 150 , 180 , 200 } (b).
Mathematics 12 02661 g002
Figure 3. Scatter plots illustrating the data from the ten nodes of Example 1 following dimensionality reduction via principal component analysis.
Figure 3. Scatter plots illustrating the data from the ten nodes of Example 1 following dimensionality reduction via principal component analysis.
Mathematics 12 02661 g003
Figure 4. Images illustrating the comparative analysis of loss functions for FL, DML, and Fed-RSADMM.
Figure 4. Images illustrating the comparative analysis of loss functions for FL, DML, and Fed-RSADMM.
Mathematics 12 02661 g004
Figure 5. Comparative execution time analysis of three algorithms for Example 1 across iterative evaluations.
Figure 5. Comparative execution time analysis of three algorithms for Example 1 across iterative evaluations.
Mathematics 12 02661 g005
Figure 6. Loss function comparison between FL and FedAvg-RSADMM with a setting of k 0 = 5 for the latter.
Figure 6. Loss function comparison between FL and FedAvg-RSADMM with a setting of k 0 = 5 for the latter.
Mathematics 12 02661 g006
Figure 7. Comparison of loss functions for m = 100 , n = 1000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (As shown in (a)), as well as for m = 100 , n = 2000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (As shown in (b)).
Figure 7. Comparison of loss functions for m = 100 , n = 1000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (As shown in (a)), as well as for m = 100 , n = 2000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (As shown in (b)).
Mathematics 12 02661 g007
Figure 8. Comparisons of loss functions for m = 100 , n = 1000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (a) and m = 100 , n = 2000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (b).
Figure 8. Comparisons of loss functions for m = 100 , n = 1000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (a) and m = 100 , n = 2000 , and k 0 { 5 , 8 , 10 , 20 , 25 } (b).
Mathematics 12 02661 g008
Figure 9. (a) The relationship between iterative time and communication rounds at different values of k 0 for Algorithm 3. (b) The relationship between iterative time and k 0 with different values of m for Algorithm 3.
Figure 9. (a) The relationship between iterative time and communication rounds at different values of k 0 for Algorithm 3. (b) The relationship between iterative time and k 0 with different values of m for Algorithm 3.
Mathematics 12 02661 g009
Table 1. Loss functions for some models.
Table 1. Loss functions for some models.
ModelLoss Function
Linear regression λ 2 x 2 + 1 2 max 0 ; 1 b j x T a j 2
Squared-SVM 1 2 b j x T a j 2
K-means 1 2 min l a j x ( l ) 2 where x x ( 1 ) T , x ( 2 ) T , T
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, J.; Zhu, Y.; Dang, Y. Symmetric ADMM-Based Federated Learning with a Relaxed Step. Mathematics 2024, 12, 2661. https://doi.org/10.3390/math12172661

AMA Style

Lu J, Zhu Y, Dang Y. Symmetric ADMM-Based Federated Learning with a Relaxed Step. Mathematics. 2024; 12(17):2661. https://doi.org/10.3390/math12172661

Chicago/Turabian Style

Lu, Jinglei, Ya Zhu, and Yazheng Dang. 2024. "Symmetric ADMM-Based Federated Learning with a Relaxed Step" Mathematics 12, no. 17: 2661. https://doi.org/10.3390/math12172661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop