Symmetric ADMM-Based Federated Learning with a Relaxed Step

Lu, Jinglei; Zhu, Ya; Dang, Yazheng

doi:10.3390/math12172661

Open AccessArticle

Symmetric ADMM-Based Federated Learning with a Relaxed Step

by

Jinglei Lu

,

Ya Zhu

and

Yazheng Dang

^*

Business School, University of Shanghai for Science and Technology, Jungong Road, Shanghai 200093, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2661; https://doi.org/10.3390/math12172661

Submission received: 12 July 2024 / Revised: 24 August 2024 / Accepted: 24 August 2024 / Published: 27 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Federated learning facilitates the training of global models in a distributed manner without requiring the sharing of raw data. This paper introduces two novel symmetric Alternating Direction Method of Multipliers (ADMM) algorithms for federated learning. The two algorithms utilize a convex combination of current local and global variables to generate relaxed steps to improve computational efficiency. They also integrate two dual-update steps with varying relaxation factors into the ADMM framework to boost the accuracy and the convergence rate. Another key feature is the use of weak parametric assumptions to enhance computational feasibility. Furthermore, the global update in the second algorithm occurs only at certain steps (e.g., at steps that are a multiple of a pre-defined integer) to improve communication efficiency. Theoretical analysis demonstrates linear convergence under reasonable conditions, and experimental results confirm the superior convergence and heightened efficiency of the proposed algorithms compared to existing methodologies.

Keywords:

federated learning; relaxed method; symmetric alternating direction method of multipliers; linear convergence

MSC:

90-10

1. Introduction

Federated learning serves as a widely adopted distributed machine learning methodology that has garnered substantial interest in recent years due to its effectiveness in addressing issues related to data privacy, security, and the accessibility of heterogeneous data [1,2,3]. This methodology has been employed extensively in various domains, such as health care [4] (Yang et al., 2019), finance [5] (Liu et al., 2021), the Internet of Things (IoT) [6] (Zeng et al., 2023), and intelligent transportation [7,8] (Manias et al., 2021). By providing solutions for data security and compliance, federated learning facilitates improved utilization of decentralized data and enhances the performance and efficiency of machine learning models. As a contemporary research hot spot, federated learning’s significance spans beyond boosting performance of machine learning models, extending to the safeguarding of data privacy, economizing computational resources, and supporting heterogeneous devices.

1.1. Related Work

(a) Improving computational efficiency and accuracy.

Each node independently addresses local optimization sub-problems in federated learning. Early research [6,9,10] used a strategy of splitting the computation across multiple devices for local optimization, although shortcomings persist regarding computational efficiency and accuracy. The Alternating Direction Method of Multipliers (ADMM), as a distributed method, has been used to solve many optimization problems due to its simplicity and efficiency. In federated learning, there are two types of ADMM available, namely exact [11] and inexact [12] ADMM. The former requires the clients to update the parameters by solving the sub-problems accurately, thereby increasing the computational burden [12,13,14,15,16,17]. The latter updates the parameters by solving sub-problems approximately, which reduces the computational complexity for clients [18,19,20,21]. However, these algorithms implement a single dual-update step per iteration and necessitate relatively stringent assumptions with respect to the parameters to ensure convergence properties.

(b) Saving computing resources. In distributed learning, local clients and the central server engage in frequent, often inefficient communication. Thus, extensive research efforts focus on devising algorithms to minimize the number of communication rounds. The stochastic gradient descent method, which aggregates in a cyclic fashion, is a widely used approach [22,23,24,25,26,27] that has demonstrated promising results in reducing the number of communication rounds. To further alleviate the burden of communication rounds, McMahan et al. [6,28,29] introduced the Federated Averaging Algorithm (FedAvg) and its improved variants. These algorithms minimize communications by performing local iterations multiple times before periodically conducting global aggregation. Li et al. [30] further refined the FedAvg method by introducing the Federated Proximal (FedProx) algorithm, allowing available device-based system resources to perform varying amounts of local work before aggregating partial solutions. Both FedAvg and FedProx have seen extensive application in distributed learning.

1.2. Our Contribution

The main contributions of this paper include the introduction of two federated learning algorithms based on Symmetric ADMM with Relaxed Step (Fed-RSADMM; see Algorithms 2 and 3), characterized as follows:

(I) Relaxed step. In contrast to the conventional ADMM, the presented algorithms employ a convex combination of the current local and global variables to generate the relaxed steps.

(II)

S y m m e t r i c A D M M .

We integrate two dual-update steps into the ADMM framework to construct a symmetric ADMM algorithm with varying relaxation factors, which is different from the general ADMM.

(III)

W e a k p a r a m e t r i c a s s u m p t i o n s .

Differing from conventional algorithmic assumptions, only simple assumptions are made with respect to the parameters.

1.3. Organization

This paper is organized as follows. In the next section, we provide the symbolic definitions and some common mathematical definitions that are used in this paper. In Section 3, we present the Fed-RSADMM and FedAvg-RSADMM algorithms, then prove their convergence. In Section 4, we design some comparative numerical experiments to illustrate the performance of our two proposed algorithms. The conclusions of this paper are presented in Section 5.

2. Preliminaries

The present section introduces some notations and definitions employed in this paper.

2.1. Notations

R^{n}

denotes the n-dimensional Euclidean space,

〈x, y〉 = x^{T} y

, and

‖ \cdot ‖

denotes the Euclidean norm. Let the set

S \subseteq R^{n}

and the point

x \in R^{n}

. If S is non-empty, the distance from point x to set S is denoted as

d i s t (x, S) = inf {‖ y - x ‖ : y \in S}

. When

S = ⌀

, let

d i s t (x, S) = + \infty

.

Definition 1

(

L -

Lipschitz continuity). In mathematics, a function (

g : R^{n} \to R

) is said to be L-Lipschitz continuous (or simply L-Lipschitz) if, for any

x, y \in R^{n}

, one has

\begin{matrix} ∥g (x) - g (y)∥ \leq L ∥x - y∥, \end{matrix}

(1)

where

∥\cdot∥

denotes the Euclidean norm.

If the function (f) is continuously differentiable and its gradient (

\nabla f

) is L-Lipschitz continuous, then we have

|f (y) - f (x) - 〈\nabla f (x), y - x〉| \leq \frac{L}{2} ‖ y - x ‖^{2}, \forall x, y \in R^{n} .

(2)

Definition 2.

Let the function

f : R^{n} \to R

be normally lower semicontinuous. The authors of [31] provided the following definitions:

(I): The Frechet subdifferential of f at $x \in d o m f$ is denoted as

$\hat{\partial} f (x) = \{x^{*} \in R^{n} : lim_{y \neq x} inf_{y \to x} \frac{f (y) - f (x) - 〈x^{*}, y - x〉}{‖ y - x ‖}\},$

(3)

and when $x \notin domf$ , $\hat{\partial} f (x) = ⌀$ .
(II): The limiting subdifferential of f at $x \in d o m f$ is denoted as

$\partial f (x) = \{x^{*} \in R^{n} : \exists x_{k} \to x, f (x_{k}) \to f (x), {\hat{x}}_{k} \in \hat{\partial} f (x_{k}), {\hat{x}}_{k} \to x^{*}\},$

(4)

and assuming that $x \in R^{n}$ is a minimal-value point of f, then $0 \in \partial f (x)$ . If $0 \in \partial f (x)$ , then x is said to be a stable point of f, and the set of stable points of f is denoted as $c r i t f$ .

The proximal operator of a normal closed, convex function f at

v \in R^{n}

is defined as follows [32]:

P r o x_{λ f} (v) = arg min_{x \in R^{n}} \{f (x) + \frac{1}{2 λ} ‖ x - v ‖^{2}\},

(5)

where

∥\cdot∥

denotes the Euclidean norm.

2.2. Loss Function

A machine learning model encompasses a set of parameters that are refined based on the training data. Typically, the training data samples include the following two components: input features represented as vectors (

a_{j}

) and desired outputs known as output labels (

b_{j}

). Each model is equipped with a loss function defined on its parameter vector (x) for each data sample. The loss function records the model error in the training data. The learning process aims to minimize this loss function over a set of training data samples (

a_{j}^{i}

). For each data sample, the loss function is designated as

f (x_{i}, a_{j}, b_{j})

or abbreviated to

f (x_{i})

for convenience.

Table 1 [33,34,35,36] summarizes loss functions for popular machine learning models. For convenience, suppose there are m edge nodes and that the local datasets are

D_{1}, D_{2}, \dots, D_{m}

. For each dataset (

D_{i}

) at node i, the loss function of the set of data samples at that node is

\begin{matrix} F_{i} (x) ≜ \frac{1}{|D_{i}|} \sum_{j \in D_{i}} f_{j} (x) . \end{matrix}

(6)

We define

|D_{i}|

, where

|D_{i}|

denotes the size of set

D_{i}

, i.e.,

|D| = \sum_{i = 1}^{m} |D_{i}| .

The global loss function of all distributed datasets is defined as

\begin{matrix} F (x) ≜ \frac{\sum_{j \in \underset{i}{\cup} D_{i}} f_{j} (x)}{|\underset{i}{\cup} D_{i}|} = \frac{\underset{i = 1}{\sum^{N}} |D_{i}| F_{i} (x)}{|D|} . \end{matrix}

(7)

2.3. Symmetric ADMM

The following convex minimization model with linear constraints and a separable objective function is considered.

min_{x \in R^{n}, y \in R^{q}} f (x) + g (y), s . t . A x + B y = b,

(8)

where

A \in R^{p \times n}, B \in R^{p \times q}

and

b \in R^{p}

. Such divisible convex optimization problems can be solved by using the ADMM algorithm. The augmented Lagrangian function for the above optimization problem is expressed as follows.

L (x, y, u) : = f (x) + g (y) + 〈A x + B y - b, u〉 + \frac{ρ}{2} ‖ A x + B y - b ‖^{2},

(9)

where

ρ

is the penalty parameter and u is the Lagrange multiplier. ADMM follows the following update process [37]:

\begin{matrix} x^{k + 1} = arg {min}_{x \in R^{n}} L (x, y^{k}, u^{k}), \\ y^{k + 1} = arg {min}_{y \in R^{q}} L (x^{k + 1}, y, u^{k}), \\ u^{k + 1} = u^{k} - ρ (A x^{k + 1} + B y^{k + 1} - b) . \end{matrix}

(10)

Based on the Peaceman–Rachford splitting method [38], symmetric ADMM (S-ADMM) was proposed to solve (9). The iterative process is expressed as follows:

\begin{matrix} x^{k + 1} = arg {min}_{x \in R^{n}} L (x, y^{k}, u^{k}), \\ u^{k + \frac{1}{2}} = u^{k} - ρ (A x^{k + 1} + B y^{k} - b) \\ y^{k + 1} = arg {min}_{y \in R^{q}} L (x^{k + 1}, y, u^{k + \frac{1}{2}}), \\ u^{k + 1} = u^{k + \frac{1}{2}} - ρ (A x^{k + 1} + B y^{k + 1} - b) . \end{matrix}

(11)

2.4. Federated Learning

Suppose we have m local nodes, each with a local dataset (

D_{i}

). Each node has a local total loss (

f_{i} (x)

) as a lower-bounded function. The global loss function can, therefore, be derived as

\begin{matrix} f (x) : = \sum_{i = 1}^{m} w_{i} f_{i} (x), \end{matrix}

(12)

where

w_{i}, (i = 1, 2, \dots, m)

are positive weights and satisfy

\sum_{i = 1}^{m} w_{i} = 1 .

The goal of federated learning is to minimize the loss function at the central node to obtain the optimal parameter (

x^{*}

), which can be described as the following problem:

\begin{matrix} x^{*} : = a r g m i n_{x \in R^{n}} f (x) . \end{matrix}

(13)

By introducing the auxiliary variable (y) and adding the constraint of

x_{i} = y

, the original problem can be rewritten in the following form:

\begin{matrix} min_{x_{i} \in R^{n}, y \in R} \sum_{i = 1}^{m} w_{i} f_{i} (x_{i}), s . t . x_{i} = y . \end{matrix}

(14)

Based on the above optimization problem, the conventional federated learning Algorithm 1 can be summarized in the following form [39]:

Algorithm 1 Federated Learning

1:: Initialize: $x_{i}^{0} = x^{0}, m, γ > 0 .$
2:: for $k = 1$ to n do
3:: $\underset{̲}{G l o b a l u p d a t e :}$
4:: The central server calculates the average parameter $y^{k + 1}$ by:
5:: $y^{k} = \sum_{i = 1}^{m} w_{i} x_{i}^{k} .$
6:: Broadcasts the parameter $y^{k + 1}$ to every local client:
7:: for $i = 1$ to n do
8:: $\underset{̲}{L o c a l u p d a t e :}$
9:: Each client updates its parameter locally and in parallel by:
10:: $x_{i}^{k + 1} = x_{i}^{k} - γ \nabla f (y^{k}) .$
11:: end for
12:: end for
13:: $R e t u r n : x_{i}^{k + 1}, y^{k + 1} (i = 1, 2, \dots, m)$

To implement the algorithms proposed in this paper, we construct the augmented Lagrangian function for problem (14). The augmented Lagrangian function of (14) is

\begin{matrix} L_{ρ} (y, X, U) : = \sum_{i = 1}^{n} L_{ρ_{i}} (y, x_{i}, u_{i}), \end{matrix}

(15)

where

X = (x_{1}, x_{2}, \dots, x_{n})

,

U = (u_{1}, u_{2}, \dots, u_{n})

,

ρ = (ρ_{1}, ρ_{2}, \dots, ρ_{m})

, and

\begin{matrix} L_{ρ_{i}} (y, x_{i}, u_{i}) = w_{i} f_{i} (x) - 〈x_{i} - y, u_{i}〉 + \frac{ρ_{i}}{2} ‖ x_{i} - y ‖^{2}, \end{matrix}

(16)

where

u_{i}, i = 1, 2, \dots, m

are the Lagrange multipliers and

ρ_{i}, i = 1, 2, \dots, m

are the penalty parameters.

2.5. Stationary Points

Here, we present the optimality condition for problem (14).

Definition 3.

Point

(y^{*}, X^{*}, U^{*})

is a stationary point of problem (14) and satisfies the following conditions:

\begin{matrix} \{\begin{matrix} w_{i} \nabla f_{i} (x_{i}^{*}) + u_{i}^{*} = 0 \\ x_{i}^{*} - y^{*} = 0 \\ \sum_{i = 1}^{m} u_{i}^{*} = 0 \end{matrix}, i = 1, 2, \dots, m . \end{matrix}

(17)

Point

x^{*}

is deemed a stationary point of problem (14) if it satisfies the following condition:

\begin{matrix} \nabla f (x^{*}) = 0 . \end{matrix}

(18)

One can readily observe that any local optimal solution satisfies (17), and if each

f_{i}

is a convex function, a point fulfilling (18) constitutes a globally optimal solution.

Based on the definition of the proximal operator and Definition 3, we obtain the following lemma.

Lemma 1

([39,40]). Suppose that

f_{i} : R^{n} \to R, i = 1, 2, \dots, m

are properly convex lower semicontinuous functions. Then, solving problem (15) reduces to a zero point of

\begin{matrix} e (p, ϱ) = (\begin{matrix} e_{x_{i}} (p, ϱ) : = x_{i} - P r o x_{ϱ f_{i}} (x_{i} + ϱ u_{i}) \\ e_{y} (p, ϱ) : = ϱ (\sum_{i = 1}^{m} u_{i}) \\ e_{u_{i}} (p, ϱ) : = ϱ (x_{i} - y) \end{matrix}) (i = 1, 2, \dots, m), \end{matrix}

(19)

where

p : = (X, y, U), ϱ \in R_{+}

for any given positive constant. For

p^{*} = (X^{*}, y^{*}, U^{*}) \in c r i t L_{ρ}

,

e (p^{*}, ϱ) = 0

. Thus,

e (p, ϱ)

can be used to measure the distance between point p and stable set

c r i t L_{ρ}

.

Now, we provide the following lemma for

e (p, ϱ)

, which is important for Remark 2 and Lemma 5 in Section 3.

Lemma 2

([39,40]). Suppose that

f_{i} : R^{n} \to R, i = 1, 2, \dots, m

are properly convex and lower semicontinuous. If p is not a stable point of

c r i t L_{ρ}

and

\bar{ϱ} \geq ϱ > 0

, then

\begin{matrix} ‖ e (p, \bar{ϱ}) ‖ \geq ‖ e (p, ϱ) ‖, \end{matrix}

(20)

and

\begin{matrix} \frac{‖ e (p, \bar{ϱ}) ‖}{\bar{ϱ}} \leq \frac{‖ e (p, ϱ) ‖}{ϱ} . \end{matrix}

(21)

3. Symmetric ADMM-Based Federated Learning with a Relaxed Step and Convergence

Based on the above augmented Lagrangian functions, in this section, we construct two symmetric ADMM-based federated learning algorithms, the first of which is Fed-RSADMM, which utilizes the federated learning framework and symmetric ADMM with a relaxed step (RSADMM). The second is FedAvg-RSADMM, based on Fed-RSADMM, which allows local clients to update multiple times, then upload their parameters to the central server.

3.1. Fed-RSADMM

Given an original dataset comprising m nodes, the local parameter for the

i

-th node is set as

x_{i}

, and data for the

i

-th node is assigned as

D_{i}

. The specific algorithmic workflow proceeds as follows (Algorithm 2).

Algorithm 2 Fed-RSADMM

Input: $α, τ, γ, ρ_{i} > 0 . S = [m]$ , $[m] : = {1, 2 \dots, m} .$
Initialize: $x_{i}^{0}, y^{0}, u_{i}^{0}, i \in [m] . S e t k \Leftarrow 0 .$
for $k = 1$ to n do
for $i = 1$ to m do
$\underset{̲}{L o c a l r e l a x e d u p d a t e :}$
$x_{r s (i)}^{k + 1} = α x_{i}^{k} + (1 - α) y^{k}$

(22)
end for
$\underset{̲}{G l o b a l u p d a t e :}$
The central server calculates the global parameter $y^{k + 1}$ by
$y^{k + 1} = arg min_{y} \underset{i = 1}{\sum^{m}} (w_{i} f_{i} (x_{r s (i)}^{k + 1}) - 〈x_{r s (i)}^{k + 1} - y, u_{i}^{k}〉 + \frac{ρ_{i}}{2} {∥x_{r s (i)}^{k + 1} - y∥}^{2})$

(23)
Broadcasts the parameter $y^{k + 1}$ to every local client:
for $i = 1$ to m do
$\underset{̲}{L o c a l r e l a x e d u p d a t e :}$
$u_{i}^{k + \frac{1}{2}} = u_{i}^{k} - τ ρ_{i} (x_{r s (i)}^{k + 1} - y^{k + 1})$

(24)
$x_{i}^{k + 1} = arg min_{x_{i}} (w_{i} f (x_{i}) - 〈x_{i} - y^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 + \frac{γ ρ_{i}}{2} {∥x_{i} - y^{k + 1}∥}^{2})$

(25)
$u_{i}^{k + 1} = u_{i}^{k + \frac{1}{2}} - γ ρ_{i} (x_{i}^{k + 1} - y^{k + 1})$

(26)
end for
end for
$R e t u r n : x_{i}^{k + 1}, y^{k + 1} (i = 1, 2, \dots, m)$

Remark 1.

In Algorithm 2, subproblem (23) can be solved by the following equation:

\begin{matrix} y^{k + 1} = a r g {min}_{y} L (y, X^{k}, U^{k}) = \sum_{i = 1}^{m} \frac{ρ_{i} x_{r s (i)}^{k + 1}}{\hat{ρ}} + \sum_{i = 1}^{m} \frac{u_{i}^{k}}{\hat{ρ}}, \end{matrix}

(27)

where

\begin{matrix} \hat{ρ} : = \sum_{i = 1}^{m} ρ_{i}, \end{matrix}

(28)

Compared to traditional federated learning, we adopt the update methods outlined in (23) and (27) for the global parametersinstead of using the average of all local parameters, denoted as

x_{i}^{k + 1}

. In contrast to the symmetric ADMM algorithm, we introduce a relaxation step to accelerate the convergence rate [19].

3.2. FedAvg-RSADMM

The communications in FedAvg-RSADMM only occur when

k \in K

= {0, k_{0}, 2 k_{0}, \dots}

, where

k_{0}

is a predefined positive integer. To facilitate local updates in Algorithm 2, an auxiliary variable (

z^{k + 1} = y^{τ k + 1}

) is introduced, where

Γ_{k} = ⌊k / k_{0}⌋ k_{0}

. It can be readily observed that if

k = Γ_{k}

, then

k \in K

, and when

Γ_{k} < k < Γ_{k} + k_{0}

, then

k \notin K

, i.e.,

\begin{matrix} z^{k + 1} = \{\begin{matrix} y^{k + 1}, & i f & k \in K, \\ y^{τ_{k} + 1}, & i f & k \notin K . \end{matrix} \end{matrix}

(29)

This approach decreases the number of communication rounds (e.g., parameter feedback and parameter upload), resulting in substantial cost savings with a convergence rate of

0 (1 / K)

, where K is the number of iterations. A convex combination between local and global variables is used to formulate the relaxed step, which is then employed to perform the parameter update. The corresponding Algorithm 3 proceeds as follows.

Algorithm 3 FedAvg-RSADMM

1:: Input: $α, τ, γ, ρ_{i} > 0 . S = [m] .$
2:: Initialize: $x_{i}^{0}, y^{0}, u_{i}^{0}, i \in [m] . S e t k \Leftarrow 0 .$
3:: for $k = 1$ to n do
4:: for $i = 1$ to m do
5:: $\underset{̲}{L o c a l r e l a x e d u p d a t e :}$
6:: $x_{r s (i)}^{k + 1} = α x_{i}^{k} + (1 - α) z^{k}$

(30)
7:: end for
8:: if $k \in K : = \{0, k_{0}, 2 k_{0}, 3 k_{0}, \dots\}$ then
9:: $\underset{̲}{G l o b a l u p d a t e :}$
10:: The central server calculates the global parameter $z^{k + 1}$ by
11:: $z^{k + 1} = arg min_{z} \underset{i = 1}{\sum^{m}} (w_{i} f_{i} (x_{r s (i)}^{k + 1}) - 〈x_{r s (i)}^{k + 1} - z, u_{i}^{k}〉 + \frac{ρ_{i}}{2} {∥x_{r s (i)}^{k + 1} - z∥}^{2})$

(31)
12:: Broadcasts the parameter $z^{k + 1}$ to every local client:
13:: end if
14:: for $i = 1$ to m do
15:: $\underset{̲}{L o c a l r e l a x e d u p d a t e :}$
16:: $u_{i}^{k + \frac{1}{2}} = u_{i}^{k} - τ ρ_{i} (x_{r s (i)}^{k + 1} - z^{k + 1})$

(32)
17:: $x_{i}^{k + 1} = arg min_{x_{i}} (w_{i} f (x_{i}) - 〈x_{i} - z^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 + \frac{γ ρ_{i}}{2} {∥x_{i} - z^{k + 1}∥}^{2})$

(33)
18:: $u_{i}^{k + 1} = u_{i}^{k + \frac{1}{2}} - γ ρ_{i} (x_{i}^{k + 1} - z^{k + 1})$

(34)
19:
20:: end for
21:: end for
22:: $R e t u r n : x_{i}^{k + 1}, z^{k + 1}, u_{i}^{k + 1} (i = 1, 2, \dots, m)$

3.3. Convergence

In this section, we only provide the corresponding convergence lemmas and theorem for Algorithm 2, as those for Algorithm 3 follow a similar process. The following assumption is important for the proof.

Assumption 1.

(a): The function $f_{i} : R^{n} \to R, i = 1, 2, \dots, m$ is lower semi-continuous.
(b): The function $f_{i} : R^{n} \to R, i = 1, \dots$ , is continuous and has the same L-Lipschitz continuous gradient.
(c): The parameters in the algorithm satisfy the following:

$γ + τ \geq 0, γ + τ α > 0, 0 < α < 1, 0 < γ < 1, τ < 1 .$

The penalty parameter ( $ρ_{i}$ ) complies with the following:

$ρ_{i} > \frac{c + \sqrt{c^{2} - 4 h L^{2}}}{2 h},$

where $c = γ (α - γ) - τ (1 + γ α)$ , $h = (τ - 1) (α - 1) L^{2} + 2 (1 - γ) L$ .
(d): The datasets of all devices are independently and identically distributed (i.i.d).

We first prove the decreasing property of the

{L_{ρ} (p^{k})}

sequence; these properties enable us to obtain Lemma 4, which, together with the optimality conditions, shows the convergence of the

{p^{k}}

sequence.

Lemma 3.

Assuming that Assumption 1 holds, there exist

a > 0

and

b > 0

such that

\begin{matrix} L_{ρ} (p^{k}) - L_{ρ} (p^{k + 1}) \geq a \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2}, \forall k . \end{matrix}

(35)

Then, the

L_{ρ} (p^{k})

sequence is monotonically decreasing, where m is a finite constant.

Lemma 4.

Assuming assumption A holds and

w^{k}

is bounded, then

\begin{matrix} \sum_{k = 0}^{+ \infty} | p^{k + 1} - p^{k} | < + \infty . \end{matrix}

(36)

Theorem 1 establishes the subsequence convergence property of the iterative sequence generated by Algorithm 2.

Theorem 1

(Subsequence Convergence). Assuming the conditions in Lemma 4 are met, the set of accumulation points of the

w^{k}

sequence is denoted as Ω. Then, the following conclusions hold:

(1): Ω is a non-empty compact set, and $d i s t (w^{k}, Ω) \to 0$ as $k \to + \infty$ ;
(2): $Ω \subseteq c r i t L_{ρ}$ .

3.4. Linear Convergence Rate

To obtain the local linear convergence rates of sequences {

p^{k}

} and {

L_{ρ} (p^{k})

} generated by Algorithm 2, the following results require that functions

f_{i}, i = 1, 2, \dots, m

be convex. We also make the following assumptions:

Assumption 2.

For any

ζ \geq {inf}_{p} L_{ρ} (p)

, there exist

ε > 0

and

ς > 0

such that

| e (p, 1) | \leq ε

and

L_{ρ} (p) \leq ζ

, which implies

d i s t (p, c r i t L_{ρ}) \leq ς ‖ e (p, 1) ‖ .

Remark 2.

According to Lemma 1, the expression for

∥e (p, 1)∥

is

\begin{matrix} e (p, 1) = (\begin{matrix} e_{x_{i}} (p, 1) : = x_{i} - P r o x_{f_{i}} (x_{i} + u_{i}) \\ e_{y} (p, 1) : = (\sum_{i = 1}^{m} u_{i}) \\ e_{u_{i}} (p, 1) : = (x_{i} - y) \end{matrix}), (i = 1, 2, \dots, m) . \end{matrix}

According to Lemma 2, for any given

ϱ > 0

, we have

‖ e (p, 1) ‖ \leq max \{ϱ, \frac{1}{ϱ}\} ‖ e (p, ϱ) ‖ .

Thus, Assumption 2, we can also set

d i s t (p, c r i t L_{ρ}) \leq ς | e (p, ϱ) |

, where

| e (p, ϱ) | \leq ε

and

L_{ρ} (p) \leq ζ

.

Assumption 3.

For any given

\bar{p} = (\bar{X}, \bar{y}, \bar{U}) \in c r i t L_{ρ}

and

\tilde{p} = (\tilde{X}, \tilde{y}, \tilde{U}) \in c r i t L_{ρ}

, there exists

δ > 0

such that when

| \bar{p} - \tilde{p} | \leq δ

,

L_{ρ} (\bar{p}) = L_{ρ} (\tilde{p})

holds.

To prove the convergence rate, we also need Lemma 5.

Lemma 5.

Suppose Assumption 1 holds and the functions

f_{i}, i = 1, 2, \dots, m

are convex; then, there exists

θ_{1},

θ_{2} > 0

such that

\begin{matrix} e (p^{k + 1}, 1) \leq θ_{1} \sum_{i = 1}^{m} ‖ Δ {x_{i}}^{k + 1} ‖ + θ_{2} ‖ Δ y^{k + 1} ‖, \forall k . \end{matrix}

(37)

We have shown that Algorithm 2 converges. Now, we would like to see how fast this convergence is based on Lemma 5, as expressed by Theorems 3 and 4.

Theorem 2.

Suppose Assumptions 1 and 2, and the conditions in Lemma 4 hold and the that functions

f_{i}, i = 1, 2, \dots, m

are convex; then, the following conclusions are valid:

(1): $d i s t (p^{k}, c r i t L_{ρ}) \to 0, k \to + \infty$ ;
(2): For any given ${\bar{p}}^{k} \in c r i t L_{ρ}$ and $∥{\bar{p}}^{k} - p^{k}∥ = d i s t (w^{k}, c r i t L ρ)$ , there exists a positive integer ( $\hat{k}$ ) such that

$L_{ρ} ({\bar{p}}^{k}) = lim_{k \to + \infty} L_{ρ} (p^{k}) = inf_{k} L_{ρ} (p^{k}), \forall k \geq \hat{k};$
(3): The { $L_{ρ} (p^{k})$ } sequence is Q-linearly convergent.

Theorem 3.

Assuming the conditions in Theorem 2 hold, the

p^{k}

sequence converges to

c r i t L_{ρ}

with an R-linear convergence rate.

See the proofs of Lemmas 3 and 4, and Theorem 1, as well as Lemma 5, Theorems 2 and 3, in Appendix A.

4. Numerical Experiment

In this section, the performances of Algorithms 2 and 3 are demonstrated by two numerical arithmetic examples, namely linear regression and logistic regression, respectively. All numerical experiments were conducted on a laptop computer with 16 GB of RAM and an Intel(R) Core(TM) i5-12500H 2.3 GHz CPU, using MATLAB (R2021a) for implementation.

4.1. Testing Examples

In our examples, each local client is designated with its own objective function (

f_{i}, i = 1, 2, \dots, m

, where m signifies the number of client nodes). Subsequently, every client generates random datasets (

A_{i} = [a_{1}^{i}, a_{2}^{i}, \dots, a_{n}^{i}]

and

b_{i} = [b_{1}^{i}, b_{2}^{i}, \dots, b_{n}^{i}]

). In these datasets,

a_{j}^{i}, (i = 1, \dots, m; j = 1, 2, \dots, n)

represents the feature data of dimension d, whereas

b_{j}^{i}, (i = 1, \dots, m; j = 1, 2, \dots, n)

denotes the label data of dimension 1.

Regarding Algorithms 2 and 3, we establish identical parameters across each node. These parameters include a relaxed step weight of

α = 0.5

, a penalty parameter (

ρ_{i}

) set to 1 for each node, a relaxed factor (

τ

) of 0.1 for the initial dual step, and a relaxed factor (

γ

) of 0.5 for the subsequent dual step at every node.

Example 1

(Linear regression). Linear regression, a canonical problem in machine learning, seeks to construct a linear function from a specified dataset, enabling the prediction of relationships between input and output variables. In this context, the objective functions for local nodes are given by

\begin{matrix} f_{i} (x) = \frac{1}{2} \underset{j = 1}{\sum^{n}} {(x^{T} a_{j}^{i} - b_{j}^{i})}^{2}, i = 1, 2, \dots, m . \end{matrix}

(38)

In this function,

a_{j}^{i} \in R^{n}

and

b_{j}^{i} \in R

denote the j-th sample for client i. It should be noted that the above objective function formulates a convex quadratic optimization problem. For this scenario, we randomly generate features (

A_{i}

) and corresponding labels (

b_{i}

) from a uniform distribution within the interval of

[0, 1]

. For the purpose of simplification, we initially set

n = 100

while selecting

m \in [10, 100]

. Subsequently, we solidify

m = 30

and let n fall in the range of

[100, 200]

.

Example 2

(Logistic regression). Logistic regression, a prevalently utilized classification algorithm, is especially apt for handling binary classification predicaments. Within this context, local clients define their objective functions as

\begin{matrix} f_{i} (x) = \frac{1}{n} \sum_{j = 1}^{n} log (1 + exp (- b_{j}^{i} x^{T} a_{j}^{i})), i = 1, 2, \dots, m, \end{matrix}

(39)

where

a_{j}^{i} \in R^{n}

and

b_{j}^{i} \in R

correspond to the j-th sample of client i. Features (

A_{i}

) are randomly generated in accordance with a uniform distribution spanning the interval of

[0, 1]

, and labels (

b_{i}

) are derived from the set of

{- 1, 1}

. Each nodal dataset is designated with a unique dimensionality. In the first instance, datasets are defined such that

m = 100

and

n = 1000

, with

k_{0}

permitted to be selected from the set of

{5, 8, 10, 20, 25}

. In a subsequent iteration, the dataset configuration persists with

m = 100

, while n is expanded to 2000, continuously allowing for the selection of

k_{0}

from

{5, 8, 10, 20, 25}

.

4.2. Numerical Results

Data generation was conducted in accordance with Example 1. Principal component analysis was then employed to illustrate the distribution of the ensuing data. In this case, the five-dimensional feature data were condensed into a two-dimensional plane, with a color bar employed to signify the continuum of label values, thereby delineating the different random distributions within the

[0, 1]

interval that each original sample encountered at the data nodes.

In Figure 1a–j, the first principal component resulting from the dimensionality reduction via principal component analysis is represented along the x axis, while the second principal component is plotted along the y axis. Figure 1 suggests that the data are variably randomly distributed across the ten nodes.

Upon employing Algorithm 2 for the case illustrated in Example 1 and stipulating the number of communication rounds as 20, a two-stage process was adopted. To initiate the process, the number of data points for a single node is set to

n = 100

, and m, denoting the number of nodes, is varied. This produces the graphical output depicted on the left. Subsequently, with a constant node count of

m = 30

, the quantity of sample points per node is varied, resulting in the representation displayed on the right.

Figure 2 presents the variation in average node loss for different data quantities. Figure 2a depicts a scenario with a single-node data sample of

n = 100

and a node count m ranging within

{10, 30, 50, 80, 100

}. As evidenced in Figure 2a, the average loss function for nodes under Algorithm 2 descends most rapidly for

m = 100

and most sluggishly for

m = 10

. Consequently, it can be inferred that an increase in the number of nodes accelerates the decrease in the loss function, implying higher accuracy. Figure 2b considers five scenarios where

m = 30

and n varies within the range of

{100, 130, 150, 180, 200}

. The average node loss function under Algorithm 2 is found to descend most swiftly for

n = 100

and most slowly for

n = 200

. This indicates that a reduction in the number of nodes expedites the descent of the loss function.

In addressing Example 1, Algorithm 2 was employed to determine parameters, which were then integrated into the linear model. Subsequently, the original feature data were incorporated, and labels for the original feature data were computed, which were then compared with the original data labels. Dimensionality reduction was achieved via principal component analysis to display a comparative graph of the original and model-predicted data.

Figure 3a–j employ distinct symbols to represent raw and predicted data, with color gradients depicting the corresponding label values. Dot markers indicate the sample data points post dimensionality reduction of the original data via principal component analysis. Cross markers denote data processed through the linear regression model generated by Algorithm 2, with the subsequent predicted labels presented in reduced dimensionality, also by principal component analysis. A comparison between the original and predicted data, as showcased in Figure 3, reveals a high level of accuracy attained by the linear regression model in conjunction with Algorithm 2.

A comparison is also made with conventional Distributed Machine Learning (DML) [34] and Federated Learning (FL) [28]. The loss function progression of these three algorithms is depicted accordingly.

Figure 4 displays the three corresponding loss function curves. From top to bottom, the curves represent the following distinct models. The first pertains to conventional FL, the second to DML, and the third to Algorithm 2. It is observed that the loss function value for Algorithm 2 (Fed-RSADMM) decreases more rapidly than that for DML and FL and also yields the smallest value upon convergence. Figure 4 suggests that Algorithm 2 exhibits superior accuracy and faster convergence compared to traditional algorithms.

To compare the time efficiency of the algorithms, we conducted a comparative analysis of their execution times, yielding the following results.

Figure 5 provides a comparative analysis of the execution times for FL, DML, and Fed-RSADMM applied to test Example 1 over a series of iterations. The performance of the Fed-RSADMM algorithm is traced by the dashed green line, which consistently shows reduced execution times, in contrast to FL (solid blue line) and DML (dash–dot red line). The flatter trajectory of the Fed-RSADMM line across the iteration spectrum underscores its time-efficiency advantage. In essence, the results from test Example 1 endorse the Fed-RSADMM algorithm’s superior time efficiency, with its modest increase in execution time demonstrating potential for scalable and efficient processing in iterative tasks.

Subsequently, Algorithm 3 is subjected to verification via its application to Example 2. To commence,

k_{0} = 5

is established within Algorithm 3, followed by a comparison with the conventional federated learning algorithm, resulting in the accompanying comparative results.

As depicted in Figure 6, the upper curve corresponds to the loss function for the traditional FL algorithm, whereas the lower curve represents Algorithm 3 (FedAvg-RSADMM). The graph illustrates that Algorithm 3 exhibits a more rapid rate of descent than the FL algorithm, indicating its superior performance over the conventional FL algorithm.

The subsequent analysis focuses on discerning the effect of

k_{0}

on the performance of Algorithm 3. While resolving the logistic regression problem of Example 2 using Algorithm 3, the sample data size of a single node is kept constant at 1000 and 2000, while

k_{0}

varies within the range of

{5, 8, 10, 20, 25}

, leading to the ensuing comparison.

Figure 7 plots the number of iterations of the local variable (x) on the x axis against the average loss of the nodes on the y axis for different m, n, and

k_{0}

values. As per the flow of Algorithm 3, a larger

k_{0}

implies fewer global updates. Consequently, Figure 7 reveals a slower descent of the node’s loss function with larger

k_{0}

values, which is attributable to the reduced number of steps in the global update, thereby conserving computational and communication resources. However, it is noted that Algorithm 3 exhibits convergence similar to that achieved with various

k_{0}

values. This suggests that an increased

k_{0}

effectively reduces communication and computational resource consumption, inducing only a minor error loss. Accordingly, a modest increase in

k_{0}

in Algorithm 3 can boost its computational efficiency. The data sample size for a single node is then elevated as

n = 2000

based on the experiment illustrated in Figure 7a for n = 1000, yielding the results portrayed in Figure 7b, which mirroring those presented in the left figure. Thus, it follows that the accuracy of Algorithm 3 remains unimpacted by the escalation of local sample data size, rendering the algorithm suitable for federated learning problems involving extensive data volumes and multiple nodes.

In Figure 8, the number of iterations of the global variable y is denoted by the x-axis, while the average loss of the nodes forms the y-axis, resulting in the displayed loss curve. The figure exhibits faster node loss function decreases with larger

k_{0}

, attributable to increased local updates for sizeable

k_{0}

during global variable updating. Consequently, fewer global update steps are required for convergence, thus significantly diminishing the number of communications and global update steps. Therefore, Algorithm 3 can be deemed effective in reducing communication losses. Again, the data sample size for a single node is amplified to

n = 2000

based on the experiments of Figure 8a, culminating the results in Figure 8b. These demonstrate that Algorithm 3 is fitting for distributed optimization problems involving considerable local node data.

To investigate the impact of varying initial iteration counts (

k_{0}

) and the number of client nodes (m) on the temporal efficiency of the algorithm, we conducted a series of experiments. The parameter influences the number of local updates (

k_{0}

), while m corresponds to the number of client nodes involved in the computation. Our objective was to ascertain whether an increased number of nodes affects the algorithm’s parallelism. The experimental outcomes are listed in Figure 9.

Figure 9a illustrates the relationship between the iterative time and the communication rounds at different

k_{0}

values for Algorithm 3 . An increment in

k_{0}

corresponds to a decrease in the total number of iterations required, allowing Algorithm 3 to halt earlier and utilize less iterative time. Each line in the graph represents the algorithm’s performance with a different

k_{0}

value, demonstrating that higher values lead to quicker convergence, as evidenced by the curves leveling off sooner. This suggests that optimizing the

k_{0}

parameter can enhance the algorithm’s efficiency, reducing computational time while maintaining convergence integrity.

Figure 9b demonstrates the relationship between the iterative time and

k_{0}

with different numbers of nodes (m) for Algorithm 3. It is observed that an increase in m leads to a longer iteration time. However, the overall time decreases with an increase in

k_{0}

, indicating that our algorithm exhibits favorable parallelism. Despite this, the influence of node count on computation time is not entirely negated; therefore, the potential time cost due to an increase in nodes must be considered in computational evaluations. This underscores the necessity of balancing the number of nodes against the performance gains achieved through parallel processing when deploying the algorithm in distributed computing environments.

Obviously, Algorithm 3 considers the communication and computation costs in relation to

k_{0}

. Specifically, the framework states that the global update occurs only at certain steps (e.g., at step km being a multiple of a pre-defined integer (

k_{0}

)). It is shown that the larger the

k_{0}

, the less time for our algorithm to converge (see Figure 7, Figure 8 and Figure 9). In addition, the local computational complexity at each node is

O (k)

, where k is the iteration time. By adjusting

k_{0}

, we can optimize the trade-off between communication efficiency and computational load, demonstrating the scalability and adaptability of the algorithm for federated learning.

5. Conclusions

This study introduces two symmetric ADMM-based federated learning algorithms relaxed steps. Algorithm 2 bolsters computational efficiency in federated learning, while Algorithm 3 capitalizes on Algorithm 2 to further optimize communication efficiency. Relevant numerical experiments were set up to illustrate the feasibility and efficiency of the algorithms. In conclusion, the two proposed algorithms exhibit rapid convergence and excellent performance. While experiments based on linear and logistic regressions were conducted on small scales and only serve as proofs of concept, in contrast to [41,42], which researched large-scale cases and realistic usage. Therefore, exploring applications related to large-scale optimization problems will be the subject of further research in the future.

Author Contributions

J.L.: data curation, theoretical derivation, methodology, software, writing—original draft preparation, and modification; Y.D.: conceptualization, formal analysis, writing—review and editing, supervision, and validation. Y.Z.: data curation, methodology, software, writing—original draft preparation, and modification. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grants 71901145 and 12371308.

Data Availability Statement

The dataset used in this paper was generated by computer simulations to support the experimental data.

Acknowledgments

We acknowledge the efforts of the editorial board and anonymous reviewers for their thorough evaluation of our manuscript. Their comments and recommendations contributed to the refinement of our research paper.

Conflicts of Interest

No conflicts of interest exist with respect to the submission of this manuscript, and the manuscript was approved by all authors for publication. We would like to declare that the work described is original research that has not been published previously and is not under consideration for publication elsewhere, either in whole or in part. All listed authors approved of the manuscript that is enclosed. The authors declare that they have no known competing financial interests or personal relationships that could appeared to have influenced the work reported in this paper.

Appendix A

Appendix A.1. Proof of Lemma 3

For notational simplicity, hereafter, we denote

Δ y^{k + 1} : = y^{k + 1} - y^{k}, Δ x_{i}^{k + 1} : = x_{i}^{k + 1} - x_{i}^{k}, Δ u_{i}^{k + 1} : = u_{i}^{k + 1} - u_{i}^{k} .

(A1)

According to Definition (4), we provide the optimality conditions for the subproblems in Fed-RSADMM,

\{\begin{matrix} 0 = w_{i} \nabla f_{i} (x_{i}^{k + 1}) - u_{i}^{k + \frac{1}{2}} + γ ρ_{i} (x_{i}^{k + 1} - y^{k + 1}), \\ 0 = \underset{i = 1}{\sum^{m}} [u_{i}^{k} - ρ_{i} (x_{r s (i)}^{k + 1} - y^{k + 1})] . \end{matrix}

(A2)

Based on the update steps for

x_{r s (i)}

and

u_{i}

, we have

\{\begin{matrix} \underset{i = 1}{\sum^{m}} (u_{i}^{k} - ρ_{i} (x_{r s (i)}^{k + 1} - y^{k + 1}) = 0, \\ (γ ρ_{i} + τ ρ_{i} α) (x_{i}^{k + 1} - y^{k + 1}) = τ ρ_{i} α Δ x_{i}^{k + 1} + τ ρ_{i} (1 - α) Δ y^{k + 1} - Δ u_{i}^{k + 1}, \\ x_{i}^{k + 1} - y^{k + 1} = \frac{1}{γ + τ α} (τ α Δ x_{i}^{k + 1} + τ (1 - α) Δ y^{k + 1} - \frac{1}{ρ_{i}} Δ u_{i}^{k + 1}) . \end{matrix}

(A3)

The optimality conditions for the subproblems involving

x_{i}

, y, and

u_{i}

are obtained as follows:

\{\begin{matrix} w_{i} \nabla f (x_{i}^{k + 1}) = u_{i}^{k + 1}, \\ Δ u_{i}^{k + 1} = (- γ ρ_{i} - τ ρ_{i} α) (x_{i}^{k + 1} - y^{k + 1})) + τ ρ_{i} α (Δ x_{i}^{k + 1} + Δ y^{k + 1}) + τ ρ_{i} Δ y^{k + 1}, \end{matrix}

(A4)

Considering the updating methodology for u, it can be inferred that

\begin{matrix} \begin{matrix} x_{i}^{k + 1} - y^{k + 1} = \frac{1}{γ + τ α} (τ α Δ x_{i}^{k + 1} + τ (1 - α) Δ y^{k + 1} - \frac{1}{ρ_{i}} Δ u_{i}^{k + 1}) . \end{matrix} \end{matrix}

(A5)

Regarding the subproblem of

x_{i}

,

\begin{matrix} w_{i} f_{i} (x_{i}^{k + 1}) - 〈x_{i}^{k + 1} - y^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 + \frac{γ ρ_{i}}{2} {∥x_{i}^{k + 1} - y^{k + 1}∥}^{2} \\ \leq w_{i} f_{i} (x_{i}^{k}) - 〈x_{i}^{k} - y^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 + \frac{γ ρ_{i}}{2} {∥x_{i}^{k} - y^{k + 1}∥}^{2} . \end{matrix}

(A6)

Hence,

\begin{matrix} w_{i} f_{i} (x_{i}^{k + 1}) - 〈x_{i}^{k + 1} - y^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 + \frac{γ ρ_{i}}{2} {∥x_{i}^{k + 1} - y^{k + 1}∥}^{2} \\ - w_{i} f_{i} (x_{i}^{k}) + 〈x_{i}^{k} - y^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 - \frac{γ ρ_{i}}{2} {∥x_{i}^{k} - y^{k + 1}∥}^{2} \leq 0 . \end{matrix}

(A7)

Therefore,

\begin{matrix} w_{i} f_{i} (x_{i}^{k + 1}) - w_{i} f_{i} (x_{i}^{k}) - 〈Δ x_{i}^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 - \frac{γ ρ_{i}}{2} {∥Δ x_{i}^{k + 1}∥}^{2} \\ \leq - γ ρ_{i} 〈x_{i}^{k + 1} - y^{k + 1}, Δ x_{i}^{k + 1}〉 . \end{matrix}

(A8)

According to the subproblem of y,

\begin{matrix} \underset{i = 1}{\sum^{m}} (- 〈x_{r s (i)}^{k + 1} - y^{k + 1}, u_{i}^{k}〉 + \frac{ρ_{i}}{2} {∥x_{r s (i)}^{k + 1} - y^{k + 1}∥}^{2}) \\ \leq \underset{i = 1}{\sum^{m}} (- 〈x_{r s (i)}^{k + 1} - y^{k}, u_{i}^{k}〉 + \frac{ρ_{i}}{2} {∥x_{r s (i)}^{k + 1} - y^{k}∥}^{2}) . \end{matrix}

(A9)

Therefore, it follows that

\underset{i = 1}{\sum^{m}} (〈Δ y^{k + 1}, u_{i}^{k}〉 + \frac{ρ_{i}}{2} ({∥x_{r s (i)}^{k + 1} - y^{k + 1}∥}^{2} - {∥x_{r s (i)}^{k + 1} - y^{k}∥}^{2}) \leq 0 .

(A10)

After simplification, it is obtained as:

\underset{i = 1}{\sum^{m}} \{〈Δ y^{k + 1}, u_{i}^{k}〉 - \frac{ρ_{i}}{2} {∥Δ y^{k + 1}∥}^{2}\} \leq \underset{i = 1}{\sum^{m}} ρ_{i} 〈Δ y^{k + 1}, x_{r s (i)}^{k + 1} - y^{k + 1}〉 .

(A11)

In summary, it is evident that

\underset{i = 1}{\sum^{m}} \{〈Δ y^{k + 1}, u_{i}^{k}〉 + \frac{ρ_{i}}{2} {∥Δ y^{k + 1}∥}^{2}\} \leq \underset{i = 1}{\sum^{m}} ρ_{i} 〈Δ y^{k + 1}, α ({x_{i}}^{k} - y^{k})〉 .

(A12)

By incorporating the subproblem of

x_{i}

, we consider the following formulation:

L_{ρ} (X^{k}, y^{k + 1}, U^{k + \frac{1}{2}}) - L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + \frac{1}{2}}) .

The result is given by

\begin{matrix} L_{ρ} (X^{k}, y^{k + 1}, U^{k + \frac{1}{2}}) - L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + \frac{1}{2}}) \\ = \underset{i = 1}{\sum^{m}} (\begin{matrix} w_{i} f_{i} (x_{i}^{k}) - w_{i} f_{i} (x_{i}^{k + 1}) + 〈Δ x_{i}^{k + 1}, u_{i}^{k + \frac{1}{2}}〉 \\ + ρ_{i} 〈x_{i}^{k + 1} - y^{k + 1}, - Δ x_{i}^{k + 1}〉 + \frac{ρ_{i}}{2} {∥Δ x_{i}^{k + 1}∥}^{2} \end{matrix}) \\ \geq \underset{i = 1}{\sum^{m}} (ρ_{i} (γ - 1) 〈x_{i}^{k + 1} - y^{k + 1}, Δ x_{i}^{k + 1}〉 + \frac{ρ_{i} (1 - γ)}{2} {∥Δ x_{i}^{k + 1}∥}^{2}) . \end{matrix}

(A13)

Integrating the subproblem of y, now, we consider the following formulation:

L_{ρ} (X^{k}, y^{k}, U^{k}) - L_{ρ} (X^{k}, y^{k + 1}, U^{k}) .

(A14)

The outcome is expressed as follows:

\begin{matrix} L_{ρ} (X^{k}, y^{k}, U^{k}) - L_{ρ} (X^{k}, y^{k + 1}, U^{k}) \\ = \underset{i = 1}{\sum^{m}} [- 〈Δ y^{k + 1}, u_{i}^{k}〉 + ρ_{i} 〈x_{i}^{k} - y^{k}, Δ y^{k + 1}〉 - \frac{ρ_{i}}{2} {∥Δ y^{k + 1}∥}^{2}] \\ \geq \underset{i = 1}{\sum^{m}} (1 - α) ρ_{i} 〈Δ y^{k + 1}, (x^{k} - y^{k})〉 \\ = \underset{i = 1}{\sum^{m}} \frac{ρ_{i} (1 - α)}{γ + τ α} [\begin{matrix} - γ 〈Δ x_{i}^{k + 1}, Δ y^{k + 1}〉 + (γ + τ) {∥Δ y^{k + 1}∥}^{2} \\ - \frac{1}{ρ_{i}} 〈Δ y^{k + 1}, Δ u_{i}^{k + 1}〉 \end{matrix}] \end{matrix}

(A15)

Utilizing the updating step of the multiplier (

u_{i}

), in conjunction with Equation (A5), it is found that

\begin{matrix} L_{ρ} (X^{k}, y^{k + 1}, U^{k}) - L_{ρ} (X^{k}, y^{k + 1}, U^{k + \frac{1}{2}}) \\ + L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + \frac{1}{2}}) - L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + 1}) \\ = \underset{i = 1}{\sum^{m}} [〈x_{i}^{k} - y^{k + 1}, u_{i}^{k + \frac{1}{2}} - u_{i}^{k}〉 - 〈x_{i}^{k + 1} - y^{k + 1}, u_{i}^{k + \frac{1}{2}} - u_{i}^{k + 1}〉] \\ = \underset{i = 1}{\sum^{m}} [〈- γ ρ_{i} Δ x_{i}^{k + 1} + Δ u_{i}^{k + 1}, x_{i}^{k + 1} - y^{k + 1}〉 - 〈Δ x_{i}^{k + 1}, Δ u_{i}^{k + 1}〉] . \end{matrix}

(A16)

By cumulatively considering (A13), (A15), and (A16) with

γ + τ α > 0

, it is concluded that

\begin{matrix} L_{ρ} (X^{k}, y^{k}, U^{k}) - L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + 1}) \\ \geq \underset{i = 1}{\sum^{m}} [\begin{matrix} \frac{ρ_{i} (- γ^{2} - γ τ α - τ + α γ)}{2 (γ + τ α)} {∥Δ x_{i}^{k + 1}∥}^{2} + \frac{ρ_{i} (1 - α) (γ + τ)}{2 (γ + τ α)} {∥Δ y^{k + 1}∥}^{2} \\ - \frac{1}{ρ_{i} (γ + τ α)} {∥Δ u_{i}^{k + 1}∥}^{2} + \frac{1 - γ}{γ + τ α} 〈Δ x_{i}^{k + 1}, Δ u_{i}^{k + 1}〉 \\ + \frac{(τ - 1) (1 - α)}{γ + τ α} 〈Δ y^{k + 1}, Δ u_{i}^{k + 1}〉 \end{matrix}] . \end{matrix}

(A17)

Upon further integration of

γ + τ α > 0

, the Lipschitz continuity of

\nabla f_{i}

and the Cauchy–Buniakowsky–Schwarz inequality yield

\begin{matrix} L_{ρ} (X^{k}, y^{k}, U^{k}) - L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + 1}) \\ \geq \underset{i = 1}{\sum^{m}} [\begin{matrix} \frac{ρ_{i} (- γ^{2} - γ τ α - τ + α γ)}{2 (γ + τ α)} {∥Δ x_{i}^{k + 1}∥}^{2} + \frac{ρ_{i} (1 - α) (γ + τ)}{2 (γ + τ α)} {∥Δ y^{k + 1}∥}^{2} \\ - \frac{1}{ρ_{i} (γ + τ α)} {∥Δ u_{i}^{k + 1}∥}^{2} - \frac{(1 - γ) L_{i}}{2 (γ + τ α)} {∥Δ x_{i}^{k + 1}∥}^{2} \\ + \frac{(τ - 1) (1 - α)}{γ + τ α} 〈Δ y^{k + 1}, Δ u_{i}^{k + 1}〉 \end{matrix}] \\ \geq \underset{i = 1}{\sum^{m}} [\begin{matrix} \frac{ρ_{i} (- γ^{2} - γ τ α - τ + α γ)}{2 (γ + τ α)} - \frac{{L_{i}}^{2}}{ρ_{i} (γ + τ α)} - \\ \frac{(τ - 1) (α - 1) L^{2} + 2 (1 - γ) L_{i}}{2 (γ + τ α)} \end{matrix}] {∥Δ x_{i}^{k + 1}∥}^{2} \\ + \underset{i = 1}{\sum^{m}} [\frac{ρ_{i} (γ + τ) (1 - α)}{2 (γ + τ α)} - \frac{(τ - 1) (α - 1)}{2 (γ + τ α)}] {∥Δ y^{k + 1}∥}^{2} . \end{matrix}

(A18)

Let

ρ ≜ {min}_{i = 1, 2, \dots, m} ρ_{i}

,

L ≜ {max}_{i = 1, 2, \dots, m} L_{i}

, and it follows that

\begin{matrix} L_{ρ} (X^{k}, y^{k}, U^{k}) - L_{ρ} (X^{k + 1}, y^{k + 1}, U^{k + 1}) \\ \geq [\begin{matrix} \frac{ρ (- γ^{2} - γ τ α - τ + α γ)}{2 (γ + τ α)} - \frac{L^{2}}{ρ (γ + τ α)} \\ - \frac{(τ - 1) (α - 1) L^{2} + 2 (1 - γ) L}{2 (γ + τ α)} \end{matrix}] \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} \\ + m [\frac{ρ (γ + τ) (1 - α)}{2 (γ + τ α)} - \frac{(τ - 1) (α - 1)}{2 (γ + τ α)}] {∥Δ y^{k + 1}∥}^{2} \\ = a \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2}, \end{matrix}

(A19)

wherein

a = \frac{ρ (γ (α - γ) - τ (1 + γ α))}{2 (γ + τ α)} - \frac{L^{2}}{ρ (γ + τ α)} - \frac{(τ - 1) (α - 1) L^{2} + 2 (1 - γ) L}{2 (γ + τ α)},

b = m [\frac{ρ (γ + τ) (1 - α)}{2 (γ + τ α)} - \frac{(τ - 1) (α - 1)}{2 (γ + τ α)}] .

Appendix A.2. Proof of Lemma 4

Given that the sequence of

\{p^{k}\}

is bounded, it can be deduced that the sequence has a limit point. Without loss of generality, we assume that

p^{*}

is a limit point of the sequence of

\{p^{k}\}

, and the subsequence of

\{p^{k_{j}}\}

converges to

p^{*}

. Since

f_{i}

is lower semi-continuous,

L_{ρ}

is also lower semi-continuous. Hence,

L_{ρ} (p^{*}) \leq \underset{j \to + \infty}{l i m i n f} L_{ρ} (p^{k_{j}}) .

(A20)

The above equation shows that

\{L_{ρ} (p^{k_{j}})\}

has a lower bound. It also follows from Lemma 3 that

\{L_{ρ} (p^{k})\}

is monotonically decreasing such that

\{L_{ρ} (p^{k_{j}})\}

is also monotonically decreasing, so there is

\{L_{ρ} (p^{k_{j}})\}

convergence. The sequence of

\{L_{ρ} (p^{k})\}

has monotonically convergent subsequences such that its whole column converges and has

lim_{k \to + \infty} L_{ρ} (p^{k}) \geq L_{ρ} (p^{*}) .

Shifting the term for (35) results in

a \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2} \leq L_{ρ} (p^{k}) - L_{ρ} (p^{k + 1})

(A21)

Taking the sum over the finite terms of the above inequality and the limit yields

\sum_{k = 1}^{N} (a \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2}) \leq L_{ρ} (p^{0}) - L_{ρ} (p^{*}) .

Consequently,

\sum_{k = 0}^{+ \infty} (a \sum_{i = 1}^{m} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2}) < + \infty

. Further integration with

\sum_{k = 0}^{+ \infty} {∥Δ u_{i}^{k}∥}^{2} < + \infty

, from Equation (A5) shows that

x_{i}^{k + 1} = \frac{1}{γ + τ α} (τ α Δ x_{i}^{k + 1} + τ (1 - α) Δ y^{k + 1} - \frac{1}{ρ_{i}} Δ u_{i}^{k + 1}) + y^{k + 1};

therefore,

Δ x_{i}^{k + 1} = O (Δ x_{i}^{k + 1}) + O (Δ y^{k + 1}) + Δ y^{k + 1} - \frac{1}{(γ + τ α) ρ_{i}} (Δ u_{i}^{k + 1} - Δ u_{i}^{k}) .

Then, according to the Cauchy inequality, we obtain

∥Δ x_{i}^{k + 1}∥ \leq ∥Δ y^{k + 1}∥ + \frac{1}{(γ + τ α) ρ_{i}} ∥Δ u_{i}^{k + 1}∥ + \frac{1}{(γ + τ α) ρ_{i}} ∥Δ u_{i}^{k}∥ .

This integrates with

\sum_{k = 0}^{+ \infty} {∥Δ y^{k + 1}∥}^{2} < + \infty, \sum_{k = 0}^{+ \infty} {∥Δ {u_{i}}^{k + 1}∥}^{2} < + \infty

. It holds that

\sum_{k = 0}^{+ \infty} {∥Δ {x_{i}}^{k + 1}∥}^{2} < + \infty

. Therefore, it can be immediately established that

\underset{k = 0}{\sum^{+ \infty}} {∥p^{k + 1} - p^{k}∥}^{2} < + \infty .

The proof is complete.

Appendix A.3. Proof of Theorem 1

(1) From the definition of

Ω

, conclusion (I) is validated.

(2) If

p^{*} \in Ω

, there exists a subsequence (

\{p^{k}\}

,

\{p^{k_{j}}\}

) such that

p^{k_{j}} \to p^{*}, j \to + \infty

. Additionally, from Lemma 4, we know

∥p^{k_{j} + 1} - p^{k_{j}}∥ \to 0

; thus,

p^{k_{j} + 1} - p^{k_{j}} \to 0

. Also, since

{x_{i}}^{k_{j} + 1}

is the solution to the x subproblem in Algorithm 2, for any

k_{j}

, it holds that

L_{ρ} (X^{k_{j} + 1}, y^{k_{j}}, U^{k_{j}}) \leq L_{ρ} (X^{*}, y^{k_{j}}, U^{k_{j}}) .

Subsequently,

\begin{matrix} \underset{j \to + \infty}{l i m s u p} L_{ρ} (p^{k_{j} + 1}) = \underset{j \to + \infty}{l i m s u p} L_{ρ} (X^{k_{j} + 1}, y^{k_{j}}, U^{k_{j}}) \\ \leq \underset{j \to + \infty}{l i m s u p} L_{ρ} (X^{*}, y^{k_{j}}, U^{k_{j}}) = L_{ρ} (p^{*}) . \end{matrix}

(A22)

According to this, together with the lower semi-continuity of

L_{ρ} (\cdot)

and given

\underset{j \to + \infty}{l i m i n f} L_{ρ} (p^{j + 1}) \geq L_{ρ} (p^{*})

, then

\underset{j \to + \infty}{l i m i n f} L_{ρ} (p^{j + 1}) \geq L_{ρ} (p^{*})

. Therefore,

lim_{j \to + \infty} f (x^{k_{j} + 1}) = f (x^{*})

. Therefore, due to the closedness of

\partial f

, considering

k = k_{j}

in Equation (23) and taking the limit as

j \to + \infty

yields

U^{*} \in \partial f (x^{*}), X^{*} - y^{*} = 0 .

This, in conjunction with Definition 3, establishes that

w^{*} \in c r i t L_{ρ}

.Q.E.D.

Appendix A.4. Proof of Lemma 5

First, based on Equations (23)–(25), the definition of

e (w^{k + 1}, 1)

, and the non-expansive nature of the proximity operator, it can be shown that

\begin{matrix} ∥e_{X} (p^{k + 1}, 1)∥ = ∥\sum_{i = 1}^{m} [{x_{i}}^{k + 1} - P r o x_{f_{i}} ({x_{i}}^{k + 1} + {u_{i}}^{k + 1})]∥ \\ = ∥\sum_{i = 1}^{m} [P r o x_{f_{i}} ({x_{i}}^{k + 1} + \nabla w_{i} f_{i} (x_{i}^{k + 1})) - P r o x_{f_{i}} (x^{k + 1} + {u_{i}}^{k + 1})]∥ \\ = ∥\sum_{i = 1}^{m} [P r o x_{f_{i}} ({x_{i}}^{k + 1} + {u_{i}}^{k + 1}) - P r o x_{f_{i}} (x^{k + 1} + {u_{i}}^{k + 1})]∥ \\ = 0 . \end{matrix}

(A23)

Additionally, from (23), it is known that

‖ e_{y} (p^{k + 1}, 1) ‖ = ‖ \sum_{i = 1}^{m} u_{i}^{k} ‖ = 0 .

(A24)

Next, integrating (24) and (25), it is established that

\begin{matrix} ‖ e_{U^{k + 1}} (p^{k + 1}, 1) ‖ = ∥\sum_{i = 1}^{m} ({x_{i}}^{k + 1} - y^{k + 1})∥ \\ \leq \sum_{i = 1}^{m} [‖ \frac{1}{γ + τ α} (τ α Δ x_{i}^{k + 1} + τ (1 - α) Δ y^{k + 1} - \frac{1}{ρ_{i}} Δ u_{i}^{k + 1}) ‖] \\ = \sum_{i = 1}^{m} [\frac{τ α (L + ρ_{i})}{ρ_{i} (τ + α γ)} ‖ Δ x_{i}^{k + 1} ‖] + m |\frac{τ (1 - α)}{γ + τ α}| ‖ Δ y^{k + 1} ‖ \\ \leq \frac{τ α (L + ρ_{max})}{ρ_{i} (τ + α γ)} \sum_{i = 1}^{m} ‖ Δ x_{i}^{k + 1} ‖ + m |\frac{τ (1 - α)}{γ + τ α}| ‖ Δ y^{k + 1} ‖ . \end{matrix}

(A25)

Finally, according to Equations (A23)–(A25), there exist positive numbers (

θ_{1}, θ_{2}

) such that

\begin{matrix} ‖ e (p^{k + 1}, 1) ‖ = \sqrt{‖ e_{x} (p^{k + 1}, 1) ‖^{2} + ‖ e_{y} (p^{k + 1}, 1) ‖^{2} + ‖ e_{λ} (p^{k + 1}, 1) ‖^{2}} \\ \leq θ_{1} \sum_{i = 1}^{m} ‖ Δ {x_{i}}^{k + 1} ‖ + θ_{2} ‖ Δ y^{k + 1} ‖ . \end{matrix}

(A26)

Appendix A.5. Proof of Theorem 2

(1) As for Lemma 4,

∥p^{k + 1} - p^{k}∥ \to 0

. This, in conjunction with (37), yields

e (p^{k + 1}, 1) \to 0

. Furthermore, as

\{L_{ρ} (p^{k})\}

is monotonically decreasing,

L_{ρ} (p^{k}) \leq L_{ρ} (p^{0}), \forall k

. Furthermore, integrating with Assumption 2, there exists

ς > 0

and a positive integer (

k_{1}

) such that

d i s t (p^{k}, c r i t L_{ρ}) \leq ς e (p^{k}, 1), \forall k \geq k_{1} .

(A27)

Consequently, conclusion (1) is validated. (2) Setting

{\bar{p}}^{k + 1} = ({\bar{X}}^{k + 1}, {\bar{y}}^{k + 1}, {\bar{U}}^{k + 1}) \in c r i t L_{ρ}

, it follows that

∥{\bar{p}}^{k + 1} - p^{k + 1}∥ = d i s t (p^{k + 1}, c r i t L_{ρ}), \forall k .

(A28)

Combined with the above conclusion (1),

p^{k + 1} - {\bar{p}}^{k + 1} \to 0

. Utilizing the triangle inequality, it is further deduced that

∥{\bar{p}}^{k} - {\bar{p}}^{k + 1}∥ \leq ∥{\bar{p}}^{k} - p^{k}∥ + ∥p^{k} - p^{k + 1}∥ + ∥p^{k + 1} - {\bar{p}}^{k + 1}∥ \to 0 .

(A29)

According to Assumption 3, for any

\bar{p}, \tilde{p} \in c r i t L_{ρ}

, it holds that

‖ \bar{p} - \tilde{p} ‖ \leq δ (δ > 0)

; therefore, we have

L_{ρ} (\bar{p}) = L_{ρ} (\tilde{p})

. Therefore, according to (A29), there exists a positive integer (

\hat{k} \geq k_{1}

) and a constant (

L_{ρ}^{*}

) such that

L_{ρ} ({\bar{p}}^{k + 1}) = L_{ρ} ({\bar{p}}^{k}) = L_{ρ}^{*}, \forall k \geq \hat{k}

.

Next, we analyze the properties of

L_{ρ}^{*}

. According to Theorem 1 (2), it follows that any accumulation point of

\{p^{k}\}

is a stable point of

L_{ρ} (p^{k})

. It is also proved by Theorem 1(2) that

lim_{j \to + \infty} L_{ρ} (p^{k_{j} + 1}) = L_{ρ} (p^{*})

. Considering that

\{L_{ρ} (p^{k})\}

converges,

lim_{k \to + \infty} L_{ρ} (p^{k}) = L_{ρ} (p^{*}) = inf_{k} L_{ρ} (p^{k})

. Hence,

L_{ρ} (p^{k})

remains constant for the set of accumulation points.

Since

p^{k_{j} + 1} - {\bar{p}}^{k_{j} + 1} \to 0

,

p^{*} - {\bar{p}}^{k_{j} + 1} \to 0

. Consequently, integration with Assumption 3 yields

L_{ρ} ({\bar{p}}^{k}) = L_{ρ}^{*} = L_{ρ} (p^{*}) = inf_{k} L_{ρ} (p^{k}), \forall k \geq \hat{k} .

(A30)

(3) According to (A28), it is understood that

\begin{matrix} \sum_{i = 1}^{m} ∥{\bar{x}}_{i}^{k + 1} - {x_{i}}^{k + 1}∥ \leq d i s t (p^{k + 1}, c r i t L_{ρ}), \\ ∥{\bar{y}}^{k + 1} - y^{k + 1}∥ \leq d i s t (p^{k + 1}, c r i t L_{ρ}), \end{matrix}

(A31)

{\bar{x}}_{i}^{k + 1} - {\bar{y}}^{k + 1} = 0 .

(A32)

From the definition of ALF in (16), it is deduced that

\begin{matrix} L_{ρ} (p^{k + 1}) - L_{ρ} ({\bar{p}}^{k + 1}) \\ = \sum_{i = 1}^{m} [f_{i} ({x_{i}}^{k + 1}) - 〈{x_{i}}^{k + 1} - y^{k + 1}, {u_{i}}^{k + 1}〉 + \frac{ρ_{i}}{2} {∥{x_{i}}^{k + 1} - y^{k + 1}∥}^{2}] \\ - \sum_{i = 1}^{m} [f_{i} ({\bar{x}}_{i}^{k + 1}) - 〈{\bar{x}}_{i}^{k + 1} - {\bar{y}}^{k + 1}, {\bar{u}}_{i}^{k + 1}〉 + \frac{ρ_{i}}{2} {∥{\bar{x}}_{i}^{k + 1} - {\bar{y}}^{k + 1}∥}^{2}] . \end{matrix}

(A33)

On the other hand, due to the convexity of

f_{i}

, it is inferred that

f_{i} ({x_{i}}^{k + 1}) - f_{i} ({\bar{x}}^{k + 1}) \leq 〈{u_{i}}^{k + 1}, ({x_{i}}^{k + 1} - {\bar{x}}_{i}^{k + 1})〉,

(A34)

Combining the above two equations with (A5) and (A32) and simplifying (A33), we obtain

\begin{matrix} L_{ρ} (p^{k + 1}) - L_{ρ} ({\bar{p}}^{k + 1}) \\ \leq \sum_{i = 1}^{m} [\begin{matrix} 〈{u_{i}}^{k + 1}, ({x_{i}}^{k + 1} - {\bar{x}}_{i}^{k + 1})〉 - 〈{x_{i}}^{k + 1} - y^{k + 1}, {u_{i}}^{k + 1}〉 \\ + \frac{ρ_{i}}{2} {∥{x_{i}}^{k + 1} - y^{k + 1}∥}^{2} \end{matrix}] \\ = \sum_{i = 1}^{m} [\begin{matrix} 〈{u_{i}}^{k + 1}, ({x_{i}}^{k + 1} - {\bar{x}}_{i}^{k + 1})〉 \\ - \frac{1}{γ + τ α} 〈τ α Δ x_{i}^{k + 1} + τ (1 - α) Δ y^{k + 1} - \frac{1}{ρ_{i}} Δ u_{i}^{k + 1}, {u_{i}}^{k + 1}〉 \\ + \frac{ρ_{i}}{2 {(γ + τ α)}^{2}} {∥(τ α Δ x_{i}^{k + 1} + τ (1 - α) Δ y^{k + 1} - \frac{1}{ρ_{i}} Δ u_{i}^{k + 1})∥}^{2} . \end{matrix}] \end{matrix}

(A35)

Furthermore, according to simple calculations and by integrating (A31), there must exist positive numbers (

t_{1}, t_{2}, t_{3}

) such that

\begin{matrix} L_{ρ} (p^{k + 1}) - L_{ρ} ({\bar{p}}^{k + 1}) \\ \leq t_{1} {∥Δ y^{k + 1}∥}^{2} + t_{2} \sum_{i = 1}^{m} {∥Δ {x_{i}}^{k + 1}∥}^{2} \\ + t_{3} (\sum_{i = 1}^{m} {∥{x_{i}}^{k + 1} - {\bar{x}}^{k + 1}∥}^{2} + {∥y^{k + 1} - {\bar{y}}^{k + 1}∥}^{2}) + t_{4} \sum_{i = 1}^{m} {∥Δ {u_{i}}^{k + 1}∥}^{2} \\ \leq t_{1} {∥Δ y^{k + 1}∥}^{2} + (t_{2} + t_{4} L^{2}) \sum_{i = 1}^{m} {∥Δ {x_{i}}^{k + 1}∥}^{2} + 2 t_{3} d i s t^{2} (p^{k + 1}, c r i t L_{ρ}) . \end{matrix}

(A36)

Furthermore, based on the aforementioned conclusion (2), Assumption 2, and Lemma 5, it is found that

\begin{matrix} L_{ρ} (p^{k + 1}) - inf_{k} L_{ρ} (p^{k}) = L_{ρ} (p^{k + 1}) - L_{ρ} ({\bar{p}}^{k + 1}) \\ \leq t_{1} {∥Δ y^{k + 1}∥}^{2} + (t_{2} + t_{4} L^{2}) \sum_{i = 1}^{m} {∥Δ {x_{i}}^{k + 1}∥}^{2} + 2 t_{3} d i s t^{2} (p^{k + 1}, c r i t L_{ρ}) \\ \leq t_{1} {∥Δ y^{k + 1}∥}^{2} + (t_{2} + t_{4} L^{2}) \sum_{i = 1}^{m} {∥Δ {x_{i}}^{k + 1}∥}^{2} + 2 t_{3} ς^{2} {‖ e (p, 1) ‖}^{2} \\ = (t_{1} + 2 t_{3} ς^{2} θ_{2}) {∥Δ y^{k + 1}∥}^{2} + (t_{2} + t_{4} L^{2} + 2 t_{3} ς^{2} θ_{1} m) {\sum_{i = 1}^{m} ‖ Δ {x_{i}}^{k + 1} ‖}^{2} \\ = h_{1} {∥Δ y^{k + 1}∥}^{2} + h_{2} \sum_{i = 1}^{m} {‖ Δ {x_{i}}^{k + 1} ‖}^{2}, \forall k \geq \hat{k}, \end{matrix}

(A37)

where

h_{1} = t_{1} + 2 t_{3} ς^{2} θ_{2}, h_{2} = t_{2} + t_{4} L^{2} + 2 t_{3} ς^{2} θ_{1}

. This, together with Equation (35), the inequality,

a < b

and the condition

L_{ρ} (p^{k + 1}) - inf_{k} L_{ρ} (p^{k}) \geq 0

yield

\begin{matrix} (L_{ρ} (p^{k + 1}) - inf_{k} L_{ρ} (p^{k})) - (L_{ρ} (p^{k}) - inf_{k} L_{ρ} (p^{k})) \\ \leq - (a \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2}) \\ \leq - \frac{a}{h} (L_{ρ} (p^{k + 1}) - inf_{k} L_{ρ} (p^{k})) . \end{matrix}

(A38)

Hence, for sufficiently large values of k, it holds that

0 \leq L_{ρ} (p^{k + 1}) - inf_{k} L_{ρ} (p^{k}) \leq \frac{1}{1 + \frac{a}{h}} (L_{ρ} (p^{k}) - inf_{k} L_{ρ} (p^{k})) .

(A39)

Consequently, the sequence

\{L_{ρ} (p^{k})\}

is Q-linearly convergent.

Appendix A.6. Proof of Theorem 3

According to Equation (35), it can be established that

\begin{matrix} a \underset{i = 1}{\sum^{m}} {∥Δ x_{i}^{k + 1}∥}^{2} + b {∥Δ y^{k + 1}∥}^{2} \leq (L_{ρ} (p^{k}) - L_{ρ} (p^{k + 1})) \\ \leq (L_{ρ} (p^{k}) - inf_{k} L_{ρ} (p^{k})) . \end{matrix}

(A40)

From Theorem 2, we know that the sequence of

\{L_{ρ} (p^{k})\}

is Q-linearly convergent, and there exist

0 < \hat{q} < 1

and

M_{1} > 0

such that

{∥Δ y∥}^{k + 1} \leq M_{1} {\hat{q}}^{k}, \forall k .

This, in combination with Equation (36), implies the existence of

0 < \hat{q} < 1

and

M_{2} > 0, M_{3} > 0

such that

‖ Δ {x_{i}}^{k + 1} ‖ \leq M_{2} {\hat{q}}^{k}, ‖ Δ {u_{i}}^{k + 1} ‖ \leq M_{3} {\hat{q}}^{k}, \forall k .

(A41)

Hence, it can be concluded that

‖ p^{k + 1} - p^{k} ‖ \leq \bar{M} {\hat{q}}^{k}, \forall k,

(A42)

where

\bar{M} = \sqrt{M_{1}^{2} + M_{2}^{2} + M_{3}^{2}} > 0

. Therefore, for any

m_{2} > m_{1} \geq 1

, it holds that

‖ p^{m_{2}} - p^{m_{1}} ‖ \leq \underset{k = m_{1}}{\sum^{m_{2} - 1}} ‖ p^{k + 1} - p^{k} ‖ \leq \frac{\bar{M}}{1 - \hat{q}} {\hat{q}}^{m_{1}} .

(A43)

This indicates that

\{p^{k}\}

is a Cauchy sequence; hence, it converges. Let its limit point be denoted as

\hat{p}

; then,

‖ p^{m_{1}} - \hat{p} ‖ \leq \frac{\bar{M}}{1 - \hat{q}} {\hat{q}}^{m_{1}} .

(A44)

Furthermore, from Theorem 1(1), it is understood that the sequence of

L_{ρ} (p^{k})

converges to the steady point of

L_{ρ} (p^{k})

at the rate of R-linear convergence.

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Zhang, X.; Hong, M.; Dhople, S.; Yin, W.; Liu, Y. Fedpd: A federated learning framework with adaptivity to non-iid data. IEEE Trans. Signal Process. 2021, 69, 6055–6070. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. Acm Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
Liu, T.; Wang, Z.; He, H.; Shi, W.; Lin, L.; An, R.; Li, C. Efficient and secure federated learning for financial applications. Appl. Sci. 2023, 13, 5877. [Google Scholar] [CrossRef]
Zeng, Q.; Lv, Z.; Li, C.; Shi, Y.; Lin, Z.; Liu, C.; Song, G. FedProLs: Federated learning for IoT perception data prediction. Appl. Intell. 2023, 53, 3563–3575. [Google Scholar] [CrossRef]
Manias, D.M.; Shami, A. Making a case for federated learning in the internet of vehicles and intelligent transportation systems. IEEE Netw. 2021, 35, 88–94. [Google Scholar] [CrossRef]
Posner, J.; Tseng, L.; Aloqaily, M.; Jararweh, Y. Federated learning in vehicular networks: Opportunities and solutions. IEEE Netw. 2021, 35, 152–159. [Google Scholar] [CrossRef]
Konečný, J.; McMahan, B.; Ramage, D. Federated optimization: Distributed optimization beyond the datacenter. arXiv 2015, arXiv:1511.03575. [Google Scholar]
Satish, S.; Nadella, G.S.; Meduri, K.; Gonaygunta, H. Collaborative Machine Learning without Centralized Training Data for Federated Learning. Int. Mach. Learn. J. Comput. Eng. 2022, 5, 1–14. [Google Scholar]
Zhou, S.; Li, G.Y. Federated learning via inexact ADMM. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9699–9708. [Google Scholar] [CrossRef]
Elgabli, A.; Park, J.; Ahmed, S.; Bennis, M. L-FGADMM: Layer-wise federated group ADMM for communication efficient decentralized deep learning. In Proceedings of the 2020 IEEE Wireless Communications and Networking Conference (WCNC), Seoul, Republic of Korea, 25–28 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Zhang, X.; Khalili, M.M.; Liu, M. Improving the privacy and accuracy of ADMM-based distributed algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5796–5805. [Google Scholar]
Guo, Y.; Gong, Y. Practical collaborative learning for crowdsensing in the internet of things with differential privacy. In Proceedings of the 2018 IEEE Conference on Communications and Network Security (CNS), Beijing, China, 30 May–1 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–9. [Google Scholar]
Zhang, X.; Khalili, M.M.; Liu, M. Recycled ADMM: Improve privacy and accuracy with less computation in distributed algorithms. In Proceedings of the 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 959–965. [Google Scholar]
Huang, Z.; Hu, R.; Guo, Y.; Chan-Tin, E.; Gong, Y. DP-ADMM: ADMM-based distributed learning with differential privacy. IEEE Trans. Inf. Forensics Secur. 2019, 15, 1002–1012. [Google Scholar] [CrossRef]
He, S.; Zheng, J.; Feng, M.; Chen, Y. Communication-efficient federated learning with adaptive consensus admm. Appl. Sci. 2023, 13, 5270. [Google Scholar] [CrossRef]
Ding, J.; Errapotu, S.M.; Zhang, H.; Gong, Y.; Pan, M. Stochastic ADMM based distributed machine learning with differential privacy. In Proceedings of the Security and Privacy in Communication Networks: 15th EAI International Conference, SecureComm 2019, Orlando, FL, USA, 23–25 October 2019; Proceedings, Part I 15. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 257–277. [Google Scholar]
Hager, W.W.; Zhang, H. Convergence rates for an inexact ADMM applied to separable convex optimization. Comput. Optim. Appl. 2020, 77, 729–754. [Google Scholar] [CrossRef]
Yue, S.; Ren, J.; Xin, J.; Lin, S.; Zhang, J. Inexact-ADMM based federated meta-learning for fast and continual edge learning. In Proceedings of the Twenty-Second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, Shanghai, China, 26–29 July 2021; pp. 91–100. [Google Scholar]
Ryu, M.; Kim, K. Differentially private federated learning via inexact ADMM with multiple local updates. arXiv 2022, arXiv:2202.09409. [Google Scholar]
Zhang, S.; Choromanska, A.E.; LeCun, Y. Deep learning with elastic averaging SGD. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Koloskova, A.; Stich, S.U.; Jaggi, M. Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. In Proceedings of the Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, 28 November–9 December2022; pp. 17202–17215. [Google Scholar]
Yu, H.; Yang, S.; Zhu, S. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019; Volume 33, pp. 5693–5700. [Google Scholar]
Dai, S.; Meng, F. Addressing modern and practical challenges in machine learning: A survey of online federated and transfer learning. Appl. Intell. 2023, 53, 11045–11072. [Google Scholar] [CrossRef]
Wang, J.; Joshi, G. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. J. Mach. Learn. Res. 2021, 22, 1–50. [Google Scholar]
Smith, V.; Forte, S.; Ma, C.; Takac, M.; Jordan, M.I.; Jaggi, M. CoCoA: A general framework for communication-efficient distributed optimization. J. Mach. Learn. Res. 2018, 18, 1–49. [Google Scholar]
Konečný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
Li, T.; Sahu, A.K.; Sanjabi, M.; Zaheer, M.; Talwalkar, A.; Smith, V. On the convergence of federated optimization in heterogeneous networks. arXiv 2018, arXiv:1812.06127. [Google Scholar]
Rockafellar, R.T.; Wets RJ, B. Variational Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends® Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Keynote, Invited and Contributed Papers. Physica-Verlag HD: Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 2019, 37, 1205–1221. [Google Scholar] [CrossRef]
Zhu, Y. An augmented ADMM algorithm with application to the generalized lasso problem. J. Comput. Graph. Stat. 2017, 26, 195–204. [Google Scholar] [CrossRef]
Yang, K.; Jiang, T.; Shi, Y.; Ding, Z. Federated learning via over-the-air computation. IEEE Trans. Wirel. Commun. 2020, 19, 2022–2035. [Google Scholar] [CrossRef]
Jia, Z.; Gao, X.; Cai, X.; Han, D. Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems. J. Optim. Theory Appl. 2021, 188, 1–25. [Google Scholar] [CrossRef]
Jia, Z.; Gao, X.; Cai, X.; Han, D. The convergence rate analysis of the symmetric ADMM for the nonconvex separable optimization problems. J. Ind. Manag. Optim. 2021, 17, 1943–1971. [Google Scholar] [CrossRef]
Kadu, A.; Kumar, R. Decentralized full-waveform inversion. In Proceedings of the 80th EAGE Conference and Exhibition 2018, Copenhagen, Denmark, 11–14 June 2018; European Association of 1Geoscientists and Engineers: Utrecht, The Netherlands, 2018; Volume 2018, pp. 1–5. [Google Scholar]
Yin, Z.; Orozco, R.; Herrmann, F.J. WISER: Multimodal variational inference for full-waveform inversion without dimensionality reduction. arXiv 2024, arXiv:2405.10327. [Google Scholar]

Figure 1. Scatter plots of raw data.

Figure 2. Loss function plots for

n = 100, m \in {10, 30, 50, 80, 100}

(a) and

m = 30, n \in {100, 130, 150, 180, 200}

(b).

Figure 2. Loss function plots for

n = 100, m \in {10, 30, 50, 80, 100}

(a) and

m = 30, n \in {100, 130, 150, 180, 200}

(b).

Figure 3. Scatter plots illustrating the data from the ten nodes of Example 1 following dimensionality reduction via principal component analysis.

Figure 4. Images illustrating the comparative analysis of loss functions for FL, DML, and Fed-RSADMM.

Figure 5. Comparative execution time analysis of three algorithms for Example 1 across iterative evaluations.

Figure 6. Loss function comparison between FL and FedAvg-RSADMM with a setting of

k_{0} = 5

for the latter.

Figure 6. Loss function comparison between FL and FedAvg-RSADMM with a setting of

k_{0} = 5

for the latter.

Figure 7. Comparison of loss functions for

m = 100, n = 1000

, and

k_{0} \in {5, 8, 10, 20, 25}

(As shown in (a)), as well as for

m = 100, n = 2000

, and

k_{0} \in {5, 8, 10, 20, 25}

(As shown in (b)).

Figure 7. Comparison of loss functions for

m = 100, n = 1000

, and

k_{0} \in {5, 8, 10, 20, 25}

(As shown in (a)), as well as for

m = 100, n = 2000

, and

k_{0} \in {5, 8, 10, 20, 25}

(As shown in (b)).

Figure 8. Comparisons of loss functions for

m = 100, n = 1000

, and

k_{0} \in {5, 8, 10, 20, 25}

(a) and

m = 100, n = 2000

, and

k_{0} \in {5, 8, 10, 20, 25}

(b).

Figure 8. Comparisons of loss functions for

m = 100, n = 1000

, and

k_{0} \in {5, 8, 10, 20, 25}

(a) and

m = 100, n = 2000

, and

k_{0} \in {5, 8, 10, 20, 25}

(b).

Figure 9. (a) The relationship between iterative time and communication rounds at different values of

k_{0}

for Algorithm 3. (b) The relationship between iterative time and

k_{0}

with different values of m for Algorithm 3.

Figure 9. (a) The relationship between iterative time and communication rounds at different values of

k_{0}

for Algorithm 3. (b) The relationship between iterative time and

k_{0}

with different values of m for Algorithm 3.

Table 1. Loss functions for some models.

Model	Loss Function
Linear regression	$\frac{λ}{2} ‖ x ‖^{2} + \frac{1}{2} max {\{0; 1 - b_{j} x^{T} a_{j}\}}^{2}$
Squared-SVM	$\frac{1}{2} ‖ b_{j} - x^{T} a_{j} ‖^{2}$
K-means	$\frac{1}{2} min_{l} ‖ a_{j} - x (l) ‖^{2} where x ≜ {[x_{(1)}^{T}, x_{(2)}^{T}, \dots]}^{T}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, J.; Zhu, Y.; Dang, Y. Symmetric ADMM-Based Federated Learning with a Relaxed Step. Mathematics 2024, 12, 2661. https://doi.org/10.3390/math12172661

AMA Style

Lu J, Zhu Y, Dang Y. Symmetric ADMM-Based Federated Learning with a Relaxed Step. Mathematics. 2024; 12(17):2661. https://doi.org/10.3390/math12172661

Chicago/Turabian Style

Lu, Jinglei, Ya Zhu, and Yazheng Dang. 2024. "Symmetric ADMM-Based Federated Learning with a Relaxed Step" Mathematics 12, no. 17: 2661. https://doi.org/10.3390/math12172661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetric ADMM-Based Federated Learning with a Relaxed Step

Abstract

1. Introduction

1.1. Related Work

1.2. Our Contribution

1.3. Organization

2. Preliminaries

2.1. Notations

2.2. Loss Function

2.3. Symmetric ADMM

2.4. Federated Learning

2.5. Stationary Points

3. Symmetric ADMM-Based Federated Learning with a Relaxed Step and Convergence

3.1. Fed-RSADMM

3.2. FedAvg-RSADMM

3.3. Convergence

3.4. Linear Convergence Rate

4. Numerical Experiment

4.1. Testing Examples

4.2. Numerical Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Lemma 3

Appendix A.2. Proof of Lemma 4

Appendix A.3. Proof of Theorem 1

Appendix A.4. Proof of Lemma 5

Appendix A.5. Proof of Theorem 2

Appendix A.6. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI