Riemannian SVRG Using Barzilai–Borwein Method as Second-Order Approximation for Federated Learning

Xiao, He; Yan, Tao; Wang, Kai

doi:10.3390/sym16091101

Open AccessArticle

Riemannian SVRG Using Barzilai–Borwein Method as Second-Order Approximation for Federated Learning

by

He Xiao

,

Tao Yan

^* and

Kai Wang

School of Mathematics and Statistics, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(9), 1101; https://doi.org/10.3390/sym16091101

Submission received: 31 July 2024 / Revised: 14 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a modified RFedSVRG method by incorporating the Barzilai–Borwein (BB) method to approximate second-order information on the manifold for Federated Learning (FL). Moreover, we use the BB strategy to obtain self-adjustment of step size. We show the convergence of our methods under some assumptions. The numerical experiments on both synthetic and real datasets demonstrate that the proposed methods outperform some used methods in FL in some test problems.

Keywords:

second-order approximation; Riemannian manifolds; non-convex constraints; stochastic variance reduced gradient; Barzilai–Borwein method; federated learning

1. Introduction

Federated Learning (FL) [1] has wide applications in machine learning. In this paper, we consider FL as the following form [1,2]:

min_{x \in M} f (x) : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x),

(1)

where

M

is the Riemannian manifold, the loss function

f_{i} : M \to R

is the local loss function stored in different a client/agent, f is the Lipschitz smooth global objective function and n is the number of clients. Thus, all clients cannot connect to each other [2]. Therefore, there is a central sever that can collect the information to output a consensus that minimize the summation of loss functions from all the clients. This framework makes use of computing resources from different departments while keeping the data private. There is no requirement to share data between them, so there are communications only between the central server and agents.

Some motivating applications of (1) including the Karcher mean problem on the positive define cone (PSD Karcher mean) [3,4]

min_{X ≻ 0} f (X) : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (X),

(2)

where

f_{i} (X) = {∥ log (X^{- 1 / 2} A_{i} X^{- 1 / 2}) ∥}_{F}^{2}

,

\{X | X ≻ 0\}

denotes symmetric positive-definite matrices,

A_{i} ⪰ 0

is the covariance matrix of the data stored on the i-th local agent and the objective function, f, is strongly geodesically convex.

For non-convex objective problems, the examples include the classical Principal Component Analysis (PCA) problem [5]

min_{{x \in R^{d} | ∥ x ∥_{2} = 1}} f (x) : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x),

(3)

where

f_{i} (x) = - \frac{1}{2} x^{⊤} A_{i} x

.

When

x \in R^{d \times r}

with

r > 1

, (3) evolves into the federated kernel PCA (kPCA) problem [3,6]

min_{X \in S_{t} (d, r)} f (X) : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (X),

(4)

where

f_{i} (X) = - \frac{1}{2} Tr (X^{⊤} A_{i} X)

,

S_{t} (d, r) = \{X \in R^{d \times r} | X^{⊤} X = I_{r}, r \leq d\}

denotes the Stiefel manifold and

A_{i}

is the covariance matrix of the data stored in the i-th local agent.

Manifold optimization problems in FL of the form (1) also appear in many machine learning tasks, such as low-rank matrix completion [7], diffusion tensor imaging [8,9,10], elasticity theory [11], Electroencephalography (EEG) classification [12] and deep neural network training [13,14]. Still, there are very few federated algorithms on manifolds. In fact, work [3] appears to be the only FL algorithm that can deal with manifold optimization problems with a similar generality as ours.

Handling manifold constraints in an FL setting poses primary obstacles. (i) Existing single-machine methods for manifold optimization [15,16,17] cannot be directly adapted to the federated setting. The server has to average the clients’ local models due to the distributed framework. Because of the non-convexity of

M

, their average is typically outside of

M

. (ii) Extending typical FL algorithms to scenarios with manifold constraints is not straightforward. Most existing FL algorithms either are unconstrained [1,2,18,19,20,21,22] or only allow for convex constraints [23,24,25,26], but manifold constraints are typically non-convex. (iii) Compared with non-convex optimization in Euclidean space, manifold optimization necessitates the consideration of the geometric structure of the manifold and properties of the loss functions, which poses challenges for algorithm design and analysis. (iv) Due to manifold constraints, it is difficult to extend some techniques [21,22] for enhancing communication efficiency originally developed for Euclidean spaces to manifold space. For more details, we refer to reference [27].

1.1. Related Work

As

M = R^{d}

, there are many works on FL algorithms [1,2,18,19,20,21,22,23,24,25,26,28]. The most widely used algorithm is FedAvg [1], which applies two loops of iterations. One is an inner loop for local clients and the other is an outer loop for the server. The inner loop in FedAvg for each client

i \in S_{t}

in parallel is as follows

x_{ℓ + 1}^{(i)} = x_{ℓ} - η^{(i)} \nabla f_{i} (x_{ℓ}^{(i)}), \forall i \in S_{t},

(5)

where

S_{t} \subset [n]

is uniformly sampled,

ℓ \in {0, 1, \dots, τ_{i} - 1}

,

η^{(i)}

is a constant called the learning rate and

x_{0}^{(i)} = x_{t}

is received from the server. Then, FedAvg averages local gradient descent updates by requiring the server to aggregate the

x_{ℓ}^{(i)}

in outer loops as

x_{t + 1} = \frac{1}{k} \sum_{i = 1}^{k} x_{τ_{i}}^{(i)}, \forall t \in {0, 1, \dots, T - 1},

(6)

which makes a good empirical convergence. However, FedAvg suffers from the client drift effect, where each client will drift the solution towards the minimum of their own loss function [18]. To alleviate this issue, several papers provide efforts to improve FedAvg. FedProx [18] ensured that the local iterates are not far from the previous consensus point by regularizing each local gradient descent update. Later, FedSplit [19] used operator splitting technology to deal with data heterogeneity, which also reduces the client drift effect. FedNova [20] proposed normalizing the averaged local gradients, but this suffers from a fundamental speed–accuracy conflict under objective heterogeneity [21].

On the other hand, variance reduction techniques are commonly used in client training on local data, which alleviate the data heterogeneity and derive the Federated SVRG (FSVRG) [2] method. FSVRG uses the SVRG method [29] in local updates for clients:

x_{ℓ + 1}^{(i)} = x_{ℓ} - η^{(i)} (\nabla f_{i} (x_{ℓ}^{(i)}) + \nabla f (x_{t}) - \nabla f_{i} (x_{t})), \forall i \in S_{t},

(7)

where

x_{0}^{(i)} = x_{t}

, and the server aggregates the

x_{ℓ}^{(i)}

as

x_{t + 1} = x_{t} + \frac{1}{k} \sum_{i = 1}^{k} (x_{τ}^{(i)} - x_{t}), where t \in {0, 1, \dots, T - 1} .

(8)

Compared with method (6), the aggregation method (8) ensures that the next consensus,

x_{t + 1}

, is not too far from the previous consensus,

x_{t}

. Later, FedLin [21] and SCAFFOLD [22] proposed the methods that use variance reduction techniques without calculating the full gradients,

\nabla f (x_{t})

.

Since the aggregation points of non-convex sets tend to be outside of the sets, FL problems with non-convex constraints are rarely be considered [3]. Moreover, the unconvex constraint problems can be transformed into unconstraint problems on the manifold as from (1). Then we can apply many convenient ways to solve the original problems. Ref. [3] proposed the RFedSVRG algorithm, which is the first algorithm to solve FL problems on Riemannian manifolds with convergence guarantee and suitable for solving cases with non-convex constraints. In order to fit the iteration on manifolds, RFedSVRG [3] uses exponential mapping and parallel transport with the SVRG method [29] in local updates:

\begin{matrix} x_{ℓ + 1}^{(i)} = {Exp}_{x_{ℓ}^{(i)}} \{η^{(i)} [grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{ℓ}^{(i)}} (grad f (x_{t}) - grad f_{i} (x_{t}))]\}, \forall i \in S_{t}, \\ where ℓ \in {0, 1, \dots, τ_{i} - 1} . \end{matrix}

(9)

RFedSVRG also requires that the server aggregates the

x_{τ_{i}}^{(i)}

as the tangent space mean

x_{t + 1} = {Exp}_{x_{t}} (\frac{1}{k} \sum_{i \in S_{t}} {Exp}_{x_{t}}^{- 1} (x_{τ_{i}}^{(i)})), where t \in {0, 1, \dots, T - 1},

(10)

which has the “regularization” property

d (x_{t + 1}, x_{t}) \leq \frac{1}{k} \sum_{i \in S_{t}} d (x_{τ_{i}}^{(i)}, x_{t})

, so that the distance between two consensus points can be controlled. Work [30] explores the differential privacy of RFedSVRG. Ref. [27] employs a projection operator and constructs the correction terms in local updates to reduce the computational costs of RFedSVRG. Ref. [31] considers the specific manifold optimization problem that appears in PCA and investigates an ADMM-type method that penalizes the orthogonality constraint.

Recently, many attempts have been devoted to SVRG with second-order information, which can further reduce the variance. It motivated the development of the Hessian-based stochastic variance reduced gradient (SVRG-2) [32] method. SVRG-2 incorporates the second-order information to further reduce the variance via the variance reduction technique of the stochastic gradient as

x_{ℓ + 1}^{(i)} = x_{ℓ}^{(i)} - η^{(i)} [\nabla f_{i} (x_{ℓ}^{(i)}) + \nabla f (x_{t}) - \nabla f_{i} (x_{t}) + (\nabla^{2} f (x_{t}) - \nabla^{2} f_{i} (x_{t})) (x_{ℓ}^{(i)} - x_{t})] .

(11)

SVRG-2 has been shown to provide better variance reductions than standard SVRG and requires only a small number of epochs to converge. However, Ref. [33] states that the computation of Hessian requires

O (n d^{2})

time and space complexity. Later, Ref. [33] proposed the the Barzilai–Borwein approximation to further control the variance with lower computational cost as a variant of SVRG-2 (SVRG-2BB) by using the BB method.

Moreover, one important issue for stochastic algorithms and their variants is how to choose an appropriate step size,

η_{t}^{(i)}

, while running the algorithm [34]. In the classical gradient descent method, the line search technique is usually used to obtain the step size. But in the stochastic gradient method, Ref. [35] states that line search is not possible because there is only downsampling information of the function value and gradient. One common method for stochastic algorithms is to to tune a best fixed step size by hand, but it is time consuming [35]. The Barzilai and Borwein [36] (BB) method is a nice choice for us to select the step length, as it can automatically calculate the step size by using the information of the iteration point and the gradient of the function. SVRG-2BBS [33] and SVRG-BB [35] incorporate the BB step size into the SVRG-2BB and SVRG, respectively. These two algorithms further improve the performance of the original algorithms and obtain the linear convergence for the strongly convex objective function.

Moreover, Refs. [37,38,39] have given the BB step size only on the Stiefel manifold with global convergence in solving the minimization problem. Later, Ref. [40] extended the Euclidean BB step size to general manifold spaces, whose numerical results show a good performance with this method.

1.2. Our Contributions

In this paper, our goal is to propose a modified RFedSVRG that incorporates second-order information on the manifold in order to save computation. We use the BB method to reduce the variance, which introduces a new algorithm, RFedSVRG-2BB, on the manifold. In addition, we propose to incorporate the BB step size into RFedSVRG-2BB (RFedSVRG-2BBS). RFedSVRG-2BBS implements automatic calculation of the step size in RFedSVRG-2BB with a faster convergence speed in numerical experiments. It also preserves the global convergence property of RFedSVRG theoretically.

We list the contributions of this paper as follows:

We propose the Barzilai–Borwein approximation as second-order information to control the variance with lower computational cost (RFedSVRG-2BB). In addition, we incorporate the Barzilai–Borwein step size for RFedSVRG-2BB and lead the RFedSVRG-2BBS.
We present the convergence results and the corresponding convergence rate of the proposed methods for the strongly geodesically convex objective function and non-convex objective function, respectively.
We conduct numerical experiments for the proposed methods on solving the PCA, kPCA and PSD Karcher mean problems in some datasets. The numerical results show that our methods are better than RFedSVRG and other algorithms.

The paper is organized as follows. In Section 2, we recall some basic concepts from Riemannian optimization. In Section 3, we first introduce the framework of RFedSVRG and the BB method. Then we propose the RFedSVRG-2BB and RFedSVRG-2BBS algorithms. In Section 4, we provide the convergence analysis of our algorithms. In Section 5, we present the comparing of results of the numerical experiments for the proposed methods and some existing methods in Federated Learning. The conclusions are drawn in Section 6.

2. Preliminaries on Riemannian Optimization

We first briefly review some of the basic concepts related to Riemannian optimization. More details are shown in [4,16,41,42,43,44].

Denote

M

is a Riemannian manifold. The tangent space of

x \in M

is denoted by

T_{x} M

. The inner product on

T_{x} M

is defined as

{〈 \cdot, \cdot 〉}_{x}

and the corresponding induced norm of

ξ \in T_{x} M

is

∥ ξ ∥ ≜ \sqrt{{〈 ξ, ξ 〉}_{x}}

. The Riemannian gradient of a function

f \in C^{1} (M)

in x is denoted as

grad f (x) : M \to T_{x} M

, which is the unique tangent vector in

T_{x} M

, such that

{〈 grad f (x), ξ 〉}_{x} = D f (x) [ξ] = 〈 \nabla f (x), ξ 〉, \forall ξ \in T_{x} M,

where

D f (x) [ξ]

denotes the directional derivative along the

ξ

direction [41] and

〈 \cdot, \cdot 〉

is the Euclidean inner product defined by

〈 ξ, ζ 〉 = Tr (ξ^{⊤} ζ)

. We also can define the angle between

ξ, ζ \in T_{x} M

as

arccos \frac{{〈 ξ, ζ 〉}_{x}}{∥ ξ ∥ ∥ ζ ∥}

.

If the function

f \in C^{2} M

, the Riemannian Hessian of f is a linear mapping from

T_{x} M

to

T_{x} M

, defined by

Hess f (x) [ξ] = {\tilde{\nabla}}_{ξ} grad f (x), \forall ξ \in T_{x} M,

where

Hess f (x) [ξ]

denotes the action of

Hess f (x)

on the tangent vector

ξ \in T_{x} M

and

\tilde{\nabla}

is the Riemannian connection [41].

Definition 1

(Geodesic and exponential mapping [43]). A geodesic is a constant speed curve

γ : [0, 1] \to M

that is the locally distance minimizing. Given a point

x \in M

and a tangent vector

v \in T_{x} M

, exponential mapping,

{Exp}_{x} (v)

, is defined as a mapping from

T_{x} M

to

M

s.t.

{Exp}_{x} (v) = γ (1)

, where γ is geodesic with

γ (0) = x

and

\dot{γ} (0) = v

.

Definition 2

(Inverse of the exponential mapping [43]). If there is a unique geodesic between

x, y \in M

, there exists the inverse of the exponential mapping

{Exp}_{x}^{- 1} : M \to T_{x} M

, and the geodesic is the unique shortest path with

∥ {Exp}_{x}^{- 1} (y) ∥ = ∥ {Exp}_{y}^{- 1} (x) ∥

between x and y.

Exponential mapping is the most accurate mapping for

\forall ξ \in T_{x} M

on the manifold. Moreover, if the manifold is complete, the inverse of the exponential mapping called the logarithm mapping is well defined, and we have

d (x, y) = ∥ {Exp}_{x}^{- 1} (y) ∥ .

(12)

Therefore, we use exponential mapping in the paper, whose property (12) is very important for our analysis later. Throughout this paper, we always assume that

M

is complete. As we all know, we cannot operate tangent vectors in different tangent spaces directly, so a natural way to deal with this problem is to apply the parallel transport. Next, we give the definition of parallel transport.

Definition 3

(Parallel transport [43]). Given a Riemannian manifold,

M

, with two points

x, y

, the parallel transport

P_{x \to y} : T_{x} M \to T_{y} M

is a linear operator that keeps the inner product:

\forall ξ, ζ \in T_{x} M

, i.e.,

{〈 ξ, ζ 〉}_{x} = {〈 P_{x \to y} ξ, P_{x \to y} ζ 〉}_{y}

.

We now present the definition of Lipschitz smooth and convexity on the Riemannian manifold,

M

, which will be utilized in later convergence analysis.

Definition 4

(L-smoothness on manifolds [16]). If there exists

L \geq 0

, such that the following inequality holds for function f:

∥ grad f (y) - P_{x \to y} grad f (x) ∥ \leq L d (x, y), \forall x, y \in M .

Then f is called Lipschitz smooth (L-smooth) on the manifold. If the manifold is complete, we have [4]:

f (y) \leq f (x) + {〈 grad f (x), {Exp}_{x}^{- 1} (y) 〉}_{x} + \frac{L}{2} d^{2} (x, y), \forall x, y \in M .

Lemma 1

([16]). Given a Riemannian manifold,

M

,

f \in C^{2} (M)

, then

Hess f (x)

is Lipschitz smooth on the manifold and the inequality holds for f:

∥grad f (y) - P_{x \to y} (grad f (x) + Hess f (x) [{Exp}_{x}^{- 1} (y)])∥ \leq \frac{L}{2} d^{2} (x, y), \forall x, y \in M .

Definition 5

(Geodesically convex and

μ

-strongly g-convex [16]). A function

f \in C^{1} (M)

is called geodesically convex if for all

x, y \in M

, there exists a geodesic γ s.t.

γ (0) = x

,

γ (1) = y

and

f (γ (t)) \leq (1 - t) f (x) + t f (y), \forall t \in [0, 1],

or equivalently

f (y) \geq f (x) + {〈 grad f (x), {Exp}_{x}^{- 1} (y) 〉}_{x} .

In addition, a function f is μ-strongly g-convex (or geodesically μ-strongly convex) if the following inequality holds for f:

f (y) \geq f (x) + {〈 grad f (x), {Exp}_{x}^{- 1} (y) 〉}_{x} + \frac{μ}{2} d^{2} (x, y),

where

0 < μ < min {1, L}

is a small constant.

Definition 6

(h-gradient dominated [17]). We say

f : M \to R

is h-dominated if there exists a constant

h > 0

and

x^{*}

is a global minimizer of f, for every

x \in M

f (x) - f (x^{*}) \leq h {∥ grad f (x) ∥}^{2} .

3. RFedSVRG with Barzilai–Borwein Approximation as Second-Order Information

In this section, we first introduce the framework of the RFedSVRG [3] method and the BB method. Then we propose our method, RFedSVRG-2BB. Finally, we use BB step size along with RFedSVRG-2BB (RFedSVRG-2BBS).

3.1. The RFedSVRG Method

The framework of RFedSVRG has three main steps [3]:

Uniformly sample clients to obtain the set $S_{t} \subset [n]$ , and the clients receive $x_{t}$ and $grad f (x_{t})$ from the server;
The clients in $S_{t}$ take the local updates

$\begin{matrix} x_{ℓ + 1}^{(i)} = {Exp}_{x_{ℓ}^{(i)}} \{η_{t}^{(i)} [grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{ℓ}^{(i)}} (grad f (x_{t}) - grad f_{i} (x_{t}))]\} \end{matrix}$

(13)

to obtain $\hat{x} = x_{τ_{i}}^{(i)}$ , where $η_{t}^{(i)} = η^{(i)}$ is the constant step size specified by users, $x_{0}^{(i)} = x_{t}$ and $ℓ \in {0, 1, τ_{i} - 1}$ ;
The central sever aggregates the updated points ${\hat{x}}^{(i)}$ from the clients to obtain $x_{t + 1}$ by tangent space mean

$x_{t + 1} = {Exp}_{x_{t}} (\frac{1}{k} \sum_{i \in S_{t}} {Exp}_{x_{t}}^{- 1} ({\hat{x}}^{(i)})), t \in {0, 1, \dots, T - 1} .$

(14)

There are many advantages to this framework. First, RFedSVRG only samples a subset of clients for each communication to iterate to avoid high computational costs. Second, the algorithm utilizes the gradient information at the previous iterate

grad f (x_{t})

and incorporates variance reduction techniques in the inner loop (13) to estimate the change of the gradient between

x_{ℓ}^{(i)}

and

x_{t}

. Because

E [grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t}))] = grad f (x_{ℓ}^{(i)}),

RFedSVRG correctly converges to the global stationary points and avoids the “client drift” effect. The expectation of variance of

[grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t}))]

has

\begin{matrix} E [∥ grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t})) - grad f (x_{ℓ}^{(i)}) ∥^{2}] \\ = & E [{∥grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t})) - grad f (x_{ℓ}^{(i)})∥}^{2}] \\ = & E [{∥grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f_{i} (x_{t}) + E [grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f_{i} (x_{t})]∥}^{2}] \\ \leq & E [{∥grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f_{i} (x_{t})∥}^{2}] \leq O (d^{2} (x_{ℓ}^{(i)}, x_{t})), \end{matrix}

(15)

where the first inequality is due to

{E [∥ β - E [β] ∥}^{2} {] = E [∥ β ∥}^{2} {] - ∥ E [β] ∥}^{2} \leq {E [∥ β ∥}^{2}]

and the second inequality follows from f is L-smooth. Third, the framework uses the tangent space mean (14) for the outer loop, which is easy to compute with closed-form on Riemannian manifolds and has the “regularization” property

d (x_{t + 1}, x_{t}) \leq \frac{1}{k} \sum_{i \in S_{t}} d (x_{τ_{i}}^{(i)}, x_{t})

, so that the distance between two consensus points can be controlled.

As pointed out by [33], the stochastic gradient methods attempt to estimates the true gradient accurately and the approximation for second-order information can obtain higher accuracy to reduce the number of communication rounds, T. Inspired by the work above, the main proposal of this paper is to generate the technique on the manifold and improve the accuracy with exponential mapping and parallel transport. The local update (13) can be modified as

\begin{matrix} x_{ℓ + 1}^{(i)} & = {Exp}_{x_{ℓ}^{(i)}} {η_{t}^{(i)} [grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{t}^{(i)}} (grad f (x_{t}) - grad f_{i} (x_{t}) \\ + B ξ - B_{i} ξ)]}, ξ = {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)}), \end{matrix}

(16)

where B and

B_{i}

are the computationally affordable matrices to approximate the second-order information that meets the following two properties:

Property 1: unbiased estimate of B and $B_{i}$ , that is,

$E [B_{i}] = B .$
Property 2: approximation of B and $B_{i}$ such that

$B ξ \approx Hess f (x_{t}) [ξ] and B_{i} ξ \approx Hess f_{i} (x_{t}) [ξ],$

where

ξ = {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})

.

From Property 1, we have

\begin{matrix} E [grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{t}^{(i)}} (grad f (x_{t}) - grad f_{i} (x_{t})) + P_{x_{t} \to x_{t}^{(i)}} (B ξ - B_{i} ξ)] \\ = grad f (x_{ℓ}^{(i)}), where ξ = {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)}) . \end{matrix}

(17)

Hence, we can expect that the inner loop of the algorithm can still find a correct solution.

From Property 2, we can obtain the variance of

[grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{t}^{(i)}} [grad f (x_{t}) - grad f_{i} (x_{t})] + P_{x_{t} \to x_{t}^{(i)}} (B ξ - B_{i} ξ)]

by

\begin{matrix} E [{∥grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{ℓ}^{(i)}} [grad f (x_{t}) - grad f_{i} (x_{t})] + P_{x_{t} \to x_{ℓ}^{(i)}} (B ξ - B_{i} ξ) - grad f (x_{ℓ}^{(i)})∥}^{2}] \\ = & E [∥ grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) + B_{i} ξ) \\ - & (grad f (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f (x_{t}) + B ξ)) ∥^{2}] \\ = & E [∥ grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) + B_{i} ξ) \\ - & E [grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) + B_{i} ξ)] ∥^{2}] \\ \leq & E [{∥grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) + B_{i} ξ)∥}^{2}] \\ \approx & E [{∥grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) + Hess f_{i} (x_{t}) [ξ])∥}^{2}] \leq O (d^{4} (x_{ℓ}^{(i)}, x_{t})), \end{matrix}

(18)

where the second inequality follows from Lemma 1. From (15) and (18), we can see that the variance of the modified local update (16) is smaller than original local update (13) when

x_{ℓ}^{(i)}

is close to

x_{t}

.

3.2. Barzilai–Borwein Method

In this subsection, we recall the Barzilai and Borwein method (BB method) deterministically. The BB method belongs to the first-order optimization algorithms. It has proven to be very successful in solving nonlinear optimization problems. In Euclidean space, the idea of calculating BB step size is derived from the quasi-Newton method [45]. Consider the unconstrained optimization problem:

min_{x \in R^{d}} f (x), where f (x) \in C^{1} (R^{d}) .

(19)

For minimizing the differentiable objective function

f (x)

, the quasi-Newton method solves this problem by using the approximate of Hessian, which needs to satisfy the secant equation, i.e.,

B_{t} s_{t} = y_{t},

(20)

where

B_{t}

is an approximation of the Hessian of f at

x_{t}

,

s_{t} = x_{t} - x_{t - 1}

,

y_{t} = \nabla f (x_{t}) - \nabla f (x_{t - 1})

and

t \geq 1

. The BB method in a Riemannian manifold is to use matrix

B_{t}

to approximate the action of the Riemannian Hessian of f at a certain point by a multiple of the identity

B_{t} = \frac{1}{η_{t}} I

with

η_{t} > 0

[40].

B_{t}

also needs to satisfy the secant Equation (20). But the difficultly is that we cannot operate the vectors in different space, and parallel transport and exponential mapping are needed. We should move the vectors in different points and tangent spaces “parallel” to the tangent space

T_{x_{t}} M

. Therefore, we denote

s_{t}

on the manifold as

s_{t} : = P_{x_{t - 1} \to x_{t}} {Exp}_{x_{t - 1}}^{- 1} (x_{t}),

(21)

and denote

y_{t}

as

y_{t} : = grad f (x_{t}) - P_{x_{t - 1} \to x_{t}} (grad f (x_{t - 1})) .

(22)

To solve the the secant equation in a least-square sense, i.e.,

min_{η_{t}} {∥\frac{1}{η_{t}} s_{t} - y_{t}∥}^{2}

, when

{〈 s_{t}, y_{t} 〉}_{x_{t}} > 0

,

B_{t}

and the Riemannian BB step size,

η_{t}

, can be written as

B_{t} = \frac{{〈 s_{t}, y_{t} 〉}_{x_{t}}}{{〈 s_{t}, s_{t} 〉}_{x_{t}}} I and η_{t} = \frac{{〈 s_{t}, s_{t} 〉}_{x_{t}}}{{〈 s_{t}, y_{t} 〉}_{x_{t}}},

(23)

Another choice of

η_{t}

is to solve

min_{η_{t}} {∥s_{t} - η_{t} y_{t}∥}^{2}

, which can be expressed as

B_{t} = \frac{{〈 y_{t}, y_{t} 〉}_{x_{t}}}{{〈 s_{t}, y_{t} 〉}_{x_{t}}} I and η_{t} = \frac{{〈 s_{t}, y_{t} 〉}_{x_{t}}}{{〈 y_{t}, y_{t} 〉}_{x_{t}}} .

(24)

3.3. RFedSVRG with Barzilai–Borwein Method as Second-Order Information (RFedSVRG-2BB)

In this subsection, we propose RFedSVRG with the BB method to approximate second-order information. The previous subsection introduces the two different approximations of

B_{t}

of (23) and (24). For convenience, we adopt (23) for the rest of the paper.

Now we propose to use the BB method for a Hessian approximation to compute B and

B_{i}

in (16). We call it the RFedSVRG-2BB method.

RFedSVRG-2BB: Equation (16) with (23) for the approximate Hessian. We first denote

s_{t}

and

y_{t}

in (21) and (22) and

y_{t}^{i} = grad f_{i} (x_{t}) - P_{x_{t - 1} \to x_{t}} (grad f_{i} (x_{t - 1})) .

(25)

If

t \geq 1

,

{〈 s_{t}, y_{t} 〉}_{x_{t}} > 0

and

{〈 s_{t}, y_{t}^{i} 〉}_{x_{t}} > 0

,

B = \frac{{〈 s_{t}, y_{t} 〉}_{x_{t}}}{{〈 s_{t}, s_{t} 〉}_{x_{t}}} I and B_{i} = \frac{{〈 s_{t}, y_{t}^{i} 〉}_{x_{t}}}{{〈 s_{t}, s_{t} 〉}_{x_{t}}} I,

(26)

else, we use first-order RFedSVRG, i.e.,

B = B_{i} = 0 .

(27)

Remark 1.

In the communication, we compute the full BB approximation, B, using full gradients, and in the local update we compute the stochastic BB approximation,

B_{i}

, by using stochastic gradients. From (26) and (27), one can observe that it does not require expensive additional computation. It is easy to see that

E [B_{i}] = B

for the above approximations derived from the BB method.

We summarize our algorithm in the following framework, Algorithm 1.

3.4. RFedSVRG-2BB with Barzilai–Borwein Step Size (RFedSVRG-2BBS)

In this section, we propose RFedSVRG-2BB with the BB step size. Note the Input in Algorithm 1, where we need to specify all step sizes of all clients, which makes us enter a very large number of parameters. As seen from (23), the BB method does not need any parameter and the step size is computed while running the algorithm. To the best of our knowledge, SVRG-BB [35] is the first algorithm to propose the use of BB step size with SVRG. In the t-th (

0 \leq t \leq T

) outer loop, BB step size in [35] can be regarded as in line 8 of Algorithm 1 with

\begin{matrix} x_{ℓ + 1}^{(i)} = x_{ℓ}^{(i)} - η_{t}^{(i)} (\nabla f_{i} (x_{ℓ}^{(i)}) + \nabla f (x_{t}) - \nabla f_{i} (x_{t})), η_{t}^{(i)} = \frac{1}{τ_{i}} \frac{∥ s_{t} ∥_{2}^{2}}{s_{t}^{⊤} y_{t}}, \end{matrix}

(28)

where

s_{t} = x_{t} - x_{t - 1}

,

y_{t} = \nabla f (x_{t}) - \nabla f (x_{t - 1})

and

τ_{i}

is the update frequency of the inner loop.

Algorithm 1: Framework of RFedSVRG-2BB

Inspired by the idea of SVRG-BB, we use the BB step size for RFedSVRG-2BB so that we can achieve the step size self-adjusting, which means we do not need to specify a step size for each client and therefore we maintain a faster speed of convergence. We call the new algorithm RFedSVRG-2BBS. We apply (23) to decide step length. Moreover, in order to make the convergence curve flatter in numerical experiments, we use the bounds

[η_{max}, η_{min}]

for the step size. This method of calculating the step size is shown in (29) and (30).

RFedSVRG-2BBS:

{η^{(i)}}_{i = 1}^{n}

in Input of Algorithm 1 is replaced by

{\hat{η_{0}}, η_{max}, η_{min}}

, where

0 < η_{min} < η_{max}

. If

t \geq 1

,

η_{t}^{(i)}

, in line 8 in Algorithm 1 is changed to

η_{t}^{(i)} = \hat{η_{t}} / τ_{i},

(29)

where

\hat{η_{t}} = \{\begin{matrix} min \{η_{max}, max \{η_{min}, η_{t}^{B B}\}\}, & if {〈 s_{t}, y_{t} 〉}_{x_{t}} > 0, \\ η_{max}, & otherwise, \end{matrix}

(30)

η_{t}^{B B} = \frac{{〈 s_{t}, s_{t} 〉}_{x_{t}}}{{〈 s_{t}, y_{t} 〉}_{x_{t}}}

,

s_{t}

and

y_{t}

is calculated by (21) and (22). Else,

η_{t}^{(i)} = \hat{η_{0}} / τ_{i}

.

4. Convergence Analysis

In this section, we present the convergence results with their convergence rate of RFedSVRG-2BB and RFedSVRG-2BBS. Before the convergence results are given, we will give some necessary conclusions and assumptions.

Assumption 1

(Smoothness). Suppose that

f_{i}

is

L_{i}

-smooth on the manifold. It implies that f is L-smooth with

L = \sum_{i = 1}^{n} L_{i}

.

Denote

s_{t}

,

y_{t}

and

y_{t}^{i}

in (21), (22) and (25). If the function f is L-smooth,

∥ y_{t} ∥ = ∥ grad f (x_{t}) - P_{x_{t - 1} \to x_{t}} grad f (x_{t - 1}) ∥ \leq L d (x_{t - 1}, x_{t}) = L ∥ {Exp}_{x_{t - 1}}^{- 1} (x_{t}) ∥ = L ∥ s_{t} ∥

. Similarly,

∥ y_{t}^{i} ∥ \leq L ∥ s_{t} ∥

. Then we can obtain the upper bound of

{〈 s_{t}, y_{t} 〉}_{x_{t}}

and

{〈 s_{t}, y_{t}^{i} 〉}_{x_{t}}

:

{〈 s_{t}, y_{t} 〉}_{x_{t}} \leq ∥ s_{t} ∥ \cdot ∥ y_{t} ∥ \leq L ∥ s_{t} ∥^{2}, {〈 s_{t}, y_{t}^{i} 〉}_{x_{t}} \leq ∥ s_{t} ∥ \cdot ∥ y_{t}^{i} ∥ \leq L ∥ s_{t} ∥^{2} .

For RFedSVRG-2BBS, we can obtain the lower bound of

η_{t}^{B B}

in (30):

η_{t}^{B B} = \frac{{〈 s_{t}, s_{t} 〉}_{x_{t}}}{{〈 s_{t}, y_{t} 〉}_{x_{t}}} \geq \frac{1}{L} \cdot \frac{∥ s_{t} ∥}{∥ s_{t} ∥} = \frac{1}{L} .

(31)

It means

max \{η_{min}, η_{t}^{B B}\} \geq \frac{1}{L}

. If

η_{max} \leq \frac{1}{L}

, from (30) we can see that

\hat{η_{t}} = η_{max}

is fixed. To avoid this situation, we should set

η_{max} > \frac{1}{L}

. In this case, we have

\hat{η_{t}} \geq \frac{1}{L} and \frac{1}{τ_{i} L} \leq η_{t}^{(i)} \leq \frac{η_{max}}{τ_{i}} .

(32)

Moreover, we have

0 \leq β = \frac{{〈 s_{t}, y_{t} 〉}_{x_{t}}}{{〈 s_{t}, s_{t} 〉}_{x_{t}}} \leq \frac{L ∥ s_{t} ∥^{2}}{∥ s_{t} ∥^{2}} = L, 0 \leq β_{i} = \frac{{〈 s_{t}, y_{t}^{(i)} 〉}_{x_{t}}}{{〈 s_{t}, s_{t} 〉}_{x_{t}}} \leq \frac{L ∥ s_{t} ∥^{2}}{∥ s_{t} ∥^{2}} = L .

For RFedSVRG-2BB and RFedSVRG-2BBS, B and

B_{i}

are computed in (26) and (27), and we can obtain

{∥B - B_{i}∥}^{2} = {∥(β - β_{i}) \cdot I∥}^{2} = {(β - β_{i})}^{2} \cdot {∥ I ∥}^{2} \leq L^{2}

and

\begin{matrix} {∥B ξ - B_{i} ξ∥}^{2} = {∥(B - B_{i}) ξ∥}^{2} = (∥B - B_{i}∥∥ {ξ ∥)}^{2} \leq L^{2} {∥ ξ ∥}^{2}, \forall ξ \in T_{x_{ℓ}^{(i)}} M . \end{matrix}

(33)

Assumption 2

(Regularization over manifold). The manifold

M

is complete and there exists a compact set,

D \subset M

, with diameter bounded by D so that all the iterates of Algorithm 1 and the optimal points are contained in

M

. The sectional curvature in

M

is bounded in

[κ_{m i n}, κ_{m a x}]

. Then we can obtain the following key constant, defined in [4,17], to capture manifold curvature:

ζ = \{\begin{matrix} \frac{\sqrt{| κ_{\min} |} D}{tanh \sqrt{| κ_{\min} |} D} & , i f κ_{\min} < 0 . \\ 1 & , i f κ_{\min} \geq 0 . \end{matrix}

(34)

Lemma 2

(Corollary 8 in [4]). If the Riemannian manifold,

M

, satisfies Assumption 2, for any points

x, x_{t} \in M

, the update

x_{t + 1} = {Exp}_{x_{t}} (- η_{t} g_{t})

satisfies

\begin{matrix} d^{2} (x_{t + 1}, x) \leq & d^{2} (x_{t}, x) + ζ {∥ η_{t} g_{t} ∥}^{2} - 2 {〈{Exp}_{x_{t}}^{- 1} (x), - η_{t} g_{t}〉}_{x_{t}} \\ = & d^{2} (x_{t}, x) + ζ d^{2} (x_{t}, x_{t + 1}) - 2 {〈{Exp}_{x_{t}}^{- 1} (x), {Exp}_{x_{t}}^{- 1} (x_{t + 1})〉}_{x_{t}} . \end{matrix}

Next, we provide convergence results with their convergence rate for RFedSVRG-2BB and RFedSVRG-2BBS with the

μ

-strongly g-convex objective function. The proof is inspired from [17]. For the calculation of the convergence rate, we refer to [4].

Lemma 3

(

μ

-strongly g-convex, RFedSVRG-2BB and RFedSVRG-2BBS with

k = 1

). Consider RFedSVRG-2BB and RFedSVRG-2BBS with Option 1 in each inner loop to obtain

{\hat{x}}^{(i)}

(line 10 in Algorithm 1) and take

k = 1

. Assumptions 1 and 2 are satisfied and f is μ-strongly g-convex in problem (1). Denote

α_{t}^{i} = \frac{4 ζ η_{t}^{(i)} L^{2} + {(1 + 2 η_{t}^{(i)} (5 ζ η_{t}^{(i)} L^{2} - μ))}^{τ_{i}} (μ - 9 ζ η_{t}^{(i)} L^{2})}{μ - 5 ζ η_{t}^{(i)} L^{2}} .

(35)

Then we have

E d^{2} (x_{t + 1}, x^{*}) \leq α_{t}^{i} E d^{2} (x_{t}, x^{*})

.

Proof.

Denote

Δ_{ℓ} = grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}))

and

v_{i_{ℓ}}^{t} = grad f_{i} (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{t}^{(i)}} (grad f (x_{t}) - grad f_{i} (x_{t}) + (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})) .

Taking the expectation with respect to i in the t-th outer loop, we have

E [Δ_{ℓ}] = grad f (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{t}^{(i)}} grad f (x_{t}),

and then we can bound the squared norm of

v_{i_{ℓ}}^{t}

as follows

\begin{matrix} E ∥ v_{i_{ℓ}}^{t} ∥^{2} & = E {∥Δ_{ℓ} + P_{x_{t} \to x_{ℓ}^{(i)}} grad f (x_{t}) + P_{x_{t} \to x_{ℓ}^{(i)}} (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = E {∥Δ_{ℓ} - E [Δ_{ℓ}] + grad f (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{ℓ}^{(i)}} (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ \leq 2 E {∥Δ_{ℓ} - E [Δ_{ℓ}]∥}^{2} + 2 E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + 2 E {∥P_{x_{t} \to x_{ℓ}^{(i)}} (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ \leq 2 E {∥Δ_{ℓ}∥}^{2} + 2 E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = 2 E {∥grad f (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f (x_{t})∥}^{2} + 2 E {∥grad f (x_{ℓ}^{(i)}) - P_{x^{*} \to x_{ℓ}^{(i)}} grad f (x^{*})∥}^{2} \\ + 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ \leq 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x^{*})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = 4 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x^{*})∥}^{2} \\ \leq 4 L^{2} E {(∥{Exp}_{x_{t}}^{- 1} (x^{*})∥ + ∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x^{*})∥)}^{2} + 2 L^{2} E {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x^{*})∥}^{2} \\ \leq 8 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x^{*})∥}^{2} + 10 L^{2} E {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x^{*})∥}^{2}, \end{matrix}

(36)

where the first inequality is due to

{∥ a + b + c ∥}^{2} \leq {2 ∥ a ∥}^{2} + {2 ∥ b ∥}^{2} + 2 {∥ c ∥}^{2}

, the second inequality is due to

{E ∥ ξ - E ξ ∥}^{2} = {E ∥ ξ ∥}^{2} - {∥ E ξ ∥}^{2} \leq E {∥ ξ ∥}^{2}

and (33), the third inequality due to Assumption 1 is satisfied, the fourth inequality is due to

d (x, y) \leq d (x, z) + d (y, z)

, and the fifth inequality

(∥ a ∥ + {∥ b ∥)}^{2} \leq {2 ∥ a ∥}^{2} + 2 {∥ b ∥}^{2}

and third equality are due to

grad f (x^{*}) = 0

.

Note that

E [v_{i_{ℓ}}^{t}] = grad f (x_{ℓ}^{(i)})

and

x_{ℓ + 1}^{(i)} = {Exp}_{x_{ℓ}^{(i)}} (- η_{t}^{(i)} v_{i_{ℓ}}^{t})

, therefore

\begin{matrix} E d^{2} (x_{ℓ + 1}^{(i)}, x^{*}) \leq E d^{2} (x_{ℓ}^{(i)}, x^{*}) + ζ E {(η_{t}^{(i)})}^{2} {∥ v_{ℓ_{i}}^{t} ∥}^{2} + 2 η_{t}^{(i)} E {〈{Exp}_{x_{ℓ}^{(i)}} (x^{*}), v_{i_{ℓ}}^{t}〉}_{x_{ℓ}^{(i)}} \\ \leq E d^{2} (x_{ℓ}^{(i)}, x^{*}) + ζ {(η_{t}^{(i)})}^{2} L^{2} E (10 d^{2} (x_{ℓ}^{(i)}, x^{*}) + 8 d^{2} (x_{t}, x^{*})) \\ + 2 η_{t}^{(i)} E {〈{Exp}_{x_{ℓ}^{(i)}} (x^{*}), grad f (x_{ℓ}^{(i)})〉}_{x_{ℓ}^{(i)}} \\ \leq (1 + 10 ζ {(η_{t}^{(i)})}^{2} L^{2} - η_{t}^{(i)} μ) E d^{2} (x_{ℓ}^{(i)}, x^{*}) + 8 ζ {(η_{t}^{(i)})}^{2} L^{2} E d^{2} (x_{t}, x^{*}) \\ + 2 η_{t}^{(i)} E (f (x^{*}) - f (x_{ℓ}^{(i)})) \\ \leq (1 + 10 ζ {(η_{t}^{(i)})}^{2} L^{2} - 2 η_{t}^{(i)} μ) E d^{2} (x_{ℓ}^{(i)}, x^{*}) + 8 ζ {(η_{t}^{(i)})}^{2} L^{2} E d^{2} (x_{t}, x^{*}) . \end{matrix}

(37)

The first inequality uses Lemma 2 and the second one is due to (36). The third and fourth inequalities use the

μ

-strongly g-convexity of

f (x)

.

Denote

u_{ℓ}^{(i)} ≜ E d^{2} (x_{ℓ}^{(i)}, x^{*})

,

q ≜ 1 + 2 η_{t}^{(i)} (5 ζ η_{t}^{(i)} L^{2} - μ)

and

p ≜ \frac{8 ζ {(η_{t}^{(i)})}^{2} L^{2}}{1 - q}

. Since

k = 1

,

x_{t + 1} = x_{τ}^{(i)}

. Note that

x_{t} = x_{0}^{(i)}

, i.e.,

u_{t} = u_{0}^{(i)}

and from (37) we have

u_{ℓ + 1}^{(i)} \leq q u_{ℓ}^{(i)} + p (1 - q) u_{t}

, i.e.,

u_{ℓ + 1}^{(i)} - p u_{t} \leq q (u_{ℓ}^{(i)} - p u_{t})

. Hence, we have

u_{t + 1} - p u_{t} \leq q^{τ_{i}} (1 - p) u_{t}

and

u_{t + 1} \leq [p + q^{τ} (1 - p)] u_{t} .

(38)

Denoting

α_{t}^{i} = p + q^{τ_{i}} (1 - p) = \frac{4 ζ η_{t}^{(i)} L^{2} + {(1 + 2 η_{t}^{(i)} (5 ζ η_{t}^{(i)} L^{2} - μ))}^{τ_{i}} (μ - 9 ζ η_{t}^{(i)} L^{2})}{μ - 5 ζ η_{t}^{(i)} L^{2}},

from inequation (38) we obtain

E d^{2} (x_{t + 1}, x^{*}) \leq α_{t}^{i} E d^{2} (x_{t}, x^{*})

. □

Theorem 1

(

μ

-strongly g-convex, RFedSVRG-2BB with

k = 1

). Consider RFedSVRG-2BB with Option 1 in each inner loop to obtain

{\hat{x}}^{(i)}

(line 10 in Algorithm 1) and take

k = 1

. Assumptions 1 and 2 are satisfied and f is μ-strongly g-convex in problem (1). Take

η^{(i)} < \frac{μ}{9 ζ L^{2}} (\forall i \in [n])

, we have

α^{i} = \frac{4 ζ η^{(i)} L^{2} + {(1 + 2 η^{(i)} (5 ζ η^{(i)} L^{2} - μ))}^{τ_{i}} (μ - 9 ζ η^{(i)} L^{2})}{μ - 5 ζ η^{(i)} L^{2}} < 1 .

Denoting

α = max_{i \in [n]} \{α^{i}\}

(39)

and

x^{*}

is the optimal point, the Output of Option 1 in RFedSVRG-2BB has linear convergence in expectation:

E d^{2} (\hat{x}, x^{*}) \leq α^{T} \cdot E d^{2} (x_{0}, x^{*}) .

In this case, the convergence rate of RFedSVRG-2BB is

O (\frac{α L D^{2}}{T (1 - α)})

.

Proof.

Since

k = 1

, without loss generality, we denote i as the client that we choose at the t-th outer loop. From Lemma 3, we know that

E d^{2} (x_{t + 1}, x^{*}) \leq α_{t}^{i} E d^{2} (x_{t}, x^{*})

, where

α_{t}^{i}

is from (35). Because

η_{t}^{(i)} = η^{(i)}

in RFedSVRG-2BB, we have

α_{t}^{i} = α^{i}

, where

α^{i}

is independent from t and denoted as

α^{i} = \frac{4 ζ η^{(i)} L^{2} + {(1 + 2 η^{(i)} (5 ζ η^{(i)} L^{2} - μ))}^{τ_{i}} (μ - 9 ζ η^{(i)} L^{2})}{μ - 5 ζ η^{(i)} L^{2}} .

Since

η^{(i)} < \frac{μ}{5 ζ^{2} L^{2}} < \frac{μ}{3 ζ L^{2}}

, we can obtain

μ - 9 ζ η^{(i)} L^{2} > 0,

μ - 5 ζ η^{(i)} L^{2} > 0,

1 + 2 η^{(i)} (5 ζ η^{(i)} L^{2} - μ) < 1 .

Therefore for

\forall i \in [n]

,

α^{i} < \frac{4 ζ η^{(i)} L^{2} + μ - 9 ζ η^{(i)} L^{2}}{μ - 5 ζ η^{(i)} L^{2}} = 1 .

Denoting

α = max_{i \in [n]} \{α^{i}\}

, we obtain

α < 1

. Therefore, for all outer loops, t, we have

E d^{2} (x_{t + 1}, x^{*}) < α \cdot E d^{2} (x_{t}, x^{*})

. It follows directly from the algorithm after T outer loops,

E d^{2} (\hat{x}, x^{*}) = E d^{2} (x_{T}, x^{*}) \leq α^{T} \cdot E d^{2} (x_{0}, x^{*}) .

By using the L-smooth of f and Assumption 2, we can obtain

E [f (x_{t}) - f (x^{*})] \leq L \cdot E [\frac{1}{2} d^{2} (x_{t}, x^{*})] \leq \frac{α^{t} L}{2} \cdot E [d^{2} (x_{T}, x^{*})] \leq \frac{α^{t} L D^{2}}{2}, t \in {1, 2, \dots, T} .

Summing the above inequation over

t = 1, \dots, T

, we have

\frac{1}{T} \sum_{t = 1}^{T} E [f (x_{t}) - f (x^{*})] \leq \frac{L D^{2} \sum_{t = 1}^{T} α^{t}}{2 T} < \frac{α L D^{2}}{2 T (1 - α)} .

Therefore, the convergence rate of RFedSVRG-2BB is

O (\frac{α L D^{2}}{T (1 - α)})

. □

Theorem 2

(

μ

-strongly g-convex, RFedSVRG-2BBS with

k = 1

). Consider RFedSVRG-2BBS with Option 1 in each inner loop to obtain

{\hat{x}}^{(i)}

(line 10 in Algorithm 1) and take

k = 1

. Assumptions 1 and 2 are satisfied and f is μ-strongly g-convex in problem (1). For

\forall i \in [n]

, take

τ_{i} = τ \geq \frac{9 ζ η_{max}^{2} L^{2}}{μ η_{min}}

, we have

α = \frac{4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2} + {[1 + \frac{2 η_{min}}{τ} (5 L^{2} ζ \frac{η_{max}}{τ} - μ)]}^{τ} [\frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ}) - 4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2}]}{\frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ})} < 1,

the Output of Option 1 in RFedSVRG-2BBS has linear convergence in expectation:

E d^{2} (\hat{x}, x^{*}) \leq α^{T} \cdot E d^{2} (x_{0}, x^{*}) .

In this case, the convergence rate of RFedSVRG-2BBS is

O (\frac{α L D^{2}}{T (1 - α)})

.

Proof.

Since

k = 1

, without loss of generality, we denote i as the client that we choose at the t-th iteration. From Lemma 3, we know that

E d^{2} (x_{t + 1}, x^{*}) \leq α_{t}^{i} E d^{2} (x_{t}, x^{*})

, where

α_{t}^{i}

is from (35). Because

τ_{i} = τ \geq \frac{9 ζ η_{max}^{2} L^{2}}{μ η_{min}} > \frac{5 ζ η_{max} η_{min} L^{2} + 4 ζ L^{2} η_{max}^{2}}{μ η_{min}} > \frac{5 ζ η_{max} L^{2}}{μ}

, we have

\frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ}) - 4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2} > 0,

5 ζ L^{2} η_{max} / τ - μ < 0,

1 + \frac{2 η_{min}}{τ} (5 L^{2} ζ \frac{η_{max}}{τ} - μ) < 1 .

From the fact that

η_{min} / τ \leq η_{t}^{(i)} \leq η_{max} / τ

in RFedSVRG-2BBS, we can obtain

α_{t}^{i} \leq α

, where

α

is a constant that is independent from t and i and denoted as

\begin{matrix} α = \frac{4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2} + {[1 + \frac{2 η_{min}}{τ} (5 L^{2} ζ \frac{η_{max}}{τ} - μ)]}^{τ} [\frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ}) - 4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2}]}{\frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ})} . \end{matrix}

Hence, we obtain the upper bound of

α

:

\begin{matrix} α & < \frac{4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2} + \frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ}) - 4 ζ L^{2} {(\frac{η_{max}}{τ})}^{2}}{\frac{η_{min}}{τ} (μ - 5 ζ L^{2} \frac{η_{max}}{τ})} = 1 . \end{matrix}

For each outer loop, t, it holds that

E d^{2} (x_{t + 1}, x^{*}) < α \cdot E d^{2} (x_{t}, x^{*})

, It follows directly from the algorithm after T outer loops,

E d^{2} (\hat{x}, x^{*}) = E d^{2} (x_{T}, x^{*}) \leq α^{T} \cdot E d^{2} (x_{0}, x^{*}) .

By using the L-smooth of f and Assumption 2, we can obtain

E [f (x_{t}) - f (x^{*})] \leq L \cdot E [\frac{1}{2} d^{2} (x_{t}, x^{*})] \leq \frac{α^{t} L}{2} \cdot E [d^{2} (x_{T}, x^{*})] \leq \frac{α^{t} L D^{2}}{2}, t \in {1, 2, \dots, T} .

Summing the above inequation over

t = 1, \dots, T

, we have

\frac{1}{T} \sum_{t = 1}^{T} E [f (x_{t}) - f (x^{*})] \leq \frac{L D^{2} \sum_{t = 1}^{T} α^{t}}{2 T} < \frac{α L D^{2}}{2 T (1 - α)} .

Therefore, the convergence rate of RFedSVRG-2BBS is

O (\frac{α L D^{2}}{T (1 - α)})

. □

Later, we show the convergence results of RFedSVRG-2BB and RFedSVRG-2BBS with

τ_{i} = 1

. The conditions for conclusions do not need the objective function, f, to be g-convex.

Lemma 4

(Non-convex, RFedSVRG-2BB and RFedSVRG-2BBS with

τ_{i} = 1

). Suppose the problem (1) satisfies Assumptions 1 and 2. If we run RFedSVRG-2BB and RFedSVRG-2BBS with Option 1 in each inner loop to obtain

{\hat{x}}^{(i)}

and

τ_{i} = 1

, we have

f (x_{t + 1}) - f (x_{t}) \leq (- η_{t}^{(i)} + \frac{{(η_{t}^{(i)})}^{2} L}{2}) {∥ grad f (x_{t}) ∥}^{2} .

Proof.

From the local update in RFedSVRG-2BB and RFedSVRG-2BBS, we know that

{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{ℓ + 1}^{(i)}) = - η_{t}^{(i)} [grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{t}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t}) - B ξ + B_{i} ξ)],

where

ξ = {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})

. Because

τ_{i} = 1

, for

\forall i \in [n]

,

x_{0}^{(i)} = x_{t}

and

ℓ \in {0}

, we have

\begin{matrix} ξ & = {Exp}_{x_{t}}^{- 1} (x_{0}^{(i)}) = {Exp}_{x_{t}}^{- 1} (x_{t}) = 0, B ξ = B_{i} ξ = 0, \\ {Exp}_{x_{t}}^{- 1} (x_{1}^{(i)}) & = - η_{t}^{(i)} [grad f_{i} (x_{t}) - P_{x_{t} \to x_{0}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t}))] \\ = - η_{t}^{(i)} [grad f_{i} (x_{t}) - P_{x_{t} \to x_{t}} (grad f_{i} (x_{t}) - grad f (x_{t}))] \\ = - η_{t}^{(i)} [grad f_{i} (x_{t}) - grad f_{i} (x_{t}) + grad f (x_{t})] = - η_{t}^{(i)} grad f (x_{t}) . \end{matrix}

Using the L-smooth of

f_{i}

and the aggregation step (line 12 in Algorithm 1), we have

\begin{matrix} f (x_{t + 1}) - f (x_{t}) & \leq {〈{Exp}_{x_{t}}^{- 1} (x_{t + 1}), grad f (x_{t})〉}_{x_{t}} + \frac{L}{2} d^{2} (x_{t + 1}, x_{t}) \\ = {〈\frac{1}{k} \sum_{i \in S_{t}} {Exp}_{x_{t}}^{- 1} (x_{1}^{(i)}), grad f (x_{t})〉}_{x_{t}} + \frac{L}{2} {∥\frac{1}{k} \sum_{i \in S_{t}} {Exp}_{x_{t}}^{- 1} (x_{1}^{(i)})∥}^{2} \\ = - η_{t}^{(i)} ∥ grad f (x_{t}) ∥^{2} + \frac{{(η_{t}^{(i)})}^{2} L}{2} {∥ grad f (x_{t}) ∥}^{2} \\ = (- η_{t}^{(i)} + \frac{{(η_{t}^{(i)})}^{2} L}{2}) {∥ grad f (x_{t}) ∥}^{2} . \end{matrix}

(40)

□

Theorem 3

(Non-convex, RFedSVRG-2BB with

τ_{i} = 1

). Suppose the problem (1) satisfies Assumptions 1 and 2. If we run RFedSVRG-2BB with Option 1 in in each inner loop to obtain

{\hat{x}}^{(i)}

,

τ_{i} = 1

and

η^{(i)} = 1 / L

, we have

min_{0 \leq t \leq T} {∥ grad f (x_{t}) ∥}^{2} \leq O (\frac{L [f (x_{0}) - f (x^{*})]}{T}) .

Proof.

From (40) and

η_{t}^{(i)} = η^{(i)} = 1 / L

in RFedSVRG-2BB, we have

f (x_{t + 1}) - f (x_{t}) \leq \frac{1}{2 L} {∥ grad f (x_{t}) ∥}^{2} .

Summing t over

t = 0, 1, \dots, T - 1

, we obtain

\frac{1}{2 L} \sum_{t = 1}^{T} {∥ grad f (x_{t}) ∥}^{2} \leq f (x_{0}) - f (x_{T}) \leq f (x_{0}) - f (x^{*}),

which means that

min_{0 \leq t \leq T} {∥ grad f (x_{t}) ∥}^{2} \leq O (\frac{L [f (x_{0}) - f (x^{*})]}{T})

. □

Theorem 4

(Non-convex, RFedSVRG-2BBS with

τ_{i} = 1

). Suppose the problem (1) satisfies Assumptions 1 and 2. If we run RFedSVRG-2BBS with Option 1 in in each inner loop to obtain

{\hat{x}}^{(i)}

,

τ_{i} = 1

and

η_{max} \leq 2 / L

, we have

f (x_{t + 1}) \leq f (x_{t}) .

Proof.

From (32),

τ_{i} = 1

and

η_{max} \leq 2 / L

in RFedSVRG-2BBS, for

\forall i \in [n]

, we have

\frac{1}{L} \leq η_{t}^{(i)} \leq \frac{2}{L}

, which means that in (40) it holds

- η_{t}^{(i)} + \frac{{(η_{t}^{(i)})}^{2} L}{2} \leq 0 .

Therefore,

f (x_{t + 1}) - f (x_{t}) \leq 0

. □

We also have the convergence results of RFedSVRG-2BB and RFedSVRG-2BBS with

τ_{i} > 1

and

k = 1

. The results show that, when the objective function is only L-smooth, the algorithms can achieve sublinear convergence. Before giving the result we need the following lemma, which is inspired from [3,17].

Lemma 5

(Non-convex, RFedSVRG-2BB and RFedSVRG-2BBS). Suppose the problem (1) satisfies Assumptions 1 and 2. Denote i as the client we choose at the t-th outer loop in RFedSVRG-2BB and RFedSVRG-2BBS,

β > 0

is a free constant and

\begin{matrix} R_{ℓ} = E [f (x_{ℓ}^{(i)}) + c_{ℓ} {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2}], \end{matrix}

(41)

where

c_{ℓ}

and

δ_{ℓ}

satisfy:

\begin{matrix} c_{ℓ} = c_{ℓ + 1} (1 + β η_{t}^{(i)} + 4 ζ L^{2} {(η_{t}^{(i)})}^{2}) + 2 L^{3} {(η_{t}^{(i)})}^{2} \geq 0, c_{τ_{i}} = 0, \end{matrix}

(42)

\begin{matrix} δ_{ℓ} = η_{t}^{(i)} - \frac{c_{ℓ + 1} η_{t}^{(i)}}{β} - L {(η_{t}^{(i)})}^{2} - 2 c_{ℓ + 1} ζ {(η_{t}^{(i)})}^{2} > 0 . \end{matrix}

(43)

The square of the norm of

grad f

at iterate

x_{ℓ}^{(i)}

has the upper bound:

E {∥grad f (x_{ℓ}^{(i)})∥}^{2} \leq \frac{R_{ℓ} - R_{ℓ + 1}}{δ_{ℓ}}, ℓ = 0, 1, \dots, τ_{i} - 1 .

(44)

Proof.

Denote

v_{i_{ℓ}}^{t} = grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}) - grad f (x_{t}) + (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)}))

, because f is L-smooth, we have

\begin{matrix} E [f (x_{ℓ + 1}^{(i)})] & \leq E [f (x_{ℓ}^{(i)}) + {〈grad f (x_{ℓ}^{(i)}), {Exp}_{x_{ℓ}^{(i)}} (x_{ℓ + 1}^{(i)})〉}_{x_{ℓ}^{(i)}} + \frac{L}{2} {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{ℓ + 1}^{(i)})∥}^{2}] \\ = E [f (x_{ℓ}^{(i)})] + E {〈grad f (x_{ℓ}^{(i)}), - η_{t}^{(i)} v_{i_{ℓ}}^{t}〉}_{x_{ℓ}^{(i)}} + \frac{L}{2} E {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{ℓ + 1}^{(i)})∥}^{2} \\ = E [f (x_{ℓ}^{(i)})] - η_{t}^{(i)} E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + \frac{L {(η_{t}^{(i)})}^{2}}{2} E {∥ v_{ℓ_{i}}^{t} ∥}^{2} \\ = E [f (x_{ℓ}^{(i)}) - η_{t}^{(i)} {∥grad f (x_{ℓ}^{(i)})∥}^{2} + \frac{L {(η_{t}^{(i)})}^{2}}{2} {∥ v_{ℓ_{i}}^{t} ∥}^{2}], \end{matrix}

(45)

where the first inequality is due to f being L-smooth, the first equality is due to

E [a + b + c] = E [a] + E [b] + E [c]

and

{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{ℓ + 1}^{(i)}) = - η_{t}^{(i)} v_{i_{ℓ}}^{t}

, and the second equality is due to

E [v_{i_{ℓ}}^{t}] = grad f (x_{ℓ}^{(i)})

. Then the following inequality holds:

\begin{matrix} E [{∥{Exp}_{x_{t}}^{- 1} (x_{ℓ + 1}^{(i)})∥}^{2}] \leq E [{∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + ζ {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{ℓ + 1}^{(i)})∥}^{2} \\ - & 2 {〈{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{ℓ + 1}^{(i)}), {Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{t})〉}_{x_{ℓ}^{(i)}}] \\ = & E [{∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + ζ {(η_{t}^{(i)})}^{2} {∥v_{i_{ℓ}}^{t}∥}^{2}] + 2 η_{t}^{(i)} E {〈grad f (x_{ℓ}^{(i)}), {Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{t})〉}_{x_{ℓ}^{(i)}} \\ \leq & E [{∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + ζ {(η_{t}^{(i)})}^{2} {∥v_{i_{ℓ}}^{t}∥}^{2}] \\ + & 2 η_{t}^{(i)} E [\frac{1}{2 β} {∥grad f (x_{ℓ}^{(i)})∥}^{2} + \frac{β}{2} {∥{Exp}_{x_{ℓ}^{(i)}}^{- 1} (x_{t})∥}^{2}], \end{matrix}

(46)

where the first inequality follows from Lemma 2 and the second inequality is due to

2 {〈 a, b 〉}_{x} \leq \frac{1}{β} {∥ a ∥}^{2} + β {∥ b ∥}^{2}

, where

β > 0

is a free constant. Denote

R_{ℓ} = E [f (x_{ℓ}^{(i)}) + c_{ℓ} {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2}]

, where

c_{ℓ}

is a parameter that varies with ℓ. Substituting (45) and (46) into

R_{ℓ + 1}

, we can obtain the following bound:

\begin{matrix} R_{ℓ + 1} & \leq E [f (x_{ℓ}^{(i)}) - η_{t}^{(i)} {∥grad f (x_{ℓ}^{(i)})∥}^{2} + \frac{L {(η_{t}^{(i)})}^{2}}{2} {∥ v_{ℓ_{i}}^{t} ∥}^{2}] + c_{ℓ + 1} E [{∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ + ζ {(η_{t}^{(i)})}^{2} {∥v_{i_{ℓ}}^{t}∥}^{2}] + 2 c_{ℓ + 1} η_{t}^{(i)} E [\frac{1}{2 β} {∥grad f (x_{ℓ}^{(i)})∥}^{2} + \frac{β}{2} {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2}] \\ = E [f (x_{ℓ}^{(i)}) - (η_{t}^{(i)} - \frac{c_{ℓ + 1} η_{t}^{(i)}}{β}) {∥grad f (x_{ℓ}^{(i)})∥}^{2}] \\ + [\frac{L {(η_{t}^{(i)})}^{2}}{2} + c_{ℓ + 1} ζ {(η_{t}^{(i)})}^{2}] E [∥ v_{i_{ℓ}}^{t} ∥^{2}] + c_{ℓ + 1} (1 + η_{t}^{(i)} β) E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} . \end{matrix}

(47)

Denoting

Δ_{ℓ} = grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} (grad f_{i} (x_{t}))

, we have

E [Δ_{ℓ}] = grad f (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f (x_{t}),

and

\begin{matrix} E ∥ v_{i_{ℓ}}^{t} ∥^{2} & = E {∥Δ_{ℓ} + P_{x_{t} \to x_{ℓ}^{(i)}} grad f (x_{t}) + P_{x_{t} \to x_{ℓ}^{(i)}} (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = E {∥Δ_{ℓ} - E [Δ_{ℓ}] + grad f (x_{ℓ}^{(i)}) + P_{x_{t} \to x_{ℓ}^{(i)}} (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ \leq 2 E {∥Δ_{ℓ} - E [Δ_{ℓ}]∥}^{2} + 2 E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + 2 E {∥P_{x_{t} \to x_{ℓ}^{(i)}} (B - B_{i}) {Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ \leq 2 E {∥Δ_{ℓ}∥}^{2} + 2 L^{2} E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = 2 E {∥grad f (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f (x_{t})∥}^{2} + 2 L^{2} E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ \leq 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥grad f (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = 4 L^{2} E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} + 2 L^{2} E {∥grad f (x_{ℓ}^{(i)})∥}^{2}, \end{matrix}

(48)

where the first inequality is due to

{∥ a + b + c ∥}^{2} \leq {2 ∥ a ∥}^{2} + {2 ∥ b ∥}^{2} + 2 {∥ c ∥}^{2}

, the second inequality is due to

{E ∥ ξ - E ξ ∥}^{2} = {E ∥ ξ ∥}^{2} - {∥ E ξ ∥}^{2} \leq E {∥ ξ ∥}^{2}

and (33) and the third inequality follows from Assumption 1. Substituting (48) into (47), we obtain

\begin{matrix} R_{ℓ} & \leq E [f (x_{ℓ}^{(i)}) - (η_{t}^{(i)} - \frac{c_{ℓ + 1} η_{t}^{(i)}}{β} - L {(η_{t}^{(i)})}^{2} - 2 c_{ℓ + 1} ζ {(η_{t}^{(i)})}^{2}) {∥grad f (x_{ℓ}^{(i)})∥}^{2}] \\ + [c_{ℓ + 1} (1 + β η_{t}^{(i)} + 4 ζ L^{2} {(η_{t}^{(i)})}^{2}) + 2 L^{3} {(η_{t}^{(i)})}^{2}] E {∥{Exp}_{x_{t}}^{- 1} (x_{ℓ}^{(i)})∥}^{2} \\ = R_{ℓ + 1} - (η_{t}^{(i)} - \frac{c_{ℓ + 1} η_{t}^{(i)}}{β} - L {(η_{t}^{(i)})}^{2} - 2 c_{ℓ + 1} ζ {(η_{t}^{(i)})}^{2}) E {∥grad f (x_{t}^{(i)})∥}^{2} . \end{matrix}

(49)

From inequation (49), we can obtain

c_{ℓ} = c_{ℓ + 1} (1 + β η_{t}^{(i)} + 4 ζ L^{2} {(η_{t}^{(i)})}^{2}) + 2 L^{3} {(η_{t}^{(i)})}^{2} \geq 0

,

c_{τ_{i}} = 0

,

δ_{ℓ} = η_{t}^{(i)} - \frac{c_{ℓ + 1} η_{t}^{(i)}}{β} - L {(η_{t}^{(i)})}^{2} - 2 c_{ℓ + 1} ζ {(η_{t}^{(i)})}^{2} > 0

and

E {∥grad f (x_{ℓ}^{(i)})∥}^{2} \leq \frac{R_{ℓ} - R_{ℓ + 1}}{δ_{ℓ}} .

□

Theorem 5

(Non-convex, RFedSVRG-2BB with

k = 1

). Suppose the problem (1) satisfies Assumptions 1 and 2. Consider RFedSVRG-2BB with Option 2 in each inner loop to obtain

{\hat{x}}^{(i)}

, if we set

k = 1

,

τ_{i} = τ = ⌈m^{3 α_{1} / 2} / ζ^{1 - 2 α_{2}}⌉

and

η^{(i)} = \frac{1}{6 L m^{α_{1}} ζ^{α_{2}}} (\forall i \in [n])

, where

m \in N^{+}

,

α_{1} \in (0, 1)

and

α_{2} \in (0, 2)

, the Output with Option 2 in RFedSVRG-2BB satisfies:

E ∥ grad f (\hat{x}) ∥^{2} \leq O (\frac{L m^{- α_{1} / 2} ζ^{3 α_{2} - 1} [f (x_{0}) - f (x^{*})]}{T}) .

(50)

When the function f is h-gradient dominated, the convergence rate of RFedSVRG-2BB is

O (\frac{L h m^{- α_{1} / 2} ζ^{3 α_{2} - 1} D^{2}}{T}) .

Proof.

Since

k = 1

, without loss of generality, we denote i as the client that we choose at the t-th outer loop. We take

β = L ζ^{1 - α_{2}} / m^{α_{1} / 2}

in (43), where

m \in N^{+}

. Because in RFedSVRG-2BB,

η_{t}^{(i)} = η^{(i)}

, recursively from (42), it can be seen that:

c_{0} = 2 L^{3} {(η^{(i)})}^{2} \frac{{(1 + θ)}^{τ_{i}} - 1}{θ} = \frac{L}{18 m^{2 α_{1}} ζ^{2 α_{2}}} \frac{{(1 + θ)}^{τ_{i}} - 1}{θ},

where

θ = η^{(i)} β + 4 ζ L^{2} {(η^{(i)})}^{2} = \frac{ζ^{1 - 2 α_{2}}}{6 m^{3 α_{1} / 2}} + \frac{ζ^{1 - 2 α_{2}}}{9 m^{2 α_{1}}} \in (\frac{ζ^{1 - 2 α_{2}}}{6 m^{3 α_{1} / 2}}, \frac{5 ζ^{1 - α_{2}}}{18 m^{3 α_{1} / 2}})

is a parameter and

{\{c_{ℓ}\}}_{ℓ = 0}^{τ - 1}

is a set of decreasing numbers.

Note that

θ < 1 / τ

, which means that

{(1 + θ)}^{τ} < e

, then the upper bound of

c_{0}

satisfies

c_{0} \leq \frac{L (e - 1)}{3 m^{α_{1} / 2} ζ}

. Therefore, the lower bound of

δ_{ℓ}

in (43) is:

\begin{matrix} δ_{min} & = min_{ℓ} \{δ_{ℓ}\} \\ \geq (η^{(i)} - \frac{c_{0} η^{(i)}}{β} - {(η^{(i)})}^{2} L - 2 c_{0} ζ {(η^{(i)})}^{2}) \\ \geq η^{(i)} (1 - \frac{e - 1}{3 ζ^{2 - α_{2}}} - \frac{1}{6 m^{α_{1}} ζ^{α_{2}}} - \frac{e - 1}{18 m^{α^{3 α_{1} / 2}} ζ^{α_{2}}}) \\ \geq η^{(i)} (1 - \frac{e - 1}{3} - \frac{1}{6} - \frac{e - 1}{18}) > \frac{33}{200} η^{(i)} = \frac{11}{600 L m^{α_{1}} ζ^{α_{2}}} . \end{matrix}

Note that this lower bound,

δ_{min}

, is independent from the choice of the client, i. From (41), we have

R_{0} = f (x_{t})

and

R_{τ} = E [f (x_{τ}^{(i)}) + c_{ℓ} ∥{Exp}_{x_{t}}^{- 1} (x_{τ}^{(i)})∥] \geq E [f (x_{τ}^{(i)})],

Because

x_{0}^{(i)} = x_{t}

and we choose Option 2 in each inner loop to obtain

{\hat{x}}^{(i)}

, summing (44) over

ℓ = 0, 1, \dots, τ_{i} - 1

, we have

E [∥ grad f (x_{t + 1}) ∥^{2}] = \frac{1}{τ_{i}} \sum_{ℓ = 1}^{τ_{i} - 1} E {∥ grad f (x_{ℓ}^{(i)}) ∥}^{2} \leq \frac{R_{0} - R_{τ_{i}}}{τ_{i} δ_{min}} \leq E (\frac{f (x_{t}) - f (x_{t + 1})}{τ δ_{min}}) .

Summing the above inequation over

t = 0, 1, \dots, T - 1

, we obtain

E {∥grad f (\hat{x})∥}^{2} = \frac{1}{T} \sum_{t = 0}^{T - 1} \frac{1}{τ_{i}} \sum_{ℓ = 0}^{τ_{i} - 1} E {∥grad f (x_{ℓ}^{(i)})∥}^{2} < \frac{600 L m^{- α_{1} / 2} ζ^{3 α_{2} - 1} [f (x_{0}) - f (x^{*})]}{11 T} .

(51)

Therefore, we have

E ∥ grad f (\hat{x}) ∥^{2} \leq O (\frac{L m^{- α_{1} / 2} ζ^{3 α_{2} - 1} [f (x_{0}) - f (x^{*})]}{T}) .

When the function f is h-gradient dominated, we have

f (x_{t}) - f (x^{*}) \leq h {∥ grad f (x_{t}) ∥}^{2}

. From (51), L-smooth of f and Assumption 2, we have

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E [f (x_{t}) - f (x^{*})] & \leq \frac{h}{T} \sum_{t = 1}^{T} E {∥grad f (x_{t})∥}^{2} < \frac{600 h L m^{- α_{1} / 2} ζ^{3 α_{2} - 1} [f (x_{0}) - f (x^{*})]}{11 T h} \\ \leq \frac{300 L h m^{- α_{1} / 2} ζ^{3 α_{2} - 1} d^{2} (x_{0}, x^{*})}{11 T} \leq \frac{300 L h m^{- α_{1} / 2} ζ^{3 α_{2} - 1} D^{2}}{11 T} . \end{matrix}

Therefore, the convergence rate of RFedSVRG-2BB is

O (\frac{L h m^{- α_{1} / 2} ζ^{3 α_{2} - 1} D^{2}}{T})

. □

Theorem 6

(Non-convex, RFedSVRG-2BBS with

k = 1

). Suppose the problem (1) satisfies Assumptions 1 and 2. Consider RFedSVRG-2BBS with Option 2 in each inner loop to obtain

{\hat{x}}^{(i)}

, if we set

k = 1

,

τ_{i} \geq τ = 24 ζ^{2} (\forall i \in [n])

and

η_{max} \leq 3 / (2 L)

, the Output with Option 2 in RFedSVRG-2BBS satisfies:

E ∥ grad f (\hat{x}) ∥^{2} \leq O (\frac{L [f (x_{0}) - f (x^{*})]}{T}) .

When the function f is h-gradient dominated, the convergence rate of RFedSVRG-2BBS is

O (\frac{L h D^{2}}{T})

.

Proof.

Since

k = 1

, without loss of generality, we denote i as the client that we choose at the t-th outer loop. In RFedSVRG-2BB,

η_{t}^{i} = \hat{η_{t}} / τ_{i}

, by recursive (42), we obtain

c_{0} = L^{3} {(\frac{\hat{η_{t}}}{τ_{i}})}^{2} \frac{{(1 + θ)}^{τ} - 1}{θ}

, where

θ = β (\frac{\hat{η_{t}}}{τ}) + 2 ζ L^{2} {(\frac{\hat{η_{t}}}{τ})}^{2}

is a parameter and

{\{c_{ℓ}\}}_{ℓ = 0}^{τ_{i} - 1}

is a series of decreasing numbers.

We fixed

β = L / ζ

in (42), then we have

2 ζ L^{2} \frac{\hat{η_{t}}}{τ} \leq 2 ζ L^{2} \frac{3}{2 \times 24 ζ^{2} L} = \frac{L}{8 ζ}

and

θ = β \frac{\hat{η_{t}}}{τ} + (2 ζ L^{2} \frac{\hat{η_{t}}}{τ}) \frac{\hat{η_{t}}}{τ} \leq \frac{L}{ζ} \frac{3}{2 τ L} + \frac{L}{8 ζ} \frac{3}{2 τ L} = \frac{27}{16 ζ τ} .

Because

ζ \geq 1

,

{(1 + θ)}^{τ} \leq {(1 + \frac{27}{16 ζ τ})}^{τ} < e^{27 / (16 ζ)} \leq e^{27 / 16}

. From

\hat{η_{t}} / τ_{i} \leq \hat{η_{t}} / τ \leq η_{max} / τ

, we have

\begin{matrix} c_{0} & < \frac{L^{3} {(\hat{η_{t}} / τ_{i})}^{2} e^{27 / 16}}{β \hat{η_{t}} / τ_{i}} = \frac{L^{3} e^{117 / 48} ζ \hat{η_{t}}}{τ_{i} L} \leq \frac{L^{2} e^{27 / 16} ζ η_{max}}{τ} \\ \leq L^{2} e^{27 / 16} ζ \cdot \frac{3}{2 τ L} \leq L^{2} e^{27 / 16} ζ \cdot \frac{3}{2 \times 24 ζ^{2} L} = \frac{L e^{27 / 16}}{16 ζ} . \end{matrix}

Then we can obtain the lower bound of

δ_{min} : = min_{ℓ = 1, \dots, τ_{i}} {δ_{ℓ}}_{ℓ = 0}^{τ_{i} - 1}

, where

{δ_{ℓ}}_{ℓ = 0}^{τ_{i} - 1}

is calculated from (43):

\begin{matrix} δ_{min} & \geq (\frac{\hat{η_{t}}}{τ} - \frac{c_{0} \hat{η_{t}}}{τ β} - {(\frac{\hat{η_{t}}}{τ})}^{2} L - 2 c_{0} ζ {(\frac{\hat{η_{t}}}{τ})}^{2}) \\ \geq \frac{\hat{η_{t}}}{τ} (1 - \frac{e^{81 / 32}}{24} - \frac{1}{16 ζ^{2}} - \frac{e^{27 / 16}}{128 ζ}) \\ \geq \frac{\hat{η_{t}}}{τ} (1 - \frac{e^{27 / 16}}{16} - \frac{1}{16} - \frac{e^{27 / 16}}{128}) > \frac{1}{2} \cdot \frac{\hat{η_{t}}}{τ} \geq \frac{1}{2 τ L} > 0 . \end{matrix}

where the last inequality is due to (32). Note that the lower bound,

δ_{min} > 0

, is independent from the choice of the client, i.

From (41), we have

R_{0} = f (x_{t})

and

R_{τ_{i}} = E [f (x_{τ_{i}}^{(i)}) + c_{ℓ} ∥{Exp}_{x_{t}}^{- 1} (x_{τ_{i}}^{(i)})∥] \geq E [f (x_{τ_{i}}^{(i)})] = E [f (x_{t + 1})]

. Because we choose Option 2 in each inner loop to obtain

{\hat{x}}^{(i)}

, summing (44) over

ℓ = 0, 1, \dots, τ_{i}

, we obtain

E [∥ grad f (x_{t + 1}) ∥^{2}] = \frac{1}{τ_{i}} \sum_{ℓ = 1}^{τ_{i} - 1} E {∥ grad f (x_{ℓ}^{(i)}) ∥}^{2} \leq \frac{R_{0} - R_{τ_{i}}}{τ_{i} δ_{min}} \leq E (\frac{f (x_{t}) - f (x_{t + 1})}{τ δ_{min}}) .

(52)

Summing (52) over

t = 0, 1, \dots, T - 1

, we have

E {∥grad f (\hat{x})∥}^{2} = \frac{1}{T} \sum_{t = 0}^{T - 1} \frac{1}{τ_{i}} \sum_{ℓ = 0}^{τ_{i} - 1} E {∥grad f (x_{ℓ}^{(i)})∥}^{2} < \frac{2 L (f (x_{0}) - f^{*})}{T} .

(53)

Therefore, we have

E ∥ grad f (\hat{x}) ∥^{2} \leq O (\frac{L [f (x_{0}) - f (x^{*})]}{T}) .

When the function f is h-gradient dominated, we have

f (x_{t}) - f (x^{*}) \leq h {∥ grad f (x_{t}) ∥}^{2}

. From (53), L-smooth of f and Assumption 2, we have

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E [f (x_{t}) - f (x^{*})] & \leq \frac{h}{T} \sum_{t = 1}^{T} E {∥grad f (x_{t})∥}^{2} < \frac{2 L h (f (x_{0}) - f^{*})}{T} \\ \leq \frac{L h d^{2} (x_{0}, x^{*})}{T} \leq \frac{L h D^{2}}{T} . \end{matrix}

Therefore, the convergence rate of RFedSVRG-2BBS is

O (\frac{L h D^{2}}{T})

. □

Based on the theorems in this section, we briefly summarize the convergence rates of our algorithms when function f satisfies different properties in Table 1. For more details, please refer to the corresponding theorem and its proof.

5. Numerical Experiments

In this section, we will demonstrate the performance of RFedSVRG-2BB and RFedSVRG-2BBS in solving problem (1) and compare it with RFedSVRG [3], RFedAvg [3] and RFedProx [3]. We use the Pymanopt package [46]. Since the inverse of exponential mapping (logarithm mapping) on the Stiefel manifold is not easy to calculate, we use the projection-like retraction [47] and its inverse [48] to approximate the exponential mapping and the logarithm mapping, respectively.

We test the five algorithms on the PSD Karcher mean (2), PCA (3) and kPCA (4) on synthetic or real datasets. For all problems, we measure the norm of the global Riemannian gradients. Because the optimal solution of

f (X)

in problem (4) only represents the eigen-space corresponding to the r-largest eigenvalues, we also measure the principal angles [3,49] between the subspaces for kPCA.

5.1. Experiments on Synthetic Data

In this section, we demonstrate the results of the five algorithms for solving PCA (3) and the PSD Karcher mean (2) on synthetic data. We first generate the data

C_{i} \in R^{p \times d}

, whose row vectors are generated from standard normal distribution. The

A_{i}

in (3) and (2) are set to

A_{i} : = C_{i}^{⊤} C_{i} with C_{i} \in R^{p \times d} and A_{i} \in R^{d \times d} .

(54)

Under these experiments, the datas in different agents are homogeneous in distribution, which provides a milder environment for comparing the behavior of all the algorithms. All algorithms iterate through the same random initial point. We terminate the algorithms if they meet one of the following conditions:

(a): $∥ grad f (x_{t}) ∥ \leq 10^{- 13}$
(b): the communication exceeds the specified number of rounds.

The y-axis of the figures in this subsection denotes

∥ grad f (x_{t}) ∥

and the x-axis denotes the outer loop in the algorithms, i.e., the round of communication in Federated Learning.

5.1.1. Experiments on PSD Karcher Mean

In this subsection, we test RFedAvg, RFedProx, RFedSVRG and RFedSVRG-BB to solve the PSD Karcher mean (2). In these experiments, we set

d = p = 20

in (54),

τ_{i} = τ

and

n = 10

. For RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB, we set

η^{(i)} = 2 \times 10^{- 1}

. We set

(η_{max}, η_{min}) = (8 \times 10^{- 1}, 8 \times 10^{- 3})

for RFedSVRG-2BBS. The results are given in Figure 1. The convergence curves of RFedSVRG-2BB and RFedSVRG-2BBS in Figure 1a–c show a linear rate of convergence, largely due to the fact that (2) is strongly geodesic convex [4].

From Figure 1a, we can see that RFedAvg and RFedProx are unable to reduce the norm of the Riemannian gradient to an acceptable level. RFedSVRG-2BB and RFedSVRG-2BBS can decrease the norm of the Riemannian gradient faster than RFedAvg, RFedProx and RFedSVRG, and RFedSVRG-2BBS is the fastest of all algorithms.

From Figure 1b, we can see that RFedSVRG-2BB and RFedSVRG-2BBS with different

τ

can convergence linearly. It is noted that the effect of taking

τ_{i} = 7

is not as good as the effect of taking

τ_{i} = 5

. The central idea in SVRG and its variants is that the stochastic gradients are used to estimate the change in the gradient between point

x_{ℓ}^{(i)}

and

x_{t}

[2,29], and it is clear that, if

x_{ℓ}^{(i)}

and

x_{t}

are close to each other, the variance of the estimate

grad f_{i} (x_{ℓ}^{(i)}) - P_{x_{t} \to x_{ℓ}^{(i)}} grad f_{i} (x_{t}) + P_{x_{t} \to x_{ℓ}^{(i)}} (B ξ - B_{i} ξ)

line 8 of Algorithm 1 should be small, resulting in an estimate of

grad f_{i} (x_{ℓ}^{(i)})

with small variance. As the inner iterate

x_{ℓ}^{(i)}

proceeds further, variance grows, and the algorithms need to start a new outer loop to compute a new full gradient

grad f (x_{t + 1})

and reset the variance. Therefore it is important to set

τ_{i}

not too large to avoid a large variance.

From Figure 1c, the cases when RFedSVRG-2BB and RFedSVRG-2BBS with

k \geq 1

can decrease the norm of the Riemannian gradient in convergence can be seen. RFedSVRG-2BB with

k > 1

is able to reach a more acceptable level of the norm of the Riemannian gradient than RFedSVRG-2BB with

k = 1

.

5.1.2. Experiments on PCA

We now test the five algorithms on the standard PCA problem (3). In this test, we set the algorithms with different numbers of the agents, n, and pick up

k = n / 10

as the number of clients for each round. We partition the data sampled from 10,000 data points in

R^{600}

into n agents on average, i.e.,

p = ⌊ 10, 000 / n ⌋

,

d = 600

in (54), and each of them contains equal numbers of data. The number of data in these experiments are relatively large. For all algorithms, we set

τ_{i} = 2

. We use the constant step size

η^{(i)} = 2 \times 10^{- 2}

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB, and set

(η_{max}, η_{min}) = (3 \times 10^{- 1}, 3 \times 10^{- 4})

for RFedSVRG-2BBS. We specify that the number of terminated rounds is 500. The results are given in Figure 2.

From all the figures in Figure 2, we can see that RFedAvg and RFedProx are unable to reduce the norm of the Riemannian gradient to an acceptable level for different numbers of n. RFedSVRG-2BB and RFedSVRG-2BBS decrease the norm of the Riemannian gradient to a better level than RFedAvg, RFedProx and RFedSVRG with the same number of communications. Moreover, RFedSVRG-2BB and RFedSVRG-2BBS converge faster than RFedSVRG, and RFedSVRG-2BBS is the fastest.

5.2. Experiments on Real Data

In this subsection, we focus on the kPCA problem (4) here with four real datasets: the Iris dataset [50], the wine dataset [50], the breast cancer dataset [51] and the MNIST hand-written dataset [52]. They are highly heterogeneous real data and we use them to generate

A_{i}

in problem (4).

We first normalize the four datasets. For one of the normalized datasets we denote it as

D_{0} = {[β_{i}^{⊤}, \dots, β_{m}^{⊤}]}^{⊤} \in R^{m \times d}

, where each data sample,

β_{i}

, in

D_{0}

satisfies

β_{i} \in R^{1 \times d}

and m is the number of data in dataset

D_{0}

. Then we randomly divide data in

D_{0}

into n parts and denote the divided data

D_{1}, \dots, D_{n}

as the dataset in clients. The covariance matrix,

A_{i}

, in (4) is calculated as

A_{i} = D_{i}^{⊤} D_{i} \in R^{d \times d} .

Utilizing the features of (4), we can obtain the optimal point,

x^{*}

, of (4) by directly solving the first r eigenvectors of

A = \frac{1}{n} \sum_{i = 1}^{n} A_{i},

i.e., if the the first r eigenvectors of A are

α_{1}, \dots, α_{r}

, the optimal point,

x^{*}

, in (4) is

x^{*} = [α_{1}, \dots, α_{r}] \in R^{d \times r}

[5].

Then we can compute the principal angles between the iterate point,

x_{t}

, and

x^{*}

directly. Moreover, the Output,

\hat{x}

, of our algorithms is the projection matrix that could be used to extract the principal components of dataset

D_{0}

by

D = D_{0} \cdot \hat{x} \in R^{m \times r},

(55)

which effectively reduces the dimension of data in

D_{0}

from

R^{d}

to

R^{r}

, and D is a reduced dimensional dataset [5]. We randomly choose the initial point for each experiment and terminate the algorithms if they meet one of the following conditions:

$∥ grad f (x_{t}) ∥ \leq 10^{- 13}$ and the principal angle between $x_{t}$ and $x^{*}$ is less than $10^{- 13}$ ;
the communication exceeds the specified number of rounds.

We present our results in Figure 3, Figure 4, Figure 5 and Figure 6. The x-axis of the figures denotes the round of communication in Federated Learning.

In Figure 3a, Figure 4a, Figure 5a and Figure 6a, we plot the principal angle between

x^{*}

and

x_{t}

as the round of communication increases, i.e., the y-axis of the figures denotes the value of the principal angle between

x^{*}

and

x_{t}

. We can observe that RFedAvg and RFedProx are unable to reduce the principal angle between

x^{*}

and

x_{t}

to an acceptable level. RFedSVRG-2BB and RFedSVRG-2BBS are able to effectively decrease the principal angle. Comparing the five algorithms, RFedSVRG-2BB and RFedSVRG-2BBS have a faster convergence rate and RFedSVRG-2BBS is the fastest.

In Figure 3b, Figure 4b, Figure 5b and Figure 6b, we plot the norm of gradient of

∥ grad f (x_{t}) ∥

as the round of communication increases, i.e., the y-axis of the figures denotes

∥ grad f (x_{t}) ∥

. We can observe that RFedAvg and RFedProx are unable to reduce the norm of the Riemannian gradient to an acceptable level. RFedSVRG-2BB and RFedSVRG-2BBS are able to effectively decrease the norm of gradient of

∥ grad f (x_{t}) ∥

. Compared with RFedAvg, RFedProx and RFedSVRG, RFedSVRG-2BB and RFedSVRG-2BBS have a faster convergence rate and RFedSVRG-2BBS is the fastest.

In Figure 3c,d, Figure 4c,d and Figure 5c,d, because we take

r = 3

, we can draw the scatter plots for D in (55) in 3D space to obtain an intuitive feel, where

\hat{x}

in (55) is the output point of RFedSVRG-2BB and RFedSVRG-2BBS, respectively. The colors of points in Figure 3c,d, Figure 4c,d and Figure 5c,d are based on the label of the original datasets. It shows that the algorithms indeed grasp the principal direction of the datasets and effectively reduce the dimensionality of the datasets.

6. Conclusions

In this paper we applied the BB technique to approximate the second-order information in RFedSVRG and obtain a new Riemannian FL algorithm, RFedSVRG-2BB. Then we incorporated the BB step size for RFedSVRG-2BB as a new variant, RFedSVRG-2BBS. We analyzed the RFedSVRG-2BB and RFedSVRG-2BBS convergence rates for a strongly geodesically convex function and L-smooth non-convex function. Additionally, we conducted some numerical experiments on synthetic and real datasets. The results show that our algorithms are better than RFedSVRG and also outperform RFedAvg and RFedProxk, which are widely applied algorithms. Therefore, the RFedSVRG-2BB and RFedSVRG-2BBS methods can be regarded as competitive alternatives to the classic methods of solving Federated Learning problems.

Author Contributions

Conceptualization, H.X. and T.Y.; methodology, H.X.; software, H.X.; validation, H.X., T.Y. and K.W.; formal analysis, H.X.; investigation, H.X.; resources, T.Y.; writing—original draft preparation, H.X.; writing—review and editing, H.X.; visualization, H.X.; supervision, T.Y.; project administration, T.Y.; funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 11671205).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: New York, NY, USA, 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Konecný, J.; McMahan, H.; Ramage, D.; Richtárik, P. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Li, J.; Ma, S. Federated Learning on Riemannian Manifolds. Appl. Set-Valued Anal. Optim. 2023, 5, 213–232. [Google Scholar]
Zhang, H.; Sra, S. First-order Methods for Geodesically Convex Optimization. In Proceedings of the 29th Annual Conference on Learning Theory, New York, New York, USA, 23–26 June 2016; PMLR: New York, NY, USA, 2016; Volume 49, pp. 1617–1638. [Google Scholar]
Härdle, W.; Simar, L. Applied Multivariate Statistical Analysis; Springer: Cham, Switzerland, 2019. [Google Scholar]
Cheung, Y.; Lou, J.; Yu, F. Vertical Federated Principal Component Analysis on Feature-Wise Distributed Data. In Web Information Systems Engineering, Proceedings of the 22nd International Conference on Web Information Systems Engineering, WISE 2021, Melbourne, VIC, Australia, 26–29 October 2021; WISE: Cham, Switzerland, 2021; pp. 173–188. [Google Scholar]
Boumal, N.; Absil, P.A. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra Its Appl. 2015, 475, 200–239. [Google Scholar] [CrossRef]
Pennec, X.; Fillard, P.; Ayache, N. A Riemannian Framework for Tensor Computing. Int. J. Comput. Vis. 2005, 66, 41–66. [Google Scholar] [CrossRef]
Fletcher, P.; Joshi, S. Riemannian geometry for the statistical analysis of diffusion tensor data. Signal Process. 2007, 87, 250–262. [Google Scholar] [CrossRef]
Rentmeesters, Q.; Absil, P. Algorithm comparison for Karcher mean computation of rotation matrices and diffusion tensors. In Proceedings of the 19th European Signal Processing Conference, Barcelona, Spain, 29 August–2 September 2011; pp. 2229–2233. [Google Scholar]
Cowi, S.; Yang, G. Averaging anisotropic elastic constant data. J. Elast. 1997, 46, 151–180. [Google Scholar]
Massart, E.; Chevallier, S. Inductive Means and Sequences Applied to Online Classification of EEG. In Geometric Science of Information, Proceedings of the Third International Conference, GSI 2017, Paris, France, 7–9 November 2017; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 763–770. [Google Scholar]
Magai, G. Deep Neural Networks Architectures from the Perspective of Manifold Learning. In Proceedings of the 2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Haikou, China, 18–20 August 2023; pp. 1021–1031. [Google Scholar]
Yerxa, T.; Kuang, Y.; Simoncelli, E.; Chung, S. Learning Efficient Coding of Natural Images with Maximum Manifold Capacity Representations. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New Orleans, LA, USA, 2023; Volume 36, pp. 24103–24128. [Google Scholar]
Chen, S.; Ma, S.; Man-Cho So, A.; Zhang, T. Proximal Gradient Method for Nonsmooth Optimization over the Stiefel Manifold. SIAM J. Optim. 2020, 30, 210–239. [Google Scholar] [CrossRef]
Boumal, N. An Introduction to Optimization on Smooth Manifolds; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Zhang, H.; Reddi, S.; Sra, S. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. Adv. Neural Inf. Process. Syst. 2016, 29, 4599–4607. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Pathak, R.; Wainwright, M. FedSplit: An algorithmic framework for fast federated optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 7057–7066. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Mitra, A.; Jaafar, R.; Pappas, G.; Hassani, H. Linear Convergence in Federated Learning: Tackling Client Heterogeneity and Sparse Gradients. Adv. Neural Inf. Process. Syst. 2021, 34, 14606–14619. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.J.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, Virtual Event, 13–18 July 2020. [Google Scholar]
Yuan, H.; Zaheer, M.; Reddi, S. Federated Composite Optimization. In Proceedings of the Machine Learning Research, 38th International Conference on Machine Learning, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: New York, NY, USA, 2021; Volume 139, pp. 12253–12266. [Google Scholar]
Bao, Y.; Crawshaw, M.; Luo, S.; Liu, M. Fast Composite Optimization and Statistical Recovery in Federated Learning. In Proceedings of the Machine Learning Research, 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; PMLR: New York, NY, USA, 2022; Volume 162, pp. 1508–1536. [Google Scholar]
Tran Dinh, Q.; Pham, N.H.; Phan, D.; Nguyen, L. FedDR–Randomized Douglas-Rachford Splitting Algorithms for Nonconvex Federated Composite Optimization. Neural Inf. Process. Syst. 2021, 34, 30326–30338. [Google Scholar]
Zhang, J.; Hu, J.; Johansson, M. Composite Federated Learning with Heterogeneous Data. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2023; pp. 8946–8950. [Google Scholar]
Zhang, J.; Hu, J.; So, A.M.; Johansson, M. Nonconvex Federated Learning on Compact Smooth Submanifolds With Heterogeneous Data. arXiv 2024, arXiv:2406.08465. [Google Scholar] [CrossRef]
Charles, Z.; Konečný, J. Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; Banerjee, A., Fukumizu, K., Eds.; PMLR: New York, NY, USA, 2021; Volume 130, pp. 2575–2583. [Google Scholar]
Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, Lake Tahoe, NV, USA, 5–10 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 1, pp. 315–323. [Google Scholar]
Huang, Z.; Huang, W.; Jawanpuria, P.; Mishra, B. Federated Learning on Riemannian Manifolds with Differential Privacy. arXiv 2024, arXiv:2404.10029. [Google Scholar]
Nguyen, T.A.; Le, L.T.; Nguyen, T.D.; Bao, W.; Seneviratne, S.; Hong, C.S.; Tran, N.H. Federated PCA on Grassmann Manifold for IoT Anomaly Detection. IEEE/ACM Trans. Netw. 2024, 1–16. [Google Scholar] [CrossRef]
Gower, R.; Le Roux, N.; Bach, F. Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods. In Proceedings of the Machine Learning Research, Twenty-First International Conference on Artificial Intelligence and Statistics, Playa Blanca, Lanzarote, Spain, 9–11 April 2018; Storkey, A., Perez-Cruz, F., Eds.; PMLR: New York, NY, USA, 2018; Volume 84, pp. 707–715. [Google Scholar]
Tankaria, H.; Yamashita, N. A stochastic variance reduced gradient using Barzilai-Borwein techniques as second order information. J. Ind. Manag. Optim. 2024, 20, 525–547. [Google Scholar] [CrossRef]
Roux, R.; Schmidt, M.; Bach, F. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets. In Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS’12, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; Volume 2, pp. 2663–2671. [Google Scholar]
Tan, C.; Ma, S.; Dai, Y.; Qian, Y. Barzilai-Borwein step size for stochastic gradient descent. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 685–693. [Google Scholar]
Barzilai, J.; Borwein, J. Two-Point Step Size Gradient Methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
Francisco, J.; Bazán, F. Nonmonotone algorithm for minimization on closed sets with applications to minimization on Stiefel manifolds. J. Comput. Appl. Math. 2012, 236, 2717–2727. [Google Scholar] [CrossRef]
Jiang, B.; Dai, Y. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Math. Program. 2013, 153, 535–575. [Google Scholar] [CrossRef]
Wen, Z.; Yin, W. A feasible method for optimization with orthogonality constraints. Math. Program. 2013, 142, 397–434. [Google Scholar] [CrossRef]
Iannazzo, B.; Porcelli, M. The Riemannian Barzilai–Borwein method with nonmonotone line search and the matrix geometric mean computation. IMA J. Numer. Anal. 2017, 38, 495–517. [Google Scholar] [CrossRef]
Absil, P.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2007. [Google Scholar]
Lee, J.M. Introduction to Riemannian Manifolds, 2nd ed.; Springer International Publishing: Cham, Switzerland, 2018; pp. 225–262. [Google Scholar]
Petersen, P. Riemannian Geometry; Springer: Berlin/Heidelberg, Germany, 2006; Volume 171. [Google Scholar]
Tu, L. An Introduction to Manifolds; Springer: New York, NY, USA, 2011. [Google Scholar]
Nocedal, J.; Wright, S. Numerical Optimization; Springer: New York, NY, USA, 1999. [Google Scholar]
Townsend, J.; Koep, N.; Weichwald, S. Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation. J. Mach. Learn. Res. 2016, 17, 4755–4759. [Google Scholar]
Absil, P.; Malick, J. Projection-like Retractions on Matrix Manifolds. SIAM J. Optim. 2012, 22, 135–158. [Google Scholar] [CrossRef]
Kaneko, T.; Fiori, S.; Tanaka, T. Empirical Arithmetic Averaging Over the Compact Stiefel Manifold. IEEE Trans. Signal Process 2013, 61, 883–894. [Google Scholar] [CrossRef]
Zhu, P.; Knyazev, A. Angles between subspaces and their tangents. J. Numer. Math. 2013, 21, 325–340. [Google Scholar] [CrossRef]
Forina, M.; Leardi, R.; Armanino, C.; Lanteri, S. PARVUS: An Extendable Package of Programs for Data Exploration. J. Chemom. 1990, 4, 191–193. [Google Scholar]
Street, W.N.; Wolberg, W.H.; Mangasarian, O.L. Nuclear feature extraction for breast tumor diagnosis. In Proceedings of the Biomedical Image Processing and Biomedical Visualization, San Jose, CA, USA, 1–4 February 1993; Acharya, R.S., Goldgof, D.B., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 1993; Volume 1905, pp. 861–870. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]

Figure 1. Results for PSD Karcher mean problem (2). (a) is the comparison of the five algorithms with

τ = 2

to reduce

∥ grad f (x_{t}) ∥

. (b) is the comparison of different

τ

in RFedSVRG-2BB and RFedSVRG-2BBS with

k = 5

. (c) is the comparison of different k in RFedSVRG-2BB and RFedSVRG-2BBS with

τ = 2

.

Figure 1. Results for PSD Karcher mean problem (2). (a) is the comparison of the five algorithms with

τ = 2

to reduce

∥ grad f (x_{t}) ∥

. (b) is the comparison of different

τ

in RFedSVRG-2BB and RFedSVRG-2BBS with

k = 5

. (c) is the comparison of different k in RFedSVRG-2BB and RFedSVRG-2BBS with

τ = 2

.

Figure 2. Results for PCA (3).

Figure 3. Results for kPCA (4) with Iris dataset. The datas are in

R^{4} (d = 4)

and we take

r = 3

,

n = 10

,

k = 5

and

τ_{i} = 5

. We set

(η_{max}, η_{min})

for RFedSVRG-2BBS as

(2.5 \times 10^{- 1}, 2.5 \times 10^{- 3})

and constant step size

η^{(i)} = 0.1

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

. (c) D in (55) for RFedSVRG-2BB. (d) D in (55) for RFedSVRG-2BBS.

Figure 3. Results for kPCA (4) with Iris dataset. The datas are in

R^{4} (d = 4)

and we take

r = 3

,

n = 10

,

k = 5

and

τ_{i} = 5

. We set

(η_{max}, η_{min})

for RFedSVRG-2BBS as

(2.5 \times 10^{- 1}, 2.5 \times 10^{- 3})

and constant step size

η^{(i)} = 0.1

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

. (c) D in (55) for RFedSVRG-2BB. (d) D in (55) for RFedSVRG-2BBS.

Figure 4. Results for kPCA (4) with wine dataset. The datas are in

R^{13} (d = 13)

and we take

r = 3

,

n = 10

,

k = 5

and

τ_{i} = 5

. We set

(η_{max}, η_{min})

for RFedSVRG-BB as

(2 \times 10^{- 1}, 2 \times 10^{- 3})

and constant step size

η^{(i)} = 0.1

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

. (c) D in (55) for RFedSVRG-2BB. (d) D in (55) for RFedSVRG-2BBS.

Figure 4. Results for kPCA (4) with wine dataset. The datas are in

R^{13} (d = 13)

and we take

r = 3

,

n = 10

,

k = 5

and

τ_{i} = 5

. We set

(η_{max}, η_{min})

for RFedSVRG-BB as

(2 \times 10^{- 1}, 2 \times 10^{- 3})

and constant step size

η^{(i)} = 0.1

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

. (c) D in (55) for RFedSVRG-2BB. (d) D in (55) for RFedSVRG-2BBS.

Figure 5. Results for kPCA (4) with breast cancer dataset. The datas are in

R^{30} (d = 30)

and we take

r = 3

,

n = 10

,

k = 5

and

τ_{i} = 5

. We set

(η_{max}, η_{min})

for RFedSVRG-BB as

(5 \times 10^{- 2}, 5 \times 10^{- 4})

and constant step size

η^{(i)} = 2 \times 10^{- 2}

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

. (c) D in (55) for RFedSVRG-2BB. (d) D in (55) for RFedSVRG-2BBS.

Figure 5. Results for kPCA (4) with breast cancer dataset. The datas are in

R^{30} (d = 30)

and we take

r = 3

,

n = 10

,

k = 5

and

τ_{i} = 5

. We set

(η_{max}, η_{min})

for RFedSVRG-BB as

(5 \times 10^{- 2}, 5 \times 10^{- 4})

and constant step size

η^{(i)} = 2 \times 10^{- 2}

for RFedAvg, RFedProx, RFedSVRG and RFedSVRG-2BB. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

. (c) D in (55) for RFedSVRG-2BB. (d) D in (55) for RFedSVRG-2BBS.

Figure 6. Results for kPCA (4) with MNIST hand-written dataset. The datas are in

R^{784} (d = 784)

and we take

r = 5

,

n = 200

,

k = n / 10

and

τ_{i} = 5

, respectively, for this experiment. We set

(η_{max}, η_{min})

for RFedSVRG-BB as

(3 \times 10^{- 8}, 3 \times 10^{- 12})

and constant step size

η^{(i)} = 2 \times 10^{- 10}

for RFedAvg, RFedProx and RFedSVRG as [3] recommended. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

.

Figure 6. Results for kPCA (4) with MNIST hand-written dataset. The datas are in

R^{784} (d = 784)

and we take

r = 5

,

n = 200

,

k = n / 10

and

τ_{i} = 5

, respectively, for this experiment. We set

(η_{max}, η_{min})

for RFedSVRG-BB as

(3 \times 10^{- 8}, 3 \times 10^{- 12})

and constant step size

η^{(i)} = 2 \times 10^{- 10}

for RFedAvg, RFedProx and RFedSVRG as [3] recommended. (a) Comparison of algorithms on reducing the principal angle between

x^{*}

and

x_{t}

. (b) Comparison of algorithms on reducing the norm of

grad f (x_{t})

.

Table 1. This table summarizes the convergence rates we have proved for our algorithms. Where T is total number of communication of algorithms, L is the Lipschitz constant of f and D is the diameter of the domain in Assumption 2. For the conditions that need to be met and other parameter sources, please refer to the corresponding theorem.

Properties of f	Algorithm	Convergence Rate	Parameter Source and Conditions
L-smooth and $μ$ -strongly g-convex	RFedSVRG-2BB	$O (\frac{α L D^{2}}{T (1 - α)})$	Theorem 1
L-smooth and $μ$ -strongly g-convex	RFedSVRG-2BBS	$O (\frac{α L D^{2}}{T (1 - α)})$	Theorem 2
L-smooth and h-gradient dominated	RFedSVRG-2BB	$O (\frac{L h m^{- α_{1} / 2} ζ^{3 α_{2} - 1} D^{2}}{T})$	Theorem 5
L-smooth and h-gradient dominated	RFedSVRG-2BBS	$O (\frac{L h D^{2}}{T})$	Theorem 6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, H.; Yan, T.; Wang, K. Riemannian SVRG Using Barzilai–Borwein Method as Second-Order Approximation for Federated Learning. Symmetry 2024, 16, 1101. https://doi.org/10.3390/sym16091101

AMA Style

Xiao H, Yan T, Wang K. Riemannian SVRG Using Barzilai–Borwein Method as Second-Order Approximation for Federated Learning. Symmetry. 2024; 16(9):1101. https://doi.org/10.3390/sym16091101

Chicago/Turabian Style

Xiao, He, Tao Yan, and Kai Wang. 2024. "Riemannian SVRG Using Barzilai–Borwein Method as Second-Order Approximation for Federated Learning" Symmetry 16, no. 9: 1101. https://doi.org/10.3390/sym16091101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Riemannian SVRG Using Barzilai–Borwein Method as Second-Order Approximation for Federated Learning

Abstract

1. Introduction

1.1. Related Work

1.2. Our Contributions

2. Preliminaries on Riemannian Optimization

3. RFedSVRG with Barzilai–Borwein Approximation as Second-Order Information

3.1. The RFedSVRG Method

3.2. Barzilai–Borwein Method

3.3. RFedSVRG with Barzilai–Borwein Method as Second-Order Information (RFedSVRG-2BB)

3.4. RFedSVRG-2BB with Barzilai–Borwein Step Size (RFedSVRG-2BBS)

4. Convergence Analysis

5. Numerical Experiments

5.1. Experiments on Synthetic Data

5.1.1. Experiments on PSD Karcher Mean

5.1.2. Experiments on PCA

5.2. Experiments on Real Data

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI