FAFedZO: Faster Zero-Order Adaptive Federated Learning Algorithm

Yanbo Lu; Huimin Gao; Yi Zhang; Yong Xu

doi:10.3390/electronics14071452

Abstract

Federated learning represents a newly emerging methodology in the field of machine learning that enables distributed agents to collaboratively learn a centralized model without sharing their raw data. Some scholars have already proposed many first-order algorithms and second-order algorithms for federated learning to reduce communication costs and speed up convergence. However, these algorithms generally rely on gradient or Hessian information, and we find it difficult to solve such federated optimization problems when the analytical expression of the loss function is not available, that is, when gradient information is not available. Therefore, we employed derivative-free federated zero-order optimization in this paper, which does not rely on specific gradient information, but instead utilizes the changes in function values or model outputs to estimate the optimization direction. Furthermore, to enhance the performance of derivative-free zero-order optimization, we propose an effective adaptive algorithm that can dynamically adjust the learning rate and other hyperparameters based on the performance during the optimization process, aiming to accelerate convergence. We rigorously analyze the convergence of our approach, and the experimental findings demonstrate our method can indeed achieve faster convergence speed on the MNIST, CIFAR-10 and Fashion-MNIST datasets in cases where gradient information is not available.

Keywords:

federated learning; zero-order optimization; adaptive; convergence

1. Introduction

With the rapid development of big data and artificial intelligence technologies, machine learning has become one of the key technologies in various fields. However, in practical applications, issues of data privacy and security have become a non-negligible challenge. In traditional machine learning, data usually need to be sent to data centers for model training, but in the process of transmission, the problems of data leakage and privacy protection are faced. To address this issue, federated learning (FL) [1], an emerging distributed machine learning technology, has emerged. It achieves collaborative learning among multiple data sources while protecting data privacy by exchanging model parameters instead of raw data. Currently, FL is utilized across diverse domains such as autonomous driving [2], personalized recommendation systems [3], and medical informatics [4], among others.

In order to meet the requirements of various scenarios in real-world applications, a large number of corresponding algorithms for federated learning have been studied. In recent years, to achieve fast convergence rates and reduce communication loads, a wide range of algorithms have been proposed, and these algorithms are commonly divided into two categories. One category is first-order methods [1,5,6], which only rely on the first-order derivative (gradient) of the objective function. The gradient indicates the direction in which the function increases most rapidly, and this type of method uses this information to iteratively update the model parameters. The other category is second-order methods. This kind of method not only utilizes the first-order derivative but also makes use of the second-order derivative (Hessian matrix), such as Fed-Sophia [7]. Most of these algorithms rely on gradient or Hessian information, and their scope of application is usually limited to differentiable functions. However, when the objective function has difficulty computing gradients or the gradient computation cost is excessively high, these algorithms will not be able to solve such problems, such as in the case of black-box functions and in federated hyperparameter tuning [8], so we must look for other solutions.

Zero-order optimization is a method proposed to reduce the dependence on gradient or Hessian information for optimizing the objective function. Its characteristic is that it does not require the calculation of the gradient. Instead, it estimates the change of the function by sampling the objective function to find the optimal solution. Therefore, this method is applicable to cases where the objective function is non-differentiable or the gradient is difficult to calculate. However, existing zero-order methods lack the utilization of gradient and other related information, resulting in the need for more iterations to achieve convergence during the optimization process. This greatly increases the computational time and affects the performance and convergence speed of the model. In deep learning, an excessively large learning rate may lead to unstable model training, resulting in overfitting or underfitting. Furthermore, traditional fixed learning rate methods often struggle to adapt to complex and diverse training data and model structures, potentially leading to a slow convergence speed of the model. The emergence of adaptive methods has effectively alleviated the problem of slow convergence. The adaptive method can dynamically adjust the learning rate based on feedback from the training process, thereby accelerating the convergence of the model. In addition, compared with stochastic gradient descent, this method can also escape saddle points more quickly [9]. Therefore, introducing it in practice represents a crucial approach to enhancing the performance of federated learning algorithms.

However, the improper design of adaptive federated learning methods may lead to convergence issues [10]. In the context of federated learning, data distribution across different devices often does not follow the IID (Independent and Identically Distributed) assumption, which further complicates the convergence problem. Reddi et al. [11] first proposed a federated version of the adaptive optimizer, among which are FedAdagrad and FedYogi. However, their analysis is only valid under the condition that

β_{1} = 0

, where

β_{1}

is the decay parameter, which cannot leverage the advantages of momentum. And these limitations become even more pronounced when dealing with non-IID (Non-Independent and Identically Distributed) data. The algorithm FedCAMS [12] provides a complete proof, but it does not improve the convergence speed. Moreover, they usually require a global learning rate for initialization and adjustment, and the selection of this global learning rate may affect the performance of the algorithm. Existing adaptive methods still have shortcomings, such as a lagging response to complex environmental changes and a high consumption of computational resources. Our goal is to design adaptive algorithms tailored to various real-world scenarios, characterized by efficiency, precision, and flexibility, so as to rapidly adapt to environmental changes, optimize resource allocation, and provide robust support for development across various fields.

Based on the above, we combine the gradient-free optimization with adaptive methods, aiming to leverage the advantages of both to solve the problem of unavailable gradient information of the objective function in a finite-sum optimization problem while achieving fast convergence and effectively improving the training efficiency and performance of the model.

The contributions of this paper are summarized as follows:

By combining the zero-order optimization and adaptive gradient method, we proposed a novel faster zero-order adaptive federated learning algorithm, called FAFedZO, which can eliminate the reliance on gradient information and accelerate convergence at the same time.
We conducted a theoretical analysis of the proposed zero-order adaptive algorithm and provided a convergence analysis framework under some mild assumptions, demonstrating its convergence. Additionally, we have analyzed the computational complexity and convergence rate of the algorithm.
We have conducted a large number of comparative experiments on the MNIST, CIFAR-10, and Fashion-MNIST datasets. The experimental results verify the effectiveness of the FAFedZO algorithm. Compared with traditional zero-order optimization algorithms, this algorithm demonstrates significant performance advantages in both IID and non-IID scenarios.

The structure of the rest of this paper is outlined below. Section 2 provides a summary of the related work. In Section 3, we present the formulation of the federated optimization problem and the algorithm framework of FAFedZO. Section 4 provides the convergence analysis of FAFedZO. Section 5 presents the outcomes of our experiments. The paper concludes with Section 6.

2. Related Work

2.1. Federated Learning

There have been numerous studies on federated learning. Just as in the flight control neural network [13], optimal feedback gains are obtained through the meticulous design of the network architecture and training, and federated learning is also seeking the optimal parameter combinations for each client to achieve better global model performance. The pioneering work on federated learning began with [1], which proposed an algorithm called FedAvg. After FedAvg, numerous additional first-order schemes emerged, such as FedNova [14], FedProx [15], SCAFFOLD [16], FedSplit [17], and FedPD [18]. Among them, SCAFFOLD employs control variates to rectify “client drift” while maintaining the same sampling and communication complexity as FedAvg. FedProx introduced a penalty-based approach, which can reduce communication complexity to

O (ϵ^{- 2})

. To further minimize communication costs, various second-order optimization methods have been introduced, including GIANT [19] and FedDANE [20]. Additionally, there are also some FL algorithms based on momentum, including [21,22,23]. The work presented in [21] proposes a momentum fusion approach for synchronizing the server and local momentum buffers; however, it does not aim to decrease complexity. Ref. [22] proposed a momentum-based global update algorithm, Fed-GLOMO, which reduces variance on the server side using variance-reduction techniques. Ref. [23] proposed the STEM algorithm, which employs momentum-assisted stochastic gradient directions for updates at both the worker nodes and the central server. However, there are still numerous federated optimization problems in reality that are challenging to solve, for instance, when gradient information is unavailable or costly to acquire. Therefore, the study of gradient-free zero-order optimization is essential.

2.2. Zero-Order Optimization

Early literature that began to use the zero-order idea for estimation includes [24,25,26]. Specifically, in [24], the authors developed a distributed zero-order algorithm utilizing gradient tracking techniques. In [25], the author provides the first generalization error analysis for black-box learning via derivative-free optimization and demonstrates that under the assumption of Lipschitz and smoothing (unknown) losses, the ZoSS method attains a comparable generalization error boundary to stochastic gradient descent (SGD). Ref. [26] proposed and analyzed zero-order stochastic approximation algorithms for non-convex and convex objective functions, focusing on solving the problems of constrained optimization and high-dimensional settings. Recent key research achievements have also applied zero-order methods to various fields, such as [27,28,29,30]. In [27], a derivative-free algorithm, FedZO, is proposed. It is proven that this algorithm achieves a linear increase in speed in relation to the number of devices involved and local iteration times under non-convex settings. Ref. [28] proposed the FedDisco algorithm, leveraging zero-order optimization techniques to reduce communication overhead significantly. Ref. [29] designed a federated zero-order algorithm, FedZeN, which focused on convex optimization, and estimated the curvature of the global objective. Under the framework of cross-device federated learning, Ref. [30] introduces a dual-communication zero-order method, which is the first technique to incorporate wireless channels into the algorithm. This represents a significant new achievement in the field of zero-order optimization, achieving a convergence rate of

O (\frac{1}{\sqrt[3]{K}})

in non-convex settings, where K represents the total number of iterations. However, due to the nature of zero-order optimization, it can lead to slower convergence speeds. Therefore, an adaptive method is introduced below.

2.3. Adaptive Methods

Early proposed adaptive algorithms include [31,32]. The Adam method in [31] introduces decay coefficients and combines momentum optimization to better adapt to non-stationary data and large-scale datasets, thereby accelerating model convergence. The AdaGrad method in [32] can adaptively adjust the learning rate of each parameter that can make sparse features obtain larger learning rates, while frequently occurring features obtain smaller learning rates. Then, researchers have extended these algorithms to the context of federated learning, like recent studies such as [33,34,35,36]. In [33], the authors designed an Adaptive Local Iteration Differential Privacy Federated Learning algorithm (ALI-DPFL) and demonstrated its superiority in resource-constrained scenarios. The adaptive methods in [34,35] effectively mitigate the issue of non-IID data, achieving critical progress in enabling better performance when dealing with non-IID data. Ref. [35] introduced a novel framework called FedARF, which adopted an adaptive feature fusion strategy, enabling the model to better adapt to the data distribution of each client, thereby accelerating the convergence speed on non-IID data. Ref. [36] introduced a novel federated learning algorithm named AFedAvg, which significantly reduced communication costs and accelerated convergence speed by combining adaptive communication frequency and gradient sparsity techniques. However, none of these studies have investigated the integration of adaptive methods within the context of zero-order optimization. Hence, our article addresses this by combining zero-order optimization with adaptive approaches to conduct our discussion.

3. Problem Formulation and Algorithm Design

In this section, we will introduce the federated optimization problem and the design of the zero-order adaptive federated optimization algorithm FAFedZO.

3.1. Federated Optimization Problem Formulation

We consider a federated learning task involving a central server and Q edge devices with an index of

{1, 2, \dots, Q}

. The central server aims to facilitate collaboration among these devices to address a specific optimization problem

\begin{matrix} min_{Ξ \in R^{d}} f (Ξ) ≜ \frac{1}{Q} \sum_{i = 1}^{Q} f_{i} (Ξ), \end{matrix}

(1)

in which

\begin{matrix} f_{i} (Ξ) ≜ E_{ϑ_{i} \sim D_{i}} [F_{i} (Ξ, ϑ_{i})], \end{matrix}

(2)

where

Ξ \in R^{d}

represents a d-dimensional model parameter. For each edge device

i \in [Q]

,

[Q]

represents all Q edge devices from 1 to Q,

f_{i} (Ξ)

denotes its local loss function, while

f (Ξ)

stands for the global loss function. In Equation (1),

f_{i} (Ξ)

evaluates the anticipated risk on the data distribution

D_{i}

on the edge device i, which is presented in Equation (2).

ϑ_{i} \sim D_{i}

denotes that the random variable

ϑ_{i}

is uniformly drawn from the distribution

D_{i}

, and

F_{i} (Ξ, ϑ_{i})

denotes the loss function of

ϑ_{i}

at the parameter

Ξ

.

3.2. Algorithm Design of FAFedZO

We explored how to design a method that combines an adaptive gradient with zero-order optimization in federated learning.

Firstly, we expand our algorithm description within the context of the FedAvg framework. We are focused here on solving problem (1) through zero-order optimization methods and propose a novel zero-order fast adaptive FL method (FAFedZO) that employs a shared adaptive learning rate. In particular, Algorithm 1 outlines the specifics of our FAFedZO method.

At the beginning, input the parameters and perform initialization. For all i, compute

Ξ_{1, i} = Ξ_{0, i} - η_{0} n_{0, i}

, which represents the first model update based on the initial parameters and the initial gradient estimates.

Then, from round

t = 1

to T, perform the following: for each edge device

i \in [Q]

, first, extract a mini-batch

B_{t, i}

of size b from the local dataset

D_{i}

. Next, compute the stochastic gradient estimates

{\hat{g}}_{t, i}

and

{\hat{g}}_{t - 1, i}

for the current model parameters

Ξ_{t, i}

based on the mini-batch

B_{t, i}

. Here, we elaborate on the specific method for estimating the gradients.

To address the issue of unavailable gradient information and to reduce the frequency of model exchange, we use gradient estimators and perform stochastic zero-order updates in each communication round. By utilizing the gradient estimator, we can approximate the gradient through random sampling and estimation, without the need to precisely calculate the derivative of the function. In particular, at the t-th round, edge device i calculates a two-point stochastic gradient estimator [24], as outlined below

\begin{matrix} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i}) = \frac{d v_{t}}{μ_{t}} (F_{i} (Ξ_{t, i} + μ_{t} v_{t}, ϑ_{t, i}) - F_{i} (Ξ_{t, i}, ϑ_{t, i})), \end{matrix}

(3)

where

Ξ_{t, i}

denotes the local model of edge device i, while

ϑ_{t, i}

signifies a random variable drawn by edge device i according to its local data distribution

D_{i}

during the t-th round.

v_{t}

represents a randomly chosen d-dimensional direction, uniformly sampled from the unit sphere

S^{d}

, while

μ_{t}

stands for a positive step size.

Afterward, in step 8 of Algorithm 1, we calculate the first-order momentum

n_{t, i}

. Each update step not only depends on the current gradient but also integrates information from historical gradients. In this way, it can provide more stable and directional guidance for the update of model parameters, which helps accelerate the convergence of the model to the optimal solution. Its definition is as follows

\begin{matrix} n_{t, i} = {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; ϑ_{t, i}) + (1 - χ_{t}) (n_{t - 1, i} - {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; ϑ_{t, i})), \end{matrix}

(4)

where the hyperparameter

χ_{t} \in (0, 1)

, represents the decay factor.

In step 9 of Algorithm 1, we calculate the second-order momentum

ι_{t, i}

. It measures the second moment of the gradient, enabling the adaptive adjustment of the learning rate in different parameter dimensions according to the historical changes in gradients. We adopt the coordinate-wise adaptive learning rate approach, akin to that utilized in Adam [31], which is defined as follows:

\begin{matrix} ι_{t, i} = ϱ ι_{t - 1, i} + (1 - ϱ) {({\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; ϑ_{t, i}))}^{2}, \end{matrix}

(5)

where the hyperparameter

ϱ \in (0, 1)

, represents the decay factor.

Then, when the number of rounds t is an integer multiple of the local update number p (i.e.,

mod (t, p) = 0

), for

ι_{t, i}

, we perform the aggregation and periodic averaging steps at the server side, resulting in

{\bar{ι}}_{t}

. Subsequently, we utilize the

{\bar{ι}}_{t}

to create an adaptive matrix

K_{t} = d i a g (\sqrt{{\bar{ι}}_{t}} + ρ)

, where

d i a g

denotes diagonal matrix and

ρ > 0

represents the decay factor. Since

{\bar{ι}}_{t}

is obtained by averaging

ι_{t, i}

across all devices, and

ι_{t, i}

is related to the square of the gradient estimation value, as

{\bar{ι}}_{t}

is updated,

K_{t}

is correspondingly updated. In this way,

K_{t}

can be adaptively adjusted based on the changes in the gradient information during the update process of the local model. This enables the algorithm to better adapt to the data distributions and model training situations on different devices, accelerate the convergence speed of the model, and enhance the final performance of the model.

Then, the averaging step is also performed on all the first-order momentums

n_{t, i}

at the server side, and the global model is updated based on the obtained

K_{t}

and

{\bar{n}}_{t}

. Otherwise (if

mod (t, p) \neq 0

), keep

K_{t}

as its previous value

K_{t - 1}

and update the global model in the same manner as before. Here, the central server aggregates the model parameters, the first-order momentum

n_{t, i}

, and the second-order momentum

ι_{t, i}

every p steps.

Finally, after all iterations are completed, the algorithm outputs a model

\bar{Ξ}

, which is uniformly randomly selected from all the global models

{{\bar{Ξ}}_{t}}_{t = 1}^{T}

obtained during the iterations, as the final result.

To have a more intuitive and clear understanding of the workflow of the FAFedZO algorithm; we roughly summarize it as the following workflow diagram shown in Figure 1:

Algorithm 1: FAFedZO Algorithm

1:: Input: the number of iterations T, decay factor $χ_{t}$ , $ϱ$ , learning rate $η_{t}$ , the number of local updates p, mini batch size b and initial batch-size B;
2:: initialize: Initialize: $Ξ_{0, i} = {\bar{Ξ}}_{0} = \frac{1}{Q} \sum_{i = 1}^{Q} Ξ_{0, i}$ , $n_{0, i} = {\bar{n}}_{0} = \frac{1}{Q} \sum_{i = 1}^{Q} {\hat{n}}_{0, i}$ with ${\hat{n}}_{0, i} = {\tilde{\nabla}}_{v}^{μ} F (Ξ_{0, i}, B_{0, i})$ and $ι_{0, i} = {\bar{ι}}_{0} = \frac{1}{Q} \sum_{i = 1}^{Q} {\hat{ι}}_{0, i}$ with ${\hat{ι}}_{0, i} = {({\tilde{\nabla}}_{v}^{μ} F (Ξ_{0, i}, B_{0, i}))}^{2}$ where $∣ B_{0, i} ∣ = B$ from $D_{i}$ for $i \in [Q]$ . $K_{0} = d i a g (\sqrt{{\bar{ι}}_{0}} + ρ)$
3:: $Ξ_{1, i} = Ξ_{0, i} - η_{0} n_{0, i}$ , for all $i \in [Q]$

16:: Output: $\bar{Ξ}$ chosen uniformly random from ${{\bar{Ξ}}_{t}}_{t = 1}^{T}$ .

Figure 1. FAFedZO framework.

4. Convergence Analysis of FAFedZO Method

In this section, the convergence of the FAFedZO method will be discussed. To facilitate the theoretical analysis of the proposed algorithm, we need to make some assumptions as follows.

Assumption 1.

The global loss function

f (Ξ)

in (1) has a lower bound, that is, there is a fixed value

f^{*}

that exists

f (Ξ) \geq f^{*} > - \infty .

Assumption 1 means that regardless of the value of the optimization variable

Ξ

, the global loss function

f (Ξ)

will not decrease indefinitely, and there exists a minimum value

f^{*}

, ensuring that the optimal solution we are looking for is within a meaningful range. Otherwise, the algorithm may keep trying to find a lower value and fail to converge.

Assumption 2.

We suppose that functions

F_{i} (Ξ, ϑ_{i}), f_{i} (Ξ), f (Ξ)

are all L-smooth; that is, for all

Ξ_{1} \in R^{d}, Ξ_{2} \in R^{d}

, we can conclude that

\begin{matrix} ∥ \nabla f_{i} (Ξ_{1}) - \nabla f_{i} (Ξ_{2}) ∥ \leq L ∥ Ξ_{1} - Ξ_{2} ∥, \end{matrix}

where

L > 0

represents the Lipschitz constant, and

∥ \cdot ∥

represents the

l_{2}

norm. In addition, it can also be expressed in the form of

f (Ξ_{1}) \leq f (Ξ_{2}) + ⟨\nabla f (Ξ_{2}), Ξ_{1} - Ξ_{2}⟩ + \frac{L}{2} {∥ Ξ_{1} - Ξ_{2} ∥}^{2}

, where the symbol

⟨ \cdot, \cdot ⟩

represents the inner product.

Assumption 2 is common in the theoretical analysis of non-convex optimization [24,27], indicating that the gradient changes of

f_{i} (Ξ)

,

f (Ξ)

are smooth and do not change suddenly. It is widely used in optimization analysis, and many typical federated learning algorithms have adopted this assumption, such as Fed-GLOMO [22], STEM [23], and so on.

Assumption 3.

\nabla F_{i} (Ξ, ϑ_{i})

serves as a precise approximation of

\nabla f_{i} (Ξ)

without bias, namely

E_{ϑ_{i}} [\nabla F_{i} (Ξ, ϑ_{i})] = \nabla f_{i} (Ξ), \forall Ξ \in R^{d} .

Assumption 3 is usually used in stochastic optimization. It allows us to approximate the true gradient by using the gradient of random sampling, thus reducing the computational complexity. This is because, especially in the cases of large-scale data and complex models, it is often infeasible to calculate the gradient precisely.

Assumption 4.

The variance of the stochastic gradient is limited within a certain range; that is, there is a constant

σ_{g} > 0

that satisfies

E_{ϑ_{i}} {∥ \nabla F_{i} (Ξ, ϑ_{i}) - \nabla f_{i} (Ξ) ∥}^{2} \leq σ_{g}^{2}, \forall Ξ \in R^{d} .

Assumption 4 indicates that the variance of stochastic gradients will not be infinite. If the variance is too large, the algorithm may experience violent oscillations during the optimization process and fail to converge. Bounded variance is a necessary condition for controlling the noise of SGD and ensuring the convergence rate.

Assumption 5.

The difference between each local loss function and the global loss function remains within a certain limit; that is, there exists a constant

σ_{h} > 0

that satisfies

∥ \nabla f (Ξ) - \nabla f_{i} {(Ξ) ∥}^{2} \leq σ_{h}^{2}, \forall Ξ \in R^{d} .

Assumption 5 is used to describe the heterogeneity between the local loss and the global loss and is also taken into account in distributed zero-order optimization [37]. It ensures that local optimization will not lead to a significant decline in global performance.

Assumption 6.

The inter-node variance is bounded, namely,

∥ \nabla f_{i} (Ξ) - \nabla f_{j} {(Ξ) ∥}^{2} \leq ζ^{2}, \forall Ξ \in R^{d},

where ζ is the heterogeneity parameter, which represents the level of data heterogeneity.

Assumption 6 is a typical assumption used to constrain data heterogeneity in the federated learning algorithm. When the data are set to be IID, that is, for all

i, j \in [Q]

,

D_{i} = D_{j}

, then

ζ = 0

. This assumption indicates that the differences between the loss functions on different devices are limited, ensuring that the optimization processes on different devices will not deviate too much.

Assumption 7.

In our algorithms, for all

t \geq 1

, the adaptive matrx

K_{t}

satisfies the condition that

λ_{m i n} (K_{t}) \geq ρ > 0,

where ρ represents an appropriate positive value and

λ_{m i n} (\cdot)

represents the minimum eigenvalue of a matrix.

Assumption 7 ensures that the adaptive matrix

K_{t}

is positive definite for any

t \geq 1

. By ensuring that the minimum eigenvalue is greater than the positive number

ρ

, it can guarantee that the algorithm has a sufficient update step size in each iteration, thus ensuring the convergence of the algorithm.

Assumption 8.

The gradients of function

f_{i} (Ξ)

are G-bounded; that is, for all i,

Ξ \in R^{d}

, we have

∥ \nabla f_{i} (Ξ) ∥ \leq G .

where

G > 0

is a positive constant.

Assumption 8 provides an upper bound for the gradient in the adaptive method. This is a typical assumption in the adaptive method, which is used to constrain the upper bound of the adaptive learning rate. This is reasonable and can usually be satisfied in practice. For example, it holds for the finite sum problem.

In the context of optimization algorithms, in order to prove the final specific conclusion, we need to introduce the concept of the

ϵ

-stationary point. We define the

ϵ

-stationary point as follows:

Definition 1.

A point Ξ is called ϵ-stationary if

∥ \nabla f (Ξ) ∥ \leq ϵ

. Generally, a stochastic algorithm is defined to achieve an ϵ-stationary point in T iterations if

E ∥ \nabla f (Ξ_{T}) ∥ \leq ϵ

.

Then, we investigate the convergence characteristics of our novel method based on Assumptions 1–8. We first obtain the following six lemmas:

Lemma 1 gives the upper bound of the adaptive matrix

K_{t}

.

Lemma 1.

Suppose the adaptive matrix sequence

{K_{t}}_{t = 1}^{T}

is derived from the algorithm. On the basis of Assumptions 1–8, we can conclude that

\begin{matrix} E {∥K_{t}∥}^{2} \leq 2 d (G^{2} + σ_{g}^{2}) + 2 ρ^{2} + \frac{1}{2} d^{2} L^{2} μ^{2} \end{matrix}

(6)

Lemma 2 measures the boundary of variance between gradients.

Lemma 2.

For

i \in [Q]

, where

[Q]

represents all Q edge devices from 1 to Q, we can conclude that

\begin{matrix} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - g_{t, i} ∥^{2} \leq L^{2} μ^{2} \end{matrix}

(7)

\begin{matrix} E ∥ g_{t} - 1 \otimes {\bar{g}}_{t} ∥^{2} \leq 6 L^{2} E {∥ Ξ_{t} - 1 \otimes {\bar{Ξ}}_{t} ∥}^{2} + 3 Q ζ^{2} \end{matrix}

(8)

Lemma 3 studies the bounds under different values of

\sum_{i = 1}^{Q} {∥Ξ_{t, i} - {\bar{Ξ}}_{t}∥}^{2}

, which is important for the proof of the subsequent theorems.

Lemma 3.

Given that

t \in [⌊ t / p ⌋ p, ⌊ t / p ⌋ (p + 1)]

,

Ξ_{t}

is derived from the algorithm; we can conclude that

(1) If

t = s_{t} = ⌊ t / p ⌋ p

, then we can obtain

\begin{matrix} \sum_{i = 1}^{Q} {∥Ξ_{s_{t}, i} - {\bar{Ξ}}_{s_{t}}∥}^{2} = 0 \end{matrix}

(9)

(2) If

t \geq ⌊ t / p ⌋ p

, then we can obtain

\begin{matrix} \sum_{i = 1}^{Q} {∥Ξ_{t, i} - {\bar{Ξ}}_{t}∥}^{2} \leq (p - 1) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \end{matrix}

(10)

Lemma 4 gives the relationship between the expected values of the function f at the model parameters

{\bar{Ξ}}_{t + 1}

and

{\bar{Ξ}}_{t}

.

Lemma 4.

Assuming the sequence

{Ξ_{t}}_{0}^{T}

is generated by the algorithm, then we can obtain

\begin{matrix} E f ({\bar{Ξ}}_{t + 1}) & \leq E f ({\bar{Ξ}}_{t}) - (\frac{3 ρ}{4 η_{t}} - \frac{L}{2}) E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2} + \frac{5 η_{t}}{2 ρ} E {∥ {\bar{g}}_{t} - {\bar{n}}_{t} ∥}^{2} \\ + \frac{5 η_{t} L^{2}}{2 ρ Q} E {∥ Ξ_{t} - 1 \otimes {\bar{Ξ}}_{t} ∥}^{2} \end{matrix}

(11)

Lemma 5 gives the iterative relationship between

E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2}

and

E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2}

.

Lemma 5.

Suppose that

n_{t}

they are produced by the algorithm; we subsequently obtain

\begin{matrix} E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2} \leq 2 {(1 - χ_{t})}^{2} E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2} + \frac{4 {(1 - χ_{t})}^{2} L^{2}}{Q} E {∥ Ξ_{t} - Ξ_{t - 1} ∥}^{2} + 4 χ_{t}^{2} L^{2} μ^{2} \end{matrix}

(12)

Lemma 6 is also a crucial conclusion for proving the final theorem.

Lemma 6.

Suppose that

n_{t}

is produced by the algorithm; we subsequently obtain

\begin{matrix} \frac{30 ρ}{72 Q} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥^{2} \leq \frac{ρ}{4} \sum_{t = s_{t}}^{\bar{s}} \frac{1}{η_{t}} E {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} + [\frac{ρ μ^{2} c^{2}}{4 Q} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}}] \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \end{matrix}

(13)

The proofs of these lemmas are provided in Appendix A. Then, based on the conclusions of the above lemmas, we can prove the final convergence theorem:

Theorem 1.

Assume that the sequence

{{\bar{Ξ}}_{t}}_{t = 1}^{T}

is produced by Algorithm 1. Based on Assumptions 1–8, given that

\forall t \geq 0

,

χ_{t + 1} = c \cdot η_{t}^{2}

,

c = \frac{1}{12 L p {\bar{h}}^{3} ρ^{2}} + \frac{30 L^{2}}{ρ^{2}} \leq \frac{60 L^{2}}{ρ^{2}}

,

w_{t} = max (\frac{3}{2}, 1728 L^{3} p^{3} {\bar{h}}^{3} - t)

,

\bar{h} = \frac{1}{L}

and

\begin{matrix} η_{t} = \frac{ρ \bar{h}}{{(w_{t} + t)}^{1 / 3}} \end{matrix}

(14)

we can conclude that

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E ∥ \nabla f ({\bar{Ξ}}_{t}) ∥ & \leq P [[\frac{12 L p}{ρ^{2} T} + \frac{L}{ρ^{2} T^{2 / 3}}] E [f ({\bar{Ξ}}_{0}) - f^{*}] + \frac{6 L^{2} μ^{2} p^{2}}{ρ^{2} Q T} \\ + [\frac{12^{2} \times 75 p}{ρ^{2} T} + \frac{900}{ρ^{2} T^{2 / 3}}] [\frac{5 L^{2} μ^{2}}{3} + 3 ζ^{2}] (ln T + 1) \\ + \frac{L^{2} μ^{2} p}{2 ρ^{2} Q T^{2 / 3}}]^{1 / 2}, \end{matrix}

(15)

where

P = 5 \sqrt{2 d (G^{2} + σ_{g}^{2}) + 2 ρ^{2} + \frac{1}{2} d^{2} L^{2} μ^{2}}

.

Due to limited space, we will only briefly introduce the outline of the proof here: Firstly, by combining the objective function

f ({\bar{Ξ}}_{t})

and the gradient estimation error term, we construct a Lyapunov function

Γ_{t}

for analysis. Secondly, we establish a recurrence inequality for

Γ_{t}

. Through cumulative summation, we can estimate the bounds and convergence of the term

\frac{1}{T} \sum_{t = 0}^{T - 1} E [\frac{1}{12 η_{t}^{2}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{1}{4 ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}]

. Finally, based on the above, we can derive the convergence bound of the objective function

\frac{1}{T} \sum_{t = 1}^{T} E ∥ \nabla f ({\bar{Ξ}}_{t}) ∥

.

The detailed proof of Theorem 1 can be found in Appendix A.

Remark 1.

We utilize ζ as an indicator of data heterogeneity. The final results demonstrate that an increase in ζ (indicating greater data heterogeneity) leads to a slowdown in the training process. In addition, we consider that the step size

μ_{t}

remains unchanged during the training process. Therefore, we omit its subscript and represent it with μ.

Remark 2.

A suitable value of ρ ensures a balanced incorporation of adaptive information in the learning rate. In practice, we commonly select ρ to be within the order of

O (1)

, steering clear of excessively small or large values.

Remark 3

(Computational complexity). For

Q_{t} = \frac{1}{12 η_{t}^{2}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{1}{4 ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}

, we have

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [Q_{t}] & = \frac{1}{T} \sum_{t = 0}^{T - 1} E [\frac{1}{12 η_{t}^{2}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{1}{4 ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}] \\ \leq [\frac{12 L p}{ρ^{2} T} + \frac{L}{ρ^{2} T^{2 / 3}}] E [f ({\bar{Ξ}}_{0}) - f^{*}] + \frac{6 L^{2} μ^{2} p^{2}}{ρ^{2} Q T} + \frac{L^{2} μ^{2} p}{2 ρ^{2} Q T^{2 / 3}} \\ + [\frac{12^{2} \times 75 p}{ρ^{2} T} + \frac{900}{ρ^{2} T^{2 / 3}}] [\frac{5 L^{2} μ^{2}}{3} + 3 ζ^{2}] (ln T + 1) \end{matrix}

Referring to the method in [38], without loss of generality, we let

\frac{p}{Q} = O (1)

and choose

p = T^{\frac{1}{3}}

. To make the right side of the inequality less than

ϵ^{2}

, we can obtain

T = O (ϵ^{- 3})

and

\frac{T}{p} = T^{\frac{2}{3}} = ϵ^{- 2}

. Therefore, to satisfy the definition of an ϵ-stationary point, which is

E ∥ \nabla f (Ξ_{T}) ∥ \leq ϵ

and

E [Q_{t}] \leq ϵ^{2}

, we obtain the total sample cost as

O (ϵ^{- 3})

and the communication round as

O (ϵ^{- 2})

.

Remark 4

(Convergence speed). In the FedZO algorithm [27], regardless of whether all devices participate or only partial devices participate, we can observe that the FedZO algorithm achieves a convergence rate of

O (\frac{1}{T^{1 / 2}})

. Furthermore, as indicated by Theorem 1, our algorithm can achieve a convergence rate of

O (\frac{1}{T^{2 / 3}})

. Therefore, our FAFedZO method theoretically possesses a faster convergence rate compared to general zero-order methods. This further validates the advantages of our approach.

5. Experimental Results

In this section, we conduct comparative experiments on the MNIST, CIFAR-10 and Fashion-MNIST datasets and present some experimental outcomes to assess the performance of the proposed FAFedZO method in federated black-box attacks, thereby confirming the advantages of this algorithm.

5.1. Experimental Environment and Datasets

Experimental environment configuration: This study utilizes the FedZO framework as the experimental baseline to conduct research on federated learning algorithms. For this study, Python version 3.10.13 is adopted as the programming language, and PyTorch 2.2.0 serves as the development platform. All tests are conducted on a Windows 10 platform equipped with an NV GTX1650 GPU version 572.83 and CUDA version 12.4.99.

Dataset: We conduct comparative experimental studies on the MNIST, CIFAR-10 and Fashion-MNIST datasets.

The MNIST dataset is a classic handwritten digit image dataset. This dataset comprises a collection of grayscale images of handwritten digits (0–9) with 60,000 training samples and 10,000 test samples, with each image measuring 28 × 28 pixels. The CIFAR-10 dataset consists of 60,000 32 × 32 color images across 10 categories, each category contains 6000 images, and the dataset is divided into 50,000 training and 10,000 test images. The Fashion-MNIST dataset has the same data structure as the MNIST dataset, containing 10 categories of grayscale images representing different types of clothing like T-shirts, trousers, sweaters, and so on.

5.2. Experimental Result Analysis

Here, we present the results of simulation experiments to assess the performance of the FAFedZO method in the context of black-box attack strategies.

Given the characteristics of black-box scenarios, optimizing black-box attacks falls into the realm of zero-order optimization. We investigate black-box attacks on a trained deep neural network (DNN) classification model. An attack refers to an intentional attempt to manipulate the input data of a model so as to interfere with its normal operation. It can be achieved by adding carefully designed perturbations to the original data, for example, the interference images that we aim to train on in this experiment. The normal learning process is based on the assumption that the input data follow a certain distribution. However, when under attack, the introduced adversarial examples will violate this distribution assumption. This will force the model to make incorrect predictions, and if this kind of attack is not properly dealt with, it may lead to the failure of the model’s generalization ability. The purpose of our experiment is to train an interference image with the same size as the image in the dataset, which makes it difficult for the human eyes to recognize the difference between the original image and the adversarial image after adding the interference image, but it can induce the classification model to make a wrong judgment. We want to achieve a higher attack success rate with as little disturbance as possible, so we consider the loss function shown below:

\begin{matrix} ψ (Ξ, ϑ) \\ = & \underset{I}{\underset{⏟}{max \{φ_{y_{ϑ}} (\frac{1}{2} t a n h (t a n h^{- 1} 2 ϑ + Ξ)) - {max}_{j \neq y_{ϑ}} \{φ_{j} (\frac{1}{2} t a n h (t a n h^{- 1} 2 ϑ + Ξ))\}, 0\}}} \\ + & c \underset{I I}{\underset{⏟}{{∥\frac{1}{2} t a n h (t a n h^{- 1} 2 ϑ + Ξ) - ϑ∥}^{2}}} \end{matrix}

(16)

where

Ξ

represents the interference image to be trained,

ϑ

represents the original image in the dataset,

y_{ϑ}

represents the label corresponding to the image

ϑ

(for example, the label of the image “deer” is 4), and

φ_{j} (ϑ)

represents the confidence that the classification model recognizes image

ϑ

as label j. The item I in Equation (16) measures the probability of attack failure (marked as attack loss),

I I

represents the image distortion caused by disturbance, and c is the balance coefficient. In this way, our goal can be achieved by minimizing

ψ (Ξ, ϑ)

. Next, we use Equation (16) to construct the local loss function.

We divide the samples in the dataset into Q groups randomly and unevenly without repetition and then distribute them to each edge device, where Q is the total count of edge devices we preset. Then, for all edge devices, we define their local loss function as

\begin{matrix} f_{i} (Ξ) ≜ E_{ϑ_{i} \sim D_{i}} [ψ (Ξ, ϑ_{i})], i \in {1, 2, \dots, Q} \end{matrix}

(17)

where

D_{i}

represents the dataset at the ith edge device. In this way, the federal black-box attack problem on the DNN classification model can be formulated as a federated optimization problem:

f (Ξ) ≜ \frac{1}{Q} \sum_{i = 1}^{Q} f_{i} (Ξ)

. Next, we use the FAFedZO algorithm proposed in this paper to solve this problem.

We select balance parameter

c = 1

. For the remaining parameters, we select them as

(b_{1}, b_{2}, μ) = (25, 20, 0.001)

and set the initial learning rate

η

to be 0.1.

We select the total number of edge devices

Q = 50

and the number of participating edge devices

M = 30

and study the influence of the number of local updates E on the attack loss and attack accuracy of the proposed FAFedZO algorithm. Then, we compare it with the FedZO algorithm. We conduct experiments on the MNIST, CIFAR-10, and Fashion-MNIST datasets, respectively, and the results are shown in Figure 2.

Figure 2. Influence of varying the number of local updates that select

(Q, M) = (50, 30)

. (a) The attack loss on the MNIST dataset. (b) The attack loss on the CIFAR-10 dataset. (c) The attack loss on the Fashion-MNIST dataset. (d) The testing accuracy on the MNIST dataset. (e) The testing accuracy on the CIFAR-10 dataset. (f) The testing accuracy on the Fashion-MNIST dataset.

Taking the MNIST dataset as an example, Figure 2a,d illustrate the impact of the number of local updates

E \in {3, 15, 60}

on the performance of the algorithm. It can be observed that the larger the value of E, the lower the attack loss of the algorithm and the higher the accuracy. In addition, the attack accuracy of FAFedZO is significantly higher than that of FedZO. When

E = 60

, they achieve comparable accuracy, which proves that the algorithm proposed in this paper is superior to the original FedZO algorithm. Similar results can also be obtained on the remaining two datasets.

In Figure 3, with the total number of edge devices

Q = 100

and the number of local updates

E = 60

, the impact of the number M of participating edge devices on the convergence performance of the FAFedZO method is studied on the MNIST and Fashion-MNIST datasets. It can be seen that by adjusting the value of M, the FAFedZO algorithm can effectively enhance the attack accuracy.

Figure 3. Influence of the number of participating edge devices that select

(Q, E) = (100, 60)

. (a) The testing accuracy on the MNIST dataset. (b) The testing accuracy on the Fashion-MNIST dataset.

In Figure 4, we select the parameters

Q = 50

and

M = 50

; that is, when all devices are involved, to study the impact of the number of local updates E on the convergence performance of the algorithm. We have also conducted experiments on the MNIST, CIFAR-10, and Fashion-MNIST datasets, respectively. It is obvious that, compared with the FedZO algorithm under different values of E, the FAFedZO method can significantly reduce the attack loss and improve the attack accuracy. Moreover, as the value of E increases, the convergence speed of the FAFedZO algorithm also accelerates, with lower attack loss and higher attack accuracy, and both tend to stabilize more quickly.

Figure 4. Influence of the number of local updates that select

Q = 50

and

M = 50

. (a) The attack loss on the MNIST dataset. (b) The attack loss on the CIFAR-10 dataset. (c) The attack loss on the Fashion-MNIST dataset. (d) The testing accuracy on the MNIST dataset. (e) The testing accuracy on the CIFAR-10 dataset. (f) The testing accuracy on the Fashion-MNIST dataset.

Finally, we study the impact of the number of participating edge devices M on the algorithm under the conditions of

Q = 50

and

E = 1

. We conducted experiments on the MNIST and Fashion-MNIST datasets, and the results are shown in Figure 5. It can be seen that the FAFedZO algorithm far outperforms the FedZO algorithm in terms of both attack loss and accuracy. Even when the FedZO algorithm takes the optimal value among the selected values of M, its performance is still weaker than the worst value of the FAFedZO algorithm. In addition, as M increases, the attack loss value decreases and the accuracy increases, which is in line with our expectations.

Figure 5. Influence of the number of participating edge devices that select

Q = 50

and

E = 1

. (a) The attack loss on the MNIST dataset; (b) The attack loss on the Fashion-MNIST dataset; (c) The testing accuracy on the MNIST dataset; (d) The testing accuracy on the Fashion-MNIST dataset.

In addition to investigating the impact of the number of local updates and the number of participating edge devices on the algorithm’s performance, we also study the influence of varying the number of random directions on the proposed algorithm. As shown in Figure 6, on the MNIST dataset, under the conditions of

Q = 50

and

M = 10

, we present the curves illustrating the impact of the number of directions

H \in {3, 15, 60}

on attack loss and attack accuracy. It can be observed that we have similar conclusions to the above figures, which further confirms the superiority of the FAFedZO algorithm.

Figure 6. Influence of the number of directions that select

Q = 50

and

M = 10

. (a) Attack loss; (b) Testing accuracy.

In addition, we also discuss the performance of our algorithm under the setting of non-Independent and Identically Distributed (non-IID) data.

Figure 7 presents the impact of the number of local updates on the attack loss and accuracy of the algorithm under the non-IID setting. It can be observed that, compared with Figure 2a,d, the attack accuracy in this case is lower than that under the IID setting, which is attributed to the influence of the non-IID setting and aligns with our expectations. Furthermore, the results also indicate that under the non-IID setting, the performance of the FAFedZO algorithm remains superior to that of the FedZO algorithm, and both achieve comparable performance levels when

E = 60

.

Figure 7. The impact of different numbers of local updates when selecting

Q = 50

and

M = 30

in a non-IID setting. (a) Attack loss; (b) Testing accuracy.

The conclusions we previously mentioned are also supported by Figure 8 and Figure 9. Therefore, in summary, the FAFedZO algorithm demonstrates superior performance in both IID and non-IID environments.

Figure 8. The impact of the number of participating edge devices when selecting

Q = 50

and

E = 1

under the non-IID setting. (a) Attack loss; (b) Testing accuracy.

Figure 9. The impact of different numbers of local updates when selecting

Q = 50

and

M = 50

under the non-IID setting. (a) Attack loss; (b) Testing accuracy.

6. Conclusions

In this paper, we proposed FAFedZO, a federated optimization algorithm that combines derivative-free zero-order optimization with an adaptive method. We conducted a theoretical analysis of the algorithm to prove its convergence and presented the computational complexity and convergence rate of the FAFedZO algorithm. Finally, we conducted a large number of comparative experiments on the MNIST, CIFAR-10, and Fashion-MNIST datasets. The results demonstrate that our method can achieve a faster convergence rate compared to general zero-order optimization algorithms, verifying the effectiveness of the algorithm proposed in this paper.

Author Contributions

Methodology, Y.X.; Formal analysis, Y.Z.; Writing—original draft, Y.L.; Writing—review & editing, H.G.; Supervision, Y.X.; Project administration, Y.X.; Funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Key Technologies Research and Development Program of Henan Province under Grant No. 242102210102.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare there is no conflicts of interest.

Appendix A

Here, we present a thorough examination of the convergence properties of our algorithm. To facilitate the analysis, we introduce the notation

g_{t, i}

that

g_{t, i} = \nabla f_{i} (Ξ_{t, i})

in the subsequent sections. Wedefine

s_{t}

as

s_{t} = ⌊ t / p ⌋ p

and use ⊗ as Kronecker product symbol.

At first, for the zero-order gradient

{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i})

, based on the characteristics of the gradient estimator outlined in ([39], Lemma 4.2), we can deduce that

\begin{matrix} E [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i})] = \nabla f_{i}^{μ} (Ξ_{t, i}) \end{matrix}

(A1)

Then, we can obtain

\begin{matrix} E {∥{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i}) - \nabla f_{i}^{μ} (Ξ_{t, i})∥}^{2} \\ \leq & E {∥{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i})∥}^{2} \\ \leq & 2 d E {∥\nabla F_{i} (Ξ_{t, i}, ϑ_{t, i})∥}^{2} + \frac{1}{2} d^{2} L^{2} μ^{2} \\ = & 2 d E ∥ \nabla F_{i} (Ξ_{t, i}, ϑ_{t, i}) - \nabla f_{i} (Ξ_{t, i}) + \nabla f_{i} (Ξ_{t, i}) ∥^{2} + \frac{1}{2} d^{2} L^{2} μ^{2} \\ = & 2 d E ∥ \nabla F_{i} (Ξ_{t, i}, ϑ_{t, i}) - \nabla f_{i} (Ξ_{t, i}) ∥^{2} + 2 d E {∥ \nabla f_{i} (Ξ_{t, i}) ∥}^{2} + \frac{1}{2} d^{2} L^{2} μ^{2} \\ \leq & 2 d σ_{g}^{2} + 2 d E {∥ \nabla f_{i} (Ξ_{t, i}) ∥}^{2} + \frac{1}{2} d^{2} L^{2} μ^{2} \\ \leq & 2 d (G^{2} + σ_{g}^{2}) + \frac{1}{2} d^{2} L^{2} μ^{2} \end{matrix}

(A2)

where the second equality is due to Assumption 3, and the establishment of these four inequalities is because

{E ∥ z - E z ∥}^{2} \leq E {∥ z ∥}^{2}

, ([39], Lemma 4.1), Assumptions 4 and 8.

Thus, we can obtain

\begin{matrix} E {∥{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i}) - \nabla f_{i}^{μ} (Ξ_{t, i})∥}^{2} \\ \leq & 2 d (G^{2} + σ_{g}^{2}) + \frac{1}{2} d^{2} L^{2} μ^{2} \end{matrix}

At the same time, we can also obtain

\begin{matrix} E {∥{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}, ϑ_{t, i})∥}^{2} \\ \leq & 2 d (G^{2} + σ_{g}^{2}) + \frac{1}{2} d^{2} L^{2} μ^{2} \end{matrix}

(A3)

The following presents the specific proofs of Lemma 1 to Lemma 6:

Proof of Lemma 1.

Firstly, we know that

\begin{matrix} E ∥ K_{t} ∥^{2} = E ∥ K_{s_{t}} ∥^{2} = E ∥ \sqrt{{\bar{ι}}_{s_{t}}} {+ ρ ∥}_{\infty}^{2} \leq 2 E {∥ {\bar{ι}}_{s_{t}} ∥}_{\infty} + 2 ρ^{2} \end{matrix}

(A4)

where

K_{t} = diag (\sqrt{{\bar{ι}}_{t}} + ρ)

is a diagonal matrix. And following the definition of

ι_{s_{t}}

, we can deduce that

\begin{matrix} E ∥ {\bar{ι}}_{s_{t}} ∥_{\infty} & = E ∥ ϱ {\bar{ι}}_{s_{t} - 1} + (1 - ϱ) \frac{1}{Q} \sum_{i = 1}^{Q} {[{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{s_{t} - 1, i}, ϑ_{s_{t} - 1, i})]}^{2} ∥_{\infty} \\ \leq ϱ E {∥{\bar{ι}}_{s_{t} - 1}∥}_{\infty} + (1 - ϱ) \frac{1}{Q} \sum_{i = 1}^{Q} E {∥ {[{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{s_{t} - 1, i}, ϑ_{s_{t} - 1, i})]}^{2} ∥}_{\infty} \\ \leq ϱ E {∥ {\bar{ι}}_{s_{t} - 1} ∥}_{\infty} + (1 - ϱ) \frac{1}{Q} \sum_{i = 1}^{Q} E {∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{s_{t} - 1, i}, ϑ_{s_{t} - 1, i}) ∥}_{2}^{2} \\ \leq ϱ E {∥ {\bar{ι}}_{s_{t} - 1} ∥}_{\infty} + (1 - ϱ) \frac{1}{Q} \sum_{i = 1}^{Q} [2 d (G^{2} + σ_{g}^{2}) + \frac{1}{2} d^{2} L^{2} μ^{2}] \\ = ϱ E {∥ {\bar{ι}}_{s_{t} - 1} ∥}_{\infty} + (1 - ϱ) [2 d (G^{2} + σ_{g}^{2}) + \frac{1}{2} d^{2} L^{2} μ^{2}] \\ ≜ ϱ E {∥ {\bar{ι}}_{s_{t} - 1} ∥}_{\infty} + (1 - ϱ) C \end{matrix}

(A5)

Therefore, because of

ϱ \in (0, 1)

, taking the recursive expansion, we can obtain

\begin{matrix} E ∥ {\bar{ι}}_{s_{t}} ∥_{\infty} & \leq (1 - ϱ) C + ϱ (1 - ϱ) C + ϱ^{2} (1 - ϱ) C + \dots + ϱ^{s_{t} - 1} (1 - ϱ) C \\ = \frac{1 - ϱ^{s_{t}}}{1 - ϱ} (1 - ϱ) C \\ \leq \frac{1}{2} C \end{matrix}

(A6)

So, we finally obtain

\begin{matrix} E ∥ K_{t} ∥^{2} & \leq 2 E ∥ {\bar{ι}}_{s_{t}} ∥_{\infty} + 2 ρ^{2} \\ \leq 2 ρ^{2} + C \\ = 2 d (G^{2} + σ_{g}^{2}) + 2 ρ^{2} + \frac{1}{2} d^{2} L^{2} μ^{2} \end{matrix}

Lemma 1 is proved. □

Proof of Lemma 2.

For the inequality (7), we have

\begin{matrix} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - g_{t, i} ∥^{2} \\ = & E ∥ \frac{1}{b} \sum_{ϑ_{t, i} \in B_{t, i}} (\nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i}) ∥^{2} \\ = & \frac{1}{b^{2}} E {∥ \sum_{ϑ_{t, i} \in B_{t, i}} (\nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i}) ∥}^{2} \\ \leq & \frac{1}{b} \sum_{ϑ_{t, i} \in B_{t, i}} E {∥ \nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i} ∥}^{2} \\ \leq & L^{2} μ^{2} \end{matrix}

(A7)

where the first inequality is because

{∥\sum_{i = 1}^{q} a_{i}∥}^{2} \leq q \sum_{i = 1}^{q} {∥a_{i}∥}^{2}

, and the last inequality follows from ([24], Lemma 5.2).

For the inequality (8), we have

\begin{matrix} E ∥ g_{t} - 1 \otimes {\bar{g}}_{t} ∥^{2} \\ = & \sum_{i = 1}^{Q} E {∥ g_{t, i} - {\bar{g}}_{t} ∥}^{2} \\ \leq & 3 \sum_{i = 1}^{Q} E [∥ g_{t, i} - \nabla f_{i} ({\bar{Ξ}}_{t}) ∥^{2} + \frac{1}{Q} \sum_{j = 1}^{Q} ∥ \nabla f_{j} ({\bar{Ξ}}_{t}) - g_{t, j} ∥^{2} + \frac{1}{Q} \sum_{j = 1}^{Q} ∥ \nabla f_{i} ({\bar{Ξ}}_{t}) - \nabla f_{j} ({\bar{Ξ}}_{t}) ∥^{2}] \\ = & 3 \sum_{i = 1}^{Q} E ∥ g_{t, i} - \nabla f_{i} ({\bar{Ξ}}_{t}) ∥^{2} + 3 \sum_{i = 1}^{Q} \frac{1}{Q} \sum_{j = 1}^{Q} E {∥ \nabla f_{j} ({\bar{Ξ}}_{t}) - g_{t, j} ∥}^{2} \\ + 3 \sum_{i = 1}^{Q} \frac{1}{Q} \sum_{j = 1}^{Q} E {∥ \nabla f_{i} ({\bar{Ξ}}_{t}) - \nabla f_{j} ({\bar{Ξ}}_{t}) ∥}^{2} \\ \leq & 3 L^{2} \sum_{i = 1}^{Q} E ∥ Ξ_{t, i} - {\bar{Ξ}}_{t} ∥^{2} + 3 L^{2} \sum_{j = 1}^{Q} E {∥ {\bar{Ξ}}_{t} - Ξ_{t, j} ∥}^{2} \\ + 3 \sum_{i = 1}^{Q} \frac{1}{Q} \sum_{j = 1}^{Q} E {∥ \nabla f_{i} ({\bar{Ξ}}_{t}) - \nabla f_{j} ({\bar{Ξ}}_{t}) ∥}^{2} \\ \leq & 6 L^{2} E {∥ Ξ_{t} - 1 \otimes {\bar{Ξ}}_{t} ∥}^{2} + 3 Q ζ^{2} \end{matrix}

(A8)

the last two inequalities here are attributed to Assumptions 2 and 6.

Thus, Lemma 2 is proved. □

Proof of Lemma 3.

(1)

t = s_{t} = ⌊ t / p ⌋ p

; then, we can conclude that

\mod (t, p) = 0

, so there is

Ξ_{t + 1, i} = {\bar{Ξ}}_{t + 1}

, and thus, we have obtained the final result.

(2) We have

\begin{matrix} Ξ_{t, i} = Ξ_{s_{t}, i} - \sum_{s = s_{t}}^{t - 1} η_{s} K_{s}^{- 1} n_{s, i} {\bar{Ξ}}_{t} = {\bar{Ξ}}_{s_{t}} - \sum_{s = s_{t}}^{t - 1} η_{s} K_{s}^{- 1} {\bar{n}}_{s} \end{matrix}

thus,

\begin{matrix} \sum_{i = 1}^{Q} {∥Ξ_{t, i} - {\bar{Ξ}}_{t}∥}^{2} & = \sum_{i = 1}^{Q} {∥Ξ_{s_{t}, i} - {\bar{Ξ}}_{s_{t}} - (\sum_{s = s_{t}}^{t - 1} η_{s} K_{s}^{- 1} n_{s, i} - \sum_{s = s_{t}}^{t - 1} η_{s} K_{s}^{- 1} {\bar{n}}_{s})∥}^{2} \\ = \sum_{i = 1}^{Q} {∥\sum_{s = s_{t}}^{t - 1} η_{s} K_{s}^{- 1} [n_{s, i} - {\bar{n}}_{s}]∥}^{2} \\ \leq (t - s_{t}) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ \leq (p - 1) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \end{matrix}

(A9)

Lemma 3 is proved. □

Proof of Lemma 4.

From the smoothness assumption, we know that

\begin{matrix} f ({\bar{Ξ}}_{t + 1}) & \leq f ({\bar{Ξ}}_{t}) + ⟨ \nabla f ({\bar{Ξ}}_{t}), {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ⟩ + \frac{L}{2} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} \\ = f ({\bar{Ξ}}_{t}) + \underset{(1)}{\underset{⏟}{⟨ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}, {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ⟩}} + \underset{(2)}{\underset{⏟}{⟨ {\bar{n}}_{t}, {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ⟩}} + \frac{L}{2} {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} \end{matrix}

(A10)

Regarding (1), we can obtain

\begin{matrix} (1) & = ⟨ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}, {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ⟩ \\ \leq ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥ ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥ \\ \leq \frac{η_{t}}{ρ} ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2} + \frac{ρ}{4 η_{t}} {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} \end{matrix}

(A11)

Regarding term (2),

K_{t} = diag (\sqrt{{\bar{ι}}_{s_{t}}} + ρ)

,

{\bar{Ξ}}_{t + 1} = {\bar{Ξ}}_{t} - η_{t} K_{t}^{- 1} {\bar{n}}_{t}

. According to the definition of

K_{t}

and Assumption 7, we can obtain

\begin{matrix} ⟨ {\bar{n}}_{t}, \frac{1}{η_{t}} ({\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}) ⟩ & = ⟨ K_{t} \frac{1}{η_{t}} ({\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}), \frac{1}{η_{t}} ({\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}) ⟩ \\ \geq ρ ∥ \frac{1}{η_{t}} ({\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}) ∥^{2} \end{matrix}

(A12)

So, when estimating the second term, there is

\begin{matrix} (2) & = ⟨ {\bar{n}}_{t}, {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ⟩ \\ \leq - η_{t} ρ {∥ \frac{1}{η_{t}} ({\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}) ∥}^{2} \\ = - \frac{ρ}{η_{t}} {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} \end{matrix}

(A13)

Bringing (A13), (A11) into (A10), we can obtain

\begin{matrix} f ({\bar{Ξ}}_{t + 1}) & \leq f ({\bar{Ξ}}_{t}) + \frac{η_{t}}{ρ} ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2} + \frac{ρ}{4 η_{t}} ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{ρ}{η_{t}} {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} \\ + \frac{L}{2} {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} \\ \leq f ({\bar{Ξ}}_{t}) - \frac{η_{t}}{4 ρ} ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2} + \frac{5 η_{t}}{4 ρ} ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2} - (\frac{3 ρ}{4 η_{t}} - \frac{L}{2}) {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} \\ \leq f ({\bar{Ξ}}_{t}) - (\frac{3 ρ}{4 η_{t}} - \frac{L}{2}) ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2} + \frac{5 η_{t}}{2 ρ} {∥ {\bar{g}}_{t} - {\bar{n}}_{t} ∥}^{2} \\ + \frac{5 η_{t}}{2 ρ} {∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{g}}_{t} ∥}^{2} \end{matrix}

(A14)

We consider the last term in (A14); taking this expectation on both sides, we have

\begin{matrix} E ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{g}}_{t} ∥^{2} & = E ∥ \frac{1}{Q} \sum_{i = 1}^{Q} (\nabla f_{i} ({\bar{Ξ}}_{t}) - g_{t, i}) ∥^{2} \\ \leq \frac{1}{Q^{2}} Q \sum_{i = 1}^{Q} E {∥ \nabla f_{i} ({\bar{Ξ}}_{t}) - g_{t, i} ∥}^{2} \\ \leq \frac{L^{2}}{Q} \sum_{i = 1}^{Q} E {∥ Ξ_{t, i} - {\bar{Ξ}}_{t} ∥}^{2} \\ = \frac{L^{2}}{Q} E {∥ Ξ_{t} - 1 \otimes {\bar{Ξ}}_{t} ∥}^{2} \end{matrix}

(A15)

By substituting it into (A14) and then taking the expectation of both sides, we can obtain this conclusion.

Lemma 4 is proved. □

Proof of Lemma 5.

We know that

{\bar{n}}_{t} = \frac{1}{Q} \sum_{i = 1}^{Q} [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) + (1 - χ_{t}) ({\bar{n}}_{t - 1} - {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}))]

So we can obtain

\begin{matrix} E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2} \\ = & E ∥ \frac{1}{Q} \sum_{i = 1}^{Q} [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) + (1 - χ_{t}) ({\bar{n}}_{t - 1} - {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}))] - {\bar{g}}_{t} ∥^{2} \\ = & E ∥ \frac{1}{Q} \sum_{i = 1}^{Q} [({\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - g_{t, i}) - (1 - χ_{t}) ({\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - g_{t - 1, i})] \\ + (1 - χ_{t}) ({\bar{n}}_{t - 1} - {\bar{g}}_{t - 1}) ∥^{2} \\ = & E ∥ \frac{1}{b} \sum_{ϑ_{t, i} \in B_{t, i}} \frac{1}{Q} \sum_{i = 1}^{Q} [(\nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i}) - (1 - χ_{t}) (\nabla f_{i}^{μ} (Ξ_{t - 1, i}; ϑ_{t, i}) \\ - g_{t - 1, i})] + (1 - χ_{t}) ({\bar{n}}_{t - 1} - {\bar{g}}_{t - 1}) ∥^{2} \\ \leq & 2 {(1 - χ_{t})}^{2} E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2} + \frac{2}{b^{2} Q^{2}} b Q \sum_{ϑ_{t, i} \in B_{t, i}} \sum_{i = 1}^{Q} E ∥ (\nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i}) \\ - (1 - χ_{t}) (\nabla f_{i}^{μ} (Ξ_{t - 1, i}; ϑ_{t, i}) - g_{t - 1, i}) ∥^{2} \\ \leq & 2 {(1 - χ_{t})}^{2} E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2} + \frac{2}{b Q} \sum_{ϑ_{t, i} \in B_{t, i}} \sum_{i = 1}^{Q} E ∥ (1 - χ_{t}) [(\nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i}) \\ - (\nabla f_{i}^{μ} (Ξ_{t - 1, i}; ϑ_{t, i}) - g_{t - 1, i}) {] + χ_{t} (\nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i}) ∥}^{2} \\ \leq & 2 {(1 - χ_{t})}^{2} E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2} + \frac{4 {(1 - χ_{t})}^{2}}{b Q} \sum_{ϑ_{t, i} \in B_{t, i}} \sum_{i = 1}^{Q} E ∥ \nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) \\ - \nabla f_{i}^{μ} (Ξ_{t - 1, i}; ϑ_{t, i}) ∥^{2} + \frac{4 χ_{t}^{2}}{b Q} \sum_{ϑ_{t, i} \in B_{t, i}} \sum_{i = 1}^{Q} E {∥ \nabla f_{i}^{μ} (Ξ_{t, i}; ϑ_{t, i}) - g_{t, i} ∥}^{2} \\ \leq & 2 {(1 - χ_{t})}^{2} E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2} + \frac{4 {(1 - χ_{t})}^{2} L^{2}}{Q} \sum_{i = 1}^{Q} E {∥ Ξ_{t, i} - Ξ_{t - 1, i} ∥}^{2} + 4 χ_{t}^{2} L^{2} μ^{2} \\ = & 2 {(1 - χ_{t})}^{2} E ∥ {\bar{n}}_{t - 1} - {\bar{g}}_{t - 1} ∥^{2} + \frac{4 {(1 - χ_{t})}^{2} L^{2}}{Q} E {∥ Ξ_{t} - Ξ_{t - 1} ∥}^{2} + 4 χ_{t}^{2} L^{2} μ^{2} \end{matrix}

(A16)

where the first inequality is due to

{∥ a + b ∥}^{2} \leq {2 ∥ a ∥}^{2} + 2 {∥ b ∥}^{2}

and the last inequality is because of the L-smoothness and Lemma 2.

Lemma 5 is proved. □

Proof of Lemma 6.

We first observe that

\begin{matrix} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} [[{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) + (1 - χ_{t}) (n_{t - 1, i} - {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}))] \\ - & \frac{1}{Q} \sum_{i = 1}^{Q} [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) + (1 - χ_{t}) (n_{t - 1, i} - {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}))] {] ∥}^{2} \\ = & \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} [(1 - χ_{t}) (n_{t - 1, i} - {\bar{n}}_{t - 1}) + [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) \\ - & (1 - χ_{t}) ({\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i})) {]] ∥}^{2} \\ \leq & (1 + γ) {(1 - χ_{t})}^{2} \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥^{2} + (1 + \frac{1}{γ}) \frac{1}{ρ^{2}} \sum_{i = 1}^{Q} E ∥ [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) \\ - & \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i})] - (1 - χ_{t}) [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) \\ - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) {] ∥}^{2} \end{matrix}

(A17)

where (A17) arises is because

K_{t} ≻ ρ I_{d}

. Continuing to process the latter term in (A17) yields

\begin{array}{l} \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) \\ - (1 - χ_{t}) [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i})] ∥^{2} \\ \leq 2 \sum_{i = 1}^{Q} E ∥ [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i})] \\ - [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) {] ∥}^{2} \\ + 2 χ_{t}^{2} \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) ∥^{2} \\ \leq 2 \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t, i}; B_{t, i}) - {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) ∥^{2} \\ + 2 χ_{t}^{2} \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) ∥^{2} \\ \leq 2 \sum_{i = 1}^{Q} L^{2} E ∥ Ξ_{t, i} - Ξ_{t - 1, i} ∥^{2} \\ + 2 χ_{t}^{2} \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) ∥^{2} \\ = 2 L^{2} E ∥ Ξ_{t} - Ξ_{t - 1} ∥^{2} \\ + 2 χ_{t}^{2} \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{i = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) ∥^{2} \end{array}

(A18)

the last two inequalities here are attributed to the Lemma 1 in [38] and the L-smoothness. Regarding the last term, we can obtain

\begin{matrix} \sum_{i = 1}^{Q} E {∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - \frac{1}{Q} \sum_{j = 1}^{Q} {\tilde{\nabla}}_{v}^{μ} F_{j} (Ξ_{t - 1, j}; B_{t, j}) ∥}^{2} \\ = & \sum_{i = 1}^{Q} E ∥ [{\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - g_{t - 1, i}] - \frac{1}{Q} \sum_{j = 1}^{Q} [{\tilde{\nabla}}_{v}^{μ} F_{j} (Ξ_{t - 1, j}; B_{t, j}) - g_{t - 1, j}] \\ + [g_{t - 1, i} - {\bar{g}}_{t - 1}] ∥^{2} \\ \leq & 2 \sum_{i = 1}^{Q} E ∥ {\tilde{\nabla}}_{v}^{μ} F_{i} (Ξ_{t - 1, i}; B_{t, i}) - g_{t - 1, i} ∥^{2} + 2 \sum_{i = 1}^{Q} E {∥ g_{t - 1, i} - {\bar{g}}_{t - 1} ∥}^{2} \\ \leq & 2 Q L^{2} μ^{2} + 6 Q ζ^{2} + 12 L^{2} E {∥ Ξ_{t - 1} - 1 \otimes {\bar{Ξ}}_{t - 1} ∥}^{2} \end{matrix}

(A19)

The last two inequalities here are attributed to Lemma 1 in [38] and Lemma 2. Then, by integrating the aforementioned inequalities (A17)–(A19) with the definition of

K_{t}

, we can deduce that when

\mod (t, p) \neq 0

, the following formula can be reached.

\begin{matrix} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & {(1 - χ_{t})}^{2} (1 + γ) \sum_{i = 1}^{Q} E ∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥^{2} + \frac{2 L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) E {∥ Ξ_{t} - Ξ_{t - 1} ∥}^{2} \\ + & \frac{4 L^{2} μ^{2}}{ρ^{2}} (1 + \frac{1}{γ}) χ_{t}^{2} + \frac{12 Q}{ρ^{2}} ζ^{2} (1 + \frac{1}{γ}) χ_{t}^{2} + 24 L^{2} (1 + \frac{1}{γ}) \frac{χ_{t}^{2}}{ρ^{2}} E {∥ Ξ_{t - 1} - 1 \otimes {\bar{Ξ}}_{t - 1} ∥}^{2} \\ \leq & {(1 - χ_{t})}^{2} (1 + γ) \sum_{i = 1}^{Q} E {∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥}^{2} + \frac{4 L^{2} μ^{2}}{ρ^{2}} (1 + \frac{1}{γ}) χ_{t}^{2} + \frac{12 Q}{ρ^{2}} ζ^{2} (1 + \frac{1}{γ}) χ_{t}^{2} \\ + & \frac{2 L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) \sum_{i = 1}^{Q} E ∥ Ξ_{t, i} - Ξ_{t - 1, i} ∥^{2} + 24 L^{2} (1 + \frac{1}{γ}) \frac{χ_{t}^{2}}{ρ^{2}} \sum_{i = 1}^{Q} E {∥ Ξ_{t - 1, i} - {\bar{Ξ}}_{t - 1} ∥}^{2} \\ \leq & {(1 - χ_{t})}^{2} (1 + γ) \sum_{i = 1}^{Q} E {∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥}^{2} + \frac{4 L^{2} μ^{2}}{ρ^{2}} (1 + \frac{1}{γ}) χ_{t}^{2} + \frac{12 Q}{ρ^{2}} ζ^{2} (1 + \frac{1}{γ}) χ_{t}^{2} \\ + & \frac{2 L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) \sum_{i = 1}^{Q} E {∥ η_{t - 1} K_{t - 1}^{- 1} n_{t - 1, i} ∥}^{2} \\ + & 24 L^{2} (1 + \frac{1}{γ}) \frac{χ_{t}^{2}}{ρ^{2}} (p - 1) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ \leq & {(1 - χ_{t})}^{2} (1 + γ) \sum_{i = 1}^{Q} E {∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥}^{2} + \frac{4 L^{2} μ^{2}}{ρ^{2}} (1 + \frac{1}{γ}) χ_{t}^{2} + \frac{12 Q}{ρ^{2}} ζ^{2} (1 + \frac{1}{γ}) χ_{t}^{2} \\ + & 24 L^{2} (1 + \frac{1}{γ}) \frac{χ_{t}^{2}}{ρ^{2}} (p - 1) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ + & \frac{4 L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) \sum_{i = 1}^{Q} E [∥ η_{t - 1} K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥^{2} + {∥ η_{t - 1} K_{t - 1}^{- 1} {\bar{n}}_{t - 1} ∥}^{2}] \end{matrix}

(A20)

where we utilize Lemma 3. Then, combining like terms, we have

\begin{matrix} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & [{(1 - χ_{t})}^{2} (1 + γ) + \frac{4 L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) η_{t - 1}^{2}] \sum_{i = 1}^{Q} E {∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥}^{2} \\ + & \frac{4 Q L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) η_{t - 1}^{2} E {∥ K_{t - 1}^{- 1} {\bar{n}}_{t - 1} ∥}^{2} + \frac{4 L^{2} μ^{2}}{ρ^{2}} (1 + \frac{1}{γ}) χ_{t}^{2} + \frac{12 Q}{ρ^{2}} ζ^{2} (1 + \frac{1}{γ}) χ_{t}^{2} \\ + & 24 L^{2} (1 + \frac{1}{γ}) \frac{χ_{t}^{2}}{ρ^{2}} (p - 1) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \end{matrix}

(A21)

We select

γ = \frac{1}{p}

,

η_{t} \leq \frac{ρ}{12 L p}

, and we know that

χ_{t} \in (0, 1)

, then

\begin{matrix} {(1 - χ_{t})}^{2} (1 + γ) + \frac{4 L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) η_{t - 1}^{2} & \leq 1 + \frac{1}{p} + \frac{4 L^{2}}{ρ^{2}} (1 + p) η_{t - 1}^{2} \\ \leq 1 + \frac{1}{p} + \frac{p + 1}{36 p^{2}} \\ \leq 1 + \frac{19}{18 p} \end{matrix}

(A22)

Substitute (A22) into (A21); then, we can obtain

\begin{matrix} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & (1 + \frac{19}{18 p}) \sum_{i = 1}^{Q} E ∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥^{2} + \frac{4 Q L^{2}}{ρ^{2}} (1 + \frac{1}{γ}) η_{t - 1}^{2} E {∥ K_{t - 1}^{- 1} {\bar{n}}_{t - 1} ∥}^{2} \\ + & \frac{4 L^{2} μ^{2}}{ρ^{2}} (1 + \frac{1}{γ}) χ_{t}^{2} + \frac{12 Q}{ρ^{2}} ζ^{2} (1 + \frac{1}{γ}) χ_{t}^{2} \\ + & 24 L^{2} (1 + \frac{1}{γ}) \frac{χ_{t}^{2}}{ρ^{2}} (p - 1) \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ \leq & (1 + \frac{19}{18 p}) \sum_{i = 1}^{Q} E ∥ K_{t - 1}^{- 1} (n_{t - 1, i} - {\bar{n}}_{t - 1}) ∥^{2} + \frac{2 Q L}{3 ρ} η_{t - 1} E {∥ K_{t - 1}^{- 1} {\bar{n}}_{t - 1} ∥}^{2} + \frac{2 L μ^{2} c^{2}}{3 ρ} η_{t - 1}^{3} \\ + & \frac{2 Q ζ^{2} c^{2}}{L ρ} η_{t - 1}^{3} + 48 \frac{L^{2} p^{2} c^{2} η_{t - 1}^{4}}{ρ^{2}} \sum_{s = s_{t}}^{t - 1} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \end{matrix}

(A23)

where we consider

χ_{t} = c η_{t - 1}^{2}

and

1 + p \leq p + p = 2 p

.

On the other hand, if

\mod (t, p) = 0

, that is

t = s_{t}

, it follows that

\sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} = 0

. So taking the recursive expansion to (A23), we have

\begin{matrix} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \frac{2 Q L}{3 ρ} \sum_{s = s_{t}}^{t - 1} {(1 + \frac{19}{18 p})}^{t - 1 - s} η_{s} E {∥ K_{s}^{- 1} {\bar{n}}_{s} ∥}^{2} + [\frac{2 L μ^{2} c^{2}}{3 ρ} + \frac{2 Q ζ^{2} c^{2}}{L ρ}] \sum_{s = s_{t}}^{t - 1} {(1 + \frac{19}{18 p})}^{t - 1 - s} η_{s}^{3} \\ + & \frac{48 L^{2} p^{2} c^{2}}{ρ^{2}} \sum_{s = s_{t}}^{t - 1} {(1 + \frac{19}{18 p})}^{t - 1 - s} η_{s}^{4} \sum_{\bar{s} = s_{t}}^{s} η_{\bar{s}}^{2} \sum_{i = 1}^{Q} E {∥ K_{\bar{s}}^{- 1} (n_{\bar{s}, i} - {\bar{n}}_{\bar{s}}) ∥}^{2} \\ \leq & \frac{2 Q L}{3 ρ} \sum_{s = s_{t}}^{t - 1} {(1 + \frac{19}{18 p})}^{p} η_{s} E {∥ K_{s}^{- 1} {\bar{n}}_{s} ∥}^{2} + [\frac{2 L μ^{2} c^{2}}{3 ρ} + \frac{2 Q ζ^{2} c^{2}}{L ρ}] \sum_{s = s_{t}}^{t - 1} {(1 + \frac{19}{18 p})}^{p} η_{s}^{3} \\ + & \frac{48 L^{2} p^{3} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{5} {(1 + \frac{19}{18 p})}^{p} \sum_{s = s_{t}}^{t} η_{s} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ \leq & \frac{2 Q L}{ρ} \sum_{s = s_{t}}^{t} η_{s} E {∥ K_{s}^{- 1} {\bar{n}}_{s} ∥}^{2} + [\frac{2 L μ^{2} c^{2}}{ρ} + \frac{6 Q ζ^{2} c^{2}}{L ρ}] \sum_{s = s_{t}}^{t} η_{s}^{3} \\ + & \frac{144 L^{2} p^{3} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{5} \sum_{s = s_{t}}^{t} η_{s} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \end{matrix}

(A24)

where we utilize that

t - 1 - s \leq p - 1 < p

and

{(1 + \frac{19}{18 p})}^{p} \leq e^{\frac{19}{18}} \leq 3

. Then, by multiplying both sides by

\sum_{t = s_{t}}^{\bar{s}} η_{t}

, we can obtain

\begin{matrix} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \frac{2 Q L}{ρ} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{s = s_{t}}^{t} η_{s} E {∥ K_{s}^{- 1} {\bar{n}}_{s} ∥}^{2} + [\frac{2 L μ^{2} c^{2}}{ρ} + \frac{6 Q ζ^{2} c^{2}}{L ρ}] \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{s = s_{t}}^{t} η_{s}^{3} \\ + & \frac{144 L^{2} p^{3} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{5} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{s = s_{t}}^{t} η_{s} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \end{matrix}

(A25)

Finally,

\begin{matrix} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \frac{2 Q L}{ρ} (\sum_{t = s_{t}}^{\bar{s}} η_{t}) \sum_{t = s_{t}}^{\bar{s}} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} + [\frac{2 L μ^{2} c^{2}}{ρ} + \frac{6 Q ζ^{2} c^{2}}{L ρ}] (\sum_{t = s_{t}}^{\bar{s}} η_{t}) \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \\ + & \frac{144 L^{2} p^{3} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{5} (\sum_{t = s_{t}}^{\bar{s}} η_{t}) \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \frac{Q}{6} \sum_{t = s_{t}}^{\bar{s}} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} + [\frac{μ^{2} c^{2}}{6} + \frac{Q ζ^{2} c^{2}}{2 L^{2}}] \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \\ + & \frac{144 L^{2} p^{4} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{6} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \end{matrix}

(A26)

where the last inequality is because

η_{t} \leq \frac{ρ}{12 L p}

, so

\sum_{t = s_{t}}^{\bar{s}} η_{t} \leq (\bar{s} - s_{t}) \frac{ρ}{12 L p} \leq p \frac{ρ}{12 L p} = \frac{ρ}{12 L}

.

Therefore,

\begin{matrix} [1 - \frac{144 L^{2} p^{4} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{6}] \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \frac{Q}{6} \sum_{t = s_{t}}^{\bar{s}} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} + [\frac{μ^{2} c^{2}}{6} + \frac{Q ζ^{2} c^{2}}{2 L^{2}}] \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \end{matrix}

(A27)

given that

c \leq \frac{60 L^{2}}{ρ^{2}}

and

1 - \frac{144 L^{2} p^{4} c^{2}}{ρ^{2}} {(\frac{ρ}{12 L p})}^{6} \geq \frac{20}{72}

. By multiplying by

\frac{3 ρ}{2 Q}

on both sides, we can obtain

\begin{matrix} \frac{30 ρ}{72 Q} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq & \frac{ρ}{4} \sum_{t = s_{t}}^{\bar{s}} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} + [\frac{ρ μ^{2} c^{2}}{4 Q} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}}] \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \\ = & \frac{ρ}{4} \sum_{t = s_{t}}^{\bar{s}} \frac{1}{η_{t}} E {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} + [\frac{ρ μ^{2} c^{2}}{4 Q} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}}] \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \end{matrix}

(A28)

Lemma 6 is proved. □

Now let us prove the final theorem.

Proof of Theorem 1.

We set

η_{t} = \frac{ρ \bar{h}}{{(w_{t} + t)}^{1 / 3}}

,

χ_{t + 1} = c \cdot η_{t}^{2}

,

c = \frac{1}{12 L p {\bar{h}}^{3} ρ^{2}} + \frac{30 L^{2}}{ρ^{2}} \leq \frac{60 L^{2}}{ρ^{2}}

,

\bar{h} = \frac{1}{L}

and

w_{t} = max (\frac{3}{2}, 1728 L^{3} p^{3} {\bar{h}}^{3} - t)

. So we can infer that

\begin{matrix} η_{t} & = min {\frac{ρ \frac{1}{L}}{{(\frac{3}{2} + t)}^{\frac{1}{3}}}, \frac{ρ \frac{1}{L}}{{(1728 L^{3} p^{3} \frac{1}{L^{3}} - t + t)}^{\frac{1}{3}}}} \\ = min {\frac{ρ}{L {(\frac{3}{2} + t)}^{\frac{1}{3}}}, \frac{ρ}{12 L p}} \end{matrix}

It is clear that

η_{t} \leq \frac{ρ}{12 L p}

. And

\begin{matrix} 2 η_{t}^{- 1} - η_{t - 1}^{- 1} & = \frac{2 {(w_{t} + t)}^{1 / 3}}{ρ \bar{h}} - \frac{{(w_{t - 1} + t - 1)}^{1 / 3}}{ρ \bar{h}} \\ \leq \frac{2}{3 ρ \bar{h} {(w_{t} + (t - 1))}^{2 / 3}} \\ \leq \frac{2}{3 ρ \bar{h} {(w_{t} / 3 + t / 3)}^{2 / 3}} = \frac{2 \cdot 3^{2 / 3}}{3 ρ \bar{h} {(w_{t} + t)}^{2 / 3}} \\ = \frac{2 \cdot 3^{2 / 3}}{3 ρ^{3} {\bar{h}}^{3}} \cdot \frac{ρ^{2} {\bar{h}}^{2}}{{(w_{t} + t)}^{2 / 3}} = \frac{2 \cdot 3^{2 / 3}}{3 ρ^{3} {\bar{h}}^{3}} η_{t}^{2} \\ \leq \frac{η_{t}}{6 ρ^{2} {\bar{h}}^{3} L p} \end{matrix}

(A29)

where we leverage the concavity of

f (Ξ) = Ξ^{1 / 3}

, that is

{(Ξ + y)}^{1 / 3} \leq Ξ^{1 / 3} + \frac{y}{3 Ξ^{2 / 3}}

. The second inequality is valid because of

w_{t} \geq \frac{3}{2}

, and the last inequality is obtained based on

η_{t} \leq \frac{ρ}{12 L p}

.

\begin{matrix} \frac{E ∥ {\bar{n}}_{t + 1} - {\bar{g}}_{t + 1} ∥^{2}}{η_{t}} - \frac{E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2}}{η_{t - 1}} \\ \leq & [\frac{2 {(1 - χ_{t + 1})}^{2}}{η_{t}} - \frac{1}{η_{t - 1}}] E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2} + \frac{4 {(1 - χ_{t + 1})}^{2} L^{2}}{Q η_{t}} E {∥ Ξ_{t + 1} - Ξ_{t} ∥}^{2} + \frac{4 χ_{t + 1}^{2} L^{2} μ^{2}}{η_{t}} \\ \leq & [2 η_{t}^{- 1} - η_{t - 1}^{- 1} - 2 c η_{t}] E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2} + \frac{8 {(1 - χ_{t + 1})}^{2} L^{2}}{Q} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ + & 8 {(1 - χ_{t + 1})}^{2} L^{2} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} + \frac{4 χ_{t + 1}^{2} L^{2} μ^{2}}{η_{t}} \\ \leq & - \frac{60 L^{2}}{ρ^{2}} η_{t} E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2} + \frac{8 L^{2}}{Q} η_{t} \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥^{2} + 8 L^{2} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} \\ + & 4 c^{2} η_{t}^{3} L^{2} μ^{2} . \end{matrix}

(A30)

where the second inequality holds true because

{(a + b)}^{2} \leq 2 a^{2} + 2 b^{2}

and

{(1 - χ_{t + 1})}^{2} \leq 1 - χ_{t + 1}

,

χ_{t + 1} = c \cdot η_{t}^{2}

and the last inequality is obtained based on (A29). Therefore, we have

\begin{matrix} \frac{ρ}{24 L^{2}} [\frac{E ∥ {\bar{n}}_{t + 1} - {\bar{g}}_{t + 1} ∥^{2}}{η_{t}} - \frac{E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2}}{η_{t - 1}}] \\ \leq & - \frac{5}{2 ρ} η_{t} E ∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2} + \frac{ρ}{3 Q} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} + \frac{ρ}{3} η_{t} E {∥K_{t}^{- 1} {\bar{n}}_{t}∥}^{2} + \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \end{matrix}

(A31)

Subsequently, we set

\begin{matrix} Γ_{t} = f ({\bar{Ξ}}_{t}) + \frac{ρ}{24 L^{2}} \frac{∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2}}{η_{t - 1}} \end{matrix}

\begin{matrix} E [Γ_{t + 1} - Γ_{t}] & = E [f ({\bar{Ξ}}_{t + 1}) - f ({\bar{Ξ}}_{t}) + \frac{ρ}{24 L^{2}} (\frac{∥ {\bar{n}}_{t + 1} - {\bar{g}}_{t + 1} ∥^{2}}{η_{t}} - \frac{∥ {\bar{n}}_{t} - {\bar{g}}_{t} ∥^{2}}{η_{t - 1}})] \\ \leq - (\frac{3 ρ}{4 η_{t}} - \frac{L}{2}) E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E {∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥}^{2} \\ + \frac{5 η_{t} L^{2} (p - 1)}{2 ρ Q} \sum_{s = s_{t}}^{t} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ + \frac{ρ}{3 Q} η_{t} \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥^{2} + \frac{ρ}{3} η_{t} E {∥ K_{t}^{- 1} {\bar{n}}_{t} ∥}^{2} + \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ = - (\frac{3 ρ}{4 η_{t}} - \frac{L}{2}) E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E {∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥}^{2} \\ + \frac{5 η_{t} L^{2} (p - 1)}{2 ρ Q} \sum_{s = s_{t}}^{t} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ + \frac{ρ}{3 Q} η_{t} \sum_{i = 1}^{Q} E ∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥^{2} + \frac{ρ}{3 η_{t}} E {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} + \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ \leq - \frac{ρ}{3 η_{t}} E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E {∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥}^{2} \\ + \frac{5 η_{t} L^{2} (p - 1)}{2 ρ Q} \sum_{s = s_{t}}^{t} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ + \frac{ρ}{3 Q} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} + \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \end{matrix}

(A32)

where we utilize that Lemma 3, Lemma 4 and

\frac{L}{2} \leq \frac{ρ}{24 η_{t} p} \leq \frac{ρ}{24 η_{t}}

. By summing the results from

t = s_{t}

to

\bar{s}

, where

\bar{s} \in [⌊ t / p ⌋ p, (⌊ t / p ⌋ + 1) p]

, we can obtain

\begin{matrix} E [Γ_{\bar{s} + 1} - Γ_{s_{t}}] & \leq \sum_{t = s_{t}}^{\bar{s}} [- \frac{ρ}{3 η_{t}} E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2}] + \sum_{t = s_{t}}^{\bar{s}} \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ + \sum_{t = s_{t}}^{\bar{s}} \frac{5 η_{t} L^{2} (p - 1)}{2 ρ Q} \sum_{s = s_{t}}^{t} η_{s}^{2} \sum_{i = 1}^{Q} E {∥ K_{s}^{- 1} (n_{s, i} - {\bar{n}}_{s}) ∥}^{2} \\ + \frac{ρ}{3 Q} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq \sum_{t = s_{t}}^{\bar{s}} [- \frac{ρ}{3 η_{t}} E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2}] + \sum_{t = s_{t}}^{\bar{s}} \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ + \frac{5 L^{2} (p - 1)}{2 ρ Q} (\sum_{t = s_{t}}^{\bar{s}} η_{t}) \sum_{t = s_{t}}^{\bar{s}} η_{t}^{2} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ + \frac{ρ}{3 Q} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq \sum_{t = s_{t}}^{\bar{s}} [- \frac{ρ}{3 η_{t}} E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}] \\ + \frac{ρ}{3 Q} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t})∥}^{2} \\ + \frac{5 L^{2} (p - 1)}{2 ρ Q} (p \times \frac{ρ}{12 L p} \times \frac{ρ}{12 L p}) \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ + \sum_{t = s_{t}}^{\bar{s}} \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ \leq \sum_{t = s_{t}}^{\bar{s}} [- \frac{ρ}{3 η_{t}} E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2}] + \sum_{t = s_{t}}^{\bar{s}} \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ + \frac{26 ρ}{72 Q} \sum_{t = s_{t}}^{\bar{s}} η_{t} \sum_{i = 1}^{Q} E {∥ K_{t}^{- 1} (n_{t, i} - {\bar{n}}_{t}) ∥}^{2} \\ \leq \sum_{t = s_{t}}^{\bar{s}} [- \frac{ρ}{3 η_{t}} E ∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥^{2} - \frac{η_{t}}{4 ρ} E ∥ \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} ∥^{2}] + \sum_{t = s_{t}}^{\bar{s}} \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ + \frac{ρ}{4} \sum_{t = s_{t}}^{\bar{s}} \frac{1}{η_{t}} E {∥ {\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t} ∥}^{2} + [\frac{ρ μ^{2} c^{2}}{4 Q} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}}] \sum_{t = s_{t}}^{\bar{s}} η_{t}^{3} \end{matrix}

(A33)

where, by utilizing Lemma 6 and the fact that

\frac{26}{72} < \frac{30}{72}

, we can derive the last inequality. Subsequently, summing the terms from the start can obtain

\begin{matrix} E [Γ_{T} - Γ_{0}] & \leq \sum_{t = 0}^{T - 1} [- \frac{ρ}{12 η_{t}} E {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} - \frac{η_{t}}{4 ρ} E {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}] + \sum_{t = 0}^{T - 1} \frac{ρ c^{2} η_{t}^{3} μ^{2}}{6} \\ + \frac{ρ μ^{2} c^{2}}{4 Q} \sum_{t = 0}^{T - 1} η_{t}^{3} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}} \sum_{t = 0}^{T - 1} η_{t}^{3} \end{matrix}

(A34)

Furthermore, we can obtain

\begin{matrix} \sum_{t = 0}^{T - 1} E [\frac{ρ}{12 η_{t}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{η_{t}}{4 ρ} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}] \\ \leq & E [Γ_{0} - Γ_{T}] + \frac{5 ρ μ^{2} c^{2}}{12} \sum_{t = 0}^{T - 1} η_{t}^{3} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}} \sum_{t = 0}^{T - 1} η_{t}^{3} \\ \leq & E [f ({\bar{Ξ}}_{0}) - f^{*}] + \frac{ρ}{24 L^{2}} \frac{∥ {\bar{n}}_{0} - {\bar{g}}_{0} ∥^{2}}{η_{0}} \\ + & \frac{5 ρ μ^{2} c^{2}}{12} \sum_{t = 0}^{T - 1} η_{t}^{3} + \frac{3 ρ ζ^{2} c^{2}}{4 L^{2}} \sum_{t = 0}^{T - 1} η_{t}^{3} \end{matrix}

(A35)

Then, consider that

\sum_{t = 0}^{T - 1} η_{t}^{3} = \sum_{t = 0}^{T - 1} \frac{ρ^{3} {\bar{h}}^{3}}{w_{t} + t} \leq \sum_{t = 0}^{T - 1} \frac{ρ^{3} {\bar{h}}^{3}}{1 + t} \leq ρ^{3} {\bar{h}}^{3} (ln T + 1)

, since

w_{t} \geq \frac{3}{2} > 1

. Taking Lemma 2 and dividing both sides of the above result by

ρ η_{T} T

, we can obtain

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [\frac{1}{12 η_{t}^{2}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{1}{4 ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}] \\ \leq & \frac{E [f ({\bar{Ξ}}_{0}) - f^{*}]}{η_{T} T ρ} + \frac{μ^{2}}{η_{T} T 24 Q η_{0}} + \frac{ρ^{3}}{η_{T} T L^{2}} [\frac{5 L^{2} μ^{2}}{12} + \frac{3 ζ^{2}}{4}] c^{2} {\bar{h}}^{3} (ln T + 1) \end{matrix}

(A36)

Regarding the first term in (A36),

\begin{matrix} \frac{1}{η_{T} T} = \frac{{(w_{T} + T)}^{1 / 3}}{ρ \bar{h} T} \leq \frac{w_{T}^{1 / 3}}{ρ \bar{h} T} + \frac{1}{ρ \bar{h} T^{2 / 3}} \leq \frac{12 L q}{ρ T} + \frac{L}{ρ T^{2 / 3}} \end{matrix}

(A37)

For the middle term, we have

\begin{matrix} \frac{μ^{2}}{η_{T} T 24 Q η_{0}} & \leq (\frac{12 L p}{ρ T} + \frac{L}{ρ T^{2 / 3}}) \times \frac{μ^{2}}{24 Q} \times \frac{w_{0}^{1 / 3}}{ρ \bar{h}} \\ \leq (\frac{12 L p}{ρ T} + \frac{L}{ρ T^{2 / 3}}) \times \frac{μ^{2}}{24 Q} \times \frac{12 L p}{ρ} \\ \leq \frac{6 L^{2} μ^{2} p^{2}}{ρ^{2} Q T} + \frac{L^{2} μ^{2} p}{2 ρ^{2} Q T^{2 / 3}} \end{matrix}

(A38)

For the third term,

\begin{matrix} \frac{ρ^{3} c^{2} {\bar{h}}^{3}}{4 η_{T} T L^{2}} & \leq (\frac{12 L p}{ρ T} + \frac{L}{ρ T^{2 / 3}}) \times {(\frac{60 L^{2}}{ρ^{2}})}^{2} \times \frac{ρ^{3} {\bar{h}}^{3}}{4 L^{2}} \\ = (\frac{12 L p}{ρ T} + \frac{L}{ρ T^{2 / 3}}) \times \frac{3600 L^{4}}{ρ^{4}} \times \frac{ρ^{3} \frac{1}{L^{3}}}{4 L^{2}} \\ = (\frac{12 L p}{ρ T} + \frac{L}{ρ T^{2 / 3}}) \times \frac{900}{ρ L} \\ = \frac{12^{2} \times 75 p}{ρ^{2} T} + \frac{900}{ρ^{2} T^{2 / 3}} \end{matrix}

(A39)

We let

Q_{t} = \frac{1}{12 η_{t}^{2}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{1}{4 ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [Q_{t}] & = \frac{1}{T} \sum_{t = 0}^{T - 1} E [\frac{1}{12 η_{t}^{2}} {∥{\bar{Ξ}}_{t + 1} - {\bar{Ξ}}_{t}∥}^{2} + \frac{1}{4 ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}] \\ \leq [\frac{12 L p}{ρ^{2} T} + \frac{L}{ρ^{2} T^{2 / 3}}] E [f ({\bar{Ξ}}_{0}) - f^{*}] + \frac{6 L^{2} μ^{2} p^{2}}{ρ^{2} Q T} + \frac{L^{2} μ^{2} p}{2 ρ^{2} Q T^{2 / 3}} \\ + [\frac{12^{2} \times 75 p}{ρ^{2} T} + \frac{900}{ρ^{2} T^{2 / 3}}] [\frac{5 L^{2} μ^{2}}{3} + 3 ζ^{2}] (ln T + 1) \end{matrix}

(A40)

and if we choose

p = {(\frac{T}{Q^{2}})}^{\frac{1}{3}}

, then

\frac{p}{T} = \frac{1}{{(Q T)}^{\frac{2}{3}}}

,

\frac{p^{2}}{Q T} = \frac{1}{Q^{\frac{7}{3}} T^{\frac{1}{3}}}

,

\frac{p}{Q T^{\frac{2}{3}}} = \frac{1}{Q^{\frac{5}{3}} T^{\frac{1}{3}}}

. So we can infer that it is convergent.

Then, with Jensen’s inequality and

∥ K_{t} ∥ \geq ρ

, we can obtain

\begin{matrix} \frac{1}{η_{t}} ∥{\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}∥ + \frac{1}{ρ} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥ \\ = & ∥\begin{matrix} K_{t}^{- 1} {\bar{n}}_{t} \end{matrix}∥ + \frac{1}{ρ} ∥\begin{matrix} \nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t} \end{matrix}∥ \\ = & \frac{1}{∥ K_{t} ∥} ∥K_{t}∥ ∥K_{t}^{- 1} {\bar{n}}_{t}∥ + \frac{1}{ρ} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥ \\ \geq & \frac{1}{∥ K_{t} ∥} ∥{\bar{n}}_{t}∥ + \frac{1}{∥ K_{t} ∥} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥ \\ \geq & \frac{1}{∥ K_{t} ∥} ∥\nabla f ({\bar{Ξ}}_{t})∥ \end{matrix}

(A41)

Finally,

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E ∥ \nabla f ({\bar{Ξ}}_{t}) ∥ & \leq \frac{1}{T} \sum_{t = 1}^{T} E [∥ K_{t} ∥ (\frac{1}{η_{t}} ∥{\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}∥ + \frac{1}{ρ} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥)] \\ \leq \frac{1}{T} \sum_{t = 1}^{T} E [\frac{λ}{2} {∥K_{t}∥}^{2} + \frac{1}{2 λ} {[\frac{1}{η_{t}} ∥{\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}∥ + \frac{1}{ρ} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥]}^{2}] \\ = & \frac{λ}{2} \frac{1}{T} \sum_{t = 1}^{T} E {∥K_{t}∥}^{2} + \frac{1}{2 λ} \frac{1}{T} \sum_{t = 1}^{T} E {[\frac{1}{η_{t}} ∥{\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}∥ + \frac{1}{ρ} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥]}^{2} \\ \leq & \sqrt{\frac{1}{T} \sum_{t = 1}^{T} E {∥K_{t}∥}^{2}} \sqrt{\frac{1}{T} \sum_{t = 1}^{T} E [\frac{2}{η_{t}^{2}} {∥{\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}∥}^{2} + \frac{2}{ρ^{2}} {∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥}^{2}]} \\ \leq & 5 \sqrt{2 d (G^{2} + σ_{g}^{2}) + 2 ρ^{2} + \frac{1}{2} d^{2} L^{2} μ^{2}} \sqrt{\frac{1}{T} \sum_{t = 0}^{T - 1} E [Q_{t}]} \end{matrix}

(A42)

where we utilize Young’s inequality and

{(a + b)}^{2} \leq 2 a^{2} + 2 b^{2}

, and

λ = \sqrt{\frac{1}{T} \sum_{t = 1}^{T} E {[\frac{1}{η_{t}} ∥{\bar{Ξ}}_{t} - {\bar{Ξ}}_{t + 1}∥ + \frac{1}{ρ} ∥\nabla f ({\bar{Ξ}}_{t}) - {\bar{n}}_{t}∥]}^{2}} / \sqrt{\frac{1}{T} \sum_{t = 1}^{T} E {∥K_{t}∥}^{2}}

.

Since

\frac{1}{T} \sum_{t = 0}^{T - 1} E [Q_{t}]

is convergent, as mentioned before, it can be known that

\frac{1}{T} \sum_{t = 1}^{T} E ∥ \nabla f ({\bar{Ξ}}_{t}) ∥

is also convergent; thus, the theorem is proved. □

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Shi, Y.; Yang, K.; Yang, Z.; Zhou, Y. Mobile Edge Artificial Intelligence: Opportunities and Challenges; Elsevier: Amsterdam, The Netherlands, 2021. [Google Scholar]
Yang, L.; Tan, B.; Zheng, V.W.; Chen, K.; Yang, Q. Federated recommendation systems. In Federated Learning: Privacy and Incentive; Springer International Publishing: Cham, Switzerland, 2020; pp. 225–239. [Google Scholar]
Yang, K.; Shi, Y.; Zhou, Y.; Yang, Z.; Fu, L.; Chen, W. Federated machine learning for intelligent IoT via reconfigurable intelligent surface. IEEE Netw. 2020, 34, 16–22. [Google Scholar]
Tian, J.; Smith, J.S.; Kira, Z. Fedfor: Stateless heterogeneous federated learning with first-order regularization. arXiv 2022, arXiv:2209.10537. [Google Scholar]
Zhang, M.; Sapra, K.; Fidler, S.; Yeung, S.; Alvarez, J.M. Personalized federated learning with first order model optimization. arXiv 2021, arXiv:2012.08565. [Google Scholar]
Elbakary, A.; Issaid, C.B.; Shehab, M.; Seddik, K.G.; ElBatt, T.A.; Bennis, M. Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm. arXiv 2024, arXiv:2406.06655. [Google Scholar]
Dai, Z.; Low, B.K.H.; Jaillet, P. Federated Bayesian optimization via Thompson sampling. Adv. Neural Inf. Process. Syst. 2020, 33, 9687–9699. [Google Scholar]
Staib, M.; Reddi, S.; Kale, S.; Kumar, S.; Sra, S. Escaping saddle points with adaptive gradient methods. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 5956–5965. [Google Scholar]
Chen, X.; Li, X.; Li, P. Toward communication efficient adaptive gradient method. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, San Francisco, CA, USA, 18–20 October 2020; pp. 119–128. [Google Scholar]
Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečnỳ, J.; Kumar, S.; McMahan, H.B. Adaptive federated optimization. arXiv 2020, arXiv:2003.00295. [Google Scholar]
Wang, Y.; Lin, L.; Chen, J. Communication-efficient adaptive federated learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 22802–22838. [Google Scholar]
Zhang, P.; Yang, X.; Chen, Z. Neural network gain scheduling design for large envelope curve flight control law. J. Beijing Univ. Aeronaut. Astronaut. 2005, 31, 604–608. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. A novel framework for the analysis and design of heterogeneous federated learning. IEEE Trans. Signal Process. 2021, 69, 5234–5249. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5132–5143. [Google Scholar]
Pathak, R.; Wainwright, M.J. FedSplit: An algorithmic framework for fast federated optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 7057–7066. [Google Scholar]
Zhang, X.; Hong, M.; Dhople, S.; Yin, W.; Liu, Y. Fedpd: A federated learning framework with adaptivity to non-IID data. IEEE Trans. Signal Process. 2021, 69, 6055–6070. [Google Scholar]
Wang, S.; Roosta, F.; Xu, P.; Mahoney, M.W. Giant: Globally improved approximate newton method for distributed optimization. Adv. Neural Inf. Process. Syst. 2018, 31, 2332–2342. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smithy, V. Feddane: A federated newton-type method. In Proceedings of the 2019 53rd Asilomar Conference on Signals, Systems, and Computers , Pacific Grove, CA, USA, 3–6 November 2019; pp. 1227–1231. [Google Scholar]
Xu, A.; Huang, H. Coordinating momenta for cross-silo federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 8735–8743. [Google Scholar]
Das, R.; Acharya, A.; Hashemi, A.; Sanghavi, S.; Dhillon, I.S.; Topcu, U. Faster non-convex federated learning via global and local momentum. In Proceedings of the Uncertainty in Artificial Intelligence, Eindhoven, The Netherlands, 1–5 August 2022; pp. 496–506. [Google Scholar]
Khanduri, P.; Sharma, P.; Yang, H.; Hong, M.; Liu, J.; Rajawat, K.; Varshney, P. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning. Adv. Neural Inf. Process. Syst. 2021, 34, 6050–6061. [Google Scholar]
Tang, Y.; Zhang, J.; Li, N. Distributed zero-order algorithms for nonconvex multiagent optimization. IEEE Trans. Control Netw. Syst. 2020, 8, 269–281. [Google Scholar]
Nikolakakis, K.; Haddadpour, F.; Kalogerias, D.; Karbasi, A. Black-box generalization: Stability of zeroth-order learning. Adv. Neural Inf. Process. Syst. 2022, 35, 31525–31541. [Google Scholar]
Balasubramanian, K.; Ghadimi, S. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. Adv. Neural Inf. Process. Syst. 2018, 31, 3459–3468. [Google Scholar]
Fang, W.; Yu, Z.; Jiang, Y.; Shi, Y.; Jones, C.N.; Zhou, Y. Communication-efficient stochastic zeroth-order optimization for federated learning. IEEE Trans. Signal Process. 2022, 70, 5058–5073. [Google Scholar]
Li, Z.; Ying, B.; Liu, Z.; Yang, H. Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization. arXiv 2024, arXiv:2405.15861. [Google Scholar]
Maritan, A.; Dey, S.; Schenato, L. FedZeN: Quadratic convergence in zeroth-order federated learning via incremental Hessian estimation. In Proceedings of the 2024 European Control Conference, Stockholm, Sweden, 25–28 June 2024; pp. 2320–2327. [Google Scholar]
Mhanna, E.; Assaad, M. Rendering wireless environments useful for gradient estimators: A zero-order stochastic federated learning method. In Proceedings of the 2024 60th Annual Allerton Conference on Communication, Control, and Computing, Urbana, IL, USA, 24–27 September 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Diederik, P.K. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Ling, X.; Fu, J.; Wang, K.; Liu, H.; Chen, Z. Ali-dpfl: Differentially private federated learning with adaptive local iterations. In Proceedings of the 2024 IEEE 25th International Symposium on a World of Wireless, Mobile and Multimedia Networks, Perth, Australia, 4–7 June 2024; pp. 349–358. [Google Scholar]
Cong, Y.; Qiu, J.; Zhang, K.; Fang, Z.; Gao, C.; Su, S.; Tian, Z. Ada-FFL: Adaptive computing fairness federated learning. CAAI Trans. Intell. Technol. 2024, 9, 573–584. [Google Scholar]
Huang, Y.; Zhu, S.; Chen, W.; Huang, Z. FedAFR: Enhancing Federated Learning with adaptive feature reconstruction. Comput. Commun. 2024, 214, 215–222. [Google Scholar]
Li, Y.; He, Z.; Gu, X.; Xu, H.; Ren, S. AFedAvg: Communication-efficient federated learning aggregation with adaptive communication frequency and gradient sparse. J. Exp. Theor. Artif. Intell. 2024, 36, 47–69. [Google Scholar]
Yi, X.; Zhang, S.; Yang, T.; Johansson, K.H. Zeroth-order algorithms for stochastic distributed nonconvex optimization. Automatica 2022, 142, 110353. [Google Scholar]
Wu, X.; Huang, F.; Hu, Z.; Huang, H. Faster adaptive federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2023; Volume 37, pp. 10379–10387. [Google Scholar]
Gao, X.; Jiang, B.; Zhang, S. On the information-adaptive variants of the ADMM: An iteration complexity perspective. J. Sci. Comput. 2018, 76, 327–363. [Google Scholar]

Figure 1. FAFedZO framework.

Figure 2. Influence of varying the number of local updates that select

(Q, M) = (50, 30)

. (a) The attack loss on the MNIST dataset. (b) The attack loss on the CIFAR-10 dataset. (c) The attack loss on the Fashion-MNIST dataset. (d) The testing accuracy on the MNIST dataset. (e) The testing accuracy on the CIFAR-10 dataset. (f) The testing accuracy on the Fashion-MNIST dataset.

Figure 2. Influence of varying the number of local updates that select

(Q, M) = (50, 30)

. (a) The attack loss on the MNIST dataset. (b) The attack loss on the CIFAR-10 dataset. (c) The attack loss on the Fashion-MNIST dataset. (d) The testing accuracy on the MNIST dataset. (e) The testing accuracy on the CIFAR-10 dataset. (f) The testing accuracy on the Fashion-MNIST dataset.

Figure 3. Influence of the number of participating edge devices that select

(Q, E) = (100, 60)

. (a) The testing accuracy on the MNIST dataset. (b) The testing accuracy on the Fashion-MNIST dataset.

Figure 3. Influence of the number of participating edge devices that select

(Q, E) = (100, 60)

. (a) The testing accuracy on the MNIST dataset. (b) The testing accuracy on the Fashion-MNIST dataset.

Figure 4. Influence of the number of local updates that select

Q = 50

and

M = 50

. (a) The attack loss on the MNIST dataset. (b) The attack loss on the CIFAR-10 dataset. (c) The attack loss on the Fashion-MNIST dataset. (d) The testing accuracy on the MNIST dataset. (e) The testing accuracy on the CIFAR-10 dataset. (f) The testing accuracy on the Fashion-MNIST dataset.

Figure 4. Influence of the number of local updates that select

Q = 50

and

M = 50

. (a) The attack loss on the MNIST dataset. (b) The attack loss on the CIFAR-10 dataset. (c) The attack loss on the Fashion-MNIST dataset. (d) The testing accuracy on the MNIST dataset. (e) The testing accuracy on the CIFAR-10 dataset. (f) The testing accuracy on the Fashion-MNIST dataset.

Figure 5. Influence of the number of participating edge devices that select

Q = 50

and

E = 1

. (a) The attack loss on the MNIST dataset; (b) The attack loss on the Fashion-MNIST dataset; (c) The testing accuracy on the MNIST dataset; (d) The testing accuracy on the Fashion-MNIST dataset.

Figure 5. Influence of the number of participating edge devices that select

Q = 50

and

E = 1

. (a) The attack loss on the MNIST dataset; (b) The attack loss on the Fashion-MNIST dataset; (c) The testing accuracy on the MNIST dataset; (d) The testing accuracy on the Fashion-MNIST dataset.

Figure 6. Influence of the number of directions that select

Q = 50

and

M = 10

. (a) Attack loss; (b) Testing accuracy.

Figure 6. Influence of the number of directions that select

Q = 50

and

M = 10

. (a) Attack loss; (b) Testing accuracy.

Figure 7. The impact of different numbers of local updates when selecting

Q = 50

and

M = 30

in a non-IID setting. (a) Attack loss; (b) Testing accuracy.

Figure 7. The impact of different numbers of local updates when selecting

Q = 50

and

M = 30

in a non-IID setting. (a) Attack loss; (b) Testing accuracy.

Figure 8. The impact of the number of participating edge devices when selecting

Q = 50

and

E = 1

under the non-IID setting. (a) Attack loss; (b) Testing accuracy.

Figure 8. The impact of the number of participating edge devices when selecting

Q = 50

and

E = 1

under the non-IID setting. (a) Attack loss; (b) Testing accuracy.

Figure 9. The impact of different numbers of local updates when selecting

Q = 50

and

M = 50

under the non-IID setting. (a) Attack loss; (b) Testing accuracy.

Figure 9. The impact of different numbers of local updates when selecting

Q = 50

and

M = 50

under the non-IID setting. (a) Attack loss; (b) Testing accuracy.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.