Privacy-Preserving Distributed Learning via Newton Algorithm

Cao, Zilong; Guo, Xiao; Zhang, Hai

doi:10.3390/math11183807

Open AccessArticle

Privacy-Preserving Distributed Learning via Newton Algorithm

by

Zilong Cao

,

Xiao Guo

and

Hai Zhang

^*

School of Mathematics, Northwest University, Xi’an 710127, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(18), 3807; https://doi.org/10.3390/math11183807

Submission received: 19 June 2023 / Revised: 28 July 2023 / Accepted: 30 July 2023 / Published: 5 September 2023

(This article belongs to the Special Issue Data Mining: Analysis and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Federated learning (FL) is a prominent distributed learning framework. The main barriers of FL include communication cost and privacy breaches. In this work, we propose a novel privacy-preserving second-order-based FL method, called GDP-LocalNewton. To improve the communication efficiency, we use Newton’s method to iterate and allow local computations before aggregation. To ensure strong privacy guarantee, we make use of the notion of differential privacy (DP) to add Gaussian noise in each iteration. Using advanced tools of Gaussian differential privacy (GDP), we prove that the proposed algorithm satisfies the strong notion of GDP. We also establish the convergence of our algorithm. It turns out that the convergence error comes from the local computation and Gaussian noise for DP. We conduct experiments to show the merits of the proposed algorithm.

Keywords:

federated learning; differential privacy; second-order method

MSC:

68P27; 68W15

1. Introduction

Federated learning (FL) is a popular distributed learning paradigm that enables a large amount of clients to collaboratively train a global model without sharing their individual data [1]. The most popular algorithm is called federated averaging (FedAvg). In each round, the server broadcasts the current model to all the clients. The clients then run multiple steps of stochastic gradient descent (SGD) in a distributed fashion. After that, the server updates the global model by aggregating results from local clients. The convergence rate of FedAvg has been studied extensively in recent years [2,3,4].

One desiderata of a real FL paradigm is the communication efficiency. The communication cost of FL mainly comes from the latency cost, that is, the fixed cost of sending messages, which is proportional to the communication rounds, regardless of the size of the message. In pursuit of this, many first-order gradient-based methods have been developed, such as local stochastic gradient descent (Local SGD) [1,5,6,7] and mini-batch SGD [8,9,10]. These algorithms reduce the communication cost by performing local computations at the client devices before aggregation. It works when the clients are mobile resources which have reasonable computational power but may suffer from the communication latency [11]. However, for some serverless systems, say, the cloud-based systems, the latency cost is severe. The communication failures are more frequent. Hence, the communication rounds should be further reduced. Recent years have seen a few work on using second-order-based methods [11,12,13,14,15] to improve the convergence of first-order-based methods. Wang et al. [12] proposed the GIANT algorithm, a new distributed approximation of Newton’s method, and showed an improved convergence rate over the distributed first-order competitors. Dünner et al. [13] developed a method which approximates the global Hessian matrix by using the block Hessian matrices from users, thereby reducing the computational burden of the global Hessian matrix. Gupta et al. [11] proposed and analyzed a second-order method with local computations, called LocalNewton. Due to its second-order nature and local computations, LocalNewton is superior in terms of the communication cost.

The other desideratum of a real FL paradigm is the privacy preservation. The data from local clients often contain sensitive information of individuals. To some extent, FL could preserve the individuals’ privacy because the original data never leave the clients. However, there exist adversarial attacks, which can result in privacy leakage by using the information during the communication; see [16] for example. Hence, the vanilla FL paradigm does not have a rigorous privacy guarantee. Differential privacy (DP; Dwork [17]) is a standard and well-adopted framework which provides strong guarantee of an individual’s privacy by ensuring that no individual has a significant influence on the algorithm’s output. The worst-case influence is indeed termed the privacy budget. A differentially private (DP) algorithm is often achieved by randomizing the algorithm’s output. Besides the vanilla

(ϵ, δ)

-DP, several DP notions are developed, say, the Rényi differential privacy (RDP) [18] and the Gaussian differential privacy (GDP) [19]. RDP and GDP enjoy an elegant composition property, that is, the theoretical bound of the privacy leakage of repeated queries can be tight and lossless, which would bring benefits when accounting for the privacy leakage in iterative-based algorithms. Using the notion of DP, there have been several studies on developing a privacy-preserving FL algorithm; see McMahan et al. [2], Geyer et al. [3], Triastcyn and Faltings [4], Noble et al. [20], Wei et al. [21], Girgis et al. [22], Cheu et al. [23], Rastogi and Nath [24], Huang et al. [25], among others. Most of these algorithms are first-order based.

With the two desiderata in mind, we develop a novel privacy-preserving second-order-based method called GDP-LocalNewton within the FL framework. Building upon Newton’s algorithm, each device performs Newton’s iterations locally. To avoid privacy leakage, we take advantage of the notion of GDP [19] to add Gaussian noise to the updates in each iteration. In particular, we propose a novel linear search method to determine the step size in the Newton update. After several local steps, the devices send parameter updates to the central server. The server then aggregates these updates and broadcasts them to synchronize all local parameters. This process is iterated until convergence or specific conditions are attained. It is worth mentioning that we assume a specific adversary termed as the curious onlooker, who can eavesdrop on the communication between the server and the local machines. The server is a typical example of a curious onlooker.

We rigorously show that under proper parameters set-up, the proposed algorithm satisfies GDP. We also analyze the convergence bound of GDP-LocalNewton. It turns out that the convergence error comprises two components. One comes from the DP noises, and the other comes from the local computations. We also conduct experiments to show the merits of GDP-LocalNewton.

2. GDP-LocalNewton

In this section, we first provide the problem formulation and the framework of GDP. Then, we develop the GDP-LocalNewton algorithm. After that, we provide rigorous privacy analysis and convergence analysis of the proposed algorithm.

2.1. Problem Formulation

2.1.1. Some Notations and Symbols

We represent vectors (e.g.,

g

) and matrices (e.g., H) as bold lowercase and uppercase letters, respectively.

∥ g ∥

denotes its

ℓ_{2}

norm, and the spectral norm of matrix

H

is denoted by

∥ H ∥

.

I

denotes an identity matrix, and the set

{1, 2, \dots, n}

is denoted by

[n]

. To distinguish the index of workers and iterations, we denote the worker index as the superscript (e.g.,

g^{k}

) and denote the iteration counter as the subscript (e.g.,

g_{t}

).

Our paper contains a large number of parameters. For ease of reading and distinction, we list them in Table 1.

2.1.2. Fundamental Problem

We consider the empirical risk minimization problems of the following form:

\min_{w \in R^{d}} {f (w) ≜ \frac{1}{n} \sum_{j = 1}^{n} f_{j} (w)},

(1)

where

f_{j} (\cdot) : R^{d} \to R

, for all

j \in [n] = 1, 2, \dots, n

, represents the loss of the j-th observation, given an underlying parameter

w \in R^{d}

. Generally, we call the

f (\cdot)

the global loss function. In the field of machine learning, such problems are general, e.g., logistic and linear regression, support vector machines, neural networks and graphical models. Taking the logistic regression as an example, we have

f_{j} (w) = ℓ_{j} (w^{T} x_{j}) = \log (1 + \exp^{- y_{j} w^{T} x_{j}}) + \frac{γ}{2} {∥ w ∥}^{2},

where

ℓ_{j} (\cdot)

is the loss function for sample

j \in [n]

and

γ

is an approximately chosen regularization parameter.

X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{d \times n}

is the sample matrix containing n data, and

y = [y_{1}, y_{2}, \dots, y_{n}]

is the corresponding label vector. Hence,

(X, y)

defines the training dataset, which is composed as

(x_{j}, y_{j}), j = 1, 2, . ., n

.

At the t-th iteration, the gradient and the Hessian are denoted by

g_{t} = \nabla f (w_{t}) = \frac{1}{n} \sum_{j = 1}^{n} \nabla f_{j} (w_{t}),

and

H_{t} = \nabla^{2} f (w_{t}) = \frac{1}{n} \sum_{j = 1}^{n} \nabla^{2} f_{j} (w_{t}),

respectively, where

w_{t}

is the updated parameter at the t-th iteration. Specially, in this paper, we make some assumptions about loss functions. Global loss function

f (\cdot)

is smooth and strongly convex.

f_{i} (\cdot)

is twice differentiable,

\nabla^{2} f_{i} (w) ≼ B I

and

∥ \nabla f_{i} {(\cdot) ∥}_{2} ⩽ Γ

, where

B \in R

and

Γ \in R

are some fixed constants.

2.1.3. Data Distribution

Assume that the FL system has K workers in total.

S_{k}

, for all

k \in [K] = {1, 2, \dots, K}

, represents a subset chosen uniformly from

[n] = {1, 2, \dots, n}

without replacement randomly, and the workers can be identified by the subsets, uniquely. The number of each workers’ sample is equal to

s = | S_{k} |

for any

k \in [K]

. Due to the sampling without replacement assumption, we have

S_{1} \cup S_{2} \cup \dots S_{K} = [n]

and

S_{i} \cap S_{j} = Φ

for all

i, j \in [K]

. Also,

K = n / s

represents the number of workers.

2.2. Gaussian Differential Privacy

In this subsection, we review the definition of Gaussian differential privacy (GDP) [26]. Then, we give some theorems and tools about GDP [26], which are the base of the proof that our algorithm satisfies

μ - G D P

. Lemmas 1 and 2 show that every user’s output satisfies

μ_{k} - G D P

. Lemma 3 guarantees DP when the entire dataset is divided as disjointed parts.

Definition 1

(f-

D P

). Let

h : R^{n \times m} \to R^{p}

be a randomized function, and

D = {x_{1}, . . ., x_{i}, . . ., x_{n}}

be a dataset. If the output of h for any α-level test between the hypotheses of the formulation

H_{0} : x_{i} = t

vs.

H_{1} : x_{i} = s

, which has power function

β (α) ⩽ 1 - f (α)

, where f is a convex, continuous, non-increasing function and for all

α \in [0, 1]

f (α) ⩽ 1 - α

, then we say h satisfies f-

D P

.

Definition 2

(Gaussian differential privacy). Define

Φ (\cdot)

as the standard Gaussian cumulative distribution. If h is f-

D P

and

f (α) ⩾ Φ (Φ^{- 1} (1 - α) - μ), α \in [0, 1],

then h is μ-Gaussian differentially private (μ-GDP).

Definition 3

(Global sensitivity). Let

g : R^{n \times m} \to R^{p}

be a deterministic function. The global sensitivity of g is the (possibly infinite) number

G S_{g} = \sup_{x_{1 : n}, x_{1 : n}^{'} \in R^{n \times m}} {∥ g (x_{1 : n}^{'}) - g (x_{1 : n}) ∥_{2} : d_{H} (x_{1 : n}, x_{1 : n}^{'}) = 1},

(2)

where

x_{1 : n}

means a dataset

{x_{1}, x_{2}, . . ., x_{n}}

, and

x_{1 : n}^{'}

is different from

x_{1 : n}

at some datum.

Lemma 1

(Gaussian Mechanism of GDP [26]). Let

g : R^{n \times m} \to R^{p}

be a function with finite global sensitivity

G S_{g}

. Let Z be a standard normal p-dimensional random vector. For all

μ > 0

and

x \in R^{n \times m}

, the random function

h (x) = g (x) + \frac{G S_{g}}{μ} Z

is

μ - G D P

.

Lemma 2

(Matrix Gaussian mechanism [26]). Consider a data matrix

A \in R^{n \times m}

such that each row vector

a_{i}

satisfies

∥ a_{i} ∥ ⩽ 1

. Further define the function

h (A) = \frac{1}{n} A^{T} A

. Let W be a symmetric random matrix whose upper-triangular elements and the diagonal are

i . i . d . \frac{1}{μ} N (0, 1)

. Then the random function

\tilde{h} (A) = h (A) + W

satisfies

μ - G D P

.

Lemma 3

(Parallel Composition [27]). Suppose we have a set of privacy mechanisms

M = {M_{1}, \dots, M_{m}}

. If each

M_{i}

provides a

μ_{i}

-GDP guarantee on a disjointed subset of the entire dataset,

M

will provide

(\max {μ_{1}, \dots, μ_{m}})

-GDP.

Lemma 4

(Composition of GDP [19]). The n-fold composition of

μ_{i} - G D P

mechanisms is

\sqrt{μ_{1}^{2} + . . . + μ_{n}^{2}} - G D P

.

2.3. GDP-LocalNewton Algorithm

In this subsection, we propose the GDP-LocalNewton algorithm for privacy-preserving distributed learning; see Algorithm 1.

Algorithm 1 GDP-LocalNewton

1:: Input: Initial iteration ${\bar{w}}_{0} \in R^{d}$ ; Max linear search parameter $α^{★}$ ; Linear search parameter $0 < β ⩽ 1 / 2$ ; Privacy parameter $μ$ ; Loss function parameter B and $Γ$ ; Iteration parameter T; $I_{T} \subseteq {1, 2, \dots, T}$ is the set of communication nodes.
2:: for $k = 1 t o K$ in parallel do
3:: Initialization: $w_{0}^{k} = {\bar{w}}_{0}$
4:: for $t = 0 t o T - 1$ do
5:: if $t \in I_{t}$ then
6:: ${\bar{w}}_{t} = \frac{1}{K} \sum_{k = 1}^{K} w_{t}^{k}$
7:: $w_{t}^{k} = {\bar{w}}_{t}$
8:: end if
9:: ${\hat{g}}_{t}^{k} = g_{t}^{k} + \frac{2 Γ \sqrt{2 T}}{μ s} Z_{t}^{k}$
10:: ${\hat{H}}_{t}^{k} = H_{t}^{k} + \frac{2 B \sqrt{2 T}}{μ s} U_{t}^{k}$
11:: ${\hat{p}}_{t}^{k} = {\hat{H}}^{k} {(w_{t}^{k})}^{- 1} {\hat{g}}^{k} (w_{t}^{k})$
12:: Find step-size $α_{t}^{k}$ using line search (Equation (10))
13:: Update model: $w_{t + 1}^{k} = w_{t}^{k} - α_{t}^{k} {\hat{p}}_{t}^{k}$
14:: end for
15:: end for

At the k-th worker in the t-th iteration, we define the local loss function (at the local iteration

w_{t}^{k}

) as

f^{k} (w_{t}^{k}) = \frac{1}{s} \sum_{j \in S_{k}} f_{j} (w_{t}^{k}) .

(3)

The k-th worker’s target is to minimize the local loss in Equation (3). We give the corresponding local gradient

g_{t}^{k}

and local Hessian

H_{t}^{k}

at the k-th worker in the t-th iteration as the following:

g_{t}^{k} = g^{k} (w_{t}^{k}) = \nabla f^{k} (w_{t}^{k}) = \frac{1}{s} \sum_{j \in S_{k}} \nabla f_{j} (w_{t}^{k}),

H_{t}^{k} = H^{k} (w_{t}^{k}) = \nabla^{2} f^{k} (w_{t}^{k}) = \frac{1}{s} \sum_{j \in S_{k}} \nabla^{2} f_{j} (w_{t}^{k}) .

The updates of the local workers’ parameter are given by

w_{t + 1}^{k} = \{\begin{matrix} w_{t}^{k} - α_{t}^{k} {({\hat{H}}_{t}^{k})}^{- 1} {\hat{g}}_{t}^{k}, & if t \notin I_{t} \\ {\bar{w}}_{t} - α_{t}^{k} {\hat{H}}^{k} {({\bar{w}}_{t})}^{- 1} {\hat{g}}^{k} ({\bar{w}}_{t}), & if t \in I_{t} \end{matrix}

(4)

where

{\bar{w}}_{t} = \frac{1}{K} \sum_{k = 1} w_{t}^{k}

and

{\hat{g}}_{t}^{k} = g_{t}^{k} + \frac{2 Γ \sqrt{2 T}}{μ s} Z_{t}^{k},

(5)

{\hat{g}}^{k} ({\bar{w}}_{t}) = g^{k} ({\bar{w}}_{t}) + \frac{2 Γ \sqrt{2 T}}{μ s} Z_{t}^{k},

(6)

and

\begin{matrix} {\hat{H}}_{t}^{k} = H_{t}^{k} + \frac{2 B \sqrt{2 T}}{μ s} U_{t}^{k}, \end{matrix}

(7)

\begin{matrix} {\hat{H}}^{k} ({\bar{w}}_{t}) = H^{k} ({\bar{w}}_{t}) + \frac{2 B \sqrt{2 T}}{μ s} U_{t}^{k}, \end{matrix}

(8)

then,

{\hat{p}}_{t}^{k} = {({\hat{H}}_{t}^{k})}^{- 1} {\hat{g}}_{t}^{k},

(9)

where

Z_{t}^{k}

is a sequence of i.i.d. standard d-dimensional Gaussian random vectors and

U_{t}^{k}

is a sequence of i.i.d. symmetric random matrices, whose upper triangular elements and the diagonals are i.i.d. standard normal randoms.

μ

is the privacy parameter, B is the norm bound of the Hessian matrix and

Γ

is the norm bound of the gradient.

I_{T}

is the set of communication nodes, where every user communicates their own updated parameters to the server, T represents the set of iterations, and L means the number of local computations. We have

| T | = ⌈| I_{T} | \times L⌉

, e.g.,

T = {1, 2, . . ., L, . . ., 2 L, . . ., ⌈| I_{T} | \times L⌉}

. In particular, if

L = 1

, this means no local computation.

Remark 1.

In practice,

{\hat{H}}^{k} ({\bar{w}}_{t})

may not be positive definite. So we need to truncate the eigenvalues as

\max {λ_{j}, ϵ}

, where

ϵ > 0

is equal to the regularization parameter. The matrix is also differentially private as a result of the post-processing property of GDP ([19], Proposition 2.8).

Here, we propose a novel step selection, coping with the negative influence of noises, and it plays an important role in the convergence of GDP-LocalNewton.

Step-size selection: Let each worker locally choose a step size according to the following rule:

α_{t}^{k} = \max_{α ⩽ α^{★}} α such that

f^{k} (w_{t}^{k} - α {\hat{p}}_{t}^{k}) ⩽ f^{k} (w_{t}^{k}) - α β {({\hat{p}}_{t}^{k})}^{T} g_{t}^{k} + α γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k},

(10)

for some

β \in (0, 1 / 2]

and

γ = 1 - β

so that

α^{★} ⩽ \min \{\frac{2 (1 - β) κ (1 - ϵ)}{M (1 + ϵ)}, \frac{2 β κ^{2}}{3 M [M - κ / 4] L}\} < 1,

where

{\tilde{N}}_{t}^{k} = {({\hat{H}}_{t}^{k})}^{- 1} {\hat{g}}_{t}^{k} - {(H_{t}^{k})}^{- 1} g_{t}^{k}

, M and

κ

are the parameters of the functional smooth and strong convex, proposed at the next subsection.

Remark 2.

In the subsequent theoretical section, we can observe the advantages and significance of the linear search in the algorithm’s convergence analysis. In part of the experiment, we will compare our strategy with the fixed step size and decaying step size. The results show that our strategy is optimal. The decaying step size refers to compressing the step size of the previous iteration at each iteration, which means

α_{t + 1} = σ^{t + 1} α_{t}

, where σ is the decaying rate.

2.4. Theoretical Results

2.4.1. Assumptions on the Loss Functions

We need the following assumptions on the loss functions.

Assumption 1.

f_{i} (\cdot)

, for all

i \in [n]

, is twice differentiable.

Assumption 2.

f (\cdot)

is κ-strongly convex, that is,

\nabla^{2} f (w) ≽ κ I .

Assumption 3.

f (\cdot)

is M-smooth, that is,

\nabla^{2} f (w) ≼ M I .

Assumption 4.

For

i \in [n]

,

∥ \nabla^{2} f_{i} {(\cdot) ∥}_{2} and {∥ \nabla f_{i} (\cdot) ∥}_{2}

are upper bounded such that

\nabla^{2} f_{i} (w) ≼ B I

and

∥ \nabla f_{i} {(\cdot) ∥}_{2} ⩽ Γ

, where

B \in R

and

Γ \in R

are some fixed constants.

2.4.2. Privacy

Theorem 1 shows that GDP-LocalNewton satisfies

μ - G D P

.

Theorem 1.

Assuming Assumptions 1–4 hold, the output of the k-worker (4) satisfies

\frac{μ}{\sqrt{T}} - G D P

. Furthermore, after T iterations, the whole algorithm satisfies

μ - G D P

.

Proof.

Firstly, we compute the global sensitivity of

s g_{t}^{k}

and

s H_{t}^{k}

as the following:

G S_{s g_{t}^{k}} = ∥ s g_{t}^{k} (x_{1 : n}^{'}) - s g_{t}^{k} (x_{1 : n}) ∥ = ∥ \nabla f_{j} (\cdot) - \nabla f_{j^{'}} (\cdot) ∥ ⩽ 2 Γ,

(11)

G S_{s H_{t}^{k}} = ∥ s H_{t}^{k} (x_{1 : n}^{'}) - s H_{t}^{k} (x_{1 : n}) ∥ = ∥ \nabla^{2} f_{j} (\cdot) - \nabla^{2} f_{j^{'}} (\cdot) ∥ ⩽ 2 B,

(12)

where we use Assumption 4, and j and

j^{'}

represent different datasets. Then, using Lemma 4, we know that the output of the k-worker (4) satisfies

\frac{μ}{\sqrt{T}} - G D P

. Finally, through Lemmas 3 and 4, the whole algorithm satisfies

μ - G D P

. The proof is completed. □

2.4.3. Convergence Analysis

The following three lemmas show that we can separate the Gaussian noises added in the algorithm and bound them so as to guarantee the convergence of GDP-LocalNewton.

Lemma 5

([26]). Let

X \in R^{d}

be a sub-Gaussian random vector with variance proxy

σ^{2}

. For any

α > 0

, with probability at least

1 - α

,

{∥ X ∥}_{2} ⩽ 4 σ \sqrt{d} + 2 σ \sqrt{2 \log (1 / α)} .

Lemma 6

([26]). Let W be a symmetric

d \times d

random matrix whose upper triangular elements, including the diagonal, are i.i.d.

N (0, 1)

. For any

α > 0

, with probability at least

1 - α

,

{∥ W ∥}_{2} ⩽ \sqrt{2 d \log (2 d / α)}

Lemma 7.

Assumptions

1 - 4

hold. Fix the total iteration T and the GDP parameter μ, let

s ⩾ \frac{4 B \sqrt{2 T} \sqrt{2 d \log (4 d / ξ_{0})}}{κ (1 - ϵ) μ}

, and use Lemma 1 and Lemma 2, then we can rewrite the noisy term in (4) as

{({\hat{H}}_{t}^{k})}^{- 1} {\hat{g}}_{t}^{k} = {(H_{t}^{k})}^{- 1} g_{t}^{k} + {\tilde{N}}_{t}^{k} .

Here, with probability at least

1 - ξ_{0}

, we have

∥ {\tilde{N}}_{t}^{k} ∥ ⩽ C^{'} {\tilde{M}}_{p r i v a c y},

(13)

where

{\tilde{M}}_{p r i v a c y} = \frac{Γ B \sqrt{T d \log (d / ξ_{0})}}{μ s κ^{2} {(1 - ϵ)}^{2}}

and

C^{'}

is a constant.

Proof.

See Appendix B.1. □

Lemma 8.

Let the function

f (\cdot)

satisfy Assumptions

1 - 3

, and suppose that step size

α_{t}^{k}

satisfies the line-search condition in (10). Also, let

0 < ϵ < 1 / 2

,

0 < δ, ξ_{0} < 1

, μ and T be fixed constants. Moreover, let the sample size

s ⩾ \frac{4 B}{κ ϵ^{2}} \log \frac{2 d}{δ}

. Then, defined in Equation (4), at the k-th worker, we have

f^{k} (w_{t + 1}^{k}) - f^{k} (w_{t}^{k}) ⩽ - ψ ∥ g_{t}^{k} ∥^{2} + α^{★} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + α^{★} (1 - β) M (1 + ϵ) {∥ {\tilde{N}}_{t}^{k} ∥}^{2},

with probability of at least

1 - δ

.

If

s ⩾ \frac{4 B \sqrt{2 T} \sqrt{2 d \log (4 d / ξ_{0})}}{κ (1 - ϵ) μ}

,

β = 1 / 2

and using Lemma 7, we obtain

f^{k} (w_{t + 1}^{k}) - f^{k} (w_{t}^{k}) ⩽ - ψ {∥ g_{t}^{k} ∥}^{2} + \frac{α^{★}}{2} M (1 + ϵ) {(C^{'} {\tilde{M}}_{p r i v a c y})}^{2},

with at least probability

1 - δ - ξ_{0}

, where

ψ = \frac{α^{★} β}{M (1 + ϵ)}

. Note that

w_{t + 1}^{k} = w_{t}^{k} - α {\hat{p}}_{t}^{k}

, which means the workers’ local gradient.

{\tilde{N}}_{t}^{k}

,

{\tilde{M}}_{p r i v a c y}

and

C^{'}

are from Lemma 7.

Proof.

See Appendix B.1. □

Remark 3.

In Lemma 8, we can find that the latter two terms come from the noise by DP. The second term can be eliminated by setting the

β = 1 / 2

and it is the merit of our new linear selection that can reduce the random negative effectiveness. As a result, Lemma 8 shows that the novel linear selection can improve the convergence of the algorithm. The small

∥ {\tilde{N}}_{t}^{k} ∥^{2}

leads to the small third term.

Theorem 2.

/

L = 1

case/Suppose that Assumptions

1 - 4

hold, and the step-size

α_{t}^{k}

satisfies the line-search condition (10). Also, let T, μ,

0 < δ, ξ_{0} < 1, 0 < ϵ, ϵ_{1} < 1 / 2

be fixed constants and let

β = 1 / 2

,

Γ = \max_{1 ⩽ i ⩽ n} ∥ \nabla f_{i} (\cdot) ∥

. Moreover, assume that the sample size for each worker satisfies

s ⩾ \frac{4 B}{κ ϵ^{2}} \log \frac{2 d}{δ}

, where the samples are chosen without replacement. When the GDP-LocalNewton is in the

L = 1

case, we obtain the following:

1.: If $s ⩾ \max \{\frac{4 B \sqrt{2 T} \sqrt{2 p \log (4 p / ξ_{0})}}{κ (1 - ϵ) μ}, \frac{Γ^{2}}{ϵ_{1}^{2} G^{2}} \log (d / δ)\}$ for $G = \min_{k} ∥ g^{k} ({\bar{w}}_{t}) ∥$ , we obtain, with probability at least $1 - K (6 δ + ξ_{0})$ ,

$f ({\bar{w}}_{t + 1}) - f ({\bar{w}}^{*}) ⩽ ρ_{1} (f ({\bar{w}}_{t}) - f ({\bar{w}}^{*})) + {\tilde{M}}_{\hat{φ}},$

(14)
2.: If $s ⩾ \frac{2 B \sqrt{2 T} \sqrt{2 p \log (4 p / ξ_{0})}}{κ (1 - ϵ) μ}$ , we obtain, with probability at least $1 - K (6 δ + ξ_{0})$ ,

$f ({\bar{w}}_{t + 1}) - f ({\bar{w}}^{*}) ⩽ ρ_{2} (f ({\bar{w}}_{t}) - f ({\bar{w}}^{*})) + {\tilde{M}}_{φ} + \frac{η Γ}{κ (1 - ϵ)} .$

(15)

Note that $ρ_{i} = 1 - 2 κ C_{i}$ , for $i = {1, 2}$ , $C_{1} = (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)} - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}})$ , $C_{2} = (ψ - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}})$ , $ψ = \frac{α^{★} β}{M (1 + ϵ)}$ , $η = (1 + \sqrt{2 \log (\frac{1}{δ})}) \sqrt{\frac{1}{s}} Γ$ . ${\tilde{M}}_{\hat{φ}} = ϵ_{1} (C^{'} {\tilde{M}}_{p r i v a c y}) Γ + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {(C^{'} {\tilde{M}}_{p r i v a c y})}^{2}$ and ${\tilde{M}}_{φ} = η (C^{'} {\tilde{M}}_{p r i v a c y}) + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {(C^{'} {\tilde{M}}_{p r i v a c y})}^{2}$ are constants. $C^{'} {\tilde{M}}_{p r i v a c y}$ is from Lemma 7.

Proof.

See Appendix C.1 and C.2. □

Remark 4.

Compared to LocalNewton when

L = 1

([11]), the protection of privacy leads to an extra error term

{\tilde{M}}_{\hat{φ}}

. The error term shows the trade-off between the protection of privacy and the usefulness of the model. If we want to decrease the error term

{\tilde{M}}_{\hat{φ}}

, it needs more data in every worker.

In this theorem, we choose

β = 1 / 2

so that the second random term in the Lemma 8 is eliminated. That is because the less random terms commonly make the algorithm more stable.

Theorem 3.

/

L > 1

case/. Suppose Assumptions

1 - 4

hold and step size

α_{t}^{k}

satisfies the line search condition (10). Also, let T, μ,

0 < δ, ξ_{0} < 1, 0 < ϵ < 1 / 2

be fixed constants and let

Γ = max_{1 ⩽ i ⩽ n} ∥ \nabla f_{i} (\cdot) ∥

. Moreover, assume that the sample size for each worker satisfies

s ⩾ \max {\frac{4 B}{κ ϵ^{2}} \log \frac{2 d}{δ}, \frac{2 B \sqrt{2 T} \sqrt{2 d \log (4 d / ξ_{0})}}{κ (1 - ϵ) μ}}

. Then, the GDP LocalNewton updates,

{\bar{w}}_{t ⩾ 0}

, within the case of

L > 1

, satisfy

f ({\bar{w}}_{t_{0} + 1}) - f ({\bar{w}}_{t_{0}}) ⩽ - \frac{C}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} {∥ g_{τ}^{k} ∥}^{2} + \frac{η L Γ}{κ (1 - ϵ)} + L {\tilde{M}}_{φ},

(16)

with probability at least

1 - K L (6 δ + ξ_{0})

. Here,

C = ψ - \frac{L [M - κ (1 - ϵ)] {(α^{*})}^{2}}{κ^{2} {(1 - ϵ)}^{2}}

,

ψ = \frac{α^{★} β}{M (1 + ϵ)}

and

η = (1 + \sqrt{2 \log (\frac{1}{δ})}) \sqrt{\frac{1}{s}} Γ

.

{\tilde{M}}_{φ} = η (C^{'} {\tilde{M}}_{p r i v a c y}) + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {(C^{'} {\tilde{M}}_{p r i v a c y})}^{2}

is a constant and

C^{'} {\tilde{M}}_{p r i v a c y}

is from Lemma 7.

Proof.

See Appendix C.3. □

3. Empirical Evaluation

In this section, we evaluate the numerical performance of GDP-LocalNewton under

L = 1

and

L > 1

, respectively. In addition, we design experiments to explore the performance of different step size strategies, such as the linear search step size, fixed step size, and decaying step size. In particular, we conduct both the simulation and the real data experiments.

The simulated datasets are generated according to the logistic model. Supposing

d = 10, n = 50, 000

, each entry of the model parameter vector

w = (w_{1}, . . ., w_{10})

is generated i.i.d. from

U (- 0.5, 0.5)

. For each sample, each entry of the predictor vector is generated i.i.d. from the standard Gaussian. The labels are then generated from the logistic regression. We also generate 10,000 samples for testing.

The real datasets we use are summarized in Table 2, which are publicly available in LIBSVM [28].

3.1. Without Local Computation ( $L = 1$ )

We fix the total clients K to be 50. Hence, the number of samples per client s is

n / 50

. We fix the total communication rounds T as 10. The regularization parameter is 0.001. We set the privacy parameter

μ = 1

, 5 and 10, respectively. We compare our method with GDP-GD algorithm with

μ = 1

and LocalNewton with

μ = \infty

. The details of GDP-GD can be found in Appendix D. Note that we set the max step size

α^{★}

of GDP-LocalNewton to be the fixed step size of GDP-GD. This is because, due to noise accumulation, the step sizes of algorithms perturbed by GDP should be much smaller than those without DP perturbation in order to ensure convergence.

Figure 1 and Figure 2 display the training loss and test accuracy of each method on one simulated dataset and three real datasets, where the training loss is the empirical average of the loss function evaluated at the training data points, and the test accuracy is the proportion of test samples that are correctly classified. As Figure 1 and Figure 2 show, GDP-LocalNewton with different

μ

converges much faster than GDP-GD with

μ = 1

in all datasets. It means that our privacy-preserving second-order algorithm is much better than the privacy-preserving first-order algorithm GDP-GD. We can also see that as the the privacy parameter

μ

increases, the performance of GDP-LocalNewton increases, especially when the communication round is large.

3.2. With Local Computation ( $L > 1$ )

We use the a9a and the ijcnn1 datasets to show how different values of L influence the performance of GDP-LocalNewton. We set the GDP parameter

μ = 2

and

μ = 4

for the a9a dataset and set

μ = 1

and

μ = 2

for the ijcnn1 dataset, the local computation parameter

L = 1

and

L = 3

, and regularization

γ = 0.5

. The communication round is 8. The other parameter set-ups are the same as those of the former experiments.

As Figure 3 shows, GDP-LocalNewton converges in all datasets. When the values of

μ

are equal,

L = 3

can speed up the convergence of the algorithm. That means the noises do not influence the speed of convergence too much. At the 7-th communication of the algorithm, the error from the privacy protection is bigger than the error of the local computation. That means that the privacy protection influences the usefulness of GDP-LocalNewton more than the local computation.

3.3. Different Step Size Strategies

In this section, we explore how different step size strategies, including linear-search step size (

α^{★} = 0.03

), fixed step size (fix

α_{t}^{k} = 0.03

), and decaying step size (the initial step size

α = 0.03

), impact GDP-LocalNewton under different levels of privacy protection(different

μ

values). In addition, we set different decaying rates to show its impact. Under

L = 1

, we use the simulation and the a9a data. Due to the varying characteristics of the datasets, we made different parameter settings for some of them. In the generated data, we set the

μ = 1

and 5 and a better decaying rate

σ

as

0.5

. In the a9a data, we set the

μ = 5

and 10 and a better decaying rate

σ

as

0.9

.

As Figure 4 and Figure 5 show, the linear-search step size performs well and stably in every case. This validates the effectiveness of our strategy both theoretically and experimentally. The reason for the similarity between the fixed step size and the search step size is that the noise causes some of the search steps to be equal to the maximum step size set, making them equivalent to the fixed step size. For the decaying step size, the experiments show that the decaying rate is very important for the convergence. An appropriate decaying rate can help the algorithm converge. In the later stages of the algorithm, decaying to a smaller step size helps resist the negative impact of noise on the algorithm and further promotes convergence, such as Figure 4a and Figure 5a. An excessive decaying rate leads to algorithm non-convergence, like the

0.1

decaying rate in every figure. However, in practice, adjusting the decaying rate will increase the operational costs and raise the risk of privacy leakage. Therefore, the linear search strategy is more effective, efficient, and stable.

4. Conclusions

In this paper, we proposed a novel algorithm called GDP-LocalNewton for privacy-preserving and communication-efficient distributed learning. To improve the communication efficiency, we developed the algorithm based on Newton’s method and traded more local computations between communications. To handle possible privacy leakage via the curious onlooker, we adopted the notion of GDP [19] by adding Gaussian noise to the updates of each local machine. In particular, we developed a step searching strategy to determine the step size in the noisy Newton update. We validated the effectiveness, efficiency, and stability of our strategy through experiments. We theoretically studied the convergence of GDP-LocalNewton, which turns out to have two error terms, one corresponding to the privacy protection and the other corresponding to the local computation. The experiments corroborated the theoretical findings.

There are some interesting problems that deserve further study. First, the DP framework would lead to strong privacy protection; however, a DP algorithm often lead to much accuracy loss compared with its non-DP counterpart. Therefore, it is desirable to consider other weaker notions of privacy [29]. Second, how to handle the heterogeneity among local datasets algorithmically and theoretically in the second-order-based framework deserves further study. Finally, it is worth tackling other issues or challenges that FL faces within the second-order-based framework.

Author Contributions

Conceptualization, H.Z.; methodology, Z.C., X.G. and H.Z.; experiment, Z.C.; theory, Z.C. and X.G.; writing, Z.C., X.G. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by: 1. National Natural Science Foundation of China-Guangdong Joint Fund (U1811461). 2. Natural Science Foundation of Shaanxi Province (No. 2021JQ-429).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Some Auxiliary Lemmas

Here, we prove the auxiliary lemmas that are used in the main proofs of the paper.

Lemma A1

([11]). Let

f (\cdot)

satisfy Assumptions

1 - 4

, and

0 < ϵ < 1 / 2

and

0 < δ < 1

be fixed constants. Then, if

s ⩾ \frac{4 B}{κ ϵ^{2}} \log \frac{2 d}{δ}

, the local Hessian matrix at the k-th worker satisfies

(1 - ϵ) κ ≼ \nabla^{2} f^{k} (w) = H^{k} (w) ≼ (1 + ϵ) M,

(A1)

for all

w \in R^{d}

and

k \in [K]

with probability at least

1 - δ

.

Lemma A2

(McDiarmid’s inequality). Let

X = (X_{1}, X_{2}, \dots, X_{m})

be m-independent random variables taking values from the set A, and assume that

f : A^{m} \to R

satisfies the following condition (bounded difference):

\sup_{x_{1}, \dots, x_{m}, \hat{x_{i}}} | f (x_{1}, \dots, x_{i}, \dots, x_{m}) - f (x_{1}, \dots, \hat{x_{i}}, \dots, x_{m}) | ⩽ c_{i},

(A2)

for all

i \in 1, \dots, m

. Then for any

ϵ > 0

, we have

P [f (X_{1}, \dots, X_{m}) - E [f (X_{1}, \dots, X_{m})] ⩾ ϵ] ⩽ \exp (- \frac{2 ϵ^{2}}{\sum_{i = 1}^{m} c_{i}^{2}}) .

(A3)

Lemma A3

([11]). Let

S \in R^{n \times s}

be any uniform sampling matrix, then for any matrix

B = [b_{1}, \dots, b_{n}]

\in R^{d \times n}

with probability

1 - δ

, for any

δ > 0

, we have

∥ \frac{1}{n} {BSS}^{T} 1 - \frac{1}{n} B 1 ∥ ⩽ (1 + \sqrt{2 \log (\frac{1}{δ})}) \sqrt{\frac{1}{s}} \max_{i} ∥ b_{i} ∥,

(A4)

where the vector

B 1

is the sum of column of the matrix

B

and

{BSS}^{T} 1

is the sum of uniformly sampled and scaled column of the matrix

B

, where the scaling factor is

\frac{1}{\sqrt{s p}}

with

p = \frac{1}{n}

. If

(i_{1}, \dots, i_{s})

is the set of sampled indices, the

{BSS}^{T} 1 = \sum_{k \in (i_{1}, \dots, i_{s})} \frac{1}{s p} b_{k}

.

From Lemma A1, we can obtain a key corollary easily, which is taken to bound

| {(p_{t}^{k})}^{T} g ({\bar{w}}_{t}) - {(p_{t}^{k})}^{T} g^{k} ({\bar{w}}_{t}) |

. And this bound is crucial for our main theorems.

Corollary A1.

Let

g^{k} ({\bar{w}}_{t})

be the gradient of the loss function in the k-th worker, and

g ({\bar{w}}_{t})

be the gradient of the global loss function, where

{\bar{w}}_{t}

is the t-th communication (

L ⩾ 1

). Then, we obtain the following:

1.: Provided that $∥ g_{i} ({\bar{w}}_{t}) ∥ ⩽ Γ$ , where $Γ = \max_{1 ⩽ i ⩽ n} ∥ \nabla f_{i} (\cdot) ∥$ , then

$∥ g^{k} ({\bar{w}}_{t}) - g ({\bar{w}}_{t}) ∥ ⩽ (1 + \sqrt{2 \log (\frac{1}{δ})}) \sqrt{\frac{1}{s}} Γ .$

(A5)

Writing, $η = (1 + \sqrt{2 \log (\frac{1}{δ})}) \sqrt{\frac{1}{s}} Γ$ , we succinctly have

$| 〈 v, g^{k} ({\bar{w}}_{t}) - g ({\bar{w}}_{t}) 〉 | ⩽ ∥ v ∥ ∥ g^{k} ({\bar{w}}_{t}) - g ({\bar{w}}_{t}) ∥ ⩽ η ∥ v ∥$

(A6)

with probability at least $1 - δ$ , where $η = O (1 / \sqrt{s})$ is small.
2.: Provided that $∥ g_{i} ({\bar{w}}_{t}) ∥ ⩽ Γ$ , where $Γ = \max_{1 ⩽ i ⩽ n} ∥ \nabla f_{i} (\cdot) ∥$ and $∥ g^{k} ({\bar{w}}_{t}) ∥ ⩾ G$ , then using vector Bernstein inequality with $t = ϵ_{1} ∥ g^{k} ∥$ , we obtain

$P (∥ g^{k} ({\bar{w}}_{t}) - g ({\bar{w}}_{t}) ∥ ⩾ ϵ_{1} ∥ g^{k} ({\bar{w}}_{t}) ∥) ⩽ d \exp (- s \frac{ϵ_{1}^{2} {∥ g^{k} ∥}^{2}}{32 Γ^{2}} + 1 / 4) ⩽ d \exp (- s \frac{ϵ_{1}^{2} G^{2}}{32 Γ^{2}} + 1 / 4) .$

(A7)

So, as long as

$G^{2} = Ω (\frac{γ^{2}}{ϵ_{1}^{2} s} \log (d / δ))$

or,

$s ≳ (\frac{Γ^{2}}{ϵ_{1}^{2} G^{2}} \log (d / δ)),$

we have

$| 〈 v, g^{k} ({\bar{w}}_{t}) - g ({\bar{w}}_{t}) 〉 | ⩽ ∥ v ∥ ∥ g^{k} ({\bar{w}}_{t}) - g ({\bar{w}}_{t}) ∥ ⩽ ϵ_{1} ∥ v ∥ ∥ g^{k} ({\bar{w}}_{t}) ∥$

(A8)

with probability at least $1 - δ$ .

Appendix B. The Proofs of Some Lemmas in The Context

Appendix B.1. The Proof of Lemma 7

Proof.

Let

{\tilde{U}}_{t}^{k} = \frac{2 B \sqrt{2 T}}{μ s} U_{t}^{k}

and

{\tilde{Z}}_{t}^{k} = \frac{2 Γ \sqrt{2 T}}{μ s} Z_{t}^{k}

, and note that the Neumann series formula leads to the identity

{\{H_{t}^{k} + \frac{2 B \sqrt{2 T}}{μ s} U_{t}^{k}\}}^{- 1} = {\{H_{t}^{k}\}}^{- 1} \sum_{j = 0}^{\infty} {[- {\tilde{U}}_{t}^{k} {\{H_{t}^{k}\}}^{- 1}]}^{j},

Hence,

{({\hat{H}}_{t}^{k})}^{- 1} {\hat{g}}_{t}^{k} = {(H_{t}^{k})}^{- 1} \sum_{j = 0}^{\infty} {[- {\tilde{U}}_{t}^{k} {(H_{t}^{k})}^{- 1}]}^{j} \{g_{t}^{k} + {\tilde{Z}}_{t}^{k}\} = {(H_{t}^{k})}^{- 1} g_{t}^{k} + {\tilde{N}}_{t}^{k},

where

{\tilde{N}}_{t}^{k} = {(H_{t}^{k})}^{- 1} ({\tilde{Z}}_{t}^{k} + \sum_{j = 1}^{\infty} {[- {\tilde{U}}_{t}^{k} {(H_{t}^{k})}^{- 1}]}^{j} \{g_{t}^{k} + {\tilde{Z}}_{t}^{k}\}) .

It remains to bound

∥ {\tilde{N}}_{t}^{k} ∥_{2}

with high probability in order to complete the proof. Note that with probability at least

1 - ξ_{0}

, we have

∥ Z_{t}^{k} ∥_{2} ⩽ 4 \sqrt{d} + 2 \sqrt{2 \log (2 / ξ_{0})}

and

∥ U_{t}^{k} ∥_{2} ⩽ \sqrt{2 d \log (4 d / ξ_{0})}

,

\begin{matrix} ∥ {\tilde{N}}_{t}^{k} ∥ & = ∥ {(H_{t}^{k})}^{- 1} ∥ ∥ {\tilde{Z}}_{t}^{k} + \sum_{j = 1}^{\infty} {[{\tilde{U}}_{t}^{k} {(H_{t}^{k})}^{- 1}]}^{j} \{g_{t}^{k} + {\tilde{Z}}_{t}^{k}\} ∥ \\ ⩽ \frac{1}{κ (1 - ϵ)} ∥ {\tilde{Z}}_{t}^{k} ∥ + \frac{1}{κ (1 - ϵ)} \frac{∥ {\tilde{U}}_{t}^{k} ∥}{κ (1 - ϵ) - ∥ {\tilde{U}}_{t}^{k} ∥} \{∥ g_{t}^{k} ∥ + ∥ {\tilde{Z}}_{t}^{k} ∥\} \\ ⩽ \frac{Γ \sqrt{2 T} (4 \sqrt{d} + 2 \sqrt{2 \log (2 / ξ_{0})})}{κ (1 - ϵ) s μ} + \frac{2 B \sqrt{2 T}}{κ (1 - ϵ) s μ} \frac{2 ∥ U_{t}^{k} ∥}{κ (1 - ϵ)} (Γ + \frac{2 Γ \sqrt{2 T} (4 \sqrt{d} + 2 \sqrt{2 \log (2 / ξ_{0})})}{s μ}) \\ ⩽ C_{0} \frac{Γ \sqrt{T} {\sqrt{d} + \sqrt{\log (2 / ξ_{0})}} + Γ B \frac{1}{κ (1 - ϵ)} \sqrt{T d \log (d / ξ_{0})}}{μ s κ (1 - ϵ)} = C^{'} {\tilde{M}}_{p r i v a c y}, \end{matrix}

(A9)

where

C_{0}

and

C^{'}

are constants and

{\tilde{M}}_{p r i v a c y} = \frac{Γ B \sqrt{T d \log (d / ξ_{0})}}{μ s κ^{2} {(1 - ϵ)}^{2}}

. Note that in order to make the second and thirst inequalities right, we need

∥ {\tilde{U}}_{t}^{k} ∥ ⩽ \frac{κ (1 - ϵ)}{2}

. So let

s ⩾ \frac{4 B \sqrt{2 T} \sqrt{2 d \log (4 d / ξ_{0})}}{κ (1 - ϵ) μ}

,

∥ {\tilde{U}}_{t}^{k} ∥ ⩽ \frac{κ (1 - ϵ)}{2}

hold with probability at least

1 - ξ_{0} / 2

. □

Appendix B.2. The Proof of Lemma 8

Proof.

Because the

f^{k} (\cdot)

is

M (1 + ϵ)

-smooth, we have

f^{k} (w_{t}^{k} - α {\hat{p}}_{t}^{k}) - f^{k} (w_{t}^{k}) ⩽ {(- α {\hat{p}}_{t}^{k})}^{T} g^{k} (w_{t}^{k}) + \frac{M (1 + ϵ)}{2} {(α)}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} .

(A10)

The above inequality is satisfied for all

α \in R

. We know that

α_{t}^{k}

, the local step-size at worker k, satisfies the line search constraint in Equation (10). Thus, there exists

α_{t}^{k} \in (0, 1)

, which satisfies the new line search condition

- α {({\hat{p}}_{t}^{k})}^{T} g_{t}^{k} + \frac{M (1 + ϵ)}{2} {(α)}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} ⩽ - α β {({\hat{p}}_{t}^{k})}^{T} g_{t}^{k} + α γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} .

(A11)

Thus,

α

must satisfy

\begin{matrix} \frac{M (1 + ϵ)}{2} α {∥ {\hat{p}}_{t}^{k} ∥}^{2} & ⩽ (1 - β) {({\hat{p}}_{t}^{k})}^{T} g_{t}^{k} + γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = (1 - β) {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} ({\hat{p}}_{t}^{k} - {\tilde{N}}_{t}^{k}) + γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = (1 - β) {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} ({\hat{p}}_{t}^{k}), \end{matrix}

(A12)

where we use the fact

γ = 1 - β

and

g_{t}^{k} = H_{t}^{k} ({\hat{p}}_{t}^{k} - {\tilde{N}}_{t}^{k})

. Note that due to the

κ (1 - ϵ)

-strong convexity of

f^{k} (\cdot)

, the local line-search constraint must satisfy

α^{★} ⩽ \frac{2 (1 - β) κ (1 - ϵ)}{M (1 + ϵ)} .

(A13)

Hence, if we choose

α^{★} ⩽ \frac{2 (1 - β) κ (1 - ϵ)}{M (1 + ϵ)}

, or

α^{★} ⩽ \frac{κ (1 - β)}{M}

for

ϵ < 1 / 2

, we can guarantee that the computation steps sufficiently satisfying the line-search condition from Equation (10) are satisfied with

α_{t}^{k} = α^{★}

. Hence, making use of the new line search condition, we obtain

\begin{matrix} f^{k} (w_{t}^{k} - α {\hat{p}}_{t}^{k}) - f^{k} (w_{t}^{k}) & ⩽ - α^{★} β {({\hat{p}}_{t}^{k})}^{T} g_{t}^{k} + α^{★} γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = - α^{★} β {(p_{t}^{k} + {\tilde{N}}_{t}^{k})}^{T} g_{t}^{k} + α^{★} γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = - α^{★} β {(p_{t}^{k})}^{T} g_{t}^{k} - α^{★} β {({\tilde{N}}_{t}^{k})}^{T} g_{t}^{k} + α^{★} γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = - α^{★} β {(g_{t}^{k})}^{T} {(H_{t}^{k})}^{- 1} g_{t}^{k} - α^{★} β {({\tilde{N}}_{t}^{k})}^{T} H_{t}^{k} p_{t}^{k} + α^{★} γ {({\hat{p}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ ⩽ - \frac{α^{★} β}{M (1 + ϵ)} {∥ g_{t}^{k} ∥}^{2} + α^{★} {(γ {\hat{p}}_{t}^{k} - β p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = - \frac{α^{★} β}{M (1 + ϵ)} {∥ g_{t}^{k} ∥}^{2} + α^{★} {(γ p_{t}^{k} + γ {\tilde{N}}_{t}^{k} - β p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = - \frac{α^{★} β}{M (1 + ϵ)} {∥ g_{t}^{k} ∥}^{2} + α^{★} (γ - β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + α^{★} γ {({\tilde{N}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} \\ = - \frac{α^{★} β}{M (1 + ϵ)} {∥ g_{t}^{k} ∥}^{2} + α^{★} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + α^{★} (1 - β) {({\tilde{N}}_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k}, \end{matrix}

w.p.

1 - δ

. Because we have

H_{t}^{k} ⪯ M (1 + ϵ) I

and

1 - β > 0

, we immediately obtain

f^{k} (w_{t}^{k} - α {\hat{p}}_{t}^{k}) - f^{k} (w_{t}^{k}) ⩽ - ψ ∥ g_{t}^{k} ∥^{2} + α^{★} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + α^{★} (1 - β) M (1 + ϵ) {∥ {\tilde{N}}_{t}^{k} ∥}^{2},

(A14)

where

ψ = \frac{α^{★} β}{M (1 + ϵ)}

. □

Appendix C. The Proofs of Main Theorems

Appendix C.1. The Proof of Claim (14)

Proof.

We define some key symbols

w_{t + 1}^{k} = {\bar{w}}_{t} - α_{t}^{k} {\hat{p}}_{t}^{k}, and {\bar{w}}_{t + 1} = {\bar{w}}_{t} - \frac{1}{K} \sum_{k = 1}^{K} α_{t}^{k} {\hat{p}}_{t}^{k} .

Invoking the M-smoothness of the function

f (\cdot)

, we have

\begin{matrix} f ({\bar{w}}_{t}) - f ({\bar{w}}_{t + 1}) & ⩾ - \frac{M}{2} {∥ {\bar{w}}_{t} - {\bar{w}}_{t + 1} ∥}^{2} + 〈 g ({\bar{w}}_{t}), {\bar{w}}_{t} - {\bar{w}}_{t + 1} 〉 \\ ⩾ - \frac{M}{2 K^{2}} {∥ \sum_{k = 1}^{K} α_{t}^{k} {\hat{p}}_{t}^{k} ∥}^{2} + 〈 g ({\bar{w}}_{t}), \frac{1}{K} \sum_{k = 1}^{K} α_{t}^{k} {\hat{p}}_{t}^{k} 〉 \\ \overset{(i)}{⩾} - \frac{M}{2 K} \sum_{k = 1}^{K} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} + 〈 g ({\bar{w}}_{t}), \frac{1}{K} \sum_{k = 1}^{K} α_{t}^{k} {\hat{p}}_{t}^{k} 〉 \\ = \frac{1}{K} \sum_{k = 1}^{K} (α_{t}^{k} {({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) - \frac{M}{2} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2}), \end{matrix}

(A15)

where

(i)

is from

∥ \frac{1}{H} \sum_{k = 1}^{H} a^{k} ∥^{2} ⩽ \frac{1}{H} \sum_{k = 1}^{H} {∥ a^{k} ∥}^{2} .

(A16)

We make use of

∥ g_{t}^{k} ∥ ⩾ G

and Corollary A1, combining with Lemma 8, then we have

\begin{matrix} α_{t}^{k} {({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) & ⩾ f^{k} ({\bar{w}}_{t}) - f^{k} ({\bar{w}}_{t + 1}) + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} ∥ {\hat{p}}_{t}^{k} ∥^{2} - ϵ_{1} ∥ {\hat{p}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ \\ ⩾ ψ ∥ g_{t}^{k} ∥^{2} - α_{t}^{k} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} - α_{t}^{k} (1 - β) M (1 + ϵ) {∥ {\tilde{N}}_{t}^{k} ∥}^{2} \\ + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} ∥ {\hat{p}}_{t}^{k} ∥^{2} - ϵ_{1} ∥ {\hat{p}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ \\ ⩾ ψ ∥ g_{t}^{k} ∥^{2} - α_{t}^{k} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} - α_{t}^{k} (1 - β) M (1 + ϵ) {∥ {\tilde{N}}_{t}^{k} ∥}^{2} \\ + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} ∥ {\hat{p}}_{t}^{k} ∥^{2} - ϵ_{1} ∥ p_{t}^{k} ∥ ∥ g_{t}^{k} ∥ - ϵ_{1} ∥ {\tilde{N}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ \\ ⩾ (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)}) ∥ g_{t}^{k} ∥^{2} - α_{t}^{k} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} - ϵ_{1} ∥ {\tilde{N}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ \\ - α_{t}^{k} (1 - β) M (1 + ϵ) ∥ {\tilde{N}}_{t}^{k} ∥^{2} + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2}, \end{matrix}

(A17)

with probability at least

1 - 4 δ

, where the second inequality uses the fact

∥ {\hat{p}}_{t}^{k} ∥ ⩽ ∥ p_{t}^{k} ∥ + ∥ {\tilde{N}}_{t}^{k} ∥

. And we let

φ ({\tilde{N}}_{t}^{k}) = α_{t}^{k} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + ϵ_{1} ∥ {\tilde{N}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ + α_{t}^{k} (1 - β) M (1 + ϵ) {∥ {\tilde{N}}_{t}^{k} ∥}^{2} .

(A18)

So we have

\begin{matrix} α_{t}^{k} {({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) - \frac{M}{2} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} & ⩾ (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)}) ∥ g_{t}^{k} ∥^{2} + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} ∥ {\hat{p}}_{t}^{k} ∥^{2} - \frac{M}{2} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} - φ ({\tilde{N}}_{t}^{k}) \\ ⩾ (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)}) ∥ g_{t}^{k} ∥^{2} - \frac{M - κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} - φ ({\tilde{N}}_{t}^{k}) \\ ⩾ (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)}) ∥ g_{t}^{k} ∥^{2} - \frac{M - κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} (2 ∥ p_{t}^{k} ∥^{2} + 2 ∥ {\tilde{N}}_{t}^{k} ∥^{2}) - φ ({\tilde{N}}_{t}^{k}) \\ ⩾ (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)} - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}}) {∥ g_{t}^{k} ∥}^{2} \\ - [M - κ (1 - ϵ)] {(α^{★})}^{2} {∥ {\tilde{N}}_{t}^{k} ∥}^{2} - φ ({\tilde{N}}_{t}^{k}), \end{matrix}

(A19)

with probability at least

1 - 4 δ

, where the third inequality is due to

∥ {\hat{p}}_{t}^{k} ∥^{2} ⩽ 2 ∥ p_{t}^{k} ∥^{2} + 2 {∥ {\tilde{N}}_{t}^{k} ∥}^{2}

. Rearranging the terms about

∥ {\tilde{N}}_{t}^{k} ∥

, we have

\hat{φ} ({\tilde{N}}_{t}^{k}; β) = α_{t}^{k} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + ϵ_{1} ∥ {\tilde{N}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ + {(1 - β) M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {∥ {\tilde{N}}_{t}^{k} ∥}^{2} .

(A20)

Hence, we obtain

α_{t}^{k} {({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) - \frac{M}{2} {(α_{t}^{k})}^{2} ∥ {\hat{p}}_{t}^{k} ∥^{2} ⩾ (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)} - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}}) {∥ g_{t}^{k} ∥}^{2} - \hat{φ} ({\tilde{N}}_{t}^{k}; β),

(A21)

with probability at least

1 - 6 δ

, and from the inequality, we obtain

f ({\bar{w}}_{t}) - f ({\bar{w}}_{t + 1}) ⩾ \frac{1}{K} \sum_{k = 1}^{K} (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)} - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}}) {∥ g_{t}^{k} ∥}^{2} - \frac{1}{K} \sum_{k = 1}^{K} \hat{φ} ({\tilde{N}}_{t}^{k}; β),

(A22)

with probability at least

1 - 6 K δ

.

If we choose

β = 1 / 2

, we will eliminate the

{(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k}

and obtain a more concise result as following

\begin{matrix} f ({\bar{w}}_{t}) - f ({\bar{w}}_{t + 1}) ⩾ \frac{1}{K} \sum_{k = 1}^{K} C_{1} {∥ g_{t}^{k} ∥}^{2} - \frac{1}{K} \sum_{k = 1}^{K} \hat{φ} ({\tilde{N}}_{t}^{k}; 1 / 2), \end{matrix}

(A23)

where

C_{1} = (ψ - \frac{ϵ_{1}}{κ (1 - ϵ)} - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}})

and

\hat{φ} ({\tilde{N}}_{t}^{k}; 1 / 2) = ϵ_{1} ∥ {\tilde{N}}_{t}^{k} ∥ ∥ g_{t}^{k} ∥ + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {∥ {\tilde{N}}_{t}^{k} ∥}^{2}

.

Making use of inequality (A16), we obtain

\frac{1}{K} \sum_{k = 1}^{K} C_{1} ∥ g_{t}^{k} ∥^{2} ⩾ C_{1} {∥ g ({\bar{w}}_{t}) ∥}^{2},

(A24)

then, combining inequality (A23), we obtain

f ({\bar{w}}_{t}) - f ({\bar{w}}_{t + 1}) ⩾ C_{1} {∥ g ({\bar{w}}_{t}) ∥}^{2} - \frac{1}{K} \sum_{k = 1}^{K} \hat{φ} ({\tilde{N}}_{t}^{k}; 1 / 2) .

(A25)

Invoking the

κ

-strong convexity of the function f

f ({\bar{w}}_{t}) - f ({\bar{w}}^{*}) ⩽ \frac{1}{2 κ} {∥ g (w_{t}) ∥}^{2},

(A26)

and combining the inequality (A25) and (A26), we obtain

f ({\bar{w}}_{t + 1}) - f ({\bar{w}}^{*}) ⩽ (1 - 2 κ C_{1}) (f ({\bar{w}}_{t}) - f ({\bar{w}}^{*})) + \frac{1}{K} \sum_{k = 1}^{K} \hat{φ} ({\tilde{N}}_{t}^{k}; 1 / 2),

(A27)

with probability at least

1 - 6 K δ

. If we use Lemma 7, we have

\hat{φ} ({\tilde{N}}_{t}^{k}; 1 / 2) ⩽ ϵ_{1} {\tilde{M}}_{p r i v a c y} Γ + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {\tilde{M}}_{p r i v a c y}^{2} = {\tilde{M}}_{\hat{φ}},

with probability at least

1 - ξ_{0}

. Hence, we have

f ({\bar{w}}_{t + 1}) - f ({\bar{w}}^{*}) ⩽ (1 - 2 κ C_{1}) (f ({\bar{w}}_{t}) - f ({\bar{w}}^{*})) + {\tilde{M}}_{\hat{φ}},

(A28)

with probability at least

1 - K (6 δ + ξ_{0})

. □

Appendix C.2. The Proof of Claim (15)

Proof.

From the uniform subsampling property and the case 1 of Corollary A1, we obtain

| {({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) - {({\hat{p}}_{t}^{k})}^{T} g^{k} ({\bar{w}}_{t}) | ⩽ η ∥ ({\hat{p}}_{t}^{k}) ∥ w . p . 1 - δ .

(A29)

Thus,

{({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) ⩾ {({\hat{p}}_{t}^{k})}^{T} g^{k} ({\bar{w}}_{t}) - η ∥ ({\hat{p}}_{t}^{k}) ∥ w . p . 1 - δ .

(A30)

Now, since the function

f^{k}

is

κ (1 - ϵ)

strongly convex with probability

1 - δ

, we have the following bound with probability of at least

1 - δ

:

α_{t}^{k} {({\hat{p}}_{t}^{k})}^{T} g_{t}^{k} ⩾ (f^{k} ({\bar{w}}_{t}) - f^{k} (w_{t + 1}^{k})) + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} {∥ {\hat{p}}_{t}^{k} ∥}^{2} .

(A31)

Combining the equations, and similar to the previous analysis, we have

\begin{matrix} α_{t}^{k} {({\hat{p}}_{t}^{k})}^{T} g ({\bar{w}}_{t}) & ⩾ (f^{k} ({\bar{w}}_{t}) - f^{k} (w_{t + 1}^{k})) + \frac{κ (1 - ϵ)}{2} {(α_{t}^{k})}^{2} ∥ {\hat{p}}_{t}^{k} ∥^{2} - η ∥ {\hat{p}}_{t}^{k} ∥ \\ ⩾ (ψ - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}}) {∥ g_{t}^{k} ∥}^{2} - φ ({\tilde{N}}_{t}^{k}; β) - \frac{η Γ}{κ (1 - ϵ)}, \end{matrix}

(A32)

with probability at least

1 - δ

, where

φ ({\tilde{N}}_{t}^{k}; β) = α_{t}^{k} (1 - 2 β) {(p_{t}^{k})}^{T} H_{t}^{k} {\tilde{N}}_{t}^{k} + η ∥ {\tilde{N}}_{t}^{k} ∥ + {(1 - β) M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {∥ {\tilde{N}}_{t}^{k} ∥}^{2}

. So take Equation (A32) into Equation (A15) and obtain

f ({\bar{w}}_{t}) - f ({\bar{w}}_{t + 1}) ⩾ \frac{1}{K} \sum_{k = 1}^{K} C_{2} {∥ g_{t}^{k} ∥}^{2} - \frac{1}{K} \sum_{k = 1}^{K} φ ({\tilde{N}}_{t}^{k}; β) - \frac{η Γ}{κ (1 - ϵ)},

(A33)

with probability at least

1 - 6 K δ

.

Then, let

β = 1 / 2

and making a similar analysis like inequality (A25) and (A26), we obtain

f ({\bar{w}}_{t + 1}) - f ({\bar{w}}^{*}) ⩽ (1 - 2 κ C_{2}) (f ({\bar{w}}_{t}) - f ({\bar{w}}^{*})) + \frac{1}{K} \sum_{k = 1}^{K} φ ({\tilde{N}}_{t}^{k}; 1 / 2) + \frac{η Γ}{κ (1 - ϵ)},

(A34)

with probability at least

1 - 6 K δ

, where

C_{2} = (ψ - \frac{[M - κ (1 - ϵ)] {(α^{★})}^{2}}{κ^{2} {(1 - ϵ)}^{2}})

and

φ ({\tilde{N}}_{t}^{k}; 1 / 2) = η ∥ {\tilde{N}}_{t}^{k} ∥ + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {∥ {\tilde{N}}_{t}^{k} ∥}^{2}

. If we use Lemma 7, we have

φ ({\tilde{N}}_{t}^{k}; 1 / 2) ⩽ η {\tilde{M}}_{p r i v a c y} + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {\tilde{M}}_{p r i v a c y}^{2} = {\tilde{M}}_{φ},

with probability at least

1 - ξ_{0}

. Then, we obtain the claim

f ({\bar{w}}_{t + 1}) - f ({\bar{w}}^{*}) ⩽ (1 - 2 κ C_{2}) (f ({\bar{w}}_{t}) - f ({\bar{w}}^{*})) + {\tilde{M}}_{φ} + \frac{η Γ}{κ (1 - ϵ)},

(A35)

with probability at least

1 - K (6 δ + ξ_{0})

. □

Appendix C.3. The Proof of Theorem 3

Proof.

We define some symbols as the following:

{\bar{w}}_{t_{0} + 1} = {\bar{w}}_{t_{0}} - \sum_{τ = t_{0}}^{L} {\bar{p}}_{τ},

(A36)

where

{\bar{p}}_{τ} = \frac{1}{K} \sum_{k = 1}^{K} α_{τ}^{k} {\hat{p}}_{τ}^{k}

is the average descent direction and

{\hat{p}}_{τ}^{k} = {({\hat{H}}_{τ}^{k})}^{- 1} {\hat{g}}_{τ}^{k}

is the local disturbed descent direction at the k-th worker at the iteration

τ

. L is the local computing times.

Invoking the M smoothness of the function f, we have

\begin{matrix} f ({\bar{w}}_{t_{0}}) - f ({\bar{w}}_{t_{0} + 1}) & ⩾ \frac{- M}{2} {∥ \sum_{τ = t_{0}}^{L} {\bar{p}}_{τ} ∥}^{2} + 〈 g ({\bar{w}}_{t_{0}}), \sum_{τ = t_{0}}^{L} {\bar{p}}_{τ} 〉 \\ = \frac{- M}{2} {∥ \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} α_{τ}^{k} {\hat{p}}_{τ}^{k} ∥}^{2} + \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} 〈 g ({\bar{w}}_{t_{0}}), α_{τ}^{k} {\hat{p}}_{τ}^{k} 〉 \\ ⩾ \frac{- M}{2 K} \sum_{k = 1}^{K} {∥ \sum_{τ = t_{0}}^{L} α_{τ}^{k} {\hat{p}}_{τ}^{k} ∥}^{2} + \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} 〈 g ({\bar{w}}_{t_{0}}), α_{τ}^{k} {\hat{p}}_{τ}^{k} 〉, \end{matrix}

(A37)

we make use of inequality (A16) in the last inequality.

Invoking the

κ (1 - ϵ)

strong convexity of function

f^{k} (\cdot)

, we have

f^{k} (w_{t_{0}}^{k}) - f^{k} (w_{t_{0} + 1}^{k}) ⩽ \frac{- κ (1 - ϵ)}{2} {∥ \sum_{τ = t_{0}}^{L} α_{τ}^{k} {\hat{p}}_{τ}^{k} ∥}^{2} + 〈 g_{t_{0}}^{k}, \sum_{τ = t_{0}}^{L} α_{τ}^{k} {\hat{p}}_{τ}^{k} 〉

(A38)

with probability

1 - δ

.

At every communication round, we have

\frac{1}{K} \sum_{k = 1}^{K} (f^{k} (w_{t_{0}}^{k}) - f^{k} (w_{t_{0} + 1}^{k})) ⩽ \frac{- κ (1 - ϵ)}{2 K} \sum_{k = 1}^{K} {∥ \sum_{τ = t_{0}}^{L} α_{τ}^{k} {\hat{p}}_{τ}^{k} ∥}^{2} + \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} 〈 g_{t_{0}}^{k}, α_{τ}^{k} {\hat{p}}_{τ}^{k} 〉 .

(A39)

Then, combining the inequality (A37) and (A39) and using the inequality (A30), we eliminate the terms

\frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} 〈 g ({\bar{w}}_{t_{0}}), α_{τ}^{k} {\hat{p}}_{τ}^{k} 〉

and

\frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} 〈 g_{t_{0}}^{k}, α_{τ}^{k} {\hat{p}}_{τ}^{k} 〉

. So, we have

\begin{matrix} f ({\bar{w}}_{t_{0}}) - f ({\bar{w}}_{t_{0} + 1}) ⩾ \frac{1}{K} \sum_{k = 1}^{K} (f^{k} (w_{t_{0}}^{k}) - f^{k} (w_{t_{0} + 1}^{k})) & - \frac{(M - κ (1 - ϵ))}{2 K} \sum_{k = 1}^{K} (∥ \sum_{τ = t_{0}}^{L} α_{τ}^{k} {\hat{p}}_{τ}^{k} ∥^{2}) \\ - \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} η α_{τ}^{k} ∥ {\hat{p}}_{τ}^{k} ∥ . \end{matrix}

(A40)

Making use of inequality (A16), we obtain

\begin{matrix} f ({\bar{w}}_{t_{0}}) - f ({\bar{w}}_{t_{0} + 1}) ⩾ \frac{1}{K} \sum_{k = 1}^{K} (f^{k} (w_{t_{0}}^{k}) - f^{k} (w_{t_{0} + 1}^{k})) & - \frac{L (M - κ (1 - ϵ))}{2 K} \sum_{k = 1}^{K} (\sum_{τ = t_{0}}^{L} {(α_{τ}^{k})}^{2} ∥ {\hat{p}}_{τ}^{k} ∥^{2}) \\ - \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} η α_{τ}^{k} ∥ {\hat{p}}_{τ}^{k} ∥ . \end{matrix}

(A41)

Then, from Lemma 8, we obtain, with probability at least

1 - K L δ

,

\begin{matrix} f ({\bar{w}}_{t_{0}}) - f ({\bar{w}}_{t_{0} + 1}) & ⩾ \frac{ψ}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} ∥ g_{τ}^{k} ∥^{2} - \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} \frac{1}{2} α^{*} M (1 + ϵ) {∥ {\tilde{N}}_{τ}^{k} ∥}^{2} \\ - \frac{L (M - κ (1 - ϵ))}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} {(α^{*})}^{2} (∥ p_{τ}^{k} ∥^{2} + ∥ {\tilde{N}}_{τ}^{k} ∥^{2}) - \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} η (∥ p_{τ}^{k} ∥ + ∥ {\tilde{N}}_{τ}^{k} ∥) \\ ⩾ \frac{C}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} {∥ g_{τ}^{k} ∥}^{2} - \frac{η L Γ}{κ (1 - ϵ)} - \frac{1}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} φ ({\tilde{N}}_{τ}^{k}; 1 / 2), \end{matrix}

(A42)

where

C = ψ - \frac{L [M - κ (1 - ϵ)] {(α^{*})}^{2}}{κ^{2} {(1 - ϵ)}^{2}}

and

φ ({\tilde{N}}_{τ}^{k}; 1 / 2) = η ∥ {\tilde{N}}_{τ}^{k} ∥ + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {∥ {\tilde{N}}_{τ}^{k} ∥}^{2}

.

If using Lemma 7, we have

φ ({\tilde{N}}_{τ}^{k}; 1 / 2) ⩽ η {\tilde{M}}_{p r i v a c y} + {\frac{1}{2} M (1 + ϵ) + [M - κ (1 - ϵ)] α^{★}} (α^{★}) {\tilde{M}}_{p r i v a c y}^{2} = {\tilde{M}}_{φ},

with probability at least

1 - ξ_{0}

. Then we obtain Theorem 3

f ({\bar{w}}_{t_{0}}) - f ({\bar{w}}_{t_{0} + 1}) ⩾ \frac{C}{K} \sum_{k = 1}^{K} \sum_{τ = t_{0}}^{L} {∥ g_{τ}^{k} ∥}^{2} - \frac{η L Γ}{κ (1 - ϵ)} - L {\tilde{M}}_{φ},

(A43)

with probability at least

1 - K L (6 δ + ξ_{0})

. □

Appendix D. The FL Gradient Descent Method with μ-GDP

Appendix D.1. Algorithm

Algorithm A1 GDP-GD

1:: Input: Initial iteration ${\bar{w}}_{0}$ ∈ $R^{d}$ ; Privacy parameter $μ$ ; Loss function parameter $γ$ ; Iteration parameter T; Step size $α$ .
2:: for $t = 0 t o T$ do
3:: Initialization: $w_{t}^{k} = {\bar{w}}_{t}$
4:: for $k = 1 t o K$ in parallel do
5:: ${\hat{g}}_{t}^{k} = g_{t}^{k} + \frac{2 Γ \sqrt{T}}{μ s} Z_{t}^{k}$
6:: Send to the server: $w_{t + 1}^{k} = w_{t}^{k} - α {\hat{g}}_{t}^{k}$
7:: end for
8:: The server updates the parameter: ${\bar{w}}_{t + 1} = \sum_{k = 1}^{K} w_{t + 1}^{k}$
9:: end for

Appendix D.2. Theory

Lemma A4

([26]). After T communication rounds, the output of any user satisfies μ-GDP.

Theorem A1.

After T communication rounds, the GDP-GD algorithm satisfies μ-GDP.

Proof.

The result is from Lemmas 3 and A4. □

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
McMahan, H.B.; Ramage, D.; Talwar, K.; Zhang, L. Learning Differentially Private Recurrent Language Models. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Geyer, R.C.; Klein, T.; Nabi, M. Differentially private federated learning: A client level perspective. arXiv 2017, arXiv:1712.07557. [Google Scholar]
Triastcyn, A.; Faltings, B. Federated Learning with Bayesian Differential Privacy. In Proceedings of the 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, 9–12 December 2019; pp. 2587–2596. [Google Scholar]
Ivkin, N.; Rothchild, D.; Ullah, E.; Stoica, I.; Arora, R. Communication-efficient distributed SGD with sketching. Adv. Neural Inf. Process. Syst. 2019, 32, 13144–13154. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Wang, J.; Joshi, G. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. Proc. Mach. Learn. Syst. 2019, 1, 212–229. [Google Scholar]
Stich, S.U. Local SGD Converges Fast and Communicates Little. In Proceedings of the ICLR 2019—International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; p. 17. [Google Scholar]
Dieuleveut, A.; Patel, K.K. Communication trade-offs for Local-SGD with large step size. Adv. Neural Inf. Process. Syst. 2019, 32, 13601–13612. [Google Scholar]
Haddadpour, F.; Kamani, M.M.; Mahdavi, M.; Cadambe, V. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Gupta, V.; Ghosh, A.; Dereziński, M.; Khanna, R.; Ramchandran, K.; Mahoney, M.W. LocalNewton: Reducing communication rounds for distributed learning. In Proceedings of the Uncertainty in Artificial Intelligence, Online, 27–30 July 2021; pp. 632–642. [Google Scholar]
Wang, S.; Roosta-Khorasani, F.; Xu, P.; Mahoney, M.W. Giant: Globally improved approximate newton method for distributed optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 2338–2348. [Google Scholar]
Dünner, C.; Lucchi, A.; Gargiani, M.; Bian, A.; Hofmann, T.; Jaggi, M. A distributed second-order algorithm you can trust. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1358–1366. [Google Scholar]
Bullins, B.; Patel, K.; Shamir, O.; Srebro, N.; Woodworth, B.E. A stochastic newton algorithm for distributed convex optimization. Adv. Neural Inf. Process. Syst. 2021, 34, 26818–26830. [Google Scholar]
Islamov, R.; Qian, X.; Richtarik, P. Distributed Second Order Methods with Fast Rates and Compressed Communication. Proc. Mach. Learn. Res. 2021, 139, 4617–4628. [Google Scholar]
Geiping, J.; Bauermeister, H.; Dröge, H.; Moeller, M. Inverting gradients-how easy is it to break privacy in federated learning? Adv. Neural Inf. Process. Syst. 2020, 33, 16937–16947. [Google Scholar]
Dwork, C. Differential privacy. In Proceedings of the Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, 10–14 July 2006; Proceedings, Part II 33. Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Mironov, I. Rényi differential privacy. In Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Santa Barbara, CA, USA, 21–25 August 2017; pp. 263–275. [Google Scholar]
Dong, J.; Roth, A.; Su, W.J. Gaussian differential privacy. J. R. Stat. Soc. Ser. Stat. Methodol. 2022, 84, 3–37. [Google Scholar] [CrossRef]
Noble, M.; Bellet, A.; Dieuleveut, A. Differentially private federated learning on heterogeneous data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 10110–10145. [Google Scholar]
Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farokhi, F.; Jin, S.; Quek, T.Q.; Poor, H.V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [Google Scholar] [CrossRef]
Girgis, A.; Data, D.; Diggavi, S.; Kairouz, P.; Suresh, A.T. Shuffled model of differential privacy in federated learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; pp. 2521–2529. [Google Scholar]
Cheu, A.; Smith, A.; Ullman, J.; Zeber, D.; Zhilyaev, M. Distributed differential privacy via shuffling. In Proceedings of the Advances in Cryptology–EUROCRYPT 2019: 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, 19–23 May 2019; Proceedings, Part I 38. Springer: Berlin/Heidelberg, Germany, 2019; pp. 375–403. [Google Scholar]
Rastogi, V.; Nath, S. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 735–746. [Google Scholar]
Huang, Z.; Mitra, S.; Vaidya, N. Differentially private distributed optimization. In Proceedings of the 16th International Conference on Distributed Computing and Networking, Goa, India, 4–7 January 2015; pp. 1–10. [Google Scholar]
Avella-Medina, M.; Bradshaw, C.; Loh, P.L. Differentially private inference via noisy optimization. arXiv 2021, arXiv:2103.11003. [Google Scholar]
Zhu, T.; Li, G.; Zhou, W.; Philip, S.Y. Differentially private data publishing and analysis: A survey. IEEE Trans. Knowl. Data Eng. 2017, 29, 1619–1638. [Google Scholar] [CrossRef]
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 1–27. [Google Scholar] [CrossRef]
Izzo, Z.; Yoon, J.; Arik, S.O.; Zou, J. Provable Membership Inference Privacy. arXiv 2022, arXiv:2211.06582. [Google Scholar]

Figure 1. The training loss of each method on four datasets with respect to the

L = 1

case.

Figure 1. The training loss of each method on four datasets with respect to the

L = 1

case.

Figure 2. The testing accuracy of each method on four datasets with respect to the

L = 1

case.

Figure 2. The testing accuracy of each method on four datasets with respect to the

L = 1

case.

Figure 3. The training loss of GDP-LocalNewton on four datasets with respect to the

L > 1

case. (a,b) Correspond to the general tendency. (c,d) Enlarge the performance in (a,b) at communication round 7, respectively.

Figure 3. The training loss of GDP-LocalNewton on four datasets with respect to the

L > 1

case. (a,b) Correspond to the general tendency. (c,d) Enlarge the performance in (a,b) at communication round 7, respectively.

Figure 4. The training loss of GDP-LocalNewton on the generated data with respect to different strategies of step size. (a,b) Correspond to

μ = 1

and

μ = 5

, respectively. An appropriate decaying rate is the half decay (

σ = 0.5

).

Figure 4. The training loss of GDP-LocalNewton on the generated data with respect to different strategies of step size. (a,b) Correspond to

μ = 1

and

μ = 5

, respectively. An appropriate decaying rate is the half decay (

σ = 0.5

).

Figure 5. The training loss of GDP-LocalNewton on the a9a data with respect to different strategies of step size. (a,b) Correspond to

μ = 5

and

μ = 10

, respectively. An appropriate decaying rate is the nine-tenth decay (

σ = 0.9

).

Figure 5. The training loss of GDP-LocalNewton on the a9a data with respect to different strategies of step size. (a,b) Correspond to

μ = 5

and

μ = 10

, respectively. An appropriate decaying rate is the nine-tenth decay (

σ = 0.9

).

Table 1. The parameters set in this paper.

Parameters	Meaning
$μ$	Privacy parameter.
n	The number of whole data.
s	The number of worker’s data.
K	The number of workers.
T	The number of iteration.
L	The number of local computation.
B	Upper bound of the Hessian matrix of individual loss function.
$Γ$	Upper bound of the gradient of individual loss function.
M	The smooth parameter of global loss function.
$κ$	The strongly convex parameter of global loss function.
$α$	Iteration step size.
$α^{*}$	The max linear search step size.
$β$	A parameter in the linear search
$w_{t}$	Global model parameter at the t-th iteration.
$w^{k}$	The k-th worker’s model parameter.
$w_{t}^{k}$	The k-th worker’s model parameter at the t-th iteration.
$H^{k}$ , $H_{t}$ , $H_{t}^{k}$ , $g^{k}$ , $g_{t}$ , $g_{t}^{k},$ etc.	Like $w_{t}$ , $w^{k}$ and $w_{t}^{k}$ .

Table 2. The real datasets in this paper.

Dataset	Training Samples (n)	Features (d)	Testing Samples
a9a	32,000	123	16,000
Covtype	500,000	54	80,000
ijcnn1	49,000	22	91,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Z.; Guo, X.; Zhang, H. Privacy-Preserving Distributed Learning via Newton Algorithm. Mathematics 2023, 11, 3807. https://doi.org/10.3390/math11183807

AMA Style

Cao Z, Guo X, Zhang H. Privacy-Preserving Distributed Learning via Newton Algorithm. Mathematics. 2023; 11(18):3807. https://doi.org/10.3390/math11183807

Chicago/Turabian Style

Cao, Zilong, Xiao Guo, and Hai Zhang. 2023. "Privacy-Preserving Distributed Learning via Newton Algorithm" Mathematics 11, no. 18: 3807. https://doi.org/10.3390/math11183807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Privacy-Preserving Distributed Learning via Newton Algorithm

Abstract

1. Introduction

2. GDP-LocalNewton

2.1. Problem Formulation

2.1.1. Some Notations and Symbols

2.1.2. Fundamental Problem

2.1.3. Data Distribution

2.2. Gaussian Differential Privacy

2.3. GDP-LocalNewton Algorithm

2.4. Theoretical Results

2.4.1. Assumptions on the Loss Functions

2.4.2. Privacy

2.4.3. Convergence Analysis

3. Empirical Evaluation

3.1. Without Local Computation ( L = 1 )

3.2. With Local Computation ( L > 1 )

3.3. Different Step Size Strategies

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Some Auxiliary Lemmas

Appendix B. The Proofs of Some Lemmas in The Context

Appendix B.1. The Proof of Lemma 7

Appendix B.2. The Proof of Lemma 8

Appendix C. The Proofs of Main Theorems

Appendix C.1. The Proof of Claim (14)

Appendix C.2. The Proof of Claim (15)

Appendix C.3. The Proof of Theorem 3

Appendix D. The FL Gradient Descent Method with μ-GDP

Appendix D.1. Algorithm

Appendix D.2. Theory

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Without Local Computation ( $L = 1$ )

3.2. With Local Computation ( $L > 1$ )