Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption

Tang, Fengyi; Wu, Wei; Liu, Jian; Wang, Huimei; Xian, Ming

doi:10.3390/electronics8040411

Open AccessArticle

Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption

by

Fengyi Tang

,

Wei Wu

,

Jian Liu

^*,

Huimei Wang

and

Ming Xian

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Electronics 2019, 8(4), 411; https://doi.org/10.3390/electronics8040411

Submission received: 9 March 2019 / Revised: 29 March 2019 / Accepted: 5 April 2019 / Published: 9 April 2019

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The flourishing deep learning on distributed training datasets arouses worry about data privacy. The recent work related to privacy-preserving distributed deep learning is based on the assumption that the server and any learning participant do not collude. Once they collude, the server could decrypt and get data of all learning participants. Moreover, since the private keys of all learning participants are the same, a learning participant must connect to the server via a distinct TLS/SSL secure channel to avoid leaking data to other learning participants. To fix these problems, we propose a privacy-preserving distributed deep learning scheme with the following improvements: (1) no information is leaked to the server even if any learning participant colludes with the server; (2) learning participants do not need different secure channels to communicate with the server; and (3) the deep learning model accuracy is higher. We achieve them by introducing a key transform server and using homomorphic re-encryption in asynchronous stochastic gradient descent applied to deep learning. We show that our scheme adds tolerable communication cost to the deep learning system, but achieves more security properties. The computational cost of learning participants is similar. Overall, our scheme is a more secure and more accurate deep learning scheme for distributed learning participants.

Keywords:

data privacy; distributed databases; machine learning

Graphical Abstract

1. Introduction

1.1. Background

In recent years, artificial intelligence (AI) [1] has been applied to more and more fields, such as medical treatment [2], internet of things (IoT) [3] and industrial control [4]. Actually, deep learning [5] is one of the most attractive and representative techniques of AI, which is mainly based on neural network [6]. Thanks to the development of related computer hardware (such as GPU) and the emergence of big data [7], deep learning has achieved striking success in tasks such as image classification [8], traffic identification [9] and self-learning [10]. Therefore deep learning is gaining increasing importance in our more and more intelligent modern society.

Distributed deep learning is gaining increasing popularity nowadays. In this kind of deep learning, the training datasets are collected from multiple distributed data providers, rather than a single one [11]. When the deep learning model is trained with more representative data items, the obtained model could be more generalized, which leads to higher model accuracy. However, the collection and utilization of distributed datasets raise worrying security issues (especially privacy issues), which hinders wider application of distributed deep learning to some extent [12].

There are many examples of privacy leak, which may incur significant threats. For example, in 2018, it was reported that a data analytics company called Cambridge Analytica harvested millions of profiles of voters which revealed voters’ privacy. The data breach influenced choices of the voters which was a threat of the fairness of election. Another real live case is that the Reuters Health recently reported that some health applications in the mobile phone might share information with a host of unrelated companies, some of which have nothing to do with healthcare. A big concern of users is how their data will be used and by whom [13].

With data providers sending more and more data to cloud servers because of the cloud servers owning high computational capability and large storage to process big data, it is reasonable that data providers may worry about their data privacy, even when they encrypt their data before sending their data out [14].

Recently, Aono and Hayashi [15] presented a privacy-preserving deep learning scheme via additively homomorphic encryption. It is one of the most representative works in privacy-preserving deep learning. Their scheme is based on gradients-encrypted asynchronous stochastic gradient descent (ASGD), in combination with learning with errors (LWE)-based encryption and Paillier encryption. They first prove that sharing gradients even partially over an honest-but-curious parameter cloud server as in [16] may leak information. Then they propose an additively homomorphic encryption based scheme, in which all learning participants encrypt their own computed gradients with the same public key and send them to the server. However, the security of the scheme is based on the assumption that the server could not collude with any learning participant.

We analyze potential risks of the system in [15] as follows. (1) Since the learning participants own the same public key and private key, learning participant A could decrypt the gradients of learning participant B as long as it gets B’s encrypted gradients. (2) To make matters worse, it is likely that the server colludes with one of the participants in practice. Once the server and the learning participant collude, they could decrypt the gradients of all learning participants because the server has all encrypted gradients and the learning participant has the private key. According to [15], they could get the private local data of all learning participants through the gradients.

In our scheme, all parties (a cloud server acts as key transform server, a cloud server acts as data service provider, and data providers act as learning participants) are assumed to be honest-but-curious [17]. In other words, they will finish given tasks, but may try to mouse some sensitive information, such as the local data of data providers. We assume that the two servers, key transform server (KTS) and data service provider (DSP), are selected from different cloud companies and they would not collude with each other because of benefit contradiction (different companies take their own benefit as first consideration) and reputation preservation; however, a server may collude with a data provider.

There are two main colluding scenarios between the server and the learning participant. First, the server and a learning participant may belong to the same company. That is to say, they may collude with each other for the common benefit of their company. Specifically, the cloud company a server belongs to may send an entity to be a learning participant. But the two cloud servers are selected from different companies, so they are unlikely to collude with each other for the sake of their companies’ reputation and beneficial contradiction. Second, even if the cloud company a server belongs to cannot send an entity to be a learning participant, it is easier for a cloud sever to corrupt a learning participant than another cloud server. Because the security measures of cloud sever is more than that of a learning participant in general.

In a word, the cloud server is likely to collude with a learning participant in [15]; and it is reasonable for us to assume that the key transform server would not collude with the data service provider in this paper.

1.2. Our Contributions

Due to the security vulnerabilities of the scheme proposed in [15], we propose a multi-key based distributed deep learning scheme to protect the data privacy of learning participants even when a server colludes with one of the learning participants, using homomorphic re-encryption. We introduce a key transform server to re-encrypt the gradients encrypted by learning participants. The data service provider makes the re-encrypted gradients additively homomorphic and finishes the weights update computation. Finally, the learning participants download new weights and decrypt them respectively. The detailed realization steps of our scheme are described in Section 4. In a word, our scheme enjoys the following properties on security, efficiency and accuracy.

Security: our scheme protects the private local data of learning participants even without the assumption that the server would not collude with any learning participant.
Efficiency: experimental results show that the computational cost of learning participants are similar to that in [15].
Accuracy: our scheme provides a little higher accuracy than that in [15].

1.3. More Related Works

Shokri et al. [16] have designed a distributed deep learning system that multiple learning participants could jointly train an ASGD based deep neural network model without sharing their local datasets, but must selectively sharing key parameters of the model. The goal of our paper is designing an ASGD-based distributed deep learning system which do not need sharing key parameters of the model.

Papernot et al. [18] proposed a scheme to preserve the privacy of training data called private aggregation of teacher ensembles (PATE). They used “teachers” for a “student” model instead of public models to protect sensitive data. Essentially, this property of the scheme can be called differential privacy.

NhatHai Phan et al. [19] proposed the deep private auto-encoder (dPA), which was one type of deep learning. Instead of perturbing the result of deep learning, the scheme focuses on the perturbation of objective function of the deep auto-encoder, to realize differential privacy.

Abadi et al. [20] have developed a framework of differential privacy to analyze the privacy cost of crowdsourcing the training of model over large dataset containing sensitive data. They also designed algorithmic techniques for machine learning under modest cost of privacy, computation, efficiency and accuracy.

Hitaj et al. [21] showed that distributed deep learning was susceptible to an attack they devised. They trained a generative adversarial network (GAN) generating samples that came from the same distribution as original training dataset, which should be kept private. Moreover, they showed that existing record-level differential privacy could not resist their attack. Therefore effective method is needed for privacy-preserving distributed collaborative deep learning. They consider one of the learning participants as the adversary, which is practical. We consider the situation in this paper as well.

Mohassel et al. [22] proposed privacy-preserving machine learning protocols for logistic regression, linear regression and stochastic gradient descent based neural network training. In their schemes, the data providers are distributed and there are two servers. Their schemes are based on the assumption that two servers would not collude, which is reasonable in practice because of beneficial conflict. Our system is based on this assumption, too.

Ping Li et al. [23] suggested that distributed deep learning over combined dataset should pay attention to two points. First, all data including intermediate computation results should be encrypted with different keys before being sent out. Second, the computational cost of data providers should be minimal. We fully consider these two points when we design our scheme, so that the data providers can be mobile smart phones. These authors proposed a framework for privacy-preserving outsourced classification in cloud computing (POCC) in [24].

Qingchen Zhang et al. [25] pointed out that offloading some expensive operations of big data feature learning to the cloud server(s) could improve the system efficiency. Meanwhile, the data privacy of enterprises and governments should be concerned. They approximated the activation function as a polynomial function with the Brakerski-Gentry-Vaikuntanathan cryptosystem (BGV). Our scheme do not need the approximation to avoid accuracy decrease.

BD Rouhani et al. [26] proposed a scalable provably-secure deep learning framework called Deepsecure. In this framework, all parties are likely to leak information. The key of their framework is the pre-processing techniques and optimized Yao’s Garbled Circuit protocol [27]. Our scheme considers the situation that one of the learning participants would leak information even its private key.

1.4. Paper Organization and Notations

The rest of the paper is organized as follows. We introduce the definitions of homomorphic re-encryption, ASGD-based deep learning and illustrate that gradients may leak information in Section 2, which are preliminaries of our system. The architecture of our system is proposed in Section 3. The details of realizing our system via proxy-invisible homomorphic re-encryption is given in Section 4. Then we perform security analysis of our system in Section 5. Furthermore, we analyze the communication cost of our system and evaluate its computational cost with experiments in Section 6. Finally, we conclude this paper in Section 7.

To facilitate presentation, we summarize the main notations used in this paper in Table 1.

2. Preliminaries

2.1. Homomorphic Re-Encryption

Homomorphic re-encryption scheme (HRES) is an asymmetric cryptography that can realize proxy-invisible re-encryption and supports privacy-preserving data processing with access control. In addition, the addition scheme of an improved version of HRES, called the somewhat re-encryption scheme, has the properties of additive homomorphism [28]. There are four roles in this scheme: data providers (DPs), data service provider (DSP), access control server (ACS), and data requesters (DRs). As the names imply, the DPs provide data to ACS; and ACS transforms the received data to realize access control; then DSP processes the transformed data; finally, the DRs request and get data they need. Next, we introduce this improved homomorphic re-encryption scheme in detail which mainly consists of the following five algorithms.

Key generation (KeyGen): $(k, p, q) \to (g, n, P K)$ . First choose k as a parameter, and two large primes p and q, where $L e n (p) = L e n (q) = k$ . The outputs of this algorithm, denoted as $(g, n, P K)$ , are public system parameters. Determine a generator g of $G$ with maximal order [29], where $G$ is the cyclic group of quadratic residues modulo $n^{2}$ . Then compute $n = p \cdot q$ . The two servers, DSP and access control server (ACS), respectively generate key pairs: ( $s k_{D S P} = a, p k_{D S P} = g^{a}$ ) and ( $s k_{A C S} = b, p k_{A C S} = g^{b}$ ). Then they negotiate the Diffie–Hellman key:

$P K = p k_{D S P}^{s k_{A C S}} = p k_{A C S}^{s k_{D S P}} = g^{a \cdot b} m o d n^{2} .$

(1)

Then $P K$ is published to all data providers for them to encrypt their data. Each data provider generates its own key pair. For example, the key pair of data provider i is $(s k_{i}, p k_{i}) = (k_{i}, g^{k_{i}})$ .
Encryption (Enc): $m_{j} \to {[m_{j}]}_{P K} = \{T_{j}, {T_{j}}^{'}\}$ . This algorithm is performed by data providers. Its function is encrypting plaintext $m_{j}$ ( $m_{j} \in Z_{n}$ ) provided by data provider i under the Diffie–Hellman key $P K$ . The resulting ciphertext can be represented by two parts:

$(1) T_{j} = (1 + m_{j} \cdot n) \cdot P K^{r} m o d n^{2}$

(2)

and

$(2) {T_{j}}^{'} = g^{r} m o d n^{2},$

(3)

where $r \in [1, \frac{n}{4}]$ is first selected randomly.
First phase of re-encryption (FPRE): ${[m_{j}]}_{P K} \to {[m_{j}]}^{+}$ . After receiving ${[m_{j}]}_{P K}$ from data provider i, DSP first selects a computation identifier CID, and then implements the following algorithms to compute the intermediate ciphertext ${[m_{j}]}^{+}$ :

$(1) h_{1} = H ({(p k_{i})}^{s k_{D S P}} | | C I D);$

(4)

$\begin{matrix} (2) {[m_{j}]}^{+} & = \{\hat{T_{j}}, \hat{{T_{j}}^{'}}\} \\ = \{T_{j}, {({T_{j}}^{'})}^{s k_{D S P}} \cdot g^{h_{1}}\} . \end{matrix}$

(5)
Second phase of re-encryption (SPRE): ${[m_{j}]}^{+} \to {[m_{j}]}_{p k_{i}}$ . After receiving ${[m_{j}]}^{+}$ from DSP, ACS performs the following algorithms to compute ${[m_{j}]}_{p k_{i}}$ :

$(1) h_{2} = H ({(p k_{i})}^{s k_{A C S}} | | C I D);$

(6)

$\begin{matrix} (2) {[m_{j}]}_{p k_{i}} = & \{\bar{T_{j}}, \bar{{T_{j}}^{'}}\} \\ = & \{\hat{T_{j}}, {(\hat{{T_{j}}^{'}})}^{s k_{A C S}} \cdot g^{h_{2}}\} . \end{matrix}$

(7)

Here, the final ciphertext ${[m_{j}]}_{p k_{i}}$ is obtained.
Decryption (Dec): ${[m_{j}]}_{p k_{i}} \to m_{j}$ . Only when the data requester of ${[m_{j}]}_{p k_{i}}$ has the corresponding private key $s k_{i}$ , the decryption could be finished correctly. The decryption algorithms are as follows:

$\begin{matrix} (1) {h_{1}}^{'} = & H ({(p k_{D S P})}^{s k_{i}} | | C I D) \\ = & H (g^{a \cdot s k_{i}} | | C I D) \\ = & h_{1}; \end{matrix}$

(8)

$\begin{matrix} (2) {h_{2}}^{'} = & H ({(p k_{A C S})}^{s k_{i}} | | C I D) \\ = & H (g^{b \cdot s k_{i}} | | C I D) \\ = & h_{2}; \end{matrix}$

(9)

$\begin{matrix} (3) m_{j} = & L (\bar{T_{j}} \cdot p k_{A C S}^{{h_{1}}^{'}} \cdot g^{{h_{2}}^{'}} / \bar{{T_{j}}^{'}} m o d n^{2}), w h e r e \\ L (x) = & (x - 1) / n . \end{matrix}$

(10)

The computation identifier CID is set to be addition here. Because only addition in HRES is used in this paper, we omit CID in the rest of this paper. According to [28], the properties of the above improved homomorphic re-encryption scheme can be summarized as follows.
(1) Additive homomorphism:

${[m_{1} + m_{2}]}_{p k_{i}} = {[m_{1}]}_{p k_{i}} * {[m_{2}]}_{p k_{i}} .$

(11)

Because the proof of additive homomorphism property of HRES is not provided in [28] and this property is important in our scheme, we prove it in this paper as follows.
Proof Proof of additive homomorphism.

$\begin{matrix} l e f t = & {[m_{1} + m_{2}]}_{p k_{i}} \\ = & {{\bar{T}}_{1 + 2}, {\bar{T}}_{1 + 2}^{'}} \\ = & \{{\hat{T}}_{1 + 2}, {({\hat{T}}_{1 + 2}^{'})}^{s k_{A C S}} \cdot g^{h_{2}}\} \\ = & \{T_{1 + 2}, {({(T_{1 + 2}^{'})}^{s k_{D S P}} \cdot g^{h_{1}})}^{s k_{A C S}} \cdot g^{h_{2}}\} \\ = & \{T_{1 + 2}, {({(T_{1 + 2}^{'})}^{a} \cdot g^{h_{1}})}^{b} \cdot g^{h_{2}}\} \\ = & \{T_{1 + 2}, {(T_{1 + 2}^{'})}^{a b} \cdot g^{h_{1} \cdot b + h_{2}}\}, \end{matrix}$

(12)

where

$\begin{matrix} T_{1 + 2} = & [1 + (m_{1} + m_{2}) \cdot n] \cdot P K^{r} m o d n^{2} \\ = & [1 + (m_{1} + m_{2}) \cdot n] \cdot g^{a b r} m o d n^{2}; \end{matrix}$

(13)

$\begin{matrix} {(T_{1 + 2}^{'})}^{a b} \cdot g^{h_{1} \cdot b + h_{2}} = & {(g^{r} m o d n^{2})}^{a b} \cdot g^{h_{1} \cdot b + h_{2}} \\ = & g^{r a b + h_{1} \cdot b + h_{2}} m o d n^{2} . \end{matrix}$

(14)

and

$\begin{matrix} r i g h t = & {[m_{1}]}_{p k_{i}} * {[m_{2}]}_{p k_{i}} \\ = & \{\bar{T_{1}} \cdot \bar{T_{2}}, {\bar{T_{1}}}^{'} \cdot {\bar{T_{2}}}^{'}\}, \end{matrix}$

(15)

where

$\begin{matrix} \bar{T_{1}} \cdot \bar{T_{2}} = & [(1 + m_{1} \cdot n) \cdot P K^{r} m o d n^{2}] \cdot \\ [(1 + m_{2} \cdot n) \cdot P K^{r} m o d n^{2}] \\ = & [1 + (m_{1} + m_{2}) \cdot n + m_{1} m_{2} n^{2}] \cdot P K^{2 r} m o d n^{2} \\ = & [1 + (m_{1} + m_{2}) \cdot n] \cdot P K^{2 r} m o d n^{2} \\ = & [1 + (m_{1} + m_{2}) \cdot n] \cdot g^{2 a b r} m o d n^{2}; \end{matrix}$

(16)

$\begin{matrix} {\bar{T_{1}}}^{'} \cdot {\bar{T_{2}}}^{'} = & {\{{[{(g^{r} m o d n^{2})}^{s k_{D S P}} \cdot g^{h_{1}}]}^{s k_{A C S}} \cdot g^{h_{2}}\}}^{2} \\ = & {\{{[{(g^{r} m o d n^{2})}^{a} \cdot g^{h_{1}}]}^{b} \cdot g^{h_{2}}\}}^{2} \\ = & {[(g^{r a b} m o d n^{2}) \cdot g^{h_{1} \cdot b + h_{2}}]}^{2} \\ = & g^{2 (r a b + h_{1} \cdot b + h_{2})} m o d n^{2} . \end{matrix}$

(17)

Because g is the generator with maximal order, we have $T_{1 + 2} = \bar{T_{1}} \cdot \bar{T_{2}}$ , ${(T_{1 + 2}^{'})}^{a b} \cdot g^{h_{1} \cdot b + h_{2}} = {\bar{T_{1}}}^{'} \cdot {\bar{T_{2}}}^{'}$ . Therefore $l e f t = r i g h t$ , and HRES has the additive homomorphism property. □
(2)

${[t \cdot m]}_{p k_{i}} = {({[m]}_{p k_{i}})}^{t} .$

(18)

(3)

${[- m]}_{p k_{i}} = {({[m]}_{p k_{i}})}^{n - 1} .$

(19)

(4) Resistant to impersonation attack and collusion between any server and any distributed data provider thanks to the two hash values $h_{1}, h_{2}$ . In order to avoid repetition, here we omit the proof of the last three properties. The details can be found in [28].

2.2. ASGD Based Deep Learning

Deep learning can be seen as a series of algorithms over a neural network consisting of multiple layers. There are numerous neuron nodes in each layer. The neurons of one layer are connected with neurons of next layer via weight variables. The weight variables are important for activation function which computes the output of one layer. For example, the output of layer

p + 1

is computed as

o u t^{(p + 1)} = f (W^{(p)} \cdot o u t^{(p)} + b^{(p)})

, where

f (x)

is the activation function, and

(W^{(p)}, b^{(p)})

denotes the weight vector connecting layer p with layer

p + 1

.

The weight vector of the deep neural network consisting of weight variables need to be determined through deep learning. Concretely, take supervised learning for example, where a training dataset is provided firstly, and the cost function J is defined according to the target of the learning task. Then cost function is computed over all data items of the given training dataset. The learning process minimizes J by adjusting values of weight variables.

The most frequently used adjusting method called the stochastic gradient descent (SGD). Denote the weight vector as

W = (w_{1}, w_{2}, \dots, w_{n})

, which consists of all weights in the deep neural network. Generally, the cost function is computed iteratively over different subsets (mini-batch) of training dataset, in which elements are selected randomly. For example, the computation result of cost function over a subset consisting of t elements can be denoted as

J_{| b a t c h | = t}

. Then the gradient vector G of cost function J can be represented as

G = (\frac{δ J_{| b a t c h | = t}}{δ w_{1}}, \frac{δ J_{| b a t c h | = t}}{δ w_{2}}, \dots, \frac{δ J_{| b a t c h | = t}}{δ w_{n}}) .

(20)

In the learning process using SGD, the update rule for weight vector is

W : = W - α \cdot G,

(21)

where

α \in R

is called learning rate.

In order to make the learning process more efficient, practical asynchronous stochastic gradient descent (ASGD) is proposed in [30], including data parallelism and model parallelism. Data parallelism means the training dataset for updating weight vector can be distributed. That is to say, the dataset consists of data from multiple distributed data providers. Every machine used for training model has a replica of all weights. Model parallelism is separating the model into several parts which can be updated by different machines. A typical example of model parallelism is depicted in Figure 1, which is proposed in [11]. There are

N = 4

learning participants included in four machines (represented by four blue rectangles with dotted lines as boundaries) in this example. The five-layer deep neural network model is separated into

N = 4

parts. The thick lines connecting nodes in different rectangles are called crossing weights, which are responsible for collecting different parts of models trained in different machines. The model is usually separated according to the computational capability of machines.

In model parallelism, the weight variables in the deep neural network can be represented as

W = (W_{1}, W_{2}, \dots, W_{N})

. Each component of W, the

W_{i} (i = 1, 2, \dots, N)

is a weight vector. So the update rule can be transformed to

W_{i} : = W_{i} - α \cdot G_{i}, i = 1, 2, \dots, N;

(22)

where

W_{i}

contains the weight variables of i th part of the model updated by i th machine,

G_{i}

means the gradient vector used for updating weights of i th part of the model, i.e.,

W_{i}

.

We adopt model parallelism in our system. The learning participant i, KTS and computing unit

C U_{i}

of DSP together act as machine i in Figure 1. Machine i is responsible for training the i th part of model assigned to it. That is to say,

W_{i}

is updated by the local data of learning participant i. If the weight variable w connects nodes in machine i and machine k, and machine k is responsible for updating w, then the gradient used for updating w should be generated by learning participant k and should be re-encrypted with the public key

p k_{k}

of learning participant k so that machine k can further process the gradient.

2.3. Gradients Leaking Information

Authors in [15] have shown that a small portion of gradients could leak the data providers’ original training data. They proved that by listing four examples, including the simplest situation that there was only one neuron, the general neural networks, the neural networks with regularization and the most complex one, with Laplace noises added to the gradients. Here we restate the simplest example.

We focus on the learning process of one single neuron. Suppose

(\vec{x}, y)

is the data item, where

\vec{x} = \{x_{1}, x_{2}, \dots, x_{n}\}

is the neuron input vector and y is the corresponding truth label. The cost function here is set as the distance between the y and

y_{p r e d i c t}

,

y_{p r e d i c t}

is computed through the activation function f fed with

\vec{x}

y_{p r e d i c t} = f (\sum_{i = 1}^{n} W_{i} \cdot x_{i} + b),

(23)

where b is the bias value. Hence the cost function is

J (W, b, \vec{x}, y) = {(y_{p r e d i c t} - y)}^{2} .

(24)

Then the gradients can be computed as follows:

\begin{matrix} g_{i} & = \frac{\partial J (W, b, \vec{x}, y)}{\partial W_{i}} \\ = 2 (y_{p r e d i c t} - y) \cdot \frac{\partial (y_{p r e d i c t} - y)}{\partial W_{i}} \\ = 2 (y_{p r e d i c t} - y) \cdot \frac{\partial f (\sum_{i = 1}^{n} W_{i} \cdot x_{i} + b)}{\partial W_{i}} \\ = 2 (y_{p r e d i c t} - y) \cdot f^{'} (\sum_{i = 1}^{n} W_{i} \cdot x_{i} + b) \cdot x_{i} \end{matrix}

(25)

and

\begin{matrix} g & = \frac{\partial J (W, b, \vec{x}, y)}{\partial b} \\ = 2 (y_{p r e d i c t} - y) \cdot \frac{\partial (y_{p r e d i c t} - y)}{\partial b} \\ = 2 (y_{p r e d i c t} - y) \cdot f^{'} (\sum_{i = 1}^{n} W_{i} \cdot x_{i} + b) \end{matrix} .

(26)

So the data item

x_{i}

can be revealed by computing

x_{i} = \frac{g_{i}}{g}

when the gradients are known. The truth value y can be guessed when input data item

\vec{x} = \{x_{1}, x_{2}, \dots, x_{n}\}

is an image according to [15].

3. System Architecture

The system architecture can be seen in Figure 2. There are three kinds of parties in our system: learning participant, KTS and DSP. The functions of them are described in detail as follows.

3.1. Learning Participant

In our system, the learning participants (LPs) act as data providers, as well as data requesters. In other words, the learning participants provide the newly computed gradients for servers to update weight variables, and request the updated weight variables for next gradients generation process. Since the learning participants are distributed in our system, the training datasets are distributed as well.

To preserve the data privacy of learning participants, encrypting gradients before sending them to DSP for further processing is necessary. Different from the encryption method in [15] that all participants use one key pair

(s k, p k)

generated jointly, learning participants encrypt their own gradients with the Diffie–Hellman key

P K

generated by KTS and DSP in our system.

P K

is known to all learning participants, but its corresponding private key is not overt. That is to say, any learning participant cannot decrypt the ciphertext of other learning participants to get their private data. Therefore each learning participant does not need to set up the independent secure channel to communicate with servers (KTS and DSP).

Every learning participant (suppose there are N participants) implements the following steps.

Generate its own key pair. The public key is public to all parties of this scheme.
Select a mini-batch of data from its own local training dataset randomly for this iteration.
Download the encrypted weights updated in previous iteration from DSP.
Decrypt the above encrypted weights with its own private key.
Compute new gradients with data obtained at step 2 and weights obtained at step 4 using partial derivative.
Encrypt new gradients with the Diffie–Hellman key $P K$ and send the ciphertexts to KTS.

3.2. Key Transform Server

An innovative and important design of our system is that a KTS is responsible for parameters generation and first phase of re-encryption (FPRE). KTS mainly performs the following operations.

Select a parameter k and two large primes $p, q$ randomly, where $L e n (p) = L e n (q) = k$ .
Compute $n = p \cdot q$ and choose a generator g with maximal order [29].
Generate its key pair. Then negotiate the Diffie–Hellman key with DSP.
Receive the ciphertexts from learning participants, and perform the FPRE over them.
Send the re-encrypted ciphertexts to the corresponding computing unit $C U_{i}$ of DSP $(i = 1, 2, \dots, N)$ .

3.3. Data Service Provider

Data service provider (DSP) is responsible for the second phase re-encryption and weights update. In order to make full use of the computation power of DSP and facilitate model parallelism, DSP is split into N parts which are named as computing units

C U

in our system. The steps DSP executes are as follows.

Generate its key pair.
Receive re-encrypted ciphertexts from KTS and perform SPRE on them.
Update the weights using encrypted gradients obtained in step 2.
Store the updated encrypted weights into the corresponding computing unit $C U_{i}$ of DSP $(i = 1, 2, \dots, N)$ .

4. System Realization

In this section we use proxy-invisible homomorphic re-encryption to realize our system described in Section 3. All steps of our proposed system are listed in sequence at length as follows, which are shown in Figure 3.

ParamGen: $(k, p, q) \to (g, n)$ . First KTS selects k as a security parameter, and two large primes p and q, where $L e n (p) = L e n (q) = k$ ( $L e n (x)$ is the bit length of input data x). Then KTS chooses a generator g with maximal order according to [29] and computes $n = p \cdot q$ . Finally KTS publishes the public system parameters $(g, n)$ to all entities in our system.
KeyGen: $(g, n) \to ((s k, p k), P K)$ . Every learning participant generates its own key pair $(s k_{i}, p k_{i}) (i = 1, 2, \dots, N)$ .

$p k_{i} = g^{s k_{i}} m o d n^{2}, i = 1, 2, \dots, N;$

(27)

KTS and DSP generate their key pairs $(s k_{K T S}, p k_{K T S})$ and $(s k_{D S P}, p k_{D S P})$ respectively.

$p k_{K T S} = g^{s k_{K T S}} m o d n^{2} = g^{a} m o d n^{2};$

(28)

$p k_{D S P} = g^{s k_{D S P}} m o d n^{2} = g^{b} m o d n^{2} .$

(29)

Then KTS and DSP negotiate their Diffie-Hellman key $P K$ . That is to say, KTS sends its public key $p k_{K T S} = g^{a} m o d n^{2}$ to DSP, and DSP sends its public key $p k_{D S P} = g^{b} m o d n^{2}$ to KTS. Therefore both KTS and DSP can calculate $P K$ respectively.

$P K = p k_{D S P}^{s k_{K T S}} = {(g^{b} m o d n^{2})}^{a} = g^{a \cdot b} m o d n^{2} .$

(30)

$P K = p k_{K T S}^{s k_{D S P}} = {(g^{a} m o d n^{2})}^{2} = g^{a \cdot b} m o d n^{2} .$

(31)

Finally, the public keys are published to all entities in our system.
Initialization: since we adopt model parallelism, the deep neural network is separated into N parts (N is the number of learning participants). A machine (consists of a learning participant, KTS and a computing unit of DSP) is responsible for training a part of network. The weight variables in each part of the network form a weight vector respectively. So N weight vectors are assigned to N machines respectively. Before performing training process, weight vectors are initialized by KTS and shared by all parties of this system. In order to make representation clear, here we denote $W_{i}^{(j)}$ as the weight vector updated by Machine i in the j th weights update iteration, and $G_{i}^{(j)}$ is the gradient vector generated by learning participant i in the j th weights update iteration and used for updating the weight vector $W_{i}^{(j)}$ . The initial weight vectors are denoted as $W_{1}^{(0)}, W_{2}^{(0)}, \dots, W_{N}^{(0)}$ . The crossing weights connecting machine i and machine $k (k > i)$ are assigned to machine k.
Data encoding: generally, weights and gradients are real numbers. But the homomorphic re-encryption requires that numbers to be encrypted should be integers. Therefore before encrypting weights and gradients with homomorphic re-encryption, data encoding process should be performed. A real number $x \in R$ can be represented by an integer $⌊x \cdot 2^{n}⌋$ with n bits of precision. Here we round down $x \cdot 2^{n}$ as the encoding result of real number x.
Gradients generation and encryption: take the first weights update iteration $(j = 1)$ for example. First, every learning participant (take the learning participant i for example here) uses a mini-batch of data selected randomly from its local dataset and the initial weight vector $W_{i}^{(0)}$ to calculate a gradient vector $G_{i}^{(1)}$ . Then calculate $- α \cdot G_{i}^{(1)}$ . Next, components of weight vector $W_{i}^{(0)}$ and vector $- α \cdot G_{i}^{(1)}$ are encoded into integers. Finally every learning participant encrypts components of its own vector $- α \cdot G_{i}^{(1)}$ with Diffie–Hellman key $P K$ respectively and sends them to KTS. KTS encrypts each component of initial weight vector $W_{i}^{(0)}$ with $P K$ .
First phase of re-encryption (FPRE): After receiving the ciphertext $E_{P K} (- α \cdot G_{1}^{(1)})$ , $E_{P K} (- α \cdot G_{2}^{(1)})$ , …, $E_{P K} (- α \cdot G_{N}^{(1)})$ from learning participant 1, 2, …, N respectively, KTS performs FPRE over them. Take the ciphertext received from learning participant i for example to describe the first phase of re-encryption algorithms. First KTS computes the hash value $h_{1}$ and then the re-encrypted ciphertext:

$h_{1} = H ({(p k_{i})}^{s k_{K S T}})$

(32)

$\begin{matrix} {[E_{P K} (W_{i}^{(0)})]}^{+} = & \{\hat{W_{i}}, \hat{{W_{i}}^{'}}\} \\ = & \{W_{i}, {({W_{i}}^{'})}^{s k_{K T S}} \cdot g^{h_{1}}\} \end{matrix}$

(33)

$\begin{matrix} {[E_{P K} (- α \cdot G_{i}^{(1)})]}^{+} = & \{\hat{T_{i}}, \hat{{T_{i}}^{'}}\} \\ = & \{T_{i}, {({T_{i}}^{'})}^{s k_{K T S}} \cdot g^{h_{1}}\} \end{matrix}$

(34)

Finally, KTS sends ${[E_{P K} (W_{i}^{(0)})]}^{+}$ and ${[E_{P K} (- α \cdot G_{i}^{(1)})]}^{+}$ to DSP.
It is noticeable that if a gradient component is used for updating a crossing weight, then it should be re-encrypted with the public key of the learning participant the crossing weight assigned to.
SPRE and homomorphic addition: DSP receives ${[E_{P K} (W_{i}^{(0)})]}^{+}$ and ${[E_{P K} (- α \cdot G_{i}^{(1)})]}^{+}$ $(1 \leq i \leq N)$ from KTS and stores them in the corresponding computation unit $C U_{i}$ . Each computation unit performs SPRE. Take the $C U_{i}$ for example. It performs following algorithms:

$h_{2} = H ({(p k_{i})}^{s k_{D S P}})$

(35)

$\begin{matrix} E_{p k_{i}} (W^{(0)}) = & \{\bar{W}, \bar{W^{'}}\} \\ = & \{\hat{W}, {(\hat{W^{'}})}^{s k_{D S P}} \cdot g^{h_{2}}\} \end{matrix}$

(36)

$\begin{matrix} E_{p k_{i}} (- α \cdot G_{i}^{(1)}) = & \{\bar{T}, \bar{T^{'}}\} \\ = & \{\hat{T}, {(\hat{T^{'}})}^{s k_{D S P}} \cdot g^{h_{2}}\} \end{matrix}$

(37)

It is noticeable that $E_{p k_{i}} (W^{(0)})$ and $E_{p k_{i}} (- α \cdot G_{i}^{(1)})$ have the property of additive homomorphism now. So computation unit i of DSP can update the weight vectors:

$\begin{matrix} E_{p k_{i}} (W_{i}^{(1)}) = E_{p k_{i}} (W_{i}^{(0)}) * E_{p k_{i}} (- α \cdot G_{i}^{(1)}), \\ 1 \leq i \leq N . \end{matrix}$

(38)

Finally $E_{p k_{i}} (W_{i}^{(1)})$ is stored into the corresponding computation unit $C U_{i}$ of DSP.
Decryption:From the second iteration of weights update process on, each learning participant downloads updated weight vector from its corresponding computation unit of DSP. That is to say, learning participant i downloads $E_{p k_{i}} (W_{i}^{(j)})$ from $C U_{i}$ of DSP $(1 \leq j \leq N_{w u}, 1 \leq i \leq N)$ . Then learning participant i decrypts the downloaded $E_{p k_{i}} (W_{i}^{(j)})$ with its private key $s k_{i}$ . The decryption algorithms are as follows.

${h_{1}}^{'} = H ({(p k_{K T S})}^{s k_{i}})$

(39)

${h_{2}}^{'} = H ({(p k_{D S P})}^{s k_{i}})$

(40)

$\begin{matrix} W_{i}^{(j)} = & L (\bar{T} \cdot p k_{D S P}^{{h_{1}}^{'}} \cdot g^{{h_{2}}^{'}} / \bar{T^{'}} m o d n^{2}), w h e r e \\ L (x) = & (x - 1) / n \end{matrix}$

(41)

Finally, downloaded $W_{i}^{(j)}$ updated by DSP is decrypted and can be used at next gradients generation iteration or deep learning model configuration when the training is finished.
Iteration:When the training is not finished, each learning participant repeats the above step 4 to step 8 to iterate the weights update process. In other words, learning participant i computes the gradient vector $G_{i}^{(j + 1)}$ with another mini-batch of data from its local dataset and $W_{i}^{(j)}$ . In the end of the training process, all the learning participants can get ultimate weight vector $W = (W_{1}^{(N_{w u})}, W_{2}^{(N_{w u})}, \dots, W_{N}^{(N_{w u})})$ by downloading and decrypting the $E_{p k_{i}} (W_{i}^{(w u)})$ from the $C U_{i}$ of DSP $(1 \leq i \leq N)$ respectively.

5. Security Analysis

In this section, we present the assumption of the computational difficulty problem on which our scheme based, and then study the security of our scheme.

5.1. Assumption

Decisional Diffie-Hellman (DDH) Problem [31] over

Z_{n^{2}}^{*}

: For each probabilistic polynomial time function F, there is a negligible function

n e g l ()

so that for sufficiently large l:

\begin{matrix} P r [F (n, X, Y, Z_{b}) = b |\begin{matrix} p, q \leftarrow S P (0.5 \cdot l); \\ n = p q; g \leftarrow G; \\ x, y, z \leftarrow [1, o r d (G)]; \\ X = g^{x} m o d n^{2}; \\ Y = g^{y} m o d n^{2}; \\ Z_{0} = g^{z} m o d n^{2}; \\ Z_{1} = g^{x y} m o d n^{2}; \\ b \leftarrow 0, 1; \end{matrix}] \\ = n e g l (l) + 0.5 \end{matrix}

(42)

That is to say, when

g^{x}

and

g^{y}

are given, the possibility for adversaries to distinguish between

g^{z}

and

g^{x y}

is negligible.

5.2. Security of Our Scheme

In this section, we analyze the security of our scheme. As mentioned in Section 1.1, we consider the scenario that KTS and DSP would not collude with each other, but KTS or DSP may collude with one of learning participants. First we give the proof that our scheme is secure in the presence of semi-honest adversaries

(A_{L P}, A_{K T S}, A_{D S P})

under non-colluding setting. Then we analyze the security of our system when there are colluding adversaries.

Proof.

Our scheme is secure in the presence of semi-honest adversaries under non-colluding setting. Our scheme is based on the HRES. The HRES is proved to be semantically secure in [28] based on the difficulty of Decisional Diffie–Hellman Problem above. We omit its proof here in order to avoid repetition. Our scheme involves three types of entities: LP, KTS and DSP. Three kinds of challengers

C_{L P}, C_{K T S}, C_{D S P}

are constructed to against three kinds of adversaries

A_{L P}, A_{K T S}

,

A_{D S P}

who want to corrupt LP, KTS and DSP respectively.

When a new gradient G is generated,

C_{L P}

challenges

A_{L P}

as follows. First,

C_{L P}

multiplies it by

- α

and encodes the result. Then

C_{L P}

encrypts the encoded result with

P K

as

{[m]}_{P K}

. Then

C_{L P}

sends

{[m]}_{P K}

to

A_{L P}

, and outputs the entire view of

A_{L P}

:

{[m]}_{P K}

.

A_{L P}

’s views in ideal and real executions are indistinguishable because of the security of HRES.

C_{K T S}

challenges

A_{K T S}

as follows.

C_{K T S}

runs encryption on two randomly chosen integers with

P K

as

{[G]}_{P K}

and

{[w]}_{P K}

. Then

{[G]}_{P K}

is multiplied by

{[w]}_{P K}

. Next

C_{K T S}

performs the FPRE on the result to get

{[m]}^{+}

, which is sent to

A_{K T S}

. If

A_{K T S}

responds with ⊥,

C_{K T S}

returns ⊥.

A_{K T S}

’s views are made up of the encrypted data.

A_{K T S}

gets the same outputs in both ideal and real executions because the LPs are honest and the HRES is proved to be secure. Therefore

A_{K T S}

’s views are indistinguishable.

C_{D S P}

challenges

A_{D S P}

as follows.

C_{D S P}

first chooses

{[m]}^{+}

randomly and performs the SPRE on it with

C_{D S P}

’s private key to get

{[m]}_{p k_{j}}

. Then

{[m]}_{p k_{j}}

is sent to

A_{D S P}

. If

A_{D S P}

responds with ⊥,

C_{D S P}

returns ⊥.

A_{D S P}

’s views are made up of the encrypted data.

A_{D S P}

gets the same output

{[m]}_{p k_{j}}

in both ideal and real executions because of the security of HRES. Therefore

A_{D S P}

’s views are indistinguishable.

When it comes to decryption of the updated weight,

C_{L P}

challenges

A_{L P}

as follows.

C_{L P}

chooses

{[m^{'}]}_{p k_{j}}

randomly and decrypts it. The result

m^{'}

is sent to

A_{L P}

. If

A_{L P}

responds with ⊥,

C_{L P}

returns ⊥. The result

m^{'}

is the view of

A_{L P}

, which is indistinguishable in both ideal and real executions because of the security of HRES. Therefore

A_{D S P}

’s views are indistinguishable. □

Therefore our scheme is secure when there are semi-honest adversaries under non-colluding setting. Next, we illuminate that our system is secure even when KTS colludes with one of the learning participants or DSP colludes with one of the learning participants. We suppose that learning participants, KTS and DSP are honest but curious. That is to say, all parties in the system will perform the execution they should do following the system steps but they would try to get local data of learning participants. Here we analyze two situations as follows: KTS colludes with one of the learning participants, or DSP colludes with one of the learning participants.

Situation 1: KTS colludes with one of the learning participants. If KTS colludes with learning participant A, then they can share information with each other and try to get private information of other learning participants. That is to say, KTS can get the gradients of A, even the private key of A. A can get private key of KTS, too. In the scheme of [15], if the cloud server colludes with a learning participant, and the server gets the private key of the learning participant, then the server could decrypt all the gradients of all learning participants because their private keys are the same. Since the gradients leak information of original data, the private data of learning participants are not safe. However, it is impossible in our proposed scheme because all learning participants encrypt their gradients with the Diffie–Hellman key generated by KTS and DSP. KTS can not decrypt the gradients itself even with the private key of a learning participant. Because the generation of hash value

h_{2}

can resit impersonation attack and collusion between KTS and any learning participant. KTS should perform steps in the first phase of re-encryption described in Section 4 honestly. Therefore, the private data of learning participants are secure even if KTS colludes with one of the learning participants in our scheme.

Situation 2: DSP colludes with one of the learning participants. If DSP colludes with learning participant B, the situation is similar to Situation 1. In other words, DSP can get the gradients of B, even the private key of B. B can get private key of DSP, as well as the all the re-encrypted weights and gradients. DSP should perform steps in the second phase of re-encryption described in Section 4 honestly. With the private key of learning participant B, DSP could only decrypt the gradients from learning participant B, but has no knowledge about gradients of other learning participants. Because the generation of hash value

h_{1}

can resit impersonation attack and collusion between DSP and any learning participant. Therefore, the original data of learning participants are safe even if DSP colludes with one of the learning participants.

To sum up, our scheme is resistant to collusion between any cloud sever and any learning participant, which is not available in schemes of [15].

6. Performance Evaluation

In this section, we evaluate the communication cost and computational cost of our proposed scheme through theoretical analysis and simulation. First, we introduce the concrete parameter settings of the experiment we take.

It is shown in [28] that the length of n,

L e n (n)

influences the computation efficiency greatly, as well as security of the HRES. Experimental results showed that larger

L e n (n)

meant longer communication time, but higher security. In order to balance the efficiency and security, here we set

L e n (n) = 1024

bits, which guarantees the learning participants with limited resources can finish decryption process efficiently according to [28].

In the experiment, we generated all the private keys randomly. The influence of length of a private key on HRES was tested in [28]. From the result, we can observe that when length of private key is between 100–200 bits, the computational costs are similar. In order to compare with the scheme in [15], we set the length of private key as 128 bits in our system, which provided the same length of security parameter as that in [15].

In the data coding process, we set

n = 32

. That is to say, the precision of data encoding results was 32-bit. We set

N = 10

, which means the deep neural network was separated into 10 parts averagely and there are 10 learning participants, 10 machines and 10 weight vectors.

6.1. Communication Cost Analysis

In this section we discuss the total communication cost of our scheme

C_{o u r s y s t e m}

. There are three kinds of communication as shown in Figure 3, communication between learning participants and KTS

C_{p - k}

, communication between KTS and DSP

C_{k - d}

, and communication between DSP and learning participants

C_{d - p}

.

To estimate the communication cost of our scheme, first we design the following formulas:

C_{o n e i t e r a t i o n}^{(j)} = C_{p - k}^{(j)} + C_{k - d}^{(j)} + C_{d - p}^{(j)},

(43)

C_{o u r s y s t e m} = \sum_{j = 1}^{N_{w u}} C_{o n e i t e r a t i o n}^{(j)} .

(44)

In Equation (43),

C_{o n e i t e r a t i o n}^{(j)}

is the communication cost of

j th

iteration.

Next, we discuss the communication cost of transmitting one ciphertext between two parties. For example, ciphertext

[m_{i}]

in our proposed scheme is composed of two components:

[m_{i}] = \{T_{i}, {T_{i}}^{'}\}

, where

T_{i} = (1 + m_{i} \cdot n) \cdot P K^{r} m o d n^{2}

and

{T_{i}}^{'} = g^{r} m o d n^{2}

. The length of

T_{i}

and

{T_{i}}^{'}

are related to

n^{2}

. Each of them has

2 * L e n (n)

bits, so uploading or downloading one ciphertext needs to transmit

4 * L e n (n)

bits. We set

L e n (n) = 1024

bits, so transmitting one ciphertext in our scheme means transmitting 4096 bits.

Finally, according to our system realization described in Figure 3, we can obtain the following results. For the first iteration:

\begin{matrix} C_{o n e i t e r a t i o n}^{(1)} & = C_{p - k}^{(1)} + C_{k - d}^{(1)} + C_{d - p}^{(1)} \\ = N \cdot 4096 + 2 \cdot N \cdot 4096 + N \cdot 4096 \\ = 4 \cdot N \cdot 4096 . \end{matrix}

(45)

For the iterations

2 \leq j \leq N_{w u}

:

\begin{matrix} C_{o n e i t e r a t i o n}^{(j)} & = C_{p - k}^{(j)} + C_{k - d}^{(j)} + C_{d - p}^{(j)} \\ = N \cdot 4096 + N \cdot 4096 + N \cdot 4096 \\ = 3 \cdot N \cdot 4096 . \end{matrix}

(46)

So the total communication cost of our scheme is

\begin{matrix} C_{o u r s y s t e m} & = \sum_{j = 1}^{N_{w u}} C_{o n e i t e r a t i o n}^{(j)} \\ = 4 \cdot N \cdot 4096 + 3 \cdot (N_{w u} - 1) \cdot N \cdot 4096 \\ = N \cdot 4096 + 3 \cdot N_{w u} \cdot N \cdot 4096 \\ = (1 + 3 \cdot N_{w u}) \cdot N \cdot 4096 . \end{matrix}

(47)

In order to compare the communication cost of our scheme with that of scheme in [15], we analyzed the increased communication factor of our scheme. The increased factor is

F_{i n c r e a s e d - c o m} = \frac{E n c r y p t e d B i t s}{P l a i n B i t s} .

(48)

Since

L e n (n) = l o g_{2} n = 1024

bits in our system, we can pack

t = ⌊\frac{⌊l o g_{2} n⌋}{p r e c + p a d}⌋

real numbers (after encoding into integers) into one HRES plaintext. We set the precision as

p r e c = 32

bits in encoding process, and

p a d = l o g_{2} N_{w u}

bits to prevent overflows in ciphertext additions [15]. So each ciphertext in our scheme can be used to encrypt t gradients. Therefore the increased factor is

\begin{matrix} F_{i n c r e a s e d - c o m} & = \frac{E n c r y p t e d B i t s}{P l a i n B i t s} \\ = \frac{(1 + 3 \cdot N_{w u}) \cdot N \cdot 4096}{t \cdot 2 \cdot (p r e c + p a d) \cdot N_{w u} \cdot N} \\ \approx \frac{(1 + 3 \cdot N_{w u}) \cdot N \cdot 4096}{2 \cdot ⌊l o g_{2} n⌋ \cdot N_{w u} \cdot N} \\ = \frac{(1 + 3 \cdot N_{w u}) \cdot 2}{N_{w u}} \\ = \frac{2}{N_{w u}} + 6 . \end{matrix}

(49)

When the number of weights update process

N_{w u}

is large enough, the increased factor of our scheme was 6. The comparison of communication cost between our scheme and latest works can be found in Table 2.

Therefore, our scheme was about six times of communication cost of the corresponding ASGD, and two times of communication cost of scheme in [15]. But in terms of learning participants, the communication cost of our scheme was similar to that in [15].

The total number of gradients in the network is

N_{g} =

109,386 in our system. After multiplying

- α

(

α

is the learning rate), every gradient is encoded into a 32-bits integer

⌊(- α \cdot G) \cdot 2^{n}⌋

and then is encrypted with

P K

. Refer to [15], we calculated the total size of ciphertext as

\begin{matrix} L_{c i p h e r t e x t} & = N_{g} * L e n (o n e g r a d i e n t) * F_{i n c r e a s e d - c o m} \\ = 109386 \times 32 \times 6 bits \\ = 2.6 MB . \end{matrix}

(50)

All the ciphertext in our system can be sent in about

T_{c o m m u n i c a t i o n} = L_{c i p h e r t e x t} / 1 Gbps = 20.8 ms

(51)

via the 1 Gbps channel for communication (suppose the same communication channel speed as that in [15]).

6.2. Computational Cost Analysis

In this section we analyze the computational cost of our scheme. There are three kinds of parties in our system as shown in Figure 2, learning participants, KTS and DSP. We will analyze the computational cost of each of them respectively. We evaluate the computational cost via running time of the algorithms they execute. First we design the following formulas:

T_{o n e i t e r a t i o n}^{(j)} = (T_{g r a d i e n t s g e n e r a t i o n}^{(j)} + T_{e n c}^{(j)}) + T_{F P R E}^{(j)} + (T_{S P R E}^{(j)} + T_{a d d}^{(j)}) + T_{d e c}^{(j)},

(52)

T_{o u r s y s t e m} = \sum_{j = 1}^{N_{w u}} T_{o n e i t e r a t i o n}^{(j)} .

(53)

In Equation (52),

T_{o n e i t e r a t i o n}^{(j)}

is the total running time of the

j_{t h}

weights update process. When a new gradient vector (consists of many gradients generated by all learning participants) was generated, one weight’s update process began. Each learning participant encrypted their own gradients. So the computational cost of the learning participants was

(T_{g r a d i e n t s g e n e r a t i o n}^{(j)} + T_{e n c}^{(j)})

. Then KTS carried out the FPRE over the gradients using public keys of different learning participants respectively. So the computational cost of KTS was

T_{F P R E}^{(j)}

. Next, DSP performs SPRE over the received ciphertexts using public keys of the learning participant and updates the weights via homomorphic addition. Therefore the computational cost of DSP was

(T_{S P R E}^{(j)} + T_{a d d}^{(j)})

. Finally, learning participants download and decrypt the updated weights using their own private keys respectively. Hence, the computational cost of learning participants was

T_{d e c}^{(j)}

.

Next, we estimated the computational cost of our scheme through simulations. In order to compare the computational cost of our scheme with that of the LWE-based scheme in [15], we used the Tensorflow 1.8.0 library and Cuda-9.0 to construct the same fully connected multilayer perceptron (MLP) network as that in [15], with four layers (784-128-64-10 neurons sequentially). The total number of gradients in the network is

N_{g} = (784 + 1) \times 128 + (128 + 1) \times 64 + (64 + 1) \times 10 =

109,386 as well. The detailed experimental settings in model training are shown in Table 3.

We implement our scheme with C++ codes. The network and scheme are run on a computer with NVIDIA GeForce GTX 1080 Ti, Intel(R) Core(TM) I7-4702MQ CPU @ 2.20 GHz and 16 G Memory. The operating system is Windows 10.

According to simulation, the average running time of gradients generation process through training the MLP in our scheme was

T_{g r a d i e n t s g e n e r a t i o n} = 1.3

ms; and the HRES running time of processing the 109,386 gradients of the network with 10 learning participants in one weights update process was

T_{H R E S} = 5192.4

ms. Therefore, the running time of one weights update process in our system was

\begin{matrix} T_{o n e u p d a t e} & = T_{g r a d i e n t s g e n e r a t i o n} + T_{H R E S} \\ = 1.3 + 5192.4 \\ = 5193.7 (ms) . \end{matrix}

(54)

The average running time per weights update iteration for performing steps of homomorphic re-encryption in our scheme, including encryption, FPRE, SPRE, homomorphic addition and decryption time are depicted in Figure 4, which are denoted as encTime, FPRETime, SPRETime, addTime and DecTime respectively.

As mentioned above, learning participants were responsible for the implementation of encryption and decryption; KTS was responsible for FPRE; DSP was responsible for SPRE and addition. So we calculated the computational cost of them respectively in Table 4 (accurate to the second decimal place). It can be seen that DSP had the maximum computational overhead (

55 %

). The computational cost of learning participants are

33 %

of the total computational cost. There were 10 learning participants in our system, and we separated the model averagely, so the average computational cost of every learning participant was just

3.3 %

, which was much lower than that of KTS and DSP. The configuration was reasonable because the computational capability of cloud server is much higher than that of every data provider generally.

The experimental results comparison is shown in Figure 5 and Table 5. In Figure 5,

T_e n c

of our scheme consists of there parts: running time of encryption, FPRE and SPRE. In Table 5,

T_{w e i g h t s u p d a t e}

means the running time of using our scheme to process the gradients and update the weights. It consists of five parts in our scheme: encryption (1112.4 ms) and decryption (584.1 ms) by learning participants, FPRE (630.2 ms) by KTS, SPRE (615.1 ms) and homomorphic addition (2250.6 ms) by DSP. On the other hand, the LWE-MLP scheme in [15], consists of there parts: encryption (899.2 ms) and decryption (785.4 ms) by learning participants, homomorphic addition (278.9 ms) by DSP.

The running time of our system was about 2.64 times of that in LWE-MLP [15]. Although the total running time of our system was longer, the computational cost of learning participants was similar to that of [15], which is shown in Table 6. The accuracy is

97.1 %

, which was a little higher than that in [15].

7. Conclusions

Considering that previous distributed deep learning scheme sharing single key pair for encryption suffers from collusion between cloud server and any learning participant [15], we propose a novel system using homomorphic re-encryption which makes privacy-preserving distributed deep learning come true. We give the detailed realization steps of our system and implement them for validation. Furthermore, security analysis and performance evaluation are provided. Specifically, the communication cost of our scheme is tolerable. Experimental results show that although the running time of our system is larger than that of LWE-MLP in [15], but the computational overhead of learning participants is nearly the same as that in [15]. More importantly, our system is more secure and accurate because it is collusion-resistant between any server and any learning participant, with higher deep learning accuracy.

Author Contributions

F.T. contributed to writing—original draft preparation, methodology, software and validation of the proposed scheme; W.W. contributed to conceptualization; J.L. contributed to writing—review and editing and funding acquisition; H.W. contributed to supervision; and M.X. contributed to project administration.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 61801489.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

LP	Learning participant
KTS	Key transform server
DSP	Data service provider
LWE-MLP	Learning with error-multilayer perception

References

Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson Education Limited: Harlow, UK, 2016. [Google Scholar]
Bennett, C.C.; Hauser, K. Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach. Artif. Intell. Med. 2013, 57, 9–19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Marjani, M.; Nasaruddin, F.; Gani, A.; Karim, A.; Hashem, I.A.T.; Siddiqa, A.; Yaqoob, I. Big IoT data analytics: Architecture, opportunities, and open research challenges. IEEE Access 2017, 5, 5247–5261. [Google Scholar]
Jamal, A.; Syahputra, R. Heat Exchanger Control Based on Artificial Intelligence Approach. Int. J. Appl. Eng. Res. (IJAER) 2016, 11, 9063–9069. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mayer-Schönberger, V.; Cukier, K. Big Data: A Revolution That Will Transform How We Live, Work, And Think; Houghton Mifflin Harcourt: Boston, MA, USA, 2013. [Google Scholar]
Zhong, P.; Gong, Z.; Li, S.; Schönlieb, C.B. Learning to diversify deep belief networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3516–3530. [Google Scholar] [CrossRef]
Wang, Z. The applications of deep learning on traffic identification. BlackHat USA 2015. Available online: https://www.blackhat.com/docs/us-15/materials/us-15-Wang-The-Applications-Of-Deep-Learning-On-Traffic-Identification-wp.pdf (accessed on 8 April 2019).
Zhang, L.; Gopalakrishnan, V.; Lu, L.; Summers, R.M.; Moss, J.; Yao, J. Self-learning to detect and segment cysts in lung CT images without manual annotation. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 1100–1103. [Google Scholar]
Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q.V.; et al. Large scale distributed deep networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, ND, USA, 3–6 December 2012; pp. 1223–1231. [Google Scholar]
Liang, Y.; Cai, Z.; Yu, J.; Han, Q.; Li, Y. Deep learning based inference of private information using embedded sensors in smart devices. IEEE Netw. 2018, 32, 8–14. [Google Scholar] [CrossRef]
Hao, J.; Huang, C.; Ni, J.; Rong, H.; Xian, M.; Shen, X.S. Fine-grained data access control with attribute-hiding policy for cloud-based IoT. Comput. Netw. 2019, 153, 1–10. [Google Scholar] [CrossRef]
Wu, W.; Parampalli, U.; Liu, J.; Xian, M. Privacy preserving k-nearest neighbor classification over encrypted database in outsourced cloud environments. World Wide Web 2019, 22, 101–123. [Google Scholar] [CrossRef]
Phong, L.T.; Aono, Y.; Hayashi, T.; Wang, L.; Moriai, S. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forensics Secur. 2018, 13, 1333–1345. [Google Scholar]
Shokri, R.; Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer And Communications Security, Denver, CO, USA, 12–16 October 2015; ACM: New York, NY, USA, 2015; pp. 1310–1321. [Google Scholar]
Chai, Q.; Gong, G. Verifiable symmetric searchable encryption for semi-honest-but-curious cloud servers. In Proceedings of the 2012 IEEE International Conference on Communications (ICC), Ottawa, ON, Canada, 10–15 June 2012; pp. 917–922. [Google Scholar]
Papernot, N.; Abadi, M.; Erlingsson, U.; Goodfellow, I.; Talwar, K. Semi-supervised knowledge transfer for deep learning from private training data. arXiv, 2016; arXiv:1610.05755. [Google Scholar]
Phan, N.; Wang, Y.; Wu, X.; Dou, D. Differential Privacy Preservation for Deep Auto-Encoders: An Application of Human Behavior Prediction. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 16, pp. 1309–1316. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; ACM: New York, NY, USA, 2016; pp. 308–318. [Google Scholar]
Hitaj, B.; Ateniese, G.; Perez-Cruz, F. Deep models under the GAN: Information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; ACM: New York, NY, USA, 2017; pp. 603–618. [Google Scholar]
Mohassel, P.; Zhang, Y. SecureML: A system for scalable privacy-preserving machine learning. In Proceedings of the 2017 38th IEEE Symposium on Security and Privacy (SP). IEEE, San Jose, CA, USA, 22–24 May 2017; pp. 19–38. [Google Scholar]
Li, P.; Li, J.; Huang, Z.; Li, T.; Gao, C.Z.; Yiu, S.M.; Chen, K. Multi-key privacy-preserving deep learning in cloud computing. Future Gener. Comput. Syst. 2017, 74, 76–85. [Google Scholar] [CrossRef]
Li, P.; Li, J.; Huang, Z.; Gao, C.Z.; Chen, W.B.; Chen, K. Privacy-preserving outsourced classification in cloud computing. Cluster Comput. 2018, 21, 277–286. [Google Scholar] [CrossRef]
Zhang, Q.; Yang, L.T.; Chen, Z. Privacy preserving deep computation model on cloud for big data feature learning. IEEE Trans. Comput. 2016, 65, 1351–1362. [Google Scholar] [CrossRef]
Rouhani, B.D.; Riazi, M.S.; Koushanfar, F. Deepsecure: Scalable provably-secure deep learning. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar]
Kiraz, M.; Schoenmakers, B. A protocol issue for the malicious case of Yao’s garbled circuit construction. In Proceedings of the 27th Symposium on Information Theory in the Benelux, Louvain-la-Neuve, Belgium, 19–20 May 2006; pp. 283–290. [Google Scholar]
Ding, W.; Yan, Z.; Deng, R.H. Encrypted data processing with homomorphic re-encryption. Inf. Sci. 2017, 409, 35–55. [Google Scholar] [CrossRef]
Ateniese, G.; Fu, K.; Green, M.; Hohenberger, S. Improved proxy re-encryption schemes with applications to secure distributed storage. ACM Trans. Inf. Syst. Secur. (TISSEC) 2006, 9, 1–30. [Google Scholar] [CrossRef] [Green Version]
Recht, B.; Re, C.; Wright, S.; Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 693–701. [Google Scholar]
Boneh, D. The decision diffie-hellman problem. In Proceedings of the International Algorithmic Number Theory Symposium, Portland, OR, USA, 21–25 June 1998; Springer: Berlin, Germany, 1998; pp. 48–63. [Google Scholar]

Figure 1. An example of model parallelism.

Figure 2. System architecture.

Figure 3. System realization.

Figure 4. (a) Average running time for every step. (b) Percentage of running time for every step.

Figure 5. Experimental results comparison.

Table 1. Notations.

Symbols	Description
${[m_{i}]}_{k}$	The ciphertext of $m_{i}$ encrypted with key k;
$α$	The learning rate;
N	The number of learning participants;
$N_{w u}$	The number of weights update process;
$L e n (*)$	The bit length of input data;
$H (*)$	The hash value of input data;
$(k, p, q) \to (g, n)$	Function with the input $k, p, q$ and the output $g, n$ ;
$p \cdot q$	The product of p and q.

Table 2. Communication cost comparison.

Increased Communication Factor	Formula	Value
LWE-based scheme in [15]	$\frac{N \cdot n \cdot l o g_{2} q}{N_{w u} \cdot p r e c} + \frac{l o g_{2} q}{p r e c}$	-
Paillier-based scheme in [15]	$2 \cdot (1 + \frac{p a d}{p r e c})$	≈2.93
Our scheme	$\frac{2}{N_{w u}} + 6$	≈6

Table 3. Experimental settings in model training.

Schemes	Batch Size	Learning Rate	Precision	Iteration Times	$F_{Activation}$	Dataset
LWE-MLP [15]	50 images	$10^{- 4}$	32-bits	2 × $10^{4}$	ReLu	MNIST
Our Scheme	50 images	$10^{- 4}$	32-bits	$2 \times 10^{4}$	ReLu	MNIST

Table 4. Average computational cost of parties in our scheme.

Party	Every Learning Participant	KTS	DSP
Time item	encTime+DecTime	FPRETime	SPRETime+addTime
Average time	1.55 (μs)	5.76 (μs)	26.20 (μs)
Percentage	$3.3 %$	$12 %$	$55 %$

Table 5. Experimental results comparison.

Schemes	$T_{gradients generation}$ (ms)	$T_{weights update}$ (ms)	$T_{Total}$ (ms)	Accuracy
LWE-MLP [15]	4.6	1963.5	1968.1	$97 %$
Our Scheme	1.3	5192.4	5193.7	$97.1 %$

Table 6. Computational cost of parties comparison.

Time (ms)	Learning Participants	KTS	DSP
LWE-MLP [15]	$899.2 + 785.4 = 1684.6$	-	278.9
Our scheme	$1112.4 + 584.1 = 1696.5$	630.2	$615.1 + 2250.6 = 2865.7$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, F.; Wu, W.; Liu, J.; Wang, H.; Xian, M. Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption. Electronics 2019, 8, 411. https://doi.org/10.3390/electronics8040411

AMA Style

Tang F, Wu W, Liu J, Wang H, Xian M. Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption. Electronics. 2019; 8(4):411. https://doi.org/10.3390/electronics8040411

Chicago/Turabian Style

Tang, Fengyi, Wei Wu, Jian Liu, Huimei Wang, and Ming Xian. 2019. "Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption" Electronics 8, no. 4: 411. https://doi.org/10.3390/electronics8040411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption

Abstract

1. Introduction

1.1. Background

1.2. Our Contributions

1.3. More Related Works

1.4. Paper Organization and Notations

2. Preliminaries

2.1. Homomorphic Re-Encryption

2.2. ASGD Based Deep Learning

2.3. Gradients Leaking Information

3. System Architecture

3.1. Learning Participant

3.2. Key Transform Server

3.3. Data Service Provider

4. System Realization

5. Security Analysis

5.1. Assumption

5.2. Security of Our Scheme

6. Performance Evaluation

6.1. Communication Cost Analysis

6.2. Computational Cost Analysis

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI