Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption

Xie, Shaoxiong; Ye, Jun; Ou, Wei

doi:10.3390/electronics14030640

Open AccessArticle

Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption

by

Shaoxiong Xie

^1,†

,

Jun Ye

^1,*,†

and

Wei Ou

²

¹

Key Laboratory of Internet Information Retrieval of Hainan Province, School of Cyberspace Security, Hainan University, Haikou 570228, China

²

School of Cyberspace Security, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(3), 640; https://doi.org/10.3390/electronics14030640

Submission received: 12 January 2025 / Revised: 1 February 2025 / Accepted: 4 February 2025 / Published: 6 February 2025

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Machine-learning applications are becoming increasingly widespread. However, machine learning is highly dependent on high-quality, large-scale training data. Due to the limitations of data privacy and security, in order to accept more user data, users are required to participate in the computation themselves through the secure use of secret keys. In this paper, we propose a multi-user encrypted machine-learning system based on partially homomorphic encryption, which can be realized for the purpose of supporting encrypted machine learning under multiple users. In this system, offline homomorphic computation is provided, so that users can support homomorphic computation without interacting with the cloud after locally executing encryption, and all computational parameters are computed in the initial and encryption phases. In this system, the isolation forest algorithm is modified appropriately so that its computation can be within the supported homomorphic computation methods. The comparison with other schemes in the comparison experiments reflects this scheme’s computational and communication advantages. In the application experiments, where anomaly detection is taken as the goal, the encrypted machine-learning system can provide more than 90% recall, illustrating this scheme’s usability.

Keywords:

partially homomorphic encryption; offline homomorphic computation; encrypted machine learning; anomaly detection

1. Introduction

With the rapid development of big data, cloud computing, and artificial intelligence, machine learning is increasingly used in a variety of fields. For example, artificial-intelligence algorithms such as logistic regression, decision tree, deep learning, and reinforcement learning are applied in the fields of finance, assisted medical care, speech recognition, and reading comprehension [1,2,3,4,5]. Under the Internet of Energy (IoE) healthcare fields, machine-learning algorithms need a large amount of effective training data in order for them to obtain high-quality machine-learning models, and the quantity and quality of training data both need to be sufficiently high during the training of the machine-learning models, otherwise even the best algorithms cannot perform well [6,7,8]. Therefore, how to obtain a large amount of normal data is a problem that needs to be considered for machine learning.

The amount of data collected and involved in training only, specified by the demands of the users themselves, is not enough. Adequate training data are needed to ensure that the model’s performance meets expectations. We therefore aim to securely collect private data and eliminate as much malicious data as possible. To collect private data, we first need to ensure the privacy and security of user data and then train the model from this secure data. The training data may carry private information [9], such as medical information, fingerprint features, personal information, and other private information [10,11,12]; it is impossible to provide the original information directly from the individual in possession of such information, and it needs to undergo some privacy-preserving processing. There are many types of technique to ensure privacy and security [13,14]; encryption has been applied in various different fields and has demonstrated its effectiveness [15,16], so we will also use encryption to address the privacy and security problem.

Privacy and security can be achieved through encryption [17,18], but the computation that encrypted data can support is limited. There is a conflict between privacy protection and computation; satisfying computational needs requires giving up a certain amount of security, and ensuring security performance requires giving up part of the functionality or efficiency. Therefore, it is of research interest to complete the computation and make it as efficient as possible while still ensuring security. Especially in the scenario of sharing data among multiple users, encryption needs to share data among multiple users and support computation while sharing data. Therefore, research is also needed to determine how to complete the sharing and computation under encryption. Since the computation mode supported by each encryption method does not necessarily support all machine-learning models, it is necessary to design different encryption methods, as well as corresponding sharing and computation methods for different machine-learning models, so that they can be computed efficiently. In the current more popular cloud-computing models, the cloud server has a strong computing capacity, and the reasonable allocation of resources can enhance the performance [19,20]. Compared with the weak computing capacity of the user, if most of the computational cost is allocated to the cloud server, this can improve the computational performance; it is therefore necessary to allocate more computation to the cloud server in the design of the computing.

There are many machine-learning models used to differentiate between normal and abnormal data; considering the robustness and the performance effect under data poisoning, the random forest model was chosen [21]. However, since there is no expert model under all fields, users also need to have an unsupervised learning model. The isolation forest model performs well in abnormal data detection [22]; therefore, the isolation forest model is used as the encrypted machine-learning model.

In the multi-user encrypted machine-learning model, multiple users provide additional data, but also bring more communication problems. There are a large number of computations in the model learning, and the online state of the homomorphic computation requires that the user side must calculate and respond in sufficient time. This means the user needs to be continuously online and have the ability to respond at any time, while the communication line also needs to ensure smoothness. These requirements are easy to achieve in a short period of time, but once the training time is longer, any user problems will cause the model training process to stop, which will significantly extend the training time and increase the training cost. Therefore, offline homomorphic computing can avoid the influence of the user side on the training process and improve the efficiency of model training.

Multi-user encryption methods are required due to the insufficient amount of single-user data, which in turn leads to the problem of computation in the encrypted state. Therefore, this paper proposes a system to support multi-user encrypted machine learning, which optimizes the encryption method to achieve privacy-preserving, multi-user sharing and computation. For the data quality problem, by improving the isolation forest model under plaintext through the encrypted isolation forest model, the model can identify abnormal data. In particular, it is optimized for cloud-computing environments so the server can compute this part. At the same time, the authentication part is increased to reduce the possibility of being spoofed. The contributions of this paper are as follows:

A new partial homomorphism algorithm that simultaneously supports the offline computation of homomorphic addition and homomorphic comparison is proposed, which enables users to upload locally encrypted data without the need to stay online to cooperate with the cloud server computation.
In order to improve the security and usability of the scheme, a mechanism for dynamic user joining and exiting is added, as well as a user parameter verification mechanism, which is used to ensure the correctness and security of the user’s secret share. This reduces the difficulty of the user’s participation and also avoids errors in the user’s secret share, resulting in a failure to support homomorphic computation.
A secure isolated forest algorithm is proposed that can support computation on ciphertext data. It is proved by anomaly detection experiments to be able to accomplish the machine-learning objective in unsupervised mode, with a tiny gap with the isolated forest algorithm in plaintext.

The remainder of this paper is structured as follows: Section 2 details the background and related work on partial homomorphic encryption. Section 3 briefly describes the model and entity of the scheme. Section 4 describes in detail the specific implementation of the encryption methods in the scheme. Section 5 describes in detail the specific implementation of the secure isolation forests algorithm. Section 6 details the security analysis of the encryption scheme. Section 7 comprehensively compares the efficiency of this scheme with other schemes and uses anomaly detection experiments to prove its effectiveness. Section 8 summarizes the paper and analyzes the problems in the current scheme and presents future directions.

2. Related Work

There are several different solutions for machine learning due to privacy and security concerns. Hao et al. [23] designed a secure comparison protocol combining homomorphic encryption and secure multiparty computation, where the communication efficiency was proportional to the number of tree nodes. Shin et al. [24] proposed a homomorphic binary decision tree model based on CKKS encryption in a cloud environment, which was able to compute without decryption. The above methods require communication with the user during computation, which increases the communication cost. Wen et al. [25] investigated the problem of federated learning, which is computed locally, and the communication only interacts with the local model. However, dealing with the data quality problem in the absence of a standard data model cannot accurately identify abnormal data, and the detection of abnormal data through the local model is not accurate enough. There is therefore a need for a more appropriate way to address security in machine learning.

Existing privacy computing schemes [26,27,28,29] use homomorphic encryption, secure multiparty computation, federated learning, and trusted computing environments. Homomorphic encryption [26] allows direct computation over the ciphertext, and decryption of the ciphertext computation results is the same as the plaintext computation results. Currently, there exist partially homomorphic encryption and fully homomorphic encryption. Full homomorphic encryption supports homomorphic addition and homomorphic multiplication. In contrast, partially homomorphic encryption only supports homomorphic addition or homomorphic multiplication, and partially homomorphic encryption is faster than full homomorphic encryption in terms of computational efficiency [30]. Secure multiparty computation [27] can securely compute the result of a function with multiple parties, guaranteeing secure participation in the computation. However, multiple rounds of interaction are required in the computation process. Federated learning [28] can compute a local model locally and produce a global model by aggregating the local models of multiple parties in the cloud. However, it requires local computational resources and is susceptible to data poisoning attacks. Trusted computing environment [29] can be computed in an independent computing environment to ensure the security of its computation, but it requires the support of trusted hardware [31]. Machine-learning model training requires comprehensive consideration of the machine-learning model, security requirements, data provider computing power, cloud server computing power, computational efficiency, communication efficiency, hardware conditions, etc., to consider the adopted scheme in specific cases. In this paper, we focus on the homomorphic encryption method, which extends the homomorphic properties from partially homomorphic encryption while ensuring its computational efficiency.

Homomorphic encryption includes full homomorphic encryption and partially homomorphic encryption; full homomorphic encryption has BGV, CKKS, and another scheme [32,33,34], which cannot be applied due to efficiency problems. Therefore, starting from partially homomorphic encryption, we look for the encryption methods that can be used to build the foundation. Common additive homomorphic encryption includes the Paillier encryption algorithm [35], and multiplicative homomorphic encryption includes the ElGamal encryption algorithm, the RSA encryption algorithm, and so on. There is also BGN additive homomorphic encryption [36], which supports one-time multiplication.

Park et al. [37] combined additive homomorphism and federated learning and proposed federated learning that can protect the process of local model aggregation in all models. They also used additive homomorphism in the aggregation process so that the local model was not leaked, but still could not overcome the problem of data poisoning in federated learning [25]. Petrean et al. [38] implemented a random forest algorithm by combining a multi-user homomorphism algorithm and a lookup table, which could compute the result of the ciphertext under the provided ciphertext model in possession of the ciphertext model and the ciphertext data. However, this required a ciphertext model and could not work without one. Rezaeibagha et al. [39] combined additive homomorphic encryption and user authentication to achieve verifiable user ciphertext aggregation under the security of chosen plaintext attack (IND-CPA), but additive aggregation has no other homomorphic properties. Savvas et al. [40] proposed a new symmetrically encrypted approach that provided homomorphic addition and multiplication, respectively, and at the same time the encryption was very efficient and secure.

Additive or multiplicative properties on their own cannot build machine-learning models such as decision trees, clustering algorithms, and support vector machines, and other models need to be compared during the training process, so it is also necessary to provide homomorphic comparison properties as a basis. Liu et al. [41] proposed a coding method to realize homomorphic comparison operations; the data were converted into ordinal codes and tails for comparison. The ciphertext comparison can be transformed into exponential and mantissa comparisons, and each comparison only leaks 1 bit of information, which better protects privacy in the case of efficient comparisons. Li et al. [42] proposed order-preserving encryption based on a coded tree. The tree had the characteristics of B-tree, and the tree establishment and maintenance process was re-modified based on encoding and order preservation based on B-tree. Jun et al. [43] proposed a hash-based comparison encryption, which encodes the data into binary data and uses hash blindness and fixation for each digit, and the subsequent comparisons can be made as a whole to compare the results and then derive the size through the comparative part. The homomorphic comparison encryption realized by encoding can only support homomorphic comparison. It cannot be extended to realize other homomorphic properties, so it cannot support the perfect machine-learning model. Zhao et al. [44] proposed a homomorphic computation toolkit, which was obtained by homomorphic computation and secure multi-party computation that could support a variety of computational methods. The homomorphic comparison process only leaks the comparison results without leaking other information, but the comparison process requires interaction, while other computation methods are not realized in a homomorphic way. Huang et al. [45] proposed a comparative homomorphic application, which demonstrated the ability to compute the distance using the homomorphic comparison and could be realized in a better way on private data, but the application only demonstrated the results.

3. System Model

3.1. Preliminary Knowledge

Suppose that there exists a cyclic group G of order

o r d (g)

with a generator g. Generate the prime numbers p and q of length

λ

and compute

N = p q

.

3.1.1. Partial Discrete Logarithm

Suppose that at

Z_{N^{2}}^{*}

, given the tuple

(g, g^{α} \mod N^{2}, N)

, which is a random value of

α

, there exists no polynomial-time algorithm capable of computing

α

.

3.1.2. Decisional Diffie–Hellman

Given the tuple

(g, g^{α} \mod N^{2}, g^{β} \mod N^{2}, N)

, which is a random value of

α, β

, there exists no polynomial-time algorithm capable of computing

g^{α β}

.

3.1.3. BCP Encryption

Init phase: Generate safe prime numbers

p, q

by selecting safe parameters

λ

; compute

N = p q

. Generate the generator

g = - a^{2 N} (a \in Z_{N^{2}}^{*})

with private key

s \in [1, \frac{N^{2}}{2}]

and public parameters

(N, g, h = g^{s})

.

Enc phase: Select

m \in Z_{N}^{*}

; compute ciphertext

C_{1} = g^{r} \mod N^{2}, C_{2} = h^{r} (1 + m N)

by random chose

r \in [1, \frac{N}{4}]

.

3.1.4. Lemma

m_{1}, m_{2}

are two positive integers requiring

|m_{1} - m_{2}| < \frac{1}{2} N^{2}

.,

M = m_{1} - m_{2} mod N^{2}

if

M \in (0, \frac{1}{2} N^{2})

, then

m_{1} < m_{2}

; otherwise,

M \in (\frac{1}{2} N^{2}, N^{2})

, then

m_{1} > m_{2}

.

3.2. System Model

The entities in the system model shown in the Figure 1 are cloud server and user. Encryption parameters and computation parameters are obtained after communication between users. Users encrypt and upload their respective data to the cloud through encryption parameters, and then the cloud server performs homomorphic computation on the ciphertext using the public parameters. The homomorphic computation contains homomorphic addition and homomorphic comparison; the property of offline homomorphism can be supported in the system model, and the optimized isolation forest algorithm achieves the unsupervised machine-learning approach.

4. Homomorphic Encryption Scheme

4.1. Initial Phase

As shown in Figure 2, initial parameters are generated from the security parameters, and the data for each user identification and system identification are computed; the secret-sharing aggregation parameter

Δ_{i}

, corresponding to the user identification and system identification, is published, and the identification is sent to the user, together with the secret-sharing aggregation parameter

Δ_{i}

. Disclosure of public parameters,

p a r a m s

.

As shown in Figure 3, user

U_{i}

randomly generates a secret-sharing polynomial

f_{i} (x)

locally. User

U_{i}

computes a secret-sharing value

Δ_{i, k} = f_{i} (u_{k}) Δ_{k}

of the secret-sharing polynomial

f_{i} (x)

, with respect to other users,

U_{k}

, and a secret-sharing value

R_{i, k} = g^{f_{i} (ν_{k}) {Δ_{k}}^{'}}

, with respect to the system,

V_{k}

. User

U_{i}

generates n-1 commitment

{\{c o m_{i, k} = g^{h_{i, k}}\}}_{k = 1, 2, \dots, n - 1}

and uploads this verification commitment to the cloud server. User

U_{i}

randomly generates a user private key,

u s k_{i}

, and a user public key,

u p k_{i}

, for the private part and the public part, respectively, and sends the secret-sharing value

Δ_{i, k}

in user private key

u s k_{i}

to the corresponding user,

U_{k}

, and makes user public key

u p k_{i}

public. When user

U_{i}

receives the shared secret

Δ_{j, i}

, they can simultaneously obtain the verification commitment from the cloud and compute the verification commitment locally and verify it through the equation

C O M_{j, i} \overset{?}{\to} g^{f_{j} (u_{i}) Δ_{i}}

. After the user receives the secret-sharing

Δ_{j, i}

from other users, all the secret-sharing parameters sent by other users are aggregated to form the secret-sharing value

F (\begin{matrix} u_{i} \end{matrix}) Δ_{i}

for the total secret.

As shown in Figure 4, the cloud server collects the public parameters of all users and stores them in the cloud, and it then computes the secret-sharing parameter

R_{k, i}

corresponding to the system identifier on the cloud, aggregates the secret-sharing parameters corresponding to all system identifiers, and obtains the total publicly releasable secret-sharing parameter R.

4.2. Encryption Phase

In Figure 5, if a homomorphic comparison between two users,

U_{c o m 1}, U_{c o m 2}

, is required, in addition to the ciphertext, the comparison parameters must be provided for the computation of the homomorphic comparison. User

U_{c o m 1}, U_{c o m 2}

first obtains the public parameter

p a r a m s

from the server and encrypts the plaintext

m_{c o m 1}, m_{c o m 2} \in (0, N^{\frac{1}{4}})

into the ciphertext

C_{c o m 1}, C_{c o m 2}

using the public key

g^{s}

. Second, user

U_{c o m 1}, U_{c o m 2}

computes the homomorphic comparison parameters locally from the secret-sharing values

F (u_{c o m 1}) Δ_{c o m 1}, F (u_{c o m 2}) Δ_{c o m 2}

of the total secret and the respective random values

k_{c o m 1}, k_{c o m 2} \in (0, \frac{N}{\sqrt{2 (N^{\frac{5}{4}} + 1)}}), r_{c o m 1}, r_{c o m 2}

, and the user uploads the ciphertext and the comparison parameters

a r g_{c o m 1}, a r g_{c o m 2}

to the cloud server.

4.3. Homomorphic Property

Homomorphic properties of the ciphertext computation and arithmetic procedure. The ciphertexts for both sides of the computation are shown below:

C_{1} = \{\begin{matrix} C_{1, 1} = {(g^{s})}^{r_{1}} (1 + m_{1} N) \mod N^{2} \\ C_{1, 2} = g^{r_{1}} \mod N^{2} \end{matrix}\} C_{2} = \{\begin{matrix} C_{2, 1} = {(g^{s})}^{r_{2}} (1 + m_{2} N) \mod N^{2} \\ C_{2, 2} = g^{r_{2}} \mod N^{2} \end{matrix}\}

(1)

4.3.1. Homomorphic Addition

The homomorphic addition of the ciphertext is computed as follows:

C_{a d d} = \{\begin{matrix} C_{a d d, 1} = C_{1, 1} C_{2, 1} = {(g^{s})}^{r_{1} + r_{2}} (1 + (m_{1} + m_{2}) N) \mod N^{2} \\ C_{a d d, 2} = C_{1, 2} C_{2, 2} = g^{r_{1} + r_{2}} \mod N^{2} \end{matrix}\}

(2)

4.3.2. Homomorphic Subtraction

The homomorphic subtraction of the ciphertext is computed as follows:

C_{s u b} = \{\begin{matrix} C_{s u b, 1} = C_{1, 1} / C_{2, 1} = {(g^{s})}^{r_{1} - r_{2}} (1 + (m_{1} - m_{2}) N) \mod N^{2} \\ C_{s u b, 2} = C_{1, 2} / C_{2, 2} = g^{r_{1} - r_{2}} \mod N^{2} \end{matrix}\}

(3)

4.3.3. Homomorphic Comparison

The homomorphic comparison of the ciphertext is computed as follows:

\bar{C} = C_{2, 1} / C_{1, 1} = \frac{{(g^{s})}^{r_{2}} (1 + m_{2} N)}{{(g^{s})}^{r_{1}} (1 + m_{1} N)} = {(g^{s})}^{r_{2} - r_{1}} (1 + (m_{2} - m_{1}) N)

(4)

\begin{matrix} M = \bar{C} a r g_{1} a r g_{2} = {(g^{s})}^{r_{2} - r_{1}} (1 + (m_{2} - m_{1}) N) k_{1} k_{2} {(g^{s})}^{r_{1} - r_{2}} \\ = k_{1} k_{2} (1 + (m_{2} - m_{1}) N) \end{matrix}

(5)

It can be seen that Lemma 1 holds by

|M| = |k_{1} k_{2} (1 + (m_{2} - m_{1}) N)| \in (0, \frac{1}{2} N^{2})

, and therefore the theorem holds when

k_{c o m 1}, k_{c o m 2} \in (0, \frac{N}{\sqrt{2 (N^{\frac{5}{4}} + 1)}}), m_{1}, m_{2} \in (0, N^{\frac{1}{4}})

is set and is able to perform homomorphic comparison computations.

4.4. Dynamic User Join and Quit Mechanism

When user

U_{j}

proposes to join or quit, it is necessary to extract the value of the secret shared by that user’s secret, and just add or subtract it from the total secret value and change the fusion subsecret of each user; then the dynamic quit can be realized.

\{\begin{matrix} g^{s} = g^{s} * g^{s_{j}} = g^{s_{1} + s_{2} + s_{3} + \dots \dots + s_{l} + s_{j}}, J O I N \\ g^{s} = g^{s} / g^{s_{j}} = g^{s_{1} + s_{2} + s_{3} + \dots \dots + s_{l} - s_{j}}, Q U I T \end{matrix}

(6)

4.4.1. User Join

Firstly, the quitter,

U_{j}

, provides its secret share

g^{s_{j}}

to the cloud server, which executes the exit procedure. For the cloud server,

g^{s} = g^{s} / g^{s_{j}} = g^{s_{1} + s_{2} + s_{3} + \dots \dots + s_{l} - s_{j}}

needs to be employed to remove the participation share of the exit user’s secret in all user secrets. User

U_{j}

locally computes

{(C_{i, 2})}^{s_{j}} = {(g^{s_{j}})}^{r_{i}}

and uploads it to the cloud server, which employs

C_{i, 1} = C_{i, 1} / {(C_{i, 2})}^{s_{j}} = {(g^{s} / g^{s_{j}})}^{r_{i}} (1 + m_{i} N)

to update all user ciphertext.

4.4.2. User Quit

Firstly, the joiner,

U_{j}

, provides its secret share

g^{s_{j}}

to the cloud server, which executes the joining procedure. For the cloud server, it is necessary to employ

g^{s} = g^{s} \cdot g^{s_{j}} = g^{s_{1} + s_{2} + s_{3} + \dots \dots + s_{l} + s_{j}}

to add the participation share of the joining user’s secret among all the user secrets. User

U_{j}

locally computes C and uploads it to the cloud server, which then employs

C_{i, 1} = C_{i, 1} {(C_{i, 2})}^{s_{j}} = {(g^{s} \cdot g^{s_{j}})}^{r_{i}} (1 + m_{i} N)

to update all user ciphertext.

5. Secure Isolation Forest Algorithm

The Secure Isolation Tree Training Algorithm 1 requires the user to provide a training ciphertext,

{\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}

. The cloud server randomly selects a feature, f, and obtains the feature ciphertext maximum,

C_{m a x}^{f}

, and minimum,

C_{m i n}^{f}

, by the maximum or minimum algorithm supported by homomorphic comparison. Set the random range

m o d u l e

and obtain the random value

R a n d o m I n t

; compute the current randomly selected comparison point,

p o i n t

, by

C_{1, 1} (1 + R a n d o m I n t * N)

, and then use the comparison algorithm to detect whether or not the current comparison point,

p o i n t = \{\begin{matrix} C_{m i n, 1}^{f} = {(g^{s})}^{r_{m i n}} (1 + (m_{m i n} + R a n d o m I n t) N) \mod N^{2} \\ C_{m i n, 2}^{f} = g^{r_{m i n}} \mod N^{2} \end{matrix}\}

, is smaller than the maximum

C_{m a x}^{f}

. After determining the comparison point,

p o i n t

, the comparison point is used to differentiate between the left and right subtrees of

C_{1}^{f}, C_{2}^{f}, \dots \dots, C_{u}^{f}

. The algorithm recursively constructs the complete tree from the left and right subtrees and then returns the tree model

T r e e

under the ciphertext.

Algorithm 1 Secure Isolation Tree Training Algorithm

Input:: ${\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}$
Output:: $T r e e$
1:: $f \leftarrow r a n d o m (n)$
2:: $C_{m a x}^{f} \leftarrow M a x (C_{1}^{f}, C_{2}^{f}, \dots \dots, C_{u}^{f})$
3:: $C_{m i n}^{f} \leftarrow M i n (C_{1}^{f}, C_{2}^{f}, \dots \dots, C_{u}^{f})$
4:: $m o d u l e \leftarrow 2^{l}$
5:: $R a n d o m I n t \leftarrow ⌊r a n d o m (0, 1) * m o d u l e⌋$
6:: $p o i n t \leftarrow A d d (C_{m i n}^{f}, R a n d o m I n t)$
7:: while $C o m p a r e (p o i n t, C_{m a x}^{f}) \neq C_{m a x}^{f}$ do
8:: $m o d u l e \to m o d u l e / 2$
9:: $R a n d o m I n t \leftarrow ⌊r a n d o m (0, 1) * m o d u l e⌋$
10:: $p o i n t \leftarrow A d d (C_{m i n}^{f}, R a n d o m I n t)$
11:: end while
12:: for each $i \in [1, t]$ do
13:: if $C o m p a r e (C_{i}^{f}, p o i n t) \neq C_{i}^{f}$ then
14:: $L e f t T r e e N o d e S e t$ append $p o i n t$
15:: else
16:: $R i g h t T r e e N o d e S e t$ append $p o i n t$
17:: end if
18:: end for
19:: $L e f t T r e e \leftarrow T r a i n (L e f t T r e e N o d e S e t)$
20:: $R i g h t T r e e \leftarrow T r a i n (R i g h t T r e e N o d e S e t)$
21:: $T r e e \leftarrow [\begin{matrix} 1, & L e f t T r e e, & R i g h t T r e e, & f, & p o i n t \end{matrix}]$
22:: return $T r e e$

The Secure Isolation Forest Training Algorithm 2 requires the user to provide training ciphertext

{\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}

. The cloud server randomly selects the user’s ciphertext as the training set without putting it back. It uses the training set as the input for Algorithm 1. Each call to Algorithm 1 can obtain a training tree,

T r e e

, and add it to the forest, which is then looped 100 times to obtain 100 training trees as a secure isolation forest model,

T r e e S e t

, for anomaly detection.

Algorithm 2 Secure Isolation Forest Training Algorithm

Input:: ${\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}$
Output:: $T r e e S e t$
1:: for each $i \in [1, 100]$ do
2:: $\{C_{r 1}^{f e a t u r e}, C_{r 2}^{f e a t u r e}, \dots \dots, C_{r 256}^{f e a t u r e}\} \leftarrow s a m p l e (\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\})$
3:: $T r e e \leftarrow T r a i n T r e e ({\{C_{r 1}^{f e a t u r e}, C_{r 2}^{f e a t u r e}, \dots \dots, C_{r 256}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n})$
4:: $T r e e S e t$ append $T r e e$
5:: end for
6:: return $T r e e S e t$

Secure Isolation Tree PathLength Algorithm 3 needs to provide the user’s detection ciphertext,

{\{C_{d e t e c t}^{f e c t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}

, and anomaly detection tree model,

T r e e

, and the length of the detected,

L e n g t h

. The cloud server inputs the corresponding ciphertext on the

T r e e

to detect the two cases; if not a leaf node, then continue to compare, by greater than the node to go to the right sub-tree,

T r e e [2]

, and less than the node to go to the left sub-tree,

T r e e [1]

, and recursive computation of the length of the path,

T r e e P a t h L e n g t h

. If it has been carried out to the leaf nodes, then determine the number of current leaf nodes stored; different numbers of different computations return different results. The return result is the path length of the ciphertext in a tree model,

T r e e P a t h L e n g t h

.

Algorithm 3 Secure Isolation Tree PathLength Algorithm

Input:: ${\{C_{d e t e c t}^{f e c t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}, T r e e, L e n g t h$
Output:: $T r e e P a t h L e n g t h$
1:: if $T r e e [0] \neq 0$ then
2:: if $C o m p a r e (C_{d e t e c t}^{T r e e [3]}, T r e e [4]) \neq C_{d e t e c t}^{T r e e [3]}$ then
3:: $P a t h L e n g t h ({\{C_{d e t e c t}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}, T r e e [2], L e n g t h + 1)$
4:: else
5:: $P a t h L e n g t h ({\{C_{d e t e c t}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}, T r e e [2], L e n g t h + 1)$
6:: end if
7:: else
8:: if $T r e e [1] \neq 1$ then
9:: $c \leftarrow 2^{*} ({log}_{2} (T r e e [1] - 1) + 0.5772156649) - (2^{*} (T r e e [1] - 1) / T r e e [1])$
10:: else
11:: $c \leftarrow 0$
12:: end if
13:: end if
14:: return $L e n g t h + c$

The Secure Isolation Forest Detection Algorithm 4 needs to provide the user’s detection ciphertext,

{\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}

, and training ciphertext,

{\{C_{d e t e c t}^{f e c t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}

. The cloud server obtains the anomaly detection isolation forest model,

T r e e S e t

, using Algorithm 2

T r a i n T r e e S e t

. The model is used to detect the ciphertext,

{\{C_{d e t e c t}^{f e c t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}

, and the path length of the ciphertext in each tree is computed using Algorithm 3

P a t h L e n g t h

and is summed to obtain the total path length,

A l l L e n g t h

. The path length can be used to compute the anomaly score; if this score is higher than the threshold value, then it is anomalous data.

Algorithm 4 Secure Isolation Forest Detection Algorithm

Input:: ${\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{u}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}, {\{C_{d e t e c t}^{f e c t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}$
Output:: ${N o m a l | A b n o r m a l}$
1:: $T r e e s e t \leftarrow T r a i n T r e e S e t ({\{C_{1}^{f e a t u r e}, C_{2}^{f e a t u r e}, \dots \dots, C_{t}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n})$
2:: $A l l L e n g t h \leftarrow 0$
3:: for each $i \in [1, 100]$ do
4:: $T r e e \leftarrow T r e e s e t [i]$
5:: $C u r r e n t T r e e L e n g t h \leftarrow P a t h L e n g t h ({\{C_{d e t e c t}^{f e a t u r e}\}}_{f e a t u r e = 1, 2, 3, \dots, n}, T r e e, 0)$
6:: $A l l L e n g t h \leftarrow A l l L e n g t h + C u r r e n t T r e e L e n g t h$
7:: end for
8:: $A v g L e n g t h \leftarrow A l l L e n g t h / 100$
9:: $c \leftarrow 2 * ({log}_{2} (256 - 1) + 0.5772156649) - (2 * (256 - 1) / 256)$
10:: $s c o r e \leftarrow 2^{A v g L e n g t h / c}$
11:: if $s c o r e < t h r e s h o l d$ then
12:: $R e s u l t \leftarrow N o r m a l$
13:: else
14:: $R e s u l t \leftarrow A b n o r m a l$
15:: end if
16:: return $R e s u l t$

6. Security Analysis

6.1. Correctness

Proof of commitment equation

C O M_{j, i} \overset{?}{\to} g^{f_{j} (u_{i}) Δ_{i}}

:

C O M_{i, j} = {(g^{s_{i}} \prod_{k = 1}^{n - 1} {(c o m_{i, k})}^{{(u_{j})}^{k}})}^{Δ_{j}} = {(g^{s_{i}} \prod_{k = 1}^{n - 1} g^{h_{i, k} {(u_{j})}^{k}})}^{Δ_{j}} = {(g^{s_{i} + \sum_{k = 1}^{n - 1} h_{i, k} {(u_{j})}^{k}})}^{Δ_{j}} = g^{f_{i} (u_{j}) Δ_{j}})

(7)

6.2. Security Model

Firstly, we introduce the idea that there are two types of attacker to be considered in the proposed scheme.

Internal Attackers: This type of attacker consists of malicious users who obtain private key information from the data that other users interact with. In some cases, the malicious user also wants to include malicious data or provide wrong parameters to degrade the quality of the training data or organize the model training process.
External Attacker: This type of attacker consists of a malicious third party that interacts with users and stores encrypted data on the cloud server to participate in model training, and the malicious third party wants to obtain the model training results or obtain private key information or plaintext information through public parameters.

6.3. Security Analysis

Security is considered for all parts, and if all stages are secured then the overall security is guaranteed.

6.3.1. Homomorphic Encryption System Security

Theorem 1.

If the plaintext m and the private key

r, s

are not leaked, and only the ciphertext and the computation parameter

{N, g, {\{ν_{i}\}}_{i \in {1, \dots, n - 2}}, {\{u_{i}\}}_{i \in {1, \dots, t}}, {\{Δ_{i}\}}_{i \in {1, \dots, t}}, {\{{Δ_{i}}^{'}\}}_{i \in {1, \dots, n - 2}},

{\{g^{s_{i}}, g^{r_{i}}\}}_{i \in {1, \dots, t}}, g^{s}, R, {\{M_{i}, C_{i, 1}, C_{i, 2}\}}_{i \in {1, \dots, t}}}

are leaked, the possibility that adversary A can compute the plaintext message through the labeled data is negligible.

Proof of Theorem 1.

The security of the scheme is demonstrated by constructing several games, as shown below:

Game 0: This scheme is identical to its security in the real world.

$P r [R E A L_{A}^{O U R} [\begin{matrix} λ \end{matrix}] = 1] = P r [G_{0} = 1]$

(8)
Game 1: This game is based on PDL assumptions and is almost identical to Game 0, except for some parameters.

${\{C O M_{i, j} = {(g^{s_{i}} \prod_{s = 1}^{n - 1} {(c o m_{i, s})}^{{(u_{j})}^{s}})}^{Δ_{j}}\}}_{i, j = 1, 2, 3, \dots \dots, n - 1 | | i \neq j}$

The game requires storage tables to hold the accessed information entries $(\begin{matrix} u_{i}, u_{j}, C O M_{i, j} \end{matrix})$ . When an adversary initiates a query, it checks whether or not the entry is already stored in the table. If the entry is already stored in the table, then the stored entry is returned to the corresponding querier; if the entry does not exist in the table, then an element needs to be randomly selected as the result of the entry and the entry is stored in the table for querying. Since the difficulty of solving the PDL puzzle is known, we define that there exists a polynomial time algorithm, $B 1$ , that solves the problem, and then the difference between Game 1 and Game 0 is:

$P r [G_{1} = 1] = P r [G_{0} = 1] + P r_{B 1}^{P D L} [λ]$

(9)
Game 2: The game is based on the IND-CPA assumption of the BCP encryption scheme and is almost identical to Game 1, except for the encryption.

$C_{1} = \{\begin{matrix} C_{1, 1} = {(g^{s})}^{r_{1}} (1 + m_{1} N) \mod N^{2} \\ C_{1, 2} = g^{r_{1}} \mod N^{2} \end{matrix}\}$

We define that there exists a polynomial time algorithm, $B 2$ , that solves the IND-CPA puzzle of the BCP encryption scheme [46], then Game 2 is different from Game 1:

$P r [G_{2} = 1] = P r [G_{1} = 1] + P r_{B 2}^{I N D - C P A} [λ]$

(10)
Game 3: This game is based on an ideal realization of the process and is almost identical to Game 2, except for some of the parameters.

$a r g_{c o m 2} = R^{r_{c o m 2}} k_{c o m 2} {(g^{r_{c o m 1} - r_{c o m 2}})}^{F (u_{c o m 2}) Δ_{c o m 2} \frac{- u_{c o m 2}}{u_{c o m 1} - u_{c o m 2}}}$

The game requires a storage table to record $(a r g_{c o m 1}, a r g_{c o m 2}, M = \bar{C} a r g_{c o m 1} a r g_{c o m 2})$ , and since this parameter exists for both parties to supply, when asked, check if the entry is already stored in the table. If the entry is already stored in the table, then the stored entry is returned to the corresponding querying party; if the entry does not exist in the table, then an element needs to be randomly selected as the result of the entry and the entry is stored in the table for querying. Since random numbers are added in the computation, we define that there exists a polynomial time algorithm, $B 3$ , capable of solving the random number discrimination problem; then, Game 3 is different from Game 2:

$P r [G_{3} = 1] = P r [G_{2} = 1] + P r_{B 3}^{R a n d o m} [\begin{matrix} λ \end{matrix}]$

(11)
Simulator S: Simulator S simulates an ideal situation, which differs from the real world. Therefore, the games constructed above are used to analyze the difference between the ideal situation and the real world, which can be obtained throughout the interaction:

$\begin{matrix} Pr [I D E A L_{A, S}^{O U R} [λ] = 1] = P r [G_{3} = 1] \\ = P r_{B 3}^{R a n d o m} [λ] + P r_{B 2}^{I N D - C P A} [λ] + P r_{B 1}^{P D L} [λ] + P r [R E A L_{A}^{O U R} [λ] = 1] \\ = P r [R E A L_{A}^{O U R} [λ] = 1] + ε \end{matrix}$

(12)

Therefore, it can be argued that only negligible advantages exist in the interaction process for it to be able to distinguish the ideal situation from the real world. □

6.3.2. Homomorphic Computing Security

The encryption scheme is a BCP encryption scheme. However, the homomorphic computational properties are extended to include partially parametric computation, so the encryption scheme after extending the homomorphic properties is analyzed to see if it is still IND-CPA secure.

Theorem 2.

If the plaintext message m and the private key

r, s

are not leaked, and only the ciphertext and the result of the homomorphic computation

\{\begin{matrix} M_{i}, C_{i, 1}, C_{i, 2}, g^{s}, C_{s u b}, C_{a d d}, M_{c o m} \end{matrix}\}

are leaked, the possibility that adversary A computes the plaintext message through the ciphertext as well as the result of the homomorphic computation is negligible.

Proof of Theorem 2.

Assuming that there exists a polynomial time algorithm, B, that can attack the encryption scheme with the advantage of

ε

, then an adversary, A, can be constructed to attack the BCP scheme with the same advantage. Since

C_{a d d}, C_{s u b}

has only homomorphic ciphertext results during homomorphic computation, the ciphertext properties of the results of homomorphic addition and homomorphic subtraction are the same as those of the original encryption. Only the information leakage from the ciphertext comparison result is discussed here.

Init: Adversary A randomly chooses a generator, g, and sends the parameter $\{G, g^{s}, g, N^{2}\}$ to B.
Challenge: B randomly picks two challenge plaintexts, $m_{0}, m_{1}$ , and sends them to adversary A. Adversary A randomly picks an r and computes the disclosure value $g^{r}$ . Adversary A first randomly picks $b = {0, 1}$ , computes the challenge ciphertext $\{g^{r}, {(g^{s})}^{r} (1 + m_{b} N)\}$ , and sends the ciphertext to B.
Query: B initiates homomorphic comparison computation by choosing any non-challenge plaintext, $m_{0}, m_{1}$ , encryption, but its comparison object cannot challenge ciphertext $\{g^{r}, {(g^{s})}^{r} (1 + m_{b} N)\}$ . Adversary A accepts the comparison request and returns the homomorphic comparison result of the comparison.
Guess: B guesses that $b^{'} = {0, 1}$ . When $b^{'} = b$ , then B outputs ’1’, indicating that it can distinguish $m_{b}$ . Otherwise, B outputs ’0’, indicating that it cannot distinguish $m_{b}$ .

Because the BCP encryption scheme has IND-CPA security without the original plaintext, it is impossible to obtain the plaintext

m_{i}

by analyzing the ciphertext and homomorphic computation results. In homomorphic computation, only the homomorphic comparison computation has information leakage. However, only the 1 bit information of the size comparison result is leaked, which is unavoidable due to the necessity of comparison. The comparison result is only known to the cloud server. However, the cloud server cannot obtain the plaintext of the original ciphertext, so if the cloud server wants to crack the plaintext range of one of the two ciphertext, it first needs to obtain the plaintext value of the other ciphertext.

\begin{matrix} P r [A_{w i n s}] = P r [\{A d d, M u l, S u b, C o m, C\} \to m] \\ = P r [\{A d d, M u l, S u b, C\} \to m] + P r [\{C o m, C\} \to m] \\ = P r_{I N D - C P A}^{B C P} + P r [\{C o m, C_{1}\} \to m_{1} ∣ m_{2}] P r [\{C_{2}\} \to m_{2}] \\ = P r_{I N D - C P A}^{B C P} + P r_{I N D - C P A}^{B C P} = ε \end{matrix}

(13)

So, the possibility that adversary A computes the plaintext message through the ciphertext and the result of the homomorphic computation is negligible. □

7. Performance Analysis

This section analyzes the performance of this scheme in terms of crucial steps and the overall scheme and verifies the feasibility of this scheme through simulation experiments compared with other schemes. Firstly, we compare the theoretical cost of each scheme in different crucial steps and reflect the advantages of this scheme through comparison, which is better than the previous scheme in terms of functionality and cost. Secondly, comparing simulation experiments can reflect the advantages of this scheme compared with other schemes in the application process. Finally, the machine-learning model is implemented on a public dataset, and the machine-learning model is used for anomaly detection. The experimental results show that this scheme can effectively support machine learning in an encrypted state. The simulation experiments were implemented on a laptop computer with an AMD Ryzen7 4800H CPU (2.9 GHz) and 32G RAM by calling the CHARM library [47] and using Python (version 3.7).

Dataset: The KDD-NSL dataset contains training data and test da; the training data has 125,973 records and the test data has 22,544 records. Each of these records contains 43 features, 41 of which describe the traffic input, and the last two features are the label (normal or attack) and the difficulty level. The last two features of all records are removed before model training to simulate unsupervised data for model training.

The communication cost considers the interaction between the client and the server side, so the communication cost is considered from a single client. The communication overhead is considered from two aspects: the communication message’s size and the communication process’s time cost. From the theoretical cost in Table 1, it can be seen that the communication cost of the scheme proposed in this paper has been completed in the initial and encryption phases, and the subsequent homomorphic addition and homomorphic comparison do not require the user to be involved in further interaction. Although the communication cost of the scheme proposed in this paper is higher in the initial and encryption phases, it only interacts before uploading the data. It does not need to stay online after uploading the ciphertext, which allows the user to wait offline after all the computations have been performed within a certain period. Other schemes require user participation in the homomorphic computation process, so the proposed scheme has a more significant advantage in the computation process. Machine-learning model computation takes a significant amount of time and may drop in the middle of the process due to unstable participation in network problems, energy problems, etc., so it is evident that offline users are more suitable for practical application scenarios. Through the experimental comparison Figure 6 and Figure 7, it can also be seen that the scheme proposed in this paper has a higher cost in the initial and encryption phases, but at the same time the communication cost in the subsequent homomorphic computation can have a considerable advantage.

We expect the cloud server to have strong computational performance in order to support the computation, but we still need to make a comparison to detect the existence of unaffordable computational costs. Through the theoretical cost in Table 2, it can be seen that the proposed scheme in this paper has a larger difference than other schemes in the initial phase; the initial phase carries more cost, and the cost is related to the number of participating users, but the actual difference with other schemes needs to be derived through experiments. Since there will not be too many participants in the training of machine-learning models, only users with a large amount of data or with machine-learning needs will be willing to participate, so user participation is set to 100. Since the computation of the servers in the initial phase of this scheme does not have a sequential requirement, the computational cost can make optimal use of the cloud server’s computational cost by concurrency optimization. As shown in Figure 8, after optimization, although this scheme still had an enormous time cost in the initial phase, its time cost was only 12 s in the case of 100 user participation. This time overhead is still within the range of seconds. At the same time, it is possible that the local network latency of some users even exceeds this time cost, so we believe that the server-side time cost in the initial phase is acceptable. The experimental results show that the non-initial phase of the other phases of this scheme is advantageous, and because there is a large number of machine-learning computing in the subsequent model learning and data prediction process, the advantages of homomorphic computing are realized, even when the number of computations is substantial.

From the theoretical analysis in Table 3, it can be seen that the most significant difference between this scheme and other schemes in terms of client computation cost lies in the initial phase and homomorphic computation phase. This scheme requires the user to compute the parameters in the initial phase. However, the user does not need to participate in the homomorphic computation process in the homomorphic computation phase. This experiment is compared after setting the number of computations to 10,000; the advantage of this scheme can be seen in Figure 9; the client’s cost is not high when performing a large number of computations. Combining theory and experiments, it can be seen that the advantage of the proposed scheme over other schemes is that there is no need to compute too many parameters locally, and the amount of user computation is only related to the number of participating users and security parameters. Since the client does not need to participate in the homomorphic computation, the computational complexity of the model is not related to the client but only to the server’s power, and the server’s power can be increased to improve the speed of homomorphic computation without considering the limitations of the client’s power. Client-side participation in homomorphic computation must ensure the user’s stable arithmetic and network. Hence, it is impossible to use complex machine-learning models, especially those whose computation is highly correlated with the amount of data. At the same time, the scheme proposed in this paper can be well adapted to these complex models.

Combining the communication time cost and computation time cost, after summarizing all the time costs and analyzing them, from Figure 10 can be seen that the cost of this scheme is significantly smaller than the time cost of other schemes. After all the parameters are computed, this scheme spends more time in the initial and encryption phases. However, in the case of many homomorphic computations, other schemes require both communication and the computation of parameters. Hence, this scheme’s total time overhead is better than other schemes. At the same time, it reduces the requirement of client stability as no subsequent interaction and computation is required.

In terms of user privacy comparison, this scheme has the same security as other schemes, but it has offline computing features, the flexibility of user joining, and the verification of user parameters, which can better provide services for homomorphic computing. Comparing in terms of computational efficiency, this scheme migrates a large number of computational steps to the cloud server for execution, more computational cost is borne by the computationally resource-rich cloud server in the cloud-computing model, and waiting for computation due to communication is avoided. Reasonable configuration of the computation cost of the user and the cloud server, while reducing the waiting time due to communication, can lead to higher computation efficiency. Comparing in terms of communication cost, the offline property of computation in this scheme ensures that the computation required for model training does not need to interact with the user, thus reducing the communication overhead by a significant amount. Even though more communication cost is required in the initial and encryption phases, the disadvantages of the earlier phases can be offset by the advantages of subsequent model training, as can be seen in Figure 7, where the communication advantage of only 10,000 homomorphic computations already outweighs the communication disadvantages associated with the initial and encryption phases.

Different numbers of subtrees mainly affects the stability of anomaly detection. Calculating the average path lengths of normal and abnormal data on different numbers of subtrees, from Figure 11 can be seen that the path length fluctuates greatly when the number of subtrees is less than 60, and the fluctuation is smooth when the number of subtrees is greater than 80. Once the number of subtrees is greater than 100, increasing the number of subtrees only slightly affects the average path length, so it is considered that the average path length has reached convergence after the number of subtrees reaches 100. Therefore, the number of subtrees is set to 100 in the subsequent experiments.

The number of sampling points per subtree mainly affects the height of the subtree, while the isolation forest model detects anomalies mainly relying on the lower layers. Therefore, a subtree with too many sampling points does not significantly improve the accuracy of anomaly detection. From Figure 12, it can also be seen that when the number of sampling points is greater than 128 and the number is then increased, it will only slightly improve the detection accuracy. Therefore, the number of sampling points can be chosen flexibly according to the size of the computing capacity; when the computing capacity is weak, we can choose 32, and when the computing capacity is strong, we can choose 256. Since computations are undertaken by the cloud server, it is recommended to set the number of sampling points to 256.

The anomaly detection task is performed on the NSL-KDD dataset using the proposed scheme in this paper, and the anomaly detection task can be better accomplished under ciphertext by the secure isolation forest algorithm. Since the data are encrypted, the computation under ciphertext and the computation under plaintext cannot be realized in the same way, so the scheme is adapted to the secure isolation forest algorithm, and the secure isolation forest algorithm can still accomplish the anomaly detection task well after the change. The goal of the simulation experiments was to detect anomalies by constantly adjusting the detection threshold of the secure isolation forest; through Figure 13 shown the detection effect of simulation experiments, it was found that setting the detection threshold to 0.422 obtained the best results. In the experimental results, the secure isolation forest algorithm reaches a 90.12% recall rate in anomaly detection, which only leads to about 10% of the anomalous data being undetected, through which it can reduce the anomalous data that is mixed in the normal data by about 90% and reduce the proportion of anomalous data to the total amount of data.

Comparing the isolation forest model under encryption with that under plaintext in the anomaly detection experiments, it can be seen from the Figure 14, only the Recall metric has a large gap, while for the combined Precision, Accuracy, and F1 metrics the encrypted isolation forest algorithm has only a slight gap. This reflects the differences between the encrypted isolation forest and the original model and also shows that the encrypted isolation forest can be used in anomaly detection tasks.

8. Conclusions

Here we propose a multi-user encrypted machine-learning system that supports an unsupervised learning approach, the secure isolation forest algorithm, through a semi-homomorphic encryption scheme. This algorithm can achieve more than 90% recall in anomaly detection experiments, proving that the system can accomplish the machine-learning goal. In this scheme, the base encryption scheme can support homomorphic comparison by front-loading the computational parameters, and by designing the communication and computational processes, the computational parameters required for homomorphic computation are front-loaded to reach the offline computational approach. In comparison with the same type of scheme, it can be seen that there is a higher computational cost in the initial stage due to the need to front-load the computational parameters. However, it is still within the acceptable range. Since the homomorphic computation in the subsequent model learning does not require user participation, it reduces the stability requirements for the clients and lowers the threshold of user participation.

Since the encrypted machine-learning model used in this scheme is the isolation forest algorithm, this method is only effective for the detection of individual anomalies that are far away from the cluster, and it is not effective in detecting small cluster anomalies. Therefore, this scheme mainly focuses on the isolated anomalies that are far away from the normal data, and it is more effective in detecting such anomalies.

In the future, we will study more encrypted machine-learning schemes supported by partially homomorphic encryption to reduce costs. At the same time, we should improve the encryption method to support homomorphic computation while minimizing the communication and computation costs. The pre-computation parameters of the current scheme need to be computed locally, and in the future we will consider adding outsourcing to minimize the arithmetic requirements of local users.

Author Contributions

S.X.: Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Writing—review & editing. J.Y.: Methodology, Formal analysis, Investigation, Resources, Writing—review & editing, Supervision. W.O.: Data curation, Investigation, Supervision, Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China (No. 62162020) and the Haikou Science and Technology Special Fund (No. 2024-017).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Qin, H.; He, D.; Feng, Q.; Khan, M.K.; Luo, M.; Choo, K.K.R. Cryptographic Primitives in Privacy-Preserving Machine Learning: A Survey. IEEE Trans. Knowl. Data Eng. 2023, 36, 1919–1934. [Google Scholar] [CrossRef]
Lee, W.; Seong, J.J.; Ozlu, B.; Shim, B.S.; Marakhimov, A.; Lee, S. Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review. Sensors 2021, 21, 1399. [Google Scholar] [CrossRef]
Gou, F.; Liu, J.; Xiao, C.; Wu, J. Research on Artificial-Intelligence-Assisted Medicine: A Survey on Medical Artificial Intelligence. Diagnostics 2024, 14, 1472. [Google Scholar] [CrossRef] [PubMed]
El Hajj, M.; Hammoud, J. Unveiling the Influence of Artificial Intelligence and Machine Learning on Financial Markets: A Comprehensive Analysis of AI Applications in Trading, Risk Management, and Financial Operations. J. Risk Financ. Manag. 2023, 16, 434. [Google Scholar] [CrossRef]
Ma, C.; An, J.; Xu, J.; Xu, B.; Xu, L.; Bai, X.E. Chinese machine reading comprehension based on deep learning neural network. Int. J. Bio-Inspired Comput. 2023, 21, 137–147. [Google Scholar] [CrossRef]
Strielkowski, W.; Vlasov, A.; Selivanov, K.; Muraviev, K.; Shakhnov, V. Prospects and Challenges of the Machine Learning and Data-Driven Methods for the Predictive Analysis of Power Systems: A Review. Energies 2023, 16, 4025. [Google Scholar] [CrossRef]
Hiemstra, L.A. Machine Learning and Artificial Intelligence Are Valuable Tools, yet Dependent on the Data Input. Arthrosc. J. Arthrosc. Relat. Surg. 2024, in press. [Google Scholar] [CrossRef] [PubMed]
Whang, S.E.; Lee, J.G. Data collection and quality challenges for deep learning. Proc. VLDB Endow. 2020, 13, 3429–3432. [Google Scholar] [CrossRef]
El Mestari, S.Z.; Lenzini, G.; Demirci, H. Preserving data privacy in machine learning systems. Comput. Secur. 2024, 137, 103605. [Google Scholar] [CrossRef]
Wang, C.; Zhang, J.; Lassi, N.; Zhang, X. Privacy Protection in Using Artificial Intelligence for Healthcare: Chinese Regulation in Comparative Perspective. Healthcare 2022, 10, 1878. [Google Scholar] [CrossRef] [PubMed]
Hu, A.; Lu, Z.; Xie, R.; Xue, M. VeriDIP: Verifying Ownership of Deep Neural Networks Through Privacy Leakage Fingerprints. IEEE Trans. Dependable Secur. Comput. 2023, 21, 2568–2584. [Google Scholar] [CrossRef]
Rigaki, M.; Garcia, S. A Survey of Privacy Attacks in Machine Learning. ACM Comput. Surv. 2023, 56, 101. [Google Scholar] [CrossRef]
Chui, K.T.; Liu, R.W.; Zhao, M.; Zhang, X. Bio-inspired algorithms for cybersecurity—A review of the state-of-the-art and challenges. Int. J. Bio-Inspired Comput. 2024, 23, 1–15. [Google Scholar] [CrossRef]
Xu, R.; Baracaldo, N.; Joshi, J. Privacy-preserving machine learning: Methods, challenges and directions. arXiv 2021, arXiv:2108.04417. [Google Scholar]
Munjal, K.; Bhatia, R. A systematic review of homomorphic encryption and its contributions in healthcare industry. Complex Intell. Syst. 2023, 9, 3759–3786. [Google Scholar] [CrossRef]
Kou, L.; Wu, J.; Zhang, F.; Ji, P.; Ke, W.; Wan, J.; Liu, H.; Li, Y.; Yuan, Q. Image encryption for Offshore wind power based on 2D-LCLM and Zhou Yi Eight Trigrams. Int. J. Bio-Inspired Comput. 2023, 22, 53–64. [Google Scholar] [CrossRef]
Yang, W.; Wang, S.; Cui, H.; Tang, Z.; Li, Y. A Review of Homomorphic Encryption for Privacy-Preserving Biometrics. Sensors 2023, 23, 3566. [Google Scholar] [CrossRef] [PubMed]
Iezzi, M. Practical Privacy-Preserving Data Science With Homomorphic Encryption: An Overview. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3979–3988. [Google Scholar] [CrossRef]
Wang, Y.; Dong, S.; Fan, W. Task Scheduling Mechanism Based on Reinforcement Learning in Cloud Computing. Mathematics 2023, 11, 3364. [Google Scholar] [CrossRef]
Reddy, M.A.; Ravindranath, K. Enhanced placement and migration of virtual machines in heterogeneous cloud data centre. Int. J. Bio-Inspired Comput. 2024, 23, 168–178. [Google Scholar] [CrossRef]
Yerlikaya, F.A.; Şerif Bahtiyar. Data poisoning attacks against machine learning algorithms. Expert Syst. Appl. 2022, 208, 118101. [Google Scholar] [CrossRef]
Xu, H.; Pang, G.; Wang, Y.; Wang, Y. Deep Isolation Forest for Anomaly Detection. IEEE Trans. Knowl. Data Eng. 2023, 35, 12591–12604. [Google Scholar] [CrossRef]
Hao, Y.; Qin, B.; Sun, Y. Privacy-Preserving Decision-Tree Evaluation with Low Complexity for Communication. Sensors 2023, 23, 2624. [Google Scholar] [CrossRef] [PubMed]
Shin, H.; Choi, J.; Lee, D.; Kim, K.; Lee, Y. Fully homomorphic training and inference on binary decision tree and random forest. In Proceedings of the European Symposium on Research in Computer Security, Bydgoszcz, Poland, 16–20 September 2024; pp. 217–237. [Google Scholar]
Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A survey on federated learning: Challenges and applications. Int. J. Mach. Learn. Cybern. 2023, 14, 513–535. [Google Scholar] [CrossRef] [PubMed]
Alaya, B.; Laouamer, L.; Msilini, N. Homomorphic encryption systems statement: Trends and challenges. Comput. Sci. Rev. 2020, 36, 100235. [Google Scholar] [CrossRef]
Zhang, E.; Li, H.; Huang, Y.; Hong, S.; Zhao, L.; Ji, C. Practical multi-party private collaborative k-means clustering. Neurocomputing 2022, 467, 256–265. [Google Scholar] [CrossRef]
Li, L.; Fan, Y.; Tse, M.; Lin, K.Y. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
Kalapaaking, A.P.; Khalil, I.; Rahman, M.S.; Atiquzzaman, M.; Yi, X.; Almashor, M. Blockchain-Based Federated Learning with Secure Aggregation in Trusted Execution Environment for Internet-of-Things. IEEE Trans. Ind. Inform. 2022, 19, 1703–1714. [Google Scholar] [CrossRef]
Doan, T.V.T.; Messai, M.L.; Gavin, G.; Darmont, J. A survey on implementations of homomorphic encryption schemes. J. Supercomput. 2023, 79, 15098–15139. [Google Scholar] [CrossRef]
Intel Xeon Processors. Intel Software Guard Extensions Developer Guide; Intel: Santa Clara, CA, USA, 2017. [Google Scholar]
Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) Fully Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory 2014, 6, 13. [Google Scholar] [CrossRef]
Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic encryption for arithmetic of approximate numbers. In Proceedings of the Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Proceedings, Part I 23. Springer: Berlin/Heidelberg, Germany, 2017; pp. 409–437. [Google Scholar]
Smart, N.P.; Vercauteren, F. Fully homomorphic SIMD operations. Des. Codes Cryptogr. 2014, 71, 57–81. [Google Scholar] [CrossRef]
Paillier, P. Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In Proceedings of the Advances in Cryptology — EUROCRYPT’99, Prague, Czech Republic, 2–6 May 1999; Stern, J., Ed.; Springer: Berlin/Heidelberg, 1999; pp. 223–238. [Google Scholar]
Boneh, D.; Goh, E.J.; Nissim, K. Evaluating 2-DNF Formulas on Ciphertexts. In Proceedings of the 2nd Annual Theory of Cryptography, Cambridge, MA, USA, 10–12 February 2005; Kilian, J., Ed.; Springer: Berlin/Heidelberg, 2005; pp. 325–341. [Google Scholar]
Park, J.; Lim, H. Privacy-Preserving Federated Learning Using Homomorphic Encryption. Appl. Sci. 2022, 12, 734. [Google Scholar] [CrossRef]
Petrean, D.E.; Potolea, R. Random forest evaluation using multi-key homomorphic encryption and lookup tables. Int. J. Inf. Secur. 2024, 23, 2023–2041. [Google Scholar] [CrossRef]
Rezaeibagha, F.; Mu, Y.; Huang, K.; Chen, L.; Zhang, L. Authenticable Additive Homomorphic Scheme and its Application for MEC-based IoT. IEEE Trans. Serv. Comput. 2022, 16, 1664–1672. [Google Scholar] [CrossRef]
Savvides, S.; Khandelwal, D.; Eugster, P. Efficient confidentiality-preserving data analytics over symmetrically encrypted datasets. Proc. VLDB Endow. 2020, 13, 1290–1303. [Google Scholar] [CrossRef]
Liu, Z.; Lv, S.; Li, J.; Huang, Y.; Guo, L.; Yuan, Y.; Dong, C. EncodeORE: Reducing Leakage and Preserving Practicality in Order-Revealing Encryption. IEEE Trans. Dependable Secur. Comput. 2020, 19, 1579–1591. [Google Scholar] [CrossRef]
Li, D.; Lv, S.; Huang, Y.; Liu, Y.; Li, T.; Liu, Z.; Guo, L. Frequency-hiding order-preserving encryption with small client storage. Proc. VLDB Endow. 2021, 14, 3295–3307. [Google Scholar] [CrossRef]
Furukawa, J. Request-Based Comparable Encryption. In Proceedings of the Computer Security—ESORICS 2013: 18th European Symposium on Research in Computer Security, Egham, UK, 9–13 September 2013; Crampton, J., Jajodia, S., Mayes, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8134, pp. 129–146. [Google Scholar]
Zhao, K.; Wang, X.A.; Yang, B.; Tian, Y.; Zhang, J. A privacy preserving homomorphic computing toolkit for predictive computation. Inf. Process. Manag. 2022, 59, 102880. [Google Scholar] [CrossRef]
Huang, D.; Gan, Q.; Wang, X.; Ogiela, M.R.; Wang, X.A. Privacy preserving IoT-based crowd-sensing network with comparable homomorphic encryption and its application in combating COVID19. Internet Things 2022, 20, 100625. [Google Scholar] [CrossRef] [PubMed]
Bresson, E.; Catalano, D.; Pointcheval, D. A Simple Public-Key Cryptosystem with a Double Trapdoor Decryption Mechanism and Its Applications. In Proceedings of the Advances in Cryptology—ASIACRYPT 2003, Taipei, Taiwan, 30 November–4 December 2003; Laih, C.S., Ed.; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2894, pp. 37–54. [Google Scholar]
Akinyele, J.A.; Garman, C.; Miers, I.; Pagano, M.W.; Rushanan, M.; Green, M.; Rubin, A.D. Charm: A framework for rapidly prototyping cryptosystems. J. Cryptogr. Eng. 2013, 3, 111–128. [Google Scholar] [CrossRef]

Figure 1. System model.

Figure 2. Initial phase 1.

Figure 3. Initial phase 2.

Figure 4. Initial phase 3.

Figure 5. Encryption phase.

Figure 6. Client communication length cost.

Figure 7. Client communication time cost.

Figure 8. Server computing time cost.

Figure 9. Client computing time cost.

Figure 10. Total time cost of client.

Figure 11. Average path length under different numbers of subtrees.

Figure 12. Comparison under different numbers of sampling points.

Figure 13. Anomaly detection at different thresholds.

Figure 14. Comparison of plaintext and encrypted isolation forest.

Table 1. Theoretical analysis of client communication cost.

Schema	Init	Encryption	Additon	Subtraction	Compare
Our	$O (3 u + 5 λ + 1) + 4 T$	$O (1) + T$	-	-	-
[44]	$O (3 u + λ + 1) + T$	$O (1) + T$	$O (5) + T$	$O (5) + T$	$O (3) + 2 T$
[45]	$O (3 u + λ + 1) + T$	$O (1) + T$	-		$O (5) + 2 T$

T: Communication delay from server to client or client to server, u: Number of users,

λ

: safety parameter.

Table 2. Theoretical analysis of server computation cost.

Schema	Init	Encryption	Additon	Subtraction	Compare
Our	$O (u λ D i v + u^{2} M u l + u^{2} λ E x p)$	-	$O (2 M u l)$	$O (2 D i v)$	$O (D i v + 2 M u l)$
[44]	$O (u λ D i v + u M u l)$	-	$O (2 λ M u l + 2 λ D i v)$	$O (2 λ M u l + 2 λ D i v)$	$O (2 λ M u l + 2 λ D i v)$
[45]	$O (u λ D i v + u M u l + λ E x p)$	-	$O (2 M u l)$	$O (2 D i v)$	$O (2 λ M u l + 4 λ D i v + 2 λ E x p)$

u: Number of users,

λ

: safety parameter,

D i v

: Division operations under modulus,

M u l

: Multiplication operations under modulus,

E x p

: Exponential operations under modulus.

Table 3. Theoretical analysis of client computation cost.

Schema	Init	Encryption	Additon	Subtraction	Compare
Our	$O (u M u l + u E x p)$	$O (2 D i v + 8 M u l + 6 E x p)$	-	-	-
[44]	-	$O (2 M u l + E x p)$	$O (D i v + M u l + E x p)$	$O (D i v + M u l + E x p)$	$O (D i v + M u l + E x p)$
[45]	-	$O (2 M u l + 2 E x p)$	-	-	$O (2 D i v + 4 M u l + 4 E x p)$

D i v

: Division operations under modulus,

M u l

: Multiplication operations under modulus,

E x p

: Exponential operations under modulus.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, S.; Ye, J.; Ou, W. Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption. Electronics 2025, 14, 640. https://doi.org/10.3390/electronics14030640

AMA Style

Xie S, Ye J, Ou W. Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption. Electronics. 2025; 14(3):640. https://doi.org/10.3390/electronics14030640

Chicago/Turabian Style

Xie, Shaoxiong, Jun Ye, and Wei Ou. 2025. "Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption" Electronics 14, no. 3: 640. https://doi.org/10.3390/electronics14030640

APA Style

Xie, S., Ye, J., & Ou, W. (2025). Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption. Electronics, 14(3), 640. https://doi.org/10.3390/electronics14030640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-User Encrypted Machine Learning Based on Partially Homomorphic Encryption

Abstract

1. Introduction

2. Related Work

3. System Model

3.1. Preliminary Knowledge

3.1.1. Partial Discrete Logarithm

3.1.2. Decisional Diffie–Hellman

3.1.3. BCP Encryption

3.1.4. Lemma

3.2. System Model

4. Homomorphic Encryption Scheme

4.1. Initial Phase

4.2. Encryption Phase

4.3. Homomorphic Property

4.3.1. Homomorphic Addition

4.3.2. Homomorphic Subtraction

4.3.3. Homomorphic Comparison

4.4. Dynamic User Join and Quit Mechanism

4.4.1. User Join

4.4.2. User Quit

5. Secure Isolation Forest Algorithm

6. Security Analysis

6.1. Correctness

6.2. Security Model

6.3. Security Analysis

6.3.1. Homomorphic Encryption System Security

6.3.2. Homomorphic Computing Security

7. Performance Analysis

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI