1. Introduction
The rapid development of cloud services has provided users with more diverse ways of handling data. The user can choose to upload the dataset to the cloud server, which will perform the relevant operations, thus reducing the use of the user’s computing resources. Due to the wide variety of users, the datasets they provide cover a wide range of types, such as email information, back-office user database information, personal health status details, and other information that involves personal privacy or corporate trade secrets [
1,
2,
3,
4,
5]. When users outsource the above information, there is a risk of leakage of sensitive information contained therein. Therefore, how to protect privacy with respect to sensitive data is a hot topic in current research.
There are two common ways of handling privacy protection: one is to secure the data from theft during transmission through a trusted channel provided by hardware or a trusted third party; the other is to encrypt the data and transmit it so that, even if an attacker obtains the ciphertext, it cannot be decrypted without a key. Although the first approach ensures security during transmission, it relies heavily on third parties, and, because it uses plaintext transmission, there is still a risk that the dataset will be stolen when it is transmitted to the cloud server. The second method encrypts the dataset and then transmits it so that the data is not compromised in plaintext, either during transmission or during server-side computation.
When encrypting a dataset, if a traditional encryption algorithm is used, the server cannot process the cryptomorphic data directly and requires the user to provide the server with a key or to perform a decryption operation. Homomorphic encryption is a new cryptographic tool that supports arbitrary functions on encrypted messages and decrypts them with the same result as if the corresponding operation had been performed on the plaintext. Homomorphic encryption algorithms can be broadly classified into the following categories according to the types and number of operations they support: FHE (fully homomorphic encryption) algorithms that support an unlimited number of operations and multiple operations, PHE (partial homomorphic encryption) algorithms that support an unlimited number of operations and a limited number of types, and SWHE (somewhat homomorphic encryption) algorithms that support a limited number of operations and multiple operations [
6,
7,
8,
9,
10,
11,
12,
13,
14]. Most applications require only a finite number of homomorphic addition or multiplication operations, and the SWHE algorithm is more efficient than the FHE algorithm, so the SWHE algorithm has a wider range of applications [
15]. The BGV (Brakerski–Gentry–Vaikuntanathan) scheme is a commonly used SWHE algorithm that supports encryption of polynomials or integers, and enables parallelization operations using the Chinese residue theorem.
Homomorphic encryption satisfies both confidentiality and ciphertext manipulability, making it widely used in the field of privacy protection. The privacy protection model of outsourced computing based on homomorphic encryption is illustrated in
Figure 1. The user encrypts the dataset to be outsourced and uploads it to the server-side (➀), where the server performs data processing of higher computational complexity; then the server returns the result to the user (➁).
Machine learning is an important type of data outsourcing computing, and its privacy protection problem is an important class of application scenarios for homomorphic encryption. In the past few years, researchers have carried out many studies on the application of homomorphic encryption to machine learning privacy protection for different application scenarios, different data types, and different machine learning algorithms. The authors of [
16] implemented an efficient and secure k-nearest neighbor algorithm. Other studies [
17,
18] involved implementation of a homomorphic K-means clustering algorithm on encrypted datasets, though this suffered from the problem of a large time overhead. In [
19,
20,
21], the encrypted objects were pictures. After performing encryption operations on picture-based datasets, the authors extracted features on dense pictures and applied different image matching algorithms to achieve matching of dense pictures. A statistical learning algorithm on a dense state dataset was implemented and described in [
22]. Further studies [
23,
24] reported the implementation of comparison algorithms or protocols on cryptographic datasets by means of bit-by-bit operations; however, their overhead was high. In [
25], relevant features on digital-type datasets were pre-extracted, then the features were encrypted and uploaded to the cloud, where the classification operation was completed after performing homomorphic operations on the ciphertext of the relevant features. The scheme proposed in this paper is different in that the encrypted object is the original dataset, while the homomorphic clustering operation is performed in the cloud afterwards.
Common machine learning algorithms are mainly divided into supervised and unsupervised types. Supervised machine learning algorithms extract the features and true labels of the training dataset to build a model, and test the test data to produce the corresponding results, representing algorithms such as Bayesian classifier, hyperplane classifier, decision tree classifier, etc.; unsupervised machine learning algorithms do not need to pre-train the model, representing algorithms such as the K-means clustering algorithm, the DBSCAN (density-based spatial clustering of application with noise) algorithm, etc. DBSCAN is an unsupervised density-based machine learning clustering algorithm that finds clusters of arbitrary shape in the presence of noisy points and has a wider range of applications than K-means, another common clustering algorithm, for example, for the implementation of advanced systems, such as recommender systems. DBSCAN can also be used for the design of a database structure in conjunction with accelerated range access (e.g., R*-tree). Therefore, the study of DBSCAN privacy protection issues is of great importance.
This paper presents a privacy-preserving scheme for homomorphic encryption-based clustering learning, which implements homomorphic DBSCAN for clustering operations on ciphertext datasets. To encrypt floating-point data in real-world scenarios, we propose three data preprocessing methods for different data precision, propose a data preprocessing strategy based on data characteristics and consider data precision and computation overhead. Since homomorphic encryption does not support ciphertext comparison operation, we design a protocol between the user and the cloud server to implement the ciphertext comparison function. Our work is expected to address privacy protection during data outsourcing computation in scenarios such as 5G communications [
26], blockchain technology [
27,
28], IoT systems [
29], wireless sensor networks [
30], and big data system security [
31].
2. Basic Knowledge
2.1. Introduction to Symbols
For a real number , , and denote rounding up, down and to the nearest integer, respectively; denotes z mod m. Use lowercase bold letters for vectors; uppercase letters for sets and for the number of elements in the set; uppercase bold letters for matrices (e.g., ) and denotes the transpose of a matrix. The symbol indicates random selection; denotes the ring, . For an element in the plaintext space, the symbol denotes its ciphertext.
2.2. Homomorphic Encryption Algorithms
Definition 1. Homomorphic encryption. The homomorphic encryption algorithm consists of four main elements: KeyGen, Enc, Eval, and Dec.
- ➀
: Enter the security parameterand output the public keyand the private key.
- ➁
: Enter the message plaintextand the public key, and output the ciphertext.
- ➂
: Enter the ciphertextand the private keyand output the plaintextof the message.
- ➃
: Enter a Boolean circuit, a set of ciphertextsand the public keyand output the resulting ciphertext.
In this paper, we choose the BGV scheme proposed in the literature [
9] and implement it with the help of the homomorphic encryption algorithm library HElib. The BGV program consists of the following algorithms.
. The modulus of bit is chosen randomly and , , , is chosen. Such that , where is an order partition polynomial. The above parameters should be chosen in such a way that the difficulty of the scheme can be based on the lattice difficulty problem GLWE (general learning with error) and can resist existing attacks.
. Choose such that the private key is . Randomly draw the matrix , the vector , and compute , such that the public key is . Clearly, .
. Expand the plaintext into , pick at random, and output the ciphertext .
. Output .
. Ciphertext addition outputs and ciphertext multiplication outputs .
In addition to the above algorithms, HElib also includes some processing procedures for implementing the FHE algorithm [
32], such as key switching, modulus switching, and bootstrapping. The SWHE algorithm does not support infinite ciphertext operations, but its ciphertext and key size are smaller, and it does not require bootstrapping to refresh the ciphertext, so it operates more efficiently. Since the scenario in this paper requires only a limited number of ciphertext operations, the SWHE algorithm is chosen to obtain high computational efficiency.
2.3. Data Pre-Processing Algorithms
Most data in real applications cannot be directly used as the plaintext of the encryption algorithm and requires some pre-processing algorithm to map the actual data into the plaintext space. The data to be processed by the DBSCAN algorithm studied in this paper is a floating-point dataset and therefore needs to be mapped to the plaintext space of the encryption algorithm using a preprocessing algorithm. In this section, three types of data pre-processing are described.
Mode 1. Integer pair encoding
Let the original data be a floating-point number . If the integers and are chosen so that b = ⎡a × c⎦, then use the number pair to represent the number . For example, 1.5 can be represented by the number pair (2,3); 1.2 can be represented by the number pairs (2,2), (5,6), etc. The selection of pairs should take into account both the time cost and the accuracy requirements.
Mode 2. Shift rounding encoding
Let the original data be , where is the integer part and is the decimal part, and and have approximately the same number of digits. Combine the original data is shifted by decimal places to obtain the new data .
Input
Output
This processing is equivalent to expanding the original data by a factor of and then rounding the fractional part of the expanded data appropriately according to the actual requirements, thus enabling the accuracy of this type of data to be guaranteed without the need to select larger parameters for the encryption scheme at the time of encryption.
Mode 3. Multi-processing encoding
The original data is as described in Mode 2, and the number of bits in both and is less than the maximum number of bits that the encryption algorithm can handle.
Input
Output
Split the integer part and the fractional part of the original data into the binary group .
This method is the most universal and provides complete assurance that data accuracy will not be lost due to pre-processing operations; however, there are some drawbacks to this method of processing. If the user turns the original data into a binary using method 3, the amount of data becomes twice as large, which increases the overhead of the encryption phase. Moreover, when performing a homomorphic multiplication operation, instead of multiplying one pair of integers by two pairs of integers, the number of multiplications increases, and the complexity of the calculation increases.
In practice, depending on the accuracy of the data to be processed, one of the three pre-processing methods mentioned above can be chosen, or a combination of pre-processing methods can be chosen to maintain a balance between efficiency and accuracy. At the same time, when choosing a specific pre-processing method, the impact of the pre-processing method on the computational overhead of the subsequent homomorphic encryption should also be taken into consideration, as the type of plaintext space and the fault tolerance of the chosen homomorphic encryption algorithm vary.
2.4. DBSCAN Algorithm
The DBSCAN algorithm is a common clustering algorithm used to construct clusters in a dataset and to discover noisy data [
33,
34]. Compared to the equally common K-means clustering algorithm, the DBSCAN algorithm does not require a predefined number of clusters and is suitable for the construction of clusters of any shape, even unconnected cyclic clusters. Due to the minimum number of points restriction, the DBSCAN algorithm avoids the single-link effect compared to K-means and, therefore, has better clustering results for any shape of data distribution.
The DBSCAN algorithm itself has many variations, the original DBSCAN algorithm has a complexity of , and the -approximate DBSCAN algorithm has a complexity of , but it has some restrictions on the dimensionality of the data. The selection of the DBSCAN algorithm is related to the type of data. The two-dimensional DBSCAN algorithm was chosen to complete the experiments based on the data type of the dataset of the scheme in this paper.
Some concepts of the DBSCAN algorithm are defined as follows: MinPts define the minimum number of data points required for a cluster, and the
-neighborhood represents the area covered by a circle with a point as its center and
as its radius. Centroids, i.e., the centers of clusters, contain more data points than MinPts in their
-neighborhood; edge points, i.e., nodes at the edges of clusters, contain fewer data points than MinPts in their
-neighborhood and are in the
-neighborhood of other centroids; and noise points, i.e., other data points in the dataset that are not centroids and edge points. An instantiated depiction of the node definition is shown in
Figure 2. In addition, Definition 2 and Definition 3 give the definitions of density reachable and density connected.
Definition 2. The density can be reached. Let , be 2 data points. Let there exist a sequence of samples , where , , if for , all have in the -neighborhood with as the centroid, then is said to be reachable by density.
Definition 3. Density is connected. For a data point , , is said to be density connected to if there exists a point such that and are density reachable after passing through .
The flow of the DBSCAN algorithm is as follows.
Step 1. Enter a collection of data. Select an unlabeled observation as the current node and label the first clustering cluster 1.
Step 2. Find all the nodes in the -neighborhood centered on , which are all neighbors of . Perform the following operation.
- ➀
If the number of neighboring nodes found is less than MinPts, then is a noisy point and step 4 is performed.
- ➁
If the number of neighboring nodes found is not less than MinPts, then is a centroid and step 3 is executed.
Step 3. Separate all neighboring nodes as centroids and repeat step 2 until there are no new neighboring nodes to use as centroids.
Step 4. Select the next unlabeled point from dataset as the current node, update the number of clusters and add 1.
Step 5. Repeat steps 2 to 4 until all points in dataset have been marked.
2.5. Evaluation Criteria for the Programme
The program is evaluated according to two criteria: time efficiency and accuracy.
In terms of time, this paper is concerned with the time to execute the homomorphic DBSCAN algorithm in the cloud. In addition to the evaluation of time, another important evaluation criterion is the accuracy of the clustering. According to the characteristics of the DBSCAN algorithm, the evaluation criterion for accuracy contains the number of clusters and the noise point judgment. In this paper, we compare the results obtained by performing the DBSCAN algorithm directly on the plaintext with the results obtained by performing the DBSCAN algorithm homomorphically on the ciphertext and decrypting it. The accuracy of clustering is defined as the percentage of identical results in the two clusters—ideally, the two results should be identical.
3. DBSCAN Privacy Protection Solutions
In this paper, a homomorphic clustering algorithm on encrypted datasets is constructed to enable privacy protection of sensitive datasets during outsourcing computations. The operations in the DBSCAN algorithm can be divided into two categories according to the operations supported by the homomorphic encryption algorithm: operations supported by the homomorphic encryption algorithm and operations not supported by the homomorphic encryption. The homomorphic DBSCAN algorithm is shown in Algorithm 1.
Algorithm 1. Homomorphic DBSCAN algorithm |
|
|
|
|
|
|
|
(6) end if |
(7) end for |
|
|
|
|
|
|
|
|
|
|
|
|
(20) endf if |
(21) end while |
|
|
|
Common homomorphic encryption schemes only support additive (subtractive) and multiplicative operations, so complex operations need to be converted accordingly so that they can be expressed in additive and multiplicative terms. The operations involved in the function findneighbor in step 3) of Algorithm 1 are not fully supported by the homomorphic encryption algorithm. The following two operations are not supported.
- (1)
When computing -neighborhoods, the distance between points needs to be calculated, the most common being the Euclidean distance, but the open square operation involved is not supported by homomorphic encryption algorithms.
- (2)
The homomorphic encryption scheme chosen in this paper does not have the nature of order-preserving encryption, and the size relationship between the encrypted ciphertexts will not be maintained between the original plaintexts, so the problem of ciphertext size comparison needs to be solved.
3.1. Dataset Pre-Processing
This section first gives a method for selecting a data pre-processing method based on data characteristics, combined with accuracy and computational overhead, etc. The selection process is shown in
Figure 3.
Let the dataset to be processed be . The number of digits in the integer part of the data is and the number of digits in the fractional part is (insufficient valid digits are filled with zeros). Select the larger number and the smaller number from and calculate the values of . Determine whether this absolute value satisfies , where is a user-defined threshold based on the characteristics of the dataset. If it is satisfied (i.e., the difference between the values is small), then the multiprocessing coding method is used directly to ensure data accuracy (in this case, the difference between the values in the set is mainly in the decimal part); if it is not satisfied, then it is further judged whether it satisfies , where is a threshold value set by the user according to the situation. If , the lower digit of the decimal part will be rounded off (approximately bits), i.e., the digit of the decimal part will be changed from to (so that ); otherwise, it will directly jump to the next step of judgment. At this point, it is necessary to take into account the retention accuracy and the subsequent calculation overhead to determine whether to choose the integer-pair coding method or the shift-rounding coding method: if the loss of data accuracy at this point has a greater impact on the subsequent calculation results, in order to improve the calculation accuracy, then choose the integer-pair coding method; if the loss of data accuracy has a smaller impact on the subsequent calculation results, then choose the shift-rounding coding method.
The sensitive information that needs to be protected in this solution is the coordinate value of each piece of data, which is floating-point data, so a suitable pre-processing algorithm needs to be chosen according to the characteristics of the dataset and the encryption scheme chosen in this solution. In this paper, two floating-point datasets A and B are selected. The number of decimal places in the data in A is large, but the difference in the size of the values is large, and the number of decimal places is much larger than the number of integer places, so, according to the selection process shown in
Figure 3, the integer-pair coding method or the shift-rounding coding method can be considered, and the decimal places are rounded off. The differences between the data in B are large and the number of fractional digits is not significantly different from the number of integer digits, so integer-pair coding or shift-rounding can also be considered, based on the selection methods described above. The encryption process, after processing the data using the integer-pair encoding method, is shown in
Figure 4.
As can be seen in
Figure 4, data is converted into a binary for encryption after processing in Mode 1, and subsequent ciphertext operations are increased, resulting in higher computational overhead, but the integer-pair encoding method preserves the accuracy of the current data as much as possible. The shift-rounding encoding method results in some loss of data precision, but the computational overhead is lower compared to the integer-pair encoding method. Therefore, it is possible to experimentally observe the effect of the loss of data precision on the clustering results for the two datasets A and B, This determines which preprocessing method is more appropriate to use (see
Section 4.1 for experimental results). The plaintext space of the encryption algorithm chosen for the scheme in this paper is a finite set of bit-length integers so that the preprocessed data can be encrypted directly.
3.2. Distance Measure Selection
In machine learning algorithms, commonly used distance measures include Euclidean distance, Manhattan distance, deformed Euclidean distance, etc. Compared to the traditional Euclidean distance, the deformed Euclidean distance can go further to preserve the accuracy of the data and reduce the impact of errors caused by open-square rounding.
Definition 4. Deformed Euclidean distance. There are 2 points , and the deformed Euclidean distance can be expressed as When performing distance comparisons, the Euclidean and deformed Euclidean distances have high accuracy and the Manhattan distance has some error. However, the open-square operation in the Euclidean distance is not supported by the homomorphic encryption algorithm, and further processing of the open-square operation is required. A Newton iteration method for homomorphic open-square operations is used in the literature [
35]
, and the procedure is as follows. Theorem 1. Newton’s iterative solution. If the equation , where is a real number, has a root near , use the iterative formulaCalculating, , … in sequence..., the sequence will infinitely approximate the roots of the equation. As shown in Theorem 1, Newton’s iterative method can convert the open-square operation into a base operation, thus satisfying the requirements of the homomorphic encryption scheme. Although this solution can achieve open-square operations, all constants involved in Equation (2) need to be encrypted in advance, and several ciphertext multiplication operations are involved in the operation process, resulting in high computational complexity. In addition, the result obtained by the Newton iteration method is an approximation, which will have a certain error, and this error may have a greater impact on the clustering results.
Theorem 2. Let and be 2 integers and let the bit length of be and the bit length of be , then we have with a bit length no greater than and with a bit length no greater than .
As shown in Theorem 2, the bit length of the result using the deformed Euclidean distance is twice as long as the Euclidean distance result. Although this results in a slightly higher computation time due to the larger size of the data during the subsequent computation, the additional overhead of the increased bit length is relatively low compared to the Newtonian iteration method (which requires multiple iterations and thus multiple multiplications to ensure the accuracy of the open-square result), where only two multiplications are required.
At the same time, this paper conducts a test of the effect of distance measures on the accuracy of clustering. In general, the distance measure used in clustering algorithms is the Euclidean distance, so the main way to test this is to replace the Euclidean distance with the Manhattan distance and the deformed Euclidean distance, respectively, and see if the clustering results are similar to those when the Euclidean distance is used. The results obtained by selecting the same input parameters and clustering on the plaintext using these three distance measures are shown in
Figure 5. The horizontal and vertical coordinates are the values of the data points, where the same color is the same clustering cluster, different colors are different clusters, and the darker data points at the edges are noisy. The three plots in
Figure 5 show the clustering results for the selected Euclidean distance, Manhattan distance, and deformed Euclidean distance, respectively. As can be seen from
Figure 5, the clustering results of Euclidean distance and deformed Euclidean distance are similar, while the clustering results for Manhattan distance have some errors compared to those of Euclidean distance. Combining the above analysis with the experimental results and considering the accuracy of the dataset and the overhead of the subsequent homomorphic operations, the deformed Euclidean distance is used in this paper.
3.3. Ciphertext Comparison Protocol
In Step 3 of Algorithm 1, determining whether a point is in a neighborhood of
requires a compare-size operation. However, the BGV homomorphic encryption algorithm used in this paper does not have the feature of order-preserving encryption, so the data encrypted using it cannot maintain the original size order and cannot be compared directly. To solve this problem, the solution in this paper is to design a protocol to compare the size of the ciphertext, as shown in Protocol 1. Let
and
be the binary representations of
and
respectively. When comparing the size of the ciphertext on
and
,
is first calculated, where
denotes an integer with more bits than
, and then the resulting ciphertext is returned to the user. The user decrypts the cipher text to obtain
and extracts its highest bit,
if the bit is 1, or
if the bit is 0. Since no homomorphic multiplication operation is used in the protocol, there is no need to introduce complex noise reduction techniques, which results in low time complexity. In a round protocol, the server sends to the user the ciphertext intermediate result
with a bit length of
(see
Section 2.2 for the definition of the parameters
,
,
). After receiving
, the user returns a 1-bit result
to the server side. Therefore, the theoretical value of the communication volume of the round comparison protocol is
.
Protocol 1. Ciphertext Comparison Protocol |
|
|
|
to the client |
|
then |
(5) return 1 |
(6) else |
(7) return 0 |
(8) end if |
The interaction scenario between the user and the server is shown in
Figure 6. The server sends the cipher text that needs to be decrypted to the user, and the user decrypts the cipher text and sends the highest bit of the decryption result back to the server. In this process, the user and the server can negotiate a time period
. Every
time, the user initiates a query to the server, and the server returns the intermediate result to be decrypted or the homomorphic clustering result according to the computing process. In this case, the user does not need to wait online all the time, but only needs to initiate a query at intervals. This process involves the direct transmission of plaintext, and an attacker may be able to obtain both plaintext and ciphertext through eavesdropping. The security aspects of this process are analyzed in detail in
Section 5.2.
4. Program Realization
The configuration used in this paper is an Intel(R) Core (TM) i7-6700 CPU @ 3.40 GHz 3.41 GHz, 16 GB of RAM, and the Helib homomorphic encryption library for encryption and homomorphic clustering of datasets. Two common clustering datasets were selected for the solution. Dataset A is a common shape dataset with 5 clusters and 2000 data items, each containing two coordinates with 10 decimal places. Dataset B is the aggregation dataset, a common special shape clustering dataset with 750 data items, each containing two coordinates with two decimal places after each coordinate, essentially the same number of places as the integer part.
In this paper, we verify the accuracy of the scheme by comparing the clustering results on plaintext data with the homomorphic clustering results on ciphertext data. In the experiments, the parameters and MinPts used in the execution of the DBSCAN algorithm are the same for both plaintext and ciphertext datasets, with denoted as and MinPts denoted as in the experiments.
4.1. Selection of Data Pre-Processing Methods
Dataset A and dataset B were selected from common clustering algorithm datasets, representing two cases with large and small differences between integer and fractional parts, respectively, to illustrate the universality of this paper’s solution. The experimental results in this section are used to verify the validity of the method selected in
Section 3.1 and to derive a more suitable preprocessing method for dataset A and dataset B. The three preprocessing methods were used to encode the two datasets, and then homomorphic clustering was performed to obtain the time overhead and accuracy of the homomorphic clustering algorithm, as shown in
Table 1. As can be seen from
Table 1, the accuracy of the multiprocessing coding is 100%, but the time overhead is higher; the accuracy of the integer-pair coding and the shift-rounding coding, although not 100%, is also very accurate, which indicates that the selection process in
Section 3.1 first excluded the multiprocessing coding and considered dataset A and dataset B to be more suitable for the integer-pair coding or and that the fractional part of the data in dataset A was correctly rounded to some extent. In addition, the accuracy of the integer-pair coding method is slightly higher than that of the shift-rounding coding method, but the accuracy of the shift-rounding coding method is also very high, and the computational efficiency of the shift-rounding coding method is much higher than that of the integer-pair coding method. The experimental results show that the dataset selected for this paper is more suitable for the shift-rounding coding method, so the specific conclusions given in this paper are based on the choice of the shift-rounding coding method.
4.2. Dataset Explicit Clustering Results
The results of the clustering process are shown in
Figure 7 and
Figure 8. The parameters chosen for dataset A were
(
after coding) and
; dataset B was chosen with
(
after coding) and
.
4.3. Dataset Ciphertext Clustering Results
Dataset A and dataset B are encrypted by shifting and rounding, and then homomorphic clustering is performed on the encrypted data. The data in dataset A was shifted and rounded, with 3, 4, and 5 decimal places retained, and then the homomorphic clustering algorithm was performed on the cipher text. The clustering results after decryption are shown in
Figure 9,
Figure 10 and
Figure 11. The dataset B is directly encoded by shift rounding, encrypted and homomorphic clustering is performed; the clustering results obtained by decryption after the calculation are shown in
Figure 12.
6. Conclusions
In this paper, we propose a scheme to perform a homomorphic DBSCAN clustering algorithm on encrypted datasets for solving the privacy protection problem during data outsourcing computation. For the comparison operation, which is not supported by the homomorphic encryption algorithm, an interaction protocol is designed to implement this function. The scheme proposes multiple coding preprocessing methods for different datasets and gives a strategy for selecting data preprocessing methods according to the characteristics of the datasets, taking into account the accuracy of the data and the computational overhead. Our proposed scheme has reliable data security, a good clustering effect, and computational performance. The drawback is that this paper only discusses the privacy leakage risk in the clustering operation process, and we will discuss a more comprehensive computation process in subsequent work. This research is expected to help solve the privacy protection problem in data-outsourcing computation, in scenarios such as 5G communication, blockchain technology, IoT systems, wireless sensor networks, and big data system security.