Next Article in Journal
Energy Management of Solar-Powered Aircraft-Based High Altitude Platform for Wireless Communications
Previous Article in Journal
Acknowledgement to Reviewers of Electronics in 2019
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Maximized Privacy-Preserving Outsourcing on Support Vector Clustering

1
School of Information Engineering, Xuchang University, Xuchang 461000, He’nan, China
2
Information Technology Research Base of Civil Aviation Administration of China, Civil Aviation University of China, Tianjin 300300, China
3
School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
4
Department of Computer and Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA 19912, USA
5
The Sate Key Laboratory of Integrated Service Networks, Xidian University, Xi’an 710071, Shanxi, China
*
Authors to whom correspondence should be addressed.
Electronics 2020, 9(1), 178; https://doi.org/10.3390/electronics9010178
Submission received: 16 December 2019 / Revised: 11 January 2020 / Accepted: 15 January 2020 / Published: 17 January 2020
(This article belongs to the Section Computer Science & Engineering)

Abstract

:
Despite its remarkable capability in handling arbitrary cluster shapes, support vector clustering (SVC) suffers from pricey storage of kernel matrix and costly computations. Outsourcing data or function on demand is intuitively expected, yet it raises a great violation of privacy. We propose maximized privacy-preserving outsourcing on SVC (MPPSVC), which, to the best of our knowledge, is the first all-phase outsourceable solution. For privacy-preserving, we exploit the properties of homomorphic encryption and secure two-party computation. To break through the operation limitation, we propose a reformative SVC with elementary operations (RSVC-EO, the core of MPPSVC), in which a series of designs make selective outsourcing phase possible. In the training phase, we develop a dual coordinate descent solver, which avoids interactions before getting the encrypted coefficient vector. In the labeling phase, we design a fresh convex decomposition cluster labeling, by which no iteration is required by convex decomposition and no sampling checks exist in connectivity analysis. Afterward, we customize secure protocols to match these operations for essential interactions in the encrypted domain. Considering the privacy-preserving property and efficiency in a semi-honest environment, we proved MPPSVC’s robustness against adversarial attacks. Our experimental results confirm that MPPSVC achieves comparable accuracies to RSVC-EO, which outperforms the state-of-the-art variants of SVC.

1. Introduction

Clustering forms natural groupings of data samples that maximize intra-cluster similarity and minimize inter-cluster similarity. Inspired by support vector machines (SVMs), support vector clustering (SVC) has attracted many studies for remarkably handling clusters with arbitrary shape [1,2,3,4]. Various application areas are closely related to it, e.g., information retrieval and analysis, signal processing, traffic behavior identification, etc. [2,4,5,6]. However, building a good clustering model requires a large number of valid training samples and a substantial iterative analysis under specific metrics; hence, it is frequently hard for individuals or small organizations to build their model on those quickly accumulated data. The pricey storage of the kernel matrix and costly computations by the dual problem solver and a large number of sampling checks set big barriers to its applications [4,7,8]. To tackle this issue, a viable solution is to outsource its heavy workloads to the Cloud [9] with sufficient computing resources. With suitable outsourcing techniques, the excess requirements of computational and storage resources for data owners can be mitigated to the Cloud.
By introducing Cloud computing, a server (in logical) in the Cloud takes the place of the client (data owner) to finish either the training phase or the labeling phase, or both. After all, as a third-party, the credibility of servers cannot be fully guaranteed. The undertaken tasks are remotely based on privately owned data samples. Furthermore, there is no need for the server to disclose any details or negotiate parameters of the analysis, even if it offers a clustering service to the client. Hence, using a third-party to build or utilize models on sensitive data is risky even if the service provider claims that it will not observe the analysis [10,11]. It is crucial to develop a privacy-preserving SVC which seeks a server to train the model and label data samples while guaranteeing the privacy, i.e., the input data, the coefficient vector, and the labeling results.
In this paper, we propose, to the best of our knowledge, the first known client–server privacy-preserving SVC architecture, namely maximized privacy-preserving outsourcing on SVC (MPPSVC), to maximize the benefits of the emerging Cloud computing technology in cluster analysis. As shown in Figure 1, the client first sends a protected diagonal matrix to the server; then, the server employs a reformative SVC framework to perform the dual problem solver in the encrypted-domain (ED) and sends the encrypted coefficient vector back. To let the client know the cluster prototype and the labeling results, we first cut off the traditional iterative analysis for efficiency and securely exploit the homomorphic encryption properties. Then, we design lightweight secure protocols which require a limited number of interactions. We assume that both the client and the server execute the protocol correctly to maintain their reputation; hence, they will behave in a semi-honest manner, i.e., they are honest but curious, thus privacy is a real issue. The main contributions of this work lie in:
(1)
A reformative SVC with elementary operations (RSVC-EO) is proposed with an updated dual coordinate descent (DCD) solver for the dual problem. In the training phase, it prefers a linear method to approach the objective without losing validity. Furthermore, it easily trades time for space by calculating kernel values on demand.
(2)
A fresh convex decomposition clustering labeling (FCDCL) method is presented, by which iterations are not required for convex decomposition. In connectivity analysis, the traditional segmer sampling checks in feature space are also avoided. Consequently, to complete calculations in each step of the outsourcing scene, a single interaction is sufficient for prototype finding and connectivity analysis. Without a mass of essential iterations, the labeling phase will not hurt the outsourcing.
(3)
Towards privacy-preserving, MPPSVC is proposed by introducing homomorphic encryption, two-party computation, and linear transformation to reconstruct RSVC-EO. Although these protocols limit the calculation types, MPPSVC works well with maximized outsourcing capability. In the field of SVC, to the best of our knowledge, it is the first known client–server privacy-preserving clustering framework solving the dual problem without interaction. Furthermore, it can securely outsource the cluster number analysis and cluster assignment on demand.
The remainder of the paper is organized as follows. Section 2 briefly describes the classic SVC and the homomorphic encryption and discusses the privacy violation and limited operations of SVC outsourcing. In Section 3, we first reformulate the dual problem and propose the DCD solver to it. Meanwhile, we present FCDCL that completes convex decomposition without iterative analysis and finishes connectivity analysis irrespective of sampling. By integrating these designs, we give an implementation of RSVC-EO in plain-domain (PD). Section 4 proposes MPPSVC for the client–server environment in ED, and presents the core idea of securely outsourcing the three crucial tasks of RSVC-EO. Section 5 gives performance and security analysis for these ED techniques. Section 6 reviews the related works. Finally, the conclusions are drawn in the Section 7, as well as the future works to be investigated.

2. Preliminaries

2.1. Support Vector Clustering

2.1.1. Estimation of a Trained Support Function

Assume that a dataset X has N data samples { x 1 , , x N } , where x i R d with i [ 1 , N ] . Through a nonlinear function Φ ( · ) , SVC maps data samples from data space to a high-dimensional feature space and finds a sphere with the minimal radius which contains most of the mapped data samples. This sphere, when mapped back to the data space, can be partitioned into several components, each enclosing an isolated cluster of samples. In mathematical formulation, the spherical radius R is subject to
min R , α , ξ i R 2 + C i ξ i s . t . | | Φ ( x i ) α | | 2 R 2 + ξ i
where α is the sphere center, ξ i is a slack variable, and C is a constant controlling the penalty of noise. Following Jung et al. [12], the sphere can be simply estimated by a support function that is defined as a positive scalar function f : R n R + . After solving the dual problem in Equation (2), we estimate the support function by support vectors (SVs) whose corresponding coefficients β i ( 0 , C ) for i = 1 , , N .
min β j i , j β i β j K ( x i , x j ) j K ( x j , x j ) β j s . t . j β j = 1 , 0 β j C , j = 1 , , N
By optimizing Equation (2) with Gaussian kernel K ( x i , x j ) = e q | | x i x j | | 2 , the objective trained support function is formulated by
f ( x ) = 1 2 j β j K ( x j , x ) + i , j β i β j K ( x i , x j ) .
Theoretically, the radius R of the hypersphere is usually defined by the square root of f ( x i ) where x i is one of SVs.

2.1.2. Cluster Assignments

Since SVs locate on the border of clusters, a simple graphical connected-component method can be used for cluster labeling. For any two samples, x i and x j , we check m segmers on the line segment connecting them by traveling their images in the hypersphere. According to Equation (3), x i and x j should be labeled the same cluster index if all the m segmers are always lying in the hypersphere, i.e., f ( x m ˜ ) R 2 for m ˜ [ 1 , m ] . Otherwise, they are in two different clusters.

2.2. Homomorphic Encryption

Multiplication is critical for SVC. Although the fully homomorphic encryption protects multiplicative items well, the existing schemes are far from practical use. Since multiplication can be replaced by addition, we prefer using the Paillier cryptosystem [13]—an additively homomorphic public-key encryption scheme—to secure clustering analysis and data exchange. It was also adopted by the authors of [14,15,16,17,18] for simplicity and generality. Based on the decisional composite residuosity problem, the Paillier cryptosystem is provable semantic security. It means that, for some composite n, we cannot decide whether there exists some y Z n 2 * such that an integer z satisfies z = y n mod n 2 . A brief description of the Paillier cryptosystem is given as follows.
Let n = p q where p and q are two large primes, r is randomly selected in Z n * , and g is randomly selected in Z n 2 * , which satisfies gcd ( L ( g λ mod n 2 ) , n ) = 1 with L ( u ) = u 1 n and λ = lcm ( p 1 , q 1 ) . We assume that plaintext m Z n is the numeric form of a feature value in x i ( i = 1 , , N ) ; its ciphertext is denoted by [ [ m ] ] , i.e., [ [ m ] ] Z n 2 * . The Paillier cryptosystem’s encryption, decryption, and additively homomorphic operations are as follows:
(1)
Encryption: Ciphertext [ [ m ] ] = g m · r n mod n 2 .
(2)
Decryption: Plaintext m = ( L ( [ [ m ] ] mod n 2 ) / L ( g λ mod n 2 ) ) mod n .
(3)
Additive homomorphism: [ [ m 1 + m 2 ] ] = [ [ m 1 ] ] · [ [ m 2 ] ] , [ [ α · m 1 ] ] = [ [ m 1 ] ] α . Here, m 1 and m 2 are two numeric features in PD and α is an integer constant. In practical situations, if α < 0 , its equivalence class value α mod n can be a substitution.
Before outsourcing data, in this study, the client should distribute his public key (i.e., ( n , g ) ) of the predefined Paillier cryptosystem to the server while keeping his private key (i.e., ( p , q , r ) or ( λ , r ) ) secret. Then, the server performs a series of clustering tasks on ciphertexts by exploiting the homomorphic properties. Only the client can decrypt all the encrypted messages that encapsulate clustering results by using his corresponding private key.

2.3. Privacy Violation and Limited Operations of SVC Outsourcing

2.3.1. Privacy Violation of SVC

Undoubtedly, to conduct the essential computations of SVC correctly in the Cloud, we have to face the possibility of privacy leakage in either the training phase or the labeling phase, or both. Otherwise, there is no feasibility.
For the training phase, the core work is to solve the dual problem in Equation (2). After being transferred to the Cloud, in PD, either the data samples or the kernel matrix is undoubtedly retrievable for the server. Based on privacy-preserving policy, this situation raises several critical problems. (1) Letting the server know the data samples as plaintext is unacceptable. (2) Only having separately encrypted data samples, the server cannot generate the required kernel matrix by itself, since calculating the Gaussian function needs the help from the client, as discussed by Rahulamathavan et al. [17]. However, more than N 2 times interactions between the server and the client are required before the server constructs a N × N kernel matrix. Besides, the widely used SMO solver for solving the dual problem needs approximately O ( N 2 ) kernel evaluations to construct the support function [8]. Taking the number of iterations into account, far more than N 2 times interactions are also essential for the SMO solver. Therefore, it is inadvisable to encrypt data samples separately. (3) To make the solver iterate effectively, the intermediate results of the coefficient vector should be in PD. Unfortunately, this would leak the final index of SVs with their importance to the server. (4) Although no plain data samples appear, the constructed kernel matrix is not suitable to be sent to the server directly. If the server knows any data sample, it can recover all the data samples since the kernel function is exactly a type of similarity measure [19].
For the labeling phase, labeling the remaining data samples, usually based on the Euclidean distances in data space, is not a time-consuming work if the cluster prototypes have been obtained. Samanthula et al. [20] presented a good idea for it, even though we have to do it in an outsourced environment. However, in the most recent studies, prototype finding and sampling calculation for connectivity analysis generally are based on iterative analysis, which raises frequent interactions to get help from the client because exponent arithmetic is the fundamental operation, and SVs are the component of the exponent function (i.e., support function). Otherwise, privacy cannot be guaranteed. It is worth mentioning that Lin and Chen [21] presented a method of releasing the training SVM classifier without violating the confidentiality of classification parameters, e.g., SVs. Although replacing the trained SVM classifier by the support function appears to be a good solution, unfortunately, we still have to send the plain data samples for decisions made by the outsourced service in the server. Undoubtedly, this results in a privacy violation.

2.3.2. Limited Operations of Homomorphic Encryption

As described in Section 2.2, additively homomorphic encryption is characterized by its additive homomorphism. However, this fundamental operation only expedites the addition operation for two ciphertexts and the multiplication operation between a plaintext and a ciphertext. Therefore, without outside assistance, it still cannot complete multiplication/division operations for two plaintext or complex functions, e.g., Gaussian function. Unfortunately, multiplication/division operations are frequently invoked by both the SMO solver and iterative analysis for cluster prototypes. Furthermore, calculating values of Gaussian function with different sample pairs is a fundamental operation in the SMO solver, iterative analysis for cluster prototypes, and sampling for connectivity analysis. Since the client is the only legal assistance provider outside, this will dramatically raise the huge scope of interactions that lead to heavy load to network bandwidth and degrade the usability of SVC outsourcing. Therefore, it is critical to explore practical ways of avoiding too many complex computations and massive interactions.

3. Reformative Support Vector Clustering with Elementary Operations

Traditionally, iterative analysis and complex operations cause massive interactions that reduce the usability of outsourcing. Towards balancing privacy, efficiency, and accuracy, we design the DCD solver by reducing unnecessary operations and replacing complex functions by elementary ones, and we also design the FCDCL to avoid iterative analysis in convex decomposition and connectivity analysis.

3.1. DCD Solver for SVC’s Dual Problem

Derived from [22], we reformulate the nonlinear dual problem in Equation (2) by a linear model. Let Q be the original kernel matrix with element Q i j = K ( x i , x j ) ( i , j [ 1 , N ] ) , Q ˜ = 2 × Q , β = [ β 1 , β 2 , , β N ] T , and e = [ 1 , 1 , , 1 ] T . Since K ( x i , x j ) = 1 for x i = x j , we have j K ( x j , x j ) β j = 1 . Then, the dual problem in Equation (2) can be reformulated by
min β 1 2 β T Q ˜ β 1 s . t . j β j = 1 , 0 β j C , j = 1 , , N .
Let D svc denote { β | j β j = 1 , 0 β j C , j = 1 , , N } . Obviously, we can relax D svc by D svc = { β | 0 β j C , j = 1 , , N } in which an equivalent globally optimal solution can be achieved. We thus get an equivalent form of the problem in Equation (2) as
min β 1 2 β T Q ˜ β s . t . 0 β j C , j = 1 , , N .
Without any prior knowledge of labels, we fix the label y i of x i to + 1 or 1 for i = 1 , , N . To solve the problem in Equation (5), the optimization process starts from an initial point β 0 R N and generates a sequence of vectors { β k } k = 1 . We refer to the process from β k to β k + 1 as an outer iteration. In each outer iteration, we have N inner iterations, so that sequentially β 1 , β 2 , , β N are updated. Each outer iteration thus generates vectors β k , i R N , i = 1 , 2 , , N + 1 , such that β k , 1 = β k , β k , N + 1 = β k + 1 , and β k , i = [ β 1 k + 1 , , β i 1 k + 1 , β i k , , β N k ] T for i = 2 , , N . To update β k , i to β k , i + 1 , we fix the other variable and then solve the following one-variable sub-problem:
min d f ( β k , i + θ e i ) s . t . 0 β i k + d C ,
where e i = [ 0 , , 0 , 1 , 0 , , 0 ] T . Then, the objective function of Equation (6) is a simple quadratic function of θ :
f ( β k , i + θ e i ) = 1 2 Q ˜ i i d 2 + i f ( β k , i ) θ + constant ,
where i f is the ith component of the gradient f . Therefore, based on a decision function definition of y = w T Φ ( x ) + b for SVC, we can solve problem in Equation (7) by introducing Equations (8)–(9).
Q ˜ i i = 2 × K ( x i , x i ) = 2 .
i f ( β ) = ( Q ˜ β ) i = j = 1 N Q ˜ i j β j = w T Φ ( x i ) .
Along with the update of β i , we can maintain w by
w w + ( β i β ^ i ) Φ ( x i ) ,
where β ^ i is the temporary coefficient value obtained in the previous iteration. Thus, we have
i f ( β k , i ) j β j ^ K ( x j , x i ) ( k 1 ) iter . + ( β β ^ i ) .
Therefore, similar to the DCD method in [22], we propose a DCD solver detailed by Algorithm 1 for solving the problem in Equation (2). Briefly, we use Equation (11) to compute i f ( β k , i ) , check the optimality of the single-variable optimization in Equation (6) by i P f ( β k , i ) > ? 1 × 10 12 , and update β i . The cost per iteration from β k to β k + 1 is O ( N d ) , and an appropriate ϵ controls the iteration number well for efficiency. Notice that the memory requirement is flexible. With sufficient memory, we can afford O ( N 2 ) for the full kernel matrix, and use search on demand to finish the calculation of line 5 in Algorithm 1; otherwise we can either store X or split them into blocks (requirement reduced to O ( N ) ), and then calculate the values of required kernel function sequentially for later use in the outer iterations.
Algorithm 1 DCD solver for SVC’s dual problem in Equation (2).
Require: Dataset X , kernel width q, and penalty C
Ensure: Coefficient vector β
1.
Randomly initialize the coefficient vector β
2.
while true do
3.
M ax , M in
4.
for i = 1 , 2 , , N do
5.
   G ^ = 2 × j = 1 N β j K ( x j , x i )
6.
   G G ^ + ( β i β ^ i )
7.
   P G = G if 0 < β i < C min ( G , 0 ) if β i = 0 max ( G , 0 ) if β i = C
8.
   M ax max ( M ax , | P G | ) , M in min ( M in , | P G | )
9.
  if | P G | > 1E-12 then
10.
    β ^ i β i
11.
    β i min ( max ( β i 1 2 G , 0 ) , C )
12.
  end if
13.
end for
14.
if ( M ax M in ) ϵ then
15.
  break
16.
end if
17.
end while

3.2. Convex Decomposition without Iterative Analysis

Derived from [22], in the labeling phase, we take convex hull as the cluster prototype for both efficiency and accuracy. However, it usually requires ( 1 ) iterations to get a stable equilibrium vector (SEV) from an SV. If we let the server do this, the client has to finish the exponential function by itself. Combining with the uncertain number of SVs N SV , more than N SV rounds of interactions bring a heavy burden to the network bandwidth. Therefore, we prefer achieving the same object without iterative analysis.
Theorem 1.
If a cluster is decomposed into multiple convex hulls, the best division positions should be those SVs whose locations link up two nearest neighboring convexity and concavity.
Proof. 
From the convex decomposition strategy [23], theoretically, the decomposed convex hulls are constructed by SVs without exceptions. Taking Figure 2a as an example, S 1 and S 2 , respectively, denote the two convex hulls in a cluster and the corresponding SEVs. Obviously, the cluster is stroked SVs { x 11 , , x 17 ; x 21 , , x 25 } . { x 11 , x 17 } and { x 21 , x 25 } constitute the division position (termed division set) for linking up two nearest neighboring convexity and concavity. To confirm this, we consider the two situations:
(1)
If the division position is not in the SVs subset in which each one links up two nearest neighboring convexity and concavity, for instance, x 21 is replaced by x 22 , then the corresponding convex hull S 2 has to be further split into two convex hulls enclosed by { x 21 , x 22 , x 25 } and { x 22 , x 23 , x 24 } , respectively, even though all these SVs converge to S 2 . Here, x 22 is an imaginary sample that is extremely similar to x 22 . Because no overlapping region or intersection of vertices of convex hulls is allowed, unfortunately, this result conflicts with the definition that SVs in a convex hull will converge to the same SEV.
(2)
From the other perspective, if x 21 is not in the division position which only includes { x 11 , x 17 , x 25 } , we have to find a substitute x 21 to construct the convex hull S 2 . x 21 might locate between x 11 and x 21 or between x 21 and x 22 . For the prior case, it means x 21 is not the closest SV to x 11 that violates the premise of x 21 linking up two nearest neighboring convex hulls. For the latter case, no matter x 21 is a new arrival SV or is just x 22 , the concavity will be formed along x 12 x 11 x 21 or x 12 x 11 x 22 . However, it becomes true only when x 21 is no longer on the cluster boundary or x 11 converging to S 2 . Unfortunately, this violates the hypothesis of x 21 being a SV and x 11 converging to S 1 .
 □
Using SVs to construct a convex hull, massive iterations in the convergence increase the communication burden significantly. We first consider the characteristics of convex decomposition: (1) SVs locate on the cluster boundary; (2) the convergence directions for SVs in a convex hull are close to the same SEV inside; and (3) after connecting two nearest neighboring SVs, no concavity is in a convex hull. Then, the core idea behind our method is quite simple and intuitive: Each SV is not only a vertex of convex hull but also an edge pattern of a cluster. In a convex hull, if an SV’s tangent plane crossing the SV is perpendicular to its convergence direction, the other SVs will set on one side of the tangent plane. Therefore, to avoid forming concavity around the SV, the included angle, between its convergence direction and the line segment connecting it and its nearest neighboring SV in the same convex hull, has a relatively small value.
As shown in Figure 2a, two tangent planes crossing x 15 and x 21 can be found, respectively. Blue arrows show their convergence directions. Apparently, S 1 x 15 x 16 and S 2 x 21 x 25 must be small enough to respectively avoid x 16 and x 22 entering the other side of the corresponding tangent planes. Based on this consideration, we do non-iterative convex decomposition in Algorithm 2. Obviously, calculating the convergence direction in Line 4 is the key.
With β and SVs obtained by Algorithm 1, we can use any optimization method to find a local minimizer of Equation (3). Different methods lead to different efficiency, i.e., the number of iterations. Fortunately, in this study, an approximate convergence direction is sufficient. For the sake of simplicity, the gradient descent of Equation (3) is preferred and calculated by
f ( x ) = j 4 q β j K ( x j , x ) coefficient · x j x vector .
Obviously, 4 q β j K ( x j , x ) scales [ x j x ] and contributes to the final convergence direction. The convergence direction x in the first step for x is the negative gradient of f ( x ) , i.e.,
x = γ f ( x ) ,
where γ is a constant factor. Now, we can calculate the cosine function on Line 5 of Algorithm 2 by
cos ( x i x i x j ) = x i T ( x j x i ) | | x i | | · | | x j x i | | .
Algorithm 2 Non-iterative convex decomposition.
Require: SVs set X S , decomposition threshold η 1
Ensure: Convex hulls S CH , set of triples S Tri
1.
Randomly select x i X S , S k = { x i } , X U = , S Tri =
2.
while X S X U do
3.
x j find the nearest neighboring SV from X S X U
4.
x i calculate the convergence direction of x i
5.
if cos ( x i x i x j ) η 1 then
6.
   S k S k { x j }
7.
else
8.
   S k + 1 { x j }
9.
   S Tri S Tri { ( x i , x j , x i ) }
10.
end if
11.
X U X U { x i }
12.
x i x j
13.
end while
Notice that, on Line 9, we collect point-pairs which are two nearest neighbors but belonging to different convex hulls, as well as the convergence direction, into S Tri . It will be employed in the following connectivity analysis for directly fetching the nearest neighboring convex hulls.

3.3. Connectivity Analysis Irrespective of Sampling

Based on the decomposed convex hulls, we try to check the connectivity of two nearest neighbors sequentially.
Lemma 1.
For two nearest neighboring convex hulls whose distance is determined by two closest vertices separately from them, there are two apparent properties. (1) We can always find a tangent plane crossing one of the two vertices, which makes the two convex hulls locate on different sides of the tangent plane. (2) Taking one of the two vertices as the original point, we draw an included angle between rays from the original point to its SEV and the other convex hull’s SEV, respectively. The smaller the included angle is, the higher the possibility of the two convex hulls are connected.
Proof. 
Taking Figure 2b as an example, x 11 and x 21 are the nearest SV-pair in a division set thus far from two nearest convex hulls S 1 and S 2 , respectively.
According to the definition, there is no overlapping region between any two convex hulls. Apparently, the first property is always true. As a typical prototype [24], geometrically, SEV locates in the convex hull and reflects the relative location well. Thus, we have S 2 x 21 S 1 < S 2 x 21 S 1 if the convex hull S 1 is moved to S 1 . This movement uses x 21 x 11 ¯ as the radius to keep distance unchanged. It generates a relative displacement for the SEV along the direction of black arrow, and the moved one gets closer to the convex hulls S 2 . Considering a plane shaped by line S 1 x 21 x 22 , the vertex x 21 will have lower possibility to be a transition point connecting two nearest neighboring convexity and concavity. That means a higher probability of S 1 and S 2 being enclosed in one cluster as the the distance reduces. On the contrary, if we increase S 2 x 21 S 1 , actually, it is achieved by moving the convex hull S 1 far from the side of S 2 with respect to the tangent plane of x 21 . Then, on the plane shaped by line x 11 x 21 x 22 , x 21 must be included in the division set again. Thus, the second property is true. □
Definition 1
(Merging Factor). Let x 11 and x 21 be two nearest vertices, respectively, from two nearest neighboring convex hulls S 1 and S 2 . In connectivity analysis of S 1 and S 2 , merging factor is defined by the cosine function of included angle between the convergence direction from one vertex and its ray direction to the other convex hull’s SEV, i.e., cos x 21 x 21 S 1 or cos x 11 x 11 S 2 .
However, Algorithm 2 cannot give us the exact SEV. Based on the local geometrical property discussed by Ping et al. [25], we define a density centroid as the substitution.
Definition 2
(Density Centroid). The density centroid S DC i of the ith convex hull S CH i is defined by 1 N CH i j = 1 N CH i x i j for x i j S CH i , where N CH i is the number of SVs in S CH i .
On the basis of Lemma 1, the presented connectivity analysis strategy is quite simple: (1) Without prior knowledge of the cluster number, we merge two nearest neighboring convex hulls if their minimal merging factor is greater than a predefined threshold η 2 . (2) Otherwise, to control the number of clusters for a specific K globally, pairs of nearest neighboring convex hulls are merged into one cluster as they move up in the hierarchy.

3.4. Implementation of RSVC-EO

Algorithm 3 gives a complete solution of RSVC-EO. Given q and C, CollectSVsbyDCDSolver(·) obtains β by invoking Algorithm 1 and collects SVs X S . Then, ConvexDecomposedbySVs(·) constructs convex hulls S CH as the cluster prototypes by employing the non-iterative analysis strategy presented in Section 3.2. Using the prior knowledge of η 2 or K, ConnAnalysisbyNoSamp(·) checks the connectivity of two nearest neighboring convex hulls and merges them step by step on demand. It results in an array with size N SV which contains the cluster labels for SVs. Notice that the frequently used adjacency matrix by the traditional methods are no longer required because ConnAnalysisbyNoSamp(·) adopts a bottom-up merging strategy. Finally, the remaining data samples are separately assigned by their nearest convex hulls’ labels.
Algorithm 3 RSVC-EO.
Require: Dataset X , kernel width q, penalty C, and thresholds η 1 , η 2 or the cluster number K
Ensure: Clustering labels for all the data samples
1.
{ X S , β } CollectSVsbyDCDSolver( X , q , C )
2.
S CH ConvexDecomposedbySVs( X S , η 1 )
3.
Labels ← ConnAnalysisbyNoSamp( S CH , η 2 K )
4.
for each x X X S do
5.
inx← find the nearest convex hull from x
6.
 Labels[x] ← Labels[ x inx ]
7.
end for
8.
return Labels

3.5. Time Complexity of RSVC-EO

Before introducing a privacy-preserving mechanism, RSVC-EO works locally in PD. Hence, we measure only the computational complexity in this section. Let N be the number of data samples, N SV be the number of SVs, and N CH be the number of the decomposed convex hulls. In the training phase, whether to use the pre-computed kernel matrix for efficiency at the cost of storage or to calculate the corresponding row of kernel values on demand depends on the actual memory capacity. If with sufficient memory, i.e., the space complexity of O ( N 2 ) to store the kernel matrix, the computational complexity is O ( N 2 ) . The innermost operation is to compute β j K ( · , · ) . Otherwise, it is up to O ( d N 2 ) whose innermost operation calculates each kernel function’s value of two d-dimensional samples. Although it seems to be time-consuming, it is much lower than O ( N 3 ) required by the traditional methods which frequently need O ( N 2 ) storage (see [4]). Further, the innermost operations for both situations are simple.
In the labeling phase, time costs for constructing the convex hulls by SVs and completing connectivity analysis depend on N SV and N CH . By employing the proposed strategy, iterations to reach the local minimum from SVs are avoided, and the sample rate is replaced by one comparison. Therefore, to finish the labeling phase, RSVC-EO only consumes O ( N SV 2 + ρ N CH ) where = { 1 , d } and ρ ( 2 , 3 ] are separately determined by whether to store the kernel matrix of SVs and input parameter η 2 or K.
Due to page limitation, we omit the comparisons with the art-of-the-state methods, which can be found in [22,26].

4. Maximized Privacy-Preserving Outsourcing on SVC

Taking RSVC-EO as the core method, we develop MPPSVC to maximize the capability of selective service outsourcing.

4.1. Privacy-Preserving Primitives

For privacy-preserving, some elementary operations in Algorithm 3 have to be done with the client’s help, e.g., multiplication, distance measure, and comparison. We present lightweight secure protocols in our proposed MPPSVC. All the below protocols are considered under a two-party semi-honest setting, where the server is a semi-honest party, and the client is an honest data owner. Meanwhile, a potential adversary might monitor communications. In particular, we assume the Paillier’s private key K pri is known only to the client, whereas K pub is public.

4.1.1. Secure Multiplication

Secure multiplication protocol (SMP) considers that the server, under the client’s help, calculates [ [ m 1 × m 2 ] ] with input { [ [ m 1 ] ] , [ [ m 2 ] ] } . Instants m 1 and m 2 are unknown to the server in the adversarial environment. In this study, we adopt the SMP described by Samanthula et al. [20].

4.1.2. Secure Comparison

Numeric comparison is critical for the server to finish the connectivity analysis and the labeling phase in ED. Notice that the intermediate comparison result is a privacy to the adversary (intermediary in the network), but it is not a secret for the server due to the requirement of procedure control. Therefore, based on the secure comparison protocols (SCPs) of Rahulamathavan et al. [17] and Samanthula et al. [20], we customize an SCP in Algorithm 4, which only needs one-round interaction.
Since r is picked randomly, the ciphertext sent back to the client can be either [ [ m 1 m 2 ] ] or [ [ m 2 m 1 ] ] in equal probability. The client and the implicit adversary hence cannot guess the actual result with a probability greater than 1 / 2 . If the execution of Lines 4–9 is honest, [ [ s t + ] ] and [ [ s t ] ] are, respectively, positive and negative integers protected by the server’s public key. Furthermore, in the communication traffic, neither [ [ m 1 ] ] nor [ [ m 2 ] ] can be found by brute force attack, and one cannot find them out with knowledge of their difference.

4.1.3. Secure Vector Distance Measurement

Euclidean distance measurement between any two vectors is essential for extracting the convergence direction in Equation (12) and cosine function in Equation (14) and labeling the remaining data samples (Line 6 in Algorithm 3). For simplicity, in this study, the secure vector distance measure protocol (SVDMP) adopts the secure squared euclidean distance of Samanthula et al. [20], which employs SMP for each dimension. It can also be done in batch if necessary.
Algorithm 4 Secure comparison protocol.
Require: Server: [ [ m 1 ] ] , [ [ m 2 ] ] , K pub ; Client: K pri
Ensure: Server output [ [ max ( m 1 , m 2 ) ] ] or [ [ min ( m 1 , m 2 ) ] ]
  •   Server:
1.
Pick r { 1 , 1 } randomly
2.
[ [ d ] ] [ [ m 1 ] ] × [ [ m 2 ] ] n 1 , [ [ d ] ] [ [ d ] ] r mod n
3.
{ [ [ d ] ] } Client
  •   Client:
4.
t K pri [ [ d ] ]
5.
if t > 0 then
6.
{ [ [ s t + ] ] } Server
7.
else
8.
{ [ [ s t ] ] } Server
9.
end if
  •   Server:
10.
if ( s t + and r 1 ) or ( s t and r 1 ) then
11.
[ [ max ( m 1 , m 2 ) ] ] [ [ m 2 ] ] , [ [ min ( m 1 , m 2 ) ] ] [ [ m 1 ] ]
12.
else
13.
[ [ max ( m 1 , m 2 ) ] ] [ [ m 1 ] ] , [ [ min ( m 1 , m 2 ) ] ] [ [ m 2 ] ]
14.
end if

4.1.4. Secure 1-NN Query

In ED, phases such as Lines 2 and 6 of Algorithm 3 should be done with the help of secure 1-NN query (S1NNQ). Given a data sample [ [ x ] ] , the server finds the nearest neighbor in dataset [ [ X ] ] with | X | samples, as described by Algorithm 5.
Algorithm 5 Secure 1-NN query protocol.
Require: Encrypted data [ [ x ] ] and dataset SVs set [ [ X ] ]
Ensure: the nearest neighbor [ [ x 1 NN ] ] in [ [ X ] ]
1.
for i 1 , | X | do
2.
[ [ d i ] ] SVDMP [ [ x ] ] , [ [ x i ] ]
3.
[ [ d max ] ] [ [ min ( d i , d max ) ] ] using SCP
4.
if [ [ d i ] ] [ [ d max ] ] then
5.
   [ [ x 1 NN ] ] [ [ x i ] ]
6.
end if
7.
end for

4.2. The Proposed MPPSVC Model

In this section, we first reformulate the crucial phases based on the privacy-preserving primitives. Then, we present the flow diagram for MPPSVC.

4.2.1. Preventing Data Recovery from Kernel Matrix

In this study, we adopt the Paillier for both efficiency and security. Unfortunately, on Line 5 of Algorithm 1, j = 1 N β j K ( x i , x j ) brings a chain-multiplications in ED. If we use SMP, the massive interactions are fatal due to privacy concerns of [ [ β j ] ] and [ [ K ( x i , x j ) ] ] . One may consider the acceptation of K ( x i , x j ) in PD. That means that the client only sends the values of K ( x i , x j ) for i , j [ 1 , N ] . However, they carry the similarities between all the data sample pairs. The server/adversary can easily recover the whole dataset if it luckily collects any two data samples in plaintext [19].
Since the kernel matrix is sufficient for Equation (5), we design a transformation strategy to hide the actual similarity in K ( · , · ) while protecting β . Let M be an N × N orthogonal matrix satisfying M 1 = M T , which is kept secret by the client. Let Q ¯ = M T Q ˜ M ; the problem in Equation (5) thus becomes
min β ¯ 1 2 β ¯ T M T Q ˜ M Q ¯ β ¯ s . t . 0 β ¯ j C , j = 1 , , N ,
which is equivalent to
min β ¯ 1 2 ( M β ¯ ) T Q ˜ ( M β ¯ ) s . t . 0 M β ¯ e C .
Let β = M β ¯ ; the problem in Equation (16) almost achieves the same formulation with Equation (5). Thus, we can find a fact for Algorithm 1: if the client sends a transformed Q ¯ to replace Q ˜ , we obtain a corresponding β ¯ which is different from β , i.e., β ¯ β . In plaintext, the client can obtain the expected β by multiplying M due to β = M β ¯ . Furthermore, the server cannot recovery Q ˜ since M is only held by the client.
For further computation and communication savings, we suggest decomposing Q ˜ to generate M through Q ˜ = M Σ eig M T , where Σ eig is the eigenvalues of Q ˜ and M is the right unitary matrix satisfying M 1 = M T . We get a diagonal matrix Q ¯ = M T Q ˜ M with N non-zero elements. If N is too large, outsourcing either the QR decomposition [27] or eigen-decomposition [28] is recommended. An adversary cannot recover Q ˜ from a diagonal matrix Q ¯ theoretically.

4.2.2. Privacy-Preserving DCD Solver

Generally, carrying the similarities amongst N data samples needs a N × N matrix. By employing the transformation strategy in Section 4.2.1, we hide the similarities by splitting them into the private key M of the client and the diagonal matrix Q ¯ . In the other perspective, as presented in Equation (16), this strategy is equivalent to protecting the sensitive coefficients by encrypting them with the private key M. If we use the transformed Q ¯ as input and multiply the returned β ¯ by M, Line 5 of Algorithm 1 should be replaced by G ^ = 2 × j = 1 N β j Q ¯ j j , and the other steps will not be changed.

4.2.3. Privacy-Preserving Convex Decomposition

Since SVs are even more sensitive than the other data samples, for Algorithm 2, the client shows the server the encrypted SVs { [ [ x 1 ] ] , [ [ x 2 ] ] , , [ [ x N S ] ] } . Due to the exponential function of K ( · , · ) , however, calculating the convergence direction for each data sample by Equation (12) in ED is beyond the server’s ability. A practical choice is querying the result from the client, while all the SVs should be considered. The client can either store the kernel matrix of SVs for solving Equation (12) by matrix manipulation or calculate 4 q β j K ( x j , x ) on demand in each loop body, on the basis of available storage. Thus, for privacy-preserving convex decomposition, we can reformulate Algorithm 2 by replacing Lines 3 and 4 by Algorithm 5 and querying, respectively. Furthermore, the cosine function on Line 4 can be easily implemented by utilizing SMP and SVDMP.

4.2.4. Privacy-Preserving Connectivity Analysis

In this study, connectivity analysis between two nearest convex hulls indicated by the collected set S Tri has two strategies for choice. In ED, we present this privacy-preserving connectivity analysis with cluster number K and threshold η 2 by Algorithms 6 and 7, respectively.
Algorithm 6 Connectivity analysis in ED with K.
Require: Server: [ [ S CH ] ] , [ [ S Tri ] ] , K pub , K; Client: K pri
Ensure: Server output encrypted array of labels [ [ L SV ] ]
1.
[ [ S DC ] ] = , [ [ S dist ] ] =
2.
for i 1 , N CH do
3.
[ [ S DC i ] ] SMP u = 1 N CH i [ [ x i u ] ] , [ [ 1 / N CH i ] ]
4.
[ [ S DC ] ] [ [ S DC ] ] [ [ S DC i ] ]
5.
end for
6.
for v 1 , N Tri do
7.
 Pick triple ( [ [ x i ] ] , [ [ x j ] ] , [ [ x i ] ] ) from [ [ S Tri [ v ] ] ]
8.
 Pick [ [ S DC k ] ] with [ [ x j ] ] [ [ S CH k ] ]
9.
[ [ S dist ] ] [ [ S dist ] ] ( l , k , [ [ cos ( x i x i S DC k ) ] ] ) using SVDMP
10.
end for
11.
N m N CH
12.
while N m > K do
13.
for v 1 , N Tri do
14.
   [ [ d v ] ] [ [ cos ( x i x i S DC k ) ] ] in [ [ S dist ] ]
15.
   [ [ d max ] ] [ [ max ( d v , d max ) ] ] using SCP
16.
end for
17.
 Extract the index k , l where [ [ d v ] ] [ [ d max ] ]
18.
 Merge the k th , l th convex hulls by [ [ S CH k ] ] [ [ S CH l ] ]
19.
[ [ S dist ] ] [ [ S dist ] ] ( k , l , [ [ cos ( x i x i S DC k ] ] )
20.
N Tri N Tri 1 ; N m N m 1
21.
end while
22.
[ [ L SV ] ] label all the SVs by their convex hull’s index
Algorithm 6 details the privacy-preserving connectivity analysis with K. Lines 1–5 construct the density centroid for each convex hull. For efficiency, we suggest the client supplying [ [ 1 / N CH i ] ] in the framework of SMP to avoid an additional division protocol. Based on SVDMP, Lines 6–10 obtain the included angel cosine for each convex hull and its nearest SV of the nearest neighbor. Then, Lines 11–21 merge the nearest convex hulls pair one by one, until the final cluster number reaches K. The last line labels all the other SVs directly. Similar operations are in Algorithm 7. The major difference is Line 4, which compares [ [ d v ] ] with [ [ η 2 ] ] using SCP to determine the mergence.
Algorithm 7 Connectivity analysis in ED with η 2 .
Require: Server: [ [ S CH ] ] , [ [ S Tri ] ] , K pub , [ [ η 2 ] ] ; Client: K pri
Ensure: Server output encrypted array of labels [ [ L SV ] ]
1.
Do lines 1–10 of Algorithm 6
2.
for v 1 , N Tri do
3.
[ [ d v ] ] [ [ cos ( x i x i S DC k ) ] ] in [ [ S dist ] ]
4.
if [ [ d v η 2 ] ] using SCP then
5.
  Merge the k th , l th convex hulls by [ [ S CH k ] ] [ [ S CH l ] ]
6.
end if
7.
end for
8.
[ [ L SV ] ] label all the SVs by their convex hull’s index

4.2.5. Privacy-Preserving Remaining Samples Labeling

If we want the server to supply a service of labeling the remaining samples or determining the new arrival samples for the others, S1NNQ is recommended. Either the SVs or the density centroids can be used on behalf of the convex hulls to meet the accuracy or efficiency requirements.

4.3. Work Mode of MPPSVC

By introducing secure outsourcing protocols to the core method of RSVC-EO, Figure 3 gives the flow diagram of MPPSVC. Arrows on the client side show communications between the client and the server for successive cluster analysis in ED, while arrows on the server-side illustrate all the accessible information for the server in each step. All the crucial information is encrypted. For instance, [ [ β ¯ ] ] is protected by M and the others in the form of [ [ · ] ] are encrypted by K pub . For clarity, we separate communications based on requirements from each step. To maximize outsourcing capability appropriately (on demand), MPPSVC allows one to outsource any step or steps without losing the control of sensitive data.

4.4. Analysis of MPPSVC

4.4.1. Time Complexity of MPPSVC

We take the classical SVC [1] as the baseline and separately measure the time complexity of each step in RSVC-EO and MPPSVC for local use by the data owner and client–server environment. Let N be the number of d-dimensional data, N SV be the number of SVs, N CH be the obtained convex hulls, and m be the average sample rate. Due to distance measurement in ED, we introduce d for accurate comparisons. Table 1 lists the computational complexities.
(1)
For step 1, “Pre-computed Q” and “Instance K ( · , · ) ” denote doing Algorithm 1 of RSVC-EO with the pre-computed kernel matrix or calculating the row of kernel values on demand, respectively. Although O ( d N 2 ) seems time-consuming, in practice, it is frequently much lower than O ( N 3 ) of the classical SVC, which consumes space O ( N 2 ) . For MPPSVC, the client is recommended to outsource a pre-computation in O ( N 2 ) for the orthogonal matrix M following [27,28] before Step 1. Then, nothing has to be done by the client except decrypting β ¯ by β = M β ¯ in O ( N ) . The major workload of O ( N 2 ) is moved to the server, which may afford the space complexity of O ( N 2 ) .
(2)
Step 2 is not in the classical SVC. The most recent convex decomposition based method [22] takes O ( ζ N SV ) and the SEVs based methods [12,29] require O ( ζ N ) . Here, ζ is the number of iterations. For RSVC-EO, the difference brought by the “Instance K ( · , · ) ” is calculating K ( x j , x ) with d-dimensional input before getting the gradient in Equation (12). For MPPSVC, upon using the kernel matrix of SVs or not, the client’s complexity is O ( N SV 2 ) where = { 1 , d } . Meanwhile, the server is designated by a polling program in O ( d N SV 2 ) . Notice that the essential tasks of Algorithm 2 for the client in ED have been cut down even though they have similar time complexity with that of in PD.
(3)
In Step 3, connectivity analysis of the classical SVC is m times sampling checks for each SVs-pair. However, calculating Equation (3) requires N SV SVs, the whole consumption is up to O ( m N SV 3 ) . On the contrary, RSVC-EO only requires O ( ρ N CH ) due to the direct use of the convergence directions. ρ ( 2 , 3 ] is because Algorithms 6 and 7 are provided for choice. Similarly, the client responses SMP, SVDMP, and SCP in Algorithm 6 or SCP in Algorithm 7 with O ( N CH ) .
(4)
Step 4 is particular for the classical SVC while the prior leaves an adjacency matrix. Its complexity is O ( N SV 2 ) .
(5)
Similar to classification, the traversal of the whole set in Step 5 cannot be avoided for every method. Its complexity is O ( d N N SV ) or O ( d N N CH ) except for the server in MPPSVC. Using S1NNQ, the major operations are carried out with the client under protocols of SVDMP and SCP.

4.4.2. Communication Complexity of MPPSVC

Regarding communication complexity irrespective of the direction, for MPPSVC, transmissions for the encrypted matrix Q ¯ , the coefficient vector β ¯ , the SVs [ [ X SV ] ] , and the encrypted labels [ [ L ALL ] ] are the major reasons for network bandwidth consumption. Firstly, amongst them, either Q ¯ or β ¯ only has N non-zero or valid items for transmission. Generally, the numeric type of float (by C/C++) supplies enough precision for computation, which requires 4 Bytes for each item. Hence, both sending Q ¯ to the server and receiving β ¯ consumes 4 N bytes of bandwidth in the communication channel. Secondly, the cost of transferring [ [ X SV ] ] and [ [ L ALL ] ] depends on the size of Paillier security parameter n; in our implementation, n = 2048 ; hence, the size of an encrypted number is 2048 bits. Sending an encrypted data sample with d features consumes 2048 d bits (i.e., 256 d bytes) of bandwidth. That means sending [ [ X SV ] ] to the server needs 256 d × N SV bytes of bandwidth while receiving [ [ L ALL ] ] consumes 256 N bytes. For the sake of simplicity and clarity, neither additional methods of reducing transmission nor additional cost for data encapsulation is taken into account.

4.4.3. Security of MPPSVC under the Semi-Honest Model

In this section, we consider the execution of MPPSVC under the semi-honest model. Due to the semantic security of the Paillier, messages in ciphertext exchanged in the client–server environment are securely protected. For each step in Figure 3, the analysis is presented as follows:
(1)
According to Section 4.2.1 and Section 4.2.2, the execution image of the server is given by Q ¯ and C. Q ¯ is a diagonal matrix protected by the matrix M secretly held by the client and C is a single-use parameter without strict limitation. Notice that from Q ¯ to the kernel matrix Q ˜ is a one-to-many mapping. No one can recover an N-dimensional vector by only one number. Without any plain item of Q ˜ , the server cannot infer any data sample even though it occasionally gets several data samples. When the server finishes Algorithm 1, the output [ [ β ¯ ] ] is naturally protected by M.
(2)
For the second step, the major works are carried out on the client side as a response to the server. As marked in Figure 3, the accessible sensitive data are [ [ X SV ] ] , [ [ x ] ] , [ [ η 1 ] ] , [ [ S CH ] ] and [ [ S Tri ] ] . However, all of them are encrypted by K pub .
(3)
Step 3 depends on the output of Step 2. It includes SMP, SCP, and SVDMP whose prototypes are proved by [20]. The server can only get the number of clusters, but cannot infer any relationship between data samples in PD. Furthermore, the client can easily hide the real number by tuning the predefined parameters K or η 2 . Then, an uncertain cluster number is meaningless for the server.
(4)
Step 5 employs S1NNQ. The server cannot infer the actual label for a plain data sample, even if it occasionally has several samples.

4.4.4. Security of MPPSVC under the Malicious Model

We extend MPPSVC into a secure protocol under the malicious model where an adversary exists. It may be the server or an eavesdropper. Since the eavesdropper cannot get more information than the server, for simplicity, we only consider the server as an implicit adversary.
For the server, it can arbitrarily deviate from the protocol to gain some advantages (e.g., learning additional information about inputs) over the client. The deviations include, for example, for the server to instantiate MPPSVC with modified queries and to abort the protocol after gaining partial information. Considering SCP, the malicious server might either use a fixed r to obtain the ordering of encrypted numerical values or tamper the two compared numerical values. For the prior, the immediate order for N ciphertext is meaningless for recovering their plaintext, because of Z N Z n 2 and N n 2 in bit. For the latter, without K pri , any modification of ciphertext might cause a significant change of its plaintext. The client can easily discover it. Therefore, all the intermediate results are either random or pseudo-random values. Even though an adversary modifies the intermediate computations, he cannot gain any additional information. The modification may eventually result in the wrong output. Thus, if we ensure all the calculations performed and messages sent by the client are correct, the proposed MPPSVC is secure. It provides the ability to validate the server’s works to the client.

5. Experimental Results

5.1. Experimental Setup

In the premise of security for outsourcing, we demonstrated the performance of RSVC-EO in PD and MPPSVC in ED. RSVC-EO dominated the validity while MPPSVC supplied the secure outsourcing framework. Deservedly, we first evaluated the validity and performance of RSVC-EO, and then the performance of MPPSVC.
In PD, the first experiment was to estimate the sensitivity concerning η 1 on accuracy. Since RSVC-EO is designed for local use, its declared advantage is the flexibility of using the pre-computed kernel matrix or not. The second experiment was to check the performance related to kernel utilization. Then, the third series of the benchmark was to give full comparisons of RSVC-EO and the state-of-the-art methods. Besides, we verified the effectiveness of capturing data distribution by all the compared methods regarding the discovered cluster number. In ED, eventually, our primary focuses were the changes in accuracy and efficiency brought by MPPSVC. In this study, the adjusted rand index (ARI) [4,30] was adopted for accuracy evaluation. It is a widely used similarity measure between two data partitions where both true labels and predicted cluster labels are given. Let N i j be the number of data samples with true label i yet assigned by j. N i · and N · j are, respectively, the number of data samples with label i and j. ARI is formulated by
ARI = i , j N i j 2 i n i · 2 j N · j 2 / N 2 1 2 i N i · 2 + j N · j 2 i N i · 2 j N · j 2 / N 2 .

5.2. Experimental Dataset

Table 2 shows the statistical information of the employed twelve datasets. Here, wisconsin, glass, wine, movement_libras, abalone, and shuttle (training version) were from UCI repository [31]. Four text corpora were employed after a pre-processing, namely DC GLI CCE by Ping et al. [32], i.e., four categories of WebKB [33], full twenty categories of 20Newsgroups [34], top 10 largest categories of Reuters-21578 [35], and Ohsumed with 23 classes [36]. P2P traffic is a collection of 9206 flows’ features that were extracted from traffic supplied by [37] following the method of Peng et al. [38]. Following the work of Guo et al. [39], kddcup99 is a nine-dimensional dataset extracted from KDD Cup 1999 Data [40], which was used to build a network intrusion detector. Due to space limitation, thereafter, we use abbreviations in brackets for dataset with long name.

5.3. Experiments in the Plain-Domain

For local use, RSVC-EO in PD means the data owner frequently cannot have sufficient memory for the required kernel matrix. Consequently, the testbed was a laptop with Intel Dual Core 2.66 GHz and 3 GB available RAM, which calculates kernel function on demand. RSVC-EO and all the other compared methods were implemented and fairly evaluated by MATLAB 2016a on Windows 7-X64.

5.3.1. Analysis of Parameter Sensitivity

Algorithm 2 introduces η 1 to indicate the convex decomposition which might be directly related to RSVC-EO’s performance. Figure 4 depicts the ARI variations achieved by RSVC-EO with respect to η 1 [ 0.8 , 0.9 ] with step 0.01 . Here, variation for each dataset is represented by the rectangle in which the square block is the mean value. Apparently, the variations is very small for 9 out of 12 datasets, i.e., wisconsin, glass, mLibras, P2P-T, WebKB, abalone, Reuters, Oh, and kddcup99. In fact, these variations are lower than 3.21 × 10 4 . For the other three datasets, i.e., wine, 20NG, and sh, the variations are lower than 5.5 × 10 3 , and the rectangles’ locations show that most of them are close to the peak value. Therefore, parameter selection of η 1 is frequently an easy work for the proposed RSVC-EO to achieve a relatively optimal clustering result. Notice that a preset η 1 is not required for those cases with prior knowledge of the cluster number K, because RSVC-EO can merge convex hulls until the expected K is obtained.

5.3.2. Analysis of Iteration Sensitivity

The iterative analysis is essential for both RSVC-EO and MPPSVC. Although the server conducts the solver independently to get the encrypted coefficient vector M β ¯ , runtime might be pricey if we have to choose the strategy of immediately calculating kernel function for large-scale data. Therefore, we concern whether massive iterations are unavoidable. Noticeably, Hsieh et al. [41] proved that the general DCD solver reaches an ϵ -accurate solution in O ( log ( 1 / ϵ ) ) iterations. For the sake of simplicity, we checked the relationship between the achieved ARI and iteration number in PD. Figure 5 depicts the results where kddcup99 is omitted due to pricey runtime with great iteration number.
For most of the cases, the achieved ARIs are relatively stable, along with the iteration number increases. By employing the proposed solver, a useful phenomenon is that, for each case, the best ARI is usually reached with a small iteration number (≤4). It means that a more precise objective function value of the problem in Equation (2) is not always required in practice. Therefore, a small iteration number meets the requirements from both RSVC-EO and MPPSVC for expected results. This will not bring noticeable computation load to either side.

5.3.3. Performance Related to Kernel Utilization

In this section, we check whether RSVC-EO is flexible to balance the efficiency and usability with limited memory. For efficiency, the runtime of completing Algorithm 1 with the pre-computed kernel matrix and with immediate calculating kernel function on Line 5 were separately evaluated. The former is denoted by “Runtime (Store Kernel)” while the latter is “Runtime (Cal. Kernel)”. Meanwhile, their memory consumptions are also evaluated, respectively, as “Storage (Store Kernel)” and “Storage (Cal. Kernel)” (Due to the limited memory (3GB); in fact, the client cannot afford all the experimental datasets’ requirements. Thus, “Runtime (Store Kernel)” for storage exceeding the supplement was estimated by “Runtime (Cal. Kernel)” minus the runtime of calculating K ( x j , x i ) .). Figure 6 shows the results. Apparently, for small datasets with N < 1000 such as wisconsin, glass, and wine, there are no obvious differences between “Runtime (Store Kernel)” and “Runtime (Cal. Kernel)”. Although the client affords large dataset analysis well, its “Runtime (Cal. Kernel)” quickly raises as N increases. On the contrary, based on the immediate calculating kernel function, “Runtime (Cal. Kernel)” is linear to the data size, whereas “Storage (Store Kernel)” easily exceeds the client’s capability. Gaps become significantly for those datasets with N d . For instance, to deal with sh, the requirement is 7.05 GB for “Storage (Store Kernel)” while “Runtime (Cal. Kernel)’ only needs 1.49 MB. Therefore, efficiency and storage consumptions are strongly related to the kernel utilization. It is critical to outsource the cluster analysis for those resource-limited clients.

5.3.4. Benchmark Results for Accuracy Comparison

To check RSVC-EO’s performance, we compared it with the state-of-the-art methods: the complete graph (CG) [1], the reduced complete graph (RCG) [24], the equilibrium based SVC (E-SVC) [29,42], the cone cluster labeling (CCL) [43], the fast SVC (FSVC) [12], the position regularized SVC (PSVC) [44], the convex decomposition cluster labeling (CDCL) [23], the voronoi cell-based clustering (VCC) [8], the fast and scalable SVC (FSSVC) [26], and the faster and reformulated SVC (FRSVC) [22]. Table 3 gives the achieved accuracies regarding ARI, and the corresponding runtime of the training phase and the labeling phase for each dataset is illustrated in Figure 7. Three points are important to be noted. First, due to the sampling strategy, VCC cannot achieve a fixed accuracy even though its parameters are fixed. We use its mean and mean-square deviation of the top ten ARIs. Secondly, runtime for each dataset is the average time of ten executions. Thirdly, not all methods can finish analysis on the client while the kernel matrix is too large or too much time (≥4 h. in this study) required by any phase. For these cases, we use “—” to denote an unavailable ARI and mark the runtime with 0. Meanwhile, we collect runtime of FRSVC and RSVC-EO by adopting the immediate calculation of kernel function.
In Table 3, the first rank is highlighted by boldface. Apparently, RSVC-EO reaches the best performance on 7 out of 12 datasets, especially for the large ones such as Reuters, Oh, 20NG, and kddcup99, whereas FRSVC performs better on sh and WebKB, FSSVC outperforms the others on wine and wisconsin, and CDCL gets better results on P2P-T. As data size increases, many traditional methods cannot run well on our client, e.g., CG, E-SVC, CCL, and PSVC. We directly quoted the results of E-SVC and FSVC on sh in [12]. Additionally, RSVC-EO frequently gets into the first three ranks in the other cases, e.g. sh, WebKB, and P2P-T. Thus, regarding the accuracy, we guess that RSVC-EO is suitable for relatively large datasets. To verify it, we also give the results of pair comparison in Table 4 following the work of Garcia and Herrera [45]. Here, RSVC-EO is the control method. A nonparametric statistical test, namely Friedman test, was employed to get the average ranks and unadjusted p values. By introducing an adjustment method, Bergmann–Hommel procedure, the adjusted p-value denoted by p Homm corresponding to each comparison was obtained. RSVC-EO reaches the best performance in the view of average rank. Since the Bergmann–Hommel procedure rejects those hypotheses with p-values 0.016 , together with the values of p Homm , we further confirm RSVC-EO’s better performance.
In Figure 7, some obvious observations for the training phase can be found. (1) For the cases smaller than WebKB, most of the methods perform similarly, including FRSVC and RSVC-EO. Although they have to calculate the kernel function, this might be balanced due to more iterations for the others. However, CCL and PSVC still consume a lot for strict restrictions. CCL requires more iterative analysis to guarantee R < 1 , while PSVC needs a pre-analysis to determine the weight for each data sample and imposes these weights as additional constraints. (2) Along with data size increases, e.g., from WebKB to Reuters, runtime for most of the methods raises dramatically. Particularly, CG, E-SVC, CCL, and PSVC want memory greater than the predefined upper bound. When the kernel matrix requires memory getting close to or greater than the client’s supplement, such as Oh, 20NG, and sh, only two groups of methods are valid. The first group includes VCC (sample rate θ [ 0.001 , 0.5 ) ) and FSSVC, which adopts sampling strategy. Together with accuracies in Table 3, FSSVC obtains better accuracy and consumes much more memory because of steadily choosing boundaries, whereas VCC prefers a random strategy. The second group consists of FRSVC and RSVC-EO, which calculate the kernel function on demand to avoid huge memory consumption. Despite having greater runtime than VCC, they are rewarded with better accuracies. A remarkable finding is that benefited by better performance of the parameter insensitivity; fewer iterations required by RSVC-EO’s learning lead to less runtime requirement. (3) For kddcup99, the full kernel matrix with approximate size 2.44 × 10 11 wants 909.18 GB, which is far greater than what the client can afford. VCC and FSSVC fail because they can hardly select appropriate data samples to describe the pattern, while FRSVC fails for pricey labeling strategy. Only RSVC-EO finishes the analysis with a suboptimal result because we only use one iteration in Algorithm 1. Therefore, there still will be a big challenge for RSVC-EO to obtain the optimal result when the data size continues to increase.
For the labeling phase, RSVC-EO outperforms the others significantly that confirms the core ideas of FCDCL. Firstly, FCDCL does not use iterative analysis. Thus, it performs well on high-dimensional data, such as mLibras and Oh, whereas ESVC fails and the others consume much more. Secondly, connectivity analysis of FCDCL avoids the traditional sampling checks in feature space. It reduces the impact of candidate sample pairs. Although the others try to reduce the number of sample pairs and the sample rate, runtime for the essential sample analysis is longer than the proportional time to the size of the dataset or the candidate subset.

5.3.5. Effectiveness of Capturing Data Distribution

Following Xu and Wunsch [30], we find that increasing the cluster number sometimes has a positive impact on the accuracy measures. However, we should try to avoid splitting data samples of a group into multiple clusters. We intuitively expect an effective method which can accurately capture data distribution. Therefore, we use the difference between the captured cluster number N C and the real number of classes N R summarized in Table 2, in terms of percentage ( N C N R 1 ) × 100 % . Comparisons amongst the eleven methods are illustrated by Figure 8. If a method is invalid on a dataset, the corresponding percentage is assigned by the greatest value amongst the other finished ones, and its column is gray with slashes. Certainly, the shorter the column for a dataset is, the better effect the corresponding method performs. As shown in Figure 8, RSVC-EO outperforms the other ten methods significantly.

5.4. Experiments in the Encrypted-Domain

To integrate the Paillier, MPPSVC was implemented by C++ using GNU GMP library version 6.0.0a (https://ftp.gnu.org/gnu/gmp/).Both the server and the client were modeled as different threads of a single program, which passes data or parameters to each other following the rules shown in Figure 3. We conducted experiments on a server with Quad-Core 2.29 GHz CPUs and 64 GB main memory running on Windows 7-X64.

5.4.1. Performance in the Encrypted-Domain

In ED, the Paillier only allows integers. We introduce a scaling factor γ on the input data samples and the exchanged data to take integers downwardly before encryption. Table 5 shows the accuracies for various γ in ED. Although better accuracies are not always obtained with greater γ , the clustering results (marked by underlines) become steadily when γ is above 10 4 . For these cases, the accuracies are either the best ones (highlighted by boldface) or very close to the best. Hence, four decimal points are sufficient for the 12 datasets. Compared with Table 3, we find that γ has little influence on accuracy in ED because both the input data samples and the intermediate results transmitted from the client to the server might lose a certain degree of precision. Since the accuracies achieved by MPPSVC are very close to those obtained by RSVC-EO in PD, we confirm the guaranteed privacy of the input data and clustering procedure.

5.4.2. Computational Complexity

Table 6 presents the runtime of each step (following Table 1) in ED consumed, respectively, by the client and the server. “#SVs” and “#Convex Hulls”, respectively, denote the number of SVs and convex hulls. We omit the consumption of outsourcing the pre-computation for M following Luo et al. [27] because it is not in the proposed framework.
Together with Figure 7, we can make several observations. (1) Step 1 is no longer the first barrier for the client even though the solver requires a large number of iterations. For example, in Figure 7l, Step 1 for kddcup99 still consumes 10,846.2134 s although we cut off the iterations to get the suboptimal result for efficiency. Now, it only requires 0.4751 s by the client without considering the iterations on the server. (2) Step 2 is the most time-consumption task for both sides, and Step 3 takes the second place. Comparing the first two time-consuming datasets P2P-T and mLibras with the others, efficiency is closely related to “#SVs” and “#Convex Hulls”. From Definition 2, “#SVs” also affects “#Convex Hulls”. Since S1NNQ compares each unlabeled data sample with both SVs and density centroids, “#SVs” and “#Convex Hulls” directly contribute the computational time of Step 5. (3) Dimensionality is another critical factor. The evidence can be found on the small data mLibras with 90 dimensions. Since each dimension should be encrypted/decrypted separately, high dimensionality increases the computational time in ED and results in more SVs. (4) Similar to the method in [17], Step 5 is a classification work in ED, which can be conducted one by one or in parallel. It is a light workload to outsource the encrypted sample for its label. (5) The total time cost is proportional to data size yet in an acceptable range for the client. However, we do not suggest outsourcing Steps 1–3 for small data analysis due to the increased costs. Based on a learned model, Step 5 is also suitable for being outsourced as a service.

6. Related Work

Despite handling arbitrary cluster shapes well, SVC suffers from pricey computation and memory as data size increases. Generally, exploring ways of optimizing the critical operations and asking for the Cloud’s help are potential countermeasures. For the former way, the training and labeling phases are considered, respectively.
(1)
For model training, the core is solving the dual problem in Equation (2). Major studies prefer generic optimization algorithms, e.g., gradient descent, sequential minimal optimization [4], and evolutionary strategies [46]. Later, studies rewrote the dual problem by introducing the Jaynes maximum entropy [47], the position-based weight [44], and the relationship amongst SVs [26]. However, conducting a solver with the full dataset suffers from huge consumption of kernel matrix. Thus, FSSVC [26] steadily selects the boundaries while VCC [8] samples a predefined ratio θ ( 0 , 1 ] of data. Other methods related to reducing the working set and divide-and-conquer strategy were surveyed in [4]. However, bottlenecks still easily appear due to the nonlinear strategy and the pre-computed kernel matrix. Thus, FRSVC [22] employs a linear method to seek a balance between efficiency and memory cost.
(2)
For the labeling phase, connectivity analysis adopts a sampling check strategy for a long time. Reducing the number of sample pairs thus becomes the first consideration, for instance, using the full dataset by CG [1], the SVs by PSVC [44], the SEVs by RCG [24], and the transition points by E-SVC [29,42]. Although the number of sample pairs is gradually reduced, they have a side-effect of additional iterations in seeking SEVs or TS. Thus, CDCL [23] suggests a compromise way of using SVs to construct convex hulls, which are employed as substitutes of SEVs. For efficiency, CDCL reduces the average sample rate by a nonlinear sampling strategy. Later, FSSVC [26] and FRSVC [22] made further improvements by reducing the average sample m close to 1. Besides, Chiang and Hao [48] introduced a cell growth strategy, which starts at any data sphere, expands by absorbing fresh neighboring spheres, and splits if its density is reduced to a certain degree. Later, CCL [43] created a new way by checking the connectivity of two SVs according to a single distance calculation. However, too strict constraints emphasized on the solver degrade its performance. In fact, for these methods, the other pricey consumption is the adjacent matrix, which usually ranges from O ( N SV 2 ) to O ( N 2 ) .
As data size increases, pricey computation and huge space needed by the above solutions raise the requirement of outsourcing. However, to ask for the Cloud’s help, the risk of data leakage raises concerns, since both the input data and the learned model memorize information [10]. The major studies focus on securely outsourcing known SVM classifiers. Generally, the secure outsourcing protocols prefer introducing homomorphic encryption [15,16,17,49], reformulating the classifier [50], randomizing the classifier [51], or finding an approximated classifier [21]. Furthermore, based on the homomorphic encryption, the existing secure outsourcing methods support the calculation of rational numbers [52], matrix computations [14,27,28], mathematical optimization [14], and k nearest neighboring query [20]. However, very few works are related to privacy-preserving model training. An early work was published by Lin et al. [19], who suggested a random linear transformation for data’s subset before outsourcing. Later, Salinas et al. [53] presented a transformed quadratic program and its solver, namely Gauss–Seidel algorithm, for securely outsourcing SVM training while reducing the client’s computation. According to a primal estimated sub-gradient solver and replacing the SVs with data prototypes, the most recent work [54] gives a solution of training SVM model with data encrypted by homomorphic encryption.
In a sense, training an SVM model and making a decision for a data sample is similar to the training phase and the last labeling step, respectively. However, the known solutions are not suitable for SVC due to the distinguished iteration/analysis strategy and operations in feature space. To the best of our knowledge, no practical solution is presented for secure outsourcing SVC despite strong demand.

7. Conclusions and Future Work

Towards easing the client’s workload, we propose MPPSVC to make all the phases of SVC outsourceable without worrying about privacy issues. For simplicity and generality, we suggest using additively homomorphic encryption to protect data privacy. The limited operations motivate us giving a new design of RSVC-EO based on elementary operations. However, inevitable iterations for Equation (2) and complex computations in all phases of SVC may cause massive interactions. For efficiency, we consequently protect the kernel by a matrix transformation, which not only reduces data transmission but also makes the outsourced solver iterate well. Besides, for the labeling phase of RSVC-EO, FCDCL is developed without iterative analysis. Taking RSVC-EO as the core, MPPSVC consists of several customized, lightweight, and secure protocols. Theoretical analysis and experimental results on twelve datasets prove the reliability of the proposed methods, i.e., RSVC-EO for local use and MPPSVC for outsourcing.
Although MPPSVC provides customizable phase outsourcing, security and efficiency are ever-lasting issues, which should be balanced as data size increases. How to securely control the iterations while reducing SVs, finding substitutes for the complex operations, avoiding unnecessary kernel matrix consumption, and making full use of distributed computing are worthy of investigation.

Author Contributions

Conceptualization, Y.P. and B.H.; methodology, Y.P. and B.H.; validation, Y.P., B.H., X.H., and J.W.; formal analysis, Y.P., X.H. and B.W.; investigation, Y.P. and B.H.; resources, X.H., J.W. and B.W.; writing—original draft preparation, Y.P.; writing—review and editing, X.H., J.W. and B.W.; supervision, J.W. and B.W.; project administration, Y.P. and B.W.; and funding acquisition, Y.P. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China under Grant No. 2017YFB0802000, the National Natural Science Foundation of China under Grant No. U1736111, the Plan For Scientific Innovation Talent of Hen’an Province under Grant No. 184100510012, the Program for Science & Technology Innovation Talents in Universities of He’nan Province under Grant No. 18HASTIT022, the Key Technologies R&D Program of He’nan Province under Grant Nos. 182102210123 and 192102210295, the Foundation of He’nan Educational Committee under Grant No. 18A520047, the Foundation for University Key Teacher of He’nan Province under Grant No. 2016GGJS-141, the Open Project Foundation of Information Technology Research Base of Civil Aviation Administration of China under Grant No. CAAC-ITRB-201702, and Innovation Scientists and Technicians Troop Construction Projects of He’nan Province.

Acknowledgments

The authors would like to thank Dale Schuurmans (University of Alberta) for suggestion on the DCD Solver, and the Associate Editor and the anonymous reviewers for their constructive comments that greatly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript or in the decision to publish the results.

References

  1. Ben-Hur, A.; Horn, D.; Siegelmann, H.T.; Vapnik, V.N. Support Vector Clustering. J. Mach. Learn. Res. 2001, 2, 125–137. [Google Scholar] [CrossRef]
  2. Saltos, R.; Weber, R.; Maldonado, S. Dynamic Rough-Fuzzy Support Vector Clustering. IEEE Trans. Fuzzy Syst. 2017, 25, 1508–1521. [Google Scholar] [CrossRef]
  3. Ye, Q.; Zhao, H.; Li, Z.; Yang, X.; Gao, S.; Yin, T.; Ye, N. L1-Norm Distance Minimization-Based Fast Robust Twin Support Vector k-Plane Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4494–4503. [Google Scholar] [CrossRef] [PubMed]
  4. Li, H.; Ping, Y. Recent Advances in Support Vector Clustering: Theory and Applications. Int. J. Pattern Recogn. Artif. Intell. 2015, 29, 1550002. [Google Scholar] [CrossRef]
  5. Sheng, Y.; Hou, C.; Si, W. Extract Pulse Clustering in Radar Signal Sorting. In Proceedings of the 2017 International Applied Computational Electromagnetics Society Symposium–Italy (ACES), Florence, Italy, 26–30 March 2017; pp. 1–2. [Google Scholar]
  6. Lawal, I.A.; Poiesi, F.; Anguita, D.; Cavallaro, A. Support Vector Motion Clustering. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 2395–2408. [Google Scholar] [CrossRef]
  7. Pham, T.; Le, T.; Dang, H. Scalable Support Vector Clustering Using Budget. arXiv 2017, arXiv:1709.06444v1. [Google Scholar]
  8. Kim, K.; Son, Y.; Lee, J. Voronoi Cell-Based Clustering Using a Kernel Support. IEEE Trans. Knowl. Data Eng. 2015, 27, 1146–1156. [Google Scholar] [CrossRef]
  9. Yu, Y.; Li, H.; Chen, R.; Zhao, Y.; Yang, H.; Du, X. Enabling Secure Intelligent Network with Cloud-Assisted Privacy-Preserving Machine Learning. IEEE Netw. 2019, 33, 82–87. [Google Scholar] [CrossRef]
  10. Song, C.; Ristenpart, T.; Shmatikov, V. Machine Learning Models that Remember Too Much. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS’ 2017, Dallas, TX, USA, 30 October–3 November 2017; ACM: New York, NY, USA, 2017; pp. 587–601. [Google Scholar]
  11. Dritsas, E.; Kanavos, A.; Trigka, M.; Sioutas, S.; Tsakalidis, A. Storage Efficient Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio-Temporal Databases. Algorithms 2019, 12, 266. [Google Scholar] [CrossRef] [Green Version]
  12. Jung, K.H.; Lee, D.; Lee, J. Fast support-based clustering method for large-scale problems. Pattern Recogn. 2010, 43, 1975–1983. [Google Scholar] [CrossRef]
  13. Paillier, P. Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In Proceedings of the 17th International Conference on Theory and Application of Cryptographic Techniques (EUROCRYPT’ 99), Prague, Czech Republic, 2–6 May 1999; pp. 223–238. [Google Scholar]
  14. Shan, Z.; Ren, K.; Blanton, M.; Wang, C. Practical Secure Computation Outsourcing: A Survey. ACM Comput. Surv. 2018, 51, 31:1–31:40. [Google Scholar] [CrossRef]
  15. Liu, X.; Deng, R.; Choo, K.R.; Yang, Y. Privacy-Preserving Outsourced Support Vector Machine Design for Secure Drug Discovery. IEEE Trans. Cloud Comput. 2019, 1–14. [Google Scholar] [CrossRef]
  16. Rahulamathavan, Y.; Veluru, S.; Phan, R.C.W.; Chambers, J.A.; Rajarajan, M. Privacy-Preserving Clinical Decision Support System using Gaussian Kernel based Classification. IEEE J. Biomed. Health Inform. 2014, 18, 56–66. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Rahulamathavan, Y.; Phan, R.C.W.; Veluru, S.; Cumanan, K.; Rajarajan, M. Privacy-Preserving Multi-Class Support Vector Machine for Outsourcing the Data Classification in Cloud. IEEE Trans. Dependable Secure Comput. 2014, 11, 467–479. [Google Scholar] [CrossRef] [Green Version]
  18. Karapiperis, D.; Verykios, V.S. An LSH-based Blocking Approach with A Homomorphic Matching Technique for Privacy-preserving Record Linkage. IEEE Trans. Knowl. Data Eng. 2015, 27, 909–921. [Google Scholar] [CrossRef]
  19. Lin, K.P.; Chang, Y.W.; Chen, M.S. Secure Support Vector Machines Outsourcing with Random Linear Transformation. Knowl. Inf. Syst. 2015, 44, 147–176. [Google Scholar] [CrossRef]
  20. Samanthula, B.; Elmehdwi, Y.; Jiang, W. k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data. IEEE Trans. Knowl. Data Eng. 2015, 27, 1261–1273. [Google Scholar] [CrossRef] [Green Version]
  21. Lin, K.P.; Chen, M.S. On the Design and Analysis of the Privacy-Preserving SVM Classifier. IEEE Trans. Knowl. Data Eng. 2011, 23, 1704–1717. [Google Scholar] [CrossRef]
  22. Ping, Y.; Tian, Y.; Guo, C.; Wang, B.; Yang, Y. FRSVC: Towards making support vector clustering consume less. Pattern Recogn. 2017, 69, 286–298. [Google Scholar] [CrossRef]
  23. Ping, Y.; Tian, Y.; Zhou, Y.; Yang, Y. Convex Decomposition Based Cluster Labeling Method for Support Vector Clustering. J. Comput. Sci. Technol. 2012, 27, 428–442. [Google Scholar] [CrossRef]
  24. Lee, J.; Lee, D. An Improved Cluster Labeling Method for Support Vector Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 461–464. [Google Scholar] [PubMed]
  25. Ping, Y.; Zhou, Y.; Yang, Y. A Novel Scheme for Accelerating Support Vector Clustering. Comput. Inform. 2011, 31, 1001–1026. [Google Scholar]
  26. Ping, Y.; Chang, Y.; Zhou, Y.; Tian, Y.; Yang, Y.; Zhang, Z. Fast and Scalable Support Vector Clustering for Large-scale Data Analysis. Knowl. Inf. Syst. 2015, 43, 281–310. [Google Scholar] [CrossRef]
  27. Luo, C.; Zhang, K.; Salinas, S.; Li, P. SecFact: Secure Large-scale QR and LU Factorizations. IEEE Trans. Big Data 2019, 1–13. [Google Scholar] [CrossRef]
  28. Zhou, L.; Li, C. Outsourcing Eigen-Decomposition and Singular Value Decomposition of Large Matrix to a Public Cloud. IEEE Access 2016, 4, 869–879. [Google Scholar] [CrossRef]
  29. Lee, J.; Lee, D. Dynamic Characterization of Cluster Structures for Robust and Inductive Support Vector Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1869–1874. [Google Scholar]
  30. Xu, R.; Wunsch, D.C. Clustering; A John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  31. Frank, A.; Asuncion, A. UCI Machine Learning Repository. 2010. Available online: http://archive.ics.uci.edu/ml (accessed on 23 July 2018).
  32. Ping, Y.; Zhou, Y.; Xue, C.; Yang, Y. Efficient representation of text with multiple perspectives. J. China Univ. Posts Telecommun. 2012, 19, 101–111. [Google Scholar] [CrossRef]
  33. Graven, M.; DiPasquo, D.; Freitag, D.; McCallum, A.; Mitchell, T.; Nigam, K.; Slattery, S. Learning to Extract Symbolic Knowledge form The World Wide Web. In Proceedings of the 15th National Conference for Artificial Intelligence (AAAI’98), Madison, WI, USA, 26–30 July 1998; pp. 509–516. [Google Scholar]
  34. Lang, K. NewsWeeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 331–339. [Google Scholar]
  35. Lewis, D.D. Reuters-21578 Text Categorization Collection. 1997. Available online: http://kdd.ics.uci.edu/databases/reuters21578/ (accessed on 19 March 2012).
  36. Hersh, W.R.; Buckley, C.; Leone, T.J.; Hickam, D.H. Ohsumed: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In Proceedings of the 17th Annual ACM SIGIR Conference, Dublin, Ireland, 3–6 July 1994; pp. 192–201. [Google Scholar]
  37. UNIBS. The UNIBS Anonymized 2009 Internet Traces. 18 March 2010. Available online: http://www.ing.unibs.it/ntw/tools/traces (accessed on 12 May 2011).
  38. Peng, J.; Zhou, Y.; Wang, C.; Yang, Y.; Ping, Y. Early TCP Traffic Classification. J. Appl. Sci.-Electron. Inf. Eng. 2011, 29, 73–77. [Google Scholar]
  39. Guo, C.; Zhou, Y.; Ping, Y.; Zhang, Z.; Liu, G.; Yang, Y. A Distance Sum-based Hybrid Method for Intrusion Detection. Appl. Intell. 2014, 40, 178–188. [Google Scholar] [CrossRef]
  40. UCI. Kdd Cup 99 Intrusion Detection Dataset. 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 10 February 2016).
  41. Hsieh, C.J.; Chang, K.W.; Lin, C.J.; Keerthi, S.S.; Sundararajan, S. A Dual Coordinate Descent Method for Large-scale Linear SVM. In Proceedings of the 25th International Conference on Machine Learning (ICML ’08), Helsinki, Finland, 5–9 July 2008; ACM: New York, NY, USA, 2008; pp. 408–415. [Google Scholar]
  42. Lee, D.; Jung, K.H.; Lee, J. Constructing Sparse Kernel Machines Using Attractors. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 20, 721–729. [Google Scholar]
  43. Lee, S.H.; Daniels, K.M. Cone Cluster Labeling for Support Vector Clustering. In Proceedings of the 6th SIAM Conference on Data Mining, Bethesda, MD, USA, 20–22 April 2006; pp. 484–488. [Google Scholar]
  44. Wang, C.D.; Lai, J.H. Position Regularized Support Vector Domain Description. Pattern Recogn. 2013, 46, 875–884. [Google Scholar] [CrossRef]
  45. Garcia, S.; Herrera, F. An Extension on “Statistical Comparisons of Classifiers over Multiple Datasets” for all Pairwise Comparisons. J. Mach. Learn. Res. 2008, 9, 2677–2694. [Google Scholar]
  46. Jun, S.H. Improvement of Support Vector Clustering using Evolutionary Programming and Bootstrap. Int. J. Fuzzy Logic Intell. Syst. 2008, 8, 196–201. [Google Scholar] [CrossRef] [Green Version]
  47. Guo, C.; Li, F. An Improved Algorithm for Support Vector Clustering based on Maximum Entropy Principle and Kernel Matrix. Expert Syst. Appl. 2011, 38, 8138–8143. [Google Scholar] [CrossRef]
  48. Chiang, J.H.; Hao, P.Y. A New Kernel-based Fuzzy Clustering Approach: Support Vector Clustering with Cell Growing. IEEE Trans. Fuzzy Syst. 2003, 11, 518–527. [Google Scholar] [CrossRef]
  49. Hua, J.; Shi, G.; Zhu, H.; Wang, F.; Liu, X.; Li, H. CAMPS: Efficient and Privacy-Preserving Medical Primary Diagnosis over Outsourced Cloud. Inf. Sci. 2019. [Google Scholar] [CrossRef]
  50. Sumana, M.; Hareesha, K. Modelling A Secure Support Vector Machine Classifier for Private Data. Int. J. Inf. Comput. Secur. 2018, 10, 25–41. [Google Scholar] [CrossRef]
  51. Jia, Q.; Guo, L.; Jin, Z.; Fang, Y. Preserving Model Privacy for Machine Learning in Distributed Systems. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 1808–1822. [Google Scholar] [CrossRef]
  52. Liu, X.; Choo, K.R.; Deng, R.H.; Lu, R.; Weng, J. Efficient and Privacy-Preserving Outsourced Calculation of Rational Numbers. IEEE Trans. Dependable Secure Comput. 2018, 15, 27–39. [Google Scholar] [CrossRef]
  53. Salinas, S.; Luo, C.; Liao, W.; Li, P. Efficient Secure Outsourcing of Large-scale Quadratic Programs. In Proceedings of the ASIA CCS ’16: 11th ACM on Asia Conference on Computer and Communications Security, Xi’an, China, 31 May–3 June 2016; ACM: New York, NY, USA, 2016; pp. 281–292. [Google Scholar]
  54. Gonzalez-Serrano, F.J.; Navia-Vazquez, A.; Amor-Martin, A. Training Support Vector Machines with privacy-protected data. Pattern Recogn. 2017, 72, 93–107. [Google Scholar] [CrossRef]
Figure 1. Architecture of secure outsourcing each phase of SVC.
Figure 1. Architecture of secure outsourcing each phase of SVC.
Electronics 09 00178 g001
Figure 2. Diagram for the principles of convex decomposition and connectivity analysis: (a) convex decomposition; and (b) connectivity analysis.
Figure 2. Diagram for the principles of convex decomposition and connectivity analysis: (a) convex decomposition; and (b) connectivity analysis.
Electronics 09 00178 g002
Figure 3. The flow diagram for approaches of MPPSVC. To keep consistent with Table 1, we use Steps 1, 2, 3, and 5 to denote the four steps of MPPSVC.
Figure 3. The flow diagram for approaches of MPPSVC. To keep consistent with Table 1, we use Steps 1, 2, 3, and 5 to denote the four steps of MPPSVC.
Electronics 09 00178 g003
Figure 4. Sensitive analysis by variations of ARI with respect to η 1 [ 0.8 , 0.9 ] with step 0.01 .
Figure 4. Sensitive analysis by variations of ARI with respect to η 1 [ 0.8 , 0.9 ] with step 0.01 .
Electronics 09 00178 g004
Figure 5. ARIs correspond to different iteration numbers.
Figure 5. ARIs correspond to different iteration numbers.
Electronics 09 00178 g005
Figure 6. Runtime and storage performance with respect to using kernel utilization or not.
Figure 6. Runtime and storage performance with respect to using kernel utilization or not.
Electronics 09 00178 g006
Figure 7. Runtime comparisons of the training phase and labeling phase for the state-of-the-art methods on 12 datasets: (a) wisconsin; (b) glass; (c) wine; (d) mLibras; (e) P2P-T; (f) WebKB; (g) abalone; (h) Reuters; (i) Oh; (j) 20NG; (k) sh; and (l) kddcup99.
Figure 7. Runtime comparisons of the training phase and labeling phase for the state-of-the-art methods on 12 datasets: (a) wisconsin; (b) glass; (c) wine; (d) mLibras; (e) P2P-T; (f) WebKB; (g) abalone; (h) Reuters; (i) Oh; (j) 20NG; (k) sh; and (l) kddcup99.
Electronics 09 00178 g007
Figure 8. Difference between the captured cluster number (while each method achieve its best accuracy) and the real number of classes in terms of percentage (+/−%).
Figure 8. Difference between the captured cluster number (while each method achieve its best accuracy) and the real number of classes in terms of percentage (+/−%).
Electronics 09 00178 g008
Table 1. Time complexities of the proposed methods.
Table 1. Time complexities of the proposed methods.
Classical SVC [1]RSVC-EOMPPSVC
Pre-Computed Q ˜ Instance K ( · , · ) ClientServer
Step 1.DCD Solver for coefficient indicating SVs* O ( N 3 ) O ( N 2 ) O ( d N 2 ) O ( N ) O ( N 2 )
Step 2.Convex decomposition in one iter. using SVs O ( N SV 2 ) O ( d N SV 2 ) O ( N SV 2 ) O ( d N SV 2 )
Step 3.Connectivity analysis withprior knowledge O ( m N SV 2 · N SV ) O ( ρ N CH ) O ( N CH ) O ( ρ N CH )
Step 4.Find the connected comp-onents by DFS O ( N SV 2 )
Step 5.Labeling the remaining data samples O ( d N N SV ) O ( d N N SV ) or O ( d N N CH ) O ( N N SV ) or O ( N N CH ) O ( d N N SV ) or O ( d N N CH )
Notice: 1. * For classical SVC, it solves the dual problem in traditional way; 2. ρ ( 2 , 3 ] , = { 1 , d } , “—” means a non-existent step.
Table 2. Description of the benchmark datasets.
Table 2. Description of the benchmark datasets.
DatasetsDataset Description
DimsSize# of Classes
wisconsin96832
glass92147
wine131783
movement_libras (mLibras)9036015
P2P traffic (P2P-T)492064
WebKB441994
abalone7417729
Reuters-21578(Reuters)10999010
Ohsumed (Oh)2313,92923
20Newsgroups (20NG)2013,99820
shuttle (sh)943,5007
kddcup999494,0215
Table 3. Accuracies (ARI) achieved by the state-of-the-art methods on 12 datasets.
Table 3. Accuracies (ARI) achieved by the state-of-the-art methods on 12 datasets.
DatasetCGRCGE-SVCCCLFSVCPSVCCDCLVCCFSSVCFRSVCRSVC-EO
wisconsin0.77930.80350.13440.90760.66870.25740.86850.8029 ± 0.05140.92480.87980.8632
glass0.27710.28230.27430.22010.24580.28010.29110.2771 ± 0.00310.29980.25820.3540
wine0.59120.79280.41590.81900.80420.38090.89610.8088 ± 0.05630.89920.84830.8185
mLibras0.24220.23560.08990.14210.25410.33200.3065 ± 0.01810.37030.33660.3732
P2P-T0.88150.83670.89170.7389 ± 0.00660.88150.86780.8807
WebKB0.30720.51440.46450.4434 ± 0.01560.56700.63950.5381
abalone0.03320.05150.06030.0710 ± 0.00110.05870.06570.0996
Reuters0.01480.47750.80640.4908 ± 0.06160.58310.72950.8338
Oh0.4280 ± 0.02920.45140.48400.5019
20NG0.4397 ± 0.04610.36280.49270.6084
sh0.59 [12]0.58 [12]0.5898 ± 0.01980.68570.80500.7337
kddcup990.7621
Note: “—” means not available due to insufficient memory or too much time consumption.
Table 4. Comparison under non-parametric statistical test.
Table 4. Comparison under non-parametric statistical test.
MethodsAverage RanksUnadjusted p p Homm
Control Method: RSVC-EO, Average Rank = 2.1667
CG8.33335.2539 × 10 6 4.2031 × 10 5
RCG6.75007.1174 × 10 4 0.0036
E-SVC8.70831.3562 × 10 6 1.3562 × 10 5
CCL8.04171.4315 × 10 5 1.0020 × 10 4
FSVC7.12502.5028 × 10 4 0.0015
PSVC8.45833.3728 × 10 6 2.6982 × 10 5
CDCL4.62500.06940.2083
VCC5.50000.01380.0553
FSSVC2.91670.57960.5796
FRSVC3.37490.37220.5796
Table 5. ARIs achieved in ED for various scaling factor γ .
Table 5. ARIs achieved in ED for various scaling factor γ .
DatasetScaling Factor γ
10 1 10 2 10 3 10 4 10 5
wisconsin0.82540.85190.85190.85190.8519
glass0.18170.31970.35400.35400.3540
wine0.42720.81850.81850.81850.8185
mLibras0.32810.32640.34150.34150.3415
P2P-T0.77570.79380.80480.80480.8048
WebKB0.48240.52350.44280.47950.4795
abalone0.08700.09150.09170.09170.0917
Reuters0.78600.72840.70630.70620.7062
Oh0.48510.49540.48980.48980.4898
20NG0.47520.51200.51710.51690.5169
sh0.65860.63460.63460.65960.6596
kddcup990.67990.76290.77140.76200.7620
Table 6. Runtime (s) of each step in ED, respectively, consumed by the client (C) and the server (S).
Table 6. Runtime (s) of each step in ED, respectively, consumed by the client (C) and the server (S).
DatasetStep 1Step 2Step 3Step 5#SVs#Convex Hulls
CSCSCSCS
wisconsin1.25240.019421.636143.48911.32009.44630.02001.44036430
glass0.70720.005914.430722.44031.29009.48090.01641.32884431
wine1.15670.056724.831040.44761.890021.44000.02443.06285143
mLibras4.30200.0186195.1682305.34680.810067.25330.071312.92996223
P2P-T4.63690.0805103.44801577.132521.1350356.62950.01128.0530661471
WebKB0.79810.005741.7010267.39226.105037.90540.00312.7794266137
abalone0.95940.013149.0764225.50122.670017.81040.00522.987618969
Reuters0.54030.014827.219259.05620.52504.58360.00811.44737515
Oh1.90960.017193.8732255.48951.125018.16870.01754.859811325
20NG0.89930.014045.534177.99570.825016.27510.01522.85906325
sh1.42850.025945.9365166.44300.85506.21770.01032.348514122
kddcup990.47511.340424.879554.28720.69004.73670.00681.30157417
Note: Step 5 can classify all the arrival data one by one or in parallel, and we collect the consumptions for a single sample for efficiency analysis.

Share and Cite

MDPI and ACS Style

Ping, Y.; Hao, B.; Hei, X.; Wu, J.; Wang, B. Maximized Privacy-Preserving Outsourcing on Support Vector Clustering. Electronics 2020, 9, 178. https://doi.org/10.3390/electronics9010178

AMA Style

Ping Y, Hao B, Hei X, Wu J, Wang B. Maximized Privacy-Preserving Outsourcing on Support Vector Clustering. Electronics. 2020; 9(1):178. https://doi.org/10.3390/electronics9010178

Chicago/Turabian Style

Ping, Yuan, Bin Hao, Xiali Hei, Jie Wu, and Baocang Wang. 2020. "Maximized Privacy-Preserving Outsourcing on Support Vector Clustering" Electronics 9, no. 1: 178. https://doi.org/10.3390/electronics9010178

APA Style

Ping, Y., Hao, B., Hei, X., Wu, J., & Wang, B. (2020). Maximized Privacy-Preserving Outsourcing on Support Vector Clustering. Electronics, 9(1), 178. https://doi.org/10.3390/electronics9010178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop