In this section, we briefly introduce clustering methods, homomorphic encryption, and the Paillier Cryptosystem used for our proposed privacy-preserving cluster learning model. We also introduce the evaluation metrics used to analyze the experimental results.
3.1. Data Clustering
Clustering can be described as combining the same instances of a dataset into the same group depending on special features of the data. For the evaluation criteria of a clustering model, the elements that are similar to each other should be in the same group as much as possible.
There are four different approaches for clustering algorithms: centroid-based, distribution-based, density-based, and connectivity-based.
Centroid-Based Clustering: Centroid-based clustering [
19] presents a cluster by a central vector, which does not have to be an instance of the dataset. The required number of clusters should be predetermined and this is the main drawback of this type of algorithm (such as the K-Means algorithm). After deciding the number of clusters, then the central vectors for each cluster are computed by different distance metrics from the clusters to find the nearest instance to form clusters.
Distribution-Based Clustering: In this type of algorithm, instances of training data are clustered due to their statistical characteristics. Clusters can be defined as instances belonging most likely to the same distribution. One well-known mixture model for this method is a
Gaussian mixture [
20]. A dataset at the start is modeled with a
arbitrarily and then the parameters are optimized to fit the dataset. It is an iterative algorithm, thus the output will converge into an optimum model.
Density-Based clustering: In this type of algorithm, clusters are defined, taking into consideration the area of high density of instances in the input dataset. Density-based clustering algorithms use different criteria for defining the density. One of the well-known criteria is called
density reachability [
21]. This logic works for searching for instances in a dataset that are within a certain threshold value, and adding those instances into the same cluster.
Connectivity-based clustering: This clustering method collects instances of an input dataset that are more interconnected to nearby instances, than farther away instances, to build a cluster model. In this approach, clusters are built based on their distance metrics, thus, a cluster can be defined by the maximum distance metric needed to group instances. Different clusters will be built at different distances, as a result, this approach can be visualized with a dendrogram, because these type of clustering methods yield a hierarchical model. As a result, a certain cluster also can be merged with another cluster.
There are a variety of different clustering algorithms. In our work, we analysed the K-Means, Hierarchical, Spectral, and Birch clustering algorithms.
The K-Means algorithm builds a cluster model considering distance metrics between input instances. The algorithm is applied generally when the data has a flat geometry, creating only a few clusters, and the number of clusters is evenly sized.
The Hierarchical algorithm builds a cluster model, taking into consideration the pairwise distance metrics between input instances. This algorithm is applicable generally when creating a large amount of clusters, and when there are possible connectivity constraints to building clusters.
The
Spectral algorithm builds a cluster model taking into accountthe nearest-neighbors [
22]. This algorithm is applicable generally when the input data has a non-flat geometry, creating only a few clusters and the number of clusters is even sized.
The Birch algorithm builds a cluster model taking into consideration the Euclidean distance between input instances. This algorithm is applicable generally when the data is large and data reduction, in respect to outlier removal, is needed.
3.2. Evaluation Metrics
In this paper, our proposed system is evaluated in consideration of six different metrics.
Homogeneity: This metric can only be fulfilled if the members of each single class are assigned into a distinct cluster in terms of homogeneity [
23]. Thus, each cluster consists of only instances of a single class. The class distribution within each cluster should be to only a single class. This metric gets a value between 0 and 1. If clustering is done in the most ideal way, than the homogeneity value will be equal to 1. The homogeneity value can be evaluated to understand the quality of the clustering of elements belonging to the same class in the same cluster. The ratio of current homogeneity value and the ideal value can be expressed as
, and this expression is equal to 0 when there is a perfect clustering. The value of
is dependent on the size of the dataset, therefore, instead of using this value, the normalized version is used and it is;
Thus, the output value with 1 is desirable, and 0 is the undesirable state, the homogeneity can be defined as:
where
Completeness is a symmetrical version of the homogeneity metric, expressing the balance of instances belonging to the same class, inside the same cluster. If the instances belonging to the same class are represented in the same cluster as the result of the clustering of the data, this metric, which takes the ideal value in this situation, is regarded as 1. If the grouping is far from the ideal state, the value of this metric will be 0. In order to satisfy this criterion, each of the clusters should be comprised of elements that belong to only one class. The distribution of cluster labeling inside each single class label is used to analyze the completeness metric. In ideal conditions, the value will be 0,
. The worst case conditions occur when each class is labeled by each cluster. In this case, the value will be 0,
. Completeness can be defined as:
where
V-Measure is a balance metric between homogeneity and completeness criteria.
V-Measure is calculated by using the harmonic mean of the homogeneity and completeness values of the cluster model. This metric value is between 0 and 1. As described above in previous homogeneity and completeness sections, these two metrics have working logics that are opposite to each other. An increase in the homogeneity value results in a decrease in the completeness value, and vice versa. The
V-Measure is shown as:
Adjusted Rand Index: The
Rand Index metric is a measure of similarities between two data clustering labels, and should be calculated in order to find the
Adjusted Rand Index (ARI). While the
Rand Index may vary between 0 and 1, the ARI can also have negative values. ARI metric is shown as:
ARI becomes 0 when the clustering is done randomly and independent of the number of clusters; however, if the clusters are similar or identical, then the metric value becomes 1. This metric is also symmetrical.
Adjusted Mutual Information For this metric, as with
ARI, mutual information describes how much information is shared between different clusters. Thus,
adjusted mutual information can be considered as a similarity metric. In our paper, adjusted mutual information (AMI) measures the number of mutual instances between different clusters. This metric value becomes 1 when the clusters are completely identical, and when the clusters are independent from each other this metric value is equal to 0. Thus, there is no information shared between clusters. AMI is the adjusted shape of mutual information. The mutual information value is shown as:
This metric is also symmetrical, as with the ARI metric.
The
Silhouette Coefficient metric is calculated by using both intra-cluster distance and mean nearest-cluster distance and is the distance between an instance and the nearest cluster that the instance that is not a part of each of the instances in a dataset. The silhouette coefficient is shown as:
where
b is the distance value between an instance and the nearest cluster (which the instance does not belong to) and
a is the mean value of distances within the cluster that the instance is a part of. This metric obtains the mean value of all silhouette coefficient values for the instances in the dataset.
The Silhouette coefficient can achieve values between and 1. When the Silhouette coefficient is closer to −1, then it is more likely that the instance is in the wrong cluster. If this metric is considered in all instances in the dataset, the more the value moves closer to . In this case, the clustering is not accurate and the instances are more likely assigned to wrong clusters. On the contrary, if the metric value gets closer to 1, then, the clustering model is more accurate. When the metric value is near 0, then the clusters are overlapped.
3.3. Homomorphic Encryption
Homomorphic encryption allows for computing with encrypted data and achieves the same results with the plain version of the data. The most important feature of this type of cryptographic scheme is to preserve the privacy of the sensitive data [
24] as they allow work on the encrypted data instead of its plain form. Homomorphic encryption schemes can also be used in connecting different kinds of services without putting at risk the exposure of sensitive data. Homomorphical encryption systems can be divided into two different groups; partially homomorphic algorithms and fully homomorphic algorithms. In our research, we applied the partially additive homomorphic Paillier Cryptosystem.
One can say that a public-key encryption scheme is additively homomorphic if, given two encrypted messages, such as
and
, there exists a public-key summation operation ⊕ such that
is an encryption of the plaintext of
. The formal definition is that an encryption scheme is additively homomorphic if for any private key, public key
, the plaintext space
for
.
3.3.1. Paillier Cryptosystem
The Paillier cryptosystem [
25] is an asymmetric, probabilistic, and public key cryptosystem. It preserves privacy [
26] depending on the complexity of the problem of computing the
n-th residue classes. A Paillier cryptosystem is a partially homomorphic system, which means that the encryption of
and
plain datasets with a
K public key gives the same result as the encryption of multiplication of the same two datasets
, using the same
K public key. This encryption algorithm works by performing the two main jobs in an order. The first one is the key generation and the second one is the encryption/decryption of the dataset.
The Paillier cryptographic system has partial homomorphic properties, which makes this algorithm more appropriate against data exposure and it is used in several fields that have sensitive data, such as medicine, finance, and the military. These partial homomorphic properties are:
The addition of two encrypted messages: The result of adding two encrypted messages that are exactly the same with the result of the plain version of adding these two messages.
Multiplication of an encrypted message with a non-encrypted message: Multiplying an encrypted message with a number N is same with multiplying the plain form of that message with the same message N and encrypting it.
Let us give a set of possible plaintext messages
M and a set of secret and public key pairs
, where
is the public key and
is the secret key of the cryptosytem. Then the Paillier homomorphic encryption cryptosystem satisfies the following property for any two plaintext messages
and
and a constant value
a:
One of the main properties of the Paillier cryptosystem is that it is based on a probabilistic encryption algorithm. Large-scale data sets are sparse matrices, in which most of the elements are zero. In order to prevent the guessing of elements of the input data set, the Paillier cryptosystem has probabilistic encryption that does not encrypt two equal plaintexts with the same encryption key into the same ciphertext.
3.3.2. Floating Point Numbers
As the Paillier encryption systems works only on integer values, the proposed protocols are only capable of handling integers. This is considered to be an obstacle in applying these algorithms, as mostly of the real datasets contain continuous values. Nonetheless, in the case of an input dataset with real numbers in the protocol, we need to map the floating point input data vectors into the discrete domain with a conversion function (i.e., scaling).
The number of digits in the fractional part of the data, which is defined as a floating point number, can be very large and if these numbers are to be used, then it would definitely require more processing power and processing time. For this reason, in this work only the first five fractional digits have been used, and, due to this fact, the fractional digits of input data more than these digits are not used. These may be possibly generated as a result of computational work and are rounded into five digits. Using this approach potentially creates minimal data losses and deviations of computation. It is seen that this effect on the calculation result depends on the grouping algorithm used, but overall it does not affect the effectiveness of the algorithms.