Next Article in Journal
Software-Defined Radio Implementation of a LoRa Transceiver
Previous Article in Journal
Electrochemical Detection of Ammonia in Water Using NiCu Carbonate Hydroxide-Modified Carbon Cloth Electrodes: A Simple Sensing Method
Previous Article in Special Issue
Efficiency and Security Evaluation of Lightweight Cryptographic Algorithms for Resource-Constrained IoT Devices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Clustering via Single and Complete Linkage Using Fully Homomorphic Encryption †

1
Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea
2
Department of Computer Engineering, Inha University, Incheon 22212, Republic of Korea
*
Author to whom correspondence should be addressed.
This paper is an extended version of a previous work presented at ICNGC 2023, titled “Hierarchical Clustering via Single Linkage using Homomorphic Encryption”, which introduces a collaborative approach for performing hierarchical clustering via a single-linkage method. The current version extends the research to cover the complete linkage method, demonstrating the correctness of the proposed clustering method in terms of its consistency with the clustering method performed on plaintext.
Sensors 2024, 24(15), 4826; https://doi.org/10.3390/s24154826
Submission received: 5 June 2024 / Revised: 20 July 2024 / Accepted: 23 July 2024 / Published: 25 July 2024
(This article belongs to the Collection Cryptography and Security in IoT and Sensor Networks)

Abstract

:
Hierarchical clustering is a widely used data analysis technique. Typically, tools for this method operate on data in its original, readable form, raising privacy concerns when a clustering task involving sensitive data that must remain confidential is outsourced to an external server. To address this issue, we developed a method that integrates Cheon-Kim-Kim-Song homomorphic encryption (HE), allowing the clustering process to be performed without revealing the raw data. In hierarchical clustering, the two nearest clusters are repeatedly merged until the desired number of clusters is reached. The proximity of clusters is evaluated using various metrics. In this study, we considered two well-known metrics: single linkage and complete linkage. Applying HE to these methods involves sorting encrypted distances, which is a resource-intensive operation. Therefore, we propose a cooperative approach in which the data owner aids the sorting process and shares a list of data positions with a computation server. Using this list, the server can determine the clustering of the data points. The proposed approach ensures secure hierarchical clustering using single and complete linkage methods without exposing the original data.

1. Introduction

Clustering, also referred to as cluster analysis, is a key area of study that is particularly significant in fields such as image analysis, pattern recognition, and machine learning [1]. It serves as an exploratory data analysis technique, categorizing data into distinct groups or subsets, where elements within each subset are more similar to each other than to elements in different subsets. A primary application of clustering is assigning labels to previously unlabeled data, especially when there is no prior knowledge of their groupings [2]. Many clustering algorithms have been introduced by researchers and are frequently used in various applications. Among these, partitional and hierarchical clustering are the most popular [3,4]. The partitional approach segments a dataset directly using a specific objective function, whereas hierarchical clustering gradually creates distinct clusters. Hierarchical methods generally follow either an agglomerative path or a divisive approach. Agglomerative clustering begins with individual data points as unique clusters and develops a hierarchical structure by continuously combining these clusters in a bottom-up fashion. In contrast, divisive clustering starts with all data points in one collective cluster and breaks them down gradually [4]. Among the hierarchical techniques, agglomerative hierarchical clustering stands out for its time efficiency and enhanced computational stability [1]. In agglomerative hierarchical clustering, the two nearest clusters are consistently combined until either all points are within a single cluster or the desired number of clusters is reached [5]. The definition of “nearest” may vary. In this study, two primary distance metrics were considered: single and complete linkages. For further details, refer to Equations (1) and (2).
Clustering is a fundamental method in data analysis, but a common challenge is the use of data in its original, unencrypted form, posing risks to sensitive information. In resource-constrained environments like internet of things and sensor data applications, clustering tasks are often outsourced to external servers, necessitating robust data protection measures. Encryption offers a reliable solution to safeguard sensitive data. Particularly, homomorphic encryption (HE) [6] enables computations on encrypted data without decryption, ensuring confidentiality. HE allows mathematical operations to be performed on two ciphertexts, and the decrypted result is identical to that obtained when the operations are performed on plaintexts. The Cheon-Kim-Kim-Song (CKKS) scheme stands out in this field because it allows both addition and multiplication operations on encrypted data using an approximation-based arithmetic approach [7].
This study proposes a method that combines agglomerative hierarchical clustering using single and complete linkages with the benefits of HE. This ensures that data can be grouped appropriately without revealing their original forms. However, sorting encrypted data, a necessary step for both single- and complete-linkage clustering, poses challenges. To address this issue, we introduce a joint approach where the data owner assists in sorting and shares a list indicating the positions of the data points with the server. With this guidance, the server can accurately group data points. This approach integrates privacy preservation measures into hierarchical clustering while ensuring the confidentiality of the data involved.

2. Preliminaries

2.1. Agglomerative Hierarchical Clustering

In this paper, the term “agglomerative hierarchical clustering” will be referred to simply as “hierarchical clustering”. The process of hierarchical clustering is outlined in the following steps.
Step 1: First, each data point is regarded as its own separate cluster, resulting in a total of n distinct clusters.
Step 2: As the procedure progresses, the two nearest clusters are combined into one. For instance, given a set of clusters labeled as C 1 ,   C 2 , ,   C n , when the two closest clusters C i and C j are determined, they are merged to create a new cluster, C ij .
Step 3: After merging C i and C j , they are replaced in the set by C ij , reducing the number of clusters by one.
This merging process (Steps 2 and 3) is repeated until a single comprehensive cluster is formed, yielding a sequence of nested clusters. If necessary, merging can stop once a specified number of clusters k is reached.
In Step 2 of the clustering process, the proximity between two clusters can be determined using several methods. In this study, two main distance measurement methods were studied: single linkage and complete linkage.
For two clusters C i and C j , the single linkage distance D ( C i ,   C j ) is defined as the shortest Euclidean distance between a point in C i and a point in C j . This is expressed as follows:
D C i ,   C j = m i n x y     x C i ,   y C j ,
A single linkage distance can be visualized as shown in Figure 1a. On the other hand, the complete linkage distance between two clusters is determined by the longest Euclidean distance among all pairs of points, as shown in
D C i , C j = m a x x y     x C i , y C j ,
The visualization of the distance between two clusters defined by complete linkage is illustrated in Figure 1b.
In Equations (1) and (2), x y denotes the Euclidean distance between x and y. These calculated distances are then used to construct a distance matrix, where the element at the i th row and j th column represents the distance between C i and C j . During Step 3, when two clusters merge, the distance matrix is updated by recomputing the distances from the newly combined cluster to the others. While the distances from the merged cluster to the others must be updated with every merge, the distances amongst the other clusters remain unchanged [5].

2.2. Homomorphic Encryption

Homomorphic encryption (HE) [6] preserves the algebraic structure, allowing computations on encrypted data without requiring decryption. Fully homomorphic encryption supports an unlimited number of additions and multiplications, which are core operations for deriving more complex functions [7]. While schemes such as BGV [8,9] and BFV [8,10] primarily support operations on integers, the CKKS scheme broadens this scope to include real and complex numbers [11]. The CKKS scheme supports approximate operations, crucial for statistical analyses and machine learning.
The “Homomorphic Encryption for Arithmetic of Approximate Numbers” (HEaaN) is a specialized library that implements the CKKS scheme, offering features like key generation, encryption, decryption, and homomorphic operations [12]. In the CKKS scheme, data are represented as polynomials, which are divided into components referred to as slots. Each slot can independently hold a number, either complex or real, enabling parallel operations. In this study, we represent a plaintext vector A as A once encrypted. Arrays containing multiple elements, whether ciphertexts or plaintexts, are denoted by parentheses. The operation ‘ m u l t ( ) represents element-wise multiplication between two ciphertexts or between a ciphertext and a plaintext. Similarly, a d d ( ) and s u b ( ) denote element-wise addition and subtraction, respectively. Additionally, a ciphertext can be shifted either to the left or the right by a specified number of rotations using the l e f t _ r o t a t e ( ) ’ and ‘ r i g h t _ r o t a t e ( ) functions.

2.3. Privacy-Preserving Clustering

Recent advancements in cryptographic methods have spurred the development of privacy-preserving clustering algorithms. Much of this research has focused on centroid-based clustering, employing techniques such as HE, secure multiparty computation, or a combination thereof, to safeguard data privacy during clustering operations [13,14,15,16,17,18,19,20].
Additionally, density-based clustering methods have been adapted for encrypted environments to ensure privacy, enabling data grouping without direct access to raw data [21,22,23].
A smaller subset of studies has investigated hierarchical clustering within privacy-preserving frameworks. Meng et al. [24], for instance, integrated HE and multiparty computation to facilitate hierarchical clustering while maintaining data confidentiality throughout various stages of data processing.
Our research contributes to this field by implementing hierarchical clustering using the CKKS scheme of HE, an approach that has been relatively less widely explored. This methodology allows us to perform hierarchical clustering directly on encrypted data, ensuring privacy throughout the entire data analysis process.

3. Proposed Approach

The goal of this study was to perform hierarchical clustering using HE. The proposed approach closely follows the standard clustering process. However, in Step 3, where distances between initially separate clusters remain unchanged but need updating in the distance matrix, we opted for sorting instead of recomputation.
With HE, data are represented as ciphertext blocks. The sorting function in HEaaN, though powerful, is computationally intensive and sorts only the values within ciphertext slots without preserving original index positions. When merging clusters, knowing both the distances and the original cluster indices is crucial. Therefore, we propose a collaborative approach involving the data owner. The data owner assists in sorting distances and provides the original index positions of the initial single clusters (data points). Since sorting alters the original positions, sharing these post-sorted positions does not compromise the confidentiality of encrypted data. The process begins with the client (the data owner) encrypting the data and transmitting it to the server for distance calculation. The server then sends intermediate results back to the client. After decrypting and sorting the distances, the client sends the sorted indices corresponding to these distances back to the server for clustering. The process flow is illustrated in Figure 2.
Suppose we have a ciphertext containing n data points, where each data point has N features (assuming n and N are powers of two for simplicity). The ciphertext is rotated to allow distance computation between all possible combinations of individual data points, as detailed in Algorithm 1. We denote the i -th data point as P i 1 = P i 1 0 ,   P i 1 1 ,   ,   P i 1 N 1 .
Algorithm 1: Computation of D i s t a n c e L i s t
Input:   A ciphertext X = ( P 0 0 ,   P 0 1 ,   ,   P 0 N 1 ,   P 1 0 ,   P 1 1 ,   ,   P 1 N 1 , , P n 1 0 ,   P n 1 1 ,   ,   P n 1 N 1 )
Output:    D i s t a n c e L i s t = ( D 0 ,     D 1 ,   ,   D n 2 1 )
1: for  i = 0 to n 2 1  do 0 i n 2 1
2:      R i     l e f t _ r o t a t e X ,     ( i + 1 ) · N
3:      D i     D X ,     R i
4:     append D i to D i s t a n c e L i s t
5: end for
6: return  D i s t a n c e L i s t
Algorithm 2 outlines the process of computing the Euclidean distance for each pair of data points. The output of this algorithm is the n N -dimensional ciphertext vector P 0 P 1 + i   m o d   n 2 , 0 ,   0 ,   ,   0 ,   P 1 P 2 + i   m o d   n 2 ,   0 ,   0 ,   ,   0 ,   , P n 1 P n + i   m o d   n 2 , where 0 ,   0 ,   ,   0 represents the sequence of N 1 zeros.
Algorithm 2:  D X ,   R i : Computation of Euclidean distance
Input:   Ciphertext X , R i : the i + 1 th rotation of X
Output:   Euclidean distance list D i
1: D i     s u b X ,     R i
2: D i     m u l t D i ,     D i
3: for  i n d e x = 0 to ( log 2 N 1 )  do  0 i n d e x log 2 N 1
4:      d     l e f t _ r o t a t e ( D i ,     2 i n d e x )
5:      D i     a d d D i ,     d
6: end for
7: initialize  T as a plaintext of 0 s with length n · N
8: for  i n d e x = 0 to n 1  do  0 i n d e x n 1
9:      T i n d e x · N     1
10: end for
11:   D i     m u l t D i ,     T
12: return  D i
In the computation of Euclidean distance, the result is a list of squared distances, not the distances themselves. Since the actual distance values are unnecessary and squared Euclidean distances increase monotonically with Euclidean distances, the list of squared distances is sufficient for subsequent sorting tasks. After computing D i s t a n c e L i s t , it is forwarded to the data owner for sorting.
The data owner sorts all distances within D i s t a n c e L i s t and provides the server with the corresponding indices that match the sorted distances. Upon receiving these indices, the server begins with the clustering process based on the sorted index list returned by the data owner.

3.1. Single Linkage

Algorithm 3 presents the clustering procedure using single linkage, employing union-find operations to manage disjoint sets (i.e., clusters) of data points. In lines 4–20, the algorithm iteratively processes each pair from the sorted index pairs until the number of clusters reduces to the specified number, k . The algorithm begins by examining the first pair representing the two closest data points. The F i n d function, defined in lines 28–33, determines the cluster identifier, or the root node of the set tree to which the input element belongs. If two elements in a pair share the same root, indicating that they are already part of the same cluster, the pair is discarded. Conversely, if the roots are different, as checked in line 8, the two elements are combined to form a new cluster. Lines 12 and 13 update the root nodes of the newly formed cluster to reflect the new cluster identifiers. The original clusters that were merged are then cleared, indicating that their elements are part of the new cluster. Following the merging, the processed pair is removed from the list, and the algorithm proceeds to the next closest pair. This procedure is repeated until the desired number of clusters k is reached.
Algorithm 3:  S i n g l e L i n k a g e ( P a i r s ,   d ,   k ) : Clustering via single linkage
Input:  P a i r s : a list of sorted index pairs, d : number of data points, k : desired number of clusters
Output:  F i n a l C l u s t e r s : a list of clusters
1: C l u s t e r M a p     i i     i Z d
2: C l u s t e r s     i     i Z d
3: c     d ,   n     d
4: while  c > k  do
5:          i ,   j     P a i r s 0
6:          r o o t 1     F i n d ( C l u s t e r M a p ,   i )
7:          r o o t 2     F i n d ( C l u s t e r M a p ,   j )
8:         if  r o o t 1 r o o t 2  then
9:              M e r g e d C l u s t e r     C l u s t e r s [ r o o t 1 ] C l u s t e r s [ r o o t 2 ]
10:              append M e r g e d C l u s t e r to C l u s t e r s
11:               C l u s t e r M a p n     n
12:               C l u s t e r M a p r o o t 1   n
13:               C l u s t e r M a p r o o t 2   n
14:              empty C l u s t e r s [ r o o t 1 ]
15:              empty C l u s t e r s [ r o o t 2 ]
16:               c     c 1
17:               n     n + 1
18:         end if
19:          P a i r s     P a i r s [ 1 : ]
20: end while
21: initialize  F i n a l C l u s t e r s as an empty list
22: for each c l u s t e r in C l u s t e r s  do
23:         if c l u s t e r is not empty then
24:              append c l u s t e r to F i n a l C l u s t e r s
25:         end if
26: end for
27: return F i n a l C l u s t e r s
28: function F i n d   ( C l u s t e r M a p ,   i )
29:         while C l u s t e r M a p i i  do
30:               i     C l u s t e r M a p [ i ]
31:         end while
32:         return i
33: end function
Figure 3 illustrates an example of the clustering process using the single linkage method, where k = 1 . Starting with four data points, each data point initially forms a separate cluster. The roots of these clusters point to identifiers that match the cluster labels. According to the P a i r s list provided by the client, the first elements to be processed are C 2 and C 3 . Because both elements point to different roots, they qualify for merging. This new combination is then added to the existing C l u s t e r s list, and the roots of C 2 and C 3 in C l u s t e r M a p are updated to a new cluster identifier, C 4 . Following this, the elements that have already been merged are cleared from their previous positions in C l u s t e r s , and the processed pair is completely removed from the index pair list. After this merging process, three clusters remain. The next pair to be considered is C 0 and C 2 . Although the root of C 2 has changed to C 4 , the pair still points to different roots, qualifying them for merging. All elements of C 4 and C 0 are then added to C l u s t e r s , creating another cluster identifier, C 5 , which updates the identifiers of the related clusters accordingly. After merging, the now irrelevant clusters are cleared from the cluster list, and the merged pair is removed. At this stage, two clusters, C 1 and C 5 , remain. The next elements to be processed are C 0 and C 3 . However, because of the previous merging, C 0 and C 3 have already formed one cluster, sharing the same root. Therefore, this pair is skipped, and the algorithm moves on to the next pair, C 1 and C 3 . This pair can be combined, as they have separate roots, resulting in the creation of C 6 as a new cluster identifier. After following similar processing steps as in the previous merges, only two pairs remain in the P a i r s list. Since these remaining pairs belong to the same cluster and every data point has now become part of one cluster, the procedure concludes. Consequently, all individual clusters, C 2 ,   C 3 , C 0 , and C 1 , are merged into one final cluster, as shown in Figure 3.

3.2. Complete Linkage

Algorithm 4 demonstrates the clustering procedure using complete linkage. Similar to single linkage, the closest clusters are merged; however, the key difference lies in the distance used to represent the two clusters. The complete linkage distance is defined as the maximum distance between any two points in the clusters, as outlined in Equation (2). The primary objective of Algorithm 4 is to identify the closest clusters with the maximum inter-cluster distance, using a sorted index pair list. The D i s t a n c e U p d a t e function, defined in lines 21–56, accomplishes this by identifying the pair with the longest distance, removing pairs with shorter distances in the newly formed cluster, and updating the cluster identifiers to reflect the current state of cluster formation. When a list of sorted index pairs is provided, the first pair represents the shortest distance among all combinations of points or single clusters. Consequently, the first pair is combined to form a new cluster, and the cluster identifiers are updated. Figure 4 illustrates various scenarios in which cluster pairs are handled following the D i s t a n c e U p d a t e function. Initially, when an index list is presented by the client in ascending order according to their distances, the first two cluster identifiers—2 and 3, representing the closest distance—are merged to create a new cluster with a new identifier, 4.
Algorithm 4:  C o m p l e t e L i n k a g e ( P a i r s ,   d ,   k ) : Clustering via complete linkage
Input:    P a i r s : a list of sorted index pairs, d : number of data points, k : desired number of clusters
Output:    F i n a l C l u s t e r s : a list of clusters
1: C l u s t e r s     i     i Z d
2: c     d ,   n     d
3: while  c > k  do
4:           i ,   j     P a i r s 0
5:         M e r g e d C l u s t e r   C l u s t e r s [ i ] C l u s t e r s [ j ]
6:        append M e r g e d C l u s t e r to C l u s t e r s
7:        empty C l u s t e r s [ i ]
8:        empty C l u s t e r s [ j ]
9:         P a i r s     D i s t a n c e U p d a t e ( P a i r s , n )
10:           P a i r s     P a i r s [ 1 : ]
11:           c     c 1
12:           n     n + 1
13: end while
14: initialize  F i n a l C l u s t e r s as an empty list
15: for each c l u s t e r in C l u s t e r s  do
16:          if c l u s t e r is not empty then
17:              append c l u s t e r to F i n a l C l u s t e r s
18:          end if
19: end for
20: return F i n a l C l u s t e r s
21: function D i s t a n c e U p d a t e ( P a i r s ,     n )
22:           i ,   j     P a i r s [ 0 ]
23:           l   length of P a i r s list
24:          if  l = 1  then
25:               P a i r s [ 0 ] ( n ,   n )
26:              return  P a i r s
27:          else
28:                 initialize  D e l e t e d as an empty list
29:                 initialize  V i s i t e d as a list of 0 s with length n
30:                 for  x = l 1 to 1  do  l 1 i n d e x 1
31:                      ( i ,   j ) P a i r s [ x ]
32:                     if  i = i or i = j  then
33:                           if  V i s i t e d j = 1  then
34:                                  append x to D e l e t e d
35:                           else
36:                                   P a i r s [ x ] ( n ,   j )
37:                                   V i s i t e d [ j ] 1
38:                           end if
39:                     else if  j = i or j = j  then
40:                           if  V i s i t e d i = 1  then
41:                                  append x to D e l e t e d
42:                           else
43:                                   P a i r s [ x ] ( i ,   n )
44:                                   V i s i t e d [ i ] 1
45:                           end if
46:                     end if
47:                 end for
48:                 initialize  N e w P a i r s as an empty list
49:                 for  i n d e x = 0 to ( l 1 )  do  0 i n d e x l 1
50:                     if  i n d e x not in D e l e t e d  then
51:                           append P a i r s [ i n d e x ] to N e w P a i r s
52:                     end if
53:                 end for
54:                 return  N e w P a i r s
55:          end if
56: end function
As a non-single cluster now exists, the index pair list must be adjusted; otherwise, the next pair to be merged in ascending order will not represent the maximum distance of the cluster. This adjustment involves searching the list backward to identify pairs related to the newly formed cluster. Starting from the end of the list, if a pair is not related to the formed cluster, this indicates that the pair is outside the cluster and should remain unchanged during the current step. Figure 4a shows the case where the longest distance, indicated by pairs 0 and 1, corresponds to the cluster identifiers that are not connected to the merged identifiers 2 and 3.
Conversely, if a pair is related to the cluster—meaning one of the elements in the pair belongs to the previously formed cluster—the first encounter of such a relation indicates that the pair represents the maximum distance from a single cluster to the formed cluster. Consequently, the cluster identifier is updated, and its counterpart element is marked as visited. In Figure 4b, the second-longest distance corresponds to the distance between clusters 1 and 2, where cluster 2 already belongs to a previously merged cluster. This represents the longest distance between Cluster 1 and Cluster 4. Therefore, Cluster 2, no longer a single cluster but part of Cluster 4, updates its identifiers to four, transforming pair (1, 2) to (1, 4).
As the process advances towards the front of the list, encountering another pair related to the formed cluster where the paired element, initially a single cluster, has been encountered previously, indicates that its distance is not the maximum distance from the formed cluster to the single-clustered element. Consequently, the pair is marked for deletion from the list. In Figure 4c, because the maximum distance between Clusters 1 and 4 has already been identified, any pair representing the distance from Cluster 1 to other elements of Cluster 4 is removed, as it would indicate shorter distances between the two clusters. Once all pairs are processed and the search returns to the list’s front, the remaining pairs are valid, representing the longest distance from the previously formed cluster to every other cluster. The clustering procedure proceeds from the front, followed by another backward search.
In Algorithm 4, the complete linkage clustering begins by sequentially processing each pair of data points or cluster identifiers, starting with the closest. The two clusters are merged into a new cluster, as shown in line 5. The original data points belonging to a cluster are no longer considered single clusters and must be repositioned to reflect their new identifiers. After merging, the cluster identifier and index pair list are updated using the D i s t a n c e U p d a t e function. This function compiles a new index pair list, excluding those pairs marked for deletion, thereby maintaining an updated and relevant list of pairs for further processing. The initial pair processed for clustering is then removed, and the clustering procedure is repeated until the number of remaining clusters is reduced to k . At the end of the process, F i n a l C l u s t e r s includes only valid clusters that are not empty.
In Figure 5, C l u s t e r s is a list used to store each cluster as the procedure progresses. The process begins with the P a i r s list provided by the client. According to this list, C 2 and C 3 are identified as the closest single clusters to be merged, forming a new cluster, C 4 . After merging, the new combination of C 2 and C 3 is added to C l u s t e r s , and their previous separate cluster positions are emptied. The P a i r s list is then scanned backward using the D i s t a n c e U p d a t e function mentioned in Algorithm 4 to update pairs connected to C 2 and C 3 , and to remove pairs representing shorter Euclidean distances. Since C 0 and C 1 have no relation to the newly formed cluster, this pair can be safely skipped. Next, the pair C 1 and C 2 is considered. Although C 2 already belongs to the new cluster, this is the first encounter with C 1 . Therefore, the cluster identifier of C 2 is updated to its new cluster, C 4 , and C 1 is marked as visited. This pair represents the maximum distance from C 4 to C 1 . The next pair to be considered is C 1 and C 3 . Since C 3 belongs to C 4 , this pair represents the distance from C 4 to another single cluster, which cannot be the maximum distance, given that the list is sorted in ascending order. Thus, the pair C 1 and C 3 is removed from the list. Following this, C 0 and C 3 are processed. Similar to the previous case, since C 3 is part of C 4 and C 0 has not been visited before, C 3 is updated to its new cluster identifier, C 4 , and C 0 is marked as visited. For the pair C 0 and C 2 , since C 2 is already part of the formed cluster and C 0 has been visited previously, this pair does not represent the maximum distance from C 4 to C 0 . Therefore, this pair is also removed from the P a i r s list. Now that all pairs in the list have been processed, the first pair that formed the cluster is omitted from the list. At this stage, three clusters remain after the initial merge. Using the updated P a i r s list, the next closest pair identified is C 0 and C 4 , forming a new cluster identifier, C 5 . All elements from C 4 and C 0 are added to the cluster list, and their previous positions are cleared from C l u s t e r s . Starting the search from the back of the P a i r s list, C 0 and C 1 are the first elements to be processed. Since C 0 now belongs to the new cluster C 5 , and this is the first encounter with C 1 , the distance represented by this pair must be the maximum distance between C 5 and C 1 . Thus, the cluster identifier of C 0 is updated to C 5 , and C 1 is marked as visited. In the next pair, C 1 and C 4 , since C 4 is not a single cluster anymore and C 1 has already been visited, this pair is discarded. After this step, the merged pair, C 0 and C 4 , is removed from P a i r s , leaving only C 5 and C 1 . With only one pair left to be combined, after merging, their cluster identifiers are updated to the same identifier. As a result, all four data points now become part of one final cluster.

4. Implementation

To verify the feasibility of the proposed approach, it was implemented on the Iris dataset [25] using the HEaaN library. The Iris dataset, widely recognized in the fields of machine learning and data analysis, comprises 150 data points, each with four features: sepal length, sepal width, petal length, and petal width. Due to the requirement that ciphertext dimension be a power of two, only 128 data points were used for the implementation. The experiment was conducted using the FGb parameter preset in HEaaN and executed on an NVIDIA TITAN RTX GPU featuring 4608 CUDA cores. The GPU was sourced from NVIDIA Corporation, Santa Clara, CA, USA. The sort function in Python 3.8.16 was employed for client-side sorting. Figure 6 presents a scatter plot illustrating the distribution of the data points in the Iris dataset before clustering, specifically using features 1 (sepal length) and 2 (sepal width).
Figure 7a,b show the single-linkage clustering results for three clusters ( k = 3 ), obtained using the existing SciPy library and our method, respectively. The entire process of our method, from data encryption to clustering, was completed in approximately 2.085 s. Similarly, Figure 8 compares the results of complete linkage clustering for the three clusters. Figure 8a displays the outcome from the SciPy library, while Figure 8b presents the results obtained using our method, completed in 2.918 s. Our clustering approach yielded results consistent with those produced by the widely used SciPy library.
Further testing of the proposed clustering method was conducted using Scikit-learn to evaluate its performance across various desired numbers of clusters, ranging from 2 to 10 clusters. For each configuration, the method was iterated 10 times with 128 data points randomly sampled from the Iris dataset. The evaluation focused on detecting misassigned points in comparison to clusters generated by the Scikit-learn library. Consistency is defined as the absence of misassigned points across all iterations for each number of clusters. Throughout these tests, no misassigned points were observed in either single or complete linkage clustering, demonstrating the reliability and consistency of the proposed method across different numbers of clusters. This consistency highlights the suitability of our approach for diverse clustering scenarios compared to existing methods.
In addition to the Iris dataset, experiments were also conducted on other datasets, varying the numbers of features, data points, and clusters. One such dataset is the Breast Cancer Wisconsin (Diagnostic) dataset, which comprises 569 instances and 30 features [26]. For this experiment, 256 data points with eight features were sampled. Our clustering process, employing both single and complete linkage methods, yielded the following results: for k = 4 , single linkage took approximately 3.392 s, while complete linkage took an average of 7.875 s. These results are compared in Figure 9, where (a) shows the clustering outcomes using the SciPy library and (b) illustrates the results from our approach. Our method consistently produced results similar to those from SciPy for both linkage methods.
Another set of tests was conducted on the UGRansome dataset [27], a network traffic dataset. We sampled 1024 instances, each with three features. To align with our clustering algorithm’s requirement of ciphertext values being powers of two, we added zero-padding to include a fourth feature. Figure 10 presents the comparison of the clustering results for k = 2 clusters. Figure 10a displays the results from SciPy version 1.10.1, while Figure 10b shows the results obtained from our method. For our method, single linkage took 13.561 s, and complete linkage took 268.567 s.
Notably, a slight difference appeared in the clustering results obtained from the two methods. For instance, in the complete linkage clustering, data points at indexes 532 ,   832 , and 970 were clustered into the second cluster (the orange cluster in Figure 10a), whereas they were assigned to the first cluster using our method (the red cluster in Figure 10b). This discrepancy is discussed further in Section 5.

5. Discussion

Table 1 compares our approach with prior studies on privacy-preserving clustering, outlining the methodologies and privacy-preserving techniques employed in each study [28]. Previous research has predominantly focused on centroid-based clustering, specifically the k-means algorithm [13,14,15,16,17,18,19,20]. Additionally, density-based clustering methods, such as mean shift [21] and DBSCAN [22,23], have been explored. Hierarchical clustering has also been investigated, particularly through the integration of HE and secure multi-party computation (MPC) [24]. In contrast, our study focuses exclusively on enhancing hierarchical clustering via HE.
The CKKS scheme is based on approximation, which inherently introduces precision errors [29]. For instance, in our experiments with the UGRansome dataset, these errors led to slight discrepancies in the computed distances, causing some data points to be clustered differently than in SciPy’s implementation. Despite these variations, particularly noticeable at small distance values, our method aligns with the fundamental principles of both single and complete linkage clustering.

6. Conclusions

Traditional hierarchical clustering methods often operate directly on raw data, which can expose sensitive information and compromise data privacy and security. In contrast, the use of HE ensures data privacy and reduces the risk of information leakage, providing a solution for preserving sensitive information during data analysis. By leveraging the HEaaN library, this study demonstrated a methodology for performing agglomerative hierarchical clustering in a privacy-preserving manner using both single and complete linkage methods.
Our experiment with various datasets served as a practical demonstration of the feasibility and effectiveness of the proposed approach. Both the single and complete linkage methods produced results that closely aligned with outcomes derived from widely used libraries that perform computations in plaintext. This alignment underscores the validity and reliability of the proposed clustering approach and its potential for real-world applications.

Author Contributions

Conceptualization, M.-K.L.; Methodology, L.S. and Y.-S.P.; Software, L.S. and Y.-S.P.; Validation, L.S. and Y.-S.P.; Writing—original draft, L.S.; Writing—review & editing, Y.-S.P. and M.-K.L.; Visualization, L.S.; Supervision, M.-K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government [Ministry of Science and ICT (MSIT)] under Grant No.RS-2023-00209294, in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) Grant funded by the Korea Government (MSIT) under Grant No.RS-2022-00155915 and No.2022-0-01047, and in part by an Inha University Research Grant (2023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhou, S.; Xu, Z.; Liu, F. Method for Determining the Optimal Number of Clusters Based on Agglomerative Hierarchical Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 3007–3017. [Google Scholar] [CrossRef] [PubMed]
  2. Havens, T.C.; Bezdek, J.C.; Palaniswami, M. Scalable Single Linkage Hierarchical Clustering for Big Data. In Proceedings of the IEEE Eight International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), Melbourne, Australia, 2–5 April 2013; pp. 396–401. [Google Scholar]
  3. Lin, C.-R.; Chen, M.-S. Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging. IEEE Trans. Knowl. Data Eng. 2005, 17, 145–159. [Google Scholar]
  4. Zhong, C.; Miao, D.; Fränti, P. Minimum Spanning Tree Based Split-and-Merge: A Hierarchical Clustering Method. Inf. Sci. 2011, 181, 3397–3410. [Google Scholar] [CrossRef]
  5. Zaki, M.J.; Meira, W., Jr. Data Mining and Machine Learning: Fundamental Concepts and Algorithms, 2nd ed.; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
  6. Gentry, C. Fully Homomorphic Encryption Using Ideal Lattices. In Proceedings of the Symposium on the Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009; ACM: New York, NY, USA, 2009; pp. 169–178. [Google Scholar]
  7. Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Proceedings of the Advances in Cryptology—ASIACRYPT 2017, Hong Kong, China, 3–7 December 2017; Springer: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar]
  8. Marcolla, C.; Sucasas, V.; Manzano, M.; Bassoli, R.; Fitzek, F.H.P.; Aaraj, N. Survey on Fully Homomorphic Encryption, Theory, and Applications. Proc. IEEE 2022, 110, 1572–1609. [Google Scholar] [CrossRef]
  9. Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) Fully Homomorphic Encryption without Bootstrapping. In Proceedings of the Innovations in Theoretical Computer Science Conference, Cambridge, MA, USA, 8–10 January 2012; ACM: New York, NY, USA, 2012; pp. 309–325. [Google Scholar]
  10. Fan, J.; Vercauteren, F. Somewhat Practical Fully Homomorphic Encryption. IACR Cryptol. ePrint Arch. 2012, 2012, 144. [Google Scholar]
  11. Cheon, J.H.; Han, K.; Kim, A.; Kim, M.; Song, Y. A Full RNS Variant of Approximate Homomorphic Encryption. In Proceedings of the 25th International Conference on Selected Areas in Cryptography, Calgary, AB, Canada, 15–17 August 2018; LNCS. Springer: Berlin/Heidelberg, Germany, 2018; Volume 11349, pp. 347–368. [Google Scholar]
  12. CryptoLab HEaaN. Available online: https://www.cryptolab.co.kr/en/products-en/heaan-he/ (accessed on 28 March 2024).
  13. Almutairi, N.; Coenen, F.; Dures, K. K-Means Clustering Using Homomorphic Encryption and an Updatable Distance Matrix: Secure Third Party Data Clustering with Limited Data Owner Interaction. In Proceedings of the International Conference on Big Data Analytics and Knowledge Discovery, Lyon, France, 28–31 August 2017; pp. 274–285. [Google Scholar]
  14. Jäschke, A.; Armknecht, F. Unsupervised Machine Learning on Encrypted Data. In Proceedings of the Selected Areas in Cryptography (SAC) 2018, Calgary, AB, Canada, 15–17 August 2018; pp. 453–478. [Google Scholar]
  15. Samanthula, B.K.; Rao, F.-Y.; Bertino, E.; Yi, X.; Liu, D. Privacy-Preserving and Outsourced Multi-User k-Means Clustering. In Proceedings of the 2015 IEEE Conference on Collaboration and Internet Computing (CIC), Hangzhou, China, 27–30 October 2015; pp. 80–89. [Google Scholar]
  16. Kim, H.J.; Chang, J.W. A Privacy-Preserving k-Means Clustering Algorithm Using Secure Comparison Protocol and Density-Based Center Point Selection. In Proceedings of the IEEE International Conference on Cloud Computing, CLOUD, San Francisco, CA, USA, 2–7 July 2018; pp. 928–931. [Google Scholar]
  17. Ramírez, D.H.; Auñón, J.M. Privacy Preserving K-Means Clustering: A Secure Multi-Party Computation Approach. arXiv 2020, arXiv:2009.10453. [Google Scholar]
  18. Mohassel, P.; Rosulek, M.; Trieu, N. Practical Privacy-Preserving K-Means Clustering. Proc. Priv. Enhancing Technol. 2020, 2020, 414–433. [Google Scholar] [CrossRef]
  19. Bunn, P.; Ostrovsky, R. Secure Two-Party k-Means Clustering. In Proceedings of the 14th ACM conference on Computer and Communication Security, Alexandria, VA, USA, 2 November–31 October 2007; pp. 486–497. [Google Scholar]
  20. Jha, S.; Kruger, L.; Mcdaniel, P. Privacy Preserving Clustering. In Proceedings of the European Symposium on Research in Computer Security, Milan, Italy, 12–14 September 2005; pp. 397–417. [Google Scholar]
  21. Cheon, J.H.; Kim, D.; Park, J.H. Towards a Practical Cluster Analysis over Encrypted Data. In Proceedings of the Selected Areas in Cryptography (SAC) 2019, Waterloo, ON, Canada, 12–16 August 2019. [Google Scholar]
  22. Bozdemir, B.; Canard, S.; Ermis, O.; Möllering, H.; Önen, M.; Schneider, T. Privacy-Preserving Density-Based Clustering. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, Hong Kong, China, 7–11 June 2021; pp. 658–671. [Google Scholar]
  23. Zahur, S.; Evans, D. Circuit Structures for Improving Efficiency of Security and Privacy Tools. In Proceedings of the Proceedings—IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 19–22 May 2013; pp. 493–507. [Google Scholar]
  24. Meng, X.; Papadopoulos, D.; Oprea, A.; Triandopoulos, N. Private Two-Party Cluster Analysis Made Formal & Scalable. arXiv 2019, arXiv:1904.04475. [Google Scholar]
  25. Lichman, M. Iris Dataset. Available online: https://archive.ics.uci.edu/dataset/53/iris (accessed on 28 March 2024).
  26. Wolberg, W.; Mangasarian, O.; Street, N.; Street, W. Breast Cancer Wisconsin (Diagnostic). UCI Mach. Learn. Repos. 1995. [Google Scholar] [CrossRef]
  27. Nkongolo, M.W. UGRansome Dataset. Available online: https://www.kaggle.com/datasets/nkongolo/ugransome-dataset/versions/1 (accessed on 9 July 2024).
  28. Hegde, A.; Möllering, H.; Schneider, T.; Yalame, H. SoK: Efficient Privacy-Preserving Clustering. Proc. Priv. Enhancing Technol. 2021, 2021, 225–248. [Google Scholar] [CrossRef]
  29. Costache, A.; Curtis, B.R.; Hales, E.; Murphy, S.; Ogilvie, T.; Player, R. On the Precision Loss in Approximate Homomorphic Encryption. In Proceedings of the International Conference on Selected Areas in Cryptography, Fredericton, NB, Canada, 14–18 August 2023. [Google Scholar]
Figure 1. Distance between two clusters defined by (a) single linkage and (b) complete linkage.
Figure 1. Distance between two clusters defined by (a) single linkage and (b) complete linkage.
Sensors 24 04826 g001
Figure 2. Client–server process flow for handling and clustering encrypted data.
Figure 2. Client–server process flow for handling and clustering encrypted data.
Sensors 24 04826 g002
Figure 3. Clustering via single linkage. Cluster identifiers before and after each update are marked in red and blue, respectively.
Figure 3. Clustering via single linkage. Cluster identifiers before and after each update are marked in red and blue, respectively.
Sensors 24 04826 g003
Figure 4. Case where the pair in focus (a) has no relation to the formed cluster, (b) is related to the formed cluster, and (c) defines a shorter distance between two clusters.
Figure 4. Case where the pair in focus (a) has no relation to the formed cluster, (b) is related to the formed cluster, and (c) defines a shorter distance between two clusters.
Sensors 24 04826 g004
Figure 5. Clustering via complete linkage. Cluster identifiers before and after each update are marked in red and blue, respectively.
Figure 5. Clustering via complete linkage. Cluster identifiers before and after each update are marked in red and blue, respectively.
Sensors 24 04826 g005
Figure 6. Scatter plot of feature 1 vs. feature 2 before clustering.
Figure 6. Scatter plot of feature 1 vs. feature 2 before clustering.
Sensors 24 04826 g006
Figure 7. Single linkage clustering with k = 3 : using (a) SciPy and (b) the proposed method.
Figure 7. Single linkage clustering with k = 3 : using (a) SciPy and (b) the proposed method.
Sensors 24 04826 g007aSensors 24 04826 g007b
Figure 8. Complete linkage clustering with k = 3 : using (a) SciPy and (b) the proposed method.
Figure 8. Complete linkage clustering with k = 3 : using (a) SciPy and (b) the proposed method.
Sensors 24 04826 g008
Figure 9. Comparison of clustering conducted on the Breast Cancer Wisconsin (diagnostic) dataset: (a) SciPy results for single (left) and complete (right) linkage; (b) results from our method for single (left) and complete (right) linkage.
Figure 9. Comparison of clustering conducted on the Breast Cancer Wisconsin (diagnostic) dataset: (a) SciPy results for single (left) and complete (right) linkage; (b) results from our method for single (left) and complete (right) linkage.
Sensors 24 04826 g009
Figure 10. Comparisons of clustering conducted on the UGRansome dataset: (a) SciPy results for single (left) and complete (right) linkage; (b) results from our method for single (left) and complete (right) linkage.
Figure 10. Comparisons of clustering conducted on the UGRansome dataset: (a) SciPy results for single (left) and complete (right) linkage; (b) results from our method for single (left) and complete (right) linkage.
Sensors 24 04826 g010
Table 1. Comparison of privacy-preserving clustering approaches.
Table 1. Comparison of privacy-preserving clustering approaches.
Clustering TypePrivacy TechniquePaper
Centroid-basedHE[13,14,15,16]
MPC[17,18]
HE + MPC[19]
HE and MPC (two protocols)[20]
Density-basedHE[21]
MPC[22,23]
HierarchicalHE + MPC[24]
HEOur work
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sokhonn, L.; Park, Y.-S.; Lee, M.-K. Hierarchical Clustering via Single and Complete Linkage Using Fully Homomorphic Encryption. Sensors 2024, 24, 4826. https://doi.org/10.3390/s24154826

AMA Style

Sokhonn L, Park Y-S, Lee M-K. Hierarchical Clustering via Single and Complete Linkage Using Fully Homomorphic Encryption. Sensors. 2024; 24(15):4826. https://doi.org/10.3390/s24154826

Chicago/Turabian Style

Sokhonn, Lynin, Yun-Soo Park, and Mun-Kyu Lee. 2024. "Hierarchical Clustering via Single and Complete Linkage Using Fully Homomorphic Encryption" Sensors 24, no. 15: 4826. https://doi.org/10.3390/s24154826

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop