Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution Environments

Han, Shumin; Shen, Kuixing; Shen, Derong; Wang, Chuang

doi:10.3390/math12152337

Open AccessArticle

Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution Environments

¹

School of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun 113001, China

²

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(15), 2337; https://doi.org/10.3390/math12152337

Submission received: 30 May 2024 / Revised: 18 July 2024 / Accepted: 23 July 2024 / Published: 26 July 2024

(This article belongs to the Special Issue Mathematical Modeling for Parallel and Distributed Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

With the world’s data volume growing exponentially, it becomes critical to link it and make decisions. Privacy-preserving record linkage (PPRL) aims to identify all the record information corresponding to the same entity from multiple data sources, without disclosing sensitive information. Previous works on multi-party PPRL methods typically adopt homomorphic encryption technology due to its ability to perform computations on encrypted data without needing to decrypt it first, thus maintaining data confidentiality. However, these methods have notable shortcomings, such as the risk of collusion among participants leading to the potential disclosure of private keys, high computational costs, and decreased efficiency. The advent of trusted execution environments (TEEs) offers a solution by protecting computations involving private data through hardware isolation, thereby eliminating reliance on trusted third parties, preventing malicious collusion, and improving efficiency. Nevertheless, TEEs are vulnerable to side-channel attacks. In this work, we propose an enhanced PPRL method based on TEE technology. Our methodology involves processing plaintext data within a TEE using the inner product mask technique, which effectively obfuscates the data, making it impervious to side-channel attacks. The experimental results demonstrate that our approach not only significantly improves resistance to side-channel attacks but also enhances efficiency, showing better performance and privacy preservation compared to existing methods. This work provides a robust solution to the challenges faced by current PPRL methods and sets the stage for future research aimed at further enhancing scalability and security.

Keywords:

privacy-preserving record linkage; Paillier homomorphic encryption; inner product mask; side-channel attacks; trusted execution environments

MSC:

68P27

1. Introduction

With the exponential growth of information technology, data have emerged as a novel and critical factor of production, ascending to the forefront of national strategic priorities. Record linkage technology is designed to discern and consolidate all the information pertaining to the same entity across diverse data sources. While it optimizes the value and utility of data, significant concerns about personal privacy preservation and cybersecurity have emerged, imposing constraints on data security measures.

Privacy-preserving record linkage (PPRL) technology provides an innovative solution for the safe sharing and utilization of data, enabling exact matching of the same entity in different data sources while protecting personal privacy [1]. Despite its theoretical advantages, PPRL faces several practical challenges, such as balancing security and efficiency, ensuring the credibility of the linked environment, and the difficulty of global evaluation [2]. These challenges limit its widespread application in practical scenarios.

Current PPRL methods usually rely on encryption technology to guarantee data security. However, encryption can lead to information loss, reducing the data linkage accuracy and overall linkage quality [3]. Most widely used PPRL methods focus on two-party record linkage but rely on trusted third parties in multi-party scenarios, which is impractical and poses privacy breach risks. Furthermore, homomorphic encryption, though advantageous, risks security decline due to potential collusion among parties, leading to private key exposure and sensitive data leakage.

Trusted execution environment (TEE) technology, with its unique advantages, combines cryptographic foundations and system security implementations to resist various attacks through isolated hardware devices. A TEE offers computing power equivalent to CPU plaintext computing, supporting efficient data processing and the separation of computing and data providers, making it advantageous for multi-party data sharing. However, TEE-based PPRL approaches are vulnerable to side-channel attacks, which necessitates effective protective measures to ensure data integrity and privacy.

Therefore, aiming at the existing research results and shortcomings, we propose an enhanced multi-party PPRL method based on a TEE.

Contributions:

Enhanced Security and Efficiency: By introducing TEE technology, our method significantly enhances security and matching efficiency. The TEE prevents malicious collusion among parties and avoids private key leakage, directly preventing privacy breaches. Plaintext-based matching improves the matching quality compared to methods that may result in information loss.
Defense Against Side-Channel Attacks: Employing inner product-masking strategies, our method effectively defends against side-channel attacks, addressing TEE technology’s shortcomings and further enhancing the system’s overall security.
Superior Linkage Quality: Experimental results obtained using actual datasets demonstrate that our proposed method outperforms existing methods in terms of the linkage quality, validating its effectiveness and superiority.

Here is the remainder of this paper: A brief summary of the research relevant to the methodology used in this paper is provided in Section 2. Pertinent definitions and preparations are covered in Section 3. The procedure and algorithm of an enhanced multi-party PPRL method based on a TEE are presented in Section 4. The analysis and results of the experiment are discussed in Section 5. Finally, the contents of this paper and future research directions are summarized in Section 6.

2. Related Work

(1): Privacy-Preserving Record Linkage

Secure multi-party computation (SMC) is a common approach in first-generation PPRL [4,5,6,7], and sensitive data are protected by federated computing. While the SMC protocol is dependable and efficient, its computational cost makes it too time-consuming to execute.

The Bloom filter codings approach is used in the second generation of PPRL [8]. However, it has been demonstrated that in some situations, Bloom filter codings are susceptible to cryptanalysis [9,10], potentially compromising the security of sensitive data. Moreover, this encoding approach does not ensure the preservation of the original string distance with accuracy, which can compromise the linkage outcomes’ quality. Secure TF-IDF [11] was employed by Lawati et al. [12] to produce weight vectors. These signatures, which are secure hash signatures for every record, serve as chunking mechanism keys. While more private, this approach is less effective. The extraction of q-grams from string quasi-identifier (QID) values was proposed by Churches and Christen [13]. However, the approach has a larger communication cost and is more susceptible to frequency assaults by outside parties. Additionally, privacy can be breached by calculating the frequency of specific hash character sequences.

The privacy-preserving form of the categorized neighborhood approach [14] is a ground-breaking technique in the third generation of PPRL. Inan et al. [15] proposed a hybrid technique combining sanitization and cryptographic approaches, which reduces the risk of collusion among participants and incurs lower costs than pure cryptographic techniques, but it still faces challenges in balancing privacy, accuracy, and cost. Lai et al. [16] proposed an efficient multiparty private matching protocol characterized by low communication overheads and no need for a trusted third party, based on Bloom filters. However, it lacks fault tolerance. Kerschbaum [17] presented an anonymous technique. Yet, by displaying the precise starting distance between the anonymous values, it allows a third party access to a wealth of information. The approach for encoding the Bloom filter was expanded by Randall et al. [18]. They completed the similarity computation [19], merged the homomorphic encryption approach, and sent encrypted data information to a hypothetical, reliable third party to address the issue of the method’s susceptibility to frequency attacks. Due to the complexity and high computational cost of homomorphic encryption, as well as the need to rely on a trusted third party, this technology has significant limitations in practical applications. This research presents a method that integrates the TEE with homomorphic encryption, shortens the PPRL process running time by decreasing the encrypting and decrypting times in the homomorphic encryption process, and enhances scalability. A comparative analysis of PPRL methods is shown in Table 1.

(2): Resistance to Side-Channel Attacks

TEEs are generally designed to be more secure execution environments to protect sensitive information from various attacks and some side-channel attacks. A TEE does not completely eliminate the risk of side-channel attacks. For example, if the TEE has a security vulnerability or interacts with other insecure components, an attacker could still use the side-channel information to pose a threat to the data. An analysis of resistance to side-channel attacks is shown in Table 2. Chen et al. [20] proposed the first implementation of the shared threshold. The results show that the scheme has perfect resistance to first-order leakage and strong resistance to second-order leakage. Moradi et al. [21] studied the security of low-delay devices against side-channel attacks and proposed several low-delay design architectures, which can resist first-order leakage through mask and threshold schemes. Goubin et al. [22] described an algorithm that converts Boolean masks into arithmetic masks using just a certain quantity of operations. Coron et al. [23] described the novel method based on Kogge–Stone advance adder with carry to calculate the carry signal in O(log K) operation. Maghrebi et al. [24] put forward a framework to construct a customized coding function based on the analyzed leakage model, and they used the knowledge of physical leakage to select the corresponding optimal coding scheme. Faust et al. [25] studied whether the internal components of the mask scheme could reuse random numbers, so as to reduce the total amount of randomness required. Balasch et al. [26] proposed a new algorithm that reduces the required randomness, improves the efficiency of the inner product mask scheme, has security in the T-probe model, and is implemented on the ARM microprocessor. There are two methods to defend against side-channel attacks: eliminating the data dependency of side information and obscuring the side information to weaken some of its characteristics [27]. However, it is very difficult to completely eliminate the possibility of side-channel attacks, because these attacks often involve some subtle differences in the underlying hardware and physical implementation.

This research presents a method that integrates the TEE with homomorphic encryption, shortens the PPRL process running time by decreasing the encrypting and decrypting times in the homomorphic encryption process, and enhances the scalability. This approach aims to address the shortcomings of the existing methods, which include long operating times, low information-matching precision, poor security, and easy side-channel attacks. The distinct features of the TEE guarantee data integrity, which enhances the matching quality, and employ the inner product mask [26] to fend off side-channel attacks for optimal defense. Additionally, the PPRL procedure is more secure thanks to its features.

3. Definition and Preparation

3.1. Privacy-Preserving Record Linkage

PPRL is a technique for linking records from different data sources while protecting the privacy of individuals. The problem definition of PPRL is as follows: suppose the parties

D_{1}

and

D_{2}

, respectively, own datasets

A

and

B

. The problem solved by PPRL techniques is to output S

= | A_{x} | \times | B_{x} |

based on a decision model, where S represents the set of data records belonging to the same data subject

x

in datasets

A

and

B

, without disclosing personal privacy information of other entities in the datasets.

3.2. Trusted Execution Environments

By using hardware-based isolation, the TEE technology protects computers and operations that include sensitive data. TEE-enabled CPUs provide an Enclave, a section that is separated from the outside world and offers a safer environment for data and code execution, guaranteeing the privacy and accuracy of programs. The data in this area are protected by both hardware and software, making it impossible for even the server to retrieve them. The TEE minimizes the requirement for repeated encryption and decryption processes during the encrypted computation process, averting the possibility of data loss that could arise from such processes. A perfect TEE can withstand all software and hardware attacks since its content is dynamic and allows updates while it is being executed. These features mean that TEE-integrated solutions outperform other privacy protection techniques in terms of the security and operational effectiveness. The overall architecture of the TEE is shown in Figure 1.

3.3. Homomorphic Encryption Algorithm

In addition to being a popular means of defense, homomorphic encryption technology is an efficient encryption technique. Its distinctive feature is that when ciphertext is operated on, the outcome is the same as when plaintext is operated on after decryption. The fundamental concept is to compute on encrypted ciphertext to produce an encrypted result that, upon decryption, corresponds to the result computed directly on the original data (plaintext) [28]. This approach not only accomplishes data protection but also has no impact on data calculation. As an illustration of additive homomorphic encryption, take the following:

{Enc}_{p k} (m_{1}) = c_{1} {Enc}_{p k} (m_{2}) = c_{2}

(1)

{Dec}_{s k} (c_{1} \cdot c_{2}) = m_{1} + m_{2}

(2)

Formula (1) and Formula (2) have two types of keys: public (pk) and private (sk), the plaintexts

m_{1}

and

m_{2}

, respectively, are encrypted to produce the ciphertexts

c_{1}

and

c_{2}

, and

\cdot

is a multiplication operation.

3.4. Side-Channel Attack

A side-channel attack is a type of attack that uses the physical implementation details generated by the system during operation, such as the energy, electromagnetic radiation, power consumption, execution time, etc., to obtain critical information. Unlike traditional attack methods, side-channel attacks do not directly exploit the software or hardware vulnerabilities of the system; instead, they obtain sensitive information by monitoring the physical characteristics of the system. These attacks include timing attacks, which infer internal information by observing the execution time; power analysis attack, by monitoring power changes to guess the key and other information; electromagnetic radiation attack, using the electromagnetic radiation generated by the system to analyze internal activities; cache side-channel attack, by observing the cache access pattern to understand the location of key data; and voice channel attacks, which analyze the sound generated by the system to understand the execution activity. These attacks often require specialized equipment and knowledge, and the defense methods include encryption, randomization, and hardware enhancements to mitigate information leakage at the physical level of the system [29,30,31].

The attacker uses statistical tools to analyze the relationship between the hypothetical leakage described by the leakage model and the measured real leakage, and then recovers the sub-key in the divide-and-rule area. When all the sub-keys are cracked, the master key can be obtained. Specifically, the theoretical relationship model between the real information L and the key when the cryptographic device is running is shown in Formula (3).

L = f (H (p, k)) + n

(3)

where p is the input plaintext; k is the sub-key (such as a byte or bit of the key); H (p, k) is a hypothetical disclosure in the cryptographic algorithm that depends on the transformation value of some intermediate operation p and k; f is the mapping between H (p, k) and the real information L; and n is random noise.

The attack process can be roughly divided into two stages: leakage information collection and statistical analysis. In the stage of information collection, the appropriate measurement tool is selected according to the type of information leakage of the cryptographic chip, and the information leakage required by the attack is captured. The statistical analysis stage mainly uses the captured leaked information, combined with the design details of the cryptographic algorithm, and uses statistical tools to recover the sub-key.

The side-channel attack is to solve the unknown sub-key k based on the known input p and the measured true leakage information L. The process of the side-channel attack is shown in Figure 2.

3.5. Inner Product Mask

The inner product mask is a common technique in the field of cryptography and hardware security, which is mainly used in the scenario of vector inner product calculation. In the inner product mask, in order to increase the randomness of the calculation process and defend against side-channel attacks, the elements in each vector are bit-by-bit XOR with a randomly generated mask. The inner product mask technique plays a crucial role in enhancing the security of the proposed TEE-based PPRL method. This technique is specifically designed to address the vulnerability of the TEE to side-channel attacks. Side-channel attacks exploit physical leakages such as power consumption, electromagnetic radiation, and timing information to infer sensitive data during computation.

Significance in the proposed solution:

Noise Injection for Obfuscation: The inner product mask technique injects noise into the computation process within the TEE. By introducing random noise, it becomes significantly more challenging for attackers to extract meaningful information through side-channel observations. This obfuscation effectively masks the true data, thereby protecting it from being compromised even if side-channel information is collected.
Enhanced Data Security: By employing the inner product mask technique, the proposed method ensures that the intermediate values processed within the TEE remain secure. This is critical in scenarios where multiple parties are involved, as it mitigates the risk of data leakage through collaborative attacks or other side-channel vulnerabilities.
Improved System Robustness: The integration of the inner product mask technique enhances the overall robustness of the PPRL system. It provides an additional layer of security that complements the existing cryptographic measures, ensuring that the system remains resilient against advanced attack vectors.
Preservation of Linkage Accuracy: Unlike traditional methods that may degrade linkage accuracy due to excessive noise or encryption, the inner product mask technique is designed to maintain the balance between security and accuracy. It allows the system to perform accurate record linkage while simultaneously safeguarding sensitive information.

In practical terms, the inner product mask technique works by multiplying each data element with a randomly generated mask before performing the inner product computation. This masked computation ensures that any observable outputs from the TEE are not directly correlated with the actual data values. The noise introduced by the masks is later removed in a controlled manner, ensuring that the final linkage results are accurate and secure.

3.6. Privacy Attacks

PPRL technology faces privacy deficits and is susceptible to security risks. Malicious attacks on PPRL primarily include frequency attacks, re-identification attacks, co-occurrence attacks, pattern-matching attacks, and side-channel attacks.

The principle of a frequency attack is to infer the identity or attributes of individuals by utilizing the frequency information of various attribute values in the dataset. Re-identification attacks involve combining anonymized data with other data sources to re-identify individuals. Co-occurrence attacks infer undisclosed sensitive information by analyzing the co-occurrence of feature values within the data. Pattern-matching attacks leverage known patterns to match specific patterns within the data, thereby inferring hidden sensitive information. Side-channel attacks deduce sensitive data by monitoring the physical properties of the system to capture information leaked during computation.

4. An Enhanced Multi-Party PPRL Method Based on a TEE

The method in this paper consists of four parts: data preprocessing, calling the TEE, constructing Paillier homomorphic encryption and using the inner product mask method.

Firstly, the public and private keys required for homomorphic encryption are generated within the Enclave. After converting the plaintext data into ciphertext using the encryption key they were given, each participant sends the ciphertext data to the Enclave. Within the Enclave, the content of the ciphertext is decrypted and converted into plaintext data using the private key. Before data processing, in order to ensure that the TEE is protected from side-channel attacks, we introduce an inner product mask mechanism. The function of the inner product mask is to hide the real calculation result by introducing random noise during the calculation of sensitive data, so as to improve the security and privacy of the system. This process is performed after the data are decrypted and before the data are processed to prevent sensitive information from being leaked. Furthermore, the processed data will be used to calculate similarity and classify the data that meet the linking requirements. Finally, the matching result IDs belonging to the same entity are returned to the participants. A flow diagram of this method is shown in Figure 3.

4.1. Data Preprocessing

Before entering the data, it is necessary to verify that the settings are consistent for every participant. To prevent data leakage, our method creates the matching public (pk) and private keys (sk) within the Enclave, where pk is transmitted to the external, while the sk is stored internally. The participants use the received pk to encrypt the data, then send the ciphertext to the Enclave. The Enclave then uses the sk stored within it to decrypt the data back into plaintext and performs the necessary computations on the Enclave data. Table 3 illustrates the parameters used in this method and their meanings.

4.2. Call TEE

When a user program communicates with an Enclave in the TEE, it first obtains a valid signature authentication to ensure its legitimacy and trustworthiness. Subsequently, the user program encrypts the parameters to be transmitted using a public key to ensure the security of the data transmission. Then, the user program sends the secure service request and the encrypted data to the TEE, requesting the execution of specific security operations or services. In this process, the TEE is responsible for transferring the received ciphertext data to the isolated runtime environment of the Enclave, ensuring the secure transmission and isolation of the data. Ultimately, a private key is used in the Enclave to decode the received ciphertext data into plaintext data, the inner product mask method is used for the plaintext data, and the matching result IDs are returned to the participants after the completion of the agreed calculation. The processing flow is shown in Figure 4.

4.3. Building Paillier Homomorphic Encryption

The Paillier homomorphic encryption algorithm is a public key encryption scheme with the property of homomorphic addition. Its key steps include key generation, encryption, homomorphic operation, and decryption.

4.3.1. Paillier Key Generation

Key generation is the first step of the Paillier homomorphic encryption algorithm, aiming to create the public and private keys used for encrypting and decrypting data. This process begins with the selection of two prime figures p and q, followed by the computation of their product n = p · q. Next, the calculation of

λ =

l cm (p − 1, q − 1), which is the least common multiple of p − 1 and q − 1, is performed for the subsequent private key calculations. On this basis, a random integer

g

is chosen such that

g \in Z_{n^{2}}

*, and according to the formula

μ = {(L (g^{λ} \mod n^{2}))}^{- 1} \mod n,

the value of μ is computed. Here, the function L(

μ)

is defined as L(

μ) = \frac{μ - 1}{n}

. Finally, the generated public key is (

n, g)

, and the private key is

(λ, μ)

. This pair of keys are used for the subsequent encryption and decryption operations. The key generation process is crucial in selecting appropriate primes and random numbers to guarantee the system’s dependability and security.

4.3.2. Paillier Encryption Function

With a plaintext m and the public key (

n, g)

, the encryption process involves selecting a random number

r_{i}

in

Z_{n^{2}}

* and computing the ciphertext

c = E (m_{i}) = g^{m} \cdot r_{i}^{n} \mod n^{2}

. This ciphertext c [21] can then be used for secure transmission or storage without revealing the original plaintext m. The encryption operation ensures the confidentiality and integrity of the data, as only the holder of the private key can perform decryption to recover the original plaintext.

4.3.3. Paillier Homomorphic Operation

Homomorphic operation is a crucial feature of the Paillier encryption algorithm, allowing arithmetic operations to be performed on encrypted data without decryption. The numerals

r_{i}

and

r_{j}

are arbitrary.

\begin{matrix} c_{i} \cdot c_{j} = E (m_{i}) E (m_{j}) = g^{m_{i}} \cdot r_{i}^{n} \cdot g^{m_{j}} \cdot r_{j}^{n} \mod n^{2} \\ = g^{m_{i} + m_{j}} \cdot {(r_{i} \cdot r_{j})}^{\dot{n}} m o d n^{2} = E (m_{i} + m_{j}) \end{matrix}

(4)

As show in Formula (4), the result corresponds to the ciphertext of the sum of the corresponding plaintexts. This property enables addition operations to be carried out on the ciphertexts directly, preserving the confidentiality of the underlying plaintexts.

4.3.4. Paillier Decryption Function

Decryption is the final step of the Paillier homomorphic encryption algorithm, aiming to recover the original plaintext from the ciphertext using the sk [32]. With the ciphertext c and the sk

= (λ, μ)

, the decryption process involves computing m using Formula (5).

m = D (c) L (c^{λ} \mod n^{2}) μ \mod n = \frac{L (c^{λ} \mod n^{2})}{L (g^{λ} \mod n^{2})} \mod n

(5)

This formula utilizes the private key

(λ, μ)

to reverse the encryption process and obtain the original plaintext m.

4.4. Using the Inner Product Mask

Considering that traditional PPRL methods are still vulnerable to side-channel attacks in the TEE, we introduce a security mechanism, the inner product mask. The inner product mask method can effectively resist the side-channel attack by introducing a random mask into the calculation of the inner product. By introducing randomness into vector calculations, we have successfully reduced the risk of information leakage when an attacker passes the analysis calculation. Furthermore, it enhances the overall robustness of the PPRL system by adding an extra layer of security and maintaining a balance between security and accuracy, allowing accurate record linkage while safeguarding sensitive information. In practice, this technique involves multiplying each data element with a randomly generated mask before performing the inner product computation, ensuring that the observable outputs from the TEE are not directly correlated with the actual data values, and the noise is later removed in a controlled manner to ensure accurate and secure linkage results.

4.4.1. Inner Product Mask Symbol

Let us use K to represent a field of characteristic 2. We use capital letters to represent elements of the field K. Multiplication is represented by the point ·, and the standard inner product of

X, Y

is represented as

〈 X, Y 〉 = \sum_{i} X_{i} \cdot Y_{i}

, where

X_{i}

and

Y_{i}

are the components of vectors

X

and

Y

. The symbol

δ_{i, j}

is 0 when i = j, otherwise it is 1.

4.4.2. Build the Inner Product Mask

Our scheme is based on the inner product construction by Dziembowski and Faust [33] and is an improvement over [34] and [35].

As shown in Algorithm 1, the initialization mask scheme is implemented by the IPSetup program. First of all, L₁ is initialized to 1, then from the uniform random nonzero elements in

K

, it generates a random vector L, L = (L₁, L₂, ..., L_n).

Algorithm 1. Setup for the masking scheme: $L \leftarrow {IPSetup}_{n} (K)$
$Input : Field description rand (K$ )
Output: Random vector L
1:	$L_{1} = 1$
2:	$For i = 2 to$ n do
3:	$L_{i} \leftarrow rand (K \ {0})$
4:	end for

As shown in Algorithm 2, the mask algorithm is implemented by the IPMask program. It takes the variable S and the initialized vector L as input, and for i from 2 to n, generates a random element

S_{i}

. Finally, by calculating

S_{1} = S + \sum_{i = 2}^{n} L_{i} \cdot S_{i}

, the masked value

S = 〈 L, S 〉

is obtained.

Algorithm 2. Masking a variable: $S \leftarrow {IPMask}_{L} (S)$
$Input : Variable S \in K$
$Output : Vector S$ $such that S = 〈 L, S 〉$
1:	$For i = 2$ to n do
2:	$S_{i} \leftarrow rand (K)$
3:	end for
4:	$S_{1} = S + \sum_{i = 2}^{n} L_{i} \cdot S_{i}$

We propose a novel multiplication scheme. This scheme achieves a secure order of t = n − 1 under the t-probing model. Algorithm 3 is based on the multiplication scheme from [36]. We reused the idea of summing the inner product matrix of the inputs. Specifically, we designed these two matrices (T and U in the algorithm) to be consistent with different masking models. As shown in Algorithm 3, lines 1 to 6 compute the elements of matrix T, where matrix T is an n × n matrix, and its elements are computed from the corresponding elements of vectors A and B, as well as vector L. Lines 7 to 19 compute the elements of matrices U and U′, where matrix U is an n × n matrix, and its elements are computed from the corresponding elements of matrix U′, a sign function, and vector L. Lines 20 to 21 compute matrix V, whose value is the sum of the corresponding elements of matrices T and U. Lines 22 to 24 compute the elements of the output vector

C

, where each element of vector

C

is the sum of the elements in the corresponding row of matrix V. Finally, output

〈 L, C 〉 = 〈 L, A 〉 \cdot 〈 L, B 〉

is obtained.

Algorithm 3. Multiply masked values: $C \leftarrow {IPMult}_{L} (A, B)$
$Input : Vector A and B of length n$
Output: $Vector C such that 〈 L, C 〉 = 〈 L, A 〉 \cdot 〈 L, B 〉$
1:	$⊳$ Computation of the matrices T
2:	$For i = 1 to$ n do
3:	$For j = 1 to$ n do
4:	$T_{i, j} = A_{i} \cdot B_{j} \cdot L_{j}$
5:	end for
6:	end for
7:	$⊳ Computation of the matrices U and U^{'}$
8:	$For i = 1 to$ n do
9:	$U_{i, i}^{'} = 0$
10:	$For j = 1 to$ n do
11:	$If i < j$ then
12:	$U_{i, j}^{'} \leftarrow rand (K)$
13:	end if
14:	$If i > j$ then
15:	$U_{i, j}^{'} = - U_{j, i}^{'}$
16:	end if
17:	$U_{i, j} = U_{i, j}^{'} \cdot δ_{i, j} L_{i}^{- 1}$
18:	end for
19:	end for
20:	$⊳$ Computation of the matrices V
21:	$V = T + U$
22:	$⊳$ Computation of the output vector C
23:	$For i = 1 to$ n do
24:	$C_{i} = \sum_{j} V_{i, j}$
25:	end for

4.4.3. Privacy Performance Analysis

The TEE makes sure that applications outside of Enclave will not be able to read the data that are operating there, and in order to better resist possible side-channel attacks such as power analysis attacks, electromagnetic side-channel attacks, and sequential side-channel attacks, the method of inner product mask is also introduced. After the Enclave uses the private key to decrypt the ciphertext data, the inner product mask is used to hide the real calculation results. Even if the attacker deduces some values of

C

through side-channel attacks, the security of the inner product mask can be explained in two cases.

C_{i}

can be imitated as a random value if the attacker has already seen certain intermediate values of the

C_{i}

. You can use the previously simulated values and add them in accordance with the procedure if all of the partial sums have been seen. If not, there is at least one value in

C

that is not used for any other observations and can then be assigned to

C_{i}

random values. This makes it impossible for attackers to accurately correlate observable information such as the input, output, or runtime with actual keys or sensitive data. In this way, even if an attacker can measure the hardware’s side-channel information, this information is no longer sufficient to reveal any useful secrets because it is masked. This conclusion has been proved by Balasch et al. [26]. The homomorphic encryption is calculated under the protection of the integral mask, which greatly enhances the security of the linking process.

4.5. Implementation of Multi-Party PPRL in the TEE

According to Algorithm 4, the first line preprocesses the data. In lines 2–6, we establish a fresh, uninitialized Enclave, represented as E (data, code). M (E (data, code)) represents the hash of the internal data and code of the Enclave. Users deploy necessary programs into the Enclave to guarantee the safety and privacy of its internal data. Even if malicious participants attempt collusion or data theft, the security mechanisms of the Enclave prevent them from obtaining decryption keys to access sensitive data, thereby preserving data privacy. Participants’ data are transmitted to the Enclave after encryption processing, followed by an update to the Enclave’s measurement. This measurement, M (E (.,.)), is used by remote parties to verify the integrity and security of the Enclave, thereby establishing trust. Finally, the initialization status of the Enclave is set to true, allowing the execution of loaded code and ultimately determining the hash of the Enclave’s measurement. In lines 7–8, after determining the hashing of the Enclave measures, we introduce the inner product mask mechanism. In the Enclave, the private key is used to decrypt the ciphertext data, and the inner product mask is applied to hide the real calculation results, and the security and privacy of the system are improved by introducing random noise. In lines 9–14, the homomorphic encryption is calculated under the protection of the integral mask. Finding all the record information related to the same entity involves using

L (c^{λ} \mod n^{2}) \mod n = 0

to assess whether the feature segment f and the plaintext segment m match after the inner product masking. Finally, in lines 18–19, the matching results are printed and all the allocated memory is cleared, freeing the Enclave from initialization processing.

Algorithm 4. An enhanced multi-party PPRL method based on the TEE
Input: Data from each party
Output: Matched results
1:	Each participant receives its own encryption key and uses the public key to encrypt the plaintext data to the ciphertext data
2:	Call TEE
3:	$TEE . create () \to E (\emptyset, \emptyset)$
4:	$TEE . add (E (\emptyset, \emptyset), data, code) \to E (data, code)$
5:	$TEE . extend (E (data, code)) \to M (E (data, code))$
6:	$TEE . init (E (data, code)) \to HMAC (M (E (data, code)))$
7:	$Apply inner product mask to ciphertext data$
8:	Introduce random noise to improve security and privacy
9:	Construct Paillier homomorphic encryption
10:	$TEE . KeyDerive (E (data, code)) \to key, {Enc}_{p k}$
11:	Decrypt ciphertext data using the private key
12:	Verify if the product of decrypted ciphertexts is 0 to determine plaintext segment
13:	$If L (c^{λ} \mod n^{2}) \mod n = 0 t h e n$
14:	Match
15:	else
16:	Non-match
17:	end if
20:	Output matched results
21:	$TEE . remove (E (data, code)) \to E (\emptyset, \emptyset)$

5. Experiment and Analysis

5.1. Experimental Setup

Referring to Table 4, to implement the method described in this work, the experimental host adopts an Intel(R) Core (TM) i5-9300H, a 2.40 GHz quad-core processor with 16 GB of RAM, 64 bit Windows 11 operating system, and PyCharm (2023.1). The dataset used is the North Carolina voter registration list (NCVR), which contains publicly available real information.

5.1.1. Evaluation Criteria

The experiment was analyzed from the following four aspects: runtime, recall, precision, and F-measure [37]. The runtime reflects the efficiency of the method in processing large amounts of data and is a key standard for measuring the scalability of the method. The others are used to assess the method’s linkage quality. The recall is the ratio of correctly identified matching record sets to the actual matching record sets. The precision is the ratio of correctly identified matching record sets to the total candidate record sets. The F-measure is the harmonic mean of the recall and precision, expressed as follows:

ℱ = 2 \times \frac{Recall \times Precision}{Recall + Precision} .

5.1.2. Parameter Determination

To evaluate our method’s efficacy under different conditions, three factors were the focus of our experiments: data source size, number of parties, and degree of data perturbation.

The size of the dataset can significantly impact the performance of the PPRL method. Larger datasets tend to increase the computational complexity and time required for record linkage. By selecting a range of dataset sizes (5 k, 10 k, 50 k, 100 k, 500 k, and 1000 k), we aim to evaluate the scalability and efficiency of our method across different volumes of data. Testing with varying dataset sizes allows us to demonstrate that our method can handle both small and large datasets effectively, showcasing its versatility and robustness in different scenarios.

The number of participating parties in a PPRL process can influence the communication cost and the complexity of maintaining data privacy. By experimenting with different numbers of parties (3, 5, 7, and 9), we can assess how well our method performs in scenarios with varying levels of collaboration. This selection helps us understand the impact of increased participation on the overall system performance and privacy guarantees, ensuring that our method is adaptable to different collaborative settings.

Data perturbation, such as misspellings or errors, is common in real-world datasets [38]. The ability to accurately link records despite such perturbations is crucial for the practical application of PPRL methods. By generating datasets with a maximum of one, two, and three misspellings (Mod-1, Mod-2, and Mod-3), we can test the resilience and accuracy of our method under varying levels of data quality. Evaluating our method with different degrees of data perturbation allows us to demonstrate its robustness and reliability in handling imperfect data, which is often encountered in real-world applications.

5.2. Experimental Results and Analysis

This section would contrast our approach with three popular approaches that are currently in use. These include the homomorphic encryption-based PPRL technique (HE-PPRL) put forth by Randall et al. [39], the counting Bloom filter (CBF)-based PPRL method (CBF-PPRL) put forth by Vatsalan et al. [40], and the Bloom filter-based PPRL method (BF-PPRL) put forth by Lai et al. [19].

5.2.1. Scalability Assessment

Initially, the suggested method’s scalability is assessed, meaning that the method’s execution time varies as the amount of data source increases. As illustrated in Figure 5, the running time of the suggested approach is longer than that of the other two methods but less than that of the HE-PPRL method when the participants are P = 3. Considering that this paper’s methodology also makes use of homomorphic encryption technology and introduces the inner product mask, which increases the running time in comparison to the alternative techniques, operating within an Enclave might significantly decrease the quantity of the encrypting and decryption processes, so the method proposed in this paper has a shorter execution time compared to the HE-PPRL method. Consequently, our proposed method maintains high scalability while ensuring a high level of security.

Secondly, we set the dataset size to |D_i| = 10 k. Although our approach reduces the frequent encryption and decryption processes, it introduces an inner product mask. As shown in Figure 6, the runtime of our method is longer than the other two Bloom filter-based methods as the number of parties rises. Therefore, our method exhibits better scalability.

5.2.2. Method Performance Evaluation

We evaluate our proposed method in terms of the recall, precision, and the F-measure.

In the case of disturbed dataset Mod-1 with the size |D_i| = 5 k, the relationship between the recall, precision and F-measure of the proposed method and the other three methods as the number of parties rises was evaluated.

From Figure 7, it is evident that, even when using inner product masking and introducing noise in our method on the perturbed dataset Mod-1, the recall remains superior to the other two methods based on Bloom filter encoding, consistently maintaining at a high level of 0.85 or above.

From Figure 8, it is evident that, on the perturbed dataset Mod-1, due to the usage of inner product masking and the introduction of noise, but with the assurance of data integrity due to the TEE, our method exhibits better precision compared to the other two methods based on Bloom filter encoding. While in the process, both other methods incur some loss of data. When the number of parties (P) reaches nine, the precision values for all the methods approximate a high level close to 0.9.

From Figure 9, it can be observed that the F-measure of our proposed method equals those of HE-PPRL and CBF-PPRL once the number of parties reaches five. Then, the F-measure consistently falls as the number of parties rises, with the F-measure of CBF-PPRL also falling below that of our method. Furthermore, the F-measure of our technique remains above 0.88 even if the number of parties reaches nine, indicating that our proposed method continues to demonstrate good matching performance.

In the case of a dataset size |D_i| = 5k and varying degrees of disturbance, the changes in the three evaluation metrics of the proposed method with an increasing number of parties P are shown in Figure 10. As the degree of disturbance increases, these metrics all decrease. This is due to the increased degree of disturbance, meaning that some true matching records will be more likely to be lost, and the inner product mask data processing also added noise. Overall, our proposed method still demonstrates satisfactory performance.

5.3. Potential Limitations and Biases

In our experiments, the proposed TEE-based PPRL method demonstrated high matching quality and security. However, we acknowledge that these results have some potential limitations and biases that could affect the model’s performance and generalizability.

Limitations:

Dataset Limitations: Our experiments were primarily validated using specific datasets. These datasets may not fully represent all the data situations in practical applications. Therefore, the model’s performance may vary in different data environments. To improve the generalizability of the results, future research could consider testing on a broader range of datasets.
Computational Cost: Although our method optimizes computational and communication costs compared to other methods, the computational cost may still be a challenge with extremely large datasets. Especially in resource-limited environments, further optimization of the computational efficiency will be an important research direction.
Security and Anti-Aggression: Our approach incorporates an inner product mask to resist side-channel attacks, thereby enhancing the security of TEE technology. However, the inner product mask does not completely eliminate all the potential attack risks. For example, an attacker could bypass existing protections with more sophisticated attacks. In addition, the TEE itself may also have certain undiscovered security vulnerabilities. Therefore, further improving the protection capability of the system and researching new security technologies are still important research directions in the future.
Model Generalizability: Our method performs well in specific application scenarios, but its performance in other application scenarios has not been verified. For example, in data-matching tasks in different fields, the feature selection and matching algorithms may need to be adjusted. Therefore, the generalizability and scalability of the model need further validation.

Potential biases:

Data Variability: The quality and consistency of the data significantly impact the model’s performance. In practical applications, the data may have issues such as missing values, inconsistency, or noise. These data variabilities may affect the matching accuracy and reliability of the model.
Algorithm Selection: Our feature selection and matching algorithms are based on the best practices in the current research, but these algorithms may not be suitable for all scenarios. Future research could explore other algorithms and techniques to further enhance the model’s performance.
Implementation Details: Technical implementation in practical applications may encounter some challenges, such as system integration, computational resource allocation, and real-time requirements. These factors may affect the practical application effect of the model and need to be explored in future research.

5.4. Practical Implementation and Integration Challenges

In implementing the TEE-based PPRL method, we first integrated the TEE technology into our existing system. The chosen TEE platform was Intel SGX due to its widespread support and robustness. We utilized the Intel SGX SDK for the software implementation, ensuring compatibility with our existing cryptographic libraries. The integration process involved setting up secure enclaves for sensitive computations and establishing secure communication channels between the enclaves and the untrusted host application.

During the integration, we encountered several challenges. Firstly, the hardware compatibility was a significant issue. Not all computational environments supported the specific TEE hardware we utilized, necessitating a careful selection of compatible devices. Secondly, the performance overhead introduced by the TEE was non-trivial. The secure enclaves, while enhancing security, resulted in increased computational and communication latency. To mitigate this, we implemented optimization strategies such as hardware acceleration and parallel processing.

Security concerns were another critical challenge. Although the TEE provides a secure execution environment, it does not completely eliminate the risk of side-channel attacks. To address this, we employed the inner product mask technique. This technique protects against side-channel attacks by masking the actual data during computations, preventing attackers from inferring sensitive information through side-channel leakage. Additionally, we conducted frequent security audits and employed advanced cryptographic techniques to further obscure side-channel information.

Our experimental setup included a series of benchmark tests to evaluate the performance and security of the TEE-integrated PPRL method. The results indicated a noticeable improvement in security with a marginal performance trade-off. The optimizations we applied successfully mitigated the performance overheads, making the method viable for practical applications.

6. Conclusions

Nowadays, the integration of data from different sources is an indispensable social process, but there are still significant issues with the present PPRL method, including excessive running times, poor linkage quality, and inadequate security. In order to successfully address the issue of privacy leakage brought about by malicious attacks such dictionary attacks, frequency attacks, cryptanalysis attacks, synthesis attacks, and collusion throughout the multi-party PPRL process, an improved TEE method is presented in this work. It improves the linkage process’s security as well. The time expenditures and data loss associated with repetitive encryption and decryption are reduced by the combination of the TEE with the homomorphic encryption method. To safeguard against side-channel attacks, an inner product masking technique is employed to inject noise, rendering the data useless to attackers even if compromised. The proposed method offers significant improvements in operational efficiency, reduced time costs, and good scalability, particularly for large datasets or numerous participants. The method presented in this paper has far-reaching implications for the development of the information society. Despite its limitations and potential biases, our method demonstrated high matching quality and security in the experiments. Future research could validate our method on a broader range of datasets and application scenarios while exploring techniques to further optimize the computational efficiency and security. Through continuous improvement and validation, we believe that the TEE-based PPRL method will play an important role in data privacy protection and efficient data matching.

Author Contributions

Conceptualization, S.H. and K.S.; methodology, S.H. and K.S.; software, S.H. and K.S.; validation, S.H. and K.S.; formal analysis, S.H., C.W. and D.S.; investigation, S.H., D.S. and C.W.; resources, S.H. and K.S.; data curation, S.H. and K.S.; writing—original draft preparation, S.H. and K.S.; writing—review and editing, S.H. and K.S.; visualization, S.H., K.S. and D.S.; supervision, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62172082) and the Education Department of Liaoning Province, Youth Project (LJKQZ20222440).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, X.; Wei, H.; Han, S.; Shen, D. Multi-party privacy-preserving record linkage method based on trusted execution environment. In Proceedings of the International Conference on Web Information Systems and Applications, Dalian, China, 16–18 September 2022; pp. 591–602. [Google Scholar]
Li, T.; Gu, Y.; Zhou, X.; Ma, Q.; Yu, G. An effective and efficient truth discovery framework over data streams. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT), Venice, Italy, 21–24 March 2017; pp. 180–191. [Google Scholar]
Wang, J.; Li, T.; Wang, A.; Liu, X.; Chen, L.; Chen, J.; Liu, J.; Wu, J.; Li, F.; Gao, Y. Real-time workload pattern analysis for large-scale cloud databases. arXiv 2023, arXiv:2307.02626. [Google Scholar] [CrossRef]
Micali, S.; Goldreich, O.; Wigderson, A. How to play any mental game. In Proceedings of the Nineteenth ACM Symposium on Theory of Computing, STOC, New York, NY, USA, 25–27 May 1987; pp. 218–229. [Google Scholar]
Atallah, M.J.; Du, W. Secure multi-party computational geometry. In Proceedings of the Workshop on Algorithms and Data Structures, Providence, RI, USA, 8–10 August 2001; pp. 165–179. [Google Scholar]
Benenson, Z.; Gärtner, F.C.; Kesdogan, D. Secure Multi-Party Computation with Security Modules; Gesellschaft für Informatik eV: Bonn, Germany, 2005. [Google Scholar]
Du, W.; Atallah, M.J. Secure multi-party computation problems and their applications: A review and open problems. In Proceedings of the 2001 Workshop on New Security Paradigms, Cloudcroft, NM, USA, 11–13 September 2001; pp. 13–22. [Google Scholar]
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
Kuzu, M.; Kantarcioglu, M.; Durham, E.; Malin, B. A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In Proceedings of the Privacy Enhancing Technologies: 11th International Symposium, PETS 2011, Waterloo, ON, Canada, 27–29 July 2011; Proceedings 11. pp. 226–245. [Google Scholar]
Niedermeyer, F.; Steinmetzer, S.; Kroll, M.; Schnell, R. Cryptanalysis of basic bloom filters used for privacy preserving record linkage. J. Priv. Confidentiality 2014, 6, 59–79. [Google Scholar]
Ravikumar, P.; Cohen, W.W.; Fienberg, S.E. A secure protocol for computing string distance metrics. In Proceedings of the ICDM-04: 2004 IEEE International Conference on Data Mining, Brighton, UK, 1–4 November 2004; pp. 40–46. [Google Scholar]
Al-Lawati, A.; Lee, D.; McDaniel, P. Blocking-aware private record linkage. In Proceedings of the 2nd International Workshop on Information Quality in Information Systems, Baltimore, MD, USA, 17 June 2005; pp. 59–68. [Google Scholar]
Churches, T.; Christen, P. Some methods for blindfolded record linkage. BMC Med. Inform. Decis. Mak. 2004, 4, 9. [Google Scholar] [CrossRef] [PubMed]
Hernández, M.A.; Stolfo, S.J. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 1998, 2, 9–37. [Google Scholar] [CrossRef]
Inan, A.; Kantarcioglu, M.; Bertino, E.; Scannapieco, M. A hybrid approach to private record linkage. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008; pp. 496–505. [Google Scholar]
Lai, P.K.; Yiu, S.-M.; Chow, K.-P.; Chong, C.; Hui, L.C.K. An Efficient Bloom Filter Based Solution for Multiparty Private Matching. In Proceedings of the Security and Management, Las Vegas, NV, USA, 26–29 June 2006; pp. 286–292. [Google Scholar]
Kerschbaum, F. Distance-preserving pseudonymization for timestamps and spatial data. In Proceedings of the 2007 ACM Workshop on Privacy in Electronic Society, Alexandria, VA, USA, 29 October 2007; pp. 68–71. [Google Scholar]
Randall, S.M.; Ferrante, A.M.; Boyd, J.H.; Bauer, J.K.; Semmens, J.B. Privacy-preserving record linkage on large real world datasets. J. Biomed. Inform. 2014, 50, 205–212. [Google Scholar] [CrossRef] [PubMed]
Hu, D.; Chen, L.; Fang, H.; Fang, Z.; Li, T.; Gao, Y. Spatio-temporal trajectory similarity measures: A comprehensive survey and quantitative study. IEEE Trans. Knowl. Data Eng. 2023, 36, 2191–2212. [Google Scholar] [CrossRef]
Chen, C.; Farmani, M.; Eisenbarth, T. A tale of two shares: Why two-share threshold implementation seems worthwhile—And why it is not. In Proceedings of the Advances in Cryptology–ASIACRYPT 2016: 22nd International Conference on the Theory and Application of Cryptology and Information Security, Hanoi, Vietnam, 4–8 December 2016; Proceedings, Part I 22. pp. 819–843. [Google Scholar]
Moradi, A.; Schneider, T. Side-Channel Analysis Protection and Low-Latency in Action: Case Study of PRINCE and Midori. In Proceedings of the Advances in Cryptology–ASIACRYPT 2016: 22nd International Conference on the Theory and Application of Cryptology and Information Security, Hanoi, Vietnam, 4–8 December 2016; Proceedings, Part I 22. pp. 517–547. [Google Scholar]
Goubin, L. A sound method for switching between boolean and arithmetic masking. In Proceedings of the Cryptographic Hardware and Embedded Systems—CHES 2001: Third International Workshop, Paris, France, 14–16 May 2001; Proceedings 3. pp. 3–15. [Google Scholar]
Coron, J.-S.; Großschädl, J.; Tibouchi, M.; Vadnala, P.K. Conversion from arithmetic to boolean masking with logarithmic complexity. In Proceedings of the Fast Software Encryption: 22nd International Workshop, FSE 2015, Istanbul, Turkey, 8–11 March 2015; Revised Selected Papers 22. pp. 130–149. [Google Scholar]
Maghrebi, H.; Servant, V.; Bringer, J. There is wisdom in harnessing the strengths of your enemy: Customized encoding to thwart side-channel attacks. In Proceedings of the Fast Software Encryption: 23rd International Conference, FSE 2016, Bochum, Germany, 20–23 March 2016; Revised Selected Papers 23. pp. 223–243. [Google Scholar]
Faust, S.; Paglialonga, C.; Schneider, T. Amortizing randomness complexity in private circuits. In Proceedings of the Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Proceedings, Part I 23. pp. 781–810. [Google Scholar]
Balasch, J.; Faust, S.; Gierlichs, B.; Paglialonga, C.; Standaert, F.-X. Consolidating inner product masking. In Proceedings of the Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Proceedings, Part I 23. pp. 724–754. [Google Scholar]
WANG, Y.; FAN, H.; DAI, Z. Advances in side channel attacks and countermeasures. Chin. J. Comput. 2023, 46, 202–228. [Google Scholar]
Zhang, C.; Li, S.; Xia, J.; Wang, W.; Yan, F.; Liu, Y. {BatchCrypt}: Efficient homomorphic encryption for {Cross-Silo} federated learning. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC 20), Online, 15–17 July 2020; pp. 493–506. [Google Scholar]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B.; Gao, Y.; Hu, J. Evolutionary clustering of moving objects. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2399–2411. [Google Scholar]
Li, T.; Huang, R.; Chen, L.; Jensen, C.S.; Pedersen, T.B. Compression of uncertain trajectories in road networks. Proc. VLDB Endow. 2020, 13, 1050–1063. [Google Scholar] [CrossRef]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B. TRACE: Real-time compression of streaming trajectories in road networks. Proc. VLDB Endow. 2021, 14, 1175–1187. [Google Scholar] [CrossRef]
Paillier, P. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Prague, Czech Republic, 2–6 May 1999; pp. 223–238. [Google Scholar]
Dziembowski, S.; Faust, S. Leakage-resilient circuits without computational assumptions. In Proceedings of the Theory of Cryptography Conference, Taormina, Italy, 19–21 March 2012; pp. 230–247. [Google Scholar]
Balasch, J.; Faust, S.; Gierlichs, B.; Verbauwhede, I. Theory and practice of a leakage resilient masking scheme. In Proceedings of the Advances in Cryptology–ASIACRYPT 2012: 18th International Conference on the Theory and Application of Cryptology and Information Security, Beijing, China, 2–6 December 2012; Proceedings 18. pp. 758–775. [Google Scholar]
Balasch, J.; Faust, S.; Gierlichs, B. Inner product masking revisited. In Proceedings of the Advances in Cryptology–EUROCRYPT 2015: 34th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Sofia, Bulgaria, 26–30 April 2015; Proceedings, Part I 34. pp. 486–510. [Google Scholar]
Ishai, Y.; Sahai, A.; Wagner, D. Private circuits: Securing hardware against probing attacks. In Proceedings of the Advances in Cryptology-CRYPTO 2003: 23rd Annual International Cryptology Conference, Santa Barbara, CA, USA, 17–21 August 2003; Proceedings 23. pp. 463–481. [Google Scholar]
Christen, P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 2011, 24, 1537–1555. [Google Scholar] [CrossRef]
Shi, H.; Zuo, L.; Wang, S.; Yuan, Y.; Su, C.; Li, P. Robust predictive fault-tolerant switching control for discrete linear systems with actuator random failures. Comput. Chem. Eng. 2024, 181, 108554. [Google Scholar] [CrossRef]
Randall, S.M.; Brown, A.P.; Ferrante, A.M.; Boyd, J.H.; Semmens, J.B. Privacy preserving record linkage using homomorphic encryption. In Proceedings of the Population Informatics for Big Data, Sydney, Australia, 10 August 2015; Volume 10. [Google Scholar]
Vatsalan, D.; Christen, P.; Rahm, E. Scalable privacy-preserving linking of multiple databases using counting Bloom filters. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain, 12–15 December 2016; pp. 882–889. [Google Scholar]

Figure 1. TEE overall architecture diagram.

Figure 2. Side-channel attack process diagram.

Figure 3. The flow of PPRL based on the enhanced TEE.

Figure 4. The flow of data processing in the TEE.

Figure 5. Data source size’s impact on the runtime.

Figure 6. Number of parties’ impact on the runtime.

Figure 7. Number of parties’ impact on the recall.

Figure 8. Number of parties’ impact on the precision.

Figure 9. Number of parties’ impact on the F-measure.

Figure 10. Number of parties’ impact on the evaluation metrics under different degrees of disturbance.

Table 1. Comparative analysis of PPRL methods.

Methods/Researchers	Main Feature	Computational Cost	Communication Cost	Linkage Quality	Security	Attack Resistance
First-generation PPRL (SMC)	Protect sensitive data with SMC	High	Medium	High	High	Medium
Second-generation PPRL (Bloom filter codings)	Using the Bloom filter codings	Low	Medium	Medium	Medium	Low
Lawati et al. [12]	Secure TF-IDF to produce weight vectors	Medium	Low	Low	High	High
Churches and Christen [13]	Extract q-grams from QID values	Low	High	Medium	Medium	Low
Third-generation PPRL	Privacy-preserving categorized neighborhood Approach	High	High	High	High	High
Inan et al. [15]	A hybrid technique combining sanitization and cryptographic approaches	Low	Medium	High	Medium	Medium
Lai et al. [16]	Based on Bloom filters and limited fault tolerance	Medium	Low	High	High	High
Kerschbaum [17]	Anonymous approach	Medium	Medium	High	Medium	Medium
Randall et al. [18]	Homomorphic encryption combined with Bloom filter	High	Medium	High	High	Medium
Proposed method	Integration of TEE and homomorphic encryption	Medium	Medium	High	High	High

Table 2. Analysis of resistance to side-channel attacks.

Researchers	Main Feature	Computational Cost	Communication Cost	Linkage Quality	Security
Chen et al. [20]	Shared threshold scheme, strong resistance to side-channel leakage	Medium	Medium	Medium	High
Moradi et al. [21]	Low-delay designs, resist first-order leakage through mask and threshold schemes	Low	Medium	High	Medium
Goubin et al. [22]	Converts Boolean masks to arithmetic masks with minimal operations	Medium	Low	Medium	Medium
Coron et al. [23]	Kogge–Stone adder for carry calculation in O(log K) operations	Medium	Low	Medium	Medium
Maghrebi et al. [24]	Customized coding based on leakage model analysis	High	Medium	High	High
Faust et al. [25]	Studies reuse of random numbers in mask schemes	Medium	Medium	High	Medium
Balasch et al. [26]	Inner product mask scheme, efficiency improvements, T-probe security model	Low	Medium	High	High
Proposed method	Inner product mask for defense against side-channel attacks	Medium	Medium	High	High

Table 3. Parameter table for the enhanced multi-party PPRL method based on the TEE.

Parameters	Description
$K$	A field with characteristic 2
$n$	Vector length
$p k$	Public key
$s k$	Private key
·	Field multiplication is denoted by dot ·
$δ_{i, j}$	$δ_{i, j}$ $is 0 when i = j$ and 1 otherwise
$m$	Plaintext message
$c$	Ciphertext message
$g, r_{i}$	A set of integers that are prime in $Z_{n^{2}}$
$L$	$Represented as L (u) = (u - 1) / n$
$T, U, U^{'}, V$	Matrix
$S, L, A, B, C$	Vector
$〈 A, B 〉$	Standard inner product over K
$p, q$	A prime number
$lcm (a, b)$	$The least common multiple of a and b$

Table 4. Experimental configuration.

Configuration Item	Configuration Information
$CPU$	$Intel (R) Core (TM) i 5 - 9300 H$
$Memory$	16 GB
Hard disk/TB	1
$Operating system$	$64 Windows 11$
Software implementation	PyCharm (2023.1)
Dataset	North Carolina voter registration list

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, S.; Shen, K.; Shen, D.; Wang, C. Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution Environments. Mathematics 2024, 12, 2337. https://doi.org/10.3390/math12152337

AMA Style

Han S, Shen K, Shen D, Wang C. Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution Environments. Mathematics. 2024; 12(15):2337. https://doi.org/10.3390/math12152337

Chicago/Turabian Style

Han, Shumin, Kuixing Shen, Derong Shen, and Chuang Wang. 2024. "Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution Environments" Mathematics 12, no. 15: 2337. https://doi.org/10.3390/math12152337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Enhanced Multi-Party Privacy-Preserving Record Linkage Using Trusted Execution Environments

Abstract

1. Introduction

2. Related Work

3. Definition and Preparation

3.1. Privacy-Preserving Record Linkage

3.2. Trusted Execution Environments

3.3. Homomorphic Encryption Algorithm

3.4. Side-Channel Attack

3.5. Inner Product Mask

3.6. Privacy Attacks

4. An Enhanced Multi-Party PPRL Method Based on a TEE

4.1. Data Preprocessing

4.2. Call TEE

4.3. Building Paillier Homomorphic Encryption

4.3.1. Paillier Key Generation

4.3.2. Paillier Encryption Function

4.3.3. Paillier Homomorphic Operation

4.3.4. Paillier Decryption Function

4.4. Using the Inner Product Mask

4.4.1. Inner Product Mask Symbol

4.4.2. Build the Inner Product Mask

4.4.3. Privacy Performance Analysis

4.5. Implementation of Multi-Party PPRL in the TEE

5. Experiment and Analysis

5.1. Experimental Setup

5.1.1. Evaluation Criteria

5.1.2. Parameter Determination

5.2. Experimental Results and Analysis

5.2.1. Scalability Assessment

5.2.2. Method Performance Evaluation

5.3. Potential Limitations and Biases

5.4. Practical Implementation and Integration Challenges

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI