1. Introduction
Dynamic searchable symmetric encryption (DSSE) is a kind of searchable symmetric encryption (SSE) specifically designed to support data dynamics such as addition and deletion operations in an encrypted database (EDB) [
1,
2]. Although DSSE schemes benefit from the flexible operations without decryption, they are likely to leak sensitive information. For example, an adversary can observe added or deleted documents that are accessed by users by exploiting the access pattern leakage [
3], or identify the underlying keyword of queries by exploiting the search pattern leakage [
3].
To formally address this information leakage problem, Bost et al. [
4,
5] introduced the notions of forward and three different types (i.e., Type-I, II, III) of backward privacy in DSSE, mainly focusing on addressing security concerns regarding the linkability between queries (e.g., update and search) and updated data. Although leaking less information implies higher security, it inevitably incurs higher computational overhead. For example, Type-I backward privacy has been only achieved by adopting cryptographically heavy operations such as oblivious RAM (ORAM) [
6].
Recently, a trusted execution environment (TEE) such as Intel SGX [
7] has been considered for DSSE to mitigate efficiency problems caused in favor of the forward/backward privacy preservation [
8,
9,
10,
11]. Amjad et al. [
8] proposed several forward and various types of backward private DSSE schemes using Intel SGX. Their first scheme, called Fort, achieves the highest security guarantee (i.e., Type-I), but suffers from high communication overhead due to the usage of ORAM [
6]. To balance security and efficiency, they proposed another scheme, called Bunker-B (Type-II), aiming to reduce communication costs. However, it still suffers from scalability degradation.
To resolve this problem, Vo et al. proposed SGX-SE1, providing Type-II backward privacy [
11], and Maiden providing Type-I backward privacy [
12], both leveraging the server-side SGX enclave as a proxy to reduce the communication cost and enhance the scalability of the scheme. Unfortunately, despite its efficiency improvement, we observed that SGX-SE1 does not fully guarantee Type-II backward privacy as expected. SGX-SE1 has additional information leakage related to deletion history, which is information leakage only allowed in Type-III backward privacy, by exploiting secondary leakage, which is naturally allowed in the protocol construction. Thus, in this paper, we seek to answer the following questions:
What is the root cause or conditions for the leakage, and how can it be exploited to break the backward privacy of DSSE schemes?In search of the answer, we first conduct an in-depth analysis of the information leakages in existing Type-II backward private DSSE schemes [
8,
11,
13,
14] in terms of both theoretical definitions and scheme constructions. Consequently, we found that there exist information leakages in several schemes [
11,
14], which can be utilized to extract a deletion history of the encrypted data, leading to violation of Type-II backward privacy. Then, we examine the condition allowing the vulnerability, and demonstrate how it can affect the Type-II backward privacy in practice by exploiting the secondary information leakage on Vo et al.’s [
11] and Sun et al.’s [
14] DSSE schemes.
Next, we propose a novel forward and Type-II backward private DSSE scheme based on SGX. To this end, we design an obfuscation technique to hide the access and the search patterns in order to prevent the information leakages related to update operations. To minimize extra communication and computation costs caused by the obfuscation technique, we selectively cache the top k-frequently accessed documents inside the SGX enclave for fast data retrieval and obfuscation.
According to our comparative analysis results with the state-of-the-art SGX-based DSSE schemes, Bunker-B [
8] and SGX-SE1 [
11], the search time of the proposed scheme is approximately 27× faster than that of Bunker-B, while providing the same security level. Compared with SGX-SE1, the proposed scheme achieves a higher level of security while minimizing performance degradation.
Contributions: Our contributions are summarized as follows:
We conduct a comprehensive analysis of the existing Type-II backward private schemes, and discover information leakage that falls outside the purview of the existing backward privacy notions. We then demonstrate how those leakages are exploited to extract the deletion history in Type-II schemes.
We demonstrate how our findings on information leakages exacerbate the security of known Type-II backward private schemes by exploiting the Vo et al. DSSE scheme [
11] and Sun et al. DSSE scheme [
14].
We design a novel forward and Type-II backward private DSSE scheme based on SGX, which hides the information leakage of deletion history with high efficiency.
We conduct a comparative analysis of our scheme with the state-of-the-art SGX-based DSSE schemes, Bunker-B [
8] and SGX-SE1 [
11], in both synthetic and real-world Enron [
15] datasets. According to the analysis, our scheme shows higher efficiency in search latency with negligible utility loss under the same security level (cf. Bunker-B) while showing similar efficiency with a higher security level (cf. SGX-SE1).
3. Leakage Abuse Attack
In this section, we introduce how leakage information we found can be exploited by an adversary to obtain the deletion history of the state-of-the-art Type-II backward privacy schemes.
Table 1 shows state-of-the-art Type-II backward private schemes that are prone to our attack. In order to demonstrate its feasibility, we show how our attack can be conducted on SGX-SE1 [
11] and Aura [
14] as representative SGX-based and non-SGX-based Type-II backward privacy schemes, respectively. Finally, we discuss the root cause of the vulnerability of the scheme constructions.
3.1. Threat Model
Similar to the previous DSSE schemes [
8,
11], we consider a semi-honest adversary at the server side, such that the adversary can observe the interaction of the enclave with the other resources located outside the enclave, and has the privilege of gaining full access over software stack outside the enclave, as well as the operating system and hypervisor. Additionally, the adversary can learn information about the access patterns by observing memory addresses and encrypted data in the encrypted database. Finally, the adversary can log the timestamps of every memory manipulation during the entire protocol, aiming to extract the deletion history.
3.1.1. Trust Assumptions on Intel SGX
We assume that the SGX enclave behaves normally without hardware bugs or backdoors. The preset code and data inside the enclave are securely protected, and cryptographic primitives provided by SGX are trusted [
7]. Furthermore, the communications and data transfer between different clients or servers are protected by the secure channels established by the SGX attestation service. The enclave can be invoked whenever clients need. We do not consider the side-channel and denial-of-service (DoS) attacks against the SGX as in many other SGX-based applications [
13,
18,
19,
20,
21].
3.2. Extraction of Deletion History
As described in
Section 2.3.2, Type-II backward privacy only leaks document identifiers matching the searched keyword
w,
, and
. If a scheme leaks
along with the other Type-II leakages, the adversary is allowed to know which deletion update cancels which add update that previously occurred, downgrading it into Type-III backward privacy. Thus, if a part of deletion history is revealed to an adversary, the privacy level of the scheme may be weakened than the original Type-II backward privacy level.
According to our investigation of the existing Type-II backward private schemes, we observed that they may leak when an adversary can exploit the access and search pattern leakages together with information leakages allowed in Type-II backward privacy (especially when they are constructed with static data structures or static values for identifying documents matching a specific keyword w). In order to obtain it, specifically, the adversary should be able to (1) distinguish whether the update operation is addition or deletion, and (2) distinguish whether two separate queries are on the same keyword w (search pattern leakage). When an addition operation is executed, the adversary is able to observe the added document identifier from the update query (but not the keyword w). Thus, the adversary learns , where u refers to the timestamp of the operation. According to the definition of Type-II backward privacy, should be hidden from the view of the server.
Next, during the search protocol for a specific keyword, the server is able to observe the query tokens and memory access of the EDB (i.e., the access pattern). Note that each query token refers to a single document. If the query token or the search index is encrypted, the adversary may observe the following search query consisting of the encrypted i query tokens , where refers to the j-th query token for keyword w for . The adversary then observes and learns the search result , directly indicating the access pattern of the EDB for the documents in .
When a client sends a search query
consisting of
k query tokens later, the adversary would observe
, and learn
,
, for the search result of
. If the value of query token
is encrypted in a deterministic manner and is bound to a specific keyword, then the adversary can identify whether two different search queries are linked to the same keyword
w by simply comparing the values of the query tokens (When the query token is generated in a non-deterministic manner, such as re-encryption, as in Bunker-B [
8], for example, our extraction of deletion history might not work due to the indistinguishability in the adversary’s view).
When observing
, the adversary may learn the deletion updates are performed between two searches of
Q and
on the same keyword
w. Further, by comparing the search results
and
, the adversary can learn which document
was deleted. Consequently, the probability that the adversary can determine which delete updates are related to
w is
, where
is the number of possible deletion operations performed in the period. According to the definition of Type-II backward privacy, any information on
should not be leaked. However, by exploiting the query token, search result, and access pattern of the EDB, the adversary can learn which delete update cancels which add update, leaking
information with a probability of
. In addition, as mentioned in
Section 2.2, if there exists an index such that the timestamp of its insertion and deletion is revealed to the adversary, the scheme leaks
. Thus, if a deletion history can be successfully extracted, the claimed Type-II backward privacy is violated. It is important to note that even though queries are encrypted using a randomized encryption algorithm, the extraction of deletion history can still be applied as long as the search pattern and the other information required for the extraction are leaked. In
Section 3.4, we demonstrate the efficacy of the extraction by showing that
can be non-negligible in the real-world scenario.
3.3. Attacks on Prior Works
We show how the deletion history can be extracted in Vo et al.’s SGX-SE1 [
11] and Sun et al.’s Aura [
14]. For better understanding, we briefly explain the overview of each scheme’s construction first. Then, we demonstrate how the deletion history can be extracted from each protocol.
3.3.1. Vo et al. Scheme
Vo et al. [
11] proposed SGX-based forward and Type-II backward private dynamic searchable encryption schemes, named SGX-SE1 and SGX-SE2. In this paper, we focus only on SGX-SE1 because it is the baseline scheme upon which SGX-SE2 is built.
When a client uploads a new document in SGX-SE1, the document is sent to the SGX enclave in the server through the addition operation in the update protocol. The enclave then parses all the keywords within the document, and uses the latest state , which infers the number of documents containing w. Specifically, for each keyword w, increases by 1 whenever a document containing w is added. The encrypted index u is a deterministic value generated by , where H refers to a hash function that hashes , which is a key value bound to w, and the count value c of . A map of the encrypted index stores the encrypted document identifier.
When deleting a document, the client transfers the corresponding document identifier to the enclave. On receipt of it, the enclave stores the within a deleted document list d, and the actual deletion of it from the EDB is conducted during the search protocol.
In the search protocol, when the enclave receives a keyword w from the client, it first fetches the document identifiers from the deleted document list d. Next, the enclave loads the corresponding documents and checks whether the keyword w exists within each document. The enclave retrieves of the deleted documents, which were used when they were added, and excludes these values from . The remaining values in the set are then used to generate query token u for the non-deleted documents. The list of query tokens u is transferred to the server. Finally, the server computes the by decrypting , and returns the corresponding encrypted documents to the client.
Extraction Scenario. Table 2 shows the example flow of SGX-SE1 protocol [
11], along with the leakage information and its type. When searching on keyword
w in SGX-SE1, the enclave sends a search query
containing a list of
pairs to the server, where
refers to the key value bound to the document identifier. The adversary can observe and trace from
the deterministic values of
u’s, which are bound to the specific keyword. Therefore, by comparing the values of
u from the past search, the adversary learns if two different queries are on the same keyword, leading to the search pattern leakage. Because the document identifiers and encrypted documents are retrieved in the untrusted area, the adversary can observe the access pattern of the matching result as well as the accessed document identifiers. The adversary compares the matching results of the two queries and learns that deletion update on
occurred in the period between the two searches (as shown in
Table 2). Among the three delete updates that occurred between the two searches, one of them must correspond to the deletion of
, thus the probability of making a correct guess is
. Since the adversary has the knowledge of the timestamps of all updates and when each document is added with their identifier values, the aforementioned leakage information allows the adversary to learn
.
3.3.2. Sun et al. Scheme
Sun et al. [
14] proposed a non-SGX-based Type-II backward private scheme called Aura. Aura requires the client to revoke the encryption key after each search. The search result does not need to be re-encrypted because the previous search result is cached.
When the client adds a keyword and document identifier ind to the database, the client retrieves the most recently used encryption key , and computes a ciphertext with ind and a tag , where F is a pseudo-random function and is secret key for t. Then, the computed ciphertext is inserted into the database . For deletion, the client inserts tag t corresponding to the deleting entry into the deletion list D.
For searching on keyword w, the client retrieves the number of searches on keyword w, denoted as i, the current secret key , and the deletion list D. Then, the client computes the revoked secret key and query token , sends them to the server, and refreshes . The server retrieves encrypted indices matching w and decrypts the non-deleted indices with . The non-deleted indices are added to a list NewInd; the deleted tags are added to a list DelInd. Then, the server retrieves indices stored in and excludes entries that are in DelInd, where stores the previous search result. Finally, NewInd together with non-deleted indices stored in cache are returned to the client.
Extraction Scenario. As explained above, during the search protocol, the server accesses for retrieving cached search results. By comparing the accessed , the adversary can learn which previous search query is on the same keyword w. The timestamps of addition update queries on w (i.e., ) that occurred between two search queries on w are revealed to the adversary. Note that remaining update queries between two search queries can be possibly deletion updates on w. During the execution of the second search query, the adversary can obtain additional information on how many deletion updates on w occurred by observing the number of entries excluded from . Since the entries stored in are in the form of , the deleted indices are also revealed while excluding deleted entries from the cache.
Because the insertion timestamps of each
are revealed, the adversary, similar to the case of Vo et al.’s scheme [
11], is able to extract
using the timestamps of possible deletion updates on
w and deleted ind. However, the attack against Aura has a lower success rate than that of SGX-SE. The reason is that Aura does not follow the traditional definition of Type-II backward privacy and follows a somewhat different definition for Type-II backward privacy.
3.3.3. Discussion
As shown in the above attack scenarios, in addition to the information leakages defined in Type-II backward privacy, there exists extra information that an adversary may learn from the protocols and exploit for extracting a deletion history. A root cause for allowing the additional leakage is the static data structures and static values used for updating and searching a keyword. During the search protocol, for instance, SGX-SE1 [
11] and Aura [
14] access the same locations of data structures when searching on the same keyword. Thus, the adversary is able to specify whether a certain document is added or deleted when compared with the same information learned from the previous search query on the same keyword. Bunker-B [
8], on the other hand, accesses different locations of data structure for every search query on the same keyword, due to the re-encryption of the entries after every search. However, the re-encryption of entries incurs high computational overhead, degrading the practicality of the scheme. It is thus a challenging problem to minimize extra information leakages while achieving high efficiency.
3.4. Feasibility of Deletion History Extraction
We evaluate the feasibility of deletion history extraction by conducting a simulation on its probability in diverse distribution models of data upload, delete, and keyword search in the cloud storage. In the simulation, we randomly generate queries and analyze the probability of deletion history extraction.
According to the distribution models for file transfer [
22] and search query [
23], we assume that the document upload follows a Poisson distribution with rate
, the data lifespan follows an exponential distribution with a mean duration
, and the search request on the same keyword follows an exponential distribution with a mean duration
. Also, the keyword frequency of the documents follows a Zipf distribution.
Based on the aforementioned distribution models, we evaluate the feasibility of deletion history extraction by measuring its probability under various simulation settings.
Figure 1 shows the evaluation results in each case. In the figure, the horizontal axis represents the time in hours, and the vertical axis represents the probability of successful extraction
, which is the probability of identifying which document is deleted. When simulating for 100 h with
,
, and
, we have observed a total of 11 time instances that enable our extraction with non-negligible probabilities. As shown in
Figure 1a, for example, there exists a time instance when the extraction succeeds with 100% (between 56 and 57 h), leading to the violation of Type-II backward privacy. As another example, one can observe that
reaches
two times between 20 and 70 h, meaning that the deleted document can be identified with probability
, violating Type-II backward privacy in a probabilistic (but still pragmatically meaningful) way in those periods.
Figure 1b shows the evaluation results when search requests are performed on average four times more frequently than in
Figure 1a. As a result, we could observe that there are two time instances showing
(between 33 and 34; 61 and 62 h) and five time instances showing
.
In
Figure 1a,b, on average, 53 update queries (consisting of 41 addition and 10 deletion queries) were generated for the target keyword. When considering all of the update queries in the above simulations, the adversary could extract 1.6% and 3.2% of actual deletion history for the target keyword, respectively. For deletion queries for the target keyword, the adversary could extract 10% and 20% of deletion queries. When considering the attack success rate above
, the probability of possible deletion history extraction non-negligibly increases.
The aforementioned simulations indicate that the extraction probability increases as the number of search requests on the same keyword increases (compare
Figure 1a,b) When calculating
, the total number of deletions that occurred between two searches on the same keyword,
, affects the probability. It is clear that if the search requests are sent more frequently,
would be smaller with high probability, leading to an increase in
.
5. Security Analysis
In this section, we analyze the security of the proposed scheme, and prove that it guarantees forward and Type-II backward privacy.
leaks nothing to the server by leveraging a secure channel established by remote attestation. However, because the adversary can observe the interactions between its memory and the enclave during the search and update procedures, we consider the communications between them as information leakage in the proposed scheme. For addition, leaks timestamp and memory access patterns when inserting new entries to the data structures and R. For deletion, does not reveal any information to the server because there is no interaction between the enclave and the server. In , the access patterns on , , and R are revealed to the server.
Definition 1 (Obfuscated Search Pattern)
. The obfuscated search pattern is the vector characterized by Equation (2)where u refers to the encrypted index or query token generated during the search. Definition 2 (Obfuscated Access Pattern)
. The obfuscated access pattern is the vectorwhere refers to the number of documents retrieved from cached map . We formulate the leakage function, and define
and
games for an adaptive adversary
and a polynomial time simulator
.
denotes the proposed scheme, and the leakage function of
is
Note that the first three leakage functions define the information exposed in
,
, and
, respectively.
refers to the inherent leakage of the enclave exposed during the interaction between the enclave and the server.
leaks nothing to the server, but the data structures of
,
, and
R are revealed to the server during the initialization.
leaks information to the server only when op
. The access pattern of encrypted entries is observed by the server when they are inserted to
,
, and
R. When
,
leaks nothing to the server because there is no interaction between the enclave and the server. Therefore,
is a tuple of
that is inserted to
(i.e., encrypted index), and
is a tuple of
that is inserted to
(i.e., encrypted map of keyword states).
refers to the encrypted document inserted with document identifier
.
leaks obfuscated search pattern
when the enclave sends query tokens to the server, obfuscated access pattern
when it retrieves the encrypted documents from the server, and the patterns on the deleted documents’ list
.
is defined as
includes hardware leakages observed during
and
, such as memory access patterns, locations, timestamps, and manipulated memory areas for
,
, and
R.
is defined as
: As presented in Algorithm 1, the challenger performs to initialize data structures used by the client, the server, and the enclave. selects a database and performs a polynomial number of updates, where is a natural number of documents. When the challenger runs op, is returned to . After that, selects a keyword w and performs . As a result of , the challenger returns the transcript of each operation and outputs to . Finally, outputs a bit b.
: selects a . generates a tuple of by using and . Then, sends the generated tuple to . adaptively selects the keyword w to search. The transcript obtained by is returned by the challenger. Finally, returns a bit b.
For all probabilistic polynomial time algorithms
, if there exists a PPT simulator such that
then
is
-secure against adaptive chosen-keyword attacks.
Definition 3 (-security). Scheme consists of three protocols: , , and . and are probabilistic experiments, where is a stateful adversary and is a stateful simulator that gets the leakage function . is -secure if can distinguish and with negligible probability.
The scheme is secure if it achieves both forward and Type-II backward privacy. Since the client issues a query on keyword w to the enclave through the secure channel, has to generate a query token by itself in the game presented in Definition 3. Forward privacy is guaranteed because state increases when a new document containing w is inserted. The increase in state value constrains to generate a query token to retrieve newly added documents. Regarding backward privacy, can learn the timestamps indicating when the deleted states of w were added in , when the enclave requests the server to access during , but cannot know when they were actually requested for deletion. Scheme caches deletion requests in the enclave and only accesses them during . Due to the obfuscation technique in the scheme, during the search on w, the enclave generates the query tokens and includes them with probability p. In other words, a false negative occurs with probability , which probabilistically omits the query tokens in the result. Therefore, is unable to identify whether the tokens are omitted due to deletion operations or false negatives. In addition, the enclave stores the most frequently retrieved document. Thus, if the matching document exists within the map , the enclave does not request the server to access R for those documents. As a result, does not know which delete updates are conducted for specific document identifiers.
Theorem 1. Scheme is -secure according to Definition 3.
Proof. We now prove Theorem 1 by illustrating a PPT simulator for which , a PPT adversary, can distinguish and with negligible probability.
: performs a random key generation to simulate the key components provisioned inside the enclave.
: executes on a random keyword w and obtains a query token q. Then, performs an addition update for w based on and , and passes them to the enclave to receive the new update of .
However, cannot classify which update tokens match q because the enclave keeps increasing the state . Thus, is unable to distinguish between the output of and the simulated output in and , guaranteeing forward privacy.
: executes on a random keyword w. The encrypted documents in the deleted document list d stored in the enclave are only requested during . Thus, is unable to specify the exact timestamp for deletion updates. When generating query token during , false negative occurs with rate , causing probabilistic omission of query token via . In addition, for generated query tokens, if the matching encrypted documents are cached within the enclave, those tokens are omitted from the result sent to via .
When compares the matching document identifier lists of the same query, and prevents from learning information required for reconstructing the deletion history. However, is able to learn the timestamps of the inserted entries related to specific via . Hence, the scheme guarantees Type-II backward privacy.
□
7. Related Work
Since the first proposal in 2000 [
26], a number of searchable symmetric encryption (SSE) schemes have been proposed for formally defining the security model [
3], improving efficiency [
1,
27], and supporting expressive queries [
28,
29]. Recently, several works have been introduced on how access pattern and search pattern leakages can be exploited by an adversary to break the security of SSE [
17,
30]. Cash et al. [
30] exploited the search pattern leakages to recover the plaintext of the query or reconstruct the client’s indexed documents. Zhang et al. [
17] introduced a file injection attack on SSE schemes that exploits access patterns on file identifiers to identify what specific file identifiers correspond to the maliciously injected file.
Another important research area in SSE is to support dynamic operations in SSE, which enables a client to perform keyword searches as well as update operations on the encrypted documents, known as dynamic symmetric searchable encryption (DSSE) [
1,
27]. Stefanov et al. [
2] and Bost et al. [
4,
5] first introduced two fundamental security requirements for DSSE, that is, forward and backward privacy. Recently, Sun et al. [
14] introduced a non-interactive forward and backward private DSSE method by constructing revocable encryption. Even though it could reduce the information leakage by making the deletions oblivious to the server, it allows additional information leakage such as the number of deleted documents and candidate timestamps of deletion updates, which can be further exploited by an adversary to violate the backward privacy via obtaining the deletion history. In addition, Vo et al. [
11] proposed a Type-II backward private scheme by leveraging Intel SGX enclave as a server-side proxy and constructed a more efficient scheme than Amjad et al. Bunker-B [
8]. Despite the improvement of its efficiency, the Vo et al. [
11] scheme allows more information leakages than Bunker-B, which can be abused to violate its backward privacy. Therefore, accomplishing high efficiency while minimizing exploitable information leakage is still a challenging and open problem in DSSE literature.