**1. Introduction**

Hospitals and other organizations often need to publish data, e.g., medical data or census data, for the purposes of scientific research and knowledge-based decision-making [1–10]. To avoid the leakage of individual privacy, explicit identifying information is removed when data is released. However, individual privacy still could be leaked by linking other public data [11]. Privacy-preserving data publishing provides methods and tools for publishing useful information while preserving individual privacy [12]. In recent years, the problem of privacy-preserving data publishing has been studied extensively. The existing privacy protection methods mainly focus on relational data, and many mature privacy models are proposed, such as *k*-anonymity [11], *l*-diversity [13], (*<sup>α</sup>*, *k*)-anonymity [14] and *t*-closeness [15]. However, data often has a complicated structure in the real world. With the advent of document-oriented databases (e.g., MongoDB) and the wide use of markup languages (e.g., XML), hierarchical data has become ubiquitous [16]. To avoid the leakage of individual privacy, the hierarchical data must be properly anonymized before it is released. At present, there are few researches on privacy protection for hierarchical data. Ozalp et al. [16] proposed *l*-diversity anonymous methods for hierarchical data. An example for hierarchical data is given in Figure 1. The schema for education data is obtained from Sabanci University [16] and the examples appearing in this paper are related to the schema. Figure 1a represents a student's record, which fits the education schema shown in Figure 1b. The student is born in 1990 and majors in Computer Science. He took two courses, CS201 and CS305. For CS201, his evaluations are submitted for two instructors. For CS305, he submitted an evaluation and showed he bought a database book. The labels of vertices are all quasi-identifiers (QIs) of the student and the corresponding sensitive information is remarked in the side of every vertex. Quasi-identifier is a set of attributes that can potentially identify an individual [11]. Assume that an attacker knows some QIs of a victim, and his goal is to reason the sensitive information of the victim. In [16], they used suppression and generalization [11] to make the anonymous hierarchical dataset satisfy *l*-diversity, which ensures the frequency of every sensitive value for the union-compatible vertices (belonging to the same vertex in schema) in an equivalence class is not more than 1/*l*. The constraint also can guarantee that every equivalence class contains at least *l* hierarchical data records. An equivalence class in an anonymous hierarchical dataset is a set of records with the same values for the QIs. However, the method does not consider the sensitivity of different sensitive attribute values, which lead to similarity attacks [15]. For example, an equivalence class contains three hierarchical data records and its class representative is shown in Figure 2, which satisfies 3-diversity. The sensitive values of their cumulative GPAs are 0.31, 0.15 and 0.09, respectively, where GPA is the grade point average. An attacker knows a victim in the equivalence class by linking with some QIs of the victim. Although the attacker does not infer the victim's specific sensitive value, he can know that the victim's academic performance is low with 100% probability and the victim's privacy is leaked. Similarly, the attacker can confirm that the grade of the victim in the course CS201 is very low according to the value { *D*, *D*+, *D* −}. Also, the attacker can infer that the victim is very dissatisfied with the DB Prof. by the value {0, 1/10, 2/10}. To avoid similarity attack, we propose a multi-level privacy-preserving approach in hierarchical data based on fuzzy sets.

**Figure 1.** An example for hierarchical data: (**a**) A student's record; (**b**) Schema for education data.

The contributions of this paper are summarized as follows:


• We do experiments to compare our approach with the existing anonymous method ClusTree proposed in [16]. Experiment results demonstrate that our approach is superior to *ClusTree* in terms of utility and security.

**Figure 2.** A class representative satisfying 3-diviersity.

## **2. Related Work**

In this section, we review the related work about privacy preserving data publishing for relational data and hierarchical data.

#### *2.1. Preserving Privacy for Publishing Relational Data*

The first privacy model, proposed by Samarati and Sweeney [11] in 1998, is *k*-anonymity for relational data, which requires that every record in a table is indistinguishable from at least *k*-1 other records with respect to QI. There exist many anonymization methods to implement *k*-anonymity, such as bottom-up generalization, top-down specialization and anonymity by clustering technique [17–19]. *k*-anonymity can protect against identity disclosure, but cannot prevent attribute disclosure. Therefore, *l*-diversity has been proposed [13]. It requires that every equivalence class contains at least *l* different sensitive values. There are numerous methods for achieving *l*-diversity [20,21]. Furthermore, Wong et al. [14] extended *k*-anonymity to (*<sup>α</sup>*, *k*)-anonymity to limit the confidence of the implications from the QI to a sensitive value to within *α* in order to protect the sensitive information from being inferred by strong implications, and proposed a bottom-up generalization algorithm to achieve (*<sup>α</sup>*, *k*)-anonymity. Li et al. [15] pointed out that *l*-diversity does not prevent skewness attack and similarity attack, so they introduced *t*-closeness model, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table. They also revised the Incognito algorithm [17], which is a top-down generalization method proposed for *k*-anonymity, to achieve *t*-closeness. However, *t*-closeness still does not prevent similarity attacks. Han et al. [22] considered the difference of sensitivity for sensitive values, and proposed multi-level *l*-diversity model for numerical sensitive attribute. Furthermore, Jin et al. [23] presented the (*<sup>α</sup>i*, *k*)-anonymity privacy preservation based on sensibility grading. However, the levels are artificially assigned. Some researches proposed fuzzy based methods for privacy preserving [24,25]. They used fuzzy sets to transform sensitive values to semantic values and published the data with fuzzy sensitive information, which decreases the utility of sensitive information and still does not resist similarity attacks.

#### *2.2. Preserving Privacy for Publishing Hierarchical Data*

There are several studies about preserving privacy for publishing hierarchical or tree-structured data. Yang and Li [26] found that the dependencies between nodes in the XML data information may result in privacy leakage. They formally defined these dependencies as XML constraints, and designed an algorithm to sanitize XML documents by considering these constraints such that no privacy is leaked. However, their attack model is too weak. Our adversarial model assumes that the attacker has some information about the victim. Landberg et al. [27] proposed *δ*-dependency and extended the anatomy method in relational data to hierarchical data. But the dissection method will damage the original semantic structure of hierarchical data, and the generalization in sensitive attributes will affect the effectiveness of hierarchical data. Nergiz et al. [28] extended *k*-anonymity methods to a multi-relational database, and proposed multi-relational *k*-anonymity. Firstly, hierarchical data will be converted to multiple relational data tables, which related to each other by primary key or foreign key, then performed *k*-anonymity separately on each relational data. However, converting hierarchical data into relational data is not a simple matter, and will produce large amounts of data redundancy, which made the executive efficiency of algorithm extremely low. It will also lose a lot of structural information. Gkountouna and Terrovitistis [29] proposed the *k*(*<sup>m</sup>*, *n*)-anonymity for tree-structured data. By using generalization and structure decomposition methods, they ensured that the number of matching records not less than *k* when the attacker knows up to *m* nodes in a tree and to *n* structural relations between these nodes. But the method cannot resist the attack with stronger background knowledge. In addition, they used structural decomposition that destroys the structural information of the hierarchical data. Ozalp et al. [16] extended *l*-diversity to hierarchical data. They utilized generalization and suppression to anonymize the hierarchical data, and make the hierarchical records in an equivalence class to be indistinguishable in terms of the QIs and structure and the sensitive values for the union-compatible vertices in an equivalence class satisfies the requirements of *l*-diversity. This method is very scalable for the general anonymous method of hierarchical data. However, this method does not consider the different sensitivity of sensitive attribute values in anonymous hierarchical data, so the anonymous hierarchical data still does not resist similarity attack. In this paper, we use fuzzy set theory to partition rank for sensitive values of union-compatible vertices, and propose a multi-level privacy-preserving approach in hierarchical data to solve similarity attacks.

## **3. Problem Descriptions**

In this section, we describe the attack model, give some fundamental definitions, and introduce our privacy protection model.

## *3.1. Attack Model*

We assume that an attacker knows a victim's QI information, which contains any combination of QI values in the same or different vertices of the victim's record. Also, the attacker can obtain some structural links. For example, the victim took two courses, and purchased only a book for course CS201. In addition, the attacker has some negative knowledge, e.g., the victim did not take CS305. Our anonymization approach can ensure that an attacker, who has this background knowledge about a victim, does not infer any sensitive value of the victim is in some level with the probability, which is greater than a given threshold.

#### *3.2. Basic Definitions in Hierarchical Data*

In this subsection, we give some basic definitions for hierarchical data [16]. Let *T* be a graph with *n* vertices. We say that *T* is a rooted tree if and only if (1) *T* is a directed acyclic graph with *n*-1 edges; (2) for every vertex (except root vertex), there is a single path from the root vertex to it in *T*; (3) there exists an edge *v* → *ci* if *ci* ∈ *children*(*v*), where *children*(*v*) is the children of vertex *v*. Such tree is denoted by *T*(*V*, *E*), where *V* and *E* are the sets of vertices and edges in the tree, respectively.

A hierarchical data record satisfies the following conditions: (1) it follows a rooted tree structure; (2) each vertex *v* has two *j*-tuples (*j* ≥ 0), *vQIt* and *vQI*, which contains the names of QI attributes and the values of corresponding QIs, respectively; (3) each vertex *v* also has two *m*-tuples (0 ≤ *m* ≤ 1), *vSAt* and *vSA*, which contains the name of sensitive attribute and the value of corresponding sensitive attribute, respectively; (4) assume that <sup>|</sup>*vQI*|+|*vSA*<sup>|</sup> ≥ 1 to eliminate empty vertices. For a vertex *v* of a hierarchical data record, *vQI* is the label of *v* and *vSA* is next to *v*. For Figure 1, *vQIt* = {*major program*, *year of birth*}, *vSat* = {*GPA*}, *vQI* = {*Computer Science*, 1990}, and *vSA* = {3.75}.

**Definition 1 (Union-Compatibility) [16].** *Two vertices v and v are union-compatible if and only if vQIt = v QIt and vSAt = v SAt.*

**Definition 2 (QI-isomorphism) [16].** *Let T1(V1, E1) and T2(V2, E2) are two hierarchical data records. T1(V1, E1) is isomorphic to T2(V2, E2) if and only if there exists a bijection f: V1* → *V2, such that:*


**Definition 3 (Equivalence Class of Hierarchical Records) [16].** *Let Q = {T1,T2,...,Tk} is a collection of k hierarchical data records. We say Q is an equivalence class, if for* ∀*i*, *j* ∈ {1, . . . , *k*}*, Ti and Tj are QI-isomorphic.*

**Definition 4 (Class Representative) [16].** *Let Q = {T1,T2,...,Tk} be an equivalence class in hierarchical data, and fi (1* ≤ *i* ≤ *k-1) be a bijection that maps T1 s vertices to Ti+1 s vertices as in QI-isomorphism. T*ˆ *is the class representative for Q if T*ˆ *is QI-isomorphic to T1 with a bijection function f and* ∀*v* ∈ *T*ˆ*, vSA = {f(v)SA, f1(f(v))SA,..., fk*−*1(f(v))SA}.*

Let *X* = {*<sup>x</sup>*1, *x*2, ..., *xo*} be a multiset of values from the domain of a sensitive attribute *A*. *X* satisfies *l*-diversity if ∀*xi* ∈ *X*, *p*(*xi*) ≤ 1/*l*, where *p*(*xi*) is the frequency of *si* in *X*. For an equivalence class *Q* in hierarchical data, *T* ˆ is the class representative for *Q*. If for ∀*v* ∈ *T* ˆ , *vSA* satisfies *l*-diversity, then *T*ˆ satisfies *l*-diversity. Given a hierarchical data *D*, an anonymous hierarchical data *D*\* satisfies *l*-diversity, if the class representative of any equivalence class in *D*\* satisfies *l*-diversity. The *l*-diversity hierarchical data does not prevent similarity attack, since it does not consider the different sensitivity of sensitive attribute values.
