*3.3. Privacy Model*

For every sensitive attribute, including numerical and categorical attributes, we partition sensitive values to five levels: *low*, *very low*, *middle*, *very high* and *high* (for some sensitive attributes, e.g., a student's grade in a course, the levels have been divided, and we do not need to handle it), and transform these value levels to corresponding sensitivity levels.

Let *U* be a universe of discourse. A mapping *μA*: *U* → [0, 1] is called a membership function on *U*, where the set *A*, which consists of *μA*(*u*) (*u* ∈ *U*), is a fuzzy set on *U*, and *μA*(*u*) is the membership degree of *u* to *A* [30–32]. The trapezoidal distribution [33] is used to give the membership functions for fuzzy sets *low*, *very low*, *middle*, *very high* and *high*, denoted by *A*1, *A*2, *A*3, *A*4, and *A*5, respectively. Let *U* be the domain of a numerical attribute (for categorical attribute, a numerical attribute can be obtained according to the frequency of every value), and *min* and *max* be the minimum and maximum values in *U*, respectively. The five fuzzy sets have values in the range [*min*, *a*2], [*<sup>a</sup>*1, *a*3], [*<sup>a</sup>*2, *a*4], [*<sup>a</sup>*3, *a*5] and [*<sup>a</sup>*4, *max*], respectively, where *a*3 = (*min* + *max*)/2, *a*1 = *min* + (*<sup>a</sup>*3-*min*)/3, *a*2 = *min* + 2(*<sup>a</sup>*3-*min*)/3, *a*4 = *a*3 + (*max*-*<sup>a</sup>*3)/3, *a*5 = *a*3 + 2(*max*-*<sup>a</sup>*3)/3. That is, *a*1, *a*2, *a*3, *a*4 and a5 uniformly divide the interval [*min*, *max*]. The membership functions for *Ai* (*i* = 1, 2, ..., 5) are shown as follows.

$$\mu\_{A\_1}(u) = \begin{cases} 1 & u \le \min \\ \frac{a\_2 - u}{a\_2 - \min} & \min < u < a\_2 \\ 0 & u \ge a\_2 \end{cases} \tag{1}$$

$$\mu\_{Ai}(u) = \begin{cases} 0 & u \le a\_{i-1} \\ \frac{u - a\_{i-1}}{a\_i - a\_{i-1}} & a\_{i-1} < u < a\_i \\ 1 & u = a\_i & i = 1, \ 2, \ 3 \\ \frac{a\_{i+1} - u}{a\_{i+1} - a\_i} & a\_i < u < a\_{i+1} \\ 0 & u \ge a\_{i+1} \end{cases} \tag{2}$$

$$\mu\_{A\_5}(u) = \begin{cases} 0 & u \le \min \\ \frac{u - a\_4}{\max - a\_4} & a\_4 < u < \max \\ 1 & u \ge \max \end{cases} \tag{3}$$

For any *u* ∈ *U*, argmax{*uAi*(*u*)|*i* ∈ {1, 2, 3, 4, 5}} is the level which *u* belongs to. We transform the value level to sensitivity level. For some sensitive attributes, the higher the value level is, the larger the sensitivity level is, e.g., income; but it is reversed for other sensitive attributes, e.g., student's cumulative GPA. For a numerical attribute, we divide the five levels from 1 to 5 for sensitivity. Level 5 is the highest and level 1 is the lowest. The higher sensitivity level is, the stronger privacy protection will be given.

For example, for an equivalence class *Q* in a hierarchical data, we assume that the sensitive attribute of the root vertex in the class representative of *Q* is the cumulative GPA, whose value is {0.8, 1.6, 2.3, 2.7, 3.5, 3.9}, where the domain of the cumulative GPA is [0, 4]. We can obtain the *min* = 0, *max* = 4, *a*3 = 2, *a*1 = 2/3, *a*2 = 4/3, *a*4 = 8/3 and *a*5 = 10/3. The membership degree of *ui* to *Aj* are shown in Table 1, where *ui* ∈ {0.8, 1.6, 2.3, 2.7, 3.5, 3.9} and *Aj* ∈ {*low*, *very low*, *middle*, *very high*, *high*}. We can know that 0.8, 1.6, 2.3, 2.7, 3.5 and 3.9 are belong to *low*, *very low*, *middle*, *very high*, *high* and *high*, respectively. Their sensitivity levels are 5, 4, 3, 2, 1 and 1, respectively.


**Table 1.** The membership degree of *ui* to *Aj*.

In fact, for every sensitive value a numerical attribute *A*, we can confirm quickly its value level by using the membership functions. As shown in Figure 3, the [*min*, *max*] is the domain of *A*, *a*1, *a*2, *a*3, *a*4 and *a*5 equally divide the [*min*, *max*]. *p*1, *p*2, *p*3 and *p*4 are the points of intersection of membership functions *μA*1 and *μA*2 , *μA*2 and *μA*3 , *μA*3 and *μA*4 , and *μA*4 and *μA*5 , respectively. The ranges of *low*, *very low*, *middle*, *very high* and *high* are [*min*, *p*1], [*p*1, *p*2], [*p*2, *p*3], [*p*3, *p*4] and [*p*4, *max*], respectively.

**Figure 3.** The membership functions for five value levels.

For example, for the cumulative GPA and evaluation score for a teacher, the domains are [0, 4] and [0, 1], respectively. Their value levels and sensitivity levels are shown in Table 2. The letter grade of a course has been divided five levels.


**Table 2.** The value levels and sensitivity levels for sensitive attributes.

For a categorical attribute, e.g., *disease*, according to the frequency of every value, we obtain an attribute *Frequency*. The values of *Frequency* can be divided into 5 levels including *low*, *very low*, *middle*, *very high* and *high*. For the disease *HIV*, it is more sensitive than *flu*, and the frequency of *HIV* is less than one of *flu*. Therefore, we divide the values of *disease* into 5 sensitivity levels according to the value levels of *Frequency*. The lower the value level is, the larger the sensitivity level is.

**Definition 5 ((***αhlev***,** *k***)-anonymity in Hierarchical Data).** *Given a hierarchical data H, a published anonymous hierarchical data H satisfies (αhlev, k)-anonymity if every equivalence class Q in H satisfies (αhlev, k)-anonymity. That is, Q contains at least k hierarchical data records, and for every vertex v in the class representative of Q, the frequency of the values in vSA which belong to the sensitivity level i is less than or equal to <sup>α</sup>hlev*[*i*]*, where αhlev* = {0.8, 0.6, 0.4, 0.2, 0.1}*.*

#### **4. The Anonymization Method**

In this section, we introduce our anonymous method, which is divided into two parts. The first step is to realize the anonymization of two hierarchical data records or class representatives, and the second step is to anonymize the entire hierarchical data by using a clustering method.

The anonymization for two hierarchical data records is shown in Algorithm 1. The input is arbitrary two hierarchical data records *T*1 and *T*2. Without loss of generality, we assume that *T*1 has fewer subtrees than *T*2. The output is the information loss of anonymizing the two records.

We first check the root nodes of *T*1 and *T*2, stored in variables *a* and *b*, respectively, whether satisfy the anonymous constraint *check\_cons*(*<sup>a</sup>*, *b*), shown as follows:

$$\text{check\\_cons}(a,b) = \begin{cases} 1 & \text{if } a \text{ and } b \text{ are union-compositions and } a\_{SA} \cup \\ & b\_{SA} \text{ is identical to } (a\_{lev}^h, k) - anonymity; \\ 0 & \text{Otherwise,} \end{cases} \tag{4}$$

where *aSA* ∪ *bSA* is identical to (*αhlev*, *k*)-anonymity, i.e., for any an vertex *v* in the class representative, the number of the values in *vSA*, which lie in sensitivity level *i*, is less than or equal to *<sup>k</sup>*\**αhlev*[*i*]. If *check\_cons*(*<sup>a</sup>*, *b*) is 0, *tree*(*a*) and *tree*(*b*) are suppressed, where *tree*(*ai*) (*ai* ∈ {*<sup>a</sup>*, *b*}) denotes the subtree rooted *ai*; otherwise, the values in QI of *a* and *b* are generalized. Let *subtrees*(*a*) and *subtrees*(*b*) represent the set of subtrees under *a* and *b*, respectively. There are three cases: (1) *subtrees*(*a*) = ∅ and *subtrees*(*b*) = ∅, which indicates that *a* and *b* are leaves of hierarchical records, i.e., no vertex need to be processed, and algorithm returns the total cost in *tree*(*a*) and *tree*(*b*); (2) *subtrees*(*a*) = ∅ and *subtrees*(*b*) = ∅, and we suppress all vertices under *b* to keep the structural consistency, and return the total cost; (3) *subtrees*(*a*) = ∅ and *subtrees*(*b*) = ∅, the subtrees under *a* and *b* need to be further processed. To minimize the information loss caused by anonymization, the subtrees under the *a* and *b* need to be optimally matched. Let *subtrees*(*a*)={*U*1, *U*2, ..., *Um*} and *subtrees*(*b*)={*V*1, *V*2, ..., *Vn*} For every subtrees *Ui* of *a*, we find the subtrees *Vj* of *b* with minimum *MLevAnonytree*(*Ui*, *Vj*), as shown in lines 12–23. For every pair (*i*, *j*) in *pairs*, we call *MLevAnonytree*(*Ui*, *Vj*) to generalize them. In lines 26 and 27, we suppress the unpaired subtrees of *b* if they exist.

**Algorithm 1.** *MLevAnonytree*(*T*1, *T*2)

**Input:** Two hierarchical data records *T*1 and *T*2 **Output:** Anonymous information loss 1 *a* ← *root*(*T*1); *b* ← *root*(*T*2); 2 **if** *check\_condition*(*<sup>a</sup>*, *b*) **then** 3 suppress *tree*(*a*) and *tree*(*b*); 4 **return** *cost*(*tree*(*a*)) + *cost*(*tree*(*b*)); 5 **for** *i* = 1 to <sup>|</sup>*aQI*<sup>|</sup> **do** 6 replace *aQI*[*i*] and *bQI*[*i*] with their generalized value; 7 **if** *subtrees*(*a*) = ∅ and *subtrees*(*b*) = ∅ **then** 8 **return** *cost*(*tree*(*a*)) + *cost*(*tree*(*b*)); 9 **if** *subtrees*(*a*) = ∅ and *subtrees*(*b*) = ∅ **then** 10 suppress all vertices under *b*; 11 **return** *cost*(*tree*(*a*)) + *cost*(*tree*(*b*)); 12 *pairs* ← ∅; 13 **for** *i* = 1 to *m* **do** 14 *min\_cost* ← ∞; 15 *paired\_index* ← ∅; 16 **for** *j* = 1 to *n* **do** 17 **if** *j* ∈ pairs **then** 18 **continue**; 19 *x* ← *Ui*; *y* ← *Vj*; 20 *loss* ← *MLevAnonytree*(*<sup>x</sup>*, *y*); 21 **if** *loss* < *min\_cost* **then** 22 *min\_cost* ← *loss*; *paired\_index* ← *j*; 23 *pairs*.append(*i*, *paired\_index*); 24 **for** (*i*, *j*) ∈ *pairs* **do** 25 *MLevAnonytree* (*Ui*, *Vj*); 26 **if** there are unpaired subtrees in *b* **then** suppress them; 27 **return** *cost*(*tree*(*a*)) + *cost*(*tree*(*b*));

An anonymous example of two hierarchical data records is shown in Figure 4, where Figure 4a–c are two raw hierarchical data records, with their anonymous results identical to (*αhlev*, 2)-anonymity, and their class representative, respectively.

**Figure 4.** An anonymous example: (**a**) Two raw hierarchical data records; (**b**) The anonymous results; (**c**) Class representative of results.

Now, we give the clustering algorithm for anonymizing the entire hierarchical data, as shown in Algorithm 2. The input is a hierarchical data *H* and privacy parameters *αhlev* and *k*. The output is the anonymous data *H* satisfies (*αhlev*, *k*)-anonymity. In lines 2–16, when the number of records in *H* is equal or larger than *k*, the algorithm creates an equivalence class from *H*. The first record is randomly picked in an equivalence class *Q*. For any residual record *Ti* in *H*, we compute the information loss by adding *Ti* to *Q*, and then sort *H* in ascending order according to the information loss. We select other *k*-1 records from the first 50 records to decrease the runtime of algorithm. In lines 17 and 18, when the number of records in *H* is less than *k*, the algorithm suppresses the all records in *H*.

#### **Algorithm 2.** *MLevCluTree*(*H*, *<sup>α</sup>hlev*, *k*)

**Input:** A hierarchical data *H* = {*T*1, *T*2, ..., *Tn*}, and privacy parameters *<sup>α</sup>hlev*, *k*;**Output:** anonymous dataset *H* which satisfies (*αhlev*, *k*)-anonymity 1 *H* ← ∅; 2 **while** *H* ≥ *k* **do** 3 pick randomly a record *x* from *H*; *H* ← *H*-*x*; 4 initialize *Q* with *x* and *Crep* ← *x*; 5 *Q*\_*cost* ← ∅; 6 **for** *i* = 1 to |*H*| **do** 7 *loss* ← *MLevAnonytree*(copy(*x*), copy(*Ti*)); 8 *Q\_cost*.*append*(*loss*); 9 use *Q\_cost* to sort *H* in ascending order; 10 *cand\_set* ← H[1:50]; 11 **for** *j* = 2 to *k* **do** 12 *y* ← argmin*y* ∈ *cand\_set*(*MLevAnonytree*(copy(*Crep*), copy(*y*))); 13 *H* ← *H*- *y* ; *cand\_set* ← *cand\_set*- *y* ; *Q* ← *Q* ∪ *y* ; 14 update *Crep*; 15 *H* ←*H* ∪*Q*; 16 **if** *H* = ∅ **then** 17 suppress all records in *H*; 18 **return** *H* ;
