This section describes the proposed approach that analyzes anonymity in multiple aspects to protect data owner’s privacy.
Figure 1 shows a general overview of the approach. As shown in the figure, the approach identifies the attributes susceptible to information leakages in pre-analytics process. The user can choose to increase anonymity of the vulnerable attributes before the anonymization procedure or directly anonymizes the data using the findings of the vulnerable attributes as guidance. The user can also choose to anonymize the smart health dataset, without applying pre-analytics. Then, the data are anonymized, by our IAB (Intelligent Anonymity Balance) anonymization technique that produces data satisfying a given anonymity requirement while optimizing data retention. The anonymized data are then analyzed in the post-analytics process to see if the critical attributes in the resulting anonymized data are vulnerable to information leakage (e.g., via inference of attackers). If they are further actions can be taken (e.g., alerting data publishers or injecting additional “fake” data to increase indistinguishability of the vulnerable individuals). The first two steps are described in this section whereas the post-analytics step will be described in
Section 5. For easy referencing, since we will describe and illustrate each section with the same data, we briefly introduce them along with common terms and notations below.
Given a (data) table
T (or relational database) with
A, a set of attributes
A1,
A2, ...,
An., a data
record represents an instance of a tuple (
a1,
a2, ...,
an), where data entry
ai ∈
dom(
Ai), a set of all possible values of
Ai. Consider
Table 1, each row represents a unique tuple of attribute values where the last column represents the number of records for each row. Here Row 2 represents a unique tuple (
F, Low, 35, 52000, 143, Black, No) with three instances of records. As shown in
Table 1, Rows 1, 4, 10, 13 are obviously vulnerable to privacy threats since each has one record instance giving low anonymity and easy for re-identification. Next, we will describe the analytical approach.
3.1. Assessing Vulnerability to Information Leakages
Before we transform a given health data into a more anonymous form, one may investigate if (and what areas of) the data are susceptible to information loss if an attacker uses some of his information to make inferences. To do this, we propose an analysis on various structures of the data using the Longpre et al.’s entropy-based measure [
14] to estimate average information loss in respective areas. The motivation of this pre-anonymization analytics is not simply to apply existing measure in a typical manner but maximizing the measure for systematic use to gain useful information for privacy protection. For example, the finding that certain attribute is vulnerable to information leakage may be linked to low anonymity that can be alleviated by modifying the original data. Next, we briefly describe the measure and its derivations from two sources.
Proposition 1. Shannon’s information quantification.
Let X be a discrete random variable with outcomes x1, x2,...., xn, p(xi) be the probability of xi being the outcome, and I(xi) be the amount (or value) of information received when learning that xi is the outcome (sent). Then I(xi) is log2(1/p(xi)).
Proof. Since the more probable the information is, the less informative the information becomes. Thus, I(xi) is inversely proportional to p(xi). Furthermore, for information of value y, the amount of information is measured by the number of bits to store y, i.e., log2(y) bits. Thus, Shannon’s quantifying information I(xi) = log2(1/p(xi)). □
Proposition 2. Longpre et al.’s entropy-based measure.
Given a data table of n individuals, where p(ri) is the probability of individual ri being re-identified. An attacker makes queries, each of which has m possible answers represented in a sequence <a1, a2,..., am>. All n individuals are partitioned into m partitions, where each partition Ej contains individuals whose attribute value matches the jth answer of the query aj. Subsequently, the average of information loss iswhereandrepresenting an initial average amount of information (before queries) and the average of amount of information after the query answer j, respectively.
Proof. If the attacker knows p(ri) then the amount or value of the information can be quantified as log2(1/p(ri)) by Proposition 1. Thus, an average of these information values over all individuals gives an entropy (Note, if an attacker does not have any information about individuals, then everyone in the table is equally likely to be identified with p(ri) = 1/n.).
Now suppose an attacker makes queries as stated. Each individual ri can belong to one partition. Thus, = n. Suppose an individual ri is found to be in Ej then p(ri) becomes p(ri|Ej), which is p({ri}∩Ej)/p(Ej) = p(ri)/p(Ej) (since {ri}∩Ej = {ri}), where p(Ej) = . Since p(ri) is reduced, the information value/amount increases (as less certain is more informative). Thus, an attacker gains more information about the individual and more vulnerable to privacy breach. Thus, the average amount of information when answer j is matched where p(ri) is changed to p(ri|Ej). This gives an average loss to be estimated as . □
Note that ΔS({Ej}) is maximum when Sj is zero and ΔS({Ej}) = S0 (i.e., no information is lost to the attacker or that he has no information). Hence, the normalized average information loss is ΔS({Ej})/S0 where its value is in [0, 1].
Analytics on Information Leakages
Instead of applying the Longpre et al.’s entropy-based measure to the entire table, we will analyze which attribute will be most vulnerable to information leaks (i.e., leaks most amount) on the average when an attacker obtains information on the attribute values.
We will use
Table 1 to illustrate and explain the concept. Suppose an attacker queries information on attribute Sex.
Table 1 has a total of 60 individuals with 27 females (F) and 33 males (M). When an attacker has no information, every individual is equally likely to be identified with
p(
ri) = 1/60. Therefore, the initial average amount of information
S0 is
log2(1/60) = 5.9. For attribute Sex, there are two possible answers: <F, M>. Thus, we partition 60 individuals into
E1 and
E2 for those who are F and M, respectively.
Based on Proposition 2 and
Table 1,
p(
ri|
E1) =
p(
ri)/
p(
E1) = (1/60)/(27/60) = 1/27, for
ri ∈
E1 = {r
i|
i = 1, 2, 8–10, 12–14, 16, 19, 22 }. This gives
S1 = −27(1/27)
log2(1/27) = 4.75. Similarly,
p(
ri|
E2)= (1/60)/(33/60) = 1/33, for
ri ∈
E2 = {r
i|
i = 3–7, 11, 15, 17, 18, 20, 21} and
S2 = −33(1/33)
log2(1/33) = 5. By Proposition 2, Δ
S({
Ej}) is (27/60)(5.9 − 4.75) + (33/60)(5.9 − 5) = 0.99. Normalizing by a maximum (i.e., when Δ
S({
Ej}) =
S0 = 5.9), we have the resulting normalized average information loss of 0.99/5.9 = 0.16 (when the attacker queries on Sex) as shown in the first row of
Table 2. Similarly, we can apply the measure to estimate the average information loss given the attacker querying on other attributes except the disclosed one (e.g., genetic risk).
Table 2 shows the overall results obtained.
The normalized results show us on average how much information is leaked given attacker knows the attribute value of the person they are looking for. The attribute that discloses more information has a higher value out of the maximum possible value of one. As shown in
Table 2, for data
Table 1, the Age attribute is the most vulnerable as it leaks the most information. Next is Zip, followed by Race. These are not surprising as they are typical key attributes that lead to identity identification. Although we have not done this, the Longpre et al.’s entropy-based measure can be applied to a combination of attributes at any level to give different insights. Here we apply the measure to each non-disclosed attribute for systematic preliminary findings.
In general, this pre-anonymization analytics can help us decide which attributes we should pay attention to when we try to protect privacy. For example, we may pick a set of most vulnerable attributes to increase anonymity by generalization. In anonymization techniques, a set of such attributes is known as quasi-identifiers or shields that are specified by users. Next section shows more details of basic anonymization techniques.
3.2. Increasing Anonymity by Generalization
The analytics in
Section 3.1 show that, once an attacker obtains the query answers, information on some attributes (or set of attributes) can lead to more average information loss than the others. To protect such loss, a common practice to increase anonymity is by generalization and compression [
8,
9,
10]. This section describes these basic concepts in more details along with the concept of
k-anonymity that is used in many anonymization techniques including ours (to described in
Section 3.3).
Generalization replaces an attribute value by a more abstract form or a more general but semantically consistent value. For example, we can replace the zip “12345” by “123∗∗”, or replace a “city” by its “country”. The former can be viewed as a suppression of the last two digits of the zip where“∗” represents any non-negative digit. The consistency on semantics of attribute values is governed by its conceptual hierarchy. By doing this, the number of records of each unique tuple will increase and that increases the tuple’s degree of anonymity. Consequently, individuals are more indistinguishable, and their identities are better protected. Generalization provides many advantages to preserve data privacy including consistent interpretation, traceability, and minimal content distortion [
10].
We will now explain the concepts in more details via illustrations on
Table 1. Continuing our analytics from
Section 3.1, where we identify that Age, Zip and Race are vulnerable. One can focus on generalizing these attributes to increase their anonymity or exploring other attributes based on domain experts. Here we consider the three attributes: Alcohol Consumption (AC), Age and Zip and their corresponding conceptual hierarchies as shown in
Figure 2. For AC, there are four attribute values in the domain although only three appear in
Table 1. The Age attribute values are discretized into four ranges and the Zip attribute values are string of numbers where a more general value uses “∗” for any non-negative digit. The Zip hierarchy is general in that it is applicable to any string of digits other than 9’s.
For simplicity and without loss of generality, we will illustrate generalization on parts of
Table 1, namely Rows 1, 2, 5, 6, 9 and 10 with four attributes: Sex, Alcohol Consumption (AC), Age and Zip, as shown in
Table 3a to be an initial data table.
In
Table 3a, Row 1 and Row 5 each has one record. This makes an individual in these two rows vulnerable for re-identification. If an attacker knows that the person he is looking for is a Female (F) having Medium (Med) AC and lives in Zip 52000, he will be able to identify the person and infer his age of 35 (see Row 1). Similarly, Row 5 is the only one record of a Female, Age 75, so this person can be identified and her sensitive information of having High AC can be leaked.
To increase anonymity of individuals in Rows 1 and 5, we generalize on AC cells of all rows of females (i.e., Rows 1, 2, 5, 6) in
Table 3a to obtain results as shown in
Table 3b where the change and important areas are colored. In this table, individual in Row 1 increases his/her anonymity since Row 1 can be merged with individuals in Row 2 creating a tuple (F, Yes, 35, 5200) with four records. However, this generalization is not enough to increase anonymity of individual in Row 5.
To increase anonymity of individual in Row 5 with the goal to merge with Row 6, we need to further generalize both rows on Age and Zip according to the taxonomies in
Figure 2. By generalizing the Age attribute two steps to [20–85] and the Zip to 5200∗, we obtain the results as shown in
Table 3c. As shown in this table, Rows 5 and 6 can now be merged. By merging Row 1 with Row 2, and Row 5 with Row 6, we obtain the final table as shown in
Table 3d. Here none of the unique tuples of attribute values has a one record. In fact, the record number indicates the degree of anonymity.
Table 3d shows that there are at least three people in each group of the same attribute values and hence their identities and information are better protected.
There are many ways to generalize. The above shows generalization at a
cell level (i.e., a
data entry of a specific row and column of a table). Another type of generalization is applied to all attribute values of the same level in the hierarchy. Thus, when a table is generalized on attribute
A, the generalization is applied only to the table rows whose
A’s attribute values are either the child or its siblings of the same parent in the hierarchy. For example, generalizing a
Table 3a on Age will replace the Age values of Rows 1, 2, and 6 to [20–44] and those of the rest of rows will be replaced by [45–85]. To improve efficiency, many anonymization techniques including ours (
Section 3.1) adopt this interpretation when applying generalization. Next, we formally define important concepts for anonymization, namely,
k-anonymity requirement and other relevant terminologies.
k-Anonymity Requirement for Anonymization
Anonymity requirement specifies an
anonymity degree required on a subset of privacy critical attributes, called
shield (or
quasi-identifiers [
24,
25]). Given the degree
k and the shield
S, the
k-anonymity requirement on shield
S, denoted by <
S,
k>, is defined to be a set of
S-projected tuples, whose each unique tuple is guaranteed to have a minimum of
k records. Let [
t,
nt] denote an ordered pair of a unique tuple
t and its corresponding number of records
nt. We say that <
S,
k> is violated if there is [
b,
rb] such that
rb <
k, for some
S-projected tuple
b. The
k-anonymity required on shield attributes helps user to protect privacy without over generalizing the tuple. As for example, consider
Table 3a with a given anonymity requirement <{AC, Age, Zip}, 3>. Note that each row represents a unique tuple projected on the shield. Rows 1, 5 and 6 violates the given anonymity requirements with the number of records lower than three. However,
Table 3d contains four distinct tuples, each of which has three or more records. Thus,
Table 3d satisfies the given anonymity requirement.
In general, for a given table, one can define more than one anonymity requirement, each of which can have a different anonymity degree and a shield. In practice, the anonymity requirement is user-specified. If the anonymity degree is too low, the shield may or may not be able to protect the individual identity (e.g., when the projected tuple becomes personally identifiable). On the other hand, if we set the anonymity degree too high, data may not be informative since almost all tuples would be the same after anonymization [
15]. The data privacy is over protected. This k-anonymity requirements are used in many anonymization techniques [
8,
9,
10,
19,
20]. Next, we describe our anonymization technique.
3.3. Balancing Generalization with Data Retention in Anonymization
Given a data table and a k-anonymity requirements, this section discusses an analytical approach to transforming the data into anonymized data that satisfy the k-anonymity requirements and at the same time retains the data from the original as much as possible. In AI (Artificial Intelligence), we can view this problem as a search in a space of all possible generalized tables on all possible attributes. The simplest approach is to search exhaustively for a solution. To improve efficiency, heuristic search can be employed. Our approach relies on two simple heuristics: the number of rows violating the anonymity requirements and the total number of table rows. The interplay between the two heuristics gives a balance between anonymity compliance and optimizing data retention.
3.3.1. Intelligent Anonymity Balance (IAB) Algorithm
We now briefly describe our anonymization algorithm,
IAB (Intelligent Anonymity Balance) as also discussed in [
15]. Given a data table
T with a set of attributes
A and a taxonomy tree for each shield attribute. Without loss of generality, we assume one anonymity requirements
R with shield
S A. The basic overview of the IAB algorithm is shown in Algorithm 1.
Algorithm 1 The IAB Anonymization Algorithm |
ProcedureIABAnonymization |
Inputs:T, a table with a set of attributes A, a set of anonymity requirement R with a set |
of anonymity shield attributes S A and corresponding taxonomy trees of each |
attribute in S. |
Output: a generalized table T’ of T where T’ has a maximum number of rows among all |
generalized tables of T satisfying R. |
1 For each violating row and applicable attribute B in S |
2 T’ ← generalized table of T on B |
3 Add T’ in W; |
4 Endfor |
5 Repeat |
6 Select from W, table Tk that has a maximum number of rows and a non-zero minimum |
number of rows that violate R |
7 For each violating row and applicable attribute B in S |
8 T’k ← generalized table of Tk on B |
9 Add T’k in W; |
10 Endfor |
11 Remove Tk from W |
12 Until W is empty or no tables in W has a number of rows > number of rows of a table |
that satisfies R |
13 Return T* that has maximum number of rows over all tables in W that satisfy R |
The algorithm iteratively generalizes a table on an appropriate attribute using its corresponding taxonomy tree to increase anonymity degree. In Lines 1–4, a generalized table of T on each attribute in S is generated and maintained in set W. Each generalized table keeps track of two key heuristics: the number of rows that violate R and the total number of rows on the table. The former tells how close we are to finding the table that satisfies the anonymity requirements R while the latter measures how much data is preserved. Among generalized tables in W, the algorithm selects a table that has the highest number of rows with the lowest violation number of rows to be further generalized (Lines 5–10). The selected table is removed from W (as shown in Line 11).
The generalization process repeats until there are no more tables left in W or no tables in W has the number of rows > the number of rows of a table that satisfies R. In other words, we stop expanding the search when we find a table that satisfies R or a table that is smaller than the biggest table that satisfies R found so far (even though it violates R). By monotonicity of generalization, further generalization can never grow the table. Therefore, the algorithm only further generalizes the table that is larger than those found to satisfy R so far. However, if a table violates R but is already smaller than the biggest table found so far to satisfy R, further generalizing it would not result in a larger table that satisfies R. Thus, the algorithm selects the largest table among the tables in W with no anonymity requirements violation.
Note that it is possible to have more than one of such table of the same size. In such a case, the algorithm selects the first one found as it represents the table that has the least number of generalized steps. In other words, it retains most specific data that are closest to the given data table. Since generalization procedure monotonically decreases the number of rows, our approach uses this property to prune the fruitless path of an exhaustive search. Thus, it finds an optimal solution. The optimal solution is that maximizes the information preserved (i.e., the table size) from the original table while hiding desired privacy by satisfying anonymity requirements (i.e., zero violation rows). Therefore, the optimal solution has maximum number of rows (maximum information preservation) that satisfies the anonymity requirement (desired anonymity).
3.3.2. Illustration
We apply the algorithm described in
Section 3.3.1 to
Table 1 with a given anonymity requirement
R = <{Zip, Age, AC}, 3>. Based on the number of records of each row,
Table 1 contains 6 rows with number of records less than 3. Thus, these rows, namely Rows 1, 4, 9, 10, 13 and 14, violate
R. Generalizing these violating rows of
Table 1 on attribute Zip (and also generalizing Zip values for the rest of the rows since their Zip values are siblings of those in the violating rows), we obtained a table as shown in
Table 4.
As shown in
Table 4, Row 1 and Row 9 can be merged to the first row of the resulting table. Rows 4 and 14 can be combined to satisfy R as a unique tuple from Shield attributes, i.e., (Low, 35, 5200∗) has three records. However, the two rows cannot be merged. Therefore, the resulting
Table 4 has reduced number of violating rows to two (i.e., Rows 10, 13) with a total number of rows to be 21.
Let
T(
n, m) denote a generalized table
T, where
n is the number of rows violating
R and
m is the number of rows in
T. Table 1 and
Table 4 are represented by
T(6, 22) and
T1(2, 21), respectively. The generalization process repeats. The whole process can be viewed as a search starting from
T(6, 22) as a root and as shown in
Figure 3.
The search starts from the root
T(6, 22), i.e.,
Table 1 (or
T) with6 violating rows and a total of 22 rows as shown in
Figure 3. We first apply to
T, generalization on Zip, Age and AC to obtain tables
T1(2, 21),
T2(4, 21) and
T3(0, 18), respectively. Recall that
T1(2, 21) is actually
Table 4.
As seen in
Table 4, after merging Rows 1 and 9, we have [(Med, 35, 5200∗), 4]. Hence the violation in these two cases is eliminated. Rows 4 and 14 also have the same shield attribute values after the generalization that is [(Low, 35, 52,000), 3]. Therefore,
T1 has 2 violating rows remained, namely Rows 10 and 13. Moreover, because Rows 1 and 9 merged, the number of rows in
T1 becomes 21. Thus,
T1(2, 21) is obtained. The rest of resulting tables can be obtained similarly.
T3(0, 18) has zero violations, however we continue to search because there might be a table with more rows and zero violations.
The frontier nodes at this point are T1(2, 21), T2(4, 21). They have the same row number, therefore T1(2, 21) having fewer violating rows is selected to be expanded further. By generalizing T1(2, 21) on the three attributes we get the tables T4(2, 21), T5(0, 21) and T6(0,17). At this point we stop because, we obtain T5(0, 21). We do not continue to search even though there are still table with violations such as T2(4, 21), because none of them have number of rows larger than the current result that is 21. That means we already found the table with the greatest number of rows with zero violations as further generalizing on other tables would only result in a smaller table. Thus, the optimal result of T5(0, 21) has been found and the algorithm stops.