*5.5. Quantifying Risk*

Intuitively, it can be stated that not all quasi-identifiers have the same "strength", "danger-factor" or the same information content as far as de-anonymization is concerned.

In Information Theory, the metric for measuring information is entropy. Shannon Entropy has been defined in Definition 9.

**Theorem 9** (Shannon Entropy [81])**.**

$$H\_S = -\sum\_{n=1}^{N} p\_i \cdot \log\_2 p\_i \tag{2}$$

*where P = (p*1, *p*2, ...*pN) is a discrete probability distribution and pi* ≥ 0 *and* ∑*Nn*=<sup>1</sup>*pi* = 1

Entropy is measured in bits. Assuming that we have an equivalence class with size N and all members of this class occur with the same probability, by picking randomly, on average, *log*2*<sup>N</sup>* guesses have to be made to find a certain element from that equivalence class and *log*2*<sup>N</sup>* is equal to the Shannon Entropy. If the entities from an equivalence class do not have the same probability for being the element to be found, the entropy will be lower; in case when it is known, *H* = 0 stands. This means that a theoretical maximum entropy can be calculated, assuming that the attacker is performing random picking on the equivalence class.

The size of an equivalence class could be gathered from demographic statistics or other publicly available data sources such as company databases, medical databases, voters registration list and so forth.

In case the exact size of the equivalence class characterized by the extracted set of quasi-identifiers is not known, by calculating conditional probabilities or applying Bayes' Theorem [82] it is possible to estimate the size of this equivalence class.

**Theorem 10** (Conditional probability [83])**.**

$$P(A/B) = \frac{P(A \cap B)}{P(B)}\tag{3}$$

**Theorem 11** (Bayes' Theorem [82,83])**.**

$$P(A/B) = \frac{P(A) \cdot P(B/A)}{P(B)}\tag{4}$$

However, the easiest solution would be to assume that *A*1, *A*2, ...*Ai* probability variables are independent to each other, and the conditional probabilities of these attributes can be calculated as: *<sup>P</sup>*(*<sup>A</sup>*1 ∩ *A*2 ∩ ... ∩ *Ai*) = *<sup>P</sup>*(*<sup>A</sup>*1)*P*(*<sup>A</sup>*2)...*<sup>P</sup>*(*Ai*). Although the independence is not true in many cases, this simple equation could be easily used as an estimation in many cases.

Hence, by extracting and linking attributes to a specific person and estimating the equivalence class size of all these attributes occurring together, the entropy can be estimated as well. However, it is important to point out that by performing the estimation using more parameters, the level of potential error also increases, and with it the variance of potential information content also increases. For instance, in two cases having equal equivalence class sizes but one has been estimated using two attributes and the other has been estimated using four, it is expected that the latter scenario can be riskier in terms of de-anonymization. Figure 8 shows the connection between entropy and risk.

**Figure8.**Connectionbetweenentropyandrisk.

According to our understanding, the risk as function of entropy is such that by decreasing entropy (by knowing more of a specific person) the risk increases.

Since there is no case when the probability of de-anonymization is zero, the deidentification process must aim to raise the entropy of the set of quasi-identifiers above a certain threshold.
