1. Introduction
Contemporary societies exhibit two disparate tendencies, which exist in fundamental tension. On the one hand, the tendency to collect data about individuals, processes, and phenomena on a massive scale allows robust scientific advances such as data-driven medicine, the training of artificial intelligence (AI) models with near-human capabilities in certain tasks, or the provision of a gamut of bespoke services. For fields such as medical discovery, data collection for the advancement of science can be viewed as an ethical mandate and is encouraged by regulations under the term
data altruism [
1]. On the other hand, such data collection (especially for the sole purpose of economic gain, also termed
surveillance capitalism [
2]) is critical from the perspectives of personal data protection and informational self-determination, which are legal rights in most countries. Often, the antagonism between data collection and data protection is viewed as a zero-sum game. However, a suite of technologies, termed
privacy-enhancing technologies (PETs), encompassing techniques from the fields of cryptography, distributed computing, and information theory, promises to reconcile this tension by permitting one to draw valuable insights from data while protecting the individual. The broad implementation of PETs can thus herald a massive increase in data availability in all domains by incentivizing data altruism through the guarantee of equitable solutions to the data utilization/data protection dilemma [
3].
Central to the promise of PETs is the protection of
privacy (by design) [
4]. However, the usage of this term, which is socially and historically charged and often used laxly, entails considerable ambiguity, which can hamper a rigorous and formal societal, political and legislative debate. After all, it is difficult to debate the implementation of a set of technologies when it is unclear
what exactly is being protected. We contend that this dilemma can be resolved through a re-conceptualization of the term
privacy. We formulated a number of expectations towards such a novel definition: it must be (1) anchored in the rich history of sociological, legal, and philosophical privacy research yet be formal and rigorous to be mathematically quantifiable; (2) easy to relate to by individuals; (3) actionable, that is, able to be implemented technologically and (4) future-proof, that is, resilient to future technological advancements, including those by malicious actors trying to undermine privacy. The key contributions of our work towards this goal can be summarized as follows:
We formulated an axiomatic definition of privacy using the language of information theory;
Our definition is naturally linked to differential privacy (DP), a PET which is widely considered the gold standard of privacy protection in many settings such as statistical machine learning;
Lastly, our formalism exposes the fundamental challenges in actualizing privacy: Determining the origin of information flows and objectively measuring and restricting information.
2. Prior Work
The most relevant prior works can be distinguished into the following categories: Works by Jourard [
5] or Westin [
6] defined privacy as a right to restrict or control information about oneself. These definitions are relatable, as they tend to mirror the individual’s natural notion of how privacy can be realized in everyday life, such as putting curtains on one’s windows. The foundational work of Nissenbaum on
Contextual Integrity (CI) [
7] instead contends that information restriction alone is not conducive to the functioning of society. Instead, information must flow
appropriately within a normative frame. This definition is more difficult to relate to, but it is very broad and thus suitable to capture a large number of privacy-relevant societal phenomena. Its key weakness lies in the fact that it attempts no formalization. Privacy cannot be quantified using the language of CI alone.
Our work synthesizes the aforementioned lines of thought by admitting the intuitive and relatable notion of restricting the flow of sensitive information while respecting the fact that information flow is an indispensable component of a well-functioning society.
The works of Solove [
8,
9,
10] have followed an orthogonal approach, eschewing the attempt to define privacy directly, instead (recursively) defining it as a
solution to a privacy problem, that is, a challenge arising during information collection, processing, or dissemination. This approach represents a natural counterpart to PETs, which represent such solutions and thus fulfill this notion of privacy. We note that, whereas our discussion focuses on DP, which is rooted in the work of Dwork et al. [
11], DP is not the only PET, nor the only way to assure that our definition of privacy is fulfilled; however, the opposite holds true: DP, and every guarantee that is stronger than DP, automatically fulfills our definition presented below (provided the sender and the receiver are mutually authenticated and the channel is secure). Despite criticism regarding the guarantees and limitations of DP [
12], it has established itself as the gold standard for privacy protection in fields such as statistics on large databases. We additionally discuss anonymization techniques such as
k-anonymity [
13] as examples of technologies that do not fulfill the definition of privacy we propose, as they are vulnerable to degradation in the presence of auxiliary information. For an overview of PETs, we refer to Ziller et al. [
14].
Our formal framework is strongly related to Shannon’s information theory [
15]. However, we also discuss a semi-quantitative relaxation of our definition, which attempts to measure qualitatively different information types (such as structural and metric information), which goes back to the work by MacKay [
16].
Our work has strong parallels to the theories by Dretske [
17] and Benthall [
18] in that we adopt the view that the meaning and ultimately the information content of informational representations is caused by a
nomic association with their related data.
Lastly, we note that the field of
quantitative information flow (QIF) [
19] utilizes similar abstractions as our formal framework; however, it focuses its purview more specifically to the study of
information leakage in secure systems. It would therefore be fair to state that our framework is a generalization of QIF to a more general societal setting.
3. Formalism
In this section, we introduce an axiomatic framework that supports our privacy definition. All sets of numbers in this study were assumed to be nonempty and finite. We note that, while our theory has at its center abstract entities, one could build intuition by considering the interactions between entities as representing human communication.
Definition 1 (Entity). An entity is a unique rational agent that is capable of perceiving its environment based on some inputs, interacting with its environment through some outputs and making decisions. We wrote for the ith entity in the set of entities E. Entities have a memory and can thus hold and exercise actions on some data. We wrote for the jth datum (or item of data) in the dataset held by e. Examples of entities include: individuals, companies, governments and their representatives, organizations, software systems and their administrators, etc.
Remark 1. The data held by entities can be owned (e.g., in a legal sense) by them or by some other entity. In acting on the data (not including sharing it with third parties), we say the entity is exercising governance over it. We differentiated the following forms of governance:
- 1.
Conjunct governance: The entity is acting on its own data (i.e., data owner and governor are conjunct).
- 2.
Disjunct governance: One or more entities is/are acting on another entity’s data. We distinguished two forms of disjunct governance:
Delegated governance, where one entity is holding and/or acting on another’s data and
Distributed governance, where entity is holding parts of a single entity’s data and acting on it. Examples of distributed governance include (1) distinct entities holding and acting on disjoint subsets (shards) of one entity’s data (e.g., birth-date or address), (2) distinct entities holding and acting on shares of one entity’s data (e.g., using secret sharing schemes), and (3) distinct entities holding and acting on copies of one entity’s data (e.g., the IPFS protocol).
The processes inherent to governance are typically considered parts of the data life-cycle
[20]. They include safekeeping, access management, quality control, deletion, etc. Permanent deletion ends the governance process. Definition 2 (Factor). Factors are circumstances that influence an entity’s behavior. It is possible to classify factors as extrinsic (e.g., laws, expectations of other entities, incentives, and threats) and intrinsic (e.g., hopes, trust, expectations, and character), although this classification is imperfect (as there is substantive overlap) and not required for this formalism. Factors also modulate and influence each other (see the example of trust and incentives below). We wrote for the ith factor in the set of factors F.
Definition 3 (Society). A set S is called a society if and only if it contains entities and factor(s) influencing their behaviors. Our definition was intended to parallel the natural perception of a society; thus, we assumed common characteristics of societies, such as temporal and spatial co-existence. The definition is flexible insofar as it admits the isolated observation of a useful (in terms of modeling) subset of entities and relevant factors, such as religious or ethnic groups, which—although possibly subsets of a society in the social science interpretation—have specific and/or characteristic factors that warrant their consideration as a society. Societies undergo temporal evolution through the interaction between entities and factors. We sometimes wrote to designate a “snapshot” of society at a discrete time point t; when omitting the subscript, it is implied that we are observing a society at a single, discrete time point.
Definition 4 (Communication). Communication is the exchange of data between entities. It includes any verbal and non-verbal form of inter-entity data exchange.
Axiom 1. Society cannot exist without communication. Hence, communication arises naturally within society.
Remark 2. For a detailed treatment, compare Axiom 1 of [21]. Our formalism is focused on a specific form of communication between entities called an information flow.
Definition 5 (Information). Let e be an entity holding data . We denote as information a structured subset of with the following properties:
It has a nomic association with the set of data , that is, a causal relationship exists between the data and its corresponding informational representation;
The nomic association is unique, that is, each informational representation corresponds to exactly one datum such that the state of one item of information is determined solely by the state of one datum ;
It is measurable in the sense that information content is a quantitative measure of the complexity of assembling the representation of the data.
This definition interlinks two foundational lines of work. Dretske [
17] postulates that meaning is acquired through nomic association between the message’s content and the data it portrays. This aspect has been expanded upon by Benthall et al. [
18], who frame nomic association in the language of Pearlian causality [
22] to analyze select facets of Contextual Integrity under the lens of
Situated Information Flow Theory [
23]. The notion of information quantification as a correspondence between information content and complexity of reassembling a representation is central to information theory. We note that we utilized this term to refer to two distinct schools of thought. In the language of Shannon’s information theory [
15], information content is a measure of
uncertainty reduction about a random variable. Here, information content is measured in
Shannons (typically synonymously referred to as
bit(s)). Shannon’s information theory is the language of choice when discussing privacy-enhancing technologies such as DP. Our definition of information embraces this interpretation, and we will assume that—for the purposes of quantifying information—informational representations are indeed random variables. In the Shannon information theory sense, we can therefore modify our definition as follows:
Definition 6 (Information (in the Shannon sense)). Let e, , and I be defined as above. Then, every element is a random variable with mass function , which can be used to resolve uncertainty about a single datum through its nomic association with this datum. Moreover, the information content of is given by:
Moreover, our framework is also compatible with a
structural/metrical information theory viewpoint. This perspective, which was developed alongside Shannon’s information theory and is rooted in the foundational work by MacKay [
16] is a superset of the former. Here, information content in the Shannon sense is termed
selective information content (to represent the fact each bit represents the uncertainty reduction by observing the answer to a question with two possible outcomes, i.e., selecting from two equally probable states). Moreover, information content can be
structural (representing the number of distinguishable groups in a representation measured in
logons) and
metrical (representing the number of indistinguishable logical elements in a representation and measured in
metrons). We note that the difficulty of measuring real-world information is inherent to both schools of information theory (compare also the discussion in [
16]—Chapter 2).
Definition 7 (Information flow)
. An information flow (or just flow) is a directed transit of information between exactly two entities. We call the origin of the sender
and the recipient of the receiver
. The subject of is called a message
and contains a single informational representation. flows over a channel C (a medium), which we assumed to be noiseless, sufficiently capacious, and error-free. We sometimes represented a flow as: Remark 3. Flows are the irreducible unit of analysis in our framework and areatomicandpairwise. This means that they concern exactly one datum
, and they take place between exactly two entities
. This fact distinguishes our formalism from CI (which uses a similar terminology), where flows are defined more broadly and pertain to “communication” in a more general way, bearing strong similarities to QIF [19], where information is also viewed as flowing through a channel. Our naming for components of the flow follows standard information-theoretic literature [24]. Remark 4. We used the term information content of to denote the largest possible quantity of information that can be derived by observing , including the information obtained by any computation on , irrespective of prior knowledge. This view is compatible with a worst-case outlook on privacy where the receiver of the message is assumed to obtain in its entirety and make every effort available to reassemble the representation of the datum that refers to.
Remark 5. A line of prior work, such as the work by McLuhan [25], has contended that the medium of transmission (i.e., the channel) modulates (and sometimes is a quintessential part of) the message. This point of view is not incompatible with ours, but we chose to incorporate
the characteristics of the channel into other parts of the flow as our framework is information-theoretic but not
communication-theoretic. For example, under our definition, an insecure (leaky) channel is regarded as giving rise to a new flow towards one or more additional receivers (see implicit flows
below), while a corruption of the message by noise or encoding errors is deemed as directly reducing its information content. Therefore, we implicitly assumed that the state of a message is determined solely by the corresponding information that is being transmitted. Although flows are atomic, human communication is not: very few acts of communication result in the transmission of information only about a single datum. We thus required a tool to “bundle” all atomic flows that arise in a certain circumstance (e.g., in a certain social situation, about a specific topic, etc.). We call these groupings of flows information flow contexts. Moreover, communication also often happens between more than two entities (one-to-many or many-to-one scenarios). Such scenarios are discussed below.
Definition 8 (Information flow context). Let be a society at time t such that , and be flows . Then, we term the collection an information flow context (or just context).
Flows are stochastic processes. This means they can arise randomly. The probability of their occurrence in a given society depends on numerous latent factors. Depending on the causal relationship between the appearance of a flow and an entity’s decision, we distinguished the following cases:
Definition 9 (Explicit flow). An explicit flow arises as a causal outcome of a decision by the entity whose data is subject to the flow.
Definition 10 (Decision)
. Let e be an entity and a collection of factors influencing its behavior. We modeled the decision process as a random variable conditioned on the factors. Then, the decision
takes the following values: Note that denotes the Bernoulli distribution, and ⊥ implies that no action is undertaken. We hypothesized the probability of decisions resulting in explicit flows to be heavily influenced by two factors. Of these, the most important is probably trust. In interpersonal relationships characterized by high levels of trust, entities are more likely to engage in information flows. Moreover, the reason for most societal information flows can be ultimately distilled to trust between entities on the basis of some generally accepted norm. For example, information flows from an individual acting as a witness in court towards the judge are ultimately linked to the trust in the socially accepted public order. Low levels of trust thus decrease the overall probability of an explicit flow arising. We also contend that trust acts as a barrier imposing an upper bound on the amount of information (described below) that an entity is willing to accept in a flow. The other main factor influencing the probability of explicit flows arising are likely incentives. For instance, the incentive of a larger social circle can entice individuals into engaging in explicit flows over social networks. The incentive of a free service provided over the internet increases the probability that the individual will share personal information (e.g., allow cookies). We note that—like all societal factors—incentives and trust modulate each other. In some cases, strong incentives can decrease the trust threshold required to engage in a flow, while, in others, no incentives are sufficient to outweigh trust. In addition, society itself can impose certain bounds on the incentives that are allowed to be offered or whether explicit flows are permitted despite high trust (e.g., generally disallowing the sharing of patient information between mutually trusting physicians who are nonetheless not immediately engaged in the treatment of the same individual).
Definition 11 (Implicit flow)
. An implicit flow arises without a causal relationship between the entity whose data is subject to the flow and the occurrence of the flow but rather due to a causal relationship between another entity’s decision and the occurrence of the flow or by circumstance. Thus, an implicit flow involving an entity e can be modeled as a random variable , where p is independent of the factors influencing e such that: An example of an implicit flow is the recording of an individual by a security camera in a public space of which the individual was not aware. Implicit flows are sometimes also called information leaks and can arise in a number of systems, even those typically considered perfectly secure. For example, a secret ballot that results in an unanimous vote implicitly reveals the preference of all voters. The quantification of information leakage is central to the study of QIF.
As flows are—by definition—pairwise interactions, analyzing many-to-one and one-to-many communication thus requires special consideration. While
one-to-many communication can be “dismantled” into pairwise flows in a straightforward way,
many-to-one communication requires considering ownership and governance of the transmitted information. For instance, many-to-one communication where each sender has conjunct governance and ownership of their data can be easily modeled as separate instances of pairwise flows. However, when governance is disjunct or when correlations exist between data, it is required to “marginalize” the contribution of the entities whose data is involved in the flow, even if they themselves are not part of it. Thus, many-to-one-communication can lead to implicit flows arising. This type of phenomenon is an emergent behavior in systems exhibiting complex information flows such as societies and has been described with the term
information bundling problem by [
26]. For example, the message “I am an identical twin” flowing from a sender to a receiver reduces the receiver’s uncertainty about the sender’s sibling’s biological gender and genetic characteristics. As data owned by the sibling and governed by the sender is flowing, information can be considered as implicitly flowing from the sibling to the receiver.
Finally, equipped with the primitives above, we can define privacy:
Definition 12 (Privacy). Let be a flow of a message between a sender and a receiver over a channel C embedded in a context . Then, privacy is the ability of to upper-bound the information content of and of any computation on , independent of the receiver’s prior knowledge.
The following implications follow immediately from the aforementioned definition:
It relates directly to an ability of the sender. We contend that this formulation mirrors the widespread perception of privacy, e.g., as it is formulated in laws. Here, the right to privacy stipulates a legal protection of the ability to restrict information about certain data;
Our definition, like our primitives, is atomic. It is possible to maintain privacy selectively, i.e., about a single datum. This granularity is required as privacy cannot be viewed “in bulk”;
Privacy is contextual. The factors inherent to the specific context in which an information flow occurs (such as trust or incentives above) and the setting of the flow itself therefore largely determine the resulting expectations and entity behaviors, similar to Contextual Integrity. As an example, a courtroom situation (in which the individual is expected to tell the truth and disclose relevant information) is not a privacy violation, as the ability of the individual to withhold information still exists, but the individual may choose to not exercise it. On the flip side, tapping an individual’s telephone is a privacy violation, independently of whether it is legally acceptable or illegal. Our framework thus separates between privacy as a faculty and the circumstances under which it is acceptable to maintain it. Edge cases also exist: for example, divulging sensitive information under a threat of bodily harm or mass surveillance states where every privacy violation is considered acceptable would have to be treated with special care (and interdisciplinary discourse) to be able to define what constitutes (or not) a socially acceptable and “appropriate” information flow.